概述
在Hive数据库中对SQL进行调优的时候,往往需要了解表的统计信息,比如:分区数量,数据行数,表的大小,文件个数等等。获取Hive表统计信息之前,需要先对Hive表收集统计信息:
- 非分区表
ANALYZE TABLE table_name COMPUTE STATISTICS; - 分区表
ANALYZE TABLE table_name PARTITION(partition_col='partition_value') COMPUTE STATISTICS;
下面提供两种方式来获取Hive表的统计信息。
方式一
方式一比较简单,只需执行命令DESC FORMATTED table_name;即可。
方式二
方式二需要开发者具有连接Hive元数据数据库的权限,使用SQL语句来获取统计信息。
- 非分区表
SELECT b.tbl_name AS '表名'
, SUM(CASE WHEN a.param_key = 'numRows' THEN a.param_value ELSE 0 END) AS '表数据行'
, SUM(CASE WHEN a.param_key = 'numRows' THEN 1 ELSE 0 END) AS '表分区数'
, SUM(CASE WHEN a.param_key = 'totalSize' THEN a.param_value ELSE 0 END) / 1024 / 1024 / 1024 AS '数据量GB'
, SUM(CASE WHEN a.param_key = 'numFiles' THEN a.param_value ELSE 0 END) AS '文件数'
FROM TABLE_PARAMS a
JOIN TBLS b
ON a.tbl_id = b.tbl_id
WHERE b.tbl_name IN ('table_name')
AND b.owner = 'hive'
;
- 分区表
SELECT c.tbl_name AS '表名'
, SUM(CASE WHEN b.param_key = 'numRows' THEN b.param_value ELSE 0 END) AS '表数据行'
, SUM(CASE WHEN b.param_key = 'numRows' THEN 1 ELSE 0 END) AS '表分区数'
, SUM(CASE WHEN b.param_key = 'totalSize' THEN b.param_value ELSE 0 END) / 1024 / 1024 / 1024 AS '数据量GB'
, SUM(CASE WHEN b.param_key = 'numFiles' THEN b.param_value ELSE 0 END) AS '文件数'
FROM PARTITIONS a
JOIN PARTITION_PARAMS b
ON a.part_id = b.part_id
JOIN TBLS c
ON a.tbl_id = c.tbl_id
WHERE c.tbl_name IN ('table_name')
AND c.owner = 'hive'
GROUP BY c.tbl_name
;
