概述
在Hive數據庫中對SQL進行調優的時候,往往需要了解表的統計信息,比如:分區數量,數據行數,表的大小,文件個數等等。獲取Hive表統計信息之前,需要先對Hive表收集統計信息:
- 非分區表
ANALYZE TABLE table_name COMPUTE STATISTICS; - 分區表
ANALYZE TABLE table_name PARTITION(partition_col='partition_value') COMPUTE STATISTICS;
下面提供兩種方式來獲取Hive表的統計信息。
方式一
方式一比較簡單,只需執行命令DESC FORMATTED table_name;
即可。
方式二
方式二需要開發者具有連接Hive元數據數據庫的權限,使用SQL語句來獲取統計信息。
- 非分區表
SELECT b.tbl_name AS '表名'
, SUM(CASE WHEN a.param_key = 'numRows' THEN a.param_value ELSE 0 END) AS '表數據行'
, SUM(CASE WHEN a.param_key = 'numRows' THEN 1 ELSE 0 END) AS '表分區數'
, SUM(CASE WHEN a.param_key = 'totalSize' THEN a.param_value ELSE 0 END) / 1024 / 1024 / 1024 AS '數據量GB'
, SUM(CASE WHEN a.param_key = 'numFiles' THEN a.param_value ELSE 0 END) AS '文件數'
FROM TABLE_PARAMS a
JOIN TBLS b
ON a.tbl_id = b.tbl_id
WHERE b.tbl_name IN ('table_name')
AND b.owner = 'hive'
;
- 分區表
SELECT c.tbl_name AS '表名'
, SUM(CASE WHEN b.param_key = 'numRows' THEN b.param_value ELSE 0 END) AS '表數據行'
, SUM(CASE WHEN b.param_key = 'numRows' THEN 1 ELSE 0 END) AS '表分區數'
, SUM(CASE WHEN b.param_key = 'totalSize' THEN b.param_value ELSE 0 END) / 1024 / 1024 / 1024 AS '數據量GB'
, SUM(CASE WHEN b.param_key = 'numFiles' THEN b.param_value ELSE 0 END) AS '文件數'
FROM PARTITIONS a
JOIN PARTITION_PARAMS b
ON a.part_id = b.part_id
JOIN TBLS c
ON a.tbl_id = c.tbl_id
WHERE c.tbl_name IN ('table_name')
AND c.owner = 'hive'
GROUP BY c.tbl_name
;