hive grouping sets和GROUPING__ID的用法

本文轉載自查看原文 2020-08-26 11:40 1559 hive

GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

這幾個分析函數通常用於OLAP中，不能累加，而且需要根據不同維度上鑽和下鑽的指標統計，比如，分小時、天、月的UV數。

grouping sets根據不同的維度組合進行聚合，等價於將不同維度的group by的結果進行 union all，簡單來說就是將多個不同維度的group by邏輯寫在了一個sql中。

數據准備：

vim /root/test.txt

2015-03,2015-03-10,cookie1 2015-03,2015-03-10,cookie5 2015-03,2015-03-12,cookie7 2015-04,2015-04-12,cookie3 2015-04,2015-04-13,cookie2 2015-04,2015-04-13,cookie4 2015-04,2015-04-16,cookie4 2015-03,2015-03-10,cookie2 2015-03,2015-03-10,cookie3 2015-04,2015-04-12,cookie5 2015-04,2015-04-13,cookie6 2015-04,2015-04-15,cookie3 2015-04,2015-04-15,cookie2 2015-04,2015-04-16,cookie1

將數據導入到hdfs目錄上：

hdfs dfs -put /root/test.txt /tmp

創建表

use test;

 create table cookie5(month string, day string, cookieid string) row format delimited fields terminated by ',';

 load data inpath "/tmp/test.txt" into table cookie5;

 select * from cookie5;

開始使用grouping sets

來條sql語句：

select 
    month,day,count(cookieid) 
from cookie5 
    group by month,day 
grouping sets (month,day);

查詢結果如下：

上面這個sql等同於多個group by ＋ union all

select month,NULL as day,count(cookieid) as nums from cookie5 group by month
union all
select NULL as month,day,count(cookieid) as nums from cookie5 group by day;

注意點：使用union和union all必須保證各個select 集合的結果有相同個數的列，並且每個列的類型是一樣的。union all的表字段必須匹配，也就是上文的month 需要用 NULL as month 來進行填充。

結果一致！但是grouping sets速度要比group by + union 快！！！！！

但是有引出一個問題？為什么grouping sets 要比 group by + union 速度要快？

首先：用explain來解釋下hive執行計划：

explain select 
    month,day,count(cookieid) 
from cookie5 
    group by month,day 
grouping sets (month,day);

執行計划如下：

+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: cookie5                         |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: month (type: string), day (type: string), cookieid (type: string) |
|               outputColumnNames: month, day, cookieid |
|               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(cookieid)      |
|                 keys: month (type: string), day (type: string), '0' (type: string) |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1, _col2, _col3 |
|                 Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
|                   sort order: +++                  |
|                   Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
|                   Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col3 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1, _col3   |
|           Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|           pruneGroupingSetId: true                 |
|           Select Operator                          |
|             expressions: _col0 (type: string), _col1 (type: string), _col3 (type: bigint) |
|             outputColumnNames: _col0, _col1, _col2 |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.TextInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink

再看另外一個：

explain select month,NULL as day,count(cookieid) as nums from cookie5 group by month 
union all 
select NULL as month,day,count(cookieid) as nums from cookie5 group by day;

+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1, Stage-3      |
|   Stage-3 is a root stage                          |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: cookie5                         |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: month (type: string), cookieid (type: string) |
|               outputColumnNames: month, cookieid   |
|               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(cookieid)      |
|                 keys: month (type: string)         |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: string) |
|                   Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col0 (type: string), UDFToString(null) (type: string), _col1 (type: bigint) |
|             outputColumnNames: _col0, _col1, _col2 |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.TextInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.TextInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-3                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: cookie5                         |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: day (type: string), cookieid (type: string) |
|               outputColumnNames: day, cookieid     |
|               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(cookieid)      |
|                 keys: day (type: string)           |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: string) |
|                   Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
|           Select Operator                          |
|             expressions: UDFToString(null) (type: string), _col0 (type: string), _col1 (type: bigint) |
|             outputColumnNames: _col0, _col1, _col2 |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+--+

比較：通過explain ，當用grouping sets時，只有2個stage，只有一次reduce，而當用group by + union時，有4個stage，發生了兩次reduce。那么肯定用grouping sets時，速度會快。

GROUPING__ID的使用：

來條sql語句：

select 
  month,
  day,
  count(distinct cookieid) as uv,
  GROUPING__ID
from cookie5 
group by month,day 
grouping sets (month,day) 
order by GROUPING__ID;

效果等價於：

SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day;

結果說明

第一列是按照month進行分組

第二列是按照day進行分組

第三列是按照month或day分組是，統計這一組有幾個不同的cookieid

第四列grouping_id表示這一組結果屬於哪個分組集合，根據grouping sets中的分組條件month，day，1是代表month，2是代表day

再來個例子：

SELECT  month, day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID 
FROM cookie5 
GROUP BY month,day 
GROUPING SETS (month,day,(month,day)) 
ORDER BY GROUPING__ID;

等價於：

SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month 
UNION ALL 
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day
UNION ALL 
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day;

GROUPING SETS (month,day,(month,day)) 這個的意思是 按三個維度來進行統計 1.按月 2 按天 3 按 月和天 ，結果也能證明這一個觀點。

結束! 轉自：https://www.cnblogs.com/qingyunzong/p/8798987.html#_label2

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hive函數：GROUPING SETS,GROUPING__ID,CUBE,ROLLUP (轉) Hive中with cube、with rollup、grouping sets用法 hive grouping sets多維度報錯 Group By 多個分組集小結 --GROUPING SETS，GROUP BY CUBE，GROUP BY ROLLUP，GROUPING(),GROUPING_ID() TSQL 分組集（Grouping Sets） Oracle分組ROLLUP、GROUP BY、GROUPING、GROUPING SETS區別和作用 GROUP函數-GROUP_ID,GROUPING,GROUPING_ID PostgreSQL 分組集合新功能（GROUPING SETS,CUBE,ROLLUP） Oracle分組小計、總計示例(grouping sets的使用) Grouping sets aggregations (with rollups or cubes) are not allowed if aggregation function parameters overlap with the aggregation functions columns