hive 之 Cube, Rollup介紹


1. GROUPING SETS

GROUPING SETS作為GROUP BY的子句,允許開發人員在GROUP BY語句后面指定多個統維度,可以簡單理解為多條group by語句通過union all把查詢結果聚合起來結合起來。

為方便理解,以testdb.test_1為例:

hive> use testdb;
hive> desc test_1;

user_id        string      id                
device_id      string      設備類型:手機、平板             
os_id          string      操作系統類型:ios、android            
app_id         string      手機app_id             
client_v       string      客戶端版本             
channel        string      渠道
grouping sets語句 等價hive語句
select device_id,os_id,app_id,count(user_id) from  test_1 group by device_id,os_id,app_id grouping sets((device_id))  SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id
select device_id,os_id,app_id,count(user_id) from  test_1 group by device_id,os_id,app_id grouping sets((device_id,os_id)) SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id
select device_id,os_id,app_id,count(user_id) from  test_1 group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id UNION ALL SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id
select device_id,os_id,app_id,count(user_id) from  test_1 group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id UNION ALL SELECT null,os_id,null,count(user_id) FROM test_1 group by os_id UNION ALL SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id  UNION ALL SELECT null,null,null,count(user_id) FROM test_1

2. CUBE函數

cube簡稱數據魔方,可以實現hive多個任意維度的查詢,cube(a,b,c)則首先會對(a,b,c)進行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在對全表進行group by,cube會統計所選列中值的所有組合的聚合

select device_id,os_id,app_id,client_v,channel,count(user_id) 
from test_1 
group by device_id,os_id,app_id,client_v,channel with cube;

等價於:

SELECT device_id,null,null,null,null ,count(user_id) FROM test_1 group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_1 group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_1 group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_1 group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_1 group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_1 group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_1 group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_v,null ,count(user_id) FROM test_1 group by client_v
UNION ALL
SELECT device_id,null,null,client_v,null ,count(user_id) FROM test_1 group by device_id,client_v
UNION ALL
SELECT null,os_id,null,client_v,null ,count(user_id) FROM test_1 group by os_id,client_v
UNION ALL
SELECT device_id,os_id,null,client_v,null ,count(user_id) FROM test_1 group by device_id,os_id,client_v
UNION ALL
SELECT null,null,app_id,client_v,null ,count(user_id) FROM test_1 group by app_id,client_v
UNION ALL
SELECT device_id,null,app_id,client_v,null ,count(user_id) FROM test_1 group by device_id,app_id,client_v
UNION ALL
SELECT null,os_id,app_id,client_v,null ,count(user_id) FROM test_1 group by os_id,app_id,client_v
UNION ALL
SELECT device_id,os_id,app_id,client_v,null ,count(user_id) FROM test_1 group by device_id,os_id,app_id,client_v
UNION ALL
SELECT null,null,null,null,channel ,count(user_id) FROM test_1 group by channel
UNION ALL
SELECT device_id,null,null,null,channel ,count(user_id) FROM test_1 group by device_id,channel
UNION ALL
SELECT null,os_id,null,null,channel ,count(user_id) FROM test_1 group by os_id,channel
UNION ALL
SELECT device_id,os_id,null,null,channel ,count(user_id) FROM test_1 group by device_id,os_id,channel
UNION ALL
SELECT null,null,app_id,null,channel ,count(user_id) FROM test_1 group by app_id,channel
UNION ALL
SELECT device_id,null,app_id,null,channel ,count(user_id) FROM test_1 group by device_id,app_id,channel
UNION ALL
SELECT null,os_id,app_id,null,channel ,count(user_id) FROM test_1 group by os_id,app_id,channel
UNION ALL
SELECT device_id,os_id,app_id,null,channel ,count(user_id) FROM test_1 group by device_id,os_id,app_id,channel
UNION ALL
SELECT null,null,null,client_v,channel ,count(user_id) FROM test_1 group by client_v,channel
UNION ALL
SELECT device_id,null,null,client_v,channel ,count(user_id) FROM test_1 group by device_id,client_v,channel
UNION ALL
SELECT null,os_id,null,client_v,channel ,count(user_id) FROM test_1 group by os_id,client_v,channel
UNION ALL
SELECT device_id,os_id,null,client_v,channel ,count(user_id) FROM test_1 group by device_id,os_id,client_v,channel
UNION ALL
SELECT null,null,app_id,client_v,channel ,count(user_id) FROM test_1 group by app_id,client_v,channel
UNION ALL
SELECT device_id,null,app_id,client_v,channel ,count(user_id) FROM test_1 group by device_id,app_id,client_v,channel
UNION ALL
SELECT null,os_id,app_id,client_v,channel ,count(user_id) FROM test_1 group by os_id,app_id,client_v,channel
UNION ALL
SELECT device_id,os_id,app_id,client_v,channel ,count(user_id) FROM test_1 group by device_id,os_id,app_id,client_v,channel
UNION ALL
SELECT null,null,null,null,null ,count(user_id) FROM test_1

3. ROLL UP函數

rollup可以實現從右到左遞減多級的統計,顯示統計某一層次結構的聚合

select device_id,os_id,app_id,client_v,channel,count(user_id) 
from test_1 
group by device_id,os_id,app_id,client_v,channel with rollup;

等價於:

select device_id,os_id,app_id,client_v,channel,count(user_id) 
from test_1 
group by device_id,os_id,app_id,client_v,channel 
grouping sets ((device_id,os_id,app_id,client_v,channel),(device_id,os_id,app_id,client_v),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

4.Grouping_ID函數

當我們沒有統計某一列時,它的值顯示為null,這可能與列本身就有null值沖突,這就需要一種方法區分是沒有統計還是值本來就是null。(寫一個排列組合的算法,就馬上理解了,grouping_id其實就是所統計各列二進制和)

例子如下:

Column1 (key) Column2 (value)
1 NULL
1 1
2 2
3 3
3 NULL
4 5

hql統計:

  SELECT key, value, GROUPING_ID, count(*) from T1 GROUP BY key, value WITH ROLLUP

結果如下:

 key value GROUPING_ID  count(*) 
NULL NULL 0     00 6
1 NULL 1     10 2
1 NULL 3     11 1
1 1 3     11 1
2 NULL 1     10 1
2 2 3     11 1
3 NULL 1     10 2
3 NULL 3     11 1
3 3 3     11 1
4 NULL 1     10 1
4 5 3     11 1

GROUPING_ID轉變為二進制,如果對應位上有值為null,說明這列本身值就是null。(通過類DataFilterNull.py 掃描,可以篩選過濾掉列中null、“”統計結果),

5. 窗口函數

hive窗口函數,感覺大部分都是在模仿oracle,有對oracle熟悉的,應該看下就知道怎么用。

具體參見:http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html

參考文章

  1. https://blog.csdn.net/gua___gua/article/details/52523698


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM