1. GROUPING SETS
GROUPING SETS作為GROUP BY的子句,允許開發人員在GROUP BY語句后面指定多個統維度,可以簡單理解為多條group by語句通過union all把查詢結果聚合起來結合起來。
為方便理解,以testdb.test_1為例:
hive> use testdb;
hive> desc test_1;
user_id string id
device_id string 設備類型:手機、平板
os_id string 操作系統類型:ios、android
app_id string 手機app_id
client_v string 客戶端版本
channel string 渠道
grouping sets語句 | 等價hive語句 |
---|---|
select device_id,os_id,app_id,count(user_id) from test_1 group by device_id,os_id,app_id grouping sets((device_id)) | SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id |
select device_id,os_id,app_id,count(user_id) from test_1 group by device_id,os_id,app_id grouping sets((device_id,os_id)) | SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id |
select device_id,os_id,app_id,count(user_id) from test_1 group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) | SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id UNION ALL SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id |
select device_id,os_id,app_id,count(user_id) from test_1 group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) | SELECT device_id,null,null,count(user_id) FROM test_1 group by device_id UNION ALL SELECT null,os_id,null,count(user_id) FROM test_1 group by os_id UNION ALL SELECT device_id,os_id,null,count(user_id) FROM test_1 group by device_id,os_id UNION ALL SELECT null,null,null,count(user_id) FROM test_1 |
2. CUBE函數
cube簡稱數據魔方,可以實現hive多個任意維度的查詢,cube(a,b,c)則首先會對(a,b,c)進行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在對全表進行group by,cube會統計所選列中值的所有組合的聚合
select device_id,os_id,app_id,client_v,channel,count(user_id)
from test_1
group by device_id,os_id,app_id,client_v,channel with cube;
等價於:
SELECT device_id,null,null,null,null ,count(user_id) FROM test_1 group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_1 group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_1 group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_1 group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_1 group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_1 group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_1 group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_v,null ,count(user_id) FROM test_1 group by client_v
UNION ALL
SELECT device_id,null,null,client_v,null ,count(user_id) FROM test_1 group by device_id,client_v
UNION ALL
SELECT null,os_id,null,client_v,null ,count(user_id) FROM test_1 group by os_id,client_v
UNION ALL
SELECT device_id,os_id,null,client_v,null ,count(user_id) FROM test_1 group by device_id,os_id,client_v
UNION ALL
SELECT null,null,app_id,client_v,null ,count(user_id) FROM test_1 group by app_id,client_v
UNION ALL
SELECT device_id,null,app_id,client_v,null ,count(user_id) FROM test_1 group by device_id,app_id,client_v
UNION ALL
SELECT null,os_id,app_id,client_v,null ,count(user_id) FROM test_1 group by os_id,app_id,client_v
UNION ALL
SELECT device_id,os_id,app_id,client_v,null ,count(user_id) FROM test_1 group by device_id,os_id,app_id,client_v
UNION ALL
SELECT null,null,null,null,channel ,count(user_id) FROM test_1 group by channel
UNION ALL
SELECT device_id,null,null,null,channel ,count(user_id) FROM test_1 group by device_id,channel
UNION ALL
SELECT null,os_id,null,null,channel ,count(user_id) FROM test_1 group by os_id,channel
UNION ALL
SELECT device_id,os_id,null,null,channel ,count(user_id) FROM test_1 group by device_id,os_id,channel
UNION ALL
SELECT null,null,app_id,null,channel ,count(user_id) FROM test_1 group by app_id,channel
UNION ALL
SELECT device_id,null,app_id,null,channel ,count(user_id) FROM test_1 group by device_id,app_id,channel
UNION ALL
SELECT null,os_id,app_id,null,channel ,count(user_id) FROM test_1 group by os_id,app_id,channel
UNION ALL
SELECT device_id,os_id,app_id,null,channel ,count(user_id) FROM test_1 group by device_id,os_id,app_id,channel
UNION ALL
SELECT null,null,null,client_v,channel ,count(user_id) FROM test_1 group by client_v,channel
UNION ALL
SELECT device_id,null,null,client_v,channel ,count(user_id) FROM test_1 group by device_id,client_v,channel
UNION ALL
SELECT null,os_id,null,client_v,channel ,count(user_id) FROM test_1 group by os_id,client_v,channel
UNION ALL
SELECT device_id,os_id,null,client_v,channel ,count(user_id) FROM test_1 group by device_id,os_id,client_v,channel
UNION ALL
SELECT null,null,app_id,client_v,channel ,count(user_id) FROM test_1 group by app_id,client_v,channel
UNION ALL
SELECT device_id,null,app_id,client_v,channel ,count(user_id) FROM test_1 group by device_id,app_id,client_v,channel
UNION ALL
SELECT null,os_id,app_id,client_v,channel ,count(user_id) FROM test_1 group by os_id,app_id,client_v,channel
UNION ALL
SELECT device_id,os_id,app_id,client_v,channel ,count(user_id) FROM test_1 group by device_id,os_id,app_id,client_v,channel
UNION ALL
SELECT null,null,null,null,null ,count(user_id) FROM test_1
3. ROLL UP函數
rollup可以實現從右到左遞減多級的統計,顯示統計某一層次結構的聚合
select device_id,os_id,app_id,client_v,channel,count(user_id)
from test_1
group by device_id,os_id,app_id,client_v,channel with rollup;
等價於:
select device_id,os_id,app_id,client_v,channel,count(user_id)
from test_1
group by device_id,os_id,app_id,client_v,channel
grouping sets ((device_id,os_id,app_id,client_v,channel),(device_id,os_id,app_id,client_v),(device_id,os_id,app_id),(device_id,os_id),(device_id),());
4.Grouping_ID函數
當我們沒有統計某一列時,它的值顯示為null,這可能與列本身就有null值沖突,這就需要一種方法區分是沒有統計還是值本來就是null。(寫一個排列組合的算法,就馬上理解了,grouping_id其實就是所統計各列二進制和)
例子如下:
Column1 (key) | Column2 (value) |
---|---|
1 | NULL |
1 | 1 |
2 | 2 |
3 | 3 |
3 | NULL |
4 | 5 |
hql統計:
SELECT key, value, GROUPING_ID, count(*) from T1 GROUP BY key, value WITH ROLLUP
結果如下:
key | value | GROUPING_ID | count(*) |
---|---|---|---|
NULL | NULL | 0 00 | 6 |
1 | NULL | 1 10 | 2 |
1 | NULL | 3 11 | 1 |
1 | 1 | 3 11 | 1 |
2 | NULL | 1 10 | 1 |
2 | 2 | 3 11 | 1 |
3 | NULL | 1 10 | 2 |
3 | NULL | 3 11 | 1 |
3 | 3 | 3 11 | 1 |
4 | NULL | 1 10 | 1 |
4 | 5 | 3 11 | 1 |
GROUPING_ID轉變為二進制,如果對應位上有值為null,說明這列本身值就是null。(通過類DataFilterNull.py 掃描,可以篩選過濾掉列中null、“”統計結果),
5. 窗口函數
hive窗口函數,感覺大部分都是在模仿oracle,有對oracle熟悉的,應該看下就知道怎么用。
具體參見:http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html