Cube
hive (hdata)> select * from test;
test.f1 test.f2 test.f3 test.cnt
A A B 1
B B A 1
A A A 2
hive (hdata)> SELECT f1,
> f2,
> f3,
> sum(cnt)
> FROM test
> GROUP BY f1,
> f2,
> f3;
f1 f2 f3 _c3
A A A 2
A A B 1
B B A 1
hive (hdata)> SELECT f1,
> f2,
> f3,
> sum(cnt),
> GROUPING__ID,
> rpad(reverse(bin(cast(GROUPING__ID AS bigint))),3,'0')
> FROM test
> GROUP BY f1,
> f2,
> f3 WITH CUBE;
f1 f2 f3 _c3 grouping__id _c5
NULL NULL NULL 4 7 000
NULL NULL A 3 6 001
NULL NULL B 1 6 001
NULL A NULL 3 5 010
NULL A A 2 4 011
NULL A B 1 4 011
NULL B NULL 1 5 010
NULL B A 1 4 011
A NULL NULL 3 3 100
A NULL A 2 2 101
A NULL B 1 2 101
A A NULL 3 1 110
A A A 2 0 111
A A B 1 0 111
B NULL NULL 1 3 100
B NULL A 1 2 101
B B NULL 1 1 110
B B A 1 0 111
cube簡稱數據魔方,可以實現hive多個任意維度的查詢,cube(a,b,c)則首先會對(a,b,c)進行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在對全表進行group by,他會統計所選列中值的所有組合的聚合。
GROUPING SETS
GROUPING SETS作為GROUP BY的子句,允許開發人員在GROUP BY語句后面指定多個統計選項,可以簡單理解為多條group by語句通過union all把查詢結果聚合起來結合起來,下面是幾個實例可以幫助我們了解,
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id))
等價於
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id))
等價於
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
Rollup
可以實現從右到左遞減多級的統計,顯示統計某一層次結構的聚合。
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
等價於
select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());
Grouping_ID函數
當我們沒有統計某一列時,它的值顯示為null,這可能與列本身就有null值沖突,這就需要一種方法區分是沒有統計還是值本來就是null。(寫一個排列組合的算法,就馬上理解了,grouping_id其實就是所統計各列二進制和)
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup