分析函數用於計算基於組的某種聚合值,它和聚合函數的不同之處是:對於每個組返回多行,而聚合函數對於每個組只返回一行。
開窗函數指定了分析函數工作的數據窗口大小,這個數據窗口大小可能會隨着行的變化而變化!到底什么是數據窗口?后面舉例會詳細講到!
基礎結構:
分析函數(如:sum(),max(),row_number()...) + 窗口子句(over函數)
over函數寫法: over(partition by cookieid order by createtime) 先根據cookieid字段分區,相同的cookieid分為一區,每個分區內根據createtime字段排序(默認升序)
注:不加 partition by 的話則把整個數據集當作一個分區,不加 order by的話會對某些函數統計結果產生影響,如sum()
版本:1.1.0-cdh5.13.3
測試表
create table if not exists test ( cookie_id string, create_time string, pv int )row format delimited fields terminated by ',';
測試數據
a,2017-12-01,3 b,2017-12-02,3 cookie1,2017-12-02,4 cookie1,2017-12-03,2 cookie1,2017-12-04,3 cookie1,2017-12-05,1 cookie1,2017-12-06,6 cookie1,2017-12-07,7 cookie2,2017-12-02,1 cookie2,2017-12-04,2 cookie3,2017-12-06,7 cookie3,2017-12-03,5
SUM、AVG、MIN、MAX
用於實現分組內所有和連續累積的統計。
以SUM舉例
SELECT cookie_id,create_time,pv, SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time) AS pv1, -- 默認為從起點到當前行 SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點到當前行,結果同pv1 SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv3, --當前行+往前3行 SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv4, --當前行+往前3行+往后1行 SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv5 ---當前行+往后所有行 FROM test;
結果:
+------------+--------------+-----+------+------+------+------+------+--+ | cookie_id | create_time | pv | pv1 | pv2 | pv3 | pv4 | pv5 | +------------+--------------+-----+------+------+------+------+------+--+ | a | 2017-12-01 | 3 | 3 | 3 | 3 | 3 | 3 | | b | 2017-12-02 | 3 | 3 | 3 | 3 | 3 | 3 | | cookie1 | 2017-12-02 | 4 | 4 | 4 | 4 | 6 | 23 | | cookie1 | 2017-12-03 | 2 | 6 | 6 | 6 | 9 | 19 | | cookie1 | 2017-12-04 | 3 | 9 | 9 | 9 | 10 | 17 | | cookie1 | 2017-12-05 | 1 | 10 | 10 | 10 | 16 | 14 | | cookie1 | 2017-12-06 | 6 | 16 | 16 | 12 | 19 | 13 | | cookie1 | 2017-12-07 | 7 | 23 | 23 | 17 | 17 | 7 | | cookie2 | 2017-12-02 | 1 | 1 | 1 | 1 | 3 | 3 | | cookie2 | 2017-12-04 | 2 | 3 | 3 | 3 | 3 | 2 | | cookie3 | 2017-12-03 | 5 | 5 | 5 | 5 | 12 | 12 | | cookie3 | 2017-12-06 | 7 | 12 | 12 | 12 | 12 | 7 | +------------+--------------+-----+------+------+------+------+------+--+
注:這些窗口的划分都是在分區內部!超過分區大小就無效了
可以看到如果不指定ROWS BETWEEN,默認統計窗口為從起點到當前行;
關鍵是理解 ROWS BETWEEN 含義,也叫做window子句:
-
PRECEDING:往前
-
FOLLOWING:往后
-
CURRENT ROW:當前行
-
UNBOUNDED:無邊界,UNBOUNDED PRECEDING 表示從最前面的起點開始, UNBOUNDED FOLLOWING:表示到最后面的終點
其他AVG,MIN,MAX,和SUM用法一樣
如果不加ORDER BY 會怎樣
select cookie_id,create_time,pv, sum(pv) over(PARTITION BY cookie_id) as pv1 FROM test;
結果
+------------+--------------+-----+------+--+ | cookie_id | create_time | pv | pv1 | +------------+--------------+-----+------+--+ | a | 2017-12-01 | 3 | 3 | | b | 2017-12-02 | 3 | 3 | | cookie1 | 2017-12-07 | 7 | 23 | | cookie1 | 2017-12-06 | 6 | 23 | | cookie1 | 2017-12-05 | 1 | 23 | | cookie1 | 2017-12-04 | 3 | 23 | | cookie1 | 2017-12-03 | 2 | 23 | | cookie1 | 2017-12-02 | 4 | 23 | | cookie2 | 2017-12-04 | 2 | 3 | | cookie2 | 2017-12-02 | 1 | 3 | | cookie3 | 2017-12-03 | 5 | 12 | | cookie3 | 2017-12-06 | 7 | 12 | +------------+--------------+-----+------+--+
可以看到,如果沒有order by,不僅分區內沒有排序,sum()計算的pv也是整個分區的pv
注:max()函數無論有沒有order by 都是計算整個分區的最大值
ROW_NUMBER 函數
ROW_NUMBER() 從1開始,按照順序,生成分組內記錄的序列
ROW_NUMBER() 的應用場景非常多,比如獲取分組內排序Top N的記錄、獲取一個session中的第一條refer等。
SELECT cookie_id,create_time,pv, ROW_NUMBER() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS rn FROM test;
結果
+------------+--------------+-----+-----+--+ | cookie_id | create_time | pv | rn | +------------+--------------+-----+-----+--+ | a | 2017-12-01 | 3 | 1 | | b | 2017-12-02 | 3 | 1 | | cookie1 | 2017-12-07 | 7 | 1 | | cookie1 | 2017-12-06 | 6 | 2 | | cookie1 | 2017-12-02 | 4 | 3 | | cookie1 | 2017-12-04 | 3 | 4 | | cookie1 | 2017-12-03 | 2 | 5 | | cookie1 | 2017-12-05 | 1 | 6 | | cookie2 | 2017-12-04 | 2 | 1 | | cookie2 | 2017-12-02 | 1 | 2 | | cookie3 | 2017-12-06 | 7 | 1 | | cookie3 | 2017-12-03 | 5 | 2 | +------------+--------------+-----+-----+--+
RANK 和 DENSE_RANK 函數
RANK() 生成數據項在分組中的排名,排名相等會在名次中留下空位 DENSE_RANK() 生成數據項在分組中的排名,排名相等會在名次中不會留下空位
我們把 rank、dense_rank、row_number三者對比,這樣比較清晰:
SELECT cookie_id,create_time,pv, RANK() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS rank_res, DENSE_RANK() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS dense_rank_res, ROW_NUMBER() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS row_number_res FROM test;
結果:
-
為了突出區別,對test表做了一點調整,將數據cookie1,2017-12-02,4改為了cookie1,2017-12-02,3
-
主要看cookie1,2017-12-04和2017-12-02的pv都是3
+------------+--------------+-----+-----------+-----------------+-----------------+--+ | cookie_id | create_time | pv | rank_res | dense_rank_res | row_number_res | +------------+--------------+-----+-----------+-----------------+-----------------+--+ | a | 2017-12-01 | 3 | 1 | 1 | 1 | | b | 2017-12-02 | 3 | 1 | 1 | 1 | | cookie1 | 2017-12-07 | 7 | 1 | 1 | 1 | | cookie1 | 2017-12-06 | 6 | 2 | 2 | 2 | | cookie1 | 2017-12-04 | 3 | 3 | 3 | 3 | | cookie1 | 2017-12-02 | 3 | 3 | 3 | 4 | | cookie1 | 2017-12-03 | 2 | 5 | 4 | 5 | | cookie1 | 2017-12-05 | 1 | 6 | 5 | 6 | | cookie2 | 2017-12-04 | 2 | 1 | 1 | 1 | | cookie2 | 2017-12-02 | 1 | 2 | 2 | 2 | | cookie3 | 2017-12-06 | 7 | 1 | 1 | 1 | | cookie3 | 2017-12-03 | 5 | 2 | 2 | 2 | +------------+--------------+-----+-----------+-----------------+-----------------+--+
LAG 和 LEAD 函數
LAG(col,n,DEFAULT) 用於統計窗口內往上第n行值
第一個參數為列名,第二個參數為往上第n行(默認為1),第三個參數為默認值(當往上第n行為NULL時候,取默認值,如不指定,則為NULL)
SELECT cookie_id,create_time, LAG(create_time,1) OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag1, LAG(create_time,1,'1970-01-01') OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag1_with_default FROM test;
結果:
+------------+--------------+-------------+--------------------+--+ | cookie_id | create_time | lag1 | lag1_with_default | +------------+--------------+-------------+--------------------+--+ | a | 2017-12-01 | NULL | 1970-01-01 | | b | 2017-12-02 | NULL | 1970-01-01 | | cookie1 | 2017-12-02 | NULL | 1970-01-01 | | cookie1 | 2017-12-03 | 2017-12-02 | 2017-12-02 | | cookie1 | 2017-12-04 | 2017-12-03 | 2017-12-03 | | cookie1 | 2017-12-05 | 2017-12-04 | 2017-12-04 | | cookie1 | 2017-12-06 | 2017-12-05 | 2017-12-05 | | cookie1 | 2017-12-07 | 2017-12-06 | 2017-12-06 | | cookie2 | 2017-12-02 | NULL | 1970-01-01 | | cookie2 | 2017-12-04 | 2017-12-02 | 2017-12-02 | | cookie3 | 2017-12-03 | NULL | 1970-01-01 | | cookie3 | 2017-12-06 | 2017-12-03 | 2017-12-03 | +------------+--------------+-------------+--------------------+--+
LEAD 函數則與 LAG 相反: LEAD(col,n,DEFAULT) 用於統計窗口內往下第n行值
第一個參數為列名,第二個參數為往下第n行(默認為1),第三個參數為默認值(當往下第n行為NULL時候,取默認值,如不指定,則為NULL)
SELECT cookie_id,create_time, LEAD(create_time,1) OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag2, LEAD(create_time,1,'1970-01-01') OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag2_with_default FROM test;
結果:
+------------+--------------+-------------+--------------------+--+ | cookie_id | create_time | lag2 | lag2_with_default | +------------+--------------+-------------+--------------------+--+ | a | 2017-12-01 | NULL | 1970-01-01 | | b | 2017-12-02 | NULL | 1970-01-01 | | cookie1 | 2017-12-02 | 2017-12-03 | 2017-12-03 | | cookie1 | 2017-12-03 | 2017-12-04 | 2017-12-04 | | cookie1 | 2017-12-04 | 2017-12-05 | 2017-12-05 | | cookie1 | 2017-12-05 | 2017-12-06 | 2017-12-06 | | cookie1 | 2017-12-06 | 2017-12-07 | 2017-12-07 | | cookie1 | 2017-12-07 | NULL | 1970-01-01 | | cookie2 | 2017-12-02 | 2017-12-04 | 2017-12-04 | | cookie2 | 2017-12-04 | NULL | 1970-01-01 | | cookie3 | 2017-12-03 | 2017-12-06 | 2017-12-06 | | cookie3 | 2017-12-06 | NULL | 1970-01-01 | +------------+--------------+-------------+--------------------+--+
GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
准備數據:
+----------+-------------+-----------+--+ | month | day | cookieid | +----------+-------------+-----------+--+ | 2015-03 | 2015-03-10 | cookie1 | | 2015-03 | 2015-03-10 | cookie5 | | 2015-03 | 2015-03-12 | cookie7 | | 2015-04 | 2015-04-12 | cookie3 | | 2015-04 | 2015-04-13 | cookie2 | | 2015-04 | 2015-04-13 | cookie4 | | 2015-04 | 2015-04-16 | cookie4 | | 2015-03 | 2015-03-10 | cookie2 | | 2015-03 | 2015-03-10 | cookie3 | | 2015-04 | 2015-04-12 | cookie5 | | 2015-04 | 2015-04-13 | cookie6 | | 2015-04 | 2015-04-15 | cookie3 | | 2015-04 | 2015-04-15 | cookie2 | | 2015-04 | 2015-04-16 | cookie1 | +----------+-------------+-----------+--+
表:
create table test2( month STRING, day STRING, cookieid STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
GROUPING SETS
在一個GROUP BY查詢中,根據不同的維度組合進行聚合,等價於將不同維度的GROUP BY結果集進行UNION ALL
SELECT month ,day ,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY month,day GROUPING SETS(month, day);
結果:
+----------+-------------+-----+--+ | month | day | uv | +----------+-------------+-----+--+ | NULL | 2015-03-10 | 4 | | NULL | 2015-03-12 | 1 | | NULL | 2015-04-12 | 2 | | NULL | 2015-04-13 | 3 | | NULL | 2015-04-15 | 2 | | NULL | 2015-04-16 | 2 | | 2015-03 | NULL | 5 | | 2015-04 | NULL | 6 | +----------+-------------+-----+--+
等價於
SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY month UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY day;
增強版
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test2 GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY GROUPING__ID;
其中的 GROUPING__ID,表示結果屬於哪一個分組集合。
結果:
+----------+-------------+-----+---------------+--+ | month | day | uv | grouping__id | +----------+-------------+-----+---------------+--+ | 2015-04 | NULL | 6 | 1 | | 2015-03 | NULL | 5 | 1 | | NULL | 2015-03-10 | 4 | 2 | | NULL | 2015-04-16 | 2 | 2 | | NULL | 2015-04-15 | 2 | 2 | | NULL | 2015-04-13 | 3 | 2 | | NULL | 2015-04-12 | 2 | 2 | | NULL | 2015-03-12 | 1 | 2 | | 2015-04 | 2015-04-16 | 2 | 3 | | 2015-04 | 2015-04-12 | 2 | 3 | | 2015-04 | 2015-04-13 | 3 | 3 | | 2015-03 | 2015-03-12 | 1 | 3 | | 2015-03 | 2015-03-10 | 4 | 3 | | 2015-04 | 2015-04-15 | 2 | 3 | +----------+-------------+-----+---------------+--+
等價於
SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test2 GROUP BY month UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test2 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test2 GROUP BY month,day;
CUBE
根據GROUP BY的維度的所有組合進行聚合。
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test2 GROUP BY month,day WITH CUBE ORDER BY GROUPING__ID;
結果
+----------+-------------+-----+---------------+--+ | month | day | uv | grouping__id | +----------+-------------+-----+---------------+--+ | NULL | NULL | 7 | 0 | | 2015-03 | NULL | 5 | 1 | | 2015-04 | NULL | 6 | 1 | | NULL | 2015-04-16 | 2 | 2 | | NULL | 2015-04-15 | 2 | 2 | | NULL | 2015-04-13 | 3 | 2 | | NULL | 2015-04-12 | 2 | 2 | | NULL | 2015-03-12 | 1 | 2 | | NULL | 2015-03-10 | 4 | 2 | | 2015-04 | 2015-04-12 | 2 | 3 | | 2015-04 | 2015-04-16 | 2 | 3 | | 2015-03 | 2015-03-12 | 1 | 3 | | 2015-03 | 2015-03-10 | 4 | 3 | | 2015-04 | 2015-04-15 | 2 | 3 | | 2015-04 | 2015-04-13 | 3 | 3 | +----------+-------------+-----+---------------+--+
等價於
SELECT NULL as month,NULL as day,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM test2 UNION ALL SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test2 GROUP BY month UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test2 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test2 GROUP BY month,day;
ROLLUP
是CUBE的子集,以最左側的維度為主,從該維度進行層級聚合。
比如,以month維度進行層級聚合:
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test2 GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID;
結果:
+----------+-------------+-----+---------------+--+ | month | day | uv | grouping__id | +----------+-------------+-----+---------------+--+ | NULL | NULL | 7 | 0 | | 2015-04 | NULL | 6 | 1 | | 2015-03 | NULL | 5 | 1 | | 2015-04 | 2015-04-16 | 2 | 3 | | 2015-04 | 2015-04-15 | 2 | 3 | | 2015-04 | 2015-04-13 | 3 | 3 | | 2015-04 | 2015-04-12 | 2 | 3 | | 2015-03 | 2015-03-12 | 1 | 3 | | 2015-03 | 2015-03-10 | 4 | 3 | +----------+-------------+-----+---------------+--+
實現的上鑽過程:月天的UV->月的UV->總UV
如果把month和day調換順序,則以day維度進行層級聚合:
SELECT day, month, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test2 GROUP BY day,month WITH ROLLUP ORDER BY GROUPING__ID;
結果:
+-------------+----------+-----+---------------+--+ | day | month | uv | grouping__id | +-------------+----------+-----+---------------+--+ | NULL | NULL | 7 | 0 | | 2015-04-12 | NULL | 2 | 1 | | 2015-04-15 | NULL | 2 | 1 | | 2015-03-12 | NULL | 1 | 1 | | 2015-04-16 | NULL | 2 | 1 | | 2015-03-10 | NULL | 4 | 1 | | 2015-04-13 | NULL | 3 | 1 | | 2015-04-16 | 2015-04 | 2 | 3 | | 2015-04-15 | 2015-04 | 2 | 3 | | 2015-04-13 | 2015-04 | 3 | 3 | | 2015-03-12 | 2015-03 | 1 | 3 | | 2015-03-10 | 2015-03 | 4 | 3 | | 2015-04-12 | 2015-04 | 2 | 3 | +-------------+----------+-----+---------------+--+
實現的上鑽過程:天月的UV->天的UV->總UV
參考:http://lxw1234.com/archives/category/hive
參考:https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup