Hive開窗函數整理


分析函數用於計算基於組的某種聚合值,它和聚合函數的不同之處是:對於每個組返回多行,而聚合函數對於每個組只返回一行。

開窗函數指定了分析函數工作的數據窗口大小,這個數據窗口大小可能會隨着行的變化而變化!到底什么是數據窗口?后面舉例會詳細講到!

基礎結構:

分析函數(如:sum(),max(),row_number()...) + 窗口子句(over函數)

over函數寫法: over(partition by cookieid order by createtime) 先根據cookieid字段分區,相同的cookieid分為一區,每個分區內根據createtime字段排序(默認升序)

注:不加 partition by 的話則把整個數據集當作一個分區,不加 order by的話會對某些函數統計結果產生影響,如sum()

版本:1.1.0-cdh5.13.3

測試表

create table if not exists test (
cookie_id string,
create_time string,
pv int
)row format delimited fields terminated by ',';

 

測試數據

a,2017-12-01,3
b,2017-12-02,3
cookie1,2017-12-02,4
cookie1,2017-12-03,2
cookie1,2017-12-04,3
cookie1,2017-12-05,1
cookie1,2017-12-06,6
cookie1,2017-12-07,7
cookie2,2017-12-02,1
cookie2,2017-12-04,2
cookie3,2017-12-06,7
cookie3,2017-12-03,5

SUM、AVG、MIN、MAX

用於實現分組內所有和連續累積的統計。

以SUM舉例

SELECT cookie_id,create_time,pv,
SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time) AS pv1, -- 默認為從起點到當前行
SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點到當前行,結果同pv1 
SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv3,   --當前行+往前3行
SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv4,    --當前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv5   ---當前行+往后所有行  
FROM test;

 

結果:

+------------+--------------+-----+------+------+------+------+------+--+
| cookie_id  | create_time  | pv  | pv1  | pv2  | pv3  | pv4  | pv5  |
+------------+--------------+-----+------+------+------+------+------+--+
| a          | 2017-12-01   | 3   | 3    | 3    | 3    | 3    | 3    |
| b          | 2017-12-02   | 3   | 3    | 3    | 3    | 3    | 3    |
| cookie1    | 2017-12-02   | 4   | 4    | 4    | 4    | 6    | 23   |
| cookie1    | 2017-12-03   | 2   | 6    | 6    | 6    | 9    | 19   |
| cookie1    | 2017-12-04   | 3   | 9    | 9    | 9    | 10   | 17   |
| cookie1    | 2017-12-05   | 1   | 10   | 10   | 10   | 16   | 14   |
| cookie1    | 2017-12-06   | 6   | 16   | 16   | 12   | 19   | 13   |
| cookie1    | 2017-12-07   | 7   | 23   | 23   | 17   | 17   | 7    |
| cookie2    | 2017-12-02   | 1   | 1    | 1    | 1    | 3    | 3    |
| cookie2    | 2017-12-04   | 2   | 3    | 3    | 3    | 3    | 2    |
| cookie3    | 2017-12-03   | 5   | 5    | 5    | 5    | 12   | 12   |
| cookie3    | 2017-12-06   | 7   | 12   | 12   | 12   | 12   | 7    |
+------------+--------------+-----+------+------+------+------+------+--+

 

注:這些窗口的划分都是在分區內部!超過分區大小就無效了

可以看到如果不指定ROWS BETWEEN,默認統計窗口為從起點到當前行;

關鍵是理解 ROWS BETWEEN 含義,也叫做window子句:

  • PRECEDING:往前

  • FOLLOWING:往后

  • CURRENT ROW:當前行

  • UNBOUNDED:無邊界,UNBOUNDED PRECEDING 表示從最前面的起點開始, UNBOUNDED FOLLOWING:表示到最后面的終點

其他AVG,MIN,MAX,和SUM用法一樣

如果不加ORDER BY 會怎樣

select cookie_id,create_time,pv,
sum(pv) over(PARTITION BY cookie_id) as pv1 
FROM test;

 

結果

+------------+--------------+-----+------+--+
| cookie_id  | create_time  | pv  | pv1  |
+------------+--------------+-----+------+--+
| a          | 2017-12-01   | 3   | 3    |
| b          | 2017-12-02   | 3   | 3    |
| cookie1    | 2017-12-07   | 7   | 23   |
| cookie1    | 2017-12-06   | 6   | 23   |
| cookie1    | 2017-12-05   | 1   | 23   |
| cookie1    | 2017-12-04   | 3   | 23   |
| cookie1    | 2017-12-03   | 2   | 23   |
| cookie1    | 2017-12-02   | 4   | 23   |
| cookie2    | 2017-12-04   | 2   | 3    |
| cookie2    | 2017-12-02   | 1   | 3    |
| cookie3    | 2017-12-03   | 5   | 12   |
| cookie3    | 2017-12-06   | 7   | 12   |
+------------+--------------+-----+------+--+

 

可以看到,如果沒有order by,不僅分區內沒有排序,sum()計算的pv也是整個分區的pv

注:max()函數無論有沒有order by 都是計算整個分區的最大值

ROW_NUMBER 函數

ROW_NUMBER() 從1開始,按照順序,生成分組內記錄的序列

ROW_NUMBER() 的應用場景非常多,比如獲取分組內排序Top N的記錄、獲取一個session中的第一條refer等。

SELECT cookie_id,create_time,pv,
ROW_NUMBER() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS rn  
FROM test;

 

結果

+------------+--------------+-----+-----+--+
| cookie_id  | create_time  | pv  | rn  |
+------------+--------------+-----+-----+--+
| a          | 2017-12-01   | 3   | 1   |
| b          | 2017-12-02   | 3   | 1   |
| cookie1    | 2017-12-07   | 7   | 1   |
| cookie1    | 2017-12-06   | 6   | 2   |
| cookie1    | 2017-12-02   | 4   | 3   |
| cookie1    | 2017-12-04   | 3   | 4   |
| cookie1    | 2017-12-03   | 2   | 5   |
| cookie1    | 2017-12-05   | 1   | 6   |
| cookie2    | 2017-12-04   | 2   | 1   |
| cookie2    | 2017-12-02   | 1   | 2   |
| cookie3    | 2017-12-06   | 7   | 1   |
| cookie3    | 2017-12-03   | 5   | 2   |
+------------+--------------+-----+-----+--+

 

RANK 和 DENSE_RANK 函數

RANK() 生成數據項在分組中的排名,排名相等會在名次中留下空位 DENSE_RANK() 生成數據項在分組中的排名,排名相等會在名次中不會留下空位

我們把 rankdense_rankrow_number三者對比,這樣比較清晰:

SELECT cookie_id,create_time,pv,
RANK() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS rank_res,
DENSE_RANK() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS dense_rank_res,
ROW_NUMBER() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS row_number_res
FROM test;

 

結果:

  1. 為了突出區別,對test表做了一點調整,將數據cookie1,2017-12-02,4改為了cookie1,2017-12-02,3

  2. 主要看cookie1,2017-12-04和2017-12-02的pv都是3

+------------+--------------+-----+-----------+-----------------+-----------------+--+
| cookie_id  | create_time  | pv  | rank_res  | dense_rank_res  | row_number_res  |
+------------+--------------+-----+-----------+-----------------+-----------------+--+
| a          | 2017-12-01   | 3   | 1         | 1               | 1               |
| b          | 2017-12-02   | 3   | 1         | 1               | 1               |
| cookie1    | 2017-12-07   | 7   | 1         | 1               | 1               |
| cookie1    | 2017-12-06   | 6   | 2         | 2               | 2               |
| cookie1    | 2017-12-04   | 3   | 3         | 3               | 3               |
| cookie1    | 2017-12-02   | 3   | 3         | 3               | 4               |
| cookie1    | 2017-12-03   | 2   | 5         | 4               | 5               |
| cookie1    | 2017-12-05   | 1   | 6         | 5               | 6               |
| cookie2    | 2017-12-04   | 2   | 1         | 1               | 1               |
| cookie2    | 2017-12-02   | 1   | 2         | 2               | 2               |
| cookie3    | 2017-12-06   | 7   | 1         | 1               | 1               |
| cookie3    | 2017-12-03   | 5   | 2         | 2               | 2               |
+------------+--------------+-----+-----------+-----------------+-----------------+--+

 

LAG 和 LEAD 函數

LAG(col,n,DEFAULT) 用於統計窗口內往上第n行值

第一個參數為列名,第二個參數為往上第n行(默認為1),第三個參數為默認值(當往上第n行為NULL時候,取默認值,如不指定,則為NULL)

SELECT cookie_id,create_time,
LAG(create_time,1) OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag1,
LAG(create_time,1,'1970-01-01') OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag1_with_default
FROM test;

 

結果:

+------------+--------------+-------------+--------------------+--+
| cookie_id  | create_time  |    lag1     | lag1_with_default  |
+------------+--------------+-------------+--------------------+--+
| a          | 2017-12-01   | NULL        | 1970-01-01         |
| b          | 2017-12-02   | NULL        | 1970-01-01         |
| cookie1    | 2017-12-02   | NULL        | 1970-01-01         |
| cookie1    | 2017-12-03   | 2017-12-02  | 2017-12-02         |
| cookie1    | 2017-12-04   | 2017-12-03  | 2017-12-03         |
| cookie1    | 2017-12-05   | 2017-12-04  | 2017-12-04         |
| cookie1    | 2017-12-06   | 2017-12-05  | 2017-12-05         |
| cookie1    | 2017-12-07   | 2017-12-06  | 2017-12-06         |
| cookie2    | 2017-12-02   | NULL        | 1970-01-01         |
| cookie2    | 2017-12-04   | 2017-12-02  | 2017-12-02         |
| cookie3    | 2017-12-03   | NULL        | 1970-01-01         |
| cookie3    | 2017-12-06   | 2017-12-03  | 2017-12-03         |
+------------+--------------+-------------+--------------------+--+

 

LEAD 函數則與 LAG 相反: LEAD(col,n,DEFAULT) 用於統計窗口內往下第n行值

第一個參數為列名,第二個參數為往下第n行(默認為1),第三個參數為默認值(當往下第n行為NULL時候,取默認值,如不指定,則為NULL)

SELECT cookie_id,create_time,
LEAD(create_time,1) OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag2,
LEAD(create_time,1,'1970-01-01') OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag2_with_default
FROM test;

 

結果:

+------------+--------------+-------------+--------------------+--+
| cookie_id  | create_time  |    lag2     | lag2_with_default  |
+------------+--------------+-------------+--------------------+--+
| a          | 2017-12-01   | NULL        | 1970-01-01         |
| b          | 2017-12-02   | NULL        | 1970-01-01         |
| cookie1    | 2017-12-02   | 2017-12-03  | 2017-12-03         |
| cookie1    | 2017-12-03   | 2017-12-04  | 2017-12-04         |
| cookie1    | 2017-12-04   | 2017-12-05  | 2017-12-05         |
| cookie1    | 2017-12-05   | 2017-12-06  | 2017-12-06         |
| cookie1    | 2017-12-06   | 2017-12-07  | 2017-12-07         |
| cookie1    | 2017-12-07   | NULL        | 1970-01-01         |
| cookie2    | 2017-12-02   | 2017-12-04  | 2017-12-04         |
| cookie2    | 2017-12-04   | NULL        | 1970-01-01         |
| cookie3    | 2017-12-03   | 2017-12-06  | 2017-12-06         |
| cookie3    | 2017-12-06   | NULL        | 1970-01-01         |
+------------+--------------+-------------+--------------------+--+

 

GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

准備數據:

+----------+-------------+-----------+--+
|  month   |     day     | cookieid  |
+----------+-------------+-----------+--+
| 2015-03  | 2015-03-10  | cookie1   |
| 2015-03  | 2015-03-10  | cookie5   |
| 2015-03  | 2015-03-12  | cookie7   |
| 2015-04  | 2015-04-12  | cookie3   |
| 2015-04  | 2015-04-13  | cookie2   |
| 2015-04  | 2015-04-13  | cookie4   |
| 2015-04  | 2015-04-16  | cookie4   |
| 2015-03  | 2015-03-10  | cookie2   |
| 2015-03  | 2015-03-10  | cookie3   |
| 2015-04  | 2015-04-12  | cookie5   |
| 2015-04  | 2015-04-13  | cookie6   |
| 2015-04  | 2015-04-15  | cookie3   |
| 2015-04  | 2015-04-15  | cookie2   |
| 2015-04  | 2015-04-16  | cookie1   |
+----------+-------------+-----------+--+

表:

create table test2(
month STRING,
day STRING, 
cookieid STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

 

GROUPING SETS

在一個GROUP BY查詢中,根據不同的維度組合進行聚合,等價於將不同維度的GROUP BY結果集進行UNION ALL

SELECT month
    ,day
    ,COUNT(DISTINCT cookieid) AS uv
FROM test2
GROUP BY month,day 
GROUPING SETS(month, day);

 

結果:

+----------+-------------+-----+--+
|  month   |     day     | uv  |
+----------+-------------+-----+--+
| NULL     | 2015-03-10  | 4   |
| NULL     | 2015-03-12  | 1   |
| NULL     | 2015-04-12  | 2   |
| NULL     | 2015-04-13  | 3   |
| NULL     | 2015-04-15  | 2   |
| NULL     | 2015-04-16  | 2   |
| 2015-03  | NULL        | 5   |
| 2015-04  | NULL        | 6   |
+----------+-------------+-----+--+

等價於

SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY month 
UNION ALL 
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY day;

 

增強版

SELECT 
    month,
    day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID 
FROM test2 
    GROUP BY month,day 
    GROUPING SETS (month,day,(month,day)) 
    ORDER BY GROUPING__ID;

 

其中的 GROUPING__ID,表示結果屬於哪一個分組集合。

結果:

+----------+-------------+-----+---------------+--+
|  month   |     day     | uv  | grouping__id  |
+----------+-------------+-----+---------------+--+
| 2015-04  | NULL        | 6   | 1             |
| 2015-03  | NULL        | 5   | 1             |
| NULL     | 2015-03-10  | 4   | 2             |
| NULL     | 2015-04-16  | 2   | 2             |
| NULL     | 2015-04-15  | 2   | 2             |
| NULL     | 2015-04-13  | 3   | 2             |
| NULL     | 2015-04-12  | 2   | 2             |
| NULL     | 2015-03-12  | 1   | 2             |
| 2015-04  | 2015-04-16  | 2   | 3             |
| 2015-04  | 2015-04-12  | 2   | 3             |
| 2015-04  | 2015-04-13  | 3   | 3             |
| 2015-03  | 2015-03-12  | 1   | 3             |
| 2015-03  | 2015-03-10  | 4   | 3             |
| 2015-04  | 2015-04-15  | 2   | 3             |
+----------+-------------+-----+---------------+--+

等價於

SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test2 GROUP BY month 
UNION ALL 
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test2 GROUP BY day
UNION ALL 
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test2 GROUP BY month,day;

 

CUBE

根據GROUP BY的維度的所有組合進行聚合。

SELECT 
    month,
    day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID 
FROM test2 
    GROUP BY month,day 
    WITH CUBE 
    ORDER BY GROUPING__ID;

 

結果

+----------+-------------+-----+---------------+--+
|  month   |     day     | uv  | grouping__id  |
+----------+-------------+-----+---------------+--+
| NULL     | NULL        | 7   | 0             |
| 2015-03  | NULL        | 5   | 1             |
| 2015-04  | NULL        | 6   | 1             |
| NULL     | 2015-04-16  | 2   | 2             |
| NULL     | 2015-04-15  | 2   | 2             |
| NULL     | 2015-04-13  | 3   | 2             |
| NULL     | 2015-04-12  | 2   | 2             |
| NULL     | 2015-03-12  | 1   | 2             |
| NULL     | 2015-03-10  | 4   | 2             |
| 2015-04  | 2015-04-12  | 2   | 3             |
| 2015-04  | 2015-04-16  | 2   | 3             |
| 2015-03  | 2015-03-12  | 1   | 3             |
| 2015-03  | 2015-03-10  | 4   | 3             |
| 2015-04  | 2015-04-15  | 2   | 3             |
| 2015-04  | 2015-04-13  | 3   | 3             |
+----------+-------------+-----+---------------+--+

等價於

SELECT NULL as month,NULL as day,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM test2
UNION ALL 
SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test2 GROUP BY month 
UNION ALL 
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test2 GROUP BY day
UNION ALL 
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test2 GROUP BY month,day;

 

ROLLUP

是CUBE的子集,以最左側的維度為主,從該維度進行層級聚合。

比如,以month維度進行層級聚合:

SELECT 
    month,
    day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID  
FROM test2 
    GROUP BY month,day
    WITH ROLLUP 
    ORDER BY GROUPING__ID;

 

結果:

+----------+-------------+-----+---------------+--+
|  month   |     day     | uv  | grouping__id  |
+----------+-------------+-----+---------------+--+
| NULL     | NULL        | 7   | 0             |
| 2015-04  | NULL        | 6   | 1             |
| 2015-03  | NULL        | 5   | 1             |
| 2015-04  | 2015-04-16  | 2   | 3             |
| 2015-04  | 2015-04-15  | 2   | 3             |
| 2015-04  | 2015-04-13  | 3   | 3             |
| 2015-04  | 2015-04-12  | 2   | 3             |
| 2015-03  | 2015-03-12  | 1   | 3             |
| 2015-03  | 2015-03-10  | 4   | 3             |
+----------+-------------+-----+---------------+--+

 

實現的上鑽過程:月天的UV->月的UV->總UV

如果把month和day調換順序,則以day維度進行層級聚合:

SELECT 
    day,
    month,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID  
FROM test2 
    GROUP BY day,month 
    WITH ROLLUP 
    ORDER BY GROUPING__ID;

 

結果:

+-------------+----------+-----+---------------+--+
|     day     |  month   | uv  | grouping__id  |
+-------------+----------+-----+---------------+--+
| NULL        | NULL     | 7   | 0             |
| 2015-04-12  | NULL     | 2   | 1             |
| 2015-04-15  | NULL     | 2   | 1             |
| 2015-03-12  | NULL     | 1   | 1             |
| 2015-04-16  | NULL     | 2   | 1             |
| 2015-03-10  | NULL     | 4   | 1             |
| 2015-04-13  | NULL     | 3   | 1             |
| 2015-04-16  | 2015-04  | 2   | 3             |
| 2015-04-15  | 2015-04  | 2   | 3             |
| 2015-04-13  | 2015-04  | 3   | 3             |
| 2015-03-12  | 2015-03  | 1   | 3             |
| 2015-03-10  | 2015-03  | 4   | 3             |
| 2015-04-12  | 2015-04  | 2   | 3             |
+-------------+----------+-----+---------------+--+

 

實現的上鑽過程:天月的UV->天的UV->總UV

 


 

參考:http://lxw1234.com/archives/category/hive

參考:https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup

參考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM