1、相關函數
1.1 窗口函數
- FIRST_VALUE:取分組內排序后,截止到當前行,第一個值
- LAST_VALUE: 取分組內排序后,截止到當前行,最后一個值
- LEAD(col,n,DEFAULT) :用於統計窗口內往后第n行值。
- 第一個參數為列名,
- 第二個參數為往下第n行(可選,默認為1),
- 第三個參數為默認值(當往下第n行為NULL時候,取默認值,如不指定,則為NULL)
- LAG(col,n,DEFAULT) :用於統計窗口內往前第n行值。
- 第一個參數為列名,
- 第二個參數為往上第n行(可選,默認為1),
- 第三個參數為默認值(當往上第n行為NULL時候,取默認值,如不指定,則為NULL)
1.2 OVER從句
1、使用標准的聚合函數COUNT、SUM、MIN、MAX、AVG
2、使用PARTITION BY
語句,使用一個或者多個原始數據類型的列
3、使用PARTITION BY
與ORDER BY
語句,使用一個或者多個數據類型的分區或者排序列
4、使用窗口規范,窗口規范支持以下格式:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
5、當ORDER BY
后面缺少窗口從句條件,窗口規范默認是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
.
6、當ORDER BY
和窗口從句都缺失, 窗口規范默認是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
.
7、OVER
從句支持以下函數, 但是並不支持和窗口一起使用它們。
8、Ranking函數: Rank, NTile, DenseRank, CumeDist, PercentRank
.Lead
和 Lag
函數.
1.3 分析函數
- ROW_NUMBER(): 從1開始,按照順序,生成分組內記錄的序列,比如,按照pv降序排列,生成分組內每天的pv名次,ROW_NUMBER()的應用場景非常多,再比如,獲取分組內排序第一的記錄;獲取一個session中的第一條refer等。
- RANK(): 生成數據項在分組中的排名,排名相等會在名次中留下空位
- DENSE_RANK() :生成數據項在分組中的排名,排名相等會在名次中不會留下空位
- CUME_DIST: 小於等於當前值的行數/分組內總行數。比如,統計小於等於當前薪水的人數,所占總人數的比例
- PERCENT_RANK: 分組內當前行的RANK值-1/分組內總行數-1
- NTILE(n) :用於將分組數據按照順序切分成n片,返回當前切片值,如果切片不均勻,默認增加第一個切片的分布。
- NTILE不支持ROWS BETWEEN,
- 比如
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。
2、測試數據集
hive (default)> select * from business; OK business.name business.orderdate business.cost jack 2017-01-01 10 tony 2017-01-02 15 jack 2017-02-03 23 tony 2017-01-04 29 jack 2017-01-05 46 jack 2017-04-06 42 tony 2017-01-07 50 jack 2017-01-08 55 mart 2017-04-08 62 mart 2017-04-09 68 neil 2017-05-10 12 mart 2017-04-11 75 neil 2017-06-12 80 mart 2017-04-13 94
3、案例
3.1 COUNT、SUM、MIN、MAX、AVG
select name,orderdate,cost, -- 所有行相加 sum(cost) over() as c1, -- 按name分組,組內相加 sum(cost) over(partition by name) as c1, -- (默認起點到當前行相加)按name分組,orderdate排序,組內相加 sum(cost) over(partition by name order by orderdate) as c2, -- 起點到當前行的 sum(cost) over(partition by name order by orderdate rows between unbounded preceding and current row ) as c3, -- 當前行+前面2行 sum(cost) over(partition by name order by orderdate rows between 2 preceding and current row ) as c4, -- 當前行+后面2行 sum(cost) over(partition by name order by orderdate rows between current row and 2 following ) as c5, -- 前面2行+當前行+后面2行 sum(cost) over(partition by name order by orderdate rows between 2 preceding and 2 following ) as c6, -- (partition by .. order by)可替換為(distribute by .. sort by ..) sum(cost) over(partition by name order by orderdate rows between 2 preceding and 2 following ) as c7 from business;
注意:
- 結果和ORDER BY相關,默認為升序
- 如果不指定ROWS BETWEEN,默認為從起點到當前行;
- 如果不指定ORDER BY,則將分組內所有值累加;
- order by必須跟在partition by后;
- Rows必須跟在Order by子;
- (partition by .. order by)可替換為(distribute by .. sort by ..)
理解ROWS BETWEEN含義,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:當前行
UNBOUNDED:無界限(起點或終點)
UNBOUNDED PRECEDING:表示從前面的起點
UNBOUNDED FOLLOWING:表示到后面的終點
3.2 first_value與last_value
select name,orderdate,cost, row_number() over(partition by name order by cost) c1, -- 正序時:當前行到第一個值之間,第一個值 first_value(cost) over(partition by name order by cost) c2, -- 正序時:當前行到最后一個值之間,最后一個值 last_value(cost) over(partition by name order by cost) c3, -- 倒序時:當前行到第一個值之間,第一個值 first_value(cost) over(partition by name order by cost desc) c4, -- 倒序時:當前行到最后一個值之間,最后一個值 last_value(cost) over(partition by name order by cost desc) c5, row_number() over(partition by name order by cost desc) c6 from business;
3.3 lead與lag
select name,orderdate,cost, -- 分組內當前行,往后第一行的值(不包括當前行) lead(cost) over(partition by name order by cost) c1, -- 分組內當前行,往后第二行的值(不包括當前行) lead(cost,2) over(partition by name order by cost) c2, -- 分組內當前行,往后第二行的值(不包括當前行),如果為null,則用9999代替 lead(cost,2,9999) over(partition by name order by cost) c3, -- 分組內當前行,往前第一行的值(不包括當前行) lag(cost) over(partition by name order by cost) c4, -- 分組內當前行,往前第二行的值(不包括當前行) lag(cost,2) over(partition by name order by cost) c5, -- 分組內當前行,往前第二行的值(不包括當前行),如果為null,則用-1代替 lag(cost,2,-1) over(partition by name order by cost) c6 from business;
3.4 RANK、ROW_NUMBER、DENSE_RANK
select name,orderdate,cost,c, -- 自然序號排序,不跳數,不重復 ROW_NUMBER() over(partition by name order by c) c1, -- 排序相同,中間會跳數,總數不變 RANK() over(partition by name order by c) c2, -- 排序相同,中間不會跳數,總數會減少 DENSE_RANK() over(partition by name order by c) c3 from( select name,orderdate,cost,date_format(orderdate,'yyyyMM') c from business)T;
3.5 NTILE
select name,orderdate,cost, -- 將組內數據分成1片 ntile(1) over(partition by name order by orderdate) c1, -- 將組內數據分成2片 ntile(2) over(partition by name order by orderdate) c2, -- 將組內數據分成3片 ntile(3) over(partition by name order by orderdate) c3, -- 將組內數據分成4片 ntile(4) over(partition by name order by orderdate) c4 from business;
注意:
如果切片不均勻,默認增加第一個切片的分布
例如:
求20%的數據(按時間排序)
select * from( select name,orderdate,cost, -- 查詢20%時間的訂單 ntile(5) over(order by orderdate) c from business)T where c=1;
3.6 CUME_DIST、PERCENT_RANK
CUME_DIST:
select name,orderdate,cost, -- 不分組,所有數據為一組,當前行占總行數的比例, -- 第一行:1/14=0.07142857142857142 -- 第二行:2/14=0.14285714285714285 CUME_DIST() over(order by orderdate) , -- 組內,計算當前行的行數/組內總行數 CUME_DIST() over(partition by name order by orderdate) from business;
PERCENT_RANK:
select name,orderdate,cost, -- 按name分組,組內的行數 sum(1) over(partition by name), -- 所有數據,按時間排序,排名 rank() over(order by orderdate), -- (排名-1)/(總行數-1) -- 第1行:排名1,(1-1)/(14-1)= 0 -- 第2行:排名4,(4-1)/(14-1)= 0.23076923076923078 -- 第3行:排名6,(6-1)/(14-1)= 0.38461538461538464 PERCENT_RANK() over(order by orderdate), -- (組內當前行-1) / (當前組總行-1) -- 第1行:(1-1)/(5-1)=0 -- 第2行:(2-1)/(5-1)=0.25 -- 第3行:(3-1)/(5-1)=0.5 PERCENT_RANK() over(partition by name order by orderdate) from business;