Hive 高級函數----開窗函數


Hive 高級函數----開窗函數

用戶分組中開窗,好像給每一份數據 開一扇窗戶 所以叫開窗函數

在sql中有一類函數叫做聚合函數,例如sum()、avg()、max()等等,這類函數可以將多行數據按照規則聚集為一行,一般來講聚集后的行數是要少於聚集前的行數的.但是有時我們想要既顯示聚集前的數據,又要顯示聚集后的數據,這時我們便引入了窗口函數.

測試數據
111,69,class1,department1
112,80,class1,department1
113,74,class1,department1
114,94,class1,department1
115,93,class1,department1
121,74,class2,department1
122,86,class2,department1
123,78,class2,department1
124,70,class2,department1
211,93,class1,department2
212,83,class1,department2
213,94,class1,department2
214,94,class1,department2
215,82,class1,department2
216,74,class1,department2
221,99,class2,department2
222,78,class2,department2
223,74,class2,department2
224,80,class2,department2
225,85,class2,department2
建表語句
create table new_score(
    id  int
    ,score int
    ,clazz string
    ,department string
) row format delimited fields terminated by ",";

一、排列函數

row_number:無並列排名
  • 用法: select xxxx, row_number() over(partition by 分組字段 order by 排序字段 desc) as rn from tb group by xxxx
dense_rank:有並列排名,並且依次遞增
rank:有並列排名,不依次遞增
percent_rank:(rank的結果-1)/(分區內數據的個數-1)
cume_dist:計算某個窗口或分區中某個值的累積分布。

假定升序排序,則使用以下公式確定累積分布: 小於等於當前值x的行數 / 窗口或partition分區內的總行數。其中,x 等於 order by 子句中指定的列的當前行中的值。

NTILE(n):對分區內數據再分成n組,然后打上組號
max、min、avg、count、sum:基於每個partition分區內的數據做對應的計算
窗口幀:用於從分區中選擇指定的多條記錄,供窗口函數處理

Hive 提供了兩種定義窗口幀的形式:ROWSRANGE。兩種類型都需要配置上界和下界。例如,ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 表示選擇分區起始記錄到當前記錄的所有行;SUM(close) RANGE BETWEEN 100 PRECEDING AND 200 FOLLOWING 則通過 字段差值 來進行選擇。如當前行的 close 字段值是 200,那么這個窗口幀的定義就會選擇分區中 close 字段值落在 100400 區間的記錄。以下是所有可能的窗口幀定義組合。如果沒有定義窗口幀,則默認為 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

只能運用在max、min、avg、count、sum、FIRST_VALUE、LAST_VALUE這幾個窗口函數上

(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
range between 3 PRECEDING and 11 FOLLOWING
SELECT id
     ,score
     ,clazz
     ,SUM(score) OVER w as sum_w
     ,round(avg(score) OVER w,3) as avg_w
     ,count(score) OVER w as cnt_w
FROM new_score
WINDOW w AS (PARTITION BY clazz ORDER BY score rows between 2 PRECEDING and 2 FOLLOWING);
111	69	class1	217	72.333	3
113	74	class1	297	74.25	4
216	74	class1	379	75.8	5
112	80	class1	393	78.6	5
215	82	class1	412	82.4	5
212	83	class1	431	86.2	5
211	93	class1	445	89.0	5
115	93	class1	457	91.4	5
213	94	class1	468	93.6	5
114	94	class1	375	93.75	4
214	94	class1	282	94.0	3
124	70	class2	218	72.667	3
121	74	class2	296	74.0	4
223	74	class2	374	74.8	5
222	78	class2	384	76.8	5
123	78	class2	395	79.0	5
224	80	class2	407	81.4	5
225	85	class2	428	85.6	5
122	86	class2	350	87.5	4
221	99	class2	270	90.0	3

select  id
        ,score
        ,clazz
        ,department
        ,row_number() over (partition by clazz order by score desc) as rn_rk
        ,dense_rank() over (partition by clazz order by score desc) as dense_rk
        ,rank() over (partition by clazz order by score desc) as rk
        ,percent_rank() over (partition by clazz order by score desc) as percent_rk
        ,round(cume_dist() over (partition by clazz order by score desc),3) as cume_rk
        ,NTILE(3) over (partition by clazz order by score desc) as ntile_num
        ,max(score) over (partition by clazz order by score desc range between 3 PRECEDING and 11 FOLLOWING) as max_p	//用到了窗口幀
from new_score;

//partition by clazz order by score desc:表示在每個班里學生的成績按照降序排序

id  score   clazz   department  rn_rk  ds_rk  rk  percent_rk  cume_rk ntile_num max_p
114	 94	    class1	department1	  1	     1	   1	  0.0	    0.273	   1	94
214	 94	    class1	department2	  2	     1	   1	  0.0	    0.273	   1	94
213	 94	    class1	department2	  3	     1	   1	  0.0	    0.273	   1	94
211	 93	    class1	department2	  4	     2	   4	  0.3	    0.455	   1	94
115	 93	    class1	department1	  5	     2	   4	  0.3	    0.455	   2	94
212	 83	    class1	department2	  6	     3	   6	  0.5	    0.545	   2	94
215	 82	    class1	department2	  7	     4	   7	  0.6	    0.636	   2	94
112	 80	    class1	department1	  8	     5	   8	  0.7	    0.727	   2	94
113	 74	    class1	department1	  9	     6	   9	  0.8	    0.909	   3	94
216	 74	    class1	department2	  10	 6	   9	  0.8	    0.909	   3	94
111	 69	    class1	department1	  11	 7	   11	  1.0	    1.0        3    94
221	 99	    class2	department2	  1	     1	   1	  0.0	    0.111	   1	99
122	 86	    class2	department1	  2	     2	   2	  0.125	    0.222	   1	99
225	 85	    class2	department2	  3	     3	   3	  0.25	    0.333	   1	99
224	 80	    class2	department2	  4	     4	   4	  0.375	    0.444	   2	99
123	 78	    class2	department1	  5	     5	   5	  0.5	    0.667	   2	99
222	 78	    class2	department2	  6	     5	   5	  0.5	    0.667	   2	99
121	 74	    class2	department1	  7	     6	   7	  0.75	    0.889	   3	99
223	 74	    class2	department2	  8	     6	   7	  0.75	    0.889	   3	99
124	 70	    class2	department1	  9	     7	   9	  1.0	    1.0        3    99

二、窗口函數

LAG(col,n):查看往前第n行數據
LEAD(col,n):查看往后第n行數據
FIRST_VALUE:取分組內排序后,截止到當前行,第一個值
LAST_VALUE:取分組內排序后,截止到當前行,最后一個值,對於並列的排名,取最后一個
select  id
        ,score
        ,clazz
        ,department
        ,lag(id,2) over (partition by clazz order by score desc) as lag_num
        ,LEAD(id,2) over (partition by clazz order by score desc) as lead_num
        ,FIRST_VALUE(id) over (partition by clazz order by score desc) as first_v_num
        ,LAST_VALUE(id) over (partition by clazz order by score desc) as last_v_num
        ,NTILE(3) over (partition by clazz order by score desc) as ntile_num
from new_score;


id  score   clazz   department  lag_num lead_num  first_v_num last_v_num  ntile_num
114	 94	    class1	department1	  NULL	   213	    114	          213	      1
214	 94	    class1	department2	  NULL	   211	    114	          213	      1
213	 94	    class1	department2	  114	   115	    114	          213	      1
211	 93	    class1	department2	  214	   212	    114	          115	      1
115	 93	    class1	department1	  213	   215	    114	          115	      2
212	 83	    class1	department2	  211	   112	    114	          212	      2
215	 82	    class1	department2	  115	   113	    114	          215	      2
112	 80	    class1	department1	  212	   216	    114	          112	      2
113	 74	    class1	department1	  215	   111	    114	          216	      3
216	 74	    class1	department2	  112	   NULL	    114	          216	      3
111	 69	    class1	department1	  113	   NULL	    114	          111	      3
221	 99	    class2	department2	  NULL	   225	    221	          221	      1
122	 86	    class2	department1	  NULL	   224	    221	          122	      1
225	 85	    class2	department2	  221	   123	    221	          225	      1
224	 80	    class2	department2	  122	   222	    221	          224	      2
123	 78	    class2	department1	  225	   121	    221	          222	      2
222	 78	    class2	department2	  224	   223	    221	          222	      2
121	 74	    class2	department1	  123	   124	    221	          223	      3
223	 74	    class2	department2	  222	   NULL	    221	          223	      3
124	 70	    class2	department1	  121	   NULL	    221	          124	      3

https://blog.csdn.net/qq_26937525/article/details/54925827


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM