Hive中的窗口函數

本文轉載自查看原文 2017-09-27 13:21 1289 BigData

簡介

本文主要介紹hive中的窗口函數.hive中的窗口函數和sql中的窗口函數相類似,都是用來做一些數據分析類的工作,一般用於olap分析

概念

我們都知道在sql中有一類函數叫做聚合函數,例如sum()、avg()、max()等等,這類函數可以將多行數據按照規則聚集為一行,一般來講聚集后的行數是要少於聚集前的行數的.但是有時我們想要既顯示聚集前的數據,又要顯示聚集后的數據,這時我們便引入了窗口函數.

在深入研究Over字句之前，一定要注意：在SQL處理中，窗口函數都是最后一步執行，而且僅位於Order by字句之前.

數據准備

我們准備一張order表,字段分別為name,orderdate,cost.數據內容如下:

jack,2015-01-01,10

tony,2015-01-02,15

jack,2015-02-03,23

tony,2015-01-04,29

jack,2015-01-05,46

jack,2015-04-06,42

tony,2015-01-07,50

jack,2015-01-08,55

mart,2015-04-08,62

mart,2015-04-09,68

neil,2015-05-10,12

mart,2015-04-11,75

neil,2015-06-12,80

mart,2015-04-13,94

在hive中建立一張表t_window,將數據插入進去.

實例

聚合函數+over

假如說我們想要查詢在2015年4月份購買過的顧客及總人數,我們便可以使用窗口函數去去實現

select name,count(*) over ()

from t_window

where substring(orderdate,1,7) = '2015-04'

得到的結果如下:

name count_window_0

mart 5

jack 5

可見其實在2015年4月一共有5次購買記錄,mart購買了4次,jack購買了1次.事實上,大多數情況下,我們是只看去重后的結果的.針對於這種情況,我們有兩種實現方式

第一種：distinct

select distinct name,count(*) over ()

from t_window

where substring(orderdate,1,7) = '2015-04'

第二種:group by

select name,count(*) over ()

from t_window

where substring(orderdate,1,7) = '2015-04'

group by name

執行后的結果如下:

name count_window_0
mart 2
jack 2

partition by子句

Over子句之后第一個提到的就是Partition By.Partition By子句也可以稱為查詢分區子句，非常類似於Group By，都是將數據按照邊界值分組，而Over之前的函數在每一個分組之內進行，如果超出了分組，則函數會重新計算.

實例

我們想要去看顧客的購買明細及月購買總額,可以執行如下的sql

select name,orderdate,cost,sum(cost) over(partition by month(orderdate))

from t_window

執行結果如下:

name orderdate cost sum_window_0

jack 2015-01-01 10 205

jack 2015-01-08 55 205

tony 2015-01-07 50 205

jack 2015-01-05 46 205

tony 2015-01-04 29 205

tony 2015-01-02 15 205

jack 2015-02-03 23 23

mart 2015-04-13 94 341

jack 2015-04-06 42 341

mart 2015-04-11 75 341

mart 2015-04-09 68 341

mart 2015-04-08 62 341

neil 2015-05-10 12 12

neil 2015-06-12 80 80

可以看出數據已經按照月進行匯總了.

order by子句

上述的場景,假如我們想要將cost按照月進行累加.這時我們引入order by子句.

order by子句會讓輸入的數據強制排序（文章前面提到過，窗口函數是SQL語句最后執行的函數，因此可以把SQL結果集想象成輸入數據）。Order By子句對於諸如Row_Number()，Lead()，LAG()等函數是必須的，因為如果數據無序，這些函數的結果就沒有任何意義。因此如果有了Order By子句，則Count()，Min()等計算出來的結果就沒有任何意義。

我們在上面的代碼中加入order by

select name,orderdate,cost,sum(cost) over(partition by month(orderdate) order by orderdate )

from t_window

得到的結果如下：(order by默認情況下聚合從起始行當當前行的數據)

name orderdate cost sum_window_0

jack 2015-01-01 10 10

tony 2015-01-02 15 25

tony 2015-01-04 29 54

jack 2015-01-05 46 100

tony 2015-01-07 50 150

jack 2015-01-08 55 205

jack 2015-02-03 23 23

jack 2015-04-06 42 42

mart 2015-04-08 62 104

mart 2015-04-09 68 172

mart 2015-04-11 75 247

mart 2015-04-13 94 341

neil 2015-05-10 12 12

neil 2015-06-12 80 80

window子句

我們在上面已經通過使用partition by子句將數據進行了分組的處理.如果我們想要更細粒度的划分，我們就要引入window子句了.

我們首先要理解兩個概念:
- 如果只使用partition by子句,未指定order by的話,我們的聚合是分組內的聚合.
- 使用了order by子句,未使用window子句的情況下,默認從起點到當前行.

當同一個select查詢中存在多個窗口函數時,他們相互之間是沒有影響的.每個窗口函數應用自己的規則.

window子句：
- PRECEDING：往前
- FOLLOWING：往后
- CURRENT ROW：當前行
- UNBOUNDED：起點，UNBOUNDED PRECEDING 表示從前面的起點， UNBOUNDED FOLLOWING：表示到后面的終點

我們按照name進行分區,按照購物時間進行排序,做cost的累加.
如下我們結合使用window子句進行查詢

select name,orderdate,cost,

sum(cost) over() as sample1,--所有行相加

sum(cost) over(partition by name) as sample2,--按name分組，組內數據相加

sum(cost) over(partition by name order by orderdate) as sample3,--按name分組，組內數據累加

sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一樣,由起點到當前行的聚合

sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --當前行和前面一行做聚合

sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--當前行和前邊一行及后面一行

sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --當前行及后面所有行

from t_window;

得到查詢結果如下：

name orderdate cost sample1 sample2 sample3 sample4 sample5 sample6 sample7

jack 2015-01-01 10 661 176 10 10 10 56 176

jack 2015-01-05 46 661 176 56 56 56 111 166

jack 2015-01-08 55 661 176 111 111 101 124 120

jack 2015-02-03 23 661 176 134 134 78 120 65

jack 2015-04-06 42 661 176 176 176 65 65 42

mart 2015-04-08 62 661 299 62 62 62 130 299

mart 2015-04-09 68 661 299 130 130 130 205 237

mart 2015-04-11 75 661 299 205 205 143 237 169

mart 2015-04-13 94 661 299 299 299 169 169 94

neil 2015-05-10 12 661 92 12 12 12 92 92

neil 2015-06-12 80 661 92 92 92 92 92 80

tony 2015-01-02 15 661 94 15 15 15 44 94

tony 2015-01-04 29 661 94 44 44 44 94 79

tony 2015-01-07 50 661 94 94 94 79 79 50

窗口函數中的序列函數

主要序列函數是不支持window子句的.

hive中常用的序列函數有下面幾個:

NTILE

· NTILE(n)，用於將分組數據按照順序切分成n片，返回當前切片值

· NTILE不支持ROWS BETWEEN，
比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)

· 如果切片不均勻，默認增加第一個切片的分布

這個函數用什么應用場景呢?假如我們想要每位顧客購買金額前1/3的交易記錄,我們便可以使用這個函數.

select name,orderdate,cost,

ntile(3) over() as sample1 , --全局數據切片

ntile(3) over(partition by name), -- 按照name進行分組,在分組內將數據切成3份

ntile(3) over(order by cost),--全局按照cost升序排列,數據切成3份

ntile(3) over(partition by name order by cost ) --按照name分組，在分組內按照cost升序排列,數據切成3份

from t_window

得到的數據如下：

name orderdate cost sample1 sample2 sample3 sample4

jack 2015-01-01 10 3 1 1 1

jack 2015-02-03 23 3 1 1 1

jack 2015-04-06 42 2 2 2 2

jack 2015-01-05 46 2 2 2 2

jack 2015-01-08 55 2 3 2 3

mart 2015-04-08 62 2 1 2 1

mart 2015-04-09 68 1 2 3 1

mart 2015-04-11 75 1 3 3 2

mart 2015-04-13 94 1 1 3 3

neil 2015-05-10 12 1 2 1 1

neil 2015-06-12 80 1 1 3 2

tony 2015-01-02 15 3 2 1 1

tony 2015-01-04 29 3 3 1 2

tony 2015-01-07 50 2 1 2 3

如上述數據，我們去sample4 = 1的那部分數據就是我們要的結果

row_number、rank、dense_rank

這三個窗口函數的使用場景非常多
- row_number()從1開始，按照順序，生成分組內記錄的序列,row_number()的值不會存在重復,當排序的值相同時,按照表中記錄的順序進行排列
- RANK() 生成數據項在分組中的排名，排名相等會在名次中留下空位
- DENSE_RANK() 生成數據項在分組中的排名，排名相等會在名次中不會留下空位

**注意：
rank和dense_rank的區別在於排名相等時會不會留下空位.**

舉例如下:

SELECT

cookieid,

createtime,

pv,

RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,

DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3

FROM lxw1234

WHERE cookieid = 'cookie1';

cookieid day pv rn1 rn2 rn3

cookie1 2015-04-12 7 1 1 1

cookie1 2015-04-11 5 2 2 2

cookie1 2015-04-15 4 3 3 3

cookie1 2015-04-16 4 3 3 4

cookie1 2015-04-13 3 5 4 5

cookie1 2015-04-14 2 6 5 6

cookie1 2015-04-10 1 7 6 7

rn1: 15號和16號並列第3, 13號排第5

rn2: 15號和16號並列第3, 13號排第4

rn3: 如果相等，則按記錄值排序，生成唯一的次序，如果所有記錄值都相等，或許會隨機排吧。

LAG和LEAD函數

這兩個函數為常用的窗口函數,可以返回上下數據行的數據. 以我們的訂單表為例,假如我們想要查看顧客上次的購買時間可以這樣去查詢

select name,orderdate,cost,

lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1,

lag(orderdate,2) over (partition by name order by orderdate) as time2

from t_window;

查詢后的數據為:

name orderdate cost time1 time2

jack 2015-01-01 10 1900-01-01 NULL

jack 2015-01-05 46 2015-01-01 NULL

jack 2015-01-08 55 2015-01-05 2015-01-01

jack 2015-02-03 23 2015-01-08 2015-01-05

jack 2015-04-06 42 2015-02-03 2015-01-08

mart 2015-04-08 62 1900-01-01 NULL

mart 2015-04-09 68 2015-04-08 NULL

mart 2015-04-11 75 2015-04-09 2015-04-08

mart 2015-04-13 94 2015-04-11 2015-04-09

neil 2015-05-10 12 1900-01-01 NULL

neil 2015-06-12 80 2015-05-10 NULL

tony 2015-01-02 15 1900-01-01 NULL

tony 2015-01-04 29 2015-01-02 NULL

tony 2015-01-07 50 2015-01-04 2015-01-02

time1取的為按照name進行分組,分組內升序排列,取上一行數據的值.

time2取的為按照name進行分組，分組內升序排列,取上面2行的數據的值,注意當lag函數為設置行數值時,默認為1行.未設定取不到時的默認值時,取null值.

lead函數與lag函數方向相反,取向下的數據.

first_value和last_value

first_value取分組內排序后，截止到當前行，第一個值
last_value取分組內排序后，截止到當前行，最后一個值

select name,orderdate,cost,

first_value(orderdate) over(partition by name order by orderdate) as time1,

last_value(orderdate) over(partition by name order by orderdate) as time2

from t_window

查詢結果如下:

name orderdate cost time1 time2

jack 2015-01-01 10 2015-01-01 2015-01-01

jack 2015-01-05 46 2015-01-01 2015-01-05

jack 2015-01-08 55 2015-01-01 2015-01-08

jack 2015-02-03 23 2015-01-01 2015-02-03

jack 2015-04-06 42 2015-01-01 2015-04-06

mart 2015-04-08 62 2015-04-08 2015-04-08

mart 2015-04-09 68 2015-04-08 2015-04-09

mart 2015-04-11 75 2015-04-08 2015-04-11

mart 2015-04-13 94 2015-04-08 2015-04-13

neil 2015-05-10 12 2015-05-10 2015-05-10

neil 2015-06-12 80 2015-05-10 2015-06-12

tony 2015-01-02 15 2015-01-02 2015-01-02

tony 2015-01-04 29 2015-01-02 2015-01-04

tony 2015-01-07 50 2015-01-02 2015-01-07

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hive窗口函數 hive sql 窗口函數 2、Hive的排序，窗口函數 Hive分析窗口函數 Hive Sql的窗口函數 hive之窗口函數 Hive（七）Hive分析窗口函數 Hive 窗口函數、分析函數 hive Spark SQL分析窗口函數 hive 常用窗口函數練習