一數據准備
cookie1,2015-04-10,1 cookie1,2015-04-11,5 cookie1,2015-04-12,7 cookie1,2015-04-13,3 cookie1,2015-04-14,2 cookie1,2015-04-15,4 cookie1,2015-04-16,4
創建數據庫及表
create database if not exists cookie; use cookie; drop table if exists cookie1; create table cookie1(cookieid string, createtime string, pv int) row format delimited fields terminated by ','; load data local inpath "/home/hadoop/cookie1.txt" into table cookie1; select * from cookie1;
SUM
查詢語句
select cookieid, createtime, pv, sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, sum(pv) over (partition by cookieid order by createtime) as pv2, sum(pv) over (partition by cookieid) as pv3, sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, sum(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 from cookie1;
查詢結果
說明
pv1: 分組內從起點到當前行的pv累積,如,11號的pv1=10號的pv+11號的pv, 12號=10號+11號+12號 pv2: 同pv1 pv3: 分組內(cookie1)所有的pv累加 pv4: 分組內當前行+往前3行,如,11號=10號+11號, 12號=10號+11號+12號, 13號=10號+11號+12號+13號, 14號=11號+12號+13號+14號 pv5: 分組內當前行+往前3行+往后1行,如,14號=11號+12號+13號+14號+15號=5+7+3+2+4=21 pv6: 分組內當前行+往后所有行,如,13號=13號+14號+15號+16號=3+2+4+4=13,14號=14號+15號+16號=2+4+4=10
如果不指定ROWS BETWEEN,默認為從起點到當前行;
如果不指定ORDER BY,則將分組內所有值累加;
關鍵是理解ROWS BETWEEN含義,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:當前行
UNBOUNDED:起點,
UNBOUNDED PRECEDING 表示從前面的起點,
UNBOUNDED FOLLOWING:表示到后面的終點
–其他AVG,MIN,MAX,和SUM用法一樣。
AVG
查詢語句
select cookieid, createtime, pv, avg(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, -- 默認為從起點到當前行 avg(pv) over (partition by cookieid order by createtime) as pv2, --從起點到當前行,結果同pv1 avg(pv) over (partition by cookieid) as pv3, --分組內所有行 avg(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, --當前行+往前3行 avg(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, --當前行+往前3行+往后1行 avg(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 --當前行+往后所有行 from cookie1;
查詢結果
MIN
查詢語句
select cookieid, createtime, pv, min(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, -- 默認為從起點到當前行 min(pv) over (partition by cookieid order by createtime) as pv2, --從起點到當前行,結果同pv1 min(pv) over (partition by cookieid) as pv3, --分組內所有行 min(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, --當前行+往前3行 min(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, --當前行+往前3行+往后1行 min(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 --當前行+往后所有行 from cookie1;
查詢結果
MAX
查詢語句
select cookieid, createtime, pv, max(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, -- 默認為從起點到當前行 max(pv) over (partition by cookieid order by createtime) as pv2, --從起點到當前行,結果同pv1 max(pv) over (partition by cookieid) as pv3, --分組內所有行 max(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, --當前行+往前3行 max(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, --當前行+往前3行+往后1行 max(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 --當前行+往后所有行 from cookie1;
查詢結果
二數據准備
接下來介紹前幾個序列函數,NTILE,ROW_NUMBER,RANK,DENSE_RANK,下面會一一解釋各自的用途。
注意: 序列函數不支持WINDOW子句。(ROWS BETWEEN)
cookie1,2015-04-10,1 cookie1,2015-04-11,5 cookie1,2015-04-12,7 cookie1,2015-04-13,3 cookie1,2015-04-14,2 cookie1,2015-04-15,4 cookie1,2015-04-16,4 cookie2,2015-04-10,2 cookie2,2015-04-11,3 cookie2,2015-04-12,5 cookie2,2015-04-13,6 cookie2,2015-04-14,3 cookie2,2015-04-15,9 cookie2,2015-04-16,7
創建表
use cookie; drop table if exists cookie2; create table cookie2(cookieid string, createtime string, pv int) row format delimited fields terminated by ','; load data local inpath "/home/hadoop/cookie2.txt" into table cookie2; select * from cookie2;
NTILE
說明
NTILE(n),用於將分組數據按照順序切分成n片,返回當前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
如果切片不均勻,默認增加第一個切片的分布
查詢語句
select cookieid, createtime, pv, ntile(2) over (partition by cookieid order by createtime) as rn1, --分組內將數據分成2片 ntile(3) over (partition by cookieid order by createtime) as rn2, --分組內將數據分成2片 ntile(4) over (order by createtime) as rn3 --將所有數據分成4片 from cookie.cookie2 order by cookieid,createtime;
查詢結果
比如,統計一個cookie,pv數最多的前1/3的天
查詢語句
select cookieid, createtime, pv, ntile(3) over (partition by cookieid order by pv desc ) as rn from cookie.cookie2;
查詢結果
--rn = 1 的記錄,就是我們想要的結果
ROW_NUMBER
說明
ROW_NUMBER() –從1開始,按照順序,生成分組內記錄的序列
–比如,按照pv降序排列,生成分組內每天的pv名次
ROW_NUMBER() 的應用場景非常多,再比如,獲取分組內排序第一的記錄;獲取一個session中的第一條refer等。
分組排序
select cookieid, createtime, pv, row_number() over (partition by cookieid order by pv desc) as rn from cookie.cookie2;
查詢結果
-- 所以如果需要取每一組的前3名,只需要rn<=3即可,適合TopN
RANK 和 DENSE_RANK
—RANK() 生成數據項在分組中的排名,排名相等會在名次中留下空位
—DENSE_RANK() 生成數據項在分組中的排名,排名相等會在名次中不會留下空位
查詢語句
select cookieid, createtime, pv, rank() over (partition by cookieid order by pv desc) as rn1, dense_rank() over (partition by cookieid order by pv desc) as rn2, row_number() over (partition by cookieid order by pv desc) as rn3 from cookie.cookie2 where cookieid='cookie1';
查詢結果
ROW_NUMBER、RANK和DENSE_RANK的區別
row_number: 按順序編號,不留空位
rank: 按順序編號,相同的值編相同號,留空位
dense_rank: 按順序編號,相同的值編相同的號,不留空位
三數據准備
cookie3.txt
d1,user1,1000 d1,user2,2000 d1,user3,3000 d2,user4,4000 d2,user5,5000
創建表
use cookie; drop table if exists cookie3; create table cookie3(dept string, userid string, sal int) row format delimited fields terminated by ','; load data local inpath "/home/hadoop/cookie3.txt" into table cookie3; select * from cookie3;
CUME_DIST
說明
–CUME_DIST :小於等於當前值的行數/分組內總行數
查詢語句
比如,統計小於等於當前薪水的人數,所占總人數的比例
select dept, userid, sal, cume_dist() over (order by sal) as rn1, cume_dist() over (partition by dept order by sal) as rn2 from cookie.cookie3;
查詢結果
結果說明
rn1: 沒有partition,所有數據均為1組,總行數為5, 第一行:小於等於1000的行數為1,因此,1/5=0.2 第三行:小於等於3000的行數為3,因此,3/5=0.6 rn2: 按照部門分組,dpet=d1的行數為3, 第二行:小於等於2000的行數為2,因此,2/3=0.6666666666666666
PERCENT_RANK
說明
–PERCENT_RANK :分組內當前行的RANK值-1/分組內總行數-1
查詢語句
select dept, userid, sal, percent_rank() over (order by sal) as rn1, --分組內 rank() over (order by sal) as rn11, --分組內的rank值 sum(1) over (partition by null) as rn12, --分組內總行數 percent_rank() over (partition by dept order by sal) as rn2, rank() over (partition by dept order by sal) as rn21, sum(1) over (partition by dept) as rn22 from cookie.cookie3;
查詢結果
結果說明
–PERCENT_RANK :分組內當前行的RANK值-1/分組內總行數-1
rn1 == (rn11-1) / (rn12-1)
rn2 == (rn21-1) / (rn22-1)
rn1: rn1 = (rn11-1) / (rn12-1) 第一行,(1-1)/(5-1)=0/4=0 第二行,(2-1)/(5-1)=1/4=0.25 第四行,(4-1)/(5-1)=3/4=0.75 rn2: 按照dept分組, dept=d1的總行數為3 第一行,(1-1)/(3-1)=0 第三行,(3-1)/(3-1)=1
四數據准備
cookie4.txt
cookie1,2015-04-10 10:00:02,url2 cookie1,2015-04-10 10:00:00,url1 cookie1,2015-04-10 10:03:04,1url3 cookie1,2015-04-10 10:50:05,url6 cookie1,2015-04-10 11:00:00,url7 cookie1,2015-04-10 10:10:00,url4 cookie1,2015-04-10 10:50:01,url5 cookie2,2015-04-10 10:00:02,url22 cookie2,2015-04-10 10:00:00,url11 cookie2,2015-04-10 10:03:04,1url33 cookie2,2015-04-10 10:50:05,url66 cookie2,2015-04-10 11:00:00,url77 cookie2,2015-04-10 10:10:00,url44 cookie2,2015-04-10 10:50:01,url55
創建表
use cookie; drop table if exists cookie4; create table cookie4(cookieid string, createtime string, url string) row format delimited fields terminated by ','; load data local inpath "/home/hadoop/cookie4.txt" into table cookie4; select * from cookie4;
LAG
說明
LAG(col,n,DEFAULT) 用於統計窗口內往上第n行值
第一個參數為列名,
第二個參數為往上第n行(可選,默認為1),
第三個參數為默認值(當往上第n行為NULL時候,取默認值,如不指定,則為NULL)
查詢語句
select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, LAG(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as last_1_time, LAG(createtime,2) over (partition by cookieid order by createtime) as last_2_time from cookie.cookie4;
查詢結果
結果說明
last_1_time: 指定了往上第1行的值,default為'1970-01-01 00:00:00' cookie1第一行,往上1行為NULL,因此取默認值 1970-01-01 00:00:00 cookie1第三行,往上1行值為第二行值,2015-04-10 10:00:02 cookie1第六行,往上1行值為第五行值,2015-04-10 10:50:01 last_2_time: 指定了往上第2行的值,為指定默認值 cookie1第一行,往上2行為NULL cookie1第二行,往上2行為NULL cookie1第四行,往上2行為第二行值,2015-04-10 10:00:02 cookie1第七行,往上2行為第五行值,2015-04-10 10:50:01
LEAD
說明
與LAG相反
LEAD(col,n,DEFAULT) 用於統計窗口內往下第n行值
第一個參數為列名,
第二個參數為往下第n行(可選,默認為1),
第三個參數為默認值(當往下第n行為NULL時候,取默認值,如不指定,則為NULL)
查詢語句
select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, LEAD(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as next_1_time, LEAD(createtime,2) over (partition by cookieid order by createtime) as next_2_time from cookie.cookie4;
查詢結果
結果說明
--邏輯與LAG一樣,只不過LAG是往上,LEAD是往下。
FIRST_VALUE
說明
取分組內排序后,截止到當前行,第一個值
查詢語句
select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, first_value(url) over (partition by cookieid order by createtime) as first1 from cookie.cookie4;
查詢結果
LAST_VALUE
說明
取分組內排序后,截止到當前行,最后一個值
查詢語句
select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, last_value(url) over (partition by cookieid order by createtime) as last1 from cookie.cookie4;
查詢結果
如果不指定ORDER BY,則默認按照記錄在文件中的偏移量進行排序,會出現錯誤的結果
如果想要取分組內排序后最后一個值,則需要變通一下
查詢語句
select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, LAST_VALUE(url) over (partition by cookieid order by createtime) as last1, FIRST_VALUE(url) over (partition by cookieid order by createtime desc) as last2 from cookie.cookie4 order by cookieid,createtime;
查詢結果
提示:在使用分析函數的過程中,要特別注意ORDER BY子句,用的不恰當,統計出的結果就不是你所期望的。
五數據准備
GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
這幾個分析函數通常用於OLAP中,不能累加,而且需要根據不同維度上鑽和下鑽的指標統計,比如,分小時、天、月的UV數。
2015-03,2015-03-10,cookie1 2015-03,2015-03-10,cookie5 2015-03,2015-03-12,cookie7 2015-04,2015-04-12,cookie3 2015-04,2015-04-13,cookie2 2015-04,2015-04-13,cookie4 2015-04,2015-04-16,cookie4 2015-03,2015-03-10,cookie2 2015-03,2015-03-10,cookie3 2015-04,2015-04-12,cookie5 2015-04,2015-04-13,cookie6 2015-04,2015-04-15,cookie3 2015-04,2015-04-15,cookie2 2015-04,2015-04-16,cookie1
創建表
use cookie; drop table if exists cookie5; create table cookie5(month string, day string, cookieid string) row format delimited fields terminated by ','; load data local inpath "/home/hadoop/cookie5.txt" into table cookie5; select * from cookie5;
GROUPING SETS和GROUPING__ID
說明
在一個GROUP BY查詢中,根據不同的維度組合進行聚合,等價於將不同維度的GROUP BY結果集進行UNION ALL
GROUPING__ID,表示結果屬於哪一個分組集合。
查詢語句
select month, day, count(distinct cookieid) as uv, GROUPING__ID from cookie.cookie5 group by month,day grouping sets (month,day) order by GROUPING__ID;
等價於
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day
查詢結果
結果說明
第一列是按照month進行分組
第二列是按照day進行分組
第三列是按照month或day分組是,統計這一組有幾個不同的cookieid
第四列grouping_id表示這一組結果屬於哪個分組集合,根據grouping sets中的分組條件month,day,1是代表month,2是代表day
再比如
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM cookie5 GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY GROUPING__ID;
等價於
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day
CUBE
說明
根據GROUP BY的維度的所有組合進行聚合
查詢語句
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM cookie5 GROUP BY month,day WITH CUBE ORDER BY GROUPING__ID;
等價於
SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM cookie5 UNION ALL SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day
查詢結果
ROLLUP
說明
是CUBE的子集,以最左側的維度為主,從該維度進行層級聚合
查詢語句
-- 比如,以month維度進行層級聚合
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM cookie5 GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID;
可以實現這樣的上鑽過程:
月天的UV->月的UV->總UV
--把month和day調換順序,則以day維度進行層級聚合:
可以實現這樣的上鑽過程:
天月的UV->天的UV->總UV
(這里,根據天和月進行聚合,和根據天聚合結果一樣,因為有父子關系,如果是其他維度組合的話,就會不一樣)