Spark sql -- Spark sql中的窗口函數和對應的api

本文轉載自查看原文 2019-05-23 11:09 2306 【大數據-Spark系列】

一、窗口函數種類

ranking 排名類
analytic 分析類
aggregate 聚合類

Function Type	SQL	DataFrame API	Description
Ranking	rank	rank	rank值可能是不連續的
Ranking	dense_rank	denseRank	rank值一定是連續的
Ranking	percent_rank	percentRank	相同的分組中 (rank -1) / ( count(score) - 1 )
Ranking	ntile	ntile	將同一組數據循環的往n個桶中放，返回對應的桶的index，index從1開始
Ranking	row_number	rowNumber	很單純的行號，類似excel的行號
Analytic	cume_dist	cumeDist
Analytic	first_value	firstValue	相同的分組中最小值
Analytic	last_value	lastValue	相同的分組中最大值
Analytic	lag	lag	取前n行數據
Analytic	lead	lead	取后n行數據
Aggregate	min	min	最小值
Aggregate	max	max	最大值
Aggregate	sum	sum	求和
Aggregate	avg	avg	求平均

二、具體用法如下

count(...) over(partition by ... order by ...)--求分組后的總數。
sum(...) over(partition by ... order by ...)--求分組后的和。
max(...) over(partition by ... order by ...)--求分組后的最大值。
min(...) over(partition by ... order by ...)--求分組后的最小值。
avg(...) over(partition by ... order by ...)--求分組后的平均值。
rank() over(partition by ... order by ...)--rank值可能是不連續的。
dense_rank() over(partition by ... order by ...)--rank值是連續的。
first_value(...) over(partition by ... order by ...)--求分組內的第一個值。
last_value(...) over(partition by ... order by ...)--求分組內的最后一個值。
lag() over(partition by ... order by ...)--取出前n行數據。　　
lead() over(partition by ... order by ...)--取出后n行數據。
ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括號中就是分子，over() 括號中就是分母。
percent_rank() over(partition by ... order by ...)--

三、實際例子

案例數據：/root/score.json/score.json，學生名字、課程、分數

{"name":"A","lesson":"Math","score":100} {"name":"B","lesson":"Math","score":100} {"name":"C","lesson":"Math","score":99} {"name":"D","lesson":"Math","score":98} {"name":"A","lesson":"E","score":100} {"name":"B","lesson":"E","score":99} {"name":"C","lesson":"E","score":99} {"name":"D","lesson":"E","score":98}

select
name,lesson,score,
ntile(2) over (partition by lesson order by score desc ) as ntile_2,
ntile(3) over (partition by lesson order by score desc ) as ntile_3,
row_number() over (partition by lesson order by score desc ) as row_number,
rank() over (partition by lesson order by score desc ) as rank,
dense_rank() over (partition by lesson order by score desc ) as dense_rank, 
percent_rank() over (partition by lesson order by score desc ) as percent_rank 
from score 
order by lesson,name,score

輸出結果完全一樣，如下表所示

name	lesson	score	ntile_2	ntile_3	row_number	rank	dense_rank	percent_rank
A	E	100	1	1	1	1	1	0.0
B	E	99	1	1	2	2	2	0.3333333333333333
C	E	99	2	2	3	2	2	0.3333333333333333
D	E	98	2	3	4	4	3	1.0
A	Math	100	1	1	1	1	1	0.0
B	Math	100	1	1	2	1	1	0.0
C	Math	99	2	2	3	3	2	0.6666666666666666
D	Math	98	2	3	4	4	3	1.0

參考：

spark sql中的窗口函數

over(partition by) 函數

=================================================================================

原創文章，轉載請務必將下面這段話置於文章開頭處（保留超鏈接）。
本文轉發自程序媛說事兒，原文鏈接https://www.cnblogs.com/abc8023/p/10910741.html

=================================================================================

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 SQL Server中的窗口函數 SQL 窗口函數簡介 PostgreSQL 高級SQL(三) 窗口函數 SQL Server 窗口函數詳解：OVER() SQL---窗口函數（window function）窗口和窗口函數 mysql8中窗口函數使用pandas實現SQL的窗口函數(附帶窗口函數的詳細講解) Flink 窗口函數【夏弈的學習筆記】（SQL）窗口函數