Spark sql -- Spark sql中的窗口函數和對應的api


一、窗口函數種類

  1. ranking 排名類
  2. analytic 分析類
  3. aggregate 聚合類
Function Type SQL DataFrame API Description
 Ranking  rank   rank rank值可能是不連續的
 Ranking  dense_rank  denseRank rank值一定是連續的
 Ranking  percent_rank   percentRank 相同的分組中 (rank -1) / ( count(score) - 1 )
 Ranking  ntile  ntile 將同一組數據循環的往n個桶中放,返回對應的桶的index,index從1開始
 Ranking  row_number  rowNumber 很單純的行號,類似excel的行號
 Analytic   cume_dist  cumeDist  
 Analytic   first_value   firstValue 相同的分組中最小值
 Analytic   last_value   lastValue 相同的分組中最大值
 Analytic   lag  lag 取前n行數據
 Analytic   lead  lead 取后n行數據
 Aggregate   min min 最小值
 Aggregate   max max 最大值
 Aggregate   sum sum 求和
 Aggregate   avg avg 求平均

二、具體用法如下

count(...) over(partition by ... order by ...)--求分組后的總數。
sum(...) over(partition by ... order by ...)--求分組后的和。
max(...) over(partition by ... order by ...)--求分組后的最大值。
min(...) over(partition by ... order by ...)--求分組后的最小值。
avg(...) over(partition by ... order by ...)--求分組后的平均值。
rank() over(partition by ... order by ...)--rank值可能是不連續的。
dense_rank() over(partition by ... order by ...)--rank值是連續的。
first_value(...) over(partition by ... order by ...)--求分組內的第一個值。
last_value(...) over(partition by ... order by ...)--求分組內的最后一個值。
lag() over(partition by ... order by ...)--取出前n行數據。  
lead() over(partition by ... order by ...)--取出后n行數據。
ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括號中就是分子,over() 括號中就是分母。
percent_rank() over(partition by ... order by ...)--

三、實際例子

案例數據:/root/score.json/score.json,學生名字、課程、分數

{"name":"A","lesson":"Math","score":100} {"name":"B","lesson":"Math","score":100} {"name":"C","lesson":"Math","score":99} {"name":"D","lesson":"Math","score":98} {"name":"A","lesson":"E","score":100} {"name":"B","lesson":"E","score":99} {"name":"C","lesson":"E","score":99} {"name":"D","lesson":"E","score":98}
select
name,lesson,score,
ntile(2) over (partition by lesson order by score desc ) as ntile_2,
ntile(3) over (partition by lesson order by score desc ) as ntile_3,
row_number() over (partition by lesson order by score desc ) as row_number,
rank() over (partition by lesson order by score desc ) as rank,
dense_rank() over (partition by lesson order by score desc ) as dense_rank, 
percent_rank() over (partition by lesson order by score desc ) as percent_rank 
from score 
order by lesson,name,score

輸出結果完全一樣,如下表所示

name lesson score ntile_2 ntile_3 row_number rank dense_rank percent_rank
A E 100 1 1 1 1 1 0.0
B E 99 1 1 2 2 2 0.3333333333333333
C E 99 2 2 3 2 2 0.3333333333333333
D E 98 2 3 4 4 3 1.0
A Math 100 1 1 1 1 1 0.0
B Math 100 1 1 2 1 1 0.0
C Math 99 2 2 3 3 2 0.6666666666666666
D Math 98 2 3 4 4 3 1.0

參考:

spark sql中的窗口函數

over(partition by) 函數

 

=================================================================================

原創文章,轉載請務必將下面這段話置於文章開頭處(保留超鏈接)。
本文轉發自程序媛說事兒,原文鏈接https://www.cnblogs.com/abc8023/p/10910741.html

=================================================================================


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM