hive中一般取top n時,row_number(),rank,dense_ran()這三個函數就派上用場了,
先簡單說下這三函數都是排名的,不過呢還有點細微的區別。
通過代碼運行結果一看就明白了。
示例數據:
1 a 10 2 a 12 3 b 13 4 b 12 5 a 14 6 a 15 7 a 13 8 b 11 9 a 16 10 b 17 11 a 14
sql語句
select id, name, sal, rank()over(partition by name order by sal desc ) rp, dense_rank() over(partition by name order by sal desc ) drp, row_number()over(partition by name order by sal desc) rmp from f_test
結果
10 b 17 1 1 1 3 b 13 2 2 2 4 b 12 3 3 3 8 b 11 4 4 4 9 a 16 1 1 1 6 a 15 2 2 2 11 a 14 3 3 3 5 a 14 3 3 4 7 a 13 5 4 5 2 a 12 6 5 6 1 a 10 7 6 7
從結果看出
rank() 排序相同時會重復,總數不會變
dense_rank()排序相同時會重復,總數會減少
row_number() 會根據順序計算
正好聽到一個需求,求sal前50%的人
用這個寫了一下,
select * from ( select id, name, sal, rank()over(partition by name order by sal desc ) rp, dense_rank() over(partition by name order by sal desc ) drp, row_number()over(partition by name order by sal desc) rmp, count(*)over(partition by name) *0.5 as count from f_test ) t where t.rp <t.count;
感覺雖然可以實現,但是有點復雜,有沒有更好的方法實現呢
NTILE
NTILE(n),用於將分組數據按照順序切分成n片,返回當前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
如果切片不均勻,默認增加第一個切片的分布
上面那個例子 正好可以用到這個
select * from ( select id, name, sal, NTILE(2) over(partition by name order by sal desc ) rn from f_test ) t where t.rn=1