hive中一般取top n时,row_number(),rank,dense_ran()这三个函数就派上用场了,
先简单说下这三函数都是排名的,不过呢还有点细微的区别。
通过代码运行结果一看就明白了。
示例数据:
1
2
3
4
5
6
7
8
9
10
11
|
1 a 10
2 a 12
3 b 13
4 b 12
5 a 14
6 a 15
7 a 13
8 b 11
9 a 16
10 b 17
11 a 14
|
sql语句
1
2
3
4
5
6
7
|
select
id,
name
,
sal,
rank()over(partition
by
name
order
by
sal
desc
) rp,
dense_rank() over(partition
by
name
order
by
sal
desc
) drp,
row_number()over(partition
by
name
order
by
sal
desc
) rmp
from
f_test
|
结果
10 b 17 1 1 1 3 b 13 2 2 2 4 b 12 3 3 3 8 b 11 4 4 4 9 a 16 1 1 1 6 a 15 2 2 2 11 a 14 3 3 3 5 a 14 3 3 4 7 a 13 5 4 5 2 a 12 6 5 6 1 a 10 7 6 7
从结果看出
rank() 排序相同时会重复,总数不会变
dense_rank()排序相同时会重复,总数会减少
row_number() 会根据顺序计算
正好听到一个需求,求sal前50%的人
用这个写了一下,
1
2
3
4
5
6
7
8
9
10
|
select
*
from
(
select
id,
name
,
sal,
rank()over(partition
by
name
order
by
sal
desc
) rp,
dense_rank() over(partition
by
name
order
by
sal
desc
) drp,
row_number()over(partition
by
name
order
by
sal
desc
) rmp,
count
(*)over(partition
by
name
) *0.5
as
count
from
f_test
) t
where
t.rp <t.
count
;
|
感觉虽然可以实现,但是有点复杂,有没有更好的方法实现呢
NTILE
NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
如果切片不均匀,默认增加第一个切片的分布
上面那个例子 正好可以用到这个
1
2
3
4
5
6
7
|
select
*
from
(
select
id,
name
,
sal,
NTILE(2) over(partition
by
name
order
by
sal
desc
) rn
from
f_test
) t
where
t.rn=1
|