假設存在表格如下:
select 'a' as category, 19 as duration
union all
select 'b' as category, 15 as duration
union all
select 'c' as category, 12 as duration
union all
select 'd' as category, 53 as duration
union all
select 'e' as category, 27 as duration
union all
select 'f' as category, 9 as duration;
category | duration
b | 15
f | 9
e | 27
c | 12
d | 53
a | 19
想要多行轉一行並且按照duration排序,形成如下效果d,e,a,b,c,f
首先排序:row_number() over (partition by category order by cast(duration as int) desc) duration_rank
,然后拼接concat_ws(',',collect_set(category))
,但是得到的結果卻是亂序的,產生這個問題的根本原因自然在MapReduce,如果啟動了多於一個mapper/reducer來處理數據,select出來的數據順序就幾乎肯定與原始順序不同了。
解決方法可以把mapper數固定成1,或者把rank加進來再進行一次排序,拼接完之后把rank去掉:
select
regexp_replace(
concat_ws(',',
sort_array(
collect_list(
concat_ws(':',lpad(cast(duration_rank as string),5,'0'),cast(category as string))
)
)
),
'\\d+\:','')
from
(select
category
,row_number() over (order by cast(duration as int) desc) duration_rank
from
(select 'a' as category, 19 as duration
union all
select 'b' as category, 15 as duration
union all
select 'c' as category, 12 as duration
union all
select 'd' as category, 53 as duration
union all
select 'e' as category, 27 as duration
union all
select 'f' as category, 9 as duration) t
) T;
duration_rank 必須要在高位補足夠的0對齊,因為排序的是字符串而不是數字,如果不補0的話,按字典序排序就會變成1, 10, 11, 12, 13, 2, 3, 4...,又不對了。將排序的結果拼起來之后,用regexp_replace函數替換掉冒號及其前面的數字,大功告成。