hive排序后collect_set


假設存在表格如下:

select 'a' as category, 19 as duration
union all
select 'b' as category, 15 as duration
union all
select 'c' as category, 12 as duration
union all
select 'd' as category, 53 as duration
union all
select 'e' as category, 27 as duration
union all
select 'f' as category, 9  as duration;

 category | duration 
 b        |       15 
 f        |       9 
 e        |       27 
 c        |       12 
 d        |       53 
 a        |       19 

想要多行轉一行並且按照duration排序,形成如下效果d,e,a,b,c,f

首先排序:row_number() over (partition by category order by cast(duration as int) desc) duration_rank,然后拼接concat_ws(',',collect_set(category)),但是得到的結果卻是亂序的,產生這個問題的根本原因自然在MapReduce,如果啟動了多於一個mapper/reducer來處理數據,select出來的數據順序就幾乎肯定與原始順序不同了。

解決方法可以把mapper數固定成1,或者把rank加進來再進行一次排序,拼接完之后把rank去掉:

select 
regexp_replace(    
 concat_ws(',',
   sort_array(
     collect_list(
       concat_ws(':',lpad(cast(duration_rank as string),5,'0'),cast(category as string))
     )
   )
 ),
'\\d+\:','')
from 
(select 
category
,row_number() over (order by cast(duration as int) desc) duration_rank 
from 
(select 'a' as category, 19 as duration
union all
select 'b' as category, 15 as duration
union all
select 'c' as category, 12 as duration
union all
select 'd' as category, 53 as duration
union all
select 'e' as category, 27 as duration
union all
select 'f' as category, 9 as duration) t
) T;

duration_rank 必須要在高位補足夠的0對齊,因為排序的是字符串而不是數字,如果不補0的話,按字典序排序就會變成1, 10, 11, 12, 13, 2, 3, 4...,又不對了。將排序的結果拼起來之后,用regexp_replace函數替換掉冒號及其前面的數字,大功告成。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM