ClickHouse row_number() over (partition by)的幾種實現方法
hive中有row_number() over (partition by)函數,可以一句SQL實現想要的排序,在ClickHouse中有很多種實現方式,本篇就介紹一下幾種方法。
目錄
1.row_number排序
2.row_number排序后取出rank=1的結果
3.特殊場景
1.row_number排序
HIVE中寫法:
select number,
row_number() over (partition by number order by time desc) as rank
from table a
GROUP BY number
ClickHouse寫法:
select number,
groupArray(time) AS arr_val,
arrayEnumerate(arr_val) as row_number
from (select distinct orderid as number,
toDate(operatetime) as time
from table
order by time desc
) a
GROUP BY number
2.row_number排序后取出rank=1的結果
hive寫法:
select orderid
from (select orderid,
row_number() over(partition by orderid order by datachange_lasttime desc) as row_num
from table
where d = '${CurrentDate}'
) a
where row_num = 1;
ClickHouse寫法:
方法1:利用groupArray
select orderid,
groupArray(1)(datachange_lasttime) as dates
from (select orderid,
datachange_lasttime
from table
ORDER BY orderid, datachange_lasttime desc
) a
group by orderid
方法2:利用max函數實現倒序,如果正序使用min函數即可
select orderid,
max(datachange_lasttime) as datachange_lasttime
from table
group by orderid
方法3:利用rowNumberInAllBlocks函數
select orderid, status
from (select orderid, status, rowNumberInAllBlocks() as rank
from (select orderid, status, datachange_lasttime
from table
order by orderid, datachange_lasttime desc
) a
) b LIMIT 1 BY orderid
方法4:利用arrayEnumerate函數
select orderid
from (select orderid,
groupArray(datachange_lasttime) AS arr_val,
arrayEnumerate(arr_val) as row_number
from (select orderid, datachange_lasttime
from table
order by datachange_lasttime desc
) a
GROUP BY number
) b
where row_number = 1
3.特殊場景
要求:
對於以下場景,需要按照orderid分組,按照日期倒序,取最新一條,若日期一致,則隨機取一條作為結果即可
hive寫法:
select orderid from (select orderid, status, row_number() over(partition by orderid order by datachange_lasttime desc) as row_num from table where d = '${CurrentDate}' ) as b where row_num = 1
ClickHouse寫法:
通過上面的案例,我們很容易想到,把上面的結果作為一個子表,與原表進行關聯,只是這樣關聯,隨便舉一個關聯的寫法:
select a.orderid as orderid_a, a.status as status from olap_htlmaindb.tmp_ord_orders_status_s_pre a inner join (select orderid, groupArray(1)(datachange_lasttime) as dates from (select orderid, datachange_lasttime from table ORDER BY orderid, datachange_lasttime desc ) a group by orderid) b on a.orderid = b.orderid and cast(a.datachange_lasttime as String) = cast(b.dates [ 1 ] as String)
這里我們是先把符合要求的orderid和時間取出來,再回去關聯,取出需要的列,因為這些函數都有一個缺點是只能有partition by的字段和排序字段,不能有其他字段,所以要返回關聯,所以上面四種方法,ininer join原表,都不能解決上面案例的問題。
這里就想到了LIMIT 1 BY這個方法,這個方法其實是最有效的,如下:
select orderid, status, datachange_lasttime from table order by orderid, datachange_lasttime desc LIMIT 1 BY orderid