使用clickhouse實現開窗函數 row/rank_number 和 lag lead


ROW_NUMBER實現

如何在ClickHouse中實現ROW_NUMBER OVER 和DENSE_RANK OVER等同效果的查詢,它們在一些其他數據庫中可用於RANK排序。

CH中並沒有直接提供對應的開窗函數,需要利用一些特殊函數變相實現,主要會用到下面幾個數組函數,它們分別是:

arrayEnumerate
arrayEnumerateDense
arrayEnumerateUniq

這些函數均接受一個數組作為輸入參數,並返回數組中元素出現的位置,例如:

SELECT    
 arrayEnumerate([10, 20, 30, 10, 40]) AS row_number,     
 arrayEnumerateDense([10, 20, 30, 10, 40]) AS dense_rank,    
 arrayEnumerateUniq([10, 20, 30, 10, 40]) AS uniq_rank
 
┌─row_number──┬─dense_rank──┬─uniq_rank───┐
│ [1,2,3,4,5][1,2,3,1,4][1,1,1,2,1] │
└─────────────┴─────────────┴─────────────┘

 

數據格式:

我們的目標是實現開窗查詢:

ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerate
DENSE_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateDense
UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateUniq

代碼如下

SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id

 

 數組展開,利用ARRAY JOIN將數組展開,並按照customer_id 、loan_id  列排序:

SELECT 
customer_id
,loan_id
,loan_dt
,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
,due_days
,row_number 
,dense_rank
,uniq_rank
from 
(
SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id
)
ARRAY JOIN  
    loan_dt,
    loan_id, 
    end_date,
    due_days,
    row_number,     
    dense_rank,     
    uniq_rank
ORDER BY     
customer_id ASC,     
row_number ASC ,     
dense_rank ASC

 

 技巧:因為end_date可能為空值,會導致array長度不一致。報錯。需要用特數值填充然后最后再轉換回來。

lag/lead實現:

neighbor(column, offset[, default_value])
 
The result of the function depends on the affected data blocks and the order of data in the block.
If you make a subquery with ORDER BY and call the function from outside the subquery, you can get the expected result.
 
Parameters
 
column — A column name or scalar expression.
offset — The number of rows forwards or backwards from the current row of column. Int64.
default_value — Optional. The value to be returned if offset goes beyond the scope of the block. Type of data blocks affected.
Returned values
 
Value for column in offset distance from current row if offset value is not outside block bounds.
Default value for column if offset value is outside block bounds. If default_value is given, then it will be used.
Type: type of data blocks affected or default value type.
 
 
參考:
https://clickhouse.tech/docs/en/sql-reference/functions/other-functions/

代碼如下

SELECT 
customer_id
,loan_id
,loan_dt
,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
,due_days
,row_number 
,dense_rank
,uniq_rank
,if(neighbor(row_number , 1)<>1,neighbor(loan_dt , 1),null)  as lead_loan_dt
,if(row_number<>1,neighbor(end_dt, -1),null)  as lag_end_dt
from 
(
SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
groupArray(loan_tot) AS loan_tot, 
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id
)
ARRAY JOIN  
    loan_dt,
    loan_id, 
    end_date,
    due_days,
    row_number,     
    dense_rank,     
    uniq_rank
ORDER BY     
customer_id ASC,     
row_number ASC ,     
dense_rank ASC

發現有一點問題:就是最后一行lead的時候會出現異常值1970-01-01年值的問題。 

SELECT 
customer_id
,loan_id
,loan_dt
,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
,due_days
,row_number as row_num
,dense_rank
,uniq_rank
,if(neighbor(row_num , 1)>1,neighbor(loan_dt , 1),null)  as lead_loan_dt
,if(row_num<>1,neighbor(end_dt, -1),null)  as lag_end_dt
from 
(
SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.ipeso_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id
)
ARRAY JOIN  
    loan_dt,
    loan_id, 
    end_date,
    due_days,
    row_number,     
    dense_rank,     
    uniq_rank
ORDER BY     
customer_id ASC,     
row_number ASC ,     
dense_rank ASC

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM