ROW_NUMBER實現
如何在ClickHouse中實現ROW_NUMBER OVER 和DENSE_RANK OVER等同效果的查詢,它們在一些其他數據庫中可用於RANK排序。
CH中並沒有直接提供對應的開窗函數,需要利用一些特殊函數變相實現,主要會用到下面幾個數組函數,它們分別是:
arrayEnumerate
arrayEnumerateDense
arrayEnumerateUniq
這些函數均接受一個數組作為輸入參數,並返回數組中元素出現的位置,例如:
SELECT arrayEnumerate([10, 20, 30, 10, 40]) AS row_number, arrayEnumerateDense([10, 20, 30, 10, 40]) AS dense_rank, arrayEnumerateUniq([10, 20, 30, 10, 40]) AS uniq_rank ┌─row_number──┬─dense_rank──┬─uniq_rank───┐ │ [1,2,3,4,5] │ [1,2,3,1,4] │ [1,1,1,2,1] │ └─────────────┴─────────────┴─────────────┘
數據格式:
我們的目標是實現開窗查詢:
ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerate
DENSE_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateDense
UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateUniq
代碼如下
SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.xxx_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id
數組展開,利用ARRAY JOIN將數組展開,並按照customer_id 、loan_id 列排序:
SELECT customer_id ,loan_id ,loan_dt ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt ,due_days ,row_number ,dense_rank ,uniq_rank from ( SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.xxx_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id ) ARRAY JOIN loan_dt, loan_id, end_date, due_days, row_number, dense_rank, uniq_rank ORDER BY customer_id ASC, row_number ASC , dense_rank ASC
技巧:因為end_date可能為空值,會導致array長度不一致。報錯。需要用特數值填充然后最后再轉換回來。
lag/lead實現:
neighbor(column, offset[, default_value]) The result of the function depends on the affected data blocks and the order of data in the block. If you make a subquery with ORDER BY and call the function from outside the subquery, you can get the expected result. Parameters column — A column name or scalar expression. offset — The number of rows forwards or backwards from the current row of column. Int64. default_value — Optional. The value to be returned if offset goes beyond the scope of the block. Type of data blocks affected. Returned values Value for column in offset distance from current row if offset value is not outside block bounds. Default value for column if offset value is outside block bounds. If default_value is given, then it will be used. Type: type of data blocks affected or default value type. 參考: https://clickhouse.tech/docs/en/sql-reference/functions/other-functions/
代碼如下
SELECT customer_id ,loan_id ,loan_dt ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt ,due_days ,row_number ,dense_rank ,uniq_rank ,if(neighbor(row_number , 1)<>1,neighbor(loan_dt , 1),null) as lead_loan_dt ,if(row_number<>1,neighbor(end_dt, -1),null) as lag_end_dt from ( SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, groupArray(loan_tot) AS loan_tot, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.xxx_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id ) ARRAY JOIN loan_dt, loan_id, end_date, due_days, row_number, dense_rank, uniq_rank ORDER BY customer_id ASC, row_number ASC , dense_rank ASC
發現有一點問題:就是最后一行lead的時候會出現異常值1970-01-01年值的問題。
SELECT customer_id ,loan_id ,loan_dt ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt ,due_days ,row_number as row_num ,dense_rank ,uniq_rank ,if(neighbor(row_num , 1)>1,neighbor(loan_dt , 1),null) as lead_loan_dt ,if(row_num<>1,neighbor(end_dt, -1),null) as lag_end_dt from ( SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.ipeso_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id ) ARRAY JOIN loan_dt, loan_id, end_date, due_days, row_number, dense_rank, uniq_rank ORDER BY customer_id ASC, row_number ASC , dense_rank ASC