原文鏈接:ClickHouse的秘密基地(chcave),作者:凱朱
如何在ClickHouse中實現ROW_NUMBER OVER 和DENSE_RANK OVER等同效果的查詢,它們在一些其他數據庫中可用於RANK排序。
同樣的,CH中並沒有直接提供對應的開窗函數,需要利用一些特殊函數變相實現,主要會用到下面幾個數組函數,它們分別是:
arrayEnumerate
arrayEnumerateDense
arrayEnumerateUniq
這些函數均接受一個數組作為輸入參數,並返回數組中元素出現的位置,例如:
ch7.nauu.com :) SELECT arrayEnumerate([10,20,30,10,40]) AS row_number, arrayEnumerateDense([10,20,30,10,40]) AS dense_rank, arrayEnumerateUniq([10,20,30,10,40]) AS uniq_rank SELECT arrayEnumerate([10, 20, 30, 10, 40]) AS row_number, arrayEnumerateDense([10, 20, 30, 10, 40]) AS dense_rank, arrayEnumerateUniq([10, 20, 30, 10, 40]) AS uniq_rank ┌─row_number──┬─dense_rank──┬─uniq_rank───┐ │ [1,2,3,4,5] │ [1,2,3,1,4] │ [1,1,1,2,1] │ └─────────────┴─────────────┴─────────────┘ 1 rows in set. Elapsed: 0.005 sec.
熟悉開窗函數的看官應該一眼就能明白
arrayEnumerate 的效果等同於 ROW_NUMBER
arrayEnumerateDense 的效果等同於 DENSE_RANK
而 arrayEnumerateUniq 相對特殊,它只返回元素第一次出現的位置
在知道了上述幾個函數的作用之后,接下來我用一個具體示例,逐步演示如何實現最終需要的查詢效果。
首先准備測試數據集,創建一張測試表
CREATE TABLE test_data engine = Memory AS WITH( SELECT ['A','A','A','A','B','B','B','B','B','A','59','90','80','80','65','75','78','88','99','70'])AS dictSELECT dict[number%10+1] AS id, dict[number+11] AS val FROM system.numbers LIMIT 10
這是一張典型的分數表:
ch7.nauu.com :) SELECT * FROM test_data SELECT *FROM test_data ┌─id─┬─val─┐ │ A │ 59 │ │ A │ 90 │ │ A │ 80 │ │ A │ 80 │ │ B │ 65 │ │ B │ 75 │ │ B │ 78 │ │ B │ 88 │ │ B │ 99 │ │ A │ 70 │ └────┴─────┘ 10 rows in set. Elapsed: 0.002 sec.
我們的目標,是要實現如下語義的查詢:
ROW_NUMBER() OVER( PARTITION BY id ORDER BY val ) DENSE_RANK() OVER( PARTITION BY id ORDER BY val ) UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )
即按照 id 分組后,基於val 排序並得出RANK。
第一步,按 val 排序,因為條件是 ORDER BY val :
SELECT * FROM test_data ORDER BY val
(因為要返回所有字段,所以這里可以使用 * )
第二步,按 id 分組,因為條件是 PARTITION BY id :
SELECT id FROM ( SELECT * FROM test_data ORDER BY val ASC ) GROUP BY id ┌─id─┐ │ B │ │ A │ └────┘ 2 rows in set. Elapsed: 0.006 sec.
第三步,計算val的RANK,需要用到剛才介紹的幾個arrayEnumerate*函數,由於它們的入參要求數組,所以先使用 groupArray將 val 轉成數組:
SELECT id, groupArray(val) AS arr_val, arrayEnumerate(arr_val) AS row_number, arrayEnumerateDense(arr_val) AS dense_rank, arrayEnumerateUniq(arr_val) AS uniq_rank FROM ( SELECT * FROM test_data ORDER BY val ASC ) GROUP BY id ┌─id─┬─arr_val────────────────────┬─row_number──┬─dense_rank──┬─uniq_rank───┐ │ B │ ['65','75','78','88','99'] │ [1,2,3,4,5] │ [1,2,3,4,5] │ [1,1,1,1,1] │ │ A │ ['59','70','80','80','90'] │ [1,2,3,4,5] │ [1,2,3,3,4] │ [1,1,1,2,1] │ └────┴────────────────────────────┴─────────────┴─────────────┴─────────────┘
可以看到,到這一步各種形式的RANK排序已經查出來了。 第四步,數組展開,利用ARRAY JOIN將數組展開,並按照 id 、RANK列排序:
SELECT id, val, row_number, dense_rank, uniq_rank FROM ( SELECT id, groupArray(val) AS arr_val, arrayEnumerate(arr_val) AS row_number, arrayEnumerateDense(arr_val) AS dense_rank, arrayEnumerateUniq(arr_val) AS uniq_rank FROM ( SELECT * FROM test_data ORDER BY val ASC ) GROUP BY id ) ARRAY JOIN arr_val AS val, row_number, dense_rank, uniq_rank ORDER BY id ASC, row_number ASC, dense_rank ASC ┌─id─┬─val─┬─row_number─┬─dense_rank─┬─uniq_rank─┐ │ A │ 59 │ 1 │ 1 │ 1 │ │ A │ 70 │ 2 │ 2 │ 1 │ │ A │ 80 │ 3 │ 3 │ 1 │ │ A │ 80 │ 4 │ 3 │ 2 │ │ A │ 90 │ 5 │ 4 │ 1 │ │ B │ 65 │ 1 │ 1 │ 1 │ │ B │ 75 │ 2 │ 2 │ 1 │ │ B │ 78 │ 3 │ 3 │ 1 │ │ B │ 88 │ 4 │ 4 │ 1 │ │ B │ 99 │ 5 │ 5 │ 1 │ └────┴─────┴────────────┴────────────┴───────────┘ 10 rows in set. Elapsed: 0.004 sec.
至此,整個查詢就完成了,我們實現了如下三種語義的查詢:
ROW_NUMBER() OVER( PARTITION BY id ORDER BY val ) DENSE_RANK() OVER( PARTITION BY id ORDER BY val ) UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )
利用RANK排序,進一步還能回答哪些問題呢?
分組TOP N,例如按id分組后,查詢排名前3的分數:
SELECT id, val, dense_rank FROM ( SELECT id, val, dense_rank FROM ( SELECT id, groupArray(val) AS arr_val, arrayEnumerateDense(arr_val) AS dense_rank FROM ( SELECT DISTINCT val, id FROM test_data ORDER BY val DESC ) GROUP BY id ) ARRAY JOIN arr_val AS val, dense_rank ORDER BY id ASC, dense_rank ASC )WHERE dense_rank <= 3 ┌─id─┬─val─┬─dense_rank─┐ │ A │ 90 │ 1 │ │ A │ 80 │ 2 │ │ A │ 70 │ 3 │ │ B │ 99 │ 1 │ │ B │ 88 │ 2 │ │ B │ 78 │ 3 │ └────┴─────┴────────────┘ 6 rows in set. Elapsed: 0.008 sec.
由於分數val存在重復數據,此處使用了DISTINCT去重
指定id的分數排名,查詢 id = A,val = 70的排名:
SELECT id, val, dense_rankFROM ( SELECT id, val, dense_rank FROM ( SELECT id, groupArray(val) AS arr_val, arrayEnumerateDense(arr_val) AS dense_rank FROM ( SELECT DISTINCT val, id FROM test_data ORDER BY val DESC ) GROUP BY id ) ARRAY JOIN arr_val AS val, dense_rank ORDER BY id ASC, dense_rank ASC )WHERE id = 'A' AND val = '70' ┌─id─┬─val─┬─dense_rank─┐ │ A │ 70 │ 3 │ └────┴─────┴────────────┘ 1 rows in set. Elapsed: 0.006 sec.