clickhouse實現排序排名以及同比環比新老版本對比


一、准備測試數據

首先准備測試表:

CREATE TABLE test_data engine = Memory AS WITH( SELECT ['A','A','A','A','B','B','B','B','B','A','59','90','80','80','65','75','78','88','99','70'])AS dict SELECT dict[number%10+1] AS id, dict[number+11] AS val FROM system.numbers LIMIT 10;

查看數據:

select * from test_data ;

SELECT *
FROM test_data Query id: 88a6a086-6400-4aeb-9997-2baa7fd2a320 ┌─id─┬─val─┐ │ A │ 59 │ │ A │ 90 │ │ A │ 80 │ │ A │ 80 │ │ B │ 65 │ │ B │ 75 │ │ B │ 78 │ │ B │ 88 │ │ B │ 99 │ │ A │ 70 │ └────┴─────┘

 

二、老版本實現(21版本以下 不包含21版本)

此處使用版本是:20.9.3.45

1)實現排序排名

CH中並沒有直接提供對應的開窗函數,需要利用一些特殊函數變相實現,主要會用到下面幾個數組函數,它們分別是:

arrayEnumerate
arrayEnumerateDense
arrayEnumerateUniq

這些函數均接受一個數組作為輸入參數,並返回數組中元素出現的位置,例如:

復制代碼
SELECT    
 arrayEnumerate([10, 20, 30, 10, 40]) AS row_number,     
 arrayEnumerateDense([10, 20, 30, 10, 40]) AS dense_rank,    
 arrayEnumerateUniq([10, 20, 30, 10, 40]) AS uniq_rank
 
┌─row_number──┬─dense_rank──┬─uniq_rank───┐
│ [1,2,3,4,5] │ [1,2,3,1,4] │ [1,1,1,2,1] │
└─────────────┴─────────────┴─────────────┘

復制代碼

 

我們的目標是實現開窗查詢:

ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerate
DENSE_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateDense
UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateUniq

如下:

SELECT
    id,
    val,
    row_number,
    dense_rank,
    uniq_rank
FROM
(
    SELECT
        id,
        groupArray(val) AS arr_val,
        arrayEnumerate(arr_val) AS row_number,
        arrayEnumerateDense(arr_val) AS dense_rank,
        arrayEnumerateUniq(arr_val) AS uniq_rank
    FROM
    (
        SELECT *
        FROM test_data
        ORDER BY val ASC
    )
    GROUP BY id
)
ARRAY JOIN
    arr_val AS val,
    row_number,
    dense_rank,
    uniq_rank
ORDER BY
    id ASC,
    row_number ASC,
    dense_rank ASC

Query id: ae812342-2d60-4d6e-9e4e-4a5f97e0670f

┌─id─┬─val─┬─row_number─┬─dense_rank─┬─uniq_rank─┐
│ A  │ 59111 │
│ A  │ 70221 │
│ A  │ 80331 │
│ A  │ 80432 │
│ A  │ 90541 │
│ B  │ 65111 │
│ B  │ 75221 │
│ B  │ 78331 │
│ B  │ 88441 │
│ B  │ 99551 │
└────┴─────┴────────────┴────────────┴───────────┘

10 rows in set. Elapsed: 0.005 sec. 

 

2)實現同比/環比,lag/lead實現:

復制代碼
neighbor(column, offset[, default_value])

函數的結果取決於受影響的數據塊和數據塊中數據的順序。

如果使用ORDER BY進行子查詢,並從子查詢外部調用該函數,則可以得到預期的結果。

 

參數

 

列-列名或標量表達式。

偏移量—從列的當前行向前或向后的行數。Int64。

默認值-可選。如果偏移量超出塊的范圍,則返回的值。受影響的數據塊的類型。

 

返回值

 

如果偏移值不在塊邊界之外,則當前行偏移距離中的列的值。

如果偏移值超出塊邊界,則列的默認值。如果給定了默認的_值,則將使用它。

類型:受影響數據塊的類型或默認值類型。

復制代碼

借助 neighbor 函數實現:

WITH toDate('2019-01-01') AS start_date
SELECT
    toStartOfMonth(start_date + (number * 32)) AS date_time,
    (number + 1) * 100 AS money,
    neighbor(money, -12) AS prev_year,
    neighbor(money, -1) AS prev_month
FROM numbers(16)

Query id: 228b6b8b-a4a8-42e5-a976-3cd67f8587cd

┌──date_time─┬─money─┬─prev_year─┬─prev_month─┐
│ 2019-01-0110000 │
│ 2019-02-012000100 │
│ 2019-03-013000200 │
│ 2019-04-014000300 │
│ 2019-05-015000400 │
│ 2019-06-016000500 │
│ 2019-07-017000600 │
│ 2019-08-018000700 │
│ 2019-09-019000800 │
│ 2019-10-0110000900 │
│ 2019-11-01110001000 │
│ 2019-12-01120001100 │
│ 2020-01-0113001001200 │
│ 2020-02-0114002001300 │
│ 2020-03-0115003001400 │
│ 2020-04-0116004001500 │
└────────────┴───────┴───────────┴────────────┘

16 rows in set. Elapsed: 0.003 sec. 


擴展用例:求過去10分鍾每分鍾都出現並連續出現3次的活躍ip數。

SELECT
    count( distincts_ip ) 
FROM
    (
    SELECT
        s_ip 
    FROM
        (
        SELECT
            s_ip,
            dt2 
        FROM
            (
            SELECT
                s_ip.groupArray ( dt ) dt_arr,
                arrayEnumerate ( dt_arr ) dt_arr_enum,
                arrayMap (( X.y )->( X - y * 60 ), dt_arr, dt_arr_enum ) map_ arr 
            FROM
                ( SELECT s_ip, found_time dt FROM s_ip_group_view2 WHERE found_time > = 1620583215 AND found_time <= 1620583915 ORDER BY s_ip, dt ) 
            GROUP BY
                s_ip 
            ) ARRAY
            JOIN map_arr AS dt2 
        ) 
    GROUP BY
        s_ip,
        dt2 
    HAVING
    count()> = 3 
    );

 

三、新版本實現(21版本以上 包含21版本)

1)實現排序排名等功能:

    rank() over () 並列有間隔,rank值為:1 2 2 4 5
  dense_rank() over() 並列不間斷,rank值為:1 2 2 3 4
  row_number() over() 相同連續排名,rank值為:1 2 3 4 5

SELECT
    id,
    val,
    rank() OVER w AS rank,
    dense_rank() OVER w AS dense_rank,
    row_number() OVER w AS row_number,
    count(*) OVER w AS count,
    sum(toInt32(val)) OVER w AS sum_v,
    avg(toInt32(val)) OVER w AS avg_v,
    max(toInt32(val)) OVER w AS max_v
FROM test_data
WINDOW  w AS (PARTITION BY  id ORDER BY  val ASC range unbounded preceding)
ORDER BY id ASC
SETTINGS allow_experimental_window_functions = 1;
┌─id─┬─val─┬─rank─┬─dense_rank─┬─row_number─┬─count─┬─sum_v─┬─────────────avg_v─┬─max_v─┐
│ A  │ 591111595959 │
│ A  │ 70222212964.570 │
│ A  │ 80333428972.2580 │
│ A  │ 80334428972.2580 │
│ A  │ 90545537975.890 │
│ B  │ 651111656565 │
│ B  │ 7522221407075 │
│ B  │ 78333321872.6666666666666778 │
│ B  │ 88444430676.588 │
│ B  │ 9955554058199 │
└────┴─────┴──────┴────────────┴────────────┴───────┴───────┴───────────────────┴───────┘
 

可以看到,ClickHouse 現在支持了原生的:

分析函數 rank()、dense_rank()、row_number() 

開窗函數 over(),且開窗函數也支持分組子句 partition by、排序子句 order by 和窗口子句 range/row 

由於默認窗口子句是 range ,下面語句等價:

PARTITION BY  id ORDER BY  val ASC range unbounded preceding

--

PARTITION BY  id ORDER BY  val ASC

SELECT
    id,
    val,
    rank() OVER w AS rank,
    dense_rank() OVER w AS dense_rank,
    row_number() OVER w AS row_number,
    count(*) OVER w AS count,
    sum(toInt32(val)) OVER w AS sum_v,
    avg(toInt32(val)) OVER w AS avg_v,
    max(toInt32(val)) OVER w AS max_v
FROM test_data
WINDOW  w AS (PARTITION BY  id ORDER BY  val ASC)
ORDER BY id ASC
SETTINGS allow_experimental_window_functions = 1;

┌─id─┬─val─┬─rank─┬─dense_rank─┬─row_number─┬─count─┬─sum_v─┬─────────────avg_v─┬─max_v─┐
│ A  │ 591111595959 │
│ A  │ 70222212964.570 │
│ A  │ 80333428972.2580 │
│ A  │ 80334428972.2580 │
│ A  │ 90545537975.890 │
│ B  │ 651111656565 │
│ B  │ 7522221407075 │
│ B  │ 78333321872.6666666666666778 │
│ B  │ 88444430676.588 │
│ B  │ 9955554058199 │
└────┴─────┴──────┴────────────┴────────────┴───────┴───────┴───────────────────┴───────┘

 

擴展用例:求過去10分鍾每分鍾都出現並連續出現3次的活躍ip數。

SELECT
    count( DISTINCT s_ip ) 
FROM
    (
    SELECT
        s_ip,
        dt - rank * 60  flat_date   
    FROM                            
        (                          
        SELECT
            s_ip,
            found_time dt,
            rank() over ( PARTITION BY s_ip ORDER BY dt ) rank 
        FROM
            s_ip_group_view2 
        WHERE
            found_time >= 1620583215 
            AND found_time <= 1620583915 
        ORDER BY
            s_ip,
            dt 
        ) 
    GROUP BY
        s_ip,
        flat_date 
    HAVING
    count()>= 3 
    );

 

 

2)同比/環比功能,如下實現:

在新的版本中,雖然目前也還未實現 lead/lag 函數,但通過開窗函數的窗口子句就能變相實現該功能:

SELECT
    date_time,
    money,
    any(money) OVER (ORDER BY money ASC Rows BETWEEN 12 PRECEDING AND 12 PRECEDING) AS prev_year,
    any(money) OVER (ORDER BY money ASC Rows BETWEEN 1 PRECEDING AND 1 PRECEDING) AS prev_month
FROM
(
    WITH toDate('2019-01-01') AS start_date
    SELECT
        toStartOfMonth(start_date + (number * 32)) AS date_time,
        (number + 1) * 100 AS money
    FROM numbers(16)
)
SETTINGS allow_experimental_window_functions = 1

Query id: 12ca2353-cb6e-4218-be1f-85ef666577ec

┌──date_time─┬─money─┬─prev_year─┬─prev_month─┐
│ 2019-01-0110000 │
│ 2019-02-012000100 │
│ 2019-03-013000200 │
│ 2019-04-014000300 │
│ 2019-05-015000400 │
│ 2019-06-016000500 │
│ 2019-07-017000600 │
│ 2019-08-018000700 │
│ 2019-09-019000800 │
│ 2019-10-0110000900 │
│ 2019-11-01110001000 │
│ 2019-12-01120001100 │
│ 2020-01-0113001001200 │
│ 2020-02-0114002001300 │
│ 2020-03-0115003001400 │
│ 2020-04-0116004001500 │
└────────────┴───────┴───────────┴────────────┘

 

--利用窗口子句,將 range 換成 row ,通過如下的句式實現:
any(value) over (.... rows between <offset> preceding and <offset> preceding), or following

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM