ClickHouse 支持的join類型說明

本文轉載自查看原文 2020-04-16 09:03 15933

按照代碼Join.h的說明，ClickHouse支持14種Join，如下所示：

* JOIN-s could be of these types:
* - ALL × LEFT/INNER/RIGHT/FULL
* - ANY × LEFT/INNER/RIGHT
* - SEMI/ANTI x LEFT/RIGHT
* - ASOF x LEFT/INNER
* - CROSS

All和Any的區別如官網文檔所示：

ANY 與 ALL

在使用ALL修飾符對JOIN進行修飾時，如果右表中存在多個與左表關聯的數據，那么系統則將右表中所有可以與左表關聯的數據全部返回在結果中。這與SQL標准的JOIN行為相同。
在使用ANY修飾符對JOIN進行修飾時，如果右表中存在多個與左表關聯的數據，那么系統僅返回第一個與左表匹配的結果。如果左表與右表一一對應，不存在多余的行時，ANY與ALL的結果相同。

以INNER JOIN為例說明ANY和ALL的區別，先准備數據：

1、創建join_test庫

create database join_test engine=Ordinary;

2、創建left_t1和right_t1表

create table left_t1(a UInt16,b UInt16,create_date date)Engine=MergeTree(create_date,a,8192);

create table right_t1(a UInt16,b UInt16,create_date date)Engine=MergeTree(create_date,a,8192);

3、插入數據

insert into left_t1 values(1,11,2020-3-20);

insert into left_t1 values(2,22,2020-3-20);

insert into left_t1 values(3,22,2020-3-20);

insert into right_t1 values(1,111,2020-3-20);

insert into right_t1 values(2,222,2020-3-20);

insert into right_t1 values(2,2222,2020-3-20);

4、查看分別增加ANY和ALL對INNER JOIN輸出結果的影響

ALL INNER JOIN

select * from left_t1 all inner join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ALL INNER JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │          2 │       2222 │           1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │          2 │        222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

3 rows in set. Elapsed: 0.019 sec.

右表right_t1存在兩條與左表left_t1匹配的結果，兩條全部返回。

ANY INNER JOIN

select * from left_t1 any inner join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ANY INNER JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │ 2 │ 2222 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │ 1 │ 111 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

2 rows in set. Elapsed: 0.023 sec.

右表right_t1存在兩條與左表left_t1匹配的結果，但是只返回一條。

INNER JOIN

內連接，將left_t1表和right_t1表所有滿足left_t1.a=right_t1.a條件的記錄進行連接，如下圖所示：

select * from left_t1 all inner join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ALL INNER JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │          2 │       2222 │           1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │          2 │        222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

3 rows in set. Elapsed: 0.134 sec.

LEFT JOIN

左連接，在內連接的基礎上，對於那些在right_t1表中找不到匹配記錄的left_t1表中的記錄，用空值或0進行連接，如下圖所示：

select * from left_t1 all left join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ALL LEFT JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 3 │ 22 │ 1975-06-21 │          0 │          0 │           0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │          2 │       2222 │           1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │          2 │        222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

4 rows in set. Elapsed: 0.013 sec.

RIGHT JOIN

右連接，在內連接的基礎上，對於那些在left_t1表中找不到匹配記錄的right_t1表中的記錄，用空值或0進行連接，如下圖所示：

select * from left_t1 all right join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ALL RIGHT JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │          2 │       2222 │           1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │          2 │        222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

3 rows in set. Elapsed: 0.021 sec.

FULL JOIN

全連接，在內連接的基礎上，對於那些在left_t1表中找不到匹配記錄的right_t1表中的記錄和在right_t1表中找不到匹配記錄的left_t1表中的記錄，都用空值或0進行連接，如下圖所示：

select * from left_t1 all full join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ALL FULL OUTER JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │          2 │       2222 │           1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │          2 │        222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │          1 │        111 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 3 │ 22 │ 1975-06-21 │          0 │          0 │           0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

4 rows in set. Elapsed: 0.046 sec.

SEMI LEFT JOIN 和 SEMI RIGHT JOIN ANTI LEFT JOIN 和 ANTI RIGHT JOIN Join.h中的解釋如下：

* SEMI JOIN filter left table by keys that are present in right table for LEFT JOIN, and filter right table by keys from left table
* for RIGHT JOIN. In other words SEMI JOIN returns only rows which joining keys present in another table.
* ANTI JOIN is the same as SEMI JOIN but returns rows with joining keys that are NOT present in another table.
* SEMI/ANTI JOINs allow to get values from both tables. For filter table it gets any row with joining same key. For ANTI JOIN it returns
* defaults other table columns.

意思是：使用SEMI LEFT JOIN時，使用右表中存在的key去過濾左表中的key，如果左表存在與右表相同的key，則輸出。

使用SEMI RIGHT JOIN時，使用左表中存在的key去過濾右表中的key，如果右表中存在與左表相同的key，則輸出。

換句話說，SEMI JOIN返回key在另外一個表中存在的記錄行。

ANTI JOIN和SEMI JOIN相反，他返回的是key在另外一個表中不存在的記錄行。

SEMI JOIN和ANTI JOIN都允許從兩個表中獲取數據。對於被過濾的表，返回的是與key相同的記錄行。對於ANTI JOIN，另外一個表返回的是默認值，比如空值或0。

select * from left_t1 semi left join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
SEMI LEFT JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │ 2 │ 2222 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │ 1 │ 111 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

2 rows in set. Elapsed: 0.052 sec.

select * from left_t1 semi right join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
SEMI RIGHT JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │          1 │        111 │           1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │          2 │        222 │           1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │          2 │       2222 │           1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

3 rows in set. Elapsed: 1.327 sec.

select * from left_t1 anti left join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ANTI LEFT JOIN right_t1 ON left_t1.a = right_t1.a

┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 3 │ 22 │ 1975-06-21 │ 3 │ 0 │ 0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘

1 rows in set. Elapsed: 0.061 sec.

select * from left_t1 anti right join right_t1 on left_t1.a=right_t1.a;

SELECT *
FROM left_t1
ANTI RIGHT JOIN right_t1 ON left_t1.a = right_t1.a

Ok.

0 rows in set. Elapsed: 0.024 sec.

ASOF LEFT 和 ASOF INNER 沒有具體的語法，本來想通過查看執行計划來看看，但是采用下述方式后，沒看到選擇什么方式，暫時不知道怎么能走到這兩個類型的處理方式上來。

clickhouse-client --send_logs_level=trace <<< 'select * from join_test.left_t1,join_test.right_t1 where join_test.left_t1.a<>1 and join_test.right_t1.a<>1' > /dev/null

下面為Join.h中的說明：

* ASOF JOIN is not-equi join. For one key column it finds nearest value to join according to join inequality.
* It's expected that ANY|SEMI LEFT JOIN is more efficient that ALL one.
*
* If INNER is specified - leave only rows that have matching rows from "right" table.
* If LEFT is specified - in case when there is no matching row in "right" table, fill it with default values instead.
* If RIGHT is specified - first process as INNER, but track what rows from the right table was joined,
* and at the end, add rows from right table that was not joined and substitute default values for columns of left table.
* If FULL is specified - first process as LEFT, but track what rows from the right table was joined,
* and at the end, add rows from right table that was not joined and substitute default values for columns of left table.
*
* Thus, LEFT and RIGHT JOINs are not symmetric in terms of implementation.
*
* All JOINs (except CROSS) are done by equality condition on keys (equijoin).
* Non-equality and other conditions are not supported.

僅支持等值條件的Join，不支持非等值和其他條件的Join。
*
* Implementation:實現機制如下：
*
* 1. Build hash table in memory from "right" table.
* This hash table is in form of keys -> row in case of ANY or keys -> [rows...] in case of ALL.
* This is done in insertFromBlock method.
*一般將小表作為右表，根據右表在內存中構建hash表。這部分實現在insertFromBlock中完成。
* 2. Process "left" table and join corresponding rows from "right" table by lookups in the map.
* This is done in joinBlock methods.
*遍歷左表，根據右表在內存中的map來連接對應行，這部分實現在joinBlock中完成。
* In case of ANY LEFT JOIN - form new columns with found values or default values.
* This is the most simple. Number of rows in left table does not change.
*ANY LEFT JOIN左表行數量不變，使用匹配的值或默認值填充新列。
* In case of ANY INNER JOIN - form new columns with found values,
* and also build a filter - in what rows nothing was found.
* Then filter columns of "left" table.
*ANY INNER JOIN 用滿足條件的值構建新列，用不滿足條件的行構建filter，然后用filter過濾左表。
* In case of ALL ... JOIN - form new columns with all found rows,
* and also fill 'offsets' array, describing how many times we need to replicate values of "left" table.
* Then replicate columns of "left" table.
*ALL...JOIN 將所有找到的行合並為新列，並填充offsets數組，並描述需要把左表的值復制多少次，然后復制左表的列。
* How Nullable keys are processed:
*如何處理NULL值：
* NULLs never join to anything, even to each other.

NULL永遠不會和任何值做JOIN，即使是NULL之間。
* During building of map, we just skip keys with NULL value of any component.

構建Hash表的過程中，跳過任何NULL值。
* During joining, we simply treat rows with any NULLs in key as non joined.
*Join期間，將NULL值行視為未JOIN
* Default values for outer joins (LEFT, RIGHT, FULL):
*外部連接的默認值
* Behaviour is controlled by 'join_use_nulls' settings.

行為由join_use_nulls參數控制。
* If it is false, we substitute (global) default value for the data type, for non-joined rows
* (zero, empty string, etc. and NULL for Nullable data types).
* If it is true, we always generate Nullable column and substitute NULLs for non-joined rows,
* as in standard SQL.

分兩種情況：當join_use_nulls參數為false時，用默認值替代未連接的行；當join_use_nulls為true時，用NULL替代未連接的行。

ANTI RIGHT JOIN

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 clickhouse的join實現 ClickHouse Hash Join 分析 clickhouse的global in/join 和普通的in/join的區別【Clickhouse】clickhouse 數據類型 Clickhouse 基礎知識二(基本命令、復合數據類型、TTL、窗口函數以及Array Join) ClickHouse 的數據類型 clickhouse數據類型 ClickHouse學習系列之一【安裝說明】【ClickHouse】0：clickhouse學習2之數據類型 clickhouse（四）配置文件說明