ClickHouse 支持的join類型說明
按照代碼Join.h的說明,ClickHouse支持14種Join,如下所示:
* JOIN-s could be of these types:
* - ALL × LEFT/INNER/RIGHT/FULL
* - ANY × LEFT/INNER/RIGHT
* - SEMI/ANTI x LEFT/RIGHT
* - ASOF x LEFT/INNER
* - CROSS
All和Any的區別如官網文檔所示:
ANY
與 ALL
在使用ALL
修飾符對JOIN進行修飾時,如果右表中存在多個與左表關聯的數據,那么系統則將右表中所有可以與左表關聯的數據全部返回在結果中。這與SQL標准的JOIN行為相同。
在使用ANY
修飾符對JOIN進行修飾時,如果右表中存在多個與左表關聯的數據,那么系統僅返回第一個與左表匹配的結果。如果左表與右表一一對應,不存在多余的行時,ANY
與ALL
的結果相同。
以INNER JOIN為例說明ANY和ALL的區別,先准備數據:
1、創建join_test庫
create database join_test engine=Ordinary;
2、創建left_t1和right_t1表
create table left_t1(a UInt16,b UInt16,create_date date)Engine=MergeTree(create_date,a,8192);
create table right_t1(a UInt16,b UInt16,create_date date)Engine=MergeTree(create_date,a,8192);
3、插入數據
insert into left_t1 values(1,11,2020-3-20);
insert into left_t1 values(2,22,2020-3-20);
insert into left_t1 values(3,22,2020-3-20);
insert into right_t1 values(1,111,2020-3-20);
insert into right_t1 values(2,222,2020-3-20);
insert into right_t1 values(2,2222,2020-3-20);
4、查看分別增加ANY和ALL對INNER JOIN輸出結果的影響
ALL INNER JOIN
select * from left_t1 all inner join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ALL INNER JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │ 2 │ 2222 │ 1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │ 2 │ 222 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │ 1 │ 111 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
3 rows in set. Elapsed: 0.019 sec.
右表right_t1存在兩條與左表left_t1匹配的結果,兩條全部返回。
ANY INNER JOIN
select * from left_t1 any inner join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ANY INNER JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │ 2 │ 2222 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │ 1 │ 111 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
2 rows in set. Elapsed: 0.023 sec.
右表right_t1存在兩條與左表left_t1匹配的結果,但是只返回一條。
INNER JOIN
內連接,將left_t1表和right_t1表所有滿足left_t1.a=right_t1.a條件的記錄進行連接,如下圖所示:
select * from left_t1 all inner join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ALL INNER JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │ 1 │ 111 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │ 2 │ 2222 │ 1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │ 2 │ 222 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
3 rows in set. Elapsed: 0.134 sec.
LEFT JOIN
左連接,在內連接的基礎上,對於那些在right_t1表中找不到匹配記錄的left_t1表中的記錄,用空值或0進行連接,如下圖所示:
select * from left_t1 all left join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ALL LEFT JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 3 │ 22 │ 1975-06-21 │ 0 │ 0 │ 0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │ 1 │ 111 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │ 2 │ 2222 │ 1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │ 2 │ 222 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
4 rows in set. Elapsed: 0.013 sec.
RIGHT JOIN
右連接,在內連接的基礎上,對於那些在left_t1表中找不到匹配記錄的right_t1表中的記錄,用空值或0進行連接,如下圖所示:
select * from left_t1 all right join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ALL RIGHT JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │ 2 │ 2222 │ 1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │ 2 │ 222 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │ 1 │ 111 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
3 rows in set. Elapsed: 0.021 sec.
FULL JOIN
全連接,在內連接的基礎上,對於那些在left_t1表中找不到匹配記錄的right_t1表中的記錄和在right_t1表中找不到匹配記錄的left_t1表中的記錄,都用空值或0進行連接,如下圖所示:
select * from left_t1 all full join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ALL FULL OUTER JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │ 2 │ 2222 │ 1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │ 2 │ 222 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │ 1 │ 111 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 3 │ 22 │ 1975-06-21 │ 0 │ 0 │ 0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
4 rows in set. Elapsed: 0.046 sec.
SEMI LEFT JOIN 和 SEMI RIGHT JOIN ANTI LEFT JOIN 和 ANTI RIGHT JOIN Join.h中的解釋如下:
* SEMI JOIN filter left table by keys that are present in right table for LEFT JOIN, and filter right table by keys from left table
* for RIGHT JOIN. In other words SEMI JOIN returns only rows which joining keys present in another table.
* ANTI JOIN is the same as SEMI JOIN but returns rows with joining keys that are NOT present in another table.
* SEMI/ANTI JOINs allow to get values from both tables. For filter table it gets any row with joining same key. For ANTI JOIN it returns
* defaults other table columns.
意思是:使用SEMI LEFT JOIN時,使用右表中存在的key去過濾左表中的key,如果左表存在與右表相同的key,則輸出。
使用SEMI RIGHT JOIN時,使用左表中存在的key去過濾右表中的key,如果右表中存在與左表相同的key,則輸出。
換句話說,SEMI JOIN返回key在另外一個表中存在的記錄行。
ANTI JOIN和SEMI JOIN相反,他返回的是key在另外一個表中不存在的記錄行。
SEMI JOIN和ANTI JOIN都允許從兩個表中獲取數據。對於被過濾的表,返回的是與key相同的記錄行。對於ANTI JOIN,另外一個表返回的是默認值,比如空值或0。
select * from left_t1 semi left join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
SEMI LEFT JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 2 │ 22 │ 1975-06-21 │ 2 │ 2222 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │ 1 │ 111 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
2 rows in set. Elapsed: 0.052 sec.
select * from left_t1 semi right join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
SEMI RIGHT JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 1 │ 11 │ 1975-06-21 │ 1 │ 111 │ 1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │ 2 │ 222 │ 1975-06-21 │
│ 2 │ 22 │ 1975-06-21 │ 2 │ 2222 │ 1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
3 rows in set. Elapsed: 1.327 sec.
select * from left_t1 anti left join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ANTI LEFT JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─right_t1.create_date─┐
│ 3 │ 22 │ 1975-06-21 │ 3 │ 0 │ 0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
1 rows in set. Elapsed: 0.061 sec.
select * from left_t1 anti right join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ANTI RIGHT JOIN right_t1 ON left_t1.a = right_t1.a
Ok.
0 rows in set. Elapsed: 0.024 sec.
ASOF LEFT 和 ASOF INNER 沒有具體的語法,本來想通過查看執行計划來看看,但是采用下述方式后,沒看到選擇什么方式,暫時不知道怎么能走到這兩個類型的處理方式上來。
clickhouse-client --send_logs_level=trace <<< 'select * from join_test.left_t1,join_test.right_t1 where join_test.left_t1.a<>1 and join_test.right_t1.a<>1' > /dev/null
下面為Join.h中的說明:
* ASOF JOIN is not-equi join. For one key column it finds nearest value to join according to join inequality.
* It's expected that ANY|SEMI LEFT JOIN is more efficient that ALL one.
*
* If INNER is specified - leave only rows that have matching rows from "right" table.
* If LEFT is specified - in case when there is no matching row in "right" table, fill it with default values instead.
* If RIGHT is specified - first process as INNER, but track what rows from the right table was joined,
* and at the end, add rows from right table that was not joined and substitute default values for columns of left table.
* If FULL is specified - first process as LEFT, but track what rows from the right table was joined,
* and at the end, add rows from right table that was not joined and substitute default values for columns of left table.
*
* Thus, LEFT and RIGHT JOINs are not symmetric in terms of implementation.
*
* All JOINs (except CROSS) are done by equality condition on keys (equijoin).
* Non-equality and other conditions are not supported.
僅支持等值條件的Join,不支持非等值和其他條件的Join。
*
* Implementation:實現機制如下:
*
* 1. Build hash table in memory from "right" table.
* This hash table is in form of keys -> row in case of ANY or keys -> [rows...] in case of ALL.
* This is done in insertFromBlock method.
*一般將小表作為右表,根據右表在內存中構建hash表。這部分實現在insertFromBlock中完成。
* 2. Process "left" table and join corresponding rows from "right" table by lookups in the map.
* This is done in joinBlock methods.
*遍歷左表,根據右表在內存中的map來連接對應行,這部分實現在joinBlock中完成。
* In case of ANY LEFT JOIN - form new columns with found values or default values.
* This is the most simple. Number of rows in left table does not change.
*ANY LEFT JOIN左表行數量不變,使用匹配的值或默認值填充新列。
* In case of ANY INNER JOIN - form new columns with found values,
* and also build a filter - in what rows nothing was found.
* Then filter columns of "left" table.
*ANY INNER JOIN 用滿足條件的值構建新列,用不滿足條件的行構建filter,然后用filter過濾左表。
* In case of ALL ... JOIN - form new columns with all found rows,
* and also fill 'offsets' array, describing how many times we need to replicate values of "left" table.
* Then replicate columns of "left" table.
*ALL...JOIN 將所有找到的行合並為新列,並填充offsets數組,並描述需要把左表的值復制多少次,然后復制左表的列。
* How Nullable keys are processed:
*如何處理NULL值:
* NULLs never join to anything, even to each other.
NULL永遠不會和任何值做JOIN,即使是NULL之間。
* During building of map, we just skip keys with NULL value of any component.
構建Hash表的過程中,跳過任何NULL值。
* During joining, we simply treat rows with any NULLs in key as non joined.
*Join期間,將NULL值行視為未JOIN
* Default values for outer joins (LEFT, RIGHT, FULL):
*外部連接的默認值
* Behaviour is controlled by 'join_use_nulls' settings.
行為由join_use_nulls參數控制。
* If it is false, we substitute (global) default value for the data type, for non-joined rows
* (zero, empty string, etc. and NULL for Nullable data types).
* If it is true, we always generate Nullable column and substitute NULLs for non-joined rows,
* as in standard SQL.
分兩種情況:當join_use_nulls參數為false時,用默認值替代未連接的行;當join_use_nulls為true時,用NULL替代未連接的行。
ANTI RIGHT JOIN