Hive中HSQL中left semi join和INNER JOIN、LEFT JOIN、RIGHT JOIN、FULL JOIN區別

本文轉載自查看原文 2020-12-10 16:15 943 hive

Hive中HSQL中left semi join和INNER JOIN、LEFT JOIN、RIGHT JOIN、FULL JOIN區別

Hive是基於Hadoop的一個數據倉庫工具，可以將結構化的數據文件映射為一張數據庫表，並提供簡單的sql查詢功能，可以將sql語句轉換為MapReduce任務進行運行。

sql中的連接查詢有inner join(內連接）、left join(左連接)、right join（右連接）、full join（全連接）left semi join(左半連接)五種方式，它們之間其實並沒有太大區別，僅僅是查詢出來的結果有所不同。

（1）重要的放在前面，union和full join on的區別，“full join on 列合並和 union 行合並”：

1） full join 使用on條件時，select * 相當於把兩個表(左表有m列p行和右表有n列q行)的所有列拼接成了一個有m+n列的結果表。

select * from table1 full join table2 on(table1.student_no=table2.student_no);

2）而union相當於把相當於把兩個查詢結果(左查詢結果表有m列p行和右查詢結果表有n列q行)的所有行進行了拼接，形成具有p+q行的查詢結果。

select student_no tb1_student_no,student_name from table1 union select student_no as tb2_student_no,class_no from table2;

注意此時，左查詢結果表和右查詢結果表，必須有相同的列，即m=q相等，否則會報如下錯誤：

hive> select student_no tb1_student_no,student_name from table1 union select class_no from table2;
FAILED: SemanticException Schema of both sides of union should match.

（2）Inner join是最簡單的關聯操作，

1）如果有on 條件的話，則兩邊關聯只取交集。

select * from table1 join table2 on table1.student_no=table2.student_no ;

2）笛卡爾積：如果沒有on條件的話，則是左表和右表的列通過笛卡爾積的形式表達出來，下面兩個sql就是求笛卡爾積：

select * from table1 join table2;

select * from table1 inner join table2;

比如table1有m行，table2有n行，最終的結果將有 m*n行

（3）outer join分為left outer join、right outer join和full outer join。

left outer join是以左表驅動，右表不存在的key均賦值為null；

right outer join是以右表驅動，左表不存在的key均賦值為null；

full outer join全表關聯，即是左外連接和右外連接結果集合求並集 ,左右表均可賦值為null。（~~而不是將兩表完整的進行笛卡爾積操作，這種表述是錯誤的，注意某些博客的表述）~~

如果full join不加on過濾條件，計算結果也是笛卡爾積：

select * from table1 a full join table2 b ;

（4）left semi join

semi join (即等價於left semi join)最主要的使用場景就是解決exist in。LEFT SEMI JOIN （左半連接）是 IN/EXISTS 子查詢的一種更高效的實現。

注意，在hive 2.1.1版本中，支持子查詢，使用in 和 not in關鍵字，以下兩個SQL都是正確的：

SELECT * FROM TABLE1 WHERE table1.student_no NOT IN (SELECT table2.student_no FROM TABLE2);

SELECT * FROM TABLE1 WHERE table1.student_no IN (SELECT table2.student_no FROM TABLE2);

以下為兩個測試數據表建表語句：

use test;

DROP TABLE IF EXISTS table1;

create table table1(

student_no bigint comment '學號',

student_name string comment '姓名'

)

COMMENT 'test 學生信息'

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

DROP TABLE IF EXISTS table2;

create table table2(

student_no bigint comment '學號',

class_no bigint comment '課程號'

)

COMMENT 'test 學生選課信息'

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

測試數據如下：

[work@ha6-prd-dx rk]$ more data_table1.txt

1 name1

2 name2

3 name3

4 name4

5 name5

6 name6

[work@ha6-prd-dx rk]$ more data_table2.txt

1 11

1 12

1 13

2 11

2 14

3 12

3 15

4 12

4 13

5 14

5 16

7 13

7 15

入庫命令如下：

load data local inpath '/home/work/yyz_work/data_table1.txt' overwrite into table table1 ;

load data local inpath '/home/work/yyz_work/data_table2.txt' overwrite into table table2 ;

測試一、測試子查詢：證明在Hive 2.1.1版本中，是支持where子句中的子查詢 in 和 not in，但是 HSQL常用的exist 子句在Hive中是不支持的。

SELECT table1.student_no, table1.student_name

FROM table1

WHERE table1.student_no in (SELECT table2.student_no FROM table2);

測試二、測試 left semi join

證明在Hive 2.1.1版本中，是支持where子句中的子查詢，SQL常用的exist in子句在Hive中是不支持的，但可以使用一個更高效的實現方式---- semi join最主要的使用場景就是解決exist in。

SQL1:

SELECT table1.student_no, table1.student_name FROM table1 LEFT SEMI JOIN table2 on ( table1.student_no =table2.student_no);

SQL2:

SELECT * FROM table1 LEFT SEMI JOIN table2 on ( table1.student_no =table2.student_no);

SQL1和SQL2等價，只輸出左表包含的那些列。且輸出結果如下：

可以看到，只打印出了左邊的表中的列，規律是如果主鍵在右邊表中存在，則打印，否則過濾掉了。以上兩個測試證明在博客https://blog.csdn.net/AntKengElephant/article/details/83029573中有錯。

此外，注意哈，只存在 left SEMI JOIN，不存在SEMI JOIN 和 right SEMI JOIN。

hive> SELECT table1.student_no, table1.student_name FROM table1 SEMI JOIN table2 on (table1.student_no=table2.student_no);
FAILED: SemanticException [Error 10009]: Line 1:79 Invalid table alias 'table1'

此外，注意需要以下幾項：

1、left semi join 的限制是， JOIN 子句中右邊的表只能在 ON 子句中設置過濾條件，在 WHERE 子句、SELECT 子句或其他地方過濾都不行。

hive> SELECT * FROM table1 LEFT SEMI JOIN table2 on ( table1.student_no =table2.student_no) where table2.student_no>3;
FAILED: SemanticException [Error 10004]: Line 1:92 Invalid table alias or column reference 'table2': (possible column names are: student_no, student_name)

對右表的過濾條件只能寫在on子句中：
hive> SELECT * FROM table1 LEFT SEMI JOIN table2 on ( table1.student_no =table2.student_no and table2.student_no>3);

2、left semi join 是只傳遞表的 join key 給 map 階段，因此left semi join 中最后 select 的結果只許出現左表的那些列（參見SQL1和SQL2區別）。

3、因為 left semi join 是 in(keySet) 的關系，遇到右表重復記錄，左表會跳過，而 join 則會一直遍歷。這就導致右表有重復值得情況下 left semi join 只產生一條，join 會產生多條，也會導致 left semi join 的性能更高。

參考：https://blog.csdn.net/happyrocking/article/details/79885071 ，(ps.其中給的最后一個例子是錯誤的，semi join不應該包含右表的列)

測試三、測試內連接Inner join等價於join，在兩張表進行連接查詢時，只保留兩張表中完全匹配的結果，不存在 inner outer join

select * from table1 inner join table2 on table1.student_no=table2.student_no;

select * from table1 join table2 on table1.student_no=table2.student_no;

測試四：left (outer) join ,在兩張表進行連接查詢時，會返回左表所有的行，即使在右表中沒有匹配的記錄。

select * from table1 left join table2 on(table1.student_no=table2.student_no);

我用的HIVE版本是hive-2.1.1，是支持直接的left join寫法；

測試五：左表獨有

SELECT a.key, a.value FROM a LEFT OUTER JOIN b ON (a.key = b.key) WHERE b.key <> NULL;

select * from table1 left outer join table2 on table1.student_no=table2.student_no where table2.student_no is not null;

測試六：left (outer) join 在兩張表進行連接查詢時，會返回右表所有的行，即使在左表中沒有匹配的記錄。

select * from table1 left outer join table2 on(table1.student_no=table2.student_no);

select * from table1 left join table2 on(table1.student_no=table2.student_no);

可以看到left outer join左邊表的數據都列出來了，如果右邊表沒有對應的列，則寫成了NULL值。

同時注意到，如果左邊的主鍵在右邊找到了N條，那么結果也是會叉乘得到N條的，比如這里主鍵為1的顯示了右邊的3條

測試七：測試right (outer) join 在兩張表進行連接查詢時，會返回右表所有的行，即使在左表中沒有匹配的記錄

select * from table1 right join table2 on(table1.student_no=table2.student_no);

select * from table1 right outer join table2 on(table1.student_no=table2.student_no);

測試八：右表獨有

select * from table1 right join table2 on(table1.student_no=table2.student_no) where table1.student_no is not null;

測試九，union 將兩個查詢結果進行合並一個表，我們可以使用union來達到目的

按列合並兩個表，比如第一茬查詢結果又6條記錄，第二個查詢結果又13條記錄，那么使用union后的結果將是19條記錄。

hive> select student_no tb1_student_no,student_name from table1 union select student_no as tb2_student_no,class_no from table2;

測試十：full join,在兩張表進行連接查詢時，返回左表和右表中所有沒有匹配的行。

查詢結果是left join和right join的並集。

select * from table1 full join table2 on(table1.student_no=table2.student_no);

測試：自己全連接自己，使用別名后，也是進行查詢：

select * from table1 a full join table1 b on(a.student_no=b.student_no);

如果full join不加on過濾條件，計算結果也是笛卡爾積：

select * from table1 a full join table2 b ;

測試十一：並集去去交集

hive> select * from table1 left outer join table2 on table1.student_no=table2.student_no where table2.student_no is null;

hive> select * from table1 RIGHT outer join table2 on table1.student_no=table2.student_no where table1.student_no is null;

hive> select * from table1 left outer join table2 on table1.student_no=table2.student_no where table2.student_no is null
> UNION
> select * from table1 RIGHT outer join table2 on table1.student_no=table2.student_no where table1.student_no is null;

結論：

hive在hive-2.1.1版本時支持’left join’的寫法；
hive的left outer join：如果右邊有多行和左邊表對應，就每一行都映射輸出；如果右邊沒有行與左邊行對應，就輸出左邊行，右邊表字段為NULL；
hive的left semi join：相當於SQL的in語句，比如上面測試3的語句相當於“select * from table1 where table1.student_no in (table2.student_no)”，注意，結果中是沒有B表的字段的。

糾正：在hive-2.1.1版本運行命令證實以下文章部分有錯：http://www.crazyant.net/1470.html

https://blog.csdn.net/lukabruce/article/details/80568796

http://www.w3school.com.cn/sql/sql_join_full.asp

原文地址：https://blog.csdn.net/helloxiaozhe/article/details/87910386

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Mysql中Left Join 與Right Join 與 Inner Join 與 Full Join的區別 inner join 與 left join 和right join之間的區別 SQL表連接查詢(inner join、full join、left join、right join) sql連接查詢（inner join、full join、left join、 right join） SQL中LEFT JOIN 和 inner join 的區別 mysql sql left right inner join區別及效率比較 Linq表連接大全(INNER JOIN、LEFT OUTER JOIN、RIGHT OUTER JOIN、FULL OUTER JOIN、CROSS JOIN) sql語句中left join、right join 以及inner join之間的使用與區別 SQL中join的用法解析（inner join、full out join、left join） SQL中的left outer join,inner join,right outer join用法詳解