談談hive中join下on和where


本文為博客園作者所寫: 一寸HUI,個人博客地址:https://www.cnblogs.com/zsql/

很多人如果先接觸mysql的執行順序(from ->on ->join ->where ->group by ->having ->select ->distinct ->order by ->limit),可能會對hive中的on和where會產生一些誤解,網上也有一些博客寫了關於這些內容的,但是自己也還是想自己親自試驗一波,本文主要從inner join,left/right join和full  join條件下去區別on和where,以及加上分區的條件,其實搞懂這些對寫hql有很大的幫助,要么可以更加的簡潔,要么更優,接下來就開始實踐。

版本:本文使用CDH 6.3.2 的版本,hive   2.1.1+cdh6.3.2進行測試的

一、試驗表和數據

1.1、建表

create  table `user`(  -- 用戶表,分區表 department_id int, age int, sex string, name string ) PARTITIONED BY (`date` string) row format delimited fields terminated by ',' STORED AS TEXTFILE; create table department( -- 部門表 id int, name string, count int ) row format delimited fields terminated by ',' STORED AS TEXTFILE;

1.2、數據

-- /data/hive/user1.txt 1,34,male,zhangsan 1,31,female,lili 3,14,female,liushen 3,24,female,sasa 4,54,male,liubei 4,36,female,yuji 4,25,male,zhaoyun 8,18,male,zhangfei -- /data/hive/user2.txt 3,37,male,wangwu 4,38,female,lisi 3,19,female,caocao 2,22,female,guanyu 1,51,male,wzj 6,31,female,zhenji 6,25,male,sunwukong 6,17,male,tangsz -- /data/hive/department.txt 1,dashuju,8
2,kaifa,9
3,ui,10
4,hr,3
5,shouxiao,12
6,zongjian,3

1.3、數據導入

load data local inpath '/data/hive/user1.txt'  into table `user` partition (`date`='2020-12-24'); load data local inpath '/data/hive/user2.txt'  into table `user` partition (`date`='2020-12-25'); load data local inpath '/data/hive/department.txt'  into table `department`;

1.4、查詢數據

SELECT  * from `user`;

 SELECT  * from department ;

 

  1.5、對表進行分析

ANALYZE TABLE `user` partition(`date`='2020-12-24') COMPUTE STATISTICS; ANALYZE TABLE `user` partition(`date`='2020-12-25') COMPUTE STATISTICS; ANALYZE TABLE department COMPUTE STATISTICS;

如果不進行如上的操作,在下面的實踐中會出問題,數據不真實,看不出效果,所以要做explain前對表進行分析,這樣更加的准確(剛剛踩了坑,每次explain出來都只有1條數據,統計有問題)

二、inner  join下的on和where分析

2.1、不使用分區進行過濾

1、首先看一個沒有條件的inner join的結果

SELECT  * from `user` u  inner join  department d  on d.id=u.department_id;

  查看執行計划:

explain SELECT  * from `user` u inner join department d on d.id=u.department_id;
STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d filterExpr: id is not null (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: id is not null (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: department_id is not null (type: boolean) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: department_id is not null (type: boolean) Statistics: Num rows: 16 Data size: 2944 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Inner Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

如上語句沒有做任何的條件過濾,也沒有使用分區:

  • 首先對d表(department)進行全表掃描,掃描到了6條數據,然后默認添加id is not null的過濾條件對其掃描到的6條數據進行過濾(自動優化,如果join下要保留null的行的話需要提前做特殊處理,不然默認就被優化掉了,會導致數據丟失),最終得到過濾后還剩下6條數據參與inner join。
  • 然后對u表(user)進行全表掃描,掃描到了16條數據,同樣添加默認的過濾條件department_id is not null,最終得到16條數據參與inner join。

 

2、接下來看一個有where條件和on條件下的結果

SELECT  * from `user` u inner join  department d on d.id=u.department_id and d.count > 9 and u.age > 20
where u.age < 30;

  接下來看看執行計划:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d filterExpr: ((count > 9) and id is not null) (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((count > 9) and id is not null) (type: boolean) Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: ((age > 20) and department_id is not null and (age < 30)) (type: boolean) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator  predicate: ((age > 20) and department_id is not null and (age < 30)) (type: boolean) Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Inner Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

結果如上所示:

  • 首先掃描d(department)表,全表掃描6條數據,並對其進行過濾:((count > 9) and id is not null) ,過濾結果剩下2條數據進行inner join操作
  • 然后掃描u(user)表,也是全表掃描16條數據,並對其進行過濾((age > 20) and department_id is not null and (age < 30)),過濾剩下1條數據(這里是有誤差的,其實應該剩余4條數據,hive的執行計划是一個大概的統計執行過程,不完全正確)進行inner join操作

小總結:inner join在不使用分區過濾的情況下,會對其兩個表都進行全表掃描,然后自動為join的鍵(on d.id=u.department_id)添加is not null的過濾條件,然后在配合on和where后面的條件進行過濾,在inner join中where和on是同時進行過濾的。

2.2、使用分區過濾

1、先看一個只有分區過濾的情況

SELECT  * from `user` u inner join department d on d.id=u.department_id where u.`date`='2020-12-25';

 查看執行計划:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d filterExpr: id is not null (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: id is not null (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: (department_id is not null and (date = '2020-12-25')) (type: boolean) Statistics: Num rows: 8 Data size: 134 Basic stats: COMPLETE Column stats: NONE #這里一個分區只有8條數據 Filter Operator predicate: department_id is not null (type: boolean) Statistics: Num rows: 8 Data size: 134 Basic stats: COMPLETE Column stats: NONE  Map Join Operator condition map:  Inner Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10 Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-25' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

結果如上,這里和沒有添加分區過的情況對比,就是添加了分區后不會對表u(user)進行全表掃描,這樣的話就能提高效率,因為分區的存儲就是一個文件夾,所以在分區過濾后就可以指定分區進行掃描,就不會進行全表掃描,這樣的情況說明:分區表先進行分區過濾,然后對指定的分區進行全部掃描,然后再使用on和where以及自動添加的is not null條件進行過濾,過濾后的數據才進行inner join

 2、看一個帶條件和分區過濾的結果

SELECT  * from `user` u inner join department d on d.id=u.department_id and d.count > 9 and u.age > 20
where u.age < 30  and u.`date`='2020-12-24';

  看執行計划:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d filterExpr: ((count > 9) and id is not null) (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((count > 9) and id is not null) (type: boolean) Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: ((age > 20) and department_id is not null and (age < 30)) (type: boolean) Statistics: Num rows: 8 Data size: 136 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((age > 20) and department_id is not null and (age < 30)) (type: boolean) Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10 Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-24' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

結果如上,得出結果與(不使用分區條件過濾且使用on和where過濾)對比,可以看出來,使用分區過濾的區別就是在對表進行掃描的時候是掃描全部還是掃描指定的分區,如果沒有分區過濾,則掃描全表,否則,只對指定的分區進行掃描。

2.3、inner join下on和where的總結

在inner join下,如果where條件中使用了分區過濾,則掃描指定的分區的數據,然后在通過where和on條件進行過濾,以及為join的鍵(on d.id=u.department_id)添加is not null的過濾條件(這里需要注意的是join鍵為null的數據是否要保留,需要保留的話,就需要對join鍵進行特殊的處理,否則數據則被過濾掉,導致數據丟失),這里on和where是同時過濾的,不區分先后。

三、left/right join下的on和where分析

由於left join和right join屬於同一類型,所以本文只針對left join進行實踐。

3.1、非主表在on和where條件下執行

先看一條執行語句的結果(在非主表在on后面添加過濾條件)(約定:u  (主表) left join d(非主表)

SELECT * from `user` u
left join department d
on d.id=u.department_id
and d.count > 9

 查看結果

   然后看一個執行計划:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d filterExpr: (count > 9) (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (count > 9) (type: boolean) Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Left Outer Join0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

結果如上,我們發現在left join下,約定:u  (主表) left join d(非主表),非主表在on下面的條件d.count > 9過濾有效,最終掃描全部6條數據,通過條件過濾剩下2條數據然后進行left join,主表掃描全表進行left join,這里注意,在left join條件下兩個表的join鍵(on d.id=u.department_id)都沒有加上is not null的條件過濾,所以在進行left join的時候需要注意join 鍵是否為空,為空的情況可以對其進行優化。

 

看一條執行語句的結果(在非主表在where后面添加過濾條件)(約定:u  (主表) left join d(非主表)

SELECT  * from `user` u left join department d on d.id=u.department_id where d.count > 9

結果如下:(與非主表在on后面添加的添加結果是不一樣的)

   看執行計划:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Left Outer Join0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col10 > 9) (type: boolean) Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

從如上的執行計划來看,對表u(user)和d(department)是在沒有任何過濾的情況下,進行了去全表掃描的left  join,在left join獲得結果后,然后再對結果使用非主表的where條件d.count > 9進行過濾

小總結:(left join)在非主表下使用on或者使用where進行過濾時的結果是不一樣的,如果是在on下面添加條件過濾,則先進行表的數據過濾,然后在進行left join,如果是在where后面添加條件過濾,則是先進行left join,然后在對left join得到的結果進行where條件的過濾,在left join中,不會對join鍵添加默認的is not null的過濾條件。

3.2、主表在on和where條件下執行

先看一條執行語句的結果(在主表在on后面添加過濾條件)(約定:u  (主表) left join d(非主表)

SELECT  * from `user` u left join department d on d.id=u.department_id and u.age > 20

 看到如上的結果發現,還是保留了16條數據(user表的全部數據),但是發現age<=20的數據好像不參加left join一樣,后面的值全都是null。 

看看執行計划:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator filter predicates: 0 {(age > 20)} 1 keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Left Outer Join0 to 1 filter predicates: 0 {(age > 20)} 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

結果如上,其中在處理d(department表)時,掃描全表6條數據,對表d(department)進行標記age>20的條件,然后對u(user)表進行全表掃描並進行全表的left join,在left join的過程中對d(department)表(來源於d表的 的字段)通過主表的條件age > 20進行篩選,如果u表的age <=20,則來源於d表的字段全部置為null,(如上為個人理解,不一定正確。簡單來說,先做個判斷標記,然后進行left join,在left join的過程中通過條件進行過濾(不符合條件的數據保留主表的數據,非主表的數據丟棄,置為null)),這里在on后面的條件不會對主表的條數產生影響,也是先進行left join並進行相應的過濾。理解起來比較繞,可以自己對應結果看看。

 

看一條執行語句的結果(在主表在where后面添加過濾條件)(約定:u  (主表) left join d(非主表)

SELECT  * from `user` u left join department d on d.id=u.department_id where u.age > 20

  接下來看執行計划:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: (age > 20) (type: boolean) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator predicate: (age > 20) (type: boolean) Statistics: Num rows: 5 Data size: 920 Basic stats: COMPLETE Column stats: PARTIAL Map Join Operator condition map: Left Outer Join0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

結果如上,可以明確的看出來當在主表中使用where過濾,會先對主表的數據進行過濾然后在進行left join,主表掃描出16條數據,過濾后剩余5條,然后再進行left join得到最終的結果

小總結:(left join)在主表下使用on或者使用where進行過濾時的結果是不一樣的,當使用where對主表進行過濾的時候,先過濾再進行left join。當使用on對主表進行過濾,先在非主表進行過濾標記,然后再對全表進行left join時根據過濾條件把不符合條件的行中來源於非主表的數據設置為null。

3.3、left/right join使用分區過濾

看如下語句:

SELECT  * from `user` u left join department d on d.id=u.department_id where u.age > 20 and u.`date` = '2020-12-24';

結果:

  看看執行計划:

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 department_id (type: int) 1 id (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: u filterExpr: ((age > 20) and (date = '2020-12-24')) (type: boolean) Statistics: Num rows: 8 Data size: 136 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (age > 20) (type: boolean) Statistics: Num rows: 2 Data size: 34 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map:  Left Outer Join0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10 Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-24' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

根據如上的執行計划可以看出來,使用分區過濾可以防止全表掃描,如上值掃描了一個分區,所以數據只有8條

3.4、left/right join下on和where的總結

  1. 使用分區條件過濾,可以防止全表掃描,最優先過濾
  2. 在主表下進行過濾,使用on和where過濾的結果是不一樣的,當使用where對主表進行過濾的時候,先過濾再進行left join。當使用on對主表進行過濾,先在非主表進行過濾標記,然后再對全表進行left join時根據過濾條件把不符合條件的行中來源於非主表的數據設置為null。
  3. 在非主表下進行過濾,使用on和where過濾的結果是不一樣的,如果是在on下面添加條件過濾,則先進行表的數據過濾,然后在進行left join,如果是在where后面添加條件過濾,則是先進行left join,然后在對left join得到的結果進行where條件的過濾
  4. left/right join不會對join鍵自動添加is not null的過濾條件,所以在left/right join的時候要注意join鍵為null的情況,這里是可以做優化的

四、full join下的on和where分析

4.1、沒有過濾條件的full  join

直接看一個沒有任何條件的full join

SELECT  * from `user` u full join department d on d.id=u.department_id

  查看執行計划:

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator key expressions: department_id (type: int) sort order: + Map-reduce partition columns: department_id (type: int) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL value expressions: age (type: int), sex (type: string), name (type: string), date (type: string) TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE value expressions: name (type: string), count (type: int) Reduce Operator Tree: Join Operator condition map: Outer Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

執行計划如上,會對每個表進行升序的排序,沒有自動優化(添加null過濾),執行全表的full join。

4.2、有where條件的full join

 SELECT  * from `user` u full join department d on d.id=u.department_id where u.age > 20 and  d.count > 9

結果如下

 查看執行計划:

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator key expressions: department_id (type: int) sort order: + Map-reduce partition columns: department_id (type: int) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL value expressions: age (type: int), sex (type: string), name (type: string), date (type: string) TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE value expressions: name (type: string), count (type: int) Reduce Operator Tree: Join Operator condition map:  Outer Join 0 to 1 keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((_col10 > 9) and (_col1 > 20)) (type: boolean) Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

從執行計划看出來,在full join下的使用where 進行過濾的時候是先進行全表掃描,然后進行full join,full join獲得結果后才對where中的條件進行過濾。

4.3、有on條件的full join(留有疑問)

SELECT  * from `user` u full join department d on d.id=u.department_id and u.age > 20 and  d.count > 9

看到如上結果,可能有點意外, (個人能力有限,厲害的博友可以解釋解釋),個人的理解為就像left join的主表下的on條件一樣,都是在full  join的過程中進行過濾,然而兩個表的全部數據都有保留下來,只有兩個條件都成立的情況下,才沒有null值。(在full join如果不懂,就盡量使用where條件判斷啦)

查看執行計划:

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator key expressions: department_id (type: int) sort order: + Map-reduce partition columns: department_id (type: int) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL value expressions: age (type: int), sex (type: string), name (type: string), date (type: string) TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE value expressions: name (type: string), count (type: int) Reduce Operator Tree: Join Operator condition map:  Outer Join 0 to 1 filter predicates: 0 {(VALUE._col0 > 20)} 1 {(VALUE._col1 > 9)} keys: 0 department_id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

留有疑問????,優秀的博友如果理解了,可以在下面留言

4.4、分區過濾的full join

SELECT  * from department d full join `user` u on d.id=u.department_id where u.`date`= '2020-12-24';

 

 查看執行計划:

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: d Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: id (type: int) sort order: + Map-reduce partition columns: id (type: int) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE value expressions: name (type: string), count (type: int) TableScan alias: u Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator key expressions: department_id (type: int) sort order: + Map-reduce partition columns: department_id (type: int) Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL value expressions: age (type: int), sex (type: string), name (type: string), date (type: string) Reduce Operator Tree: Join Operator condition map:  Outer Join 0 to 1 keys: 0 id (type: int) 1 department_id (type: int) outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8, _col9, _col10 Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col10 = '2020-12-24') (type: boolean) Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col2 (type: int), _col6 (type: int), _col7 (type: int), _col8 (type: string), _col9 (type: string), '2020-12-24' (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7 Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

根據執行計划得知:在full join中,就算使用了分區過濾,還是先full join得到結果,然后在通過where條件進行過濾,所以推薦使用子查詢先過濾,然后在進行full join。

4.5、full join中的on和where總結

  1. 這里在on的條件下還是留有疑問。。
  2. 在where的條件下不管是否使用分區過濾都是先full join,再進行過濾的,所以這里現有通過子查詢過濾,再進行full join
  3. 在full jion中不會自動添加join 鍵為is not null的條件

五、總結

1、inner join

  • inner join首先可以通過分區進行過濾,防止全表掃描。
  • inner join會自動為join的鍵(on d.id=u.department_id)添加is not null的過濾條件
  • inner join 下on和where后面的條件進行過濾,在inner join中where和on是同時進行過濾,沒有順序的區別

2、left/right join

  • left/right join使用分區條件過濾,可以防止全表掃描,最優先過濾
  • left/right join在主表下進行過濾,使用on和where過濾的結果是不一樣的,當使用where對主表進行過濾的時候,先過濾再進行left join。當使用on對主表進行過濾,先在非主表進行過濾標記,然后再對全表進行left join時根據過濾條件把不符合條件的行中來源於非主表的數據設置為null。
  • left/right join在非主表下進行過濾,使用on和where過濾的結果是不一樣的,如果是在on下面添加條件過濾,則先進行表的數據過濾,然后在進行left join,如果是在where后面添加條件過濾,則是先進行left join,然后在對left join得到的結果進行where條件的過濾,所以過濾非主表的時候可以通過on進行條件過濾,這樣防止寫子查詢
  • left/right join不會對join鍵自動添加is not null的過濾條件,所以在left/right join的時候要注意join鍵為null的情況,這里是可以做優化的

3、full join

  • full join中on下條件下過濾(有疑問,還待探究)
  • full join中where下條件過濾,不管是不是分區過濾,都是先進行full join,在根據條件進行過濾,這里推薦子查詢先過濾在進行full join
  • 不會對join鍵自動添加is not null的過濾條件


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM