大數據SQL中的Join謂詞下推，真的那么難懂？

本文轉載自查看原文 2021-11-09 22:22 305 Hive

聽到謂詞下推這個詞，是不是覺得很高大上，找點資料看了半天才能搞懂概念和思想，借這個機會好好學習一下吧。

引用范欣欣大佬的博客中寫道，以前經常滿大街聽到謂詞下推，然而對謂詞下推卻總感覺懵懵懂懂，並不明白的很真切。這里拿出來和大家交流交流。個人認為謂詞下推有兩個層面的理解：

其一是邏輯執行計划優化層面的說法，比如SQL語句：select * from order ,item where item.id = order.item_id and item.category = ‘book’，正常情況語法解析之后應該是先執行Join操作，再執行Filter操作。通過謂詞下推，可以將Filter操作下推到Join操作之前執行。即將where item.category = ‘book’下推到 item.id = order.item_id之前先行執行。
其二是真正實現層面的說法，謂詞下推是將過濾條件從計算進程下推到存儲進程先行執行，注意這里有兩種類型進程：計算進程以及存儲進程。計算與存儲分離思想，這在大數據領域相當常見，比如最常見的計算進程有SparkSQL、Hive、impala等，負責SQL解析優化、數據計算聚合等，存儲進程有HDFS（DataNode）、Kudu、HBase，負責數據存儲。正常情況下應該是將所有數據從存儲進程加載到計算進程，再進行過濾計算。謂詞下推是說將一些過濾條件下推到存儲進程，直接讓存儲進程將數據過濾掉。這樣的好處顯而易見，過濾的越早，數據量越少，序列化開銷、網絡開銷、計算開銷這一系列都會減少，性能自然會提高。

謂詞下推 Predicate Pushdown（PPD）：簡而言之，就是在不影響結果的情況下，盡量將過濾條件提前執行。謂詞下推后，過濾條件在map端執行，減少了map端的輸出，降低了數據在集群上傳輸的量，節約了集群的資源，也提升了任務的性能。

PPD 配置
PPD控制參數：hive.optimize.ppd，默認值：true

PPD規則：

	Preserved Row tables	Null Supplying tables
Join Predicate	Case J1: Not Pushed	Case J2: Pushed
Where Predicate	Case W1: Pushed	Case W2: Not Pushed

Push：謂詞下推，可以理解為被優化
Not Push：謂詞沒有下推，可以理解為沒有被優化

實驗

實驗結果列表形式：

Pushed or Not	SQL
Pushed	select ename,dept_name from E join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Pushed	select ename,dept_name from E join D on E.dept_id = D.dept_id where E.eid='HZ001';
Pushed	select ename,dept_name from E join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Pushed	select ename,dept_name from E join D on E.dept_id = D.dept_id where D.dept_id='D001';
Not Pushed	select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Pushed	select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where E.eid='HZ001';
Pushed	select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Not Pushed	select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where D.dept_id='D001';
Pushed	select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Not Pushed	select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where E.eid='HZ001';
Not Pushed	select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Pushed	select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where D.dept_id='D001';
Not Pushed	select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Not Pushed	select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where E.eid='HZ001';
Not Pushed	select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Not Pushed	select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where D.dept_id='D001';

實驗結果表格形式：

predicate

此表實際上就是上述PPD規則表。

結論

1、對於Join(Inner Join)、Full outer Join，條件寫在on后面，還是where后面，性能上面沒有區別；
2、對於Left outer Join ，右側的表寫在on后面、左側的表寫在where后面，性能上有提高；
3、對於Right outer Join，左側的表寫在on后面、右側的表寫在where后面，性能上有提高；
4、當條件分散在兩個表時，謂詞下推可按上述結論2和3自由組合，情況如下：

SQL	過濾時機
`select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001' and D.dept_id = 'D001');`	dept_id在map端過濾，eid在reduce端過濾
`select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id = 'D001') where E.eid='HZ001';`	dept_id，eid都在map端過濾
`select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') where D.dept_id = 'D001';`	dept_id，eid都在reduce端過濾
`select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id ) where E.eid='HZ001' and D.dept_id = 'D001';`	dept_id在reduce端過濾，eid在map端過濾

注意：如果在表達式中含有不確定函數，整個表達式的謂詞將不會被pushed，例如

select a.* 
from a join b on a.id = b.id
where a.ds = '2019-10-09' and a.create_time = unix_timestamp();

因為unix_timestamp是不確定函數，在編譯的時候無法得知，所以，整個表達式不會被pushed，即ds='2019-10-09'也不會被提前過濾。類似的不確定函數還有rand()等。

參考文獻：
[1] https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

引用：https://blog.csdn.net/strongyoung88/article/details/81156271

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【大數據】SparkSql 連接查詢中的謂詞下推處理 (一) sql中join的on和where操作引發的謂詞下推優化 Spark之謂詞下推 Hive優化之謂詞下推 MySQL之謂詞下推 SQL大數據優化上聊聊謂詞下推的事大數據技術 - 為什么是SQL Hive中的Predicate Pushdown Rules（謂詞下推規則) 大數據中必須要掌握的 Flink SQL 詳細剖析