聽到謂詞下推
這個詞,是不是覺得很高大上,找點資料看了半天才能搞懂概念和思想,借這個機會好好學習一下吧。
引用范欣欣大佬的博客中寫道,以前經常滿大街聽到謂詞下推,然而對謂詞下推卻總感覺懵懵懂懂,並不明白的很真切。這里拿出來和大家交流交流。個人認為謂詞下推有兩個層面的理解:
-
其一是邏輯執行計划優化層面的說法,比如SQL語句:select * from order ,item where item.id = order.item_id and item.category = ‘book’,正常情況語法解析之后應該是先執行Join操作,再執行Filter操作。通過謂詞下推,可以將Filter操作下推到Join操作之前執行。即將where item.category = ‘book’下推到 item.id = order.item_id之前先行執行。
-
其二是真正實現層面的說法,謂詞下推是將過濾條件從計算進程下推到存儲進程先行執行,注意這里有兩種類型進程:計算進程以及存儲進程。計算與存儲分離思想,這在大數據領域相當常見,比如最常見的計算進程有SparkSQL、Hive、impala等,負責SQL解析優化、數據計算聚合等,存儲進程有HDFS(DataNode)、Kudu、HBase,負責數據存儲。正常情況下應該是將所有數據從存儲進程加載到計算進程,再進行過濾計算。謂詞下推是說將一些過濾條件下推到存儲進程,直接讓存儲進程將數據過濾掉。這樣的好處顯而易見,過濾的越早,數據量越少,序列化開銷、網絡開銷、計算開銷這一系列都會減少,性能自然會提高。
謂詞下推 Predicate Pushdown(PPD)
:簡而言之,就是在不影響結果的情況下,盡量將過濾條件提前執行。謂詞下推后,過濾條件在map端執行,減少了map端的輸出,降低了數據在集群上傳輸的量,節約了集群的資源,也提升了任務的性能。
PPD 配置
PPD控制參數:hive.optimize.ppd
,默認值:true
PPD規則:
Preserved Row tables | Null Supplying tables | |
---|---|---|
Join Predicate | Case J1: Not Pushed | Case J2: Pushed |
Where Predicate | Case W1: Pushed | Case W2: Not Pushed |
Push
:謂詞下推,可以理解為被優化
Not Push
:謂詞沒有下推,可以理解為沒有被優化
實驗
實驗結果列表形式:
Pushed or Not | SQL |
---|---|
Pushed | select ename,dept_name from E join D on ( E.dept_id = D.dept_id and E.eid='HZ001'); |
Pushed | select ename,dept_name from E join D on E.dept_id = D.dept_id where E.eid='HZ001'; |
Pushed | select ename,dept_name from E join D on ( E.dept_id = D.dept_id and D.dept_id='D001'); |
Pushed | select ename,dept_name from E join D on E.dept_id = D.dept_id where D.dept_id='D001'; |
Not Pushed | select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001'); |
Pushed | select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where E.eid='HZ001'; |
Pushed | select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001'); |
Not Pushed | select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where D.dept_id='D001'; |
Pushed | select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001'); |
Not Pushed | select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where E.eid='HZ001'; |
Not Pushed | select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001'); |
Pushed | select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where D.dept_id='D001'; |
Not Pushed | select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001'); |
Not Pushed | select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where E.eid='HZ001'; |
Not Pushed | select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001'); |
Not Pushed | select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where D.dept_id='D001'; |
實驗結果表格形式:
此表實際上就是上述PPD規則表。
結論
1、對於Join(Inner Join)、Full outer Join,條件寫在on后面,還是where后面,性能上面沒有區別;
2、對於Left outer Join ,右側的表寫在on后面、左側的表寫在where后面,性能上有提高;
3、對於Right outer Join,左側的表寫在on后面、右側的表寫在where后面,性能上有提高;
4、當條件分散在兩個表時,謂詞下推可按上述結論2和3自由組合,情況如下:
SQL | 過濾時機 |
---|---|
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001' and D.dept_id = 'D001'); |
dept_id在map端過濾,eid在reduce端過濾 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id = 'D001') where E.eid='HZ001'; |
dept_id,eid都在map端過濾 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') where D.dept_id = 'D001'; |
dept_id,eid都在reduce端過濾 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id ) where E.eid='HZ001' and D.dept_id = 'D001'; |
dept_id在reduce端過濾,eid在map端過濾 |
注意:如果在表達式中含有不確定函數,整個表達式的謂詞將不會被pushed,例如
select a.*
from a join b on a.id = b.id
where a.ds = '2019-10-09' and a.create_time = unix_timestamp();
因為unix_timestamp
是不確定函數,在編譯的時候無法得知,所以,整個表達式不會被pushed,即ds='2019-10-09'也不會被提前過濾。類似的不確定函數還有rand()等。
參考文獻:
[1] https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior
引用:https://blog.csdn.net/strongyoung88/article/details/81156271
猜你喜歡
Hive計算最大連續登陸天數
Hadoop 數據遷移用法詳解
Hbase修復工具Hbck
數倉建模分層理論
一文搞懂Hive的數據存儲與壓縮
大數據組件重點學習這幾個