引用:https://blog.csdn.net/strongyoung88/article/details/81156271
謂詞下推概念
謂詞下推 Predicate Pushdown(PPD)
:簡而言之,就是在不影響結果的情況下,盡量將過濾條件提前執行。謂詞下推后,過濾條件在map端執行,減少了map端的輸出,降低了數據在集群上傳輸的量,節約了集群的資源,也提升了任務的性能。
PPD 配置
PPD
控制參數:hive.optimize.ppd
- Default Value: true
- Added In: Hive 0.4.0
Push
:謂詞下推,可以理解為被優化Not Push
:謂詞沒有下推,可以理解為沒有被優化
實驗
實驗結果列表形式:
Pushed or Not | SQL |
---|---|
Pushed | select ename,dept_name from E join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Pushed | select ename,dept_name from E join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Pushed | select ename,dept_name from E join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Pushed | select ename,dept_name from E join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
Not Pushed | select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Pushed | select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Pushed | select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Not Pushed | select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
Pushed | select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Not Pushed | select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Not Pushed | select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Pushed | select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
Not Pushed | select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Not Pushed | select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Not Pushed | select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Not Pushed | select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
實驗結果表格形式:
Join(inner join) | Left Outer Join | Right Outer Join | Full Outer Join | |||||
---|---|---|---|---|---|---|---|---|
Left Table | Right Table | Left Table | Right Table | Left Table | Right Table | Left Table | Right Table | |
Join Predicate | Pushed | Pushed | Not Pushed | Pushed | Pushed | Not Pushed | Not Pushed | Not Pushed |
Where Predicate | Pushed | Pushed | Pushed | Not Pushed | Not Pushed | Pushed | Not Pushed | Not Pushed |
此表實際上就是上述PPD規則表
結論
1、對於Join(Inner Join)、Full outer Join,條件寫在on后面,還是where后面,性能上面沒有區別;
2、對於Left outer Join ,右側的表寫在on后面、左側的表寫在where后面,性能上有提高;
3、對於Right outer Join,左側的表寫在on后面、右側的表寫在where后面,性能上有提高;
4、當條件分散在兩個表時,謂詞下推可按上述結論2和3自由組合,情況如下:
SQL | 過濾時機 |
---|---|
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001' and D.dept_id = 'D001'); |
dept_id在map端過濾,eid在reduce端過濾 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id = 'D001') where E.eid='HZ001'; |
dept_id,eid都在map端過濾 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') where D.dept_id = 'D001'; |
dept_id,eid都在reduce端過濾 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id ) where E.eid='HZ001' and D.dept_id = 'D001'; |
dept_id在reduce端過濾,eid在map端過濾 |
注意:如果在表達式中含有不確定函數,整個表達式的謂詞將不會被pushed,例如
select a.* from a join b on a.id = b.id where a.ds = '2019-10-09' and a.create_time = unix_timestamp();
因為unix_timestamp
是不確定函數,在編譯的時候無法得知,所以,整個表達式不會被pushed,即ds='2019-10-09'
也不會被提前過濾。類似的不確定函數還有rand()
等。