解決spark sql關聯(join)查詢使用“or“緩慢的問題


1.需求描述

將a表的數據與b表的兩個字段進行關聯,輸出結果

a表數據約24億條

b表數據約30萬條

2.優化效果

優化后執行時間從數天減少到數分鍾

3.資源配置

spark 1.4.1

200core,600G RAM

4.代碼簡化版(優化前)

sqlContext.sql("name,ip1,ip2 as ip from table_A where name is not null and ip2 is not null or ip2 is not null) group by name,ip1,ip2").registerTempTable("a") sqlContext.read.parquet("table_B").registerTempTable("b") sqlContext.sql(''' select ip, count(1) as cnt from (select bb.ip as ip, aa.name as name from (select * from b where ip != '')bb left join (select * from a)aa on (bb.ip=aa.ip2 or bb.ip=aa.ip1) group by bb.ip, aa.name) group by ip ''').write.json("result")

 

 

5.代碼簡化版(優化后)

后來經過排查發現是使用or語句導致的運行緩慢,於是將兩個條件查詢注冊成兩張表,然后union成一張表,union操作其實只是合並兩個rdd的分區,基本沒有什么開銷。然后在對這張表進行關聯操作

代碼如下:

//查詢出需要的字段並進行緩存,因為下面要查詢2次
sqlContext.sql("CACHE TABLE all AS select name,ip1,ip2 from table_A where name is not null and (ip1 is not null or ip2 is not null) group by name,ip1,ip2") sqlContext.sql("select name,ip1 from all group by name,ip1").registerTempTable("temp1") sqlContext.sql("select name,ip2 from all group by name,ip2").registerTempTable("temp2") sqlContext.sql("select name,ip from (select * from temp1 union all select * from temp2)a group by name,ip").registerTempTable("a") sqlContext.read.parquet("table_B").registerTempTable("b") sqlContext.sql(''' select ip, count(1) as cnt from (select bb.ip as ip, aa.name as name from (select * from b where ip != '')bb left join (select * from a)aa on bb.ip=aa.ip group by bb.ip, aa.name) group by ip ''').write.json("result")


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM