解決spark sql關聯(join)查詢使用“or“緩慢的問題

本文轉載自查看原文 2017-03-15 22:06 1416 spark

1.需求描述

將a表的數據與b表的兩個字段進行關聯，輸出結果

a表數據約24億條

b表數據約30萬條

2.優化效果

優化后執行時間從數天減少到數分鍾

3.資源配置

spark 1.4.1

200core,600G RAM

4.代碼簡化版（優化前）

sqlContext.sql("name,ip1,ip2 as ip from table_A where name is not null and ip2 is not null or ip2 is not null) group by name,ip1,ip2").registerTempTable("a") sqlContext.read.parquet("table_B").registerTempTable("b") sqlContext.sql(''' select ip, count(1) as cnt from (select bb.ip as ip, aa.name as name from (select * from b where ip != '')bb left join (select * from a)aa on (bb.ip=aa.ip2 or bb.ip=aa.ip1) group by bb.ip, aa.name) group by ip ''').write.json("result")

5.代碼簡化版（優化后）

后來經過排查發現是使用or語句導致的運行緩慢，於是將兩個條件查詢注冊成兩張表，然后union成一張表，union操作其實只是合並兩個rdd的分區，基本沒有什么開銷。然后在對這張表進行關聯操作

代碼如下:

//查詢出需要的字段並進行緩存，因為下面要查詢2次
sqlContext.sql("CACHE TABLE all AS select name,ip1,ip2 from table_A where name is not null and (ip1 is not null or ip2 is not null) group by name,ip1,ip2") sqlContext.sql("select name,ip1 from all group by name,ip1").registerTempTable("temp1") sqlContext.sql("select name,ip2 from all group by name,ip2").registerTempTable("temp2") sqlContext.sql("select name,ip from (select * from temp1 union all select * from temp2)a group by name,ip").registerTempTable("a") sqlContext.read.parquet("table_B").registerTempTable("b") sqlContext.sql(''' select ip, count(1) as cnt from (select bb.ip as ip, aa.name as name from (select * from b where ip != '')bb left join (select * from a)aa on bb.ip=aa.ip group by bb.ip, aa.name) group by ip ''').write.json("result")

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark SQL中出現 CROSS JOIN 問題解決 SQL Server使用 LEFT JOIN ON LIKE進行數據關聯查詢 mongodb 查詢緩慢問題 Mysql-Join 關聯查詢之索引失效問題使用dataframe解決spark TopN問題：分組、排序、取TopN和join相關問題 Spark SQL 之 Join 實現 MYSQL 多表 LEFT JOIN 關聯查詢，索引失效導致全表掃描問題及解決方法 springboot中使用JOIN實現關聯表查詢 SQL學習（五）多表關聯-join Spark3學習【基於Java】5. Spark-Sql聯表查詢JOIN