1.需求描述
將a表的數據與b表的兩個字段進行關聯,輸出結果
a表數據約24億條
b表數據約30萬條
2.優化效果
優化后執行時間從數天減少到數分鍾
3.資源配置
spark 1.4.1
200core,600G RAM
4.代碼簡化版(優化前)
sqlContext.sql("name,ip1,ip2 as ip from table_A where name is not null and ip2 is not null or ip2 is not null) group by name,ip1,ip2").registerTempTable("a") sqlContext.read.parquet("table_B").registerTempTable("b") sqlContext.sql(''' select ip, count(1) as cnt from (select bb.ip as ip, aa.name as name from (select * from b where ip != '')bb left join (select * from a)aa on (bb.ip=aa.ip2 or bb.ip=aa.ip1) group by bb.ip, aa.name) group by ip ''').write.json("result")
5.代碼簡化版(優化后)
后來經過排查發現是使用or語句導致的運行緩慢,於是將兩個條件查詢注冊成兩張表,然后union成一張表,union操作其實只是合並兩個rdd的分區,基本沒有什么開銷。然后在對這張表進行關聯操作
代碼如下:
//查詢出需要的字段並進行緩存,因為下面要查詢2次
sqlContext.sql("CACHE TABLE all AS select name,ip1,ip2 from table_A where name is not null and (ip1 is not null or ip2 is not null) group by name,ip1,ip2") sqlContext.sql("select name,ip1 from all group by name,ip1").registerTempTable("temp1") sqlContext.sql("select name,ip2 from all group by name,ip2").registerTempTable("temp2") sqlContext.sql("select name,ip from (select * from temp1 union all select * from temp2)a group by name,ip").registerTempTable("a") sqlContext.read.parquet("table_B").registerTempTable("b") sqlContext.sql(''' select ip, count(1) as cnt from (select bb.ip as ip, aa.name as name from (select * from b where ip != '')bb left join (select * from a)aa on bb.ip=aa.ip group by bb.ip, aa.name) group by ip ''').write.json("result")