Etl之HiveSql調優(left join where的位置)


一、前言

公司實用Hadoop構建數據倉庫,期間不可避免的實用HiveSql,在Etl過程中,速度成了避無可避的問題。本人有過幾個數據表關聯跑1個小時的經歷,你可能覺得無所謂,可是多次Etl就要多個小時,非常浪費時間,所以HiveSql優化不可避免。

注:本文只是從sql層面介紹一下日常需要注意的點,不涉及Hadoop、MapReduce等層面,關於Hive的編譯過程,請參考文章:http://tech.meituan.com/hive-sql-to-mapreduce.html

二、准備數據

假設咱們有兩張數據表。

景區表:sight,12W條記錄,數據表結構:

hive> desc sight;
OK
area                    string                  None                
city                    string                  None                
country                 string                  None                
county                  string                  None                
id                      string                  None                
name                    string                  None                
region                  string                  None     

景區訂單明細表:order_sight,1040W條記錄,數據表結構:

hive> desc order_sight;    
OK
create_time             string                  None                
id                      string                  None                
order_id                string                  None                
sight_id                bigint                  None  

 三、分析

3.1 where條件

那么咱們希望看見景區id是9718,日期是2015-10-10的所有訂單id,那么sql需要如下書寫:

hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10';         
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3562174, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562174/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3562174
Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1
2015-10-12 22:58:22,706 Stage-1 map = 0%,  reduce = 0%
2015-10-12 22:58:29,882 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 4.73 sec
2015-10-12 22:58:30,907 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 4.73 sec
2015-10-12 22:58:31,933 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:32,968 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:33,995 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:35,020 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:36,046 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:37,070 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:38,096 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:39,121 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:40,153 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.87 sec
2015-10-12 22:58:41,182 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.22 sec
2015-10-12 22:58:42,209 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.22 sec
2015-10-12 22:58:43,236 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.22 sec
2015-10-12 22:58:44,263 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.3 sec
2015-10-12 22:58:45,289 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.3 sec
2015-10-12 22:58:46,316 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 15.3 sec
2015-10-12 22:58:47,344 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
2015-10-12 22:58:48,370 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
2015-10-12 22:58:49,397 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
2015-10-12 22:58:50,424 Stage-1 map = 50%,  reduce = 17%, Cumulative CPU 21.85 sec
2015-10-12 22:58:51,452 Stage-1 map = 83%,  reduce = 17%, Cumulative CPU 37.62 sec
2015-10-12 22:58:52,478 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 38.06 sec
2015-10-12 22:58:53,506 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 38.06 sec
2015-10-12 22:58:54,534 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.17 sec
2015-10-12 22:58:55,560 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.17 sec
2015-10-12 22:58:56,587 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.17 sec
2015-10-12 22:58:57,615 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.25 sec
2015-10-12 22:58:58,642 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.25 sec
2015-10-12 22:58:59,674 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.25 sec
2015-10-12 22:59:00,708 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.32 sec
2015-10-12 22:59:01,736 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.32 sec
2015-10-12 22:59:02,763 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.32 sec
2015-10-12 22:59:03,791 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 38.41 sec
2015-10-12 22:59:04,817 Stage-1 map = 96%,  reduce = 29%, Cumulative CPU 49.13 sec
2015-10-12 22:59:05,843 Stage-1 map = 100%,  reduce = 29%, Cumulative CPU 49.59 sec
2015-10-12 22:59:06,870 Stage-1 map = 100%,  reduce = 41%, Cumulative CPU 49.76 sec
2015-10-12 22:59:07,897 Stage-1 map = 100%,  reduce = 41%, Cumulative CPU 49.76 sec
2015-10-12 22:59:08,922 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 52.79 sec
2015-10-12 22:59:09,947 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 52.79 sec
MapReduce Total cumulative CPU time: 52 seconds 790 msec
Ended Job = job_1434099279301_3562174
MapReduce Jobs Launched: 
Job 0: Map: 8  Reduce: 1   Cumulative CPU: 52.79 sec   HDFS Read: 371210469 HDFS Write: 330 SUCCESS
Total MapReduce CPU Time Spent: 52 seconds 790 msec
OK
9718    210294949
9718    210294421
9718    210296438
9718    210295344
9718    210297567
9718    210296076
9718    210295525
9718    210298219
9718    210295840
9718    210301363
9718    210297733
9718    210298066
9718    210295239
9718    210298328
9718    210298008
9718    210299712
9718    210295586
9718    210295050
9718    210295566
9718    210299105
9718    210296318
9718    210295277
Time taken: 52.068 seconds, Fetched: 22 row(s)

 可見需要的時間是52秒,如果咱們換一個sql的書寫方式:

hive> select s.id,o.order_id from sight s left join (select order_id,sight_id from order_sight where create_time = '2015-10-10') o on o.sight_id=s.id where s.id=9718;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3562218, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562218/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3562218
Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1
2015-10-12 23:03:54,926 Stage-1 map = 0%,  reduce = 0%
2015-10-12 23:04:01,075 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 2.24 sec
2015-10-12 23:04:02,101 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 2.24 sec
2015-10-12 23:04:03,126 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 2.24 sec
2015-10-12 23:04:04,151 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 5.53 sec
2015-10-12 23:04:05,176 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 5.53 sec
2015-10-12 23:04:06,201 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 14.62 sec
2015-10-12 23:04:07,226 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
2015-10-12 23:04:08,250 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
2015-10-12 23:04:09,275 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
2015-10-12 23:04:10,300 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
2015-10-12 23:04:11,324 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 18.66 sec
2015-10-12 23:04:12,356 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.09 sec
2015-10-12 23:04:13,384 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.09 sec
2015-10-12 23:04:14,410 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.09 sec
2015-10-12 23:04:15,437 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.22 sec
2015-10-12 23:04:16,463 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.22 sec
2015-10-12 23:04:17,487 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.22 sec
2015-10-12 23:04:18,514 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.35 sec
2015-10-12 23:04:19,538 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.35 sec
2015-10-12 23:04:20,569 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.35 sec
2015-10-12 23:04:21,595 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.54 sec
2015-10-12 23:04:22,620 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.54 sec
2015-10-12 23:04:23,646 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.54 sec
2015-10-12 23:04:24,673 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.64 sec
2015-10-12 23:04:25,698 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.64 sec
2015-10-12 23:04:26,723 Stage-1 map = 63%,  reduce = 21%, Cumulative CPU 19.64 sec
2015-10-12 23:04:27,748 Stage-1 map = 75%,  reduce = 21%, Cumulative CPU 23.32 sec
2015-10-12 23:04:28,774 Stage-1 map = 88%,  reduce = 21%, Cumulative CPU 27.27 sec
2015-10-12 23:04:29,799 Stage-1 map = 100%,  reduce = 21%, Cumulative CPU 32.82 sec
2015-10-12 23:04:30,823 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 34.35 sec
2015-10-12 23:04:31,846 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 34.35 sec
MapReduce Total cumulative CPU time: 34 seconds 350 msec
Ended Job = job_1434099279301_3562218
MapReduce Jobs Launched: 
Job 0: Map: 8  Reduce: 1   Cumulative CPU: 34.35 sec   HDFS Read: 371210469 HDFS Write: 330 SUCCESS
Total MapReduce CPU Time Spent: 34 seconds 350 msec
OK
9718    210297733
9718    210298066
9718    210295239
9718    210298328
9718    210298008
9718    210299712
9718    210297567
9718    210296076
9718    210295525
9718    210298219
9718    210295840
9718    210301363
9718    210295586
9718    210295050
9718    210295566
9718    210299105
9718    210296318
9718    210295277
9718    210294949
9718    210294421
9718    210296438
9718    210295344
Time taken: 43.709 seconds, Fetched: 22 row(s)

實用43秒,快了一些。當然咱們並不是僅僅分析說快了20%(我還多次測試,這次的差距最小),而是分析原因!

單從兩個sql的寫法上看的出來,特別是第二條的紅色部分,我將left的條件寫到里面了。那么執行的結果隨之不一樣,第二條的Reduce時間明顯小於第一條的Reduce時間。

原因是這兩個sql都分解成8個Map任務和1個Reduce任務,如果left的條件寫在后面,那么這些關聯操作會放在Reduce階段,1個Reduce操作的時間必然大於8個Map的執行時間,造成執行時間超長。

結論:當使用外關聯時,如果將副表的過濾條件寫在Where后面,那么就會先全表關聯,之后再過濾


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM