一、前言
公司實用Hadoop構建數據倉庫,期間不可避免的實用HiveSql,在Etl過程中,速度成了避無可避的問題。本人有過幾個數據表關聯跑1個小時的經歷,你可能覺得無所謂,可是多次Etl就要多個小時,非常浪費時間,所以HiveSql優化不可避免。
注:本文只是從sql層面介紹一下日常需要注意的點,不涉及Hadoop、MapReduce等層面,關於Hive的編譯過程,請參考文章:http://tech.meituan.com/hive-sql-to-mapreduce.html
二、准備數據
假設咱們有兩張數據表。
景區表:sight,12W條記錄,數據表結構:
hive> desc sight; OK area string None city string None country string None county string None id string None name string None region string None
景區訂單明細表:order_sight,1040W條記錄,數據表結構:
hive> desc order_sight; OK create_time string None id string None order_id string None sight_id bigint None
三、分析
3.1 where條件
那么咱們希望看見景區id是9718,日期是2015-10-10的所有訂單id,那么sql需要如下書寫:
hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1434099279301_3562174, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562174/ Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1434099279301_3562174 Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1 2015-10-12 22:58:22,706 Stage-1 map = 0%, reduce = 0% 2015-10-12 22:58:29,882 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 4.73 sec 2015-10-12 22:58:30,907 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 4.73 sec 2015-10-12 22:58:31,933 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:32,968 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:33,995 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:35,020 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:36,046 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:37,070 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:38,096 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:39,121 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:40,153 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.87 sec 2015-10-12 22:58:41,182 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.22 sec 2015-10-12 22:58:42,209 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.22 sec 2015-10-12 22:58:43,236 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.22 sec 2015-10-12 22:58:44,263 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.3 sec 2015-10-12 22:58:45,289 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.3 sec 2015-10-12 22:58:46,316 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 15.3 sec 2015-10-12 22:58:47,344 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 21.85 sec 2015-10-12 22:58:48,370 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 21.85 sec 2015-10-12 22:58:49,397 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 21.85 sec 2015-10-12 22:58:50,424 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 21.85 sec 2015-10-12 22:58:51,452 Stage-1 map = 83%, reduce = 17%, Cumulative CPU 37.62 sec 2015-10-12 22:58:52,478 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 38.06 sec 2015-10-12 22:58:53,506 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 38.06 sec 2015-10-12 22:58:54,534 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.17 sec 2015-10-12 22:58:55,560 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.17 sec 2015-10-12 22:58:56,587 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.17 sec 2015-10-12 22:58:57,615 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.25 sec 2015-10-12 22:58:58,642 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.25 sec 2015-10-12 22:58:59,674 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.25 sec 2015-10-12 22:59:00,708 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.32 sec 2015-10-12 22:59:01,736 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.32 sec 2015-10-12 22:59:02,763 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.32 sec 2015-10-12 22:59:03,791 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 38.41 sec 2015-10-12 22:59:04,817 Stage-1 map = 96%, reduce = 29%, Cumulative CPU 49.13 sec 2015-10-12 22:59:05,843 Stage-1 map = 100%, reduce = 29%, Cumulative CPU 49.59 sec 2015-10-12 22:59:06,870 Stage-1 map = 100%, reduce = 41%, Cumulative CPU 49.76 sec 2015-10-12 22:59:07,897 Stage-1 map = 100%, reduce = 41%, Cumulative CPU 49.76 sec 2015-10-12 22:59:08,922 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 52.79 sec 2015-10-12 22:59:09,947 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 52.79 sec MapReduce Total cumulative CPU time: 52 seconds 790 msec Ended Job = job_1434099279301_3562174 MapReduce Jobs Launched: Job 0: Map: 8 Reduce: 1 Cumulative CPU: 52.79 sec HDFS Read: 371210469 HDFS Write: 330 SUCCESS Total MapReduce CPU Time Spent: 52 seconds 790 msec OK 9718 210294949 9718 210294421 9718 210296438 9718 210295344 9718 210297567 9718 210296076 9718 210295525 9718 210298219 9718 210295840 9718 210301363 9718 210297733 9718 210298066 9718 210295239 9718 210298328 9718 210298008 9718 210299712 9718 210295586 9718 210295050 9718 210295566 9718 210299105 9718 210296318 9718 210295277 Time taken: 52.068 seconds, Fetched: 22 row(s)
可見需要的時間是52秒,如果咱們換一個sql的書寫方式:
hive> select s.id,o.order_id from sight s left join (select order_id,sight_id from order_sight where create_time = '2015-10-10') o on o.sight_id=s.id where s.id=9718; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1434099279301_3562218, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562218/ Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1434099279301_3562218 Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1 2015-10-12 23:03:54,926 Stage-1 map = 0%, reduce = 0% 2015-10-12 23:04:01,075 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 2.24 sec 2015-10-12 23:04:02,101 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 2.24 sec 2015-10-12 23:04:03,126 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 2.24 sec 2015-10-12 23:04:04,151 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 5.53 sec 2015-10-12 23:04:05,176 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 5.53 sec 2015-10-12 23:04:06,201 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 14.62 sec 2015-10-12 23:04:07,226 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 18.66 sec 2015-10-12 23:04:08,250 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 18.66 sec 2015-10-12 23:04:09,275 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 18.66 sec 2015-10-12 23:04:10,300 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 18.66 sec 2015-10-12 23:04:11,324 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 18.66 sec 2015-10-12 23:04:12,356 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.09 sec 2015-10-12 23:04:13,384 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.09 sec 2015-10-12 23:04:14,410 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.09 sec 2015-10-12 23:04:15,437 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.22 sec 2015-10-12 23:04:16,463 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.22 sec 2015-10-12 23:04:17,487 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.22 sec 2015-10-12 23:04:18,514 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.35 sec 2015-10-12 23:04:19,538 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.35 sec 2015-10-12 23:04:20,569 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.35 sec 2015-10-12 23:04:21,595 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.54 sec 2015-10-12 23:04:22,620 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.54 sec 2015-10-12 23:04:23,646 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.54 sec 2015-10-12 23:04:24,673 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.64 sec 2015-10-12 23:04:25,698 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.64 sec 2015-10-12 23:04:26,723 Stage-1 map = 63%, reduce = 21%, Cumulative CPU 19.64 sec 2015-10-12 23:04:27,748 Stage-1 map = 75%, reduce = 21%, Cumulative CPU 23.32 sec 2015-10-12 23:04:28,774 Stage-1 map = 88%, reduce = 21%, Cumulative CPU 27.27 sec 2015-10-12 23:04:29,799 Stage-1 map = 100%, reduce = 21%, Cumulative CPU 32.82 sec 2015-10-12 23:04:30,823 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 34.35 sec 2015-10-12 23:04:31,846 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 34.35 sec MapReduce Total cumulative CPU time: 34 seconds 350 msec Ended Job = job_1434099279301_3562218 MapReduce Jobs Launched: Job 0: Map: 8 Reduce: 1 Cumulative CPU: 34.35 sec HDFS Read: 371210469 HDFS Write: 330 SUCCESS Total MapReduce CPU Time Spent: 34 seconds 350 msec OK 9718 210297733 9718 210298066 9718 210295239 9718 210298328 9718 210298008 9718 210299712 9718 210297567 9718 210296076 9718 210295525 9718 210298219 9718 210295840 9718 210301363 9718 210295586 9718 210295050 9718 210295566 9718 210299105 9718 210296318 9718 210295277 9718 210294949 9718 210294421 9718 210296438 9718 210295344 Time taken: 43.709 seconds, Fetched: 22 row(s)
實用43秒,快了一些。當然咱們並不是僅僅分析說快了20%(我還多次測試,這次的差距最小),而是分析原因!
單從兩個sql的寫法上看的出來,特別是第二條的紅色部分,我將left的條件寫到里面了。那么執行的結果隨之不一樣,第二條的Reduce時間明顯小於第一條的Reduce時間。
原因是這兩個sql都分解成8個Map任務和1個Reduce任務,如果left的條件寫在后面,那么這些關聯操作會放在Reduce階段,1個Reduce操作的時間必然大於8個Map的執行時間,造成執行時間超長。
結論:當使用外關聯時,如果將副表的過濾條件寫在Where后面,那么就會先全表關聯,之后再過濾