Etl之HiveSql調優(設置map reduce 的數量)


前言:

最近發現hivesql的執行速度特別慢,前面我們已經說明了left和union的優化,下面咱們分析一下增加或者減少reduce的數量來提升hsql的速度。

參考:http://www.cnblogs.com/liqiu/p/4873238.html

分析:

select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10'; 

上一篇博文已經說明了,需要8個map,1個reduce,執行的速度:52秒。詳細記錄參考:http://www.cnblogs.com/liqiu/p/4873238.html

增加Reduce的數量:

首先說明一下reduce默認的個數:(每個reduce任務處理的數據量,默認為1000^3=1G,參數是hive.exec.reducers.bytes.per.reducer);(每個任務最大的reduce數,默認為999,參數是hive.exec.reducers.max)

即,如果reduce的輸入(map的輸出)總大小不超過1G,那么只會有一個reduce任務;

如果數據表b2c_money_trace的大小是2.4G,那么reduce的數量是3個,例如:

hive> select count(1) from b2c_money_trace where operate_time = '2015-10-10' group by operate_time;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3623421, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3623421/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3623421
Hadoop job information for Stage-1: number of mappers: 20; number of reducers: 3

那么繼續說最開始的例子,例如:

set mapred.reduce.tasks = 8; 

執行的結果:

hive> set mapred.reduce.tasks = 8;                                                                                                    
hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 8
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Cannot run job locally: Input Size (= 380265495) is larger than hive.exec.mode.local.auto.inputbytes.max (= 50000000)
Starting Job = job_1434099279301_3618454, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3618454/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3618454
Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 8
2015-10-14 15:31:55,570 Stage-1 map = 0%,  reduce = 0%
2015-10-14 15:32:01,734 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 4.63 sec
2015-10-14 15:32:02,760 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 10.93 sec
2015-10-14 15:32:03,786 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 10.93 sec
2015-10-14 15:32:04,812 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:05,837 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:06,892 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:07,947 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:08,983 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:10,039 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:11,088 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:12,114 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 21.94 sec
2015-10-14 15:32:13,143 Stage-1 map = 75%,  reduce = 19%, Cumulative CPU 24.28 sec
2015-10-14 15:32:14,170 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 27.94 sec
2015-10-14 15:32:15,197 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 27.94 sec
2015-10-14 15:32:16,224 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 28.58 sec
2015-10-14 15:32:17,250 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 28.95 sec
2015-10-14 15:32:18,277 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 37.02 sec
2015-10-14 15:32:19,305 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 48.93 sec
2015-10-14 15:32:20,332 Stage-1 map = 75%,  reduce = 25%, Cumulative CPU 49.31 sec
2015-10-14 15:32:21,359 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 57.99 sec
2015-10-14 15:32:22,385 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 61.88 sec
2015-10-14 15:32:23,411 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 71.56 sec
2015-10-14 15:32:24,435 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 71.56 sec
MapReduce Total cumulative CPU time: 1 minutes 11 seconds 560 msec
Ended Job = job_1434099279301_3618454
MapReduce Jobs Launched: 
Job 0: Map: 8  Reduce: 8   Cumulative CPU: 71.56 sec   HDFS Read: 380267639 HDFS Write: 330 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 11 seconds 560 msec
OK
9718    210296076
9718    210299105
9718    210295344
9718    210295277
9718    210295586
9718    210295050
9718    210301363
9718    210297733
9718    210298066
9718    210295566
9718    210298219
9718    210296438
9718    210298328
9718    210298008
9718    210299712
9718    210295239
9718    210297567
9718    210295525
9718    210294949
9718    210296318
9718    210294421
9718    210295840
Time taken: 36.978 seconds, Fetched: 22 row(s)

可見8個reduce使得reduce的時間明顯提升了。

增加Map的數量:

數據表大小:

map的數量就不能用上面的事例,那么看這個數據表:

hive> dfs -ls -h /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace;
Found 4 items
-rw-r--r--   3 ticketdev ticketdev    600.0 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/24f19a74-ca91-4fb2-9b79-1b1235f1c6f8
-rw-r--r--   3 ticketdev ticketdev    597.2 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/34ca13a3-de44-402e-9548-e6b9f92fde67
-rw-r--r--   3 ticketdev ticketdev    590.6 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/ac249f44-60eb-4bf7-9c1a-6f643873b823
-rw-r--r--   3 ticketdev ticketdev    606.5 M 2015-10-14 02:13 /user/ticketdev/hive/warehouse/business_mirror.db/b2c_money_trace/f587fec9-60da-4f18-8b47-406999d95fd1

共2.4G

數據塊大小:

hive> set dfs.block.size;
dfs.block.size=134217728

注意:134217728L是128M的意思!

map數量

文件大小是600M*4個,每個數據塊是128M,即:取整(600/128)*4=20個Mapper

hive> select count(1) from b2c_money_trace;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3620170, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3620170/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3620170
Hadoop job information for Stage-1: number of mappers: 20; number of reducers: 1

注意上面的紅色部分,說明mappers的數量是20。

那么設置划分map的文件大小

set mapred.max.split.size=50000000;
set mapred.min.split.size.per.node=50000000;
set mapred.min.split.size.per.rack=50000000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

大概解釋一下:

50000000表示50M;

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;這個參數表示執行前進行小文件合並,當然這里沒有使用到。

其他三個參數說明按照50M來划分數據塊。

執行結果:

hive> select count(1) from b2c_money_trace;       
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3620223, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3620223/
Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1434099279301_3620223
Hadoop job information for Stage-1: number of mappers: 36; number of reducers: 1

每個文件600M,正好12個Mapper,所以36個Mappers,注意上面的紅色部分。

結論:

並非map和reduce數量越多越好,因為越多占用的資源越多,同時處理的時間未必一定增加,最好根據實際情況調整到一個合理的數量。

參考文章

http://lxw1234.com/archives/2015/04/15.htm

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM