1. 調整reduce個數(方式1)
-- 每個reduce處理的數據量(默認為256M)
set hive.exec.reducers.bytes.per.reducer=256000000; -- 每個job允許最大的reduce個數
set hive.exec.reducers.max=1009;
-- 計算reduce個數公式
reduce個數=min(參數2,總輸入數量/參數1)
注意 : mapreduce.job.reduces=-1 時生效
測試1 : 文件個數1、文件大小34.8M、每個reduce處理數據量20M
-- 測試1 : 文件個數1、文件大小34.8M、每個reduce處理數據量20M
set mapreduce.job.reduces=-1; set hive.exec.reducers.bytes.per.reducer=20971520; SET hive.merge.mapfiles = true; SET hive.merge.mapredfiles=true; -- 任務結束時,對小文件進行合並
set hive.merge.size.per.task=100; set yarn.scheduler.maximum-allocation-mb=118784; set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=4096; set yarn.nodemanager.vmem-pmem-ratio=4.2; create table mergeTab4 as
select substr(uploader,0,1) ,count(1) from gulivideo_user_ori group by substr(uploader,0,1); Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2. 調整reduce個數(方式2)
-- 設置每個job中reduce Task個數 -- 可以在 mapred-default.xml 設置
set mapreduce.job.reduces=3;
測試2 :
set mapreduce.job.reduces=3; set yarn.scheduler.maximum-allocation-mb=118784; set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=4096; set yarn.nodemanager.vmem-pmem-ratio=4.2; create table mergeTab7 as
select substr(uploader,0,1) ,count(1) from gulivideo_user_ori group by substr(uploader,0,1); Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
162 486 /user/hive/warehouse/home.db/mergetab7/000000_0
157 471 /user/hive/warehouse/home.db/mergetab7/000001_0
163 489 /user/hive/warehouse/home.db/mergetab7/000002_0
3. 思考 : reduce 個數是不是越多越好?
不是
1. 過多設置reduceTask數, 啟動和初始化時間遠大於每個任務的處理時間,會浪費資源和時間
2. 過多的reduceTask,會導致小文件過多
4. reduceTask 個數設置多少合適 ?
使單個reduceTask 處理合適的數據量