Hadoop企業開發場景案例
1 案例需求
(1)需求:從1G數據中,統計每個單詞出現次數。服務器3台,每台配置4G內存,4核CPU,4線程。
(2)需求分析:
1G/128m = 8個MapTask;1個ReduceTask:1個mrAppMaster
平均每個節點運行10個/3台 ≈ 3個任務(4 3 3)
2 HDFS參數調優
(1)修改:hadoop-env.sh
export HDFS_NAMENODE_OPTS = "-Dhadoop.security.logger=INFO,RFAS -Xmx1024m"
export HDFS_DATANODE_OPTS = "-Dhadoop.security.logger=ERROR,RFAS -Xmx1024m"
(2)修改:hdfs-site.xml
<!--NameNode有一個工作線程池,默認值是10-->
<property>
<name>dfs.namenode.handler.count</name>
<value>21</value>
</property>
(3)修改core-site.xml
<!-- 配置垃圾回收時間為 60 分鍾 -->
<property>
<name>fs.trash.interval</name>
<value>60</value>
</property>
(4)將配置分發到三台服務器上
rsync -av 分發的文件名稱 用戶名@主機名稱:儲存配置文件地址
3 MapReduce 參數調優
(1)修改mapred-site.xml
<!-- 環形緩沖區大小,默認 100m -->
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>100</value>
</property>
<!-- 環形緩沖區溢寫閾值,默認 0.8 -->
<property>
<name>mapreduce.map.sort.spill.percent</name>
<value>0.80</value>
</property>
<!-- merge 合並次數,默認 10 個 -->
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>10</value>
</property>
<!-- maptask 內存,默認 1g; maptask 堆內存大小默認和該值大小一致 mapreduce.map.java.opts -->
<property>
<name>mapreduce.map.memory.mb</name>
<value>-1</value>
<description>
The amount of memory to request from the scheduler for each map task. If this is not specified or is non-positive, it is inferred from mapreduce.map.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024.
</description>
</property>
<!-- matask 的 CPU 核數,默認 1 個 -->
<property>
<name>mapreduce.map.cpu.vcores</name>
<value>1</value>
</property>
<!-- matask 異常重試次數,默認 4 次 -->
<property>
<name>mapreduce.map.maxattempts</name>
<value>4</value>
</property>
<!-- 每個 Reduce 去 Map 中拉取數據的並行數。默認值是 5 -->
<property>
<name>mapreduce.reduce.shuffle.parallelcopies</name>
<value>5</value>
</property>
<!-- Buffer 大小占 Reduce 可用內存的比例,默認值 0.7 -->
<property>
<name>mapreduce.reduce.shuffle.input.buffer.percent</name>
<value>0.70</value>
</property>
<!-- Buffer 中的數據達到多少比例開始寫入磁盤,默認值 0.66。 -->
<property>
<name>mapreduce.reduce.shuffle.merge.percent</name>
<value>0.66</value>
</property>
<!-- reducetask 內存,默認 1g;reducetask 堆內存大小默認和該值大小一致 mapreduce.reduce.java.opts -->
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>-1</value>
<description>The amount of memory to request from the scheduler for each reduce task. If this is not specified or is non-positive, it is inferred from mapreduce.reduce.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024.
</description>
</property>
<!-- reducetask 的 CPU 核數,默認 1 個 -->
<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>2</value>
</property>
<!-- reducetask 失敗重試次數,默認 4 次 -->
<property>
<name>mapreduce.reduce.maxattempts</name>
<value>4</value>
</property>
<!-- 當MapTask完成的比例達到該值后才會為ReduceTask申請資源。默認是0.05-->
<property>
<name>mapreduce.job.reduce.slowstart.completedmaps</name>
<value>0.05</value>
</property>
<!-- 如果程序在規定的默認 10 分鍾內沒有讀到數據,將強制超時退出 -->
<property>
<name>mapreduce.task.timeout</name>
<value>600000</value>
</property>
(2)服務器分發配置文件
rsync -av 分發的文件名稱 用戶名@主機名稱:儲存配置文件地址
4 Yarn參數調優
(1)修改Yarn-site.xml
<!-- 選擇調度器,默認容量 -->
<property>
<description>The class to use as the resource scheduler.</description>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
<!-- ResourceManager 處理調度器請求的線程數量,默認 50;如果提交的任務數大於 50,可以增加該值,但是不能超過 3 台 * 4 線程 = 12 線程(去除其他應用程序實際不能超過 8) -->
<property>
<description>Number of threads to handle schedulerinterface.</description>
<name>yarn.resourcemanager.scheduler.client.thread-count</name>
<value>8</value>
</property>
<!-- 是否讓 yarn 自動檢測硬件進行配置,默認是 false,如果該節點有很多其他應用程序,建議
手動配置。如果該節點沒有其他應用程序,可以采用自動 -->
<property>
<description>Enable auto-detection of node capabilities such as memory and CPU.</description>
<name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
<value>false</value>
</property>
<!-- 是否將虛擬核數當作 CPU 核數,默認是 false,采用物理 CPU 核數 -->
<property>
<description>Flag to determine if logical processors(such as hyperthreads) should be counted as cores. Only applicable on Linux when yarn.nodemanager.resource.cpu-vcores is set to -1 and yarn.nodemanager.resource.detect-hardware-capabilities is true.
</description>
<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
<value>false</value>
</property>
<!-- 虛擬核數和物理核數乘數,默認是 1.0 -->
<property>
<description>Multiplier to determine how to convert phyiscal cores to vcores. This value is used if yarn.nodemanager.resource.cpu-vcores is set to -1(which implies auto-calculate vcores) and yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The number of vcores will be calculated as number of CPUs * multiplier.
</description>
<name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>
<value>1.0</value>
</property>
<!-- NodeManager 使用內存數,默認 8G,修改為 4G 內存 -->
<property>
<description>Amount of physical memory, in MB, that can be allocated for containers. If set to -1 and
yarn.nodemanager.resource.detect-hardware-capabilities is true, it is automatically calculated(in case of Windows and Linux). In other cases, the default is 8192MB.
</description>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<!-- nodemanager 的 CPU 核數,不按照硬件環境自動設定時默認是 8 個,修改為 4 個 -->
<property>
<description>Number of vcores that can be allocated for containers. This is used by the RM scheduler when allocating resources for containers. This is not used to limit the number of CPUs used by YARN containers. If it is set to -1 and yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
automatically determined from the hardware in case of Windows and Linux. In other cases, number of vcores is 8 by default.
</description>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
<!-- 容器最小內存,默認 1G -->
<property>
<description>The minimum allocation for every container request at the RM in MBs. Memory requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have
less memory than this value will be shut down by the resource manager.
</description>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<!-- 容器最大內存,默認 8G,修改為 2G -->
<property>
<description>The maximum allocation for every container request at the RM in MBs. Memory requests higher than this will throw an InvalidResourceRequestException.
</description>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<!-- 容器最小 CPU 核數,默認 1 個 -->
<property>
<description>The minimum allocation for every container request at the RM in terms of virtual CPU cores. Requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have fewer virtual cores than this value will be shut down by the
resource manager.
</description>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<!-- 容器最大 CPU 核數,默認 4 個,修改為 2 個 -->
<property>
<description>The maximum allocation for every container request at the RM in terms of virtual CPU cores. Requests higher than this will throw an InvalidResourceRequestException.
</description>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>
<!-- 虛擬內存檢查,默認打開,修改為關閉 -->
<property>
<description>Whether virtual memory limits will be enforced for containers.</description>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 虛擬內存和物理內存設置比例,默認 2.1 -->
<property>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.
</description>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
(2)服務器分發配置文件
rsync -av 分發的文件名稱 用戶名@主機名稱:儲存配置文件地址
10.3.5 執行程序
(1)重啟集群
sbin/stop-yarn.sh
sbin/start-yarn.sh
(2)執行 WordCount 程序
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /wcinput /wcoutput
說明:在hadoop文件夾下運行命令,/input 為要統計的 1G 數據所在的文件夾目錄,/output 為要輸出統計結果的文件夾目錄。
(3)觀察 Yarn 任務執行頁面
網址:hadoop103:8088
(4)運行結果
/wcinput/work.txt原內容:
運行結果:生成文件夾/wcoutput