Hadoop: Hadoop Cluster配置文件


Hadoop配置文件

Hadoop的配置文件:

  • 只讀的默認配置文件:core-default.xml, hdfs-default.xml, yarn-default.xml 和 mapred-default.xml
  • 站點特定的配置文件:etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml 和 etc/hadoop/mapred-site.xm
  • Hadoop環境變量配置文件:etc/hadoop/hadoop-env.shetc/hadoop/mapred-env.sh 和 etc/hadoop/yarn-env.sh

管理員用戶可以修改etc/hadoop/hadoop-env.shetc/hadoop/mapred-env.sh 和 etc/hadoop/yarn-env.sh 腳本來自定義站點特定的配置,修改這些腳本就是配置Hadoop后台進程用到的環境變量,比如,配置JAVA_HOME。

通過修改下面配置參數,管理員可以設置單獨的Hadoop后台進程

 Daemon Environment Variable
NameNode HADOOP_NAMENODE_OPTS
DataNode HADOOP_DATANODE_OPTS
Secondary NameNode HADOOP_SECONDARYNAMENODE_OPTS
ResourceManager YARN_RESOURCEMANAGER_OPTS
NodeManager YARN_NODEMANAGER_OPTS
WebAppProxy YARN_PROXYSERVER_OPTS
Map Reduce Job History Server HADOOP_JOB_HISTORYSERVER_OPTS

 

 

 

 

 

 

 

 

其他有用的配置參數:

  • HADOOP_PID_DIR:進程ID文件存放的目錄
  • HADOOP_LOG_DIR:日志文件存放的目錄,默認會自動創建。
  • HADOOP_HEAPSIZE / YARN_HEAPSIZE:能夠使用堆內存的最大值,以MB為單位。默認值是1000,既1000M。使用它可以單獨指定某個節點Hadoop進程能夠使用的內存大小。

大多數情況下,我們需要配置HADOOP_PID_DIRHADOOP_LOG_DIR,因為運行Hadoop進程的用戶需要對這些目錄有寫權限。

三、Hadoop后台進程及配置

  • HDFS后台進程:NameNode、SecondaryNameNode、DataNode
  • YARN后天進程:ResourceManager、WebAppProxy
  • MapReduce后台進程:MapReduce Job History Server

下面介紹各個配置文件中的重要參數

1、etc/hadoop/core-site.xml

  Parameter Value Notes
fs.defaultFS NameNode URI hdfs://host:port/
io.file.buffer.size 131072 讀寫文件的buffer大小,單位byte

 

 

 

 

2、etc/hadoop/hdfs-site.xml

NameNode配置參數

 Parameter Value Notes
dfs.namenode.name.dir NameNo的在本地文件系統中存儲namespace和事務日志的目錄 如果配置多個目錄,用逗號分開,每個目錄都會存放一份副本
dfs.hosts DataDode白名單 不指定,默認所有DataNode都可以使用
dfs.hosts.exclude DataNode黑名單,不允許使用 不指定,默認所有DataNode都可以使用
dfs.blocksize 268435456 HDFS數據塊大小,單位byte,默認64M,對於超大文件可以配置為256M
dfs.namenode.handler.count 100 處理對DataNode的RPC調用的NameNode服務線程數量

 

 

 

 

 

 

 

 

DataNode配置參數

 Parameter Value Notes
dfs.datanode.data.dir DataNode在本地文件系統存儲數據塊的目錄,多個目錄按逗號分割 如果是多個目錄,會在每個目錄存放一個副本

 

 

 

 

3、etc/hadoop/yarn-site.xml

針對ResourceManager 和 NodeManager共同的配置

 Parameter Value Notes
yarn.acl.enable true / false 是否啟用ACL權限控制,默認false
yarn.admin.acl Admin ACL 集群上管理員的ACL權限,具體參考Linux下ACL權限控制的詳細內容。默認是*,表示任何人都可以訪問,什么也不設置(空白)表示禁止任何人訪問
yarn.log-aggregation-enable false 是否啟用日志聚合,默認false

 

ResourceManager配置參數

 Parameter Value Notes
yarn.resourcemanager.address 客戶端訪問並提交作業的地址(host:port 一旦設置了,這個地址會覆蓋參數 yarn.resourcemanager.hostname指定的值
yarn.resourcemanager.scheduler.address ApplicationMasters 連接並調度、獲取資源的地址host:port) 一旦設置了,這個地址會覆蓋參數 yarn.resourcemanager.hostname指定的值
yarn.resourcemanager.resource-tracker.address NodeManagers連接ResourceManager的地址host:port) 一旦設置了,這個地址會覆蓋參數 yarn.resourcemanager.hostname指定的值
yarn.resourcemanager.admin.address 管理相關的commands連接ResourceManager的地址host:port) 一旦設置了,這個地址會覆蓋參數 yarn.resourcemanager.hostname指定的值
yarn.resourcemanager.webapp.address 瀏覽器訪問ResourceManager的地址host:port) 一旦設置了,這個地址會覆蓋參數 yarn.resourcemanager.hostname指定的值
yarn.resourcemanager.hostname ResourceManager主機名稱  
yarn.resourcemanager.scheduler.class

ResourceManager調度程序使用的java class,默認值是

org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

CapacityScheduler (推薦), FairScheduler (推薦), or FifoScheduler
yarn.scheduler.minimum-allocation-mb 為每一個資源請求分配的最小內存 單位MB
yarn.scheduler.maximum-allocation-mb 為每一個資源請求分配的最大內存 單位MB
yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path etc/hadoop/hdfs-site.xml  

 

NodeManager配置參數

 Parameter Value Notes
yarn.nodemanager.resource.memory-mb NodeManager進程可使用的物理內存大小 關系到yarn.scheduler.minimum-allocation-mbyarn.scheduler.maximum-allocation-mb
yarn.nodemanager.vmem-pmem-ratio Maximum ratio by which virtual memory usage of tasks may exceed physical memory The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.
yarn.nodemanager.local-dirs 存放中間數據的本地目錄,多個目錄逗號分隔 多個目錄可以提升磁盤IO速度
yarn.nodemanager.log-dirs 存放日志的本地目錄,多個目錄逗號分隔 多個目錄可以提升磁盤IO速度
yarn.nodemanager.log.retain-seconds 10800 Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
yarn.nodemanager.remote-app-log-dir /logs HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
yarn.nodemanager.remote-app-log-dir-suffix logs Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.
yarn.nodemanager.aux-services mapreduce_shuffle Shuffle service that needs to be set for Map Reduce applications.

 

History Server 參數配置:

 Parameter Value Notes
yarn.log-aggregation.retain-seconds -1 How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node.
yarn.log-aggregation.retain-check-interval-seconds -1 Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.

 

4、etc/hadoop/mapred-site.xml

MapReduce 應用的配置:

Parameter Value Notes
mapreduce.framework.name yarn Execution framework set to Hadoop YARN.
mapreduce.map.memory.mb 1536 Larger resource limit for maps.
mapreduce.map.java.opts -Xmx1024M Larger heap-size for child jvms of maps.
mapreduce.reduce.memory.mb 3072 Larger resource limit for reduces.
mapreduce.reduce.java.opts -Xmx2560M Larger heap-size for child jvms of reduces.
mapreduce.task.io.sort.mb 512 Higher memory-limit while sorting data for efficiency.
mapreduce.task.io.sort.factor 100 More streams merged at once while sorting files.
mapreduce.reduce.shuffle.parallelcopies 50 Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

 

 

 

 

 

 

 

 MapReduce JobHistory Server配置:

Parameter Value Notes
mapreduce.jobhistory.address MapReduce JobHistory Server host:port Default port is 10020.
mapreduce.jobhistory.webapp.address MapReduce JobHistory Server Web UI host:port Default port is 19888.
mapreduce.jobhistory.intermediate-done-dir /mr-history/tmp Directory where history files are written by MapReduce jobs.
mapreduce.jobhistory.done-dir /mr-history/done Directory where history files are managed by the MR JobHistory Server.

 

NodeManagers的健康監控

Hadoop提供了一個監控機制,管理員可以配置NodeManager運行一個腳本,定期檢測某個Node是否健康可用。如果某Node不可用,該節點會在standard output打印出一條ERROR開頭的消息,NodeManager會定期檢查所有Node的output,如果發現有ERROR信息,就會把這個Node標志為unhealthy,然后將其加入黑名單,不會有任務分陪給它了,直到該Node恢復正常,NodeManager檢測到會將其移除黑名單,繼續分配任務給它。

下面是health monitoring script的配置信息,位於etc/hadoop/yarn-site.xml

Parameter Value Notes
yarn.nodemanager.health-checker.script.path Node health script Script to check for node’s health status.
yarn.nodemanager.health-checker.script.opts Node health script options Options for script to check for node’s health status.
yarn.nodemanager.health-checker.script.interval-ms Node health script interval Time interval for running health script.
yarn.nodemanager.health-checker.script.timeout-ms Node health script timeout interval Timeout for health script execution.

 

 

 

 

NodeManager要能夠定期檢查本地磁盤,特別是nodemanager-local-dirs 和 nodemanager-log-dirs配置的目錄,當發現bad directories的數量達到了yarn.nodemanager.disk-health-checker.min-healthy-disks指定的值,這個節點才被標志為unhealthy。

Slaves File

列出所有DataNode節點的 hostnames or IP 地址在etc/hadoop/slaves 文件, 一行一個。 Helper 腳本 (described below) 使用etc/hadoop/slaves 文件在許多客戶端運行命令。 它不需要任何基於Java的hadoop配置,為了使用此功能,Hadoop節點之間應使用ssh建立互信連接。

配置 SSH

ssh免密碼遠程連接,主要用於start-dfs.sh 和 start-yarn.sh 腳本,實現批量啟動HDFS、YARN進程

以下實現hdfs用戶的ssh免密碼,主要用於start-dfs.sh

1、檢查ssh本機是否需要密碼

[hdfs@server1 ~]$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is dd:f4:7e:77:28:03:a1:3d:d2:f1:1d:d0:fe:70:a3:dc.
Are you sure you want to continue connecting (yes/no)?

2、設置針對hdfs用戶的面密碼ssh

server1:192.168.100.51

server2:192.168.100.52

# 實現本機ssh免密碼
[hdfs@server1 ~]$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
[hdfs@server1 .ssh]$ cat id_rsa.pub >> authorized_keys
[hdfs@server1 .ssh]$ chmod 0600 authorized_keys

# 實現192.168.100.52可以免密碼ssh本機
[hdfs@server1 ~]$ scp .ssh/id_rsa.pub hdfs@192.168.100.52:~

# 登錄192.168.100.52設置
[hdfs@server2 ~]$ ls
id_rsa.pub
[hdfs@server2 ~]$ cat id_rsa.pub >> .ssh/authorized_keys

# ssh 192.168.100.51
[hdfs@server2 ~]$ ssh 192.168.100.51
Last login: Tue Sep  6 14:36:15 2016 from localhost
[hdfs@server1 ~]$ logout

3、按照同樣的方式,實現server1對server2的ssh免密碼

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM