Hadoop配置文件
Hadoop的配置文件:
- 只讀的默認配置文件:core-default.xml, hdfs-default.xml, yarn-default.xml 和 mapred-default.xml
- 站點特定的配置文件:etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml 和 etc/hadoop/mapred-site.xm
- Hadoop環境變量配置文件:etc/hadoop/hadoop-env.sh、etc/hadoop/mapred-env.sh 和 etc/hadoop/yarn-env.sh
管理員用戶可以修改etc/hadoop/hadoop-env.sh、etc/hadoop/mapred-env.sh 和 etc/hadoop/yarn-env.sh 腳本來自定義站點特定的配置,修改這些腳本就是配置Hadoop后台進程用到的環境變量,比如,配置JAVA_HOME。
通過修改下面配置參數,管理員可以設置單獨的Hadoop后台進程
Daemon | Environment Variable |
---|---|
NameNode | HADOOP_NAMENODE_OPTS |
DataNode | HADOOP_DATANODE_OPTS |
Secondary NameNode | HADOOP_SECONDARYNAMENODE_OPTS |
ResourceManager | YARN_RESOURCEMANAGER_OPTS |
NodeManager | YARN_NODEMANAGER_OPTS |
WebAppProxy | YARN_PROXYSERVER_OPTS |
Map Reduce Job History Server | HADOOP_JOB_HISTORYSERVER_OPTS |
其他有用的配置參數:
- HADOOP_PID_DIR:進程ID文件存放的目錄
- HADOOP_LOG_DIR:日志文件存放的目錄,默認會自動創建。
- HADOOP_HEAPSIZE / YARN_HEAPSIZE:能夠使用堆內存的最大值,以MB為單位。默認值是1000,既1000M。使用它可以單獨指定某個節點Hadoop進程能夠使用的內存大小。
大多數情況下,我們需要配置HADOOP_PID_DIR和HADOOP_LOG_DIR,因為運行Hadoop進程的用戶需要對這些目錄有寫權限。
三、Hadoop后台進程及配置
- HDFS后台進程:NameNode、SecondaryNameNode、DataNode
- YARN后天進程:ResourceManager、WebAppProxy
- MapReduce后台進程:MapReduce Job History Server
下面介紹各個配置文件中的重要參數
1、etc/hadoop/core-site.xml
Parameter | Value | Notes |
---|---|---|
fs.defaultFS | NameNode URI | hdfs://host:port/ |
io.file.buffer.size | 131072 | 讀寫文件的buffer大小,單位byte |
2、etc/hadoop/hdfs-site.xml
NameNode配置參數
Parameter | Value | Notes |
---|---|---|
dfs.namenode.name.dir | NameNo的在本地文件系統中存儲namespace和事務日志的目錄 | 如果配置多個目錄,用逗號分開,每個目錄都會存放一份副本 |
dfs.hosts | DataDode白名單 | 不指定,默認所有DataNode都可以使用 |
dfs.hosts.exclude | DataNode黑名單,不允許使用 | 不指定,默認所有DataNode都可以使用 |
dfs.blocksize | 268435456 | HDFS數據塊大小,單位byte,默認64M,對於超大文件可以配置為256M |
dfs.namenode.handler.count | 100 | 處理對DataNode的RPC調用的NameNode服務線程數量 |
DataNode配置參數
Parameter | Value | Notes |
---|---|---|
dfs.datanode.data.dir | DataNode在本地文件系統存儲數據塊的目錄,多個目錄按逗號分割 | 如果是多個目錄,會在每個目錄存放一個副本 |
3、etc/hadoop/yarn-site.xml
針對ResourceManager 和 NodeManager共同的配置
Parameter | Value | Notes |
---|---|---|
yarn.acl.enable | true / false | 是否啟用ACL權限控制,默認false |
yarn.admin.acl | Admin ACL | 集群上管理員的ACL權限,具體參考Linux下ACL權限控制的詳細內容。默認是*,表示任何人都可以訪問,什么也不設置(空白)表示禁止任何人訪問 |
yarn.log-aggregation-enable | false | 是否啟用日志聚合,默認false |
ResourceManager配置參數
Parameter | Value | Notes |
---|---|---|
yarn.resourcemanager.address | 客戶端訪問並提交作業的地址(host:port) | 一旦設置了,這個地址會覆蓋參數 yarn.resourcemanager.hostname指定的值 |
yarn.resourcemanager.scheduler.address | ApplicationMasters 連接並調度、獲取資源的地址(host:port) | 一旦設置了,這個地址會覆蓋參數 yarn.resourcemanager.hostname指定的值 |
yarn.resourcemanager.resource-tracker.address | NodeManagers連接ResourceManager的地址(host:port) | 一旦設置了,這個地址會覆蓋參數 yarn.resourcemanager.hostname指定的值 |
yarn.resourcemanager.admin.address | 管理相關的commands連接ResourceManager的地址(host:port) | 一旦設置了,這個地址會覆蓋參數 yarn.resourcemanager.hostname指定的值 |
yarn.resourcemanager.webapp.address | 瀏覽器訪問ResourceManager的地址(host:port) | 一旦設置了,這個地址會覆蓋參數 yarn.resourcemanager.hostname指定的值 |
yarn.resourcemanager.hostname | ResourceManager主機名稱 | |
yarn.resourcemanager.scheduler.class | ResourceManager調度程序使用的java class,默認值是 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler |
CapacityScheduler (推薦), FairScheduler (推薦), or FifoScheduler |
yarn.scheduler.minimum-allocation-mb | 為每一個資源請求分配的最小內存 | 單位MB |
yarn.scheduler.maximum-allocation-mb | 為每一個資源請求分配的最大內存 | 單位MB |
yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path | 同etc/hadoop/hdfs-site.xml |
NodeManager配置參數
Parameter | Value | Notes |
---|---|---|
yarn.nodemanager.resource.memory-mb | NodeManager進程可使用的物理內存大小 | 關系到yarn.scheduler.minimum-allocation-mb和yarn.scheduler.maximum-allocation-mb |
yarn.nodemanager.vmem-pmem-ratio | Maximum ratio by which virtual memory usage of tasks may exceed physical memory | The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio. |
yarn.nodemanager.local-dirs | 存放中間數據的本地目錄,多個目錄逗號分隔 | 多個目錄可以提升磁盤IO速度 |
yarn.nodemanager.log-dirs | 存放日志的本地目錄,多個目錄逗號分隔 | 多個目錄可以提升磁盤IO速度 |
yarn.nodemanager.log.retain-seconds | 10800 | Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled. |
yarn.nodemanager.remote-app-log-dir | /logs | HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled. |
yarn.nodemanager.remote-app-log-dir-suffix | logs | Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled. |
yarn.nodemanager.aux-services | mapreduce_shuffle | Shuffle service that needs to be set for Map Reduce applications. |
History Server 參數配置:
Parameter | Value | Notes |
---|---|---|
yarn.log-aggregation.retain-seconds | -1 | How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node. |
yarn.log-aggregation.retain-check-interval-seconds | -1 | Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node. |
4、etc/hadoop/mapred-site.xml
MapReduce 應用的配置:
Parameter | Value | Notes |
---|---|---|
mapreduce.framework.name | yarn | Execution framework set to Hadoop YARN. |
mapreduce.map.memory.mb | 1536 | Larger resource limit for maps. |
mapreduce.map.java.opts | -Xmx1024M | Larger heap-size for child jvms of maps. |
mapreduce.reduce.memory.mb | 3072 | Larger resource limit for reduces. |
mapreduce.reduce.java.opts | -Xmx2560M | Larger heap-size for child jvms of reduces. |
mapreduce.task.io.sort.mb | 512 | Higher memory-limit while sorting data for efficiency. |
mapreduce.task.io.sort.factor | 100 | More streams merged at once while sorting files. |
mapreduce.reduce.shuffle.parallelcopies | 50 | Higher number of parallel copies run by reduces to fetch outputs from very large number of maps. |
MapReduce JobHistory Server配置:
Parameter | Value | Notes |
---|---|---|
mapreduce.jobhistory.address | MapReduce JobHistory Server host:port | Default port is 10020. |
mapreduce.jobhistory.webapp.address | MapReduce JobHistory Server Web UI host:port | Default port is 19888. |
mapreduce.jobhistory.intermediate-done-dir | /mr-history/tmp | Directory where history files are written by MapReduce jobs. |
mapreduce.jobhistory.done-dir | /mr-history/done | Directory where history files are managed by the MR JobHistory Server. |
NodeManagers的健康監控
Hadoop提供了一個監控機制,管理員可以配置NodeManager運行一個腳本,定期檢測某個Node是否健康可用。如果某Node不可用,該節點會在standard output打印出一條ERROR開頭的消息,NodeManager會定期檢查所有Node的output,如果發現有ERROR信息,就會把這個Node標志為unhealthy,然后將其加入黑名單,不會有任務分陪給它了,直到該Node恢復正常,NodeManager檢測到會將其移除黑名單,繼續分配任務給它。
下面是health monitoring script的配置信息,位於etc/hadoop/yarn-site.xml
Parameter | Value | Notes |
---|---|---|
yarn.nodemanager.health-checker.script.path | Node health script | Script to check for node’s health status. |
yarn.nodemanager.health-checker.script.opts | Node health script options | Options for script to check for node’s health status. |
yarn.nodemanager.health-checker.script.interval-ms | Node health script interval | Time interval for running health script. |
yarn.nodemanager.health-checker.script.timeout-ms | Node health script timeout interval | Timeout for health script execution. |
NodeManager要能夠定期檢查本地磁盤,特別是nodemanager-local-dirs 和 nodemanager-log-dirs配置的目錄,當發現bad directories的數量達到了yarn.nodemanager.disk-health-checker.min-healthy-disks指定的值,這個節點才被標志為unhealthy。
Slaves File
列出所有DataNode節點的 hostnames or IP 地址在etc/hadoop/slaves 文件, 一行一個。 Helper 腳本 (described below) 使用etc/hadoop/slaves 文件在許多客戶端運行命令。 它不需要任何基於Java的hadoop配置,為了使用此功能,Hadoop節點之間應使用ssh建立互信連接。
配置 SSH
ssh免密碼遠程連接,主要用於start-dfs.sh 和 start-yarn.sh 腳本,實現批量啟動HDFS、YARN進程
以下實現hdfs用戶的ssh免密碼,主要用於start-dfs.sh
1、檢查ssh本機是否需要密碼
[hdfs@server1 ~]$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is dd:f4:7e:77:28:03:a1:3d:d2:f1:1d:d0:fe:70:a3:dc.
Are you sure you want to continue connecting (yes/no)?
2、設置針對hdfs用戶的面密碼ssh
server1:192.168.100.51
server2:192.168.100.52
# 實現本機ssh免密碼
[hdfs@server1 ~]$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
[hdfs@server1 .ssh]$ cat id_rsa.pub >> authorized_keys
[hdfs@server1 .ssh]$ chmod 0600 authorized_keys
# 實現192.168.100.52可以免密碼ssh本機
[hdfs@server1 ~]$ scp .ssh/id_rsa.pub hdfs@192.168.100.52:~
# 登錄192.168.100.52設置
[hdfs@server2 ~]$ ls
id_rsa.pub
[hdfs@server2 ~]$ cat id_rsa.pub >> .ssh/authorized_keys
# ssh 192.168.100.51
[hdfs@server2 ~]$ ssh 192.168.100.51
Last login: Tue Sep 6 14:36:15 2016 from localhost
[hdfs@server1 ~]$ logout
3、按照同樣的方式,實現server1對server2的ssh免密碼