ResourceManager High Availability (RM高可用)

- Introduction（簡介）
- Architecture（架構）
- - RM Failover（RM 故障切換）
  - Recovering prevous active-RM’s state（恢復之前活動的RM的狀態）
- Deployment（部署）
- - Configurations（配置）
  - Admin commands（管理命令）
  - ResourceManager Web UI services（RM Web UI服務）
  - Web Services（Web 服務）

Introduction

This guide provides an overview of High Availability of YARN’s ResourceManager, and details how to configure and use this feature. The ResourceManager (RM) is responsible for tracking the resources in a cluster, and scheduling applications (e.g., MapReduce jobs). Prior to Hadoop 2.4, the ResourceManager is the single point of failure in a YARN cluster. The High Availability feature adds redundancy in the form of an Active/Standby ResourceManager pair to remove this otherwise single point of failure.

這個知道提供YARN的ResourceManager的高可用綜述，和如何配置和使用這個特性的細節。RM負責跟蹤集群中的資源和調度應用（例如 MapReduce作業）。在Hadoop2.4之前，RM是YARN集群中的一個單點故障。這個高可用特性以活動/備用 RM對的形式增加了冗余來移除這個潛在的單點故障。

Architecture（架構）

RM Failover（RM故障切換）

ResourceManager HA is realized through an Active/Standby architecture - at any point of time, one of the RMs is Active, and one or more RMs are in Standby mode waiting to take over should anything happen to the Active. The trigger to transition-to-active comes from either the admin (through CLI) or through the integrated failover-controller when automatic-failover is enabled.

RM的高可用特性通過任何時間點的主/備架構來實現的，一個RM作為活動，而其他RMs進入備用模式隨時等待接管出事的活動的RM。備用轉活躍的觸發可以通過管理員用命令行或者通過集成的故障切換控制器配置允許自動故障切換。

Manual transitions and failover（手動切換和故障切換）

When automatic failover is not enabled, admins have to manually transition one of the RMs to Active. To failover from one RM to the other, they are expected to first transition the Active-RM to Standby and transition a Standby-RM to Active. All this can be done using the “yarn rmadmin” CLI.

當自動故障切換沒有被激活時，管理員必須手動地轉換RMs中的一個為活躍。RM的故障切換時首先將活躍的RM切換為備用然后將一個備用的RM切換為活躍狀態。這些都可以用“yarn rmadmin”命令行來實現。

Automatic failover（自動故障切換）

The RMs have an option to embed the Zookeeper-based ActiveStandbyElector to decide which RM should be the Active. When the Active goes down or becomes unresponsive, another RM is automatically elected to be the Active which then takes over. Note that, there is no need to run a separate ZKFC daemon as is the case for HDFS because ActiveStandbyElector embedded in RMs acts as a failure detector and a leader elector instead of a separate ZKFC deamon.

RMa有個選項來嵌入基於Zookeepper的主備選舉機制來決定哪個RM是活躍的。當活躍的RM失效或者反應遲鈍，另一個RM會被自動選舉為主用然后接管工作。需要注意的是，沒必要為HDFS運行一個單獨的ZKFC進程因為主備選舉機制內嵌到RMs作為一個失效檢查器和選舉器來替代一個單獨的ZKFC進程。

Client, ApplicationMaster and NodeManager on RM failover（客戶端、應用控制器和節點管理器在RM的故障切換下的轉移）

When there are multiple RMs, the configuration (yarn-site.xml) used by clients and nodes is expected to list all the RMs. Clients, ApplicationMasters (AMs) and NodeManagers (NMs) try connecting to the RMs in a round-robin fashion until they hit the Active RM. If the Active goes down, they resume the round-robin polling until they hit the “new” Active. This default retry logic is implemented as org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider. You can override the logic by implementing org.apache.hadoop.yarn.client.RMFailoverProxyProvider and setting the value of yarn.client.failover-proxy-provider to the class name.

當有多個RM，客戶端和節點可以通過配置（yarn-site.xml）來獲得RM的列表。客戶端、應用控制器和節點管理器采用循環的方式來試圖連上RM直到他們連上活躍RM。如果活躍的RM失效了，它們重新開始以循環的方式去連接RM直到他們連上新的活躍RM。這個默認的重試邏輯是org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider實現的。你可以通過實現 org.apache.hadoop.yarn.client.RMFailoverProxyProvider 來覆蓋這個邏輯並將yarn.client.failover-proxy-provider的值設為該類名。

Recovering prevous active-RM’s state（恢復到之前活躍RM的狀態）

With the ResourceManger Restart enabled, the RM being promoted to an active state loads the RM internal state and continues to operate from where the previous active left off as much as possible depending on the RM restart feature. A new attempt is spawned for each managed application previously submitted to the RM. Applications can checkpoint periodically to avoid losing any work. The state-store must be visible from the both of Active/Standby RMs. Currently, there are two RMStateStore implementations for persistence - FileSystemRMStateStore and ZKRMStateStore. The ZKRMStateStore implicitly allows write access to a single RM at any point in time, and hence is the recommended store to use in an HA cluster. When using the ZKRMStateStore, there is no need for a separate fencing mechanism to address a potential split-brain situation where multiple RMs can potentially assume the Active role. When using the ZKRMStateStore, it is advisable to NOT set the “zookeeper.DigestAuthenticationProvider.superDigest” property on the Zookeeper cluster to ensure that the zookeeper admin does not have access to YARN application/user credential information.

如果RM重啟是被激活可用的，依靠RM的重啟特性一個RM被提升為活躍RM狀態時加載前面那個活躍RM留下盡可能多的RM的內部狀態和操作。應用可以周期的檢查來避免丟失任何工作。狀態倉庫對主用/備用RM都是可見的。目前，有兩個實現的持久化RM狀態倉庫- FileSystemRMStateStore和ZKRMStateStore。ZKRMStateStore允許在任何一個時間點只對一個RM可寫，因此推薦在HA集群中使用這個倉庫。當使用ZKRMStateStore作為狀態倉庫，建議不要在Zookepper集群中設置zookeeper.DigestAuthenticationProvider.superDigest屬性確保zookepper管理員沒有進入YARN 應用和用戶的權限信息。

Deployment（部署）

Configurations（配置）

Most of the failover functionality is tunable using various configuration properties. Following is a list of required/important ones. yarn-default.xml carries a full-list of knobs. See yarn-default.xml for more information including default values. See the document for ResourceManger Restart also for instructions on setting up the state-store.

大部分的故障切換功能都可以用各樣的配置屬性來調用。下面是屬性中需要的/重要的部分列表。yarn-default.xml是完整的開關列表。去查看 yarn-default.xml 獲取更多信息包括默認值。看ResourceManger Restart 文檔也可以得到狀態倉庫的設置信息。

Configuration Properties	Description
`yarn.resourcemanager.zk-address`	Address of the ZK-quorum. Used both for the state-store and embedded leader-election.
`yarn.resourcemanager.ha.enabled`	Enable RM HA. RM高可用激活
`yarn.resourcemanager.ha.rm-ids`	List of logical IDs for the RMs. e.g., “rm1,rm2”. RMs的邏輯ID列表
`yarn.resourcemanager.hostname.`rm-id	For each rm-id, specify the hostname the RM corresponds to. Alternately, one could set each of the RM’s service addresses. 為每個RM-id指定一個主機名。或者可以設置每個RM的服務地址
`yarn.resourcemanager.address.`rm-id	For each rm-id, specify host:port for clients to submit jobs. If set, overrides the hostname set in `yarn.resourcemanager.hostname.`rm-id. 為每個rm-id設置主機：端口用來提交作業。如果設置，將覆蓋`yarn.resourcemanager.hostname.`rm-id的設置
`yarn.resourcemanager.scheduler.address.`rm-id	For each rm-id, specify scheduler host:port for ApplicationMasters to obtain resources. If set, overrides the hostname set in `yarn.resourcemanager.hostname.`rm-id. 為每個rm-id指定AM的主機：端口來獲取資源。如果設置了將覆蓋`yarn.resourcemanager.hostname.`rm-id的設置
`yarn.resourcemanager.resource-tracker.address.`rm-id	For each rm-id, specify host:port for NodeManagers to connect. If set, overrides the hostname set in `yarn.resourcemanager.hostname.`rm-id. 為每個rm-id指定NodeManagers的連接的主機：端口。如果設置將覆蓋`yarn.resourcemanager.hostname.`rm-id的設置
`yarn.resourcemanager.admin.address.`rm-id	For each rm-id, specify host:port for administrative commands. If set, overrides the hostname set in `yarn.resourcemanager.hostname.`rm-id. 為每個rm-id設置管理命令行的主機：端口。如果設置了將覆蓋`yarn.resourcemanager.hostname.`rm-id的設置
`yarn.resourcemanager.webapp.address.`rm-id	For each rm-id, specify host:port of the RM web application corresponds to. You do not need this if you set `yarn.http.policy` to `HTTPS_ONLY`. If set, overrides the hostname set in `yarn.resourcemanager.hostname.`rm-id. 為每個rm-id指定用於RMweb應用通訊的主機：端口。如果你設置了`yarn.http.policy` to `HTTPS_ONLY那就沒必要設置了。`如果設置了將覆蓋`yarn.resourcemanager.hostname.`rm-id的設置
`yarn.resourcemanager.webapp.https.address.`rm-id	For each rm-id, specify host:port of the RM https web application corresponds to. You do not need this if you set `yarn.http.policy` to `HTTP_ONLY`. If set, overrides the hostname set in `yarn.resourcemanager.hostname.`rm-id. 為每個rm-id指定用於RM https web應用通訊的主機：端口。如果你設置了`yarn.http.policy` to `HTTPS_ONLY那就沒必要設置了。`如果設置了將覆蓋`yarn.resourcemanager.hostname.`rm-id的設置
`yarn.resourcemanager.ha.id`	Identifies the RM in the ensemble. This is optional; however, if set, admins have to ensure that all the RMs have their own IDs in the config. 定義一個RM的集合ID.這是可選的；然而，如果設置了，管理員將要確保所有的RM所有自己的ID
`yarn.resourcemanager.ha.automatic-failover.enabled`	Enable automatic failover; By default, it is enabled only when HA is enabled. 故障切換激活；默認的，在HA激活下可用。
`yarn.resourcemanager.ha.automatic-failover.embedded`	Use embedded leader-elector to pick the Active RM, when automatic failover is enabled. By default, it is enabled only when HA is enabled. 當自動故障切換可用時，使用內嵌的選舉器來選擇活躍RM。默認的，在HA激活下可用。
`yarn.resourcemanager.cluster-id`	Identifies the cluster. Used by the elector to ensure an RM doesn’t take over as Active for another cluster. 定義集群的ID。被選舉器使用確保RM不會在其他集群中接管稱為活躍RM
`yarn.client.failover-proxy-provider`	The class to be used by Clients, AMs and NMs to failover to the Active RM. 這個類用於將客戶端、AMs和NMs轉移到活躍的RM
`yarn.client.failover-max-attempts`	The max number of times FailoverProxyProvider should attempt failover. 嘗試故障切換的最大嘗試次數。
`yarn.client.failover-sleep-base-ms`	The sleep base (in milliseconds) to be used for calculating the exponential delay between failovers.
`yarn.client.failover-sleep-max-ms`	The maximum sleep time (in milliseconds) between failovers. 故障切換之間的最大休眠時間
`yarn.client.failover-retries`	The number of retries per attempt to connect to a ResourceManager. 每個嘗試連接RM的重連次數
`yarn.client.failover-retries-on-socket-timeouts`	The number of retries per attempt to connect to a ResourceManager on socket timeouts. 每個嘗試連接RM的重連次數的socket超時

Sample configurations（配置例子）

Here is the sample of minimal setup for RM failover.

<property>
  <name>yarn.resourcemanager.ha.enabled</name>
  <value>true</value>
</property>
<property>
  <name>yarn.resourcemanager.cluster-id</name>
  <value>cluster1</value>
</property>
<property>
  <name>yarn.resourcemanager.ha.rm-ids</name>
  <value>rm1,rm2</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm1</name>
  <value>master1</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm2</name>
  <value>master2</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm1</name>
  <value>master1:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm2</name>
  <value>master2:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.zk-address</name>
  <value>zk1:2181,zk2:2181,zk3:2181</value>
</property>

Admin commands（管理員命令）

yarn rmadmin has a few HA-specific command options to check the health/state of an RM, and transition to Active/Standby. Commands for HA take service id of RM set by yarn.resourcemanager.ha.rm-ids as argument.

 $ yarn rmadmin -getServiceState rm1
 active
 
 $ yarn rmadmin -getServiceState rm2
 standby

If automatic failover is enabled, you can not use manual transition command. Though you can override this by –forcemanual flag, you need caution.

 $ yarn rmadmin -transitionToStandby rm1
 Automatic failover is enabled for org.apache.hadoop.yarn.client.RMHAServiceTarget@1d8299fd
 Refusing to manually manage HA state, since it may cause
 a split-brain scenario or other incorrect state.
 If you are very sure you know what you are doing, please
 specify the forcemanual flag.

See YarnCommands for more details.

ResourceManager Web UI services

Assuming a standby RM is up and running, the Standby automatically redirects all web requests to the Active, except for the “About” page.

假設一個備用RM被提升為活躍，該備用RM會自動重定向到所有提到活躍RM的請求，除了“About”頁面

Web Services

Assuming a standby RM is up and running, RM web-services described at ResourceManager REST APIs when invoked on a standby RM are automatically redirected to the Active RM.

假設一個備用RM被提升為活躍，RM web-service在ResourceManager REST APIs 描述的用來將一個備用RM自動重定向活躍RM。

*由於譯者本身能力有限，所以譯文中肯定會出現表述不正確的地方，請大家多多包涵，也希望大家能夠指出文中翻譯得不對或者不准確的地方，共同探討進步，謝謝。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Postman官方文檔翻譯 Mammoth官方文檔翻譯 FlowCanvas官方文檔翻譯（一） Caffe Model Zoo官方文檔翻譯 Orchard官方文檔翻譯(一) 總覽 Akka官方文檔翻譯：Cluster Specification Mysql 5.7 官方文檔翻譯 NServiceBus官方文檔翻譯（一）NServiceBus 概況 kong插件官方文檔翻譯 Spring Data JPA(官方文檔翻譯)