2021-11-15 18:52:15,361 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2021-11-15 18:52:15,372 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8032: starting
2021-11-15 18:52:15,421 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8030: starting
2021-11-15 18:52:19,190 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to active state
2021-11-15 18:52:19,193 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to standby state
2021-11-15 18:52:19,222 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=refreshQueues TARGET=AdminService RESULT=FAILURE DESCRIPTION=ResourceManager is not active. Can not refresh queues. PERMISSIONS=
2021-11-15 18:52:19,222 ERROR org.apache.hadoop.yarn.server.resourcemanager.AdminService: RefreshAll failed so firing fatal event
org.apache.hadoop.ha.ServiceFailedException: ResourceManager rm1 is not Active!
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:609)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:313)
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:813)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2021-11-15 18:52:19,223 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:813)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll during transistion to Active
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:321)
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
... 4 more
Caused by: org.apache.hadoop.ha.ServiceFailedException: ResourceManager rm1 is not Active!
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:609)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:313)
... 5 more
2021-11-15 18:52:19,223 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2021-11-15 18:52:19,232 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED. Cause:
org.apache.hadoop.ha.ServiceFailedException: ResourceManager rm1 is not Active!
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:609)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:313)
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:813)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2021-11-15 18:52:19,234 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
獲取rm的狀態
命令:
yarn rmadmin -getServiceState rm1
yarn rmadmin -getServiceState rm2
不能獲取rm1,rm2的狀態。
https://issues.apache.org/jira/browse/YARN-2588
https://issues.apache.org/jira/browse/YARN-2019
https://issues.apache.org/jira/browse/YARN-2010
Recovery失敗導致導致RM無法啟動:非常嚴重的bug。
根源在於RM HA啟動時會去zk讀取之前的狀態(recovery過程),如果zk中的數據有問題,recovery過程會拋出異常,
這個異常沒處理好,會直接拋到上層,導致RM進入STOPPED狀態。
進入STOPPED狀態后,就不能再變成其他狀態了。
解決方法:
不改代碼的話,只能手動清除zk中的數據。可以手動刪除zk中的/rmstore路徑:
setAcl /rmstore/ZKRMStateRoot world:anyone:cdrwa
rmr /rmstore
或者在yarn-site.xml中設置yarn.resourcemanager.zk-state-store.parent-path屬性,比如/rmstore2,將數據存到另外一個路徑。