文章作者:luxianghao
文章來源:http://www.cnblogs.com/luxianghao/p/7886850.html 轉載請注明,謝謝合作。
免責聲明:文章內容僅代表個人觀點,如有不當,歡迎指正。
---
一 引言
Ambari作為一個集群管理工具自然不能少了進程自動拉起這個功能,具體的場景就是
1 當你的進程異常掛掉后,Ambari自動將進程拉起,恢復服務,避免人為介入;
2 當機器啟動后你不用再一個一個的點擊,拉起服務,避免繁瑣的機械勞動;
......
總之它會努力變成你期望的樣子^-^
二 版本迭代
Ambari早期就有這個功能,在2.2 2.3 2.4等版本不斷的迭代,使其不斷的完善易用,早期相關配置在ambari.properties中,由於這種方式相關的配置屬性是靜態的,修改后需重啟Ambari Server和Amabari Agent,后來遷移到了cluster-env.xml中,並錄入數據庫,在Web端也做了支持,當修改了相關配置也不用重啟服務了,相關的修改會隨着心跳信息從Ambari Server發送到Ambari agent。支持集群級別的總開關和組件粒度的開關,相關配置屬性如下:
recovery_enabled:集群級別自動拉起功能的開關
recovery_type: 恢復功能的類型,不同類型會有不同的執行邏輯,如下表
recovery_lifetime_max_count:自動拉起生命周期的最大次數,如果Ambari Agent重啟這個值會被重置
recovery_max_count:在一個時間窗口內,自動拉起動作的最大嘗試次數,如果Ambari Agent重啟這個值會被重置
recovery_window_in_minutes:自動拉起功能的時間窗口長度
recovery_retry_interval:兩次重試之間的時間間隔
| Attribute: recovery_type | Commands | State Transitions |
| AUTO_START | Start | INSTALLED → STARTED |
| FULL | Install, Start, Restart, Stop | INIT → INSTALLED, INIT → STARTED, INSTALLED → STARTED, STARTED → STARTED, STARTED → INSTALLED |
| DEFAULT | None | Auto start feature disabled |
三 功能介紹 && 代碼邏輯
從Ambari概覽中的Ambari Server架構圖中我們可以看到Ambari Server維護了一個FSM(有限狀態機),記錄了每個組件的desired state(Ambari Server期望的組件狀態),Ambari Agent會實時的檢測自己的宿主機上的服務的current state(當前狀態),當desired state和current state不一致就會觸發recovery,狀態的遷移如上面的表格中所述,2.4版本中recovery_type我們一般使用AUTO START,最常見的場景就是INSTALLED-->STARTED狀態的遷移,該事件的邏輯如下:

備注:組件正常運行時狀態為STARTED,異常宕掉或正常停止后狀態為INSTALLED。
上述狀態遷移發生的前提是兩個開關要打開,如下圖所示
1 recovery_enabled = True
2 enable components包含Service A
3 當我們不想關上面兩個開關但又想某個節點上的組件不啟用自啟動功能時,我們可以利用Maintenance模式,下面幾種情況都會造成組件處於Maintenance模式
a)組件被置為Maintenance模式
b)組件所在主機被置為Maintenance模式
c)組件所屬服務被置為Maintenance模式
d)組件所在主機所屬的集群被置為Maintenance模式

相關的源代碼文件
1 AmbariManagementControllerImpl.java
2 ServiceComponentDesiredStateEntity.java
3 ServiceComponentRecoveryChangedEvent.java
4 RecoveryConfigHelper.java
5 RecoveryManager.py
6 Controller.py
...
相關的服務log
INFO 2017-11-21 12:16:24,210 RecoveryManager.py:243 - Service A needs recovery.
INFO 2017-11-21 12:16:24,209 Controller.py:265 - Heartbeat response received (id = 15)
INFO 2017-11-21 12:16:24,210 RecoveryManager.py:243 - Service A needs recovery.
INFO 2017-11-21 12:16:24,210 RecoveryManager.py:798 - START command cannot be computed as details are not received from Server.
INFO 2017-11-21 12:16:34,210 Heartbeat.py:82 - Building Heartbeat: {responseId = 15, timestamp = 1511237794210, commandsInProgress = False, componentsMapped = True,recoveryTimestamp = 1511237693282}
INFO 2017-11-21 12:16:54,588 Controller.py:310 - Adding recovery command START for component Service A
INFO 2017-11-21 12:16:54,589 ActionQueue.py:117 - Adding AUTO_EXECUTION_COMMAND for role Service A of cluster DRUID to the queue.
INFO 2017-11-21 12:16:54,604 ActionQueue.py:238 - Executing command with id = 1-0 for role = Service A of cluster DRUID.
INFO 2017-11-21 12:16:54,705 Heartbeat.py:82 - Building Heartbeat: {responseId = 18, timestamp = 1511237814704, commandsInProgress = False, componentsMapped = True,recoveryTimestamp = 1511237693282}
INFO 2017-11-21 12:16:54,854 Controller.py:265 - Heartbeat response received (id = 19)
INFO 2017-11-21 12:16:58,982 ActionQueue.py:341 - After EXECUTION_COMMAND (START), current state of Service A to STARTED
相關patch
AMBARI-15077:Auto-start services: Backend API and DB changes for component auto start
AMBARI-14983:Auto-start services: Show list of Services/Component with status indicator
AMBARI-14023:Agents should not ask for auto-start command details if it has the details (smohanty)
AMBARI-13463:Auto start should allow selection of components that can be auto-started (smohanty)
AMBARI-13434:Expose Alert Grace Period Setting in Agents (aonishuk)
AMBARI-13954:Enable auto-start with alerting for AMS (dsen)
AMBARI-14182: Recovery alerts do not go away
AMBARI-14865: Auto start - Maintenance mode of components should be respected when handling agent registration
AMBARI-15141: Start all services request aborts in the middle and hosts go into heartbeat-lost state
AMBARI-15230: Auto-start services: Move default values in ambari.properties to cluster-env.xml
AMBARI-15474: Listen for changes to auto-start configuration and send them to the agent during heartbeats.
AMBARI-12517: Don't send install_packages command to hosts without versionable components
四 類似工具
進程的自動拉起也可以用進程守護工具比如Supervisor, God,不同的是這兩者是用自己的daemon fork出子進程,通過監控子進程的方式獲取進程狀態的,而Ambari是通過pid或者端口監控的方式獲取進程狀態。
五 相關鏈接
WIKI: https://cwiki.apache.org/confluence/display/AMBARI/Recovery%3A+auto+start+components
