在合並Region的過程中出現永久RIT怎么辦?筆者在生產環境中就遇到過這種情況,在批量合並Region的過程中,出現了永久MERGING_NEW的情況,雖然這種情況不會影響現有集群的正常的服務能力,但是如果集群有某個節點發生重啟,那么可能此時該RegionServer上的Region是沒法均衡的。因為在RIT狀態時,HBase是不會執行Region負載均衡的,即使手動執行balancer命令也是無效的。
如果不解決這種RIT情況,那么后續有HBase節點相繼重啟,這樣會導致整個集群的Region驗證不均衡,這是很致命的,對集群的性能將會影響很大。經過查詢HBase JIRA單,發現這種MERGING_NEW永久RIT的情況是觸發了HBASE-17682的BUG,需要打上該Patch來修復這個BUG,其實就是HBase源代碼在判斷業務邏輯時,沒有對MERGING_NEW這種狀態進行判斷,直接進入到else流程中了。源代碼如下:
for (RegionState state : regionsInTransition.values()) { HRegionInfo hri = state.getRegion(); if (assignedRegions.contains(hri)) { // Region is open on this region server, but in transition. // This region must be moving away from this server, or splitting/merging. // SSH will handle it, either skip assigning, or re-assign. LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn); } else if (sn.equals(state.getServerName())) { // Region is in transition on this region server, and this // region is not open on this server. So the region must be // moving to this server from another one (i.e. opening or // pending open on this server, was open on another one. // Offline state is also kind of pending open if the region is in // transition. The region could be in failed_close state too if we have // tried several times to open it while this region server is not reachable) if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) { LOG.info("Found region in " + state + " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); } else if(state.isSplittingNew()) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); } else { LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state); } } }
修復之后代碼:
for (RegionState state : regionsInTransition.values()) { HRegionInfo hri = state.getRegion(); if (assignedRegions.contains(hri)) { // Region is open on this region server, but in transition. // This region must be moving away from this server, or splitting/merging. // SSH will handle it, either skip assigning, or re-assign. LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn); } else if (sn.equals(state.getServerName())) { // Region is in transition on this region server, and this // region is not open on this server. So the region must be // moving to this server from another one (i.e. opening or // pending open on this server, was open on another one. // Offline state is also kind of pending open if the region is in // transition. The region could be in failed_close state too if we have // tried several times to open it while this region server is not reachable) if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) { LOG.info("Found region in " + state + " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); } else if(state.isSplittingNew()) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); } else if (isOneOfStates(state, State.SPLITTING_NEW, State.MERGING_NEW)) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); }else { LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state); } } }
但是,這里有一個問題,目前該JIRA單只是說了需要去修復BUG,打Patch。但是,實際生產情況下,面對這種RIT情況,是不可能長時間停止集群,影響應用程序讀寫的。那么,有沒有臨時的解決辦法,先臨時解決當前的MERGING_NEW這種永久RIT,之后在進行HBase版本升級操作。
辦法是有的,在分析了MERGE合並的流程之后,發現HBase在執行Region合並時,會先生成一個初始狀態的MERGING_NEW。整個Region合並流程如下:
從流程圖中可以看到,MERGING_NEW是一個初始化狀態,在Master的內存中,而處於Backup狀態的Master內存中是沒有這個新Region的MERGING_NEW狀態的,那么可以通過對HBase的Master進行一個主備切換,來臨時消除這個永久RIT狀態。而HBase是一個高可用的集群,進行主備切換時對用戶應用來說是無感操作。因此,面對MERGING_NEW狀態的永久RIT可以使用對HBase進行主備切換的方式來做一個臨時處理方案。之后,我們在對HBase進行修復BUG,打Patch進行版本升級。