記一次HBase的NotServingRegionException問題


1. 問題

在hbase測試集群上,訪問一些hbase表,均報類似下面的錯誤:region不可用

Wed Oct 28 14:00:56 CST 2020, RpcRetryingCaller{globalStartTime=1603864856185, pause=100, maxAttempts=4}, org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: tbl_ishis_hb_tips_levy_inf,,1578557543052.0ca2591b1209239ee963893abdfa4b53. is not online on vm3928.hadoop.com,16020,1603800273275
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3325)
......

2. 排查問題

首先懷疑是HBase集群內正在進行Region的Split和不同機器之間的Region Balance,這時候Region是不可用的狀態,但是Region切分和再均衡持續時間不會很久,而現在的問題昨天已經出現了,於是看下 HMaster 日志,一直在報錯,關鍵信息如下:

2020-10-28 14:00:17,941 WARN org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fvm3928.hadoop.com%2C16020%2C1598158651562-splitting%2Fvm3928.hadoop.com%252C16020%252C1598158651562.vm3928.hadoop.com%252C16020%252C1598158651562.regiongroup-0.1603692260004
2020-10-28 14:00:17,942 WARN org.apache.hadoop.hbase.master.SplitLogManager: error while splitting logs in [hdfs://nameservice1/hbase/WALs/vm3928.hadoop.com,16020,1598158651562-splitting] installed = 1 but only 0 done
2020-10-28 14:00:17,942 WARN org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Failed state=SERVER_CRASH_SPLIT_LOGS, retry pid=1, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; ServerCrashProcedure server=vm3928.hadoop.com,16020,1598158651562, splitWal=true, meta=true; cycles=3455
java.io.IOException: error or interrupted while splitting logs in [hdfs://nameservice1/hbase/WALs/vm3928.hadoop.com,16020,1598158651562-splitting] Task = installed = 1 done = 0 error = 1
        at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:271)
        at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:401)
        at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:386)
        at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:283
        ......

可以推斷是拆分WAL日志時出錯或中斷導致的問題,該問題影響到了 region 恢復

https://www.w3cschool.cn/hbase_doc/

Write Ahead Log

Write Ahead Log(WAL)將HBase中數據的所有更改記錄到基於文件的存儲中。在正常操作下,不需要WAL,因為數據更改從MemStore移動到StoreFiles。但是,如果在刷新MemStore之前RegionServer崩潰或變得不可用,則WAL確保可以重播對數據所做的更改。如果寫入WAL失敗,則修改數據的整個操作將失敗。WAL位於HDFS中的/hbase/WALs/目錄下,每個區域有子目錄。

WAL拆分

RegionServer服務於許多region。分區服務器中的所有region共享相同活動的WAL文件。WAL文件中的每個編輯都包含有關它屬於哪個region的信息。當打開region時,需要重播屬於該region的WAL文件中的編輯。因此,WAL文件中的編輯必須按region分組,以便可以重播特定的集合以重新生成特定region中的數據。按region對WAL編輯進行分組的過程稱為日志拆分。如果region服務器出現故障,它是恢復數據的關鍵過程。
在群集啟動時由HMaster完成日志拆分,或者在region服務器關閉時由ServerShutdownHandler完成日志拆分。為保證一致性,受影響的區域在數據恢復之前不可用。所有WAL編輯都需要在給定region再次可用之前恢復並重播。因此,受到日志拆分影響的region在該過程完成之前不可用。
過程:日志分割,分步執行
新目錄按以下模式命名:
/hbase/WALs/ , , 目錄被重新命名。重命名該目錄非常重要,因為即使HMaster認為它已關閉,RegionServer仍可能啟動並接受請求。如果RegionServer沒有立即響應,也沒有檢測到它的ZooKeeper會話,HMaster可能會將其解釋為RegionServer失敗。重命名日志目錄可確保現有的有效WAL文件仍然由活動但繁忙的RegionServer使用,而不會意外寫入。新目錄根據以下模式命名:
/hbase/WALs/ , , -splitting
這種重命名的目錄的例子可能如下所示:
/hbase/WALs/srv.example.com,60020,1254173957298-splitting
每個日志文件都被拆分,每次一個。日志拆分器一次讀取一個編輯項的日志文件,並將每個編輯條目放入對應於編輯區域的緩沖區中。同時,拆分器啟動多個編寫器線程。編寫器線程選取相應的緩沖區,並將緩沖區中的編輯項寫入臨時恢復的編輯文件。臨時編輯文件使用以下命名模式存儲到磁盤:
/hbase/<table_name>/<region_id>/recovered.edits/.temp
該文件用於存儲此區域的WAL日志中的所有編輯。日志拆分完成后,.temp文件將被重命名為寫入文件的第一個日志的序列ID。要確定是否所有編輯都已寫入,將序列ID與寫入HFile的上次編輯的序列進行比較。如果最后編輯的序列大於或等於文件名中包含的序列ID,則很明顯,編輯文件中的所有寫入操作都已完成。
日志拆分完成后,每個受影響的區域將分配給RegionServer。打開該區域時,會檢查recoverededed文件夾以找到恢復的編輯文件。如果存在任何這樣的文件,則通過讀取編輯並將其保存到MemStore來重播它們。在重放所有編輯文件后,MemStore的內容被寫入磁盤(HFile),編輯文件被刪除。

繼續查看hdfs上的WAL目錄,可以發現存在splitting目錄,WAL切分正常結束的話不會出現該目錄

hadoop fs -ls /hbase/WALs
Found 4 items
drwxr-xr-x   - hbase hbase          0 2020-10-26 18:40 /hbase/WALs/vm3928.hadoop.com,16020,1598158651562-splitting
drwxr-xr-x   - hbase hbase          0 2020-10-28 15:50 /hbase/WALs/vm3928.hadoop.com,16020,1603800273275
drwxr-xr-x   - hbase hbase          0 2020-10-28 15:49 /hbase/WALs/vm3929.hadoop.com,16020,1603800273410
drwxr-xr-x   - hbase hbase          0 2020-10-28 15:49 /hbase/WALs/vm3930.hadoop.com,16020,1603800273039

考慮到HMaster一直在報WAL切分失敗的log,大致推測由於集群一直在做WAL切分任務,無法完成,受影響的那台RegionSerer上region都不可用,因此無法正常查詢hbase數據。

3. 初步解決

參考網上的一些解決方法是優先恢復服務,將splitting目錄move 其他地方,於是執行以下操作

hadoop fs -mv /hbase/WALs/vm3928.hadoop.com,16020,1598158651562-splitting /hbase/bk

這時在cdh管理界面上查看28這台服務器上region數量,可用的region由1個恢復到了244個

這時查詢了一些hbase表的前幾條數據,已經可以訪問數據了(其實還有些問題,看后文)

4. 分析WAL切分失敗原因

繼續分析WAL切分失敗的原因,查看splitting目錄下面文件的部分內容

hadoop fs -cat /hbase/WALs/vm3928.hadoop.com,16020,1598158651562-splitting/vm3928.hadoop.com%2C16020%2C1598158651562.vm3928.hadoop.com%2C16020%2C1598158651562.regiongroup-0.1603692260004
20/10/28 11:00:10 WARN hdfs.DFSClient: Found Checksum error for BP-1135703146-10.200.39.28-1574043673848:blk_1084998682_11258280 from DatanodeInfoWithStorage[10.200.39.30:1004,DS-cfda5a02-d48f-4269-896e-083d55fdd33f,DISK] at 512
20/10/28 11:00:10 WARN hdfs.DFSClient: No live nodes contain block BP-1135703146-10.200.39.28-1574043673848:blk_1084998682_11258280 after checking nodes = [DatanodeInfoWithStorage[10.200.39.30:1004,DS-cfda5a02-d48f-4269-896e-083d55fdd33f,DISK], DatanodeInfoWithStorage[10.200.39.29:1004,DS-dffdf8e8-0f2a-4a43-b39d-b70b50263e0c,DISK], DatanodeInfoWithStorage[10.200.39.28:1004,DS-21d3927c-e42d-45ce-8e79-25c14202cbe5,DISK]], ignoredNodes = null
......

可以發現存在hbase數據block丟失的情況,而hbase的數據存儲在hdfs上面的,hdfs有三副本策略,3份全丟狀態就為 corrupt,日志中可以發現集群的三個節點上均找不到某個block,接下來對hbase數據目錄做下檢查,結果如下:

hdfs fsck /hbase/data
Connecting to namenode via http://vm3928.hadoop.com:9870/fsck?ugi=hbase&path=%2Fhbase%2Fdata

/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas/42a907b4ffafb72205d9e1d1fa5e8301/d/36cafb11c3a6455aaba1605b72bbb7c7: CORRUPT blockpool BP-1135703146-10.200.39.28-1574043673848 block blk_1084998838

/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas/42a907b4ffafb72205d9e1d1fa5e8301/d/36cafb11c3a6455aaba1605b72bbb7c7: CORRUPT 1 blocks of total size 2559961 B.
/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas/42a907b4ffafb72205d9e1d1fa5e8301/d/756ffcf3efd24971abe0195f8de5af1e:  Under replicated BP-1135703146-10.200.39.28-1574043673848:blk_1084998837_11258152. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).

/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas_idx_mchnt_cd/72c126f1fa9cef3a0fdd8fcb5e49501e/d/a64f19e4dbfc47a991d919a5595b4e29:  Under replicated BP-1135703146-10.200.39.28-1574043673848:blk_1084998835_11258150. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).

Status: CORRUPT
 Number of data-nodes:  3
 Number of racks:               1
 Total dirs:                    3016
 Total symlinks:                0

Replicated Blocks:
 Total size:    114299895515 B
 Total files:   2386
 Total blocks (validated):      2406 (avg. block size 47506190 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:      1 (0.04156276 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:        1
  CORRUPT BLOCKS:       1
  CORRUPT SIZE:         2559961 B
  ********************************
 Minimally replicated blocks:   2405 (99.958435 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       2 (0.08312552 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.9970906
 Missing blocks:                0
 Corrupt blocks:                1
 Missing replicas:              4 (0.055417012 %)
 Blocks queued for replication: 0
The filesystem under path '/hbase/data' is CORRUPT

挑出其中的關鍵信息如下:

/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas/42a907b4ffafb72205d9e1d1fa5e8301/d/36cafb11c3a6455aaba1605b72bbb7c7: CORRUPT 1 blocks of total size 2559961 B.

可以發現表名為tbl_glhis_hb_ct_settle_dtl_bas,region名為42a907b4ffafb72205d9e1d1fa5e8301,列族d的一份數據文件36cafb11c3a6455aaba1605b72bbb7c7存在缺失的block,可以猜測如果訪問到了tbl_glhis_hb_ct_settle_dtl_bas表的42a907b4ffafb72205d9e1d1fa5e8301分區時,應該還會報錯,果然,在hbase shell中直接count操作,一直數到了13224000行,需要訪問42a907b4ffafb72205d9e1d1fa5e8301,開始報了NotServingRegionException錯,如下所示,與預期一致。

Current count: 13224000, row: 4995600052377126-79799371-79799880787440-0119212559-269394-03014510-00092900-X-Q1#010                                                  
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=8, exceptions:
Wed Oct 28 17:21:50 CST 2020, RpcRetryingCaller{globalStartTime=1603876909641, pause=100, maxAttempts=8}, org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: tbl_glhis_hb_ct_settle_dtl_bas,5,1578557420156.42a907b4ffafb72205d9e1d1fa5e8301. is not online on vm3928.hadoop.com,16020,1603800273039

目前看來,可以猜測整個問題的大概原因:某張hbase表底層的hdfs數據塊缺失,造成集群WAL切分任務一直失敗無法完成,最終造成受影響的RegionServer上所有region都不可用(非region正在分裂或再均衡引起的不可用)。使用刪除hdfs上切分目錄的方法,使不存在丟失數據塊的表可以正常訪問,但真正丟失數據塊的hbase表仍然存在部分region不可用的問題(測試環境中為什么會出現hdfs數據塊丟失的情況還不確定)

5. 處理狀態異常數據

查看目前的hbase數據狀態

hdfs fsck /hbase/data
Connecting to namenode via http://vm3928.hadoop.com:9870/fsck?ugi=hbase&path=%2Fhbase%2Fdata

/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas/42a907b4ffafb72205d9e1d1fa5e8301/d/36cafb11c3a6455aaba1605b72bbb7c7: CORRUPT blockpool BP-1135703146-10.200.39.28-1574043673848 block blk_1084998838

/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas/42a907b4ffafb72205d9e1d1fa5e8301/d/36cafb11c3a6455aaba1605b72bbb7c7: CORRUPT 1 blocks of total size 2559961 B.
/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas/42a907b4ffafb72205d9e1d1fa5e8301/d/756ffcf3efd24971abe0195f8de5af1e: CORRUPT blockpool BP-1135703146-10.200.39.28-1574043673848 block blk_1084998837

/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas/42a907b4ffafb72205d9e1d1fa5e8301/d/756ffcf3efd24971abe0195f8de5af1e: CORRUPT 1 blocks of total size 40464 B.
/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas_idx_mchnt_cd/72c126f1fa9cef3a0fdd8fcb5e49501e/d/a64f19e4dbfc47a991d919a5595b4e29:  Under replicated BP-1135703146-10.200.39.28-1574043673848:blk_1084998835_11258150. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).

Status: CORRUPT
 Number of data-nodes:  3
 Number of racks:               1
 Total dirs:                    2884
 Total symlinks:                0

Replicated Blocks:
 Total size:    122742037956 B
 Total files:   2440
 Total blocks (validated):      2519 (avg. block size 48726493 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:      2 (0.07939658 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:        2
  CORRUPT BLOCKS:       2
  CORRUPT SIZE:         2600425 B
  ********************************
 Minimally replicated blocks:   2517 (99.9206 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       1 (0.03969829 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.996824
 Missing blocks:                0
 Corrupt blocks:                2
 Missing replicas:              2 (0.02646553 %)
 Blocks queued for replication: 0
The filesystem under path '/hbase/data' is CORRUPT

可以發現tbl_glhis_hb_ct_settle_dtl_bas表有兩個數據塊丟失,具體文件也顯示出來了,另外tbl_glhis_hb_ct_settle_dtl_bas_idx_mchnt_cd表存在數據塊只有一個副本,默認應該為3個副本,也是不健康的狀態,這時去hbase web頁面查看,顯示一直有Hbase region in transition,涉及到的region為上述不健康的region,當訪問到這些region數據時,也會遇到NotServingRegionException異常

什么是RIT狀態?

As regions are managed by the master and region servers to, for example, balance the load across servers, they go through short phases of transition. This applies to opening, closing, and splitting a region. Before the operation is performed, the region is added to the “Regions in Transition” list on the WEB UI, and once the operation is complete, it is removed.

rit是Region在進入打開,關閉,分割,再均衡等操作前的一種狀態,而我們的環境一直出現,猜測跟數據塊損壞有關

首先對丟失數據塊的文件執行刪除操作,執行后,查看RIT,tbl_glhis_hb_ct_settle_dtl_bas表涉及到region已經不在RIT列表里,再檢查/hbase/data目錄,已經沒有缺失的數據塊,狀態為健康

hdfs fsck /hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas/42a907b4ffafb72205d9e1d1fa5e8301/d/36cafb11c3a6455aaba1605b72bbb7c7 -delete
hdfs fsck /hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas/42a907b4ffafb72205d9e1d1fa5e8301/d/756ffcf3efd24971abe0195f8de5af1e -delete

hdfs fsck /hbase/data
Connecting to namenode via http://vm3928.hadoop.com:9870/fsck?ugi=hbase&path=%2Fhbase%2Fdata
/hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas_idx_mchnt_cd/72c126f1fa9cef3a0fdd8fcb5e49501e/d/a64f19e4dbfc47a991d919a5595b4e29:  Under replicated BP-1135703146-10.200.39.28-1574043673848:blk_1084998835_11258150. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).

Status: HEALTHY
 Number of data-nodes:  3
 Number of racks:               1
 Total dirs:                    2787
 Total symlinks:                0

Replicated Blocks:
 Total size:    122670925814 B
 Total files:   2437
 Total blocks (validated):      2512 (avg. block size 48833967 B)
 Minimally replicated blocks:   2512 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       1 (0.039808918 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     2.999204
 Missing blocks:                0
 Corrupt blocks:                0
 Missing replicas:              2 (0.026539277 %)
 Blocks queued for replication: 0
The filesystem under path '/hbase/data' is HEALTHY

繼續對副本數不夠的文件,執行如下修復操作,設置副本數為3

hadoop fs -setrep -w 3 /hbase/data/default/tbl_glhis_hb_ct_settle_dtl_bas_idx_mchnt_cd/72c126f1fa9cef3a0fdd8fcb5e49501e/d/a64f19e4dbfc47a991d919a5595b4e29

但是,在我們環境里,這個操作一直是waitting狀態,查看了namenode節點上的日志如下,發現一直在報java.io.EOFException異常,看起來時datanode之間同步副本異常,暫時還沒解決。。。

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: vm3928.hadoop.com:1004:DataXceiver error processing READ_BLOCK operation  src: /10.200.39.29:57166 dst: /10.200.39.28:1004
java.io.EOFException
        at java.io.DataInputStream.readShort(DataInputStream.java:315)
        at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:139)
        at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:153)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.loadLastPartialChunkChecksum(FsVolumeImpl.java:1137)
        at org.apache.hadoop.hdfs.server.datanode.FinalizedReplica.loadLastPartialChunkChecksum(FinalizedReplica.java:157)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.getPartialChunkChecksumForFinalized(BlockSender.java:444)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:265)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:598)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM