hdfs 的 dataNode 啟動卡死的排錯過程


在啟動hdfs的時候發現,有一個dataNode一直啟動不起來,查看日志發現Time to add replicas to map for block pool 這樣的日志,表示hdfs正在掃描數據盤,把數據文件名打包上傳給nameNode,但是數據盤有3個,才掃描了兩個,卡在第三個數據盤掃描出了問題
 
 1 2021-07-24 11:37:50,731 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data01/hadoop/hdfs/data/current
 2 2021-07-24 11:37:50,738 INFO  checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data01/hadoop/hdfs/data/current
 3 2021-07-24 11:37:50,739 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data02/hadoop/hdfs/data/current
 4 2021-07-24 11:37:50,740 INFO  checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data02/hadoop/hdfs/data/current
 5 2021-07-24 11:37:50,740 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data03/hadoop/hdfs/data/current
 6 2021-07-24 11:37:50,740 INFO  checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data03/hadoop/hdfs/data/current
 7 2021-07-24 11:37:50,740 INFO  impl.FsDatasetImpl (FsDatasetImpl.java:addBlockPool(2635)) - Adding block pool BP-1188018203-192.168.50.56-1615989660288
 8 2021-07-24 11:37:50,741 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current...
 9 2021-07-24 11:37:50,741 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data02/hadoop/hdfs/data/current...
10 2021-07-24 11:37:50,741 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current...
11 2021-07-24 11:37:50,757 INFO  impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data01/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5512085090304
12 2021-07-24 11:37:50,757 INFO  impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data03/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5497424637952
13 2021-07-24 11:37:50,760 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data01/hadoop/hdfs/data/current: 19ms
14 2021-07-24 11:37:50,760 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data03/hadoop/hdfs/data/current: 19ms
15 2021-07-24 11:37:50,773 INFO  impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data02/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5591791809340
16 2021-07-24 11:37:50,773 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data02/hadoop/hdfs/data/current: 32ms
17 2021-07-24 11:37:50,774 INFO  impl.FsDatasetImpl (FsVolumeList.java:addBlockPool(423)) - Total time to scan all replicas for block pool BP-1188018203-192.168.50.56-1615989660288: 33ms
18 2021-07-24 11:37:50,776 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current...
19 2021-07-24 11:37:50,776 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data02/hadoop/hdfs/data/current...
20 2021-07-24 11:37:50,776 INFO  impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data01/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist
21 2021-07-24 11:37:50,776 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current...
22 2021-07-24 11:37:50,776 INFO  impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data02/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist
23 2021-07-24 11:37:50,777 INFO  impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data03/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist
24 2021-07-24 11:39:00,774 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(193)) - Time to add replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current: 69998ms
25 2021-07-24 11:39:02,396 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(193)) - Time to add replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current: 71620ms

 

 
然后就自然會懷疑io,通過iostst -kxd 2 查看到第三個數據盤的io都在98~100%之間

 

 

 
然后通過iotop -oP 找到了消耗io最高的進程,好家伙,發現居然有3個進程在對這個數據路徑進行du,不卡死才怪,應該是由於剛剛我頻繁重啟dataNode,重啟時,之前的du還未結束殺死,就又增加了一個du進程,馬上手動殺死這三個進程,然后重啟dataNode,成功

 

 

 
但是在偶爾的重啟中也發現,第三個數據盤的du時快時慢,偶爾會卡死,所以懷疑硬件有些問題,后面准備和運維工程師進行溝通


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM