在啟動hdfs的時候發現,有一個dataNode一直啟動不起來,查看日志發現Time to add replicas to map for block pool 這樣的日志,表示hdfs正在掃描數據盤,把數據文件名打包上傳給nameNode,但是數據盤有3個,才掃描了兩個,卡在第三個數據盤掃描出了問題
1 2021-07-24 11:37:50,731 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data01/hadoop/hdfs/data/current 2 2021-07-24 11:37:50,738 INFO checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data01/hadoop/hdfs/data/current 3 2021-07-24 11:37:50,739 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data02/hadoop/hdfs/data/current 4 2021-07-24 11:37:50,740 INFO checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data02/hadoop/hdfs/data/current 5 2021-07-24 11:37:50,740 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data03/hadoop/hdfs/data/current 6 2021-07-24 11:37:50,740 INFO checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data03/hadoop/hdfs/data/current 7 2021-07-24 11:37:50,740 INFO impl.FsDatasetImpl (FsDatasetImpl.java:addBlockPool(2635)) - Adding block pool BP-1188018203-192.168.50.56-1615989660288 8 2021-07-24 11:37:50,741 INFO impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current... 9 2021-07-24 11:37:50,741 INFO impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data02/hadoop/hdfs/data/current... 10 2021-07-24 11:37:50,741 INFO impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current... 11 2021-07-24 11:37:50,757 INFO impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data01/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5512085090304 12 2021-07-24 11:37:50,757 INFO impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data03/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5497424637952 13 2021-07-24 11:37:50,760 INFO impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data01/hadoop/hdfs/data/current: 19ms 14 2021-07-24 11:37:50,760 INFO impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data03/hadoop/hdfs/data/current: 19ms 15 2021-07-24 11:37:50,773 INFO impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data02/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5591791809340 16 2021-07-24 11:37:50,773 INFO impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data02/hadoop/hdfs/data/current: 32ms 17 2021-07-24 11:37:50,774 INFO impl.FsDatasetImpl (FsVolumeList.java:addBlockPool(423)) - Total time to scan all replicas for block pool BP-1188018203-192.168.50.56-1615989660288: 33ms 18 2021-07-24 11:37:50,776 INFO impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current... 19 2021-07-24 11:37:50,776 INFO impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data02/hadoop/hdfs/data/current... 20 2021-07-24 11:37:50,776 INFO impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data01/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist 21 2021-07-24 11:37:50,776 INFO impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current... 22 2021-07-24 11:37:50,776 INFO impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data02/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist 23 2021-07-24 11:37:50,777 INFO impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data03/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist 24 2021-07-24 11:39:00,774 INFO impl.FsDatasetImpl (FsVolumeList.java:run(193)) - Time to add replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current: 69998ms 25 2021-07-24 11:39:02,396 INFO impl.FsDatasetImpl (FsVolumeList.java:run(193)) - Time to add replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current: 71620ms
然后就自然會懷疑io,通過iostst -kxd 2 查看到第三個數據盤的io都在98~100%之間

然后通過iotop -oP 找到了消耗io最高的進程,好家伙,發現居然有3個進程在對這個數據路徑進行du,不卡死才怪,應該是由於剛剛我頻繁重啟dataNode,重啟時,之前的du還未結束殺死,就又增加了一個du進程,馬上手動殺死這三個進程,然后重啟dataNode,成功

但是在偶爾的重啟中也發現,第三個數據盤的du時快時慢,偶爾會卡死,所以懷疑硬件有些問題,后面准備和運維工程師進行溝通