hdfs 數據壞塊導致datanode不能正常上報數據塊


生產集群上,有一台datanode節點磁盤數量飆升,其中五塊盤容量已經使用達到100%了,其他磁盤也基本達到90%以上。運行balancer不生效,數據還是瘋長,看balancer日志,貌似balancer沒有進行大量的塊移動。

問題現象:

2020-05-25 22:13:47,459 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x1f57d7768e5e5fbf,  containing 11 storage report(s), of which we sent 2. The reports had 2790431 total blocks and used 2 RPC(s). This took 511 msec to generate and 511 msecs for RPC and NN processing. Got back no commands
2020
-05-25 22:13:47,459 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: RemoteException in offerService org.apache.hadoop.ipc.RemoteException(java.io.IOException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.runBlockOp(BlockManager.java:3935) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1423) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:179) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:28423) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455) Caused by: java.lang.NullPointerException at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1481) at org.apache.hadoop.ipc.Client.call(Client.java:1427) at org.apache.hadoop.ipc.Client.call(Client.java:1337) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy15.blockReport(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:203) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:371) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:629) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:771) at java.lang.Thread.run(Thread.java:748) 2020-05-25 22:13:48,829 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /xxx:41340, dest: /xxx:50010, bytes: 41295660, op: HDFS_WRITE, cliID: libhdfs3_client_random_1943875928_count_329741_pid_31778_tid_139650573580032, offset: 0, srvID: 66666ee2-f0b1-472f-ae97-adb2418b61b7, blockid: BP-106388200-xxx-1508315348381:blk_1104911380_31178890, duration: 8417209028 2020-05-25 22:13:48,829 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-106388200-xxx-1508315348381 (Datanode Uuid 66666ee2-f0b1-472f-ae97-adb2418b61b7) service to xxx1/xxx:9000 java.util.ConcurrentModificationException: modification=2962864 != iterModification = 2962863 at org.apache.hadoop.util.LightWeightGSet$SetIterator.ensureNext(LightWeightGSet.java:305) at org.apache.hadoop.util.LightWeightGSet$SetIterator.hasNext(LightWeightGSet.java:322) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1813) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:335) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:629) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:771) at java.lang.Thread.run(Thread.java:748) 2020-05-25 22:13:48,829 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-106388200-xxx-1508315348381:blk_1104911380_31178890, type=LAST_IN_PIPELINE terminating

 

從上述的現象可以看到已經發生了塊風暴,datanode定期的向namenode發送塊信息,但是由於塊風暴的問題,這個節點的塊不知道什么時候才能完全上報(這個節點大概存儲不到50T的數據),通過修改dfs.blockreport.split.threshold參數,貌似也不起作用。問題就卡住了。

 

繼續看namenode日志:

        at java.lang.Thread.run(Thread.java:748)
2020-05-27 17:19:45,424 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: libhdfs3_client_random_1124367707_count_1513119_pid_195169_tid_140150889772800, pending creates: 1] has expired hard limit
2020-05-27 17:19:45,424 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: libhdfs3_client_random_1124367707_count_1513119_pid_195169_tid_140150889772800, pending creates: 1], src=/user/xxx/9.log.ok.ok.ok.ok.gz

namenode的日志文件大量報這個塊的租約超時,幾乎日志文件里面全是這幾條信息,通過fsck檢查這個塊,檢測失敗,返回一個空指針,這真的是日了哈士奇,為了防止數據丟失,打算把該數據塊拷貝到本地,同樣無情的返回一個空指針。

分析:

個人感覺是由於namenode在進行對這個數據壞塊分配節點的時候,占用了大量的處理線程,導致datanode在上報數據塊的時候,不能正常上報,導致塊風暴。namenode不能獲取到節點的塊信息,導致很多的本該被刪除的數據庫沒能及時刪除,同時namenode還依然往這個節點分配數據副本,導致節點磁盤數據量瘋長。

 

解決:

最終刪掉了該數據塊同時看這個datanode節點日志,在執行大量的delete操作,同時配合balancer,最終集群達到均衡。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM