hadoop集群運維碰到的問題匯總

本文轉載自查看原文 2017-12-29 19:27 1246 zookeeper/ kafka/ hadoop/ hbase

1.zookeeper報錯

2017 - 12 - 13 16 : 47 : 55 , 968 [myid:] - INFO [main-SendThread(localhost: 2181 ):ClientCnxn$SendThread @975 ] - Opening socket connection to server localhost/ 127.0 . 0.1 : 2181 . Will not attempt to authenticate using SASL (unknown error)

2017 - 12 - 13 16 : 47 : 55 , 968 [myid:] - WARN [main-SendThread(localhost: 2181 ):ClientCnxn$SendThread @1102 ] - Session 0x0 for server null , unexpected error, closing socket connection and attempting reconnect

java.net.ConnectException: Connection refused

     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

     at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java: 717 )

     at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java: 350 )

     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java: 1081 )

原因：zookeeper節點掛了，啟動即可

2.kafka消費報錯：Job aborted due to stage failure:kafka.common.OffsetOutOfRangeException

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0 , localhost): kafka.common.OffsetOutOfRangeException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

kafka message過期時間log.retention.hours=168

解決：問題原因是，cosumer-group消費的offset已早於kafka存儲的最早的message。參考blog里面有更詳盡的解釋

獲取topic mysqlslowlog的offset的最小值

./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name --time -2

獲取topic:mysqlslowlog的offset的最大值

./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name--time -1

在zk上更新topic partition的offset

#查partition 0最小值

get /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0

#更新partition 0最小值

set /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0 3546232

或者可以使用如下命令批量更新為最小值

./kafka-run-class.sh kafka.tools.UpdateOffsetsInZK earliest

參考：

http://blog.csdn.net/xueba207/article/details/51135423
http://blog.csdn.net/xueba207/article/details/51174818

3.重啟hbase regionserver節點報錯：

Server ...,1514436003346 has been rejected; Reported time is too far out of sync with master. Time difference of 136758ms > max allowed of 30000ms

一般是因為hmaster 節點和 regionserver節點時間不一致導致。同步時間，重啟節點即可。

4.摘除hdfs datanode節點，datanode節點一直處於Decommission In Progress狀態

通過WEB UI查看：

#低於副本數要求的blocks
Under replicated blocks ：2979
#沒有副本的blocks
Blocks with no live replicas： 0
#低於副本數要求的blocks，且正在創建中
Under Replicated Blocks In files under construction：1

或者通過../bin/hadoop dfsadmin -report命令查看datanode的狀態。

副本數為：2，當Under replicated blocks是越來越低，等於0時，應該就會完全摘除。

另外，因為同一個rack的datanode節點一般會有一個副本，因此，可以通過修改副本數的方式，快速下線datanode

#查看集群狀態

./bin/hadoop fsck / -blocks -locations -files

#修改副本數（當Blocks with no live replicas為0時可以操作）

./bin/hadoop fs -setrep -R 1 /

#關閉datanode節點，

./sbin/hadoop-daemon.sh stop datanode

#從slaves列表和rack列表中刪掉對應節點

#freshnode或者依次重啟namenode

./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes

5.摘除hdfs的datanode節點

Failed to add xxxxxxxx:50010: You cannot have a rack and a non-rack node at the same level of the network topology.

解決：

通過 ./bin/hdfs dfsadmin -printTopology查看rack list

刷新

./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes

不管用，
(1)頁面依然顯示狀態為dead的datanode，
(2)依然報You cannot have a rack and a non-rack node at the same level of the network topology.

依次重啟namenode，生效

./sbin/hadoop-daemon.sh stop namenode
./sbin/hadoop-daemon.sh start namenode

通過

./bin/hdfs dfsadmin -printTopology

查看rack信息，應該被摘掉的節點也不再顯示

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hadoop運維問題記錄大數據集群運維（20）centos7 hadoop 單機模式安裝配置最新遠程部署運維工具匯總 hadoop 性能調優與運維【kafka學習之三】kafka集群運維 ElasticSearch 集群的規划部署與運維【mongoDB運維篇④】Shard 分片集群 HA高可用集群中"腦裂"問題解決 - 運維總結使用SIMATIC NET 碰到的一些問題匯總運維工作經驗匯總---------高級運維工程師