hadoop集群運維碰到的問題匯總


1.zookeeper報錯

2017 - 12 - 13  16 : 47 : 55 , 968  [myid:] - INFO  [main-SendThread(localhost: 2181 ):ClientCnxn$SendThread @975 ] - Opening socket connection to server localhost/ 127.0 . 0.1 : 2181 . Will not attempt to authenticate using SASL (unknown error)
2017 - 12 - 13  16 : 47 : 55 , 968  [myid:] - WARN  [main-SendThread(localhost: 2181 ):ClientCnxn$SendThread @1102 ] - Session  0x0  for  server  null , unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
     at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java: 717 )
     at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java: 350 )
     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java: 1081 )

原因:zookeeper節點掛了,啟動即可

 

2.kafka消費報錯:Job aborted due to stage failure:kafka.common.OffsetOutOfRangeException

Exception in thread  "main"  org.apache.spark.SparkException: Job aborted due to stage failure: Task  0  in stage  0.0  failed  1  times, most recent failure: Lost task  0.0  in stage  0.0  (TID  0 , localhost): kafka.common.OffsetOutOfRangeException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

kafka message過期時間log.retention.hours=168

解決:問題原因是,cosumer-group消費的offset已早於kafka存儲的最早的message。參考blog里面有更詳盡的解釋

獲取topic mysqlslowlog的offset的最小值

./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name --time -2

獲取topic:mysqlslowlog的offset的最大值

./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name--time -1

在zk上更新topic partition的offset

#查partition  0最小值

get /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0

#更新partition  0最小值

set /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0 3546232

或者可以使用如下命令批量更新為最小值

./kafka-run-class.sh kafka.tools.UpdateOffsetsInZK earliest 

 

參考:

http://blog.csdn.net/xueba207/article/details/51135423
http://blog.csdn.net/xueba207/article/details/51174818

 

3.重啟hbase regionserver節點報錯:

Server ...,1514436003346 has been rejected; Reported time is too far out of sync with master.  Time difference of 136758ms > max allowed of 30000ms

一般是因為hmaster 節點和 regionserver節點時間不一致導致。同步時間,重啟節點即可。

 

4.摘除hdfs  datanode節點,datanode節點一直處於Decommission In Progress狀態

通過WEB UI查看:

#低於副本數要求的blocks
Under replicated blocks :2979
#沒有副本的blocks
Blocks with no live replicas: 0
#低於副本數要求的blocks,且正在創建中
Under Replicated Blocks In files under construction:1

或者通過../bin/hadoop dfsadmin -report命令查看datanode的狀態。

副本數為:2,當Under replicated blocks是越來越低,等於0時,應該就會完全摘除。

另外,因為同一個rack的datanode節點一般會有一個副本,因此,可以通過修改副本數的方式,快速下線datanode

#查看集群狀態

./bin/hadoop fsck / -blocks -locations -files

#修改副本數(當Blocks with no live replicas為0時可以操作)

 ./bin/hadoop fs -setrep -R 1 /

#關閉datanode節點,

./sbin/hadoop-daemon.sh stop datanode

#從slaves列表和rack列表中刪掉對應節點

 

#freshnode或者依次重啟namenode

./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes

5.摘除hdfs的datanode節點

Failed to add xxxxxxxx:50010: You cannot have a rack and a non-rack node at the same level of the network topology.

 解決:

通過 ./bin/hdfs dfsadmin -printTopology查看rack list

刷新

./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes

不管用,
(1)頁面依然顯示狀態為dead的datanode,
(2)依然報You cannot have a rack and a non-rack node at the same level of the network topology.

依次重啟namenode,生效

./sbin/hadoop-daemon.sh stop namenode
./sbin/hadoop-daemon.sh start namenode

通過

./bin/hdfs dfsadmin -printTopology

查看rack信息,應該被摘掉的節點也不再顯示

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM