1.zookeeper報錯
2017
-
12
-
13
16
:
47
:
55
,
968
[myid:] - INFO [main-SendThread(localhost:
2181
):ClientCnxn$SendThread
@975
] - Opening socket connection to server localhost/
127.0
.
0.1
:
2181
. Will not attempt to authenticate using SASL (unknown error)
2017
-
12
-
13
16
:
47
:
55
,
968
[myid:] - WARN [main-SendThread(localhost:
2181
):ClientCnxn$SendThread
@1102
] - Session
0x0
for
server
null
, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:
717
)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:
350
)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:
1081
)
原因:zookeeper節點掛了,啟動即可
2.kafka消費報錯:Job aborted due to stage failure:kafka.common.OffsetOutOfRangeException
Exception in thread
"main"
org.apache.spark.SparkException: Job aborted due to stage failure: Task
0
in stage
0.0
failed
1
times, most recent failure: Lost task
0.0
in stage
0.0
(TID
0
, localhost): kafka.common.OffsetOutOfRangeException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
kafka message過期時間log.retention.hours=168
解決:問題原因是,cosumer-group消費的offset已早於kafka存儲的最早的message。參考blog里面有更詳盡的解釋
獲取topic mysqlslowlog的offset的最小值
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name --time -2
獲取topic:mysqlslowlog的offset的最大值
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name--time -1
在zk上更新topic partition的offset
#查partition 0最小值
get /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0
#更新partition 0最小值
set /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0 3546232
或者可以使用如下命令批量更新為最小值
./kafka-run-class.sh kafka.tools.UpdateOffsetsInZK earliest
參考:
http://blog.csdn.net/xueba207/article/details/51135423
http://blog.csdn.net/xueba207/article/details/51174818
3.重啟hbase regionserver節點報錯:
Server ...
,
1514436003346
has been rejected; Reported time is too far out of sync with master. Time difference of 136758ms > max allowed of 30000ms
一般是因為hmaster 節點和 regionserver節點時間不一致導致。同步時間,重啟節點即可。
4.摘除hdfs datanode節點,datanode節點一直處於Decommission In Progress狀態
通過WEB UI查看:
#低於副本數要求的blocks
Under replicated blocks :2979
#沒有副本的blocks
Blocks with no live replicas: 0
#低於副本數要求的blocks,且正在創建中
Under Replicated Blocks In files under construction:1
或者通過../bin/hadoop dfsadmin -report命令查看datanode的狀態。
副本數為:2,當Under replicated blocks是越來越低,等於0時,應該就會完全摘除。
另外,因為同一個rack的datanode節點一般會有一個副本,因此,可以通過修改副本數的方式,快速下線datanode
#查看集群狀態
./bin/hadoop fsck / -blocks -locations -files
#修改副本數(當Blocks with no live replicas為0時可以操作)
./bin/hadoop fs -setrep -R 1 /
#關閉datanode節點,
./sbin/hadoop-daemon.sh stop datanode
#從slaves列表和rack列表中刪掉對應節點
#freshnode或者依次重啟namenode
./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes
5.摘除hdfs的datanode節點
Failed to add xxxxxxxx
:
50010
: You cannot have a rack and a non-rack node at the same level of the network topology.
解決:
通過 ./bin/hdfs dfsadmin -printTopology查看rack list
刷新
./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes
不管用,
(1)頁面依然顯示狀態為dead的datanode,
(2)依然報You cannot have a rack and a non-rack node at the same level of the network topology.
依次重啟namenode,生效
./sbin/hadoop-daemon.sh stop namenode
./sbin/hadoop-daemon.sh start namenode
通過
./bin/hdfs dfsadmin -printTopology
查看rack信息,應該被摘掉的節點也不再顯示