背景
因磁盤滿了,導致kafka所有的服務器全部宕機了,然后重啟kafka集群,服務是啟動成功了,但有一些報錯:
broker1:
broker2:
broker3:一直在刷以下錯誤信息
雖然報了這些錯,但kafka正常啟動了,通過命令測試了集群能正常生產和消費消息,但是看kafka-manager界面,出現副本未分配的異常情況:
檢查消費這些主題的程序,果然是消費失敗了,一直在刷如下異常信息:
注:圖中IP的是broker3節點
截止到這里可以看出,broker3節點出問題了,導致消費者程序連接不上,但奇怪的話,通過命令創建主題測試,在broker3節點又能消費。
繼續分析broker3的日志,報錯原因:集群要求的副本數是2,但只找到1個。
於是查看相關主題的詳細信息,發現確實ISR列表中是少了副本
猜測由於宕機后,有些節點落后leader太多,還沒有追上來,所以脫離了ISR列表,於是等它自動追上來。
等到第2天一看,還是一樣,沒有追上來,於是決定重啟kafka集群,發現有些分區的會自動擴展成2,出問題的那些分區還是沒有。。。。
然后想通過重新分配分區指定副本,看能否讓它自動恢復一下副本,通過以下命令進行處理:
bin/kafka-reassign-partitions.sh --zookeeper 10.0.xx.x:2181,10.0.xx.x:2181,10.0.xx.x:2181 --reassignment-json-file reassign.json --execute
reassign.json文件內容:
{"version":1, "partitions":[ {"topic":"__consumer_offsets","partition":0,"replicas":[2,3]}, {"topic":"__consumer_offsets","partition":1,"replicas":[3,1]}, {"topic":"__consumer_offsets","partition":2,"replicas":[1,2]}, {"topic":"__consumer_offsets","partition":3,"replicas":[1,2]}, {"topic":"__consumer_offsets","partition":4,"replicas":[3,2]}, {"topic":"__consumer_offsets","partition":5,"replicas":[1,3]}, {"topic":"__consumer_offsets","partition":6,"replicas":[2,3]}, {"topic":"__consumer_offsets","partition":7,"replicas":[3,1]}, {"topic":"__consumer_offsets","partition":8,"replicas":[1,2]}, {"topic":"__consumer_offsets","partition":9,"replicas":[2,1]}, {"topic":"__consumer_offsets","partition":10,"replicas":[3,2]}, {"topic":"__consumer_offsets","partition":11,"replicas":[1,3]}, {"topic":"__consumer_offsets","partition":12,"replicas":[2,3]}, {"topic":"__consumer_offsets","partition":13,"replicas":[3,1]}, {"topic":"__consumer_offsets","partition":14,"replicas":[1,2]}, {"topic":"__consumer_offsets","partition":15,"replicas":[2,1]}, {"topic":"__consumer_offsets","partition":16,"replicas":[3,2]}, {"topic":"__consumer_offsets","partition":17,"replicas":[1,3]}, {"topic":"__consumer_offsets","partition":18,"replicas":[2,3]}, {"topic":"__consumer_offsets","partition":19,"replicas":[3,1]}, {"topic":"__consumer_offsets","partition":20,"replicas":[1,2]}, {"topic":"__consumer_offsets","partition":21,"replicas":[2,1]}, {"topic":"__consumer_offsets","partition":22,"replicas":[3,2]}, {"topic":"__consumer_offsets","partition":23,"replicas":[1,3]}, {"topic":"__consumer_offsets","partition":24,"replicas":[2,3]}, {"topic":"__consumer_offsets","partition":25,"replicas":[3,1]}, {"topic":"__consumer_offsets","partition":26,"replicas":[1,2]}, {"topic":"__consumer_offsets","partition":27,"replicas":[2,1]}, {"topic":"__consumer_offsets","partition":28,"replicas":[3,2]}, {"topic":"__consumer_offsets","partition":29,"replicas":[1,3]}, {"topic":"__consumer_offsets","partition":30,"replicas":[2,3]}, {"topic":"__consumer_offsets","partition":31,"replicas":[3,1]}, {"topic":"__consumer_offsets","partition":32,"replicas":[1,2]}, {"topic":"__consumer_offsets","partition":33,"replicas":[2,1]}, {"topic":"__consumer_offsets","partition":34,"replicas":[3,2]}, {"topic":"__consumer_offsets","partition":35,"replicas":[1,3]}, {"topic":"__consumer_offsets","partition":36,"replicas":[2,3]}, {"topic":"__consumer_offsets","partition":37,"replicas":[3,1]}, {"topic":"__consumer_offsets","partition":38,"replicas":[1,2]}, {"topic":"__consumer_offsets","partition":39,"replicas":[2,1]}, {"topic":"__consumer_offsets","partition":40,"replicas":[3,2]}, {"topic":"__consumer_offsets","partition":41,"replicas":[1,3]}, {"topic":"__consumer_offsets","partition":42,"replicas":[2,3]}, {"topic":"__consumer_offsets","partition":43,"replicas":[3,1]}, {"topic":"__consumer_offsets","partition":44,"replicas":[1,2]}, {"topic":"__consumer_offsets","partition":45,"replicas":[2,1]}, {"topic":"__consumer_offsets","partition":46,"replicas":[3,2]}, {"topic":"__consumer_offsets","partition":47,"replicas":[1,3]}, {"topic":"__consumer_offsets","partition":48,"replicas":[2,3]}, {"topic":"__consumer_offsets","partition":49,"replicas":[3,1]} ]}`
重新分區指定副本的方法也不行,於是修改kafka配置,把集群要求的副本數改為1:
vi server.properties
重啟kafka集群后,broker3不在就報錯了,在重啟消費都程序,也能正常連上kafka進行消費了。
總結:
kafka出現宕機后,副本脫離ISR列表(落后leader太多),按正常來說它會慢慢追上來后在自動重新加入ISR列表中,但我的等了20個小時后還沒有,重啟kafka集群后也沒有恢復。導致服務啟動有問題。
現在臨時解決方案是調整成1,讓它先跑一段時間后,看能否恢復回來,到時在設置成2。
問題:
1、原因尚未找到;
2、這樣調整后,kafka會出現數據丟失的情況(出問題期間的數據都丟失了)。