假如Kafka集群中一個broker宕機無法恢復,應該如何處理?


假如Kafka集群中一個broker宕機無法恢復, 應該如何處理?

今天面試時遇到這個問題, 網上資料說添加新的broker, 是不會自動同步舊數據的.

笨辦法

環境介紹

三個broker的集群, zk,kafka裝在一起

| broker | IP | broker.id |
|---------|---------------|-----------|
| broker1 | 172.18.12.211 | 211 |
| broker2 | 172.18.12.212 | 212 |
| broker3 | 172.18.12.213 | 213 |

創建測試topic

#./bin/kafka-topics.sh --zookeeper 172.18.12.212:2181 --create --topic test1 --replication-factor 3 --partitions 1
Created topic "test1".

查看

#./bin/kafka-topics.sh --zookeeper 172.18.12.212:2181 --describe --topic test1
Topic:test1 PartitionCount:1 ReplicationFactor:3 Configs:
        Topic: test1 Partition: 0 Leader: 213 Replicas: 213,212,211 Isr: 213,212,211

注意當前
Replicas: 213,212,211
Isr: 213,212,211

造一些消息

#./bin/kafka-console-producer.sh --broker-list 172.18.12.212:9092 --topic test1
>1
>2
>3

kill broker2

[root@node024212 ~]# ps -ef| grep kafka
root 17633 1 1 Feb17 ? 00:55:18 /usr/local/java/bin/java -server -Xmx2g - ...
[root@node024212 ~]# kill -9 17633
[root@node024212 ~]# ps -ef| grep kafka
root 21875 21651 0 11:27 pts/2 00:00:00 grep --color=auto kafka

稍等一會, 再次describe test1

#./bin/kafka-topics.sh --zookeeper 172.18.12.212:2181 --describe --topic test1
Topic:test1 PartitionCount:1 ReplicationFactor:3 Configs:
        Topic: test1 Partition: 0 Leader: 213 Replicas: 213,212,211 Isr: 213,211

可看到副本仍然是Replicas: 213,212,211
ISR已經變為Isr: 213,211

在212啟動新broker

創建一份新的配置文件, 自動一個新的broker

# cp server.properties server2.properties 
# vim server2.properties 
只修改這兩個參數
broker.id=218
log.dirs=/DATA21/kafka/kafka-logs,/DATA22/kafka/kafka-logs,/DATA23/kafka/kafka-logs,/DATA24/kafka/kafka-logs

創建相應目錄

mkdir -p /DATA21/kafka/kafka-logs
mkdir -p /DATA22/kafka/kafka-logs
mkdir -p /DATA23/kafka/kafka-logs
mkdir -p /DATA24/kafka/kafka-logs

啟動新broker

./bin/kafka-server-start.sh -daemon config/server2.properties 

稍等, 查看 test1 狀態

#./bin/kafka-topics.sh --zookeeper 172.18.12.212:2181 --describe --topic test1
Topic:test1 PartitionCount:1 ReplicationFactor:3 Configs:
        Topic: test2 Partition: 0 Leader: 213 Replicas: 213,212,211 Isr: 213,218,211

可以看到 test1 副本仍然是Replicas: 213,212,211
ISR為Isr: 213,218,211. 也就是說缺失的副本不會自動遷移到新broker上.

使用kafka-reassign-partitions.sh重分配分區

將212刪除,添加218

[root@node024211 12:04:48 /usr/local/kafka]
#echo '{"version":1,"partitions":[{"topic":"test1","partition":0,"replicas":[211,213,218]}]}' > increase-replication-factor.json

[root@node024211 12:58:30 /usr/local/kafka]
#./bin/kafka-reassign-partitions.sh --zookeeper 172.18.12.211:2181 --reassignment-json-file increase-replication-factor.json --execute
Current partition replica assignment

{"version":1,"partitions":[{"topic":"test1","partition":0,"replicas":[213,212,211],"log_dirs":["any","any","any"]}]}

Save this to use as the --reassignment-json-file option during rollback
Successfully started reassignment of partitions.

[root@node024211 12:58:49 /usr/local/kafka]
#./bin/kafka-reassign-partitions.sh --zookeeper 172.18.12.211:2181 --reassignment-json-file increase-replication-factor.json --verify
Status of partition reassignment: 
Reassignment of partition test1-0 completed successfully

查看topic信息

#./bin/kafka-topics.sh --zookeeper 172.18.12.212:2181 --describe --topic test1
Topic:test1 PartitionCount:1 ReplicationFactor:3 Configs:
        Topic: test1 Partition: 0 Leader: 213 Replicas: 211,213,218 Isr: 213,211,218

驗證218是否有全部數據

雖然看副本信息中已經有了218, 但是218是否包含舊消息呢?
我的辦法是, kill 211,213, 然后–from-beginning 消費218數據, 實際測試也是可以的

#./bin/kafka-console-consumer.sh --bootstrap-server 172.18.12.212:9092 --topic test1 --from-beginning
1
2
3
4
5
6
7
8
9
10
11
11

看了下211 218的log文件大小也是一樣的

[2019-02-21 13:29:19]#ls -l /DATA22/kafka/kafka-logs/test1-0/
[2019-02-21 13:29:19]total 8
[2019-02-21 13:29:19]-rw-r--r--. 1 root root 10485760 Feb 21 12:58 00000000000000000000.index
[2019-02-21 13:29:19]-rw-r--r--. 1 root root 381 Feb 21 13:00 00000000000000000000.log
[2019-02-21 13:29:19]-rw-r--r--. 1 root root 10485756 Feb 21 12:58 00000000000000000000.timeindex
[2019-02-21 13:29:19]-rw-r--r--. 1 root root 16 Feb 21 13:00 leader-epoch-checkpoint

更簡單的辦法

通過閱讀文檔發現
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Howtoreplaceafailedbroker

How to replace a failed broker?
When a broker fails, Kafka doesn’t automatically re-replicate the data on the failed broker to other brokers. This is because in the common case, one brings down a broker to apply code or config changes, and will bring up the broker quickly afterward. Re-replicating the data in this case will be wasteful. In the rarer case that a broker fails completely, one will need to bring up another broker with the same broker id on a new server. The new broker will automatically replicate the missing data.

這上面說的,如果服務器真的壞了, 只需要新啟動一個broker, 把broker.id設置為 損壞的那個broker的id, 就會自動復制過去丟失的數據。

實際測試了一下, 確實可以恢復。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM