etcd節點故障處理


問題:巡檢發現k8s集群的etcd集群狀態不對,其中有一個節點不健康,現象如下:

[root@k8s-master1 ~]# kubectl get cs
NAME                 STATUS      MESSAGE                                  ERROR
controller-manager   Healthy     ok                                       
scheduler            Healthy     ok                                       
etcd-1               Healthy     {"health":"true"}                        
etcd-0               Healthy     {"health":"true"}                        
etcd-2               Unhealthy   HTTP probe failed with statuscode: 503

而且查詢etcd日志沒有太多報錯信息,時間和證書都是正常的,而且也沒有防火牆問題,於是開始進行如下操作

1.將有故障的etcd節點remove出集群:

[root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member list
20fd79755169a89, started, etcd-3, https://172.16.23.122:2380, https://172.16.23.122:2379, false
39356a19c9b19f6d, started, etcd-1, https://172.16.23.120:2380, https://172.16.23.120:2379, false
506e9a48a5c19ec3, started, etcd-2, https://172.16.23.121:2380, https://172.16.23.121:2379, false

由上面信息可知,有故障的etcd節點為etcd-2這個,對應etcd-3這個name也就是122這一台機器

[root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member remove 20fd79755169a89
Member  20fd79755169a89 removed from cluster ad1f122f981ee2bf
[root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member list
39356a19c9b19f6d, started, etcd-1, https://172.16.23.120:2380, https://172.16.23.120:2379, false
506e9a48a5c19ec3, started, etcd-2, https://172.16.23.121:2380, https://172.16.23.121:2379, false

2.第一步已經將有故障的etcd節點etcd-2剔除了集群,開始操作etcd-3這個節點,刪除etcd數據,然后將etcd配置文件集群信息由new修改為existing

# rm -rf /var/lib/etcd/default.etcd/member/

修改etcd配置文件,將下面new修改為:

修改前:

ETCD_INITIAL_CLUSTER_STATE="new"

修改后:

ETCD_INITIAL_CLUSTER_STATE="existing"

3.然后將etcd-3節點加入到集群:

[root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member add etcd-2 --peer-urls=https://172.16.23.122:2380
Member a98137c10970d43c added to cluster ad1f122f981ee2bf

然后查看集群列表:

[root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member list
39356a19c9b19f6d, started, etcd-1, https://172.16.23.120:2380, https://172.16.23.120:2379, false
506e9a48a5c19ec3, started, etcd-2, https://172.16.23.121:2380, https://172.16.23.121:2379, false
a98137c10970d43c, unstarted, , https://172.16.23.122:2380, , false

4.重啟etcd故障節點:

[root@k8s-master3 ~]# systemctl start etcd
[root@k8s-master3 ~]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since 日 2021-02-28 22:04:34 CST; 4s ago

最后查看k8s集群的etcd:

[root@k8s-master1 ~]# kubectl get cs
NAME                 STATUS    MESSAGE             ERROR
scheduler            Healthy   ok                  
controller-manager   Healthy   ok                  
etcd-2               Healthy   {"health":"true"}   
etcd-0               Healthy   {"health":"true"}   
etcd-1               Healthy   {"health":"true"}

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM