k8s 集群中的etcd故障解決


 

 

一次在k8s集群中創建實例發現etcd集群狀態出現連接失敗狀況,導致創建實例失敗。於是排查了一下原因。

問題來源

下面是etcd集群健康狀態:

1
2
3
4
5
6
7
8
9
10
11
[root@docker01 ~] # cd /opt/kubernetes/ssl/
[root@docker01 ssl] # /opt/kubernetes/bin/etcdctl \
> --ca- file =ca.pem --cert- file =server.pem --key- file =server-key.pem \
> --endpoints= "https://10.0.0.99:2379,https://10.0.0.100:2379,https://10.0.0.111:2379"  \
> cluster-health
member 1bd4d12de986e887 is healthy: got healthy result from https: //10 .0.0.99:2379
member 45396926a395958b is healthy: got healthy result from https: //10 .0.0.100:2379
failed to check the health of member c2c5804bd87e2884 on https: //10 .0.0.111:2379: Get https: //10 .0.0.111:2379 /health : net /http : TLS handshake timeout
member c2c5804bd87e2884 is unreachable: [https: //10 .0.0.111:2379] are all unreachable
cluster is healthy
[root@docker01 ssl] #

可以明顯看到etcd節點03出現問題。

這個時候到節點03上來重啟etcd服務如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[root@docker03 ~] # systemctl restart etcd
Job  for  etcd.service failed because the control process exited with error code. See  "systemctl status etcd.service"  and  "journalctl -xe"  for  details.
[root@docker03 ~] # journalctl -xe
Mar 24 22:24:32 docker03 etcd[1895]: setting maximum number of CPUs to 1, total number of available CPUs is 1
Mar 24 22:24:32 docker03 etcd[1895]: the server is already initialized as member before, starting as etcd member...
Mar 24 22:24:32 docker03 etcd[1895]: peerTLS: cert =  /opt/kubernetes/ssl/server .pem, key =  /opt/kubernetes/ssl/server-key .pem, ca = , trusted-ca =  /opt/kubernetes/ssl
Mar 24 22:24:32 docker03 etcd[1895]: listening  for  peers on https: //10 .0.0.111:2380
Mar 24 22:24:32 docker03 etcd[1895]: The scheme of client url http: //127 .0.0.1:2379 is HTTP  while  peer key /cert  files are presented. Ignored key /cert  files.
Mar 24 22:24:32 docker03 etcd[1895]: listening  for  client requests on 127.0.0.1:2379
Mar 24 22:24:32 docker03 etcd[1895]: listening  for  client requests on 10.0.0.111:2379
Mar 24 22:24:32 docker03 etcd[1895]: member c2c5804bd87e2884 has already been bootstrapped
Mar 24 22:24:32 docker03 systemd[1]: etcd.service: main process exited, code=exited, status=1 /FAILURE
Mar 24 22:24:32 docker03 systemd[1]: Failed to start Etcd Server.
-- Subject: Unit etcd.service has failed
-- Defined-By: systemd
-- Support: http: //lists .freedesktop.org /mailman/listinfo/systemd-devel
--
-- Unit etcd.service has failed.
--
-- The result is failed.
Mar 24 22:24:32 docker03 systemd[1]: Unit etcd.service entered failed state.
Mar 24 22:24:32 docker03 systemd[1]: etcd.service failed.
Mar 24 22:24:33 docker03 systemd[1]: etcd.service holdoff  time  over, scheduling restart.
Mar 24 22:24:33 docker03 systemd[1]: start request repeated too quickly  for  etcd.service
Mar 24 22:24:33 docker03 systemd[1]: Failed to start Etcd Server.
-- Subject: Unit etcd.service has failed
-- Defined-By: systemd
-- Support: http: //lists .freedesktop.org /mailman/listinfo/systemd-devel
--
-- Unit etcd.service has failed.
--
-- The result is failed.
Mar 24 22:24:33 docker03 systemd[1]: Unit etcd.service entered failed state.
Mar 24 22:24:33 docker03 systemd[1]: etcd.service failed.

並沒有成功啟動服務,可以看到提示信息:member c2c5804bd87e2884 has already been bootstrapped

查看資料說是:
One of the member was bootstrapped via discovery service. You must remove the previous data-dir to clean up the member information. Or the member will ignore the new configuration and start with the old configuration. That is why you see the mismatch.
大概意思:
其中一個成員是通過discovery service引導的。必須刪除以前的數據目錄來清理成員信息。否則成員將忽略新配置,使用舊配置。這就是為什么你看到了不匹配。
看到了這里,問題所在也就很明確了,啟動失敗的原因在於data-dir (/var/lib/etcd/default.etcd)中記錄的信息與 etcd啟動的選項所標識的信息不太匹配造成的。

問題解決

第一種方式我們可以通過修改啟動參數解決這類錯誤。既然 data-dir 中已經記錄信息,我們就沒必要在啟動項中加入多於配置。具體修改--initial-cluster-state參數:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[root@docker03 ~] # cat /usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
 
[Service]
Type=notify
EnvironmentFile=- /opt/kubernetes/cfg/etcd
ExecStart= /opt/kubernetes/bin/etcd  \
--name=${ETCD_NAME} \
--data- dir =${ETCD_DATA_DIR} \
--listen-peer-urls=${ETCD_LISTEN_PEER_URLS} \
--listen-client-urls=${ETCD_LISTEN_CLIENT_URLS},http: //127 .0.0.1:2379 \
--advertise-client-urls=${ETCD_ADVERTISE_CLIENT_URLS} \
--initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS} \
--initial-cluster=${ETCD_INITIAL_CLUSTER} \
--initial-cluster-token=${ETCD_INITIAL_CLUSTER} \
--initial-cluster-state=existing \   # 將new這個參數修改成existing,啟動正常!
--cert- file = /opt/kubernetes/ssl/server .pem \
--key- file = /opt/kubernetes/ssl/server-key .pem \
--peer-cert- file = /opt/kubernetes/ssl/server .pem \
--peer-key- file = /opt/kubernetes/ssl/server-key .pem \
--trusted-ca- file = /opt/kubernetes/ssl/ca .pem \
--peer-trusted-ca- file = /opt/kubernetes/ssl/ca .pem
Restart=on-failure
LimitNOFILE=65536
 
[Install]
WantedBy=multi-user.target

我們將 --initial-cluster-state=new 修改成  --initial-cluster-state=existing,再次重新啟動就ok了。

第二種方式刪除所有etcd節點的 data-dir 文件(不刪也行),重啟各個節點的etcd服務,這個時候,每個節點的data-dir的數據都會被更新,就不會有以上故障了。

第三種方式是復制其他節點的data-dir中的內容,以此為基礎上以 --force-new-cluster 的形式強行拉起一個,然后以添加新成員的方式恢復這個集群。

這是目前的幾種解決辦法


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM