K8S集群中etcd備份及恢復【轉】


etcd是kubernetes集群極為重要的一塊服務,存儲了kubernetes集群所有的數據信息,如Namespace、Pod、Service、路由等狀態信息。如果etcd集群發生災難或者 etcd 集群數據丟失,都會影響k8s集群數據的恢復。因此,通過備份etcd數據來實現kubernetes集群的災備環境十分重要。

 

一、etcd集群備份

etcd不同版本的 etcdctl 命令不一樣,但大致差不多,這里備份使用 napshot save進行快照備份。
需要注意幾點:
  • 備份操作在etcd集群的其中一個節點執行就可以。
  • 這里使用的是etcd v3的api,因為從 k8s 1.13 開始,k8s不再支持 v2 版本的 etcd,即k8s的集群數據都存在了v3版本的etcd中。故備份的數據也只備份了使用v3添加的etcd數據,v2添加的etcd數據是沒有做備份的。
  • 本案例使用的是二進制部署的k8s v1.18.6 + Calico 容器環境(下面命令中的"ETCDCTL_API=3 etcdctl" 等同於 "etcdctl")
 
1)開始備份之前,先來查看下etcd數據
1
2
3
4
5
6
7
8
9
10
11
12
13
14
etcd 數據目錄
[root@k8s-master01 ~] # cat /opt/k8s/bin/environment.sh |grep "ETCD_DATA_DIR="
export  ETCD_DATA_DIR= "/data/k8s/etcd/data"
      
etcd WAL 目錄
[root@k8s-master01 ~] # cat /opt/k8s/bin/environment.sh |grep "ETCD_WAL_DIR="
export  ETCD_WAL_DIR= "/data/k8s/etcd/wal"
 
[root@k8s-master01 ~] # ls /data/k8s/etcd/data/
member
[root@k8s-master01 ~] # ls /data/k8s/etcd/data/member/
snap
[root@k8s-master01 ~] # ls /data/k8s/etcd/wal/
0000000000000000-0000000000000000.wal  0.tmp

  

2)執行etcd集群數據備份
在etcd集群的其中一個節點執行備份操作,然后將備份文件拷貝到其他節點上。
 
先在etcd集群的每個節點上創建備份目錄
1
# mkdir -p /data/etcd_backup_dir

在etcd集群其中個一個節點(這里在k8s-master01)上執行備份:

1
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --endpoints=https://172.16.60.231:2379 snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db

將備份文件拷貝到其他的etcd節點
1
2
[root@k8s-master01 ~] # rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master02:/data/etcd_backup_dir/
[root@k8s-master01 ~] # rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master03:/data/etcd_backup_dir/

  

可以將上面k8s-master01節點的etcd備份命令放在腳本里,結合crontab進行定時備份:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@k8s-master01 ~] # cat /data/etcd_backup_dir/etcd_backup.sh
#!/usr/bin/bash
 
date ;
CACERT= "/etc/kubernetes/cert/ca.pem"
CERT= "/etc/etcd/cert/etcd.pem"
EKY= "/etc/etcd/cert/etcd-key.pem"
ENDPOINTS= "172.16.60.231:2379"
 
ETCDCTL_API=3  /opt/k8s/bin/etcdctl  \
--cacert= "${CACERT}"  --cert= "${CERT}"  --key= "${EKY}"  \
--endpoints=${ENDPOINTS} \
snapshot save  /data/etcd_backup_dir/etcd-snapshot- ` date  +%Y%m%d`.db
 
# 備份保留30天
find  /data/etcd_backup_dir/  -name  "*.db"  -mtime +30 - exec  rm  -f {} \;
 
# 同步到其他兩個etcd節點
/bin/rsync  -e  "ssh -p5522"  -avpgolr --delete  /data/etcd_backup_dir/  root@k8s-master02: /data/etcd_backup_dir/
/bin/rsync  -e  "ssh -p5522"  -avpgolr --delete  /data/etcd_backup_dir/  root@k8s-master03: /data/etcd_backup_dir/

  

設置crontab定時備份任務,每天凌晨5點執行備份:
1
2
3
4
[root@k8s-master01 ~] # chmod 755 /data/etcd_backup_dir/etcd_backup.sh
[root@k8s-master01 ~] # crontab -l
#etcd集群數據備份
0 5 * * *  /bin/bash  -x  /data/etcd_backup_dir/etcd_backup .sh >  /dev/null  2>&1

二、etcd集群恢復

etcd集群備份操作只需要在其中的一個etcd節點上完成,然后將備份文件拷貝到其他節點。
但etcd集群恢復操作必須要所有的etcd節點上完成!

1)模擬etcd集群數據丟失
刪除三個etcd集群節點的data數據 (或者直接刪除data目錄)

1
# rm -rf /data/k8s/etcd/data/*

 

查看k8s集群狀態:

1
2
3
4
5
6
7
[root@k8s-master01 ~] # kubectl get cs
NAME                 STATUS      MESSAGE                                                                                           ERROR
etcd-2               Unhealthy   Get https: //172 .16.60.233:2379 /health : dial tcp 172.16.60.233:2379: connect: connection refused
etcd-1               Unhealthy   Get https: //172 .16.60.232:2379 /health : dial tcp 172.16.60.232:2379: connect: connection refused
etcd-0               Unhealthy   Get https: //172 .16.60.231:2379 /health : dial tcp 172.16.60.231:2379: connect: connection refused
scheduler            Healthy     ok
controller-manager   Healthy     ok

  

由於此時etcd集群的三個節點服務還在,過一會兒查看集群狀態恢復正常:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@k8s-master01 ~] # kubectl get cs
NAME                 STATUS    MESSAGE             ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-0               Healthy   { "health" : "true" }
etcd-2               Healthy   { "health" : "true" }
etcd-1               Healthy   { "health" : "true" }
 
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health
https: //172 .16.60.231:2379 is healthy: successfully committed proposal: took = 9.918673ms
https: //172 .16.60.233:2379 is healthy: successfully committed proposal: took = 10.985279ms
https: //172 .16.60.232:2379 is healthy: successfully committed proposal: took = 13.422545ms
 
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem member list --write-out=table
+------------------+---------+------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |    NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+------------+----------------------------+----------------------------+------------+
| 1d1d7edbba38c293 | started | k8s-etcd03 | https: //172 .16.60.233:2380 | https: //172 .16.60.233:2379 |       false  |
| 4c0cfad24e92e45f | started | k8s-etcd02 | https: //172 .16.60.232:2380 | https: //172 .16.60.232:2379 |       false  |
| 79cf4f0a8c3da54b | started | k8s-etcd01 | https: //172 .16.60.231:2380 | https: //172 .16.60.231:2379 |       false  |
+------------------+---------+------------+----------------------------+----------------------------+------------+

  

如上發現,etcd集群三個節點的leader都是false,即沒有選主。此時需要重啟三個節點的etcd服務:
1
# systemctl restart etcd

  

重啟后,再次查看發現etcd集群已經選主成功,集群狀態正常!
1
2
3
4
5
6
7
8
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem   --cert=/etc/etcd/cert/etcd.pem   --key=/etc/etcd/cert/etcd-key.pem   --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" endpoint status
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https: //172 .16.60.231:2379 | 79cf4f0a8c3da54b |   3.4.9 |  1.6 MB |       true  |       false  |         5 |      24658 |              24658 |        |
| https: //172 .16.60.232:2379 | 4c0cfad24e92e45f |   3.4.9 |  1.6 MB |      false  |       false  |         5 |      24658 |              24658 |        |
| https: //172 .16.60.233:2379 | 1d1d7edbba38c293 |   3.4.9 |  1.7 MB |      false  |       false  |         5 |      24658 |              24658 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

  

但是,k8s集群數據其實已經丟失了。namespace命名空間下的pod等資源都沒有了。此時就需要通過etcd集群備份文件來恢復,即通過上面的etcd集群快照文件恢復。
1
2
3
4
5
6
7
8
9
10
[root@k8s-master01 ~] # kubectl get ns
NAME              STATUS   AGE
default           Active   9m47s
kube-node-lease   Active   9m39s
kube-public       Active   9m39s
kube-system       Active   9m47s
[root@k8s-master01 ~] # kubectl get pods -n kube-system
No resources found  in  kube-system namespace.
[root@k8s-master01 ~] # kubectl get pods --all-namespaces
No resources found

  

2)etcd集群數據恢復,即kubernetes集群數據恢復
在etcd數據恢復之前,先依次關閉所有master節點的kube-aposerver服務,所有etcd節點的etcd服務:
1
2
# systemctl stop kube-apiserver
# systemctl stop etcd

  

特別注意:在進行etcd集群數據恢復之前,一定要先將所有etcd節點的data和wal舊工作目錄刪掉,這里指的是/data/k8s/etcd/data文件夾跟/data/k8s/etcd/wal文件夾,可能會導致恢復失敗(恢復命令執行時報錯數據目錄已存在)。
1
# rm -rf /data/k8s/etcd/data && rm -rf /data/k8s/etcd/wal

  

在每個etcd節點執行恢復操作:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
172.16.60.231節點
-------------------------------------------------------
ETCDCTL_API=3 etcdctl \
--name=k8s-etcd01 \
--endpoints= "https://172.16.60.231:2379"  \
--cert= /etc/etcd/cert/etcd .pem \
--key= /etc/etcd/cert/etcd-key .pem \
--cacert= /etc/kubernetes/cert/ca .pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https: //172 .16.60.231:2380 \
--initial-cluster=k8s-etcd01=https: //172 .16.60.231:2380,k8s-etcd02=https: //172 .16.60.232:2380,k8s-etcd03=https: //192 .168.137.233:2380 \
--data- dir = /data/k8s/etcd/data  \
--wal- dir = /data/k8s/etcd/wal  \
snapshot restore  /data/etcd_backup_dir/etcd-snapshot-20200820 .db
 
 
172.16.60.232節點
-------------------------------------------------------
ETCDCTL_API=3 etcdctl \
--name=k8s-etcd02 \
--endpoints= "https://172.16.60.232:2379"  \
--cert= /etc/etcd/cert/etcd .pem \
--key= /etc/etcd/cert/etcd-key .pem \
--cacert= /etc/kubernetes/cert/ca .pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https: //172 .16.60.232:2380 \
--initial-cluster=k8s-etcd01=https: //172 .16.60.231:2380,k8s-etcd02=https: //172 .16.60.232:2380,k8s-etcd03=https: //192 .168.137.233:2380 \
--data- dir = /data/k8s/etcd/data  \
--wal- dir = /data/k8s/etcd/wal  \
snapshot restore  /data/etcd_backup_dir/etcd-snapshot-20200820 .db
 
 
192.168.137.233節點
-------------------------------------------------------
ETCDCTL_API=3 etcdctl \
--name=k8s-etcd03 \
--endpoints= "https://192.168.137.233:2379"  \
--cert= /etc/etcd/cert/etcd .pem \
--key= /etc/etcd/cert/etcd-key .pem \
--cacert= /etc/kubernetes/cert/ca .pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https: //192 .168.137.233:2380 \
--initial-cluster=k8s-etcd01=https: //172 .16.60.231:2380,k8s-etcd02=https: //172 .16.60.232:2380,k8s-etcd03=https: //192 .168.137.233:2380 \
--data- dir = /data/k8s/etcd/data  \
--wal- dir = /data/k8s/etcd/wal  \
snapshot restore  /data/etcd_backup_dir/etcd-snapshot-20200820 .db

  

依次啟動所有etcd節點的etcd服務:
1
2
# systemctl start etcd
# systemctl status etcd

  

檢查 ETCD 集群狀態(如下,發現etcd集群里已經成功選主了)
1
2
3
4
5
6
7
8
9
10
11
12
13
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health
https: //172 .16.60.232:2379 is healthy: successfully committed proposal: took = 12.837393ms
https: //172 .16.60.233:2379 is healthy: successfully committed proposal: took = 13.306671ms
https: //172 .16.60.231:2379 is healthy: successfully committed proposal: took = 13.602805ms
 
[root@k8s-master01 ~] # ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem   --cert=/etc/etcd/cert/etcd.pem   --key=/etc/etcd/cert/etcd-key.pem   --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" endpoint status
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https: //172 .16.60.231:2379 | 79cf4f0a8c3da54b |   3.4.9 |  9.0 MB |      false  |       false  |         2 |         13 |                 13 |        |
| https: //172 .16.60.232:2379 | 4c0cfad24e92e45f |   3.4.9 |  9.0 MB |       true  |       false  |         2 |         13 |                 13 |        |
| https: //172 .16.60.233:2379 | 5f70664d346a6ebd |   3.4.9 |  9.0 MB |      false  |       false  |         2 |         13 |                 13 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

  

再依次啟動所有master節點的kube-apiserver服務:
1
2
# systemctl start kube-apiserver
# systemctl status kube-apiserver

  

查看kubernetes集群狀態:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@k8s-master01 ~] # kubectl get cs
NAME                 STATUS      MESSAGE                                  ERROR
controller-manager   Healthy     ok
scheduler            Healthy     ok
etcd-2               Unhealthy   HTTP probe failed with statuscode: 503
etcd-1               Unhealthy   HTTP probe failed with statuscode: 503
etcd-0               Unhealthy   HTTP probe failed with statuscode: 503
 
由於etcd服務剛重啟,需要多刷幾次狀態就會正常:
[root@k8s-master01 ~] # kubectl get cs
NAME                 STATUS    MESSAGE             ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-2               Healthy   { "health" : "true" }
etcd-0               Healthy   { "health" : "true" }
etcd-1               Healthy   { "health" : "true" }

  

查看kubernetes的資源情況:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[root@k8s-master01 ~] # kubectl get ns
NAME              STATUS   AGE
default           Active   7d4h
kevin             Active   5d18h
kube-node-lease   Active   7d4h
kube-public       Active   7d4h
kube-system       Active   7d4h
 
[root@k8s-master01 ~] # kubectl get pods --all-namespaces
NAMESPACE     NAME                                       READY   STATUS              RESTARTS   AGE
default       dnsutils-ds-22q87                          0 /1      ContainerCreating   171        7d3h
default       dnsutils-ds-bp8tm                          0 /1      ContainerCreating   138        5d18h
default       dnsutils-ds-bzzqg                          0 /1      ContainerCreating   138        5d18h
default       dnsutils-ds-jcvng                          1 /1      Running             171        7d3h
default       dnsutils-ds-xrl2x                          0 /1      ContainerCreating   138        5d18h
default       dnsutils-ds-zjg5l                          1 /1      Running             0          7d3h
default       kevin-t-84cdd49d65-ck47f                   0 /1      ContainerCreating   0          2d2h
default       nginx-ds-98rm2                             1 /1      Running             2          7d3h
default       nginx-ds-bbx68                             1 /1      Running             0          7d3h
default       nginx-ds-kfctv                             0 /1      ContainerCreating   1          5d18h
default       nginx-ds-mdcd9                             0 /1      ContainerCreating   1          5d18h
default       nginx-ds-ngqcm                             1 /1      Running             0          7d3h
default       nginx-ds-tpcxs                             0 /1      ContainerCreating   1          5d18h
kevin         nginx-ingress-controller-797ffb479-vrq6w   0 /1      ContainerCreating   0          5d18h
kevin          test -nginx-7d4f96b486-qd4fl                0 /1      ContainerCreating   0          2d1h
kevin          test -nginx-7d4f96b486-qfddd                0 /1      Running             0          2d1h
kube-system   calico-kube-controllers-578894d4cd-9rp4c   1 /1      Running             1          7d3h
kube-system   calico-node-d7wq8                          0 /1      PodInitializing     1          7d3h
在etcd集群數據恢復后,pod容器也會慢慢恢復到running狀態。至此,kubernetes整個集群已經通過etcd備份數據恢復了。
 

三、最后總結

Kubernetes 集群備份主要是備份 ETCD 集群。而恢復時,主要考慮恢復整個順序:
停止kube-apiserver --> 停止ETCD --> 恢復數據 --> 啟動ETCD --> 啟動kube-apiserve
 
特別注意:
  • 備份ETCD集群時,只需要備份一個ETCD數據,然后同步到其他節點上。
  • 恢復ETCD數據時,拿其中一個節點的備份數據恢復即可

轉自

K8S集群災備環境部署 - 散盡浮華 - 博客園
https://www.cnblogs.com/kevingrace/p/14616824.html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM