1 Coredns CrashLoopBackOff 導致無法成功添加工作節點的問題
1.1 問題描述
kubeadm方式安裝的k8s 1.14.1集群,使用一段時間后k8s-master-15-81機器重啟docker和kubelet服務后,coredns無法工作了
[root@k8s-master-15-81 k8s_config]# kubectl get pod -n kube-system NAME READY STATUS RESTARTS AGE coredns-8686dcc4fd-4qswq 0/1 CrashLoopBackOff 15 40d coredns-8686dcc4fd-769bs 0/1 CrashLoopBackOff 15 40d kube-apiserver-k8s-master-15-81 1/1 Running 4 40d kube-apiserver-k8s-master-15-82 1/1 Running 0 40d kube-apiserver-k8s-master-15-83 1/1 Running 0 40d kube-controller-manager-k8s-master-15-81 1/1 Running 5 40d kube-controller-manager-k8s-master-15-82 1/1 Running 1 40d kube-controller-manager-k8s-master-15-83 1/1 Running 1 40d kube-flannel-ds-amd64-4fg7t 1/1 Running 0 40d kube-flannel-ds-amd64-bcl4j 1/1 Running 0 40d kube-flannel-ds-amd64-k6vp2 1/1 Running 0 40d kube-flannel-ds-amd64-lkjlz 1/1 Running 2 40d kube-flannel-ds-amd64-mb2lg 1/1 Running 0 40d kube-flannel-ds-amd64-nl9pn 1/1 Running 5 40d kube-proxy-4sbms 1/1 Running 2 40d kube-proxy-9v6fm 1/1 Running 0 40d kube-proxy-jsnkk 1/1 Running 5 40d kube-proxy-rvkmh 1/1 Running 0 40d kube-proxy-s4dfv 1/1 Running 0 40d kube-proxy-s8lws 1/1 Running 0 40d kube-scheduler-k8s-master-15-81 1/1 Running 5 40d kube-scheduler-k8s-master-15-82 1/1 Running 1 40d kube-scheduler-k8s-master-15-83 1/1 Running 1 40d kubernetes-dashboard-5f7b999d65-d7fpp 0/1 Terminating 0 18m kubernetes-dashboard-5f7b999d65-k759t 0/1 Terminating 0 21m kubernetes-dashboard-5f7b999d65-pmvkk 0/1 CrashLoopBackOff 2 43s [root@k8s-master-15-81 k8s_config]#
此時其他節點都是notready狀態
[root@k8s-master-15-81 k8s_config]# kubectl get no NAME STATUS ROLES AGE VERSION k8s-master-15-81 Ready master 40d v1.14.1 k8s-master-15-82 NotReady master 40d v1.14.1 k8s-master-15-83 NotReady master 40d v1.14.1 k8s-node-15-84 NotReady <none> 40d v1.14.1 k8s-node-15-85 NotReady <none> 40d v1.14.1 k8s-node-15-86 NotReady <none> 40d v1.14.1 [root@k8s-master-15-81 k8s_config]#
初步診斷容器崩潰,我們需要進一步查看日志,使用“kubectl logs”:
這次我們獲得了以下具體錯誤:
[root@k8s-master-15-81 ~]# kubectl -n kube-system logs coredns-8686dcc4fd-7fwcz #這是主要是日志 E1028 06:36:35.489403 1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host E1028 06:36:35.489403 1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-8686dcc4fd-7fwcz.unknownuser.log.ERROR.20191028-063635.1: no such file or directory
[root@k8s-master-15-81 ~]# kubectl -n kube-system describe pod coredns-8686dcc4fd-4j5gv #這個日志沒啥用 Name: coredns-8686dcc4fd-4j5gv Namespace: kube-system Priority: 2000000000 PriorityClassName: system-cluster-critical Node: k8s-master-15-81/192.168.15.81 Start Time: Mon, 28 Oct 2019 14:15:16 +0800 Labels: k8s-app=kube-dns pod-template-hash=8686dcc4fd Annotations: <none> Status: Running IP: 10.244.0.30 Controlled By: ReplicaSet/coredns-8686dcc4fd Containers: coredns: Container ID: docker://5473c887d6858f364e8fc4c8001e41b2c5e612ce55d7c409df69788abf6585ed Image: registry.aliyuncs.com/google_containers/coredns:1.3.1 Image ID: docker-pullable://registry.aliyuncs.com/google_containers/coredns@sha256:638adb0319813f2479ba3642bbe37136db8cf363b48fb3eb7dc8db634d8d5a5b Ports: 53/UDP, 53/TCP, 9153/TCP Host Ports: 0/UDP, 0/TCP, 0/TCP Args: -conf /etc/coredns/Corefile State: Terminated Reason: Error Exit Code: 2 Started: Mon, 28 Oct 2019 14:15:39 +0800 Finished: Mon, 28 Oct 2019 14:15:40 +0800 Last State: Terminated Reason: Error Exit Code: 2 Started: Mon, 28 Oct 2019 14:15:20 +0800 Finished: Mon, 28 Oct 2019 14:15:21 +0800 Ready: False Restart Count: 2 Limits: memory: 170Mi Requests: cpu: 100m memory: 70Mi Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5 Readiness: http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /etc/coredns from config-volume (ro) /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-ltkvt (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: config-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: coredns Optional: false coredns-token-ltkvt: Type: Secret (a volume populated by a Secret) SecretName: coredns-token-ltkvt Optional: false QoS Class: Burstable Node-Selectors: beta.kubernetes.io/os=linux Tolerations: CriticalAddonsOnly node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 27s default-scheduler Successfully assigned kube-system/coredns-8686dcc4fd-4j5gv to k8s-master-15-81 Normal Pulled 4s (x3 over 26s) kubelet, k8s-master-15-81 Container image "registry.aliyuncs.com/google_containers/coredns:1.3.1" already present on machine Normal Created 4s (x3 over 26s) kubelet, k8s-master-15-81 Created container coredns Normal Started 4s (x3 over 25s) kubelet, k8s-master-15-81 Started container coredns Warning BackOff 1s (x4 over 21s) kubelet, k8s-master-15-81 Back-off restarting failed container [root@k8s-master-15-81 ~]# kubectl -n kube-system describe pod coredns-8686dcc4fd-5p6tp Name: coredns-8686dcc4fd-5p6tp Namespace: kube-system Priority: 2000000000 PriorityClassName: system-cluster-critical Node: k8s-master-15-81/192.168.15.81 Start Time: Mon, 28 Oct 2019 14:15:15 +0800 Labels: k8s-app=kube-dns pod-template-hash=8686dcc4fd Annotations: <none> Status: Running IP: 10.244.0.29 Controlled By: ReplicaSet/coredns-8686dcc4fd Containers: coredns: Container ID: docker://4b19e53c68188faa107c310e75c6927bb0e280be042019b2805ef050fcd9aaaf Image: registry.aliyuncs.com/google_containers/coredns:1.3.1 Image ID: docker-pullable://registry.aliyuncs.com/google_containers/coredns@sha256:638adb0319813f2479ba3642bbe37136db8cf363b48fb3eb7dc8db634d8d5a5b Ports: 53/UDP, 53/TCP, 9153/TCP Host Ports: 0/UDP, 0/TCP, 0/TCP Args: -conf /etc/coredns/Corefile State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 2 Started: Mon, 28 Oct 2019 14:16:09 +0800 Finished: Mon, 28 Oct 2019 14:16:10 +0800 Ready: False Restart Count: 3 Limits: memory: 170Mi Requests: cpu: 100m memory: 70Mi Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5 Readiness: http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /etc/coredns from config-volume (ro) /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-ltkvt (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: config-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: coredns Optional: false coredns-token-ltkvt: Type: Secret (a volume populated by a Secret) SecretName: coredns-token-ltkvt Optional: false QoS Class: Burstable Node-Selectors: beta.kubernetes.io/os=linux Tolerations: CriticalAddonsOnly node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 90s default-scheduler Successfully assigned kube-system/coredns-8686dcc4fd-5p6tp to k8s-master-15-81 Warning Unhealthy 85s kubelet, k8s-master-15-81 Readiness probe failed: HTTP probe failed with statuscode: 503 Normal Pulled 36s (x4 over 89s) kubelet, k8s-master-15-81 Container image "registry.aliyuncs.com/google_containers/coredns:1.3.1" already present on machine Normal Created 36s (x4 over 88s) kubelet, k8s-master-15-81 Created container coredns Normal Started 36s (x4 over 88s) kubelet, k8s-master-15-81 Started container coredns Warning BackOff 4s (x11 over 84s) kubelet, k8s-master-15-81 Back-off restarting failed container [root@k8s-master-15-81 ~]#
強制刪除coredns pod無效
[root@k8s-master-15-81 ~]# kubectl delete pod coredns-8686dcc4fd-4j5gv --grace-period=0 --force -n kube-system warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "coredns-8686dcc4fd-4j5gv" force deleted [root@k8s-master-15-81 ~]# kubectl delete pod coredns-8686dcc4fd-5p6tp --grace-period=0 --force -n kube-system warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "coredns-8686dcc4fd-5p6tp" force deleted [root@k8s-master-15-81 ~]#
本地dns配置是ok的
[root@k8s-master-15-81 k8s_config]# cat /etc/resolv.conf # Generated by NetworkManager nameserver 10.68.8.65 nameserver 10.68.8.66 [root@k8s-master-15-81 k8s_config]#
1.2 解決方案:
這問題很有可能是防火牆(iptables)規則錯亂或者緩存導致的,可以依次執行以下命令進行解決:
systemctl stop kubelet systemctl stop docker iptables --flush iptables -tnat --flush systemctl start kubelet systemctl start docker
執行如上命令后問題解決
2 添加工作節點時提示token過期
集群注冊token的有效時間為24小時,如果集群創建完成后沒有及時添加工作節點,那么我們需要重新生成token。相關命令如下所示:
#生成token kubeadm token generate #根據token輸出添加命令 kubeadm token create <token> --print-join-command --ttl=0

然后僅需復制打印出來的命令到工作節點執行即可。
3 kubectl 執行命令報“The connection to the server localhost:8080 was refused”
作為集群管理的核心,工作節點上的kubectl可能一上來就跪了,如下圖所示:

出現這個問題的原因是kubectl命令需要使用kubernetes-admin的身份來運行,在“kubeadm int”啟動集群的步驟中就生成了“/etc/kubernetes/admin.conf”。
因此,解決方法如下,將主節點中的【/etc/kubernetes/admin.conf】文件拷貝到工作節點相同目錄下:
#復制admin.conf,請在主節點服務器上執行此命令 scp /etc/kubernetes/admin.conf 172.16.2.202:/etc/kubernetes/admin.conf scp /etc/kubernetes/admin.conf 172.16.2.203:/etc/kubernetes/admin.conf

然后分別在工作節點上配置環境變量:
#設置kubeconfig文件 export KUBECONFIG=/etc/kubernetes/admin.conf echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> ~/.bash_profile
接下來,工作節點就正常了,如:

4 網絡組件flannel無法完成初始化
網絡組件flannel安裝完成后,通過命令查看時一直在初始化狀態,並且通過日志輸出內容如下所示:
kubectl get pods -n kube-system -o wide
kubectl logs -f kube-flannel-ds-amd64-hl89n -n kube-system

具體錯誤日志為:
Error from server: Get https://172.16.2.203:10250/containerLogs/kube-system/kube-flannel-ds-amd64-hl89n/kube-flannel?follow=true: dial tcp 172.16.2.203:10250: connect: no route to host
這時,我們可以登錄節點所在的服務器,使用以下命令來查看目標節點上的kubelet日志:
journalctl -u kubelet -f
注意:journalctl工具可以查看所有日志,包括內核日志和應用日志。

通過日志,我們發現是鏡像拉取的問題。對此,大家可以參考上文中鏡像拉取的方式以及重命名鏡像標簽來解決此問題,當然也可以通過設置代理來解決此問題。
5 部分節點無法啟動pod
有時候,我們部署了應用之后,發現在部分工作節點上pod無法啟動(一直處於ContainerCreating的狀態):

通過排查日志最終我們得到重要信息如下所示:
NetworkPlugin cni failed to set up pod "demo-deployment-675b5f9477-hdcwg_default" network: failed to set bridge addr: "cni0" already has an IP address different from 10.0.2.1/24
這是由於當前節點之前被反復注冊,導致flannel網絡出現問題。可以依次執行以下腳本來重置節點並且刪除flannel網絡來解決:
kubeadm reset #重置節點 systemctl stop kubelet && systemctl stop docker && rm -rf /var/lib/cni/ && rm -rf /var/lib/kubelet/* && rm -rf /var/lib/etcd && rm -rf /etc/cni/ && ifconfig cni0 down && ifconfig flannel.1 down && ifconfig docker0 down && ip link delete cni0 && ip link delete flannel.1 systemctl start docker
執行完成后,重新生成token並注冊節點即可,具體可以參考上文內容。
參考:
https://www.cnblogs.com/codelove/p/11466217.html
