1. 查看coreDNS是否正常啟動
kubectl -n kube-system get po|grep core
2. 如果不正常,並確定yaml配置無誤,可將coreDNS pod通過修改deployment yaml 添加 nodeName嘗試調度到其他節點,排查是否為node原因,或直接通過第4條排查問題node
spec:
nodeName: 1.1.1.1
containers:
- name: xxx
image: xxx
ports:
- containerPort: 8080
3. 如果通過nodeName成功運行並啟動coreDNS pod 可在任意node上通過sevice name解析coreDNS的可用性,示例:
# coreDNS地址:10.96.0.10 # 任意服務的service name:tiller-deploy nslookup tiller-deploy.kube-system.svc.cluster.local 10.96.0.10
4. 如果某一節點出現解析失敗,則測試node 到pod是否連通
# pod ip: 10.96.0.10 ping 10.53.5.165
5. 連通性測試失敗,查看問題node flannel是否正常運行,如果正常運行,繼續排查
# 1、查看問題節點flannel容器
docker ps | grep flannel
# 2、查看flannle網卡狀態
ifconfig flannel.1
# 3、查看路由表與正常節點對比是否齊全,示例:
route -n
10.244.1.0 10.244.1.0 255.255.255.0 UG 0 0 0 flannel.1
10.244.2.0 10.244.2.0 255.255.255.0 UG 0 0 0 flannel.1
10.244.3.0 10.244.3.0 255.255.255.0 UG 0 0 0 flannel.1
10.244.4.0 10.244.4.0 255.255.255.0 UG 0 0 0 flannel.1
10.244.5.0 10.244.5.0 255.255.255.0 UG 0 0 0 flannel.1
10.244.6.0 10.244.6.0 255.255.255.0 UG 0 0 0 flannel.1
10.244.7.0 10.244.7.0 255.255.255.0 UG 0 0 0 flannel.1
# 如果上面查看有問題,確定是否啟動NetworkManager服務,該服務會導致flannel異常
# 查看
systemctl status NetworkManager
# 關閉 && 禁用
systemctl stop NetworkManager && systemctl disable NetworkManager
# 如果是該服務影響,日志中會出現此類問題:
device (flannel.1): state change: unmanager -> unavailable (reason 'connection-assumed')
# 並檢查問題解析地址
cat /etc/resolv.conf
# 刪除flannel 重新拉起flannel
docker rm -f flannel
6. 如果flannel未啟動情況
# 1. 查看kubelet是否啟動
netstat -tnlp| grep kubelet
# 2. 未啟動,則查看swap分區是否開啟,正常情況下為swap原因導致
# 臨時關閉swap分區, 重啟失效;
swapoff -a
# 永久關閉swap分區
sed -ri 's/.*swap.*/#&/' /etc/fstab
# 通過free 查看swap狀態為關閉狀態
free -m
total used free shared buff/cache available
Swap: 0 0 0
# 啟動kubelet,會拉起flannel
systemctl start kubelet
# 如果kubelet啟動和flannel正常啟動可通過第5條排查問題,並測試服務的可用性
