這篇文章僅供業務中台的兄弟姐妹們日常排查故障所用,對於平台層面的大神,可忽略不計。
問題1:K8S集群服務訪問失敗?
curl: (60) Peer's Certificate issuer is not recognized.
More details here: http://curl.haxx.se/docs/sslcerts.html
curl performs SSL certificate verification by default, using a "bundle"
of Certificate Authority (CA) public keys (CA certs). If the default
bundle file isn't adequate, you can specify an alternate file
using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
the bundle, the certificate verification probably failed due to a
problem with the certificate (it might be expired, or the name might
not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
the -k (or --insecure) option.
原因分析:證書不能被識別,其原因為:自定義證書,過期等。
解決方法:更新證書即可。
問題2:K8S集群服務訪問失敗?
curl: (7) Failed connect to 10.103.22.158:3000; Connection refused
原因分析:端口映射錯誤,服務正常工作,但不能提供服務。
解決方法:刪除svc,重新映射端口即可。
kubectl delete svc nginx-deployment
問題3:K8S集群服務暴露失敗?
Error from server (AlreadyExists): services "nginx-deployment" already exists
原因分析:該容器已暴露服務了。
解決方法:刪除svc,重新映射端口即可。
問題4:外網無法訪問K8S集群提供的服務?
原因分析:K8S集群的type為ClusterIP,未將服務暴露至外網。
解決方法:修改K8S集群的type為NodePort即可,於是可通過所有K8S集群節點訪問服務。
kubectl edit svc nginx-deployment
問題5:pod狀態為ErrImagePull?
readiness-httpget-pod 0/1 ErrImagePull 0 10s
原因分析:image無法拉取;
Warning Failed 59m (x4 over 61m) kubelet, k8s-node01 Error: ErrImagePull
解決方法:更換鏡像即可。
問題6:創建init C容器后,其狀態不正常?
NAME READY STATUS RESTARTS AGE
myapp-pod 0/1 Init:0/2 0 20s
原因分析:查看日志發現,pod一直出於初始化中;然后查看pod詳細信息,定位pod創建失敗的原因為:初始化容器未執行完畢。
Error from server (BadRequest): container "myapp-container" in pod "myapp-pod" is waiting to start: PodInitializing
waiting for myservice
Server: 10.96.0.10
Address: 10.96.0.10:53
** server can't find myservice.default.svc.cluster.local: NXDOMAIN
*** Can't find myservice.svc.cluster.local: No answer
*** Can't find myservice.cluster.local: No answer
*** Can't find myservice.default.svc.cluster.local: No answer
*** Can't find myservice.svc.cluster.local: No answer
*** Can't find myservice.cluster.local: No answer
解決方法:創建相關service,將SVC的name寫入K8S集群的coreDNS服務器中,於是coreDNS就能對POD的initC容器執行過程中的域名解析了。
kubectl apply -f myservice.yaml
NAME READY STATUS RESTARTS AGE
myapp-pod 0/1 Init:1/2 0 27m
myapp-pod 0/1 PodInitializing 0 28m
myapp-pod 1/1 Running 0 28m
問題7:探測存活pod狀態為CrashLoopBackOff?
readiness-httpget-pod 0/1 CrashLoopBackOff 1 13s
readiness-httpget-pod 0/1 Completed 2 20s
readiness-httpget-pod 0/1 CrashLoopBackOff 2 31s
readiness-httpget-pod 0/1 Completed 3 42s
readiness-httpget-pod 0/1 CrashLoopBackOff 3 53s
原因分析:鏡像問題,導致容器重啟失敗。
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 56m kubelet, k8s-node01 Pulling image "hub.atguigu.com/library/mylandmarktech/myapp:v1"
Normal Pulled 56m kubelet, k8s-node01 Successfully pulled image "hub.atguigu.com/library/mylandmarktech/myapp:v1"
Normal Created 56m (x3 over 56m) kubelet, k8s-node01 Created container readiness-httpget-container
Normal Started 56m (x3 over 56m) kubelet, k8s-node01 Started container readiness-httpget-container
Normal Pulled 56m (x2 over 56m) kubelet, k8s-node01 Container image "hub.atguigu.com/library/mylandmarktech/myapp:v1" already present on machine
Warning Unhealthy 56m kubelet, k8s-node01 Readiness probe failed: Get http://10.244.2.22:80/index1.html: dial tcp 10.244.2.22:80: connect: connection refused
Warning BackOff 56m (x4 over 56m) kubelet, k8s-node01 Back-off restarting failed container
Normal Scheduled 50s default-scheduler Successfully assigned default/readiness-httpget-pod to k8s-node01
解決方法:更換鏡像即可。
問題8:POD創建失敗?
readiness-httpget-pod 0/1 Pending 0 0s
readiness-httpget-pod 0/1 Pending 0 0s
readiness-httpget-pod 0/1 ContainerCreating 0 0s
readiness-httpget-pod 0/1 Error 0 2s
readiness-httpget-pod 0/1 Error 1 3s
readiness-httpget-pod 0/1 CrashLoopBackOff 1 4s
readiness-httpget-pod 0/1 Error 2 15s
readiness-httpget-pod 0/1 CrashLoopBackOff 2 26s
readiness-httpget-pod 0/1 Error 3 37s
readiness-httpget-pod 0/1 CrashLoopBackOff 3 52s
readiness-httpget-pod 0/1 Error 4 82s
原因分析:鏡像問題導致容器無法啟動。
[root@k8s-master01 ~]# kubectl logs readiness-httpget-pod
url.js:106
throw new errors.TypeError('ERR_INVALID_ARG_TYPE', 'url', 'string', url);
^
TypeError [ERR_INVALID_ARG_TYPE]: The "url" argument must be of type string. Received type undefined
at Url.parse (url.js:106:11)
at Object.urlParse [as parse] (url.js:100:13)
at module.exports (/myapp/node_modules/mongodb/lib/url_parser.js:17:23)
at connect (/myapp/node_modules/mongodb/lib/mongo_client.js:159:16)
at Function.MongoClient.connect (/myapp/node_modules/mongodb/lib/mongo_client.js:110:3)
at Object.<anonymous> (/myapp/app.js:12:13)
at Module._compile (module.js:641:30)
at Object.Module._extensions..js (module.js:652:10)
at Module.load (module.js:560:32)
at tryModuleLoad (module.js:503:12)
at Function.Module._load (module.js:495:3)
at Function.Module.runMain (module.js:682:10)
at startup (bootstrap_node.js:191:16)
at bootstrap_node.js:613:3
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 58m (x5 over 59m) kubelet, k8s-node01 Container image "hub.atguigu.com/library/myapp:v1" already present on machine
Normal Created 58m (x5 over 59m) kubelet, k8s-node01 Created container readiness-httpget-container
Normal Started 58m (x5 over 59m) kubelet, k8s-node01 Started container readiness-httpget-container
Warning BackOff 57m (x10 over 59m) kubelet, k8s-node01 Back-off restarting failed container
Normal Scheduled 3m35s default-scheduler Successfully assigned default/readiness-httpget-pod to k8s-node01
解決方法:更換鏡像。
問題9:POD的ready狀態未進入?
readiness-httpget-pod 0/1 Running 0 116s
原因分析:POD的執行命令失敗,無法獲取資源。
Error from server (NotFound): pods "pod" not found
2021/06/11 07:10:14 [error] 30#30: *1 open() "/usr/share/nginx/html/index1.html" failed (2: No such file or directory), client: 10.244.2.1, server: localhost, request: "GET /index1.html HTTP/1.1", host: "10.244.2.25:80"
10.244.2.1 - - [11/Jun/2021:07:10:14 +0000] "GET /index1.html HTTP/1.1" 404 153 "-" "kube-probe/1.15" "-"
10.244.2.1 - - [11/Jun/2021:07:10:17 +0000] "GET /index1.html HTTP/1.1" 404 153 "-" "kube-probe/1.15" "-"
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 64m kubelet, k8s-node01 Container image "hub.atguigu.com/library/nginx" already present on machine
Normal Created 64m kubelet, k8s-node01 Created container readiness-httpget-container
Normal Started 64m kubelet, k8s-node01 Started container readiness-httpget-container
Warning Unhealthy 59m (x101 over 64m) kubelet, k8s-node01 Readiness probe failed: HTTP probe failed with statuscode: 404
Normal Scheduled 8m16s default-scheduler Successfully assigned default/readiness-httpget-pod to k8s-node01
解決方法:進入容器內部,創建yaml定義的資源
問題10:pod創建失敗?
error: error validating "myregistry-secret.yml": error validating data: ValidationError(Pod.spec.imagePullSecrets[0]): invalid type for io.k8s.api.core.v1.LocalObjectReference: got "string", expected "map"; if you choose to ignore these errors, turn validation off with --validate=false
原因分析:yml文件內容出錯---使用中文字符;
解決方法:修改myregistrykey內容即可。
11、kube-flannel-ds-amd64-ndsf7插件pod的status為Init:0/1?
排查思路:kubectl -n kube-system describe pod kube-flannel-ds-amd64-ndsf7 #查詢pod描述信息;
原因分析:k8s-slave1節點拉取鏡像失敗。
解決方法:登錄k8s-slave1,重啟docker服務,手動拉取鏡像。
k8s-master節點,重新安裝插件即可。
kubectl create -f kube-flannel.yml;kubectl get nodes
12、K8S創建服務status為ErrImagePull?
排查思路:kubectl describe pod test-nginx
原因分析:拉取鏡像名稱問題。
解決方法:刪除錯誤pod;重新拉取鏡像;
kubectl delete pod test-nginx;kubectl run test-nginx --image=10.0.0.81:5000/nginx:alpine
13、不能進入指定容器內部?
Error from server (BadRequest): container volume-test-container is not valid for pod volume-test-pod
原因分析:yml文件comtainers字段重復,導致該pod沒有該容器。
解決方法:去掉yml文件中多余的containers字段,重新生成pod。
14、創建PV失敗?
persistentvolume/nfspv1 unchanged
persistentvolume/nfspv01 created
Error from server (Invalid): error when applying patch:
{"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"v1\",\"kind\":\"PersistentVolume\",\"metadata\":{\"annotations\":{},\"name\":\"nfspv01\"},\"spec\":{\"accessModes\":[\"ReadWriteOnce\"],\"capacity\":{\"storage\":\"5Gi\"},\"nfs\":{\"path\":\"/nfs2\",\"server\":\"192.168.66.100\"},\"persistentVolumeReclaimPolicy\":\"Retain\",\"storageClassName\":\"nfs\"}}\n"}},"spec":{"nfs":{"path":"/nfs2"}}}
to:
Resource: "/v1, Resource=persistentvolumes", GroupVersionKind: "/v1, Kind=PersistentVolume"
Name: "nfspv01", Namespace: ""
Object: &{map["apiVersion":"v1" "kind":"PersistentVolume" "metadata":map["annotations":map["kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"v1\",\"kind\":\"PersistentVolume\",\"metadata\":{\"annotations\":{},\"name\":\"nfspv01\"},\"spec\":{\"accessModes\":[\"ReadWriteOnce\"],\"capacity\":{\"storage\":\"5Gi\"},\"nfs\":{\"path\":\"/nfs1\",\"server\":\"192.168.66.100\"},\"persistentVolumeReclaimPolicy\":\"Retain\",\"storageClassName\":\"nfs\"}}\n"] "creationTimestamp":"2021-06-25T01:54:24Z" "finalizers":["kubernetes.io/pv-protection"] "name":"nfspv01" "resourceVersion":"325674" "selfLink":"/api/v1/persistentvolumes/nfspv01" "uid":"89cb1d15-8012-47f0-aee6-6507bb624387"] "spec":map["accessModes":["ReadWriteOnce"] "capacity":map["storage":"5Gi"] "nfs":map["path":"/nfs1" "server":"192.168.66.100"] "persistentVolumeReclaimPolicy":"Retain" "storageClassName":"nfs" "volumeMode":"Filesystem"] "status":map["phase":"Available"]]}
for: "PV.yml": PersistentVolume "nfspv01" is invalid: spec.persistentvolumesource: Forbidden: is immutable after creation
原因分析:pv的name字段重復。
解決方法:修改pv的name字段即可。
15、pod無法掛載PVC?
原因分析:pod無法掛載PVC。
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 60s default-scheduler pod has unbound immediate PersistentVolumeClaims (repeated 2 times)
accessModes與可使用的PV不一致,導致無法掛載PVC,由於只能掛載大於1G且accessModes為RWO的PV,故只能成功創建1個pod,第2個pod一致pending,按序創建時則第3個pod一直未被創建;
解決方法:修改yml文件中accessModes或PV的accessModes即可。
16、問題:pod使用PV后,無法訪問其內容?
原因分析:nfs卷中沒有文件或權限不對。
解決方法:在nfs卷中創建文件並授予權限。
17、查看節點狀態失敗?
Error from server (NotFound): the server could not find the requested resource (get services http:heapster:)
原因分析:沒有heapster服務。
解決方法:安裝promethus監控組件即可。
18、pod一直處於pending'狀態?
原因分析:由於已使用同樣鏡像發布了pod,導致無節點可調度。
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 9s (x13 over 14m) default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
解決方法:刪除所有pod后部署pod即可。
19、helm安裝組件失敗?
[root@k8s-master01 hello-world]# helm install
Error: This command needs 1 argument: chart nam
[root@k8s-master01 hello-world]# helm install ./
Error: no Chart.yaml exists in directory "/root/hello-world"
原因分析:文件名格式不對。
解決方法:mv chart.yaml Chart.yaml
20、helm更新release失敗?
[root@k8s-master01 hello-world]#
[root@k8s-master01 hello-world]# helm upgrade joyous-wasp ./
UPGRADE FAILED
ROLLING BACK
Error: render error in "hello-world/templates/deployment.yaml": template: hello-world/templates/deployment.yaml:14:35: executing "hello-world/templates/deployment.yaml" at <.values.image.reposi...>: can't evaluate field image in type interface {}
Error: UPGRADE FAILED: render error in "hello-world/templates/deployment.yaml": template: hello-world/templates/deployment.yaml:14:35: executing "hello-world/templates/deployment.yaml" at <.values.image.reposi...>: can't evaluate field image in type interface {}
原因分析:yaml文件語法錯誤。
解決方法:修改yaml文件即可。
21、etcd啟動失敗?
[root@k8s-master01 ~]# systemctl enable --now etcd
Created symlink from /etc/systemd/system/etcd3.service to /usr/lib/systemd/system/etcd.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/etcd.service to /usr/lib/systemd/system/etcd.service.
Job for etcd.service failed because a timeout was exceeded. See "systemctl status etcd.service" and "journalctl -xe" for details.
原因分析:認證失敗原因可能為證書、配置、端口等。檢查配置符合etcd版本要求,證書生成過程有效。最后確認端口被占用導致認證失敗。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
[root@k8s-master01 ~]
# systemctl status etcd
● etcd.service - Etcd.service
Loaded: loaded (
/usr/lib/systemd/system/etcd
.service; enabled; vendor preset: disabled)
Active: activating (start) since Wed 2021-07-14 09:53:03 CST; 1min 6s ago
Docs: https:
//coreos
.com
/etcd/docs/latest/
Main PID: 39692 (etcd)
CGroup:
/system
.slice
/etcd
.service
└─39692
/usr/local/bin/etcd
--config-
file
=
/etc/etcd/etcd
.config.yml
Jul 14 09:54:09 k8s-master01 etcd[39692]: rejected connection from
"192.168.0.108:46168"
(error
"remote error: tls: bad certificate"
, ServerName
""
)
Jul 14 09:54:09 k8s-master01 etcd[39692]: rejected connection from
"192.168.0.108:46166"
(error
"remote error: tls: bad certificate"
, ServerName
""
)
Jul 14 09:54:09 k8s-master01 etcd[39692]: rejected connection from
"192.168.0.108:46170"
(error
"remote error: tls: bad certificate"
, ServerName
""
)
Jul 14 09:54:09 k8s-master01 etcd[39692]: rejected connection from
"192.168.0.108:46172"
(error
"remote error: tls: bad certificate"
, ServerName
""
)
Jul 14 09:54:09 k8s-master01 etcd[39692]: rejected connection from
"192.168.0.108:46176"
(error
"remote error: tls: bad certificate"
, ServerName
""
)
Jul 14 09:54:09 k8s-master01 etcd[39692]: rejected connection from
"192.168.0.108:46174"
(error
"remote error: tls: bad certificate"
, ServerName
""
)
Jul 14 09:54:09 k8s-master01 etcd[39692]: rejected connection from
"192.168.0.108:46178"
(error
"remote error: tls: bad certificate"
, ServerName
""
)
Jul 14 09:54:09 k8s-master01 etcd[39692]: rejected connection from
"192.168.0.108:46180"
(error
"remote error: tls: bad certificate"
, ServerName
""
)
Jul 14 09:54:10 k8s-master01 etcd[39692]: rejected connection from
"192.168.0.108:46182"
(error
"remote error: tls: bad certificate"
, ServerName
""
)
Jul 14 09:54:10 k8s-master01 etcd[39692]: rejected connection from
"192.168.0.108:46186"
(error
"remote error: tls: bad certificate"
, ServerName
""
)
|
解決方法:kill占用2379端口的進程,重啟etcd即可。