linux運維、架構之路-K8s故障排查


一、kubernetes故障排查

1、應用程序故障排查

主要針對Pod級別的,

       非running狀態時使用describe查看Pod事件進行問題排查。describe也可以查看其他資源對象事件,如deployment、service等。

 kubectl describe TYPE/NAME

[root@k8s-master ~]# kubectl describe pod web 
Name:         web
Namespace:    default
Priority:     0
Node:         k8s-node1/192.168.56.62
Start Time:   Wed, 16 Dec 2020 14:43:55 +0800
Labels:       <none>
Annotations:  cni.projectcalico.org/podIP: 10.244.36.81/32
              cni.projectcalico.org/podIPs: 10.244.36.81/32
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nginx:
    Container ID:   
    Image:          nginx
    Image ID:       
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c87dr (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-c87dr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-c87dr
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events: Type Reason Age From Message ----    ------     ----       ----                ------- Normal Scheduled <unknown>  default-scheduler   Successfully assigned default/web to k8s-node1 Normal Pulling 11s kubelet, k8s-node1  Pulling image "nginx"

kubectl logs TYPE/NAME [-c CONTAINER]:Apiserver調用kubelet的接口獲取

[root@k8s-master ~]# kubectl logs web 
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Configuration complete; ready for start up

kubectl exec POD [-c CONTAINER] --COMMAND [args...],一個Pod中有多個容器時,使用-c指定容器的名稱。

②pod處於pending狀態可能的原因

  • 下載鏡像
  • 可能node節點資源不足
  • 沒有匹配到節點標簽
  • 有污點

2、管理節點異常排查

集群架構圖

①kubeadm部署

 除kubelet服務外,其他組件均采用靜態Pod啟動。

[root@k8s-master ~]# kubectl get pods -n kube-system 
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-59877c7fb4-z2bms   1/1     Running   2          105d
calico-node-pnjxq                          1/1     Running   1          105d
calico-node-v48jq                          1/1     Running   1          105d
coredns-7ff77c879f-dqk8t                   1/1     Running   1          105d
coredns-7ff77c879f-j8zsp                   1/1     Running   1          105d
etcd-k8s-master                            1/1     Running   1          105d
kube-apiserver-k8s-master                  1/1     Running   1 105d kube-controller-manager-k8s-master         1/1     Running   6 105d kube-proxy-ck88h                           1/1     Running   1 105d kube-proxy-hkb9f                           1/1     Running   1 105d kube-scheduler-k8s-master                  1/1     Running   6 105d
metrics-server-8fcfb55ff-wlw5s             1/1     Running   3          104d

其他服務配置文件路徑:/etc/kubernetes/manifests

[root@k8s-master ~]# ll /etc/kubernetes/manifests/
總用量 16
-rw------- 1 root root 1887 9月   1 17:04 etcd.yaml
-rw------- 1 root root 2738 9月   1 17:04 kube-apiserver.yaml
-rw------- 1 root root 2594 9月   1 17:04 kube-controller-manager.yaml
-rw------- 1 root root 1149 9月   1 17:04 kube-scheduler.yaml

通過組件服務及進程、證書等區別k8s集群部署方式

[root@k8s-master ~]# systemctl status kube-apiserver.service
Unit kube-apiserver.service could not be found.    #說明非二進制部署
[root@k8s-master ~]# ps aux|grep apiserver         #kubeadm部署的證書路徑都是特定的形式
root       1696  6.1 19.0 635004 386360 ?       Ssl  10:01  30:04 kube-apiserver --advertise-address=192.168.56.61 --allow-privileged=true --authorization-mode=Node,RBAC --client-ca-file=/etc/kubernetes/pki/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-key-file=/etc/kubernetes/pki/sa.pub --service-cluster-ip-range=10.96.0.0/12 --tls-cert-file=/etc/kubernetes/pki/apiserver.crt --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
1001       3837  0.0  1.2 138732 26048 ?        Ssl  10:04   0:17 /dashboard --insecure-bind-address=0.0.0.0 --bind-address=0.0.0.0 --auto-generate-certificates --namespace=kubernetes-dashboard --tls-key-file=apiserver.key --tls-cert-file=apiserver.crt
root      87035  0.0  0.0 112724   980 pts/1    S+   18:09   0:00 grep --color=auto apiserver

修改靜態Pod配置文件路徑

[root@k8s-master ~]# tail /var/lib/kubelet/config.yaml 
imageMinimumGCAge: 0s
kind: KubeletConfiguration
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s

②二進制部署

所有組件均采用systemd管理

[root@k8s-node1 ~]# systemctl status kube-apiserver.service 
● kube-apiserver.service - Kubernetes API Server
   Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-04-20 15:26:41 CST; 7 months 27 days ago
     Docs: https://github.com/kubernetes/kubernetes
 Main PID: 17587 (kube-apiserver)
    Tasks: 36
   Memory: 356.5M
   CGroup: /system.slice/kube-apiserver.service
           └─17587 /app/kubernetes/bin/kube-apiserver --logtostderr=false --v=2 --log-dir=/app/kubernetes/logs --etcd-...

Dec 16 16:22:11 k8s-node1 kube-apiserver[17587]: E1216 16:22:11.216916   17587 watcher.go:214] watch chan error: ...acted
Dec 16 16:38:14 k8s-node1 kube-apiserver[17587]: E1216 16:38:14.231035   17587 watcher.go:214] watch chan error: ...acted
Dec 16 16:51:27 k8s-node1 kube-apiserver[17587]: E1216 16:51:27.296324   17587 watcher.go:214] watch chan error: ...acted
Dec 16 17:04:51 k8s-node1 kube-apiserver[17587]: E1216 17:04:51.356825   17587 watcher.go:214] watch chan error: ...acted
Dec 16 17:20:04 k8s-node1 kube-apiserver[17587]: E1216 17:20:04.464772   17587 watcher.go:214] watch chan error: ...acted
Dec 16 17:28:03 k8s-node1 kube-apiserver[17587]: E1216 17:28:03.551942   17587 watcher.go:214] watch chan error: ...acted
Dec 16 17:38:01 k8s-node1 kube-apiserver[17587]: E1216 17:38:01.568538   17587 watcher.go:214] watch chan error: ...acted
Dec 16 17:52:41 k8s-node1 kube-apiserver[17587]: E1216 17:52:41.593466   17587 watcher.go:214] watch chan error: ...acted
Dec 16 18:01:48 k8s-node1 kube-apiserver[17587]: E1216 18:01:48.620521   17587 watcher.go:214] watch chan error: ...acted
Dec 16 18:16:43 k8s-node1 kube-apiserver[17587]: E1216 18:16:43.655648   17587 watcher.go:214] watch chan error: ...acted
Hint: Some lines were ellipsized, use -l to show in full.

服務配置文件路徑:/usr/lib/systemd/system

③管理節點組件

  • kube-apiserver
  • kube-controller-manager
  • kube-scheduler

3、工作節點異常排查

①管理節點組件

  • kubelet           #調用容器引擎接口管理容器,並將容器運行狀態上報給apiserver。
  • kube-proxy    #實現Pod的負載均衡和服務發現,根據訪問的請示,轉發到后面的一組Pod。

②node是not ready狀態可能原因

  • kubelet服務啟動有問題
  • kubelet與apiserver網絡不通
  • kubelet攜帶證書有問題,例如過期
  • node節點磁盤空間滿了

 kubelet服務未啟動處理

systemctl start kubelet && systemctl enable kubelet

kubelet服務無法啟動處理

journalctl -u kubelet  #查看日志排查處理
journalctl -u kubelet.service >kubelet.log  #輸出到文件中排查

 4、Service訪問異常排查

①用戶通過NodePort訪問service流程

 

 client -> kube-proxy監聽一個端口,接受流量會被iptables/ipvs處理 -> 一組pod(分散每個節點)

[root@k8s-node1 ~]# kubectl get svc -n kube-system 
NAME                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
grafana                   NodePort    10.0.0.202   <none>        3000:9006/TCP                258d

[root@k8s-node1 ~]# iptables-save |grep 9006
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-system/grafana:" -m tcp --dport 9006 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-system/grafana:" -m tcp --dport 9006 -j KUBE-SVC-3QDDWNGGGXWDZXKH

②查看Pod和Service是否運行正常

[root@k8s-master ~]# kubectl get pods -o wide
NAME                   READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
web-5dcb957ccc-96nbn   1/1     Running   0          10m   10.244.36.93   k8s-node1   <none>           <none>
web-5dcb957ccc-j5sz7   1/1     Running   0          10m   10.244.36.66   k8s-node1   <none>           <none>
[root@k8s-master ~]# kubectl get svc
NAME          TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
kubernetes    ClusterIP   10.96.0.1      <none>        443/TCP        106d
web-service   NodePort    10.99.239.53   <none>        80:31100/TCP   10m

③查看Service是否正常關聯到Pod

[root@k8s-master ~]# kubectl get ep
NAME          ENDPOINTS                         AGE
kubernetes    192.168.56.61:6443                106d
web-service   10.244.36.66:80,10.244.36.93:80   9m43s

④Service指定target-port是否正確

[root@k8s-master ~]# kubectl exec  -it web-5dcb957ccc-96nbn -- netstat -lntp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      1/nginx: master pro 
tcp6       0      0 :::80                   :::*                    LISTEN      1/nginx: master pro

⑤無法訪問Service其他原因

  • Service是否通過DNS工作?
  • kube-proxy正常工作嗎?
  • kube-proxy是否正常寫iptables規則?
  • cni網絡插件是否正常工作?


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM