關於Pod Condition的一些思考

本文轉載自查看原文 2020-08-31 10:13 565

使用k8s的擴展調度器機制來實現當某個基礎監控的服務Pod不Ready時，這個Pod所在的節點就不允許調度，例如Pod（daemonset形式部署）中的服務會檢測節點的CNI網絡插件如果沒有正常工作，這個Pod的由於探針作用就會變成不Ready的，那么擴展調度器就會避免調度業務Pod到該節點。在實踐過程中，發現Pod的Status字段中的Condition Type有Ready和ContainerReady，以下通過源碼來簡單看一下這兩種的狀態關系是怎么樣的。

一個正常的Pod的Status字段如下：

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2020-08-28T02:58:50Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    status: "True"
    type: ContainersReady
  containerStatuses:
  - containerID: docker://e9875eb8bfae241f61a3139b8f70fd5a65f23687cbc3267bf2a364126ac1a20a
    image: docker.io/grafana/grafana:6.4.3
    imageID: docker-pullable://docker.io/grafana/grafana@sha256:bd55ea2bad17f5016431734b42fdfc202ebdc7d08b6c4ad35ebb03d06efdff69
    lastState: {}
    name: grafana
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2020-08-28T08:38:00Z
  hostIP: 172.16.0.2
  phase: Running
  podIP: 10.244.0.84
  qosClass: Burstable
  startTime: 2020-08-28T08:37:57Z

Pod status condition中關於兩種PodReady和ContainerReady類型的描述如下：

即PodReady表示這個Pod是否可以接收處理通過svc發過來的請求，當值為True時，controller-manager中的svc controller和ep controller就會把這個pod加入到對應的ep列表，節點的kube-proxy（openshift上是sdn pod）就會watch到這個變化，在節點上為svc增加對應的iptables nat轉發規則。

ContainerReady表示Pod中的所有容器是否都是Ready狀態了（即kc get pod的中n/m，n<=m 字段），是否Ready由用戶為這個Pod所配置的Readiness探針的探測結果為准。

// These are valid conditions of pod.
const (
    // PodReady means the pod is able to service requests and should be added to the
    // load balancing pools of all matching services.
    PodReady PodConditionType = "Ready"
    // ContainersReady indicates whether all containers in the pod are ready.
    ContainersReady PodConditionType = "ContainersReady"
)

我們知道ContainerReady屬性值由Readiness探針決定，那什么情況下會影響PodReady的值？

源碼基於k8s1.11: https://github.com/kubernetes/kubernetes/tree/release-1.11

1、如果節點的狀態為NotReady，那么NodeController就會通過調用MarkAllPodsNotReady方法把這個節點上的所有Pod 的PodReady Condition設置為False，如下

#pkg/controller/nodelifecycle/node_lifecycle_controller.go

if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue {
    nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
    if err = nodeutil.MarkAllPodsNotReady(nc.kubeClient, node); err != nil {
        utilruntime.HandleError(fmt.Errorf("Unable to mark all pods NotReady on node %v: %v", node.Name, err))
     }
 }


#pkg/controller/util/node/controller_utils.go，這個方法直接調clientset更新Pod Status
func MarkAllPodsNotReady(kubeClient clientset.Interface, node *v1.Node) error

2、Kubelet中的StatusManager會根據容器的狀態更新Etcd中Pod的status，如下，status_manager.go中的Start方法從podStatusChannel中獲取status變化信息，並通過SyncPod方法將變化信息merger到Etcd中

  #pkg/kubelet/status/status_manager.go Start()
　　　
// 從channel中后去Pod Status變化
go wait.Forever(func() { 
    for { 
        select {
            case syncRequest := <-m.podStatusChannel:
                klog.V(5).Infof("Status Manager: syncing pod: %q, with status: (%d, %v) from podStatusChannel",
                    syncRequest.podUID, syncRequest.status.version, syncRequest.status.status)
                m.syncPod(syncRequest.podUID, syncRequest.status)
            case <-syncTicker:
                                ......
            }
        }
    }, 0)



// syncPod方法調用mergePodStatus方法更新到etcd中 
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    ...
    pod, err := m.kubeClient.CoreV1().Pods(status.podNamespace).Get(context.TODO(), status.podName, metav1.GetOptions{})
    oldStatus := pod.Status.DeepCopy()
    newPod, patchBytes, unchanged, err := statusutil.PatchPodStatus(m.kubeClient, pod.Namespace, pod.Name, pod.UID, *oldStatus, mergePodStatus(*oldStatus, status.status))
    ...
}

上面podStatusChannel中的值是由updateStatusInternal()方法生成，而updateStatusInternal中放入channel的值是SetContainerReadiness()方法中構造的status，如下

// updateStatusInternal updates the internal status cache, and queues an update to the api server if
// necessary. Returns whether an update was triggered.
// This method IS NOT THREAD SAFE and must be called from a locked function.
func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {

    normalizeStatus(pod, &status)

    newStatus := versionedPodStatus{
        status:       status,
        version:      cachedStatus.version + 1,
        podName:      pod.Name,
        podNamespace: pod.Namespace,
    }
    m.podStatuses[pod.UID] = newStatus

    select {
    case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}:
        glog.V(5).Infof("Status Manager: adding pod: %q, with status: (%q, %v) to podStatusChannel",
            pod.UID, newStatus.version, newStatus.status)
        return true
    default:
                ......
    }
}

func (m *manager) SetContainerReadiness(podUID types.UID, containerID kubecontainer.ContainerID, ready bool) {

    pod, ok := m.podManager.GetPodByUID(podUID)
    oldStatus, found := m.podStatuses[pod.UID]

    // Find the container to update.
    containerStatus, _, ok := findContainerStatus(&oldStatus.status, containerID.String())
        // 判斷cache中的Pod Status的container ready狀態是不是和方法參數ready一致的
    if containerStatus.Ready == ready {
        glog.V(4).Infof("Container readiness unchanged (%v): %q - %q", ready,
            format.Pod(pod), containerID.String())
        return
    }

    // Make sure we're not updating the cached version.
        // 不要直接更新緩存中的ContainerStatus，因為后續這個更新不一定會提交到apiserver
    status := *oldStatus.status.DeepCopy()
    containerStatus, _, _ = findContainerStatus(&status, containerID.String())
    containerStatus.Ready = ready

    // updateConditionFunc updates the corresponding type of condition
    updateConditionFunc := func(conditionType v1.PodConditionType, condition v1.PodCondition) {
                ......         
            status.Conditions[conditionIndex] = condition
            ......
    }
    // 這里通過GeneratePodReadyCondition()方法構造Pod Status，GeneratePodReadyCondition方法判斷當status.ContainerStatus都是Ready時，就返回status.Condition的PodReady為True
    updateConditionFunc(v1.PodReady, GeneratePodReadyCondition(&pod.Spec, status.Conditions, status.ContainerStatuses, status.Phase))
    updateConditionFunc(v1.ContainersReady, GenerateContainersReadyCondition(&pod.Spec, status.ContainerStatuses, status.Phase))
    m.updateStatusInternal(pod, status, false)
}

而status_manager結構體中的SetContainerReadiness()方法只在kubelet的prober_manager即探針模塊調用了，如下

// prober_manager模塊會根據ReadinessProber檢測結果調用status_manager模塊的SetContainerReadiness更新容器的ready屬性值
// pkg/kubelet/prober/prober_manager.go
func (m *manager) updateReadiness() {
    update := <-m.readinessManager.Updates()

    ready := update.Result == results.Success
    m.statusManager.SetContainerReadiness(update.PodUID, update.ContainerID, ready)
}

雖然kc get pod中的Ready子段值為1/1（即ContainersReady=True），但是並不代表這個Pod會接收Service過來的請求。這個情況時比較好重現出來的，先把節點的kubelet服務停了，node在大約40s（controller-manager參數指定）之后會被NodeController標記為NotReady，並Update節點上面的Pod的PodReady Condition為False，EndpointController Watch到Pod status變化之后就會把Pod從svc對應的ep列表中移除。

[root@k8s-master kubelet]# kc get pod -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP            NODE
grafana-b5c674bc4-8xmzb      1/1     Running   0          4d    10.244.0.84   k8s-master.com
prometheus-9d44889cc-6jm2h   1/1     Running   0          4d    10.244.0.91   k8s-master.com

###上面的Pod是1/1的，但是svc中沒有這個endpoint###

[root@k8s-master kubelet]# kc describe svc grafana
Name:                     grafana
Namespace:                istio-system
Labels:                   app=grafana
                          release=istio
Annotations:              kubectl.kubernetes.io/last-applied-configuration:
                            {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"grafana","release":"istio"},"name":"grafana","namespace"...
Selector:                 app=grafana
Type:                     NodePort
IP:                       10.96.188.25
Port:                     http  3000/TCP
TargetPort:               3000/TCP
NodePort:                 http  31652/TCP
Endpoints:                
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>
[root@k8s-master kubelet]#

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 對於“社區”的一些思考關於JSON CSRF的一些思考對Python、shell的一些思考 Android關於OutOfMemoryError的一些思考 UWP開發的一些思考對okhttp參數的一些思考 SLAM應用的一些思考 hadoop 二次排序的一些思考關於三門問題的一些思考關於IT企業組織架構的一些思考