關於Pod Condition的一些思考


使用k8s的擴展調度器機制來實現當某個基礎監控的服務Pod不Ready時,這個Pod所在的節點就不允許調度,例如Pod(daemonset形式部署)中的服務會檢測節點的CNI網絡插件如果沒有正常工作,這個Pod的由於探針作用就會變成不Ready的,那么擴展調度器就會避免調度業務Pod到該節點。在實踐過程中,發現Pod的Status字段中的Condition Type有Ready和ContainerReady,以下通過源碼來簡單看一下這兩種的狀態關系是怎么樣的。

 

一個正常的Pod的Status字段如下:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2020-08-28T02:58:50Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    status: "True"
    type: ContainersReady
  containerStatuses:
  - containerID: docker://e9875eb8bfae241f61a3139b8f70fd5a65f23687cbc3267bf2a364126ac1a20a
    image: docker.io/grafana/grafana:6.4.3
    imageID: docker-pullable://docker.io/grafana/grafana@sha256:bd55ea2bad17f5016431734b42fdfc202ebdc7d08b6c4ad35ebb03d06efdff69
    lastState: {}
    name: grafana
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2020-08-28T08:38:00Z
  hostIP: 172.16.0.2
  phase: Running
  podIP: 10.244.0.84
  qosClass: Burstable
  startTime: 2020-08-28T08:37:57Z

Pod status condition中關於兩種PodReady和ContainerReady類型的描述如下:

即PodReady表示這個Pod是否可以接收處理通過svc發過來的請求,當值為True時,controller-manager中的svc controller和ep controller就會把這個pod加入到對應的ep列表,節點的kube-proxy(openshift上是sdn pod)就會watch到這個變化,在節點上為svc增加對應的iptables nat轉發規則。

ContainerReady表示Pod中的所有容器是否都是Ready狀態了(即kc get pod的中n/m,n<=m 字段),是否Ready由用戶為這個Pod所配置的Readiness探針的探測結果為准。

// These are valid conditions of pod.
const (
    // PodReady means the pod is able to service requests and should be added to the
    // load balancing pools of all matching services.
    PodReady PodConditionType = "Ready"
    // ContainersReady indicates whether all containers in the pod are ready.
    ContainersReady PodConditionType = "ContainersReady"
)

 

我們知道ContainerReady屬性值由Readiness探針決定,那什么情況下會影響PodReady的值?

源碼基於k8s1.11: https://github.com/kubernetes/kubernetes/tree/release-1.11

1、如果節點的狀態為NotReady,那么NodeController就會通過調用MarkAllPodsNotReady方法把這個節點上的所有Pod 的PodReady Condition設置為False,如下

  #pkg/controller/nodelifecycle/node_lifecycle_controller.go
if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue {
    nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
    if err = nodeutil.MarkAllPodsNotReady(nc.kubeClient, node); err != nil {
        utilruntime.HandleError(fmt.Errorf("Unable to mark all pods NotReady on node %v: %v", node.Name, err))
     }
 }


#pkg/controller/util/node/controller_utils.go,這個方法直接調clientset更新Pod Status
func MarkAllPodsNotReady(kubeClient clientset.Interface, node *v1.Node) error

 

2、Kubelet中的StatusManager會根據容器的狀態更新Etcd中Pod的status,如下,status_manager.go中的Start方法從podStatusChannel中獲取status變化信息,並通過SyncPod方法將變化信息merger到Etcd中

 

  #pkg/kubelet/status/status_manager.go Start()
   
// 從channel中后去Pod Status變化
go wait.Forever(func() { 
    for { 
        select {
            case syncRequest := <-m.podStatusChannel:
                klog.V(5).Infof("Status Manager: syncing pod: %q, with status: (%d, %v) from podStatusChannel",
                    syncRequest.podUID, syncRequest.status.version, syncRequest.status.status)
                m.syncPod(syncRequest.podUID, syncRequest.status)
            case <-syncTicker:
                                ......
            }
        }
    }, 0)



// syncPod方法調用mergePodStatus方法更新到etcd中 
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    ...
    pod, err := m.kubeClient.CoreV1().Pods(status.podNamespace).Get(context.TODO(), status.podName, metav1.GetOptions{})
    oldStatus := pod.Status.DeepCopy()
    newPod, patchBytes, unchanged, err := statusutil.PatchPodStatus(m.kubeClient, pod.Namespace, pod.Name, pod.UID, *oldStatus, mergePodStatus(*oldStatus, status.status))
    ...
}

上面podStatusChannel中的值是由updateStatusInternal()方法生成,而updateStatusInternal中放入channel的值是SetContainerReadiness()方法中構造的status,如下

// updateStatusInternal updates the internal status cache, and queues an update to the api server if
// necessary. Returns whether an update was triggered.
// This method IS NOT THREAD SAFE and must be called from a locked function.
func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {

    normalizeStatus(pod, &status)

    newStatus := versionedPodStatus{
        status:       status,
        version:      cachedStatus.version + 1,
        podName:      pod.Name,
        podNamespace: pod.Namespace,
    }
    m.podStatuses[pod.UID] = newStatus

    select {
    case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}:
        glog.V(5).Infof("Status Manager: adding pod: %q, with status: (%q, %v) to podStatusChannel",
            pod.UID, newStatus.version, newStatus.status)
        return true
    default:
                ......
    }
}

func (m *manager) SetContainerReadiness(podUID types.UID, containerID kubecontainer.ContainerID, ready bool) {

    pod, ok := m.podManager.GetPodByUID(podUID)
    oldStatus, found := m.podStatuses[pod.UID]

    // Find the container to update.
    containerStatus, _, ok := findContainerStatus(&oldStatus.status, containerID.String())
        // 判斷cache中的Pod Status的container ready狀態是不是和方法參數ready一致的
    if containerStatus.Ready == ready {
        glog.V(4).Infof("Container readiness unchanged (%v): %q - %q", ready,
            format.Pod(pod), containerID.String())
        return
    }

    // Make sure we're not updating the cached version.
        // 不要直接更新緩存中的ContainerStatus,因為后續這個更新不一定會提交到apiserver
    status := *oldStatus.status.DeepCopy()
    containerStatus, _, _ = findContainerStatus(&status, containerID.String())
    containerStatus.Ready = ready

    // updateConditionFunc updates the corresponding type of condition
    updateConditionFunc := func(conditionType v1.PodConditionType, condition v1.PodCondition) {
                ......         
            status.Conditions[conditionIndex] = condition
            ......
    }
    // 這里通過GeneratePodReadyCondition()方法構造Pod Status,GeneratePodReadyCondition方法判斷當status.ContainerStatus都是Ready時,就返回status.Condition的PodReady為True
updateConditionFunc(v1.PodReady, GeneratePodReadyCondition(&pod.Spec, status.Conditions, status.ContainerStatuses, status.Phase)) updateConditionFunc(v1.ContainersReady, GenerateContainersReadyCondition(&pod.Spec, status.ContainerStatuses, status.Phase)) m.updateStatusInternal(pod, status, false) }

而status_manager結構體中的SetContainerReadiness()方法只在kubelet的prober_manager即探針模塊調用了,如下

// prober_manager模塊會根據ReadinessProber檢測結果調用status_manager模塊的SetContainerReadiness更新容器的ready屬性值
// pkg/kubelet/prober/prober_manager.go
func (m *manager) updateReadiness() {
    update := <-m.readinessManager.Updates()

    ready := update.Result == results.Success
    m.statusManager.SetContainerReadiness(update.PodUID, update.ContainerID, ready)
}

 

雖然kc get pod中的Ready子段值為1/1(即ContainersReady=True),但是並不代表這個Pod會接收Service過來的請求。這個情況時比較好重現出來的,先把節點的kubelet服務停了,node在大約40s(controller-manager參數指定)之后會被NodeController標記為NotReady,並Update節點上面的Pod的PodReady Condition為False,EndpointController Watch到Pod status變化之后就會把Pod從svc對應的ep列表中移除。

[root@k8s-master kubelet]# kc get pod -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP            NODE
grafana-b5c674bc4-8xmzb      1/1     Running   0          4d    10.244.0.84   k8s-master.com
prometheus-9d44889cc-6jm2h   1/1     Running   0          4d    10.244.0.91   k8s-master.com

###上面的Pod是1/1的,但是svc中沒有這個endpoint###
[root@k8s
-master kubelet]# kc describe svc grafana Name: grafana Namespace: istio-system Labels: app=grafana release=istio Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"grafana","release":"istio"},"name":"grafana","namespace"... Selector: app=grafana Type: NodePort IP: 10.96.188.25 Port: http 3000/TCP TargetPort: 3000/TCP NodePort: http 31652/TCP Endpoints: Session Affinity: None External Traffic Policy: Cluster Events: <none> [root@k8s-master kubelet]#

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM