kubernetes pod termination pending

本文轉載自查看原文 2018-07-06 22:48 908 kubernetes

在將k8s從1.7.9 升級到1.10.2 之后，發現刪除pod一直處於terminating狀態, 調查發現刪不掉的pod都有一個特點就是pod yaml中command部分寫錯了，如下所示：

apiVersion: v1
kind: Pod
metadata:
  name: bad-pod-termation-test
spec:
  containers:
    - image: nginx
      command:
      - xxxx
      name: pad-pod-test

可以看到此時pod中的command為一個不存在的命令，創建該yaml后會返回如下狀態：

% kubectl get pods 
NAME                     READY     STATUS              RESTARTS   AGE
bad-pod-termation-test   0/1       RunContainerError   0          20s

在宿主機上docker ps -a 可以看到對應的docker是處於Creted狀態的(無法正常啟動的狀態)，因為pod起不來會重試，所以有多個docker實例：

CONTAINER ID        IMAGE                              COMMAND                  CREATED              STATUS              PORTS               NAMES
b66c1a3de3ae        nginx                              "xxxx"                   9 seconds ago        Created                                 k8s_pad-pod-test_bad-pod-termation-test_default_7786ffea-7de9-11e8-9754-509a4c2d27d1_3
148a312b89cf        nginx                              "xxxx"                   43 seconds ago       Created                                 k8s_pad-pod-test_bad-pod-termation-test_default_7786ffea-7de9-11e8-9754-509a4c2d27d1_2
6414f874ffe0        k8s.gcr.io/pause-amd64:3.1         "/pause"                 About a minute ago   Up About a minute                       k8s_POD_bad-pod-termation-test_default_7786ffea-7de9-11e8-9754-509a4c2d27d1_0

此時刪除pod就會看到pod一直處於termianting狀態，只能用kubectl delete pods bad-pod-termation-test --grace-period=0 --forece強制刪除，但是強制刪除是官方所不建議的，可能會造成資源的泄露，這種方案肯定不是長久之計。
調高kubelet的日志級別仔細查看發現kubelet一直輸出一條可疑log：

I0702 19:26:43.712496   26521 kubelet_pods.go:942] Pod "bad-pod-termation-test_default(9eae939b-7dea-11e8-9754-509a4c2d27d1)" is terminated, but some containers have not been cleaned up: [0xc4218d1260 0xc4228ae540]

也就是說container沒刪干凈，kubelet在等待container被刪除。上面log打印的是指針，也就是存放container信息的變量地址，不過可以猜出這就是pod對應的container，手動docker rm 上面兩個created狀態的container之后，pod馬上就被刪除不可見了，懷疑kubelet本身存在某些bug導致Created狀態的container某些資源無法釋放，為什么會這樣吶？
查看代碼發現kubelet會有一個PodCache來保存所有的pod信息，每創建一個pod就會向其中添加一條記錄，且只有在container刪除的時候才會將對應的cache清空，對應的cache清空后才能刪除pod。

之前的環境中為了方便debug將container退出后的屍體都保存了下來，在kubelet中設置--minimum-container-ttl-duration=36h flag來保存容器屍體，該flag已經是deprecated狀態了，官方不建議使用，建用--eviction-hard 或 --eviction-soft來代替，因為在1.7.9中minimum-container-ttl-duration還是可以正常使用的，倒也沒在意deprecated的提醒，並且在1.10.2也同樣設置了該flag，導致cache無法清空，進而無法刪除pod。

按照上面的分析只有刪除了container，pod才可以刪除，那么設置了flag minimum-container-ttl-duration來保留container的后果豈不是所有的pod都無法刪除嗎？為什么之前正常的pod可以被刪除？難道是正常pod的container屍體都被刪了? 做了一下測試，果然刪除正常pod之后容器屍體立馬也被刪除，設置minimum-container-ttl-duration壓根沒起作用，但是對於上述yaml創建的異常pod反倒起作用了，Created狀態的container，直到minimum-container-ttl-duration之后才被刪除。
雖然比較奇怪，但在一個deprecated的flag上面發生任何問題都是可以原諒的，官方已經明確聲明不建議使用了，只能去掉該flag避免問題的出現．下了master分支最新的代碼重新編譯試了一下，版本如下，發現無論設不設置該flag，都會立即刪除pod的container, 所以pod pending在terminating狀態的問題就不存在了。

% kubectl version                                             
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.0-alpha.0.1939+5114d4e0b040d2", GitCommit:"5114d4e0b040d2f00417e0fd3e11204afd30f63c", GitTreeState:"clean", BuildDate:"2018-07-07T02:54:29Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

那現在就剩一個問題了：在1.7.9的版本中設置了minimum-container-ttl-duration為什么可以正常刪除pod? 為什么能夠既保留容器屍體又能刪除pod，通過查閱源碼發現k8s通過調用PodResourcesAreReclaimed來判斷資源是否回收，只有資源全部回收才可以刪除pod，在1.7.9的實現代碼如下，依次判斷是否有正在運行中的pod，volume是否清空，sandbox(也就是pause)容器是否清理：

func (kl *Kubelet) PodResourcesAreReclaimed(pod *v1.Pod, status v1.PodStatus) bool {
	if !notRunning(status.ContainerStatuses) {
		// We shouldnt delete pods that still have running containers
		glog.V(3).Infof("Pod %q is terminated, but some containers are still running", format.Pod(pod))
		return false
	}
	if kl.podVolumesExist(pod.UID) && !kl.kubeletConfiguration.KeepTerminatedPodVolumes {
		// We shouldnt delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes
		glog.V(3).Infof("Pod %q is terminated, but some volumes have not been cleaned up", format.Pod(pod))
		return false
	}
	if kl.kubeletConfiguration.CgroupsPerQOS {
		pcm := kl.containerManager.NewPodContainerManager()
		if pcm.Exists(pod) {
			glog.V(3).Infof("Pod %q is terminated, but pod cgroup sandbox has not been cleaned up", format.Pod(pod))
			return false
		}
	}
	return true
}

而在v1.10.2中的實現如下：

func (kl *Kubelet) PodResourcesAreReclaimed(pod *v1.Pod, status v1.PodStatus) bool {
	if !notRunning(status.ContainerStatuses) {
		// We shouldnt delete pods that still have running containers
		glog.V(3).Infof("Pod %q is terminated, but some containers are still running", format.Pod(pod))
		return false
	}
	// pod's containers should be deleted
	runtimeStatus, err := kl.podCache.Get(pod.UID)
	if err != nil {
		glog.V(3).Infof("Pod %q is terminated, Error getting runtimeStatus from the podCache: %s", format.Pod(pod), err)
		return false
	}
	if len(runtimeStatus.ContainerStatuses) > 0 {
		glog.V(3).Infof("Pod %q is terminated, but some containers have not been cleaned up: %+v", format.Pod(pod), runtimeStatus.ContainerStatuses)
		return false
	}
	if kl.podVolumesExist(pod.UID) && !kl.keepTerminatedPodVolumes {
		// We shouldnt delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes
		glog.V(3).Infof("Pod %q is terminated, but some volumes have not been cleaned up", format.Pod(pod))
		return false
	}
	if kl.kubeletConfiguration.CgroupsPerQOS {
		pcm := kl.containerManager.NewPodContainerManager()
		if pcm.Exists(pod) {
			glog.V(3).Infof("Pod %q is terminated, but pod cgroup sandbox has not been cleaned up", format.Pod(pod))
			return false
		}
	}
	return true
}

可以看出1.7.9中資源回收的邏輯與1.10.2中的不太一樣，v1.10.2增加了判斷cache是否為空的邏輯，上面說過只有在容器被刪除之后才清空cache，1.7.9中設置了minimum-container-ttl-duration之后不會清理退出的container屍體，所以cache也未清空，其實在這種情況下是存在資源泄露的。為了驗證這個結論，專門在1.7.9的PodResourcesAreReclaimed method中也加入了cache是否為空的判斷邏輯，果然出現了一直pending在terminating狀態的情況。　　

回到我們設置 minimum-container-ttl-duration　flag的初衷：　container退出后保留信息方便debug，回溯狀態，那如果不使用這個flag該怎么辦哪？去哪里找逝去的信息？　官方的文檔中對 minimum-container-ttl-duration　有句描述是deprecated once old logs are stored outside of container’s context，將來可能會將log保存到容器外面，但是目前顯然是沒有實現的。另外做了幾次實驗之后發現只要不手動刪除pod，對應的container屍體就會一直保存下來，如果有多個退出實例屍體，不會每個實例都保存，但至少會保存一個退出實例，可以用來debug。反過來思考，如果保存每個退出實例，其實是將容器運行的上下文都保存下來了，如果一個container在writable layer寫入大量的數據的話，會導致占用大量磁盤空間而不能釋放，所以盡量不要保存太多退出實例，官方的保留的退出實例個數一般情況下debug就夠用了，對於額外信息的保存就需要通過遠程備份的方式來實現了。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Kubernetes - Pod pending: pod has unbound immediate PersistentVolumeClaims kubernetes排錯系列：（二）、運行很久的kubernetes集群，創建出來的pod都是pending狀態 k8s pod一直處於pending k8s上pod一次pending解決過程 Kubernetes Pod 全面知識 kubernetes horizontal pod autoscaling Kubernetes Pod鈎子 Kubernetes pod平滑遷移 kubernetes之pod調度 Kubernetes 排錯之 Pod 異常