一、k8s調度流程
1、(預選)先排除完全不符合pod運行要求的節點
2、(優先)根據一系列算法,算出node的得分,最高沒有相同的,就直接選擇
3、上一步有相同的話,就隨機選一個
二、調度方式
1、node(運行在那些node上)
2、pod選擇(當需要運行在某個pod在一個節點上(pod親和性),或不要pod和某個pod運行在一起(pod反親和性))
3、污點 (pod是否能容忍污點,能則能調度到該節點,不能容忍則無法調度到該節點,如果存在則驅離pod),可以定義容忍時間
三、常用的預選機制
調度器: 預選策略:(一部分) CheckNodeCondition:#檢查節點是否正常(如ip,磁盤等) GeneralPredicates HostName:#檢查Pod對象是否定義了pod.spec.hostname PodFitsHostPorts:#pod要能適配node的端口 pods.spec.containers.ports.hostPort(指定綁定在節點的端口上) MatchNodeSelector:#檢查節點的NodeSelector的標簽 pods.spec.nodeSelector PodFitsResources:#檢查Pod的資源需求是否能被節點所滿足 NoDiskConflict: #檢查Pod依賴的存儲卷是否能滿足需求(默認未使用) PodToleratesNodeTaints:#檢查Pod上的spec.tolerations可容忍的污點是否完全包含節點上的污點; PodToleratesNodeNoExecuteTaints:#不能執行(NoExecute)的污點(默認未使用) CheckNodeLabelPresence:#檢查指定的標簽再上節點是否存在 CheckServiceAffinity:#將相同services相同的pod盡量放在一起(默認未使用) MaxEBSVolumeCount: #檢查EBS(AWS存儲)存儲卷的最大數量 MaxGCEPDVolumeCount #GCE存儲最大數 MaxAzureDiskVolumeCount: #AzureDisk 存儲最大數 CheckVolumeBinding: #檢查節點上已綁定或未綁定的pvc NoVolumeZoneConflict: #檢查存儲卷對象與pod是否存在沖突 CheckNodeMemoryPressure:#檢查節點內存是否存在壓力過大 CheckNodePIDPressure: #檢查節點上的PID數量是否過大 CheckNodeDiskPressure: #檢查內存、磁盤IO是否過大 MatchInterPodAffinity: #檢查節點是否能滿足pod的親和性或反親和性
四、常用的優選函數
LeastRequested:#空閑量越高得分越高 (cpu((capacity-sum(requested))*10/capacity)+memory((capacity-sum(requested))*10/capacity))/2 BalancedResourceAllocation:#CPU和內存資源被占用率相近的勝出; NodePreferAvoidPods: #節點注解信息“scheduler.alpha.kubernetes.io/preferAvoidPods” TaintToleration:#將Pod對象的spec.tolerations列表項與節點的taints列表項進行匹配度檢查,匹配條目越,得分越低; SeletorSpreading:#標簽選擇器分散度,(與當前pod對象通選的標簽,所選其它pod越多的得分越低) InterPodAffinity:#遍歷pod對象的親和性匹配項目,項目越多得分越高 NodeAffinity: #節點親和性 、 MostRequested: #空閑量越小得分越高,和LeastRequested相反 (默認未啟用) NodeLabel: #節點是否存在對應的標簽 (默認未啟用) ImageLocality:#根據滿足當前Pod對象需求的已有鏡像的體積大小之和(默認未啟用)
五、高級調度設置方式
1、nodeSelector選擇器
#查看標簽 [root@k8s-m ~]# kubectl get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS k8s-m Ready master 120d v1.11.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=k8s-m,node-role.kubernetes.io/master= node1 Ready <none> 120d v1.11.2 app=myapp,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disk=ssd,disktype=ssd,kubernetes.io/hostname=node1,test_node=k8s-node1 #使用nodeSelector選擇器,選擇disk=ssd的node #查看 [root@k8s-m schedule]# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE nginx-pod 1/1 Running 0 49s 10.244.1.92 node1 <none> [root@k8s-m schedule]# cat my-pod.yaml apiVersion: v1 kind: Pod metadata: name: nginx-pod labels: app: my-pod spec: containers: - name: my-pod image: nginx ports: - name: http containerPort: 80 nodeSelector: disk: ssd #如果nodeSelector中指定的標簽節點都沒有,該pod就會處於Pending狀態(預選失敗)
2、affinity
2.1、nodeAffinity的preferredDuringSchedulingIgnoredDuringExecution (軟親和,選擇條件匹配多的,就算都不滿足條件,還是會生成pod)
#使用 [root@k8s-m schedule]# cat my-affinity-pod.yaml apiVersion: v1 kind: Pod metadata: name: affinity-pod labels: app: my-pod spec: containers: - name: affinity-pod image: nginx ports: - name: http containerPort: 80 affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: test_node1 #標簽鍵名 operator: In #In表示在 values: - k8s-node1 #test_node1標簽的值 - test1 #test_node1標簽的值 weight: 60 #匹配相應nodeSelectorTerm相關聯的權重,1-100 ##查看(不存在這個標簽,但是還是創建bin運行了) [root@k8s-m schedule]# kubectl get pod NAME READY STATUS RESTARTS AGE affinity-pod 1/1 Running 0 16s
2.2、requiredDuringSchedulingIgnoredDuringExecution (硬親和,類似nodeSelector,硬性需求,如果不滿足條件不會調度pod,都不滿足則Pending)
[root@k8s-m schedule]# cat my-affinity-pod.yaml apiVersion: v1 kind: Pod metadata: name: affinity-pod labels: app: my-pod spec: containers: - name: affinity-pod image: nginx ports: - name: http containerPort: 80 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: test_node1 #標簽鍵名 operator: In #In表示在 values: - k8s-node1 #test_node1標簽的值 - test1 #test_node1標簽的值 #查看(沒有test_node1這個標簽,所以會Pending) [root@k8s-m schedule]# kubectl get pod NAME READY STATUS RESTARTS AGE affinity-pod 0/1 Pending 0 4s
六、pod的親和與反親和性
1、podAffinity:(讓pod和某個pod處於同一地方(同一地方不一定指同一node節點,根據個人使用的標簽定義))
#使用(讓affinity-pod和my-pod1處於同一處) [root@k8s-m schedule]# cat my-affinity-pod2.yaml apiVersion: v1 kind: Pod metadata: name: my-pod1 labels: app1: my-pod1 spec: containers: - name: my-pod1 image: nginx ports: - name: http containerPort: 80 --- apiVersion: v1 kind: Pod metadata: name: affinity-pod labels: app: my-pod spec: containers: - name: affinity-pod image: nginx ports: - name: http containerPort: 80 affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app1 #標簽鍵名,上面pod定義 operator: In #In表示在 values: - my-pod1 #app1標簽的值 topologyKey: kubernetes.io/hostname #kubernetes.io/hostname的值一樣代表pod處於同一位置 #此pod應位於同一位置(親和力)或不位於同一位置(反親和力),與pods匹配指定名稱空間中的labelSelector,其中co-located定義為在標簽值為的節點上運行,key topologyKey匹配任何選定pod的任何節點在跑 #查看 [root@k8s-m schedule]# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE affinity-pod 1/1 Running 0 54s 10.244.1.98 node1 <none> my-pod1 1/1 Running 0 54s 10.244.1.97 node1 <none>
2、podAntiAffinity(讓pod和某個pod不處於同一node,和上面相反)
[root@k8s-m schedule]# cat my-affinity-pod3.yaml apiVersion: v1 kind: Pod metadata: name: my-pod1 labels: app1: my-pod1 spec: containers: - name: my-pod1 image: nginx ports: - name: http containerPort: 80 --- apiVersion: v1 kind: Pod metadata: name: affinity-pod labels: app: my-pod spec: containers: - name: affinity-pod image: nginx ports: - name: http containerPort: 80 affinity: podAntiAffinity: #就改了這里 requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app1 #標簽鍵名,上面pod定義 operator: In #In表示在 values: - my-pod1 #app1標簽的值 topologyKey: kubernetes.io/hostname #kubernetes.io/hostname的值一樣代表pod不處於同一位置 #查看(我自有一台node,所有是Pending狀態) [root@k8s-m schedule]# kubectl get pod NAME READY STATUS RESTARTS AGE affinity-pod 0/1 Pending 0 1m my-pod1 1/1 Running 0 1m
七、污點調度
taint的effect定義對Pod排斥效果:
NoSchedule:#僅影響調度過程,對現存的Pod對象不產生影響;
NoExecute:#既影響調度過程,也影響現在的Pod對象;不容忍的Pod對象將被驅逐;
PreferNoSchedule: #當沒合適地方運行pod了,也會找地方運行pod
1、查看並管理污點
#查看node污點(Taints) [root@k8s-m schedule]# kubectl describe node k8s-m |grep Taints Taints: node-role.kubernetes.io/master:NoSchedule [root@k8s-m schedule]# kubectl describe node node1 |grep Taints Taints: <none> #管理污點taint kubectl taint node -h #打污點(給node打標簽) kubectl taint node node1 node-type=PreferNoSchedule:NoSchedule #查看 [root@k8s-m schedule]# kubectl describe node node1 |grep Taints Taints: node-type=PreferNoSchedule:NoSchedule #刪除污點 [root@k8s-m ~]# kubectl taint node node1 node-type- node/node1 untainted #查看 [root@k8s-m ~]# kubectl describe node node1 |grep Taints aints: <none>
2、使用污點
#創建pod [root@k8s-m ~]# cat mypod.yaml apiVersion: v1 kind: Pod metadata: name: nginx-pod labels: app: my-pod spec: containers: - name: my-pod image: nginx ports: - name: http containerPort: 80 #查看pod(Pinding了) [root@k8s-m ~]# kubectl get pod NAME READY STATUS RESTARTS AGE nginx-pod 0/1 Pending 0 32s #不能容忍污點 [root@k8s-m ~]# kubectl describe pod nginx-pod|tail -1 Warning FailedScheduling 3s (x22 over 1m) default-scheduler 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate. ###使用 [root@k8s-m ~]# cat mypod.yaml apiVersion: v1 kind: Pod metadata: name: nginx-pod labels: app: my-pod spec: containers: - name: my-pod image: nginx ports: - name: http containerPort: 80 tolerations: #容忍的污點 - key: "node-type" #之前定義的污點名 operator: "Equal" #Exists,如果node-type污點在,就能容忍,Equal精確 value: "PreferNoSchedule" #污點值 effect: "NoSchedule" #效果 #tolerationSeconds: 3600 #如果被驅逐的話,容忍時間,只能是effect為tolerationSeconds或NoExecute定義 #查看(已經調度了) [root@k8s-m ~]# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE nginx-pod 1/1 Running 0 3m 10.244.1.100 node1 <none>