K8s中的control-plane包括了apiserver、controller-manager、scheduler、etcd,當搭建高可用集群時就會涉及到部分組件的選主問題。etcd是整個集群所有狀態信息的存儲,涉及數據的讀寫和多個etcd之間數據的同步,對數據的一致性要求嚴格,所以使用較復雜的raft算法來選擇用於提交數據的主節點。而apiserver作為集群入口,本身是無狀態的web服務器,多個apiserver服務之間直接負載請求並不需要做選主。Controller-Manager和Scheduler作為任務類型的組件,比如controller-manager內置的k8s各種資源對象的控制器實時的watch apiserver獲取對象最新的變化事件做期望狀態和實際狀態調整,調度器watch未綁定節點的pod做節點選擇,顯然多個這些任務同時工作是完全沒有必要的,所以controller-manager和scheduler也是需要選主的,但是選主邏輯和etcd不一樣的,這里只需要保證從多個controller-manager和scheduler之間選出一個進入工作狀態即可,而無需考慮它們之間的數據一致和同步。
kube-scheduler中關於leader選擇的參數描述
/ # kube-scheduler -h 2>&1 | grep -i leader--leader-elect Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability. (default true) --leader-elect-lease-duration duration The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 15s) --leader-elect-renew-deadline duration The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than or equal to the lease duration. This is only applicable if leader election is enabled. (default 10s) --leader-elect-resource-lock endpoints The type of resource object that is used for locking during leader election. Supported options are endpoints (default) and `configmaps`. (default "endpoints") --leader-elect-retry-period duration The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 2s)
基於k8s 1.11源碼分析,Lock Resouce為Endpoint
1、調度器啟動時先選舉leader,再回調schuduler的run方法進入調度邏輯
// https://sourcegraph.com/github.com/kubernetes/kubernetes@release-1.11/-/blob/cmd/kube-scheduler/app/server.go func Run(c schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}) error { ...... // Prepare a reusable run function. run := func(stopCh <-chan struct{}) { sched.Run() <-stopCh } // If leader election is enabled, run via LeaderElector until done and exit. if c.LeaderElection != nil { c.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{ OnStartedLeading: run, OnStoppedLeading: func() { utilruntime.HandleError(fmt.Errorf("lost master")) }, } leaderElector, err := leaderelection.NewLeaderElector(*c.LeaderElection) leaderElector.Run()
}
}
2、直接調用Acquire方法來嘗試競選為leader
// Run starts the leader election loop func (le *LeaderElector) Run() { defer func() { runtime.HandleCrash() le.config.Callbacks.OnStoppedLeading() }() le.acquire() stop := make(chan struct{}) go le.config.Callbacks.OnStartedLeading(stop) le.renew() close(stop) }
3、Acquire方法以leader-elect-retry-period指定的時間為間隔,循環調用TryAcquireOrRenew方法,其中的le.config.Lock類型為EndpointsLock,EndpointsLock.Identity()方法返回自己的主機名,EndpointsLock.Get方法請求apiServer獲取保存在etcd中的選舉記錄。
如果從apiserver獲取ep選舉記錄對象失敗,則嘗試自己作為leader
以自己觀察到的observe時間來看,如果租約(15s)未到,並且自己不是leader,不能去搶占為leader,所以就沒有其他可以做的了
如果當前自己就是leader,不管租約是否到期,都以當前時間嘗試續約,競選時間acquireTime保持、leader切換次數保持,否則切換次數加1
向apiserver發送更新ep選舉記錄對象的請求,由apiserver來保證多個客戶端的原子更新操作,通過對比resourceVersion版本號(對應etcd中的modifiedindex編號),保證只有一個client能修改成功,其余的返回409
Lock被初始化為EndpointsLock type EndpointsLock struct { // EndpointsMeta should contain a Name and a Namespace of an // Endpoints object that the LeaderElector will attempt to lead. EndpointsMeta metav1.ObjectMeta Client corev1client.EndpointsGetter LockConfig ResourceLockConfig e *v1.Endpoints } // Get returns the election record from a Endpoints Annotation func (el *EndpointsLock) Get() (*LeaderElectionRecord, error) { var record LeaderElectionRecord el.e, err = el.Client.Endpoints(el.EndpointsMeta.Namespace).Get(el.EndpointsMeta.Name, metav1.GetOptions{}) if recordBytes, found := el.e.Annotations[LeaderElectionRecordAnnotationKey]; found { if err := json.Unmarshal([]byte(recordBytes), &record); err != nil { return nil, err } } return &record, nil } //如果自己不是leader,嘗試競選為leader,如果自己就是leader,嘗試renew續租 // tryAcquireOrRenew tries to acquire a leader lease if it is not already acquired, // else it tries to renew the lease if it has already been acquired. Returns true // on success else returns false. func (le *LeaderElector) tryAcquireOrRenew() bool { now := metav1.Now() // 這個Identity()返回的就是自己的hostname + "_" + string(uuid.NewUUID())
// 初始化一個leader是自己的leaderElectionRecord對象,為自己acquire成功時准備 leaderElectionRecord := rl.LeaderElectionRecord{ HolderIdentity: le.config.Lock.Identity(), LeaseDurationSeconds: int(le.config.LeaseDuration / time.Second), RenewTime: now, AcquireTime: now, } // 1. obtain or create the ElectionRecord oldLeaderElectionRecord, err := le.config.Lock.Get()
// 如果從apiserver獲取ep失敗,則嘗試自己作為leader if err != nil { le.observedRecord = leaderElectionRecord le.observedTime = le.clock.Now() return true } // 2. Record obtained, check the Identity & Time
// apiServer中的leader對象和自己記錄的不一樣,更新自己的記錄 if !reflect.DeepEqual(le.observedRecord, *oldLeaderElectionRecord) { le.observedRecord = *oldLeaderElectionRecord le.observedTime = le.clock.Now() }
//以自己觀察到的observe時間來看,如果租約(15s)未到,並且自己不是leader,那么自己沒有其他可以做的了 if le.observedTime.Add(le.config.LeaseDuration).After(now.Time) && oldLeaderElectionRecord.HolderIdentity != le.config.Lock.Identity() { return false } // 3. We're going to try to update. The leaderElectionRecord is set to it's default // here. Let's correct it before updating.
// 走到這里可能:1、自己不是leader,但是租約到期了 2、自己是leader,但租約沒有到期 3、自己是leader,但是租約到期
// 如果當前自己就是leader,即對應2、3,不管租約是否到期,都以當前時間嘗試續約,競選時間acquireTime保持、leader切換次數保持,否則切換次數加1 if oldLeaderElectionRecord.HolderIdentity == le.config.Lock.Identity() { leaderElectionRecord.AcquireTime = oldLeaderElectionRecord.AcquireTime leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions } else { leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions + 1 } // update the lock itself
// 向apiserver發送更新ep的請求,由apiserver來保證多個客戶端的原子更新操作,其resourceVersion版本號機制保證只有一個client能修改成功 if err = le.config.Lock.Update(leaderElectionRecord); err != nil { glog.Errorf("Failed to update lock: %v", err) return false } le.observedRecord = leaderElectionRecord le.observedTime = le.clock.Now() return true }