kube-scheduler源碼分析（1）-初始化與啟動分析

本文轉載自查看原文 2022-02-20 10:47 985 kubernetes源碼解析/ kube-scheduler

kube-scheduler源碼分析（1）-初始化與啟動分析

kube-scheduler簡介

kube-scheduler組件是kubernetes中的核心組件之一，主要負責pod資源對象的調度工作，具體來說，kube-scheduler組件負責根據調度算法（包括預選算法和優選算法）將未調度的pod調度到合適的最優的node節點上。

kube-scheduler架構圖

kube-scheduler的大致組成和處理流程如下圖，kube-scheduler對pod、node等對象進行了list/watch，根據informer將未調度的pod放入待調度pod隊列，並根據informer構建調度器cache（用於快速獲取需要的node等對象），然后sched.scheduleOne方法為kube-scheduler組件調度pod的核心處理邏輯所在，從未調度pod隊列中取出一個pod，經過預選與優選算法，最終選出一個最優node，然后更新cache並異步執行bind操作，也就是更新pod的nodeName字段，至此一個pod的調度工作完成。

kube-scheduler組件的分析將分為兩大塊進行，分別是：
（1）kube-scheduler初始化與啟動分析；
（2）kube-scheduler核心處理邏輯分析。

本篇先進行kube-scheduler組件的初始化與啟動分析，下篇再進行核心處理邏輯分析。

1.kube-scheduler初始化與啟動分析

基於tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

直接看到kube-scheduler的NewSchedulerCommand函數，作為kube-scheduler初始化與啟動分析的入口。

NewSchedulerCommand

NewSchedulerCommand函數主要邏輯：
（1）初始化組件默認啟動參數值；
（2）定義kube-scheduler組件的運行命令方法，即runCommand函數（runCommand函數最終調用Run函數來運行啟動kube-scheduler組件，下面會進行Run函數的分析）；
（3）kube-scheduler組件啟動命令行參數解析。

// cmd/kube-scheduler/app/server.go
func NewSchedulerCommand(registryOptions ...Option) *cobra.Command {
    // 1.初始化組件默認啟動參數值
    opts, err := options.NewOptions()
	if err != nil {
		klog.Fatalf("unable to initialize command options: %v", err)
	}
	
	// 2.定義kube-scheduler組件的運行命令方法，即runCommand函數
	cmd := &cobra.Command{
		Use: "kube-scheduler",
		Long: `The Kubernetes scheduler is a policy-rich, topology-aware,
workload-specific function that significantly impacts availability, performance,
and capacity. The scheduler needs to take into account individual and collective
resource requirements, quality of service requirements, hardware/software/policy
constraints, affinity and anti-affinity specifications, data locality, inter-workload
interference, deadlines, and so on. Workload-specific requirements will be exposed
through the API as necessary.`,
		Run: func(cmd *cobra.Command, args []string) {
			if err := runCommand(cmd, args, opts, registryOptions...); err != nil {
				fmt.Fprintf(os.Stderr, "%v\n", err)
				os.Exit(1)
			}
		},
	}
	
	// 3.組件命令行啟動參數解析
	fs := cmd.Flags()
	namedFlagSets := opts.Flags()
	verflag.AddFlags(namedFlagSets.FlagSet("global"))
	globalflag.AddGlobalFlags(namedFlagSets.FlagSet("global"), cmd.Name())
	for _, f := range namedFlagSets.FlagSets {
		fs.AddFlagSet(f)
	}
	...
}

runCommand

runCommand定義了kube-scheduler組件的運行命令函數，主要看到以下兩個邏輯：
（1）調用algorithmprovider.ApplyFeatureGates方法，根據FeatureGate是否開啟，決定是否追加注冊相應的預選和優選算法；
（2）調用Run，運行啟動kube-scheduler組件。

// cmd/kube-scheduler/app/server.go
// runCommand runs the scheduler.
func runCommand(cmd *cobra.Command, args []string, opts *options.Options, registryOptions ...Option) error {
	...

	// Apply algorithms based on feature gates.
	// TODO: make configurable?
	algorithmprovider.ApplyFeatureGates()

	// Configz registration.
	if cz, err := configz.New("componentconfig"); err == nil {
		cz.Set(cc.ComponentConfig)
	} else {
		return fmt.Errorf("unable to register configz: %s", err)
	}

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	return Run(ctx, cc, registryOptions...)
}

1.1 algorithmprovider.ApplyFeatureGates

根據FeatureGate是否開啟，決定是否追加注冊相應的預選和優選算法。

// pkg/scheduler/algorithmprovider/plugins.go
import (
	"k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults"
)

func ApplyFeatureGates() func() {
	return defaults.ApplyFeatureGates()
}

1.1.1 init

plugins.go文件import了defaults包，所以看defaults.ApplyFeatureGates方法之前，先來看到defaults包的init函數，主要做了內置調度算法的注冊工作，包括預選算法和優選算法。

（1）先來看到defaults包中defaults.go文件init函數。

// pkg/scheduler/algorithmprovider/defaults/defaults.go
func init() {
	registerAlgorithmProvider(defaultPredicates(), defaultPriorities())
}

預算算法：

// pkg/scheduler/algorithmprovider/defaults/defaults.go
func defaultPredicates() sets.String {
	return sets.NewString(
		predicates.NoVolumeZoneConflictPred,
		predicates.MaxEBSVolumeCountPred,
		predicates.MaxGCEPDVolumeCountPred,
		predicates.MaxAzureDiskVolumeCountPred,
		predicates.MaxCSIVolumeCountPred,
		predicates.MatchInterPodAffinityPred,
		predicates.NoDiskConflictPred,
		predicates.GeneralPred,
		predicates.PodToleratesNodeTaintsPred,
		predicates.CheckVolumeBindingPred,
		predicates.CheckNodeUnschedulablePred,
	)
}

優選算法：

// pkg/scheduler/algorithmprovider/defaults/defaults.go
func defaultPriorities() sets.String {
	return sets.NewString(
		priorities.SelectorSpreadPriority,
		priorities.InterPodAffinityPriority,
		priorities.LeastRequestedPriority,
		priorities.BalancedResourceAllocation,
		priorities.NodePreferAvoidPodsPriority,
		priorities.NodeAffinityPriority,
		priorities.TaintTolerationPriority,
		priorities.ImageLocalityPriority,
	)
}

registerAlgorithmProvider函數注冊 algorithm provider，algorithm provider存儲了所有類型的調度算法列表，包括預選算法和優選算法（只存儲了算法key列表，不包含算法本身）。

// pkg/scheduler/algorithmprovider/defaults/defaults.go
func registerAlgorithmProvider(predSet, priSet sets.String) {
	// Registers algorithm providers. By default we use 'DefaultProvider', but user can specify one to be used
	// by specifying flag.
	scheduler.RegisterAlgorithmProvider(scheduler.DefaultProvider, predSet, priSet)
	// Cluster autoscaler friendly scheduling algorithm.
	scheduler.RegisterAlgorithmProvider(ClusterAutoscalerProvider, predSet,
		copyAndReplace(priSet, priorities.LeastRequestedPriority, priorities.MostRequestedPriority))
}

最終將注冊的algorithm provider賦值給變量algorithmProviderMap（存儲了所有類型的調度算法列表），該變量是該包的全局變量。

// pkg/scheduler/algorithm_factory.go
// RegisterAlgorithmProvider registers a new algorithm provider with the algorithm registry.
func RegisterAlgorithmProvider(name string, predicateKeys, priorityKeys sets.String) string {
	schedulerFactoryMutex.Lock()
	defer schedulerFactoryMutex.Unlock()
	validateAlgorithmNameOrDie(name)
	algorithmProviderMap[name] = AlgorithmProviderConfig{
		FitPredicateKeys:     predicateKeys,
		PriorityFunctionKeys: priorityKeys,
	}
	return name
}

// pkg/scheduler/algorithm_factory.go
var (
	...
	algorithmProviderMap   = make(map[string]AlgorithmProviderConfig)
	...
)

（2）再來看到defaults包中register_predicates.go文件的init函數，主要是注冊了預選算法。

// pkg/scheduler/algorithmprovider/defaults/register_predicates.go
func init() {
    ...
    // Fit is defined based on the absence of port conflicts.
	// This predicate is actually a default predicate, because it is invoked from
	// predicates.GeneralPredicates()
	scheduler.RegisterFitPredicate(predicates.PodFitsHostPortsPred, predicates.PodFitsHostPorts)
	// Fit is determined by resource availability.
	// This predicate is actually a default predicate, because it is invoked from
	// predicates.GeneralPredicates()
	scheduler.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)
	...

（3）最后看到defaults包中register_priorities.go文件的init函數，主要是注冊了優選算法。

// pkg/scheduler/algorithmprovider/defaults/register_priorities.go
func init() {
    ...
    // Prioritize nodes by least requested utilization.
	scheduler.RegisterPriorityMapReduceFunction(priorities.LeastRequestedPriority, priorities.LeastRequestedPriorityMap, nil, 1)

	// Prioritizes nodes to help achieve balanced resource usage
	scheduler.RegisterPriorityMapReduceFunction(priorities.BalancedResourceAllocation, priorities.BalancedResourceAllocationMap, nil, 1)
    ...
}

預選算法與優選算法注冊的最后結果，都是賦值給全局變量，預選算法注冊后賦值給fitPredicateMap，優選算法注冊后賦值給priorityFunctionMap。

// pkg/scheduler/algorithm_factory.go
var (
	...
	fitPredicateMap        = make(map[string]FitPredicateFactory)
	...
	priorityFunctionMap    = make(map[string]PriorityConfigFactory)
	...
)

1.1.2 defaults.ApplyFeatureGates

主要用於判斷是否開啟特定的FeatureGate，然后追加注冊相應的預選和優選算法。

// pkg/scheduler/algorithmprovider/defaults/defaults.go
func ApplyFeatureGates() (restore func()) {
	...

	// Only register EvenPodsSpread predicate & priority if the feature is enabled
	if utilfeature.DefaultFeatureGate.Enabled(features.EvenPodsSpread) {
		klog.Infof("Registering EvenPodsSpread predicate and priority function")
		// register predicate
		scheduler.InsertPredicateKeyToAlgorithmProviderMap(predicates.EvenPodsSpreadPred)
		scheduler.RegisterFitPredicate(predicates.EvenPodsSpreadPred, predicates.EvenPodsSpreadPredicate)
		// register priority
		scheduler.InsertPriorityKeyToAlgorithmProviderMap(priorities.EvenPodsSpreadPriority)
		scheduler.RegisterPriorityMapReduceFunction(
			priorities.EvenPodsSpreadPriority,
			priorities.CalculateEvenPodsSpreadPriorityMap,
			priorities.CalculateEvenPodsSpreadPriorityReduce,
			1,
		)
	}

	// Prioritizes nodes that satisfy pod's resource limits
	if utilfeature.DefaultFeatureGate.Enabled(features.ResourceLimitsPriorityFunction) {
		klog.Infof("Registering resourcelimits priority function")
		scheduler.RegisterPriorityMapReduceFunction(priorities.ResourceLimitsPriority, priorities.ResourceLimitsPriorityMap, nil, 1)
		// Register the priority function to specific provider too.
		scheduler.InsertPriorityKeyToAlgorithmProviderMap(scheduler.RegisterPriorityMapReduceFunction(priorities.ResourceLimitsPriority, priorities.ResourceLimitsPriorityMap, nil, 1))
	}

	...
}

1.2 Run

Run函數主要是根據配置參數，運行啟動kube-scheduler組件，其核心邏輯如下：
（1）准備好event上報client，用於將kube-scheduler產生的各種event上報給api-server；
（2）調用scheduler.New方法，實例化scheduler對象；
（3）啟動event上報管理器；
（4）設置kube-scheduler組件的健康檢查，並啟動健康檢查以及與metrics相關的http服務；
（5）啟動所有前面注冊過的對象的infomer，開始同步對象資源；
（6）調用WaitForCacheSync，等待所有informer的對象同步完成，使得本地緩存數據與etcd中的數據一致；
（7）根據組件啟動參數判斷是否要開啟leader選舉功能；
（8）調用sched.Run方法啟動kube-scheduler組件（sched.Run將作為下面kube-scheduler核心處理邏輯分析的入口）。

// cmd/kube-scheduler/app/server.go
func Run(ctx context.Context, cc schedulerserverconfig.CompletedConfig, outOfTreeRegistryOptions ...Option) error {
	// To help debugging, immediately log version
	klog.V(1).Infof("Starting Kubernetes Scheduler version %+v", version.Get())

	outOfTreeRegistry := make(framework.Registry)
	for _, option := range outOfTreeRegistryOptions {
		if err := option(outOfTreeRegistry); err != nil {
			return err
		}
	}
    
    // 1.准備好event上報client，用於將kube-scheduler產生的各種event上報給api-server
	// Prepare event clients.
	if _, err := cc.Client.Discovery().ServerResourcesForGroupVersion(eventsv1beta1.SchemeGroupVersion.String()); err == nil {
		cc.Broadcaster = events.NewBroadcaster(&events.EventSinkImpl{Interface: cc.EventClient.Events("")})
		cc.Recorder = cc.Broadcaster.NewRecorder(scheme.Scheme, cc.ComponentConfig.SchedulerName)
	} else {
		recorder := cc.CoreBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: cc.ComponentConfig.SchedulerName})
		cc.Recorder = record.NewEventRecorderAdapter(recorder)
	}
    
    // 2.調用scheduler.New方法，實例化scheduler對象
	// Create the scheduler.
	sched, err := scheduler.New(cc.Client,
		cc.InformerFactory,
		cc.PodInformer,
		cc.Recorder,
		ctx.Done(),
		scheduler.WithName(cc.ComponentConfig.SchedulerName),
		scheduler.WithAlgorithmSource(cc.ComponentConfig.AlgorithmSource),
		scheduler.WithHardPodAffinitySymmetricWeight(cc.ComponentConfig.HardPodAffinitySymmetricWeight),
		scheduler.WithPreemptionDisabled(cc.ComponentConfig.DisablePreemption),
		scheduler.WithPercentageOfNodesToScore(cc.ComponentConfig.PercentageOfNodesToScore),
		scheduler.WithBindTimeoutSeconds(cc.ComponentConfig.BindTimeoutSeconds),
		scheduler.WithFrameworkOutOfTreeRegistry(outOfTreeRegistry),
		scheduler.WithFrameworkPlugins(cc.ComponentConfig.Plugins),
		scheduler.WithFrameworkPluginConfig(cc.ComponentConfig.PluginConfig),
		scheduler.WithPodMaxBackoffSeconds(cc.ComponentConfig.PodMaxBackoffSeconds),
		scheduler.WithPodInitialBackoffSeconds(cc.ComponentConfig.PodInitialBackoffSeconds),
	)
	if err != nil {
		return err
	}
    
    // 3.啟動event上報管理器
	// Prepare the event broadcaster.
	if cc.Broadcaster != nil && cc.EventClient != nil {
		cc.Broadcaster.StartRecordingToSink(ctx.Done())
	}
	if cc.CoreBroadcaster != nil && cc.CoreEventClient != nil {
		cc.CoreBroadcaster.StartRecordingToSink(&corev1.EventSinkImpl{Interface: cc.CoreEventClient.Events("")})
	}
	
	// 4.設置kube-scheduler組件的健康檢查，並啟動健康檢查以及與metrics相關的http服務
	// Setup healthz checks.
	var checks []healthz.HealthChecker
	if cc.ComponentConfig.LeaderElection.LeaderElect {
		checks = append(checks, cc.LeaderElection.WatchDog)
	}

	// Start up the healthz server.
	if cc.InsecureServing != nil {
		separateMetrics := cc.InsecureMetricsServing != nil
		handler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, separateMetrics, checks...), nil, nil)
		if err := cc.InsecureServing.Serve(handler, 0, ctx.Done()); err != nil {
			return fmt.Errorf("failed to start healthz server: %v", err)
		}
	}
	if cc.InsecureMetricsServing != nil {
		handler := buildHandlerChain(newMetricsHandler(&cc.ComponentConfig), nil, nil)
		if err := cc.InsecureMetricsServing.Serve(handler, 0, ctx.Done()); err != nil {
			return fmt.Errorf("failed to start metrics server: %v", err)
		}
	}
	if cc.SecureServing != nil {
		handler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, false, checks...), cc.Authentication.Authenticator, cc.Authorization.Authorizer)
		// TODO: handle stoppedCh returned by c.SecureServing.Serve
		if _, err := cc.SecureServing.Serve(handler, 0, ctx.Done()); err != nil {
			// fail early for secure handlers, removing the old error loop from above
			return fmt.Errorf("failed to start secure server: %v", err)
		}
	}
    
    // 5.啟動所有前面注冊過的對象的informer，開始同步對象資源
	// Start all informers.
	go cc.PodInformer.Informer().Run(ctx.Done())
	cc.InformerFactory.Start(ctx.Done())
    
    // 6.等待所有informer的對象同步完成，使得本地緩存數據與etcd中的數據一致
	// Wait for all caches to sync before scheduling.
	cc.InformerFactory.WaitForCacheSync(ctx.Done())
    
    // 7.根據組件啟動參數判斷是否要開啟leader選舉功能
	// If leader election is enabled, runCommand via LeaderElector until done and exit.
	if cc.LeaderElection != nil {
		cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
			OnStartedLeading: sched.Run,
			OnStoppedLeading: func() {
				klog.Fatalf("leaderelection lost")
			},
		}
		leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
		if err != nil {
			return fmt.Errorf("couldn't create leader elector: %v", err)
		}

		leaderElector.Run(ctx)

		return fmt.Errorf("lost lease")
	}
    
    // 8.調用sched.Run方法啟動kube-scheduler組件
	// Leader election is disabled, so runCommand inline until done.
	sched.Run(ctx)
	return fmt.Errorf("finished without leader elect")
}

1.2.1 scheduler.New

scheduler對象的實例化分為3個部分，分別是：
（1）實例化pod、node、pvc、pv等對象的infomer；
（2）調用configurator.CreateFromConfig，根據前面注冊的內置調度算法（或根據用戶提供的調度策略），實例化scheduler；
（3）給infomer對象注冊eventHandler；

// pkg/scheduler/scheduler.go
func New(client clientset.Interface,
	informerFactory informers.SharedInformerFactory,
	podInformer coreinformers.PodInformer,
	recorder events.EventRecorder,
	stopCh <-chan struct{},
	opts ...Option) (*Scheduler, error) {

	stopEverything := stopCh
	if stopEverything == nil {
		stopEverything = wait.NeverStop
	}

	options := defaultSchedulerOptions
	for _, opt := range opts {
		opt(&options)
	}
    
    // 1.實例化node、pvc、pv等對象的infomer
	schedulerCache := internalcache.New(30*time.Second, stopEverything)
	volumeBinder := volumebinder.NewVolumeBinder(
		client,
		informerFactory.Core().V1().Nodes(),
		informerFactory.Storage().V1().CSINodes(),
		informerFactory.Core().V1().PersistentVolumeClaims(),
		informerFactory.Core().V1().PersistentVolumes(),
		informerFactory.Storage().V1().StorageClasses(),
		time.Duration(options.bindTimeoutSeconds)*time.Second,
	)

	registry := options.frameworkDefaultRegistry
	if registry == nil {
		registry = frameworkplugins.NewDefaultRegistry(&frameworkplugins.RegistryArgs{
			VolumeBinder: volumeBinder,
		})
	}
	registry.Merge(options.frameworkOutOfTreeRegistry)

	snapshot := nodeinfosnapshot.NewEmptySnapshot()

	configurator := &Configurator{
		client:                         client,
		informerFactory:                informerFactory,
		podInformer:                    podInformer,
		volumeBinder:                   volumeBinder,
		schedulerCache:                 schedulerCache,
		StopEverything:                 stopEverything,
		hardPodAffinitySymmetricWeight: options.hardPodAffinitySymmetricWeight,
		disablePreemption:              options.disablePreemption,
		percentageOfNodesToScore:       options.percentageOfNodesToScore,
		bindTimeoutSeconds:             options.bindTimeoutSeconds,
		podInitialBackoffSeconds:       options.podInitialBackoffSeconds,
		podMaxBackoffSeconds:           options.podMaxBackoffSeconds,
		enableNonPreempting:            utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NonPreemptingPriority),
		registry:                       registry,
		plugins:                        options.frameworkPlugins,
		pluginConfig:                   options.frameworkPluginConfig,
		pluginConfigProducerRegistry:   options.frameworkConfigProducerRegistry,
		nodeInfoSnapshot:               snapshot,
		algorithmFactoryArgs: AlgorithmFactoryArgs{
			SharedLister:                   snapshot,
			InformerFactory:                informerFactory,
			VolumeBinder:                   volumeBinder,
			HardPodAffinitySymmetricWeight: options.hardPodAffinitySymmetricWeight,
		},
		configProducerArgs: &frameworkplugins.ConfigProducerArgs{},
	}

	metrics.Register()
    
    // 2.調用configurator.CreateFromConfig，根據前面注冊的內置調度算法（或根據用戶提供的調度策略），實例化scheduler
	var sched *Scheduler
	source := options.schedulerAlgorithmSource
	switch {
	case source.Provider != nil:
		// Create the config from a named algorithm provider.
		sc, err := configurator.CreateFromProvider(*source.Provider)
		if err != nil {
			return nil, fmt.Errorf("couldn't create scheduler using provider %q: %v", *source.Provider, err)
		}
		sched = sc
	case source.Policy != nil:
		// Create the config from a user specified policy source.
		policy := &schedulerapi.Policy{}
		switch {
		case source.Policy.File != nil:
			if err := initPolicyFromFile(source.Policy.File.Path, policy); err != nil {
				return nil, err
			}
		case source.Policy.ConfigMap != nil:
			if err := initPolicyFromConfigMap(client, source.Policy.ConfigMap, policy); err != nil {
				return nil, err
			}
		}
		sc, err := configurator.CreateFromConfig(*policy)
		if err != nil {
			return nil, fmt.Errorf("couldn't create scheduler from policy: %v", err)
		}
		sched = sc
	default:
		return nil, fmt.Errorf("unsupported algorithm source: %v", source)
	}
	// Additional tweaks to the config produced by the configurator.
	sched.Recorder = recorder
	sched.DisablePreemption = options.disablePreemption
	sched.StopEverything = stopEverything
	sched.podConditionUpdater = &podConditionUpdaterImpl{client}
	sched.podPreemptor = &podPreemptorImpl{client}
	sched.scheduledPodsHasSynced = podInformer.Informer().HasSynced
    
    // 3.給infomer對象注冊eventHandler
	AddAllEventHandlers(sched, options.schedulerName, informerFactory, podInformer)
	return sched, nil
}

總結

kube-scheduler簡介

kube-scheduler架構圖

kube-scheduler初始化與啟動分析流程圖

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 102 - kube-scheduler源碼分析 - cobra-尋找scheduler組件啟動函數 kube-scheduler源碼分析（2）-核心處理邏輯分析 kube-scheduler源碼分析（3）-搶占調度分析 104 - kube-scheduler源碼分析 - predicate整體流程 101 - kube-scheduler源碼分析 - k8s源碼組織結構概覽 103 - kube-scheduler源碼分析 - 調度算法-尋找predicates和priorities dubbox源碼分析（一）-服務的啟動與初始化 Spring之SpringMVC(源碼)啟動初始化過程分析 Solr初始化源碼分析-Solr初始化與啟動 linux中斷源碼分析 - 初始化(二)