如今CPU的核數從單核,到雙核,再到4核、8核、甚至10核。但是我們知道Android使用的多核架構都是分大小核,或者現在最新的,除了大小核以外,還有一個超大核。
區分大小核,是因為它們之間的性能(算力),功耗是不同的,而且它們又以cluster來區分(小核在一個cluster,大核在另一個cluster),而目前由於同cluster內的cpu freq是同步調節的。
所以,在對CPU的任務調度中,需要對其同樣進行區分,來確保性能和功耗的平衡。
因此,針對CPU的拓撲結構,內核中會建立不同的調度域、調度組來體現。如下圖,以某8核cpu為例:
- 在DIE level,cpu 0-7
- 在MC level,cpu 0-3在一組,而cpu4-7在另一組
- *SMT超線程技術,會在MC level以下,再進行一次區分:01、23、45、67(這里可以暫不考慮,因為當前ARM平台並未支持SMT)
CPU Topology建立
在kernel中,有CPU Topology的相關代碼來形成這樣的結構,結構的定義在dts文件中,根據不同平台會不同。我當前這個mtk平台的DTS相關信息如下(至於這里為什么沒有用qcom平台,因為現在公司暫時貌似只有mtk平台,所以可能略微有點差別):
cpu0: cpu@000 { device_type = "cpu"; compatible = "arm,cortex-a53"; reg = <0x000>; enable-method = "psci"; clock-frequency = <2301000000>; operating-points-v2 = <&cluster0_opp>; dynamic-power-coefficient = <275>; capacity-dmips-mhz = <1024>; cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>, <&SODI &SODI3 &DPIDLE &SUSPEND>; }; cpu1: cpu@001 { device_type = "cpu"; compatible = "arm,cortex-a53"; reg = <0x001>; enable-method = "psci"; clock-frequency = <2301000000>; operating-points-v2 = <&cluster0_opp>; dynamic-power-coefficient = <275>; capacity-dmips-mhz = <1024>; cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>, <&SODI &SODI3 &DPIDLE &SUSPEND>; }; cpu2: cpu@002 { device_type = "cpu"; compatible = "arm,cortex-a53"; reg = <0x002>; enable-method = "psci"; clock-frequency = <2301000000>; operating-points-v2 = <&cluster0_opp>; dynamic-power-coefficient = <275>; capacity-dmips-mhz = <1024>; cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>, <&SODI &SODI3 &DPIDLE &SUSPEND>; }; cpu3: cpu@003 { device_type = "cpu"; compatible = "arm,cortex-a53"; reg = <0x003>; enable-method = "psci"; clock-frequency = <2301000000>; operating-points-v2 = <&cluster0_opp>; dynamic-power-coefficient = <275>; capacity-dmips-mhz = <1024>; cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>, <&SODI &SODI3 &DPIDLE &SUSPEND>; }; cpu4: cpu@100 { device_type = "cpu"; compatible = "arm,cortex-a53"; reg = <0x100>; enable-method = "psci"; clock-frequency = <1800000000>; operating-points-v2 = <&cluster1_opp>; dynamic-power-coefficient = <85>; capacity-dmips-mhz = <801>; cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>, <&SODI &SODI3 &DPIDLE &SUSPEND>; }; cpu5: cpu@101 { device_type = "cpu"; compatible = "arm,cortex-a53"; reg = <0x101>; enable-method = "psci"; clock-frequency = <1800000000>; operating-points-v2 = <&cluster1_opp>; dynamic-power-coefficient = <85>; capacity-dmips-mhz = <801>; cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>, <&SODI &SODI3 &DPIDLE &SUSPEND>; }; cpu6: cpu@102 { device_type = "cpu"; compatible = "arm,cortex-a53"; reg = <0x102>; enable-method = "psci"; clock-frequency = <1800000000>; operating-points-v2 = <&cluster1_opp>; dynamic-power-coefficient = <85>; capacity-dmips-mhz = <801>; cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>, <&SODI &SODI3 &DPIDLE &SUSPEND>; }; cpu7: cpu@103 { device_type = "cpu"; compatible = "arm,cortex-a53"; reg = <0x103>; enable-method = "psci"; clock-frequency = <1800000000>; operating-points-v2 = <&cluster1_opp>; dynamic-power-coefficient = <85>; capacity-dmips-mhz = <801>; cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>, <&SODI &SODI3 &DPIDLE &SUSPEND>; }; cpu-map { cluster0 { core0 { cpu = <&cpu0>; }; core1 { cpu = <&cpu1>; }; core2 { cpu = <&cpu2>; }; core3 { cpu = <&cpu3>; }; }; cluster1 { core0 { cpu = <&cpu4>; }; core1 { cpu = <&cpu5>; }; core2 { cpu = <&cpu6>; }; core3 { cpu = <&cpu7>; }; }; };
代碼路徑:drivers/base/arch_topology.c、arch/arm64/kernel/topology.c,本文代碼以CAF Kernel msm-5.4為例。
第一部分,這里解析DTS,並保存cpu_topology的package_id,core_id,cpu_sclae(cpu_capacity_orig)
kernel_init() -> kernel_init_freeable() -> smp_prepare_cpus() -> init_cpu_topology() -> parse_dt_topology()
針對dts中,依次解析"cpus"節點,以及其中的"cpu-map"節點;
- 先解析了其中cluster節點的內容結構。
- 在對cpu capacity進行歸一化
static int __init parse_dt_topology(void) { struct device_node *cn, *map; int ret = 0; int cpu; cn = of_find_node_by_path("/cpus"); //查找dts中 /cpus的節點 if (!cn) { pr_err("No CPU information found in DT\n"); return 0; } /* * When topology is provided cpu-map is essentially a root * cluster with restricted subnodes. */ map = of_get_child_by_name(cn, "cpu-map"); //查找/cpus節點下,cpu-map節點 if (!map) goto out; ret = parse_cluster(map, 0); //(1)解析cluster結構 if (ret != 0) goto out_map; topology_normalize_cpu_scale(); //(2)將cpu capacity歸一化 /* * Check that all cores are in the topology; the SMP code will * only mark cores described in the DT as possible. */ for_each_possible_cpu(cpu) if (cpu_topology[cpu].package_id == -1) ret = -EINVAL; out_map: of_node_put(map); out: of_node_put(cn); return ret; }
(1)解析cluster結構
- 通過第一個do-while循環,進行"cluster+序號"節點的解析:當前平台分別解析cluster0、1。然后仍然調用自身函數,這樣代碼復用,進一步解析其中的“core”結構
- 在進一步解析core結構時,同樣通過第二個do-while循環,進行"core+序號"節點的解析:當前平台支持core0,1...7,共8個核,通過parse_core函數進一步解析
- 所以實際解析執行順序應該是:cluster0,core0,1,2,3;cluster1,core4,5,6,7。
- 最后在每個cluster中的所有core都解析完,跳出其do-while循環時,package_id就是遞增。說明package_id就對應了cluster的id
static int __init parse_cluster(struct device_node *cluster, int depth) { char name[20]; bool leaf = true; bool has_cores = false; struct device_node *c; static int package_id __initdata; int core_id = 0; int i, ret; /* * First check for child clusters; we currently ignore any * information about the nesting of clusters and present the * scheduler with a flat list of them. */ i = 0; do { snprintf(name, sizeof(name), "cluster%d", i); //依次解析cluster0,1... 當前平台只有cluster0/1 c = of_get_child_by_name(cluster, name); //檢查cpu-map下,是否有cluster結構 if (c) { leaf = false; ret = parse_cluster(c, depth + 1); //如果有cluster結構,會繼續解析更深層次的core結構。(這里通過代碼復用,接着解析core結構) of_node_put(c); if (ret != 0) return ret; } i++; } while (c); /* Now check for cores */ i = 0; do { snprintf(name, sizeof(name), "core%d", i); //依次解析core0,1... 當前平台有8個core c = of_get_child_by_name(cluster, name); //檢查cluster下,是否有core結構 if (c) { has_cores = true; if (depth == 0) { //這里要注意,是因為上面depth+1的調用才會走下去 pr_err("%pOF: cpu-map children should be clusters\n", //如果cpu-map下沒有cluster結構的(depth==0),就會報錯 c); of_node_put(c); return -EINVAL; } if (leaf) { //在depth+1的情況下,leaf == true,說明是core level了 ret = parse_core(c, package_id, core_id++); //(1-1)解析core結構 } else { pr_err("%pOF: Non-leaf cluster with core %s\n", cluster, name); ret = -EINVAL; } of_node_put(c); if (ret != 0) return ret; } i++; } while (c); if (leaf && !has_cores) pr_warn("%pOF: empty cluster\n", cluster); if (leaf) //在core level遍歷完成:說明1個cluster解析完成,要解析下一個cluster了,package id要遞增了 package_id++; //所以package id就對應了cluster id return 0; }
(1-1)解析core結構
- 因為當前平台不支持超線程,所以core+序號節點下面,沒有thread+序號的節點了
- 解析cpu節點中的所有信息
- 更新cpu_topology[cpu].package_id、core_id,分別對應了哪個cluster的哪個core
static int __init parse_core(struct device_node *core, int package_id, int core_id) { char name[20]; bool leaf = true; int i = 0; int cpu; struct device_node *t; do { snprintf(name, sizeof(name), "thread%d", i); //不支持SMT,所以dts沒有在core下面配置超線程 t = of_get_child_by_name(core, name); if (t) { leaf = false; cpu = get_cpu_for_node(t); if (cpu >= 0) { cpu_topology[cpu].package_id = package_id; cpu_topology[cpu].core_id = core_id; cpu_topology[cpu].thread_id = i; } else { pr_err("%pOF: Can't get CPU for thread\n", t); of_node_put(t); return -EINVAL; } of_node_put(t); } i++; } while (t); cpu = get_cpu_for_node(core); //(1-1-1)從core中解析cpu節點 if (cpu >= 0) { if (!leaf) { pr_err("%pOF: Core has both threads and CPU\n", core); return -EINVAL; } cpu_topology[cpu].package_id = package_id; //保存package id(cluster id)到cpu_topology結構體的數組 cpu_topology[cpu].core_id = core_id; //保存core id到cpu_topology結構體的數組; core id對應cpu號:0,1...7 } else if (leaf) { pr_err("%pOF: Can't get CPU for leaf core\n", core); return -EINVAL; } return 0; }
(1-1-1)從core中解析cpu節點
- 從core節點中查找cpu節點,並對應好cpu id
- 再解析cpu core的capacity
static int __init get_cpu_for_node(struct device_node *node) { struct device_node *cpu_node; int cpu; cpu_node = of_parse_phandle(node, "cpu", 0); //獲取core節點中cpu節點信息 if (!cpu_node) return -1; cpu = of_cpu_node_to_id(cpu_node); //獲取cpu節點對應的cpu core id:cpu-0,1... if (cpu >= 0) topology_parse_cpu_capacity(cpu_node, cpu); //(1-1-1-1)解析每個cpu core的capacity else pr_crit("Unable to find CPU node for %pOF\n", cpu_node); of_node_put(cpu_node); return cpu; }
(1-1-1-1)解析每個cpu core的capacity
- 先解析capacity-dmips-mhz值作為cpu raw_capacity,這個參數就是對應了cpu的算力,數字越大,算力越強(可以對照上面mtk平台dts,明顯是大小核架構;但不同的是,它cpu0-3都是大核,cpu4-7是小核,這個與一般的配置不太一樣,一般qcom平台是反過來,cpu0-3是小核,4-7是大核)
- 當前raw_capcity是cpu 0-3:1024,cpu4-7:801
bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) { static bool cap_parsing_failed; int ret; u32 cpu_capacity; if (cap_parsing_failed) return false; ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz", //解析cpu core算力,kernel4.19后配置該參數 &cpu_capacity); if (!ret) { if (!raw_capacity) { raw_capacity = kcalloc(num_possible_cpus(), //為所有cpu raw_capacity變量都申請空間 sizeof(*raw_capacity), GFP_KERNEL); if (!raw_capacity) { cap_parsing_failed = true; return false; } } capacity_scale = max(cpu_capacity, capacity_scale); //記錄最大cpu capacity值作為scale raw_capacity[cpu] = cpu_capacity; //raw capacity就是dts中dmips值 pr_debug("cpu_capacity: %pOF cpu_capacity=%u (raw)\n", cpu_node, raw_capacity[cpu]); } else { if (raw_capacity) { pr_err("cpu_capacity: missing %pOF raw capacity\n", cpu_node); pr_err("cpu_capacity: partial information: fallback to 1024 for all CPUs\n"); } cap_parsing_failed = true; free_raw_capacity(); } return !ret; }
(2)將cpu raw_capacity進行歸一化
- 遍歷每個cpu core進行歸一化,其實就是將最大值映射為1024,小的值,按照原先比例n,歸一化為n*1024。
- 歸一化步驟:將當前raw_capacity *1024 /capacity_scale,capacity_scale其實就是raw_capacity的最大值,其實就是1024
- 將cpu raw capacity保存到per_cpu變量:cpu_scale中,在內核調度中經常使用的cpu_capacity_orig、cpu_capacity參數的計算都依賴它。
void topology_normalize_cpu_scale(void) { u64 capacity; int cpu; if (!raw_capacity) return; pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale); for_each_possible_cpu(cpu) { pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n", cpu, raw_capacity[cpu]); capacity = (raw_capacity[cpu] << SCHED_CAPACITY_SHIFT) //就是按照max cpu capacity的100% = 1024的方式歸一化capacity / capacity_scale; topology_set_cpu_scale(cpu, capacity); //更新per_cpu變量cpu_scale(cpu_capacity_orig)為各自的cpu raw capacity pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n", cpu, topology_get_cpu_scale(cpu)); } }
第二部分更新sibling_mask
cpu0的調用路徑如下:
kernel_init -> kernel_init_freeable -> smp_prepare_cpus -> store_cpu_topology
cpu1-7的調用路徑如下:
secondary_start_kernel
-> store_cpu_topology
void store_cpu_topology(unsigned int cpuid) { struct cpu_topology *cpuid_topo = &cpu_topology[cpuid]; u64 mpidr; if (cpuid_topo->package_id != -1) //這里因為已經解析過package_id了,所以直接就不會走讀協處理器寄存器等相關步驟了 goto topology_populated; mpidr = read_cpuid_mpidr(); /* Uniprocessor systems can rely on default topology values */ if (mpidr & MPIDR_UP_BITMASK) return; /* * This would be the place to create cpu topology based on MPIDR. * * However, it cannot be trusted to depict the actual topology; some * pieces of the architecture enforce an artificial cap on Aff0 values * (e.g. GICv3's ICC_SGI1R_EL1 limits it to 15), leading to an * artificial cycling of Aff1, Aff2 and Aff3 values. IOW, these end up * having absolutely no relationship to the actual underlying system * topology, and cannot be reasonably used as core / package ID. * * If the MT bit is set, Aff0 *could* be used to define a thread ID, but * we still wouldn't be able to obtain a sane core ID. This means we * need to entirely ignore MPIDR for any topology deduction. */ cpuid_topo->thread_id = -1; cpuid_topo->core_id = cpuid; cpuid_topo->package_id = cpu_to_node(cpuid); pr_debug("CPU%u: cluster %d core %d thread %d mpidr %#016llx\n", cpuid, cpuid_topo->package_id, cpuid_topo->core_id, cpuid_topo->thread_id, mpidr); topology_populated: update_siblings_masks(cpuid); //(1)更新當前cpu的sibling_mask }
(1)更新當前cpu的sibling_mask
- 匹配規則就是如果是同一個package id(同一個cluster內),那么就互為sibling,並設置core_sibling的mask
- 當前平台不支持超線程,所以沒有thread_sibling
void update_siblings_masks(unsigned int cpuid) { struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; int cpu; /* update core and thread sibling masks */ for_each_online_cpu(cpu) { cpu_topo = &cpu_topology[cpu]; if (cpuid_topo->llc_id == cpu_topo->llc_id) { //當前平台不支持acpi,所以所有cpu的llc_id都是-1。這里都會滿足 cpumask_set_cpu(cpu, &cpuid_topo->llc_sibling); cpumask_set_cpu(cpuid, &cpu_topo->llc_sibling); } if (cpuid_topo->package_id != cpu_topo->package_id) //只有當在同一個cluster內時,才可能成為core_sibling/thread_sibling(當前平台不支持線程sibling) continue; cpumask_set_cpu(cpuid, &cpu_topo->core_sibling); //互相設置各自cpu topo結構體的core_sibling mask中添加對方的cpu bit cpumask_set_cpu(cpu, &cpuid_topo->core_sibling); if (cpuid_topo->core_id != cpu_topo->core_id) //只有在同一個core內時,才有可能成為thread_sibling continue; cpumask_set_cpu(cpuid, &cpu_topo->thread_sibling); //互相設置thread_sibling mask中的thread bit cpumask_set_cpu(cpu, &cpuid_topo->thread_sibling); } }
最終我們可以通過adb查看cpu相關節點信息來確認上面的cpu topology信息:
TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # ls
core_id core_siblings core_siblings_list physical_package_id thread_siblings thread_siblings_list
cpu0:
TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat core_id 0 TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat core_siblings 0f TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat core_siblings_list 0-3 TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat physical_package_id 0 TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat thread_siblings 01 TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat thread_siblings_list 0
cpu1:
TECNO-KF6p:/sys/devices/system/cpu/cpu1/topology # cat * 1 0f 0-3 0 02 //thread_siblings 1 //thread_siblings_list
cpu7:
TECNO-KF6p:/sys/devices/system/cpu/cpu7/topology # cat * 3 //core_id(cpu4-7的core id分別為0,1,2,3。相當於另一個cluster內重新開始計數) f0 //core_siblings 4-7 //core_siblings_list(兄弟姐妹core列表) 1 //physical_package_id(就是cluster id) 80 //thread_siblings 7 //thread_siblings_list
以上就是CPU topology建立的相關流程了,還是比較清晰的。
sd調度域和sg調度組建立
CPU MASK
* cpu_possible_mask- has bit 'cpu' set iff cpu is populatable //系統所有cpu * cpu_present_mask - has bit 'cpu' set iff cpu is populated //存在的所有cpu,根據hotplug變化, <= possible * cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler //處於online的cpu,即active cpu + idle cpu * cpu_active_mask - has bit 'cpu' set iff cpu available to migration //處於active的cpu,區別與idle cpu * cpu_isolated_mask- has bit 'cpu' set iff cpu isolated //處於isolate的cpu,隔離的cpu不會被分配task運行,但是沒有下電 * 1、如果沒有CONFIG_HOTPLUG_CPU,那么 present == possible, active == online。 2、配置了cpu hotplug的情況下,present會根據hotplug狀態,動態變化。
調度域和調度組是在kernel初始化時開始建立的,調用路徑如下:
kernel_init() -> kernel_init_freeable() -> sched_init_smp() -> sched_init_domains()
傳入的cpu_map是cpu_active_mask,即活動狀態的cpu,建立調度域:
/* Current sched domains: */ static cpumask_var_t *doms_cur; /* Number of sched domains in 'doms_cur': */ static int ndoms_cur;
/* * Set up scheduler domains and groups. For now this just excludes isolated * CPUs, but could be used to exclude other special cases in the future. */ int sched_init_domains(const struct cpumask *cpu_map) { int err; zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL); zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL); zalloc_cpumask_var(&fallback_doms, GFP_KERNEL); arch_update_cpu_topology(); //(1)填充cpu_core_map數組 ndoms_cur = 1; //記錄調度域數量的變量,當前初始化為1 doms_cur = alloc_sched_domains(ndoms_cur); //alloc調度域相關結構體內存空間 if (!doms_cur) doms_cur = &fallback_doms; cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_FLAG_DOMAIN)); //這里會從cpu_map中挑選沒有isolate的cpu,初始化時沒有isolate cpu? err = build_sched_domains(doms_cur[0], NULL); //(2)根據提供的一組cpu,建立調度域 register_sched_domain_sysctl(); //(3)注冊proc/sys/kernel/sched_domain目錄,並完善其中相關sysctl控制參數 return err; }
(1)用cpu_possiable_mask填充cpu_core_map數組
int arch_update_cpu_topology(void) { unsigned int cpu; for_each_possible_cpu(cpu) //遍歷每個cpu cpu_core_map[cpu] = cpu_coregroup_map(cpu); //利用cpu_possiable_mask,也就是物理上所有的cpu core return 0; }
(2)根據提供的可用cpu(active的cpu中去掉isolate cpu),建立調度域
- (2-1)根據配置的default topology建立其CPU拓撲結構(MC、DIE);alloc sched_domain以及per_cpu私有變量;alloc root domain空間並初始化
- (2-2)判斷當前平台類型:大小核;獲取擁有不同cpu capacity的最淺level:DIE
- (2-3)根據平台cpu和topology結構,申請MC、DIE level調度域,並建立其child-parent關系;初始化調度域flag和load balance參數;使能MC、DIE的idle balance
- (2-4)申請sched group並初始化cpu mask以及capacity,建立sg在MC、DIE上的內部環形鏈表關系;建立sd、sg、sgc的關聯;
- (2-5)針對出現一些錯誤(sa_sd_storage)的情況下,防止正在使用的sd_data在(2-8)中被free
- (2-6)更新MC level下每個sg(其實就是每個cpu)的cpu_orig_capacity/cpu_capacity等,再更新DIE level下每個sg(其實就是每個cluster內所有cpu)的cpu_orig_capacity/cpu_capacity
- 遍歷cpu_map中每個cpu,
- 找到擁有最大/最小 cpu_orig_capacity(即cpu_scale)的cpu,並保存到walt root domain結構體中
- 將新建立的MC level的sd、root domain、cpu_rq三者綁定起來
- (2-7)將每個新的MC level的sd與對應cpu rq綁定,將每個新的rd與cpu rq綁定;舊的sd、舊的rd都進行銷毀
- 遍歷cpu_map,找到cpu_orig_capacity的中間值(適用於有3種不同cpu core類型的情況,當前平台只有大小核,沒有超大核,所以這里不用考慮);上一步中找到的最大/最小 cpu_orig_capacity(即cpu_scale)以及其對應的cpu,都將更新到rd中
- 使用static-key機制來修改當前調度域是否有不同cpu capacity的代碼路徑;
- 根據上述建立cpu拓撲、申請root domain的正常/異常情況,進行錯誤處理(釋放必要結構體等)
/* * Build sched domains for a given set of CPUs and attach the sched domains * to the individual CPUs */ static int build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr) { enum s_alloc alloc_state = sa_none; struct sched_domain *sd; struct s_data d; int i, ret = -ENOMEM; struct sched_domain_topology_level *tl_asym; bool has_asym = false; if (WARN_ON(cpumask_empty(cpu_map))) //過濾cpu_map為空的情況 goto error; alloc_state = __visit_domain_allocation_hell(&d, cpu_map); //(2-1)建立MC、DIE的拓撲結構;初始化root domain if (alloc_state != sa_rootdomain) goto error; tl_asym = asym_cpu_capacity_level(cpu_map); //(2-2)獲取包含max cpu capacity的最淺level:DIE level /* Set up domains for CPUs specified by the cpu_map: */ //根據cpu map建立調度域 for_each_cpu(i, cpu_map) { //遍歷每個cpu map中的cpu:0-7 struct sched_domain_topology_level *tl; sd = NULL; for_each_sd_topology(tl) { //遍歷MC、DIE level int dflags = 0; if (tl == tl_asym) { //DIE level會帶有:SD_ASYM_CPUCAPACITY flag,並設has_asym = true dflags |= SD_ASYM_CPUCAPACITY; has_asym = true; } if (WARN_ON(!topology_span_sane(tl, cpu_map, i))) goto error; sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i); //(2-3)建立MC、DIE level的調度域 if (tl == sched_domain_topology) //將最低層級的sd保存到s_data.sd的per_cpu變量中,當前平台為MC level的sd *per_cpu_ptr(d.sd, i) = sd; if (tl->flags & SDTL_OVERLAP) //判斷是否sd有重疊,當前平台沒有重疊 sd->flags |= SD_OVERLAP; if (cpumask_equal(cpu_map, sched_domain_span(sd))) //判斷cpu map和當前sd->span是否一致,一致則表示當前cpu_map中的所有cpu都在這個sd->span內。就會停止下一層tl的sd建立,可能用當前這一層的sd就已經足夠了? break; } } /* Build the groups for the domains */ for_each_cpu(i, cpu_map) { //遍歷cpu_map中每個cpu for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { //從cpu的最低層級sd開始向上遍歷,當前平台遍歷順序是:MC->DIE sd->span_weight = cpumask_weight(sched_domain_span(sd)); //獲取當前sd范圍內的cpu數量 if (sd->flags & SD_OVERLAP) { //根據是否有重疊的sd,建立調度組sg(NUMA架構才會有這個flag) if (build_overlap_sched_groups(sd, i)) //重疊sd情況下,建立sg(非當前平台,暫不展開) goto error; } else { if (build_sched_groups(sd, i)) //(2-4)因為當前平台沒有重疊sd,所以走這里建立調度組sg goto error; } } } /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { //遍歷所有cpu,當前平台遍歷順序是cpu7,6...0 if (!cpumask_test_cpu(i, cpu_map)) //如果cpu不在cpu map中,應該是hotplug的情況 continue; for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { //依次遍歷MC level和DIE level claim_allocations(i, sd); //(2-5)將用於建立sd、sg的per_cpu指針(sdd),防止隨后的__free_domain_allocs()將其free init_sched_groups_capacity(i, sd); //(2-6)初始化sg的cpu_capacity } } /* Attach the domains */ rcu_read_lock(); for_each_cpu(i, cpu_map) { //遍歷cpu map中所有cpu #ifdef CONFIG_SCHED_WALT int max_cpu = READ_ONCE(d.rd->wrd.max_cap_orig_cpu); //獲取walt_root_domain中保存最大orig_capacity cpu的變量 int min_cpu = READ_ONCE(d.rd->wrd.min_cap_orig_cpu); //獲取walt_root_domain中保存最小orig_capacity cpu的變量 #endif sd = *per_cpu_ptr(d.sd, i); //最低層級的sd在上面的流程中被保存到per_cpu變量中,當前平台為MC level #ifdef CONFIG_SCHED_WALT //通過遍歷循環,找除最大和最小orig_capacity的cpu if ((max_cpu < 0) || (arch_scale_cpu_capacity(i) > arch_scale_cpu_capacity(max_cpu))) WRITE_ONCE(d.rd->wrd.max_cap_orig_cpu, i); if ((min_cpu < 0) || (arch_scale_cpu_capacity(i) < arch_scale_cpu_capacity(min_cpu))) WRITE_ONCE(d.rd->wrd.min_cap_orig_cpu, i); #endif cpu_attach_domain(sd, d.rd, i); //(2-7)將sd、rd與cpu rq綁定起來 } #ifdef CONFIG_SCHED_WALT /* set the mid capacity cpu (assumes only 3 capacities) */ for_each_cpu(i, cpu_map) { int max_cpu = READ_ONCE(d.rd->wrd.max_cap_orig_cpu); //獲取擁有最大orig cpu capacity的第一個cpu int min_cpu = READ_ONCE(d.rd->wrd.min_cap_orig_cpu); //獲取擁有最小orig cpu capacity的第一個cpu if ((arch_scale_cpu_capacity(i) //找到orig cpu capacity在最大和最小之間的cpu != arch_scale_cpu_capacity(min_cpu)) && (arch_scale_cpu_capacity(i) != arch_scale_cpu_capacity(max_cpu))) { WRITE_ONCE(d.rd->wrd.mid_cap_orig_cpu, i); //當前平台只有2個值orig cpu capacity,所以這里找不到mid值的cpu break; } } /* * The max_cpu_capacity reflect the original capacity which does not * change dynamically. So update the max cap CPU and its capacity * here. */ if (d.rd->wrd.max_cap_orig_cpu != -1) { d.rd->max_cpu_capacity.cpu = d.rd->wrd.max_cap_orig_cpu; //更新rd中的擁有最大orig cpu capacity的cpu(注意變量與max_cap_orig_cpu不同) d.rd->max_cpu_capacity.val = arch_scale_cpu_capacity( //並更新該cpu的orig cpu capacity值 d.rd->wrd.max_cap_orig_cpu); } #endif rcu_read_unlock(); if (has_asym) //當前平台為大小核架構,所以為true static_branch_inc_cpuslocked(&sched_asym_cpucapacity); //針對sched_asym_cpucapacity的變量判斷分支做更改(static key機制用來優化指令預取,類似likely/unlikely) ret = 0; error: __free_domain_allocs(&d, alloc_state, cpu_map); //(2-8)根據函數最上面建立拓撲、以及申請root domain結果,釋放相應的空間 return ret; }
(2-1)建立MC、DIE的拓撲結構;初始化root domain
static enum s_alloc __visit_domain_allocation_hell(struct s_data *d, const struct cpumask *cpu_map) { memset(d, 0, sizeof(*d)); if (__sdt_alloc(cpu_map)) //(2-1-1)初始化MC、DIE的拓撲結構 return sa_sd_storage; d->sd = alloc_percpu(struct sched_domain *); //申請d->sd空間 if (!d->sd) return sa_sd_storage; d->rd = alloc_rootdomain(); //(2-1-2)申請root domain並初始化 if (!d->rd) return sa_sd; return sa_rootdomain; }
(2-1-1)初始化MC、DIE的拓撲結構
CPU topology結構如下,因為當前平台不支持SMT,所以從下到上,分別是MC level、DIE level。在sdt_alloc()中的循環中會使用到。
/* * Topology list, bottom-up. */ static struct sched_domain_topology_level default_topology[] = { #ifdef CONFIG_SCHED_SMT { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, #endif #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, #endif { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, };
- 首先建立MC level
- alloc了sd_data結構體(&tl->data)的4個指針:sdd->sd,sdd->sds,sdd->sg,sdd->sgc;
- 在遍歷CPU時,從cpu0-7,分別創建了每個per_cpu變量保存:sd、sds、sg、sgc
- 再建立DIE level
- alloc了sd_data結構體(&tl->data)的4個指針:sdd->sd,sdd->sds,sdd->sg,sdd->sgc;
- 在遍歷CPU時,從cpu0-7,分別創建了每個per_cpu變量保存:sd、sds、sg、sgc
static int __sdt_alloc(const struct cpumask *cpu_map) { struct sched_domain_topology_level *tl; int j; for_each_sd_topology(tl) { //依次遍歷MC、DIE結構 struct sd_data *sdd = &tl->data; //如下是為MC、DIE level的percpu變量sd_data,申請空間 sdd->sd = alloc_percpu(struct sched_domain *); //sched_domain if (!sdd->sd) return -ENOMEM; sdd->sds = alloc_percpu(struct sched_domain_shared *); //sched_domain_shared if (!sdd->sds) return -ENOMEM; sdd->sg = alloc_percpu(struct sched_group *); //sched_group if (!sdd->sg) return -ENOMEM; sdd->sgc = alloc_percpu(struct sched_group_capacity *); //sched_group_capacity if (!sdd->sgc) return -ENOMEM; for_each_cpu(j, cpu_map) { //遍歷了cpu_map中所有cpu,當前平台為8核:cpu0-7 struct sched_domain *sd; struct sched_domain_shared *sds; struct sched_group *sg; struct sched_group_capacity *sgc; sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(), //申請sd + cpumask的空間 GFP_KERNEL, cpu_to_node(j)); //cpu_to_node應該是選擇cpu所在本地的內存node,UMA架構僅有一個node if (!sd) return -ENOMEM; *per_cpu_ptr(sdd->sd, j) = sd; //將cpu[j]的調度域sd綁定到sdd->sd上 sds = kzalloc_node(sizeof(struct sched_domain_shared), //類似申請sds空間,並綁定到sdd->sds GFP_KERNEL, cpu_to_node(j)); if (!sds) return -ENOMEM; *per_cpu_ptr(sdd->sds, j) = sds; sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), //類似申請sg + cpumask空間,並綁定到sdd->sg GFP_KERNEL, cpu_to_node(j)); if (!sg) return -ENOMEM; sg->next = sg; //初始化時,sg的鏈表並未真正建立 *per_cpu_ptr(sdd->sg, j) = sg; sgc = kzalloc_node(sizeof(struct sched_group_capacity) + cpumask_size(),//類似申請sgc + cpumask空間,並綁定到sdd->sgc GFP_KERNEL, cpu_to_node(j)); if (!sgc) return -ENOMEM; #ifdef CONFIG_SCHED_DEBUG sgc->id = j; //將cpu編號綁定到sgc->id #endif *per_cpu_ptr(sdd->sgc, j) = sgc; } } return 0; }
(2-1-2) 申請root domain並初始化
static struct root_domain *alloc_rootdomain(void) { struct root_domain *rd; rd = kzalloc(sizeof(*rd), GFP_KERNEL); if (!rd) return NULL; if (init_rootdomain(rd) != 0) { //(2-1-2-1)初始化root domain kfree(rd); return NULL; } return rd; }
(2-1-2-1)初始化root domain
static int init_rootdomain(struct root_domain *rd) { if (!zalloc_cpumask_var(&rd->span, GFP_KERNEL)) //申請4個cpu mask的空間 goto out; if (!zalloc_cpumask_var(&rd->online, GFP_KERNEL)) goto free_span; if (!zalloc_cpumask_var(&rd->dlo_mask, GFP_KERNEL)) goto free_online; if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL)) goto free_dlo_mask; #ifdef HAVE_RT_PUSH_IPI rd->rto_cpu = -1; //初始化rto相關參數和隊列,針對IPI pull的請求,在rto_mask中loop,暫時沒理解? raw_spin_lock_init(&rd->rto_lock); init_irq_work(&rd->rto_push_work, rto_push_irq_work_func); #endif init_dl_bw(&rd->dl_bw); //初始化deadline bandwidth if (cpudl_init(&rd->cpudl) != 0) //初始化cpudl結構體 goto free_rto_mask; if (cpupri_init(&rd->cpupri) != 0) //初始化cpupri結構體 goto free_cpudl; #ifdef CONFIG_SCHED_WALT rd->wrd.max_cap_orig_cpu = rd->wrd.min_cap_orig_cpu = -1; //初始化walt_root_domain rd->wrd.mid_cap_orig_cpu = -1; #endif init_max_cpu_capacity(&rd->max_cpu_capacity); //初始化max_cpu_capacity ->val=0、->cpu=-1 return 0; free_cpudl: cpudl_cleanup(&rd->cpudl); free_rto_mask: free_cpumask_var(rd->rto_mask); free_dlo_mask: free_cpumask_var(rd->dlo_mask); free_online: free_cpumask_var(rd->online); free_span: free_cpumask_var(rd->span); out: return -ENOMEM; }
(2-2)獲取包含max cpu capacity的最淺level:DIE level
- 判斷當前是否是大小核架構:
- 遍歷cpu map和cpu toplology,找到最大cpu capacity
- 找到有不同cpu capacity的level:DIE level
/* * Find the sched_domain_topology_level where all CPU capacities are visible * for all CPUs. */ static struct sched_domain_topology_level *asym_cpu_capacity_level(const struct cpumask *cpu_map) { int i, j, asym_level = 0; bool asym = false; struct sched_domain_topology_level *tl, *asym_tl = NULL; unsigned long cap; /* Is there any asymmetry? */ cap = arch_scale_cpu_capacity(cpumask_first(cpu_map)); //獲取cpu_map中第一個cpu,cpu0的capacity for_each_cpu(i, cpu_map) { //判斷是否有不同capacity的cpu,決定是否是大小核架構 if (arch_scale_cpu_capacity(i) != cap) { //當前平台是大小核有不同capacity asym = true; break; } } if (!asym) return NULL; /* * Examine topology from all CPU's point of views to detect the lowest * sched_domain_topology_level where a highest capacity CPU is visible * to everyone. */ for_each_cpu(i, cpu_map) { //遍歷cpu map中的每個cpu,cpu 0-7 unsigned long max_capacity = arch_scale_cpu_capacity(i); int tl_id = 0; for_each_sd_topology(tl) { //依次遍歷MC、DIE level if (tl_id < asym_level) goto next_level; for_each_cpu_and(j, tl->mask(i), cpu_map) { //(2-2-1)在MC level時分別遍歷cpu0-3、cpu4-7;DIE level時遍歷cpu0-7 unsigned long capacity; capacity = arch_scale_cpu_capacity(j); //獲取cpu_capacity_orig if (capacity <= max_capacity) continue; max_capacity = capacity; //在所有cpu中找到最大的cpu capacity asym_level = tl_id; //記錄level id:1 asym_tl = tl; //記錄有不同cpu capacity的cpu topology level: DIE } next_level: tl_id++; } } return asym_tl; }
(2-2-1)單獨分析下tl->mask(i)
- 因為tl實際就是default_topology的指針,所以tl->mask:在MC level下,就是cpu_coregroup_mask;在DIE level下,就是cpu_cpu_mask
- 所以MC level下,獲取的mask就是core_siblings mask;DIE level下,獲取的就是所有物理cpu的mask
const struct cpumask *cpu_coregroup_mask(int cpu) { const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu)); /* Find the smaller of NUMA, core or LLC siblings */ if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) { /* not numa in package, lets use the package siblings */ core_mask = &cpu_topology[cpu].core_sibling; } if (cpu_topology[cpu].llc_id != -1) { if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask)) core_mask = &cpu_topology[cpu].llc_sibling; } return core_mask; }
static inline const struct cpumask *cpu_cpu_mask(int cpu) { return cpumask_of_node(cpu_to_node(cpu)); } /* Returns a pointer to the cpumask of CPUs on Node 'node'. */ static inline const struct cpumask *cpumask_of_node(int node) { if (node == NUMA_NO_NODE) //當前平台是UMA架構,非NUMA結構,所以只有一個node return cpu_all_mask; return node_to_cpumask_map[node]; }
(2-3)建立MC、DIE level的調度域
static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain_attr *attr, struct sched_domain *child, int dflags, int cpu) { struct sched_domain *sd = sd_init(tl, cpu_map, child, dflags, cpu); //(2-3-1)初始化sched_domain,填充sd結構體,根據tl level構建sd父子關系等 if (child) { //MC level的child為NULL;所以下面只針對DIE level sd->level = child->level + 1; //DIE level值為child level+1 sched_domain_level_max = max(sched_domain_level_max, sd->level);//記錄sd最大level child->parent = sd; //將MC level sd的parent設置為DIE level的sd if (!cpumask_subset(sched_domain_span(child), sched_domain_span(sd))) { pr_err("BUG: arch topology borken\n"); #ifdef CONFIG_SCHED_DEBUG pr_err(" the %s domain not a subset of the %s domain\n", child->name, sd->name); #endif /* Fixup, ensure @sd has at least @child CPUs. */ cpumask_or(sched_domain_span(sd), sched_domain_span(sd), sched_domain_span(child)); } } set_domain_attribute(sd, attr); //(2-3-2)這里attr為NULL,打開idle balance return sd; }
(2-3-1)初始化sched_domain
-
初始化sd.flags,最后 MC、DIE level的flags 如下:
- 初始化sd結構體中其他重要參數:
3. 設置cpu mask:sched_domain_span(sd) 。在MC level,就是cluster的范圍;在DIE level,就是所有物理cpu
4. 通過外面的遍歷循環,將MC、DIE建立child-parent的鏈接關系
5. 打開MC、DIE level的idle load balance功能
static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain *child, int dflags, int cpu) { struct sd_data *sdd = &tl->data; struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu); //獲取當前cpu的sd結構體 int sd_id, sd_weight, sd_flags = 0; #ifdef CONFIG_NUMA /* * Ugly hack to pass state to sd_numa_mask()... */ sched_domains_curr_level = tl->numa_level; #endif sd_weight = cpumask_weight(tl->mask(cpu)); //獲取MC/DIE level下的sd_weight(就是topology level下的cpu個數,當前平台:MC為4,DIE為8) if (tl->sd_flags) //只有MC level有配置 sd_flags = (*tl->sd_flags)(); // MC level的sd_flags:SD_SHARE_PKG_RESOURCES;DIE level則沒有 if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS, //僅僅是判斷下是否有相關bit位是否越界 "wrong sd_flags in topology description\n")) sd_flags &= TOPOLOGY_SD_FLAGS; //然后越界的話,清零下 /* Apply detected topology flags */ sd_flags |= dflags; //DIE level會傳入 SD_ASYM_CPUCAPACITY flag *sd = (struct sched_domain){ //初始化sd結構體 .min_interval = sd_weight, //MC:4,DIE:8 .max_interval = 2*sd_weight, //MC:8,DIE:16 .busy_factor = 32, .imbalance_pct = 125, //用於load balance .cache_nice_tries = 0, .flags = 1*SD_LOAD_BALANCE | 1*SD_BALANCE_NEWIDLE | 1*SD_BALANCE_EXEC | 1*SD_BALANCE_FORK | 0*SD_BALANCE_WAKE | 1*SD_WAKE_AFFINE | 0*SD_SHARE_CPUCAPACITY | 0*SD_SHARE_PKG_RESOURCES | 0*SD_SERIALIZE | 1*SD_PREFER_SIBLING | 0*SD_NUMA | sd_flags //MC:SD_SHARE_PKG_RESOURCES,DIE:SD_ASYM_CPUCAPACITY , .last_balance = jiffies, //初始化load balance的時間戳 .balance_interval = sd_weight, //load balance的間隔,MC:4,DIE:8 .max_newidle_lb_cost = 0, //newidle load balance的cost .next_decay_max_lb_cost = jiffies, //idle balance中用到,暫時還不清楚什么cost .child = child, //MC level:sd的child為NULL;而DIE level:sd的child是MC level的sd #ifdef CONFIG_SCHED_DEBUG .name = tl->name, #endif }; cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu)); //將對應tl mask(MC:core_siblings, DIE:所有物理cpu),與cpu map進行“位與”,作為sd的范圍 sd_id = cpumask_first(sched_domain_span(sd)); //拿取sd范圍內的第一個cpu,作為sd_id /* * Convert topological properties into behaviour. */ /* Don't attempt to spread across CPUs of different capacities. */ if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child) //當前平台只有DIE level滿足條件 sd->child->flags &= ~SD_PREFER_SIBLING; //所以DIE level的child,就是MC level的sd,其flags會去掉清掉SD_PREFER_SIBLING //所以,DIE sd的flag有SD_PREFER_SIBLING;而MC sd沒有此flag if (sd->flags & SD_SHARE_CPUCAPACITY) { //這個flag應該是超線程sd支持的flag sd->imbalance_pct = 110; } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { //從上面*tl->sd_flags()調用了MC level函數:cpu_core_flags(),所有這個flag在MC level是為1的 sd->imbalance_pct = 117; //修改MC level的sd不平衡百分比 sd->cache_nice_tries = 1; //修改MC level的cache_nice_tries = 2,暫時不清楚變量用途? #ifdef CONFIG_NUMA //當前平台不支持NUMA(平台為UMA架構) } else if (sd->flags & SD_NUMA) { sd->cache_nice_tries = 2; sd->flags &= ~SD_PREFER_SIBLING; sd->flags |= SD_SERIALIZE; if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) { sd->flags &= ~(SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE); } #endif } else { //DIE level的cache_nice_tries = 1 sd->cache_nice_tries = 1; } sd->shared = *per_cpu_ptr(sdd->sds, sd_id); //MC level是每個cpu各自的sds;而DIE level:是cpu0、cpu4的sds atomic_inc(&sd->shared->ref); //對sd->shared的引用計數+1 if (sd->flags & SD_SHARE_PKG_RESOURCES) //MC leve滿足 atomic_set(&sd->shared->nr_busy_cpus, sd_weight); //設置sd->shared->nr_busy_cpus = 4 sd->private = sdd; //sd->private指向&tl->data,MC/DIE level的cpu sd都指向對應level的tl->data的結構 return sd; }
(2-3-2)傳參attr =NULL,所以這里是判斷sd是否要打開idle balance。實際是當前平台MC、DIE level都打開了idle balance
static void set_domain_attribute(struct sched_domain *sd, struct sched_domain_attr *attr) { int request; if (!attr || attr->relax_domain_level < 0) { if (default_relax_domain_level < 0) return; else request = default_relax_domain_level; } else request = attr->relax_domain_level; if (request < sd->level) { /* Turn off idle balance on this domain: */ sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE); } else { /* Turn on idle balance on this domain: */ sd->flags |= (SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE); } }
(2-4)當前平台沒有重疊sd,所以會調用函數build_sched_groups逐步建立調度組sg
- 外部有2層循環,第一層為cpu_map:cpu0-7,第二層為sd:MC、DIE;函數內部有1層循環:從當前cpu開始,在sd內遍歷所有cpu--------感覺有點多余:比如DIE level時,cpu0/cpu4各是child sd中的第一個cpu,就會進行初始化;而cpu1-3/cpu5-7時,就會在get_group中直接過濾return。
- 通過get_group進行sg初始化:sg/sgc->cpumask,sgc->capacity;並把sd、sg、sgc 3者關聯起來
- MC、DIE level下將每個sg用環形鏈表關聯起來
/* * build_sched_groups will build a circular linked list of the groups * covered by the given span, will set each group's ->cpumask correctly, * and will initialize their ->sgc. * * Assumes the sched_domain tree is fully constructed */ static int build_sched_groups(struct sched_domain *sd, int cpu) { struct sched_group *first = NULL, *last = NULL; struct sd_data *sdd = sd->private; const struct cpumask *span = sched_domain_span(sd); //獲取當前sd的范圍,MC是core_siblings,DIE是所有物理cpu struct cpumask *covered; int i; lockdep_assert_held(&sched_domains_mutex); covered = sched_domains_tmpmask; cpumask_clear(covered); //每次外面大循環新的sd或者cpu,就會清空covered mask for_each_cpu_wrap(i, span, cpu) { //從當前cpu開始遍歷整個sd span struct sched_group *sg; if (cpumask_test_cpu(i, covered)) //已經在covered mask中的cpu,不需要再進行下面工作 continue; sg = get_group(i, sdd); //(2-4-1)初始化cpu i的調度組sg cpumask_or(covered, covered, sched_group_span(sg)); //將covered = covered | sg的span if (!first) //每個cpu、每個level進來記錄第一個sg first = sg; if (last) last->next = sg; //每個sg的next都指向下一個sg last = sg; } last->next = first; //將所有sg->next形成環形鏈表 sd->groups = first; //sd->groups只指向第一個sg return 0; }
(2-4-1)初始化cpu i的調度組sg
- 如果sd是DIE level的,那么就會只初始化並返回cluster中的第1個cpu-----非常重要!!!
- 將sd與sg、sg與sgc關聯起來
- 初始化sg->cpumask和sg->sgc->cpumask:DIE level,為child sd的范圍;MC level,為單個cpu。----------這里區別於sched_domain_span(sd),sg的范圍會比sd的范圍降一級!!!用一句話說就是:每個sched domain的第一個sched group就是sd對應的child sched domain。
- 初始化sgc->capacity(等於child sd中cpu個數 * 1024),最大和最小capacity都是1024---------這個當前還不准確,僅僅是初始化,后面還會再修改
通過上述sd和sg的初始化建立,最終形成如下圖關系。而其中DIE level上,只會初始化每個cluster的第一個cpu的sched group調度組(圖中虛線表示的都沒有關聯到per_cpu變量中)
/* * Package topology (also see the load-balance blurb in fair.c) * * The scheduler builds a tree structure to represent a number of important * topology features. By default (default_topology[]) these include: * * - Simultaneous multithreading (SMT) * - Multi-Core Cache (MC) * - Package (DIE) * * Where the last one more or less denotes everything up to a NUMA node. * * The tree consists of 3 primary data structures: * * sched_domain -> sched_group -> sched_group_capacity * ^ ^ ^ ^ * `-' `-' * * The sched_domains are per-CPU and have a two way link (parent & child) and * denote the ever growing mask of CPUs belonging to that level of topology. * * Each sched_domain has a circular (double) linked list of sched_group's, each * denoting the domains of the level below (or individual CPUs in case of the * first domain level). The sched_group linked by a sched_domain includes the * CPU of that sched_domain [*]. * * Take for instance a 2 threaded, 2 core, 2 cache cluster part: * * CPU 0 1 2 3 4 5 6 7 * * DIE [ ] * MC [ ] [ ] * SMT [ ] [ ] [ ] [ ] * * - or - * * DIE 0-7 0-7 0-7 0-7 0-7 0-7 0-7 0-7 * MC 0-3 0-3 0-3 0-3 4-7 4-7 4-7 4-7 * SMT 0-1 0-1 2-3 2-3 4-5 4-5 6-7 6-7 * * CPU 0 1 2 3 4 5 6 7 * * One way to think about it is: sched_domain moves you up and down among these * topology levels, while sched_group moves you sideways through it, at child * domain granularity. * * sched_group_capacity ensures each unique sched_group has shared storage. * * There are two related construction problems, both require a CPU that * uniquely identify each group (for a given domain): * * - The first is the balance_cpu (see should_we_balance() and the * load-balance blub in fair.c); for each group we only want 1 CPU to * continue balancing at a higher domain. * * - The second is the sched_group_capacity; we want all identical groups * to share a single sched_group_capacity. * * Since these topologies are exclusive by construction. That is, its * impossible for an SMT thread to belong to multiple cores, and cores to * be part of multiple caches. There is a very clear and unique location * for each CPU in the hierarchy. * * Therefore computing a unique CPU for each group is trivial (the iteration * mask is redundant and set all 1s; all CPUs in a group will end up at _that_ * group), we can simply pick the first CPU in each group. * * * [*] in other words, the first group of each domain is its child domain. */ static struct sched_group *get_group(int cpu, struct sd_data *sdd) { struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu); struct sched_domain *child = sd->child; struct sched_group *sg; bool already_visited; if (child) //child sd存在,說明當前是DIE level的sd cpu = cpumask_first(sched_domain_span(child)); //那么取出MC level中child sd的第一個cpu;DIE level時,下面用的sg都是每個cluster的第一個cpu的sg sg = *per_cpu_ptr(sdd->sg, cpu); //綁定sd和sg sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); //綁定sg和sgc /* Increase refcounts for claim_allocations: */ //計算sg的引用計數 already_visited = atomic_inc_return(&sg->ref) > 1; /* sgc visits should follow a similar trend as sg */ WARN_ON(already_visited != (atomic_inc_return(&sg->sgc->ref) > 1)); /* If we have already visited that group, it's already initialized. */ //過濾已經初始化過的sg:在DIE level時,build_sched_groups函數遍歷所有物理cpu,但是當前函數僅初始化child sd中的第一個cpu。所以當遍歷cpu0/4,會實際執行下去,而cpu1-3/cpu5-7時,就會在這里過濾 if (already_visited) return sg; if (child) { //如果是DIE level cpumask_copy(sched_group_span(sg), sched_domain_span(child)); //sg的范圍(sg->cpumask)是child sd的范圍 cpumask_copy(group_balance_mask(sg), sched_group_span(sg)); //sg->sgc->cpumask也是child sd的范圍 } else { //如果是MC level cpumask_set_cpu(cpu, sched_group_span(sg)); //那么sg的范圍(sg->cpumask)就是自己對應的單個cpu cpumask_set_cpu(cpu, group_balance_mask(sg)); //sg->sgc->cpumask也是自己對應的單個cpu } sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg)); //根據sg范圍內有幾個cpu,來簡單計算總capacity sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; //初始化最小capacity sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; //初始化最大capacity return sg; }
(2-5)將用於建立sd、sg的per_cpu指針(sdd)置NULL,防止隨后的__free_domain_allocs()將其free-----結合(2-8)看來下,應該是出現一些錯誤(sa_sd_storage)的情況下,防止正在使用的sd_data被free
/* * NULL the sd_data elements we've used to build the sched_domain and * sched_group structure so that the subsequent __free_domain_allocs() * will not free the data we're using. */ static void claim_allocations(int cpu, struct sched_domain *sd) { struct sd_data *sdd = sd->private; WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd); //依次判斷sd、sds、sg、sgc的per_cpu指針,並置為NULL *per_cpu_ptr(sdd->sd, cpu) = NULL; if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref)) *per_cpu_ptr(sdd->sds, cpu) = NULL; if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref)) *per_cpu_ptr(sdd->sg, cpu) = NULL; if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref)) *per_cpu_ptr(sdd->sgc, cpu) = NULL; }
(2-6)初始化sg的cpu_capacity
- do-while循環中對每個sg->group_weight進行初始化:MC level,sg范圍是對應cpu;DIE level,sg范圍是cluster的范圍(如果支持WALT,還要去掉isolate cpu)
- 對sg中的第一個cpu(MC level,cpumask為sg對應cpu;DIE level,cpumask為cluster中第一個cpu),更新group capacity;
/* * Initialize sched groups cpu_capacity. * * cpu_capacity indicates the capacity of sched group, which is used while * distributing the load between different sched groups in a sched domain. * Typically cpu_capacity for all the groups in a sched domain will be same * unless there are asymmetries in the topology. If there are asymmetries, * group having more cpu_capacity will pickup more load compared to the * group having less cpu_capacity. */ void init_sched_groups_capacity(int cpu, struct sched_domain *sd) { struct sched_group *sg = sd->groups; //獲取sd對應的sg #ifdef CONFIG_SCHED_WALT cpumask_t avail_mask; #endif WARN_ON(!sg); do { //do-while循環中,對sg環形鏈表中的所有sg的->group_weight進行初始化 int cpu, max_cpu = -1; #ifdef CONFIG_SCHED_WALT cpumask_andnot(&avail_mask, sched_group_span(sg), //如果支持WALT,那么group_weight = sg的范圍中去掉isolate cpu;MC level,sg范圍是對應cpu;DIE level,sg范圍是cluster的范圍 cpu_isolated_mask); sg->group_weight = cpumask_weight(&avail_mask); #else sg->group_weight = cpumask_weight(sched_group_span(sg)); //如果不支持WALT,那么group_weight = sg的范圍;MC level,sg范圍是對應cpu;DIE level,sg范圍是cluster的范圍 #endif if (!(sd->flags & SD_ASYM_PACKING)) //當前平台沒有這個flag。這個flag應該表示支持非對稱SMT調度 goto next; for_each_cpu(cpu, sched_group_span(sg)) { if (max_cpu < 0) max_cpu = cpu; else if (sched_asym_prefer(cpu, max_cpu)) max_cpu = cpu; } sg->asym_prefer_cpu = max_cpu; next: sg = sg->next; } while (sg != sd->groups); if (cpu != group_balance_cpu(sg)) //僅對sg->sgc->cpumask中第一個cpu,進行下一步更新group capacity。MC level,cpumask為sg對應cpu;DIE level,cpumask為cluster中第一個cpu return; update_group_capacity(sd, cpu); //(2-6-1)更新對應group的capacity }
(2-6-1)更新對應group的capacity-----這個函數在load balance的流程中也會被調用到
- 更新group capacity有時間間隔要求,間隔限制在[1,25]個tick之間
- 當sd為MC level時,因為其對應sg只有自身一個cpu,所以僅僅只需更新cpu capacity;而如果是DIE level,則需要進一步更新和計算
- 當sd為MC level,更新rq->cpu_orig_capacity/cpu_capacity、sgc->capacity/min_capacity/max_capcity
- 當sd為DIE level,那么通過child sd->groups指針,以及通過MC level的sg環形鏈表,依次遍歷每個sgc->capcity(但會排除isolate狀態cpu):遍歷時,其中最大cpu max_capacity和最小cpu min_capacity分別作為DIE level sd->sg->sgc->max_capacity/min_capacity,最后把所有非isolate cpu的sgc->capacity累加起來,作為這個DIE level sd->sg->sgc->capacity;
void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; unsigned long capacity, min_capacity, max_capacity; unsigned long interval; interval = msecs_to_jiffies(sd->balance_interval); //sgc的更新有間隔限制:1 ~ HZ/10 interval = clamp(interval, 1UL, max_load_balance_interval); sdg->sgc->next_update = jiffies + interval; if (!child) { //如果是MC level的sd更新sgc,那么就只要更新cpu capacity,因為MC level的sg只有單個cpu在內 update_cpu_capacity(sd, cpu); //(2-6-1-1)更新cpu capacity return; } capacity = 0; min_capacity = ULONG_MAX; max_capacity = 0; if (child->flags & SD_OVERLAP) { //這個是sd有重疊的情況,當前平台沒有sd重疊 /* * SD_OVERLAP domains cannot assume that child groups * span the current group. */ for_each_cpu(cpu, sched_group_span(sdg)) { struct sched_group_capacity *sgc; struct rq *rq = cpu_rq(cpu); if (cpu_isolated(cpu)) continue; /* * build_sched_domains() -> init_sched_groups_capacity() * gets here before we've attached the domains to the * runqueues. * * Use capacity_of(), which is set irrespective of domains * in update_cpu_capacity(). * * This avoids capacity from being 0 and * causing divide-by-zero issues on boot. */ if (unlikely(!rq->sd)) { capacity += capacity_of(cpu); } else { sgc = rq->sd->groups->sgc; capacity += sgc->capacity; } min_capacity = min(capacity, min_capacity); max_capacity = max(capacity, max_capacity); } } else { /* * !SD_OVERLAP domains can assume that child groups 因為沒有sd重疊,那么所有child sd的groups合在一起,就是當前的group * span the current group. */ group = child->groups; do { //do-while遍歷child sd的sg環形鏈表;當前平台為例,走到這里是DIE level,那么child sd就是MC level的groups struct sched_group_capacity *sgc = group->sgc; //獲取對應sgc __maybe_unused cpumask_t *cpus = sched_group_span(group); //因為group是處於MC level,所以范圍就是sg對應的cpu if (!cpu_isolated(cpumask_first(cpus))) { //排除isolate狀態的cpu capacity += sgc->capacity; //將每個sgc(cpu)的capacity累加起來 min_capacity = min(sgc->min_capacity, //保存最小的sgc->capacity min_capacity); max_capacity = max(sgc->max_capacity, //保存最大的sgc->capacity max_capacity); } group = group->next; } while (group != child->groups); } sdg->sgc->capacity = capacity; //將MC level中每個sgc->capacity累加起來,其總和作為DIE level中group capacity sdg->sgc->min_capacity = min_capacity; //並保存最大、最小capacity sdg->sgc->max_capacity = max_capacity; }
(2-6-1-1)更新cpu capacity
- 獲取cpu的orig_capacity(也就是cpu_scale),並獲取max_freq_scale(每次在cpufreq調頻中通過不同policy會變化,每次調頻更新的公式如下)
policy_max_freq * 1024 max_freq_scale = ———————————————————————————————— ,在cpufreq中會根據policy設置policy_max_freq;max_freq_scale在開機初始化為1024,並作為per_cpu保存起來 原max_freq_scale
- 再通過cpu_scale和max_freq_scale計算,並考慮thermal限制,最終計算結果更新為當前cpu rq的cpu_capacity_orig,公式如下:
rq->cpu_orig_capacity = min(cpu_scale * max_freq_scale /1024, thermal限制的最大cpu capacity)
- 通過特定的計算公式,計算得出去掉irq、rt進程、dl進程util之后的剩余cpu capacity。之后將其更新為rq->cpu_capacity、sgc->capacity/min_capacity/max_capacity
static void update_cpu_capacity(struct sched_domain *sd, int cpu) { unsigned long capacity = arch_scale_cpu_capacity(cpu); //獲取per_cpu變量cpu_scale struct sched_group *sdg = sd->groups; capacity *= arch_scale_max_freq_capacity(sd, cpu); //獲取per_cpu變量max_freq_scale,參與計算 capacity >>= SCHED_CAPACITY_SHIFT; //這2步計算為:cpu_scale * max_freq_scale / 1024 capacity = min(capacity, thermal_cap(cpu)); //計算得出的capacity不能超過thermal限制中的cpu的capacity cpu_rq(cpu)->cpu_capacity_orig = capacity; //將計算得出的capacity作為當前cpu rq的cpu_capacity_orig capacity = scale_rt_capacity(cpu, capacity); //(2-6-1-1-1)計算cfs rq剩余的cpu capacity if (!capacity) //如果沒有剩余cpu capacity給cfs了,那么就強制寫為1 capacity = 1; cpu_rq(cpu)->cpu_capacity = capacity; //更新相關sgc capacity:cpu rq的cpu_capacity、sgc的最大/最小的capacity sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = capacity; sdg->sgc->max_capacity = capacity; }
(2-6-1-1-1)計算cfs rq剩余的cpu capacity
- 獲取irq util,如果irq util超過orig cpu capacity,則說明已經沒有剩余CPU算力了
- 獲取rt進程的util,和dl進程的util,並求和。如果結果超過orig cpu capacity,則說明也已經沒有剩余CPU算力了
- 如果上面2步,計算都還有剩余算力,那么就計算剩余cpu算力,如下:
(max - avg_rt.util_avg - avg_dl.util_avg) * (max - avg_irq.util_avg) 剩余cpu capacity = ————————————————————————————————————————————————————————————————————————, 其中 max = rq->cpu_orig_capacity(上面計算出的結果) max
static unsigned long scale_rt_capacity(int cpu, unsigned long max) { struct rq *rq = cpu_rq(cpu); unsigned long used, free; unsigned long irq; irq = cpu_util_irq(rq); //獲取cpu rq的avg_irq.util_avg if (unlikely(irq >= max)) //如果util_avg超過max,則說明util滿了? return 1; used = READ_ONCE(rq->avg_rt.util_avg); //獲取rt task rq的util_avg used += READ_ONCE(rq->avg_dl.util_avg); //獲取並累加dl task rq的util_avg if (unlikely(used >= max)) //如果util_avg超過max,則說明util滿了? return 1; free = max - used; //計算free util = 最大capacity - rt的util_avg - dl的util_avg return scale_irq_capacity(free, irq, max); //(max - rt的util_avg - dl的util_avg) * (max - irq) /max }
(2-7)將sd、rd與cpu rq綁定起來
- for循環從當前sd向parent遍歷,但是會過濾DIE level sd----------當前平台也就是只判斷MC level的sd
- 首先判斷是否要對parent sd銷毀?其中有2層判斷是否需要銷毀parent sd:一層是判斷parent sd本身是否已滿足銷毀條件,另一層是判斷與child sd對比,是否有必要對parent sd進行銷毀
- 先取出parent(鏈表中先斷開連接,將parent->parent和child鏈接),根據parent sd如有flag:SD_PREFER_SIBLING,將其傳遞到child sd。-------這2步,在當前平台都不滿足。所以僅僅指揮斷開parent sd、child sd的鏈接
- 銷毀parent sd,參考(2-7-2)
- 再對child sd也同樣進行銷毀判斷,以及進行銷毀
前面這些都是將新的sd進行”修剪“,去掉一些不影響調度的sd層級,之后就會將新的sd綁定到rd上:
- (2-7-3)將新root domain與cpu_rq綁定起來,舊rd會被free
- 更新rq->sd為新的sd;將當前cpu更新進sd_sysctl_cpus的cpu mask中
- (2-7-5)將原先的tmp sd給銷毀
- 最后更新per_cpu相關的sd變量
/* * Attach the domain 'sd' to 'cpu' as its base domain. Callers must * hold the hotplug lock. */ static void cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) { struct rq *rq = cpu_rq(cpu); struct sched_domain *tmp; /* Remove the sched domains which do not contribute to scheduling. */ for (tmp = sd; tmp; ) { struct sched_domain *parent = tmp->parent; //過濾沒有parent sd的sd,即過濾DIE level sd if (!parent) break; if (sd_parent_degenerate(tmp, parent)) { //(2-7-1)判斷是否要對parent sd進行degenerate操作 tmp->parent = parent->parent; //首先斷開parent' sd的鏈表關系 if (parent->parent) //根據parent->parent是否存在,將parnet->parent的child鏈接到tmp sd parent->parent->child = tmp; /* * Transfer SD_PREFER_SIBLING down in case of a * degenerate parent; the spans match for this * so the property transfers. */ if (parent->flags & SD_PREFER_SIBLING) //因為當前平台只有DIE level有這個flag,又因為DIE level沒有parent sd,所以在上面已經過濾了,這里的條件不會滿足 tmp->flags |= SD_PREFER_SIBLING; destroy_sched_domain(parent); //(2-7-2)銷毀'parent' sd } else //如果不需要degenerate操作 tmp = tmp->parent; //則直接更新tmp,准備遍歷下一層level } if (sd && sd_degenerate(sd)) { //判斷sd是否需要進行degenerate tmp = sd; sd = sd->parent; destroy_sched_domain(tmp); //銷毀sd,同上 if (sd) //如果被銷毀的sd有parent sd,那么就將parent sd的->child置為NULL sd->child = NULL; } sched_domain_debug(sd, cpu); //打印sd attach的debug信息 rq_attach_root(rq, rd); //(2-7-3) 將新的root doamin與cpu rq綁定在一起 tmp = rq->sd; rcu_assign_pointer(rq->sd, sd); //將新sd與rq->sd綁定起來 dirty_sched_domain_sysctl(cpu); //(2-7-4)更新sd_sysctl_cpus的cpu mask destroy_sched_domains(tmp); //(2-7-5)將tmp sd銷毀 update_top_cache_domain(cpu); //(2-7-6)更新cpu的sd相關的per_cpu變量 }
(2-7-1)判斷是否要對parent sd進行degenerate操作?sd_parent_degenerate函數:return 1,則表示要進行銷毀;return 0,則不需要進行銷毀
- 先針對sd本身是否可以進行銷毀進行判斷,標准參考(2-7-1-1)如果sd_degenerate函數return 1,則說明這個parent sd需要銷毀;反之則不需要銷毀
- 如果child sd和parent sd的范圍不相同,則return 0;反之,則繼續進行判斷
- 如果parent sd level下只有一個sg,那么就先清空一些flags。再判斷child sd和parent sd的flag是否一致?如果一致,則return 1;不一致,則return 0
static int sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) { unsigned long cflags = sd->flags, pflags = parent->flags; if (sd_degenerate(parent)) //(2-7-1-1) 判斷parent sd是否有必要做下面的步驟來degenerate判斷 return 1; if (!cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) //判斷MC和DIE level是否sd范圍一樣?當前平台不一樣 return 0; //所以,一般這里就會return 0 /* Flags needing groups don't count if only 1 group in parent */ if (parent->groups == parent->groups->next) { //如果parent sg只有一個了,那么下面這些flag,就不需要了 pflags &= ~(SD_LOAD_BALANCE | SD_BALANCE_NEWIDLE | SD_BALANCE_FORK | SD_BALANCE_EXEC | SD_ASYM_CPUCAPACITY | SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES | SD_PREFER_SIBLING | SD_SHARE_POWERDOMAIN); if (nr_node_ids == 1) pflags &= ~SD_SERIALIZE; } if (~cflags & pflags) //判斷MC level的flag與DIE level修改后的flag是否一致? return 0; //如果一致,則return 1;不一致,則return 0 return 1; }
(2-7-1-1) 判斷parent sd是否有必要做下面的步驟來degenerate判斷
- 如果DIE level sd的范圍內只有1個cpu,則表示需要銷毀sd-------這么理解:DIE leve都只有一個cpu了,那也就沒有MC level sd存在的必要了
- 如果sd中包含一些flag,並且sd至少有2個sg,這種情況下不能銷毀sd。-------實際當前平台DIE level,會有2個sg,所以這里就會return 0(暫只考慮全核都開的情況)
- 如果sd包含SD_WAKE_AFFINE(flag意義:任務喚醒時,放置到臨近的cpu),則return 0。-------目前平台所有sd都有這個flag
- 如果上面3個條件都不滿足,則return 1
return 0表示不需要進行sd銷毀;return 1表示要進行sd銷毀。
static int sd_degenerate(struct sched_domain *sd) { if (cpumask_weight(sched_domain_span(sd)) == 1) //如果DIE level sd中,只有一個cpu(當前平台有8個cpu),就return 1 return 1; /* Following flags need at least 2 groups */ if (sd->flags & (SD_LOAD_BALANCE | //當前DIE level中會由部分flag,但同時sg有2個,所以會return 0 SD_BALANCE_NEWIDLE | SD_BALANCE_FORK | SD_BALANCE_EXEC | SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN)) { if (sd->groups != sd->groups->next) return 0; } /* Following flags don't use groups */ if (sd->flags & (SD_WAKE_AFFINE)) //所有sd都這個flag,所以這里都會return 0 return 0; return 1; }
(2-7-2)銷毀'parent' sd:主要就是依次釋放申請的空間:sd->groups、sd->shared、sd本身。
static void destroy_sched_domain(struct sched_domain *sd) { /* * A normal sched domain may have multiple group references, an * overlapping domain, having private groups, only one. Iterate, * dropping group/capacity references, freeing where none remain. */ free_sched_groups(sd->groups, 1); //(2-7-2-1)free sd對應的sg結構體 if (sd->shared && atomic_dec_and_test(&sd->shared->ref)) //判斷是否有sds結構,並且sds的引用==1時。就free掉sds kfree(sd->shared); kfree(sd); //free sd結構體 }
(2-7-2-1)free sd對應的sg結構體:do-while循環中會遍歷sg循環鏈表,將sd->sgc也會釋放掉,最后再釋放sg本身
static void free_sched_groups(struct sched_group *sg, int free_sgc) { struct sched_group *tmp, *first; if (!sg) return; first = sg; do { //do-while循環,完整遍歷sg循環鏈表元素一次 tmp = sg->next; if (free_sgc && atomic_dec_and_test(&sg->sgc->ref)) //檢查引用數是否為1,判斷是否要free sgc結構 kfree(sg->sgc); //free sgc結構體 if (atomic_dec_and_test(&sg->ref)) //檢查sg結構應用數是否為1 kfree(sg); //free sg結構體 sg = tmp; } while (sg != first); }
(2-7-3) 將新的root doamin與cpu rq綁定在一起
- 如果cpu rq上原先綁定過roo domain,那么就將其作為old rd
- 通過old rd判斷rq還處於online,那就先將rq offline
- 清空old rd中對應的當前rq的cpu;當old rd不在被使用時,將old rd置為NULL;如果old rd引用不為0,則后面要對其進行free-------這一步是對old rd的剝離
- 將新的rd,賦給rq->rd,並將rq對應的cpu添加到rd的span范圍中。--------這里完成root domain的更新
- 判斷如果rq->cpu是active的狀態,那么就要將rq online
- 最后根據所需,對old rd進行free
void rq_attach_root(struct rq *rq, struct root_domain *rd) { struct root_domain *old_rd = NULL; unsigned long flags; raw_spin_lock_irqsave(&rq->lock, flags); if (rq->rd) { old_rd = rq->rd; //暫存原先的rd if (cpumask_test_cpu(rq->cpu, old_rd->online)) //如果原先的rd還處於online set_rq_offline(rq); //(2-7-3-1)則先讓rq offline cpumask_clear_cpu(rq->cpu, old_rd->span); //在old rd中去掉offline rq對應的cpu /* * If we dont want to free the old_rd yet then * set old_rd to NULL to skip the freeing later * in this function: */ if (!atomic_dec_and_test(&old_rd->refcount)) //判斷old rd的引用是否為0(代表是否需要free old rd) old_rd = NULL; //設置為NULL后,后面流程就不會free old rd } atomic_inc(&rd->refcount); //將rd引用+1 rq->rd = rd; //更新rq->rd為新的rd cpumask_set_cpu(rq->cpu, rd->span); //將rq->cpu為新rd的范圍 if (cpumask_test_cpu(rq->cpu, cpu_active_mask)) //如果rq->cpu都是active的 set_rq_online(rq); //(2-7-3-2)那么就將rq set為online raw_spin_unlock_irqrestore(&rq->lock, flags); if (old_rd) //根據上面是否設置old rd為NULL,確定是否free old rd call_rcu(&old_rd->rcu, free_rootdomain); //(2-7-3-3)free old rd }
(2-7-3-1)讓rq offline:依次對rq中的所有class調用rq_offline接口,並對,再置rq->online為0
void set_rq_offline(struct rq *rq) { if (rq->online) { //確認rq online const struct sched_class *class; for_each_class(class) { //遍歷所有調度class if (class->rq_offline) //判斷對應class rq_offline是否存在 class->rq_offline(rq); //(2-3-7-1-1)調用class對應的rq_offline,這里以cfs rq為例 } cpumask_clear_cpu(rq->cpu, rq->rd->online); //將rq的rq online mask中去掉當前rq對應的cpu rq->online = 0; //將rq online置為0 } }
(2-3-7-1-1)調用class對應的rq_offline,這里以cfs rq為例
static void rq_offline_fair(struct rq *rq) { update_sysctl(); //更新sysctl參數 /* Ensure any throttled groups are reachable by pick_next_task */ unthrottle_offline_cfs_rqs(rq); //(2-3-7-1-1-1)把rq中的所有cfs_rq都解除帶寬限制 }
(2-3-7-1-1-1)把rq中的所有cfs_rq都解除帶寬限制(這部分其實屬於cfs帶寬限制的范疇,不深入分析。之前有看過代碼,但是沒有記錄下來。以后有時間再整理)
/* cpu offline callback */ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq) { struct task_group *tg; lockdep_assert_held(&rq->lock); rcu_read_lock(); list_for_each_entry_rcu(tg, &task_groups, list) { //遍歷所有task group struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; //取出tg中對應cpu的cfs_rq if (!cfs_rq->runtime_enabled) //過濾已關閉帶寬限制功能的cfs_rq continue; /* * clock_task is not advancing so we just need to make sure * there's some valid quota amount */ cfs_rq->runtime_remaining = 1; //確保有有效的配額值 /* * Offline rq is schedulable till CPU is completely disabled //offline rq在take_cpu_down()中完全disable CPU前,仍然可以被調度 * in take_cpu_down(), so we prevent new cfs throttling here. //所以,我們把新的cfs限制在這里。 */ cfs_rq->runtime_enabled = 0; //關閉cfs rq的帶寬限制功能 if (cfs_rq_throttled(cfs_rq)) //判斷cfs_rq的是否處於帶寬被限制狀態 unthrottle_cfs_rq(cfs_rq); //解除帶寬限制 } rcu_read_unlock(); }
(2-7-3-2)將rq set為online,其實就是做set offline的相反操作
void set_rq_online(struct rq *rq) { if (!rq->online) { const struct sched_class *class; cpumask_set_cpu(rq->cpu, rq->rd->online); //將rq->cpu設置到rq->rd->online的cpu mask中,表示對應rd中的online cpu增加了 rq->online = 1; //將rq設為online for_each_class(class) { //遍歷所有調度class if (class->rq_online) //判斷對應class rq_offline是否存在 class->rq_online(rq); //(2-3-7-2-1)調用class對應的rq_offline,這里以cfs rq為例 } } }
(2-3-7-2-1)調用class對應的rq_offline,這里以cfs rq為例
static void rq_online_fair(struct rq *rq) { update_sysctl(); //更新sysctl參數 update_runtime_enabled(rq); //(2-3-7-2-1-1)更新cfs帶寬限制的開關和配置 }
(2-3-7-2-1-1)更新cfs帶寬限制的開關和配置
/* * Both these CPU hotplug callbacks race against unregister_fair_sched_group() * * The race is harmless, since modifying bandwidth settings of unhooked group * bits doesn't do much. */ /* cpu online calback */ static void __maybe_unused update_runtime_enabled(struct rq *rq) { struct task_group *tg; lockdep_assert_held(&rq->lock); rcu_read_lock(); list_for_each_entry_rcu(tg, &task_groups, list) { //遍歷所有task group struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; //獲取tg對應的cfs帶寬限制結構體 struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; //取出tg中對應cpu的cfs_rq raw_spin_lock(&cfs_b->lock); cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF; //根據cfs帶寬限制的配額有沒有限制,設置cfs_rq帶寬限制是否打開 raw_spin_unlock(&cfs_b->lock); } rcu_read_unlock(); }
(2-7-3-3)free old rd:主要對rd結構提的各個成員依次free,最后free rd自身
static void free_rootdomain(struct rcu_head *rcu) { struct root_domain *rd = container_of(rcu, struct root_domain, rcu); //通過rcu獲取要free的rd cpupri_cleanup(&rd->cpupri); //free rd結構體中的相關成員 cpudl_cleanup(&rd->cpudl); free_cpumask_var(rd->dlo_mask); free_cpumask_var(rd->rto_mask); free_cpumask_var(rd->online); free_cpumask_var(rd->span); free_pd(rd->pd); kfree(rd); //最后free rd本身 }
(2-7-4)更新sd_sysctl_cpus的cpu mask:將cpu添加進去(暫不清楚這個cpu mask有什么用處?)
void dirty_sched_domain_sysctl(int cpu) { if (cpumask_available(sd_sysctl_cpus)) __cpumask_set_cpu(cpu, sd_sysctl_cpus); }
(2-7-5)將tmp sd銷毀:從sd向其parent遍歷,進行逐層銷毀
static void destroy_sched_domains(struct sched_domain *sd) { if (sd) call_rcu(&sd->rcu, destroy_sched_domains_rcu); //如果sd是否存在,則進行銷毀 }
static void destroy_sched_domains_rcu(struct rcu_head *rcu) { struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu); while (sd) { struct sched_domain *parent = sd->parent; //從MC->DIE遍歷sd destroy_sched_domain(sd); //並銷毀sd,參照(2-7-2) sd = parent; } }
(2-7-6)更新cpu的sd相關的per_cpu變量
static void update_top_cache_domain(int cpu) { struct sched_domain_shared *sds = NULL; struct sched_domain *sd; int id = cpu; int size = 1; sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); //找到該cpu所在的最高的、包含這個flag的domain。即 MC level if (sd) { id = cpumask_first(sched_domain_span(sd)); //取出該sd中第一個cpu size = cpumask_weight(sched_domain_span(sd)); //獲取該sd中cpu的數量 sds = sd->shared; //獲取sd的sds } rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); //更新sd_lcc = sd per_cpu(sd_llc_size, cpu) = size; //更新sd_lcc_size = sd中cpu數量 per_cpu(sd_llc_id, cpu) = id; //更新sd_lcc_id = sd中第一個cpu rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds); //更新sd_llc_shared = sd->sds sd = lowest_flag_domain(cpu, SD_NUMA); //當前平台不支持NUMA rcu_assign_pointer(per_cpu(sd_numa, cpu), sd); //所以最后這個sd是DIE level,但是其本身也沒有什么意義 sd = highest_flag_domain(cpu, SD_ASYM_PACKING); //當前平台不支持SMT rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd); //所以最后這個sd是DIE level,但是其本身也沒有什么意義 sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY); //獲取cpu最低的、包含這個flag的domain。即 DIE level rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd); }
(2-8)根據函數最上面建立拓撲、以及申請root domain結果,釋放相應的空間
static void __free_domain_allocs(struct s_data *d, enum s_alloc what, const struct cpumask *cpu_map) { switch (what) { case sa_rootdomain: //一切正常情況下是這個分支 if (!atomic_read(&d->rd->refcount)) //根據rd的引用,決定是否還需要保留root domain free_rootdomain(&d->rd->rcu); //如不需要,則進行free /* Fall through */ case sa_sd: //申請root domain失敗的情況 free_percpu(d->sd); //free d->sd /* Fall through */ case sa_sd_storage: //建立拓撲結構失敗、或者申請d->sd 失敗的情況 __sdt_free(cpu_map); //free整個cpu_map中所有cpu的拓撲結構,並遍歷free所有per_cpu的sdd->* /* Fall through */ case sa_none: break; } }
register_sched_domain_sysctl(); //(3)注冊proc/sys/kernel/sched_domain目錄,並完善其中相關sysctl控制參數
-----這部分暫不准備解析了,都是一些sysfs接口。有興趣的可以參考這位大佬博主的blog:https://blog.csdn.net/wukongmingjing/article/details/100043644
參考:https://blog.csdn.net/wukongmingjing/article/details/82426568