CPU拓撲結構和調度域/組

本文轉載自查看原文 2021-11-20 11:07 32 進程調度

如今CPU的核數從單核，到雙核，再到4核、8核、甚至10核。但是我們知道Android使用的多核架構都是分大小核，或者現在最新的，除了大小核以外，還有一個超大核。

區分大小核，是因為它們之間的性能（算力），功耗是不同的，而且它們又以cluster來區分（小核在一個cluster，大核在另一個cluster），而目前由於同cluster內的cpu freq是同步調節的。

所以，在對CPU的任務調度中，需要對其同樣進行區分，來確保性能和功耗的平衡。

因此，針對CPU的拓撲結構，內核中會建立不同的調度域、調度組來體現。如下圖，以某8核cpu為例：

在DIE level，cpu 0-7
在MC level，cpu 0-3在一組，而cpu4-7在另一組
*SMT超線程技術，會在MC level以下，再進行一次區分：01、23、45、67（這里可以暫不考慮，因為當前ARM平台並未支持SMT）

CPU Topology建立

在kernel中，有CPU Topology的相關代碼來形成這樣的結構，結構的定義在dts文件中，根據不同平台會不同。我當前這個mtk平台的DTS相關信息如下（至於這里為什么沒有用qcom平台，因為現在公司暫時貌似只有mtk平台，所以可能略微有點差別）：

        cpu0: cpu@000 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x000>;
            enable-method = "psci";
            clock-frequency = <2301000000>;
            operating-points-v2 = <&cluster0_opp>;
            dynamic-power-coefficient = <275>;
            capacity-dmips-mhz = <1024>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu1: cpu@001 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x001>;
            enable-method = "psci";
            clock-frequency = <2301000000>;
            operating-points-v2 = <&cluster0_opp>;
            dynamic-power-coefficient = <275>;
            capacity-dmips-mhz = <1024>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu2: cpu@002 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x002>;
            enable-method = "psci";
            clock-frequency = <2301000000>;
            operating-points-v2 = <&cluster0_opp>;
            dynamic-power-coefficient = <275>;
            capacity-dmips-mhz = <1024>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu3: cpu@003 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x003>;
            enable-method = "psci";
            clock-frequency = <2301000000>;
            operating-points-v2 = <&cluster0_opp>;
            dynamic-power-coefficient = <275>;
            capacity-dmips-mhz = <1024>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu4: cpu@100 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x100>;
            enable-method = "psci";
            clock-frequency = <1800000000>;
            operating-points-v2 = <&cluster1_opp>;
            dynamic-power-coefficient = <85>;
            capacity-dmips-mhz = <801>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu5: cpu@101 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x101>;
            enable-method = "psci";
            clock-frequency = <1800000000>;
            operating-points-v2 = <&cluster1_opp>;
            dynamic-power-coefficient = <85>;
            capacity-dmips-mhz = <801>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu6: cpu@102 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x102>;
            enable-method = "psci";
            clock-frequency = <1800000000>;
            operating-points-v2 = <&cluster1_opp>;
            dynamic-power-coefficient = <85>;
            capacity-dmips-mhz = <801>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu7: cpu@103 {
            device_type = "cpu";
            compatible = "arm,cortex-a53";
            reg = <0x103>;
            enable-method = "psci";
            clock-frequency = <1800000000>;
            operating-points-v2 = <&cluster1_opp>;
            dynamic-power-coefficient = <85>;
            capacity-dmips-mhz = <801>;
            cpu-idle-states = <&STANDBY &MCDI_CPU &MCDI_CLUSTER>,
                <&SODI &SODI3 &DPIDLE &SUSPEND>;
        };

        cpu-map {
            cluster0 {
                core0 {
                    cpu = <&cpu0>;
                };
                core1 {
                    cpu = <&cpu1>;
                };
                core2 {
                    cpu = <&cpu2>;
                };
                core3 {
                    cpu = <&cpu3>;
                };
            };

            cluster1 {
                core0 {
                    cpu = <&cpu4>;
                };
                core1 {
                    cpu = <&cpu5>;
                };
                core2 {
                    cpu = <&cpu6>;
                };
                core3 {
                    cpu = <&cpu7>;
                };
            };
        };

代碼路徑：drivers/base/arch_topology.c、arch/arm64/kernel/topology.c，本文代碼以CAF Kernel msm-5.4為例。

第一部分，這里解析DTS，並保存cpu_topology的package_id，core_id，cpu_sclae(cpu_capacity_orig)

kernel_init()
    -> kernel_init_freeable()
        -> smp_prepare_cpus()
            -> init_cpu_topology()
                -> parse_dt_topology()

針對dts中，依次解析"cpus"節點，以及其中的"cpu-map"節點；

先解析了其中cluster節點的內容結構。
在對cpu capacity進行歸一化

static int __init parse_dt_topology(void)
{
    struct device_node *cn, *map;
    int ret = 0;
    int cpu;

    cn = of_find_node_by_path("/cpus");    //查找dts中 /cpus的節點
    if (!cn) {
        pr_err("No CPU information found in DT\n");
        return 0;
    }

    /*
     * When topology is provided cpu-map is essentially a root
     * cluster with restricted subnodes.
     */
    map = of_get_child_by_name(cn, "cpu-map");    //查找/cpus節點下，cpu-map節點
    if (!map)
        goto out;

    ret = parse_cluster(map, 0);    //(1)解析cluster結構
    if (ret != 0)
        goto out_map;

    topology_normalize_cpu_scale();    //(2)將cpu capacity歸一化

    /*
     * Check that all cores are in the topology; the SMP code will
     * only mark cores described in the DT as possible.
     */
    for_each_possible_cpu(cpu)
        if (cpu_topology[cpu].package_id == -1)
            ret = -EINVAL;

out_map:
    of_node_put(map);
out:
    of_node_put(cn);
    return ret;
}

（1）解析cluster結構

通過第一個do-while循環，進行"cluster+序號"節點的解析：當前平台分別解析cluster0、1。然后仍然調用自身函數，這樣代碼復用，進一步解析其中的“core”結構
在進一步解析core結構時，同樣通過第二個do-while循環，進行"core+序號"節點的解析：當前平台支持core0，1...7，共8個核，通過parse_core函數進一步解析
所以實際解析執行順序應該是：cluster0，core0,1,2,3；cluster1，core4,5,6,7。
最后在每個cluster中的所有core都解析完，跳出其do-while循環時，package_id就是遞增。說明package_id就對應了cluster的id

static int __init parse_cluster(struct device_node *cluster, int depth)
{
    char name[20];
    bool leaf = true;
    bool has_cores = false;
    struct device_node *c;
    static int package_id __initdata;
    int core_id = 0;
    int i, ret;

    /*
     * First check for child clusters; we currently ignore any
     * information about the nesting of clusters and present the
     * scheduler with a flat list of them.
     */
    i = 0;
    do {
        snprintf(name, sizeof(name), "cluster%d", i);    //依次解析cluster0,1... 當前平台只有cluster0/1
        c = of_get_child_by_name(cluster, name);  //檢查cpu-map下，是否有cluster結構
        if (c) {
            leaf = false;
            ret = parse_cluster(c, depth + 1);     //如果有cluster結構，會繼續解析更深層次的core結構。（這里通過代碼復用，接着解析core結構）
            of_node_put(c);
            if (ret != 0)
                return ret;
        }
        i++;
    } while (c);

    /* Now check for cores */
    i = 0;
    do {
        snprintf(name, sizeof(name), "core%d", i);    //依次解析core0,1... 當前平台有8個core
        c = of_get_child_by_name(cluster, name);    //檢查cluster下，是否有core結構
        if (c) {
            has_cores = true;

            if (depth == 0) {                                            //這里要注意，是因為上面depth+1的調用才會走下去
                pr_err("%pOF: cpu-map children should be clusters\n",    //如果cpu-map下沒有cluster結構的（depth==0），就會報錯
                       c);
                of_node_put(c);
                return -EINVAL;
            }

            if (leaf) {                                            //在depth+1的情況下，leaf == true，說明是core level了
                ret = parse_core(c, package_id, core_id++);     //(1-1)解析core結構
            } else {
                pr_err("%pOF: Non-leaf cluster with core %s\n",
                       cluster, name);
                ret = -EINVAL;
            }

            of_node_put(c);
            if (ret != 0)
                return ret;
        }
        i++;
    } while (c);

    if (leaf && !has_cores)
        pr_warn("%pOF: empty cluster\n", cluster);

    if (leaf)            //在core level遍歷完成：說明1個cluster解析完成，要解析下一個cluster了，package id要遞增了
        package_id++;    //所以package id就對應了cluster id

    return 0;
}

（1-1）解析core結構

因為當前平台不支持超線程，所以core+序號節點下面，沒有thread+序號的節點了
解析cpu節點中的所有信息
更新cpu_topology[cpu].package_id、core_id，分別對應了哪個cluster的哪個core

static int __init parse_core(struct device_node *core, int package_id,
                 int core_id)
{
    char name[20];
    bool leaf = true;
    int i = 0;
    int cpu;
    struct device_node *t;

    do {
        snprintf(name, sizeof(name), "thread%d", i);    //不支持SMT，所以dts沒有在core下面配置超線程
        t = of_get_child_by_name(core, name);
        if (t) {
            leaf = false;
            cpu = get_cpu_for_node(t);
            if (cpu >= 0) {
                cpu_topology[cpu].package_id = package_id;
                cpu_topology[cpu].core_id = core_id;
                cpu_topology[cpu].thread_id = i;
            } else {
                pr_err("%pOF: Can't get CPU for thread\n",
                       t);
                of_node_put(t);
                return -EINVAL;
            }
            of_node_put(t);
        }
        i++;
    } while (t);

    cpu = get_cpu_for_node(core);    //（1-1-1）從core中解析cpu節點
    if (cpu >= 0) {
        if (!leaf) {
            pr_err("%pOF: Core has both threads and CPU\n",
                   core);
            return -EINVAL;
        }

        cpu_topology[cpu].package_id = package_id;    //保存package id（cluster id）到cpu_topology結構體的數組
        cpu_topology[cpu].core_id = core_id;        //保存core id到cpu_topology結構體的數組; core id對應cpu號：0,1...7
    } else if (leaf) {
        pr_err("%pOF: Can't get CPU for leaf core\n", core);
        return -EINVAL;
    }

    return 0;
}

（1-1-1）從core中解析cpu節點

從core節點中查找cpu節點，並對應好cpu id
再解析cpu core的capacity

static int __init get_cpu_for_node(struct device_node *node)
{
    struct device_node *cpu_node;
    int cpu;

    cpu_node = of_parse_phandle(node, "cpu", 0);    //獲取core節點中cpu節點信息
    if (!cpu_node)
        return -1;

    cpu = of_cpu_node_to_id(cpu_node);    //獲取cpu節點對應的cpu core id：cpu-0,1...
    if (cpu >= 0)
        topology_parse_cpu_capacity(cpu_node, cpu);    //（1-1-1-1）解析每個cpu core的capacity
    else
        pr_crit("Unable to find CPU node for %pOF\n", cpu_node);

    of_node_put(cpu_node);
    return cpu;
}

（1-1-1-1）解析每個cpu core的capacity

先解析capacity-dmips-mhz值作為cpu raw_capacity，這個參數就是對應了cpu的算力，數字越大，算力越強（可以對照上面mtk平台dts，明顯是大小核架構；但不同的是，它cpu0-3都是大核，cpu4-7是小核，這個與一般的配置不太一樣，一般qcom平台是反過來，cpu0-3是小核，4-7是大核）
當前raw_capcity是cpu 0-3：1024，cpu4-7：801

bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
{
    static bool cap_parsing_failed;
    int ret;
    u32 cpu_capacity;

    if (cap_parsing_failed)
        return false;

    ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz",    //解析cpu core算力，kernel4.19后配置該參數
                   &cpu_capacity);
    if (!ret) {
        if (!raw_capacity) {
            raw_capacity = kcalloc(num_possible_cpus(),        //為所有cpu raw_capacity變量都申請空間
                           sizeof(*raw_capacity),
                           GFP_KERNEL);
            if (!raw_capacity) {
                cap_parsing_failed = true;
                return false;
            }
        }
        capacity_scale = max(cpu_capacity, capacity_scale);    //記錄最大cpu capacity值作為scale
        raw_capacity[cpu] = cpu_capacity;                    //raw capacity就是dts中dmips值
        pr_debug("cpu_capacity: %pOF cpu_capacity=%u (raw)\n",
            cpu_node, raw_capacity[cpu]);
    } else {
        if (raw_capacity) {
            pr_err("cpu_capacity: missing %pOF raw capacity\n",
                cpu_node);
            pr_err("cpu_capacity: partial information: fallback to 1024 for all CPUs\n");
        }
        cap_parsing_failed = true;
        free_raw_capacity();
    }

    return !ret;
}

（2）將cpu raw_capacity進行歸一化

遍歷每個cpu core進行歸一化，其實就是將最大值映射為1024，小的值，按照原先比例n，歸一化為n*1024。
歸一化步驟：將當前raw_capacity *1024 /capacity_scale，capacity_scale其實就是raw_capacity的最大值，其實就是1024
將cpu raw capacity保存到per_cpu變量：cpu_scale中，在內核調度中經常使用的cpu_capacity_orig、cpu_capacity參數的計算都依賴它。

void topology_normalize_cpu_scale(void)
{
    u64 capacity;
    int cpu;

    if (!raw_capacity)
        return;

    pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale);
    for_each_possible_cpu(cpu) {
        pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n",
             cpu, raw_capacity[cpu]);
        capacity = (raw_capacity[cpu] << SCHED_CAPACITY_SHIFT)        //就是按照max cpu capacity的100% = 1024的方式歸一化capacity
            / capacity_scale;
        topology_set_cpu_scale(cpu, capacity);                    //更新per_cpu變量cpu_scale(cpu_capacity_orig)為各自的cpu raw capacity
        pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
            cpu, topology_get_cpu_scale(cpu));
    }
}

第二部分更新sibling_mask

cpu0的調用路徑如下：

kernel_init
    -> kernel_init_freeable
        -> smp_prepare_cpus
            -> store_cpu_topology

cpu1-7的調用路徑如下：

secondary_start_kernel
    -> store_cpu_topology

void store_cpu_topology(unsigned int cpuid)
{
    struct cpu_topology *cpuid_topo = &cpu_topology[cpuid];
    u64 mpidr;

    if (cpuid_topo->package_id != -1)　　//這里因為已經解析過package_id了，所以直接就不會走讀協處理器寄存器等相關步驟了
        goto topology_populated;

    mpidr = read_cpuid_mpidr();

    /* Uniprocessor systems can rely on default topology values */
    if (mpidr & MPIDR_UP_BITMASK)
        return;

    /*
     * This would be the place to create cpu topology based on MPIDR.
     *
     * However, it cannot be trusted to depict the actual topology; some
     * pieces of the architecture enforce an artificial cap on Aff0 values
     * (e.g. GICv3's ICC_SGI1R_EL1 limits it to 15), leading to an
     * artificial cycling of Aff1, Aff2 and Aff3 values. IOW, these end up
     * having absolutely no relationship to the actual underlying system
     * topology, and cannot be reasonably used as core / package ID.
     *
     * If the MT bit is set, Aff0 *could* be used to define a thread ID, but
     * we still wouldn't be able to obtain a sane core ID. This means we
     * need to entirely ignore MPIDR for any topology deduction.
     */
    cpuid_topo->thread_id  = -1;
    cpuid_topo->core_id    = cpuid;
    cpuid_topo->package_id = cpu_to_node(cpuid);

    pr_debug("CPU%u: cluster %d core %d thread %d mpidr %#016llx\n",
         cpuid, cpuid_topo->package_id, cpuid_topo->core_id,
         cpuid_topo->thread_id, mpidr);

topology_populated:
    update_siblings_masks(cpuid);    //（1）更新當前cpu的sibling_mask
}

（1）更新當前cpu的sibling_mask

匹配規則就是如果是同一個package id（同一個cluster內），那么就互為sibling，並設置core_sibling的mask
當前平台不支持超線程，所以沒有thread_sibling

void update_siblings_masks(unsigned int cpuid)
{
    struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
    int cpu;

    /* update core and thread sibling masks */
    for_each_online_cpu(cpu) {
        cpu_topo = &cpu_topology[cpu];

        if (cpuid_topo->llc_id == cpu_topo->llc_id) {        //當前平台不支持acpi，所以所有cpu的llc_id都是-1。這里都會滿足
            cpumask_set_cpu(cpu, &cpuid_topo->llc_sibling);
            cpumask_set_cpu(cpuid, &cpu_topo->llc_sibling);
        }

        if (cpuid_topo->package_id != cpu_topo->package_id)    //只有當在同一個cluster內時，才可能成為core_sibling/thread_sibling（當前平台不支持線程sibling）
            continue;

        cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);    //互相設置各自cpu topo結構體的core_sibling mask中添加對方的cpu bit
        cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

        if (cpuid_topo->core_id != cpu_topo->core_id)    //只有在同一個core內時，才有可能成為thread_sibling
            continue;

        cpumask_set_cpu(cpuid, &cpu_topo->thread_sibling);    //互相設置thread_sibling mask中的thread bit
        cpumask_set_cpu(cpu, &cpuid_topo->thread_sibling);
    }
}

最終我們可以通過adb查看cpu相關節點信息來確認上面的cpu topology信息：

TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # ls
core_id  core_siblings  core_siblings_list  physical_package_id  thread_siblings  thread_siblings_list

cpu0：

TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat core_id
0
TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat core_siblings
0f
TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat core_siblings_list
0-3
TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat physical_package_id
0
TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat thread_siblings
01
TECNO-KF6p:/sys/devices/system/cpu/cpu0/topology # cat thread_siblings_list
0

cpu1：

TECNO-KF6p:/sys/devices/system/cpu/cpu1/topology # cat *
1
0f
0-3
0
02　　//thread_siblings
1　　 //thread_siblings_list

cpu7：

TECNO-KF6p:/sys/devices/system/cpu/cpu7/topology # cat *
3　　 //core_id(cpu4-7的core id分別為0,1,2,3。相當於另一個cluster內重新開始計數)
f0　　//core_siblings
4-7　 //core_siblings_list（兄弟姐妹core列表）
1　　 //physical_package_id（就是cluster id）
80　　//thread_siblings
7　　 //thread_siblings_list

以上就是CPU topology建立的相關流程了，還是比較清晰的。

sd調度域和sg調度組建立

CPU MASK

 *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable //系統所有cpu
 *     cpu_present_mask - has bit 'cpu' set iff cpu is populated //存在的所有cpu，根據hotplug變化， <= possible
 *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler //處於online的cpu，即active cpu + idle cpu
 *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration //處於active的cpu，區別與idle cpu
 *     cpu_isolated_mask- has bit 'cpu' set iff cpu isolated //處於isolate的cpu，隔離的cpu不會被分配task運行，但是沒有下電
 *
1、如果沒有CONFIG_HOTPLUG_CPU，那么 present == possible， active == online。
2、配置了cpu hotplug的情況下，present會根據hotplug狀態，動態變化。

調度域和調度組是在kernel初始化時開始建立的，調用路徑如下：

kernel_init()
    -> kernel_init_freeable()
        -> sched_init_smp()
            -> sched_init_domains()

傳入的cpu_map是cpu_active_mask，即活動狀態的cpu，建立調度域：

/* Current sched domains: */
static cpumask_var_t            *doms_cur;

/* Number of sched domains in 'doms_cur': */
static int                ndoms_cur;

/*
 * Set up scheduler domains and groups.  For now this just excludes isolated
 * CPUs, but could be used to exclude other special cases in the future.
 */
int sched_init_domains(const struct cpumask *cpu_map)
{
    int err;

    zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
    zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
    zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);

    arch_update_cpu_topology();        //（1）填充cpu_core_map數組
    ndoms_cur = 1;                    //記錄調度域數量的變量，當前初始化為1
    doms_cur = alloc_sched_domains(ndoms_cur);    //alloc調度域相關結構體內存空間
    if (!doms_cur)
        doms_cur = &fallback_doms;
    cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_FLAG_DOMAIN));    //這里會從cpu_map中挑選沒有isolate的cpu，初始化時沒有isolate cpu？
    err = build_sched_domains(doms_cur[0], NULL);    //（2）根據提供的一組cpu，建立調度域
    register_sched_domain_sysctl(); //（3）注冊proc/sys/kernel/sched_domain目錄，並完善其中相關sysctl控制參數

    return err;
}

（1）用cpu_possiable_mask填充cpu_core_map數組

int arch_update_cpu_topology(void)
{
    unsigned int cpu;

    for_each_possible_cpu(cpu)        //遍歷每個cpu
        cpu_core_map[cpu] = cpu_coregroup_map(cpu);    //利用cpu_possiable_mask，也就是物理上所有的cpu core

    return 0;
}

（2）根據提供的可用cpu（active的cpu中去掉isolate cpu），建立調度域

（2-1）根據配置的default topology建立其CPU拓撲結構（MC、DIE）；alloc sched_domain以及per_cpu私有變量；alloc root domain空間並初始化
（2-2）判斷當前平台類型：大小核；獲取擁有不同cpu capacity的最淺level：DIE
（2-3）根據平台cpu和topology結構，申請MC、DIE level調度域，並建立其child-parent關系；初始化調度域flag和load balance參數；使能MC、DIE的idle balance
（2-4）申請sched group並初始化cpu mask以及capacity，建立sg在MC、DIE上的內部環形鏈表關系；建立sd、sg、sgc的關聯；
（2-5）針對出現一些錯誤（sa_sd_storage）的情況下，防止正在使用的sd_data在（2-8）中被free
（2-6）更新MC level下每個sg（其實就是每個cpu）的cpu_orig_capacity/cpu_capacity等，再更新DIE level下每個sg（其實就是每個cluster內所有cpu）的cpu_orig_capacity/cpu_capacity
遍歷cpu_map中每個cpu，
- 找到擁有最大/最小 cpu_orig_capacity（即cpu_scale）的cpu，並保存到walt root domain結構體中
- 將新建立的MC level的sd、root domain、cpu_rq三者綁定起來
- （2-7）將每個新的MC level的sd與對應cpu rq綁定，將每個新的rd與cpu rq綁定；舊的sd、舊的rd都進行銷毀
遍歷cpu_map，找到cpu_orig_capacity的中間值（適用於有3種不同cpu core類型的情況，當前平台只有大小核，沒有超大核，所以這里不用考慮）；上一步中找到的最大/最小 cpu_orig_capacity（即cpu_scale）以及其對應的cpu，都將更新到rd中
使用static-key機制來修改當前調度域是否有不同cpu capacity的代碼路徑；
根據上述建立cpu拓撲、申請root domain的正常/異常情況，進行錯誤處理（釋放必要結構體等）

/*
 * Build sched domains for a given set of CPUs and attach the sched domains
 * to the individual CPUs
 */
static int
build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
{
    enum s_alloc alloc_state = sa_none;
    struct sched_domain *sd;
    struct s_data d;
    int i, ret = -ENOMEM;
    struct sched_domain_topology_level *tl_asym;
    bool has_asym = false;

    if (WARN_ON(cpumask_empty(cpu_map)))    //過濾cpu_map為空的情況
        goto error;

    alloc_state = __visit_domain_allocation_hell(&d, cpu_map);    //（2-1）建立MC、DIE的拓撲結構；初始化root domain
    if (alloc_state != sa_rootdomain)
        goto error;

    tl_asym = asym_cpu_capacity_level(cpu_map);        //（2-2）獲取包含max cpu capacity的最淺level：DIE level

    /* Set up domains for CPUs specified by the cpu_map: */  //根據cpu map建立調度域
    for_each_cpu(i, cpu_map) {                        //遍歷每個cpu map中的cpu：0-7
        struct sched_domain_topology_level *tl;

        sd = NULL;
        for_each_sd_topology(tl) {                //遍歷MC、DIE level
            int dflags = 0;

            if (tl == tl_asym) {                //DIE level會帶有：SD_ASYM_CPUCAPACITY flag，並設has_asym = true
                dflags |= SD_ASYM_CPUCAPACITY;
                has_asym = true;
            }

            if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
                goto error;

            sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i); //（2-3）建立MC、DIE level的調度域

            if (tl == sched_domain_topology)    //將最低層級的sd保存到s_data.sd的per_cpu變量中，當前平台為MC level的sd
                *per_cpu_ptr(d.sd, i) = sd;
            if (tl->flags & SDTL_OVERLAP)        //判斷是否sd有重疊，當前平台沒有重疊
                sd->flags |= SD_OVERLAP;
            if (cpumask_equal(cpu_map, sched_domain_span(sd)))    //判斷cpu map和當前sd->span是否一致，一致則表示當前cpu_map中的所有cpu都在這個sd->span內。就會停止下一層tl的sd建立，可能用當前這一層的sd就已經足夠了？
                break;
        }
    }

    /* Build the groups for the domains */
    for_each_cpu(i, cpu_map) {                        //遍歷cpu_map中每個cpu
        for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {            //從cpu的最低層級sd開始向上遍歷，當前平台遍歷順序是：MC->DIE
            sd->span_weight = cpumask_weight(sched_domain_span(sd));    //獲取當前sd范圍內的cpu數量
            if (sd->flags & SD_OVERLAP) {                //根據是否有重疊的sd，建立調度組sg（NUMA架構才會有這個flag）
                if (build_overlap_sched_groups(sd, i))    //重疊sd情況下，建立sg（非當前平台，暫不展開）
                    goto error;
            } else {
                if (build_sched_groups(sd, i))            //（2-4）因為當前平台沒有重疊sd，所以走這里建立調度組sg
                    goto error;
            }
        }
    }

    /* Calculate CPU capacity for physical packages and nodes */
    for (i = nr_cpumask_bits-1; i >= 0; i--) {                //遍歷所有cpu，當前平台遍歷順序是cpu7,6...0
        if (!cpumask_test_cpu(i, cpu_map))            //如果cpu不在cpu map中，應該是hotplug的情況
            continue;

        for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {    //依次遍歷MC level和DIE level
            claim_allocations(i, sd);                //(2-5)將用於建立sd、sg的per_cpu指針（sdd），防止隨后的__free_domain_allocs()將其free
            init_sched_groups_capacity(i, sd);        //(2-6)初始化sg的cpu_capacity
        }
    }

    /* Attach the domains */
    rcu_read_lock();
    for_each_cpu(i, cpu_map) {                    //遍歷cpu map中所有cpu
#ifdef CONFIG_SCHED_WALT
        int max_cpu = READ_ONCE(d.rd->wrd.max_cap_orig_cpu);    //獲取walt_root_domain中保存最大orig_capacity cpu的變量
        int min_cpu = READ_ONCE(d.rd->wrd.min_cap_orig_cpu);    //獲取walt_root_domain中保存最小orig_capacity cpu的變量
#endif

        sd = *per_cpu_ptr(d.sd, i);                //最低層級的sd在上面的流程中被保存到per_cpu變量中，當前平台為MC level

#ifdef CONFIG_SCHED_WALT                                    //通過遍歷循環，找除最大和最小orig_capacity的cpu
        if ((max_cpu < 0) || (arch_scale_cpu_capacity(i) >
                arch_scale_cpu_capacity(max_cpu)))
            WRITE_ONCE(d.rd->wrd.max_cap_orig_cpu, i);

        if ((min_cpu < 0) || (arch_scale_cpu_capacity(i) <
                arch_scale_cpu_capacity(min_cpu)))
            WRITE_ONCE(d.rd->wrd.min_cap_orig_cpu, i);
#endif

        cpu_attach_domain(sd, d.rd, i);        //(2-7)將sd、rd與cpu rq綁定起來
    }

#ifdef CONFIG_SCHED_WALT
    /* set the mid capacity cpu (assumes only 3 capacities) */
    for_each_cpu(i, cpu_map) {
        int max_cpu = READ_ONCE(d.rd->wrd.max_cap_orig_cpu);        //獲取擁有最大orig cpu capacity的第一個cpu
        int min_cpu = READ_ONCE(d.rd->wrd.min_cap_orig_cpu);        //獲取擁有最小orig cpu capacity的第一個cpu

        if ((arch_scale_cpu_capacity(i)                                //找到orig cpu capacity在最大和最小之間的cpu
                != arch_scale_cpu_capacity(min_cpu)) &&
                (arch_scale_cpu_capacity(i)
                != arch_scale_cpu_capacity(max_cpu))) {
            WRITE_ONCE(d.rd->wrd.mid_cap_orig_cpu, i);                //當前平台只有2個值orig cpu capacity，所以這里找不到mid值的cpu
            break;
        }
    }

    /*
     * The max_cpu_capacity reflect the original capacity which does not
     * change dynamically. So update the max cap CPU and its capacity
     * here.
     */
    if (d.rd->wrd.max_cap_orig_cpu != -1) {
        d.rd->max_cpu_capacity.cpu = d.rd->wrd.max_cap_orig_cpu;    //更新rd中的擁有最大orig cpu capacity的cpu（注意變量與max_cap_orig_cpu不同）
        d.rd->max_cpu_capacity.val = arch_scale_cpu_capacity(        //並更新該cpu的orig cpu capacity值
                        d.rd->wrd.max_cap_orig_cpu);
    }
#endif

    rcu_read_unlock();

    if (has_asym)                                                    //當前平台為大小核架構，所以為true
        static_branch_inc_cpuslocked(&sched_asym_cpucapacity);        //針對sched_asym_cpucapacity的變量判斷分支做更改（static key機制用來優化指令預取，類似likely/unlikely）

    ret = 0;
error:
    __free_domain_allocs(&d, alloc_state, cpu_map);        //(2-8)根據函數最上面建立拓撲、以及申請root domain結果，釋放相應的空間

    return ret;
}

（2-1）建立MC、DIE的拓撲結構；初始化root domain

static enum s_alloc
__visit_domain_allocation_hell(struct s_data *d, const struct cpumask *cpu_map)
{
    memset(d, 0, sizeof(*d));

    if (__sdt_alloc(cpu_map))    //（2-1-1）初始化MC、DIE的拓撲結構
        return sa_sd_storage;
    d->sd = alloc_percpu(struct sched_domain *);    //申請d->sd空間
    if (!d->sd)
        return sa_sd_storage;
    d->rd = alloc_rootdomain();    //（2-1-2）申請root domain並初始化
    if (!d->rd)
        return sa_sd;

    return sa_rootdomain;
}

（2-1-1）初始化MC、DIE的拓撲結構

CPU topology結構如下，因為當前平台不支持SMT，所以從下到上，分別是MC level、DIE level。在sdt_alloc()中的循環中會使用到。

/*
 * Topology list, bottom-up.
 */
static struct sched_domain_topology_level default_topology[] = {
#ifdef CONFIG_SCHED_SMT
    { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
#endif
#ifdef CONFIG_SCHED_MC
    { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
#endif
    { cpu_cpu_mask, SD_INIT_NAME(DIE) },
    { NULL, },
};

首先建立MC level

alloc了sd_data結構體（&tl->data）的4個指針：sdd->sd，sdd->sds，sdd->sg，sdd->sgc；
在遍歷CPU時，從cpu0-7，分別創建了每個per_cpu變量保存：sd、sds、sg、sgc

再建立DIE level

alloc了sd_data結構體（&tl->data）的4個指針：sdd->sd，sdd->sds，sdd->sg，sdd->sgc；
在遍歷CPU時，從cpu0-7，分別創建了每個per_cpu變量保存：sd、sds、sg、sgc

static int __sdt_alloc(const struct cpumask *cpu_map)
{
    struct sched_domain_topology_level *tl;
    int j;

    for_each_sd_topology(tl) {            //依次遍歷MC、DIE結構
        struct sd_data *sdd = &tl->data;        //如下是為MC、DIE level的percpu變量sd_data，申請空間

        sdd->sd = alloc_percpu(struct sched_domain *);        //sched_domain
        if (!sdd->sd)
            return -ENOMEM;

        sdd->sds = alloc_percpu(struct sched_domain_shared *);    //sched_domain_shared
        if (!sdd->sds)
            return -ENOMEM;

        sdd->sg = alloc_percpu(struct sched_group *);        //sched_group
        if (!sdd->sg)
            return -ENOMEM;

        sdd->sgc = alloc_percpu(struct sched_group_capacity *);    //sched_group_capacity
        if (!sdd->sgc)
            return -ENOMEM;

        for_each_cpu(j, cpu_map) {            //遍歷了cpu_map中所有cpu，當前平台為8核：cpu0-7
            struct sched_domain *sd;
            struct sched_domain_shared *sds;
            struct sched_group *sg;
            struct sched_group_capacity *sgc;

            sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),    //申請sd + cpumask的空間
                    GFP_KERNEL, cpu_to_node(j));    //cpu_to_node應該是選擇cpu所在本地的內存node，UMA架構僅有一個node
            if (!sd)
                return -ENOMEM;

            *per_cpu_ptr(sdd->sd, j) = sd;    //將cpu[j]的調度域sd綁定到sdd->sd上

            sds = kzalloc_node(sizeof(struct sched_domain_shared),    //類似申請sds空間，並綁定到sdd->sds
                    GFP_KERNEL, cpu_to_node(j));
            if (!sds)
                return -ENOMEM;

            *per_cpu_ptr(sdd->sds, j) = sds;

            sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), //類似申請sg + cpumask空間，並綁定到sdd->sg
                    GFP_KERNEL, cpu_to_node(j));
            if (!sg)
                return -ENOMEM;

            sg->next = sg;    //初始化時，sg的鏈表並未真正建立

            *per_cpu_ptr(sdd->sg, j) = sg;

            sgc = kzalloc_node(sizeof(struct sched_group_capacity) + cpumask_size(),//類似申請sgc + cpumask空間，並綁定到sdd->sgc
                    GFP_KERNEL, cpu_to_node(j));
            if (!sgc)
                return -ENOMEM;

#ifdef CONFIG_SCHED_DEBUG
            sgc->id = j;    //將cpu編號綁定到sgc->id
#endif

            *per_cpu_ptr(sdd->sgc, j) = sgc;
        }
    }

    return 0;
}

（2-1-2）申請root domain並初始化

static struct root_domain *alloc_rootdomain(void)
{
    struct root_domain *rd;

    rd = kzalloc(sizeof(*rd), GFP_KERNEL);
    if (!rd)
        return NULL;

    if (init_rootdomain(rd) != 0) {    //（2-1-2-1）初始化root domain
        kfree(rd);
        return NULL;
    }

    return rd;
}

（2-1-2-1）初始化root domain

static int init_rootdomain(struct root_domain *rd)
{
    if (!zalloc_cpumask_var(&rd->span, GFP_KERNEL))        //申請4個cpu mask的空間
        goto out;
    if (!zalloc_cpumask_var(&rd->online, GFP_KERNEL))
        goto free_span;
    if (!zalloc_cpumask_var(&rd->dlo_mask, GFP_KERNEL))
        goto free_online;
    if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
        goto free_dlo_mask;

#ifdef HAVE_RT_PUSH_IPI
    rd->rto_cpu = -1;                            //初始化rto相關參數和隊列，針對IPI pull的請求，在rto_mask中loop，暫時沒理解？
    raw_spin_lock_init(&rd->rto_lock);
    init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
#endif

    init_dl_bw(&rd->dl_bw);                //初始化deadline bandwidth
    if (cpudl_init(&rd->cpudl) != 0)    //初始化cpudl結構體
        goto free_rto_mask;

    if (cpupri_init(&rd->cpupri) != 0)    //初始化cpupri結構體
        goto free_cpudl;

#ifdef CONFIG_SCHED_WALT
    rd->wrd.max_cap_orig_cpu = rd->wrd.min_cap_orig_cpu = -1;    //初始化walt_root_domain
    rd->wrd.mid_cap_orig_cpu = -1;
#endif

    init_max_cpu_capacity(&rd->max_cpu_capacity);    //初始化max_cpu_capacity ->val=0、->cpu=-1

    return 0;

free_cpudl:
    cpudl_cleanup(&rd->cpudl);
free_rto_mask:
    free_cpumask_var(rd->rto_mask);
free_dlo_mask:
    free_cpumask_var(rd->dlo_mask);
free_online:
    free_cpumask_var(rd->online);
free_span:
    free_cpumask_var(rd->span);
out:
    return -ENOMEM;
}

（2-2）獲取包含max cpu capacity的最淺level：DIE level

判斷當前是否是大小核架構：
遍歷cpu map和cpu toplology，找到最大cpu capacity
找到有不同cpu capacity的level：DIE level

/*
 * Find the sched_domain_topology_level where all CPU capacities are visible
 * for all CPUs.
 */
static struct sched_domain_topology_level
*asym_cpu_capacity_level(const struct cpumask *cpu_map)
{
    int i, j, asym_level = 0;
    bool asym = false;
    struct sched_domain_topology_level *tl, *asym_tl = NULL;
    unsigned long cap;

    /* Is there any asymmetry? */
    cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));    //獲取cpu_map中第一個cpu，cpu0的capacity

    for_each_cpu(i, cpu_map) {                        //判斷是否有不同capacity的cpu，決定是否是大小核架構
        if (arch_scale_cpu_capacity(i) != cap) {    //當前平台是大小核有不同capacity
            asym = true;
            break;
        }
    }

    if (!asym)
        return NULL;

    /*
     * Examine topology from all CPU's point of views to detect the lowest
     * sched_domain_topology_level where a highest capacity CPU is visible
     * to everyone.
     */
    for_each_cpu(i, cpu_map) {                //遍歷cpu map中的每個cpu，cpu 0-7
        unsigned long max_capacity = arch_scale_cpu_capacity(i);
        int tl_id = 0;

        for_each_sd_topology(tl) {            //依次遍歷MC、DIE level
            if (tl_id < asym_level)
                goto next_level;

            for_each_cpu_and(j, tl->mask(i), cpu_map) {        //(2-2-1)在MC level時分別遍歷cpu0-3、cpu4-7；DIE level時遍歷cpu0-7
                unsigned long capacity;

                capacity = arch_scale_cpu_capacity(j);        //獲取cpu_capacity_orig

                if (capacity <= max_capacity)
                    continue;

                max_capacity = capacity;    //在所有cpu中找到最大的cpu capacity
                asym_level = tl_id;            //記錄level id：1
                asym_tl = tl;                //記錄有不同cpu capacity的cpu topology level: DIE
            }
next_level:
            tl_id++;
        }
    }

    return asym_tl;
}

(2-2-1)單獨分析下tl->mask(i)

因為tl實際就是default_topology的指針，所以tl->mask：在MC level下，就是cpu_coregroup_mask；在DIE level下，就是cpu_cpu_mask
所以MC level下，獲取的mask就是core_siblings mask；DIE level下，獲取的就是所有物理cpu的mask

const struct cpumask *cpu_coregroup_mask(int cpu)
{
    const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));

    /* Find the smaller of NUMA, core or LLC siblings */
    if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
        /* not numa in package, lets use the package siblings */
        core_mask = &cpu_topology[cpu].core_sibling;
    }
    if (cpu_topology[cpu].llc_id != -1) {
        if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
            core_mask = &cpu_topology[cpu].llc_sibling;
    }

    return core_mask;
}

static inline const struct cpumask *cpu_cpu_mask(int cpu)
{
    return cpumask_of_node(cpu_to_node(cpu));
}

/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
static inline const struct cpumask *cpumask_of_node(int node)
{
    if (node == NUMA_NO_NODE)　　//當前平台是UMA架構，非NUMA結構，所以只有一個node
        return cpu_all_mask;

    return node_to_cpumask_map[node];
}

（2-3）建立MC、DIE level的調度域

static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
        const struct cpumask *cpu_map, struct sched_domain_attr *attr,
        struct sched_domain *child, int dflags, int cpu)
{
    struct sched_domain *sd = sd_init(tl, cpu_map, child, dflags, cpu); //(2-3-1)初始化sched_domain，填充sd結構體，根據tl level構建sd父子關系等

    if (child) {                                                        //MC level的child為NULL；所以下面只針對DIE level
        sd->level = child->level + 1;                                    //DIE level值為child level+1
        sched_domain_level_max = max(sched_domain_level_max, sd->level);//記錄sd最大level
        child->parent = sd;                                                //將MC level sd的parent設置為DIE level的sd

        if (!cpumask_subset(sched_domain_span(child),
                    sched_domain_span(sd))) {
            pr_err("BUG: arch topology borken\n");
#ifdef CONFIG_SCHED_DEBUG
            pr_err("     the %s domain not a subset of the %s domain\n",
                    child->name, sd->name);
#endif
            /* Fixup, ensure @sd has at least @child CPUs. */
            cpumask_or(sched_domain_span(sd),
                   sched_domain_span(sd),
                   sched_domain_span(child));
        }

    }
    set_domain_attribute(sd, attr);　　//（2-3-2）這里attr為NULL，打開idle balance

    return sd;
}

（2-3-1）初始化sched_domain

初始化sd.flags，最后 MC、DIE level的flags 如下：
初始化sd結構體中其他重要參數：

　　 3. 設置cpu mask：sched_domain_span(sd) 。在MC level，就是cluster的范圍；在DIE level，就是所有物理cpu

　　 4. 通過外面的遍歷循環，將MC、DIE建立child-parent的鏈接關系

　　 5. 打開MC、DIE level的idle load balance功能

static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl,
    const struct cpumask *cpu_map,
    struct sched_domain *child, int dflags, int cpu)
{
    struct sd_data *sdd = &tl->data;
    struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);    //獲取當前cpu的sd結構體
    int sd_id, sd_weight, sd_flags = 0;

#ifdef CONFIG_NUMA
    /*
     * Ugly hack to pass state to sd_numa_mask()...
     */
    sched_domains_curr_level = tl->numa_level;
#endif

    sd_weight = cpumask_weight(tl->mask(cpu));        //獲取MC/DIE level下的sd_weight（就是topology level下的cpu個數，當前平台：MC為4，DIE為8）

    if (tl->sd_flags)                                //只有MC level有配置
        sd_flags = (*tl->sd_flags)();                // MC level的sd_flags：SD_SHARE_PKG_RESOURCES；DIE level則沒有
    if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,            //僅僅是判斷下是否有相關bit位是否越界
            "wrong sd_flags in topology description\n"))
        sd_flags &= TOPOLOGY_SD_FLAGS;                        //然后越界的話，清零下

    /* Apply detected topology flags */
    sd_flags |= dflags;                        //DIE level會傳入 SD_ASYM_CPUCAPACITY flag

    *sd = (struct sched_domain){            //初始化sd結構體
        .min_interval        = sd_weight,    //MC：4，DIE：8
        .max_interval        = 2*sd_weight,    //MC：8，DIE：16
        .busy_factor        = 32,            
        .imbalance_pct        = 125,            //用於load balance

        .cache_nice_tries    = 0,

        .flags            = 1*SD_LOAD_BALANCE
                    | 1*SD_BALANCE_NEWIDLE
                    | 1*SD_BALANCE_EXEC
                    | 1*SD_BALANCE_FORK
                    | 0*SD_BALANCE_WAKE
                    | 1*SD_WAKE_AFFINE
                    | 0*SD_SHARE_CPUCAPACITY
                    | 0*SD_SHARE_PKG_RESOURCES
                    | 0*SD_SERIALIZE
                    | 1*SD_PREFER_SIBLING
                    | 0*SD_NUMA
                    | sd_flags                //MC:SD_SHARE_PKG_RESOURCES，DIE：SD_ASYM_CPUCAPACITY
                    ,

        .last_balance        = jiffies,        //初始化load balance的時間戳
        .balance_interval    = sd_weight,    //load balance的間隔，MC：4，DIE：8
        .max_newidle_lb_cost    = 0,        //newidle load balance的cost
        .next_decay_max_lb_cost    = jiffies,    //idle balance中用到，暫時還不清楚什么cost
        .child            = child,            //MC level：sd的child為NULL；而DIE level：sd的child是MC level的sd
#ifdef CONFIG_SCHED_DEBUG
        .name            = tl->name,
#endif
    };

    cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));        //將對應tl mask（MC:core_siblings, DIE:所有物理cpu），與cpu map進行“位與”，作為sd的范圍
    sd_id = cpumask_first(sched_domain_span(sd));            //拿取sd范圍內的第一個cpu，作為sd_id

    /*
     * Convert topological properties into behaviour.
     */

    /* Don't attempt to spread across CPUs of different capacities. */
    if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)        //當前平台只有DIE level滿足條件
        sd->child->flags &= ~SD_PREFER_SIBLING;                //所以DIE level的child，就是MC level的sd，其flags會去掉清掉SD_PREFER_SIBLING
                                                            //所以，DIE sd的flag有SD_PREFER_SIBLING;而MC sd沒有此flag
    if (sd->flags & SD_SHARE_CPUCAPACITY) {                //這個flag應該是超線程sd支持的flag
        sd->imbalance_pct = 110;

    } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {    //從上面*tl->sd_flags()調用了MC level函數：cpu_core_flags()，所有這個flag在MC level是為1的
        sd->imbalance_pct = 117;                        //修改MC level的sd不平衡百分比
        sd->cache_nice_tries = 1;                        //修改MC level的cache_nice_tries = 2，暫時不清楚變量用途？

#ifdef CONFIG_NUMA                                    //當前平台不支持NUMA（平台為UMA架構）
    } else if (sd->flags & SD_NUMA) {
        sd->cache_nice_tries = 2;

        sd->flags &= ~SD_PREFER_SIBLING;
        sd->flags |= SD_SERIALIZE;
        if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
            sd->flags &= ~(SD_BALANCE_EXEC |
                       SD_BALANCE_FORK |
                       SD_WAKE_AFFINE);
        }

#endif
    } else {                                //DIE level的cache_nice_tries = 1
        sd->cache_nice_tries = 1;
    }

    sd->shared = *per_cpu_ptr(sdd->sds, sd_id);         //MC level是每個cpu各自的sds；而DIE level：是cpu0、cpu4的sds
    atomic_inc(&sd->shared->ref);            //對sd->shared的引用計數+1

    if (sd->flags & SD_SHARE_PKG_RESOURCES)                    //MC leve滿足
        atomic_set(&sd->shared->nr_busy_cpus, sd_weight);    //設置sd->shared->nr_busy_cpus = 4

    sd->private = sdd;            //sd->private指向&tl->data，MC/DIE level的cpu sd都指向對應level的tl->data的結構

    return sd;
}

（2-3-2）傳參attr =NULL，所以這里是判斷sd是否要打開idle balance。實際是當前平台MC、DIE level都打開了idle balance

static void set_domain_attribute(struct sched_domain *sd,
                 struct sched_domain_attr *attr)
{
    int request;

    if (!attr || attr->relax_domain_level < 0) {
        if (default_relax_domain_level < 0)
            return;
        else
            request = default_relax_domain_level;
    } else
        request = attr->relax_domain_level;
    if (request < sd->level) {
        /* Turn off idle balance on this domain: */
        sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
    } else {
        /* Turn on idle balance on this domain: */
        sd->flags |= (SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
    }
}

（2-4）當前平台沒有重疊sd，所以會調用函數build_sched_groups逐步建立調度組sg

外部有2層循環，第一層為cpu_map：cpu0-7，第二層為sd：MC、DIE；函數內部有1層循環：從當前cpu開始，在sd內遍歷所有cpu--------感覺有點多余：比如DIE level時，cpu0/cpu4各是child sd中的第一個cpu，就會進行初始化；而cpu1-3/cpu5-7時，就會在get_group中直接過濾return。
通過get_group進行sg初始化：sg/sgc->cpumask，sgc->capacity；並把sd、sg、sgc 3者關聯起來
MC、DIE level下將每個sg用環形鏈表關聯起來

/*
 * build_sched_groups will build a circular linked list of the groups
 * covered by the given span, will set each group's ->cpumask correctly,
 * and will initialize their ->sgc.
 *
 * Assumes the sched_domain tree is fully constructed
 */
static int
build_sched_groups(struct sched_domain *sd, int cpu)
{
    struct sched_group *first = NULL, *last = NULL;
    struct sd_data *sdd = sd->private;
    const struct cpumask *span = sched_domain_span(sd);        //獲取當前sd的范圍，MC是core_siblings，DIE是所有物理cpu
    struct cpumask *covered;
    int i;

    lockdep_assert_held(&sched_domains_mutex);
    covered = sched_domains_tmpmask;

    cpumask_clear(covered);                //每次外面大循環新的sd或者cpu，就會清空covered mask

    for_each_cpu_wrap(i, span, cpu) {    //從當前cpu開始遍歷整個sd span
        struct sched_group *sg;

        if (cpumask_test_cpu(i, covered))    //已經在covered mask中的cpu，不需要再進行下面工作
            continue;

        sg = get_group(i, sdd);            //（2-4-1）初始化cpu i的調度組sg

        cpumask_or(covered, covered, sched_group_span(sg));     //將covered = covered | sg的span

        if (!first)                //每個cpu、每個level進來記錄第一個sg
            first = sg;
        if (last)
            last->next = sg;    //每個sg的next都指向下一個sg
        last = sg;
    }
    last->next = first;        //將所有sg->next形成環形鏈表
    sd->groups = first;        //sd->groups只指向第一個sg

    return 0;
}

（2-4-1）初始化cpu i的調度組sg

如果sd是DIE level的，那么就會只初始化並返回cluster中的第1個cpu-----非常重要！！！
將sd與sg、sg與sgc關聯起來
初始化sg->cpumask和sg->sgc->cpumask：DIE level，為child sd的范圍；MC level，為單個cpu。----------這里區別於sched_domain_span(sd)，sg的范圍會比sd的范圍降一級！！！用一句話說就是：每個sched domain的第一個sched group就是sd對應的child sched domain。
初始化sgc->capacity（等於child sd中cpu個數 * 1024），最大和最小capacity都是1024---------這個當前還不准確，僅僅是初始化，后面還會再修改

　　通過上述sd和sg的初始化建立，最終形成如下圖關系。而其中DIE level上，只會初始化每個cluster的第一個cpu的sched group調度組（圖中虛線表示的都沒有關聯到per_cpu變量中）

/*
 * Package topology (also see the load-balance blurb in fair.c)
 *
 * The scheduler builds a tree structure to represent a number of important
 * topology features. By default (default_topology[]) these include:
 *
 *  - Simultaneous multithreading (SMT)
 *  - Multi-Core Cache (MC)
 *  - Package (DIE)
 *
 * Where the last one more or less denotes everything up to a NUMA node.
 *
 * The tree consists of 3 primary data structures:
 *
 *    sched_domain -> sched_group -> sched_group_capacity
 *        ^ ^             ^ ^
 *          `-'             `-'
 *
 * The sched_domains are per-CPU and have a two way link (parent & child) and
 * denote the ever growing mask of CPUs belonging to that level of topology.
 *
 * Each sched_domain has a circular (double) linked list of sched_group's, each
 * denoting the domains of the level below (or individual CPUs in case of the
 * first domain level). The sched_group linked by a sched_domain includes the
 * CPU of that sched_domain [*].
 *
 * Take for instance a 2 threaded, 2 core, 2 cache cluster part:
 *
 * CPU   0   1   2   3   4   5   6   7
 *
 * DIE  [                             ]
 * MC   [             ] [             ]
 * SMT  [     ] [     ] [     ] [     ]
 *
 *  - or -
 *
 * DIE  0-7 0-7 0-7 0-7 0-7 0-7 0-7 0-7
 * MC    0-3 0-3 0-3 0-3 4-7 4-7 4-7 4-7
 * SMT  0-1 0-1 2-3 2-3 4-5 4-5 6-7 6-7
 *
 * CPU   0   1   2   3   4   5   6   7
 *
 * One way to think about it is: sched_domain moves you up and down among these
 * topology levels, while sched_group moves you sideways through it, at child
 * domain granularity.
 *
 * sched_group_capacity ensures each unique sched_group has shared storage.
 *
 * There are two related construction problems, both require a CPU that
 * uniquely identify each group (for a given domain):
 *
 *  - The first is the balance_cpu (see should_we_balance() and the
 *    load-balance blub in fair.c); for each group we only want 1 CPU to
 *    continue balancing at a higher domain.
 *
 *  - The second is the sched_group_capacity; we want all identical groups
 *    to share a single sched_group_capacity.
 *
 * Since these topologies are exclusive by construction. That is, its
 * impossible for an SMT thread to belong to multiple cores, and cores to
 * be part of multiple caches. There is a very clear and unique location
 * for each CPU in the hierarchy.
 *
 * Therefore computing a unique CPU for each group is trivial (the iteration
 * mask is redundant and set all 1s; all CPUs in a group will end up at _that_
 * group), we can simply pick the first CPU in each group.
 *
 *
 * [*] in other words, the first group of each domain is its child domain.
 */

static struct sched_group *get_group(int cpu, struct sd_data *sdd)
{
    struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
    struct sched_domain *child = sd->child;
    struct sched_group *sg;
    bool already_visited;

    if (child)                    //child sd存在，說明當前是DIE level的sd
        cpu = cpumask_first(sched_domain_span(child));    //那么取出MC level中child sd的第一個cpu；DIE level時，下面用的sg都是每個cluster的第一個cpu的sg

    sg = *per_cpu_ptr(sdd->sg, cpu);　　　　　　//綁定sd和sg
    sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);    //綁定sg和sgc

    /* Increase refcounts for claim_allocations: */                //計算sg的引用計數
    already_visited = atomic_inc_return(&sg->ref) > 1;
    /* sgc visits should follow a similar trend as sg */
    WARN_ON(already_visited != (atomic_inc_return(&sg->sgc->ref) > 1));

    /* If we have already visited that group, it's already initialized. */    //過濾已經初始化過的sg：在DIE level時，build_sched_groups函數遍歷所有物理cpu，但是當前函數僅初始化child sd中的第一個cpu。所以當遍歷cpu0/4，會實際執行下去，而cpu1-3/cpu5-7時，就會在這里過濾
    if (already_visited)
        return sg;

    if (child) {                                                        //如果是DIE level
        cpumask_copy(sched_group_span(sg), sched_domain_span(child));    //sg的范圍（sg->cpumask）是child sd的范圍
        cpumask_copy(group_balance_mask(sg), sched_group_span(sg));        //sg->sgc->cpumask也是child sd的范圍
    } else {                                                        //如果是MC level
        cpumask_set_cpu(cpu, sched_group_span(sg));                    //那么sg的范圍（sg->cpumask）就是自己對應的單個cpu
        cpumask_set_cpu(cpu, group_balance_mask(sg));                //sg->sgc->cpumask也是自己對應的單個cpu
    }

    sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));    //根據sg范圍內有幾個cpu，來簡單計算總capacity
    sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;    //初始化最小capacity
    sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;    //初始化最大capacity

    return sg;
}

(2-5)將用於建立sd、sg的per_cpu指針（sdd）置NULL，防止隨后的__free_domain_allocs()將其free-----結合(2-8)看來下，應該是出現一些錯誤（sa_sd_storage）的情況下，防止正在使用的sd_data被free

/*
 * NULL the sd_data elements we've used to build the sched_domain and
 * sched_group structure so that the subsequent __free_domain_allocs()
 * will not free the data we're using.
 */
static void claim_allocations(int cpu, struct sched_domain *sd)
{
    struct sd_data *sdd = sd->private;

    WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);                //依次判斷sd、sds、sg、sgc的per_cpu指針，並置為NULL
    *per_cpu_ptr(sdd->sd, cpu) = NULL;

    if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
        *per_cpu_ptr(sdd->sds, cpu) = NULL;

    if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
        *per_cpu_ptr(sdd->sg, cpu) = NULL;

    if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
        *per_cpu_ptr(sdd->sgc, cpu) = NULL;
}

(2-6)初始化sg的cpu_capacity

do-while循環中對每個sg->group_weight進行初始化：MC level，sg范圍是對應cpu；DIE level，sg范圍是cluster的范圍（如果支持WALT，還要去掉isolate cpu）
對sg中的第一個cpu（MC level，cpumask為sg對應cpu；DIE level，cpumask為cluster中第一個cpu），更新group capacity；

/*
 * Initialize sched groups cpu_capacity.
 *
 * cpu_capacity indicates the capacity of sched group, which is used while
 * distributing the load between different sched groups in a sched domain.
 * Typically cpu_capacity for all the groups in a sched domain will be same
 * unless there are asymmetries in the topology. If there are asymmetries,
 * group having more cpu_capacity will pickup more load compared to the
 * group having less cpu_capacity.
 */
void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
{
    struct sched_group *sg = sd->groups;        //獲取sd對應的sg
#ifdef CONFIG_SCHED_WALT
    cpumask_t avail_mask;
#endif

    WARN_ON(!sg);

    do {                                //do-while循環中，對sg環形鏈表中的所有sg的->group_weight進行初始化
        int cpu, max_cpu = -1;

#ifdef CONFIG_SCHED_WALT
        cpumask_andnot(&avail_mask, sched_group_span(sg),    //如果支持WALT，那么group_weight = sg的范圍中去掉isolate cpu；MC level，sg范圍是對應cpu；DIE level，sg范圍是cluster的范圍
                            cpu_isolated_mask);
        sg->group_weight = cpumask_weight(&avail_mask);
#else
        sg->group_weight = cpumask_weight(sched_group_span(sg));    //如果不支持WALT，那么group_weight = sg的范圍；MC level，sg范圍是對應cpu；DIE level，sg范圍是cluster的范圍
#endif

        if (!(sd->flags & SD_ASYM_PACKING))        //當前平台沒有這個flag。這個flag應該表示支持非對稱SMT調度
            goto next;

        for_each_cpu(cpu, sched_group_span(sg)) {
            if (max_cpu < 0)
                max_cpu = cpu;
            else if (sched_asym_prefer(cpu, max_cpu))
                max_cpu = cpu;
        }
        sg->asym_prefer_cpu = max_cpu;

next:
        sg = sg->next;
    } while (sg != sd->groups);

    if (cpu != group_balance_cpu(sg))        //僅對sg->sgc->cpumask中第一個cpu，進行下一步更新group capacity。MC level，cpumask為sg對應cpu；DIE level，cpumask為cluster中第一個cpu
        return;

    update_group_capacity(sd, cpu);            //(2-6-1)更新對應group的capacity
}

(2-6-1)更新對應group的capacity-----這個函數在load balance的流程中也會被調用到

更新group capacity有時間間隔要求，間隔限制在[1，25]個tick之間
當sd為MC level時，因為其對應sg只有自身一個cpu，所以僅僅只需更新cpu capacity；而如果是DIE level，則需要進一步更新和計算
- 當sd為MC level，更新rq->cpu_orig_capacity/cpu_capacity、sgc->capacity/min_capacity/max_capcity
- 當sd為DIE level，那么通過child sd->groups指針，以及通過MC level的sg環形鏈表，依次遍歷每個sgc->capcity（但會排除isolate狀態cpu）：遍歷時，其中最大cpu max_capacity和最小cpu min_capacity分別作為DIE level sd->sg->sgc->max_capacity/min_capacity，最后把所有非isolate cpu的sgc->capacity累加起來，作為這個DIE level sd->sg->sgc->capacity；

void update_group_capacity(struct sched_domain *sd, int cpu)
{
    struct sched_domain *child = sd->child;
    struct sched_group *group, *sdg = sd->groups;
    unsigned long capacity, min_capacity, max_capacity;
    unsigned long interval;

    interval = msecs_to_jiffies(sd->balance_interval);                //sgc的更新有間隔限制：1 ~ HZ/10
    interval = clamp(interval, 1UL, max_load_balance_interval);
    sdg->sgc->next_update = jiffies + interval;

    if (!child) {                        //如果是MC level的sd更新sgc，那么就只要更新cpu capacity，因為MC level的sg只有單個cpu在內
        update_cpu_capacity(sd, cpu);    //(2-6-1-1)更新cpu capacity
        return;
    }

    capacity = 0;
    min_capacity = ULONG_MAX;
    max_capacity = 0;

    if (child->flags & SD_OVERLAP) {            //這個是sd有重疊的情況，當前平台沒有sd重疊
        /*
         * SD_OVERLAP domains cannot assume that child groups
         * span the current group.
         */

        for_each_cpu(cpu, sched_group_span(sdg)) {
            struct sched_group_capacity *sgc;
            struct rq *rq = cpu_rq(cpu);

            if (cpu_isolated(cpu))
                continue;

            /*
             * build_sched_domains() -> init_sched_groups_capacity()
             * gets here before we've attached the domains to the
             * runqueues.
             *
             * Use capacity_of(), which is set irrespective of domains
             * in update_cpu_capacity().
             *
             * This avoids capacity from being 0 and
             * causing divide-by-zero issues on boot.
             */
            if (unlikely(!rq->sd)) {
                capacity += capacity_of(cpu);
            } else {
                sgc = rq->sd->groups->sgc;
                capacity += sgc->capacity;
            }

            min_capacity = min(capacity, min_capacity);
            max_capacity = max(capacity, max_capacity);
        }
    } else  {
        /*
         * !SD_OVERLAP domains can assume that child groups        因為沒有sd重疊，那么所有child sd的groups合在一起，就是當前的group
         * span the current group.
         */

        group = child->groups;
        do {                                                    //do-while遍歷child sd的sg環形鏈表；當前平台為例，走到這里是DIE level，那么child sd就是MC level的groups
            struct sched_group_capacity *sgc = group->sgc;        //獲取對應sgc
            __maybe_unused cpumask_t *cpus =
                    sched_group_span(group);                    //因為group是處於MC level，所以范圍就是sg對應的cpu

            if (!cpu_isolated(cpumask_first(cpus))) {        //排除isolate狀態的cpu
                capacity += sgc->capacity;                    //將每個sgc（cpu）的capacity累加起來
                min_capacity = min(sgc->min_capacity,        //保存最小的sgc->capacity
                            min_capacity);
                max_capacity = max(sgc->max_capacity,        //保存最大的sgc->capacity
                            max_capacity);
            }
            group = group->next;
        } while (group != child->groups);
    }

    sdg->sgc->capacity = capacity;                //將MC level中每個sgc->capacity累加起來，其總和作為DIE level中group capacity
    sdg->sgc->min_capacity = min_capacity;        //並保存最大、最小capacity
    sdg->sgc->max_capacity = max_capacity;
}

(2-6-1-1)更新cpu capacity

獲取cpu的orig_capacity（也就是cpu_scale），並獲取max_freq_scale（每次在cpufreq調頻中通過不同policy會變化，每次調頻更新的公式如下）

                    policy_max_freq * 1024
max_freq_scale = ———————————————————————————————— ，在cpufreq中會根據policy設置policy_max_freq；max_freq_scale在開機初始化為1024，並作為per_cpu保存起來
                        原max_freq_scale

再通過cpu_scale和max_freq_scale計算，並考慮thermal限制，最終計算結果更新為當前cpu rq的cpu_capacity_orig，公式如下：
```
rq->cpu_orig_capacity = min(cpu_scale * max_freq_scale /1024, thermal限制的最大cpu capacity)
```
通過特定的計算公式，計算得出去掉irq、rt進程、dl進程util之后的剩余cpu capacity。之后將其更新為rq->cpu_capacity、sgc->capacity/min_capacity/max_capacity

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
    unsigned long capacity = arch_scale_cpu_capacity(cpu);    //獲取per_cpu變量cpu_scale
    struct sched_group *sdg = sd->groups;

    capacity *= arch_scale_max_freq_capacity(sd, cpu);        //獲取per_cpu變量max_freq_scale，參與計算
    capacity >>= SCHED_CAPACITY_SHIFT;                        //這2步計算為：cpu_scale * max_freq_scale / 1024

    capacity = min(capacity, thermal_cap(cpu));                //計算得出的capacity不能超過thermal限制中的cpu的capacity
    cpu_rq(cpu)->cpu_capacity_orig = capacity;                //將計算得出的capacity作為當前cpu rq的cpu_capacity_orig

    capacity = scale_rt_capacity(cpu, capacity);        //(2-6-1-1-1)計算cfs rq剩余的cpu capacity

    if (!capacity)            //如果沒有剩余cpu capacity給cfs了，那么就強制寫為1
        capacity = 1;

    cpu_rq(cpu)->cpu_capacity = capacity;        //更新相關sgc capacity：cpu rq的cpu_capacity、sgc的最大/最小的capacity
    sdg->sgc->capacity = capacity;
    sdg->sgc->min_capacity = capacity;
    sdg->sgc->max_capacity = capacity;
}

(2-6-1-1-1)計算cfs rq剩余的cpu capacity

獲取irq util，如果irq util超過orig cpu capacity，則說明已經沒有剩余CPU算力了
獲取rt進程的util，和dl進程的util，並求和。如果結果超過orig cpu capacity，則說明也已經沒有剩余CPU算力了

如果上面2步，計算都還有剩余算力，那么就計算剩余cpu算力，如下：

                    (max - avg_rt.util_avg - avg_dl.util_avg) * (max - avg_irq.util_avg)
剩余cpu capacity = ————————————————————————————————————————————————————————————————————————, 其中 max = rq->cpu_orig_capacity（上面計算出的結果）
                                                    max

static unsigned long scale_rt_capacity(int cpu, unsigned long max)
{
    struct rq *rq = cpu_rq(cpu);
    unsigned long used, free;
    unsigned long irq;

    irq = cpu_util_irq(rq);            //獲取cpu rq的avg_irq.util_avg

    if (unlikely(irq >= max))        //如果util_avg超過max，則說明util滿了？
        return 1;

    used = READ_ONCE(rq->avg_rt.util_avg);        //獲取rt task rq的util_avg
    used += READ_ONCE(rq->avg_dl.util_avg);        //獲取並累加dl task rq的util_avg

    if (unlikely(used >= max))        //如果util_avg超過max，則說明util滿了？
        return 1;

    free = max - used;        //計算free util = 最大capacity - rt的util_avg - dl的util_avg

    return scale_irq_capacity(free, irq, max);    //(max - rt的util_avg - dl的util_avg) * (max - irq) /max
}

(2-7)將sd、rd與cpu rq綁定起來

for循環從當前sd向parent遍歷，但是會過濾DIE level sd----------當前平台也就是只判斷MC level的sd
首先判斷是否要對parent sd銷毀？其中有2層判斷是否需要銷毀parent sd：一層是判斷parent sd本身是否已滿足銷毀條件，另一層是判斷與child sd對比，是否有必要對parent sd進行銷毀
先取出parent（鏈表中先斷開連接，將parent->parent和child鏈接），根據parent sd如有flag：SD_PREFER_SIBLING，將其傳遞到child sd。-------這2步，在當前平台都不滿足。所以僅僅指揮斷開parent sd、child sd的鏈接
銷毀parent sd，參考（2-7-2）
再對child sd也同樣進行銷毀判斷，以及進行銷毀

前面這些都是將新的sd進行”修剪“，去掉一些不影響調度的sd層級，之后就會將新的sd綁定到rd上：

（2-7-3）將新root domain與cpu_rq綁定起來，舊rd會被free
更新rq->sd為新的sd；將當前cpu更新進sd_sysctl_cpus的cpu mask中
（2-7-5）將原先的tmp sd給銷毀
最后更新per_cpu相關的sd變量

/*
 * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
 * hold the hotplug lock.
 */
static void
cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
{
    struct rq *rq = cpu_rq(cpu);
    struct sched_domain *tmp;

    /* Remove the sched domains which do not contribute to scheduling. */
    for (tmp = sd; tmp; ) {
        struct sched_domain *parent = tmp->parent;        //過濾沒有parent sd的sd，即過濾DIE level sd
        if (!parent)
            break;

        if (sd_parent_degenerate(tmp, parent)) {    //(2-7-1)判斷是否要對parent sd進行degenerate操作
            tmp->parent = parent->parent;            //首先斷開parent' sd的鏈表關系
            if (parent->parent)                        //根據parent->parent是否存在，將parnet->parent的child鏈接到tmp sd
                parent->parent->child = tmp;
            /*
             * Transfer SD_PREFER_SIBLING down in case of a
             * degenerate parent; the spans match for this
             * so the property transfers.
             */
            if (parent->flags & SD_PREFER_SIBLING)    //因為當前平台只有DIE level有這個flag，又因為DIE level沒有parent sd，所以在上面已經過濾了，這里的條件不會滿足
                tmp->flags |= SD_PREFER_SIBLING;
            destroy_sched_domain(parent);            //(2-7-2)銷毀'parent' sd
        } else                            //如果不需要degenerate操作
            tmp = tmp->parent;            //則直接更新tmp，准備遍歷下一層level
    }

    if (sd && sd_degenerate(sd)) {        //判斷sd是否需要進行degenerate
        tmp = sd;
        sd = sd->parent;
        destroy_sched_domain(tmp);        //銷毀sd，同上
        if (sd)                            //如果被銷毀的sd有parent sd，那么就將parent sd的->child置為NULL
            sd->child = NULL;
    }

    sched_domain_debug(sd, cpu);        //打印sd attach的debug信息

    rq_attach_root(rq, rd);            //(2-7-3) 將新的root doamin與cpu rq綁定在一起
    tmp = rq->sd;
    rcu_assign_pointer(rq->sd, sd);        //將新sd與rq->sd綁定起來
    dirty_sched_domain_sysctl(cpu);    //(2-7-4)更新sd_sysctl_cpus的cpu mask
    destroy_sched_domains(tmp);        //(2-7-5)將tmp sd銷毀

    update_top_cache_domain(cpu);    //(2-7-6)更新cpu的sd相關的per_cpu變量
}

(2-7-1)判斷是否要對parent sd進行degenerate操作？sd_parent_degenerate函數：return 1，則表示要進行銷毀；return 0，則不需要進行銷毀

先針對sd本身是否可以進行銷毀進行判斷，標准參考（2-7-1-1）如果sd_degenerate函數return 1，則說明這個parent sd需要銷毀；反之則不需要銷毀
如果child sd和parent sd的范圍不相同，則return 0；反之，則繼續進行判斷
如果parent sd level下只有一個sg，那么就先清空一些flags。再判斷child sd和parent sd的flag是否一致？如果一致，則return 1；不一致，則return 0

static int
sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
{
    unsigned long cflags = sd->flags, pflags = parent->flags;

    if (sd_degenerate(parent))        //(2-7-1-1) 判斷parent sd是否有必要做下面的步驟來degenerate判斷
        return 1;

    if (!cpumask_equal(sched_domain_span(sd), sched_domain_span(parent)))        //判斷MC和DIE level是否sd范圍一樣？當前平台不一樣
        return 0;                                                                //所以，一般這里就會return 0

    /* Flags needing groups don't count if only 1 group in parent */
    if (parent->groups == parent->groups->next) {                        //如果parent sg只有一個了，那么下面這些flag，就不需要了
        pflags &= ~(SD_LOAD_BALANCE |
                SD_BALANCE_NEWIDLE |
                SD_BALANCE_FORK |
                SD_BALANCE_EXEC |
                SD_ASYM_CPUCAPACITY |
                SD_SHARE_CPUCAPACITY |
                SD_SHARE_PKG_RESOURCES |
                SD_PREFER_SIBLING |
                SD_SHARE_POWERDOMAIN);
        if (nr_node_ids == 1)
            pflags &= ~SD_SERIALIZE;
    }
    if (~cflags & pflags)        //判斷MC level的flag與DIE level修改后的flag是否一致？
        return 0;                //如果一致，則return 1；不一致，則return 0

    return 1;
}

(2-7-1-1) 判斷parent sd是否有必要做下面的步驟來degenerate判斷

如果DIE level sd的范圍內只有1個cpu，則表示需要銷毀sd-------這么理解：DIE leve都只有一個cpu了，那也就沒有MC level sd存在的必要了
如果sd中包含一些flag，並且sd至少有2個sg，這種情況下不能銷毀sd。-------實際當前平台DIE level，會有2個sg，所以這里就會return 0（暫只考慮全核都開的情況）
如果sd包含SD_WAKE_AFFINE（flag意義：任務喚醒時，放置到臨近的cpu），則return 0。-------目前平台所有sd都有這個flag
如果上面3個條件都不滿足，則return 1

　　return 0表示不需要進行sd銷毀；return 1表示要進行sd銷毀。

static int sd_degenerate(struct sched_domain *sd)
{
    if (cpumask_weight(sched_domain_span(sd)) == 1)        //如果DIE level sd中，只有一個cpu（當前平台有8個cpu），就return 1
        return 1;                           

    /* Following flags need at least 2 groups */
    if (sd->flags & (SD_LOAD_BALANCE |                    //當前DIE level中會由部分flag，但同時sg有2個，所以會return 0
             SD_BALANCE_NEWIDLE |
             SD_BALANCE_FORK |
             SD_BALANCE_EXEC |
             SD_SHARE_CPUCAPACITY |
             SD_ASYM_CPUCAPACITY |
             SD_SHARE_PKG_RESOURCES |
             SD_SHARE_POWERDOMAIN)) {
        if (sd->groups != sd->groups->next)
            return 0;
    }

    /* Following flags don't use groups */
    if (sd->flags & (SD_WAKE_AFFINE))             //所有sd都這個flag，所以這里都會return 0
        return 0;

    return 1;
}

(2-7-2)銷毀'parent' sd：主要就是依次釋放申請的空間：sd->groups、sd->shared、sd本身。

static void destroy_sched_domain(struct sched_domain *sd)
{
    /*
     * A normal sched domain may have multiple group references, an
     * overlapping domain, having private groups, only one.  Iterate,
     * dropping group/capacity references, freeing where none remain.
     */
    free_sched_groups(sd->groups, 1);        //（2-7-2-1）free sd對應的sg結構體

    if (sd->shared && atomic_dec_and_test(&sd->shared->ref))    //判斷是否有sds結構，並且sds的引用==1時。就free掉sds
        kfree(sd->shared);
    kfree(sd);                //free sd結構體
}

（2-7-2-1）free sd對應的sg結構體：do-while循環中會遍歷sg循環鏈表，將sd->sgc也會釋放掉，最后再釋放sg本身

static void free_sched_groups(struct sched_group *sg, int free_sgc)
{
    struct sched_group *tmp, *first;

    if (!sg)
        return;

    first = sg;
    do {                            //do-while循環，完整遍歷sg循環鏈表元素一次
        tmp = sg->next;

        if (free_sgc && atomic_dec_and_test(&sg->sgc->ref))        //檢查引用數是否為1，判斷是否要free sgc結構
            kfree(sg->sgc);                                        //free sgc結構體

        if (atomic_dec_and_test(&sg->ref))            //檢查sg結構應用數是否為1
            kfree(sg);                                //free sg結構體
        sg = tmp;
    } while (sg != first);
}

(2-7-3) 將新的root doamin與cpu rq綁定在一起

如果cpu rq上原先綁定過roo domain，那么就將其作為old rd
通過old rd判斷rq還處於online，那就先將rq offline
清空old rd中對應的當前rq的cpu；當old rd不在被使用時，將old rd置為NULL；如果old rd引用不為0，則后面要對其進行free-------這一步是對old rd的剝離
將新的rd，賦給rq->rd，並將rq對應的cpu添加到rd的span范圍中。--------這里完成root domain的更新
判斷如果rq->cpu是active的狀態，那么就要將rq online
最后根據所需，對old rd進行free

void rq_attach_root(struct rq *rq, struct root_domain *rd)
{
    struct root_domain *old_rd = NULL;
    unsigned long flags;

    raw_spin_lock_irqsave(&rq->lock, flags);

    if (rq->rd) {
        old_rd = rq->rd;                                //暫存原先的rd

        if (cpumask_test_cpu(rq->cpu, old_rd->online))    //如果原先的rd還處於online
            set_rq_offline(rq);                            //(2-7-3-1)則先讓rq offline

        cpumask_clear_cpu(rq->cpu, old_rd->span);        //在old rd中去掉offline rq對應的cpu

        /*
         * If we dont want to free the old_rd yet then
         * set old_rd to NULL to skip the freeing later
         * in this function:
         */
        if (!atomic_dec_and_test(&old_rd->refcount))    //判斷old rd的引用是否為0（代表是否需要free old rd）
            old_rd = NULL;                                //設置為NULL后，后面流程就不會free old rd
    }

    atomic_inc(&rd->refcount);        //將rd引用+1
    rq->rd = rd;                    //更新rq->rd為新的rd

    cpumask_set_cpu(rq->cpu, rd->span);                //將rq->cpu為新rd的范圍
    if (cpumask_test_cpu(rq->cpu, cpu_active_mask))    //如果rq->cpu都是active的
        set_rq_online(rq);                            //（2-7-3-2）那么就將rq set為online

    raw_spin_unlock_irqrestore(&rq->lock, flags);

    if (old_rd)                                        //根據上面是否設置old rd為NULL，確定是否free old rd
        call_rcu(&old_rd->rcu, free_rootdomain);    //（2-7-3-3）free old rd
}

(2-7-3-1)讓rq offline：依次對rq中的所有class調用rq_offline接口，並對，再置rq->online為0

void set_rq_offline(struct rq *rq)
{
    if (rq->online) {                        //確認rq online
        const struct sched_class *class;

        for_each_class(class) {            //遍歷所有調度class
            if (class->rq_offline)        //判斷對應class rq_offline是否存在
                class->rq_offline(rq);    //(2-3-7-1-1)調用class對應的rq_offline，這里以cfs rq為例
        }

        cpumask_clear_cpu(rq->cpu, rq->rd->online);        //將rq的rq online mask中去掉當前rq對應的cpu
        rq->online = 0;            //將rq online置為0
    }
}

(2-3-7-1-1)調用class對應的rq_offline，這里以cfs rq為例

static void rq_offline_fair(struct rq *rq)
{
    update_sysctl();            //更新sysctl參數

    /* Ensure any throttled groups are reachable by pick_next_task */
    unthrottle_offline_cfs_rqs(rq);    //(2-3-7-1-1-1)把rq中的所有cfs_rq都解除帶寬限制
}

(2-3-7-1-1-1)把rq中的所有cfs_rq都解除帶寬限制（這部分其實屬於cfs帶寬限制的范疇，不深入分析。之前有看過代碼，但是沒有記錄下來。以后有時間再整理）

/* cpu offline callback */
static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
{
    struct task_group *tg;

    lockdep_assert_held(&rq->lock);

    rcu_read_lock();
    list_for_each_entry_rcu(tg, &task_groups, list) {        //遍歷所有task group
        struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];        //取出tg中對應cpu的cfs_rq

        if (!cfs_rq->runtime_enabled)            //過濾已關閉帶寬限制功能的cfs_rq
            continue;

        /*
         * clock_task is not advancing so we just need to make sure
         * there's some valid quota amount
         */
        cfs_rq->runtime_remaining = 1;            //確保有有效的配額值
        /*
         * Offline rq is schedulable till CPU is completely disabled    //offline rq在take_cpu_down（）中完全disable CPU前，仍然可以被調度
         * in take_cpu_down(), so we prevent new cfs throttling here.    //所以，我們把新的cfs限制在這里。
         */
        cfs_rq->runtime_enabled = 0;        //關閉cfs rq的帶寬限制功能

        if (cfs_rq_throttled(cfs_rq))        //判斷cfs_rq的是否處於帶寬被限制狀態
            unthrottle_cfs_rq(cfs_rq);        //解除帶寬限制
    }
    rcu_read_unlock();
}

（2-7-3-2）將rq set為online，其實就是做set offline的相反操作

void set_rq_online(struct rq *rq)
{
    if (!rq->online) {
        const struct sched_class *class;

        cpumask_set_cpu(rq->cpu, rq->rd->online);    //將rq->cpu設置到rq->rd->online的cpu mask中，表示對應rd中的online cpu增加了
        rq->online = 1;            //將rq設為online

        for_each_class(class) {                //遍歷所有調度class
            if (class->rq_online)            //判斷對應class rq_offline是否存在
                class->rq_online(rq);        //(2-3-7-2-1)調用class對應的rq_offline，這里以cfs rq為例
        }
    }
}

(2-3-7-2-1)調用class對應的rq_offline，這里以cfs rq為例

static void rq_online_fair(struct rq *rq)
{
    update_sysctl();            //更新sysctl參數

    update_runtime_enabled(rq);    //(2-3-7-2-1-1)更新cfs帶寬限制的開關和配置
}

(2-3-7-2-1-1)更新cfs帶寬限制的開關和配置

/*
 * Both these CPU hotplug callbacks race against unregister_fair_sched_group()
 *
 * The race is harmless, since modifying bandwidth settings of unhooked group
 * bits doesn't do much.
 */

/* cpu online calback */
static void __maybe_unused update_runtime_enabled(struct rq *rq)
{
    struct task_group *tg;

    lockdep_assert_held(&rq->lock);

    rcu_read_lock();
    list_for_each_entry_rcu(tg, &task_groups, list) {        //遍歷所有task group
        struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;    //獲取tg對應的cfs帶寬限制結構體
        struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];        //取出tg中對應cpu的cfs_rq

        raw_spin_lock(&cfs_b->lock);
        cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;    //根據cfs帶寬限制的配額有沒有限制，設置cfs_rq帶寬限制是否打開
        raw_spin_unlock(&cfs_b->lock);
    }
    rcu_read_unlock();
}

（2-7-3-3）free old rd：主要對rd結構提的各個成員依次free，最后free rd自身

static void free_rootdomain(struct rcu_head *rcu)
{
    struct root_domain *rd = container_of(rcu, struct root_domain, rcu);    //通過rcu獲取要free的rd

    cpupri_cleanup(&rd->cpupri);            //free rd結構體中的相關成員
    cpudl_cleanup(&rd->cpudl);
    free_cpumask_var(rd->dlo_mask);
    free_cpumask_var(rd->rto_mask);
    free_cpumask_var(rd->online);
    free_cpumask_var(rd->span);
    free_pd(rd->pd);
    kfree(rd);                                //最后free rd本身
}

(2-7-4)更新sd_sysctl_cpus的cpu mask：將cpu添加進去（暫不清楚這個cpu mask有什么用處？）

void dirty_sched_domain_sysctl(int cpu)
{
    if (cpumask_available(sd_sysctl_cpus))
        __cpumask_set_cpu(cpu, sd_sysctl_cpus);
}

(2-7-5)將tmp sd銷毀：從sd向其parent遍歷，進行逐層銷毀

static void destroy_sched_domains(struct sched_domain *sd)
{
    if (sd)
        call_rcu(&sd->rcu, destroy_sched_domains_rcu);        //如果sd是否存在，則進行銷毀
}

static void destroy_sched_domains_rcu(struct rcu_head *rcu)
{
    struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu);

    while (sd) {
        struct sched_domain *parent = sd->parent;        //從MC->DIE遍歷sd
        destroy_sched_domain(sd);                        //並銷毀sd，參照(2-7-2)
        sd = parent;
    }
}

(2-7-6)更新cpu的sd相關的per_cpu變量

static void update_top_cache_domain(int cpu)
{
    struct sched_domain_shared *sds = NULL;
    struct sched_domain *sd;
    int id = cpu;
    int size = 1;

    sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);    //找到該cpu所在的最高的、包含這個flag的domain。即 MC level
    if (sd) {
        id = cpumask_first(sched_domain_span(sd));            //取出該sd中第一個cpu
        size = cpumask_weight(sched_domain_span(sd));        //獲取該sd中cpu的數量
        sds = sd->shared;                                    //獲取sd的sds
    }

    rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);            //更新sd_lcc = sd
    per_cpu(sd_llc_size, cpu) = size;                        //更新sd_lcc_size = sd中cpu數量
    per_cpu(sd_llc_id, cpu) = id;                            //更新sd_lcc_id = sd中第一個cpu
    rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);    //更新sd_llc_shared = sd->sds

    sd = lowest_flag_domain(cpu, SD_NUMA);                //當前平台不支持NUMA
    rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);        //所以最后這個sd是DIE level，但是其本身也沒有什么意義

    sd = highest_flag_domain(cpu, SD_ASYM_PACKING);                //當前平台不支持SMT
    rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);        //所以最后這個sd是DIE level，但是其本身也沒有什么意義

    sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);            //獲取cpu最低的、包含這個flag的domain。即 DIE level
    rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);    
}

(2-8)根據函數最上面建立拓撲、以及申請root domain結果，釋放相應的空間

static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
                 const struct cpumask *cpu_map)
{
    switch (what) {
    case sa_rootdomain:                            //一切正常情況下是這個分支
        if (!atomic_read(&d->rd->refcount))        //根據rd的引用，決定是否還需要保留root domain
            free_rootdomain(&d->rd->rcu);        //如不需要，則進行free
        /* Fall through */
    case sa_sd:                            //申請root domain失敗的情況
        free_percpu(d->sd);                //free d->sd
        /* Fall through */
    case sa_sd_storage:                    //建立拓撲結構失敗、或者申請d->sd 失敗的情況
        __sdt_free(cpu_map);            //free整個cpu_map中所有cpu的拓撲結構，並遍歷free所有per_cpu的sdd->*
        /* Fall through */
    case sa_none:
        break;
    }
}

register_sched_domain_sysctl(); //（3）注冊proc/sys/kernel/sched_domain目錄，並完善其中相關sysctl控制參數

-----這部分暫不准備解析了，都是一些sysfs接口。有興趣的可以參考這位大佬博主的blog：https://blog.csdn.net/wukongmingjing/article/details/100043644

參考：https://blog.csdn.net/wukongmingjing/article/details/82426568

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 CPU拓撲結構 CPU調度 CPU調度（CPU Scheduling）拓撲結構圖，什么是拓撲結構最基本3種拓撲結構拓撲結構介紹及其種類 Zigbee 的拓撲結構 CPU調度——EAS調度器代數拓撲\集合拓撲\代數拓撲\拓撲關系\拓撲結構_筆記 OS總結（七）：CPU調度