CPU,即中央處理器,它最有用的屬性就是算力性能。通過之前的知識學習,了解了linux kernel中對cpu算力形象化的表示:cpu capacity。
1、從cpu拓撲結構、sched_doamin/sched_group的建立過程來看,就包含了對cpu capcity的初始建立。
2、而cpu的算力和cpu運行的freq又極其相關,因此對cpu調頻的動作,又使cpu capacity發生改變。
3、cpu的算力決定了它能及時處理的task量,最終在給cpu做cfs task placement時,就會參考cpu剩余capcity(去掉irq、dl、rt進程的占用)。
上面提到的流程中cpu capacity,從代碼流程一一解析(代碼基於caf-kernel msm-5.4):
在系統開機初始化時,建立CPU拓撲結構,就會根據cpu廠商DTS中配置的參數,解析並作為cpu的算力:
- 先讀取dts配置作為raw capacity
cpu0: cpu@000 { device_type = "cpu"; ... capacity-dmips-mhz = <1024>; ... }; cpu7: cpu@103 { device_type = "cpu"; ... capacity-dmips-mhz = <801>; ... }; ---------------------------------------------------------------- bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) { ... ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz", //解析cpu core算力,kernel4.19后配置該參數 &cpu_capacity); ... capacity_scale = max(cpu_capacity, capacity_scale); //記錄最大cpu capacity值作為scale,不能超過scale。因此cpu capacity都是大核1024,小核<1024 raw_capacity[cpu] = cpu_capacity; //raw capacity就是dts中dmips值,其實就是調度中經常使用到的cpu_capacity_orig pr_debug("cpu_capacity: %pOF cpu_capacity=%u (raw)\n", cpu_node, raw_capacity[cpu]); ... return !ret; }
2. 這里是將raw capacity進行歸一化,按照最大cpu raw capacity為1024,小的cpu raw capacity按照比例歸一化為1024的小數倍:大核1024,小核***(***<1024)
然后將歸一化的值,保存為cpu_scale的per_cpu變量
void topology_normalize_cpu_scale(void) { u64 capacity; int cpu; if (!raw_capacity) return; pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale); for_each_possible_cpu(cpu) { pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n", cpu, raw_capacity[cpu]); capacity = (raw_capacity[cpu] << SCHED_CAPACITY_SHIFT) //就是按照max cpu capacity的100% = 1024的方式歸一化capacity / capacity_scale; topology_set_cpu_scale(cpu, capacity); //更新per_cpu變量cpu_scale為各自的cpu raw capacity pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n", cpu, topology_get_cpu_scale(cpu)); } }
3. update_cpu_capacity函數是主要來更新cpu剩余capacity的。從函數中每個部分的計算,也可以看出一些cpu capacity相關計算的端倪。
-
- arch_scale_cpu_capacity獲取cpu_scale
- arch_scale_max_freq_capacity函數展開看下:
- 可以看到其實就是獲取max_freq_scale, 而它具體如何計算的呢?
/* Replace task scheduler's default max-frequency-invariant accounting */ #define arch_scale_max_freq_capacity topology_get_max_freq_scale static inline unsigned long topology_get_max_freq_scale(struct sched_domain *sd, int cpu) { return per_cpu(max_freq_scale, cpu); } void arch_set_max_freq_scale(struct cpumask *cpus, unsigned long policy_max_freq) { unsigned long scale, max_freq; int cpu = cpumask_first(cpus); if (cpu > nr_cpu_ids) return; max_freq = per_cpu(max_cpu_freq, cpu); if (!max_freq) return; scale = (policy_max_freq << SCHED_CAPACITY_SHIFT) / max_freq; trace_android_vh_arch_set_freq_scale(cpus, policy_max_freq, max_freq, &scale); for_each_cpu(cpu, cpus) per_cpu(max_freq_scale, cpu) = scale; }
- 從上述arch_set_max_freq_scale函數可知首先獲取max_cpu_freq,再通過如下計算公式得出:
policy_max_freq * 1024 ① max_freq_scale = ——————————————————————————— ,其中policy_max_freq代表當前cpufreq governor(policy)支持的最大freq(會經過PM QoS將所有userspace的request聚合之后得出) max_cpu_freq
-
而max_cpu_freq又是如何確定的呢?其實就是該cpu支持的最大freq:
{ ... arch_set_freq_scale(policy->related_cpus, new_freq, policy->cpuinfo.max_freq); ... } void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq, unsigned long max_freq) { unsigned long scale; int i; scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq; trace_android_vh_arch_set_freq_scale(cpus, cur_freq, max_freq, &scale); for_each_cpu(i, cpus){ per_cpu(freq_scale, i) = scale; per_cpu(max_cpu_freq, i) = max_freq; } }
計算公式如下:
② max_cpu_freq = max_freq = policy->cpuinfo.max_freq, 其中policy->cpuinfo.max_freq就是該cpu支持的最大freq
- 可以看到其實就是獲取max_freq_scale, 而它具體如何計算的呢?
- 將獲取的max_freq_scale進行計算,並考慮thermal限制的情況。結果不能超過thermal限制的最大capacity:
min(cpu_scale * max_freq_scale / 1024, thermal_cap)
- 上面計算得到的結果,就是作為cpu_capacity_orig的值
static void update_cpu_capacity(struct sched_domain *sd, int cpu) { unsigned long capacity = arch_scale_cpu_capacity(cpu); //獲取per_cpu變量cpu_scale struct sched_group *sdg = sd->groups; capacity *= arch_scale_max_freq_capacity(sd, cpu); //獲取per_cpu變量max_freq_scale,參與計算 capacity >>= SCHED_CAPACITY_SHIFT; //這2步計算為:cpu_scale * max_freq_scale / 1024 capacity = min(capacity, thermal_cap(cpu)); //計算得出的capacity不能超過thermal限制中的cpu的capacity cpu_rq(cpu)->cpu_capacity_orig = capacity; //將計算得出的capacity作為當前cpu rq的cpu_capacity_orig capacity = scale_rt_capacity(cpu, capacity); //計算cfs rq剩余的cpu capacity if (!capacity) //如果沒有剩余cpu capacity給cfs了,那么就強制寫為1 capacity = 1; cpu_rq(cpu)->cpu_capacity = capacity; //更新相關sgc capacity:cpu rq的cpu_capacity、sgc的最大/最小的capacity sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = capacity; sdg->sgc->max_capacity = capacity; }
- 之后再通過scale_rt_capacity函數,從cpu_capacity_orig中減去irq、dl class和rt class的util_avg占用之后,得到剩余的cpu capacity就是留給cfs進程的。公式如下:
(cpu_capacity_orig - avg_rt.util_avg - avg_dl.util_avg) * (cpu_capacity_orig - avg_irq.util_avg) cpu_capacity = ———————————————————————————————————————————————————————————————————————————————————————————————————— cpu_capacity_orig
static unsigned long scale_rt_capacity(int cpu, unsigned long max) { struct rq *rq = cpu_rq(cpu); unsigned long used, free; unsigned long irq; irq = cpu_util_irq(rq); //獲取cpu rq的irq util_avg if (unlikely(irq >= max)) //如果util_avg超過max,則說明util滿了? return 1; used = READ_ONCE(rq->avg_rt.util_avg); //獲取rt task rq的util_avg used += READ_ONCE(rq->avg_dl.util_avg); //獲取並累加dl task rq的util_avg if (unlikely(used >= max)) //如果util_avg超過max,則說明util滿了? return 1; free = max - used; //計算free util = 最大capacity - rt的util_avg - dl的util_avg return scale_irq_capacity(free, irq, max); //(max - rt的util_avg - dl的util_avg) * (max - irq) /max }
- 最后計算結果(剩余)作為cpu_capacity,以及sgc->capacity
整體的依賴框架如下:
總結:
- max_cpu_freq:等於policy->cpuinfo.max_freq,也就是cpu支持的最大freq。------因為cpu的最大支持freq是一個固定值。所以,max_cpu_scale與cpu當前的freq無關。
- max_freq_scale:根據max_cpu_scale、cpufreq governor policy的最大支持freq,計算並歸一化得出。-------policy與cpufreq選擇governor policy的最大支持freq有關。
- cpu_scale:根據DTS中配置的cpu算力值決定。------這個值是cpu固有算力的體現,一般有cpu廠商決定和配置,所以一般是固定的。
- cpu_capacity_orig:根據max_freq_scale、cpu_scale計算,並考慮thermal限制后的結果。--------這其實代表了cpu在當前的policy下最大的算力體現。
- cpu_capacity:根據cpu_capacity_orig,計算去掉irq、rt、dl的占用后,剩余的capacity。---------首先,irq的響應處理會占用cpu(更甚地,如果irq太頻繁觸發,會影響系統性能);rt、dl class的進程的優先級都是高於cfs task的;所以,從當前cpu當前狀態的算力中,去掉了irq、rt、dl的算力占用,那么剩余的就是留給cfs task的cpu算力了。