之前遇到一個耗電問題,最后發現是/proc/sys/kernel/sched_boost節點設置異常,一直處於boost狀態。導致所有場景功耗上升。
現在總結一下sched_boost的相關知識。
Sched_Boost
sched_boost主要是通過影響Task placement的方式,來進行boost。它屬於QTI EAS中的一部分。
默認task placement policy
計算每個cpu的負載,並將task分配到負載最輕的cpu上。如果有多個cpu的負載相同(一般是都處於idle),那么就會把task分配到系統中capacity最大的cpu上。
設置sched_boost
通過設置節點:/proc/sys/kernel/sched_boost 或者內核調用sched_set_boost()函數,可以進行sched_boost,並且在分配任務時,忽略對energy的消耗。
boost一旦設置之后,就必須顯示寫0來關閉。同時也支持個應用同時調用設置,設置會選擇boost等級最高的生效; 而當所有應用都都關閉boost時,boost才會真正失效。
boost等級
sched_boost一共有4個等級,除了0代表關閉boost以外,其他3個等級靈活地控制功耗和性能的不同傾向程度。
在通過節點設置,會調用sched_boost_handler
{ .procname = "sched_boost", .data = &sysctl_sched_boost, .maxlen = sizeof(unsigned int), .mode = 0644, .proc_handler = sched_boost_handler, .extra1 = &neg_three, .extra2 = &three, },
經過verify之后,調用_sched_set_boost來設置boost。
int sched_boost_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { int ret; unsigned int *data = (unsigned int *)table->data; mutex_lock(&boost_mutex); ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (ret || !write) goto done; if (verify_boost_params(*data)) _sched_set_boost(*data); else ret = -EINVAL; done: mutex_unlock(&boost_mutex); return ret;
而通過內核調用的方式,同樣最后也是調用_sched_set_boost來設置boost。
int sched_set_boost(int type) { int ret = 0; mutex_lock(&boost_mutex); if (verify_boost_params(type)) _sched_set_boost(type); else ret = -EINVAL; mutex_unlock(&boost_mutex); return ret; }
接下來,我們看關鍵的設置函數_sched_set_boost:
static void _sched_set_boost(int type) { if (type == 0) //通過type參數判斷是否enable/disable boost sched_boost_disable_all(); //(1)disable all boost else if (type > 0) sched_boost_enable(type); //(2) enable boost else sched_boost_disable(-type); //(3) disable boost /* * sysctl_sched_boost holds the boost request from * user space which could be different from the * effectively enabled boost. Update the effective * boost here. */ sched_boost_type = sched_effective_boost(); sysctl_sched_boost = sched_boost_type; set_boost_policy(sysctl_sched_boost); //(4) 設置boost policy trace_sched_set_boost(sysctl_sched_boost); }
首先看一下sched_boost的4個用於控制配置的結構體:
其中refcount來記錄設置的次數。enter函數表示切換到該boost配置的動作;exit則是退出該boost配置的動作。
static struct sched_boost_data sched_boosts[] = { [NO_BOOST] = { .refcount = 0, .enter = sched_no_boost_nop, .exit = sched_no_boost_nop, }, [FULL_THROTTLE_BOOST] = { .refcount = 0, .enter = sched_full_throttle_boost_enter, .exit = sched_full_throttle_boost_exit, }, [CONSERVATIVE_BOOST] = { .refcount = 0, .enter = sched_conservative_boost_enter, .exit = sched_conservative_boost_exit, }, [RESTRAINED_BOOST] = { .refcount = 0, .enter = sched_restrained_boost_enter, .exit = sched_restrained_boost_exit, }, };
(1)disable all boost
調用除no boost外,所有boost配置的exit函數並且將他們的refcount清0。
#define SCHED_BOOST_START FULL_THROTTLE_BOOST #define SCHED_BOOST_END (RESTRAINED_BOOST + 1 static void sched_boost_disable_all(void) { int i; for (i = SCHED_BOOST_START; i < SCHED_BOOST_END; i++) { if (sched_boosts[i].refcount > 0) { sched_boosts[i].exit(); sched_boosts[i].refcount = 0; } } }
(2) enable boost
refcount記錄調用次數+;
由於sched+boost支持多應用同時調用的,所以在設置boost之前,要先檢查當前有效的boost配置。
優先級是No boost > Full Throttle > Conservative > Restrained。
static void sched_boost_enable(int type) { struct sched_boost_data *sb = &sched_boosts[type]; int next_boost, prev_boost = sched_boost_type; sb->refcount++; //refcount記錄次數+1 if (sb->refcount != 1) return; /* * This boost enable request did not come before. * Take this new request and find the next boost * by aggregating all the enabled boosts. If there * is a change, disable the previous boost and enable * the next boost. */ next_boost = sched_effective_boost(); //設置boost之前,檢查當前有效的boost配置 if (next_boost == prev_boost) return; sched_boosts[prev_boost].exit(); //調用之前配置的exit,退出之前的boost sched_boosts[next_boost].enter(); //調用現在配置的enter,進入當前boost狀態
通過檢查refcount,來確認當前有效的boost。
static int sched_effective_boost(void) { int i; /* * The boosts are sorted in descending order by * priority. */ for (i = SCHED_BOOST_START; i < SCHED_BOOST_END; i++) { if (sched_boosts[i].refcount >= 1) return i; } return NO_BOOST; }
(3)disable boost
同樣假如是disable boost的話,就會相應的對refcount--,並且調用當前boost類型的exit函數來退出boost。
因為sched_boost支持多種boost同時開啟,並按優先級設置。所以當disable一種boost時,最后通過檢查當前有效的boost來進入余下優先級高的boost模式。
static void sched_boost_disable(int type) { struct sched_boost_data *sb = &sched_boosts[type]; int next_boost; if (sb->refcount <= 0) return; sb->refcount--; if (sb->refcount) return; /* * This boost's refcount becomes zero, so it must * be disabled. Disable it first and then apply * the next boost. */ sb->exit(); next_boost = sched_effective_boost(); sched_boosts[next_boost].enter(); }
(4)設置boost policy
在最后一步中,設置policy來體現task是否需要進行up migrate。
如下是sched_boost不同等級對應的up migrate遷移策略。
Full throttle和Conservative:SCHED_BOOST_ON_BIG---在進行task placement時,僅考慮capacity最大的cpu core
無:SCHED_BOOST_ON_ALL---在進行task placement時,僅不考慮capacity最小的cpu core
No Boost和Restrained:SCHED_BOOST_NONE---正常EAS
/* * Scheduler boost type and boost policy might at first seem unrelated, * however, there exists a connection between them that will allow us * to use them interchangeably during placement decisions. We'll explain * the connection here in one possible way so that the implications are * clear when looking at placement policies. * * When policy = SCHED_BOOST_NONE, type is either none or RESTRAINED * When policy = SCHED_BOOST_ON_ALL or SCHED_BOOST_ON_BIG, type can * neither be none nor RESTRAINED. */ static void set_boost_policy(int type) { if (type == NO_BOOST || type == RESTRAINED_BOOST) { //conservative和full throttle模式才會進行向上遷移 boost_policy = SCHED_BOOST_NONE; return; } if (boost_policy_dt) { boost_policy = boost_policy_dt; return; } if (min_possible_efficiency != max_possible_efficiency) { //左邊是cpu中efficiency最小值,右邊為最大值。big.LITTLE架構應該恆成立 boost_policy = SCHED_BOOST_ON_BIG; return; } boost_policy = SCHED_BOOST_ON_ALL; }
接下來詳細分析3種boost設置的原理:
Full Throttle
full throttle(全速)模式下的sched boost,主要有如下2個動作:
(1)core control
(2)freq aggregation
static void sched_full_throttle_boost_enter(void) { core_ctl_set_boost(true); //(1)core control walt_enable_frequency_aggregation(true); //(2)freq aggregation }
(1)core control:isoloate/unisoloate cpu cores;enable boost時,開所有cpu core
int core_ctl_set_boost(bool boost) { unsigned int index = 0; struct cluster_data *cluster; unsigned long flags; int ret = 0; bool boost_state_changed = false; if (unlikely(!initialized)) return 0; spin_lock_irqsave(&state_lock, flags); for_each_cluster(cluster, index) { //修改並記錄每個cluster的boost狀態 if (boost) { boost_state_changed = !cluster->boost; ++cluster->boost; } else { if (!cluster->boost) { ret = -EINVAL; break; } else { --cluster->boost; boost_state_changed = !cluster->boost; } } } spin_unlock_irqrestore(&state_lock, flags); if (boost_state_changed) { index = 0; for_each_cluster(cluster, index) //針對每個cluster,apply boost設置 apply_need(cluster); } trace_core_ctl_set_boost(cluster->boost, ret); return ret; } EXPORT_SYMBOL(core_ctl_set_boost);
static void apply_need(struct cluster_data *cluster) { if (eval_need(cluster)) //判斷是否需要 wake_up_core_ctl_thread(cluster); //喚醒cluster的core control thread }
具體如何判斷的:
enable boost時:判斷是否需要unisolate cpu,
disable boost時:判斷need_cpus < active_cpus是否成立。
並且與上一次更新的間隔時間滿足 > delay time。
static bool eval_need(struct cluster_data *cluster) { unsigned long flags; struct cpu_data *c; unsigned int need_cpus = 0, last_need, thres_idx; int ret = 0; bool need_flag = false; unsigned int new_need; s64 now, elapsed; if (unlikely(!cluster->inited)) return 0; spin_lock_irqsave(&state_lock, flags); if (cluster->boost || !cluster->enable) { need_cpus = cluster->max_cpus; //當enable boost時,設置need_cpus為所有cpu } else { cluster->active_cpus = get_active_cpu_count(cluster); //當disable boost時,首先獲取active的cpu thres_idx = cluster->active_cpus ? cluster->active_cpus - 1 : 0; list_for_each_entry(c, &cluster->lru, sib) { bool old_is_busy = c->is_busy; if (c->busy >= cluster->busy_up_thres[thres_idx] || sched_cpu_high_irqload(c->cpu)) c->is_busy = true; else if (c->busy < cluster->busy_down_thres[thres_idx]) c->is_busy = false; trace_core_ctl_set_busy(c->cpu, c->busy, old_is_busy, c->is_busy); need_cpus += c->is_busy; } need_cpus = apply_task_need(cluster, need_cpus); //根據task需要,計算need_cpus } new_need = apply_limits(cluster, need_cpus); //限制need_cpus范圍:cluster->min_cpus <= need_cpus <= clusterr->max_cpus need_flag = adjustment_possible(cluster, new_need); //(*)enable boost時:判斷是否需要unisolate cpu; disable boost時:判斷need_cpus < active_cpus是否成立 last_need = cluster->need_cpus; now = ktime_to_ms(ktime_get()); if (new_need > cluster->active_cpus) { ret = 1; //enable boost } else { /* * When there is no change in need and there are no more * active CPUs than currently needed, just update the * need time stamp and return. //當需要的cpu沒有變化時,只需要更新時間戳,然后return */ if (new_need == last_need && new_need == cluster->active_cpus) { cluster->need_ts = now; spin_unlock_irqrestore(&state_lock, flags); return 0; } elapsed = now - cluster->need_ts; ret = elapsed >= cluster->offline_delay_ms; //修改need_cpus的時間要大於delay時間,才認為有必要進行更改 } if (ret) { cluster->need_ts = now; //更新時間戳,need_cpus cluster->need_cpus = new_need; } trace_core_ctl_eval_need(cluster->first_cpu, last_need, new_need, ret && need_flag); spin_unlock_irqrestore(&state_lock, flags); return ret && need_flag; }
滿足更新要求的條件后,就會喚醒core control thread
static void wake_up_core_ctl_thread(struct cluster_data *cluster) { unsigned long flags; spin_lock_irqsave(&cluster->pending_lock, flags); cluster->pending = true; spin_unlock_irqrestore(&cluster->pending_lock, flags); wake_up_process(cluster->core_ctl_thread); }
其中會有一個檢測pending的防止重入的操作。假如pending標志已經改寫,那么就會將當前進程移出rq。
static int __ref try_core_ctl(void *data) { struct cluster_data *cluster = data; unsigned long flags; while (1) { set_current_state(TASK_INTERRUPTIBLE); //先退出RUNNING狀態,設為TASK_INTERRUPTIBLE,后面再調用schedule()。會判斷當前task已不處於TASK_RUNNING狀態,隨后會進行dequeue,並調度其他進程到rq->curr。(典型的出rq操作)
spin_lock_irqsave(&cluster->pending_lock, flags); if (!cluster->pending) { //檢測pending,如果已經core control完成,則直接進行schedule(),並准備退出該thread spin_unlock_irqrestore(&cluster->pending_lock, flags); schedule(); if (kthread_should_stop()) break; spin_lock_irqsave(&cluster->pending_lock, flags); } set_current_state(TASK_RUNNING); cluster->pending = false; spin_unlock_irqrestore(&cluster->pending_lock, flags); do_core_ctl(cluster); //do work } return 0; }
static void __ref do_core_ctl(struct cluster_data *cluster) { unsigned int need; need = apply_limits(cluster, cluster->need_cpus); //再次check need_cpus是否合法 if (adjustment_possible(cluster, need)) { //再次check是否需要更改 pr_debug("Trying to adjust group %u from %u to %u\n", cluster->first_cpu, cluster->active_cpus, need); if (cluster->active_cpus > need) //根據need進行cpu un/isolate try_to_isolate(cluster, need); else if (cluster->active_cpus < need) try_to_unisolate(cluster, need); } }
(2)freq aggregation:提升cpu freq
設置flag:
static inline void walt_enable_frequency_aggregation(bool enable) { sched_freq_aggr_en = enable; }
后續在以下這一個地方會影響freq選擇:
- load freq polcy時,會根據是否開啟freq_aggregation的flag而影響load的計算。
開啟flag之后,會使用aggr_grp_load替代rq->grp_time.prev_runnable_sum
aggr_grp_load是當前cluster中所有cpu core的rq->grp_time.prev_runnable_sum的總和:sum。
那么開啟之后,就會使load變大,從而提升cpu freq或者進行
static inline u64 freq_policy_load(struct rq *rq) { 。。。 if (sched_freq_aggr_en) load = rq->prev_runnable_sum + aggr_grp_load; else load = rq->prev_runnable_sum + rq->grp_time.prev_runnable_sum; 。。。 }
Conservative
static void sched_conservative_boost_enter(void) { update_cgroup_boost_settings(); //(1)更新cgroup boost設置 sched_task_filter_util = sysctl_sched_min_task_util_for_boost; //(2)task util調節 }
(1)遍歷group,group一般有以下這些:backgroun,foreground,top-app,rt。
將所有sched_boost_enabled設置為false,除了寫了no_override的group。而從init.target.rc中可以看到top-app和foregroung寫了該flag(如下),所以最后就會有
write /dev/stune/foreground/schedtune.sched_boost_no_override 1 write /dev/stune/top-app/schedtune.sched_boost_no_override 1
void update_cgroup_boost_settings(void) { int i; for (i = 0; i < BOOSTGROUPS_COUNT; i++) { if (!allocated_group[i]) break; if (allocated_group[i]->sched_boost_no_override) continue; allocated_group[i]->sched_boost_enabled = false; } }
(2)修改min_task_util的門限
/* 1ms default for 20ms window size scaled to 1024 */ unsigned int sysctl_sched_min_task_util_for_boost = 51; //conservative 保守 /* 0.68ms default for 20ms window size scaled to 1024 */ unsigned int sysctl_sched_min_task_util_for_colocation = 35; //normal
目前僅發現在如下2處會用到這個值:
1、獲取sched_boost_enabled仍然為true的task。過濾task_util較小的task,關閉boost。
下面這個函數是返回task boost policy的,也就是placement boost。函數主要邏輯:
task所在group中打開了sched_boost_endabled,並且sched_boost設置非0: 假如sched_boost=1(full throttle),則policy是SCHED_BOOST_ON_BIG 假如sched_boost=2(Conservative),則還需要判斷task util是否超過sched_task_filter_util。超過了,policy=SCHED_BOOST_ON_BIG;沒超過,policy=SCHED_BOOST_NONE
根據上面看到只有top-app和foreground這2個group是打開了sched_boost_enabled的。
所以,這里就是將top-app和foreground中的task_util較大的task挑選出來,enable了boost;而其他的task,都關閉了boost。
static inline enum sched_boost_policy task_boost_policy(struct task_struct *p) { enum sched_boost_policy policy = task_sched_boost(p) ? sched_boost_policy() : SCHED_BOOST_NONE; if (policy == SCHED_BOOST_ON_BIG) { /* * Filter out tasks less than min task util threshold * under conservative boost. */ if (sched_boost() == CONSERVATIVE_BOOST && task_util(p) <= sched_task_filter_util) //修改了這個門限 policy = SCHED_BOOST_NONE; } return policy; }
2、在更新walt負載時
更新了unfilter的計數,增大到51,相當於:/* 1ms default for 20ms window size scaled to 1024 */(原先默認為:35,即0.68ms)。也就是當demand_scaled大於這個門限后,會設置一個nr_windows的緩沖時間。緩沖時間每一次update_history就會減1,直到為0。
static void update_history(struct rq *rq, struct task_struct *p, u32 runtime, int samples, int event) { 。。。 if (demand_scaled > sched_task_filter_util) p->unfilter = sysctl_sched_task_unfilter_nr_windows; else if (p->unfilter) p->unfilter = p->unfilter - 1; 。。。 }
p->unfilter這個參數又會在如下兩個地方用於判斷:
首先是在fair.c中,但是由於現在是CONSERVATIVE_BOOST,所以,只會return false。這條暫不分析下去。
static inline bool task_skip_min_cpu(struct task_struct *p) { return sched_boost() != CONSERVATIVE_BOOST && get_rtg_status(p) && p->unfilter; }
另外一個地方是如下walt.h中,由於p->unfilter非0,那么就會判斷當前cpu是否為小核,如果是小核,那么return true。
static inline bool walt_should_kick_upmigrate(struct task_struct *p, int cpu) { struct related_thread_group *rtg = p->grp; if (is_suh_max() && rtg && rtg->id == DEFAULT_CGROUP_COLOC_ID && rtg->skip_min && p->unfilter) return is_min_capacity_cpu(cpu); return false; }
返回值會影響如下函數的返回值,也是return flase。
所以,會告訴系統當前的task load與當前cpu的capacity不匹配,需要進行遷移。
此外,在進行遷移時,也是調用該函數,用於排除目標cpu為小核。
綜上,就是將task從小核遷移到大核。
static inline bool task_fits_max(struct task_struct *p, int cpu) { unsigned long capacity = capacity_orig_of(cpu); unsigned long max_capacity = cpu_rq(cpu)->rd->max_cpu_capacity.val; unsigned long task_boost = per_task_boost(p); if (capacity == max_capacity) return true; if (is_min_capacity_cpu(cpu)) { if (task_boost_policy(p) == SCHED_BOOST_ON_BIG || task_boost > 0 || schedtune_task_boost(p) > 0 || walt_should_kick_upmigrate(p, cpu)) //條件為true return false; } else { /* mid cap cpu */ if (task_boost > TASK_BOOST_ON_MID) return false; } return task_fits_capacity(p, capacity, cpu); }
Restrained
該模式下,從代碼看,僅打開了freq aggregation。與上面全速模式類似,不再贅述。
static void sched_restrained_boost_enter(void) { walt_enable_frequency_aggregation(true); }
總結各個boost的效果
Full throttle:
1、通過core control,將所有cpu都進行unisolation
2、通過freq聚合,將load計算放大。從而觸發提升freq,或者遷移等
3、通過設置boost policy= SCHED_BOOST_ON_BIG,遷移挑選target cpu時,只會選擇大核
最終效果應該盡可能把任務都放在大核運行(除了cpuset中有限制)
Conservative:
1、通過更新group boost配置,僅讓top-app和foreground組進行task placement boost
2、提高min_task_util的門限,讓進行up migrate的條件更苛刻。只有load較大(>1ms)的task,會進行up migrate。
2、同上,更改min_task_util門限后,會提醒系統task與cpu是misfit,需要進行遷移。
3、通過設置boost policy= SCHED_BOOST_ON_BIG,遷移挑選target cpu時,只會選擇大核
最終效果:top-app和foreground的一些task會遷移到大核運行
Restrained:
1、通過freq聚合,將load計算放大。從而觸發提升freq,或者遷移等
load放大后,仍遵循基本EAS。提升freq或者遷移,視情況而定。
注意關於early_datection部分簡述參考之前文章:https://www.cnblogs.com/lingjiajun/p/12317090.html 中CPU freq調節章節。不過代碼中貌似沒有看到會影響的地方。由於EAS部分本人還沒有仔細學習分析代碼,可能還有疏漏或者錯誤,歡迎交流。