一、select_task_rq_fair()函數
CFS任務選核最終都是要走 select_task_rq_fair() 函數,三種CFS選核路徑如下:
try_to_wake_up //core.c select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); //喚醒選核路徑 wake_up_new_task //core.c select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0); //fork選核路徑 sched_exec //core.c select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0); //exec選核路徑
對比可以看到,只有任務喚醒選核才傳了wake_flags,可能選擇sync喚醒。sched_domain 的flag中是否包含 SD_BALANCE_XX 可以通過下面方法查看:
/proc/sys/kernel/sched_domain # find ./ -name flags | xargs cat SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES //MC SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_PREFER_SIBLING //DIE
可以看出對於嵌入式這種大小核的Arm64架構中,MC和DIE都沒有 SD_BALANCE_WAKE 這個標志。
函數解讀:
/* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, * SD_BALANCE_FORK, or SD_BALANCE_EXEC. * * Balances load by selecting the idlest CPU in the idlest group, or under * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set. * * Returns the target CPU number. * * preempt must be disabled. */ /* * 傳參: * p: 待選核任務 * prev_cpu:任務之前運行的cpu * sd_flag: 在包含此標志的sd中為任務選核,可取值 SD_BALANCE_WAKE(喚醒任務的選核)、SD_BALANCE_FORK(fork的新任務選核)、SD_BALANCE_EXEC(exec的任務選核) * wake_flags: 可取值有 WF_SYNC、WF_FORK、WF_MIGRATED、WF_ON_CPU、WF_ANDROID_VENDOR,但是只對WF_SYNC進行了特殊處理,表示同步喚醒,其它標志被此函數忽略。 * * 返回值:選出的目標cpu * * 執行環境:必須關搶占 */ static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) { struct sched_domain *tmp, *sd = NULL; int cpu = smp_processor_id(); int new_cpu = prev_cpu; //目標cpu先默認取prev_cpu int want_affine = 0; int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); int target_cpu = -1; //注冊了這個HOOK且不是對新fork出來的任務進行選核,就更新任務的負載 if (trace_android_rvh_select_task_rq_fair_enabled() && !(sd_flag & SD_BALANCE_FORK)) sync_entity_load_avg(&p->se); //Vendor廠商可以更改選核路徑,若是選到了核,就直接退出了 trace_android_rvh_select_task_rq_fair(p, prev_cpu, sd_flag, wake_flags, &target_cpu); probe_android_rvh_select_task_rq_fair if (target_cpu >= 0) return target_cpu; //只有對喚醒任務的選核才有可能走EAS選核路徑 if (sd_flag & SD_BALANCE_WAKE) { //記錄當前任何和被喚醒任務p之間是否有固定的喚醒關系 record_wakee(p); //全局使能控制/proc/sys/kernel/sched_energy_aware是否開啟EAS if (sched_energy_enabled()) { //進行EAS選核,EAS選到核了,就直接退出。里面判斷了rd->overutilized,若overutilized則直接失敗退出 new_cpu = find_energy_efficient_cpu(p, prev_cpu, sync); if (new_cpu >= 0) return new_cpu; //若EAS沒有選到核,就回退為初始狀態,繼續走傳統路徑進行選核 new_cpu = prev_cpu; } //若p和current之間比較有相關性且任務p允許運行在當前cpu上,就認為是want_affine want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); } rcu_read_lock(); for_each_domain(cpu, tmp) { //MC-->DIE /* If both 'cpu' and 'prev_cpu' are part of this domain, cpu is a valid SD_WAKE_AFFINE target.*/ /* SD_WAKE_AFFINE 標志MC和DIE都定義了,恆成立, 因此可以解釋為只要滿足want_affine, 就肯定會進入 */ if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { //若當前cpu就是任務p之前運行的prev_cpu, 不用調用wake_affine()選核了,若是喚醒選核直接選idle的兄弟cpu,非喚醒選核直接選prev_cpu if (cpu != prev_cpu) new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync); //若是既不走快速路徑,又不走慢速路徑,就返回它。 sd = NULL; /* Prefer wake_affine over balance flags */ break; } /* * 注意 tmp->flags 不會包含 SD_BALANCE_WAKE, 因為MC和DIE都沒設置這個標志。只有 SD_BALANCE_FORK、SD_BALANCE_EXEC 選核才有可能賦值 * 對於want_affine=0的情況,這個循環退出后,sd就是當前cpu對應的DIE層級的sd. 這里的作用是找到支持sd_flag的最高層級的sd */ if (tmp->flags & sd_flag) sd = tmp; else if (!want_affine) break; } if (unlikely(sd)) { //SD_BALANCE_WAKE這類休眠喚醒的任務選核不可能走慢速路徑,因為sd恆為NULL /* Slow path */ new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag); } else if (sd_flag & SD_BALANCE_WAKE) { //對喚醒的任務進行選核 /* Fast path */ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); if (want_affine) current->recent_used_cpu = cpu; } rcu_read_unlock(); return new_cpu; }
1. 主要邏輯
(1) Vendor可以注冊HOOK來決定是否執行這個原生的選核流程,若注冊了且選到了核,則原生選核流程不再執行。否則執行原生選核流程。
(2) 判斷是否進行EAS選核,若EAS選到和核則直接退出,否則繼續往下走選核流程。EAS選核的條件為:
a. 必須是喚醒場景的選核,fork和exec場景的選核都不會走EAS。
b. EAS全局控制開關/proc/sys/kernel/sched_energy_aware必須是開啟的。
c. 系統沒有overutilized,若是overutilized則直接退出EAS選核。
(3) 滿足wake_affine選出的核會作為一個候選結果,若是喚醒場景,還有走快速路徑進行選核。
(4) 慢速路徑選核只有fork和exec兩類選核流程會走,喚醒選核是不會走慢速路徑的。且慢速路徑傳參sd只可能是當前cpu對應的DIE層級的sd。
(5) 快速路徑選核只有喚醒選核流程會走,fork和exec兩類選核流程不會走快速選核流程。雖然名字叫select_idle_sibling,但是在同cluster中選不到idle核時,還是會在所有cluster中進行選核。
二、wake affine 屬性
wake affine 只支持喚醒場景的選核。主要判斷當前任務與被喚醒任務之間是否有比較固定的喚醒關系,來幫助選出一個候選cpu。
1. task_struct 中有三個成員用來線程喚醒信息
struct task_struct { ... unsigned int wakee_flips; //該線程喚醒不同wakee的次數,值越大說明越偏向於喚醒不同的線程且喚醒越頻繁 unsigned long wakee_flip_decay_ts; //記錄上次wakee_flips衰減的時間點,其每秒衰減為原來的50% struct task_struct *last_wakee; //該線程上次喚醒的線程 ... }
2. record_wakee()
/* * 若current對p有比較固定的喚醒關系的話,current->wakee_flips將非常的小。 * 若經常喚醒其它不同的線程,其值將比較大 */ static void record_wakee(struct task_struct *p) { /* * Only decay a single time; tasks that have less then 1 wakeup per * jiffy will not have built up many flips. */ //超過1秒衰減為原來的1/2 if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) { current->wakee_flips >>= 1; current->wakee_flip_decay_ts = jiffies; } //當前任務上下文中,若上次喚醒的不是任務p, 當前任務的wakee_flips加1 if (current->last_wakee != p) { current->last_wakee = p; current->wakee_flips++; } }
Waker喚醒wakee的場景中,有兩種選核思路:一種是聚合的思路,即讓waker和wakee盡量的close,從而提高cache hit。另外一種思考是分散,即讓load盡量平均分配在多個cpu上。不同的喚醒模型使用不同的放置策略,兩種簡單的喚醒模型:
(1) 在1:N模型中,一個server會不斷的喚醒多個不同的client。
(2) 在1:1模型中,線程A和線程B不斷的喚醒對方。
在1:N模型中,如果N是一個較大的數值,那么讓waker和wakee盡量的close會導致負荷的極度不平均,這會waker所在的sd會承擔太多的task,從而引起性能下降。
在1:1模型中,讓waker和wakee盡量的close不存在這樣的問題,還能提高性能。
實際的程序中,喚醒關系可能沒有那么簡單,一個wakee可能是另外一個關系中的waker,交互可能是M:N的形式。考慮這樣一個場景:waker把wakee拉近,而wakee自身的wakee flips比較大,由於wakee也會做waker,那么更多的線程也會拉近waker所在的sd,從而進一步加劇CPU資源的競爭。因此waker和wakee的wakee flips的數值都不能太大,太大的時候應該禁止wake affine。內核中通過 wake_wide() 來判斷是否使能wake affine。
3. wake_wide()
/* * 通過開關頻率啟發式檢測 M:N 喚醒者/被喚醒者關系。 * 許多喚醒者喚醒的是與上次不同的任務,其頻率大約是喚醒某一任務的 N 倍。 * 為了確定我們是否應該讓負載分散與合並到共享緩存,我們在一對伙伴的一個中尋找最小為 llc_size 的“翻轉”頻率,在另一個中尋找大於 lls_size 頻率的因子。 * 滿足這兩個條件,我們可以相對確定這種關系是非一夫一妻制的,伙伴數量超過套接字大小。 * Waker/wakee 是client/server、worker/dispatcher、中斷源或任何無關緊要的東西,傳播標准是明顯的合作伙伴數量超過套接字大小。 * * 對於有比較固定喚醒關系的,返回0,否則返回1 */ static int wake_wide(struct task_struct *p) { unsigned int master = current->wakee_flips; unsigned int slave = p->wakee_flips; int factor = __this_cpu_read(sd_llc_size); //表示此cpu所在cluster中cpu的個數 if (master < slave) swap(master, slave); //這里是用的是或 if (slave < factor || master < slave * factor) return 0; return 1; }
這個函數返回0,並且當前cpu在任務p的親和性里面,就判斷為want_affine。這里的or邏輯是存疑的,因為master和slave其一的wakee_flips比較小就滿足wake affine,這會使得任務太容易在LLC domain堆積了。在1:N模型中(例如手機轉屏的時候,一個線程會喚醒非常非常多的線程來處理configChange消息),master的wakee_flips巨大無比,slave的wakee_flips非常小,如果仍然wake affine是不合理的。
4. wake_affine()
/* * select_task_rq_fair傳參:sd為當前cpu對應的MC或DIE層級的sd; p為待選核任務; this_cpu為當前正在執行的cpu; * prev_cpu為任務之前運行的cpu; sync表示選核時是否指定了同步WF_SYNC標志 */ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int this_cpu, int prev_cpu, int sync) { int target = nr_cpumask_bits; if (sched_feat(WA_IDLE)) target = wake_affine_idle(this_cpu, prev_cpu, sync); if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits) //若上面沒有選到cpu target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync); schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts); //增加統計計數 if (target == nr_cpumask_bits) return prev_cpu; //若是還是沒選到核,就返回prev_cpu,否則返回選到的cpu schedstat_inc(sd->ttwu_move_affine); schedstat_inc(p->se.statistics.nr_wakeups_affine); return target; } /* * The purpose of wake_affine() is to quickly determine on which CPU we can run * soonest. For the purpose of speed we only consider the waking and previous CPU. * * wake_affine_idle() - only considers 'now', it check if the waking CPU is * cache-affine and is (or will be) idle. * * wake_affine_weight() - considers the weight to reflect the average * scheduling latency of the CPUs. This seems to work * for the overloaded case. * * wake_affine傳參:this_cpu是當前正在執行的cpu; prev_cpu是任務上次運行的cpu; sync是是 * 否指定了AF_SYNC同步喚醒標志 */ static int wake_affine_idle(int this_cpu, int prev_cpu, int sync) { /* * If this_cpu is idle, it implies the wakeup is from interrupt * context. Only allow the move if cache is shared. Otherwise an * interrupt intensive(密集型) workload could force all tasks onto one * node depending on the IO topology or IRQ affinity settings. * * If the prev_cpu is idle and cache affine then avoid a migration. * There is no guarantee that the cache hot data from an interrupt * is more important than cache hot data on the prev_cpu and from * a cpufreq perspective, it's better to have higher utilisation * on one CPU. */ //若正在執行的cpu是idle的(中斷喚醒),且當前cpu和任務上次運行的cpu屬於同一個cluster if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu)) //若prev_cpu也idle就返回prev_cpu,否則返回當前正在運行的idle cpu return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu; /* * 若同步喚醒且當前cpu上只有一個任務(也就是waker)在運行,就返回當前cpu,因為sync喚醒, * 當前waker馬上就要睡眠了。 */ if (sync && cpu_rq(this_cpu)->nr_running == 1) return this_cpu; return nr_cpumask_bits; //沒選到cpu返回它 } /* * wake_affine傳參:sd是當前正在運行的cpu對應的MC或DIE層級的sd; p是待選核的任務; this_cpu是當前正在執行的cpu; prev_cpu是任務上次運行的cpu; * sync是是否指定了AF_SYNC同步喚醒標志。 */ static int wake_affine_weight(struct sched_domain *sd, struct task_struct *p, int this_cpu, int prev_cpu, int sync) { s64 this_eff_load, prev_eff_load; unsigned long task_load; //當前正在運行cpu上的CFS任務的負載 this_eff_load = cpu_load(cpu_rq(this_cpu)); //return cfs_rq->avg.load_avg; if (sync) { unsigned long current_load = task_h_load(current); //約為 current->se.avg.load_avg; //當前任務的負載比當前cpu上cfs任務的負載之和還大?應該是安全的容錯處理 if (current_load > this_eff_load) return this_cpu; this_eff_load -= current_load; } task_load = task_h_load(p); //約為 p->se.avg.load_avg; this_eff_load += task_load; //計算若p把current從當前cpu上擠下去后,cpu上CFS任務的總負載 if (sched_feat(WA_BIAS)) this_eff_load *= 100; //乘以100 this_eff_load *= capacity_of(prev_cpu); //再乘以cpu算力中可用於cfs任務的算力(出去溫限/irq/rt/dl占用后的算力),乘以的竟然是prev_cpu的! prev_eff_load = cpu_load(cpu_rq(prev_cpu)); //prev_cpu::rq.cfs_rq->avg.load_avg prev_eff_load -= task_load; //這里為什么要減去? if (sched_feat(WA_BIAS)) prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2; //MC和DIE都是乘以 100+(117-100)/2 prev_eff_load *= capacity_of(this_cpu); //乘以的竟然是this_cpu可用於cfs任務的算力! /* * If sync, adjust the weight of prev_eff_load such that if * prev_eff == this_eff that select_idle_sibling() will consider * stacking the wakee on top of the waker if no other CPU is * idle. * 如果同步喚醒,則調整 prev_eff_load 的權重,以便如果 prev_eff == this_eff 則 select_idle_sibling() 將考慮在沒有其它空閑CPU的情況下將 * 被喚醒的任務放在喚醒它的CPU上。 */ if (sync) prev_eff_load += 1; //前面乘以100又乘以CPU可用於CFS任務的算力了,這里加1有啥用? /*公式整理一下為:this_cpu_load/this_cpu_cap < 108.5% * prev_cpu_load/prev_cpu_cap 成立就返回this_cpu*/ return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits; }
三、慢速路徑
只有fork和exec類型的選核才會走慢速路徑,慢速路徑選核函數為 find_idlest_cpu(),走慢速路徑選核時傳參sd只可能是DIE層級的。
/* * 只有fork或exec的選核才可能走這個函數。 * * 傳參:sd當前cpu對應的DIE層級的sd(只可能是DIE層級); p是待選核任務; cpu為當前cpu; prev_cpu為任務p之前運行的cpu; sd_flag表示哪種選核類型。 * * 作用:在最idle的sg中找出最idle的cpu, 若是沒找到就返回當前cpu. */ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int prev_cpu, int sd_flag) { int new_cpu = cpu; //這個sd中的所有cpu都不在任務p的cpu親和性里面 if (!cpumask_intersects(sched_domain_span(sd), p->cpus_ptr)) return prev_cpu; /* * We need task's util for cpu_util_without, sync it up to prev_cpu's last_update_time. */ //僅對於exec類型的選核更新任務的負載 if (!(sd_flag & SD_BALANCE_FORK)) sync_entity_load_avg(&p->se); while (sd) { struct sched_group *group; struct sched_domain *tmp; int weight; if (!(sd->flags & sd_flag)) { //恆不成立,因為fork或exec類型的選核MC/DIE都支持 sd = sd->child; continue; } //若是返回NULL,直接選的是當前cpu group = find_idlest_group(sd, p, cpu); if (!group) { sd = sd->child; continue; } //若是在d中找到了最忙的sg,就在其中選退出延遲最小且最后進idle的cpu,若是沒有idle的cpu就選load最小的cpu new_cpu = find_idlest_group_cpu(group, p, cpu); if (new_cpu == cpu) { /* Now try balancing at a lower domain level of 'cpu': */ sd = sd->child; continue; } /* Now try balancing at a lower domain level of 'new_cpu': */ cpu = new_cpu; weight = sd->span_weight; sd = NULL; for_each_domain(cpu, tmp) { if (weight <= tmp->span_weight) //一進來就退出了,相當於沒有執行 break; if (tmp->flags & sd_flag) sd = tmp; } } return new_cpu; }
1. find_idlest_group()
在當前cpu對應的sd中找出最空閑的sg
/* * find_idlest_group() finds and returns the least busy CPU group within the domain. * * Assumes p is allowed on at least one CPU in sd. * * find_idlest_cpu傳參:sd為當前cpu DIE層級對應的sd; p為待選核任務; cpu為當前cpu; */ static struct sched_group *find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) { struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups; struct sg_lb_stats local_sgs, tmp_sgs; struct sg_lb_stats *sgs; unsigned long imbalance; struct sg_lb_stats idlest_sgs = { .avg_load = UINT_MAX, .group_type = group_overloaded, //初始化為最大的不均衡 }; //1024 * (sd->imbalance_pct-100) / 100 imbalance = scale_load_down(NICE_0_LOAD) * (sd->imbalance_pct-100) / 100; //遍歷MC層級先的每個cpu do { int local_group; /* Skip over this group if it has no CPUs allowed */ //跳過任務p親和性不允許的cpu if (!cpumask_intersects(sched_group_span(group), p->cpus_ptr)) continue; //當前正在運行的cpu是否在這個group中 local_group = cpumask_test_cpu(this_cpu, sched_group_span(group)); if (local_group) { sgs = &local_sgs; local = group; } else { sgs = &tmp_sgs; } update_sg_wakeup_stats(sd, group, sgs, p); //group和idlest比較,看group是否比idlest更閑 if (!local_group && update_pick_idlest(idlest, &idlest_sgs, group, sgs)) { idlest = group; idlest_sgs = *sgs; } } while (group = group->next, group != sd->groups); /* There is no idlest group to push tasks to */ //若沒找到idlest sg,選當前cpu if (!idlest) return NULL; /* The local group has been skipped because of CPU affinity */ //此sd中若沒有local group就直接返回idlest,否則還要和local group再PK一下,此例中,local group恆存在的 if (!local) return idlest; /* * If the local group is idler than the selected idlest group * don't try and push the task. */ //若local sg更閑,選當前cpu if (local_sgs.group_type < idlest_sgs.group_type) return NULL; /* * If the local group is busier than the selected idlest group * try and push the task. */ if (local_sgs.group_type > idlest_sgs.group_type) return idlest; //local group和idlest group一樣閑的情況了: switch (local_sgs.group_type) { case group_overloaded: case group_fully_busy: /* * When comparing groups across NUMA domains, it's possible for * the local domain to be very lightly loaded relative to the * remote domains but "imbalance" skews the comparison making * remote CPUs look much more favourable. When considering * cross-domain, add imbalance to the load on the remote node * and consider staying local. */ //大小核架構沒有SD_NUMA標志,恆不執行 if ((sd->flags & SD_NUMA) && ((idlest_sgs.avg_load + imbalance) >= local_sgs.avg_load)) return NULL; /* * If the local group is less loaded than the selected * idlest group don't try and push any tasks. */ //最閑的sg和local sg差值大於閾值17*1024,選當前cpu if (idlest_sgs.avg_load >= (local_sgs.avg_load + imbalance)) return NULL; //local_avg_load/idlest_avg_load <= sd->imbalance_pct%,選當前cpu if (100 * local_sgs.avg_load <= sd->imbalance_pct * idlest_sgs.avg_load) return NULL; break; case group_imbalanced: case group_asym_packing: /* Those type are not used in the slow wakeup path */ return NULL; //選當前cpu case group_misfit_task: /* Select group with the highest max capacity */ if (local->sgc->max_capacity >= idlest->sgc->max_capacity) return NULL; //選當前cpu break; case group_has_spare: if (sd->flags & SD_NUMA) { #ifdef CONFIG_NUMA_BALANCING //沒使能,不執行 int idlest_cpu; /* * If there is spare capacity at NUMA, try to select * the preferred node */ if (cpu_to_node(this_cpu) == p->numa_preferred_nid) return NULL; idlest_cpu = cpumask_first(sched_group_span(idlest)); if (cpu_to_node(idlest_cpu) == p->numa_preferred_nid) return idlest; #endif /* * Otherwise, keep the task on this node to stay close * its wakeup source and improve locality. If there is * a real need of migration, periodic load balance will * take care of it. */ if (local_sgs.idle_cpus) return NULL; //若local sg有idle cpu存在,選當前cpu } /* * Select group with highest number of idle CPUs. We could also * compare the utilization which is more stable but it can end * up that the group has less spare capacity but finally more * idle CPUs which means more opportunity to run task. */ //若local sg的idle cpu個數比idlest sg更多,選當前cpu if (local_sgs.idle_cpus >= idlest_sgs.idle_cpus) return NULL; break; } return idlest; }
update_sg_wakeup_stats 更新 sgs:
/* * update_sg_wakeup_stats - Update sched_group's statistics for wakeup. * @sd: The sched_domain level to look for idlest group. * @group: sched_group whose statistics are to be updated. * @sgs: variable to hold the statistics for this group. * @p: The task for which we look for the idlest group/CPU. * * find_idlest_group傳參:sd為當前cpu對應的MC層級的sd; group為MC層級的一個sg; sgs為未初始化的參數; p為待選核任務 * 作用:更新 sgs */ static inline void update_sg_wakeup_stats(struct sched_domain *sd, struct sched_group *group, struct sg_lb_stats *sgs, struct task_struct *p) { int i, nr_running; memset(sgs, 0, sizeof(*sgs)); //MC層級一個sg中只有一個cpu for_each_cpu(i, sched_group_span(group)) { struct rq *rq = cpu_rq(i); unsigned int local; sgs->group_load += cpu_load_without(rq, p); sgs->group_util += cpu_util_without(i, p); sgs->group_runnable += cpu_runnable_without(rq, p); local = task_running_on_cpu(i, p); //判斷p是否queue在這個cpu的rq上,此case下應該return 0 sgs->sum_h_nr_running += rq->cfs.h_nr_running - local; nr_running = rq->nr_running - local; sgs->sum_nr_running += nr_running; /* * No need to call idle_cpu_without() if nr_running is not 0 */ if (!nr_running && idle_cpu_without(i, p)) sgs->idle_cpus++; } /* Check if task fits in the group */ //只有DIE層級有這個標志,恆不成立 if (sd->flags & SD_ASYM_CPUCAPACITY && !task_fits_capacity(p, group->sgc->max_capacity)) { sgs->group_misfit_task_load = 1; } sgs->group_capacity = group->sgc->capacity; sgs->group_weight = group->group_weight; sgs->group_type = group_classify(sd->imbalance_pct, group, sgs); /* * Computing avg_load makes sense only when group is fully busy or overloaded */ if (sgs->group_type == group_fully_busy || sgs->group_type == group_overloaded) sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) / sgs->group_capacity; } //根據imbalance_pct和sgs來確認group_type static inline enum group_type group_classify(unsigned int imbalance_pct, struct sched_group *group, struct sg_lb_stats *sgs) { /* * sgs->sum_nr_running <= sgs->group_weight 返回fase, 只要cpu個數大於任務個數,就不認為group_overloaded * group_util/group_capacity > imbalance_pct/100=117% 或 group_runnable/group_capacity > imbalance_pct/100=117% 返回true */ if (group_is_overloaded(imbalance_pct, sgs)) return group_overloaded; if (sg_imbalanced(group)) //return group->sgc->imbalance; return group_imbalanced; if (sgs->group_asym_packing) return group_asym_packing; if (sgs->group_misfit_task_load) return group_misfit_task; if (!group_has_capacity(imbalance_pct, sgs)) return group_fully_busy; return group_has_spare; }
后續判斷的group和idlest比較,看group是否比idlest更閑:
static bool update_pick_idlest(struct sched_group *idlest, struct sg_lb_stats *idlest_sgs, struct sched_group *group, struct sg_lb_stats *sgs) { if (sgs->group_type < idlest_sgs->group_type) return true; if (sgs->group_type > idlest_sgs->group_type) return false; /* * The candidate and the current idlest group are the same type of * group. Let check which one is the idlest according to the type. */ //group_type相等的情況下比較誰更閑 switch (sgs->group_type) { case group_overloaded: case group_fully_busy: //哪個sg的avg_load,哪個sg更閑 /* Select the group with lowest avg_load. */ if (idlest_sgs->avg_load <= sgs->avg_load) return false; break; case group_imbalanced: case group_asym_packing: /* Those types are not used in the slow wakeup path */ return false; case group_misfit_task: //哪個sg的最大算力大,哪個sg更閑 /* Select group with the highest max capacity */ if (idlest->sgc->max_capacity >= group->sgc->max_capacity) return false; break; case group_has_spare: //哪個sg的idle cpu個數多,哪個sg更閑 /* Select group with most idle CPUs */ if (idlest_sgs->idle_cpus > sgs->idle_cpus) return false; /* Select group with lowest group_util */ //若idle cpu個數相等,哪個group_util小哪個更閑 if (idlest_sgs->idle_cpus == sgs->idle_cpus && idlest_sgs->group_util <= sgs->group_util) return false; break; } return true; }
2. find_idlest_group_cpu()
/* * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group. * * find_idlest_cpu傳參:group為在當前cpu對應的MC層級sd中找出的最閑的sg; p為待選核任務; this_cpu為當前正在執行的cpu */ static int find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) { unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; u64 latest_idle_timestamp = 0; int least_loaded_cpu = this_cpu; int shallowest_idle_cpu = -1; int i; /* Check if we have any choice: */ //對於MC層級的sg這里就直接就返回了,因為sg中只有一個cpu,DIE層級繼續往下走。 if (group->group_weight == 1) return cpumask_first(sched_group_span(group)); /* Traverse only the allowed CPUs */ //DIE層級的sg,遍歷這個cluster中任務p親和性允許的cpu for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) { if (sched_idle_cpu(i)) //若此cpu rq中只有SCHED_IDLE類型的任務,就選此cpu return i; //判斷此cpu是否是idle cpu: rq->curr==rq->idle && rq->nr_running==0 && rq->ttwu_pending==0 if (available_idle_cpu(i)) { struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); //return rq->idle_state; if (idle && idle->exit_latency < min_exit_latency) { /* * We give priority to a CPU whose idle state has the smallest exit latency irrespective * of any idle timestamp. */ min_exit_latency = idle->exit_latency; //記錄最小退出延遲 latest_idle_timestamp = rq->idle_stamp; shallowest_idle_cpu = i; //記錄最小的退出延遲的cpu } else if ((!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { //idle退出延遲相等,且進入idle的時間更晚 /* * If equal or no active idle state, then the most recently idled CPU might have * a warmer cache. */ latest_idle_timestamp = rq->idle_stamp; shallowest_idle_cpu = i; } } else if (shallowest_idle_cpu == -1) {//對於非idle類型的cpu,並且之前也沒有遍歷到idle cpu load = cpu_load(cpu_rq(i)); if (load < min_load) { min_load = load; //記錄load最小的cpu least_loaded_cpu = i; } } } //若是有idle cpu就選idle cpu,若是沒有idle cpu就選load最小的cpu return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; }
同 load_balance()尋找最忙的cpu類似。這里的慢速路徑,選擇最空閑的cpu,先在sd中找最空閑的sg,然后再sg中找最空閑的cpu。這里最空閑指的是當前cpu所在cluster內,有idle的cpu就算idle的cpu,若是沒有idle的cpu就選負載最小的cpu。
四、快速路徑
快速選核路徑只有喚醒選核流程才會走。雖然函數名中是選兄弟cpu,但是在選不到idle cpu的時候,也會在其它Cluster進行選核
/* * Try and locate an idle core/thread in the LLC cache domain. * * select_task_rq_fair傳參:p為待選核任務; prev是任務上次運行的cpu; target是備選目標cpu,可能是prev_cpu也可能是wake_affine選出的cpu */ static int select_idle_sibling(struct task_struct *p, int prev, int target) { struct sched_domain *sd; unsigned long task_util; int i, recent_used_cpu; /* * On asymmetric system, update task utilization because we will check * that the task fits with cpu's capacity. */ if (static_branch_unlikely(&sched_asym_cpucapacity)) { sync_entity_load_avg(&p->se); task_util = uclamp_task_util(p); //將util_est clamp 在MIN和MAX之間 } /* * 傳入的target是一個idle cpu且可以容納的了uclamp后的任務的util,就直接選之前選的cpu。 * Wake affine的場景下,target cpu是通過wake_affine函數尋找的target cpu,其他場景下, * target CPU其實等於prev CPU。 */ if ((available_idle_cpu(target) || sched_idle_cpu(target)) && asym_fits_capacity(task_util, target)) return target; /* * If the previous CPU is cache affine and idle, don't be stupid: * * 若傳入的參數是由wake_affine選出的cpu,且和prev_cpu屬於同一個cluster(共享L2-Cache), 且prev_cpu是idle的, 且prev_cpu能容納的下uclamp后的任務p的util */ if (prev != target && cpus_share_cache(prev, target) && (available_idle_cpu(prev) || sched_idle_cpu(prev)) && asym_fits_capacity(task_util, prev)) return prev; /* * Allow a per-cpu kthread to stack with the wakee if the kworker thread and the tasks previous CPUs are the same. * The assumption is that the wakee queued work for the per-cpu kthread that is now complete and the wakeup is * essentially a sync wakeup. An obvious example of this pattern is IO completions. * 翻譯: * 如果 kworker 線程和任務prev_cpu是同一個,則允許 per-cpu kthread 與被喚醒線程堆疊。假設是被喚醒任務往per-cpu的kthread * 上queue work, 喚醒本質上是同步喚醒。這種模式的一個明顯例子是 IO 完成。 * * 當前任務是一個per-cpu的內核線程,且任務的prev_cpu就是當前cpu,且當前cpu上最多只有一個任務在運行,就選prev_cpu. */ if (is_per_cpu_kthread(current) && prev == smp_processor_id() && this_rq()->nr_running <= 1) { return prev; } /* Check a recently used CPU as a potential idle candidate: */ /* * recent_used_cpu的更新位置: * select_task_rq_fair: 更新 current->recent_used_cpu 為當前正在運行的cpu * select_idle_sibling: 將待選核任務 p->recent_used_cpu 設置為 prev_cpu * * 若 p->recent_used_cpu 既不是prev_cpu也不是target_cpu, 但是是和target_cpu共cluster的,且是idle的, 且任務的親和性允許, * 且算力能夠容納任務,那么就選 p->recent_used_cpu。 */ recent_used_cpu = p->recent_used_cpu; if (recent_used_cpu != prev && recent_used_cpu != target && cpus_share_cache(recent_used_cpu, target) && (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && asym_fits_capacity(task_util, recent_used_cpu)) { /* * Replace recent_used_cpu with prev as it is a potential candidate for the next wake: */ p->recent_used_cpu = prev; return recent_used_cpu; //還將p放在它最近使用過的cpu上 } /* * For asymmetric CPU capacity systems, our domain of interest is sd_asym_cpucapacity rather than sd_llc. */ if (static_branch_unlikely(&sched_asym_cpucapacity)) { sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target)); //target_cpu對應的DIE層級的sd /* * On an asymmetric CPU capacity system where an exclusive cpuset defines a symmetric island (i.e. one unique * capacity_orig value through the cpuset), the key will be set but the CPUs within that cpuset will not have * a domain with SD_ASYM_CPUCAPACITY. These should follow the usual symmetric capacity path. */ if (sd) { //這個函數還可能返回-1,但是並沒有兼容這個錯誤的結果!后續的使用也沒有做判斷,是個BUG!! i = select_idle_capacity(p, sd, target); return ((unsigned)i < nr_cpumask_bits) ? i : target; //這里會返回,下面的部分不會被執行到 } } /* 下面就是非大小核架構會執行的路徑了 */ sd = rcu_dereference(per_cpu(sd_llc, target)); //target_cpu對應的MC層級的sd if (!sd) return target; i = select_idle_core(p, sd, target); //CONFIG_SCHED_SMT沒定義為空函數 if ((unsigned)i < nr_cpumask_bits) return i; i = select_idle_cpu(p, sd, target); //CONFIG_SCHED_SMT沒定義為空函數 if ((unsigned)i < nr_cpumask_bits) return i; i = select_idle_smt(p, sd, target); //CONFIG_SCHED_SMT沒定義為空函數 if ((unsigned)i < nr_cpumask_bits) return i; return target; } /* * select_idle_sibling傳參:p為待選核任務; sd為參數target_cpu對應的DIE層級的sd; target為候選的target_cpu * * 作用:從target_cpu開始遍歷,找一個能容納下任務p的idle cpu, 若所有idle cpu都不能容納下任務,就返回算力最大的idle cpu, * 否則返回-1; */ static int select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) { unsigned long task_util, best_cap = 0; int cpu, best_cpu = -1; struct cpumask *cpus; cpus = this_cpu_cpumask_var_ptr(select_idle_mask); //先賦值后使用 cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); task_util = uclamp_task_util(p); //這里從target_cpu開始遍歷,利用了之前判斷的結果target_cpu的算力是能容納任務p的 for_each_cpu_wrap(cpu, cpus, target) { unsigned long cpu_cap = capacity_of(cpu); if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu)) //若此cpu非idle則跳過這個cpu continue; if (fits_capacity(task_util, cpu_cap)) //返回一個能容納任務的idle cpu return cpu; if (cpu_cap > best_cap) { //若是沒有能容納的,則返回一個算力最大的 best_cap = cpu_cap; best_cpu = cpu; } } return best_cpu; }
1. 此函數中要使用到p的util, 所以需要先更新其負載。
2. 快速路徑選核依次判斷:
(1) 傳入的target是一個idle cpu, 且可以容納的了uclamp后的任務的util,就直接選之前選的這個 target_cpu
(2) 若傳入的參數target是由wake_affine選出的cpu,且和prev_cpu屬於同一個cluster(共享L2-Cache), 且prev_cpu是idle的, 且prev_cpu能容納的下uclamp后的任務p的util, 那么選擇 prev_cpu。
(3) 當前任務是一個per-cpu的內核線程,且任務的prev_cpu就是當前cpu,且當前cpu上最多只有一個任務在運行,就選 prev_cpu。一個典型的例子是IO操作場景。
(4) 若 p->recent_used_cpu 既不是prev_cpu也不是target_cpu, 但是是和target_cpu共cluster的,且是idle的, 且任務的親和性允許,且算力能夠容納任務,那么就選 p->recent_used_cpu。
(5) 從target_cpu開始遍歷,找一個能容納下任務p的idle cpu, 若所有idle cpu都不能容納下任務,就返回算力最大的idle cpu, 否則返回-1;
3. 注意有個BUG: select_idle_capacity()返回-1的話沒有做容錯處理。
五、總結
1. 有三種任務選核路徑,sd_flag分別對應為 SD_BALANCE_WAKE(喚醒場景)、SD_BALANCE_FORK(fork新任務場景)、SD_BALANCE_EXEC(exec場景)。其中EAS選核和wake-affine選核只適用於喚醒場景,EAS選到核后直接退出,wake-affine選到核后還需要走快速路徑繼續選核,希望在當前cpu的同cluster中選一個空閑cpu或prev_cpu,若選不到,最終也會嘗試在所有cluster中選核。fork新任務場景和exec場景走慢速選核路徑,找出系統中最閑的cpu作為目標cpu。
參考:https://blog.csdn.net/feelabclihu/article/details/122007603?spm=1001.2014.3001.5501