調度中的負載概念,與平時熟知的cpu占用率並不是一回事,兩者間有較大差別。本文分析了cpu負載和系統負載,並非CPU使用率。代碼基於CAF- SM8250 - kernel 4.19。
負載計算中,其實主要分為3大部分,由小到大依次為:
1、調度實體負載:update_load_avg()----PELT
2、CPU負載:update_cpu_load_active()
3、系統負載:calc_global_load_tick()
這3個級別的負載計算分別體現了不同維度下的當前負載情況。
之前blog分析了調度實體sched entity級別的負載計算,是通過PELT機制來統計的。它體現了每一個調度實體的對cpu產生的負載情況,並進行了實時統計更新。(這塊之前分析,復盤了下,還是不太詳細和明確,后續還要再總結一下)
這次主要分析余下2個:CPU負載和系統負載的計算原理。
CPU負載計算原理
CPU負載是用來體現當前CPU的工作任務loading情況,和CPU繁忙程度的。其主要通過統計CPU rq上task處於runnable的平均時間(runnable_load_avg = runnable_load_sum / LOAD_AVG_MAX)。並根據不同周期,統計出不同的k線,來體現CPU負載的變化趨勢。
我們知道單個task處於runnable的平均時間是由PELT算法機制來完成統計的。所以,我們此次分析更偏向於如何利用統計出的單個task數據,再進一步統計出不同周期的均線,來表示CPU負載。
代碼路徑如下:
scheduler_tick()
-> cpu_load_update_active()
void cpu_load_update_active(struct rq *this_rq) { unsigned long load = weighted_cpuload(this_rq); //load = cfs_rq->avg.runnable_load_avg if (tick_nohz_tick_stopped()) cpu_load_update_nohz(this_rq, READ_ONCE(jiffies), load); //(1) else cpu_load_update_periodic(this_rq, load); //(2) }
上面根據是否配置nohz而可能停止了tick走向不同的分支:
(1)cpu_load_update_nohz(this_rq, READ_ONCE(jiffies), load) //更新cpu負載,基於no HZ場景
(2)cpu_load_update_periodic() //周期性更新cpu負載,基於有HZ場景
NO_HZ的情況下,會走這個分支。pending_updates表示jiffies數是否更新,即表示tick數。 經過的秒數 = jiffies / HZ(平台當前HZ=250);一個tick為HZ倒數,即4ms。
/* * There is no sane way to deal with nohz on smp when using jiffies because the * CPU doing the jiffies update might drift wrt the CPU doing the jiffy reading * causing off-by-one errors in observed deltas; {0,2} instead of {1,1}. * * Therefore we need to avoid the delta approach from the regular tick when * possible since that would seriously skew the load calculation. This is why we * use cpu_load_update_periodic() for CPUs out of nohz. However we'll rely on * jiffies deltas for updates happening while in nohz mode (idle ticks, idle * loop exit, nohz_idle_balance, nohz full exit...) * * This means we might still be one tick off for nohz periods. */ static void cpu_load_update_nohz(struct rq *this_rq, unsigned long curr_jiffies, unsigned long load) { unsigned long pending_updates; pending_updates = curr_jiffies - this_rq->last_load_update_tick; //計算pending_updates if (pending_updates) { this_rq->last_load_update_tick = curr_jiffies; //更新時間戳 /* * In the regular NOHZ case, we were idle, this means load 0. * In the NOHZ_FULL case, we were non-idle, we should consider * its weighted load. */ cpu_load_update(this_rq, load, pending_updates); //(2-1)更新cpu rq的cpu load數據 } }
我們再看(2)這個分支:
``` static void cpu_load_update_periodic(struct rq *this_rq, unsigned long load) { #ifdef CONFIG_NO_HZ_COMMON /* See the mess around cpu_load_update_nohz(). */ this_rq->last_load_update_tick = READ_ONCE(jiffies); //記錄更新的時間戳 #endif cpu_load_update(this_rq, load, 1); //(2-1)更新cpu rq的cpu load數據 } ```
最終都是調用了cpu_load_update()來更新cpu rq的cpu load數據。
(2-1)更新cpu rq的cpu load數據
/** * __cpu_load_update - update the rq->cpu_load[] statistics * @this_rq: The rq to update statistics for * @this_load: The current load * @pending_updates: The number of missed updates * * Update rq->cpu_load[] statistics. This function is usually called every * scheduler tick (TICK_NSEC). * * This function computes a decaying average: * * load[i]' = (1 - 1/2^i) * load[i] + (1/2^i) * load * * Because of NOHZ it might not get called on every tick which gives need for * the @pending_updates argument. * * load[i]_n = (1 - 1/2^i) * load[i]_n-1 + (1/2^i) * load_n-1 * = A * load[i]_n-1 + B ; A := (1 - 1/2^i), B := (1/2^i) * load * = A * (A * load[i]_n-2 + B) + B * = A * (A * (A * load[i]_n-3 + B) + B) + B * = A^3 * load[i]_n-3 + (A^2 + A + 1) * B * = A^n * load[i]_0 + (A^(n-1) + A^(n-2) + ... + 1) * B * = A^n * load[i]_0 + ((1 - A^n) / (1 - A)) * B * = (1 - 1/2^i)^n * (load[i]_0 - load) + load * * In the above we've assumed load_n := load, which is true for NOHZ_FULL as * any change in load would have resulted in the tick being turned back on. * * For regular NOHZ, this reduces to: * * load[i]_n = (1 - 1/2^i)^n * load[i]_0 * * see decay_load_misses(). For NOHZ_FULL we get to subtract and add the extra * term. */ static void cpu_load_update(struct rq *this_rq, unsigned long this_load, unsigned long pending_updates) { unsigned long __maybe_unused tickless_load = this_rq->cpu_load[0]; //獲取之前周期為0的均線cpu load int i, scale; this_rq->nr_load_updates++; //統計cpu load更新次數 /* Update our load: */ this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */ //更新cpu load[0] for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { //更新cpu load[1-4] unsigned long old_load, new_load; /* scale is effectively 1 << i now, and >> i divides by scale */ old_load = this_rq->cpu_load[i]; #ifdef CONFIG_NO_HZ_COMMON old_load = decay_load_missed(old_load, pending_updates - 1, i); //(2-1-1)將原先的old load做老化;如果pending_update == 1,就不用做負載老化 if (tickless_load) { //如果之前cpu load[0]有負載 old_load -= decay_load_missed(tickless_load, pending_updates - 1, i); //那么還要對tickless_load進行老化【針對這里為什么要添加tickless_load的考慮,后面說明】 /* * old_load can never be a negative value because a * decayed tickless_load cannot be greater than the * original tickless_load. */ old_load += tickless_load; } #endif new_load = this_load; /* * Round up the averaging division if load is increasing. This * prevents us from getting stuck on 9 if the load is 10, for * example. */ if (new_load > old_load) //這里是做了一個補償,防止由於old load小於new load情況下,最終的cpu load都不可能達到最大值 new_load += scale - 1; this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i; //計算當前新的cpu load } }
最后計算更新當前新的cpu load,實際為5份數據。分別對應不同周期長度的均線數據:
解釋一下上面表格的意思:
針對CPU負載計算,會統計5條不同周期的均線,周期分別為{0,8,32,64,128},單位為tick。可以理解為統計了從當前tick開始往前推周期tick個數的load數據,周期內沒有load,則load就會歸0。
而在更新時,采用的計算公式:load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load ----idx就是表格中【i】,load為old load,cur_load為新的load
那么整理后,為:當前load = 更新系數 * 舊的load + (1-更新系數)* 新的load ---更新系數為 (2^idx - 1) / 2^idx,最終參考【cpu_load】計算公式
現在我們來看,關於tickless系統(我們當前就是tickless/NO_HZ系統),錯過了一些tick,那么就需要先將old load先老化,再考慮new load,重新計算當前的cpu load[1-4]。這里使用了查表法來減少計算量,提升性能。
PS:因為NO_HZ一般是出現休眠。那么休眠時,load是被認為為0的,所以,其實只需要考慮將舊的load進行衰減就可以了。
(2-1-1)將原先的old load做老化;如果pending_update == 1,就不用做負載老化
/* * The exact cpuload calculated at every tick would be: * * load' = (1 - 1/2^i) * load + (1/2^i) * cur_load * * If a CPU misses updates for n ticks (as it was idle) and update gets * called on the n+1-th tick when CPU may be busy, then we have: * * load_n = (1 - 1/2^i)^n * load_0 * load_n+1 = (1 - 1/2^i) * load_n + (1/2^i) * cur_load * * decay_load_missed() below does efficient calculation of * * load' = (1 - 1/2^i)^n * load * * Because x^(n+m) := x^n * x^m we can decompose any x^n in power-of-2 factors. * This allows us to precompute the above in said factors, thereby allowing the * reduction of an arbitrary n in O(log_2 n) steps. (See also * fixed_power_int()) * * The calculation is approximated on a 128 point scale. */ #define DEGRADE_SHIFT 7 static const u8 degrade_zero_ticks[CPU_LOAD_IDX_MAX] = {0, 8, 32, 64, 128}; static const u8 degrade_factor[CPU_LOAD_IDX_MAX][DEGRADE_SHIFT + 1] = { { 0, 0, 0, 0, 0, 0, 0, 0 }, { 64, 32, 8, 0, 0, 0, 0, 0 }, { 96, 72, 40, 12, 1, 0, 0, 0 }, { 112, 98, 75, 43, 15, 1, 0, 0 }, { 120, 112, 98, 76, 45, 16, 2, 0 } }; /* * Update cpu_load for any missed ticks, due to tickless idle. The backlog * would be when CPU is idle and so we just decay the old load without * adding any new load. */ static unsigned long decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) { int j = 0; if (!missed_updates) return load; if (missed_updates >= degrade_zero_ticks[idx]) //不同的統計周期,超過了一定tick數,說明系統已經sleep這么長時間,那么old load就需要被清空了 return 0; if (idx == 1) return load >> missed_updates; //如果是周期為2均線,就直接根據missed_updates的個數,除以2次冪就行了,因為old和new的占比是各為1/2 while (missed_updates) { if (missed_updates % 2) load = (load * degrade_factor[idx][j]) >> DEGRADE_SHIFT; missed_updates >>= 1; j++; } return load; }
針對不同周期線的均線數據,老化的計算都是使用同一個公式:
load_n = (1 - 1/2^i)^n * l
oad_0 ---i就是idx,n則是pending_update-1,也就是
missed_updates
tickless_load
再說說tickless_load,我發現在之前的kernel版本上是沒有的。也就是說,原先cpu負載的老化是完全按照上述2個公式嚴格計算的。
而現在加入了tickless_load,是考慮了上一次更新tick中cpu_load[0]的負載數據。將其也更新進了舊load(old_load)中。
個人認為這樣做的目的是這樣的,假設如下場景:
1. 上一次該均線load更新,此時load較小-----這部分load視作舊load
2. 期間有短暫的load劇增------這部分load視作tickless_load
3. 還沒有到該均線load更新,進入休眠
4. 喚醒時刻load也比較小
5. 此次該均線load更新,此時load也比較小(與喚醒時刻load一致)
但是此時,如果不考慮ticklees load。那么該均線load此次更新會僅考慮舊load的衰減,再加上更新均線此時load,經過算法計算得出。但是此時的load與實際體現有較大差距,因為周期越長,那么新load更新時所占比例小,舊load所占比例大。所以,導致此時最終load結果與平均load有一部分相差,因為沒有考慮到tickless load的負載,無法體現CPU平均負載真實情況。
而加入tickless_load之后,即將休眠前那部分load也會加入舊load計算。這樣避免了那部分短暫load的計算缺失。
所以考慮tickless_load的最終老化的公式應該是這樣的:
load_n = (1 - 1/2^i)^n * l
oad_0 + [1 - (1 - 1/2^i)^n ] *
tickless_load
最后結果load_n就是更新后的old_load,最后再用上面表格【cpu_load】計算公式,計算出最新的cpu負載數據。
tickless load這塊的理解屬於我個人理解,最好還是通過檢查kernel提交記錄查看commit信息。如有理解不對的地方,請指出。
CPU負載為何如此設計?
1、不同周期的均線用來反應不同時間窗口長度下的負載情況,主要供load_balance()在不同場景判斷是否負載平衡的比較基准,常用為cpu_load[0]和cpu_load[1](這塊后續分析load_balance時,需要check清楚)
2、使用不同周期的均線目的在於平滑樣本的抖動,確定趨勢的變化方向
系統負載計算原理
系統級的平均負載(load average)可以通過以下命令(uptime、top、cat /proc/loadavg)查看:
$ uptime 16:48:24 up 4:11, 1 user, load average: 25.25, 23.40, 23.46 $ top - 16:48:42 up 4:12, 1 user, load average: 25.25, 23.14, 23.37 $ cat /proc/loadavg 25.72 23.19 23.35 42/3411 43603
“load average:”后面的3個數字分別表示1分鍾、5分鍾、15分鍾的load average。可以從幾方面去解析load average:
If the averages are 0.0, then your system is idle.
If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing.
If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing.
If they are higher than your CPU count, then you might have a performance problem (it depends).
最早的系統級平均負載(load average)只會統計runnable狀態,但是linux后面覺得這種統計方式代表不了系統的真實負載。
舉一個例子:系統換一個低速硬盤后,他的runnable負載還會小於高速硬盤時的值;linux認為睡眠狀態(TASK_INTERRUPTIBLE/TASK_UNINTERRUPTIBLE)也是系統的一種負載,系統得不到服務是因為io/外設的負載過重。
系統級負載統計函數calc_global_load_tick()中會把(this_rq->nr_running+this_rq->nr_uninterruptible)都計入負載;
代碼解析如下:
scheduler_tick()
-> calc_global_load_tick()
/* * Called from scheduler_tick() to periodically update this CPU's * active count. */ void calc_global_load_tick(struct rq *this_rq) { long delta; if (time_before(jiffies, this_rq->calc_load_update)) //過濾系統負載已經更新了的情況 return; delta = calc_load_fold_active(this_rq, 0); //(1)更新nr_running + uninterruptible的task數量 if (delta) atomic_long_add(delta, &calc_load_tasks); //統計task數量到系統全局變量calc_load_tasks中 this_rq->calc_load_update += LOAD_FREQ; //下一次更新系統負載的時間(當前時間+5s) }
//(1)更新nr_running + uninterruptible的task數量
long calc_load_fold_active(struct rq *this_rq, long adjust) { long nr_active, delta = 0; nr_active = this_rq->nr_running - adjust; nr_active += (long)this_rq->nr_uninterruptible; //統計所有nr_running和uninterruptible的task數量 if (nr_active != this_rq->calc_load_active) { //this_rq->calc_load_active表示當前nr_running + uninterruptible的task數量 delta = nr_active - this_rq->calc_load_active; //計算差值 this_rq->calc_load_active = nr_active; //更新task數量 } return delta; }
上面這部分主要是每隔5s以上,更新系統中nr_running + uninterruptible的task數量的,並統計到全局變量calc_load_tasks中。
此外,還有一部分是在系統jiffies更新時,觸發計算系統負載動作。
系統中注冊了tick_setup_sched_timer,用於模擬tick、高精度定時器、以及更新jiffies等,其中也會對系統負載進行計算。
當timer觸發時,就會執行tick_sched_timer。當前cpu為tick_do_timer_cpu時,就會進行系統負載計算。所以,計算負載僅由一個從cpu完成,它就是tick_do_timer_cpu。
tick_sched_timer() -> tick_sched_do_timer() ->tick_do_update_jiffies64() -> do_timer() -> calc_global_load()
/* * calc_load - update the avenrun load estimates 10 ticks after the * CPUs have updated calc_load_tasks. * * Called from the global timer code. */ void calc_global_load(unsigned long ticks) { unsigned long sample_window; long active, delta; sample_window = READ_ONCE(calc_load_update); //獲取scheduler_tick中更新的新時間戳 if (time_before(jiffies, sample_window + 10)) //確保在時間戳之后的10個tick后(確保所有cpu都更新完calc_load_tasks),進行系統負載計算(所以總的間隔時間時5s + 10 tick) return; /* * Fold the 'old' NO_HZ-delta to include all NO_HZ CPUs. */ delta = calc_load_nohz_fold(); //(2)統計NO_HZ cpu的task數量(是否因為idle而錯過了統計task數量,所以在這里更新一下?) if (delta) atomic_long_add(delta, &calc_load_tasks); //更新nr_running + uninterrunptible的task數量全局變量 active = atomic_long_read(&calc_load_tasks); //獲取nr_running + uninterrunptible的task數量全局變量 active = active > 0 ? active * FIXED_1 : 0; //乘FIXED_1系數 avenrun[0] = calc_load(avenrun[0], EXP_1, active); //(3)計算1分鍾的系統負載 avenrun[1] = calc_load(avenrun[1], EXP_5, active); //計算5分鍾的系統負載 avenrun[2] = calc_load(avenrun[2], EXP_15, active); //計算15分鍾的系統負載 WRITE_ONCE(calc_load_update, sample_window + LOAD_FREQ); //更新時間戳 /* * In case we went to NO_HZ for multiple LOAD_FREQ intervals * catch up in bulk. */ calc_global_nohz(); //(4) }
(2)統計NO_HZ cpu的task數量(是否因為idle而錯過了統計task數量,所以在這里更新一下?)
static long calc_load_nohz_fold(void) { int idx = calc_load_read_idx(); long delta = 0; if (atomic_long_read(&calc_load_nohz[idx])) delta = atomic_long_xchg(&calc_load_nohz[idx], 0); return delta; }
/* * Handle NO_HZ for the global load-average. * * Since the above described distributed algorithm to compute the global * load-average relies on per-CPU sampling from the tick, it is affected by * NO_HZ. * * The basic idea is to fold the nr_active delta into a global NO_HZ-delta upon * entering NO_HZ state such that we can include this as an 'extra' CPU delta * when we read the global state. * * Obviously reality has to ruin such a delightfully simple scheme: * * - When we go NO_HZ idle during the window, we can negate our sample * contribution, causing under-accounting. * * We avoid this by keeping two NO_HZ-delta counters and flipping them * when the window starts, thus separating old and new NO_HZ load. * * The only trick is the slight shift in index flip for read vs write. * * 0s 5s 10s 15s * +10 +10 +10 +10 * |-|-----------|-|-----------|-|-----------|-| * r:0 0 1 1 0 0 1 1 0 * w:0 1 1 0 0 1 1 0 0 * * This ensures we'll fold the old NO_HZ contribution in this window while * accumlating the new one. * * - When we wake up from NO_HZ during the window, we push up our * contribution, since we effectively move our sample point to a known * busy state. * * This is solved by pushing the window forward, and thus skipping the * sample, for this CPU (effectively using the NO_HZ-delta for this CPU which * was in effect at the time the window opened). This also solves the issue * of having to deal with a CPU having been in NO_HZ for multiple LOAD_FREQ * intervals. * * When making the ILB scale, we should try to pull this in as well. */ static atomic_long_t calc_load_nohz[2]; static int calc_load_idx;
(3)計算1分鍾的系統負載。5分鍾和15分鍾的負載也是類似計算方法
/* * a1 = a0 * e + a * (1 - e) */ static inline unsigned long calc_load(unsigned long load, unsigned long exp, unsigned long active) { unsigned long newload; newload = load * exp + active * (FIXED_1 - exp); if (active >= load) newload += FIXED_1-1; return newload / FIXED_1; } #define FSHIFT 11 /* nr of bits of precision */ #define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ #define LOAD_FREQ (5*HZ+1) /* 5 sec intervals */ #define EXP_1 1884 /* 1/exp(5sec/1min) as fixed-point */ #define EXP_5 2014 /* 1/exp(5sec/5min) */ #define EXP_15 2037 /* 1/exp(5sec/15min) */
核心算法calc_load()的思想是:old_load * 老化系數 + new_load *(1 - 老化系數)
1分鍾負載計算公式:
old_load * (EXP_1/FIXED_1) + new_load * (1 - EXP_1/FIXED_1)
即:
其中,
FIXED_1 = 2^11 = 2048
EXP_1 = 1884
EXP_5 = 2014
EXP_15 = 2037
5分鍾和15分鍾的計算,只需要將公式中的EXP_1換成EXP_5/EXP_15即可。
從計算來看,系統負載本質實際就是統計nr_running + uninterruptible task數量。
(4)由於NO_HZ可能導致已經錯過了多個tick,所以需要將錯過的這些tick也考慮在內。根據實際錯過的具體tick數,重新計算出准確的負載load
/* * NO_HZ can leave us missing all per-CPU ticks calling * calc_load_fold_active(), but since a NO_HZ CPU folds its delta into * calc_load_nohz per calc_load_nohz_start(), all we need to do is fold * in the pending NO_HZ delta if our NO_HZ period crossed a load cycle boundary. * * Once we've updated the global active value, we need to apply the exponential * weights adjusted to the number of cycles missed. */ static void calc_global_nohz(void) { unsigned long sample_window; long delta, active, n; sample_window = READ_ONCE(calc_load_update); if (!time_before(jiffies, sample_window + 10)) { /* * Catch-up, fold however many we are behind still */ delta = jiffies - sample_window - 10; n = 1 + (delta / LOAD_FREQ); active = atomic_long_read(&calc_load_tasks); active = active > 0 ? active * FIXED_1 : 0; avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n); avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n); avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n); WRITE_ONCE(calc_load_update, sample_window + n * LOAD_FREQ); } /* * Flip the NO_HZ index... * * Make sure we first write the new time then flip the index, so that * calc_load_write_idx() will see the new time when it reads the new * index, this avoids a double flip messing things up. */ smp_wmb(); calc_load_idx++;
最后通過cat proc/loadavg,可以查看系統負載結果:
代碼如下:
static int loadavg_proc_show(struct seq_file *m, void *v) { unsigned long avnrun[3]; get_avenrun(avnrun, FIXED_1/200, 0); seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n", LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]), LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]), LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]), nr_running(), nr_threads, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); return 0; }
總結
1、cpu負載計算,在每個scheduler_tick中觸發。
統計的數據是cpu rq的runnable_load_avg,使用的公式是:當前load = 更新系數 * 舊的load + (1-更新系數)* 新的load ---更新系數為 (2^idx - 1) / 2^idx,idx = 0,1,2,3,4。
同時會統計出5條不同周期的CPU負載均線,來用於不同場景下的cpu負載體現和比較基准,並能體現其變化趨勢。
2、系統負載計算,是在scheduler_tick中統計runnable + uninterruptible的task數量(統計間隔5s)。在sched tick timer觸發時(統計間隔5s + 10 tick),計算系統負載。
統計的數據是runnable + uninterruptible的task數量,使用的公式是:舊load * 老化系數+新load * (1 - 老化系數)。
同時會統計1分鍾、5分鍾、15分鍾的系統負載數據。可以通過節點查看得知。
PS:乍一看2個負載的計算公式都差不多,其實統計的load天差地別,需要注意區別!
參考:https://blog.csdn.net/pwl999/article/details/78817902
https://blog.csdn.net/wukongmingjing/article/details/82531950