CPU負載均衡之WALT學習【轉】

本文轉載自查看原文 2020-10-30 11:01 772 【linux內核】

轉自：https://blog.csdn.net/xiaoqiaoq0/article/details/107135747/

前言

本文繼續整理CPU調度WALT相關內容，主要整理如下內容：

WALT是什么？
WALT 計算？
WALT 計算數據如何使用？

1. WALT是什么？

WALT：Windows-Assist Load Tracing的縮寫：
- 從字面意思來看，是以window作為輔助項來跟蹤CPU LOAD；
- 實質上是一種計算方法，用數據來表現CPU當前的loading情況，用於后續任務調度、遷移、負載均衡等功能；

1.1 為什么需要WALT ？

對於一項技術的發展，尤其是一種計算方式的引入，一定是伴隨着過去的技術不在適用於當前事務發展的需要，或者這項技術可以讓人更懶；

1.1.1 PELT的計算方式的不足？

PELT的引進的時候，linux的主流還在於服務器使用，更多關注設備性能的體現，彼時功耗還不是考慮的重點，而隨着移動設備的發展，功耗和響應速度成為被人們直接感知到的因素，成為當前技術發展主要考慮的因素：

對於當前的移動設備，在界面處理的應用場景，需要盡快響應，否則user會明顯感覺到卡頓；
對於當前移動設備，功耗更是一個必須面對的因素，手機需要頻繁充電，那銷量一定好不了；
根據用戶場景決定task是否heavy的要求，比如顯示的內容不同，其task重要程度也不同，即同一個類別的TASK也需要根據具體情況動態改變；

而基於當前PELT的調度情況（衰減的計算思路），更能體現連續的趨勢情況，而對於快速的突變性質的情況，不是很友好：

對於快速上升和快速下降的情況響應速度較慢，由於衰減的計算過程，所以實際的Loading上升和下降需要一定周期后才能在數據上反饋出來，導致響應速度慢；
PELT基於其衰減機制，所以對於一個task sleep 一段時間后，則其負載計算減小，但是如果此時該Task為網絡傳輸這種，周期性的需要cpu和freq的能力，則不能快速響應（因為該計算方式更能體現趨向性、平均效果）

1.2 WALT如何處理

根據上述的原因，我們了解到，當前需要在PELT的基礎上（保持其好處），實現一種更能適用於當前需求的計算方式：

數據上報更加及時；
數據直接體現現狀；
對算力的消耗不會增加（算力）；

1.2.1 WALT 處理

我這里總結了WALT所能（需要）做到的效果：

繼續保持對於所有Task-entity的跟蹤；
在此前usage（load）的基礎上，添加對於demand的記錄，用於之后預測；
每個CPU上runqueue 的整體負載仍為所有Task統計的sum；
核心在於計算差異，由之前的衰減的方式變更為划分window的方式：數據采集更能快速體現實際變化（對比與PELT的趨勢），如下為Linux官方的一些資料：
1. A task’s demand is the maximum of its contribution to the most recently completed window and its average demand over the past N windows.
2. WALT “forgets” blocked time entirely：即只統計runable和running time，可以對於Task的實際耗時有更准確的統計，可以通過demand預測；
3. CPU busy time - The sum of execution times of all tasks in the most recently completed window；
4. WALT “forgets” cpu utilization as soon as tasks are taken off of the runqueue；

1.2.2 應用補充

task分配前各個CPU和task負載的統計；
task migration 遷移
大小核的分配；
EAS 分配；

1.3 版本導入

linux 4.8.2 之后導入（但是在bootlin查看code，最新5.8仍沒有對應文件）
android 4.4之后導入（android kernel 4.9 中是有這部分的）

2. Kernel如何啟用WALT

android kernel code中已經集成了這部分內容，不過根據廠商的差異，可能存在沒有啟用的情況：

打開宏測試：
1. menuconfig ==》Genernal setup ==》CPU/Task time and stats accounting ==》support window based load tracking
2. 圖示：
直接修改
1. kernel/arch/arm64/config/defconfig中添加CONFIG_SCHED_WALT=y
build image 驗證修改是否生效：
demo:/sys/kernel/tracing # zcat /proc/config.gz | grep WALT

CONFIG_SCHED_WALT=y
CONFIG_HID_WALTOP=y
測試
當前只是在ftrace中可以看到確實有統計walt的數據，但是沒有實際的應用來確認具體是否有改善或者其他數據（當然Linux的資料中有一些數據，但是並非本地測試）；

3. WALT計算

本小節從原理和code 來說明，WALT采用的計算方式：

windows 是如何划分的？
對於Task如何分類，分別做怎樣的處理？
WALT部分數據如何更新？
WALT更新的數據如何被調度、EAS使用？

3.1 Windows划分

首先來看輔助計算項window是如何划分的？
簡單理解，就是將系統自啟動開始以一定時間作為一個周期，分別統計不同周期內Task的Loading情況，並將其更新到Runqueue中；

則還有哪些內容需要考慮？

一個周期即window設置為多久比較合適？這個根據實際項目不同調試不同的值，目前Kernel中是設置的標准是20ms；
具體統計多少個window內的Loading情況？根據實際項目需要調整，目前Kernel中設置為5個window；

所以對於一個Task和window，可能存在如下幾種情況：
在這里插入圖片描述
ps：ms = mark_start（Task開始），ws = window_start（當前window開始）， wc = wallclock（當前系統時間）

Task在這個window內啟動，且做統計時仍在這個window內，即Task在一個window內；
Task在前一個window內啟動，做統計時在當前window內，即Task跨過兩個window；
Task在前邊某一個window內啟動，做統計時在當前window內，即Task跨過多個完整window；

即Task在Window的划分只有上述三種情況，所有的計算都是基於上述划分的；

3.2 Task 分類

可以想到的是，對於不同類別的Task或者不同狀態的Task計算公式都是不同的，WALT將Task划分為如下幾個類別：
Tadk分類
上圖中有將各個Task event的調用函數列出來；

3.2.1 更新demand判斷

在更新demand時，會首先根據Task event判斷此時是否需要更新：
demand對類別的差異
對應function：

static int account_busy_for_task_demand(struct task_struct *p, int event) { /* No need to bother updating task demand for exiting tasks * or the idle task. */ //task 已退出或者為IDLE，則不需要計算 if (exiting_task(p) || is_idle_task(p)) return 0; /* When a task is waking up it is completing a segment of non-busy * time. Likewise, if wait time is not treated as busy time, then * when a task begins to run or is migrated, it is not running and * is completing a segment of non-busy time. */ // 默認 walt_account_wait_time是1，則只有TASK_WAKE if (event == TASK_WAKE || (!walt_account_wait_time && (event == PICK_NEXT_TASK || event == TASK_MIGRATE))) return 0; return 1; }

3.2.2 更新CPU busy time判斷

在更新CPU busy time時，會首先根據Task event判斷此時是否需要更新：
busy time對event的差異
對應function：

static int account_busy_for_cpu_time(struct rq *rq, struct task_struct *p, u64 irqtime, int event) { //是否為idle task or other task？ if (is_idle_task(p)) { /* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */ // 是schedule 觸發的下一個task為idle task if (event == PICK_NEXT_TASK) return 0; /* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */ // 如果是中斷或者等待IO的IDLE TASK，是要計算busy time的； return irqtime || cpu_is_waiting_on_io(rq); } //wake 喚醒操作不需要計算； if (event == TASK_WAKE) return 0; //不是IDLE TASK則以下幾個類型需要計算 if (event == PUT_PREV_TASK || event == IRQ_UPDATE || event == TASK_UPDATE) return 1; /* Only TASK_MIGRATE && PICK_NEXT_TASK left */ //默認是0 return walt_freq_account_wait_time; }

3.3 數據如何更新？（調用邏輯）

前邊兩個小結已經介紹了Task在window上統計邏輯和不同Task統計不同數據判斷，這里具體來看核心調用邏輯，首先上一張圖：
WALT
這個圖是在xmind導出來的結構圖，不清楚是否可以放大查看，這里具體介紹流程：

入口函數walt_update_task_ravg
demand更新函數
cpu busy time 更新函數

3.3.1 入口函數介紹

walt_update_task_ravg
對應function：

/* Reflect task activity on its demand and cpu's busy time statistics */ void walt_update_task_ravg(struct task_struct *p, struct rq *rq, int event, u64 wallclock, u64 irqtime) { //判斷返回 if (walt_disabled || !rq->window_start) return; lockdep_assert_held(&rq->lock); //更新window_start和cum_window_demand update_window_start(rq, wallclock); if (!p->ravg.mark_start) goto done; //更新數據：demand和busy_time update_task_demand(p, rq, event, wallclock); update_cpu_busy_time(p, rq, event, wallclock, irqtime); done: // trace trace_walt_update_task_ravg(p, rq, event, wallclock, irqtime); // 更新mark_start p->ravg.mark_start = wallclock; }

函數主要做三件事情：

更新當前 window start時間為之后數據更新做准備；
更新對應task的demand數值，需要注意這里也會對應更新RQ中的數據；
更新對應task的cpu busy time占用；

這個函數是WALT計算的主要入口，可以看到調用它的位置有很多，即上圖最左側內容，簡單來說就是在中斷、喚醒、遷移、調度這些case下都會更新Loading情況，這里不一一詳細說明了；

task awakend

task start execute

task stop execute

task exit

window rollover

interrupt

scheduler_tick

task migration

freq change

3.3.2 更新window start

這里主要是在計算之前更新window_start確保rq 窗口起始值准確：
在這里插入圖片描述
對應function：

static void update_window_start(struct rq *rq, u64 wallclock) { s64 delta; int nr_windows; //計算時間 delta = wallclock - rq->window_start; /* If the MPM global timer is cleared, set delta as 0 to avoid kernel BUG happening */ if (delta < 0) { delta = 0; /* * WARN_ONCE(1, * "WALT wallclock appears to have gone backwards or reset\n"); */ } if (delta < walt_ravg_window) // 不足一個window周期，則直接返回； return; nr_windows = div64_u64(delta, walt_ravg_window);//計算window數量 rq->window_start += (u64)nr_windows * (u64)walt_ravg_window;//統計window_start時間 rq->cum_window_demand = rq->cumulative_runnable_avg;//實質還得使用cumulative_runnable_avg }

3.3.3 更新demand

3.3.3.1 demand主要邏輯：

在這里插入圖片描述
對應function：

/* * Account cpu demand of task and/or update task's cpu demand history * * ms = p->ravg.mark_start; * wc = wallclock * ws = rq->window_start * * Three possibilities: * * a) Task event is contained within one window. * window_start < mark_start < wallclock * * ws ms wc * | | | * V V V * |---------------| * * In this case, p->ravg.sum is updated *iff* event is appropriate * (ex: event == PUT_PREV_TASK) * * b) Task event spans two windows. * mark_start < window_start < wallclock * * ms ws wc * | | | * V V V * -----|------------------- * * In this case, p->ravg.sum is updated with (ws - ms) *iff* event * is appropriate, then a new window sample is recorded followed * by p->ravg.sum being set to (wc - ws) *iff* event is appropriate. * * c) Task event spans more than two windows. * * ms ws_tmp ws wc * | | | | * V V V V * ---|-------|-------|-------|-------|------ * | | * |<------ nr_full_windows ------>| * * In this case, p->ravg.sum is updated with (ws_tmp - ms) first *iff* * event is appropriate, window sample of p->ravg.sum is recorded, * 'nr_full_window' samples of window_size is also recorded *iff* * event is appropriate and finally p->ravg.sum is set to (wc - ws) * *iff* event is appropriate. * * IMPORTANT : Leave p->ravg.mark_start unchanged, as update_cpu_busy_time() * depends on it! */ static void update_task_demand(struct task_struct *p, struct rq *rq, int event, u64 wallclock) { u64 mark_start = p->ravg.mark_start;//mark start 可以看到是task 的值； u64 delta, window_start = rq->window_start;//window start是 rq的值； int new_window, nr_full_windows; u32 window_size = walt_ravg_window; //第一個判斷條件，ms和ws，即當前task的start實際是否在這個window內； new_window = mark_start < window_start; if (!account_busy_for_task_demand(p, event)) { if (new_window) /* If the time accounted isn't being accounted as * busy time, and a new window started, only the * previous window need be closed out with the * pre-existing demand. Multiple windows may have * elapsed, but since empty windows are dropped, * it is not necessary to account those. */ update_history(rq, p, p->ravg.sum, 1, event); return; } // 如果ms > ws，則是case a：將wc-ms，在此周期內的實際執行時間； if (!new_window) { /* The simple case - busy time contained within the existing * window. */ add_to_task_demand(rq, p, wallclock - mark_start); return; } //超過 1個window的情況 /* Busy time spans at least two windows. Temporarily rewind * window_start to first window boundary after mark_start. */ //從ms 到 ws的時間，包含多個完整window delta = window_start - mark_start; nr_full_windows = div64_u64(delta, window_size); window_start -= (u64)nr_full_windows * (u64)window_size; //ws 計算到ws_tmp這里： /* Process (window_start - mark_start) first */ //先添加最開始半個周期的demand add_to_task_demand(rq, p, window_start - mark_start); /* Push new sample(s) into task's demand history */ //更新history update_history(rq, p, p->ravg.sum, 1, event); if (nr_full_windows) update_history(rq, p, scale_exec_time(window_size, rq), nr_full_windows, event); /* Roll window_start back to current to process any remainder * in current window. */ // 還原 window_start window_start += (u64)nr_full_windows * (u64)window_size; /* Process (wallclock - window_start) next */ //更新最后的周期，可以看到整體類似於pelt的計算，增加了history的操作； mark_start = window_start; add_to_task_demand(rq, p, wallclock - mark_start); } //demand計算更新： static void add_to_task_demand(struct rq *rq, struct task_struct *p, u64 delta) { //demand需要做一次轉換，將實際運行時間，轉換為CPU 能力比例，一般就是獲取CPU 的capcurr 然后除1024； delta = scale_exec_time(delta, rq); p->ravg.sum += delta; //這里有個判斷當sum超過window size的時候修改； if (unlikely(p->ravg.sum > walt_ravg_window)) p->ravg.sum = walt_ravg_window; }

3.3.3.2 update history 邏輯：

update_history 整理：

本函數在Task進入一個新的Window的時候調用；
更新Task中的demand，根據過往幾個Window的情況；
同步更新Rq中的Usage，根據當前demand計算值；

對應function：

/* * Called when new window is starting for a task, to record cpu usage over * recently concluded window(s). Normally 'samples' should be 1. It can be > 1 * when, say, a real-time task runs without preemption for several windows at a * stretch. */ static void update_history(struct rq *rq, struct task_struct *p, u32 runtime, int samples, int event) { u32 *hist = &p->ravg.sum_history[0];//對應window 指針鏈接 int ridx, widx; u32 max = 0, avg, demand; u64 sum = 0; /* Ignore windows where task had no activity */ if (!runtime || is_idle_task(p) || exiting_task(p) || !samples) goto done; /* Push new 'runtime' value onto stack */ widx = walt_ravg_hist_size - 1;// history數量最大位置 ridx = widx - samples;//計算鏈表中需要去除的window數量 //如下兩個for循環就是將新增加的window添加到history鏈表中，並更新sum值和max值； for (; ridx >= 0; --widx, --ridx) { hist[widx] = hist[ridx]; sum += hist[widx]; if (hist[widx] > max) max = hist[widx]; } for (widx = 0; widx < samples && widx < walt_ravg_hist_size; widx++) { hist[widx] = runtime; sum += hist[widx]; if (hist[widx] > max) max = hist[widx]; } // Task中sum賦值； p->ravg.sum = 0; //demand根據策略不同，從history window中計算，我們默認是policy2 就是 WINDOW_STATS_MAX_RECENT_AVG，在過去平均值和當前值中選擇大的那個； if (walt_window_stats_policy == WINDOW_STATS_RECENT) { demand = runtime; } else if (walt_window_stats_policy == WINDOW_STATS_MAX) { demand = max; } else { avg = div64_u64(sum, walt_ravg_hist_size); if (walt_window_stats_policy == WINDOW_STATS_AVG) demand = avg; else demand = max(avg, runtime); } /* * A throttled deadline sched class task gets dequeued without * changing p->on_rq. Since the dequeue decrements hmp stats * avoid decrementing it here again. * * When window is rolled over, the cumulative window demand * is reset to the cumulative runnable average (contribution from * the tasks on the runqueue). If the current task is dequeued * already, it's demand is not included in the cumulative runnable * average. So add the task demand separately to cumulative window * demand. */ //進行runnable_avg參數矯正，前提為並非deadline類型task if (!task_has_dl_policy(p) || !p->dl.dl_throttled) { if (task_on_rq_queued(p))//在runqueue中排隊，但是沒有實際執行 fixup_cumulative_runnable_avg(rq, p, demand);//在rq中添加當前demand和task中記錄demand的差值，更新到cumulative_runnable_avg else if (rq->curr == p)//當前執行的就是這個Task fixup_cum_window_demand(rq, demand);//在rq中添加demand } //最后將計算出來的demand更新到Task中； p->ravg.demand = demand; done: trace_walt_update_history(rq, p, runtime, samples, event); return; } //更新cumulative_runnable_avg的值； static void fixup_cumulative_runnable_avg(struct rq *rq, struct task_struct *p, u64 new_task_load) { //計算demand和p中記錄的demand差值（可能小於0） s64 task_load_delta = (s64)new_task_load - task_load(p); //添加到rq中 rq->cumulative_runnable_avg += task_load_delta; if ((s64)rq->cumulative_runnable_avg < 0) panic("cra less than zero: tld: %lld, task_load(p) = %u\n", task_load_delta, task_load(p)); // fixup_cum_window_demand(rq, task_load_delta); } //更新cum_window_demand，直接累加傳入值 static inline void fixup_cum_window_demand(struct rq *rq, s64 delta) { rq->cum_window_demand += delta; if (unlikely((s64)rq->cum_window_demand < 0)) rq->cum_window_demand = 0; } //可以看到這里實際更新了：cum_window_demand、cumulative_runnable_avg //這兩個還在如下函數中有更新：就一個+，一個-， void walt_inc_cumulative_runnable_avg(struct rq *rq, struct task_struct *p) { rq->cumulative_runnable_avg += p->ravg.demand; /* * Add a task's contribution to the cumulative window demand when * * (1) task is enqueued with on_rq = 1 i.e migration, * prio/cgroup/class change. * (2) task is waking for the first time in this window. */ if (p->on_rq || (p->last_sleep_ts < rq->window_start)) fixup_cum_window_demand(rq, p->ravg.demand); } void walt_dec_cumulative_runnable_avg(struct rq *rq, struct task_struct *p) { rq->cumulative_runnable_avg -= p->ravg.demand; BUG_ON((s64)rq->cumulative_runnable_avg < 0); /* * on_rq will be 1 for sleeping tasks. So check if the task * is migrating or dequeuing in RUNNING state to change the * prio/cgroup/class. */ if (task_on_rq_migrating(p) || p->state == TASK_RUNNING) fixup_cum_window_demand(rq, -(s64)p->ravg.demand); } //在code中搜索了這兩個函數的調用： //分別在fair\dl\rt\stop_task中調用enqueue時inc，dequeue時dec； //這部分計算會優先於rq中nr_running進行；

函數的一些注解都在code中添加了，有任何疑問歡迎提出；

3.3.3.3 demand更新函數總結：

則demand更新主要做了如下內容：

計算包括task中間包括多個1個window以及多個window的情況，實質就是根據我們上文提到的窗口划分來做的；
需要注意的是本函數中window_start和mark_start都是局部變量，實際task內值並未更新，因為之后計算busy time還需要使用；
demand 實質更新的就是task中ravg.sum以及rq中cumulative_runnable_avg 和cum_window_demand ；

3.3.4 更新cpu busy time

這個函數邏輯畫出來更加龐大，主要是針對於不同的case做計算，計算划分都是前文提過的窗口划分，但是具體數值統計會有些許差異：
在這里插入圖片描述
對應function：

/* * Account cpu activity in its busy time counters (rq->curr/prev_runnable_sum) */ static void update_cpu_busy_time(struct task_struct *p, struct rq *rq, int event, u64 wallclock, u64 irqtime) { int new_window, nr_full_windows = 0; int p_is_curr_task = (p == rq->curr); u64 mark_start = p->ravg.mark_start; //ms u64 window_start = rq->window_start; //ws u32 window_size = walt_ravg_window; //window size u64 delta; //初始變量值獲取 new_window = mark_start < window_start;// is task period in a new window? if (new_window) { // update nr_full_windows nr_full_windows = div64_u64((window_start - mark_start), window_size); if (p->ravg.active_windows < USHRT_MAX) p->ravg.active_windows++; } /* Handle per-task window rollover. We don't care about the idle * task or exiting tasks. */ if (new_window && !is_idle_task(p) && !exiting_task(p)) { u32 curr_window = 0; if (!nr_full_windows) curr_window = p->ravg.curr_window; //update prev p->ravg.prev_window = curr_window; p->ravg.curr_window = 0; } // 根據event irq判斷當前的輸入，如果沒有對busy造成貢獻，則直接返回； if (!account_busy_for_cpu_time(rq, p, irqtime, event)) { /* account_busy_for_cpu_time() = 0, so no update to the * task's current window needs to be made. This could be * for example * * - a wakeup event on a task within the current * window (!new_window below, no action required), * - switching to a new task from idle (PICK_NEXT_TASK) * in a new window where irqtime is 0 and we aren't * waiting on IO */ if (!new_window) return; /* A new window has started. The RQ demand must be rolled * over if p is the current task. */ if (p_is_curr_task) { u64 prev_sum = 0; /* p is either idle task or an exiting task */ if (!nr_full_windows) { prev_sum = rq->curr_runnable_sum; } rq->prev_runnable_sum = prev_sum; rq->curr_runnable_sum = 0; } return; } //對應task在當前window內啟動，對類型做判斷（這個是核心），然后計算時間更新 if (!new_window) { /* account_busy_for_cpu_time() = 1 so busy time needs * to be accounted to the current window. No rollover * since we didn't start a new window. An example of this is * when a task starts execution and then sleeps within the * same window. */ //判斷：不是中斷 或者 不是idle 或者 等待IO if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) delta = wallclock - mark_start; else delta = irqtime; //換算時間增加curr上 delta = scale_exec_time(delta, rq); rq->curr_runnable_sum += delta; if (!is_idle_task(p) && !exiting_task(p)) p->ravg.curr_window += delta; return; } // cur window 內task有做事情，但是傳入參數並非該task，一般來說就是中斷； if (!p_is_curr_task) { /* account_busy_for_cpu_time() = 1 so busy time needs * to be accounted to the current window. A new window * has also started, but p is not the current task, so the * window is not rolled over - just split up and account * as necessary into curr and prev. The window is only * rolled over when a new window is processed for the current * task. * * Irqtime can't be accounted by a task that isn't the * currently running task. */ //整體分割為兩步計算，prev & curr if (!nr_full_windows) { /* A full window hasn't elapsed, account partial * contribution to previous completed window. */ delta = scale_exec_time(window_start - mark_start, rq); if (!exiting_task(p)) p->ravg.prev_window += delta; } else { /* Since at least one full window has elapsed, * the contribution to the previous window is the * full window (window_size). */ delta = scale_exec_time(window_size, rq); if (!exiting_task(p)) p->ravg.prev_window = delta; } rq->prev_runnable_sum += delta; /* Account piece of busy time in the current window. */ delta = scale_exec_time(wallclock - window_start, rq); rq->curr_runnable_sum += delta; if (!exiting_task(p)) p->ravg.curr_window = delta; return; } //運行的函數 if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) { /* account_busy_for_cpu_time() = 1 so busy time needs * to be accounted to the current window. A new window * has started and p is the current task so rollover is * needed. If any of these three above conditions are true * then this busy time can't be accounted as irqtime. * * Busy time for the idle task or exiting tasks need not * be accounted. * * An example of this would be a task that starts execution * and then sleeps once a new window has begun. */ if (!nr_full_windows) { /* A full window hasn't elapsed, account partial * contribution to previous completed window. */ delta = scale_exec_time(window_start - mark_start, rq); if (!is_idle_task(p) && !exiting_task(p)) p->ravg.prev_window += delta; delta += rq->curr_runnable_sum; } else { /* Since at least one full window has elapsed, * the contribution to the previous window is the * full window (window_size). */ delta = scale_exec_time(window_size, rq); if (!is_idle_task(p) && !exiting_task(p)) p->ravg.prev_window = delta; } /* * Rollover for normal runnable sum is done here by overwriting * the values in prev_runnable_sum and curr_runnable_sum. * Rollover for new task runnable sum has completed by previous * if-else statement. */ rq->prev_runnable_sum = delta; /* Account piece of busy time in the current window. */ delta = scale_exec_time(wallclock - window_start, rq); rq->curr_runnable_sum = delta; if (!is_idle_task(p) && !exiting_task(p)) p->ravg.curr_window = delta; return; } //中斷 if (irqtime) { /* account_busy_for_cpu_time() = 1 so busy time needs * to be accounted to the current window. A new window * has started and p is the current task so rollover is * needed. The current task must be the idle task because * irqtime is not accounted for any other task. * * Irqtime will be accounted each time we process IRQ activity * after a period of idleness, so we know the IRQ busy time * started at wallclock - irqtime. */ BUG_ON(!is_idle_task(p)); mark_start = wallclock - irqtime; /* Roll window over. If IRQ busy time was just in the current * window then that is all that need be accounted. */ rq->prev_runnable_sum = rq->curr_runnable_sum; if (mark_start > window_start) { rq->curr_runnable_sum = scale_exec_time(irqtime, rq); return; } /* The IRQ busy time spanned multiple windows. Process the * busy time preceding the current window start first. */ delta = window_start - mark_start; if (delta > window_size) delta = window_size; delta = scale_exec_time(delta, rq); rq->prev_runnable_sum += delta; /* Process the remaining IRQ busy time in the current window. */ delta = wallclock - window_start; rq->curr_runnable_sum = scale_exec_time(delta, rq); return; } BUG(); }

細節內容在函數中注釋了，這里來簡單總結下：

根據不同Task類型做不同busytime時間的計算；
核心計算方式均相同，只是具體數值差異；
更新數據為
Task中prev_window、curr_window
rq中prev_runable_sum、curr_runnable_sum

3.3.5 irq load 相關調用統計

3.3.5.1 與irq相關的三個變量：

cur_irqload：當前Task的irqload，即執行時間
avg_irqload：當前rq的平均irqload，這個值與中斷頻率相關，逐步衰減，是個累加值；
u64 irqload_ts：上次計算walt irqload的時間，通過這個值來確認中斷頻次；

3.3.5.2 調用邏輯

sched_init時三個值被設置為0，前邊已經研究過了，這東西是在中斷時被調用，具體來看：

void walt_account_irqtime(int cpu, struct task_struct *curr, u64 delta, u64 wallclock) { struct rq *rq = cpu_rq(cpu); unsigned long flags, nr_windows; u64 cur_jiffies_ts; raw_spin_lock_irqsave(&rq->lock, flags); /* * cputime (wallclock) uses sched_clock so use the same here for * consistency. */ //計算從獲取wallclock到執行到這里的差值更新，即做矯正； //這里需要跟蹤delta傳入時值，sched_clock_cpu - irq_start_time //即delta是irq的執行時間； delta += sched_clock() - wallclock; cur_jiffies_ts = get_jiffies_64(); //如果是IDLE task則做walt相關計算更新，這里是獲取的當前值作為wallclock，delta即irq執行time if (is_idle_task(curr)) walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(), delta); //計算兩次中斷統計之間的時間，這里nr_windows是tick數 nr_windows = cur_jiffies_ts - rq->irqload_ts; //這里是指這個CPU上觸發中斷的頻率，以10個tick作為判斷依據，假設HZ設置為250，則一個tick為4ms if (nr_windows) { if (nr_windows < 10) {//如果經過的時間差值在10以內，則avg_irqload衰減為原來的3/4 /* Decay CPU's irqload by 3/4 for each window. */ rq->avg_irqload *= (3 * nr_windows); rq->avg_irqload = div64_u64(rq->avg_irqload, 4 * nr_windows); } else {//如果經過的時間差值超過10，則avg_irqload忽略不計，直接記為0； rq->avg_irqload = 0; } //累加當前的irqload rq->avg_irqload += rq->cur_irqload; rq->cur_irqload = 0; } rq->cur_irqload += delta; //irqload_ts為當前值，目前搜索irqload_ts只有這兩個位置有更新使用，則說明ts是指上次irq中斷統計的時間 rq->irqload_ts = cur_jiffies_ts; raw_spin_unlock_irqrestore(&rq->lock, flags); }

account_irq_enter_time/account_irq_exit_time ==> irq_account_irq ==> walt_account_irqtime
這個過程還比較簡單：

中斷進入和退出的時候都會統計數據；
統計數據即中斷執行時間；
rq的時間根據中斷進入的頻率累加不同；

3.3.5.3 irqload使用的第一個場景

判斷cpu的irq load情況，直接上code：

#define WALT_HIGH_IRQ_TIMEOUT 3 u64 walt_irqload(int cpu) { struct rq *rq = cpu_rq(cpu); s64 delta; delta = get_jiffies_64() - rq->irqload_ts; /* * Current context can be preempted by irq and rq->irqload_ts can be * updated by irq context so that delta can be negative. * But this is okay and we can safely return as this means there * was recent irq occurrence. */ //這個計算是避免被競爭搶占后delta值發生變化，至於這里為什么是3，目前還有疑惑？ if (delta < WALT_HIGH_IRQ_TIMEOUT) return rq->avg_irqload; else return 0; } //這個函數是在find_best_target，即在migirate時找到下一個CPU時判斷負載； int walt_cpu_high_irqload(int cpu) { return walt_irqload(cpu) >= sysctl_sched_walt_cpu_high_irqload;//這個值默認是10ms }

3.4 關鍵結構體

rq //在runqueue中添加部分數據統計
task_struct //在task_struct中添加對應變量
ravg //與這個計算相關的結構

3.4.1 rq

在這里插入圖片描述
對應的結構定義：

struct rq { ... #ifdef CONFIG_SCHED_WALT u64 cumulative_runnable_avg; u64 window_start; u64 curr_runnable_sum; u64 prev_runnable_sum; u64 nt_curr_runnable_sum; u64 nt_prev_runnable_sum; u64 cur_irqload; u64 avg_irqload; u64 irqload_ts; u64 cum_window_demand; #endif /* CONFIG_SCHED_WALT */ ... };

3.4.2 task_struct

在這里插入圖片描述

struct task_struct { ... #ifdef CONFIG_SCHED_WALT struct ravg ravg; /* * 'init_load_pct' represents the initial task load assigned to children * of this task */ u32 init_load_pct; u64 last_sleep_ts; #endif ... } /* ravg represents frequency scaled cpu-demand of tasks */ struct ravg { /* * 'mark_start' marks the beginning of an event (task waking up, task * starting to execute, task being preempted) within a window * * 'sum' represents how runnable a task has been within current * window. It incorporates both running time and wait time and is * frequency scaled. * * 'sum_history' keeps track of history of 'sum' seen over previous * RAVG_HIST_SIZE windows. Windows where task was entirely sleeping are * ignored. * * 'demand' represents maximum sum seen over previous * sysctl_sched_ravg_hist_size windows. 'demand' could drive frequency * demand for tasks. * * 'curr_window' represents task's contribution to cpu busy time * statistics (rq->curr_runnable_sum) in current window * * 'prev_window' represents task's contribution to cpu busy time * statistics (rq->prev_runnable_sum) in previous window */ u64 mark_start; // marks the beginning of an event (task waking up, task starting to execute, task being preempted) within a window u32 sum, demand; // sum ： how runable a task has benn within current window； demand： u32 sum_history[RAVG_HIST_SIZE_MAX]; // u32 curr_window, prev_window; u16 active_windows; }; #endif

4. 附錄

4.1 linux的調度變更過程

runqueue 按照優先級划分，active expored，更快速的調度；
CFS 提出virtual time的概念，根據優先級換算不同的物理時間；
CFS + PELT，更加合理的分配Task以及遷移Task；
CFS + WALT，響應更加迅速，更適合用於手機這類設備，可以在性能和功耗之間做比較好的平衡；

4.2 待補充內容

update history code [done]
irq 調用過程 [done]
對於更新數據的使用==>計划跟蹤top過程，希望明天可以初步完成

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 LINUX內核CPU負載均衡機制【轉】負載均衡原理-轉調度器6—WALT負載計算 cpu負載的探討 (轉) 負載均衡Load Balance學習【轉】Nginx學習---負載均衡的原理、分類、實現架構，以及使用場景 Ocelot中文文檔-負載均衡（轉）負載均衡，會話保持，session同步（轉） nginx 實現mysql的負載均衡【轉】 SQL Server 負載均衡集群(轉)