CPU負載均衡之WALT學習【轉】


轉自:https://blog.csdn.net/xiaoqiaoq0/article/details/107135747/

前言

本文繼續整理CPU調度WALT相關內容,主要整理如下內容:

  1. WALT是什么?
  2. WALT 計算?
  3. WALT 計算數據如何使用?

1. WALT是什么?

WALT:Windows-Assist Load Tracing的縮寫:
- 從字面意思來看,是以window作為輔助項來跟蹤CPU LOAD;
- 實質上是一種計算方法,用數據來表現CPU當前的loading情況,用於后續任務調度、遷移、負載均衡等功能;

1.1 為什么需要WALT ?

對於一項技術的發展,尤其是一種計算方式的引入,一定是伴隨着過去的技術不在適用於當前事務發展的需要,或者這項技術可以讓人更懶;

1.1.1 PELT的計算方式的不足?

PELT的引進的時候,linux的主流還在於服務器使用,更多關注設備性能的體現,彼時功耗還不是考慮的重點,而隨着移動設備的發展,功耗和響應速度成為被人們直接感知到的因素,成為當前技術發展主要考慮的因素:

  1. 對於當前的移動設備,在界面處理的應用場景,需要盡快響應,否則user會明顯感覺到卡頓;
  2. 對於當前移動設備,功耗更是一個必須面對的因素,手機需要頻繁充電,那銷量一定好不了;
  3. 根據用戶場景決定task是否heavy的要求,比如顯示的內容不同,其task重要程度也不同,即同一個類別的TASK也需要根據具體情況動態改變;

而基於當前PELT的調度情況(衰減的計算思路),更能體現連續的趨勢情況,而對於快速的突變性質的情況,不是很友好:

  1. 對於快速上升和快速下降的情況響應速度較慢,由於衰減的計算過程,所以實際的Loading上升和下降需要一定周期后才能在數據上反饋出來,導致響應速度慢;
  2. PELT基於其衰減機制,所以對於一個task sleep 一段時間后,則其負載計算減小,但是如果此時該Task為網絡傳輸這種,周期性的需要cpu和freq的能力,則不能快速響應(因為該計算方式更能體現趨向性、平均效果)

1.2 WALT如何處理

根據上述的原因,我們了解到,當前需要在PELT的基礎上(保持其好處),實現一種更能適用於當前需求的計算方式:

  1. 數據上報更加及時;
  2. 數據直接體現現狀;
  3. 對算力的消耗不會增加(算力);

1.2.1 WALT 處理

我這里總結了WALT所能(需要)做到的效果:

  1. 繼續保持對於所有Task-entity的跟蹤 ;
  2. 在此前usage(load)的基礎上,添加對於demand的記錄,用於之后預測;
  3. 每個CPU上runqueue 的整體負載仍為所有Task統計的sum;
  4. 核心在於計算差異,由之前的衰減的方式變更為划分window的方式:數據采集更能快速體現實際變化(對比與PELT的趨勢),如下為Linux官方的一些資料:
    1. A task’s demand is the maximum of its contribution to the most recently completed window and its average demand over the past N windows.
    2. WALT “forgets” blocked time entirely:即只統計runable和running time,可以對於Task的實際耗時有更准確的統計,可以通過demand預測;
    3. CPU busy time - The sum of execution times of all tasks in the most recently completed window;
    4. WALT “forgets” cpu utilization as soon as tasks are taken off of the runqueue;

1.2.2 應用補充

  1. task分配前各個CPU和task負載的統計;
  2. task migration 遷移
  3. 大小核的分配;
  4. EAS 分配;

1.3 版本導入

  1. linux 4.8.2 之后導入(但是在bootlin查看code,最新5.8仍沒有對應文件)
  2. android 4.4之后導入(android kernel 4.9 中是有這部分的)

2. Kernel如何啟用WALT

android kernel code中已經集成了這部分內容,不過根據廠商的差異,可能存在沒有啟用的情況:

  1. 打開宏測試:
    1. menuconfig ==》Genernal setup ==》CPU/Task time and stats accounting ==》support window based load tracking
    2. 圖示:kernel config
  2. 直接修改
    1. kernel/arch/arm64/config/defconfig中添加CONFIG_SCHED_WALT=y
  3. build image 驗證修改是否生效:
    demo:/sys/kernel/tracing # zcat /proc/config.gz | grep WALT

    CONFIG_SCHED_WALT=y
    CONFIG_HID_WALTOP=y

  4. 測試
    當前只是在ftrace中可以看到確實有統計walt的數據,但是沒有實際的應用來確認具體是否有改善或者其他數據(當然Linux的資料中有一些數據,但是並非本地測試);

3. WALT計算

本小節從原理和code 來說明,WALT采用的計算方式:

  1. windows 是如何划分的?
  2. 對於Task如何分類,分別做怎樣的處理?
  3. WALT部分數據如何更新?
  4. WALT更新的數據如何被調度、EAS使用?

3.1 Windows划分

首先來看輔助計算項window是如何划分的?
簡單理解,就是將系統自啟動開始以一定時間作為一個周期,分別統計不同周期內Task的Loading情況,並將其更新到Runqueue中;

則還有哪些內容需要考慮?

  1. 一個周期即window設置為多久比較合適?這個根據實際項目不同調試不同的值,目前Kernel中是設置的標准是20ms;
  2. 具體統計多少個window內的Loading情況?根據實際項目需要調整,目前Kernel中設置為5個window;

所以對於一個Task和window,可能存在如下幾種情況:
在這里插入圖片描述
ps:ms = mark_start(Task開始),ws = window_start(當前window開始), wc = wallclock(當前系統時間)

  1. Task在這個window內啟動,且做統計時仍在這個window內,即Task在一個window內;
  2. Task在前一個window內啟動,做統計時在當前window內,即Task跨過兩個window;
  3. Task在前邊某一個window內啟動,做統計時在當前window內,即Task跨過多個完整window;
    在這里插入圖片描述
    即Task在Window的划分只有上述三種情況,所有的計算都是基於上述划分的;

3.2 Task 分類

可以想到的是,對於不同類別的Task或者不同狀態的Task計算公式都是不同的,WALT將Task划分為如下幾個類別:
Tadk分類
上圖中有將各個Task event的調用函數列出來;

3.2.1 更新demand判斷

在更新demand時,會首先根據Task event判斷此時是否需要更新:
demand對類別的差異
對應function:

static int account_busy_for_task_demand(struct task_struct *p, int event) { /* No need to bother updating task demand for exiting tasks * or the idle task. */ //task 已退出或者為IDLE,則不需要計算 if (exiting_task(p) || is_idle_task(p)) return 0; /* When a task is waking up it is completing a segment of non-busy * time. Likewise, if wait time is not treated as busy time, then * when a task begins to run or is migrated, it is not running and * is completing a segment of non-busy time. */ // 默認 walt_account_wait_time是1,則只有TASK_WAKE if (event == TASK_WAKE || (!walt_account_wait_time && (event == PICK_NEXT_TASK || event == TASK_MIGRATE))) return 0; return 1; } 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

3.2.2 更新CPU busy time判斷

在更新CPU busy time時,會首先根據Task event判斷此時是否需要更新:
busy time對event的差異
對應function:

static int account_busy_for_cpu_time(struct rq *rq, struct task_struct *p, u64 irqtime, int event) { //是否為idle task or other task? if (is_idle_task(p)) { /* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */ // 是schedule 觸發的下一個task為idle task if (event == PICK_NEXT_TASK) return 0; /* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */ // 如果是中斷或者等待IO的IDLE TASK,是要計算busy time的; return irqtime || cpu_is_waiting_on_io(rq); } //wake 喚醒操作不需要計算; if (event == TASK_WAKE) return 0; //不是IDLE TASK則以下幾個類型需要計算 if (event == PUT_PREV_TASK || event == IRQ_UPDATE || event == TASK_UPDATE) return 1; /* Only TASK_MIGRATE && PICK_NEXT_TASK left */ //默認是0 return walt_freq_account_wait_time; } 

 

3.3 數據如何更新?(調用邏輯)

前邊兩個小結已經介紹了Task在window上統計邏輯和不同Task統計不同數據判斷,這里具體來看核心調用邏輯,首先上一張圖:
WALT
這個圖是在xmind導出來的結構圖,不清楚是否可以放大查看,這里具體介紹流程:

  1. 入口函數walt_update_task_ravg
  2. demand更新函數
  3. cpu busy time 更新函數

3.3.1 入口函數介紹

walt_update_task_ravg
對應function:

/* Reflect task activity on its demand and cpu's busy time statistics */ void walt_update_task_ravg(struct task_struct *p, struct rq *rq, int event, u64 wallclock, u64 irqtime) { //判斷返回 if (walt_disabled || !rq->window_start) return; lockdep_assert_held(&rq->lock); //更新window_start和cum_window_demand update_window_start(rq, wallclock); if (!p->ravg.mark_start) goto done; //更新數據:demand和busy_time update_task_demand(p, rq, event, wallclock); update_cpu_busy_time(p, rq, event, wallclock, irqtime); done: // trace trace_walt_update_task_ravg(p, rq, event, wallclock, irqtime); // 更新mark_start p->ravg.mark_start = wallclock; } 

 

函數主要做三件事情:

  1. 更新當前 window start時間為之后數據更新做准備;
  2. 更新對應task的demand數值,需要注意這里也會對應更新RQ中的數據;
  3. 更新對應task的cpu busy time占用;

這個函數是WALT計算的主要入口,可以看到調用它的位置有很多,即上圖最左側內容,簡單來說就是在中斷、喚醒、遷移、調度這些case下都會更新Loading情況,這里不一一詳細說明了;

  1. task awakend
  2. task start execute
  3. task stop execute
  4. task exit
  1. window rollover
  2. interrupt
  3. scheduler_tick
  1. task migration
  2. freq change

3.3.2 更新window start

這里主要是在計算之前更新window_start確保rq 窗口起始值准確:
在這里插入圖片描述
對應function:

static void update_window_start(struct rq *rq, u64 wallclock) { s64 delta; int nr_windows; //計算時間 delta = wallclock - rq->window_start; /* If the MPM global timer is cleared, set delta as 0 to avoid kernel BUG happening */ if (delta < 0) { delta = 0; /* * WARN_ONCE(1, * "WALT wallclock appears to have gone backwards or reset\n"); */ } if (delta < walt_ravg_window) // 不足一個window周期,則直接返回; return; nr_windows = div64_u64(delta, walt_ravg_window);//計算window數量 rq->window_start += (u64)nr_windows * (u64)walt_ravg_window;//統計window_start時間 rq->cum_window_demand = rq->cumulative_runnable_avg;//實質還得使用cumulative_runnable_avg } 

 

3.3.3 更新demand

3.3.3.1 demand主要邏輯:

在這里插入圖片描述
對應function:

/* * Account cpu demand of task and/or update task's cpu demand history * * ms = p->ravg.mark_start; * wc = wallclock * ws = rq->window_start * * Three possibilities: * * a) Task event is contained within one window. * window_start < mark_start < wallclock * * ws ms wc * | | | * V V V * |---------------| * * In this case, p->ravg.sum is updated *iff* event is appropriate * (ex: event == PUT_PREV_TASK) * * b) Task event spans two windows. * mark_start < window_start < wallclock * * ms ws wc * | | | * V V V * -----|------------------- * * In this case, p->ravg.sum is updated with (ws - ms) *iff* event * is appropriate, then a new window sample is recorded followed * by p->ravg.sum being set to (wc - ws) *iff* event is appropriate. * * c) Task event spans more than two windows. * * ms ws_tmp ws wc * | | | | * V V V V * ---|-------|-------|-------|-------|------ * | | * |<------ nr_full_windows ------>| * * In this case, p->ravg.sum is updated with (ws_tmp - ms) first *iff* * event is appropriate, window sample of p->ravg.sum is recorded, * 'nr_full_window' samples of window_size is also recorded *iff* * event is appropriate and finally p->ravg.sum is set to (wc - ws) * *iff* event is appropriate. * * IMPORTANT : Leave p->ravg.mark_start unchanged, as update_cpu_busy_time() * depends on it! */ static void update_task_demand(struct task_struct *p, struct rq *rq, int event, u64 wallclock) { u64 mark_start = p->ravg.mark_start;//mark start 可以看到是task 的值; u64 delta, window_start = rq->window_start;//window start是 rq的值; int new_window, nr_full_windows; u32 window_size = walt_ravg_window; //第一個判斷條件,ms和ws,即當前task的start實際是否在這個window內; new_window = mark_start < window_start; if (!account_busy_for_task_demand(p, event)) { if (new_window) /* If the time accounted isn't being accounted as * busy time, and a new window started, only the * previous window need be closed out with the * pre-existing demand. Multiple windows may have * elapsed, but since empty windows are dropped, * it is not necessary to account those. */ update_history(rq, p, p->ravg.sum, 1, event); return; } // 如果ms > ws,則是case a:將wc-ms,在此周期內的實際執行時間; if (!new_window) { /* The simple case - busy time contained within the existing * window. */ add_to_task_demand(rq, p, wallclock - mark_start); return; } //超過 1個window的情況 /* Busy time spans at least two windows. Temporarily rewind * window_start to first window boundary after mark_start. */ //從ms 到 ws的時間,包含多個完整window delta = window_start - mark_start; nr_full_windows = div64_u64(delta, window_size); window_start -= (u64)nr_full_windows * (u64)window_size; //ws 計算到ws_tmp這里: /* Process (window_start - mark_start) first */ //先添加最開始半個周期的demand add_to_task_demand(rq, p, window_start - mark_start); /* Push new sample(s) into task's demand history */ //更新history update_history(rq, p, p->ravg.sum, 1, event); if (nr_full_windows) update_history(rq, p, scale_exec_time(window_size, rq), nr_full_windows, event); /* Roll window_start back to current to process any remainder * in current window. */ // 還原 window_start window_start += (u64)nr_full_windows * (u64)window_size; /* Process (wallclock - window_start) next */ //更新最后的周期,可以看到整體類似於pelt的計算,增加了history的操作; mark_start = window_start; add_to_task_demand(rq, p, wallclock - mark_start); } //demand計算更新: static void add_to_task_demand(struct rq *rq, struct task_struct *p, u64 delta) { //demand需要做一次轉換,將實際運行時間,轉換為CPU 能力比例,一般就是獲取CPU 的capcurr 然后除1024; delta = scale_exec_time(delta, rq); p->ravg.sum += delta; //這里有個判斷當sum超過window size的時候修改; if (unlikely(p->ravg.sum > walt_ravg_window)) p->ravg.sum = walt_ravg_window; }
3.3.3.2 update history 邏輯:

update_history 整理:

  1. 本函數在Task進入一個新的Window的時候調用;
  2. 更新Task中的demand,根據過往幾個Window的情況;
  3. 同步更新Rq中的Usage,根據當前demand計算值;
    在這里插入圖片描述
    對應function:
/* * Called when new window is starting for a task, to record cpu usage over * recently concluded window(s). Normally 'samples' should be 1. It can be > 1 * when, say, a real-time task runs without preemption for several windows at a * stretch. */ static void update_history(struct rq *rq, struct task_struct *p, u32 runtime, int samples, int event) { u32 *hist = &p->ravg.sum_history[0];//對應window 指針鏈接 int ridx, widx; u32 max = 0, avg, demand; u64 sum = 0; /* Ignore windows where task had no activity */ if (!runtime || is_idle_task(p) || exiting_task(p) || !samples) goto done; /* Push new 'runtime' value onto stack */ widx = walt_ravg_hist_size - 1;// history數量最大位置 ridx = widx - samples;//計算鏈表中需要去除的window數量 //如下兩個for循環就是將新增加的window添加到history鏈表中,並更新sum值和max值; for (; ridx >= 0; --widx, --ridx) { hist[widx] = hist[ridx]; sum += hist[widx]; if (hist[widx] > max) max = hist[widx]; } for (widx = 0; widx < samples && widx < walt_ravg_hist_size; widx++) { hist[widx] = runtime; sum += hist[widx]; if (hist[widx] > max) max = hist[widx]; } // Task中sum賦值; p->ravg.sum = 0; //demand根據策略不同,從history window中計算,我們默認是policy2 就是 WINDOW_STATS_MAX_RECENT_AVG,在過去平均值和當前值中選擇大的那個; if (walt_window_stats_policy == WINDOW_STATS_RECENT) { demand = runtime; } else if (walt_window_stats_policy == WINDOW_STATS_MAX) { demand = max; } else { avg = div64_u64(sum, walt_ravg_hist_size); if (walt_window_stats_policy == WINDOW_STATS_AVG) demand = avg; else demand = max(avg, runtime); } /* * A throttled deadline sched class task gets dequeued without * changing p->on_rq. Since the dequeue decrements hmp stats * avoid decrementing it here again. * * When window is rolled over, the cumulative window demand * is reset to the cumulative runnable average (contribution from * the tasks on the runqueue). If the current task is dequeued * already, it's demand is not included in the cumulative runnable * average. So add the task demand separately to cumulative window * demand. */ //進行runnable_avg參數矯正,前提為並非deadline類型task if (!task_has_dl_policy(p) || !p->dl.dl_throttled) { if (task_on_rq_queued(p))//在runqueue中排隊,但是沒有實際執行 fixup_cumulative_runnable_avg(rq, p, demand);//在rq中添加當前demand和task中記錄demand的差值,更新到cumulative_runnable_avg else if (rq->curr == p)//當前執行的就是這個Task fixup_cum_window_demand(rq, demand);//在rq中添加demand } //最后將計算出來的demand更新到Task中; p->ravg.demand = demand; done: trace_walt_update_history(rq, p, runtime, samples, event); return; } //更新cumulative_runnable_avg的值; static void fixup_cumulative_runnable_avg(struct rq *rq, struct task_struct *p, u64 new_task_load) { //計算demand和p中記錄的demand差值(可能小於0) s64 task_load_delta = (s64)new_task_load - task_load(p); //添加到rq中 rq->cumulative_runnable_avg += task_load_delta; if ((s64)rq->cumulative_runnable_avg < 0) panic("cra less than zero: tld: %lld, task_load(p) = %u\n", task_load_delta, task_load(p)); // fixup_cum_window_demand(rq, task_load_delta); } //更新cum_window_demand,直接累加傳入值 static inline void fixup_cum_window_demand(struct rq *rq, s64 delta) { rq->cum_window_demand += delta; if (unlikely((s64)rq->cum_window_demand < 0)) rq->cum_window_demand = 0; } //可以看到這里實際更新了:cum_window_demand、cumulative_runnable_avg //這兩個還在如下函數中有更新:就一個+,一個-, void walt_inc_cumulative_runnable_avg(struct rq *rq, struct task_struct *p) { rq->cumulative_runnable_avg += p->ravg.demand; /* * Add a task's contribution to the cumulative window demand when * * (1) task is enqueued with on_rq = 1 i.e migration, * prio/cgroup/class change. * (2) task is waking for the first time in this window. */ if (p->on_rq || (p->last_sleep_ts < rq->window_start)) fixup_cum_window_demand(rq, p->ravg.demand); } void walt_dec_cumulative_runnable_avg(struct rq *rq, struct task_struct *p) { rq->cumulative_runnable_avg -= p->ravg.demand; BUG_ON((s64)rq->cumulative_runnable_avg < 0); /* * on_rq will be 1 for sleeping tasks. So check if the task * is migrating or dequeuing in RUNNING state to change the * prio/cgroup/class. */ if (task_on_rq_migrating(p) || p->state == TASK_RUNNING) fixup_cum_window_demand(rq, -(s64)p->ravg.demand); } //在code中搜索了這兩個函數的調用: //分別在fair\dl\rt\stop_task中調用enqueue時inc,dequeue時dec; //這部分計算會優先於rq中nr_running進行; 

 

函數的一些注解都在code中添加了,有任何疑問歡迎提出;

3.3.3.3 demand更新函數總結:

則demand更新主要做了如下內容:

  1. 計算包括task中間包括多個1個window以及多個window的情況,實質就是根據我們上文提到的窗口划分來做的;
  2. 需要注意的是本函數中window_start和mark_start都是局部變量,實際task內值並未更新,因為之后計算busy time還需要使用;
  3. demand 實質更新的就是task中ravg.sum以及rq中cumulative_runnable_avg 和cum_window_demand ;

3.3.4 更新cpu busy time

這個函數邏輯畫出來更加龐大,主要是針對於不同的case做計算,計算划分都是前文提過的窗口划分,但是具體數值統計會有些許差異:
在這里插入圖片描述
對應function:

/* * Account cpu activity in its busy time counters (rq->curr/prev_runnable_sum) */ static void update_cpu_busy_time(struct task_struct *p, struct rq *rq, int event, u64 wallclock, u64 irqtime) { int new_window, nr_full_windows = 0; int p_is_curr_task = (p == rq->curr); u64 mark_start = p->ravg.mark_start; //ms u64 window_start = rq->window_start; //ws u32 window_size = walt_ravg_window; //window size u64 delta; //初始變量值獲取 new_window = mark_start < window_start;// is task period in a new window? if (new_window) { // update nr_full_windows nr_full_windows = div64_u64((window_start - mark_start), window_size); if (p->ravg.active_windows < USHRT_MAX) p->ravg.active_windows++; } /* Handle per-task window rollover. We don't care about the idle * task or exiting tasks. */ if (new_window && !is_idle_task(p) && !exiting_task(p)) { u32 curr_window = 0; if (!nr_full_windows) curr_window = p->ravg.curr_window; //update prev p->ravg.prev_window = curr_window; p->ravg.curr_window = 0; } // 根據event irq判斷當前的輸入,如果沒有對busy造成貢獻,則直接返回; if (!account_busy_for_cpu_time(rq, p, irqtime, event)) { /* account_busy_for_cpu_time() = 0, so no update to the * task's current window needs to be made. This could be * for example * * - a wakeup event on a task within the current * window (!new_window below, no action required), * - switching to a new task from idle (PICK_NEXT_TASK) * in a new window where irqtime is 0 and we aren't * waiting on IO */ if (!new_window) return; /* A new window has started. The RQ demand must be rolled * over if p is the current task. */ if (p_is_curr_task) { u64 prev_sum = 0; /* p is either idle task or an exiting task */ if (!nr_full_windows) { prev_sum = rq->curr_runnable_sum; } rq->prev_runnable_sum = prev_sum; rq->curr_runnable_sum = 0; } return; } //對應task在當前window內啟動,對類型做判斷(這個是核心),然后計算時間更新 if (!new_window) { /* account_busy_for_cpu_time() = 1 so busy time needs * to be accounted to the current window. No rollover * since we didn't start a new window. An example of this is * when a task starts execution and then sleeps within the * same window. */ //判斷:不是中斷 或者 不是idle 或者 等待IO if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) delta = wallclock - mark_start; else delta = irqtime; //換算時間增加curr上 delta = scale_exec_time(delta, rq); rq->curr_runnable_sum += delta; if (!is_idle_task(p) && !exiting_task(p)) p->ravg.curr_window += delta; return; } // cur window 內task有做事情,但是傳入參數並非該task,一般來說就是中斷; if (!p_is_curr_task) { /* account_busy_for_cpu_time() = 1 so busy time needs * to be accounted to the current window. A new window * has also started, but p is not the current task, so the * window is not rolled over - just split up and account * as necessary into curr and prev. The window is only * rolled over when a new window is processed for the current * task. * * Irqtime can't be accounted by a task that isn't the * currently running task. */ //整體分割為兩步計算,prev & curr if (!nr_full_windows) { /* A full window hasn't elapsed, account partial * contribution to previous completed window. */ delta = scale_exec_time(window_start - mark_start, rq); if (!exiting_task(p)) p->ravg.prev_window += delta; } else { /* Since at least one full window has elapsed, * the contribution to the previous window is the * full window (window_size). */ delta = scale_exec_time(window_size, rq); if (!exiting_task(p)) p->ravg.prev_window = delta; } rq->prev_runnable_sum += delta; /* Account piece of busy time in the current window. */ delta = scale_exec_time(wallclock - window_start, rq); rq->curr_runnable_sum += delta; if (!exiting_task(p)) p->ravg.curr_window = delta; return; } //運行的函數 if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) { /* account_busy_for_cpu_time() = 1 so busy time needs * to be accounted to the current window. A new window * has started and p is the current task so rollover is * needed. If any of these three above conditions are true * then this busy time can't be accounted as irqtime. * * Busy time for the idle task or exiting tasks need not * be accounted. * * An example of this would be a task that starts execution * and then sleeps once a new window has begun. */ if (!nr_full_windows) { /* A full window hasn't elapsed, account partial * contribution to previous completed window. */ delta = scale_exec_time(window_start - mark_start, rq); if (!is_idle_task(p) && !exiting_task(p)) p->ravg.prev_window += delta; delta += rq->curr_runnable_sum; } else { /* Since at least one full window has elapsed, * the contribution to the previous window is the * full window (window_size). */ delta = scale_exec_time(window_size, rq); if (!is_idle_task(p) && !exiting_task(p)) p->ravg.prev_window = delta; } /* * Rollover for normal runnable sum is done here by overwriting * the values in prev_runnable_sum and curr_runnable_sum. * Rollover for new task runnable sum has completed by previous * if-else statement. */ rq->prev_runnable_sum = delta; /* Account piece of busy time in the current window. */ delta = scale_exec_time(wallclock - window_start, rq); rq->curr_runnable_sum = delta; if (!is_idle_task(p) && !exiting_task(p)) p->ravg.curr_window = delta; return; } //中斷 if (irqtime) { /* account_busy_for_cpu_time() = 1 so busy time needs * to be accounted to the current window. A new window * has started and p is the current task so rollover is * needed. The current task must be the idle task because * irqtime is not accounted for any other task. * * Irqtime will be accounted each time we process IRQ activity * after a period of idleness, so we know the IRQ busy time * started at wallclock - irqtime. */ BUG_ON(!is_idle_task(p)); mark_start = wallclock - irqtime; /* Roll window over. If IRQ busy time was just in the current * window then that is all that need be accounted. */ rq->prev_runnable_sum = rq->curr_runnable_sum; if (mark_start > window_start) { rq->curr_runnable_sum = scale_exec_time(irqtime, rq); return; } /* The IRQ busy time spanned multiple windows. Process the * busy time preceding the current window start first. */ delta = window_start - mark_start; if (delta > window_size) delta = window_size; delta = scale_exec_time(delta, rq); rq->prev_runnable_sum += delta; /* Process the remaining IRQ busy time in the current window. */ delta = wallclock - window_start; rq->curr_runnable_sum = scale_exec_time(delta, rq); return; } BUG(); } 

 

細節內容在函數中注釋了,這里來簡單總結下:

  1. 根據不同Task類型做不同busytime時間的計算;
  2. 核心計算方式均相同,只是具體數值差異;
  3. 更新數據為
    Task中prev_window、curr_window
    rq中prev_runable_sum、curr_runnable_sum

3.3.5 irq load 相關調用統計

3.3.5.1 與irq相關的三個變量:

cur_irqload:當前Task的irqload,即執行時間
avg_irqload:當前rq的平均irqload,這個值與中斷頻率相關,逐步衰減,是個累加值;
u64 irqload_ts:上次計算walt irqload的時間,通過這個值來確認中斷頻次;

3.3.5.2 調用邏輯

sched_init時 三個值被設置為0,前邊已經研究過了,這東西是在中斷時被調用,具體來看:

void walt_account_irqtime(int cpu, struct task_struct *curr, u64 delta, u64 wallclock) { struct rq *rq = cpu_rq(cpu); unsigned long flags, nr_windows; u64 cur_jiffies_ts; raw_spin_lock_irqsave(&rq->lock, flags); /* * cputime (wallclock) uses sched_clock so use the same here for * consistency. */ //計算從獲取wallclock到執行到這里的差值更新,即做矯正; //這里需要跟蹤delta傳入時值,sched_clock_cpu - irq_start_time //即delta是irq的執行時間; delta += sched_clock() - wallclock; cur_jiffies_ts = get_jiffies_64(); //如果是IDLE task則做walt相關計算更新,這里是獲取的當前值作為wallclock,delta即irq執行time if (is_idle_task(curr)) walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(), delta); //計算兩次中斷統計之間的時間,這里nr_windows是tick數 nr_windows = cur_jiffies_ts - rq->irqload_ts; //這里是指這個CPU上觸發中斷的頻率,以10個tick作為判斷依據,假設HZ設置為250,則一個tick為4ms if (nr_windows) { if (nr_windows < 10) {//如果經過的時間差值在10以內,則avg_irqload衰減為原來的3/4 /* Decay CPU's irqload by 3/4 for each window. */ rq->avg_irqload *= (3 * nr_windows); rq->avg_irqload = div64_u64(rq->avg_irqload, 4 * nr_windows); } else {//如果經過的時間差值超過10,則avg_irqload忽略不計,直接記為0; rq->avg_irqload = 0; } //累加當前的irqload rq->avg_irqload += rq->cur_irqload; rq->cur_irqload = 0; } rq->cur_irqload += delta; //irqload_ts為當前值,目前搜索irqload_ts只有這兩個位置有更新使用,則說明ts是指上次irq中斷統計的時間 rq->irqload_ts = cur_jiffies_ts; raw_spin_unlock_irqrestore(&rq->lock, flags); } 

 

account_irq_enter_time/account_irq_exit_time ==> irq_account_irq ==> walt_account_irqtime
這個過程還比較簡單:

  1. 中斷進入和退出的時候都會統計數據;
  2. 統計數據即中斷執行時間;
  3. rq的時間根據中斷進入的頻率累加不同;
3.3.5.3 irqload使用的第一個場景

判斷cpu的irq load情況,直接上code:

#define WALT_HIGH_IRQ_TIMEOUT 3 u64 walt_irqload(int cpu) { struct rq *rq = cpu_rq(cpu); s64 delta; delta = get_jiffies_64() - rq->irqload_ts; /* * Current context can be preempted by irq and rq->irqload_ts can be * updated by irq context so that delta can be negative. * But this is okay and we can safely return as this means there * was recent irq occurrence. */ //這個計算是避免被競爭搶占后delta值發生變化,至於這里為什么是3,目前還有疑惑? if (delta < WALT_HIGH_IRQ_TIMEOUT) return rq->avg_irqload; else return 0; } //這個函數是在find_best_target,即在migirate時找到下一個CPU時判斷負載; int walt_cpu_high_irqload(int cpu) { return walt_irqload(cpu) >= sysctl_sched_walt_cpu_high_irqload;//這個值默認是10ms } 

 

3.4 關鍵結構體

  1. rq //在runqueue中添加部分數據統計
  2. task_struct //在task_struct中添加對應變量
  3. ravg //與這個計算相關的結構

3.4.1 rq

在這里插入圖片描述
對應的結構定義:

struct rq { ... #ifdef CONFIG_SCHED_WALT u64 cumulative_runnable_avg; u64 window_start; u64 curr_runnable_sum; u64 prev_runnable_sum; u64 nt_curr_runnable_sum; u64 nt_prev_runnable_sum; u64 cur_irqload; u64 avg_irqload; u64 irqload_ts; u64 cum_window_demand; #endif /* CONFIG_SCHED_WALT */ ... }; 

 

3.4.2 task_struct

在這里插入圖片描述

struct task_struct { ... #ifdef CONFIG_SCHED_WALT struct ravg ravg; /* * 'init_load_pct' represents the initial task load assigned to children * of this task */ u32 init_load_pct; u64 last_sleep_ts; #endif ... } /* ravg represents frequency scaled cpu-demand of tasks */ struct ravg { /* * 'mark_start' marks the beginning of an event (task waking up, task * starting to execute, task being preempted) within a window * * 'sum' represents how runnable a task has been within current * window. It incorporates both running time and wait time and is * frequency scaled. * * 'sum_history' keeps track of history of 'sum' seen over previous * RAVG_HIST_SIZE windows. Windows where task was entirely sleeping are * ignored. * * 'demand' represents maximum sum seen over previous * sysctl_sched_ravg_hist_size windows. 'demand' could drive frequency * demand for tasks. * * 'curr_window' represents task's contribution to cpu busy time * statistics (rq->curr_runnable_sum) in current window * * 'prev_window' represents task's contribution to cpu busy time * statistics (rq->prev_runnable_sum) in previous window */ u64 mark_start; // marks the beginning of an event (task waking up, task starting to execute, task being preempted) within a window u32 sum, demand; // sum : how runable a task has benn within current window; demand: u32 sum_history[RAVG_HIST_SIZE_MAX]; // u32 curr_window, prev_window; u16 active_windows; }; #endif 

 

4. 附錄

4.1 linux的調度變更過程

  1. runqueue 按照優先級划分,active expored,更快速的調度;
  2. CFS 提出virtual time的概念,根據優先級換算不同的物理時間;
  3. CFS + PELT,更加合理的分配Task以及遷移Task;
  4. CFS + WALT,響應更加迅速,更適合用於手機這類設備,可以在性能和功耗之間做比較好的平衡;

4.2 待補充內容

  1. update history code [done]
  2. irq 調用過程 [done]
  3. 對於更新數據的使用==>計划跟蹤top過程,希望明天可以初步完成


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM