Linux進程管理 (2)CFS調度器

本文轉載自查看原文 2018-06-12 21:00 2976 Linux進程管理

關鍵詞：

Linux進程管理 (1)進程的誕生

Linux進程管理 (2)CFS調度器

Linux進程管理 (3)SMP負載均衡

Linux進程管理 (4)HMP調度器

Linux進程管理 (5)NUMA調度器

Linux進程管理 (6)EAS綠色節能調度器

Linux進程管理 (7)實時調度

Linux進程管理 (8)最新更新與展望

Linux進程管理 (篇外)內核線程

根據進程的特性可以將進程划分為：交互式進程、批處理進程、實時進程。

O(N)調度器從就緒隊列中比較所有進程的優先級，然后選擇一個最高優先級的進程作為下一個調度進程。每個進程都一個固定時間片，當進程時間片用完之后，調度器會選擇下一個調度進程，當所有進程都運行一遍后再重新分配時間片。調度器選擇下一個調度進程前需要遍歷整個就緒隊列，花費O(N)時間。

O(1)調度器優化了選擇下一個進程的時間，它為每個CPU維護一組進程優先級隊列，每個優先級一個隊列，這樣在選擇下一個進程時，只需查詢優先級隊列相應的位圖即可知道哪個隊列中有酒須進程，查詢時間為常數O(1)。

Linux定義了5種調度器類，分別對應stop、deadline、realtime、cfs、idle，他們通過next串聯起來。

const struct sched_class stop_sched_class = {
    .next            = &dl_sched_class,
...
};

const struct sched_class dl_sched_class = {
    .next            = &rt_sched_class,
...
};

const struct sched_class rt_sched_class = {
    .next            = &fair_sched_class,
...
};

const struct sched_class fair_sched_class = {
    .next            = &idle_sched_class,
...
}

const struct sched_class idle_sched_class = {
    /* .next is NULL */
...
};

/*
 * Scheduling policies
 */
#define SCHED_NORMAL        0
#define SCHED_FIFO        1
#define SCHED_RR        2
#define SCHED_BATCH        3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE        5
#define SCHED_DEADLINE        6

同時定義了6中調度策略，3中調度實體，他們之間的關系如下表。

調度器類	調度策略	調度實體	優先級
stop_sched_class
dl_sched_class	SCHED_DEADLINE	sched_dl_entity	(, 0)
rt_sched_class	SCHED_FIFO SCHED_RR	sched_rt_entity	[0, 100)
fair_sched_class	SCHED_NORMAL SCHED_BATCH	sched_entity	[100, )
idle_sched_class	SCHED_IDLE

實時調度類相關參考《實時調度類分析，以及FIFO和RR對比實驗》。

1. 權重計算

1.1 計算優先級

計算優先級之前，首先要明白struct task_struct中各個關於優先級成員的含義。

struct task_struct {
...
    int prio, static_prio, normal_prio;
    unsigned int rt_priority;
...
    unsigned int policy;
...
};

prio：保存進程動態優先級，系統根據prio選擇調度類，有些情況需要暫時提高進程優先級。

static_prio：靜態優先級，在進程啟動時分配。內核不保存nice值，通過PRIO_TO_NICE根據task_struct->static_prio計算得到。這個值可以通過nice/renice或者setpriority()修改。

normal_prio：是基於static_prio和調度策略計算出來的優先級，在創建進程時會繼承父進程normal_prio。對普通進程來說，normal_prio等於static_prio；對實時進程，會根據rt_priority重新計算normal_prio。

rt_priority：實時進程的優先級，和進程設置參數sched_param.sched_priority等價。

nice/renice系統調用可以改變static_prio值。

rt_priority在普通進程中等於0，實時進程中范圍是1~99。

normal_prio在普通進程中等於static_prio；在實時進程中normal_prio=99-rt_priority。

獲取normal_prio的函數是normal_prio()

static inline int __normal_prio(struct task_struct *p)
{
    return p->static_prio;
}

static inline int normal_prio(struct task_struct *p)
{
    int prio;

    if (task_has_dl_policy(p))
        prio = MAX_DL_PRIO-1;-----------------------------------對於DEADLINE類進程來說固定值為-1。 else if (task_has_rt_policy(p))
        prio = MAX_RT_PRIO-1 - p->rt_priority;------------------對於實時進程來說，normal_prio=100-1-rt_priority else
        prio = __normal_prio(p);--------------------------------對普通進程來說normal_prio=static_prio return prio;
}

prio在普通進程中和static_prio相等；在實時進程中prio和rt_priority存在prio+rt_priority=99關系。

獲取prio的函數是effective_prio()。

static int effective_prio(struct task_struct *p)
{
    p->normal_prio = normal_prio(p);
    /*
     * If we are RT tasks or we were boosted to RT priority,
     * keep the priority unchanged. Otherwise, update priority
     * to the normal priority:
     */
    if (!rt_prio(p->prio))-------------------即prio大於99的情況，此時為普通進程，prio=normal_prio=static_prio。
        return p->normal_prio;
    return p->prio;
}

普通進程：static_prio=prio=normal_prio；rt_priority=0。

實時進程：prio=normal_prio=99-rt_priority；rt_priority=sched_param.sched_priority，rt_priority=[1, 99]；static_prio保持默認值不改變。

static_prio和nice之間的關系

內核使用0~139數值表示優先級，數值越低優先級越高。其中0~99給實時進程使用，100~139給普通進程(SCHED_NORMAL/SCHED_BATCH)使用。

用戶空間nice傳遞的變量映射到普通進程優先級，即100~139。

關於nice和prio之間的轉換，內核提供NICE_TO_PRIO和PRIO_TO_NICE兩個宏。

#define MAX_USER_RT_PRIO    100
#define MAX_RT_PRIO        MAX_USER_RT_PRIO

#define MAX_PRIO        (MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO        (MAX_RT_PRIO + NICE_WIDTH / 2)

/*
 * Convert user-nice values [ -20 ... 0 ... 19 ]
 * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
 * and back.
 */
#define NICE_TO_PRIO(nice)    ((nice) + DEFAULT_PRIO)
#define PRIO_TO_NICE(prio)    ((prio) - DEFAULT_PRIO)

/*
 * 'User priority' is the nice value converted to something we
 * can work with better when scaling various scheduler parameters,
 * it's a [ 0 ... 39 ] range.
 */
#define USER_PRIO(p)        ((p)-MAX_RT_PRIO)
#define TASK_USER_PRIO(p)    USER_PRIO((p)->static_prio)
#define MAX_USER_PRIO        (USER_PRIO(MAX_PRIO))

1.2 計算權重

內核中使用struct load_weight數據結構來記錄調度實體的權重信息。

權重信息是根據優先級來計算的，通過task_struct->se.load來獲取進程的權重信息。

因為權重僅適用於普通進程，普通進程的nice對應范圍是-20~19。

struct task_struct {
...
    struct sched_entity se;
...
};

struct sched_entity {
    struct load_weight    load;        /* for load-balancing */
...
};

struct load_weight {
    unsigned long weight;----------------調度實體的權重
    u32 inv_weight;----------------------inverse weight，是全中一個中間計算結果。
};

set_load_weight()設置進程的權重值，通過task_struct->static_prio從prio_to_weight[]和prio_to_wmult[]獲取。

static void set_load_weight(struct task_struct *p)
{
    int prio = p->static_prio - MAX_RT_PRIO;---------------------權重值取決於static_prio，減去100而不是120，對應了下面數組下標。 struct load_weight *load = &p->se.load;

    /*
     * SCHED_IDLE tasks get minimal weight:
     */
    if (p->policy == SCHED_IDLE) {
        load->weight = scale_load(WEIGHT_IDLEPRIO);-------------IDLE調度策略進程使用固定優先級權重，取最低普通優先級權重的1/5。
        load->inv_weight = WMULT_IDLEPRIO;----------------------取最低普通優先級反轉權重的5倍。 return;
    }

    load->weight = scale_load(prio_to_weight[prio]);
    load->inv_weight = prio_to_wmult[prio];
}

nice從-20~19，共40個等級，nice值越高優先級越低。

進程每提高一個優先級，則增加10%CPU時間，同時另一個進程減少10%時間，他們之間的關系從原來的1:1變成了1.1:0.9=1.22。

因此相同優先級之間的關系使用系統1.25來表示。

假設A和B進程nice都為0，權重都是1024.

A的nice變為1，B不變。那么B獲得55%運行時間，A獲得45%運行時間。A的權重就變成了A/(A+1024)=9/(9+11)，A=1024*9/11=838。

但是Linux並不是嚴格按照1.22系數來計算的，而是近似1.25。

A的權重值就變成了1024/1.25≈820。

prio_to_weight[]以nice-0為基准權重1024，然后將nice從-20~19預先計算出。set_load_weight()就可以通過優先級得到進程對應的權重。

prio_to_wmult[]為了方便計算vruntime而預先計算結果。

inv_weight=2³²/weight

static const int prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

static const u32 prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

1.2.1 優先級和權重關系實驗

下面設計一下CPU intensive的進程，然后設置不同優先級，再使用top查看他們實際得到的CPU執行事件。

這樣就可以驗證他們的優先級和權重關系。

首先需要將這些進程固定到一個CPU上，然后調整優先級。

#define _GNU_SOURCE
#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/resource.h>

int main(void)
{
  int i, pid;
  cpu_set_t mask;

  //Set CPU affinity.
  CPU_ZERO(&mask);
  CPU_SET(0, &mask);
  if(sched_setaffinity(0, sizeof(cpu_set_t), &mask) == -1)
  {
    exit(EXIT_FAILURE);
  }
  
  pid = getpid();
  setpriority(PRIO_PROCESS, pid, -20);


  while(1)
  {
  }

  return 0;
}

1.2.1.1 -20和-19關系

理論上-20和-19的CPU占比應該是88761:71755=1.24:1=55.3/44.7。

來看一下實際執行效果，符合預期。

通過kernelshark查看一下他們之間的關系，兩個進程之間是有規律的。

-20執行了2+3tick，-19執行了2+2tick，兩者之間的比例也接近1.25。符合預期。

1.2.1.2 -20、-19、-18三者關系呢？

這三者之間的比例關系應該是88761:71755:56483=1.57:1.27:1。

實際結果是41.2:32.9:25.9=1.59:1.27:1，符合預期。

1.2.1.3 -20、-19、-18、0四者關系呢？

88761:71755:56483:1024=1.57:1.27:1:0.018

實際結果是40.8:33.0:25.8:0.4=1.58:1.28:1:0.016，基本符合預期。

為什么不以nice-0為基准呢？首先在-20、-19、-18存在的情況下，nice-0的誤差顯得特別大，另一個系統還存在其它很多nice-0的進程。

1.3 計算vrtime

CFS中所謂的Fair是vrtime的，而不是實際時間的平等。

CFS調度器拋棄以前固定時間片和固定調度周期的算法，而采用進程權重值的比重來量化和計算實際運行時間。

引入虛擬時鍾概念，每個進程虛擬時間是實際運行時間相對Nice值為0的權重比例值。

Nice值小的進程，優先級高且權重大，其虛擬時鍾比真實時鍾跑得慢，所以可以獲得更多的實際運行時間。

反之，Nice值大的進程，優先級低權重小，獲得的實際運行時間也更少。

CFS選擇虛擬時鍾跑得慢的進程，而不是實際運行時間運行的少的進程。

vruntime=delta_exec*nice_0_weight/weight

vruntime表示進程的虛擬運行時間，delta_exec表示進程實際運行時間，nice_0_weight表示nice為0權重值，weight表示該進程的權重值，可以通過prio_to_weight[]獲取。

vruntime=delta_exec*nice_0_weight*2³²/weight>>32

其中2³²/weight可以用inv_weight來表示，其中inv_weight可以從prio_to_wmult[]中獲取。

vruntime=delta_exec*nice_0_weight*inv_weight>>32

calc_delta_fair()是計算虛擬時間的函數，其返回值是虛擬時間。

__calc_delta()是計算vruntime的核心，delta_exec是進程實際運行時間，weight是nice_0_weight，lw是對應進程的load_weight，里面包含了其inv_weight值。

static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
    if (unlikely(se->load.weight != NICE_0_LOAD))----如果當前進程權重是NICE_0_WEIGHT，虛擬時間就是delta，不需要__calc_delta()計算。
        delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

    return delta;
}

static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
    u64 fact = scale_load_down(weight);--------------fact等於weight。 int shift = WMULT_SHIFT;-------------------------WMULT_SHIFT等於32

    __update_inv_weight(lw);-------------------------更新load_weight->inv_weight，一般情況下已經設置，不需要進行操作。 if (unlikely(fact >> 32)) {----------------------一般fact>>32為0，所以跳過 while (fact >> 32) {
            fact >>= 1;
            shift--;
        }
    }

    /* hint to use a 32x32->64 mul */
    fact = (u64)(u32)fact * lw->inv_weight;----------此處相當於nice_0_weight*inv_weight while (fact >> 32) {
        fact >>= 1;
        shift--;
    }

    return mul_u64_u32_shr(delta_exec, fact, shift);----此處相當於delta_exec*(nice_0_weight*inv_weight)>>32。
}

優先級越低的進程inv_weight值越大，其它nice_0_weight和位置都是一樣的。

所以相同的delta_exec情況下，優先級越低vruntime越大。

cfs總是在紅黑樹中選擇vrunime最小的進程進行調度，優先級高的進程在相同實際運行時間的情況下vruntime最小，所以總會被優先選擇。但是隨着vruntime的增長，優先級低的進程也有機會運行。

1.4 負載計算

內核中計算CPU負載的方法是PELT(Per-Entity Load Tracing)，不僅考慮進程權重，而且跟蹤每個調度實體的負載情況。

sched_entity結構中有一個struct sched_avg用於描述進程的負載。

runnable_sum：表示該調度實體在就緒隊列里(sched_entity->on_rq==1)可運行狀態的總時間。包括兩部分，一是正在運行的時間，即running時間；二是在就緒隊列中等待的時間。進程進入就緒隊列時(調用enqueue_entity())，on_rq被置為1，但該進程因為睡眠等原因退出就緒隊列時(調用dequeue_entity())，on_rq會被清0，因此runnable_sum就是統計進程在就緒隊列的時間。

runnable_period：可以理解為該調度實體在系統中的總時間，period是指一個周期period為1024us。當一個進程fork出來后，無論是否在就緒隊列中，runnable_period一直在遞增。

runnable_avg_sum：考慮歷史數據對負載的影響，采用衰減系統來計算平均復雜，調度實體在就緒隊列里可運行狀態下總的衰減累加時間。

runnable_avg_period：調度實體在系統中總的衰減累加時間。

last_runnable_update：最近更新load的時間點，用於計算時間間隔。

load_avg_contrib：進程平均負載的貢獻度。

struct sched_avg {
    /*
     * These sums represent an infinite geometric series and so are bound
     * above by 1024/(1-y).  Thus we only need a u32 to store them for all
     * choices of y < 1-2^(-32)*1024.
     */
    u32 runnable_avg_sum, runnable_avg_period;
    u64 last_runnable_update;
    s64 decay_count;
    unsigned long load_avg_contrib;
};

struct sched_entity {
...
#ifdef CONFIG_SMP
    /* Per entity load average tracking */
    struct sched_avg  avg;
#endif
}

1.4.1 衰減因子

將1024us時間跨度算成一個周期，period簡稱PI。

一個PI周期內對系統負載的貢獻除了權重外，還有PI周期內可運行的時間，包括運行時間或等待CPU時間。

一個理想的計算方式是：統計多個實際的PI周期，並使用一個衰減系數來計算過去的PI周期對付賊的貢獻。

Li是一個調度實體在第i個周期內的負載貢獻。那么一個調度實體負載綜合計算公式如下：

L=L₀+L₁*y+L₂*y²+L₃*y³...+L₃₂*y³²+...

調度實體的負載需要考慮時間因素，不能只考慮當前負載，還要考慮其在過去一段時間表現。

一般認為過去第32個周期的負載減半，所以y³²=0.5，得出衰減因子y=0.978左右。

同時內核不需要數組來存放過去PI個周期負載貢獻，只需要用過去周期貢獻總和乘以衰減系數y，並加上當前時間點的負載L0即可。

下表對衰減因子乘以2³²，計算完成后再右移32位。如下，就將原來衰減因子的浮點元算轉換成乘法和移位操作。

L*yⁿ= (L*yⁿ*2³²)>>32 = (L*(0.978)ⁿ*2³²)>>32 = L*runnable_avg_yN_inv[n]>>32

runnable_avg_yN_inv[n]是計算第n個周期的衰減值，在實際使用中需要計算n個周期的負載累積貢獻值。

runnable_avg_yN_sum[n] = 1024*(y + y² + y³ + ... + yⁿ)

取1024是因為一個周期是1024微秒。

下面兩個數組雖然都是計算負載累計，但是runnable_avg_yN_inv[]使計算某一個周期的貢獻值，runnable_avg_yN_sum[n]是計算n個周期的貢獻值。

runnable_avg_yN_inv[]共32個成員，runnable_avg_yN_sum[]共33個成員。

static const u32 runnable_avg_yN_inv[] = {
    0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
    0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
    0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
    0xad583ee9, 0xa9a15ab4, 0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9,
    0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a, 0x8b95c1e3, 0x88980e80,
    0x85aac367, 0x82cd8698,
};


static const u32 runnable_avg_yN_sum[] = {
        0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
     9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
    17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
};

這兩個參數分別對應decay_load()和__compute_runnable_contrib()。

decay_load()根據一個load值和周期序號n，返回衰減后的load值。

__compute_runnable_contrib()只有一個參數過去的periods周期數目，返回累計衰減load值。

1.4.2 update_entity_load_avg()

update_entity_load_avg()主要更新struct sched_avg結構體成員，其中__update_entity_runnable_avg()更新了last_runnable_update、runnable_avg_sum和runnable_avg_period三個數據；

__update_entity_load_avg_contrib()更新了load_avg_contrib；最后同時更新了cfs_rq->runnable_load_avg。

static inline void update_entity_load_avg(struct sched_entity *se,
                      int update_cfs_rq)
{
    struct cfs_rq *cfs_rq = cfs_rq_of(se);
    long contrib_delta;
    u64 now;

    /*
     * For a group entity we need to use their owned cfs_rq_clock_task() in
     * case they are the parent of a throttled hierarchy.
     */
    if (entity_is_task(se))
        now = cfs_rq_clock_task(cfs_rq);
    else
        now = cfs_rq_clock_task(group_cfs_rq(se));

    if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))-------更新sched_avg的三個參數：last_runnable_update、runnable_avg_sum、runnable_avg_period。如果上次更新到本次不足1024
us，不做衰減計算，不計算負載貢獻度。 return;

    contrib_delta = __update_entity_load_avg_contrib(se);--------------計算本次更新貢獻度，更新到load_avg_contrib中。 if (!update_cfs_rq)
        return;

    if (se->on_rq)
        cfs_rq->runnable_load_avg += contrib_delta;--------------------累加到cfs_rq->runnable_laod_avg中 else
        subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
}


static __always_inline int __update_entity_runnable_avg(u64 now,
                            struct sched_avg *sa,
                            int runnable)--------------------------------runnable表示該進程是否在就緒隊列上接受調度
{
    u64 delta, periods;
    u32 runnable_contrib;
    int delta_w, decayed = 0;

    delta = now - sa->last_runnable_update;------------------------------上次更新負載到本次更新的間隔，單位是ns。 /*
     * This should only happen when time goes backwards, which it
     * unfortunately does during sched clock init when we swap over to TSC.
     */
    if ((s64)delta < 0) {
        sa->last_runnable_update = now;
        return 0;
    }

    /*
     * Use 1024ns as the unit of measurement since it's a reasonable
     * approximation of 1us and fast to compute.
     */
    delta >>= 10;--------------------------------------------------------delta單位變成近似1微秒 if (!delta)
        return 0;
    sa->last_runnable_update = now;

    /* delta_w is the amount already accumulated against our next period */
    delta_w = sa->runnable_avg_period % 1024;----------------------------runnable_avg_period是上一次更新時的總周期數，delta_w是上一次周周期數不能湊成一個周期的剩余時間，單位是微秒。 if (delta + delta_w >= 1024) {---------------------------------------如果時間大於一個周期，就需要進行衰減計算。 /* period roll-over */
        decayed = 1;

        /*
         * Now that we know we're crossing a period boundary, figure
         * out how much from delta we need to complete the current
         * period and accrue it.
         */
        delta_w = 1024 - delta_w;
        if (runnable)
            sa->runnable_avg_sum += delta_w;
        sa->runnable_avg_period += delta_w;

        delta -= delta_w;

        /* Figure out how many additional periods this update spans */
        periods = delta / 1024;---------------------------------------------本次更新和上次更新之間經歷的周期數periods
        delta %= 1024;

        sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
                          periods + 1);-------------------------------------分別計算第periods+1個周期的runnable_avg_sum和runnable_avg_period的衰減。
        sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
                             periods + 1);

        /* Efficiently calculate \sum (1..n_period) 1024*y^i */
        runnable_contrib = __compute_runnable_contrib(periods);-------------得到過去periods個周期的累計衰減。 if (runnable)
            sa->runnable_avg_sum += runnable_contrib;
        sa->runnable_avg_period += runnable_contrib;
    }

    /* Remainder of delta accrued against u_0` */---------------------------不能湊成完成周期的部分直接進行相加。
    if (runnable)
        sa->runnable_avg_sum += delta;
    sa->runnable_avg_period += delta;

    return decayed;---------------------------------------------------------decayed表示是否進行了衰減計算
}

static __always_inline u64 decay_load(u64 val, u64 n)----------------------val表示n個周期前的負載值，n表示第n個周期。返回結果為val*yⁿ，變成查表(val*runnable_avg_yN_inv[n])>>32。
{
    unsigned int local_n;

    if (!n)
        return val;---------------------------------------------------------n=0：表示當前周期，不衰減。 else if (unlikely(n > LOAD_AVG_PERIOD * 63))----------------------------n>=2016：LOAD_AVG_PERIOD=32，因此n超過2016就認為衰減值變為0。 return 0;

    /* after bounds checking we can collapse to 32-bit */
    local_n = n;

    /*
     * As y^PERIOD = 1/2, we can combine
     *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
     * With a look-up table which covers y^n (n<PERIOD)
     *
     * To achieve constant time decay_load.
     */
    if (unlikely(local_n >= LOAD_AVG_PERIOD)) {-----------------------------32=<n<2016：每32個周期衰減1/2，即val右移一位。剩下周期數存入local_n。
        val >>= local_n / LOAD_AVG_PERIOD;
        local_n %= LOAD_AVG_PERIOD;
    }

    val *= runnable_avg_yN_inv[local_n];------------------------------------0<n<32：根據local_n查表得到衰減值 /* We don't use SRR here since we always want to round down. */
    return val >> 32;-------------------------------------------------------最終結果右移32位，歸一化。
}

static u32 __compute_runnable_contrib(u64 n)
{
    u32 contrib = 0;

    if (likely(n <= LOAD_AVG_PERIOD))--------------------------------------n<=32：直接查表得到結果。 return runnable_avg_yN_sum[n];
    else if (unlikely(n >= LOAD_AVG_MAX_N))--------------------------------n>=345：直接取最大值47742，這個值也是一共345個周期的累計衰減。 return LOAD_AVG_MAX;

    /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
    do {-------------------------------------------------------------------以LOAD_AVG_PERIOD為步長，計算過去n/32個32周期的累計衰減
        contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
        contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];-------------------都取n=32的情況

        n -= LOAD_AVG_PERIOD;
    } while (n > LOAD_AVG_PERIOD);

    contrib = decay_load(contrib, n);-------------------------------------還需經過n過周期衰減，因此經過decay_load()得到過去“n/32個32周期”的最終累計衰減。 return contrib + runnable_avg_yN_sum[n];------------------------------不能湊成32周期單獨計算並和contrib累加得到最終的結果。
}

static long __update_entity_load_avg_contrib(struct sched_entity *se)
{
    long old_contrib = se->avg.load_avg_contrib;

    if (entity_is_task(se)) {
        __update_task_entity_contrib(se);
    } else {
        __update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
        __update_group_entity_contrib(se);
    }

    return se->avg.load_avg_contrib - old_contrib;
}

static inline void __update_task_entity_contrib(struct sched_entity *se)
{
    u32 contrib;

    /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
    contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
    contrib /= (se->avg.runnable_avg_period + 1);
    se->avg.load_avg_contrib = scale_load(contrib);-------------------------更新sched_avg->load_avg_contrib
}

load_avg_contrib = (runnable_avg_sum*weight)/runnable_avg_period

可見一個調度實體的平均負載和以下3個因素相關：

調度實體的權重weight
調度實體可運行狀態下的總衰減累加時間runnable_avg_sum
調度實體在調度器中總衰減累加時間runnable_avg_period

runnable_avg_sum越接近runnable_avg_period，則平均負載越大，表示調度實體一直在占用CPU。

2. 進程創建

2.1 sched_entity、rq、cfs_rq

struct sched_entity內嵌在task_struct中，稱為調度實體，描述進程作為一個調度實體參與調度的所需要的所有信息。

struct sched_entity {
    struct load_weight    load;        /* for load-balancing */----------------調度實體的權重。
    struct rb_node        run_node;--------------------------------------------表示調度實體在紅黑樹中的節點 struct list_head    group_node;
    unsigned int        on_rq;-------------------------------------------------表示該調度實體是否在就緒隊列中接受調度

    u64            exec_start;
    u64            sum_exec_runtime;
    u64            vruntime;---------------------------------------------------表示本調度實體的虛擬運行時間
    u64            prev_sum_exec_runtime;

    u64            nr_migrations;

#ifdef CONFIG_SCHEDSTATS
    struct sched_statistics statistics;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    int            depth;
    struct sched_entity    *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq        *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq        *my_q;-------------------------------------------------如果my_q不為null表示當前調度實體是調度組，而不是單個進程。 #endif

#ifdef CONFIG_SMP
    /* Per-entity load-tracking */
    struct sched_avg    avg;-----------------------------------------------------表示調度實體平均負載信息。 #endif
};

strcut sched_entity是per-task的，struct rq是per-cpu的。

系統中每個CPU就有一個struct rq數據結構，this_rq()可以獲取當前CPU的就緒隊列struct rq。

struct rq是描述CPU的通用就緒隊列，rq數據結構記錄了一個就緒隊列所需要的全部信息，包括一個cfs就緒隊列數據結構strct cfs_rq、一個實時調度器就緒隊列數據結構struct rt_rq和一個deadline就緒隊列數據結構structdl_rq。

struct rq {
    /* runqueue lock: */
    raw_spinlock_t lock;

    /*
     * nr_running and cpu_load should be in the same cacheline because
     * remote CPUs use both these fields when doing load calculation.
     */
    unsigned int nr_running;-------------------------------------運行進程個數
#ifdef CONFIG_NUMA_BALANCING
    unsigned int nr_numa_running;
    unsigned int nr_preferred_running;
#endif
    #define CPU_LOAD_IDX_MAX 5
    unsigned long cpu_load[CPU_LOAD_IDX_MAX];
    unsigned long last_load_update_tick;
#ifdef CONFIG_NO_HZ_COMMON
    u64 nohz_stamp;
    unsigned long nohz_flags;
#endif
#ifdef CONFIG_NO_HZ_FULL
    unsigned long last_sched_tick;
#endif
    /* capture load from *all* tasks on this cpu: */
    struct load_weight load;---------------------------------------就緒隊列權重。
    unsigned long nr_load_updates;
    u64 nr_switches;

    struct cfs_rq cfs;---------------------------------------------cfs就緒隊列 struct rt_rq rt;-----------------------------------------------rt就緒隊列 struct dl_rq dl;-----------------------------------------------deadline就緒隊列

#ifdef CONFIG_FAIR_GROUP_SCHED
    /* list of leaf cfs_rq on this cpu: */
    struct list_head leaf_cfs_rq_list;

    struct sched_avg avg;
#endif /* CONFIG_FAIR_GROUP_SCHED */

    /*
     * This is part of a global counter where only the total sum
     * over all CPUs matters. A task can increase this counter on
     * one CPU and if it got migrated afterwards it may decrease
     * it on another CPU. Always updated under the runqueue lock:
     */
    unsigned long nr_uninterruptible;

    struct task_struct *curr, *idle, *stop;
    unsigned long next_balance;
    struct mm_struct *prev_mm;

    unsigned int clock_skip_update;
    u64 clock;
    u64 clock_task;

    atomic_t nr_iowait;

#ifdef CONFIG_SMP
    struct root_domain *rd;
    struct sched_domain *sd;

    unsigned long cpu_capacity;

    unsigned char idle_balance;
    /* For active balancing */
    int post_schedule;
    int active_balance;
    int push_cpu;
    struct cpu_stop_work active_balance_work;
    /* cpu of this runqueue: */
    int cpu;
    int online;

    struct list_head cfs_tasks;

    u64 rt_avg;
    u64 age_stamp;
    u64 idle_stamp;
    u64 avg_idle;

    /* This is used to determine avg_idle's max value */
    u64 max_idle_balance_cost;
#endif

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
    u64 prev_irq_time;
#endif
#ifdef CONFIG_PARAVIRT
    u64 prev_steal_time;
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
    u64 prev_steal_time_rq;
#endif

    /* calc_load related fields */
    unsigned long calc_load_update;
    long calc_load_active;

#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
    int hrtick_csd_pending;
    struct call_single_data hrtick_csd;
#endif
    struct hrtimer hrtick_timer;
#endif

#ifdef CONFIG_SCHEDSTATS
    /* latency stats */
    struct sched_info rq_sched_info;
    unsigned long long rq_cpu_time;
    /* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

    /* sys_sched_yield() stats */
    unsigned int yld_count;

    /* schedule() stats */
    unsigned int sched_count;
    unsigned int sched_goidle;

    /* try_to_wake_up() stats */
    unsigned int ttwu_count;
    unsigned int ttwu_local;
#endif

#ifdef CONFIG_SMP
    struct llist_head wake_list;
#endif

#ifdef CONFIG_CPU_IDLE
    /* Must be inspected within a rcu lock section */
    struct cpuidle_state *idle_state;
#endif
};

struct cfs_rq {
    struct load_weight load;------------------------------------cfs就緒隊列的權重
    unsigned int nr_running, h_nr_running;

    u64 exec_clock;
    u64 min_vruntime;-------------------------------------------跟蹤該就緒隊列紅黑樹中最小的vruntime值。
#ifndef CONFIG_64BIT
    u64 min_vruntime_copy;
#endif

    struct rb_root tasks_timeline;------------------------------運行隊列紅黑樹根。 struct rb_node *rb_leftmost;--------------------------------紅黑樹最左邊節點，也即為最小vruntime時間的節點，單進程選擇下一個進程來運行時，就選擇這個。 /*
     * 'curr' points to currently running entity on this cfs_rq.
     * It is set to NULL otherwise (i.e when none are currently running).
     */
    struct sched_entity *curr, *next, *last, *skip;

#ifdef    CONFIG_SCHED_DEBUG
    unsigned int nr_spread_over;
#endif

#ifdef CONFIG_SMP
    /*
     * CFS Load tracking
     * Under CFS, load is tracked on a per-entity basis and aggregated up.
     * This allows for the description of both thread and group usage (in
     * the FAIR_GROUP_SCHED case).
     */
    unsigned long runnable_load_avg, blocked_load_avg;----------runnable_load_avg跟蹤該就緒隊列中總平均負載。
    atomic64_t decay_counter;
    u64 last_decay;
    atomic_long_t removed_load;

#ifdef CONFIG_FAIR_GROUP_SCHED
    /* Required to track per-cpu representation of a task_group */
    u32 tg_runnable_contrib;
    unsigned long tg_load_contrib;

    /*
     *   h_load = weight * f(tg)
     *
     * Where f(tg) is the recursive weight fraction assigned to
     * this group.
     */
    unsigned long h_load;
    u64 last_h_load_update;
    struct sched_entity *h_load_next;
#endif /* CONFIG_FAIR_GROUP_SCHED */
#endif /* CONFIG_SMP */

#ifdef CONFIG_FAIR_GROUP_SCHED
    struct rq *rq;    /* cpu runqueue to which this cfs_rq is attached */----本cfs_rq附着的struct rq

    /*
     * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
     * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
     * (like users, containers etc.)
     *
     * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
     * list is used during load balance.
     */
    int on_list;
    struct list_head leaf_cfs_rq_list;
    struct task_group *tg;    /* group that "owns" this runqueue */----------組調度數據結構

#ifdef CONFIG_CFS_BANDWIDTH
    int runtime_enabled;
    u64 runtime_expires;
    s64 runtime_remaining;

    u64 throttled_clock, throttled_clock_task;
    u64 throttled_clock_task_time;
    int throttled, throttle_count;
    struct list_head throttled_list;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};

通過task_struct可以找到對應的cfs_rq，struct task_struct通過task_thread_info()找到thread_info；

通過struct thread_info得到cpu，通過cpu_rq()找到對應CPU的struct rq，進而找到對應的struct cfs_rq。

#define task_thread_info(task)    ((struct thread_info *)(task)->stack)

static inline unsigned int task_cpu(const struct task_struct *p)
{
    return task_thread_info(p)->cpu;
}

DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

#define cpu_rq(cpu)        (&per_cpu(runqueues, (cpu)))
#define this_rq()        this_cpu_ptr(&runqueues)--------------------當前CPU的struct rq
#define task_rq(p)        cpu_rq(task_cpu(p))----------------------指定CPU的struct rq

static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
{
    return &task_rq(p)->cfs;
}

2.2 fair_sched_class

struct sched_class是調度類操作方法，CFS調度器的調度類fair_sched_class定義了CFS相關操作方法。

這些方法的具體介紹會在下面一一介紹。

const struct sched_class fair_sched_class = {
    .next            = &idle_sched_class,
    .enqueue_task        = enqueue_task_fair,
    .dequeue_task        = dequeue_task_fair,
    .yield_task        = yield_task_fair,
    .yield_to_task        = yield_to_task_fair,

    .check_preempt_curr    = check_preempt_wakeup,

    .pick_next_task        = pick_next_task_fair,
    .put_prev_task        = put_prev_task_fair,

#ifdef CONFIG_SMP
    .select_task_rq        = select_task_rq_fair,
    .migrate_task_rq    = migrate_task_rq_fair,

    .rq_online        = rq_online_fair,
    .rq_offline        = rq_offline_fair,

    .task_waking        = task_waking_fair,
#endif

    .set_curr_task          = set_curr_task_fair,
    .task_tick        = task_tick_fair,
    .task_fork        = task_fork_fair,

    .prio_changed        = prio_changed_fair,
    .switched_from        = switched_from_fair,
    .switched_to        = switched_to_fair,

    .get_rr_interval    = get_rr_interval_fair,

    .update_curr        = update_curr_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED
    .task_move_group    = task_move_group_fair,
#endif
};

2.3 進程創建

進程創建由do_fork()函數來完成，do_fork-->copy_process參與了進程調度相關初始化。

copy_process()
  sched_fork()
    __sched_fork()
    fair_sched_class->task_fork()->task_fork_fair()
      __set_task_cpu()
      update_curr()
      place_entity()
  wake_up_new_task()
    activate_task()
      enqueue_task
        fair_sched_class->enqueue_task-->enqueue_task_fair()

2.3.1 sched_fork()

sched_fork()調用__sched_fork()對struct task_struct進行初始化，

int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
    unsigned long flags;
    int cpu = get_cpu();--------------------------------------------------禁止任務搶占並且獲取cpu序號

    __sched_fork(clone_flags, p);
    p->state = TASK_RUNNING;----------------------------------------------此時並沒有真正運行，還沒有加入到調度器
    p->prio = current->normal_prio;

    /*
     * Revert to default priority/policy on fork if requested.
     */
    if (unlikely(p->sched_reset_on_fork)) {-------------------------------如果sched_reset_on_fork為true，重置policy、static_prio、prio、weight、inv_weight等。 if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
            p->policy = SCHED_NORMAL;
            p->static_prio = NICE_TO_PRIO(0);
            p->rt_priority = 0;
        } else if (PRIO_TO_NICE(p->static_prio) < 0)
            p->static_prio = NICE_TO_PRIO(0);

        p->prio = p->normal_prio = __normal_prio(p);
        set_load_weight(p);
        p->sched_reset_on_fork = 0;
    }

    if (dl_prio(p->prio)) {
        put_cpu();
        return -EAGAIN;
    } else if (rt_prio(p->prio)) {
        p->sched_class = &rt_sched_class;
    } else {
        p->sched_class = &fair_sched_class;-------------------------------根據task_struct->prio選擇調度器類，
    }

    if (p->sched_class->task_fork)
        p->sched_class->task_fork(p);-------------------------------------調用調度器類的task_fork方法，cfs對應task_fork_fair()。 
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    set_task_cpu(p, cpu);-------------------------------------------------將p指定到cpu上運行，如果task_struct->stack->cpu和當前所在cpu不一致，需要將cpu相關設置到新CPU上。
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
    if (likely(sched_info_on()))
        memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
    p->on_cpu = 0;
#endif
    init_task_preempt_count(p);-------------------------------------------初始化preempt_count
#ifdef CONFIG_SMP
    plist_node_init(&p->pushable_tasks, MAX_PRIO);
    RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif

    put_cpu();------------------------------------------------------------啟用任務搶占 return 0;
}

__sched_fork()對task_struct數據結構進行初始值設定。

static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
{
    p->on_rq            = 0;

    p->se.on_rq            = 0;
    p->se.exec_start        = 0;
    p->se.sum_exec_runtime        = 0;
    p->se.prev_sum_exec_runtime    = 0;
    p->se.nr_migrations        = 0;
    p->se.vruntime            = 0;
#ifdef CONFIG_SMP
    p->se.avg.decay_count        = 0;
#endif
    INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_SCHEDSTATS
    memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif

    RB_CLEAR_NODE(&p->dl.rb_node);
    init_dl_task_timer(&p->dl);
    __dl_clear_params(p);

    INIT_LIST_HEAD(&p->rt.run_list);

#ifdef CONFIG_PREEMPT_NOTIFIERS
    INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif...
}

task_fork_fair()參數是新創建的進程，

static void task_fork_fair(struct task_struct *p)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se, *curr;
    int this_cpu = smp_processor_id();--------------------------獲取當前cpu id struct rq *rq = this_rq();
    unsigned long flags;

    raw_spin_lock_irqsave(&rq->lock, flags);

    update_rq_clock(rq);

    cfs_rq = task_cfs_rq(current);------------------------------獲取當前進程所在cpu的cfs_rq
    curr = cfs_rq->curr;

    /*
     * Not only the cpu but also the task_group of the parent might have
     * been changed after parent->se.parent,cfs_rq were copied to
     * child->se.parent,cfs_rq. So call __set_task_cpu() to make those
     * of child point to valid ones.
     */
    rcu_read_lock();
    __set_task_cpu(p, this_cpu);--------------------------------將進程p和當前CUP綁定，p->wake_cpu在后續喚醒該進程時會用到這個成員。
    rcu_read_unlock();

    update_curr(cfs_rq);----------------------------------------更新當前調度實體的cfs_rq->curr信息 if (curr)
        se->vruntime = curr->vruntime;
    place_entity(cfs_rq, se, 1);--------------------------------cfs_rq是父進程對應的cfs就緒隊列，se對應的是進程p調度實體，initial為1。 if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
        /*
         * Upon rescheduling, sched_class::put_prev_task() will place
         * 'current' within the tree based on its new key value.
         */
        swap(curr->vruntime, se->vruntime);
        resched_curr(rq);
    }

    se->vruntime -= cfs_rq->min_vruntime;

    raw_spin_unlock_irqrestore(&rq->lock, flags);
}

set_task_cpu()將進程和指定的cpu綁定。

update_curr()是cfs調度器核心函數，主要更新cfs_rq->curr，即當前調度實體。

主要更新了調度實體的vruntime、sum_exec_runtime、exec_start等等。

void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
if (task_cpu(p) != new_cpu) {

if (p->sched_class->migrate_task_rq)
p->sched_class->migrate_task_rq(p);
p->se.nr_migrations++;
perf_event_task_migrate(p);
}

__set_task_cpu(p, new_cpu);
}


static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
{
    set_task_rq(p, cpu);
#ifdef CONFIG_SMP
    /*
     * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
     * successfuly executed on another CPU. We must ensure that updates of
     * per-task data have been completed by this moment.
     */
    smp_wmb();
    task_thread_info(p)->cpu = cpu;
    p->wake_cpu = cpu;
#endif
}

static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;----------------------------------curr指向父進程調度實體。
    u64 now = rq_clock_task(rq_of(cfs_rq));------------------------------------獲取當前就緒隊列保存的rq->clock_task值，該變量在每次時鍾tick到來時更新。
    u64 delta_exec;

    if (unlikely(!curr))
        return;

    delta_exec = now - curr->exec_start;----------------------------------------delta_exec計算該進程從上次調用update_curr()函數到現在的時間差。 if (unlikely((s64)delta_exec <= 0))
        return;

    curr->exec_start = now;

    schedstat_set(curr->statistics.exec_max,
              max(delta_exec, curr->statistics.exec_max));

    curr->sum_exec_runtime += delta_exec;---------------------------------------sum_exec_runtime直接加上delta_exec。
    schedstat_add(cfs_rq, exec_clock, delta_exec);

    curr->vruntime += calc_delta_fair(delta_exec, curr);------------------------根據delta_exec和進程curr->load計算該進程的虛擬事件curr->vruntime。
    update_min_vruntime(cfs_rq);------------------------------------------------更新當前cfs_rq->min_vruntime if (entity_is_task(curr)) {-------------------------------------------------如果curr->my_q為null，那么當前調度實體是進程 struct task_struct *curtask = task_of(curr);

        trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
        cpuacct_charge(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }

    account_cfs_rq_runtime(cfs_rq, delta_exec);
}

place_entity()參數cfs_rq是se對應進程的父進程對應的cfs就緒隊列，se是新進程調度實體，initial為1。

place_entity()考慮當前se所在cfs_rq總體權重，然后更新se->vruntime。

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;-----------------------------是單步遞增的，用於跟蹤整個cfs就緒隊列中紅黑樹里最小的vruntime值。 
    if (initial && sched_feat(START_DEBIT))--------------------------如果當前進程用於fork新進程，那么這里會對新進程的vruntime做一些懲罰，因為新創建了一個新進程導致cfs運行隊列權重發生了變化。
        vruntime += sched_vslice(cfs_rq, se);------------------------sched_vslice()計算得到虛擬時間作為懲罰值，累加到vruntime。 /* sleeps up to a single latency don't count. */
    if (!initial) {
        unsigned long thresh = sysctl_sched_latency;

        if (sched_feat(GENTLE_FAIR_SLEEPERS))
            thresh >>= 1;

        vruntime -= thresh;
    }
    se->vruntime = max_vruntime(se->vruntime, vruntime);--------------取se->vruntime和懲罰后的vruntime的最大值，方式vruntime回退。
}

static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    return calc_delta_fair(sched_slice(cfs_rq, se), se);---------------根據sched_slice()計算得到的執行時間和se中的權重，計算出虛擬時間。
}

static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);-------根據運行中進程數目計算就緒隊列調度周期長度。

    for_each_sched_entity(se) {----------------------------------------遍歷當前se所在就緒隊列上所有的調度實體。 struct load_weight *load;
        struct load_weight lw;

        cfs_rq = cfs_rq_of(se);----------------------------------------通過sched_entity找到其所在的cfs_rq，進而獲得cfs_rq->load。
        load = &cfs_rq->load;

        if (unlikely(!se->on_rq)) {
            lw = cfs_rq->load;

            update_load_add(&lw, se->load.weight);
            load = &lw;
        }
        slice = __calc_delta(slice, se->load.weight, load);------------根據當前進程的權重來計算在cfs就緒隊列總權重中可以瓜分的調度時間。
    }
    return slice;
}

unsigned int sysctl_sched_latency = 6000000ULL;

static unsigned int sched_nr_latency = 8;

unsigned int sysctl_sched_min_granularity = 750000ULL;


static u64 __sched_period(unsigned long nr_running)------------------計算CFS就緒對列中的一個調度周期的長度，可以理解為一個調度周期的時間片，根據當前運行的進程數目來計算。
{
    u64 period = sysctl_sched_latency;---------------------------------cfs默認調度時間片6ms
    unsigned long nr_latency = sched_nr_latency;-----------------------運行中的最大進程數目閾值 if (unlikely(nr_running > nr_latency)) {---------------------------如果運行中的進程數目大於8，按照每個進程最小的調度延時0.75ms計時，乘以進程數目來計算調度周期時間片。
        period = sysctl_sched_min_granularity;
        period *= nr_running;
    }

    return period;
}

2.3.2 wake_up_new_task()

void wake_up_new_task(struct task_struct *p)
{
    unsigned long flags;
    struct rq *rq;

    raw_spin_lock_irqsave(&p->pi_lock, flags);
#ifdef CONFIG_SMP
 set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));------------重新選擇CPU，有可能cpus_allowed在fork中被改變，或者之前前選擇的CPU被關閉了。 #endif

    /* Initialize new task's runnable average */
    init_task_runnable_average(p);
    rq = __task_rq_lock(p);
    activate_task(rq, p, 0);--------------------------------------------------------最終調用到enqueue_task_fair()將進程p添加到cfs就緒隊列中。
    p->on_rq = TASK_ON_RQ_QUEUED;
    trace_sched_wakeup_new(p, true);
    check_preempt_curr(rq, p, WF_FORK);---------------------------------------------檢查是否有進程可以搶占當前正在運行的進程。
#ifdef CONFIG_SMP
    if (p->sched_class->task_woken)
        p->sched_class->task_woken(rq, p);
#endif
    task_rq_unlock(rq, p, &flags);
}

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
    if (task_contributes_to_load(p))
        rq->nr_uninterruptible--;

    enqueue_task(rq, p, flags);
}


static void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
    update_rq_clock(rq);
    sched_info_queued(rq, p);
    p->sched_class->enqueue_task(rq, p, flags);
}

enqueue_task_fair()把新進程p放入cfs就緒隊列rq中。

static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;

    for_each_sched_entity(se) {--------------------------對於沒有定義CONFIG_FAIR_GROUP_SCHED的情況，只有一次結束for循環，即只有se一個調度實體。 if (se->on_rq)
            break;
        cfs_rq = cfs_rq_of(se);
        enqueue_entity(cfs_rq, se, flags);---------------把調度實體se添加到cfs_rq就緒隊列中。         if (cfs_rq_throttled(cfs_rq))
            break;
        cfs_rq->h_nr_running++;

        flags = ENQUEUE_WAKEUP;
    }

    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        cfs_rq->h_nr_running++;

        if (cfs_rq_throttled(cfs_rq))
            break;

        update_cfs_shares(cfs_rq);
        update_entity_load_avg(se, 1);----------------------------------------更新該調度實體的負載load_avg_contrib和就緒隊列負載runnable_load_avg。
    }

    if (!se) {
        update_rq_runnable_avg(rq, rq->nr_running);
        add_nr_running(rq, 1);
    }
    hrtick_update(rq);
}

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    /*
     * Update the normalized vruntime before updating min_vruntime
     * through calling update_curr().
     */
    if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
        se->vruntime += cfs_rq->min_vruntime;

    /*
     * Update run-time statistics of the 'current'.
     */ update_curr(cfs_rq);-------------------------------------------------------更新當前進程的vruntime和該cfs就緒隊列的min_vruntime。
    enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);---------------計算調度實體se的load_avg_contrib，然后添加到整個cfs就緒隊列總平局負載cfs_rq->runnable_load_avg中。
    account_entity_enqueue(cfs_rq, se);
    update_cfs_shares(cfs_rq);

    if (flags & ENQUEUE_WAKEUP) {-----------------------------------------------處理剛被喚醒的進程。
        place_entity(cfs_rq, se, 0);--------------------------------------------對喚醒進程有一定補償，最多可以補償一個調度周期的一般，即vruntime減去半個調度周期時間。
        enqueue_sleeper(cfs_rq, se);
    }

    update_stats_enqueue(cfs_rq, se);
    check_spread(cfs_rq, se);
    if (se != cfs_rq->curr)
        __enqueue_entity(cfs_rq, se);-------------------------------------------把調度實體se加入到cfs就緒隊列的紅黑樹中。
    se->on_rq = 1;--------------------------------------------------------------表示該調度實體已經在cfs就緒隊列中。

    if (cfs_rq->nr_running == 1) {
        list_add_leaf_cfs_rq(cfs_rq);
        check_enqueue_throttle(cfs_rq);
    }
}

check_preempt_curr()用於檢查是否有新進程搶占當前進程。

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
    const struct sched_class *class;

    if (p->sched_class == rq->curr->sched_class) {
        rq->curr->sched_class->check_preempt_curr(rq, p, flags);
    } else {
        for_each_class(class) {
            if (class == rq->curr->sched_class)
                break;
            if (class == p->sched_class) {
                resched_curr(rq);
                break;
            }
        }
    }

    /*
     * A queue event has occurred, and we're going to schedule.  In
     * this case, we can save a useless back to back clock update.
     */
    if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
        rq_clock_skip_update(rq, true);
}

static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
    struct task_struct *curr = rq->curr;
    struct sched_entity *se = &curr->se, *pse = &p->se;
    struct cfs_rq *cfs_rq = task_cfs_rq(curr);
    int scale = cfs_rq->nr_running >= sched_nr_latency;
    int next_buddy_marked = 0;

    if (unlikely(se == pse))
        return;

    /*
     * This is possible from callers such as attach_tasks(), in which we
     * unconditionally check_prempt_curr() after an enqueue (which may have
     * lead to a throttle).  This both saves work and prevents false
     * next-buddy nomination below.
     */
    if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
        return;

    if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
        set_next_buddy(pse);
        next_buddy_marked = 1;
    }

    /*
     * We can come here with TIF_NEED_RESCHED already set from new task
     * wake up path.
     *
     * Note: this also catches the edge-case of curr being in a throttled
     * group (e.g. via set_curr_task), since update_curr() (in the
     * enqueue of curr) will have resulted in resched being set.  This
     * prevents us from potentially nominating it as a false LAST_BUDDY
     * below.
     */
    if (test_tsk_need_resched(curr))
        return;

    /* Idle tasks are by definition preempted by non-idle tasks. */
    if (unlikely(curr->policy == SCHED_IDLE) &&
        likely(p->policy != SCHED_IDLE))
        goto preempt;

    /*
     * Batch and idle tasks do not preempt non-idle tasks (their preemption
     * is driven by the tick):
     */
    if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
        return;

    find_matching_se(&se, &pse);
    update_curr(cfs_rq_of(se));
    BUG_ON(!pse);
    if (wakeup_preempt_entity(se, pse) == 1) {
        /*
         * Bias pick_next to pick the sched entity that is
         * triggering this preemption.
         */
        if (!next_buddy_marked)
            set_next_buddy(pse);
        goto preempt;
    }

    return;

preempt:
    resched_curr(rq);
    /*
     * Only set the backward buddy when the current task is still
     * on the rq. This can happen when a wakeup gets interleaved
     * with schedule on the ->pre_schedule() or idle_balance()
     * point, either of which can * drop the rq lock.
     *
     * Also, during early boot the idle thread is in the fair class,
     * for obvious reasons its a bad idea to schedule back to it.
     */
    if (unlikely(!se->on_rq || curr == rq->idle))
        return;

    if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
        set_last_buddy(se);
}

3. 進程調度

__schedule()是調度器的核心函數，其作用是讓調度器選擇和切換到一個合適進程運行。調度軌跡如下：

__schedule()
  ->pick_next_task()
    ->pick_next_task_fair()
  ->context_switch()
    ->switch_mm()
      ->cpu_v7_switch_mm()
    ->switch_to()
      ->__switch_to

3.1 進程調度時機

調度的時機分為如下3種：

1. 阻塞操作：互斥量(mutex)、信號量(semaphore)、等待隊列(waitqueue)等。

2. 在中斷返回前和系統調用返回用戶空間時，去檢查TIF_NEED_RESCHED標志位以判斷是否需要調度。

3. 將要被喚醒的進程不會馬上調用schedule()要求被調度，而是會被添加到cfs就緒隊列中，並且設置TIF_NEED_RESCHED標志位。那么喚醒進程什么時候被調度呢？這要根據內核是否具有可搶占功能(CONFIG_PREEMPT=y)分兩種情況。

3.1 如果內核可搶占，則：

如果喚醒動作發生在系統調用或者異常處理上下文中，在下一次調用preempt_enable()時會檢查是否需要搶占調度。
如果喚醒動作發生在硬中斷處理上下文中，硬件中斷處理返回前夕(不管中斷發生點在內核空間還是用戶空間)會檢查是否要搶占當前進程。

3.2 如果內核不可搶占，則：

當前進程調用cond_resched()時會檢查是否要調度。
主動調度用schedule()。
系統調用或者異常處理返回用戶空間時。
中斷處理完成返回用戶空間時(只有中斷發生點在用戶空間才會檢查)。

3.2 preempt_schedule()

asmlinkage __visible void __sched notrace preempt_schedule(void)
{
    /*
     * If there is a non-zero preempt_count or interrupts are disabled,
     * we do not want to preempt the current task. Just return..
     */
    if (likely(!preemptible())) return; preempt_schedule_common(); } static void __sched notrace preempt_schedule_common(void) { do { __preempt_count_add(PREEMPT_ACTIVE); __schedule(); __preempt_count_sub(PREEMPT_ACTIVE); /* * Check again in case we missed a preemption opportunity * between schedule and now. */ barrier(); } while (need_resched()); }

3.3 __schedule()

__schedule()函數調用pick_next_task()讓進程調度器從就緒隊列中選擇一個最合適的進程next，然后context_switch()切換到next進程運行。

static void __sched __schedule(void)
{
    struct task_struct *prev, *next;
    unsigned long *switch_count;
    struct rq *rq;
    int cpu;

    preempt_disable();
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    rcu_note_context_switch();
    prev = rq->curr;

    schedule_debug(prev);

    if (sched_feat(HRTICK))
        hrtick_clear(rq);

    smp_mb__before_spinlock();
    raw_spin_lock_irq(&rq->lock);

    rq->clock_skip_update <<= 1; /* promote REQ to ACT */

    switch_count = &prev->nivcsw;
    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {--------------當前進程狀態不處於TASK_RUNNING狀態， if (unlikely(signal_pending_state(prev->state, prev))) {
            prev->state = TASK_RUNNING;
        } else {
            deactivate_task(rq, prev, DEQUEUE_SLEEP);
            prev->on_rq = 0;

            if (prev->flags & PF_WQ_WORKER) {
                struct task_struct *to_wakeup;

                to_wakeup = wq_worker_sleeping(prev, cpu);
                if (to_wakeup)
                    try_to_wake_up_local(to_wakeup);
            }
        }
        switch_count = &prev->nvcsw;
    }

    if (task_on_rq_queued(prev))
        update_rq_clock(rq);

    next = pick_next_task(rq, prev);---------------------------------------調用pick_next_task_fair()從就緒隊列rq上選擇合適的進程返回給next。
    clear_tsk_need_resched(prev);
    clear_preempt_need_resched();
    rq->clock_skip_update = 0;

    if (likely(prev != next)) {--------------------------------------------如果待切入的進程next和待切出的進程next不等，那么調用context_switch()進行上下文切換。
        rq->nr_switches++;
        rq->curr = next;
        ++*switch_count;

        rq = context_switch(rq, prev, next); /* unlocks the rq */
        cpu = cpu_of(rq);
    } else
        raw_spin_unlock_irq(&rq->lock);

    post_schedule(rq);

    sched_preempt_enable_no_resched();
}

下面重點分析選擇待切入函數pick_next_task()和進行切換函數context_switch()兩部分。

3.3.1 pick_next_task()

pick_next_task()是對調度類中pick_next_task()方法的包裹，這里主要對應cfs調度策略的pick_next_task_fair()。

/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev)
{
    const struct sched_class *class = &fair_sched_class;
    struct task_struct *p;

    /*
     * Optimization: we know that if all tasks are in
     * the fair class we can call that function directly:
     */
    if (likely(prev->sched_class == class &&
           rq->nr_running == rq->cfs.h_nr_running)) {------------------------如果當前進程prev的調度類是cfs，並且該CPU就緒隊列中進程數量等於cfs就緒隊列中進程數量。說明該CPU就緒隊列中只有普通進程沒有其它調度類進程。
        p = fair_sched_class.pick_next_task(rq, prev);
        if (unlikely(p == RETRY_TASK))
            goto again;

        /* assumes fair_sched_class->next == idle_sched_class */
        if (unlikely(!p))
            p = idle_sched_class.pick_next_task(rq, prev);

        return p;
    }

again:
    for_each_class(class) {--------------------------------------------------其它情況就需要遍歷整個調度類，優先級為stop->deadline->realtime->cfs->idle。從這里也可以看出不同調度策略的優先級。
        p = class->pick_next_task(rq, prev);
        if (p) {
            if (unlikely(p == RETRY_TASK))
                goto again;
            return p;
        }
    }

    BUG(); /* the idle class will always have a runnable task */
}

static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev)
{
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;
    struct task_struct *p;
    int new_tasks;

again:
#ifdef CONFIG_FAIR_GROUP_SCHED
...
#endif

    if (!cfs_rq->nr_running)--------------------------------如果cfs就緒隊列上沒有進程，那么選擇idle進程。 goto idle;

    put_prev_task(rq, prev);

    do {
        se = pick_next_entity(cfs_rq, NULL);----------------選擇cfs就緒隊列中的紅黑樹最左邊進程。
        set_next_entity(cfs_rq, se);
        cfs_rq = group_cfs_rq(se);--------------------------如果定義CONFIG_FAIR_GROUP_SCHED，需要遍歷cfs_rq->rq上的就緒隊列。如果沒定義，則返回NULL。
    } while (cfs_rq);

    p = task_of(se);

    if (hrtick_enabled(rq))
        hrtick_start_fair(rq, p);

    return p;

idle:
    new_tasks = idle_balance(rq);
    /*
     * Because idle_balance() releases (and re-acquires) rq->lock, it is
     * possible for any higher priority task to appear. In that case we
     * must re-start the pick_next_entity() loop.
     */
    if (new_tasks < 0)
        return RETRY_TASK;

    if (new_tasks > 0)
        goto again;

    return NULL;
}

在沒有定義CONFIG_FAIR_GROUP_SCHED的情況下，pick_next_entity()參數curr為NULL。表示pick_next_entity()優先獲取cfs_rq->rb_leftmost結點。

set_next_entity()將cfs_rq->curr指向se，並且更行se的exec_start和prev_sum_exec_runtime。

static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    struct sched_entity *left = __pick_first_entity(cfs_rq);
    struct sched_entity *se;

    /*
     * If curr is set we have to see if its left of the leftmost entity
     * still in the tree, provided there was anything in the tree at all.
     */
    if (!left || (curr && entity_before(curr, left)))-----------------如果left不存在，left指向curr；或者left存在，curr不為NULL且curr的vruntime小於left的，那么left指向curr。
        left = curr;

    se = left; /* ideally we run the leftmost entity */---------------在curr為NULL情況下，se即cfs_rq的最左側節點。
...
    /*
     * Prefer last buddy, try to return the CPU to a preempted task.
     */
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)----如果cfs_rq->last存在，且其vruntime小於left的。那么更新se為cfs_rq->last。
        se = cfs_rq->last;

    /*
     * Someone really wants this to run. If it's not unfair, run it.
     */
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)----類似於cfs_rq->next，如果cfs_rq->next小於left的vruntime，那么更新se為cfs_rq->next。
        se = cfs_rq->next;

    clear_buddies(cfs_rq, se);

    return se;
}

static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    /* 'current' is not kept within the tree. */
    if (se->on_rq) {-----------------------------------------------------如果當前調度實體在就緒隊列，則移除。 /*
         * Any task has to be enqueued before it get to execute on
         * a CPU. So account for the time it spent waiting on the
         * runqueue.
         */
        update_stats_wait_end(cfs_rq, se);
        __dequeue_entity(cfs_rq, se);
    }

    update_stats_curr_start(cfs_rq, se);
    cfs_rq->curr = se;
#ifdef CONFIG_SCHEDSTATS
    /*
     * Track our maximum slice length, if the CPU's load is at
     * least twice that of our own weight (i.e. dont track it
     * when there are only lesser-weight tasks around):
     */
    if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
        se->statistics.slice_max = max(se->statistics.slice_max,
            se->sum_exec_runtime - se->prev_sum_exec_runtime);
    }
#endif
    se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

3.3.2 context_switch()

context_switch()共3個參數，其中rq表示進程切換所在的就緒隊列，prev將要被換出的進程，next將要被換入執行的進程。

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
           struct task_struct *next)
{
    struct mm_struct *mm, *oldmm;

    prepare_task_switch(rq, prev, next);-----------和finish_task_switch()成對操作，其中next->on_cpu置1。

    mm = next->mm;
    oldmm = prev->active_mm;
    /*
     * For paravirt, this is coupled with an exit in switch_to to
     * combine the page table reload and the switch backend into
     * one hypercall.
     */
    arch_start_context_switch(prev);

    if (!mm) {-------------------------------------對於內核線程來說是沒有進程地址空間的
        next->active_mm = oldmm;-------------------因為進程調度的需要，需要借用一個進程的地址空間，因此有了active_mm成員。為什么不用prev->mm呢？因為prev也可能是內核線程。
        atomic_inc(&oldmm->mm_count);
        enter_lazy_tlb(oldmm, next);
    } else switch_mm(oldmm, mm, next);----------------對普通進程，需要調用switch_mm()函數做一些進程地址空間切換的處理。 if (!prev->mm) {-------------------------------對於prev是內核線程情況，prev->active_mm為NULL，rq->prev_mm記錄prev->active_mm。
        prev->active_mm = NULL;
        rq->prev_mm = oldmm;
    }
    /*
     * Since the runqueue lock will be released by the next
     * task (which is an invalid locking op but in the case
     * of the scheduler it's an obvious special-case), so we
     * do an early lockdep release here:
     */
    spin_release(&rq->lock.dep_map, 1, _THIS_IP_);

    context_tracking_task_switch(prev, next);
    /* Here we just switch the register state and the stack. */ switch_to(prev, next, prev);-------------------切換進程，從prev進程切換到next進程來運行。該函數完成時，CPU運行next進程，prev進程被調度出去，俗稱“睡眠”。
    barrier();

    return finish_task_switch(prev);---------------進程切換后的清理工作，prev->on_cpu置0，遞減old_mm->mm_count，由next處理prev進程殘局。
}

switch_mm()和switch_to()都是體系結構密切相關函數。

switch_mm()把新進程頁表基地址設置到頁目錄表基地址寄存器中。

switch_mm()首先把當前CPU設置到下一個進程的cpumask位圖中，然后調用check_and_switch_context()來完成ARM體系結構相關的硬件設置，例如flush TLB。

/*
 * This is the actual mm switch as far as the scheduler
 * is concerned.  No registers are touched.  We avoid
 * calling the CPU specific function when the mm hasn't
 * actually changed.
 */
static inline void
switch_mm(struct mm_struct *prev, struct mm_struct *next,
      struct task_struct *tsk)
{
#ifdef CONFIG_MMU
    unsigned int cpu = smp_processor_id();

    /*
     * __sync_icache_dcache doesn't broadcast the I-cache invalidation,
     * so check for possible thread migration and invalidate the I-cache
     * if we're new to this CPU.
     */
    if (cache_ops_need_broadcast() &&
        !cpumask_empty(mm_cpumask(next)) &&
        !cpumask_test_cpu(cpu, mm_cpumask(next)))
        __flush_icache_all();

    if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next)) || prev != next) {
        check_and_switch_context(next, tsk);
        if (cache_is_vivt())
            cpumask_clear_cpu(cpu, mm_cpumask(prev));
    }
#endif
}

switch_to()最終調用__switch_to()匯編函數。

__switch_to()包含三個參數，r0是移出進程(prev)的task_struct結構，r1是移出進程(task_thread_info(prev))的thread_info結構，r2是移入進程(task_thread_info(next))的thread_info結構。

這里把prev進程相關寄存器上下文保存到該進程的thread_info->cpu_context結構體中，然后再把next進程thread_info->cpu_context結構體中的值設置到物理CPU寄存器中，從而實現進程堆棧的切換。

#define switch_to(prev,next,last)                    \
do {                                    \
    last = __switch_to(prev,task_thread_info(prev), task_thread_info(next));    \
} while (0)

/*
 * Register switch for ARMv3 and ARMv4 processors
 * r0 = previous task_struct, r1 = previous thread_info, r2 = next thread_info
 * previous and next are guaranteed not to be the same.
 */
ENTRY(__switch_to)
 UNWIND(.fnstart    )
 UNWIND(.cantunwind    )
    add    ip, r1, #TI_CPU_SAVE
 ARM(    stmia    ip!, {r4 - sl, fp, sp, lr} )    @ Store most regs on stack
 THUMB(    stmia    ip!, {r4 - sl, fp}       )    @ Store most regs on stack
 THUMB(    str    sp, [ip], #4           )
 THUMB(    str    lr, [ip], #4           )
    ldr    r4, [r2, #TI_TP_VALUE]
    ldr    r5, [r2, #TI_TP_VALUE + 4]
#ifdef CONFIG_CPU_USE_DOMAINS
    ldr    r6, [r2, #TI_CPU_DOMAIN]
#endif
    switch_tls r1, r4, r5, r3, r7
#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP)
    ldr    r7, [r2, #TI_TASK]
    ldr    r8, =__stack_chk_guard
    ldr    r7, [r7, #TSK_STACK_CANARY]
#endif
#ifdef CONFIG_CPU_USE_DOMAINS
    mcr    p15, 0, r6, c3, c0, 0        @ Set domain register
#endif
    mov    r5, r0
    add    r4, r2, #TI_CPU_SAVE
    ldr    r0, =thread_notify_head
    mov    r1, #THREAD_NOTIFY_SWITCH
    bl    atomic_notifier_call_chain
#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP)
    str    r7, [r8]
#endif
 THUMB(    mov    ip, r4               )
    mov    r0, r5
 ARM(    ldmia    r4, {r4 - sl, fp, sp, pc}  )    @ Load all regs saved previously
 THUMB(    ldmia    ip!, {r4 - sl, fp}       )    @ Load all regs saved previously
 THUMB(    ldr    sp, [ip], #4           )
 THUMB(    ldr    pc, [ip]           )
 UNWIND(.fnend        )
ENDPROC(__switch_to)

3.4 調度實體sched_entity紅黑樹操作

cfs使用紅黑樹來管理調度實體，紅黑樹的鍵值為sched_entity->vruntime。

__enqueue_entity()用於將調度實體se鍵入到cfs_rq運行隊列上，具體是加入到cfs_rq->tasks_timeline的紅黑樹上。

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;----------------------取當前cfs_rq->tasks_timeline樹上的第一個節點，注意不一定是最左側節點。 struct rb_node *parent = NULL;
    struct sched_entity *entry;
    int leftmost = 1;

    /*
     * Find the right place in the rbtree:
     */
    while (*link) {---------------------------------------------------------------從第一個節點開始遍歷當前cfs_rq紅黑樹，知道找到空的插入節點。
        parent = *link;
        entry = rb_entry(parent, struct sched_entity, run_node);------------------通過parent找到其對應的調度實體 /*
         * We dont care about collisions. Nodes with
         * the same key stay together.
         */
        if (entity_before(se, entry)) {-------------------------------------------如果se->vruntime < entry->vruntime則條件成立，插入點指向entry對應的左節點。
            link = &parent->rb_left;
        } else {------------------------------------------------------------------否則插入點指向entry對應的右節點，則leftmost為0。
            link = &parent->rb_right;
            leftmost = 0;
        }
    }

    /*
     * Maintain a cache of leftmost tree entries (it is frequently
     * used):
     */
    if (leftmost)----------------------------------------------------------------如果新插入的節點為最左側節點，那么需要改變cfs_rq->rb_leftmost。
        cfs_rq->rb_leftmost = &se->run_node;

    rb_link_node(&se->run_node, parent, link);-----------------------------------將link指向se->run_node
    rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);---------------------在將se->run_node插入后，進行平衡調整。
}

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    if (cfs_rq->rb_leftmost == &se->run_node) {---------------------------------如果待刪除的節點是cfs_rq->rb_leftmose，那么還需要更新cfs_rq->rb_leftmost，然后再刪除。 struct rb_node *next_node;

        next_node = rb_next(&se->run_node);
        cfs_rq->rb_leftmost = next_node;
    }

    rb_erase(&se->run_node, &cfs_rq->tasks_timeline);---------------------------從cfs_rq->tasks_timeline刪除節點se->run_node。
}

struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)-----------------獲取cfs_rq->rb_leftmost對應的調度實體。
{
    struct rb_node *left = cfs_rq->rb_leftmost;

    if (!left)
        return NULL;

    return rb_entry(left, struct sched_entity, run_node);
}

static struct sched_entity *__pick_next_entity(struct sched_entity *se)----------獲取當前調度實體右側的調度實體。
{
    struct rb_node *next = rb_next(&se->run_node);

    if (!next)
        return NULL;

    return rb_entry(next, struct sched_entity, run_node);
}

struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)------------------獲取cfs_rq最右側的調度實體。
{
    struct rb_node *last = rb_last(&cfs_rq->tasks_timeline);--------------------rb_last在cfs_rq->tasks_timeline不停遍歷右節點，直到最后一個。 if (!last)
        return NULL;

    return rb_entry(last, struct sched_entity, run_node);
}

static inline int entity_before(struct sched_entity *a, struct sched_entity *b) { return (s64)(a->vruntime - b->vruntime) < 0;----------------------------比較調度實體a->vruntime和b->vruntime，如果a before b返回true。 }

4. schedule tick

時鍾分為周期性時鍾和單次觸發時鍾，通過clockevents_register_device()進行注冊。

廣播和非廣播時鍾的區別在於設備的clock_event_device->cpumask設置。

clockevnets_register_device()

->tick_check_new_device()

->tick_setup_device()

->tick_setup_periodic()-----------------------------如果tick_device->mode定義為TICKDEV_MODE_PERIODIC，則注冊為周期性時鍾。

->tick_set_periodic_handler()

->tick_handle_periodic()------------------------周期性時鍾

->tick_handle_periodic_broadcast()---------周期性廣播時鍾

->tick_setup_oneshot()-----------------------------如果tick_device->mode定義為TICKDEV_MODE_ONESHOT，則為單次觸發時鍾。

tick_set_periodic_handler()將struct clock_event_device的event_handler設置為tick_handle_periodic()。

上面是時鍾的注冊，時鍾是由中斷驅動的，在中斷的處理函數中會調用到clock_event_device->event_handler()。

對於周期性時鍾對應函數為tick_handle_periodic()-->tick_periodic()-->update_process_times()-->scheduler_tick()。

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
    int cpu = smp_processor_id();
    struct rq *rq = cpu_rq(cpu);
    struct task_struct *curr = rq->curr;

    sched_clock_tick();

    raw_spin_lock(&rq->lock);
    update_rq_clock(rq);--------------------------更新當前CPU就緒隊列rq中的時鍾計數clock和clock_task。
    curr->sched_class->task_tick(rq, curr, 0);----對應調度類方法task_tick，cfs調度類對應task_tick_fair()，用於處理時鍾tick到來時與調度器相關的事情。
    update_cpu_load_active(rq);-------------------更新運行隊列中的cpu_load[]
    raw_spin_unlock(&rq->lock);

    perf_event_task_tick();

#ifdef CONFIG_SMP
    rq->idle_balance = idle_cpu(cpu);
    trigger_load_balance(rq);
#endif
    rq_last_tick_reset(rq);
}

task_tick_fair()是cfs調度類task_tick()對應函數，首先調用entity_tick()檢查是否需要調度，然后調用update_rq_runnable_avg更新該就緒隊列的統計信息。

/*
 * scheduler tick hitting a task of our scheduling class:
 */
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;

    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);-----------------------由sched_entity找到對應task_struct，進而找到所在的就緒隊列，再找到cfs_rq。
        entity_tick(cfs_rq, se, queued);--------------除了更新se和cfs_rq的統計信息之外，調用check_preempt_tick()檢查是否需要調度。
    }

    if (numabalancing_enabled)
        task_tick_numa(rq, curr);

    update_rq_runnable_avg(rq, 1);
}

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    /*
     * Update run-time statistics of the 'current'.
     */ update_curr(cfs_rq);------------------------------------更新當前進程的vruntime、exec_start等和就緒隊列cfs_rq的min_vruntime等。 /*
     * Ensure that runnable average is periodically updated.
     */ update_entity_load_avg(curr, 1);------------------------更新curr調度實體的sched_avg參數load_avg_contrib等。
    update_cfs_rq_blocked_load(cfs_rq, 1);
    update_cfs_shares(cfs_rq);
...
    if (cfs_rq->nr_running > 1)
        check_preempt_tick(cfs_rq, curr);------------------如果當前就緒隊列運行中進程數nr_running大於1，check_preempt_tick()進行檢查當前進程是否需要讓出CPU。
}

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;
    struct sched_entity *se;
    s64 delta;

    ideal_runtime = sched_slice(cfs_rq, curr);----------------------------該進程根據權重在一個調度周期里分到的實際運行時間，和sched_vslice()得到的虛擬運行時間區別。
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;----delta_exec是該進程已經運行的實際時間 if (delta_exec > ideal_runtime) {-------------------------------------如果實際運行時間超過了理論分配運行時間，那么該進程需要被調度出去，設置該進程thread_info中TIF_NEED_RESCHED標志位。
        resched_curr(rq_of(cfs_rq));
        /*
         * The current task ran long enough, ensure it doesn't get
         * re-elected due to buddy favours.
         */
        clear_buddies(cfs_rq, curr);
        return;
    }
    if (delta_exec < sysctl_sched_min_granularity)------------------------如果進程實際運行時間小於sysctl_sched_min_granularity(0.75ms)，那么同樣不需要調度。 return;

    se = __pick_first_entity(cfs_rq);-------------------------------------選擇當前cfs_rq就緒隊列最左側調度實體。
    delta = curr->vruntime - se->vruntime;

    if (delta < 0)--------------------------------------------------------如果當前curr->vruntime小於最左側調度實體vruntime，同樣不需要調度。 return;

    if (delta > ideal_runtime)--------------------------------------------這里為什么要這么比？delta是虛擬事件差值，ideal_runtime是實際時間差值。
        resched_curr(rq_of(cfs_rq));
}

5. 組調度

CFS調度器的調度粒度是進程，在某些場景下希望調度粒度是組。

組與組之間的關系是公平的，組內的調度實體又是公平的。組調度就是解決這方面的應用需求。

CFS調度器定義一個數據結構來抽象組調度struct task_group。

/* task group related information */
struct task_group {
    struct cgroup_subsys_state css;

#ifdef CONFIG_FAIR_GROUP_SCHED
    /* schedulable entities of this group on each cpu */
    struct sched_entity **se;
    /* runqueue "owned" by this group on each cpu */
    struct cfs_rq **cfs_rq;
    unsigned long shares;

#ifdef    CONFIG_SMP
    atomic_long_t load_avg;
    atomic_t runnable_avg;
#endif
#endif

#ifdef CONFIG_RT_GROUP_SCHED
    struct sched_rt_entity **rt_se;
    struct rt_rq **rt_rq;

    struct rt_bandwidth rt_bandwidth;
#endif

    struct rcu_head rcu;
    struct list_head list;

    struct task_group *parent;
    struct list_head siblings;
    struct list_head children;

#ifdef CONFIG_SCHED_AUTOGROUP
    struct autogroup *autogroup;
#endif

    struct cfs_bandwidth cfs_bandwidth;
}

5.1 創建組調度

組調度屬於cgroup架構中的cpu子系統，在系統配置時需要打開CONFIG_CGROUP_SCHED和CONFIG_FAIR_GROUP_SCHED。

創建一個組調度的接口是sched_create_group()。

/* allocate runqueue etc for a new task group */
struct task_group *sched_create_group(struct task_group *parent)-----------parent指上一級的組調度節點，系統中有一個組調度的根root_task_group。
{
    struct task_group *tg;

    tg = kzalloc(sizeof(*tg), GFP_KERNEL);---------------------------------分配task_group數據結構 if (!tg)
        return ERR_PTR(-ENOMEM);

    if (!alloc_fair_sched_group(tg, parent))-------------------------------創建cfs調度器需要的組調度數據結構 goto err;

    if (!alloc_rt_sched_group(tg, parent))---------------------------------創建rt調度器需要的組調度數據結構 goto err;

    return tg;

err:
    free_sched_group(tg);
    return ERR_PTR(-ENOMEM);
}

alloc_fair_sched_group()創建cfs調度器需要的組調度數據結構。

int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se;
    int i;

    tg->cfs_rq = kzalloc(sizeof(cfs_rq) * nr_cpu_ids, GFP_KERNEL);--分配NR_CPUS個cfs_rq數據結構，存放到指針數組中，這里數據結構不是struct cfs_rq。 if (!tg->cfs_rq)
        goto err;
    tg->se = kzalloc(sizeof(se) * nr_cpu_ids, GFP_KERNEL);----------分配NR_CPUS個se數據結構，注意這里不是struct sched_entity。 if (!tg->se)
        goto err;

    tg->shares = NICE_0_LOAD;---------------------------------------調度組的權重初始化為NICE值為0的權重。

    init_cfs_bandwidth(tg_cfs_bandwidth(tg));

    for_each_possible_cpu(i) {--------------------------------------遍歷系統中所有possible CPU，為每個CPU分配一個struct cfs_rq調度隊列和struct sched_entity調度實體。
        cfs_rq = kzalloc_node(sizeof(struct cfs_rq),----------------之前分配的是指針數組，這里為每個CPU分配struct cfs_rq和struct sched_entity數據結構。
                      GFP_KERNEL, cpu_to_node(i));
        if (!cfs_rq)
            goto err;

        se = kzalloc_node(sizeof(struct sched_entity),
                  GFP_KERNEL, cpu_to_node(i));
        if (!se)
            goto err_free_rq;

        init_cfs_rq(cfs_rq);----------------------------------------初始化cfs_rq就緒隊列中的tasks_timeline和min_vruntime等信息。
        init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);--------構建組調度結構的關鍵函數。
    }

    return 1;

err_free_rq:
    kfree(cfs_rq);
err:
    return 0;
}

init_cfs_rq()初始化cfs_rq的tasks_timeline紅黑樹、min_vruntime。

init_tg_cfs_entry()初始化構建組調度結構的關鍵函數,，將rg和cfs_rq關聯，。

void init_cfs_rq(struct cfs_rq *cfs_rq)
{
    cfs_rq->tasks_timeline = RB_ROOT;
    cfs_rq->min_vruntime = (u64)(-(1LL << 20));
#ifndef CONFIG_64BIT
    cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
#ifdef CONFIG_SMP
    atomic64_set(&cfs_rq->decay_counter, 1);
    atomic_long_set(&cfs_rq->removed_load, 0);
#endif
}

void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
            struct sched_entity *se, int cpu,
            struct sched_entity *parent)
{
    struct rq *rq = cpu_rq(cpu);

    cfs_rq->tg = tg;
    cfs_rq->rq = rq;
    init_cfs_rq_runtime(cfs_rq);

    tg->cfs_rq[cpu] = cfs_rq;-----------------------------將alloc_fair_sched_group()分配的指針數組和對應的數據結構關聯上。
    tg->se[cpu] = se;

    /* se could be NULL for root_task_group */
    if (!se)
        return;

    if (!parent) {
        se->cfs_rq = &rq->cfs;
        se->depth = 0;
    } else {
        se->cfs_rq = parent->my_q;
        se->depth = parent->depth + 1;
    }

    se->my_q = cfs_rq;------------------------------------針對組調度中實體才有的my_q。 /* guarantee group entities always have weight */
    update_load_set(&se->load, NICE_0_LOAD);
    se->parent = parent;
}

5.1.1 雙核task_group、cfs_rq、sched_entity、task_struct關系圖

5.2 將進程加入組調度

通過調用cpu_cgrp_subsys的接口函數cpu_cgroup_attach()將今晨加入到組調度中。

struct cgroup_subsys cpu_cgrp_subsys = {
...
    .attach        = cpu_cgroup_attach,
    .exit        = cpu_cgroup_exit,
    .legacy_cftypes    = cpu_files,
    .early_init    = 1,
};

static void cpu_cgroup_attach(struct cgroup_subsys_state *css,
                  struct cgroup_taskset *tset)
{
    struct task_struct *task;

    cgroup_taskset_for_each(task, tset)----------------遍歷tset包含的進程鏈表。
        sched_move_task(task);-------------------------將task進程遷移到組調度中。
}

void sched_move_task(struct task_struct *tsk)
{
    struct task_group *tg;
    int queued, running;
    unsigned long flags;
    struct rq *rq;

    rq = task_rq_lock(tsk, &flags);

    running = task_current(rq, tsk);--------------------------判斷進程tsk是否正在運行
    queued = task_on_rq_queued(tsk);--------------------------判斷進程tsk是否在就緒隊列里，tsk->on_rq等於TASK_ON_RQ_QUEUED表示該進程在就緒隊列中。 if (queued)
        dequeue_task(rq, tsk, 0);-----------------------------如果進程在就緒隊列中，那么要讓該進程暫時先退出就緒隊列。 if (unlikely(running))------------------------------------如果該進程在在運行中，剛才已經調用dequeue_task()把進程退出就緒隊列，現在只能繼續加回到就緒隊列中。
        put_prev_task(rq, tsk);

    /*
     * All callers are synchronized by task_rq_lock(); we do not use RCU
     * which is pointless here. Thus, we pass "true" to task_css_check()
     * to prevent lockdep warnings.
     */
    tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
              struct task_group, css);
    tg = autogroup_task_group(tsk, tg);
    tsk->sched_task_group = tg;

#ifdef CONFIG_FAIR_GROUP_SCHED
    if (tsk->sched_class->task_move_group)
        tsk->sched_class->task_move_group(tsk, queued);
    else
#endif set_task_rq(tsk, task_cpu(tsk));---------------------將tsk對應的調度實體的cfs_rq、parent和當前CPU對應的cfs_rq、se關聯起來。 if (unlikely(running))
        tsk->sched_class->set_curr_task(rq);
    if (queued)
        enqueue_task(rq, tsk, 0);

    task_rq_unlock(rq, tsk, &flags);
}

static void task_move_group_fair(struct task_struct *p, int queued)
{
    struct sched_entity *se = &p->se;
    struct cfs_rq *cfs_rq;

    if (!queued && (!se->sum_exec_runtime || p->state == TASK_WAKING))
        queued = 1;

    if (!queued)
        se->vruntime -= cfs_rq_of(se)->min_vruntime;
    set_task_rq(p, task_cpu(p));
    se->depth = se->parent ? se->parent->depth + 1 : 0;
    if (!queued) {
        cfs_rq = cfs_rq_of(se);
        se->vruntime += cfs_rq->min_vruntime;
#ifdef CONFIG_SMP
        se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
        cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
#endif
    }
}

static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
{
#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED)
    struct task_group *tg = task_group(p);----------獲取當前進程對應的task_group。 #endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    p->se.cfs_rq = tg->cfs_rq[cpu];-----------------設置調度實體的cfs_rq和parent。
    p->se.parent = tg->se[cpu];
#endif...
}

static void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
    update_rq_clock(rq);
    sched_info_queued(rq, p);
    p->sched_class->enqueue_task(rq, p, flags);
}

static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;

    for_each_sched_entity(se) {------------------------------在打開CONFIG_FAIR_GROUP_SCHED之后，需要遍歷進程調度實體和它的上一級調度實體。第一次遍歷是p->se，第二滴遍歷是對應組調度實體tg->se[]。 if (se->on_rq)
            break;
        cfs_rq = cfs_rq_of(se);
        enqueue_entity(cfs_rq, se, flags);

        /*
         * end evaluation on encountering a throttled cfs_rq
         *
         * note: in the case of encountering a throttled cfs_rq we will
         * post the final h_nr_running increment below.
        */
        if (cfs_rq_throttled(cfs_rq))
            break;
        cfs_rq->h_nr_running++;

        flags = ENQUEUE_WAKEUP;
    }

    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        cfs_rq->h_nr_running++;

        if (cfs_rq_throttled(cfs_rq))
            break;

        update_cfs_shares(cfs_rq);
        update_entity_load_avg(se, 1);
    }

    if (!se) {
        update_rq_runnable_avg(rq, rq->nr_running);
        add_nr_running(rq, 1);
    }
    hrtick_update(rq);
}

static void set_curr_task_fair(struct rq *rq)
{
    struct sched_entity *se = &rq->curr->se;

    for_each_sched_entity(se) {
        struct cfs_rq *cfs_rq = cfs_rq_of(se);

        set_next_entity(cfs_rq, se);
        /* ensure bandwidth has been allocated on our new cfs_rq */
        account_cfs_rq_runtime(cfs_rq, 0);
    }
}

組調度基本策略如下：

在創建組調度tg時，tg為每個CPU同時創建組調度內部使用的cfs_rq就緒隊列。
組調度作為一個調度實體加入到系統的cfs就緒隊列rq->cfs_rq中。
進程加入到一個組中后，就脫離了系統的cfs就緒隊列，並且加入到組調度的cfs就緒隊列tg->cfs_rq[]中。
在選擇下一個進程時，從系統的cfs就緒隊列開始，如果選中的調度實體是組調度tg，那么還需要繼續遍歷tg中的就緒隊里，從中選擇一個進程來運行。

5.3 組調度相關實驗

6. PELT算法改進

PELT(Per-Entity Load Tracking)算法中有一個重要的變量runnable_load_avg，用於描述就緒隊列基於可運行狀態的總衰減累加時間(runnable time)和權重計算出來的平均負載。

在Linux 4.0中，一次只更新一個調度實體的負載，而沒有更新cfs_rq所有調度實體的負載變化情況。

Linux 4.3做出了優化，在每次更新平均負載時會更新整個cfs_rq的平均負載。

struct cfs_rq中增加了struct sched_avg，並且struct sched_avg也做出了改變。

原來load_avg_contrib變成了load_avg，它是計算調度實體基於可運行時間的平均負載，並且考慮CPU頻率因素。

util_avg是計算調度實體基於執行時間內的平均負載。對於就緒隊列來說，這兩個成員包括運行時間和阻塞時間。

struct sched_avg {
    /*
     * These sums represent an infinite geometric series and so are bound
     * above by 1024/(1-y).  Thus we only need a u32 to store them for all
     * choices of y < 1-2^(-32)*1024.
     */
    u32 runnable_avg_sum, runnable_avg_period;
    u64 last_runnable_update;
    s64 decay_count;
    unsigned long load_avg_contrib;
};


/*
 * The load_avg/util_avg accumulates an infinite geometric series.
 * 1) load_avg factors the amount of time that a sched_entity is
 * runnable on a rq into its weight. For cfs_rq, it is the aggregated
 * such weights of all runnable and blocked sched_entities.
 * 2) util_avg factors frequency scaling into the amount of time
 * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
 * For cfs_rq, it is the aggregated such times of all runnable and
 * blocked sched_entities.
 * The 64 bit load_sum can:
 * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
 * the highest weight (=88761) always runnable, we should not overflow
 * 2) for entity, support any load.weight always runnable
 */
struct sched_avg {
    u64 last_update_time, load_sum;
    u32 util_sum, period_contrib;
    unsigned long load_avg, util_avg;
};

7. 小結

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Linux CFS調度器之負荷權重load_weight--Linux進程的管理與調度(二十五） linux 進程管理與調度（一） Linux進程核心調度器之主調度器schedule--Linux進程的管理與調度(十九）【轉】調度器24—CFS任務選核 CFS調度器（1）—— 基本原理 Linux核心調度器之周期性調度器scheduler_tick--Linux進程的管理與調度(十八）【原創】（六）Linux進程調度-實時調度器 Linux喚醒搶占----Linux進程的管理與調度(二十三） Linux的進程線程及調度 Linux調度器 - 進程優先級