cgroup原理簡析:進程調度

本文轉載自查看原文 2017-05-26 20:39 6448 linux/ cgroup

本篇來探究下cgroup對cpu的限制機制,前文提到過cgroup也是通過進程調度子系統來達到限制cpu的目的,因此需要了解下進程調度子系統.

因為是介紹cgroup的文章,因此只介紹進程調度中與cgroup密切關聯的部分,詳細完成的進程調度實現可參考進程調度的相關資料.

本文分為三個部分,首先介紹進程調度中的調度算法,在該基礎上引入組調度,最后結合前面文章(cgroup原理簡析:vfs文件系統)來說明上層通過echo pid >> tasks, echo n > cpu.shares等操作影響調度器對進程的調度,從而控制進程對cpu的使用,(內核源碼版本3.10)

-------------------------------------------------------------

1.進程調度

我們知道linux下的進程會有多種狀態,已就緒狀態的進程在就緒隊列中(struct rq),當cpu需要加載新任務時,調度器會從就緒隊列中選擇一個最優的進程加載執行.

那么調度器根據規則來選出這個最優的進程呢？這又引出了進程優先級的概念,簡單來說,linux將進程分為普通進程和實時進程兩個大類,用一個數值范圍0-139來表示優先級,值越小優先級越高,其中0-99表示實時進程,100-139(對應用戶層nice值-20-19)表示普通進程,實時進程的優先級總是高於普通進程.

不同優先級類型的進程當然要使用不同的調度策略,普通進程使用完全公平調度(cfs),實時進程使用實時調度(rt),這里內核實現上使用了一個類似面向對象的方式,抽象出一個調度類(struct sched_class)聲明同一的鈎子函數,完全公平調度實例(fair_sched_class)和實時調度實例(rt_sched_class)各自實現這些鈎子.

對應的,就緒隊列里也就分為兩個子隊列,一個維護普通進程,稱之為cfs隊列(cfs_rq),一個維護實時進程,稱之為rt隊列(rt_rq).

另外,雖然調度器最終調度的對象是進程,但在這里用一個調度實體(struct sched_entity||struct sched_rt_entity)表示一個要被調度的對象.
對於普通的進程調度來說,一個調度實體對象內嵌在task_struct中,對於組調度(cgroup)來說,其內嵌在task_group中.

cfs隊列用紅黑樹組織調度實體,cfs調度算法總是選擇最左邊的調度實體.

rt隊列的結構是類似rt_queue[100][list]這樣的結構,這里99對應實時進程的0-99個優先級,二維是鏈表,也就是根據實時進程的優先級,將進程掛載相應的list上.

rt調度算法總是先選優先級最高的調度實體.

這里先貼出這些概念的struct,圖1展示了他們之間的關系.

struct rq {
    unsigned int nr_running;    //就緒進程的總數目
    struct load_weight load;    //當前隊列的總負荷
    struct cfs_rq cfs;          //完全公平隊列
    struct rt_rq rt;            //實時隊列
    struct task_struct *curr, *idle, *stop; //curr指向當前正在執行的task,
    u64 clock;                  //該隊列自身的時鍾(實際時間,調度算法還有個虛擬時間的概念)
    ...
};

就緒隊列,每個cpu對應一個就緒隊列,后面為描述方便,假定系統cpu核數為1.

struct cfs_rq {　　//刪減版
    struct load_weight load;　　//該隊列的總權重
    unsigned int nr_running, h_nr_running;　　//該隊列中的任務數
    u64 min_vruntime;　　　　//一個虛擬時間,后面細說
    struct rb_root tasks_timeline;  //該cfs隊列的紅黑樹,所有的進程用它來組織
    struct rb_node *rb_leftmost;　　//指向紅黑樹最左邊的一個節點,也就是下次將被調度器裝載的
    struct sched_entity *curr, *next, *last, *skip; //curr指向當前正在執行的進程
    struct rq *rq;          //自己所屬的rq
    struct task_group *tg;  //該cfs隊列所屬的task_group(task_group是實現cgroup的基礎,后面再說)
    ...
};

cfs就緒隊列,用紅黑樹組織,這里有一個虛擬時間(vruntime)的概念,來保證在保證高優先級進程占用更多cpu的前提下,保證所有進程被公平的調度.

struct rt_prio_array {  //實時隊列用這個二位鏈表組織進程
    DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1);  
    struct list_head queue[MAX_RT_PRIO];    // MAX_RT_PRIO=100 上面解釋過了
};

struct rt_rq {
    struct rt_prio_array active;    //組織所有就緒進程
    unsigned int rt_nr_running;     //就緒進程數目
    int rt_throttled;               //禁止調度標記
    u64 rt_time;                    //當前隊列累計運行時間
    u64 rt_runtime;                 //當前隊列最大運行時間
    unsigned long rt_nr_boosted;
    struct rq *rq;                  //所屬rq
    struct task_group *tg;          //所屬cgroup
    ...
};

實時就緒隊列,用二維鏈表組織.
因為實時進程優先級總是高於普通進程,又不使用完全公平算法,極端情況下實時進程一直占着cpu,普通進程等不到cpu資源.
因此實現用rt_time,rt_runtime用來限制實時進程占用cpu資源,例如rt_time = 100 rt_runtime = 95,這兩個變量對應cgroup下的cpu.rt_period_us, cpu.rt_runtime_us.
那么該rq下,所有的實時進程只能占用cpu資源的95%,剩下的%5的資源留給普通進程使用.

struct sched_entity {
    struct load_weight load;    //該調度實體的權重(cfs算法的關鍵 >> cgroup限制cpu的關鍵)
    struct rb_node run_node;    //樹節點,用於在紅黑樹上組織排序
    u64 exec_start;             //調度器上次更新這個實例的時間(實際時間)
    u64 sum_exec_runtime;       //自進程啟動起來,運行的總時間(實際時間)
    u64 vruntime;               //該調度實體運行的虛擬時間
    u64 prev_sum_exec_runtime;  //進程在上次被撤銷cpu時,運行的總時間(實際時間)
    struct sched_entity *parent;//父調度實體
    struct cfs_rq *cfs_rq;      //自己所屬cfs就緒隊列
    struct cfs_rq *my_q;        //子cfs隊列,組調度時使用,如果該調度實體代表普通進程,該字段為NULL
    ...
};

普通進程使用完全公平算法來保證隊列中的進程都可得到公平的調度機會,同時兼顧高優先級的進程占用更多的cpu資源.

struct sched_rt_entity {
    struct list_head run_list;  //鏈表節點,用於組織實時進程
    struct sched_rt_entity  *parent;  //父調度實體  
    struct rt_rq        *rt_rq; //自己所屬rt就緒隊列
    struct rt_rq        *my_q;  //子rt隊列,組調度時使用,如果該調度實體代表普通進程,該字段為NULL
    ...
};

實時進程的調度算法是很簡單的,優先級高的可實時搶占優先級低的進程,直到進程自己放棄cpu,否則可"一直"運行.(加了引號的一直)

struct sched_class {
    const struct sched_class *next; //sched_class指針,用於將cfs類 rt類 idle類串起來
    // 全是函數指針,不同的類去各自實現,由調度器統一調用,
    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*yield_task) (struct rq *rq);
    bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);
    void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
    struct task_struct * (*pick_next_task) (struct rq *rq);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);
    void (*set_curr_task) (struct rq *rq);
    void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork) (struct task_struct *p);
    void (*switched_from) (struct rq *this_rq, struct task_struct *task);
    void (*switched_to) (struct rq *this_rq, struct task_struct *task);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,int oldprio);
    unsigned int (*get_rr_interval) (struct rq *rq,
    struct task_struct *task);
    void (*task_move_group) (struct task_struct *p, int on_rq);
};

struct sched_class只是向調度器聲明了一組函數,具體的實現是由各個調度類(cfs rt)實現.核心調度器只關心在什么時機調用那個函數指針,不關心具體實現.
圖1

如圖1,展示了各個結構之間的關系,注意標注出來的調度組.

為了聚焦,下面討論時不考慮有新進程入隊,或者就緒隊列的子機因為等待其他資源(IO)出隊等情況,只以周期性調度(系統每個一段時間產生一次tick中斷,此時系統有機會更新一些統計變量,同時計算當前進程是否已經運行了足夠長時間,需要調度)為例說明調度器的行為.
我們知道實時進程是會搶占普通進程的,所以當有實時進程進入就緒隊列時,普通進程很快就會被撤銷cpu,讓給實時進程,因此實時進程和普通進程的搶占都是在實時進程入隊時實時觸發的,所以周期性調度不用考慮這種情況.
那么情況就變簡單了,周期調度只需調用當前task的sched_class.task_tick就可以了,如果task是實時進程,相當於調用實時進程的調度類的task_tick實現,當task是普通進程時同理.

------cfs調度算法
完全公平調度算法的主要好處是它可以保證高優先級的進程占用更多的cpu時間,當時又能保證所有進程公平的得到調度機會.
上面也提到了虛擬時間的概念(vruntime),高優先級的進程得到了更多的cpu時間(實際時間runtime),但從虛擬時間(vruntime)的維度來說,各個進程跑的vruntime是一樣長的.
怎么理解？待我舉個不是很形象的栗子:
我們把A B C三個進程比為三個老奶奶,把調度器視為一個紅領巾,紅領巾的目標是同時把三個老奶奶一起扶到馬路對面(完全公平嘛).

cfs調度算法跟上面紅領巾的例子是類似的,可以把紅領巾扶老奶奶走過的路程想象為vruntime,一個周期后,他對ABC都是公平的,扶她們走的路程都一樣(vruntime時長),我們假設A奶奶腿腳極其不變(A優先級高),那么雖然扶A和扶BC走過的路程一樣長,但是因為A慢呀,所以明顯扶A要花更長的時間(實際時間).
所以cfs調度算法的秘密就是,優先級高的進程vruntime增長的慢,ABC三個進程可能都跑了10vruntime,但是BC花了10個runtime,而A(優先級高)缺花了20runtime.

怎么實現呢？

runtime = period * (se->load.weight / cfs_rq->load.weight) 公式1

period為一個調度周期,這里不關注它如何計算,se->load.weight越大,該調度實體在一個調度周期內獲得的實際時間越長.

vruntime = runtime * (orig_load_value / se->load.weight) 公式2

將orig_load_value是個常量(1024),顯而易見,進程的優先級越高(se->load.weight越大),在runtime相同時其vruntime變化的越慢.
我們將公式1帶入公式2可得公式3:

vruntime  = period * orig_load_value / cfs_rq->load.weight  公式3

由此可見,進程的vruntime和優先級沒有關系,這樣就達到了按照優先級分配實際時間,相同的vruntime體現公平.

上面cfs_rq.min_vruntime總是保存當前cfs隊列各個調度實體中最小的vruntime(不是太嚴謹,不影響).
而紅黑樹各個sched_entity排序的key為(sched_entity->vruntime - cfs_rq.min_vruntime),這樣優先級越高,相同實際時間的sched_entity->vruntime越小.
對應紅黑樹的key越小,越靠近紅黑樹左側,cfs調度算法就是選取紅黑樹最左側的sched_entity給cpu裝載.

從頭擼代碼,當tick中斷發生時,核心調度器調用cfs的task_tick函數task_tick_fair:

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;

    for_each_sched_entity(se) {     //組調度
        cfs_rq = cfs_rq_of(se);
        entity_tick(cfs_rq, se, queued);    //在entity_tick中更新cfs隊列,當前調度實體時間相關的統計,並判斷是否需要調度
    }
    ....
}

static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    ...
    update_curr(cfs_rq);    // 更新調度實體,cfs隊列統計信息
    if (cfs_rq->nr_running > 1)
        check_preempt_tick(cfs_rq, curr);   // 判斷是否需要調度
    ...
}

entity_tick函數主要的操作就是兩部分,首先調用update_curr()去更新sched_entity cfs的runtimevruntime等信息.
接着再調用check_preempt_tick()函數檢查該cfs是否需要重新調度,看下這兩個函數.

static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_of(cfs_rq)->clock_task;    // 獲取當前時間,這里是實際時間
    unsigned long delta_exec;

    if (unlikely(!curr))
        return;
    delta_exec = (unsigned long)(now - curr->exec_start); // 計算自上個周期到當前時間的差值,也就是該調度實體的增量runtime
    if (!delta_exec)
        return;

    __update_curr(cfs_rq, curr, delta_exec);        //真正的更新函數
    curr->exec_start = now;    // exec_start更新成now,用於下次計算
    ...
    account_cfs_rq_runtime(cfs_rq, delta_exec);     //和cpu.cfs_quota_us  cpu.cfs_period_us相關
}

static inline void __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,unsigned long delta_exec)
{
    unsigned long delta_exec_weighted;
    curr->sum_exec_runtime += delta_exec;   //sum_exec_runtime保存該調度實體累計的運行時間runtime
    delta_exec_weighted = calc_delta_fair(delta_exec, curr);    //用runtime(delta_exec)根據公式2算出vruntime(delta_exec_weighted)
    curr->vruntime += delta_exec_weighted;  //累加vruntime
    update_min_vruntime(cfs_rq);    //更新該cfs的min_vruntime
}

這幾行代碼還是比較清晰的,根據runtime 公式2算出vruntime,然后累加該sched_entity的vruntime.
因為該sched_entity的vruntime增加了,所以之前保存的cfs_rq->min_vruntime可能失效了,update_min_vruntime()函數負責檢查並更新.
同理,該sched_entity的vruntime也許不是該cfs紅黑樹中最小的了,因此上面調用了check_preempt_tick()簡單是否需要重新調度.
看下check_preempt_tick():

static void check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;
    struct sched_entity *se;
    s64 delta;

    ideal_runtime = sched_slice(cfs_rq, curr);  //根據該sched_entity的權重以及公式1計算出期在本調度周期最大可運行的runtime
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;  //該周期已經運行的runtime
    if (delta_exec > ideal_runtime) {       // 超過了,則調用resched_task設置重新調度標記,(cgroup中的cpu.cfs_quota_us, cpu.cfs_period_us)
        resched_task(rq_of(cfs_rq)->curr);
        return;
    }


    if (delta_exec < sysctl_sched_min_granularity)
        return;

    se = __pick_first_entity(cfs_rq);       // 選取當前紅黑樹最左端的sched_entity
    delta = curr->vruntime - se->vruntime;  // 兩個sched_entity的vruntime比較

    if (delta < 0)
        return;

    if (delta > ideal_runtime)      //這里是個BUG,vruntime不能和runtime比啊,改為 if (delta > calc_delta_fair(ideal_runtime, curr))
        resched_task(rq_of(cfs_rq)->curr);
}

check_preempt_tick也是比較清晰的,首先該sched_entity占用cpu的實際時間不能超過根據其權值算出的份額.
其次,如果紅黑樹最左側的sched_entity的vruntime更小(delta > ideal_runtime *(orig_load_value / se->load.weight)參見公式2),那么發起調度.
這里不直接判斷delta > 0,應該是為了防止頻繁調度吧,cfs調度算法結束.

------實時調度算法,
實時調度算法原理上沒啥好說的,直接看代碼吧.

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
    struct sched_rt_entity *rt_se = &p->rt;

    update_curr_rt(rq);     // 更新runtime統計,並判斷實時進程占用份額達到限制的話,設置調度標記

    if (p->policy != SCHED_RR)  // 如果不是SCHED_RR調度策略,直接返回
        return;

    if (--p->rt.time_slice)     // 每次tick中斷-1
        return;

    p->rt.time_slice = sched_rr_timeslice;  // p->rt.time_slice 自減為0,該調度了,給起重新賦值后放入隊列末尾

    for_each_sched_rt_entity(rt_se) {
        if (rt_se->run_list.prev != rt_se->run_list.next) {
            requeue_task_rt(rq, p, 0);          // 掛到對列尾,
            set_tsk_need_resched(p);        // 設置調度標記
            return;
        }
    }
}

static void update_curr_rt(struct rq *rq)
{
    struct task_struct *curr = rq->curr;
    struct sched_rt_entity *rt_se = &curr->rt;
    struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
    u64 delta_exec;

    delta_exec = rq->clock_task - curr->se.exec_start;  //增量runtime

    curr->se.sum_exec_runtime += delta_exec;    //task_struct累加

    curr->se.exec_start = rq->clock_task;   //置為當前時間,用以下次計算差值


    for_each_sched_rt_entity(rt_se) {
        rt_rq = rt_rq_of_se(rt_se);
        if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {   //組調度？
            rt_rq->rt_time += delta_exec;   // 當前rt_rq的rt_time累加
            if (sched_rt_runtime_exceeded(rt_rq))   //該rt_rq運行時間是否超出限制
                resched_task(curr);     //是就調度,
        }
    }
}

sched_rt_runtime_exceeded里的判斷涉及到了用戶層通過cgroup限制實時進程的cpu份額(cpu.rt_period_us, cpu.rt_runtime_us).看下:

static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
{
    u64 runtime = sched_rt_runtime(rt_rq);

    if (rt_rq->rt_throttled)
        return rt_rq_throttled(rt_rq);

    if (runtime >= sched_rt_period(rt_rq))
        return 0;

    balance_runtime(rt_rq);     //去其他核上借點時間,不深究了,(cpu.rt_runtime_us / cpu.rt_period_us * cpus才是真正的最大份額)
    runtime = sched_rt_runtime(rt_rq);  //獲取本周期最大執行時間
    if (runtime == RUNTIME_INF)
        return 0;

    if (rt_rq->rt_time > runtime) {     //當前運行時間已經超出,需要讓出cpu
        struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);

        /*
         * Don't actually throttle groups that have no runtime assigned
         * but accrue some time due to boosting.
         */
        if (likely(rt_b->rt_runtime)) {

            rt_rq->rt_throttled = 1;    //設置該rt_throttled為1,當新的周期開始時,rt_throttled重新被置0

        } else {
            rt_rq->rt_time = 0;
        }

        if (rt_rq_throttled(rt_rq)) {
            sched_rt_rq_dequeue(rt_rq); //將該rt_rq --> task_group(后面說) --> sched_rt_entity出隊
            return 1;
        }
    }

    return 0;
}

到此實時調度也結束了,下面看下組調度.
-------------------------------------------------------------
2.組調度
上面討論時一直用調度實體來表示進程的,這是因為,一個調度實體可能對應一個task_struct,也可能對應一個task_group,看下struct:

struct task_struct {
    int on_rq;
    int prio, static_prio, normal_prio;
    unsigned int rt_priority;   
    const struct sched_class *sched_class;  //調度類
    struct sched_entity se;     // cfs調度實體
    struct sched_rt_entity rt;  //rt調度實體
    struct task_group *sched_task_group;
    ...
｝

struct task_group {
    struct cgroup_subsys_state css;

    struct sched_entity **se;   //cfs調度實體
    struct cfs_rq **cfs_rq;     //子cfs隊列
    unsigned long shares;       //

    atomic_t load_weight;       //

    struct sched_rt_entity **rt_se; // rt調度實體
    struct rt_rq **rt_rq;       //子rt隊列

    struct rt_bandwidth rt_bandwidth;   //存儲cpu.rt_period_us, cpu.rt_runtime_us
    struct cfs_bandwidth cfs_bandwidth; //存儲cpu.cfs_quota_us, cpu.cfs_period_us
};

可見task_group和task_struct中都內嵌了調度實體,與后者不同的是,task_group並不是最終的運行單位,它只是代表本group中的所有task在上層分配總的cpu資源,
圖2展示了task_group和其他幾個結構的關系.
圖2

結合圖1來看,調度算法在選到一個entity時,如果對應的是一個group,那么就在該cgroup的子rq中再次選擇,知道選中一個真的進程,結合代碼吧:

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;

    for_each_sched_entity(se) {     //組調度
        cfs_rq = cfs_rq_of(se);
        entity_tick(cfs_rq, se, queued);    //在entity_tick中更新cfs隊列,當前調度實體時間相關的統計,並判斷是否需要調度
    }
    ....
}

#define for_each_sched_entity(se) \
        for (; se; se = se->parent)

這是上面的代碼,展開for_each_sched_entity宏
更新統計信息時是循環的,如果該調度實體是普通進程,那么se->parent為NULL,如果是某個group下的進程,那么循環向上更新父group.

static struct task_struct *pick_next_task_fair(struct rq *rq)
{
    struct task_struct *p;
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;

    if (!cfs_rq->nr_running)
        return NULL;

    do {        //循環處理
        se = pick_next_entity(cfs_rq);
        set_next_entity(cfs_rq, se);
        cfs_rq = group_cfs_rq(se);
    } while (cfs_rq);

    p = task_of(se);
    if (hrtick_enabled(rq))
        hrtick_start_fair(rq, p);

    return p;
}

調度器選擇下一個裝載的進程時,pick_next_entity選出最左邊的sched_entity,如果對應的是一個group,那么就從該cgroup的 sub cfs_rq繼續選,知道選到一個進程,
實時組調度也是類似的,有區別的是,實時group的權重,是該group下權重最大的進程的權重,因為實時進程優先級高的搶占優先級低的呀.
下面結合下前面文章(cgroup原理簡析:vfs文件系統)來說下通過echo cgroup的控制文件是如何限制該cgroup下的進程的吧.
-------------------------------------------------------------
eg: echo 2048 >> cpu.shares
vfs調用過成上篇文章說過了,不再重復,直接到cpu_shares_write_u64,cpu_shares_write_u64只是包裝,實際工作的是sched_group_set_shares()函數.

int sched_group_set_shares(struct task_group *tg, unsigned long shares)
{
    ....
    tg->shares = shares;            //將shares賦值到tg->shares
    for_each_possible_cpu(i) {      //一個cgroup在每個cpu上都有一個調度實體,
        struct rq *rq = cpu_rq(i);
        struct sched_entity *se;
        se = tg->se[i];
        for_each_sched_entity(se)
            update_cfs_shares(group_cfs_rq(se));    
    }
    ...
    return 0;
}
進到update_cfs_shares()函數:
static void update_cfs_shares(struct cfs_rq *cfs_rq)
{
    struct task_group *tg;
    struct sched_entity *se;
    long shares;

    tg = cfs_rq->tg;
    se = tg->se[cpu_of(rq_of(cfs_rq))];
    if (!se || throttled_hierarchy(cfs_rq))
        return;
#ifndef CONFIG_SMP
    if (likely(se->load.weight == tg->shares))
        return;
#endif
    shares = calc_cfs_shares(cfs_rq, tg);       //可以理解為shares = tg->share

    reweight_entity(cfs_rq_of(se), se, shares); //設置該task_group對應的sched_entity的load.weight
}
一些簡單的判斷,最終調用reweight_entity設置weight,
static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, unsigned long weight)
{
    ...
    update_load_set(&se->load, weight);
    ...
}
static inline void update_load_set(struct load_weight *lw, unsigned long w)
{
    lw->weight = w;
    lw->inv_weight = 0;
}

echo 2048 >> cpu.shares就是設置cgroup->task_group->sched_entity.load.weight對應的值.
結合公式1,改變了weight就最終影響到了該sched_entity占cpu的實際時間.
注意該sched_entity不是某個進程的,而是task_group(cgroup)的,所以最終結果是:該cgroup下的進程占用cpu之和等於該sched_entity.load.weight結合公式1算出來的cpu時間.

eg: echo 1000000 >> cpu.rt_period_us echo 950000 >> cpu.rt_period_us
這里以實時進程看下,cpu.cfs_quota_us cpu.cfs_period_us也是類似,不再看了.
vfs-->cpu_rt_period_write_uint,這個函數封裝sched_group_set_rt_period()函數.

static int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
{
    u64 rt_runtime, rt_period;

    rt_period = (u64)rt_period_us * NSEC_PER_USEC;  //計算新的period
    rt_runtime = tg->rt_bandwidth.rt_runtime;       //原有的quota

    if (rt_period == 0)
        return -EINVAL;

    return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);  //設置
}


vfs-->cpu_rt_runtime_write,這個函數封裝sched_group_set_rt_runtime()函數,
static int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
{
    u64 rt_runtime, rt_period;

    rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period);    //原有的period
    rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC;        //新的quota
    if (rt_runtime_us < 0)
        rt_runtime = RUNTIME_INF;

    return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);  //設置
}

可見設置cpu.rt_period_us,cpu.rt_period_us時,都是除了新設的值,再取另外一個原有的值,一起調用tg_set_rt_bandwidth設置.
static int tg_set_rt_bandwidth(struct task_group *tg, u64 rt_period, u64 rt_runtime)
{
    ...
    tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);    //設置,將quota period存到task_group里
    tg->rt_bandwidth.rt_runtime = rt_runtime;

    for_each_possible_cpu(i) {
        struct rt_rq *rt_rq = tg->rt_rq[i];

        raw_spin_lock(&rt_rq->rt_runtime_lock);
        rt_rq->rt_runtime = rt_runtime;         //同步設置每個rt_rq->rt_runtime
        raw_spin_unlock(&rt_rq->rt_runtime_lock);
    }
    raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
unlock:
    read_unlock(&tasklist_lock);
    mutex_unlock(&rt_constraints_mutex);

    return err;
}

結合上面說的實時調度算法,設置了rt_rq->rt_runtime相當於設置了實時進程在每個周期占用的cpu的最大比例.
-------------------------------------------------------------
eg: echo pid >> tasks
這個操作分為兩部。首先改變進程所屬的cgroup,之前已經說過了.其次調用各個子系統的attach鈎子函數,這里相當於改變進程所在的運行隊列。
簡單來說就是改變幾個指針的值。讓進程與原先的就緒隊列斷開。跑在cgroup的就緒隊列中。
如有錯誤，請指正。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 進程調度原理 Nginx：進程調度 2.2.2進程調度 jenkins原理簡析進程調度算法進程調度函數程序進程調度，一個調度器的自白 Linux進程模型簡析 libco hook原理簡析 testng TestListener 原理簡析