Sched_Boost小結

本文轉載自查看原文 2020-04-27 14:50 1967 進程調度

之前遇到一個耗電問題，最后發現是/proc/sys/kernel/sched_boost節點設置異常，一直處於boost狀態。導致所有場景功耗上升。

現在總結一下sched_boost的相關知識。

Sched_Boost

sched_boost主要是通過影響Task placement的方式，來進行boost。它屬於QTI EAS中的一部分。

默認task placement policy

計算每個cpu的負載，並將task分配到負載最輕的cpu上。如果有多個cpu的負載相同（一般是都處於idle），那么就會把task分配到系統中capacity最大的cpu上。

設置sched_boost

通過設置節點：/proc/sys/kernel/sched_boost 或者內核調用sched_set_boost()函數，可以進行sched_boost，並且在分配任務時，忽略對energy的消耗。

boost一旦設置之后，就必須顯示寫0來關閉。同時也支持個應用同時調用設置，設置會選擇boost等級最高的生效；而當所有應用都都關閉boost時，boost才會真正失效。

boost等級

sched_boost一共有4個等級，除了0代表關閉boost以外，其他3個等級靈活地控制功耗和性能的不同傾向程度。

在通過節點設置，會調用sched_boost_handler

{
    .procname    = "sched_boost",
    .data        = &sysctl_sched_boost,
    .maxlen        = sizeof(unsigned int),
    .mode        = 0644,
    .proc_handler    = sched_boost_handler,
    .extra1        = &neg_three,
    .extra2        = &three,
},

經過verify之后，調用_sched_set_boost來設置boost。

int sched_boost_handler(struct ctl_table *table, int write,
        void __user *buffer, size_t *lenp,
        loff_t *ppos)
{
    int ret;
    unsigned int *data = (unsigned int *)table->data;

    mutex_lock(&boost_mutex);

    ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);

    if (ret || !write)
        goto done;

    if (verify_boost_params(*data))
        _sched_set_boost(*data);
    else
        ret = -EINVAL;

done:
    mutex_unlock(&boost_mutex);
    return ret;

而通過內核調用的方式，同樣最后也是調用_sched_set_boost來設置boost。

int sched_set_boost(int type)
{
    int ret = 0;

    mutex_lock(&boost_mutex);
    if (verify_boost_params(type))
        _sched_set_boost(type);
    else
        ret = -EINVAL;
    mutex_unlock(&boost_mutex);
    return ret;
}

接下來，我們看關鍵的設置函數_sched_set_boost：

static void _sched_set_boost(int type)
{
    if (type == 0)　　　　　　　　　　　　//通過type參數判斷是否enable/disable boost
        sched_boost_disable_all();　　 //(1)disable all boost
    else if (type > 0)　　　　　　　　　　
        sched_boost_enable(type);　　　//(2) enable boost
    else
        sched_boost_disable(-type);　　//(3) disable boost

    /*
     * sysctl_sched_boost holds the boost request from
     * user space which could be different from the
     * effectively enabled boost. Update the effective
     * boost here.
     */

    sched_boost_type = sched_effective_boost();
    sysctl_sched_boost = sched_boost_type;
    set_boost_policy(sysctl_sched_boost);　　//(4) 設置boost policy
    trace_sched_set_boost(sysctl_sched_boost);
}

首先看一下sched_boost的4個用於控制配置的結構體：

其中refcount來記錄設置的次數。enter函數表示切換到該boost配置的動作；exit則是退出該boost配置的動作。

static struct sched_boost_data sched_boosts[] = {
    [NO_BOOST] = {
        .refcount = 0,
        .enter = sched_no_boost_nop,
        .exit = sched_no_boost_nop,
    },
    [FULL_THROTTLE_BOOST] = {
        .refcount = 0,
        .enter = sched_full_throttle_boost_enter,
        .exit = sched_full_throttle_boost_exit,
    },
    [CONSERVATIVE_BOOST] = {
        .refcount = 0,
        .enter = sched_conservative_boost_enter,
        .exit = sched_conservative_boost_exit,
    },
    [RESTRAINED_BOOST] = {
        .refcount = 0,
        .enter = sched_restrained_boost_enter,
        .exit = sched_restrained_boost_exit,
    },
};

(1)disable all boost

調用除no boost外，所有boost配置的exit函數並且將他們的refcount清0。

#define SCHED_BOOST_START FULL_THROTTLE_BOOST
#define SCHED_BOOST_END (RESTRAINED_BOOST + 1

static void sched_boost_disable_all(void)
{
    int i;

    for (i = SCHED_BOOST_START; i < SCHED_BOOST_END; i++) {
        if (sched_boosts[i].refcount > 0) {
            sched_boosts[i].exit();
            sched_boosts[i].refcount = 0;
        }
    }
}

(2) enable boost

refcount記錄調用次數+；

由於sched+boost支持多應用同時調用的，所以在設置boost之前，要先檢查當前有效的boost配置。

優先級是No boost > Full Throttle > Conservative > Restrained。

static void sched_boost_enable(int type)
{
    struct sched_boost_data *sb = &sched_boosts[type];
    int next_boost, prev_boost = sched_boost_type;

    sb->refcount++;　　　　//refcount記錄次數+1

    if (sb->refcount != 1)
        return;

    /*
     * This boost enable request did not come before.
     * Take this new request and find the next boost
     * by aggregating all the enabled boosts. If there
     * is a change, disable the previous boost and enable
     * the next boost.
     */

    next_boost = sched_effective_boost();　　//設置boost之前，檢查當前有效的boost配置
    if (next_boost == prev_boost)
        return;

    sched_boosts[prev_boost].exit();　　　　//調用之前配置的exit，退出之前的boost
    sched_boosts[next_boost].enter();　　　 //調用現在配置的enter，進入當前boost狀態

通過檢查refcount，來確認當前有效的boost。

static int sched_effective_boost(void)
{
    int i;

    /*
     * The boosts are sorted in descending order by
     * priority.
     */
    for (i = SCHED_BOOST_START; i < SCHED_BOOST_END; i++) {
        if (sched_boosts[i].refcount >= 1)
            return i;
    }

    return NO_BOOST;
}

（3）disable boost

同樣假如是disable boost的話，就會相應的對refcount--，並且調用當前boost類型的exit函數來退出boost。

因為sched_boost支持多種boost同時開啟，並按優先級設置。所以當disable一種boost時，最后通過檢查當前有效的boost來進入余下優先級高的boost模式。

static void sched_boost_disable(int type)
{
    struct sched_boost_data *sb = &sched_boosts[type];
    int next_boost;

    if (sb->refcount <= 0)
        return;

    sb->refcount--;

    if (sb->refcount)
        return;

    /*
     * This boost's refcount becomes zero, so it must
     * be disabled. Disable it first and then apply
     * the next boost.
     */
    sb->exit();

    next_boost = sched_effective_boost();
    sched_boosts[next_boost].enter();
}

（4）設置boost policy

在最后一步中，設置policy來體現task是否需要進行up migrate。

如下是sched_boost不同等級對應的up migrate遷移策略。

Full throttle和Conservative：SCHED_BOOST_ON_BIG---在進行task placement時，僅考慮capacity最大的cpu core

無：SCHED_BOOST_ON_ALL---在進行task placement時，僅不考慮capacity最小的cpu core

No Boost和Restrained：SCHED_BOOST_NONE---正常EAS

/*
 * Scheduler boost type and boost policy might at first seem unrelated,
 * however, there exists a connection between them that will allow us
 * to use them interchangeably during placement decisions. We'll explain
 * the connection here in one possible way so that the implications are
 * clear when looking at placement policies.
 *
 * When policy = SCHED_BOOST_NONE, type is either none or RESTRAINED
 * When policy = SCHED_BOOST_ON_ALL or SCHED_BOOST_ON_BIG, type can
 * neither be none nor RESTRAINED.
 */
static void set_boost_policy(int type)
{
    if (type == NO_BOOST || type == RESTRAINED_BOOST) {　　//conservative和full throttle模式才會進行向上遷移
        boost_policy = SCHED_BOOST_NONE;
        return;
    }

    if (boost_policy_dt) {
        boost_policy = boost_policy_dt;
        return;
    }

    if (min_possible_efficiency != max_possible_efficiency) {　　//左邊是cpu中efficiency最小值，右邊為最大值。big.LITTLE架構應該恆成立
        boost_policy = SCHED_BOOST_ON_BIG;
        return;
    }

    boost_policy = SCHED_BOOST_ON_ALL;
}

接下來詳細分析3種boost設置的原理：

Full Throttle

full throttle（全速）模式下的sched boost，主要有如下2個動作：

（1）core control

（2）freq aggregation

static void sched_full_throttle_boost_enter(void)
{
    core_ctl_set_boost(true);　　　　　　　　　　　　//（1）core control
    walt_enable_frequency_aggregation(true);　　　//（2）freq aggregation
}

（1）core control：isoloate/unisoloate cpu cores；enable boost時，開所有cpu core

int core_ctl_set_boost(bool boost)
{
    unsigned int index = 0;
    struct cluster_data *cluster;
    unsigned long flags;
    int ret = 0;
    bool boost_state_changed = false;

    if (unlikely(!initialized))
        return 0;

    spin_lock_irqsave(&state_lock, flags);
    for_each_cluster(cluster, index) {　　　　　　　　　　//修改並記錄每個cluster的boost狀態
        if (boost) {
            boost_state_changed = !cluster->boost;
            ++cluster->boost;
        } else {
            if (!cluster->boost) {
                ret = -EINVAL;
                break;
            } else {
                --cluster->boost;
                boost_state_changed = !cluster->boost;
            }
        }
    }
    spin_unlock_irqrestore(&state_lock, flags);

    if (boost_state_changed) {
        index = 0;
        for_each_cluster(cluster, index)　　　　　　　　//針對每個cluster，apply boost設置
            apply_need(cluster);
    }

    trace_core_ctl_set_boost(cluster->boost, ret);

    return ret;
}
EXPORT_SYMBOL(core_ctl_set_boost);

static void apply_need(struct cluster_data *cluster)
{
    if (eval_need(cluster))　　　　　　　　　　　　//判斷是否需要
        wake_up_core_ctl_thread(cluster);　　　 //喚醒cluster的core control thread
}

具體如何判斷的：

enable boost時：判斷是否需要unisolate cpu，

disable boost時：判斷need_cpus < active_cpus是否成立。

並且與上一次更新的間隔時間滿足 > delay time。

static bool eval_need(struct cluster_data *cluster)
{
    unsigned long flags;
    struct cpu_data *c;
    unsigned int need_cpus = 0, last_need, thres_idx;
    int ret = 0;
    bool need_flag = false;
    unsigned int new_need;
    s64 now, elapsed;

    if (unlikely(!cluster->inited))
        return 0;

    spin_lock_irqsave(&state_lock, flags);

    if (cluster->boost || !cluster->enable) {　　　　　　　　　　　
        need_cpus = cluster->max_cpus;　　　　　　//當enable boost時，設置need_cpus為所有cpu
    } else {
        cluster->active_cpus = get_active_cpu_count(cluster);　　　　　　　　　　//當disable boost時，首先獲取active的cpu
        thres_idx = cluster->active_cpus ? cluster->active_cpus - 1 : 0;
        list_for_each_entry(c, &cluster->lru, sib) {　　　　　　　　　　　　　　　　
            bool old_is_busy = c->is_busy;

            if (c->busy >= cluster->busy_up_thres[thres_idx] ||
                sched_cpu_high_irqload(c->cpu))
                c->is_busy = true;
            else if (c->busy < cluster->busy_down_thres[thres_idx])
                c->is_busy = false;

            trace_core_ctl_set_busy(c->cpu, c->busy, old_is_busy,
                        c->is_busy);
            need_cpus += c->is_busy;
        }
        need_cpus = apply_task_need(cluster, need_cpus);　　　　　　　　　　　　//根據task需要，計算need_cpus
    }
    new_need = apply_limits(cluster, need_cpus);　　　　　　　　　　　　　　　　　//限制need_cpus范圍：cluster->min_cpus <= need_cpus <= clusterr->max_cpus
    need_flag = adjustment_possible(cluster, new_need);　　　　　　　　　　　　 //（*）enable boost時：判斷是否需要unisolate cpu；    disable boost時:判斷need_cpus < active_cpus是否成立

    last_need = cluster->need_cpus;
    now = ktime_to_ms(ktime_get());

    if (new_need > cluster->active_cpus) {　　　　　　
        ret = 1;　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　//enable boost
    } else {
        /*
         * When there is no change in need and there are no more
         * active CPUs than currently needed, just update the
         * need time stamp and return.　　　　　　　　　　　　　　　　//當需要的cpu沒有變化時，只需要更新時間戳，然后return
         */
        if (new_need == last_need && new_need == cluster->active_cpus) {
            cluster->need_ts = now;
            spin_unlock_irqrestore(&state_lock, flags);
            return 0;
        }

        elapsed =  now - cluster->need_ts;
        ret = elapsed >= cluster->offline_delay_ms;　　　　//修改need_cpus的時間要大於delay時間，才認為有必要進行更改
    }

    if (ret) {
        cluster->need_ts = now;　　　　　　　　　　　　　　　　//更新時間戳，need_cpus
        cluster->need_cpus = new_need;
    }
    trace_core_ctl_eval_need(cluster->first_cpu, last_need, new_need,
                 ret && need_flag);
    spin_unlock_irqrestore(&state_lock, flags);

    return ret && need_flag;
}

滿足更新要求的條件后，就會喚醒core control thread

static void wake_up_core_ctl_thread(struct cluster_data *cluster)
{
    unsigned long flags;

    spin_lock_irqsave(&cluster->pending_lock, flags);
    cluster->pending = true;
    spin_unlock_irqrestore(&cluster->pending_lock, flags);

    wake_up_process(cluster->core_ctl_thread);
}

其中會有一個檢測pending的防止重入的操作。假如pending標志已經改寫，那么就會將當前進程移出rq。

static int __ref try_core_ctl(void *data)
{
    struct cluster_data *cluster = data;
    unsigned long flags;

    while (1) {
        set_current_state(TASK_INTERRUPTIBLE);　　　　　　　　　　　　　　　 //先退出RUNNING狀態，設為TASK_INTERRUPTIBLE，后面再調用schedule()。會判斷當前task已不處於TASK_RUNNING狀態，隨后會進行dequeue，並調度其他進程到rq->curr。（典型的出rq操作）

        spin_lock_irqsave(&cluster->pending_lock, flags);
        if (!cluster->pending) {　　　　　　　　　　　　　　　　　　　　　　　　//檢測pending，如果已經core control完成，則直接進行schedule()，並准備退出該thread
            spin_unlock_irqrestore(&cluster->pending_lock, flags);
            schedule();
            if (kthread_should_stop())
                break;
            spin_lock_irqsave(&cluster->pending_lock, flags);
        }
        set_current_state(TASK_RUNNING);
        cluster->pending = false;
        spin_unlock_irqrestore(&cluster->pending_lock, flags);

        do_core_ctl(cluster);　　　　　　　　//do work
    }

    return 0;
}

static void __ref do_core_ctl(struct cluster_data *cluster)
{
    unsigned int need;

    need = apply_limits(cluster, cluster->need_cpus);　　　　　　//再次check need_cpus是否合法

    if (adjustment_possible(cluster, need)) {　　　　　　　　　　　　//再次check是否需要更改
        pr_debug("Trying to adjust group %u from %u to %u\n",
                cluster->first_cpu, cluster->active_cpus, need);

        if (cluster->active_cpus > need)　　　　　　　　　　//根據need進行cpu un/isolate
            try_to_isolate(cluster, need);
        else if (cluster->active_cpus < need)
            try_to_unisolate(cluster, need);
    }
}

（2）freq aggregation：提升cpu freq

設置flag：

static inline void walt_enable_frequency_aggregation(bool enable)
{
    sched_freq_aggr_en = enable;
}

后續在以下這一個地方會影響freq選擇：

load freq polcy時，會根據是否開啟freq_aggregation的flag而影響load的計算。

開啟flag之后，會使用aggr_grp_load替代rq->grp_time.prev_runnable_sum

aggr_grp_load是當前cluster中所有cpu core的rq->grp_time.prev_runnable_sum的總和：sum。

那么開啟之后，就會使load變大，從而提升cpu freq或者進行

static inline u64 freq_policy_load(struct rq *rq)
{
。。。
    if (sched_freq_aggr_en)
        load = rq->prev_runnable_sum + aggr_grp_load;
    else
        load = rq->prev_runnable_sum + rq->grp_time.prev_runnable_sum;
。。。
}

Conservative

static void sched_conservative_boost_enter(void)
{
    update_cgroup_boost_settings();　　　　　　　　　　　　　　　　　　　　//（1）更新cgroup boost設置
    sched_task_filter_util = sysctl_sched_min_task_util_for_boost;　 //（2）task util調節
}

（1）遍歷group，group一般有以下這些：backgroun，foreground，top-app，rt。

將所有sched_boost_enabled設置為false，除了寫了no_override的group。而從init.target.rc中可以看到top-app和foregroung寫了該flag（如下），所以最后就會有

write /dev/stune/foreground/schedtune.sched_boost_no_override 1
write /dev/stune/top-app/schedtune.sched_boost_no_override 1

void update_cgroup_boost_settings(void)
{
    int i;

    for (i = 0; i < BOOSTGROUPS_COUNT; i++) {
        if (!allocated_group[i])
            break;

        if (allocated_group[i]->sched_boost_no_override)
            continue;

        allocated_group[i]->sched_boost_enabled = false;
    }
}

（2）修改min_task_util的門限

/* 1ms default for 20ms window size scaled to 1024 */
unsigned int sysctl_sched_min_task_util_for_boost = 51;　　//conservative 保守
/* 0.68ms default for 20ms window size scaled to 1024 */
unsigned int sysctl_sched_min_task_util_for_colocation = 35;　　//normal

目前僅發現在如下2處會用到這個值：

1、獲取sched_boost_enabled仍然為true的task。過濾task_util較小的task，關閉boost。

下面這個函數是返回task boost policy的，也就是placement boost。函數主要邏輯：

task所在group中打開了sched_boost_endabled，並且sched_boost設置非0：
假如sched_boost=1（full throttle），則policy是SCHED_BOOST_ON_BIG
假如sched_boost=2(Conservative)，則還需要判斷task util是否超過sched_task_filter_util。超過了，policy=SCHED_BOOST_ON_BIG;沒超過，policy=SCHED_BOOST_NONE

根據上面看到只有top-app和foreground這2個group是打開了sched_boost_enabled的。

所以，這里就是將top-app和foreground中的task_util較大的task挑選出來，enable了boost；而其他的task，都關閉了boost。

static inline enum sched_boost_policy task_boost_policy(struct task_struct *p)
{
    enum sched_boost_policy policy = task_sched_boost(p) ?
                            sched_boost_policy() :
                            SCHED_BOOST_NONE;
    if (policy == SCHED_BOOST_ON_BIG) {
        /*
         * Filter out tasks less than min task util threshold
         * under conservative boost.
         */
        if (sched_boost() == CONSERVATIVE_BOOST &&
                task_util(p) <= sched_task_filter_util)　　//修改了這個門限
            policy = SCHED_BOOST_NONE;
    }

    return policy;
}

2、在更新walt負載時

更新了unfilter的計數，增大到51，相當於：/* 1ms default for 20ms window size scaled to 1024 */（原先默認為：35，即0.68ms）。也就是當demand_scaled大於這個門限后，會設置一個nr_windows的緩沖時間。緩沖時間每一次update_history就會減1，直到為0。

static void update_history(struct rq *rq, struct task_struct *p,
             u32 runtime, int samples, int event)
{
。。。
    if (demand_scaled > sched_task_filter_util)
        p->unfilter = sysctl_sched_task_unfilter_nr_windows;
    else
        if (p->unfilter)
            p->unfilter = p->unfilter - 1;
。。。
}

p->unfilter這個參數又會在如下兩個地方用於判斷：

首先是在fair.c中，但是由於現在是CONSERVATIVE_BOOST，所以，只會return false。這條暫不分析下去。

static inline bool task_skip_min_cpu(struct task_struct *p)
{
    return sched_boost() != CONSERVATIVE_BOOST &&
        get_rtg_status(p) && p->unfilter;
}

另外一個地方是如下walt.h中，由於p->unfilter非0，那么就會判斷當前cpu是否為小核，如果是小核，那么return true。

static inline bool walt_should_kick_upmigrate(struct task_struct *p, int cpu)
{
    struct related_thread_group *rtg = p->grp;

    if (is_suh_max() && rtg && rtg->id == DEFAULT_CGROUP_COLOC_ID &&
                rtg->skip_min && p->unfilter)
        return is_min_capacity_cpu(cpu);

    return false;
}

返回值會影響如下函數的返回值，也是return flase。

所以，會告訴系統當前的task load與當前cpu的capacity不匹配，需要進行遷移。

此外，在進行遷移時，也是調用該函數，用於排除目標cpu為小核。

綜上，就是將task從小核遷移到大核。

static inline bool task_fits_max(struct task_struct *p, int cpu)
{
    unsigned long capacity = capacity_orig_of(cpu);
    unsigned long max_capacity = cpu_rq(cpu)->rd->max_cpu_capacity.val;
    unsigned long task_boost = per_task_boost(p);

    if (capacity == max_capacity)
        return true;

    if (is_min_capacity_cpu(cpu)) {
        if (task_boost_policy(p) == SCHED_BOOST_ON_BIG ||
            task_boost > 0 ||
            schedtune_task_boost(p) > 0 ||
            walt_should_kick_upmigrate(p, cpu))　　//條件為true
            return false;
    } else { /* mid cap cpu */
        if (task_boost > TASK_BOOST_ON_MID)
            return false;
    }

    return task_fits_capacity(p, capacity, cpu);
}

Restrained

該模式下，從代碼看，僅打開了freq aggregation。與上面全速模式類似，不再贅述。

static void sched_restrained_boost_enter(void)
{
    walt_enable_frequency_aggregation(true);
}

總結各個boost的效果

Full throttle：

1、通過core control，將所有cpu都進行unisolation

2、通過freq聚合，將load計算放大。從而觸發提升freq，或者遷移等

3、通過設置boost policy= SCHED_BOOST_ON_BIG，遷移挑選target cpu時，只會選擇大核

最終效果應該盡可能把任務都放在大核運行（除了cpuset中有限制）

Conservative：

1、通過更新group boost配置，僅讓top-app和foreground組進行task placement boost

2、提高min_task_util的門限，讓進行up migrate的條件更苛刻。只有load較大（>1ms）的task，會進行up migrate。

2、同上，更改min_task_util門限后，會提醒系統task與cpu是misfit，需要進行遷移。

3、通過設置boost policy= SCHED_BOOST_ON_BIG，遷移挑選target cpu時，只會選擇大核

最終效果：top-app和foreground的一些task會遷移到大核運行

Restrained：

1、通過freq聚合，將load計算放大。從而觸發提升freq，或者遷移等

load放大后，仍遵循基本EAS。提升freq或者遷移，視情況而定。

注意關於early_datection部分簡述參考之前文章：https://www.cnblogs.com/lingjiajun/p/12317090.html 中CPU freq調節章節。不過代碼中貌似沒有看到會影響的地方。由於EAS部分本人還沒有仔細學習分析代碼，可能還有疏漏或者錯誤，歡迎交流。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Sched_Boost總結 Boost Graph Library 庫小結1 移植boost_1_55_0至arm的方法小結 Python的sched模塊【一】 sched.h SCHED_FIFO與SCHED_OTHER調度機制 linux進程/線程調度策略(SCHED_OTHER,SCHED_FIFO,SCHED_RR) linux進程調度方法(SCHED_OTHER,SCHED_FIFO,SCHED_RR) 線程綁定CPU核-sched_setaffinity python定時任務-sched模塊