淺析Linux內核調度

本文轉載自查看原文 2015-10-26 15:50 7940 linux_kernel

1.調度器的概述

多任務操作系統分為非搶占式多任務和搶占式多任務。與大多數現代操作系統一樣，Linux采用的是搶占式多任務模式。這表示對CPU的占用時間由操作系統決定的，具體為操作系統中的調度器。調度器決定了什么時候停止一個進程以便讓其他進程有機會運行，同時挑選出一個其他的進程開始運行。

2.調度策略

在Linux上調度策略決定了調度器是如何選擇一個新進程的時間。調度策略與進程的類型有關，內核現有的調度策略如下：

#define SCHED_NORMAL		0
#define SCHED_FIFO		1
#define SCHED_RR		2
#define SCHED_BATCH		3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE		5

0: 默認的調度策略，針對的是普通進程。
1：針對實時進程的先進先出調度。適合對時間性要求比較高但每次運行時間比較短的進程。
2：針對的是實時進程的時間片輪轉調度。適合每次運行時間比較長得進程。
3：針對批處理進程的調度，適合那些非交互性且對cpu使用密集的進程。
SCHED_ISO：是內核的一個預留字段，目前還沒有使用
5：適用於優先級較低的后台進程。
注：每個進程的調度策略保存在進程描述符task_struct中的policy字段

3.調度器中的機制

內核引入調度類（struct sched_class）說明了調度器應該具有哪些功能。內核中每種調度策略都有該調度類的一個實例。（比如：基於公平調度類為：fair_sched_class，基於實時進程的調度類實例為：rt_sched_class），該實例也是針對每種調度策略的具體實現。調度類封裝了不同調度策略的具體實現，屏蔽了各種調度策略的細節實現。
調度器核心函數schedule()只需要調用調度類中的接口，完成進程的調度，完全不需要考慮調度策略的具體實現。調度類連接了調度函數和具體的調度策略。

武特師兄關於sche_class和sche_entity的解釋，一語中的。

調度類就是代表的各種調度策略，調度實體就是調度單位，這個實體通常是一個進程，但是自從引入了cgroup后，這個調度實體可能就不是一個進程了，而是一個組

4.schedule()函數

linux 支持兩種類型的進程調度，實時進程和普通進程。實時進程采用SCHED_FIFO 和SCHED_RR調度策略，普通進程采用SCHED_NORMAL策略。
preempt_disable()：禁止內核搶占
cpu_rq（）：獲取當前cpu對應的就緒隊列。
prev = rq->curr;獲取當前進程的描述符prev
switch_count = &prev->nivcsw;獲取當前進程的切換次數。
update_rq_clock() ：更新就緒隊列上的時鍾
clear_tsk_need_resched()清楚當前進程prev的重新調度標志。
deactive_task():將當前進程從就緒隊列中刪除。
put_prev_task() :將當前進程重新放入就緒隊列
pick_next_task():在就緒隊列中挑選下一個將被執行的進程。
context_switch():進行prev和next兩個進程的切換。具體的切換代碼與體系架構有關，在switch_to()中通過一段匯編代碼實現。
post_schedule():進行進程切換后的后期處理工作。

5.pick_next_task函數

選擇下一個將要被執行的進程無疑是一個很重要的過程，我們來看一下內核中代碼的實現
對以下這段代碼說明：
1.當rq中的運行隊列的個數(nr_running)和cfs中的nr_runing相等的時候，表示現在所有的都是普通進程，這時候就會調用cfs算法中的pick_next_task(其實是pick_next_task_fair函數)，當不相等的時候，則調用sched_class_highest(這是一個宏，指向的是實時進程)，這下面的這個for(;;)循環中，首先是會在實時進程中選取要調度的程序（p = class->pick_next_task(rq);）。如果沒有選取到，會執行class=class->next;在class這個鏈表中有三種類型（fair,idle,rt）.也就是說會調用到下一個調度類。

static inline struct task_struct *
pick_next_task(struct rq *rq)
{
	const struct sched_class *class;
	struct task_struct *p;

	/*
	 * Optimization: we know that if all tasks are in
	 * the fair class we can call that function directly:
	 */
//基於公平調度的普通進程
	if (likely(rq->nr_running == rq->cfs.nr_running)) {
		p = fair_sched_class.pick_next_task(rq);
		if (likely(p))
			return p;
	}
//基於實時調度的實時進程
	class = sched_class_highest;
	for ( ; ; ) {
		p = class->pick_next_task(rq);  //實時進程的類
		if (p)
			return p;
		/*
		 * Will never be NULL as the idle class always
		 * returns a non-NULL p:
		 */
		class = class->next;  //rt->next = fair;  fair->next = idle
	}
}

在這段代碼中體現了Linux所支持的兩種類型的進程，實時進程和普通進程。回顧下：實時進程可以采用SCHED_FIFO 和SCHED_RR調度策略，普通進程采用SCHED_NORMAL調度策略。
在這里首先說明一個結構體struct rq,這個結構體是調度器管理可運行狀態進程的最主要的數據結構。每個cpu上都有一個可運行的就緒隊列。剛才在pick_next_task函數中看到了在選擇下一個將要被執行的進程時實際上用的是struct rq上的普通進程的調度或者實時進程的調度，那么具體是如何調度的呢？在實時調度中，為了實現O(1)的調度算法，內核為每個優先級維護一個運行隊列和一個DECLARE_BITMAP,內核根據DECLARE_BITMAP的bit數值找出非空的最高級優先隊列的編號，從而可以從非空的最高級優先隊列中取出進程進行運行。
我們來看下內核的實現

struct rt_prio_array {
	DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
	struct list_head queue[MAX_RT_PRIO];
};

數組queue[i]里面存放的是優先級為i的進程隊列的鏈表頭。在結構體rt_prio_array 中有一個重要的數據構DECLARE_BITMAP，它在內核中的第一如下：


define DECLARE_BITMAP(name,bits) \
	unsigned long name[BITS_TO_LONGS(bits)]

5.1對於實時進程的O(1)算法

這個數據是用來作為進程隊列queue[MAX_PRIO]的索引位圖。bitmap中的每一位與queue[i]對應，當queue[i]的進程隊列不為空時，Bitmap的相應位就為1，否則為0，這樣就只需要通過匯編指令從進程優先級由高到低的方向找到第一個為1的位置，則這個位置就是就緒隊列中最高的優先級（函數sched_find_first_bit()就是用來實現該目的的）。那么queue[index]->next就是要找的候選進程。
如果還是不懂，那就來看兩個圖

注：在每個隊列上的任務一般基於先進先出的原則進行調度（並且為每個進程分配時間片）
在內核中的實現為：

static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
						   struct rt_rq *rt_rq)
{
	struct rt_prio_array *array = &rt_rq->active;
	struct sched_rt_entity *next = NULL;
	struct list_head *queue;
	int idx;

	idx = sched_find_first_bit(array->bitmap); //找到優先級最高的位
	BUG_ON(idx >= MAX_RT_PRIO);

	queue = array->queue + idx; //然后找到對應的queue的起始地址
	next = list_entry(queue->next, struct sched_rt_entity, run_list);  //按先進先出拿任務

	return next;
}

那么當同一優先級的任務比較多的時候，內核會根據
位圖：
將對應的位置為1，每次取出最大的被置為1的位，表示優先級最高：

5.2 關於普通進程的CFS算法：

我們知道，普通進程在選取下一個需要被調度的進程時，是調用的pick_next_task_fair函數。在這個函數中是以調度實體為單位進行調度的。其最主要的函數是：pick_next_entity，在這個函數中會調用wakeup_preempt_entity函數，這個函數的主要作用是根據進程的虛擬時間以及權重的結算進程的粒度，以判斷其是否需要搶占。看一下內核是怎么實現的：

wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
	s64 gran, vdiff = curr->vruntime - se->vruntime;//計算兩個虛擬時間差
//如果se的虛擬時間比curr還大，說明本該curr執行，無需搶占
	if (vdiff <= 0)
		return -1;

	gran = wakeup_gran(curr, se);
	if (vdiff > gran)
		return 1;

	return 0;
}

gran為需要搶占的時間差，只有兩個時間差大於需要搶占的時間差，才需要搶占，這里避免太頻繁的搶占

wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
{
	unsigned long gran = sysctl_sched_wakeup_granularity;

	if (cfs_rq_of(curr)->curr && sched_feat(ADAPTIVE_GRAN))
		gran = adaptive_gran(curr, se);

	/*
	 * Since its curr running now, convert the gran from real-time
	 * to virtual-time in his units.
	 */
	if (sched_feat(ASYM_GRAN)) {
		/*
		 * By using 'se' instead of 'curr' we penalize light tasks, so
		 * they get preempted easier. That is, if 'se' < 'curr' then
		 * the resulting gran will be larger, therefore penalizing the
		 * lighter, if otoh 'se' > 'curr' then the resulting gran will
		 * be smaller, again penalizing the lighter task.
		 *
		 * This is especially important for buddies when the leftmost
		 * task is higher priority than the buddy.
		 */
		if (unlikely(se->load.weight != NICE_0_LOAD))
			gran = calc_delta_fair(gran, se);
	} else {
		if (unlikely(curr->load.weight != NICE_0_LOAD))
			gran = calc_delta_fair(gran, curr);
	}

	return gran;
}

6.調度中的nice值

首先需要明確的是：nice的值不是進程的優先級，他們不是一個概念，但是進程的Nice值會影響到進程的優先級的變化。

通過命令ps -el可以看到進程的nice值為NI列。PRI表示的是進程的優先級，其實進程的優先級只是一個整數，它是調度器選擇進程運行的基礎。
普通進程有：靜態優先級和動態優先級。
靜態優先級：之所有稱為靜態優先級是因為它不會隨着時間而改變，內核不會修改它，只能通過系統調用nice去修改,靜態優先級用進程描述符中的static_prio來表示。在內核中/kernel/sched.c中，nice和靜態優先級的關系為：

#define NICE_TO_PRIO(nice)	(MAX_RT_PRIO + (nice) + 20)
#define PRIO_TO_NICE(prio)	((prio) - MAX_RT_PRIO - 20)

動態優先級：調度程序通過增加或者減小進程靜態優先級的值來獎勵IO小的進程或者懲罰cpu消耗型的進程。調整后的優先級稱為動態優先級。在進程描述中用prio來表示，通常所說的優先級指的是動態優先級。
由上面分析可知，我們可以通過系統調用nice函數來改變進程的優先級。

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <unistd.h>
#include <sys/time.h>

#define JMAX (400*100000)

#define GET_ELAPSED_TIME(tv1,tv2) ( \
  (double)( (tv2.tv_sec - tv1.tv_sec) \
            + .000001 * (tv2.tv_usec - tv1.tv_usec)))
//做一個延遲的計算
double do_something (void)
{
    int j;
    double x = 0.0;
    struct timeval tv1, tv2;
    gettimeofday (&tv1, NULL);//獲取時區
    for (j = 0; j < JMAX; j++)
        x += 1.0 / (exp ((1 + x * x) / (2 + x * x)));
    gettimeofday (&tv2, NULL);
    return GET_ELAPSED_TIME (tv1, tv2);//求差值
}

int main (int argc, char *argv[])
{
    int niceval = 0, nsched;
    /* for kernels less than 2.6.21, this is HZ
       for tickless kernels this must be the MHZ rate
       e.g, for 2.6 GZ scale = 2600000000 */
    long scale = 1000;

    long ticks_cpu, ticks_sleep;
    pid_t pid;
    FILE *fp;
    char fname[256];
    double elapsed_time, timeslice, t_cpu, t_sleep;

    if (argc > 1)
        niceval = atoi (argv[1]);
    pid = getpid ();

    if (argc > 2)
        scale = atoi (argv[2]);

    /* give a chance for other tasks to queue up */
    sleep (3);

    sprintf (fname, "/proc/%d/schedstat", pid);//讀取進程的調度狀態
	/*
		在schedstat中的數字是什么意思呢？：
	*/
    /*    printf ("Fname = %s\n", fname); */

    if (!(fp = fopen (fname, "r"))) {
        printf ("Failed to open stat file\n");
        exit (-1);
    }
	//nice系統調用
    if (nice (niceval) == -1 && niceval != -1) {
        printf ("Failed to set nice to %d\n", niceval);
        exit (-1);
    }
    elapsed_time = do_something ();//for 循環執行了多長時間

    fscanf (fp, "%ld %ld %d", &ticks_cpu, &ticks_sleep, &nsched);//nsched表示調度的次數
    t_cpu = (float)ticks_cpu / scale;//震動的次數除以1000，就是時間
    t_sleep = (float)ticks_sleep / scale;
    timeslice = t_cpu / (double)nsched;//除以調度的次數，就是每次調度的時間（時間片）
    printf ("\nnice=%3d time=%8g secs pid=%5d"
            "  t_cpu=%8g  t_sleep=%8g  nsched=%5d"
            "  avg timeslice = %8g\n",
            niceval, elapsed_time, pid, t_cpu, t_sleep, nsched, timeslice);
    fclose (fp);

    exit (0);
}

說明： 首先說明的是/proc/[pid]/schedstat:在這個文件下放着3個變量，他們分別代表什么意思呢？

第一個：該進程擁有的cpu的時間

第二個：在對列上的等待時間，即睡眠時間

第三個：被調度的次數

由結果可以看出當nice的值越小的時候，其睡眠時間越短，則表示其優先級升高了。

7.關於獲取和設置優先級的系統調用：sched_getscheduler（）和sched_setscheduler

#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>

#define DEATH(mess) { perror(mess); exit(errno); }

void printpolicy (int policy)
{

    /* SCHED_NORMAL = SCHED_OTHER in user-space */

    if (policy == SCHED_OTHER)
        printf ("policy = SCHED_OTHER = %d\n", policy);
    if (policy == SCHED_FIFO)
        printf ("policy = SCHED_FIFO = %d\n", policy);
    if (policy == SCHED_RR)
        printf ("policy = SCHED_RR = %d\n", policy);
}

int main (int argc, char **argv)
{
    int policy;
    struct sched_param p;

    /* obtain current scheduling policy for this process */
	//獲取進程調度的策略
    policy = sched_getscheduler (0);
    printpolicy (policy);

    /* reset scheduling policy */

    printf ("\nTrying sched_setscheduler...\n");
    policy = SCHED_FIFO;
    printpolicy (policy);
    p.sched_priority = 50;
	//設置優先級為50
    if (sched_setscheduler (0, policy, &p))
        DEATH ("sched_setscheduler:");
    printf ("p.sched_priority = %d\n", p.sched_priority);
    exit (0);
}

輸出結果：

[root@wang schedule]# ./get_schedule_policy 
policy = SCHED_OTHER = 0

Trying sched_setscheduler...
policy = SCHED_FIFO = 1
p.sched_priority = 50

可以看出進程的優先級已經被改變。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 淺析Linux內核調度 Linux內核之進程調度 Linux內核CFS調度器 Linux 2.6內核Makefile淺析 linux 內核Lockup機制淺析淺析Linux中的進程調度 Linux內核學習筆記三——進程調度 linux內核SMP負載均衡淺析(zz) 24小時學通Linux內核之調度和內核同步 [linux-內核][轉]內核日志及printk結構淺析