定時器都知道吧?個人認為是linux最核心的功能之一了!比如線程sleep(5000),5s后再喚醒執行,cpu是怎么知道5s的時間到了?還有nginx這種反向代理每隔一段時間都要檢測客戶端的是否還在,如果掉線了就沒必要再分配資源維護連接關系啦。那么間隔固定時間檢測心跳的定時機制是怎么實現的了?
1、(1) linux系統和時間相關最核心的變量就是jiffies!在include\linux\raid\pq.h中的定義如下:
# define jiffies raid6_jiffies() #define HZ 1000 //返回當前時間的毫秒值,比如時間戳 static inline uint32_t raid6_jiffies(void) { struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec*1000 //tv_sec是秒,乘以1000轉成毫秒 + tv.tv_usec/1000;//tv_usec是微秒,除以1000轉成毫秒 }
這段代碼的信息量很大,最核心的有兩點:
- 有個宏定義是HZ,值是1000;既然HZ在這里和時間相關,猜都能猜到應該是毫秒單位,因為1s=1000ms;正規的定義:HZ表示1秒產生時鍾中斷的次數,這里是每秒產生1000次時鍾中斷,也就是每次時鍾中斷的間隔是1毫秒!
- gettimeofday函數把當前時間戳的值存放在了timeval結構體,兩個字段分別是秒和微妙單位;最后raid6_jiffies統一轉成毫秒單位返回。所以,這不就是個時間戳么?
繼續深入gettimeofday函數,發現最終是通過系統調用獲取當前時間的!核心的結構體就是vsyscall_gtod_data!
如果是32位系統,jiffies最大值也就是2^32,超過這個值會產生溢出,回繞到0重新開始計數!所以為了處理回繞問題,linux內核專門提供了比較的方法:
#define time_after(a,b) \ (typecheck(unsigned long, a) && \ typecheck(unsigned long, b) && \ ((long)(b) - (long)(a) < 0)) #define time_before(a,b) time_after(b,a) #define time_after_eq(a,b) \ (typecheck(unsigned long, a) && \ typecheck(unsigned long, b) && \ ((long)(a) - (long)(b) >= 0)) #define time_before_eq(a,b) time_after_eq(b,a)
(2) 定時器、定時器,本質就是在特定的時間干特定的活!比如社畜碼農早上7:20起床,7:50上班車,9點打開電腦開始搬磚等!在linux系統內核,怎么把特定時間和特定的動作關聯在一起了?C++可以創建一個類,成員變量是時間,成員函數是特定的動作;linux內核用C寫的,用結構體也能完成同樣的功能。這一切都是用timer_list結構體實現的,如下:
struct timer_list { /* * All fields that change during normal runtime grouped to the * same cacheline */ struct hlist_node entry;//鏈表結構,串接timer_list unsigned long expires;//到期時間,一般用jiffies+5*HZ:表示5秒后觸發定時器的回到函數 void (*function)(unsigned long);//定時器時間到后的回調函數 unsigned long data;//回調函數的參數 u32 flags; #ifdef CONFIG_TIMER_STATS int start_pid; void *start_site; char start_comm[16]; #endif #ifdef CONFIG_LOCKDEP struct lockdep_map lockdep_map; #endif };
結構體定義好了,該怎么用了?先看一段demo代碼,形象理解一下timer的用法:
#include <linux/module.h> #include <linux/timer.h> #include <linux/jiffies.h> void time_pre(struct timer_list *timer); struct timer_list mytimer; // DEFINE_TIMER(mytimer, time_pre); void time_pre(struct timer_list *timer) { printk("%s\n", __func__); mytimer.expires = jiffies + 500 * HZ/1000; // 500ms 運行一次 mod_timer(&mytimer, mytimer.expires); // 2.2 如果需要周期性執行任務,在定時器回調函數中添加 mod_timer } // 驅動接口 int __init chr_init(void) { timer_setup(&mytimer, time_pre, 0); // 1. 初始化 mytimer.expires = jiffies + 500 * HZ/1000; //0.5秒觸發一次 add_timer(&mytimer); // 2.1 向內核中添加定時器 printk("init success\n"); return 0; } void __exit chr_exit(void) { if(timer_pending(&mytimer)) { del_timer(&mytimer); // 3.釋放定時器 } printk("exit Success \n"); } module_init(chr_init); module_exit(chr_exit); MODULE_LICENSE("GPL"); MODULE_AUTHOR("XXX"); MODULE_DESCRIPTION("a simple char device example");
大概的思路: 生成timer_list結構體並初始化,然后用add_timer注冊剛才初始化的timer,等expire時間到后就調用callback回調函數!思路很簡單,從這里能看到核心的函數時add_timer和mod_timer函數,而這兩個函數最終都調用了__mod_timer函數;從函數的源碼看,用的都是隊列來組織timer的,傳說中的紅黑樹了?
上面的這些定時器都是依賴HZ的,這種定時器稱之為低分辨率定時器,從名字就能看出來精度不高。低分辨率定時器使用的timer wheel機制來管理系統中定時器。在timer wheel的機制下,系統中的定時器不是使用單一的鏈表進行管理。為了達到高效訪問,並且消耗極小的cpu資源,linux系統采用了五個鏈表數組來進行管理(原理和進程調度的O(1)算法類似,把原來單一的隊列按照優先級分成若干個):五個數組之間就像手表一樣存在分秒進位的操作。在tv1中存放這timer_jiffies到timer_jiffies+256,也就是說tv1存放定時器范圍為0-255。如果在每一個tick上可能有多個相同的定時器都要處理,這時候就使用鏈表將相同的定時器串在一起超時的時候處理。tv2有64個單元,每個單元有256個tick,因此tv2的超時范圍為256-256*64-1(2^14 -1), 就這樣一次類推到tv3, tv4, tv5上,各個tv的范圍如下:
數組 |
idx范圍 |
---|---|
tv1 |
0--2^8-1 |
tv2 |
2^8--2^14-1 |
tv3 |
2^14--2^20-1 |
tv4 |
2^20--2^26-1 |
tv5 |
2^26--2^32-1 |
整個timer wheel圖示如下:
2、(1)低精度定時器已經不適用於某些要求高的場景了(比如看門狗、usb、ethernet、塊設備、kvm等子系統);為了提高精度,同時兼容低版本的linux內核,需要重新設計新的定時器,取名為hrtimer。和hrtimer配套的結構體有好幾個,為了直觀感受這些結構體的關系,這里用下圖表示:
(2)hrtimer結構體和timer_list類似,都有expire字段和回調函數字段。可以發現結構體之間地關系比較復雜,甚至還互相嵌套,怎么復雜的結構體,linux內核是怎么使用的了?
不管是哪中定時器,首先要生成定時器的實例,主要記錄expire時間和回調函數,所以先調用__hrtimer_init方法初始化定時器,代碼如下:
static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id, enum hrtimer_mode mode) { struct hrtimer_cpu_base *cpu_base; int base; memset(timer, 0, sizeof(struct hrtimer)); //初始化hrtimer的base字段 cpu_base = raw_cpu_ptr(&hrtimer_bases); if (clock_id == CLOCK_REALTIME && mode != HRTIMER_MODE_ABS) clock_id = CLOCK_MONOTONIC; base = hrtimer_clockid_to_base(clock_id); timer->base = &cpu_base->clock_base[base]; //初始化紅黑樹的node節點 timerqueue_init(&timer->node); #ifdef CONFIG_TIMER_STATS timer->start_site = NULL; timer->start_pid = -1; memset(timer->start_comm, 0, TASK_COMM_LEN); #endif }
核心功能就是給base字段的屬性賦值,然后初始化紅黑樹的node節點!隨后就是把該節點加入紅黑樹了,便於后續動態快速地增刪改查定時器!構建紅黑樹的函數是hrtimer_start_range_ns,代碼如下:
/** * hrtimer_start_range_ns - (re)start an hrtimer on the current CPU * @timer: the timer to be added * @tim: expiry time * @delta_ns: "slack" range for the timer * @mode: expiry mode: absolute (HRTIMER_MODE_ABS) or * relative (HRTIMER_MODE_REL) */ void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns, const enum hrtimer_mode mode) { struct hrtimer_clock_base *base, *new_base; unsigned long flags; int leftmost; base = lock_hrtimer_base(timer, &flags); /* Remove an active timer from the queue: 最終是調用timerqueue_del函數從紅黑樹刪除 */ remove_hrtimer(timer, base, true); /* 如果是相對時間,則需要加上當前時間,因為內部是使用絕對時間 */ if (mode & HRTIMER_MODE_REL) tim = ktime_add_safe(tim, base->get_time()); tim = hrtimer_update_lowres(timer, tim, mode); /* 設置到期的時間范圍 */ hrtimer_set_expires_range_ns(timer, tim, delta_ns); /* Switch the timer base, if necessary: */ new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED); timer_stats_hrtimer_set_start_info(timer); /* 把hrtime按到期時間排序,加入到對應時間基准系統的紅黑樹中 */ /* 如果該定時器的是最早到期的,將會返回true 最終調用的是timerqueue_add函數 */ leftmost = enqueue_hrtimer(timer, new_base); if (!leftmost) goto unlock; if (!hrtimer_is_hres_active(timer)) { /* * Kick to reschedule the next tick to handle the new timer * on dynticks target. */ if (new_base->cpu_base->nohz_active) wake_up_nohz_cpu(new_base->cpu_base->cpu); } else { hrtimer_reprogram(timer, new_base); } unlock: unlock_hrtimer_base(timer, &flags); }
(3)定時器的紅黑樹建好后,該怎么用了?既然和時間相關,必然繞不開的機制:時鍾中斷!計算機主板上有種特殊的硬件,每間隔相同的時長就會給cpu發出脈沖信號,作用相當於計算機的“脈搏”;cpu收到這個信號后可以做出某些動作回應,這個機制就是時鍾中斷。最常見的時鍾中斷動作就是進程切換了!然而除此之外,時鍾中斷還有個非常重要的作用:觸發和管理定時器!回顧一下上述的結構體定義和使用流程,會發現多個進程可能會需要在同一時間觸發定時器,圖示如下:比如在第1s的時候,進程A和B都有定時器需要被觸發;再比如第7s的時候,進程A、E、F也都有定時器需要被觸發,操作系統都是怎么知道應該在什么時候觸發哪些定時器的了?
此時紅黑樹的作用就凸顯了:每次發生時鍾中斷,除了必要的進程/線程切換,還需要檢查紅黑樹,看看最左邊節點的expire是不是已經到了,如果還沒有就不處理,等下一個時鍾中斷再檢查;如果已經到了,就執行該節點的回調函數,同時刪除該節點;這個過程是在hrtimer_interrupt中執行的,該函數代碼如下:
/* * High resolution timer interrupt * Called with interrupts disabled */ void hrtimer_interrupt(struct clock_event_device *dev) { struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases); ktime_t expires_next, now, entry_time, delta; int retries = 0; BUG_ON(!cpu_base->hres_active); cpu_base->nr_events++; dev->next_event.tv64 = KTIME_MAX; raw_spin_lock(&cpu_base->lock); entry_time = now = hrtimer_update_base(cpu_base); retry: cpu_base->in_hrtirq = 1; /* * We set expires_next to KTIME_MAX here with cpu_base->lock * held to prevent that a timer is enqueued in our queue via * the migration code. This does not affect enqueueing of * timers which run their callback and need to be requeued on * this CPU. */ cpu_base->expires_next.tv64 = KTIME_MAX; /*查看紅黑樹的最下節點,如果到期就執行回調函數,並刪除該節點*/ __hrtimer_run_queues(cpu_base, now); /* Reevaluate the clock bases for the next expiry 找到下一個到期的定時器 */ expires_next = __hrtimer_get_next_event(cpu_base); /* * Store the new expiry value so the migration code can verify * against it. */ cpu_base->expires_next = expires_next; cpu_base->in_hrtirq = 0; raw_spin_unlock(&cpu_base->lock); /* Reprogramming necessary ? */ if (!tick_program_event(expires_next, 0)) { cpu_base->hang_detected = 0; return; } /* * The next timer was already expired due to: * - tracing * - long lasting callbacks * - being scheduled away when running in a VM * * We need to prevent that we loop forever in the hrtimer * interrupt routine. We give it 3 attempts to avoid * overreacting on some spurious event. * * Acquire base lock for updating the offsets and retrieving * the current time. */ raw_spin_lock(&cpu_base->lock); now = hrtimer_update_base(cpu_base); cpu_base->nr_retries++; if (++retries < 3) goto retry; /* * Give the system a chance to do something else than looping * here. We stored the entry time, so we know exactly how long * we spent here. We schedule the next event this amount of * time away. */ cpu_base->nr_hangs++; cpu_base->hang_detected = 1; raw_spin_unlock(&cpu_base->lock); delta = ktime_sub(now, entry_time); if ((unsigned int)delta.tv64 > cpu_base->max_hang_time) cpu_base->max_hang_time = (unsigned int) delta.tv64; /* * Limit it to a sensible value as we enforce a longer * delay. Give the CPU at least 100ms to catch up. */ if (delta.tv64 > 100 * NSEC_PER_MSEC) expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC); else expires_next = ktime_add(now, delta); tick_program_event(expires_next, 1); printk_once(KERN_WARNING "hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta)); }
最重要的就是__hrtimer_run_queues以及繼續調用的__run_hrtimer函數了,代碼如下,重要部分加了中文注釋:
/* * The write_seqcount_barrier()s in __run_hrtimer() split the thing into 3 * distinct sections: * * - queued: the timer is queued * - callback: the timer is being ran * - post: the timer is inactive or (re)queued * * On the read side we ensure we observe timer->state and cpu_base->running * from the same section, if anything changed while we looked at it, we retry. * This includes timer->base changing because sequence numbers alone are * insufficient for that. * * The sequence numbers are required because otherwise we could still observe * a false negative if the read side got smeared over multiple consequtive * __run_hrtimer() invocations. */ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, struct hrtimer_clock_base *base, struct hrtimer *timer, ktime_t *now) { enum hrtimer_restart (*fn)(struct hrtimer *); int restart; lockdep_assert_held(&cpu_base->lock); debug_deactivate(timer); cpu_base->running = timer; /* * Separate the ->running assignment from the ->state assignment. * * As with a regular write barrier, this ensures the read side in * hrtimer_active() cannot observe cpu_base->running == NULL && * timer->state == INACTIVE. */ raw_write_seqcount_barrier(&cpu_base->seq); /*調用timerqueue_del從紅黑樹刪除節點*/ __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0); timer_stats_account_hrtimer(timer); /*最重要的回調函數*/ fn = timer->function; /* * Clear the 'is relative' flag for the TIME_LOW_RES case. If the * timer is restarted with a period then it becomes an absolute * timer. If its not restarted it does not matter. */ if (IS_ENABLED(CONFIG_TIME_LOW_RES)) timer->is_rel = false; /* * Because we run timers from hardirq context, there is no chance * they get migrated to another cpu, therefore its safe to unlock * the timer base. 定時器是被硬件層面的時鍾中斷觸發的,所以這個回調函數肯定是當前cpu執行的 */ raw_spin_unlock(&cpu_base->lock); trace_hrtimer_expire_entry(timer, now); //終於執行了回調函數 restart = fn(timer); trace_hrtimer_expire_exit(timer); raw_spin_lock(&cpu_base->lock); /* * Note: We clear the running state after enqueue_hrtimer and * we do not reprogram the event hardware. Happens either in * hrtimer_start_range_ns() or in hrtimer_interrupt() * * Note: Because we dropped the cpu_base->lock above, * hrtimer_start_range_ns() can have popped in and enqueued the timer * for us already. */ if (restart != HRTIMER_NORESTART && !(timer->state & HRTIMER_STATE_ENQUEUED)) enqueue_hrtimer(timer, base);//調用timerqueue_add把timer加入紅黑樹 /* * Separate the ->running assignment from the ->state assignment. * * As with a regular write barrier, this ensures the read side in * hrtimer_active() cannot observe cpu_base->running == NULL && * timer->state == INACTIVE. */ raw_write_seqcount_barrier(&cpu_base->seq); WARN_ON_ONCE(cpu_base->running != timer); cpu_base->running = NULL; } static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now) { struct hrtimer_clock_base *base = cpu_base->clock_base; unsigned int active = cpu_base->active_bases; /*遍歷各個時間基准系統,查詢每個hrtimer_clock_base對應紅黑樹的左下節點, 判斷它的時間是否到期,如果到期,通過__run_hrtimer函數,對到期定時器進行處理, 包括:調用定時器的回調函數、從紅黑樹中移除該定時器、 根據回調函數的返回值決定是否重新啟動該定時器*/ for (; active; base++, active >>= 1) { struct timerqueue_node *node; ktime_t basenow; if (!(active & 0x01)) continue; basenow = ktime_add(now, base->offset); /* 返回紅黑樹中的左下節點,之所以可以在while循環中使用該函數, 是因為__run_hrtimer會在移除舊的左下節點時, 新的左下節點會被更新到base->active->next字段中, 使得循環可以繼續執行,直到沒有新的到期定時器為止 */ while ((node = timerqueue_getnext(&base->active))) { struct hrtimer *timer; timer = container_of(node, struct hrtimer, node); /* * The immediate goal for using the softexpires is * minimizing wakeups, not running timers at the * earliest interrupt after their soft expiration. * This allows us to avoid using a Priority Search * Tree, which can answer a stabbing querry for * overlapping intervals and instead use the simple * BST we already have. * We don't add extra wakeups by delaying timers that * are right-of a not yet expired timer, because that * timer will have to trigger a wakeup anyway. */ if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) break; __run_hrtimer(cpu_base, base, timer, &basenow); } } }
lib\timerqueue.c文件中比較重要的3個工具函數:都是紅黑樹常規的操作!
/** * timerqueue_add - Adds timer to timerqueue. * * @head: head of timerqueue * @node: timer node to be added * * Adds the timer node to the timerqueue, sorted by the * node's expires value. */ bool timerqueue_add(struct timerqueue_head *head, struct timerqueue_node *node) { struct rb_node **p = &head->head.rb_node; struct rb_node *parent = NULL; struct timerqueue_node *ptr; /* Make sure we don't add nodes that are already added */ WARN_ON_ONCE(!RB_EMPTY_NODE(&node->node)); while (*p) { parent = *p; ptr = rb_entry(parent, struct timerqueue_node, node); if (node->expires.tv64 < ptr->expires.tv64) p = &(*p)->rb_left; else p = &(*p)->rb_right; } rb_link_node(&node->node, parent, p); rb_insert_color(&node->node, &head->head); if (!head->next || node->expires.tv64 < head->next->expires.tv64) { head->next = node; return true; } return false; } EXPORT_SYMBOL_GPL(timerqueue_add); /** * timerqueue_del - Removes a timer from the timerqueue. * * @head: head of timerqueue * @node: timer node to be removed * * Removes the timer node from the timerqueue. */ bool timerqueue_del(struct timerqueue_head *head, struct timerqueue_node *node) { WARN_ON_ONCE(RB_EMPTY_NODE(&node->node)); /* update next pointer */ if (head->next == node) { struct rb_node *rbn = rb_next(&node->node); head->next = rbn ? rb_entry(rbn, struct timerqueue_node, node) : NULL; } rb_erase(&node->node, &head->head); RB_CLEAR_NODE(&node->node); return head->next != NULL; } EXPORT_SYMBOL_GPL(timerqueue_del); /** * timerqueue_iterate_next - Returns the timer after the provided timer * * @node: Pointer to a timer. * * Provides the timer that is after the given node. This is used, when * necessary, to iterate through the list of timers in a timer list * without modifying the list. */ struct timerqueue_node *timerqueue_iterate_next(struct timerqueue_node *node) { struct rb_node *next; if (!node) return NULL; next = rb_next(&node->node); if (!next) return NULL; return container_of(next, struct timerqueue_node, node); } EXPORT_SYMBOL_GPL(timerqueue_iterate_next);
整個函數的調用過程:tick_program_event(注冊clock_event_device)->hrtimer_inerrupt->__hrtimer_run_queues->__run_hrtimer
(4)定時器中最重要的莫過於執行回調函數了,研發人員設計了這么復雜的流程、結構體,最終的目的不就是為了在正確的時間執行正確的回調函數么? 由於回調函數是異步執行的,它類似於一種“軟件中斷”,而且是處於非進程的上下文中,所以回調函數有以下3點需要注意:
- 沒有 current 指針、不允許訪問用戶空間。因為沒有進程上下文,相關代碼和被中斷的進程沒有任何聯系。
- 不能執行休眠(或可能引起休眠的函數)和調度。
- 任何被訪問的數據結構都應該針對並發訪問進行保護,以防止競爭條件。
3、為什么hrtimer比timer_list精度高?
(1)低分辨率定時器的計時單位基於jiffies值的計數,也就是說,它的精度只有1/HZ。假如內核配置的HZ是1000,那意味着系統中的低分辨率定時器的精度就是1ms; 那么問題來了,為了提高精度,為啥不把HZ值設置地更大了?比如10000、1000000等?提高時鍾中斷頻率也會產生副作用,中斷頻率越高,系統的負擔就增加了,處理器需要花時間來執行中斷處理程序,中斷處理器占用cpu時間越多。這樣處理器執行其他工作的時間及越少,並且還會打亂處理器高速緩存(進程切換導致地)。所以選擇時鍾中斷頻率時要考慮多方面,要取得各方面的折中的一個合適頻率。
(2)大部分時間里,time wheel可以實現O(1)時間復雜度。但是當有進位發生時,不可預測的O(N)定時器級聯遷移時間大大地影響了定時器的精度;剛好紅黑樹的增刪改查時間復雜度可以控制在O(lgN),再加上硬件的計數進步,所以可以比較好地把精度控制在納秒級別!
參考:
1、https://blog.csdn.net/droidphone/article/details/8051405 低分辨率定時器的原理
2、https://blog.csdn.net/hongzg1982/article/details/54881361 高精度定時器hrtimer的原理
3、https://cloud.tencent.com/developer/article/1603333?from=15425 內核低分辨率定時器的實現
4、https://zhuanlan.zhihu.com/p/83078387 高精度定時器原理簡介