在做進程安全監控的時候,拍腦袋決定的,如果發現一個進程在D狀態時,即TASK_UNINTERRUPTIBLE(不可中斷的睡眠狀態),時間超過了8min,就將系統panic掉。恰好DB組做日志時,將整個log緩存到內存中,最后刷磁盤,結果系統就D狀態了很長時間,自然panic了,中間涉及到Linux的緩存寫回刷磁盤的一些機制和調優方法,寫一下總結。
目前機制需要將臟頁刷回到磁盤一般是以下情況:
- 臟頁緩存占用的內存太多,內存空間不足;
- 臟頁已經更改了很長時間,時間上已經到了臨界值,需要及時刷新保持內存和磁盤上數據一致性;
- 外界命令強制刷新臟頁到磁盤
- write寫磁盤時檢查狀態刷新
內核使用pdflush線程刷新臟頁到磁盤,pdflush線程個數在2和8之間,可以通過/proc/sys/vm/nr_pdflush_threads文件直接查看,具體策略機制參看源碼函數__pdflush。
一、內核其他模塊強制刷新
先說一下第一種和第三種情況:當內存空間不足或外界強制刷新的時候,臟頁的刷新是通過調用wakeup_pdflush函數實現的,調用其函數的有do_sync、free_more_memory、try_to_free_pages。wakeup_pdflush的功能是通過background_writeout的函數實現的:
static void background_writeout(unsigned long _min_pages) { long min_pages = _min_pages; struct writeback_control wbc = { .bdi = NULL, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write = 0, .nonblocking = 1, }; for ( ; ; ) { struct writeback_state wbs; long background_thresh; long dirty_thresh; get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, NULL); if (wbs.nr_dirty + wbs.nr_unstable < background_thresh && min_pages <= 0) break; wbc.encountered_congestion = 0; wbc.nr_to_write = MAX_WRITEBACK_PAGES; wbc.pages_skipped = 0; writeback_inodes(&wbc); min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { /* Wrote less than expected */ blk_congestion_wait(WRITE, HZ/10); if (!wbc.encountered_congestion) break; } } }
background_writeout進到一個死循環里面,通過get_dirty_limits獲取臟頁開始刷新的臨界值background_thresh,即為dirty_background_ratio的總內存頁數百分比,可以通過proc接口/proc/sys/vm/dirty_background_ratio調整,一般默認為10。當臟頁超過臨界值時,調用writeback_inodes寫MAX_WRITEBACK_PAGES(1024)個頁,直到臟頁比例低於臨界值。
二、內核定時器啟動刷新
內核在啟動的時候在page_writeback_init初始化wb_timer定時器,超時時間是dirty_writeback_centisecs,單位是0.01秒,可以通過/proc/sys/vm/dirty_writeback_centisecs調節。wb_timer的觸發函數是wb_timer_fn,最終是通過wb_kupdate實現。
static void wb_kupdate(unsigned long arg) { sync_supers(); get_writeback_state(&wbs); oldest_jif = jiffies - (dirty_expire_centisecs * HZ) / 100; start_jif = jiffies; next_jif = start_jif + (dirty_writeback_centisecs * HZ) / 100; nr_to_write = wbs.nr_dirty + wbs.nr_unstable + (inodes_stat.nr_inodes - inodes_stat.nr_unused); while (nr_to_write > 0) { wbc.encountered_congestion = 0; wbc.nr_to_write = MAX_WRITEBACK_PAGES; writeback_inodes(&wbc); if (wbc.nr_to_write > 0) { if (wbc.encountered_congestion) blk_congestion_wait(WRITE, HZ/10); else break; /* All the old data is written */ } nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; } if (time_before(next_jif, jiffies + HZ)) next_jif = jiffies + HZ; if (dirty_writeback_centisecs) mod_timer(&wb_timer, next_jif); }
上面的代碼沒有拷貝全。內核首先將超級塊信息刷新到文件系統上,然后獲取oldest_jif作為wbc的參數只刷新已修改時間大於dirty_expire_centisecs的臟頁,dirty_expire_centisecs參數可以通過/proc/sys/vm/dirty_expire_centisecs調整。
三、WRITE寫文件刷新緩存
用戶態使用WRITE函數寫文件時也有可能要刷新臟頁,generic_file_buffered_write函數會在將寫的內存頁標記為臟之后,根據條件刷新磁盤以平衡當前臟頁比率,參看balance_dirty_pages_ratelimited函數:
void balance_dirty_pages_ratelimited(struct address_space *mapping) { static DEFINE_PER_CPU(int, ratelimits) = 0; long ratelimit; ratelimit = ratelimit_pages; if (dirty_exceeded) ratelimit = 8; /* * Check the rate limiting. Also, we do not want to throttle real-time * tasks in balance_dirty_pages(). Period. */ if (get_cpu_var(ratelimits)++ >= ratelimit) { __get_cpu_var(ratelimits) = 0; put_cpu_var(ratelimits); balance_dirty_pages(mapping); return; } put_cpu_var(ratelimits); }
balance_dirty_pages_ratelimited函數通過ratelimit_pages調節刷新(調用balance_dirty_pages函數)的次數,每ratelimit_pages次調用才會刷新一次,具體刷新過程看balance_dirty_pages函數:
static void balance_dirty_pages(struct address_space *mapping) { struct writeback_state wbs; long nr_reclaimable; long background_thresh; long dirty_thresh; unsigned long pages_written = 0; unsigned long write_chunk = sync_writeback_pages(); struct backing_dev_info *bdi = mapping->backing_dev_info; for (;;) { struct writeback_control wbc = { .bdi = bdi, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write = write_chunk, }; get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, mapping); nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable; if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh) break; if (!dirty_exceeded) dirty_exceeded = 1; /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. * Unstable writes are a feature of certain networked * filesystems (i.e. NFS) in which data may have been * written to the server's write cache, but has not yet * been flushed to permanent storage. */ if (nr_reclaimable) { writeback_inodes(&wbc); get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, mapping); nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable; if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh) break; pages_written += write_chunk - wbc.nr_to_write; if (pages_written >= write_chunk) break; /* We've done our duty */ } blk_congestion_wait(WRITE, HZ/10); } if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh && dirty_exceeded) dirty_exceeded = 0; if (writeback_in_progress(bdi)) return; /* pdflush is already working this queue */ /* * In laptop mode, we wait until hitting the higher threshold before * starting background writeout, and then write out all the way down * to the lower threshold. So slow writers cause minimal disk activity. * * In normal mode, we start background writeout at the lower * background_thresh, to keep the amount of dirty memory low. */ if ((laptop_mode && pages_written) || (!laptop_mode && (nr_reclaimable > background_thresh))) pdflush_operation(background_writeout, 0); }
函數走進一個死循環,通過get_dirty_limits獲取dirty_background_ratio和dirty_ratio對應的內存頁數值,當24行做判斷,如果臟頁大於dirty_thresh,則調用writeback_inodes開始刷緩存到磁盤,如果一次沒有將臟頁比率刷到dirty_ratio之下,則用blk_congestion_wait阻塞寫,然后反復循環,直到比率降低到dirty_ratio;當比率低於dirty_ratio之后,但臟頁比率大於dirty_background_ratio,則用pdflush_operation啟用background_writeout,pdflush_operation是非阻塞函數,喚醒pdflush后直接返回,background_writeout在有pdflush調用。
如此可知:WRITE寫的時候,緩存超過dirty_ratio,則會阻塞寫操作,回刷臟頁,直到緩存低於dirty_ratio;如果緩存高於background_writeout,則會在寫操作時,喚醒pdflush進程刷臟頁,不阻塞寫操作。
四,問題總結
導致進程D狀態大部分是因為第3種和第4種情況:有大量寫操作,緩存由Linux系統管理,一旦臟頁累計到一定程度,無論是繼續寫還是fsync刷新,都會使進程D住。