Linux Kernel文件系統寫I/O流程代碼分析(二)bdi_writeback


Linux Kernel文件系統寫I/O流程代碼分析(二)bdi_writeback

上一篇# Linux Kernel文件系統寫I/O流程代碼分析(一),我們看到Buffered IO,寫操作寫入到page cache后就直接返回了,本文主要分析臟頁是如何刷盤的。

概述

由於內核page cache的作用,寫操作實際被延遲寫入。當page cache里的數據被用戶寫入但是沒有刷新到磁盤時,則該page為臟頁(塊設備page cache機制因為以前機械磁盤以扇區為單位讀寫,引入了buffer_head,每個4K的page進一步划分成8個buffer,通過buffer_head管理,因此可能只設置了部分buffer head為臟)。
臟頁在以下情況下將被回寫(write back)到磁盤上:

  • 臟頁在內存里的時間超過了閾值。
  • 系統的內存緊張,低於某個閾值時,必須將所有臟頁回寫。
  • 用戶強制要求刷盤,如調用sync()、fsync()、close()等系統調用。

以前的Linux通過pbflush機制管理臟頁的回寫,但因為其管理了所有的磁盤的page/buffer_head,存在嚴重的性能瓶頸,因此從Linux 2.6.32開始,臟頁回寫的工作由bdi_writeback機制負責。bdi_writeback機制為每個磁盤都創建一個線程,專門負責這個磁盤的page cache或者
buffer cache的數據刷新工作,以提高I/O性能。

BDI系統

BDI是backing device info的縮寫,它用於描述后端存儲(如磁盤)設備相關的信息。相對於內存來說,后端存儲的I/O比較慢,因此寫盤操作需要通過page cache進行緩存延遲寫入。
最初的BDI子系統里,模塊啟動的時候創建bdi-default進程,然后為每個注冊的設備創建flush-x:y(x,y為主次設備號)的進程,用於臟數據的回寫。在Linux 3.10.0版本之后,BDI子系統使用workqueue機制代替原來的線程創建,需要回寫時,將flush任務提交給workqueue,最終由通用的[kworker]進程負責處理。BDI子系統初始化的代碼如下:

static int __init default_bdi_init(void)
{
	int err;

	bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_FREEZABLE |
					      WQ_UNBOUND | WQ_SYSFS, 0);
	if (!bdi_wq)
		return -ENOMEM;

	err = bdi_init(&default_backing_dev_info);
	if (!err)
		bdi_register(&default_backing_dev_info, NULL, "default");
	err = bdi_init(&noop_backing_dev_info);

	return err;
}
subsys_initcall(default_bdi_init);

設備注冊

當執行mount流程時,底層文件系統定義自己的struct backing_dev_info結構並將其注冊到BDI子系統,如下是FUSE代碼示例:

static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
{
	int err;

	fc->bdi.name = "fuse";
	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
	/* fuse does it's own writeback accounting */
	fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;

	err = bdi_init(&fc->bdi);
	if (err)
		return err;

	fc->bdi_initialized = 1;

	if (sb->s_bdev) {
		err =  bdi_register(&fc->bdi, NULL, "%u:%u-fuseblk",
				    MAJOR(fc->dev), MINOR(fc->dev));
	} else {
		err = bdi_register_dev(&fc->bdi, fc->dev);
	}

	if (err)
		return err;

	/*
	 *    /sys/class/bdi/<bdi>/max_ratio
	 */
	bdi_set_max_ratio(&fc->bdi, 1);

	return 0;
}

該函數先通過bdi_init()初始化struct backing_dev_info,然后通過bid_register()將其注冊到BDI子系統。
其中bdi_init()會調用bdi_wb_init()初始化struct bdi_writeback

static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
{
	memset(wb, 0, sizeof(*wb));

	wb->bdi = bdi;
	wb->last_old_flush = jiffies;
	INIT_LIST_HEAD(&wb->b_dirty);
	INIT_LIST_HEAD(&wb->b_io);
	INIT_LIST_HEAD(&wb->b_more_io);
	spin_lock_init(&wb->list_lock);
	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
}

其中初始化了一個默認處理函數為bdi_writeback_workfn的work,用於回寫處理。

數據回寫

在上一篇的基礎上,將圖補充了bdi回寫的部分,如下所示:
數據寫入緩存和回寫

bdi_queue_work

BDI子系統使用workqueue機制進行數據回寫,其回寫接口為bdi_queue_work()將具體某個bdi的回寫請求(wb_writeback_work)掛到bdi_wq上。代碼如下:

static void bdi_queue_work(struct backing_dev_info *bdi,
			   struct wb_writeback_work *work)
{
	trace_writeback_queue(bdi, work);

	spin_lock_bh(&bdi->wb_lock);
	if (!test_bit(BDI_registered, &bdi->state)) {
		if (work->done)
			complete(work->done);
		goto out_unlock;
	}
	list_add_tail(&work->list, &bdi->work_list);
	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
out_unlock:
	spin_unlock_bh(&bdi->wb_lock);
}

調用該函數的地方包括:

  • sync_inode_sb(): 將該super block上所有的臟inode回寫。
  • writeback_inodes_sb_nr():回寫super block上指定個數臟inode。
  • __bdi_start_writeback():定時調用或者需要釋放pages或者需要更多內存時調用。

bdi_writeback_workfn

bdi_queue_work()提交了work給bdi_wq上,由對應的bdi處理函數進行處理,默認的函數為bdi_writeback_workfn,其代碼如下:

void bdi_writeback_workfn(struct work_struct *work)
{
	struct bdi_writeback *wb = container_of(to_delayed_work(work),
						struct bdi_writeback, dwork);
	struct backing_dev_info *bdi = wb->bdi;
	long pages_written;

	set_worker_desc("flush-%s", dev_name(bdi->dev));
	current->flags |= PF_SWAPWRITE;

	if (likely(!current_is_workqueue_rescuer() ||
		   !test_bit(BDI_registered, &bdi->state))) {
		/*
		 * The normal path.  Keep writing back @bdi until its
		 * work_list is empty.  Note that this path is also taken
		 * if @bdi is shutting down even when we're running off the
		 * rescuer as work_list needs to be drained.
		 */
		do {
			pages_written = wb_do_writeback(wb);
			trace_writeback_pages_written(pages_written);
		} while (!list_empty(&bdi->work_list));
	} else {
		/*
		 * bdi_wq can't get enough workers and we're running off
		 * the emergency worker.  Don't hog it.  Hopefully, 1024 is
		 * enough for efficient IO.
		 */
		pages_written = writeback_inodes_wb(&bdi->wb, 1024,
						    WB_REASON_FORKER_THREAD);
		trace_writeback_pages_written(pages_written);
	}

	if (!list_empty(&bdi->work_list))
		mod_delayed_work(bdi_wq, &wb->dwork, 0);
	else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
		bdi_wakeup_thread_delayed(bdi);

	current->flags &= ~PF_SWAPWRITE;
}

首先判斷當前workqueue能否獲得足夠的worker進行處理,如果能則將bdi上所有work全部提交,否則只提交一個work並限制寫入1024個pages。
正常情況下通過調用wb_do_writeback函數處理回寫。

wb_do_writeback

該函數代碼如下,遍歷bdi上所有work,通過調用wb_writeback()進行數據寫入。

static long wb_do_writeback(struct bdi_writeback *wb)
{
	struct backing_dev_info *bdi = wb->bdi;
	struct wb_writeback_work *work;
	long wrote = 0;

	set_bit(BDI_writeback_running, &wb->bdi->state);
	while ((work = get_next_work_item(bdi)) != NULL) {

		trace_writeback_exec(bdi, work);

		wrote += wb_writeback(wb, work);

		/*
		 * Notify the caller of completion if this is a synchronous
		 * work item, otherwise just free it.
		 */
		if (work->done)
			complete(work->done);
		else
			kfree(work);
	}

	/*
	 * Check for periodic writeback, kupdated() style
	 */
	wrote += wb_check_old_data_flush(wb);
	wrote += wb_check_background_flush(wb);
	clear_bit(BDI_writeback_running, &wb->bdi->state);

	return wrote;
}

wb_writeback()函數最終調用__writeback_single_inode()將某個inode上臟頁刷回。

__writeback_single_inode

__writeback_single_inode()的代碼如下,最終通過調用do_writepages()函數寫盤:

static int
__writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
{
	struct address_space *mapping = inode->i_mapping;
	long nr_to_write = wbc->nr_to_write;
	unsigned dirty;
	int ret;

	WARN_ON(!(inode->i_state & I_SYNC));

	trace_writeback_single_inode_start(inode, wbc, nr_to_write);

	ret = do_writepages(mapping, wbc);

	/*
	 * Make sure to wait on the data before writing out the metadata.
	 * This is important for filesystems that modify metadata on data
	 * I/O completion. We don't do it for sync(2) writeback because it has a
	 * separate, external IO completion path and ->sync_fs for guaranteeing
	 * inode metadata is written back correctly.
	 */
	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
		int err = filemap_fdatawait(mapping);
		if (ret == 0)
			ret = err;
	}

	/*
	 * Some filesystems may redirty the inode during the writeback
	 * due to delalloc, clear dirty metadata flags right before
	 * write_inode()
	 */
	spin_lock(&inode->i_lock);
	/* Clear I_DIRTY_PAGES if we've written out all dirty pages */
	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
		inode->i_state &= ~I_DIRTY_PAGES;
	dirty = inode->i_state & I_DIRTY;
	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
	spin_unlock(&inode->i_lock);
	/* Don't write the inode if only I_DIRTY_PAGES was set */
	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
		int err = write_inode(inode, wbc);
		if (ret == 0)
			ret = err;
	}
	trace_writeback_single_inode(inode, wbc, nr_to_write);
	return ret;
}

do_writepages

函數do_writepages()在上一篇已經介紹過了,它負責調用底層文件系統的a_ops->writepages將pages寫入后端存儲。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM