Linux fsync和fdatasync系統調用實現分析（Ext4文件系統）

本文轉載自查看原文 2020-03-12 15:00 1535 Linux文件系統

轉自：https://blog.csdn.net/luckyapple1028/article/details/61413724

在Linux系統中，對文件系統上文件的讀寫一般是通過頁緩存（page cache）進行的（DirectIO除外），這樣設計的可以延時磁盤IO的操作，從而可以減少磁盤讀寫的次數，提升IO性能。但是性能和可靠性在一定程度上往往是矛盾的，雖然內核中設計有一個工作隊列執行贓頁回寫同磁盤文件進行同步，但是在一些極端的情況下還是免不了掉電數據丟失。因此內核提供了sync、fsync、fdatasync和msync系統調用用於同步，其中sync會同步整個系統下的所有文件系統以及塊設備，而fsync和fdatasync只針對單獨的文件進行同步，msync用於同步經過mmap的文件。使用這些系統API，用戶可以在寫完一些重要文件之后，立即執行將新寫入的數據回寫到磁盤，盡可能的降低數據丟失的概率。本文將介紹fsync和fdatasync的功能和區別，並以Ext4文件系統為例，分析它們是如何將數據同步到磁盤中的。

內核版本：Linux 4.10.1

內核文件：fs/sync.c、fs/ext4/fsync.c、fs/ext4/inode.c、mm/filemap.c

1、概述
當用戶在寫一個文件時，若在open時沒有設置O_SYNC和O_DIRECT，那么新write的數據內容將會暫時保存在頁緩存(page cache)中，對應的頁成為贓頁（dirty page），這些數據並不會立即寫回磁盤中。同時內核中設計有一個等待隊列bdi_wq以及一些writeback worker，它們在達到一定的條件之后（延遲時間到期（默認5s）、系統內存不足、贓頁超過閾值等）就會被喚醒執行贓頁（dirty page）的回寫操作，文件中新寫入的數據在此時才能夠寫回磁盤。雖然從write操作到writeback之間的窗口時間（Ext4默認啟用delay alloc特性，該時間延長到了30s）較短，若在此期間設備掉電或者系統奔潰，那用戶的數據將會丟失。因此，對於單個文件來說，如果需要提高可靠性，可以在寫入后調用fsync和fdatasync來實現文件（數據）的同步。

fsync系統調用會同步fd表示文件的所有數據，包括數據和元數據，它會一直阻塞等待直到回寫結束。fdatasync同fsync類似，但是它不會回寫被修改的元數據，除非對於一些對於數據完整性檢索有關的場景。例如，若僅是文件的最后一次訪問時間（st_atime）或最后一次修改時間（st_mtime）發生變化是不需要同步元數據的，因為它不會影響文件數據塊的檢索，若是文件的大小改變了（st_isize）則顯然是需要同步元數據的，若不同步則可能導致系統崩潰后無法檢索修改的數據。鑒於fdatasync的以上區別，可以看出應用程序對一些無需回寫文件元數據的場景使用fdatasync可以提升性能。

最后，關於fsync和fdatasync的詳細描述可以參考fsync的manual page。

2、實現分析
關於fsync和fdatasync的實現，其調用處理流程並不復雜，但是其中涉及文件系統日志和block分配管理、內存頁回寫機制等相關的諸多細節，若要完全掌握則需要具備相關的知識。本文結合Ext4文件系統，從主線調用流程入手進行詳細分析，不深入文件系統和其他模塊的過多的其他細節。同時我們約定對於Ext4文件系統，使用默認的選項，即使用order類型的日志模型，不啟用inline data、加密等特殊選項。首先主要函數調用關系如下圖所示：

sys_fsync/sys_datasync

---> do_fsync

---> vfs_fsync

---> vfs_fsync_range

---> mark_inode_dirty_sync

| ---> ext4_dirty_inode

| | ---> ext4_journal_start

| | ---> ext4_mark_inode_dirty

| | ---> ext4_journal_stop

| ---> inode_io_list_move_locked

| ---> wb_wakeup_delayed

---> ext4_sync_file

---> filemap_write_and_wait_range

| ---> __filemap_fdatawrite_range

| ---> do_writepages

| ---> ext4_writepages

| ---> filemap_fdatawait_range

---> jbd2_complete_transaction

---> blkdev_issue_flush

fysnc和fdatasync系統調用按照相同的執行代碼路徑執行。在do_fsync函數中會根據入參fd找到對應的文件描述符file結構，在vfs_fsync_range函數中fdatasync流程不會執行mark_inode_dirty_sync函數分支，fsync函數會判斷當前的文件是否在訪問、修改時間上有發生過變化，若發生過變化則會調用mark_inode_dirty_sync分支更新元數據並設置為dirty然后將對應的贓頁添加到jbd2日志的對應鏈表中等待日志提交進程執行回寫；隨后的ext4_sync_file函數中會調用filemap_write_and_wait_range函數同步文件中的dirty page cache，它會向block層提交bio並等待回寫執行結束，然后調用jbd2_complete_transaction函數觸發元數據回寫（若元數據不為臟則不會回寫任何與該文件相關的元數據），最后若Ext4文件系統啟用了barrier特性且需要flush write cache，那調用blkdev_issue_flush向底層發送flush指令，這將觸發磁盤中的cache寫入介質的操作（這樣就能保證在正常情況下數據都被落盤了）。

具體的執行流程圖如下圖所示：

fsync和fdatasync系統調用流程圖

下面跟蹤fsync和fdatasync系統調用的源代碼具體分析它是如何實現文件數據同步操作的：

SYSCALL_DEFINE1(fsync, unsigned int, fd)
{
return do_fsync(fd, 0);
}
SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
{
return do_fsync(fd, 1);
}
fsync和fdatasync系統調用只有一個入參，即已經打開的文件描述符fd；函數直接調用do_fsync，僅第二個入參datasync標識不同。
static int do_fsync(unsigned int fd, int datasync)
{
struct fd f = fdget(fd);
int ret = -EBADF;

if (f.file) {
ret = vfs_fsync(f.file, datasync);
fdput(f);
}
return ret;
}
do_fsync函數首先調用fdget從當前進程的fdtable中根據fd找到對應的struct fd結構體，真正用到的是它里面的struct file實例（該結構體在open文件時動態生成並和fd綁定后保存在進程task_struct結構體中），然后調用通用函數vfs_fsync。

/**
* vfs_fsync - perform a fsync or fdatasync on a file
* @file: file to sync
* @datasync: only perform a fdatasync operation
*
* Write back data and metadata for @file to disk. If @datasync is
* set only metadata needed to access modified file data is written.
*/
int vfs_fsync(struct file *file, int datasync)
{
return vfs_fsync_range(file, 0, LLONG_MAX, datasync);
}
EXPORT_SYMBOL(vfs_fsync);
vfs_fsync函數直接轉調vfs_fsync_range，其中入參二和入參三為需要同步文件數據位置的起始與結束偏移值，以字節為單位，這里傳入的分別是0和LLONG_MAX，顯然是表明要同步所有的數據了。
/**
* vfs_fsync_range - helper to sync a range of data & metadata to disk
* @file: file to sync
* @start: offset in bytes of the beginning of data range to sync
* @end: offset in bytes of the end of data range (inclusive)
* @datasync: perform only datasync
*
* Write back data in range @start..@end and metadata for @file to disk. If
* @datasync is set only metadata needed to access modified file data is
* written.
*/
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
struct inode *inode = file->f_mapping->host;

if (!file->f_op->fsync)
return -EINVAL;
if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
spin_lock(&inode->i_lock);
inode->i_state &= ~I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
mark_inode_dirty_sync(inode);
}
return file->f_op->fsync(file, start, end, datasync);
}
EXPORT_SYMBOL(vfs_fsync_range);
vfs_fsync_range函數首先從file結構體的addess_space中找到文件所屬的inode（地址映射address_space結構在open文件時的sys_open->do_dentry_open調用中初始化，里面保存了該文件的所有建立的page cache、底層塊設備和對應的操作函數集），然后判斷文件系統的file_operation函數集是否實現了fsync接口，如果未實現直接返回EINVAL。
接下來在非datasync（sync）的情況下會對inode的I_DIRTY_TIME標記進行判斷，如果置位了該標識（表示該文件的時間戳已經發生了跟新但還沒有同步到磁盤上）則清除該標志位並調用mark_inode_dirty_sync設置I_DIRTY_SYNC標識，表示需要進行sync同步操作。該函數會針對當前inode所在的不同state進行區別處理，同時會將inode添加到后台回刷bdi的Dirty list上去（bdi回寫任務會遍歷該list執行同步操作，當然容易導致誤解的是當前的回寫流程是不會由bdi write back worker來執行的，而是在本調用流程中就直接一氣呵成的）。

static inline void mark_inode_dirty_sync(struct inode *inode)
{
__mark_inode_dirty(inode, I_DIRTY_SYNC);
}
void __mark_inode_dirty(struct inode *inode, int flags)
{
#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
struct super_block *sb = inode->i_sb;
int dirtytime;

trace_writeback_mark_inode_dirty(inode, flags);

/*
* Don't do this for I_DIRTY_PAGES - that doesn't actually
* dirty the inode itself
*/
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME)) {
trace_writeback_dirty_inode_start(inode, flags);

if (sb->s_op->dirty_inode)
sb->s_op->dirty_inode(inode, flags);

trace_writeback_dirty_inode(inode, flags);
}
__mark_inode_dirty函數由於當前傳入的flag等於I_DIRTY_SYNC（表示inode為臟但是不需要在fdatasync時進行同步，一般用於時間戳i_atime等改變的情況下，定義在include/linux/fs.h中），所以這里會調用文件系統的dirty_inode函數指針，對於ext4文件系統即是ext4_dirty_inode函數。
void ext4_dirty_inode(struct inode *inode, int flags)
{
handle_t *handle;

if (flags == I_DIRTY_TIME)
return;
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
if (IS_ERR(handle))
goto out;

ext4_mark_inode_dirty(handle, inode);

ext4_journal_stop(handle);
out:
return;
}
ext4_dirty_inode函數涉及ext4文件系統使用的jbd2日志模塊，它將啟用一個新的日志handle（日志原子操作）並將應該同步的inode元數據block向日志jbd2模塊transaction進行提交（注意不會立即寫日志和回寫）。其中ext4_journal_start函數會簡單判斷一下ext4文件系統的日志執行狀態最后直接調用jbd2__journal_start來啟用日志handle；然后ext4_mark_inode_dirty函數會調用ext4_get_inode_loc獲取inode元數據所在的buffer head映射block，按照標准的日志提交流程jbd2_journal_get_write_access（獲取寫權限）-> 對元數據raw_inode進行更新 -> jbd2_journal_dirty_metadata（設置元數據為臟並添加到日志transaction的對應鏈表中）；最后ext4_journal_stop->jbd2_journal_stop調用流程結束這個handle原子操作。這樣后面日志commit進程會對日志的元數據塊進行提交（注意，這里並不會立即喚醒日志commit進程啟動日志提交動作，啟用largefile特性除外）。
回到__mark_inode_dirty函數中繼續往下分析：

if (flags & I_DIRTY_INODE)
flags &= ~I_DIRTY_TIME;
dirtytime = flags & I_DIRTY_TIME;

/*
* Paired with smp_mb() in __writeback_single_inode() for the
* following lockless i_state test. See there for details.
*/
smp_mb();

if (((inode->i_state & flags) == flags) ||
(dirtytime && (inode->i_state & I_DIRTY_INODE)))
return;
下面如果inode當前的state同要設置的標識完全相同或者在設置dirtytime的情況下inode已經為臟了那就直接退出，無需再設置標識了和添加Dirty list了。
if (unlikely(block_dump))
block_dump___mark_inode_dirty(inode);

spin_lock(&inode->i_lock);
if (dirtytime && (inode->i_state & I_DIRTY_INODE))
goto out_unlock_inode;
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;

inode_attach_wb(inode, NULL);

if (flags & I_DIRTY_INODE)
inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;

/*
* If the inode is being synced, just update its dirty state.
* The unlocker will place the inode on the appropriate
* superblock list, based upon its state.
*/
if (inode->i_state & I_SYNC)
goto out_unlock_inode;

/*
* Only add valid (hashed) inodes to the superblock's
* dirty list. Add blockdev inodes as well.
*/
if (!S_ISBLK(inode->i_mode)) {
if (inode_unhashed(inode))
goto out_unlock_inode;
}
if (inode->i_state & I_FREEING)
goto out_unlock_inode;
首先為了便於調試，在設置了block_dump時會有調試信息的打印，會調用block_dump___mark_inode_dirty函數將該dirty inode的inode號、文件名和設備名打印出來。

然后對inode上鎖並進行最后的處理，先設置i_state添加flag標記，當前置位的flag為I_DIRTY_SYNC，執行到此處inode的狀態標識就設置完了；隨后判斷該inode是否已經正在進行sync同步（設置I_SYNC標識，在執行回寫worker的writeback_sb_inodes函數調用中會設置該標識）或者inode已經在銷毀釋放的過程中了，若是則直接退出，不再繼續回寫。

/*
* If the inode was already on b_dirty/b_io/b_more_io, don't
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
struct bdi_writeback *wb;
struct list_head *dirty_list;
bool wakeup_bdi = false;

wb = locked_inode_to_wb_and_lock_list(inode);

WARN(bdi_cap_writeback_dirty(wb->bdi) &&
!test_bit(WB_registered, &wb->state),
"bdi-%s not registered\n", wb->bdi->name);

inode->dirtied_when = jiffies;
if (dirtytime)
inode->dirtied_time_when = jiffies;

if (inode->i_state & (I_DIRTY_INODE | I_DIRTY_PAGES))
dirty_list = &wb->b_dirty;
else
dirty_list = &wb->b_dirty_time;

wakeup_bdi = inode_io_list_move_locked(inode, wb,
dirty_list);

spin_unlock(&wb->list_lock);
trace_writeback_dirty_inode_enqueue(inode);

/*
* If this is the first dirty inode for this bdi,
* we have to wake-up the corresponding bdi thread
* to make sure background write-back happens
* later.
*/
if (bdi_cap_writeback_dirty(wb->bdi) && wakeup_bdi)
wb_wakeup_delayed(wb);
return;
}
}
out_unlock_inode:
spin_unlock(&inode->i_lock);

#undef I_DIRTY_INODE
}
最后針對當前的inode尚未Dirty的情況，設置inode的Dirty time並將它添加到它回寫bdi_writeback對應的Dirty list中去（當前上下文添加的是wb->b_dirty鏈表），然后判斷該bdi是否沒有正在處理的dirty io操作（需判斷dirty list、io list和io_more list是否都為空）且支持回寫操作，就調用wb_wakeup_delayed函數往后台回寫工作隊列添加延時回寫任務，延時的時間由dirty_writeback_interval全局變量設定，默認值為5s時間。

當然了，雖然這里會讓writeback回寫進程在5s以后喚醒執行回寫，但是在當前fsync的調用流程中是絕對不會等5s以后由writeback回寫進程來執行回寫的（這部分涉及后台bdi贓頁回寫機制）。

回到vfs_fsync_range函數中，代碼流程執行到這里，從針對!datasync && (inode->i_state & I_DIRTY_TIME)這個條件的分支處理中就可以看到fsync和fdatasync系統調用的不同之處了：fsync系統調用針對時間戳變化的inode會設置inode為Dirty，這將導致后面的執行流程對文件的元數據進行回寫，而fdatasync則不會。

繼續往下分析，vfs_fsync_range函數最后調用file_operation函數集里的fsync注冊函數，ext4文件系統調用的是ext4_sync_file，將由ext4文件系統執行文件數據和元數據的同步操作。

/*
* akpm: A new design for ext4_sync_file().
*
* This is only called from sys_fsync(), sys_fdatasync() and sys_msync().
* There cannot be a transaction open by this task.
* Another task could have dirtied this inode. Its data can be in any
* state in the journalling system.
*
* What we do is just kick off a commit and wait on it. This will snapshot the
* inode to disk.
*/

int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
struct inode *inode = file->f_mapping->host;
struct ext4_inode_info *ei = EXT4_I(inode);
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
int ret = 0, err;
tid_t commit_tid;
bool needs_barrier = false;

J_ASSERT(ext4_journal_current_handle() == NULL);

trace_ext4_sync_file_enter(file, datasync);

if (inode->i_sb->s_flags & MS_RDONLY) {
/* Make sure that we read updated s_mount_flags value */
smp_rmb();
if (EXT4_SB(inode->i_sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
ret = -EROFS;
goto out;
}
分段來分析ext4_sync_file函數，首先明確幾個局部變量：1、commit_tid是日志提交事物的transaction id，用來區分不同的事物（transaction）；2、needs_barrier用於表示是否需要對所在的塊設備發送cache刷寫命令，是一種用於保護數據一致性的手段。這幾個局部變量后面會看到是如何使用的，這里先關注一下。
ext4_sync_file函數首先判斷文件系統只讀的情況，對於一般以只讀方式掛載的文件系統由於不會寫入文件，所以不需要執行fsync/fdatasync操作，立即返回success即可。但是文件系統只讀也可能是發生了錯誤導致的，因此這里會做一個判斷，如果文件系統abort（出現致命錯誤），就需要返回EROFS而不是success，這樣做是為了避免應用程序誤認為文件已經同步成功了。

if (!journal) {
ret = __generic_file_fsync(file, start, end, datasync);
if (!ret)
ret = ext4_sync_parent(inode);
if (test_opt(inode->i_sb, BARRIER))
goto issue_flush;
goto out;
}
接下來處理未開啟日志的情況，這種情況下將調用通用函數__generic_file_fsync進行文件同步，隨后調用ext4_sync_parent對文件所在的父目錄進行同步。之所以要同步父目錄是因為在未開啟日志的情況下，若同步的是一個新創建的文件，那么待到父目錄的目錄項通過writeback后台回寫之間將有一個巨大的時間窗口，在這段時間內掉電或者系統崩潰就會導致數據的丟失，所以這里及時同步父目錄項將該時間窗大大的縮短，也就提高了數據的安全性。ext4_sync_parent函數會對它的父目錄進行遞歸，若是新創建的目錄都將進行同步。
由於在默認情況下是啟用日志的（jbd2日志模塊journal在mount文件系統時的ext4_fill_super->ext4_load_journal調用流程中初始化），所以這個分支暫不詳細分析，回到ext4_sync_file中分析默認開啟日志的情況。

ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
if (ret)
return ret;
接下來調用filemap_write_and_wait_range回寫從start到end的dirty文件數據塊並等待回寫完成。
/**
* filemap_write_and_wait_range - write out & wait on a file range
* @mapping: the address_space for the pages
* @lstart: offset in bytes where the range starts
* @lend: offset in bytes where the range ends (inclusive)
*
* Write out and wait upon file offsets lstart->lend, inclusive.
*
* Note that `lend' is inclusive (describes the last byte to be written) so
* that this function can be used to write to the very end-of-file (end = -1).
*/
int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
int err = 0;

if ((!dax_mapping(mapping) && mapping->nrpages) ||
(dax_mapping(mapping) && mapping->nrexceptional)) {
err = __filemap_fdatawrite_range(mapping, lstart, lend,
WB_SYNC_ALL);
/* See comment of filemap_write_and_wait() */
if (err != -EIO) {
int err2 = filemap_fdatawait_range(mapping,
lstart, lend);
if (!err)
err = err2;
}
} else {
err = filemap_check_errors(mapping);
}
return err;
}
EXPORT_SYMBOL(filemap_write_and_wait_range);
filemap_write_and_wait_range函數首先判斷是否需要回寫，若沒有啟用dax特性，那么其地址空間頁緩存必須非0（因為需要回寫的就是頁緩存page cache :)），否則會調用filemap_check_errors處理異常，先來看一下該函數：
int filemap_check_errors(struct address_space *mapping)
{
int ret = 0;
/* Check for outstanding write errors */
if (test_bit(AS_ENOSPC, &mapping->flags) &&
test_and_clear_bit(AS_ENOSPC, &mapping->flags))
ret = -ENOSPC;
if (test_bit(AS_EIO, &mapping->flags) &&
test_and_clear_bit(AS_EIO, &mapping->flags))
ret = -EIO;
return ret;
}
EXPORT_SYMBOL(filemap_check_errors);
filemap_check_errors函數主要檢測地址空間的AS_EIO和AS_ENOSPC標識，前者表示發生IO錯誤，后者表示空間不足（它們定義在include/linux/pagemap.h中），只需要對這兩種異常標記進行清除即可。
若有頁緩存需要回寫，則調用__filemap_fdatawrite_range執行回寫，注意最后一個入參是WB_SYNC_ALL，這表示將會等待回寫結束：

/**
* __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
* @mapping: address space structure to write
* @start: offset in bytes where the range starts
* @end: offset in bytes where the range ends (inclusive)
* @sync_mode: enable synchronous operation
*
* Start writeback against all of a mapping's dirty pages that lie
* within the byte offsets <start, end> inclusive.
*
* If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as
* opposed to a regular memory cleansing writeback. The difference between
* these two operations is that if a dirty page/buffer is encountered, it must
* be waited upon, and not just skipped over.
*/
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end, int sync_mode)
{
int ret;
struct writeback_control wbc = {
.sync_mode = sync_mode,
.nr_to_write = LONG_MAX,
.range_start = start,
.range_end = end,
};

if (!mapping_cap_writeback_dirty(mapping))
return 0;

wbc_attach_fdatawrite_inode(&wbc, mapping->host);
ret = do_writepages(mapping, &wbc);
wbc_detach_inode(&wbc);
return ret;
}
從函數的注釋中可以看出，__filemap_fdatawrite_range函數會將<start, end>位置的dirty page回寫。它首先構造一個struct writeback_control實例並初始化相應的字段，該結構體用於控制writeback回寫操作，其中sync_mode表示同步模式，一共有WB_SYNC_NONE和WB_SYNC_ALL兩種可選，前一種不會等待回寫結束，一般用於周期性回寫，后一種會等待回寫結束，用於sync之類的強制回寫；nr_to_write表示要回寫的頁數；range_start和range_end表示要會寫的偏移起始和結束的位置，以字節為單位。

接下來調用mapping_cap_writeback_dirty函數判斷文件所在的bdi是否支持回寫動作，若不支持則直接返回0（表示寫回的數量為0）；然后調用wbc_attach_fdatawrite_inode函數將wbc和inode的bdi進行綁定（需啟用blk_cgroup內核屬性，否則為空操作）；然后調用do_writepages執行回寫動作，回寫完畢后調用wbc_detach_inode函數將wbc和inode解除綁定。

int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
int ret;

if (wbc->nr_to_write <= 0)
return 0;
if (mapping->a_ops->writepages)
ret = mapping->a_ops->writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
return ret;
}
do_writepages函數將優先調用地址空間a_ops函數集中的writepages注冊函數，ext4文件系統實現為ext4_writepages，若沒有實現則調用通用函數generic_writepages（該函數在后台贓頁回刷進程wb_workfn函數調用流程中也會被調用來執行回寫操作）。

下面來簡單分析ext4_writepages是如何執行頁回寫的（函數較長，分段來看）：

static int ext4_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
pgoff_t writeback_index = 0;
long nr_to_write = wbc->nr_to_write;
int range_whole = 0;
int cycled = 1;
handle_t *handle = NULL;
struct mpage_da_data mpd;
struct inode *inode = mapping->host;
int needed_blocks, rsv_blocks = 0, ret = 0;
struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
bool done;
struct blk_plug plug;
bool give_up_on_write = false;

percpu_down_read(&sbi->s_journal_flag_rwsem);
trace_ext4_writepages(inode, wbc);

if (dax_mapping(mapping)) {
ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
wbc);
goto out_writepages;
}

/*
* No pages to write? This is mainly a kludge to avoid starting
* a transaction for special inodes like journal inode on last iput()
* because that could violate lock ordering on umount
*/
if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
goto out_writepages;

if (ext4_should_journal_data(inode)) {
struct blk_plug plug;

blk_start_plug(&plug);
ret = write_cache_pages(mapping, wbc, __writepage, mapping);
blk_finish_plug(&plug);
goto out_writepages;
}
ext4_writepages函數首先針對dax_mapping的分支，數據頁的回寫交由dax_writeback_mapping_range處理；接下來判斷是否有頁需要回寫，如果地址空間中沒有映射頁或者radix tree中沒有設置PAGECACHE_TAG_DIRTY標識（即無臟頁，該標識會在__set_page_dirty函數中對臟的數據塊設置），那就直接退出即可。
然后判斷當前文件系統的日志模式，如果是journal模式（數據塊和元數據塊都需要寫jbd2日志），將交由write_cache_pages函數執行回寫，由於默認使用的是order日志模式，所以略過，繼續往下分析。

/*
* If the filesystem has aborted, it is read-only, so return
* right away instead of dumping stack traces later on that
* will obscure the real source of the problem. We test
* EXT4_MF_FS_ABORTED instead of sb->s_flag's MS_RDONLY because
* the latter could be true if the filesystem is mounted
* read-only, and in that case, ext4_writepages should
* *never* be called, so if that ever happens, we would want
* the stack trace.
*/
if (unlikely(sbi->s_mount_flags & EXT4_MF_FS_ABORTED)) {
ret = -EROFS;
goto out_writepages;
}

if (ext4_should_dioread_nolock(inode)) {
/*
* We may need to convert up to one extent per block in
* the page and we may dirty the inode.
*/
rsv_blocks = 1 + (PAGE_SIZE >> inode->i_blkbits);
}
接下處理dioread_nolock特性，該特性會在文件寫buffer前分配未初始化的extent，等待寫IO完成后才會對extent進行初始化，以此可以免去加解inode mutext鎖，從而來達到加速寫操作的目的。該特性只對啟用了extent屬性的文件有用，且不支持journal日志模式。若啟用了該特性則需要在日志中設置保留塊，默認文件系統的塊大小為4KB，那這里將指定保留塊為2個。
/*
* If we have inline data and arrive here, it means that
* we will soon create the block for the 1st page, so
* we'd better clear the inline data here.
*/
if (ext4_has_inline_data(inode)) {
/* Just inode will be modified... */
handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out_writepages;
}
BUG_ON(ext4_test_inode_state(inode,
EXT4_STATE_MAY_INLINE_DATA));
ext4_destroy_inline_data(handle, inode);
ext4_journal_stop(handle);
}
接下來處理inline data特性，該特性是對於小文件的，它的數據內容足以保存在inode block中，這里也同樣先略過該特性的處理。

if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;

if (wbc->range_cyclic) {
writeback_index = mapping->writeback_index;
if (writeback_index)
cycled = 0;
mpd.first_page = writeback_index;
mpd.last_page = -1;
} else {
mpd.first_page = wbc->range_start >> PAGE_SHIFT;
mpd.last_page = wbc->range_end >> PAGE_SHIFT;
}

mpd.inode = inode;
mpd.wbc = wbc;
ext4_io_submit_init(&mpd.io_submit, wbc);
接下來進行一些標識位的判斷，其中range_whole置位表示寫整個文件；然后初始化struct mpage_da_data mpd結構體，在當前的非周期寫的情況下設置需要寫的first_page和last_page，然后初始化mpd結構體的inode、wbc和io_submit這三個字段，然后跳轉到retry標號處開始執行。
retry:
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag_pages_for_writeback(mapping, mpd.first_page, mpd.last_page);
done = false;
blk_start_plug(&plug);
這里的tag_pages_for_writeback函數需要關注一下，它將address_mapping的radix tree中已經設置了PAGECACHE_TAG_DIRTY標識的節點設置上PAGECACHE_TAG_TOWRITE標識，表示開始回寫，后文中的等待結束__filemap_fdatawrite_range函數會判斷該標識。接下來進入一個大循環，逐一處理需要回寫的數據頁。
while (!done && mpd.first_page <= mpd.last_page) {
/* For each extent of pages we use new io_end */
mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
if (!mpd.io_submit.io_end) {
ret = -ENOMEM;
break;
}

/*
* We have two constraints: We find one extent to map and we
* must always write out whole page (makes a difference when
* blocksize < pagesize) so that we don't block on IO when we
* try to write out the rest of the page. Journalled mode is
* not supported by delalloc.
*/
BUG_ON(ext4_should_journal_data(inode));
needed_blocks = ext4_da_writepages_trans_blocks(inode);

/* start a new transaction */
handle = ext4_journal_start_with_reserve(inode,
EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
"%ld pages, ino %lu; err %d", __func__,
wbc->nr_to_write, inode->i_ino, ret);
/* Release allocated io_end */
ext4_put_io_end(mpd.io_submit.io_end);
break;
}

trace_ext4_da_write_pages(inode, mpd.first_page, mpd.wbc);
ret = mpage_prepare_extent_to_map(&mpd);
if (!ret) {
if (mpd.map.m_len)
ret = mpage_map_and_submit_extent(handle, &mpd,
&give_up_on_write);
else {
/*
* We scanned the whole range (or exhausted
* nr_to_write), submitted what was mapped and
* didn't find anything needing mapping. We are
* done.
*/
done = true;
}
}
/*
* Caution: If the handle is synchronous,
* ext4_journal_stop() can wait for transaction commit
* to finish which may depend on writeback of pages to
* complete or on page lock to be released. In that
* case, we have to wait until after after we have
* submitted all the IO, released page locks we hold,
* and dropped io_end reference (for extent conversion
* to be able to complete) before stopping the handle.
*/
if (!ext4_handle_valid(handle) || handle->h_sync == 0) {
ext4_journal_stop(handle);
handle = NULL;
}
/* Submit prepared bio */
ext4_io_submit(&mpd.io_submit);
/* Unlock pages we didn't use */
mpage_release_unused_pages(&mpd, give_up_on_write);
/*
* Drop our io_end reference we got from init. We have
* to be careful and use deferred io_end finishing if
* we are still holding the transaction as we can
* release the last reference to io_end which may end
* up doing unwritten extent conversion.
*/
if (handle) {
ext4_put_io_end_defer(mpd.io_submit.io_end);
ext4_journal_stop(handle);
} else
ext4_put_io_end(mpd.io_submit.io_end);

if (ret == -ENOSPC && sbi->s_journal) {
/*
* Commit the transaction which would
* free blocks released in the transaction
* and try again
*/
jbd2_journal_force_commit_nested(sbi->s_journal);
ret = 0;
continue;
}
/* Fatal error - ENOMEM, EIO... */
if (ret)
break;
}
該循環中的調用流程非常復雜，這里簡單描述一下：首先調用ext4_da_writepages_trans_blocks計算extext所需要使用的block數量，然后調用ext4_journal_start_with_reserve啟動一個新的日志handle，需要的block數量和保留block數量通過needed_blocks和rsv_blocks給出；然后調用mpage_prepare_extent_to_map和mpage_map_and_submit_extent函數，它將遍歷查找wbc中的PAGECACHE_TAG_TOWRITE為標記的節點，對其中已經映射的贓頁直接下發IO，對沒有映射的則計算需要映射的頁要使用的extent並進行映射；隨后調用ext4_io_submit下發bio，最后調用ext4_journal_stop結束本次handle。
回到filemap_write_and_wait_range函數中，如果__filemap_fdatawrite_range函數返回不是IO錯誤，那將調用filemap_fdatawait_range等待回寫結束。

int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
loff_t end_byte)
{
int ret, ret2;

ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
ret2 = filemap_check_errors(mapping);
if (!ret)
ret = ret2;

return ret;
}
EXPORT_SYMBOL(filemap_fdatawait_range);
filemap_fdatawait_range函數一共做了兩件事，第一件事就是調用__filemap_fdatawait_range等待<start_byte, end_byte>回寫完畢，第二件事是調用filemap_check_errors進行錯誤處理。
static int __filemap_fdatawait_range(struct address_space *mapping,
loff_t start_byte, loff_t end_byte)
{
pgoff_t index = start_byte >> PAGE_SHIFT;
pgoff_t end = end_byte >> PAGE_SHIFT;
struct pagevec pvec;
int nr_pages;
int ret = 0;

if (end_byte < start_byte)
goto out;

pagevec_init(&pvec, 0);
while ((index <= end) &&
(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
PAGECACHE_TAG_WRITEBACK,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
unsigned i;

for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];

/* until radix tree lookup accepts end_index */
if (page->index > end)
continue;

wait_on_page_writeback(page);
if (TestClearPageError(page))
ret = -EIO;
}
pagevec_release(&pvec);
cond_resched();
}
out:
return ret;
}
__filemap_fdatawait_range函數是一個大循環，在循環中會調用pagevec_lookup_tag函數找到radix tree中設置了PAGECACHE_TAG_WRITEBACK標記的節點（對應前文中的標記位置），然后調用wait_on_page_writeback函數設置等待隊列等待對應page的PG_writeback標記被清除（表示回寫結束），這里的等待會讓進程進入D狀態，最后如果發生了錯誤會返回-EIO，進而觸發filemap_fdatawait_range->filemap_check_errors錯誤檢查調用。

通過以上filemap_write_and_wait_range調用可以看出，文件的回寫動作並沒有通過由后台bdi回寫進程來執行，這里的fsync和fdatasync系統調用就在當前調用進程中執行回寫的。

至此，文件的數據回寫就完成了，而元數據尚在日志事物中等待提交，接下來回到最外層的ext4_sync_file函數，提交最后的元數據塊。

/*
* data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
* Metadata is in the journal, we wait for proper transaction to
* commit here.
*
* data=journal:
* filemap_fdatawrite won't do anything (the buffers are clean).
* ext4_force_commit will write the file data into the journal and
* will wait on that.
* filemap_fdatawait() will encounter a ton of newly-dirtied pages
* (they were dirtied by commit). But that's OK - the blocks are
* safe in-journal, which is all fsync() needs to ensure.
*/
if (ext4_should_journal_data(inode)) {
ret = ext4_force_commit(inode->i_sb);
goto out;
}

commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
if (journal->j_flags & JBD2_BARRIER &&
!jbd2_trans_will_send_data_barrier(journal, commit_tid))
needs_barrier = true;
ret = jbd2_complete_transaction(journal, commit_tid);
if (needs_barrier) {
issue_flush:
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (!ret)
ret = err;
}
out:
trace_ext4_sync_file_exit(inode, ret);
return ret;
}
參考注釋中的說明，對於默認的ordered模式，前面的filemap_write_and_wait_range函數已經同步了文件的數據塊，而元數據塊可能仍然在日志journal里，接下來的流程會找到一個合適的事物來進行日志的提交。

首先做一個判斷，如果啟用了文件系統的barrier特性，這里會調用jbd2_trans_will_send_data_barrier函數判斷是否需要向塊設備發送flush指令，需要注意的是commit_tid參數，如果是fdatasync調用，那它使用ei->i_datasync_tid，否則使用ei->i_sync_tid，用以表示包含我們關注文件元數據所在當前的事物id。

int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid)
{
int ret = 0;
transaction_t *commit_trans;

if (!(journal->j_flags & JBD2_BARRIER))
return 0;
read_lock(&journal->j_state_lock);
/* Transaction already committed? */
if (tid_geq(journal->j_commit_sequence, tid))
goto out;
commit_trans = journal->j_committing_transaction;
if (!commit_trans || commit_trans->t_tid != tid) {
ret = 1;
goto out;
}
/*
* Transaction is being committed and we already proceeded to
* submitting a flush to fs partition?
*/
if (journal->j_fs_dev != journal->j_dev) {
if (!commit_trans->t_need_data_flush ||
commit_trans->t_state >= T_COMMIT_DFLUSH)
goto out;
} else {
if (commit_trans->t_state >= T_COMMIT_JFLUSH)
goto out;
}
ret = 1;
out:
read_unlock(&journal->j_state_lock);
return ret;
}
jbd2_trans_will_send_data_barrier函數會對當前日志的狀態進行一系列判斷，返回1表示當前的transaction還沒有被提交，所以不發送flush指令，返回0表示當前的事物可能已經被提交了，因此需要發送flush。具體如下：
（1）文件系統日志模式不支持barrier，這里返回0會觸發flush（這一點不是很理解，判斷同ext4_sync_file剛好矛盾）；

（2）當前的事物id號和journal->j_commit_sequence進行比較，如果j_commit_sequence大於該id號表示這里關注的事物已經被提交了，返回0；

（3）如果正在提交的事物不存在或者正在體驕傲的事物不是所當前的事物，表示當前的事物被日志提交進程所處理，返回1；

（4）如果當前的事物正在提交中且提交已經進行到T_COMMIT_JFLUSH，表明元數據日志已經寫回完畢了，返回0；

（5）最后如果當前的事物正在提交中但是還沒有將元數據日志寫回，返回1。

回到ext4_sync_file函數中，接下來jbd2_complete_transaction函數執行日志的提交工作：

int jbd2_complete_transaction(journal_t *journal, tid_t tid)
{
int need_to_wait = 1;

read_lock(&journal->j_state_lock);
if (journal->j_running_transaction &&
journal->j_running_transaction->t_tid == tid) {
if (journal->j_commit_request != tid) {
/* transaction not yet started, so request it */
read_unlock(&journal->j_state_lock);
jbd2_log_start_commit(journal, tid);
goto wait_commit;
}
} else if (!(journal->j_committing_transaction &&
journal->j_committing_transaction->t_tid == tid))
need_to_wait = 0;
read_unlock(&journal->j_state_lock);
if (!need_to_wait)
return 0;
wait_commit:
return jbd2_log_wait_commit(journal, tid);
}
首先如果當前正在運行的日志事物（尚有日志原子操作handle正在進行或者日志提交進程還沒觸發提交動作）且正是當前的事物，那么立即調用jbd2_log_start_commit函數喚醒日志回寫進程回寫待提交的元數據，然后調用jbd2_log_wait_commit函數等待元數據日志回寫完畢（注意：並不保證元數據自身回寫完畢，但是由於日志回寫完畢后即使此刻系統崩潰，文件的元數據也能夠得到恢復，因此文中其他地方不再對此詳細區分），這也是正常情況下一般的流程；但是如果當前的事物已經在回寫中了，那只需要等待即可；最后如果是沒有正在提交的事物或提交的事物不為等待的事物id，表示事物已經寫回了，所以直接退出即可。

回到ext4_sync_file函數中，最后根據前面的判斷結果，如果需要下發flush指令，則調用blkdev_issue_flush函數向塊設備下發flush命令，該命令最終會向物理磁盤發送一條flush cache的SCSI指令，磁盤會回刷磁盤寫cache，這樣數據才會真正的落盤，真正的安全了。（此處有一個疑問，為什么是否需要下發flush的判斷會放在jbd2_complete_transaction之前，在jbd2_complete_transaction之后判斷豈不是更好？因為jbd2_complete_transaction之后是能夠保證當前的事物提交完畢，所以只需要判斷journal是否支持barrier就可以了。現在這樣處理豈不是會漏掉前文中的3和5兩種情況不下發flush指令？）

最后我們回頭看一下對commit_tid參數賦值的ei->i_datasync_tid和ei->i_sync_tid值從何而來，這也是fsync和fdatasync是否會回寫元數據的關鍵。

其實這兩個值會在ext4_mark_iloc_dirty->ext4_do_update_inode->ext4_update_inode_fsync_trans的調用流程里設置和修改：

static int ext4_do_update_inode(handle_t *handle,
struct inode *inode,
struct ext4_iloc *iloc)
{
...
if (ei->i_disksize != ext4_isize(raw_inode)) {
ext4_isize_set(raw_inode, ei->i_disksize);
need_datasync = 1;
}
...
ext4_update_inode_fsync_trans(handle, inode, need_datasync);
...
}
static inline void ext4_update_inode_fsync_trans(handle_t *handle,
struct inode *inode,
int datasync)
{
struct ext4_inode_info *ei = EXT4_I(inode);

if (ext4_handle_valid(handle)) {
ei->i_sync_tid = handle->h_transaction->t_tid;
if (datasync)
ei->i_datasync_tid = handle->h_transaction->t_tid;
}
}
由此調用關系可以看出只有在inode元數據變臟時才會更新i_sync_tid值，才會使得前文中ext4_sync_file最后的觸發日志事物的提交和元數據的回寫，如果元數據在調用fsync時不為臟，那也就不需要執行元數據回寫操作了（后面可以看到不會觸發日志回寫）。另外，如果元數據變為臟時，它的大小也改變了，那么它還會跟新i_datasync_tid值，以至於fdatasync調用會觸發元數據的回寫，這一點同本文概述中描述的一致。
簡單總結一下：在執行fsync和fdatasync調用時，若元數據不為臟則不會回寫元數據；若元數據為臟但是size值不變，則fdatasync不會回寫元數據，而fsync會回寫；最后若元數據為臟且size值變化則fdatasync和fsync都會回寫元數據。

至此，fsync和fdatasync的調用流程執行結束。

3、總結
fsync和fdatasync系統調用用於實現對某一特定文件數據和元數據的dirty page cache的同步功能，將data寫回磁盤，對於減少重要數據丟失有着重要的意義。本文描述了fsync和fdatasync的作用和區別，同時結合Ext4文件系統分析它們在Linux內核中的實現，但由於涉及較多模塊較多，整個邏輯也較為復雜，所以仍有許多待分析和考慮的地方。

————————————————
版權聲明：本文為CSDN博主「luckyapple1028」的原創文章，遵循 CC 4.0 BY-SA 版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/luckyapple1028/article/details/61413724

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Ext4文件系統架構分析(一) Ext4文件系統架構分析(三) Ext4文件系統架構分析(二) Ext4文件系統架構分析(二) 關閉ext4文件系統的日志功能 Ext4文件系統修復 windows下讀取ext4文件系統關於ext4文件系統概述 ext4文件系統的delalloc選項造成單次寫延遲增加的分析 [svc]為何linux ext4文件系統目錄默認大小是4k?