Linux內核學習:EXT4 文件系統在 Linux 內核系統中的讀寫過程【轉】


轉自:https://blog.csdn.net/qq_32473685/article/details/103494398

目錄

1 概述

2 虛擬文件系統 與 Ext4 文件系統

2.1 sys_write( ) 代碼跟蹤

2.2 sys_write( ) 過程分析

2.3 sys_write( ) 的核心部分 vfs_write( )

2.4 ext4_file_write( )

2.4.1 ext4文件系統的extent

2.4.2 ext4_file_write( ) 

2.5 generic_file_write_iter( )

2.6 __generic_file_write_iter( )

2.7 generic_perform_write( )

2.7.1 ext4文件系統address_space_operations

2.7.2 ext4文件系統delay allocation機制

2.7.3 執行完 generate_write_back( )后


1 概述

用戶進程通過系統調用write()往磁盤上寫數據,但write()執行結束后,數據是否 立即寫到磁盤上?內核讀文件數據時,使用到了“提前讀”;寫數據時,則使用了“延遲寫”, 即write()執行結束后,數據並沒有立即立即將請求放入塊設備驅動請求隊列,然后寫到 硬盤上。

跟蹤的時候通過

dump_stack

重新編譯linux內核,跟蹤函數執行過程。

2 虛擬文件系統 與 Ext4 文件系統

首先文件系統在內核中的讀寫過程是在 sys_write( ) 中定義的。

2.1 sys_write( ) 代碼跟蹤

sys_write( ) 定義在 include/linux/syscalls.h 中:

asmlinkage long sys_write(unsigned int fd, const char __user *buf, 568 size_t count); 

sys_write( )的具體實現在 fs/read_write.c 中:

  1.  
    SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
  2.  
    size_t, count)
  3.  
    {
  4.  
    struct fd f = fdget_pos(fd);
  5.  
    ssize_t ret = -EBADF;
  6.  
    if (f.file) {
  7.  
    loff_t pos = file_pos_read(f.file);
  8.  
    ret = vfs_write(f.file, buf, count, &pos);
  9.  
    if (ret >= 0)
  10.  
    file_pos_write(f.file, pos);
  11.  
    fdput_pos(f);
  12.  
    }
  13.  
    return ret;
  14.  
    }

2.2 sys_write( ) 過程分析

可以看出在實現 sys_write( ) 的時候,分為如下幾步:

1) 根據打開文件號 fd找到該已打開文件file結構:

struct fd f = fdget_pos(fd);

2) 讀取當前文件的讀寫位置:

loff_t pos = file_pos_read(f.file);

3) 寫入:

ret = vfs_write(f.file, buf, count, &pos);

4) 根據讀文件結果,更新文件讀寫位置 :

file_pos_write(f.file, pos);

2)和  4)可以作為寫入之前和之后的對應操作來看,一個是讀取當前文件的位置,一個是根據寫文件的結果,更新文件的讀寫位置,主要代碼還是在 fs/read_write.c 中:

  1.  
    static inline loff_t file_pos_read(struct file *file)
  2.  
    {
  3.  
    return file->f_pos;
  4.  
    }
  5.  
     
  6.  
    static inline void file_pos_write(struct file *file, loff_t pos)
  7.  
    {
  8.  
    file->f_pos = pos;
  9.  
    }

3) 是整個 sys_write( ) 中最為重要的一部分,下面我們仔細分析一下這個函數。

2.3 sys_write( ) 的核心部分 vfs_write( )

  1.  
    ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos){
  2.  
    ssize_t ret;
  3.  
     
  4.  
    if (!(file->f_mode & FMODE_WRITE))
  5.  
    return -EBADF;
  6.  
    if (!(file->f_mode & FMODE_CAN_WRITE))
  7.  
    return -EINVAL;
  8.  
    if (unlikely(!access_ok(VERIFY_READ, buf, count)))
  9.  
    return -EFAULT;
  10.  
    ret = rw_verify_area(WRITE, file, pos, count);
  11.  
    if (ret >= 0) {
  12.  
    count = ret;
  13.  
    file_start_write(file);
  14.  
    if (file->f_op->write)
  15.  
    ret = file->f_op->write(file, buf, count, pos);
  16.  
    else if (file->f_op->aio_write)
  17.  
    ret = do_sync_write(file, buf, count, pos);
  18.  
    else
  19.  
    ret = new_sync_write(file, buf, count, pos);
  20.  
    if (ret > 0) {
  21.  
    fsnotify_modify(file);
  22.  
    add_wchar(current, ret);
  23.  
    }
  24.  
    inc_syscw(current);
  25.  
    file_end_write(file);
  26.  
    }
  27.  
     
  28.  
    return ret;
  29.  
    }

首先函數在 rw_verify_area(WRITE, file, pos, count); 檢查文件是否從當前位置 pos 開始的 count 字節是否對寫操作加上了 “強制鎖”,這是通過調用函數完成的。

通過合法性檢查后,就調用具體文件系統 file_operations中 write 的方法。對於ext4文件系統,file_operations方法定義在 fs/ext4/file.c 中。從定義中可知 write 方法實現函數為 do_sync_write( )。 

下面是ext4文件系統操作的數據結構:

  1.  
    const struct file_operations ext4_file_operations = {
  2.  
    .llseek = ext4_llseek,
  3.  
    .read = new_sync_read,
  4.  
    .write = new_sync_write,
  5.  
    .read_iter = generic_file_read_iter,
  6.  
    .write_iter = ext4_file_write_iter,
  7.  
    .unlocked_ioctl = ext4_ioctl,
  8.  
    #ifdef CONFIG_COMPAT
  9.  
    .compat_ioctl = ext4_compat_ioctl,
  10.  
    #endif
  11.  
    .mmap = ext4_file_mmap,
  12.  
    .open = ext4_file_open,
  13.  
    .release = ext4_release_file,
  14.  
    .fsync = ext4_sync_file,
  15.  
    .splice_read = generic_file_splice_read,
  16.  
    .splice_write = iter_file_splice_write,
  17.  
    .fallocate = ext4_fallocate,
  18.  
    };

下面是do_sync_write( )的具體代碼,也在fs/read_write.c中:

  1.  
    ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos)
  2.  
    {
  3.  
    struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len };
  4.  
    struct kiocb kiocb;
  5.  
    ssize_t ret;
  6.  
    init_sync_kiocb(&kiocb, filp);
  7.  
    kiocb.ki_pos = *ppos;
  8.  
    kiocb.ki_nbytes = len;
  9.  
    ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
  10.  
    if (-EIOCBQUEUED == ret)
  11.  
    ret = wait_on_sync_kiocb(&kiocb);
  12.  
    *ppos = kiocb.ki_pos;
  13.  
    return ret;
  14.  
    }
  15.  
    EXPORT_SYMBOL(do_sync_write);

異步I/O允許用戶空間來初始化操作而不必等待它們的完成,因此,一個應用程序可以在他的I/O處理進行中做其他的處理。

塊和網絡驅動在整個時間是完全異步的,因此只有字符驅動對於明確的異步I/O支持是候選的。實現異步I/O操作的file_operations方法,都使用I/O Control Block,其定義在 include/linux/aio.h中

定義了一個臨時變量iov,這個變量記錄了用戶空間緩沖區地址buf和所要寫的字節數len,用戶空間的緩沖區地址buf是保存在iov中的。初始化異步I/O數據結構后,就用file_operations 中的aio_write方法。拓展到ext4文件中的時,該方法就是ext4_file_operations結構體中的ext4_file_write( )。

下面就具體到ext4的文件系統,這個函數也是aio_write( ) 的延展。

2.4 ext4_file_write( )

2.4.1 ext4文件系統的extent

Ext2/3等老Linux文件系統使用間接映射模式 (block mapping),  文件的每一個塊都要被記錄下來,這使得大文件操作(刪除)效率低下。Ext4 引入extents這一概念來代替 Ext2/3 使用的傳統的塊映射方式。ext4中一個extent最大可以映射128MB的連續物理存儲空間。

Ext3采用間接塊映射,當操作大文件的時候,效率極其低下,比如一個100MB大小的文件,在Ext3中要建立25600個數據塊的映射表,每個數據塊大小為4KB,而Ext4引入了extents,每個extent為一組連續的數據塊,上述文件表示為,該文件數據保存在接下來的25600個數據塊中,提高了不少效率。

Extent模式主要數據結構包括ext4_extent, ext4_extent_idx, ext4_extent_header,均定義在文件fs/ext4/ext4_extents.h文件中。

  1.  
    /*
  2.  
    * This is the extent on-disk structure.
  3.  
    * It's used at the bottom of the tree.
  4.  
    */
  5.  
    struct ext4_extent {
  6.  
    __le32 ee_block; /* first logical block extent covers */
  7.  
    __le16 ee_len; /* number of blocks covered by extent */
  8.  
    __le16 ee_start_hi; /* high 16 bits of physical block */
  9.  
    __le32 ee_start_lo; /* low 32 bits of physical block */
  10.  
    };
  11.  
     
  12.  
    /*
  13.  
    * This is index on-disk structure.
  14.  
    * It's used at all the levels except the bottom.
  15.  
    */
  16.  
    struct ext4_extent_idx {
  17.  
    __le32 ei_block; /* index covers logical blocks from 'block' */
  18.  
    __le32 ei_leaf_lo; /* pointer to the physical block of the next *
  19.  
    * level. leaf or next index could be there */
  20.  
    __le16 ei_leaf_hi; /* high 16 bits of physical block */
  21.  
    __u16 ei_unused;
  22.  
    };
  23.  
     
  24.  
    /*
  25.  
    * Each block (leaves and indexes), even inode-stored has header.
  26.  
    */
  27.  
    struct ext4_extent_header {
  28.  
    __le16 eh_magic; /* probably will support different formats */
  29.  
    __le16 eh_entries; /* number of valid entries */
  30.  
    __le16 eh_max; /* capacity of store in entries */
  31.  
    __le16 eh_depth; /* has tree real underlying blocks? */
  32.  
    __le32 eh_generation; /* generation of the tree */
  33.  
    };

2.4.2 ext4_file_write( ) 

  1.  
    static ssize_t
  2.  
    ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
  3.  
    {
  4.  
    struct file *file = iocb->ki_filp;
  5.  
    struct inode *inode = file_inode(iocb->ki_filp);
  6.  
    struct mutex *aio_mutex = NULL;
  7.  
    struct blk_plug plug;
  8.  
    int o_direct = io_is_direct(file);
  9.  
    int overwrite = 0;
  10.  
    size_t length = iov_iter_count(from);
  11.  
    ssize_t ret;
  12.  
    loff_t pos = iocb->ki_pos;
  13.  
     
  14.  
    /*
  15.  
    * Unaligned direct AIO must be serialized; see comment above
  16.  
    * In the case of O_APPEND, assume that we must always serialize
  17.  
    */
  18.  
    if (o_direct &&
  19.  
    ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
  20.  
    !is_sync_kiocb(iocb) &&
  21.  
    (file->f_flags & O_APPEND ||
  22.  
    ext4_unaligned_aio(inode, from, pos))) {
  23.  
    aio_mutex = ext4_aio_mutex(inode);
  24.  
    mutex_lock(aio_mutex);
  25.  
    ext4_unwritten_wait(inode);
  26.  
    }
  27.  
    mutex_lock(&inode->i_mutex);
  28.  
    if (file->f_flags & O_APPEND)
  29.  
    iocb->ki_pos = pos = i_size_read(inode);
  30.  
     
  31.  
    /*
  32.  
    * If we have encountered a bitmap-format file, the size limit
  33.  
    * is smaller than s_maxbytes, which is for extent-mapped files.
  34.  
    */
  35.  
    if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
  36.  
    struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
  37.  
     
  38.  
    if ((pos > sbi->s_bitmap_maxbytes) ||
  39.  
    (pos == sbi->s_bitmap_maxbytes && length > 0)) {
  40.  
    mutex_unlock(&inode->i_mutex);
  41.  
    ret = -EFBIG;
  42.  
    goto errout;
  43.  
    }
  44.  
     
  45.  
    if (pos + length > sbi->s_bitmap_maxbytes)
  46.  
    iov_iter_truncate(from, sbi->s_bitmap_maxbytes - pos);
  47.  
    }
  48.  
     
  49.  
    iocb-> private = &overwrite;
  50.  
    if (o_direct) {
  51.  
    blk_start_plug(&plug);
  52.  
    /* check whether we do a DIO overwrite or not */
  53.  
    if (ext4_should_dioread_nolock(inode) && !aio_mutex &&
  54.  
    !file->f_mapping->nrpages && pos + length <= i_size_read(inode)) {
  55.  
    struct ext4_map_blocks map;
  56.  
    unsigned int blkbits = inode->i_blkbits;
  57.  
    int err, len;
  58.  
     
  59.  
    map.m_lblk = pos >> blkbits;
  60.  
    map.m_len = (EXT4_BLOCK_ALIGN(pos + length, blkbits) >> blkbits)
  61.  
    - map.m_lblk;
  62.  
    len = map.m_len;
  63.  
     
  64.  
    err = ext4_map_blocks( NULL, inode, &map, 0);
  65.  
    /*
  66.  
    * 'err==len' means that all of blocks has
  67.  
    * been preallocated no matter they are
  68.  
    * initialized or not. For excluding
  69.  
    * unwritten extents, we need to check
  70.  
    * m_flags. There are two conditions that
  71.  
    * indicate for initialized extents. 1) If we
  72.  
    * hit extent cache, EXT4_MAP_MAPPED flag is
  73.  
    * returned; 2) If we do a real lookup,
  74.  
    * non-flags are returned. So we should check
  75.  
    * these two conditions.
  76.  
    */
  77.  
    if (err == len && (map.m_flags & EXT4_MAP_MAPPED))
  78.  
    overwrite = 1;
  79.  
    }
  80.  
    }
  81.  
     
  82.  
    ret = __generic_file_write_iter(iocb, from);
  83.  
    mutex_unlock(&inode->i_mutex);
  84.  
     
  85.  
    if (ret > 0) {
  86.  
    ssize_t err;
  87.  
    err = generic_write_sync(file, iocb->ki_pos - ret, ret);
  88.  
    if (err < 0)
  89.  
    ret = err;
  90.  
    }
  91.  
    if (o_direct)
  92.  
    blk_finish_plug(&plug);
  93.  
     
  94.  
    errout:
  95.  
    if (aio_mutex)
  96.  
    mutex_unlock(aio_mutex);
  97.  
    return ret;
  98.  
    }

首先檢查文件是否為ext4的extent模式,若為傳統的塊映射方式,先檢查文件是否過大。若當前文件位置加上待寫的數據長度,大小若超過最大文件限制,則要做相應的調整,最終文件大小不能超過sbi->s_bitmap_maxbytes。

  1.  
    if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
  2.  
    struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
  3.  
     
  4.  
    if ((pos > sbi->s_bitmap_maxbytes) ||
  5.  
    (pos == sbi->s_bitmap_maxbytes && length > 0)) {
  6.  
    mutex_unlock(&inode->i_mutex);
  7.  
    ret = -EFBIG;
  8.  
    goto errout;
  9.  
    }
  10.  
     
  11.  
    if (pos + length > sbi->s_bitmap_maxbytes)
  12.  
    iov_iter_truncate(from, sbi->s_bitmap_maxbytes - pos);
  13.  
    }

generic_file_aio_write( ) 就是ext4_file_write( )的主體執行語句,若I/O不是塊對齊,寫操作完成后,還要對i_aio_mutex解鎖。

  1.  
    ret = __generic_file_write_iter(iocb, from);
  2.  
    mutex_unlock(&inode->i_mutex);
  3.  
     
  4.  
    if (ret > 0) {
  5.  
    ssize_t err;
  6.  
    err = generic_write_sync(file, iocb->ki_pos - ret, ret);
  7.  
    if (err < 0)
  8.  
    ret = err;
  9.  
    }
  10.  
    if (o_direct)
  11.  
    blk_finish_plug(&plug);

2.5 generic_file_write_iter( )

generic_file_aio_write( )源碼如下:

  1.  
    /**
  2.  
    * generic_file_write_iter - write data to a file
  3.  
    * @iocb: IO state structure
  4.  
    * @from: iov_iter with data to write
  5.  
    *
  6.  
    * This is a wrapper around __generic_file_write_iter() to be used by most
  7.  
    * filesystems. It takes care of syncing the file in case of O_SYNC file
  8.  
    * and acquires i_mutex as needed.
  9.  
    */
  10.  
    ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
  11.  
    {
  12.  
    struct file *file = iocb->ki_filp;
  13.  
    struct inode *inode = file->f_mapping->host;
  14.  
    ssize_t ret;
  15.  
     
  16.  
    mutex_lock(&inode->i_mutex);
  17.  
    ret = __generic_file_write_iter(iocb, from);
  18.  
    mutex_unlock(&inode->i_mutex);
  19.  
    if (ret > 0) {
  20.  
    ssize_t err;
  21.  
     
  22.  
    err = generic_write_sync(file, iocb->ki_pos - ret, ret);
  23.  
    if (err < 0)
  24.  
    ret = err;
  25.  
    }
  26.  
    return ret;
  27.  
    }
  28.  
    EXPORT_SYMBOL(generic_file_write_iter);

在do_sync_write()中已經將當前文件寫的起始位置記錄在iocb->ki_pos。接下來執行主體函數__generic_file_write_iter( ),執行寫操作前要加鎖,完成后解鎖,若寫操作成功,就返回寫完成的字節數,返回值大於0;寫操作出現錯誤,就返回相應的錯誤碼。接下來就要調用generic_write_sync()將數據刷新到硬盤上。

2.6 __generic_file_write_iter( )

  1.  
    /**
  2.  
    * __generic_file_write_iter - write data to a file
  3.  
    * @iocb: IO state structure (file, offset, etc.)
  4.  
    * @from: iov_iter with data to write
  5.  
    *
  6.  
    * This function does all the work needed for actually writing data to a
  7.  
    * file. It does all basic checks, removes SUID from the file, updates
  8.  
    * modification times and calls proper subroutines depending on whether we
  9.  
    * do direct IO or a standard buffered write.
  10.  
    *
  11.  
    * It expects i_mutex to be grabbed unless we work on a block device or similar
  12.  
    * object which does not need locking at all.
  13.  
    *
  14.  
    * This function does *not* take care of syncing data in case of O_SYNC write.
  15.  
    * A caller has to handle it. This is mainly due to the fact that we want to
  16.  
    * avoid syncing under i_mutex.
  17.  
    * 此功能完成了將數據實際寫入文件所需的所有工作。它會進行所有基本檢查,從文件中刪除SUID,更新修改
  18.  
    * 時間並根據我們執行直接I/O還是標准緩沖寫入來調用適當的子例程。除非我們在完全不需要鎖定的塊設備或
  19.  
    * 類似對象上工作,否則它預計將獲取i_mutex。如果是O_SYNC寫操作,此功能不會負責同步數據。呼叫者必
  20.  
    * 須處理它。這主要是由於我們要避免在i_mutex下進行同步。
  21.  
    */
  22.  
     
  23.  
    ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
  24.  
    {
  25.  
    struct file *file = iocb->ki_filp;
  26.  
    struct address_space * mapping = file->f_mapping;
  27.  
    struct inode *inode = mapping->host;
  28.  
    loff_t pos = iocb->ki_pos;
  29.  
    ssize_t written = 0;
  30.  
    ssize_t err;
  31.  
    ssize_t status;
  32.  
    size_t count = iov_iter_count(from);
  33.  
     
  34.  
    /* We can write back this queue in page reclaim */
  35.  
    current->backing_dev_info = inode_to_bdi(inode);
  36.  
    err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
  37.  
    if (err)
  38.  
    goto out;
  39.  
     
  40.  
    if (count == 0)
  41.  
    goto out;
  42.  
    iov_iter_truncate(from, count);
  43.  
     
  44.  
    err = file_remove_suid(file);
  45.  
    if (err)
  46.  
    goto out;
  47.  
     
  48.  
    err = file_update_time(file);
  49.  
    if (err)
  50.  
    goto out;
  51.  
     
  52.  
    if (io_is_direct(file)) {
  53.  
    loff_t endbyte;
  54.  
     
  55.  
    written = generic_file_direct_write(iocb, from, pos);
  56.  
    /*
  57.  
    * If the write stopped short of completing, fall back to
  58.  
    * buffered writes. Some filesystems do this for writes to
  59.  
    * holes, for example. For DAX files, a buffered write will
  60.  
    * not succeed (even if it did, DAX does not handle dirty
  61.  
    * page-cache pages correctly).
  62.  
    */
  63.  
    if (written < 0 || written == count || IS_DAX(inode))
  64.  
    goto out;
  65.  
     
  66.  
    pos += written;
  67.  
    count -= written;
  68.  
     
  69.  
    status = generic_perform_write(file, from, pos);
  70.  
    /*
  71.  
    * If generic_perform_write() returned a synchronous error
  72.  
    * then we want to return the number of bytes which were
  73.  
    * direct-written, or the error code if that was zero. Note
  74.  
    * that this differs from normal direct-io semantics, which
  75.  
    * will return -EFOO even if some bytes were written.
  76.  
    */
  77.  
    if (unlikely(status < 0)) {
  78.  
    err = status;
  79.  
    goto out;
  80.  
    }
  81.  
    iocb->ki_pos = pos + status;
  82.  
    /*
  83.  
    * We need to ensure that the page cache pages are written to
  84.  
    * disk and invalidated to preserve the expected O_DIRECT
  85.  
    * semantics.
  86.  
    */
  87.  
    endbyte = pos + status - 1;
  88.  
    err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
  89.  
    if (err == 0) {
  90.  
    written += status;
  91.  
    invalidate_mapping_pages(mapping,
  92.  
    pos >> PAGE_CACHE_SHIFT,
  93.  
    endbyte >> PAGE_CACHE_SHIFT);
  94.  
    } else {
  95.  
    /*
  96.  
    * We don't know how much we wrote, so just return
  97.  
    * the number of bytes which were direct-written
  98.  
    */
  99.  
    }
  100.  
    } else {
  101.  
    written = generic_perform_write(file, from, pos);
  102.  
    if (likely(written >= 0))
  103.  
    iocb->ki_pos = pos + written;
  104.  
    }
  105.  
    out:
  106.  
    current->backing_dev_info = NULL;
  107.  
    return written ? written : err;
  108.  
    }
  109.  
    EXPORT_SYMBOL(__generic_file_write_iter);

 更新檢查后的實際可寫入數據大小(大多數情況下不變,只有待寫的數據超出文件大小限制,count值才會變化)。

generic_write_checks( )來檢查對該文件的是否有相應的寫權限,這個和系統中是否對文件大小有限制有關,將文件的suid標志清0,而且如果是可執行文件的話,就將sgid標志也清0,既然寫文件,那么文件就會被修改(或創建),修改文件的時間是要記錄在inode中的,並且將inode標記為臟(回寫到磁盤上)。

若寫方式為Direct IO,前面的工作都是一些合法性檢查、記錄文件改變、修改時間。而寫文件的主要工作是調用函數 generic_perform_write( ) 來完成。

2.7 generic_perform_write( )

  1.  
    ssize_t generic_perform_write(struct file *file,
  2.  
    struct iov_iter *i, loff_t pos)
  3.  
    {
  4.  
    struct address_space *mapping = file->f_mapping;
  5.  
    const struct address_space_operations *a_ops = mapping->a_ops;
  6.  
    long status = 0;
  7.  
    ssize_t written = 0;
  8.  
    unsigned int flags = 0;
  9.  
     
  10.  
    /*
  11.  
    * Copies from kernel address space cannot fail (NFSD is a big user).
  12.  
    */
  13.  
    if (!iter_is_iovec(i))
  14.  
    flags |= AOP_FLAG_UNINTERRUPTIBLE;
  15.  
     
  16.  
    // 若當前I/O操作是屬於在內核中進行,顯然是不能被中斷的(用戶態的I/O操作可以被中斷),就要設置AOP_FLAG_UNINTERRUPTIBLE標志
  17.  
    do {
  18.  
    struct page *page;
  19.  
    unsigned long offset; /* Offset into pagecache page */
  20.  
    unsigned long bytes; /* Bytes to write to page */
  21.  
    size_t copied; /* Bytes copied from user */
  22.  
    void *fsdata;
  23.  
     
  24.  
    offset = (pos & (PAGE_CACHE_SIZE - 1));
  25.  
    bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
  26.  
    iov_iter_count(i));
  27.  
    // index:當前pos位置在pagecache的索引(以頁面大小為單位)
  28.  
    // offset:為在頁面內的偏移
  29.  
    // bytes:要從用戶空間拷貝的數據大小
  30.  
    again:
  31.  
    /*
  32.  
    * Bring in the user page that we will copy from _first_.
  33.  
    * Otherwise there's a nasty deadlock on copying from the
  34.  
    * same page as we're writing to, without it being marked
  35.  
    * up-to-date.
  36.  
    *
  37.  
    * Not only is this an optimisation, but it is also required
  38.  
    * to check that the address is actually valid, when atomic
  39.  
    * usercopies are used, below.
  40.  
    */
  41.  
    if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
  42.  
    status = -EFAULT;
  43.  
    break;
  44.  
    }
  45.  
     
  46.  
    // 調用索引節點(file->f_mapping)中address_space對象的write_begin方法,write_begin方法會為該頁分配和初始化緩沖區首部,稍后,我們會詳細分析ext4文件 系統實現的write_begin方法ext4_da_write_begin()。
  47.  
     
  48.  
    status = a_ops->write_begin(file, mapping, pos, bytes, flags,
  49.  
    &page, &fsdata);
  50.  
    if (unlikely(status < 0))
  51.  
    break;
  52.  
     
  53.  
    if (mapping_writably_mapped(mapping))
  54.  
    flush_dcache_page(page);
  55.  
     
  56.  
    // mapping->i_mmap_writable 記錄 VM_SHAREE 共享映射數。若mapping_writably_mapped()不等於0,則說明該頁面被多個共享使用,調用flush_dcache_page()。flush_dcache_page()將dcache相應的page里的數據寫到memory里去,以保證dcache內的數據與memory內的數據的一致性。但在x86架構中,flush_dcache_page() 的實現為空,不做任何操作。
  57.  
     
  58.  
    copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
  59.  
    flush_dcache_page(page);
  60.  
     
  61.  
    status = a_ops->write_end(file, mapping, pos, bytes, copied,
  62.  
    page, fsdata);
  63.  
    if (unlikely(status < 0))
  64.  
    break;
  65.  
    copied = status;
  66.  
     
  67.  
    cond_resched();
  68.  
     
  69.  
    //將待寫的數據拷貝到內核空間后,調用ext4文件系統的address_space_operations的 write_end方法。前面看到ext4文件系統有4種模式:writeback、ordered、journalled和delay allocation。加載ext4分區時,默認方式為delay allocation。對應的write_end方法為 ext4_da_write_end()。
  70.  
    cond_resched()檢查當前進程的TIF_NEED_RESCHED標志,若該標志為設置,則 調用schedule函數,調度一個新程序投入運行。
  71.  
     
  72.  
    iov_iter_advance(i, copied);
  73.  
    if (unlikely(copied == 0)) {
  74.  
    /*
  75.  
    * If we were unable to copy any data at all, we must
  76.  
    * fall back to a single segment length write.
  77.  
    *
  78.  
    * If we didn't fallback here, we could livelock
  79.  
    * because not all segments in the iov can be copied at
  80.  
    * once without a pagefault.
  81.  
    */
  82.  
    bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
  83.  
    iov_iter_single_seg_count(i));
  84.  
    goto again;
  85.  
    }
  86.  
    //當a_ops->write_end()執行完成后,寫數據操作完成了(注意,此時數據不一定真正寫到磁盤上,因為大多數數據寫為異步I/O)。接下來就要更新iov_iter結構體里的信息,包括文件的位置、寫數據大小、數據所在位置。若copied值為0,說明沒能將數據從用戶態拷貝到內核態,就要再次嘗試寫操作。
  87.  
    pos += copied;
  88.  
    written += copied;
  89.  
    //更新文件位置pos和已完成寫的數據大小
  90.  
    balance_dirty_pages_ratelimited(mapping);
  91.  
    if (fatal_signal_pending(current)) {
  92.  
    status = -EINTR;
  93.  
    break;
  94.  
    }
  95.  
    } while (iov_iter_count(i));
  96.  
    //調用 balance_dirty_pages_ratelimited() 來檢查頁面Cache中的臟頁比例是否超過一 個閥值(通常為系統中頁的40%)。若超過閥值,就調用 writeback_inodes() 來刷新幾十頁到磁盤上
  97.  
    return written ? written : status;
  98.  
    }
  99.  
    EXPORT_SYMBOL(generic_perform_write);

2.7.1 ext4文件系統address_space_operations

2.7.2 ext4文件系統delay allocation機制

延時分配(Delayed allocation)該技術也稱為allocate-on-flush,可以提升文件系統的性能。只有buffer I/O中每次寫操作都會涉及的磁盤塊分配過程推遲到數據回寫時再進行,即數據將要被真正寫入磁盤時,文件系統才為其分配塊,這與其它文件系統在早期就分配好必要的塊是不同的。另外,由於ext4的這種做法可以根據真實的文件大小做塊分配決策,它還減少了碎片的產生。

通常在進行Buffer Write時,系統的實際操作僅僅是為這些數據在操作系統內分配內存頁(page cache)並保存這些數據,等待用戶調用fsync等操作強制刷新或者等待系統觸發定時回寫過程。在數據拷貝到page cache這一過程中,系統會為這些數據在磁盤上分配對應的磁盤塊。

而在使用delalloc(delay allocation)后,上面的流程會略有不同,在每次buffer Write時,數據會被保存到page cache中,但是系統並不會為這些數據分配相應的磁盤塊,僅僅會查詢是否有已經為這些數據分配過磁盤塊,以便決定后面是否需要為這些數據分配磁盤 塊。在用戶調用fsync或者系統觸發回寫過程時,系統會嘗試為標記需要分配磁盤塊的這些 數據分配磁盤塊。這樣文件系統可以為這些屬於同一個文件的數據分配盡量連續的磁盤空間,從而優化后續文件的訪問性能。

2.7.3 執行完 generate_write_back( )后

在generic_perform_write()函數執行完成后,我們應知道以下兩點:
(1) 寫數據已從用戶空間拷貝到頁面Cache中(內核空間);
(2) 數據頁面標記為臟;
(3) 數據還未寫到磁盤上去,這就是“延遲寫”技術。后面我們會分析何時、在哪里、怎樣將數據寫到磁盤上的


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM