虛擬化技術主要包含三部分內容:CPU虛擬化,內存虛擬化,設備虛擬化.本系列文章主要描述磁盤設備的虛擬化過程,包含了一個讀操作的I/O請求如何從Guest Vm到其最終被處理的整個過程.本系列文章中引用到的linux內核代碼版本為3.7.10,使用的虛擬化平台是KVM,qemu的版本是1.6.1.
用戶程序想要訪問IO設備需要調用操作系統提供的接口,即系統調用.當在用戶程序中調用一個read操作時,系統先保存好read操作的參數,然后調用int 80命令(也可能是sysenter)進入內核空間,在內核空間中,讀操作的邏輯由sys_read函數實現.
在講sys_read的實現過程之前,我們先來看看read操作在內核空間需要經歷的層次結構.從圖中可以看出,read操作首先經過虛擬文件系統曾(vfs), 接下來是具體的文件系統層,Page cache層,通用塊層(generic block layer),I/O調度層(I/O scheduler layer),塊設備驅動層(block device driver layer),最后是塊物理設備層(block device layer).
- 虛擬文件系統層:該層屏蔽了下層的具體操作,為上層提供統一的接口,如vfs_read,vfs_write等.vfs_read,vfs_write通過調用下層具體文件系統的接口來實現相應的功能.
- 具體文件系統層:該層針對每一類文件系統都有相應的操作和實現了,包含了具體文件系統的處理邏輯.
- page cache層:該層緩存了從塊設備中獲取的數據.引入該層的目的是避免頻繁的塊設備訪問,如果在page cache中已經緩存了I/O請求的數據,則可以將數據直接返回,無需訪問塊設備.
- 通過塊層:接收上層的I/O請求,並最終發出I/O請求.該層向上層屏蔽了下層設備的特性.
- I/O調度層: 接收通用塊層發出的 IO 請求,緩存請求並試圖合並相鄰的請求(如果這兩個請求的數據在磁盤上是相鄰的)。並根據設置好的調度算法,回調驅動層提供的請求處理函數,以處理具體的 IO 請求
- 塊設備驅動層:從上層取出請求,並根據參數,操作具體的設備.
- 塊設備層:真正的物理設備.
了解了內核層次的結構,讓我們來看一下read操作的代碼實現.
sys_read函數聲明在include/linux/syscalls.h文件中,
- asmlinkage long sys_read(unsigned int fd, char __user *buf, size_t count);
- SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
- {
- struct fd f = fdget(fd);
- ssize_t ret = -EBADF;
- if (f.file) {
- loff_t pos = file_pos_read(f.file);
- ret = vfs_read(f.file, buf, count, &pos); //調用vfs layer中的read操作
- file_pos_write(f.file, pos);//設置當前文件的位置
- fdput(f);
- }
- return ret;
- }
vfs_read函數屬於vfs layer,定義在fs/read_write.c, 其主要功能是調用具體文件系統中對應的read操作,如果具體文件系統沒有提供read操作,則使用默認的do_sync_read函數.
- ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
- {
- ssize_t ret;
- if (!(file->f_mode & FMODE_READ))
- return -EBADF;
- if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read))
- return -EINVAL;
- if (unlikely(!access_ok(VERIFY_WRITE, buf, count)))
- return -EFAULT;
- ret = rw_verify_area(READ, file, pos, count);
- if (ret >= 0) {
- count = ret;
- if (file->f_op->read) {
- ret = file->f_op->read(file, buf, count, pos); //該函數由具體的文件系統指定
- } else
- ret = do_sync_read(file, buf, count, pos); //內核默認的讀文件操作
- if (ret > 0) {
- fsnotify_access(file);
- add_rchar(current, ret);
- }
- inc_syscr(current);
- }
- return ret;
- }
file->f_op的類型為struct file_operations, 該類型定義了一系列涉及文件操作的函數指針,針對不同的文件系統,這些函數指針指向不同的實現.以ext4 文件系統為例子,該數據結構的初始化在fs/ext4/file.c,從該初始化可以知道,ext4的read操作調用了內核自帶的do_sync_read()函數
- const struct file_operations ext4_file_operations = {
- .llseek = ext4_llseek,
- .read = do_sync_read,
- .write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = ext4_file_write,
- .unlocked_ioctl = ext4_ioctl,
- #ifdef CONFIG_COMPAT
- .compat_ioctl = ext4_compat_ioctl,
- #endif
- .mmap = ext4_file_mmap,
- .open = ext4_file_open,
- .release = ext4_release_file,
- .fsync = ext4_sync_file,
- .splice_read = generic_file_splice_read,
- .splice_write = generic_file_splice_write,
- .fallocate = ext4_fallocate,
- };
do_sync_read()函數定義fs/read_write.c中,
- ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
- {
- struct iovec iov = { .iov_base = buf, .iov_len = len };
- struct kiocb kiocb;
- ssize_t ret;
- init_sync_kiocb(&kiocb, filp);//初始化kiocp,描述符kiocb是用來記錄I/O操作的完成狀態
- kiocb.ki_pos = *ppos;
- kiocb.ki_left = len;
- kiocb.ki_nbytes = len;
- for (;;) {
- ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);//調用真正做讀操作的函數,ext4文件系統在fs/ext4/file.c中配置
- if (ret != -EIOCBRETRY)
- break;
- wait_on_retry_sync_kiocb(&kiocb);
- }
- if (-EIOCBQUEUED == ret)
- ret = wait_on_sync_kiocb(&kiocb);
- *ppos = kiocb.ki_pos;
- return ret;
- }
在ext4文件系統中filp->f_op->aio_read函數指針只想generic_file_aio_read, 該函數定義於mm/filemap.c文件中,該函數有兩個執行路徑,如果是以O_DIRECT方式打開文件,則讀操作跳過page cache直接去讀取磁盤,否則調用do_generic_sync_read函數嘗試從page cache中獲取所需的數據.
- ssize_t
- generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
- {
- struct file *filp = iocb->ki_filp;
- ssize_t retval;
- unsigned long seg = 0;
- size_t count;
- loff_t *ppos = &iocb->ki_pos;
- count = 0;
- retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
- if (retval)
- return retval;
- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (filp->f_flags & O_DIRECT) {
- loff_t size;
- struct address_space *mapping;
- struct inode *inode;
- struct timex txc;
- do_gettimeofday(&(txc.time));
- mapping = filp->f_mapping;
- inode = mapping->host;
- if (!count)
- goto out; /* skip atime */
- size = i_size_read(inode);
- if (pos < size) {
- retval = filemap_write_and_wait_range(mapping, pos,
- pos + iov_length(iov, nr_segs) - 1);
- if (!retval) {
- retval = mapping->a_ops->direct_IO(READ, iocb,
- iov, pos, nr_segs);
- }
- if (retval > 0) {
- *ppos = pos + retval;
- count -= retval;
- }
- /*
- * Btrfs can have a short DIO read if we encounter
- * compressed extents, so if there was an error, or if
- * we've already read everything we wanted to, or if
- * there was a short read because we hit EOF, go ahead
- * and return. Otherwise fallthrough to buffered io for
- * the rest of the read.
- */
- if (retval < 0 || !count || *ppos >= size) {
- file_accessed(filp);
- goto out;
- }
- }
- }
- count = retval;
- for (seg = 0; seg < nr_segs; seg++) {
- read_descriptor_t desc;
- loff_t offset = 0;
- /*
- * If we did a short DIO read we need to skip the section of the
- * iov that we've already read data into.
- */
- if (count) {
- if (count > iov[seg].iov_len) {
- count -= iov[seg].iov_len;
- continue;
- }
- offset = count;
- count = 0;
- }
- desc.written = 0;
- desc.arg.buf = iov[seg].iov_base + offset;
- desc.count = iov[seg].iov_len - offset;
- if (desc.count == 0)
- continue;
- desc.error = 0;
- do_generic_file_read(filp, ppos, &desc, file_read_actor);
- retval += desc.written;
- if (desc.error) {
- retval = retval ?: desc.error;
- break;
- }
- if (desc.count > 0)
- break;
- }
- out:
- return retval;
- }
do_generic_file_read定義在mm/filemap.c文件中,該函數調用page cache層中相關的函數.如果所需數據存在與page cache中,並且數據不是dirty的,則從page cache中直接獲取數據返回.如果數據在page cache中不存在,或者數據是dirty的,則page cache會引發讀磁盤的操作.該函數的讀磁盤並不是簡單的只讀取所需數據的所在的block,而是會有一定的預讀機制來提高cache的命中率,減少磁盤訪問的次數.
page cache層中真正讀磁盤的操作為readpage系列,readpage系列函數具體指向的函數實現在fs/ext4/inode.c文件中定義,該文件中有很多個struct address_space_operation對象來對應與不同日志機制,我們選擇linux默認的ordered模式的日志機制來描述I/O的整個流程, ordered模式對應的readpage系列函數如下所示.
- static const struct address_space_operations ext4_ordered_aops = {
- .readpage = ext4_readpage,
- .readpages = ext4_readpages,
- .writepage = ext4_writepage,
- .write_begin = ext4_write_begin,
- .write_end = ext4_ordered_write_end,
- .bmap = ext4_bmap,
- .invalidatepage = ext4_invalidatepage,
- .releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
- .migratepage = buffer_migrate_page,
- .is_partially_uptodate = block_is_partially_uptodate,
- .error_remove_page = generic_error_remove_page,
- };
為簡化流程,我們選取最簡單的ext4_readpage函數來說明,該函數實現位於fs/ext4/inode.c中,函數很簡單,只是調用了mpage_readpage函數.mpage_readpage位於fs/mpage.c文件中,該函數生成一個IO請求,並提交給Generic block layer.
- int mpage_readpage(struct page *page, get_block_t get_block)
- {
- struct bio *bio = NULL;
- sector_t last_block_in_bio = 0;
- struct buffer_head map_bh;
- unsigned long first_logical_block = 0;
- map_bh.b_state = 0;
- map_bh.b_size = 0;
- bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio,
- &map_bh, &first_logical_block, get_block);
- if (bio)
- mpage_bio_submit(READ, bio);
- return 0;
- }
Generic block layer會將該請求分發到具體設備的IO隊列中,由I/O Scheduler去調用具體的driver接口獲取所需的數據.
至此,在Guest vm中整個I/O的流程已經介紹完了,后續的文章會介紹I/O操作如何從Guest vm跳轉到kvm及如何在qemu中模擬I/O設備.
參考資料:
1. read系統調用剖析:
http://www.ibm.com/developerworks/cn/linux/l-cn-read/
轉載:http://blog.csdn.net/dashulu/article/details/16820281