https://kernelgo.org/virtio-overview.html
http://lihanlu.cn/virtio-frontend-kick/
Qemu Vhost Block架構分析
https://blog.csdn.net/u012377031/article/details/38186329
virtio guest notifier和host notifier分別使用不同的ioeventfd和irqfd
1、kick產生,如果該eventfd是由qemu側來監聽的,則會執行對應的qemu函數kvm_handle_io();如果是vhost來監聽的,則直接在vhost內核模塊執行vhost->handle_kick()。
1、 eventfd 一半的用法是用戶態通知用戶態,或者內核態通知用戶態。例如 virtio-net 的實現是 guest 在 kick host 時采用 eventfd 通知的 qemu,然后 qemu 在用戶態做報文處理。但 vhost-net 是在內核態進行報文處理,guest 在 kick host 時采用 eventfd 通知的是內核線程 vhost_worker。所以這里的用法就跟常規的 eventfd 的用法不太一樣。
虛擬機與物理機的通信通過vring來實現數據交互,這之間存在一種io的通信機制。
-
主機通知客戶機是通過注入中斷來實現,虛擬設備連在模擬的中斷控制器上,有自己的中斷線信息,PCI設備的中斷信息會被寫入該設備的配置空間
-
客戶機通知主機是通過virtio讀寫內存來實現的。
上面第二條分有兩類:MMIO和PIO。MMIO是通過mmap()像寫內存一樣讀寫虛擬設備,比如內存。PIO(就是通常意義上的io端口)通過hypervisor捕獲設備io來實現虛擬化。兩者的區別是:MMIO是通過內存的異常來進行,PIO則是通過io動作的捕獲。

virtio是KVM環境下的I/O虛擬化的框架,目前已經使用到的虛擬化設備包括block、net、scsi,本質是guest os與host os間的一種高效通信機制。本文旨在對virtio通信用到的eventfd進行分析。為了描述簡單,一般來說,稱guest側通知host側的行為為kick,host側通知guset側的行為為call。
參考資料:
1. virtio spec
2. Virtio 基本概念和設備操作
3. note
1. 什么是eventfd?
eventfd是只存在於內存中的文件,通過系統調用sys_eventfd可以創建新的文件,它可以用於線程間、進程間的通信,無論是內核態或用戶態。其實現機制並不復雜,參考內核源碼樹的fs/eventfd.c文件,看數據結構struct eventfd_ctx的定義:
struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
/*
* Every time that a write(2) is performed on an eventfd, the * value of the __u64 being written is added to "count" and a * wakeup is performed on "wqh". A read(2) will return the "count" * value to userspace, and will reset "count" to zero. The kernel * side eventfd_signal() also, adds to the "count" counter and * issue a wakeup. */ __u64 count; unsigned int flags; };
eventfd的信號實際上就是上面的count,write的時候對其++,read的時候則清零(並不絕對正確)。wait_queue_head wqh則是用來保存監聽eventfd的睡眠進程,每當有進程來epoll、select且此時不存在有效的信號(count <= 0),則sleep在wqh上。當某一進程對該eventfd進行write的時候,則會喚醒wqh上的睡眠進程。代碼細節參考eventfd_read(), eventfd_write()。
virtio用到的eventfd其實是kvm中的ioeventfd機制,是對eventfd的又一層封裝(eventfd + iodevice)。該機制不進一步細說,下面結合virtio的kick操作使用來分析,包括兩部分:
1. 如何設置:如何協商該eventfd?
2. 如何產生kick信號:是誰來負責寫從而產生kick信號?
2. 如何設置?
即qemu用戶態進程是如何和kvm.ko來協商使用哪一個eventfd來kick通信。
這里就要用到kvm的一個系統調用:ioctl(KVM_IOEVENTFD, struct kvm_ioeventfd),找一下qemu代碼中執行該系統調用的路徑:
memory_region_transaction_commit() { address_space_update_ioeventfds() { address_space_add_del_ioeventfds() { MEMORY_LISTENER_CALL(eventfd_add, Reverse, §ion, fd->match_data, fd->data, fd->e); } } }
上面的eventfd_add有兩種可能的執行路徑:
1. mmio(Memory mapping I/O): kvm_mem_ioeventfd_add()
2. pio(Port I/O): kvm_io_ioeventfd_add()
通過代碼靜態分析,上面的調用路徑其實只找到了一半,接下來使用gdb來查看memory_region_transaction_commit()可能的執行路徑,結合vhost_blk(qemu用戶態新增的一項功能,跟qemu-virtio或dataplane在I/O鏈路上的層次類似)看一下設置eventfd的執行路徑:
#0 memory_region_transaction_commit () at /home/gavin4code/qemu-2-1-2/memory.c:799 #1 0x0000000000462475 in memory_region_add_eventfd (mr=0x1256068, addr=16, size=2, match_data=true, data=0, e=0x1253760) at /home/gavin4code/qemu-2-1-2/memory.c:1588 #2 0x00000000006d483e in virtio_pci_set_host_notifier_internal (proxy=0x1255820, n=0, assign=true, set_handler=false) at hw/virtio/virtio-pci.c:200 #3 0x00000000006d6361 in virtio_pci_set_host_notifier (d=0x1255820, n=0, assign=true) at hw/virtio/virtio-pci.c:884 ---------------- 非vhost也調用 #4 0x00000000004adb90 in vhost_dev_enable_notifiers (hdev=0x12e6b30, vdev=0x12561f8) at /home/gavin4code/qemu-2-1-2/hw/virtio/vhost.c:932 #5 0x00000000004764db in vhost_blk_start (vb=0x1256368) at /home/gavin4code/qemu-2-1-2/hw/block/vhost-blk.c:189 #6 0x00000000004740e5 in virtio_blk_handle_output (vdev=0x12561f8, vq=0x1253710) at /home/gavin4code/qemu-2-1-2/hw/block/virtio-blk.c:456 #7 0x00000000004a729e in virtio_queue_notify_vq (vq=0x1253710) at /home/gavin4code/qemu-2-1-2/hw/virtio/virtio.c:774 #8 0x00000000004a9196 in virtio_queue_host_notifier_read (n=0x1253760) at /home/gavin4code/qemu-2-1-2/hw/virtio/virtio.c:1265 #9 0x000000000073d23e in qemu_iohandler_poll (pollfds=0x119e4c0, ret=1) at iohandler.c:143 #10 0x000000000073ce41 in main_loop_wait (nonblocking=0) at main-loop.c:485 #11 0x000000000055524a in main_loop () at vl.c:2031 #12 0x000000000055c407 in main (argc=48, argv=0x7ffff99985c8, envp=0x7ffff9998750) at vl.c:4592
整體執行路徑還是很清晰的,vhost模塊的vhost_dev_enable_notifiers()來告訴kvm需要用到的eventfd。
3. 如何產生kick信號?
大體上來說,guest os覺得有必要通知host對virtqueue上的請求進行處理,就會執行vp_notify(),相當於執行一次port I/O(或者mmio),虛擬機則會退出guest mode。這里假設使用的是intel的vmx,當檢測到pio或者mmio會設置vmcs中的exit_reason,host內核態執行vmx_handle_eixt(),檢測exit_reason並執行相應的handler函數(kernel_io()),整體的執行路徑如下:
vmx_handle_eixt() { /* kvm_vmx_exit_handlers[exit_reason](vcpu); */ handle_io() { kvm_emulate_pio() { kernel_io() { if (read) { kvm_io_bus_read() { } } else { kvm_io_bus_write() { vhost方式 ioeventfd_write(); } } } } }
最后會執行到ioeventfd_write(),這樣就產生了一次kick信號。
如果該eventfd是由qemu側來監聽的,則會執行對應的qemu函數kvm_handle_io();如果是vhost來監聽的,則直接在vhost內核模塊執行vhost->handle_kick()。
qemu kmv_handle_io()的調用棧如下所示:
Breakpoint 1, virtio_ioport_write (opaque=0x1606400, addr=18, val=0) at hw/virtio/virtio-pci.c:270 270 { (gdb) t [Current thread is 4 (Thread 0x414e7940 (LWP 29695))] (gdb) bt #0 virtio_ioport_write (opaque=0x1606400, addr=18, val=0) at hw/virtio/virtio-pci.c:270 #1 0x00000000006d4218 in virtio_pci_config_write (opaque=0x1606400, addr=18, val=0, size=1) at hw/virtio/virtio-pci.c:435 #2 0x000000000045c716 in memory_region_write_accessor (mr=0x1606c48, addr=18, value=0x414e6da8, size=1, shift=0, mask=255) at /home/gavin4code/qemu/memory.c:444 #3 0x000000000045c856 in access_with_adjusted_size (addr=18, value=0x414e6da8, size=1, access_size_min=1, access_size_max=4, access=0x45c689 <memory_region_write_accessor>, mr=0x1606c48) at /home/gavin4code/qemu/memory.c:481 #4 0x000000000045f84f in memory_region_dispatch_write (mr=0x1606c48, addr=18, data=0, size=1) at /home/gavin4code/qemu/memory.c:1138 #5 0x00000000004630be in io_mem_write (mr=0x1606c48, addr=18, val=0, size=1) at /home/gavin4code/qemu/memory.c:1976 #6 0x000000000040f030 in address_space_rw (as=0xd05d00 <address_space_io>, addr=49170, buf=0x7f4994f6b000 "", len=1, is_write=true) at /home/gavin4code/qemu/exec.c:2114 #7 0x0000000000458f62 in kvm_handle_io (port=49170, data=0x7f4994f6b000, direction=1, size=1, count=1) at /home/gavin4code/qemu/kvm-all.c:1674 ----------非vhost方式 #8 0x00000000004594c6 in kvm_cpu_exec (cpu=0x157ec50) at /home/gavin4code/qemu/kvm-all.c:1811 #9 0x0000000000440364 in qemu_kvm_cpu_thread_fn (arg=0x157ec50) at /home/gavin4code/qemu/cpus.c:930 #10 0x0000003705e0677d in start_thread () from /lib64/libpthread.so.0 #11 0x00000037056d49ad in clone () from /lib64/libc.so.6 #12 0x0000000000000000 in ?? ()
至此,整個virtio的evenfd機制分析結束。
[root@bogon qemu]# grep virtio_pci_config_write -rn * hw/virtio/virtio-pci.c:443:static void virtio_pci_config_write(void *opaque, hwaddr addr, hw/virtio/virtio-pci.c:479: .write = virtio_pci_config_write, [root@bogon qemu]#
https://kernelgo.org/virtio-overview.html
給vhost發送事件通知,喚醒
12 > > > kvm_vmx_exit_handlers: [EXIT_REASON_IO_INSTRUCTION> > > > > > handle_io, > > > static int handle_io(struct kvm_vcpu *vcpu) > > > > return kvm_fast_pio_out(vcpu, size, port); > > > > > static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt, int size, unsigned short port, const void *val, unsigned int count) > > > > > > emulator_pio_in_out > > > > > > kernel_pio(struct kvm_vcpu *vcpu, void *pd) > > > > > > > int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, > > > > > > > > __kvm_io_bus_write > > > > > > > > > dev->ops->write(vcpu, dev, addr, l, v) > > > > > > > > > dev->ops is ioeventfd_ops 相當ioeventfd_write > > > > > > > > > > eventfd_signal(p->eventfd, 1); > > > > > > > > > > > wake_up_locked_poll(&ctx->wqh, POLLIN);
調用kvm_vm_ioctl給qemu發送事件通知
virtio_pci_config_write --> virtio_ioport_write --> virtio_pci_start_ioeventfd --> virtio_bus_set_host_notifier --> virtio_bus_start_ioeventfd --> virtio_device_start_ioeventfd_impl --> virtio_bus_set_host_notifier --> virtio_pci_ioeventfd_assign --> memory_region_add_eventfd --> memory_region_transaction_commit --> address_space_update_ioeventfds --> address_space_add_del_ioeventfds --> kvm_io_ioeventfd_add/vhost_eventfd_add --> kvm_set_ioeventfd_pio --> kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &kick)
accel/kvm/kvm-all.c
946 static MemoryListener kvm_io_listener = { 947 .eventfd_add = kvm_io_ioeventfd_add, 948 .eventfd_del = kvm_io_ioeventfd_del, 949 .priority = 10, 950 }; 951 952 int kvm_set_irq(KVMState *s, int irq, int level) 953 { 954 struct kvm_irq_level event; 955 int ret; 956 957 assert(kvm_async_interrupts_enabled()); 958 959 event.level = level; 960 event.irq = irq; 961 ret = kvm_vm_ioctl(s, s->irq_set_ioctl, &event); 962 if (ret < 0) { 963 perror("kvm_set_irq"); 964 abort(); 965 } 966 967 return (s->irq_set_ioctl == KVM_IRQ_LINE) ? 1 : event.status; 968 }
其實,這就是QEMU的Fast MMIO實現機制。 我們可以看到,QEMU會為每個設備MMIO對應的MemoryRegion注冊一個ioeventfd。 最后調用了一個KVM_IOEVENTFD ioctl到KVM內核里面,而在KVM內核中會將MMIO對應的(gpa,len,eventfd)信息會注冊到KVM_FAST_MMIO_BUS上。 這樣當Guest訪問MMIO地址范圍退出后(觸發EPT Misconfig),KVM會查詢一下訪問的GPA是否落在某段MMIO地址空間range內部, 如果是的話就直接寫eventfd告知QEMU,QEMU就會從coalesced mmio ring page中取MMIO請求 (注:pio page和 mmio page是QEMU和KVM內核之間的共享內存頁,已經提前mmap好了)。
#kvm內核代碼virt/kvm/eventfd.c中 kvm_vm_ioctl(KVM_IOEVENTFD) --> kvm_ioeventfd --> kvm_assign_ioeventfd --> kvm_assign_ioeventfd_idx # MMIO處理流程中(handle_ept_misconfig)最后會調用到ioeventfd_write通知QEMU。 /* MMIO/PIO writes trigger an event if the addr/val match */ static int ioeventfd_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr, int len, const void *val) { struct _ioeventfd *p = to_ioeventfd(this); if (!ioeventfd_in_range(p, addr, len, val)) return -EOPNOTSUPP; eventfd_signal(p->eventfd, 1); return 0; }
virtio_bus_set_host_notifier
hw/block/dataplane/virtio-blk.c:200: r = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, true); hw/block/dataplane/virtio-blk.c:204: virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false); hw/block/dataplane/virtio-blk.c:290: virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false); hw/virtio/virtio.c:2634: r = virtio_bus_set_host_notifier(qbus, n, true); hw/virtio/virtio.c:2663: r = virtio_bus_set_host_notifier(qbus, n, false); hw/virtio/virtio.c:2698: r = virtio_bus_set_host_notifier(qbus, n, false); hw/virtio/virtio-bus.c:263:int virtio_bus_set_host_notifier(VirtioBusState *bus, int n, bool assign) hw/virtio/vhost.c:1345: r = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), hdev->vq_index + i, hw/virtio/vhost.c:1356: e = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), hdev->vq_index + i, hw/virtio/vhost.c:1380: r = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), hdev->vq_index + i, hw/scsi/virtio-scsi-dataplane.c:98: rc = virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), n, true); hw/scsi/virtio-scsi-dataplane.c:178: virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false); hw/scsi/virtio-scsi-dataplane.c:217: virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false); include/hw/virtio/virtio-bus.h:152:int virtio_bus_set_host_notifier(VirtioBusState *bus, int n, bool assign);
263 int virtio_bus_set_host_notifier(VirtioBusState *bus, int n, bool assign) 264 { 265 VirtIODevice *vdev = virtio_bus_get_device(bus); 266 VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(bus); 267 DeviceState *proxy = DEVICE(BUS(bus)->parent); 268 VirtQueue *vq = virtio_get_queue(vdev, n); 269 EventNotifier *notifier = virtio_queue_get_host_notifier(vq); 270 int r = 0; 271 272 if (!k->ioeventfd_assign) { 273 return -ENOSYS; 274 } 275 276 if (assign) { 277 r = event_notifier_init(notifier, 1); 278 if (r < 0) { 279 error_report("%s: unable to init event notifier: %s (%d)", 280 __func__, strerror(-r), r); 281 return r; 282 } 283 r = k->ioeventfd_assign(proxy, notifier, n, true); 284 if (r < 0) { 285 error_report("%s: unable to assign ioeventfd: %d", __func__, r); 286 virtio_bus_cleanup_host_notifier(bus, n); 287 } 288 } else { 289 k->ioeventfd_assign(proxy, notifier, n, false); 290 } 291 292 return r; 293 }
- 首先看一下EventNotifier的結構,它包含兩個文件描述符,一個用於讀,一個用於寫,實際上,這兩個文件描述符值是一樣的,在進程的fdt表中對應的file結構當然也一樣,只不過rfd是用作qemu用戶態的讀,另一個wfd用作qemu用戶態或者內核態的寫。
struct EventNotifier { int rfd; int wfd; };
- 看一下
virtio_bus_set_host_notifier中EventNotifier的初始化過程:
int event_notifier_init(EventNotifier *e, int active) { ...... ret = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); /* 1 */ e->rfd = e->wfd = ret; /* 2 */ ...... }
1. 調用eventfd系統調用,創建eventfd,返回一個描述符,初始化內核計數器為0
2. 將返回的描述符作為EventNotifier的初始值
- 初始化EventNotifier之后,它的rfd將作為描述符被qemu主事件循環的poll,qemu會將rfd的可讀狀態關聯一個鈎子函數,任何想觸發這個鈎子函數的qemu線程,或者內核模塊,都可以通過寫wfd來通知qemu主事件循環,從而達到高效的通信目的。

通過virtio_bus_set_host_notifier使kvm和qemu共享notifier
[root@bogon virtio]# grep ioeventfd_assign -rn * virtio-bus.c:181: if (!k->ioeventfd_assign) { virtio-bus.c:214: if (!k->ioeventfd_assign || !k->ioeventfd_enabled(proxy)) { virtio-bus.c:256: return k->ioeventfd_assign && k->ioeventfd_enabled(proxy); virtio-bus.c:272: if (!k->ioeventfd_assign) { virtio-bus.c:283: r = k->ioeventfd_assign(proxy, notifier, n, true); virtio-bus.c:289: k->ioeventfd_assign(proxy, notifier, n, false); virtio-mmio.c:79:static int virtio_mmio_ioeventfd_assign(DeviceState *d, virtio-mmio.c:514: k->ioeventfd_assign = virtio_mmio_ioeventfd_assign; virtio-pci.c:223:static int virtio_pci_ioeventfd_assign(DeviceState *d, EventNotifier *notifier, virtio-pci.c:2055: k->ioeventfd_assign = virtio_pci_ioeventfd_assign; [root@bogon virtio]#


通過virtio_device_start_ioeventfd_impl給qemu設置fd事件讀寫函數
vdc->start_ioeventfd = virtio_device_start_ioeventfd_impl;
hw/block/virtio-blk.c:804: virtio_device_start_ioeventfd(vdev); hw/block/virtio-blk.c:1263: vdc->start_ioeventfd = virtio_blk_data_plane_start; hw/s390x/virtio-ccw.c:131:static void virtio_ccw_start_ioeventfd(VirtioCcwDevice *dev) hw/s390x/virtio-ccw.c:133: virtio_bus_start_ioeventfd(&dev->bus); hw/s390x/virtio-ccw.c:514: virtio_ccw_start_ioeventfd(dev); hw/s390x/virtio-ccw.c:885: virtio_ccw_start_ioeventfd(dev); hw/virtio/virtio.c:2623:static int virtio_device_start_ioeventfd_impl(VirtIODevice *vdev) hw/virtio/virtio.c:2677:int virtio_device_start_ioeventfd(VirtIODevice *vdev) hw/virtio/virtio.c:2682: return virtio_bus_start_ioeventfd(vbus); hw/virtio/virtio.c:2745: vdc->start_ioeventfd = virtio_device_start_ioeventfd_impl; hw/virtio/virtio-bus.c:200: /* Force virtio_bus_start_ioeventfd to act. */ hw/virtio/virtio-bus.c:202: virtio_bus_start_ioeventfd(bus); hw/virtio/virtio-bus.c:206:int virtio_bus_start_ioeventfd(VirtioBusState *bus) hw/virtio/virtio-bus.c:223: r = vdc->start_ioeventfd(vdev); hw/virtio/virtio-mmio.c:95:static void virtio_mmio_start_ioeventfd(VirtIOMMIOProxy *proxy) hw/virtio/virtio-mmio.c:97: virtio_bus_start_ioeventfd(&proxy->bus); hw/virtio/virtio-mmio.c:292: virtio_mmio_start_ioeventfd(proxy); hw/virtio/virtio-pci.c:280:static void virtio_pci_start_ioeventfd(VirtIOPCIProxy *proxy) hw/virtio/virtio-pci.c:282: virtio_bus_start_ioeventfd(&proxy->bus); hw/virtio/virtio-pci.c:329: virtio_pci_start_ioeventfd(proxy); hw/virtio/virtio-pci.c:1070: virtio_pci_start_ioeventfd(proxy); hw/virtio/virtio-pci.c:1252: virtio_pci_start_ioeventfd(proxy); hw/scsi/virtio-scsi.c:449: virtio_device_start_ioeventfd(vdev); hw/scsi/virtio-scsi.c:636: virtio_device_start_ioeventfd(vdev); hw/scsi/virtio-scsi.c:768: virtio_device_start_ioeventfd(vdev); hw/scsi/virtio-scsi.c:987: vdc->start_ioeventfd = virtio_scsi_dataplane_start; include/hw/virtio/virtio.h:151: int (*start_ioeventfd)(VirtIODevice *vdev); include/hw/virtio/virtio.h:300:int virtio_device_start_ioeventfd(VirtIODevice *vdev); include/hw/virtio/virtio-bus.h:144:int virtio_bus_start_ioeventfd(VirtioBusState *bus);
vhost_user_fs_pci_class_init沒有start_ioeventfd

Qemu代碼virtio-net相關代碼: virtio_queue_host_notifier_read -> virtio_queue_notify_vq -> vq->handle_output -> virtio_net_handle_tx_bh 隊列注冊的時候,回注冊回調函數 -> qemu_bh_schedule -> virtio_net_tx_bh -> virtio_net_flush_tx -> virtqueue_pop -> qemu_sendv_packet_async // 報文放到發送隊列上,寫tap設備的fd去發包 -> tap_receive_iov -> tap_write_packet // 最后調用 tap_write_packet 把數據包發給tap設備投遞出去
*************** 消息傳遞 Guest->KVM
還記得Virtio網絡發包過程分析中,driver在start_xmit的最后調用了virtqueue_kick函數來通知device,我們從這里開始分析。代碼如下,可以看到函數首先調用了virtqueue_kick_prepare判斷當前是否需要kick,如果需要則virtqueue_notify。
bool virtqueue_kick(struct virtqueue *vq) |
先分析virtqueue_kick_prepare函數,函數獲取了old(上次kick后的vring.avail->idx)和new(當前idx),vq->num_added用來表示兩者差值。如果vq->event=1則vring_need_event。
1 |
bool virtqueue_kick_prepare(struct virtqueue *_vq) |
vq->event的值與VIRTIO_RING_F_EVENT_IDX(Virtio PCI Card Specification Version 0.9.5中定義此字段)有關。
Virtual I/O Device (VIRTIO) Version1.0規范中VIRTIO_RING_F_EVENT_IDX變為VIRTIO_F_EVENT_IDX,1.0規范中對此字段介紹的更為詳細,具體可以看VirtIO SPEC文檔2.4.9節。
下面來看vring_need_event,先看一下傳進來的參數event_idx的含義,event_idx實際上為used ring的最后一個元素的值。代碼如下,如果(u16)(new_idx - event_idx - 1) < (u16)(new_idx - old)成立說明backend的處理速度夠快,那么返回true表示可以kick backend,否則說明backend當前處理的位置event_idx落后於old,此時backend處理速度較慢,返回false等待下次一起kick backend。
對於VirtIO的機制來說,backend一直消耗avail ring,frontend一直消耗used ring,因此backend用used ring的last entry告訴frontend自己當前處理到哪了。
1 |
|
1 |
static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old) |
回到virtqueue_kick函數,我們接着來分析virtqueue_notify,virtqueue_notify調用了vq->notify(_vq),notify定義在struct ving_virtqueue中,notify具體是哪個函數是在setup_vq中創建virtqueue時綁定的。
1 |
bool (*notify)(struct virtqueue *vq); |
1 |
bool virtqueue_notify(struct virtqueue *_vq) |
下面我們分析setup_vq來看看notify到底是什么,vring_create_virtqueue函數綁定notify為vp_notify,注意一下vq->priv的值下文分析要用到。
1 |
static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, |
可以看到vp_notify函數中寫VIRTIO_PCI_QUEUE_NOTIFY(A 16-bit r/w queue notifier)來達到通知的目的.
1 |
bool vp_notify(struct virtqueue *vq) |
本節總結
上述過程總結如下圖,首先virtqueue_kick_prepare根據feature bit以及后端的處理速度來判斷時候需要通知,如果需要則調用vp_notify,在其中iowrite VIRTIO_PCI_QUEUE_NOTIFY

KVM->QEMU
iowrite VIRTIO_PCI_QUEUE_NOTIFY后會產生一個VM exit,KVM會判斷exit_reason,IO操作對應的執行函數是handle_io(),此部分的代碼比較多,就不占用篇幅進行詳細分析了,下面列舉出函數的調用流程,感興趣的小伙伴可以結合代碼看一看,如圖,最后會調用eventfd_signal喚醒QEMU中的poll。

Backend
最后看一下Backend的處理,首先在后端是如何把VIRTIO_PCI_QUEUE_NOTIFY和eventfd聯系起來的?如下圖,memory_region_add_eventfd函數會和kvm來協商使用哪一個eventfd進行kick通信,而event_notifier_set_handler函數把IOHandler函數virtio_queue_host_notifier_read和host_notifier添加到AioHandler中,也就是說當前端kick后,后端會執行virtio_queue_host_notifier_read函數,到這里我們就和上一篇文章的發包過程結合起來了。
kvm --> guest
1、注冊中斷fd
前端驅動加載(probe)的過程中,會去初始化virtqueue,這個時候會去申請MSIx中斷並注冊中斷處理函數:
virtnet_probe --> init_vqs --> virtnet_find_vqs --> vi->vdev->config->find_vqs [vp_modern_find_vqs] --> vp_find_vqs --> vp_find_vqs_msix // 為每virtqueue申請一個MSIx中斷,通常收發各一個隊列 --> vp_request_msix_vectors // 主要的MSIx中斷申請邏輯都在這個函數里面 --> pci_alloc_irq_vectors_affinity // 申請MSIx中斷描述符(__pci_enable_msix_range) --> request_irq // 注冊中斷處理函數 // virtio-net網卡至少申請了3個MSIx中斷: // 一個是configuration change中斷(配置空間發生變化后,QEMU通知前端) // 發送隊列1個MSIx中斷,接收隊列1MSIx中斷
在QEMU/KVM這一側,開始模擬MSIx中斷,具體流程大致如下:
virtio_pci_config_write --> virtio_ioport_write --> virtio_set_status --> virtio_net_vhost_status --> vhost_net_start --> virtio_pci_set_guest_notifiers --> kvm_virtio_pci_vector_use |--> kvm_irqchip_add_msi_route //更新中斷路由表 |--> kvm_virtio_pci_irqfd_use //使能MSI中斷 --> kvm_irqchip_add_irqfd_notifier_gsi --> kvm_irqchip_assign_irqfd # 申請MSIx中斷的時候,會為MSIx分配一個gsi,並為這個gsi綁定一個irqfd,然后調用ioctl KVM_IRQFD注冊到內核中。 static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int rfd, int virq, bool assign) { struct kvm_irqfd irqfd = { .fd = fd, .gsi = virq, .flags = assign ? 0 : KVM_IRQFD_FLAG_DEASSIGN, }; if (rfd != -1) { irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE; irqfd.resamplefd = rfd; } if (!kvm_irqfds_enabled()) { return -ENOSYS; } return kvm_vm_ioctl(s, KVM_IRQFD, &irqfd); } # KVM內核代碼virt/kvm/eventfd.c kvm_vm_ioctl(s, KVM_IRQFD, &irqfd) --> kvm_irqfd_assign --> vfs_poll(f.file, &irqfd->pt) // 在內核中poll這個irqfd
從上面的流程可以看出,QEMU/KVM使用irqfd機制來模擬MSIx中斷, 即設備申請MSIx中斷的時候會為MSIx分配一個gsi(這個時候會刷新irq routing table), 並為這個gsi綁定一個irqfd,最后在內核中去poll這個irqfd。 當QEMU處理完IO之后,就寫MSIx對應的irqfd,給前端注入一個MSIx中斷,告知前端我已經處理好IO了你可以來取結果了。

2、qemu發送中斷事件給kvm fd
virtio_notify_irqfd
例如,virtio-scsi從前端取出IO請求后會取做DMA操作(DMA是異步的,QEMU協程中負責處理)。 當DMA完成后QEMU需要告知前端IO請求已完成(Complete),那么怎么去投遞這個MSIx中斷呢? 答案是調用virtio_notify_irqfd注入一個MSIx中斷。
#0 0x00005604798d569b in virtio_notify_irqfd (vdev=0x56047d12d670, vq=0x7fab10006110) at hw/virtio/virtio.c:1684 #1 0x00005604798adea4 in virtio_scsi_complete_req (req=0x56047d09fa70) at hw/scsi/virtio-scsi.c:76 #2 0x00005604798aecfb in virtio_scsi_complete_cmd_req (req=0x56047d09fa70) at hw/scsi/virtio-scsi.c:468 #3 0x00005604798aee9d in virtio_scsi_command_complete (r=0x56047ccb0be0, status=0, resid=0) at hw/scsi/virtio-scsi.c:495 #4 0x0000560479b397cf in scsi_req_complete (req=0x56047ccb0be0, status=0) at hw/scsi/scsi-bus.c:1404 #5 0x0000560479b2b503 in scsi_dma_complete_noio (r=0x56047ccb0be0, ret=0) at hw/scsi/scsi-disk.c:279 #6 0x0000560479b2b610 in scsi_dma_complete (opaque=0x56047ccb0be0, ret=0) at hw/scsi/scsi-disk.c:300 #7 0x00005604799b89e3 in dma_complete (dbs=0x56047c6e9ab0, ret=0) at dma-helpers.c:118 #8 0x00005604799b8a90 in dma_blk_cb (opaque=0x56047c6e9ab0, ret=0) at dma-helpers.c:136 #9 0x0000560479cf5220 in blk_aio_complete (acb=0x56047cd77d40) at block/block-backend.c:1327 #10 0x0000560479cf5470 in blk_aio_read_entry (opaque=0x56047cd77d40) at block/block-backend.c:1387 #11 0x0000560479df49c4 in coroutine_trampoline (i0=2095821104, i1=22020) at util/coroutine-ucontext.c:115 #12 0x00007fab214d82c0 in __start_context () at /usr/lib64/libc.so.6
在virtio_notify_irqfd函數中,會去寫irqfd,給內核發送一個信號。
1654 void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq) 1655 { 1656 bool should_notify; 1657 rcu_read_lock(); 1658 should_notify = virtio_should_notify(vdev, vq); 1659 rcu_read_unlock(); 1660 1661 if (!should_notify) { 1662 return; 1663 } 1664 1665 trace_virtio_notify_irqfd(vdev, vq); 1666 1667 /* 1668 * virtio spec 1.0 says ISR bit 0 should be ignored with MSI, but 1669 * windows drivers included in virtio-win 1.8.0 (circa 2015) are 1670 * incorrectly polling this bit during crashdump and hibernation 1671 * in MSI mode, causing a hang if this bit is never updated. 1672 * Recent releases of Windows do not really shut down, but rather 1673 * log out and hibernate to make the next startup faster. Hence, 1674 * this manifested as a more serious hang during shutdown with 1675 * 1676 * Next driver release from 2016 fixed this problem, so working around it 1677 * is not a must, but it's easy to do so let's do it here. 1678 * 1679 * Note: it's safe to update ISR from any thread as it was switched 1680 * to an atomic operation. 1681 */ 1682 virtio_set_isr(vq->vdev, 0x1); 1683 event_notifier_set(&vq->guest_notifier); 1684 }
QEMU寫了這個irqfd后,KVM內核模塊中的irqfd poll就收到一個POLL_IN事件,然后將MSIx中斷自動投遞給對應的LAPIC。 大致流程是:POLL_IN -> kvm_arch_set_irq_inatomic -> kvm_set_msi_irq, kvm_irq_delivery_to_apic_fast
Copy static int irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) { if (flags & EPOLLIN) { idx = srcu_read_lock(&kvm->irq_srcu); do { seq = read_seqcount_begin(&irqfd->irq_entry_sc); irq = irqfd->irq_entry; } while (read_seqcount_retry(&irqfd->irq_entry_sc, seq)); /* An event has been signaled, inject an interrupt */ if (kvm_arch_set_irq_inatomic(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1, false) == -EWOULDBLOCK) schedule_work(&irqfd->inject); srcu_read_unlock(&kvm->irq_srcu, idx); }

