【翻譯】QEMU內部機制：宏觀架構和線程模型

原文地址：http://blog.vmsplice.net/2011/03/qemu-internals-overall-architecture-and.html

作者介紹：Stefan Hajnoczi來自紅帽公司的虛擬化團隊，負責開發和維護QEMU項目的block layer, network subsystem和tracing subsystem。

目前工作是multi-core device emulation in QEMU和host/guest file sharing using vsock，過去從事過disk image formats, storage migration和I/O performance optimization

本文是QEMU內部機制系列文章的第一篇，目的是分享QEMU的工作原理，並讓新的代碼貢獻者更容易地熟悉QEMU的基線代碼。

運行一台vm包括執行vm的代碼、處理定時器、IO並且響應外部命令(用戶向qemu發送的命令)。為了完成所有這些事情，需要一個能夠以安全的方式調解資源，並且不會在一個需要花費長時間的磁盤IO或外部命令操作的場景下暫停vm的執行的架構。有兩種常見的用於響應多個事件源的編程架構：

1.並行架構：將熱舞分配到進程或線程中以便同時執行，我把它稱為"線程架構"
2.事件驅動架構：通過運行一個主循環來響應事件，將事件分發到事件處理器中。該方式一般是通過在多個文件描述符上執行select或poll等類型的系統調用實現的

QEMU實際上使用了一種將事件驅動編程和線程混合起來的架構。它這么做是為了避免事件驅動編程模型的單線程架構無法利用多核cpu的優勢。而且，在某些場景下，編寫一個獨立的線程來執行某項任務比集成至事件驅動中實現起來更簡單。但是，QEMU的核心是事件驅動的，它的大部分代碼運行在事件驅動的環境下。

事件驅動架構是圍繞將事件分發至處理函數的事件循環進行的。QEMU的主事件循環是main_loop_wait()函數，它執行下列任務：

1.等待文件描述符可讀或可寫。文件描述符的角色很關鍵，因為包括文件、套接字、管道和多種其他資源在內，都是文件描述符。文件描述符可以通過qemu_set_fd_handler()函數添加。
2.運行到期的定時器。定時器使用qemu_mod_timer()函數添加。
3.運行bottom-halves(中斷語境中的下半部)，它與立即到期的定時器類似。BH用於避免重入和調用堆棧溢出，它可以通過qemu_bh_schedule()函數添加。

當一個文件描述符狀態就緒、一個定時器到期或一個BH被調度時，事件循環使用回調來響應該事件。在回調環境下，有兩個簡單的規則：

1.核心代碼的其他部分不會被同時執行，所以無需進行同步。對於其他核心代碼而言，回調的執行是序列化、原子化得。在任何時刻，僅有一個線程擁有執行核心代碼的控制權。
2.不應該執行任何阻塞式的系統調用或長時間的計算任務。因為事件循環在繼續響應其他事件之前必須等待回調結束，所以回調應避免執行時間過長。破壞該規則會導致vm暫停，而且無法響應外部命令。

第二個規則有時比較難以實現，從而在QEMU中有一些阻塞式的代碼。實際上，在qemu_aio_wait()函數中甚至嵌套了一個子事件循環，它用於等待全局事件循環的文件描述符的一個子集。但願在以后的重構代碼中能將這些違背規則的實現移除掉。新增的代碼基本上不會有理由去違反該規則，如果一定要執行阻塞式任務，一個解決方案是將它交給獨立的工作線程去完成。

雖然很多IO操作可以在非阻塞式的方式執行，但是某些系統調用並沒有非阻塞版本的實現。此外，有的長時間運行的計算任務會持續占用CPU，並且難以將其分解為回調的形式。在這些場景下，可以使用獨立的工作線程將這些任務從QEMU的核心代碼流程中單拿出來處理。

一個使用工作線程的示例是posix-aio-compat.c文件，它實現了異步文件IO。當QEMU的核心代碼發起一個aio請求后，它會被放置於一個隊列中。工作線程從該隊列取出請求，在QEMU的核心代碼流程之外處理該請求。由於他們是獨立的線程，在處理中就可以放心的執行阻塞式的操作了。這種實現方式同時會處理線程與線程間的、線程與QEMU核心代碼間的同步和交互操作。

還有一個例子是ui/vnc-jobs-async.c文件，它使用工作線程執行鏡像壓縮和編碼等cpu消耗型任務。

QEMU的核心代碼大部分都是線程不安全的，所以工作線程無法直接調用QEMU的核心代碼。有一個例外是qemu_malloc()這樣的簡單常用函數是線程安全的，但這僅僅是一個例外。這種線程不安全的實現規則導致了工作線程不能將事件交互回QEMU的核心代碼。

當一個工作線程需要通知QEMU的核心代碼時，一個管道或qemu_eventfd()文件描述符會被加入到事件循環中。工作線程可以向文件描述符執行寫操作，同時，事件循環會在該文件描述符准備就緒時調用回調函數。而且，必須使用信號來保證事件循環能夠在任何場景下都有機會運行。posix-aio-compat.c中使用了該方式，在下一節介紹完vm的代碼運行機制之后，可以對該方式有更深刻的理解。

上述內容主要關注的是QEMU的事件循環，但是QEMU更重要的用處在於能夠執行vm的代碼。

有兩種執行vm代碼的方式：Tiny Code Generator(TCG)和KVM。TCG使用動態二進制翻譯技術來模擬vm，也稱為Just-in-Time(JIT)編譯技術。KVM利用現代cpu提供的硬件虛擬化技術直接在宿主的cpu上安全的執行vm的代碼。本文不會關注這些具體的執行方式，重要的是這些執行方式都允許QEMU跳到vm代碼並執行它。

跳到vm代碼之后，QEMU的控制權就會交給vm。運行vm代碼的線程不能處於事件循環下，因為此時vm擁有cpu的控制權。通常，運行vm代碼的時間是有限的，因為對模擬設備寄存器的讀寫以及其他異常會中斷vm代碼的執行並將控制權交還給QEMU。在某些極端場景下，vm可能會占用過多的時間從而不會讓出控制權，此時，QEMU就會處於無法響應的狀態。

為了解決這種vm長期占用QEMU線程的控制權的問題，可以使用signals(信號)來跳出vm。UNIX信號會讓當前執行流程交出控制權並且調用信號處理函數。這樣，QEMU就可以中斷vm代碼的執行並且回到主事件循環中，並且開始處理pending狀態的事件。

該機制會導致正在執行vm代碼的QEMU無法立即響應並處理新事件。大部分情況下QEMU最終會處理這些事件，但這種額外的延遲會導致性能問題。因此，定時器、IO完成和工作線程對QEMU核心的通知等事件會使用信號來確保事件循環立即運行。

現在，您可能想知道事件循環、smp多核架構的vm等宏觀架構是什么樣的。既然我們已經知道了QEMU的線程模型和執行vm代碼的機制，下一節將對宏觀架構進行介紹。

傳統的QEMU架構是在單線程中執行vm代碼和事件循環代碼。QEMU默認采用該架構，稱為非-iothread架構，使用默認編譯選項./configure && make即可激活。QEMU線程會在接收到信號或遇到異常時結束vm代碼的執行，獲取控制權。然后它使用非阻塞式的select運行一次事件循環，之后它會繼續執行vm的代碼，並且重復此過程。

若vm啟動時使用了-smp 2參數，QEMU並不會創建額外的線程。QEMU依舊使用單線程，但它會多路復用兩個vcpus用於執行vm代碼和事件循環。因此，非-iothread的方式無法發揮多核宿主機的優勢，並且在開啟smp的vm場景下性能堪憂。

但請注意，QEMU雖然只有一個核心線程，但它會有0個或多個工作線程，這些線程可能是臨時的也可能是永久的。這些線程僅執行特定的任務並且不會執行vm代碼或處理事件。我這里強調這一點是由於很多人搞不清楚工作線程，並且把它們誤認為是由於使用了多vcpus而啟動的線程。請記住，非-iothread架構下僅會啟動一個QEMU核心線程。

最新的架構是QEMU為每個vcpu啟動一個線程，外加一個獨立的事件循環線程，稱為iothread架構，編譯時使用./configure --enable-io-thread進行激活。每一個vcpu都可以並行的執行vm的代碼，為SMP機制提供真正的支持。iothread負責執行事件循環。通過維護一個全局互斥鎖來保證QEMU的核心代碼依舊不會被vcpu線程和iothread同時執行。大部分時間中，vcpu線程將執行vm的代碼並且不會hold該全局互斥鎖。同樣地，在大部分時間中iothread會阻塞在select調用，並且也不會hold該全局互斥鎖。

請注意：TCG模式不是線程安全的，所以即使在iothread架構下，QEMU依舊是使用單線程多路復用vcpu的方式。只有KVM模式可以利用每vcpu線程的優勢。

總結和展望
希望這可以幫助大家理解QEMU的宏觀架構(該架構也被KVM使用)。
之后，本文的細節可能會更新。並且我希望能夠將默認架構從非-iothread更改為iothread，甚至將非-iothread架構移除。
我將會嘗試在qemu項目更新時，更新本文。

QEMU Internals: Overall architecture and threading model

This is the first post in a series on QEMU Internals aimed at developers. It is designed to share knowledge of how QEMU works and make it easier for new contributors to learn about the QEMU codebase.

Running a guest involves executing guest code, handling timers, processing I/O, and responding to monitor commands. Doing all these things at once requires an architecture capable of mediating resources in a safe way without pausing guest execution if a disk I/O or monitor command takes a long time to complete. There are two popular architectures for programs that need to respond to events from multiple sources:

Parallel architecture splits work into processes or threads that can execute simultaneously. I will call this threaded architecture.
Event-driven architecture reacts to events by running a main loop that dispatches to event handlers. This is commonly implemented using the select(2) or poll(2) family of system calls to wait on multiple file descriptors.

QEMU actually uses a hybrid architecture that combines event-driven programming with threads. It makes sense to do this because an event loop cannot take advantage of multiple cores since it only has a single thread of execution. In addition, sometimes it is simpler to write a dedicated thread to offload one specific task rather than integrate it into an event-driven architecture. Nevertheless, the core of QEMU is event-driven and most code executes in that environment.

The event-driven core of QEMU

An event-driven architecture is centered around the event loop which dispatches events to handler functions. QEMU's main event loop is main_loop_wait() and it performs the following tasks:

Waits for file descriptors to become readable or writable. File descriptors play a critical role because files, sockets, pipes, and various other resources are all file descriptors. File descriptors can be added using qemu_set_fd_handler().
Runs expired timers. Timers can be added using qemu_mod_timer().
Runs bottom-halves (BHs), which are like timers that expire immediately. BHs are used to avoid reentrancy and overflowing the call stack. BHs can be added using qemu_bh_schedule().

When a file descriptor becomes ready, a timer expires, or a BH is scheduled, the event loop invokes a callbackthat responds to the event. Callbacks have two simple rules about their environment:

No other core code is executing at the same time so synchronization is not necessary. Callbacks execute sequentially and atomically with respect to other core code. There is only one thread of control executing core code at any given time.
No blocking system calls or long-running computations should be performed. Since the event loop waits for the callback to return before continuing with other events, it is important to avoid spending an unbounded amount of time in a callback. Breaking this rule causes the guest to pause and the monitor to become unresponsive.

This second rule is sometimes hard to honor and there is code in QEMU which blocks. In fact there is even a nested event loop in qemu_aio_wait() that waits on a subset of the events that the top-level event loop handles. Hopefully these violations will be removed in the future by restructuring the code. New code almost never has a legitimate reason to block and one solution is to use dedicated worker threads to offload long-running or blocking code.

Offloading specific tasks to worker threads

Although many I/O operations can be performed in a non-blocking fashion, there are system calls which have no non-blocking equivalent. Furthermore, sometimes long-running computations simply hog the CPU and are difficult to break up into callbacks. In these cases dedicated worker threads can be used to carefully move these tasks out of core QEMU.

One example user of worker threads is posix-aio-compat.c, an asynchronous file I/O implementation. When core QEMU issues an aio request it is placed on a queue. Worker threads take requests off the queue and execute them outside of core QEMU. They may perform blocking operations since they execute in their own threads and do not block the rest of QEMU. The implementation takes care to perform necessary synchronization and communication between worker threads and core QEMU.

Another example is ui/vnc-jobs-async.c which performs compute-intensive image compression and encoding in worker threads.

Since the majority of core QEMU code is not thread-safe, worker threads cannot call into core QEMU code. Simple utilities like qemu_malloc() are thread-safe but that is the exception rather than the rule. This poses a problem for communicating worker thread events back to core QEMU.

When a worker thread needs to notify core QEMU, a pipe or a qemu_eventfd() file descriptor is added to the event loop. The worker thread can write to the file descriptor and the callback will be invoked by the event loop when the file descriptor becomes readable. In addition, a signal must be used to ensure that the event loop is able to run under all circumstances. This approach is used by posix-aio-compat.c and makes more sense (especially the use of signals) after understanding how guest code is executed.

Executing guest code

So far we have mainly looked at the event loop and its central role in QEMU. Equally as important is the ability to execute guest code, without which QEMU could respond to events but would not be very useful.

There are two mechanism for executing guest code: Tiny Code Generator (TCG) and KVM. TCG emulates the guest using dynamic binary translation, also known as Just-in-Time (JIT) compilation. KVM takes advantage of hardware virtualization extensions present in modern Intel and AMD CPUs for safely executing guest code directly on the host CPU. For the purposes of this post the actual techniques do not matter but what matters is that both TCG and KVM allow us to jump into guest code and execute it.

Jumping into guest code takes away our control of execution and gives control to the guest. While a thread is running guest code it cannot simultaneously be in the event loop because the guest has (safe) control of the CPU. Typically the amount of time spent in guest code is limited because reads and writes to emulated device registers and other exceptions cause us to leave the guest and give control back to QEMU. In extreme cases a guest can spend an unbounded amount of time without giving up control and this would make QEMU unresponsive.

In order to solve the problem of guest code hogging QEMU's thread of control signals are used to break out of the guest. A UNIX signal yanks control away from the current flow of execution and invokes a signal handler function. This allows QEMU to take steps to leave guest code and return to its main loop where the event loop can get a chance to process pending events.

The upshot of this is that new events may not be detected immediately if QEMU is currently in guest code. Most of the time QEMU eventually gets around to processing events but this additional latency is a performance problem in itself. For this reason timers, I/O completion, and notifications from worker threads to core QEMU use signals to ensure that the event loop will be run immediately.

You might be wondering what the overall picture between the event loop and an SMP guest with multiple vcpus looks like. Now that the threading model and guest code has been covered we can discuss the overall architecture.

iothread and non-iothread architecture

The traditional architecture is a single QEMU thread that executes guest code and the event loop. This model is also known as non-iothread or !CONFIG_IOTHREAD and is the default when QEMU is built with ./configure && make. The QEMU thread executes guest code until an exception or signal yields back control. Then it runs one iteration of the event loop without blocking in select(2). Afterwards it dives back into guest code and repeats until QEMU is shut down.

If the guest is started with multiple vcpus using -smp 2, for example, no additional QEMU threads will be created. Instead the single QEMU thread multiplexes between two vcpus executing guest code and the event loop. Therefore non-iothread fails to exploit multicore hosts and can result in poor performance for SMP guests.

Note that despite there being only one QEMU thread there may be zero or more worker threads. These threads may be temporarily or permanent. Remember that they perform specialized tasks and do not execute guest code or process events. I wanted to emphasise this because it is easy to be confused by worker threads when monitoring the host and interpret them as vcpu threads. Remember that non-iothread only ever has one QEMU thread.

The newer architecture is one QEMU thread per vcpu plus a dedicated event loop thread. This model is known as iothread or CONFIG_IOTHREAD and can be enabled with ./configure --enable-io-thread at build time. Each vcpu thread can execute guest code in parallel, offering true SMP support, while the iothread runs the event loop. The rule that core QEMU code never runs simultaneously is maintained through a global mutex that synchronizes core QEMU code across the vcpus and iothread. Most of the time vcpus will be executing guest code and do not need to hold the global mutex. Most of the time the iothread is blocked in select(2) and does not need to hold the global mutex.

Note that TCG is not thread-safe so even under the iothread model it multiplexes vcpus across a single QEMU thread. Only KVM can take advantage of per-vcpu threads.

Conclusion and words about the future

Hopefully this helps communicate the overall architecture of QEMU (which KVM inherits). Feel free to leave questions in the comments below.
In the future the details are likely to change and I hope we will see a move to CONFIG_IOTHREAD by default and maybe even a removal of !CONFIG_IOTHREAD.
I will try to update this post as qemu.git changes.