說明: 之所以要翻譯這篇論文,是因為參考此論文可以很好地理解SPDK/NVMe的設計思想。
NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs
NVMeDirect: 面向基於NVMe固態硬盤存儲應用優化的一種用戶空間I/O框架
Hyeong-Jun Kim, Sungkyunkwan University, hjkim@csl.skku.edu Young-Sik Lee, KAIST, yslee@calab.kaist.ac.kr Jin-Soo Kim, Sungkyunkwan University, jinsookim@skku.edu
Abstract | 摘要
The performance of storage devices has been increased significantly due to emerging technologies such as Solid State Drives (SSDs) and Non-Volatile Memory Express (NVMe) interface. However, the complex I/O stack of the kernel impedes utilizing the full performance of NVMe SSDs. The application-specific optimization is also difficult on the kernel because the kernel should provide generality and fairness.
新興技術如固態硬盤(SSD)和NVMe接口的發展,使存儲設備的性能得到大幅度的提升。但是,操作系統內核I/O過於復雜,於是阻礙了NVMe接口的固態硬盤發揮其全部的性能。在內核中,面向應用程序的優化實現起來也很困難,因為內核就應該提供通用性和公平性。
In this paper, we propose a user-level I/O framework which improves the performance by allowing user applications to access commercial NVMe SSDs directly without any hardware modification. Moreover, the proposed framework provides flexibility where user applications can select their own I/O policies including I/O completion method, caching, and I/O scheduling. Our evaluation results show that the proposed framework outperforms the kernel-based I/O by up to 30% on microbenchmarks and by up to 15% on Redis.
在本論文中,我們提出了一個用戶級的I/O架構,允許用戶應用程序直接訪問商用的NVMe固態硬盤,而無需做任何硬件修改。而且,該框架給用戶提供了靈活性,用戶應用程序可以選擇自己的I/O策略,包括I/O完成方法、緩存和I/O調度。評估結果表明,我們的提出的框架優於基於內核的I/O,在微基准測試中性能提升高達30%,應用到Redis上性能提升高達15%。
1 Introduction | 概述
The new emerging technologies are making a remarkable progress in the performance of storage devices. NAND flash-based Solid State Drives (SSDs) are being widely adopted on behalf of hard disk drives (HDDs). The next-generation non-volatile memory such as 3D XPoint[8] promises the next step for the storage devices. In accordance with the improvement in the storage performance, the new NVMe (Non-Volatile Memory Express) interface has been standardized to support high performance storage based on the PCI Express (PCIe) interconnect.
新興技術使得存儲設備在性能上取得了顯著的進步。基於NAND閃存的SSD正作為硬盤驅動器(HDD)的代表被廣泛地采用。存儲設備的下一步發展將是新一代的NVM如3D XPoint。為了提高存儲性能,新的NVMe接口已經標准化,用來支持基於PCIe的高性能存儲設備。
As storage devices are getting faster, the overhead of the legacy kernel I/O stack becomes noticeable since it has been optimized for slow HDDs. To overcome this problem, many researchers have tried to reduce the kernel overhead by using the polling mechanism[5, 13] and eliminating unnecessary context switching[11, 14].
存儲設備變得越來越快,傳統的內核I/O棧的開銷變得異常明顯,雖然內核針對較慢的硬盤做過一些優化。為了解決這一問題,許多研究人員試圖通過使用輪詢機制來減少內核開銷,消除不必要的上下文切換。
However, kernel-level I/O optimizations have several limitations to satisfy the requirements of user applications. First, the kernel should be general because it provides an abstraction layer for applications, managing all the hardware resources. Thus, it is hard to optimize the kernel without loss of generality. Second, the kernel cannot implement any policy that favors a certain application because it should provide fairness among applications. Lastly, the frequent update of the kernel requires a constant effort to port such application-specific optimizations.
然而,如果要滿足用戶應用需求的話,內核級的I/O優化受到了一些限制。首先,內核就應該是通用的,因為它為應用提供了一個抽象層,負責管理所有硬件資源。因此,對內核進行優化而不損失通用性是很困難的。其次,內核不能實現任何的只支持某個應用的策略,因為內核應該給多個應用程序提供公平性。最后,內核的頻繁更新需要持續不斷的投入來支持這種面向某個應用所做的優化。
In this sense, it would be desirable if a user-space I/O framework is supported for high-performance storage devices which enables the optimization of the I/O stack in the user space without any kernel intervention. In particular, such a user-space I/O framework can have a great impact on modern data-intensive applications such as distributed data processing platforms, NoSQL systems, and database systems, where the I/O performance plays an important role in the overall performance. Recently, Intel released a set of tools and libraries for accessing NVMe SSDs in the user space, called SPDK [7]. However, SPDK only works for a single user and application because it moves the whole NVMe driver from the kernel to the user space.
從這個意義上講,如果在沒有任何內核干預的情況下,有一個用戶空間I/O框架支持高性能存儲設備,能夠優化用戶空間I/O棧,那么這個框架就是極好的。特別地,這樣一個用戶空間I/O框架能夠對現代數據密集型應用(例如:分布式數據處理平台,NoSQL系統和數據庫系統)產生重大的影響。在這些應用中,I/O性能在整體性能中扮演者着重要的角色。近年來,英特爾發布了稱之為SPDK的一套工具和庫,用來在用戶空間訪問NVMe固態硬盤。然而,SPDK只適用於單個用戶和單一應用,因為它把整個NVMe驅動從內核空間移動到了用戶空間。
2 Background | 背景
NVM Express (NVMe) is a high performance and scalable host controller interface for PCIe-based SSDs [1]. The notable feature of NVMe is to offer multiple queues to process I/O commands. Each I/O queue can manage up to 64K commands and a single NVMe device supports up to 64K I/O queues. When issuing an I/O command, the host system places the command into the submission queue and notify the NVMe device using the doorbell register. After the NVMe device processes the I/O command, it writes the results to the completion queue and raises an interrupt to the host system. NVMe enhances the performance of interrupt processing by MSI/MSI-X and interrupt aggregation. In the current Linux kernel, the NVMe driver creates a submission queue and a completion queue per core in the host system to avoid locking and cache collision.
NVMe是一個針對基於PCIe的固態硬盤的高性能的、可擴展的主機控制器接口。NVMe的顯著特征是提供多個隊列來處理I/O命令。單個NVMe設備支持多達64K個I/O隊列,每個I/O隊列可以管理多達64K個命令。當發出一個I/O命令的時候,主機系統將命令放置到提交隊列(SQ),然后使用門鈴寄存器(DB)通知NVMe設備。當NVMe設備處理完I/O命令之后,設備將處理結果寫入到完成隊列(CQ),並引發一個中斷通知主機系統。NVMe使用MSI/MSI-X和中斷聚合來提高中斷處理的性能。在當前的Linux內核中,NVMe驅動程序在每一個CPU核(core)上創建一個提交隊列(SQ)和完成隊列(CQ),以避免鎖和緩存沖突。
3 NVMeDirect I/O Framework | NVMeDirect I/O框架
3.1 Design | 設計
We develop a user-space I/O framework called NVMeDirect to fully utilize the performance of NVMe SSDs while meeting the diverse requirements from user applications. Figure 1 illustrates the overall architecture of our NVMeDirect framework.
我們開發了一個叫做NVMeDirect的用戶空間的I/O框架,該框架充分利用了NVMe固態硬盤的性能,同時能夠滿足用戶應用多樣化的需求。圖1描述了NVMeDirect框架的總體框架。
The Admin tool controls the kernel driver with the root privilege to manage the access permission of I/O queues. When an application requests a single I/O queue to NVMeDirect, the user-level library calls the kernel driver. The kernel first checks whether the application is allowed to perform user-level I/Os. And then it creates the required submission queue and the completion queue, and maps their memory regions and the associated doorbell registers to the user-space memory region of the application. After this initialization process, the application can issue I/O commands directly to the NVMe SSD without any hardware modification or help from the kernel I/O stack. NVMeDirect offers the notion of I/O handles to send I/O requests to NVMe queues. A thread can create one or more I/O handles to access the queues and each handle can be bound to a dedicated queue or a shared queue. According to the characteristics of the workloads, each handle can be configured to use different features such as caching, I/O scheduling, and I/O completion. The major APIs supported by the NVMeDirect framework are summarized in Table 1. NVMeDirect also provides various wrapper APIs that correspond to NVMe commands such as read, write, flush, and discard.
管理工具使用root權限來控制內核驅動程序從而管理I/O隊列的訪問權限。當應用向NVMeDirect請求一個I/O隊列的時候,用戶級別的庫函數調用內核驅動。內核首先檢查該應用是否被允許執行用戶級別的I/O,然后創建所必需的提交隊列(SQ)和完成隊列(CQ),並將它們的內存區域和相關的門鈴寄存器映射到應用程序的用戶空間內存區域。初始化過程完成后,應用程序可以直接給NVMe固態硬盤發I/O命令,無需修改任何硬件或求助於內核I/O棧。NVMeDirect提供了I/O句柄的概念,用來給NVMe隊列發送I/O請求。線程可以創建一個或多個I/O句柄來訪問隊列,每個句柄都可以綁定到一個專用的隊列或共享的隊列上。根據工作負載特性,可以為每個句柄都配置使用不同的特性,例如緩存、I/O調度和I/O完成。表1總結了NVMeDirect框架支持的主要的API。NVMeDirect還對NVMe命令提供各種包裝API,例如read, write, flush, 和discard。
Separating handles from queues enables flexible grouped I/O policies among multiple threads and makes it easy to implement differentiated I/O services. Basically, a single I/O queue is bound to a single handle as Thread A in Figure 1. If a thread needs to separate read requests from write requests to avoid read starvation due to bulk writes, it can bind multiple queues to a single handle as Thread B in Figure 1. It is also possible to share a single queue among multiple threads as Thread C and D in Figure 1.
從隊列中分離出句柄可以在多個線程之間實現靈活的分組I/O策略,並易於實現差異化的I/O服務。總的來說,單個的I/O隊列綁定到單個句柄上,例如圖1中的線程A。如果一個線程需要將讀請求與寫請求分開,以避免由於批量寫入而導致的讀飢餓問題,那么可以將多個隊列綁定到一個單獨的句柄上,例如圖1中的線程B。還可以在多個線程之間共享單個隊列,例如圖1中的線程C和D。
NVMeDirect also offers block cache, I/O scheduler, and I/O completion thread components for supporting diverse I/O policies. Applications can mix and match these components depending on their needs. Block cache manipulates the memory for I/O in 4KB unit size similar to the page cache in the kernel. Since the NVMe interface uses the physical memory addresses for I/O commands, the memory in the block cache utilizes pretranslated physical addresses to avoid address translation overhead. I/O scheduler issues I/O commands for asynchronous I/O operations or when an I/O request is dispatched from the block cache. Since the interrupt-based I/O completion incurs context switching and additional software overhead, it is not suitable for high performance and low latency I/O devices [13]. However, if an application is sensitive to bandwidth rather than to latency, polling is not efficient due to the significant increase in the CPU load. Thus, NVMeDirect utilizes a dedicated polling thread with dynamic polling period control based on the I/O size or a hint from applications to avoid unnecessary CPU usage.
NVMeDirect還提供了塊緩存(block cache),I/O調度器和I/O完成線程組件以支持多種I/O策略。應用程序可以根據自己的需要混合和適配這些組件。類似於內核頁面緩存, 塊緩存以4KB單元大小操縱內存I/O。由於NVMe接口使用物理內存地址發送I/O命令,位於塊緩存的內存采用預翻譯物理地址以避免地址轉換開銷。I/O調度器發出I/O命令,要么為異步I/O操作,要么從塊緩存中發出I/O請求。由於基於中斷的I/O完成會導致上下文切換和額外的軟件開銷,因此不適合高性能和低延遲的I/O設備。但是,如果應用對帶寬敏感而不那么在乎時延的話,那么輪詢效率就不高了,因為顯著地增加了CPU負載。因此,NVMeDirect利用專用的輪詢線程與應用的足跡避免了不必要的CPU占用,注意這個專用的輪詢線程具有基於I/O的大小的動態輪詢周期控制機制。
3.2 Implementation | 實現
The NVMeDirect framework is composed of three components: queue management, admin tool, and user-level library. The queue management and the admin tool are mostly implemented in the NVMe kernel driver, and user-level library is developed as a shared library.
NVMeDirect框架由三個組件構成:隊列管理,管理工具和用戶級別的函數庫。其中,隊列管理和管理工具大部分實現都在NVMe內核驅動中,而用戶級別的函數庫是一個共享函數庫。
We implement the queue management module in the NVMe driver of the Linux kernel 4.3.3. At the initialization stage, the admin tool notifies the access privileges and the number of available queues to the queue management module with ioctl(). When an application requests to create a queue, the queue management module checks the permission and creates a pair of submission and completion queues. The module maps the kernel memory addresses of the created queues and the doorbell to the user's address space using dma_common_mmap() to make them visible for the user-level library. The module also exports the memory addresses via the proc file system. Then, the user-level library can issue I/O commands by accessing the memory addresses of queues and doorbell registers.
我們在Linux 內核4.3.3上實現了NVMe驅動的隊列管理模塊。在初始化階段,管理工具使用ioctl()給隊列管理模塊發通知,告知訪問權限和可用的隊列數。當應用請求創建一個隊列的時候,隊列管理模塊檢查權限,然后創建一個包含提交隊列(SQ)和完成隊列(CQ)的隊列對(QP)。隊列管理模塊使用dma_common_mmap()將創建的隊列的內核空間的內存地址和門鈴寄存器映射到用戶態地址空間,這樣用戶級別的函數庫就可以看見已經創建的隊列和門鈴寄存器。隊列管理模塊也使用proc文件系統,將內存地址導出。那么,用戶級別的函數庫就可以通過訪問隊列地址和門鈴寄存器發送I/O命令。
The I/O completion thread is implemented as a standalone thread to check the completion of I/O using polling. Multiple completion queues can share a single I/O completion thread or a single completion queue can use a dedicated thread to check the I/O completion. The polling period can be adjusted dynamically depending on the I/O characteristics of applications. Also, an application can explicitly set the polling period of the specific queue using nvmed_set_param(). The I/O completion thread uses usleep() to adjust the polling period.
I/O完成線程是作為一個獨立的線程來實現的,它使用輪詢去檢查I/O的完成。多個完成隊列(CQ)可以共享一個I/O完成線程,或單個完成隊列可以使用專有的完成線程去檢查I/O完成情況。輪詢周期可以根據應用程序的I/O特性動態地做出調整。此外,應用程序可以調用nvmed_set_param()顯式地設置在特定隊列上的輪詢周期。I/O完成現成使用usleep()來調整輪詢周期。
4 Evaluation | 評估
We compare the I/O performance of NVMeDirect with the original kernel-based I/O with asynchronous I/O support (Kernel I/O) and SPDK using the Flexible IO Tester (fio) benchmark [3]. For all the experiments, we use a Linux machine equipped with a 3.3GHz Intel Core i7 CPU and 64GB of memory running Ubuntu 14.04. All the performance evaluations are performed on a commercial Intel 750 Series 400GB NVMe SSD.
通過使用fio benchmark測試,我們比較了NVMeDirect, Kernel I/O和SPDK的I/O性能。在所有的試驗中,我們使用的都是運行Ubuntu 14.04的Linux機器(CPU: 3.3GHz Intel Core i7, 內存:64GB)。所有的性能評估都是在商用的Intel 750系列的NVMe固態硬盤(400GB)上完成的。
4.1 Baseline Performance | 性能基線
Figure 2 depicts the IOPS of random reads (Figure 2a) and random writes (Figure 2b) on NVMeDirect, SPDK, and Kernel I/O varying the queue depth with a single thread. When the queue depth is sufficiently large, the performance of random reads and writes meets or exceeds the performance specification of the device on both NVMeDirect, SPDK, and Kernel I/O. However, NVMeDirect achieves higher IOPS compared to Kernel I/O until the IOPS is saturated. This is because NVMeDirect avoids the overhead of the kernel I/O stack by supporting direct accesses between the user application and the NVMe SSD. As shown in Figure 2a, we can see that our framework records the near maximum performance of the device with the queue depth of 64 for random reads, while Kernel I/O has 12% less IOPS in the same configuration. In Figure 2b, when NVMeDirect achieves the maximum device performance, Kernel I/O shows 20% less IOPS. SPDK shows the same trend as NVMeDirect because it also accesses the NVMe SSD directly in the user space.
圖2描述了在NVMeDirect, SPDK, 和Kernel I/O上隨機讀(圖2a),隨機寫(圖2b)的IOPS, 使用單個線程和多個隊列深度。當隊列深度足夠大,隨機讀寫的性能達到或超過設備指定的性能指標,無論是NVMeDirect,還是SPDK以及Kernel I/O。然而,NVMeDirect比Kernel I/O獲得了更高的IOPS,直到IOPS達到飽和狀態。這是因為NVMeDirect支持用戶應用直接訪問NVMe固態硬盤,避免了Kernel I/O棧的開銷。如圖2a所示,我們可以看到,當隊列深度為64的時候,NVMeDirect框架在隨機讀的時候達到了近乎最大的性能。而在同樣的配置上,Kernel I/O的IOPS少了12%。在圖2b中,NVMeDirect在隨機寫的時候達到了設備的最大的性能,而Kernel I/O的IOPS少了20%。SPDK顯示的趨勢跟NVMeDirect是一樣的,因為它也在用戶空間直接訪問NVMe固態硬盤。
4.2 Impact of the Polling Period | 輪詢周期產生的影響
Figure 3 shows the trends of IOPS (denoted by lines) and CPU utilization (denoted by bars) when we vary the polling period per I/O size. The result is normalized to the IOPS achieved when the polling is performed without delay for each I/O size. We can notice that a significant performance degradation occurs in a certain point for each I/O size. For instance, if the I/O is 4KB in size, it is better to shorten the polling period as much as possible because the I/O is completed very quickly. In case of 8KB and 16KB I/O sizes, there is no significant slow-down, even though the polling is performed once in 70μs and 200μs, respectively. At the same time, we can reduce the CPU utilization due to the polling thread to 4% for 8KB I/Os and 1% for 16KB I/Os. As mentioned in Section 3.1, we use the adaptive polling period control based on this result to reduce the CPU utilization associated with polling.
圖3顯示了IOPS趨勢(線狀圖)和CPU的利用率(柱狀圖),當我們針對每一個I/O大小使用不同的輪詢周期的時候。當對每一個I/O大小進行沒有任何延遲的輪詢的時候,獲得的IOPS會歸一化到某一點。我們不難注意到,對每一個I/O大小來說,在某幾個點性能嚴重下降。例如:當I/O大小為4KB的時候,最好盡可能地縮短輪詢周期,因為I/O能夠非常快速地完成。當I/O大小為8KB和16KB的時候,性能並沒有明顯的下降,即使輪詢周期在70μs和200μs之間擺動。與此同時,我們可以降低CPU利用率,對8KB的I/O來說降低4%, 對16KB的I/O來說降低1%。正如第3.1節所提到的,我們使用基於此結果的自適應輪詢周期控制機制來降低與輪詢相關的CPU利用率。
4.3 Latency Sensitive Application | 時延敏感型應用
The low latency is particularly effective on the latency sensitive application such as key-value stores. We evaluate NVMeDirect on one of the latency sensitive applications, Redis, which is an in-memory key value store mainly used as database, cache, and message broker [2]. Although Redis is an in-memory database, Redis writes logs for all write commands to provide persistence. This makes the write latency be critical to the performance of Redis. To run Redis on NVMeDirect, we added 6 LOC (lines of code) for initialization and modified 12 LOC to replace POSIX I/O interface with the corresponding NVMeDirect interface with the block cache. We use the total 10 clients with workload-A in YCSB [6], which is an update heavy workload.
對諸如鍵值對存儲之類的對時延敏感的應用來說,低延遲尤其重要。我們使用了一種叫做Redis的對時延敏感的應用來對NVMeDirect進行評估。Redis是一種基於內存的鍵值對存儲,主要用於數據庫,緩存和消息代理。雖然Redis是一種內存數據庫,但是Redis將所有寫命令的日志寫入到磁盤以提供持久性。這使得寫時延對Redis的性能影響至為關鍵。為了在NVMeDirect上運行Redis, 我們給Redis的初始化部分增加了6行代碼,修改了12行代碼,用與塊緩存相關的NVMeDirect接口替代了POSIX I/O接口。我們使用了YSCB文中的workload-A, 包含了10個客戶端,這樣的工作負載很大。
Table 2 shows the throughput and latency of Redis on Kernel I/O and NVMeDirect. NVMeDirect improves the overall throughput by about 15% and decreases the average latency by 13% on read and by 20% on update operations. This is because NVMeDirect reduces the latency by eliminating the software overhead of the kernel.
表2顯示了在Kernel I/O和NVMeDirect上Redis的吞吐量和時延。NVMeDirect提高了總的吞吐量約15%,降低了讀平均時延13%和更新操作時延20%。這是因為NVMeDirect通過消除內核軟件開銷從而降低了時延。
4.4 Differentiated I/O Service | 差異化的I/O服務
I/O classification and boosting the prioritized I/Os is important to improve the application performance such as writing logs in database systems [9, 10]. NVMeDirect can provide the differentiated I/O service easily because the framework can apply different I/O policies to the application using I/O handles and multiple queues.
I/O分類和增強I/O優先級對於提高應用程序性能非常重要,例如在數據庫系統中寫入日志。NVMeDirect可以很容易地提供差異化的I/O服務,因為此框架可以通過使用I/O句柄和多個隊列來將不同的I/O策略應用於不同的應用程序。
We perform an experiment to demonstrate the possible I/O boosting scheme in NVMeDirect. To boost specific I/Os, we assign a prioritized thread to a dedicated queue while the other threads share a single queue. For the case of non-boosting mode, each thread has its own queue in the framework. Figure 4 illustrates the IOPS of Kernel I/O and two I/O boosting modes of NVMeDirect while running the fio benchmark. The benchmark runs four threads including one prioritized thread and each thread performs random writes with a queue depth of 4. As shown in the result, the prioritized thread with a dedicated queue on NVMeDirect outperforms the other threads remarkably. In the case of Kernel I/O, all threads have the similar performance because there is no mechanism to support prioritized I/Os. This result shows that NVMeDirect can provide the differentiated I/O service without any software overhead.
我們做了一個實驗來論證NVMeDirect這一I/O性能提升方案。為了增強特定的I/O,我們給優先級高的線程分配一個專用隊列,而其他線程則共享一個隊列。對於非增強模式,每個線程在框架中都有自己的隊列。圖4說明了Kernel I/O和NVMeDirect雙I/O在fio基准測試的結果。基准測試運行四個線程,其中包括一個優先級線程,每個線程執行隨機寫,隊列深度為4。結果表明,使用專用隊列的優先級高的線程的性能明顯優於其他線程。在Kernel I/O的情況下,所有線程都具有相似的性能,因為沒有支持優先級I/O的機制。這一結果表明,NVMeDirect可以提供差異化的I/O服務,而不需要額外的軟件開銷。
5 Evaluation | 評估
There have been several studies for optimizing the storage stack as the hardware latency is decreased to a few milliseconds. Shin et al. [11] present a low latency I/O completion scheme based on the optimized low level hardware abstraction layer. Especially, optimizing I/O path minimizes the scheduling delay caused by the context switch. Yu et al. [14] propose several optimization schemes to fully exploit the performance of fast storage devices. The optimization includes polling I/O completion, eliminating context switches, merging I/O, and double buffering. Yang et al. [13] also present that the polling method for the I/O completion delivers higher performance than the traditional interrupt-driven I/O.
由於硬件時延減少到了幾個毫秒,於是業界針對存儲棧優化進行了研究。 Shin et al. [11]提出了一種基於優化的底層硬件抽象層的低時延I/O完成方案。特別地,優化I/O路徑最小化了因為上下文切換引起的調度時延。Yu et al. [14]提出了幾種優化方案,以充分利用快速存儲設備的性能。優化包括對I/O完成進行輪詢、消除上下文切換、合並I/O和雙緩沖。Yang et al. [13]還指出I/O完成輪詢方法比傳統的中斷驅動I/O性能更高。
Since the kernel still has overhead in spite of several optimizations, researchers have tried to utilize direct access to the storage devices without involving the kernel. Caulfield et al. [5] present user-space software stacks to further reduce storage access latencies based on their special storage device, Moneta [4]. Likewise, Volos et al. [12] propose a flexible file-system architecture that exposes the storage-class memory to user applications to access storage without kernel interaction. These approaches are similar to the proposed NVMeDirect I/O framework. However, their studies require special hardware while our framework can run on any commercial NVMe SSDs. In addition, they still have complex modules to provide general file system layer which is not necessary for all applications.
盡管有一些優化,但是內核開銷依然存在,於是研究人員試圖直接訪問存儲設備而不經過內核。Caulfield et al. [5]提出用戶空間軟件棧以進一步降低存儲訪問的時延,基於他們的特殊存儲設備(Moneta)。類似地,Volos et al. [12]提出了一個靈活的文件系統架構,它將存儲類內存暴露給用戶應用,在沒有內核交互的情況下直接訪問存儲。這些方法與我們提出的NVMeDirect I/O框架類似。然而,他們的研究需要特定的硬件,而我們的框架可以運行在任何商用的NVMe固態硬盤上。此外,他們仍然有復雜的模塊來提供通用文件系統層,而文件系統層對於所有應用程序來說都不是必需的。
NVMeDirect is a research outcome independent of SPDK [7] released by Intel. Although NVMeDirect is conceptually similar to SPDK, NVMeDirect has following differences. First, NVMeDirect leverages the kernel NVMe driver for control-plane operations, thus existing applications and NVMeDirect-enabled applications can share the same NVMe SSD. In SPDK, however, the whole NVMe SSD is dedicated to a single process who has all the user-level driver code. Second, NVMeDirect is not intended to be just a set of mechanisms to allow user-level direct accesses to NVMe SSDs. Instead, NVMeDirect also aims to provide a large set of I/O policies to optimize various data-intensive applications according to their characteristics and requirements.
NVMeDirect是一個獨立於英特爾公布的SPDK的研究結果。雖然NVMeDirect在概念上與SPDK相似,但NVMeDirect有以下不同。首先,NVMeDirect利用內核NVMe驅動做控制平面的操作,因此,現有的應用和啟用了NVMeDirect的應用可以共享同一塊NVMe固態硬盤。然而在SPDK中,整塊NVMe固態硬盤是專屬於某一個過程,該進程擁有所有的用戶級別的驅動代碼。其次,NVMeDirect不只是一套允許用戶直接訪問NVMe固態硬盤的機制。它還提供大量的I/O策略,用來優化各種數據密集型應用,根據應用自身的特點和需求。
6 Conclusion | 結論
We propose a user-space I/O framework, NVMeDirect, to enable the application-specific optimization on NVMe SSDs. Using NVMeDirect, user-space applications can access NVMe SSDs directly without any overhead of the kernel I/O stack. In addition, the framework provides several I/O policies which can be used selectively by the demand of applications. The evaluation results show that NVMeDirect is a promising approach to improve application performance using several user-level I/O optimization schemes.
我們提出了一個用戶空間的I/O框架,NVMeDirect,用來對面向NVMe固態硬盤存儲的應用做優化。使用NVMeDirect, 用戶態的應用程序能夠直接訪問NVMe固態硬盤,而不需要任何內核I/O棧開銷。此外,該框架還提供了可根據應用需求有選擇地使用的一些I/O策略。評估結果表明,通過使用多用戶級別的I/O優化方案,NVMeDirect不失為一種有效地提高應用程序性能的方法。
Since NVMeDirect does not interact with the kernel during the I/Os, it cannot provide enough protection normally enforced by the file system layer. In spite of this, we believe NVMeDirect is still useful for many data-intensive applications which are deployed in a trusted environment. As future work, we plan to investigate ways to protect the system against illegal memory and storage accesses. We are also planning to provide user-level file systems to support more diverse application scenarios. NVMeDirect is available as opensource at https://github.com/nvmedirect.
因為NVMeDirect在I/O過程中不與內核交互,所以它不能提供通常由文件系統層提供的足夠的訪問保護。雖然如此,我們依然相信NVMeDirect對很多數據密集型應用是有用的,這些應用部署在可信任的環境中。未來我們計划研究保護系統免受非法內存/存儲訪問的方法。我們還計划提供用戶級別的文件系統,以支持更加多樣化的應用場景。關於NVMeDirect的源代碼,請訪問https://github.com/nvmedirect。
Acknowledgements | 鳴謝
We would like to thank the anonymous reviewers and our shepherd, Nisha Talagala, for their valuable comments. This work was supported by Samsung Research Funding Center of Samsung Electronics under Project Number SRFC-TB1503-03.
請允許我們對匿名審稿人和指導老師Nisha Talagala表示感謝,感謝他們提出的寶貴意見。三星電子的三星研究基金中心為這項工作提供了支持和幫助,項目編號是SRFC-TB1503-03。
References | 參考文獻
[1] NVM Express Overview. http://www.nvmexpress.org/about/nvm-express-overview/.
[2] Redis. http://redis.io/.
[3] AXBOE, J. Flexible IO tester. http://git.kernel.dk/?p=fio.git;a=summary.
[4] CAULFIELD, A. M., DE, A., COBURN , J., MOLLOW , T. I., GUPTA , R. K., AND SWANSON, S.Moneta: A high-performance storage array architecture for next-generation, non-volatile memories. In Proc. MICRO (2010), pp. 385-395.
[5] CAULFIELD, A. M., MOLLOV, T. I., EISNER , L. A., DE, A., COBURN, J., AND SWANSON , S. Providing safe, user space access to fast, solid state disks. In Proc. ASPLOS (2012), pp. 387-400.
[6] COOPER , B. F., SILBERSTEIN, A., TAM , E., RAMAKRISHNAN ,R., AND SEARS, R. Benchmarking cloud serving systems with YCSB. In Proc. SOCC (2010), pp. 143-154.
[7] INTEL. Storage performance development kit. https://01.org/spdk.
[8] INTEL, AND MICRON. Intel and Micron Produce Break-through Memory Technology. http://newsroom.intel.com/community/intel_newsroom/blog/2015/07/28/intel-and-micron-produce-breakthrough-memory-technology, 2015.
[9] KIM, S., KIM, H., KIM , S.-H., LEE, J., AND JEONG, J. Request-oriented durable write caching for application performance. In Proc. USENIX ATC(2015), pp. 193-206.
[10] LEE, S.-W., MOON, B., PARK, C., KIM, J.-M., AND KIM, S.-W. A case for flash memory SSD in enterprise database applications. In Proc. SIGMOD (2008), pp. 1075-1086. [11] SHIN, W., CHEN, Q., OH, M., EOM, H., AND YEOM, H. Y. OS I/O path optimizations for flash solid-state drives. In Proc. USENIX ATC (2014), pp. 483-488.
[12] VOLOS, H., NALLI, S., PANNEERSELVAM, S., VARADARAJAN, V., SAXENA, P., AND SWIFT, M. M. Aerie: Flexible file-system interfaces to storage-class memory. In Proc. EuroSys (2014).
[13] YANG, J., MINTURN, D. B., AND HADY, F. When poll is better than interrupt. In Proc. FAST (2012), p. 3.
[14] YU, Y. J., SHIN, D.I., SHIN, W., SONG, N. Y., CHOI, J. W., KIM, H. S., EOM, H., AND YEOM, H. Y. Optimizing the block I/O subsystem for fast storage devices. ACM Transactions on Computer Systems (TOCS) 32, 2 (2014), 6.
Don't be afraid to be unique or speak your mind, because that's what makes you different from everyone else. | 不要害怕與眾不同或坦率直言,因為那正是你與其他人的不同之處。