[SPDK/NVMe存儲技術分析]002 - SPDK官方介紹


Introduction to the Storage Performance Development Kit (SPDK) | SPDK概述

By Jonathan S. (Intel), Updated December 5, 2016

Solid-state storage media is in the process of taking over the data center. Current-generation flash storage enjoys significant advantages in performance, power consumption, and rack density over rotational media. These advantages will continue to grow as next-generation media enter the marketplace.
固態存儲設備正在逐步接管數據中心。目前這一代的閃存存儲,相對於傳統的磁盤設備來說,在性能(performance)、功耗(power consumption)和機架密度(rack density)上都具有顯著的優勢。這些優勢將會進一步增大,使閃存存儲作為下一代存儲設備進入到存儲市場。

Introduction to the Storage Performance Development Kit (SPDK) Customers integrating current solid-state media, such as the Intel® SSD DC P3700 Series Non-Volatile Memory Express* (NVMe*) drive, face a major challenge: because the throughput and latency performance are so much better than that of a spinning disk, the storage software now consumes a larger percentage of the total transaction time. In other words, the performance and efficiency of the storage software stack is increasingly critical to the overall storage system. As storage media continues to evolve, it risks outstripping the software architectures that use it, and in coming years the storage media landscape will continue evolving at an incredible pace.
用戶使用現在的固態設備,比如Intel® SSD DC P3700 Series Non-Volatile Memory Express(NVMe)驅動,面臨着一個主要的挑戰:因為吞吐量和時延性能比傳統的磁盤好太多,現在的存儲軟件在總的處理時間中占用了較大的比例。換句話說,存儲軟件棧的性能和效率在整個存儲系統中顯得越來越重要了。隨着存儲設備繼續向前發展,它將面臨遠遠超過正在使用的軟件體系結構的風險(即存儲設備受制於相關軟件的不足而不能發揮其全部性能),在接下來的幾年中,存儲設備將會繼續發展到一個令人難以置信的高度。

To help storage OEMs and ISVs integrate this hardware, Intel has created a set of drivers and a complete, end-to-end reference storage architecture called the Storage Performance Development Kit (SPDK). The goal of SPDK is to highlight the outstanding efficiency and performance enabled by using Intel’s networking, processing, and storage technologies together. By running software designed from the silicon up, SPDK has demonstrated that millions of I/Os per second are easily attainable by using a few processor cores and a few NVMe drives for storage with no additional offload hardware. Intel provides the entire Linux* reference architecture source code under the broad and permissive BSD license and is distributed to the community through GitHub*. A blog, mailing list, and additional documentation can be found at spdk.io.
為了幫助存儲OEM(設備代工廠)和ISV(獨立軟件開發商)整合硬件,Inte構造了一系列驅動,以及一個完備的、端對端的參考存儲體系結構,被命名為Storage Performance Development Kit(SPDK)。SPDK的目標是通過同時使用Intel的網絡技術,處理技術和存儲技術來顯著地提高效率和性能。通過運行為硬件定制的軟件,通過使用多個core和幾個NVMe存儲(沒有額外的offlload硬件),SPDK已經證明很容易達到每秒鍾數百萬次I/O讀取。Intel使用BSD license通過Github分發提供其全部的基於Linux架構的源碼。博客、郵件列表和其他文檔可在spdk.io中找到。

Software Architectural Overview | 軟件體系結構概覽

How does SPDK work? The extremely high performance is achieved by combining two key techniques: running at user level and using Poll Mode Drivers (PMDs). Let’s take a closer look at these two software engineering terms.
SPDK是如何工作的?達到這樣的超高性能運用了兩項關鍵技術運行於用戶態輪詢模式。讓我們對這兩個軟件工程術語做進一步的了解。

First, running our device driver code at user level means that, by definition, driver code does not run in the kernel. Avoiding the kernel context switches and interrupts saves a significant amount of processing overhead, allowing more cycles to be spent doing the actual storing of the data. Regardless of the complexity of the storage algorithms (deduplication, encryption, compression, or plain block storage), fewer wasted cycles means better performance and latency. This is not to say that the kernel is adding unnecessary overhead; rather, the kernel adds overhead relevant to general-purpose computing use cases that may not be applicable to a dedicated storage stack. The guiding principle of SPDK is to provide the lowest latency and highest efficiency by eliminating every source of additional software overhead.
首先,我們的設備驅動代碼運行在用戶態,這意味着(在定義上)驅動代碼不會運行在內核態。避免內核上下文切換和中斷將會節省大量的處理開銷,允許更多的時鍾周期被用來做實際的數據存儲。無論存儲算法(去冗,加密,壓縮,空白塊存儲)多么復雜,浪費更少的時鍾周期總是意味着更好的性能和時延。這並不是說內核增加了不必要的開銷;相反,內核增加了一些與通用計算用例相關的開銷,因而可能不適合專用的存儲棧。SPDK的指導原則是通過消除每一處額外的軟件開銷來提供最少的時延和最高的效率

Second, PMDs change the basic model for an I/O. In the traditional I/O model, the application submits a request for a read or a write, and then sleeps while awaiting an interrupt to wake it up once the I/O has been completed. PMDs work differently; an application submits the request for a read or write, and then goes off to do other work, checking back at some interval to see if the I/O has yet been completed. This avoids the latency and overhead of using interrupts and allows the application to improve I/O efficiency. In the era of spinning media (tape and HDDs), the overhead of an interrupt was a small percentage of the overall I/O time, thus was a tremendous efficiency boost to the system. However, as the age of solid-state media continues to introduce lower-latency persistent media, interrupt overhead has become a non-trivial portion of the overall I/O time. This challenge will only become more glaring with lower latency media. Systems are already able to process many millions of I/Os per second, so the elimination of this overhead for millions of transactions compounds quickly into multiple cores being saved. Packets and blocks are dispatched immediately and time spent waiting is minimized, resulting in lower latency, more consistent latency (less jitter), and improved throughput.
其次,輪詢模式驅動(Polled Mode Drivers, PMDs)改變了I/O的基本模型。在傳統的I/O模型中,應用程序提交讀寫請求后進入睡眠狀態,一旦I/O完成,中斷就會將其喚醒。PMDs的工作方式則不同,應用程序提交讀寫請求后繼續執行其他工作,以一定的時間間隔回頭檢查I/O是否已經完成。這種方式避免了中斷帶來的延遲和開銷,並使得應用程序提高了I/O效率。在旋轉設備時代(磁帶和機械硬盤),中斷開銷只占整個I/O時間的很小的百分比,因此給系統帶來了巨大的效率提升。然而,在固態設備時代,持續引入更低時延的持久化設備,中斷開銷成為了整個I/O時間中不可忽視的部分。這個問題在更低時延的設備上只會越來越嚴重。系統已經能夠每秒處理數百萬個I/O,所以消除數百萬個事務的這種開銷,能夠快速地復制到多個core中。數據包和數據塊被立即分發,因為等待花費的時間變小,使得時延更低,一致性時延更多(抖動更少),吞吐量也得到了提高。

SPDK is composed of numerous subcomponents, interlinked and sharing the common elements of user-level and poll-mode operation. Each of these components was created to overcome a specific performance bottleneck encountered while creating the end-to-end SPDK architecture. However, each of these components can also be integrated into non-SPDK architectures, allowing customers to leverage the experience and techniques used within SPDK to accelerate their own software.
SPDK由多個子組件構成,相互連接並共享用戶態操作和輪詢模式操作的共有部分。當構造端對端SPDK體系結構時,每個組件被構造來克服遭遇到的特定的性能瓶頸。然而,每個組件也可以被整合進非SPDK體系結構,允許用戶利用SPDK中使用的經驗和技術來加速自己的軟件。

Starting at the bottom and building up:
讓我們就上圖自底向上開始講述:

Hardware Drivers | 硬件驅動

NVMe driver: The foundational component for SPDK, this highly optimized, lockless driver provides unparalleled scalability, efficiency, and performance.
NVMe driver:SPDK的基礎組件,高度優化且無鎖的驅動提供了前所未有的高擴展性,高效性和高性能。

Intel® QuickData Technology: Also known as Intel® I/O Acceleration Technology (Intel® IOAT), this is a copy offload engine built into the Intel® Xeon® processor-based platform. By providing user space access, the threshold for DMA data movement is reduced, allowing greater utilization for small-size I/Os or NTB.
Inter QuickData Technology:也稱為Intel I/O Acceleration Technology(Inter IOAT,Intel I/O加速技術),這是一種基於Xeon處理器平台上的copy offload引擎。通過提供用戶空間訪問,減少了DMA數據移動的閾值,允許對小尺寸I/O或NTB(非透明橋)做更好地利用。

Back-End Block Devices | 后端塊設備

NVMe over Fabrics (NVMe-oF) initiator: From a programmer’s perspective, the local SPDK NVMe driver and the NVMe-oF initiator share a common set of API commands. This means that local/remote replication, for example, is extraordinarily easy to enable.
NVMe over Fabrics(NVMe-oF)initiator:從程序員的角度來看,本地SPDK NVMe驅動和NVMe-oF initiator共享一套公共的API命令。這意味着本地/遠程復制非常容易實現。

Ceph* RADOS Block Device (RBD): Enables Ceph as a back-end device for SPDK. This might allow Ceph to be used as another storage tier, for example.
Ceph RADOS Block Device(RBD):使Ceph成為SPDK的后端設備,這可能允許Ceph用作另一個存儲層。

Blobstore Block Device: A block device allocated by the SPDK Blobstore, this is a virtual device that VMs or databases could interact with. These devices enjoy the benefits of the SPDK infrastructure, meaning zero locks and incredibly scalable performance.
Blobstore Block Device:由SPDK Blobstore分配的塊設備,是虛擬機或數據庫可與之交互的虛擬設備。這些設備享有SPDK基礎架構帶來的優勢,意味着無鎖和令人難以置信的可擴展性。

Linux* Asynchronous I/O (AIO): Allows SPDK to interact with kernel devices like HDDs.
Linux Asynchrounous I/O(AIO):允許SPDK與內核設備(如機械硬盤)發生交互。

Storage Services | 存儲服務

Block device abstraction layer (bdev): This generic block device abstraction is the glue that connects the storage protocols to the various device drivers and block devices. Also provides flexible APIs for additional customer functionality (RAID, compression, dedup, and so on) in the block layer.
Block device abstration layer(bdev):這種通用的塊設備抽象是連接到各種不同設備驅動和塊設備的存儲協議的粘合劑。在塊層中還提供了靈活的API用於額外的用戶功能(磁盤陣列,壓縮,去冗等)。

Blobstore: Implements a highly streamlined file-like semantic (non-POSIX*) for SPDK. This can provide high-performance underpinnings for databases, containers, virtual machines (VMs), or other workloads that do not depend on much of a POSIX file system’s feature set, such as user access control.
Blobstore:為SPDK實現一個高精簡的類文件的語義(非POSIX)。這可為數據庫,容器,虛擬機或其他不依賴於大部分POSIX文件系統功能集(比如用戶訪問控制)的工作負載提供高性能支撐。

Storage Protocols | 存儲協議

iSCSI target: Implementation of the established specification for block traffic over Ethernet; about twice as efficient as kernel LIO. Current version uses the kernel TCP/IP stack by default.
iSCSI target:建立了通過以太網的塊流量規范,大約是內核LIO效率的兩倍。現在的版本默認使用內核TCP/IP協議棧。

NVMe-oF target: Implements the new NVMe-oF specification. Though it depends on RDMA hardware, the NVMe-oF target can serve up to 40 Gbps of traffic per CPU core.
NVMe-oF target:實現了新的NVMe-oF規范。雖然這取決於RDMA硬件,NVMe-oF target可以為每個CPU核提供高達40Gbps的流量。

vhost-scsi target: A feature for KVM/QEMU that utilizes the SPDK NVMe driver, giving guest VMs lower latency access to the storage media and reducing the overall CPU load for I/O intensive workloads.
vhost-scsi target:KVM/QEMU的功能利用了SPDK NVMe驅動,使得訪客虛擬機訪問存儲設備時延更低,使得I/O密集型工作負載的整體CPU負載有所下降。

 

SPDK does not fit every storage architecture. Here are a few questions that might help you determine whether SPDK components are a good fit for your architecture.
SPDK不適用於所有的存儲架構。這里有一些問題可能會幫助用戶決定SPDK組件是否適合他們的架構。

》 Is the storage system based on Linux or or FreeBSD*?
這個存儲系統是否基於Linux或FreeBSD?

SPDK is primarily tested and supported on Linux. The hardware drivers are supported on both FreeBSD and Linux.
是的。 SPDK主要在Linux上測試和支持。硬件驅動被FreeBSD和Linux所支持。

》Is the hardware platform for the storage system Intel® architecture?
存儲系統的硬件平台是否要求是Intel體系結構?

SPDK is designed to take full advantage of Intel® platform characteristics and is tested and tuned for Intel® chips and systems.
是的。SPDK被設計來充分地利用Intel平台的特性,並針對Intel芯片和系統做測試和調優。

》Does the performance path of the storage system currently run in user mode?
這個存儲系統的高性能路徑是否運行在用戶態?

SPDK is able to improve performance and efficiency by running more of the performance path in user space. By combining applications with SPDK features like the NVMe-oF target, initiator, or Blobstore, the entire data path may be able to run in user space, offering substantial efficiencies.
是的。SPDK通過更多地在用戶態下運行從網卡到磁盤的高性能通路從而提高性能和效率。通過將具有SPDK功能(比如NVMe-oF target,NVMe-oFinitator,Blobstore)的應用程序結合起來,整個數據通路能夠在用戶態運行,從而顯著地提供高效率。

》Can the system architecture incorporate lockless PMDs into its threading model?
該系統架構可以將無鎖的PMDs合並到它的線程模型中去嗎?

Since PMDs continually run on their threads (instead of sleeping or ceding the processor when unused), they have specific thread model requirements.
不能。 因為PMDs持續運行在它們的線程中(而不是睡眠或者不用時讓出處理器),所以它們有特殊的線程模型需求。

》Does the system currently use the Data Plane Development Kit (DPDK) to handle network packet workloads?
系統現在是否用DPDK處理網絡數據包的工作負載?

SPDK shares primitives and programming models with DPDK, so customers currently using DPDK will likely find the close integration with SPDK useful. Similarly, if customers are using SPDK, adding DPDK features for network processing may present a significant opportunity.
是的。 SPDK和DPDK共享早期的編程模型,所以現在使用DPDK的用戶可能會發現與SPDK緊密整合非常有用。同樣地,如果正在使用SPDK的用戶為網絡處理添加DPDK功能可能是個重要的機遇。

》Does the development team have the expertise to understand and troubleshoot problems themselves?
開發團隊自己是否必須具有理解和解決問題的專業技能?

Intel shall have no support obligations for this reference software. While Intel and the open source community around SPDK will use commercially reasonable efforts to investigate potential errata of unmodified released software, under no circumstances will Intel have any obligation to customers with respect to providing any maintenance or support of the software.
是的。Intel沒有義務為相關軟件提供支持。Intel和圍繞SPDK的開源社區會付出合理的商業努力去調查未修改的發布版本的軟件中的潛在的錯誤,其他任何情況下,Intel都沒有義務為用戶提供針對該軟件任何形式的維護和支持。

 

If you’d like to find out more about SPDK, please fill out the contact request form or check out SPDK.io for access to the mailing list, documentation, and blogs.
關於SPDK, 如果您想了解更多,請填寫聯系請求表,或者訪問SPDK.io的郵件列表、文檔和博客。

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.
英特爾技術的特性及優勢依賴於具體的系統配置,並且可能需要啟用硬件、 軟件或激活某些服務。性能因系統配置不同而有所不同。請咨詢您的系統制造商或零售商,或訪問intel.com了解更多信息.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
軟件和性能測試中使用的工作負載可能僅在英特爾微處理器上針對性能做了優化。SYSmark* 和 MobileMark* 等性能測試使用了特定的計算機系統、 組件、 軟件、 操作和功能。對這些因素作任何修改都可能導致不同的結果。為了幫助您完整地評估您的采購決策,請查詢其他信息和做其他性能測試,包括該產品與其它產品一同使用時的性能。

Performance testing configuration:
性能測試配置:

    2S Intel® Xeon® processor E5-2699 v3: 18 C, 2.3 GHz (hyper-threading off)
        Note: Single socket was used while performance testing
    32 GB DDR4 2133 MT/s
        4 Memory Channel per CPU
            1x 4GB 2R DIMM per channel
    Ubuntu* (Linux) Server 14.10
    3.16.0-30-generic kernel
    Intel® Ethernet Controller XL710 for 40GbE
    8x P3700 NVMe drive for storage

    NVMe configuration
        Total 8 PCIe* Gen 3 x 4 NVMes
            4 NVMes coming from first x16 slot
            (bifurcated to 4 x4s in the BIOS)
            Another 4 NVMes coming from second x16 slot (bifurcated to 4 x4s in the BIOS)
        Intel® SSD DC P3700 Series 800 GB
        Firmware: 8DV10102

    FIO BenchMark Configuration
        Direct: Yes
        Queue depth
        4KB Random I/O: 32 outstanding I/O
        64KB Seq. I/O: 8 outstanding I/O
        Ramp Time: 30 seconds
        Run Time:180 seconds
        Norandommap: 1
        I/O Engine: Libaio
        Numjobs: 1

    BIOS Configuration
        Speed Step: Disabled
        Turbo Boost: Disabled
        CPU Power and Performance Policy: Performance

For more information go to http://www.intel.com/performance.
更多信息,請訪問http://www.intel.com/performance.

 

In the confrontation between the stream and the rock, the stream always wins, not through strength but by perseverance. | 在溪流與與岩石的對峙里,溪流永遠都是勝利者,它靠的不是力量而是毅力。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM