來源: https://zcopy.wordpress.com/ 說明: 本文不是對原文的逐字逐句翻譯,而是摘取核心部分以介紹RDMA Send操作(后面凡是提到RDMA send, 都對應於IBA里的send操作)。文中給出的例子非常淺顯易懂,很值得一讀。
1. What is RDMA | 什么是RDMA
RDMA is Remote Direct Memory Access which is a way of moving buffers between two applications across a network. RDMA differs from traditional network interfaces because it bypasses the operating system. This allows programs that implement RDMA to have:
- The absolute lowest latency
- The highest throughput
- Smallest CPU footprint
RDMA指的是遠程直接內存訪問,這是一種通過網絡在兩個應用程序之間搬運緩沖區里的數據的方法。RDMA與傳統的網絡接口不同,因為它繞過了操作系統。這允許實現了RDMA的程序具有如下特點:
- 絕對的最低時延
- 最高的吞吐量
- 最小的CPU足跡 (也就是說,需要CPU參與的地方被最小化)
2. How Can We Use It | 怎么使用RDMA
To make use of RDMA we need to have a network interface card that implements an RDMA engine.
使用RDMA, 我們需要有一張實現了RDMA引擎的網卡。
We call this an HCA (Host Channel Adapter). The adapter creates a channel from it’s RDMA engine though the PCI Express bus to the application memory. A good HCA will implement in hardware all the logic needed to execute RDMA protocol over the wire. This includes segmentation and reassembly as well as flow control and reliability. So from the application perspective we deal with whole buffers.
我們把這種卡稱之為HCA(主機通道適配器)。 適配器創建一個貫穿PCIe總線的從RDMA引擎到應用程序內存的通道。一個好的HCA將在導線上執行的RDMA協議所需要的全部邏輯都在硬件上予以實現。這包括分組,重組以及流量控制和可靠性保證。因此,從應用程序的角度看,只負責處理所有緩沖區即可。
In RDMA we setup data channels using a kernel driver. We call this the command channel. We use the command channel to establish data channels which will allow us to move data bypassing the kernel entirely. Once we have established these data channels we can read and write buffers directly.
在RDMA中我們使用內核態驅動建立一個數據通道。我們稱之為命令通道。使用命令通道,我們能夠建立一個數據通道,該通道允許我們在搬運數據的時候完全繞過內核。一旦建立了這種數據通道,我們就能直接讀寫數據緩沖區。
The API to establish these the data channels are provided by an API called "verbs". The verbs API is a maintained in an open source linux project called the Open Fabrics Enterprise Distribution (OFED). (www.openfabrics.org). There is an equivalent project for Windows WinOF located at the same site.
建立數據通道的API是一種稱之為"verbs"的API。"verbs" API是由一個叫做OFED的Linux開源項目維護的。在站點www.openfabrics.org上,為Windows WinOF提供了一個等價的項目。
The verbs API is different from the sockets programming API you might be used to. But once you learn some concepts it is actually a lot easier to use and much simpler to design your programs.
"verbs" API跟你用過的socket編程API是不一樣的。但是,一旦你掌握了一些概念后,就會變得非常容易,而且在設計你的程序的時候更簡單。
3. Queue Pairs | 隊列對
RDMA operations start by "pinning" memory. When you pin memory you are telling the kernel that this memory is owned by the application. Now we tell the HCA to address the memory and prepare a channel from the card to the memory. We refer to this as registering a Memory Region. We can now use the memory that has been registered in any of the RDMA operations we want to perform. The diagram below show the registered region and buffers within that region in use by the communication queues.
RDMA操作開始於“搞”內存。當你在“搞”內存的時候,就是告訴內核這段內存名花有主了,主人就是你的應用程序。於是,你告訴HCA,就在這段內存上尋址,趕緊准備開辟一條從HCA卡到這段內存的通道。我們將這一動作稱之為注冊一個內存區域(MR)。一旦MR注冊完畢,我們就可以使用這段內存來做任何RDMA操作。在下面的圖中,我們可以看到注冊的內存區域(MR)和被通信隊列所使用的位於內存區域之內的緩沖區(buffer)。
RDMA communication is based on a set of three queues. The send queue and receive queue are responsible for scheduling work. They are always created in pairs. They are referred to as a Queue Pair(QP). A Completion Queue (CQ) is used to notify us when the instructions placed on the work queues have been completed.
RDMA通信基於三條隊列(SQ, RQ和CQ)組成的集合。 其中, 發送隊列(SQ)和接收隊列(RQ)負責調度工作,他們總是成對被創建,稱之為隊列對(QP)。當放置在工作隊列上的指令被完成的時候,完成隊列(CQ)用來發送通知。
A user places instructions on it’s work queues that tells the HCA what buffers it wants to send or receive. These instructions are small structs called work requests or Work Queue Elements (WQE). WQE is pronounced "WOOKIE" like the creature from starwars. A WQE primarily contains a pointer to a buffer. A WQE placed on the send queue contains a pointer to the message to be sent. A pointer in the WQE on the receive queue contains a pointer to a buffer where an incoming message from the wire can be placed.
當用戶把指令放置到工作隊列的時候,就意味着告訴HCA那些緩沖區需要被發送或者用來接受數據。這些指令是一些小的結構體,稱之為工作請求(WR)或者工作隊列元素(WQE)。 WQE的發音為"WOOKIE",就像星球大戰里的猛獸。一個WQE主要包含一個指向某個緩沖區的指針。一個放置在發送隊列(SQ)里的WQE中包含一個指向待發送的消息的指針。一個放置在接受隊列里的WQE里的指針指向一段緩沖區,該緩沖區用來存放待接受的消息。
RDMA is an asynchronous transport mechanism. So we can queue a number of send or receive WQEs at a time. The HCA will process these WQE in order as fast as it can. When the WQE is processed the data is moved. Once the transaction completes a Completion Queue Element (CQE) is created and placed on the Completion Queue (CQ). We call a CQE a "COOKIE".
RDMA是一種異步傳輸機制。因此我們可以一次性在工作隊列里放置好多個發送或接收WQE。HCA將盡可能快地按順序處理這些WQE。當一個WQE被處理了,那么數據就被搬運了。 一旦傳輸完成,HCA就創建一個完成隊列元素(CQE)並放置到完成隊列(CQ)中去。 相應地,CQE的發音為"COOKIE"。
4. A Simple Example | 舉個簡單的例子
Lets look at a simple example. In this example we will move a buffer from the memory of system A to the memory of system B. This is what we call Message Passing semantics. The operation is a SEND, this is the most basic form of RDMA.
讓我們看個簡單的例子。在這個例子中,我們將把一個緩沖區里的數據從系統A的內存中搬到系統B的內存中去。這就是我們所說的消息傳遞語義學。接下來我們要講的一種操作為SEND,是RDMA中最基礎的操作類型。
Step 1 System A and B have created their QP's Completion Queue's and registered a regions in memory for RDMA to take place. System A identifies a buffer that it will want to move to System B. System B has an empty buffer allocated for the data to be placed.
第1步:系統A和B都創建了他們各自的QP的完成隊列(CQ), 並為即將進行的RDMA傳輸注冊了相應的內存區域(MR)。 系統A識別了一段緩沖區,該緩沖區的數據將被搬運到系統B上。系統B分配了一段空的緩沖區,用來存放來自系統A發送的數據。
Step 2 System B creates a WQE "WOOKIE" and places in on the Receive Queue. This WQE contains a pointer to the memory buffer where the data will be placed. System A also creates a WQE which points to the buffer in it's memory that will be transmitted.
第2步:系統B創建一個WQE並放置到它的接收隊列(RQ)中。這個WQE包含了一個指針,該指針指向的內存緩沖區用來存放接收到的數據。系統A也創建一個WQE並放置到它的發送隊列(SQ)中去,該WQE中的指針執行一段內存緩沖區,該緩沖區的數據將要被傳送。
Step 3 The HCA is always working in hardware looking for WQE's on the send queue. The HCA will consume the WQE from System A and begin streaming the data from the memory region to system B. When data begins arriving at System B the HCA will consume the WQE in the receive queue to learn where it should place the data. The data streams over a high speed channel bypassing the kernel.
第3步:系統A上的HCA總是在硬件上干活,看看發送隊列里有沒有WQE。HCA將消費掉來自系統A的WQE, 然后將內存區域里的數據變成數據流發送給系統B。當數據流開始到達系統B的時候,系統B上的HCA就消費來自系統B的WQE,然后將數據放到該放的緩沖區上去。在高速通道上傳輸的數據流完全繞過了操作系統內核。
Step 4 When the data movement completes the HCA will create a CQE "COOKIE". This is placed in the Completion Queue and indicates that the transaction has completed. For every WQE consumed a CQE is generated. So a CQE is created on System A's CQ indicating that the operation completed and also on System B's CQ. A CQE is always generated even if there was an error. The CQE will contain field indicating the status of the transaction.
第4步:當數據搬運完成的時候,HCA會創建一個CQE。 這個CQE被放置到完成隊列(CQ)中,表明數據傳輸已經完成。HCA每消費掉一個WQE, 都會生成一個CQE。因此,在系統A的完成隊列中放置一個CQE,意味着對應的WQE的發送操作已經完成。同理,在系統B的完成隊列中也會放置一個CQE,表明對應的WQE的接收操作已經完成。如果發生錯誤,HCA依然會創建一個CQE。在CQE中,包含了一個用來記錄傳輸狀態的字段。
The transaction we just demonstrated is an RDMA SEND operation. On Infiniband or RoCE the total time for a relatively small buffer would be about 1.3 µs. By creating many WQE's at once we could move millions of buffers every second.
我們剛剛舉例說明的是一個RDMA Send操作。在IB或RoCE中,傳送一個小緩沖區里的數據耗費的總時間大約在1.3µs。通過同時創建很多WQE, 就能在1秒內傳輸存放在數百萬個緩沖區里的數據。
5. Summary | 總結
This lesson showed you how and where to get the software so you can start using the RDMA verbs API. It also introduced the Queuing concept that is the basis of the RDMA programming paradigm. Finally we showed how a buffer is moved between two systems, demonstrating an RDMA SEND operation.
在這一課中,我們學習了如何使用RDMA verbs API。同時也介紹了隊列的概念,而隊列概念是RDMA編程的基礎。最后,我們演示了RDMA send操作,展現了緩沖區的數據是如何在從一個系統搬運到另一個系統上去的。
注記: 本文的原作者對RDMA Send操作講得非常淺顯易懂,只可惜沒有繼續講解RDMA read和RDMA write操作。
參考資料:
The shortest way to do many things is to only one thing at a time. | 做許多事情的捷徑就是一次只做一件事。