本文是對論文Dissecting a Small InfiniBand Application Using the Verbs API所做的中英文對照翻譯
Dissecting a Small InfiniBand Application Using the Verbs API
Gregory Kerr∗ College of Computer and Information Science Northeastern University Boston, MA kerrg@ccs.neu.edu
Abstract | 摘要
InfiniBand is a switched fabric interconnect. The InfiniBand specification does not define an API. However the OFED package, libibverbs, has become the default API on Linux and Solaris systems. Sparse documentation exists for the verbs API. The simplest InfiniBand program provided by OFED, ibv_rc_pingpong, is about 800 lines long. The semantics of using the verbs API for this program is not obvious to the first time reader. This paper will dissect the ibv_rc_pingpong program in an attempt to make clear to users how to interact with verbs. This work was motivated by an ongoing project to include direct InfiniBand support for the DMTCP checkpointing package.
InfiniBand是一種基於交換結構的網絡互連方式。InfiniBand標准並沒有定義API。然而,OFED軟件包(libibverbs)已經成為了Linux和Solaris系統默認的API。一些稀疏的文檔存在於verbs API中。OFED提供了一個最簡單的IB程序(ibv_rc_pingpong),約800行代碼。在這個程序中,使用的API的語義對第一次接觸verbs的讀者來說並不那么容易理解。本文將剖析ibv_rc_pingpong程序,以幫助讀者弄清楚如何與verbs打交道。之所以寫作本文,是因為一個正在進行的項目,該項目將在DMTCP檢查點軟件包中包含對IB的直接支持。
1 Introduction | 概述
The program ibv_rc_pingpong can be found at openfabrics.org, under the "examples/" directory of the OFED tarball. The source code used for this document is from version 1.1.4. The ibv_rc_pingpong program sets up a connection between two nodes running InfiniBand adapters and transfers data. Let's begin by looking at the program in action. In this paper, I will refer to two nodes: client and server. There are various command line flags that may be set when running the program. It is important to note that the information contained within this document is based on the assumption that the program has been run with no command line flags configured. Configuring these flags will alter much of the program's behavior.
程序ibv_rc_pingpong可以在penfabrics.org上找到,位於OFED壓縮包的"examples/"目錄下面。本文使用的源代碼版本是1.1.4。程序ibv_rc_pingpong在兩個具有IB適配器的節點之間建立連接並進行數據傳輸。讓我們從看程序代碼開始。在本文中,我將提及兩個結點:client和server。當運行程序的時候,可以設置各種命令行標志。值得注意的是,本文中包含的信息基於一個假定,即該程序在運行時沒有配置命令行標志。如果配置這些標志的話,將改變程序的許多行為。
Since both nodes run the same executable, the "client" is the instance that is launched with a hostname as an argument. The LID, QPN, and PSN will be explained later.
由於兩個節點都運行相同的可執行文件,"client"是一個使用主機名作為參數的啟動的實例。有關LID、QPN和PSN將稍后做解釋。
[user@server]$ ibv_rc_pingpong local address: LID 0x0008, QPN 0x580048, PSN 0x2a166f, GID :: remote address: LID 0x0003, QPN 0x580048, PSN 0x5c3f21, GID :: 8192000 bytes in 0.01 seconds = 5167.64 Mbit/sec 1000 iters in 0.01 seconds = 12.68 usec/iter [user@client]$ ibv_rc_pingpong server local address: LID 0x0003, QPN 0x580048, PSN 0x5c3f21, GID :: remote address: LID 0x0008, QPN 0x580048, PSN 0x2a166f, GID :: 8192000 bytes in 0.01 seconds = 5217.83 Mbit/sec 1000 iters in 0.01 seconds = 12.56 usec/iter
Before we delve into the actual code, please look at a list of all verbs API functions which will be used for our purposes. I encourage the reader to pause and read the man page for each of these.
在深入研究實際代碼之前,請看一下我們將用到的verbs API列表。讀者朋友不妨暫停一下,先閱讀一下這些API的man頁面。
1 ibv_get_device_list(3) 2 ibv_open_device(3) 3 ibv_alloc_pd(3) 4 ibv_reg_mr(3) 5 ibv_create_cq(3) 6 ibv_create_qp(3) 7 ibv_modify_qp(3) 8 ibv_post_recv(3) 9 ibv_post_send(3) 10 ibv_poll_cq(3) 11 ibv_ack_cq_events(3)
2 Layers | 分層
There are multiple drivers, existing in kernel and userspace, involved in a connection. See Figure 2a. To explain it simply, much of the connection setup work goes through the kernel driver, as speed is not a critical concern in that area.
在內核和用戶空間,有多個驅動參與連接。請參見圖2a。簡單地說,大部分連接安裝工作都是通過內核驅動完成,因為速度對建立連接來說並不關鍵。
The user space drivers are involved in function calls such as ibv_post_send and ibv_post_recv. Instead of going through kernel space, they interact directly with the hardware by writing to a segment of mapped memory. Avoiding kernel traps is one way to decrease the overall latency of each operation.
用戶空間設備驅動參與到函數調用之中,例如ibv_post_send和ibv_post_recv。它們不經過內核空間,而是通過對內存映射段執行寫操作來直接與硬件打交道。避免陷入內核是降低單個操作的總時延的一種方法。
3 Remote Direct Memory Access | RDMA(遠程直接內存訪問)
One of the key concepts in InfiniBand is Remote Direct Memory Access (RDMA). This allows a node to directly access the memory of another node on the subnet, without involving the remote CPU or software layers.
IB的核心概念之一就是RDMA(遠程直接內存訪問)。這允許一個結點直接訪問子網內的另一個結點的系統內存,而不需要遠端CPU或者軟件層的干預。
Remember the key concepts of Direct Memory Access (DMA) as illustrated by Figure 2b.
圖2b演示了DMA(直接內存訪問)的核心概念。
In the DMA, the CPU sends a command to the hardware to begin a DMA operation. When the operation finishes, the DMA hardware raises an interrupt with the CPU, signaling completion. The RDMA concept used in InfiniBand is similar to DMA, except with two nodes accessing each other's memory; one node is the sender and one is the receiver.
在DMA中,CPU給硬件發送一個開始DMA操作的命令。當操作完成后,DMA硬件引發一個中斷高速CPU, DMA操作已經完成。用在InfiniBand中RDMA概念與DMA類似,兩個節點相互訪問對方的系統內存,一個結點為sender(發送方), 另一個結點為receiver(接收方)。
Figure 3 illustrates an InfiniBand connection. In this case the DMA Hardware is the Host Channel Adapter (HCA), and the two HCAs are connected, through a switch, to each other. The HCA is InfiniBand's version of a network card; it is the hardware local to each node that facilitates communications. This allows an HCA on one node to use another HCA to perform DMA operations on a remote node.
圖3演示了IB連接。在這個示例中,DMA硬件是HCA(主機通道適配器)。兩個結點各有一個HCA,通過交換機互聯。HCA就是InfiniBand的網卡。它是每個結點用來通信的硬件。這允許一個節點上的HCA使用另一個結點上的HCA來在遠程節點上執行DMA操作。
4 Overview | 概要
The ibv_rc_pingpong program does the following.
程序ibv_rc_pingpong做了如下6件事情。
- Reserves memory from the operating system for sending and receiving data 在操作系統中申請內存,為發送和接收數據做准備
- Allocates resources from the verbs API 從verbs API中申請資源
- Uses a TCP socket to exchange InfiniBand connection information 使用TCP socket交換IB連接信息
- Creates a connection between two InfiniBand ports 在兩個IB端口中創建一個連接
- Transfers data over the connection 在連接上傳輸數據
- Acknowledges the successful completion of the transfer 在傳輸成功完成后給一個ACK
5 Data Transfer Modes | 數據傳輸模型
The InfiniBand specification states four different connection types: Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD). This program, ibv_rc_pingpong uses a simple RC model. RD is not supported by current hardware.
InfiniBand技術規范定義了四種不同的連接類型:可靠連接(RC)、不可靠連接(UC)、可靠數據報(RD)和不可靠數據報(UD)。ibv_rc_pingpong程序使用的是簡單的可靠連接(RC)。目前硬件不支持可靠數據報(RD)。
The difference between reliable and unreliable is predictable -- in a reliable connection data is transferred in order and guaranteed to arrive. In an unreliable connection neither of those guarantees is made.
可靠與不可靠的區別在於是否可預測 -- 在可靠連接中,數據按順序到達並保證能夠到達。在不可靠連接中,這兩項都保證不了。
A connection type is an association strictly between two hosts. In a datagram, a host is free to communicate with any other host on the subnet.
連接類型是主機之間的嚴格關聯。在數據報中,主機可以自由地與子網上的任何其他主機進行通信。
6 Queue Based Model | 基於隊列的模型
The InfiniBand hardware processes requests from the client software through requests, which are placed into queues. To send messages between nodes, each node must have at minimum three queues: a Send Queue (SQ), Receive Queue (RQ), and Completion Queue (CQ).
IB硬件處理來自client的請求是通過把請求放置到隊列上。在兩個結點之間發送數據,每一個結點至少包含3個隊列:發送隊列(SQ)、接收隊列(RQ)和完成隊列(CQ)。
In a reliable connection, used in the ibv_rc_pingpong program, queue pairs on two distinct hosts compromise an end-to-end context. They send messages to each other, and only each other. This paper restricts itself to this mode.
在用於ibv_rc_pingpong程序的可靠連接中,兩個獨立的主機之間的QP(隊列對)協商一個端到端的上下文。他們互相傳遞信息,而且只在彼此之間。本文的討論僅限於這一模式。
The queues themselves exist on the HCA. However the libibverbs will return to the user a data structure which corresponds with the QP. While the library will create the QP, the user assumes the responsibility of "connecting" the QP with the remote node. This is generally done by opening an out-of-band socket connection, trading the identification numbers for the queues, and then updating the hardware with the information.
隊列本身存在於HCA硬件上。libibverbs將返回一個跟QP一致關聯的數據結構給用戶。QP的創建由libibverbs函數庫負責,但連接遠端結點的QP則由用戶自己負責。通常的實現是打開一個帶外socket連接,為隊列交換標識符,然后把信息更新到硬件上。
More recently, librdma_cm (an OFED library for connection management) allows a user to create and connect QPs through library calls reminiscent of POSIX sockets. Those calls are outside the scope of this document.
最近,librdma_cm(OFED針對連接管理發布的函數庫)允許用戶創建和連接QP, 通過調用POSIX socket。這些調用不在本文的討論范圍之內。
6.1 Posting Work Requests to Queues | 給隊列里放置工作請求
To send and receive data in the InfiniBand connection (end-to-end context), work requests, which become Work Queue Entries (WQE, pronounced "wookie") are posted to the appropriate queue. These work requests point to lists of scatter/gather elements (each element has an address and size associated with it). This is a means of writing to and reading from buffers which are non-contiguous in memory.
在InfiniBand連接(即是端到端的上下文)中發送和接收數據,工作請求(即工作隊列元素,WQE,發音為wookie)被放置到相應的隊列中。這些工作請求指向一個分散/聚合元素(SGE)的列表(每個元素都有一個虛擬內存地址和與之相關聯的緩沖區大小)。這是一種在非連續的內存緩沖區中寫入和讀取數據的方法。
The memory buffers must be registered with the hardware; that process is explained later. Memory buffers must be posted to the receive queue before the remote host can post any sends. The ibv_rc_pingpong program posts numerous buffers to the receive queue at the beginning of execution, and then repopulates the queue as necessary. A receive queue entry is processed when the remote host posts a send operation.
內存緩沖區必須在硬件中注冊,注冊過程稍后予以解釋。在遠程主機放置任何發送請求到發送隊列之前,內存緩沖區必須在放置到接收隊列。ibv_rc_pingpong程序在開始執行的時候放置大量的緩存區到發送隊列,然后在必要的時候重新填充隊列。當遠程主機放置一個發送操作的時候,接收隊列條目將被處理。
When the hardware processes the work request, a Completion Queue Entry (CQE, pronounced "cookie") is placed on the CQ. There is a sample of code showing how to handle completion events in ibv_ack_cq_events(3).
當硬件處理一個工作請求(WR)的時候,一個完成隊列條目(CQE,發音為cookie)被放置到完成隊列(CQ)中。在ibv_ack_cq_events(3)中,有樣本代碼演示如何處理完成事件。
7 Connecting the Calls | 連接使用的函數調用
The table below which the function calls used in ibv_rc_pingpong to create a connection, and the order in which they are called.
在下表中,列出了ibv_rc_pingpong創建一個連接使用的函數以及調用順序。
This table introduces the resources which are allocated in the process of creating a connection. These resources will be explained in detail later.
上面的表格也介紹了在創建一個連接中分配的資源,對這些資源稍后做解釋。
8 Allocating Resources | 分配資源
8.1 Creating a Context | 創建上下文
The first function call to the verbs API made by the ibv_rc_pingpong source code is here:
619 dev_list = ibv_get_device_list(NULL);
As the man page states, this function returns a list of available HCAs.
ibv_rc_pingpong源代碼調用的第一個verbs API就是ibv_get_device_list(), 該函數在手冊里有明確說明,返回一個可用的HCA列表。
The argument to the function is an optional pointer to an int, which the library uses to specify the size of the list.
該函數的參數是一個可選的執行int的指針,用來指定列表的長度。
Next it populates the pingpong_context structure with the function pp_init_ctx.
接下來,在函數pp_init_ctx填充pingpong_context結構體。
The pingpong_context structure wraps all the resources associated with a connection into one unit.
pingpong_context結構體包裝了與連接相關聯的所有分配的資源。
/* Listing 1: struct pingpong context */ 59 struct pingpong_context { 60 struct ibv_context *context; 61 struct ibv_comp_channel *channel; 62 struct ibv_pd *pd; 63 struct ibv_mr *mr; 64 struct ibv_cq *cq; 65 struct ibv_qp *qp; 66 void *buf; 67 int size; 68 int rx_depth; 69 int pending; 70 struct ibv_port_attr portinfo; 71 };
/* Listing 2: Initializing the struct pingpong context */ 643 ctx = pp_init_ctx(ib_dev, size, rx_depth, ib_port, use_event, !servername);
The ib_dev argument is a struct device * and comes from dev_list. The argument size specifies the size of the message to be sent (4096 bytes by default), rx_depth sets the number of receives to post at a time, ib_port is the port of the HCA and use_event specifies whether to sleep on CQ events or poll for them.
參數ib_dev來自於dev_list, 類型為struct device *。參數size指定發送的消息長度(默認4096字節),rx_depth設置一次放置到接收隊列的接收請求的個數(即接收隊列深度),ib_port是HCA的port,use_event指定是在CQ事件上睡眠或還是進行輪詢。
The function pp_init_ctx first allocates a buffer of memory, which will be used to send and receive data. Note that the buffer is memalign-ed to a page, since it is pinned (see section 8.3 for a definition of pinning).
函數pp_init_ctx首先分配內存緩沖區,用於發送和接收數據。注意緩沖區是按頁對齊的,由於該緩沖區是需要被凝固的(有關pinning的定義,請參見8.3節)。
/* Listing 3: Allocating a Buffer */ 320 ctx->buf = memalign(page_size, size); 321 if (!ctx->buf) { 322 fprintf(stderr, "Couldn't allocate work buf.\n"); 323 return NULL; 324 } 325 326 memset(ctx->buf, 0x7b + is_server, size);
Next the ibv_context pointer is populated with a call to ibv_open_device. The ibv_context is a structure which encapsulates information about the device opened for the connection.
接下來,ibv_context指針被賦值,通過調用ibv_open_device()。ibv_context是一個結構體,封裝了為建立連接而打開的設備信息。
/* Listing 4: Opening a Context */ 328 ctx->context = ibv_open_device(ib_dev); 329 if (!ctx->context) { 330 fprintf(stderr, "Couldn't get context for %s\n", 331 ibv_get_device_name(ib_dev)); 332 return NULL; 333 }
From the "infiniband/verbs.h" header, the struct ibv_context is as follows:
結構體ibv_context來自頭文件"infiniband/verbs.h",如下所示:
/* Listing 5: struct ibv_context */ 766 struct ibv_context { 767 struct ibv_device *device; 768 struct ibv_context_ops ops; 769 int cmd_fd; 770 int async_fd; 771 int num_comp_vectors; 772 pthread_mutex_t mutex; 773 void *abi_compat; 774 struct ibv_more_ops *more_ops; 775 };
The struct ibv_device * is a pointer to the device opened for this connection. The struct ibv_context_ops ops field contains function pointers to driver specific functions, which the user need not access directly.
結構ibv_device *是一個指向為建立連接而打開的設備的指針。結構ibv_context_ops ops域包含了一系列函數指針,這些函數指針執行驅動程序的具體功能,用戶不需要直接訪問這些函數指針。
8.2 Protection Domain | 保護域
After the device is opened and the context is created, the program allocates a protection domain.
在設備被打開和上下文創建完畢之后,程序分配一個保護域(PD)。
/* Listing 6: Opening a Protection Domain */ 344 ctx->pd = ibv_alloc_pd(ctx->context); 345 if (!ctx->pd) { 346 fprintf(stderr, "Couldn't allocate PD\n"); 347 return NULL; 348 }
A protection domain, according to the InfiniBand specification, allows the client to control which remote computers can access its memory regions during InfiniBand sends and receives.
根據IB的技術規范,保護域(PD)允許cient在IB發送和接收過程中,控制它的內存區域,該內存取可以被遠程電腦訪問。
The protection domain mostly exists on the hardware itself. Its user-space data structure is sparse:
保護域(PD)主要存在於硬件本身。其用戶空間數據結構實為稀疏:
/* Listing 7: struct ibv_pd from "infiniband/verbs.h" */ 308 struct ibv_pd { 309 struct ibv_context *context; 310 uint32_t handle; 311 };
8.3 Memory Region | 內存區域
The ibv_rc_pingpong program next registers one memory region with the hardware.
接下來ibv_rc_pingpong程序注冊一段內存區域(MR),該區域將被硬件訪問。
When the memory region is registered, two things happen. The memory is pinned by the kernel, which prevents the physical address from being swapped to disk. On Linux operating systems, a call to mlock is used to perform this operation. In addition, a translation of the virtual address to the physical address is given to the HCA.
當內存區域(MR)注冊了,有兩件事情將發生。(首先,) 內存被內核鎖定,防止物理地址(內存里存放的數據)被交換到硬盤上。在Linux操作系統中,使用mlock調用來執行這一操作。其次,(HCA驅動)將虛擬內存地址轉換為物理內存地址,然后將這一對應關系交給HCA硬件去使用。
/* Listing 8: Registering a Memory Region */ 350 ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size, IBV_ACCESS_LOCAL_WRITE); 351 if (!ctx->mr) { 352 fprintf(stderr, "Couldn't register MR\n"); 353 return NULL; 354 }
The arguments are the protection domain with which to associate the memory region, the address of the region itself, the size, and the flags. The options for the flags are defined in "infiniband/verbs.h".
參數是與MR相關聯的PD, MR本身的地址,大小和標志。標志選項的定義在文件"infiniband/verbs.h"里。
/* Listing 9: Access Flags */ 300 enum ibv_access_flags { 301 IBV_ACCESS_LOCAL_WRITE = 1, 302 IBV_ACCESS_REMOTE_WRITE = (1<<1), 303 IBV_ACCESS_REMOTE_READ = (1<<2), 304 IBV_ACCESS_REMOTE_ATOMIC = (1<<3), 305 IBV_ACCESS_MW_BIND = (1<<4) 306 };
When the memory registration is complete, an lkey field or Local Key is created. According to the InfiniBand Technical Specification the lkey is used to identify the appropriate memory addresses and provide authorization to access them.
當內存注冊(MR)完成的時候,lkey(或Local Key)字段就創建好了。根據IB的技術規范,lkey用來確定合適的內存地址和提供訪問授權。
8.4 Completion Queue | 完成隊列
The next part of the connection is the completion queue (CQ), where work completion queue entries are posted. Please note that you must create the CQ before the QP. As stated previously, ibv_ack_cq_events(3) has helpful examples of how to manage completion events.
建立連接的接下來的一部分是創建完成隊列(CQ),工作完成條目(CQE)被放置到完成隊列(CQ)上。請注意,必須在創建QP之前創建CQ。如何管理完成事件,前面提及的ibv_ack_cq_events(3)中例子可供參考。
/* Listing 10: Creating a CQ */ 356 ctx->cq = ibv_create_cq(ctx->context, rx_depth + 1, NULL, 357 ctx->channel, 0); 358 if (!ctx->cq) { 359 fprintf(stderr, "Couldn't create CQ\n"); 360 return NULL; 361 }
8.5 Queue Pairs | 隊列對
Communication in InfiniBand is based on the concept of queue pairs. Each queue pair contains a send queue and a receive queue, and must be associated with at least one completion queue. The queues themselves exist on the HCA. A data structure containing a reference to the hardware queue pair resources is returned to the user.
在IB中,通信是基於隊列對(QP)的概念而實現。每一個隊列對(QP)包含有一個發送隊列(SQ)和接收隊列(RQ),並且必須與至少一個完成隊列(CQ)相關聯(換言之,一個QP中的SQ和RQ可以關聯到同一個CQ上)。這些隊列(SQ, RQ和CQ)都存在於HCA硬件上。一個包含有對硬件QP資源的引用的數據結構,被返回給用戶使用。
First, look at the code to create a QP.
首先,讓我們看看創建一個QP的源代碼。
/* Listing 11: Creating a QP */ 364 struct ibv_qp_init_attr attr = { 365 .send_cq = ctx->cq, 366 .recv_cq = ctx->cq, 367 .cap = { 368 .max_send_wr = 1, 369 .max_recv_wr = rx_depth, 370 .max_send_sge = 1, 371 .max_recv_sge = 1 372 }, 373 .qp_type = IBV_QPT_RC 374 }; 375 376 ctx->qp = ibv_create_qp(ctx->pd, &attr); 377 if (!ctx->qp) { 378 fprintf(stderr, "Couldn't create QP\n"); 379 return NULL; 380 }
Notice that a data structure which defines the initial attributes of the QP must be given as an argument. There are a few other elements in the data structure, which are optional to define.
注意,定義QP初始屬性的數據結構必須作為一個參數傳遞。在這個數據結構中,有一些元素是可選的。
The first two elements, send_cq and recv_cq, associate the QP with a CQ as stated earlier. The send and receive queue may be associated with the same completion queue.
前兩個元素是send_cq和recv_cq,(如前面所說)是與QP相關聯的完成隊列(CQ)。發送隊列和接收隊列可能關聯到同一個完成隊列上。
The cap field points to a struct ibv_qp_cap and specifies how many send and receive work requests the queues can hold. The max_{send, recv}_sge field specifies the maximum number of scatter/gather elements that each work request will be able to hold. A scatter gather element is used in a direct memory access (DMA) operation, and each SGE points to a buffer in memory to be used in the read or write. In this case, the attributes state that only one buffer may be pointed to at any given time.
字段cap指向一個結構體struct ibv_qp_cap, 指定可以容納的發送和接收工作請求的個數。max_{send, recv}_sge字段指定了每一個工作請求能夠容納的最大的SGE數目。一個SGE用來做DMA操作,每一個SGE指向一個可用於讀/寫的內存緩沖區。在這個例子中,屬性狀態說明了在任何給定的時間內僅指向一個緩沖區。
The qp_type field specifies what type of connection is to be used, in this case a reliable connection.
qp_type字段指定了使用的連接類型,在這里是可靠連接(RC)。
Now the queue pair has been created. It must be moved into the initialized state, which involves a library call. In the initialized state, the QP will silently drop any incoming packets and no work requests can be posted to the send queue.
現在QP已經創建好了。必須通過庫函數調用將它的狀態設置為初始化狀態。在初始化狀態下,任何傳入的數據包將被QP悄悄地丟棄,並且工作請求不能夠被放置到發送隊列上。
/* Listing 12: Setting QP to INIT */ 384 struct ibv_qp_attr attr = { 385 .qp_state = IBV_QPS_INIT, 386 .pkey_index = 0, 387 .port_num = port, 388 .qp_access_flags = 0 389 }; 390 391 if (ibv_modify_qp(ctx->qp, &attr, 392 IBV_QP_STATE | 393 IBV_QP_PKEY_INDEX | 394 IBV_QP_PORT | 395 IBV_QP_ACCESS_FLAGS)) { 396 fprintf(stderr, "Failed to modify QP to INIT\n"); 397 return NULL; 398 }
The third argument to ibv_modify_qp is a bitmask stating which options should be configured. The flags are specified in enum ibv_qp_attr_mask in "infiniband/verbs.h".
ibv_modify_qp()的第3個參數是一個位掩碼,說明應該配置的選項。flags在頭文件"infiniband/verbs.h"的枚舉體ibv_qp_attr_mask中定義。
At this point the ibv_rc_pingpong program posts a receive work request to the QP.
在這里,ibv_rc_pingpong程序放置一個接收工作請求到QP上。
650 routs = pp_post_recv(ctx, ctx->rx_depth);
Look at the definition of pp_post_recv.
看一下pp_post_recv的定義。
/* Listing 13: Posting Recv Requests */ 444 static int pp_post_recv(struct pingpong_context *ctx, int n) 445 { 446 struct ibv_sge list = { 447 .addr = (uintptr_t) ctx->buf, 448 .length = ctx->size, 449 .lkey = ctx->mr->lkey 450 }; 451 struct ibv_recv_wr wr = { 452 .wr_id = PINGPONG_RECV_WRID, 453 .sg_list = &list, 454 .num_sge = 1, 455 }; 456 struct ibv_recv_wr *bad_wr; 457 int i; 458 459 for (i = 0; i < n; ++i) 460 if (ibv_post_recv(ctx->qp, &wr, &bad_wr)) 461 break; 462 463 return i; 464 }
The ibv_sge list is the list pointing to the scatter/gather elements (in this case, a list of size 1). To review, the SGE is a pointer to a memory region which the HCA can read to or write from.
ibv_sge列表是指向SGE數組的列表(在這里,列表長度為1)。SGE指向一個內存區域,該區域能被HCA讀寫。
Next is the ibv_recv_wr structure. The first field, wr_id, is a field set by the program to identify the work request. This is needed when checking the completion queue elements; it specifies which work request completed.
下一個結構體是ibv_recv_wr。第一個字段wr_id由應用程序設置,以標識對應的WR。在檢查完成隊列元素時需要wr_id,它指定了哪一個WR已經完成了。
The work request given to ibv_post_recv is actually a linked list, of length 1.
傳給ibv_post_recv()的WR實際上是一個長度為1的鏈表。
/* Listing 14: Linked List */ 451 struct ibv_recv_wr wr = { 452 .wr_id = PINGPONG_RECV_WRID, 453 .sg_list = &list, 454 .num_sge = 1, 455 };
If one of the work requests fails, the library will set the bad_wr pointer to the failed wr in the linked list.
如果一個WR執行失敗了,那么庫函數就將bad_wr指向在此鏈表中失敗的那個wr。
Receive buffers must be posted before any sends. It is common practice to loop over the ibv_post_recv call to post numerous buffers at the beginning of execution. Eventually these buffers will be used up; internal flow control must be implemented by the applications to ensure that sends are not posted without corresponding receives.
接收緩沖區必須在發送之前放置到SQ上。 通常的做法是在執行開始的時候循環調用ibv_post_recv()將多個buffer放置到接收隊列上。最終這些buffer將被全部消耗掉。應用程序必須實現內部的流量控制,以確保在遠端沒有准備好接收的情況下不放置任何發送請求到SQ上。
8.6 Connecting | 連接
The next step occurs in pp_client_exch_dest and pp_server_exch_dest. The QPs need to be configured to point to a matching QP on a remote node. However, the QPs currently have no means of locating each other. The processes open an out-of-band TCP socket and transmit the needed information. That information, once manually communicated, is given to the driver and then each side's QP is configured to point at the other. (The OFED librdma_cm library is an alternative to explicit out-of-band TCP.)
下一步發生在pp_client_exch_dest和pp_server_exch_dest中。QP需要配置一下,指向遠端結點的QP。然而,目前QP沒有定位對方的方法。進程打開帶外TCP socket並傳輸所需要的信息。這些信息傳遞給驅動,然后每一方的QP就被配置為指向另一方的QP。(使用OFED的librdma_cm庫,可以用來替代顯式的TCP帶外數據。)
So what information needs to be exchanged/configured? Mainly the LID, QPN, and PSN. The LID is the "Local Identifier" and it is a unique number given to each port when it becomes active. The QPN is the Queue Pair Number, and it is the identifier assigned to each queue on the HCA. This is used to specify to what queue messages should be sent. Finally, the destinations must share their PSNs.
那么,哪些信息需要交換和配置? LID, QPN和PSN。 LID是本地ID的縮寫,當一個port變成活躍狀態的時候,port就被分配了一個獨一無二的數字。QPN是QP Number的縮寫,是在HCA上分配給每一個隊列的標識符。QPN用來指定消息發送到哪個隊列上去。最后,目標必須共享它們的PSN。
The PSN stands for Packet Sequence Number. In a reliable connection it is used by the HCA to verify that packets are coming in order and that packets are not missing. The initial PSN, for the first packet, must be specified by the user code. If it is too similar to a recently used PSN, the hardware will assume that the incoming packets are stale packets from an old connection and reject them.
PSN代表的是包序列號。在可靠連接中,HCA用PSN來保證一個個數據包是有序到達的而且沒有丟包。第一個數據包最初的PSN,必須由用戶代碼指定。如果PSN與最近使用的一個PSN很相似的話,硬件就假定傳入的數據包是一個陳腐的包,來自一個舊連接,然后予以拒絕。
The GID, seen in the code sample below, is a 128-bit unicast or multicast identifier used to identify an endport. The link layer specifies which interconnect the software is running on; there are other interconnects that OFED supports, though that is not within the scope of this paper.
在下面的代碼示例中,GID是一個128位的單播或者多播的標識符,用來標識一個終端端口。鏈路層指定了軟件在哪種互聯協議上運行。OFED還支持除IB之外的其他互聯協議,那些協議不在本論文的討論范圍之內。
Within pp_connect_ctx the information, once transmitted, is used to connect the QPs into an end-to-end context.
在pp_connect_ctx之中的信息,一旦傳輸完成,就用於連接QP對到一個端到端的上下文中。
/* Listing 15: Setting Up Destination Information */ 665 my_dest.lid = ctx->portinfo.lid; 666 if (ctx->portinfo.link_layer == IBV_LINK_LAYER_INFINIBAND && !my_dest.lid) { 667 fprintf(stderr, "Couldn't get local LID\n"); 668 return 1; 669 } 670 671 if (gidx >= 0) { 672 if (ibv_query_gid(ctx->context, ib_port, gidx, &my_dest.gid)) { 673 fprintf(stderr, "Could not get local gid for gid index %d\n", gidx); 674 return 1; 675 } 676 } else 677 memset(&my_dest.gid, 0, sizeof my_dest.gid); 678 679 my_dest.qpn = ctx->qp->qp_num; 680 my_dest.psn = lrand48() & 0xffffff;
The my_dest data structure is filled and then transmitted via TCP. Figure 4 illustrates this data transfer.
數據結構my_dest被填充,然后通過TCP傳送。圖4顯示了這一數據傳送。
8.6.1 Modifying QPs | 修改QP
Look at the attributes given to the ibv_modify_qp call.
讓我們看一看傳遞給ibv_modify_qp()調用的屬性。
/* Listing 16: Moving QP to Ready to Recv */ 84 struct ibv_qp_attr attr = { 85 .qp_state = IBV_QPS_RTR, 86 .path_mtu = mtu, 87 .dest_qp_num = dest->qpn, 88 .rq_psn = dest->psn, 89 .max_dest_rd_atomic = 1, 90 .min_rnr_timer = 12, 91 .ah_attr = { 92 .is_global = 0, 93 .dlid = dest->lid, 94 .sl = sl, 95 .src_path_bits = 0, 96 .port_num = port 97 } 98 }; ... 106 if (ibv_modify_qp(ctx->qp, &attr, 107 IBV_QP_STATE | 108 IBV_QP_AV | 109 IBV_QP_PATH_MTU | 110 IBV_QP_DEST_QPN | 111 IBV_QP_RQ_PSN | 112 IBV_QP_MAX_DEST_RD_ATOMIC | 113 IBV_QP_MIN_RNR_TIMER)) { 114 fprintf(stderr, "Failed to modify QP to RTR\n"); 115 return 1; 116 }
As you can see .qp_state is set to IBV_QPS_RTR, or Ready-To-Receive. The three fields swapped over TCP, the PSN, QPN, and LID, are now given to the hardware. With this information, the QPs are registered with each other by the hardware, but are not ready to begin exchanging messages. The min_rnr_timer is the time, in seconds, between retries before a timeout occurs.
正如你所看到的,.qp_state被設置為IBV_QPS_RTR或Ready-To-Receive(接收就緒)。通過TCP交換到的三個字段(PSN, QPN和LID),現在傳給硬件。有了這些信息,彼此的QP被注冊到對方的硬件上,但還沒有為開始消息交換准備就緒。min_rnr_timer是重試的時間間隔(以秒為單位),在發生超時的時候。
The QP must be moved into the Ready-To-Send state before the "connection" process is complete.
在連接過程完成之前,QP狀態必須被改變到Ready-To-Send(發送就緒)狀態。
/* Listing 17: Moving QP to Ready to Send */ 118 attr.qp_state = IBV_QPS_RTS; 119 attr.timeout = 14; 120 attr.retry_cnt = 7; 121 attr.rnr_retry = 7; 122 attr.sq_psn = my_psn; 123 attr.max_rd_atomic = 1; 124 if (ibv_modify_qp(ctx->qp, &attr, 125 IBV_QP_STATE | 126 IBV_QP_TIMEOUT | 127 IBV_QP_RETRY_CNT | 128 IBV_QP_RNR_RETRY | 129 IBV_QP_SQ_PSN | 130 IBV_QP_MAX_QP_RD_ATOMIC)) { 131 fprintf(stderr, "Failed to modify QP to RTS\n"); 132 return 1; 133 }
The attr used to move the QP into IBV_QPS_RTS is the same attr used in the previous call. There is no need to zero out the structure because the bitmask, given as the third argument, specifies which fields should be set.
用來將QP狀態改變到IBV_QPS_RTS的attr跟前面調用使用的attr是一模一樣的。沒有必要將數據結構初始化為0, 因為第三個參數bitmask指定了哪些字段需要被設置。
After the QP is moved into the Ready-To-Send state, the connection (end-to-end context) is ready.
當QP狀態處於Ready-to-Send的時候,端到端的連接就准備好了。
8.7 Sending Data | 數據發送
Since the server already posted receive buffers, the client will now post a "send" work request.
既然服務器端已經放置了接收緩沖區,那么客戶端即將放置一個“發送”工作請求。
/* Listing 18: Client Posting Send */ 468 struct ibv_sge list = { 469 .addr = (uintptr_t) ctx->buf, 470 .length = ctx->size, 471 .lkey = ctx->mr->lkey 472 }; 473 struct ibv_send_wr wr = { 474 .wr_id = PINGPONG_SEND_WRID, 475 .sg_list = &list, 476 .num_sge = 1, 477 .opcode = IBV_WR_SEND, 478 .send_flags = IBV_SEND_SIGNALED, 479 }; 480 struct ibv_send_wr *bad_wr; 481 482 return ibv_post_send(ctx->qp, &wr, &bad_wr);
The wr_id is an ID specified by the programmer to identify the completion notification corresponding with this work request. In addition, the flag IBV_SEND_SIGNALED sets the completion notification indicator. According to ibv_post_send(3), it is only relevant if the QP is created with sq_sig_all = 0.
wr_id是由程序員指定的ID,用來識別與這個WR相對應的完成通知。另外,標志IBV_SEND_SIGNALED設置了完成通知指示燈。根據ibv_post_send(3),只有當QP創建的時候設置了sq_sig_all等於0時才相關。
8.8 Flow Control | 流量控制
Programmers must implement their own flow control when working with the verbs API. Let us examine the flow control used in ibv_rc_pingpong. Remember from earlier that a client cannot post a send if its remote node does not have a buffer waiting to receive the data.
當使用verbs API的時候,程序員必須自己實現流量控制。讓我們看看ibv_rc_pingpong里的流量控制。記住我們在前面強調的,如果遠端結點沒有准備好等待接收數據的緩沖區,client就不能夠發送數據。
Flow control must be used to ensure that receivers do not exhaust their supply of posted receives. Furthermore, the CQ must not overflow. If the client does not pull CQEs off the queue fast enough, the CQ is thrown into an error state, and can no longer be used.
使用流量控制是必須的,以確保接收端不會耗盡它所提供的接收資源。此外,CQ必須不能夠溢出。如果client不能足夠快地把CQE從CQ上拉取下來的話,那么CQ就會陷入錯誤狀態以致於不能再被使用。
You can see at the top of the loop, which will send/recv the data, that ibv_rc_pingpong tracks the send and recv count.
可以看到,發送/接收數據位於循環的頂部,ibv_rc_pingpong跟蹤了發送和接收的計數器。
/* Listing 19: Flow Control */ 717 rcnt = scnt = 0; 718 while (rcnt < iters || scnt < iters) {
Now the code will poll the CQ for two completions; a send completion and a receive completion.
現在,代碼將為發送完成(SC)和接收完成(RC)輪詢CQ。
/* Listing 20: Polling the CQ */ 745 do { 746 ne = ibv_poll_cq(ctx->cq, 2, wc); 747 if (ne < 0) { 748 fprintf(stderr, "poll CQ failed %d\n", ne); 749 return 1; 750 } 751 752 } while (!use_event && ne < 1);
The use_event variable specifies whether or not the program should sleep on CQ events. By default, ibv_rc_pingpong will poll. Hence the while-loop. On success, ibv_poll_cq returns the number of completions found.
變量use_event指定是否在CQ時間上進行sleep。默認地,ibv_rc_pingpong進行輪詢。因此使用了while循環。輪詢成功后,ibv_poll_cq()返回工作完成(WC)的數量。
Next, the program must account for how many sends and receives have been posted.
接下來,程序必須統計放置的發送請求和接受工作請求的個數。
/* Listing 21: Flow Control Accounting */ 762 switch ((int) wc[i].wr_id) { 763 case PINGPONG_SEND_WRID: 764 ++scnt; 765 break; 766 767 case PINGPONG_RECV_WRID: 768 if (--routs <= 1) { 769 routs += pp_post_recv(ctx, ctx->rx_depth - routs); 770 if (routs < ctx->rx_depth) { 771 fprintf(stderr, 772 "Couldn't post receive (%d)\n", 773 routs); 774 return 1; 775 } 776 } 777 778 ++rcnt; 779 break; 780 781 default: 782 fprintf(stderr, "Completion for unknown wr_id %d\n", 783 (int) wc[i].wr_id); 784 return 1; 785 }
The ID given to the work request is also given to its associated work completion, so that the client knows what WQE the CQE is associated with. In this case, if it finds a completion for a send event, it increments the send counter and moves on.
給WR使用的ID也被給到相關聯的WC上,於是client就知道CQE是關聯到那一個WQE上。在這個案例中,如果為一個發送事件找到了一個工作完成,就增加發送計數器然后繼續。
The case for PINGPONG_RECV_WRID is more interesting, because it must make sure that receive buffers are always available. In this case the routs variable indicates how many recv buffers are available. So if only one buffer remains available, ibv_rc_pingpong will post more recv buffers. In this case, it calls pp_post_recv again, which will post another 500 (by default). After that it increments the recv counter.
PINGPONG_RECV_WRID的案例更有趣,因為它必須確保接收緩沖區總是可用的。在這種情況下,變量routs表示有多少接收緩沖區可用。所以,如果只有一個緩沖仍然可用,ibv_rc_pingpong將會放置更多的接收緩沖區。在這種情況下,它要求pp_post_recv再放置500(默認值)個接收緩沖區,然后增加recv計數器。
Finally, if more sends need to be posted, the program will post another send before continuing the loop.
最后,如果需要放置更多的發送請求,程序將在繼續循環之前放置另一個發送請求。
/* Listing 22: Posting Another Send */ 787 ctx->pending &= ~(int) wc[i].wr_id; 788 if (scnt < iters && !ctx->pending) { 789 if (pp_post_send(ctx)) { 790 fprintf(stderr, "Couldn't post send\n"); 791 return 1; 792 } 793 ctx->pending = PINGPONG_RECV_WRID | 794 PINGPONG_SEND_WRID; 795 }
8.9 ACK
The ibv_rc_pingpong program will now ack the completion events with a call to ibv_ack_cq_events. To avoid races, the CQ destroy operation will wait for all completion events returned by ibv_get_cq_event to be acknowledged. The call to ibv_ack_cq_events must take a mutex internally, so it is best to ack multiple events at once.
程序ibv_rc_pingpong將調用ibv_ack_cq_events()對完成的事件進行確認。為了避免競態,CQ的destroy操作將等待所有完成事件並確認。調用ibv_ack_cq_events()必須持有內部的互斥鎖,所以最好是一次確認多個事件。
816 ibv_ack_cq_events(ctx->cq, num_cq_events);
As a reminder, ibv_ack_cq_events(3) has helpful sample code.
友情提醒一下,ibv_ack_cq_events(3)有提供幫助性的示例代碼。
9 Conclusion | 結束語
InfiniBand is the growing standard for supercomputer interconnects, even appearing in departmental clusters. The API is complicated and sparsely documented, and the sample program provided by OFED, ibv_rc_pingpong, does not fully explain the functionality of the verbs. This paper will hopefully enable the reader to better understand the verbs interface.
在超級計算機互聯標准中,IB正在蓬勃生長,已經出現在集群中。由於verbs的API較為復雜,而且文檔記錄比較稀疏。OFED提供的示例程序ibv_rc_pingpong並不能完全解釋verbs的功能。本文希望讀者在閱讀之后,能夠更好地理解verbs的用戶接口。
10 Acknowledgements | 致謝
Gene Cooperman (Northeastern University) and Jeff Squyres (Cisco) contributed substantially to the organization, structure, and content of this document. Jeff also took the time to discuss the details of InfiniBand with me. Josh Hursey (Oak Ridge National Laboratory) shared his knowledge of InfiniBand with me along the way. Roland Dreier (PureStorage) pointed out, and corrected, a mistake in my explanation of acks.
東北大學的Gene Cooperman和思科的Jeff Squyres對此文檔的結構,內容做出了重大貢獻。Jeff還花了不少時間和我討論IB的細節。橡樹嶺國家實驗室的John Hursey跟我分享了有關IB的知識。Pure存儲的Roland Dreier指出和糾正了我在解釋ACK時存在的錯誤。
References | 參考文獻
- [1] Jason Ansel, Gene Cooperman, and Kapil Arya. DMTCP: Scalable user-level transparent checkpointing for cluster computations and the desktop. In Proc. of IEEE International Parallel and Distributed Processing Symposium (IPDPS-09, systems track). IEEE Press, 2009. published on CD; software available at http://dmtcp.sourceforge.net.
- [2] InfiniBand Trade Assocation. InfiniBand Architecture Specification Volume 1, Release 1.2.1, November 2007. http://www.infinibandta.org/content/pages.php? pg=technology_download.
Do one thing at a time, and do well. | 一次只做一件事,做到最好!