[SPDK/NVMe存儲技術分析]010 - 理解SGL

本文轉載自查看原文 2017-12-05 16:10 3605 NVMe

在NVMe over PCIe中，I/O命令支持SGL(Scatter Gather List 分散聚合表)和PRP(Physical Region Page 物理(內存)區域頁), 而管理命令只支持PRP;而在NVMe over Fabrics中，無論是管理命令還是I/O命令都只支持SGL。NVMe over Fabrics既支持FC網絡，又支持RDMA網絡。眾所周知，在RDMA編程中，SGL(Scatter/Gather List)是最基本的數據組織形式。 SGL是一個數組，該數組中的元素被稱之為SGE(Scatter/Gather Element)，每一個SGE就是一個Data Segment(數據段)。其中，SGE的定義如下(參見verbs.h)：

struct ibv_sge {
        uint64_t        addr;
        uint32_t        length;
        uint32_t        lkey;
};

addr: 數據段所在的虛擬內存的起始地址 (Virtual Address of the Data Segment (i.e. Buffer))
length: 數據段長度(Length of the Data Segment)
lkey: 該數據段對應的L_Key (Key of the local Memory Region)

而在數據傳輸中，發送/接收使用的Verbs API為：

ibv_post_send() - post a list of work requests (WRs) to a send queue 將一個WR列表放置到發送隊列中
ibv_post_recv() - post a list of work requests (WRs) to a receive queue 將一個WR列表放置到接收隊列中

下面以ibv_post_send()為例，說明SGL是如何被放置到RDMA硬件的線纜(Wire)上的。

ibv_post_send()的函數原型

#include <infiniband/verbs.h>

int ibv_post_send(struct ibv_qp *qp, 
                  struct ibv_send_wr *wr,
                  struct ibv_send_wr **bad_wr);

ibv_post_send() posts the linked list of work requests (WRs) starting with wr to the send queue of the queue pair qp. It stops processing WRs from this list at the first failure (that can be detected immediately while requests are being posted), and returns this failing WR through bad_wr.

The argument wr is an ibv_send_wr struct, as defined in <infiniband/verbs.h>.

struct ibv_send_wr {
        uint64_t                wr_id;                  /* User defined WR ID */
        struct ibv_send_wr     *next;                   /* Pointer to next WR in list, NULL if last WR */
        struct ibv_sge         *sg_list;                /* Pointer to the s/g array */
        int                     num_sge;                /* Size of the s/g array */
        enum ibv_wr_opcode      opcode;                 /* Operation type */
        int                     send_flags;             /* Flags of the WR properties */
        uint32_t                imm_data;               /* Immediate data (in network byte order) */
        union {
                struct {
                        uint64_t        remote_addr;    /* Start address of remote memory buffer */
                        uint32_t        rkey;           /* Key of the remote Memory Region */
                } rdma;
                struct {
                        uint64_t        remote_addr;    /* Start address of remote memory buffer */
                        uint64_t        compare_add;    /* Compare operand */
                        uint64_t        swap;           /* Swap operand */
                        uint32_t        rkey;           /* Key of the remote Memory Region */
                } atomic;
                struct {
                        struct ibv_ah  *ah;             /* Address handle (AH) for the remote node address */
                        uint32_t        remote_qpn;     /* QP number of the destination QP */
                        uint32_t        remote_qkey;    /* Q_Key number of the destination QP */
                } ud;
        } wr;
};

struct ibv_sge {
        uint64_t        addr;   /* Start address of the local memory buffer */
        uint32_t        length; /* Length of the buffer */
        uint32_t        lkey;   /* Key of the local Memory Region */
};

在調用ibv_post_send()之前，必須填充好數據結構wr。 wr是一個鏈表，每一個結點包含了一個sg_list(i.e. SGL: 由一個或多個SGE構成的數組), sg_list的長度為num_sge。

下面圖解一下SGL和WR鏈表的對應關系，並說明一個SGL (struct ibv_sge *sg_list)里包含的多個數據段是如何被RDMA硬件聚合成一個連續的數據段的。

01 - 創建SGL

從上圖中，我們可以看到wr鏈表中的每一個結點都包含了一個SGL，SGL是一個數組，包含一個或多個SGE。

02 - 使用PD做內存保護

一個SGL至少被一個MR保護, 多個MR存在同一個PD中。

03 - 調用ibv_post_send()將SGL發送到wire上去

在上圖中，一個SGL數組包含了3個SGE, 長度分別為N1, N2, N3字節。我們可以看到，這3個buffer並不連續，它們Scatter(分散)在內存中的各個地方。RDMA硬件讀取到SGL后，進行Gather(聚合)操作，於是在RDMA硬件的Wire上看到的就是N3+N2+N1個連續的字節。換句話說，通過使用SGL, 我們可以把分散(Scatter)在內存中的多個數據段(不連續)交給RDMA硬件去聚合(Gather)成連續的數據段。

最后，作為一個代碼控（不喜歡紙上談兵），貼一小段代碼展示一下如何為調用ibv_post_send()准備SGL和WR以加深理解。

 1 #define BUFFER_SIZE     1024
 2 
 3 struct connection {
 4         struct rdma_cm_id       *id;
 5         struct ibv_qp           *qp;
 6 
 7         struct ibv_mr           *recv_mr;
 8         struct ibv_mr           *send_mr;
 9 
10         char                    recv_region[BUFFER_SIZE];
11         char                    send_region[BUFFER_SIZE];
12 
13         int                     num_completions;
14 };
15 
16 void foo_send(void *context)
17 {
18         struct connection *conn = (struct connection *)context;
19 
20         /* 1. Fill the array SGL having only one element */
21         struct ibv_sge sge;
22 
23         memset(&sge, 0, sizeof(sge));
24         sge.addr        = (uintptr_t)conn->send_region;
25         sge.length      = BUFFER_SIZE;
26         sge.lkey        = conn->send_mr->lkey;
27 
28         /* 2. Fill the singly-linked list WR having only one node */
29         struct ibv_send_wr wr;
30         struct ibv_send_wr *bad_wr = NULL;
31 
32         memset(&wr, 0, sizeof(wr));
33         wr.wr_id        = (uintptr_t)conn;
34         wr.opcode       = IBV_WR_SEND;
35         wr.sg_list      = &sge;
36         wr.num_sge      = 1;
37         wr.send_flags   = IBV_SEND_SIGNALED;
38 
39         /* 3. Now send ... */
40         ibv_post_send(conn->qp, &wr, &bad_wr);
41 
42         ...<snip>...
43 }

附錄一： OFED Verbs

A great ship asks deep waters. | 是大船就得走深水，是蛟龍就得去大海里暢游。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。