打造用戶態存儲利器,基於SPDK的存儲引擎Blobstore & BlobFS
https://community.mellanox.com/s/article/howto-configure-nvme-over-fabrics
SPDK自動精簡配置的邏輯卷使用 construct_lvol_bdev https://www.sdnlab.com/21098.html
spdk 塊設備層bdev https://www.cnblogs.com/whl320124/p/10064878.html https://spdk.io/doc/bdev.html#bdev_ug_introduction
轉自 spdk中nvmf源碼 https://mp.weixin.qq.com/s/ohPaxAwmhGtuQQWz--J6WA
https://spdk.io/doc/nvmf_tgt_pg.html
nvmf的spec http://nvmexpress.org/resources/specifications/
總的知識點+較詳細不同bdev配置 https://www.jianshu.com/p/b11948e55d80
https://www.gitmemory.com/issue/spdk/spdk/627/478899588
SPDK NVMe-oF target 多路功能介紹 http://syswift.com/508.html
nvmf源碼解讀1
spdk中nvmf的庫在 lib/nvmf中,app/nvmf_tgt.c用的就是庫中的api
基本數據類型
The library exposes a number of primitives - basic objects that the user creates and interacts with. They are:
struct spdk_nvmf_tgt:
是subsystems(及其namespace),transports 和相關的網絡連接的 一個集合。
An NVMe-oF target. This concept, surprisingly, does not appear in the NVMe-oF specification. SPDK defines this to mean the collection of subsystems with the associated namespaces, plus the set of transports and their associated network connections. This will be referred to throughout this guide as a target.
struct spdk_nvmf_subsystem:
An NVMe-oF subsystem, as defined by the NVMe-oF specification. Subsystems contain namespaces and controllers and perform access control. This will be referred to throughout this guide as a subsystem.
struct spdk_nvmf_ns:
An NVMe-oF namespace, as defined by the NVMe-oF specification. Namespaces are bdevs. See Block Device User Guide for an explanation of the SPDK bdev layer. This will be referred to throughout this guide as a namespace.
struct spdk_nvmf_qpair:
An NVMe-oF queue pair, as defined by the NVMe-oF specification. These map 1:1 to network connections. This will be referred to throughout this guide as a qpair.
struct spdk_nvmf_transport:
An abstraction for a network fabric, as defined by the NVMe-oF specification. The specification is designed to allow for many different network fabrics, so the code mirrors that and implements a plugin system. Currently, only the RDMA transport is available. This will be referred to throughout this guide as a transport.
struct spdk_nvmf_poll_group:
An abstraction for a collection of network connections that can be polled as a unit. This is an SPDK-defined concept that does not appear in the NVMe-oF specification. Often, network transports have facilities to check for incoming data on groups of connections more efficiently than checking each one individually (e.g. epoll), so poll groups provide a generic abstraction for that. This will be referred to throughout this guide as a poll group.
struct spdk_nvmf_listener:
A network address at which the target will accept new connections.
struct spdk_nvmf_host:
An NVMe-oF NQN representing a host (initiator) system. This is used for access control.
基本api
The Basics
A user of the NVMe-oF target library begins by creating a target using spdk_nvmf_tgt_create(), setting up a set of addresses on which to accept connections by calling spdk_nvmf_tgt_listen(), then creating a subsystem using spdk_nvmf_subsystem_create().
Subsystems begin in an inactive state and must be activated by calling spdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only when in the paused or inactive state. A running subsystem may be paused by calling spdk_nvmf_subsystem_pause() and resumed by calling spdk_nvmf_subsystem_resume().
Namespaces may be added to the subsystem by calling spdk_nvmf_subsystem_add_ns() when the subsystem is inactive or paused. Namespaces are bdevs. See Block Device User Guide for more information about the SPDK bdev layer. A bdev may be obtained by calling spdk_bdev_get_by_name().
Once a subsystem exists and the target is listening on an address, new connections may be accepted by polling spdk_nvmf_tgt_accept().
All I/O to a subsystem is driven by a poll group, which polls for incoming network I/O. Poll groups may be created by calling spdk_nvmf_poll_group_create(). They automatically request to begin polling upon creation on the thread from which they were created. Most importantly, a poll group may only be accessed from the thread on which it was created.
When spdk_nvmf_tgt_accept() detects a new connection, it will construct a new struct spdk_nvmf_qpair object and call the user provided new_qpair_fn callback for each new qpair. In response to this callback, the user must assign the qpair to a poll group by calling spdk_nvmf_poll_group_add(). Remember, a poll group may only be accessed from the thread on which it was created, so making a call to spdk_nvmf_poll_group_add() may require passing a message to the appropriate thread.
Access Control
Access control is performed at the subsystem level by adding allowed listen addresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and spdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept connections from any host or over any established listen address. Listeners and hosts may only be added to inactive or paused subsystems.
Discovery Subsystems
A discovery subsystem, as defined by the NVMe-oF specification, is automatically created for each NVMe-oF target constructed. Connections to the discovery subsystem are handled in the same way as any other subsystem - new qpairs are created in response to spdk_nvmf_tgt_accept() and they must be assigned to a poll group.
Transports
The NVMe-oF specification defines multiple network transports (the "Fabrics" in NVMe over Fabrics) and has an extensible system for adding new fabrics in the future. The SPDK NVMe-oF target library implements a plugin system for network transports to mirror the specification. The API a new transport must implement is located in lib/nvmf/transport.h. As of this writing, only an RDMA transport has been implemented.
The SPDK NVMe-oF target is designed to be able to process I/O from multiple fabrics simultaneously.
Choosing a Threading Model
The SPDK NVMe-oF target library does not strictly dictate threading model, but poll groups do all of their polling and I/O processing on the thread they are created on. Given that, it almost always makes sense to create one poll group per thread used in the application. New qpairs created in response to spdk_nvmf_tgt_accept() can be handed out round-robin to the poll groups. This is how the SPDK NVMe-oF target application currently functions.
More advanced algorithms for distributing qpairs to poll groups are possible. For instance, a NUMA-aware algorithm would be an improvement over basic round-robin, where NUMA-aware means assigning qpairs to poll groups running on CPU cores that are on the same NUMA node as the network adapter and storage device. Load-aware algorithms also may have benefits.
Scaling Across CPU Cores
Incoming I/O requests are picked up by the poll group polling their assigned qpair. For regular NVMe commands such as READ and WRITE, the I/O request is processed on the initial thread from start to the point where it is submitted to the backing storage device, without interruption. Completions are discovered by polling the backing storage device and also processed to completion on the polling thread. Regular NVMe commands (READ, WRITE, etc.) do not require any cross-thread coordination, and therefore take no locks.
NVMe ADMIN commands, which are used for managing the NVMe device itself, may modify global state in the subsystem. For instance, an NVMe ADMIN command may perform namespace management, such as shrinking a namespace. For these commands, the subsystem will temporarily enter a paused state by sending a message to each thread in the system. All new incoming I/O on any thread targeting the subsystem will be queued during this time. Once the subsystem is fully paused, the state change will occur, and messages will be sent to each thread to release queued I/O and resume. Management commands are rare, so this style of coordination is preferable to forcing all commands to take locks in the I/O path.
Zero Copy Support
For the RDMA transport, data is transferred from the RDMA NIC to host memory and then host memory to the SSD (or vice versa), without any intermediate copies. Data is never moved from one location in host memory to another. Other transports in the future may require data copies.
RDMA
The SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and rdmacm libraries, which are packaged and available on most Linux distributions. It does not use a user-space RDMA driver stack through DPDK.
In order to scale to large numbers of connections, the SPDK NVMe-oF RDMA transport allocates a single RDMA completion queue per poll group. All new qpairs assigned to the poll group are given their own RDMA send and receive queues, but share this common completion queue. This allows the poll group to poll a single queue for incoming messages instead of iterating through each one.
Each RDMA request is handled by a state machine that walks the request through a number of states. This keeps the code organized and makes all of the corner cases much more obvious.
RDMA SEND, READ, and WRITE operations are ordered with respect to one another, but RDMA RECVs are not necessarily ordered with SEND acknowledgements. For instance, it is possible to detect an incoming RDMA RECV message containing a new NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND containing an NVMe completion. This is problematic at full queue depth because there may not yet be a free request structure. To handle this, the RDMA request structure is broken into two parts - an rdma_recv and an rdma_request. New RDMA RECVs will always grab a free rdma_recv, but may need to wait in a queue for a SEND acknowledgement before they can acquire a full rdma_request object.
Further, RDMA NICs expose different queue depths for READ/WRITE operations than they do for SEND/RECV operations. The RDMA transport reports available queue depth based on SEND/RECV operation limits and will queue in software as necessary to accommodate (usually lower) limits on READ/WRITE operations.
塊設備bdev層
介紹
塊設備是支持以固定大小的塊讀取和寫入數據的存儲設備。這些塊通常為512或4096字節。設備可以是軟件中的邏輯構造,或者對應於諸如NVMe SSD的物理設備。
塊設備層包含單個通用庫lib/bdev,以及實現各種類型的塊設備的許多可選模塊(作為單獨的庫)。通用庫的公共頭文件是bdev.h,它是與任何類型的塊設備交互所需的全部API。

下面將介紹如何使用該API與bdev進行交互。有關實現bdev模塊的指南,請參閱編寫自定義塊設備模塊。
除了為所有塊設備提供通用抽象之外,bdev層還提供了許多有用的功能:
- 響應隊列滿或內存不足的情況自動排隊I / O請求
- 支持熱移除,即使在I / O流量發生時也是如此。
- I / O統計信息,如帶寬和延遲
- 設備重置支持,和I / O超時跟蹤
基本原語
bdev API的用戶與許多基本對象進行交互。
struct spdk_bdev,本指南將其稱為bdev,表示通用塊設備。struct spdk_bdev_desc,此前稱為描述符,表示給定塊設備的句柄。描述符用於建立和跟蹤使用底層塊設備的權限,非常類似於UNIX系統上的文件描述符。對塊設備的請求是異步的,由spdk_bdev_io對象表示。請求必須在關聯的I / O channel上提交。消息傳遞和並發中描述了I / O channel的動機和設計。
Bdev可以是分層的,這樣一些bdev通過將請求路由到其他bdev來服務I / O. 這可用於實現緩存,RAID,邏輯卷管理等。的BDEV該路線I / O到其他的BDEV通常被稱為虛擬的BDEV,或vbdevs的簡稱。
初始化庫
bdev層依賴於頭文件include/spdk/thread.h抽象的通用消息傳遞基礎結構。有關完整說明,請參閱消息傳遞和並發。最重要的是,只能通過調用spdk_allocate_thread()從已經分配了SPDK的線程調用bdev庫。
從分配的線程中,可以通過調用spdk_bdev_initialize()來初始化bdev庫,這是一個異步操作。在調用完成回調之前,不能調用其他bdev庫函數。同樣,要拆除bdev庫,請調用spdk_bdev_finish()。
發現塊設備
所有塊設備都有一個簡單的字符串名稱。在任何時候,都可以通過調用spdk_bdev_get_by_name()來獲取指向設備對象的指針,或者可以使用spdk_bdev_first()和spdk_bdev_next()及其變體來迭代整個bdev集。
一些塊設備也可以給出別名,也是字符串名稱。別名的行為類似於符號鏈接 - 它們可以與實名互換使用以查找塊設備。
准備使用塊設備
為了將I / O請求發送到塊設備,必須首先通過調用spdk_bdev_open()來打開它。這將返回一個描述符。多個用戶可能同時打開bdev,並且用戶之間的讀寫協調必須由bdev層之外的某些更高級別的機制來處理。如果虛擬bdev模塊聲明了bdev,則打開具有寫入權限的bdev可能會失敗。虛擬bdev模塊實現RAID或邏輯卷管理之類的邏輯,並將其I / O轉發到較低級別的bdev,因此它們將這些較低級別的bdev標記為聲稱可防止外部用戶發出寫入。
打開塊設備時,可以提供可選的回調和上下文,如果刪除了為塊設備提供服務的底層存儲,則將調用該回調和上下文。例如,當NVMe SSD熱插拔時,將在物理NVMe SSD支持的bdev的每個打開描述符上調用remove回調。回調可以被認為是關閉打開描述符的請求,因此可以釋放其他內存。當存在開放描述符時,不能拆除bdev,因此強烈建議提供回調。
當用戶完成描述符時,他們可以通過調用spdk_bdev_close()來釋放它。
描述符可以同時傳遞給多個線程並從中使用。但是,對於每個線程,必須通過調用spdk_bdev_get_io_channel()獲得單獨的I / O channel。這將分配必要的每線程資源,以便在不接受鎖定的情況下向bdev提交I / O請求。要釋放channel,請調用spdk_put_io_channel()。在銷毀所有相關 channel之前,不能關閉描述符。
SPDK 的I/O 路徑采用無鎖化機制。當多個thread操作同意SPDK 用戶態block device (bdev) 時,SPDK會提供一個I/O channel的概念 (即thread和device的一個mapping關系)。不同的thread 操作同一個device應該擁有不同的I/O channel,每個I/O channel在I/O路徑上使用自己獨立的資源就可以避免資源競爭,從而去除鎖的機制。詳見SPDK進程間的高效通信。
發送I / O
一旦一個描述符和一個信道已經獲得,I / O可以通過調用各種I / O功能提交諸如發送spdk_bdev_read() 。這些調用都將回調作為參數,稍后將使用spdk_bdev_io對象的句柄調用該參數。響應完成,用戶必須調用spdk_bdev_free_io()來釋放資源。在此回調中,用戶還可以使用函數spdk_bdev_io_get_nvme_status()和spdk_bdev_io_get_scsi_status()以他們選擇的格式獲取錯誤信息。
通過調用spdk_bdev_read()或spdk_bdev_write()等函數來執行I / O提交。這些函數將一個指向內存區域的指針或一個描述將被傳輸到塊設備的內存的分散集合列表作為參數。必須通過spdk_dma_malloc()或其變體分配此內存。有關內存必須來自特殊分配池的完整說明,請參閱用戶空間驅動程序的內存管理。在可能的情況下,內存中的數據將使用直接內存訪問直接傳輸到塊設備。這意味着它不會被復制。
所有I / O提交功能都是異步和非阻塞的。它們不會因任何原因阻塞或停止線程。但是,I / O提交功能可能會以兩種方式之一失敗。首先,它們可能會立即失敗並返回錯誤代碼。在這種情況下,將不會調用提供的回調。其次,它們可能異步失敗。在這種情況下,關聯的spdk_bdev_io將傳遞給回調,它將報告錯誤信息。
某些I / O請求類型是可選的,給定的bdev可能不支持。要查詢bdev以獲取其支持的I / O請求類型,請調用spdk_bdev_io_type_supported()。
重置塊設備
為了處理意外的故障情況,bdev庫提供了一種通過調用spdk_bdev_reset()來執行設備重置的機制。這會將消息傳遞給bdev存在I / O channel的每個其他線程,暫停它,然后將重置請求轉發到底層bdev模塊並等待完成。完成后,I / O channel將恢復,重置將完成。bdev模塊中的特定行為是特定於模塊的。例如,NVMe設備將刪除所有隊列對,執行NVMe重置,然后重新創建隊列對並繼續。最重要的是,無論設備類型如何,塊設備的所有未完成的I / O都將在重置完成之前完成。
rpc到bdev流程
./script/rpc.py construct_malloc_bdev -b Malloc0 64 512 :
--------[rpc.py] p = subparsers.add_parser('construct_malloc_bdev',````)
---------[rpc.py] def construct_malloc_bdev(args) 中 rpc.bdev.construct_malloc_bdev
---------------[script/rpc/bdev.py] def construct_malloc_bdev(client, num_blocks, block_size, name=None, uuid=None) 中client.call('construct_malloc_bdev', params)
---------------------[lib/bdev/malloc/bdev_malloc_rpc.c ] SPDK_RPC_REGISTER("construct_malloc_bdev", spdk_rpc_construct_malloc_bdev, ····)
spdk_rpc_construct_malloc_bdev()
----------------------------[lib/bdev/malloc/bdev_malloc.c] create_malloc_disk()

