用途:
做數據的移動(movement )和轉換 (transformation)。
取代 了Intel® QuickData Technology, 這個是 Intel® I/O Acceleration Technology (I/OAT: QD, DCA, RSS, Low latency interrupts …)一部分. 2007年kernel patch 合並了。
https://www.intel.com/content/www/us/en/wireless-network/accel-technology.html (I/OAT)
https://software.intel.com/en-us/articles/fast-memcpy-using-spdk-and-ioat-dma-engine (Fast memcpy with SPDK and Intel® I/OAT DMA Engine)
QD: async DMA
場景:
• Datacenter:
As a data movement offload engine to reduce datacenter tax for memory copying, zeroing, etc. to free up CPU cycles from mundane infrastructure work.
數據移動卸載
• Storage:
Storage appliances use data movement (including CRC generation and Data Integrity Field (DIF) generation) within the node and across nodes using Non-Transparent Bridge (NTB).
通過非透明橋在node內或者node之間做存儲數據的移動。
• Networking:
Packet processing pipelines use Intel DSA for data copy. An example usage is virtual switch (vSwitch) offload for inter-VM packet switching.
網絡數據包的處理流水線中,做數據拷貝。 這個比較抽象,具體怎么做?
• Deduplication:
Memory deduplication requires comparing memory pages for equality, which can be done using Intel DSA memory compare operations.
內存比較,查重刪除
• VM Migration and Fast Checkpointing:
VM fast checkpointing and VM migration flows require the VMM to identify a VM’s dirty pages and send them efficiently to the destination machine (with minimal network traffic and latency). Intel DSA delta operations generate diffs of pages, enabling the VMM to send only the delta record to the destination, reducing network bandwidth.
VM熱遷移,跟Deduplication類似。
需要詳細了解,Intel® DSA Features(主要是第8章,具體的有哪些desciptor), 能明白上述場景中采用了哪些原理。
結構圖:
1. memory-mapped 的寄存器來控制通過操作。
操作包括, 包括capability,配置和 portal(工作提交寄存器,把descriptor提交到qeue中), 在一個4K頁表中,在BAR0,BAR1定義(見9.1.1章)來描述,
2. 通用描述符(descriptor),來描述要運行的工作(work)。
描述符可以批處理(放在內存里面),也可以單個處理(隊列里面)。
描述符 通常會包含一個完成記錄的地址和有效位。 描述符分很多中, 比如Drain,Memory Move,Compare,Delta,CRC etc
批處理描述符(位於host內存),包含work descriptors 數組(array) 的地址和長度。提高了多個小數據傳輸效率,
device從host 內存中讀取work descriptors數組,可以配置亂序執行。batch descriptor和其中的每個work descriptor都有自己的completion record address 和 completion interrupt。 batch 不支持嵌套。
3. 工作隊列,就是device上的一塊存儲,包含提交到device的descriptors。
可配置優先級,和qeues的大小。 有調度算法來處理優先級,保證高優先級不會餓死低優先級。
分Unlimited(kernel空間)和limited protal。
分Shared 和 Dedicated 隊列。
具體的配置都看第九章。
4. engine 是真正的執行單元。包含了一個work descriptor 處理單元。
The work descriptor processing unit uses the Address Translation Cache and IOMMU for completion record, source, and destination address translations; reads source data; performs the specified operation; and writes the destination data back to memory. When the operation is complete, the engine writes the completion record to the pre-translated completion address and generates an interrupt, if requested by the work descriptor
5. engine 和 queue 可以分組。 N:M map吧。
6. Descriptor Completion
就是用來描述,Descriptor的完成情況, 可通過中斷來通知。中斷支持兩種,MSI-X table 和 device-specific Interrupt Message Storage (IMS) table, 見SIOV描述。
可以理解成一個同步機制。
completion record 會支持完成進度, 如果出錯的話,軟件可以修復錯誤,或者通過新的Descriptor重新提交剩下的工作,或者由軟件來完成。
7. Descriptor 順序執行和Fencing
等completion record 或者 interrupt
使用Drain descriptor or Drain command,再提交下一個descriptor
在batch中使用Fence flag
問題:
protal和descriptor的關系。
descriptor 通過一個特殊的寄存器portal, 來提交的qeue中。
Readback
應該是,Host 來讀device
Shared Work Queue (SWQ)
Dedicated Work Queue (DWQ)
如果用戶模式客戶端使用受限Portal,則它可以請求內核模式驅動程序使用無限Portal代表其提交描述符。 這有助於避免拒絕服務並提供forward progress保證。
descriptors 通過稱為Portal的特殊寄存器提交到工作隊列中
概念:
completion record:
When the operation is complete, the engine writes the completion record to the pre-translated completion address and generates an interrupt, if requested by the work descriptor.
The completion attributes specify the address to write the completion record and optionally the information needed to generate a completion interrupt. 包含了descriptor完成狀態或者錯誤信息。
The first byte of the completion record is the status byte. Status values written by the device are all nonzero. Software should initialize the status field of the completion record to 0 before submitting the descriptor to be able to tell when the device has written to the completion record. (Initializing the completion record also ensures that it is mapped, so the device is less likely to encounter a page fault when accessing it.)
The Request Completion Record flag indicates to the device that it should write the completion record even if the operation completed successfully. If this flag is not set, the device writes the completion record only if there is an error.
REF:
Intel® Scalable I/O Virtualization Technical Specification
All Intel® 64 and IA-32 Architectures Software Developer Manuals
內核文檔 Complete virtual memory map with 4-level page tables
wiki 輸入輸出內存管理單元
kevin Intel® Scalable I/O Virtualization
Intel® I/O Acceleration Technology
Fast memcpy with SPDK and Intel® I/OAT DMA Engine
idxd driver for Intel Data Streaming Accelerator
INTRODUCING THE INTEL® DATA STREAMING ACCELERATOR (INTEL® DSA)