參考書籍:https://files.cnblogs.com/files/codestack/OReilly-Linux-Observability-with-BPF-2019.rar
XDP總體設計包括以下幾個部分:
- XDP驅動鈎子:網卡驅動中XDP程序的一個hook,XDP程序可以對數據包進行逐層解析、按規則進行過濾,或者對數據包進行封裝或者解封裝,修改字段對數據包進行轉發等;
- eBPF虛擬機:字節碼加載到內核之后運行在eBPF虛擬機上,
- BPF maps(BPF映射):存儲鍵值對,作為用戶態程序和內核態XDP程序、內核態XDP程序之間的通信媒介,類似於進程間通信的共享內存訪問;
- eBPF程序校驗器:XDP程序肯定是我們自己編寫的,那么如何確保XDP程序加載到內核之后不會導致內核崩潰或者帶來其他的安全問題呢?程序校驗器就是在將XDP字節碼加載到內核之前對字節碼進行安全檢查,比如判斷是否有循環,程序長度是否超過限制,程序內存訪問是否越界,程序是否包含不可達的指令;
當一個數據包到達網卡,在內核網絡棧分配緩沖區將數據包內容存到sk_buff結構體之前,xdp程序執行,讀取由用戶態的控制平面寫入到BPF maps的數據包處理規則,對數據包執行相應的操作,比如可以直接丟棄該數據包,或者將數據包發回當前網卡,或者直接將數據包通過AF_XDP這個特殊的socket直接轉發給上層應用程序。
XDP can (as of November 2019):
- Fast incoming packet filtering. XDP can inspect fields in incoming packets and take simple action like DROP, TX to send it out the same interface it was received, REDIRECT to other interface or PASS to kernel stack for processing. XDP can alternate packet data like swap MAC addresses, change ip addresses, ports, ICMP type, recalculate checksums, etc. So obvious usage is for implementing:
- Filerwalls (DROP)
- L2/L3 lookup & forward
- NAT – it is possible to implement static NAT indirectly (two XDP programs, each attached to own interface, processing and forwarding the traffic out, via the other interface). Connection tracking is possible, but more complicated with preserving and exchanging session-related data in TABLES.
AF_XDP
- As opposed to AF_Packet, AF_XDP moves frames directly to the userspace, without the need to go through the whole kernel network stack. They arrive in the shortest possible time. AF_XDP does not bypass the kernel but creates an in-kernel fast path.
- It also offers advantages like zero-copy (between kernel space & userspace) or offloading of the XDP bytecode into NIC. AF_XDP can run in interrupt mode, as well as polling mode, while DPDK polling mode drivers always poll – this means that they use 100% of the available CPU processing power.
AF_XDP socket sfd = socket(PF_XDP, SOCK_RAW, 0); buffs = calloc(num_buffs, FRAME_SIZE); setsockopt(sfd, SOL_XDP, XDP_MEM_REG, buffs); setsockopt(sfd, SOL_XDP, XDP_{RX|TX|FILL|COMPLETE}_RING, ring_size); mmap(..., sfd, ......); /* map kernel rings */ bind(sfd, ”/dev/eth0”, queue_id,....); for (;;) { read_process_send_messages(sfd); };
所以AF_XDP
Socket的創建過程可以使用在網絡編程中常見的socket()
系統調用,就是參數需要特別配置一下。在創建之后,每個socket都各自分配了一個RX ring和TX ring。這兩個ring保存的都是descriptor,里面有指向UMEM
中真正保存幀的地址。
UMEM
也有兩個隊列,一個叫FILL ring
,一個叫COMPLETION ring
。其實就和傳統網絡IO過程中給DMA填寫的接收和發送環形隊列很類似。在RX過程中,用戶應用程序在FILL ring
中填入接收數據包的地址,XDP程序會將接收到的數據包放入該地址中,並在socket的RX ring中填入對應的descriptor。
但COMPLETION ring
中保存的並非用戶應用程序“將要”發送的幀的地址,而是已經完成發送的幀的地址。這些幀可以用來被再次發送或者接收。“將要”發送的幀的地址是在socket的TX ring中,同樣由用戶應用填入。RX/TX ring和FILL/COMPLETION ring之間是多對一(n:1)的關系。也就是說可以有多個socket和它們的RX/TX ring共享一個UMEN
和它的FILL/COMPLETION ring。
什么xdp 什么ebpf 什么map 什么af_packet 什么xxx, 到頭來就是 user和kernel 共享同一片內存,都能去擼羊毛而已 !!!
什么是通信,就是數據交換,也就是內存倒騰!!!!
參考:
https://www.kernel.org/doc/html/v4.18/networking/af_xdp.html
https://www.dpdk.org/wp-content/uploads/sites/35/2018/10/pm-06-DPDK-PMD-for-AF_XDP.pdf
https://pantheon.tech/what-is-af_xdp/
https://www.kernel.org/doc/html/latest/networking/af_xdp.html