VNF網絡性能提升解決方案及實踐
2016年7月
作者: 王智民
貢獻者:
創建時間: 2016-7-20
穩定程度: 初稿
修改歷史
版本 |
日期 |
修訂人 |
說明 |
1.0 |
2016-7-20 |
王智民 |
初稿 |
目錄
3.3.8 multiqueue virtio-net 33
-
引言
-
編寫目的
-
本文檔系統性的整理和研究VNF性能提升的技術方案。
-
背景
SDN、NFV成為雲計算時代下網絡重構的關鍵思路和技術路線,已經成為不可逆的趨勢。經過幾年的摸索和嘗試后,大家逐步感覺到VNF的性能成為制約SDN的普及和推廣,於是出現各種提升VNF的方案。
這些紛繁復雜的方案,有獨立的項目,也有Intel、Cisco等巨頭的助力,所以整理和深入研究這些方案就顯得非常必要。
-
SDN
-
SDN核心思想
-
詳細的有關SDN的概念及技術細節可以參考我寫的《SDN市場與技術預研報告》。這里只是簡單闡述SDN的核心思想:
1.通過類似openflow協議實現網絡流量"傻瓜式"轉發與處理
SDN在斯坦福大學提出來的時候的同時,提出了openflow協議來解決兩個問題:
- Controller與worker之間的流量轉發控制與通信
- Worker實現傻瓜式的所謂的"流表轉發"
雖然說SDN並沒有要求一定支持OpenFlow協議(可以通過其他協議來實現SDN的目的),但是鑒於網絡的開放性特點,行業需要有標准的協議, OpenFlow協議是SDN概念提出之時隨之而來的協議。下圖來自OpenFlow官方白皮書,描述了OpenFlow協議在SDN框架下的位置和作用:
2.實現controller與worker的松耦合
控制與轉發的分離其實並不是SDN的獨創,也不是現在才提出的概念,早在機架式設備、NP處理器、多核處理器的時候就已經提出控制與數據處理的分離,只不過,之前更多的是針對一台設備內部而言,比如機架式設備的主控板和接口板,主控板負責路由計算、安全處理等復雜業務的處理,接口板上來的報文首先會先到主控板,經過主控板"決策"后再前往某個接口板,通過查找路由表或轉發表從某個接口發送出去。
SDN則要求控制與數據轉發實現物理隔離,通過網絡連接,實現controller與worker的松耦合。
3.實現虛擬或物理網絡的靈活互聯
由於虛擬機的出現,使得目前很多的交換機、路由器都實現了虛擬化,即交換機、路由器運行在虛擬機內部,多個虛擬機實際上構成了所謂的"虛擬網絡",但本質上仍然是一台交換機或路由器,所以SDN必須確保這些虛擬的網絡之間,虛擬網絡與物理網絡之間要實現無縫的連接和通信。
4.支持在Controller上靈活編程
這也是SDN發起的初衷之一。
5.實現網絡虛擬化
所謂的虛擬化,核心上就是"切片slice和分布式調度",或者叫"資源池化和資源調度"。涉及到"一虛多和多虛一"。
網絡虛擬化又分為"網絡資源虛擬化"和"網絡功能虛擬化"。后者即是本文要詳細探討的NFV(Network Function Virtualization)。
所謂的網絡虛擬化,即:Similar to computer virtualiza-tion, network virtualization promises to improve resource allocation, permits operators to check-point their network before changes, and allows competing customers to share the same equipment in a con-trolled and isolated fashion. Critically, virtual networks also promise to provide a safe and realistic environment to deploy and evaluate experimental "clean slate" pro-tocols in production networks.
Thus, by analogy, the network itself should have a hardware abstraction layer. This layer should be easy to slice so that multiple wildly different net-works can run simultaneously on top without interfer-ing with each other, on a variety of different hardware,including switches, routers, access points, and so on.
Above the hardware abstraction layer, we want new pro-tocols and addressing formats to run independently in their own isolated slice of the same physical network, enabling networks optimized for the applications run-ning on them, or customized for the operator who owns them. Below the virtualization layer, new hardware can be developed for different environments with different speed, media (wireline and wireless), power or fanout requirements.
While individual technolo-gies can slice particular hardware resources (e.g., MPLS can virtualize forwarding tables) and layers (e.g., WDM slices the physical layer, VLANs slices the link layer),there is currently no one single technology or clean ab-straction that will virtualize the network as a whole.
網絡虛擬化,對應計算虛擬化,隨着計算虛擬化得到實質性突破后,雲計算應用的矛盾更多集中在網絡層面,急需網絡能夠像計算資源那樣能夠"隨心所欲"的調度和管理。
網絡虛擬化,本質上要求"網絡轉發路徑"實現資源池化,而網絡轉發路徑更具體一點講,包括入接口、出接口,至少要實現網絡接口的資源池化。
需要注意的是,網絡虛擬化與虛擬網絡是兩個不同的概念。
-
網絡虛擬化
網絡虛擬化又分為"網絡資源虛擬化"和"網絡功能虛擬化"。如果類比計算虛擬化,則網絡資源虛擬化有點類似於KVM、XEN等hypervisor所做的事情,網絡功能虛擬化則類似於VDI。
網絡資源虛擬化,本質上要求"網絡轉發路徑"實現資源池化,而網絡轉發路徑的屬性,至少包括路由表、擴展上講還包括網絡帶寬、延遲、優先級等等。所以需要實現這些屬性的資源池化和管理。
網絡功能虛擬化NFV簡單理解就是講之前的路由器、交換機、防火牆、負載均衡、應用交付、信令網關等設備做"切片",實現資源池化和管理。
下圖用類比計算虛擬化來實現網絡虛擬化的框圖,openflow相當於x86的指令,這里的FlowVisor類似KVM實現網絡資源的虛擬化,NOX對應桌面的操作系統win等(Guest OS)。
所以,如果按照這個思路來進行網絡虛擬化,則有三大艱巨任務:openflow、Flowvisor、NOX。相對來說NOX可能比較簡單一些,因為大多數網絡功能設備都有自己成熟的操作系統了。
但是隨着大家的實踐和技術的發展,在實現上,離不開SDN的實現路線,所以SDN當前出現了幾大流派:
-
以斯坦福等學院派為代表的openflow技術
借用BigSwitch的框圖:
其核心的思想是controller與dataplane之間采用openflow協議,Dataplane的轉發采用openflow流表。
-
以vmware為代表的主機overlay技術
下圖是vmware的NSX框圖:
Controller與Dataplane的控制與管理沒有統一的協議,各廠商有自己的私有協議,或者采用已有的BGP、NETCONF、XMPP、SNMP、CAPWAP等;在Dataplane轉發有采用openflow流表,但大多還是采用FIB路由轉發。
主機Overlay隧道技術大都采用VXLAN,當然也有NVGRE、MPLS over GRE、MPLS over UDP等。STT是H3C提出的overlay標准。
-
以Cisco/華為為代表的網絡overlay技術
下圖是H3C的一個網絡Overlay實現框架,為了實現物理網絡對虛機的感知,還需VEPA技術。
-
NFV
-
OPNFV
-
網絡功能虛擬化NFV是網絡虛擬化的一部分。
我們可以通過OPNFV的框架圖可以看出兩者的關系和位置:
當前在網絡資源虛擬化和網絡功能虛擬化方面的實踐如下:
-
VNF性能提升方案
VNF的性能由於虛擬化層的加入導致網絡IO性能下降非常大,如果沒有優化的情況下,20G硬件平台下跑8個虛擬防火牆vFW,每個vFW的網絡IO吞吐大包大概在100Mbps左右,小字節也只有20Mbps左右。
A和B的性能難以滿足實際需求,這個通道的瓶頸在0.3Mpps左右,C和D是不錯的選擇,其實D性能更好。E,F直接使用物理網卡,中間沒有使用虛擬交換機,像vxlan封裝這樣的事情需要vnf或物理交換機來做。
要提升NFV的性能主要有三個方面:OVS、VNF網絡協議棧、VNF與OVS的通道。
提升OVS本身性能的技術方案當前主要有:Intel協助的DPDK、OpenFlow流表轉發。
在VNF網絡協議棧的優化改造的技術方案主要有:DPDK、ODP、VPP等。
在VNF與OVS通道優化改在的技術方案主要有:virtio(vhost)、IVSHMEM、SR-IOV(PMD)、SR-IOV(Driver)。
-
虛擬通道性能提升
-
virtio-net
-
一個數據包從虛機到物理網卡的完整路徑圖如下:
從這個路徑圖可以看出,優化VNF與OVS通道的點有三個:虛擬網卡、虛擬化層和內核網橋。
虛擬網卡分為兩種:全虛擬化和半虛擬化。所謂的全虛擬化即VM感知不到自己使用的網卡實際上是由hypervisor模擬出來的,比如e1000網卡,這個虛擬網卡完全由hyperviosr比如Kvm-Qemu軟件模擬出來的;半虛擬化則在hypervisor做一部分工作,VM的guest os也需要做一部分修改。
Virtio是IO(注意不只是網卡)的一個半虛擬化解決方案。這里只是關注virtio-net,即virtio網卡。
guest發出中斷信號退出kvm,從kvm退出到用戶空間的qemu進程。然后由qemu開始對tap設備進行讀寫。 可以看到這里從用戶態進入內核,再從內核切換到用戶態,進行了2次切換。
-
vhost-net
vhost技術對virtio-net進行了優化,在內核中加入了vhost-net.ko模塊,使得對網絡數據可以再內核態得到處理。
guest發出中斷信號退出kvm,kvm直接和vhost-net.ko通信,然后由vhost-net.ko訪問tap設備。 這樣網絡數據只需要經過從用戶態到內核態的一次切換,就可以完成數據的傳輸。大大提高了虛擬網卡的性能。
下面是host、virtio、vhost-net三者在IO延遲方面的對比數據:
-
vhost-user
vhost-user是qemu新版本提供的一個特性。其實質是提供了一種用戶態進程網絡通信的機制,利用共享內存、eventfd、irqfd等技術。
Vhost-user相對於vhost-net,改進點在於數據報文無需進入host kernel,直接在用戶態進行傳輸,可以減少一次數據拷貝等開銷。
正是vhost-user提供了用戶態進程之間高效的網絡通信機制,往往被用來與其他優化技術,比如DPDK、ODP、snabb switch等一起使用。
-
ivshmem
ivshmem是Qemu提供的guest<->guest和guest<->host之間零拷貝的通信機制。其實現原理是通過共享內存模擬一個pci設備。這樣通信雙方就可以像操作pci設備一樣來進行數據交換和通信了。與vhost-user的區別?
Qemu 1.5.x版本以上即支持ivshmem機制。
Guest與guest之間通過ivshmem通信有中斷和非中斷兩種模式,guest與host之間只有非中斷模式。
【ivshmem pci BARs】
BAR是PCI配置空間中從0x10 到 0x24的6個register,用來定義PCI需要的配置空間大小以及配置PCI設備占用的地址空間,X86中地址空間分為MEM和IO兩類,因此PCI 的BAR在bit0來表示該設備是映射到memory還是IO,bar的bit0是只讀的,bit1保留位,bit2 中0表示32位地址空間,1表示64位地址空間,其余的bit用來表示設備需要占用的地址空間大小與設備起始地址。
ivshmem設備支持3個PCI基地址寄存器BAR0、BAR1和BAR2。
BAR0是1k的MMIO區域,支持寄存器。根據計算可以得到,設備當前支持3個32bits寄存器,還有一個寄存器為每個guest都有,最多支持253個guest(一共256*4=1kbyte),實際默認為16。
BAR1用於MSI-X。
BAR2用來從host中映射共享內存體。BAR2的大小通過命令行指定,必須是2的次方。
ivshmem設備共有4種類型的寄存器,寄存器用於guest之間共享內存的同步,mask和status在pin中斷下使用,msi下不使用:
enumivshmem_registers {
INTRMASK = 0,
INTRSTATUS = 4,
IVPOSITION = 8,
DOORBELL = 12,
};
-
Mask寄存器
與中斷狀態按位與,如果非0則觸發一個中斷。因此可以通過設置mask的第一bit為0來屏蔽中斷。
-
Status寄存器
當中斷發生(pin中斷下doorbell被設置),目前qemu驅動所實現的寄存器被設置為1。由於status代碼只會設1,所以mask也只有第一個bit會有左右,筆者理解可通過修改驅動代碼實現status含義的多元化。
-
IVPosition寄存器
IVPosition是只讀的,報告了guest id號碼。Guest id是非負整數。id只會在設備就緒是被設置。如果設備沒有准備好,IVPosition返回-1。應用程序必須確保他們有有效的id后才開始使用共享內存。
-
Doorbell寄存器
通過寫自己的doorbell寄存器可以向其它guest發送中斷。
【共享內存服務】
共享內存server是在host上運行的一個應用程序,每啟動一個vm,server會指派給vm一個id號,並且將vm的id號和分配的eventfd文件描述符一起發給qemu進程。Id號在收發數據時用來標識vm,guests之間通過eventfd來通知中斷。每個guest都在與自己id所綁定的eventfd上偵聽,並且使用其它eventfd來向其它guest發送中斷。
共享內存服務者代碼在nahanni的ivshmem_server.c。
【ivshmem中斷模式】
非中斷模式直接把虛擬pci設備當做一個共享內存進行操作,中斷模式則會操作虛擬pci的寄存器進行通信,數據的傳輸都會觸發一次虛擬pci中斷並觸發中斷回調,使接收方顯式感知到數據的到來,而不是一直阻塞在read。
ivshmem中斷模式分為Pin-based 中斷和msi中斷。兩種中斷模式的區別請參見附錄"MSI與MSI-X"。
【原理與實現】
1.共享內存體建立
host上Linux內核可以通過將tmpfs掛載到/dev/shm,從而通過/dev/shm來提供共享內存作為bar2映射的共享內存體。
mount tmpfs /dev/shm -t tmpfs -osize=32m
也可通過shm_open+ftruncate創建一個共享內存文件/tmp/nahanni。
(nahanni是一個為KVM/Qemu提供共享內存設備的項目)。
2.ivshmem server建立
ivshmem_server啟動方式如下:
./ivshmem_server -m 64 -p/tmp/nahanni &
其中-m所帶參數為總共享內存大小單位(M),-p代表共享內存體,-n代表msi模式中斷向量個數。
3.qemu建立eventfd字符設備、ivshmem設備、共享內存映射
在中斷模式下會使用eventfd字符設備來模擬guest之間的中斷。
在中斷模式下,qemu啟動前會啟動nahanni的ivshmem_server進程,該進程守候等待qemu的連接,待socket連接建立后,通過socket指派給每個vm一個id號(posn變量),並且將id號同eventfd文件描述符一起發給qemu進程(一個efd表示一個中斷向量,msi下有多個efd)。
在qemu一端通過–chardev socket建立socket連接,並通過-deviceivshmem建立共享內存設備。
./qemu-system-x86_64 -hda mg -L /pc-bios/ --smp 4
–chardev socket,path=/tmp/nahanni,id=nahanni
-device ivshmem,chardev=nahanni,size=32m,msi=off
-serial telnet:0.0.0.0:4000,server,nowait,nodelay-enable-kvm&
Server端通過select偵聽一個qemu上socket的連接。 Qemu端啟動時需要設置-chardevsocket,path=/tmp/nahanni,id=nahanni,通過該設置qemu通過查找chardev注冊類型register_types會調用qemu_chr_open_socket->unix_connect_opts,實現與server之間建立socket連接,server的add_new_guest會指派給每個vm一個id號,並且將id號同一系列eventfd文件描述符一起發給qemu進程。
在非中斷模式中,無需通過–chardevsocket建立連接,但同樣需要支持-device ivshmem建立共享內存:
./qemu-system-x86_64-dyn -hda Img -L /pc-bios/ --smp 4
-device ivshmem,shm=nahanni,size=32m
-serial telnet:0.0.0.0:4001,server,nowait,nodelay&
ivshmem設備類型的注冊:ivshmem_class_init->pci_ivshmem_init
4.Guest OS中ivshmem驅動
首先注冊kvm_ivshmem這個字符設備,得到主設備號,並實現此pci字符設備的文件操作:
register_chrdev(0, "kvm_ivshmem", &kvm_ivshmem_ops);
static const struct file_operations kvm_ivshmem_ops = {
.owner = THIS_MODULE,
.open =kvm_ivshmem_open,
.mmap =kvm_ivshmem_mmap,
.read =kvm_ivshmem_read,
.ioctl = kvm_ivshmem_ioctl,
.write = kvm_ivshmem_write,
.llseek = kvm_ivshmem_lseek,
.release = kvm_ivshmem_release,
};
注冊kvm_ivshmem這個字符設備之后,通過pci_register_driver(&kvm_ivshmem_pci_driver)實現字符型pci設備注冊,隨后由pci_driver數據結構中的probe函數指針所指向的偵測函數來初始化該PCI設備:
現在linux版本有ivshmem設備的驅動uio_ivshmem.ko。
5.Guest OS中如何使用ivshmem機制與host os以及其他guest os通信
分為中斷模式和非中斷模式。
非中斷模式,將/dev/ivshmem當做文件進行操作即可。
中斷模式下,Guest應用一端調用ioctl發起寫數據請求,另一端ioctl返回准備讀數據。
【ivshmem與virtio+vhost的性能對比】
在此博客http://blog.csdn.net/u014358116/article/details/22753423中將這兩種方案做了一個性能對比:
1)ivshmem在傳輸1M以上數據時,性能基本穩定,數據量與時間基本成線性關系
2)1M以上數據傳輸性能,ivshmem性能是virtio性能10~20倍
3)小數據量傳輸ivshmem性能比virtio有明顯優勢
-
macvtap
一個數據包從虛機到物理網卡的完整路徑圖如下:
從這個路徑圖可以看出,優化VNF與OVS通道的點有三個:虛擬網卡、虛擬化層和內核網橋。
由於linux內核網橋轉發性能通常情況下非常糟糕,如果能夠不經過內核網橋,則性能會提升很多。
我們先看看內核網橋的網絡協議棧結構:
這里涉及到用戶態與內核態交互的"網絡接口",當前有下面幾類:
-
TUN 設備
TUN 設備是一種虛擬網絡設備,通過此設備,程序可以方便得模擬網絡行為。
先來看看物理設備是如何工作的:
所有物理網卡收到的包會交給內核的 Network Stack 處理,然后通過 Socket API 通知給用戶程序。下面看看 TUN 的工作方式:
普通的網卡通過網線收發數據包,但是 TUN 設備通過一個文件收發數據包。所有對這個文件的寫操作會通過 TUN 設備轉換成一個數據包送給內核;當內核發送一個包給 TUN 設備時,通過讀這個文件可以拿到包的內容。
-
TAP 設備
TAP 設備與 TUN 設備工作方式完全相同,區別在於:
TUN 設備的 /dev/tunX 文件收發的是 IP 層數據包,只能工作在 IP 層,無法與物理網卡做 bridge,但是可以通過三層交換(如 ip_forward)與物理網卡連通。
TAP 設備的 /dev/tapX 文件收發的是 MAC 層數據包,擁有 MAC 層功能,可以與物理網卡做 bridge,支持 MAC 層廣播。
-
MACVLAN
有時我們可能需要一塊物理網卡綁定多個 IP 以及多個 MAC 地址,雖然綁定多個 IP 很容易,但是這些 IP 會共享物理網卡的 MAC 地址,可能無法滿足我們的設計需求,所以有了 MACVLAN 設備,其工作方式如下:
MACVLAN 會根據收到包的目的 MAC 地址判斷這個包需要交給哪個虛擬網卡。單獨使用 MACVLAN 好像毫無意義,但是配合之前介紹的 network namespace 使用,我們可以構建這樣的網絡:
由於 macvlan 與 eth0 處於不同的 namespace,擁有不同的 network stack,這樣使用可以不需要建立 bridge 在 virtual namespace 里面使用網絡。
-
MACVTAP
MACVTAP 是對 MACVLAN的改進,把 MACVLAN 與 TAP 設備的特點綜合一下,使用 MACVLAN 的方式收發數據包,但是收到的包不交給 network stack 處理,而是生成一個 /dev/tapX 文件,交給這個文件:
由於 MACVLAN 是工作在 MAC 層的,所以 MACVTAP 也只能工作在 MAC 層,不會有 MACVTUN 這樣的設備。
可以看到MACVTAP直接繞過內核協議棧,可以有效提高數據從物理網卡到虛機之間的性能。
-
SR-IOV
SR-IOV涉及內存虛擬化技術。
內存虛擬化:用硬件實現從GVA(Guest Virtual Address)àGPA(Guest Physical Address)àHPA(Host Physical Address)的兩次地址轉換。傳統非虛擬化的操作系統只通過硬件MMU完成一次從GVA到GPA的虛擬化。
I/O虛擬化,從模型上分為軟件I/O虛擬化和硬件輔助I/O虛擬化。
軟件I/O虛擬化有三種模型:Split I/O、Direct I/O和Passthrough I/O。
硬件輔助I/O虛擬化,從實現技術上有Intel的VT-d(virtualization technology for Directed I/O)和AMD的IOMMU(I/O Memory Management Unit)。從實現規范和框架上有SR-IOV,PCI-SIG國際組織專門針對PCIe設備而制定的規范。
Split I/O虛擬化模型的核心宗旨是所有來自guest的IO都由一個IO驅動程序(在Dom0區域,具有IO權限)來代理與物理IO交互。需要Guest OS做修改,也是XEN虛擬化引擎的實現方案,稱之為半虛擬化解決方案,如下圖所示:
Direct I/O虛擬化模型的核心思想是無需Guest OS做修改,直接使用設備的驅動程序,來自Guest的IO都由VMM來代理與物理設備交互。稱之為完全虛擬化解決方案,如下圖所示:
Passthrough I/O虛擬化模型允許guest直接操作物理IO設備,無需或少量經過VMM的干預。優點是性能高,缺點是物理IO設備只能給指定的guest使用,無法實現共享。如下圖所示:
【SR-IOV規范】
從軟件角度,CPU與IO設備通信有三種方式:中斷、寄存器、共享內存。IO設備通過中斷通知CPU,CPU通過寄存器對IO設備做控制,共享內存則通過DMA使得CPU與IO設備之間可以進行大規模數據通信。
SR-IOV是PCI-SIG組織推出的為了減少VMM對虛擬IO的干擾以提升IO虛擬化性能的規范,繼承Passthrough IO技術,通過IOMMU減少地址轉換和地址空間保護的開銷。
具有SR-IOV功能的PCIe設備遵循PCIe規范,其結構如下圖所示:
一個PF(Physical Function)管理多個VF(Virtula Function)。一個PF是具有完整PCIe功能的物理設備,具有唯一的VID,包括IO空間、存儲空間和配置空間。每個VF有自己獨立的配置空間,但是共享PF的IO空間和存儲空間,是一個輕量級的PCIe設備。每個VF有自己唯一的RID,唯一標識交換源,同時還用於搜索IOMMU頁表,使得每個虛機可以使用單獨的IOMMU頁表。
PCIe switch下面掛多個具有SR-IOV的PCIe設備,如果要實現PCIe Switch的虛擬化,則需要MR-IOV,當前還沒有廠商實現MR-IOV。
【SR-IOV實現模型】
PF驅動管理所有的PF資源,並負責配置和管理它所擁有的VF。
PF 驅動是一個專門管理SR-IOV設備全局功能驅動,而且還要配置相關共享資源。PF 驅動隨着VMM的不同而不同,一般需要具有比普通虛擬機更高的權限才能對其進行操作。PF驅動包含了所有傳統驅動的功能,使得VMM能夠訪問設備I/O資源。也可以通過調用PF驅動執行相關操作從而影響整個設備。PF驅動必須在VF驅動之前加載,而且需要等VF驅動卸載之后才能卸載。
VF驅動就如普通的PCIe設備驅動運行在Guest OS中。
標准設備驅動(驅動不會意識到自己所處的虛擬化環境)期望獲知如何控制設備和設備是如何工作的。在虛擬化環境下,一個標准的驅動一般與一個軟件間接層進行交互,這個軟件間接層模擬了底層的物理硬件設備。大多數情況下,該驅動不會意識到這個間接層的存在。
VF接口並沒有包含完整的PCIe控制功能,而且一般不能直接控制共享設備和資源,比如設置以太網連接速率。這時候有必要告知VF驅動,使其意識到自己所處的虛擬環境。當VF驅動意識到自己所處的虛擬化環境之后,就能夠直接與硬件進行數據交互。同時這些VF驅動也能夠知道這些設備都是共享設備,依賴於PF驅動提供的服務來進行操作,這些操作都具有全局效果,比如初始化和下一級結構的控制。
PF驅動與VF驅動之間需要有通信機制,比如配置、管理、事件通知等。具體的通信機制由底層的硬件平台而有所差異,比如通常采用郵箱和門鈴機制:發送方將消息寫入郵箱,然后按門鈴,接收方聽到門鈴(其實就是受到中斷),從郵箱中讀出信息,並設置共享寄存器表示已收到。如下圖所示:
IOVM為每個VF分配完整的配置空間,並分配給指定的guest。使得運行在guest的VF驅動可以像訪問普通的PCIe設備區訪問和配置VF。
IOMMU將PCI物理地址映射到客戶機中。傳統的IOMMU機制是采用集中方式統一管理所有的DMA。除了這種傳統的IOMMU機制外,還有AGP、GARTner、TPT、TCP/IP等特殊的DMA,通過內存地址來區分設備。
IO設備會產生非常多的中斷,DMA必須要能夠將中斷正確的路由到客戶機中。一般有兩種方法,一種是采用中斷路由控制器進行路由,一種是通過DMA寫請求發出MSI消息(Message Singled Interupt)。DMA寫請求中會包含目的地址,所以DMA需要訪問所有的內存空間,無法實現中斷隔離。
無論是傳統的IOMMU還是特殊DMA,都無法實現DMA區域的隔離。
Intel VT-d通過改造IOMMU,實現多個DMA保護區域,最終實現DMA虛擬化,也稱之為DMA重映射(DMA Remapping)。通過重新定義MSI消息格式,不再嵌入目的地址,而是采用消息ID來區分不同的DMA保護區域。
這套實現模型的處理流程如下:
首先,DMA重映射完成DMA的虛擬化,即一方面完成PCIe設備的物理地址到客戶機的物理地址的轉換(客戶機物理地址與虛擬地址的轉換是由客戶機里面的VF驅動來完成的),另外一方面需要建立MSI消息ID與客戶機的一一對應關系;
當DMA將數據傳輸到接收方的緩沖區中,並產生MSI或MSI-x消息;
此時,VMM捕獲到此中斷,並根據MSI消息ID通知客戶機;
然后,客戶機中的VF驅動處理中斷,並從本地DMA緩沖區中讀出數據。
下圖是一個實現的案例:
第1、2步:包到達,被送往L2 進行分類和交換;
第3步:根據目的MAC地址進行分類,這時候,改包與 緩沖池1匹配;
第4步:網卡發起DMA操作,將一個包傳遞給一個VM;
第5步:DMA操作到達Intel 芯片集,在這里VT-d(由Hypervisor 配置)進行DMA操作所需的地址翻譯;結果使得該包被直接送入到虛擬機的VF驅動緩沖里面;
第6步:網卡發起了MSI-X中斷,表明該接收操作已經完成。該中斷由Hypervisor接收;
第7步:Hypervisor向虛擬機注入一個虛擬中斷表明傳輸已經結束,這時候虛擬機的VF驅動就開始處理該包。
【SR-IOV帶來的優勢】
優勢1:IOMMU將GPA(客戶機物理地址)轉換為宿主機物理地址,從而使得客戶機IO性能基本接近物理IO的性能;相對於軟件模擬IO,則節省了客戶機的CPU消耗
優勢2:運行在客戶機里面的VF驅動可以直接訪問PCIe物理設備的寄存器,無需陷入和切換
優勢3:DMA重映射使得物理中斷和虛擬中斷之間的延遲大大減少,而且中斷之間實現隔離,在安全性方面得到提升
優勢4:實現PCIe設備共享,不像Passthrough方案,只能給特定的客戶機使用
優勢5:VF驅動就是普通的PCIe驅動,無需為SR-IOV安裝特殊的前端驅動,也無需VMM維護特殊的后端驅動(注意不是指IOVM),通用性較好
優勢6:不依賴於宿主機的IO操作,所以不增加宿主機的負擔
【SR-IOV相關驅動】
Intel SR-IOV 驅動(也即PF驅動)包含了所有 Intel 以太網卡的功能,並且還有下面使用SR-IOV時擁有的特殊功能:
·給每個VF生成一個MAC 地址
·通過信箱系統與VF驅動通信:
·通過VF驅動配置VLAN過濾器
·通過VF驅動配置多播地址
·通過VF驅動配置最大包長
·處理VF驅動資源復位請求
Intel VF驅動實例代碼是標准 Intel ixgbe 10 Gigabit Ethernet 驅動的一個修改后的版本。通過設備ID來加載。Intel VF有一個設備ID表明它們是一個VF, 這樣VF驅動就可以被加載。
Intel VF 驅動可以被分割為三個部分:
·操作系統界面——虛擬機操作系統可以通過該界面調用各種API
·I/O操作——使用SR-IOV 功能來進行I/O操作,而避免Hypervisor的干預
·配置任務——配置像VLAN過濾器等需要與PF驅動進行通信的任務
從Intel官網上可以下載PF驅動和VF驅動:
PF驅動代碼:
ixgbe-4.3.15.tar.gz
支持SR-IOV的代碼在ixgbe_sriov.c和ixgbe_sriov.h。
比如給每個VF設置MAC地址的代碼:
文件:ixgbe_sriov.c
函數:ixgbe_vf_configuration
VF驅動代碼:
ixgbevf-3.1.2.tar.gz
比如VF驅動從硬件VF接收一個報文:
文件:ixgbevf_main.c
函數:ixgbevf_receive_skb
如果VMM是vmware的ESXi,則PF驅動需要Intel提供單獨的安裝包:
ixgbe-3.7.13-535467.zip
VF驅動與Guest OS有關系,所以在vmware上面的虛機里面運行的VF驅動無需特別提供。
-
VMDq
VMDq是Intel推出的無需VMM干預虛機報文收發的一種技術,與SR-IOV的區別如下圖所示:
VMDq
VMM在服務器的物理網卡中為每個虛機分配一個獨立的隊列,這樣虛機出來的流量可以直接經過軟件交換機發送到指定隊列上,軟件交換機無需進行排序和路由操作。但是,VMM和虛擬交換機仍然需要將網絡流量在VMDq和虛機之間進行復制。
VMDq需要物理網卡支持VMDq特性,即物理網卡必須要具備報文分類和分流器件,物理網卡的驅動必須支持VMDq;VMM要能夠支持多隊列管理和路由到指定的VM中。
查詢當前intel官網,支持VMDq的intel網卡不是太多。一般支持VMDq的網卡都支持RSS(反過來未必)。
SR-IOV
對於SR-IOV來說,則更加徹底,它通過創建不同虛擬功能(VF)的方式,呈現給虛擬機的就是獨立的網卡,因此,虛擬機直接跟網卡通信,不需要經過軟件交換機。VF和VM之間通過DMA進行高速數據傳輸。
SR-IOV的性能是最好的,但是需要一系列的支持,包括網卡、處理器、芯片組等。
要支持VMDq,需要網卡支持,同時還需要改造VMM或vSwitch。Linux KVM當前版本似乎不支持VMDq,需要經過改造。
網上有一遍文章設計了一套改造方案:
[RFC] Virtual Machine Device Queues (VMDq) support on KVM
Network adapter with VMDq technology presents multiple pairs of tx/rx queues,
and renders network L2 sorting mechanism based on MAC addresses and VLAN tags
for each tx/rx queue pair. Here we present a generic framework, in which network
traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without
any software copy.
Actually this framework can apply to traditional network adapters which have
just one tx/rx queue pair. And applications using the same user/kernel interface
can utilize this framework to send/receive network traffic directly thru a tx/rx
queue pair in a network adapter.
We use virtio-net architecture to illustrate the framework.
The basic idea is to utilize the kernel Asynchronous I/O combined with Direct
I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
kernel, we still can see it in SCSI tape driver.
With traditional file operations, a copying of payload contents from/to the
kernel DMA address to/from a user buffer is needed. That's what the copying we
want to save.
The proposed framework is like this:
A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in
host side. KVM virto-net Backend service, the user space program submits
asynchronous read/write I/O requests to the host kernel through TUN/TAP device.
The requests are corresponding to the vqueue elements include both transmission
& receive. They can be queued in one AIO request and later, the completion will
be notified through the underlying packets tx/rx processing of the rx/tx queue
pair.
Detailed path:
To guest Virtio-net driver, packets receive corresponding to asynchronous read
I/O requests of Backend service.
1) Guest Virtio-net driver provides header and payload address through the
receive vqueue to Virtio-net backend service.
2) Virtio-net backend service encapsulates multiple vqueue elements into
multiple AIO control blocks and composes them into one AIO read request.
3) Virtio-net backend service uses io_submit() syscall to pass the request to
the TUN/TAP device.
4) Virtio-net backend service uses io_getevents() syscall to check the
completion of the request.
5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares
for Direct I/O.
A modified NIC driver may render a skb which header is allocated in host
kernel, but the payload buffer is directly mapped from user space buffer which
are rendered through the AIO request by the Backend service. get_user_pages()
may do this. For one AIO read request, the TUN/TAP driver maintains a list for
the directly mapped buffers, and a NIC driver tries to get the buffers as
payload buffer to compose the new skbs. Of course, if getting the buffers
fails, then kernel allocated buffers are used.
6) Modern NIC cards now mostly have the header split feature. The NIC queue
pair then may directly DMA the payload into the user spaces mapped payload
buffers.
Thus a zero-copy for payload is implemented in packet receiving.
7) The TUN/TAP driver manually copy the host header to space user mapped.
8) aio_complete() to notify the Virtio-net backend service for io_getevents().
To guest Virtio-net driver, packets send corresponding to asynchronous write
I/O requests of backend. The path is similar to packet receive.
1) Guest Virtio-net driver provides header and payload address filled with
contents through the transmit vqueue to Virtio-net backed service.
2) Virtio-net backend service encapsulates the vqueue elements into multiple
AIO control blocks and composes them into one AIO write request.
3) Virtio-net backend service uses the io_submit() syscall to pass the
requests to the TUN/TAP device.
4) Virtio-net backend service uses io_getevents() syscall to check the request
completion.
5) The TUN/TAP driver gets the write requests and allocates skbs for it. The
header contents are copied into the skb header. The directly mapped user space
buffer is easily hooked into skb. Thus a zero copy for payload is implemented
in packet sending.
6) aio_complete() to notify the Virtio-net backend service for io_getevents().
The proposed framework is described as above.
Consider the modifications to the kernel and qemu:
To kernel:
1) The TUN/TAP driver may be modified a lot to implement AIO device operations
and to implement directly user space mapping into kernel. Code to maintain the
directly mapped user buffers should be in. It's just a modification for driver.
2) The NIC driver may be modified to compose skb differently and slightly data
structure change to add user directly mapped buffer pointer.
Here, maybe it's better for a NIC driver to present an interface for an rx/tx
queue pair instance which will also apply to traditional hardware, the kernel
interface should not be changed to make the other components happy.
The abstraction is useful, though it is not needed immediately here.
3) The skb shared info structure may be modified a little to contain the user
directly mapped info.
To Qemu:
1) The Virtio-net backend service may be modified to handle AIO read/write
requests from the vqueues.
2) Maybe a separate pthread to handle the AIO request triggering is needed.
-
multiqueue virtio-net
virtio-net當前支持多隊列,其目的是實現並發的報文收發。
To make sure the whole stack could be worked in parallel, the parallelism of not only the front-end (guest driver) but also the back-end (vhost and tap/macvtap) must be explored. This is done by:
-
Allowing multiple sockets to be attached to tap/macvtap
-
Using multiple threaded vhost to serve as the backend of a multiqueue capable virtio-net adapter
-
Use a multi-queue awared virtio-net driver to send and receive packets to/from each queue
The main goals of multiqueue is to explore the parallelism of each module who is involved in the packet transmission and reception:
-
macvtap/tap: For single queue virtio-net, one socket of macvtap/tap was abstracted as a queue for both tx and rx. We can reuse and extend this abstraction to allow macvtap/tap can dequeue and enqueue packets from multiple sockets. Then each socket can be treated as a tx and rx, and macvtap/tap is fact a multi-queue device in the host. The host network codes can then transmit and receive packets in parallel.
-
vhost(注意這里指vhost-net,不是vhost-user): The parallelism could be done through using multiple vhost threads to handle multiple sockets. Currently, there's two choices in design.
-
1:1 mapping between vhost threads and sockets. This method does not need vhost changes and just launch the the same number of vhost threads as queues. Each vhost thread is just used to handle one tx ring and rx ring just as they are used for single queue virtio-net.
-
M:N mapping between vhost threads and sockets. This methods allow a single vhost thread to poll more than one tx/rx rings and sockests and use separated threads to handle tx and rx request.
-
-
qemu: qemu is in charge of the fllowing things
-
allow multiple tap file descriptors to be used for a single emulated nic
-
userspace multiqueue virtio-net implementation which is used to maintaining compatibility, doing management and migration
-
control the vhost based on the userspace multiqueue virtio-net
-
-
guest driver
-
Allocate multiple rx/tx queues
-
Assign each queue a MSI-X vector in order to parallize the packet processing in guest stack
-
具體說明請參見文檔:http://www.linux-kvm.org/page/Multiqueue
【Guest virtio-net驅動代碼分析】
在virtio-net多隊列機制中,前端和后端通過virtqueue來進行數據交換,virtqueue的初始化通過config->find_vqs來進行:
網卡上報中斷模式分為msix模式和非msix模式。Intel VT-d技術必須是msix模式。
1. 如果沒有開啟msix模式,則調用vp_request_intx申請一個中斷,中斷處理函數是vp_interrupt。vp_interrupt實際調用的是vp_vring_interrupt(配置變更的中斷除外)。
vp_vring_interrupt會遍歷virtio_pci_device的所有virtqueue(多個隊列的設備),調用中斷處理函數vring_interrupt,最終調用virtqueue注冊的callback函數完成中斷處理。
2. 開啟了msix模式,還要區分不同的模式,要么是所有virtqueue共享一個中斷,要么是每個virtqueue獨立一個中斷,無論是哪種模式,都需要調用vp_request_msix_vectors去申請irq中斷資源。還要對每個virtqueue,調用setup_vq來完成初始化。
-
虛擬交換機性能提升
-
OVS
一個數據包從虛機到物理網卡的完整路徑圖如下:
從這個路徑圖可以看出,優化VNF與OVS通道的點有三個:虛擬網卡、虛擬化層和內核網橋。
Host os默認采用linux bridge來進行網絡轉發,但由於linux bridge支持的功能有限,不易擴展,於是Open vSwitch(OVS)作為linux bridge的替代者出現。如下圖所示:
在OVS里面有br和port的概念,br對應linux bridge的br.101、br.102等邏輯接口,port對應linux的eth0、eth1等物理網卡接口。
但是我們可能會發現當前采用開源社區的OVS的性能實際要比linux bridge的二層轉發性能要低。之所以還要用OVS,主要在於OVS的擴展性。
如何提高OVS的報文轉發性能,這也是提高NFV性能的關鍵點之一。
一種方法是將OVS從內核空間移到用戶態,以減少數據報文的一次拷貝,當前OVS已經支持用戶態,但還是實驗版本。還有一種原生用戶態的虛機交換機snabb vswitch。
一種方法是同時改在guest os和OVS,使得用戶態與內核態的零拷貝機制。比如intel主導的DPDK機制。
-
snabb
snabb當前主要有兩個開源項目:snabb swtich和snabb NFV。
Snabb is written using these main techniques:
-
Lua, a high-level programming language that is easy to learn.
-
LuaJIT, a just-in-time compiler that is competitive with C.
-
Ethernet I/O with no kernel overhead ("kernel bypass" mode).
Snabb Switch是一個用戶態的虛擬交換機。
Problem: Virtual machines need networking that is both fast and flexible. Hardware NIC virtualization is fast, software networking is flexible, but neither is both.
Solution: Snabb NFV provides both performance and flexibility. The secret sauce is a best-of-both-worlds design that combines SR-IOV hardware virtualization with a feature-rich software layer based on Virtio.
Snabb NFV is deployed as an OpenStack ML2 mechanism driver.
The operator configures Neutron using the standard commands and API. Snabb NFV then implements the Neutron configuration using its own a fast data-plane and robust control-plane.
Snabb NFV supports these Neutron extensions: Provider Networks, Security Groups, Port Filtering, QoS, and L2-over-L3 tunnels (aka softwires).
Snabb NFV is distributed based on OpenStack Icehouse. The distribution includes NFV-oriented updates to QEMU, Libvirt, Nova, and Neutron.
Snabb是userspace virtio app機制的一個應用。
所謂的userspace virtio app機制是C/S框架的通信機制,也就是vhost-user機制。
snabb switch的工作原理:
-
Snabb用vhost-user(QEMU 2.1的新特性)去和VM通信(所以不再需要tap設備,也不需要Kernel)
打開qemu的vhost-user特性, 如下:
qemu -m 1024 -mem-path /bak/shared,prealloc=on,share=on -netdev type=vhost-user,id=net0,file=/path/to/socket -device virtio-net-pci,netdev=net0
-mem-path選項支持為一個虛機分配和其他進程共享內存的guest內在vring, vring是虛機的網卡數據的緩存,再通過unix socket將vring的文件描述符、中斷號、IO事件等傳給同在用戶空間的snabb switch進程。snabb switch進程就可以直接通過文件描述符去ving中取數據了。
b) Snabb用intel10g.lua驅動或OpenOnload提供的驅動(libcuil.so)去和硬件網卡打交道
-
Lagopus
Lagopus vSwitch that provides high-performance packet processing。
Lagopus software switch is a yet another OpenFlow 1.3 software switch implementation. Lagopus software switch is designed to leverage multi-core CPUs for high-performance packet processing and fowarding with DPDK. Many network protocol formats are supported, such as Ethernet, VLAN, QinQ, MAC-in-MAC, MPLS and PBB. In addition, tunnel protocol processing is supported for overlay-type networking with GRE, VxLAN and GTP.
Lagopus swtich支持兩個版本:raw-socket和DPDK supported。
具體請參考:http://www.lagopus.org/lagopus-book/en/html/
-
VNF網絡協議棧性能提升
VNF的性能與VNF運行模式有較大的關系。
實現一個網絡功能的系統大致分為三層:驅動層、TCP/IP協議棧、網絡功能。
一般來說,驅動和TCP/IP協議棧都處於內核態,網絡功能放在了用戶態,帶來的代價是性能損耗:內核態與用戶態通信導致的內存拷貝、調度不及時等。
程序放在用戶態有其優勢:隔離性較好、易調試等。所以當前出現很多項目試圖將驅動和TCP/IP層都放在用戶態,這樣避免了內核與用戶交互帶來的性能損耗,同時也可以充分利用用戶態程序的隔離性和可調試性優點。
-
Snabb NFV
具體請參見:https://github.com/snabbco/snabb/blob/master/README.md
-
libuinet
UINET:User INET
This is a user-space port of the FreeBSD TCP/IP stack, begun with the
FreeBSD 9.1-RELEASE sources and many pieces of Kip Macy's user-space
port of an earlier version of the FreeBSD stack, libplebnet.
Unlike the stock FreeBSD TCP/IP stack, this stack can initiate and
terminate arbitrary TCP/IP connections, including those on
arbitrarily-nested VLANs. Listen sockets can be bound to a wildcard
IP address (across everything on the wire, not just local interfaces),
wildcard port, and specific VLAN tag stacks. L2 information for
accepted connections is available to the application. Outbound
connections can be bound to any IP and port, as well as any MAC
address and VLAN tag stack.
This stack can also passively reconstruct TCP streams using a copy of
those streams' bidirectional packet flow. Reconstruction can continue
even in the face of packet loss (in which case zero-filled holes in
the affected streams are reported to the application).
Packet I/O is currently accomplished via netmap or libpcap (although
the latter interface is relatively new and untested).
-
mTCP
mTCP is a highly scalable user-level TCP stack for multicore systems.
mTCP source code is distributed under the Modified BSD License.
mTCP有三個版本:PSIO VERSION、DPDK VERSION、NETMAP VERSION
https://github.com/eunyoung14/mtcp/blob/master/README
-
NUSE
Linux has also been ported to itself. You can now run the kernel as a userspace application - this is called UserMode Linux (UML).
This is a library operating system (LibOS) version of Linux kernel, which will benefit in the couple of situations like:
-
operating system personalization
-
full featured network stack for kernel-bypass technology (a.k.a. a high-speed packet I/O mechanism) like Intel DPDK, netmap, etc
-
testing network stack in a complex scenario.
Right now, we have 2 sub-projects of this LibOS.
-
Network Stack in Userspace (NUSE) NUSE allows us to use Linux network stack as a library which any applications can directory use by linking the library. Each application has its own network stack so, it provides an instant virtualized environment apart from a host operating system.
-
Direct Code Execution (DCE) DCE provides network simulator integration with Linux kernel so that any Linux implemented network protocols are for a protocol under investigate.
https://github.com/libos-nuse/net-next-nuse
-
OpenDP
Open data plane on DPDK TCP/IP stack for DPDK
ANS(accelerated network stack)provide a userspace TCP/IP stack for use with DPDK. ANS is a static library which can be compiled with your App. You can add or delete ether interface, IP address and static routing from ANS. Your App can forward traffic to ANS.
https://github.com/opendp/dpdk-ans/wiki
-
OpenOnLoad
OpenOnload runs on Linux and supports TCP/UDP/IP network protocols with the standard BSD sockets API, and requires no modifications to applications to use.
It achieves performance improvements in part by performing network processing at user-level, bypassing the OS kernel entirely on the data path.
Networking performance is improved without sacrificing the security and multiplexing functions that the OS kernel normally provides.
OpenOnload comprises a user-level shared library that intercepts network-related system calls and implements the protocol stack, and supporting kernel modules.
To accelerate an application with the Onload user-level transport, simply
invoke the application on the command line as normal, prepended with
"onload".
ie. Instead of:
netperf -t TCP_RR -H myserver
do this:
onload netperf -t TCP_RR -H myserver
and tuned for best latency:
onload -p latency netperf -t TCP_RR -H myserver
-
Rump kernel
NUSE將整個Linux kernel編譯成一個用戶態動態庫,供應用程序使用,OpenDP為用戶態應用程序提供加速的TCP/IP協議棧,OpenOnLoad則無需修改應用程序即可使用TCP/IP協議棧以加速應用程序的網絡處理性能。
Rump kernel的思路是讓內核態的驅動程序可以不用修改即可運行在用戶態。驅動程序包括設備驅動程序、文件系統、TCP/IP協議棧等。
The NetBSD rump kernel is the first implementation of the "anykernel" concept where drivers either can be compiled into and/or run in the monolithic kernel or in user space on top of a light-weight rump kernel.
The NetBSD drivers can be used on top of the rump kernel on a wide range of POSIX operating systems, such as the Linux, NetBSD.
The rump kernels can also run without POSIX directly on top of the Xen hypervisor, the L4 microkernel in Genode OS or even on "OS-less" bare metal.
Rump kernel與DPDK結合后的體系層次如下:
有關操作系統的類型定義:
An anykernel is different in concept from microkernels, exokernels, partitioned kernels or hybrid kernels in that it tries to preserve the advantages of a monolithic kernel, while still enabling the faster driver development and added security in user space.
The "anykernel" concept refers to an architecture-agnostic approach to drivers where drivers can either be compiled into the monolithic kernel or be run as a userspace process, microkernel-style, without code changes.
With drivers, a wider concept is considered where not only device drivers are included but also file systems and the networking stack.
-
KVMforNFV
https://git.opnfv.org/cgit/kvmfornfv/
© 2015 Open Platform for NFV Project, Inc., a Linux Foundation Collaborative Project.
-
VPP
VPP是Cisco主導的一個NFV協議棧加速方案,具體請參見后面的VPP介紹。
-
Intel硬件輔助虛擬化
-
處理器輔助虛擬化VT-x
-
英特爾處理器內更出色的虛擬化支持英特爾VT-x 有助於提高基於軟件的虛擬化解決方案的靈活性與穩定性。通過按照純軟件虛擬化的要求消除虛擬機監視器(VMM)代表客戶操作系統來聽取、中斷與執行特定指令的需要,不僅能夠有效減少 VMM 干預,還為 VMM 與客戶操作系統之間的傳輸平台控制提供了有力的硬件支持,這樣在需要 VMM干預時,將實現更加快速、可靠和安全的切換。此外,英特爾VT-x 具備的虛擬機遷移特性還可為您的 IT 投資提供有力保護,並進一步提高故障切換、負載均衡、災難恢復和維護的靈活性:
--英特爾VT FlexPriority:當處理器執行任務時,往往會收到需要注意的其它設備或應用發出的請求或"中斷"命令。為了最大程度減少對性能的影響,處理器內的一個專用寄存器(APIC任務優先級寄存器,或 TPR)將對任務優先級進行監控。如此一來,只有優先級高於當前運行任務的中斷才會被及時關注。
英特爾FlexPriority 可創建 TPR6 的一個虛擬副本,該虛擬副本可讀取,在某些情況下,如在無需干預時,還可由客戶操作系統進行更改。上述舉措可以使頻繁使用 TPR 的 32 位操作系統獲得顯著的性能提升。(例如,能夠將在 Windows Server* 2000上運行的應用的性能提高 35%。)
--英特爾虛擬化靈活遷移技術(Intel VT FlexMigration):虛擬化的一個重要優勢是能夠在無需停機的情況下,將運行中的應用在物理服務器之間進行遷移。英特爾虛擬化靈活遷移技術(Intel VT FlexMigration)旨在實現基於英特爾處理器的當前服務器與未來服務器之間的無縫遷移,即使新的系統可能包括增強的指令集也不例外。借助此項技術,管理程序能夠在遷移池內的所有服務器中建立一套一致的指令,實現工作負載的無縫遷移。這便生成了可在多代硬件中無縫運行的更加靈活、統一的服務器資源池。
-
芯片組輔助虛擬化VT-d
英特爾芯片組內更出色的虛擬化支持由於每台服務器上整合了更多的客戶操作系統,數據進出系統的傳輸量(I/O 流量)有所增加並且更趨復雜。如果沒有硬件輔助,虛擬機監視器(VMM)必須直接參與每項 I/O 交易。這不僅會減緩數據傳輸速度,還會由於更頻繁的 VMM 活動而增大服務器處理器的負載。這就如同在一個繁忙的購物中心,每位顧客都不得不通過一個門進出該中心,並且只能從中心經理那里得到指示。這樣不僅會耽誤顧客的時間,也會使經理無法處理其它緊急事件。
英特爾VT-d 通過減少 VMM 參與管理 I/O 流量的需求,不但加速了數據傳輸,而且消除了大部分的性能開銷。這是通過使 VMM將特定 I/O 設備安全分配給特定客戶操作系統來實現的。每個設備在系統內存中都有一個專用區域,只有該設備及其分配的客戶操作系統才能對該區域進行訪問。
完成初始分配之后,數據即可直接在客戶操作系統與為其分配的設備之間進行傳輸。這樣,I/O 流量的流動將更加迅速,而減少的 VMM 活動則會進一步縮減服務器處理器的負載。此外,由於用於特定設備或客戶操作系統的 I/O 數據不能被其它任何硬件或客戶軟件組件訪問,系統的安全性與可用性也得到了進一步增強。
-
網卡輔助虛擬化VT-c
英特爾I/O 設備內更出色的虛擬化支持隨着企業在虛擬化環境中部署越來越多的應用,並利用實時遷移來節省功率或提升可用性,對虛擬化 I/O 設備的要求也在顯著提高。通過將廣泛的硬件輔助特性集成到 I/O 設備(該設備用於保持服務器與數據中心網絡、存儲基礎設施及其它外部設備的連接)中,英特爾VT-c 可針對虛擬化進一步優化網絡。從本質上來說,這套技術組合的功能與郵局非常相似:將收到的信件、包裹及信封分門別類,然后投遞到各自的目的地。通過在專用網絡芯片上執行這些功能,英特爾VT-c 大幅提高了交付速度,減少了 VMM 與服務器處理器的負載。英特爾VT-c 包括以下兩項關鍵技術(當前所有的英特爾萬兆位服務器網卡及選定的英特爾千兆位服務器網卡均可支持):
--借助虛擬機設備隊列(VMDq)最大限度提高 I/O 吞吐率:在傳統服務器虛擬化環境中,VMM 必須對每個單獨的數據包進行分類,並將其發送到為其分配的虛擬機。這樣會占用大量的處理器周期。而借助 VMDq,該分類功能可由英特爾服務器網卡內的專用硬件來執行,VMM 只需負責將預分類的數據包組發送到適當的客戶操作系統。這將減緩 I/O 延遲,使處理器獲得更多的可用周期來處理業務應用。英特爾VT-c可將 I/O 吞吐量提高一倍以上,使虛擬化應用達到接近本機的吞吐率。每台服務器將整合更多應用,而 I/O 瓶頸則會更少。
--借助虛擬機直接互連(VMDc)大幅提升虛擬化性能:借助PCI-SIG 單根 I/O 虛擬化(SR-IOV)標准,虛擬機直接互連(VMDc)支持虛擬機直接訪問網絡 I/O 硬件,從而顯著提升虛擬性能。如前所述,英特爾VT-d 支持客戶操作系統與設備I/O 端口之間的直接通信信道。通過支持每個 I/O 端口的多條直接通信信道,SR-IOV 可對此進行擴展。例如,通過單個英特爾萬兆位服務器網卡,可為 10 個客戶操作系統中的每個操作系統分配一個受保護的、1 Gb/秒的專用鏈路。這些直接通信鏈路繞過了 VMM 交換機,可進一步提升 I/O 性能並減少服務器處理器的負載。
-
Intel ONP
-
ONP
-
https://01.org/zh/packet-processing/intel®-onp
Intel® Open Network Platform (Intel ONP) is a reference architecture that provides engineering guidance and ecosystem enablement support to encourage widespread adoption of SDN and NFV solutions in Telco, Enterprise and Cloud. It is not a commercial product, but a pre-production reference that drives development and showcase SDN/NFV solution capabilities. Intel ONP reference architecture brings together Industry Standard High Volume Servers (SHVS) based on Intel® Architecture (IA) and a software stack composed of open source, open standard software ingredients.
One of the key objectives of Intel ONP is to align and optimize key Open Community software ingredients for architects and engineers targeting high performing SDN and NFV open source based solutions. Primary Intel ONP Software Ingredients included are: DPDK for accelerated packets processing; Open vSwitch* (OVS) including support for OVS with DPDK which enables much better performance of the data plan when using DPDK libraries; OpenDaylight* (ODL) controller; and OpenStackorchestrator.
-
DPDK
Intel將DPDK運用到了NFV領域,分為DPDK Switch和DPDK NFV。其中DPDK NFV中包含協議棧優化和IVSHMEM優化通道。
DPDK is a set of software libraries and Ethernet drivers (native and virtualized) that run in Linux user space to boost packet processing throughput on Intel® architecture.
DPDK library components include:
-
Environment Abstraction Layer - abstracts huge-page file system, provides multi-thread and multi-process support.
-
Memory Manager - allocates pools of objects in memory. A pool is created in huge page memory space and uses a ring to store free objects. It also provides an alignment helper to make sure that objects are padded, to spread them equally on all DRAM channels.
-
Buffer Manager – reduces, by a significant amount, the time the operating system spends allocating and de-allocating buffers. The DPDK library pre-allocates fixed size buffers which are stored in memory pools.
-
Queue Manager - implements safe lockless queues, instead of using spinlocks, that allow different software components to process packets, while avoiding unnecessary wait times.
-
Flow Classification - provides an efficient mechanism which incorporates Intel® Streaming SIMD Extensions (Intel® SSE) to produce a hash based on tuple information, so that packets may be placed into flows quickly for processing, greatly improving throughput.
DPDK的性能優化方法大致如下,具體詳細解釋請參考:
http://openvswitch.org/support/dist-docs-2.5/INSTALL.DPDK.md.html
-
PMD affinitization
-
Multiple poll mode driver threads
-
DPDK port Rx Queues
-
Exact Match Cache
-
Compiler options
-
Simultaneous Multithreading (SMT)
-
The isolcpus kernel boot parameter
isolcpus can be used on the kernel bootline to isolate cores from the kernel scheduler and hence dedicate them to OVS or other packet forwarding related workloads.
-
NUMA/Cluster On Die
-
Rx Mergeable buffers
-
Packet processing in the guest
-
DPDK virtio pmd in the guest
-
DPDK with IVSHMEM
DPDK關於ivshmem的實現代碼:
從前面分析可知,ivshmem機制,一是qemu本身支持,二是guest os需要ivshmem驅動。那么DPDK針對ivshmem做了哪些改善呢?
The DPDK IVSHMEM library facilitates fast zero-copy data sharing among virtual machines (host-to-guest or guest-to-guest) by means of QEMU's IVSHMEM mechanism.
-
The library works by providing a command line for QEMU to map several hugepages into a single IVSHMEM device.
-
For the guest to know what is inside any given IVSHMEM device (and to distinguish between DPDK and non-DPDK IVSHMEM devices), a metadata file is also mapped into the IVSHMEM segment.
-
No work needs to be done by the guest application to map IVSHMEM devices into memory; they are automatically recognized by the DPDK Environment Abstraction Layer (EAL).
A typical DPDK IVSHMEM use case looks like the following:
所以這個所謂的metadata file記錄了在ivshmem這個設備里面的信息。
如何創建這個metadata file呢?DPDK提供的API:rte_ivshmem_metadata_create() to create a new metadata file
ivshmem設備里面可以有哪些信息呢?通過DPDK提供的ivshmem相關API可以得知:
-
rte_ivhshmem_metadata_add_memzone() to add rte_memzone to metadata file
-
rte_ivshmem_metadata_add_ring() to add rte_ring to metadata file
-
rte_ivshmem_metadata_add_mempool() to add rte_mempool to metadata file
Guest或host的應用程序則可以通過讀取metadata file獲得memzone、ring、mempool等相關的信息。
Guest應用程序如何操作ivshmem設備呢?DPDK通過
rte_ivshmem_metadata_cmdline_generate() to generate the command line for QEMU
【IVSHMEM Environment Configuration】
-
Compile a special version of QEMU from sources.
The source code can be found on the QEMU website (currently, version 1.4.x is supported, but version 1.5.x is known to work also), however, the source code will need to be patched to support using regular files as the IVSHMEM memory backend. The patch is not included in the DPDK package, but is available on the Intel®DPDK-vswitch project webpage (either separately or in a DPDK vSwitch package).
-
Enable IVSHMEM library in the DPDK build configuration.
In the default configuration, IVSHMEM library is not compiled. To compile the IVSHMEM library, one has to either use one of the provided IVSHMEM targets (for example, x86_64-ivshmem-linuxapp-gcc), or set CONFIG_RTE_LIBRTE_IVSHMEM to "y" in the build configuration.
-
Set up hugepage memory on the virtual machine.
The guest applications run as regular DPDK (primary) processes and thus need their own hugepage memory set up inside the VM.
詳細參考:http://dpdk.org/doc/guides/prog_guide/ivshmem_lib.html
如何在DPDK的基礎上編寫一個基於ivshmem的應用程序呢?在DPDK 2.0的軟件包中未找到基於ivshmem的樣例應用程序。但是官方給出了編寫ivshmem應用程序的一些最佳實踐:
When considering the use of IVSHMEM for sharing memory, security implications need to be carefully evaluated. IVSHMEM is not suitable for untrusted guests, as IVSHMEM is essentially a window into the host process memory. This also has implications for the multiple VM scenarios. While the IVSHMEM library tries to share as little memory as possible, it is quite probable that data designated for one VM might also be present in an IVSMHMEM device designated for another VM. Consequently, any shared memory corruption will affect both host and all VMs sharing that particular memory.
IVSHMEM applications essentially behave like multi-process applications, so it is important to implement access serialization to data and thread safety. DPDK ring structures are already thread-safe, however, any custom data structures that the user might need would have to be thread-safe also.
Similar to regular DPDK multi-process applications, it is not recommended to use function pointers as functions might have different memory addresses in different processes.
It is best to avoid freeing the rte_mbuf structure on a different machine from where it was allocated, that is, if the mbuf was allocated on the host, the host should free it. Consequently, any packet transmission and reception should also happen on the same machine (whether virtual or physical). Failing to do so may lead to data corruption in the mempool cache.
-
DPDK with ring
ring是一種無鎖的隊列管理方式。在linux發行版中包括,不僅有對應的ring庫,而且還將這種機制抽象成一種新的pf_ring套接字用於用戶態與內核態的零拷貝通信機制。
那么,DPDK用ring做什么呢?對ring有改進嗎?
DPDK用ring主要用來軟件模擬以太網網卡以實現VM to VM 或VM to host的報文通信。也就是虛機之間或虛機與主機之間可以通過這個模擬的以太網網卡進行報文收發以達到交換信息和數據的目的。
DPDK實現的ring機制與ivshmem機制非常相似,只不過前者是基於共享隊列來實現的,后者基於共享內存文件來實現的。
DPDK提供了PMD on ring,但是未提供PMD on ivshmem。
ring既然用作兩個應用之間通信隊列,那么這個隊列則需要在兩個應用之間必須共享。怎么做到的呢?
ring隊列都是處於物理上連續的內存區域,所以不同應用如果能夠看到這塊內存,自然就可以實現共享。
-
DPDK with vhost-user
DPDK中rte_vhost庫是vhost-user機制的一個具體實現。使得guest發送報文不必陷入主機內核,以共享內存的方式將數據傳送給qemu,然后由qemu與主機內核通信,最終將報文發送出去。
DPDK通過vhost-net-user來與qemu中的vhost模塊交互以獲得virtio vring的相關信息,具體對virtio網卡的操作通過virtio-net-user來實現。
另外DPDK vhost還實現了另外一種機制叫vhost-cuse。不過intel宣稱后續重點支持vhost-user機制。
【原理與接口】
要實現guest發送報文不必陷入主機內核,以共享內存的方式將數據傳送給qemu,則需要:
1)qemu與guest應用進程(虛機)之間可以共享內存
For QEMU, this is done by using the -object memory-backend-file,share=on,... option. Which means QEMU will create a file to serve as the guest RAM. The share=on option allows another process to map that file, which means it can access the guest RAM.
2)qemu與guest都需要能夠獲得vring信息
所以qemu和guest都需要實現vhost的模塊,通常是C/S模式通信機制。
Qemu支持vhost已經在1.7版本以上支持,而DPDK作為guest,為支持vhost實現了兩個模塊,vhost-net-user與qemu對應的vhost模塊通信,virtio-net-user則根據vhost-net-user的指令具體操作virtio設備。
Vhost-net-user與qemu通信支持兩種模式,一種是vhost-net-user充當server,一種是充當client。
【vhost-net-user源代碼分析】
1.vhost-net-user驅動注冊
vhost-net-user相對於虛機系統來說,當做一個驅動來看待。
rte_vhost_driver_register(path, flags)
This function registers a vhost driver into the system. For vhost-user server mode, a Unix domain socket file path will be created.
Currently two flags are supported (these are valid for vhost-user only):
-
RTE_VHOST_USER_CLIENT
DPDK vhost-user will act as the client when this flag is given.
-
RTE_VHOST_USER_NO_RECONNECT
When DPDK vhost-user acts as the client it will keep trying to reconnect to the server (QEMU) until it succeeds. This is useful in two cases:
-
When QEMU is not started yet.
-
When QEMU restarts (for example due to a guest OS reboot).
This reconnect option is enabled by default. However, it can be turned off by setting this flag.
2.消息分發
提供rte_vhost_driver_session_start()用於vhost message的分發。
3.virtio設備狀態管理
通過rte_vhost_driver_callback_register()注冊virtio設備狀態變更處理函數。
-
new_device(int vid)
This callback is invoked when a virtio net device becomes ready. vid is the virtio net device ID.
-
destroy_device(int vid)
This callback is invoked when a virtio net device shuts down (or when the vhost connection is broken).
-
vring_state_changed(int vid, uint16_t queue_id, int enable)
This callback is invoked when a specific queue's state is changed, for example to enabled or disabled.
-
DPDK with PMD
本節討論的是DPDK的網絡收發的機制PMD(Poll-Mode Driver)輪詢模式。
DPDK NFV的網卡有virtio、IVSHMEM(基於內存文件模擬的PCI設備)、VF(SR-IOV)、ring(基於內存隊列模擬的以太網網卡)、pcap(基於磁盤文件模擬的以太網網卡)5類;DPDK vSwitch的網卡有物理千兆、萬兆、40G網卡外,還有tap、ivshmem、VF、ring、pcap等虛擬網卡設備。
DPDK包括千兆,萬兆,40G網卡、虛擬IO網卡(比如virtio-net,對應的庫為librte_pmd_virtio.so)、軟件模擬IO網卡(比如ring、pcap等)設備的輪詢驅動。
DPDK訪問網卡的驅動有兩類:uio和FIFO。前者是DPDK在用戶態接管物理IO,后者是通過共享內存方式使得在用戶態直接訪問內核態的物理IO。詳細請參見我寫的《》。
DPDK的uio驅動在報文收發方面采用所謂的PMD(Poll-Mode Driver)輪詢模式。
系統在進行報文處理的方法一般有四種:中斷、輪詢、NAPI(中斷與輪詢結合)、PMD(不采用中斷)
PMD由運行在用戶態驅動提供的API組成。
PMD不通過任何中斷來直接訪問RX和TX描述符(除了網卡連接狀態改變的中斷之外)實現在應用程序中快速收包,處理和轉發。
對於包處理程序DPDK允許兩種模式,run-to-completion和pipe-line:
-
run-to-completion模式下,使用API來輪詢指定的網卡收包RX描述符隊列。然后報文也就在這個核上處理,然后通過發送API將報文放入網卡TX描述符隊列中。
-
pipe-line模式下,一個core通過API輪詢一個或者是多個端口的RX描述符隊列。報文收下來之后通過ring傳遞給其它core。其它的core處理報文,處理完后通過發送API將報文放到端口的TX描述符ring中。
DPDK PMD支持硬件負載卸載Hardware Offload。
依賴於驅動的內力,可通過rte_eth_dev_info_get()獲取,PMD可能支持硬件負載均衡,像校驗和,TCP分片,或者是VLAN插入。
對上述負載特性的支持意味着在rte_mbuf結構體中添加額外的狀態位。
輪詢驅動API,默認所有PMD接口都是無鎖的,我們假定它不會在不同的邏輯核之間並行訪問同樣的對象。例如,PMD收包函數不會被不同的邏輯核調用輪詢同樣的網卡的相同隊列。當然,這個函數可以被不同的邏輯核並行調用從不同的rx收包隊列中取包。上層應用比如遵守這個規則。
除了物理和虛擬硬件的輪詢驅動,DPDK也包括一套軟件庫允許物理硬件聚合到邏輯接口的PMD。
鏈路聚合PMD庫(librte_pmd_bond)支持聚合一組相同速度和雙工的rte_eth_dev端口,提供和linux上bond驅動的相同功能,允許主機和交換機之間多個(slave)網卡聚合成一個邏輯接口。新的聚合PMD將按照指定的操作模式處理底層網卡,例如鏈路主備,故障容錯,負載均衡。
Librte_pmd_bond庫提供API來創建bond設備,包括配置和管理bond設備和它的slave設備。注意:librte_pmd_bond庫默認是開啟的,關閉可以將CONFIG_RTE_LIBRTE_PMD_BOND設置為n再重新編譯。
聚合PMD支持6種負載均衡模式:
-
Round-Robin(Mode 0)
-
Active Backup(Mode 1)
-
Balance XOR(Mode 2)
默認策略(2層)使用一個簡單的計算,基於報文流源和目的mac地址計以及Bond設備中可用的活動slave設備數目的結果標記報文使用指定的slave傳輸。2層和3層支持輪詢傳輸策略,使用ip源地址和目的地址來計算出使用的傳輸slave端口,最后支持的3+4層使用源端口和目的端口,就和源ip、目的ip方式一樣。
-
Broadcast(Mode 3)
-
Link Aggregation 802.3AD(Mode 4)
-
Transmit Load Balancing(Mode 5)
這種模式提供了自適應的傳輸負載均衡。它動態的改變傳輸slave,依據據算的負載。每100ms收集一次統計值並每10ms計算調度一次。
源代碼:
librte_pmd_af_packet:從packet socket接收發送報文
af_packet與AF_INET同屬於協議族,前者表示raw_socket,即可以訪問二層以上的報文信息,而后者只能獲得傳輸層及之上的信息。
Packet sockets are used to receive or send raw packets at the device driver (OSI Layer 2) level. They allow the user to implement protocol modules in user space on top of the physical layer.
【PMD on virtio】
對應Guest OS的驅動librte_pmd_virtio.so
源代碼:
virtio_ethdev.c:主要是virtio網卡設備相關的操作函數定義
virtio_pci.c:virtio作為pci設備的相關操作函數定義
virtqueue.c:virtio實現的基礎
vritio_rxtx.c:這是PMD on virtio的關鍵實現
相對於virtio虛擬網卡的PMD驅動體現在哪些方面呢?
1.首先定義virtio的PMD驅動結構
2.virtio PMD驅動注冊
3.virtio 作為pci設備的初始化
rte_eal_pci_probe()-àpci_probe_all_drivers()->rte_eal_pci_probe_one_driver()
哪些驅動需要從內核態映射到用戶態呢?全局搜索一下即可知曉:
可見,virtio沒有設置這個標志,說明virtio無需做映射,是因為virtio設備是一個虛擬設備,本身就是qemu模擬出來的一個用戶態的虛擬設備。
另外,RTE_PCI_DRV_INTR_LSC 這個標志的意思是設備的link狀態通過中斷來通知。從下面可知virtio設備的link狀態是通過中斷來通知的。
為了了解pci設備如何從內核態映射到用戶態,下面看一下這個映射函數:
4.virtio 中斷處理
5.virtio報文收發
下面是virtio驅動收報文的函數(需要理解virtio收發報文的機制,參考前面的multiqueue virtio-net章節的介紹)
PMD實質體現在網絡設備的報文收發方面,物理網卡和虛擬網卡在報文收發方面有些差異,但是總體的思路是報文收發不通過中斷通知機制,而是收發線程去輪詢網絡設備的緩沖區。
【PMD on VF】
在SR-IOV模式下,VF驅動可能是igb,也可能是ixgbe,也就是說與物理網卡的驅動是一樣的。
【PMD on ring】
1.ring設備驅動注冊
2.ring設備的初始化
3.ring設備的報文收發
從代碼上可以看出,ring設備的報文收發其實就是對ring隊列的出隊入隊的操作。
【PMD on pNIC】
-
OVS with DPDK
-
DPDK對OVS的改進點
-
DPDK Switch則是利用vhost-user技術,改造OVS實現類似於snabb switch一樣性能強勁的用戶態交換機。
下圖表達了DPDK針對OVS的改造點:
-
將OVS改造成用戶態的vSwitch
-
支持vhost-user通信機制,使得VM與OVS之間、VM與VM之間支持vhost-user通信機制,只要VM的guest os支持virtio即可
-
支持ISVSHMEM通信機制,使得VM與OVS之間、VM與VM之間支持shmem通信機制,但是VM的guest os必須是基於DPDK改造過以支持shmem
-
支持直接操作物理網卡,提供uio和vfio兩種驅動模式
UIO把硬件設備映射給用戶態后,內核不會在參與其調度,完全交給用戶自定義使用。VFIO則在映射后,內核仍舊會為其提供部分規范方法的支持(如中斷,iommu等)。
5)overlay 隧道的優化
下面是DPDK針對OVS為了性能所做的整改:
-
性能測試
那么DPDK對OVS的改造效果如何呢?
【IBM測試實驗】
據IBM實驗(來源於SDNLAB 唐剛)
a) CPU:2 sockets Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 10 cores per scoket
b) RAM:64 Gbytes
• NIC cards are Intel10-Gigabit X540-AT2
• Kernel used is Linux3.13.0-53-generic
網絡拓撲1:測試IVSHMEM通道的性能
物理網卡----OVS with DPDK----VM---OVS with DPDK-----物理網卡。
OVS with DPDK與虛機VM之間的通道采用IVSHMEM(其實嚴格意義上應該是采用ring機制,只不過ring機制與ivshmem機制有相似,都是共享內存的零拷貝機制。在DPDK 2.0軟件包中找不到基於ivshmem的樣例應用程序),虛機VM里面跑的轉發程序使用ring client,在ovs的安裝目錄下有。
在這種模式下,可以達到10Mpps(64字節 10Gbps轉發速率對應14.88Mpps)
網絡拓撲2:測試vhost-user通道的性能
注意,當OVS使用vhost-user機制時,虛機的虛擬網卡需要使用virtio。
網絡拓撲3:測試vhost(vhost-net)通道的性能
注意,這時候使用的OVS是原生態的,即未支持DPDK,所以運行在內核態中,OVS與虛機的通道接口仍然是TAP,虛機的虛擬網卡是virtio。不過,虛機里面跑的報文轉發VNF是經過DPDK優化過的程序。
這三種情況下得到的性能數據如下:
如果將測試工具由Dpdk-Pktgenfor(用於測試pps性能)換成netperf(用於測試bps),對應的數據如下:
可見,
-
內核態的OVS+vhost優化的性能要遠遠低於OVS with DPDK。64字節的吞吐前者不到2G,后者9G左右
-
OVS with DPDK with IVSHMEM通道在應用層新建性能方面又優於OVS with DPDK with vhost-user。新建速率前者在1.3萬,后者在5000左右
-
OVS with DPDK with IVSHMEM通道在3層吞吐方面與OVS with DPDK with vhost-user相當,在9G左右
【Intel測試實驗】
詳細請參考《Intel_ONP_Release_2.1_Performance_Test_Report_Rev1.0.pdf》。
網絡拓撲1:測試Host性能
可以,主機host的轉發性能,64字節可以達到28Gbps左右。
網絡拓撲2:測試虛擬交換機OVS native與OVS with DPDK性能
可見,64字節吞吐,OVS with DPDK 是OVS Natvie的10倍,達到7Gbps左右。
網絡拓撲3:測試一個虛機經過OVS Native和OVS with DPDK下的性能
可見,虛機交換機上掛一個虛機,64字節吞吐,OVS with DPDK是OVS Native的6~7倍,達到2Gbps左右。另外經過虛機后,性能要比單純經過虛機交換機的性能(7G)下降很多。
網絡拓撲4:測試虛擬交換機下掛1個和2個虛機下的性能
可見,64字節吞吐,經過一個VM是6Gbps,經過2個VM是2Gbps,下降大約1/3。
網絡拓撲5:經過虛機的虛擬交換機交換性能
可見,虛擬交換機的轉發性能與服務器的CPU個數基本成直線關系,在8 cores的情況下256自己吞吐可以達到40Gbps。同時轉發流程的不同會有稍微差異,但是不大。
網絡拓撲6:測試XLAN性能
可見,XLAN加解封裝,114字節吞吐(64+50),OVS with DPDK能夠接近4Gbps。同時與CPU個數基本成線性關系,但是OVS Native則基本不隨CPU個數的增加而增加。
-
Run OVS with DPDK
詳細請參考:http://openvswitch.org/support/dist-docs-2.5/INSTALL.DPDK.md.html
【當前狀態】
Open vSwitch can use Intel(R) DPDK lib to operate entirely in userspace.
The DPDK support of Open vSwitch is considered experimental. It has not been thoroughly tested.
【前提條件】
DPDK:>= 2.2
OVS: >= 2.5
Linux kernel:>= 2.6.34
【編譯步驟】
第一步:Configure build & install DPDK
1.Set $DPDK_DIR
export DPDK_DIR=/usr/src/dpdk-2.2 cd $DPDK_DIR
2.Update config/common_linuxapp so that DPDK generate single lib file. (modification also required for IVSHMEM build)
CONFIG_RTE_BUILD_COMBINE_LIBS=y
Then run make install to build and install the library.
For default install without IVSHMEM:
make install T=x86_64-native-linuxapp-gcc
To include IVSHMEM (shared memory):
make install T=x86_64-ivshmem-linuxapp-gcc
第二步:Configure & build the Linux kernel
DPDK對Linux kernel的要求如下:
-
Kernel version >= 2.6.34
-
glibc >= 2.7 (for features related to cpuset)
-
UIO support
-
HUGETLBFS
-
PROC_PAGE_MONITOR support
-
HPET and HPET_MMAP configuration options should also be enabled if HPET(High Precision Event Timer) support is required.
第三步:Configure & build OVS
-
Non IVSHMEM:
export DPDK_BUILD=$DPDK_DIR/x86_64-native-linuxapp-gcc/
-
IVSHMEM:
export DPDK_BUILD=$DPDK_DIR/x86_64-ivshmem-linuxapp-gcc/
cd $(OVS_DIR)/ ./boot.sh ./configure --with-dpdk=$DPDK_BUILD [CFLAGS="-g -O2 -Wno-cast-align"] make
【運行步驟】
第一步:設置Linux kernel
1.Setup system boot Add the following options to the kernel bootline:
default_hugepagesz=1GB hugepagesz=1G hugepages=1
2.Mount the hugetable filesystem
mount -t hugetlbfs -o pagesize=1G none /dev/hugepages
第二步:設置DPDK
DPDK devices can be setup using either the VFIO (for DPDK 1.7+) or UIO modules.
UIO:
UIO requires inserting an out of tree driver igb_uio.ko that is available in DPDK:
-
insert uio.ko: modprobe uio
-
insert igb_uio.ko: insmod $DPDK_BUILD/kmod/igb_uio.ko
-
Bind network device to igb_uio: $DPDK_DIR/tools/dpdk_nic_bind.py --bind=igb_uio eth1
VFIO:
VFIO needs to be supported in the kernel and the BIOS.
-
Insert vfio-pci.ko: modprobe vfio-pci
-
Set correct permissions on vfio device:
sudo /usr/bin/chmod a+x /dev/vfio
sudo /usr/bin/chmod 0666 /dev/vfio/*
-
Bind network device to vfio-pci:
$DPDK_DIR/tools/dpdk_nic_bind.py --bind=vfio-pci eth1
第三步:安裝用戶態進程和工具
-
First time only db creation (or clearing):
mkdir -p /usr/local/etc/openvswitch mkdir -p /usr/local/var/run/openvswitch rm /usr/local/etc/openvswitch/conf.db ovsdb-tool create /usr/local/etc/openvswitch/conf.db \ /usr/local/share/openvswitch/vswitch.ovsschema
-
Start ovsdb-server
ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock \ --remote=db:Open_vSwitch,Open_vSwitch,manager_options \ --private-key=db:Open_vSwitch,SSL,private_key \ --certificate=Open_vSwitch,SSL,certificate \ --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --pidfile --detach
-
First time after db creation, initialize:
ovs-vsctl --no-wait init
第四步:Start vswitchd
export DB_SOCK=/usr/local/var/run/openvswitch/db.sock ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach
If allocated more than one GB hugepage (as for IVSHMEM), set amount and use NUMA node 0 memory:
ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 -- unix:$DB_SOCK --pidfile –detach
第五步:給vswitch添加bridge,向bridge添加dpdk的端口
To use ovs-vswitchd with DPDK, create a bridge with datapath_type "netdev" in the configuration database. For example:
ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
Now you can add dpdk devices. OVS expects DPDK device names to start with "dpdk" and end with a portid. vswitchd should print (in the log file) the number of dpdk devices found.
ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk
Once first DPDK port is added to vswitchd, it creates a Polling thread and polls dpdk device in continuous loop. Therefore CPU utilization for that thread is always 100%.
Note: creating bonds of DPDK interfaces is slightly different to creating bonds of system interfaces. For DPDK, the interface type must be explicitly set, for example:
ovs-vsctl add-bond br0 dpdkbond dpdk0 dpdk1 -- set Interface dpdk0 type=dpdk -- set Interface dpdk1 type=dpdk
第六步:測試
cd /usr/src/ovs/utilities/
./ovs-ofctl del-flows br0
Add flows between port 1 (dpdk0) to port 2 (dpdk1)
./ovs-ofctl add-flow br0 inport=1,action=output:2 ./ovs-ofctl add-flow br0 inport=2,action=output:1 ``
【使用DPDK rings】
OVS運行在Host OS上,OVS with DPDK則可以使用DPDK ring機制。
OVS with DPDK套件提供了ring client,其運行分為兩種情形:ring client運行在host OS和運行在虛機VM里面。
運行在Host OS:
Following the steps above to create a bridge, you can now add dpdk rings as a port to the vswitch. OVS will expect the DPDK ring device name to start with dpdkr and end with a portid.
ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr
DPDK rings client test application
Included in the test directory is a sample DPDK application for testing the rings. This is from the base dpdk directory and modified to work with the ring naming used within ovs.
location tests/ovs_client
To run the client :
cd /usr/src/ovs/tests/ ovsclient -c 1 -n 4 --proc-type=secondary -- -n "port id you gave dpdkr"
In the case of the dpdkr example above the "port id you gave dpdkr" is 0.
The application simply receives an mbuf on the receive queue of the ethernet ring and then places that same mbuf on the transmit ring of the ethernet ring.
運行在VM:
In addition to executing the client in the host, you can execute it within a guest VM. To do so you will need a patched qemu(支持IVSHMEM).
類似下面的場景:
【使用DPDK vhost-user】
第一步:設置OVS with DPDK
Following the steps above to create a bridge, you can now add DPDK vhost-user as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-user ports can have arbitrary names, except that forward and backward slashes are prohibited in the names.
For vhost-user, the name of the port type is dpdkvhostuser
ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 type=dpdkvhostuser
This action creates a socket located at /usr/local/var/run/openvswitch/vhost-user-1, which you must provide to your VM on the QEMU command line.
第二步:設置Qemu
-
Configure sockets.
Pass the following parameters to QEMU to attach a vhost-user device:
-chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 –netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce –device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1
where vhost-user-1 is the name of the vhost-user port added to the switch. Repeat the above parameters for multiple devices, changing the chardev path and id as necessary. Note that a separate and different chardev path needs to be specified for each vhost-user device. For example you have a second vhost-user port named 'vhost-user-2', you append your QEMU command line with an additional set of parameters:
-chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 –netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce –device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2
2.Configure huge pages.
QEMU must allocate the VM's memory on hugetlbfs. Vhost-user ports access a virtio-net device's virtual rings and packet buffers mapping the VM's physical memory on hugetlbfs. To enable vhost-user ports to map the VM's memory into their process address space, pass the following paramters to QEMU:
-object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages, share=on –numa node,memdev=mem –mem-prealloc
3.Optional: Enable multiqueue support QEMU needs to be configured with multiple queues and the number queues must be less or equal to Open vSwitch other_config:n-dpdk-rxqs. The $q below is the number of queues. The $v is the number of vectors, which is '$q x 2 + 2'.
-chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v
第三步:設置guest os的網卡驅動
If one wishes to use multiple queues for an interface in the guest, the driver in the guest operating system must be configured to do so. It is recommended that the number of queues configured be equal to '$q'.
For example, this can be done for the Linux kernel virtio-net driver with:
ethtool -L combined <$q>
-L: Changes the numbers of channels of the specified network device
combined: Changes the number of multi-purpose channels.
-
改進點源碼分析
【報文收發】
從上圖可以看出,DPDK對接管了OVS從底層收發報文的部分。
原生OVS是內核態的,與虛機連接的接口是橋下綁定多個TAP設備;與物理網卡連接是通過vlan子接口。
OVS with DPDK搬到用戶態了,此時與虛機連接的接口和與物理網卡連接的接口都被DPDK所接管,與內核接口則有TAP和Socket了。
OVS with DPDK與虛機連接的接口和與物理網卡連接的接口都被DPDK所接管,其核心實現代碼:openvswitch-2.5.0/lib/netdev-dpdk.c
OVS with DPDK 支持的3類網絡接口:
啟動ovswitched服務的時候添加了-dpdk選項:
openvswitch-2.5.0/vswitchd/ovs-vswitchd.c
至於Intel提到對數據平面轉發、Tunnel等部分的優化,實際在ovs 2.5版本中未找到具體實現。
-
FD.io
-
FD.io
-
FD.io (Fast data - Input/Output) is a collection of several projects and libraries to amplify the transformation to support flexible, programmable and composable services on a generic hardware platform. FD.io offers the Software Defined Infrastructure developer community a landing site with multiple projects fostering innovations in software-based packet processing towards the creation of high-throughput, low-latency and resource-efficient IO services suitable to many architectures (x86, ARM, and PowerPC) and deployment environments (bare metal, VM, container).
At the heart of fd.io is Vector Packet Processing (VPP) technology.
-
VPP
-
VPP概述
-
In development since 2002, VPP is production code currently running in shipping products.
It runs in user space on multiple architectures including x86, ARM, and Power architectures on both x86 servers and embedded devices. The design of VPP is hardware, kernel, and deployment (bare metal, VM, container) agnostic.
It runs completely in userspace.
VPP-powered FD.io is two orders of magnitude faster than currently available technologies.
The primary problem Cisco set out to solve with the development of vector packet processing (VPP) in 2002 was "Accelerating the NFV Data Plane".
The VPP Technology also provides a very high performance low level API. The API works via a shared memory message bus. The messages passed along the bus are specified in a simple IDL (Interface Definition Language) which is used to create C client libraries and Java client libraries. Support for generating bindings in additional languages could be added. These client libraries make it very easy to write external applications that programmatically control VPP. The shared memory message bus approach is very high performance.
Remote programmability can be achieved by having as your external app a Management Agent that exposes your SDN protocol of choice.
There is a 'Honeycomb' agent available at launch which exposes yang models for VPP functionality via netconf and restconf available currently. A controller which supports netconf/yang, such as OpenDaylight can 'mount' the Honeycomb Management Agent.
Primary Characteristics Of VPP
-Improved fault-tolerance and ISSU when compared to running similar packet processing in the kernel:
-
crashes seldom require more than a process restart
-
software updates do not require system reboots
-
development environment is easier to use and perform debug than similar kernel code
-
user-space debug tools (gdb, valgrind, wireshark)
-
leverages widely-available kernel modules (uio, igb_uio): DMA-safe memory
-Runs as a Linux user-space process:
-
same image works in a VM, in a Linux container, or over a host kernel
-
KVM and ESXi: NICs via PCI direct-map
-
Vhost-user, netmap, virtio paravirtualized NICs
-
Tun/tap drivers
-
DPDK poll-mode device drivers
-Integrated with the Intel DPDK, VPP supports existing NIC devices including:
-
Intel i40e, Intel ixgbe physical and virtual functions, Intel e1000, virtio, vhost-user, Linux TAP
-
HP rebranded Intel Niantic MAC/PHY
-
Cisco VIC
-Security issues considered:
-
Extensive white-box testing by Cisco's security team
-
Image segment base address randomization
-
Shared-memory segment base address randomization
-
Stack bounds checking
-
Debug CLI "chroot"
-
The vector method of packet processing has been proven as the primary punt/inject path on major architectures.
-Supported Architectures
-
x86/64
-
Supported Packaging Models
-The VPP platform supports package installation on the following operating systems:
-
Debian
-
Ubuntu 14.04
-
VPP工作原理
【cache thrashing】
Supported by the data plane development kit's poll mode drivers (PMD) and ring buffer libraries, VPP aims to increase forwarding plane throughput by reducing the number of misses in flow/forwarding table caches while replacing standard serial packet lookups with a parallel approach.
Short-lived flows or high-entropy packet fields -- those likely to have differing values from packet to packet -- kill caches, hence the introduction of the megaflow (aggregate) cache into OVS in an attempt to turn mice (flows) into elephants.
Let's start from scratch: An empty cache is "cold," resulting in misses on all queries, of course. The cache "warms up" as those misses are subsequently used to populate it, at which point the cache is said to be "warm."
A warm cache should be resulting in an appropriate number of query "hits" with either simple first-in, first-out (FIFO) methodologies or, more appropriately, least recently used (LRU) or least frequently used (LFU) algorithms deciding how old entries should be replace by new one. This replacement policy is actually more critical than I'm giving it credit for, here, as a high cache churn rate could be due to poor replacement algorithms. A more likely reason, though, are those pesky short-lived flows, with each new addition resulting in a miss and the possible replacement of a long-term flow entry in the cache with one that is never seen again. An incessant eviction of useful data is (superbly) referred to as "cache thrashing."
In CPU parlance, this functionality would leverage the information cache (i-cache), It also mentions a supporting data cache (d-cache), used to store pre-fetched data needed to support the i-cache. VPP primarily favors the i-cache, although there are some advantages obtained with increased d-cache efficiencies as well.
【forwording graph and graph node】
The forwarding graph, which essentially defines the forwarding operations for each given packet, comprises a number of "graph nodes," each with a different role to play in processing or directing the packet.
VPP technology, is highly modular, allowing for new graph nodes to be easily "plugged in" without changes to the underlying code base or kernel. This gives developers the potential to easily build any number of packet processing devices with varying forwarding graphs, including not just those supporting switches and routers but intrusion detection and prevention, firewalls or load balancers.
【scalar processing and vector processing】
With the bottleneck being the cache, even in the most highly tuned, all-user-space, DPDK accelerated environments, the switch pipeline operates in a serial mode, handing one packet at a time. Even if there is a nice big DPDK FIFO chock full o' packets, they are sent through the "forwarding graph" individually. In computing parlance, this is called scalar processing.
Rather than working on single packets in a serial manner, a VPP operates simultaneously on an array, or collection, of packets.this is called vector processing.
Rather than just grabbing the packet at the front of the line, however, the VPP engine takes a chunk of packets up to a predetermined maximum of, let's say, 256. Naturally, the vector itself doesn't contain the actual packets but pointers to their locations in a buffer.
The "superframe" of N x packets, as it has been referred to, proceeds to the first graph node, where the Ethernet header is decoded and the EtherType is identified, as previously. While our theory of temporal locality suggests that the EtherType will be identical across the vector (i.e. IPv4), naturally there is a chance a group of diverse packets (i.e. IPv6) made it into the superframe. If this is the case, the forwarding graph forks and the superframe is partitioned into two "subnets," each with a distinct next-hop graph node.
The problem with that traditional scalar packet processing is:
-
thrashing occurs in the I-cache
-
each packet incurs an identical set of I-cache misses
-
no workaround to the above except to provide larger caches
By contrast, vector processing processes more than one packet at a time.
One of the benefits of the vector approach is that it fixes the I-cache thrashing problem. It also mitigates the dependent read latency problem (pre-fetching eliminates the latency).
This approach fixes the issues related to stack depth / D-cache misses on stack addresses. It improves "circuit time". The "circuit" is the cycle of grabbing all available packets from the device RX ring, forming a "frame" (vector) that consists of packet indices in RX order, running the packets through a directed graph of nodes, and returning to the RX ring. As processing of packets continues, the circuit time reaches a stable equilibrium based on the offered load.
As the vector size increases, processing cost per packet decreases because you are amortizing the I-cache misses over a larger N.
【temporal locality】
VPP operates on a simple principle with a (typically) scientific name: temporal locality -- or locality in time. In terms of application flows, this phenomenon notes the relationship between packets sampled within a short period of time and the strong likelihood that they are similar, if not identical, in nature. Packets with such attributes would reuse the same resources and will be accessing the same (cache) memory locations.
-
VPP性能
本數據來自官方wiki:https://wiki.fd.io/view/VPP/What_is_VPP?
One of the benefits of this implementation of VPP is its high performance on relatively low-power computing. This high level of performance is based on the following highlights:
-
High-performance user-space network stack for commodity hardware
-
The same code for host, VMs, Linux containers
-
Integrated vhost-user virtio backend for high speed VM-to-VM connectivity
-
L2 and L3 functionality, multiple encapsulations
-
Leverages best-of-breed open source driver technology: Intel DPDK
-
Extensible by use of plugins
-
Control-plane / orchestration-plane via standards-based APIs
The rates reflect VPP and OVSDPDK performance tested on Haswell x86 platform with E5-2698v3 2x16C 2.3GHz. The graphs shows NDR rates for 12 port 10GE, 16 core, IPv4.
-
VPP源代碼分析
VPP實現了完整的IPV4和IPv6協議棧。
VPP的特點主要有兩個:用矢量來描述多個報文和一次處理多個報文。
所以首先需要了解VPP是如何用矢量描述多個報文的。矢量在計算機里面又叫bitmap位圖。
三個關鍵數據結構:
vlib_main_t:VPP的控制數據結構,其中包括graph node數據結構;
vlib_node_runtime_t:用於描述graph node運行時的信息;
vlib_frame_t:數據幀,即包括多個報文的矢量的數據結構;
一個報文處理的入口:
static uword ip4_input (vlib_main_t * vm,
vlib_node_runtime_t * node,
vlib_frame_t * frame)
【VPP入口】
vpp-16.06\vlib\vlib\main.c
【注冊一個node】
以arp處理節點為例:
注冊一個node:
【將報文重定向到自己定義的node】
int vnet_hw_interface_rx_redirect_to_node (vnet_main_t * vnm, u32 hw_if_index,
u32 node_index)
-
SDN、NFV、OpenStack
這三者之間既有關聯,也有區別,各自的角色承擔各自的職責。
我們可以通過OPNFV的框架圖可以看出分兩個層面:
一個層面是網絡資源:
OpenStack的角色是虛擬化網絡資源的編排系統,OpenDayLight等是虛擬化網絡資源的調度控制,KVM+OVS是實現網絡資源的虛擬化(切片)。
另一個層面是網絡功能:
當前還沒有成熟的開源項目來進行虛擬化網絡功能VNF的編排,OpenContrail具備虛擬化網絡功能的調度和控制能力,而實現網絡功能虛擬化的核心方法是虛擬化網絡資源經過重新組合得到虛擬化的網絡功能VNF:
-
OVS
-
OpenContrail的vRouter
-
OpenStack的DVR
當前在網絡資源虛擬化和網絡功能虛擬化方面的實踐如下:
OpenDayLight、OpenContrail如何融合到OpenStack這個資源編排系統中呢?
基本上都是利用了OpenStack Neutron的Plugin機制:
第三方的網絡項目都會向Neturon注冊一個ML2(The Modular Layer2) Plugin作為與Neturon API service交互的一個代理。
在網絡流量轉發層面,OpenDaylight是利用Neturon與OVS交互,而OpenContrail則是自己實現vRouter,同時自己的Controller與vRouter進行交互。
-
VNF性能提升實踐
vFW作為一種對網絡IO性能要求非常高的VNF,我們在三個點上面都進行了一些嘗試,取得了較好的成果。
下圖是我們通過優化虛機與host os的通道,采用SR-IOV技術,相比采用virtio,vFW的IO吞吐得到幾十倍的提升:
在vFW的DataPath方面,由於我們的DataPath實現機制上與DPDK類似,所以僅僅參考了DPDK的大內存管理,通過改造vFW的內存管理機制,用大內存管理報文數據結構,64字節的性能得到了大約1~3%的提升,效果不是太明顯。
-
總結
-
OpenStack是虛擬網絡資源的編排系統,OpenDaylight、OpenContrail等是虛擬網絡資源的調度與控制系統,DVR或OVS是虛擬網絡資源的載體和虛擬網絡功能的執行者
-
NFV是從網絡功能虛擬化的實現層面來說的,VNF是NFV的重要輸出。OpenContrail可以看做集VNF的編排、控制於一體的系統
-
DPDK並非Intel首創,實際上在前期NP、Multicore的實踐經驗的提出了框架,而且從前面分析看出DPDK的框架沒有超出當年RMI、Cavium提出的解決框架,RMI提出的數據平面是在內核空間實現,Cavium的框架是在用戶空間實現,由於Intel受到一些廠商用Cavium芯片較多的影響,采用了用戶空間的解決方案。目前看來用戶空間的解決方案在可調試性、可移植性方面要比內核空間的解決方案好得多
-
VNF是通過NFV實現的,而NFV的實現方式當前主要有兩種:虛機和容器。
-
由於VNF畢竟是通過軟件虛擬而得到的,所以對於網絡IO吞吐的影響比較大,如何提升VNF的網絡IO性能呢?從虛機實現的角度來看,主要優化的點有三個:VNF的DataPath、虛機與host os的虛擬通道、host os與物理網卡的通道。
-
VNF的DataPath優化方案主要有:OpenDP、Snabb、libuinet、mTCP、NUSE、OpenOnLoad、VPP等;虛機與Host OS的虛擬通道優化方案主要有:virtio-net、vhost-net、vhost-user、ivshmem、macvtap、mutiqueue virtio-net、SR-IOV、VMDq等;host os與物理網卡的通道優化方案主要有:snabb、lagopus、OVS、Rump Kernel、KVM4NFV
-
Intel的DPDK在這三個方面都有自己的優化方案:DataPath的優化主要體現在PMD驅動,虛機與Host OS通道優化主要體現在ivshmem、vhost-use、SR-IOV等,Host OS與網絡網卡的通道優化主要體現在對OVS的改造
-
-
附錄
-
PCIe規范
-
在計算機系統中,PCI設備具有"總線號、設備號、功能號、VID、DID、RID、SID/SVID、中斷號和設備名稱"等屬性。
●VID:全稱Vendor Identification,又稱Vendor ID,是代表發明設備的專利所有者(技術廠商)的識別碼,即常說的廠商ID,這個ID是PCI-SGI組織統一編制命名的,是唯一的廠商標識,不允許重名。例如:ATI的VID是0x1002,而NVIDIA的VID則是0x10DE。
●DID:全稱Device Identification,又稱Device ID,是針對設備本身標識的代碼,即常說的設備ID。這個ID標識主要區別同類設備的不同型號,一般由技術發明廠商按PCI規范命名,不同廠商的設備可以有重名(由於不同廠商都有唯一的VID,因此並不會混淆身份)。例如:同樣研發代號為R350的ATI 9800和ATI 9800XT,設備ID卻不同,分別為:0x4E48和0x4E4A。
●SID:全稱 Subsystem-Identification,又稱Sub-ID子(次要)系統(設備)識別碼,是授權被制造的設備的二次編碼。和DID的區別是這個代碼不是原技術廠商設備的代碼,而是二級制造廠商代工設備的代碼。有時候,如果是由技術廠商自己組織制造設備,這個編碼也可以和DID同名。
●SVID:全稱 Subsystem-Vendor Identification,又稱Sub-Vendor ID子(次要)系統廠商識別碼,是由PCI-SGI組織認證的二級制造廠商的識別碼,同樣是唯一的廠商標識,不允許重名,但同一家技術廠商的VID和SVID可以重名。SID和SVID代碼通常放在一起,作為設備二次認證的IDs。例如:8139網卡的Subsystem IDs組合是0x813910EC,前面的8139是SID,后面的10EC是SVID。
●RID:全稱Revision ID,又稱Rev ID。即通常所說的版本號:REV.X.
●CC:全稱Class-Code,類型代碼,是區別不同類或者同類不同規格設備類型的編碼。對於每一種設備如顯示、聲卡、SCSI、USB設備等等都有各自的標准分類。例如:Class-Code:000c0300,代表UHCI類型的USB設備;而Class-Code: 000c0310,則代表OHCI類型的USB設備。
通常情況下,將VID和DID分別稱為技術廠商ID和技術設備ID;將SVID和SID分別稱為制造廠商ID和制造設備ID。
-
pf_ring與vring
pf_ring是linux為了實現用戶態和內核態零拷貝數據處理而提出的新的協議族。是一種新的socket機制。
vring是virtio的virtqueue的一種實現方式。Virtqueue就是guest與host之間的一個簡單的隊列,不同的IO設備,其virtqueue的個數不一樣,比如網絡IO,則最少會有1對rx和tx隊列。
兩種在實現思路上有相似之處:用戶態和內核態都可以讀取和寫入一個共同的數據緩沖區,pf_ring是一個環形緩沖區,vring是一個FIFO的隊列。
下圖是pf_ring的原理圖:
-
提出一種基於環形緩沖區的新的套接字pf_ring socket
-
每創建一個pf_ring套接字就分配一個環形緩沖區
-
當將這個套接字與網卡綁定時,網卡接收到報文后,DMA直接將數據寫入此套接字的環形緩沖區中
-
應用層可以直接讀取環形緩沖區的數據
-
當新的報文到來時可以覆蓋已經被應用層讀取的緩沖區
注:Unix有一種報文收發機制,叫Device Polling設備輪詢,所提供的API通常叫NAPI,其工作原理:
-
當網卡接收到一個報文后,產生一個中斷信號給系統
-
系統接收到這個中斷信號后,會做:關閉網卡中斷;激活輪詢進程,對網卡進行輪詢收包;打開網卡中斷
vring的工作原理如下圖:
vring由三部分組成:
-
報文描述符
-
used vring和available vring,vring的每個元素就是報文描述符
-
virtqueue,實際就是guest virtio driver與virtio pci device之間用戶傳輸數據的隊列
-
MSI與MSI-X
MSI的全稱是Message Signaled Interrupt。
MSI出現在PCI 2.2和PCIe的規范中,是一種內部中斷信號機制。傳統的中斷都有專門的中斷pin,當中斷信號產生時,中斷PIN電平產生變化(一般是拉低)。INTx就是傳統的外部中斷觸發機制,它使用專門的通道來產生控制信息。然而PCIe並沒有多根獨立的中斷PIN,於是使用特殊的信號來模擬中斷PIN的置位和復位。MSI允許設備向一段指定的MMIO地址空間寫一小段數據,然后chipset以此產生相應的中斷給CPU。
從電氣機械的角度,MSI減少了對interrupt pin個數的需求,增加了中斷號的數量,傳統的PCI中斷只允許每個device擁有4個中斷,並且由於這些中斷都是共享的,大部分device都只有一個中斷,MSI允許每個device有1,2,4,8,16甚至32個中斷。
使用MSI也有一點點性能上的優勢。使用傳統的PIN中斷,當中斷到來時,程序去讀內存獲取數據時有可能會產生沖突。其原因device的數據主要通過DMA來傳輸,而在PIN中斷到達時,DMA傳輸還未能完成,此時cpu不能獲取到數據,只能空轉。而MSI不會存在這個問題,因為MSI都是發生在DMA傳輸完成之后的。
-
參考資料
-
軟件包 Intel DPDK 2.0
-
軟件包 OVS 2.5.0
-
軟件包 VPP 10.06
-
《Intel® Open Network Platform Release 2.1 Reference Architecture Guide.pdf》