1.1 offload技術概述
首先要從術語offload說起,offload指的是將一個本來有軟件實現的功能放到硬件上來實現,這樣就可以將本來在操作系統上進行的一些數據包處理(如分片、重組等)放到網卡硬件上去做,降低系統CPU消耗的同時,提高處理性能。在neutron中,基於VXLAN的網絡虛擬技術給服務器的CPU帶來了額外的負擔,比如封包、解包和校驗等,VXLAN的封包和解包都是由OVS來完成的,使用VXLAN offload技術后,VXLAN的封包和解包都交給網卡或者硬件交換機來做了,那么網卡的VXLAN offload技術就是利用網卡來實現VXLAN的封包和解包功能。
再說技術分類,實現offload的幾種技術:
-
LSO(Large Segment Offload):協議棧直接傳遞打包給網卡,由網卡負責分割
-
LRO(Large Receive Offload):網卡對零散的小包進行拼裝,返回給協議棧一個大包
-
GSO(Generic Segmentation Offload):LSO需要用戶區分網卡是否支持該功能,GSO則會自動判斷,如果支持則啟用LSO,否則不啟用
-
GRO(Generic Receive Offload):LRO需要用戶區分網卡是否支持該功能,GRO則會自動判斷,如果支持則啟用LRO,否則不啟用
-
TSO(TCP Segmentation Offload):針對TCP的分片的offload。類似LSO、GSO,但這里明確是針對TCP
-
USO(UDP Segmentation offload):正對UDP的offload,一般是IP層面的分片處理
幾種技術的對比
(1)LSO vs LRO
兩種技術分別對應發送數據和接收數據兩個方面,一般來說,計算機網絡上傳輸的數據基本單位是離散的數據包,而且這個數據包都有MTU的大小的限制,如果需要發送比較多的數據,那么經過OS協議棧的時候,就會拆分為不超過MTU的數據包,如果使用CPU來做的話,會造成使用率過高。引入LSO后,在發送數據超過MTU的時候,OS只需要提交一次請求給網卡,網卡會自動把數據拿過來,然后進行拆分封裝,發送的數據包不會超過MTU的限制,而LRO的作用就是當網卡一次收到很多碎片數據包時,LRO可以輔助自動組合成一段較大的數據包,一次性的交給OS處理,這兩種技術主要面向TCP報文。
(2)TSO vs UFO
分別對應TCP報文和UDP報文,TSO 將 TCP 協議的一些處理下放到網卡完成以減輕協議棧處理占用 CPU 的負載。通常以太網的 MTU 是1500Bytes,除去 IP 頭(標准情況下20 Bytes)、TCP頭(標准情況下20Bytes),TCP的MSS (Max Segment Size)大小是1460Bytes。當應用層下發的數據超過 MSS 時,協議棧會對這樣的 payload 進行分片,保證生成的報文長度不超過MTU的大小。但是對於支持 TSO/GSO 的網卡而言,就沒這個必要了,可以把最多64K大小的 payload 直接往下傳給協議棧,此時 IP 層也不會進行分片,一直會傳給網卡驅動,支持TSO/GSO的網卡會自己生成TCP/IP包頭和幀頭,這樣可以offload很多協議棧上的內存操作,checksum計算等原本靠CPU來做的工作都移給了網卡。
1.2 NIC的VXLAN offload技術介紹
上面主要介紹了一些offload技術的基本概念,下面來詳細介紹VXLAN的offload原理。 在虛擬化的網絡覆蓋應用中存在多種技術,主要包括VXLAN、NVGRE和SST等隧道技術,以VXLAN技術為例,采用MAC-in-UDP來進行數據包的轉換傳輸,從上圖可以看出,除了UDP協議報頭,VXLAN還引入額外的數據包處理,這種添加或者移除協議包頭的操作使CPU需要執行關於數據包的更多操作。目前來看,Linux driver已經可滿足硬件網卡的VXLAN offload需求,使用下面的ethtool命令就可以配置網卡的VXLAN offload功能
ethtool -k ethX //這項命令可以列舉出ethX的offloads以及當前的狀態
ethtool -K ethX tx-udp_tnl-segmentation [off|on] //可以開啟或關閉Linux
采用網卡VXLAN offload技術后,overlay情形下的虛擬網絡性能也會得到大規模的提升。本質上來說,VXLAN的封裝格式類似於一種l2vpn技術,即將二層以太網報文封裝在udp報文里,從而跨越underlay L3網絡,來實現不同的服務器或不同的數據中心間的互聯。
在采用VXLAN技術后,由於虛機產生或接受的報文被封裝於外層的UDP報文中予以傳輸,使得以往的TCP segment optimization、TCP checksum offload等功能對於內層的虛機的TCP數據收發失效,較大的影響了虛機間通信的性能,給最終的用戶帶來了很差的用戶體驗。廠商為了解決上述問題,提出了NIC VXLAN offload技術。
網卡的VXLAN offload主要對網卡的能力進行了增強,並與網卡驅動配合,使得網卡能夠知曉VXLAN內部以太報文的位置,從而使得TSO、TCP checksum offload這些技術能夠對內部的以太報文生效,從而提升TCP性能。
目前部署虛擬網絡主流采用VXALN技術,其封包、解包用CPU來實現,將會消耗很大的CPU等系統資源。VXLAN使用通用的x86進行封包、解包處理,其CPU資源占用會達到50%左右,可以考慮使用支持VXLAN offload功能的網卡來降低系統資源的消耗問題。目前來看,博通、Intel、mellanox和Qlogic等網卡廠商都支持VXLAN的卸載。盡管是不同廠商的產品,但業內已經有標准的VXLAN offload接口,無需改動代碼即可啟用這一功能,並不會增加代碼層面的工作量。
tx-udp_tnl-segmentation
Overlay網絡,例如VxLAN,現在應用的越來越多。Overlay網絡可以使得用戶不受物理網絡的限制,進而創建,配置並管理所需要的虛擬網絡連接。同時Overlay可以讓多個租戶共用一個物理網絡,提高網絡的利用率。Overlay網絡有很多種,但是最具有代表性的是VxLAN。VxLAN是一個MAC in UDP的設計,具體格式如下所示。
從VxLAN的格式可以看出,以VxLAN為代表的Overlay網絡在性能上存在兩個問題。一個是Overhead的增加,VxLAN在原始的Ethernet Frame上再包了一層Ethernet+IP+UDP+VXLAN,這樣每個Ethernet Frame比原來要多傳輸50個字節。所以可以預見的是,Overlay網絡的效率必然要低於Underlay網絡。另一個問題比傳50個字節更為嚴重,那就是需要處理這額外的50個字節。這50個字節包括了4個Header,每個Header都涉及到拷貝,計算,都需要消耗CPU。而我們現在迫切的問題在於CPU可以用來處理每個網絡數據包的時間更少了。
首先,VxLAN的這50個字節是沒法避免的。其次,那就只能降低它的影響。這里仍然可以采用Jumbo Frames的思想,因為50個字節是固定的,那網絡數據包越大,50字節帶來的影響就相對越小。
先來看一下虛擬機的網絡連接圖。虛擬機通過QEMU連接到位於宿主機的TAP設備,之后再通過虛機交換機轉到VTEP(VxLAN Tunnel EndPoint),封裝VxLAN格式,發給宿主機網卡。
理想情況就是,一大段VxLAN數據直接傳給網卡,由網卡去完成剩下的分片,分段,並對分成的小的網絡包分別封裝VxLAN,計算校驗和等工作。這樣VxLAN對虛機網絡帶來影響就可以降到最低。實際中,這是可能的,但是需要一系列的前提條件。
首先,虛擬機要把大的網絡包發到宿主機。因為虛擬機里面也運行了一個操作系統,也有自己的TCP/IP協議棧,所以虛擬機完全有能力自己就把大的網絡包分成多個小的網絡包。從前面介紹的內容看,只有TSO才能真正將一個大的網絡包發到網卡。GSO在發到網卡的時候,已經在進入驅動的前一刻將大的網絡包分成了若干個小的網絡數據包。所以這里要求:虛機的網卡支持TSO(Virtio默認支持),並且打開TSO(默認打開),同時虛機發出的是TCP數據。
之后,經過QEMU,虛擬交換機的轉發,VTEP的封裝,這個大的TCP數據被封裝成了VxLAN格式。50個字節的VxLAN數據被加到了這個大的TCP數據上。接下來問題來了,這本來是個TCP數據,但是因為做了VxLAN的封裝,現在看起來像是個UDP的數據。如果操作系統不做任何處理,按照前面的介紹,那就應該走GSO做IP Fragmentation,並在發送給網卡的前一刻分成多個小包。這樣,如果網卡本來支持TSO現在就用不上了。並且更加嚴重的是,現在還沒做TCP Segmentation。我們在上一篇花了很大的篇幅介紹其必要性的TCP Segmentation在這里也丟失了。
對於現代的網卡,除了TSO,GSO等offload選項外,還多了一個選項tx-udp_tnl-segmentation。如果這個選項打開,操作系統自己會識別封裝成VxLAN的UDP數據是一個tunnel數據,並且操作系統會直接把這一大段VxLAN數據丟給網卡去處理。在網卡里面,網卡會針對內層的TCP數據,完成TCP Segmentation。之后再為每個TCP Segment加上VxLAN封裝(50字節),如下圖右所示。這樣,VxLAN封裝對於虛擬機網絡來說,影響降到了最低。
從前面描述看,要達成上述的效果,需要宿主機網卡同時支持TSO和tx-udp_tnl-segmentation。如果這兩者任意一個不支持或者都不支持。那么系統內核會調用GSO,將封裝成VxLAN格式的大段TCP數據,在發給網卡驅動前完成TCP Segmentation,並且為每個TCP Segment加上VxLAN封裝。如下圖左所示。
如果關閉虛擬機內的TSO,或者虛擬機內發送的是UDP數據。那么在虛擬機的TCP/IP協議棧會調用GSO,發給虛擬機網卡驅動的前一刻,完成了分段、分片。虛擬機最終發到QEMU的網絡數據包就是多個小的網絡數據包。這個時候,無論宿主機怎么配置,都需要處理多個小的網絡包,並對他們做VxLAN封裝。
VXLAN hardware offload
Intel X540默認支持VXLAN offload:
tx-udp_tnl-segmentation: on static int __devinit ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent) { ... #ifdef HAVE_ENCAP_TSO_OFFLOAD netdev->features |= NETIF_F_GSO_UDP_TUNNEL; ///UDP tunnel offload #endif static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = { ... [NETIF_F_GSO_UDP_TUNNEL_BIT] = "tx-udp_tnl-segmentation",
[root@bogon ~]# ethtool -k enahisic2i0 tx-udp_tnl-segmentation on ethtool: bad command line argument(s) For more information run ethtool -h [root@bogon ~]# ethtool -k enahisic2i3 tx-udp_tnl-segmentation on ethtool: bad command line argument(s) For more information run ethtool -h [root@bogon ~]# ethtool -k enahisic2i0 | grep tx-udp tx-udp_tnl-segmentation: off [fixed] tx-udp_tnl-csum-segmentation: off [fixed]
static const struct net_device_ops vxlan_netdev_ops = { .ndo_init = vxlan_init, .ndo_uninit = vxlan_uninit, .ndo_open = vxlan_open, .ndo_stop = vxlan_stop, .ndo_start_xmit = vxlan_xmit, .ndo_get_stats64 = ip_tunnel_get_stats64, .ndo_set_rx_mode = vxlan_set_multicast_list, .ndo_change_mtu = vxlan_change_mtu, .ndo_validate_addr = eth_validate_addr, .ndo_set_mac_address = eth_mac_addr, .ndo_fdb_add = vxlan_fdb_add, .ndo_fdb_del = vxlan_fdb_delete, .ndo_fdb_dump = vxlan_fdb_dump, };
vxlan_xmit-->vxlan_xmit_one-->vxlan_xmit_skb
int vxlan_xmit_skb(struct vxlan_sock *vs, struct rtable *rt, struct sk_buff *skb, __be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df, __be16 src_port, __be16 dst_port, __be32 vni) { struct vxlanhdr *vxh; struct udphdr *uh; int min_headroom; int err; if (!skb->encapsulation) { skb_reset_inner_headers(skb); skb->encapsulation = 1; } min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len + VXLAN_HLEN + sizeof(struct iphdr) + (vlan_tx_tag_present(skb) ? VLAN_HLEN : 0); /* Need space for new headers (invalidates iph ptr) */ err = skb_cow_head(skb, min_headroom); if (unlikely(err)) return err; if (vlan_tx_tag_present(skb)) { if (WARN_ON(!__vlan_put_tag(skb, skb->vlan_proto, vlan_tx_tag_get(skb)))) return -ENOMEM; skb->vlan_tci = 0; } vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh)); vxh->vx_flags = htonl(VXLAN_FLAGS); vxh->vx_vni = vni; __skb_push(skb, sizeof(*uh)); skb_reset_transport_header(skb); uh = udp_hdr(skb); uh->dest = dst_port; uh->source = src_port; uh->len = htons(skb->len); uh->check = 0; err = handle_offloads(skb); if (err) return err; return iptunnel_xmit(rt, skb, src, dst, IPPROTO_UDP, tos, ttl, df, false); }
SKB_GSO_UDP_TUNNEL
VXLAN設備在發送數據時,會設置SKB_GSO_UDP_TUNNEL:
static int handle_offloads(struct sk_buff *skb) { if (skb_is_gso(skb)) { int err = skb_unclone(skb, GFP_ATOMIC); if (unlikely(err)) return err; skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL; } else if (skb->ip_summed != CHECKSUM_PARTIAL) skb->ip_summed = CHECKSUM_NONE; return 0; }
值得注意的是,該特性只有當內層的packet為TCP協議時,才有意義。前面已經討論ixgbe不支持UFO,所以對UDP packet,最終會在推送給物理網卡時(dev_hard_start_xmit)進行軟件GSO。
static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb, netdev_features_t features) { struct sk_buff *segs = ERR_PTR(-EINVAL); unsigned int mss; __wsum csum; struct udphdr *uh; struct iphdr *iph; if (skb->encapsulation && (skb_shinfo(skb)->gso_type & (SKB_GSO_UDP_TUNNEL|SKB_GSO_UDP_TUNNEL_CSUM))) { segs = skb_udp_tunnel_segment(skb, features, false); //進行分段 goto out; }
}
測試
VXLAN OFFLOAD
It’s been a while since I last posted due to work and life in general. I’ve been working on several NFV projects and thought I’d share some recent testing that I’ve been doing…so here we go :)
Let me offload that for ya!
In a multi-tenancy environment (OpenStack, Docker, LXC, etc), VXLAN solves the limitation of 4094 VLANs/networks, but introduces a few caveats:
- Performance impact on the data path due to encapsulation overhead
- Inability of hardware offload capabilities on the inner packet
These extra packet processing workloads are handled by the host operating system in software, which can result in increased overall CPU utilization and reduced packet throughput.
Network interface cards like the Intel X710 and Emulex OneConnect can offload some of the CPU resources by processing the workload in the physical NIC.
Below is a simple lab setup to test VXLAN offload data path with offload hardware. The main focus is to compare the effect of VXLAN offloading and how it performs directly over a physical or bridge interface.
Test Configuration
The lab topology consists of the following:
- Nexus 9372 (tenant leaf) w/ jumbo frames enabled
- Cisco UCS 220 M4S (client)
- Cisco UCS 240 M4X (server)
- Emulex OneConnect NICs
Specs:
Lab topology
Four types of traffic flows have been used to compare the impact of Emulex’s VXLAN offload when the feature has been enabled or disabled:
ethtool -k <eth0/eth1> tx-udp_tnl-segmentation <on/off>
Tools
Netperf was used to generate TCP traffic between client and server. It is a light user-level process that is widely used for networking measurement. The tool consists of two binaries:
- netperf - user-level process that connects to the server and generates traffic
- netserver - user-level process that listens and accepts connection requests
**MTU considerations: VXLAN tunneling adds 50 bytes (14-eth + 20-ip + 8-udp + 8-vxlan) to the VM Ethernet frame. You should make sure that the MTU of the NIC that sends the packets takes into account the tunneling overhead (the configuration below shows the MTU adjustment).
Client Configuration
# Update system
yum update -y
# Install and start OpenvSwitch
yum install -y openvswitch
service openvswitch start
# Create bridge
ovs-vsctl add-br br-int
# Create VXLAN interface and set destination VTEP
ovs-vsctl add-port br-int vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=<server ip> options:key=10 options:dst_port=4789
# Create tenant namespaces
ip netns add tenant1
# Create veth pairs
ip link add host-veth0 type veth peer name host-veth1
ip link add tenant1-veth0 type veth peer name tenant1-veth1
# Link primary veth interfaces to namespaces
ip link set tenant1-veth0 netns tenant1
# Add IP addresses
ip a add dev host-veth0 192.168.0.10/24
ip netns exec tenant1 ip a add dev tenant1-veth0 192.168.10.10/24
# Bring up loopback interfaces
ip netns exec tenant1 ip link set dev lo up
# Set MTU to account for VXLAN overhead
ip link set dev host-veth0 mtu 8950
ip netns exec tenant1 ip link set dev tenant1-veth0 mtu 8950
# Bring up veth interfaces
ip link set dev host-veth0 up
ip netns exec tenant1 ip link set dev tenant1-veth0 up
# Bring up host interfaces and set MTU
ip link set dev host-veth1 up
ip link set dev host-veth1 mtu 8950
ip link set dev tenant1-veth1 up
ip link set dev tenant1-veth1 mtu 8950
# Attach ports to OpenvSwitch
ovs-vsctl add-port br-int host-veth1
ovs-vsctl add-port br-int tenant1-veth1
# Enable VXLAN offload
ethtool -k eth0 tx-udp_tnl-segmentation on
ethtool -k eth1 tx-udp_tnl-segmentation on
Server Configuration
# Update system
yum update -y
# Install and start OpenvSwitch
yum install -y openvswitch
service openvswitch start
# Create bridge
ovs-vsctl add-br br-int
# Create VXLAN interface and set destination VTEP
ovs-vsctl add-port br-int vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=<client ip> options:key=10 options:dst_port=4789
# Create tenant namespaces
ip netns add tenant1
# Create veth pairs
ip link add host-veth0 type veth peer name host-veth1
ip link add tenant1-veth0 type veth peer name tenant1-veth1
# Link primary veth interfaces to namespaces
ip link set tenant1-veth0 netns tenant1
# Add IP addresses
ip a add dev host-veth0 192.168.0.20/24
ip netns exec tenant1 ip a add dev tenant1-veth0 192.168.10.20/24
# Bring up loopback interfaces
ip netns exec tenant1 ip link set dev lo up
# Set MTU to account for VXLAN overhead
ip link set dev host-veth0 mtu 8950
ip netns exec tenant1 ip link set dev tenant1-veth0 mtu 8950
# Bring up veth interfaces
ip link set dev host-veth0 up
ip netns exec tenant1 ip link set dev tenant1-veth0 up
# Bring up host interfaces and set MTU
ip link set dev host-veth1 up
ip link set dev host-veth1 mtu 8950
ip link set dev tenant1-veth1 up
ip link set dev tenant1-veth1 mtu 8950
# Attach ports to OpenvSwitch
ovs-vsctl add-port br-int host-veth1
ovs-vsctl add-port br-int tenant1-veth1
# Enable VXLAN offload
ethtool -k eth0 tx-udp_tnl-segmentation on ethtool -k eth1 tx-udp_tnl-segmentation on
Offload verification
[root@client ~]# dmesg | grep VxLAN
[ 6829.318535] be2net 0000:05:00.0: Enabled VxLAN offloads for UDP port 4789
[ 6829.324162] be2net 0000:05:00.1: Enabled VxLAN offloads for UDP port 4789
[ 6829.329787] be2net 0000:05:00.2: Enabled VxLAN offloads for UDP port 4789
[ 6829.335418] be2net 0000:05:00.3: Enabled VxLAN offloads for UDP port 4789
[root@client ~]# ethtool -k eth0 | grep tx-udp
tx-udp_tnl-segmentation: on
[root@server ~]# dmesg | grep VxLAN
[ 6829.318535] be2net 0000:05:00.0: Enabled VxLAN offloads for UDP port 4789 [ 6829.324162] be2net 0000:05:00.1: Enabled VxLAN offloads for UDP port 4789 [ 6829.329787] be2net 0000:05:00.2: Enabled VxLAN offloads for UDP port 4789 [ 6829.335418] be2net 0000:05:00.3: Enabled VxLAN offloads for UDP port 4789
[root@server ~]# ethtool -k eth0 | grep tx-udp
tx-udp_tnl-segmentation: on
Testing
As stated before, Netperf was used for getting the throughput and the CPU utilization for the server and the client side. The test was run over the bridged interface in the Tenant1 namespace with VXLAN Offload off and Offload on.
Copies of the netperf scripts can be found here:
TCP stream testing UDP stream testing
Throughput: % CPU Utilization (Server side):
% CPU Utilization (Client side):
I conducted several TCP stream tests saw the following results with different buffer/socket sizes:
Socket size of 128K(sender and Receiver): Socket size of 32K(sender and Receiver):
Socket size of 4K(sender and Receiver):
NETPERF Raw Results:
Offload Off:
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET : +/-2.500% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 6.663%
!!! Local CPU util : 14.049%
!!! Remote CPU util : 13.944%
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 4096 10.00 9591.78 1.18 0.93 0.483 0.383
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET : +/-2.500% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 4.763%
!!! Local CPU util : 7.529%
!!! Remote CPU util : 10.146%
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 8192 10.00 9200.11 0.94 0.90 0.402 0.386
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET : +/-2.500% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 4.469%
!!! Local CPU util : 8.006%
!!! Remote CPU util : 8.229%
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 32768 10.00 9590.11 0.65 0.90 0.268 0.367
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET : +/-2.500% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 7.053%
!!! Local CPU util : 12.213%
!!! Remote CPU util : 13.209%
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.00 9412.99 0.76 0.85 0.316 0.357
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET : +/-2.500% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 1.537%
!!! Local CPU util : 12.137%
!!! Remote CPU util : 15.495%
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 65536 10.00 9106.93 0.59 0.85 0.253 0.369
Offload ON:
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET : +/-2.500% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 5.995%
!!! Local CPU util : 8.044%
!!! Remote CPU util : 7.965%
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 4096 10.00 9632.98 1.08 0.91 0.440 0.371
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET : +/-2.500% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 0.031%
!!! Local CPU util : 6.747%
!!! Remote CPU util : 5.451%
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 8192 10.00 9837.25 0.91 0.91 0.362 0.363
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET : +/-2.500% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 0.099%
!!! Local CPU util : 7.835%
!!! Remote CPU util : 13.783%
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.00 9837.17 0.65 0.89 0.261 0.354
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET : +/-2.500% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 0.092%
!!! Local CPU util : 7.445%
!!! Remote CPU util : 8.866%
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 32768 10.00 9834.57 0.53 0.88 0.212 0.353
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET : +/-2.500% @ 99% conf.
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 5.255%
!!! Local CPU util : 7.245%
!!! Remote CPU util : 8.528%
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 65536 10.00 9465.12 0.52 0.90 0.214 0.375
Test results
Line rate speeds were achieved in almost all traffic flow type tests with the exception of VXLAN over bridge. For VXLAN over physical flow test, the CPU utilization was pretty much similar to the baseline physical flow test as no encapsulation was taken place. When offload was disabled, the CPU usage increased by over 50%.
Given that Netperf is single process threaded, and forwarding pins to one CPU core only, the CPU utilization shown was an extrapolated result from what was reported by the tool which was N*8 cores. This also showed how throughput was effected by CPU resource as seen in the case with VXLAN over bridge test. Also, reduction in socket sizes produced higher CPU utilization with offload on, due to the smaller packet/additional overhead handling.
These tests were completed with the standard supported kernel in RHEL 7.1. There have been added networking improvements in the 4.x kernel that in separate testing increased performance by over 3x, although existing results are very promising.
Overall, VXLAN offloading will be useful in getting past specific network limitations and achieving scalable east-west expansions.
Code
https://github.com/therandomsecurityguy/benchmarking-tools/tree/main/netperf
Packet fragmentation and segmentation offload in UDP and VXLAN
https://hustcat.github.io/udp-and-vxlan-fragment/
Linux環境中的網絡分段卸載技術 GSO/TSO/UFO/LRO/GRO
https://rtoax.blog.csdn.net/article/details/108748689
常見網絡加速技術淺談(二)
https://zhuanlan.zhihu.com/p/44683790