理解 Linux 網絡棧（3）：QEMU/KVM + VxLAN 環境下的 Segmentation Offloading 技術（發送端）

本文轉載自查看原文 2016-03-02 09:25 5552 KVM/ 網絡/ 基礎知識/ 原理

本系列文章總結 Linux 網絡棧，包括：

（1）Linux 網絡協議棧總結

（2）非虛擬化Linux環境中的網絡分段卸載技術 GSO/TSO/UFO/LRO/GRO

（3）QEMU/KVM + VxLAN 環境下的 Segmentation Offloading 技術（發送端）

（4）QEMU/KVM + VxLAN 環境下的 Segmentation Offloading 技術（接收端）

1. 測試環境

1.1 總體環境

宿主機：Ubuntu Linux/KVM + VxLAN + Linux bridge，網卡 MTU 9000
客戶機：Ubuntu Linux + Virtio-net NIC，網卡 MTU 1500，使用 OpenStack Kilo 管理虛機

（這是發送端宿主機和客戶機示意圖）

在發送端，虛機要把一個數據包發出去，需要首先把 packet 經過 virtio 設備發到宿主機的某個 linux bridge，網橋再轉發到 xvlan 的虛擬網卡，虛擬網卡把它變成 VxLAN UDP frame，然后發往UDP 層，再通過TCP/IP層再次進行路由，數據包這次會被發往物理網絡，並最終抵達接收端端。

1.2 客戶機中使用的 virtio-net 虛擬網卡

客戶機的 virtio-net 超虛擬化網卡其實是一個使用中斷和 DMA 技術實現的 PCI 設備，因此可以使用 lspci 命令查看它的信息：

00:03.0 Ethernet controller [0200]: Red Hat, Inc Virtio network device [1af4:1000]

默認情況下，該網卡的 Segmentation Offloading 全部是打開的：

root@sammyubuntu1:~# ethtool -k eth0
Features for eth0:
rx-checksumming: on [fixed]
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]

2. 發送端的實驗和實現

2.1 實驗

2.1.1 在客戶機和宿主機網卡的 GSO/TSO/UFO 全部打開情況下的實驗

（1）iperf 的 MSS 是 1448 bytes

root@sammyubuntu1:~# iperf -c 20.0.0.103 -l 65550 -m -M 400000
WARNING: attempt to set TCP maxmimum segment size to 400000 failed.
Setting the MSS may not be implemented on this OS.
------------------------------------------------------------
Client connecting to 20.0.0.103, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 20.0.0.150 port 56228 connected with 20.0.0.103 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.06 GBytes   908 Mbits/sec
[  3] MSS size 1448 bytes (MTU 1500 bytes, ethernet)

實驗表明，客戶機中的TCP MSS 只和客戶機網卡的 MTU 有關，和其它因素比如宿主機網卡 MTU 沒有關系。而且，一個 TCP 連接的兩個方向上的 MSS 是可以不同的。正是因為 MSS 的獨立性，它可能會產生不同的后果，下文會有闡述。

另外一個有趣的結果是，客戶機網卡的 MTU 的大小對網絡性能的影響不大。在下面的測試中，客戶機網卡 MTU 由 1500 提高到 8000，性能只提高了 6.8%。

<客戶機網卡 MTU 1500，宿主機網卡 MTU 9000>
root@sammyubuntu1:~# iperf -c 20.0.0.103   -m
------------------------------------------------------------
Client connecting to 20.0.0.103, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 20.0.0.150 port 56275 connected with 20.0.0.103 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.06 GBytes   909 Mbits/sec
[  3] MSS size 1448 bytes (MTU 1500 bytes, ethernet)

<客戶機網卡 MTU 8000，宿主機網卡 MTU 9000>
root@sammyubuntu1:~# iperf -c 20.0.0.103   -m
------------------------------------------------------------
Client connecting to 20.0.0.103, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 20.0.0.150 port 56274 connected with 20.0.0.103 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.14 GBytes   977 Mbits/sec
[  3] MSS size 7948 bytes (MTU 7988 bytes, unknown interface)

（2）客戶機網卡：Frame 的 size 明顯超過了 MSS，說明在啟用了 GSO/TSO 的情況下，只要每個數據包不超過 IP 包的最大大小 64k，virtio-net 網卡就可以直接經過 virtqueue 發給 QEMU 中的backend。檢驗碼報錯，說明校驗和計算被卸載到了網卡上，但是可能網卡計算錯誤。

07:56:04.857427 IP (tos 0x0, ttl 64, id 43384, offset 0, flags [DF], proto TCP (6), length 1500)
20.0.0.150.56230 > 20.0.0.103.5001: Flags [.], cksum 0x2ecb (incorrect -> 0xc064), seq 438742392:438743840, ack 1, win 229, options [nop,nop,TS val 7353870 ecr 564507], length 1448

07:56:10.861709 IP (tos 0x0, ttl 64, id 56589, offset 0, flags [DF], proto TCP (6), length 65212)
20.0.0.150.56230 > 20.0.0.103.5001: Flags [.], cksum 0x27ac (incorrect -> 0x74c2), seq 1119980056:1120045216, ack 1, win 229, options [nop,nop,TS val 7355371 ecr 566008], length 65160

（3）宿主機上客戶機網卡對應的 tap 設備：跟客戶機網卡中看到的一樣，說明 QEMU 中的 virtio-queue（backend）和宿主機中的 tap 網絡設備都是對這些 packets 直接發送的，沒有做任何分包等操作。

08:12:23.443278 fa:16:3e:a3:a0:55 > fa:16:3e:1e:d9:f4, ethertype IPv4 (0x0800), length 61322: (tos 0x0, ttl 64, id 52865, offset 0, flags [DF], proto TCP (6), length 61308)
    20.0.0.150.56238 > 20.0.0.103.5001: Flags [.], cksum 0x186c (incorrect -> 0xad42), seq 896238816:896300072, ack 1, win 157, options [nop,nop,TS val 7598521 ecr 809156], length 61256
08:12:23.443355 fa:16:3e:a3:a0:55 > fa:16:3e:1e:d9:f4, ethertype IPv4 (0x0800), length 3030: (tos 0x0, ttl 64, id 52927, offset 0, flags [DF], proto TCP (6), length 3016)
    20.0.0.150.56238 > 20.0.0.103.5001: Flags [.], cksum 0x34b7 (incorrect -> 0x97b3), seq 896300072:896303036, ack 1, win 157, options [nop,nop,TS val 7598521 ecr 809156], length 2964

（4）宿主機內的連接 tap 設備和vxlan interface 的 linux bridge：幀的大小超過其 MTU，說明它直接轉發經過 GSO/TSO 合並后的幀。

oot@hkg02kvm004ccz023:~# ifconfig brq137db7ce-a4
brq137db7ce-a4 Link encap:Ethernet  HWaddr 36:7e:1f:8e:65:a0
          UP BROADCAST RUNNING MULTICAST  MTU:8950  Metric:1
          RX packets:594948 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1058392082 (1.0 GB)  TX bytes:0 (0.0 B)

08:18:39.574679 fa:16:3e:a3:a0:55 > fa:16:3e:1e:d9:f4, ethertype IPv4 (0x0800), length 52430: (tos 0x0, ttl 64, id 1585, offset 0, flags [DF], proto TCP (6), length 52416)
    20.0.0.150.56238 > 20.0.0.103.5001: Flags [.], cksum 0xf5af (incorrect -> 0x47b1), seq 7027645:7080009, ack 0, win 157, options [nop,nop,TS val 7692554 ecr 903188], length 52364
08:18:39.574784 fa:16:3e:a3:a0:55 > fa:16:3e:1e:d9:f4, ethertype IPv4 (0x0800), length 52430: (tos 0x0, ttl 64, id 1638, offset 0, flags [DF], proto TCP (6), length 52416)

（5）宿主機內的 vxlan-interface：幀的大小超過其 MTU，說明它直接轉發經過 GSO/TSO 合並后的幀

root@hkg02kvm004ccz023:~# ifconfig vxlan-97
vxlan-97  Link encap:Ethernet  HWaddr d6:7e:83:70:40:b2
          UP BROADCAST RUNNING MULTICAST  MTU:8950  Metric:1
          RX packets:44754025 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5179964 errors:0 dropped:2 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2381558146 (2.3 GB)  TX bytes:89661075401 (89.6 GB)

08:16:54.685914 fa:16:3e:a3:a0:55 > fa:16:3e:1e:d9:f4, ethertype IPv4 (0x0800), length 1054: (tos 0x0, ttl 64, id 14493, offset 0, flags [DF], proto TCP (6), length 1040)
    20.0.0.150.56238 > 20.0.0.103.5001: Flags [.], cksum 0x2cff (incorrect -> 0x3143), seq 60291712:60292700, ack 1, win 157, options [nop,nop,TS val 7666332 ecr 876964], length 988
08:16:54.685936 fa:16:3e:a3:a0:55 > fa:16:3e:1e:d9:f4, ethertype IPv4 (0x0800), length 24766: (tos 0x0, ttl 64, id 14494, offset 0, flags [DF], proto TCP (6), length 24752)
    20.0.0.150.56238 > 20.0.0.103.5001: Flags [.], cksum 0x899f (incorrect -> 0xf01d), seq 60292700:60317400, ack 1, win 157, options [nop,nop,TS val 7666332 ecr 876964], length 24700

（6）宿主機 VxLAN UDP socket 所綁定的物理網卡：測試了三種情況，證明了網卡是按照 MSS 進行 IP 分片的。

當 TCP 連接的 MSS 是 988 時：
08:19:34.286094 IP 10.110.156.43.33980 > 10.110.156.42.4789: VXLAN, flags [I] (0x08), vni 97
IP 20.0.0.150.56238 > 20.0.0.103.5001: Flags [.], seq 96306288:96307276, ack 1, win 157, options [nop,nop,TS val 7706232 ecr 916866], length 988

當 TCP 連接的 MSS 是 1448 時：
07:45:24.008510 IP 10.110.156.43.44429 > 10.110.156.42.4789: VXLAN, flags [I] (0x08), vni 97
IP 20.0.0.150.56228 > 20.0.0.103.5001: Flags [.], seq 126000:127448, ack 1, win 229, options [nop,nop,TS val 7193663 ecr 404323], length 1448

如果將網卡的 MTU 調到比 MSS （1448）還小，比如 1400，則會出現 IP 分片，說明 GSO 還是按照 MSS 分段，而不是按照 MTU 或者 [MTU, MSS] 中的較小值來分段：
08:39:35.743048 IP 10.110.156.43.41903 > 10.110.156.42.4789: VXLAN, flags [I] (0x08), vni 97
IP truncated-ip - 154 bytes missing! 20.0.0.150.56246 > 20.0.0.103.5001: Flags [.], seq 12179152:12180600, ack 1, win 188, options [nop,nop,TS val 8006596 ecr 1217232], length 1448

2.1.2 在客戶機關閉 GSO/TSO/UFO

root@sammyubuntu1:~# iperf -c 20.0.0.103 -l 65550  -m
------------------------------------------------------------
Client connecting to 20.0.0.103, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 20.0.0.150 port 56269 connected with 20.0.0.103 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.00 GBytes   859 Mbits/sec
[  3] MSS size 1448 bytes (MTU 1500 bytes, ethernet)

客戶機網卡：在客戶機 CPU 中進行了 TCP 分段 10:02:41.474750 IP (tos 0x0, ttl 64, id 59282, offset 0, flags [DF], proto TCP (6), length 1500)
    20.0.0.150.56269 > 20.0.0.103.5001: Flags [.], cksum 0x2ecb (incorrect -> 0x1924), seq 1795544:1796992, ack 1, win 229, options [nop,nop,TS val 9253025 ecr 2463660], length 1448

該過程說明客戶機中產生了 TCP 分段；對網絡性能有一定的下降。

2.1.3 虛機 TSO/GSO 打開和宿主機的 GSO/TSO 關閉

客戶機網卡：傳輸 GSO 大幀
09:47:35.894971 IP (tos 0x0, ttl 64, id 5902, offset 0, flags [DF], proto TCP (6), length 65212)
    20.0.0.150.56263 > 20.0.0.103.5001: Flags [.], cksum 0x27ac (incorrect -> 0xdeb7), seq 893020126:893085286, ack 1, win 229, options [nop,nop,TS val 9026630 ecr 2237265], length 65160vxlan-interface 設備：直接轉發
09:53:15.457088 fa:16:3e:a3:a0:55 > fa:16:3e:1e:d9:f4, ethertype IPv4 (0x0800), length 65226: (tos 0x0, ttl 64, id 43361, offset 0, flags [DF], proto TCP (6), length 65212)
    20.0.0.150.56267 > 20.0.0.103.5001: Flags [.], cksum 0x27ac (incorrect -> 0x2e18), seq 1127931208:1127996368, ack 1, win 229, options [nop,nop,TS val 9111525 ecr 2322160], length 65160

宿主機物理網卡：發送size 為 TCP MSS 的小幀 09:50:41.251821 IP 10.110.156.43.12914 > 10.110.156.42.4789: VXLAN, flags [I] (0x08), vni 97
IP 20.0.0.150.56265 > 20.0.0.103.5001: Flags [.], seq 80044016:80045464, ack 1, win 229, options [nop,nop,TS val 9072973 ecr 2283609], length 1448

該過程說明這里產生了 UDP/IP 分片。

2.1.4 TCP 傳輸性能比較

客戶機和宿主機GSO/TSO/UFO都打開 > 客戶機打開宿主機關閉 > 客戶機關閉。

2.2 Linux 內核支持 GSO for UDP tunnels：“udp: Generalize GSO for UDP tunnels”

這個支持是在 2014 年才加入 Linux 內核的，更多信息請參考原文 https://lwn.net/Articles/613999/。這個 patch 在 GSO 中添加了對 UDP 隧道技術的支持。

需要在 skb 發到 UDP 協議棧之前，添加一個新的 option：inner_protocol，可以使用方法 skb_set_inner_ipproto 或者 skb_set_inner_protocol 來設置。vxlan driver 中的相關代碼為 skb_set_inner_protocol(skb, htons(ETH_P_TEB));

函數

skb_udp_tunnel_segment 會檢查該 option 再處理分段。

支持多種類型的封裝，包括 SKB_GSO_UDP_TUNNEL{_CSUM}

2.3 VxLAN 的實現

代碼在這里：https://github.com/torvalds/linux/blob/master/drivers/net/vxlan.c

2.3.1 VxLAN interface

跟名字一樣，VxLAN interface 也是當做一個 network device interface 來使用的，vxlan.c 文件中實現了其驅動的邏輯。它處於數據鏈路層的 device driver 層，實現了 vxlan interface 的 device driver。vxlan interface 同樣可以使用 ethtool 查看其 segmentation offloading 能力：

root@hkg02kvm004ccz023:~# ethtool -k vxlan-4 | grep offload
tcp-segmentation-offload: on
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: on
l2-fwd-offload: off [fixed]

其驅動設置了 net_device_ops結構體變量，其中定義了操作 net_device 的重要函數，vxlan在驅動程序中根據需要的操作要填充這些函數，其中主要是 packets 的接收和發送處理函數。

static const struct net_device_ops vxlan_netdev_ops = {
    .ndo_init        = vxlan_init,      
    .ndo_uninit        = vxlan_uninit,
    .ndo_open        = vxlan_open,
    .ndo_stop        = vxlan_stop,
    .ndo_start_xmit        = vxlan_xmit,  #向 vxlan interface 發送 packet
    ...
};

來看看代碼實現：

（1）首先看 static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev) 方法，它的輸入就是要傳輸的 packets 所對應的 sk_buff 以及要經過的 vxlan interface dev:

它的主要邏輯是獲取 vxlan dev，然后為 sk_buff 中的每一個 skb 調用 vxlan_xmit_skb 方法。

#該方法主要邏輯是，計算 tos，ttl，df，src_port，dst_port，md 以及 flags等，然后調用 vxlan_xmit_skb 方法。
err = vxlan_xmit_skb(rt, sk, skb, fl4.saddr,
                     dst->sin.sin_addr.s_addr, tos, ttl, df,
                     src_port, dst_port, htonl(vni << 8), md,
                     !net_eq(vxlan->net, dev_net(vxlan->dev)),
                     flags);

（2）vxlan_xmit_skb 函數修改了 skb，添加了 VxLAN Header，以及設置 GSO 參數。

static int vxlan_xmit_skb(struct rtable *rt, struct sock *sk, struct sk_buff *skb,
              __be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
              __be16 src_port, __be16 dst_port, __be32 vni,
              struct vxlan_metadata *md, bool xnet, u32 vxflags)
{
    ...int type = udp_sum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL; #計算 GSO UDP 相關的 offload type，使得能夠利用內核 GSO for UDP Tunnel
    u16 hdrlen = sizeof(struct vxlanhdr); #計算 vxlan header 的長度 ...
 #計算 skb 新的 headroom，其中包含了 VXLAN Header 的長度
    min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len + VXLAN_HLEN + sizeof(struct iphdr)
            + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);

    /* Need space for new headers (invalidates iph ptr) */
    err = skb_cow_head(skb, min_headroom); #使得 skb head 可寫 ...
    skb = vlan_hwaccel_push_inside(skb); #處理 vlan 相關事情 ...
    skb = iptunnel_handle_offloads(skb, udp_sum, type); #設置 checksum 和 type ...
    vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh)); #擴展 skb data area，來容納 vxlan header
    vxh->vx_flags = htonl(VXLAN_HF_VNI);
    vxh->vx_vni = vni;
    ... if (vxflags & VXLAN_F_GBP)
        vxlan_build_gbp_hdr(vxh, vxflags, md);

    skb_set_inner_protocol(skb, htons(ETH_P_TEB)); #設置 Ethernet protocol，這是 GSO 在 UDP tunnel 中必須要的

    udp_tunnel_xmit_skb(rt, sk, skb, src, dst, tos, ttl, df, #調用 linux 網絡棧接口，將 skb 傳給 udp tunnel 協議棧繼續處理
                src_port, dst_port, xnet, !(vxflags & VXLAN_F_UDP_CSUM));
    return 0;
}

（3）接下來就進入了 Linux TCP/IP 協議棧，從 UDP 進入，然后再到 IP 層。如果硬件支持，則由硬件調用 linux 內核中的 UDP GSO 函數；如果硬件不支持，則在進入 device driver queue 之前由 linux 內核調用 UDP GSO 分片函數。然后再一直往下到網卡。

最終在這個函數 ip_finish_output_gso 里面，先調用 GSO分段函數，如果需要的話，再進行 IP 分片：

static int ip_finish_output_gso(struct net *net, struct sock *sk,
                                 struct sk_buff *skb, unsigned int mtu)
 {
         netdev_features_t features;
         struct sk_buff *segs;
         int ret = 0;
 
         /* Slowpath -  GSO segment length is exceeding the dst MTU.
          *
          * This can happen in two cases:
          * 1) TCP GRO packet, DF bit not set
          * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
          * from host network stack.
          */
         features = netif_skb_features(skb);
         segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); #這里最終會調用到 UDP 的 gso_segment 回調函數進行 UDP GSO 分段 if (IS_ERR_OR_NULL(segs)) {
                 kfree_skb(skb);
                 return -ENOMEM;
         }
 
         consume_skb(skb);
 
         do {
                 struct sk_buff *nskb = segs->next;
                 int err;
 
                 segs->next = NULL;
                 err = ip_fragment(net, sk, segs, mtu, ip_finish_output2); #需要的話，再進行 IP 分片，因為 UDP GSO 是按照 MSS 進行，MSS 還是有可能超過 IP 分段所使用的宿主機物理網卡 MTU 的 if (err && ret == 0)
                         ret = err;
                 segs = nskb;
         } while (segs);
 
         return ret;
 }

這是 UDP 層所注冊的 gso 回調函數：

static const struct net_offload udpv4_offload = {
    .callbacks = {
        .gso_segment = udp4_ufo_fragment,
        .gro_receive  =    udp4_gro_receive,
        .gro_complete =    udp4_gro_complete,
    },
};

它的實現在這里：

static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb, netdev_features_t features,
    struct sk_buff *(*gso_inner_segment)(struct sk_buff *skb, netdev_features_t features), __be16 new_protocol)
{
    .../* segment inner packet. */ #先調用內層的 分段函數進行分段
    enc_features = skb->dev->hw_enc_features & netif_skb_features(skb);
    segs = gso_inner_segment(skb, enc_features);
    ...
    skb = segs;
    do { #執行 UDP GSO 分段 struct udphdr *uh;
        int len;

        skb_reset_inner_headers(skb);
        skb->encapsulation = 1;

        skb->mac_len = mac_len;

        skb_push(skb, outer_hlen);
        skb_reset_mac_header(skb);
        skb_set_network_header(skb, mac_len);
        skb_set_transport_header(skb, udp_offset);
        len = skb->len - udp_offset;
        uh = udp_hdr(skb);
        uh->len = htons(len);
        ...
        skb->protocol = protocol;
    } while ((skb = skb->next));
out:
    return segs;
}

struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb, netdev_features_t features, bool is_ipv6)
{
    ...switch (skb->inner_protocol_type) { #計算內層的分片方法 case ENCAP_TYPE_ETHER: #感覺 vxlan 的 GSO 應該是走這個分支，相當於是將 VXLAN 所封裝的二層幀當做 payload 來分段，而不是將包含 VXLAN Header 的部分來分
        protocol = skb->inner_protocol;
        gso_inner_segment = skb_mac_gso_segment;
        break;
    case ENCAP_TYPE_IPPROTO:
        offloads = is_ipv6 ? inet6_offloads : inet_offloads;
        ops = rcu_dereference(offloads[skb->inner_ipproto]);
        if (!ops || !ops->callbacks.gso_segment)
            goto out_unlock;
        gso_inner_segment = ops->callbacks.gso_segment;
        break;
    default:
        goto out_unlock;
    }

    segs = __skb_udp_tunnel_segment(skb, features, gso_inner_segment,
                    protocol);
    ...
    return segs; #返回分片好的seg list
}

這里比較有疑問的是，VXLAN 沒有定義 gso_segment 回調函數，這導致有可能在 UDP GSO 分段里面沒有完整的 VXLAN Header。。這需要進一步研究。原因可能是在 inner segment 那里，分段是將 UDP 所封裝的二層幀當做 payload 來分段，因此，VXLAN Header 就會保持在每個分段中。

（4）可見，在整個過程中，有客戶機上 TCP 協議層設置的 skb_shinfo(skb)->gso_size 始終保持不變為 MSS，因此，在網卡中最終所做的針對 UDP GSO 數據報的 GSO 分片所依據的分片的長度還是根據 skb_shinfo(skb)->gso_size 的值即 TCP MSS。

3. 發送端現有方案的問題和改進方法

3.1 問題

從上面所描述的過程可以看出來，目前的 VXLAN 協議和實現中存在一個問題，那就是：最終在宿主機上所做的 IP 分片是根據客戶機中 TCP 連接的 MSS 來進行的，而實際的網絡環境中，宿主機的網卡的 MTU 往往采用巨幀技術設置為 9000 bytes，而客戶機的網卡 MTU 往往使用默認的 1500 bytes，在接收端也是同樣類型的節點的情況下，目前的分片方式存在很大的資源浪費。

理想的情況分為幾種：

（1）如果接收端是同樣的采用 VXLAN VETP 的節點，IP 分片應該按照物理網卡的 MTU 進行，到了對端做了 IP 分片重組后，直接由 VXLAN VTEP 交給虛機。
（2）如果對方是將會連接外網的 VXLAN Gateway 節點，那對端 VXLAN VTEP 收到 IP 分片重組后的大網絡包后，它自己或者使用 Linux 網絡協議棧 segmentation offloading 技術對大包進行分片，然后再轉發到外網。
（3）如果對方是普通網絡節點（此時發送端是 VXLAN Gateway 節點），那么走當前的模式，根據 TCP MSS 使用 UDP/IP 分片，再發出去。

3.2 一種實現方案

文章 Segmentation Offloading Extension for VXLAN 提出了一種 VXLAN Segmentation Offloading Extension (VXLAN-soe) 實現方案。該方案擴展了 VXLAN Header，添加了幾個新的標志位：（S - 標志位，是否使用該技術； Overlay MSS Hi 和 Lo：對 TCP，就是 MSS；對 UDP，就是 MTU）

發送端 VXLAN VTEP 根據配置或者別的條件

設置 S = 1，表示 offload segmentation 到對端節點上的 VXLAN VTEP，並設置 Overlay MSS Hi 和 Lo
設置 S= 0，表示不做 remote segmentation offloading，做普通的處理

接收端 VXLAN Hypervisor VTEP 將檢查 S 標志位：

如果為 1，則不做 MTU 檢查，直接將包交給虛機
S 標志位，如果為 0，則走普通處理流程

接收端 VXLAN Gateway VTEP：

檢查 S 標志位，如果為 1，則它自己或者利用 segmentation offloading 技術做分片

該方案的問題是當 S = 1 時，會產生 UDP/IP 分片。

另一篇文章 MTU and Fragmentation Issues with In-the-Network Tunneling 也討論了使用隧道時候的各種分片方案。它討論的主要問題包括：

要不要在發送端分片
要不要在接收端重組，以及如何重組
要不要使用 PMTU。不使用的話，如何避免

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 實驗十二：kvm環境下qemu-kvm創建虛擬機之間的網絡配置理解 QEMU/KVM 和 Ceph（2）：QEMU 的 RBD 塊驅動（block driver）在qemu-kvm配置橋接網絡大二層網絡----Vxlan技術 overlay網絡技術之VxLAN詳解【原創】Linux虛擬化KVM-Qemu分析（三）之KVM源碼（1）理解 QEMU/KVM 和 Ceph（1）：QEMU-KVM 和 Ceph RBD 的緩存機制總結 VXLAN 基礎教程：在 Linux 上配置 VXLAN 網絡理解 Linux 網絡棧（1）：Linux 網絡協議棧簡單總結 Ubuntu下qemu環境搭建