[root@localhost ~]# ovs-ofctl dump-flows br0 cookie=0x0, duration=706691.312s, table=0, n_packets=84180530, n_bytes=7456796909, priority=0 actions=NORMAL cookie=0x0, duration=706444.352s, table=1, n_packets=0, n_bytes=0, ip,nw_dst=192.168.2.210 actions=mod_dl_dst:2e:a9:be:9e:4d:07 [root@localhost ~]# ovs-ofctl del-flows br0 [root@localhost ~]# ovs-ofctl del-flows br0 [root@localhost ~]# ovs-ofctl dump-flows br0 [root@localhost ~]#
虛擬機
qemu-system-aarch64 -name vm2 -nographic \ -enable-kvm -M virt,usb=off -cpu host -smp 2 -m 4096 \ -global virtio-blk-device.scsi=off \ -device virtio-scsi-device,id=scsi \ -kernel vmlinuz-4.18 --append "console=ttyAMA0 root=UUID=6a09973e-e8fd-4a6d-a8c0-1deb9556f477" \ -initrd initramfs-4.18 \ -drive file=vhuser-test1.qcow2 \ -m 2048M -numa node,memdev=mem -mem-prealloc \ -object memory-backend-file,id=mem,size=2048M,mem-path=/dev/hugepages,share=on \ -chardev socket,id=char1,path=$VHOST_SOCK_DIR/vhost-user1 \ -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce \ -device virtio-net-pci,netdev=mynet1,mac=00:00:00:00:00:01,mrg_rxbuf=off \
[root@localhost ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:00:00:00:00:01 brd ff:ff:ff:ff:ff:ff inet6 fe80::200:ff:fe00:1/64 scope link valid_lft forever preferred_lft forever [root@localhost ~]# ip a add 10.10.103.229/24 dev eth0 [root@localhost ~]# ping 10.10.103.81 PING 10.10.103.81 (10.10.103.81) 56(84) bytes of data. 64 bytes from 10.10.103.81: icmp_seq=1 ttl=64 time=0.374 ms 64 bytes from 10.10.103.81: icmp_seq=2 ttl=64 time=0.144 ms --- 10.10.103.81 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.144/0.259/0.374/0.115 ms
(gdb) b dp_netdev_upcall Breakpoint 1 at 0x96c638: file lib/dpif-netdev.c, line 6434. (gdb) c Continuing. [Switching to Thread 0xfffd53ffd510 (LWP 117402)] Breakpoint 1, dp_netdev_upcall (pmd=pmd@entry=0x18f60810, packet_=packet_@entry=0x16e0f4b80, flow=flow@entry=0xfffd53ff48d8, wc=wc@entry=0xfffd53ff4b78, ufid=ufid@entry=0xfffd53ff4328, type=type@entry=DPIF_UC_MISS, userdata=userdata@entry=0x0, actions=actions@entry=0xfffd53ff4348, put_actions=0xfffd53ff4388) at lib/dpif-netdev.c:6434 6434 { (gdb) bt #0 dp_netdev_upcall (pmd=pmd@entry=0x18f60810, packet_=packet_@entry=0x16e0f4b80, flow=flow@entry=0xfffd53ff48d8, wc=wc@entry=0xfffd53ff4b78, ufid=ufid@entry=0xfffd53ff4328, type=type@entry=DPIF_UC_MISS, userdata=userdata@entry=0x0, actions=actions@entry=0xfffd53ff4348, put_actions=0xfffd53ff4388) at lib/dpif-netdev.c:6434 #1 0x0000000000976558 in handle_packet_upcall (put_actions=0xfffd53ff4388, actions=0xfffd53ff4348, key=0xfffd53ff9350, packet=0x16e0f4b80, pmd=0x18f60810) at lib/dpif-netdev.c:6823 #2 fast_path_processing (pmd=pmd@entry=0x18f60810, packets_=packets_@entry=0xfffd53ffca80, keys=keys@entry=0xfffd53ff7d60, flow_map=flow_map@entry=0xfffd53ff5640, index_map=index_map@entry=0xfffd53ff5620 "", in_port=<optimized out>) at lib/dpif-netdev.c:6942 #3 0x0000000000977f58 in dp_netdev_input__ (pmd=pmd@entry=0x18f60810, packets=packets@entry=0xfffd53ffca80, md_is_valid=md_is_valid@entry=false, port_no=port_no@entry=2) at lib/dpif-netdev.c:7031 #4 0x0000000000978680 in dp_netdev_input (port_no=2, packets=0xfffd53ffca80, pmd=0x18f60810) at lib/dpif-netdev.c:7069 #5 dp_netdev_process_rxq_port (pmd=pmd@entry=0x18f60810, rxq=0x1721f440, port_no=2) at lib/dpif-netdev.c:4480 #6 0x0000000000978a24 in pmd_thread_main (f_=0x18f60810) at lib/dpif-netdev.c:5731 #7 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #8 0x0000ffffb19f7d38 in start_thread (arg=0xfffd53ffd510) at pthread_create.c:309 #9 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb)
ovs
現在Open vSwitch主要由三個部分組成:
- ovsdb-server:OpenFlow本身被設計成網絡數據包的一種處理流程,它沒有考慮軟件交換機的配 置,例如配置QoS,關聯SDN控制器等。ovsdb-server是Open vSwitch對於OpenFlow實現的補 充,它作為OpenvSwitch的configuration database,保存Open vSwitch的持久化數據。
- ovs-vswitchd:運行在用戶空間的轉發程序,接收SDN控制器下發的OpenFlow規則。並且通知 OVS內核模塊該如何處理網絡數據包。
- OVS內核模塊:運行在內核空間的轉發程序,根據ovs-vswitchd的指示,處理網絡數據包
Open vSwitch中有快速路徑(fast path)和慢速路徑(slow path)。其中ovs-vswitchd代表了slow path,OVS內核模塊代表了fast path。現在OpenFlow存儲在slow path中,但是為了快速轉發,網絡包 應該盡可能的在fast path中轉發。因此,OpenVSwitch按照下面的邏輯完成轉發。
當一個網絡連接的第一個網絡數據包(首包)被發出時,OVS內核模塊會先收到這個數據包。但是內核 模塊現在還不知道如何處理這個包,因為所有的OpenFlow都存在ovs-vswitchd,因此它的默認行為是 將這個包上送到ovs-vswitchd。ovs-vswitchd通過OpenFlow pipeline,處理完網絡數據包送回給OVS內 核模塊,同時,ovs-vswitchd還會生成一串類似於OpenFlow Action,但是更簡單的datapath action。 這串datapath action會一起送到OVS內核模塊。因為同一個網絡連接的所有網絡數據包特征(IP, MAC,端口號)都一樣,當OVS內核模塊收到其他網絡包的時候,可以直接應用datapath action。
這樣,成功的解決了之前的問題。首先,內核模塊不用關心OpenFlow的變化,不用考慮OpenVSwitch的代碼更新,這些都在ovs-vswitchd完成。其次,整個網絡連接,只有首包需要經過OpenFlow pipeline的處理,其余的包在內核模塊匹配網絡包特征,直接應用datapath action,就可以完成轉發。OVS內核模塊通過緩存來保持datapath action的記錄。稍微早期的內核模塊實現了名為microflow的緩存,這個緩存就是一個hash map,key是所有OpenFlow可能匹配的值對應的hash值,包括了網絡2-4層header數據和一些其他的metadata例如in_port,value就是datapath action。hash map可以實現O(1)的查找時間,這樣可以在OVS內核模塊中實現高效的查找轉發。
kernel收包
通過vport注冊的回調函數netdev_frame_hook()->netdev_frame_hook()-> netdev_port_receive()->ovs_vport_receive()處理接收報文,ovs_flow_key_extract()函數生成flow的key內容用以接下來進行流表匹配,最后調用ovs_dp_process_packet()函數進入真正的ovs數據包處理,代碼流程如下:

action處理
ovs的action類型如下,使用nla_type()函數獲取nl_type的值,入口處理函數為do_execute_actions()。
enum ovs_action_attr { OVS_ACTION_ATTR_UNSPEC, OVS_ACTION_ATTR_OUTPUT, /* u32 port number. */ OVS_ACTION_ATTR_USERSPACE, /* Nested OVS_USERSPACE_ATTR_*. */ OVS_ACTION_ATTR_SET, /* One nested OVS_KEY_ATTR_*. */ OVS_ACTION_ATTR_PUSH_VLAN, /* struct ovs_action_push_vlan. */ OVS_ACTION_ATTR_POP_VLAN, /* No argument. */ OVS_ACTION_ATTR_SAMPLE, /* Nested OVS_SAMPLE_ATTR_*. */ OVS_ACTION_ATTR_RECIRC, /* u32 recirc_id. */ OVS_ACTION_ATTR_HASH, /* struct ovs_action_hash. */ OVS_ACTION_ATTR_PUSH_MPLS, /* struct ovs_action_push_mpls. */ OVS_ACTION_ATTR_POP_MPLS, /* __be16 ethertype. */ OVS_ACTION_ATTR_SET_MASKED, /* One nested OVS_KEY_ATTR_* including * data immediately followed by a mask. * The data must be zero for the unmasked * bits. */ OVS_ACTION_ATTR_CT, /* Nested OVS_CT_ATTR_* . */ OVS_ACTION_ATTR_TRUNC, /* u32 struct ovs_action_trunc. */ OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */ OVS_ACTION_ATTR_POP_ETH, /* No argument. */ OVS_ACTION_ATTR_CT_CLEAR, /* No argument. */ OVS_ACTION_ATTR_PUSH_NSH, /* Nested OVS_NSH_KEY_ATTR_*. */ OVS_ACTION_ATTR_POP_NSH, /* No argument. */ OVS_ACTION_ATTR_METER, /* u32 meter ID. */ OVS_ACTION_ATTR_CLONE, /* Nested OVS_CLONE_ATTR_*. */ OVS_ACTION_ATTR_CHECK_PKT_LEN, /* Nested OVS_CHECK_PKT_LEN_ATTR_*. */ __OVS_ACTION_ATTR_MAX, /* Nothing past this will be accepted * from userspace. */ #ifdef __KERNEL__ OVS_ACTION_ATTR_SET_TO_MASKED, /* Kernel module internal masked * set action converted from * OVS_ACTION_ATTR_SET. */ #endif };
OVS_ACTION_ATTR_OUTPUT:獲取port號,調用do_output()發送報文到該port;
OVS_ACTION_ATTR_USERSPACE:調用output_userspace()發送到用戶態;
OVS_ACTION_ATTR_HASH:調用execute_hash()獲取skb的hash賦值到ovs_flow_hash
OVS_ACTION_ATTR_PUSH_VLAN:調用push_vlan()增加vlan頭部
upcall處理
當沒有找到匹配的流表時,內核通過netlink發送報文到用戶層處理,入口函數ovs_dp_upcall(),該函數調用queue_userspace_packet()構造發往用戶層的skb,通過netlink通信機制發送到用戶層,其中形成的主要數據格式如下:
output_userspace ovs_dp_upcall queue_userspace_packet skb_zerocopy(user_skb, skb, skb->len, hlen); //upcall信息對象添加報文
void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key) { const struct vport *p = OVS_CB(skb)->input_vport; struct datapath *dp = p->dp; struct sw_flow *flow; struct sw_flow_actions *sf_acts; struct dp_stats_percpu *stats; u64 *stats_counter; u32 n_mask_hit; stats = this_cpu_ptr(dp->stats_percpu); /* Look up flow. */ flow = ovs_flow_tbl_lookup_stats(&dp->table, key, &n_mask_hit); if (unlikely(!flow)) { struct dp_upcall_info upcall; int error; memset(&upcall, 0, sizeof(upcall)); upcall.cmd = OVS_PACKET_CMD_MISS; upcall.portid = ovs_vport_find_upcall_portid(p, skb); upcall.mru = OVS_CB(skb)->mru; error = ovs_dp_upcall(dp, skb, key, &upcall, 0); if (unlikely(error)) kfree_skb(skb); else consume_skb(skb); stats_counter = &stats->n_missed; goto out; } ovs_flow_stats_update(flow, key->tp.flags, skb); sf_acts = rcu_dereference(flow->sf_acts); ovs_execute_actions(dp, skb, sf_acts, key); stats_counter = &stats->n_hit; out: /* Update datapath statistics. */ u64_stats_update_begin(&stats->syncp); (*stats_counter)++; stats->n_mask_hit += n_mask_hit; u64_stats_update_end(&stats->syncp); }
action
遍歷所有的action,支持的有OFPACT_OUTPUT、OFPACT_GROUP、OFPACT_CONTROLLER、OFPACT_ENQUEUE、OFPACT_SET_VLAN_VID、OFPACT_SET_VLAN_PCP、OFPACT_STRIP_VLAN、OFPACT_PUSH_VLAN、OFPACT_SET_ETH_SRC、OFPACT_SET_ETH_DST、OFPACT_SET_IPV4_SRC、
OFPACT_SET_IPV4_DST、OFPACT_SET_IP_DSCP、OFPACT_SET_IP_ECN、OFPACT_SET_IP_TTL、OFPACT_SET_L4_SRC_PORT、OFPACT_SET_L4_DST_PORT、OFPACT_RESUBMIT、OFPACT_SET_TUNNEL、OFPACT_SET_QUEUE、OFPACT_POP_QUEUE、OFPACT_REG_MOVE、
OFPACT_SET_FIELD、OFPACT_STACK_PUSH、OFPACT_STACK_POP、OFPACT_PUSH_MPLS、OFPACT_POP_MPLS、OFPACT_SET_MPLS_LABEL、OFPACT_SET_MPLS_TC、OFPACT_SET_MPLS_TTL、OFPACT_DEC_MPLS_TTL、OFPACT_DEC_TTL、OFPACT_NOTE、OFPACT_MULTIPATH、
OFPACT_BUNDLE、OFPACT_OUTPUT_REG、OFPACT_OUTPUT_TRUNC、OFPACT_LEARN、OFPACT_CONJUNCTION、OFPACT_EXIT、OFPACT_UNROLL_XLATE、OFPACT_FIN_TIMEOUT、OFPACT_CLEAR_ACTIONS、OFPACT_WRITE_ACTIONS、OFPACT_WRITE_METADATA、OFPACT_METER、
OFPACT_GOTO_TABLE、OFPACT_SAMPLE、OFPACT_CLONE、OFPACT_CT、OFPACT_CT_CLEAR、OFPACT_NAT、OFPACT_DEBUG_RECIRC,
action太多,我們先介紹幾個常用的
OFPACT_OUTPUT,xlate_output_action會根據端口情況進行一些操作,這塊不細看了
OFPACT_CONTROLLER,execute_controller_action生成一個packet_in報文,然后發送
OFPACT_SET_ETH_SRC、OFPACT_SET_ETH_DST、OFPACT_SET_IPV4_SRC、OFPACT_SET_IPV4_DST、OFPACT_SET_IP_DSCP、OFPACT_SET_IP_ECN、OFPACT_SET_IP_TTL、OFPACT_SET_L4_SRC_PORT、OFPACT_SET_L4_DST_PORT,修改源目的mac、IP以及DSCP、ECN、TTL和L4的源目的端口
OFPACT_RESUBMIT,xlate_ofpact_resubmit會繼續查找指定的table的流表
OFPACT_SET_TUNNEL,設置tunnel id
OFPACT_CT,compose_conntrack_action執行完ct的設置之后回調do_xlate_actions執行其他的action
packetin數據包
ovs-ofctl add-flow br0 in_port=1,actions=controller
OFPP_CONTROLLER xlate_actions ┣━do_xlate_actions(ofpacts, ofpacts_len, &ctx, true, false); ┣━xlate_output_action do_xlate_actions { switch (a->type) { case OFPACT_OUTPUT: xlate_output_action(ctx, ofpact_get_OUTPUT(a)->port, ofpact_get_OUTPUT(a)->max_len, true, last, false, group_bucket_action); break; } xlate_output_action(struct xlate_ctx *ctx, ofp_port_t port, uint16_t controller_len, bool may_packet_in, bool is_last_action, bool truncate, bool group_bucket_action) { case OFPP_CONTROLLER: xlate_controller_action(ctx, controller_len, (ctx->in_packet_out ? OFPR_PACKET_OUT : group_bucket_action ? OFPR_GROUP : ctx->in_action_set ? OFPR_ACTION_SET : OFPR_ACTION), 0, UINT32_MAX, NULL, 0); }
packetout數據包
處理controller發送的packetout數據包:
ofconn_run
|
handle_openflow
|
handle_packet_out
dpdk
(gdb) b odp_execute_actions Breakpoint 4 at 0x9ac7c0: file lib/odp-execute.c, line 849. (gdb) c Continuing. [Switching to Thread 0xfffd797ad510 (LWP 117399)] Breakpoint 4, odp_execute_actions (dp=dp@entry=0xfffd797a4338, batch=batch@entry=0xfffd797a43c8, steal=steal@entry=true, actions=0xfffd797a44e0, actions_len=8, dp_execute_action=dp_execute_action@entry=0x9799b4 <dp_execute_cb>) at lib/odp-execute.c:849 849 { (gdb) bt #0 odp_execute_actions (dp=dp@entry=0xfffd797a4338, batch=batch@entry=0xfffd797a43c8, steal=steal@entry=true, actions=0xfffd797a44e0, actions_len=8, dp_execute_action=dp_execute_action@entry=0x9799b4 <dp_execute_cb>) at lib/odp-execute.c:849 #1 0x00000000009765c0 in dp_netdev_execute_actions (actions_len=<optimized out>, actions=<optimized out>, flow=0xfffd797a48d8, should_steal=true, packets=0xfffd797a43c8, pmd=0x17dcfc70) at lib/dpif-netdev.c:7596 #2 handle_packet_upcall (put_actions=0xfffd797a4388, actions=0xfffd797a4348, key=0xfffd797a80d0, packet=0x21ed94b480, pmd=0x17dcfc70) at lib/dpif-netdev.c:6846 #3 fast_path_processing (pmd=pmd@entry=0x17dcfc70, packets_=packets_@entry=0xfffd797aca80, keys=keys@entry=0xfffd797a7d60, flow_map=flow_map@entry=0xfffd797a5640, index_map=index_map@entry=0xfffd797a5620 "", in_port=<optimized out>) at lib/dpif-netdev.c:6942 #4 0x0000000000977f58 in dp_netdev_input__ (pmd=pmd@entry=0x17dcfc70, packets=packets@entry=0xfffd797aca80, md_is_valid=md_is_valid@entry=false, port_no=port_no@entry=3) at lib/dpif-netdev.c:7031 #5 0x0000000000978680 in dp_netdev_input (port_no=3, packets=0xfffd797aca80, pmd=0x17dcfc70) at lib/dpif-netdev.c:7069 #6 dp_netdev_process_rxq_port (pmd=pmd@entry=0x17dcfc70, rxq=0x17432950, port_no=3) at lib/dpif-netdev.c:4480 #7 0x0000000000978a24 in pmd_thread_main (f_=0x17dcfc70) at lib/dpif-netdev.c:5731 #8 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #9 0x0000ffffb19f7d38 in start_thread (arg=0xfffd797ad510) at pthread_create.c:309 #10 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb) quit A debugging session is active. Inferior 1 [process 116553] will be detached.
dp_execute_action
(gdb) b dp_execute_cb Breakpoint 2 at 0x9799b4: file lib/dpif-netdev.c, line 7237. (gdb) c Continuing. Breakpoint 2, dp_execute_cb (aux_=0xfffd53ff4338, packets_=0xfffd53ff43c8, a=0xfffd53ff44e0, should_steal=false) at lib/dpif-netdev.c:7237 7237 { (gdb) bt #0 dp_execute_cb (aux_=0xfffd53ff4338, packets_=0xfffd53ff43c8, a=0xfffd53ff44e0, should_steal=false) at lib/dpif-netdev.c:7237 #1 0x00000000009ac8ac in odp_execute_actions (dp=dp@entry=0xfffd53ff4338, batch=batch@entry=0xfffd53ff43c8, steal=steal@entry=true, actions=<optimized out>, actions_len=<optimized out>, dp_execute_action=dp_execute_action@entry=0x9799b4 <dp_execute_cb>) at lib/odp-execute.c:864 #2 0x00000000009765c0 in dp_netdev_execute_actions (actions_len=<optimized out>, actions=<optimized out>, flow=0xfffd53ff48d8, should_steal=true, packets=0xfffd53ff43c8, pmd=0x18f60810) at lib/dpif-netdev.c:7596 #3 handle_packet_upcall (put_actions=0xfffd53ff4388, actions=0xfffd53ff4348, key=0xfffd53ff8570, packet=0x16e027400, pmd=0x18f60810) at lib/dpif-netdev.c:6846 #4 fast_path_processing (pmd=pmd@entry=0x18f60810, packets_=packets_@entry=0xfffd53ffca80, keys=keys@entry=0xfffd53ff7d60, flow_map=flow_map@entry=0xfffd53ff5640, index_map=index_map@entry=0xfffd53ff5620 "", in_port=<optimized out>) at lib/dpif-netdev.c:6942 #5 0x0000000000977f58 in dp_netdev_input__ (pmd=pmd@entry=0x18f60810, packets=packets@entry=0xfffd53ffca80, md_is_valid=md_is_valid@entry=false, port_no=port_no@entry=2) at lib/dpif-netdev.c:7031 #6 0x0000000000978680 in dp_netdev_input (port_no=2, packets=0xfffd53ffca80, pmd=0x18f60810) at lib/dpif-netdev.c:7069 #7 dp_netdev_process_rxq_port (pmd=pmd@entry=0x18f60810, rxq=0x1721f440, port_no=2) at lib/dpif-netdev.c:4480 #8 0x0000000000978a24 in pmd_thread_main (f_=0x18f60810) at lib/dpif-netdev.c:5731 #9 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #10 0x0000ffffb19f7d38 in start_thread (arg=0xfffd53ffd510) at pthread_create.c:309 #11 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb)
dp_netdev_process_rxq_port
dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd, struct netdev_rxq *rx, odp_port_t port_no) { struct dp_packet_batch batch; int error; dp_packet_batch_init(&batch); cycles_count_start(pmd); /*通過調用netdev_class->rxq_recv從rx中收包存入batch中*/ error = netdev_rxq_recv(rx, &batch); cycles_count_end(pmd, PMD_CYCLES_POLLING); if (!error) { *recirc_depth_get() = 0; cycles_count_start(pmd); /*將batch中的包轉入datapath中進行處理*/ dp_netdev_input(pmd, &batch, port_no); cycles_count_end(pmd, PMD_CYCLES_PROCESSING); } ... }
netdev_dpdk_rxq_recv
netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch) { /*調用dpdk接口rte_eth_rx_burst進行收包,一次最多收32個包*/ nb_rx = rte_eth_rx_burst(rx->port_id, rxq->queue_id, (struct rte_mbuf **) batch->packets, NETDEV_MAX_BURST); }
netdev_send
netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch, bool concurrent_txq) { int error; netdev_send_prepare_batch(netdev, batch); if (OVS_UNLIKELY(dp_packet_batch_is_empty(batch))) { return 0; } error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq); if (!error) { COVERAGE_INC(netdev_sent); } return error; }
lib/netdev-dpdk.c:5454:static const struct netdev_class dpdk_class = {
lib/netdev-dpdk.c:5462:static const struct netdev_class dpdk_ring_class = {
lib/netdev-dpdk.c:5470:static const struct netdev_class dpdk_vhost_class = {
lib/netdev-dpdk.c:5485:static const struct netdev_class dpdk_vhost_client_class = {
lib/netdev-dummy.c:1567:static const struct netdev_class dummy_class = {
lib/netdev-dummy.c:1572:static const struct netdev_class dummy_internal_class = {
lib/netdev-dummy.c:1577:static const struct netdev_class dummy_pmd_class = {
lib/netdev-linux.c:3552:const struct netdev_class netdev_linux_class = {
lib/netdev-linux.c:3568:const struct netdev_class netdev_tap_class = {
lib/netdev-linux.c:3583:const struct netdev_class netdev_internal_class = {
lib/netdev-linux.c:3598:const struct netdev_class netdev_afxdp_class = {
static const struct netdev_class dpdk_class = { .type = "dpdk", NETDEV_DPDK_CLASS_BASE, .construct = netdev_dpdk_construct, .set_config = netdev_dpdk_set_config, .send = netdev_dpdk_eth_send, }; static const struct netdev_class dpdk_ring_class = { .type = "dpdkr", NETDEV_DPDK_CLASS_BASE, .construct = netdev_dpdk_ring_construct, .set_config = netdev_dpdk_ring_set_config, .send = netdev_dpdk_ring_send, }; static const struct netdev_class dpdk_vhost_class = { .type = "dpdkvhostuser", NETDEV_DPDK_CLASS_COMMON, .construct = netdev_dpdk_vhost_construct, .destruct = netdev_dpdk_vhost_destruct, .send = netdev_dpdk_vhost_send, .get_carrier = netdev_dpdk_vhost_get_carrier, .get_stats = netdev_dpdk_vhost_get_stats, .get_custom_stats = netdev_dpdk_get_sw_custom_stats, .get_status = netdev_dpdk_vhost_user_get_status, .reconfigure = netdev_dpdk_vhost_reconfigure, .rxq_recv = netdev_dpdk_vhost_rxq_recv, .rxq_enabled = netdev_dpdk_vhost_rxq_enabled, };
gdb) s [Switching to Thread 0xfffd797ad510 (LWP 117399)] Breakpoint 5, netdev_send (netdev=0x1745d240, qid=qid@entry=0, batch=batch@entry=0xfffd4c004ea0, concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:893 893 error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq); (gdb) bt #0 netdev_send (netdev=0x1745d240, qid=qid@entry=0, batch=batch@entry=0xfffd4c004ea0, concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:893 #1 0x000000000096fae8 in dp_netdev_pmd_flush_output_on_port (pmd=pmd@entry=0x17dcfc70, p=p@entry=0xfffd4c004e70) at lib/dpif-netdev.c:4391 #2 0x000000000096fdfc in dp_netdev_pmd_flush_output_packets (pmd=pmd@entry=0x17dcfc70, force=force@entry=false) at lib/dpif-netdev.c:4431 #3 0x00000000009787d8 in dp_netdev_pmd_flush_output_packets (force=false, pmd=0x17dcfc70) at lib/dpif-netdev.c:4501 #4 dp_netdev_process_rxq_port (pmd=pmd@entry=0x17dcfc70, rxq=0x17432950, port_no=3) at lib/dpif-netdev.c:4486 #5 0x0000000000978a24 in pmd_thread_main (f_=0x17dcfc70) at lib/dpif-netdev.c:5731 #6 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #7 0x0000ffffb19f7d38 in start_thread (arg=0xfffd797ad510) at pthread_create.c:309 #8 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb) s netdev_linux_send (netdev_=0x1745d240, qid=0, batch=0xfffd4c004ea0, concurrent_txq=true) at lib/netdev-linux.c:1696 1696 { (gdb) bt #0 netdev_linux_send (netdev_=0x1745d240, qid=0, batch=0xfffd4c004ea0, concurrent_txq=true) at lib/netdev-linux.c:1696 #1 0x00000000009a478c in netdev_send (netdev=0x1745d240, qid=qid@entry=0, batch=batch@entry=0xfffd4c004ea0, concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:893 #2 0x000000000096fae8 in dp_netdev_pmd_flush_output_on_port (pmd=pmd@entry=0x17dcfc70, p=p@entry=0xfffd4c004e70) at lib/dpif-netdev.c:4391 #3 0x000000000096fdfc in dp_netdev_pmd_flush_output_packets (pmd=pmd@entry=0x17dcfc70, force=force@entry=false) at lib/dpif-netdev.c:4431 #4 0x00000000009787d8 in dp_netdev_pmd_flush_output_packets (force=false, pmd=0x17dcfc70) at lib/dpif-netdev.c:4501 #5 dp_netdev_process_rxq_port (pmd=pmd@entry=0x17dcfc70, rxq=0x17432950, port_no=3) at lib/dpif-netdev.c:4486 #6 0x0000000000978a24 in pmd_thread_main (f_=0x17dcfc70) at lib/dpif-netdev.c:5731 #7 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #8 0x0000ffffb19f7d38 in start_thread (arg=0xfffd797ad510) at pthread_create.c:309 #9 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb)
lib/netdev.c:883:netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
(gdb) b lib/netdev.c:883 Note: breakpoint 3 also set at pc 0x9a4568. Breakpoint 4 at 0x9a4568: file lib/netdev.c, line 883. (gdb) c Continuing. [New Thread 0xffffac55f910 (LWP 98082)] [Switching to Thread 0xfffd53ffd510 (LWP 117402)] Breakpoint 3, netdev_send (netdev=0x1745d240, qid=qid@entry=0, batch=batch@entry=0xfffd4408da40, concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:885 885 { (gdb) bt #0 netdev_send (netdev=0x1745d240, qid=qid@entry=0, batch=batch@entry=0xfffd4408da40, concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:885 #1 0x000000000096fae8 in dp_netdev_pmd_flush_output_on_port (pmd=pmd@entry=0x18f60810, p=p@entry=0xfffd4408da10) at lib/dpif-netdev.c:4391 #2 0x000000000096fdfc in dp_netdev_pmd_flush_output_packets (pmd=pmd@entry=0x18f60810, force=force@entry=false) at lib/dpif-netdev.c:4431 #3 0x00000000009787d8 in dp_netdev_pmd_flush_output_packets (force=false, pmd=0x18f60810) at lib/dpif-netdev.c:4501 #4 dp_netdev_process_rxq_port (pmd=pmd@entry=0x18f60810, rxq=0x1721f440, port_no=2) at lib/dpif-netdev.c:4486 #5 0x0000000000978a24 in pmd_thread_main (f_=0x18f60810) at lib/dpif-netdev.c:5731 #6 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #7 0x0000ffffb19f7d38 in start_thread (arg=0xfffd53ffd510) at pthread_create.c:309 #8 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb) c Continuing. [Switching to Thread 0xfffd797ad510 (LWP 117399)] Breakpoint 3, netdev_send (netdev=0x1745d240, qid=qid@entry=0, batch=batch@entry=0xfffd4c004ea0, concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:885 885 { (gdb) bt #0 netdev_send (netdev=0x1745d240, qid=qid@entry=0, batch=batch@entry=0xfffd4c004ea0, concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:885 #1 0x000000000096fae8 in dp_netdev_pmd_flush_output_on_port (pmd=pmd@entry=0x17dcfc70, p=p@entry=0xfffd4c004e70) at lib/dpif-netdev.c:4391 #2 0x000000000096fdfc in dp_netdev_pmd_flush_output_packets (pmd=pmd@entry=0x17dcfc70, force=force@entry=false) at lib/dpif-netdev.c:4431 #3 0x00000000009787d8 in dp_netdev_pmd_flush_output_packets (force=false, pmd=0x17dcfc70) at lib/dpif-netdev.c:4501 #4 dp_netdev_process_rxq_port (pmd=pmd@entry=0x17dcfc70, rxq=0x17432950, port_no=3) at lib/dpif-netdev.c:4486 #5 0x0000000000978a24 in pmd_thread_main (f_=0x17dcfc70) at lib/dpif-netdev.c:5731 #6 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #7 0x0000ffffb19f7d38 in start_thread (arg=0xfffd797ad510) at pthread_create.c:309 #8 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb)
netdev_dpdk_vhost
(gdb) bt #0 netdev_dpdk_vhost_send (netdev=0x16d6e3dc0, qid=0, batch=0xfffd440a1a20, concurrent_txq=true) at lib/netdev-dpdk.c:2887 #1 0x00000000009a478c in netdev_send (netdev=0x16d6e3dc0, qid=qid@entry=0, batch=batch@entry=0xfffd440a1a20, concurrent_txq=concurrent_txq@entry=true) at lib/netdev.c:893 #2 0x000000000096fae8 in dp_netdev_pmd_flush_output_on_port (pmd=pmd@entry=0x18f60810, p=p@entry=0xfffd440a19f0) at lib/dpif-netdev.c:4391 #3 0x000000000096fdfc in dp_netdev_pmd_flush_output_packets (pmd=pmd@entry=0x18f60810, force=force@entry=false) at lib/dpif-netdev.c:4431 #4 0x00000000009787d8 in dp_netdev_pmd_flush_output_packets (force=false, pmd=0x18f60810) at lib/dpif-netdev.c:4501 #5 dp_netdev_process_rxq_port (pmd=pmd@entry=0x18f60810, rxq=0x1721f440, port_no=2) at lib/dpif-netdev.c:4486 #6 0x0000000000978a24 in pmd_thread_main (f_=0x18f60810) at lib/dpif-netdev.c:5731 #7 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #8 0x0000ffffb19f7d38 in start_thread (arg=0xfffd53ffd510) at pthread_create.c:309 #9 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb)
netdev_dpdk_eth_send
(gdb) bt #0 netdev_dpdk_eth_send (netdev=0x19feba4c0, qid=1, batch=0xfffd4c0050f0, concurrent_txq=false) at lib/netdev-dpdk.c:2960 #1 0x00000000009a478c in netdev_send (netdev=0x19feba4c0, qid=qid@entry=1, batch=batch@entry=0xfffd4c0050f0, concurrent_txq=concurrent_txq@entry=false) at lib/netdev.c:893 #2 0x000000000096fae8 in dp_netdev_pmd_flush_output_on_port (pmd=pmd@entry=0x17dcfc70, p=p@entry=0xfffd4c0050c0) at lib/dpif-netdev.c:4391 #3 0x000000000096fdfc in dp_netdev_pmd_flush_output_packets (pmd=pmd@entry=0x17dcfc70, force=force@entry=false) at lib/dpif-netdev.c:4431 #4 0x00000000009787d8 in dp_netdev_pmd_flush_output_packets (force=false, pmd=0x17dcfc70) at lib/dpif-netdev.c:4501 #5 dp_netdev_process_rxq_port (pmd=pmd@entry=0x17dcfc70, rxq=0x17432950, port_no=3) at lib/dpif-netdev.c:4486 #6 0x0000000000978a24 in pmd_thread_main (f_=0x17dcfc70) at lib/dpif-netdev.c:5731 #7 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #8 0x0000ffffb19f7d38 in start_thread (arg=0xfffd797ad510) at pthread_create.c:309 #9 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb)
EMC + dpls + ofproto classifier
包從物理或者虛擬接口進入OVS-DPDK后根據包的頭域將會得到一個唯一的標識或者hash,這個標識將會與以下3個交換表中的一條表項進行匹配。這三個交換表分別為:exact match cache(EMC),datapath classifier(dpcls),ofproto classifier。包將會按順序遍歷以上3個表直到找到表項與其匹配,匹配后包將執行匹配所指示的所有動作,然后進行轉發。
EMC根據有限數量的表項對流提供快速的處理,在EMC中包標識必須與表項進行IP 5元組的精確匹配。若EMC未匹配上,那么包將進入dpcls。dpcls擁有多重子表來維持更多的表項,並且可使用通配(wildcard)對包標識進行匹配。當包與dpcls匹配后流表項將在EMC中進行設置,在此之后那些擁有與當前包相同標識的包可以根據EMC快速處理。若dpcls依舊未匹配上,那么包將進入ofproto classifier根據openflow控制器進行處理。若在ofproto classifier中匹配了相應的表項,那個該表項將項快速交換表分發,在此之后那些擁有相同流的包將被快速處理。(翻譯自:https://software.intel.com/en-us/articles/open-vswitch-with-dpdk-overview)
注意:EMC是以PMD為邊界的,每個PMD擁有自己的EMC;dpcls是以端口為邊界的,每個端口擁有自己的dpcls;ofproto classifier是以橋為邊界的,每個橋擁有自己的ofproto classifier
dp_netdev_input__(struct dp_netdev_pmd_thread *pmd, struct dp_packet_batch *packets, bool md_is_valid, odp_port_t port_no) { int cnt = packets->count; #if !defined(__CHECKER__) && !defined(_WIN32) const size_t PKT_ARRAY_SIZE = cnt; #else /* Sparse or MSVC doesn't like variable length array. */ enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST }; #endif OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct netdev_flow_key keys[PKT_ARRAY_SIZE]; struct packet_batch_per_flow batches[PKT_ARRAY_SIZE]; long long now = time_msec(); size_t newcnt, n_batches, i; odp_port_t in_port; n_batches = 0; /*將dp_packet_batch中的所有包送入EMC(pmd->flow_cache)處理 *返回要被送入fast_path_processing中處理的包數 *同時若md_is_valid該函數還將根據port_no初始化metadata*/ newcnt = emc_processing(pmd, packets, keys, batches, &n_batches, md_is_valid, port_no); if (OVS_UNLIKELY(newcnt)) { packets->count = newcnt; /* Get ingress port from first packet's metadata. */ in_port = packets->packets[0]->md.in_port.odp_port; fast_path_processing(pmd, packets, keys, batches, &n_batches, in_port, now); } }
packet_batch_per_flow_execute
(gdb) bt
#0 dp_netdev_execute_actions (actions_len=16, actions=0xfffd4c004ab4, flow=0xfffd4c005620, should_steal=true, packets=0xfffd797a5960, pmd=0x17dcfc70) at lib/dpif-netdev.c:7594
#1 packet_batch_per_flow_execute (pmd=0x17dcfc70, batch=0xfffd797a5950) at lib/dpif-netdev.c:6551
#2 dp_netdev_input__ (pmd=pmd@entry=0x17dcfc70, packets=packets@entry=0xfffd797aca80, md_is_valid=md_is_valid@entry=false, port_no=port_no@entry=3) at lib/dpif-netdev.c:7060
#3 0x0000000000978680 in dp_netdev_input (port_no=3, packets=0xfffd797aca80, pmd=0x17dcfc70) at lib/dpif-netdev.c:7069
#4 dp_netdev_process_rxq_port (pmd=pmd@entry=0x17dcfc70, rxq=0x17432950, port_no=3) at lib/dpif-netdev.c:4480
#5 0x0000000000978a24 in pmd_thread_main (f_=0x17dcfc70) at lib/dpif-netdev.c:5731
#6 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383
#7 0x0000ffffb19f7d38 in start_thread (arg=0xfffd797ad510) at pthread_create.c:309
#8 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6
(gdb)
(gdb) bt #0 packet_batch_per_flow_execute (pmd=0x17dcfc70, batch=0xfffd797a5950) at lib/dpif-netdev.c:6543 #1 dp_netdev_input__ (pmd=pmd@entry=0x17dcfc70, packets=packets@entry=0xfffd797aca80, md_is_valid=md_is_valid@entry=false, port_no=port_no@entry=3) at lib/dpif-netdev.c:7060 #2 0x0000000000978680 in dp_netdev_input (port_no=3, packets=0xfffd797aca80, pmd=0x17dcfc70) at lib/dpif-netdev.c:7069 #3 dp_netdev_process_rxq_port (pmd=pmd@entry=0x17dcfc70, rxq=0x17432950, port_no=3) at lib/dpif-netdev.c:4480 #4 0x0000000000978a24 in pmd_thread_main (f_=0x17dcfc70) at lib/dpif-netdev.c:5731 #5 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #6 0x0000ffffb19f7d38 in start_thread (arg=0xfffd797ad510) at pthread_create.c:309 #7 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb)
upcall_receive
Breakpoint 5, upcall_receive (upcall=upcall@entry=0xfffd797aa730, backer=0x1746b1e0, packet=0x21ede43280, type=DPIF_UC_MISS, userdata=0x0, flow=flow@entry=0xfffd797ab768, mru=mru@entry=0, ufid=0xfffd797ab1b8, pmd_id=3) at ofproto/ofproto-dpif-upcall.c:1114 1114 { (gdb) bt #0 upcall_receive (upcall=upcall@entry=0xfffd797aa730, backer=0x1746b1e0, packet=0x21ede43280, type=DPIF_UC_MISS, userdata=0x0, flow=flow@entry=0xfffd797ab768, mru=mru@entry=0, ufid=0xfffd797ab1b8, pmd_id=3) at ofproto/ofproto-dpif-upcall.c:1114 #1 0x0000000000943620 in upcall_cb (packet=<optimized out>, flow=0xfffd797ab768, ufid=<optimized out>, pmd_id=<optimized out>, type=<optimized out>, userdata=<optimized out>, actions=0xfffd797ab1d8, wc=0xfffd797aba08, put_actions=0xfffd797ab218, aux=0x17463ac0) at ofproto/ofproto-dpif-upcall.c:1315 #2 0x000000000096c6e8 in dp_netdev_upcall (pmd=pmd@entry=0x17dcfc70, packet_=packet_@entry=0x21ede43280, flow=flow@entry=0xfffd797ab768, wc=wc@entry=0xfffd797aba08, ufid=ufid@entry=0xfffd797ab1b8, type=type@entry=DPIF_UC_MISS, userdata=userdata@entry=0x0, actions=actions@entry=0xfffd797ab1d8, put_actions=0xfffd797ab218) at lib/dpif-netdev.c:6466 #3 0x0000000000976558 in handle_packet_upcall (put_actions=0xfffd797ab218, actions=0xfffd797ab1d8, key=0xfffd797ac640, packet=0x21ede43280, pmd=0x17dcfc70) at lib/dpif-netdev.c:6823 #4 fast_path_processing (pmd=pmd@entry=0x17dcfc70, packets_=packets_@entry=0xfffd797aca80, keys=keys@entry=0xfffd797ac610, flow_map=flow_map@entry=0xfffd797ac4c0, index_map=index_map@entry=0xfffd797ac4b0 "", in_port=<optimized out>) at lib/dpif-netdev.c:6942 #5 0x0000000000977f58 in dp_netdev_input__ (pmd=pmd@entry=0x17dcfc70, packets=packets@entry=0xfffd797aca80, md_is_valid=md_is_valid@entry=false, port_no=port_no@entry=3) at lib/dpif-netdev.c:7031 #6 0x0000000000978680 in dp_netdev_input (port_no=3, packets=0xfffd797aca80, pmd=0x17dcfc70) at lib/dpif-netdev.c:7069 #7 dp_netdev_process_rxq_port (pmd=pmd@entry=0x17dcfc70, rxq=0x17432950, port_no=3) at lib/dpif-netdev.c:4480 #8 0x0000000000978a24 in pmd_thread_main (f_=0x17dcfc70) at lib/dpif-netdev.c:5731 #9 0x00000000009fc5dc in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:383 #10 0x0000ffffb19f7d38 in start_thread (arg=0xfffd797ad510) at pthread_create.c:309 #11 0x0000ffffb16df690 in thread_start () from /lib64/libc.so.6 (gdb)
handle_packet_upcall
fast_path_processing 調用 handle_packet_upcall,fast_path_processing繼續到ofproto classifier的流表查找 dp_netdev_input__ --> handle_packet_upcall ┣━ dp_netdev_upcall | ┣━ upcall_cb | ┣━ upcall_receive | ┣━ process_upcall ┣━dp_netdev_execute_actions ┣━ dp_netdev_flow_add
udpif_upcall_handler
udpif_upcall_handler
┣━ recv_upcalls
┣━dpif_recv
┣━upcall_receive
┣━process_upcall
▏ ┣━upcall_xlate ▏ ┣━xlate_actions
┣━handle_upcalls
┣━dpif_operate
┣━dpif->dpif_class->operate( dpif_netlink_operate)
- 內核匹配不到流表通過netlink發送到用戶態處理,用戶態 udpif_upcall_handler線程接收內核發送的upcall信息;
- 調用xlate_actions 進行動作處理,此函數會調用rule_dpif_lookup_from_table 在flow table 中進行規則查找,查找到規則后調用do_xlate_actions,根據action的不同執行不同的操作。
- handle_upcalls ,調用dpif_netlink_operate 修改內核中的datapath flow 規則。
dpif_recv(struct dpif *dpif, uint32_t handler_id, struct dpif_upcall *upcall, struct ofpbuf *buf) { int error = EAGAIN; if (dpif->dpif_class->recv) { error = dpif->dpif_class->recv(dpif, handler_id, upcall, buf); if (!error) { dpif_print_packet(dpif, upcall); } else if (error != EAGAIN) { log_operation(dpif, "recv", error); } } return error; }
const struct dpif_class dpif_netlink_class = { "system", false, /* cleanup_required */ NULL, /* init */ dpif_netlink_recv,
dpif_netlink_operate
dpif_netlink_operate ┣━try_send_to_netdev-->try_send_to_netdev-->parse_flow_put-->netdev_flow_put-->netdev_tc_flow_put ┣━dpif_netlink_operate_chunks ┣━dpif_netlink_operate__
調用process_upcall處理upcall信息,upcall類型分為
enum upcall_type {
BAD_UPCALL, /*Some kind of bug somewhere. */
MISS_UPCALL, /* Aflow miss. */
SFLOW_UPCALL, /*sFlow sample. */
FLOW_SAMPLE_UPCALL, /*Per-flow sampling. */
IPFIX_UPCALL /*Per-bridge sampling. */
};
內核datapath未匹配的類型為MISS_UPCALL。
調用xlate_actions 進行動作處理,此函數會調用rule_dpif_lookup_from_table 在flow table 中進行規則查找,查找到規則后調用do_xlate_actions,根據action的不同執行不同的操作。
OVS send部分
kernel
netdev_frame_hook-->netdev_port_receive-->ovs_vport_receive-->ovs_dp_process_packet-->ovs_execute_actions-->do_execute_actions-->do_output
這里會根據dev的mtu和報文的mru比較需要不需要分片,通常情況下報文的mru是0,不需要分片,但是因為重組函數handle_fragments
重組時設置了報文的mru,所以分片時也根據此項來判定。調用ovs_fragment
分片完成之后會調用ovs_vport_send-->dev_queue_xmit-->dev_hard_start_xmit-->ops->ndo_start_xmit-->bond_start_xmit
進行報文發送。
另一種的報文分片是VM中設置了TSO、GSO、UFO等features,導致報文比較大,這個時候需要根據標志判定是否需要分片。
netdev_frame_hook-->netdev_port_receive-->ovs_vport_receive-->ovs_dp_process_packet-->ovs_execute_actions-->do_execute_actions-->do_output-->ovs_vport_send-->dev_queue_xmit-->dev_hard_start_xmit
這里會根據gso進行分片,分片完成后繼續調用ops->ndo_start_xmit-->bond_start_xmit
進行報文發送。其實此處也會根據其他的一些特性來判定,一般比如帶了TSO之類的需要網卡去分片。
OVS-DPDK
pmd_thread_main-->dp_netdev_process_rxq_port-->dp_netdev_input-->dp_netdev_input__-->packet_batch_per_flow_execute-->dp_netdev_execute_actions-->dp_execute_cb-->netdev_send-->netdev_dpdk_eth_send
一般會根據網卡的特性來判定是不是分片。
PKT_RX_FDIR_ID flag
/* Bit masks for the 'ol_flags' member of the 'dp_packet' structure. */ enum dp_packet_offload_mask { /* Value 0 is not used. */ /* Is the 'rss_hash' valid? */ DEF_OL_FLAG(DP_PACKET_OL_RSS_HASH, PKT_RX_RSS_HASH, 0x1), /* Is the 'flow_mark' valid? */ DEF_OL_FLAG(DP_PACKET_OL_FLOW_MARK, PKT_RX_FDIR_ID, 0x2), /* Bad L4 checksum in the packet. */ DEF_OL_FLAG(DP_PACKET_OL_RX_L4_CKSUM_BAD, PKT_RX_L4_CKSUM_BAD, 0x4), /* Bad IP checksum in the packet. */ DEF_OL_FLAG(DP_PACKET_OL_RX_IP_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD, 0x8), /* Valid L4 checksum in the packet. */ DEF_OL_FLAG(DP_PACKET_OL_RX_L4_CKSUM_GOOD, PKT_RX_L4_CKSUM_GOOD, 0x10), /* Valid IP checksum in the packet. */ DEF_OL_FLAG(DP_PACKET_OL_RX_IP_CKSUM_GOOD, PKT_RX_IP_CKSUM_GOOD, 0x20), /* TCP Segmentation Offload. */ DEF_OL_FLAG(DP_PACKET_OL_TX_TCP_SEG, PKT_TX_TCP_SEG, 0x40), /* Offloaded packet is IPv4. */ DEF_OL_FLAG(DP_PACKET_OL_TX_IPV4, PKT_TX_IPV4, 0x80), /* Offloaded packet is IPv6. */ DEF_OL_FLAG(DP_PACKET_OL_TX_IPV6, PKT_TX_IPV6, 0x100), /* Offload TCP checksum. */ DEF_OL_FLAG(DP_PACKET_OL_TX_TCP_CKSUM, PKT_TX_TCP_CKSUM, 0x200), /* Offload UDP checksum. */ DEF_OL_FLAG(DP_PACKET_OL_TX_UDP_CKSUM, PKT_TX_UDP_CKSUM, 0x400), /* Offload SCTP checksum. */ DEF_OL_FLAG(DP_PACKET_OL_TX_SCTP_CKSUM, PKT_TX_SCTP_CKSUM, 0x800), /* Adding new field requires adding to DP_PACKET_OL_SUPPORTED_MASK. */ };
DPDK PMD收到報文后,通過檢測rte_mbuf.flag中包含PKT_RX_FDIR_ID flag
dp_netdev_input__
dfc_processing
dp_packet_has_flow_mark(struct dp_packet *p OVS_UNUSED, uint32_t *mark OVS_UNUSED) { #ifdef DPDK_NETDEV if (p->mbuf.ol_flags & PKT_RX_FDIR_ID) { *mark = p->mbuf.hash.fdir.hi; return true; } #endif return false; }
dp_packet_has_flow_mark(const struct dp_packet *p, uint32_t *mark) { if (*dp_packet_ol_flags_ptr(p) & DP_PACKET_OL_FLOW_MARK) { *mark = *dp_packet_flow_mark_ptr(p); return true; } return false; }
if ((*recirc_depth_get() == 0) && dp_packet_has_flow_mark(packet, &mark)) { flow = mark_to_flow_find(pmd, mark); if (OVS_LIKELY(flow)) { tcp_flags = parse_tcp_flags(packet); if (OVS_LIKELY(batch_enable)) { dp_netdev_queue_batches(packet, flow, tcp_flags, batches, n_batches); } else { /* Flow batching should be performed only after fast-path * processing is also completed for packets with emc miss * or else it will result in reordering of packets with * same datapath flows. */ packet_enqueue_to_flow_map(packet, flow, tcp_flags, flow_map, map_cnt++); } continue; } }
struct rte_flow_action_mark
netdev_dpdk_add_rte_flow_offload(struct netdev *netdev, const struct match *match, struct nlattr *nl_actions OVS_UNUSED, size_t actions_len OVS_UNUSED, const ovs_u128 *ufid, struct offload_info *info) { struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); const struct rte_flow_attr flow_attr = { .group = 0, .priority = 0, .ingress = 1, .egress = 0 }; struct flow_patterns patterns = { .items = NULL, .cnt = 0 }; struct flow_actions actions = { .actions = NULL, .cnt = 0 }; struct rte_flow *flow; struct rte_flow_error error; uint8_t *ipv4_next_proto_mask = NULL; int ret = 0; /* Eth */ struct rte_flow_item_eth eth_spec; struct rte_flow_item_eth eth_mask; memset(ð_spec, 0, sizeof(eth_spec)); memset(ð_mask, 0, sizeof(eth_mask)); if (!eth_addr_is_zero(match->wc.masks.dl_src) || !eth_addr_is_zero(match->wc.masks.dl_dst)) { rte_memcpy(ð_spec.dst, &match->flow.dl_dst, sizeof(eth_spec.dst)); rte_memcpy(ð_spec.src, &match->flow.dl_src, sizeof(eth_spec.src)); eth_spec.type = match->flow.dl_type; rte_memcpy(ð_mask.dst, &match->wc.masks.dl_dst, sizeof(eth_mask.dst)); rte_memcpy(ð_mask.src, &match->wc.masks.dl_src, sizeof(eth_mask.src)); eth_mask.type = match->wc.masks.dl_type; add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_ETH, ð_spec, ð_mask); } else { /* * If user specifies a flow (like UDP flow) without L2 patterns, * OVS will at least set the dl_type. Normally, it's enough to * create an eth pattern just with it. Unluckily, some Intel's * NIC (such as XL710) doesn't support that. Below is a workaround, * which simply matches any L2 pkts. */ add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_ETH, NULL, NULL); } /* VLAN */ struct rte_flow_item_vlan vlan_spec; struct rte_flow_item_vlan vlan_mask; memset(&vlan_spec, 0, sizeof(vlan_spec)); memset(&vlan_mask, 0, sizeof(vlan_mask)); if (match->wc.masks.vlans[0].tci && match->flow.vlans[0].tci) { vlan_spec.tci = match->flow.vlans[0].tci & ~htons(VLAN_CFI); vlan_mask.tci = match->wc.masks.vlans[0].tci & ~htons(VLAN_CFI); /* match any protocols */ vlan_mask.tpid = 0; add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_VLAN, &vlan_spec, &vlan_mask); } /* IP v4 */ uint8_t proto = 0; struct rte_flow_item_ipv4 ipv4_spec; struct rte_flow_item_ipv4 ipv4_mask; memset(&ipv4_spec, 0, sizeof(ipv4_spec)); memset(&ipv4_mask, 0, sizeof(ipv4_mask)); if (match->flow.dl_type == htons(ETH_TYPE_IP)) { ipv4_spec.hdr.type_of_service = match->flow.nw_tos; ipv4_spec.hdr.time_to_live = match->flow.nw_ttl; ipv4_spec.hdr.next_proto_id = match->flow.nw_proto; ipv4_spec.hdr.src_addr = match->flow.nw_src; ipv4_spec.hdr.dst_addr = match->flow.nw_dst; ipv4_mask.hdr.type_of_service = match->wc.masks.nw_tos; ipv4_mask.hdr.time_to_live = match->wc.masks.nw_ttl; ipv4_mask.hdr.next_proto_id = match->wc.masks.nw_proto; ipv4_mask.hdr.src_addr = match->wc.masks.nw_src; ipv4_mask.hdr.dst_addr = match->wc.masks.nw_dst; add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_IPV4, &ipv4_spec, &ipv4_mask); /* Save proto for L4 protocol setup */ proto = ipv4_spec.hdr.next_proto_id & ipv4_mask.hdr.next_proto_id; /* Remember proto mask address for later modification */ ipv4_next_proto_mask = &ipv4_mask.hdr.next_proto_id; } if (proto != IPPROTO_ICMP && proto != IPPROTO_UDP && proto != IPPROTO_SCTP && proto != IPPROTO_TCP && (match->wc.masks.tp_src || match->wc.masks.tp_dst || match->wc.masks.tcp_flags)) { VLOG_DBG("L4 Protocol (%u) not supported", proto); ret = -1; goto out; } if ((match->wc.masks.tp_src && match->wc.masks.tp_src != OVS_BE16_MAX) || (match->wc.masks.tp_dst && match->wc.masks.tp_dst != OVS_BE16_MAX)) { ret = -1; goto out; } struct rte_flow_item_tcp tcp_spec; struct rte_flow_item_tcp tcp_mask; memset(&tcp_spec, 0, sizeof(tcp_spec)); memset(&tcp_mask, 0, sizeof(tcp_mask)); if (proto == IPPROTO_TCP) { tcp_spec.hdr.src_port = match->flow.tp_src; tcp_spec.hdr.dst_port = match->flow.tp_dst; tcp_spec.hdr.data_off = ntohs(match->flow.tcp_flags) >> 8; tcp_spec.hdr.tcp_flags = ntohs(match->flow.tcp_flags) & 0xff; tcp_mask.hdr.src_port = match->wc.masks.tp_src; tcp_mask.hdr.dst_port = match->wc.masks.tp_dst; tcp_mask.hdr.data_off = ntohs(match->wc.masks.tcp_flags) >> 8; tcp_mask.hdr.tcp_flags = ntohs(match->wc.masks.tcp_flags) & 0xff; add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_TCP, &tcp_spec, &tcp_mask); /* proto == TCP and ITEM_TYPE_TCP, thus no need for proto match */ if (ipv4_next_proto_mask) { *ipv4_next_proto_mask = 0; } goto end_proto_check; } struct rte_flow_item_udp udp_spec; struct rte_flow_item_udp udp_mask; memset(&udp_spec, 0, sizeof(udp_spec)); memset(&udp_mask, 0, sizeof(udp_mask)); if (proto == IPPROTO_UDP) { udp_spec.hdr.src_port = match->flow.tp_src; udp_spec.hdr.dst_port = match->flow.tp_dst; udp_mask.hdr.src_port = match->wc.masks.tp_src; udp_mask.hdr.dst_port = match->wc.masks.tp_dst; add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_UDP, &udp_spec, &udp_mask); /* proto == UDP and ITEM_TYPE_UDP, thus no need for proto match */ if (ipv4_next_proto_mask) { *ipv4_next_proto_mask = 0; } goto end_proto_check; } struct rte_flow_item_sctp sctp_spec; struct rte_flow_item_sctp sctp_mask; memset(&sctp_spec, 0, sizeof(sctp_spec)); memset(&sctp_mask, 0, sizeof(sctp_mask)); if (proto == IPPROTO_SCTP) { sctp_spec.hdr.src_port = match->flow.tp_src; sctp_spec.hdr.dst_port = match->flow.tp_dst; sctp_mask.hdr.src_port = match->wc.masks.tp_src; sctp_mask.hdr.dst_port = match->wc.masks.tp_dst; add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_SCTP, &sctp_spec, &sctp_mask); /* proto == SCTP and ITEM_TYPE_SCTP, thus no need for proto match */ if (ipv4_next_proto_mask) { *ipv4_next_proto_mask = 0; } goto end_proto_check; } struct rte_flow_item_icmp icmp_spec; struct rte_flow_item_icmp icmp_mask; memset(&icmp_spec, 0, sizeof(icmp_spec)); memset(&icmp_mask, 0, sizeof(icmp_mask)); if (proto == IPPROTO_ICMP) { icmp_spec.hdr.icmp_type = (uint8_t)ntohs(match->flow.tp_src); icmp_spec.hdr.icmp_code = (uint8_t)ntohs(match->flow.tp_dst); icmp_mask.hdr.icmp_type = (uint8_t)ntohs(match->wc.masks.tp_src); icmp_mask.hdr.icmp_code = (uint8_t)ntohs(match->wc.masks.tp_dst); add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_ICMP, &icmp_spec, &icmp_mask); /* proto == ICMP and ITEM_TYPE_ICMP, thus no need for proto match */ if (ipv4_next_proto_mask) { *ipv4_next_proto_mask = 0; } goto end_proto_check; } end_proto_check: add_flow_pattern(&patterns, RTE_FLOW_ITEM_TYPE_END, NULL, NULL); struct rte_flow_action_mark mark; mark.id = info->flow_mark; add_flow_action(&actions, RTE_FLOW_ACTION_TYPE_MARK, &mark); struct rte_flow_action_rss *rss; rss = add_flow_rss_action(&actions, netdev); add_flow_action(&actions, RTE_FLOW_ACTION_TYPE_END, NULL); flow = rte_flow_create(dev->port_id, &flow_attr, patterns.items, actions.actions, &error); free(rss); if (!flow) { VLOG_ERR("rte flow creat error: %u : message : %s\n", error.type, error.message); ret = -1; goto out; } ufid_to_rte_flow_associate(ufid, flow); VLOG_DBG("installed flow %p by ufid "UUID_FMT"\n", flow, UUID_ARGS((struct uuid *)ufid));
Overview
This article describes the design and implementation of Open vSwitch* (OvS) with the Data Plane Development Kit (DPDK) (OvS-DPDK) datapath classifier, also known as dpcls. We recommend that you first read the introductory article where we introduce the high-level dpcls design, the Tuple-Space search (TSS) implementation and its performance optimization strategy, and a few other scenarios. Part 2 (this article) focuses on call graphs and the subroutines involved.
OvS-DPDK has three-tier look-up tables/caches. Incoming packets are first matched against Exact Match Cache (EMC) and in case of a miss are sent to the dpcls (megaflow cache). The dpcls is implemented as a tuple space search (TSS) that supports arbitrary bitwise matching on packet header fields. Packets that miss the dpcls are sent to the OpenFlow* pipeline, also known as ofproto classifier, configured by an SDN controller as depicted in Figure 1.
Figure 1. Open vSwitch* three-tier cache architecture.
Packet Reception Path
The poll mode driver (PMD) thread continuously polls the input ports in its poll list. From each port it can retrieve a burst of packets not exceeding 32 (NETDEV_MAX_BURST). Each input packet then can be classified based on the set of active flow rules. The purpose of classification is to find a matching flow so that the packet can be processed accordingly. Packets are grouped per flow and each group will be processed to execute the specified actions. Figure 2 shows the stages of packet processing.
Figure 2. Packet-processing stages.
Flows are defined by dp_netdev_flow data structure (not to be confused with the struct flow) and are stored into a hash table called flow_table. Some of the information stored in a flow are
- Rule
- Actions
- Statistics
- Batch (queue for processing the packets that matched this flow)
- Thread ID (owning this flow)
- Reference count
Often the words “flows” and “rules” are used interchangeably; however, note that the rule is part of the flow.
The entries stored in the dpcls subtables are {rule, flow pointer} couples. The dpcls classification can then be summarized as a two-step process where:
- A matching rule is found from the subtables after lookup.
- The actions of the corresponding flow (from the flow pointer) are then executed.
Figure 3 shows a call graph for packet classification and action execution.
Figure 3. Lookup, batching, and action execution call graph.
When a packet is received, the packet header fields get extracted and stored into a “miniflow” data structure as part of EMC processing. The miniflow is a compressed representation of the packet that helps in reducing the memory footprint and cache lines.
Various actions can be executed on the classified packet, such as forwarding it to a certain port. Other actions are available, for example adding a VLAN tag, dropping the packet, or sending the packet to the Connection Tracker module.
EMC Call Graph
EMC is implemented as a hash table where the match must occur exactly on the whole miniflow structure. The hash value can be pre-computed by the network interface card (NIC) when the RSS mode is enabled; otherwise hash is computed in the software using miniflow_hash_5tuple().
Figure 4. EMC processing call graph.
In emc_processing(), after the miniflow extraction and hash computation, a lookup is performed in EMC for a match. When a valid entry is found, an exact match check will be carried out to compare that entry with the miniflow extracted from the packet. In case of a miss, the check is repeated on the next linked entry, if any. If no match is found, the packet will be later checked against the dpcls, the second level cache. Figure 4 depicts the call graph of packet processing in exact match cache.
Datapath Classifier Call Graph
Figure 5 shows the main dpcls functions involved in the overall classification and batching processes.
Figure 5. dpcls call graph.
As described in our introductory article a dpcls lookup is performed by traversing all its subtables. For each subtable the miniflow extracted from the packet is used to derive a search key in conjunction with the subtable mask. See netdev_flow_key_hash_in_mask() function in Figure 6 where also hash functions are displayed.
Figure 6. dpcls lookup call graph.
A dpcls lookup involves an iterative search over multiple subtables until a match is found. On a miss, the ofproto table will be consulted. On a hit, the retrieved value is a pointer to the matching flow which will determine the actions to be performed for the packet. A lookup on a subtable involves the following operations:
- The subtable mask will determine the subset of the packet miniflow fields to consider. The mask is then applied on the subset of fields to obtain a search-key.
- A hash value is computed on the search-key to point to a subtable entry.
- If an entry is found, due to the possibility of collisions (as is the behavior in hash table data structures), a check has to be performed to determine whether or not that entry matches with the search key. If it doesn’t match, the check is repeated on the chained list of entries, if any.
Due to their wildcarded nature, the entries stored in dpcls are also referred to as “megaflows,” because an entry can be matched by “several” packets. For example with “Src IP = 21.2.10.*”, incoming packets with Src IPs “21.2.10.1” or “21.2.10.2” or “21.2.10.88” will match the above rule. On the other hand, the entries stored in EMC are referred to as “microflows” (not to be confused with the miniflow data structure).
Note that the term “fast path” comes from the vanilla OvS kernel datapath implementation where the two-level caches are located in kernel space. “Slow path” refers to the user-space datapath that involves the ofproto table and requires a context switch.
Subtables Creation and Destruction
The order of subtables in the datapath classifier is random, and the subtables can be created and destroyed at runtime. Each subtable can collect rules with a specific predefined mask, see the introductory article for more details. When a new rule has to be inserted into the classifier, all the existing subtables are traversed until a suitable subtable matching the rule mask is found. Otherwise a new subtable will be created to store the new rule. The subtable creation call graph is depicted in Figure 7.
Figure 7. Rule insertion and subtable creation call graph.
When the last rule in a subtable is deleted, the subtable becomes empty and can be destroyed. The subtable deletion call graph is depicted in Figure 8.
Figure 8. Subtable deletion call graph.
Slow Path Call Graph
On a dpcls lookup miss, the packet will be classified by the ofproto table, see Figure 9.
An “upcall” is triggered by means of dp_netdev_upcall(). The reply from Ofproto layer will contain all the information about the packet classification. In addition to the execution of actions, a learning mechanism will be activated: a new flow will be stored, a new wildcarded rule will be inserted into the dpcls, and an exact-match rule will be inserted into EMC. This way the two-layer caches will be able to directly manage similar packets in the future.
Figure 9. Upcall processing call graph.
Packet Batching
Packet classification categorizes packets with the active flows. The set of received packets is divided into groups depending on the flows that were matched. Each group is enqueued into a batch specific to the flow, as depicted in Figure 10.
Figure 10. Packet are grouped depending on the matching flow.
All packets enqueued into the same batch will be processed with the list of actions defined by that flow. To improve the packet forwarding performance, the packets belonging to the same flow are batched and processed together.
In some cases there could be very few packets in a batch. In a worst case scenario each packet of the fetched set is hitting a distinct flow, so each batch will enqueue only one packet. That becomes particularly inefficient when it comes to transmitting fewer packets over the DPDK interfaces as it incurs expensive MMIO writes. In order to optimize the MMIO transactions and improve the performance, an intermediate queue is implemented. Enqueued packets are transmitted when their count is greater than or equal to 32. Figure 11 depicts how the packet batching is triggered from emc_processing() and fast_path_processing().
Figure 11. Packet batching call graph.
Action Execution Call Graph
The action execution is done on a batch basis. A typical example of an action can be to forward the packets on the output interface. Figure 12 depicts the case where the packets are sent on the output port by calling netdev_send().
Figure 12. Action execution call graph.
Example of Packet Forwarding to a Physical Port
Figure 13. Overall call graph in case of a forwarding to a physical port.
The call graph in Figure 13 depicts an overall picture of the packet path from ingress until the action execution stage, wherein the packet of interest is forwarded to a certain output physical port.
command:
ovs-ofctl dump-flows br0
ovs-ofctl del-flows br0
ovs-ofctl -O openflow13 dump-flows s1
ovs-ofctl add-flow br0 "priority=0,actions=NORMAL"
https://software.intel.com/content/www/us/en/develop/articles/ovs-dpdk-datapath-classifier-part-2.html
OVS-DPDK DataPath Classifier反向設計
https://www.sdnlab.com/20901.html
OpenVSwitch實現淺談(一)
https://zhuanlan.zhihu.com/p/66216907
Open vSwitch源碼閱讀筆記
https://www.sdnlab.com/18668.html