[dpdk] 熟悉SDK與初步使用 (三)(IP Fragmentation源碼分析)


對例子IP Fragmentation的熟悉,使用,以及源碼分析。

功能:

  該例子的功能有二:

    一: 將IP分片?

    二: 根據路由表,做包轉發。 路由表如下:

IP_FRAG: Socket 1: adding route 100.10.0.0/16 (port 0)
IP_FRAG: Socket 1: adding route 100.20.0.0/16 (port 1)
IP_FRAG: Socket 1: adding route 100.30.0.0/16 (port 2)
IP_FRAG: Socket 1: adding route 100.40.0.0/16 (port 3)
IP_FRAG: Socket 1: adding route 100.50.0.0/16 (port 4)
IP_FRAG: Socket 1: adding route 100.60.0.0/16 (port 5)
IP_FRAG: Socket 1: adding route 100.70.0.0/16 (port 6)
IP_FRAG: Socket 1: adding route 100.80.0.0/16 (port 7)
IP_FRAG: Socket 1: adding route 0101:0101:0101:0101:0101:0101:0101:0101/48 (port 0)
IP_FRAG: Socket 1: adding route 0201:0101:0101:0101:0101:0101:0101:0101/48 (port 1)
IP_FRAG: Socket 1: adding route 0301:0101:0101:0101:0101:0101:0101:0101/48 (port 2)
IP_FRAG: Socket 1: adding route 0401:0101:0101:0101:0101:0101:0101:0101/48 (port 3)
IP_FRAG: Socket 1: adding route 0501:0101:0101:0101:0101:0101:0101:0101/48 (port 4)
IP_FRAG: Socket 1: adding route 0601:0101:0101:0101:0101:0101:0101:0101/48 (port 5)
IP_FRAG: Socket 1: adding route 0701:0101:0101:0101:0101:0101:0101:0101/48 (port 6)
IP_FRAG: Socket 1: adding route 0801:0101:0101:0101:0101:0101:0101:0101/48 (port 7)

 

 

問題一:

  main()函數大概是這樣的:標紅的三行將與下面敘述的事情相關

int
main(int argc, char **argv)
{
        ... ...
        /* init EAL */
        ret = rte_eal_init(argc, argv);
        if (ret < 0)
                rte_exit(EXIT_FAILURE, "rte_eal_init failed");
        ... ...

        /* launch per-lcore init on every lcore */ rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
        RTE_LCORE_FOREACH_SLAVE(lcore_id) {
                if (rte_eal_wait_lcore(lcore_id) < 0)
                        return -1;
        }

        return 0;
}

  其中,函數 rte_eal_wait_lcore 的實現如下:

/*
 * Wait until a lcore finished its job.                                                                                                                                
 */                           
int
rte_eal_wait_lcore(unsigned slave_id)                                                                                                                                  
{
        if (lcore_config[slave_id].state == WAIT)
                return 0;     

        while (lcore_config[slave_id].state != WAIT &&
               lcore_config[slave_id].state != FINISHED);                                                                                                            

        rte_rmb();

        /* we are in finished state, go to wait state */
        lcore_config[slave_id].state = WAIT;
        return lcore_config[slave_id].ret;                                                                                                                             
}

   閱讀紅色部分,可以很明顯的發現,這是一個死循環啊!!! 從字面意義上來看,main函數在完成了remote_launch之后,主進程會在這個函數里等等子進程結束。

這樣的話,用一個死循環來等,難道不會有問題嗎??? 所以我要的debug它一下看看怎么回事。 於是,為了達到這個目的,我分別經歷了下文中的問題二三四。終於debug成功了。解答如下:

  解答起來其實也很簡單,只需要看下 rte_eal_mp_remote_launch() 函數的代碼,就明白了。它的代碼如下:

 66 /*
 67  * Check that every SLAVE lcores are in WAIT state, then call
 68  * rte_eal_remote_launch() for all of them. If call_master is true
 69  * (set to CALL_MASTER), also call the function on the master lcore.
 70  */
 71 int
 72 rte_eal_mp_remote_launch(int (*f)(void *), void *arg,
 73                          enum rte_rmt_call_master_t call_master)
 74 {
 75         int lcore_id;
 76         int master = rte_get_master_lcore();
 77 
 78         /* check state of lcores */
 79         RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 80                 if (lcore_config[lcore_id].state != WAIT)
 81                         return -EBUSY;
 82         }
 83 
 84         /* send messages to cores */
 85         RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 86                 rte_eal_remote_launch(f, arg, lcore_id);
 87         }
 88 
 89         if (call_master == CALL_MASTER) {
 90                 lcore_config[master].ret = f(arg);
 91                 lcore_config[master].state = FINISHED;
 92         }
 93 
 94         return 0;
 95 }

  從第90行可以看出。主進程在這里進入了業務邏輯,所以直到程序退出之前。它都沒有機會執行前邊的那個死循環。也就是說,主進程當進入死循環的時候,也說明其他進程即將結束。並不會存在長期空跑CPU的情況。 不過,如果業務邏輯寫錯了呢? 子進程並沒有如逾期退出的話,是否會進入循環? 這里暫時先留下這個疑問。

  另一個需要紀錄下來的東西是。所有有需要的函數,實際上在rte_eal_init() 函數中便都創建完成了。remote_launch()函數實際上只是為其他進程傳遞一個啟動運行的消息。

具體消息內容,目前我沒有深入分析。

 

問題二:

  運行不起來,啟用DEBUG,gdb跟蹤一下。

這個makefile也是很那難用的。摸索了一下,有幾個命令,比較有用的如下:

[root@dpdk dpdk]# make help
[root@dpdk dpdk]# make V=yes D=yes

 以上命令並沒有用,到各模塊的MAKEFILE里,將-O3手工改成-g,重新編譯,才奏效。

 

問題三:

  通過gdb發現,啟動不了跟網卡特效有關系。

  a。初始化函數中默認的參數是啟用 硬checksum 等 offload 特性的。由於我模擬的網卡不支持,只能關掉。

static const struct rte_eth_conf port_conf = { 
        .rxmode = {
                .max_rx_pkt_len = JUMBO_FRAME_MAX_SIZE,
                .split_hdr_size = 0,           
                .header_split   = 0, /**< Header Split disabled */
                .hw_ip_checksum = 0, /**< IP checksum offload enabled */
                .hw_vlan_filter = 0, /**< VLAN filtering disabled */
                .jumbo_frame    = 1, /**< Jumbo Frame Support enabled */
                .hw_strip_crc   = 0, /**< CRC stripped by hardware */                                                                                                  
        },                    
        .txmode = {           
                .mq_mode = ETH_MQ_TX_NONE,                                                                                                                             
        },
};

  b. 另一處修改

               /* init one TX queue per couple (lcore,port) */
                queueid = 0;
                for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
                        if (rte_lcore_is_enabled(lcore_id) == 0)
                                continue;

                        socket = (int) rte_lcore_to_socket_id(lcore_id);
                        printf("txq=%u,%d ", lcore_id, queueid);
                        fflush(stdout);

                        rte_eth_dev_info_get(portid, &dev_info);
                        txconf = &dev_info.default_txconf;
                        txconf->txq_flags = 0 | ETH_TXQ_FLAGS_NOXSUMS;
                        ret = rte_eth_tx_queue_setup(portid, queueid, nb_txd,
                                                     socket, txconf);
                        if (ret < 0) {
                                printf("\n");
                                rte_exit(EXIT_FAILURE, "rte_eth_tx_queue_setup: "
                                        "err=%d, port=%d\n", ret, portid);
                        }

                        qconf = &lcore_queue_conf[lcore_id];
                        qconf->tx_queue_id[portid] = queueid;
                        queueid++;
                }

  c. 我之前模擬的網卡不支持多隊列,經過學習研究,讓 qemu/kvm 支持了多隊列。另寫了一篇,如下:

    [Virtualization][qemu][kvm][virtio] 使用 QEMU/KVM 模擬網卡多隊列

  啟動成功:

[root@dpdk build]# ./ip_fragmentation -l 6,7 -- -p 3
EAL: Detected 8 lcore(s)
EAL: Probing VFIO support...
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
PMD: bnxt_rte_pmd_init() called for (null)
EAL: PCI device 0000:00:03.0 on NUMA socket -1
EAL:   probe driver: 1af4:1000 rte_virtio_pmd
EAL: PCI device 0000:00:04.0 on NUMA socket -1
EAL:   probe driver: 1af4:1000 rte_virtio_pmd
EAL: PCI device 0000:00:05.0 on NUMA socket -1
EAL:   probe driver: 1af4:1000 rte_virtio_pmd
IP_FRAG: Creating direct mempool on socket 1
IP_FRAG: Creating indirect mempool on socket 1
IP_FRAG: Creating LPM table on socket 1
IP_FRAG: Creating LPM6 table on socket 1
Initializing port 0 on lcore 6... Address:00:00:00:01:00:01
txq=6,0 txq=7,1 
Initializing port 1 on lcore 7... Address:00:00:00:01:00:02
txq=6,0 txq=7,1 

IP_FRAG: Socket 1: adding route 100.10.0.0/16 (port 0)
IP_FRAG: Socket 1: adding route 100.20.0.0/16 (port 1)
IP_FRAG: Socket 1: adding route 100.30.0.0/16 (port 2)
IP_FRAG: Socket 1: adding route 100.40.0.0/16 (port 3)
IP_FRAG: Socket 1: adding route 100.50.0.0/16 (port 4)
IP_FRAG: Socket 1: adding route 100.60.0.0/16 (port 5)
IP_FRAG: Socket 1: adding route 100.70.0.0/16 (port 6)
IP_FRAG: Socket 1: adding route 100.80.0.0/16 (port 7)
IP_FRAG: Socket 1: adding route 0101:0101:0101:0101:0101:0101:0101:0101/48 (port 0)
IP_FRAG: Socket 1: adding route 0201:0101:0101:0101:0101:0101:0101:0101/48 (port 1)
IP_FRAG: Socket 1: adding route 0301:0101:0101:0101:0101:0101:0101:0101/48 (port 2)
IP_FRAG: Socket 1: adding route 0401:0101:0101:0101:0101:0101:0101:0101/48 (port 3)
IP_FRAG: Socket 1: adding route 0501:0101:0101:0101:0101:0101:0101:0101/48 (port 4)
IP_FRAG: Socket 1: adding route 0601:0101:0101:0101:0101:0101:0101:0101/48 (port 5)
IP_FRAG: Socket 1: adding route 0701:0101:0101:0101:0101:0101:0101:0101/48 (port 6)
IP_FRAG: Socket 1: adding route 0801:0101:0101:0101:0101:0101:0101:0101/48 (port 7)

Checking link status
done
Port 0 Link Up - speed 10000 Mbps - full-duplex
Port 1 Link Up - speed 10000 Mbps - full-duplex
IP_FRAG: entering main loop on lcore 7
IP_FRAG:  -- lcoreid=7 portid=1
IP_FRAG: entering main loop on lcore 6
IP_FRAG:  -- lcoreid=6 portid=0

 

問題四:

  如何查看編譯選項,使用的靜態庫。修改編譯選項,啟動debug等? 唯一的辦法是makefile。結構還是很清晰的。但是,依然需要花很長的時間讀。

  打印編譯命令的方法如下:

  修改文件 /sdk/@dpdk/dpdk-stable-16.07.1 mk/internal/rte.compile-pre.mk 中的 C_TO_O_DO 變量: 第101行,為新增內容。

 99 C_TO_O_DO = @set -e; \     
100         echo $(C_TO_O_DISP); \
101         echo $(C_TO_O); \
102         $(C_TO_O) && \     
103         $(PMDINFO_TO_O) && \
104         echo $(C_TO_O_CMD) > $(call obj2cmd,$(@)) && \
105         sed 's,'$@':,dep_'$@' =,' $(call obj2dep,$(@)).tmp > $(call obj2dep,$(@)) && \                                                                             
106         rm -f $(call obj2dep,$(@)).tmp
107 

  打印鏈接命令的方法如下:

  修改文件 /sdk/@dpdk/dpdk-stable-16.07.1 mk/rte.app.mk 中的 O_TO_EXE_DO 變量: 第209行,為新增內容。

207 O_TO_EXE_DO = @set -e; \
208         echo $(O_TO_EXE_DISP); \
209         echo $(O_TO_EXE); \
210         $(O_TO_EXE) && \
211         echo $(O_TO_EXE_CMD) > $(call exe2cmd,$(@))
212 

  實現效果如下:

[root@dpdk ip_fragmentation]# make
echo "xxxxccccxxxx"
xxxxccccxxxx
  CC main.o
gcc -Wp,-MD,./.main.o.d.tmp -m64 -pthread -march=native -DRTE_MACHINE_CPUFLAG_SSE -DRTE_MACHINE_CPUFLAG_SSE2 -DRTE_MACHINE_CPUFLAG_SSE3 -DRTE_MACHINE_CPUFLAG_SSSE3 -DRTE_MACHINE_CPUFLAG_SSE4_1 -DRTE_MACHINE_CPUFLAG_SSE4_2 -I/root/src/sdk/@dpdk/dpdk-stable-16.07.1/examples/ip_fragmentation/build/include -I/root/dpdk//x86_64-native-linuxapp-gcc/include -include /root/dpdk//x86_64-native-linuxapp-gcc/include/rte_config.h -g -W -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wpointer-arith -Wcast-align -Wnested-externs -Wcast-qual -Wformat-nonliteral -Wformat-security -Wundef -Wwrite-strings -Wno-return-type -o main.o -c /root/src/sdk/@dpdk/dpdk-stable-16.07.1/examples/ip_fragmentation/main.c
  LD ip_fragmentation
gcc -o ip_fragmentation -m64 -pthread -march=native -DRTE_MACHINE_CPUFLAG_SSE -DRTE_MACHINE_CPUFLAG_SSE2 -DRTE_MACHINE_CPUFLAG_SSE3 -DRTE_MACHINE_CPUFLAG_SSSE3 -DRTE_MACHINE_CPUFLAG_SSE4_1 -DRTE_MACHINE_CPUFLAG_SSE4_2 -I/root/src/sdk/@dpdk/dpdk-stable-16.07.1/examples/ip_fragmentation/build/include -I/root/dpdk//x86_64-native-linuxapp-gcc/include -include /root/dpdk//x86_64-native-linuxapp-gcc/include/rte_config.h -g -W -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wold-style-definition -Wpointer-arith -Wcast-align -Wnested-externs -Wcast-qual -Wformat-nonliteral -Wformat-security -Wundef -Wwrite-strings main.o -L/root/dpdk//x86_64-native-linuxapp-gcc/lib -Wl,-lrte_kni -Wl,-lrte_pipeline -Wl,-lrte_table -Wl,-lrte_port -Wl,-lrte_pdump -Wl,-lrte_distributor -Wl,-lrte_reorder -Wl,-lrte_ip_frag -Wl,-lrte_meter -Wl,-lrte_sched -Wl,-lrte_lpm -Wl,--whole-archive -Wl,-lrte_acl -Wl,--no-whole-archive -Wl,-lrte_jobstats -Wl,-lrte_power -Wl,--whole-archive -Wl,-lrte_timer -Wl,-lrte_hash -Wl,-lrte_vhost -Wl,-lrte_kvargs -Wl,-lrte_mbuf -Wl,-lethdev -Wl,-lrte_cryptodev -Wl,-lrte_mempool -Wl,-lrte_ring -Wl,-lrte_eal -Wl,-lrte_cmdline -Wl,-lrte_cfgfile -Wl,-lrte_pmd_bond -Wl,-lrte_pmd_af_packet -Wl,-lrte_pmd_bnxt -Wl,-lrte_pmd_cxgbe -Wl,-lrte_pmd_e1000 -Wl,-lrte_pmd_ena -Wl,-lrte_pmd_enic -Wl,-lrte_pmd_fm10k -Wl,-lrte_pmd_i40e -Wl,-lrte_pmd_ixgbe -Wl,-lrte_pmd_null -Wl,-lrte_pmd_ring -Wl,-lrte_pmd_virtio -Wl,-lrte_pmd_vhost -Wl,-lrte_pmd_vmxnet3_uio -Wl,-lrte_pmd_null_crypto -Wl,--no-whole-archive -Wl,-lrt -Wl,-lm -Wl,-ldl -Wl,-export-dynamic -Wl,-export-dynamic -L/root/src/sdk/@dpdk/dpdk-stable-16.07.1/examples/ip_fragmentation/build/lib -L/root/dpdk//x86_64-native-linuxapp-gcc/lib -Wl,--as-needed -Wl,-Map=ip_fragmentation.map -Wl,--cref
  INSTALL-APP ip_fragmentation
  INSTALL-MAP ip_fragmentation.map
[root@dpdk ip_fragmentation]# 

 

 問題五

  啟動例子程序之后,做發包測試。發現所有包都被源端口轉發回來。通過代碼可以看到,默認的路由規則就是源端口轉發回來。

  a。怎么方便的發包呢? 除了已知的 tcpreplay 可以在端口上將包原樣轉發以外。還可以使用 tcpreplay-edit 對包內容做一下修改后在發送。我就是通過這種方法來測試例子中的路由表的。

/home/tong/Data [tong@T7] [11:24]
> sudo tcpreplay-edit -i tap-dpdk-1 -D 0.0.0.0/0:100.20.0.0/16 --enet-dmac=00:00:00:01:00:01 -L1 oicq-bak.pcap
Actual: 2 packets (162 bytes) sent in 30.01 seconds.
Rated: 5.3 Bps, 0.000 Mbps, 0.06 pps
Flows: 1 flows, 0.03 fps, 2 flow packets, 0 non-flow
Statistics for network device: tap-dpdk-1
        Attempted packets:         2
        Successful packets:        2
        Failed packets:            0
        Truncated packets:         0
        Retried packets (ENOBUFS): 0
        Retried packets (EAGAIN):  0

  b。通過 debug 發現收包之后的結構如下:

  其中,紅色部分的結構代表了報文類型。很顯然,可以看出來,程序在這個時間點,並沒有識別到,該報文是IPv4 or IPv6。在后續做轉發的代碼邏輯里,會對IP類型進行判斷,在此結構體數值下,該代碼判斷為既不是v4,也不是v6。故進入了默認路由,從源端口發了出來。

  所以,到目前為止,並不知道為什么報類型沒有被識別。可以有三種情況,1,代碼依賴了硬件來處理,而我的模擬網卡不能處理。2. pmd處理,又要我是虛擬機,故沒有處理。3. 例子代碼有誤。應該在代碼某處,調用一個識別的函數進行處理。

  總之,在還未完全高清楚一個常規包處理流程和邏輯時。該問題暫擱置。

(gdb) p *m
$26 = {cacheline0 = 0x7fffd4dd4300, buf_addr = 0x7fffd4dd4380, buf_physaddr = 2059223936, buf_len = 2176, rearm_data = 0x7fffd4dd4312 "\216", data_off = 142, {
    refcnt_atomic = {cnt = 1}, refcnt = 1}, nb_segs = 1 '\001', port = 0 '\000', ol_flags = 0, rx_descriptor_fields1 = 0x7fffd4dd4320, {packet_type = 0, {
      l2_type = 0, l3_type = 0, l4_type = 0, tun_type = 0, inner_l2_type = 0, inner_l3_type = 0, inner_l4_type = 0}}, pkt_len = 67, data_len = 67, vlan_tci = 0, 
  hash = {rss = 0, fdir = {{{hash = 0, id = 0}, lo = 0}, hi = 0}, sched = {lo = 0, hi = 0}, usr = 0}, seqn = 0, vlan_tci_outer = 0, cacheline1 = 0x7fffd4dd4340, {
    userdata = 0x0, udata64 = 0}, pool = 0x7fffd60436c0, next = 0x0, {tx_offload = 0, {l2_len = 0, l3_len = 0, l4_len = 0, tso_segsz = 0, outer_l3_len = 0, 
      outer_l2_len = 0}}, priv_size = 0, timesync = 0}
(gdb) 

 

  c. 改了qemu,模擬了硬件checksum,仍不行。通過debug分析代碼,好像是virtio pmd driver就不支持硬件checksum。新的qemu命令如下(注意紅色部分)。並沒有找到方法怎么在guest里通過查看的方式確認網卡是否支持 checksum offload.

/home/tong/VM/dpdk [tong@T7] [19:06]
> cat start-multiqueue.sh 
sudo qemu-system-x86_64 -nographic -vnc 127.0.0.1:1 -enable-kvm \
        -m 2G -cpu Nehalem -smp cores=2,threads=2,sockets=2 \
        -numa node,mem=1G,cpus=0-3,nodeid=0 \
        -numa node,mem=1G,cpus=4-7,nodeid=1 \
        -drive file=disk.img,if=virtio \
        -net nic,vlan=0,model=virtio,macaddr='00:00:00:01:00:00' \
        -device virtio-net-pci,netdev=dev1,mac='00:00:00:01:00:01',vectors=34,mq=on,csum=on,guest_csum=on \
        -device virtio-net-pci,netdev=dev2,mac='00:00:00:01:00:02',vectors=34,mq=on,csum=on,guest_csum=on \
        -device virtio-net-pci,netdev=dev3,mac='00:00:00:01:00:03',vectors=34,mq=on,csum=on,guest_csum=on \
        -net tap,vlan=0,ifname=tap-dpdk-ctrl \
        -netdev tap,ifname=tap-dpdk-1,script=no,downscript=no,vhost=on,queues=16,id=dev1 \
        -netdev tap,ifname=tap-dpdk-2,script=no,downscript=no,vhost=on,queues=16,id=dev2 \
        -netdev tap,ifname=tap-dpdk-3,script=no,downscript=no,vhost=on,queues=16,id=dev3 &
#       -device vfio-pci,host='0000:00:19.0' \
#ne2k_pci,i82551,i82557b,i82559er,rtl8139,e1000,pcnet,virtio

/home/tong/VM/dpdk [tong@T7] [19:06]
> 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM