linux源碼解讀（三十二）：dpdk核心源碼解析（二）

本文轉載自查看原文 2022-03-24 11:21 1659

　　dpdk是intel主導開發的網絡編程框架，有這么多的優點，都是怎么實現的了？

　　1、UIO原理：dpdk繞過了操作系統內核，直接接管網卡，用戶程序可以直接在3環讀寫網卡的數據，這就涉及到兩個關鍵技術點了：

地址映射：3環的程序是怎么定位到網卡數據存放在哪的了？
攔截硬件中斷：傳統數據處理流程是網卡收到數據后通過硬件中斷通知cpu來取數據，3環的程序肯定要攔截這個中斷，然后通過輪詢方式取數據，這個又是怎么實現的了？

（1）地址映射：3環程序最常使用的就是內存地址了，一共32或64bit；C/C++層面可以通過指針直接讀寫地址的值；除了內存，還有很多設備也需要和cpu交互數據，比如顯示器：要在屏幕顯示的內容肯定是需要用戶指定的，用戶程序可以把顯示的內容發送到顯示器指定的地方，然后再屏幕打印出來。為了方便用戶程序發送數據，硬件層面會把顯示器的部分存儲空間映射到內存地址，做到了和內存條硬件的尋址方式一樣，用戶也可以直接通過指針往這里寫數據（匯編層面直接通過mov指令操作即可）！網卡也類似：網卡是插在pci插槽的，網卡（或者說pci插槽）的存儲空間也會映射到內存地址，應用程序讀寫這塊物理地址就等同於讀寫網卡的存儲空間！實際寫代碼時，由於要深入驅動，pci網卡預留物理的內存與io空間會保存到uio設備上，相當於將這些物理空間與io空間暴露給uio設備，應用程序訪問這些uio設備即可！幾個關鍵的函數如下：

　　將pci網卡的物理內存空間以及io空間保存在uio設備結構struct uio_info中的mem成員以及port成員中，uio設備就知道了網卡的物理以及io空間。應用層訪問這個uio設備的物理空間以及io空間，就相當於訪問pci設備的物理以及io空間；本質上就是將pci網卡的空間暴露給uio設備。

int igbuio_pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
{
    //將pci內存，端口映射給uio設備
    struct rte_uio_pci_dev *udev;
    err = igbuio_setup_bars(dev, &udev->info);
}
static int igbuio_setup_bars(struct pci_dev *dev, struct uio_info *info)
{
    //pci內存，端口映射給uio設備
    for (i = 0; i != sizeof(bar_names) / sizeof(bar_names[0]); i++) 
    {
        if (pci_resource_len(dev, i) != 0 && pci_resource_start(dev, i) != 0) 
        {
            flags = pci_resource_flags(dev, i);
            if (flags & IORESOURCE_MEM) 
            {
                //暴露pci的內存空間給uio設備
                ret = igbuio_pci_setup_iomem(dev, info, iom,  i, bar_names[i]);
            } 
            else if (flags & IORESOURCE_IO) 
            {
                //暴露pci的io空間給uio設備
                ret = igbuio_pci_setup_ioport(dev, info, iop,  i, bar_names[i]);
            }
        }
    }
}

　　（2）攔截硬件中斷：為了減掉內核中冗余的數據處理流程，應用程序要hook網卡的中斷，從源頭開始攔截網卡數據！當硬件中斷觸發時，才不會一直觸發內核去執行中斷回調。也就是通過這種方式，才能在應用層實現硬件中斷處理過程。注意：這里說的中斷僅是控制中斷，而不是報文收發的數據中斷，數據中斷是不會走到這里來的，因為在pmd開啟中斷時，沒有設置收發報文的中斷掩碼，只注冊了網卡狀態改變的中斷掩碼；hook中斷的代碼如下：

int igbuio_pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
{
    //填充uio信息
    udev->info.name = "igb_uio";
    udev->info.version = "0.1";
    udev->info.handler = igbuio_pci_irqhandler;        //硬件控制中斷的入口，劫持原來的硬件中斷
    udev->info.irqcontrol = igbuio_pci_irqcontrol;    //應用層開關中斷時被調用，用於是否開始中斷
}
static irqreturn_t igbuio_pci_irqhandler(int irq, struct uio_info *info)
{
    if (udev->mode == RTE_INTR_MODE_LEGACY && !pci_check_and_mask_intx(udev->pdev))
    {
        return IRQ_NONE;
    }
    //返回IRQ_HANDLED時，linux uio框架會喚醒等待uio中斷的進程。注冊到epoll的uio中斷事件就會被調度
    /* Message signal mode, no share IRQ and automasked */
    return IRQ_HANDLED;
}
static int igbuio_pci_irqcontrol(struct uio_info *info, s32 irq_state)
{
    //調用內核的api來開關中斷
    if (udev->mode == RTE_INTR_MODE_LEGACY)
    {
        pci_intx(pdev, !!irq_state);
    }
    else if (udev->mode == RTE_INTR_MODE_MSIX)\
    {
        list_for_each_entry(desc, &pdev->msi_list, list)
            igbuio_msix_mask_irq(desc, irq_state);
    }
}

　　2、內存池：傳統應用要使用內存時，一般都是調用malloc讓操作系統在堆上分配。這樣做有兩點弊端：

進入內核要切換上下文
操作系統通過buddy&slab算法找合適的空閑內存

所以頻繁調用malloc會嚴重拉低效率！如果不頻繁調用malloc，怎么處理頻繁收到和需要發送的報文數據了？dpdk采用的是內存池的技術：即在huge page內存中開辟一個連續的大緩沖區當做內存池！同時提供rte_mempool_get從內存池中獲取內存空間。也可調用rte_mempool_put將不再使用的內存空間放回到內存池中。從這里就能看出：dpdk自己從huge page處維護了一大塊內存供應用程序使用，應用程序不再需要通過系統調用從操作系統申請內存了！

（1）內存池的創建，在rte_mempool_create接口中完成。這個接口主要是在大頁內存中開辟一個連續的大緩沖區當做內存池，然后將這個內存池進行分割，頭部為struct rte_mempool內存池結構；緊接着是內存池的私有結構大小，這個由應用層自己設置，每個創建內存池的應用進程都可以指定不同的私有結構；最后是多個連續的對象元素，這些對象元素都是處於同一個內存池中。每個對象元素又有對象的頭部，對象的真實數據區域，對象的尾部組成。這里所說的對象元素，其實就是應用層要開辟的真實數據空間，例如應用層自己定義的結構體變量等；本質上是dpdk自己實現了一套內存的管理辦法，其作用和linux的buddy&slab是一樣的，沒本質區別！整個內存池圖示如下：

　每創建一個內存池，都會創建一個鏈表節點，然后插入到鏈表中，因此這個鏈表記錄着當前系統創建了多少內存池。核心代碼如下：

//創建內存池鏈表節點
te = rte_zmalloc("MEMPOOL_TAILQ_ENTRY", sizeof(*te), 0);
//內存池鏈表節點插入到內存池鏈表中
te->data = (void *) mp;
RTE_EAL_TAILQ_INSERT_TAIL(RTE_TAILQ_MEMPOOL, rte_mempool_list, te);

　　所以說內存池可能不止1個，會有多個！在內存池中，內存被划分成了N多的對象。應用程序要申請內存時，怎么知道哪些對象空閑可以用，哪些對象已經被占用了？當對象元素初始化完成后，會把對象指針放入ring隊列，所以說ring隊列的所有對象指針都是可以使用的！應用程序要申請內存時，可以調用rte_mempool_get接口從ring隊列中獲取，也就是出隊；使用完畢后調用rte_mempool_put將內存釋放回收時，也是將要回收的內存空間對應的對象指針放到這個ring隊列中，也就是入隊！

　（2）具體分配內存時的步驟：

現代cpu基本都是多核的，多個cpu同時在內存池申請內存時無法避免涉及到互斥，會在一定程度上影響分配的效率，所以每個cpu自己都有自己的“自留地”，會優先在自己的“自留地”申請內存；
如果“自留地”的內存已耗盡，才會繼續去內存池申請內存！核心代碼如下：

int rte_mempool_get(struct rte_mempool *mp, void **obj_table, unsigned n)
{
#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
    //從當前cpu應用層緩沖區中獲取
    cache = &mp->local_cache[lcore_id];
    cache_objs = cache->objs;
    for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
    {
        *obj_table = cache_objs[len];
    }
    return 0;
#endif
    /* get remaining objects from ring */
    //直接從ring隊列中獲取
    ret = rte_ring_sc_dequeue_bulk(mp->ring, obj_table, n);
}

　　釋放內存的步驟和申請類似：

先查看cpu的“自留地”是否還有空間。如果有，就先把釋放的對象指針放在“自留地”；
如果“自留地”沒空間了，再把釋放的對象指針放在內存池！核心代碼如下：

int rte_mempool_put(struct rte_mempool *mp, void **obj_table, unsigned n)
{
#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
    //在當前cpu本地緩存有空間的場景下， 先放回到本地緩存。
    cache = &mp->local_cache[lcore_id];
    cache_objs = &cache->objs[cache->len];
    for (index = 0; index < n; ++index, obj_table++)
    {
        cache_objs[index] = *obj_table;
    }
    //緩沖達到閾值，刷到隊列中
    if (cache->len >= flushthresh) 
    {
        rte_ring_mp_enqueue_bulk(mp->ring, &cache->objs[cache_size], cache->len - cache_size);
        cache->len = cache_size;
    }
        return 0
#endif
    //直接放回到ring隊列
    rte_ring_sp_enqueue_bulk(mp->ring, obj_table, n);
}

　　注意：這里的ring是環形無鎖隊列！

　 3、Poll mode driver：不論何總形式的io，接收方獲取數據的方式有兩種：

　被動接收中斷的喚醒：典型如網卡收到數據，通過硬件中斷通知操作系統去處理；操作系統收到數據后會喚醒休眠的進程繼續處理數據
輪詢 poll：寫個死循環不停的檢查內存地址是否有新數據到了！

　　在 x86 體系結構中，一次中斷處理需要將 CPU 的狀態寄存器保存到堆棧，並運行中斷handler，最后再將保存的狀態寄存器信息從堆棧中恢復，整個過程需要至少 300 個處理器時鍾周期！所以dpdk果斷拋棄了中斷，轉而使用輪詢方式！整個流程大致是這樣的：內核態的UIO Driver hook了網卡發出的中斷信號，然后由用戶態的 PMD Driver 采用主動輪詢的方式。除了鏈路狀態通知仍必須采用中斷方式以外（因為網卡發出硬件中斷才能觸發執行hook代碼的嘛，這個容易理解吧？），均使用無中斷方式直接操作網卡設備的接收和發送隊列。整體流程大致如下：UIO hook了網卡的中斷，網卡收到數據后“被迫”執行hook代碼！先是通過UIO把網卡的存儲地址映射到/dev/uio文件，而后應用程序通過PMD輪詢檢查文件是否有新數據到來！期間也使用mmap把應用的虛擬地址映射到網卡的物理地址，減少數據的拷貝轉移！

　　總的來說：UIO+PMD，前者旁路了內核，后者主動輪詢避免了硬中斷，DPDK 從而可以在用戶態進行收發包的處理。帶來了零拷貝（Zero Copy）、無系統調用（System call）的優化。同時，還避免了軟中斷的異步處理，也減少了上下文切換帶來的 Cache Miss！輪詢收報核心代碼如下：

/*PMD輪詢接收數據包*/
uint16_t
eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
        uint16_t nb_pkts)
{
    /* volatile防止編譯器優化,每次使用必須又一次從memory中取而不是用寄存器的值 */
    volatile struct e1000_rx_desc *rx_ring;
    volatile struct e1000_rx_desc *rxdp;//指向rx ring中某個e1000_rx_desc描述符
    struct em_rx_queue *rxq;//整個接收隊列
    struct em_rx_entry *sw_ring;//指向描述符隊列的頭部，根據rx tail來偏移
    struct em_rx_entry *rxe;//指向sw ring中具體的entry
    struct rte_mbuf *rxm;//entry里的rte mbuf
    /*是new mbuf，新申請的mbuf，當rxm從ring中取出后，需要用nmb再掛上去，
      更新對應rx ring和sw ring中的值，為下一次收包做准備*/
    struct rte_mbuf *nmb;
    struct e1000_rx_desc rxd;//具體的非指針描述符
    uint64_t dma_addr;
    uint16_t pkt_len;
    uint16_t rx_id;
    uint16_t nb_rx;
    uint16_t nb_hold;
    uint8_t status;

    rxq = rx_queue;

    nb_rx = 0;
    nb_hold = 0;
    //初始化臨時變量，要開始遍歷隊列了
    rx_id = rxq->rx_tail;
    rx_ring = rxq->rx_ring;
    sw_ring = rxq->sw_ring;
    /* 一次性收32個報文 */
    while (nb_rx < nb_pkts) {
        /*
         * The order of operations here is important as the DD status
         * bit must not be read after any other descriptor fields.
         * rx_ring and rxdp are pointing to volatile data so the order
         * of accesses cannot be reordered by the compiler. If they were
         * not volatile, they could be reordered which could lead to
         * using invalid descriptor fields when read from rxd.
         */
        /* 當前報文的descriptor */
        rxdp = &rx_ring[rx_id];
        status = rxdp->status; /* 結束標記,必須首先讀取 */
        /*檢查狀態是否為dd, 不是則說明驅動還沒有把報文放到接收隊列，直接退出*/
        if (! (status & E1000_RXD_STAT_DD))
            break;
        rxd = *rxdp; /* 復制一份 */

        /*
         * End of packet.
         *
         * If the E1000_RXD_STAT_EOP flag is not set, the RX packet is
         * likely to be invalid and to be dropped by the various
         * validation checks performed by the network stack.
         *
         * Allocate a new mbuf to replenish the RX ring descriptor.
         * If the allocation fails:
         *    - arrange for that RX descriptor to be the first one
         *      being parsed the next time the receive function is
         *      invoked [on the same queue].
         *
         *    - Stop parsing the RX ring and return immediately.
         *
         * This policy do not drop the packet received in the RX
         * descriptor for which the allocation of a new mbuf failed.
         * Thus, it allows that packet to be later retrieved if
         * mbuf have been freed in the mean time.
         * As a side effect, holding RX descriptors instead of
         * systematically giving them back to the NIC may lead to
         * RX ring exhaustion situations.
         * However, the NIC can gracefully prevent such situations
         * to happen by sending specific "back-pressure" flow control
         * frames to its peer(s).
         */
        PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
               "status=0x%x pkt_len=%u",
               (unsigned) rxq->port_id, (unsigned) rxq->queue_id,
               (unsigned) rx_id, (unsigned) status,
               (unsigned) rte_le_to_cpu_16(rxd.length));

        nmb = rte_mbuf_raw_alloc(rxq->mb_pool);
        if (nmb == NULL) {
            PMD_RX_LOG(DEBUG, "RX mbuf alloc failed port_id=%u "
                   "queue_id=%u",
                   (unsigned) rxq->port_id,
                   (unsigned) rxq->queue_id);
            rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed++;
            break;
        }
        
        /* 表示當前descriptor被上層軟件占用 */
        nb_hold++;
        /* 當前收到的mbuf */
        rxe = &sw_ring[rx_id];
        /* 收包位置,假設超過環狀數組則回滾 */
        rx_id++;
        if (rx_id == rxq->nb_rx_desc)
            rx_id = 0;

        /* mbuf加載cache下次循環使用 */
        /* Prefetch next mbuf while processing current one. */
        rte_em_prefetch(sw_ring[rx_id].mbuf);

        /*
         * When next RX descriptor is on a cache-line boundary,
         * prefetch the next 4 RX descriptors and the next 8 pointers
         * to mbufs.
         */
         /* 取下一個descriptor,以及mbuf指針下次循環使用 */
        /* 一個cache line是4個descriptor大小(64字節) */
        if ((rx_id & 0x3) == 0) {
            rte_em_prefetch(&rx_ring[rx_id]);
            rte_em_prefetch(&sw_ring[rx_id]);
        }

        /* Rearm RXD: attach new mbuf and reset status to zero. */

        rxm = rxe->mbuf;
        rxe->mbuf = nmb;
        dma_addr =
            rte_cpu_to_le_64(rte_mbuf_data_iova_default(nmb));
        rxdp->buffer_addr = dma_addr;
        rxdp->status = 0;/* 重置當前descriptor的status */

        /*
         * Initialize the returned mbuf.
         * 1) setup generic mbuf fields:
         *    - number of segments,
         *    - next segment,
         *    - packet length,
         *    - RX port identifier.
         * 2) integrate hardware offload data, if any:
         *    - RSS flag & hash,
         *    - IP checksum flag,
         *    - VLAN TCI, if any,
         *    - error flags.
         */
        pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.length) -
                rxq->crc_len);
        rxm->data_off = RTE_PKTMBUF_HEADROOM;
        rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
        rxm->nb_segs = 1;
        rxm->next = NULL;
        rxm->pkt_len = pkt_len;
        rxm->data_len = pkt_len;
        rxm->port = rxq->port_id;

        rxm->ol_flags = rx_desc_status_to_pkt_flags(status);
        rxm->ol_flags = rxm->ol_flags |
                rx_desc_error_to_pkt_flags(rxd.errors);

        /* Only valid if PKT_RX_VLAN set in pkt_flags */
        rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);

        /*
         * Store the mbuf address into the next entry of the array
         * of returned packets.
         */
          /* 把收到的mbuf返回給用戶 */
        rx_pkts[nb_rx++] = rxm;
    }
     /* 收包位置更新 */
    rxq->rx_tail = rx_id;

    /*
     * If the number of free RX descriptors is greater than the RX free
     * threshold of the queue, advance the Receive Descriptor Tail (RDT)
     * register.
     * Update the RDT with the value of the last processed RX descriptor
     * minus 1, to guarantee that the RDT register is never equal to the
     * RDH register, which creates a "full" ring situtation from the
     * hardware point of view...
     */
    nb_hold = (uint16_t) (nb_hold + rxq->nb_rx_hold);
    if (nb_hold > rxq->rx_free_thresh) {
        PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
               "nb_hold=%u nb_rx=%u",
               (unsigned) rxq->port_id, (unsigned) rxq->queue_id,
               (unsigned) rx_id, (unsigned) nb_hold,
               (unsigned) nb_rx);
        rx_id = (uint16_t) ((rx_id == 0) ?
            (rxq->nb_rx_desc - 1) : (rx_id - 1));
        E1000_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
        nb_hold = 0;
    }
    rxq->nb_rx_hold = nb_hold;
    return nb_rx;
}

　　　接收報文的整理流程梳理如下圖所示：

DMA控制器控制報文一個個寫到rx ring中接收描述符指定的IO虛擬內存中，對應的實際內存應該就是mbuf；
接收函數用rx tail變量控制不停地讀取rx ring中的描述符和sw ring中的mbuf，並申請新的mbuf放入sw ring中，更新rx ring中的buffer addr
最后把讀取的mbuf返回給應用程序。

　　4、線程親和性

　　一個cpu上可以運行多個線程，由linux內核來調度各個線程的執行。內核在調度線程時，會進行上下文切換，保存線程的堆棧等信息，以便這個線程下次再被調度執行時，繼續從指定的位置開始執行。然而上下文切換是需要耗費cpu資源的的。多核體系的CPU，物理核上的線程來回切換，會導致L1/L2 cache命中率的下降。同時NUMA架構下，如果操作系統調度線程的時候，跨越了NUMA節點，將會導致大量的L3 cache的丟失。Linux對線程的親和性是有支持的, 如果將線程和cpu進行綁定的話，線程會一直在指定的cpu上運行，不會被操作系統調度到別的cpu上，線程之間互相獨立工作而不會互相擾完，節省了操作系統來回調度的時間。目前DPDK通過把線程綁定到cpu的方法來避免跨核任務中的切換開銷。

　　線程綁定cpu物理核的函數如下：

/* set affinity for current EAL thread */
static int
eal_thread_set_affinity(void)
{
    unsigned lcore_id = rte_lcore_id();

    /* acquire system unique id  */
    rte_gettid();

    /* update EAL thread core affinity */
    return rte_thread_set_affinity(&lcore_config[lcore_id].cpuset);
}

　　繼續往下走：

/*
    根據前面的rte_cpuset_t ,設置tid的綁定關系
    存儲thread local socket_id
    存儲thread local rte_cpuset_t
*/
int
rte_thread_set_affinity(rte_cpuset_t *cpusetp)
{
    int s;
    unsigned lcore_id;
    pthread_t tid;

    tid = pthread_self();//得到當前線程id
    //綁定cpu和線程
    s = pthread_setaffinity_np(tid, sizeof(rte_cpuset_t), cpusetp);
    if (s != 0) {
        RTE_LOG(ERR, EAL, "pthread_setaffinity_np failed\n");
        return -1;
    }

    /* store socket_id in TLS for quick access */
    //socketid存放到線程本地空間，便於快速讀取
    RTE_PER_LCORE(_socket_id) =
        eal_cpuset_socket_id(cpusetp);

    /* store cpuset in TLS for quick access */
    //cpu信息存放到cpu本地空間，便於快速讀取
    memmove(&RTE_PER_LCORE(_cpuset), cpusetp,
        sizeof(rte_cpuset_t));

    lcore_id = rte_lcore_id();//獲取線程綁定的CPU
    if (lcore_id != (unsigned)LCORE_ID_ANY) {//如果不相等，就更新lcore配置
        /* EAL thread will update lcore_config */
        lcore_config[lcore_id].socket_id = RTE_PER_LCORE(_socket_id);
        memmove(&lcore_config[lcore_id].cpuset, cpusetp,
            sizeof(rte_cpuset_t));
    }

    return 0;
}

　　繼續往下走：

int
pthread_setaffinity_np(pthread_t thread, size_t cpusetsize,
               const rte_cpuset_t *cpuset)
{
    if (override) {
        /* we only allow affinity with a single CPU */
        if (CPU_COUNT(cpuset) != 1)
            return POSIX_ERRNO(EINVAL);

        /* we only allow the current thread to sets its own affinity */
        struct lthread *lt = (struct lthread *)thread;

        if (lthread_current() != lt)
            return POSIX_ERRNO(EINVAL);

        /* determine the CPU being requested */
        int i;

        for (i = 0; i < LTHREAD_MAX_LCORES; i++) {
            if (!CPU_ISSET(i, cpuset))
                continue;
            break;
        }
        /* check requested core is allowed */
        if (i == LTHREAD_MAX_LCORES)
            return POSIX_ERRNO(EINVAL);

        /* finally we can set affinity to the requested lcore 
        前面做了大量的檢查和容錯，這里終於開始綁定cpu了
        */
        lthread_set_affinity(i);
        return 0;
    }
    return _sys_pthread_funcs.f_pthread_setaffinity_np(thread, cpusetsize,
                               cpuset);
}

　　綁定cpu的方法也簡單：本質就是個上下文切換

/*
 * migrate the current thread to another scheduler running
 * on the specified lcore.
 */
int lthread_set_affinity(unsigned lcoreid)
{
    struct lthread *lt = THIS_LTHREAD;
    struct lthread_sched *dest_sched;

    if (unlikely(lcoreid >= LTHREAD_MAX_LCORES))
        return POSIX_ERRNO(EINVAL);

    DIAG_EVENT(lt, LT_DIAG_LTHREAD_AFFINITY, lcoreid, 0);

    dest_sched = schedcore[lcoreid];

    if (unlikely(dest_sched == NULL))
        return POSIX_ERRNO(EINVAL);

    if (likely(dest_sched != THIS_SCHED)) {
        lt->sched = dest_sched;
        lt->pending_wr_queue = dest_sched->pready;
        //真正切換線程到指定cpu運行的代碼
        _affinitize();
        return 0;
    }
    return 0;
}
tatic __rte_always_inline void
_affinitize(void);
static inline void
_affinitize(void)
{
    struct lthread *lt = THIS_LTHREAD;

    DIAG_EVENT(lt, LT_DIAG_LTHREAD_SUSPENDED, 0, 0);
    ctx_switch(&(THIS_SCHED)->ctx, &lt->ctx);
}
void
ctx_switch(struct ctx *new_ctx __rte_unused, struct ctx *curr_ctx __rte_unused)
{
    /* SAVE CURRENT CONTEXT */
    asm volatile (
        /* Save SP */
        "mov x3, sp\n"
        "str x3, [x1, #0]\n"

        /* Save FP and LR */
        "stp x29, x30, [x1, #8]\n"

        /* Save Callee Saved Regs x19 - x28 */
        "stp x19, x20, [x1, #24]\n"
        "stp x21, x22, [x1, #40]\n"
        "stp x23, x24, [x1, #56]\n"
        "stp x25, x26, [x1, #72]\n"
        "stp x27, x28, [x1, #88]\n"

        /*
         * Save bottom 64-bits of Callee Saved
         * SIMD Regs v8 - v15
         */
        "stp d8, d9, [x1, #104]\n"
        "stp d10, d11, [x1, #120]\n"
        "stp d12, d13, [x1, #136]\n"
        "stp d14, d15, [x1, #152]\n"
    );

    /* RESTORE NEW CONTEXT */
    asm volatile (
        /* Restore SP */
        "ldr x3, [x0, #0]\n"
        "mov sp, x3\n"

        /* Restore FP and LR */
        "ldp x29, x30, [x0, #8]\n"

        /* Restore Callee Saved Regs x19 - x28 */
        "ldp x19, x20, [x0, #24]\n"
        "ldp x21, x22, [x0, #40]\n"
        "ldp x23, x24, [x0, #56]\n"
        "ldp x25, x26, [x0, #72]\n"
        "ldp x27, x28, [x0, #88]\n"

        /*
         * Restore bottom 64-bits of Callee Saved
         * SIMD Regs v8 - v15
         */
        "ldp d8, d9, [x0, #104]\n"
        "ldp d10, d11, [x0, #120]\n"
        "ldp d12, d13, [x0, #136]\n"
        "ldp d14, d15, [x0, #152]\n"
    );
}

參考：

1、https://blog.csdn.net/ApeLife/article/details/100751359 uio驅動實現

2、https://blog.csdn.net/ApeLife/article/details/100006695 內存池的實現

3、https://blog.51cto.com/u_15076236/4624576 PMD優化

4、https://blog.csdn.net/jeawayfox/article/details/105189788 dpdk接收報文

5、https://blog.csdn.net/u012630961/article/details/80918682 dpdk線程親和性

6、https://zhuanlan.zhihu.com/p/366155783 dpdk多線程模型

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 linux源碼解讀（三十二）：dpdk原理概述（一） linux源碼解讀（三十一）：quic核心源碼分析（二） linux源碼解讀（三十三）：android下boringSSL核心源碼解析&x音防抓包證書校驗原理 Mybatis(四)：MyBatis核心組件介紹原理解析和源碼解讀 ABP源碼分析三十二：ABP.SignalR HashMap源碼解析和設計解讀 linux源碼解讀（十二）：系統調用(strace命令0和中斷&字節跳動HIDS簡要分析 Linux核心源碼閱讀方法 dpdk源碼---vfio(zym) Alamofire源碼解讀系列(十二)之請求(Request)