DMA CACHE一致性問題解決方案

本文轉載自查看原文 2021-12-08 09:53 2206 os

DMA與Cache 的一致性

一致性問題

mem中有一塊報文，cpu會將這塊報文讀到cache,cpu再讀這塊，cache hit。則會從cache中取值。

如果外設是一張網卡，通過DMA 數據傳到內存，將紅色這塊塗成了綠色。內存已經綠了，但是cpu讀這塊數據卻還是紅色。造成內存 cache 不一致。
同樣 CPU 寫紅色區域數據的告訴cache， cache 並沒有與mem做同步的話，此時數據經過DMA，發送的報文也是有問題的。
解決方案：

Coherent DMA buffers 一致性
DMA Streaming Mapping 流式DMA映射

Coherent DMA buffers 一致性

對於一個很弱的硬件，當硬件沒有對一致性問題有幫助的時候。
dma_alloc_coherent, 寫驅動的時候自己申請的一片內存。
- cpu 讀寫不帶cache
- dma讀寫也不帶cache
  這樣就不會出現一致性問題。但是很多情況下你又不能用dma_alloc_coherent, 除非自己寫驅動，自己申請的內存。
  但是很多情況下，一個tcp/ip 協議棧，有一個 socket buffer，這塊buffer 並不是程序員申請出來的內存。這時候不可能用dma_alloc_coherent.
DMA Streaming Mapping 流式DMA映射
- 發包
  此時可以用 dma_map_single 與 dma_unmap_single, 這個api 會將cache里的非程序員用dma申請的內存做一次flush，同步到內存中。
- 收包
  會將cache 里的內容置換為 invalid，詳情見cache line 那一章節的關於MESI一致性的闡述。CPU是可以控制cache 的 flag，但他不能訪問某塊cache的第幾個byte的
  還有 dma_map_sg, dma_unmap_sg這兩個API,有的dma引擎較強，支持聚集散列，自動傳n個buffer，第一個傳完，傳第二個，並不需要連續的內存做DMA.可以用上述兩個api，可以將多個不連續的 buffer 做自動傳輸（以后接觸到再查資料學習把）。

dma_alloc_coherent的例外

一般情況下這個api 是不帶cache(綠色)。但是當cpu支持cache互聯網絡。cache coherent interconnnect，CPU的cache 可以感知到外部設備。硬件做同步。（就是MESI的同步手段）。此時dam_alloc_coherent申請的內存就可以帶上cache
表面上都是上述關於dma的API，但是后端針對不同的平台，實現的可能不同。

SMMU | IOMMU

DMA 自帶 MMU，因此帶有SMMU的DMA並不在乎申請的內存是否連續，會將物理地址映射成虛擬連續的。但是申請內存依舊使用dma_alloc_coherent。上述幾個不帶MMU的DMA 申請的內存都是通過CMA （管CMA要）申請的連續內存。
但是帶有MMU的DMA申請內存可以不連續
由此可以看出硬件幫你做了很多工作后，你就少操心很多啦。

Linux 內核中 DMA 及 Cache 分析，涉及以下函數

dma_alloc_coherent
dma_map_single
dma_alloc_writecombine
pgprot_noncached
remap_pfn_range

Linux Kernel: 4.9.22
Arch: arm

arm

arch/arm/mm/dma-mapping.c
include/linux/dma-mapping.h

幾個關鍵變量和函數

atomic_pool_init 和 DEFAULT_DMA_COHERENT_POOL_SIZE
dma zone、 dma pool、 setup_dma_zone 和 CONFIG_ZONE_DMA
coherent_dma_mask 和 dma_zone_size

DMA ZONE

存在 DMA ZONE 的原因是某些硬件的 DMA 引擎 不能訪問到所有的內存區域，因此，加上一個 DMA ZONE，當使用 GFP_DMA 方式申請內存時，獲得的內存限制在 DMA ZONE 的范圍內，這些特定的硬件需要使用 GFP_DMA 方式獲得可以做 DMA 的內存；
如果系統中所有的設備都可選址所有的內存，那么 DMA ZONE 覆蓋所有內存。
DMA ZONE 的大小，以及 DMA ZONE 要不要存在，都取決於你實際的硬件是什么。
由於設計及硬件的使用模式， DMA ZONE 可以不存在

由於現如今絕大多少的 SoC 都很牛逼，似乎 DMA 都沒有什么缺陷了，根本就不太可能給我們機會指定 DMA ZONE 大小裝逼了，那個這個 ZONE 就不太需要存在了。反正任何 DMA 在任何地方申請的內存，這個 DMA 都可以存取到。

DMA ZONE 的內存只能做 DMA 嗎？

DMA ZONE 的內存做什么都可以。 DMA ZONE 的作用是讓有缺陷的 DMA 對應的外設驅動申請 DMA buffer 的時候從這個區域申請而已，但是它不是專有的。其他所有人的內存（包括應用程序和內核）也可以來自這個區域。

dma_mask 與 coherent_dma_mask 的定義

include/linux/device.h

struct device {
    ...
    u64     *dma_mask;  /* dma mask (if dma'able device) */
    u64     coherent_dma_mask;/* Like dma_mask, but for
                        alloc_coherent mappings as
                        not all hardware supports
                        64 bit addresses for consistent
                        allocations such descriptors. */
    unsigned long   dma_pfn_offset;

    struct device_dma_parameters *dma_parms;

    struct list_head    dma_pools;  /* dma pools (if dma'ble) */

    struct dma_coherent_mem *dma_mem; /* internal for coherent mem
    ...
};

dma_mask 與 coherent_dma_mask 這兩個參數表示它能尋址的物理地址的范圍，內核通過這兩個參數分配合適的物理內存給 device。 dma_mask 是 設備 DMA 能訪問的內存范圍， coherent_dma_mask 則作用於申請 一致性 DMA 緩沖區。因為不是所有的硬件都能夠支持 64bit 的地址寬度。如果 addr_phy 是一個物理地址，且 (u64)addr_phy <= *dev->dma_mask，那么該 device 就可以尋址該物理地址。如果 device 只能尋址 32 位地址，那么 mask 應為 0xffffffff。依此類推。

例如內核代碼 arch/arm/mm/dma-mapping.c

static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
             gfp_t gfp, pgprot_t prot, bool is_coherent,
             unsigned long attrs, const void *caller)
{
    u64 mask = get_coherent_dma_mask(dev);
    struct page *page = NULL;
    void *addr;
    bool allowblock, cma;
    struct arm_dma_buffer *buf;
    struct arm_dma_alloc_args args = {
        .dev = dev,
        .size = PAGE_ALIGN(size),
        .gfp = gfp,
        .prot = prot,
        .caller = caller,
        .want_vaddr = ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) == 0),
        .coherent_flag = is_coherent ? COHERENT : NORMAL,
    };

#ifdef CONFIG_DMA_API_DEBUG
    u64 limit = (mask + 1) & ~mask;
    if (limit && size >= limit) {
        dev_warn(dev, "coherent allocation too big (requested %#x mask %#llx)\n",
            size, mask);
        return NULL;
    }
#endif
...
}

imit 就是通過 mask 計算得到的設備最大尋址范圍

dma_alloc_coherent 分配的內存一定在 DMA ZONE 內嗎？

dma_alloc_coherent() 申請的內存來自於哪里，不是因為它的名字前面帶了個 dma_ 就來自 DMA ZONE 的，本質上取決於對應的 DMA 硬件是誰。應該說絕對多數情況下都不在 DMA ZONE 內，代碼如下

dma_alloc_coherent -> dma_alloc_attrs

static inline void *dma_alloc_attrs(struct device *dev, size_t size,
                       dma_addr_t *dma_handle, gfp_t flag,
                       unsigned long attrs)
{
    struct dma_map_ops *ops = get_dma_ops(dev);
    void *cpu_addr;

    BUG_ON(!ops);

    if (dma_alloc_from_coherent(dev, size, dma_handle, &cpu_addr))
        return cpu_addr;

    if (!arch_dma_alloc_attrs(&dev, &flag))
        return NULL;
    if (!ops->alloc)
        return NULL;

    cpu_addr = ops->alloc(dev, size, dma_handle, flag, attrs);
    debug_dma_alloc_coherent(dev, size, *dma_handle, cpu_addr);
    return cpu_addr;
}

在 dma_alloc_attrs 首先通過 dma_alloc_from_coherent 從 device 自己的 dma memory 中申請，如果沒有再通過 ops->alloc 申請， arm 如下

static struct dma_map_ops *arm_get_dma_map_ops(bool coherent)
{
    return coherent ? &arm_coherent_dma_ops : &arm_dma_ops;
}

struct dma_map_ops arm_coherent_dma_ops = {
    .alloc          = arm_coherent_dma_alloc,
    .free           = arm_coherent_dma_free,
    .mmap           = arm_coherent_dma_mmap,
    .get_sgtable        = arm_dma_get_sgtable,
    .map_page       = arm_coherent_dma_map_page,
    .map_sg         = arm_dma_map_sg,
};
EXPORT_SYMBOL(arm_coherent_dma_ops);

static void *arm_coherent_dma_alloc(struct device *dev, size_t size,
    dma_addr_t *handle, gfp_t gfp, unsigned long attrs)
{
    return __dma_alloc(dev, size, handle, gfp, PAGE_KERNEL, true,
               attrs, __builtin_return_address(0));
}

static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
             gfp_t gfp, pgprot_t prot, bool is_coherent,
             unsigned long attrs, const void *caller)
{
    u64 mask = get_coherent_dma_mask(dev);
    struct page *page = NULL;
    void *addr;
    bool allowblock, cma;
    struct arm_dma_buffer *buf;
    struct arm_dma_alloc_args args = {
        .dev = dev,
        .size = PAGE_ALIGN(size),
        .gfp = gfp,
        .prot = prot,
        .caller = caller,
        .want_vaddr = ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) == 0),
        .coherent_flag = is_coherent ? COHERENT : NORMAL,
    };

#ifdef CONFIG_DMA_API_DEBUG
    u64 limit = (mask + 1) & ~mask;
    if (limit && size >= limit) {
        dev_warn(dev, "coherent allocation too big (requested %#x mask %#llx)\n",
            size, mask);
        return NULL;
    }
#endif

    if (!mask)
        return NULL;

    buf = kzalloc(sizeof(*buf),
              gfp & ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM));
    if (!buf)
        return NULL;

    if (mask < 0xffffffffULL)
        gfp |= GFP_DMA;

    /*
     * Following is a work-around (a.k.a. hack) to prevent pages
     * with __GFP_COMP being passed to split_page() which cannot
     * handle them.  The real problem is that this flag probably
     * should be 0 on ARM as it is not supported on this
     * platform; see CONFIG_HUGETLBFS.
     */
    gfp &= ~(__GFP_COMP);
    args.gfp = gfp;

    *handle = DMA_ERROR_CODE;
    allowblock = gfpflags_allow_blocking(gfp); // gfp
    cma = allowblock ? dev_get_cma_area(dev) : false;
    根據不同的取值采用不同allowblock
    if (cma)
        buf->allocator = &cma_allocator;
    else if (nommu() || is_coherent)
        buf->allocator = &simple_allocator;
    else if (allowblock)
        buf->allocator = &remap_allocator;
    else
        buf->allocator = &pool_allocator;

    addr = buf->allocator->alloc(&args, &page);

    if (page) {
        unsigned long flags;

        *handle = pfn_to_dma(dev, page_to_pfn(page));
        buf->virt = args.want_vaddr ? addr : page;

        spin_lock_irqsave(&arm_dma_bufs_lock, flags);
        list_add(&buf->list, &arm_dma_bufs);
        spin_unlock_irqrestore(&arm_dma_bufs_lock, flags);
    } else {
        kfree(buf);
    }

    return args.want_vaddr ? addr : page;
}

&pool_allocator 從 DMA POOL 中分配，使用函數 atomic_pool_init 創建

代碼段

   if (mask < 0xffffffffULL)
        gfp |= GFP_DMA;

GFP_DMA 標記被設置，以指揮內核從 DMA ZONE 申請內存。但是 mask 覆蓋了整個 4GB，調用 dma_alloc_coherent() 獲得的內存就不需要一定是來自 DMA ZONE

static void *pool_allocator_alloc(struct arm_dma_alloc_args *args,
                  struct page **ret_page)
{
    return __alloc_from_pool(args->size, ret_page);
}

static void pool_allocator_free(struct arm_dma_free_args *args)
{
    __free_from_pool(args->cpu_addr, args->size);
}

static struct arm_dma_allocator pool_allocator = {
    .alloc = pool_allocator_alloc,
    .free = pool_allocator_free,
};

static void *__alloc_from_pool(size_t size, struct page **ret_page)
{
    unsigned long val;
    void *ptr = NULL;

    if (!atomic_pool) {
        WARN(1, "coherent pool not initialised!\n");
        return NULL;
    }

    val = gen_pool_alloc(atomic_pool, size);
    if (val) {
        phys_addr_t phys = gen_pool_virt_to_phys(atomic_pool, val);

        *ret_page = phys_to_page(phys);
        ptr = (void *)val;
    }

    return ptr;
}

dma_alloc_coherent() 申請的內存是非 cache 的嗎？

缺省情況下， dma_alloc_coherent() 申請的內存缺省是進行 uncache 配置的。但是現代 SOC 有可能會將內核的通用實現 overwrite 掉，變成 dma_alloc_coherent() 申請的內存也是可以帶 cache 的。

static struct dma_map_ops *arm_get_dma_map_ops(bool coherent)
{
    return coherent ? &arm_coherent_dma_ops : &arm_dma_ops;
}

struct dma_map_ops arm_coherent_dma_ops = {
    .alloc          = arm_coherent_dma_alloc,
    .free           = arm_coherent_dma_free,
    .mmap           = arm_coherent_dma_mmap,
    .get_sgtable        = arm_dma_get_sgtable,
    .map_page       = arm_coherent_dma_map_page,
    .map_sg         = arm_dma_map_sg,
};
EXPORT_SYMBOL(arm_coherent_dma_ops);

static int macb_alloc_consistent(struct macb *bp)
{
    struct macb_queue *queue;
    unsigned int q;
    int size;

    for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
        size = TX_RING_BYTES(bp) + bp->tx_bd_rd_prefetch;
        queue->tx_ring = dma_alloc_coherent(&bp->pdev->dev, size,
                            &queue->tx_ring_dma,
                            GFP_KERNEL);
        if (!queue->tx_ring)
            goto out_err;
        netdev_dbg(bp->dev,
               "Allocated TX ring for queue %u of %d bytes at %08lx (mapped %p)\n",
               q, size, (unsigned long)queue->tx_ring_dma,
               queue->tx_ring);

        size = bp->tx_ring_size * sizeof(struct macb_tx_skb);
        queue->tx_skb = kmalloc(size, GFP_KERNEL);
        if (!queue->tx_skb)
            goto out_err;

        size = RX_RING_BYTES(bp) + bp->rx_bd_rd_prefetch;
        queue->rx_ring = dma_alloc_coherent(&bp->pdev->dev, size,
                         &queue->rx_ring_dma, GFP_KERNEL);
        if (!queue->rx_ring)
            goto out_err;
        netdev_dbg(bp->dev,
               "Allocated RX ring of %d bytes at %08lx (mapped %p)\n",
               size, (unsigned long)queue->rx_ring_dma, queue->rx_ring);
    }
    if (bp->macbgem_ops.mog_alloc_rx_buffers(bp))
        goto out_err;

    return 0;

out_err:
    macb_free_consistent(bp);
    return -ENOMEM;
}

dma_alloc_coherent 在 arm 平台上會禁止頁表項中的 C （Cacheable）域以及 B (Bufferable)域。
而 dma_alloc_writecombine 只禁止 C （Cacheable）域.

C 代表是否使用高速緩沖存儲器（cacheline），而 B 代表是否使用寫緩沖區。

這樣，dma_alloc_writecombine 分配出來的內存不使用緩存，但是會使用寫緩沖區。而 dma_alloc_coherent 則二者都不使用。
C B 位的具體含義
0 0 無cache，無寫緩沖；任何對memory的讀寫都反映到總線上。對 memory 的操作過程中CPU需要等待。
0 1 無cache，有寫緩沖；讀操作直接反映到總線上；寫操作，CPU將數據寫入到寫緩沖后繼續運行，由寫緩沖進行寫回操作。
1 0 有cache，寫通模式；讀操作首先考慮cache hit；寫操作時直接將數據寫入寫緩沖，如果同時出現cache hit，那么也更新cache。
1 1 有cache，寫回模式；讀操作首先考慮cache hit；寫操作也首先考慮cache hit。

效率最高的寫回，其次寫通，再次寫緩沖，最次非CACHE一致性操作。

其實，寫緩沖也是一種非常簡單得CACHE，為何這么說呢。

我們知道，DDR是以突發讀寫的，一次讀寫總線上實際會傳輸一個burst的長度，這個長度一般等於一個cache line的長度。

cache line是32bytes。即使讀1個字節數據，也會傳輸32字節，放棄31字節。

寫緩沖是以CACHE LINE進行的，所以寫效率會高很多。

在Uboot下遇到的DMA快取一致性問題

linux鐵三角之內存（五）

DMA 導致的 CACHE 一致性問題解決方案

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 DMA和cache一致性問題 DMA與cache一致性的問題 Cache的一致性問題分布式系統一致性問題解決實戰【Redis實戰】雙寫一致性問題和解決方案【Redis實戰】雙寫一致性問題和解決方案分布式一致性問題的解決方案用CAS方案解決高並發一致性問題分布式系統一致性問題解決實戰(阿里) 異步解耦+消息隊列可作為分布式系統滿足最終一致性的優秀方案 Cache一致性與DMA