Linux下的物理內存管理2-slab緩存的管理

本文轉載自查看原文 2017-03-21 10:25 1441 linux 內核源碼分析/ 內存管理

2017-03-02

在Linux下的物理內存管理中，對SLAB機制大致做了介紹，對SLAB管理結構對象也做了介紹，但是對於小內存塊的分配沒有介紹，本節重點介紹下slab對小內存塊的管理。

內核中使用全局的kmem_cache數組kmalloc_caches組織不同大小的緩存塊，每個緩存塊由一個kmem_cache結構描述，緩存塊大小一般是按8字節遞增，分配時不足8字節按照8字節算，依次向上舍入。內核有兩種方式根據size獲取對應的階

1、使用一個size_index數組保存對應內存塊的階，當size小於192時，使用此數組。

static s8 size_index[24] = {
    3,    /* 8 */
    4,    /* 16 */
    5,    /* 24 */
    5,    /* 32 */
    6,    /* 40 */
    6,    /* 48 */
    6,    /* 56 */
    6,    /* 64 */
    1,    /* 72 */
    1,    /* 80 */
    1,    /* 88 */
    1,    /* 96 */
    7,    /* 104 */
    7,    /* 112 */
    7,    /* 120 */
    7,    /* 128 */
    2,    /* 136 */
    2,    /* 144 */
    2,    /* 152 */
    2,    /* 160 */
    2,    /* 168 */
    2,    /* 176 */
    2,    /* 184 */
    2    /* 192 */
};

2,、當size大於192時，使用fls函數。

小內存塊的總體分配架構如下圖所示：

根據請求分配的內存的size，通過size_index_elem(size)獲取該size對應的index，上篇文章介紹過，塊的大小按照2的階計算。kmalloc_caches數組中的下標其實也是表示了塊的大小即2^n字節。

下面結合內核kmalloc函數執行流程，分析下具體的小內存塊的分配。正常情況下會調用到__kmalloc函數，該函數主體可分為兩部分：根據size得到對應的緩存塊結構kmem_cache；從緩存塊中取對象；

前者使用kmalloc_slab函數，而后者調用slab_alloc函數。從這里我們仔細深入分析：

struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
{
    int index;
    /*如果size大於最大值，則返回NULL*/

    if (size > KMALLOC_MAX_SIZE) {
        WARN_ON_ONCE(!(flags & __GFP_NOWARN));
        return NULL;
    }

    if (size <= 192) {
        if (!size)
            return ZERO_SIZE_PTR;
        
        index = size_index[size_index_elem(size)];
    } else
        index = fls(size - 1);

#ifdef CONFIG_ZONE_DMA
    if (unlikely((flags & GFP_DMA)))
        return kmalloc_dma_caches[index];

#endif
    return kmalloc_caches[index];
}

kmalloc_slab代碼本身比較簡單，先對size做了判斷，看是否符合要求，如果size小於等於192，則從size_index數組中獲取對應的階；否則調用fls函數獲取對應的階。最后根據此階作為下標，獲取對應的kmem_cache對象。而內存分配的重點在slab_alloc函數，該函數中直接調用了slab_alloc_node函數

static __always_inline void *
slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
           unsigned long caller)
{
    unsigned long save_flags;
    void *ptr;
    int slab_node = numa_mem_id();

    flags &= gfp_allowed_mask;

    lockdep_trace_alloc(flags);

    if (slab_should_failslab(cachep, flags))
        return NULL;

    cachep = memcg_kmem_get_cache(cachep, flags);

    cache_alloc_debugcheck_before(cachep, flags);
    local_irq_save(save_flags);

    if (nodeid == NUMA_NO_NODE)
        nodeid = slab_node;

    if (unlikely(!cachep->node[nodeid])) {
        /* Node not bootstrapped yet */
        ptr = fallback_alloc(cachep, flags);
        goto out;
    }

    if (nodeid == slab_node) {
        /*
         * Use the locally cached objects if possible.
         * However ____cache_alloc does not allow fallback
         * to other nodes. It may fail while we still have
         * objects on other nodes available.
         */
        ptr = ____cache_alloc(cachep, flags);
        if (ptr)
            goto out;
    }
    /* ___cache_alloc_node can fall back to other nodes */
    ptr = ____cache_alloc_node(cachep, flags, nodeid);
  out:
    local_irq_restore(save_flags);
    ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
    kmemleak_alloc_recursive(ptr, cachep->object_size, 1, cachep->flags,
                 flags);

    if (likely(ptr))
        kmemcheck_slab_alloc(cachep, flags, ptr, cachep->object_size);

    if (unlikely((flags & __GFP_ZERO) && ptr))
        memset(ptr, 0, cachep->object_size);

    return ptr;
}

函數u首先獲取了節點ID即slab_node，如果參數中未指定具體的node，默認從當前節點開始分配。中間是復雜的檢查機制，而實質性的工作在____cache_alloc函數中

static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
    void *objp;
    struct array_cache *ac;
    bool force_refill = false;

    check_irq_off();
    /*先從CPU 緩存中取*/
    ac = cpu_cache_get(cachep);
    /**/
    if (likely(ac->avail)) {
        ac->touched = 1;
        objp = ac_get_obj(cachep, ac, flags, false);

        /*
         * Allow for the possibility all avail objects are not allowed
         * by the current flags
         */
        if (objp) {
            STATS_INC_ALLOCHIT(cachep);
            goto out;
        }
        force_refill = true;
    }

    STATS_INC_ALLOCMISS(cachep);
    objp = cache_alloc_refill(cachep, flags, force_refill);
    /*
     * the 'ac' may be updated by cache_alloc_refill(),
     * and kmemleak_erase() requires its correct value.
     */
    ac = cpu_cache_get(cachep);

out:
    /*
     * To avoid a false negative, if an object that is in one of the
     * per-CPU caches is leaked, we need to make sure kmemleak doesn't
     * treat the array pointers as a reference to the object.
     */
    if (objp)
        kmemleak_erase(&ac->entry[ac->avail]);
    return objp;
}

可以看到，該函數中首先根據參數中的kmem_cache獲取當前CPU對應的array_cache,array_cache結構如下：

struct array_cache {
    unsigned int avail;//可用對象的數目
    unsigned int limit;//可擁有的最大對象的數目
    unsigned int batchcount;//
    unsigned int touched;
    spinlock_t lock;
    void *entry[];    /*主要是為了訪問后面的對象
             * Must have this definition in here for the proper
             * alignment of array_cache. Also simplifies accessing
             * the entries.
             *
             * Entries should not be directly dereferenced as
             * entries belonging to slabs marked pfmemalloc will
             * have the lower bits set SLAB_OBJ_PFMEMALLOC
             */
};

這個結構per_CPU 緩存，每個CPU都有對應的緩存，其中avail對應當前緩存中可用對象的數目，limit表示可擁有的最大對象的數目，batchcount表示在緩存為空時，需要填充的對象的數量，touched是一個活動位，表示當前緩存的活躍程度，便於后續的收縮操作。entry指向該緩存的對象數組，數組中保存的是對象的地址，這里就表示緩存塊的地址。有一點就是從這里分配緩存對象不是從數組起始位置分配，而是從數組末尾分配，avail就表示分配的對象在entry數組中的下標，這里發現設計的還真是巧妙。

回到____cache_alloc函數中，如果當前CPU緩存中有可用對象，則設置首先設置活躍位，然后調用ac_get_obj函數從緩存中獲取一個對象。如果沒有可用的對象，則調用cache_alloc_refill函數填充緩存，之后再次調用cpu_cache_get獲取對象。獲取對象之后需要調用kmemleak_erase函數設置entry數組中的對應指針為NULL。到這里發現主要有兩個操作，獲取對象，填充緩存。

先看獲取對象ac_get_obj

static inline void *ac_get_obj(struct kmem_cache *cachep,
            struct array_cache *ac, gfp_t flags, bool force_refill)
{
    void *objp;

    if (unlikely(sk_memalloc_socks()))
        objp = __ac_get_obj(cachep, ac, flags, force_refill);
    else
        objp = ac->entry[--ac->avail];

    return objp;
}

sk_memalloc_socks意思暫不清楚，從unlikely可以看到這里大部分都可以直接通過entry得到對象，所以獲取對象的方式還是挺簡單的，直接從根據avail從entry數組中獲的一個對象地址即可。並且這里avail應該指向首個為NULL的entry。

如果緩存中沒有呢，就需要填充per_CPU 緩存了，這里看cache_alloc_refill函數

static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
                            bool force_refill)
{
    int batchcount;
    struct kmem_cache_node *n;
    struct array_cache *ac;
    int node;

    check_irq_off();
    /*獲取當前節點kmem_cache_node*/
    node = numa_mem_id();
    if (unlikely(force_refill))
        goto force_grow;
retry:
    ac = cpu_cache_get(cachep);
    /*要填充對象的數目*/
    batchcount = ac->batchcount;
    /*如果當前緩存訪問並不頻繁，且要填充的數目較多，減少分配的數目*/
    if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
        /*
         * If there was little recent activity on this cache, then
         * perform only a partial refill.  Otherwise we could generate
         * refill bouncing.
         */
        batchcount = BATCHREFILL_LIMIT;
    }
    /**/
    n = cachep->node[node];

    BUG_ON(ac->avail > 0 || !n);
    /*枷鎖*/
    spin_lock(&n->list_lock);

    /* See if we can refill from the shared array */
    if (n->shared && transfer_objects(ac, n->shared, batchcount)) {
        n->shared->touched = 1;
        goto alloc_done;
    }

    while (batchcount > 0) {
        struct list_head *entry;
        struct slab *slabp;
        /* Get slab alloc is to come from. */
        /*首先從slabs_partial鏈表中分配*/
        entry = n->slabs_partial.next;
        /*如果半分配的slab鏈表為空*/
        if (entry == &n->slabs_partial) {
            n->free_touched = 1;
            /*從free鏈表分配*/
            entry = n->slabs_free.next;
            /*如果free鏈表滿，則增長*/
            if (entry == &n->slabs_free)
                goto must_grow;
        }
        /*由entry定位slab的地址*/
        slabp = list_entry(entry, struct slab, list);
        check_slabp(cachep, slabp);
        check_spinlock_acquired(cachep);

        /*
         * The slab was either on partial or free list so
         * there must be at least one object available for
         * allocation.
         */
        BUG_ON(slabp->inuse >= cachep->num);
        /*循環從slab獲取對象*/
        while (slabp->inuse < cachep->num && batchcount--) {
            STATS_INC_ALLOCED(cachep);
            STATS_INC_ACTIVE(cachep);
            STATS_SET_HIGH(cachep);
            /*填充緩存*/
            ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
                                    node));
        }
        check_slabp(cachep, slabp);

        /* move slabp to correct slabp list: */
        /*把slab移除鏈表*/
        list_del(&slabp->list);
        /*根據情況添加到對應的鏈表*/
        if (slabp->free == BUFCTL_END)
            list_add(&slabp->list, &n->slabs_full);
        else
            list_add(&slabp->list, &n->slabs_partial);
    }

must_grow:
    n->free_objects -= ac->avail;
alloc_done:
    spin_unlock(&n->list_lock);

    if (unlikely(!ac->avail)) {
        int x;
force_grow:
        x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);

        /* cache_grow can reenable interrupts, then ac could change. */
        ac = cpu_cache_get(cachep);
        node = numa_mem_id();

        /* no objects in sight? abort */
        if (!x && (ac->avail == 0 || force_refill))
            return NULL;

        if (!ac->avail)        /* objects refilled by interrupt? */
            goto retry;
    }
    ac->touched = 1;

    return ac_get_obj(cachep, ac, flags, force_refill);
}

代碼部分並不難理解，首先獲取當前NUMA節點ID，從而在后面獲取對應的kmem_cache_node結構，該結構中保存了與其關聯的slab，然后獲取當前CPU的緩存對象array_cache。根據其中的touched字段判斷是否該緩存是否活躍，如果不活躍且要求充填的對象數還比較多，就把填充對象數設置成BATCHREFILL_LIMIT。然后即根據NUMA ID 從kmem_cache結構的node數組中獲取kmem_cache_node結構。首先嘗試從共享數組中分配，這相當於又添加了一層緩存。如果共享數組為NULL或者從共享數組填充失敗，則走正規途徑即從slab鏈表中分配。進入while循環

前面提到過NUMA架構下，每個NODE關聯三條SLAB鏈表，分別表示full,partial,free的slab.這里首先嘗試的是從未使用完的鏈表中分配，如果該鏈表為NULL，則從free鏈表中分配，如果free鏈表也為NULL，則必須must_grow的途徑，從伙伴系統申請對象。到這里不管從哪里，已經獲取一個slab了，接下來從該slab分配對象，再次進入一個while循環。這里重要的是首先通過slab_get_obj函數從slab中獲取一個對象，然后通過ac_put_obj函數把對象填充到CPU關聯的緩存數組中。看下兩個函數

static void *slab_get_obj(struct kmem_cache *cachep, struct slab *slabp,
                int nodeid)
{
    void *objp = index_to_obj(cachep, slabp, slabp->free);
    kmem_bufctl_t next;

    slabp->inuse++;
    next = slab_bufctl(slabp)[slabp->free];
#if DEBUG
    slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
    WARN_ON(slabp->nodeid != nodeid);
#endif
    slabp->free = next;

    return objp;
}

首先從SLAB中取一個對象，更新計數器，然后獲取下一個可用對象的索引，設置到slab的free字段，接着就返回了。

static inline void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
                                void *objp)
{
    if (unlikely(sk_memalloc_socks()))
        objp = __ac_put_obj(cachep, ac, objp);

    ac->entry[ac->avail++] = objp;
}

而put函數就更顯簡單，直接在對象地址和entry數組建立關聯即可。從這里看avail同樣是指首個為NULL的索引。這樣while循環結束，填充per_CPU 緩存的工作節完成了，接來下要做的就是把SLAB從對應的鏈表摘下來。如果里面的對象已經用完，則假如到full鏈表，否則加入到slabs_partial鏈表。

最后會通過ac_get_obj從當前CPU的緩存entry數組中獲取一個對象，返回。而在最終，需要設置entry中分配出去的對象位置為NULL。至此，小緩存塊的分配工作就完成了！

總結：

　　其實小緩存塊的管理和普通對象的管理方式基本類似，都是作為對象存儲，不同的是一般對象的大小不確定，而小緩存塊的大小由系統初始指定；一般對象的緩存可以自己創建，而小緩存塊由系統維護。其余的分配流程、保存方式等都是使用相同的API。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 內存管理篇——物理內存的管理 Linux讀寫物理內存 Linux查看物理內存信息內存管理兩部曲之物理內存管理 linux下如何獲取某一進程占用的物理內存和虛擬內存 Linux內存管理6---伙伴算法與slab linux源碼解讀（九）：內存管理——buddy和slab Linux-物理內存和虛擬內存 Linux 釋放物理內存和虛擬內存 windows 物理內存獲取