linux源碼解讀(二十五):mmap原理和實現方式


  眾所周知,linux的理念是萬物皆文件,自然少不了對文件的各種操作,常見的諸如open、read、write等,都是大家耳熟能詳的操作。除了這些常規操作外,還有一個不常規的操作:mmap,其在file_operations結構體中的定義如下: 這個函數的作用是什么了?

      

      1、對於讀寫文件,傳統經典的api都是這樣的:先open文件,拿到文件的fd;再調用read或write讀寫文件。由於文件存放在磁盤,3環的app是沒有權限直接操作磁盤的,所以需要通過系統調用進入操作系統的內核,再通過事先安裝好的驅動讀寫磁盤數據。這樣一來,磁盤的數據會分別存放在內核空間和用戶空間,也就是同一份數據會在內存內部放在兩個不同的地方,而且也需要拷貝2次,整個過程是“又費柴油又費馬達”;流程示例如下:

    

    這樣做既然浪費內存空間,也浪費拷貝的時間,該怎么優化了?

   2、上述做法的結症在於同一份數據拷貝2次,那么能不能只拷貝1次了?答案是可以的,mmap就是這么干的!

  (1)先看看mmap的用例,直觀了解一下是怎么使用的,如下:

#include<stdlib.h>
#include<sys/mman.h>
#include<fcntl.h>

int main(void)
{
    int *p;
    int fd = open("hello", O_RDWR);
    if(fd < 0)
    {
        perror("open hello");
        exit(1);
    }
    p = mmap(NULL,6, PROT_WRITE, MAP_SHARED, fd, 0);
    if(p == MAP_FAILED)
    {
       perror("mmap"); //程序進里面了,證明mmap失敗
       exit(1);
    }
    close(fd);
    p[0] = 0x30313233;
    munmap(p, 6);
    return 0;

}

  用例是不是很簡單了?還是先調用open函數得到文件的fd,再調用mmap建立文件在內存的映射,這時得到了文件在內存映射的地址p,最后通過p指針讀寫文件數據!整個邏輯非常簡單,是個碼農都能看懂!這么簡單方便、效率還高(只復制一次)的mmap又是怎么實現的了?

  (2)先說一下mmap的原理:mmap只復制1次的原理也簡單,就是在進程的虛擬內存開辟一塊空間,映射到內核的物理內存,並建立和文件的關聯,再把文件內容讀到這塊內存;后續3環的app讀寫文件都不走磁盤了,而是直接讀寫這塊建立好映射的內存!等到進程退出或出意外奔潰,操作系統把映射內存的數據重新寫回磁盤的文件!

      

       mmap的原理也不復雜,具體是到代碼層面是怎么做的了?

    (3)從上面的demo可以看出,3環應用層直接調用的是mmap函數,但很明顯這個功能因為涉及到磁盤讀寫,肯定是需要操作系統支持得,所以mmap肯定需要通過系統調用進入內核執行代碼。操作系統提供的系統調用函數是do_mmap,在mm\mmap.c文件中,代碼如下:

/*
 * The caller must hold down_write(&current->mm->mmap_sem).
 根據用戶傳入的參數做了一系列的檢查,然后根據參數初始化vm_area_struct的標志vm_flags、
 vma->vm_file = get_file(file)建立文件與vma的映射
 */
unsigned long do_mmap(struct file *file, unsigned long addr,
            unsigned long len, unsigned long prot,
            unsigned long flags, vm_flags_t vm_flags,
            unsigned long pgoff, unsigned long *populate)
{
    struct mm_struct *mm = current->mm;//當前進程的虛擬內存描述符
    int pkey = 0;

    *populate = 0;

    if (!len)
        return -EINVAL;

    /*
     * Does the application expect PROT_READ to imply PROT_EXEC?
     *
     * (the exception is when the underlying filesystem is noexec
     *  mounted, in which case we dont add PROT_EXEC.)
     */
    if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
        if (!(file && path_noexec(&file->f_path)))
            prot |= PROT_EXEC;
    /* 假如沒有設置MAP_FIXED標志,且addr小於mmap_min_addr, 因為可以修改addr,
    所以就需要將addr設為mmap_min_addr的頁對齊后的地址 */
    if (!(flags & MAP_FIXED))
        addr = round_hint_to_min(addr);

    /* Careful about overflows.. 檢查長度,防止溢出??*/
    /* 進行Page大小的對齊,因為內存映射大小必須頁對齊 */
    len = PAGE_ALIGN(len);
    if (!len)
        return -ENOMEM;

    /* offset overflow? */
    if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
        return -EOVERFLOW;

    /* Too many mappings? */
    /* 判斷該進程的地址空間的虛擬區間數量是否超過了限制 */
    if (mm->map_count > sysctl_max_map_count)
        return -ENOMEM;

    /* Obtain the address to map to. we verify (or select) it and ensure
     * that it represents a valid section of the address space.
     從當前進程的用戶空間獲取一個未被映射區間的起始地址:這里就涉及到紅黑樹了 
     */
    addr = get_unmapped_area(file, addr, len, pgoff, flags);
    if (offset_in_page(addr))/* 檢查addr是否有效 */
        return addr;

    if (prot == PROT_EXEC) {
        pkey = execute_only_pkey(mm);
        if (pkey < 0)
            pkey = 0;
    }

    /* Do simple checking here so the lower-level routines won't have
     * to. we assume access permissions have been handled by the open
     * of the memory object, so we don't do any here.
     */
    vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
            mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
    /* 假如flags設置MAP_LOCKED,即類似於mlock()將申請的地址空間鎖定在內存中, 
    檢查是否可以進行lock*/
    if (flags & MAP_LOCKED)
        if (!can_do_mlock())
            return -EPERM;

    if (mlock_future_check(mm, vm_flags, len))
        return -EAGAIN;

    if (file) {
        struct inode *inode = file_inode(file);
        /*根據標志指定的map種類,把為文件設置的訪問權考慮進去。
        如果所請求的內存映射是共享可寫的,就要檢查要映射的文件是為寫入而打開的,而不
        是以追加模式打開的,還要檢查文件上沒有上強制鎖。
        對於任何種類的內存映射,都要檢查文件是否為讀操作而打開的。
        */

        switch (flags & MAP_TYPE) {
        case MAP_SHARED:
            if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
                return -EACCES;

            /*
             * Make sure we don't allow writing to an append-only
             * file..
             */
            if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
                return -EACCES;

            /*
             * Make sure there are no mandatory locks on the file.
             */
            if (locks_verify_locked(file))
                return -EAGAIN;

            vm_flags |= VM_SHARED | VM_MAYSHARE;
            if (!(file->f_mode & FMODE_WRITE))
                vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

            /* fall through */
        case MAP_PRIVATE:
            if (!(file->f_mode & FMODE_READ))
                return -EACCES;
            if (path_noexec(&file->f_path)) {
                if (vm_flags & VM_EXEC)
                    return -EPERM;
                vm_flags &= ~VM_MAYEXEC;
            }

            if (!file->f_op->mmap)
                return -ENODEV;
            if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
                return -EINVAL;
            break;

        default:
            return -EINVAL;
        }
    } else {
        switch (flags & MAP_TYPE) {
        case MAP_SHARED:
            if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
                return -EINVAL;
            /*
             * Ignore pgoff.
             */
            pgoff = 0;
            vm_flags |= VM_SHARED | VM_MAYSHARE;
            break;
        case MAP_PRIVATE:
            /*
             * Set pgoff according to addr for anon_vma.
             */
            pgoff = addr >> PAGE_SHIFT;
            break;
        default:
            return -EINVAL;
        }
    }

    /*
     * Set 'VM_NORESERVE' if we should not account for the
     * memory use of this mapping.
     */
    if (flags & MAP_NORESERVE) {
        /* We honor MAP_NORESERVE if allowed to overcommit */
        if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
            vm_flags |= VM_NORESERVE;

        /* hugetlb applies strict overcommit unless MAP_NORESERVE */
        if (file && is_file_hugepages(file))
            vm_flags |= VM_NORESERVE;
    }
    /*創建和初始化虛擬內存區域,並加入紅黑樹管理*/
    addr = mmap_region(file, addr, len, vm_flags, pgoff);
    if (!IS_ERR_VALUE(addr) &&
        ((vm_flags & VM_LOCKED) ||
        /*
        假如沒有設置MAP_POPULATE標志位內核並不在調用mmap()時就為進程分配物理內存空間,
        而是直到下次真正訪問地址空間時發現數據不存在於物理內存空間時才觸發Page Fault
        ,將缺失的 Page 換入內存空間
        */
         (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
        *populate = len;
    return addr;
}

  代碼有很多,但是核心功能其實並不復雜:找到空閑的虛擬內存地址,並根據不同的文件打開方式設置不同的vm標志位flag!在函數末尾處調用了mmap_region函數,核心功能是創建和初始化虛擬內存區域,並加入紅黑樹節點進行管理,代碼如下:

/*創建和初始化虛擬內存區域,並加入紅黑樹節點進行管理*/
unsigned long mmap_region(struct file *file, unsigned long addr,
        unsigned long len, vm_flags_t vm_flags, unsigned long pgoff)
{
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma, *prev;
    int error;
    struct rb_node **rb_link, *rb_parent;
    unsigned long charged = 0;

    /* Check against address space limit.
       申請的虛擬內存空間是否超過了限制 */
    if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
        unsigned long nr_pages;

        /*
         * MAP_FIXED may remove pages of mappings that intersects with
         * requested mapping. Account for the pages it would unmap.
         */
        nr_pages = count_vma_pages_range(mm, addr, addr + len);

        if (!may_expand_vm(mm, vm_flags,
                    (len >> PAGE_SHIFT) - nr_pages))
            return -ENOMEM;
    }

    /* Clear old maps 
    檢查[addr, addr+len)的區間是否存在映射空間,假如存在重合的映射空間需要munmap*/
    while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
                  &rb_parent)) {
        if (do_munmap(mm, addr, len))
            return -ENOMEM;
    }

    /*
     * Private writable mapping: check memory availability
     */
    if (accountable_mapping(file, vm_flags)) {
        charged = len >> PAGE_SHIFT;
        if (security_vm_enough_memory_mm(mm, charged))
            return -ENOMEM;
        vm_flags |= VM_ACCOUNT;
    }

    /*
     * Can we just expand an old mapping?
     檢查是否可以合並[addr, addr+len)區間內的虛擬地址空間vma
     */
    vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
            NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
    if (vma)/* 假如合並成功,即使用合並后的vma, 並跳轉至out */
        goto out;

    /*
     * Determine the object being mapped and call the appropriate
     * specific mapper. the address has already been validated, but
     * not unmapped, but the maps are removed from the list.
     如果不能和已有的虛擬內存區域合並,通過 Memory Descriptor 來申請一個 vma
     */
    vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
    if (!vma) {
        error = -ENOMEM;
        goto unacct_error;
    }

    vma->vm_mm = mm;
    vma->vm_start = addr;
    vma->vm_end = addr + len;
    vma->vm_flags = vm_flags;
    vma->vm_page_prot = vm_get_page_prot(vm_flags);
    vma->vm_pgoff = pgoff;
    INIT_LIST_HEAD(&vma->anon_vma_chain);//vma通過鏈表連接,這里初始化鏈表頭
    
    /* 假如指定了文件映射 */
    if (file) {
         /* 映射的文件不允許寫入,調用 deny_write_accsess(file) 排斥常規的文件操作 */
        if (vm_flags & VM_DENYWRITE) {
            error = deny_write_access(file);
            if (error)
                goto free_vma;
        }
        if (vm_flags & VM_SHARED) {/* 映射的文件允許其他進程可見, 標記文件為可寫 */
            error = mapping_map_writable(file->f_mapping);
            if (error)
                goto allow_write_and_free_vma;
        }

        /* ->mmap() can change vma->vm_file, but must guarantee that
         * vma_link() below can deny write-access if VM_DENYWRITE is set
         * and map writably if VM_SHARED is set. This usually means the
         * new file must not have been exposed to user-space, yet.
         */
        vma->vm_file = get_file(file);/* 遞增 File 的引用次數,返回 File 賦給 vma */
        error = file->f_op->mmap(file, vma); /* 調用文件系統指定的 mmap 函數*/
        if (error)
            goto unmap_and_free_vma;

        /* Can addr have changed??
         *
         * Answer: Yes, several device drivers can do it in their
         *         f_op->mmap method. -DaveM
         * Bug: If addr is changed, prev, rb_link, rb_parent should
         *      be updated for vma_link()
         */
        WARN_ON_ONCE(addr != vma->vm_start);

        addr = vma->vm_start;
        vm_flags = vma->vm_flags;
        /* 假如標志為 VM_SHARED,但沒有指定映射文件,需要調用 shmem_zero_setup()
           shmem_zero_setup() 實際映射的文件是 dev/zero
        */
    } else if (vm_flags & VM_SHARED) {
        error = shmem_zero_setup(vma);
        if (error)
            goto free_vma;
    }
    /*新分配的vma加入紅黑樹*/
    vma_link(mm, vma, prev, rb_link, rb_parent);
    /* Once vma denies write, undo our temporary denial count */
    if (file) {
        if (vm_flags & VM_SHARED)
            mapping_unmap_writable(file->f_mapping);
        if (vm_flags & VM_DENYWRITE)
            allow_write_access(file);
    }
    file = vma->vm_file;
out:
    perf_event_mmap(vma);
    /* 更新進程的虛擬地址空間 mm */
    vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
    if (vm_flags & VM_LOCKED) {
        if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
                    vma == get_gate_vma(current->mm)))
            mm->locked_vm += (len >> PAGE_SHIFT);
        else
            vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
    }

    if (file)
        uprobe_mmap(vma);

    /*
     * New (or expanded) vma always get soft dirty status.
     * Otherwise user-space soft-dirty page tracker won't
     * be able to distinguish situation when vma area unmapped,
     * then new mapped in-place (which must be aimed as
     * a completely new data area).
     */
    vma->vm_flags |= VM_SOFTDIRTY;

    vma_set_page_prot(vma);

    return addr;

unmap_and_free_vma:
    vma->vm_file = NULL;
    fput(file);

    /* Undo any partial mapping done by a device driver. */
    unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
    charged = 0;
    if (vm_flags & VM_SHARED)
        mapping_unmap_writable(file->f_mapping);
allow_write_and_free_vma:
    if (vm_flags & VM_DENYWRITE)
        allow_write_access(file);
free_vma:
    kmem_cache_free(vm_area_cachep, vma);
unacct_error:
    if (charged)
        vm_unacct_memory(charged);
    return error;
}

  以上兩個函數的核心功能是查找、分配、初始化空閑的vma,並加入鏈表和紅黑樹管理,同時設置vma的各種flags屬性,便於后續管理!那么問題來了:數據最終都是要存放在物理內存的,截至目前所有的操作都是虛擬內存,這些vma都是在哪和物理內存建立映射的了?關鍵的函數是remap_pfn_range,在mm/memory.c文件中;

/**
 * remap_pfn_range - remap kernel memory to userspace
   將內核空間的內存映射到用戶空間,或者說是
   將用戶空間的一個vma虛擬內存區映射到以page開始的一段連續物理頁面上
 * @vma: user vma to map to:需要映射(或者說掛載關聯)物理地址的vma
 * @addr: target user address to start at:用戶空間地址的起始位置
 * @pfn: physical address of kernel memory:內核的物理地址空間
 * @size: size of map area
 * @prot: page protection flags for this mapping:內存頁面的屬性
 *
 *  Note: this is only safe if the mm semaphore is held when called.
 */
int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
            unsigned long pfn, unsigned long size, pgprot_t prot)
{
    pgd_t *pgd;
    unsigned long next;
    /*需要映射的虛擬地址尾部:注意要頁對齊,因為cpu硬件是以頁為單位管理內存的*/
    unsigned long end = addr + PAGE_ALIGN(size);
    struct mm_struct *mm = vma->vm_mm;
    unsigned long remap_pfn = pfn;
    int err;

    /*
     * Physically remapped pages are special. Tell the
     * rest of the world about it:
     *   VM_IO tells people not to look at these pages
     *    (accesses can have side effects).
     *   VM_PFNMAP tells the core MM that the base pages are just
     *    raw PFN mappings, and do not have a "struct page" associated
     *    with them.
     *   VM_DONTEXPAND
     *      Disable vma merging and expanding with mremap().
     *   VM_DONTDUMP
     *      Omit vma from core dump, even when VM_IO turned off.
     *
     * There's a horrible special case to handle copy-on-write
     * behaviour that some programs depend on. We mark the "original"
     * un-COW'ed pages by matching them up with "vma->vm_pgoff".
     * See vm_normal_page() for details.
     */
    if (is_cow_mapping(vma->vm_flags)) {
        if (addr != vma->vm_start || end != vma->vm_end)
            return -EINVAL;
        vma->vm_pgoff = pfn;
    }

    err = track_pfn_remap(vma, &prot, remap_pfn, addr, PAGE_ALIGN(size));
    if (err)
        return -EINVAL;
    /*改變虛擬地址的標志*/
    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;

    BUG_ON(addr >= end);
    pfn -= addr >> PAGE_SHIFT;
    /*
    /* To find an entry in a generic PGD。宏定義展開后如下:
    #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
    #define pgd_offset(mm, address) ((mm)->pgd+pgd_index(address))
    查找addr第1級頁目錄項中對應的頁表項的地址
    */
    pgd = pgd_offset(mm, addr);
    /*刷新TLB緩存;這個緩存和CPU的L1、L2、L3的緩存思想一致,
    既然進行地址轉換需要的內存IO次數多,且耗時,
    那么干脆就在CPU里把頁表盡可能地cache起來不就行了么,
    所以就有了TLB(Translation Lookaside Buffer),
    專門用於改進虛擬地址到物理地址轉換速度的緩存。
    其訪問速度非常快,和寄存器相當,比L1訪問還快。*/
    flush_cache_range(vma, addr, end);
    do {
        /*
        計算下一個將要被映射的虛擬地址,如果addr到end可以被一個pgd映射的話,那么返回end的值
        */
        next = pgd_addr_end(addr, end);
        /*完成虛擬內存和物理內存映射,本質就是填寫完CR3指向的頁表;
        過程就是逐級完成:1級是pgd,上面已經完成;2級是pud,3級是pmd,4級是pte
        */
        err = remap_pud_range(mm, pgd, addr, next,
                pfn + (addr >> PAGE_SHIFT), prot);
        if (err)
            break;
    } while (pgd++, addr = next, addr != end);

    if (err)
        untrack_pfn(vma, remap_pfn, PAGE_ALIGN(size));

    return err;
}

  最核心的就是remap_pud_range方法了,從這個方法開始,逐級構造頁表的各個映射轉換!閱讀代碼前,可以先熟悉一下4級頁表轉換原理如下:

     

 

      代碼如下:3個方法的結構類似,層層深入,直到最后一級pte!pte內部調用set_pte_at方法最終完成物理地址和虛擬地址的映射!

/*
 * maps a range of physical memory into the requested pages. the old
 * mappings are removed. any references to nonexistent pages results
 * in null mappings (currently treated as "copy-on-access")
 */
static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
            unsigned long addr, unsigned long end,
            unsigned long pfn, pgprot_t prot)
{
    pte_t *pte;
    spinlock_t *ptl;

    pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
    if (!pte)
        return -ENOMEM;
    arch_enter_lazy_mmu_mode();
    do {
        BUG_ON(!pte_none(*pte));
        /*這是映射的最后一級:把物理地址的值填寫到pte表項*/
        set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
        pfn++;
    } while (pte++, addr += PAGE_SIZE, addr != end);
    arch_leave_lazy_mmu_mode();
    pte_unmap_unlock(pte - 1, ptl);
    return 0;
}

static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
            unsigned long addr, unsigned long end,
            unsigned long pfn, pgprot_t prot)
{
    pmd_t *pmd;
    unsigned long next;

    pfn -= addr >> PAGE_SHIFT;
    pmd = pmd_alloc(mm, pud, addr);
    if (!pmd)
        return -ENOMEM;
    VM_BUG_ON(pmd_trans_huge(*pmd));
    do {
        next = pmd_addr_end(addr, end);
        if (remap_pte_range(mm, pmd, addr, next,
                pfn + (addr >> PAGE_SHIFT), prot))
            return -ENOMEM;
    } while (pmd++, addr = next, addr != end);
    return 0;
}

static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
            unsigned long addr, unsigned long end,
            unsigned long pfn, pgprot_t prot)
{
    pud_t *pud;
    unsigned long next;

    pfn -= addr >> PAGE_SHIFT;
    /*返回pgd值*/
    pud = pud_alloc(mm, pgd, addr);
    if (!pud)
        return -ENOMEM;
    do {
        next = pud_addr_end(addr, end);
        if (remap_pmd_range(mm, pud, addr, next,
                pfn + (addr >> PAGE_SHIFT), prot))
            return -ENOMEM;
    } while (pud++, addr = next, addr != end);
    return 0;
}

 

 

注意事項&總結事項:

1、脫殼的時候如果遇到mmap就要注意了:有可能是要加載殼文件了!

2、頁對齊的代碼:也可以借鑒用來做其他數字的對齊,把PAGE_SIZE改成其他數字就好

#define PAGE_MASK (~(PAGE_SIZE-1))
#define PAGE_ALIGN(x) ((x + PAGE_SIZE - 1) & PAGE_MASK)

3、核心原理:只分配1塊物理內存,把進程的虛擬地址映射到這塊物理內存,達到讀寫一次到位的目的! 

 

參考:

1、https://mp.weixin.qq.com/s/y4LT5rtLZXXSvk66w3tVcQ  三種實現mmap的方式

2、https://www.bilibili.com/video/BV1XK411A7q2  linux mmap機制

3、https://www.bilibili.com/video/BV1mk4y1C76p  mmap機制

4、https://www.leviathan.vip/2019/01/13/mmap%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90/  mmap源碼分析

5、https://www.cnblogs.com/pengdonglin137/p/8150981.html  remap_pfn_range源碼分析

6、https://zhuanlan.zhihu.com/p/79607142  TLB緩存


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM