Linux內核內存管理-內存訪問與缺頁中斷【轉】


轉自:https://yq.aliyun.com/articles/5865

摘要: 簡單描述了x86 32位體系結構下Linux內核的用戶進程和內核線程的線性地址空間和物理內存的聯系,分析了高端內存的引入與缺頁中斷的具體處理流程。先介紹了用戶態進程的執行流程,然后對比了內核線程,引入高端內存的概念,最后分析了缺頁中斷的流程。 用戶進程 fork之后的用戶態進...

簡單描述了x86 32位體系結構下Linux內核的用戶進程和內核線程的線性地址空間和物理內存的聯系,分析了高端內存的引入與缺頁中斷的具體處理流程。先介紹了用戶態進程的執行流程,然后對比了內核線程,引入高端內存的概念,最后分析了缺頁中斷的流程。

  • 用戶進程 
    fork之后的用戶態進程已經建立好了所需的數據結構,比如task struct,thread info,mm struct等,將編譯鏈接好的可執行程序的地址區域與進程結構中內存區域做好映射,等開始執行的時候,訪問並未經過映射的用戶地址空間,會發生缺頁中斷,然后內核態的對應中斷處理程序負責分配page,並將用戶進程空間導致缺頁的地址與page關聯,然后檢查是否有相同程序文件的buffer,因為可能其他進程執行同一個程序文件,已經將程序讀到buffer里邊了,如果沒有,則將磁盤上的程序部分讀到buffer,而buffer head通常是與分配的頁面相關聯的,所以實際上會讀到對應頁面代表的物理內存之中,返回到用戶態導致缺頁的地址繼續執行,此時經過mmu的翻譯,用戶態地址成功映射到對應頁面和物理地址,然后讀取指令執行。在上述過程中,如果由於內存耗盡或者權限的問題,可能會返回-NOMEM或segment fault錯誤給用戶態進程。

  • 內核線程 
    沒有獨立的mm結構,所有內核線程共享一個內核地址空間與內核頁表,由於為了方便系統調用等,在用戶態進程規定了內核的地址空間是高1G的線性地址,而低3G線性地址空間供用戶態使用。注意這部分是和用戶態進程的線性地址是重合的,經過mmu的翻譯,會轉換到相同的物理地址,即前1G的物理地址(准確來講后128M某些部分的物理地址可能會變化),內核線程訪問內存也是要經過mmu的,所以借助用戶態進程的頁表,雖然內核有自己的內核頁表,但不直接使用(為了減少用戶態和內核態頁表切換的消耗?),用戶進程頁表的高1G部分實際上是共享內核頁表的映射的,訪問高1G的線性地址時能訪問到低1G的物理地址。而且,由於從用戶進程角度看,內核地址空間只有3G-4G這一段(內核是無法直接訪問0-3G的線性地址空間的,因為這一段是用戶進程所有,一方面如果內核直接讀寫0-3G的線性地址可能會毀壞進程數據結構,另一方面,不同用戶態進程線性地址空間實際映射到不同的物理內存地址,所以可能此刻內核線程借助這個用戶態進程的頁表成功映射到某個物理地址,但是到下一刻,借助下一個用戶態進程的頁表,相同的線性地址就可能映射到不同的物理內存地址了)。

  • 高端內存 
    那么,如何讓內核訪問到大於1G的物理內存?由此引入高端內存的概念,基本思路就是將3G-4G這1G的內核線性地址空間(從用戶進程的角度看,從內核線程的角度看是0-1G)取出一部分挪作他用,而不是固定映射,即重用部分內核線性地址空間,映射到1G之上的物理內存。所以,對於x86 32位體系上的Linux內核將3G-4G的線性地址空間分為0-896m和896m-1G的部分,前面部分使用固定映射,當內核使用進程頁表訪問3G-3G+896m的線性地址時,不會發生缺頁中斷,但是當訪問3G+896m以上的線性地址時,可能由於內核頁表被更新,而進程頁表還未和內核頁表同步,此時會發生內核地址空間的缺頁中斷,從而將內核頁表同步到當前進程頁表。注意,使用vmalloc分配內存的時候,可能已經設置好了內核頁表,等到下一次借助進程頁表訪問內核空間地址發生缺頁時才會觸發內核頁表和當前頁表的同步。 
    Linux x86 32位下的線性地址空間與物理地址空間 
    (圖片出自《understanding the linux virtual memory manager》) 
    這里寫圖片描述

  • 缺頁 
    page fault的處理過程如下:在用戶空間上下文和內核上下文下都可能訪問缺頁的線性地址導致缺頁中斷,但有些情況沒有實際意義。

    • 如果缺頁地址位於內核線性地址空間 
      • 如果在vmalloc區,則同步內核頁表和用戶進程頁表,否則掛掉。注意此處未分具體上下文
    • 如果發生在中斷上下文或者!mm,則檢查exception table,如果沒有則掛掉。
    • 如果缺頁地址發生在用戶進程線性地址空間 
      • 如果在內核上下文,則查exception table,如果沒有,則掛掉。這種情況沒多大實際意義
      • 如果在用戶進程上下文 
        • 查找vma,找到,先判斷是否需要棧擴張,否則進入通常的處理流程
        • 查找vma,未找到,bad area,通常返回segment fault

    具體的缺頁中斷流程圖及代碼如下: 
    (圖片出自《understanding the linux virtual memory manager》) 
    這里寫圖片描述

(Linux 3.19.3 arch/x86/mm/fault.c 1044)
/*
 * This routine handles page faults.  It determines the address,
 * and the problem, and then passes it off to one of the appropriate
 * routines.
 *
 * This function must have noinline because both callers
 * {,trace_}do_page_fault() have notrace on. Having this an actual function
 * guarantees there's a function trace entry.
 */

//處理缺頁中斷
//參數:寄存器值,錯誤碼,缺頁地址
static noinline void
__do_page_fault(struct pt_regs *regs, unsigned long error_code,
        unsigned long address)
{
    struct vm_area_struct *vma;
    struct task_struct *tsk;
    struct mm_struct *mm;
    int fault, major = 0;
    unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

    tsk = current;
    mm = tsk->mm;

    /*
     * Detect and handle instructions that would cause a page fault for
     * both a tracked kernel page and a userspace page.
     */
    if (kmemcheck_active(regs))
        kmemcheck_hide(regs);
    prefetchw(&mm->mmap_sem);

    if (unlikely(kmmio_fault(regs, address)))
        return;

    /*
     * We fault-in kernel-space virtual memory on-demand. The
     * 'reference' page table is init_mm.pgd.
     *
     * NOTE! We MUST NOT take any locks for this case. We may
     * be in an interrupt or a critical region, and should
     * only copy the information from the master page table,
     * nothing more.
     *
     * This verifies that the fault happens in kernel space
     * (error_code & 4) == 0, and that the fault was not a
     * protection error (error_code & 9) == 0.
     */

    //如果缺頁地址位於內核空間
    if (unlikely(fault_in_kernel_space(address))) {
        if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) { //位於內核上下文
            if (vmalloc_fault(address) >= 0) //如果位於vmalloc區域 vmalloc_sync_one同步內核頁表進程頁表 
                return;

            if (kmemcheck_fault(regs, address, error_code))
                return;
        }

        /* Can handle a stale RO->RW TLB: */
        if (spurious_fault(error_code, address))
            return;

        /* kprobes don't want to hook the spurious faults: */
        if (kprobes_fault(regs))
            return;
        /*
         * Don't take the mm semaphore here. If we fixup a prefetch
         * fault we could otherwise deadlock:
         */
        bad_area_nosemaphore(regs, error_code, address);

        return;
    }



    /* kprobes don't want to hook the spurious faults: */
    if (unlikely(kprobes_fault(regs)))
        return;

    if (unlikely(error_code & PF_RSVD))
        pgtable_bad(regs, error_code, address);

    if (unlikely(smap_violation(error_code, regs))) {
        bad_area_nosemaphore(regs, error_code, address);
        return;
    }

    /*
     * If we're in an interrupt, have no user context or are running
     * in an atomic region then we must not take the fault:
     */

    //如果位於中斷上下文或者!mm, 出錯
    if (unlikely(in_atomic() || !mm)) {
        bad_area_nosemaphore(regs, error_code, address);
        return;
    }

    /*
     * It's safe to allow irq's after cr2 has been saved and the
     * vmalloc fault has been handled.
     *
     * User-mode registers count as a user access even for any
     * potential system fault or CPU buglet:
     */
    if (user_mode_vm(regs)) {
        local_irq_enable();
        error_code |= PF_USER;
        flags |= FAULT_FLAG_USER;
    } else {
        if (regs->flags & X86_EFLAGS_IF)
            local_irq_enable();
    }

    perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

    if (error_code & PF_WRITE)
        flags |= FAULT_FLAG_WRITE;

    /*
     * When running in the kernel we expect faults to occur only to
     * addresses in user space.  All other faults represent errors in
     * the kernel and should generate an OOPS.  Unfortunately, in the
     * case of an erroneous fault occurring in a code path which already
     * holds mmap_sem we will deadlock attempting to validate the fault
     * against the address space.  Luckily the kernel only validly
     * references user space from well defined areas of code, which are
     * listed in the exceptions table.
     *
     * As the vast majority of faults will be valid we will only perform
     * the source reference check when there is a possibility of a
     * deadlock. Attempt to lock the address space, if we cannot we then
     * validate the source. If this is invalid we can skip the address
     * space check, thus avoiding the deadlock:
     */
    if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
        if ((error_code & PF_USER) == 0 &&
            !search_exception_tables(regs->ip)) {
            bad_area_nosemaphore(regs, error_code, address);
            return;
        }
retry:
        down_read(&mm->mmap_sem);
    } else {
        /*
         * The above down_read_trylock() might have succeeded in
         * which case we'll have missed the might_sleep() from
         * down_read():
         */
        might_sleep();
    }


    //缺頁中斷地址位於用戶空間 
    //查找vma 
    vma = find_vma(mm, address);

    //沒找到,出錯
    if (unlikely(!vma)) {
        bad_area(regs, error_code, address);
        return;
    }

    //檢查在vma的地址的合法性
    if (likely(vma->vm_start <= address))
        goto good_area;

    if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
        bad_area(regs, error_code, address);
        return;
    }

    //如果在用戶上下文
    if (error_code & PF_USER) {
        /*
         * Accessing the stack below %sp is always a bug.
         * The large cushion allows instructions like enter
         * and pusha to work. ("enter $65535, $31" pushes
         * 32 pointers and then decrements %sp by 65535.)
         */
        if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
            bad_area(regs, error_code, address);
            return;
        }
    }

    //棧擴張
    if (unlikely(expand_stack(vma, address))) {
        bad_area(regs, error_code, address);
        return;
    }

    /*
     * Ok, we have a good vm_area for this memory access, so
     * we can handle it..
     */

    //vma合法 
good_area:
    if (unlikely(access_error(error_code, vma))) {
        bad_area_access_error(regs, error_code, address);
        return;
    }

    /*
     * If for any reason at all we couldn't handle the fault,
     * make sure we exit gracefully rather than endlessly redo
     * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
     * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
     */

    //調用通用的缺頁處理
    fault = handle_mm_fault(mm, vma, address, flags);
    major |= fault & VM_FAULT_MAJOR;

    /*
     * If we need to retry the mmap_sem has already been released,
     * and if there is a fatal signal pending there is no guarantee
     * that we made any progress. Handle this case first.
     */
    if (unlikely(fault & VM_FAULT_RETRY)) {
        /* Retry at most once */
        if (flags & FAULT_FLAG_ALLOW_RETRY) {
            flags &= ~FAULT_FLAG_ALLOW_RETRY;
            flags |= FAULT_FLAG_TRIED;
            if (!fatal_signal_pending(tsk))
                goto retry;
        }

        /* User mode? Just return to handle the fatal exception */
        if (flags & FAULT_FLAG_USER)
            return;

        /* Not returning to user mode? Handle exceptions or die: */
        no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
        return;
    }

    up_read(&mm->mmap_sem);
    if (unlikely(fault & VM_FAULT_ERROR)) {
        mm_fault_error(regs, error_code, address, fault);
        return;
    }

    /*
     * Major/minor page fault accounting. If any of the events
     * returned VM_FAULT_MAJOR, we account it as a major fault.
     */
    if (major) {
        tsk->maj_flt++;
        perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
    } else {
        tsk->min_flt++;
        perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
    }

    check_v8086_mode(regs, address, tsk);
}
NOKPROBE_SYMBOL(__do_page_fault);

掃我,和雲棲在線交流

【雲棲快訊】首屆阿里巴巴在線技術峰會,將於7月19日-21日20:00-21:30在線舉辦。峰會邀請到阿里集團9位技術大V,分享電商架構、安全、數據處理、數據庫、多應用部署、互動技術、Docker持續交付與微服務等一線實戰經驗,解讀最新技術在阿里集團的應用實踐。   詳情請點擊


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM