Linux內存管理 (17)KSM


專題:Linux內存管理專題

關鍵詞:KSM、匿名頁面、COW、madvise 、MERGEABLE、UNMERGEABLE

KSM是Kernel Samepage Merging的意思,用於合並內容相同的頁面。

在虛擬化環境中,同一台主機上存在許多相同OS和應用程序,很多頁面內容可能是完全相同的,因此可以被合並,從而釋放內存供其它應用程序使用。

KSM允許合並同一個進程或不同進程之間內容相同的匿名頁面,這對應用程序是不可見的。

把這些相同的頁面很成一個只讀頁面,從而釋放物理頁面,當應用程序需要改變頁面內容時,發生寫時復制(Copy-On-Write)。

 

1. KSM的實現

KSM核心設計思想是基於寫時復制機制COW,也就是將內容相同的頁面合並成一個只讀頁面,從而釋放出空閑物理頁面

KSM的實現可以分為兩部分:一是啟動內核線程ksmd,等待喚醒進行頁面掃描和合並;二是madvise喚醒內核線程ksmd。

KSM只會處理通過madvise系統調用顯式指定的用戶進程地址空間內存,因此用戶想使用此功能必須顯式調用madvise(addr, length, MADV_MERGEABLE)

用戶想取消KSM中某個用戶進程地址空間合並功能,也需要顯式調用madvise(addr, length, MADV_UNMERGEABLE)

欠一張KSM實現流程圖。

KSM會合並什么樣類型的頁面?

一個典型的應用程序由以下5個內存部分組成:

  • 可執行文件的內存映射(page cache)
  • 程序分配使用的匿名頁面
  • 進程打開的文件映射
  • 進程訪問文件系統產生的cache
  • 進程訪問內核產生的內核buffer(如slab)等

KSM只考慮進程分配使用的匿名頁面

如何去查找和比較兩個相同的頁面?

KSM巧妙使用紅黑樹設計了兩棵樹:stable樹和unstable樹。

KSM巧妙地利用頁面的校驗值來比較unstable樹的頁面最近是否被修改過。

如何節省內存?

頁面分為物理頁面和虛擬頁面,多個虛擬頁面同時映射到一個物理頁面,因此需要把映射到該頁所有PTE都解除后,才算是真正釋放。

目前有兩種做法:

一是掃描每個進程中的VMA,由VMA的虛擬地址查詢MMU頁表找到對應的page數據結構,進而找到用戶pte。

然后對比KSM中的stable樹中的stable樹和unstable樹,如果找到內容相同的頁面,就把該pte設置成COW,映射到KSM頁面中,從而釋放出一個pte,這里只是釋放出一個用戶pte,而不是物理頁面。

如果該物理頁面只有一個pte映射,那就是釋放該頁。

二是直接掃描系統中的物理頁面,然后通過反響映射來解除該頁所有的用戶pte,從而一次性釋放出物理頁面。

1.1 使能KSM功能

很多內核默認沒有開啟KSM功能,需要打開CONFIG_KSM=y才能打開。

通過make menuconfig配置如下:

Processor type and features
    [*]Enable KSM for page merging

 

1.2 KSM相關數據結構

KSM的核心數據結構有三個:struct rmap_item、struct mm_slot、struct ksm_scan。

欠一張三者之間的關系圖。

rmap_item描述一個虛擬地址反向映射的條目。

/**
 * struct rmap_item - reverse mapping item for virtual addresses
 * @rmap_list: next rmap_item in mm_slot's singly-linked rmap_list
 * @anon_vma: pointer to anon_vma for this mm,address, when in stable tree
 * @nid: NUMA node id of unstable tree in which linked (may not match page)
 * @mm: the memory structure this rmap_item is pointing into
 * @address: the virtual address this rmap_item tracks (+ flags in low bits)
 * @oldchecksum: previous checksum of the page at that virtual address
 * @node: rb node of this rmap_item in the unstable tree
 * @head: pointer to stable_node heading this list in the stable tree
 * @hlist: link into hlist of rmap_items hanging off that stable_node
 */
struct rmap_item {
    struct rmap_item *rmap_list;-----------------------------------------所有rmap_item連接成一個鏈表,鏈表頭在ksm_scam.rmap_list中。
    union {
        struct anon_vma *anon_vma;    /* when stable */------------------當rmap_item加入stable樹時,指向VMA的anon_vma數據結構。
#ifdef CONFIG_NUMA
        int nid;        /* when node of unstable tree */
#endif
    };
    struct mm_struct *mm;------------------------------------------------進程的struct mm_struct數據結構
    unsigned long address;        /* + low bits used for flags below */--rmap_item所跟蹤的用戶地址空間
    unsigned int oldchecksum;    /* when unstable */---------------------虛擬地址對應的物理頁面的舊校驗值
    union {
        struct rb_node node;    /* when node of unstable tree */---------rmap_item加入unstable紅黑樹的節點
        struct {        /* when listed from stable tree */
            struct stable_node *head;------------------------------------加入stable紅黑樹的節點 struct hlist_node hlist;-------------------------------------stable鏈表
        };
    };
};

 

mm_slot描述添加到KSM系統中將來要被掃描的進程mm_struct數據結構。

/*
 * A few notes about the KSM scanning process,
 * to make it easier to understand the data structures below:
 *
 * In order to reduce excessive scanning, KSM sorts the memory pages by their
 * contents into a data structure that holds pointers to the pages' locations.
 *
 * Since the contents of the pages may change at any moment, KSM cannot just
 * insert the pages into a normal sorted tree and expect it to find anything.
 * Therefore KSM uses two data structures - the stable and the unstable tree.
 *
 * The stable tree holds pointers to all the merged pages (ksm pages), sorted
 * by their contents.  Because each such page is write-protected, searching on
 * this tree is fully assured to be working (except when pages are unmapped),
 * and therefore this tree is called the stable tree.
 *
 * In addition to the stable tree, KSM uses a second data structure called the
 * unstable tree: this tree holds pointers to pages which have been found to
 * be "unchanged for a period of time".  The unstable tree sorts these pages
 * by their contents, but since they are not write-protected, KSM cannot rely
 * upon the unstable tree to work correctly - the unstable tree is liable to
 * be corrupted as its contents are modified, and so it is called unstable.
 *
 * KSM solves this problem by several techniques:
 *
 * 1) The unstable tree is flushed every time KSM completes scanning all
 *    memory areas, and then the tree is rebuilt again from the beginning.
 * 2) KSM will only insert into the unstable tree, pages whose hash value
 *    has not changed since the previous scan of all memory areas.
 * 3) The unstable tree is a RedBlack Tree - so its balancing is based on the
 *    colors of the nodes and not on their contents, assuring that even when
 *    the tree gets "corrupted" it won't get out of balance, so scanning time
 *    remains the same (also, searching and inserting nodes in an rbtree uses
 *    the same algorithm, so we have no overhead when we flush and rebuild).
 * 4) KSM never flushes the stable tree, which means that even if it were to
 *    take 10 attempts to find a page in the unstable tree, once it is found,
 *    it is secured in the stable tree.  (When we scan a new page, we first
 *    compare it against the stable tree, and then against the unstable tree.)
 *
 * If the merge_across_nodes tunable is unset, then KSM maintains multiple
 * stable trees and multiple unstable trees: one of each for each NUMA node.
 */

/**
 * struct mm_slot - ksm information per mm that is being scanned
 * @link: link to the mm_slots hash list
 * @mm_list: link into the mm_slots list, rooted in ksm_mm_head
 * @rmap_list: head for this mm_slot's singly-linked list of rmap_items
 * @mm: the mm that this information is valid for
 */
struct mm_slot {
    struct hlist_node link;------------------用於添加到mm_slot哈希表中。 struct list_head mm_list;----------------用於添加到mm_slot鏈表中,鏈表頭在ksm_mm_head struct rmap_item *rmap_list;-------------rmap_item鏈表頭 struct mm_struct *mm;--------------------進程的mm_sturct數據結構
};

 

ksm_scan表示當前掃描狀態。

/**
 * struct ksm_scan - cursor for scanning
 * @mm_slot: the current mm_slot we are scanning
 * @address: the next address inside that to be scanned
 * @rmap_list: link to the next rmap to be scanned in the rmap_list
 * @seqnr: count of completed full scans (needed when removing unstable node)
 *
 * There is only the one ksm_scan instance of this cursor structure.
 */
struct ksm_scan {
    struct mm_slot *mm_slot;---------------------當前正在掃描的mm_slot
    unsigned long address;-----------------------下一次掃描地址
    struct rmap_item **rmap_list;----------------將要掃描rmap_item的指針
    unsigned long seqnr;-------------------------全部掃描完成后會計數一次,用於刪除unstable節點。
};

 

 

1.3 madvise觸發喚醒KSM內核線程

madvise用於給內核處理內存paging I/O建議,和KSM相關的是MADV_MERGEABLE和MADV_UNMERGEABLE。

MADV_MERGEABLE用於顯式調用處理用戶進程地址空間合並功能。

MADV_UNMERGEABLE用於顯式調用取消一個用戶進程地址空間合並功能。

madvise-------------------------------------madvise系統調用
    madvise_vma
        madvise_behavior
            ksm_madvise---------------------處理MADV_MERGEABLE/MADV_UNMERGEABLE情況
                __ksm_enter-----------------MADV_MERGEABLE,喚醒ksmd線程
                unmerge_ksm_pages-----------MADV_UNMERGEABLE

 

 __ksm_enter首先創建mm_slot,然后喚醒ksmd內核線程進行KSM處理。

int __ksm_enter(struct mm_struct *mm)
{
    struct mm_slot *mm_slot;
    int needs_wakeup;

    mm_slot = alloc_mm_slot();-----------------------------------------------分配一個mm_slot,表示當前進程mm_struct數據結構 if (!mm_slot)
        return -ENOMEM;

    /* Check ksm_run too?  Would need tighter locking */
    needs_wakeup = list_empty(&ksm_mm_head.mm_list);--------------------------為空表示當前沒有正在被掃描的mm_slot

    spin_lock(&ksm_mmlist_lock);
    insert_to_mm_slots_hash(mm, mm_slot);-------------------------------------將當前進程mm賦給mm_slot->mm /*
     * When KSM_RUN_MERGE (or KSM_RUN_STOP),
     * insert just behind the scanning cursor, to let the area settle
     * down a little; when fork is followed by immediate exec, we don't
     * want ksmd to waste time setting up and tearing down an rmap_list.
     *
     * But when KSM_RUN_UNMERGE, it's important to insert ahead of its
     * scanning cursor, otherwise KSM pages in newly forked mms will be
     * missed: then we might as well insert at the end of the list.
     */
    if (ksm_run & KSM_RUN_UNMERGE)
        list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list);
    else
        list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);---------mm_slot添加到ksm_scan.mm_slot->mm_list鏈表中
    spin_unlock(&ksm_mmlist_lock);

    set_bit(MMF_VM_MERGEABLE, &mm->flags);------------------------------------表示這個進程已經添加到KSM系統中
    atomic_inc(&mm->mm_count);

    if (needs_wakeup)
        wake_up_interruptible(&ksm_thread_wait);------------------------------如果之前為空,則喚醒ksmd內核線程。

    return 0;
}

  

1.4 內核線程ksmd

static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);

定義ksm_thread_wait等待隊列,ksmd內核線程在此阻塞等待,madvise喚醒。

ksm_init創建ksmd內核線程,

static int __init ksm_init(void)
{
    struct task_struct *ksm_thread;
    int err;

    err = ksm_slab_init();-----------------------------------------------創建ksm_rmap_item、ksm_stable_node、ksm_mm_slot三個高速緩存。 if (err)
        goto out;

    ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");------------創建ksmd內核線程,處理函數為ksm_scan_thread。
    if (IS_ERR(ksm_thread)) {
        pr_err("ksm: creating kthread failed\n");
        err = PTR_ERR(ksm_thread);
        goto out_free;
    }

#ifdef CONFIG_SYSFS
    err = sysfs_create_group(mm_kobj, &ksm_attr_group);-----------------在sys/kernel/mm下創建ksm相關節點 if (err) {
        pr_err("ksm: register sysfs failed\n");
        kthread_stop(ksm_thread);
        goto out_free;
    }
#else
    ksm_run = KSM_RUN_MERGE;    /* no way for user to start it */

#endif /* CONFIG_SYSFS */

#ifdef CONFIG_MEMORY_HOTREMOVE
    /* There is no significance to this priority 100 */
    hotplug_memory_notifier(ksm_memory_callback, 100);
#endif
    return 0;

out_free:
    ksm_slab_free();
out:
    return err;
}

sysfs相關層次為kernel_obj-->mm_obj-->ksm_attr_group,其中ksm_attr_group的.attrs為ksm_attrs。

創建節點/sys/kernel/mm/ksm如下,這些節點提供給用戶對KSM進行控制。

下面是節點名、值、解釋:

full_scans----------------0:只讀,已經全掃描可合並區域次數
pages_shared----------0:ksm_pages_shared,stable樹中節點數
pages_sharing---------0:ksm_pages_sharing,
pages_to_scan-----100:ksm_thread_pages_to_scan,一次ksm_do_scan的頁面數
pages_unshared------0:ksm_pages_unshared,unstable樹中節點數
pages_volatile---------0:改變太頻繁的頁面數
run------------------------0:0-停止ksmd但保持已合並頁面狀態;1-運行ksmd;2-停止ksmd並且將已經合並頁面拆分。
sleep_millisecs-------20:ksm_thread_sleep_millisecs,每次KSM掃描之間的間隔時間

關於pages_shared、pages_sharing和page_unshared比例說明什么呢?

pages_sharing/pages_shared比例越高表示頁面共享情況越好;pages_unshared/pages_sharing比例越高表示KSM收效很低。

ksm_scan_thread()是ksmd內核線程主干,每次執行ksm_do_scan()函數去掃描合並pages_to_scan個頁面,然后睡眠等待sleep_millisecs毫秒

如果無事可做會在ksm_thread_wait上等待,知道madvise喚醒。

static int ksm_scan_thread(void *nothing)
{
    set_freezable();
    set_user_nice(current, 5);

    while (!kthread_should_stop()) {
        mutex_lock(&ksm_thread_mutex);
        wait_while_offlining();
        if (ksmd_should_run())
            ksm_do_scan(ksm_thread_pages_to_scan);
        mutex_unlock(&ksm_thread_mutex);

        try_to_freeze();

        if (ksmd_should_run()) {
            schedule_timeout_interruptible(
                msecs_to_jiffies(ksm_thread_sleep_millisecs));
        } else {
            wait_event_freezable(ksm_thread_wait,
                ksmd_should_run() || kthread_should_stop());
        }
    }
    return 0;
}

 ksm_do_scan是ksmd線程的實際執行者,它有着如下流程,將一個頁面合並成KSM頁面,包括查找stabke樹和unstable樹等,然后進行合並操作。

ksm_do_scan
    scan_get_next_rmap_item---------------------選取合適的匿名頁面
cmp_and_merge_page--------------------------將頁面與root_stable_tree/root_unstable_tree中頁面進行比較,判斷是否能合並
stable_tree_search----------------------搜索stable紅黑樹並查找是否有和page內容一致的節點
try_to_merge_with_ksm_page--------------嘗試將候選頁合並到KSM頁面中
stable_tree_append
unstable_tree_search_insert-------------搜索unstable紅黑樹中是否有和該頁內容相同的節點
try_to_merge_two_pages------------------若在unstable紅黑樹中找到和當前頁內容相同節點,嘗試合並這兩頁面成為一個KSM頁面
stable_tree_append----------------------將合並的兩個頁面對應rmap_item添加到stable節點哈希表中
break_cow

 ksm_do_scan()函數如下:

static void ksm_do_scan(unsigned int scan_npages)
{
    struct rmap_item *rmap_item;
    struct page *uninitialized_var(page);

    while (scan_npages-- && likely(!freezing(current))) {----------------while中嘗試去合並scan_npages個頁面
        cond_resched();
        rmap_item = scan_get_next_rmap_item(&page);----------------------獲取一個合適的匿名頁面page if (!rmap_item)
            return;
        cmp_and_merge_page(page, rmap_item);-----------------------------讓page在KSM的stable和unstable兩棵樹中查找是否有合適合並的對象,並嘗試去合並他們。
        put_page(page);
    }
}

 

 scan_get_next_rmap_item()遍歷ksm_mm_heand,然后再遍歷進程地址空間的每個VMA,然后通過get_next_rmpa_item()返回rmap_item。

static struct rmap_item *scan_get_next_rmap_item(struct page **page)
{
    struct mm_struct *mm;
    struct mm_slot *slot;
    struct vm_area_struct *vma;
    struct rmap_item *rmap_item;
    int nid;

    if (list_empty(&ksm_mm_head.mm_list))---------------------------------------為空表示沒有mm_slot,所以不需要繼續
        return NULL;

    slot = ksm_scan.mm_slot;
    if (slot == &ksm_mm_head) {-------------------------------------------------第一次運行ksmd,進行一些初始化工作。
...
        if (!ksm_merge_across_nodes) {
            struct stable_node *stable_node;
            struct list_head *this, *next;
            struct page *page;

            list_for_each_safe(this, next, &migrate_nodes) {
                stable_node = list_entry(this,
                        struct stable_node, list);
                page = get_ksm_page(stable_node, false);
                if (page)
                    put_page(page);
                cond_resched();
            }
        }

        for (nid = 0; nid < ksm_nr_node_ids; nid++)
            root_unstable_tree[nid] = RB_ROOT;----------------------------------unstable樹初始化

        spin_lock(&ksm_mmlist_lock);
        slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list);
        ksm_scan.mm_slot = slot;
        spin_unlock(&ksm_mmlist_lock);
        /*
         * Although we tested list_empty() above, a racing __ksm_exit
         * of the last mm on the list may have removed it since then.
         */
        if (slot == &ksm_mm_head)
            return NULL;
next_mm:
        ksm_scan.address = 0;
        ksm_scan.rmap_list = &slot->rmap_list;
    }

    mm = slot->mm;
    down_read(&mm->mmap_sem);
    if (ksm_test_exit(mm))
        vma = NULL;
    else
        vma = find_vma(mm, ksm_scan.address);

    for (; vma; vma = vma->vm_next) {-------------------------------------------for循環遍歷所有VMA if (!(vma->vm_flags & VM_MERGEABLE))
            continue;
        if (ksm_scan.address < vma->vm_start)
            ksm_scan.address = vma->vm_start;
        if (!vma->anon_vma)
            ksm_scan.address = vma->vm_end;

        while (ksm_scan.address < vma->vm_end) {-------------------------------掃描VMA中所有虛擬頁面 if (ksm_test_exit(mm))
                break;
            *page = follow_page(vma, ksm_scan.address, FOLL_GET);--------------follow_page函數從虛擬地址開始找回normal mapping頁面的struct page數據結構 if (IS_ERR_OR_NULL(*page)) {
                ksm_scan.address += PAGE_SIZE;
                cond_resched();
                continue;
            }
            if (PageAnon(*page) ||
                page_trans_compound_anon(*page)) {-----------------------------只處理匿名頁面情況
                flush_anon_page(vma, *page, ksm_scan.address);
                flush_dcache_page(*page);
                rmap_item = get_next_rmap_item(slot,---------------------------去找mm_slot->rmap_list鏈表上是否有該虛擬地址對應的rmap_item,沒有找到就新建一個。
                    ksm_scan.rmap_list, ksm_scan.address);
                if (rmap_item) {
                    ksm_scan.rmap_list =
                            &rmap_item->rmap_list;
                    ksm_scan.address += PAGE_SIZE;
                } else
                    put_page(*page);
                up_read(&mm->mmap_sem);
                return rmap_item;
            }
            put_page(*page);
            ksm_scan.address += PAGE_SIZE;
            cond_resched();
        }
    }

    if (ksm_test_exit(mm)) {-------------------------------------------------說明for循環里掃描該進程所有的VMA都沒找到合適的匿名頁面
        ksm_scan.address = 0;
        ksm_scan.rmap_list = &slot->rmap_list;
    }
    /*
     * Nuke all the rmap_items that are above this current rmap:
     * because there were no VM_MERGEABLE vmas with such addresses.
     */
    remove_trailing_rmap_items(slot, ksm_scan.rmap_list);--------------------在該進程中沒找到合適的匿名頁面時,那么對應的rmap_item已經沒用必要占用空間,直接刪除。

    spin_lock(&ksm_mmlist_lock);
    ksm_scan.mm_slot = list_entry(slot->mm_list.next,-----------------------取下一個mm_slot struct mm_slot, mm_list);
    if (ksm_scan.address == 0) {--------------------------------------------處理該進程被銷毀的情況,把mm_slot從ksm_mm_head鏈表中刪除,釋放mm_slot數據結構,清MMF_VM_MERGEABLE標志位。 /*
         * We've completed a full scan of all vmas, holding mmap_sem
         * throughout, and found no VM_MERGEABLE: so do the same as
         * __ksm_exit does to remove this mm from all our lists now.
         * This applies either when cleaning up after __ksm_exit
         * (but beware: we can reach here even before __ksm_exit),
         * or when all VM_MERGEABLE areas have been unmapped (and
         * mmap_sem then protects against race with MADV_MERGEABLE).
         */
        hash_del(&slot->link);
        list_del(&slot->mm_list);
        spin_unlock(&ksm_mmlist_lock);

        free_mm_slot(slot);
        clear_bit(MMF_VM_MERGEABLE, &mm->flags);
        up_read(&mm->mmap_sem);
        mmdrop(mm);
    } else {
        spin_unlock(&ksm_mmlist_lock);
        up_read(&mm->mmap_sem);
    }

    /* Repeat until we've completed scanning the whole list */
    slot = ksm_scan.mm_slot;
    if (slot != &ksm_mm_head)
        goto next_mm;----------------------------------------繼續掃描下一個mm_slot

    ksm_scan.seqnr++;----------------------------------------掃描完一輪mm_slot,增加計數 return NULL;
}

 

 cmp_and_merge_page()有兩個參數,page表示剛才掃描mm_slot時找到的一個合格匿名頁面,rmap_item表示該page對應的rmap_item數據結構。

static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
{
    struct rmap_item *tree_rmap_item;
    struct page *tree_page = NULL;
    struct stable_node *stable_node;
    struct page *kpage;
    unsigned int checksum;
    int err;

    stable_node = page_stable_node(page);
    if (stable_node) {
        if (stable_node->head != &migrate_nodes &&
            get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) {
            rb_erase(&stable_node->node,
                 root_stable_tree + NUMA(stable_node->nid));
            stable_node->head = &migrate_nodes;
            list_add(&stable_node->list, stable_node->head);
        }
        if (stable_node->head != &migrate_nodes &&
            rmap_item->head == stable_node)
            return;
    }

    /* We first start with searching the page inside the stable tree */
    kpage = stable_tree_search(page);------------------------------------------在root_stabletree中查找頁面內容和page相同的stable頁。 if (kpage == page && rmap_item->head == stable_node) {---------------------說明kpage和page是同一個頁面,說明該頁已經是KSM頁面,不需繼續處理。
        put_page(kpage);
        return;
    }

    remove_rmap_item_from_tree(rmap_item);

    if (kpage) {
        err = try_to_merge_with_ksm_page(rmap_item, page, kpage);--------------如果在stable書中找到一個頁面內容相同節點,那么嘗試合並這個頁面到節點上 if (!err) {
            /*
             * The page was successfully merged:
             * add its rmap_item to the stable tree.
             */
            lock_page(kpage);
            stable_tree_append(rmap_item, page_stable_node(kpage));------------合並成功后,把rmap_item添加到stable_node->hlist哈希鏈表上。
            unlock_page(kpage);
        }
        put_page(kpage);
        return;
    }
============================================================stable和unstable分割線======================================================================== /*
     * If the hash value of the page has changed from the last time
     * we calculated it, this page is changing frequently: therefore we
     * don't want to insert it in the unstable tree, and we don't want
     * to waste our time searching for something identical to it there.
     */
    checksum = calc_checksum(page);--------------------------------------------再次計算校驗值,如不等,則說明頁面變動頻繁,不適合添加到unstable紅黑樹中。 if (rmap_item->oldchecksum != checksum) {
        rmap_item->oldchecksum = checksum;
        return;
    }

    tree_rmap_item =
        unstable_tree_search_insert(rmap_item, page, &tree_page);--------------搜索root_unstable_tree中是否有和該頁面內容相同的節點。 if (tree_rmap_item) {
        kpage = try_to_merge_two_pages(rmap_item, page,
                        tree_rmap_item, tree_page);----------------------------嘗試合並page和tree_page成為一個KSM頁面kpage。
        put_page(tree_page);
        if (kpage) {
            /*
             * The pages were successfully merged: insert new
             * node in the stable tree and add both rmap_items.
             */
            lock_page(kpage);
            stable_node = stable_tree_insert(kpage);---------------------------將kpage添加到root_stable_tree中,創建一個新stable_node節點。 if (stable_node) {
                stable_tree_append(tree_rmap_item, stable_node);
                stable_tree_append(rmap_item, stable_node);
            }
            unlock_page(kpage);

            /*
             * If we fail to insert the page into the stable tree,
             * we will have 2 virtual addresses that are pointing
             * to a ksm page left outside the stable tree,
             * in which case we need to break_cow on both.
             */
            if (!stable_node) {------------------------------------------------如果stable_node插入到stable樹失敗,那么調用break_cow()主動觸發一個卻也中斷來分離這個KSM頁面。
                break_cow(tree_rmap_item);
                break_cow(rmap_item);
            }
        }
    }
}

 

  

2. 匿名頁面和KSM頁面的區別

2.1 如何區分匿名頁面和KSM頁面?

如果struct page指向映射到用戶虛擬內存空間的匿名頁面,那么其成員mapping指向anon_vma。

mapping的低2位,表示匿名頁面或者KSM。

/*
...
 *
 * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.----------------所以說KSM肯定是匿名頁面,KSM是匿名頁面的子集。
...
*/ #define PAGE_MAPPING_ANON 1 #define PAGE_MAPPING_KSM 2 #define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)

 

內核中提供兩個函數檢查頁面的KSM或者匿名頁面類型:

static inline int PageAnon(struct page *page)
{
    return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
}

static inline int PageKsm(struct page *page)-----------------------------------------也可以看出KSM頁面需同時具備ANON/KSM。
{
    return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
                (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
}

 

 2.2 KSM頁面和匿名頁面的區別

分兩種情況,一是父子進程VMA共享同一個匿名頁面;二是不相干的進程VMA共享同一個匿名頁面。

 

父進程在VMA映射匿名頁面是會創建屬於這個VMA的RMAP反向映射的設施,在__page_set_anon_rmap()例會設置page->index值為虛擬地址在VMA中的offset。

子進程fork時,復制了父進程的VMA內容到子進程的VMA中,並且復制父進程的頁表到子進程中,因此對於父子進程來說,page->index值是一致的。

當需要從page找到所有映射page的虛擬地址時,在rmap_walk_anon()函數中,父子進程都是用page->index來計算在VMA中的虛擬地址。

在rmap_walk()中,如果是非KSM的匿名頁面,使用rmap_walk_anon進行反向查找。

static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
{
...
    pgoff = page_to_pgoff(page);
    anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
        struct vm_area_struct *vma = avc->vma;
        unsigned long address = vma_address(page, vma);----------------------根據page->index計算虛擬地址 if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
            continue;

        ret = rwc->rmap_one(page, vma, address, rwc->arg);
        if (ret != SWAP_AGAIN)
            break;
        if (rwc->done && rwc->done(page))
            break;
    }
    anon_vma_unlock_read(anon_vma);
    return ret;
}

 

 

KSM頁面由內容相同的兩個頁面合並而成,他們可以是不同進程的VMA,也可以是父子進程的VMA。

int rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc)
{
...
again:
    hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
        struct anon_vma *anon_vma = rmap_item->anon_vma;
        struct anon_vma_chain *vmac;
        struct vm_area_struct *vma;

        anon_vma_lock_read(anon_vma);
        anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
                           0, ULONG_MAX) {
...
            ret = rwc->rmap_one(page, vma,
                    rmap_item->address, rwc->arg);---------------------------使用rmap_item->address來獲取每個VMA對應的虛擬地址
...
            }
        }
        anon_vma_unlock_read(anon_vma);
    }
...
}

 

因此對於KSM頁面來說,page->index等於第一次映射該頁的VMA中的offset。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM