專題:Linux內存管理專題
關鍵詞:KSM、匿名頁面、COW、madvise 、MERGEABLE、UNMERGEABLE。
KSM是Kernel Samepage Merging的意思,用於合並內容相同的頁面。
在虛擬化環境中,同一台主機上存在許多相同OS和應用程序,很多頁面內容可能是完全相同的,因此可以被合並,從而釋放內存供其它應用程序使用。
KSM允許合並同一個進程或不同進程之間內容相同的匿名頁面,這對應用程序是不可見的。
把這些相同的頁面很成一個只讀頁面,從而釋放物理頁面,當應用程序需要改變頁面內容時,發生寫時復制(Copy-On-Write)。
1. KSM的實現
KSM核心設計思想是基於寫時復制機制COW,也就是將內容相同的頁面合並成一個只讀頁面,從而釋放出空閑物理頁面。
KSM的實現可以分為兩部分:一是啟動內核線程ksmd,等待喚醒進行頁面掃描和合並;二是madvise喚醒內核線程ksmd。
KSM只會處理通過madvise系統調用顯式指定的用戶進程地址空間內存,因此用戶想使用此功能必須顯式調用madvise(addr, length, MADV_MERGEABLE)。
用戶想取消KSM中某個用戶進程地址空間合並功能,也需要顯式調用madvise(addr, length, MADV_UNMERGEABLE)。
欠一張KSM實現流程圖。
KSM會合並什么樣類型的頁面?
一個典型的應用程序由以下5個內存部分組成:
- 可執行文件的內存映射(page cache)
- 程序分配使用的匿名頁面
- 進程打開的文件映射
- 進程訪問文件系統產生的cache
- 進程訪問內核產生的內核buffer(如slab)等
KSM只考慮進程分配使用的匿名頁面
如何去查找和比較兩個相同的頁面?
KSM巧妙使用紅黑樹設計了兩棵樹:stable樹和unstable樹。
KSM巧妙地利用頁面的校驗值來比較unstable樹的頁面最近是否被修改過。
如何節省內存?
頁面分為物理頁面和虛擬頁面,多個虛擬頁面同時映射到一個物理頁面,因此需要把映射到該頁所有PTE都解除后,才算是真正釋放。
目前有兩種做法:
一是掃描每個進程中的VMA,由VMA的虛擬地址查詢MMU頁表找到對應的page數據結構,進而找到用戶pte。
然后對比KSM中的stable樹中的stable樹和unstable樹,如果找到內容相同的頁面,就把該pte設置成COW,映射到KSM頁面中,從而釋放出一個pte,這里只是釋放出一個用戶pte,而不是物理頁面。
如果該物理頁面只有一個pte映射,那就是釋放該頁。
二是直接掃描系統中的物理頁面,然后通過反響映射來解除該頁所有的用戶pte,從而一次性釋放出物理頁面。
1.1 使能KSM功能
很多內核默認沒有開啟KSM功能,需要打開CONFIG_KSM=y才能打開。
通過make menuconfig配置如下:
Processor type and features [*]Enable KSM for page merging
1.2 KSM相關數據結構
KSM的核心數據結構有三個:struct rmap_item、struct mm_slot、struct ksm_scan。
欠一張三者之間的關系圖。
rmap_item描述一個虛擬地址反向映射的條目。
/** * struct rmap_item - reverse mapping item for virtual addresses * @rmap_list: next rmap_item in mm_slot's singly-linked rmap_list * @anon_vma: pointer to anon_vma for this mm,address, when in stable tree * @nid: NUMA node id of unstable tree in which linked (may not match page) * @mm: the memory structure this rmap_item is pointing into * @address: the virtual address this rmap_item tracks (+ flags in low bits) * @oldchecksum: previous checksum of the page at that virtual address * @node: rb node of this rmap_item in the unstable tree * @head: pointer to stable_node heading this list in the stable tree * @hlist: link into hlist of rmap_items hanging off that stable_node */ struct rmap_item { struct rmap_item *rmap_list;-----------------------------------------所有rmap_item連接成一個鏈表,鏈表頭在ksm_scam.rmap_list中。 union { struct anon_vma *anon_vma; /* when stable */------------------當rmap_item加入stable樹時,指向VMA的anon_vma數據結構。 #ifdef CONFIG_NUMA int nid; /* when node of unstable tree */ #endif }; struct mm_struct *mm;------------------------------------------------進程的struct mm_struct數據結構 unsigned long address; /* + low bits used for flags below */--rmap_item所跟蹤的用戶地址空間 unsigned int oldchecksum; /* when unstable */---------------------虛擬地址對應的物理頁面的舊校驗值 union { struct rb_node node; /* when node of unstable tree */---------rmap_item加入unstable紅黑樹的節點 struct { /* when listed from stable tree */ struct stable_node *head;------------------------------------加入stable紅黑樹的節點 struct hlist_node hlist;-------------------------------------stable鏈表 }; }; };
mm_slot描述添加到KSM系統中將來要被掃描的進程mm_struct數據結構。
/* * A few notes about the KSM scanning process, * to make it easier to understand the data structures below: * * In order to reduce excessive scanning, KSM sorts the memory pages by their * contents into a data structure that holds pointers to the pages' locations. * * Since the contents of the pages may change at any moment, KSM cannot just * insert the pages into a normal sorted tree and expect it to find anything. * Therefore KSM uses two data structures - the stable and the unstable tree. * * The stable tree holds pointers to all the merged pages (ksm pages), sorted * by their contents. Because each such page is write-protected, searching on * this tree is fully assured to be working (except when pages are unmapped), * and therefore this tree is called the stable tree. * * In addition to the stable tree, KSM uses a second data structure called the * unstable tree: this tree holds pointers to pages which have been found to * be "unchanged for a period of time". The unstable tree sorts these pages * by their contents, but since they are not write-protected, KSM cannot rely * upon the unstable tree to work correctly - the unstable tree is liable to * be corrupted as its contents are modified, and so it is called unstable. * * KSM solves this problem by several techniques: * * 1) The unstable tree is flushed every time KSM completes scanning all * memory areas, and then the tree is rebuilt again from the beginning. * 2) KSM will only insert into the unstable tree, pages whose hash value * has not changed since the previous scan of all memory areas. * 3) The unstable tree is a RedBlack Tree - so its balancing is based on the * colors of the nodes and not on their contents, assuring that even when * the tree gets "corrupted" it won't get out of balance, so scanning time * remains the same (also, searching and inserting nodes in an rbtree uses * the same algorithm, so we have no overhead when we flush and rebuild). * 4) KSM never flushes the stable tree, which means that even if it were to * take 10 attempts to find a page in the unstable tree, once it is found, * it is secured in the stable tree. (When we scan a new page, we first * compare it against the stable tree, and then against the unstable tree.) * * If the merge_across_nodes tunable is unset, then KSM maintains multiple * stable trees and multiple unstable trees: one of each for each NUMA node. */ /** * struct mm_slot - ksm information per mm that is being scanned * @link: link to the mm_slots hash list * @mm_list: link into the mm_slots list, rooted in ksm_mm_head * @rmap_list: head for this mm_slot's singly-linked list of rmap_items * @mm: the mm that this information is valid for */ struct mm_slot { struct hlist_node link;------------------用於添加到mm_slot哈希表中。 struct list_head mm_list;----------------用於添加到mm_slot鏈表中,鏈表頭在ksm_mm_head struct rmap_item *rmap_list;-------------rmap_item鏈表頭 struct mm_struct *mm;--------------------進程的mm_sturct數據結構 };
ksm_scan表示當前掃描狀態。
/** * struct ksm_scan - cursor for scanning * @mm_slot: the current mm_slot we are scanning * @address: the next address inside that to be scanned * @rmap_list: link to the next rmap to be scanned in the rmap_list * @seqnr: count of completed full scans (needed when removing unstable node) * * There is only the one ksm_scan instance of this cursor structure. */ struct ksm_scan { struct mm_slot *mm_slot;---------------------當前正在掃描的mm_slot unsigned long address;-----------------------下一次掃描地址 struct rmap_item **rmap_list;----------------將要掃描rmap_item的指針 unsigned long seqnr;-------------------------全部掃描完成后會計數一次,用於刪除unstable節點。 };
1.3 madvise觸發喚醒KSM內核線程
madvise用於給內核處理內存paging I/O建議,和KSM相關的是MADV_MERGEABLE和MADV_UNMERGEABLE。
MADV_MERGEABLE用於顯式調用處理用戶進程地址空間合並功能。
MADV_UNMERGEABLE用於顯式調用取消一個用戶進程地址空間合並功能。
madvise-------------------------------------madvise系統調用 madvise_vma madvise_behavior ksm_madvise---------------------處理MADV_MERGEABLE/MADV_UNMERGEABLE情況 __ksm_enter-----------------MADV_MERGEABLE,喚醒ksmd線程 unmerge_ksm_pages-----------MADV_UNMERGEABLE
__ksm_enter首先創建mm_slot,然后喚醒ksmd內核線程進行KSM處理。
int __ksm_enter(struct mm_struct *mm) { struct mm_slot *mm_slot; int needs_wakeup; mm_slot = alloc_mm_slot();-----------------------------------------------分配一個mm_slot,表示當前進程mm_struct數據結構 if (!mm_slot) return -ENOMEM; /* Check ksm_run too? Would need tighter locking */ needs_wakeup = list_empty(&ksm_mm_head.mm_list);--------------------------為空表示當前沒有正在被掃描的mm_slot spin_lock(&ksm_mmlist_lock); insert_to_mm_slots_hash(mm, mm_slot);-------------------------------------將當前進程mm賦給mm_slot->mm /* * When KSM_RUN_MERGE (or KSM_RUN_STOP), * insert just behind the scanning cursor, to let the area settle * down a little; when fork is followed by immediate exec, we don't * want ksmd to waste time setting up and tearing down an rmap_list. * * But when KSM_RUN_UNMERGE, it's important to insert ahead of its * scanning cursor, otherwise KSM pages in newly forked mms will be * missed: then we might as well insert at the end of the list. */ if (ksm_run & KSM_RUN_UNMERGE) list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list); else list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);---------mm_slot添加到ksm_scan.mm_slot->mm_list鏈表中 spin_unlock(&ksm_mmlist_lock); set_bit(MMF_VM_MERGEABLE, &mm->flags);------------------------------------表示這個進程已經添加到KSM系統中 atomic_inc(&mm->mm_count); if (needs_wakeup) wake_up_interruptible(&ksm_thread_wait);------------------------------如果之前為空,則喚醒ksmd內核線程。 return 0; }
1.4 內核線程ksmd
static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
定義ksm_thread_wait等待隊列,ksmd內核線程在此阻塞等待,madvise喚醒。
ksm_init創建ksmd內核線程,
static int __init ksm_init(void) { struct task_struct *ksm_thread; int err; err = ksm_slab_init();-----------------------------------------------創建ksm_rmap_item、ksm_stable_node、ksm_mm_slot三個高速緩存。 if (err) goto out; ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");------------創建ksmd內核線程,處理函數為ksm_scan_thread。 if (IS_ERR(ksm_thread)) { pr_err("ksm: creating kthread failed\n"); err = PTR_ERR(ksm_thread); goto out_free; } #ifdef CONFIG_SYSFS err = sysfs_create_group(mm_kobj, &ksm_attr_group);-----------------在sys/kernel/mm下創建ksm相關節點 if (err) { pr_err("ksm: register sysfs failed\n"); kthread_stop(ksm_thread); goto out_free; } #else ksm_run = KSM_RUN_MERGE; /* no way for user to start it */ #endif /* CONFIG_SYSFS */ #ifdef CONFIG_MEMORY_HOTREMOVE /* There is no significance to this priority 100 */ hotplug_memory_notifier(ksm_memory_callback, 100); #endif return 0; out_free: ksm_slab_free(); out: return err; }
sysfs相關層次為kernel_obj-->mm_obj-->ksm_attr_group,其中ksm_attr_group的.attrs為ksm_attrs。
創建節點/sys/kernel/mm/ksm如下,這些節點提供給用戶對KSM進行控制。
下面是節點名、值、解釋:
full_scans----------------0:只讀,已經全掃描可合並區域次數
pages_shared----------0:ksm_pages_shared,stable樹中節點數
pages_sharing---------0:ksm_pages_sharing,
pages_to_scan-----100:ksm_thread_pages_to_scan,一次ksm_do_scan的頁面數
pages_unshared------0:ksm_pages_unshared,unstable樹中節點數
pages_volatile---------0:改變太頻繁的頁面數
run------------------------0:0-停止ksmd但保持已合並頁面狀態;1-運行ksmd;2-停止ksmd並且將已經合並頁面拆分。
sleep_millisecs-------20:ksm_thread_sleep_millisecs,每次KSM掃描之間的間隔時間
關於pages_shared、pages_sharing和page_unshared比例說明什么呢?
pages_sharing/pages_shared比例越高表示頁面共享情況越好;pages_unshared/pages_sharing比例越高表示KSM收效很低。
ksm_scan_thread()是ksmd內核線程主干,每次執行ksm_do_scan()函數去掃描合並pages_to_scan個頁面,然后睡眠等待sleep_millisecs毫秒。
如果無事可做會在ksm_thread_wait上等待,知道madvise喚醒。
static int ksm_scan_thread(void *nothing) { set_freezable(); set_user_nice(current, 5); while (!kthread_should_stop()) { mutex_lock(&ksm_thread_mutex); wait_while_offlining(); if (ksmd_should_run()) ksm_do_scan(ksm_thread_pages_to_scan); mutex_unlock(&ksm_thread_mutex); try_to_freeze(); if (ksmd_should_run()) { schedule_timeout_interruptible( msecs_to_jiffies(ksm_thread_sleep_millisecs)); } else { wait_event_freezable(ksm_thread_wait, ksmd_should_run() || kthread_should_stop()); } } return 0; }
ksm_do_scan是ksmd線程的實際執行者,它有着如下流程,將一個頁面合並成KSM頁面,包括查找stabke樹和unstable樹等,然后進行合並操作。
ksm_do_scan
scan_get_next_rmap_item---------------------選取合適的匿名頁面
cmp_and_merge_page--------------------------將頁面與root_stable_tree/root_unstable_tree中頁面進行比較,判斷是否能合並
stable_tree_search----------------------搜索stable紅黑樹並查找是否有和page內容一致的節點
try_to_merge_with_ksm_page--------------嘗試將候選頁合並到KSM頁面中
stable_tree_append
unstable_tree_search_insert-------------搜索unstable紅黑樹中是否有和該頁內容相同的節點
try_to_merge_two_pages------------------若在unstable紅黑樹中找到和當前頁內容相同節點,嘗試合並這兩頁面成為一個KSM頁面
stable_tree_append----------------------將合並的兩個頁面對應rmap_item添加到stable節點哈希表中
break_cow
ksm_do_scan()函數如下:
static void ksm_do_scan(unsigned int scan_npages) { struct rmap_item *rmap_item; struct page *uninitialized_var(page); while (scan_npages-- && likely(!freezing(current))) {----------------while中嘗試去合並scan_npages個頁面 cond_resched(); rmap_item = scan_get_next_rmap_item(&page);----------------------獲取一個合適的匿名頁面page if (!rmap_item) return; cmp_and_merge_page(page, rmap_item);-----------------------------讓page在KSM的stable和unstable兩棵樹中查找是否有合適合並的對象,並嘗試去合並他們。 put_page(page); } }
scan_get_next_rmap_item()遍歷ksm_mm_heand,然后再遍歷進程地址空間的每個VMA,然后通過get_next_rmpa_item()返回rmap_item。
static struct rmap_item *scan_get_next_rmap_item(struct page **page) { struct mm_struct *mm; struct mm_slot *slot; struct vm_area_struct *vma; struct rmap_item *rmap_item; int nid; if (list_empty(&ksm_mm_head.mm_list))---------------------------------------為空表示沒有mm_slot,所以不需要繼續 return NULL; slot = ksm_scan.mm_slot; if (slot == &ksm_mm_head) {-------------------------------------------------第一次運行ksmd,進行一些初始化工作。 ... if (!ksm_merge_across_nodes) { struct stable_node *stable_node; struct list_head *this, *next; struct page *page; list_for_each_safe(this, next, &migrate_nodes) { stable_node = list_entry(this, struct stable_node, list); page = get_ksm_page(stable_node, false); if (page) put_page(page); cond_resched(); } } for (nid = 0; nid < ksm_nr_node_ids; nid++) root_unstable_tree[nid] = RB_ROOT;----------------------------------unstable樹初始化 spin_lock(&ksm_mmlist_lock); slot = list_entry(slot->mm_list.next, struct mm_slot, mm_list); ksm_scan.mm_slot = slot; spin_unlock(&ksm_mmlist_lock); /* * Although we tested list_empty() above, a racing __ksm_exit * of the last mm on the list may have removed it since then. */ if (slot == &ksm_mm_head) return NULL; next_mm: ksm_scan.address = 0; ksm_scan.rmap_list = &slot->rmap_list; } mm = slot->mm; down_read(&mm->mmap_sem); if (ksm_test_exit(mm)) vma = NULL; else vma = find_vma(mm, ksm_scan.address); for (; vma; vma = vma->vm_next) {-------------------------------------------for循環遍歷所有VMA if (!(vma->vm_flags & VM_MERGEABLE)) continue; if (ksm_scan.address < vma->vm_start) ksm_scan.address = vma->vm_start; if (!vma->anon_vma) ksm_scan.address = vma->vm_end; while (ksm_scan.address < vma->vm_end) {-------------------------------掃描VMA中所有虛擬頁面 if (ksm_test_exit(mm)) break; *page = follow_page(vma, ksm_scan.address, FOLL_GET);--------------follow_page函數從虛擬地址開始找回normal mapping頁面的struct page數據結構 if (IS_ERR_OR_NULL(*page)) { ksm_scan.address += PAGE_SIZE; cond_resched(); continue; } if (PageAnon(*page) || page_trans_compound_anon(*page)) {-----------------------------只處理匿名頁面情況 flush_anon_page(vma, *page, ksm_scan.address); flush_dcache_page(*page); rmap_item = get_next_rmap_item(slot,---------------------------去找mm_slot->rmap_list鏈表上是否有該虛擬地址對應的rmap_item,沒有找到就新建一個。 ksm_scan.rmap_list, ksm_scan.address); if (rmap_item) { ksm_scan.rmap_list = &rmap_item->rmap_list; ksm_scan.address += PAGE_SIZE; } else put_page(*page); up_read(&mm->mmap_sem); return rmap_item; } put_page(*page); ksm_scan.address += PAGE_SIZE; cond_resched(); } } if (ksm_test_exit(mm)) {-------------------------------------------------說明for循環里掃描該進程所有的VMA都沒找到合適的匿名頁面 ksm_scan.address = 0; ksm_scan.rmap_list = &slot->rmap_list; } /* * Nuke all the rmap_items that are above this current rmap: * because there were no VM_MERGEABLE vmas with such addresses. */ remove_trailing_rmap_items(slot, ksm_scan.rmap_list);--------------------在該進程中沒找到合適的匿名頁面時,那么對應的rmap_item已經沒用必要占用空間,直接刪除。 spin_lock(&ksm_mmlist_lock); ksm_scan.mm_slot = list_entry(slot->mm_list.next,-----------------------取下一個mm_slot struct mm_slot, mm_list); if (ksm_scan.address == 0) {--------------------------------------------處理該進程被銷毀的情況,把mm_slot從ksm_mm_head鏈表中刪除,釋放mm_slot數據結構,清MMF_VM_MERGEABLE標志位。 /* * We've completed a full scan of all vmas, holding mmap_sem * throughout, and found no VM_MERGEABLE: so do the same as * __ksm_exit does to remove this mm from all our lists now. * This applies either when cleaning up after __ksm_exit * (but beware: we can reach here even before __ksm_exit), * or when all VM_MERGEABLE areas have been unmapped (and * mmap_sem then protects against race with MADV_MERGEABLE). */ hash_del(&slot->link); list_del(&slot->mm_list); spin_unlock(&ksm_mmlist_lock); free_mm_slot(slot); clear_bit(MMF_VM_MERGEABLE, &mm->flags); up_read(&mm->mmap_sem); mmdrop(mm); } else { spin_unlock(&ksm_mmlist_lock); up_read(&mm->mmap_sem); } /* Repeat until we've completed scanning the whole list */ slot = ksm_scan.mm_slot; if (slot != &ksm_mm_head) goto next_mm;----------------------------------------繼續掃描下一個mm_slot ksm_scan.seqnr++;----------------------------------------掃描完一輪mm_slot,增加計數 return NULL; }
cmp_and_merge_page()有兩個參數,page表示剛才掃描mm_slot時找到的一個合格匿名頁面,rmap_item表示該page對應的rmap_item數據結構。
static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item) { struct rmap_item *tree_rmap_item; struct page *tree_page = NULL; struct stable_node *stable_node; struct page *kpage; unsigned int checksum; int err; stable_node = page_stable_node(page); if (stable_node) { if (stable_node->head != &migrate_nodes && get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { rb_erase(&stable_node->node, root_stable_tree + NUMA(stable_node->nid)); stable_node->head = &migrate_nodes; list_add(&stable_node->list, stable_node->head); } if (stable_node->head != &migrate_nodes && rmap_item->head == stable_node) return; } /* We first start with searching the page inside the stable tree */ kpage = stable_tree_search(page);------------------------------------------在root_stabletree中查找頁面內容和page相同的stable頁。 if (kpage == page && rmap_item->head == stable_node) {---------------------說明kpage和page是同一個頁面,說明該頁已經是KSM頁面,不需繼續處理。 put_page(kpage); return; } remove_rmap_item_from_tree(rmap_item); if (kpage) { err = try_to_merge_with_ksm_page(rmap_item, page, kpage);--------------如果在stable書中找到一個頁面內容相同節點,那么嘗試合並這個頁面到節點上 if (!err) { /* * The page was successfully merged: * add its rmap_item to the stable tree. */ lock_page(kpage); stable_tree_append(rmap_item, page_stable_node(kpage));------------合並成功后,把rmap_item添加到stable_node->hlist哈希鏈表上。 unlock_page(kpage); } put_page(kpage); return; } ============================================================stable和unstable分割線======================================================================== /* * If the hash value of the page has changed from the last time * we calculated it, this page is changing frequently: therefore we * don't want to insert it in the unstable tree, and we don't want * to waste our time searching for something identical to it there. */ checksum = calc_checksum(page);--------------------------------------------再次計算校驗值,如不等,則說明頁面變動頻繁,不適合添加到unstable紅黑樹中。 if (rmap_item->oldchecksum != checksum) { rmap_item->oldchecksum = checksum; return; } tree_rmap_item = unstable_tree_search_insert(rmap_item, page, &tree_page);--------------搜索root_unstable_tree中是否有和該頁面內容相同的節點。 if (tree_rmap_item) { kpage = try_to_merge_two_pages(rmap_item, page, tree_rmap_item, tree_page);----------------------------嘗試合並page和tree_page成為一個KSM頁面kpage。 put_page(tree_page); if (kpage) { /* * The pages were successfully merged: insert new * node in the stable tree and add both rmap_items. */ lock_page(kpage); stable_node = stable_tree_insert(kpage);---------------------------將kpage添加到root_stable_tree中,創建一個新stable_node節點。 if (stable_node) { stable_tree_append(tree_rmap_item, stable_node); stable_tree_append(rmap_item, stable_node); } unlock_page(kpage); /* * If we fail to insert the page into the stable tree, * we will have 2 virtual addresses that are pointing * to a ksm page left outside the stable tree, * in which case we need to break_cow on both. */ if (!stable_node) {------------------------------------------------如果stable_node插入到stable樹失敗,那么調用break_cow()主動觸發一個卻也中斷來分離這個KSM頁面。 break_cow(tree_rmap_item); break_cow(rmap_item); } } } }
2. 匿名頁面和KSM頁面的區別
2.1 如何區分匿名頁面和KSM頁面?
如果struct page指向映射到用戶虛擬內存空間的匿名頁面,那么其成員mapping指向anon_vma。
mapping的低2位,表示匿名頁面或者KSM。
/* ... * * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.----------------所以說KSM肯定是匿名頁面,KSM是匿名頁面的子集。
... */ #define PAGE_MAPPING_ANON 1 #define PAGE_MAPPING_KSM 2 #define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
內核中提供兩個函數檢查頁面的KSM或者匿名頁面類型:
static inline int PageAnon(struct page *page) { return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0; } static inline int PageKsm(struct page *page)-----------------------------------------也可以看出KSM頁面需同時具備ANON/KSM。 { return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); }
2.2 KSM頁面和匿名頁面的區別
分兩種情況,一是父子進程VMA共享同一個匿名頁面;二是不相干的進程VMA共享同一個匿名頁面。
父進程在VMA映射匿名頁面是會創建屬於這個VMA的RMAP反向映射的設施,在__page_set_anon_rmap()例會設置page->index值為虛擬地址在VMA中的offset。
子進程fork時,復制了父進程的VMA內容到子進程的VMA中,並且復制父進程的頁表到子進程中,因此對於父子進程來說,page->index值是一致的。
當需要從page找到所有映射page的虛擬地址時,在rmap_walk_anon()函數中,父子進程都是用page->index來計算在VMA中的虛擬地址。
在rmap_walk()中,如果是非KSM的匿名頁面,使用rmap_walk_anon進行反向查找。
static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc) { ... pgoff = page_to_pgoff(page); anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) { struct vm_area_struct *vma = avc->vma; unsigned long address = vma_address(page, vma);----------------------根據page->index計算虛擬地址 if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) continue; ret = rwc->rmap_one(page, vma, address, rwc->arg); if (ret != SWAP_AGAIN) break; if (rwc->done && rwc->done(page)) break; } anon_vma_unlock_read(anon_vma); return ret; }
KSM頁面由內容相同的兩個頁面合並而成,他們可以是不同進程的VMA,也可以是父子進程的VMA。
int rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc) { ... again: hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { struct anon_vma *anon_vma = rmap_item->anon_vma; struct anon_vma_chain *vmac; struct vm_area_struct *vma; anon_vma_lock_read(anon_vma); anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, 0, ULONG_MAX) { ... ret = rwc->rmap_one(page, vma, rmap_item->address, rwc->arg);---------------------------使用rmap_item->address來獲取每個VMA對應的虛擬地址 ... } } anon_vma_unlock_read(anon_vma); } ... }
因此對於KSM頁面來說,page->index等於第一次映射該頁的VMA中的offset。