1、linux內核中利用紅黑樹增刪改查快速、穩定的特性來管理的還有另一個非常重要的功能:虛擬內存管理!前面介紹了buddy和slab算法是用來管理物理頁面的。由於早期物理頁面遠比虛擬頁面小很多,而且只需要分配和回收合並,所以也沒用樹形結構來組織,簡單粗暴地用鏈表來管理!但是虛擬內存不一樣了:以32位的系統為例,虛擬內存有4GB,能划分的虛擬內存塊有很多,划分后需要快速增刪查虛擬內存塊(需要頻繁地讀取代碼、讀寫數據、加載動態鏈接庫等),此時用紅黑樹就很合適了!老規矩,先上結構體:
task_struct中嵌套了一個mm_struct結構體指針,這個結構體大有乾坤:
struct task_struct { ....... struct mm_struct *mm; ....... }
繼續深入:又發現了紅黑樹的根節點!另外兩個vm_area_struct結構體又是干嘛的了?
struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb;/*又是紅黑樹的根節點*/ struct vm_area_struct * mmap_cache; /* last find_vma result */ ....... }
繼續深入:
- 發現rb_node結構體了么?這明顯是紅黑樹的節點呀!和上面的rb_root不是剛好組成紅黑樹么?
-
進程的虛擬內存空間會被分成不同的若干區域,每個區域都有其相關的屬性和用途; 一個合法的地址總是落在某個區域當中的,這些區域也不會重疊。在linux內核中,這樣的區域被稱之為虛擬內存區域(virtual memory areas),簡稱 VMA。 一個vma就是一塊連續的線性地址空間的抽象,它擁有自身的權限(可讀,可寫,可執行等等) ;
struct vm_area_struct { struct mm_struct * vm_mm; /* 所屬的內存描述符 */ unsigned long vm_start; /* vma的起始地址 */ unsigned long vm_end; /* vma的結束地址 */ /* 該vma的在一個進程的vma鏈表中的前驅vma和后驅vma指針,鏈表中的vma都是按地址來排序的*/ struct vm_area_struct *vm_next, *vm_prev; pgprot_t vm_page_prot; /* vma的訪問權限 */ unsigned long vm_flags; /* 標識集 */ struct rb_node vm_rb; /* 紅黑樹中對應的節點 */ ............... }
為了直觀展示這些結構體之間的關系,我畫了一張圖供大家參考,要點說明如下:
- vm_area_struct有vm_start和vm_end兩個字段,分別指向虛擬內存區域的開始和結束地址;
- vm_area_struct的vm_rb又組成了紅黑樹:便於根據特定的條件快速查找目標區域;
- vm_area_struct實例之間也用鏈表連接:主要用來遍歷節點(不需要像樹形結構一樣前序、中序、后序等方式遍歷,速度快一些;樹形結構遍歷需要遞歸或借助棧/隊列等結構,空間復雜度是O(N))
2、(1)結構體定義好了,下一步就是操作這些結構體了。既然用到了紅黑樹管理VMA,第一件事肯定是建樹、建鏈表啦(所有的操作都在mm\mmap.c文件里),最直接的api是__vma_link函數,如下:
/*將新創建的vm_area_struct掛在mm_struct中管理的紅黑樹上*/ static void __vma_link(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, struct rb_node **rb_link, struct rb_node *rb_parent) { __vma_link_list(mm, vma, prev, rb_parent); __vma_link_rb(mm, vma, rb_link, rb_parent); }
調用了兩個函數,從名稱看就知道一個是建立鏈表,另一個是建立紅黑樹;先看第一個_vma_link_list函數,如下:就是把vma實例加入到mm的mmap字段,然后讓vma的next指針指向下一個vma實例,完成鏈表的建立!
void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, struct rb_node *rb_parent) { struct vm_area_struct *next; vma->vm_prev = prev; if (prev) { next = prev->vm_next; prev->vm_next = vma; } else { mm->mmap = vma; if (rb_parent) next = rb_entry(rb_parent, struct vm_area_struct, vm_rb); else next = NULL; } vma->vm_next = next; if (next) next->vm_prev = vma; }
另一個是__vma_link_rb函數在紅黑樹中插入節點,如下:
void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma, struct rb_node **rb_link, struct rb_node *rb_parent) { /* Update tracking information for the gap following the new vma. */ if (vma->vm_next) vma_gap_update(vma->vm_next); else mm->highest_vm_end = vma->vm_end; /* * vma->vm_prev wasn't known when we followed the rbtree to find the * correct insertion point for that vma. As a result, we could not * update the vma vm_rb parents rb_subtree_gap values on the way down. * So, we first insert the vma with a zero rb_subtree_gap value * (to be consistent with what we did on the way down), and then * immediately update the gap to the correct value. Finally we * rebalance the rbtree after all augmented values have been set. */ rb_link_node(&vma->vm_rb, rb_parent, rb_link); vma->rb_subtree_gap = 0; vma_gap_update(vma); /*傳入的vma插入紅黑樹*/ vma_rb_insert(vma, &mm->mm_rb); }
(2)上述代碼執行完畢,紅黑樹也就建好了!接下來就是查詢了;因為紅黑樹是用來管理vma的,建樹的時候所用的key值就是vma的線性地址了,所以紅黑樹節點左孩vma的地址都比當前節點小,右孩vma的地址都比當前節點大;代碼如下:
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) { struct rb_node *rb_node; struct vm_area_struct *vma; /* Check the cache first. */ vma = vmacache_find(mm, addr); if (likely(vma)) return vma; /*紅黑樹的根節點*/ rb_node = mm->mm_rb.rb_node; while (rb_node) { struct vm_area_struct *tmp; tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb); if (tmp->vm_end > addr) { vma = tmp; /*要查找的addr剛好在這個vma實例的vm_start和vm_end之間,說明找到了*/ if (tmp->vm_start <= addr) break; /*addr比vm_start都大,繼續從左子樹查找*/ rb_node = rb_node->rb_left; } else /*addr比vm_end小,繼續從右子樹查找*/ rb_node = rb_node->rb_right; } if (vma) vmacache_update(addr, vma); return vma; }
(3)上面是根據用戶指定的線性地址查找第一個符合的vma實例,用戶實際使用時,還需要查找空閑未使用的虛擬內存塊,用來存儲重要的數據,這個又該怎么實現了?linux的實現方法為:arch_get_unmapped_area函數,如下:最核心的代碼加了中文注釋;思路就是根據addr查找vma;如果vma為空,並且大小、邊界等條件也都滿足,這塊內存就可以拿來用了!
unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; struct vm_unmapped_area_info info; if (len > TASK_SIZE - mmap_min_addr) return -ENOMEM; if (flags & MAP_FIXED) return addr; if (addr) { addr = PAGE_ALIGN(addr);//將addr調整成頁大小的倍數 /*通過addr查找對應的vma是否為空;如果是空,說明該區域還未被使用 ,如果其他條件也滿足,就直接使用這塊地址了*/ vma = find_vma(mm, addr); if (TASK_SIZE - len >= addr && addr >= mmap_min_addr && (!vma || addr + len <= vma->vm_start)) return addr; } info.flags = 0; info.length = len; info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; info.align_mask = 0; return vm_unmapped_area(&info); }
(4)內存用完后就要釋放,為了避免碎片,肯定是要和現有的空閑內存合並的;由於空閑內存沒有用紅黑樹組織,所以此步驟也不涉及紅黑樹的操作,具體的思路為:先檢查釋放區域之前的prve區域終止地址是否與釋放區域起始地址重合,或釋放區域的結束地址是否與其之后的next區域起始地址重合;接着再檢查將要合並的區域是否有相同的標志。如果合並區域均映射了磁盤文件,則還要檢查其映射文件是否相同,以及文件內的偏移量是否連續。思路並不復雜,代碼如下:
/* * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out * whether that can be merged with its predecessor or its successor. * Or both (it neatly fills a hole). * * In most cases - when called for mmap, brk or mremap - [addr,end) is * certain not to be mapped by the time vma_merge is called; but when * called for mprotect, it is certain to be already mapped (either at * an offset within prev, or at the start of next), and the flags of * this area are about to be changed to vm_flags - and the no-change * case has already been eliminated. * * The following mprotect cases have to be considered, where AAAA is * the area passed down from mprotect_fixup, never extending beyond one * vma, PPPPPP is the prev vma specified, and NNNNNN the next vma after: * * AAAA AAAA AAAA AAAA * PPPPPPNNNNNN PPPPPPNNNNNN PPPPPPNNNNNN PPPPNNNNXXXX * cannot merge might become might become might become * PPNNNNNNNNNN PPPPPPPPPPNN PPPPPPPPPPPP 6 or * mmap, brk or case 4 below case 5 below PPPPPPPPXXXX 7 or * mremap move: PPPPXXXXXXXX 8 * AAAA * PPPP NNNN PPPPPPPPPPPP PPPPPPPPNNNN PPPPNNNNNNNN * might become case 1 below case 2 below case 3 below * * It is important for case 8 that the the vma NNNN overlapping the * region AAAA is never going to extended over XXXX. Instead XXXX must * be extended in region AAAA and NNNN must be removed. This way in * all cases where vma_merge succeeds, the moment vma_adjust drops the * rmap_locks, the properties of the merged vma will be already * correct for the whole merged range. Some of those properties like * vm_page_prot/vm_flags may be accessed by rmap_walks and they must * be correct for the whole merged range immediately after the * rmap_locks are released. Otherwise if XXXX would be removed and * NNNN would be extended over the XXXX range, remove_migration_ptes * or other rmap walkers (if working on addresses beyond the "end" * parameter) may establish ptes with the wrong permissions of NNNN * instead of the right permissions of XXXX. */ struct vm_area_struct *vma_merge(struct mm_struct *mm, struct vm_area_struct *prev, unsigned long addr, unsigned long end, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t pgoff, struct mempolicy *policy, struct vm_userfaultfd_ctx vm_userfaultfd_ctx) { pgoff_t pglen = (end - addr) >> PAGE_SHIFT; struct vm_area_struct *area, *next; int err; /* * We later require that vma->vm_flags == vm_flags, * so this tests vma->vm_flags & VM_SPECIAL, too. */ if (vm_flags & VM_SPECIAL) return NULL; if (prev) next = prev->vm_next; else next = mm->mmap; area = next; if (area && area->vm_end == end) /* cases 6, 7, 8 */ next = next->vm_next; /* verify some invariant that must be enforced by the caller */ VM_WARN_ON(prev && addr <= prev->vm_start); VM_WARN_ON(area && end > area->vm_end); VM_WARN_ON(addr >= end); /* * Can it merge with the predecessor? */ if (prev && prev->vm_end == addr && mpol_equal(vma_policy(prev), policy) && can_vma_merge_after(prev, vm_flags, anon_vma, file, pgoff, vm_userfaultfd_ctx)) { /* * OK, it can. Can we now merge in the successor as well? */ if (next && end == next->vm_start && mpol_equal(policy, vma_policy(next)) && can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, vm_userfaultfd_ctx) && is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) { /* cases 1, 6 */ err = __vma_adjust(prev, prev->vm_start, next->vm_end, prev->vm_pgoff, NULL, prev); } else /* cases 2, 5, 7 */ err = __vma_adjust(prev, prev->vm_start, end, prev->vm_pgoff, NULL, prev); if (err) return NULL; khugepaged_enter_vma_merge(prev, vm_flags); return prev; } /* * Can this new request be merged in front of next? */ if (next && end == next->vm_start && mpol_equal(policy, vma_policy(next)) && can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, vm_userfaultfd_ctx)) { if (prev && addr < prev->vm_end) /* case 4 */ err = __vma_adjust(prev, prev->vm_start, addr, prev->vm_pgoff, NULL, next); else { /* cases 3, 8 */ err = __vma_adjust(area, addr, next->vm_end, next->vm_pgoff - pglen, NULL, next); /* * In case 3 area is already equal to next and * this is a noop, but in case 8 "area" has * been removed and next was expanded over it. */ area = next; } if (err) return NULL; khugepaged_enter_vma_merge(area, vm_flags); return area; } return NULL; }
(5)既然釋放內存,相應的vma肯定也要從紅黑樹刪除的,這個功能在detach_vmas_to_be_unmapped中實現的:
/* * Create a list of vma's touched by the unmap, removing them from the mm's * vma list as we go.. Remove the vma's, and unmap the actual pages */ static void detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, unsigned long end) { struct vm_area_struct **insertion_point; struct vm_area_struct *tail_vma = NULL; insertion_point = (prev ? &prev->vm_next : &mm->mmap); vma->vm_prev = NULL; do { vma_rb_erase(vma, &mm->mm_rb);//從紅黑樹刪除vma mm->map_count--; tail_vma = vma; vma = vma->vm_next; } while (vma && vma->vm_start < end); *insertion_point = vma; if (vma) { vma->vm_prev = prev; vma_gap_update(vma); } else mm->highest_vm_end = prev ? prev->vm_end : 0; tail_vma->vm_next = NULL; /* Kill the cache */ vmacache_invalidate(mm); }
總結:
1、這里有虛擬內存和物理內存、進程內存和操作系統內存的映射圖示,方便各位理解
2、用rb_node關鍵詞搜索,我一共找到了3千多個:可見紅黑樹在linux內核使用的范圍之廣!
3、AVL樹和紅黑樹很類似,但是AVL樹的刪除和插入需要多次旋轉操作以及不斷向根節點回溯,所以在大量刪除和插入操作的情況下, AVL樹的效率較低。紅黑樹是一種查找效率僅次於AVL樹的不完全平衡二叉樹,它摒棄了AVL要求的強平衡約束,能夠以O(logn)的時間復雜度進行插入、刪除操作;而且其插入和刪除最多需要兩次或者三次旋轉即可保持樹的平衡。雖然二者的算法復雜度相同,但在最壞情況下,紅黑樹提供了更加快速刪除和插入一個節點的算法,顯然紅黑樹的高效操作更適合Linux內核中大量VMA的添加、刪除和查找!
參考:
1、https://stephenzhou.blog.csdn.net/article/details/89501437?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.pc_relevant_default&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.pc_relevant_default&utm_relevant_index=2 虛擬內存VMA淺析
2、http://edsionte.com/techblog/archives/3564 虛擬內存操作
3、http://edsionte.com/techblog/archives/3586 虛擬內存操作