Linux Hugetlbfs內核源碼簡析-----(一)Hugetlbfs初始化

本文轉載自查看原文 2014-10-30 11:10 3351 Linux Kernel

一、引言

　　為了實現虛擬內存管理機制，操作系統對內存實行分頁管理。自內存“分頁機制”提出之始，內存頁面的默認大小便被設置為 4096 字節（4KB），雖然原則上內存頁面大小是可配置的，但絕大多數的操作系統實現中仍然采用默認的 4KB 頁面。當某些應用的需要使用的內存達到幾G、甚至幾十G的時候，4KB的內存頁面將嚴重制約程序的性能。

　　CPU緩存中有一組緩存專門用於緩存TLB，但其大小是有限的。當采用的默認頁面大小為 4KB，其產生的TLB較大，因而將會產生較多 TLB Miss 和缺頁中斷，從而大大影響應用程序的性能。操作系統以 2MB 甚至更大作為分頁的單位時，將會大大減少 TLB Miss 和缺頁中斷的數量，顯著提高應用程序的性能。這也正是 Linux 內核引入大頁面支持的直接原因。好處是很明顯的，假設應用程序需要 2MB 的內存，如果操作系統以 4KB 作為分頁的單位，則需要 512 個頁面，進而在 TLB 中需要 512 個表項，同時也需要 512 個頁表項，操作系統需要經歷至少 512 次 TLB Miss 和 512 次缺頁中斷才能將 2MB 應用程序空間全部映射到物理內存；然而，當操作系統采用 2MB 作為分頁的基本單位時，只需要一次 TLB Miss 和一次缺頁中斷，就可以為 2MB 的應用程序空間建立虛實映射，並在運行過程中無需再經歷 TLB Miss 和缺頁中斷（假設未發生 TLB 項替換和 Swap）。

　　為了能以最小的代價實現大頁面支持，Linux 操作系統采用了基於 hugetlbfs 特殊文件系統 2M 字節大頁面支持。這種采用特殊文件系統形式支持大頁面的方式，使得應用程序可以根據需要靈活地選擇虛存頁面大小，而不會被強制使用 2MB 大頁面。

二、HugePage的使用

　　本文的例子摘自 Linux 內核源碼中提供的有關說明文檔 (Documentation/vm/hugetlbpage.txt) 。使用 hugetlbfs 之前，首先需要在編譯內核 (make menuconfig) 時配置CONFIG_HUGETLB_PAGE和CONFIG_HUGETLBFS選項，這兩個選項均可在 File systems 內核配置菜單中找到。

　　內核編譯完成並成功啟動內核之后，將 hugetlbfs 特殊文件系統掛載到根文件系統的某個目錄上去，以使得 hugetlbfs 可以訪問。命令如下：

　　mount none /mnt/huge -t hugetlbfs

　　此后，只要是在 /mnt/huge/ 目錄下創建的文件，將其映射到內存中時都會使用 2MB 作為分頁的基本單位。值得一提的是，hugetlbfs 中的文件是不支持讀 / 寫系統調用 ( 如read()或write()等 ) 的，一般對它的訪問都是以內存映射的形式進行的。為了更好地介紹大頁面的應用，接下來將給出一個大頁面應用的例子，該例子同樣也是摘自於上述提到的內核文檔，只是略有簡化。

 1 清單 1. Linux 大頁面應用示例
 2  #include <fcntl.h> 
 3  #include <sys/mman.h> 
 4  #include <errno.h> 
 5 
 6  #define MAP_LENGTH      (10*1024*1024) 
 7 
 8  int main() 
 9  { 
10     int fd; 
11     void * addr; 
12 
13     /* create a file in hugetlb fs */ 
14     fd = open("/mnt/huge/test", O_CREAT | O_RDWR); 
15     if(fd < 0){ 
16         perror("Err: "); 
17         return -1; 
18     }   
19 
20     /* map the file into address space of current application process */ 
21     addr = mmap(0, MAP_LENGTH, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); 
22     if(addr == MAP_FAILED){ 
23         perror("Err: "); 
24         close(fd); 
25         unlink("/mnt/huge/test"); 
26         return -1; 
27     }   
28 
29     /* from now on, you can store application data on huage pages via addr */ 
30 
31     munmap(addr, MAP_LENGTH); 
32     close(fd); 
33     unlink("/mnt/huge/test"); 
34     return 0; 
35  }

　　對於系統中大頁面的統計信息可以在 Proc 特殊文件系統（/proc）中查到，如/proc/sys/vm/nr_hugepages給出了當前內核中配置的大頁面的數目，也可以通過該文件配置大頁面的數目，如：

　　echo 20 > /proc/sys/vm/nr_hugepages

三、Hugetlbfs的初始化(基於Linux-3.4.51)

1、hugetlb的初始化

　　hugetlb初始化是通過hugetlb_init()函數實現的，主要是初始化hstates[MAX_NUMNODES]全局數組以及創建sysfs相關目錄文件。　

 1 static int __init hugetlb_init(void)
 2 {
 3     /* Some platform decide whether they support huge pages at boot
 4      * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
 5      * there is no such support
 6      */
 7     if (HPAGE_SHIFT == 0)
 8         return 0;
 9 
10     if (!size_to_hstate(default_hstate_size)) {
11         default_hstate_size = HPAGE_SIZE;　　/*默認大小為2M*/
12         if (!size_to_hstate(default_hstate_size))
13 　　　　　　　/* 初始化hstates[MAX_NUMNODES]數組，數組中只有一個成員;
14 　　　　　　　 * HUGETLB_PAGE_ORDER = 9,即，h->order = 9;
15 　　　　　　　 */
16 　　　　　　　hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
17     }
18 　　 /*由於hstates[]只有一個成員，default_hstate_idx = 0*/
19     default_hstate_idx = size_to_hstate(default_hstate_size) - hstates;
20 　　 /*默認最大頁數為0*/
21     if (default_hstate_max_huge_pages)
22         default_hstate.max_huge_pages = default_hstate_max_huge_pages;
23 
24 　　/*由於最大頁數為0，沒有為hstate[]分配任何頁*/
25 　　hugetlb_init_hstates();
26 　　/*這個函數不知道干啥？？？*/
27 　　gather_bootmem_prealloc();
28 　　/*打印初始化后的相關信息*/
29 　　report_hugepages();
30 　　/*初始化/sys/kernel/mm/hugepages相關目錄文件*/
31 　　hugetlb_sysfs_init();
32 　　/*初始化/sys/device/system/node/node*/hugepages相關目錄文件*/
33 　　hugetlb_register_all_nodes();
34 　　return 0;
35 }
36 module_init(hugetlb_init);

另外，hugepage的默認大小也可以通過配置內核啟動參數“default_hugepagesz”指定，例如：default_hugepagesz=4M，指定default_hstate_size的大小為4M，其內核實現如下：

1 static int __init hugetlb_default_setup(char *s)
2 {
3     default_hstate_size = memparse(s, &s);
4     return 1;
5 }
6 __setup("default_hugepagesz=", hugetlb_default_setup);

hugepage的大頁是通過將N個連續的4k頁作為一個混合頁來實現大頁面的。

hugepage的頁數也可以通過內核啟動參數“hugepages”指定。例如：hugepages=1024，其內核實現如下：

 1 static int __init hugetlb_nrpages_setup(char *s)
 2 {
 3     unsigned long *mhp;
 4     static unsigned long *last_mhp;
 5     /*
 6      * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
 7      * so this hugepages= parameter goes to the "default hstate".
 8      */
 9     if (!max_hstate)
10         mhp = &default_hstate_max_huge_pages;
11     else
12         mhp = &parsed_hstate->max_huge_pages;
13     if (mhp == last_mhp) {
14         printk(KERN_WARNING "hugepages= specified twice without "
15             "interleaving hugepagesz=, ignoring\n");
16         return 1;
17     }
18     if (sscanf(s, "%lu", mhp) <= 0)
19         *mhp = 0;
20     /*
21      * Global state is always initialized later in hugetlb_init.
22      * But we need to allocate >= MAX_ORDER hstates here early to still
23      * use the bootmem allocator.
24      */
25 　　　/* parsed_hstate->order = 9, MAX_ORDER = 11, 不會調用hugetlb_hstate_alloc_pages();
26 　　　 * 通過內核啟動參數配置頁面數，什么時候分配具體的內存頁？？？
27 　　　 */
28     if (max_hstate && parsed_hstate->order >= MAX_ORDER)
29         hugetlb_hstate_alloc_pages(parsed_hstate);
30     last_mhp = mhp;
31     return 1;
32 }
33 __setup("hugepages=", hugetlb_nrpages_setup);

hugepage的頁數也可以通過命令配置，echo 20 > /proc/sys/vm/nr_hugepages,此時，是通過系統調用實現的。內核實現如下：

1 int hugetlb_sysctl_handler(struct ctl_table *table, int write,
2               void __user *buffer, size_t *length, loff_t *ppos)
3 {
4     return hugetlb_sysctl_handler_common(false, table, write,
5                             buffer, length, ppos);
6 }

 1 static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 2              struct ctl_table *table, int write,
 3              void __user *buffer, size_t *length, loff_t *ppos)
 4 {
 5     struct hstate *h = &default_hstate;
 6     unsigned long tmp;
 7     int ret;
 8     tmp = h->max_huge_pages;
 9     if (write && h->order >= MAX_ORDER)
10         return -EINVAL;
11     table->data = &tmp;
12     table->maxlen = sizeof(unsigned long);
13 　　/*從用戶空間將數值copy賦值給tabel->data，即tmp，並做相關檢查*/
14     ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
15     if (ret)
16         goto out;
17     if (write) {        
18 　　　　　　　　　 NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY);
19         if (!(obey_mempolicy &&
20                    init_nodemask_of_mempolicy(nodes_allowed))) {
21             NODEMASK_FREE(nodes_allowed);
22             nodes_allowed = &node_states[N_HIGH_MEMORY];
23         }
24 　　　　　/*設置最大頁數，並分配具體內存頁*/
25         h->max_huge_pages = set_max_huge_pages(h, tmp, nodes_allowed);
26         if (nodes_allowed != &node_states[N_HIGH_MEMORY])
27             NODEMASK_FREE(nodes_allowed);
28     }
29 out:
30     return ret;
31 }

 1 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 2                         nodemask_t *nodes_allowed)
 3 {
 4     unsigned long min_count, ret;
 5     if (h->order >= MAX_ORDER)
 6         return h->max_huge_pages;
 7     /*
 8      * Increase the pool size
 9      * First take pages out of surplus state.  Then make up the
10      * remaining difference by allocating fresh huge pages.
11      *
12      * We might race with alloc_buddy_huge_page() here and be unable
13      * to convert a surplus huge page to a normal huge page. That is
14      * not critical, though, it just means the overall size of the
15      * pool might be one hugepage larger than it needs to be, but
16      * within all the constraints specified by the sysctls.
17      */
18     spin_lock(&hugetlb_lock);
19     while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
20         if (!adjust_pool_surplus(h, nodes_allowed, -1))
21             break;
22     }
23     while (count > persistent_huge_pages(h)) {
24         /*
25          * If this allocation races such that we no longer need the
26          * page, free_huge_page will handle it by freeing the page
27          * and reducing the surplus.
28          */
29         spin_unlock(&hugetlb_lock);
30 　　　　　/*分配內存頁*/
31         ret = alloc_fresh_huge_page(h, nodes_allowed);
32         spin_lock(&hugetlb_lock);
33         if (!ret)
34             goto out;
35         /* Bail for signals. Probably ctrl-c from user */
36         if (signal_pending(current))
37             goto out;
38     }
39     /*
40      * Decrease the pool size
41      * First return free pages to the buddy allocator (being careful
42      * to keep enough around to satisfy reservations).  Then place
43      * pages into surplus state as needed so the pool will shrink
44      * to the desired size as pages become free.
45      *
46      * By placing pages into the surplus state independent of the
47      * overcommit value, we are allowing the surplus pool size to
48      * exceed overcommit. There are few sane options here. Since
49      * alloc_buddy_huge_page() is checking the global counter,
50      * though, we'll note that we're not allowed to exceed surplus
51      * and won't grow the pool anywhere else. Not until one of the
52      * sysctls are changed, or the surplus pages go out of use.
53      */
54     min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
55     min_count = max(count, min_count);
56     try_to_free_low(h, min_count, nodes_allowed);
57     while (min_count < persistent_huge_pages(h)) {
58         if (!free_pool_huge_page(h, nodes_allowed, 0))
59             break;
60     }
61     while (count < persistent_huge_pages(h)) {
62         if (!adjust_pool_surplus(h, nodes_allowed, 1))
63             break;
64     }
65 out:
66     ret = persistent_huge_pages(h);
67     spin_unlock(&hugetlb_lock);
68     return ret;
69 }

 1 static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 2 {
 3     struct page *page;
 4     int start_nid;
 5     int next_nid;
 6     int ret = 0;
 7     start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
 8     next_nid = start_nid;
 9     do {
10 　　　　　/* 從內存Node的zonelist上分配2^h->order個4K的內存頁，返回第一個page的地址；
11 　　　　　 * 如果分配不成功，從下一個內存Node上嘗試；
12 　　　　　 */
13         page = alloc_fresh_huge_page_node(h, next_nid);
14         if (page) {
15             ret = 1;
16             break;
17         }
18         next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
19     } while (next_nid != start_nid);
20     if (ret)
21         count_vm_event(HTLB_BUDDY_PGALLOC);
22     else
23         count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
24     return ret;
25 }

 1 static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 2 {
 3     struct page *page;
 4     if (h->order >= MAX_ORDER)
 5         return NULL;
 6     /*__GFP_COMP標志：分配2^h->order個連續的4K大小的page，返回第一個Page的地址，並設置PG_compound標記*/
 7 　　 page = alloc_pages_exact_node(nid,
 8 　　 htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 9                         __GFP_REPEAT|__GFP_NOWARN,
10 　　 huge_page_order(h));
11     if (page) {
12         if (arch_prepare_hugepage(page)) {
13             __free_pages(page, huge_page_order(h));
14             return NULL;
15         }
16 　　　　　/* 1、將已分配的2^h->order個數的page中的第二個page的lru.next執行函數free_huge_page()；
17 　　　　　 * 2、在put_page()函數中，最后調用free_huge_page()-->enqueue_huge_page()，將page加入到h->hugepages_freelists[nid]鏈表；
18 　　　　　 */
19         prep_new_huge_page(h, page, nid);
20     }
21     return page;
22 }

2、hugetlbfs的初始化

hugetlbfs的創建，主要是建立VFS層的super_block、dentry、inode之間的相關映射，同時也和hugetlb_init()函數中初始化的hstates[]數組關聯起來了，也就和分配的大內存頁關聯起來了。如下圖(有點亂)：

 1 static int __init init_hugetlbfs_fs(void)
 2 {
 3     int error;
 4     struct vfsmount *vfsmount;
 5 
 6     /*初始化hugetlbfs回寫數據結構*/
 7     error = bdi_init(&hugetlbfs_backing_dev_info);
 8     if (error)
 9         return error;
10 
11     error = -ENOMEM;
12     /*創建slab緩存hugetlbfs_inode_cachep，后續hugetlbfs的inode從這里面分配*/
13     hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
14                     sizeof(struct hugetlbfs_inode_info),
15                     0, 0, init_once);
16     if (hugetlbfs_inode_cachep == NULL)
17         goto out2;
18 
19     /*將hugetlbfs_fs_type加入到全局file_systems鏈表中*/
20     error = register_filesystem(&hugetlbfs_fs_type);
21     if (error)
22         goto out;
23 
24     /* 創建hugetlbfs的super_block、entry、inode，並建立它們之間的相互映射，
25 　　　* 以及它們與hugetlbfs_fs_type、default_hstate、hugetlbfs_inode_cachep之間的映射關系
26 　　　*/
27     vfsmount = kern_mount(&hugetlbfs_fs_type);
28 
29     if (!IS_ERR(vfsmount)) {
30         hugetlbfs_vfsmount = vfsmount;
31         return 0;
32     }
33 
34     error = PTR_ERR(vfsmount);
35 
36  out:
37     kmem_cache_destroy(hugetlbfs_inode_cachep);
38  out2:
39     bdi_destroy(&hugetlbfs_backing_dev_info);
40     return error;
41 }
42

有不足或錯誤之處，歡迎指出。

參考：

http://www.ibm.com/developerworks/cn/linux/l-cn-hugetlb/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Linux Hugetlbfs內核源碼簡析-----(二)Hugetlbfs掛載 linux內核源碼分析 - nvme設備的初始化 Linux 內核調度器源碼分析 - 初始化 linux內核中的regmap是如何初始化的? but no mounted hugetlbfs found for that size Linux內核poll/select機制簡析 linux中斷源碼分析 - 初始化(二) Linux x86_64內核中斷初始化趣談linux操作系統筆記-內核初始化 vuex源碼簡析