一、引言
為了實現虛擬內存管理機制,操作系統對內存實行分頁管理。自內存“分頁機制”提出之始,內存頁面的默認大小便被設置為 4096 字節(4KB),雖然原則上內存頁面大小是可配置的,但絕大多數的操作系統實現中仍然采用默認的 4KB 頁面。當某些應用的需要使用的內存達到幾G、甚至幾十G的時候,4KB的內存頁面將嚴重制約程序的性能。
CPU緩存中有一組緩存專門用於緩存TLB,但其大小是有限的。當采用的默認頁面大小為 4KB,其產生的TLB較大,因而將會產生較多 TLB Miss 和缺頁中斷,從而大大影響應用程序的性能。操作系統以 2MB 甚至更大作為分頁的單位時,將會大大減少 TLB Miss 和缺頁中斷的數量,顯著提高應用程序的性能。這也正是 Linux 內核引入大頁面支持的直接原因。好處是很明顯的,假設應用程序需要 2MB 的內存,如果操作系統以 4KB 作為分頁的單位,則需要 512 個頁面,進而在 TLB 中需要 512 個表項,同時也需要 512 個頁表項,操作系統需要經歷至少 512 次 TLB Miss 和 512 次缺頁中斷才能將 2MB 應用程序空間全部映射到物理內存;然而,當操作系統采用 2MB 作為分頁的基本單位時,只需要一次 TLB Miss 和一次缺頁中斷,就可以為 2MB 的應用程序空間建立虛實映射,並在運行過程中無需再經歷 TLB Miss 和缺頁中斷(假設未發生 TLB 項替換和 Swap)。
為了能以最小的代價實現大頁面支持,Linux 操作系統采用了基於 hugetlbfs 特殊文件系統 2M 字節大頁面支持。這種采用特殊文件系統形式支持大頁面的方式,使得應用程序可以根據需要靈活地選擇虛存頁面大小,而不會被強制使用 2MB 大頁面。
二、HugePage的使用
本文的例子摘自 Linux 內核源碼中提供的有關說明文檔 (Documentation/vm/hugetlbpage.txt) 。使用 hugetlbfs 之前,首先需要在編譯內核 (make menuconfig) 時配置CONFIG_HUGETLB_PAGE和CONFIG_HUGETLBFS選項,這兩個選項均可在 File systems 內核配置菜單中找到。
內核編譯完成並成功啟動內核之后,將 hugetlbfs 特殊文件系統掛載到根文件系統的某個目錄上去,以使得 hugetlbfs 可以訪問。命令如下:
mount none /mnt/huge -t hugetlbfs
此后,只要是在 /mnt/huge/ 目錄下創建的文件,將其映射到內存中時都會使用 2MB 作為分頁的基本單位。值得一提的是,hugetlbfs 中的文件是不支持讀 / 寫系統調用 ( 如read()或write()等 ) 的,一般對它的訪問都是以內存映射的形式進行的。為了更好地介紹大頁面的應用,接下來將給出一個大頁面應用的例子,該例子同樣也是摘自於上述提到的內核文檔,只是略有簡化。
1 清單 1. Linux 大頁面應用示例 2 #include <fcntl.h> 3 #include <sys/mman.h> 4 #include <errno.h> 5 6 #define MAP_LENGTH (10*1024*1024) 7 8 int main() 9 { 10 int fd; 11 void * addr; 12 13 /* create a file in hugetlb fs */ 14 fd = open("/mnt/huge/test", O_CREAT | O_RDWR); 15 if(fd < 0){ 16 perror("Err: "); 17 return -1; 18 } 19 20 /* map the file into address space of current application process */ 21 addr = mmap(0, MAP_LENGTH, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); 22 if(addr == MAP_FAILED){ 23 perror("Err: "); 24 close(fd); 25 unlink("/mnt/huge/test"); 26 return -1; 27 } 28 29 /* from now on, you can store application data on huage pages via addr */ 30 31 munmap(addr, MAP_LENGTH); 32 close(fd); 33 unlink("/mnt/huge/test"); 34 return 0; 35 }
對於系統中大頁面的統計信息可以在 Proc 特殊文件系統(/proc)中查到,如/proc/sys/vm/nr_hugepages給出了當前內核中配置的大頁面的數目,也可以通過該文件配置大頁面的數目,如:
echo 20 > /proc/sys/vm/nr_hugepages
三、Hugetlbfs的初始化(基於Linux-3.4.51)
1、hugetlb的初始化
hugetlb初始化是通過hugetlb_init()函數實現的,主要是初始化hstates[MAX_NUMNODES]全局數組以及創建sysfs相關目錄文件。
1 static int __init hugetlb_init(void) 2 { 3 /* Some platform decide whether they support huge pages at boot 4 * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when 5 * there is no such support 6 */ 7 if (HPAGE_SHIFT == 0) 8 return 0; 9 10 if (!size_to_hstate(default_hstate_size)) { 11 default_hstate_size = HPAGE_SIZE; /*默認大小為2M*/ 12 if (!size_to_hstate(default_hstate_size)) 13 /* 初始化hstates[MAX_NUMNODES]數組,數組中只有一個成員; 14 * HUGETLB_PAGE_ORDER = 9,即,h->order = 9; 15 */ 16 hugetlb_add_hstate(HUGETLB_PAGE_ORDER); 17 } 18 /*由於hstates[]只有一個成員,default_hstate_idx = 0*/ 19 default_hstate_idx = size_to_hstate(default_hstate_size) - hstates; 20 /*默認最大頁數為0*/ 21 if (default_hstate_max_huge_pages) 22 default_hstate.max_huge_pages = default_hstate_max_huge_pages; 23 24 /*由於最大頁數為0,沒有為hstate[]分配任何頁*/ 25 hugetlb_init_hstates(); 26 /*這個函數不知道干啥???*/ 27 gather_bootmem_prealloc(); 28 /*打印初始化后的相關信息*/ 29 report_hugepages(); 30 /*初始化/sys/kernel/mm/hugepages相關目錄文件*/ 31 hugetlb_sysfs_init(); 32 /*初始化/sys/device/system/node/node*/hugepages相關目錄文件*/ 33 hugetlb_register_all_nodes(); 34 return 0; 35 } 36 module_init(hugetlb_init);
另外,hugepage的默認大小也可以通過配置內核啟動參數“default_hugepagesz”指定,例如:default_hugepagesz=4M,指定default_hstate_size的大小為4M,其內核實現如下:
1 static int __init hugetlb_default_setup(char *s) 2 { 3 default_hstate_size = memparse(s, &s); 4 return 1; 5 } 6 __setup("default_hugepagesz=", hugetlb_default_setup);
hugepage的大頁是通過將N個連續的4k頁作為一個混合頁來實現大頁面的。
hugepage的頁數也可以通過內核啟動參數“hugepages”指定。例如:hugepages=1024,其內核實現如下:
1 static int __init hugetlb_nrpages_setup(char *s) 2 { 3 unsigned long *mhp; 4 static unsigned long *last_mhp; 5 /* 6 * !max_hstate means we haven't parsed a hugepagesz= parameter yet, 7 * so this hugepages= parameter goes to the "default hstate". 8 */ 9 if (!max_hstate) 10 mhp = &default_hstate_max_huge_pages; 11 else 12 mhp = &parsed_hstate->max_huge_pages; 13 if (mhp == last_mhp) { 14 printk(KERN_WARNING "hugepages= specified twice without " 15 "interleaving hugepagesz=, ignoring\n"); 16 return 1; 17 } 18 if (sscanf(s, "%lu", mhp) <= 0) 19 *mhp = 0; 20 /* 21 * Global state is always initialized later in hugetlb_init. 22 * But we need to allocate >= MAX_ORDER hstates here early to still 23 * use the bootmem allocator. 24 */ 25 /* parsed_hstate->order = 9, MAX_ORDER = 11, 不會調用hugetlb_hstate_alloc_pages(); 26 * 通過內核啟動參數配置頁面數,什么時候分配具體的內存頁??? 27 */ 28 if (max_hstate && parsed_hstate->order >= MAX_ORDER) 29 hugetlb_hstate_alloc_pages(parsed_hstate); 30 last_mhp = mhp; 31 return 1; 32 } 33 __setup("hugepages=", hugetlb_nrpages_setup);
hugepage的頁數也可以通過命令配置,echo 20 > /proc/sys/vm/nr_hugepages,此時,是通過系統調用實現的。內核實現如下:
1 int hugetlb_sysctl_handler(struct ctl_table *table, int write, 2 void __user *buffer, size_t *length, loff_t *ppos) 3 { 4 return hugetlb_sysctl_handler_common(false, table, write, 5 buffer, length, ppos); 6 }
1 static int hugetlb_sysctl_handler_common(bool obey_mempolicy, 2 struct ctl_table *table, int write, 3 void __user *buffer, size_t *length, loff_t *ppos) 4 { 5 struct hstate *h = &default_hstate; 6 unsigned long tmp; 7 int ret; 8 tmp = h->max_huge_pages; 9 if (write && h->order >= MAX_ORDER) 10 return -EINVAL; 11 table->data = &tmp; 12 table->maxlen = sizeof(unsigned long); 13 /*從用戶空間將數值copy賦值給tabel->data,即tmp,並做相關檢查*/ 14 ret = proc_doulongvec_minmax(table, write, buffer, length, ppos); 15 if (ret) 16 goto out; 17 if (write) { 18 NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY); 19 if (!(obey_mempolicy && 20 init_nodemask_of_mempolicy(nodes_allowed))) { 21 NODEMASK_FREE(nodes_allowed); 22 nodes_allowed = &node_states[N_HIGH_MEMORY]; 23 } 24 /*設置最大頁數,並分配具體內存頁*/ 25 h->max_huge_pages = set_max_huge_pages(h, tmp, nodes_allowed); 26 if (nodes_allowed != &node_states[N_HIGH_MEMORY]) 27 NODEMASK_FREE(nodes_allowed); 28 } 29 out: 30 return ret; 31 }
1 static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, 2 nodemask_t *nodes_allowed) 3 { 4 unsigned long min_count, ret; 5 if (h->order >= MAX_ORDER) 6 return h->max_huge_pages; 7 /* 8 * Increase the pool size 9 * First take pages out of surplus state. Then make up the 10 * remaining difference by allocating fresh huge pages. 11 * 12 * We might race with alloc_buddy_huge_page() here and be unable 13 * to convert a surplus huge page to a normal huge page. That is 14 * not critical, though, it just means the overall size of the 15 * pool might be one hugepage larger than it needs to be, but 16 * within all the constraints specified by the sysctls. 17 */ 18 spin_lock(&hugetlb_lock); 19 while (h->surplus_huge_pages && count > persistent_huge_pages(h)) { 20 if (!adjust_pool_surplus(h, nodes_allowed, -1)) 21 break; 22 } 23 while (count > persistent_huge_pages(h)) { 24 /* 25 * If this allocation races such that we no longer need the 26 * page, free_huge_page will handle it by freeing the page 27 * and reducing the surplus. 28 */ 29 spin_unlock(&hugetlb_lock); 30 /*分配內存頁*/ 31 ret = alloc_fresh_huge_page(h, nodes_allowed); 32 spin_lock(&hugetlb_lock); 33 if (!ret) 34 goto out; 35 /* Bail for signals. Probably ctrl-c from user */ 36 if (signal_pending(current)) 37 goto out; 38 } 39 /* 40 * Decrease the pool size 41 * First return free pages to the buddy allocator (being careful 42 * to keep enough around to satisfy reservations). Then place 43 * pages into surplus state as needed so the pool will shrink 44 * to the desired size as pages become free. 45 * 46 * By placing pages into the surplus state independent of the 47 * overcommit value, we are allowing the surplus pool size to 48 * exceed overcommit. There are few sane options here. Since 49 * alloc_buddy_huge_page() is checking the global counter, 50 * though, we'll note that we're not allowed to exceed surplus 51 * and won't grow the pool anywhere else. Not until one of the 52 * sysctls are changed, or the surplus pages go out of use. 53 */ 54 min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages; 55 min_count = max(count, min_count); 56 try_to_free_low(h, min_count, nodes_allowed); 57 while (min_count < persistent_huge_pages(h)) { 58 if (!free_pool_huge_page(h, nodes_allowed, 0)) 59 break; 60 } 61 while (count < persistent_huge_pages(h)) { 62 if (!adjust_pool_surplus(h, nodes_allowed, 1)) 63 break; 64 } 65 out: 66 ret = persistent_huge_pages(h); 67 spin_unlock(&hugetlb_lock); 68 return ret; 69 }
1 static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) 2 { 3 struct page *page; 4 int start_nid; 5 int next_nid; 6 int ret = 0; 7 start_nid = hstate_next_node_to_alloc(h, nodes_allowed); 8 next_nid = start_nid; 9 do { 10 /* 從內存Node的zonelist上分配2^h->order個4K的內存頁,返回第一個page的地址; 11 * 如果分配不成功,從下一個內存Node上嘗試; 12 */ 13 page = alloc_fresh_huge_page_node(h, next_nid); 14 if (page) { 15 ret = 1; 16 break; 17 } 18 next_nid = hstate_next_node_to_alloc(h, nodes_allowed); 19 } while (next_nid != start_nid); 20 if (ret) 21 count_vm_event(HTLB_BUDDY_PGALLOC); 22 else 23 count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); 24 return ret; 25 }
1 static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid) 2 { 3 struct page *page; 4 if (h->order >= MAX_ORDER) 5 return NULL; 6 /*__GFP_COMP標志:分配2^h->order個連續的4K大小的page,返回第一個Page的地址,並設置PG_compound標記*/ 7 page = alloc_pages_exact_node(nid, 8 htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE| 9 __GFP_REPEAT|__GFP_NOWARN, 10 huge_page_order(h)); 11 if (page) { 12 if (arch_prepare_hugepage(page)) { 13 __free_pages(page, huge_page_order(h)); 14 return NULL; 15 } 16 /* 1、將已分配的2^h->order個數的page中的第二個page的lru.next執行函數free_huge_page(); 17 * 2、在put_page()函數中,最后調用free_huge_page()-->enqueue_huge_page(),將page加入到h->hugepages_freelists[nid]鏈表; 18 */ 19 prep_new_huge_page(h, page, nid); 20 } 21 return page; 22 }
2、hugetlbfs的初始化
hugetlbfs的創建,主要是建立VFS層的super_block、dentry、inode之間的相關映射,同時也和hugetlb_init()函數中初始化的hstates[]數組關聯起來了,也就和分配的大內存頁關聯起來了。如下圖(有點亂):
1 static int __init init_hugetlbfs_fs(void) 2 { 3 int error; 4 struct vfsmount *vfsmount; 5 6 /*初始化hugetlbfs回寫數據結構*/ 7 error = bdi_init(&hugetlbfs_backing_dev_info); 8 if (error) 9 return error; 10 11 error = -ENOMEM; 12 /*創建slab緩存hugetlbfs_inode_cachep,后續hugetlbfs的inode從這里面分配*/ 13 hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache", 14 sizeof(struct hugetlbfs_inode_info), 15 0, 0, init_once); 16 if (hugetlbfs_inode_cachep == NULL) 17 goto out2; 18 19 /*將hugetlbfs_fs_type加入到全局file_systems鏈表中*/ 20 error = register_filesystem(&hugetlbfs_fs_type); 21 if (error) 22 goto out; 23 24 /* 創建hugetlbfs的super_block、entry、inode,並建立它們之間的相互映射, 25 * 以及它們與hugetlbfs_fs_type、default_hstate、hugetlbfs_inode_cachep之間的映射關系 26 */ 27 vfsmount = kern_mount(&hugetlbfs_fs_type); 28 29 if (!IS_ERR(vfsmount)) { 30 hugetlbfs_vfsmount = vfsmount; 31 return 0; 32 } 33 34 error = PTR_ERR(vfsmount); 35 36 out: 37 kmem_cache_destroy(hugetlbfs_inode_cachep); 38 out2: 39 bdi_destroy(&hugetlbfs_backing_dev_info); 40 return error; 41 } 42
有不足或錯誤之處,歡迎指出。
參考:
http://www.ibm.com/developerworks/cn/linux/l-cn-hugetlb/