簡述:
進程信號11段錯誤,出core。內核dmesg打印:*****(程序名字)[31255]: segfault at 7fff6d99febc ip 0000003688644323 sp 00007fff6d99fd30 error 7 in libc.so.6[3688600000+175000]
直接原因:
sp指針向下增長時,越界,訪問到mmap以只讀映射的地址空間,導致寫操作異常。do_page_fault失敗,給進程發了信號11。
環境:
root ~ # uname -a
Linux localhost 2.6.30-gentoo-r8 #47 SMP Mon Sep 21 03:44:07 EDT 2015 x86_64 GNU/Linux
根本原因:
進程地址空間大致如下分布如下(我的環境是這個分布):
(圖片摘自《深入Linux內核架構》)
與問題相關的關鍵點
①stack向下增長,所以系統必須給stack留下足夠的地址空間。位於stack下面的是受限於mm->mmap_base的mmap地址空間。
②mm->mmap_base表示用於mmap的地址空間的最大值,查找方向向下,所有的mmap系統調用分配出來的地址空間位於mm->mmap_base之下。
綜合①和②,可以得到結論mm->mmap_base必須小於並遠離Stack。但是2.6內核這里存在bug,不能一定滿足該條件。
mmap_base賦值分析:
mm->mmap_base賦值在load_elf_binary()->arch_pick_mmap_layout()->mmap_base()中:
static unsigned long mmap_base(void) { unsigned long gap = current->signal->rlim[RLIMIT_STACK].rlim_cur;//改值為棧大小限制,系統可配,默認1M。問題環境改值為8M if (gap < MIN_GAP) gap = MIN_GAP; //走這個邏輯,MIN_GAP值為128M,gap值更新為128M。 else if (gap > MAX_GAP) gap = MAX_GAP; //7FFFFFFFF000(TASK_SIZE) - 8000000(gap) //[0,FFFFFFF000], mmap_rnd()的取值范圍區間 //[7EFFF8000000,7FFFF7FFF000],返回的mmap_base的地址的取值范圍 return PAGE_ALIGN(TASK_SIZE - gap - mmap_rnd()); }
從代碼中看,mmap_base賦值中只考慮了當前棧大小,但並不能保證遠離stack。
極端情況下,mmap_rnd隨機值為0時,mmap_base被賦值為TASK_SIZE-128M的位置,即:7FFFF7FFF000
棧底賦值分析:
Stack的棧底為:STACK_TOP - random_variable,經計算取值范圍為:[7FFC00000000,7FFFFFFFF000]
#define STACK_TOP TASK_SIZE retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP), executable_stack); static unsigned long randomize_stack_top(unsigned long stack_top) { unsigned int random_variable = 0; if ((current->flags & PF_RANDOMIZE) && !(current->personality & ADDR_NO_RANDOMIZE)) { random_variable = get_random_int() & STACK_RND_MASK; //[0, 3fffff000] random_variable <<= PAGE_SHIFT; } #ifdef CONFIG_STACK_GROWSUP return PAGE_ALIGN(stack_top) + random_variable; #else return PAGE_ALIGN(stack_top) - random_variable; //[7FFC00000000,7FFFFFFFF000] #endif }
綜上,這里確認2.6的內核存在bug。改動方法是,給mmap_base至少再減去 STACK_RND_MASK。
該問題內核最早於2009年修復:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/x86/mm/mmap.c?id=80938332d8cf652f6b16e0788cf0ca136befe0b5
新版本的內核實現:
https://code.woboq.org/linux/linux/arch/x86/mm/mmap.c.html#mmap_base
問題排查過程:
①充分分析core文件,確認棧的增長是正常的。導致異常的地址,也確實是正常的。函數調用時,參數正常壓棧的地址。
②懷疑內核問題(缺頁中斷失敗?)。找到全部打印 segfault at 的內核代碼,逐個分析,對可能的路徑添加調試。確認錯誤原因在,do_page_fault的1130行,判斷vm是否可寫時失敗。
③正常來說,這個vm是stack的vm_area_struct結構體,一定是有寫權限的。懷疑vm的問題,增加調試,在出問題時,打印vm_area_struct的信息。根據vma中的file指針找到被mmap的文件的名字(我的環境是個字符設備)。
至此確認問題原因:棧的增長踩到了字符設備的mmap地址空間,該mmap是只讀。調試代碼如下:
/* * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate * routines. */ dotraplinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) { struct vm_area_struct *vma; struct task_struct *tsk; unsigned long address; struct mm_struct *mm; int write; int fault; tsk = current; mm = tsk->mm; prefetchw(&mm->mmap_sem); /* Get the faulting address: */ address = read_cr2(); ... if (unlikely(expand_stack(vma, address))) { bad_area(regs, error_code, address); return; } /* * Ok, we have a good vm_area for this memory access, so * we can handle it.. */ good_area: write = error_code & PF_WRITE; if (unlikely(access_error(error_code, write, vma))) { printk("svking vm_next info :start:%lx end:%lx flags:%lx\n", vma->vm_next->vm_start, vma->vm_next->vm_end, vma->vm_next->vm_flags); printk("svking vm info :start:%lx end:%lx flags:%lx\n", vma->vm_start, vma->vm_end, vma->vm_flags); printk("svking vm info :vm_file: %p\n", vma->vm_file); if (vma->vm_file) { printk("svking file name :%s, %s\n", vma->vm_file->f_dentry->d_iname, vma->vm_file->f_dentry->d_name.name); } bad_area_access_error(regs, error_code, address); return; } /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo * the fault: */ fault = handle_mm_fault(mm, vma, address, write); ... }
④分析mmap機制,確認mmap的返回的地址空間受控於進程的mm->mmap_base。同時mm->mmap_base是在進程啟動時執行exec函數時確定的。所以懷疑mm->mmap_base賦值時存在問題。
⑤分析進程地址空間和mmap_base賦值函數,確認根本原因: “mm->mmap_base必須小於並遠離Stack” 未能滿足。
附:
調試日志(***屏蔽了業務關鍵字):
root ~ # tail -f /var/log/messages | grep -E "svking|segfault"
Jul 21 16:44:16 localhost kernel: [ 5489.150802] svking vm_next info :start:7fff13d58000 end:7fff13d6d000 flags:100177
Jul 21 16:44:16 localhost kernel: [ 5489.150990] svking vm info :start:7ffdcb558000 end:7fff13d58000 flags:400844fd
Jul 21 16:44:16 localhost kernel: [ 5489.151155] svking vm info :vm_file: ffff880082dbbcc0
Jul 21 16:44:16 localhost kernel: [ 5489.151260] svking file name :******, ******
Jul 21 16:44:16 localhost kernel: [ 5489.151388] ******[22867]: segfault at 7fff13d57ffc ip 00007fff6469ae99 sp 00007fff13d57ff0 error 7 in lib****.so[7fff64681000+2a000]
Jul 21 16:48:42 localhost kernel: [ 5754.180545] svking vm_next info :start:7fffb5202000 end:7fffb5217000 flags:100177
Jul 21 16:48:42 localhost kernel: [ 5754.180772] svking vm info :start:7ffe6ca02000 end:7fffb5202000 flags:400844fd
Jul 21 16:48:42 localhost kernel: [ 5754.180971] svking vm info :vm_file: ffff880082f74240
Jul 21 16:48:42 localhost kernel: [ 5754.181136] svking file name :******, ******
Jul 21 16:48:42 localhost kernel: [ 5754.181277] ******[32049]: segfault at 7fffb5201ff8 ip 00007ffff5b744ae sp 00007fffb5202000 error 7
ps:個人筆記,如有錯誤,期待指出,不喜勿噴。