概述
應用場景
- 現場還原,事后問題分析定位
- 錯誤很難重現
- 解析raddump(內存轉儲機制),轉儲的機制有后面幾種:LKCD,Diskdump,Netdump,Kdump,MKdump
依賴條件
- 真實的linux系統,或者VMware虛擬機:virtualbox虛擬機測試過不行
環境安裝
主機編譯安裝
用來定位主機本身的dump問題
依賴文件:
- crash軟件
- 帶調試信息的vmlinux
- 內核dump文件
安裝crash軟件
sudo apt-get install crash
准備帶調試信息的vmlinux
下載對應內核版本的 debug-info package
http://ddebs.ubuntu.com/pool/main/l/linux/
查看自己的內核版本
wsk@wsk:~$ uname -r
4.15.0-156-generic
查看自己的硬件平台
wsk@wsk:~$ uname -m
x86_64
附件:x86、i386、i486、i586、i686和x86_64 - 默默淡然 - 博客園 (cnblogs.com)
根據以上信息就可以找到對應的debug-info包,這里我們選擇4.15.0-156-generic 和 amd64的軟件包。獲取安裝debug-info包
wget http://ddebs.ubuntu.com/pool/main/l/linux/linux-image-unsigned-4.15.0-156-generic-dbgsym_4.15.0-156.163_amd64.ddeb
sudo dpkg -i linux-image-unsigned-4.15.0-156-generic-dbgsym_4.15.0-156.163_amd64.ddeb
查看軟件包看裝位置:要的就是vmlinux-4.15.0-156-generic
文件
wsk@wsk:~/crash$ dpkg --contents linux-image-unsigned-4.15.0-156-generic-dbgsym_4.15.0-156.163_amd64.ddeb
drwxr-xr-x root/root 0 2021-08-19 22:30 ./
drwxr-xr-x root/root 0 2021-08-19 22:30 ./usr/
drwxr-xr-x root/root 0 2021-08-19 22:30 ./usr/lib/
drwxr-xr-x root/root 0 2021-08-19 22:30 ./usr/lib/debug/
drwxr-xr-x root/root 0 2021-08-19 22:30 ./usr/lib/debug/boot/
-rw-r--r-- root/root 596730416 2021-08-19 22:30 ./usr/lib/debug/boot/vmlinux-4.15.0-156-generic
查看是否攜帶了調試信息:
攜帶調試信息的狀態
wsk@wsk:/usr/lib/debug/boot$ readelf -S vmlinux-4.15.0-156-generic | grep "debug"
[61] .debug_aranges PROGBITS 0000000000000000 01cef030
[62] .rela.debug_arang RELA 0000000000000000 10f44138
[63] .debug_info PROGBITS 0000000000000000 01d19170
[64] .rela.debug_info RELA 0000000000000000 10f74c18
[65] .debug_abbrev PROGBITS 0000000000000000 0d56091e
[66] .debug_line PROGBITS 0000000000000000 0da9a7bd
[67] .rela.debug_line RELA 0000000000000000 21f8e008
[68] .debug_frame PROGBITS 0000000000000000 0e6fff18
[69] .rela.debug_frame RELA 0000000000000000 21fa4fb8
[70] .debug_str PROGBITS 0000000000000000 0e963de8
[71] .debug_loc PROGBITS 0000000000000000 0ec6111c
[72] .rela.debug_loc RELA 0000000000000000 221b7148
[73] .debug_ranges PROGBITS 0000000000000000 0f89cb70
[74] .rela.debug_range RELA 0000000000000000 2318bde8
調試vmlinux路徑:/usr/lib/debug/boot/vmlinux-4.15.0-156-generic
交叉編譯環境
目標和編譯調試不是同一套環境。比如編譯調試在Ubuntu系統上面,定位的是ARM單板的dump內容
----------todo------------
使用方法
生成dump文件
安裝kdump
sudo apt-get install linux-crashdump
安裝完成后
sudo reboot
查看安裝狀態
wsk@wsk:~$ dmesg | grep crashkernel
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.15.0-162-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity crashkernel=512M-:192M
[ 0.000000] Reserving 192MB of memory at 576MB for crashkernel (System RAM: 2217MB) //預留內存空間
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.15.0-162-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity crashkernel=512M-:192M
show kdump config
wsk@wsk:~$ kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr: 0x
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-4.15.0-162-generic
kdump initrd:
/var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.15.0-162-generic
current state: ready to kdump
kexec command:
/sbin/kexec -p --command-line="BOOT_IMAGE=/vmlinuz-4.15.0-162-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz
show kdump service status
wsk@wsk:~$ service --status-all
[ + ] kdump-tools
[ + ] kexec
[ + ] kexec-load
[ - ] keyboard-setup.sh
強制掛死kernel
修改root密碼
$ sudo passwd root
$ ...
切換到root賬號,執行下面命令
$ su root
$ ...
$ echo c > /proc/sysrq-trigger
show crash file
wsk@wsk:/var/crash$ sudo tree
.
├── 202111192354
│ ├── dmesg.202111192354
│ └── dump.202111192354
├── kdump_lock
├── kexec_cmd
└── linux-image-4.15.0-156-generic-202111192354.crash
1 directory, 5 file
現場分析
dmesg分析
直接通過vim打開下面的文件
├── 202111192354/dump.202111192354
系統遺言: 根據這些信息可以大致判斷系統掛的位置
[ 224.951948] Oops: 0002 [#1] SMP PTI
[ 224.951962] Modules linked in: vmw_vsock_vmci_transport vsock intel_rapl_perf vmw_balloon joydev input_leds serio_raw btusb btrtl btbcm btintel bluetooth ecdh_generic snd_ens1371 snd_ac97_codec gameport snd_rawmidi snd_seq_device ac97_bus snd_pcm snd_timer snd soundcore shpchp vmw_vmci mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd vmwgfx ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci psmouse libahci mptspi mptscsih
[ 224.952266] mptbase e1000 scsi_transport_spi drm pata_acpi i2c_piix4
[ 224.952294] CPU: 1 PID: 1741 Comm: bash Not tainted 4.15.0-156-generic #163-Ubuntu
[ 224.952322] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 224.952364] RIP: 0010:sysrq_handle_crash+0x16/0x20
[ 224.952382] RSP: 0018:ffffb0cac1a0fe30 EFLAGS: 00010286
[ 224.952403] RAX: ffffffffa6800d60 RBX: 0000000000000063 RCX: 0000000000000006
[ 224.952430] RDX: 0000000000000000 RSI: ffff8fb7fb656498 RDI: 0000000000000063
[ 224.952456] RBP: ffffb0cac1a0fe30 R08: 000000000000064c R09: 0000000000000082
[ 224.952504] R10: 0000000000000001 R11: 00000000ffffffff R12: 0000000000000004
[ 224.952545] R13: 0000000000000000 R14: ffffffffa7788c20 R15: ffff8fb7f62f4200
[ 224.952572] FS: 00007fe1d5a96740(0000) GS:ffff8fb7fb640000(0000) knlGS:0000000000000000
[ 224.952602] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 224.952624] CR2: 0000000000000000 CR3: 000000007601c005 CR4: 00000000003606e0
[ 224.952671] Call Trace:
[ 224.952686] __handle_sysrq+0x80/0x140
[ 224.952702] write_sysrq_trigger+0x2f/0x40
[ 224.952720] proc_reg_write+0x45/0x70
[ 224.952735] __vfs_write+0x1b/0x40
[ 224.952749] vfs_write+0xb1/0x1a0
[ 224.953469] SyS_write+0x5c/0xe0
[ 224.954177] do_syscall_64+0x73/0x130
[ 224.954883] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[ 224.955595] RIP: 0033:0x7fe1d5169224
[ 224.956278] RSP: 002b:00007ffe7a8ac998 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 224.956988] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe1d5169224
[ 224.957596] RDX: 0000000000000002 RSI: 0000558006985450 RDI: 0000000000000001
[ 224.958094] RBP: 0000558006985450 R08: 000000000000000a R09: 0000000000000001
[ 224.958590] R10: 000000000000000a R11: 0000000000000246 R12: 00007fe1d5445760
[ 224.959064] R13: 0000000000000002 R14: 00007fe1d54412a0 R15: 00007fe1d5440760
[ 224.959544] Code: e7 e8 7f fb ff ff e9 c0 fe ff ff 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 c7 05 95 b3 35 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 0f 1f 44 00 00 55 c7 05 30 2d e6
[ 224.960917] RIP: sysrq_handle_crash+0x16/0x20 RSP: ffffb0cac1a0fe30
[ 224.961370] CR2: 0000000000000000
dump分析
加載dump信息
wsk@wsk:~/crash$ crash /usr/lib/debug/boot/vmlinux-4.15.0-156-generic 202111192354/dump.202111192354
crash 7.2.8
Copyright (C) 2002-2020 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: kernel relocated [594MB]: patching 100292 gdb minimal_symbol values
please wait... (patching 100292 gdb minimal_symbol values)
KERNEL: /usr/lib/debug/boot/vmlinux-4.15.0-156-generic
DUMPFILE: 202111192354/dump.202111192354 [PARTIAL DUMP]
CPUS: 2
DATE: Fri Nov 19 23:54:13 2021
UPTIME: 00:04:41
LOAD AVERAGE: 0.00, 0.02, 0.00
TASKS: 211
NODENAME: wsk
RELEASE: 4.15.0-156-generic
VERSION: #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021
MACHINE: x86_64 (2904 Mhz)
MEMORY: 2 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000000"
PID: 1741
COMMAND: "bash"
TASK: ffff8fb7f6335c00 [THREAD_INFO: ffff8fb7f6335c00]
CPU: 1
STATE: TASK_RUNNING (PANIC)
KERNEL:系統崩潰時運行的 kernel 文件
DUMPFILE:內核轉儲文件
CPUS: 所在機器的 CPU 數量
DATE: 系統崩潰的時間
TASKS:系統崩潰時內存中的任務數
NODENAME:崩潰的系統主機名
RELEASE:和 VERSION: 內核版本號
MACHINE:CPU 架構
MEMORY:崩潰主機的物理內存
PANIC:崩潰類型,常見的崩潰類型包括:
崩潰類型包括:
1. SysRq(System Request):通過魔法組合鍵導致的系統崩潰,通常是測試使用。通過 echo c > /proc/sysrq-trigger,就可以觸發系統崩潰。
2. oops:可以看成是內核級的 Segmentation Fault。應用程序如果進行了非法內存訪問或執行了非法指令,會得到 Segfault 信號,一般行為是 coredump,應用程序也可以自己截獲 Segfault 信號,自行處理。如果內核自己犯了這樣的錯誤,則會彈出 oops 信息。
顯示內核堆棧回溯:此處部分與dmesg內容相同
crash> bt
PID: 1741 TASK: ffff8fb7f6335c00 CPU: 1 COMMAND: "bash"
#0 [ffffb0cac1a0fab0] machine_kexec at ffffffffa62649d3
#1 [ffffb0cac1a0fb10] __crash_kexec at ffffffffa6330139
#2 [ffffb0cac1a0fbd8] crash_kexec at ffffffffa6330f41
#3 [ffffb0cac1a0fbf8] oops_end at ffffffffa6231318
#4 [ffffb0cac1a0fc20] no_context at ffffffffa6275e9c
#5 [ffffb0cac1a0fc88] __bad_area_nosemaphore at ffffffffa6276253
#6 [ffffb0cac1a0fcc8] bad_area_nosemaphore at ffffffffa6276324
#7 [ffffb0cac1a0fcd8] __do_page_fault at ffffffffa6276bfb
#8 [ffffb0cac1a0fd50] do_page_fault at ffffffffa6276ffe
#9 [ffffb0cac1a0fd80] page_fault at ffffffffa6c01615
[exception RIP: sysrq_handle_crash+22]
RIP: ffffffffa6800d76 RSP: ffffb0cac1a0fe30 RFLAGS: 00010286
RAX: ffffffffa6800d60 RBX: 0000000000000063 RCX: 0000000000000006
RDX: 0000000000000000 RSI: ffff8fb7fb656498 RDI: 0000000000000063
RBP: ffffb0cac1a0fe30 R8: 000000000000064c R9: 0000000000000082
R10: 0000000000000001 R11: 00000000ffffffff R12: 0000000000000004
R13: 0000000000000000 R14: ffffffffa7788c20 R15: ffff8fb7f62f4200
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffffb0cac1a0fe38] __handle_sysrq at ffffffffa6801520
#11 [ffffb0cac1a0fe68] write_sysrq_trigger at ffffffffa6801a0f
#12 [ffffb0cac1a0fe80] proc_reg_write at ffffffffa64fe8a5
#13 [ffffb0cac1a0fea0] __vfs_write at ffffffffa6483afb
#14 [ffffb0cac1a0feb0] vfs_write at ffffffffa6483cc1
#15 [ffffb0cac1a0fee8] sys_write at ffffffffa6483f3c
#16 [ffffb0cac1a0ff30] do_syscall_64 at ffffffffa6203a43
#17 [ffffb0cac1a0ff50] entry_SYSCALL_64_after_hwframe at ffffffffa6c00085
RIP: 00007fe1d5169224 RSP: 00007ffe7a8ac998 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe1d5169224
RDX: 0000000000000002 RSI: 0000558006985450 RDI: 0000000000000001
RBP: 0000558006985450 R8: 000000000000000a R9: 0000000000000001
R10: 000000000000000a R11: 0000000000000246 R12: 00007fe1d5445760
R13: 0000000000000002 R14: 00007fe1d54412a0 R15: 00007fe1d5440760
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
crash支持的命令
crash> help
* extend log rd task
alias files mach repeat timer
ascii foreach mod runq tree
bpf fuser mount search union
bt gdb net set vm
btop help p sig vtop
dev ipcs ps struct waitq
dis irq pte swap whatis
eval kmem ptob sym wr
exit list ptov sys q
crash version: 7.2.8 gdb version: 7.6
For help on any command above, enter "help <command>".
For help on input options, enter "help input".
For help on output options, enter "help output".
crash定位思路
crash定位思路:
- 通過堆棧查看在什么地方dump:明確調查方向,即調用哪個函數時出現了crash
- 明確函數在什么情況下出現的dump:想辦法找到函數執行的入參,全局和局部參數
- 明確函數dump的根因:將參數,全局和局部參數代入后,明確dump的原因
注:crash的位置並不一定是第一案發現場,多數是由於其它位置引發的
bt
查看系統崩潰前的堆棧等信息:可以看到在write_sysrq_trigger接口時出現crash。觸發crash的任務信息:
PID: 1741 TASK: ffff8fb7f6335c00 CPU: 1 COMMAND: "bash"
crash> bt
PID: 1741 TASK: ffff8fb7f6335c00 CPU: 1 COMMAND: "bash"
#0 [ffffb0cac1a0fab0] machine_kexec at ffffffffa62649d3
#1 [ffffb0cac1a0fb10] __crash_kexec at ffffffffa6330139
#2 [ffffb0cac1a0fbd8] crash_kexec at ffffffffa6330f41
#3 [ffffb0cac1a0fbf8] oops_end at ffffffffa6231318
#4 [ffffb0cac1a0fc20] no_context at ffffffffa6275e9c
#5 [ffffb0cac1a0fc88] __bad_area_nosemaphore at ffffffffa6276253
#6 [ffffb0cac1a0fcc8] bad_area_nosemaphore at ffffffffa6276324
#7 [ffffb0cac1a0fcd8] __do_page_fault at ffffffffa6276bfb
#8 [ffffb0cac1a0fd50] do_page_fault at ffffffffa6276ffe
#9 [ffffb0cac1a0fd80] page_fault at ffffffffa6c01615
[exception RIP: sysrq_handle_crash+22]
RIP: ffffffffa6800d76 RSP: ffffb0cac1a0fe30 RFLAGS: 00010286
RAX: ffffffffa6800d60 RBX: 0000000000000063 RCX: 0000000000000006
RDX: 0000000000000000 RSI: ffff8fb7fb656498 RDI: 0000000000000063
RBP: ffffb0cac1a0fe30 R8: 000000000000064c R9: 0000000000000082
R10: 0000000000000001 R11: 00000000ffffffff R12: 0000000000000004
R13: 0000000000000000 R14: ffffffffa7788c20 R15: ffff8fb7f62f4200
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffffb0cac1a0fe38] __handle_sysrq at ffffffffa6801520
#11 [ffffb0cac1a0fe68] write_sysrq_trigger at ffffffffa6801a0f
#12 [ffffb0cac1a0fe80] proc_reg_write at ffffffffa64fe8a5
#13 [ffffb0cac1a0fea0] __vfs_write at ffffffffa6483afb
#14 [ffffb0cac1a0feb0] vfs_write at ffffffffa6483cc1
#15 [ffffb0cac1a0fee8] sys_write at ffffffffa6483f3c
#16 [ffffb0cac1a0ff30] do_syscall_64 at ffffffffa6203a43
#17 [ffffb0cac1a0ff50] entry_SYSCALL_64_after_hwframe at ffffffffa6c00085
RIP: 00007fe1d5169224 RSP: 00007ffe7a8ac998 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe1d5169224
RDX: 0000000000000002 RSI: 0000558006985450 RDI: 0000000000000001
RBP: 0000558006985450 R8: 000000000000000a R9: 0000000000000001
R10: 000000000000000a R11: 0000000000000246 R12: 00007fe1d5445760
R13: 0000000000000002 R14: 00007fe1d54412a0 R15: 00007fe1d5440760
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
攜帶詳細參數,查看函數的初步信息
crash> bt -slf
.......
ffffb0cac1a0fe38: ffffffffa6801520
#10 [ffffb0cac1a0fe38] __handle_sysrq+128 at ffffffffa6801520
/build/linux-WRF4xN/linux-4.15.0/drivers/tty/sysrq.c: 583
ffffb0cac1a0fe40: 0000000000000002 fffffffffffffffb
ffffb0cac1a0fe50: ffffb0cac1a0fef8 0000558006985450
ffffb0cac1a0fe60: ffffb0cac1a0fe78 ffffffffa6801a0f
#11 [ffffb0cac1a0fe68] write_sysrq_trigger+47 at ffffffffa6801a0f //接口所在位置
/build/linux-WRF4xN/linux-4.15.0/drivers/tty/sysrq.c: 1108
ffffb0cac1a0fe70: ffff8fb7b41a1a40 ffffb0cac1a0fe98
ffffb0cac1a0fe80: ffffffffa64fe8a5
#12 [ffffb0cac1a0fe80] proc_reg_write+69 at ffffffffa64fe8a5
/build/linux-WRF4xN/linux-4.15.0/fs/proc/inode.c: 231
ffffb0cac1a0fe88: 0000000000000002 0000000000000000
......
查看函數的定義,明確下步定位重點:從入參buf讀取字符,然后在__handle_sysrq中處理。所以我們要看buf傳入的內容到底是什么?
/*
* writing 'C' to /proc/sysrq-trigger is like sysrq-C
*/
static ssize_t write_sysrq_trigger(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
if (count) {
char c;
if (get_user(c, buf))
return -EFAULT;
__handle_sysrq(c, false);
}
return count;
}
如何解析局部變量,詳見下文:
Crash工具實戰-變量解析【轉】 - sky-heaven - 博客園 (cnblogs.com)
crash常用命令
- bt -slf:查看堆棧信息
- dis :將對應地址代碼進行反匯編
FAQ
沒有生成crash文件
- 當前已知virtualbox上面的ubuntu虛擬機無法正常是使用kdump;實際測試vmware上面可以產生crash文件
- Hyper-V也無法使用kdump,但好像有解決方案,詳見參考鏈接
參考鏈接
- 使用Crash工具分析 Linux dump文件-xyyylx-ChinaUnix博客
- 系統崩潰 - crash工具介紹 - 簡書 (jianshu.com)
無法對 Linux 虛擬機使用 kdump 或 kexec Hyper-V - Windows Client | Microsoft Docs - (50條消息) crash分析linux內核崩潰轉儲文件vmcore_yg@hunter的博客-CSDN博客_crash分析vmcore
- Linux內核調試的方式以及工具集錦 (daimajiaoliu.com)
- (50條消息) Linux內核調試的方式以及工具集錦_rayylee-CSDN博客 | 參考安裝debug-info軟件包推薦
- 系統崩潰 - crash工具介紹 - 簡書 (jianshu.com) | crash定位