crash工具


概述

應用場景

  • 現場還原,事后問題分析定位
  • 錯誤很難重現
  • 解析raddump(內存轉儲機制),轉儲的機制有后面幾種:LKCD,Diskdump,Netdump,Kdump,MKdump

依賴條件

  • 真實的linux系統,或者VMware虛擬機:virtualbox虛擬機測試過不行

環境安裝

主機編譯安裝

用來定位主機本身的dump問題
依賴文件:

  • crash軟件
  • 帶調試信息的vmlinux
  • 內核dump文件

安裝crash軟件

sudo apt-get install crash

准備帶調試信息的vmlinux

下載對應內核版本的 debug-info package
http://ddebs.ubuntu.com/pool/main/l/linux/
查看自己的內核版本

wsk@wsk:~$ uname -r
4.15.0-156-generic

查看自己的硬件平台

wsk@wsk:~$ uname -m
x86_64

附件:x86、i386、i486、i586、i686和x86_64 - 默默淡然 - 博客園 (cnblogs.com)

根據以上信息就可以找到對應的debug-info包,這里我們選擇4.15.0-156-generic 和 amd64的軟件包。獲取安裝debug-info包

wget http://ddebs.ubuntu.com/pool/main/l/linux/linux-image-unsigned-4.15.0-156-generic-dbgsym_4.15.0-156.163_amd64.ddeb
  
sudo dpkg -i linux-image-unsigned-4.15.0-156-generic-dbgsym_4.15.0-156.163_amd64.ddeb

查看軟件包看裝位置:要的就是vmlinux-4.15.0-156-generic 文件

wsk@wsk:~/crash$ dpkg --contents linux-image-unsigned-4.15.0-156-generic-dbgsym_4.15.0-156.163_amd64.ddeb
drwxr-xr-x root/root         0 2021-08-19 22:30 ./
drwxr-xr-x root/root         0 2021-08-19 22:30 ./usr/
drwxr-xr-x root/root         0 2021-08-19 22:30 ./usr/lib/
drwxr-xr-x root/root         0 2021-08-19 22:30 ./usr/lib/debug/
drwxr-xr-x root/root         0 2021-08-19 22:30 ./usr/lib/debug/boot/
-rw-r--r-- root/root 596730416 2021-08-19 22:30 ./usr/lib/debug/boot/vmlinux-4.15.0-156-generic

查看是否攜帶了調試信息:
攜帶調試信息的狀態

wsk@wsk:/usr/lib/debug/boot$ readelf -S vmlinux-4.15.0-156-generic | grep "debug"
  [61] .debug_aranges    PROGBITS         0000000000000000  01cef030
  [62] .rela.debug_arang RELA             0000000000000000  10f44138
  [63] .debug_info       PROGBITS         0000000000000000  01d19170
  [64] .rela.debug_info  RELA             0000000000000000  10f74c18
  [65] .debug_abbrev     PROGBITS         0000000000000000  0d56091e
  [66] .debug_line       PROGBITS         0000000000000000  0da9a7bd
  [67] .rela.debug_line  RELA             0000000000000000  21f8e008
  [68] .debug_frame      PROGBITS         0000000000000000  0e6fff18
  [69] .rela.debug_frame RELA             0000000000000000  21fa4fb8
  [70] .debug_str        PROGBITS         0000000000000000  0e963de8
  [71] .debug_loc        PROGBITS         0000000000000000  0ec6111c
  [72] .rela.debug_loc   RELA             0000000000000000  221b7148
  [73] .debug_ranges     PROGBITS         0000000000000000  0f89cb70
  [74] .rela.debug_range RELA             0000000000000000  2318bde8

調試vmlinux路徑:/usr/lib/debug/boot/vmlinux-4.15.0-156-generic

交叉編譯環境

目標和編譯調試不是同一套環境。比如編譯調試在Ubuntu系統上面,定位的是ARM單板的dump內容
----------todo------------

使用方法

生成dump文件

安裝kdump

sudo apt-get install linux-crashdump

安裝完成后
sudo reboot

查看安裝狀態

wsk@wsk:~$  dmesg | grep crashkernel
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.15.0-162-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity crashkernel=512M-:192M
[    0.000000] Reserving 192MB of memory at 576MB for crashkernel (System RAM: 2217MB) //預留內存空間
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.15.0-162-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity crashkernel=512M-:192M

show kdump config

wsk@wsk:~$ kdump-config show
DUMP_MODE:        kdump
USE_KDUMP:        1
KDUMP_SYSCTL:     kernel.panic_on_oops=1
KDUMP_COREDIR:    /var/crash
crashkernel addr: 0x
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-4.15.0-162-generic
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.15.0-162-generic
current state:    ready to kdump

kexec command:
  /sbin/kexec -p --command-line="BOOT_IMAGE=/vmlinuz-4.15.0-162-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

show kdump service status

wsk@wsk:~$ service --status-all
 [ + ]  kdump-tools
 [ + ]  kexec
 [ + ]  kexec-load
 [ - ]  keyboard-setup.sh

強制掛死kernel


修改root密碼
$ sudo passwd root
$ ...

切換到root賬號,執行下面命令
$ su root
$ ...

$ echo c > /proc/sysrq-trigger

show crash file

wsk@wsk:/var/crash$ sudo tree
.
├── 202111192354
│   ├── dmesg.202111192354
│   └── dump.202111192354
├── kdump_lock
├── kexec_cmd
└── linux-image-4.15.0-156-generic-202111192354.crash

1 directory, 5 file

現場分析

dmesg分析

直接通過vim打開下面的文件

├── 202111192354/dump.202111192354

系統遺言: 根據這些信息可以大致判斷系統掛的位置

[  224.951948] Oops: 0002 [#1] SMP PTI
[  224.951962] Modules linked in: vmw_vsock_vmci_transport vsock intel_rapl_perf vmw_balloon joydev input_leds serio_raw btusb btrtl btbcm btintel bluetooth ecdh_generic snd_ens1371 snd_ac97_codec gameport snd_rawmidi snd_seq_device ac97_bus snd_pcm snd_timer snd soundcore shpchp vmw_vmci mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd vmwgfx ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci psmouse libahci mptspi mptscsih
[  224.952266]  mptbase e1000 scsi_transport_spi drm pata_acpi i2c_piix4
[  224.952294] CPU: 1 PID: 1741 Comm: bash Not tainted 4.15.0-156-generic #163-Ubuntu
[  224.952322] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[  224.952364] RIP: 0010:sysrq_handle_crash+0x16/0x20
[  224.952382] RSP: 0018:ffffb0cac1a0fe30 EFLAGS: 00010286
[  224.952403] RAX: ffffffffa6800d60 RBX: 0000000000000063 RCX: 0000000000000006
[  224.952430] RDX: 0000000000000000 RSI: ffff8fb7fb656498 RDI: 0000000000000063
[  224.952456] RBP: ffffb0cac1a0fe30 R08: 000000000000064c R09: 0000000000000082
[  224.952504] R10: 0000000000000001 R11: 00000000ffffffff R12: 0000000000000004
[  224.952545] R13: 0000000000000000 R14: ffffffffa7788c20 R15: ffff8fb7f62f4200
[  224.952572] FS:  00007fe1d5a96740(0000) GS:ffff8fb7fb640000(0000) knlGS:0000000000000000
[  224.952602] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  224.952624] CR2: 0000000000000000 CR3: 000000007601c005 CR4: 00000000003606e0
[  224.952671] Call Trace:
[  224.952686]  __handle_sysrq+0x80/0x140
[  224.952702]  write_sysrq_trigger+0x2f/0x40
[  224.952720]  proc_reg_write+0x45/0x70
[  224.952735]  __vfs_write+0x1b/0x40
[  224.952749]  vfs_write+0xb1/0x1a0
[  224.953469]  SyS_write+0x5c/0xe0
[  224.954177]  do_syscall_64+0x73/0x130
[  224.954883]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
[  224.955595] RIP: 0033:0x7fe1d5169224
[  224.956278] RSP: 002b:00007ffe7a8ac998 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  224.956988] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe1d5169224
[  224.957596] RDX: 0000000000000002 RSI: 0000558006985450 RDI: 0000000000000001
[  224.958094] RBP: 0000558006985450 R08: 000000000000000a R09: 0000000000000001
[  224.958590] R10: 000000000000000a R11: 0000000000000246 R12: 00007fe1d5445760
[  224.959064] R13: 0000000000000002 R14: 00007fe1d54412a0 R15: 00007fe1d5440760
[  224.959544] Code: e7 e8 7f fb ff ff e9 c0 fe ff ff 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 c7 05 95 b3 35 01 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 0f 1f 44 00 00 55 c7 05 30 2d e6
[  224.960917] RIP: sysrq_handle_crash+0x16/0x20 RSP: ffffb0cac1a0fe30
[  224.961370] CR2: 0000000000000000

dump分析

加載dump信息

wsk@wsk:~/crash$ crash /usr/lib/debug/boot/vmlinux-4.15.0-156-generic 202111192354/dump.202111192354

crash 7.2.8
Copyright (C) 2002-2020  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [594MB]: patching 100292 gdb minimal_symbol values

please wait... (patching 100292 gdb minimal_symbol values)
      KERNEL: /usr/lib/debug/boot/vmlinux-4.15.0-156-generic
    DUMPFILE: 202111192354/dump.202111192354  [PARTIAL DUMP]
        CPUS: 2
        DATE: Fri Nov 19 23:54:13 2021
      UPTIME: 00:04:41
LOAD AVERAGE: 0.00, 0.02, 0.00
       TASKS: 211
    NODENAME: wsk
     RELEASE: 4.15.0-156-generic
     VERSION: #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021
     MACHINE: x86_64  (2904 Mhz)
      MEMORY: 2 GB
       PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000000"
         PID: 1741
     COMMAND: "bash"
        TASK: ffff8fb7f6335c00  [THREAD_INFO: ffff8fb7f6335c00]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)
KERNEL:系統崩潰時運行的 kernel 文件  
DUMPFILE:內核轉儲文件  
CPUS: 所在機器的 CPU 數量  
DATE: 系統崩潰的時間  
TASKS:系統崩潰時內存中的任務數  
NODENAME:崩潰的系統主機名  
RELEASE:和 VERSION: 內核版本號  
MACHINE:CPU 架構  
MEMORY:崩潰主機的物理內存  
PANIC:崩潰類型,常見的崩潰類型包括:  

崩潰類型包括:
1. SysRq(System Request):通過魔法組合鍵導致的系統崩潰,通常是測試使用。通過 echo c > /proc/sysrq-trigger,就可以觸發系統崩潰。
2. oops:可以看成是內核級的 Segmentation Fault。應用程序如果進行了非法內存訪問或執行了非法指令,會得到 Segfault 信號,一般行為是 coredump,應用程序也可以自己截獲 Segfault 信號,自行處理。如果內核自己犯了這樣的錯誤,則會彈出 oops 信息。

顯示內核堆棧回溯:此處部分與dmesg內容相同

crash> bt
PID: 1741   TASK: ffff8fb7f6335c00  CPU: 1   COMMAND: "bash"
 #0 [ffffb0cac1a0fab0] machine_kexec at ffffffffa62649d3
 #1 [ffffb0cac1a0fb10] __crash_kexec at ffffffffa6330139
 #2 [ffffb0cac1a0fbd8] crash_kexec at ffffffffa6330f41
 #3 [ffffb0cac1a0fbf8] oops_end at ffffffffa6231318
 #4 [ffffb0cac1a0fc20] no_context at ffffffffa6275e9c
 #5 [ffffb0cac1a0fc88] __bad_area_nosemaphore at ffffffffa6276253
 #6 [ffffb0cac1a0fcc8] bad_area_nosemaphore at ffffffffa6276324
 #7 [ffffb0cac1a0fcd8] __do_page_fault at ffffffffa6276bfb
 #8 [ffffb0cac1a0fd50] do_page_fault at ffffffffa6276ffe
 #9 [ffffb0cac1a0fd80] page_fault at ffffffffa6c01615
    [exception RIP: sysrq_handle_crash+22]
    RIP: ffffffffa6800d76  RSP: ffffb0cac1a0fe30  RFLAGS: 00010286
    RAX: ffffffffa6800d60  RBX: 0000000000000063  RCX: 0000000000000006
    RDX: 0000000000000000  RSI: ffff8fb7fb656498  RDI: 0000000000000063
    RBP: ffffb0cac1a0fe30   R8: 000000000000064c   R9: 0000000000000082
    R10: 0000000000000001  R11: 00000000ffffffff  R12: 0000000000000004
    R13: 0000000000000000  R14: ffffffffa7788c20  R15: ffff8fb7f62f4200
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffffb0cac1a0fe38] __handle_sysrq at ffffffffa6801520
#11 [ffffb0cac1a0fe68] write_sysrq_trigger at ffffffffa6801a0f
#12 [ffffb0cac1a0fe80] proc_reg_write at ffffffffa64fe8a5
#13 [ffffb0cac1a0fea0] __vfs_write at ffffffffa6483afb
#14 [ffffb0cac1a0feb0] vfs_write at ffffffffa6483cc1
#15 [ffffb0cac1a0fee8] sys_write at ffffffffa6483f3c
#16 [ffffb0cac1a0ff30] do_syscall_64 at ffffffffa6203a43
#17 [ffffb0cac1a0ff50] entry_SYSCALL_64_after_hwframe at ffffffffa6c00085
    RIP: 00007fe1d5169224  RSP: 00007ffe7a8ac998  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007fe1d5169224
    RDX: 0000000000000002  RSI: 0000558006985450  RDI: 0000000000000001
    RBP: 0000558006985450   R8: 000000000000000a   R9: 0000000000000001
    R10: 000000000000000a  R11: 0000000000000246  R12: 00007fe1d5445760
    R13: 0000000000000002  R14: 00007fe1d54412a0  R15: 00007fe1d5440760
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

crash支持的命令

crash> help

*              extend         log            rd             task
alias          files          mach           repeat         timer
ascii          foreach        mod            runq           tree
bpf            fuser          mount          search         union
bt             gdb            net            set            vm
btop           help           p              sig            vtop
dev            ipcs           ps             struct         waitq
dis            irq            pte            swap           whatis
eval           kmem           ptob           sym            wr
exit           list           ptov           sys            q

crash version: 7.2.8    gdb version: 7.6
For help on any command above, enter "help <command>".
For help on input options, enter "help input".
For help on output options, enter "help output".

crash定位思路

crash定位思路:

  • 通過堆棧查看在什么地方dump:明確調查方向,即調用哪個函數時出現了crash
  • 明確函數在什么情況下出現的dump:想辦法找到函數執行的入參,全局和局部參數
  • 明確函數dump的根因:將參數,全局和局部參數代入后,明確dump的原因

注:crash的位置並不一定是第一案發現場,多數是由於其它位置引發的

bt

查看系統崩潰前的堆棧等信息:可以看到在write_sysrq_trigger接口時出現crash。觸發crash的任務信息:
PID: 1741 TASK: ffff8fb7f6335c00 CPU: 1 COMMAND: "bash"

crash> bt
PID: 1741   TASK: ffff8fb7f6335c00  CPU: 1   COMMAND: "bash"
 #0 [ffffb0cac1a0fab0] machine_kexec at ffffffffa62649d3
 #1 [ffffb0cac1a0fb10] __crash_kexec at ffffffffa6330139
 #2 [ffffb0cac1a0fbd8] crash_kexec at ffffffffa6330f41
 #3 [ffffb0cac1a0fbf8] oops_end at ffffffffa6231318
 #4 [ffffb0cac1a0fc20] no_context at ffffffffa6275e9c
 #5 [ffffb0cac1a0fc88] __bad_area_nosemaphore at ffffffffa6276253
 #6 [ffffb0cac1a0fcc8] bad_area_nosemaphore at ffffffffa6276324
 #7 [ffffb0cac1a0fcd8] __do_page_fault at ffffffffa6276bfb
 #8 [ffffb0cac1a0fd50] do_page_fault at ffffffffa6276ffe
 #9 [ffffb0cac1a0fd80] page_fault at ffffffffa6c01615
    [exception RIP: sysrq_handle_crash+22]
    RIP: ffffffffa6800d76  RSP: ffffb0cac1a0fe30  RFLAGS: 00010286
    RAX: ffffffffa6800d60  RBX: 0000000000000063  RCX: 0000000000000006
    RDX: 0000000000000000  RSI: ffff8fb7fb656498  RDI: 0000000000000063
    RBP: ffffb0cac1a0fe30   R8: 000000000000064c   R9: 0000000000000082
    R10: 0000000000000001  R11: 00000000ffffffff  R12: 0000000000000004
    R13: 0000000000000000  R14: ffffffffa7788c20  R15: ffff8fb7f62f4200
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffffb0cac1a0fe38] __handle_sysrq at ffffffffa6801520
#11 [ffffb0cac1a0fe68] write_sysrq_trigger at ffffffffa6801a0f
#12 [ffffb0cac1a0fe80] proc_reg_write at ffffffffa64fe8a5
#13 [ffffb0cac1a0fea0] __vfs_write at ffffffffa6483afb
#14 [ffffb0cac1a0feb0] vfs_write at ffffffffa6483cc1
#15 [ffffb0cac1a0fee8] sys_write at ffffffffa6483f3c
#16 [ffffb0cac1a0ff30] do_syscall_64 at ffffffffa6203a43
#17 [ffffb0cac1a0ff50] entry_SYSCALL_64_after_hwframe at ffffffffa6c00085
    RIP: 00007fe1d5169224  RSP: 00007ffe7a8ac998  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007fe1d5169224
    RDX: 0000000000000002  RSI: 0000558006985450  RDI: 0000000000000001
    RBP: 0000558006985450   R8: 000000000000000a   R9: 0000000000000001
    R10: 000000000000000a  R11: 0000000000000246  R12: 00007fe1d5445760
    R13: 0000000000000002  R14: 00007fe1d54412a0  R15: 00007fe1d5440760
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

攜帶詳細參數,查看函數的初步信息

crash> bt -slf
.......
    ffffb0cac1a0fe38: ffffffffa6801520
#10 [ffffb0cac1a0fe38] __handle_sysrq+128 at ffffffffa6801520
    /build/linux-WRF4xN/linux-4.15.0/drivers/tty/sysrq.c: 583
    ffffb0cac1a0fe40: 0000000000000002 fffffffffffffffb
    ffffb0cac1a0fe50: ffffb0cac1a0fef8 0000558006985450
    ffffb0cac1a0fe60: ffffb0cac1a0fe78 ffffffffa6801a0f
#11 [ffffb0cac1a0fe68] write_sysrq_trigger+47 at ffffffffa6801a0f        //接口所在位置
    /build/linux-WRF4xN/linux-4.15.0/drivers/tty/sysrq.c: 1108
    ffffb0cac1a0fe70: ffff8fb7b41a1a40 ffffb0cac1a0fe98
    ffffb0cac1a0fe80: ffffffffa64fe8a5
#12 [ffffb0cac1a0fe80] proc_reg_write+69 at ffffffffa64fe8a5
    /build/linux-WRF4xN/linux-4.15.0/fs/proc/inode.c: 231
    ffffb0cac1a0fe88: 0000000000000002 0000000000000000
......

查看函數的定義,明確下步定位重點:從入參buf讀取字符,然后在__handle_sysrq中處理。所以我們要看buf傳入的內容到底是什么?

/*
 * writing 'C' to /proc/sysrq-trigger is like sysrq-C
 */
static ssize_t write_sysrq_trigger(struct file *file, const char __user *buf,
				   size_t count, loff_t *ppos)
{
	if (count) {
		char c;

		if (get_user(c, buf))
			return -EFAULT;
		__handle_sysrq(c, false);
	}

	return count;
}

如何解析局部變量,詳見下文:
Crash工具實戰-變量解析【轉】 - sky-heaven - 博客園 (cnblogs.com)

crash常用命令

  • bt -slf:查看堆棧信息
  • dis :將對應地址代碼進行反匯編

FAQ

沒有生成crash文件

  • 當前已知virtualbox上面的ubuntu虛擬機無法正常是使用kdump;實際測試vmware上面可以產生crash文件
  • Hyper-V也無法使用kdump,但好像有解決方案,詳見參考鏈接

參考鏈接


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM