如何手動觸發物理機panic,並產生vmcore


如何手動觸發物理機panic,並產生vmcore?

1. 配置kdump

1.1 el6

如果是CentOS 6 則編輯/boot/grub/grub.conf配置在內核參數中添加 crashkernel=auto 類似如下

kernel /vmlinuz-2.6.32-xxx.el6.x86_64 ro root=LABEL=/ crashkernel=auto ...

1.2 el7

如果是CentOS 7 則編輯/etc/default/grub修改GRUB_CMDLINE_LINUX行添加crashkernel=auto類似如下

GRUB_CMDLINE_LINUX="crashkernel=auto ..."

1.3 只要修改/etc/default/grub修改GRUB_CMDLINE_LINUX行 ,就需要重新生成grub配置:

grub2-mkconfig -o /boot/grub2/grub.cfg

如果你遇到服務器極不穩定,需要在系統hung住時候立即crash掉系統生成kernel core dump,則需要使用NMI watchdog,則需要在內核參數中再加上 nmi_watchdog=1 激活watchdog(這樣就不需要時時盯着服務器來手工觸發core)

重啟服務器使得以上配置生效

2. kdump涉及的sysctl 配置

查閱了網上很多有關kdump的資料,發現在配置kdump時,對sysctl.conf 內的一些配置也進行了調整。這里也列舉下,可以根據具體的情況酌情進行修改。

如下參數也都可以在/etc/default/grub修改GRUB_CMDLINE_LINUX行添加

kernel.sysrq=1
kernel.unknown_nmi_panic=1
kernel.softlockup_panic=1

2.1 sysrq

kernel.sysrq=1,如果通過/proc文件配置 ,上面的配置等價於echo 1 > /proc/sys/kernel/sysrq .

默認SysRQ(/proc/sys/kernel/sysrq)設置值是16. 修改這個值為1激活SysRq來觸發core dump

echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

此時在帶外可以看到

[1290981.013642] SysRq : Trigger a crash
[1290981.018405] BUG: unable to handle kernel NULL pointer dereference at           (null)
[1290981.028007] IP: [<ffffffff813ed756>] sysrq_handle_crash+0x16/0x20

然后切換到kdump內核並進行 vmcore 存儲

         Starting Kdump Vmcore Save Service...
[    7.200189] BTRFS info (device sda4): disk space caching is enabled

kdump: saving to /kdumproot/data//127.0.0.1-2017-01-18-23:33:42/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
Copying data                       : [100.0 %] \
kdump: saving vmcore complete

打開sysrq鍵的功能以后,有終端訪問權限的用戶將會擁有一些特別的功能。如果系統出現掛起的情況或在診斷一些和內核相關,
使用這些組合鍵能即時打印出內核的信息。

因此,除非是要調試,解決問題,一般情況下,不要打開此功能。如果一定要打開,請確保你的終端訪問的安全性。

2.2 kernel.unknown_nmi_panic

kernel.unknown_nmi_panic=1 ,如果系統已經是處在Hang的狀態的話,那么可以使用NMI按鈕來觸發Kdump。

開啟這個選項可以:echo 1 > /proc/sys/kernel/unknown_nmi_panic 需要注意的是,啟用這個特性的話,是不能夠同時啟用NMI_WATCHDOG的!否則系統會Panic!

2.3 kernel.softlockup_panic

kernel.softlockup_panic=1,其對應的是/proc/sys/kernel/softlockup_panic的值,值為1可以讓內核在死鎖或者死循環的時候可以宕機重啟。如果你的機器中安裝了kdump,在重啟之后,你會得到一份內核的core文件,這時從core文件中查找問題就方便很多了,而且再也不用手動重啟機器了。如果你的內核是標准內核的話,可以通過修改/proc/sys/kernel/softlockup_thresh來修改超時的閾值,如果是CentOS內核的話,對應的文件是/proc/sys/kernel/watchdog_thresh

2.4 nmi watchdog

2.4.1 編譯支持

在很多x86/x86-64類型的硬件中提供了一個功能可以激活watchdog NMI interrupts(NMI:不可屏蔽中斷在系統非常困難時依然可以執行)。這個功能可以用來debug內核異常。通過周期性執行MNI中斷,內核可以見識任何CPU思索並且打印出相應的debug信息。

為了使用NMI watchdog,你需要在內核激活APIC支持。對於SMP內核,APIC支持已經自動編譯進內核。即在內核配置中:

CONFIG_X86_UP_APIC (Processor type and features -> Local APIC support on uniprocessors) 

CONFIG_X86_UP_IOAPIC (Processor type and features -> IO-APIC support on uniprocessors) 

2.4.2 開啟

在 x86 平台,nmi_watchdog默認是關閉的,你需要在啟動參數中激活它。

動態修改:

也可以通過對/proc/sys/kernel/nmi_watchdog寫入0來在運行時關閉NMI watchdog。而寫入1這個文件則可以重新激活NMI watchdog。

永久修改:

需要在啟動時使用nmi_watchdog=X參數來激活NMI watchdog,否則無法動態修改。

#grep NMI /proc/interrupts
 NMI:          2          1          0          5          1          0          1          1          0          1          2          0          0         12          1          1          0          0          0          5          1          1          0          0   Non-maskable interrupts

2.4.3 nmi watchdog觸發kernel crash

當系統掛起並且通常中斷都被禁止的故障時,可以通過不可屏蔽中斷(non maskable interrupt, NMI)來觸發一個panic以及獲得crash dump。有兩種方式來觸發一個NMI,不過這兩個方法不能同時使用。

3. 手動觸發有兩個方法:

3.1 方法1:使用/proc/sysrq-trigger

echo 1 | sudo tee /proc/sys/kernel/sysrq
echo c | sudo tee /proc/sysrq-trigger

3.2 方法2: 開啟nmi

假如,希望機器A 接受nmi后panic

登錄A:

echo 1 | sudo tee /proc/sys/kernel/unkown_nmi_panic

A的oob IP: 100.126.xx.xx

登錄A的機房所在的oob跳板機:

接下來就可以通過網絡使用IPMI遠程發送unknown_nmi_panic信號給服務器觸發kernel core dump

[root@oob1.xxx /root]
#ipmitool -I lanplus -U name -P xxxxx -H 100.126.xx.xx chassis power diag
Chassis Power Control: Diag
rt2m09613 login: 
[  287.765130] Kernel panic - not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources:
[  287.765130] 1. Integrated Management Log (IML)
[  287.765130] 2. OA Syslog
[  287.765130] 3. OA Forward Progress Log
[  287.765130] 4. iLO Event Log
[  287.926060] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OE  ------------   3.10.0-327.xxxxx.x86_64 #1
[  287.990149] Hardware name: HP ProLiant DL380e Gen8, BIOS P73 08/20/2012
[  288.029748]  ffffffffa049e4b0 2df877b6d8f92a53 ffff88183f405de0 ffffffff81631816
[  288.074609]  ffff88183f405e60 ffffffff8162b0ef 0000000000000008 ffff88183f405e70
[  288.118449]  ffff88183f405e10 2df877b6d8f92a53 0000000000000000 ffffc9000c0d2072
[  288.162844] Call Trace:
[  288.177585]  <NMI>  [<ffffffff81631816>] dump_stack+0x19/0x1b
[  288.212435]  [<ffffffff8162b0ef>] panic+0xd8/0x1e7
[  288.240375]  [<ffffffffa049d8ed>] hpwdt_pretimeout+0xdd/0xe0 [hpwdt]
[  288.279340]  [<ffffffff8163a9f9>] nmi_handle.isra.0+0x69/0xb0
[  288.314476]  [<ffffffff8163ab66>] do_nmi+0x126/0x340
[  288.341792]  [<ffffffff81639e31>] end_repeat_nmi+0x1e/0x2e
[  288.374724]  [<ffffffff81058e96>] ? native_safe_halt+0x6/0x10
[  288.409004]  [<ffffffff81058e96>] ? native_safe_halt+0x6/0x10
[  288.442540]  [<ffffffff81058e96>] ? native_safe_halt+0x6/0x10
[  288.476401]  <<EOE>>  [<ffffffff8101dd5f>] default_idle+0x1f/0xc0
[  288.510993]  [<ffffffff8101e666>] arch_cpu_idle+0x26/0x30
[  288.541880]  [<ffffffff810d3185>] cpu_startup_entry+0x245/0x290
[  288.577386]  [<ffffffff81621587>] rest_init+0x77/0x80
[  288.607035]  [<ffffffff81a89057>] start_kernel+0x429/0x44a
[  288.640036]  [<ffffffff81a88a37>] ? repair_env_string+0x5c/0x5c
[  288.675191]  [<ffffffff81a88120>] ? early_idt_handlers+0x120/0x120
[  288.713429]  [<ffffffff81a885ee>] x86_64_start_reservations+0x2a/0x2c
[  288.753833]  [<ffffffff81a88742>] x86_64_start_kernel+0x152/0x175
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu

4. crash進行結果分析

crash包需要yum -y install crash 單獨安裝過,另外crash 命令需要依賴kernel-debuginfo 包(該包又依賴kernel-debuginfo-common包),該包的下載地址:http://debuginfo.centos.org/6/x86_64/

下載前先要確認下自己主機的內核版本。我在測試機上是通過下面的命令執行的:

el6

# uname -r
2.6.32-431.17.1.el6.x86_64
# wget http://debuginfo.centos.org/6/x86_64/kernel-debuginfo-common-x86_64-2.6.32-431.17.1.el6.x86_64.rpm
# wget http://debuginfo.centos.org/6/x86_64/kernel-debuginfo-2.6.32-431.17.1.el6.x86_64.rpm

el7

http://debuginfo.centos.org/7/x86_64/

crash分析案例1

下載完成后,通過rpm -ivh將這兩個包安裝。然后通過下面的命令進行crash分析

# pwd
/var/crash/127.0.0.1-2014-09-16-14:35:49
# crash /usr/lib/debug/lib/modules 2.6.32-431.17.1.el6.x86_64/vmlinux vmcore
crash 6.1.0-5.el6
Copyright (C) 2002-2012  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
      KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.17.1.el6.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 1
        DATE: Tue Sep 16 22:35:49 2014
      UPTIME: 00:05:33
LOAD AVERAGE: 0.00, 0.00, 0.00
       TASKS: 175
    NODENAME: localhost.localdomain
     RELEASE: 2.6.32-431.17.1.el6.x86_64
     VERSION: #1 SMP Wed May 7 23:32:49 UTC 2014
     MACHINE: x86_64  (3398 Mhz)
      MEMORY: 1 GB
       PANIC: "Oops: 0002 [#1] SMP " (check log for details)
         PID: 1412
     COMMAND: "bash"
        TASK: ffff88003d0b2040  [THREAD_INFO: ffff88003c33c000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 1412   TASK: ffff88003d0b2040  CPU: 0   COMMAND: "bash"
 #0 [ffff88003c33d9e0] machine_kexec at ffffffff81038f3b
 #1 [ffff88003c33da40] crash_kexec at ffffffff810c59f2
 #2 [ffff88003c33db10] oops_end at ffffffff8152b7f0
 #3 [ffff88003c33db40] no_context at ffffffff8104a00b
 #4 [ffff88003c33db90] __bad_area_nosemaphore at ffffffff8104a295
 #5 [ffff88003c33dbe0] bad_area at ffffffff8104a3be
 #6 [ffff88003c33dc10] __do_page_fault at ffffffff8104ab6f
 #7 [ffff88003c33dd30] do_page_fault at ffffffff8152d73e
 #8 [ffff88003c33dd60] page_fault at ffffffff8152aaf5
    [exception RIP: sysrq_handle_crash+22]
    RIP: ffffffff8134b516  RSP: ffff88003c33de18  RFLAGS: 00010096
    RAX: 0000000000000010  RBX: 0000000000000063  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000063
    RBP: ffff88003c33de18   R8: 0000000000000000   R9: ffffffff81645da0
    R10: 0000000000000001  R11: 0000000000000000  R12: 0000000000000000
    R13: ffffffff81b01a40  R14: 0000000000000286  R15: 0000000000000004
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff88003c33de20] __handle_sysrq at ffffffff8134b7d2
#10 [ffff88003c33de70] write_sysrq_trigger at ffffffff8134b88e
#11 [ffff88003c33dea0] proc_reg_write at ffffffff811f2f1e
#12 [ffff88003c33def0] vfs_write at ffffffff81188c38
#13 [ffff88003c33df30] sys_write at ffffffff81189531
#14 [ffff88003c33df80] system_call_fastpath at ffffffff8100b072
    RIP: 00000036e3adb7a0  RSP: 00007fff22936c10  RFLAGS: 00010206
    RAX: 0000000000000001  RBX: ffffffff8100b072  RCX: 0000000000000400
    RDX: 0000000000000002  RSI: 00007fab7908b000  RDI: 0000000000000001
    RBP: 00007fab7908b000   R8: 000000000000000a   R9: 00007fab79084700
    R10: 00000000ffffffff  R11: 0000000000000246  R12: 0000000000000002
    R13: 00000036e3d8e780  R14: 0000000000000002  R15: 00000036e3d8e780
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
crash> 

上面,只是簡單的通過打印堆棧信息,顯示主機在出現kdump生成時,pid 為1412的bash進程操作。從上面的顯示信息中也簡單的看到有 write_sysrq_trigger 函數觸發。crash在定位問題原因時,為我們提供了下面的命令:

crash> ?
*              files          mach           repeat         timer
alias          foreach        mod            runq           tree
ascii          fuser          mount          search         union
bt             gdb            net            set            vm
btop           help           p              sig            vtop
dev            ipcs           ps             struct         waitq
dis            irq            pte            swap           whatis
eval           kmem           ptob           sym            wr
exit           list           ptov           sys            q
extend         log            rd             task
crash version: 6.1.0-5.el6   gdb version: 7.3.1
For help on any command above, enter "help <command>".
For help on input options, enter "help input".
For help on output options, enter "help output".

crash分析案例2

2017-04-08 03:34:51    [29201245.714153] io-error-guard: catch 1 continuous bio error.
2017-04-08 03:34:51    [29201245.722046] Buffer I/O error on device sdc, logical block 0
...
2017-04-08 03:34:51    [29201245.781675] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
2017-04-08 03:34:51    [29201245.790380] IP: [] netoops+0x125/0x2a0
2017-04-08 03:34:51    [29201245.796262] PGD 371206d067 PUD 1c17f34067 PMD 0 
2017-04-08 03:34:51    [29201245.801494] Oops: 0000 [#1] SMP 
...
2017-04-08 03:34:51    [29201245.941654] Pid: 21606, comm: sh Tainted: GF          ---------------    2.6.32-358.23.2.ali1233.el5.x86_64 #1 Inspur SA5212M4/YZMB-00370-109
2017-04-08 03:34:51    [29201245.955245] RIP: 0010:[]  [] netoops+0x125/0x2a0
...
2017-04-08 03:34:51    [29201246.085431] Call Trace:
2017-04-08 03:34:51    [29201246.088422]   
2017-04-08 03:34:51    [29201246.091092]  [] kmsg_dump+0x113/0x180
2017-04-08 03:34:51    [29201246.096875]  [] bio_endio+0x12a/0x1b0
2017-04-08 03:34:51    [29201246.102652]  [] req_bio_endio+0x90/0xc0
2017-04-08 03:34:51    [29201246.108601]  [] blk_update_request+0x262/0x480
2017-04-08 03:34:51    [29201246.115165]  [] blk_update_bidi_request+0x27/0x80
2017-04-08 03:34:51    [29201246.121984]  [] blk_end_bidi_request+0x2f/0x80
2017-04-08 03:34:51    [29201246.128542]  [] blk_end_request+0x10/0x20
2017-04-08 03:34:51    [29201246.134665]  [] blk_end_request_err+0x33/0x60
2017-04-08 03:34:51    [29201246.141140]  [] scsi_io_completion+0x2db/0x5b0
2017-04-08 03:34:51    [29201246.147699]  [] scsi_finish_command+0xc3/0x120
2017-04-08 03:34:51    [29201246.154258]  [] scsi_softirq_done+0x101/0x170
2017-04-08 03:34:51    [29201246.160734]  [] blk_done_softirq+0x83/0xa0
2017-04-08 03:34:51    [29201246.166944]  [] __do_softirq+0xbf/0x220
2017-04-08 03:34:51    [29201246.172895]  [] call_softirq+0x1c/0x30
2017-04-08 03:34:51    [29201246.178757]  [] do_softirq+0x65/0xa0
2017-04-08 03:34:51    [29201246.184451]  [] irq_exit+0x7c/0x90
2017-04-08 03:34:51    [29201246.189968]  [] smp_call_function_single_interrupt+0x34/0x40
2017-04-08 03:34:51    [29201246.198017]  [] call_function_single_interrupt+0x13/0x20
2017-04-08 03:34:51    [29201246.205438]   
2017-04-08 03:34:51    [29201246.208106]  [] ? wait_for_rqlock+0x24/0x40
2017-04-08 03:34:51    [29201246.214403]  [] do_exit+0x5e6/0x8d0
2017-04-08 03:34:51    [29201246.220003]  [] do_group_exit+0x41/0xb0
2017-04-08 03:34:51    [29201246.225957]  [] sys_exit_group+0x17/0x20
2017-04-08 03:34:51    [29201246.231996]  [] system_call_fastpath+0x16/0x1b

找到對應的vmcore

cd tmp/linux-2.6.32-358.23.2.el5

gdb vmlinux

//轉換Oops中的Call Trace中函數源代碼位置

(gdb) l *bio_endio+0x12a

/// 就可以定位到出現故障時候源代碼的位置

0xffffffff811b292a is in bio_endio (fs/bio.c:1474).
1469                }
1470            }
1471        }
1472        spin_unlock(&eio->lock);
1473        if (sysctl_enable_bio_netoops)
1474            kmsg_dump(KMSG_DUMP_SOFT, NULL);
1475    }
1476
1477    /**
1478     * bio_endio - end I/O on a bio
(gdb)
注意

REF

我需要內核的源代碼


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM