1 背景描述
最近上線了一台物理機,IT那邊安裝的操作系統的版本信息如下:
CentOS Linux release 7.3.1611 (Core)
內核版本
3.10.0-514.el7.x86_64
該系統是跑docker的,docker版本為
Docker version 19.03.6
在運行的故障中,出現異常宕機重啟的情況。
2 故障分析
2.1 分析思路
(1)先看操作系統日志/var/log/message,看看能不能看出蛛絲馬跡
(2)懷疑硬件兼容性問題,找硬件廠商確定固件、兼容性問題
(3)猜測操作系統有BUG。看看Linux的kdump有沒有啟動,如果,看看有沒有崩潰時候的內核轉儲文件
2.2 具體分析實踐
(1)查看操作系統日志/var/log/message
從日志中可以看出,系統在2020.4.1 18:19:01 宕機了,隨即在18:23:19重啟了。但除此之外,並沒有其它更多可幫助分析的信息了。
(2)分析硬件兼容性問題
同步發送idrac上收集到的硬件信息,發給硬件供應商查詢。
(3)使用kdump分析
a. 查看是否安裝和啟動了kdump
# systemctl status kdump.service ● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) since Thu 2020-04-02 09:01:47 CST; 4h 0min ago Main PID: 284294 (code=exited, status=0/SUCCESS) Tasks: 0 Memory: 0B CGroup: /system.slice/kdump.service
注:安裝kdump相關工具見章節3
b. 使用crash命令分析
按照章節3安裝好工具之后,使用以下命令分析vmcore(我的是之前默認就已經開了kdump的)
# crash /var/crash/127.0.0.1-2020-04-01-18\:19\:32/vmcore /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux crash 7.2.3-10.el7 Copyright (C) 2002-2017 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux DUMPFILE: /var/crash/127.0.0.1-2020-04-01-18:19:32/vmcore [PARTIAL DUMP] CPUS: 72 DATE: Wed Apr 1 18:19:27 2020 UPTIME: 19 days, 08:32:38 LOAD AVERAGE: 0.29, 0.32, 0.29 TASKS: 4177 NODENAME: RELEASE: 3.10.0-514.el7.x86_64 VERSION: #1 SMP Tue Nov 22 16:42:41 UTC 2016 MACHINE: x86_64 (2600 Mhz) MEMORY: 127.5 GB PANIC: "kernel BUG at fs/xfs/xfs_aops.c:1062!" PID: 92639 COMMAND: "kworker/u898:3" TASK: ffff8810f827bec0 [THREAD_INFO: ffff880106fa4000] CPU: 1 STATE: TASK_RUNNING (PANIC) crash> bt PID: 92639 TASK: ffff8810f827bec0 CPU: 1 COMMAND: "kworker/u898:3" #0 [ffff880106fa75f0] machine_kexec at ffffffff81059cdb #1 [ffff880106fa7650] __crash_kexec at ffffffff81105182 #2 [ffff880106fa7720] crash_kexec at ffffffff81105270 #3 [ffff880106fa7738] oops_end at ffffffff8168ee88 #4 [ffff880106fa7760] die at ffffffff8102e93b #5 [ffff880106fa7790] do_trap at ffffffff8168e540 #6 [ffff880106fa77e0] do_invalid_op at ffffffff8102b144 #7 [ffff880106fa7890] invalid_op at ffffffff81697e5e [exception RIP: xfs_vm_writepage+1419] RIP: ffffffffa052b2fb RSP: ffff880106fa7948 RFLAGS: 00010246 RAX: 006fffff00040009 RBX: ffff8813abed8fc8 RCX: 000000000000000c RDX: 0000000000000008 RSI: ffff880106fa7c40 RDI: ffffea006be56c00 RBP: ffff880106fa79f0 R8: ffffffffffffffd8 R9: 000000000001a100 R10: ffff88207ffd7000 R11: 0000000000000000 R12: ffff8813abed8fc8 R13: ffff880106fa7c40 R14: ffff8813abed8e78 R15: ffffea006be56c00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff880106fa7990] find_get_pages_tag at ffffffff81180981 #9 [ffff880106fa79f8] __writepage at ffffffff8118b3b3 #10 [ffff880106fa7a10] write_cache_pages at ffffffff8118bed1 #11 [ffff880106fa7b28] generic_writepages at ffffffff8118c19d #12 [ffff880106fa7b88] xfs_vm_writepages at ffffffffa052a063 [xfs] #13 [ffff880106fa7bb8] do_writepages at ffffffff8118d24e #14 [ffff880106fa7bc8] __writeback_single_inode at ffffffff81228730 #15 [ffff880106fa7c08] writeback_sb_inodes at ffffffff8122941e #16 [ffff880106fa7cb0] __writeback_inodes_wb at ffffffff8122967f #17 [ffff880106fa7cf8] wb_writeback at ffffffff81229ec3 #18 [ffff880106fa7d70] bdi_writeback_workfn at ffffffff8122bd05 #19 [ffff880106fa7e20] process_one_work at ffffffff810a7f3b #20 [ffff880106fa7e68] worker_thread at ffffffff810a8d76 #21 [ffff880106fa7ec8] kthread at ffffffff810b052f #22 [ffff880106fa7f50] ret_from_fork at ffffffff81696518 crash>
c. 可以看到exception RIP: xfs_vm_writepage+1419,用谷歌查詢一下
感覺這個與我的現象很像
https://access.redhat.com/solutions/2779111
看起來一樣,先安排停機時間,按照文檔的說法,將內核版本進行升級,后續再觀察下是否還會出現宕機。
3 kdump相關工具安裝
3.1 安裝kexec-tools
yum search kexec-tools yum install crash
3.2 配置kdump服務
vim /etc/kdump.conf
# 修改core文件的目錄
path /var/crash systemctl start kdump systemctl enable kdump.service
參考:https://www.linuxtechi.com/how-to-enable-kdump-on-rhel-7-and-centos-7/
3.3 安裝kernel-debuginfo工具
(1)下載安裝包
在http://debuginfo.centos.org/7/x86_64/上搜索與內核版本一致的rpm包
kernel-debuginfo-3.10.0-514.el7.x86_64.rpm kernel-debuginfo-common-x86_64-3.10.0-514.el7.x86_64.rpm
(2)安裝
rpm -ivh kernel-debuginfo-common-x86_64-3.10.0-514.el7.x86_64.rpm rpm -ivh kernel-debuginfo-3.10.0-514.el7.x86_64.rpm