記一次Centos7主機自動重啟原因查詢


1 背景描述
最近上線了一台物理機,IT那邊安裝的操作系統的版本信息如下:
CentOS Linux release 7.3.1611 (Core)

內核版本
3.10.0-514.el7.x86_64 

該系統是跑docker的,docker版本為
Docker version 19.03.6
在運行的故障中,出現異常宕機重啟的情況。

 

2 故障分析
2.1 分析思路
(1)先看操作系統日志/var/log/message,看看能不能看出蛛絲馬跡
(2)懷疑硬件兼容性問題,找硬件廠商確定固件、兼容性問題
(3)猜測操作系統有BUG。看看Linux的kdump有沒有啟動,如果,看看有沒有崩潰時候的內核轉儲文件

2.2 具體分析實踐
(1)查看操作系統日志/var/log/message

從日志中可以看出,系統在2020.4.1 18:19:01 宕機了,隨即在18:23:19重啟了。但除此之外,並沒有其它更多可幫助分析的信息了。

(2)分析硬件兼容性問題
同步發送idrac上收集到的硬件信息,發給硬件供應商查詢。
(3)使用kdump分析
a. 查看是否安裝和啟動了kdump

# systemctl status kdump.service 
● kdump.service - Crash recovery kernel arming
Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
Active: active (exited) since Thu 2020-04-02 09:01:47 CST; 4h 0min ago
Main PID: 284294 (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CGroup: /system.slice/kdump.service

注:安裝kdump相關工具見章節3
b. 使用crash命令分析
按照章節3安裝好工具之后,使用以下命令分析vmcore(我的是之前默認就已經開了kdump的)

# crash /var/crash/127.0.0.1-2020-04-01-18\:19\:32/vmcore /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux
crash 7.2.3-10.el7
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

KERNEL: /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2020-04-01-18:19:32/vmcore [PARTIAL DUMP]
CPUS: 72
DATE: Wed Apr 1 18:19:27 2020
UPTIME: 19 days, 08:32:38
LOAD AVERAGE: 0.29, 0.32, 0.29
TASKS: 4177
NODENAME: 
RELEASE: 3.10.0-514.el7.x86_64
VERSION: #1 SMP Tue Nov 22 16:42:41 UTC 2016
MACHINE: x86_64 (2600 Mhz)
MEMORY: 127.5 GB
PANIC: "kernel BUG at fs/xfs/xfs_aops.c:1062!"
PID: 92639
COMMAND: "kworker/u898:3"
TASK: ffff8810f827bec0 [THREAD_INFO: ffff880106fa4000]
CPU: 1
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 92639 TASK: ffff8810f827bec0 CPU: 1 COMMAND: "kworker/u898:3"
#0 [ffff880106fa75f0] machine_kexec at ffffffff81059cdb
#1 [ffff880106fa7650] __crash_kexec at ffffffff81105182
#2 [ffff880106fa7720] crash_kexec at ffffffff81105270
#3 [ffff880106fa7738] oops_end at ffffffff8168ee88
#4 [ffff880106fa7760] die at ffffffff8102e93b
#5 [ffff880106fa7790] do_trap at ffffffff8168e540
#6 [ffff880106fa77e0] do_invalid_op at ffffffff8102b144
#7 [ffff880106fa7890] invalid_op at ffffffff81697e5e
[exception RIP: xfs_vm_writepage+1419]
RIP: ffffffffa052b2fb RSP: ffff880106fa7948 RFLAGS: 00010246
RAX: 006fffff00040009 RBX: ffff8813abed8fc8 RCX: 000000000000000c
RDX: 0000000000000008 RSI: ffff880106fa7c40 RDI: ffffea006be56c00
RBP: ffff880106fa79f0 R8: ffffffffffffffd8 R9: 000000000001a100
R10: ffff88207ffd7000 R11: 0000000000000000 R12: ffff8813abed8fc8
R13: ffff880106fa7c40 R14: ffff8813abed8e78 R15: ffffea006be56c00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffff880106fa7990] find_get_pages_tag at ffffffff81180981
#9 [ffff880106fa79f8] __writepage at ffffffff8118b3b3
#10 [ffff880106fa7a10] write_cache_pages at ffffffff8118bed1
#11 [ffff880106fa7b28] generic_writepages at ffffffff8118c19d
#12 [ffff880106fa7b88] xfs_vm_writepages at ffffffffa052a063 [xfs]
#13 [ffff880106fa7bb8] do_writepages at ffffffff8118d24e
#14 [ffff880106fa7bc8] __writeback_single_inode at ffffffff81228730
#15 [ffff880106fa7c08] writeback_sb_inodes at ffffffff8122941e
#16 [ffff880106fa7cb0] __writeback_inodes_wb at ffffffff8122967f
#17 [ffff880106fa7cf8] wb_writeback at ffffffff81229ec3
#18 [ffff880106fa7d70] bdi_writeback_workfn at ffffffff8122bd05
#19 [ffff880106fa7e20] process_one_work at ffffffff810a7f3b
#20 [ffff880106fa7e68] worker_thread at ffffffff810a8d76
#21 [ffff880106fa7ec8] kthread at ffffffff810b052f
#22 [ffff880106fa7f50] ret_from_fork at ffffffff81696518
crash>

c. 可以看到exception RIP: xfs_vm_writepage+1419,用谷歌查詢一下

 

 

 

感覺這個與我的現象很像
https://access.redhat.com/solutions/2779111
看起來一樣,先安排停機時間,按照文檔的說法,將內核版本進行升級,后續再觀察下是否還會出現宕機。


3 kdump相關工具安裝
3.1 安裝kexec-tools

yum search kexec-tools
yum install crash

3.2 配置kdump服務

vim /etc/kdump.conf
# 修改core文件的目錄
path /var/crash systemctl start kdump systemctl enable kdump.service

參考:https://www.linuxtechi.com/how-to-enable-kdump-on-rhel-7-and-centos-7/
3.3 安裝kernel-debuginfo工具
(1)下載安裝包
在http://debuginfo.centos.org/7/x86_64/上搜索與內核版本一致的rpm包

kernel-debuginfo-3.10.0-514.el7.x86_64.rpm 
kernel-debuginfo-common-x86_64-3.10.0-514.el7.x86_64.rpm

(2)安裝

rpm -ivh kernel-debuginfo-common-x86_64-3.10.0-514.el7.x86_64.rpm
rpm -ivh kernel-debuginfo-3.10.0-514.el7.x86_64.rpm


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM