配置:
cat /proc/sys/kernel/hung_task_panic 軟中斷 , 內核中有進程進入了死循環,結束不了,或執行時間過長。
cat /proc/sys/kernel/nmi_watchdog 硬中斷
PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2"
提示有hard LOCKUP
棧信息如下:
PID: 16104 TASK: ffff880118b8e080 CPU: 2 COMMAND: "bwc_shrinker"
#0 [ffff88002c287b50] machine_kexec at ffffffff81038fa9
#1 [ffff88002c287bb0] crash_kexec at ffffffff810c5992
#2 [ffff88002c287c80] panic at ffffffff81511980
#3 [ffff88002c287d00] watchdog_overflow_callback at ffffffff810e646d
#4 [ffff88002c287d20] __perf_event_overflow at ffffffff8111c357
#5 [ffff88002c287da0] perf_event_overflow at ffffffff8111c924
#6 [ffff88002c287db0] intel_pmu_handle_irq at ffffffff81022e07
#7 [ffff88002c287e90] perf_event_nmi_handler at ffffffff815161a9
#8 [ffff88002c287ea0] notifier_call_chain at ffffffff81517c65
#9 [ffff88002c287ee0] atomic_notifier_call_chain at ffffffff81517cca
#10 [ffff88002c287ef0] notify_die at ffffffff810a12ae
#11 [ffff88002c287f20] do_nmi at ffffffff8151595b
#12 [ffff88002c287f50] nmi at ffffffff81515220
[exception RIP: __read_lock_failed+8]
RIP: ffffffff8128c6a8 RSP: ffff880107f11c08 RFLAGS: 00000097
RAX: 0000000000000297 RBX: 0000000000000077 RCX: 0000000000000082
RDX: 0000000000000297 RSI: 0000000000000202 RDI: ffff88020b7633f4
RBP: ffff880107f11c10 R8: 0000000000000000 R9: 0000000000000000
R10: 00000000ffffffff R11: ffff88006e94a268 R12: ffffffffa0952840
R13: 0000000000000008 R14: ffff88006bd677c8 R15: 0000000000000202
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#13 [ffff880107f11c08] __read_lock_failed at ffffffff8128c6a8
#14 [ffff880107f11c08] _read_lock_irqsave at ffffffff815149f5
#15 [ffff880107f11c18] combo_clean_blocks_compress at ffffffffa0938126 [bwc]
#16 [ffff880107f11cb8] combo_blocks_compress at ffffffffa093899e [bwc]
#17 [ffff880107f11d58] pool_shrink at ffffffffa09344b0 [bwc]
#18 [ffff880107f11db8] sys_shrink at ffffffffa091b245 [bwc]
#19 [ffff880107f11e18] shrink_helper_main at ffffffffa091b2da [bwc]
#20 [ffff880107f11e38] worker_thread at ffffffff81094b00
#21 [ffff880107f11ee8] kthread at ffffffff8109acd6
#22 [ffff880107f11f48] kernel_thread at ffffffff8100c20a
==========================================================
soft lock_up
之前的panic信息被刪除了,棧的大概意思是被NMI watchdog給中斷了。
==========================================================
此類問題分析過程:
首先應該確定是hard lock_up還是hard lock_up
----------------------------------------------------------------------------------------------------------
hard lock_up是因為中斷被禁掉了,長時間(默認應該是5S)沒有打開,這時NMI(Non-maskable interrupt,不可屏蔽中斷,簡稱NMI)
會中斷當前進程,這時就要分析,當前進程為什么長時間關閉中斷,目前我遇到一種情況:
1. 其他進程拿了鎖,忘記放鎖,另一個線程調用spin_lock_irqsave(),這個函數是先關閉中斷,再去申請鎖,如果一直拿不到鎖,就會觸發NMI。
這種問題的分析過程是這樣的,在crash工具中執行foreach bt,把所有的進程棧都打印出來,看是否有進程在鎖內的情況,如果有,分析為什么長時間沒有放鎖,
如果沒有進程在鎖內,則需要遍歷代碼,查找所有使用該鎖的地方,看是否有忘記放鎖的流程。
-------------------------------------------------------------------------------------------------------------
soft lock_up是由於搶占被長時間關閉,系統無法正常調度其他進程。這時NMI watchdog會中斷該線程。
這種問題就需要查看代碼,看為什么長時間死循環,可能是循環條件一直為真。