1.簡介
The OOM Killer 是內核中的一個進程,當系統出現嚴重內存不足時,它就會啟用自己的算法去選擇某一個進程並殺掉. 之所以會發生這種情況,是因為Linux內核在給某個進程分配內存時,會比進程申請的內存多分配一些. 這是為了保證進程在真正使用的時候有足夠的內存,因為進程在申請內存后並不一定立即使用,當真正使用的時候,可能部分內存已經被回收了. 比如 當一個進程申請2G內存時,內核可能會分配2.5G的內存給它.通常這不會導致什么問題.然而一旦系統內大量的進程在使用內存時,就會出現內存供不應求.很快就會導致內存耗盡. 這時就會觸發這個oom killer,它會選擇性的殺掉某個進程以保證系統能夠正常運行.
2. The OOM Killer選擇哪個進程殺掉?
The OOM Killer通過檢查所有正在運行的進程,然后根據自己的算法給每個進程一個badness分數.擁有最高 badness分數的進程將會在內存不足時被殺掉.它打分的算法如下:
-
某一個進程和它所有的子進程都占用了很多內存的將會打一個高分;
-
優先選擇進程號最小的那個進程
-
內核進程和其他較重要的進程會被打成相對較低的分.
The OOM Killer給每一個進程打的分數都放在 /proc/{pid}/oom_score文件中,其實這里有三個文件,依次是
oom_score、oom_adj、oom_score_adj. 這三個文件按Linux官方文檔來說就是:
oom_score是存儲最終的分數,也就是badneess分數,最高的會被kill掉, man 一下 proc,找到:
/proc/[pid]/oom_score (since Linux 2.6.11) This file displays the current score that the kernel gives to this process for the purpose of selecting a process for the OOM-killer. A higher score means that the process is more likely to be selected by the OOM-killer. The basis for this score is the amount of memory used by the process, with increases (+) or decreases (-) for factors including: * whether the process creates a lot of children using fork(2) (+); * whether the process has been running a long time, or has used a lot of CPU time (-); * whether the process has a low nice value (i.e., > 0) (+); * whether the process is privileged (-); and * whether the process is making direct hardware access (-). The oom_score also reflects the adjustment specified by the oom_score_adj or oom_adj setting for the process.
oom_adj這個文件已經過時了,當前存在 是為了兼容舊版本的內核,, 同樣man一下 proc 找到:
/proc/[pid]/oom_adj (since Linux 2.6.11) This file can be used to adjust the score used to select which process should be killed in an out-of-memory (OOM) situation. The kernel uses this value for a bit- shift operation of the process's oom_score value: valid values are in the range -16 to +15, plus the special value -17, which disables OOM-killing altogether for this process. A positive score increases the likelihood of this process being killed by the OOM-killer; a negative score decreases the likelihood. The default value for this file is 0; a new process inherits its parent's oom_adj setting. A process must be privileged (CAP_SYS_RESOURCE) to update this file. Since Linux 2.6.36, use of this file is deprecated in favor of /proc/[pid]/oom_score_adj.
oom_score_adj 是新版本內核官方建議使用的,看一下使用說明:
/proc/[pid]/oom_score_adj (since Linux 2.6.36) This file can be used to adjust the badness heuristic used to select which process gets killed in out-of-memory conditions. The badness heuristic assigns a value to each candidate task ranging from 0 (never kill) to 1000 (always kill) to determine which process is targeted. The units are roughly a proportion along that range of allowed memory the process may allocate from, based on an estimation of its current memory and swap use. For example, if a task is using all allowed memory, its badness score will be 1000. If it is using half of its allowed memory, its score will be 500. There is an additional factor included in the badness score: root processes are given 3% extra memory over other tasks. The amount of "allowed" memory depends on the context in which the OOM-killer was called. If it is due to the memory assigned to the allocating task's cpuset being exhausted, the allowed memory represents the set of mems assigned to that cpuset (see cpuset(7)). If it is due to a mempolicy's node(s) being exhausted, the allowed memory represents the set of mempolicy nodes. If it is due to a memory limit (or swap limit) being reached, the allowed memory is that configured limit. Finally, if it is due to the entire system being out of memory, the allowed memory represents all allocatable resources. The value of oom_score_adj is added to the badness score before it is used to determine which task to kill. Acceptable values range from -1000 (OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows user space to con‐ trol the preference for OOM-killing, ranging from always preferring a certain task or completely disabling it from OOM-killing. The lowest possible value, -1000, is equivalent to disabling OOM-killing entirely for that task, since it will always report a badness score of 0. Consequently, it is very simple for user space to define the amount of memory to consider for each task. Setting a oom_score_adj value of +500, for example, is roughly equivalent to allowing the remainder of tasks sharing the same system, cpuset, mempolicy, or memory controller resources to use at least 50% more memory. A value of -500, on the other hand, would be roughly equivalent to discounting 50% of the task's allowed memory from being considered as scoring against the task. For backward compatibility with previous kernels, /proc/[pid]/oom_adj can still be used to tune the badness score. Its value is scaled linearly with oom_score_adj. Writing to /proc/[pid]/oom_score_adj or /proc/[pid]/oom_adj will change the other with its scaled value.
最后一句也就是說為了兼容舊版本的內核,oom_score_adj和oom_adj任何一個變動,另一個也會自動跟着改動.
這三個文件先了解到這.后面還會用到.
3. 如何找到一個進程是被The OOM Killer殺掉的?
最簡單的方法就是用dmesg看系統日志. 對於redhat系的:
dmesg | egrep -i “killed process”
比如系統可能輸出(這是我本地測試的):
host kernel: Out of Memory: Killed process 13482 (mysql).
或者直接查看日志
egrep -i 'killed process' /var/log/messages*
4. 如何阻止一些重要的進程不被The OOM Killer殺掉
The OOM killer 通常是檢查 oom_score_obj(上面提到的)值,並經過計算得出最終的oom_score來決定殺死哪個進程的. 所以我們查一下內核里面定義的這個值的取值范圍再去修改其值 .這里我看的是4.13.16這個版本.
源代碼是 oom_kill.c https://elixir.bootlin.com/linux/v4.13.16/source/mm/oom_kill.c,里面引用了頭文件
#include <linux/oom.h>
而這個oom.h又引用了
uapi/linux/oom.h
這個頭文件,查看這個文件
內核定義的值的范圍: https://elixir.bootlin.com/linux/v4.13.16/source/include/uapi/linux/oom.h
#ifndef _UAPI__INCLUDE_LINUX_OOM_H #define _UAPI__INCLUDE_LINUX_OOM_H /* * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for * pid. */ #define OOM_SCORE_ADJ_MIN (-1000) #define OOM_SCORE_ADJ_MAX 1000 /* * /proc/<pid>/oom_adj set to -17 protects from the oom killer for legacy * purposes. */ #define OOM_DISABLE (-17) /* inclusive */ #define OOM_ADJUST_MIN (-16) #define OOM_ADJUST_MAX 15 #endif /* _UAPI__INCLUDE_LINUX_OOM_H */
這意味着我們可以把要保護的進程的oom_score_obj的值調整成一個較小的負值, 或者把oom_adj調成 -17,這兩個文件已經在上面說過了.
sudo echo -200 > /proc/{pid}/oom_score_adj (如果-200是所有進程中最大的,當系統內存不足時,還是會被oom-killer殺掉) 或 sudo echo -17 > /proc/{pid}/oom_adj (不會被oom-killer殺掉)
5. 如何查看所有正在Running的進程的badnees score
這里我借用一下Raunak Ramakrishnan 大神寫的一個腳本
#!/bin/bash # Displays running processes in descending order of OOM score printf 'PID\tOOM Score\tOOM Adj\tCommand\n' while read -r pid comm; do [ -f /proc/$pid/oom_score ] && [ $(cat /proc/$pid/oom_score) != 0 ] && printf '%d\t%d\t\t%d\t%s\n' "$pid" "$(cat /proc/$pid/oom_score)" "$(cat /proc/$pid/oom_score_adj)" "$comm"; done < <(ps -e -o pid= -o comm=) | sort -k 2nr
6. 如何強制觸發The OOM Killer
在內核官方文檔上有一篇文章:
https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html
詳細說明了 /proc/sysrq-trigger的各種操作和作用
7. 參考文獻
-
Linux內核官方文檔:@Linux官方內核文檔(https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html)
-
博文:<https://github.com/lorenzo-stoakes/linux-vm-notes/blob/master/sections/oom.md>
-
Oracle官方文檔:<https://www.oracle.com/technical-resources/articles/it-infrastructure/dev-oom-killer.html>
-
人工觸發The OOM Killer:<https://www.lynxbee.com/how-to-invoke-oom-killer-manually-for-understanding-which-process-gets-killed-first/>
-
Raunak Ramakrishnan大神的博客: <https://dev.to/rrampage/surviving-the-linux-oom-killer-2ki9