Linux OOM Killer機制


1.簡介

The OOM Killer 是內核中的一個進程,當系統出現嚴重內存不足時,它就會啟用自己的算法去選擇某一個進程並殺掉. 之所以會發生這種情況,是因為Linux內核在給某個進程分配內存時,會比進程申請的內存多分配一些. 這是為了保證進程在真正使用的時候有足夠的內存,因為進程在申請內存后並不一定立即使用,當真正使用的時候,可能部分內存已經被回收了. 比如 當一個進程申請2G內存時,內核可能會分配2.5G的內存給它.通常這不會導致什么問題.然而一旦系統內大量的進程在使用內存時,就會出現內存供不應求.很快就會導致內存耗盡. 這時就會觸發這個oom killer,它會選擇性的殺掉某個進程以保證系統能夠正常運行.

 

2. The OOM Killer選擇哪個進程殺掉?

The OOM Killer通過檢查所有正在運行的進程,然后根據自己的算法給每個進程一個badness分數.擁有最高 badness分數的進程將會在內存不足時被殺掉.它打分的算法如下:

  • 某一個進程和它所有的子進程都占用了很多內存的將會打一個高分;

  • 優先選擇進程號最小的那個進程

  • 內核進程和其他較重要的進程會被打成相對較低的分.

The OOM Killer給每一個進程打的分數都放在 /proc/{pid}/oom_score文件中,其實這里有三個文件,依次是

oom_score、oom_adj、oom_score_adj. 這三個文件按Linux官方文檔來說就是:

oom_score是存儲最終的分數,也就是badneess分數,最高的會被kill掉, man 一下 proc,找到:

/proc/[pid]/oom_score (since Linux 2.6.11)
              This  file displays the current score that the kernel gives to this process for the
              purpose of selecting a process for the OOM-killer.  A higher score means  that  the
              process  is more likely to be selected by the OOM-killer.  The basis for this score
              is the amount of memory used by the process, with increases (+)  or  decreases  (-)
              for factors including:
​
              * whether the process creates a lot of children using fork(2) (+);
​
              * whether  the  process has been running a long time, or has used a lot of CPU time
                (-);
​
              * whether the process has a low nice value (i.e., > 0) (+);
​
              * whether the process is privileged (-); and
​
              * whether the process is making direct hardware access (-).
​
              The oom_score also reflects  the  adjustment  specified  by  the  oom_score_adj  or
              oom_adj setting for the process.

 

oom_adj這個文件已經過時了,當前存在 是為了兼容舊版本的內核,, 同樣man一下 proc 找到:

/proc/[pid]/oom_adj (since Linux 2.6.11)
              This file can be used to adjust the score used to select which  process  should  be
              killed  in an out-of-memory (OOM) situation.  The kernel uses this value for a bit-
              shift operation of the process's oom_score value: valid values are in the range -16
              to  +15, plus the special value -17, which disables OOM-killing altogether for this
              process.  A positive score increases the likelihood of this process being killed by
              the OOM-killer; a negative score decreases the likelihood.
​
              The  default  value for this file is 0; a new process inherits its parent's oom_adj
              setting.  A process must be privileged (CAP_SYS_RESOURCE) to update this file.
​
              Since  Linux   2.6.36,   use   of   this   file   is   deprecated   in   favor   of
              /proc/[pid]/oom_score_adj.

oom_score_adj 是新版本內核官方建議使用的,看一下使用說明:

/proc/[pid]/oom_score_adj (since Linux 2.6.36)
              This  file can be used to adjust the badness heuristic used to select which process
              gets killed in out-of-memory conditions.
​
              The badness heuristic assigns a value to each candidate task ranging from 0  (never
              kill)  to 1000 (always kill) to determine which process is targeted.  The units are
              roughly a proportion along that range of allowed memory the  process  may  allocate
              from, based on an estimation of its current memory and swap use.  For example, if a
              task is using all allowed memory, its badness score will be 1000.  If it  is  using
              half of its allowed memory, its score will be 500.
​
              There  is  an  additional  factor included in the badness score: root processes are
              given 3% extra memory over other tasks.
​
              The amount of "allowed" memory depends on the context in which the  OOM-killer  was
              called.   If it is due to the memory assigned to the allocating task's cpuset being
              exhausted, the allowed memory represents the set of mems assigned  to  that  cpuset
              (see  cpuset(7)).   If  it  is  due  to  a mempolicy's node(s) being exhausted, the
              allowed memory represents the set of mempolicy nodes.  If it is  due  to  a  memory
              limit  (or  swap limit) being reached, the allowed memory is that configured limit.
              Finally, if it is due to the entire system being out of memory, the allowed  memory
              represents all allocatable resources.
              
              The  value  of  oom_score_adj  is  added  to the badness score before it is used to
              determine   which   task   to   kill.    Acceptable   values   range   from   -1000
              (OOM_SCORE_ADJ_MIN)  to  +1000 (OOM_SCORE_ADJ_MAX).  This allows user space to con‐
              trol the preference for OOM-killing, ranging from always preferring a certain  task
              or  completely disabling it from OOM-killing.  The lowest possible value, -1000, is
              equivalent to disabling OOM-killing entirely for that task, since  it  will  always
              report a badness score of 0.
​
              Consequently,  it  is  very simple for user space to define the amount of memory to
              consider for each task.  Setting a oom_score_adj value of  +500,  for  example,  is
              roughly  equivalent  to  allowing  the  remainder of tasks sharing the same system,
              cpuset, mempolicy, or memory controller resources to use at least 50% more  memory.
              A  value of -500, on the other hand, would be roughly equivalent to discounting 50%
              of the task's allowed memory from being considered as scoring against the task.
​
              For backward compatibility with previous kernels, /proc/[pid]/oom_adj can still  be
              used to tune the badness score.  Its value is scaled linearly with oom_score_adj.
​
              Writing  to  /proc/[pid]/oom_score_adj or /proc/[pid]/oom_adj will change the other
              with its scaled value.

 

最后一句也就是說為了兼容舊版本的內核,oom_score_adj和oom_adj任何一個變動,另一個也會自動跟着改動.

這三個文件先了解到這.后面還會用到.

 

3. 如何找到一個進程是被The OOM Killer殺掉的?

最簡單的方法就是用dmesg看系統日志. 對於redhat系的:

dmesg | egrep -i “killed process”

 

比如系統可能輸出(這是我本地測試的):

host kernel: Out of Memory: Killed process 13482 (mysql).

 

或者直接查看日志

egrep -i 'killed process' /var/log/messages*

 

4. 如何阻止一些重要的進程不被The OOM Killer殺掉

The OOM killer 通常是檢查 oom_score_obj(上面提到的)值,並經過計算得出最終的oom_score來決定殺死哪個進程的. 所以我們查一下內核里面定義的這個值的取值范圍再去修改其值 .這里我看的是4.13.16這個版本.

源代碼是 oom_kill.c https://elixir.bootlin.com/linux/v4.13.16/source/mm/oom_kill.c,里面引用了頭文件

#include <linux/oom.h>

而這個oom.h又引用了

uapi/linux/oom.h

這個頭文件,查看這個文件

內核定義的值的范圍: https://elixir.bootlin.com/linux/v4.13.16/source/include/uapi/linux/oom.h

#ifndef _UAPI__INCLUDE_LINUX_OOM_H
#define _UAPI__INCLUDE_LINUX_OOM_H/*
 * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
 * pid.
 */
#define OOM_SCORE_ADJ_MIN   (-1000)
#define OOM_SCORE_ADJ_MAX   1000/*
 * /proc/<pid>/oom_adj set to -17 protects from the oom killer for legacy
 * purposes.
 */
#define OOM_DISABLE (-17)
/* inclusive */
#define OOM_ADJUST_MIN (-16)
#define OOM_ADJUST_MAX 15#endif /* _UAPI__INCLUDE_LINUX_OOM_H */

 

這意味着我們可以把要保護的進程的oom_score_obj的值調整成一個較小的負值, 或者把oom_adj調成 -17,這兩個文件已經在上面說過了.

sudo echo -200 > /proc/{pid}/oom_score_adj  (如果-200是所有進程中最大的,當系統內存不足時,還是會被oom-killer殺掉)
或
sudo echo -17 > /proc/{pid}/oom_adj  (不會被oom-killer殺掉)

 

5. 如何查看所有正在Running的進程的badnees score

這里我借用一下Raunak Ramakrishnan 大神寫的一個腳本

#!/bin/bash
# Displays running processes in descending order of OOM score
printf 'PID\tOOM Score\tOOM Adj\tCommand\n'
while read -r pid comm; do [ -f /proc/$pid/oom_score ] && [ $(cat /proc/$pid/oom_score) != 0 ] && printf '%d\t%d\t\t%d\t%s\n' "$pid" "$(cat /proc/$pid/oom_score)" "$(cat /proc/$pid/oom_score_adj)" "$comm"; done < <(ps -e -o pid= -o comm=) | sort -k 2nr

 

6. 如何強制觸發The OOM Killer

在內核官方文檔上有一篇文章:

https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html

詳細說明了 /proc/sysrq-trigger的各種操作和作用

 

7. 參考文獻


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM