背景說明
運行環境信息,Kubernetes + docker 、應用系統java程序
問題描述
- 首先從Kubernetes事件中心告警信息如下,該告警集群常規告警事件(其實從下面這些常規告警信息是無法判斷是什么故障問題)
-
最初懷疑是docker服務有問題,切換至節點上查看docker & kubelet 日志,如下
kubelet日志,kubelet無法初始化線程,需要增加所處運行用戶的進程限制,大致意思就是需要調整ulimit -u(具體分析如下,此處先描述問題)
<root@PROD-BE-K8S-WN8 ~># journalctl -u "kubelet" --no-pager --follow-- Logs begin at Wed 2019-12-25 11:30:13 CST. --Dec 22 14:21:51 PROD-BE-K8S-WN8 kubelet[3124]: encoding/json.(*decodeState).unmarshal(0xc000204580, 0xcafe00, 0xc00048f440, 0xc0002045a8, 0x0)Dec 22 14:21:51 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/encoding/json/decode.go:180 +0x1eaDec 22 14:21:51 PROD-BE-K8S-WN8 kubelet[3124]: encoding/json.Unmarshal(0xc00025e000, 0x9d38, 0xfe00, 0xcafe00, 0xc00048f440, 0x0, 0x0)Dec 22 14:21:51 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/encoding/json/decode.go:107 +0x112Dec 22 14:21:51 PROD-BE-K8S-WN8 kubelet[3124]: github.com/go-openapi/spec.Swagger20Schema(0xc000439680, 0x0, 0x0)Dec 22 14:21:51 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/go-openapi/spec/spec.go:82 +0xb8Dec 22 14:21:51 PROD-BE-K8S-WN8 kubelet[3124]: github.com/go-openapi/spec.MustLoadSwagger20Schema(...)Dec 22 14:21:51 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/go-openapi/spec/spec.go:66Dec 22 14:21:51 PROD-BE-K8S-WN8 kubelet[3124]: github.com/go-openapi/spec.init.4()Dec 22 14:21:51 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/go-openapi/spec/spec.go:38 +0x57Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime: failed to create new OS thread (have 15 already; errno=11)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime: may need to increase max user processes (ulimit -u)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: fatal error: newosprocDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: goroutine 1 [running, locked to thread]:Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime.throw(0xcbf07e, 0x9)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc00099fe20 sp=0xc00099fdf0 pc=0x4376d2Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime.newosproc(0xc000600800)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/runtime/os_linux.go:161 +0x1c5 fp=0xc00099fe80 sp=0xc00099fe20 pc=0x433be5Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime.newm1(0xc000600800)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/runtime/proc.go:1843 +0xdd fp=0xc00099fec0 sp=0xc00099fe80 pc=0x43dcbdDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime.newm(0xcf1010, 0x0, 0xffffffffffffffff)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/runtime/proc.go:1822 +0x9b fp=0xc00099fef8 sp=0xc00099fec0 pc=0x43db3bDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime.startTemplateThread()Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/runtime/proc.go:1863 +0xb2 fp=0xc00099ff28 sp=0xc00099fef8 pc=0x43ddb2Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime.LockOSThread()Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/runtime/proc.go:3845 +0x6b fp=0xc00099ff48 sp=0xc00099ff28 pc=0x44300bDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: main.init.0()Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/plugins/cilium-cni/cilium-cni.go:66 +0x30 fp=0xc00099ff58 sp=0xc00099ff48 pc=0xb2fa50Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime.doInit(0x11c73a0)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/runtime/proc.go:5652 +0x8a fp=0xc00099ff88 sp=0xc00099ff58 pc=0x44720aDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime.main()Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/runtime/proc.go:191 +0x1c5 fp=0xc00099ffe0 sp=0xc00099ff88 pc=0x439e85Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: runtime.goexit()Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc00099ffe8 sp=0xc00099ffe0 pc=0x46fc81Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: goroutine 11 [chan receive]:Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: k8s.io/klog/v2.(*loggingT).flushDaemon(0x121fc40)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/k8s.io/klog/v2/klog.go:1131 +0x8bDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: created by k8s.io/klog/v2.init.0Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/k8s.io/klog/v2/klog.go:416 +0xd8Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: goroutine 12 [select]:Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: io.(*pipe).Read(0xc000422780, 0xc00034b000, 0x1000, 0x1000, 0xba4480, 0x1, 0xc00034b000)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/io/pipe.go:57 +0xe7Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: io.(*PipeReader).Read(0xc00000e380, 0xc00034b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/io/pipe.go:134 +0x4cDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: bufio.(*Scanner).Scan(0xc00052ef38, 0x0)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/bufio/scan.go:214 +0xa9Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: github.com/sirupsen/logrus.(*Entry).writerScanner(0xc00016e1c0, 0xc00000e380, 0xc000516300)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/writer.go:59 +0xb4Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: created by github.com/sirupsen/logrus.(*Entry).WriterLevelDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/writer.go:51 +0x1b7Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: goroutine 13 [select]:Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: io.(*pipe).Read(0xc0004227e0, 0xc000180000, 0x1000, 0x1000, 0xba4480, 0x1, 0xc000180000)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/io/pipe.go:57 +0xe7Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: io.(*PipeReader).Read(0xc00000e390, 0xc000180000, 0x1000, 0x1000, 0x0, 0x0, 0x0)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/io/pipe.go:134 +0x4cDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: bufio.(*Scanner).Scan(0xc00020af38, 0x0)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/bufio/scan.go:214 +0xa9Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: github.com/sirupsen/logrus.(*Entry).writerScanner(0xc00016e1c0, 0xc00000e390, 0xc000516320)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/writer.go:59 +0xb4Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: created by github.com/sirupsen/logrus.(*Entry).WriterLevelDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/writer.go:51 +0x1b7Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: goroutine 14 [select]:Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: io.(*pipe).Read(0xc000422840, 0xc0004c2000, 0x1000, 0x1000, 0xba4480, 0x1, 0xc0004c2000)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/io/pipe.go:57 +0xe7Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: io.(*PipeReader).Read(0xc00000e3a0, 0xc0004c2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/io/pipe.go:134 +0x4cDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: bufio.(*Scanner).Scan(0xc00052af38, 0x0)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/bufio/scan.go:214 +0xa9Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: github.com/sirupsen/logrus.(*Entry).writerScanner(0xc00016e1c0, 0xc00000e3a0, 0xc000516340)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/writer.go:59 +0xb4Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: created by github.com/sirupsen/logrus.(*Entry).WriterLevelDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/writer.go:51 +0x1b7Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: goroutine 15 [select]:Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: io.(*pipe).Read(0xc0004228a0, 0xc000532000, 0x1000, 0x1000, 0xba4480, 0x1, 0xc000532000)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/io/pipe.go:57 +0xe7Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: io.(*PipeReader).Read(0xc00000e3b0, 0xc000532000, 0x1000, 0x1000, 0x0, 0x0, 0x0)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/io/pipe.go:134 +0x4cDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: bufio.(*Scanner).Scan(0xc00052ff38, 0x0)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /usr/local/go/src/bufio/scan.go:214 +0xa9Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: github.com/sirupsen/logrus.(*Entry).writerScanner(0xc00016e1c0, 0xc00000e3b0, 0xc000516360)Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/writer.go:59 +0xb4Dec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: created by github.com/sirupsen/logrus.(*Entry).WriterLevelDec 22 14:22:06 PROD-BE-K8S-WN8 kubelet[3124]: /go/src/github.com/cilium/cilium/vendor/github.com/sirupsen/logrus/writer.go:51 +0x1b7 - 於是查看系統日志,如下(本來想追蹤當前時間的系統日志,但當時系統反應超級慢,但是當時的系統load是很低並沒有很高,而且CPU & MEM利用率也不高,具體在此是系統為什么反應慢,后面再分析 “問題一”)
在執行查看系統命令,提示無法創建進程
<root@PROD-BE-K8S-WN8 ~># dmesg -TL -bash: fork: retry: No child processes [Fri Sep 17 18:25:53 2021] Linux version 5.11.1-1.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2), GNU ld version 2.32-16.el7) #1 SMP Mon Feb 22 17:30:33 EST 2021 [Fri Sep 17 18:25:53 2021] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.1-1.el7.elrepo.x86_64 root=UUID=8770013a-4455-4a77-b023-04d04fa388c8 ro crashkernel=auto spectre_v2=retpoline net.ifnames=0 console=tty0 console=ttyS0,115200n8 noibrs [Fri Sep 17 18:25:53 2021] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [Fri Sep 17 18:25:53 2021] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [Fri Sep 17 18:25:53 2021] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [Fri Sep 17 18:25:53 2021] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers' [Fri Sep 17 18:25:53 2021] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR' [Fri Sep 17 18:25:53 2021] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask' [Fri Sep 17 18:25:53 2021] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
- 嘗試在該節點新建Container,如下
提示初始化線程失敗,資源不夠
<root@PROD-BE-K8S-WN8 ~># docker run -it --rm tomcat bash runtime/cgo: runtime/cgo: pthread_create failed: Resource temporarily unavailable pthread_create failed: Resource temporarily unavailable SIGABRT: abort PC=0x7f34d16023d7 m=3 sigcode=18446744073709551610 goroutine 0 [idle]: runtime: unknown pc 0x7f34d16023d7 stack: frame={sp:0x7f34cebb8988, fp:0x0} stack=[0x7f34ce3b92a8,0x7f34cebb8ea8) 00007f34cebb8888: 000055f2b345a7bf <runtime.(*mheap).scavengeLocked+559> 00007f34cebb88c0 00007f34cebb8898: 000055f2b3450e0e <runtime.(*mTreap).end+78> 0000000000000000
故障分析
根據以上的故障問題初步分析,第一反應是ulimi -u值太小,已經被hit(觸及到,突破該參數的上限),於是查看各用戶的ulimi -u,官方描述就是max user processes(該參數的值上限應該小於user.max_pid_namespace的值,該參數是內核初始化分配的)
監控信息
- 查看用戶的max processes的上限,如下
<root@PROD-BE-K8S-WN8 ~># ulimit -u
249047
<root@PROD-BE-K8S-WN8 ~># ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 249047 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 65535 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 249047 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
- 因為ulimit是針對於每用戶而言的,具體還要驗證每個用戶的limit的配置,如下
根據以下配置判斷,並沒有超出設定的范圍
# 默認limits.conf的配置 # End of file root soft nofile 65535 root hard nofile 65535 * soft nofile 65535 * hard nofile 65535 # limits.d/20.nproc.conf的配置,如下 # Default limit for number of user's processes to prevent # accidental fork bombs. # See rhbz #432903 for reasoning. * soft nproc 65536 root soft nproc unlimited
- 查看節點運行的進程
從監控信息可以看到在故障最高使用457個進程
- 查看系統中的進程狀態,如下
雖然說有個Z狀的進程,但是也不影響當前系統運行
- 查看系統create的線程數,如下
從下監控圖表,當時最大線程數是 32616
分析過程
-
從以上監控信息分析,故障時間區間,系統運行的線程略高 31616,但是該值卻沒有超過當前用戶的ulimit -u的值,初步排除該線索
- 根據系統拋出的錯誤提示,Google一把 fork: Resource temporarily unavailable
https://github.com/awslabs/amazon-eks-ami/issues/239
在整個帖子看到一條這樣的提示
One possible cause is running out of process IDs. Check you don't have 40.000 defunct processes or similar on nodes with problems
- 於是根據該線索,翻閱linux內核文檔,搜索PID相關字段,其中找到如下相關的PID參數
-
kernel.core_uses_pid = 1
引用官方文檔
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#core-uses-pid
參數大致意思是為系統coredump文件命名,實際生成的名字為 “core.PID”,則排除該參數引起的問題
-
kernel.ns_last_pid = 23068
引用官方文檔
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#ns-last-pid
參數大致意思是,記錄當前系統最后分配的PID identifiy,當kernel fork執行下一個task時,kernel將從此pid分配identify
-
kernel.pid_max = 32768
引用官方文檔
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#pid-max
參數大致意思是,kernel允許當前系統分配的最大PID identify,如果kernel 在fork時hit到這個值時,kernel會wrap back到內核定義的minimum PID identify,意思就是不能分配大於該參數設定的值+1,該參數邊界范圍是全局的,屬於系統全局邊界
通過該參數的闡述,大致問題定位到了,在linux中其實thread & process 的創建都會被該參數束縛,因為無論是線程還是進程結構體基本上一樣的,都需要PID來標識
-
user.max_pid_namespaces = 253093
引用官方文檔
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/user.html#max-pid-namespaces
參數大致意思是,在當前所屬用戶namespace下允許該用戶創建的最大的PID,意思應該是最大進程吧,等同於參數ulimit -u的值,由內核初始化而定義的,具體算法應該是(init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2)
-
kernel.cad_pid = 1
引用官方文檔
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#cad-pid
參數大致意思是,向系統發送reboot信號,特別針對於ctrl+alt+del,對於該參數不需要理解太多,用不到
- 查看系統內核參數kernel.pid_max,如下
關於該參數的初始值是如何計算的,下面會分析的
<root@PROD-BE-K8S-WN8 ~># sysctl -a | grep pid_max kernel.pid_max = 32768
- 返回系統中,需要定位是哪個應用系統create如此之多的線程,如下(推薦安裝監控系統,用於記錄監控數據信息)
- 通常網上的教程都是盲目的調整對應的內核參數值,個人認為運維所有操作都是被動的,不具備根治問題,需要從源頭解決問題,最好是拋給研發,在應用系統初始化,create適當的線程量
具體如何優化內核參數,下面來分析
參數分析
相關內核參數詳細說明,及如何調整,及相互關系,及計算方式,參數邊界,如下說明
kernel.pid_max
概念就不詳述了,參考上文(大致意思就是,系統最大可分配的PID identify,理解有點抽象,嚴格意義是最大標識,每個進程的標識符,當然也代表最大進程吧)
話不多說,分析源代碼,如有誤,請指出
- 代碼地址
int pid_max = PID_MAX_DEFAULT; #define RESERVED_PIDS 300 int pid_max_min = RESERVED_PIDS + 1; int pid_max_max = PID_MAX_LIMIT;
上面代表表示,pid_max默認賦值等於PID_MAX_DEFAULT的值,但是初始創建的PID identify是RESERVD_PIDS + 1,也就是等於301,小於等於300是系統內核保留值(可能是特殊使用吧)
那么PID_MAX_DEFAULT的值是如何計算的及初始化時是如何定義的及默認值、最大值,及LIMIT的值
具體PID_MAX_DEFAULT代碼地址,如下
linux/threads.h at v5.11-rc1 · torvalds/linux · GitHub
/* * This controls the default maximum pid allocated to a process * 大致意思就是,如果在編譯內核時指定或者為CONFIG_BASE_SMALl賦值了,那么默認值就是4096,反而就是32768 */ #define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000) /* * A maximum of 4 million PIDs should be enough for a while. * [NOTE: PID/TIDs are limited to 2^30 ~= 1 billion, see FUTEX_TID_MASK.]
* 如果CONFIG_BASE_SMALL被賦值了,則最大值就是32768,如果條件不成立,則判斷long的類型通常應該是操作系統版本,如果大於4字節,取值范圍大約就是4 million,精確計算就是4,194,304,如果條件還不成立則只能取值最被設置的PID_MAX_DEFAULT的值 */ #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \ (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))但是翻閱man proc的官方文檔,明確說明:如果OS為64位系統PID_MAX_LIMIT的邊界值為2的22次方,精確計算就是2*1024*1024*1024等於1,073,741,824,10億多。而32BIT的操作系統默認就是32768
如何查看CONFIG_BASE_SMALL的值,如下
<root@HK-K8S-WN1 ~># cat /boot/config-5.11.1-1.el7.elrepo.x86_64 | grep CONFIG_BASE_SMALL CONFIG_BASE_SMALL=0
0代表未被賦值
kernel.threads-max
Documentation for /proc/sys/kernel/ — The Linux Kernel documentation
- 參數解釋:
該參數大致意思是,系統內核fork()允許創建的最大線程數,在內核初始化時已經設定了此值,但是即使設定了該值,但是線程結構只能占用可用RAM page的一部分,約1/8(注意是可用內存,即Available memory page),如果超出此值1/8則threads-max的值會減少
內核初始化時,默認指定最小值為MIN_THREADS = 20,MAX_THREADS的最大邊界值是由FUTEX_TID_MASK值而約束,但是在內核初始化時,kernel.threads-max的值是根據系統實際的物理內存計算出來的,如下代碼
linux/fork.c at v5.16-rc1 · torvalds/linux · GitHub
/* * set_max_threads */ static void set_max_threads(unsigned int max_threads_suggested) { u64 threads; unsigned long nr_pages = totalram_pages(); /* * The number of threads shall be limited such that the thread * structures may only consume a small part of the available memory. */ if (fls64(nr_pages) + fls64(PAGE_SIZE) > 64) threads = MAX_THREADS; else threads = div64_u64((u64) nr_pages * (u64) PAGE_SIZE, (u64) THREAD_SIZE * 8UL); if (threads > max_threads_suggested) threads = max_threads_suggested; max_threads = clamp_t(u64, threads, MIN_THREADS, MAX_THREADS); }
kernel.threads-max該參數一般不需要手動更改,因為在內核根據現在有的內存已經算好了,不建議修改
那么kernel.threads-max由FUTEX_TID_MASK常量所約束,那它的具體值是多少呢,如下
linux/futex.h at v5.16-rc1 · torvalds/linux · GitHub
#define FUTEX_TID_MASK 0x3fffffff
vm.max_map_count
Documentation for /proc/sys/vm/ — The Linux Kernel documentation
- 參數解釋
大致意思是,允許系統進程最大分配的內存MAP區域,一般應用程序占用少於1000個map,但是個別程序,特別針對於被malloc分配,可能會大量消耗,每個allocation會占用一到二個map,默認值為65530
通過設定此值可以限制進程使用VMA(虛擬內存區域)的數量。虛擬內存區域是一個連續的虛擬地址空間區域。在進程的生命周期中,每當程序嘗試在內存中映射文件,鏈接到共享內存段,或者分配堆空間的時候,這些區域將被創建。調優這個值將限制進程可擁有VMA的數量。限制一個進程擁有VMA的總數可能導致應用程序出錯,因為當進程達到了VMA上線但又只能釋放少量的內存給其他的內核進程使用時,操作系統會拋出內存不足的錯誤。如果你的操作系統在NORMAL區域僅占用少量的內存,那么調低這個值可以幫助釋放內存給內核用參數大致作用就是這樣的。
可以總結一下什么情況下,適當的增加
- 壓力測試,壓測應用程序最大create的線程數量
- 高並發的應用系統,單進程並發非常高
參考文檔
http://www.kegel.com/c10k.html#limits.threads
https://listman.redhat.com/archives/phil-list/2003-August/msg00005.html
https://bugzilla.redhat.com/show_bug.cgi?id=1459891
配置建議
參數邊界
參數名稱 | 范圍邊界 |
kernel.pid_max | 系統全局限制 |
kernel.threads-max | 系統全局限制 |
vm.max_map_count | 進程級別限制 |
/etc/security/limits.conf | 用戶級別限制 |
總結建議
- kernel.pid_max約束整個系統最大create的線程與進程數量,無論是線程還是進程,都不能hit到此設定的值,錯誤有二種(create接近拋出Resource temporarily unavailable,create大於拋出No more processes...);可以根據實際應用場景及應用平台修改此值,比如Kubernetes平台,一個節點可能運行上百Container instance,或者是高並發,多線程的應用
- kernel.threads-max只針對事個系統所有用戶的最大他create的線程數量,就大於系統所有用戶設定的ulimit -u的值,最好ulimit -u 精確的計算一下(不推薦手動修改該參數,該參數是由在內核初始化系統算出來的結果,如果將其放大可以會造成內存溢出,一般系統默認值不會被hit到)
- vm.max_map_count是針對系統單個進程允許被分配的VMA區域,如果在壓測時,會有二種情況拋出(線程不夠11=no more threads allowed,資源不夠12 = out of mem.)但是此值了不能設置的太大,會造成內存開銷,回收慢;此值的調整,需要根據實際壓測結果而定(常指可以被create多少個線程達到飽和)
- limits.conf針對用戶級別的,在設置此值時,需要考慮到上面二個全局參數的值,用戶的total值(不管是nproc 還是nofile)不能大於與之對應的kernel.pid_max & kernel.threads-max & fs.file-max
- Linux通常不會對單個CPU的create線程數做上限,過於復雜,個人認為內存不好精確計算吧