Linux 性能優化--理解 CPU 使用率和平均負載

本文轉載自查看原文 2019-04-02 14:08 667 Linux

1 理解CPU

CPU（Cental Processing Unit）是計算機系統的運算和控制核心，是信息處理、程序運行的最終執行單元，相當系統的“大腦”。

當 cpu 過於繁忙，就像“人腦”並發處理過多事情，會降低做事效率，嚴重時甚至會導致崩潰“宕機”。因此，理解 CPU 工作原理，合理控制資源，是保障系統穩定持續運行的重要手段。

1.1 多 cpu 和多核 cpu

多個物理CPU，CPU通過總線進行通信，效率比較低：

用於雙路Xeon可擴展的主板：超微X11DAi-N

多核CPU，不同的核通過L2 cache進行通信，存儲和外設通過總線與CPU通信：

用於單路Xeon-W的主板：超微X11SRA

2 查詢CPU信息

[root@localhost ~]# cat /proc/cpuinfo | grep 'physical id' | sort | uniq | wc -l　　//查看 cpu 個數
2
[root@localhost ~]# cat /proc/cpuinfo | grep 'cpu cores' | sort | uniq     // 查看cpu物理核數
cpu cores    : 3
[root@localhost ~]# cat /proc/cpuinfo | grep 'siblings' | sort | uniq     // 查看 cpu 邏輯核數  
siblings    : 3

3 平均負載含義

當系統變慢的時候，我們一般使用 top 或 uptime 命令來查看系統平均負載情況。
正確定義：單位時間內，系統中處於可運行狀態(R,Running/Runnable)和不可中斷睡眠狀態(D，Disk Sleep) 的平均進程數。
錯誤定義：單位時間內的 cpu 使用率。
可運行狀態的進程：正在使用 cpu 或者正在等待 cpu 的進程，即 ps aux 命令下 STAT 處於 R 狀態的進程
不可中斷狀態的進程：處於內核態關鍵流程中的進程，且不可被打斷，如等待硬件設備IO響應，ps命令D狀態的進程。
理想狀態：每個 cpu 上都有一個活躍進程，即平均負載數等於 cpu 數。
過載經驗值：平均負載高於 cpu 數量 70% 的時候。

假如在一個單 cpu 系統上看到 1.73 0.60 7.98，表示在過去一分鍾內系統有73%的超載，而在15分鍾內，有698%的超載。

注：可運行狀態進程包括正在使用cpu或等待cpu的進程；不可中斷狀態進程是指處於內核關鍵流程中的進程，並且該流程不可被打斷，比如當進程向磁盤寫數據時，如果被打斷，就可能出現磁盤數據與進程數據不一致。

不可中斷進程狀態，本質上是系統對進程和硬件設備的一種保護機制。

4 平均負載案例分析

系統環境和工具

系統環境和配置：CentOS 7 64bit 4G內存 2CPU

相關工具：stress、sysstat。
stress：一個Linux系統壓力測試工具。
sysstat：監控和分析系統的性能工具，包括mpstat關於cpu詳細信息(單獨輸出或分組輸出)、pidstat（進程性能分析）命令、iostat等。

安裝工具：yum install -y epel-release stress sysstat

同一個Linux系統開三個終端。

場景一：CPU密集型進程

第一個終端運行stress命令模擬一個cpu使用率100%

[root@localhost ~]# stress --cpu 1 --timeout 300
stress: info: [9716] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

第二個終端運行uptime查看平均負載變化情況

#-d參數表示高亮顯示變化的區域

[root@localhost ~]# watch -d uptime

第三個終端雲溪mpstat查看cpu使用率變化情況

#-P ALL表示監控所有cpu，5表示間隔5秒輸出一組數據

[root@localhost zhiwenwei]# mpstat -P ALL 5
Linux 3.10.0-957.10.1.el7.x86_64 (localhost.localdomain)     2019年04月01日     _x86_64_    (2 CPU)

22時33分45秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
22時33分50秒  all   50.95    0.00    0.30    0.00    0.00    0.00    0.00    0.00    0.00   48.75
22時33分50秒    0    1.80    0.00    0.60    0.00    0.00    0.00    0.00    0.00    0.00   97.60
22時33分50秒    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

結論：從終端二看到1分鍾的平均負載會慢慢增加到1.00；終端三正好有一個cpu使用率為100%，但它的iowait只有0，說明平均負載的升高正是由於cpu使用率的升高。

那么到底是哪個進程導致cpu使用率為100%，可以使用top命令來來查或使用pidstat命令。

[root@localhost zhiwenwei]# pidstat -u 5 1
Linux 3.10.0-957.10.1.el7.x86_64 (localhost.localdomain)     2019年04月01日     _x86_64_    (2 CPU)

22時37分09秒   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
22時37分14秒     0      2334    0.00    0.20    0.00    0.00    0.20     0  xfsaild/dm-0
22時37分14秒     0      4805    1.20    0.20    0.00    0.00    1.40     0  mono
22時37分14秒     0      4808    0.00    0.20    0.00    0.00    0.20     0  rsyslogd
22時37分14秒     0      9783    0.00    0.20    0.00    0.00    0.20     0  watch
22時37分14秒     0     10078  100.20    0.00    0.00    0.00  100.20     1  stress
22時37分14秒     0     10087    0.20    0.20    0.00    0.00    0.40     0  pidstat

平均時間:   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
平均時間:     0      2334    0.00    0.20    0.00    0.00    0.20     -  xfsaild/dm-0
平均時間:     0      4805    1.20    0.20    0.00    0.00    1.40     -  mono
平均時間:     0      4808    0.00    0.20    0.00    0.00    0.20     -  rsyslogd
平均時間:     0      9783    0.00    0.20    0.00    0.00    0.20     -  watch
平均時間:     0     10078  100.20    0.00    0.00    0.00  100.20     -  stress
平均時間:     0     10087    0.20    0.20    0.00    0.00    0.40     -  pidstat

場景二：I/O密集型進程

第一個終端運行stress命令模擬I/O壓力

[root@localhost ~]# stress -i 1 --timeout 600

第二個終端運行uptime查看當前平均負載變化情況

[root@localhost ~]# watch -d uptime

第三個終端運行mpstat查看cpu使用率變化情況

[zhiwenwei@localhost tmp]$ mpstat -P ALL 5 1
Linux 3.10.0-957.10.1.el7.x86_64 (localhost.localdomain)     2019年04月02日     _x86_64_    (2 CPU)

13時44分56秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
13時45分01秒  all    1.41    0.00   47.43    0.00    0.00    0.00    0.00    0.00    0.00   51.16
13時45分01秒    0    1.61    0.00   89.52    0.00    0.00    0.00    0.00    0.00    0.00    8.87
13時45分01秒    1    1.21    0.00    5.43    0.00    0.00    0.00    0.00    0.00    0.00   93.36

平均時間:  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
平均時間:  all    1.41    0.00   47.43    0.00    0.00    0.00    0.00    0.00    0.00   51.16
平均時間:    0    1.61    0.00   89.52    0.00    0.00    0.00    0.00    0.00    0.00    8.87
平均時間:    1    1.21    0.00    5.43    0.00    0.00    0.00    0.00    0.00    0.00   93.36

結論：1分鍾的平均負載會慢慢增加到1，其中兩個cpu平均使用率49.40，而idle平均達到50.40。說明平均負載的升高由於idle的升高。
查看導致idle升高的進程：

[zhiwenwei@localhost tmp]$ pidstat -u 5 1
Linux 3.10.0-957.10.1.el7.x86_64 (localhost.localdomain)     2019年04月02日     _x86_64_    (2 CPU)

13時50分46秒   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13時50分51秒     0      4273    0.00    8.58    0.00    1.60    8.58     0  kworker/u4:0
13時50分51秒     0      4805    1.40    0.00    0.00    0.00    1.40     1  mono
13時50分51秒     0      4815    1.00   62.08    0.00    3.59   63.07     0  stress
13時50分51秒     0      4816    0.00   14.77    0.00    1.60   14.77     1  kworker/u4:1
13時50分51秒  1007      4819    0.00    0.20    0.00    0.00    0.20     1  pidstat

平均時間:   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
平均時間:     0      4273    0.00    8.58    0.00    1.60    8.58     -  kworker/u4:0
平均時間:     0      4805    1.40    0.00    0.00    0.00    1.40     -  mono
平均時間:     0      4815    1.00   62.08    0.00    3.59   63.07     -  stress
平均時間:     0      4816    0.00   14.77    0.00    1.60   14.77     -  kworker/u4:1
平均時間:  1007      4819    0.00    0.20    0.00    0.00    0.20     -  pidstat

可以發現是stress進程導致的。

場景三：大量進程

第一個終端使用stress命令模擬10個進程

[root@localhost ~]# stress -c 10 --timeout 600

第二個終端用uptime查看平均負載變化情況

[root@localhost ~]# watch -d uptime

第三個終端pidstat查看進程情況

[root@localhost ~]# pidstat -u 5 1
Linux 3.10.0-957.10.1.el7.x86_64 (localhost.localdomain)     2019年04月02日     _x86_64_    (2 CPU)

13時55分59秒   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13時56分04秒     0      4805    1.38    0.20    0.00    0.00    1.58     0  mono
13時56分04秒     0      4828   19.53    0.00    0.00   79.49   19.53     0  stress
13時56分04秒     0      4829   19.72    0.00    0.00   79.49   19.72     0  stress
13時56分04秒     0      4830   19.72    0.00    0.00   79.68   19.72     1  stress
13時56分04秒     0      4831   19.72    0.00    0.00   79.68   19.72     0  stress
13時56分04秒     0      4832   19.53    0.00    0.00   79.09   19.53     0  stress
13時56分04秒     0      4833   19.72    0.00    0.00   79.29   19.72     1  stress
13時56分04秒     0      4834   19.53    0.00    0.00   78.90   19.53     1  stress
13時56分04秒     0      4835   19.72    0.00    0.00   80.08   19.72     1  stress
13時56分04秒     0      4836   19.53    0.00    0.00   79.09   19.53     0  stress
13時56分04秒     0      4837   19.72    0.00    0.00   79.29   19.72     1  stress
13時56分04秒     0      4848    0.00    0.20    0.00    0.39    0.20     0  pidstat

平均時間:   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
平均時間:     0      4805    1.38    0.20    0.00    0.00    1.58     -  mono
平均時間:     0      4828   19.53    0.00    0.00   79.49   19.53     -  stress
平均時間:     0      4829   19.72    0.00    0.00   79.49   19.72     -  stress
平均時間:     0      4830   19.72    0.00    0.00   79.68   19.72     -  stress
平均時間:     0      4831   19.72    0.00    0.00   79.68   19.72     -  stress
平均時間:     0      4832   19.53    0.00    0.00   79.09   19.53     -  stress
平均時間:     0      4833   19.72    0.00    0.00   79.29   19.72     -  stress
平均時間:     0      4834   19.53    0.00    0.00   78.90   19.53     -  stress
平均時間:     0      4835   19.72    0.00    0.00   80.08   19.72     -  stress
平均時間:     0      4836   19.53    0.00    0.00   79.09   19.53     -  stress
平均時間:     0      4837   19.72    0.00    0.00   79.29   19.72     -  stress
平均時間:     0      4848    0.00    0.20    0.00    0.39    0.20     -  pidstat

總結：
平均負載高有可能是cpu密集型進程導致的；
平均負載高不一定是cpu使用率高，還有可能是I/O更繁忙
當發現負載高的時候，可以使用mpstat、pidstat等工具輔助分析負載高的根源。

參考文獻

https://time.geekbang.org/column/article/69618

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 CPU 使用率和平均負載如何正確理解 CPU 使用率和平均負載的關系？看完你就知道了如何正確理解 CPU 使用率和平均負載的關系？看完你就知道了 Linux 性能調優CPU篇：平均負載與CPU使用率 Linux性能優化-CPU使用率 Linux系統平均負載和CPU使用率的差異及聯系理解Linux CPU負載和 CPU使用率理解CPU負載和CPU使用率 Linux性能之CPU使用率性能測試--cpu使用率和cpu負載區別