Linux 內存錯誤診斷


先了解一些概念

DRAM(Dynamic Random Access Memory),即動態隨機存取存儲器,最為常見的系統內存ECC是“Error Checking and Correcting”的簡寫,中文名稱是“錯誤檢查和糾正”。ECC內存,即應用了能夠實現錯誤檢查和糾正技術(ECC)的內存條。EDAC,即Error Detection And Correction(錯誤檢測與糾正)。

內存有兩種錯誤類型分別是CEUE,CE 是 Correctable Error 的簡稱, UE是Uncorrectable Error的簡稱,CE即可恢復的錯誤,暫不影響系統的正常運行。可以在找時機停機換掉。UE為不可恢復的內存錯誤,通常會導致宕機。

系統messages日志

[root@my-host mg4a]# grep kernel /var/log/messages
Jan 14 19:01:11 my-host kernel: mce: [Hardware Error]: Machine check events logged
Jan 14 19:01:12 my-host kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 (channel:5 slot:0 page:0x554c02 offset:0x3c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:0 ha:1 channel_mask:2 rank:0)
[root@my-host mg4a]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch5_ce_count:1
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow0/ch5_ce_count:0
[root@my-host mg4a]# dmidecode -t 1
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0044, DMI type 1, 27 bytes
System Information
Manufacturer: LENOVO
Product Name: Lenovo System x3750 M4 -[8753IH5]-
Version: 03
Serial Number: 06FF367
UUID: C4EF8080-7926-11E5-8B14-6C0B849B418E
Wake-up Type: Other
SKU Number: XxXxXxX
Family: System X

這是另外一台設備messges日志

Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 27 13:53:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8de3b1960
Jun 27 13:53:25 irora30 kernel: EDAC MC2: CE page 0x8de3b1, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080a13
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008de3b1960
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Jun 27 14:19:27 irora30 auditd[5571]: Audit daemon rotating log files
Jun 27 19:09:23 irora30 auditd[5571]: Audit daemon rotating log files
Jun 27 23:59:21 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 28 02:15:55 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8d9ea5960
Jun 28 02:15:55 irora30 kernel: EDAC MC2: CE page 0x8d9ea5, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008d9ea5960
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 28 03:08:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8ded39960
Jun 28 03:08:25 irora30 kernel: EDAC MC2: CE page 0x8ded39, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008ded39960
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
Jun 28 03:45:13 irora30 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-to-date.
Jun 28 04:44:25 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 09:34:22 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 10:02:30 irora30 ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=df -hl /var|awk 'NR>1 && int($5) > 80' removes=None creates=None chdir=None
Jun 28 14:23:49 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 19:09:25 irora30 auditd[5571]: Audit daemon rotating log files

故障確認及定位故障內存槽位

[root@irora30 ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow5/ch0_ce_count:294
/sys/devices/system/edac/mc/mc3/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow5/ch0_ce_count:0
[root@irora30 ~]#

  • count:不為0的行即代表存在內存錯誤。
  • mc:第幾個CPU。
  • csrow:內存通道。
  • ch*:通道內的第幾根內存。

內存安裝情況

 1 Memory Component    Status
 2 
 3 Proc 1 DIMM 1A     16384 MB 1333 MHz
 4 
 5 Proc 1 DIMM 2I     Not installed Not installed
 6 
 7 Proc 1 DIMM 3E     Not installed Not installed
 8 
 9 Proc 1 DIMM 4C     Not installed Not installed
10 
11 Proc 1 DIMM 5K     Not installed Not installed
12 
13 Proc 1 DIMM 6G     Not installed Not installed
14 
15 Proc 1 DIMM 7B     16384 MB 1333 MHz
16 
17 Proc 1 DIMM 8J     Not installed Not installed
18 
19 Proc 1 DIMM 9F     Not installed Not installed
20 
21 Proc 1 DIMM 10D     Not installed Not installed
22 
23 Proc 1 DIMM 11L     Not installed Not installed
24 
25 Proc 1 DIMM 12H     Not installed Not installed
26 
27 Proc 2 DIMM 1A     16384 MB 1333 MHz
28 
29 Proc 2 DIMM 2I     Not installed Not installed
30 
31 Proc 2 DIMM 3E     Not installed Not installed
32 
33 Proc 2 DIMM 4C     Not installed Not installed
34 
35 Proc 2 DIMM 5K     Not installed Not installed
36 
37 Proc 2 DIMM 6G     Not installed Not installed
38 
39 Proc 2 DIMM 7B     16384 MB 1333 MHz
40 
41 Proc 2 DIMM 8J     Not installed Not installed
42 
43 Proc 2 DIMM 9F     Not installed Not installed
44 
45 Proc 2 DIMM 10D     Not installed Not installed
46 
47 Proc 2 DIMM 11L     Not installed Not installed
48 
49 Proc 2 DIMM 12H     Not installed Not installed
50 
51 Proc 3 DIMM 1A     16384 MB 1333 MHz
52 
53 Proc 3 DIMM 2I     Not installed Not installed
54 
55 Proc 3 DIMM 3E     Not installed Not installed
56 
57 Proc 3 DIMM 4C     Not installed Not installed
58 
59 Proc 3 DIMM 5K     Not installed Not installed
60 
61 Proc 3 DIMM 6G     Not installed Not installed
62 
63 Proc 3 DIMM 7B     16384 MB 1333 MHz
64 
65 Proc 3 DIMM 8J     Not installed Not installed
66 
67 Proc 3 DIMM 9F     Not installed Not installed
68 
69 Proc 3 DIMM 10D     Not installed Not installed
70 
71 Proc 3 DIMM 11L     Not installed Not installed
72 
73 Proc 3 DIMM 12H     Not installed Not installed
74 
75 Proc 4 DIMM 1A     16384 MB 1333 MHz
76 
77 Proc 4 DIMM 2I     Not installed Not installed
78 
79 Proc 4 DIMM 3E     Not installed Not installed
80 
81 Proc 4 DIMM 4C     Not installed Not installed
82 
83 Proc 4 DIMM 5K     Not installed Not installed
84 
85 Proc 4 DIMM 6G     Not installed Not installed
86 
87 Proc 4 DIMM 7B     16384 MB 1333 MHz
88 
89 Proc 4 DIMM 8J     Not installed Not installed
90 
91 Proc 4 DIMM 9F     Not installed Not installed
92 
93 Proc 4 DIMM 10D     Not installed Not installed
94 
95 Proc 4 DIMM 11L     Not installed Not installed
96 
97 Proc 4 DIMM 12H     Not installed Not installed

使用edac工具來檢測服務器內存故障

隨着虛擬化,Redis,BDB內存數據庫等應用的普及,現在越來越多的服務器配置了大容量內存,拿DELL的R620來說在配置雙路CPU下,其24個內存插槽,支持的內存高達960GB。對於ECC,REG這些帶有糾錯功能的內存故障檢測是一件很頭疼的事情,出現故障,還是可以連續運行幾個月甚至幾年,但如果運氣不好,隨時都會掛掉,好在linux中提供了一個edac-utils 內存糾錯診斷工具,可以用來檢查服務器內存潛在的故障。
下面以CentOS為例,介紹下edac-utils 工具的使用.
在使用edac-utils 工具之前,需要先了解服務器的硬件架構,以DELL R620為例,(其它如HP DL360P G8,IBM X3650 M4 機型都使用了 E5-2600 系列CPU,C600 系列芯片組.大致相同) 其CPU內存控制器對應通道,內存槽關系,如下所示。

處理器0 (對應一個內存控制器)
通道0:內存插槽A1、A5 和A9
通道1:內存插槽A2、A6 和A10
通道2:內存插槽A3、A7 和A11
通道3:內存插槽A4、A8 和A12

處理器1 (對應一個內存控制器)
通道0:內存插槽B1、B5 和B9
通道1:內存插槽B2、B6 和B10
通道2:內存插槽B3、B7 和B11
通道3:內存插槽B4、B8 和B12

1.安裝 edac-utils 工具

yum install -y libsysfs edac-utils

2.執行檢測命令,可查看糾錯提示如下

edac-util -v
 1 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: A1
 2 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: A2
 3 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: A3
 4 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: A4
 5 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: A5
 6 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: A6
 7 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: A7
 8 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: A8
 9 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: A9
10 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: A10
11 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: A11
12 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: A12
13 
14 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: B1
15 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: B2
16 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: B3
17 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: B4
18 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B5
19 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B6
20 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B7
21 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B8
22 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B9
23 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B10
24 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B11
25 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B12

其中

mc06 表示 表示內存控制器0;
CPU_Src_ID#0 表示源CPU0;
Channel#0 表示通道0;
DIMM#0 標示內存槽0;
Corrected Errors 代表已經糾錯的次數;

根據前面列出的CPU通道和內存槽對應關系即可給edac-utils 返回的信息進行編號。
即可得出 A1槽 6312 次糾錯,B1槽 6459次糾錯,B3槽 535次糾錯. 3條內存出現潛在故障,接下來聯系供應商進行更換即可。

12條內存的對應關系

 1 mc0: csrow0: CPU#0Channel#0_DIMM#0: A1
 2 mc0: csrow0: CPU#0Channel#1_DIMM#0: A2
 3 mc0: csrow0: CPU#0Channel#2_DIMM#0: A3
 4 mc0: csrow1: CPU#0Channel#0_DIMM#1: A4
 5 mc0: csrow1: CPU#0Channel#1_DIMM#1: A5
 6 mc0: csrow1: CPU#0Channel#2_DIMM#1: A6
 7 
 8 mc1: csrow0: CPU#1Channel#0_DIMM#0: B1
 9 mc1: csrow0: CPU#1Channel#1_DIMM#0: B2
10 mc1: csrow0: CPU#1Channel#2_DIMM#0: B3
11 mc1: csrow1: CPU#1Channel#0_DIMM#1: B4
12 mc1: csrow1: CPU#1Channel#1_DIMM#1: B5
13 mc1: csrow1: CPU#1Channel#2_DIMM#1: B6

20條內存的對應關系

 1 mc0: 0 Uncorrected Errors with no DIMM info
 2 mc0: 0 Corrected Errors with no DIMM info
 3 mc0: csrow0: 0 Uncorrected Errors
 4 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors A1
 5 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors B1
 6 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors C1
 7 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors D1
 8 mc0: csrow1: 0 Uncorrected Errors
 9 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors A2
10 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors B2
11 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors C2
12 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors D2
13 mc0: csrow2: 0 Uncorrected Errors
14 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: 0 Corrected Errors A3
15 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: 11 Corrected Errors B3
16 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: 0 Corrected Errors C3
17 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: 0 Corrected Errors D3
18 mc1: 0 Uncorrected Errors with no DIMM info
19 mc1: 0 Corrected Errors with no DIMM info
20 mc1: csrow0: 0 Uncorrected Errors
21 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors 
22 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors 
23 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
24 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
25 mc1: csrow1: 0 Uncorrected Errors
26 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
27 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
28 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
29 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
30 
31 4x16關系
32 mc0: csrow0: CPU#0Channel#0_DIMM#0: 0 Corrected Errors 8a
33 mc0: csrow0: CPU#0Channel#1_DIMM#0: 0 Corrected Errors 5b
34 mc0: csrow0: CPU#0Channel#2_DIMM#0: 0 Corrected Errors 2c
35 mc0: csrow1: 0 Uncorrected Errors
36 mc0: csrow1: CPU#0Channel#0_DIMM#1: 1 Corrected Errors 7d
37 mc0: csrow1: CPU#0Channel#1_DIMM#1: 0 Corrected Errors 4e
38 mc0: csrow1: CPU#0Channel#2_DIMM#1: 0 Corrected Errors 1f
39 mc0: csrow2: 0 Uncorrected Errors
40 mc0: csrow2: CPU#0Channel#0_DIMM#2: 0 Corrected Errors 6G
41 mc0: csrow2: CPU#0Channel#1_DIMM#2: 0 Corrected Errors 3h

參考:
https://www.cnblogs.com/luckyall/p/11225772.html
http://www.voidcn.com/article/p-gvfvakvy-btw.html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM