使用edac工具來檢測服務器內存故障.


隨着虛擬化,Redis,BDB內存數據庫等應用的普及,現在越來越多的服務器配置了大容量內存,拿DELL的R620來說在配置雙路CPU下,其24個內存插槽,支持的內存高達960GB。對於ECC,REG這些帶有糾錯功能的內存故障檢測是一件很頭疼的事情,出現故障,還是可以連續運行幾個月甚至幾年,但如果運氣不好,隨時都會掛掉,好在linux中提供了一個edac-utils 內存糾錯診斷工具,可以用來檢查服務器內存潛在的故障。
下面以CentOS為例,介紹下edac-utils 工具的使用.
在使用edac-utils 工具之前,需要先了解服務器的硬件架構,以DELL R620為例,(其它如HP DL360P G8,IBM X3650 M4 機型都使用了 E5-2600 系列CPU,C600 系列芯片組.大致相同) 其CPU內存控制器對應通道,內存槽關系,如下所示。

處理器0 (對應一個內存控制器)
通道0:內存插槽A1、A5 和A9
通道1:內存插槽A2、A6 和A10
通道2:內存插槽A3、A7 和A11
通道3:內存插槽A4、A8 和A12

處理器1 (對應一個內存控制器)
通道0:內存插槽B1、B5 和B9
通道1:內存插槽B2、B6 和B10
通道2:內存插槽B3、B7 和B11
通道3:內存插槽B4、B8 和B12

1.安裝 edac-utils 工具

yum install -y libsysfs edac-utils
2.執行檢測命令,可查看糾錯提示如下

edac-util -v

mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: A1
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: A2
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: A3
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: A4
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: A5
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: A6
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: A7
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: A8
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: A9
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: A10
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: A11
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: A12

mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: B1
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: B2
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: B3
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: B4
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B5
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B6
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B7
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B8
mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B9
mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B10
mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B11
mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B12

其中 mc0 表示 表示內存控制器0, CPU_Src_ID#0表示源CPU0 , Channel#0 表示通道0
DIMM#0 標示內存槽0,Corrected Errors 代表已經糾錯的次數,根據前面列出的CPU通
道和內存槽對應關系即可給edac-utils 返回的信息進行編號。
即可得出 A1槽 6312 次糾錯,B1槽 6459次糾錯,B3槽 535次糾錯. 3條內存出現潛在故障,接下來聯系供應商進行更換即可。

12條內存的對應關系
mc0: csrow0: CPU#0Channel#0_DIMM#0: A1
mc0: csrow0: CPU#0Channel#1_DIMM#0: A2
mc0: csrow0: CPU#0Channel#2_DIMM#0: A3
mc0: csrow1: CPU#0Channel#0_DIMM#1: A4
mc0: csrow1: CPU#0Channel#1_DIMM#1: A5
mc0: csrow1: CPU#0Channel#2_DIMM#1: A6

mc1: csrow0: CPU#1Channel#0_DIMM#0: B1
mc1: csrow0: CPU#1Channel#1_DIMM#0: B2
mc1: csrow0: CPU#1Channel#2_DIMM#0: B3
mc1: csrow1: CPU#1Channel#0_DIMM#1: B4
mc1: csrow1: CPU#1Channel#1_DIMM#1: B5
mc1: csrow1: CPU#1Channel#2_DIMM#1: B6

20條內存的對應關系
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors A1
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors B1
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors C1
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors D1
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors A2
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors B2
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors C2
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors D2
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: 0 Corrected Errors A3
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: 11 Corrected Errors B3
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: 0 Corrected Errors C3
mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: 0 Corrected Errors D3
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors 
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors 
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors

4x16關系
mc0: csrow0: CPU#0Channel#0_DIMM#0: 0 Corrected Errors 8a
mc0: csrow0: CPU#0Channel#1_DIMM#0: 0 Corrected Errors 5b
mc0: csrow0: CPU#0Channel#2_DIMM#0: 0 Corrected Errors 2c
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU#0Channel#0_DIMM#1: 1 Corrected Errors 7d
mc0: csrow1: CPU#0Channel#1_DIMM#1: 0 Corrected Errors 4e
mc0: csrow1: CPU#0Channel#2_DIMM#1: 0 Corrected Errors 1f
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: CPU#0Channel#0_DIMM#2: 0 Corrected Errors 6G
mc0: csrow2: CPU#0Channel#1_DIMM#2: 0 Corrected Errors 3h


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM