storcli64和smartctl定位硬盤的故障信息
轉自:
Section One : Introduction
strocli是megacli的升級版本,針對於戴爾服務器是perccli,用法完全一致
smartctl可以查看磁盤的主控芯片smart信息
lsscsi可以查看系統的scsi信息,數據來源/proc/scsi/scsi相關,該文檔此處暫不介紹
這些工具都是查看磁盤相關信息的常用工具,對於排查磁盤狀態和raid卡問題都有幫助
Section Two : Install package
安裝一下storcli或者perccli,並且將命令軟連接到/usr/bin/目錄下,方便使用命令:
ln -s /opt/MegaRAID/storcli/storcli64 /usr/bin/
ln -s /opt/MegaRAID/perccli/percclie64 /usr/bin/
Section Three : Step
由系統磁盤盤符/dev/sdf定位對應的硬盤盤位思路如下:
perccli64 /c0/eall/sall show 看到該磁盤有
img-/c0/eall/sall 從該圖看到有四個jbod分區,根據經驗一般人為jbod的分區系統盤符會在raid分區之前,也就是說jbod的分區會從/dev/sda > /dev/sdd,raid的分區從/dev/sde開始;
DG代表drive group,是配置raid建分組的順序,有圖上看到32:4和32:5是一個卷組。
perccli64 /c0/vall show看到該磁盤的DG與VD的對應關系如下
img-/c0/vall 由圖上看到DG/VD就是raid的卷組和系統里卷組的順序對應關系,一般如果服務器只有raid卷組來說的話,VD0就是操作系統里的/dev/sda,以此類推;但是如果服務器包括了jbod卷組,則raid的卷組從jbod后開始排序,本例中也就是VD0=/dev/sde,則要定位/dev/sdf的話VD=1,對應DG=1;
回到img-/c0/eall/sall上,DG為1時,DID=6,DID就是device id,這個概念后邊有用;同時Slot NO.也就是slt = 6對應的服務器上盤位就是第7個(從0開始到6),此時即定位到了/dev/sdf的物理盤位。
反之從服務器上看到硬盤故障燈,可以反推對應的系統分區盤符
Note:
如果服務器沒有jbod卷組,全是raid的,則此時/c0/vall找到對應關系即可定位關聯關系
實際操作時還可以通過 perccli64 /c0/e32/s6 start/stop locate點亮關閉磁盤燈,來判斷定位是否正確
Section Four : storcli/perccli Usage
查看控制器的信息
**perccli64 show ctrlcount 查看有幾個控制器即幾個raid卡
**
perccli64 show 顯示raid卡信息
[root@node-15 ~]# perccli64 show Status Code = 0 Status = Success Description = None Number of Controllers = 1 Host Name = node-15.domain.tld Operating System = Linux3.10.0-327.20.1.es2.el7.x86_64 System Overview : =============== ------------------------------------------------------------------------ Ctl Model Ports PDs DGs DNOpt VDs VNOpt BBU sPR DS EHS ASOs Hlth ------------------------------------------------------------------------ 0 PERCH730Mini 8 16 11 0 11 0 Opt On 3 N 0 Opt ------------------------------------------------------------------------ Ctl=Controller Index|DGs=Drive groups|VDs=Virtual drives|Fld=Failed PDs=Physical drives|DNOpt=DG NotOptimal|VNOpt=VD NotOptimal|Opt=Optimal Msng=Missing|Dgd=Degraded|NdAtn=Need Attention|Unkwn=Unknown sPR=Scheduled Patrol Read|DS=DimmerSwitch|EHS=Emergency Hot Spare Y=Yes|N=No|ASOs=Advanced Software Options|BBU=Battery backup unit Hlth=Health|Safe=Safe-mode boot
可以看到只有一個raid卡,ctrl 0也是就是/c0
storcli64 /c0 show
[root@node-15 ~]# perccli64 /c0 show Generating detailed summary of the adapter, it may take a while to complete. Controller = 0 Status = Success Description = None Product Name = PERC H730 Mini Serial Number = 663021Z SAS Address = 51866da066153000 PCI Address = 00:03:00:00 System Time = 01/10/2019 20:48:38 Mfg. Date = 06/17/16 Controller Time = 01/10/2019 12:44:21 FW Package Build = 25.4.0.0017 BIOS Version = 6.29.00.0_4.16.07.00_0x06120100 FW Version = 4.260.00-6259 Driver Name = megaraid_sas Driver Version = 06.807.10.00-rh1 Current Personality = RAID-Mode Vendor Id = 0x1000 Device Id = 0x5D SubVendor Id = 0x1028 SubDevice Id = 0x1F49 Host Interface = PCI-E Device Interface = SAS-12G Bus Number = 3 Device Number = 0 Function Number = 0 Drive Groups = 11 TOPOLOGY : ======== --------------------------------------------------------------------------- DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace TR --------------------------------------------------------------------------- 0 - - - - RAID1 Optl N 931.0 GB dflt N N dflt N N 0 0 - - - RAID1 Optl N 931.0 GB dflt N N dflt N N 0 0 0 32:4 4 DRIVE Onln N 931.0 GB dflt N N dflt - N 0 0 1 32:5 5 DRIVE Onln N 931.0 GB dflt N N dflt - N 1 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 1 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 1 0 0 32:6 6 DRIVE Onln N 931.0 GB dflt N N dflt - N 2 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 2 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 2 0 0 32:7 7 DRIVE Onln N 931.0 GB dflt N N dflt - N 3 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 3 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 3 0 0 32:8 8 DRIVE Onln N 931.0 GB dflt N N dflt - N 4 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 4 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 4 0 0 32:9 9 DRIVE Onln N 931.0 GB dflt N N dflt - N 5 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 5 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 5 0 0 32:10 10 DRIVE Onln N 931.0 GB dflt N N dflt - N 6 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 6 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 6 0 0 32:11 11 DRIVE Onln N 931.0 GB dflt N N dflt - N 7 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 7 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 7 0 0 32:12 12 DRIVE Onln N 931.0 GB dflt N N dflt - N 8 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 8 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 8 0 0 32:13 13 DRIVE Onln N 931.0 GB dflt N N dflt - N 9 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 9 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 9 0 0 32:14 14 DRIVE Onln N 931.0 GB dflt N N dflt - N 10 - - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 10 0 - - - RAID0 Optl N 931.0 GB dflt N N dflt N N 10 0 0 32:15 15 DRIVE Onln N 931.0 GB dflt N N dflt - N --------------------------------------------------------------------------- DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Dgrd=Degraded Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present TR=Transport Ready Virtual Drives = 11 VD LIST : ======= ------------------------------------------------------------- DG/VD TYPE State Access Consist Cache Cac sCC Size Name ------------------------------------------------------------- 0/0 RAID1 Optl RW Yes RWBD - OFF 931.0 GB 1/1 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 2/2 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 3/3 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 4/4 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 5/5 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 6/6 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 7/7 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 8/8 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 9/9 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 10/10 RAID0 Optl RW Yes RWBD - OFF 931.0 GB ------------------------------------------------------------- Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked| Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack| FWB=Force WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled Check Consistency Physical Drives = 16 PD LIST : ======= ---------------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp ---------------------------------------------------------------------------- 32:0 0 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U 32:1 1 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U 32:2 2 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U 32:3 3 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U 32:4 4 Onln 0 931.0 GB SATA HDD N N 512B ST91000640NS U 32:5 5 Onln 0 931.0 GB SATA HDD N N 512B ST91000640NS U 32:6 6 Onln 1 931.0 GB SATA HDD N N 512B ST91000640NS U 32:7 7 Onln 2 931.0 GB SATA HDD N N 512B ST91000640NS U 32:8 8 Onln 3 931.0 GB SATA HDD N N 512B ST91000640NS U 32:9 9 Onln 4 931.0 GB SATA HDD N N 512B ST91000640NS U 32:10 10 Onln 5 931.0 GB SATA HDD N N 512B ST91000640NS U 32:11 11 Onln 6 931.0 GB SATA HDD N N 512B ST91000640NS U 32:12 12 Onln 7 931.0 GB SATA HDD N N 512B ST91000640NS U 32:13 13 Onln 8 931.0 GB SATA HDD N N 512B ST91000640NS U 32:14 14 Onln 9 931.0 GB SATA HDD N N 512B ST91000640NS U 32:15 15 Onln 10 931.0 GB SATA HDD N N 512B ST91000640NS U ---------------------------------------------------------------------------- EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded BBU_Info : ======== ---------------------------------------------- Model State RetentionTime Temp Mode MfgDate ---------------------------------------------- BBU Optimal 0 hour(s) 38C - 0/00/00 ----------------------------------------------
看磁盤的Device id、Slot No. 以及DriveGroup
[root@node-15 ~]# perccli64 /c0/eall/sall show Controller = 0 Status = Success Description = Show Drive Information Succeeded. Drive Information : ================= ---------------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp ---------------------------------------------------------------------------- 32:0 0 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U 32:1 1 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U 32:2 2 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U 32:3 3 JBOD - 185.75 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U 32:4 4 Onln 0 931.0 GB SATA HDD N N 512B ST91000640NS U 32:5 5 Onln 0 931.0 GB SATA HDD N N 512B ST91000640NS U 32:6 6 Onln 1 931.0 GB SATA HDD N N 512B ST91000640NS U 32:7 7 Onln 2 931.0 GB SATA HDD N N 512B ST91000640NS U 32:8 8 Onln 3 931.0 GB SATA HDD N N 512B ST91000640NS U 32:9 9 Onln 4 931.0 GB SATA HDD N N 512B ST91000640NS U 32:10 10 Onln 5 931.0 GB SATA HDD N N 512B ST91000640NS U 32:11 11 Onln 6 931.0 GB SATA HDD N N 512B ST91000640NS U 32:12 12 Onln 7 931.0 GB SATA HDD N N 512B ST91000640NS U 32:13 13 Onln 8 931.0 GB SATA HDD N N 512B ST91000640NS U 32:14 14 Onln 9 931.0 GB SATA HDD N N 512B ST91000640NS U 32:15 15 Onln 10 931.0 GB SATA HDD N N 512B ST91000640NS U ---------------------------------------------------------------------------- EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
Note:
根據經驗,在centos系統里的默認udev規則下,jbod的分區在raid的分區之前(如果在線修改的,重啟后會變成jbod在前),通過lsscsi命令可以發現在同一個raid控制器下,jbod的分區的channel值小於raid分區的值,類似下圖, 第一個字段的第二個值0是jbod和2是raid的區別.
[root@SZVPN-2 udev]# lsscsi [0:0:24:0] disk IBM-ESXS MBF2600RC SB2C /dev/sda [0:2:0:0] disk IBM ServeRAID M5110e 3.19 /dev/sdb [0:2:1:0] disk IBM ServeRAID M5110e 3.19 /dev/sdc
並且jbod設備的分區在系統里被udev規則識別得到的scsi_level高於raid分區.
udevadm -ap /sys/class/block/sdx |grep scsi_level
我的測試值jbod的scsi_level是7而raid的scsi_level是6.
相應的udev規則是
/lib/udev/rules.d/60-persistent-storage.rules
scsci_level: ATTRS{scsi_level}=="[6-9]*"
查看指定硬盤的信息
[root@node-15 ~]# perccli64 /c0/e32/s6 show all Controller = 0 Status = Success Description = Show Drive Information Succeeded. Drive /c0/e32/s6 : ================ ------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp ------------------------------------------------------------------- 32:6 6 Onln 1 931.0 GB SATA HDD N N 512B ST91000640NS U ------------------------------------------------------------------- EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded Drive /c0/e32/s6 - Detailed Information : ======================================= Drive /c0/e32/s6 State : ====================== Shield Counter = 0 Media Error Count = 46431 *** 很明顯的問題發生了46431次介質錯誤 *** Other Error Count = 0 Drive Temperature = 31C (87.80 F) Predictive Failure Count = 126 *** 預測故障次數126次 *** S.M.A.R.T alert flagged by drive = Yes Drive /c0/e32/s6 Device attributes : ================================== SN = 9XGA228L Manufacturer Id = ATA Model Number = ST91000640NS NAND Vendor = NA WWN = 5000c500918f2f8a Firmware Revision = AA63 Raw size = 931.512 GB [0x74706db0 Sectors] Coerced size = 931.0 GB [0x74600000 Sectors] Non Coerced size = 931.012 GB [0x74606db0 Sectors] Device Speed = 6.0Gb/s Link Speed = 12.0Gb/s NCQ setting = N/A Write Cache = Enabled Logical Sector Size = 512B Physical Sector Size = 512B Connector Name = 00 Drive /c0/e32/s6 Policies/Settings : ================================== Drive position = DriveGroup:1, Span:0, Row:0 Enclosure position = 0 Connected Port Number = 0(path0) Sequence Number = 2 Commissioned Spare = No Emergency Spare = No Last Predictive Failure Event Sequence Number = 95183 *** 上一次預測錯誤的序號95183 *** Successful diagnostics completion on = N/A SED Capable = No SED Enabled = No Secured = No Cryptographic Erase Capable = No Locked = No Needs EKM Attention = No PI Eligible = No Certified = Yes Wide Port Capable = No Port Information : ================ ----------------------------------------- Port Status Linkspeed SAS address ----------------------------------------- 0 Active 12.0Gb/s 0x500056b33fefe586 ----------------------------------------- Inquiry Data = 5a 0c ff 3f 37 c8 10 00 00 00 00 00 3f 00 00 00 00 00 00 00 20 20 20 20 20 20 20 20 20 20 20 20 58 39 41 47 32 32 4c 38 00 00 00 00 04 00 20 20 20 20 41 41 33 36 54 53 31 39 30 30 36 30 30 34 53 4e 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 80 00 40 00 2f 00 40 00 02 00 02 07 00 ff 3f 10 00 3f 00 10 fc fb 00 10 00 ff ff ff 0f 00 00 07 00
Note:
通過單個卷組的信息查看,發現了media error,說明了硬盤是有問題的
查看磁盤與系統磁盤分區的對應
[root@node-15 ~]# perccli64 /c0/vall show Controller = 0 Status = Success Description = None Virtual Drives : ============== ------------------------------------------------------------- DG/VD TYPE State Access Consist Cache Cac sCC Size Name ------------------------------------------------------------- 0/0 RAID1 Optl RW Yes RWBD - OFF 931.0 GB 1/1 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 2/2 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 3/3 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 4/4 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 5/5 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 6/6 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 7/7 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 8/8 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 9/9 RAID0 Optl RW Yes RWBD - OFF 931.0 GB 10/10 RAID0 Optl RW Yes RWBD - OFF 931.0 GB ------------------------------------------------------------- Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked| Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack| FWB=Force WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled Check Consistency
Note:
VD:一般認為是該硬盤在系統里的設備順序,一般如果只有raid分區,那么VD=0的就是系統里的/dev/sda,VD=1就是/dev/sdb以此類推,但是如果有jbod的分區,先排列jbod分區,如jbod的到了/dev/sdc,VD0則是/dev/sdd,以此類推;
DG:是在raid卡里配置卷組的順序;
Raid卡日志收集相關命令
storcli64 /c0 show time
顯示raid的時間
storcli64 /c0 show alilog logfile=node-x.alilog
獲取alilog,所有的log都包括了
storcli64 /c0 show all logfile=node-x.all.log
raid卡的信息
storcli64 /c0 show badblocks
磁盤壞道的信息
perccli64 /c0 show events filter=fatal
顯示事件級別為fatal的,可以獲取所有毀滅性事件的信息,發現磁盤故障或raid卡故障
perccli64 /c0 show cc
數據一致性檢測,raid1以上的級別多個盤的數據是需要進行一致性檢測的,但是單盤raid0可能是不需要的,是否影響性能不確定
Section Five : Smartctl Get Error info of Disks
Common Commands Usage Description
--scan
Scan for devices
--scan-open
Scan for devices and try to open each device
-x, --xall
Show all information for device
-a, --all
Show all SMART information for device
-i, --info
Show identity information for device
-d TYPE, --device=TYPE
Specify device type to one of: ata, scsi, nvme[,NSID], sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,p][,x][,N], usbprolific, usbsunplus, marvell, areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, aacraid,H,L,ID, cciss,N, auto, test
-s VALUE, --smart=VALUE
Enable/disable SMART on device (on/off)
-o VALUE, --offlineauto=VALUE(ATA)
Enable/disable automatic offline testing on device (on/off)
-S VALUE, --saveauto=VALUE(ATA)
Enable/disable Attribute autosave on device (on/off)
-H, --health
Show device SMART health status
-c, --capabilities(ATA,NVMe)
Show device SMART capabilities
-A, --attributes
Show device SMART vendor-specific Attributes and values
-l TYPE, --log=TYPE
Show device log. TYPE: error, selftest, selective, directory[,g|s],
xerror[,N][,error], xselftest[,N][,selftest],
background, sasphy[,reset], sataphy[,reset],
scttemp[sts,hist], scttempint,N[,p],
scterc[,N,M], devstat[,N], ssd,
gplog,N[,RANGE], smartlog,N[,RANGE],
nvmelog,N,SIZE
-t TEST, --test=TEST
Run test. TEST: offline, short, long, conveyance, force, vendor,N,
select,M-N, pending,N, afterselect,[on|off]
-X, --abort
Abort any non-captive test on device
Get info for /dev/sdf
查看所有設備列表
[root@node-15 ~]# smartctl --scan /dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/sdc -d scsi # /dev/sdc, SCSI device /dev/sdd -d scsi # /dev/sdd, SCSI device /dev/sde -d scsi # /dev/sde, SCSI device /dev/sdf -d scsi # /dev/sdf, SCSI device /dev/sdg -d scsi # /dev/sdg, SCSI device /dev/sdh -d scsi # /dev/sdh, SCSI device /dev/sdi -d scsi # /dev/sdi, SCSI device /dev/sdj -d scsi # /dev/sdj, SCSI device /dev/sdk -d scsi # /dev/sdk, SCSI device /dev/sdl -d scsi # /dev/sdl, SCSI device /dev/sdm -d scsi # /dev/sdm, SCSI device /dev/sdn -d scsi # /dev/sdn, SCSI device /dev/sdo -d scsi # /dev/sdo, SCSI device /dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device /dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device /dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device /dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device /dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device /dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device /dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device /dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device /dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device /dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device /dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device /dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device /dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], SCSI device /dev/bus/0 -d megaraid,13 # /dev/bus/0 [megaraid_disk_13], SCSI device /dev/bus/0 -d megaraid,14 # /dev/bus/0 [megaraid_disk_14], SCSI device /dev/bus/0 -d megaraid,15 # /dev/bus/0 [megaraid_disk_15], SCSI device
Note:
通過前面的章節我們定位到了磁盤/dev/sdf在perccli里的DID即device_id為6,也就是/dev/bus/0 -d megaraid,6
查看磁盤信息
[root@node-15 ~]# smartctl -i -d megaraid,6 /dev/sdf smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Constellation.2 (SATA) Device Model: ST91000640NS Serial Number: 9XGA228L LU WWN Device Id: 5 000c50 0918f2f8a Add. Product Id: DELL(tm) Firmware Version: AA63 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Fri Jan 11 11:28:46 2019 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled
查看磁盤的屬性信息
一般此處可以用來查看磁盤的整體健康狀態指標參數
針對以下輸出信息,字段的解釋
- ID:屬性ID,通常是一個1到255之間的十進制或十六進制的數字。
- ATTRIBUTE_NAME:硬盤制造商定義的屬性名。
- FLAG:屬性操作標志(可以忽略)。
- VALUE:這是表格中最重要的信息之一,代表給定屬性的標准化值,在1到253之間。253意味着最好情況,1意味着最壞情況。取決於屬性和制造商,初始化VALUE可以被設置成100或200.
- WORST:所記錄的最小VALUE。
- THRESH:在報告硬盤FAILED狀態前,WORST可以允許的最小值,也就是WORST如果小於THRESH,磁盤就會報告FAILED。
- TYPE:屬性的類型(Pre-fail或Oldage)。Pre-fail類型的屬性可被看成一個關鍵屬性,表示參與磁盤的整體SMART健康評估(PASSED/FAILED)。如果任何Pre-fail類型的屬性故障,那么可視為磁盤將要發生故障。另一方面,Oldage類型的屬性可被看成一個非關鍵的屬性(如正常的磁盤磨損),表示不會使磁盤本身發生故障。
- UPDATED:表示屬性的更新頻率。Offline代表磁盤上執行離線測試的時間。
- WHEN_FAILED:如果VALUE小於等於THRESH,會被設置成“FAILING_NOW”;如果WORST小於等於THRESH會被設置成“In_the_past”;如果都不是,會被設置成“-”。在“FAILING_NOW”情況下,需要盡快備份重要文件,特別是屬性是Pre-fail類型時。“In_the_past”代表屬性已經故障了,但在運行測試的時候沒問題。“-”代表這個屬性從沒故障過。
- RAW_VALUE:制造商定義的原始值,從VALUE派生。
[root@node-15 ~]# smartctl -A -d megaraid,6 /dev/sdf smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x010f 081 038 044 Pre-fail Always In_the_past 151546765 3 Spin_Up_Time 0x0103 094 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 21 5 Reallocated_Sector_Ct 0x0133 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 338813105 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 18784 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 21 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 1710 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 069 053 045 Old_age Always - 31 (Min/Max 24/40) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 852 194 Temperature_Celsius 0x0022 031 047 000 Old_age Always - 31 (0 14 0 0 0) 195 Hardware_ECC_Recovered 0x001a 117 099 000 Old_age Always - 151546765 197 Current_Pending_Sector 0x0012 084 084 000 Old_age Always - 688 198 Offline_Uncorrectable 0x0010 084 084 000 Old_age Offline - 688 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 8093 (164 214 0) 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1870535293 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1530387871
查看磁盤的健康檢測狀態
Note:
關於以下檢測結果,說明檢測結果是PASSED的,就是磁盤還可以使用,但是列出了一條檢測異常的WORST<THRESH,TYPE是Pre-fail,WHEN_FAILED是In_the_past,說明預測這個盤快壞了。
[root@node-15 ~]# smartctl -H -d megaraid,6 /dev/sdf smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Status not supported: ATA return descriptor not supported by controller firmware SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. Please note the following marginal Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x010f 081 038 044 Pre-fail Always In_the_past 151546765
查看磁盤的錯誤日志
[root@node-15 ~]# smartctl -l error -d megaraid,6 /dev/sdf smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Error Log Version: 1 ATA Error Count: 46431 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 46431 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 42 00 00 ff ff ff 4f 00 46d+15:15:32.968 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:29.901 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:26.825 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 READ VERIFY SECTOR(S) EXT Error 46430 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 42 00 00 ff ff ff 4f 00 46d+15:15:29.901 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:26.825 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 READ VERIFY SECTOR(S) EXT Error 46429 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 42 00 00 ff ff ff 4f 00 46d+15:15:26.825 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 READ VERIFY SECTOR(S) EXT b0 da 00 00 4f c2 00 00 46d+15:15:17.838 SMART RETURN STATUS Error 46428 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 42 00 00 ff ff ff 4f 00 46d+15:15:23.965 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 READ VERIFY SECTOR(S) EXT b0 da 00 00 4f c2 00 00 46d+15:15:17.838 SMART RETURN STATUS 2f 00 01 e0 00 00 40 00 46d+15:15:17.703 READ LOG EXT Error 46427 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 42 00 00 ff ff ff 4f 00 46d+15:15:20.905 READ VERIFY SECTOR(S) EXT 42 00 00 ff ff ff 4f 00 46d+15:15:18.093 READ VERIFY SECTOR(S) EXT b0 da 00 00 4f c2 00 00 46d+15:15:17.838 SMART RETURN STATUS 2f 00 01 e0 00 00 40 00 46d+15:15:17.703 READ LOG EXT 42 00 00 ff ff ff 4f 00 46d+15:15:15.276 READ VERIFY SECTOR(S) EXT
補充
- 如果沒有開啟磁盤的smart可以通過-s on device開啟
- 一般來說如果samrtctl -i 獲取info時沒有什么信息輸出且smart support是允許的可用的,那么說明可能需要做test才能獲取到-t short/long,該測試不會破壞硬盤上的數據,但對於存儲一般不適用離線offline測試
- 收集時可以通過-x -a參數獲取更全面的磁盤信息
- smartctl是可以配置服務的/etc/smartmontools/smartd.conf,對此目前沒有研究,后續有研究成果再更新