環境:AIX6.1 + Oracle11.2.0.4 RAC(2 nodes)
1.故障現象
使用crsctl查看集群各資源狀態,在任一節點都會直接報錯CRS-4535, CRS-4000;但此時數據庫是可以被正常訪問的。具體故障現象如下:
#節點1查詢
grid@bjdb1:/home/grid>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
#節點2查詢
root@bjdb2:/>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
同樣的,crs_stat -t 查看一樣報錯,錯誤碼是CRS-0184:
root@bjdb1:/>crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
節點2也一樣!
確定此時數據庫是可以被正常訪問的。如下:
#節點2模擬客戶端登錄RAC集群,使用SCAN IP訪問,發現可以正常訪問到數據庫
oracle@bjdb2:/home/oracle>sqlplus jingyu/jingyu@192.168.103.31/bjdb
SQL*Plus: Release 11.2.0.4.0 Production on Mon Oct 10 14:24:47 2016
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
SQL>
RAC環境下的/etc/hosts文件相關內容:
#scan
192.168.103.31 scan-ip
2.定位問題
首先查看節點1的集群相關日志: Clusterware(GI)的日志存放在$GRID_HOME/log/nodename下; Clusterware(GI)對應幾個關鍵的后台進程css,crs,evm,它們的日志分別存在cssd,crsd,evmd目錄下;節點1查看相關日志:
#查看GI的alert日志文件,最近的記錄只是提示GI所在存儲空間使用率高,稍后清理下即可,而且目前還有一定空間剩余,顯然並非是此次故障的原因。
root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log
2016-10-10 14:18:26.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:23:31.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:28:36.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:33:41.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:38:46.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
#因為crsctl不可以使用,進而查看crs的日志信息,發現3號已經有報錯,無法打開裸設備,從而導致無法初始化OCR;繼續看錯誤信息,發現是這個時候訪問共享存儲時無法成功。懷疑此刻存儲出現問題,需要進一步和現場人員確定此時間點是否有存儲相關的施工。
root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1/crsd>tail -f crsd.log
2016-10-03 18:04:40.248: [ OCRRAW][1]proprinit: Could not open raw device
2016-10-03 18:04:40.248: [ OCRASM][1]proprasmcl: asmhandle is NULL
2016-10-03 18:04:40.252: [ OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]
2016-10-03 18:04:40.253: [ CRSOCR][1] OCR context init failure. Error: PROC-26: Error while accessing the physical storage
2016-10-03 18:04:40.253: [ CRSD][1] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage
2016-10-03 18:04:40.253: [ CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26
2016-10-03 18:04:40.253: [ CRSD][1] Done.
節點2查看相關日志:
#查看GI的alert日志,發現節點2的ctss有CRS-2409的報錯,雖然根據MOS文檔 ID 1135337.1說明,This is not an error. ctssd is reporting that there is a time difference and it is not doing anything about it as it is running in observer mode.只需要查看兩個節點的時間是否一致,但實際上查詢節點時間一致:
root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2>tail -f alert*.log
2016-10-10 12:29:22.145:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 12:59:38.799:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 13:34:11.402:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 14:12:44.168:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 14:44:04.824:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
#查看節點2的crs日志,發現和節點1相近的時間點,同樣訪問共享存儲出現了問題,進而無法初始化OCR
root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2/crsd>tail -f crsd.log
2016-10-03 18:04:31.077: [ OCRRAW][1]proprinit: Could not open raw device
2016-10-03 18:04:31.077: [ OCRASM][1]proprasmcl: asmhandle is NULL
2016-10-03 18:04:31.081: [ OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]
2016-10-03 18:04:31.081: [ CRSOCR][1] OCR context init failure. Error: PROC-26: Error while accessing the physical storage
2016-10-03 18:04:31.082: [ CRSD][1] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage
2016-10-03 18:04:31.082: [ CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26
2016-10-03 18:04:31.082: [ CRSD][1] Done.
現在登入到grid用戶,確定下ASM磁盤組的狀態:
sqlplus / as sysasm 直接查詢v$asm_diskgroup;
select name, state, total_mb, free_mb from v$asm_diskgroup;
發現OCR_VOTE1磁盤組在兩個ASM實例上都是沒有mount;
SQL> select instance_name from v$instance;
INSTANCE_NAME
------------------------------------------------
+ASM2
SQL> select name, state, total_mb, free_mb from v$asm_diskgroup;
NAME STATE TOTAL_MB FREE_MB
------------------------------ --------------------------------- ---------- ----------
DATA MOUNTED 737280 88152
FRA_ARCHIVE MOUNTED 10240 9287
OCR_VOTE1 DISMOUNTED 0 0
另一個節點一樣;
節點1mount OCR相關磁盤組
SQL> select name, state from v$asm_diskgroup;
NAME STATE
------------------------------ ---------------------------------
DATA MOUNTED
FRA_ARCHIVE MOUNTED
OCR_VOTE1 DISMOUNTED
再確認下目前GI的一些核心后台進程:
#發現crs這個進程是沒有啟動的,查詢沒有任何結果輸出
root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep
同樣,節點2查詢也是一樣沒有啟動crs進程。
簡單總結問題現狀:故障發生在10月3日 下午18:04左右,所有節點都因為無法訪問共享存儲進而導致OCR初始化失敗。目前的crs進程是沒有正常啟動的。
3.處理問題
3.1 嘗試手工掛載OCR磁盤組
SQL> alter diskgroup ocr_vote1 mount;
Diskgroup altered.
SQL> select name, state from v$asm_diskgroup;
NAME STATE
------------------------------ ---------------------------------
DATA MOUNTED
FRA_ARCHIVE MOUNTED
OCR_VOTE1 MOUNTED
3.2 節點1啟動CRS
目前,crs這個進程依然是沒有啟動的,
#證明crsd.bin當前沒有啟動
root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep
節點1嘗試正常開啟crs失敗
root@bjdb1:/>crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
節點1嘗試正常關閉crs失敗
root@bjdb1:/>crsctl stop crs
CRS-2796: The command may not proceed when Cluster Ready Services is not running
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.
那么下一步如何處理呢?
最終選擇在節點1強制停止crs再啟動成功
#強制關閉節點1的crs
root@bjdb1:/>crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'bjdb1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.crf' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.evmd' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.asm' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'bjdb1'
CRS-2677: Stop of 'ora.evmd' on 'bjdb1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'bjdb1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'bjdb1' succeeded
CRS-5017: The resource action "ora.crf stop" encountered the following error:
action for daemon aborted. For details refer to "(:CLSN00108:)" in "/opt/u01/app/11.2.0/grid/log/bjdb1/agent/ohasd/orarootagent_root/orarootagent_root.log".
CRS-2675: Stop of 'ora.crf' on 'bjdb1' failed
CRS-2679: Attempting to clean 'ora.crf' on 'bjdb1'
CRS-2681: Clean of 'ora.crf' on 'bjdb1' succeeded
CRS-2675: Stop of 'ora.asm' on 'bjdb1' failed
CRS-2679: Attempting to clean 'ora.asm' on 'bjdb1'
CRS-2681: Clean of 'ora.asm' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'bjdb1'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'bjdb1'
CRS-2677: Stop of 'ora.cssd' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'bjdb1'
CRS-2677: Stop of 'ora.gipcd' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'bjdb1'
CRS-2677: Stop of 'ora.gpnpd' on 'bjdb1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'bjdb1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'bjdb1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
root@bjdb1:/>
#查看crsctl資源狀態,此時肯定沒有
root@bjdb1:/>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
#查看crsd.bin,cssd.bin,evmd.bin,都已經沒有了相關進程
root@bjdb1:/>ps -ef|grep crsd.bin
root 8126466 25428158 0 15:52:50 pts/0 0:00 grep crsd.bin
root@bjdb1:/>ps -ef|grep cssd.bin
root 8126470 25428158 0 15:53:01 pts/0 0:00 grep cssd.bin
root@bjdb1:/>ps -ef|grep evmd.bin
root 35520600 25428158 0 15:53:13 pts/0 0:00 grep evmd.bin
#查看pmon進程,也都沒有了
root@bjdb1:/>ps -ef|grep pmon|grep -v grep
root@bjdb1:/>
#嘗試再次啟動crs,成功!
root@bjdb1:/>crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
#查看crsctl資源,依然報錯,說明還沒有完全起來
root@bjdb1:/>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
#等待一段時間,可以查GI相關的核心后台進程
root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep
root@bjdb1:/>ps -ef|grep cssd.bin|grep -v grep
grid 10747994 26542192 0 15:55:03 - 0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.bin
root@bjdb1:/>ps -ef|grep pmon
root 39387390 25428158 0 15:57:23 pts/0 0:00 grep pmon
root@bjdb1:/>ps -ef|grep pmon|grep -v grep
grid 39911466 1 0 15:58:47 - 0:00 asm_pmon_+ASM2
root@bjdb1:/>ps -ef|grep pmon|grep -v grep
root 37814470 25428158 0 15:59:27 pts/0 0:00 grep pmon
grid 39911466 1 0 15:58:47 - 0:00 asm_pmon_+ASM2
root@bjdb1:/>
root@bjdb1:/>ps -ef|grep crsd.bin
root@bjdb1:/>ps -ef|grep cssd.bin
grid 10747994 26542192 0 15:55:03 - 0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.bin
root@bjdb1:/>ps -ef|grep evmd.bin
grid 40173760 1 0 15:57:10 - 0:00 /opt/u01/app/11.2.0/grid/bin/evmd.bin
root@bjdb1:/>ps -ef|grep crsd.bin
root 37683238 1 0 15:59:54 - 0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin reboot
root@bjdb1:/>
#當核心進程都起來時,再次查看crsctl資源情況,發現已經可以正常查詢,各資源正在啟動
root@bjdb1:/>crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE bjdb1
ora.LISTENER.lsnr
ONLINE ONLINE bjdb1
ora.OCR_VOTE1.dg
ONLINE ONLINE bjdb1
ora.asm
ONLINE ONLINE bjdb1 Started
ora.gsd
OFFLINE OFFLINE bjdb1
ora.net1.network
ONLINE ONLINE bjdb1
ora.ons
ONLINE ONLINE bjdb1
ora.registry.acfs
ONLINE ONLINE bjdb1
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE OFFLINE
ora.bjdb.db
1 ONLINE OFFLINE
2 ONLINE OFFLINE
ora.bjdb1.vip
1 ONLINE ONLINE bjdb1
ora.bjdb2.vip
1 ONLINE OFFLINE STARTING
ora.cvu
1 ONLINE ONLINE bjdb1
ora.oc4j
1 ONLINE ONLINE bjdb1
ora.scan1.vip
1 ONLINE OFFLINE STARTING
最后等待一段時間后,再次查詢,發現節點1各資源已經全部正常。
root@bjdb1:/>crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE bjdb1
ora.LISTENER.lsnr
ONLINE ONLINE bjdb1
ora.OCR_VOTE1.dg
ONLINE ONLINE bjdb1
ora.asm
ONLINE ONLINE bjdb1 Started
ora.gsd
OFFLINE OFFLINE bjdb1
ora.net1.network
ONLINE ONLINE bjdb1
ora.ons
ONLINE ONLINE bjdb1
ora.registry.acfs
ONLINE ONLINE bjdb1
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE OFFLINE
ora.bjdb.db
1 ONLINE ONLINE bjdb1 Open
2 ONLINE OFFLINE
ora.bjdb1.vip
1 ONLINE ONLINE bjdb1
ora.bjdb2.vip
1 ONLINE OFFLINE
ora.cvu
1 ONLINE ONLINE bjdb1
ora.oc4j
1 ONLINE ONLINE bjdb1
ora.scan1.vip
1 ONLINE OFFLINE
3.3 解決節點1上GI對應存儲空間使用率過高
繼續觀察節點1的日志:
grid@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log
2016-10-10 16:03:25.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 16:08:30.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 16:09:50.796:
[ctssd(5046446)]CRS-2407:The new Cluster Time Synchronization Service reference node is host bjdb1.
2016-10-10 16:10:20.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 16:15:25.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
其實這個之前也看到過,就是需要清理/opt/u01目錄空間了!查找可以刪除的一些歷史日志,解決完這個提示就不會再出現!
3.4 節點2手工掛載OCR,重啟CRS
節點1問題已解決,在節點2同樣掛載OCR后重啟CRS
方法都一樣,只是在節點2操作,不再贅述。
#強制停止節點2的crs
root@bjdb2:/>crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'bjdb2'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.crf' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.ctssd' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.evmd' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.asm' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'bjdb2'
CRS-2677: Stop of 'ora.crf' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.evmd' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'bjdb2' succeeded
CRS-2675: Stop of 'ora.asm' on 'bjdb2' failed
CRS-2679: Attempting to clean 'ora.asm' on 'bjdb2'
CRS-2681: Clean of 'ora.asm' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'bjdb2'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'bjdb2'
CRS-2677: Stop of 'ora.cssd' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'bjdb2'
CRS-2677: Stop of 'ora.gipcd' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'bjdb2'
CRS-2677: Stop of 'ora.gpnpd' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'bjdb2' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'bjdb2' has completed
CRS-4133: Oracle High Availability Services has been stopped.
再啟動:
crsctl start crs
等待一段時間后查詢:
#觀察到crs進程已經啟動
root@bjdb2:/>ps -ef|grep crsd.bin|grep -v grep
root 22610148 1 0 16:24:15 - 0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin reboot
#最后使用crsctl查看資源已經恢復正常
root@bjdb2:/>crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
ora.LISTENER.lsnr
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
ora.OCR_VOTE1.dg
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
ora.asm
ONLINE ONLINE bjdb1 Started
ONLINE ONLINE bjdb2 Started
ora.gsd
OFFLINE OFFLINE bjdb1
OFFLINE OFFLINE bjdb2
ora.net1.network
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
ora.ons
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
ora.registry.acfs
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE bjdb2
ora.bjdb.db
1 ONLINE ONLINE bjdb1 Open
2 ONLINE ONLINE bjdb2 Open
ora.bjdb1.vip
1 ONLINE ONLINE bjdb1
ora.bjdb2.vip
1 ONLINE ONLINE bjdb2
ora.cvu
1 ONLINE ONLINE bjdb1
ora.oc4j
1 ONLINE ONLINE bjdb1
ora.scan1.vip
1 ONLINE ONLINE bjdb2
#查看運行在節點2上的監聽程序,之前故障時,scan的監聽就在節點2上
root@bjdb2:/>ps -ef|grep tns
grid 5308430 1 0 Aug 17 - 5:05 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN1 -inherit
grid 5505240 1 1 Aug 17 - 27:23 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inherit
root@bjdb2:/>
至此,完成本次RAC集群CRS-4535,CRS-4000故障的處理;值得注意的是,巡檢發現故障后,我整個troubleshooting解決過程,RAC數據庫對外都是可以提供服務的,這點也說明了RAC的穩定性!
