在一次系統維護過程中,嘗試啟動RAC環境,結果RAC服務沒有啟動,在/tmp目錄下發現了這個錯誤:
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such device or address] [6]
前兩天檢查備份日志時發現,在釋放CHANNEL的時候報錯,進一步詳細的檢查發現,帶庫有一個DRIVE DOWN掉了,備份只能在一個CHANNEL上進行,因此備份日志中出現了錯誤,錯誤信息如下:
bash-3.00$ more /data/backup/backup_tradedb_081101.out
Script. /data/backup/backup_tradedb.sh
==== started on Sat Nov 1 23:00:00 CST 2008 ====
RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
RMAN> 2> 3> 4> 5> 6> 7> 8> RMAN> 2> 3> 4> 5> 6> 7> 8> 9> RMAN> 2> 3> 4> RMAN>
Copyright (c) 1982, 2005, Oracle. All rights reserved.
connected to target database: TRADEDB (DBID=4181457554)
using target database control file instead of recovery catalog
RMAN> 2> 3> 4> 5> 6> 7> 8>
allocated channel: C1
channel C1: sid=112 instance=tradedb1 devtype=SBT_TAPE
channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
allocated channel: C2
channel C2: sid=146 instance=tradedb1 devtype=SBT_TAPE
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
Starting backup at 01-NOV-08
input backupset count=842 stamp=669081253 creation_time=25-OCT-08
channel C2: starting piece 1 at 01-NOV-08
channel C2: backup piece /data/backup/tradedb/qaju2nl5_1_1
input backupset count=840 stamp=669080836 creation_time=25-OCT-08
channel C1: starting piece 1 at 01-NOV-08
channel C1: backup piece /data/backup/tradedb/q8ju2n84_1_1
piece handle=qaju2nl5_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 01-NOV-08
channel C2: backup set complete, elapsed time: 00:03:35
deleted backup piece
backup piece handle=/data/backup/tradedb/qaju2nl5_1_1 recid=1446 stamp=669081254
input backupset count=841 stamp=669080836 creation_time=25-OCT-08
channel C2: starting piece 1 at 01-NOV-08
channel C2: backup piece /data/backup/tradedb/q9ju2n84_1_1
piece handle=q9ju2n84_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 01-NOV-08
channel C2: backup set complete, elapsed time: 00:03:15
deleted backup piece
backup piece handle=/data/backup/tradedb/q9ju2n84_1_1 recid=1447 stamp=669080837
input backupset count=843 stamp=669081317 creation_time=25-OCT-08
channel C2: starting piece 1 at 01-NOV-08
channel C2: backup piece /data/backup/tradedb/qbju2nn5_1_1
piece handle=qbju2nn5_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 01-NOV-08
channel C2: backup set complete, elapsed time: 00:11:46
deleted backup piece
backup piece handle=/data/backup/tradedb/qbju2nn5_1_1 recid=1448 stamp=669081317
input backupset count=844 stamp=669081317 creation_time=25-OCT-08
channel C2: starting piece 1 at 01-NOV-08
channel C2: backup piece /data/backup/tradedb/qcju2nn5_1_1
RMAN-03009: failure of backup command on C1 channel at 11/01/2008 23:27:19
ORA-19506: failed to create sequential file, name="q8ju2n84_1_2", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
VxBSACreateObject: Failed with error:
Server Status: network connection timed out
ORA-19600: input file is backup piece (/data/backup/tradedb/q8ju2n84_1_1)
ORA-19601: output file is backup piece (q8ju2n84_1_2)
channel C1 disabled, job failed on it will be run on another channel
piece handle=qcju2nn5_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 01-NOV-08
channel C2: backup set complete, elapsed time: 00:21:41
deleted backup piece
backup piece handle=/data/backup/tradedb/qcju2nn5_1_1 recid=1449 stamp=669081322
input backupset count=840 stamp=669080836 creation_time=25-OCT-08
channel C2: starting piece 1 at 01-NOV-08
channel C2: backup piece /data/backup/tradedb/q8ju2n84_1_1
piece handle=q8ju2n84_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 01-NOV-08
channel C2: backup set complete, elapsed time: 00:12:26
deleted backup piece
backup piece handle=/data/backup/tradedb/q8ju2n84_1_1 recid=1445 stamp=669080837
input backupset count=846 stamp=669083380 creation_time=26-OCT-08
.
.
.
channel C2: starting piece 1 at 02-NOV-08
channel C2: backup piece /data/backup/tradedb/qhju2q9f_1_1
piece handle=qhju2q9f_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
channel C2: finished piece 1 at 02-NOV-08
channel C2: backup set complete, elapsed time: 00:08:56
deleted backup piece
backup piece handle=/data/backup/tradedb/qhju2q9f_1_1 recid=1454 stamp=669083952
Finished backup at 02-NOV-08
released channel: C1
released channel: C2
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of release command at 11/02/2008 00:44:39
RMAN-06012: channel: C1 not allocated
手頭啟動DRIVE,沒有發現異常,但是一旦執行備份,這個DRIVE就DOWN掉了。嘗試修改這個DRIVE的配置,發現DRIVE原本的路徑對於NETBACKUP根本無法加載,看來可能是硬件問題導致了原因。
於是系統維護人員到現場解決問題,發現是光纖交換機出現了故障,於是重啟了光纖交換機。由於RAC環境也依賴該光纖交換機,但是RAC環境配置了雙路光纖交換機,因此重啟光交的時候沒有停RAC服務。
結果光纖交換機重啟的結果導致RAC的一個節點服務器暫時無法啟動,而另一個節點服務器也發生了重啟。
由於RAC環境完全DOWN掉,於是嘗試在目前可以啟動的節點上啟動RAC服務:
# /etc/init.d/init.crs start
Startup will be queued to init within 30 seconds.
服務啟動后半天沒有響應,檢查后台經常沒有任何的Oracle實例啟動,感覺不太對勁,檢查/tmp目錄發現了上面的錯誤信息:
bash-3.00# cd /tmp
bash-3.00# ls
crsctl.4483 crsctl.4492 crsctl.4493 hsperfdata_noaccess hsperfdata_root ssh-sIvv2068
bash-3.00# ls -l
total 96
-rw-r--r-- 1 oracle oinstall 155 Nov 5 20:46 crsctl.4483
-rw-r--r-- 1 oracle oinstall 155 Nov 5 20:46 crsctl.4492
-rw-r--r-- 1 oracle oinstall 155 Nov 5 20:46 crsctl.4493
drwxr-xr-x 2 noaccess noaccess 178 Nov 5 19:53 hsperfdata_noaccess
drwxr-xr-x 2 root root 117 Nov 5 19:54 hsperfdata_root
drwx------ 2 root root 184 Nov 5 19:57 ssh-sIvv2068
bash-3.00# more crsctl.4483
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such device or address]
Oracle的共享存儲是通過VERITAS的VOLUMN CLUSTER MANAGER進行管理的,目前DOWN掉的節點是VOLUMN CLUSTER MANAGER的主節點,但是在當前節點上可以看到OCR裸設備、VOT裸設備以及所有的控制文件、日志文件、數據文件和參數文件的裸設備,這些裸設備的訪問路徑都是正常的,為什么還會導致這個錯誤呢。
查詢了METALINK,發現可能是bug:Bug No. 3613622中描述的問題:
The problem here is that no node cannot rely on its perception of the network,since the network may be broken in an undetectable manner, so the node must have access to the voting disk. When access to the voting disk is lost, or the I/O takes 'too long', the node must fail.
When Veritas CVM runs with Vendor Clusterware, then the Vendor Clusterware is the primary driver of node reconfiguration,@ not the miss count setting of CSS. As John mentioned above,@ on Sun Cluster by default CSS tolerates up to almost 10 minutes@ of Veritas CVM I/O suspension. It is Veritas's problem to fix.
看來問題很可能是由於VERITAS的CVM引起的,而且在一段時間后,這個節點上的RAC確實可以啟動了,不過由於當時節點1恰好也可以正確啟動了,所以不好確定是否是由於主節點的啟動導致了問題消失,還是由於等待時間超過了10分鍾,使得這個問題得以解決。
先記錄這個問題,以后如果有機會的話,還要驗證一下。