ORACLE -- RAC OCR 不發找到引起的血案


 

在一次系統維護過程中,嘗試啟動RAC環境,結果RAC服務沒有啟動,在/tmp目錄下發現了這個錯誤:

OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such device or address] [6]

 

 

前兩天檢查備份日志時發現,在釋放CHANNEL的時候報錯,進一步詳細的檢查發現,帶庫有一個DRIVE DOWN掉了,備份只能在一個CHANNEL上進行,因此備份日志中出現了錯誤,錯誤信息如下:

 

bash-3.00$ more /data/backup/backup_tradedb_081101.out

 

Script. /data/backup/backup_tradedb.shITPUB個人空間A xdZ(pw'C Q'C8d'P1F
==== started on Sat Nov 1 23:00:00 CST 2008 ====


"Vh-ue.YIYh7rL}0RMAN: /opt/oracle/product/10.2/database/bin/rman
{:UL.\K0ORACLE_SID: tradedb1
$R.|$`{6@0ORACLE_HOME: /opt/oracle/product/10.2/database
l-aJ? o1l$N4D0RMAN> 2> 3> 4> 5> 6> 7> 8> RMAN> 2> 3> 4> 5> 6> 7> 8> 9> RMAN> 2> 3> 4> RMAN>
X'b2H y]#{0Copyright (c) 1982, 2005, Oracle.  All rights reserved.

connected to target database: TRADEDB (DBID=4181457554)ITPUB個人空間6e Hc'R$c
using target database control file instead of recovery catalog

RMAN> 2> 3> 4> 5> 6> 7> 8>ITPUB個人空間"R;g6}$`T!?L
allocated channel: C1
u:V0n[ L)M `0channel C1: sid=112 instance=tradedb1 devtype=SBT_TAPE
/GjHc+_?8M0channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

allocated channel: C2ITPUB個人空間$xx6O,R_G
channel C2: sid=146 instance=tradedb1 devtype=SBT_TAPEITPUB個人空間1j B9gkak
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

Starting backup at 01-NOV-08
7gd._c:~F9O0input backupset count=842 stamp=669081253 creation_time=25-OCT-08
3r+{Op*i jB0channel C2: starting piece 1 at 01-NOV-08ITPUB個人空間.a5j.mf&G'E;N#V
channel C2: backup piece /data/backup/tradedb/qaju2nl5_1_1ITPUB個人空間]#N"i*~J1rd6u#{
input backupset count=840 stamp=669080836 creation_time=25-OCT-08
vX7Epc5]A0channel C1: starting piece 1 at 01-NOV-08ITPUB個人空間x`/`/j B k,f
channel C1: backup piece /data/backup/tradedb/q8ju2n84_1_1
!h#zv%g.N ?ry0piece handle=qaju2nl5_1_2 comment=API Version 2.0,MMS Version 5.0.0.0ITPUB個人空間 {3}b:M/ir'M7G
channel C2: finished piece 1 at 01-NOV-08ITPUB個人空間e_ B ^.l`}u A}
channel C2: backup set complete, elapsed time: 00:03:35ITPUB個人空間-JM M H2w
deleted backup piece
'M9pvJ@0_0backup piece handle=/data/backup/tradedb/qaju2nl5_1_1 recid=1446 stamp=669081254ITPUB個人空間jrM:N b2M*`9DH
input backupset count=841 stamp=669080836 creation_time=25-OCT-08
9F6EVl.oz,QO0@0channel C2: starting piece 1 at 01-NOV-08
6d[;F2A(a k0channel C2: backup piece /data/backup/tradedb/q9ju2n84_1_1ITPUB個人空間*To2nC-x
piece handle=q9ju2n84_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
Kf0Jh#Z"K0channel C2: finished piece 1 at 01-NOV-08
'a],G R/_0channel C2: backup set complete, elapsed time: 00:03:15
yms0?9IC}0deleted backup piece
/]"\9r3v%?2mb0backup piece handle=/data/backup/tradedb/q9ju2n84_1_1 recid=1447 stamp=669080837ITPUB個人空間_6])N6d0Gd
input backupset count=843 stamp=669081317 creation_time=25-OCT-08ITPUB個人空間u0w| IHie
channel C2: starting piece 1 at 01-NOV-08
XIy[;R,btl.]0channel C2: backup piece /data/backup/tradedb/qbju2nn5_1_1
'n u,J]Q7C9q0cz0piece handle=qbju2nn5_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
;\ ^3]i O;_OkO2F0channel C2: finished piece 1 at 01-NOV-08
7y8` n*G4G#G:O*O0channel C2: backup set complete, elapsed time: 00:11:46ITPUB個人空間9y`F \"miz
deleted backup piece
2Dcp5p$mj![n0backup piece handle=/data/backup/tradedb/qbju2nn5_1_1 recid=1448 stamp=669081317ITPUB個人空間-\Mae_G4k
input backupset count=844 stamp=669081317 creation_time=25-OCT-08ITPUB個人空間^;ci Q[k8D-IX
channel C2: starting piece 1 at 01-NOV-08ITPUB個人空間,de*y A5pe(n*Ji
channel C2: backup piece /data/backup/tradedb/qcju2nn5_1_1ITPUB個人空間'C8[ ] tq.Hj
RMAN-03009: failure of backup command on C1 channel at 11/01/2008 23:27:19
?7p0P2a'JA b@!]0ORA-19506: failed to create sequential file, name="q8ju2n84_1_2", parms=""ITPUB個人空間9X3O` fWa
ORA-27028: skgfqcre: sbtbackup returned errorITPUB個人空間qRZ2_Rn;x^]
ORA-19511: Error received from media manager layer, error text:ITPUB個人空間+t9G j;Z3i
   VxBSACreateObject: Failed with error:ITPUB個人空間 Jju1h4x[8A[v
   Server Status:  network connection timed outITPUB個人空間z:t&`1~HtisHY
ORA-19600: input file is backup piece  (/data/backup/tradedb/q8ju2n84_1_1)
1dN4gxQA.Va0ORA-19601: output file is backup piece  (q8ju2n84_1_2)ITPUB個人空間.]n'{1}A?.B}
channel C1 disabled, job failed on it will be run on another channelITPUB個人空間0I?%N9F8C5z#a&j4a5] jJ
piece handle=qcju2nn5_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
jy4Eg/ko"MG)O0channel C2: finished piece 1 at 01-NOV-08
:z Md-Mn0channel C2: backup set complete, elapsed time: 00:21:41ITPUB個人空間*A#C-X;N+W7Bj
deleted backup pieceITPUB個人空間*sS"K!_2s
backup piece handle=/data/backup/tradedb/qcju2nn5_1_1 recid=1449 stamp=669081322ITPUB個人空間RCP S"j(uR5P y
input backupset count=840 stamp=669080836 creation_time=25-OCT-08
U5Jb1|_7AG0channel C2: starting piece 1 at 01-NOV-08
W&lz#GI0channel C2: backup piece /data/backup/tradedb/q8ju2n84_1_1
"f V+I-ks"t:d7eh0piece handle=q8ju2n84_1_2 comment=API Version 2.0,MMS Version 5.0.0.0
XU7ol-vE'[bS0channel C2: finished piece 1 at 01-NOV-08
Iq&P+yb&H0channel C2: backup set complete, elapsed time: 00:12:26
L.h2j6v,}5VB ]J0deleted backup piece
9|6oS?4_,R5Y4N6cd0backup piece handle=/data/backup/tradedb/q8ju2n84_1_1 recid=1445 stamp=669080837
+b:fh P9OI0input backupset count=846 stamp=669083380 creation_time=26-OCT-08ITPUB個人空間,FuKv7[Zm
.ITPUB個人空間*~ ^h%V-] R&J
.
MTRj_0.ITPUB個人空間6Oa3y,qO]
channel C2: starting piece 1 at 02-NOV-08
M)?(B1Nj*E)N.^$J:J0channel C2: backup piece /data/backup/tradedb/qhju2q9f_1_1ITPUB個人空間Ym4~(] xyV B\R}4^
piece handle=qhju2q9f_1_2 comment=API Version 2.0,MMS Version 5.0.0.0ITPUB個人空間D(Aa&{ u
channel C2: finished piece 1 at 02-NOV-08ITPUB個人空間+pk\"h*ag0T5G
channel C2: backup set complete, elapsed time: 00:08:56ITPUB個人空間1kOB*MI#t@
deleted backup pieceITPUB個人空間 t!D!S-jO
backup piece handle=/data/backup/tradedb/qhju2q9f_1_1 recid=1454 stamp=669083952
Qq7@mP ]7h9ff0Finished backup at 02-NOV-08

released channel: C1
N hZ?9g4cY7F0released channel: C2ITPUB個人空間3Q~'a-F7tmS@(Nd
RMAN-00571: ===========================================================ITPUB個人空間hgcLrU|
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============ITPUB個人空間A-}5?Q~R\
RMAN-00571: ===========================================================ITPUB個人空間gx9~] yP8aZlP
RMAN-03002: failure of release command at 11/02/2008 00:44:39ITPUB個人空間y p.X6mb%zWuNp
RMAN-06012: channel: C1 not allocated

 

手頭啟動DRIVE,沒有發現異常,但是一旦執行備份,這個DRIVE就DOWN掉了。嘗試修改這個DRIVE的配置,發現DRIVE原本的路徑對於NETBACKUP根本無法加載,看來可能是硬件問題導致了原因。

 

於是系統維護人員到現場解決問題,發現是光纖交換機出現了故障,於是重啟了光纖交換機。由於RAC環境也依賴該光纖交換機,但是RAC環境配置了雙路光纖交換機,因此重啟光交的時候沒有停RAC服務。

 

結果光纖交換機重啟的結果導致RAC的一個節點服務器暫時無法啟動,而另一個節點服務器也發生了重啟。

 

由於RAC環境完全DOWN掉,於是嘗試在目前可以啟動的節點上啟動RAC服務:

 

# /etc/init.d/init.crs start

Startup will be queued to init within 30 seconds.

 

服務啟動后半天沒有響應,檢查后台經常沒有任何的Oracle實例啟動,感覺不太對勁,檢查/tmp目錄發現了上面的錯誤信息:

 

bash-3.00# cd /tmpITPUB個人空間+\X2I2h-hsz A.L
bash-3.00# lsITPUB個人空間#M,b&P/GQ A"SL
crsctl.4483          crsctl.4492          crsctl.4493          hsperfdata_noaccess  hsperfdata_root      ssh-sIvv2068ITPUB個人空間d`*iF"TT1d|
bash-3.00# ls -l
9qfw jpC0total 96
R[;t;U3k:xG%g-~0-rw-r--r--   1 oracle   oinstall     155 Nov  5 20:46 crsctl.4483
q }.}C?(v7B.q t]0-rw-r--r--   1 oracle   oinstall     155 Nov  5 20:46 crsctl.4492ITPUB個人空間{3G%V0hsCj,d
-rw-r--r--   1 oracle   oinstall     155 Nov  5 20:46 crsctl.4493
m y f ~~e+X \0drwxr-xr-x   2 noaccess noaccess     178 Nov  5 19:53 hsperfdata_noaccessITPUB個人空間(W!Et/n6MQ4B%O*y7g+b
drwxr-xr-x   2 root     root         117 Nov  5 19:54 hsperfdata_rootITPUB個人空間"A~"L^ Sp
drwx------   2 root     root         184 Nov  5 19:57 ssh-sIvv2068
ln vr;w*gB:DJ0bash-3.00# more crsctl.4483ITPUB個人空間p1k,? ]5j{F)R4t
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such device or address]

 

Oracle的共享存儲是通過VERITAS的VOLUMN CLUSTER MANAGER進行管理的,目前DOWN掉的節點是VOLUMN CLUSTER MANAGER的主節點,但是在當前節點上可以看到OCR裸設備、VOT裸設備以及所有的控制文件、日志文件、數據文件和參數文件的裸設備,這些裸設備的訪問路徑都是正常的,為什么還會導致這個錯誤呢。

 

查詢了METALINK,發現可能是bug:Bug No. 3613622中描述的問題:

 

The problem here is that no node cannot rely on its perception of the network,since the network may be broken in an undetectable manner, so the node must have access to the voting disk.  When access to the voting disk is lost, or the I/O takes 'too long', the node must fail.

 

When Veritas CVM runs with Vendor Clusterware, then the Vendor Clusterware is the primary driver of node reconfiguration,@ not the miss count setting of CSS.  As John mentioned above,@ on Sun Cluster by default CSS tolerates up to almost 10 minutes@ of Veritas CVM I/O suspension.  It is Veritas's problem to fix.

 

看來問題很可能是由於VERITAS的CVM引起的,而且在一段時間后,這個節點上的RAC確實可以啟動了,不過由於當時節點1恰好也可以正確啟動了,所以不好確定是否是由於主節點的啟動導致了問題消失,還是由於等待時間超過了10分鍾,使得這個問題得以解決。

 

先記錄這個問題,以后如果有機會的話,還要驗證一下。

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM