由於OCR文件損壞造成Oracle RAC不能啟動的現象和處理方法


v$cluster_interconnects

集群節點間通信使用的IP地址

錯誤信息

  • 使用了公網進行連接
SQL> select * from v$cluster_interconnects;

NAME IP_ADDRESS IS_ SOURCE CON_ID
eth0 192.168.1.70 OS dependent software 0

  • 日志信息
Filename=alert_+ASM1.log

~~~~~~~~~~~~~~~~正常啟動~~~~~~~~~~~~~~~~~~~~~~~~
Thu Jun 23 12:33:00 2016
**********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 4
Number of processor cores in the system is 16
Number of processor sockets in the system is 1
Private Interface 'eth1:1' configured from GPnP for use as a private interconnect.
[name='eth1:1', type=1, ip=169.254.31.89, mac=00-15-5d-75-0b-16, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Public Interface 'eth0' configured from GPnP for use as a public interface.
[name='eth0', type=1, ip=192.168.1.70, mac=00-15-5d-75-0b-15, net=192.168.1.0/24, mask=255.255.255.0, use=public/1]
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u1/app/12.1.0/grid/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is enabled
NOTE: remote asm mode is local (mode 0x301; from cluster type)
NOTE: Volume support enabled
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options.
ORACLE_HOME = /u1/app/12.1.0/grid
System name: Linux
Node name: test-rac1.sf.net
Release: 2.6.32-358.el6.x86_64
Version: #1 SMP Fri Feb 22 13:35:02 PST 2013
Machine: x86_64
Using parameter settings in server-side spfile +CRSDG/test-cluster/ASMPARAMETERFILE/registry.253.893674255
System parameters with non-default values:
large_pool_size = 12M
remote_login_passwordfile= "EXCLUSIVE"
asm_diskstring = "/dev/asm*"
asm_diskgroups = "DATADG"
asm_diskgroups = "FRADG"
asm_power_limit = 1
NOTE: remote asm mode is local (mode 0x301; from cluster type)
Thu Jun 23 12:33:02 2016
Cluster communication is configured to use the following interface(s) for this instance
169.254.31.89 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

~~~~~~~~~~~~~~~~不正常啟動~~~~~~~~~~~~~~~~~~~~~~~~
Sun Jul 31 10:30:00 2016
**********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 4
Number of processor cores in the system is 16
Number of processor sockets in the system is 1
WARNING: No cluster interconnect has been specified. Depending on <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
the communication driver configured Oracle cluster traffic
may be directed to the public interface of this machine.
Oracle recommends that RAC clustered databases be configured
with a private interconnect for enhanced security and
performance.
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u1/app/12.1.0/grid/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is enabled
NOTE: remote asm mode is local (mode 0x301; from cluster type)
NOTE: Volume support enabled
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options.
ORACLE_HOME = /u1/app/12.1.0/grid
System name: Linux
Node name: test-rac1.sf.net
Release: 2.6.32-358.el6.x86_64
Version: #1 SMP Fri Feb 22 13:35:02 PST 2013
Machine: x86_64
Using parameter settings in server-side spfile +CRSDG/test-cluster/ASMPARAMETERFILE/registry.253.893674255
System parameters with non-default values:
large_pool_size = 12M
remote_login_passwordfile= "EXCLUSIVE"
asm_diskstring = "/dev/asm*"
asm_diskgroups = "DATADG"
asm_diskgroups = "FRADG"
asm_power_limit = 1
NOTE: remote asm mode is local (mode 0x301; from cluster type)
Sun Jul 31 10:30:05 2016
Cluster communication is configured to use the following interface(s) for this instance
192.168.1.70 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
cluster interconnect IPC version: Oracle UDP/IP (generic) 
  • 原因
Filename=ohasd_orarootagent_root_278.trc

2016-07-31 08:31:10.211739 : USRTHRD:2180118272: {0:9:3} failed to receive ARP request
2016-07-31 08:31:10.211778 : USRTHRD:2180118272: {0:9:3} (null) category: -2, operation: read, loc: lnxrecv:2,os, OS error: 100, other: <<<<<<<<<<<<<<<<<<<<<<<<<<<網卡有錯誤
2016-07-31 08:31:10.712354 : USRTHRD:2180118272: {0:9:3} Failed to check 169.254.31.89 on eth1 <<<<<<<<<<<<<<<<<<<<<<<<<<<
2016-07-31 08:31:10.712389 : USRTHRD:2180118272: {0:9:3} (null) category: 0, operation: , loc: , OS error: 0, other: <<<<<<<<<<<<<<<<<<<<<<<<<<<
2016-07-31 08:31:10.712419 : USRTHRD:2180118272: {0:9:3} Assigned IP 169.254.31.89 no longer valid on inf eth1
2016-07-31 08:31:10.712436 : USRTHRD:2180118272: {0:9:3} Attempt to reassign the IP 169.254.31.89 on inf eth1
2016-07-31 08:31:10.712455 : USRTHRD:2180118272: {0:9:3} VipActions::startIp {
2016-07-31 08:31:11.039416 : AGFW:2191427328: {0:0:2} Agent received the message: AGENT_HB[Engine] ID 12293:1232478
2016-07-31 08:31:11.212863 : USRTHRD:2180118272: {0:9:3} Failed to check 169.254.31.89 on eth1
2016-07-31 08:31:11.212896 : USRTHRD:2180118272: {0:9:3} (null) category: 0, operation: , loc: , OS error: 0, other:
2016-07-31 08:31:11.213190 : USRTHRD:2180118272: {0:9:3} Adding 169.254.31.89 on eth1:1
2016-07-31 08:31:11.213400 : USRTHRD:2180118272: {0:9:3} Arp::sCreateSocket {
2016-07-31 08:31:11.227198 : USRTHRD:2180118272: {0:9:3} Arp::sCreateSocket }
2016-07-31 08:31:11.227227 : USRTHRD:2180118272: {0:9:3} Flushing neighbours ARP Cache
2016-07-31 08:31:11.227245 : USRTHRD:2180118272: {0:9:3} Arp::sFlushArpCache {
2016-07-31 08:31:11.227319 : USRTHRD:2180118272: {0:9:3} Arp::sSend: sending type 1
2016-07-31 08:31:11.227477 : USRTHRD:2180118272: {0:9:3} ignoring failure: failed to send arp
2016-07-31 08:31:12.312581 :CLSDYNAM:2169542400: [ora.ctssd]{0:9:3} [check] ClsdmClient::sendMessage clsdmc_respget return: status=0, ecode=0
2016-07-31 08:31:12.312645 :CLSDYNAM:2169542400: [ora.ctssd]{0:9:3} [check] translateReturnCodes, return = 0, state detail = OBSERVERCheckcb data [0x7fab4019bfe0]: mode[0xee] offset[0 ms].
2016-07-31 08:31:13.220363 : USRTHRD:2184320768: HAIP: event GIPCD_IF_UPDATE
2016-07-31 08:31:13.220588 : USRTHRD:2187224832: {0:9:3} dequeue change event 0x7fab58078a40, GIPCD_IF_UPDATE
2016-07-31 08:31:13.220651 : USRTHRD:2187224832: {0:9:3} HAIP: IF state gipcdadapterstateDown
2016-07-31 08:31:13.220681 : USRTHRD:2187224832: {0:9:3} It is non-cluster network, attr public
2016-07-31 08:31:13.220707 : USRTHRD:2187224832: {0:9:3} to verify routes
2016-07-31 08:31:13.220735 : USRTHRD:2187224832: {0:9:3} to verify start completion 1
2016-07-31 08:31:13.220983 : USRTHRD:2187224832: {0:9:3} HAIP: assigned ip '169.254.31.89'
2016-07-31 08:31:13.221001 : USRTHRD:2187224832: {0:9:3} HAIP: check ip '169.254.31.89'
2016-07-31 08:31:13.221017 : USRTHRD:2187224832: {0:9:3} Start: 1 HAIP assignment, 1, 1
2016-07-31 08:31:13.228064 : USRTHRD:2184320768: HAIP: event GIPCD_IF_UPDATE
2016-07-31 08:31:13.229735 : USRTHRD:2187224832: {0:9:3} dequeue change event 0x7fab58078fb0, GIPCD_IF_UPDATE
2016-07-31 08:31:13.229782 : USRTHRD:2187224832: {0:9:3} HAIP: IF state gipcdadapterstateDown
2016-07-31 08:31:13.229858 : USRTHRD:2187224832: {0:9:3} It is non-cluster network, attr public
2016-07-31 08:31:13.229884 : USRTHRD:2187224832: {0:9:3} to verify routes
2016-07-31 08:31:13.229909 : USRTHRD:2187224832: {0:9:3} to verify start completion 1
2016-07-31 08:31:13.230098 : USRTHRD:2187224832: {0:9:3} HAIP: assigned ip '169.254.31.89'
2016-07-31 08:31:13.230121 : USRTHRD:2187224832: {0:9:3} HAIP: check ip '169.254.31.89'
2016-07-31 08:31:13.230138 : USRTHRD:2187224832: {0:9:3} Start: 1 HAIP assignment, 1, 1
2016-07-31 08:31:13.231900 : USRTHRD:2184320768: HAIP: event GIPCD_IF_UPDATE
2016-07-31 08:31:13.232008 : USRTHRD:2187224832: {0:9:3} dequeue change event 0x7fab5808a7d0, GIPCD_IF_UPDATE
2016-07-31 08:31:13.232043 : USRTHRD:2187224832: {0:9:3} HAIP: IF state gipcdadapterstateDown
2016-07-31 08:31:13.232065 : USRTHRD:2187224832: {0:9:3} It is non-cluster network, attr public
2016-07-31 08:31:13.232130 : USRTHRD:2187224832: {0:9:3} to verify routes
2016-07-31 08:31:13.232151 : USRTHRD:2187224832: {0:9:3} to verify start completion 1

support提供的方案

如電話討論,目前問題在node1上,由於7月31日,8:30~~15:00之前,node1的私網網卡有報錯,導致GI將private network disable掉,在10:30 node1重慶過程中,無法找到私網信息,就使用了公網臨時替代。
造成了,node2正常啟動時,無法ping通node1的私網。

現在的解決方案是:

1. 關閉兩個節點的GI
2 重啟node2主機
3.當 node2啟動成功后,再重啟node1

OCR

OCR全稱為"Oracle Cluster Registry",字面意思為Oracle集群注冊表。
Oracle官方文檔上的描述為:
Anything that Oracle Clusterware manages is known as a CRS resource. A CRS
resource can be a database, an instance, a service, a listener, a VIP address, or an application process. Oracle Clusterware manages CRS resources based on the resource's configuration information that is stored in the Oracle Cluster Registry (OCR).

問題原因

根據CRSD的日志看,懷疑OCR文件損壞了。
~~~~~~~~~~~~~~~~~~~~~~~~
2016-08-04 12:41:16.034910 :UiServer:1767343872: {2:30043:2} Done for ctx=0x7fdb040316a0
2016-08-04 12:41:16.040546 :  OCRRAW:1777850112: rtnode:3: invalid tnode 73
2016-08-04 12:41:16.040567 :  OCRRAW:1777850112: propropen:0: could not read tnode addrd=0
2016-08-04 12:41:16.040624 :  OCRRAW:1777850112: proprseterror: Error in accessing physical storage [26] Marking context invalid.    <<<<<<<<<<<<<<<<<Error in accessing physical storage [26]
~~~~~~~~~~~~~~~~~~~~~~~~~

處理方式

建議根據下面步驟還原一個OCR的有效備份,然后重新啟動crsd進程
注意,執行下面步驟前,先關閉node1 的GI軟件,然后在node2上執行下面步驟。
1. Identify available OCR backups using 'ocrconfig -showbackup'
2. Dump the contents (as privileged user) of all available backups using
'ocrdump <outputfile> -backupfile <backupfile>'
3. Identify all the backup locations where 'ocrdump' successfully completed
4. Inspect the contents of the ocrdump output, and identify a suitable backup
5. Shutdown the CRS stack on all nodes in the cluster
6. Restore the OCR backup (as privileged user) using
'ocrconfig -restore <file>'

執行上面步驟過程中,先不需要關閉node2的instance和listener,但是上面操作要在申請完停機時間后做,有可能在做的過程中instance會受到影響。
在crsd成功啟動后,關閉instance和本地listener,通過crsctl命令再次啟動instance和本地listener

在這些都完成之后,再啟動node1.

測試環境模擬

檢查OCR自動備份

在RAC上默認OCR文件每四個小時會備份一次,可以使用命令ocrconfig -showbackup查看備份文件的生成時間和位置,每次備份只在一個節點上生成數據。

上午檢查時得到的數據:

[oracle@ol6-121-rac2 ~]$ ocrconfig -showbackup

ol6-121-rac1     2016/08/17 07:32:10     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup00.ocr     0

ol6-121-rac1     2016/08/17 03:32:08     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup01.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup02.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/day.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/week.ocr     0
PROT-25: Manual backups for the Oracle Cluster Registry are not available

下午4點時檢查得到的數據:

[oracle@ol6-121-rac2 ~]$ ocrconfig -showbackup

ol6-121-rac1     2016/08/17 15:32:15     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup00.ocr     0

ol6-121-rac1     2016/08/17 11:32:12     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup01.ocr     0

ol6-121-rac1     2016/08/17 07:32:10     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup02.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/day.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/week.ocr     0
PROT-25: Manual backups for the Oracle Cluster Registry are not available

其中最新的備份總是命名為backup00.ocr,同時保留一份上周的備份和前一天的備份,這里因為集群是在08/16日首次啟動的,所以week.ocr和day.ocr是同一個文件。

  • OCR自動備份是通過Master Node CRSD進程完成,所以一旦CRSD進程不存在了,OCR自動備份也會停止。

查看ocr文件的內容

可以使用ocrdump命令將OCR文件的內容轉換成文本形式。

[oracle@ol6-121-rac1 ~]$ sudo /u01/app/12.1.0.2/grid/bin/ocrdump testbackup01.txt -backupfile /home/oracle/backup01.ocr

OCR邏輯備份

Oracle建議在任何集群配置變更前備份ocr文件,這是可以通過命令進行手動邏輯備份。需要使用root用戶進行執行。

[oracle@ol6-121-rac1 ~]$ ocrconfig -export ocr_bak1.ocrdmp
PROT-20: Insufficient permission to proceed. Require privileged user

[oracle@ol6-121-rac1 ~]$ sudo -s /u01/app/12.1.0.2/grid/bin/ocrconfig -export ocr_bak1.ocrdmp

[oracle@ol6-121-rac1 ~]$ ls -lh *ocr*
-rw-------. 1 root   root      97K Aug 17 17:41 ocr_bak1.ocrdmp

模擬一次OCR損壞恢復

OCR文件的位置

使用ocrcheck命令。

[oracle@ol6-121-rac1 ~]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          4
         Total space (kbytes)     :     409568
         Used space (kbytes)      :       1548
         Available space (kbytes) :     408020
         ID                       :  144424447
         Device/File Name         :      +DATA
                                    Device/File integrity check succeeded

                                    Device/File not configured

                                    Device/File not configured

                                    Device/File not configured

                                    Device/File not configured

         Cluster registry integrity check succeeded

         Logical corruption check bypassed due to non-privileged user

登錄ASMCMD,一般在集群名文件夾下可以找到OCRFILE文件夾,系統使用的OCR文件就在此文件夾下面。

ASMCMD> pwd
+DATA/ol6-121-scan/OCRFILE
ASMCMD> ls -l
Type     Redund  Striped  Time             Sys  Name
OCRFILE  UNPROT  COARSE   AUG 18 10:00:00  Y    REGISTRY.255.919941339

關閉CRS

[oracle@ol6-121-rac1 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stop crs

[root@ol6-121-rac2 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stop crs

模擬OCR文件損壞

因為不能在ASMCMD下編輯文件,所以從其它集群復制一個到本集群,然后使用命令還原。

若直接還原文件,會出現錯誤,原因是ASM沒有啟動,而目標OCR文件是存放在ASM磁盤上面的。

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/ocrconfig -restore /home/oracle/backup00.ocr
PROT-35: The configured OCR locations are not accessible

[root@ol6-121-rac1 ~]# ps -ef | grep asm
root      2415  6471  0 11:47 pts/1    00:00:00 grep asm

方法是只啟動cssd,然后在sqlplus中啟動ASM。同時確認CRS沒有啟動。

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start crs -excl -cssonly
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'ol6-121-rac1'
CRS-2676: Start of 'ora.cssdmonitor' on 'ol6-121-rac1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'ol6-121-rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'ol6-121-rac1'
CRS-2676: Start of 'ora.diskmon' on 'ol6-121-rac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'ol6-121-rac1' succeeded

[oracle@ol6-121-rac1 trace]$ sqlplus / as sysasm

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 18 14:03:38 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ASM instance started

Total System Global Area 1140850688 bytes
Fixed Size                  2933400 bytes
Variable Size            1112751464 bytes
ASM Cache                  25165824 bytes
ASM diskgroups mounted
ASM diskgroups volume enabled

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager

然后可以執行ocrconfig -restore命令。

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/ocrconfig -restore /home/oracle/backup00.ocr
[root@ol6-121-rac1 ~]#

ASMCMD> ls -l
Type     Redund  Striped  Time             Sys  Name
OCRFILE  UNPROT  COARSE   AUG 18 14:00:00  Y    REGISTRY.255.919941339

嘗試啟動CRS

在OCR修改后再嘗試重啟CRS,可以看到CRS最終沒有啟動。

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.asm' on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.crf' on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'ol6-121-rac1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'ol6-121-rac1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'ol6-121-rac1' succeeded
CRS-2677: Stop of 'ora.crf' on 'ol6-121-rac1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'ol6-121-rac1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'ol6-121-rac1' succeeded
CRS-2677: Stop of 'ora.asm' on 'ol6-121-rac1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'ol6-121-rac1'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'ol6-121-rac1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'ol6-121-rac1'
CRS-2677: Stop of 'ora.cssd' on 'ol6-121-rac1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'ol6-121-rac1'
CRS-2677: Stop of 'ora.gipcd' on 'ol6-121-rac1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ol6-121-rac1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@ol6-121-rac1 ~]#

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -init -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details      
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       ol6-121-rac1             Started,STABLE
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.crf
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.crsd
      1        ONLINE  OFFLINE                               STABLE
ora.cssd
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.ctssd
      1        ONLINE  ONLINE       ol6-121-rac1             ACTIVE:0,STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.drivers.acfs
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.evmd
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.gipcd
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.gpnpd
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.mdnsd
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.storage
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
--------------------------------------------------------------------------------

在日志crsd.trc中出現錯誤,目測第一個錯誤就是沒有grid用戶,因為測試環境上GI不是用grid用戶搭建的。

2016-08-18 14:22:50.431205 :  CRSSEC:3000985344: {1:24681:2} Exception: ACL entry creation failed for: owner:grid:rwx
    CLSB:3000985344: Oracle Clusterware infrastructure error in CRSD (OS PID 2742): Fatal signal 6 has occurred in program crsd thread 3000985344; nested signal count is 1
Incident 81 created, dump file: /u01/app/oracle/diag/crs/ol6-121-rac1/crs/incident/incdir_81/crsd_i81.trc
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []

這時,可以看到ASM還是啟動的,可以用sqlplus啟動數據庫,從而實現單節點運行。
但是首次啟動時報spfile不存在錯誤,應該是OCR文件錯誤造成的spfile變更。

[oracle@ol6-121-rac1 trace]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 18 14:53:10 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DATA/test/spfiletest.ora'
ORA-17503: ksfdopn:2 Failed to open file +DATA/test/spfiletest.ora
ORA-15056: additional error message
ORA-17503: ksfdopn:2 Failed to open file +DATA/test/spfiletest.ora
ORA-15173: entry 'spfiletest.ora' does not exist in directory 'test'
ORA-06512: at line 4
SQL> exit
Disconnected

可以通過,增加一個spfile文件別名使數據庫啟動時找到spfile文件。

ASMCMD [+DATA/test/parameterfile] > mkalias '+DATA/TEST/PARAMETERFILE/spfile.294.920047363' '+DATA/TEST/spfiletest.ora'
ASMCMD [+DATA/test/parameterfile] > cd ..
ASMCMD [+DATA/test] > ls
CONTROLFILE/
DATAFILE/
ONLINELOG/
PARAMETERFILE/
PASSWORD/
TEMPFILE/
spfiletest.ora

再次啟動數據庫正常

[oracle@ol6-121-rac1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 18 15:12:32 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area  838860800 bytes
Fixed Size                  2929936 bytes
Variable Size             616565488 bytes
Database Buffers          213909504 bytes
Redo Buffers                5455872 bytes
Database mounted.
Database opened.

SQL> select instance_name,status from v$instance;

INSTANCE_NAME    STATUS
---------------- ------------
test1            OPEN

同樣的方式可以啟動第二個節點的實例,且實例間的事物一致性。

進行OCR文件恢復

  • 找到備份文件
[oracle@ol6-121-rac1 ~]$ ocrconfig -showbackup

ol6-121-rac1     2016/08/18 07:32:31     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup00.ocr     0

ol6-121-rac1     2016/08/18 03:32:26     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup01.ocr     0

ol6-121-rac1     2016/08/17 23:32:22     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup02.ocr     0

ol6-121-rac1     2016/08/17 03:32:08     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/day.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/week.ocr     0
PROT-25: Manual backups for the Oracle Cluster Registry are not available
  • 確認CRS關閉狀態
[oracle@ol6-121-rac1 ~]$ ps -ef | grep crs
oracle    2669  3366  0 15:46 pts/1    00:00:00 grep crs
[oracle@ol6-121-rac1 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
  • 進行OCR文件還原
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/ocrconfig -restore /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/week.ocr
  • 在節點2啟動CRS,使用啟動cluster,啟動crs則報錯。中間實例可以登錄和插入數據。
[root@ol6-121-rac2 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
[root@ol6-121-rac2 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start cluster
CRS-2672: Attempting to start 'ora.crsd' on 'ol6-121-rac2'
CRS-2676: Start of 'ora.crsd' on 'ol6-121-rac2' succeeded
[root@ol6-121-rac2 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details      
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.asm
               ONLINE  ONLINE       ol6-121-rac2             Started,STABLE
ora.net1.network
               ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.ons
               ONLINE  ONLINE       ol6-121-rac2             STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.MGMTLSNR
      1        ONLINE  ONLINE       ol6-121-rac2             169.254.42.254 192.1
                                                             68.1.102,STABLE
ora.cvu
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.mgmtdb
      1        ONLINE  OFFLINE      ol6-121-rac2             Instance Shutdown,ST
                                                             ARTING
ora.oc4j
      1        ONLINE  OFFLINE      ol6-121-rac2             STARTING
ora.ol6-121-rac1.vip
      1        ONLINE  INTERMEDIATE ol6-121-rac2             FAILED OVER,STABLE
ora.ol6-121-rac2.vip
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.scan2.vip
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.scan3.vip
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.test.db
      1        ONLINE  OFFLINE                               STABLE
      2        ONLINE  ONLINE       ol6-121-rac2             Open,STABLE
--------------------------------------------------------------------------------
  • 在節點1同樣啟動集群
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start cluster
CRS-2672: Attempting to start 'ora.crsd' on 'ol6-121-rac1'
CRS-2676: Start of 'ora.crsd' on 'ol6-121-rac1' succeeded

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details      
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       ol6-121-rac1             STABLE
               ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       ol6-121-rac1             STABLE
               ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.asm
               ONLINE  ONLINE       ol6-121-rac1             Started,STABLE
               ONLINE  ONLINE       ol6-121-rac2             Started,STABLE
ora.net1.network
               ONLINE  ONLINE       ol6-121-rac1             STABLE
               ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.ons
               ONLINE  ONLINE       ol6-121-rac1             STABLE
               ONLINE  ONLINE       ol6-121-rac2             STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.MGMTLSNR
      1        ONLINE  ONLINE       ol6-121-rac2             169.254.42.254 192.1
                                                             68.1.102,STABLE
ora.cvu
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.mgmtdb
      1        ONLINE  ONLINE       ol6-121-rac2             Open,STABLE
ora.oc4j
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.ol6-121-rac1.vip
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.ol6-121-rac2.vip
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.scan2.vip
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.scan3.vip
      1        ONLINE  ONLINE       ol6-121-rac2             STABLE
ora.test.db
      1        ONLINE  OFFLINE                               Instance Shutdown,ST
                                                             ABLE
      2        ONLINE  ONLINE       ol6-121-rac2             Open,STABLE
--------------------------------------------------------------------------------
  • 發現數據庫沒有自動啟動,因此使用srvctl啟動實例。結果不能使用集群命令啟動,單獨使用sqlplus還是可以啟動,並且可以通過listner遠程登錄。
[oracle@ol6-121-rac1 ~]$ srvctl start instance -db test -node ol6-121-rac1
PRCR-1013 : Failed to start resource ora.test.db
PRCR-1064 : Failed to start resource ora.test.db on node ol6-121-rac1
CRS-2662: Resource 'ora.test.db' is disabled on server 'ol6-121-rac1'

[oracle@ol6-121-rac1 ~]$ srvctl enable database -db test
PRCC-1010 : test was already enabled
PRCR-1002 : Resource ora.test.db is already enabled
  • 關閉節點1的集群軟件,錯誤原因可能是使用sqlplus直接啟動了數據庫,造成數據庫的自啟動項被停用。
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -p

NAME=ora.test.db
TYPE=ora.database.type
ACL=owner:oracle:rwx,pgrp:oinstall:r--,other::r--,group:dba:r-x,group:oper:r-x,user:oracle:r-x
ACTIONS=startoption,group:"oinstall",user:"oracle",group:"dba",group:"oper"
ACTION_SCRIPT=
ACTION_START_OPTION=
ACTION_TIMEOUT=600
ACTIVE_PLACEMENT=0
AGENT_FILENAME=%CRS_HOME%/bin/oraagent%CRS_EXE_SUFFIX%
AUTO_START=restore

參考文檔處理Oracle 11gR2 RAC數據庫資源不能自動啟動的問題

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/srvctl enable database -d test

PRCC-1010 : test was already enabled
PRCR-1002 : Resource ora.test.db is already enabled
[root@ol6-121-rac1 ~]#
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/srvctl enable instance -d test -i test1
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/srvctl enable instance -d test -i test2
  • 關閉節點1的集群服務,然后重啟下服務器看是否能夠自動啟動數據庫
    在本次測試中,重啟后GI服務和數據庫都都能夠正常啟動。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM