CRS-2674: Start of 'ora.cssd' on 'rac2' failed 引發的rac集群服務起不來問題

本文轉載自查看原文 2020-06-15 16:03 973 Oracle

問題背景：客戶反饋Oracle rac集群節點宕機

1、首先查看宕機原因，歸檔日志滿導致服務重啟，查看歸檔日志路徑是USE_DB_RECOVERY_FILE_DEST （默認路徑），

安裝的時候沒有做調整，應該調整單獨的歸檔目錄，首先清理歸檔日志然后修改歸檔路徑

2、節點一正常啟動，節點二起不來沒有cluster服務
檢查集群服務
在rac2節點上檢查集群服務的狀態報錯

1 [grid@rac2 ~]# /u01/app/11.2.0/grid/bin/crs_stat -t
2 CRS-0184: Cannot communicate with the CRS daemon.

根據上面報錯，可以判斷出crs是有問題。
嘗試啟動也報錯：注意需要使用root

嘗試啟動crs服務

 1 root@ora102 ~]# /u01/app/11.2.0/grid/bin/crsctl start crs
 2 CRS-4640: Oracle High Availability Services is already active
 3 CRS-4000: Command Start failed, or completed with errors.
 4 正常情況是：
 5 [root@rac2 bin]# /u01/app/11.2.0/grid/bin/crsctl start crs
 6 CRS-4123: Oracle High Availability Services has been started.
 7 檢查crs服務，發現有問題：
 8 [grid@rac2 ~]$ crsctl check crs
 9 CRS-4638: Oracle High Availability Services is online
10 CRS-4535: Cannot communicate with Cluster Ready Services
11 CRS-4530: Communications failure contacting Cluster Synchronization Services demon
12 CRS-4534: Cannot communicate with Event Manager‘

然后節點rac2查看ip情況，發現vip和scan ip都已經不在，可以判斷出節點rac已經脫離了集群。
查看節點 ifconfig -a

3、嘗試重新注冊節點2加入集群

 1 [root@rac2 ~]# sh /u01/app/11.2.0/grid/root.sh
 2 Performing root user operation for Oracle 11g
 3 
 4 The following environment variables are set as:
 5     ORACLE_OWNER= grid
 6     ORACLE_HOME=  /u01/app/11.2.0/grid
 7 Enter the full pathname of the local bin directory: [/usr/local/bin]:
 8 The contents of "dbhome" have not changed. No need to overwrite.
 9 The contents of "oraenv" have not changed. No need to overwrite.
10 The contents of "coraenv" have not changed. No need to overwrite.
11 Entries will be added to the /etc/oratab file as needed by
12 Database Configuration Assistant when a database is created
13 Finished running generic part of root script.
14 Now product-specific root actions will be performed.
15 Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
16 User ignored Prerequisites during installation
17 Installing Trace File Analyzer
18 Configure Oracle Grid Infrastructure for a Cluster ... succeeded

4、還是有問題，清理節點2的配置信息，然后重新運行root.sh

 1 [root@rac2 trace]$ /u01/app/11.2.0/grid/crs/install/rootcrs.pl -verbose -deconfig -force
 2 [root@rac2 ~]# /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
 3 [root@rac2 bin]# /u01/app/11.2.0/grid/root.sh
 4 
 5 報錯：
 6 [root@rac2 install]#  /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
 7 Can't locate Env.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 . /u01/app/11.2.0/grid/crs/install) at crsconfig_lib.pm line 703.
 8 BEGIN failed--compilation aborted at crsconfig_lib.pm line 703.
 9 Compilation failed in require at /u01/app/11.2.0/grid/crs/install/roothas.pl line 166.
10 BEGIN failed--compilation aborted at /u01/app/11.2.0/grid/crs/install/roothas.pl line 166.
11 缺少依賴包  安裝命令 yum install perl-Env
12 
13 已安裝:
14   perl-Env.noarch 0:1.04-2.el7

5、清理節點2配置信息

 1 [root@rac2 install]#  /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
 2 Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
 3 CRS-4535: Cannot communicate with Cluster Ready Services
 4 CRS-4000: Command Stop failed, or completed with errors.
 5 CRS-4535: Cannot communicate with Cluster Ready Services
 6 CRS-4000: Command Delete failed, or completed with errors.
 7 CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac2'
 8 CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac2'
 9 CRS-2677: Stop of 'ora.mdnsd' on 'rac2' succeeded
10 CRS-2673: Attempting to stop 'ora.crf' on 'rac2'
11 CRS-2677: Stop of 'ora.crf' on 'rac2' succeeded
12 CRS-2673: Attempting to stop 'ora.gipcd' on 'rac2'
13 CRS-2677: Stop of 'ora.gipcd' on 'rac2' succeeded
14 CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac2'
15 CRS-2677: Stop of 'ora.gpnpd' on 'rac2' succeeded
16 CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac2' has completed
17 CRS-4133: Oracle High Availability Services has been stopped.
18 Successfully deconfigured Oracle Restart stack

6、重新注冊到集群中

 1 [root@rac2 install]# /u01/app/11.2.0/grid/root.sh
 2 Performing root user operation for Oracle 11g
 3 The following environment variables are set as:
 4     ORACLE_OWNER= grid
 5     ORACLE_HOME=  /u01/app/11.2.0/grid
 6 Enter the full pathname of the local bin directory: [/usr/local/bin]:
 7 The contents of "dbhome" have not changed. No need to overwrite.
 8 The contents of "oraenv" have not changed. No need to overwrite.
 9 The contents of "coraenv" have not changed. No need to overwrite.
10 
11 Entries will be added to the /etc/oratab file as needed by
12 Database Configuration Assistant when a database is created
13 Finished running generic part of root script.
14 Now product-specific root actions will be performed.
15 Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
16 User ignored Prerequisites during installation
17 Installing Trace File Analyzer
18 OLR initialization - successful
19 Adding Clusterware entries to inittab
20 CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node rac1, number 1, and is terminating
21 An active cluster was found during exclusive startup, restarting to join the cluster
22 Start of resource "ora.cssd" failed
23 CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac2'
24 CRS-2672: Attempting to start 'ora.gipcd' on 'rac2'
25 CRS-2676: Start of 'ora.cssdmonitor' on 'rac2' succeeded
26 CRS-2676: Start of 'ora.gipcd' on 'rac2' succeeded
27 CRS-2672: Attempting to start 'ora.cssd' on 'rac2'
28 CRS-2672: Attempting to start 'ora.diskmon' on 'rac2'
29 CRS-2676: Start of 'ora.diskmon' on 'rac2' succeeded
30 CRS-2674: Start of 'ora.cssd' on 'rac2' failed
31 CRS-2679: Attempting to clean 'ora.cssd' on 'rac2'
32 CRS-2681: Clean of 'ora.cssd' on 'rac2' succeeded
33 CRS-2673: Attempting to stop 'ora.gipcd' on 'rac2'
34 CRS-2677: Stop of 'ora.gipcd' on 'rac2' succeeded
35 CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'rac2'
36 CRS-2677: Stop of 'ora.cssdmonitor' on 'rac2' succeeded
37 CRS-5804: Communication error with agent process
38 CRS-4000: Command Start failed, or completed with errors.
39 Failed to start Oracle Grid Infrastructure stack
40 Failed to start Cluster Synchorinisation Service in clustered mode at /u01/app/11.2.0/grid/crs/install/crsconfig_lib.pm line 1278.
41 /u01/app/11.2.0/grid/perl/bin/perl -I/u01/app/11.2.0/grid/perl/lib -I/u01/app/11.2.0/grid/crs/install /u01/app/11.2.0/grid/crs/install/rootcrs.pl execution failed
42 依然失敗

7、CSSD沒有在第二個節點上啟動。$grid_home/log/rac2子目錄中查找cssd日志文件。查看日志信息。

1 /u01/app/11.2.0/grid/log/rac2/cssd
2 2019-10-12 15:41:19.013: [    CSSD][3199571712]clssgmDiscEndpcl: gipcDestroy 0x8a28
3 2019-10-12 15:41:19.064: [    CSSD][3181754112]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
4 2019-10-12 15:41:19.844: [    CSSD][3186484992]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 464729747, wrtcnt, 8055111, LATS 336904, lastSeqNo 8055110, uniqueness 1569234927, timestamp 1570866136/3845241248
5 2019-10-12 15:41:20.064: [    CSSD][3181754112]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
6 2019-10-12 15:41:20.845: [    CSSD][3186484992]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 464729747, wrtcnt, 8055112, LATS 337904, lastSeqNo 8055111, uniqueness 1569234927, timestamp 1570866137/3845242248

8、查看節點2的心跳

1 [grid@rac2 /]$ ping 20.20.20.201  --節點1的priv
2 PING 20.20.20.201 (20.20.20.201) 56(84) bytes of data.
3 From 20.20.20.202 icmp_seq=1 Destination Host Unreachable
4 From 20.20.20.202 icmp_seq=2 Destination Host Unreachable
5 From 20.20.20.202 icmp_seq=3 Destination Host Unreachable
6 From 20.20.20.202 icmp_seq=4 Destination Host Unreachable

心跳不通、。。。。。心累，據客戶說節點1的心跳出過好幾次問題了，估計網卡有問題。

征得客戶同意，先嘗試節點1的網卡重啟下，然后把服務重啟下，節點1/2服務都正常起來了，
后續建議客戶更換網卡消除隱患。

9、繞了一大圈是因為心跳的問題，解決問題就應該大膽假設小心求證，對可能的原因排錯最終順藤摸瓜抓住本質。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 CRS-2800: Cannot start resource 'ora.asm' as it is already in the INTERMEDIATE state on server ‘RAC02’ 解決VMware服務起不來的問題 Oracle 11g RAC ohasd failed to start at /u01/app/11.2.0/grid/crs/install/rootcrs.pl line 443 解決方法 RAC集群時間同步服務 Oracle Rac crs無法啟動 Oracle RAC -常見CRS命令 docker起不來報錯：Failed to start Docker Application Container Engine. RAC出現CRS-4535: Cannot communicate with Cluster Ready Services 時排查問題步驟 Oracle RAC集群介紹 linux redis啟動不來問題