一個網友咨詢一個問題,他的11.2.0.2 RAC(for Aix),沒有安裝任何patch或PSU。其中一個節點重啟之后無法正常啟動,查看ocssd日志如下:
1 -08-09 14:21:46.094: [ CSSD][5414]clssnmSendingThread: sent 4 join msgs to all nodes-08-09 14:21:46.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:47.042: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958157, LATS 1518247992, lastSeqNo 255958154, uniqueness 1406064021, timestamp 1407565306/1501758072-08-09 14:21:47.051: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958158, LATS 1518248002, lastSeqNo 255958155, uniqueness 1406064021, timestamp 1407565306/1501758190-08-09 14:21:47.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:48.042: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958160, LATS 1518248993, lastSeqNo 255958157, uniqueness 1406064021, timestamp 1407565307/1501759080-08-09 14:21:48.052: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958161, LATS 1518249002, lastSeqNo 255958158, uniqueness 1406064021, timestamp 1407565307/1501759191-08-09 14:21:48.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:49.043: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958163, LATS 1518249993, lastSeqNo 255958160, uniqueness 1406064021, timestamp 1407565308/1501760082-08-09 14:21:49.056: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958164, LATS 1518250007, lastSeqNo 255958161, uniqueness 1406064021, timestamp 1407565308/1501760193-08-09 14:21:49.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:50.044: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958166, LATS 1518250994, lastSeqNo 255958163, uniqueness 1406064021, timestamp 1407565309/1501761090-08-09 14:21:50.057: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958167, LATS 1518251007, lastSeqNo 255958164, uniqueness 1406064021, timestamp 1407565309/1501761195-08-09 14:21:50.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:51.046: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958169, LATS 1518251996, lastSeqNo 255958166, uniqueness 1406064021, timestamp 1407565310/1501762100-08-09 14:21:51.057: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958170, LATS 1518252008, lastSeqNo 255958167, uniqueness 1406064021, timestamp 1407565310/1501762205-08-09 14:21:51.102: [ CSSD][5414]clssnmSendingThread: sending join msg to all nodes-08-09 14:21:51.102: [ CSSD][5414]clssnmSendingThread: sent 5 join msgs to all nodes-08-09 14:21:51.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:52.050: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958172, LATS 1518253000, lastSeqNo 255958169, uniqueness 1406064021, timestamp 1407565311/1501763110-08-09 14:21:52.058: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958173, LATS 1518253008, lastSeqNo 255958170, uniqueness 1406064021, timestamp 1407565311/1501763230-08-09 14:21:52.089: [ CSSD][5671]clssnmRcfgMgrThread: Local Join-08-09 14:21:52.089: [ CSSD][5671]clssnmLocalJoinEvent: begin on node(2), waittime 193000-08-09 14:21:52.089: [ CSSD][5671]clssnmLocalJoinEvent: set curtime (1518253039) for my node-08-09 14:21:52.089: [ CSSD][5671]clssnmLocalJoinEvent: scanning 32 nodes-08-09 14:21:52.089: [ CSSD][5671]clssnmLocalJoinEvent: Node rac01, number 1, is in an existing cluster with disk state 3-08-09 14:21:52.090: [ CSSD][5671]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk-08-09 14:21:52.431: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
從上面的信息,很容易給人感覺是心跳的問題。這么理解也不錯,只是這里的心跳不是指的我們說理解的傳統的心跳網絡。我讓他在crs正常的一個節點查詢如下信息,我們就知道原因了,如下:
SQL> select name,ip_address from v$cluster_interconnects;NAME IP_ADDRESS--------------- ----------------en0 169.254.116.242
大家可以看到,這里心跳IP為什么是169網段呢?很明顯跟我們的/etc/hosts設置不匹配啊?why ?
這里我們要介紹下Oracle 11gR2 引入的HAIP特性,Oracle引入該特性的目的是為了通過自身的技術來實現心跳網絡的冗余,而不再依賴於第三方技術,比如Linux的bond等等。
在Oracle 11.2.0.2版本之前,如果使用了OS級別的心跳網卡綁定,那么Oracle仍然以OS綁定的為准。從11.2.0.2開始,如果沒有在OS層面進行心跳冗余的配置,那么Oracle自己的HAIP就啟用了。所以你雖然設置的192.168.1.100,然而實際上Oracle使用是169.254這個網段。關於這一點,大家可以去看下alert log,從該日志都能看出來,這里不多說。
我們可以看到,正常節點能看到如下的169網段的ip,問題節點確實看不到這個169的網段IP:
Oracle MOS提供了一種解決方案,如下:
crsctl start res ora.cluster_interconnect.haip -init
經過測試,使用root進行操作,也是不行的。針對HAIP的無法啟動,Oracle MOS文檔說通常是如下幾種情況:
1) 心跳網卡異常
2) 多播工作機制異常
3)防火牆等原因
4)Oracle bug
對於心跳網卡異常,如果只有一塊心跳網卡,那么ping其他的ip就可以進行驗證了,這一點很容易排除。
對於多播的問題,可以通過Oracle提供的mcasttest.pl腳本進行檢測(請參考Grid Infrastructure Startup During Patching, Install or Upgrade May Fail Due to Multicasting Requirement (ID 1212703.1),我這里的檢查結果如下:
$ ./mcasttest.pl -n rac02,rac01 -i en0########### Setup for node rac02 ##########Checking node access 'rac02'Checking node login 'rac02'Checking/Creating Directory /tmp/mcasttest for binary on node 'rac02'Distributing mcast2 binary to node 'rac02'########### Setup for node rac01 ##########Checking node access 'rac01'Checking node login 'rac01'Checking/Creating Directory /tmp/mcasttest for binary on node 'rac01'Distributing mcast2 binary to node 'rac01'########### testing Multicast on all nodes ##########Test for Multicast address 230.0.1.0Aug 11 21:39:39 | Multicast Failed for en0 using address 230.0.1.0:42000Test for Multicast address 224.0.0.251Aug 11 21:40:09 | Multicast Failed for en0 using address 224.0.0.251:42001$
雖然這里通過腳本檢查,發現對於230和224網段都是不通的,然而這不見得一定說明是多播的問題導致的。雖然我們查看ocssd.log,通過搜索mcast關鍵可以看到相關的信息。
實際上,我在自己的11.2.0.3 Linux RAC環境中測試,即使mcasttest.pl測試不通,也可以正常啟動CRS的。
由於網友這里是AIX,應該我就排除防火牆的問題了。因此最后懷疑Bug 9974223的可能性比較大。實際上,如果你去查詢HAIP的相關信息,你會發現該特性其實存在不少的Oracle bug。
其中 for knowns HAIP issues in 11gR2/12c Grid Infrastructure (1640865.1)就記錄12個HAIP相關的bug。
由於這里他的第1個節點無法操作,為了安全,是不能有太多的操作的。
對於HAIP,如果沒有使用多心跳網卡的情況下,我覺得完全是可以禁止掉的。但是昨天查MOS文檔,具體說不能disabled。最后測試發現其實是可以禁止掉的。如下是我的測試過程:
1 [root@rac1 bin]# ./crsctl modify res ora.cluster_interconnect.haip -attr "ENABLED=0" -init[root@rac1 bin]# ./crsctl stop crs 2 CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1'CRS-2673: Attempting to stop 'ora.crsd' on 'rac1'CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'rac1'CRS-2673: Attempting to stop 'ora.oc4j' on 'rac1'CRS-2673: Attempting to stop 'ora.cvu' on 'rac1'CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'rac1'CRS-2673: Attempting to stop 'ora.GRID.dg' on 'rac1'CRS-2673: Attempting to stop 'ora.registry.acfs' on 'rac1'CRS-2673: Attempting to stop 'ora.rac1.vip' on 'rac1'CRS-2677: Stop of 'ora.rac1.vip' on 'rac1' succeeded 3 CRS-2672: Attempting to start 'ora.rac1.vip' on 'rac2'CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'rac1' succeeded 4 CRS-2673: Attempting to stop 'ora.scan1.vip' on 'rac1'CRS-2677: Stop of 'ora.scan1.vip' on 'rac1' succeeded 5 CRS-2672: Attempting to start 'ora.scan1.vip' on 'rac2'CRS-2676: Start of 'ora.rac1.vip' on 'rac2' succeeded 6 CRS-2676: Start of 'ora.scan1.vip' on 'rac2' succeeded 7 CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'rac2'CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'rac2' succeeded 8 CRS-2677: Stop of 'ora.registry.acfs' on 'rac1' succeeded 9 CRS-2677: Stop of 'ora.oc4j' on 'rac1' succeeded 10 CRS-2677: Stop of 'ora.cvu' on 'rac1' succeeded 11 CRS-2677: Stop of 'ora.GRID.dg' on 'rac1' succeeded 12 CRS-2673: Attempting to stop 'ora.asm' on 'rac1'CRS-2677: Stop of 'ora.asm' on 'rac1' succeeded 13 CRS-2673: Attempting to stop 'ora.ons' on 'rac1'CRS-2677: Stop of 'ora.ons' on 'rac1' succeeded 14 CRS-2673: Attempting to stop 'ora.net1.network' on 'rac1'CRS-2677: Stop of 'ora.net1.network' on 'rac1' succeeded 15 CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'rac1' has completed 16 CRS-2677: Stop of 'ora.crsd' on 'rac1' succeeded 17 CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1'CRS-2673: Attempting to stop 'ora.ctssd' on 'rac1'CRS-2673: Attempting to stop 'ora.evmd' on 'rac1'CRS-2673: Attempting to stop 'ora.asm' on 'rac1'CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac1'CRS-2677: Stop of 'ora.mdnsd' on 'rac1' succeeded 18 CRS-2677: Stop of 'ora.evmd' on 'rac1' succeeded 19 CRS-2677: Stop of 'ora.ctssd' on 'rac1' succeeded 20 CRS-2677: Stop of 'ora.asm' on 'rac1' succeeded 21 CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'rac1'CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'rac1' succeeded 22 CRS-2673: Attempting to stop 'ora.cssd' on 'rac1'CRS-2677: Stop of 'ora.cssd' on 'rac1' succeeded 23 CRS-2673: Attempting to stop 'ora.crf' on 'rac1'CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded 24 CRS-2677: Stop of 'ora.crf' on 'rac1' succeeded 25 CRS-2673: Attempting to stop 'ora.gipcd' on 'rac1'CRS-2677: Stop of 'ora.gipcd' on 'rac1' succeeded 26 CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac1'CRS-2677: Stop of 'ora.gpnpd' on 'rac1' succeeded 27 CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has completed 28 CRS-4133: Oracle High Availability Services has been stopped.[root@rac1 bin]# ./crsctl start crs 29 CRS-4123: Oracle High Availability Services has been started.[root@rac1 bin]# ./crsctl check crs 30 CRS-4638: Oracle High Availability Services is online 31 CRS-4537: Cluster Ready Services is online 32 CRS-4529: Cluster Synchronization Services is online 33 CRS-4533: Event Manager is online[root@rac1 bin]# ./crsctl stat res -t -init--------------------------------------------------------------------------------NAME TARGET STATE SERVER STATE_DETAILS--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.asm 34 ONLINE ONLINE rac1 Startedora.cluster_interconnect.haip 35 ONLINE OFFLINE 36 ora.crf 37 ONLINE ONLINE rac1 38 ora.crsd 39 ONLINE ONLINE rac1 40 ora.cssd 41 ONLINE ONLINE rac1 42 ora.cssdmonitor 43 ONLINE ONLINE rac1 44 ora.ctssd 45 ONLINE ONLINE rac1 ACTIVE:0ora.diskmon 46 OFFLINE OFFLINE 47 ora.drivers.acfs 48 ONLINE ONLINE rac1 49 ora.evmd 50 ONLINE ONLINE rac1 51 ora.gipcd 52 ONLINE ONLINE rac1 53 ora.gpnpd 54 ONLINE ONLINE rac1 55 ora.mdnsd 56 ONLINE ONLINE rac1[root@rac1 bin]#
不過需要注意的是:當修改之后,兩個節點都必須要重啟CRS,否則僅僅重啟一個節點的CRS是不行的,ASM實例是無法啟動的。
對於HAIP異常,為什么會導致節點的CRS無法正常啟動呢?關於這一點,我們來看下該資源的屬性就知道了,如下:
1 NAME=ora.cluster_interconnect.haip 2 TYPE=ora.haip.type 3 ACL=owner:root:rw-,pgrp:oinstall:rw-,other::r--,user:oracle:r-x 4 ACTION_FAILURE_TEMPLATE=ACTION_SCRIPT=ACTIVE_PLACEMENT=0AGENT_FILENAME=%CRS_HOME%/bin/orarootagent%CRS_EXE_SUFFIX%AUTO_START=always 5 CARDINALITY=1CHECK_INTERVAL=30DEFAULT_TEMPLATE=DEGREE=1DESCRIPTION="Resource type for a Highly Available network IP"ENABLED=0FAILOVER_DELAY=0FAILURE_INTERVAL=0FAILURE_THRESHOLD=0HOSTING_MEMBERS=LOAD=1LOGGING_LEVEL=1NOT_RESTARTING_TEMPLATE=OFFLINE_CHECK_INTERVAL=0PLACEMENT=balanced 6 PROFILE_CHANGE_TEMPLATE=RESTART_ATTEMPTS=5SCRIPT_TIMEOUT=60SERVER_POOLS=START_DEPENDENCIES=hard(ora.gpnpd,ora.cssd)pullup(ora.cssd)START_TIMEOUT=60STATE_CHANGE_TEMPLATE=STOP_DEPENDENCIES=hard(ora.cssd)STOP_TIMEOUT=0UPTIME_THRESHOLD=1mUSR_ORA_AUTO=USR_ORA_IF=USR_ORA_IF_GROUP=cluster_interconnect 7 USR_ORA_IF_THRESHOLD=20USR_ORA_NETMASK=USR_ORA_SUBNET=
可以看到,如果該資源有異常,現在gpnpd,cssd都是有問題的。
備注:實際上還可以通過在asm 層面指定cluster_interconnects來避免haip的問題。