HAIP異常,導致RAC節點無法啟動的解決方案


一個網友咨詢一個問題,他的11.2.0.2 RAC(for Aix),沒有安裝任何patch或PSU。其中一個節點重啟之后無法正常啟動,查看ocssd日志如下:

復制代碼
1 -08-09 14:21:46.094: [    CSSD][5414]clssnmSendingThread: sent 4 join msgs to all nodes-08-09 14:21:46.421: [    CSSD][4900]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0-08-09 14:21:47.042: [    CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958157, LATS 1518247992, lastSeqNo 255958154, uniqueness 1406064021, timestamp 1407565306/1501758072-08-09 14:21:47.051: [    CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958158, LATS 1518248002, lastSeqNo 255958155, uniqueness 1406064021, timestamp 1407565306/1501758190-08-09 14:21:47.421: [    CSSD][4900]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0-08-09 14:21:48.042: [    CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958160, LATS 1518248993, lastSeqNo 255958157, uniqueness 1406064021, timestamp 1407565307/1501759080-08-09 14:21:48.052: [    CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958161, LATS 1518249002, lastSeqNo 255958158, uniqueness 1406064021, timestamp 1407565307/1501759191-08-09 14:21:48.421: [    CSSD][4900]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0-08-09 14:21:49.043: [    CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958163, LATS 1518249993, lastSeqNo 255958160, uniqueness 1406064021, timestamp 1407565308/1501760082-08-09 14:21:49.056: [    CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958164, LATS 1518250007, lastSeqNo 255958161, uniqueness 1406064021, timestamp 1407565308/1501760193-08-09 14:21:49.421: [    CSSD][4900]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0-08-09 14:21:50.044: [    CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958166, LATS 1518250994, lastSeqNo 255958163, uniqueness 1406064021, timestamp 1407565309/1501761090-08-09 14:21:50.057: [    CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958167, LATS 1518251007, lastSeqNo 255958164, uniqueness 1406064021, timestamp 1407565309/1501761195-08-09 14:21:50.421: [    CSSD][4900]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0-08-09 14:21:51.046: [    CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958169, LATS 1518251996, lastSeqNo 255958166, uniqueness 1406064021, timestamp 1407565310/1501762100-08-09 14:21:51.057: [    CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958170, LATS 1518252008, lastSeqNo 255958167, uniqueness 1406064021, timestamp 1407565310/1501762205-08-09 14:21:51.102: [    CSSD][5414]clssnmSendingThread: sending join msg to all nodes-08-09 14:21:51.102: [    CSSD][5414]clssnmSendingThread: sent 5 join msgs to all nodes-08-09 14:21:51.421: [    CSSD][4900]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0-08-09 14:21:52.050: [    CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958172, LATS 1518253000, lastSeqNo 255958169, uniqueness 1406064021, timestamp 1407565311/1501763110-08-09 14:21:52.058: [    CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958173, LATS 1518253008, lastSeqNo 255958170, uniqueness 1406064021, timestamp 1407565311/1501763230-08-09 14:21:52.089: [    CSSD][5671]clssnmRcfgMgrThread: Local Join-08-09 14:21:52.089: [    CSSD][5671]clssnmLocalJoinEvent: begin on node(2), waittime 193000-08-09 14:21:52.089: [    CSSD][5671]clssnmLocalJoinEvent: set curtime (1518253039) for my node-08-09 14:21:52.089: [    CSSD][5671]clssnmLocalJoinEvent: scanning 32 nodes-08-09 14:21:52.089: [    CSSD][5671]clssnmLocalJoinEvent: Node rac01, number 1, is in an existing cluster with disk state 3-08-09 14:21:52.090: [    CSSD][5671]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk-08-09 14:21:52.431: [    CSSD][4900]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
復制代碼

 


從上面的信息,很容易給人感覺是心跳的問題。這么理解也不錯,只是這里的心跳不是指的我們說理解的傳統的心跳網絡。我讓他在crs正常的一個節點查詢如下信息,我們就知道原因了,如下:
SQL> select name,ip_address from v$cluster_interconnects;NAME            IP_ADDRESS--------------- ----------------en0             169.254.116.242

大家可以看到,這里心跳IP為什么是169網段呢?很明顯跟我們的/etc/hosts設置不匹配啊?why ?

這里我們要介紹下Oracle 11gR2 引入的HAIP特性,Oracle引入該特性的目的是為了通過自身的技術來實現心跳網絡的冗余,而不再依賴於第三方技術,比如Linux的bond等等。

在Oracle 11.2.0.2版本之前,如果使用了OS級別的心跳網卡綁定,那么Oracle仍然以OS綁定的為准。從11.2.0.2開始,如果沒有在OS層面進行心跳冗余的配置,那么Oracle自己的HAIP就啟用了。所以你雖然設置的192.168.1.100,然而實際上Oracle使用是169.254這個網段。關於這一點,大家可以去看下alert log,從該日志都能看出來,這里不多說。

我們可以看到,正常節點能看到如下的169網段的ip,問題節點確實看不到這個169的網段IP:

Oracle MOS提供了一種解決方案,如下:

crsctl start res ora.cluster_interconnect.haip -init

經過測試,使用root進行操作,也是不行的。針對HAIP的無法啟動,Oracle MOS文檔說通常是如下幾種情況:

1) 心跳網卡異常

2)   多播工作機制異常

3)防火牆等原因

4)Oracle bug

對於心跳網卡異常,如果只有一塊心跳網卡,那么ping其他的ip就可以進行驗證了,這一點很容易排除。

對於多播的問題,可以通過Oracle提供的mcasttest.pl腳本進行檢測(請參考Grid Infrastructure Startup During Patching, Install or Upgrade May Fail Due to Multicasting Requirement (ID 1212703.1),我這里的檢查結果如下:

$ ./mcasttest.pl -n rac02,rac01 -i en0###########  Setup for node rac02  ##########Checking node access 'rac02'Checking node login 'rac02'Checking/Creating Directory /tmp/mcasttest for binary on node 'rac02'Distributing mcast2 binary to node 'rac02'###########  Setup for node rac01  ##########Checking node access 'rac01'Checking node login 'rac01'Checking/Creating Directory /tmp/mcasttest for binary on node 'rac01'Distributing mcast2 binary to node 'rac01'###########  testing Multicast on all nodes  ##########Test for Multicast address 230.0.1.0Aug 11 21:39:39 | Multicast Failed for en0 using address 230.0.1.0:42000Test for Multicast address 224.0.0.251Aug 11 21:40:09 | Multicast Failed for en0 using address 224.0.0.251:42001$

雖然這里通過腳本檢查,發現對於230和224網段都是不通的,然而這不見得一定說明是多播的問題導致的。雖然我們查看ocssd.log,通過搜索mcast關鍵可以看到相關的信息。

實際上,我在自己的11.2.0.3 Linux RAC環境中測試,即使mcasttest.pl測試不通,也可以正常啟動CRS的。

由於網友這里是AIX,應該我就排除防火牆的問題了。因此最后懷疑Bug 9974223的可能性比較大。實際上,如果你去查詢HAIP的相關信息,你會發現該特性其實存在不少的Oracle bug。

其中 for knowns HAIP issues in 11gR2/12c Grid Infrastructure (1640865.1)就記錄12個HAIP相關的bug。

由於這里他的第1個節點無法操作,為了安全,是不能有太多的操作的。

對於HAIP,如果沒有使用多心跳網卡的情況下,我覺得完全是可以禁止掉的。但是昨天查MOS文檔,具體說不能disabled。最后測試發現其實是可以禁止掉的。如下是我的測試過程:

復制代碼
 1 [root@rac1 bin]# ./crsctl modify res ora.cluster_interconnect.haip -attr "ENABLED=0" -init[root@rac1 bin]# ./crsctl stop crs
 2 CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1'CRS-2673: Attempting to stop 'ora.crsd' on 'rac1'CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'rac1'CRS-2673: Attempting to stop 'ora.oc4j' on 'rac1'CRS-2673: Attempting to stop 'ora.cvu' on 'rac1'CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'rac1'CRS-2673: Attempting to stop 'ora.GRID.dg' on 'rac1'CRS-2673: Attempting to stop 'ora.registry.acfs' on 'rac1'CRS-2673: Attempting to stop 'ora.rac1.vip' on 'rac1'CRS-2677: Stop of 'ora.rac1.vip' on 'rac1' succeeded
 3 CRS-2672: Attempting to start 'ora.rac1.vip' on 'rac2'CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'rac1' succeeded
 4 CRS-2673: Attempting to stop 'ora.scan1.vip' on 'rac1'CRS-2677: Stop of 'ora.scan1.vip' on 'rac1' succeeded
 5 CRS-2672: Attempting to start 'ora.scan1.vip' on 'rac2'CRS-2676: Start of 'ora.rac1.vip' on 'rac2' succeeded
 6 CRS-2676: Start of 'ora.scan1.vip' on 'rac2' succeeded
 7 CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'rac2'CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'rac2' succeeded
 8 CRS-2677: Stop of 'ora.registry.acfs' on 'rac1' succeeded
 9 CRS-2677: Stop of 'ora.oc4j' on 'rac1' succeeded
10 CRS-2677: Stop of 'ora.cvu' on 'rac1' succeeded
11 CRS-2677: Stop of 'ora.GRID.dg' on 'rac1' succeeded
12 CRS-2673: Attempting to stop 'ora.asm' on 'rac1'CRS-2677: Stop of 'ora.asm' on 'rac1' succeeded
13 CRS-2673: Attempting to stop 'ora.ons' on 'rac1'CRS-2677: Stop of 'ora.ons' on 'rac1' succeeded
14 CRS-2673: Attempting to stop 'ora.net1.network' on 'rac1'CRS-2677: Stop of 'ora.net1.network' on 'rac1' succeeded
15 CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'rac1' has completed
16 CRS-2677: Stop of 'ora.crsd' on 'rac1' succeeded
17 CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1'CRS-2673: Attempting to stop 'ora.ctssd' on 'rac1'CRS-2673: Attempting to stop 'ora.evmd' on 'rac1'CRS-2673: Attempting to stop 'ora.asm' on 'rac1'CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac1'CRS-2677: Stop of 'ora.mdnsd' on 'rac1' succeeded
18 CRS-2677: Stop of 'ora.evmd' on 'rac1' succeeded
19 CRS-2677: Stop of 'ora.ctssd' on 'rac1' succeeded
20 CRS-2677: Stop of 'ora.asm' on 'rac1' succeeded
21 CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'rac1'CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
22 CRS-2673: Attempting to stop 'ora.cssd' on 'rac1'CRS-2677: Stop of 'ora.cssd' on 'rac1' succeeded
23 CRS-2673: Attempting to stop 'ora.crf' on 'rac1'CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded
24 CRS-2677: Stop of 'ora.crf' on 'rac1' succeeded
25 CRS-2673: Attempting to stop 'ora.gipcd' on 'rac1'CRS-2677: Stop of 'ora.gipcd' on 'rac1' succeeded
26 CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac1'CRS-2677: Stop of 'ora.gpnpd' on 'rac1' succeeded
27 CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has completed
28 CRS-4133: Oracle High Availability Services has been stopped.[root@rac1 bin]# ./crsctl start crs
29 CRS-4123: Oracle High Availability Services has been started.[root@rac1 bin]# ./crsctl check crs
30 CRS-4638: Oracle High Availability Services is online
31 CRS-4537: Cluster Ready Services is online
32 CRS-4529: Cluster Synchronization Services is online
33 CRS-4533: Event Manager is online[root@rac1 bin]# ./crsctl stat res -t -init--------------------------------------------------------------------------------NAME           TARGET  STATE        SERVER                   STATE_DETAILS--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.asm
34         ONLINE  ONLINE       rac1                     Startedora.cluster_interconnect.haip
35         ONLINE  OFFLINE
36 ora.crf
37         ONLINE  ONLINE       rac1
38 ora.crsd
39         ONLINE  ONLINE       rac1
40 ora.cssd
41         ONLINE  ONLINE       rac1
42 ora.cssdmonitor
43         ONLINE  ONLINE       rac1
44 ora.ctssd
45         ONLINE  ONLINE       rac1                     ACTIVE:0ora.diskmon
46         OFFLINE OFFLINE
47 ora.drivers.acfs
48         ONLINE  ONLINE       rac1
49 ora.evmd
50         ONLINE  ONLINE       rac1
51 ora.gipcd
52         ONLINE  ONLINE       rac1
53 ora.gpnpd
54         ONLINE  ONLINE       rac1
55 ora.mdnsd
56         ONLINE  ONLINE       rac1[root@rac1 bin]#
復制代碼

 

不過需要注意的是:當修改之后,兩個節點都必須要重啟CRS,否則僅僅重啟一個節點的CRS是不行的,ASM實例是無法啟動的。

對於HAIP異常,為什么會導致節點的CRS無法正常啟動呢?關於這一點,我們來看下該資源的屬性就知道了,如下:

復制代碼
1 NAME=ora.cluster_interconnect.haip
2 TYPE=ora.haip.type
3 ACL=owner:root:rw-,pgrp:oinstall:rw-,other::r--,user:oracle:r-x
4 ACTION_FAILURE_TEMPLATE=ACTION_SCRIPT=ACTIVE_PLACEMENT=0AGENT_FILENAME=%CRS_HOME%/bin/orarootagent%CRS_EXE_SUFFIX%AUTO_START=always
5 CARDINALITY=1CHECK_INTERVAL=30DEFAULT_TEMPLATE=DEGREE=1DESCRIPTION="Resource type for a Highly Available network IP"ENABLED=0FAILOVER_DELAY=0FAILURE_INTERVAL=0FAILURE_THRESHOLD=0HOSTING_MEMBERS=LOAD=1LOGGING_LEVEL=1NOT_RESTARTING_TEMPLATE=OFFLINE_CHECK_INTERVAL=0PLACEMENT=balanced
6 PROFILE_CHANGE_TEMPLATE=RESTART_ATTEMPTS=5SCRIPT_TIMEOUT=60SERVER_POOLS=START_DEPENDENCIES=hard(ora.gpnpd,ora.cssd)pullup(ora.cssd)START_TIMEOUT=60STATE_CHANGE_TEMPLATE=STOP_DEPENDENCIES=hard(ora.cssd)STOP_TIMEOUT=0UPTIME_THRESHOLD=1mUSR_ORA_AUTO=USR_ORA_IF=USR_ORA_IF_GROUP=cluster_interconnect
7 USR_ORA_IF_THRESHOLD=20USR_ORA_NETMASK=USR_ORA_SUBNET=
復制代碼

 

可以看到,如果該資源有異常,現在gpnpd,cssd都是有問題的。

備注:實際上還可以通過在asm 層面指定cluster_interconnects來避免haip的問題。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM