有些尷尬的一次集群啟動故障排錯


因為工作性質改變,有許久沒動手處理故障了,今天的排錯也是非生產環境,為驗證一些測試臨時搭的一套11g RAC環境,為了省時間,直接拿之前備份的vbox的環境拷貝,結果啟動機器發現集群無法啟動:

[root@jystdrac1 ~]# su - grid
[grid@jystdrac1 ~]$ crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[grid@jystdrac1 ~]$ crsctl stat res -t -init
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Status failed, or completed with errors.

查看集群alert日志報錯:

[grid@jystdrac1 jystdrac1]$ pwd
/opt/app/11.2.0/grid/log/jystdrac1
[grid@jystdrac1 jystdrac1]$ tail -20f alertjystdrac1.log
2021-07-01 00:26:27.379:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4526)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:26:31.384:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:28:32.889:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4568)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:28:36.895:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:28:38.424:
[mdnsd(4644)]CRS-5602:mDNS service stopping by request.
2021-07-01 00:30:38.407:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4633)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:30:42.412:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:32:43.923:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4676)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:32:47.928:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:32:49.455:
[mdnsd(4822)]CRS-5602:mDNS service stopping by request.

進一步看mdns.log等最新報錯信息(gpnp.log類似,為節省篇幅沒有貼出):

[grid@jystdrac1 mdnsd]$ pwd
/opt/app/11.2.0/grid/log/jystdrac1/mdnsd
[grid@jystdrac1 mdnsd]$ tail -20 mdnsd.log
2021-06-30 22:50:59.275: [    MDNS][1534236416] mdnsd exit
2021-06-30 22:53:03.989: [ default][1342412544]

================================================================================
2021-06-30 22:53:03.989: [ default][1342412544]mdnsd START pid=2201
[  clsdmt][1335961344]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
2021-06-30 22:53:03.991: [  clsdmt][1335961344]PID for the Process [2201], connkey 9
2021-06-30 22:53:03.991: [  clsdmt][1335961344]Creating PID [2201] file for home /opt/app/11.2.0/grid host jystdrac1 bin mdns to /opt/app/11.2.0/grid/mdns/init/
2021-06-30 22:53:03.992: [  clsdmt][1335961344]Writing PID [2201] to the file [/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid]
2021-06-30 22:53:03.992: [  clsdmt][1335961344]Failed to record pid for MDNSD
2021-06-30 22:53:03.992: [  clsdmt][1335961344]Terminating process
2021-06-30 22:53:03.992: [    MDNS][1335961344] clsdm requested mdnsd exit
2021-06-30 22:53:03.992: [    MDNS][1335961344] mdnsd exit
2021-06-30 22:57:14.236: [ default][747345664]

================================================================================
2021-06-30 22:57:14.236: [ default][747345664]mdnsd START pid=2375
[  clsdmt][740894464]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
2021-06-30 22:57:14.239: [  clsdmt][740894464]PID for the Process [2375], connkey 9
2021-06-30 22:57:14.239: [  clsdmt][740894464]Cr[grid@jystdrac1 mdnsd]$

MOS 也有篇文章介紹了RAC起不來的五大問題:

  • Grid Infrastructure 啟動的五大問題 (Doc ID 1526147.1)

其中問題 4:Agent 或者 mdnsd.bin, gpnpd.bin, gipcd.bin 未運行,就和目前的現象很匹配。

文檔中描述了可能的原因和對應解決方案:

可能的原因:

1. orarootagent 缺少執行權限
2. 缺少進程相關的 <node>.pid 文件或者這個文件的所有者/權限不對
3. GRID_HOME 所有者/權限不對

解決方案:

1. 和一個好的GRID_HOME比較所有者/權限,並做相應的改正,或者以root用戶執行:,
   # cd <GRID_HOME>/crs/install
   # ./rootcrs.pl -unlock
   # ./rootcrs.pl -patch
這將停止集群軟件,對需要的文件的所有者/權限設置為root用戶,並且重啟集群軟件。
2. 如果對應的 <node>.pid 不存在, 就用touch命令創建一個具有相應所有者/權限的文件, 否則就按要求改正文件<node>.pid的所有者/權限, 然后重啟集群軟件.
這里是<GRID_HOME>下,所有者屬於root:root 權限 644的<node>.pid 文件列表:
  ./ologgerd/init/<node>.pid
  ./osysmond/init/<node>.pid
  ./ctss/init/<node>.pid
  ./ohasd/init/<node>.pid
  ./crs/init/<node>.pid
所有者屬於<grid>:oinstall,權限644
  ./mdns/init/<node>.pid  
  ./evm/init/<node>.pid
  ./gipc/init/<node>.pid
  ./gpnp/init/<node>.pid

3. 對第3種原因,請參考解決方案1

可是依次排查下來發現均無問題,奇怪了,為啥權限都正確就是寫不進去呢?

手工vi試下看看呢?

[grid@jystdrac1 jystdrac1]$ vi /opt/app/11.2.0/grid/mdns/init/jystdrac1.pid
2201

保存時發現報錯:

"/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid"
"/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid" E514: write error (file system full?)
Press ENTER or type command to continue

什么?文件系統空間滿了???

[grid@jystdrac1 jystdrac1]$ df -h
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/vg_linuxbase-lv_root   28G   27G     0 100% /
tmpfs                             1.5G     0  1.5G   0% /dev/shm
/dev/sda1                         485M   39M  421M   9% /boot

額,果然.. 好尷尬,居然是最初級的空間容量問題。
趕緊清理下空間后重啟集群再試是否正常啟動?
It's Ok!


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM