因為工作性質改變,有許久沒動手處理故障了,今天的排錯也是非生產環境,為驗證一些測試臨時搭的一套11g RAC環境,為了省時間,直接拿之前備份的vbox的環境拷貝,結果啟動機器發現集群無法啟動:
[root@jystdrac1 ~]# su - grid
[grid@jystdrac1 ~]$ crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[grid@jystdrac1 ~]$ crsctl stat res -t -init
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Status failed, or completed with errors.
查看集群alert日志報錯:
[grid@jystdrac1 jystdrac1]$ pwd
/opt/app/11.2.0/grid/log/jystdrac1
[grid@jystdrac1 jystdrac1]$ tail -20f alertjystdrac1.log
2021-07-01 00:26:27.379:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4526)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:26:31.384:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:28:32.889:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4568)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:28:36.895:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:28:38.424:
[mdnsd(4644)]CRS-5602:mDNS service stopping by request.
2021-07-01 00:30:38.407:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4633)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:30:42.412:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:32:43.923:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4676)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:32:47.928:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:32:49.455:
[mdnsd(4822)]CRS-5602:mDNS service stopping by request.
進一步看mdns.log等最新報錯信息(gpnp.log類似,為節省篇幅沒有貼出):
[grid@jystdrac1 mdnsd]$ pwd
/opt/app/11.2.0/grid/log/jystdrac1/mdnsd
[grid@jystdrac1 mdnsd]$ tail -20 mdnsd.log
2021-06-30 22:50:59.275: [ MDNS][1534236416] mdnsd exit
2021-06-30 22:53:03.989: [ default][1342412544]
================================================================================
2021-06-30 22:53:03.989: [ default][1342412544]mdnsd START pid=2201
[ clsdmt][1335961344]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
2021-06-30 22:53:03.991: [ clsdmt][1335961344]PID for the Process [2201], connkey 9
2021-06-30 22:53:03.991: [ clsdmt][1335961344]Creating PID [2201] file for home /opt/app/11.2.0/grid host jystdrac1 bin mdns to /opt/app/11.2.0/grid/mdns/init/
2021-06-30 22:53:03.992: [ clsdmt][1335961344]Writing PID [2201] to the file [/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid]
2021-06-30 22:53:03.992: [ clsdmt][1335961344]Failed to record pid for MDNSD
2021-06-30 22:53:03.992: [ clsdmt][1335961344]Terminating process
2021-06-30 22:53:03.992: [ MDNS][1335961344] clsdm requested mdnsd exit
2021-06-30 22:53:03.992: [ MDNS][1335961344] mdnsd exit
2021-06-30 22:57:14.236: [ default][747345664]
================================================================================
2021-06-30 22:57:14.236: [ default][747345664]mdnsd START pid=2375
[ clsdmt][740894464]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
2021-06-30 22:57:14.239: [ clsdmt][740894464]PID for the Process [2375], connkey 9
2021-06-30 22:57:14.239: [ clsdmt][740894464]Cr[grid@jystdrac1 mdnsd]$
MOS 也有篇文章介紹了RAC起不來的五大問題:
- Grid Infrastructure 啟動的五大問題 (Doc ID 1526147.1)
其中問題 4:Agent 或者 mdnsd.bin, gpnpd.bin, gipcd.bin 未運行,就和目前的現象很匹配。
文檔中描述了可能的原因和對應解決方案:
可能的原因:
1. orarootagent 缺少執行權限
2. 缺少進程相關的 <node>.pid 文件或者這個文件的所有者/權限不對
3. GRID_HOME 所有者/權限不對
解決方案:
1. 和一個好的GRID_HOME比較所有者/權限,並做相應的改正,或者以root用戶執行:,
# cd <GRID_HOME>/crs/install
# ./rootcrs.pl -unlock
# ./rootcrs.pl -patch
這將停止集群軟件,對需要的文件的所有者/權限設置為root用戶,並且重啟集群軟件。
2. 如果對應的 <node>.pid 不存在, 就用touch命令創建一個具有相應所有者/權限的文件, 否則就按要求改正文件<node>.pid的所有者/權限, 然后重啟集群軟件.
這里是<GRID_HOME>下,所有者屬於root:root 權限 644的<node>.pid 文件列表:
./ologgerd/init/<node>.pid
./osysmond/init/<node>.pid
./ctss/init/<node>.pid
./ohasd/init/<node>.pid
./crs/init/<node>.pid
所有者屬於<grid>:oinstall,權限644
./mdns/init/<node>.pid
./evm/init/<node>.pid
./gipc/init/<node>.pid
./gpnp/init/<node>.pid
3. 對第3種原因,請參考解決方案1
可是依次排查下來發現均無問題,奇怪了,為啥權限都正確就是寫不進去呢?
手工vi試下看看呢?
[grid@jystdrac1 jystdrac1]$ vi /opt/app/11.2.0/grid/mdns/init/jystdrac1.pid
2201
保存時發現報錯:
"/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid"
"/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid" E514: write error (file system full?)
Press ENTER or type command to continue
什么?文件系統空間滿了???
[grid@jystdrac1 jystdrac1]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_linuxbase-lv_root 28G 27G 0 100% /
tmpfs 1.5G 0 1.5G 0% /dev/shm
/dev/sda1 485M 39M 421M 9% /boot
額,果然.. 好尷尬,居然是最初級的空間容量問題。
趕緊清理下空間后重啟集群再試是否正常啟動?
It's Ok!