RAC某節點v$asm_disk查詢hang分析處理

本文轉載自查看原文 2018-01-30 06:25 1374 enq: DD - contention/ GPnP Get Item/ A故障處理專題/ ASM file metadata operation/ v$asm_disk

主題：RAC某節點v$asm_disk查詢hang分析處理
環境：Oracle 11.2.0.3 RAC
故障描述：RAC環境2個節點，節點1查詢v$asm_disk正常返回結果，節點2查詢v$asm_disk就會一直hang，查詢會話對應event是ASM file metadata operation.

1.初步判斷
2.深入分析
3.峰回路轉
4.解決問題

1.初步判斷

首先連接節點1，查詢v$asm_disk沒有問題：

--節點1查詢v$asm_disk都沒有問題
SQL> show parameter name  
SQL> select path from v$asm_disk;

PATH
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/dev/oracleasm/disks/DATADISK
/dev/oracleasm/disks/FRADISK
/dev/oracleasm/disks/CRSDISK1
/dev/oracleasm/disks/CRSDISK3
/dev/oracleasm/disks/CRSDISK2

然后連接節點2，查詢v$asm_disk就一直hang

SQL> select path from v$asm_disk;
hang住一直無結果返回..

故障重現，確認客戶描述故障現場屬實。

既然是hang住，自然去查該會話的等待事件是什么？是否有阻塞會話？

SQL> select sid, serial#, inst_id, sql_id, event, p1,p2,p3, machine, username, blocking_session from gv$session where wait_class# <> 6;

       SID    SERIAL#    INST_ID SQL_ID        EVENT                                       P1         P2         P3 MACHINE              USERNAME                       BLOCKING_SESSION
---------- ---------- ---------- ------------- ----------------------------------- ---------- ---------- ---------- -------------------- ------------------------------ ----------------

      289      33911          2 b5tshv0auyqm0 ASM file metadata operation               6229          9          0 newdb2               SYS

發現等待事件是“ASM file metadata operation”，BLOCKING_SESSION為空，表明並沒有阻塞會話。

此時查了下MOS，初步懷疑是asm_diskstring配置有問題，或是對應的磁盤權限有什么異常。但實際排查並未發現異常。
兩個節點asm實例的參數設置都一樣；對應目錄下的磁盤等權限等也一致。

兩個節點asm實例的參數設置都一樣

--節點1：
SQL> show parameter asm

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskgroups                       string      DATADG, FRADG, CRSDG_NEW
asm_diskstring                       string      /dev/oracleasm/disks/*
asm_power_limit                      integer     1
asm_preferred_read_failure_groups    string
SQL> 

--節點2 一樣的：
SQL> show parameter asm

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskgroups                       string      DATADG, FRADG, CRSDG_NEW
asm_diskstring                       string      /dev/oracleasm/disks/*
asm_power_limit                      integer     1
asm_preferred_read_failure_groups    string

[grid@newdb1 ~]$ ls -l /dev/oracleasm/disks/
total 0
brw-rw---- 1 grid asmadmin 253, 11 Jan 29 15:05 CRSDISK1
brw-rw---- 1 grid asmadmin 253,  9 Jan 29 15:05 CRSDISK2
brw-rw---- 1 grid asmadmin 253, 10 Jan 29 15:05 CRSDISK3
brw-rw---- 1 grid asmadmin 253, 14 Jan 29 15:05 DATADISK
brw-rw---- 1 grid asmadmin 253, 13 Jan 29 15:05 FRADISK
[grid@newdb1 ~]$ 

[grid@newdb2 ~]$ ls -l /dev/oracleasm/disks/
total 0
brw-rw---- 1 grid asmadmin 253, 21 Jan 29 15:05 CRSDISK1
brw-rw---- 1 grid asmadmin 253, 18 Jan 29 15:05 CRSDISK2
brw-rw---- 1 grid asmadmin 253, 19 Jan 29 15:05 CRSDISK3
brw-rw---- 1 grid asmadmin 253, 22 Jan 29 15:05 DATADISK
brw-rw---- 1 grid asmadmin 253, 20 Jan 29 15:05 FRADISK

2.深入分析

在故障節點2上使用oradebug獲取short_stack以及SSD 266，然后進行深入分析。

首先獲取上面289,33911會話的spid：

select p.spid
from v$process p, v$session s
where p.addr = s.paddr
and s.sid = &sid
and s.serial# = &serial;

SQL> select p.spid
  2    from v$process p, v$session s
  3   where p.addr = s.paddr
  4     and s.sid = &sid
  5     and s.serial# = &serial;
Enter value for sid: 289
old   4:    and s.sid = &sid
new   4:    and s.sid = 289
Enter value for serial: 33911
old   5:    and s.serial# = &serial
new   5:    and s.serial# = 33911

SPID
------------------------
8763

SQL> !ps -ef|grep 8763
oracle    8763  8762  0 16:31 ?        00:00:00 oraclenewdb2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      9639     1  0 16:35 ?        00:00:00 oracle+ASM2_user8763_newdb2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle   12022  9750  0 16:43 pts/2    00:00:00 /bin/bash -c ps -ef|grep 8763
oracle   12024 12022  0 16:43 pts/2    00:00:00 grep 8763

然后使用oradebug 跟蹤8763進程，獲取short_stack和SSD 266：

SQL> oradebug setospid 8763
Oracle pid: 68, Unix process pid: 8763, image: oracle@newdb2 (TNS V1-V3)
SQL> oradebug unlimit
Statement processed.
SQL> oradebug short_stack
ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-read()+14<-ntpfprd()+115<-nsbasic_brc()+376<-nsbrecv()+69<-nioqrc()+485<-ttcdrv()+1461<-nioqwa()+61<-upirtrc()+1385<-kpurcsc()+98<-kpuexec()+10807<-OCIStmtExecute()+39<-kfdDskTableCb4Db()+4492<-kfdDskTableCbInternal()+301<-kfdDskTableCb()+56<-qerfxFetch()+2210<-rwsfcd()+103<-qerhjFetch()+3187<-opifch2()+2995<-kpoal8()+2939<-opiodr()+916<-ttcpip()+2242<-opitsk()+1673<-opiino()+966<-opiodr()+916<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+252<-main()+201<-__libc_start_main()+253<-_start()+36
SQL> oradebug dump systemstate 266
Statement processed.
SQL> oradebug tracefile_name
/home/app/oracle/diag/rdbms/newdg/newdb2/trace/newdb2_ora_8763.trc
SQL>

從獲取到的trc文件中，並沒有找到有用的線索。至此問題陷入僵局。

3.峰回路轉

回過頭來想之前的解決過程，發現雖然查詢v$asm_disk這類操作是在db層，但是本質確實asm實例的管理資源，去查asm實例是否有阻塞呢？

--ASM INSTANCE
set lines 350 trimspool on pages 300

select sid, state, event, seconds_in_wait, blocking_session
from   v$session
where  blocking_session is not null
or sid in (select blocking_session 
           from   v$session 
           where  blocking_session is not null)
order by sid;

在故障節點2的asm實例上查詢：

SQL> set lines 350 trimspool on pages 300
SQL> 
SQL> select sid, state, event, seconds_in_wait, blocking_session
  2  from   v$session
  3  where  blocking_session is not null
  4  or sid in (select blocking_session 
  5             from   v$session 
  6             where  blocking_session is not null)
  7  order by sid;

       SID STATE               EVENT                                                            SECONDS_IN_WAIT BLOCKING_SESSION
---------- ------------------- ---------------------------------------------------------------- --------------- ----------------
       128 WAITING             enq: DD - contention                                                      542311             1702
       380 WAITING             enq: DD - contention                                                     2159938             1702
       442 WAITING             enq: DD - contention                                                     2159098             1702
       506 WAITING             enq: DD - contention                                                     2158258             1702
       569 WAITING             enq: DD - contention                                                      713509             1702
       632 WAITING             enq: DD - contention                                                     2156578             1702
       695 WAITING             enq: DD - contention                                                     2155738             1702
       758 WAITING             enq: DD - contention                                                     1108924             1702
       821 WAITING             enq: DD - contention                                                     2142990             1702
       884 WAITING             enq: DD - contention                                                     2139901             1702
       947 WAITING             enq: DD - contention                                                     2136286             1702
      1010 WAITING             enq: DD - contention                                                     2078621             1702
      1073 WAITING             enq: DD - contention                                                     1845020             1702
      1135 WAITING             GPnP Get Item                                                            2161258
      1136 WAITING             enq: DD - contention                                                     1844671             1702
      1199 WAITING             enq: DD - contention                                                     1844394             1702
      1263 WAITING             enq: DD - contention                                                     1844065             1702
      1325 WAITING             enq: DD - contention                                                     1843549             1702
      1388 WAITING             enq: DD - contention                                                     1471932             1702
      1451 WAITING             enq: DD - contention                                                     1641694             1702
      1514 WAITING             enq: DD - contention                                                     1641426             1702
      1577 WAITING             enq: DD - contention                                                     1572189             1702
      1640 WAITING             enq: DD - contention                                                     1501765             1702
      1702 WAITING             rdbms ipc reply                                                                0             1135
      1703 WAITING             enq: DD - contention                                                      896552             1702
      1765 WAITING             enq: DD - contention                                                     2161018             1702
      1766 WAITING             enq: DD - contention                                                      893474             1702
      1829 WAITING             enq: DD - contention                                                       11782             1702
      1892 WAITING             enq: DD - contention                                                        5041             1702
      1955 WAITING             enq: DD - contention                                                        3357             1702

30 rows selected.

果然看到很多阻塞，EVENT為"enq: DD - contention"的都是被1702會話阻塞，而1702會話又是被1135會話阻塞，1135會話的event是"GPnP Get Item"。
看兩個會話的詳細信息：

select sid,
       SERIAL#,
       sql_id,
       event,
       MACHINE,
       PROGRAM,
       username,
       blocking_session from   v$session where sid = 1702; 


select sid,
       SERIAL#,
       sql_id,
       event,
       MACHINE,
       PROGRAM,
       username,
       blocking_session from   v$session where sid = 1135;

結果如下：

SQL> select sid,
  2         SERIAL#,
  3         sql_id,
  4         event,
  5         MACHINE,
  6         PROGRAM,
  7         username,
  8         blocking_session from   v$session where sid = 1702; 

       SID    SERIAL# SQL_ID        EVENT                                                            MACHINE                                                          PROGRAM                                       USERNAME                BLOCKING_SESSION
---------- ---------- ------------- ---------------------------------------------------------------- ---------------------------------------------------------------- ------------------------------------------------ ------------------------------ ----------------
      1702      15127 b8a2pdbq3p2mj rdbms ipc reply                                                  newdb2                                                           oracle@newdb2 (TNS V1-V3)                     SYS                                         1135

SQL>  select sid,
  2         SERIAL#,
  3         sql_id,
  4         event,
  5         MACHINE,
  6         PROGRAM,
  7         username,
  8         blocking_session from   v$session where sid = 1135; 

       SID    SERIAL# SQL_ID        EVENT                                                            MACHINE                                                          PROGRAM                                       USERNAME                BLOCKING_SESSION
---------- ---------- ------------- ---------------------------------------------------------------- ---------------------------------------------------------------- ------------------------------------------------ ------------------------------ ----------------
      1135          1               GPnP Get Item                                                    newdb2                                                           oracle@newdb2 (RBAL)

找到最終的原因：GPnP Get Item .

4.解決問題

有了上面的信息，再次查詢MOS就可以匹配到 - Diskgroup Mount Hangs with RBAL Waiting on 'GPnP Get Item' and 'enq: DD - contention' (文檔 ID 1375505.1) - "crsctl check cluster -all" command gives CRS-4404, CRS-4405 errors (文檔 ID 1392934.1)

參照MOS解決方案：

Use the following command as root on Compute Node 3 to identify the current gpnpd.bin process:
ps -ef | grep gpnpd.bin

This will give you an output similar to:
ps -ef | grep pnp
oracle 31422 1 0 2011 ? 00:05:40 /u01/app/11.2.0.2/grid/bin/gpnpd.bin

Please note the process ID number of the current gpnpd.bin process. In this example, = 31422

Use the following command on Compute Node 3 to stop the current gpnpd.bin process:
kill -HUP
for example, kill -HUP 31422

After a few moments, gpnpd.bin should be automatically restarted by GI, run the command from step # 1 to verify that a new gpnpd.bin process exists

實際操作如下：

[root@newdb2 dev]# ps -ef | grep gpnpd.bin
grid      9835     1  0  2017 ?        00:23:19 /home/app/11.2.0/grid/bin/gpnpd.bin
root     26300 28772  0 17:43 pts/0    00:00:00 grep gpnpd.bin
[root@newdb2 dev]# kill -HUP  9835
You have mail in /var/spool/mail/root
[root@newdb2 dev]# 
[root@newdb2 dev]# ps -ef | grep gpnpd.bin
grid     26740     1  1 17:45 ?        00:00:00 /home/app/11.2.0/grid/bin/gpnpd.bin
root     26762 28772  0 17:45 pts/0    00:00:00 grep gpnpd.bin
[root@newdb2 dev]# su - grid
[grid@newdb2 ~]$ sqlplus / as sysasm

SQL*Plus: Release 11.2.0.3.0 Production on Mon Jan 29 17:46:02 2018

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> set lines 350 trimspool on pages 300
SQL> 
SQL> select sid, state, event, seconds_in_wait, blocking_session
  2  from   v$session
  3  where  blocking_session is not null
  4  or sid in (select blocking_session 
  5             from   v$session 
  6             where  blocking_session is not null)
  7  order by sid;

no rows selected

SQL> select path from v$asm_disk;

PATH
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/dev/oracleasm/disks/FRADISK
/dev/oracleasm/disks/DATADISK
/dev/oracleasm/disks/CRSDISK3
/dev/oracleasm/disks/CRSDISK2
/dev/oracleasm/disks/CRSDISK1

SQL>

至此，故障節點的ASM實例的阻塞消失，再次驗證查詢v$asm_disk已經恢復正常，故障算是完美解決。

后記：

1.實際上起初的解決思路出現偏差，ASM的問題就應該直接從ASM實例找原因，而之前都是在DB層分析，自然沒有得到好的結果。這一點很值得自己反思。
2.最開始接到這個case是建議找停機時間直接重啟節點2，這個方案簡單粗暴且不需要深入分析問題，從最終解決方案看到，這種方法也是可以解決問題的。但恰恰由於客戶堅持不能重啟，反倒因此而學到了更多的技能點。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 處理因ASM實例異常導致RAC第一節點實例異常終止故障 RAC一個節點自動重啟問題分析 RAC安裝gird,第一個節點執行root.sh報"The ora.asm resource is not ONLINE"錯誤基於CentOS與VmwareStation10搭建Oracle11G RAC 64集群環境：4.安裝Oracle RAC FAQ-4.2.Oracleasm Createdisk ASM磁盤失敗:Instantiating disk: failed 通過udev創建ASM共享磁盤（RAC） Oracle rac配置asm共享磁盤 Oracle RAC asm常用命令 RAC集群兩個節點只能啟動一個節點問題處理測試一體機ASM Disk online操作在Linux 5/6上使用UDEV SCSI規則配置ASM DISK