記錄一則數據庫死鎖故障的分析過程


1.背景概述

客戶的監控告警頻繁提示系統xx數據庫死鎖增長個數高於當前閾值_當前值1.00。 下面是詳細的故障分析診斷過程,以及詳細的解決方案描述。

2.故障分析

2.1 故障現象
登錄到系統,從數據庫到alert日志可以發現的確存在很多ORA-60的信息,截取部分如下:

2020-04-23T19:32:00.644961+08:00
XXXDB(4):ORA-00060: Global Enqueue Services Deadlock detected. See Note 60.1 at My Oracle for Troubleshooting ORA-60 Errors. More info in file
 /oracle/app/oracle/diag/rdbms/z1d1v19c/xxxxxxxx2/trace/xxxxxxxx2_ora_127408.trc.
2020-04-23T19:32:01.000382+08:00
Dumping diagnostic data in directory=[cdmp_20200423193200], requested by (instance=2, osid=127408), summary=[abnormal process termination].
2020-04-23T19:32:54.093147+08:00
XXXDB(4):ORA-00060: Global Enqueue Services Deadlock detected. See Note 60.1 at My Oracle for Troubleshooting ORA-60 Errors. More info in file
 /oracle/app/oracle/diag/rdbms/z1d1v19c/xxxxxxxx2/trace/xxxxxxxx2_ora_127383.trc.
2020-04-23T19:32:54.289460+08:00
Dumping diagnostic data in directory=[cdmp_20200423193254], requested by (instance=2, osid=127383), summary=[abnormal process termination].
2020-04-23T19:32:57.576079+08:00
XXXDB(4):ORA-00060: Global Enqueue Services Deadlock detected. See Note 60.1 at My Oracle for Troubleshooting ORA-60 Errors. More info in file
 /oracle/app/oracle/diag/rdbms/z1d1v19c/xxxxxxxx2/trace/xxxxxxxx2_ora_124482.trc.

2.2 故障根源
查看trc內容,發現是自身導致的阻塞,幾個trc都類似,下面截取其中一個trc跟蹤日志的關鍵內容:

*** 2020-04-23T19:32:00.644695+08:00 (XXXDB(4))
*** SESSION ID:(7989.26294) 2020-04-23T19:32:00.644756+08:00
*** CLIENT ID:() 2020-04-23T19:32:00.644762+08:00
*** SERVICE NAME:(XXXDB) 2020-04-23T19:32:00.644767+08:00
*** MODULE NAME:(oracle@xxxxxxxxdb2) 2020-04-23T19:32:00.644772+08:00
*** ACTION NAME:() 2020-04-23T19:32:00.644777+08:00
*** CLIENT DRIVER:() 2020-04-23T19:32:00.644781+08:00
*** CONTAINER ID:(4) 2020-04-23T19:32:00.644785+08:00
 
Single resource deadlock: blocking enqueue which blocks itself, f 0
Granted global enqueue 0xaf1cabb50
----------enqueue 0xaf1cabb50------------------------
lock version     : 137621
Owner inst       : 2
grant_level      : KJUSERNL
req_level        : KJUSERPW
bast_level       : KJUSERNL
notify_func      : none
resp             : 0x5bf1f1c38
procp            : 0xb069785a0
pid              : 127405
proc version     : 580
oprocp           : (nil)
opid             : 127405
group lock owner : 0xb5aaec910
possible pid     : 127405
xid              : C1000-0002-000FC6D3
dd_time          : 10.0 secs
dd_count         : 0
timeout          : 60.0 secs
On_timer_q?      : Y
On_dd_q?         : Y
sec since mv2grQ : N/A
lock_state       : OPENING CONVERTING 
ast_flag         : 0x0
flags            : 0x0
Open Options     : KJUSERDEADLOCK 
Convert options  : KJUSERGETVALUE 
History          : FREE > KJLALC > REF_RES > LOC_AST > CLOSE > FREE > KJLALC > REF_RES > GR2CVT
Msg_Seq          : 0x0
res_seq          : 10845078
valblk           : 0x00000000000000000000000000000000 .
user session for deadlock lock 0xaf1cabb50
  sid: 7867 ser: 47087 audsid: 210608 user: 103/ZJY
  pdb: 4/XXXDB
    flags: (0x41) USR/- flags2: (0x40009) -/-/INC
    flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-
  pid: 193 O/S info: user: grid, term: UNKNOWN, ospid: 127405
    image: oracle@xxxxxxxxdb2
  client details:
    O/S info: user: , term: , ospid: 1234
    machine: ltyglapp01 program: oracle@xxxxxxxxdb2
    application name: oracle@xxxxxxxxdb2, hash value=420857658
  current SQL:
  delete from tx_xxxx_xxxxx_xxxx where 1=1  and FTOOL_CODE='X019541' and T8_SYS_ADTYPE_ID=1 and T8_SYS_PORTFOL_ID=100148 and SETTLE_DATE >='20200423' and isaccount != 1   and account_type=3 and (BOND_MARKET=1 or BOND_MARKET is null )
Requesting global enqueue 0xb1ceb9110
----------enqueue 0xb1ceb9110------------------------
lock version     : 534003
Owner inst       : 2
grant_level      : KJUSERCW
req_level        : KJUSERPW
bast_level       : KJUSERNL
notify_func      : none
resp             : 0x5bf1f1c38
procp            : 0xb069bd4b0
pid              : 127408
proc version     : 1615
oprocp           : (nil)
opid             : 127408
group lock owner : 0xb4aa785e8
possible pid     : 127408
xid              : C2000-0002-0002D8DD
dd_time          : 0.0 secs
dd_count         : 0
timeout          : -6670697.0 secs
On_timer_q?      : N
On_dd_q?         : N
sec since mv2grQ : 0
lock_state       : CONVERTING 
ast_flag         : 0x0
flags            : 0x1
Open Options     : KJUSERDEADLOCK 
Convert options  : KJUSERGETVALUE 
History          : REF_RES > LOC_AST > CLOSE > FREE > KJLALC > REF_RES > GR2CVT > LOC_AST_FG > LOC_AST
Msg_Seq          : 0x0
res_seq          : 10845078
valblk           : 0x00000000000000000000000000000000 .
user session for deadlock lock 0xb1ceb9110
  sid: 7989 ser: 26294 audsid: 210609 user: 103/ZJY
  pdb: 4/XXXDB
    flags: (0x41) USR/- flags2: (0x40009) -/-/INC
    flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-
  pid: 194 O/S info: user: grid, term: UNKNOWN, ospid: 127408
    image: oracle@xxxxxxxxdb2
  client details:
    O/S info: user: , term: , ospid: 1234
    machine: ltyglapp01 program: oracle@xxxxxxxxdb2
    application name: oracle@xxxxxxxxdb2, hash value=420857658
  current SQL:
  delete from tx_xxxx_xxxxx_xxxx where 1=1  and FTOOL_CODE='112094455' and T8_SYS_ADTYPE_ID=24 and T8_SYS_PORTFOL_ID=100148 and SETTLE_DATE >='20200423' and isaccount != 1  and (BOND_MARKET=3 or BOND_MARKET is null ) and relation_no='100912'
----------resource 0x5bf1f1c38----------------------
resname       : [0x11ecf][0x0],[TM][ext 0xfffdbbf8,0x0][domid 0x4]

可以發現都是delete from tx_xxxx_xxxxx_xxxx這張表,死鎖現象是Single resource deadlock,而且TYPE是TM,綜合分析這種現象,大多就是外鍵沒有索引導致。

c. For single resource deadlock on TM enqueue, missing foreign key index is often the cause, please check case 3 for the solution.

進一步獲取建表語句查看約束部分,發現只有主鍵,沒有外鍵:

select dbms_metadata.get_ddl('TABLE','TX_XXXX_XXXXX_XXXX','ZJY') from dual;
…
      "TEMPACCRUBLN" NUMBER(22,4) DEFAULT 0,
         CONSTRAINT "PK_TX_XXXX_XXXXX_XXXX" PRIMARY KEY ("ID")
  USING INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS
  STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "ZJYDATA"  ENABLE
…

查詢ZJY用戶下所有存在外鍵的表:

STAT TABLE_NAME                     CON                  IDX
---- ------------------------------ -------------------- --------------------
ok   T8_ACC_XXXXXXX                 TX_XXXXX_XXXX_ID    TX_XXXXX_XXXX_ID
**** T8_SHIBOR_ACC_XXXXXXX          TX_XXXXX_XXXX_ID
**** T8_CUSTOM_ACC_XXXXXXX          TX_XXXXX_XXXX_ID
**** T8_IMPAWN_ACC_XXXXXXX          TX_XXXXX_XXXX_ID
**** T8_FUND_ACC_XXXXXXX            TX_XXXXX_XXXX_ID
**** T8_LENDING_ACC_XXXXXXX         TX_XXXXX_XXXX_ID
**** T8_BOND_ACC_XXXXXXX            TX_XXXXX_XXXX_ID
**** T8_BUYOUT_ACC_XXXXXXX          TX_XXXXX_XXXX_ID

進一步分析,其中T8_BOND_ACC_XXXXXXX的外鍵就是引用了TX_XXXX_XXXXX_XXXX的主鍵ID列:

select dbms_metadata.get_ddl('TABLE','T8_BOND_ACC_XXXXXXX','ZJY') from dual;
 REFERENCES "ZJY"."TX_XXXX_XXXXX_XXXX" ("ID") ON DELETE CASCADE ENABLE

2.3 故障處置
業務閑時,在T8_BOND_ACC_XXXXXXX表的外鍵列TX_XXXXX_XXXX_ID上創建索引,再觀察告警情況。

create index TX_XXXXX_XXXX_ID on T8_BOND_ACC_XXXXXXX(TX_XXXXX_XXXX_ID) online;

注:上面是示范創建語句,具體創建索引的語句,比如是否采用online,是否使用並行等參照行里運維具體規范。

3.根本解決方案及建議

雖然本次告警信息未發現其他表的全局死鎖問題,但從Oracle的設計層面來看,建議在在所有7個表的外鍵列上都創建對應的索引,防止后續在其他表大量DML時出現同類問題。

此外,復盤的時候發現,trc跟蹤日志的這一部分信息:

----------resource 0x5bf1f1c38----------------------
resname       : [0x11ecf][0x0],[TM][ext 0xfffdbbf8,0x0][domid 0x4]

這里的[0x11ecf]轉換為10進制,就是對應的外鍵無索引對象的object_id。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM