RAC 單節點實例異常關閉,關鍵報錯ORA--29770


監控系統監控到RAC 的一個實例異常關閉 ,時間是凌晨1點多,還好沒有影響到業務

之后就是分析原因

這套RAC搭建在虛擬化環境OS SUSE11

查看oracel alert log信息

 1 Mon Dec 04 01:16:20 2017 
 2 Thread 1 advanced to log sequence 26371 (LGWR switch) 
 3 Current log# 2 seq# 26371 mem# 0: +DATA/mgr/onlinelog/group_2.258.887111011 
 4 Current log# 2 seq# 26371 mem# 1: +FRA/mgr/onlinelog/group_2.258.887111015 
 5 Mon Dec 04 01:16:20 2017 
 6 Archived Log entry 52713 added for thread 1 sequence 26370 ID 0x6c454da0 dest 1: 
 7 Mon Dec 04 01:24:14 2017 
 8 LMS0 (ospid: 17259) has not called a wait for 97 secs. <<<<<<<<<<<=========== 
 9 Mon Dec 04 01:24:29 2017 
10 Errors in file /oracle/product/diag/rdbms/mgr/mgr1/trace/mgr1_lmhb_17273.trc (incident=252137): 
11 ORA-29770: ?????? LMS0 (OSID 17259) ???????? 70 ? 
12 Incident details in: /oracle/product/diag/rdbms/mgr/mgr1/incident/incdir_252137/mgr1_lmhb_17273_i252137.trc <<<<<<<<<<<=========== 
13 Mon Dec 04 01:24:40 2017 
14 ERROR: Some process(s) is not making progress. <<<<<<<<<<<=========== 
15 LMHB (ospid: 17273) is terminating the instance. 
16 Please check LMHB trace file for more details. 
17 Please also check the CPU load, I/O load and other system properties for anomalous behavior <<<<<<<<<<<=========== 
18 ERROR: Some process(s) is not making progress. 
19 LMHB (ospid: 17273): terminating the instance due to error 29770 <<<<<<<<<<<=========== 
20 Mon Dec 04 01:24:41 2017 
21 System state dump requested by (instance=1, osid=17273 (LMHB)), summary=[abnormal instance termination]. 
22 System State dumped to trace file /oracle/product/diag/rdbms/mgr/mgr1/trace/mgr1_diag_17245.trc 
23 Mon Dec 04 01:24:41 2017 
24 ORA-1092 : opitsk aborting process 
25 Dumping diagnostic data in directory=[cdmp_20171204012441], requested by (instance=1, osid=17273 (LMHB)), summary=[abnormal instance termination]. 
26 Mon Dec 04 01:24:46 2017 
27 ORA-1092 : opitsk aborting process 
28 Termination issued to instance processes. Waiting for the processes to exit 
29 Mon Dec 04 01:24:53 2017 
30 ORA-1092 : opitsk aborting process 
31 Mon Dec 04 01:24:53 2017 
32 License high water mark = 43 
33 Mon Dec 04 01:24:54 2017 
34 Instance termination failed to kill one or more processes 
35 Instance terminated by LMHB, pid = 17273 
36 USER (ospid: 3513): terminating the instance 
37 Instance terminated by USER, pid = 3513 
38 Mon Dec 04 01:25:18 2017 
39 Starting ORACLE instance (normal) 

從日志信息來看,LMS0進程1:24:14時已經等待97s,之后生成trace文件,之后報錯ORA-29770,之后實例被終止。

科普一下LMS進程:

LMSn進程會維護在Global Resource Directory (GRD)中的數據文件以及每個cached block的狀態。LMSn用於在RAC的實例間進行message以及數據塊的傳輸,這個對應的服務也就是GCS(Global Cache Service),LMS是Cache Fusion的一個重要部分。LMS進程可以說是RAC上最活躍的后台進程,會消耗較多的CPU.一般每個實例會有多個LMS進程,每個Oracle版本的默認的LMS進程數目會有所不同,大部分版本的默認值是:MIN(CPU_COUNT/2, 2))

ORACLE報錯信息

ORA-29770: global enqueue process LMS0 (OSID 17259) is hung for more than 70 seconds

報錯生成了trace文件

Relevant Information Collection 
--------------------------------------- 
System name: Linux 
Node name: mgrdb01 
Release: 3.0.76-0.11-default 
Version: #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) 
Machine: x86_64 
VM name: VMWare Version: 6 <<<<<<<============= 
Instance name: mgr1 
Redo thread mounted by this instance: 1 
Oracle process number: 17 
Unix process pid: 17273, image: oracle@mgrdb01 (LMHB) <<<<<<<============= 


*** 2017-12-04 01:23:14.315 
*** SESSION ID:(1599.1) 2017-12-04 01:23:14.315 
*** CLIENT ID:() 2017-12-04 01:23:14.315 
*** SERVICE NAME:(SYS$BACKGROUND) 2017-12-04 01:23:14.315 
*** MODULE NAME:() 2017-12-04 01:23:14.315 
*** ACTION NAME:() 2017-12-04 01:23:14.315 


*** TRACE FILE RECREATED AFTER BEING REMOVED *** 


*** 2017-12-04 01:23:14.314 
============================== 
LMS0 (ospid: 17259) has not moved for 41 sec (1512321794.1512321753) <<<<<<<============= 
kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 2 
: Not in wait; last wait ended 37 secs ago. 
: last wait_id 570620887 at 'gcs remote message'. <<<<<<<============= 
============================== 
Dumping PROCESS LMS0 (ospid: 17259) States 
============================== 
===[ Callstack ]=== 

*** 2017-12-04 01:23:14.316 
Process diagnostic dump for oracle@mgrdb01 (LMS0), OS id=17259, 
pid: 13, proc_ser: 1, sid: 1223, sess_ser: 1 
------------------------------------------------------------------------------- 
os thread scheduling delay history: (sampling every 1.000000 secs) 
0.000000 secs at [ 01:23:13 ] 
NOTE: scheduling delay has not been sampled for 0.935154 secs 0.000000 secs from [ 01:23:09 - 01:23:14 ], 5 sec avg 
0.000000 secs from [ 01:22:15 - 01:23:14 ], 1 min avg 
0.000000 secs from [ 01:18:15 - 01:23:14 ], 5 min avg 
loadavg : 22.91 7.20 2.77 
Memory (Avail / Total) = 92815.82M / 129069.95M 
Swap (Avail / Total) = 16384.00M / 16384.00M 
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 
0 D oracle 17259 1 0 58 - - 12123770 sleep_ Sep25 ? 14:24:46 ora_lms0_mgr1 <<<<<<<============= lms0 is in status 'D'.uninterruptible I/O . 

*** 2017-12-04 01:23:19.322 
Short stack dump: ORA-32516: ?????? 'Unix process pid: 17259, image: oracle@mgrdb01 (LMS0)' ??? ORADEBUG ?? 'SHORT_STACK'; ?????? 4920 ?? 

------------------------------------------------------------------------------- 
Process diagnostic dump actual duration=5.010000 sec 
(max dump time=5.000000 sec) 

*** 2017-12-04 01:23:19.322 
kjgcr_SlaveReqBegin: message queued to slave 
kjgcr_Main: KJGCR_ACTION - id 3 
CPU is high. Top oracle users listed below: <<<<<<<============= CPU is high , but no oracle user consuming high CPU . 
Session Serial CPU 
1882 1 0 
1 28043 0 
95 1 0 
97 24343 0 
189 1 0 

*** 2017-12-04 01:23:24.390 
kjgcr_Main: Reset called for action high cpu, identify users, count 0 

*** 2017-12-04 01:23:24.391 
kjgcr_Main: Reset called for action high cpu, kill users, count 0 

*** 2017-12-04 01:23:24.391 
kjgcr_Main: Reset called for action high cpu, activate RM plan, count 0 

*** 2017-12-04 01:23:24.391 
kjgcr_Main: Reset called for action high cpu, set BG into RT, count 0 

*** 2017-12-04 01:23:34.391 
============================== 
LMS0 (ospid: 17259) has not moved for 61 sec (1512321814.1512321753) <<<<<<<============= 
kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 2 
: Not in wait; last wait ended 57 secs ago. 
: last wait_id 570620887 at 'gcs remote message'. 
...... 

*** 2017-12-04 01:23:54.401 
Process diagnostic dump for oracle@mgrdb01 (LMS0), OS id=17259, 
pid: 13, proc_ser: 1, sid: 1223, sess_ser: 1 
------------------------------------------------------------------------------- 
os thread scheduling delay history: (sampling every 1.000000 secs) 
0.000000 secs at [ 01:23:53 ] 
NOTE: scheduling delay has not been sampled for 0.859464 secs 0.000000 secs from [ 01:23:49 - 01:23:54 ], 5 sec avg 
0.000000 secs from [ 01:22:55 - 01:23:54 ], 1 min avg 
0.000000 secs from [ 01:18:55 - 01:23:54 ], 5 min avg 
loadavg : 31.82 11.43 4.38 
Memory (Avail / Total) = 92795.89M / 129069.95M 
Swap (Avail / Total) = 16384.00M / 16384.00M 
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 
0 D oracle 17259 1 0 58 - - 12123770 sleep_ Sep25 ? 14:24:46 ora_lms0_mgr1 <<<<<<<============= still 'D' 

*** 2017-12-04 01:23:59.409 
Short stack dump: ORA-32515: ???? ORADEBUG ?? 'SHORT_STACK' ??? 'Unix process pid: 17259, image: oracle@mgrdb01 (LMS0)'; ???????????? 4930 ?? 

------------------------------------------------------------------------------- 
Process diagnostic dump actual duration=5.010000 sec 
(max dump time=5.000000 sec) 

*** 2017-12-04 01:23:59.410 
kjgcr_Main: KJGCR_ACTION - id 5 

*** 2017-12-04 01:24:14.409 
============================== 
LMS0 (ospid: 17259) has not moved for 101 sec (1512321854.1512321753) 
kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 6 
================================================== 
=== LMS0 (ospid: 17259) Heartbeat Report 
================================================== 
LMS0 (ospid: 17259) has no heartbeats for 101 sec. (threshold 70 sec) 
: Not in wait; last wait ended 97 secs ago. 
: last wait_id 570620887 at 'gcs remote message'. 
============================== 
Dumping PROCESS LMS0 (ospid: 17259) States <<<<<<<============= 
============================== 
===[ System Load State ]=== 
CPU Total 32 Core 32 Socket 4 
Load normal: Cur 9144 Highmark 40960 (35.71 160.00) 
===[ Latch State ]=== 
Not in Latch Get 
===[ Session State Object ]=== 
---------------------------------------- 
SO: 0xb98ef85d8, type: 4, owner: 0xb88bf79d0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 
proc=0xb88bf79d0, name=session, file=ksu.h LINE:12624, pg=0 
(session) sid: 1223 ser: 1 trans: (nil), creator: 0xb88bf79d0 
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- 
flags2: (0x409) -/-/INC 
DID: , short-term DID: 
txn branch: (nil) 
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS 
ksuxds FALSE at location: 0 
service name: SYS$BACKGROUND 
Current Wait Stack: 
Not in wait; last wait ended 1 min 37 sec ago <<<<<<<============= not in wait. 
Wait State: 
fixed_waits=0 flags=0x21 boundary=(nil)/-1 
Session Wait History: 
elapsed time of 1 min 37 sec since last wait 
0: waited for 'gcs remote message'<<<<<<<============= 
waittime=0x1, poll=0x0, event=0x0 
wait_id=570620887 seq_num=7644 snap_id=1 
wait times: snap=0.001200 sec, exc=0.001200 sec, total=0.001200 sec 
wait times: max=0.010000 sec 
wait counts: calls=1 os=1 
occurred after 0.000218 sec of elapsed time 
1: waited for 'gcs remote message' 
waittime=0x1, poll=0x0, event=0x0 
wait_id=570620886 seq_num=7643 snap_id=1 
wait times: snap=0.006362 sec, exc=0.006362 sec, total=0.006362 sec 
wait times: max=0.010000 sec 
wait counts: calls=1 os=1 
occurred after 0.000022 sec of elapsed time 

通過trace文件可以看到,lms0 的狀態是 'D'

D是什么意思?

Linux進程狀態:D (TASK_UNINTERRUPTIBLE),不可中斷的睡眠狀態。

引起D狀態的根本原因是由於IO等待,若你對某個磁盤的IO操作特別頻繁,就會造成后續的IO操作處於等待狀態,即處於D狀態。

也就是說LMS0進程在不可中斷的睡眠狀態,同時CPU很高,但是不是ORACLE用戶占用的。

到這里就可以懷疑,問題可能出現在OS層面,去查看OSwatcher監控信息

zzz ***Mon Dec 4 01:21:02 CST 2017 
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ 
r b swpd free buff cache si so bi bo in cs us sy id wa st 
5 0 0 95113492 1518640 31129980 0 0 3 5 0 0 1 0 99 0 0 
0 0 0 95107952 1518640 31130044 0 0 80 1 6505 8514 2 1 97 0 0 
0 0 0 95107652 1518640 31130020 0 0 80 34 4694 7057 1 0 99 0 0 
zzz ***Mon Dec 4 01:21:32 CST 2017 
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ 
r b swpd free buff cache si so bi bo in cs us sy id wa st 
4 0 0 95120764 1518644 31130072 0 0 3 5 0 0 1 0 99 0 0 
1 0 0 95116052 1518644 31130136 0 0 80 1 6378 8855 2 1 97 0 0 
0 0 0 95115644 1518644 31130132 0 0 16 34 5174 7705 1 0 99 0 0 
zzz ***Mon Dec 4 01:24:53 CST 2017 
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ 
r b swpd free buff cache si so bi bo in cs us sy id wa st 
8 6 0 96085040 1518752 31135792 0 0 3 5 0 0 1 0 99 0 0 
2 0 0 96155748 1518772 31120520 0 0 0 9642 11662 9591 9 5 84 3 0 
1 0 0 97170696 1518780 30175204 0 0 0 2749 5527 6026 3 2 95 0 0 
zzz ***Mon Dec 4 01:25:23 CST 2017 
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ 
r b swpd free buff cache si so bi bo in cs us sy id wa st 
4 0 0 117056400 1518796 10318440 0 0 3 5 0 0 1 0 99 0 0 
0 0 0 117052664 1518796 10318456 0 0 0 230 3792 4556 0 1 99 0 0 
1 0 0 116783816 1518796 10587120 0 0 0 841 3856 4530 1 2 97 0 0 
vmstat 采樣間隔為30秒,但是從01:21:32 之后一直到01:24:53之間的記錄都沒有 。說明OS在此期間性能很差。 
性能差到什么程度,簡單的top,iostat,ps都無法生成,直接導致lms進程通信障礙。
問題定位到OS層面,很可惜我們拋開了oracle的嫌疑,但是OS工程師並沒有找到問題的根因。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM