故障描述:與客戶溝通,初步確認故障范圍大概是在上午的8:30-10:30之間,反應故障現象是Tomcat的連接數滿導致應用無法連接,數據庫alert中無明顯報錯,需要協助排查原因。
1.導入包含故障時刻的數據
為了便於后續分析,我向客戶索要了從昨天下午13:00到今天18:00的awrdump,導入到自己的實驗環境進行分析。生產環境導出awrdump:
@?/rdbms/admin/awrextr
測試環境導入awrdump:
SYS@jyzhao1 >select * from dba_directories;
SYS@jyzhao1 >create directory jy as '/home/oracle/awrdump';
SYS@jyzhao1 >select * from dba_directories;
SYS@jyzhao1 >!mkdir -p /home/oracle/awrdump
SYS@jyzhao1 >@?/rdbms/admin/awrload
省略部分輸出..
... Dropping AWR_STAGE user
End of AWR Load
2.創建m_ash表,明確故障時刻
創建m_ash表:--create table
create table m_ash20180322 as select * from dba_hist_active_sess_history where dbid=&dbid;
輸入生產庫對應的dbid,完成創建分析表。
select to_char(sample_time, 'yyyy-mm-dd hh24:mi'), count(1)
FROM m_ash20180322
group by to_char(sample_time, 'yyyy-mm-dd hh24:mi')
order by 1;
根據生成的數據生成折線圖如下:
可以從圖中明確故障時刻,即在10:00、12:30、14:10這三個時刻會話都明顯上升(積壓),看來客戶的反饋時間點並沒有包含所有異常時刻。
另外,引用下maclean的診斷腳本,可以看到核心意思差不多,只是進一步將instance_number區分開細化:
--驗證導出的ASH時間范圍:
select
t.dbid, t.instance_number, min(sample_time), max(sample_time), count(*) session_count
from m_ash20180322 t
group by t.dbid, t.instance_number
order by dbid, instance_number;
--確認問題發生的精確時間范圍:
select
dbid, instance_number, sample_id, sample_time, count(*) session_count
from m_ash20180322 t
group by dbid, instance_number, sample_id, sample_time
order by dbid, instance_number, sample_time;
3.確定異常時刻的top n event
確定每個采樣點的top n event,下面也是參考maclean的腳本。 比如我這里以2018-03-22 09:59:00 - 2018-03-22 10:00:00為例:select t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.event,
t.session_state,
t.c session_count
from (select t.*,
rank() over(partition by dbid, instance_number, sample_time order by c desc) r
from (select /*+ parallel 8 */
t.*,
count(*) over(partition by dbid, instance_number, sample_time, event) c,
row_number() over(partition by dbid, instance_number, sample_time, event order by 1) r1
from dba_hist_active_sess_history t
where sample_time >
to_timestamp('2018-03-22 09:59:00',
'yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('2018-03-22 10:00:00',
'yyyy-mm-dd hh24:mi:ss')
) t
where r1 = 1) t
where r < 3
order by dbid, instance_number, sample_time, r;
其他異常時刻,輸入對應的變量值:
select t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.event,
t.session_state,
t.c session_count
from (select t.*,
rank() over(partition by dbid, instance_number, sample_time order by c desc) r
from (select /*+ parallel 8 */
t.*,
count(*) over(partition by dbid, instance_number, sample_time, event) c,
row_number() over(partition by dbid, instance_number, sample_time, event order by 1) r1
from dba_hist_active_sess_history t
where sample_time >
to_timestamp('&begin_sample_time',
'yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('&end_sample_time',
'yyyy-mm-dd hh24:mi:ss')
) t
where r1 = 1) t
where r < 3
order by dbid, instance_number, sample_time, r;
2018-03-22 12:29:00
2018-03-22 12:30:00
2018-03-22 14:09:00
2018-03-22 14:10:00
綜上,3個連接數堆積的異常時刻TOP event都是 “enq: TX - row lock contention”。
4.確定最終的top holder
使用maclean的腳本,觀察每個采樣點的等待鏈:select
level lv,
connect_by_isleaf isleaf,
connect_by_iscycle iscycle,
t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.session_id,
t.sql_id,
t.session_type,
t.event,
t.session_state,
t.blocking_inst_id,
t.blocking_session,
t.blocking_session_status
from m_ash20180322 t
where sample_time >
to_timestamp('2018-03-22 09:59:00',
'yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('2018-03-22 10:00:00',
'yyyy-mm-dd hh24:mi:ss')
start with blocking_session is not null
connect by nocycle
prior dbid = dbid
and prior sample_time = sample_time
/*and ((prior sample_time) - sample_time between interval '-1'
second and interval '1' second)*/
and prior blocking_inst_id = instance_number
and prior blocking_session = session_id
and prior blocking_session_serial# = session_serial#
order siblings by dbid, sample_time;
結果如下:
進一步篩選,將isleaf=1的葉(top holder)找出來:
--基於上一步的原理來找出每個采樣點的最終top holder:
select t.lv,
t.iscycle,
t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.session_id,
t.sql_id,
t.session_type,
t.event,
t.seq#,
t.session_state,
t.blocking_inst_id,
t.blocking_session,
t.blocking_session_status,
t.c blocking_session_count
from (select t.*,
row_number() over(partition by dbid, instance_number, sample_time order by c desc) r
from (select t.*,
count(*) over(partition by dbid, instance_number, sample_time, session_id) c,
row_number() over(partition by dbid, instance_number, sample_time, session_id order by 1) r1
from (select /*+ parallel 8 */
level lv,
connect_by_isleaf isleaf,
connect_by_iscycle iscycle,
t.*
from m_ash20180322 t
where sample_time >
to_timestamp('2018-03-22 09:59:00',
'yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('2018-03-22 10:00:00',
'yyyy-mm-dd hh24:mi:ss')
start with blocking_session is not null
connect by nocycle
prior dbid = dbid
and prior sample_time = sample_time
/*and ((prior sample_time) - sample_time between interval '-1'
second and interval '1' second)*/
and prior blocking_inst_id = instance_number
and prior blocking_session = session_id
and prior
blocking_session_serial# = session_serial#) t
where t.isleaf = 1) t
where r1 = 1) t
where r < 3
order by dbid, sample_time, r;
對其他異常時段進行分析:
2018-03-22 12:29:00
2018-03-22 12:30:00
2018-03-22 14:09:00
2018-03-22 14:10:00
-- top holder: DIY sample_time
select t.lv,
t.iscycle,
t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.session_id,
t.sql_id,
t.session_type,
t.event,
t.seq#,
t.session_state,
t.blocking_inst_id,
t.blocking_session,
t.blocking_session_status,
t.c blocking_session_count
from (select t.*,
row_number() over(partition by dbid, instance_number, sample_time order by c desc) r
from (select t.*,
count(*) over(partition by dbid, instance_number, sample_time, session_id) c,
row_number() over(partition by dbid, instance_number, sample_time, session_id order by 1) r1
from (select /*+ parallel 8 */
level lv,
connect_by_isleaf isleaf,
connect_by_iscycle iscycle,
t.*
from m_ash20180322 t
where sample_time >
to_timestamp('&begin_sample_time',
'yyyy-mm-dd hh24:mi:ss')
and sample_time <
to_timestamp('&end_sample_time',
'yyyy-mm-dd hh24:mi:ss')
start with blocking_session is not null
connect by nocycle
prior dbid = dbid
and prior sample_time = sample_time
/*and ((prior sample_time) - sample_time between interval '-1'
second and interval '1' second)*/
and prior blocking_inst_id = instance_number
and prior blocking_session = session_id
and prior
blocking_session_serial# = session_serial#) t
where t.isleaf = 1) t
where r1 = 1) t
where r < 3
order by dbid, sample_time, r;
發現所有的異常時刻最終阻塞都是實例1的sid為3548的session,不再贅述。
5.總結
從第四步可以看到,top holder都是實例1,會話3548. 比如可以看到實例1的481會話被實例2的6377會話阻塞,然后實例2的6377會話又被實例1的3548會話阻塞。 通過sql_id可以查詢到sql文本:select * from dba_hist_sqltext where sql_id = '&sql_id';
可以看到實例1的3548會話當前正在執行的SQL只是一個查詢語句,當前會話狀態是ON CPU,所以推測該會話之前有DML的事物未提交導致阻塞。
去查詢該會話的DML操作時,也有update和insert操作,但是update操作已經無法找到對應SQL文本。
select t.event, t.*
from m_ash20180322 t
where instance_number = 1
and session_id = 3548
and t.sql_opname <> 'SELECT';
其實從ash也可以看到關於3548阻塞的信息,甚至從addm的建議中也會有類似建議:
Rationale
The session with ID 3548 and serial number 8795 in instance number 1 was
the blocking session responsible for 52% of this recommendation's
benefit.
Rationale
The session with ID 6377 and serial number 30023 in instance number 2
was the blocking session responsible for 47% of this recommendation's
benefit.
只不過我們從底層查詢,可以看到6377實際也是被3548阻塞,找到最終阻塞者。
btw,從導入的awrdump中,除了可以取awr外,同樣可以支持取awrsqrpi和addmrpti以及ashrpti,非常方便:
SYS@jyzhao1 >@?/rdbms/admin/awrrpti
SYS@jyzhao1 >@?/rdbms/admin/awrsqrpi
SYS@jyzhao1 >@?/rdbms/admin/ashrpti
SYS@jyzhao1 >@?/rdbms/admin/addmrpti