oracle慎用基於on commit刷新物化視圖的方式(一)
背景
上周五,也就是2020-08-07那天快下班的時候,突然發生一件事。客戶反饋xxxx平台登錄不上去,當時一開始沒意識到是數據庫導致的問題。
業務人員在那懷疑這里那里,還是網絡ip哪里地區又被封了。后來都排除沒發生什么問題,還想着重啟Nginx服務器(我也不懂)。
但是維護Nginx的人員剛好請假了。
剛好當時大概是這個事情發生的半個小時之前,監控郵件報出如下錯誤,
當時以為只是歸檔目錄快滿了,簡單的備份了下歸檔,清理掉多余的歸檔。
排查
問題出現后,雖然沒懷疑到數據庫的問題,不過我也還是看了一眼數據庫。
發現警告日志閃回恢復區有在報使用率超標的問題,有問題,絕對有問題。
除了歸檔量暴增的問題和剛搭建起來沒多久的備庫歸檔也爆掉的問題,沒發現其他的報錯。
先看看日志切換的頻率,和平時同一時段相比,這歸檔多了100多倍。
讓我一度懷疑是應用出現了什么問題導致dml操作劇增導致的。

SELECT TO_CHAR(FIRST_TIME,'YYYY-MM-DD') DAY, TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'00',1,0)),'999') "00", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'01',1,0)),'999') "01", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'02',1,0)),'999') "02", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'03',1,0)),'999') "03", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'04',1,0)),'999') "04", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'05',1,0)),'999') "05", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'06',1,0)),'999') "06", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'07',1,0)),'999') "07", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'08',1,0)),'999') "08", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'09',1,0)),'999') "09", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'10',1,0)),'999') "10", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'11',1,0)),'999') "11", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'12',1,0)),'999') "12", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'13',1,0)),'999') "13", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'14',1,0)),'999') "14", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'15',1,0)),'999') "15", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'16',1,0)),'999') "16", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'17',1,0)),'999') "17", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'18',1,0)),'999') "18", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'19',1,0)),'999') "19", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'20',1,0)),'999') "20", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'21',1,0)),'999') "21", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'22',1,0)),'999') "22", TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'23',1,0)),'999') "23" FROM V$LOG_HISTORY GROUP BY TO_CHAR(FIRST_TIME,'YYYY-MM-DD') ORDER BY 1 DESC;
這個時候我需要定位找出是哪條SQL或者哪些SQL引起的問題。
通過如下語句,可以找出哪些對象有大量數據塊變化情況,

SELECT TO_CHAR(BEGIN_INTERVAL_TIME, 'YYYY-MM-DD HH24') SNAP_TIME, DHSO.OBJECT_NAME, SUM(DB_BLOCK_CHANGES_DELTA) BLOCK_CHANGED FROM DBA_HIST_SEG_STAT DHSS, DBA_HIST_SEG_STAT_OBJ DHSO, DBA_HIST_SNAPSHOT DHS WHERE DHS.SNAP_ID = DHSS.SNAP_ID AND DHS.INSTANCE_NUMBER = DHSS.INSTANCE_NUMBER AND DHSS.OBJ# = DHSO.OBJ# AND DHSS.DATAOBJ# = DHSO.DATAOBJ# AND BEGIN_INTERVAL_TIME BETWEEN TO_DATE('2020-08-07 16:00', 'YYYY-MM-DD HH24:MI') AND TO_DATE('2020-08-07 17:30', 'YYYY-MM-DD HH24:MI') GROUP BY TO_CHAR(BEGIN_INTERVAL_TIME, 'YYYY-MM-DD HH24'), DHSO.OBJECT_NAME HAVING SUM(DB_BLOCK_CHANGES_DELTA) > 0 ORDER BY SUM(DB_BLOCK_CHANGES_DELTA) asc;
部分關鍵結果如下,
SNAP_TIME |OBJECT_NAME |BLOCK_CHANGED ---------------------------------------|------------------------------|------------- ............. |.............. | ...... 2020-08-07 16 |.............. | 100480 2020-08-07 16 |.............. | 129184 2020-08-07 16 |.............. | 137024 2020-08-07 17 |.............. | 143520 2020-08-07 17 |.............. | 145440 2020-08-07 16 |.............. | 228624 2020-08-07 16 |.............. | 256704 2020-08-07 17 |.............. | 548336 2020-08-07 17 |.............. | 55019536 2020-08-07 17 |.............. | 103429408 2020-08-07 16 |T_XXXXXXXXXX_INFO | 103997488 2020-08-07 16 |PK_X_XXXXXX_XXXX | 195532752 210 rows selected. Elapsed: 00:00:02.29
在2020-08-07 16點內,有兩個對象的塊變化是最多的,分別是
T_XXXXXXXXXX_INFO
PK_X_XXXXXX_XXXX
PK_X_XXXXXX_XXXX很容易就可以查出,這是一個主鍵索引,對應的表名為X_XXXXXX_XXXX。
趕緊取問題時間段的awr報告,看看是否存在相關的X_XXXXXX_XXXX以及T_XXXXXXXXXX_INFO的sql,順便也看看等待事件。
結果一眼就看出等待事件存在問題,
Event | Waits | Total Wait Time (sec) | Wait Avg(ms) | % DB time | Wait Class |
---|---|---|---|---|---|
enq: TX - row lock contention | 683 | 223.1K | 326667 | 39.0 | Application |
gc buffer busy acquire | 10,048 | 163.8K | 16302 | 28.6 | Cluster |
enq: JI - contention | 23,034 | 112.7K | 4892 | 19.7 | Other |
buffer busy waits | 1,770 | 20.9K | 11784 | 3.6 | Concurrency |
DB CPU | 20K | 3.5 | |||
enq: TX - index contention | 591 | 6630.2 | 11219 | 1.2 | Concurrency |
log file switch (checkpoint incomplete) | 481 | 6268.1 | 13031 | 1.1 | Configuration |
db file scattered read | 482,552 | 4536 | 9 | .8 | User I/O |
gc current block busy | 10,845 | 3853.8 | 355 | .7 | Cluster |
direct path read | 832,639 | 3170 | 4 | .6 | User I/O |
這個系統平時沒有存在那么高的TX鎖等待,趕緊看看哪個段的行鎖等待比較嚴重,
Segments by Row Lock Waits
Owner | Tablespace Name | Object Name | Subobject Name | Obj. Type | Row Lock Waits | % of Capture |
---|---|---|---|---|---|---|
ABCD_EFG | ABCD | X_XXXXXX_XXXX | TABLE | 681 | 53.71 | |
ABCD_XXXXXXXXS | DYCK_FILETRANS | PK_XXXX_XXXXXXXX_TMP | INDEX | 321 | 25.32 |
剛好和前邊塊變化最多的對象是同一個,也可以看看塊變化,
Segments by DB Blocks Changes
Owner | Tablespace Name | Object Name | Subobject Name | Obj. Type | DB Block Changes | % of Capture |
---|---|---|---|---|---|---|
XXXXX_DATA | USERS | PK_X_XXXXXX_XXXX | INDEX | 161,129,664 | 64.72 | |
XXXXX_DATA | USERS | T_XXXXXXXXXX_INFO | TABLE | 85,726,272 | 34.43 |
都對的上!!!
這個時候在從awr中找找含有X_XXXXXX_XXXX表的SQL,
Elapsed Time (s) | Executions | Elapsed Time per Exec (s) | %Total | %CPU | %IO | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|---|
226,451.96 | 1,075 | 210.65 | 39.58 | 0.01 | 0.00 | 95nn49n57dqza | JDBC Thin Client | update X_XXXXXX_XXXX set ID=:1... |
166,564.13 | 1,368 | 121.76 | 29.12 | 0.14 | 0.00 | 08x3dymqbdbd4 | JDBC Thin Client | ...................... |
20,730.38 | 294 | 70.51 | 3.62 | 0.01 | 0.03 | 4vs91dcv7u1p6 | JDBC Thin Client | ...................... |
5,395.58 | 0 | 0.94 | 5.25 | 94.88 | c7sa42jj690jw | ...................... | ||
5,395.54 | 7 | 770.79 | 0.94 | 5.25 | 94.88 | 1hsz4n5g5dz1j | ...................... | |
5,205.90 | 38 | 137.00 | 0.91 | 40.58 | 58.45 | 66u44fa687q74 | xxxxxxxxx.exe | ...................... |
4,619.25 | 17,384 | 0.27 | 0.81 | 0.05 | 0.08 | atwcm5hwwbtj3 | xxxxxxxxxxxxxxxxxx.exe | ...................... |
4,231.15 | 1 | 4,231.15 | 0.74 | 99.72 | 0.09 | 2d0r4v0y38vm1 | xxxxxxxxxxx.exe | ...................... |
4,049.48 | 47,215 | 0.09 | 0.71 | 0.14 | 0.20 | cqfs0h217uuc3 | xxxxxxxxxxxxxxxxxx.exe | ...................... |
2,707.50 | 1 | 2,707.50 | 0.47 | 99.87 | 0.10 | gjj8zbpgta9rv | xxxx.exe | ...................... |
查看95nn49n57dqza的歷史執行情況,

set lines 150 pages 150 col BEGIN_INTERVAL_TIME for a23 col PLAN_HASH_VALUE for 9999999999 col date_time for a30 col snap_id heading 'SnapId' col executions_delta heading "No. of exec" col sql_profile heading "SQL|Profile" for a7 col date_time heading 'Date time' col avg_lio heading 'LIO/exec' for 99999999999.99 col avg_cputime heading 'CPUTIM/exec' for 9999999.99 col avg_etime heading 'ETIME/exec' for 9999999.99 col avg_pio heading 'PIO/exec' for 9999999.99 col avg_row heading 'ROWs/exec' for 9999999.99 SELECT distinct s.snap_id , PLAN_HASH_VALUE, to_char(s.BEGIN_INTERVAL_TIME,'yyyy-mm-dd hh24:mi:ss')|| to_char(s.END_INTERVAL_TIME,'-hh24:mi:ss') Date_Time, SQL.executions_delta, SQL.buffer_gets_delta/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_lio, --SQL.ccwait_delta, (SQL.cpu_time_delta/1000000)/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_cputime , (SQL.elapsed_time_delta/1000000)/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_etime, SQL.DISK_READS_DELTA/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_pio, SQL.rows_processed_total/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_row --,SQL.sql_profile FROM dba_hist_sqlstat SQL, dba_hist_snapshot s WHERE SQL.instance_number =(select instance_number from v$instance) and SQL.dbid =(select dbid from v$database) and s.snap_id = SQL.snap_id AND sql_id in ('&SQLID') order by s.snap_id /
結果如下,
SnapId|PLAN_HASH_VALUE|Date time |No. of exec| LIO/exec|CPUTIM/exec| ETIME/exec| PIO/exec| ROWs/exec ----------|---------------|------------------------------|-----------|---------------|-----------|-----------|-----------|----------- 16581| 2297901584|2020-08-07 13:30:02-14:00:09 | 370| 4.04| .00| .00| .00| 5547.76 16583| 2297901584|2020-08-07 14:30:16-15:00:24 | 563| 4.02| .00| .00| .00| 3648.02 16584| 2297901584|2020-08-07 15:00:24-15:30:31 | 567| 4.03| .00| .00| .00| 3623.28 16585| 2297901584|2020-08-07 15:30:31-16:00:39 | 442| 4.05| .00| .00| .00| 4648.97 16586| 2297901584|2020-08-07 16:00:39-16:30:47 | 429| 4.28| .00| .59| .01| 4790.84 16587| 2297901584|2020-08-07 16:30:47-17:01:31 | 366| 5.72| .06| 169.81| .07| 5616.50 16587| 2297901584|2020-08-07 16:30:47-17:07:50 | 366| 5.72| .06| 169.81| .07| 5616.50 16588| 2297901584|2020-08-07 17:01:31-17:30:32 | 278| 15.70| .03| 423.38| .15| 7395.38 16588| 2297901584|2020-08-07 17:07:50-17:30:32 | 278| 15.70| .03| 423.38| .15| 7395.38 16589| 2297901584|2020-08-07 17:30:32-18:00:41 | 369| 31.79| .07| 1180.75| .00| 5572.59 16590| 2297901584|2020-08-07 18:00:41-18:30:51 | 535| 44.85| .06| 909.64| .00| 3844.53 16591| 2297901584|2020-08-07 18:30:51-19:00:00 | 201| 55.29| .01| 192.97| .00| 10233.94 16592| 2297901584|2020-08-07 19:00:00-19:30:08 | 26| 4.69| .00| 1.26| .00| 79117.23
從8月7日下午1點開始可以查詢到記錄,一直在下午4點半后開始,sql執行時間急劇增加,但是執行計划確是並沒有變化過。
並且從執行次數看,並沒有多太過於劇烈的波動。
到這里目前似乎沒有什么好辦法可以自己追查下去,能確定的是出問題的SQL語句就是95nn49n57dqza,由於update引起大量的TX鎖等待,導致xxxx平台登錄不上去或者響應很久才出現頁面。
而TX鎖嚴重的原因就是一次update的時間從不到1s變到幾百s甚至上千s。
這里存在一個問題,便是在看了awr發現tx鎖嚴重,再到確定SQL的過程,由於重心太過明確導致忽略掉了一開始找出哪些對象有大量數據塊變化情況那里,除了PK_X_XXXXXX_XXXX外其實還有另外一個對象T_XXXXXXXXXX_INFO。
暫且忽略T_XXXXXXXXXX_INFO順着一直下去的思路的話,單純從視圖間的關聯貌似已經無法繼續下去,可以選擇在數據庫級別做一次ssd,或者10046。
推薦10046,從結果反推回去,10046是可以看出會話除了有執行95nn49n57dqza之外,也是有對T_XXXXXXXXXX_INFO表做insert的操作的。
因為第一時間發現95nn49n57dqza的問題,立馬問了業務人員之后,他們有反饋在4點左右有創建一張物化視圖T_XXXXXXXXXX_INFO,基表就是X_XXXXXX_XXXX。因此沒做10046。
看了創建語句,用了on commit。大致找了下資料,基本確定是這個引起的。當即決定刪除這個物化視圖。
18:54:45 SYS@dyckdb1(2194)> drop MATERIALIZED VIEW MACRO_DATA.T_ENTERPRISE_INFO; drop MATERIALIZED VIEW MACRO_DATA.T_ENTERPRISE_INFO * ERROR at line 1: ORA-04021: timeout occurred while waiting to lock object Elapsed: 00:00:49.05
--撤銷授權操作,不然物化視圖刪不掉 18:57:45 SYS@dyckdb2(1269)> revoke select on dyck_pub.t_client_info from MACRO_DATA; ^Crevoke select on dyck_pub.t_client_info from MACRO_DATA * ERROR at line 1: ORA-01013: user requested cancel of current operation Elapsed: 00:02:49.57 19:02:04 SYS@dyckdb2(1269)> / Revoke succeeded. Elapsed: 00:00:09.58 19:02:46 SYS@dyckdb1(2194)> drop MATERIALIZED VIEW MACRO_DATA.T_ENTERPRISE_INFO; Materialized view dropped. Elapsed: 00:00:03.82
刪除視圖,問題解決。
這里提供另外一個查找產生指定時間段內產生redo最多的SQL做參考:

SELECT TO_CHAR(BEGIN_INTERVAL_TIME,'YYYY_MM_DD HH24') WHEN, DBMS_LOB.SUBSTR(SQL_TEXT,4000,1) SQL, DHSS.INSTANCE_NUMBER INST_ID, DHSS.SQL_ID, EXECUTIONS_DELTA EXEC_DELTA, ROWS_PROCESSED_DELTA ROWS_PROC_DELTA FROM DBA_HIST_SQLSTAT DHSS, DBA_HIST_SNAPSHOT DHS, DBA_HIST_SQLTEXT DHST WHERE UPPER(DHST.SQL_TEXT) LIKE '%T_CLIENT_INFO%' AND LTRIM(UPPER(DHST.SQL_TEXT)) NOT LIKE 'SELECT%' AND DHSS.SNAP_ID=DHS.SNAP_ID AND DHSS.INSTANCE_NUMBER=DHS.INSTANCE_NUMBER AND DHSS.SQL_ID=DHST.SQL_ID AND BEGIN_INTERVAL_TIME BETWEEN TO_DATE('2020-08-07 16:00','YYYY-MM-DD HH24:MI') AND TO_DATE('2020-08-07 17:00','YYYY-MM-DD HH24:MI') order by exec_delta;
通過上邊的語句查找是可以發現update基表和insert物化視圖的SQL語句的。
接下來可以通過模擬on commit對update影響有多大,順便看看是否由於物化視圖關系是否導致redo量激增。