監控工具DPA發現海外一台Oracle數據庫服務器DB Commit Time指標告警,超過紅色告警線(40毫秒左右,黃色告警是10毫秒,紅色告警線是20毫秒),如下截圖所示,生成了對應的時段的AWR報告,發現Top 5 Timed Events里面,log file sync等待事件的平均等待時間為37毫秒,log file parallel write等待事件的平均等待時間為40毫秒
如果對Tanel Poder::Understanding LGWR, Log File Sync Waits and Commit Performance這篇文章所講述的內容很熟悉的話(經典圖如下),那么通過等待事件log file sync與log file parallel write的Avg Wait(ms)指標: 37ms & 40ms,基本上可以判斷就是redo log所在的磁盤I/O出現了性能問題
然后在用lfsdiag.sql腳本分析一下詳細的統計數據,如下所示:
HISTOGRAM DATA FOR LFS AND OTHER RELATED WAITS:
APPROACH: Look at the wait distribution for log file sync waits
by looking at "wait_time_milli". Look at the high wait times then
see if you can correlate those with other related wait events.
INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT
---------- ---------------------------------------- --------------- ----------
1 log file sync 1 4363
1 log file sync 2 835
1 log file sync 4 1650
1 log file sync 8 4937
1 log file sync 16 146252
1 log file sync 32 606674
1 log file sync 64 263377
1 log file sync 128 253254
1 log file sync 256 2
1 log file switch completion 1 124
1 log file switch completion 2 9
1 log file switch completion 4 19
1 log file switch completion 8 21
1 log file switch completion 16 35
1 log file switch completion 32 97
1 log file switch completion 64 133
1 log file switch completion 128 326
1 log file switch completion 256 1736
1 log file switch completion 512 3042
1 log file switch completion 1024 2020
1 log file parallel write 1 0
1 log file parallel write 2 0
1 log file parallel write 4 80
1 log file parallel write 8 2142
1 log file parallel write 16 170987
1 log file parallel write 32 779205
1 log file parallel write 64 311463
1 log file parallel write 128 79688
1 log file parallel write 256 42763
1 log file parallel write 512 13052
1 log file parallel write 1024 20468
1 log file parallel write 2048 14020
1 log file parallel write 4096 921
1 log file parallel write 8192 96
1 log file parallel write 16384 18
1 log file parallel write 32768 18
1 log file parallel write 65536 8
1 log file parallel write 131072 2
1 LGWR wait for redo copy 1 8516
1 LGWR wait for redo copy 2 20
1 LGWR wait for redo copy 4 19
1 LGWR wait for redo copy 8 20
1 LGWR wait for redo copy 16 11
ORDERED BY WAIT_TIME_MILLI
INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT
---------- ---------------------------------------- --------------- ----------
1 log file sync 1 4363
1 log file switch completion 1 124
1 log file parallel write 1 0
1 LGWR wait for redo copy 1 8516
1 log file sync 2 835
1 log file switch completion 2 9
1 log file parallel write 2 0
1 LGWR wait for redo copy 2 20
1 log file sync 4 1650
1 log file switch completion 4 19
1 log file parallel write 4 80
1 LGWR wait for redo copy 4 19
1 log file sync 8 4937
1 log file switch completion 8 21
1 log file parallel write 8 2142
1 LGWR wait for redo copy 8 20
1 log file sync 16 146252
1 log file switch completion 16 35
1 log file parallel write 16 170987
1 LGWR wait for redo copy 16 11
1 log file sync 32 606674
1 log file switch completion 32 97
1 log file parallel write 32 779205
1 log file sync 64 263377
1 log file switch completion 64 133
1 log file parallel write 64 311463
1 log file sync 128 253254
1 log file switch completion 128 326
1 log file parallel write 128 79688
1 log file sync 256 2
1 log file switch completion 256 1736
1 log file parallel write 256 42763
1 log file switch completion 512 3042
1 log file parallel write 512 13052
1 log file switch completion 1024 2020
1 log file parallel write 1024 20468
1 log file parallel write 2048 14020
1 log file parallel write 4096 921
1 log file parallel write 8192 96
1 log file parallel write 16384 18
1 log file parallel write 32768 18
1 log file parallel write 65536 8
1 log file parallel write 131072 2
REDO WRITE STATS
"redo write time" in centiseconds (100 per second)
11.1: "redo write broadcast ack time" in centiseconds (100 per second)
11.2: "redo write broadcast ack time" in microseconds (1000 per millisecond)
VERSION INST_ID NAME VALUE MILLISECONDS
----------------- ---------- ---------------------------------------- --------------------- -------------------
10.2.0.5.0 1 redo write time 9551524 95515240.000
10.2.0.5.0 1 redo writer latching time 51
10.2.0.5.0 1 redo writes 1434931
AWR WORST AVG LOG FILE SYNC SNAPS:
上面數據可以看到,log file sync等待事件數量最多的是32ms這個區間的,log file parallel write等待事件發生最多的也是32ms這個區間的,其實這個值已經遠遠超過7ms,極其不正常。log file parallel write 事件是LGWR進程專屬的等待事件,發生在LGWR將log_buffer中的重做日志信息寫入聯機重做日志文件組的成員文件,LGWR在該事件上等待該寫入過程的完成。該事件等待時間過長,說明日志文件所在磁盤緩慢或存在爭用。log file sync和log file parallel write是相互關聯的。換句話講,假設log file parallel write的時間非常長,那么必定導致log file sync等待時間拉長。如果log file parallel write 等待非常高,那么可能一般是物理磁盤I/O的問題
另外,我們也檢查了一下redo log的切換頻率,如下所示,redolog_sitch_time.sql查看發現redo log切換的次數並不頻繁,生成的歸檔日志的量也並不大。大部分時候一小時切換零次或一次。
然后我們找了一台機器(上述指標正常的服務器)簡單對測試了一下IO的速度,這個方法極其簡單,就是看看生成一個大文件需要多長時間,簡單測試一下I/O性能(沒有考慮cache等,測試采樣也不詳盡),但是對比數據也基本能驗證、反饋磁盤IO存在問題)。
問題服務器:
# time dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct
2048+0 records in
2048+0 records out
1073741824 bytes (1.1 GB) copied, 88.271 seconds, 12.2 MB/s
real 1m28.273s
user 0m0.010s
sys 0m0.655s
對比服務器(正常的服務器):
# time dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct
2048+0 records in
2048+0 records out
1073741824 bytes (1.1 GB) copied, 2.48344 seconds, 432 MB/s
real 0m2.485s
user 0m0.004s
sys 0m0.386s
如上對比所示,兩台服務器生成同樣一個大小文件,耗費的時間,I/O性能差別非常大,完全驗證了告警的服務器所在的存儲I/O存在性能問題,但是公司分工非常明確,DBA也不清楚底層存儲出了什么問題,只能將這個問題反饋出來,等待海外負責維護系統和存儲的同事的回復。