reliable message
Table of Contents
1 現象
同事反饋一個Sqlloader 進程加載數據特別慢,平常幾分鍾運行完的事情,這次跑了3個半小時還沒跑完。 查詢數據庫會話,信息如下:
SID USER_NAME EVENT --- ---------- ------- 1962 STG reliable message
趁此機會,研究一下這個等待事件吧。
2 reliable message問題分析
2.1 事件說明
MOS 上對於reliable message的解釋如下:
When a process sends a message using the 'KSR' intra-instance broadcast service, the message publisher waits on this wait-event until all subscribers have consumed the 'reliable message' just sent. The publisher waits on this wait-event for up to one second and then re-tests if all subscribers have consumed the message, or until posted. If the message is not fully consumed the wait recurs, repeating until either the message is consumed or until the waiter is interrupted.
說明此等待事件是發布消息方出現的等待。當消息隊列中的消息沒有被全部讀取的時候,就會等待此事件。 經查閱文檔得知,此等待事件,是針對各種channel的。不同的channel 針對不同的情況。也就有不同 的解決方法。而大部分是BUG,需要打補丁,或者升級至更高的版本.workaround,基本上是重啟實例, 或者關閉相關的功能。
2.2 查看渠道
從gv$channel_waits 視圖里查詢問題最嚴重的 channel. 方法1,可以馬上確定有問題的一個或者多個channel. 而方法2雖然也可以,但是略顯麻煩。
-
方法1
SELECT CHANNEL, SUM(wait_count) sum_wait_count FROM GV$CHANNEL_WAITS GROUP BY CHANNEL ORDER BY SUM(wait_count) DESC;
查詢示例:
CHANNEL SUM_WAIT_COUNT ---------------------------------------------------------------- -------------- Result Cache: Channel 15436686 RBR channel 9393 kxfp control signal channel 7357 MMON remote action broadcast channel 3070 obj broadcast channel 1731 service operations - broadcast channel 2 kill job broadcast - broadcast channel 2 parameters to cluster db instances - broadcast channel 2 quiesce channel 2
從上面查詢結果,可以看到 "Result Cache: Channel", 是最有問題的channel.
-
方法2
select to_char(p1, 'XXXXXXXXXXXXXXXX') event_param, count(*), sum(time_waited/1000000) time_waited from gv$active_session_history where event = 'reliable message' group by to_char(p1, 'XXXXXXXXXXXXXXXX') order by time_waited*count(*) desc; -- 取出影響最大的內存地址 select name_ksrcdes from x$ksrcdes where indx in (select name_ksrcctx from x$ksrcctx where addr in (&1)); Enter value for 1: '7ACD8AA60','7ACD8FA88' old 3: where indx in (select name_ksrcctx from x$ksrcctx where addr in (&1)) new 3: where indx in (select name_ksrcctx from x$ksrcctx where addr in ('7ACD8AA60','7ACD8FA88')) NAME_KSRCDES ---------------------------------------------------------------- Result Cache: Channel RBR channel
從上面查詢結果來看,已明確定位到有問題的 "Result Cache: Channel". 上面只是一個查詢多個 channel的示例。這個例子中只需要查詢第一個 addr='7ACD8AA60' 即可。
3 解決辦法
3.1 Result Cache: Channel
以下內容三選一:
- 數據庫更新到 12.2 或者12.1.0.2.0 Patchset
- 應用補丁 18416368
-
workaround
SQL> alter system set result_cache_max_size=0 scope=both sid='*';
修改參數后,實例需要重啟。
3.2 RBR channel
影響版本:11.2.0.3
Bug 15826962 High "reliable message" wait due to "RBR channel"。
最保險的辦法是得出進程trace,或者system trace,然后與MOS 文檔對照,或者開SR,由Oracle 服務人員幫忙確定。
在以下版本、補丁中得到修復 :
- 11.2.0.4 (Server Patch Set)
- 11.2.0.3.12 (Oct 2014) Database Patch Set Update (DB PSU)
- 11.2.0.3 Bundle Patch 19 for Exadata Database
- 11.2.0.3 Patch 34 on Windows Platforms
- 11.2.0.3 Patch 23 on Windows Platforms
所以解決辦法是升級或者打補丁。
3.3 kxfp control signal channel
- 影響版本
- 12.1.0.2
- (no term)
-
現象分析 其實這里並不只是這一個channel等待嚴重。示例如下:
SQL> select CHANNEL,sum(wait_count) sum_wait_count from GV$CHANNEL_WAITS group by CHANNEL order by sum(wait_count) 2 3 4 / CHANNEL SUM_WAIT_COUNT ---------------------------------------------------------------- -------------- Flashback RVWR init channel 2 quiesce channel 3 PMON actions channel 6 Broker IQ Result Channel 24 kill job broadcast - broadcast channel 54 parameters to cluster db instances - broadcast channel 137 GEN0 ksbxic channel 1035 Flashback Marker channel 1546 LCK0 ksbxic channel 2669 service operations - broadcast channel 7033 MMON remote action broadcast channel 78046 kxfp remote slave spawn channel 157850 Result Cache: Channel 242303 RBR channel 1595647 obj broadcast channel 4105387 kxfp control signal channel 5582125
可以看到除了,kxfp control signal channel 外還有 obj broadcast channel . 這兩個是 其他的數倍甚至是數十倍。
同時,建議做一個hang analyze 。 查看trace 文件中是否包含了以下內容:
ervice name: SYS$BACKGROUND Current Wait Stack: 1: waiting for 'CSS group membership query'
如果有,說明CSS 組成員關系查詢出現了阻塞,正常情況下應該是非常快的。
以上兩個現象,基本可以確定是Oracle BUG: 20470877.
- (no term)
-
解決辦法只有更新補丁
Patch 20470877: LONG WAITS FOR "RELIABLE MESSAGE" AFTER A FEW DAYS OF UPTIME
- (no term)
-
workaround
重啟實例
Created: 2019-12-26 Thu 13:39