前些天給同事准備一套模擬環境用於測試一個OGG問題:
環境架構:Oracle 11.2.0.4 RAC + 單實例11.2.0.4 ADG(同時作為OGG源端,OGG版本19.1.0.0.4) + 單實例19.3多租戶(其中1個PDB作為OGG目標端,OGG版本19.1.0.0.4)
現象概述:發現OGG進程abended,原因是主庫歸檔滿,但是實際已配置歸檔自動清理腳本(歸檔空間使用大於90%時清理),進一步查看發現根源是歸檔清理失效,報錯RMAN-08137,導致的影響有很多,首先主庫無法進行測試數據寫入,其次ADG備庫產生延遲,然后OGG源端抽取進程因超時報錯OGG-02149導致abended..
1.故障現象:歸檔清理報錯RMAN-08137
自動歸檔清理失效,報錯RMAN-08137,手工清理現象一樣:
RMAN> DELETE NOPROMPT ARCHIVELOG ALL COMPLETED BEFORE 'SYSDATE-1/24';
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+FRA/crmdb/archivelog/2020_07_08/thread_1_seq_422.426.1045198149 thread=1 sequence=422
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+FRA/crmdb/archivelog/2020_07_08/thread_1_seq_423.424.1045198157 thread=1 sequence=423
...
使用oerr查看RMAN-08137的描述:
[oracle@jystdrac1 logs]$ oerr rman 8137
8137, 3, "WARNING: archived log not deleted, needed for standby or upstream capture process"
// *Cause: An archived log that should have been deleted was not as it was
// required by upstream capture process or Data Guard.
// The next message identifies the archived log.
// *Action: This is an informational message. The archived log can be
// deleted after it is no longer needed. See the
// documentation for Data Guard to alter the set of active
// Data Guard destinations. See the documentation for
// Streams to alter the set of active streams.
查看rman的設置,未發現特殊設置:
RMAN> show all;
using target database control file instead of recovery catalog
RMAN configuration parameters for database with db_unique_name CRMDB are:
CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default
CONFIGURE BACKUP OPTIMIZATION OFF; # default
CONFIGURE DEFAULT DEVICE TYPE TO DISK; # default
CONFIGURE CONTROLFILE AUTOBACKUP OFF; # default
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F'; # default
CONFIGURE DEVICE TYPE DISK PARALLELISM 1 BACKUP TYPE TO BACKUPSET; # default
CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE MAXSETSIZE TO UNLIMITED; # default
CONFIGURE ENCRYPTION FOR DATABASE OFF; # default
CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default
CONFIGURE COMPRESSION ALGORITHM 'BASIC' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE ; # default
CONFIGURE ARCHIVELOG DELETION POLICY TO NONE; # default
CONFIGURE SNAPSHOT CONTROLFILE NAME TO '+DATA/crmdb/snapcf_crmdb.f';
2.解決方案:設置"_deferred_log_dest_is_valid"參數
進一步搜索查詢,匹配到MOS:- RMAN-08137 on Primary Database although Archive Destination to Standby is deferred (Doc ID 1380368.1)
給出的原因和解決方案引用如下:
CAUSE
If we defer an Archive Destination to a Standby Database, the Primary Database will still consider the Standby Database as existing but temporary unavailable eg. for Maintenance. This can happen if you stop Log Transport Services from the Data Guard Broker or manually defer the State for the Archive Destination.SOLUTION
As long as the Archive Destination (log_archive_dest_n) is still set, we consider the Standby Database as still existing and preserve the ArchiveLogs on the Primary Database to perform Gap Resolution when the Archive Destination is valid again.
There are Situations when this is not wanted, eg. the Standby Database was activated or removed but you still keep the Archive Destination because you want to rebuild the Standby Database later again. In this Case you can set the hidden Parameter "_deferred_log_dest_is_valid" to FALSE (default TRUE) which will consider deferred Archive Destinations as completely unavailable and will not preserve ArchiveLogs for those Destinations any more. It is a dynamic Parameter and can be set this Way:SQL> alter system set "_deferred_log_dest_is_valid" = FALSE scope=both;
NOTE: This Parameter has been introduced with Oracle Database 11.2.0.x. In earlier Versions you have to unset the log_archive_dest_n-Parameter pointing to the remote Standby Database to make the Primary Database considering it as completely unavailable. There also exists a Patch on Top of 11.1.0.7 for some Platforms to include this Parameter in 11.1.0.7, too. This is Patch Number 8468117.
上面的描述很清楚了,實際結合當前環境,發現確實是有其他log_archive_dest_n設置為defer,而這些我們實際不再用了,要么徹底清除,要么按照MOS設置隱藏參數,我們先查下這個隱藏參數的當前默認設置,發現是Ture:
NAME DESCRIPTION VALUE
----------------------------------- ------------------------------------------------------------------ ------------------------------
_deferred_log_dest_is_valid consider deferred log dest as valid for log deletion (TRUE/FALSE) TRUE
這是一個動態的隱藏參數,可以直接修改為FALSE:
alter system set "_deferred_log_dest_is_valid" = FALSE;
--再次查詢,已經修改成功:
NAME DESCRIPTION VALUE
----------------------------------- ------------------------------------------------------------------ ------------------------------
_deferred_log_dest_is_valid consider deferred log dest as valid for log deletion (TRUE/FALSE) FALSE
3.恢復故障:歸檔清理恢復正常,ADG同步正常,OGG進程啟動正常
3.1 歸檔清理恢復正常
上面設置隱藏參數之后,就可以正常刪除歸檔了:
RMAN> DELETE NOPROMPT ARCHIVELOG ALL COMPLETED BEFORE 'SYSDATE-1/24';
...
archived log file name=+FRA/crmdb/archivelog/2020_07_09/thread_2_seq_399.462.1045343023 RECID=1655 STAMP=1045343047
deleted archived log
archived log file name=+FRA/crmdb/archivelog/2020_07_09/thread_2_seq_400.396.1045343069 RECID=1657 STAMP=1045343099
deleted archived log
archived log file name=+FRA/crmdb/archivelog/2020_07_09/thread_2_seq_401.404.1045343111 RECID=1661 STAMP=1045343151
Deleted 87 objects
3.2 ADG同步恢復正常
主庫切下日志,再看ADG同步狀態,最終恢復正常:
SQL> @dg
NAME VALUE UNIT TIME_COMPUTED DATUM_TIME
------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------
transport lag +06 03:04:01 day(2) to second(0) interval 07/16/2020 00:10:08 07/16/2020 00:10:08
apply lag +06 03:04:01 day(2) to second(0) interval 07/16/2020 00:10:08 07/16/2020 00:10:08
apply finish time +00 00:00:04.358 day(2) to second(3) interval 07/16/2020 00:10:08
estimated startup time 65 second 07/16/2020 00:10:08
SQL> /
NAME VALUE UNIT TIME_COMPUTED DATUM_TIME
------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------
transport lag +00 00:00:00 day(2) to second(0) interval 07/16/2020 00:10:16 07/16/2020 00:10:15
apply lag +00 00:00:00 day(2) to second(0) interval 07/16/2020 00:10:16 07/16/2020 00:10:15
apply finish time +00 00:00:00.000 day(2) to second(3) interval 07/16/2020 00:10:16
estimated startup time 65 second 07/16/2020 00:10:16
3.3 OGG進程啟動正常
再將OGG的進程手工啟動,恢復正常:
GGSCI (test03) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPAUDDG 00:00:00 00:00:10
EXTRACT ABENDED EXTAUDDG 00:00:00 146:59:54
GGSCI (test03) 2> view report EXTAUDDG
...
2020-07-09 21:11:37 ERROR OGG-02149 Standby database has made no progress for more than 30,000 seconds.
GGSCI (test03) 3> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPAUDDG 00:00:00 00:00:02
EXTRACT ABENDED EXTAUDDG 00:00:00 147:01:07
GGSCI (test03) 4> start EXTAUDDG
Sending START request to MANAGER ...
EXTRACT EXTAUDDG starting
GGSCI (test03) 5> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPAUDDG 00:00:00 00:00:00
EXTRACT RUNNING EXTAUDDG 00:00:00 00:00:00
--target ogg ok!
GGSCI (db19) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT RUNNING REPAUD1A 00:00:00 00:00:05
至此,整個架構涉及到的所有環境均恢復正常。