You are on page 1of 3

ARCH process hangs and fails to ship archive logs to standby database

Platform: Sun Solaris 10 Database: 10g R2 (10.2.0.4) Problem Description: A problem was faced at a client site where one of the ARCH processes on primary database would hang intermittently while archiving a redo log sequence to a remote standby. This would stall any further shipping of archive logs to standby database. So managed recovery on standby would stall in absence of new archive logs, thus compromising the availability of an updated standby database in the event of primary site failure. The error was being reported every fortnight, on an average. Alert log shows:
ARC1: Evaluating archive log 1 thread 1 sequence 3087 ARC1: Beginning to archive log 1 thread 1 sequence 3087 Creating archive destination LOG_ARCHIVE_DEST_4: 'ABCD' Fri Feb 19 07:14:32 2010 ARC0: Evaluating archive log 1 thread 1 sequence 3087 Fri Feb 19 07:14:32 2010 Thread 1 advanced to log sequence 3089 Fri Feb 19 07:14:32 2010 ARC0: Unable to archive log 1 thread 1 sequence 3087

Trace File shows:


*** SESSION ID:(770.1) 2010-02-19 02:38:24.668 Maximum redo generation record size = 156160 bytes Maximum redo generation change vector size = 150676 bytes tkcrrsarc: (WARN) Failed to find ARCH for message (message:0x10) tkcrrpa: (WARN) Failed initial attempt to send ARCH message (message:0x10) *** 2010-02-19 07:14:32.108 Warning: log write time 760ms, size 1KB *** 2010-02-19 11:07:22.313 Warning: log write time 1100ms, size 1KB *** 2010-02-19 11:21:02.449 Warning: log write time 1120ms, size 1KB *** 2010-02-19 13:12:55.537 Warning: log write time 1270ms, size 1KB *** 2010-02-19 13:15:50.042 Warning: log write time 1140ms, size 1KB

*** 2010-02-19 14:58:26.613 On research, it was found that Oracle suggests killing the hung ARCH process holding the lock on that archive log file. A new ARCH process will be spawned automatically. It will release the lock held by ARC1 for redo log 3087, thus allow the redo log to be archived by other ARCH process. When implemented, it seemed to work and the new ARCH process spawned was able to continue the shipping of archived logs to standby database. However, this approach had a serious shortcoming. It required manual intervention to detect and resolve the problem. So if error occurred during night, it might go unnoticed till next morning, thus seriously compromising the utility of standby database. Further, this solution was advised for 9.2.0.6 by Oracle, but somehow it worked in our case on 10g. Unsatisfied with above approach, we further dug into the issue and analyzed other trace files generated during the hung period of ARCH process. It was then that we observed the given entries in one of the trace files:

Second Trace File shows:


*** 2010-02-19 23:13:13.179 ABC: tkrsf_al_read: No mirror copies to re-read data *** 2010-02-19 23:28:13.344 ABC: tkrsf_al_read: No mirror copies to re-read data *** 2010-02-19 23:43:13.018 ABC: tkrsf_al_read: No mirror copies to re-read data *** 2010-02-19 23:58:12.920 ABC: tkrsf_al_read: No mirror copies to re-read data *** 2010-02-20 00:13:13.292 ABC: tkrsf_al_read: No mirror copies to re-read data *** 2010-02-20 00:28:12.883 ABC: tkrsf_al_read: No mirror copies to re-read data *** 2010-02-20 00:43:12.854 ABC: tkrsf_al_read: No mirror copies to re-read data *** 2010-02-20 00:58:12.780 ..

Based on the above search, it was found that it is a 10.2.0.4 version specific issue. However, the Oracle note didnt talk about hanging of the ARCH process for standby database. It just mentioned the above error entries in one of the trace files. The workaround suggested to either replacing ARCH SYNC by LGWR ASYNC in the related log_archive_dest_n parameter or apply oPatch 7136489, if available for the platform. Since the patch was available for our platform (Solaris), we applied the patch on both the primary and standby nodes. After applying the patch, its been more than 3 months, but the problem was never reported again.

Summary:

Our research led to the conclusion that both the trace files were being generated at the same time. Even though, Oracle suggested application of patch for a different purpose, but the circumstances in our case encouraged us to try the solution. It worked perfectly and the team was spared the effort of manually killing the ARCH process, every time the problem occurred. Further, the given solution ensures better availability in case of failure.

References: Metalink Doc ID: 364342.1 Metalink Doc ID: 748425.1]