You are on page 1of 4

Handling Long-running Rollbacks and Recovery in Adaptive Server White Paper-Tec...

Page 1 of 4


Long-running transactions can create situations where a rollback or recovery takes a long time. This
white paper describes what can be done to avoid this, or handle it if it is already occurring.

Handling Long-running Rollbacks and Recovery

By Bret Halford, Principal Product Support Engineer (ASE)

Key Points
1) Dont shutdown ASE to attempt to clear a process performing rollback.
2) Whenever possible, allow long-running transactions to completely commit or rollback
before shutting down ASE. Check syslogshold to verify there are no long-running
transactions before issuing a SHUTDOWN.
3) Proactively avoid and actively prevent overly long transactions during typical OLTP

General Overview
In terms of recovery, databases work best with small, quick transactions. These can be mixed
with longer transactions, but such long transactions are best limited to maintenance windows.
Long-running transactions may occur during typical OLTP production processing by mistake
or necessity and can result in difficult situations, primarily blocking and long-running session
rollbacks and database recoveries. These can be critical problems if they prevent a server from
being available during scheduled production time, and the options for dealing with them are
limited. They are best avoided by proactive planning and system design.

Handling Long-Running Transactions

Long-running transactions are usually noticed when locks they are holding block other
processes for long periods of time. In order to release the locks, the transaction must either
complete or be killed and rollback. Unfortunately, a long-running transaction can take a long
time to rollback. When a long-running transaction is determined to be a problem it is best to
try to determine if it will complete soon instead of just killing it. This can be difficult to do, but
historical records of the run-times of similar operations or consulting the user running the
query may provide an answer.
If it is necessary to kill the user process (spid) to force a rollback, be prepared for the rollback
process to take some time. As of 15.0.3, there isnt any command to estimate how long this
will take before killing the user process; but a long running-process that performed a lot of
work will take much longer than one that started a long time ago, did a little work, and simply
remained open.

Handling Long-Running Rollbacks

Once a process has been killed, it will show in sp_who / sysprocesses output as performing
ROLLBACK. An incrementing physical_io field in sysprocesses can provide a measure of
progress in some cases, but it can be difficult to predict how long the rollback will take. DBAs
often become concerned that the rollback is hanging, or that they cannot KILL a process that is
doing a rollback (the KILL does actually work, but the process must finish rolling back first
before the spid goes away).
In order to rollback a transaction, ASE performs a single backward scan of syslogs looking for
log records belonging to this transaction. As it does so, it generates log records called
compensation log records (CLR) that reverse (rollback) the original modification and applies


Handling Long-running Rollbacks and Recovery in Adaptive Server White Paper-Tec... Page 2 of 4

those changes to the data pages. Once the beginning of the transaction is reached, a final log
record indicating the end of the transaction is written to the log and the rollback completes.
As of ASE 12.5.2 and 15.0 GA, the progress of a rollback can be checked using the command
KILL <spid> WITH STATUSONLY. This command has no effect on a process that is not
performing rollback (i.e. you do not need to worry about accidentally killing an active process).
The output of the command looks like this:
1> kill 13 with statusonly
2> go
spid: 13 Transaction rollback in progress. Estimated
rollback completion: 17% Estimated time left: 13 seconds
As a long-running rollback will continue to hold locks and block other processes until it
completes, DBAs are sometimes tempted to reboot ASE in order to clear the rollback process.
This is almost always a mistake as the incomplete rollback of the transaction will have to be
completed as part of the recovery process for the entire database when ASE is rebooted. The
full database recovery process will take longer than allowing the rollback to complete, and all
users will be unable to access any part of the database until recovery completes. However, a
shutdown may make sense if the DBA believes it is better to load a previous dump of the
database than wait for the rollback/recovery to complete. This might be the case if it is critical
to resume production operations at a given time and preferable to be operating on older data
than not operating at all.

Avoiding Long-Running Transactions

Long-running transactions should be prevented except perhaps during defined periods of
maintenance. If they are used during maintenance periods, you should create a plan for what to
do if you find the transactions havent completed as the end of the maintenance window
approaches. You might, for instance, make it standard procedure to take a full database backup
prior to the use of the large transactions so you could fall back to it if necessary. Ideally there
would be no or minimal open transactions at the time of the dump to make the recovery time
predictable and minimal and your plan should allow for the length of time it will take to create
the database for load, load the dump, and complete recovery before it has to go live.
Most long-running transactions can be identified and even prevented by using the Resource
Governor to limit elapsed time. You can also identify the longest current transaction using the
master..syslogshold table. Eliminating them entirely may require development work and
coordination with other business units, but is usually worth the effort.

Long-Running Transaction / Rollback summary:

1) It is often better to allow a long-running transaction to complete than kill it as rollback may
take a long time. The decision is not easy as it is often not possible to determine how much
longer a transaction will run before it commits.
2) It is almost always better to let a long-running rollback complete than shutdown the server; a
full database recovery will always take longer than allowing the rollback to complete.
3) Shutting down ASE may be an option if you plan to fall back to a previous dump in order to
resume operation as quickly as possible or by some deadline.

Recovery Overview
When ASE is booted, each database goes through a recovery process that brings the database to
a transitionally-consistent state. This recovery process usually doesnt take very long, but
under some circumstances can take a long time. The difference is in how much work recovery
has to do. The smaller the active part of the transaction log is, the faster recovery will finish.
Recovery requires 3 scans of the log, where a run-time rollback of a sessions transaction only
does one. The first phase, analysis, is a forward scan of the log from the beginning of the
oldest open transaction at the time of the most recent checkpoint. It builds up a table of
information about transactions and the work that needs to be done. The second phase, redo,
scans the log forward and ensures that all the modifications in the log have been applied to
pages in the databases. The third phase, undo, scans the log backward and rolls back the
work done by each modification belonging to an incomplete transaction, generating CLR
(rollback) log records to account for this work.

The two modes of the SHUTDOWN command are the regular polite shutdown and the
abrupt SHUTDOWN WITH NOWAIT. Shutdown with nowait brings the server down very
quickly, without waiting for any user activity to complete. It does, however, cleanly close file


Handling Long-running Rollbacks and Recovery in Adaptive Server White Paper-Tec... Page 3 of 4

descriptors, etc., and so is preferable to a KILL 9 or turning the power off. A common
misconception about the polite shutdown is that it will wait for all currently open
transactions to complete; it actually (as documented) only waits for the currently executing
statement to complete, and the process will then be terminated. This logic allows atomic
statements to complete and commit, but multi-statement transactions will be terminated and
rolled back on recovery.

Handling Long-Running Recovery

If you find a database is taking a long time to recover, allowing recovery to complete is usually
the best option, even when it is painful to wait it out. For recovery done as part of a server
reboot, ASE automatically makes temporary changes to configuration to optimize recovery
speed. If recovery does seem unusually slow, it is a good idea to check with the System
Administrators that the disk i/o and network systems (for NFS-mounted devices) are healthy.
The DBCC SHOWRECOVERY command can be used to monitor the progress of a databases
The next best alternative to waiting out recovery is to abort, rebuild the database, and load a
prior dump. There is a very unsafe technique to bypass recovery called suiciding the log,
discussed later, that should only be considered in the most extreme circumstances.

Monitoring Recovery Progress

Beginning in 12.5.3 and with some enhancements in 12.5.4 ESD #7 and 15.0.2 ESD #2, there
is an internal (not fully supported but still useful) DBCC SHOWRECOVERY command that
displays recovery status. Running it at intervals will give an indication of rate of progress. The
amount of work required for the final undo phase is quite variable and cannot be determined
until the analysis and redo phases are complete.

Sample output:

dbcc traceon(3604)
dbcc showrecovery

Recovery Log Markers:

Checkpoint Log Record: (521145198, 21)
First Log Record: (516510375, 14)
Last Log Record: (521145198, 21)
Analysis Statistics:
Log I/O Size: 32768
Total records to scan: 149658428
Number of records in phase 1: 0 (0.0%)
Number of records to redo in phase 1: 3 (100.0%)
Number of records in phase 2: 149658428 (100.0%)
Number of records to redo in phase 2: 149658428 (100.0%)
Total number of log records to redo: 149658431 (100.0%)
Redo Statistics:
Total operations considered for redo: 5
Operations that did not need redo (PFTS check): 3 (60.0%)
Operations that might need redo (PFTS check): 0 (0.0%)
Operations that needed redo (timestamp check): 0 (0.0%)
Operations that did not need redo (timestamp check): 2 (40.0%)
Total number of log records to undo: 4

Aborting Recovery to Load an Older Dump

These are the typical steps to use to abort recovery and restore an older dump.
2) Reboot ASE with traceflag 3608 to recover only the master database, so that recovery of the
user databases wont be done. Do this by adding -T3608 to the list of dataserver
parameters in the RUN_SERVER file.
a) If DROP doesnt work, DBCC DBREPAIR (<dbname>, dropdb)
b) If DBREPAIR doesnt work, delete that dbid from sysdatabases and sysusages


Handling Long-running Rollbacks and Recovery in Adaptive Server White Paper-Tec... Page 4 of 4

5) Remove the traceflag 3608
6) Reboot ASE normally
9) Load transaction dumps, if applicable

Short-Circuiting Recovery (Suicide the Log)

DBAs are sometimes tempted to bypass recovery via a process commonly known as suiciding
the log, which fools the server into thinking there is no recovery needed. With very few
exceptions, this is a very bad idea, and the details on how to do it are deliberately not given
here if it is necessary to do this you should be working with someone in Tech Support who is
familiar with the process and its dangers.
An example of circumstances in which the author might consider using this technique:
recovery is failing with a fatal error that cannot be fixed and there are no backups available, on
the theory that transactionally inconsistent data would be an improvement over no data at all.
Carefully consider the dangers of inconsistent data in your application bad data in a medical
information system can be life threatening, for instance.
Suiciding the log throws away transactional consistency in the database. For example, a
transfer of funds from account x to account y may a) not occur at all, b) only remove funds
from x and not add them to y, c) add funds to y and not remove them from x.
The physical structures of the database may also be left in inconsistent states (broken page
linkages, missing index entries, etc.). If there isnt time to complete recovery, is there really
time to run full dbcc checks after this procedure to verify the database structures are not
corrupt? Even if there is, the transactional inconsistencies will not be caught unless you have
methods to do so, such as client-side transactional auditing. Suiciding the log breaks the
transaction log chain. You will need to take a full database dump after suiciding the log.

Long Running Recovery Summary

1) Use DBCC SHOWRECOVERY to monitor recovery progress.
2) Allowing recovery to complete is usually the best option. A possible alternative is to abort,
rebuild the database, and load a prior dump.

Copyright 2011, Sybase Inc. - v 7.1