You are on page 1of 89

Replication Troubleshooting

DECEMBER 21, 2010 6 COMMENTS


Some info to help diagnose possible issues with replication.

How many undelivered commands are in the distribution database?…


SELECT
ds.*
, ma.article
, da.publication
, da.name
FROM dbo.MSdistribution_status DS
INNER JOIN dbo.msdistribution_agents da ON da.id = ds.agent_id
INNER JOIN dbo.msarticles ma ON ma.publisher_id = da.publisher_id and ma.article_id = ds.article_id
ORDER BY
UndelivCmdsInDistDB DESC
,publication

Why is a complete snapshot being generated when a new article is added (SQL 2005)?

This is expected behaviour if you have a merge or snapshot publication. If you have a transactional publication, a snapshot of all
articles will always be generated if the immediate_sync publication property is set to true. Typically, the immediate_sync publication
property is set to true if you allowed anonymous subscriptions while creating the publication through the CreatePublication wizard. To
prevent the complete snapshot, run the script below:
EXEC sp_changepublication
@publication = 'MainPub',
@property = N'allow_anonymous',
@value = 'false'
GO
EXEC sp_changepublication
@publication = 'MainPub',
@property = N'immediate_sync',
@value = 'false'
GO

Query Timeout Expired


REPORT THIS AD

When applying a large snapshot to the subscriber, a ‘Query Timeout Expired’ message is given on the replication monitor, and the
snapshot stops processing. The query timeout usually happens after 30 minutes.
There is a manual workaround which involves BCPing the file into the destination database
To BCP in a file created by the snapshot job, run the following

H:\>BCP -h

usage: BCP {dbtable | query} {in | out | queryout | format} datafile

[-m maxerrors]            [-f formatfile]          [-e errfile]

[-F firstrow]             [-L lastrow]             [-b batchsize]

[-n native type]          [-c character type]      [-w wide character type]

[-N keep non-text native] [-V file format version] [-q quoted identifier]

[-C code page specifier]  [-t field terminator]    [-r row terminator]

[-i inputfile]            [-o outfile]             [-a packetsize]

[-S server name]          [-U username]            [-P password]

[-T trusted connection]   [-v version]             [-R regional enable]

[-k keep null values]     [-E keep identity values]

[-h "load hints"]         [-x generate xml format file]


For example:

bcp "dbname"."dbo"."REP_IMAGE" in "F:\TMP_SHARE\REP_TABLE2_IMAGE_3.bcp" -e


"F:\TMP_SHARE\errorfile.log" -t"\n<x$3>\n" -r"\n<,@g>\n" -Sservername -T -w

Before starting replication again, complete the following tasks:


1. Truncate the live msrepl_commands table
2. Truncate the live msrepl_transactions table
3. Ensure the replication procedures are on the subscriber database. These are the INSERT, DELETE, & UPDATE procs for each
article in the subscription.
Manually extract the INS, DEL, UPD procs for all articles from the snapshot folder and run them into the subscriber.
4. Once this is done, restart the distribution agent
5. Attempt a dummy transaction from source to destination.

Log reader agent failed and its history shows message: “No such interface”
You need to re-register your log reader agent. Try regsvr32 logread.exe; you might also have to register the entire contents of
C:\Program Files\Microsoft SQL Server\90\Com (Hilary Cotter)

Not all my logreaders start up – what can I do?


Increase the max_worker_threads setting in the syssubsystems table of the msdb database.

What are the Pros and cons of restarting the log reader agent?
Sometimes under extreme high load you will get deadlocking between the log reader agent and the distribution clean up agent. In this
case stopping the log reader agent to let the distribution clean up agent do its job will alleviate the problem. It is recommended that in
this case you use a remote distributor. You also can bounce the log reader agent when you want to switch profiles.
If you stop it, the latency will increase and if you stop for a significant time, the commands’ age might exceed the retention period. Also,
the log can’t be backed up fully (and therefore truncated) unless the log reader agent has marked it as read.

I receive the error 14100: Specify all articles when subscribing to a publication using concurrent snapshot processing
If you add a new table to an existing publication using sp_addarticle when you try to subscribe to that newly added article from an
existing subscription, using sp_addsubscription, the error above may be received. This applies when the existing publication set up with
concurrent snapshot option and means that you can’t synchronize subscriptions for such publications without a complete resynch.
There are 2 unofficial workarounds: (a) you can circumvent the check by specifying @reserve = ‘internal’ when you add the subscription
for the new article and the snapshot agent should generate snapshot for the new article after that and (b) you could change the
immediate_sync property in syspublications to 0 (see sp_changepublication).
Other more official workarounds including changing the sync_method from ‘concurrent’ to either ‘database snapshot’ (enterprise edition
only) and ‘native’ (which locks table during snapshot generation). Change the sync_method will force a reinitialization of all your
subscriptions at this point. Alternatively you could create another publication and use this instead.

How to….. truncate the transaction log? After restoring a database to another server, when I subsequently try to shrink the
log I get the following error: “The log was not truncated because records at the beginning of the log are pending replication”
Before truncating the log, you can execute sp_repldone. In cases where this is not enough, you might have to set up this database as a
transactional publisher before executing sp_repldone, then remove the publication afterwards.

How to find out which commands are waiting to be replicated?


Use this to get the timestamp of the latest command to be replicated:
select transaction_timestamp
from subscriberdatabasename..MSreplication_subscriptions
Then run this in the distribution database (replace the value with the one returned from above:)
sp_browsereplcmds @xact_seqno_start = '0x000000AF00000043000B00000001'

How to safely backup transactional replication?


Have a look in BOL for “Strategies for Backing Up and Restoring Transactional Replication” and “Backing Up and Restoring Replication
Databases”. The key for standard transactional replication is whether you use ‘sync with backup’ or not. If you do, you are ensured that
your distribution backup can never get ahead of the publisher backup (ie no transactions enter the msrepl_commands which haven’t
already been backed up on the publisher), and all will be well. However this will introduce latency (even using log shipping you can only
backup the logs once per minute at the highest frequency, and this is clearly not ideal). If you don’t use this option, after disaster
recovery you’ll have to ignore some transactions and treat errors manually (using -SKIPERRORS).
As for the subscriber backups which are to be restored, this is not usually seen as being so crucial. As long as they are restored to a
time before the distribution restore, then commands can be sent down by the distribution agent – details in BOL for this. Alternatively
you could of course reinitialize.

What happens when a transaction fails at the publisher – does it still run at the subscriber?
If you have a transaction on the publisher, you may check @@error and then call rollback but whether you rollback or not, the sp is still
executed on the subscriber. This situation is altered (no subscriber call) if you set the transaction isolation level to serializable. This is
important to do because even if you trap the same error in the transaction on the subscriber and rollback there, the error is registered
and the distribution agent will fail. SkipErrors would avoid this problem but ideally the call shouldn’t be sent from the publisher to the
subscriber if it has already failed once.
How to read the transactions for TR in non-binary format?
These transactions exist in the transactions table MSrepl_commands: use sp_browsereplcmds to view them. In the case of a queue,
use sp_replqueuemonitor to read the MSreplication_queue table and sp_browsereplcmds to look at the compensating commands when
there is conflict resolution.

Replication Troubleshooting - How to deal with out of sync publications


Transactional Replication and nasty errors that cause out of sync publications.

The other day we had an issue on our distributor that caused deadlocks on the Distribution database.  Several
of the Log Reader Agents suffered fatal errors due to being chosen as the deadlock victim.  This caused the
following error to occur:
 The process could not execute 'sp_repldone/sp_replcounters' on 'MyPublisherServer'
When I drilled in to view the detail, I found this error:
 The specified LSN (%value) for repldone log scan occurs before the current start of replication in the
log (%newervalue)

After much searching on the error, I came across several forum posts that indicated I was pretty well up a
creek.  I then found this post on SQLServerCentral.  Hilary Cotter's response was the most beneficial for
devising a recovery plan and Stephen Cassady's response helped me refine that plan.

Hilary Cotter (Blog) is an expert when it comes to SQL replication.  He certainly knows his stuff!

The Recovery Plan


Recovering from this issue involves several steps.  

For small databases or publications where the snapshot to reinitialize the publication will be small and push
quickly, it's simplest and best to just reinitialize the entire publication and generate/push a new snapshot.  

For larger publications (my publication contained almost 1,000 tables) and situations where pushing the
snapshot will take an inordinate amount of time (24+ hours in my case) the following process can be used to
skip the missing transactions and identify the tables that are now out of sync:
 Recover the Log Reader Agent by telling it to skip the missing transactions
 Recover the Distribution Agent by configuring it to ignore data consistency issues
 Validate the publication to determine which tables are out of sync
 Drop and republish out of sync tables

Log Reader Agent Recovery


The simplest way to recover the Log Reader Agent is to run the following command against the published
database:
 sp_replrestart
This effectively tells SQL to restart replication NOW, thus ignoring all transactions that have occurred between
the time of the failure and the time you run the command.  The longer you wait to run this command, the more
activity in the database that gets ignored, which likely results in more tables that fall out of sync.

Distribution Agent Recovery


Now that the Log Reader Agent is capturing transactions for replication, the Distribution Agent will likely get
upset because there are transactions missing.  I specifically received the following error:
 The row was not found at the Subscriber when applying the replicated command
This error causes the Distribution Agent to fail, but there is a system profile for the Distribution Agent that you
can select to bypass the data consistency errors.
 Launch Replication Monitor
 In the left-hand column
 Expand the DB server that contains the published database
 Select the Publication 
 In the right-hand pane
 Double-click the Subscription
 In the Subscription window
 Go to the Action menu and select Agent Profile
 Select the profile: Continue on data consistency errors. and click OK
 Be sure to note which profile was selected before changing it so that you can select the
appropriate option once recovery is complete
 If the Distribution Agent is currently running (it's likely in a fail/retry loop), you'll need to:
 Go to the Action menu and select Stop Distribution Agent
 Go to the Action menu and select Start Distribution Agent
 If there is more than one subscription, repeat these steps for any additional subscriptions

Subscription Validation
Validating the Subscription(s) is a fairly straightforward task.
 Launch Replication Monitor
 In the left-hand column of Replication Monitor
 Expand the DB server that contains the published database
 Right-click the Publication and select Validate Subscriptions...
 Verify Validate all SQL Server Subscriptions is selected
 Click the Validation Options... button and verify the validation options - I recommend selecting
the following options:
 Compute a fast row count: if differences are found, compute an actual row count
 Compare checksums to verify row data (this process can take a long time)
 Once you are satisfied with the validation options, click OK and then click OK to actually queue
up the validation process
 Please note: for large databases, this process may take a while (and the Validate
Subscriptions window may appear as Not Responding)
For my publications (~1,000 tables and DB was ~100GB) the validation process took about 20 minutes, but
individual results will vary.
If you wish to monitor the validation progress
 In the right-hand pane of Replication Monitor
 Double-click the Subscription
 In the Subscription window:
 Go to the Action menu and select Auto Refresh

Identify out of sync tables


I created the following script that will return the tables that failed validation:

-- This script will return out of sync tables after a Subscription validation has been performed
-- Set the isolation level to prevent any blocking/locking 
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

SELECT 
mda.publication [PublicationName],
mdh.start_time [SessionStartTime],
mdh.comments [Comments]

FROM distribution.dbo.MSdistribution_agents mda 
JOIN distribution.dbo.MSdistribution_history mdh ON mdh.agent_id = mda.id 

-- Update Publication name as appropriate


WHERE mda.publication = 'My Publication'
AND mdh.comments LIKE '%might be out of%'
-- This next line restricts results to the past 24 hours.
AND mdh.start_time > (GETDATE() - 1) 
-- Alternatively, you could specify a specific date/time: AND mdh.start_time > '2012-04-25 10:30'
-- View most recent results first
ORDER BY mdh.start_time DESC

The Comments column will contain the following message if a table is out of sync:


 Table 'MyTable' might be out of synchronization.  Rowcounts (actual: %value, expected: %value).
Checksum values  (actual: -%value, expected: -%value).
Make a list of all tables that are returned by the aforementioned script.

Now the determination needs to be made as to the level of impact.


 The Reinitialize All Subscriptions option should be used if the following is true:
 Large number of tables affected (majority of published tables)
 Unaffected tables are small in size (if the snapshot for the unaffected tables is going to be very
small, it's much easier to just reinitialize everything)
 Dropping and re-adding individual tables should be used if the following is true:
 The number of tables affected is far less than the total number of tables
 The tables that are unaffected are very large in size and will cause significant latency when
pushing the snapshot
The latter was the case in my scenario (about 100 out of 1,000 tables were out of sync, and the ~900 tables
that were in sync included some very large tables).

Reinitialize All Subscriptions


Follow this process if the determination has been made to use the Reinitialize All Subscriptions option:
 In the left-hand column of Replication Monitor
 Expand the DB server that contains the published database
 Right-click the Publication and select Reinitialize All Subscriptions...
 Verify Use a new snapshot is selected
 Verify Generate the new snapshot now is NOT selected
 Click the Mark For Reinitialization button
 Please note: for large databases, this process may take a while (and
the Replication Monitor window may appear as Not Responding)
 In the right-hand pane of Replication Monitor
 Select the Agents tab (in SQL 2005 select the Warnings and Agents tab)
 Right click the Snapshot Agent and select Start Agent
 The reason for performing this manually is that sometimes when you select
the Generate the new snapshot now option, it kicks off the Snapshot Agent before the reinitialization is
complete which causes blocking, deadlocks and major performance issues.

Recover out of sync tables


If the determination has been made to recover the individual tables, use the list of tables generated from the
validation process and follow this process:
 In the left-hand column of Replication Monitor
 Expand the DB server that contains the published database
 Right-click the Publication and select Properties
 Select the Articles page in the left-hand column
 Once the center page has populated, expand each table published to determine if the table is
filtered (i.e. not all columns in the table are published).
 If tables are filtered, make a note of the columns that are not pushed for each table
 Once review of the tables is complete, click Cancel
 If you click OK after expanding tables, it will invalidate the entire snapshot and you will
end up reinitializing all articles in the publication
 Right-click the Publication and select Properties
 Select the Articles page in the left-hand column
 Clear the check boxes for all out of sync tables and click OK
 Right-click the Publication and select Properties
 Select the Articles page in the left-hand column
 Select the affected tables in the center pane 
 If any tables were not completely replicated, be sure to reference your notes regarding
which columns are replicated
 Click OK when table selection is complete
 Note: If you receive an error that the entire snapshot will be invalidated, close the
Publication Properties window and try adding in a few tables at a time until all tables are selected.
 In the right-hand pane of Replication Monitor
 Select the Agents tab (in SQL 2005 select the Warnings and Agents tab)
 Right click the Snapshot Agent and select Start Agent
 Double-click the Subscription
 Go to the Action menu and select Auto Refresh

Final cleanup
Once the snapshot has been delivered and replication has caught up on all queued transactions, perform the
following to return replication to a normally running state.
 In the left-hand column of Replication Monitor
 Expand the DB server that contains the published database
 Select the Publication 
 In the right-hand pane of Replication Monitor
 Double-click the Subscription
 In the Subscription window
 Go to the Action menu and select Agent Profile
 Select the profile that was configured before you changed it (if unsure, the Default
agent profile is typically the default) and click OK
 If there is more than one subscription, repeat these steps for any additional subscriptions

I hope this helps if you run into the same situation.  I would like to especially thank Hilary Cotter for sharing his
knowledge with the community as his forum and blog posts really helped me resolve the issue.

Example of troubleshooting Distribution Agent


errors
I thought it would be helpful to post a Replication Distribution Agent troubleshooting case to show more about replication
components and troubleshooting approaches.

Problem:
SQL Server Distribution Agent reported “Failed” in Replication Monitor. To capture text of the message we added the
following parameters to the Distribution Agent job and restarted the job.  The –output parameter writes Agent log to text file showing
each step the Distribution Agent was performing along with actual text of message.

-output c:\temp\dist.log and -commitbatchsize 1

Output
2018-12-29 00:17:53.917 Agent message code 8144. Procedure or function sp_MSupd_dboAddress
has too many arguments specified.
2018-12-29 00:17:53.932 ErrorId = 35, SourceTypeId = 0
ErrorCode = ‘8144’
ErrorText = ‘Procedure or function sp_MSupd_dboAddress has too many arguments specified.’
2018-12-29 00:17:53.948 Adding alert to msdb..sysreplicationalerts: ErrorId = 35,
Transaction Seqno = 000a261100001560013e00000000, Command ID = 1
Not needed for this case, but often we capture Replication commands being executed via SQL Server Extended
Events or Profiler Trace RPC and Batch events along with errors and warning.
Background:
Customer’s Transactional Replication publication consisted of a couple of very large tables generating timeouts when initial snapshot
was being applied to the subscriber.  To get around this problem, the subscriber was setup using a Backup/Restore from the Publisher. 
Steps-by-step is documented in doc.Microsoft.com and https://repltalk.com/how-to-manually-synchronize-replication-subscriptions-
by-using-backup-or-restore/ . Distribution Agent worked for a few minutes, then failed with error above.

How is works:
In Transactional Replication, the LogReader agent is picking up committed transaction from the published database’s transaction log,
writing the Insert\Update\Delete commands to the Distribution database.  The Distribution Agent is picking up those commands and
applying on the Subscriber.  Changes are not stored in distribution database as SQL, i.e. “update table set column a = 1 where
primarykey = ‘abc’ “, but instead as calls stored procedure along with a parameter list.  The Distribution Agent calls these
Insert/Update/Deleted stored procedures on the subscribers with the appropriate parameter list. These stored procedures have format
sp_MSups_<schema><tablename> as in sp_MSupd_dboAddress, sp_MSdel_dboAddress, or sp_MSins_dboAddress.

Approach:
The error indicates a mis-match in the number of parameters (columns) in the command stored in the Distribution DB compared to the
number of columns in the Replication created stored procedure.  We needed to see which one is correct.

First step was to compare the SCHEMA of the Published database to the Subscriber.  Since the Subscriber was a backup of the
Publisher, I suspected them to be the same, but you never know unless you check. We executed the following command on both the
Pub and Sub and yes, they were identical.

sp_help Address

 
Next was to look at actual command text stored in the Distribution database.  The text is stored as binary, however, using Replication
built-in command sp_browsereplcmds and optional parameters we could return text of the transaction of interest. We used the
Transaction ID shows in the error message.

Transaction Seqno = 000a261100001560013e00000000  >>>>drop off trailing 8 0<<<<


sp_browsereplcmds  @xact_seqno_start = ‘0x000a261100001560013e’,  @xact_seqno_end =
‘0x000a261100001560013e’
Output
{CALL [sp_MSupd_dboAddress] (,,,,,,,,,,2009-12-28 02:22:10.000,,,,,2009-12-27
02:22:13.683,,,620190,0×008400)}
We confirmed table schema between Publisher and Subscriber are the same. We can see command and parameters being stored in the
Distribution database that generated the error. Next check how these parameters match to the Subscribers stored procedure code. To
get that we executed command below on the Subscriber.

sp_helptext sp_MSupd_dboAddress
Output:
CREATE procedure [sp_MSupd_dboAddress]
@c1 int,@c2 int,@c3 char(1),@c4 varchar(40),@c5 varchar(40),@c6 varchar(30),@c7
varchar(30),@c8 varchar(30),@c9 varchar(30),@c10 varchar(10),@c11 datetime,@c12
varchar(7),@c13 varchar(7),@c14 char(1),@c15 varchar(50),@c16 datetime,@c17 varchar(4),@c18
varchar(2),@pkc1 int
as
begin
update [dbo].[Address] set
[person_id] = @c2
,[address_type] = @c3
. . . .
,[uc_code] = @c18
where [address_id] = @pkc1
 
 
Looking at the code I see the last “expected” value is a PrimaryKey used in the WHERE clause to update 1 row.  However, the
sp_MSupd_dboAddress parameter list has a binary value 0x008400 as the last parameter.  Clearly the parameter list doesn’t match,
but which is right?

What is the expected behavior?


When troubleshooting these problems, sometimes best to step back and determine what is “expected” behavior.  Setting up quick test
to learn what’s expected may provide clue as to what’s broken. We do this for performance troubleshooting when capturing “baseline”
data. Following this logic my next step was to setup a simple Transactional Replication publication using
sample AdventureWorks database. 

Using the Replication Wizard to publish 1 table, then scripting out the subscriber stored procedure, I could clearly see bitmap
parameter along with additional logic within the stored procedure. Since the “broken” subscriber didn’t have this expected code to
handle the correct number of parameters, I knew the Replication generated stored procedures on the Subscriber was incorrect and
needed to be updated.

How to correct the problem?


SQL Server includes stored procedure sp_scriptpublicationcustomprocs to re-generate new set of subscriber stored
procedures. Execute this command on the Publisher with OUTPUT AS TEXT. Then execute the output on the Subscriber.  Make sure
to increase the TEXT COLUMN WIDTH to 5000 in the Query Properties window or your stored procedure code will get truncated at
the default 256 characters.  Yeah, I found that one out the hard way.  Thankfully, the code automatically includes all DROP then
CREATEs SPs needed by the Subscriber and I just ran it a 2nd time with a wider text window.

sp_scriptpublicationcustomprocs  ‘<publication>’
Once new create stored procedure scripts were executed on the subscriber, the Distribution Agent executed the
correct commands with matching parameter list, no reinitialization required. This same command can be used
anytime the subscriber stored procedures are accidently DROPPED.

Troubleshooting transactional replication


latency issues in SQL Server
Problem

I have several clustered SQL Server 2012 instances installed and I am having issues with replication
latency. The environment has a dedicated SQL Server instance for the distributor. One instance has
publisher database(s) and another instance has subscriber database(s). It is reported that there is
high latency in replication most of the time. I also noticed that there is a lot of blocking on the
distribution server with big CPU spikes.

Solution

Fixing latency issues is not a straightforward process. You need to gather a lot of data, analyze the
data, make changes one at a time and then monitor to see if you have fixed the issue. This is a
continuous process until you get acceptable latency.

Understanding Data Flow in SQL Server Transactional Replication


Before we begin, it will help to understand the data flow of SQL Server transactional replication.
There are three main components:

1. Publisher - the database/server which needs to replicate data


2. Distributor - the database/server which stores the replicated data temporarily
3. Subscriber - the destination database/server which consumes the replicated data

Typically in a high use OLTP system, each component is a dedicated SQL Server to support high
availability.

Figure 1 shows the architecture of transactional replication.

Figure 1 - Replication Architecture (BOL: http://msdn.microsoft.com/en-us/library/ms151176.aspx)

Monitoring SQL Server Transactional Replication


It is necessary to implement a latency report to monitor and alert if latency is above a certain
threshold that you define. It could be 5 minutes, 10 minutes or even a few seconds depending on
your environment and service level agreement (SLA) with the users. This is really important when
troubleshooting latency issues. The latency report should have information about the total latency,
latency between publisher and distributor, latency between distributor and subscriber therefore you
will know exactly which part of replication has issues.
Tracer tokens are commonly used to measure latency. You can use Replication Monitor (RM) to
insert a tracer token for each publication. Alternatively you could use T-SQL commands as well.

Fore more details about tracer tokens refer to this, BOL: http://technet.microsoft.com/en-


us/library/ms151846(v=sql.105).aspx

How To Get Replication Latency


The following is output from sp_replcounters for a good performing environment.

replication replication
replicated
database rate latency replbeginlsn replnextlsn
transactions
trans/sec (sec)

Publisher_db1 2587 1946.951 0.04 0x0008C11A00316D090001 0x0008C11A00316D090004

Publisher_db1 0 562.5 1.883 0x00000000000000000000 0x00000000000000000000

Table 1 - Sample output of sp_replcounters

 Database - publisher database


 Replicated transactions - Number of transactions in the log awaiting delivery to the distribution
database
 Replication rate trans/sec - Average number of transactions per second delivered to the distribution
database
 Replication latency - Average time, in seconds, that transactions were in the log before being
distributed
 Replbeginlsn - Log sequence number (LSN) of the current truncation point in the log
 Replendlsn - LSN of the next commit record awaiting delivery to the distribution database

Using the above information you can determine how good the overall replication latency is. The
higher the value you see in "replication transactions rate/sec" the better the data transfer speed for
replication. Also having low numbers for the "replication latency (sec)" column.

Sample output of poorly performing replication system is shown in Table 2.

replicated replication rate replication


database replbeginlsn replnextlsn
transactions trans/sec latency (sec)

0x000998C5006A0E6C002
Publisher_db1 11170556 1612.123 9232.216 0x000998C5006A1C72000
1

Table 2 - Output of sp_replcounter against poorly performing replication system

In this situation, you can see latency is over 2.5 hours (refer replication latency column 9232
seconds). At the same time you can see the data transfer rate is fairly good (1612.123). So what may
be the problem? See the replicated transactions, it is more than 11 million, meaning there are over 11
million commands waiting to be delivered to the distribution database. In other words, they are still in
the Transaction Log (T-Log) of the publisher database. So in this particular case, the latency is mainly
between the publisher and the distributor. If you configured the latency report, it would show a high
value of latency between the publisher and distributor.

If you see strange high figures like above (Table 2), this could be due to following reasons:

 large transactions occurred in publisher database


 slow performing network
 slow performing storage

If you see millions of waiting commands in the output and you figured it is not due to a slow network,
slow storage or unexpected OLTP operations at the publisher, then the issue is probably with the
configuration of T-Log of the publisher database.

Remember replication is one of the log based operations in SQL Server. So the configuration of the t-
log for the publisher database closely relates to the performance of replication. The program called,
Log Reader scans the t-log to identify the commands to be replicated (Refer Figure 1). So in this
case, you need to pay attention to the t-log size, whether it is properly sized according to the
transaction volume of the publisher, the number of VLFs of the T-Log and the size of VLFs. For
replication, all these parameters matter. It is quite challenging to identify the "sweet spot" of the t-log
in terms of number of VLFs. The below links might be helpful.

 http://www.mssqltips.com/sqlservertip/1225/how-to-determine-sql-server-database-transaction-log-
usage/
 http://www.sqlskills.com/blogs/kimberly/transaction-log-vlfs-too-many-or-too-few/

SQL Server Log Reader Agent


Log reader is an executable which executes from the distributor and scans the T-Log of the publisher
database. There are two threads that do the work:

1. Reader Thread - Reads the T-Log via the stored procedure, sp_replcmds. This scans the T-Log and
identifies the commands to be replicated by skipping not-to-be replicated commands.
2. Writer Thread - Writes the transactions identified by the reader thread into the distribution database
via sp_MSadd_replcmds.

Both of these stored procedures are system stored procedures that are created when you configure
transactional replication. There are parameters for the log reader agent profile which you can use to
change the behavior of the Log Reader thus you can change replication behavior. Taking a closer
look at parameter values for the Log Reader is an essential part of troubleshooting replication issues
including latency.

Fore more details: BOL: http://msdn.microsoft.com/en-us/library/ms146878.aspx

How To View Log Reader Agent Profile

In SSMS, connect to the distribution server. Right click on Replication and click on Properties. (Refer
Figure 2 and 3)
Figure 2 - Get distributor properties
Figure 3 - Distributor properties

Click on Profile Defaults in the Distributor Properties window shown in Figure 3. The Agent Profiles
window displays as shown in Figure 4.
Figure 4 - Agent Profiles

The right pane of the Agent Profiles window has all the replication agent profiles. Select Log Reader
Agents from the list and you will see the profiles for the Log Reader. The ticked one is currently be
used and you can click on … to get the configuration values for the Log Reader Agent Profile as
shown in Figure 5 below.
Figure 5 - Profile Parameters

Note: When you change the Log Reader properties they will not take effect until you restart SQL
Server Agent.

Important Parameters of Log Reader Agent Profile

There are certain parameters that you need to adjust as part of fine tuning process of transactional
replication system.

 -Continuous - Specifies whether the agent tries to poll replicated transactions continually. If specified,
the agent polls replicated transactions from the source at polling intervals even if there are no
transactions pending.
 -HistoryVerboseLevel [ 0| 1| 2] - Specifies the amount of history logged during a log reader operation.
You can minimize the performance effect of history logging by selecting 1.
 -MaxCmdsInTran - Specifies the maximum number of statements grouped into a transaction as the
Log Reader writes commands to the distribution database. Using this parameter allows the Log Reader
Agent and Distribution Agent to divide large transactions (consisting of many commands) at the
Publisher into several smaller transactions when applied at the Subscriber. Specifying this parameter
can reduce contention at the Distributor and reduce latency between the Publisher and Subscriber.
Because the original transaction is applied in smaller units, the Subscriber can access rows of a large
logical Publisher transaction prior to the end of the original transaction, breaking strict transactional
atomicity. The default is 0, which preserves the transaction boundaries of the Publisher.
 -PollingInterval - Is how often, in seconds, the log is queried for replicated transactions. The default is
5 seconds.
 -ReadBatchSize - Is the maximum number of transactions read out of the transaction log of the
publishing database per processing cycle, with a default of 500. The agent will continue to read
transactions in batches until all transactions are read from the log. This parameter is not supported for
Oracle Publishers.
 -ReadBatchThreshold - Is the number of replication commands to be read from the transaction log
before being issued to the Subscriber by the Distribution Agent. The default is 0. If this parameter is not
specified, the Log Reader Agent will read to the end of the log or to the number specified in
-ReadBatchSize (number of transactions).
How To Decide The Log Reader Agent Profile Settings

You can query the MSLogreader_history table in the distribution database to see the log reader
statistics. By analyzing these data, you can determine the performance of the log reader. You can
use the below query;

USE distribution
GO

SELECT time,
CAST(comments AS XML) AS comments,
runstatus,
duration,
xact_seqno,
delivered_transactions,
delivered_commands,
average_commands,
delivery_time,
delivery_rate,
delivery_latency / ( 1000 * 60 ) AS delivery_latency_Min
FROM mslogreader_history WITH (nolock)
WHERE time > '2014-10-28 16:00:00.130'
ORDER BY time DESC

It is difficult to attach a sample output, because the output is very wide. However I would like to
highlight some of the columns.

Look at the values in the Comments column below. It contains xml segments which have valuable
information. The Comments column gives you information about how the Log Reader is performing.
The below table shows six different sample records of actual data in a replication environment. Look
at rows 2, 3 and 6. It displays more information with state 1, 2 and 3 messages.

If you see a lot of messages like "Approximately 2500000 log records have been scanned in pass #
4, 0 of which were marked for replication." which means, the Log Reader Agent has found 0 records
to replicate. This essentially means there are many operations going on in publisher which are not
marked for replication. Increasing the -ReadBatchSize parameter would be beneficial in this type of
situation. The default value of the parameter is 500, but you could increase this value by several
thousand to scan more t-log records because most of the time you do not find much data to replicate.

Seq# Comments

1 12 transaction(s) with 14 command(s) were delivered.

2 No replicated transactions are available.

Raised events that occur when an agent's reader thread waits longer than the agent's -messageinterval time. (By
3 default, the time is 60 seconds.) If you notice State 2 events that are recorded for an agent, this indicates that the agent
is taking a long time to write changes to the destination.

4 Raised events that are generated only by the Log Reader Agent when the writer thread waits longer than the
-messageinterval time. If you notice State 3 events that are recorded for the Log Reader Agent, this indicates that the
agent is taking a long time to scan the replicated changes from the transaction log.

5 Approximately 2500000 log records have been scanned in pass # 4, 0 of which were marked for replication.

6 Normal events that describe both the reader and writer thread performance.%lt;/message>

See below for what these state values mean:

 state 1 - Normal activity. Nothing to worry about


 state 2 - Reader Thread has to WAIT for Writer Thread. Has some issues
 state 3 - Writer Thread has to WAIT for Reader Thread. Has some issues

Using these messages you can nail down your analysis of Log Reader Agent performance to Reader
or Writer Thread issues. Another important data column you need to know is "xact_seqno", which is
the last processed transaction sequence number. Look at that value and see it is changing frequently.
If so, replicated commands are processing quickly. Sometimes you may see the same value
in xact_seqno column for a long time, maybe even for a few hours. That indicates a large transaction
occurred in the publisher database which resulted in large DML activities. You can identify the actual
commands of the transaction using the below code snippet.

USE distribution
go
EXEC Sp_browsereplcmds
@xact_seqno_start = '0x0008BF0F008A6D7F00AA',
@xact_seqno_end = '0x0008BF0F008A6D7F00AA',
@publisher_database_id = 10

@publisher_database_id may be different than the database id of the publisher server. You need to
know that first before executing the above code. Use the below code to identify
the publisher_database_id.

USE distribution
GO

SELECT * FROM dbo.MSpublisher_databases

Or

USE distribution
go
SELECT TOP 1 publisher_database_id
FROM msrepl_commands
WHERE xact_seqno = '0x0008BF0F008A6D7F00AA'

Note: This publisher database id is different from the database id of sys.databases in publisher


server.

Refer to the command column of sp_browsereplcmds query to see the actual command executing.
This way you can get a better sense of what is happening at the moment when there is a slowness in
replication.

If the transaction has millions of DML activities, it takes time to run the sp_browsereplcmds query.
Additionally you can filter the records using @article_id or @command_id or both as below;
USE distribution
go
EXEC Sp_browsereplcmds
@xact_seqno_start = '0x0008BF0F008A6D7F00AA',
@xact_seqno_end = '0x0008BF0F008A6D7F00AA',
@publisher_database_id = 10,
@article_id = 1335,
@command_id= '1000000'

How Large are the Replication Specific Tables


The distribution database has many tables to support SQL Server replication. It is important to know
how big they are. At least the most important ones. This should be a part of your troubleshooting
effort. I normally use the below query to see the record count of the most centric tables for
transactional replication.

USE distribution
GO
SELECT Getdate() AS CaptureTime,
Object_name(t.object_id) AS TableName,
st.row_count,
s.NAME
FROM sys.dm_db_partition_stats st WITH (nolock)
INNER JOIN sys.tables t WITH (nolock)
ON st.object_id = t.object_id
INNER JOIN sys.schemas s WITH (nolock)
ON t.schema_id = s.schema_id
WHERE index_id < 2
AND Object_name(t.object_id)
IN ('MSsubscriptions',
'MSdistribution_history',
'MSrepl_commands',
'MSrepl_transactions',
)
ORDER BY st.row_count DESC

Table Name Description

MSsubscriptions contains one row for each published article in a subscription

MSdistribution_history contains history rows for the Distribution Agents associated with the local Distributor

MSrepl_commands contains rows of replicated commands

MSrepl_transactions contains one row for each replicated transaction

If you see high rowcount (probably more than 1 or 2 million) this means there is some problem in
replication. It could be one of the reasons stated below:

1. Clean-up job (this is in distribution server) is not running


2. Its taking lot of time to deliver the commands to subscriber
3. There may be blocking in distribution server due to clean-up job

Use the below query to identify what is going on currently in the distribution server. (You can use the
same query in any server for the same purpose)
SELECT r.session_id,
s.program_name,
s.login_name,
r.start_time,
r.status,
r.command,
Object_name(sqltxt.objectid, sqltxt.dbid) AS ObjectName,
Substring(sqltxt.text, ( r.statement_start_offset / 2 ) + 1, ( (
CASE r.statement_end_offset
WHEN -1 THEN
datalength(sqltxt.text)
ELSE r.statement_end_offset
END
- r.statement_start_offset ) / 2 ) + 1) AS active_statement,
r.percent_complete,
Db_name(r.database_id) AS DatabaseName,
r.blocking_session_id,
r.wait_time,
r.wait_type,
r.wait_resource,
r.open_transaction_count,
r.cpu_time,-- in milli sec
r.reads,
r.writes,
r.logical_reads,
r.row_count,
r.prev_error,
r.granted_query_memory,
Cast(sqlplan.query_plan AS XML) AS QueryPlan,
CASE r.transaction_isolation_level
WHEN 0 THEN 'Unspecified'
WHEN 1 THEN 'ReadUncomitted'
WHEN 2 THEN 'ReadCommitted'
WHEN 3 THEN 'Repeatable'
WHEN 4 THEN 'Serializable'
WHEN 5 THEN 'Snapshot'
END AS Issolation_Level,
r.sql_handle,
r.plan_handle
FROM sys.dm_exec_requests r WITH (nolock)
INNER JOIN sys.dm_exec_sessions s WITH (nolock)
ON r.session_id = s.session_id
CROSS apply sys.Dm_exec_sql_text(r.sql_handle) sqltxt
CROSS apply
sys.Dm_exec_text_query_plan(r.plan_handle, r.statement_start_offset,
r.statement_end_offset) sqlplan
WHERE r.status <> 'background'
ORDER BY r.session_id
go

If you see blocking with LCK_M_S waits, this is probably due to the Clean-up job. This job runs every
10 minutes and it clears the commands that have already been replicated. It is safe to stop and
disable the job for a couple of hours to clear the blocking.

Most often I noticed the root blocker is sp_MSsubscription_cleanup (This is a nested stored


procedure call from sp_MSdistribution_cleanup, which is the "Distribution clean up" job) You also can
notice the above stored procedure in CXPACKET wait type and it blocks the following statement.

UPDATE msdistribution_history
SET runstatus = @runstatus,
time = @current_time,
duration = @duration,
comments = @comments,
xact_seqno = @xact_seqno,
updateable_row = @this_row_updateable,
error_id = CASE @error_id
WHEN 0 THEN error_id
ELSE @error_id
END
WHERE agent_id = @agent_id
AND timestamp = @lastrow_timestamp
AND ( runstatus = @runstatus
OR ( @update_existing_row = 1
AND runstatus IN ( @idle, @inprogress )
AND @runstatus IN ( @idle, @inprogress ) ) )

The wait type for the above statement is LCK_M_X and the wait resource
is MSdistribution_history table. This table is used inside the head blocker stored procedure and it
already acquired the shared lock on most of the rows. I feel MS needs some optimization to this code.
When I compared the clean-up job stored procedure between 2008 and 2012 versions of SQL
Server, I noticed it doubled the lines of code in the 2012 version.

At the same time, you also may notice high CPU in distribution server and that is due to many
blockings due to the above head blocker. There is really nothing you can do except stop and disable
the clean-up job for some time. You also may try setting the MAXDOP to 1 in distribution server to
bring down the CPU usage.

Improving the Latency Between Distributor And Subscriber


Again thanks to the latency report. If you identify the replication latency is between the distributor and
subscriber, then it is worth considering the below points.

Publishing Stored Procedure Execution


This is especially useful in cases where large batch operations (e.g.: DELETE) are performed on the
publisher. I have seen cases where millions of rows are affected due to a large batch delete and the
moment they occurred it started to transfer the commands to the distributor and then the subscriber.
This slows replication and you can notice increased latency. Using this method, the same large batch
operation execute at the subscriber instead of passing individual commands via the distributor.
However before implementing this solution you need to spend time doing some research and assess
how feasible this is for your environment. There are many factors that you need to be aware of.

For more detail: http://msdn.microsoft.com/en-us/library/ms152754.aspx

Enable Multiple Streams for Subscriber


Enabling multiple streams for the subscriber can greatly improve aggregate transactional replication
throughput by applying the subscriber changes in parallel. Still there are many factors you need to
consider and you need to do some homework before getting this to production.

For more details: http://technet.microsoft.com/en-us/library/ms151762(v=sql.105).aspx

Maintain Indexes and Statistics in Distribution Database


Distribution database is categorized as a system database in SSMS. However some level of DBA
intervention is needed to keep the distribution database in good shape. Distribution database has
tables, indexes and statistics like normal user databases. We know for a fact that indexes need to be
maintained (rebuild/reorganize) as well as running update statistics in user databases, so why not the
same operations in the distribution database? The clean-up stored procedures has its own statistics
update statements to keep the statistics up to date, but not for all of them. It is totally fine to have
index and statistics update jobs deployed to the distribution database and schedule them to run at off-
peak time as you do in user databases. I have done this in production environments as per MS
suggestion.

Distribution Agent Performance


You can query MSdistribution_history table to see how Distribution Agent performs.

USE distribution
go
SELECT TOP 100 time,
Cast(comments AS XML) AS comments,
runstatus,
duration,
xact_seqno,
delivered_commands,
average_commands,
current_delivery_rate,
delivered_transactions,
error_id,
delivery_latency
FROM msdistribution_history WITH (nolock)
ORDER BY time DESC

The output of the above query is similar to the output of the Log Reader history table. Look at the
value of the Comments column. If you see messages with state 1 which means Distribution Agent is
performing normally. Using xact_seqno you can identify the commands replicated. If you notice the
same value for xact_seqno for a longer time which means it is replicating a large transaction.

Distribution Agent Profile


Like the Log Reader Agent Profile, there is a Distribution Agent Profile on the distribution server. If
you open Agent Profiles window (Refer Figure 4) from the right pane you can select Distribution
Agents to see the profiles. You can tweak the parameter values of the agent to change the replication
behavior. You can do it at the publication level or apply to all publications. It will need a SQL Server
Agent restart in distribution server to take effect.

Below are some parameters you may consider tweaking:

 -CommitBatchSize - Is the number of transactions to be issued to the Subscriber before a COMMIT


statement is issued. The default is 100.
 -CommitBatchThreshold - Is the number of replication commands to be issued to the Subscriber
before a COMMIT statement is issued. The default is 1000.
 -HistoryVerboseLevel[ 0 | 1 | 2 | 3 ] - Specifies the amount of history logged during a distribution
operation. You can minimize the performance effect of history logging by selecting 1.
 -MaxDeliveredTransactions - Is the maximum number of push or pull transactions applied to
Subscribers in one synchronization. A value of 0 indicates that the maximum is an infinite number of
transactions. Other values can be used by Subscribers to shorten the duration of a synchronization
being pulled from a Publisher.
 -PollingInterval - Is how often, in seconds, the distribution database is queried for replicated
transactions. The default is 5 seconds.
 -SubscriptionStreams [0|1|2|...64] - Is the number of connections allowed per Distribution Agent to
apply batches of changes in parallel to a Subscriber, while maintaining many of the transactional
characteristics present when using a single thread. For a SQL Server Publisher, a range of values from
1 to 64 is supported. This parameter is only supported when the Publisher and Distributor are running
on SQL Server 2005 or later versions. This parameter is not supported or must be 0 for non-SQL
Server Subscribers or peer-to-peer subscriptions.

For more details: BOL: http://msdn.microsoft.com/en-us/library/ms147328.aspx

Troubleshooter: Find errors with SQL Server


transactional replication
Troubleshooting replication errors can be frustrating without a basic understanding of how
transactional replication works. The first step in creating a publication is having the Snapshot Agent
create the snapshot and save it to the snapshot folder. Next, the Distribution Agent applies the
snapshot to the subscriber.

This process creates the publication and puts it in the synchronizing state. Synchronization works in
three phases:

1. Transactions occur on objects that are replicated, and are marked "for replication" in the
transaction log.
2. The Log Reader Agent scans through the transaction log and looks for transactions that are
marked "for replication." These transactions are then saved to the distribution database.
3. The Distribution Agent scans through the distribution database by using the reader thread.
Then, by using the writer thread, this agent connects to the subscriber to apply those changes to
the subscriber.

Errors can occur in any step of this process. Finding those errors can be the most challenging aspect
of troubleshooting synchronization issues. Thankfully, the use of Replication Monitor makes this
process easy.

 Note
 The purpose of this troubleshooting guide is to teach troubleshooting methodology. It's
designed not to solve your specific error, but to provide general guidance in finding errors with
replication. Some specific examples are provided, but the resolution to them can vary depending
on the environment.
 The errors that this guide provides as examples are based on the Configuring transactional
replication tutorial.
Troubleshooting methodology
Questions to ask

1. Where in the synchronization process is replication failing?


2. Which agent is experiencing an error?
3. When was the last time replication worked successfully? Has anything changed since then?

Steps to take

1. Use Replication Monitor to identify at which point replication is encountering the error (which
agent?):
o If errors are occurring in the Publisher to Distributor section, the issue is with the Log
Reader Agent.
o If errors are occurring in the Distributor to Subscriber section, the issue is with the
Distribution Agent.
2. Look through that agent's job history in Job Activity Monitor to identify details of the error. If
the job history is not showing enough details, you can enable verbose logging on that specific
agent.
3. Try to determine a solution for the error.

Find errors with the Snapshot Agent


The Snapshot Agent generates the snapshot and writes it to the specified snapshot folder.

1. View the status of your Snapshot Agent:

a. In Object Explorer, expand the Local Publication node under Replication.

b. Right-click your publication AdvWorksProductTrans > View Snapshot Agent Status.

2. If an error is reported in the Snapshot Agent status, you can find more details in the Snapshot
Agent job history:

a. Expand SQL Server Agent in Object Explorer and open Job Activity Monitor.

b. Sort by Category and identify the Snapshot Agent by the category REPL-Snapshot.

c. Right-click the Snapshot Agent and then select View History.


3. In the Snapshot Agent history, select the relevant log entry. This is usually a line or
two before the entry that's reporting the error. (A red X indicates errors.) Review the message
text in the box below the logs:

ConsoleCopy
The replication agent had encountered an exception.
Exception Message: Access to path '\\node1\repldata.....' is denied.

If your Windows permissions are not configured correctly for your snapshot folder, you'll see an
"access is denied" error for the Snapshot Agent. You'll need to verify permissions to the folder where
your snapshot is stored, and make sure that the account used to run the Snapshot Agent has
permissions to access the share.

Find errors with the Log Reader Agent


The Log Reader Agent connects to your publisher database and scans the transaction log for any
transactions that are marked "for replication." It then adds those transactions to the distribution
database.

1. Connect to the publisher in SQL Server Management Studio. Expand the server node, right-
click the Replication folder, and then select Launch Replication Monitor:

Replication Monitor opens: 

2. The red X indicates that the publication is not synchronizing. Expand My Publishers on the left
side, and then expand the relevant publisher server.
3. Select the AdvWorksProductTrans publication on the left, and then look for the red X on one
of the tabs to identify where the issue is. In this case, the red X is on the Agents tab, so one of
the agents is encountering an error:

4. Select the Agents tab to identify which agent is encountering the error:

5. This view shows you two agents, the Snapshot Agent and the Log Reader Agent. The one that's
encountering an error has the red X. In this case, it's the Log Reader Agent.
Double-click the line that's reporting the error to open the agent history for the Log Reader
Agent. This history provides more information about the error:

ConsoleCopy
Status: 0, code: 20011, text: 'The process could not execute 'sp_replcmds' on
'NODE1\SQL2016'.'.
The process could not execute 'sp_replcmds' on 'NODE1\SQL2016'.
Status: 0, code: 15517, text: 'Cannot execute as the database principal because the principal
"dbo" does not exist, this type of principal cannot be impersonated, or you do not have
permission.'.
Status: 0, code: 22037, text: 'The process could not execute 'sp_replcmds' on
'NODE1\SQL2016'.'.

6. The error typically occurs when the owner of the publisher database is not set correctly. This
can happen when a database is restored. To verify this:

a. Expand Databases in Object Explorer.

b. Right-click AdventureWorks2012 > Properties.

c. Verify that an owner exists under the Files page. If this box is blank, this is the likely cause of
your issue.

7. If the owner is blank on the Files page, open a New Query window within the context of the
AdventureWorks2012 database. Run the following T-SQL code:

SQLCopy
-- set the owner of the database to 'sa' or a specific user account, without the brackets.
EXECUTE sp_changedbowner '<useraccount>'
-- example for sa: exec sp_changedbowner 'sa'
-- example for user account: exec sp_changedbowner 'sqlrepro\administrator'

8. You might need to restart the Log Reader Agent:

a. Expand the SQL Server Agent node in Object Explorer and open Job Activity Monitor.

b. Sort by Category and identify the Log Reader Agent by the REPL-LogReader category.

c. Right-click the Log Reader Agent job and select Start Job at Step.


9. Validate that your publication is now synchronizing by opening Replication Monitor again. If
it's not already open, you can find it by right-clicking Replication in Object Explorer.
10. Select the AdvWorksProductTrans publication, select the Agents tab, and double-click the
Log Reader Agent to open the agent history. You should now see that the Log Reader Agent is
running and either is replicating commands or has "no replicated transactions":

Find errors with the Distribution Agent


The Distribution Agent finds data in the distribution database and then applies it to the subscriber.

1. Connect to the publisher in SQL Server Management Studio. Expand the server node, right-
click the Replication folder, and then select Launch Replication Monitor.
2. In Replication Monitor, select the AdvWorksProductTrans publication, and select the All
Subscriptions tab. Right-click the subscription and select View Details:

3. The Distributor to Subscriber History dialog box opens and clarifies what error the agent is
encountering:

ConsoleCopy
Error messages:
Agent 'NODE1\SQL2016-AdventureWorks2012-AdvWorksProductTrans-NODE2\SQL2016-7' is retrying
after an error. 89 retries attempted. See agent job history in the Jobs folder for more
details.

4. The error indicates that the Distribution Agent is retrying. To find more information, check the
job history for the Distribution Agent:

a. Expand SQL Server Agent in Object Explorer > Job Activity Monitor.

b. Sort the jobs by Category.

c. Identify the Distribution Agent by the category REPL-Distribution. Right-click the agent and
select View History.

5. Select one of the error entries and view the error text at the bottom of the window:
ConsoleCopy
Message:
Unable to start execution of step 2 (reason: Error authenticating proxy
NODE1\repl_distribution, system error: The user name or password is incorrect.)

6. This error indicates that the password that the Distribution Agent used is incorrect. To resolve
it:

a. Expand the Replication node in Object Explorer.

b. Right-click the subscription > Properties.

c. Select the ellipsis (...) next to Agent Process Account and modify the password.

7. Check Replication Monitor again, by right-clicking Replication in Object Explorer. A red X


under All Subscriptions indicates that the Distribution Agent is still encountering an error.

Open the Distribution to Subscriber history by right-clicking the subscription in Replication


Monitor > View Details. Here, the error is now different:

ConsoleCopy
Connecting to Subscriber 'NODE2\SQL2016'
Agent message code 20084. The process could not connect to Subscriber 'NODE2\SQL2016'.
Number: 18456
Message: Login failed for user 'NODE2\repl_distribution'.

8. This error indicates that the Distribution Agent could not connect to the subscriber, because
the login failed for user NODE2\repl_distribution. To investigate further, connect to the
subscriber and open the current SQL Server error log under the Management node in Object
Explorer:

If you're seeing this error, the login is missing on the subscriber. To resolve this error,
see Permissions for replication.

9. After the login error is resolved, check Replication Monitor again. If all issues have been
addressed, you should see a green arrow next to Publication Name and a status
of Running under All Subscriptions.
Right-click the subscription to open the Distributor To Subscriber history once more to verify
success. If this is the first time you're running the Distribution Agent, you'll see that the snapshot
has been bulk copied to the subscriber:

Enable verbose logging on any agent


You can use verbose logging to see more detailed information about errors occurring with any agent
in the replication topology. The steps are the same for each agent. Just make sure that you're
selecting the correct agent in Job Activity Monitor.

 Note

The agents can be on either the publisher or the subscriber, depending on whether it's a pull or push
subscription. If you can't find the agent you're looking for on the server you're looking at, try checking
the other server.

1. Decide where you want the verbose logging to be saved, and ensure that the folder exists. This
example uses c:\temp.
2. Expand the SQL Server Agent node in Object Explorer and open Job Activity Monitor.

3. Sort by Category and identify the agent of interest. This example uses the Log Reader Agent.
Right-click the agent of interest > Properties.

4. Select the Steps page, and then highlight the Run agent step. Select Edit.

5. In the Command box, start a new line, enter the following text, and select OK:

ConsoleCopy
-Output C:\Temp\OUTPUTFILE.txt -Outputverboselevel 3

You can modify the location and verbosity level according to your preference.

 Note
These things might cause your agent to fail, or the output file to be missing, when you're adding
the verbose output parameter:

o There's a formatting issue where the dash became a hyphen.


o The location doesn't exist on disk, or the account that's running the agent lacks
permission to write to the specified location.
o There's a space missing between the last parameter and the -Output parameter.
o Different agents support different levels of verbosity. If you enable verbose logging but
your agent fails to start, try decreasing the specified verbosity level by 1.
6. Restart the Log Reader Agent by right-clicking the agent > Stop Job at Step. Refresh by
selecting the Refresh icon from the toolbar. Right-click the agent > Start Job at Step.
7. Review the output on disk.

8. To disable verbose logging, follow the same previous steps to remove the entire -Output line
that you added earlier.

For more information, see Enabling verbose logging for replication agents.


Troubleshooting SQL Server Replication
By: Ahmad Yaseen

In the previous article, Setting Up and Configuring SQL Server Replication, we discussed in-depth, the
SQL Server Replication concept, its components, types and how to configure the SQL Transactional
Replication step by step. It is highly recommended to go through the previous article and understand the
replication concept and its components before reading this article. In this article, we will see how to
troubleshoot an existing SQL Server Replication site.

Troubleshooting Overview
The main goal of the SQL Server Replication is keeping the data in the Publisher and the Subscriber
synchronized. In the happy scenario, if a transaction is performed and committed at the publication
database, it will be copied to the distribution database then synchronized and applied to all Subscribers
connected to that Publisher. If an issue occurs at any step of this process, the Publisher changes will not
be available at the Subscriber side. In this case, we need to troubleshoot and fix that issue as soon as
possible before ending up with an expired SQL Replication site that should be synchronized again from
scratch or a database with its transaction log file runs out of free space, pausing all database transactions.
Identifying at which step the replication synchronization is failing and allocating an indicative error
message that leads to fix the issue, is the most challenging part of the SQL Server Replication
troubleshooting process. Also, checking the last synchronization time and what changes performed
at/after that time that may cause this failure, can also help in troubleshooting the replication
synchronization failure.
Understanding the role of the SQL Server Replication agent will help in identifying at which step the
synchronization fails.  Recall that there are three replication agents that are common between most of the
SQL Server Replication types. The Snapshot Agent is responsible for creating the initial synchronizaxtion
snapshot. The Log Reader Agent is responsible for reading the changes from the database transaction
log file and copy it to the distribution database and finally, the Distribution agent that is responsible for
synchronizing the changes to the Subscribers.
In this article, we will take advantage of the Replication Monitor and Job Activity Monitor windows in
monitoring the SQL Server Replication status and getting information about any synchronization failure
error.

Troubleshooting Scenarios
The best and straight-forward way to understand how to troubleshoot the SQL Server Replication issues
is by providing practical scenarios and showing how to fix this particular issue. Let us start discussing the
scenarios one by one.

SQL Server Agent Service Issue


The SQL Server Agent service plays a vital role in the SQL Server Replication synchronization process.
This is due to the fact that each replication agent will run under a SQL agent job.
Being a proactive database administrator, you need to check the SQL replication site status on a daily
basis. To check the replication site status, right-click on the Publication, under the Replication -> Local
Publications node, and choose the Launch Replication Monitor option, as shown below:
From the Replication Monitor window, you can see a warning message, showing that the replication will
be expiring soon or already expired, without seeing any indicative error message, as below:

If the Replication Monitor window provides us with no useful information about why the replication site is
expiring soon, the next step is to check the Job Activity Monitor under the SQL Server Agent node.
Visiting the SQL Server Agent node, you will see directly that the SQL Server Agent Service is not running
(from the red circle beside it). If the SQL Server Agent Service is not running, this means that all the jobs
created under that instance are not working, including the replication agent jobs. As a result, the overall
replication site is not working.
To fix that issue, we need to start the SQL Server Agent service from the SQL Server Management Studio
directly or using the SQL Server Configuration Manager (recommended), as shown below:

After starting the SQL Server Agent service, check the Replication Monitor again and make sure that the
Subscriber status is Running and all the pending transactions are synchronized with the Subscriber
successfully. You can check these steps one by one, by checking that the records are copied from the
Publisher to Distributor section:
Then synchronized from the Distributor to the Subscriber successfully, as below:

And finally make sure that there is no undistributed transaction from the last tab, as shown below:
After that, we need to make sure that the replication agents jobs are up and running with no issue. The
SQL Agent jobs can be checked by expanding the SQL Server Agent node under the SSMS Object
Explorer and view the Job Activity monitor then check if the Log Reader Agent and Distributor agent are
running, taking into consideration that the Snapshot Agent will work only during the snapshot creation
process, as shown below:
You can also review the history of the replication agents jobs and check the previous failure reason, by
right-clicking on that job and choose View History option as below:

Where you may find an indicative error message that helps in overcoming this issue in the future, as
below:

To overcome the previous issue, the SQL Server Agent service startup mode should be changed from
Manual to Automatic, in this way you will make sure that the service will start automatically when the
hosting server is rebooted.
Snapshot Agent Permission Issue
Assume that while checking the SQL Server Replication status, using the Replication Monitor, you noticed
that there is a replication failure, from the X sign inside the red circle. And the Replication Monitor shows
that the failure is from one of the replication agents, from the X sign inside the red circle at the top of the
Agents tab.
To identify that replication failure, we should browse the Agents tab and check which agent is failing. From
the Agents page, you will see that the Snapshot Agent is the failing one. Double-click on the Snapshot
Agent and review the below error message:
The replication agent has not logged a progress message in 10 minutes. This might indicate an
unresponsive agent or high system activity. Verify that records are being replicated to the destination and
that connections to the Subscriber, Publisher, and Distributor are still active.
Unfortunately, this error message is generic and it shows only that the Snapshot Agent is not working
without specifying the reason, as follows:
Then we need to search for useful information in another place, which is the Snapshot Agent job. From
the Job Activity Monitor window, under the SQL Server Agent node, you can see that the Snapshot Agent
job is failed. And from that job history, you can see that it failed recently, due to the proxy authentication
problem. In other words, the credentials for the account under which the Snapshot Agent runs is not
correct, as shown below:
To fix the Snapshot Agent credential issue, right-click on the Publication, under the Replication node ->
Local Publication, and choose the Properties option. From the Publication Properties window, browse
the Agent Security page and re-insert the credentials for the account under which the Snapshot Agent
will run.
After refreshing the Snapshot Agent account credentials, start the Snapshot Agent job again, from the Job
Activity Monitor window, and make sure that the job is working fine, as below:
Also, check if the Snapshot Agent is working fine now, and the error message does not appear anymore
under the Replication Monitor, as shown below:

Snapshot Folder Permission Issue


Assume that, when trying to synchronize the Publisher and the Subscriber using the initial snapshot or
resynchronize the Snapshot replication site using a new snapshot, the snapshot creation process failed
with the access error message below:

This error message shows that, the account under which the Snapshot Agent is running does not have
permission to access the snapshot folder specified in the error message.
To fix that issue, we need to check the account under which the Snapshot Agent is running, from the
Agent Security page of the Publication Properties window, as shown below:
Then browse the snapshot folder specified in the error message and make sure that this Snapshot
account has minimum read-write permission on that folder, then run the Snapshot Agent again and see
that the issue is fixed now and the synchronization snapshot is created successfully, as below:
Subscriber Permission Issue
Assume that, while checking the SQL Server Replication site status, using the Replication Monitor, you
see that there is a failure with the Subscriber, as shown below:
If you click on the error icon, you will see that the failure has occurred when trying to synchronize the
transactions from the Distributor to the Subscriber. And from the error message, it is clear that the
Distributor is not able to connect to the Subscriber SQL Server instance due to permission issue, as
shown below:
To fix that issue, we need to check and refresh the credentials used to connect to the Subscriber instance.
To check the credentials, right-click on the Subscription under the Replication node -> Local Publications
-> the current Publication name and choose the Properties option. From the Subscriber Connection field
under the Subscriber Properties window, refresh the credentials for the account that will be used to
connect to the Subscriber instance, as shown below:
After that, check the replication status again from the Replication Monitor and you will see that the
Subscriber connection issue is no longer available, and the replication site is running normally, as shown
below:

Subscriber Not Reachable


Another SQL Server Replication failure issue you may face from the Subscriber side is that the Distributor
is not able to connect to the Subscriber, showing under the Distributor to the Subscriber page that, it is not
able to open connection with the Subscriber due to “Network Related … ” connectivity error, shown in the
Replication Monitor window below:
This error message is indicating that there is a connection issue between the Distributor instance and the
Subscriber instance. The first and straight-forward way to check this connectivity issue is to make sure
that the Subscriber SQL Server instance is online. This can be checked from the SQL Server
Configuration Manager from the Subscriber side. In our situation, we can see that the SQL Server Service
at the Subscriber side is stopped. To fix that issue, start the SQL Server Service and check from the
Replication Monitor that the replication site is synchronized again, as shown below. For more advanced
SQL connectivity issue, check the Troubleshooting Connectivity MS document:
Subscriber Database Permission Issue
Assume that you are checking the SQL Server Replication synchronization status, using the Replication
Monitor, and it is found that the replication is failing while trying to replicate the changes from the
Distributor to the Subscriber, Clicking on the subscriber error, you will see that the Distributor is able to
reach the subscriber and connect to it, but not able to connect to the Subscription database due to lack of
permission issue, as shown below:
To fix that issue, connect to the Subscriber and make sure that the account that is used to connect to the
Subscriber database is a member of the db_Owner database fixed role, as shown below:
After that, check the Replication Monitor again and make sure that the Distributor is able to reach the
subscription database and replicate the changes, as below:
Data Difference Issue
Assume that one of the database development teams claims that there are some changes that are
performed on the Shifts table on the Publisher (SQL1) are not reflected in the daily reports that run on the
Subscriber instance (SQL2), and he provided the snapshot below that shows that the changes are not
replicated:
The first step in checking the replication synchronization issue is opening the Replication Monitor and find
at which step it is failing. From the Replication Monitor, you can see that the Log Reader Agent is failing,
as the changes are not replicated from the Distributor to the Subscriber, but no clear message is returned
from that agent, as shown below:

As we cannot find meaningful error message from the Replication Monitor, we will check the history of the
Log Reader Agent job, using the Job Activity Monitor, which shows that, the credentials for the account
under which the Log Reader Agent is running, is incorrect, as shown below:
To fix the Log Reader Agent credentials issue, browse the Agent Security page of the Publication
Properties window, and refresh the Log Reader Agent credentials with a valid one, as below:

Checking the Replication Monitor again, you will see that the changes are replicated successfully and that
the data is updated with the new shifts changes, as shown below:
Row Not Found at Subscriber
Let us look at the issue from another side. Let’s say, there is a change performed in the shifts table as
shown below:

But this change is not replicated to the Subscriber and the overall SQL Server Replication site is failed.
From the Replication Monitor, you can see that it is failing while trying to make the change from the
Distributor to the Subscriber, and failed due to the fact that it is not able to update that specific record with
ID equal to 3, because this record is not available at the Subscriber database table, as shown below:

Checking that record at the Subscriber side (SQL2), you will see that the record is not available, as below:
To overcome this issue, we need to insert that record again to the Subscriber database table and let the
Distributor try to update it again, fixing the replication synchronization failure issue, as shown below:
SQL Server provides us with an option to let the replication site continue working even though a data
inconsistency issue is found, where you can manually fix this inconsistency issue later. To do so, from the
Replication Monitor, right-click on the Subscriber and choose Agent Profile option, as shown below:

From the displayed window, you can update the Log Reader Agent profile and allow it to continue
replicating data changes in case there is data inconsistency issue, as shown below:
Uninitialized Subscription Issue
If the replication site is left without monitoring for a long time, and a failure occurred without any fix for
more than three days, the replication site will be expired and the Subscription will be marked as
uninitialized, waiting to be reinitialized again using a new snapshot. The same scenario can be faced
when creating a new Subscription without initializing it, as shown below:
To fix that issue, we should reinitialize that Subscription, by right-clicking on the Subscription under the
Replication node -> Local Publications and expand the Publication, then choose the Reinitialize option
and mark this Subscription for Initialization and make it ready to receive a new snapshot, as shown below:
If the Subscription status stays Uninitialized after reinitializing it, check the Snapshot Agent job, using the
Job Activity Monitor window, and see why it is failing. From the Snapshot Agent job history, you will see
that the job failed due to an issue determining the owner of that agent job, as shown below:
To overcome this issue, open the Snapshot Agent job and change the owner of the job to SA or any valid
administrator user, and the job will run successfully, as below:

Now you will see that the Subscription status changed to Running, giving that it is waiting for the initial
snapshot to start the synchronization process, as shown below:
To generate a new snapshot, right-click on the Publication, under the Replication node-> Local
Publications, and select View Snapshot Agent Status option.
From the opened window, click on the Start button to start the snapshot creating process. When the
snapshot that contains all the Publisher articles created successfully, open the Replication Monitor again
and check the status of the Subscription, where you will see that the snapshot is applied to the Subscriber
and synchronized with the Publisher, as shown below:
Publisher Database Owner Issue
Assume also that, when checking the status of the SQL Server Replication site, using the Replication
Monitor, the replication site was failed and the failure detected at the Log Reader Agent. Checking the
error message returned from that agent, it is found that there is an issue determining the current owner of
the Publication database, as shown below:
To fix that issue, we need to update the current publication database owner, by replacing it with a valid
database user, using the SP_changedbowner system stored procedure, or simply from the database
properties window. After that, run the Log Reader Agent job again, using the Job Activity Monitor window,
then validate if the agent issue is no longer available, using the Replication Monitor, as shown below:
Conclusion
In this article, we demonstrated different issues that you may face while using the SQL Server Replication
feature to copy data between different sites, and how to fix these issues.
It is highly recommended to keep the SQL Server Engine up to date, with the latest SPs and CUs, so all
bugs related to the SQL Server Replication features will be fixed automatically. Lastly, as a proactive SQL
Server database administrator, keep an eye on your replication site to fix any issue from the beginning
before it becomes larger and harder to fix.
Troubleshooting Transactional Replication in
SQL Server
4th February 2016 By John McCormack Leave a Comment

This might make me the odd one out but I actually really like replication. It took me a while
to get comfortable with it but when I did and when I learned how to troubleshoot
transactional replication confidently, I became a fan. Since I exclusively use transactional
replication and not snapshot replication or merge replication, this post is only about
transactional replication and in particular, how to troubleshoot transactional replication
errors.

In the production system I work on, replication is highly reliable and rarely if ever causes
the DBA’s headaches. It can be less so in our plethora of dev and qa boxes, probably
down to rate of change in these environments with regular refreshes. Due to this, I’ve had
to fix it many times. As I explain how I troubleshoot replication errors, I assume you know
the basics of how replication works. If you don’t, a really good place to start is books
online. It describes how replication uses a publishing metaphor and describes all the
component parts in detail.

If you don’t currently use replication and you want to set up for the first time, I recommend
having a playground instance, not related to any of your production, QA or dev instances
that you can play about with and try things out. As I say in a previous post, “This is an
ideal place to set up mirroring, replication and Always On Availability Groups.”

Look for errors here first


Replication errors are logged to the table MSRepl_errors in the distribution database.  Run
the query below to see what your problems are and start at the most recent.
[sql]

SELECT *
FROM distribution..msrepl_errors
ORDER BY ID DESC

[/sql]

If you have any errors, your output will be similar to below:

Regular Errors

Error code: 21074: The subscriptions(s) have been marked inactive. If this occurs, you
need to reinitialize the subscription. For beginners, you want to do this in SSMS. Expand
Replication -> Local publications. Then either right click on the publication name and
choose Reinitialize All Subscriptions or expand the publication and right click on the
subscription you want to reinitialize and choose Reinitialize. This can also be done in
replication monitor.

Error Code: 3729: Cannot DROP TABLE ‘dbo.X’ because it is being referenced by an


object ‘vw_x’ (This occurs when a table you are trying to drop and recreate via
reinitializing is schemabound to a view). This indexed view is getting in the way. In a non-
production environment, I would script out the view and indexes, drop them, then fix
replication and recreate the indexed view. If this happens in production, I’m open to
suggestions. You wouldn’t normally want to drop and recreate an indexed view without
having a solid understanding of the consequences.
Error Code: 1205: Message: Transaction (Process ID x) was deadlocked on lock
resources with another process… (Here you have the deadlock victim – unable to
complete because of the said deadlock). This is more likely when setting up replication
during an environment refresh and can be easily sorted by reinitializing.

I can’t find a reliable list of all of these errors. If someone could provide one in the
comments, I can edit the post.

XACT_SEQNO
You may have noticed the XACT_SEQNO column in the msrepl_errors table. This might
appear when you have an error such as ‘row not found at subscriber’. From the results
pane, copy the XACT_SEQNO value in question and add it to the query below.

[sql]

select * from distribution.dbo.MSarticles


where article_id in
(
select article_id from distribution.dbo.MSrepl_commands
where xact_seqno = 0x000002B9000430B6000E00000000
)

[/sql]

To find which publication the article belongs to, replace the value of dest_table in the
WHERE clause below with the article name from the query above:

[sql]

SELECT sp.name, sa.* FROM sysarticles sa


JOIN syspublications sp
ON sa.pubid = sp.pubid
WHERE dest_table = ‘<Name of table here>’

[/sql]

Replication monitor – Green is good


Green is good
It’s a great feeling when you can say, “replication is green now” after working on a
problem. Replication monitor gives you a nice visual interface and immediately tells you if
something is up by displaying a red X. There is also a warning icon, a yellow exclamation
mark or bang as I’ve heard it called from American DBAs. The exclamation mark is usually
an indicator of latency and can even be caused by the overhead of running replication
monitor so I wouldn’t worry too much about these, especially if they come and go with any
frequency in replication monitor.

Replication monitor also provides the visual interface for reinitializing subscriptions as well
as other useful features like tracer tokens.

Being proactive
Alerts
First of all, you want a process alerting you to errors in your replication set up. Whether
this be your own custom solution like an SQL agent job which regularly polls
Distribution..MSRepl_Errors or a 3rd party monitoring solution, you should have something
set up to alert you immediately to errors.

Tracer tokens – are they any good?


Tracer tokens can be good for confirming everything is running well when they succeed.
They provide you with information on the full latency of a test transaction. The trouble is
when they don’t succeed. They just hang around and don’t report back until they have
ultimately failed (which won’t be anytime soon). So you can use tracer tokens with caution
but there are other ways. I read a good post by Kendra Little on Monitoring SQL Server
Transactional Replication and she describes the use of canary tables. I won’t explain it all
here as it would only be repeating someone else’s work but take a look and try them out in
your own dev environment.

Comparing row counts as a method of checking


Sometimes, replication can break or just stall and not report any errors. In this scenario,
we need another solution for alerting the DBAs to any potential problems. This can be
achieved by checking the row count on tables involved in replication at different time
intervals. You would take a count and then take another count, say 5 or 10 minutes later
and compare the results. If the row difference per article have stayed the same or
increased, it can be a sign of an error. It’s not definitive but it can be an effective double
check.

A View Into Replication Health


Replication Monitor is the primary GUI tool at your disposal for viewing replication performance and
diagnosing problems. Replication Monitor was included in Enterprise Manager in SQL Server 2000,
but in SQL Server 2005, Replication Monitor was separated from SQL Server Management Studio
(SSMS) into a standalone executable. Just like SSMS, Replication Monitor can be used to monitor
Publishers, Subscribers, and Distributors running previous versions of SQL Server, although features
not present in SQL Server 2005 won’t be displayed or otherwise available for use.

To launch Replication Monitor, open SSMS, connect to a Publisher in the Object Explorer, right-click
the Replication folder, and choose Launch Replication Monitor from the context menu. Figure 1
shows Replication Monitor with several registered Publishers added. Replication Monitor displays a
tree view in the left pane that lists Publishers that have been registered; the right pane’s contents
change depending on what’s selected in the tree view.

Figure 1: Replication Monitor with Registered Publishers Added

Selecting a Publisher in the tree view shows three tabbed views in the right pane: Publications, which
shows the name, current status, and number of Subscribers for each publication on the Publisher;
Subscription Watch List, which shows the status and estimated latency (i.e., time to deliver pending
commands) of all Subscriptions to the Publisher; and Agents, which shows the last start time and
current status of the Snapshot, Log Reader, and Queue Reader agents, as well as various automated
maintenance jobs created by SQL Server to keep replication healthy.

Expanding a Publisher node in the tree view shows its publications. Selecting a publication displays
four tabbed views in the right pane: All Subscriptions, which shows the current status and estimated
latency of the Distribution Agent for each Subscription; Tracer Tokens, which shows the status of
recent tracer tokens for the publication (I’ll discuss tracer tokens in more detail later); Agents, which
shows the last start time, run duration, and current status of the Snapshot and Log Reader agents
used by the publication; and Warnings, which shows the settings for all warnings that have been
configured for the publication.

Right-clicking any row (i.e., agent) in the Subscription Watch List, All Subscriptions, or Agents tabs
will display a context menu with options that include stopping and starting the agent, viewing the
agent’s profile, and viewing the agent’s job properties. Double-clicking an agent will open a new
window that shows specific details about the agent’s status.

Distribution Agent windows have three tabs: Publisher to Distributor History, which shows the status
and recent history of the Log Reader agent for the publication; Distributor to Subscriber History,
which shows the status and recent history of the Distribution Agent; and Undistributed Commands,
which shows the number of commands at the distribution database waiting to be applied to the
Subscriber and an estimate of how long it will take to apply them. Log Reader and Snapshot Reader
agent windows show only an Agent History tab, which displays the status and recent history of that
agent.

When a problem occurs with replication, such as when a Distribution Agent fails, the icons for the
Publisher, Publication, and agent will change depending on the type of problem. Icons overlaid by a
red circle with an X indicate an agent has failed, a white circle with a circular arrow indicates an agent
is retrying a command, and a yellow caution symbol indicates a warning. Identifying the problematic
agent is simply a matter of expanding in the tree view the Publishers and Publications that are
alerting to a condition, selecting the tabs in the right pane for the agent(s) with a problem, and
double-clicking the agent to view its status and information about the error.

Measuring the Flow of Data

Understanding how long it takes for data to move through each step is especially useful when
troubleshooting latency issues and will let you focus your attention on the specific segment that’s
problematic. Tracer tokens were added in SQL Server 2005 to measure the flow of data and actual
latency from a Publisher all the way through to Subscribers (the latency values shown for agents in
Replication Monitor are estimated). Creating a tracer token writes a special marker to the transaction
log of the Publication database that’s read by the Log Reader agent, written to the distribution
database, and sent through to all Subscribers. The time it takes for the token to move through each
step is saved in the Distribution database.

Tracer tokens can be used only if both the Publisher and Distributor are on SQL Server 2005 or later.
Subscriber statistics will be collected for push subscriptions if the Subscriber is running SQL Server
7.0 or later and for pull subscriptions if the Subscriber is running SQL Server 2005 or higher. For
Subscribers that don’t meet these criteria (non-SQL Server Subscribers, for example), statistics for
tracer tokens will still be gathered from the Publisher and Distributor. To add a tracer token, you
must be a member of the sysadmin fixed server role or db_owner fixed database role on the Publisher.

To add a new tracer token or view the status of existing tracer tokens, navigate to the Tracer Tokens
tab in Replication Monitor. Figure 2 shows an example of the Tracer Tokens tab showing latency
details for a previously inserted token. To add a new token, click Insert Tracer. Details for existing
tokens can be viewed by selecting from the drop-down list on the right.

Figure 2: Tracer Tokens Tab Showing Latency Details for a Token

Know When There Are Problems

Although Replication Monitor is useful for viewing replication health, it’s not likely (or even
reasonable) that you’ll keep it open all the time waiting for an error to occur. After all, as a busy DBA
you have more to do than watch a screen all day, and at some point you have to leave your desk.

However, SQL Server can be configured to raise alerts when specific replication problems occur.
When a Distributor is initially set up, a default group of alerts for replication-related events is created.
To view the list of alerts, open SSMS and make a connection to the Distributor in Object Explorer,
then expand the SQL Server Agent and Alerts nodes in the tree view. To view or configure an alert,
open the Alert properties window by double-clicking the alert or right-click the alert and choose the
Properties option from the context menu. Alternatively, alerts can be configured in Replication
Monitor by selecting a Publication in the left pane, viewing the Warnings tab in the right pane, and
clicking the Configure Alerts button. The options the Alert properties window offers for response
actions, notification, and so on are the same as an alert for a SQL Server agent job. Figure 3 shows an
example of the Warnings tab in Replication Monitor.

Figure 3: Replication Monitor’s Warnings Tab

There are three alerts that are of specific interest for transactional replication: Replication: Agent
failure; Replication: Agent retry; and Replication Warning: Transactional replication latency
(Threshold: latency). By default, only the latency threshold alerts are enabled (but aren’t configured to
notify an operator). The thresholds for latency alerts are configured in the Warnings tab for a
Publication in Replication Monitor. These thresholds will trigger an alert if exceeded and are used by
Replication Monitor to determine if an alert icon is displayed on the screen. In most cases, the default
values for latency alerts are sufficient, but you should review them to make sure they meet the SLAs
and SLEs you’re responsible for.

A typical replication alert response is to send a notification (e.g., an email message) to a member of
the DBA team. Because email alerts rely on Database Mail, you’ll need to configure that first if you
haven’t done so already. Also, to avoid getting inundated with alerts, you’ll want to change the delay
between responses to five minutes or more. Finally, be sure to enable the alert on the General page of
the Alert properties window.

Changes to alerts are applied to the Distributor and affect all Publishers that use the Distributor.
Changes to alert thresholds are applied only to the selected Publication and can’t be applied on a
Subscriber-by-Subscriber basis.

Other Potential Problems to Keep an Eye On

Two other problems can creep up that neither alerts nor Replication Monitor will bring to your
attention: agents that are stopped, and unchecked growth of the distribution database on the
Distributor.

A common configuration option is to run agents continuously (or Start automatically when SQL
Server Agent starts). Occasionally, they might need to be stopped, but if they aren’t restarted, you can
end up with transactions that accumulate at the Distributor waiting to be applied to the Subscriber or,
if the log reader agent was stopped, transaction log growth at the Publisher. The estimated latency
values displayed in Replication Monitor are based on current performance if the agent is running, or
the agent’s most recent history if it’s stopped. If the agent was below the latency alert threshold at the
time it was stopped, then a latency alert won’t be triggered and Replication Monitor won’t show an
alert icon.

The dbo.Admin_Start_Idle_Repl_Agents stored procedure in Web Listing 1 can be applied to the


Distributor (and subscribers with pull subscriptions) and used to restart replication agents that are
scheduled to run continuously but aren’t currently running. Scheduling this procedure to run
periodically (e.g., every six hours) will prevent idle agents from turning into bigger problems.
Unchecked growth of the distribution database on the Distributor can still occur when all agents are
running. Once commands have been delivered to all Subscribers, they need to be removed to free
space for new commands. When the Distributor is initially set up, a SQL Server Agent job
named Distribution clean up: distribution is created to remove commands that have been delivered
to all Subscribers. If the job is disabled or isn’t running properly (e.g., is blocked), commands won’t be
removed and the distribution database will grow. Reviewing this job’s history and the size of the
distribution database for every Distributor should be part of a DBA’s daily checklist.

Common Problems and Solutions

Now that you have the tools in place to monitor performance and know when problems occur, let’s
take a look at three common transactional replication problems and how to fix them.

Distribution Agents fail with the error message The row was not found at the
Subscriber when applying the replicated command or Violation of PRIMARY KEY
constraint [Primary Key Name]. Cannot insert duplicate key in object [Object Name].

Cause: By default, replication delivers commands to Subscribers one row at a time (but as part of a
batch wrapped by a transaction) and uses @@rowcount to verify that only one row was affected. The
primary key is used to check for which row needs to be inserted, updated, or deleted; for inserts, if a
row with the primary key already exists at the Subscriber, the command will fail because of a primary
key constraint violation. For updates or deletes, if no matching primary key exists, @@rowcount
returns 0 and an error will be raised that causes the Distribution Agent to fail.

Solution: If you don’t care which command is failing, you can simply change the Distribution
Agent’s profile to ignore the errors. To change the profile, navigate to the Publication in Replication
Monitor, right-click the problematic Subscriber in the All Subscriptions tab, and choose the Agent
Profile menu option. A new window will open that lets you change the selected agent profile; select
the check box for the Continue on data consistency errors profile, and then click OK. Figure 4 shows
an example of the Agent Profile window with this profile selected. The Distribution Agent needs to be
restarted for the new profile to take effect; to do so, right-click the Subscriber and choose the Stop
Synchronizing menu option. When the Subscriber’s status changes from Running to Not Running,
right-click the Subscriber again and select the Start Synchronizing menu option.
Figure 4: Continue on Data Consistency Errors Profile Selected in the Distribution Agent’s Profile

This profile is a system-created profile that will skip three specific errors: inserting a row with a
duplicate key, constraint violations, and rows missing from the Subscriber. If any of these errors occur
while using this profile, the Distribution Agent will move on to the next command rather than failing.
When choosing this profile, be aware that the data on the Subscriber is likely to become out of sync
with the Publisher.

If you want to know the specific command that’s failing, the sp_browsereplcmds stored procedure can
be executed at the Distributor. Three parameters are required: an ID for the Publisher database, a
transaction sequence number, and a command ID. To get the Publisher database ID, execute the code
in Listing 1 on your Distributor (filling in the appropriate values for Publisher, Subscriber, and
Publication).

SELECT DISTINCT
subscriptions.publisher_database_id
FROM sys.servers AS [publishers]
INNER JOIN distribution.dbo.MSpublications AS [publications]
ON publishers.server_id = publications.publisher_id
INNER JOIN distribution.dbo.MSarticles AS [articles]
ON publications.publication_id = articles.publication_id
INNER JOIN distribution.dbo.MSsubscriptions AS [subscriptions]
ON articles.article_id = subscriptions.article_id
AND articles.publication_id = subscriptions.publication_id
AND articles.publisher_db = subscriptions.publisher_db
AND articles.publisher_id = subscriptions.publisher_id
INNER JOIN sys.servers AS [subscribers]
ON subscriptions.subscriber_id = subscribers.server_id
WHERE publishers.name = 'MyPublisher'
AND publications.publication = 'MyPublication'
AND subscribers.name = 'MySubscriber'

To get the transaction sequence number and command ID, navigate to the failing agent in Replication
Monitor, open its status window, select the Distributor to Subscriber History tab, and select the most
recent session with an Error status. The transaction sequence number and command ID are contained
in the error details message. Figure 5 shows an example of an error message containing these two
values.
Figure 5: An Error Message Containing the Transaction Sequence Number and Command ID

Finally, execute the code in Listing 2 using the values you just retrieved to show the command that’s
failing at the Subscriber. Once you know the command that’s failing, you can make changes at the
Subscriber for the command to apply successfully.

EXECUTE distribution.dbo.sp_browsereplcmds
@xact_seqno_start = '0x0000001900001926000800000000',
@xact_seqno_end = '0x0000001900001926000800000000',
@publisher_database_id = 29,
@command_id = 1

Distribution Agent fails with the error message Could not find stored procedure
'sp_MSins_'.

Cause: The Publication is configured to deliver INSERT, UPDATE, and DELETE commands using
stored procedures, and the procedures have been dropped from the Subscriber. Replication stored
procedures aren’t considered to be system stored procedures and can be included using schema
comparison tools. If the tools are used to move changes from a non-replicated version of a Subscriber
database to a replicated version (e.g., migrating schema changes from a local development
environment to a test environment), the procedures could be dropped because they don’t exist in the
non-replicated version.

Solution: This is an easy problem to fix. In the published database on the Publisher, execute the
sp_scriptPublicationcustomprocs stored procedure to generate the INSERT, UPDATE, and DELETE
stored procedures for the Publication. This procedure only takes one parameter—the name of the
Publication—and returns a single nvarchar(4000) column as the result set. When executed in SSMS,
make sure to output results to text (navigate to Control-T or Query Menu, Results To, Results To
Text) and that the maximum number of characters for results to text is set to at least 8,000. You can
set this value by selecting Tools, Options, Query Results, Results to Text, Maximum number of
characters displayed in each column). After executing the stored procedure, copy the scripts that
were generated into a new query window and execute them in the subscribed database on the
Subscriber.
Distribution Agents won’t start or don’t appear to do anything.

Cause: This typically happens when a large number of Distribution Agents are running on the same
server at the same time; for example, on a Distributor that handles more than 50 Publications or
Subscriptions. Distribution Agents are independent executables that run outside of the SQL Server
process in a non-interactive fashion (i.e., no GUI). Windows Server uses a special area of memory
called the non-interactive desktop heap to run these kinds of processes. If Windows runs out of
available memory in this heap, Distribution Agents won’t be able to start.

Solution: Fixing the problem involves making a registry change to increase the size of the non-
interactive desktop heap on the server experiencing the problem (usually the Distributor) and
rebooting. However, it’s important to note that modifying the registry can result in serious problems if
it isn’t done correctly. Make sure to perform the following steps carefully and back up the registry
before you modify it:

1. Start the Registry Editor by typing regedit32.exe in a run dialog box or command prompt.
2. Navigate to the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session
Manager\SubSystems key in the left pane.
3. In the right pane, double-click the Windows value to open the Edit String dialog box.
4. Locate the SharedSection parameter in the Value data input box. It has three values separated by
commas and should look like the following:

5. SharedSection=1024,3072,512

The desktop heap is the third value (512 in this example). Increasing the value by 256 or 512 (i.e., making it a
value of 768 or 1024) should be sufficient to resolve the issue. Click OK after modifying the value. Rebooting
will ensure that the new value is used by Windows. For more information about the non-interactive desktop
heap, see "Unexpected behavior occurs when you run many processes on a computer that is running SQL
Server."

Monitoring Your Replication Environment

When used together, Replication Monitor, tracer tokens, and alerts are a solid way for you to monitor
your replication topology and understand the source of problems when they occur. Although the
techniques outlined here offer guidance about how to resolve some of the more common issues that
occur with transactional replication, there simply isn’t enough room to cover all the known problems
in one article. For more tips about troubleshooting replication problems, visit the Microsoft SQL
Server Replication Support Team’s REPLTalk blog.
Monitoring SQL Server Transactional Replication
Tracer Tokens Aren’t Really Your Friend
“Tracer Tokens” were introduced in SQL Server 2005. They sound awfully good. Books Online explains that you can
automate them using sys.sp_posttracertoken and report on them using sp_helptracertokenhistory.

There’s a big problem: tracer tokens are too patient.

Let’s say my replication is incredibly overwhelmed and I send out a tracer token. I won’t hear back until it reaches its
destination or definitively fails. That could be a very, very long time. The fact that it’s potentially unknown means I
don’t want to rely heavily on it for monitoring.

Don’t Rely Too Much on Replication Monitor


(REPLMON.exe)
When replication is behind, it’s natural to turn to Replication Monitor. The first five links in “Monitoring Replication” in
Books Online point to it, after all.

Replication Monitor isn’t all bad. But don’t depend on it too much, either.

 Replication Monitor is a tool to help you answer the question “how are things doing right now?” It doesn’t
baseline or give the kind of historical info that your manager wants to see.
 Replication Monitor may run queries to count the number of undistributed commands that may take a while
to run and be performance intensive (particularly when things get backed up in the distributor).

I’ve personally seem some cases where running more than one instance of Replication Monitor while a publication
snapshot was being taken also caused blocking. Too many people checking to see “how much longer will this
take?” actually caused things to take longer. It’s not just me, Microsoft recommends you avoid running multiple
instances of Replication Monitor.

ReplMon protip: You can disable automatic refreshing for the Replication Monitor UI, and just refresh the data when
you need it. More info in Books Online here. (Thanks to John Samson for this tip.)

Replication Monitor is useful, but you’re better off if people can get information on replication health without
everyone having to run Replmon. You can do this fairly easily by using simpler tools to create dashboards to chart
replication latency.

Easy Replication Monitoring: Alert on Latency with


Canary Tables
It’s easy to build your own system for tracking replication latency for each publication. Here are the ingredients for
the simplest version:

 Add a table named dbo.Canary_PubName to each publication


 dbo.Canary_PubName has a single row with a datetime column in it
 A SQL Server Agent job on the publisher updates the datetime to the current timestamp every minute
 A SQL Server Agent job on the subscriber checks dbo.Canary_PubName every minute and alerts if the
difference between the current time and the timestamp is greater than N minutes

It’s very simple to extend this to a simple dashboard using a third party monitoring tool or SQL Server Reporting
Services: you simply poll all the dbo.Canary tables and report on the number of minutes of latency on each server.

This simple process gets around the weaknesses of tracer tokens, and also gives you immediate insight into how
much latency you have on each subscriber. Bonus: this exact same technique also works well with logshipping and
AlwaysOn Availability Groups. Tastes great, less filling.

Medium Replication Monitoring: Notify when Undistributed


Commands Rise in the Distribution Database
The distribution database is a special place for Transactional Replication. The log reader agent pulls information on
what’s changed from the transaction log of the publication database and translates it into commands that hang out
in the distribution database before the changes go out to subscribers.

If you have a lot of data modification occurring on the publisher, you can get a big backup of commands in the
distribution database.

If replication performance is important, set up a SQL Server Agent job on your distribution server to regularly check
the amount of undistributed commands with a script like Robert Davis provides here. Have it alert you when the
commands go above a given threshold.

Real world example: When I was the DBA for an environment with mission-critical replication, we would warn when
undistributed commands rose above 500K and create a severity-1 ticket when they rose above 1 million. We did this
after setting up dashboards to baseline replication latency and also baselining the amount of undistributed
commands in distribution, so that we knew what our infrastructure could recover from and what might need DBA
attention to recover in time.

Difficult Replication Monitoring: Alert When Individual


Articles are Unhealthy
Here’s where things get tricky. It’s very difficult to prove that all articles in replication are healthy. The steps up to
this point have tracked latency for the entire publication and bottlenecks in the distribution database.Things get
pretty custom if you need to prove that individual tables are all up to date.

I once had a situation where a code release removed some articles from replication, modified the tables and data
significantly, then re-added the articles to replication.

There was an issue with the scripts and one of the articles didn’t get put back into replication properly at the end of
the process. Replication was working just fine. No script had explicitly dropped the table from the subscriber, so it
just hung out there with stale data. The problem wasn’t discovered for a few days, and it was a bit difficult to track
down. Unfortunately, the next week was kind of a downer because a lot of data had to be re-processed after that
article was fixed.

Here’s what’s tricky: typically some articles change much more often than others. Monitoring individual articles
typically requires baselining “normal” latency per article, then writing custom code that checks each article against
the allowed latency. This is significantly more difficult for any large articles that don’t have a “Last Modified Date”
style column.
(Disclaimer: in the case that you don’t have a “Last Modified” date on your subscriber, I do not suggest layering
Change Tracking on top of the replication subscriber. If you are tempted to do that, first read my post on
Performance Tuning Change Tracking, then go through all the steps that you would do if you needed to re-initialize
replication or make schema changes on articles. You’ll change your mind by the end.)

Special Cases: The “Desktop Heap” is Used Up


This is a special case for replication. If you have a large amount of replication agents on a single server (such as
200 or more), you may run into issues where things just silently stop working due to desktop heap exhaustion. This
is an issue that can be hard to identify because the agents just stop working!

Canary tables can help monitor for this, but you’ll need a lot of them since this can happen on an agent-by-agent
basis. Read more about fixing desktop heap problem in replication in KB 949296. (Thanks to Michael Bourgon for
suggesting we include this.)

Test Your Monitoring out in Staging


The #1 mistake I find with transactional replication is ignoring the staging environment. This is critical to supporting
replication and creating effective monitoring for it.

The staging environment isn’t the same thing as development or QA. It’s a place where you have the same number
of SQL Server instances as production, and the same replication setup as production. You test changes against
staging before they go to production. You can also use it to test replication changes.

Staging is also where you confirm that your replication monitoring works. Data probably doesn’t constantly change
in your staging environment, but that’s OK. Use canary tables and get creative to simulate load for test purposes.

You might also like