Every once in a while you run across a problem that shouldn’t be causing people as much problems as it is, yet there it is causing all sorts of problems for long enough that being broken is considered normal.
In the case of this one client that I’ve been working with that problem was massive delays in replication between two SQL Server 2008 R2 machines. And by massive delays I’m not talking about a few minutes, but hours, sometimes up to 10-12 hours of delay and it would be like that for days or weeks.
The symptoms in this case were that when you inserted a tracer into the replication publication the bulk of the delay was between the publisher and the distributor. That meant that odds are it was one of three thinks.
- The disk on the distribution database is crap
- There is something very wrong with the transaction log on the published database
- There is a network problem between the publisher and the distributor
Well in this case #3 doesn’t apply as the publisher and the distributor are on the same machine. Looking at the IO the response times were fine, between 0.001 and 0.015 seconds according to perfmon. There weren’t any major waits problems showing up when looking at the waits. That basically removes #1. So by process of elimination that leaves us with #2, something is very wrong with the transaction log.
Upon first look there’s nothing funky going on. Database is in simple recovery. The transaction log is huge, but with all these problems that’s to be expected. Running DBCC LOGINFO showed me some very useful information, when it finely finished (which was my next clue this was the problem). When it finished there was over 24,000 Virtual Log Files (VLFs) within the transaction log file.
Ladies and gentleman we have a winner!
Now fixing this on a machine which has replication which is as pissed off as this machine does isn’t as simple as just shrinking the log file. Because the replication is keeping a massive amount of transaction log space locked up this required some careful planning. The first step is to give the transaction log some nice big VLFs to write data to, but we need to do this somewhere that isn’t going to impact the cleaning up of the existing VLFs. The option here, add another transaction log file to the database. Grow it large enough to handle all the traffic for a day or two and grow it out in 8000 Meg chunks. In this case I grow it out to 56000 Megs in size. Then I left the system to sit there for a day.
When I came back to look at the system the next day replication performance was still awful, but now when I looked at DBCC LOGINFO all the VLFs which were in use were in the new data file. Every VLF in file #2 (the default transaction log file) was marked as status 0. A quick DBCC SHRINKFILE statement then growing the default transaction log file back out to it’s needed size of 56000 Megs and suddenly we were down to ~250 VLFs give or take.
Within about 2 hours replication had completely caught up and written all the transactions not only to the distribution database but had also written all those transactions to the subscriber as well.
Once replication was done I shrank the new transaction log file which I had created as small as it would go so that the next VLF that would be written to was in the default transaction log file. Once all the VLFs in the new transaction log file were back to status 0 I simply deleted the new file.
Now replication is all happy. I’m no longer seeing messages like “The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately 500000 log records have been scanned in pass # 3, 0 of which were marked for replication, elapsed time n (ms). When I started all this these messages were being thrown several times a second. Now they aren’t appearing at all which is very good.
Replication is happy, the client is happy, and I’m happily working on their next problem.
Denny
P.S. While this may not be your problem if you are seeing this sort of issue, it’s VERY much worth checking out.