Last Saturday, I was woken up by a call from office saying that replications are failing. In our implementation, We have around 7 transactional publications and 8 P2P publications.
Transactional Publications have two subscribers each and P2P publications have 3 subscribers each.
Now the Issue was that few of the Publication – Subscription pairs were failing with errors. All of them are failing with the below errors in the replication monitor.
Error messages: (unknown security error). The step failed.
The process could not allocate memory.
We tried to restart the individual jobs which were failing but we were not able to stop them. As the issue started a few hours back before we were involved, we wanted a speedy resolution and what we did was to restart the SQL Server Agent , so that all the distribution agents will get restarted, which resolved the issue temporarily.
On research we found that we were getting below errors during the time of the issue
“Downgrading backup log buffers from 1024K to 64K”
“Error: 701, Severity: 17, State: 123.
There is insufficient system memory to run this query”
As of now it seems like there was a Non-Buffer Pool memory pressure which was causing the issue. We have 16 GB of Physical RAM on this 64 bit machine and the fact that Max Server Memory was set to 14 GB, almost helps understand the reason for a Non-Buffer Pool memory contention, as we also have around 2 Log Reader Agents and 38 Distribution Agents and a couple of other third party softwares to run apart from sqlservr.exe.
As a general rule of thumb we should have given atleast a 4 GB of Physical RAM for operating system and set SQL Max Server Memory at 12 GB.
On further research on this issue we found similar errors in the Application Event Logs
Event Type: Error
Event Source: SQLSERVERAGENT
Event Category: None
Event ID: 52
Microsoft SQL Server Replication : A replication agent encountered a fatal error and was shut down. A mini-dump has been generated at the following location:
c:\Program Files\Microsoft SQL Server\90\Shared\ErrorDumps\ReplAgent20110610220003_0.mdmp
In the location “c:\Program Files\Microsoft SQL Server\90\Shared\ErrorDumps” we found that there were 4 mini dumps. We are still waiting to get an RCA on this mini dumps from Microsoft and as soon as we get it I will post it here.
An important learning in this issue is that there was no information on this dumps on the SQL Server error logs and the dumps are not generated in the normal LOG folder. So one more folder that we need to keep track of , if using replication. 🙂