Last week there was a restart of SQL Server Serivce in Cluster. The SQL Server resources were active on Node A and then it restarted and came back online in Node A itself. As it was a distributor server (used for replication) , there was no direct impact in the system and the only impact is a small delay of data movement from Publisher to Subscriber.
Eventhough there was no direct customer impact, I still thought that finding the root cause of the restart is important for us. As always, I opened the SQL Server ERRORLOG first to see what was the time of restart. To my surprise, the ERRORLOG had the recent startup messages with the correct timings and towards the end of the ERRORLOG I was able to see the shutdown messages as well (which I ideally expected only in ERRORLOG.1). I made sure that SQL Server was still running, but now I am surprised as to why the previous ERRORLOG is being overwritten.
As I was getting clueless on what is actually happening, I took a look at the Application Event Logs and found that there is a error event for SQL Server.
From the error it was clear that SQL Server was not able to rename the previous ERRORLOG to ERRORLOG.1. Now what I did is to close every opened window from all the remote desktops and then restart SQL once. I was in the impression that some window was holding a lock on the ERRORLOG. After restart of SQL, I noticed that it is still the same situation and the error was still there in the Application Event Log.
At this point what I did is to download this tool Process Explorer. I opened the tool and clicked on the Find menu, clicked on “Find handle or DLL”.
Now a search window opened up. I typed in ERRORLOG and then clicked on Search and got the below output.
From the Output it was clear that, there was this process Koqerr.exe which was holding a lock on to the SQL Server Errorlog. Koqerr.exe is a part of the Tivoli Monitoring Agent software for SQL Server. At this point all I had to do was to stop that monitoring software and then restart SQL Server. Once restarted, SQL Server started writing all the errors to the new ERRORLOG and the old errorlog was renamed to ERRORLOG.1 !!!