We all know SQL Server Clustering is a widely used High Availability solution. A challenge the SQL Server DBA’s face regularly is to find the root cause of a SQL Server failovers in cluster.
To understand whether it was a failover or a restart of SQL in the same node , take a look at my previous post here.
Now, In case of failovers it could be an automatic failover or it could be a manually initiated failover. An automatic failover is when SQL Server fails over to the passive node due to some issues with SQL Server (isAlive \ LooksAlive failure).
When a DBA or System administrator Moves the SQL Server group to another node using Cluster administrator or using Cluster command for some maintenance\deployment activities then it is called a manual failover.
When we do a Root Cause Analysis to find the reason for SQL Server failovers, its important to first find whether it is a manual or automatic failover. Below are the steps that you can take to find it out.
One easy way to find that out is to see the previous SQL Server Error logs. If you see that SQL Server restarted multiple times (actually 3 times) in the same node and then came online in the other node then it should ideally be an automatic failover.
In case of manual failover, you will see that the SQL Server is running on a different node after the first restart.
Now this behaviour totally depends on what action we have set for the failure of SQL Server resources in cluster administrator.
Another option is to check the cluster log, which I am detailing below. Windows cluster maintains a log in all the member nodes of the cluster. The general path is C:\Windows\Cluster\Cluster.log
Note: I have only tested this in Windows 2003 clusters. (Yet to test in windows 2008\R2)
Open the Cluster log and scroll down to the approximate time there was a failure. Now search for the occurence of these keywords “FmsMoveGroupRequest” and “FmpDoMoveGroup”
You would see entries similar to this.
INFO [FM] FmsMoveGroupRequest: To move group ‘a47f0c1a-fc1b-45ed-8a34-07bf59300c9c’
INFO [FM] FmpDoMoveGroup: Entry
INFO [FM] FmpMoveGroup: Entry
INFO [FM] FmpPickNodeFromPreferredListAtRandom: Picking node for group a47f0c1a-fc1b-45ed-8a34-07bf59300c9c [SQL2005], suggested node 1…
INFO [FM] FmpPickNodeFromPreferredListAtRandom: Node 1 for group a47f0c1a-fc1b-45ed-8a34-07bf59300c9c is user preferred…
INFO [FM] FmpPickNodeFromPreferredListAtRandom: Selected node 1 for group a47f0c1a-fc1b-45ed-8a34-07bf59300c9c…
INFO [FM] FmpMoveGroup: Moving group a47f0c1a-fc1b-45ed-8a34-07bf59300c9c to node 1 (1)
INFO [FM] FmpNotifyGroupStateChangeReason: Notifying group SQL2005 [a47f0c1a-fc1b-45ed-8a34-07bf59300c9c] of state change reason 1…
Here to find which group has got a Manual failover request, copy this GUID value ‘a47f0c1a-fc1b-45ed-8a34-07bf59300c9c’. Now open registry and traverse to this path
You will find information about all the groups that you have in the windows cluster there. Match the GUID value and look at the “name” field. In my case it gives the SQL Server Group name.
This will tell us that The SQL Server Group with GUID ‘a47f0c1a-fc1b-45ed-8a34-07bf59300c9c’ was successfully send a Move Group request (which is Manual Failover).
Now if you scroll through the cluster log, you will find the below entries, which says that the failover was successfully completed.
INFO [FM] FmpCompleteMoveGroup: Exit, status = 0
INFO [FM] FmpMovePendingThread Exit.
Happy RCA !!!