How SQL Server Clustering Works, Part 2

on January 27, 2013

Public and Private Networks

Each node of a cluster must have at least two network cards to be a fully supported installation. One network card is connected to the public network, and the other network card will be connected to a private cluster network. (Read Part 1 of this two-part article here.)

The public network is the network to which the client applications connect. This is how they communicate to a clustered SQL Server instance using the clustered IP address and clustered SQL Server name. It is recommended to have two teamed network cards for the public network for redundancy and to improve availability.
The private network is used solely for communications between the clustered nodes. It is used mainly for the heartbeat communication. Two forms of communications are executed:
- LooksAlive: Verifies that the SQL Server service runs on the online node every 5 seconds by default
- IsAlive: Verifies that SQL Server accepts connections by executing sp_server_diagnostics.

This health detection logic determines if a node is down and the passive node then takes over the production workload.

The SQL Server Instance

Surprisingly, SQL Server client applications don’t need to know how to switch communicating from a failed cluster node to the new active node or anything else about specific cluster nodes (such as the NETBIOS name or IP address of individual cluster nodes). This is because each clustered SQL Server instance is assigned a Network name and IP address, which client applications use to connect to the clustered SQL Server. In other words, client applications don’t connect to a node’s specific name or IP address but instead to the cluster SQL network name or cluster SQL IP address that stays consistent and fails over. Each clustered SQL Server will belong to a Failover Cluster Resource Group that contains the following resources that will fail together:

SQL Server Network Name
IP Address
One or more shared disks
SQL Server Database Engine service
SQL Server Agent
SQL Server Analysis Services, if installed in the same group
One file share resource, if the FILESTREAM feature is installed

How a Failover Works

Assume that a single SQL Server 2012 instance runs on the active node of a cluster and that a passive node is available to take over when needed. At this time, the active node communicates with both the database and the quorum on the shared disk array. Because only a single node at a time can access the shared disk array, the passive node does not access the database or the quorum. In addition, the active node sends out heartbeat signals over the private network, and the passive node monitors them, so it can take over if a failover occurs. Clients are also interacting with the active node via the clustered SQL Server name and IP address while running production workloads.

Now assume that the active node stops working because of a power failure. The passive node, which is monitoring the heartbeats from the active node, notices that the heartbeats stopped. After a predetermined delay, the passive node assumes that the active node has failed and initiates a failover. As part of the failover process, the passive node (now the active node) takes over control of the shared disk array and reads the quorum, looking for any unsynchronized configuration changes. It also takes over control of the clustered SQL Server name and IP address. In addition, as the node takes over the databases, it has to perform a SQL Server startup and recover the databases.

The time this takes depends on many factors, including the performance of the hardware and the number of transactions that might have to be rolled forward or back during the database recovery process. When the recovery process is complete, the new active node announces itself on the network with the clustered SQL Server name and IP address, which enables the client applications to reconnect and begin using the SQL Server 2012 instance after this minimal interruption.