What SQL Server Clustering Can and Cannot Do

By: Steven Wort, Ross LoForte, Brian Knight


Many DBAs seem to have difficulty understanding exactly what clustering is. Following is a good working definition:

Microsoft Windows Failover Clustering is a high-availability option designed to increase the uptime of SQL Server instances. A cluster includes two or more physical servers, called nodes; identical configuration is recommended. One is identified as the active node, on which a SQL Server instance is running the production workload, and the other is a passive node, on which SQL Server is installed but not running. If the SQL Server instance on the active node fails, the passive node becomes the active node and begins to run the SQL Server production workload with some minimal failover downtime. Additionally, you can deploy a Windows Failover Cluster to have both nodes active, which means running different SQL Server instances where any SQL Server instances can failover to the other node.

This definition is straightforward, but it has a lot of unclear implications, which is where many clustering misunderstandings arise. One of the best ways to more fully understand what clustering can and cannot do is to drill down into the details.

What Clustering Can Do

Clustering is designed to improve the availability of the physical server hardware, operating system, and SQL Server instances but excluding the shared storage. Should any of these aspects fail, the SQL Server instance fails over. The other node in a cluster automatically takes over the failed SQL Server instance to reduce downtime to a minimum.

Additionally, the use of a Windows Failover Cluster can help reduce downtime when you perform maintenance on cluster nodes. For example, if you need to update hardware on a physical server or install a new service pack on the operating system, you can do so one node at a time. To do so, follow these steps:

1.   First, you upgrade the passive node that is not running a SQL Server instance.

2.   Next, manually failover from the active node to the now upgraded node, which becomes the active node.

3.   Then upgrade the currently passive node.

4.   After it is upgraded, if you choose, you can fail back to the original node. This cluster feature helps to reduce the overall downtime caused by upgrades.

When running an upgrade, you need to ensure that you do not manually failover to a node that has not been upgraded because that would cause instability since the binary would not have been updated.

NOTE

A Windows 2003 Failover Cluster cannot be upgraded to a Windows 2008 Failover Cluster because architecturally the two versions are different. Instead, create a Windows 2008 Failover Cluster and migrate the databases.

What Clustering Cannot Do

The list of what clustering cannot do is much longer than the list of what it can do, and this is where the misunderstandings start for many people. Clustering is just one part of many important and required pieces in a puzzle to ensure high availability. Other aspects of high availability, such as ensuring redundancy in all hardware components, are just as important. Without hardware redundancy, the most sophisticated cluster solution in the world can fail. If all the pieces of that puzzle are not in place, spending a lot of money on clustering may not be a good investment. The section “Getting Prepared for Clustering” discusses this in further detail.

Some DBAs believe that clustering can reduce downtime to zero. This is not the case; clustering can mitigate downtime, but it can’t eliminate it. For example, the failover itself causes an outage lasting from seconds to a few minutes while the SQL Server services are stopped on one node then started on the other node and database recovery is performed.

Nor is clustering designed to intrinsically protect data as the shared storage is a single point of failover in clustering. This is a great surprise to many DBAs. Data must be protected using other options, such as backups, log shipping, or disk mirroring. In actuality, the same database drives are shared, albeit without being seen at the same time, by all servers in the cluster, so corruption in one would carry over to the others.

Clustering is not a solution for load balancing either. Load balancing is when many servers act as one, spreading your load across several servers simultaneously. Many DBAs, especially those who work for large commercial websites, may think that clustering provides load balancing between the cluster nodes. This is not the case; clustering helps improve only uptime of SQL Server instances. If you need load balancing, then you must look for a different solution. A possibility might be Peer-to-Peer Transactional Replication.

Clustering purchases require Enterprise or Datacenter versions of the Windows operating system and SQL Server Standard, Enterprise, or BI editions. These can get expensive and many organizations may not cost-justify this expense. Clustering is usually deployed within the confines of a data center, but can be used over geographic distances (geoclusters). To implement a geocluster, work with your storage vendor to enable the storage across the geographic distances to synchronize the disk arrays. SQL Server 2012 also supports another option: multi-site clustering across subnet. The same subnet restriction was eliminated with the release of SQL Server 2012.

NOTE

Clustering requires experienced DBAs to be highly trained in hardware and software, and DBAs with clustering experience command higher salaries.

Although SQL Server is cluster-aware, not all client applications that use SQL Server are cluster-aware. For example, even if the failover of a SQL Server instance is relatively seamless, a client application may not have the reconnect logic. Applications without reconnect logic require that users exit and then restart the client application after the SQL Server instance has failed over, then users may lose any data displayed on their current screen.

Choosing SQL Server 2012 Clustering for the Right Reasons

When it comes right down to it, the reason for a clustered SQL Server is to improve the high availability of the whole SQL Server instances which includes all user/system databases, logins, SQL Jobs but this justification makes sense only if the following are true:

  • You have experienced DBA staff to install, configure, and administer a clustered SQL Server.
  • The cost (and pain) resulting from downtime is more than the cost of purchasing the cluster hardware and software and maintaining it over time.
  • You have in place the capability to protect your storage redundancy. Remember that clusters don’t protect data.
  • For a geographically dispersed cluster across remote data centers, you have a Microsoft certified third-party hardware and software solution.
  • You have in place all the necessary peripherals required to support a highly available server environment (for example, backup power and so on).

If all these things are true, your organization is a good candidate for installing a clustered SQL Server, and you should proceed; but if your organization doesn’t meet these criteria, and you are not willing to implement them, you would probably be better with an alternative, high-availability option, such as one of those discussed next.

Comments

  1. Hi Adam,
    Wonderful article, really helpful. I have few questions, would be great if you could answer:
    Set up: we have 2 nodes, Windows Server 2008. On both the nodes we have 2 different SQL Server instances are installed. Both are SQL Server 2008 R2. Both the nodes are part of the cluster.

    Now we want to do in-place upgrade of SQL Server 2008R2 to SQL Server 2014.
    First thing we need to do is install the Windows Sp1, which is per-requisite to install SS 2014.
    1. How should we go about doing the upgrade? Your opinion.
    2. I’m not the DBA, but the lead of the development team. My suggestion was we broke the clustering and then do the sp1 installation and upgrade on one server at a time. My assumption is after removing clustering both the sql server instances will run normally as stand alone instance, after the upgrading both the server we will create the clustering again. Down time is not an issue.
    3. DBA is saying if we remove the clustering sql server instances won’t run. They won’t even start. And he is saying we do the upgrade with the clustering in place. However I feel it is more risky.

    Your help will be appreciated.
    Thanks
    Mangal

    • You really should not do an in-place upgrade. It’s much, much safer and simpler to stand up your new cluster and migrate your databases. In-place upgrades are extremely risky when you have just one server and one instance, a setup this complex only makes it that much worse.

  2. Hi,

    I will like to know if you could answer a question about deadlocks and clustering. When you manually do a failover to a passive node, do the deadlocks that the database had on the active node remain on the passive node when the failover process is completed?

    Is there any literature on the net about this?

    Best regards an thank you for any help you could give me.

    Humberto Lopez

    • Humberto,

      when fail-over occurs sql server services starts on passive node and becomes active, that means all the connection that were on active node already terminated and it will not be seen on server activity after fail-over. It is same as restarting stand-alone server.

  3. nice information.

    i have a question for sql clustering.
    is it possible to replicate a sql failover cluster to new sql server?

    thank you

Leave a Reply