Considerations for Multi Site Clusters in Windows Server 2012 (Part 1)

by [Published on 14 May 2013 / Last Updated on 22 Aug. 2013]

This article series discusses architectural considerations that must be taken into account when building multi-site clusters that are based on Windows Server 2012.

If you would like to read the other parts of this article series please go to:

Introduction

One of the features in Windows Server 2012 that has improved the most from previous versions is failover clustering. Windows Server 2012 allows you to build clusters that are more scalable than ever before, while at the same time giving administrators much more freedom to choose a cluster design that makes sense for their own organization, rather than being completely locked into a rigid set of requirements.

Although it was previously possible to build a multi-site failover cluster, Windows Server 2012 makes geo clustering much more practical. It is worth noting however, that even though Microsoft has gone to great lengths to make building clusters easier than it has ever been, good cluster design is essential. An improperly designed multi-site cluster will likely suffer from performance problems and may ultimately prove to be less than reliable. That being the case, I decided to write this article series as a way of providing you with some best practices for building multi-site clusters that are based on Windows Server 2012.

Quorum Considerations

I want to start out by talking about one of the aspects of multi-site clustering that has traditionally proven to be the most challenging. In order for a cluster to function, it has to maintain quorum. This is a fancy way of saying that a minimal number of cluster nodes must remain functional and accessible in order for the cluster to function.

Windows Server generally uses a Majority Node Set cluster. In a Majority Node Set Cluster a majority of the cluster nodes must be functional in order for the cluster to retain quorum. Microsoft defines the majority as half of the cluster nodes, plus one. If for example a Majority Node Set Cluster contained four cluster nodes then Windows would define a node majority as three cluster nodes (half of the cluster nodes plus an additional node).

The majority node set requirement comes with a couple of implications. For starters, it means that smaller clusters can tolerate the failure of fewer nodes while still retaining quorum. For example, a four node cluster can only tolerate the failure of a single node. On the other hand, a cluster with ten nodes can retain quorum even if up to four nodes fail.

Although cluster node planning is really just a matter of basic math (at least in terms of calculating tolerance for node failures), things get a little bit more interesting when you bring a multisite architecture into the picture.

Imagine for example that your organization has a primary data center and a disaster recovery data center. Now imagine that you decide to build a multisite cluster to handle a mission-critical application. You want to be able to run that application in either data center, so you want to put plenty of cluster nodes in each location.

As previously mentioned, a Majority Node Set cluster with ten cluster nodes can survive the failure of up to four nodes. With that in mind, let’s pretend that we decided to place five nodes in each of the two data centers. That way, all but one of the cluster nodes could potentially fail in either one of the data centers and the cluster would still retain quorum.

Although this architecture may at first sound promising, there is a major problem. Imagine what would happen in the WAN link (or the Internet connection, if that’s what you are using) between the two sites failed. In this type of situation, the cluster nodes are not smart enough to tell the difference between a WAN link failure and a mass cluster node failure.

In this scenario, each datacenter would interpret the WAN link failure as if all of the cluster nodes in the opposite datacenter had failed. In other words, each datacenter thinks that five cluster nodes are down. Remember that in a ten node cluster, six cluster nodes have to remain online in order for the cluster to retain quorum. Each datacenter can only confirm the availability of five nodes, so neither datacenter is able to maintain quorum. Hence the clustered application fails, even though not a single cluster node has actually failed.

In this nightmare scenario, the WAN link is the cluster’s Achilles heel. It is the one single point of failure that has the potential to bring down the entire cluster. The question is how can you protect your cluster against this sort of thing?

There are a couple of different schools of thought on preventing a WAN link outage from bringing down the cluster. In the past, a popular option has been to stack the deck in favor of one datacenter or the other. To show you how this works, let’s go back to my earlier example of a ten node cluster that spans two datacenters.

If the goal is to prevent a WAN link failure from bringing down the cluster then you would need to place an uneven number of cluster nodes in each datacenter. A ten node cluster requires that six nodes remain online in order for the cluster to retain quorum. As such, placing six nodes in the primary datacenter and four nodes in the disaster recovery datacenter will insulate the cluster against a WAN link failure (assuming that all of the nodes in the primary datacenter are online at the time of the failure).

A second school of thought regarding protecting a Majority Node Set cluster against a WAN link failure is to make use of a third site. This architecture works by placing half of the cluster nodes in the organization’s primary datacenter and half of the cluster nodes in a disaster recovery datacenter. The third location doesn’t actually host a cluster node. Instead, it hosts a non-clustered server that acts as a file share witness.

A file share witness is a server that acts as a sort of referee in the event of a WAN link failure. To show you how this works, consider our earlier example in which an organization needs to build a multi-site cluster with ten cluster nodes. Now let’s suppose that we decided to put five cluster nodes in the primary datacenter and five cluster nodes in the disaster recovery datacenter.

In this arrangement all of the same rules apply. The cluster still requires six nodes to be available in order for the cluster to retain quorum. Now suppose that a WAN link failure occurs. Neither datacenter has enough nodes for the cluster to retain quorum. However, all of the cluster nodes know about the file share witness. Therefore, both datacenters will attempt to contact the file share witness server. The datacenter with the functioning WAN connection should be able to establish contact, while the datacenter with the failed connection should not be able to. When a datacenter does establish contact with the file share witness, that server takes the place of a sixth cluster node. In doing so, it allows the cluster to retain quorum in spite of the WAN link failure.

Conclusion

Although cluster quorum is an extremely important consideration for multi-site clusters, it is far from being the only consideration. Some of the other considerations that must be taken into account are node storage and the availability of cluster resources. I will discuss these issues and more as the series progresses.

If you would like to read the other parts of this article series please go to:

Advertisement

Featured Links