Hyper-V Windows Failover Cluster and IsAlive Operation (Part 1)

by [Published on 2 Aug. 2016 / Last Updated on 2 Aug. 2016]

In Part 1 of this article series, we'll explain a Hyper-V cluster issue that was caused because of a DNS registration failure during the cluster operation.

If you would like to read the other parts in this article series please go to:

There was a major outage for one of our Hyper-V clusters running Windows Server 2012 that caused critical virtual machines to go down. During our investigation we noticed a lot of error and warning messages in the event log of the cluster node. Since the error and warning messages were related to the DNS, Winlogon, Group Policy processing, NetLogon, and Kerberos, it was a bit challenging for us as to whether we should correlate event error messages with the Failover cluster failure or not. It took hours of time for us to investigate and fix the issue.

Issue Overview

After we fixed the issue with Hyper-V cluster, the first question that we had was what caused Hyper-V cluster to fail suddenly? It has been just a DNS issue that was causing the complete Hyper-V cluster to fail. The possible reason that we investigated during the troubleshooting is that during the normal cluster routine processes, cluster initiated the DNS registration process with the one of the DNS Servers configured in the TCP/IP property of the cluster nodes which failed due to time out issues explained a little later in this article.

It is important to note that most of the Operating System processes require to implement a time out function. The processes must complete within a given time period. If the process does not complete within a given time period, the process times out and terminates. An application may implement a time out function to ensure it completes a task within a given time period. Every application may have a different time out value for each of its functions. For example, DNS clients expect a DNS lookup to be completed within a maximum of 10 seconds. Similarly, a few cluster operations expect a call to be completed within a few milliseconds.

Failover cluster is responsible for registering DNS records for the Network Name resources. The DNS registration process executed by the cluster runs every 24 hours. DNS registration process invokes during some cluster events such as when “IsAlive” function executes or when someone brings the Network Name resource online. During the “IsAlive” call for a Network Name resource, the cluster checks to see if the Network Name resource needs to be registered in the DNS by examining the logic explained in the “DNS Registration behavior for Network Name Cluster Resources” section later in this article.

Since the Hyper-V cluster was not able to complete the DNS registration process timely, it caused the Network Name resource to fail, which, in turn, caused other resources to fail. Now the question is, since there was a failure with the Network Name resource, what caused the entire cluster to fail including Disk and Virtual Machine resources? It is because of the RHS process getting terminated as part of “IsAlive” call.

Note:
When operating all cluster resources in a single RHS process and if there is an event that causes the RHS process to terminate, all other resources will fail as well. RHS.exe process is required for a Windows Failover cluster to be up and running.

How did we fix the issue?

Since there were some issues during the DNS registration process executed by the “IsAlive” call, we decided to skip the Dynamic DNS registration process by unchecking the “Register this connection’s addresses in DNS” setting on the network adapter property as shown in the red square of the screenshot below:

Image

Unchecking “Register this connection’s addresses in DNS” setting disables the dynamic DNS registration process executed by the Operating System or any other applications that performs Dynamic DNS registration such as Ipconfig /registerdns command and Cluster “IsAlive” call.

Once the cluster was up and running, we started investigating the DNS time out issue. We found that because of a recent network change, the DNS network ports for the first DNS Server on the firewall were missed. As a result, the required network ports were blocked resulting in DNS communication failure between cluster node and DNS Server.

DNS Registration behavior for the Network Name Cluster Resources

The DNS registration process for the cluster Network Name resources are executed during the following events:

  • When the “Online” function executes.
  • When someone tries to bring the Network Name resource online via the Cluster Administrator or programmatically.
  • When Cluster “IsAlive” call executes.

When the cluster starts the DNS registration process, it performs the following logic:

  • Checks to see if the Network Name is registered in the NBT Table.
  • If NBT Table does not hold the network name, the cluster starts the DNS Registration process.
  • Cluster checks to see if the private property “RegisterAllProvidersIP” for the Network Name resources is set.
  • Cluster checks to see whether the DNS registration process needs to be invoked or not by checking the “Register this connection’s addresses in the DNS” setting on the Network Adapter.
    • If Network Adapter is configured to process DNS registration, the DNS registration process starts.
    • If the “Register this connection’s addresses in the DNS” setting is unchecked, the DNS registration process is skipped.

Why “IsAlive” did not connect to Secondary DNS Server?

Although the cluster node was configured with multiple DNS Servers, the “IsAlive” call did not attempt to use secondary DNS Servers. As I understand from reading some documents at MSDN, the “IsAlive” is an entry point configured in the Resource DLL with Terminate function. “IsAlive” once terminated by cluster will not process any other statements written in the function. When someone brings the resource online or when “IsAlive” executes again, it will connect to the first DNS Server provided by the DNS Resolver and terminate again if it times out in between.

“IsAlive” call might check next available DNS Server if it does not terminate. Terminating is something like preventing a routine/function from being completed successfully. If it doesn’t complete, it doesn’t know what to do next (i.e. Checking next available DNS Server).

In the next part, we will explain Windows Failover cluster interaction with Hyper-V resources and the process involved to keep cluster resources healthy in a Windows Failover cluster.

If you would like to read the other parts in this article series please go to:

See Also


The Author — Nirmal Sharma

Nirmal Sharma avatar

Nirmal Sharma is a MCSEx3, MCITP, and was awarded Microsoft MVP in Directory Services. In his spare time, he likes to help others and share some of his knowledge by writing tips and articles for various online communities. Nirmal can also be found contributing to PowerShell based Dynamic Packs for ADHealthProf.ITDynamicPacks.Net solutions.

Advertisement

Featured Links