If you would like to read the other parts in this article series please go to:
- Hyper-V Windows Failover Cluster and IsAlive Operation (Part 1)
- Hyper-V Windows Failover Cluster and IsAlive Operation (Part 3)
By default, all resources in a Windows Failover cluster are handled by Resource Host Subsystem implemented in a single RHS.exe process unless you configure cluster resources to run in a separate RHS process. RHS controls the cluster resources and it communicates with Resource DLLs. A Resource DLL ships with a cluster-aware application. A Resource DLL in a cluster is responsible for executing various cluster specific functions against the resources maintained by the application. For example, HVCLUSRES.DLL is responsible for executing Online, “IsAlive”, and “LooksAlive” functions for Hyper-V virtual machine resources. Similarly, ClusRes.DLL implements the same set of functions to interact with generic cluster resources such as Network Name, IP Address and File Server and Disk resources. The Windows failover cluster is designed to interact with cluster resources in below order:
ClusSvc.exe (Cluster Service) communicates with Resource Host Subsystem (RHS.EXE). RHS.exe receives instructions from Cluster Service to execute cluster specific functions such as Online, Office, “IsAlive”, “LooksAlive” and so on. Resource DLLs receives instructions from Resource Host Subsystem and take necessary actions.
There are various functions executed in a Windows failover cluster, but the functions that are executed to check resource availability and to ensure resources are healthy are explained below:
- “Online” function is executed when a resource is brought online via Cluster Administrator tool or programmatically.
- “LooksAlive” function is executed every 5 seconds and is responsible to check status of the resource in the cluster.
- “IsAlive” function is executed every 60 and is responsible for doing a thorough check on the resource depending on the resource type.
Interaction of Windows Failover Cluster with Resources
This is what happens when cluster service starts:
- Cluster service starts.
- Cluster Service implements RHS.exe processes for monitoring the resources in the cluster. By default, one RHS process is implemented in the cluster to monitor all resources. Multiple RHS.exe processes will be started, if you have configured any of the cluster resources to run in a separate RHS process.
- RHS.exe process communicates with Resource DLLs (ClusRes.dll for Core cluster resources such as Cluster Name and IP Address and VMCLUSRES.DLL for Hyper-V virtual machine resources).
- First time when the Windows Failover Cluster service starts, RHS instructs Resource DLLs to execute “Online” function to bring resources online.
- Resource DLLs receives the instructions from the RHS.exe process and then bring the resources online.
The individual cluster resources can be brought online in a random order, but each Resource DLL is responsible for bringing its resources online by implementing worker threads. All cluster resources can be brought online simultaneously. For example, ClusRes.DLL will bring the Cluster Name, IP Address, Disk Resources and File Server Resources online and HVCLUSRES.DLL will bring Virtual Machine resources online.
Note that some of the cluster resources can come online successfully even if resources have some health issues. The resources might remain online until cluster executes “IsAlive” function to do a thorough check on the resources. When Cluster Service starts, the only function that it instructs RHS to execute is the “Online” function. “Online” function does not perform any health checks for the resources except resource dependency check. If a dependent resource has already been brought online by the cluster, all other resources depending on that resource will also be brought online when the “Online” function executes.
- At this stage, Cluster Service provides the necessary instructions for RHS.exe processes to monitor resources by executing “IsAlive” and “LooksAlive” functions and this is where the actual check on the resources is performed.
- When 5 seconds interval expires, RHS instructs Resource DLLs to execute “LooksAlive” call.
The first function that is executed by Resource DLL is “LooksAlive”. “LooksAlive” is a quick and lightweight health check. It is the Resource DLL that implements the necessary checks to be performed as part of the “LooksAlive” call. For example, for a Disk Resource, “LooksAlive” executed by ClusRes.DLL will perform a reservation against all disks managed by the cluster. Similarly, for Virtual Machine resources, “LooksAlive” might check whether the virtual machine is on or not.
“IsAlive” call is executed only if the “LooksAlive” call fails for some reasons. “IsAlive” call is used to do a thorough check on the resources. For example, for disk resources, “IsAlive” issues a DIR or an equivalent command. Similarly, for SQL resources, “IsAlive” might try to connect to SQL Server instance to ensure service is responding to the requests. Depending on the resource category, “IsAlive” thorough test is performed. For Network Name resources, “IsAlive” call includes checking the registration status of the Network Name in the NBT Table on the local node and starting the DNS registration process.
- Once the “IsAlive” call is completed, it reports the status of the resource back to RHS.exe process.
It is important to understand that every function in the cluster should report the status of its resources back to RHS.exe in a timely manner.
The issue that we had faced was a “DNS Time Out” during the “IsAlive” DNS registration process for the Network Name resources and we could notice “IsAlive” time out and RHS termination in the cluster log as shown below:
- <Date and Time> ERR [RHS] RhsCall::DeadlockMonitor: Call ISALIVE timed out for resource 'Cluster Name'.
- <Date and Time> ERR [RHS] Resource Cluster Name handling deadlock. Cleaning current operation and terminating RHS process.
Microsoft says that a “IsAlive” should return a response back to RHS within 300000 milliseconds which is 5 minutes. Since “IsAlive” was taking more than 300000 milliseconds to connect to DNS Server for updating/refreshing DNS record, cluster noticed this and terminated the RHS process.
It is important to note that “IsAlive” is expected to report the status of resources within 300 milliseconds. Waiting for 5 minutes to receive a response from “IsAlive” does not serve the purpose of achieving 99.9% availability of the applications. A cluster achieves availability of services hosted in the cluster by reducing the time it takes to perform a cluster operation. You would not want a cluster to wait for 5 minutes before a failover event is triggered.
While the issue was related to the DNS registration process for the Network Name resource, it was necessary to investigate as to what caused other cluster resources to fail in the cluster. We found that since the entire cluster monitoring operation was running under one RHS.exe process and since the RHS.exe was terminated as part of the “IsAlive” call, cluster did not have any other RHS.exe processes running to keep other cluster resources online.
To prove the point, we were able to bring the Disk Resources online by configuring Disk Resources to run in a separate RHS container which, in turn, implemented a separate RHS.exe process. RHS.exe process that was responsible for other resources such as Network Name and IP Address was still terminating because of the way Network Name resources are handled by the Windows Failover cluster.
Windows Failover cluster implements DNS registration process as part of the “IsAlive” call to ensure Network Name resources are registered in the DNS Server. We explained there are a couple of checks performed by the “IsAlive” call before “IsAlive” function invokes the DNS registration process for the Network Name Resources.
We explained the interaction of Windows Failover cluster with cluster resources by utilizing “IsAlive” and “LooksAlive” functions implemented in the Resource DLLs.
In the final part of this article series, we will explain the configuration changes that you should avoid on a Hyper-V failover cluster.
If you would like to read the other parts in this article series please go to: