Planning for High Availability: Attempt to Eliminate Downtime

by [Published on 18 Aug. 2004 / Last Updated on 18 Aug. 2004]

To ensure that systems are available for requests when called upon, you have to plan for the failure of those same resources that are being requested. Hard drives have MTBF (mean time between failure), viruses go undetected, name it… disaster can happen. The only way to keep systems ‘available’ to those who request them is to deploy those systems in a manner that will keep them available to those requests.

Either with disk arrays or clustered servers, planned redundancy is always a good bet to take when planning for high availability. In this article we will look at what you need to know to plan for a high availability solution that will keep services online and available to those who need them… and depend on them. High availability takes some work and effort in the beginning. Taking the time to plan and design is the key to maximizing the possibility of a successful deployment. The same care must go into the design process. High availability design is often complex and requires knowing a great many areas within IT to get it right – or the team used to plan the high availability solution ends to be diverse. So, let’s take a look at why you need high availability and what you can do to plan for it. In this section we will look at how to plan for downtime, how to build a plan, how to manage your services, system assessment and testing your plan. Remember, high availability assures uptime, uptime may be your business so you should consider the costs of implementing a highly available solution, one failure that causes your series downtime may be all it takes to have paid for the solution in the first place.


"For a complete guide to high availability, check out Windows Server 2003 Clustering and Load Balancing' from Amazon.com"

Plan Your Downtime

You need to achieve as close to 100 percent uptime as possible. We all know that this is not completely realistic, things break. Breakdowns occur because of disk crashes, power or UPS failure, application problems resulting in system crashes, or any other hardware or software malfunction. You could make a list a mile long with all the things that can go wrong with a computer system! So, the next best thing is 99.999 percent (5 nines), which is reasonable with today’s technology. You can (with enough cash) build just redundancy into just about anything these days. Examples could be RAID (redundant array of inexpensive disks), or a pair of clustering servers as seen in figure 1.


Figure 1: A Simple 2-Node Cluster

A Clustered solution can help you to minimize downtime because you have a solution that built around planned failure! A client (10.1.1.3) wants to access data from a database on a server that appears as 10.1.1.1. This is the VIP (Virtual IP) which creates a transparent solution for the client – one server to access, when in reality, both servers are there ready. If one server has a malfunction, the other takes over in its place. The database can be a shared RAID configuration that is also highly available. Solutions like this, if you plan them out correctly, can help you to maximize uptime and help you cope with downtime less frequently.

SLA’s and Five Nines

You can also define in a Service Level Agreement (SLA) what 99.999 percent means to both parties…. Both parties being, that of which you would request of a service provider, and that of which you would be expected to deliver as a service provider. I sometimes consider myself a service provider if I am deploying a server… the name ‘server’ implies its purpose. Client ‘will’ demand it services nonstop until in fact it does fail. If you promised 99.999 percent uptime to someone for a single year, that translates to a downtime ratio of about five to ten minutes. Since this could be deemed pretty hard to accomplish, it may make sense to strive for an uptime that makes more sense, one that’s more realistic to scheduled outages and possible disaster-recovery testing performed by your staff. Things eventually have to be replaced or worked on and 10 minutes a year is just not going to cut it. Something more reasonable to your situation may be closer to 99.9 percent uptime, which allots for about nine to ten hours of downtime per year. This is more practical and feasible to obtain. Whether providing or receiving such a service, both sides should test planned outages to see if delivery schedules can be met.

You can figure this formula by taking the amount of hours in a day (24) and multiplying it by the number of days in the year (365). This equals 8,760 hours in a year. Use the following equation:

% of uptime per year = (8,760 - number of total hours down per year) / 8,760

If you schedule eight hours of downtime per month for maintenance and outages

(96 hours total), then you can say the percentage of uptime per year is 8,760 minus 96 divided by 8,760. You can see you’d wind up with about 98.9 percent uptime for your systems. This should be an easy way for you to provide an accurate accounting of your downtime.

Remember, you must account for downtime accurately when you plan for high availability. Downtime can be planned or, worse, unexpected. Very common and often overlooked sources of unexpected downtime include the following:

  • Disk crash or failure
  • Power or UPS failure (or power backup nonexistence)
  • Application problems resulting in system crashes
  • Any other hardware or software malfunction (bugs/glitches)

Make sure you plan accordingly and consider your options on planned downtime and what you can reasonably deliver if you are a service provider, or what you expect from your service provider.

Building the Highly Available Solutions’ Plan

Before we get knee deep into the ‘official’ plan, let’s look at why you would want one. Consider the following scenario, a simple server crash and what it costs you and your company. The following is a list of what could happen in sequence:

  1. A company uses a server to access an application that accepts orders and does transactions. (this translates to ‘how the company collects cash from customers’)
  2. The application, when it runs, serves not only the sales staff, but also three other companies who do business-to-business (B2B) transactions. The estimate is, within one hour’s time, the peak money made exceeded 2.5 million dollars.
  3. The server crashes and you don’t have a High Availability solution in place.
  4. This means no failover, redundancy, or load balancing exists at all. It simply fails.
  5. It takes you (the Systems Engineer) 5 minutes to be paged, but about 15 minutes to get onsite. You then take 40 minutes to troubleshoot and resolve the problem. One hour’s time is being very conservative. 
  6. The company’s server is brought back online and connections are reestablished. The system is tested and deemed physically and logically fit.

Everything appears functional again. The problem was simple this time—a simple application glitch that caused a service to stop and, once restarted, everything was okay.

Now, the problem with this whole scenario is this: although it was a true disaster, it was also a simple one. The systems engineer happened to be nearby and was able to diagnose the problem quite quickly. Even better, the problem was a simple fix. This easy problem still took the companies’ shared application down for at least one hour and, if this had been a peak-time period, over 2 million dollars could have been lost. Wish you had that highly available solution huh? How much money would it take for your company to lose before you paid for the redundancy, the staff and their lunch for a year by being proactive? Make no mistake – High Availability is based on proactive thinking. You are ‘planning’ for disaster so you will not have to ‘react’ to it once it occurs.

Another issue that could result is losing customer or vendor faith in your company. The companies you connect to and do business with as well as your own clientele may start to lose faith in your ability to serve them if your web site is not accessible or defaced, your database is corrupted, your ERP application is not accessible and holding them up from doing business. This could also cost you revenue and the possibility of acquiring new clients moving forward. People ‘talk’ and the uneducated could take this small glitch as a major problem with your company’s people, instead of the technology.

Let’s look at this scenario again, except with a Highly Available solution in place:

  1. A company uses a Server to access an application that accepts orders and does transactions
  2. The application, when it runs, serves not only the sales staff, but also three other companies who do business-to-business (B2B) transactions. The estimate is, within one hour’s time, the peak money made exceeded 2.5 million dollars.
  3. The server crashes, but you do have a Highly Available solution in place. (Note, at this point, it doesn’t matter what the solution is. What matters is that you added redundancy into the service.)
  4. Server and application are redundant, so when a glitch takes place, the redundancy spares the application from failing.
  5. Customers are unaffected. Business resumes as normal. Nothing is lost and no downtime is accumulated.
  6. The one hour you saved your business in downtime just paid for the entire Highly Available solution you implemented.

With a plan in place, planning for proactive design and use of redundant and resilient services… can help you to avert most disasters.

Managing Your Services

In this section, you see all the factors to consider while designing a Highly Available solution. Such factors include all your main management areas within your organization or department that are put in place to help plan for and maintain most IT services, such as high availability. This one area is fault management. Fault management is one of five categories of network management which is defined by the ISO (International Organization for Standardization). Fault management ensures that all faults are detected and sets a process in place to record, monitor and maintain such faults for correlation.  Fault management encompasses fault detection, isolation and the correction of abnormal operation, and includes functions to maintain and examine error logs, accept and act upon error detection notifications, trace and identify faults, carry out sequences of diagnostic tests and correct faults.  Now that you understand fault management, you should also understand and at least consider the other areas in the ISO Network Management Model. The ISO Network Management Model contains:

  • Performance Management
  • Configuration Management
  • Accounting Management
  • Fault Management
  • Security Management

The following is a list of the main services to remember and consider when planning for high availability:

  • Change Management is crucial to the ongoing success of the solution during the production phase. This type of management is used to monitor and log changes on the system.
  • Problem Management addresses the process for Help Desks and Server monitoring.
  • Security Management is tasked to prevent unauthorized penetrations of a system.
  • Performance Management addresses the overall performance of the service, availability, and reliability.

Also, not in the ISO model, but should be considered is Service Management. Service Management is the management of the true components of Highly Available solutions: the people, the process in place, and the technology needed to create the solution. Keeping this balance to have a truly viable solution is important. Service Management includes the design and deployment phases. Service management is crucial to the development of your Highly Available solution. You should cater to your customer’s demands for uptime. If you promise it, you better deliver it.

Highly Available System Assessment Ideas

The following is a list of items for you to consider during the postproduction planning phase. Make sure you covered all your bases with this list:

  • Now that you have your solution configured, document it! A lack of documentation will surely spell disaster for you. Documentation isn’t difficult to do, it’s simply tedious, but all that work will pay off in the end when you need it handy. Documentation is a life saver in a disaster.
  • Train your staff. Make sure your staff has access to a test lab, books to read, and advanced training classes. Go to free seminars to learn more about High Availability. If you can ignore the sales pitch, they’re quite informative.
  • Test your staff with incident response drills and disaster scenarios. Written procedures are important, but live drills are even better to see how your staff responds. Remember, if you have a failure on a system, it could failover to another system, but you must quickly resolve the problem on the first system that failed. You could have the same issue on the other nodes in your cluster and if, that’s the case, you’re on borrowed time. Set up a scenario and test it.
  • Assess your current business climate, so you know what’s expected of your systems at all times. Plan for future capacity especially as you add new applications, and as hardware and traffic increase.
  • Revisit your overall business goals and objectives. Make sure what you intend to do with your high-availability solution is being provided. If you want faster access to the systems, is it, in fact, faster? When you have a problem, is the failover seamless? Are customers affected? You don’t want to implement a high-availability solution and have performance that gets worse. This won’t look good for you!
  • Do a data-flow analysis on the connections the high availability uses. You’d be surprised that damaged NICs, the wrong drivers, excessive protocols, bottlenecks, mismatched port speeds, and duplex, to name a few problems, have on the system. I’ve made significant differences in networks by simply running an analysis on the data flow on the wire and, through this analysis, have made great speed differences. A good example could be if you had old ISA-based NIC cards that only ran at 10 Mbps. If you plugged your system into a port that uses 100 Mbps, then you will only run at 10, because that’s as fast as the NIC will go. What would happen if the switch port was set to 100 Mbps and not to auto-negotiate? This would create a problem because the NIC wouldn’t communicate on the network because of a mismatch in speeds. Issues like this are common on networks and could quite possibly be the reason for poor or no data flow on your network.
  • Monitor the services you consider essential to operation and make sure they’re always up and operational. Never assume a system will run flawlessly unless a change is implemented . . . at times, systems choke up on themselves, either by a hung thread or process. You can use network-monitoring tools like GFI Network Server Monitor to monitor such services.
  • Assess your total cost of ownership (TCO) and see if it was all worth it. In other words, at the beginning of this book, you learned how Highly Availability solutions would save money for your business.

So, considering you did implement highly available solutions, did they save your business money? Do the final cost analysis to check if you made the right decision. The best way to determine TCO is to go online and use a TOC calculator program that shows you TCO based on your own unique business model. Because, for the most part, all business models will be different, the best way to determine TCO is to run the calculator and figure TCO based on your own personal answers to the calculator’s questions. There are TCO calculator links at the end of this article. This should give you a good running start on advanced planning for high availability, and it gives you many things to check and think about, especially when you’re done with your implementation.

Testing a High Availability System

Now that you have the planning and design fundamentals down, let’s discuss the process of testing your high-availability systems. You need to assure the test is run for a long enough time, so you can get a solid sampling of how the system operates normally without stress (or activity) and how it runs with activity. Then, run a test long enough to obtain a solid baseline, so you know how your systems operate normally on a daily basis. Use that for a comparison during times of activity.

Summary

This article covered the fundamentals of high availability, highly available systems, why you may want them and how they can help you to minimize (or quite possibly eliminate) most system downtime you may experiencing, or may experience in the future. Remember that high availability planning is just that – it’s a lot of planning and proactive work, but when completely implemented properly, you could really help to minimize failures and problems to keep the business up and running, as it should be. For more information on planning highly available systems and network, please check out Windows 2000 & Windows Server 2003 Clustering & Load Balancing, by Robert J. Shimonski. You can also post questions about this article to the author on this websites ‘General Discussion” forum.

Reference and Links

International Organization for Standardization
http://www.iso.org

GFI Network Server Monitor
http://www.gfi.com/nsm/

TCO Calculators
http://www.microsoft.com/business/reducecosts/efficiency/consolidate/tco.mspx http://www.oracle.com/ip/std_infrastructure/cc/index.html?tcocalculator.html 

Advertisement

Featured Links