If you would like to read the other parts in this article series please go to:
- Preserving server hardware (Part 1)
- Preserving server hardware (Part 2)
- Preserving server hardware (Part 3)
In the first article of this series we looked first at some different types of airborne particulates and how they can affect the health of small business server systems, PCs, and laptops used in business environments. After that we examined some solutions you can implement for ensuring such systems don't get damaged by airborne particulates. Unfortunately these kinds of protective steps don't always work as well as expected, so in the second article of this series we followed up by looking at some tips and recommendations I've gleaned over the years from my colleagues and also from the readers of our weekly newsletter WServerNews.com as they described how you can safely clean a server or client system that's gotten gunked up with dust, hair and other stuff floating around the air of the typical cubicle office, server room, or dingy hotel room. Then in the third article we began examining the issue of overheating as it applies to servers, PCs and laptops used in business environments. In that third article we examined some causes of system hardware overheating and identified some of the problems that can result from system hardware overheating. This present article continues on the theme of overheating, and we'll look at various vendor tools for identifying when system hardware is overheating as well as a few third-party tools you can use for identifying when systems are overheating. Finally, we'll describe what you can do when your system hardware overheats.
Vendor tools for identifying when system hardware is overheating
If your server, PC or laptop was purchased from a major vendor like Dell or HP you should have access to vendor tools that can help you identify whether your system hardware has been overheating. For example, Dell provides various online and downloadable diagnostic tools and tests you can use or run to check whether your system might be overheating. This page in the Dell Knowledge Base provides access to an online diagnostic tool you can run on a Windows-based PC or laptop. Also on that page you will find information about offline diagnostic tests that Dell systems can perform in order to diagnose and isolate hardware-related issues that can prevent your PC or laptop from successfully booting into Windows.
For Dell servers being managed by Dell OpenManage, a hardware management solution that helps administrators manage Dell PowerEdge servers, you can use OpenManage Server Administrator in the host operating system to view the status of the fan and temperature of the server system. For remotely managed Dell PowerEdge servers you can use the GUI of your Integrated Dell Remote Access Card (iDRAC) to similarly view such information as well as take remedial steps when possible.
Overheating and other potential health issues relating to HP servers can be monitored and identified by using HPE Integrated Lights Out (iLO), a solution available from Hewlett Packard Enterprise that uses embedded server management technology built into HP ProLiant and BladeSystem servers. Version 4 of iLO lets you view detailed health information about your servers via any web browser or even on your smartphone by using the iLO Mobile App which can be downloaded for the Android platform from the Google Play Store and for iPhones from Apple iTunes. HP's iLO is basically the HP counterpart to Dell's iDRAC solution.
HP also has other solutions you can use for monitoring temperature and fan speed of server systems including those running Linux instead of Windows Server. For example, if you have the HP ProLiant Support Pack installed on a ProLiant server that has Linux installed, you can use the hpasmcli command by running /sbin/hpasmcli -s "show temp" to display a variety of different temperature points within your server including CPU temperatures, memory stick temperatures, and ambient server system temperature. Mark Nellemann has an example of a script he wrote for parsing the output of the hpasmcli command in this post on his blog.
Third-party tools for identifying when system hardware is overheating
A popular third-party solution for monitoring not only temperature but other critical aspects of your server system hardware is the solution available from ServersCheck, an open, scalable, modular server monitoring solution. ServersCheck is a complete server system monitoring solution that includes base units, a monitoring platform, and a broad range of sensors that can monitor environmental factors (like temperature) as well as power and security. ServersCheck can be implemented and used as a standalone server monitoring solution or you can integrate it into your systems management system or even your building management system. For example, to use ServersCheck to monitor the temperature of a particular blade server in a rack-mounted system, you could attach a USB temperature probe to the wall of the enclosure in the immediate vicinity of the blade server. ServersCheck is used by several large companies I've had contact with, and a number of my colleagues who work in IT consulting and system administration have recommended their solution.
Another popular solution for monitoring the temperature of server systems is that by SolarWindws. For example, you can proactively monitor HP servers for possible overheating problems as well as issues relating to fan speed, power supply, and more by using SolarWinds Server & Application Monitor.
Since most CPUs today include built-in functionality for monitoring temperature and other parameters like voltage and fan speed, and most hard drives include support for S.M.A.R.T. which allows for monitoring of hard drive temperatures, you may simply want to use a dedicated utility instead of a complete systems monitoring system to keep a close watch on the temperature of your server, PC or laptop. SpeedFan is one popular program that can let you view the temperature of your motherboard and disk, view voltages and fan speeds, modify fan speeds, and more.
What you can do when your system hardware overheats
The first thing you should do of course if you suspect that the strange behavior of your system is the result of overheating is to remove or eliminate any visible direct causes of such overheating. For example, you might take some simple remedial action like moving your system into a cooler room, turning up the room's air conditioner, or removing the device from direct sunlight. Or if your problem isn't hardware or software but is more "wetware" related, you might consider chasing your cat away from sleeping on top of your server!
There are also some additional steps you can follow if you've purchased your system from a major vendor like Dell or HP since these vendors use BIOS-enabled features that allow certain hardware failures to be logged to the BIOS of your system. For example, if you suspect that your Dell system is failing to boot or behave properly due to overheating, and you've already taken all the usual remedial steps as described above, start by checking the System Event Log in the BIOS of the affected system. Some messages that the BIOS in a Dell system might display as a result of overheating can include error messages relating to heatsinks, fans, or air temperature, for example the message "Alert! Air Temperature Sensor Not Detected."
If your system has failed because of a hardware component failure, you may need to replace that component. The best course of action in such cases is usually to contact the vendor's technical support line for assistance and describe the problem and the steps you've already taken to try and identify it. Unfortunately with the quality of technical support declining because of vendor cost-cutting, you may be asked by the support technician to perform a series of tests that may include steps that you've already taken. Large-volume customers generally have it best here as they usually have sufficient leverage with the vendor to quickly bump the problem up from Tier 1 to Tier 2 or 3 support personnel who are more responsive and knowledgeable.
If your systems shut down unexpectedly or you had to perform a hard shutdown (i.e. pull the plug) because of the operating system freezing, you should perform the following steps before attempting to restart the system:
- Disconnect all external cables (including the power cable).
- Hold down the power button for at least 5 seconds so any remaining internal power (e.g. in capacitors) can dissipate).
- Reconnect the power cable and any other external cables.
- Restart the system.
Finally, note that if your system is randomly freezing or shutting down and you've determined that the BIOS software is out of date, do not attempt to flash the BIOS with updated software until the cause of the freezing/shutdowns has been identified and resolved. And if your BIOS is out of date by more than one revision, best practice is to apply each missing BIOS update sequentially in order, that is, don't jump over several BIOS updates or you might end up with an unbootable server.
If you would like to read the other parts in this article series please go to: