If you would like to read the other parts in this article series please go to:
- Trench Tales (Part 2) - Troubleshooting Slow Logons
- Trench Tales (Part 3) - Apple in the Enterprise
- Trench Tales (Part 4) - More Apple in the Enterprise
- TrenchTales (Part 5) - Logon Banners
- TrenchTales (Part 6) - Keyboard Conundrums
Being able to troubleshoot problems when they arise is the true test of the IT professional. Why? Because IT systems are so complex that they often break down! So real-world IT means having the skills to take things apart and analyze them, figure out what's wrong, and put them back together again so they work again. Often we need help to successfully do this, and the peer community of IT pros around the world is a terrific resource in this regard. This series of articles leverages the expertise of IT pros from around the world who have contacted me through my role as editor of WServerNews, the world's largest newsletter focused on system admin and security issues for Microsoft Windows Servers, and also through several other channels such as my connections with IT pros through my activities as a Microsoft Most Valuable Professional (MVP). All stories shared in these articles are used by permission of the individuals who submitted them to me.
To set the stage for the stories that follow, you should begin by reading my editorial Hardware Hell in the January 9, 2012 issue of WServerNews. Then read the From The Mailbag section of the January 16 issue where some newsletter readers respond by sharing their own hardware troubleshooting tips. But some of the stories readers sent me were too long to include in my Mailbag column, so here is a selection of them for your education and enjoyment.
Network Problems in an Industrial Environment
Craig from Australia shared the following story with me:
You called for hardware hell stories. Here’s my favourite war story.
Quite a few years ago I was sysadmin at a company that had a sales/admin office and a metal workshop factory out the back. They were using coaxial 10base2 ethernet (told you it was a while ago) running on three segments.
One day I get a report from the accounts dept that the workstations are locking up for about 10 seconds at random times. Of course I could never be in the accounts dept when it happened. After a while the head of accounts commented it seemed to happen when the air conditioner turned on or off. I scoffed at this initially as the computers and aircon were on completely separate power circuits – they couldn’t possibly interfere with each other.
Then, one day I was working on a machine in accounts and the computer locked up. A few seconds later the aircon came on and the computer worked normally.
After much investigation we finally tracked it down to a faulty thermostat on the air conditioning unit. One of the co-ax cables ran near the thermostat. Instead of switching on or off cleaning the thermostat would rapidly flick between on and off for, you guessed it, about 10 seconds. Replaced the thermostat and the problem went away.
So an annoying IT problem was caused by an air conditioning thermostat. Technically a hardware problem but nothing to do with the network.
What are some lessons can we learn from Craig's story?
- Sometimes IT problems have nothing to do with IT.
- Electrical equipment like A/C units can sometimes cause network issues.
- Even shielded coax cabling may not be enough in some environments.
Keep a Magnifying Glass Handy
Gary told me this story about a server he built for a customer:
Here’s one of my nightmare stories:
A couple of years ago, I put together a Win 2003 Server and several PC’s for a Dentist. Loaded the OS, all programs that the office needed, and then configured the server. Everything tested out fine. Then within 15 minutes I got an erratic behavior on the Server, couldn’t see it on the network, all kinds of goofiness. It not only slowed down, but almost every function, including video and NIC was affected. I took out each replaceable piece, including memory and replaced them one by one and then rebooted. Sometimes it worked sometimes not.
Maddening when you think you found the problem, only to find out you didn’t. Anyway, I checked and rechecked everything, including the CPU. NO DIFFERENCE!!! AAHHH! More grey hair and now a bald spot!
1-2 hours later, I re-examined that CPU as the Server was running, and I noticed that the CPU was loose. I shut down and un-mounted the CPU, turned it upside down and noticed that the cooling compound wasn’t looking right. I examined the posts and one of them had a defect that turned into a hairline crack! This prevented a complete contact with the cooling fan and overheated the CPU, which in turn affected every function on the Server. Needless to say, I took the board back and got a replacement. It was obvious to the vendor that it was a manufacture defect.
Lessons learned from Gary's story:
- It's not always your fault--it could be a manufacturing defect in one of the components or even the motherboard.
Be Methodical with Your Troubleshooting
Tony from the UK shared a couple of stories about troubleshooting system hardware:
This is one I had a few weeks ago. An HP desktop (minitower) had died after a couple of years use. As a matter of routine, when things don’t boot, the first thing I do is to remove and reseat the memory, thus wiping the contacts clean. I didn’t even need to get that far – one of the memory sticks was barely resting in its socket. It must have been shipped years earlier from HP like that because if it had been seated correctly, it would have been positively locked in place.
Sometimes something that is simple, you think is impossible, is the cause. But because I go about things with at least a few basic methodical steps, I found it quickly.
But on a more serious note, I have come across the following increasingly as a problem, even with servers – failure to boot correctly. If you disconnect ALL cables, and I mean every last cable, leave for a few minutes. Now reconnect the power cable, the mouse, keyboard and monitor but definitely NO network cables. In a surprising number of cases the machine will now boot. Some small network switches (which I will not name in case they have improved the design) regularly latched up whenever there had been a momentary power interruption.
I believe the cause to be along the following lines. As technology processes shrink, it takes less static energy, induced energy and leakage current to cause a gate to latch up and sustain itself in an incorrect way, effectively causing an entire block to malfunction. The Ethernet section on most PCs is fed from a standby power rail even when your PC is switched off (unless at the wall socket) so that WOL (Wake On LAN) will allow administrators to start up machines remotely. Once latchup has occurred, all power has to be removed and the charge allowed to leak away. Only then will the circuit function again normally.
The background to this – I used to support the G170/G171 the original colour lookup table around which the VGA standard was built. Protection against latchup on the VGA cable was the one thing our competitors were never able to copy, which was why our devices were the most reliable. I also used to work on automotive electronics, and had to protect against latchup anywhere by design. I remember owing to a design error at one company, the microcode ROM that held the instructions for the CPU operation had been inadvertently designed with 1s for 0s and 0s for 1s. One of our engineers found that if you fed certain pins with 5v and others with 4.3v, the sense amplifiers could be made to operate incorrectly and invert when they should not, thus enabling us to temporarily make the part work, at least for initial testing.
Ethernet cabling often ends up linking equipment on different power supplies – so that offices and the server room, or even different offices may be on different phase of the mains supply, and if these are imbalanced, there can be net flows along the return cabling which generates potential differences capable of causing latchup.
In a R&D project on protecting smart cards ten years ago, we demonstrated that the light from a photoflash gun could affect a chip for a short time, and by replacing the Xenon tube with a coil, we could “fix” SRAM cells for several days before the charge finally leaked away.
With decreasing process geometries, it takes less energy and leakage to cause latchup. If you are an old wrinkly like me (UK speak for getting on a bit) then you might know about the problems with “hum” on hi-hi equipment when there were multiple earth paths. Same thing can happen with Ethernet, but instead of getting 50/60Hz induced hum, the problem is with these unshielded cables picking up transients e.g. from inadequately suppressed sources. In theory, the twisted pairs used in Ethernet pick up the same signal so the difference between the two is zero, but the whole pair moves with respect to a nominal ground. This is similar to the old problem of common mode.
Some lessons learned from Tony's stories:
- Be methodical in your troubleshooting procedures--the simplest problem is often the culprit.
- When disconnecting all sources of power from a system, disconnect the network cable as well.
- The way your building's electrical systems are architected can affect your LANs.
System Won't Resume From Sleep
Finally, David from Ottawa, Canada shares this story:
About this time last year, I decided to build a new Intel Sandy Bridge system to replace my AMD Phenom II computer. I waited for the B3 revisions of the P67 chipset motherboards to make it to market then picked out the components for this new computer which included an Intel Core i7 2600K CPU, Asus P8P67 Deluxe motherboard, 16 GB Kingston DDR3 1600Mhz RAM (4 DIMMs of 4GB each) and an Antec True Power 650 watt PSU.
Assembled the system and everything worked well with one exception. The system would not resume from sleep after an extended period in this state. The system would power up then shut down then try to power up again but would often hang at this point. I would have to hold down the power button to shut of the system, then press it again to start the computer. At this point Windows 7 would resume from the hibernation file. Interesting thing was that this problem did not happen if the computer was resumed from sleeping after a short period of 10 minutes or so. Also noted that my Logitech USB webcam would have to be reconnected before it would work properly when the system resumed.
Assuming the motherboard is the problem, I took the system back to the store where I purchased the parts and they exchange the motherboard for me. Wake from sleep problem improves but still happens often. Begin searching the net and find that many people are having this problem with boards based on the Intel P67 chipset. Discussion in the forums indicate that it is a known issue and to wait for an EFI BIOS update to fix it.
I eagerly await each EFI update to fix this issue but it never happens. In fact, the last update I applied makes the situation worse. Now I focus my attention on the DIMMs. I try running different combinations of the Kingston DIMMs in the system without any improvement. The store I use has a special one weekend on 4 and 8 GB Mushkin DDR3 1333 Mhz RAM kits. They are sold out of the 8GB kits so I pick up a 4GB kit for $20. Install this RAM into the system and am able to resume the system after it has been sleeping overnight.
Now I think I have found the problem but I want 16 GB of RAM. Two weeks later I notice the store has the 8GB kits back in stock. I return with the 4GB kit and ask if I can make an exchange. This store, Canada Computers (http://www.canadacomputers.com) which I highly recommend, takes back the 4GB kit and I then purchase and leave with two 8GB kits. I put these into the computer and I have the same problems again.
Over the December holiday season, I decide to take an Antec NEO ECO 620 watt PSU out of my older AMD system and try it on this Core i7 build. I put the system into sleep mode that night then go to bed. Next morning I go to the computer, press a button on the keyboard to wake it up and it works and this has continued to work without fail since changing the power supply. This also resolves the problem with the webcam.
I received an RMA replacement on the Antec True Power PSU today and I hope it was just one bad PSU and not a problem with the model itself.
The power supply component is the foundation and heart of any computer. Every component in a computer is connected to this device and a bad power supply will certainly result in a problematic and unstable computer. For this reason I always select quality name brand power supplies for the computers I build and is also the reason why I did not suspect the power supply for the longest time. This experience, however, has reminded me that even the top power supply brands are not always 100% fault free.
The moral of David's story? PC power supplies are the root of all evil!
If you would like to read the other parts in this article series please go to: