The Science of Network Troubleshooting
By stretch | Wednesday, March 10, 2010 at 4:19 a.m. UTC
A number of people have written asking me what happened to a paper I wrote back in 2008 entitled "The Science of Network Troubleshooting." Unfortunately, I neglected to republish the paper after revamping packetlife.net in late 2009, so here it is again as a blog article.
Troubleshooting is not an art. Along with many other IT methodologies, it is often referred to as an art, but it's not. It's a science, if ever there was one. Granted, someone with great skill in troubleshooting can make it seem like a natural talent, the same way a professional ball player makes hitting a home run look easy, when in fact it is a learned skill. Another common misconception holds troubleshooting as a skill derived entirely from experience with the involved technologies. While experience is certainly beneficial, the ability to troubleshoot effectively arises primarily from the embrace of a systematic process, a science.
It's said that troubleshooting can't be taught, but I disagree. More accurately, I would argue that troubleshooting can't be taught easily, or to great detail. This is because traditional education encompasses how a technology functions; troubleshooting encompasses all the ways in which it can cease to function. Given that it's virtually impossible to identify and memorize all the potential points of failure a system or network might hold, engineers must instead learn a process for identifying and resolving malfunctions as they occur. To borrow a cliché analogy, teach a man to identify why a fish is broken, rather than expecting him to memorize all the ways a fish might break.
Troubleshooting as a Process
Essentially, troubleshooting is the correlation between cause and effect. Your proxy server experiences a hard disk failure, and you can no longer access web pages. A backhoe digs up a fiber, and you can't call a branch office. Cause, and effect. Moving forward, the correlation is obvious; the difficulty lies in transitioning from effect to cause, and this is troubleshooting at its core.
Consider walking into a dark room. The light is off, but you don't know why. This is the observed effect for which we need to identify a cause. Instinctively, you'll reach for the light switch. If the light switch is on, you'll search for another cause. Maybe the power is out. Maybe the breaker has been tripped. Maybe someone stole all the light bulbs (it happens). Without much thought, you investigate each of these possible causes in order of convenience or likelihood. Subconsciously, you're applying a process to resolve the problem.
Even though our light bulb analogy is admittedly simplistic, it serves to illustrate the fundamentals of troubleshooting. The same concepts are scalable to exponentially more complex scenarios. From a high-level view, the troubleshooting process can be reduced to a few core steps:
- Identify the effect(s)
- Eliminate suspect causes
- Devise a solution
- Test and repeat
Step 1: Identify the Effect(s)
If you've been a network engineer for more than a few hours, you've been told at least once that the Internet is down. Yes, the global information infrastructure some forty years in the making has fallen to its knees and is in a state of complete chaos. All this is, of course, confirmed by Mary in accounting. Last time it was discovered her Ethernet cable had come unplugged, but this time she's certain it's a global catastrophe.
Correctly identifying the effects of an outage or change is the most critical step in troubleshooting. A poor judgment at this first step will likely start you down the wrong path, wasting time and resources. Identifying an effect is not to be confused with deducing a probable cause; in this step we are focused solely on listing the ways in which network operation has deviated from the norm.
Identifying effects is best done without assumption or emotion. While your mind will naturally leap to possible causes at the first report of an outage, you must force yourself to adopt an objective stance and investigate the noted symptoms without bias. In the case of Mary's doomsday forecast, you would likely want to confirm the condition yourself before alerting the authorities.
Some key points to consider:
What was working and has stopped?
An important consideration is whether an absent service was ever present to begin with. A user may report an inability to reach FTP sites as an outage, not realizing FTP connections have always been blocked by the firewall as a matter of policy.
What wasn't working and has started?
This is can be a much less obvious change, but no less important. One example would be the easing of restrictions on traffic types or bandwidth, perhaps due to routing through an alternate path, or the deletion of an access control mechanism.
What has continued to work?
Has all network access been severed, or merely certain types of traffic? Or only certain destinations? Has a contingency system assumed control from a failed production system?
When was the change observed?
This critical point is very often neglected. Timing is imperative for correlation with other events, as we'll soon see. Also remember that we are often limited to noting the time a change was observed, rather than when it occurred. For example, an outage observed Monday morning could have easily occurred at any time during the preceding weekend.
Who is affected? Who isn't?
Is the change limited to a certain group of users or devices? Is it constrained to a geographical or logical area? Is any person or service immune?
Is the condition intermittent?
Does the condition disappear and reappear? Does this happen at predictable intervals, or does it appear to be random?
Has this happened before?
Is this a recurring problem? How long ago did it happen last? What was the resolution? (You do keep logs of this sort of thing, right?)
Correlation with planned maintenance and configuration changes
Was something else being changed at this time? Was a device added, removed, or replaced? Did the outage occur during a scheduled maintenance window, either locally or at another site or provider?
Step 2: Eliminate Suspect Causes
Once we have a reliable account of the effect or effects, we can attempt to deduce probable causes. I say probable because deducing all possible causes is impractical, if not impossible. One possible cause is a power failure. Another possible cause is spontaneous combustion. Only one of these possible causes is probable.
There is a popular mantra of "always start with layer one," suggesting that the physical connectivity of a network should be verified before working on the higher layers. I disagree, as this is misleading and often impractical. You're not going to drive out to a remote site to verify everything is plugged in if a simple ping verifies end-to-end connectivity. Similarly, it's unlikely that any cables were disturbed if you can verify with relative certainty no one has gone near the suspect devices. Perhaps this is an oversimplified argument, but verifying physical connectivity is often needlessly time consuming and superseded by alternative methods.
Instead, I suggest narrowing causes in order of combined probability and convenience. For example, there might be nothing to indicate DNS is returning an invalid response, but performing a manual name resolution takes roughly two seconds, so this is easily justified. Conversely, comparing a current device configuration to its week-old backup and accounting for any differences may take a considerable amount of time, but this presents a high probability of exposing a cause, so it too is justified.
The order in which you decide to eliminate suspect causes is ultimately dependent on your experience, your familiarity with the infrastructure, and your allowance for time. Regardless of priority, each suspect cause should undergo the same process of elimination:
Define a working condition
You can't test for a condition unless you know what condition to expect. Before performing a test, you should have in mind what outcome should be produced in the absence of an outage. For example, performing a traceroute to a distant node is meaningless if you can't compare it against a traceroute to the same destination under normal conditions.
Define a test for that condition
Ensure that the test you perform is in fact evaluating the suspect cause. For instance, pinging an E-mail server doesn't explicitly guarantee that mail services are available, only the server itself (technically, only that server's address). To verify the presence of mail services, a connection to the relevant daemon(s) must be established.
Apply the test and record the result
Once you've applied the test, record its success or failure in your notes. Even if you've eliminated the cause under suspicion, you have a reference to remind you of this and avoid wasting time repeating the same test again unnecessarily.
It is common to uncover multiple failures in the course of troubleshooting. When this happens, it is important to recognize any dependencies. For example, if you discover that E-mail, web access, and a trunk link are all down, the E-mail and web failures can likely be ignored if they depend on the trunk link to function. However, always remember to verify these supposed secondary outages after the primary outage has been resolved.
Step 3: Devise a Solution
Once we have identified a point of failure, we want to continue our systematic approach. Just as with testing for failures, we can apply a simple methodology to testing for solutions. In fact, the process very closely mirrors the steps performed to eliminate suspect causes.
Define the failure
At this point you should have a comfortable idea of the failure. Form a detailed description so you have something to test against after applying a solution. For example, you would want to refine "The Internet is down" to "Users in building 10 cannot access the Internet because their subnet was removed from the outbound ACL on the firewall."
Define the proposed solution
Describe exactly what changes are to be made, and exactly what the expected outcome is. Blanket solutions such as arbitrarily rebooting a device or rebuilding a configuration from scratch might fix the problem, but they prevent any further diagnosis and consequently impede mitigation.
Apply the solution and record the result
Once we have a defined failure and a proposed solution, it's time to pit the two against each other. Be observant in applying the solution, and record its result. Does the outcome match what you expected? Has the failure been resolved?
In addition to our defined process, some guidelines are well worth mentioning.
Far too often I encounter a technician who, upon becoming frustrated with a failure or failures, opts to recklessly reboot, reconfigure, or replace a device instead of troubleshooting systematically. This is the high-tech equivalent of pounding something with a hammer until it works. Focus on one failure at a time, and one solution at a time per failure.
Watch out for hazardous changes
When developing a solution, remember to evaluate what effect it might have on systems unrelated to those being troubleshot. It's a horrible feeling to realize you've fixed one problem at the expense of causing a much larger one. The best course of action when this happens is typically to immediately reverse the change which was made. Note that this is only possible with a systematic approach.
Step 4: Test and Repeat
Upon implementing a solution and observing a positive effect, we can begin to retrace our steps back toward the original symptoms. If any conditions were overlooked because they were decided to be dependent on the recently resolved failure, test for them again. Refer to your notes from the initial report and verify that each symptom has been resolved. Ensure that the same tests which were used to identify a failure are used to confirm functionality.
If you notice that a failure or failures remain, pick up where you left off in the testing cycle, annotate it, and press forward.
Step 5: Mitigate
The troubleshooting process does not end when the problem has been resolved and everyone is happy. All of your hard work up until this point amounts to very little if the same problem occurs again tomorrow. In IT, problems are commonly fixed without ever being resolved. Band-aids and hasty installations are not acceptable substitutes for implementing a permanent and reliable solution. So to speak, many people will go on mopping the floor day after day without ever plugging the leak.
A permanent solution may be as complex as redesigning the routed backbone, or as simple as moving a power strip somewhere it won't be tripped on anymore. A permanent solution also doesn't have to be 100% effective, but it should be as effective as is practical. At the absolute minimum, ensure that you record the observed failure and the applied solution, so that if it the condition does recur you have an accurate and dated reference.
A Final Word
Everyone has his or her own preference in troubleshooting, and by no means do I consider this paper conclusive. However, if there's only one concept you take away, make it this: above all else, remain calm. You're no good to anyone in a panic. One's ability to manage stress and maintain a professional demeanor even in the face of utter chaos is what makes a good engineer great.
Most network outages, despite what we're led to believe by senior management, are not the end of the world. There are instances where downtime can lead to loss of life; fortunately, this isn't the case with the vast majority of networks. Money may be lost, time may be wasted, and feelings may be hurt, but when the problem is finally resolved, odds are you've learned something valuable.
Posted in Education
March 10, 2010 at 7:01 a.m. UTC
The increasing need to adhere to strict change control procedures kill the science of troubleshooting. In my world one test (Step 4) would require mounds of paperwork and numerous sign offs. To do my job I'm forced to do things under the table and hope I don't break anything and call attention to my activities.
March 10, 2010 at 3:15 p.m. UTC
Great post strech!
March 10, 2010 at 8:00 p.m. UTC
You set the bar too high, man. :(
March 10, 2010 at 9:36 p.m. UTC
For the love of god, use proper change management procedures... Too often are problems caused by hotshot admins who think they know everything.
March 11, 2010 at 2:51 a.m. UTC
March 11, 2010 at 8:34 a.m. UTC
This post does a good job at explaining 1) that you need to resolve (or at least be aware of) the real problem and 2) that you need to understand how and why a process happens so that you can logically predict what failure modes will be. From these failure mode predictions, you can work backwards to possibly find the root of the problem and thus a resolution to said problem.
March 11, 2010 at 10:51 a.m. UTC
HH (guest) = FAIL
go away, you don't belong here.
March 11, 2010 at 3:10 p.m. UTC
I wish I could program troubleshooting into wet ware and forceibly install it in all helpdesk techs.
Great article btw.
March 11, 2010 at 3:22 p.m. UTC
you's the best...keep it real.
March 11, 2010 at 3:40 p.m. UTC
HH makes a valid point. Change controls are great - IF they're implemented practically. So long as they leave an engineer enough room to maneuver, they can be an excellent tool to help generate documentation during the troubleshooting process.
March 11, 2010 at 6:11 p.m. UTC
Another important point, particularly in networks is to define, and then segment the problem domain. Is it possible to isolate the problem to one half of the domain? If so, identify the half that is broken, focus on that, and see if it can be split down again. This can be a very fast way of homing in on the problem, although it doesn't work so well in multilayered topologies.
March 11, 2010 at 7:12 p.m. UTC
I'm a big fan of change control when done correctly. I've even introduced change control to some of my clients and employers.
Unfortunately, change control is typically run by people that know nothing about troubleshooting methodologies. They also tend to treat troubleshooting changes and pre-planned changes the same way.
In my experience, the best change control people are those that have operated and thrived in the trenches of technical support. Same can be said for Technical Project Managers.
March 11, 2010 at 9:02 p.m. UTC
In Electronics School, they taught us Troubleshooting by Halves. Test at the halfway point. If it works, Move halfway to the end and retest. If it does not work, move halfway closer to the beginning and retest. Repeat until you find the border between working and failure. Then you know exactly were the problem is.
It was for fixing TVs and other electronics, but applies well to networks. Router A can't access Router E? Can A access C? No?, Then test access to Router B. If A can then access B, test B to C. Then you know the issue is with B or C.
March 12, 2010 at 1:15 p.m. UTC
Very well structured troubleshooting paper. I have already sent it to a couple of IT buddies which need to apply it as a standard in the infrastructures. Myself have grabbed some tips out of it.
peace from Puerto Rico
March 12, 2010 at 2:15 p.m. UTC
I would add in the "apply the solution" section - emphasis on doing one thing at a time. Frequently, people will apply a solution, then another, then another - and suddenly it works - but it's not clear which solution fixed it.
On the other hand - it all depends on one's environment. Sometimes, if you know some equipment is buggy with a memory leak, then rebooting is about the only thing you can do, until a patch comes out.
Also, I've talked to desktop people before - where their department has a policy - if you can't fix it within 15 minutes, reimage. Some shops would rather have their people put the same band-aid on over and over, rather than actually heal their patient.
March 14, 2010 at 1:19 a.m. UTC
March 18, 2010 at 5:39 a.m. UTC
Exellent post Stretch, you made Network Admin's job simple. Best wishes for your future endevor.
Thank you Balwanth
March 18, 2010 at 9:48 p.m. UTC
One thing that I would add is to be sure to document the results of your tests regardless if the test led you to the fix or not. For me the results of the tests are as important as the fix and I am sure that your coworkers would agree if the issue ever pops up again. Also, if a particular test never produces a positive result, the test should be removed from your troubleshooting process.
March 19, 2010 at 11:23 a.m. UTC
Superb post Stretch, keep up the good work. All in all a useful and a amazing website.
March 20, 2010 at 5:39 p.m. UTC
I agree that it is a science, but the art comes in the management of the situation and the non IT people involved.
Too many techs, could be great techs if they would just slow down, and properly handle the situation in addition to the reported problem.
--as a side note, it is sometimes fun to watch 'em flounder a bit before you jump in full tilt.
March 25, 2010 at 5:55 p.m. UTC
The art here in this case is the ability to articulate all this information in a document such as this. As I read this article I nodded in agreement to all the points you made, but writing an article to describe it clearly and concisely is where I would have fallen down.
Nice work. Really nice work.