The Importance of Watching the Wire
By stretch | Wednesday, April 3, 2013 at 1:42 a.m. UTC
I came across an interesting problem today that I think serves as an excellent example of why packet analysis is such a critical skill for network engineers.
A few days ago, the internal network belonging to one of my employer's customers was compromised by a malicious party. Since the customer had connectivity into our network by way of a VPN tunnel and we didn't want to knowingly expose ourselves or other customers to the threat, we saw fit to temporarily sever the VPN while the breach was tended to by another party. We also upgraded the site's core switch to better support a feature useful in the analysis of the breach.
Shortly thereafter, we began receiving reports of problems with Internet connectivity from the site. Everything was reachable, it was just... slow. And worse, the issue seemed not to have any uniform effect: One person experiencing the issue might sit next to someone else who was completely immune and who noticed no difference from the day before. This of course made troubleshooting frustrating, to say the least.
First we tried reversing the firmware upgrade on the core switch, as it was reasonable to suspect we may have encountered some obscure bug, but this was quickly revealed to be a red herring as the issue persisted. On-site engineers verified that they could still reach everything (excluding of course our internal resources which were no longer reachable as a result of severing the VPN) and speed tests showed mostly normal results. There was no correlation between affected users and any access switch, VLAN, or IP subnet. We also confirmed about seventeen times that Internet traffic was in fact traveling from end hosts through the firewall directly to the Internet, with no proxy or caching servers in between.
Frustrated with the lack of progress in isolating the issue, I asked for a packet capture to be performed at the site. The testing procedure was as follows:
- Open a web browser
- Navigate to about:blank (this ensures that the test begins with a clean slate and guards against any rogue HTTP requests resulting from leaving the prior page)
- Start a promiscuous packet capture in Wireshark
- Navigate to packetlife.net and wait for the page to completely load
- Stop and save the capture.
My motivation for choosing packetlife.net as the test target was more than mere vanity. When you load a major web site like yahoo.com or cnn.com, you're actually generating a huge number of HTTP requests under the hood for content from a dozen or so sources (content delivery networks, embedded advertisements, etc.). By using a simple, familiar site I knew what sort and number of HTTP requests to expect, which makes the job of analyzing each packet in the capture a good deal easier.
After quickly sanity-checking the packet capture to ensure that the test was completed as as designed, my first step was to isolate the initial TCP session triggered by the web request. A quick way to do this if you know the IP address of the remote server is to apply a display filter showing only traffic to or from that IP:
Then, right-click on the first TCP packet (the one with only the SYN flag set) and select "Follow TCP Stream". This will apply a new display filter showing only packets which belong to that TCP session. (It will also generate a new dialog showing the contents of the HTTP request, which in many cases can be quite handy but wasn't of interest at this time.)
I reviewed the TCP exchange and everything appeared normal, until I noticed the packet timestamps. After the initial three-way TCP handshake (SYN, SYN/ACK, ACK) completed, there was a delay of about five seconds before the HTTP GET request was transmitted (between packets #32 and #101 in the figure below). Under normal conditions, there should be practically zero delay between the handshake completion and the initial HTTP request, so I knew I was on to something.
(The delta time column in the screenshot was added to the default column set to show the difference in seconds between each packet and the one displayed immediately before it.)
One important clue to note is that the delay here is almost exactly five seconds; a round number to us humans. This suggests that some process triggered by or in parallel with the TCP connection is intelligently timing out at five seconds, at which point the HTTP request proceeds.
I clear the display filter and begin sifting through the packets which were captured between the establishment of the TCP session and the first HTTP request. Most of it is ambient noise. There's nothing terribly interesting except for a few unanswered TCP requests from the local host to a remote server that used to be accessible via the VPN tunnel we disconnected earlier. The requests were destined for port 5274; a quick google of the port number yielded no clues. But the answer finally clicked when I looked up the destination IP address: It was a Trend Micro antivirus server.
What does an antivirus application have to do with Internet connectivity? One of the services provided by Trend Micro and countless other AV products is the ability to filter URLs and block access from the local machine to questionable web sites. As I understand it (and I could be wrong, as we're well outside my area of expertise now), the client software hijacks the web browser's HTTP request and sends the URL to a central server which compares it against a blacklist. In this scenario, that server is no longer accessible, and the connection attempt times out after - you guessed it - five seconds. The client then "fails open," meaning that the HTTP request is allowed to continue despite receiving no response from the filtering server. (In hindsight, a big red error page complaining about AV software issues would have all but eliminated our time spent troubleshooting.)
Five seconds may not seem like a substantial amount of time, but remember how I mentioned earlier that large sites tend to pull content from myriad sources just to render a single page? If several of those request are being hijacked and delayed one after another, the cumulative delay can quickly grow to the point where angry phone calls are placed. This explanation also reveals why some users were immune to the issue: They were using an alternate antivirus product, or didn't have the URL filtering feature enabled.
Fortunately, the resolution for this issue ended up being trivial to implement. But how long do you suppose troubleshooting might have gone on had we not taken the time to inspect the wire directly? Hone your packet analysis skills well, as the investment will quickly pay off.
Posted in Packet Analysis
April 3, 2013 at 3:22 a.m. UTC
I agree wholeheartedly. Polishing ones "shark-fu" illuminates so much when troubleshooting. Thanks for the reminder!
April 3, 2013 at 4:58 a.m. UTC
Well written! Enjoyed reading it :)
April 3, 2013 at 5:21 a.m. UTC
Excellent article, though based on the title I was kind of hoping you were talking about the TV Series
April 3, 2013 at 6:20 a.m. UTC
Thanks.. interesting article!
April 3, 2013 at 6:22 a.m. UTC
Thanks.. interesting article!
April 3, 2013 at 7:05 a.m. UTC
the first thing I do to troubleshoot internet connectivity issues is to disable the antivirus and check again.
So this issues wouldn't have taken this much time to resolve if it would have happened in my network.
April 3, 2013 at 8:47 a.m. UTC
Thank you for the article Jeremy, useful as always. But i have to say that in investigating such an issue, the field engineer(s) failed big time.
He should have noticed that the problems were happening on the PCs where the trend micro antivirus was installed and not on the others. A smarter engineer would have noticed that and the packet inspection process wouldn't have been necessary. You were troubleshooting remotely so you couldn't be aware of it, and you've shown great packet inspection skills, but the field engineer was practically asleep.
I mean, if every damn application problem i faced had to be investigated using packet inspection i would have been fired at least ten times for not being fast enough in troubleshooting the issue and making my customer lose time (and money).
just my .02
April 3, 2013 at 9:39 a.m. UTC
Just like some anti-virus programs which put themselves in the socket layer between e-mail-programs and the server. If your license expires, they still intercept the connection, but refuse doing anything. This results in user complaints that the mail server closes the connection, but a tcpdump on the server shows absolutely nothing.
April 3, 2013 at 12:52 p.m. UTC
True! Thank you for this. One of the skills which needs to be trained from the very beginning of every network life is to know how to use wireshark.
April 4, 2013 at 7:54 a.m. UTC
I'm struggling to remember the last time a packet capture actually identified a network problem. As Networkers we've become skilled in packet analysis only to prove to others that its not the network.
April 4, 2013 at 9:55 a.m. UTC
Hi Jeremy...very well-written article. It is knowledge-enhancing and gives one a broader idea of possible causes of a symptom. As is proved in this case, correct diagnosis can weigh more than the fix.
I came across your Cheat Sheets in my friend's computer. I am so impressed by those that I opened packetlife.net today to drop an appreciation note. And came across this article. Man...man..there cannot be a better way to summarize vast topics in a single page. Trust me, your cheat sheets are probably the most helpful thing I have come across in this year. They are concise, precise and most of all, complete. They cover everything about the topic and in the most lucid way.
I greatly respect people who render their knowledge/talent/hard work in return of no monetary interests. I am the latest addition to your fan following. God bless you, bro !!
April 4, 2013 at 11:50 a.m. UTC
In pcaps we trust!
April 5, 2013 at 12:29 a.m. UTC
Awesome write up! I'm a beginner CCNA Networker, but after reading the articles on this site it has inspired me to work towards becoming an NE. Thank you guys keep up the good work!
April 5, 2013 at 2:50 p.m. UTC
Nice article! A lot of Monday morning quarterbacks here.
April 7, 2013 at 1:01 a.m. UTC
Hey, Jeremy! Long time lurker, first time commenter. Great story! Glad you were able to figure out the problem.
April 8, 2013 at 10:27 a.m. UTC
Excellent article. I wasn't even aware of the Delta column in wireshark, I usually did this calc in my head.
And I couldn't agree more with pompeychimes. But the sad truth is that the network (and the networkers) is always the first suspect. But alas, we networkers are usually who can pinpoint the root of the problem.
April 8, 2013 at 11:56 p.m. UTC
Thank you..excellent post...Please write more such interesting experiences
April 9, 2013 at 6:56 a.m. UTC
I have a question. If the third party tool, takes time to check the url handed-over against blacklist, then does it mean, the product is affecting normal packet flow. If yes, then why such products there in the market?
Please let me know your opinion.
April 13, 2013 at 3:47 a.m. UTC
Dumb question here: Why did traffic for the a/v filtering need to go over the VPN in the first place?
April 17, 2013 at 2:14 p.m. UTC
Great post. Well written.
April 19, 2013 at 7:12 p.m. UTC
Good find. :)
April 21, 2013 at 1:38 a.m. UTC
I faced a similar problem at the company I work at. Users were reporting "the internet is slow" and had intermittent connectivity issues but it was only affecting some users. It turned out that Kaspersky,the antivirus software we use, had pushed out an update to their web filtering package that was causing the same thing you mentioned in the post. We disabled the "feature" until we were able to push out an update that resolved it.
April 22, 2013 at 11:21 p.m. UTC
You have become so smug and boring, and you state the obvious as if you were the first person to think of it.
April 22, 2013 at 11:23 p.m. UTC
Just wanted to share an anecdote. Sorry you read it that way. Of course, since I'm so smug and boring you could always just stop reading my blog.
April 24, 2013 at 4:28 p.m. UTC
So great to have real life examples for my students to see!
April 29, 2013 at 7:04 a.m. UTC
very well wrritten & explained , enjoyed reading it. keep posting !!
May 18, 2013 at 10:15 p.m. UTC
Absolutely. Some AV systems actually run a proxy service on the local machine which scans traffic as it passes through; and it terms of "bad" AV systems, Trend Micro has developed a bad reputation.
(On a side note; how is it that a single organisation has multiple AV systems spread across the Desktop machine - surely it was in the early post-stages of a business acquisition?)
May 30, 2013 at 9:28 p.m. UTC
Great article! The feeling that you get when you sort out something like that is awesome!
June 4, 2013 at 1:32 p.m. UTC
July 13, 2013 at 6:34 a.m. UTC
Great article Jeremy. Ive seen something similar with Kasperski in the past. You are right, it is a little frustrating.
July 17, 2013 at 2:36 p.m. UTC
Would the continuous ping or MTR yield the results at the first place in order to rule out the application layer ?
Great write up anyway.
August 13, 2013 at 2:14 p.m. UTC
Great article, the goofy problems are the most rewarding when you figure them out.
September 4, 2013 at 5:10 a.m. UTC
Beautifully said ! Thanks!
October 11, 2013 at 7:47 p.m. UTC
Great article. Unbelievable that people are critics of your attempt to provide useful info and anecdotes. Maybe their networking skills are A+ but their interpersonal skills are C-. Sure glad I don't work with them. I will be pointing this out (along with your great site) to my networking students. Thanks.
November 7, 2013 at 8:46 a.m. UTC
As I am new to networking field I am able to understand the article partially but the demonstration of troubleshooting is great :) thank you
January 15, 2014 at 11:12 a.m. UTC
This is really a great topic, packet analysis is really a MUST
February 7, 2014 at 8:16 p.m. UTC
Re: " Why did traffic for the a/v filtering need to go over the VPN in the first place?"
It sounds like it is a site-to-site VPN with the URL filtering server at the HQ site.
February 14, 2014 at 1:08 p.m. UTC
Very well written and explained....
December 4, 2015 at 7:45 a.m. UTC
good article, thanks mate.
February 19, 2016 at 8:38 p.m. UTC
Great article! I am going to continue to search your site while I troubleshoot a connection issue through Wireshark. Lots of TCP Retransmission, TCP Out-of-order, TCP Dup Ack and TCP previous segment not captured.