Traceroute and Not-so-Equal ECMP

By stretch | Monday, April 27, 2015 at 2:05 p.m. UTC

I came across an odd little issue recently involving equal-cost multipath (ECMP) routing and traceroute. Traceroutes from within our network to destinations out on the Internet were following two different paths, with one path being one hop longer than the other. This resulted in mangled traceroute output, impeding our ability to troubleshoot.

The relevant network topology comprises a mesh of two edge routers and two core switches. Each edge router has a number of transit circuits to different providers, and advertises a default route via OSPF to the two core switches below. The core switches each load-balance traffic across both default routes to either edge routers.

topology.png

Because each edge router has different providers, some destinations are routed out via edge1 and others via edge2, which means sometimes a packet will be routed to edge2 via edge1, or vice versa.

two_paths.png

Routers typically employ a hash function using layer three and four information from each packet to pseudo-randomly distribute traffic across equal links. Typically, all packets belonging to a flow (e.g. all packets with the same source and destination IP and port numbers) follow the same path.

However, in this case traceroute packets were being split across two path of unequal length, which made traceroute output pretty much unreadable. We noticed that only UDP traceroutes were affected; ICMP traceroutes reported one path as normal.

Here's an ICMP traceroute for reference:

traceroute to X.X.X.X, 30 hops max, 60 byte packets
 1  46.101.0.254 (46.101.0.254)  3.515 ms  3.491 ms  3.486 ms
 2  5.101.111.233 (5.101.111.233)  0.597 ms  0.625 ms  0.623 ms
 3  linx.peer.nac.net (195.66.224.94)  90.584 ms  90.603 ms  90.714 ms
 4  0.e3-2.tbr1.tl9.nac.net (209.123.11.141)  81.491 ms  81.542 ms  81.542 ms
 5  0.e1-1.tbr1.ewr.nac.net (209.123.10.129)  81.711 ms  81.838 ms  81.905 ms
 6  0.e1-4.tbr1.oct.nac.net (209.123.10.122)  82.391 ms  82.192 ms  82.155 ms
...

Very clean; one path. And here's the same traceroute using UDP packets:

traceroute to X.X.X.X, 30 hops max, 60 byte packets
 1  46.101.0.253 (46.101.0.253)  0.619 ms  0.583 ms 46.101.0.254 (46.101.0.254)  0.531 ms
 2  5.101.111.237 (5.101.111.237)  0.541 ms 5.101.111.233 (5.101.111.233)  0.529 ms 5.101.111.241 (5.101.111.241)  0.469 ms
 3  5.101.111.250 (5.101.111.250)  0.444 ms linx.peer.nac.net (195.66.224.94)  81.334 ms 5.101.111.250 (5.101.111.250)  0.485 ms
 4  0.e3-2.tbr1.tl9.nac.net (209.123.11.141)  81.330 ms  81.299 ms  81.267 ms
 5  0.e3-2.tbr1.tl9.nac.net (209.123.11.141)  81.137 ms 0.e1-4.tbr1.mmu.nac.net (209.123.10.101)  82.411 ms  82.261 ms
 6  0.e1-4.tbr1.oct.nac.net (209.123.10.122)  82.241 ms  82.534 ms 0.e1-1.tbr1.ewr.nac.net (209.123.10.129)  89.896 ms
...

At first glance, it looks like an incoherent jumble of hops, but look closely and you'll notice that some nodes appear multiple times at different hops in the traceroute. Some packets are following the shorter path out directly via edge2, whereas others following the longer path via edge1 and then to edge2. But why? Shouldn't all of the packets follow the same path?

The Linux traceroute utility starts out by send packets with an IP TTL of 1 on UDP port 33434. By default, it will send three packets with a TTL of 1, and then increment the TTL to 2 and the port number of 33435. These numbers keep incrementing until the destination is reached (or the trace route runs into a filter blocking its packets).

Or at least that's what I thought.

During troubleshooting, I actually confirmed this behavior on my local Linux Mint workstation with a packet capture. However, I failed to realize that the traceroute utility that shipped with Mint (which is part of the GNU inetutils family) was entirely different from the one installed on my Ubuntu Server 14.04 test machine.

From Mint:

workstation$ traceroute -V
traceroute (GNU inetutils) 1.9.2
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later .
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Elian Gidoni.

From Ubuntu:

test-box:~# traceroute -V
Modern traceroute for Linux, version 2.0.20, Aug 19 2014
Copyright (c) 2008  Dmitry Butskoy,   License: GPL v2 or any later

The Ubuntu flavor of traceroute, as it turns out, actually increments the UDP port number for every packet rather than for every TTL cycle. Thus, traceroute packets were actually alternating between path A and path B due to the hashing function on the routers. (FYI, the traceroute utility on Junos appears to operate the same way.)

If you prefer, you can replace the stock Ubuntu traceroute with the inetutils version:

sudo apt-get remove traceroute
sudo apt-get install inetutils-traceroute

Or, just use the much more capable mtr (sudo apt-get install mtr-tiny) utility in the first place.

About the Author

Jeremy Stretch is a network engineer living in the Raleigh-Durham, North Carolina area. He is known for his blog and cheat sheets here at Packet Life. You can reach him by email or follow him on Twitter.

Posted in Tips and Tricks

Comments


djfader
April 29, 2015 at 10:48 p.m. UTC

I have come across to ECMP traceroute lately in Customer env but it is good to know that you tested it with diefferent traceroute implementation. Something that we need to be aware of. Anyway this Modern traceroute is better and I would recommend using that as it shows you more then ICMP or inetutils withot ECMP I think :)


Andras (guest)
May 3, 2015 at 4:46 a.m. UTC

Another alternative is paris-traceroute which tries to map the full ecmp path by probing with various ports, but you can also specify a fixed source and destination port.


Mitch (guest)
May 4, 2015 at 12:44 a.m. UTC

Great write up - must have been a fun one to track down


jsicuran (guest)
May 20, 2015 at 3:51 p.m. UTC

Is there a CEF process involved here?


Kin (guest)
May 21, 2015 at 1:02 p.m. UTC

Pardon me, I'm not quite sure why would you configure static default route to the two edge router as there should be some sort of load balancing protocol configured on it. Because, thinking from a router point of view, when there are two default static route, where should it send the packet forward?


Tom (guest)
May 28, 2015 at 1:53 a.m. UTC

Kin, if there are two default static routes configured, an IOS router will enter both in the routing table as they are equal cost (no metric for static routes and both use the same administrative distance). If you wanted a more deterministic path, you would set static route to routerX with a lower admin distance of the static route to routerY. That way, it will always choose the path to routerX unless that route goes away (ie. using next-hop object tracking). No additional configuration would be required.

Now, if you have a default route in an IGP, the behavior may change depending on how you set/manipulate link cost and/or metrics.


Halil Baysal (guest)
June 6, 2015 at 3:45 p.m. UTC

to test these kind of paths, for me the most efficient way has been with hping3. using the traceroute flag and using static source and destination ports. Changing up these numbers and perhaps adding 1 or 2 bits to the destination address can generate you the paths a flow might follow in ECMP paths.

Cheers

Halil


Nate (guest)
June 25, 2015 at 1:57 p.m. UTC

I recall INE mentioning this early on in my CCIE studies. Where each packet has a destination port between 33434 and 33464. I had thought this would explain Cisco IOS's 30 hops since 33464 - 33434 = 30. But for this reason, I get in the habit of doing traceroute probe 1 to clean up my output.


Bill (guest)
July 2, 2015 at 1:30 p.m. UTC

I know UDP increments the destination port with each probe, ICMP has no ports hence the difference between the two outputs. Could the router consider each probe a new socket/conversation since it is a new destination port to the same IP but a different TTL?


FET (guest)
September 12, 2015 at 5:19 p.m. UTC

Can anybody recommend some traffic generators which are able to dynamically change L3/L4 addresses of each generated packet? I understand I can use UDP traceroute but I need to play with different throughput from the traffic generator. Thanks in advance!


Bava (guest)
May 26, 2016 at 12:07 p.m. UTC

Great explanation..Thanks


Solman (guest)
June 20, 2016 at 3:49 a.m. UTC

Great explanation..Thanks


JASON (guest)
July 28, 2016 at 6:30 a.m. UTC

Nice article Jeremy, it's good to know different traceroute utilities vary in operation. By the way I have to mention, I love all all your packetlife.net study guides, thanks for writing them and please keep more coming if you can =). Regards, Jason


insomniac (guest)
November 13, 2016 at 4:54 p.m. UTC

Have you tried using Dublin Traceroute? https://dublin-traceroute.net . It was written with this use case explicitly in mind, and it will also give you a better insight of your network. Spoiler alert: I am the author.


A guest
November 16, 2016 at 6:14 a.m. UTC

According to the information in apt-cache inetutils-traceroute package is IPv4-only while traceroute-package is IPv4/IPv6 - can anyone confirm on this?

Comments have closed for this article due to its age.