Path MTU Discovery
By stretch | Monday, August 18, 2008 at 12:57 a.m. UTC
When a host needs to transmit data out an interface, it references the interface's Maximum Transmission Unit (MTU) to determine how much data it can put into each packet. Ethernet interfaces, for example, have a default MTU of 1500 bytes, not including the Ethernet header or trailer. This means a host needing to send a TCP data stream would typically use the first 20 of these 1500 bytes for the IP header, the next 20 for the TCP header, and as much of the remaining 1460 bytes as necessary for the data payload. Encapsulating data in maximum-size packets like this allows for the least possible consumption of bandwidth by protocol overhead.
Unfortunately, not all links which compose the Internet have the same MTU. The MTU offered by a link may vary depending on the physical media type or configured encapsulation (such as GRE tunneling or IPsec encryption). When a router decides to forward an IPv4 packet out an interface, but determines that the packet size exceeds the interface's MTU, the router must fragment the packet to transmit it as two (or more) individual pieces, each within the link MTU. Fragmentation is expensive both in router resources and in bandwidth utilization; new headers must be generated and attached to each fragment. (In fact, the IPv6 specification removes transit packet fragmentation from router operation entirely, but this discussion will be left for another time.)
To utilize a path in the most efficient manner possible, hosts must find the path MTU; this is the smallest MTU of any link in the path to the distant end. For example, for two hosts communicating across three routed links with independent MTUs of 1500, 800, and 1200 bytes, the smallest (800 bytes) must be assumed by each end host to avoid fragmentation.
Of course, it's impossible to know the MTU of each link through which a packet might travel. RFC 1191 defines path MTU discovery, a simple process through which a host can detect a path MTU smaller than its interface MTU. Two components are key to this process: the Don't Fragment (DF) bit of the IP header, and a subcode of the ICMP Destination Unreachable message, Fragmentation Needed.
Setting the DF bit in an IP packet prevents a router from performing fragmentation when it encounters an MTU less than the packet size. Instead, the packet is discarded and an ICMP Fragmentation Needed message is sent to the originating host. Essentially, the router is indicating that it needs to fragment the packet but the DF flag won't allow for it. Conveniently, RFC 1191 expands the Fragmentation Needed message to include the MTU of the link necessitating fragmentation. A Fragmentation Needed message can be seen in packet #6 of this packet capture.
Now that the actual path MTU has been learned, the host can cache this value and packetize future data for the destination to the appropriate size. Note that path MTU discovery is an ongoing process; the host continues to set the DF flag so that it can detect further decreases in MTU should dynamic routing influence a new path to the destination. RFC 1191 also allows for periodic testing for an increased path MTU, by occasionally attempting to pass a packet larger than the learned MTU. If the packet succeeds, the path MTU will be raised to this higher value.
You can test path MTU discovery across a live network with a tool like tracepath (part of the Linux IPutils package) or mturoute (Windows only). Here's a sample of tracepath output from the lab pictured above, with the MTU of F0/1 reduced to 1400 bytes using the ip mtu command:
Host$ tracepath -n 192.168.1.2 1: 192.168.0.2 0.097ms pmtu 1500 1: 192.168.0.1 0.535ms 1: 192.168.0.1 0.355ms 2: 192.168.0.1 0.430ms pmtu 1400 2: 192.168.1.2 0.763ms reached Resume: pmtu 1400 hops 2 back 254
Posted in Routing
Comments
August 18, 2008 at 2:51 a.m. UTC
Just what is needed! thanks!
Gr
August 18, 2008 at 4:16 a.m. UTC
traceroute can be used as an alternative to tracepath on *nix when specifying -F (do not fragment) and setting the packetsize to your configured interface protocol MTU. If a hop in the path is below your MTU then a !F will be returned with the PMTU discovery value.
A limitation of traceroute is that continued tracing will fail and you'll need to set your packetsize to the PMTU value to test for smaller links further in the path.
August 18, 2008 at 6:25 p.m. UTC
Top stuff! PMTUD often unknown or very much misunderstood aspect of the network. This simplifies, explains and provides a method to troubleshoot it very well. The only thing I think you need to add is an explicit explantion of this being a prime example of why some ICMP messages should be permitted - there is a great many overzealous admins out there who configure a blanket drop or deny.
Disabling PMTUD (so that DF is not set by default) or configuring a smaller MTU (eg. 1420) and on internet facing servers with TCP services is often also a good idea to pre-emptively avoid those PMTUD black holes.
August 29, 2008 at 4:27 p.m. UTC
RE: Colin
Disabling or removing the DF bit is never a good idea. Modern routers are not designed to do fragmentation (and it's not even possible to do fragmentation with IPv6). For example the 6500/7600 series punts all fragmented packets to the MSFC and is not handled in hardware by the PFC.
Obviously the best fix is to find out where the ICMP breakdown is occuring and fixing it (thus fixing PMTUD). Another option is tcp-mss-adjust (although personally I think that routers shouldn't dip into L4 headers but that just opinion). Removing the DF-bit or never setting it should never be a valid work around!
September 3, 2008 at 7:16 p.m. UTC
Nice discussion and it leaves me curious:
How do servers cache their discovery of path MTU?
Is on a per distant-end IP basis? Per socket?
Thanks!
November 13, 2008 at 10:00 a.m. UTC
Great post, very clear and understandable.
Thanks.
March 18, 2009 at 5:31 a.m. UTC
Great explanation, thanks!
One question I have: is it okay to think of MTU as a layer two restriction which dictates the size of the frames and therefore the size of the packets which can be fit into a frame?
From what I understand, then, a correctly sized packet passed down from layer 3 will fit into a single frame and this can pass through all the MTU restrictions along the path, correct?
November 17, 2009 at 7:27 a.m. UTC
mturoute doesn't test MTU discovery :(. It simple determines MTU by brute force. No analysis for the ICMP messages "fragmentation required".
December 29, 2010 at 7:38 a.m. UTC
can you please tell me how to disable the Path MTU in linux machine..?
May 13, 2011 at 5:51 p.m. UTC
Great article, thank you for taking the time to write it, and for sharing your knowledge with everybody.
June 17, 2011 at 2:41 p.m. UTC
Hi ,
I was facing an issue where only two particular intranet sites were not opening on wireless but on wired it was working, there was no error just the page remains blank , the only resolution I could find was to reduce the mtu size on the user system but I am now looking for global resolutions, the wireless controllers are Motrola Symbol RFS 7000 , any suggestion would be very much appreciated.
March 14, 2012 at 10:03 a.m. UTC
The phrase
"... and a subcode of the ICMP Destination Unreachable message, Fragmentation Needed"
was the eureka moment I needed to spot that 'no ip unreachables' on an interface was breaking CIFS traffic across a VPN that has been bugging me for months!
September 6, 2012 at 6:18 a.m. UTC
Thanks Jeremy, what a great stuff!
April 14, 2013 at 9:35 a.m. UTC
Simple and effective. Thank you very much for the great tut.
August 3, 2013 at 6:16 p.m. UTC
What about existing FW rules blocking ICMP packets of certain size, to prevent flood of large ICMP packets?
Is there any other way to discover PMTU without using ICMP packets?
August 4, 2013 at 2:19 p.m. UTC
Pretty simplified! Thanks.
November 12, 2014 at 4:07 a.m. UTC
Great article! Thank you, Stretch!
June 2, 2015 at 6:35 a.m. UTC
Clear explanation :)
June 5, 2015 at 12:13 p.m. UTC
Thanks for putting this on the internet. Extremely insightful in helping me debug a networking issue I'd been having with OpenStack and GRE. Can you explain what the notation around the router in your 1st diagram (F0/0 & F0/1) means? I'm not a networking engineer 8-).
July 20, 2015 at 10:28 a.m. UTC
Simple and best explanation !!!
October 4, 2015 at 5:04 a.m. UTC
Awesome, thanks for putting this up. The link to the windows mturoute was much appreciated!
April 14, 2016 at 7:53 a.m. UTC
Hi Aneess.
does your gateway device create a separate interface link if yes, then try to play with MTU of that interface.