Path MTU Discovery

When a host needs to transmit data out an interface, it references the interface's Maximum Transmission Unit (MTU) to determine how much data it can put into each packet. Ethernet interfaces, for example, have a default MTU of 1500 bytes, not including the Ethernet header or trailer. This means a host needing to send a TCP data stream would typically use the first 20 of these 1500 bytes for the IP header, the next 20 for the TCP header, and as much of the remaining 1460 bytes as necessary for the data payload. Encapsulating data in maximum-size packets like this allows for the least possible consumption of bandwidth by protocol overhead.

Unfortunately, not all links which compose the Internet have the same MTU. The MTU offered by a link may vary depending on the physical media type or configured encapsulation (such as GRE tunneling or IPsec encryption). When a router decides to forward an IPv4 packet out an interface, but determines that the packet size exceeds the interface's MTU, the router must fragment the packet to transmit it as two (or more) individual pieces, each within the link MTU. Fragmentation is expensive both in router resources and in bandwidth utilization; new headers must be generated and attached to each fragment. (In fact, the IPv6 specification removes transit packet fragmentation from router operation entirely, but this discussion will be left for another time.)

without_pmtud.png

To utilize a path in the most efficient manner possible, hosts must find the path MTU; this is the smallest MTU of any link in the path to the distant end. For example, for two hosts communicating across three routed links with independent MTUs of 1500, 800, and 1200 bytes, the smallest (800 bytes) must be assumed by each end host to avoid fragmentation.

path_mtu.png

Of course, it's impossible to know the MTU of each link through which a packet might travel. RFC 1191 defines path MTU discovery, a simple process through which a host can detect a path MTU smaller than its interface MTU. Two components are key to this process: the Don't Fragment (DF) bit of the IP header, and a subcode of the ICMP Destination Unreachable message, Fragmentation Needed.

df_flag.png

Setting the DF bit in an IP packet prevents a router from performing fragmentation when it encounters an MTU less than the packet size. Instead, the packet is discarded and an ICMP Fragmentation Needed message is sent to the originating host. Essentially, the router is indicating that it needs to fragment the packet but the DF flag won't allow for it. Conveniently, RFC 1191 expands the Fragmentation Needed message to include the MTU of the link necessitating fragmentation. A Fragmentation Needed message can be seen in packet #6 of this packet capture.

with_pmtud.png

Now that the actual path MTU has been learned, the host can cache this value and packetize future data for the destination to the appropriate size. Note that path MTU discovery is an ongoing process; the host continues to set the DF flag so that it can detect further decreases in MTU should dynamic routing influence a new path to the destination. RFC 1191 also allows for periodic testing for an increased path MTU, by occasionally attempting to pass a packet larger than the learned MTU. If the packet succeeds, the path MTU will be raised to this higher value.

You can test path MTU discovery across a live network with a tool like tracepath (part of the Linux IPutils package) or mturoute (Windows only). Here's a sample of tracepath output from the lab pictured above, with the MTU of F0/1 reduced to 1400 bytes using the ip mtu command:

Host$ tracepath -n 192.168.1.2
 1:  192.168.0.2       0.097ms pmtu 1500
 1:  192.168.0.1       0.535ms 
 1:  192.168.0.1       0.355ms 
 2:  192.168.0.1       0.430ms pmtu 1400
 2:  192.168.1.2       0.763ms reached
    Resume: pmtu 1400 hops 2 back 254 

About the Author

Jeremy Stretch is a freelance networking engineer, instructor, and the maintainer of PacketLife.net. He currently lives in Fairfax, Virginia, on the edge of the Washington, DC metro area. Although primarily an R&S guy, he likes to get into everything, and runs a free network training lab out of his basement for fun. You can contact him by email or follow him on Twitter.

Comments

Just what is needed! thanks!

Gr

traceroute can be used as an alternative to tracepath on *nix when specifying -F (do not fragment) and setting the packetsize to your configured interface protocol MTU. If a hop in the path is below your MTU then a !F will be returned with the PMTU discovery value.

A limitation of traceroute is that continued tracing will fail and you'll need to set your packetsize to the PMTU value to test for smaller links further in the path.

Top stuff! PMTUD often unknown or very much misunderstood aspect of the network. This simplifies, explains and provides a method to troubleshoot it very well. The only thing I think you need to add is an explicit explantion of this being a prime example of why some ICMP messages should be permitted - there is a great many overzealous admins out there who configure a blanket drop or deny.

Disabling PMTUD (so that DF is not set by default) or configuring a smaller MTU (eg. 1420) and on internet facing servers with TCP services is often also a good idea to pre-emptively avoid those PMTUD black holes.

RE: Colin

Disabling or removing the DF bit is never a good idea. Modern routers are not designed to do fragmentation (and it's not even possible to do fragmentation with IPv6). For example the 6500/7600 series punts all fragmented packets to the MSFC and is not handled in hardware by the PFC.

Obviously the best fix is to find out where the ICMP breakdown is occuring and fixing it (thus fixing PMTUD). Another option is tcp-mss-adjust (although personally I think that routers shouldn't dip into L4 headers but that just opinion). Removing the DF-bit or never setting it should never be a valid work around!

Nice discussion and it leaves me curious:

How do servers cache their discovery of path MTU?

Is on a per distant-end IP basis? Per socket?

Thanks!

Great post, very clear and understandable.

Thanks.

Great explanation, thanks!

One question I have: is it okay to think of MTU as a layer two restriction which dictates the size of the frames and therefore the size of the packets which can be fit into a frame?

From what I understand, then, a correctly sized packet passed down from layer 3 will fit into a single frame and this can pass through all the MTU restrictions along the path, correct?

mturoute doesn't test MTU discovery :(. It simple determines MTU by brute force. No analysis for the ICMP messages "fragmentation required".

can you please tell me how to disable the Path MTU in linux machine..?

Great article, thank you for taking the time to write it, and for sharing your knowledge with everybody.

Hi ,

I was facing an issue where only two particular intranet sites were not opening on wireless but on wired it was working, there was no error just the page remains blank , the only resolution I could find was to reduce the mtu size on the user system but I am now looking for global resolutions, the wireless controllers are Motrola Symbol RFS 7000 , any suggestion would be very much appreciated.

The phrase

"... and a subcode of the ICMP Destination Unreachable message, Fragmentation Needed"

was the eureka moment I needed to spot that 'no ip unreachables' on an interface was breaking CIFS traffic across a VPN that has been bugging me for months!

Leave a Comment


Register to comment as a member. You'll look cooler.

Optional; will not be displayed publicly or given out.

No commercial links. Only personal (e.g. blog, Twitter, or LinkedIn) and/or on-topic links, please.
How many bits are in a byte?