The Overlay Problem: Getting In and Out
By stretch | Friday, September 30, 2016 at 1:47 p.m. UTC
I've been researching overlay network strategies recently. There are plenty of competing implementations available, employing various encapsulations and control plane designs. But every design I've encountered seems ultimately hampered by the same issue: scalability at the edge.
Why Build an Overlay?
Imagine a scenario where we've got 2,000 physical servers split across 50 racks. Each server functions as a hypervisor housing on average 100 virtual machines, resulting in a total of approximately 200,000 virtual hosts (~4,000 per rack).
In an ideal world, we could allocate a /20 of IPv4 space to each rack. The top-of-rack (ToR) L3 switches in each rack would advertise this /20 northbound toward the network core, resulting in a clean, efficient routing table in the core. This is, of course, how IP was intended to function.
Unfortunately, this approach isn't usually viable in the real world because we need to preserve the ability to move a virtual machine from one hypervisor to another (often residing in a different rack) without changing its assigned IP address. Establishing the L3 boundary at the ToR switch prevents us from doing this efficiently.
The simplest solution is to move the L3 boundary higher in the network and re-deploy the distribution layer as a giant L2 domain spanning all racks. This would leave us with a /14 worth of IPv4 space at the edge of a sprawling L2 mess. The need to prevent L2 loops means we cannot utilize redundant paths within this domain without employing some form of multichassis link aggregation between the ToR and distribution switches. We also need to invest in some seriously beefed-up switches at the distribution layer which can deal with hundreds of thousands of MAC, ARP (IPv4), and ND (IPv6) table entries.
This L2 approach buys us VM mobility at the expense of a complex, fragile infrastructure, but it might work well enough. That is, until we decide we also want forwarding plane isolation. We could implement a limited number of isolated domains among virtual machines using manually-configured VLANs, but with 200,000 hosts it's likely we'd quickly exhaust the 12-bit 802.1Q ID space.
This is where an overlay network becomes interesting. The theory is that we can abstract the forwarding plane from the physical network, alleviating the need for all network nodes to know where every end host resides, while simultaneously decoupling forwarding isolation from physical transport. This frees us to go back to building simple L3 underlay networks, over which encapsulated overlay traffic can be carried opaquely. In this design, only hypervisors need to know where virtual machines physically exist (information which can be disseminated using some sort of directory service or flood-and-learn behavior). The underlay needs only enough information to carry a packet from one hypervisor to another. So in contrast to the overlay, which comprises 200,000 hosts (all virtual), the underlay would have only 2,000 hosts (all physical).
The Gateway Problem
This all sounds pretty fantastic, right? Unfortunately, in networking we never really solve problems: We just move them around. Assuming we have a viable method for disseminating forwarding information to all hypervisors, the overlay itself works just fine. However, a problem arises when we consider how to attach the overlay to the "real" world. How does traffic from the Internet, or any traditional network, get routed into and out of the overlay?
A network engineer's first instinct is to place a designated set of boxes at the overlay edge to function as gateways between it and the outside world. The challenge with this approach is finding suitable equipment for the job. There are several crucial criteria that must be considered:
- FIB capacity - Can the device store the requisite number of IPv4 and one IPv6 routes (including encapsulation tags) for every host in the overlay?
- Throughput - Can the device forward encapsulated traffic at (or reasonably close to) wire rate?
- Supported encapsulations - Does the device support our preferred method of overlay encapsulation?
There are few existing products from the major network vendors that meet all our requirements, and they're not cheap. Alternatively, we could try building our own gateways on commodity x86 hardware, but maintaining sufficient performance is sure to be a challenge.
The best approach might be to employ an army of relatively low-powered devices along the overlay edge, divided into discrete functional sets of perhaps four or eight devices each. Each set would be responsible for providing edge connectivity to a portion of the hypervisors. This should allow us to scale horizontally with ease and at relatively low cost.
I don't have any answers yet, but it's certainly an interesting problem to mull over. I'd be interested to hear if any readers have headed down this path yet.
Posted in Design
September 30, 2016 at 1:56 p.m. UTC
"Unfortunately, this approach isn't usually viable in the real world because we need to preserve the ability to move a virtual machine from one hypervisor to another (often residing in a different rack) without changing its assigned IP address. Establishing the L3 boundary at the ToR switch prevents us from doing this efficiently."
No, putting an L3 boundary at the ToR doesn't prevent it, it could work just fine with an L3 ToR if the hypervisor vendors hadn't totally punted on networking design of hypervisors. If the hypervisors would participate in the L3 routing world (just implement BGP and use the pre-existing L3 routing capability of the extant hypervisor platforms, for goodness sakes) then this whole problem disappears without any of the huge mess the networking industry has gone through for the past 5+ years trying to make this stuff work without recognizing the easy solution.
September 30, 2016 at 4:02 p.m. UTC
@Jeff: The problem with that approach is that you then have hundreds of thousands of host routes being injected into the network. Most ToR switches available today can't support that many routes in their FIB.
Take the article's example of 200,000 virtual machines. That's 200K IPv4 /32s and 200K IPv6 /128s if we don't do any form of aggregation. Odds are you're not going to be able to stuff that many routes into whatever model access switch you use without it falling over.
September 30, 2016 at 6:19 p.m. UTC
Isn´t it at the end a ARP storm design flaw?
Maybe you could sacrifice a port in all of tor switches to overload on them a proxy arp
September 30, 2016 at 8:39 p.m. UTC
- Use network equipment that can support that many routes. We are talking /32 and /128 so Tomahawk can do it, for example. And I am sure Mellanox too.
- Or better, instead of having IPs dancing around racks, assign a /20 per rack and use a service discovery service or DNS (at the end of the day, DNS was invented exactly for this use case ;) ). You will achieve VM mobility without having to move IPs around. If some particular application doesn't support that mechanism, then assign to the exceptions a /32 that will be able to float around. If you honestly have 200.000 VMs and your organization doesn't know how to use DNS or a service discovery service there is something very broken somewhere, run : )
October 1, 2016 at 12:23 p.m. UTC
Why not use the L2 approach with vxlans, instead of the 12 bit restriction (and 4096 VLANs); use the 24 bits worth?
October 1, 2016 at 6:11 p.m. UTC
I also like idea of running routing on hypervisors. When you have L3 it's so much better than any hack to legacy L2 protocols.
Regarding switch which will support 200k routes. Just FYI Nexus9372 support 208k host routes, other switches in Nexus 93xx families support even more. And I'm pretty sure other vendors have similar specs.
October 2, 2016 at 3:23 p.m. UTC
"In an ideal world, we could allocate a /20 of IPv4 space to each rack."..."we need to preserve the ability to move a virtual machine from one hypervisor to another (often residing in a different rack) without changing its assigned IP address." Why not use pure IPv6 space? So even if switched from one hypervisor to another, there would be no need to do all the rest of the steps outlined in this article. BTW, Welcome to the year 2016.
October 2, 2016 at 6:25 p.m. UTC
@David: As mentioned, big boxes come with big price tags. And I'm not sure I understand your point about DNS. Relocating the VM would still require changing its IP address. (I work for a VPS provider, so modifying the host OS is something we prefer to avoid entirely.)
@Nick: You'll find that those specs are best-case scenarios, and don't account for commingling of protocols or overlay encapsulations.
@Michael: We are several years away from an IPv6-only VPS being a viable product, unfortunately.
October 4, 2016 at 9:24 a.m. UTC
It seems that you are focusing on the wrong thing. An IP-address is not important, but the server function is. I guess this is the generic fault with engineers that they want control of everything. If you use the existing mechanims of providing IP-addresses with DHCP/DNS you can stick to the /20 design. If the server moves it gets a new IP-address, but since you are using DNS you will be able to connect to it anyway. And since it uses DHCP/DNS you don't have to touch it as a provider.
So its your IPAM/DCIM-solution thats need a different ID than IP-addres to identify a customer VPS.
As long as we are focusing on fixed addresses we will never move on to the dynamic nirvana IPv6 will provide us with.
So you send the problem from networking to application-devs to keep track of whatever address the system has at the current time.
October 4, 2016 at 1:24 p.m. UTC
@Stig: We are in the business of VPS hosting. Asking customers to change their IPs is a non-starter, unfortunately. If only it were that simple.
October 4, 2016 at 1:47 p.m. UTC
You might want to take a look at this RFC for a different take on it:
October 4, 2016 at 9:51 p.m. UTC
Something like LISP on the edge is also an interesting solution but remains mostly academic at this point.
October 9, 2016 at 3:05 a.m. UTC
I think VMware NSX has solved this in a reasonable way - deploy a layer of VMs running routing software, with one leg in overlay space, and another in physical.
Modern servers support 40G interfaces, and DPDK-based VM routers should be able to provide matching forwarding performance.
I haven't looked into what's available in open source space, but there sure are commercial virtual routers that should be able to push 40G+ per instance.
October 9, 2016 at 3:36 a.m. UTC
The solution is to kill L2. Use /32 for the VM (container) and then build the pods for the size of the tables in the switches (leaving headroom). Upstream to the bigger routers with bigger tables, and summarize where you can. L2 needs to die. The latest AvagoCom (Broadcom) switches are ideal for this. It is 2016. If you have a dependency on L2 then you are doing it wrong. Yes, I know, "but we have this legacy apps"... Ugh. Oh, well. I talked to someone with an SNA network the other day..
October 17, 2016 at 12:58 p.m. UTC
You don't need 200k routes in TOR, just send a default route from your core layer no? TOR would only need routes for hosts/ips in the local rack. Core should be able to handle the 200k routes.
October 17, 2016 at 9:24 p.m. UTC
@sean and @thomas there are potential blackhole routing issues when aggregating subnets in a CLOS (which i assume is in use here) https://tools.ietf.org/html/rfc7938#section-8.2 has more info
There is also the issue of potential wasted IP space when you assign a /20 (or whatever) to a pod. In IaaS environments VMs can be different sizes, meaning pods may have a large difference in VM capacity
December 19, 2016 at 8:07 a.m. UTC
In reading this, Locator/ID Separation Protocol (LISP) comes to mind first. I've seen it used in production and it does quite well.
February 15, 2017 at 4:15 a.m. UTC
Can you split your VPS clusters/domains up into 10 racks per?
February 24, 2017 at 5:01 a.m. UTC
Disclaimer - I've been supporting Cisco ACI for about 2 years so I'm definitely biased...
The problem you raise is the original motivation for Cisco ACI. The goal is to have up to 1 million endpoints within the fabric while also using TOR leafs with relatively small table sizes (and therefore cheaper per-port cost). The hardware could support this scalability but I think software currently limits to about ~180k endpoints per fabric. All endpoints are stored on the spines so traffic destined to a remote leaf can be proxied by the TOR to the spine for zero-penalty lookup (i.e., no penalty in CLOS network nor spine pipeline for routing between VTEPs vs. proxy lookup for an endpoint).
So when you talk about FIB scalability, the number of endpoints (MACs, IPv4 /32, and IPv6 /128) can be extremely high. Also, all forwarding and encapsulation is done in single hardware lookup to support line-rate 10G/40G/100G. And, it uses a routed overlay so your L2 flood traffic is encapsulated in L3 multicast and therefore limited to only the ports that actually have the BD currently active (with interesting features around dynamically pushing BD's only to ports that have VM active). Also, flooding can be completely disabled if not needed with some cool learning and forwarding features on the leaf (i.e., redirect ARP broadcast to the specific port+encap where the target IP was learned).
The overlay is VXLAN encap (so i don't know if that satisfies you're 3rd point), but the access encap for host traffic can be untagged/VLAN/VXLAN (and future NVGRE although the demand does not seem to be there).
Price is relatively low for Cisco gear. But, what do I know, I'm on the technical side not sales ;-)
March 1, 2017 at 9:50 p.m. UTC
SPB wraps layer 2 inside IS-IS and allows for some interesting flexibility while eliminating many of the issues inherent in layer 2. You can get neat things like ECMP and all-active links without requiring clustering or stacking.
March 1, 2017 at 9:51 p.m. UTC
TRILL/FabricPath functions similarly.
April 27, 2017 at 6:58 a.m. UTC
Hmmm, good to know. great publication.