Replacing an MPLS WAN with an Internet VPN Overlay
By stretch | Monday, July 14, 2014 at 1:03 p.m. UTC
I received an email last week from a reader seeking advice on a fairly common predicament:
Our CIO has recently told us that he wants to get rid of MPLS because it is too costly and is leaning towards big internet lines running IPSEC VPNs to connect the whole of Africa.
As you can imagine, this has caused a huge debate between the networks team and management, we run high priority services such as Lync enterprise, SAP, video conferencing etc. and networks feel we need MPLS for guaranteed quality for these services but management feels the Internet is today stable enough to run just as good as MPLS.
What is your take on the MPLS vs Internet debate from a network engineer's point of view? And more so, would running those services over Internet work?
This is something I struggled with pretty frequently in a prior job working for a managed services provider. MPLS WANs are great because they provide flexible, private connectivity with guaranteed throughput. Most MPLS providers also allow you to choose from a menu of QoS schemes and classify your traffic so that real-time voice and video services are treated higher preference during periods of congestion.
Unfortunately, MPLS WANs tend to be considerably more expensive than Internet circuits. A dedicated 3 Mbps MPLS circuit might cost three or four times as much as a 50 Mbps business class broadband Internet circuit: These numbers are hard to justify to management who may not appreciate the contexts of reliability and QoS controls. Since private connectivity can be achieved using a VPN overlay on top of plain Internet circuits, can we still justify the cost MPLS WANs? Should we?
My advice would be to stick with the MPLS WAN if you can afford it. A VPN overlaid on top of Internet circuits might work most of the time, but when it doesn't perform adequately, you'll have little immediate recourse. Should you decide on moving to a VPN overlay, do so in phases: Keep the MPLS WAN around for a few months in case the overlay strategy doesn't work out. But if you find that your Internet circuits provide sufficient throughput so that congestion of real-time services never becomes a problem, maybe that's an acceptable solution.
About the Author
Jeremy Stretch is a network engineer living in the Raleigh-Durham, North Carolina area. He is known for his blog and cheat sheets here at Packet Life. You can reach him by email or follow him on Twitter.
July 14, 2014 at 1:36 p.m. UTC
In my previous employment we had about 18 remote sites mixed with MPLS, Frame-relay and VPN on broadband. With my current employer we have about 7 remote offices all of which are using VPN on broadband. Having worked with all 3 options it all depends on what type of traffic you intend to use. In my previous employer, print jobs were local and most traffic was from Wyse terminals to a Citrix desktop but TCP connection timeouts could be an issue especially with SonicWALL firewalls which we had to change the settings from 15 minutes to 1 hour.
With my current employer the bandwidth is important due to print jobs, scanned documents and medical images as an MPLS circuit with the bandwidth requirements necessary would be too costly for the org. We are also using Wyse terminals with a Citrix desktop, this time with Cisco ASA firewalls, but the poor quality of service on broadband can cause the Citrix desktops to disconnect. The local carrier has had issues in a specific area causing this problem. We have two offices on the same broadband provider who run VoIP between each other on a L2L VPN and sometimes the QoS is an issue as calls can sound as if the call is in an empty, echoing warehouse.
More times than not, I find repair/response times better with broadband than MPLS/T1 circuits where in the US where those circuits are usually covered by tariffs defining SLA of response, repairs and credits.
July 14, 2014 at 1:46 p.m. UTC
Yes, the Internet is "stable", but there is no guarantee of bandwidth/performance. There is no QoS, etc on VPN-over-Internet connections. This is something I have to stress over and over again to network design engineers, customers, etc.
July 14, 2014 at 1:57 p.m. UTC
We are big fan of Cisco's MPLS over MGRE. We implemented it not only on top of Internet even over ISP's MPLS links. This method allows us control the full routing table, VRF segmentation, and even route IPv6 (via 6VPE) traffic without ISP involvement. Of course the draw back is the MTU issue which makes TCP MSS adjustment very critical.
July 14, 2014 at 3:47 p.m. UTC
Just make sure that for "carefully selected" location you can run parallel downloads during a VIP videoconf session. There still are lots of places on Earth where internet connectivity is just bad.
A more sensible approach would be to start with pilot sites and see what it gives. In some places MPLS may be THE way to go, in other places big business-class internet pipes MAY me a perfectly valid solution. You could end up with a mix that would cost less than a fully MPLS-based solution and still deliver comparable (or even better) performance - best of both worlds.
July 14, 2014 at 5:40 p.m. UTC
Maybe the more interesting question is whether local pricing (Africa) influences the choice. It could be MPLS offerings perform as bad as regular internets.
July 14, 2014 at 8:13 p.m. UTC
We run Telepresence over our MPLS circuits and you really can't do that with IPSEC over the Internet. We've tried and it's a mess. Without end-to-end QOS you might as well run Skype.
July 15, 2014 at 12:26 p.m. UTC
We currently use MPLS, which offers a great private design with fail over capabilities and QoS throughout the architecture, but the problem is the cost for the limited amount of bandwidth. We constantly run into bandwidth issues at some of our locations. These circuits are expensive. We are currently looking into an Ethernet Virtual Private Line (EVPL) solution for less money with higher bandwidth. This may not be the answer because the ISP's EVPL footprint is not everywhere we have a location, which would mean a third party vendor would have to get involved.
July 15, 2014 at 5:35 p.m. UTC
Yo Stretch, maybe consider MPLS half tunnels. We use them and they basically work like this: IPSec tunnels over the internet to Sprints MPLS backbone edge routers. All you need is an internet connection at the office. It is pretty cheap too, something like 70 bucks a month.
July 16, 2014 at 4:42 p.m. UTC
Given the low cost maybe the best approach is to procure a cable/DSL/whatever general IP connection in the area of interest and test whether your mission critical applications will work. There's so much variance between local ISPs I'd strongly suggest trying the service at the address its needed too. Worst case is you have a cable/DSL path for 12 months at ~$65/mo. That's a pretty low cost to pilot a solution which might save the company a ton of cash. And if it doesn't work you can report back that you performed real-world trials and failed.
July 18, 2014 at 12:45 a.m. UTC
Depends on the applications you need to run.
If you explain it to users, in many cases they're fine with accepting the occasional blip knowing that they're saving a bundle of money.
There are risks, do your best to calculate them and explain them to the business folks. If they're OK, you get the appropriate communication plan going, then it's probably worth a pilot experiment.
Good monitoring will make a huge difference. Find something that will watch packet loss performance, one way latency, etc. PerfSonar is OSS and geared toward high performance research networks. AppNeta is a slicker solution, and also much more expensive. There's likely other solutions out there, but 5 pings every minute aren't going to do the trick. Issues live in the seconds when your polling monitor isn't running.
July 18, 2014 at 2:48 p.m. UTC
In some of the Fortune 500 companies that we manage, we use VPN overlay only as a back-up connection; and MPLS as the primary path of traffic. However, the design for any network heavily depends on the budget.
July 18, 2014 at 5:45 p.m. UTC
We use MPLS, IPSEC and SSL-VPN Solutions and we have had various consistent speed & reliability problems with IPSec over the years that just don't happen with MPLS. As mentioned, it depends what you are doing with it and how critical the site is.
July 18, 2014 at 7:05 p.m. UTC
Minimum latency and dedicate L2/L3 across MPLS or just overhead with L3 VPN over the Internet. Since this is in Africa, I believe MPLS lines are not relatively cheap, therefore Internet feed is the best solution.
If you are looking at the mission-critical application traversing those lines, I would try to convince management with MPLS option as in the long-term it will benefit you. I stick to your recommendation Jeremy - phases - to test new services as well to have back up solution in place.
July 20, 2014 at 9:30 p.m. UTC
Can anyone including Jeremy who is/has worked in an ISP environment shed light on how they guarantee bandwidth and latency on their end for the customers? Is it strictly QoS and Bandwdith restricting + high end chassis switches? Thanks!
July 22, 2014 at 5:17 a.m. UTC
As Network Engineers, we all know (at least we should all know) that MPLS WAN's provide us mainly with advantages.... Private access with guaranteed/SLA'd latency and the ability to work hand-in-hand with the provider to guarantee a committed bandwidth, honor queues end to end, etc etc. And as Network Engineers who have worked on the Analyst side of the isle, we also know that MPLS WAN's are insanely expensive. Yes, compared to the Frame-Relay days, MPLS WAN's drove the Big Three providers to decrease WAN costs extensively – especially in the NxT1 arena.... Most larger corporations sign contracts with these said-Service Providers using Legal Teams competent enough to insert sharp teeth into the contract – so that the SLA's have some teeth.
So we like to be able to provide at least some level of metrics to “prove it's not the network”. And, let's also face that, as Network people, those metrics don't get us very far in proving that. Most Enterprises I've seen, look at the monthly Network Performance Results emails for a total of about 2 months worth – and then, it inevitably ends up in an email filter.
Let's face it.... We Network Engineers love to talk about how critical it is for us to have these guarantees. Why? Because we all know who ultimately gets blamed for just about everything – especially performance-based issues..... we want to error way on the side of caution. “If we can get them to pay for it, then why not?!?” We often get so compartmentalized, that we start believing that the Funding Well is endless. Don't get me wrong though – Infrastructure, especially NETWORK Infrastructure, is constantly the victim of Chronic Underinvestment by management. That's why it is so critical to have a highly skilled, highly experienced, politically saavy, and somewhat fearless Project/Program Manager that can go fight for this money.
About to overgeneralize, perhaps; but we still ultimately care about Availability and Performance..... and there are many variables that we can get into starting at Bandwidth/Latency/Jitter, getting into how we prioritize traffic, etc. What does management (you know, the one's that don't care about the network until it breaks, the one's responsible for Chronic Underinvestment) care about? 1. Uptime. 2. “Speed” 3. Can they stream their favorite Sports Network online, trade their Stocks, and do their Facebook? Let's remember our reality as Network Engineers . . . . 95% of us, work for companies that have never taken the time/investment to determine the actual loss of revenue associated with Network Downtime as well as certain levels of Degradation. In those same said-companies, the most important thing to most upper managers is ultimately “Is the network UP (#1) and fast enough (#2) so that I can stream today's All Blacks Rugby game (#3) on my sports network (and of course post my up to the minute commentary about the game on Facebook/Twitter?” That reality sucks, but I've seen it be all too true – alarmingly true. “Chicago office couldn't access Exchange for two hours yesterday”? Response: Schedule a meeting with (Insert Network Team Manager Name HERE), figure out what happened, and let me know whenever you have a chance.” “Internet Down”? Hell to pay.
Average latencies across the country for an MPLS WAN generally SLA at around 75-90ms from coast to coast with <0.01% loss. Latencies across the country for average Business-grade ISP in a reasonably populated metro area run an average of around 40-55ms with around .01% loss. Let's remember that the Internet Backbone is much more mature than it was ten years ago, or even five years ago. Many Network Engineers will remember what happened when the Towers fell on 9/11 and took out the massive POP that existed under the complex...... Internet Latencies doubled, tripled, or worse..... Today, the resiliency of the Internet Backbone infrastructure, fast rerouting, as well as much more stable DNS services.
With all that said, this is one of those topics that could be argued for months. If we can get five to ten times the bandwidth with an Internet Circuit, for half the price....... the need for queuing and whatnot is no longer as much of an issue – You have 5-10x the available bandwidth and similar (if not better) latency. Network Infrastructure costs (Router models) are comparable, as well..... so maybe, in this day in age, it might be worth giving dynamic (or even static) overlays another shot.
July 23, 2014 at 5:42 a.m. UTC
I am currently a Network Engineer at a tier 1 ISP. And I have just dealt with that exact situation. A customer was in Africa and had a 10-15 MB Ethernet circuit providing internet for their VPN tunnel and they had an "expensive" (their phrasing) 2 MB MPLS link. They decided that they wanted more bandwidth, so they shutdown their Primary MPLS circuit to use their backup VPN link. They then complain that during the peak of the day, things get VERY slow. I explained to them, that there is NOTHING we can do. You are traversing the INTERNET. I'm sorry, I can't tell everyone else to stop using it, because you want a VPN tunnel to work at 10 MBs.
MPLS WAN with the guarantees on BW and latency can be very important, because you can't do much if you have throughput / latency issues on a VPN circuit.
July 24, 2014 at 6:16 a.m. UTC
@Joon, ISPs oversubscribe typically by multiple factors over the actual capacity of their overall network capacity because of the simple fact that it is guaranteed that users are not always using their connections to capacity. I have seen numbers from 4 to 1 up to 28 to 1 oversubscription levels in various countries.
CoS/QoS is typically handled at edges and really only is necessary when BOTH the subscribed MPLS bandwidth AND the actual edge ring is at or near 100% capacity (this is extremely rare). I have seen some instances where MTU 1400-1500ish are tail dropped when 7000-9000 frames are mixed in (data and voice/video at the same time). In all instances, tail drops (simply put, frames discarded because buffer cant handle any more) occur. QoS has a very limited effect in this situation because the buffers are unable to handle the load. Upgrading the QoS is a solution, or separating voice/video and data onto separate lines would be another (and preferrable) solution.
The solution I tell ISPs is to simply rate control traffic over the lines in question so that it does not reach capacity. However, as I said before - ISPs love to oversubscribe, so without fail they refuse this solution.
Considering OP of this thread, if far and near end are in first world countries (not africa) I would be perfectly comfortable using a secure tunnel over the interwebs for normal traffic... however in the case of X to/from Africa, I would agree with most others who said do pilot projects from key sites. Also keep in mind that if you trend drops, latency, thruput etc make sure you do it over a sufficiently long amount of time to account for peaks, random spazzing, backhoes, BGP speediness etc. If it was me, I wouldnt be satisfied unless I had at least a months worth of continuous data to look over.
July 25, 2014 at 7:07 a.m. UTC
Business grade Internet is great for IPsec VPNs if all your links are running on a single provider. On multiple ISPs across the globe, you won't be able to do anything if one of the ISPs congests the other (or one Tier1 ISP congests another).
July 28, 2014 at 9:50 a.m. UTC
Its always interesting to read the debate on MPLS and the opinions of the Network Engineers. I am based in South Africa so I understand the whole setup here. I am also Cisco and Juniper Certified on the L3 side, however I have been for the last 4 years been with on the Telecoms side, specifically L2. As Network Engineers (L3) the lowest we have looked is L2.5 ie MPLS, however Carie Ethernet (L2) provides all your connectivity with QoS. I am also MEF certified as a MECP 2.0. The MEF provides standards for E-Line, E-LAN, E-Tree and E-Access using EVC's and OVC's etc Much cheaper than MPLS with quality of service using Y1731 SOAM etc
July 28, 2014 at 5:28 p.m. UTC
I have found that 'Business Class' Internet from within the continental US and most of Europe works fairly well for voice and video. Working for VARs and as a independent consultant who has had to troubleshooting many of these rollouts, I can say the reason they typically fail fall into one of these categories (listed in the order I have most experienced):
Under-provisioning of bandwidth: Companies and engineers that support VPN networks often fail to look at the overhead requirements for VPN. They continue the size g711 VoIP at 80Kbps when it now needs 112Kbps and VTC at standard bandwidth associated with the Codec when it typically takes twice as much (sizing based on GRE with IPsec Tunnel Mode). IPSec overhead will be there and be significant in real time applications such as voice and video. Know it, plan for it.
Poor QoS: Just because an ISP will not honor your DSCP or other markings doesn't mean you shouldn't be doing QoS. MPLS Providers often spoil us by honoring and dropping based on our marking - they do intelligent policing / shaping for us. ISPs don't do this, so we have to dust off the cobwebs and shape in "both directions." I put that in quotes because the both direction part typically is where companies fall short when they attempt QoS for internet circuits. Downstream, from the internet to us, is typically where QoS is most needed because our users usually download more than they upload. But most engineers only have outbound QoS and don’t understand that shaping towards the LAN will be required to manage what is coming in from the internet. This, QoS, is a lost art! The current paper certificate world of exam cramming is making it worst by the day.... Too many of us network engineers are really network admin level knowledge with certs devalued through cheating. Another topic for another post
Poor system MTU and MSS rollout: Companies and we the engineers that rollout these networks often don't plan for VPN centric WANs down to the system level. Servers that will mostly communication across the WAN to communicate, should have MTU settings that require no path fragmentation for their connections. This includes voice and video suites. Deploying TCP MSS adjustments at the network appliances help (especially when it comes to HTML), but does not work for all services and applications (such as SQL and real-time applications such as video that don't use MSS).
Crappy ISPs: This is the least seen, but certainly does happen. Like with MPLS you get what you pay for here. All too often the only companies willing to do IPSEC vs MPLS, really go cheap. Mom-and-Pop fly by night ISPs usually give customer level service to businesses. I take lots of time and money to deploy and mature a carrier grade network. Ask yourself when chosing one, did I pick a company that had both time and money to build something to my bosses expectations. As someone previously stated, ISPs are often more responsive than MPLS providers, just do not be a pushover. Many of these providers’ first couple of tiers have no other job than to find the issue on your side (no different than the MPLS side). You have to be potentially and politely willing to work through the steps that there are required to do to get on to the next tier. Honestly, if thorough testing is done on day one of acceptance, you avoid most of these problems. Good levels of IP SLA network management are also a must, since they can show quantifiable trends in MOS, jitter, and loss that often help your ISPs find the poor performing windows to isolate problematic paths or nodes between your sites.
Like many network deployments, it comes down to having a good plan. We must really understand the applications that will be going over the new sub-1500 MTU paths and how that will impact them. Unlike MPLS and dedicate circuits, the implications of rolling a successful Internet based DMVPN, PtP GRE with IPSEC, or pure IPSEC VPN network requires tuning from nearly all IT services departments / teams. As with really making MPLS work for you, a good QoS strategy is also required. The typical rewards if done well are fewer paths and providers to manage, much shorter time to make bandwidth / service adjustments, a larger more carrier diverse and resilient backbone, and greater profitability based on reducing the network CAPEX. I don’t think this is the solution for everything. I would not recommend this for implementations that are RTP centric, such as call centers, where everything revolves around jitter. You can often get two ISPs with twice or more the bandwidth at the cost of a single MPLS path, so it really comes down to really network engineering to meet these SLAs. Like with the big rollouts of VoIP in the 2000’s, there will always be proof of why internet is not as good; but, do you miss all those non-VoIP phones today?
July 28, 2014 at 8:19 p.m. UTC
Make sure you set the expectation that MPLS won't have the same reliability as an MPLS over T1 or Metro-E. There is a cost associated with everything.
July 28, 2014 at 10:46 p.m. UTC
Another option to potentially pursue, but is also the most complex, is use the MPLS circuits for critical real-time traffic, such as voice and video conferencing, then use the Internet VPN for everything else, that isn't delay sensitive.
Doing this, should allow you to still cut costs in the MPLS infrastructure, by lowering the required bandwidth requirements of the MPLS infrastructure. This solution also provides built in redundancy, in that if either the Internet or MPLS circuit goes down, it can fail-over to the still up circuit.
This is also a much more complex design, as you will have to route the traffic appropriately. Me personally, I'd choose to route this traffic with BGP, as you gain a lot of flexibility in how you route traffic.
July 31, 2014 at 2:38 p.m. UTC
In reply to R's comments, "Make sure you set the expectation that MPLS won't have the same reliability as an MPLS over T1 or Metro-E"....
I have helped as many companies with chronic last mile carriers on T1's, DS3's, and Metro-E extending MPLS as I have on internet paths. I agree with your statement, but only if the topology solely spans across large metropolitan regions, where circuits stay on the MPLS carrier's 'business unit' network. When this isn't the case, your path becomes as good as the last mile LEC extension, which has lots of variance throughout the world and even the US. This last mile extension over the LEC may be ran across telephone line infrastructures over 50 year’s old leading to a lot of poor pairs / noisy lines. So, the statement that it is more reliable can be true, but definitely is not always accurate. Carriers are more quickly evolving their internet infrastructure networks and the associated hardware and lines carrying it, so you more often run the risk of being on stagnant infrastructures on when crossing leased line, such as last mile LEC extensions.
So, the next time you have to bring up a branch office in Billy-Bob, West Virginia, compare the cost of a T1 of MPLS and two 10Mbps business class internet circuits from different providers. I would be willing to bet the two internet paths would be cheaper, give you the dual-path reliability, and a significant bandwidth increase. If you think the pre-WWII copper infrastructure is cleaner and more reliable there in 'Billy-Bob', I have some more stuff to sell yah :) . So, then it just comes to our ability to engineer VPNs (DMVPNs) to mesh our sites, understand sub-1500 byte MTU paths and know what to do with them to work well, and last understand how to tune dynamic routing for fast failover and load-balancing.
MPLS is easy because most of the routing magic needed to mesh and heal our sites is on the carrier; so, this likely should be the way ahead for companies staffed with little WAN routing and VPN experience. It is also probably the best path for companies where the business is VoIP, such as distributed call centers with high loads of voice traffic traversing the WAN (let's say greater than 30% of traffic mix) and not using site local gateways.
What I am getting at is this, don't limit your toolbox as an engineer based on what we knew in the 90's and early 2000's. The internet infrastructure is growing at at-least ten times the pace of provider’s private clouds, MPLS and leased line. The MPLS providers try hard to fear the companies we support into not holding the accountable for their outrageous pricing. It truly comes down to the core of engineering; gather facts about the problem / requirement, then find a solution to solve it. Unfortunately all too often, we first arrive at the solution (like MPLS), then gather facts to support it. It is our clients that suffer in cost when we aren't as creative as we can be when designing the best, most reliable, and most cost effective solution for them. Comfort and easy should be for the people we support, never for the life of an engineer during the design and rollout phase.
August 2, 2014 at 6:04 p.m. UTC
Along the lines of jtdub's comment: keep the MPLS at low committed rates for business critical and delay/jitter sensitive traffic. Use Cisco IWAN solution (with PfR) to load balance best effort traffic on Inet. If traffic falls below (configurable) SLAs, move it to the other path.
August 11, 2014 at 1:48 p.m. UTC
We run voice of VPN/Internet without issues. We also have MPLS for larger sites. Here is what you need to do to make this successful 1. Make sure to pick stable Internet provider. Stay away from cable, DSL, or anything not Tier1. There are some mom/pop shops that are good, but they are regional, so do your homework. 2. Make sure you have 2 data center where you terminate your tunnels to, that's for redundancy. If you are an all Cisco shop, DMVPN is the way to go. 3. This one is the most important: managing expectation. You have to CYA with an email, power point, anything, to make sure when thing do go bad, you did warn about it. If CIO want this, explain why it is not a good idea, explain technical reasons in writing and then if he/she still says to do it, then do it, but you have your paper to back you up if CIO decides to pin this on you.
September 19, 2014 at 1:27 a.m. UTC
If saving on WAN costs is a big factor and Cisco routers are in use, their combo-platter of DMVPN and PfR has been shown to achieve similar composite availability and performance when provided with 2 or more disparate links. This frees you up to use Internet (Cable, DSL, whatever) AND use MPLS, FR, T-X, etc. Rather than use as-designed metrics and static, manually altered traffic paths, rely on PfR to leverage NBAR, Netflow, and IP SLA to dynamically determine traffic flows in use and distribute them across all links to achieve best performance. Several large global enterprises are using this with promising results and it frees you up to use ANY and ALL transports you want/need, at the same time.
October 2, 2014 at 1:06 a.m. UTC
I would compare service level expectations from the business side with service level agreements from the service provider side. It is not just about QoS, downtime has a cost\impact as well.
it is a tough call that will ultimately be made by management because of the savings involved.