This blog post is an account of how we have toiled over the years to improve the throughput of our interDC tunnels. I joined this company around 2012. We were scaling aggressively then. We quickly expanded to 4 DCs with a mixture of AWS and colocation. Our primary DC is connected to all these new DCs via IPSEC tunnels established from SRX. The SRX model we had, had an IPSEC throughput of 350Mbps. Around December 2015 we saturated the SRX. Buying SRX was an option on the table. Buying one with 2Gbps throughput would have cut the story short. The tech team didn't see it happening.



I don't have an answer to the question, "Is it worth spending time in solving a problem if a solution is already available out of box?" This project helped us in improving our critical thinking and in experiencing the theoretical network fundamentals on live traffic, but also caused us quite a bit of fatigue due to management overhead. Cutting short the philosophy, lets jump to the story.



December 2015 it was decided to build tunnels from opensource solutions to supplement the SRX. The initial solution proposed was SSH tunnel. SSH tunnel(Socks proxy) is established from a public box in our primary DC to a public EC2 instance in AWS. Any application which needs to skip the SRX path can be started with a proxychains command. The proxychain will override all glibc connect call to use SOCKS proxy as per the config. This solution actually broke a single end to end TCP connection to multiple(3) TCP connections viz Server to Socks proxy, Socks Proxy in DC to socks proxy in AWS, socks proxy in AWS to EC2 instance. Its a TCP in TCP tunnel which will be more susceptible to congestion as Multiplicative decrease is going to affect three times. As a result we decided to go with TCP in UDP tunnel. Looking back this solution it shows how naive we were.



February 2016 we started our exploration of TCP in UDP tunnel. We started with openvpn. We established client to site openvpn between our DC and AWS EC2 instance. Servers can change their route to AWS via openvpn box in DC. Openvpn box forwards all packets it receives on LAN to tun interface which forwards to AWS instance. AWS instance NATs and sends to destination since it is a client to site VPN. The first problem we faced here is some of our boxes used tcp recycle(now removed) kernel functionality. TCP recycle is not supposed to work properly for instances behind NAT. Hence we changed client to site openvpn to site to site openvpn. DC servers send to openvpn box, Openvpn sends to AWS openvpn server, AWS openvpn server sends to EC2 instances without NAT. Route table in EC2 is updated to send selected DC traffic via openvpn. The openvpn tunnel was not able to give more than 100 Mbps. We figured out CPU instruction set support for AES-NI will reduce CPU utilization on encryption and decryption. We spawned AWS instance with AES-NI support and tweaked udp buffer values. We were able to do 200 Mbps. But our data team found 200Mbps +SRX 350 Mbps(selective routing) insufficient. Openvpn is a userspace process, so it causes context switch to kernel space and openvpn can't scale across multiple CPUs. We maintained TCP end to end semantics as this is a TCP in UDP tunnel.



IPSEC looked magic to us. No routes were added on the tunnel box but packets get routed properly, no user space process was running. But since industry standard is IPSEC, we decided to move to IPSEC. pfsense was picked as OS for IPSEC tunnel endpoint at DC and AWS VPN Gateway was used on AWS side. On May 2017 we installed pfsense and proceeded with our testing. We figured out pfsense capping at 300Mbps. This was a shock to us. Adding a bonded 2 Gb NIC improved the performance to 500Mbps. Exploring further we figured out interrupts were hogging CPU. We did lot of stupid explorations before coming to this conclusion which are not necessary. Distributing interrupts is the key. We enabled MSIx and added a 10G NIC. This distributed interrupts based on source IP, Destination IP, Source Port, Destination Port (similar to Request Side Steering in Linux). IPSEC traffic is still received by one queue but LAN traffic is distributed across queues as they were independent TCP connections. End of this change with 1 nic we were able to reach 600Mbps. One CPU which does IPSEC would be loaded higher than other cores which receives LAN traffic. But it is nowhere close to 100%. There were some bottlenecks in the network which caused a cap at 600Mbps. On July 2017, we made pfsense as our primary tunnel device with static routing (BGP daemon was not stable then in pfsense). We had few more network bottlenecks which are not trivial to explain here. On March 2018 we were able to reach 1.4Gbps on pfsense and our daily traffic stood upwards of 1Gbps (in + out).



Again a team started seeing lag in the pfsense setup. We changed inbound via a different ISP which has 10Gbps capacity. Our throughput improved to 1.7Gbps and we were doing consistent 1.2Gbps(in + out) per day. This is not sufficient for the team during peak traffic. On July 2018, we put forth the conclusion that there is a bottleneck due to LACP before the hop to router. LACP puts all IPSEC traffic to one AWS region in one LACP bucket as they have same source IP, destination IP and source and destination ports. Depending on the availablity of bandwidth in the LACP 1Gbps link the performance of tunnel varies. Similarly the CPU is also utilized when the traffic handled is close to 800Mbps. Remember one CPU being utilized for IPSEC traffic. Now that CPU has become the bottleneck. If we had bought an SRX we would have jumped here directly instead of going through all the stories above. Using direct connect without IPSEC will shard traffic across LACP links and removes PFsense from datapath. Again the story forced itself not to stop there as no direct connect was in sight.



We agreed sharding traffic is the way to go. IPSEC performance optimisation docs also suggests either offload IPSEC to NIC (from 4.16 kernels) or shard across multiple IPSEC tunnels. We setup strongswan on Linux on both sides AWS and DC. We were not able to use AWS VPN gateway as they stopped supporting ECMP across multiple tunnels. Between strongswan boxes we ran 3 IPSEC tunnels for the same policy (DC traffic to AWS traffic and vice versa). We created vti interface for each tunnel and added linux ECMP routes via all 3 vti interfaces for cross DC traffic. Each TCP connection will be bucketed in one of 3 tunnels. LACP might put 3 tunnels in different links as their IPs change. We were able to do under 2Gbps IPSEC throughput on the strongswan setup along with 1Gbps on existing pfsense setup. We are capable of doing close to 3Gbps now and we are seeing 3 CPUs of strongswan are utilized instead of just one CPU in pfsense. This tunnel has thus become horizontally scalable to number of cores to improve performance. In three years we have scaled our tunnel throughput 10times with the opensource tools and commodity hardware.



Would it have been better if we jumped to bigger SRX and then to Direct connect? When is the right time to ask your tech team stop experimenting as they can spend their time on something that does add value to company ? Should we jump to experimentation even though we haven't hit a dead-end or maxed out the existing solutions? If no, how will you keep the team's experimental behaviour alive? I will leave all these questions for the management to ponder.