This is the nineteenth post in the 2016 FastMail Advent Calendar. Stay tuned for another post tomorrow.

We have always run physically separate switches for our internal and external networks at FastMail, and likewise run our own encryption for the links between our datacentres. I'm a strong believer in airgaps and in not trusting anything outside our own locked racks within the datacentre.

When I started at FastMail, we didn't have much "offsite". There was a real machine at another provider running secondary MX and DNS. For a while there was a VPS in Singapore that was purely a forwarding service to deal with routing issues in Asia.

We used OpenVPN with symmetric keys for these simple links. It was easy to set up, easy to trust (it uses heavily-reviewed TLS negotiation to set up the encryption), and I already knew and liked the underlying concept from using CIPE to link our offices at my previous job.

We had no need to use anything fancier for these low bandwidth needs.

Growing pains

When FastMail was purchased by Opera Software in 2010 we set up a secondary datacentre in Iceland with real time replication of all email. I did a ton of work in Cyrus IMAP to make replication more efficient, but it was still maxing out the CPU on the VPN machine.

OpenVPN is single threaded for simplicity of the code and robustness, but it meant that our shiny blade servers with 24 threads (2 processors x 6 cores x hyperthreading) were only using 1/24 of their capacity.

Software or Hardware?

Opera's sysadmin department tried to convince me to buy their favourite hardware (Juniper) rather than staying with software. Of course everything is hardware eventually, but running the VPN on the commodity hardware makes it easy to substitute and means we don't have to keep as many spares. There are always a couple of blades available to be repurposed if one of them has a hardware failure. Keeping a second high-end Juniper around just in case would have been very expensive, but not having a hot spare is untenable.

So it had to be software. The only serious contender was IPsec.

Black Swan

Debian has a page with some Linux IPsec history. There was FreeS/WAN (free secure wide area network) and then the KAME API for kernel access, and freeswan got abandoned and we had Openswan and strongSwan and libreswan, and there's KLIPS, and ... they're all awful.

After much frustration with bogus and incorrect documentation, I managed to get IPsec working. I never liked it; the implementions blatantly violate the Unix philosophy. OpenVPN gives you a tun device which acts like a virtual secure cable between your two datacentres that you can route down. IPsec (Openswan at least, which was the only one I could even get to work) routes the packets over the existing interface, but applies its magic in a way that the routing tools and firewall tools don't really understand. I was never 100% confident that the routing rules would keep unencypted packets off the external network in failure modes.

The configuration file was full of 'left' and 'right', and I certainly never figured out how to route arbitrary networks through the link, it looked like you had to set up a separate IPsec configuration for each network range you wanted to route. Config looked something like this:

conn nyi-to-thor left=[% conf.nyi.extip %] leftid=@[% conf.nyi.exthostname %] leftsubnet=10.202.0.0/16 leftsourceip=[% conf.nyi.intip %] leftrsasigkey=[% conf.sigkey %] leftnexthop=[% conf.thor.extip %] right=[% conf.thor.extip %] rightid=@[% conf.thor.exthostname %] rightsubnet=10.205.4.0/22 rightsourceip=[% conf.thor.intip %] rightrsasigkey=[% conf.sigkey %] rightnexthop=[% conf.nyi.extip %] authby=rsasig auto=start

Note the subnet ranges embedded in the configuration. We didn't route anything but those network ranges through IPsec, using the Opera internal management ranges for everything else during the Opera years.

But the network in Iceland had availability issues and every time the network dropped out for a couple of minutes, IPsec would fail to resync and have to be restarted manually. Even regular failover for maintenance and testing (we always had two hosts configured and ready to go in case of machine failure) was unreliable.

Maybe I'm just really stupid and can't make IPsec work for me, or I backed the wrong swan, I dunno. Anyway, IPsec never sat well. It was the least reliable part of our infrastructure.

Back to OpenVPN

Nothing much had changed last year when I finally got jack of Openswan bailing on us and looked around to see if there was anything else. But I figured, if we ran multiple VPNs in parallel and routed traffic between them, that must work, right.

Which leads us to our current configuration, a mesh of OpenVPN links between our datacentres. We're currently running 4 channels for each pairing, though very little traffic goes along one of the edges.

Let's take a look at how it's done. First the heavily templated config file:

[%- SET conf = global.vpnlinks.$datacentre.$link -%] [%- SET local = global.vpndata.$datacentre %] [%- SET remote = global.vpndata.${conf.dest} %] local [% local.extip %] lport [% conf.lport %] remote [% remote.extip %] rport [% conf.rport %] dev tun ifconfig [% conf.lhost %] [% conf.rhost %] ping 5 ping-restart 30 script-security 2 up-delay up-restart up /etc/openvpn/up-[% link %] down-pre down /etc/openvpn/down-[% link %] cipher [% conf.cipher %] secret /etc/secure/openvpn/keys/[% conf.keyname %].key

Let's look at the interesting bits here. We run on different ports for each link so that we're running separate OpenVPN processes with no contention for the UDP ports.

I'm a bit cagey about showing IP addresses since we moved our VPN links on to hidden network ranges to avoid DDoS attacks taking out our networks, but let's take a look at the port numbers:

[brong@qvpn1 ~]$ grep port /etc/openvpn/*.conf /etc/openvpn/nq1.conf:lport 6011 /etc/openvpn/nq1.conf:rport 6001 /etc/openvpn/nq2.conf:lport 6012 /etc/openvpn/nq2.conf:rport 6002 /etc/openvpn/nq3.conf:lport 6013 /etc/openvpn/nq3.conf:rport 6003 /etc/openvpn/nq4.conf:lport 6014 /etc/openvpn/nq4.conf:rport 6004 /etc/openvpn/sq1.conf:lport 7011 /etc/openvpn/sq1.conf:rport 7001 /etc/openvpn/sq2.conf:lport 7012 /etc/openvpn/sq2.conf:rport 7002 /etc/openvpn/sq3.conf:lport 7013 /etc/openvpn/sq3.conf:rport 7003 /etc/openvpn/sq4.conf:lport 7014 /etc/openvpn/sq4.conf:rport 7004

So each config has a local and remote port which is completely separate, but contiguous to allow us to firewall a port range on each machine.

Ping settings allow the VPN to quickly re-establish after a network outage.

script security is required to allow external scripts to run, which we use to set up the routing.

is required to allow external scripts to run, which we use to set up the routing. up delay stops the interface spinning up and the up script running until the link is established.

stops the interface spinning up and the script running until the link is established. down pre does the opposite, makes the down script run before dropping the link.

does the opposite, makes the script run before dropping the link. up restart causes both the down and up script to run if the connection restarts for any reason.

The end result of all these settings is very reliable routing across restarts and connection failures.

Up and Down

The most interesting part is the up and down scripts. These set up the routing. I'll show the full up script, which contains all the functions, so it's enough to make this blog post useful for someone wanting to duplicate our setup.

#!/usr/bin/perl [%- SET conf = global.vpnlinks.$datacentre.$link -%] [%- SET local = global.vpndata.$datacentre %] [%- SET remote = global.vpndata.${conf.dest} %] use strict; use warnings; use IO::LockedFile; # [% link %] my $lock = IO::LockedFile->new(">/var/run/vpnroute.lock"); my $dev = shift; disable_rpfilter($dev); set_queue_discipline($dev, "sfq"); [%- FOREACH netname = remote.networks %] [%- SET data = $environment.network.$netname %] add_route("[% data.netblock %]", "[% conf.rhost %]", "[% $thishost.network.internal_ip %]"); [%- END %] sub disable_rpfilter { my $dev = shift; print "echo 0 > /proc/sys/net/ipv4/conf/$dev/rp_filter

"; if (open(FH, ">", "/proc/sys/net/ipv4/conf/$dev/rp_filter")) { print FH "0

"; close(FH); } } sub set_queue_discipline { my $dev = shift; my $qdisc = shift; runcmd('/sbin/tc', 'qdisc', 'replace', 'dev', $dev, 'root', $qdisc); } sub add_route { my ($netblock, $rhost, $srcip) = @_; my @existing = get_routes($netblock); return if grep { $_ eq $rhost } @existing; my $cmd = @existing ? 'change' : 'add'; push @existing, $rhost; runcmd('ip', 'route', $cmd, $netblock, 'src', $srcip, map { ('nexthop', 'via', $_) } @existing); } sub del_route { my ($netblock, $rhost, $srcip) = @_; my @existing = get_routes($netblock); return unless grep { $_ eq $rhost } @existing; @existing = grep { $_ ne $rhost } @existing; my $cmd = @existing ? 'change' : 'delete'; runcmd('ip', 'route', $cmd, $netblock, 'src', $srcip, map { ('nexthop', 'via', $_) } @existing); } sub get_iproute { my @res = `ip route`; chomp(@res); my %r; my $dst; foreach (@res) { if (s/^\s+//) { my @items = split; my $cat = shift @items; my %args = @items; push @{$r{$dst}{$cat}}, \%args if exists $r{$dst}; next; } my @items = split; $dst = shift @items; my %args = @items; $r{$dst} = \%args if !exists $args{dev} || $args{dev} =~ m/^tun\d+$/; } return \%r; } sub get_routes { my $dst = shift; my $routes = get_iproute(); my $nexthop = $routes->{$dst}{nexthop} || [$routes->{$dst}]; return grep { $_ } map { $_->{via} } @$nexthop; } sub runcmd { my @cmd = @_; print "@cmd

"; system(@cmd); }

The down script is identical, except that it doesn't run the rpfilter or queue discipline lines, and of course it runs del_route instead of add_route .

Firstly we disable rp_filter on the interface. rp_filter drops any packets that wouldn't route to this same interface, and with multiple interfaces all routing the same network range, it would cause packets to fail to route. We still firewall the tun+ interfaces to only allow packets from our internal datacentre ranges of course.

Next we set the queue discipline to sfq, or "Stochastic Fairness Queueing", which is a low CPU usage hashing algorithm to distribute the load fairly across all links.

Since there's no way to add or remove routes directly, we take a global lock using the Perl IO::LockedFile module while reading and write routes, and hence we can just read the current routing table, manipulate it, and write out the new config. The lock is necessary because commonly all eight links on a machine get spun up at once, so they're likely to be making changes concurrently.

You can see in get_routes it has to handle two different styles of output from ip route , just a single destination when only one link is up, and also multiple nexthop lines.

So we have a list of nexthop routes with the same metric via the different OpenVPN links, and we manipulate that list and then tell the kernel to update the routing table.

Routes

Here's how it looks in the system routing table on qvpn1 in Quadranet, our LA datacentre. The links are 'sq' from Switch (Amsterdam) and 'nq' from NYI (New York).

10.202.0.0/16 src 10.207.2.173 nexthop via 192.168.6.1 dev tun0 weight 1 nexthop via 192.168.6.2 dev tun1 weight 1 nexthop via 192.168.6.4 dev tun2 weight 1 nexthop via 192.168.6.3 dev tun3 weight 1 10.202.0.0/16 via 10.207.1.205 dev eth0 metric 1 10.206.0.0/16 src 10.207.2.173 nexthop via 192.168.7.3 dev tun4 weight 1 nexthop via 192.168.7.2 dev tun5 weight 1 nexthop via 192.168.7.4 dev tun6 weight 1 nexthop via 192.168.7.1 dev tun7 weight 1 10.206.0.0/16 via 10.207.1.205 dev eth0 metric 1 192.168.6.1 dev tun0 proto kernel scope link src 192.168.6.11 192.168.6.2 dev tun1 proto kernel scope link src 192.168.6.12 192.168.6.3 dev tun3 proto kernel scope link src 192.168.6.13 192.168.6.4 dev tun2 proto kernel scope link src 192.168.6.14 192.168.7.1 dev tun7 proto kernel scope link src 192.168.7.11 192.168.7.2 dev tun5 proto kernel scope link src 192.168.7.12 192.168.7.3 dev tun4 proto kernel scope link src 192.168.7.13 192.168.7.4 dev tun6 proto kernel scope link src 192.168.7.14

(Quadra is .207, Switch is .206, NYI is .202)

If I take down a single one of the OpenVPN links, the routing just keeps working as we remove the one hop:

[brong@qvpn1 hm]$ /etc/init.d/openvpn stop sq2 Stopping virtual private network daemon: sq2. [brong@qvpn1 hm]$ ip route | grep interesting 10.206.0.0/16 src 10.207.2.173 nexthop via 192.168.7.3 dev tun4 weight 1 nexthop via 192.168.7.4 dev tun6 weight 1 nexthop via 192.168.7.1 dev tun7 weight 1 192.168.7.1 dev tun7 proto kernel scope link src 192.168.7.11 192.168.7.3 dev tun4 proto kernel scope link src 192.168.7.13 192.168.7.4 dev tun6 proto kernel scope link src 192.168.7.14

And then bring it back up again:

[brong@qvpn1 hm]$ /etc/init.d/openvpn start sq2 Starting virtual private network daemon: sq2. [brong@qvpn1 hm]$ ip route | ... 10.206.0.0/16 src 10.207.2.173 nexthop via 192.168.7.3 dev tun4 weight 1 nexthop via 192.168.7.4 dev tun6 weight 1 nexthop via 192.168.7.1 dev tun7 weight 1 nexthop via 192.168.7.2 dev tun5 weight 1

To see the commands that it ran, we can just run the up and down scripts directly:

[brong@qvpn1 hm]$ /etc/openvpn/down-sq2 tun5 ip route change 10.206.0.0/16 src 10.207.2.173 nexthop via 192.168.7.3 nexthop via 192.168.7.1 nexthop via 192.168.7.4 [brong@qvpn1 hm]$ /etc/openvpn/up-sq2 tun5 echo 0 > /proc/sys/net/ipv4/conf/tun5/rp_filter /sbin/tc qdisc replace dev tun5 root sfq ip route change 10.206.0.0/16 src 10.207.2.173 nexthop via 192.168.7.3 nexthop via 192.168.7.1 nexthop via 192.168.7.4 nexthop via 192.168.7.2

Plenty of headroom

This is comfortably handling the load with the four links to NYI which get most of the traffic (it's quiet on the weekend while I'm writing this, and during busy times they might be using more CPU, but four cores is enough to supply our current bandwith peaks.)

11533 root 20 0 24408 3856 3264 S 18.7 0.0 7120:19 openvpn 11521 root 20 0 24408 3912 3324 S 17.4 0.0 6690:50 openvpn 11527 root 20 0 24408 3908 3320 S 12.4 0.0 2741:46 openvpn 11539 root 20 0 24408 3796 3208 S 7.5 0.0 3749:41 openvpn

There are heaps more CPUs available in the box if we need to spin up more concurrent links, and it's just a matter of adding an extra line to the network layout data file and then running make -C conf/openvpn install; /etc/init.d/openvpn start to bring the link up at each end. The routing algorithm will automatically spread the load once the two ends pair up.

We're much happier with our datacentre links now. We can manage firewalls and routes with our standard tooling and they are rock solid.