Back To School is the busiest time for any education company. Students, teachers, schools, software companies, and the rest of the education world are all gearing up for a new school year. Unfortunately, during the most critical time of the year, Clever’s infrastructure was throwing a fit.

At the time, Clever was adding over a thousand schools per week. District administrators were scrambling to get their class schedules in order, relying on Clever more than ever. Teachers were training on new software and expecting that their students could log in with just one password on the first day of school. Right when we needed our infrastructure to “just work” we started getting strange errors and boxes started coming up good as dead.

The Clever Stack:

But let’s take a step back. Here at Clever we treat our servers as disposable and interchangeable. Our stack is:

AWS machines in a VPC

Private Docker and npm repositories

Salt to pull down machine state

Fabric to manage deploys

This stack allows us to maintain security while being able to quickly deploy to production, which we often do multiple times per day.

Our data-processing / ETL pipeline consists of many different “workers” running on hefty EC2 instances. These workers all connect to a MongoDB replicaset to update and read data. Due to the increased load from back to school we were using three times the usual number of those machines to process the incoming data, and we were re-deploying frequently both for hotfixes and for our usual development work.

The Problem:

All was not well, however. When we’d run our fabric scripts to spin up new worker machines, approximately one third of these new machines would fail our health checks, something which had never happened before. For some strange reason, these boxes could not speak to our database machines, that made them, for our purposes, about as useful as the printer in Office Space.

In particular, even a command as simple as this would fail on the affected machines:

ubuntu@10.0.1.90:~$ nc -vz -w 2 database-box-1.internal.clever.com 27017

where:

-v = verbose -z = don’t send data, just see if you can connect -w 2 = wait for 2 seconds before timing out 27017 = port mongod is listening on

We Tried The Usual Suspects:

Was DNS resolution broken?

This was certainly not right, as we could run dig and get the right IP just fine. Additionally, running the netcat query above with the IP of the database machine gave the same result.

Was it a network partition?

Identical boxes in the same Availability Zone, spun up at the same time, could access the database servers. Traceroute also worked…

Was it a port issue?

Almost certainly not – other boxes in the same AWS security group could access the job server on that port

Oh, maybe it was a ulimit issue!

Solid guess, but we had previously increased the limits to 10x what we’d need, and could have far more network connections open at a time than when we saw some of the issues. We also tested multiple new connections from a machine that could connect, which should have failed if there was a ulimit problem.

Well, maybe there’s a black hole somewhere near our database servers…

We had one other very interesting symptom: when we ran traceroute on the box with connection issues, the problem disappeared! This made debugging VERY difficult and turned this from a gross bug into a nasty bug. We had on our hands a programmer’s worst fear: a Heisenbug.

Ok, so at this point we called AWS (paid support FTW) and got a very helpful engineer who’d seen this before. There were a few key clues:

The database machines don’t churn very often, but the worker machines that do have issues connecting are often terminated and relaunched.

We use only a small amount of our virtual IP space (We let Amazon generate our IPs, and Amazon seems to pick lower IPs first rather than picking randomly). During “Back To School” we use a larger-than-usual fraction of our machines as these worker machines. When we spun up they often received IPs that had been used for previous generations of worker machines.

When we ran tcpdump , we saw that an affected worker box could connect and send packets, and the database machines received packets and sent responses back. However, the affected worker box never saw these packets come back to its NIC and timed out.

Here’s tcpdump output from the affected worker box:

ubuntu@10.0.1.90:~$ sudo tcpdump -i eth0 -e | grep 10.0.1.20 listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 02:17:20.979058 IP 10.0.1.90.ssh > 10.0.1.20.61954: Flags [P.], seq 3431575415:3431575611, ack 4159969503, win 348, options [nop,nop,TS val 45028090 ecr 912682739], length 196 02:18:06.111010 IP 10.0.1.90.ssh > 10.0.1.20.61954: Flags [P.], seq 3431576355:3431576551, ack 4159969627, win 348, options [nop,nop,TS val 45039373 ecr 912727572], length 196

Here’s tcpdump output from the target database box:

ubuntu@10.0.1.20:~$ sudo tcpdump -i eth0 -e | grep 10.0.1.90 listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 02:15:04.793339 IP 10.0.1.90.ssh > 10.0.1.20.61648: Flags [P.], seq 552:588, ack 1, win 348, options [nop,nop,TS val 193513889 ecr 912546555], length 36 02:15:04.797194 IP 10.0.1.20.61648 > 10.0.1.90.ssh: Flags [.], ack 552, win 8169, options [nop,nop,TS val 912547567 ecr 193513881], length 0 02:19:04.698509 IP 10.0.1.90.ssh > 10.0.1.20.61648: Flags [P.], seq 1936:2116, ack 1, win 348, options [nop,nop,TS val 193573865 ecr 912784839], length 180 02:19:04.724616 IP 10.0.1.20.61648 > 10.0.1.90.ssh: Flags [.], ack 1140, win 8178, options [nop,nop,TS val 912785848 ecr 193573865], length 0

From this you can see that the database machine ( 10.0.1.20 ) receives SYN packets (with the correct sequence number) and sends back SYN/ACK packets. However, the worker machine never receives those packets.

Last chance to guess….

The Cause:

The ultimate issue ended up being a fairly simple one, albeit a little lower in the Morlock tunnels of networking than we’re used to seeing:

The ARP tables on the database boxes had cached, outdated MAC addresses from no-longer-living boxes.

That is, the ARP table contained cached “stale” MAC addresses, pointing to the NICs of old machines which had been terminated previously.

If it’s been a while since you’ve thought about ARP tables, I don’t blame you – it’s something that usually just works! As a refresher, ARP tables are between the Network (e.g. IP) layer and the Data Link (e.g. Ethernet) layer. When sending an IP packet to another IP address within a network, we need to know what the Ethernet address is for NIC of the machine with that address. The ARP table is a cache so that we don’t need to search in the Data Link layer for that address for every packet, which would be horribly inefficient.

Here’s what the database ARP tables looked like:

ubuntu@database-box-1:~$ ip neigh 10.0.1.60 dev eth0 lladdr 00:00:01:FC:B3:45 STALE 10.0.1.61 dev eth0 lladdr 00:00:01:FC:B3:46 REACHABLE 10.0.1.80 dev eth0 lladdr 00:00:01:FC:B3:47 REACHABLE 10.0.1.90 dev eth0 lladdr 00:00:01:FC:B3:48 STALE 10.0.1.91 dev eth0 lladdr 00:00:01:FC:B3:49 REACHABLE

(Note that those STALE items seem to never be removed automatically)

However, the affected worker box had this MAC address:

ubuntu@worker-box-3:~$ ifconfig eth0 Link encap:Ethernet HWaddr 00:00:01:FC:B3:70 inet addr:10.0.1.90 Bcast:10.0.1.255 Mask:255.255.255.0

So, when a database box tried to send back to 10.0.1.90 , for instance, it sent to a MAC address that doesn’t exist anymore (our “black hole”!). According to our AWS tech, traceroute “fixed” this as we were debugging because it clears the ARP entry for the corresponding IP. However, I was unable to find any documentation around that – any pointers to that would be appreciated.

The Solution:

The solution recommended by the AWS tech was a cron job on the database boxes that simply flushes the cache every 5 minutes or so. While that seems simplistic, he had tried with other clients more subtle ways of fixing the issue and it turns out that ARP is super complex on Linux. While the database machine now has to build up the ARP cache after we clear it, that operation has a very minimal effect on performance.

ip neigh | awk '{print $1}' | xargs -IX arp -v -i eth0 -d X

We’ve set up the command above called via cron on our database machines and have had no problems since. As promised, we have not had performance problems either. While this works for us, you also might look into clearing only the “stale” addresses.

One Last Question:

After looking at this for a while, it seems possible that there is a Linux kernel bug (we’re currently using Ubuntu 12.04 , Linux kernel release 3.13.0-32-generic , version #57-Ubuntu ). Seemingly, the kernel should have refreshed that ARP entry itself, according to this documentation. However, this is a fairly complicated part of the operating system and it is possible we are misreading the situation – I imagine that the AWS techs would have realized that this is a kernel issue before we had run into it.

Lessons:

The Link Layer isn’t always reliable, even in a controlled environment such as AWS

AWS paid support can be a very quick solution to your problem

Sometimes, Linux is complicated and you should just solve your problem with a little duct tape

Hopefully, this post will help others – it looks like there are a few who have run into similar issues with AWS.

Lastly, we’re looking for experienced distributed system engineers to help us build out our systems! If you’ve been itching to play around with Golang and Docker, and want to help us help millions of students and teachers, please check out our jobs page!