World of IaaS

Base uses AWS on a daily basis. It is the most mature IaaS solution on the market, offering great availability, good professional and community support and lots of services that make your life easier. On the other hand, in this seemingly nice looking picture there are some hidden dragons. This short note is about how we ran into one of them.

Base infrastructure and TCP health checks

In an infrastructure built around microservices, health checking is a vital part of the system. They are the only way to ensure the whole product works well for users, whose experience is the most important thing after all. What is perhaps a bit surprising though is that some of them may actually cause problems, leading to disruptions or even outages. This is about such a case.

To understand our finding we must shed some light on how our infrastructure works.

We use our own PaaS system written a long time ago when no one even dreamed about solutions like Marathon or Kubernetes. It consists of two important layers: a scheduler (called Grid) and routing (called Mesh). The deployment process is requesting the scheduler to find and launch a given microservice on one of our grid-node servers. Once it is up and running, the routing configuration on Mesh is being updated to point to the new locations and old workers are being killed. Everything is powered by AWS EC2 instances in four availability zones – each of them runs an independent copy of our PaaS for redundancy.

We’ve relied on HAProxy from the very beginning. Its simplicity and great performance serve us very well to dispatch requests from grid-nodes to Mesh. As Mesh is service-aware and is forwarding requests, its health is extremely important. That’s why HAProxy is performing health checks on it every few seconds.

This is the place where the first problem occurred, and it was influenced by two factors:

HAProxy can perform two kind of checks: HTTP and TCP. The first one works at OSI layer 7, performs simple HTTP call and analyzes the return code. Any answer different than 200 is considered an error. The second one is much simpler, works on layer 4. It simply creates a TCP connection which is being immediately closed. This is the kind of check we are relying on at Base. As mentioned earlier HAProxy is dispatching requests to Mesh from the corresponding availability zone as long as health checks are passing. If they fail though, we allow HAProxy to fallback to a different datacenter. The problem was in the way we configured it – each service entry in HAProxy configuration had all four Mesh endpoints listed: one as primary, the rest as backup. So even though every service had the same endpoints, they were being health-checked separately. This way each of our Mesh servers was receiving around 70 * 32 / 2 TCP connections per second.

This was obviously a design error that remained unnoticed. Initially we were running only a few services and only a few grid-nodes. Everything scaled up with time, and until it started misbehaving – the problem remained unnoticed.

First problems

Luckily for us, the problems first revealed themselves on a sandbox environment, which is a much smaller setup but also runs smaller instances. It was built from three datacenters, two grids each, thus Mesh received around 200 connections per second. The routing box was c3.large, which as we observed later, is quite important.

Everything started when our colleague wanted to test a new feature which required adding one grid-node. Right after it was plugged into the infrastructure we started to receive complaints from developers related to sandbox performance. It included timeouts and fallbacks between datacenters. It didn’t take long to figure out that plugging in the new node caused HAProxy health checks to timeout across the whole availability zone but one question remained – why? After a few minutes we figured out that shutting down the HAProxy process on the new node is solving the problem – bingo. We then started a deeper investigation.

Investigation

An obvious test that came to our mind first was to simulate HAProxy health checks. We hacked a simple TCP-connect-disconnect application in Python and run it against a Mesh server. Results were quite surprising. After a few seconds of making connections with a speed of ~500/s, it stopped almost completely. One or two was making it through every 1 or 3 seconds. We connected those values with TCP retransmission time, tcpdump confirmed our observation – TCP SYN packets were not reaching the destination server and thus were being sent again. Simultaneously to our test, every single HAProxy in the testing availability zone was timing out and falling back. After some time connections started to go through again, but it didn’t last long and they got stuck again. It looked like some aggressive QoS mechanism. So we visualized the amount of connections per second over time.

Ok, something’s definitely wrong here. Why are our connections being so aggressively capped every few seconds? Why is it influencing other hosts in our infrastructure?

To answer those questions we wrote a more complex benchmark application in C. It was spawning a few threads, each of them was making connections and closing them immediately. All operations were blocking, thus the maximum amount of simultaneously established connections per instance was equal to the amount of spawned threads. It was important as we wanted to make sure that we won’t exceed any of the TCP buffers on the destination server side. CPU and memory usage were also monitored and appeared normal during our benchmarks. Next step was to build an artificial environment similar to our sandbox setup. So we launched one c3.large box with nginx and ten m3.medium instance to simulate health check traffic. All hosts were running in a VPC within the same availability zone of eu-west region.

Note that we performed dozens of tests. I will present a few of them to give you an insight into the problem.

The first test was conducted with ten instances running with similar speed of 500-700 connections per second. Nine of them were limited drastically, one remained uncapped for some reason (we tried to reproduce this behavior in a predictable way, but as far as we noticed it’s random).

In the second test we launched 9 workers with limited speed plus one with slightly lower delays between connections. We noted the faster worker was not affected! This behavior is 100% reproducible if it’s faster by 30-40% or more.

This time half of the workers were connecting as fast as possible and the other half was limited to 500-700 connections/second. Once again all workers in the slower pool were throttled after a few seconds. More interesting things happened in the faster group – two of the workers kept connecting, but sometimes twice slower than they could.

All workers connecting on full throttle. Only three of them remained working as expected.

As mentioned earlier, beside the above cases we run dozens of variations of benchmarks against various instance types and found out that instances bigger than c3.large (c3.xlarge, r3.large, etc) are not affected by this issue and are connecting with fairly constant speed of 2500c/s. On the other hand, smaller instances are affected in a similar way. It also seems that depending on the connection establishment speed the instance is assigned to different pools of throttling. That’s why if we run one test faster it was not affected.

Answers

At this point we had enough intel to bother AWS support with our case. After a few days we received a valuable answer confirming that such a behavior is expected for those instance types and if we want to avoid them we should run bigger ones. We found it quite surprising, as documentation doesn’t mention limiting of TCP connections. Also until now we believed that throughput limitation is the only one related to networking.

Our case didn’t finish at this point though. After another few days one of Amazon technicians spotted another interesting thing – if we set the security group to allow all TCP traffic the issue also doesn’t occur. We omitted this use case in our benchmarks as it’s unacceptable for us anyway. We tried many others SG settings and none of them was problem free.

Conclusions

TCP health checks are very common thanks to their speed and the sufficient reliability they offer for many use cases. Obviously our case was quite specific, as it was caused by some very old code and poor configuration. As we don’t want to resign from c3.large instances, we fixed HAProxy configs to use backends which are being checked only once. This way we drastically reduced the amount of unnecessary connections and solved the problem of timeouts. It is possible though that someone else will run into this issue in their product. If so, the only solution is to bump the instance type to a bigger one or to allow traffic from 0.0.0.0/0 for TCP.