Load testing

Broadcasting messages to a few connections over WebSockets is easy. Doing it at high-scale is where things become far more difficult so, once we’d built the core service, we immediately started load testing.

Our autoscaling policies were set up, like many services at DAZN, to scale up aggressively and scale down slowly. For example, we:

Scale-up 50% when average CPU is above 60% for 3 minutes

Scale-up 100% when maximum CPU is above 85% for 1 minute

Scale-down 10% when maximum CPU is below 30% for 10 minutes

Scaling up

For any service to scale linearly, doubling the number of tasks or instances (scaling up 100%) should also double the maximum capacity. To test this, we turned off all autoscaling and ran 5 tasks. These tasks were able to handle 1 million WebSocket connections being opened over 5 minutes before we saw a slight increase in latency due to high CPU usage.

We expected that when we started 5 new tasks (total of 10), we’d be able to open another 1 million connections before seeing the same increase in latency. This was not the case — most of the new connections saw increased latency. After a few minutes, some of the existing connections were dropped, indicating service instability.

So, what was happening? Well, AWS Application Load Balancers route requests in a round-robin fashion. This works okay for HTTP-based services, as connections don’t stay open for more than a second in most cases. But, when using WebSockets, the connections are persistent, so round-robin request routing has a huge negative impact on our scalability.

When we added 5 more tasks, the original 5 tasks still had 1 million open connections. The ALB routed the next 1 million connections evenly across the 10 tasks, rather than trying to level out the load. After a while, the original 5 tasks became overloaded and were shut down due to health checks. When this happens, the new 5 tasks will end up being overloaded and experience the same issue (and again, and again…)

Many load balancers support different routing algorithms such as least outstanding requests (LOR). This is usually a great way of routing traffic for HTTP-based services, as it’ll select the host with the lowest load but, for WebSocket services, it’s a requirement. Luckily, AWS listened to our feedback and announced LOR routing at re:Invent 2019. Until then, to avoid instability, we were stuck with running our service at a higher capacity than we needed. This had significant cost implications, so we’re relieved to be able to use the LOR algorithm at last.

Scaling down

Scaling down is usually easy. The load balancer will stop routing new requests to the host, then it can wait for up to an hour for any open connections to finish before terminating it. For HTTP-based services, this works well as the connections aren’t open for long. But what happens with WebSocket connections which are open for many hours? The target will still have thousands of open connections after an hour, which are all forcefully terminated. This also impacts deployments as we need to switch out the targets gracefully.

If many targets are de-registered at the same time, we can easily end up with a thundering herd of reconnections. This would immediately cause the system to scale back up again, causing instability. We needed to be able to gracefully drain the WebSocket connections over some time.

The first idea was to listen to a process signal. We hoped that ECS would send one signal to the task as soon as it’s de-registered (such as SIGTERM ), and another when it’s killed (such as SIGKILL ). This is not the case. Tasks are only sent a SIGTERM 30s before a SIGKILL , regardless of the load balancer deregistration delay. 30 seconds is nowhere near enough time to slowly drain all the connections.

In the end, the only solution was for each task to poll the AWS API for the target health. However, the task needs to work out which instanceId and port it is running on before it can call the API. The instance metadata API returns the instanceId . Finding the port requires querying the task metadata API to get the taskARN and cluster, followed by calling the ECS.describeTasks API function. Altogether, this is far more complex than it should be. We’re hoping to open-source our implementation soon.

Improving efficiency — Connection Tracking

Once we’d run a few successful tests of our scaling behaviour, we wanted to see how many connections each task could handle. We also wanted to see whether changing instance types or task sizes could improve efficiency.

We noticed that there was a hard limit on the number of connections which each instance type could handle. After some searching and support from AWS, we found that there are (unpublished) limits on the number of tracked connections a security group can handle, known as connection tracking or CONNTRACK. By opening our ALB and ECS security groups (relying on our private subnet and network ACLs for security), we were able to avoid this limit.

Unfortunately, when re-running our tests, we experienced similar behaviour. After a lot more searching and help from AWS experts, we found that it was limited by nf_conntrack , which Linux uses for NAT and firewall purposes. On c4.large instances, this is set to 65536 by default, meaning that a c4.large instance can only handle 65k connections. sysctl can increase this limit, for example /sbin/sysctl -w net.netfilter.nf_conntrack_max=196608 .

Now, our instances can perform to their best ability. Our next load test was able to scale quickly and handle tens of millions of connections.

Failing faster

During another load test, we made one of our regions perform badly, with simulated latency. We expected Route53 latency-based routing to detect this and to failover to another region — but this was not the case. Latency-based routing is the approximate latency between the user and the AWS region — NOT the user and your service! We resolved this issue by using finely-tuned Route 53 Health Checks. For more information, read our article on how to implement the perfect failover strategy using Amazon Route53.

Iteration & improvements

CloudFront-related issues

Whilst running Pubby in production, we noticed occasional spikes in the error rates reported by CloudFront. The spikes would only last a few minutes, but they were large enough to trigger our CloudWatch alarms. This didn’t correlate to any errors from our service or load balancers, but some users were being sent 502 errors so we had to investigate further.

The CloudFront access logs showed that these spikes were always sent from a single CloudFront edge location, such as FRA2 (Frankfurt). AWS Support helped us to diagnose the errors as “transient networking issues” from the edge locations. They were minor and short-lived incidents which weren’t even noticed by other HTTP-based services, so why was Pubby affected?

WebSockets, with their persistent nature, are vulnerable to network issues. When an edge location has an issue, CloudFront fails over within minutes so all new requests are directed elsewhere. With WebSockets, however, all of the open connections would be dropped and have to reconnect, causing the error rate to spike far higher than it would for a few failed HTTP requests.

After speaking with the CloudFront team, we decided to remove CloudFront from Pubby’s infrastructure. It’s not necessary for a pure WebSockets service. We’d like to introduce AWS Global Accelerator so requests enter the AWS global network as early as possible, but it’s currently missing IPv6 support.

Planned Future Architecture

Once we removed CloudFront, the next improvement on our todo list was to make the API multi-region. If our primary AWS region had an outage, our services would be unable to publish messages. This has been made much easier as single-region DynamoDB tables can now be converted to global tables.

Below is our planned future architecture. CloudFront is gone and the API is now multi-region. SNS has been replaced with DynamoDB Streams to simplify the architecture. The API Gateway has been replaced with an ALB.