What is this Red Nose Day thing?

Since its launch in 1988, Red Nose Day has become something of a British institution. It’s the day when people across the land can get together and raise money at home, school and work to support vulnerable people and communities in the UK and internationally.

With a whole year of fairly pedestrian traffic and then five hours of everything being on the line, we need to be able to ensure that our systems fail and fall over gracefully. I am going to give you a run through of how we load test our systems continuously over and above peak load, ensuring that we aren’t introducing vulnerabilities that our applications won’t be able to handle during an onslaught of traffic. Give my previous article on the journey to serverless a read to get some context on how we got to getting nearly everything Serverless and massively reducing our AWS footprint.

The Backstory

So, to understand where we are now with load testing, you need to know where we came from.

We had previously used Siege deployed across a load of EC2 spot instances, all deployed using Terraform and Ansible. To simulate one of our most significant past traffic events, we had used 32 r4.large instances for a suite of slam tests. It allowed us to generate the required load to ensure our applications worked as expected. The downsides were quite immediately evident from an engineering (and cost) perspective. These were,

It took days to deploy the infrastructure, set up and run the tests.

Was unnecessarily costly to load test, infrastructure was also usually deployed for much longer than it needed to be.

It was not easily replicable by everyone in the engineering team who might want to test their system.

A new hope

Peter (Engineering Lead) and I struggled to understand the legacy approach and how we could move forward with it, we started googling for alternatives with immediate effect. We quickly came across two very suitable candidates; these were Goad and Serverless Artillery. We made a quick comparison of the two and settled on Serverless Artillery as it was written in NodeJS and used Serverless Framework, which was in line with the rest of our emerging Serverless applications.

To give you a bit more context on our methodology, we generally perform a varying selection of tests against our applications, below is a quick explanation of each test type and why we do it,

Drip tests are usually throughout the period of a couple of days. These ensure that we simulate a normal background load level, observe the behaviour of the site under normal operation and correlate any periods during which increased latencies or error rates are seen with activities known to be occurring during those periods (e.g. deployments, cache clears and the other elements of our load testing program).

are usually throughout the period of a couple of days. These ensure that we simulate a normal background load level, observe the behaviour of the site under normal operation and correlate any periods during which increased latencies or error rates are seen with activities known to be occurring during those periods (e.g. deployments, cache clears and the other elements of our load testing program). Slam tests to simulate a sudden spike of traffic, this enables us to understand the behaviour of the site when faced with a traffic profile that has traditionally been difficult for automated scaling systems to handle appropriately. Our goal here is to determine the level and type of errors to expect during sudden traffic spikes. Due to the nature of Comic Relief, extremely sharp spikes of traffic can occur in response to news stories, appeal films shown on national television and other similar events.

to simulate a sudden spike of traffic, this enables us to understand the behaviour of the site when faced with a traffic profile that has traditionally been difficult for automated scaling systems to handle appropriately. Our goal here is to determine the level and type of errors to expect during sudden traffic spikes. Due to the nature of Comic Relief, extremely sharp spikes of traffic can occur in response to news stories, appeal films shown on national television and other similar events. Ramp tests produce a gradually increasing level of traffic similar to that seen over the course of each day during the campaign period.

Donation traffic on the lead up to and including the night of TV

As far as our workflow for load tests, we store all of the load test plans alongside the artillery code for our tests in a single repository. We could probably keep the plans nearer to the applications that we are testing, but these things grow naturally, so, oh well! For every load test, we create a release in GitHub documenting the load test and any information or findings from it; we can then quickly go back to these at a later time, all code changes related to the load test are part of sed release.

A load test release on our lookups service

We report from Serverless Artillery to InfluxDB on the performance of the load test; we then visualise the report in Grafana and link to it in our GitHub releases.

We run all our load tests from a separate AWS account to the application that we are testing, which gives us rapid insights into the cost of running the test and also means we don’t overrun our lambda invocation limits (it hopefully wouldn’t, we have very high invocation limits). We will also normally load test the application from a different AWS region to the one that the application is hosted in. It would probably be more scientific to run these from another cloud provider, to nearer replicate actual users, but it is near enough for us for the time being.

Serverless Artillery load testing reporting to InfluxDB and viewed in Grafana

So, to give you a full run through of how we load test our Gift Aid application with Serverless Artillery, I am going to need to provide you with some context. The best description of the application is a form that users submit their details so that we can claim gift aid on SMS donations.

The traffic hitting the Gift Aid application is generally very spikey off the back of a call to action on a BBC broadcast channel, ramping up from 0 to 10’s of thousands of requests in a matter of seconds. It is probably our highest load application, but also one of our simplest user flows, as the bulk of the load is to a single lambda.

The Gift Aid form

In the following test, we define the endpoint for the Serverless backend, set a duration of 60 seconds, an arrival rate of 120 submissions per second ramping to 240 submissions per second.

We then create our submission logic, that will submit a form with random data using the faker NPM module to the serverless endpoint.

It’s that simple, I won’t go into to much more detail here, as both the Serverless Artillery and Artillery documentation are over and above what I can describe here.

Once we have validated and run a lower traffic version of the test on our local machines to ensure that it works as expected, we run slsart deploy to deploy the load test to AWS and then invoke the load test using slasrt invoke . We will then immediately start to see counts of invoked load test lambdas in Grafana, seeing traffic start to hit the backend and can be on the lookout for errors or latency issues.

On a ramp test, we will usually see the request ramp up and make sure that there aren’t any sudden dips in response rates or client errors.

Ramping requests over 5 minutes

A common issue that we see on very sudden high spikes of traffic is that you will get a sizeable warm-up latency. Once the fleet of lambdas is provisioned and warm, this will generally come under control. We build our frontend applications to handle this initial spike and not freak out the user if it’s happening. A 382ms average on P99.99 is acceptable, but we do end up feeling sorry for the people watching a spinner for 8 seconds.

For the night of TV, we generally get a progressively warm fleet of lambdas as the night progresses, so this is a bit of a non-issue. We can warm a fleet of lambdas when we know a spike is coming; the spikes generally aren’t that predictable though.

Warmup latency spike

When the load testing is running, we will be on high alert for errors from Sentry and IOPipe. We will generally try to fully replicate the alerts and dashboards for the load test environment to match production, to get us entirely used to what we would be looking at during campaign.

In the lead up to campaign we also run several game days, where we bring stakeholders from across the business in on the testing & simulation. We run more extended versions of these tests to replicate a condensed version of the traffic profile for the night of TV while watching re-runs of Red Nose Day on youtube. This process helps to get the team used to what they will be looking at on the night and allows for communication routes to be created and optimised.

The final thoughts

The next step for us is to probably bring some degree of automation to our load tests and bring the tests into our Concourse CI pipelines alongside our feature tests. As new features come in it would be awesome to have an idea of the load implication on our systems and make sure that we aren’t bringing in any critical issues.

That really is as simple as load testing is for us. We encourage every squad in the engineering team to do it and be comfortable with it. The process of load testing before Serverless always seemed like an uphill struggle, a bit of a pain and targeted at a niche skill set. Much like with everything in Serverless, it just takes away some of the extra rubbish hoops that you have to jump through to get peripheral stuff done and get on with your day. The price and time to load test are at such a negligible level now, it just doesn't really factor in as a reason not to do it.