GOV.UK Notify lets central government, local authorities and the NHS send emails, text messages and letters to their users.

We usually send between 100,000 and 200,000 text messages a day. It’s important for services using Notify that they’re able to quickly and successfully send text messages to their users.

Those services rely on us to send important messages, for example a flood warning or a two-factor authentication (2FA) code so their users can sign in to another service. We design and build Notify with this in mind.

Using multiple text message providers

When a central government, local authority or NHS service wants to send a text message to a user, they ask Notify, either manually through our web interface or using our API, to send it. We then send an HTTP request to a text message provider to ask them to deliver the message. No provider will be working perfectly 100% of the time (nor should we expect them to be). Because of this we have 2 different providers, so if one encounters any issues we can use the other provider to send the message.

Our original load balancing design

Originally we sent all text messages through one provider, say provider A. If provider A started having trouble, Notify would automatically swap all traffic to provider B – a process known as a failover. We used 2 measures to decide if a provider was having problems and failover. We measured a:

single 500-599 HTTP response code from the provider

slowdown in successful delivery callbacks (a message back from the provider to say it had delivered the message to the recipient)

To determine if callbacks were slow, we’d measure the last 10 minutes of messages being sent. We’d consider callbacks slow if 30% of them took longer than 4 minutes to report back as delivered.

We could also manually swap traffic from, say, provider A to provider B as we wanted. We did this often, maybe once a week, to try and reach a roughly 50/50 split of messages sent between each of our providers. If we ended up sending only a small number of messages through one provider over the long run, they might not be massively incentivised to be a provider in the future.

A problem with our original design

One day, towards the end of 2019, we had a large spike in requests to send text messages. We sent all these requests to one of our providers but it turned out they couldn’t handle the load and started to fail. Our system swapped to the other provider but it turned out that sending a large amount of traffic out of nowhere caused them to start returning errors too. It was likely that our providers needed time to scale up to handle the sudden load we were sending them.

How we improved our resiliency

We changed Notify to send traffic to both providers with a roughly 50/50 split. When a single text message is sent, Notify will pick a provider at random. This should reduce the chance of giving our providers a very large amount of unexpected traffic that they will not be able to handle.

We also changed how we handled errors from our providers. If a provider gives us a 500-599 HTTP response code, we would reduce their share of the load by 10 percentage points (and therefore increase the other provider by 10 percentage points). We will not reduce the share if it’s already been reduced in the last minute.

We also decided that if a provider is slow to deliver messages, measured in the same way as before, we would reduce their share of the load by 10 percentage points. Again, we will not reduce the share if it’s already been reduced in the last minute.

It’s important that we wait a minute before allowing another 500-599 HTTP response code to decrease that provider’s share of traffic again. This means that just a small blip, for example five 500-599 HTTP responses over a second, doesn’t switch all traffic to the other provider too quickly.

Equally balancing our traffic

We still had the manual task of equally balancing our traffic if we no longer needed to push that traffic towards one of the providers. We decided that, if neither provider had changed its balance of traffic in the last hour, we’d move both providers 10 percentage points closer to their defined resting points.

This means our system will automatically restore itself to the middle and removes the manual burden of our team trying to send roughly equal traffic to both providers. We can still manually decide what percentage of traffic goes to each provider if we want to, but this is something we anticipate doing rarely.

We did consider trying to overcorrect traffic to bring the overall balance back to 50/50 over, say, a month. For example, if provider A has an incident and receives no traffic for 24 hours, we could give it 70% of the traffic for the next few days to overcorrect the traffic it lost. We decided doing this would only bring a small benefit and would increase the complexity of our load balancing system. Keeping things as simple as possible won the argument in this case.

How the service is doing now

The following graphs show the number of text messages we sent to each of our providers per second.

On the morning of 26 January one of our providers ran into problems and we reduced their share of traffic down to zero. Every hour for a while after this you can see us give them 10% of traffic to see if they have recovered enough, but they hadn’t so it got reduced back to 0% again.

Finally the next afternoon their system improved and we moved back towards a roughly equal split of traffic.

What’s next

This fix works for us now. As we continue to grow we'll do more stuff like this to make sure we're providing the best performance, resilience and value for money to Notify’s users.