Retries are a core resiliency pattern which help enhance service availability by re-attempting failed operations. Retries are commonly found pattern which many libraries (such as the aws-sdk) regularly employ. This post shows how retries can be used to enhance service availability and the latency tradeoff that they incur.

What are Retries?

A retry is just a repeated operation.When an error occurs during an operation a retry repeats the operation. Retries are usually combined with some sort of “backoff strategy” which provides a timeout between operations, in order to prevent a resource from being overwhelmed.

The state diagram below shows retry states and flow common to many different libraries:

Retry State Diagram

An operation is attempted, if it succeeds the result is returned, if a failure occurs the retry checks to see if it should perform the action again. If it shouldn’t an exception is raised if it is a timeout is applied (usually) and the operation is attempted again. Retries are a pattern which allow services to increase availability at the expense of increased latency.

Why Use Retries?

Many classes of errors (network, application) are transient and rooted in network or server overload. These errors are ephemeral and are usually quickly resolved. If latency of a retry can be tolerated, it allows for increased availability at the cost of increased latency. Retries allow a client to offer higher availability than its dependencies. In order to illustrate this consider a service (“Service”) which has a dependency on another service (“Service Dependency”) that offers an SLO availability of 99%.

Because of this the client has to assume that 99 out of every 100 requests will fail. If the service is making 1000 requests per second this results in 10 failures per second! If the operation is retried 1x it would require two errors to occur in order for the request to fail 0.01 * 0.01 percent chance! 0.001! 1/1000! (This makes the huge assumption that there are no other sources of errors and there are no correlation between errors)

With a single retry the rate of success for service calls of the main service goes from 99% to 99.9%!! Higher than what the SLO the Service Dependency offers! With a second retry that goes to 0.0001 (0.01 * 0.01 * 0.01) or 99.99%! Ignoring other sources of errors, the service is able to offer closer to a 99.99% SLO. While the real world can’t offer this exact math, retries are able to increase the availability some amount.

How?

In order to illustrate retries in action the service scenario from above will be modeled to show how retries can allow a caller to offer higher availability than its dependencies offer and insulate callers from failures. To do this resilience4js will be used. All code used to generate the test data below can be found here.

A dummy service will be used to model 99% availability:

class DummyService {

get() {

return new Promise((resolve, reject) => {

// 99 / 100 times return a good response

const num = getRandomInt(100);

if (num === 0) {

reject(new Error('failed'));

}

resolve('success');

});

}

}

Next the retry policy will be configured:

const retry = resilience4js.Retry.New(

'dummy_service',

resilience4js.Retry.Strategies.UntilLimit.New(

resilience4js.Retry.Timing.FixedInterval.New(50),

3,

),

);

The retry is give a name dummy_service which will be emitted as a label in the metrics. The strategy being used is UntilLimit which will make a fixed number of attempts (`3` in this case). Each attempt will be made in an interval of 50ms. Another common backoff timeout is Exponential Backoff. Some libraries, such as Polly, allow for configuring the specific types of errors to be retried. Currently, resilience4js will retry on any error that the retry policy is executing.

After creating the retry policy it needs to be configured with an operation to retry. In this case it is the service dependency call from above:

const service = new DummyService();

const wrappedGet = retry.decoratePromise(service.get);

The operation will now retry up to 3 times using a 50ms timeout in between each failed operation.

Each of the following test cases will apply load at a rate of 500 requests / second using vegeta:

$ echo "GET http://localhost:3000/ " | vegeta attack -rate 500 -duration=0 | tee results.bin | vegeta report

No Retry

The first test is performed without any retries: