Photo by Dušan Smetana on Unsplash

In this article I’ll cover fault tolerance in microservices and how to achieve it. If you look it up on wikipedia, you will find following definition:

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components.

For us, a component means anything: microservice, database(DB), load balancer(LB), you name it. I won’t cover DB/LB fault-tolerance mechanisms, because they are vendor-specific and enabling them ends up setting some property or changing deploy policy.

As a software engineers, applications is where we have all the power and responsibility, so let’s take care of it. Here’s the list of patterns, I am going to cover:

Timeouts

Retries

Circuit Breaker

Deadlines

Rate limiters

Some of the patterns are widely known, you might even doubt they worth mentioning, but stick to the article — I’ll cover basic forms briefly, then discuss their flaws and how to overcome them.

Timeouts

Timeout is a specified period of time which is allowed to wait for some event to occur. There is a problem if you are using SO_TIMEOUT (also known as socket timeout or read timeout) — it represents timeout between any two consecutive data packets, not for the whole response, so it’s harder to enforce SLA, especially when response payload is big. What you usually want is timeout, which covers whole interaction from establishing connection to the very last byte of the response. SLA usually described in such timeouts, because they are humane and natural to us. Sadly, they doesn’t fit SO_TIMEOUT philosophy. To overcome it in JVM world you can use JDK11 or OkHttp client. Go has a mechanisms in std library too.

If you want to dig in — check my previous article.

Retries

If your request failed — wait a bit and try again. That’s basically it, retrying makes sense, because network might degrade for a moment or GC hit that particular instance your request came to. Now, imagine having chain of microservices like that:

What happens if we set number of total attempts to 3 at every service and service D suddenly starts serving 100% of errors? It will lead to a retry storm — a situation when every service in chain starts retrying their requests, therefore drastically amplifying total load, so B will face 3x load, C — 9x and D — 27x! Redundancy is one of the key principles in achieving high-availability, but I doubt you would have enough free capacity on clusters C and D in that case. Setting total tries to 2 doesn’t help much either, plus it makes user experience worse on small blips.

Solution:

Distinguish retryable errors from non-retryable. It’s pointless to retry request, when user doesn’t have permissions or payload doesn’t structured properly. Contrary, retrying request timeouts or 5xx is good.

Adopt error budgeting — technique, when you stop making retries if rate of retryable errors exceeds threshold, e.g. if 20% of interactions with service D results in error, stop retrying it and try to degrade gracefully. Amount of errors might be tracked with rolling window over N last seconds.

Circuit Breaker

Circuit breaker can be explained as a stricter version of error budgeting — when errors rate is too high, function won’t be executed at all and will return fallback result, if provided. Very small portion of requests should be executed anyway in order to understand if 3rd party recovered or not. What we want is to give 3rd party a chance to recover without any manual work done.

You might argue, that it doesn’t make sense to enable circuit breaker if function is on critical path, but bear in mind, that this short and controlled ‘outage’ is likely to prevent a big and uncontrollable one.

Although circuit breaker and error budgeting share similar ideas, it makes sense to configure both of them. Since error budgeting is less disruptive its threshold must be smaller.

Hystrix was a go-to circuit breaker implementation in JVM for a long time. As of now, it entered maintenance mode, advising to use resilience4j instead.

Deadlines/distributed timeouts

We’ve discussed timeouts in the first part of this article, now let’s see how we can make them ‘distributed’. First, revisit same chain of services calling each other:

Service A willing to wait at most 400ms and request requires some work to be done from all 3 downstream services. Assume that service B took 400 ms and now ready to call service C. Is that reasonable at all? No! Service A timeout’ed and doesn’t wait for the result anymore. Proceeding further will only waste resources and increase susceptibility to retry storms.

To implement it, we must add extra metadata to a request, that will help to understand, when it’s reasonable to interrupt processing. Ideally, this should be supported by all participants and being passed throughout the system.

On practice this metadata is one of the following:

Timestamp : pass point of time at which your service will stop waiting for response. First, gateway/frontend service sets deadline to ‘current timestamp + timeout’. Next, any downstream service should check if current timestamp ≥ deadline. If the answer is yes, then it’s safe to shut it down, otherwise — start processing. Unfortunately, there is a problem with clock skew, when machines can have different clock time. If it occurs, requests will stuck or/and get rejected immediately, causing outage to happen.

: pass point of time at which your service will stop waiting for response. First, gateway/frontend service sets deadline to ‘current timestamp + timeout’. Next, any downstream service should check if current timestamp ≥ deadline. If the answer is yes, then it’s safe to shut it down, otherwise — start processing. Unfortunately, there is a problem with clock skew, when machines can have different clock time. If it occurs, requests will stuck or/and get rejected immediately, causing outage to happen. Timeout: pass the amount of time service allowed to wait for. This is a bit trickier to implement. Same as before you set deadline as soon as possible. Next, any downstream service should calculate how much time does it spend, subtract it from the inbound timeout and pass to the next participant. It’s crucial not to forget time spent waiting in the queue! So, if service A is allowed to wait 400ms and service B spent 150ms, it must append 250ms deadline timeout, when calling service C. Although it doesn’t count time spent on the wire, deadline can only be triggered later, not earlier, thus, potentially consuming slightly more resources, but not spoiling the outcome. Deadlines are implemented this way in GRPC.

Last thing to discuss is — does it ever makes sense not to interrupt call chain, when deadline is exceeded? The answer is yes, if your service have plenty of free capacity and completing request will make it hotter (cache/JIT), it’s okay to keep processing.

Rate limiters

Previously discussed patterns mostly solves problem of cascading failures — situation when dependent service collapse after its dependency collapsed, eventually leading to full shutdown. Now, let’s cover situation, when your service is overloaded. There are plenty of tech and domain-specific reasons why it might happen, just assume it happened.

Photo by Joshua Hoehne on Unsplash

Every application has its unknown capacity. This value is dynamic and depends on multiple variables — such as recent code changes, model of CPU application running on right now, busyness of host machine, etc.

What happens when load surpass capacity? Usually, this vicious cycle occurs:

Response time grows, GC footprint increases Clients get more timeouts, even more load arrives goto 1, but more severe

This is an example of what might happen. Sure, if clients have error budgeting/circuit breaker, 2nd item might not create extra load, thereby give a chance to leave this cycle. Other things might happen instead — removing instance from LB’s upstream list might create more inequality in load and shut neighbor instances and so on.

Limiters to the rescue! Their idea is to shed incoming load gracefully. This is how excessive load should be handled ideally:

Limiter drops extra load above capacity, thus lets application serve requests in compliance with SLA Excessive load redistributes to other instances/cluster auto-scales/cluster gets scaled by a human

There are 2 types of limiters — rate and concurrency, former restricts inbound RPS, latter restricts amount of requests being processed at any moment of time.

For the sake of simplicity, I’ll make an assumption that all requests to our services are nearly equal in computational cost and have same importance. Computational inequality arise from the fact, that different users can have different amount of data associated with them, e.g. favorite TV series or previous orders. Usually, embracing pagination helps to achieve computational equality of requests.

Rate limiter is more widely used, but doesn’t provide as strong guarantees as concurrency limit does, so if you wish to choose one, stick with concurrency limit and here’s why.

When configuring rate limiter, we think we enforce following:

This service can process N requests per second at any point of time.

But what we actually declare is this:

Assuming that response time won’t change, this service can process N requests per second at any point of time.

Why this remark is important? I’ll ‘prove’ it by intuition. For those willing to have math-based proof — check Little’s law. Assuming rate limit is 1000 RPS, response time is 1000ms and SLA is 1200ms, we easily serve exactly 1000 requests in a second under given SLA.

Now, response time grew by 50ms (dependency service started doing extra work). Every second from now on service will face more and more requests being processed at the same time, because arrival rate is bigger than service rate. Having unlimited amount of workers means you will run out of resources and collapse, especially in environments, where workers map 1:1 to OS threads. How concurrency limit with 1000 workers would handle it? It will reliably serve 1000/1.05 = ~950 RPS without SLA violation and drop the rest. Also, no reconfiguration needed to catch up!

We can update rate limit every time dependency has changed, but this is immensely big burden, potentially requiring whole ecosystem to be reconfigured on every change.

Depending on how limit value is being set, it’s either static or dynamic limiter.

Static

In this case limit is configured manually. Value can be assessed by regular performance tests. Although, it won’t be 100% accurate, it can be pessimized for safety. This type of limiting requires work to be done around CI/CD pipelines and has lower resources utilization. Static limiter can be implemented by restricting size of workers thread pool (concurrency only), by adding inbound filter which counts requests, by NGINX limiting functionality or by envoy sidecar proxy.

Dynamic

Here, limit depends on metric, which is re-calculated on regular basis. Chances are high, that there is a correlation for your service between being overloaded and growth in response time. If so, metric can be a statistics function over response times, e.g. percentile, medium or average. Remember computation equality property? This property is a key to more accurate calculations.

Then, define a predicate which will answer, whether metric is healthy. For example, having p99 ≥ 500ms considered unhealthy, thus limit should be decreased. How limit is increased and decreased should be decided by an apply feedback control algorithm, like AIMD (which is used in TCP protocol). Here’s pseudocode for it:

if healthy {

limit = limit + increase;

} else {

limit = limit * decreaseRatio; // 0 < decreaseRatio < 1.0

}

AIMD in action

As you can see, limit grows slowly, probing if application is doing good, and decreases steeply if faulty behavior is found.

Netflix has pioneered idea of dynamic limits and open-sourced their solution, here’s repo. It has implementations of several feedback algorithms, static limiter implementation, GRPC integration and Java servlet integration.