Microservices

One of the important aspects while building a system involving lot of micro-services is the ability to heal or contain failure. Resilience.

How do you ensure resiliency, avoiding cascading failures in microservices?

Let’s take an example.

Client calls Service A

Service A depends on Service B to satisfy the request

Service B Responds fast – Success. Responds with Connection Refused / Reset – Handled in code. Responds slow – Timeouts, Retries.



Timeouts, Retries

Slow resources fail slowly.

The last situation where the dependent service is slow is the most interesting. Service A’s handler blocks for the slow resource. During that time, the handler is doing nothing useful, and causing a cascading failure.

This could be solved in a couple of ways involving some global state to monitor such a performance.

Circuit Breaker : If we hit a timeout on a dependent resource more than once, it probably will fail in the consequent requests. Instead, we can mark it as dead and throw exceptions to be handled immediately. Bulkheads : This looks at services as connection pools. If access to Service B is restricted at 5 workers at a time, then the rest fail immediately unless a connection can be established. This requires lot of monitoring insight to arrive at the number 5. This works best when the response times are expected to be long.

A bulkhead is an upright wall within the hull of a ship which serves to limit the failure within the compartment.

If water breaks through in one compartment, it prevents from flowing into the other. This prevents from cascading failures and the entire ship capsizing.

Titanic, is a very well known example of what happens when you don’t have proper isolation leading to cascading failures. Ref

Implementation

Some great libraries available, that help with actual instrumentation are