Spread the love

















This one is a long pending article on my drafts. I gave this as a talk roughly a year ago at Rootconf, but got buried under other things.

If you’re thinking, why microservices? Does everyone need microservices? It makes a lot of sense for a big company to not deploy on a single code application(imagine the horrors). Although, this is something that I don’t aim to answer in this blog post. For your reading, I think this is a very good presentation that might give some insight into this topic.

Nevertheless, you don’t need to be already in a microservices model for reading further, you might be able to use some of this model in your application right away! The model used throughout in this post would be a world facing application that works with a set of microservices. We hereby have 2 components for our system, our main application which basically serves the customer, and dependencies that we aim to be consuming from our application.

Our guiding principles

We will start with a simple principle that we will focus upon:

Always design for failures

Always assume one or all of your microservices WILL go down. We aim at doing the best for our customer when such a scenario arises. Let’s list down some guiding rules for our pricinple:

Don’t fail if your microservice goes down : We want to serve our customers even if a microservice is unavailable.

: We want to serve our customers even if a microservice is unavailable. Application should not forever wait for microservice : Customers are impatient, don’t make them wait.

: Customers are impatient, don’t make them wait. Contain and Isolate failures : Limit the number of requests that are going to be affected by any failures, and also, isolate the failure of microservices from each other

: Limit the number of requests that are going to be affected by any failures, and also, isolate the failure of microservices from each other Respect the service when it is slow : It isn’t someone else’s problem if a service is down, it’s your company’s, we want to make sure that we don’t make things worse for the service.

: It isn’t someone else’s problem if a service is down, it’s your company’s, we want to make sure that we don’t make things worse for the service. Fail fast – Recover fast: Our application should be able to fail with as less requests as possible, and as recover as soon as our failing service is back up.

If you think about it, the first 2 points in the above list are fairly straightforward to solve, simply add timeouts to your service calls, and you are taking care of them.

A simple example with just timeouts

Let’s quickly jump in with what happens if you don’t design our application for failures. We build a simple multithreaded application which allows us to make requests to our remote services. We have a HTTP handler in our application and a set of application threads which are used for running our application code and asynchronously calling remote services A and B.



We will pick some numbers for our application.

Number of threads: 100 Average response time from each service: 100ms

Failing service A



Thinking about the happy case. Since both services are called concurrently, we can process 50 requests in every 100ms. Requests per second served by our application is as high as 0.1 * 50 = 500

Consider the (likely) scenario where our Service A has troubles responding to our requests. We were smart enough and added 1s timeouts to Service A requests.

Now under stressed conditions, every request processing Service A remote calls takes 1 second before timing out. We can ignore any time taken by Service B so in our example. So we have 100 threads occupied for 1 second each. Our RPS quickly dropped down to 1 * 100 = 100 which is 1/5 of our original RPS.

Going back to our rules above:

We didn’t fail when when a microservice was unavailable.

We didn’t wait forever, the customer was given a fallback after 1 second

We were unable to Contain and isolate failures : Each and every request was affected, so we weren’t able to contain the effects. Service B which was operating normally wasn’t able to be served to the customer which means that we were unsuccessful in containing the failures.

: Each and every request was affected, so we weren’t able to contain the effects. Service B which was operating normally wasn’t able to be served to the customer which means that we were unsuccessful in containing the failures. We didn’t help Service A recover, we made as many requests possible to the service. We were trying to make things worse for the service by adding more and more backpressure.

we made as many requests possible to the service. We were trying to make things worse for the service by adding more and more backpressure. We failed fast, but we wouldn’t recover fast When service A comes back up, it would first try to process all the requests backlog before the application recover completely.

Overall we did not design for failures. Every customer was affected, and a large majority started seeing rejection of requests.

Circuit breaker mechanism

Next we try to introduce the concept of a Circuit breaker mechanism and how to implement it in a non-complicated way. Similar to how it sounds, we aim to build an analogy to a circuit breaker where we can stop sending requests to a service when it is unavailable. We do one step better than an electric breaker and re-open the circuit breaker whenever the service is available again.



Seeing the above diagram should give you an idea of what is happening. When everything is good, all requests are served by our Circuit breaker setup. When Service A is unavailable, our circuit gets closed and starts rejecting requests without even sending it to the remote service. Both our application and the remote service get benefits with this.

Implementation of a simple circuit breaker

In the most trivial implementation, we simply create a thread pool per service in order to provide isolation of each of these remote service calls.



We again start picking numbers for our threads. If we need to serve 500 RPS at a average latency of 100ms, each of the services needs a thread pool of size 50.



Let’s again repeat the case where Service A goes down. What happens? Again with our smart timeouts of 1s, our thread pool for service A blocks responses for the requests. This time though, the application is unaffected and continues to run the same. Every second, 50 requests(size of service A thread pool) get blocked and affect the customer by delaying them for 1 second. The remaining 450 requests simply skip calling Service A and the application continues running.

Let’s think again about our rules:

We didn’t fail when when a microservice was unavailable.

We didn’t wait forever , 10% customers was given a fallback after 1 second, while other 90% got fallback immediately.

, 10% customers was given a fallback after 1 second, while other 90% got fallback immediately. We Contained and isolated failures of service A : Only 10% of the requests were affected, and application didn’t get affected in serving the content

: Only 10% of the requests were affected, and application didn’t get affected in serving the content We reduced the load on Service A, and possibly helped it recover.

and possibly helped it recover. We failed fast, and recovered immediately When service A comes back up, our thread pool with just 50 requests will empty immediately, and we recover.

We designed for failure scenarios and gave a good experience to our customers without adding load on any of our systems.

Some more details about a real world implementation

Each of these separate thread pools should have fixed queue sizes (note that an infinite queue is the path to doom). Use Little’s law to tune the number of threads. Queue size is trickier and might need some trial and error, but unless you scale it too high(10x of thread pool size) or too low(10% of thread pool size), it would most probably not give you any troubles.

(note that an infinite queue is the path to doom). Use Little’s law to tune the number of threads. Queue size is trickier and might need some trial and error, but unless you scale it too high(10x of thread pool size) or too low(10% of thread pool size), it would most probably not give you any troubles. We like to also add retries to our timed out service calls . This makes the customer experience better. Assuming our timeouts are more than our P99s, we are simply retrying only 1% of our requests, which means no added load on our dependencies.

. This makes the customer experience better. Assuming our timeouts are more than our P99s, we are simply retrying only 1% of our requests, which means no added load on our dependencies. Responding to customer even faster or having even smaller amount of affected customers. A simple way to do that is let the service call timeouts be higher say 3 seconds in the above example and our application level timeouts to be 1 second. This way our Service call thread pools fill even faster without adding a dent in customer experience. The added benefit is that the number of customers who get affected reduce.

A simple way to do that is let the service call timeouts be higher say 3 seconds in the above example and our application level timeouts to be 1 second. This way our Service call thread pools fill even faster without adding a dent in customer experience. The added benefit is that the number of customers who get affected reduce. Testing this setup is important. One easy way to simulate failures is to use IPTables to drop or add latency to incoming packets from the service you want to test failures with.

One easy way to simulate failures is to use IPTables to drop or add latency to incoming packets from the service you want to test failures with. Monitor your setup for number of retries and rejections. Number of retries tell us about what is the additional load added on the service, number of rejections help us understand the exact customer experience.

References:

– Netflix’s Hystrix implementation using Semaphores is something everybody should read about

– Martin Fowler’s introduction to Circuit breakers

Credits: Alex Koturanov, my mentor at Amazon who always guided me with his depth of knowledge. We closely worked together at putting together a Fault tolerant model.

Share this: Twitter

Facebook

LinkedIn

Reddit

Pocket

WhatsApp



Like this: Like Loading...