Microservices architecture brings a few challenges that were not present in monolithic systems. One of them is the fact that you can no longer rely on other parts of your distributed systems (for some ideas on why it can happen you can refer to 8 fallacies of distributed computing). Does it mean that you should not use microservices in a well justified case? Not at all. But a good advice is that you need to design for failure from the start.

Let’s first discuss key practices to make your microservice resilient to failures of other parts of the system:

use timeouts for external dependencies – you don’t want to wait forever and destroy your service latency (a.k.a fail fast). Typically, you don’t want to rely only on timeouts in 3rd party libraries (such as http clients) because you cannot easily reason about real behaviour in case of failures (read timeout or connection timeout, are there any retries, is there any other processing that can fail not within connection/read timeouts, etc)

use bounded queues for external dependencies – this way you don’t eat up all resources, both yours and those being used by the target service in case it is not responding timely or correctly

if external dependency is not responding correctly stop calling it and give it time and capacity to recover

Hard to accomplish in your tiny microservice’s code? Fortunately there is a library that comes to the rescue – it is Netflix’s Hystrix. There are already a few Hystrix Hello World tutorials on the web, so instead of jumping to code directly I would like to go through a real life example and discuss some of our production experiences.

Our Use Case – User Activity Feed

Throughout our application we display user activity streams – these represent user actions in different domains of our system – e.g. user X won a deal, user Y created a task, user Z created a contact, etc. Each of such subdomains is managed by a different microservice. So in order to get data for a page of the activity stream, several services need to be asked for input. Single service called Feeder acts as a gateway and is responsible for fetching this data from target services.

What happens when an operation on a single dependency starts taking longer? In this case the entire Feeder request would wait until the slowest dependency completes. That would noticeably increase user interface load times which is obviously something we want to avoid.

Timeouts

First solution that comes to our mind would be to introduce timeouts. Hystrix allows us to set a timeout for the entire executable command thread that connects to remote service and fetches the data. But even with a timeout something dangerous may happen as we are getting more requests to feeder.

The effect is that threads which are connecting to a slow downstream service are piling up. Similar issue happens on the target dependency as well. During high load even if a timeout is set we end up with hundreds of threads accumulate both on the client and the service endpoints within minutes or seconds. Such an issue can propagate to other services and our system may quickly stop responding causing downtime for customers. This is a pretty sad place to be.

Bounded Pools

To fix it, we need to throttle incoming traffic that reaches certain rate. This can be achieved by limiting the number of threads used to connect to an external dependency. Hystrix allows you to configure the size of the thread pool associated with every dependency. If your service connects to 10 downstream services you will need 10 thread pools with a limit bound to each one. In case the limit is reached in one of them, no other request is accepted in this group and in effect your code stops calling this single misbehaving downstream service. All other groups are operating irrespectively allowing connections to other services. Assuming we have 10 external services and 10 threads available for each downstream dependency calls then we can have a safe maximum of 100 threads allocated on feeder to handle external dependencies.

Now single service failure is not cascading to other services. In case the thread pool is exhausted, feeder fails fast without sacrificing latency and resources. Bounded thread pool also gives some breathing room for downstream dependencies. But there is more that Hystrix gives to us to facilitate system recovery in case of failures.

Circuit Breakers

In case of failures coming from an external dependency (not only timeouts but also connection errors, 500 errors, etc.) our service tries to reach the external service only if the success rate stays above a specific level – typically 50%. When success rate drops below the threshold Hystrix opens a circuit and as a result blocks any further calls. The client service would fail fast in such a scenario without even trying to call external service. Again – only the single misbehaving dependency would be switched off, and data from all other services would be available in the Feeder output. After configured time (by default 5 seconds) Hystrix would subtly start probing external service and check whether it is coming back to operation. When success rate is high enough, full load would be directed to the service again.

Implementation

All these was possible with just a few lines of code.

public static class InstrumentedHystrixCommand < T > extends HystrixCommand < T > { private Supplier<T> supplier; public InstrumentedHystrixCommand (HystrixCommand.Setter config, Supplier<T> supplier) { super (config); this .supplier = supplier; } @Trace (dispatcher = true ) @Override protected T run () throws Exception { NewRelic.setTransactionName( "hystrix" , getCommandKey().name()); return supplier.get(); } }

Supplier<T> would encapsulate a risky piece of work piece of work, such as a remote call, that needs to be executed in a safe and isolated environment of Hystrix Command. We took the approach of wrapping every external code in our API client layer within Hystrix Commands, so that all of our remote calls are safe-guarded. New Relic’s @Trace and setTransactionName is just some instrumentation code to hook into APM.

With recent Spring Cloud additions, you can use a single annotation @HystrixCommand over a risky method inside your bean and enjoy safe-guarded execution.

Great but what does it really mean to our users?

A few Words on Fallbacks

Hystrix lets you implement a fallback method to stub a response in case of a failure – it can be a predefined value or a cached entry, etc. We decided that we can skip an item in an activity feed, if it is not available in time, in favour of keeping low latency and recoverability of the system. So we want to provide an empty fallback. If you are using HystrixObservableCommand you can return rx.Observable.empty() . If you are using classic HystrixCommand you cannot simply return null though, as this would generate Observable with a single value that is null. When we wrote the code HystrixObservableCommand was still in RC phase so we decided to use HystrixCommand . The solution that worked best was to skip any fallback, have HystrixRuntimeException thrown and handle it in the calling code as shown in next code snippet.

Plays Nice with RxJava

HystrixCommand returns Java Rx Observable , which in our case represents a stream of activity events. Caller can consume Observable stream either in an asynchronous way by submitting a subscriber or can synchronously block the execution until data is available.

Observable<Feed> feeds = Observable.merge( services.stream() .map(service -> service.newRequest().user(user).all().observe()) .map(feedObservable -> feedObservable .doOnError(throwable -> log.error(throwable.getMessage(), throwable)) .onErrorResumeNext(Observable.empty())) .collect(Collectors.toList()))

The code above calls every external service and if any of them fails it stubs empty observable, collects all the Observables to a list and finally merges those streams together. At the end you get a single Observable stream of different activity feeds you can consume.

An additional side effect of using Hystrix, is that it manages concurrency for us. Every service call executed within HystrixCommand is executed within a separate “hystrix” thread:

2015-06-12 14:51:00,254Z INFO [hystrix-AppointmentsFeedService:all-9] c.g.services.http.Client: Sending HTTP GET 2015-06-12 14:51:00,256Z INFO [hystrix-CrmFeedService:all-9] c.g.services.http.Client: Sending HTTP GET 2015-06-12 14:51:00,257Z INFO [hystrix-CommonFeedService:all-9] c.g.services.http.Client: Sending HTTP GET 2015-06-12 14:51:00,259Z INFO [hystrix-CollaborationsFeedService:all-9] c.g.services.http.Client: Sending HTTP GET

Cool! So you wrap your risky calls inside Hystrix commands and deploy your application to production. Are we done? Not yet. You need to monitor your deployment now and fine tune configurations.

Monitoring

Hystrix provides rich set of metrics and nice integration mechanism to report them: HystrixPlugins.registerMetricsPublisher(HystrixMetricsPublisher impl) . If you are using Dropwizard Metrics you can use some of available publisher implementations at no effort.

Most interesting metrics would be: command latency percentiles, percentage of errors and number of open circuits over time. You need to thoroughly understand this data in order to tweak your configuration.

Important Configuration Items

First of all you need to know the expected latency of your system. Let’s assume that you decided that 99% of responses should be under 1 second which would be a sufficient time for your user interface to load. Then you need to determine throughput during peak hours – e.g. 10 requests per second.

It means that you need a capacity of at least 10 threads for a single dependency thread pool ( hystrix.threadpool.default.coreSize ) to make 99% calls within expected latency. Typically you will add some buffer here to accommodate some of slower calls as well. However, most of latent calls will need to be filtered out through the timeout ( hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds ). Set this timeout carefully – setting it to 99% latency would give you strict latency guarantees but would restrict number of complete responses in case infrastructure is in under higher stress. Too high value would be more permissive but would require higher thread pools. For our feeder use case setting it 2-3 times higher than 99% latency and having bigger pools is doing the job.

What we also found pretty useful was to set this timeout to a slightly lower value than HTTP client connection and read timeouts. The reason for this is to be able to see timeouts under standard Hystrix metrics rather than having uncategorized failures.

Whatever you tune, it is important you keep observing your environment and make sure you keep a good balance between latency and number of incomplete responses.

Any Bad Parts?

Not many. We found stack traces pretty hard to analyze at first. We also missed some notion of zone awareness – Hystrix may still forward traffic to a zone even if it is completely unresponsive, as long as other zones are operating correctly and the total success rate is high enough. You would need a second level of load balancing that is zone aware to fix that.

Other than that we found Hystrix a perfect match. In case of more serious issues we observe circuits open in the feeder service, feel safe that those issues are not propagated and we don’t overload downstream services. We can sleep slightly better knowing this.

Conclusion

Microservices are fun but to do them right you need to know your tools, understand your data and expect the unexpected.