“The most certain way to succeed is always to try just one more time.” – Thomas Edison

If you’re following reactive hype, you definitely heard about “resiliency” or “fault tolerance”. Many authors use them interchangeably, and, according to the Reactive Manifesto, reactive systems should be resilient:

The system stays responsive in the face of failure. This applies not only to highly-available, mission critical systems — any system that is not resilient will be unresponsive after a failure. Resilience is achieved by replication, containment, isolation and delegation. Failures are contained within each component, isolating components from each other and thereby ensuring that parts of the system can fail and recover without compromising the system as a whole. Recovery of each component is delegated to another (external) component and high-availability is ensured by replication where necessary. The client of a component is not burdened with handling its failures.

This is very abstract, but the last two sentences are quite concrete:

Recovery of each component is delegated to another (external) component;

The client of a component is not burdened with handling its failures.

A simple down-to-earth situation on the level of a single component: if you’re starting to build an application these days, it’s very likely that you gonna use cloud. It is also likely that you gonna need to persist some data. If you gonna use cloud, it’s very likely that the data will need to travel across the network in order to get stored. It is also very likely that from time to time the network will fail. Some failures can be very short-lived and it’s reasonable to assume that if you retry, the operation will succeed.

To be even more concrete, let’s say that we’ve got a form, which collects some data, and sends it to an endpoint and the endpoint needs to store data on S3. So, you start with Play and within a fraction of an hour you’ve got everything working:

If the Future on the right-hand side fails, the response will be a failure, which will become an error sent back to the client. Not very nice, especially if it’s easy to recover. Let’s apply reactive manifesto to our scenario:

Recovery of e̵a̵c̵h̵ ̵c̵o̵m̵p̵o̵n̵e̵n̵t̵ AWS client is delegated to another (external) component;

The c̶l̶i̶e̶n̶t̶ ̶o̶f̶ ̶a̶ ̶c̶o̶m̶p̶o̶n̶e̶n̶t̶ FormController is not burdened with handling i̶t̶s̶ ̶f̶a̶i̶l̶u̶r̶e̶s̶ failures of AWS client.

And that implies you need to write some fancy error handler with retrial mechanism, right? Not really. You already have it, if you use Akka, moreover you can choose from at least two options.

Option 1: Akka streams

Pros: fits nicely into both: Actor-based and traditional models;

fits nicely into both: Actor-based and traditional models; Cons: steep learning curve of Akka Streaming.

It’s actually very simple if you’re familiar with the basic concepts of Akka Streams. A stream can be configured with recoverWithRetries() to make recovery attempts when. The only challenge is to convert futures into streams, and then back to a future.

Option 2: Kamikaze pattern

Pros: integrates well into actor-based applications, can be easily customized;

integrates well into actor-based applications, can be easily customized; Cons: a bit verbose, requires good understanding of Akka mechanics.

Here is the deal: in Akka each actor has a parent, which knows what to do if there is an exception in the child. That knowledge is called supervision strategy, which can be easily overridden.

val supervisionStrategy = OneForOneStrategy(maxNrOfRetries = 10) {

case _: BucketDoesNotExistException ⇒ Escalate

case _: AmazonS3Exception ⇒ Restart

case _: SdkClientException ⇒ Restart

case _: Exception ⇒ Escalate

}

Possible reactions on an unhandled exception are:

Escalate (default) : The exception is rethrown so that the supervisor fails with the same exception as the child;

: The exception is rethrown so that the supervisor fails with the same exception as the child; Resume : The failed actor keeps working as if nothing is happened;

: The failed actor keeps working as if nothing is happened; Stop : Speaks for itself;

: Speaks for itself; Restart: Discards the old Actor instance and replaces it with a new, then resumes message processing.

The idea is to write an actor in such a way, that it:

automatically starts doing the job when started;

sends a report to the parent and terminates when the job is successfully done;

The parent needs to death watch the executor actor, and await a message with the result. If the executor stopped before sending the result — the operation could not be completed.

Now, if you need to store a file on S3 from an actor, you’re good to create S3OperationActor and implement the death watch / success protocol. If not, and you’d like to be able to use ask , or you plan to use Kamikaze pattern more often, it’s fairly straightforward to create a generic actor, which will run the protocol for you.

From here it becomes really easy to store an S3 file with retries from any place of our app:

val kamiManagerProps = KamikazeManagerActor.props(

S3KamikazeActor.props(bucketName, S3Create(key, bytes)),

S3KamikazeActor.supervisionStrategy

) // Normally this should be shorter than the one you've used in the

// strategy.

implicit val timeout = akka.util.Timeout(3 seconds) val fireStoredFuture = childActorOf(kamiManagerProps) ? Attack

Conclusion

We’ve considered two different approaches to achieving fault tolerance on the level of a single operation, which is still very important, because large problems are very often just clusters of smaller ones.

The better recovery capabilities against the circumstances beyond control we put into the application, the less time will be spent on filing bug reports, reading logs, scratching heads, calling support. It’s not a hard thing to do using right tools and patterns.

Of course, it doesn’t mean that you need to retry every failed action. In the examples above, a problem caused by missing bucket will cause immediate escalation of the error and the future will fail; it doesn’t make sense to retry if the component is misconfigured.

We’ve considered two different ways of retrying an operation, which may fail without actually writing a single line of code, which tracks recovery attempts. Both approaches require understanding of underlying technologies: Actors or Streams; it’s important to have those before you put such thing in production. At the same time, I believe that studying the examples and reading appropriate docs should give you a good base even if you haven’t done much with Akka yet.