Some time ago my friend wrote an article about Akka Cluster anti-patterns. It’s time for a counterweight post. I will try to answer the title question from a slightly different angle than in the official documentation. Let’s see some use cases when Akka Cluster is a perfect solution. Maybe “perfect” is not the best world. I should probably say “recommended” solution.

At first, I want to highlight that you should do whatever possible not to use any cluster system. Why? It’s quite costly to run and maintain a cluster. Here are some basic drawbacks of all cluster systems:

Seed nodes setup. New application instance cannot be simply launched and added to the load balancer, it must register itself in the cluster. Split brain — very nasty things can happen with your data (and cluster) in case of a split brain, always keep this in mind when using clustered solutions. Resources — the minimum number of instances for a cluster system to achieve consensus is 3 (or 1, but then it is not actually a cluster). If you have 10 microservices you need to launch at least 30 instances. 2 CPU cores and 4 GB per instance and you end up with 60 cores and 120 GB of RAM, not only on production, but also on dev/test/staging environments. I’m aware that you can minimize resource requirements in many different ways, but make sure that you are running a cluster(s) not only on production but also on other envs. This way you will avoid a lot of unpleasant surprises. Other possible drawbacks, you name it.

I could elaborate more, but problems with clusters are not the topic of this article, let’s go back to the main plot.

Work distribution?

Some projects use Akka Cluster as a work distribution mechanism. You can launch “master” nodes, “worker” nodes and parallelize some tasks across all worker nodes within the cluster. This use case is perfectly valid for Akka Cluster, but the question is: can I achieve the same thing cheaper or faster? I believe yes. You just need to take advantage of some message broker and distribute the work across parallel consumers. Right now, the default choice for a highly scalable and performant message queue is, of course, Apache Kafka. Kafka and its ecosystem is a topic of plenty of our blog posts. We successfully accomplished numbers of projects with Kafka on board. Work distribution is only one of the many Kafka use cases.

A careful reader might say that I just changed one cluster solution to another one. That’s true. Now, the question is: which one is easier to set up and run? Unfortunately, I cannot compare/measure this in an honest way. I can just summarize my personal experience, which, in this case, is in favor of Kafka cluster. To operate Kafka cluster you will need to read and understand quite a lot of config params. Before Akka 2.6.6 version you would need to provide your own (or some open source) Split Brain Resolver. After that release the official Lightbend Split Brain Resolver was merged to Akka, so you don’t have to worry about providing one. Although you should remember that split brain is still possible.

Once you set up Kafka cluster, launching new instances of consumers is a piece of cake. Whereas, adding a new node to Akka Cluster requires some sort of service discovery. Next thing worth mentioning is that a single Kafka cluster can be used for many different business contexts. In Akka Cluster this will be an anti-pattern. It’s recommended to create more smaller clusters, preferable, one per context. Fun fact is that Akka Clusters very often use Kafka cluster as a communication channel between them.

In a nutshell, if you’re already using Akka Cluster and you want to parallelise some work within the same context, then yes, Akka Cluster will do the work for you. Otherwise, take some time to analyse other solutions. Apache Kafka is only one example. Depending on your scale, you might use something much simpler than Akka or Kafka.

Scaling reads?

We want to easily scale our application. Sound like another legitimate reason to use Akka Cluster. Ok, but first let’s answer what exactly do we want to scale? Which endpoints are problematic? Where are the actual bottlenecks? Make sure that you have a good monitoring infrastructure to provide some data, not guesses.

Most applications are read-heavy. In this case, you should focus on scaling the query side. A good approach will be to migrate your architecture towards CQRS.

Query services are stateless, so scaling them requires only to launch more instances. This architecture allows you to run a dedicated storage per read model. For some queries NoSQL database will be the fastest solution, for others, a good old SQL will be just fine. The most important thing is that you have the freedom to make this decision independently. Of course some sort of synchronization between write side (preferable events store) and read side is required. I won’t cover this now, but you can find hundreds of articles/books about this topic. Once again, no cluster needed.

Scaling writes?

How about scaling writes? Finally, something interesting. Scaling command services is more challenging because (usually) you want to maintain data consistency. If you simply launch two (or more) instances of the same application you will end up with concurrent writes. Generally, this problem is solved by optimistic locking. Make sure that your underlying storage supports such concurrency control. Sometimes locking is not necessary because you can update your data in an idempotent way, but this is a rather rare situation.

Single Writer Principle

What if optimistic locking is not feasible to your use case? Such concurrency control is fine as long as the probability of concurrent access is very low. Otherwise, it will significantly degrade your performance because of the contention. Also, at some scale, databases that support optimistic locking might be not fast enough to handle your writes. If you switch to distributed databases to handle more load like Cassandra or Scylla you might be surprised that your locking mechanism is no longer possible. Ok, to be honest, Cassandra supports locking by lightweight transactions, but this approach should be avoided for heavy write loads.

To overcome this problem you should try to achieve the Single Writer Principle (SWP). Easy to grasp method for fast, consistent, contention-free and hardware friendly updates. The essence of this method is that all writes must go via a single point, one at a time. Since there are no concurrent updates, you don’t have to worry about data consistency. SWP is relatively easy to implement on a single instance application. Whereas a distributed SWP is a completely different story. Implementing it correctly will be a huge challenge. I suppose it might be the most demanding task in your whole programming career. Of course, you want to make it fast, reliable and with reactive philosophy in mind.

Finally, it’s time for Akka Cluster. The cavalry has arrived.

The Polish cavalry, Hussars, Batowski-Kaczor reproduction painting

Akka Cluster is basically a distributed consensus mechanism. Obviously, there is more stuff going under the hood, but this simplification is good enough. Consensus by itself will not do the job for us. Another piece is necessary, which is Akka Sharding. That’s it, you just get a Distributed Single Writer Principle. One of the most desired distributed system features.

The next question could be: what exactly should we shard? Since you want to scale and maintain consistency, a perfect candidate for sharding will be your domain aggregate, which is responsible to maintain business consistency in the first place.

Akka Cluster and Akka Sharding is a powerful pair, but there is one more interesting addition. SWP, as the name suggests, needs to write something. Now, the slowest part of the system will be the IO operation. How can we programmatically improve our writes?

I don’t want to go into details here, but append-only writes are the most effective way to save data to any storage, including SSD drives. No deleting, no editing, just a simple fsync operation. The majority of databases will love to handle such load.

A perfect way to achieve append-only writes is to apply Event Sourcing architecture, which is addressed by Akka Persistence or Akka Persistence Typed.

Summary

Akka Cluster + Sharding + Persistence create an ultimate trio for building fast, resilient, scalable, distributed applications. Not everybody needs to jump into such a high level of awesomeness, so be careful with the decision. However, if your current solution is hitting its limits, then Akka stack will definitely do the job for you.

As a last thought, I will only mention that in 99.9% of cases you should never ever implement a distributed consensus by yourself. Most probably you will fail or, in the best scenario, burn a lot of money. Just check how others leverage Akka Cluster to their solutions: zio-akka-cluster, aecor.