We explore which architecture to implement for a component that is critical to our platform.

At Zalando, our team is currently in the process of creating an essential core component of our platform. Luckily (and thanks to consistently following a microservices strategy), this component does something relatively simple: Triggering a CPU-intensive computation up front, and then publishing the results to other selected services.

We already have a proof-of-concept up and running, and the team decided to use Scala and the Play framework for this. However, as this component doesn't have any kind of web-facing frontend, nor exposes any kind of REST API, we are now re-evaluating our choice of technology.

Requirements

As the team is not yet super-experienced with Scala, and the component is an absolutely critical core part of the whole platform, we are looking for a technology and architecture that fulfills the following criteria:

Be as simple as possible

as possible Be easily maintainable

Be as robust as possible during runtime

as possible during runtime Be fast, so that we don't cause any kind of back-pressure

In the end, we as developers want to be able to sleep well, knowing that the chance of any kind of failure for this component is minimal, and if it happens, we're able to fix it quickly.

Test subjects

During our discussion, various approaches for tackling these requirements quickly emerged.

Of course, there will always be the someone who wants to go for the simplest solution possible. And it is a valid point of view, as we have to get up and running fast, and would still be able to adapt later. However, its validity strongly depends on the simplest solution actually being simple, and not only the easiest thing to do. In this case, the proposal was to keep each run through the component in a single blocking thread, and to just route threads as needed via Java's ExecutorService.

Looking at the current Play application, the second option that naturally emerged was to use Scala's futures without much further sophistication.

And finally, with Scala, there's always the elephant in the room: Akka.

Until recently, Akka inevitably meant using the actor model, known from Erlang. In my experience, people are often sceptical about using actors extensively. There are several reasons for that scepticism. It's quite a paradigm shift for many people, others don't trust all the low-level concepts being abstracted away, while some miss type safety.

However, the recent hype around Reactive has led to more alternatives popping up, even within Akka: Reactive Streams. The two major implementations of Reactive Streams for Scala out there seem to be Akka Streams on the one hand, and RxScala on the other.

There are still a lot of other approaches available to do the same things. However, the above alternatives were the ones brought up by our team, so these will be the focus of our examination.

Case study

The next step was to play around with each of the approaches above and determine which of them were a good fit for the team and our requirements. In order to do this in a structured way, on top of exposing any kinds of obvious performance or robustness problems, we chose to create a simple sandbox model of our problem and run benchmarks on it. This means that the following case study shouldn't be taken as a "benchmark first" case study.

Introducing the model

Let's get coding, finally!

Here's our simple model of the jobs our component will receive (the blog will contain only the relevant code snippets, for a more holistic view, see the respective GitHub project):

case class Job(id: Int) { val payload = Array.fill(16000)(Random.nextInt()) } case class JobResult(job: Job, result: Int) case class PublishResult(result: JobResult)

This is how the computational part of our model looks:

object Computer { import ComputationFollowedByAsyncPublishing._ def compute ( job : Job ): JobResult = { // jmh ensures that this really consumes CPU Blackhole consumeCPU numTokensToConsume JobResult ( job , job . id ) } }

It uses the awesome JMH benchmarking library (nicely integrated into sbt via sbt-jmh) and its black hole to do all the work for us.

And here's the part where we "publish" to other services, which is naturally asynchronous:

object Publisher { import ComputationFollowedByAsyncPublishing._ // we use the scheduler and the dispatcher of the actor system here because it 's so very convenient def publish ( result : JobResult , system : ActorSystem ): Future [ PublishResult ] = after ( publishDuration , system . scheduler ) { Future ( PublishResult ( result ))( system . dispatcher ) } ( system . dispatcher ) }

Notice that we're using the convenient scheduling provided by Akka actor systems here, as we'll have an actor system running for our other experiments anyway.

Old-school blocking

Here's how the good old blocking approach looks like:

def benchmark ( coreFactor : Int ) : Unit = { val exec = Executors newFixedThreadPool numWorkers ( coreFactor ) try { val futures = 1 to numTasks map Job map { job => exec . submit ( new Callable [ PublishResult ] { // explicitly turn async publishing operation into a blocking operation override def call () : PublishResult = Await . result ( Publisher publish ( Computer compute job , system ), 1 hour ) } ) } printResult ( futures map ( _ . get )) } finally exec . shutdown () }

Plain futures

Using futures instead of blocking everywhere doesn't really look that complicated to me:

def benchmark ( coreFactor : Int ): Unit = { import system.dispatcher // execution context only for the ( cpu - bound ) computation val ec = ExecutionContext fromExecutorService Executors . newFixedThreadPool ( numWorkers ( coreFactor )) try { // `traverse` will distribute the tasks to the thread pool , the rest happens fully async printResult ( Await . result ( Future . traverse ( 1 to numTasks map Job ) { job => Future ( Computer compute job )( ec ) flatMap ( Publisher . publish ( _ , system )) }, 1 hour )) } finally ec . shutdown () }

Actors

Actors, however, do get a bit more involved. First of all, here's the client distributing the jobs:

def benchmark ( coreFactor : Int ): Unit = { import system.dispatcher implicit val timeout = Timeout ( 1 hour ) // Route computations through a balanced pool of ( cpu bound ) computation workers . val router = system actorOf BalancingPool ( numWorkers ( coreFactor )) . props ( Props [ ComputeActor ]) try { // Collect the results , sum them up and print the sum . printResult ( Await . result ( Future . traverse ( 1 to numTasks map Job ) { job => ( router ? job ) . mapTo [ PublishResult ] }, 1 hour )) } finally router ! PoisonPill }

The ComputeActor is just an actor wrapper around the computation which delegates work to the actors responsible for publishing:

class ComputeActor extends Actor { val publisher = context actorOf Props [ PublishActor ] def receive = { case job : Job => // tell the publisher about who sent us the job , and the job results val s = sender () publisher ! ( s , Computer compute job ) } }

Finally, the actor wrapper around publishing the results:

class PublishActor extends Actor { import context.dispatcher def receive = { case ( s : ActorRef , r : JobResult ) => // just pipe the result back to the original sender Publisher . publish ( r , context . system ) pipeTo s } }

Streams, using RxScala

Streaming our jobs through RxScala looks really beautiful and concise in my eyes. I'm not sure if it's correctly doing what it should, as it blows up the heap when running. I'm afraid that there's a memory leak in there somewhere, and we shouldn’t need to deal with an indistinct issue like that.

def benchmark: Unit = { // looks nice, not sure if correct, blows up the heap Observable .from(1 to numTasks map Job) .subscribeOn(ComputationScheduler()) .map(Computer compute) .subscribeOn(ExecutionContextScheduler(system dispatcher)) .flatMap(1024, r => Observable.from(Publisher publish (r, system))(system dispatcher)) .foldLeft(0) { case (s, r) => s + computeResult(r) } .foreach(println) }

Streams, using Akka Streams

Akka Streams, in contrast, look a little more complex. This is due to the conscious decision on the side of the Akka team to create a more abstract DSL in order to encourage the reuse of partial flow graphs. Having a second look, the above RxScala code might be a bit deceptive in its conciseness due to the simple nature of our model. If the pipelining becomes more complex, Akka's composable graph DSL might be a better fit for keeping things readable and under control.

First, we create a helper flow responsible for balanced routing of a workload. This is basically copied from the respective Akka Streams cookbook documentation:

private def balancer [ In, Out ] ( worker : Flow [ In, Out, Any ] , workerCount : Int ) : Flow [ In, Out, NotUsed ] = { Flow fromGraph GraphDSL . create () { implicit b => val balancer = b add Balance [ In ] ( workerCount , waitForAllDownstreams = false ) val merge = b add Merge [ Out ] ( workerCount ) 1 to workerCount foreach { _ => balancer ~> worker . async ~> merge } FlowShape ( balancer . in , merge . out ) } }

With this helper at hand, we can create the graph and run it. The central piece of code here is source ~> balanced ~> publish ~> sink.in.

def benchmark ( coreFactor : Int )( implicit system : ActorSystem ) : Unit = { // a sink that computes the sum val sink = Sink . fold [ Int, PublishResult ] ( 0 ) { case ( sum , job ) => sum + computeResult ( job ) } // wiring up the graph of streams val g = RunnableGraph fromGraph GraphDSL . create ( sink ) { implicit b => sink => // preparations ... val source = b add Source ( 1 to numTasks map Job ) val compute = Flow . fromFunction ( Computer compute ). withAttributes ( ActorAttributes dispatcher "compute-dispatcher" ) val balanced = b add balancer ( compute , numWorkers ( coreFactor )). async val publish = b add Flow [ JobResult ] . mapAsyncUnordered ( 1024 ) { Publisher . publish ( _ , system ) } // finally , here ' s the graph source ~> balanced ~> publish ~> sink . in ClosedShape } // Running the graph will materialize it into a future int . We wait for it and print it . println ( Await . result ( g . run ()( ActorMaterializer ()), 1 hour )) }

Benchmarking results

When running the benchmarks, the main result for us was that with the numbers from our problem, the choice of approach doesn't really matter when it comes to pure runtime performance. We only observe some very slight differences, if at all.

Here are the results from one specific lengthy run, as an example:

The Akka Streams implementation seems to outperform all others very slightly, while the blocking implementation usually fares slightly worse, presumably due to thread-switching overhead.

The main outlier is RxScala, which unfortunately throws an OutOfMemoryError very quickly, as previously mentioned. We're happily accepting any hints about what we might be doing wrong here, however, the code looks pretty straightforward.

Having said that, there are some significant differences in runtime behavior.

One very obvious one is the number of threads getting used, as you might expect. Plain futures and Akka streams are both very economic when it comes to threads. Actors seem to use a few more threads. This could be a question of tuning the configuration, I suspect. By far the most number of threads are required by the blocking approach, who would have thought?

Also, when varying the numbers, we notice that futures, streams, and actors are all way more consistent in their runtime behavior, whereas you have to re-tune the blocking approach each time to fit specific circumstances. This means that the blocking approach is not very robust towards a live environment where numbers will be in constant flux.

Summary

These results leave us with three good alternatives: Using Scala futures, using Akka actors, or going for Akka streams.

Using futures has the advantage of simplicity. However, I'm afraid that this simplicity can be a bit deceiving when the system grows more complex. Sure, futures compose, but it might become difficult to reason about the flow of futures quickly. Also, there's always the pitfall of not noticing failures with futures: One misplaced foreach can have you looking for the failure you observed, but can't pinpoint, forever.

Using Akka streams, on the other hand, imposes some overhead in getting up and running. It requires learning the DSL, and also understanding what's going on under the hood to some extent. We might reap some benefits from this investment as soon as the component starts growing in complexity, as outlined above. Concerning error handling, streams should be quite well-behaved, as you can basically define a supervision strategy for each node in your flow graph.

If Akka streams are abstracting away too many details for you, using actors directly might be a good option. This approach gives you a lot of direct control over and insight into what's going on. The actor model in general is also tried and proven. Error handling via the Akka supervisor model is straightforward. The main problem with actors is that they are completely based on messages and effects. You have to be careful to deal with the emerging complexity by getting your actor hierarchies right and testing them well.

So there you have it. Now go and play with the code yourself! We’d be happy to hear from you with comments or improvements.

The full playground of code can be found on GitHub here.