Before I get into the meat & potatoes of discussing my project, I should mention I have looked into tools and projects beyond Hadoop — such as Spark, Cascading, Scalding, and Storm. I agree they do a good job of abstracting some of the work in big “map-reduce” like jobs, but I still felt they were too big for my project. I had something else in mind.

Lately, I've come to rely on a very popular framework for a few tasks. Akka — an actor-based framework— is used for implementing Scala (and Java) based systems that can robustly handle numerous and/or large tasks with traits such as concurrency & parallelism, asynchronicity, fault-tolerance, and remote distribution (clustering!) — just to name a few. With an understanding of some of the basics of recommender systems and algorithms, I wanted to see if Akka was suitable for the job too. I wasn't making any assumptions of Akka being better or even equivalent in performance to the now de facto set of big data frameworks. Even if it possibly couldn’t handle the same scale of processing as the big data crew, I at least felt it could be scalable enough for my task. Implementation experiments would prove or disprove this idea.

Based on some of the actor-patterns I had recently became comfortable with, an implementation was clear in my mind. Even more so, I felt that the implementation could be minimal in comparison to some of the setups I've seen for some big data frameworks.

So let’s see what I came up with…

An Actor-based User Similarity System

The Environment

Short and sweet — I had a database with a whole bunch of user identifiers, item identifiers, and ratings in the range of 1 to 10. Actually, I had some ratings. Some of the preferential data I had was only binary — the user at some point in time had only signified they ‘liked’ the item. Before getting started, I decided to fill these gap ratings with a 5 — as in 5 out of 10 stars. Being an experiment, I would see what kind of results that would yield.

The System

Now, I'm not going to spend this article giving an introduction to Akka. Mainly because I don’t want to :-) Hopefully, what I can present is enough glimpses of what Akka and/or the actor-model can possibly do for you as a developer. If I do happen to pique your interest, I recommend the book “Akka Concurrency” by Derek Wyatt for an introduction to the framework and model. For the purpose of this article, it’s enough to understand that actor-based frameworks work primarily on the concept of message passing, where actors communicate signals and information between each other to get tasks done.

The Design

The top-down design of a similarity job

For my design, the system can be abstractly broken down into three succinct modules. Along with being the primary components of the entire system, they also represent the basis of a similarity job, which is illustrated on the left.

At the top or start of a similarity job is the data streamer. This module is responsible for synchronously pulling the data from the database. “Stream” is an important keyword here. This distinction is made to indicate that each piece of data should not be stored here locally in memory for long-term usage. This would be a waste of resources since the raw data will be abundant and is actually needed elsewhere. The data streamer quickly transmits (or messages) data along to an actual consumer — the vectorization module — for processing. Vectorization is the process of arranging the raw data into vectors and is a requirement for most recommender system algorithms. In general, vectorization will perform the necessary grouping of preferential data (ratings) for each user.

To handle the expected deluge of data, the vectorization module actually consists of a number of distributed worker nodes (that is, actors) to individually collect all the raw data. This is important as the process of vectorization can be time consuming. Each worker node stores user vectors — that is “user-to-item-to-rating” tuples — locally in memory. For each similarity job, consideration and experimentation will lead to discovering the best distribution of worker nodes to efficiently allocate workloads.

Once the data streamer has no more data to stream, a message is sent to signal the workers to start pushing their collected user vectors to the next module in the workflow, the computation module. This module similarly contains a number of worker nodes (more actors) that perform the actual similarity computations. For each user vector a worker node has, a similarity computation will be made against every other user vector every other worker node has. Ideally, at-least-once and at-most-once. This is the source of the brunt of work being performed by the recommender and the distribution of the worker nodes prove their use here too.

The algorithmic approach I went with is straightforward and trivial, but with the parallelism gained from using distributed nodes, optimal efficiency can be gained.

The Distributed Algorithm

Via message passing, each worker node will individually and sequentially become distributors. A distributor is responsible for sending its user vectors to every other worker node. When a worker node retrieves a user vector from the distributor, it will compare that user vector against all the user vectors it has stored locally. After each similarity computation, the received user vector is discarded. This is performed for each user vector the distributor sends. Eventually, the distributor will signal to all the worker nodes that it has no more user vectors. Because a distributor nodes user vectors are no longer needed at this point — since they have already been compared for similarity to every other user vector sitting on the other worker nodes — the distributor can shut itself down. Another worker node then becomes the distributor. But what about computing user similarities for the user vectors at the distributor? The trick to that is in the distributor; when the distributor broadcasts its user vectors, it also sends them to itself. This is because the distributor is still considered a worker node. It accepts messages from itself, but does the same action as every other worker node: it computes user similarities.

That’s it! After the last distributor finishes up — which would be the second last worker in a queue of workers— the job can be considered complete.

Implementation

With the ideology that actors should be responsible for a very minimal amount of work, Akka actors are usually programmed to perform a relatively small number of simple tasks in reaction to messages they send and receive. The Akka actor API declares one important method to override — receive — which can be thought of as the switchboard control that defines the actions an actor should perform when communicated with. Based on my design above, the receive methods for my modules and worker nodes are for the most part quite trivial.

When a similarity job is initiated, a request for acquisition of a data streamer, vectorization node, and computation node is made. For all intents and purposes for this discussion, we will assume these actors exist already locally in a virtual machine. I will first give an overview of my data streaming actor:

When a job is started, the data streamer — UserContentStreamer— is allocated for the job when it retrieves a RequestStreamer message (which is a Scala case class here). When the job is ready to start, the streamer is then sent a StartStream message, which also indicates a subset of the data in the database (i.e. with the sourceId value). Hidden from my code snippet is the implementation defined by the userContentDatabase object (line 21). That object is injected into my actor via the ApplicationModule trait and basically wraps up a bunch of implementation code (i.e. JDBC connection pooling with C3P0 and LINQ access with Slick). What it returns is a Scala Stream of RawUserContentDto case class objects created from the database entries. These are in turn streamed to awaiting vectorisation worker nodes in the upcoming UserContentVectorizer actor. In case you are wondering, the “!” is Scala and Akka’s syntax for sending (or telling) a message to some other actor. In this case, all the streaming data is sent to a single actor, who I've purposely neglected in my previous design specification, until now.

Director Actors are actor implementations hidden within my three module design. These actors are responsible for the management and organisation of worker nodes. In the case of data streaming, the output actor (line 19) is the director for the UserContentVectorizer node, which was the actor that initially sent the StartStream message. This director routes user vectors to appropriate worker nodes using Akka’s very useful consistent hashing actor routing configuration. This configuration ensures that user vectors for particular users always go to the same worker node (i.e. user grouping). This means that user vectors are collected properly without duplicate vectors for the same user existing somewhere else in the distribution of worker nodes.

Here’s a brief description of the important bits from the vectorization phase:

The UserContentVectorizer (UCV), which is synonymous to my vectorisation node in the design, is also acquired for a similarity job. (line 8-11)

A useful feature of Akka actors is their ability to hotswap their behaviour defined by receive. On line 10, the UCV changes its state to working (line 15-23). This is important as it allows the UCV to perform the vectorization without interruption of additional acquisition requests (however, line 16 shows a trick to allow the UCV to stash future jobs it might want to work on later).

When the UCV is ready to start, it receives a message (line 17) indicating to it the data streamer it should work with. At this point in time, the UCV does two important things: it creates the worker nodes (lines 27-28) and creates the director (line 32) as an anonymous actor. As previously explained, the director is the actor that is responsible for routing the vectors from the UserContentStreamer to the worker nodes (lines 40-44).

When the stream is completed, the UCV will tell each worker node to return their vectors back to the similarity job-level (where the UCV was allocated) and shut themselves down (lines 45-48).

This code snippet also includes the code for the vectorization worker nodes defined by the UserVectorizationReducer actor.

This actor is simple: it gets RawUserContentDto’s (line 65) and retains a localized map of vectors (lines 79-102 & lines 108-109). This encapsulates the logic described earlier (i.e. filling the ratings gaps).

Finally, when the ReturnUserVectors message is received, the actor simply pipes them to the calling job node (lines 69-77).

The beauty of this is the parallelism gained from the Akka configuration of the worker nodes. Because the actors are designed to not share any mutable data, they are free to work independently and concurrently with the data they only receive. Furthermore, with respect to “Java-world”, this is all done without the need to work directly with threads, locks, mutexes, synchronization blocks, etc. (of course, nothing is stopping you from using them if you need them). Beneficially, the actor configuration itself is easily abstracted away from the code and defined in an easy to read configuration file (line 28) .

That’s it for vectorisation. In ~100 lines, we have two classes that can potentially do a heavy load of work.