I recently published an article about our experience scaling the Video User Profile Service we built for the Olympics. In that article, I focused on the overall experience and then a little bit about the Elixir solution. Now it's time to explain the details of our Elixir application.

Software rewrite (from Ruby to Elixir)

It's common sense that a software rewrite is often really a bad idea. Despite this, as we’ve seen in the previous article, sometimes it is hard to scale past a given point. I believe that, for cases like this, software rewrites can be a natural move for a technology stack so that it can meet its new requirements. In our case, the application was written 4 years ago, and it did the job very well until the traffic increased 6x.

The Ruby endpoint we choose to rewrite

The Video User Profile Service has an endpoint that saves many logged user actions, like “Add to favorites” and “Watch Later”. It also tracks how much of a video the user watched in their “Watch History” list, which allows users to resume watching a video from the where they left it. This tracking of the watched percentage is somehow special, because our video player sends this information to our endpoint every 10 seconds, via a HTTP POST call. This endpoint, implemented in Ruby, represents 80% of Video User Profile Service throughput. Simplifying it a bit, we have a split of READ(20%) and WRITE(80%) operations, as we can see below:

Simplified version — Ruby for read and Elixir for write

Explaining the Ruby (POST) version

The original system implementing this endpoint is a traditional RoR application. After saving a video in someone’s list, we need to update the list counter for that specific video, because we need to know how many times a video was added to favorites. Say we have video 1234 and ten users adds it to their favorites, so the overall COUNTER for video 1234 must be 10.

Counter for Favorites

As it happens with traditional Rails applications, the original system would always block its entire process for each request. Given that Update Counter is a slow operation, the usual solution for RoR applications is to use a Background Job, using something like Resque or Sidekiq. Both solutions use Redis as a message queue to allow workers to execute jobs. I consider this solution a workaround because the platform does not have a built-in solution. This is illustrated in the diagram below:

Ruby application needs another “application” to deal with async jobs

Creating the Elixir version

Elixir has similar tools to Ruby. In the Ruby ecosystem, we use Rake, Bundler (with Gemfiles), etc. Elixir has Mix. Mix got all years experience from Rails and Ruby community and combines almost everything in a single, fast and well-designed tool. From starting a project with mix new, to release a version with mix release, you can install dependencies, run tests and your project configuration also stays on a "mix file", called mix.ex.

With the tools, the syntax is very similar and the move is natural. I think the hardest part is to change from Object-Oriented paradigm to Functional paradigm, but if you're already familiar with Functional Programming, it's very easy.

With the Functional Programming paradigm, Elixir also has the OTP, inherited from the Erlang Platform. This environment deserves an exclusive article, but quickly, it is a conjunction of abstractions to allow developers create concurrent, parallel and distributed software, with application and supervision capabilities, dealing with all processes lifecycle, from spawn to exit, dealing with failure restart, allowing an easy implementation of the Crash-only software philosophy.

The Architectural Changes

An important change migrating the Ruby POST endpoint to Elixir was no longer needing a Resque Application to deal with Async Tasks. With this change, we removed 20 docker containers, responsible for dealing with Update Counter Job. Elixir has the Task Module, which provides us Async Tasks, with the option to be supervised, in the case of failure. Combined with Poolboy, we can throttle the throughput of Update Counter Job using a Pool of Objects. The main difference from the Ruby solution is that everything is part of Elixir and OTP architecture, which is very tested and mature. That's the final result:

With Elixir, we do not need Resque anymore

Frameworks and Libraries

To develop the POST Endpoint, we choose to use Phoenix Framework, which is similar to Rails API and Rails itself.

To deal with cache, the best option we found was CacheX, that uses the ETS and can replicate across Nodes. ETS uses the BEAM VM Memory, so it's the fastest way to have an object to be fetched when needed. If we compare to a cache layer using Redis or Memcached, it’s almost the same as comparing L1 cache with RAM access (of course I'm overreacting). The use of CacheX is working perfectly since we put our application into production.

For tests, Elixir provides ExUnit, a great tool, fully integrated with Mix, so you do not need to install another one. It's something like RSpec and Test::Unit. IMO, it's a great tool. With ExUnit, we're using FakeServer (created by my friend and coworker Bernardo Lins) for HTTP Request Stubs.

To access MongoDB, we had some issues. There is a mongo driver, but it's not evolving and lacks support for Replica Sets. I started a Fork, called MongoX. It integrates with Ecto 1.1.9 using mongox_ecto. It needs improvements, but has full Replica Sets support and we're using it in production.

We're using other libraries as listed below:

The current version (Ruby and Elixir)

The Video User Profile Service now have a mixed solution, using Ruby for queries and Elixir for commands. It's something like CQRS (Command Query Responsibility Segregation), but not so canonical, because although we have the Ruby application reading and the Elixir writing, we didn't isolate the Command Model and the Query Model. The structure is identical, but we're moving to isolate them. It needs some improvements, and we're doing it right now. The next picture shows the current state:

Current CQRS Ruby and Elixir Video User Profile Service

Conclusion

After I wrote the first article, I received lots of questions about the implementation details. I hope this article can be a good answer to everyone interested in this case. Of course, there will be always uncovered questions, so feel free to ask for anything.

Update

In 2017, we rewrote the core engine using GenStage.