Hey there! My name is Michal 0xDEADB33F Ptaszek, and I’m a software architect at Riot. Today I would like to talk about communication. But not the kind of communication you’re probably thinking of. I want to talk about the other, more exciting kind of communication: LoL players communicating with chat servers during a tense game; authentication servers communicating with the LoL client on login; microservices that route state changes between clients in the middle of the night - you know, that kind of communication.

At a high level, communication between services can be split into two groups: synchronous requests, where the sender is blocked until it receives a response, and asynchronous eventing, where messages are fired to the receiver without waiting for a reaction. I want to talk about the latter through the lens of the Riot Messaging Service, a service we built to support scenarios where backend services must inform clients about certain events - such as state changes - in an asynchronous way. The Riot Messaging Service is specifically designed to handle service to client messaging, which comes with its own challenges, requirements, and assumptions.

In this article, I’ll start with a discussion about state changes and the importance of stateful services. Then I’ll move on to the architecture of the Riot Messaging Service that enables linear scalability and high fault tolerance. Last but not least, I’ll focus on a slice of our journey from drawing board to production servers that can support 10 million player connections on a single box. Let’s dive in!

State

At Riot Games, we’re big fans of microservice architecture. This approach to developing software has tons of benefits for us, including separation of concerns, independent deployability, scalability, and language flexibility (some teams love Java, others like Go, some are even crazy enough to write their stuff in Erlang).

Each of the many microservices running in League is responsible for its own state. Let's take Clubs as an example. Clubs are player-created, player-organized, and player-controlled social groups. The Clubs service stores membership information, tags, messages of the day, member ranks, and a few other useful details. Whenever you log into the client, it fetches your latest state from the Clubs service and renders it so that you and your friends can pick up where you left off.

But once we have the initial state, how do we receive updates when, for example, an officer changes the message of the day ("Team comp of the day: yordles!")? Should the client actively poll the clubs service every few seconds to check if there were any changes (the "are we there yet?" approach)? Should the client wait until a player relogs to fetch a fresh state (hmm, questionable)? Or maybe clients should establish persistent connections to the Clubs service, giving us hundreds of thousands of active connections on the Clubs service side (using the Comet approach for instance) and dozens of connections in each client - one for each service the client will use.

In short, there are lots of options, and the Riot Messaging Service (RMS for short) has been built specifically to solve this problem. It's a backend service that allows other services to publish messages and enables clients to receive them. So when a state change occurs in a player’s Club, the Clubs service can publish a message to RMS asking it to inform the other Club members’ clients to refresh their local Clubs state.

Philosophically, RMS is similar to a mobile push notification service. Said service would be responsible for delivering small push events to phones and would support a whole constellation of mobile applications without the need to understand what's actually being routed through it. Mobile push notifications are also one way async events, and in most cases it's up to the receiving application to act on the received notification.

RMS Under the Hood

On a high level, RMS consists of 2 main tiers: RMS Edge and RMS Routing.

The RMS Edge (RMSE) tier is a collection of independent servers responsible for hosting player client connections. League clients connect to an RMSE node sitting behind a load balancer using an encrypted WebSocket connection. The connection is established after successful authentication, persists throughout the player’s session, and is terminated on logout.

On top of handling authentication and holding client connections, each RMSE server is also responsible for delivering incoming published messages to connected clients. For every new session registered or existing session terminated, an RMSE node will make a request to the RMS Routing tier to inform it about the change.

Because RMSE servers don't know about each other, they are 100% linearly scalable. Another interesting property is that local failures are isolated to the local server: if one server crashes or has performance hiccups, its issues will not affect adjacent servers.

Here's a simple diagram showing how the RMSE tier is structured:

The RMS Routing (RMSR) tier in turn is a layer of clustered servers responsible for a global view of all client sessions across all RMSE servers. RMSR nodes hold a global, distributed table mapping player identifiers to RMSE nodes that keep their sessions. The RMSR tier also processes incoming published messages from other services and routes them to proper RMSE nodes. Finally, RMSR servers keep track of the health of RMSE tier nodes and perform necessary cleanups whenever something bad happens to one of them.

Architecturally, RMSR looks like this: