At Greta we’re building a script for P2P data distribution on top of the webRTC Datachannel. One crucial aspect in being able to deliver the data as fast as possible is to have an incredibly fast signaling system that scales rapidly. We’ve tried external signaling systems such as Pubnub and Pusher, as well as built several versions of our own, but haven’t been satisfied enough with the speed and scaling capabilities. Until now. In this post I will share some of our learnings in building a signaling system that handles more than 10M concurrent users per machine as well as meets our need for speed.

What we needed

Let’s start by looking at the requirements of our signaling system:

Ability to handle a minimum of 1M concurrent connections on as little CPU and RAM as possible.

Needs to pass the messages very fast not to add latency to our system.

Needs the ability to send both 1:1 and 1:N messages.

Needs to handle a high volume of messages in flight.

Apart from looking at what you do need out of a signaling system I think it is equally important to know what you don’t need. In our case the things we don’t need are:

Persistent messages. If we can’t deliver the message fast enough we don’t need to deliver it at all.

Playback of messages. This is a result of the previous no-need bullet.

Integration with any third party or legacy system that is not in the original design.

The first version

Our first version scaled to 1M concurrent connections per server, and used server sent events and a post API for the communication.

Every client connected to one of the servers, and thereafter sent a post to an API end point that would forward the post to every server with connections. That meant that the post could be sent over the server sent events channels to the clients.

However, it turned out that doing a lot of posts from the clients was significantly slower and didn’t preform as well as sending data over a websocket connection.

The second version

We decided to rebuild the system by using websocket connections instead. This time we set out with two new requirements:

Use websockets.

Scaling cross servers when the message arrives on the websocket.

This time we built a system based on Erlang and RabbitMQ. With Erlang we were able to handle the 1M concurrent users very well, but we still had a few things that slowed the system down. To make the messaging map well with RabbitMQ we had to transform every message when it passed between RabbitMQ and the websocket connection, which added some resources as well as added some latency. But overall we were happy with the system — it was the fastest signaling system we had used so far, it could handed a high volume of messages and connections, and it met all of our requirements.

There were only two things that concerned us. The first was that the configuration of the RabbitMQ cluster with RAM nodes and a stats node was a bit more complex than we wanted, although it was reasonable based on the systems’ scalability and speed. The second thing was the relative ineffectiveness in needing to transform every single message two times as it passed through the system.

The third and current version

As Greta took on larger sites, the downsides from the second system became more notable for us developers. We started thinking about an even better system, but realised it wasn’t gonna be easy to beat the one we already had in production.

We had previously done some experimenting with Golang, and was starting to reach the 1M concurrent connections, which seemed like a good starting position.

When we started reading about NATS and decided to try it out we quickly got very interested! We realised that NATS could help us with most of our requirements, but also didn’t focus on some of our non-requirements which was interesting. After a few days of development we had a first prototype, which wasn’t capable of handling that 1M concurrent connections, but we could already tell that it was very fast! We decided to dig deeper into Golang, and after a while we started seeing massive improvements. What we ended up with is a system that doesn’t need to transform the messages, and is easier to configure and get up and running. By combining NATS and Golang the new signaling system is now able to do 10M+ concurrent connections on a virtual machine of decent size, and on a machine that is smaller than the one we previously did 1M concurrent users on we are now doing 1.2M concurrent users, meaning that the new system isn’t only better at scaling, it requires less resources as well. And equally exciting; our new signaling system performs an average speed improvement of 32% for delivering the messages, which means that we had successfully combined our two most important goals; scaling and speed.

There are probably things we could optimise in the Golang proxy to make the performance even better, but for now this is definitely the fastest and most effective signaling system we have come across!