This is it. It’s this simple to manage concurrent EventSource connections using the Play Framework and the Akka Actor model.

How do we know that this works well at scale? Read the next few sections to find out.

Load-testing with real production traffic

There is only so much that one can simulate with load-testing tools. Ultimately, a system needs to be tested against not-easily-replicable production traffic patterns. But how do we test against real production traffic before we actually launch our product? For this, we used a technique that we like to call a “dark launch.” This will be discussed in more detail in a later post.

For the purposes of this post, let’s say that we are able to generate real production traffic on a cluster of machines running our server. An effective way to test the limits of the system is to direct increasing amounts of traffic to a single node to uncover problems that you would have faced if traffic had increased manifold on the entire cluster.

As with anything else, we hit some limits, and the following sections are a fun story of how we eventually reached a hundred thousand connections per machine with simple optimizations.

Limit I: Maximum number of pending connections on a socket

During some of our initial load testing, we ran into a strange problem where we were unable to open more than approximately 128 concurrent connections at once. Please note that the server could easily hold thousands of concurrent connections, but we could not add more than about 128 connections simultaneously to that pool of connections. In the real world, this would be the equivalent of having 128 members initiate a connection to the same machine at the same time.

After some investigation, we learned about the following kernel parameter.

net.core.somaxconn

This kernel parameter is the size of the backlog of TCP connections waiting to be accepted by the application. If a connection indication arrives when the queue is full, the connection is refused. The default value for this parameters is 128 on most modern operating systems.

Bumping up this limit in /etc/sysctl.conf helped us get rid of the “connection refused” issues on our Linux machines.

Please note that Netty 4.x and above automatically pick up the OS value for this parameter and use it while creating the Java ServerSocket. However, if you would rather configure this on the application level too, you can set the following configuration parameter in your Play application.

play.server.netty.option.backlog=1024

Limit II: JVM thread count

A few hours after we allowed a significant percentage of production traffic to hit our server for the first time, we were alerted to the fact that the load balancer was unable to connect to a few of our machines. On further investigation, we saw the following all over our server logs.

java.lang.OutOfMemoryError: unable to create new native thread

The following graph for the JVM thread count on our machines corroborated the fact that we were dealing with a thread leak and running out of memory.