In my previous installment I discussed how our application is required to accept up to 1000 new TCP/IP connections per second, handle upwards of 40,000 simultaneous TCP/IP connections, and the kinds of things we tried to get that working under Ruby/JRuby, and how we eventually decided to try Elixir and Erlang to solve the problem. If you haven’t read part 1, I’d suggest reading it now. Go ahead… I’ll wait…

So, now that we’ve accepted 40,000 connections talking over SSL, what are we going to do with them? That was our next challenge.

TL;DR – (Erlang) Processes FTW

So each time we called Owsla.TcpSupervisor.start_endpoint(ssl_socket) in part 1, we created a new Erlang process for that SSL connection. That process has two responsibilities. First, it handles reading data from the socket (which is generally text-based and line-oriented) and putting it, along with some routing information, onto a RabbitMQ queue. Second, it can receive messages from any Erlang process in our cluster and send them back to the device.

40,000 Processes — is this guy nuts?

No, I’m not nuts – I’m using Erlang. The Erlang programming language and runtime have their own version of a “process” which is much lighter weight than an OS Thread or process, which makes creating new processes very inexpensive and allows the VM to run 10s (if not 100s) of thousands of processes in a very efficient manner. Native threads on most modern OSes take up to several megabytes of memory for their initial stack, while Erlang processes take almost no memory to start up, and dynamically allocate stack as necessary.

The VM also has its own scheduler separate from the OS scheduler, which helps keep expensive OS-level context switching at bay. Interestingly, Erlang’s scheduler appears to be preemptive and not cooperative like many other “non-OS-thread concurrency solutions” which require the programmer to yield back time to the runtime cooperatively, and allow for tight, long-running loops to potentially hog CPU time without end.

The TcpEndpoint

So, what does this Endpoint look like anyway? Let’s walk through it and dissect some things.

defmodule Owsla.TcpEndpoint do use GenServer.Behaviour

We start by defining our module, and ‘importing’ Elixir’s GenServer.Behavior which tells OTP that this module is a gen_server.

defrecord EndpointState, socket: nil, id: nil, sequence_no: 0, closed: false

Next, we define a record type called EndpointState that describes the current state of our endpoint. Note that for very simple GenServers, it’s probably fine to use a simple tuple for state, but at some point it becomes easier to manage if you create a record. Our state includes the SSL socket, a unique identifier, a sequence number for messages being sent to our back-end systems, and a flag to tell if, somehow, our process is still up but the SSL socket has been closed.

def start_link(socket, id) do :gen_server.start_link({:global, id}, __MODULE__, {socket, id}, []) end def init({socket, id}) do { :ok, EndpointState.new(socket: socket, id: id) } end

Here, we’ve got our boilerplate “Start up my gen_server” code.

def send(connection_id, message) do :gen_server.cast {:global, connection_id}, {:send, message} end

The send function is a simple wrapper around a call to :gen_server.cast as a convenience to consumers. It takes a connection ID (which is a UUID in our case) and a message to send back down the pipe. Next, we have our gen_server handling code:

def handle_cast( {:start}, state) do :ssl.setopts(state.socket, [active: :once]) {:noreply, state } end

Defining handle_cast functions that pattern match to certain messages is how your implement your behavior in an Erlang gen_server (part of the OTP library). This one (handling the :start message) is invoked by the TcpAcceptor way back in part 1, once it has transferred ownership of the SSL socket to this process. It tells the process that it can start doing things with the socket without crashing for accessing another process’s socket.

Note that we set the SSL socket to [active: :once] rather than [active: true] . The difference is important, as [active: true] can allow your process’s mailbox to be filled by TCP/IP messages if whatever is on the other end can send data to you very quickly. For more information, take a look at The Buckets of Sockets section of Learn you some Erlang for great good. Then, read all the rest of it for good measure.

def handle_cast( {:send, message}, state) do unless state.closed do :ssl.send(state.socket,message) end {:noreply, state} end

This handle_cast matches our :send from before. It simply checks to see if our connection has been closed and, if not, sends the message down the SSL socket.

def handle_info({:ssl, _, data}, state) do Owsla.RabbitProducer.send(state.id, data, state.sequence_no) :ssl.setopts(state.socket, [active: once]) {:noreply, state.sequence_no(state.sequence_no + 1)} end def handle_info({:ssl_closed, _}, state) do IO.puts "Client closed socket - stopping connection #{state.id}" {:stop, :normal, state.closed(true)} end def handle_info({:ssl_error, _, reason}, state) do IO.puts "Error on socket.recv. Error was: #{reason}" IO.puts "Closing connection #{:uuid.to_string(state.id)}" {:stop, :error, state} end

The three handle_info functions above are all about handling asynchronous messages from the SSL socket. Because we have to be able to both send and receive data on this socket, we can’t just put a blocking :ssl.recv() call in a tight loop, so we need to receive our data asynchronously. We did that by setting the socket to active, as discussed above.

The first, {:ssl, _, data} , is called every time we get a line of data (remember our data is newline-separated – there are other options for breaking apart your data using gen_tcp if you need, for example, parsing of buffers based on the first N bytes of a packet containing the length of the rest of the packet). It simply forwards data to a RabbitMQ queue via the Owsla.RabbitProducer module/process, which is a wrapper around our fork of ExRabbit which uses the native Erlang rabbitmq client.

The other two handle_info functions deal with the SSL socket either closing “naturally” or having some kind of error. In either case, we want to stop the current process, so rather than return a tuple with :ok as the first item, we return a tuple with :stop . This signals to OTP that our process has completed and should be terminated. In one case, we return :normal as the second item in our tuple, to let our supervisor know that we shut down normally, and in the other we return :error so that OTP knows our process failed for some unexpected reason.

A quick note on Elixir defrecords – note that the first two handle_info functions transform their current state in some way – the {:ssl, _, data} handler returns:

{:noreply, state.sequence_no(state.sequence_no + 1)}

the state.sequence_no(state.sequence_no + 1) part takes our existing state, creates a copy (as records are immutable) with all of the data the same except for the sequence_no field, which will be incremented by 1. The same pattern is used to set the closed field in the :ssl_closed handler. state.sequence_no(state.sequence_no + 1) is a much easier way to express EndpointState.new(socket: state.socket, id: state.id, sequence_no: state.sequence_no + 1)

Finally, we want to shut down our SSL connection whenever our process ends, so we define a terminate function to close it down before our process is removed by Erlang:

def terminate(_reason, state) do IO.puts "Closing connection #{state.id}" :ssl.close(state.socket) :ok end

And that’s pretty much that. Note that I’ve walked you through a slightly older and simpler version of our TCP endpoint and, if you handled all of your business logic inside of the Endpoint process, this would probably be enough. Further, you would probably be able to ignore the sequence_no gymnastics we’re going through. However, the astute reader may have noticed that outbound messages are sent down the SSL connection in order of receipt in the process’ mailbox. In our case, those messages are being picked up by a pool of processes reading from a RabbitMQ queue, which means that messages could arrive out-of-order. Our current implementation handles this more gracefully, dealing with out-of-order messages and timeouts for when we somehow miss a message from RabbitMQ. However, the commensurate increase in complexity would make it much harder to describe in a blog post.

The final part of the solution is the RabbitConsumer, which picks up messages from our back-end system (currently still written in Ruby) and sends them back to the appropriate endpoint to be sent down the SSL connection, assuming it’s still around. But we can talk about that one another time.