Don’t block the loop. Photo by Chris Arock on Unsplash

This is a short explainer of event-driven servers, intended to help readers gain an intuitive understanding of event loops. It could be useful when:

deciding between different MPMs for Apache HTTP Server

comparing Apache HTTP Server and NGINX

choosing concurrency models for gunicorn

troubleshooting an event loop

Concurrent Servers

In the classic client-server architecture, a server accepts connections from clients, receives data on the new socket, forwards it to the application for processing, and then sends data back to the client on the same socket.

MVP: A single-threaded server. We’re ready for some venture capital.

To achieve concurrency, the simplest multi-threaded implementation could create a new thread for each new connection. Once created, the thread is bound to the connection, processing all requests and responses on this connection sequentially. The thread is then destroyed and freed when the connection is closed.

To reduce the overhead of this create-destroy pattern, the server could use a thread pool instead.

This describes the worker Multi-Processing Module of the popular Apache HTTP Server.

Depending on the specifics of each implementation, this model can have a few drawbacks, including the memory overhead incurred per connection. This scales poorly in a deployment that expects a large number of concurrent clients, each maintaining persistent connections (e.g. HTTP/1.1 Keep-Alive, long polling, or reusing a TLS socket). Precious system resources would be wasted on threads that are idle most of the time.

Threads could be waiting for results from databases or waiting for data from the client.

memory required = memory required per thread * number of connections

To drive this intuition a little further, consider this example: if a server implemented using this 1:1 model allocates 1 MB of memory per connection, it would need 10 GB of memory to serve 10 thousand concurrent connections.

Is 1 MB a good estimate? A PyMySQL client occupies about 0.37 MB. See memory-profiler.

This is the gist of the C10K problem, and we want an implementation with a resource consumption profile that scales sub-linearly with the growing number of concurrent connections.

Toward an event-driven implementation

To conserve memory and CPU while handling a large number of idle connections, we can improve upon the above implementation by doing the following:

reduce the amount of memory that required for each new connection by allowing the main thread to handle multiple connections, and reduce CPU cycles wasted with polling by using event notifications from the kernel to trigger callbacks when sockets are ready to be read/written.

Without getting into precise definitions, Step 1 introduces us to asynchronous programming, where each thread is no longer bound to a single connection, but instead multiplexes over multiple ones. For this to succeed, the main thread needs to be aware of the readiness and states of each socket, reading from sockets once they are ready to be read, and quickly writing to sockets when there is data to be sent.

Multiplexing over connections c0, c1, and c2 with an event loop on one thread.

Furthermore, since we are sharing the main thread across many clients, the application needs to ensure that no single request can block the entire thread, starving the other clients of their time. Anything that is using the main thread needs to yield its control as soon as it can.

Step 2 introduces us to kernel event notifications, such as I/O readiness, UNIX signals, and timeouts. These can be used via kqueue in BSD, or epoll in Linux. The following example integrates with libevent , which is an OS-independent library that supports kqueue , epoll , and select , etc.

A Small Example

event-proxy is a small program I had written to demonstrate the fundamentals of event-based servers, and the use of libevent. It is minimal by design, to aid instruction and study. It is also written in C, which I have found to be the best language to use for learning systems engineering.

Using event-proxy with cURL.

Here is a short demo of event-proxy in action:

Building and running event-proxy. Notice how it’s receiving and forwarding requests to the proxied server.

cURLing the local proxy; watch this gif side-by-side with the previous gif.

event-proxy is structured as follows:

main.c , which provides the program’s entry point, initializes the logger, and invokes proxy.c#proxy to start the proxy;

, which provides the program’s entry point, initializes the logger, and invokes to start the proxy; proxy.c , which creates the listening socket, initializes the event loop, and blocks until the loop exits;

, which creates the listening socket, initializes the event loop, and blocks until the loop exits; io.c , which provides proxy.c with the do_accept callback for accepting connections; and

, which provides with the callback for accepting connections; and client.c , which creates a socket and connects to the proxied server.

proxy.c

In proxy.c#proxy , we first create a TCP socket, bind it to an IP address and port number, and use the listen system call to wait for incoming connections. We have not done anything unique so far, and these steps are separated into _init_listen_fd , so we can focus on _init_event_loop .

To start using libevent, we first allocate an event_base structure using event_base_new. This structure will be used to hold the events we are interested in, as well as register pointers to callbacks. Later, it is used to start the event loop. For simplicity, we create a default event_base without any custom configuration:

ev_base = event_base_new();

Next, we register the do_accept callback against a read event on listen_fd , which is the listening socket from before. This tells the event loop to invoke do_accept whenever there is a new incoming connection at that socket.

ev_listen = event_new(ev_base, listen_fd, EV_READ|EV_PERSIST, do_accept, …);

event_add(ev_listen, NULL);

// NULL here means no timeout

// without EV_PERSIST, this event will be triggered at most once

Then, we register the quit_cb callback against a SIGQUIT signal.

ev_quit = evsignal_new(ev_base, SIGQUIT, quit_cb, ev_base);

event_add(ev_quit, NULL);

// NULL here means no timeout

Finally, we start the event loop and block indefinitely:

event_base_dispatch(ev_base);

`event_base_new` and `event_add` setup state that `event_base_dispatch` uses.

Roughly, event_base_dispatch starts and blocks on an infinite loop that waits until at least one event is triggered, and invokes the activated events’ registered callbacks. These callbacks are invoked on the same thread, like so:

do_accept runs on the same stack, in the same thread, as _init_event_loop .

Callbacks that take too long to return will penalize the entire event loop and undermine fairness. Thus, callbacks must return as quickly as possible and yield control of the main thread back to the event loop. For this reason, event-based programming generally requires avoiding blocking calls, such as I/O waits, explicit sleeps, and mutex waits.

Notice that the last argument to both event_new and evsignal_new is passed to the callback functions when they are invoked. For example, when SIGQUIT is sent to the program, we pass ev_base to quit_cb , so that it can tell the event loop to exit. Unfortunately, since we are working in C, the type of the argument is void * , and some casting is necessary:

struct event_base *ev_base = arg;

event_base_loopexit(ev_base, NULL);

// NULL here means exit after all current events are complete

io.c

io.c provides do_accept , a callback that is invoked whenever a new incoming connection is received at the listener socket. When invoked, we first use the accept system call to create a new socket for each connection (named accept_fd ). Then, we connect to the proxied server, creating new socket named client_fd . We have not done anything new yet, and the novelty begins in _init_bufferevents .

At this point, we have created:

accept_fd , a socket that can be used to read and write bytes to the connected client, and

, a socket that can be used to read and write bytes to the connected client, and client_fd , a socket that can be used to read and write bytes to the proxied server.

The remaining tasks are:

read bytes from accept_fd and write them to client_fd

and write them to read bytes from client_fd and write them to accept_fd

This can be done by registering a callback against read events on both sockets. To do this, we use libevent’s bufferevent API, which provides a convenient interface for working with buffers and stream sockets. We start by creating a bufferevent structure that is associated with the main event loop and our new sockets:

bev = bufferevent_socket_new(ev_base, fd, BEV_OPT_CLOSE_ON_FREE);

// create a bufferevent structure

Then, we register the read callback and the error callback:

bufferevent_setcb(bev, readcb, NULL, errorcb, arg);

By default, a new bufferevent has writing enabled, but not reading, and the following is necessary:

bufferevent_enable(bev, EV_READ | EV_WRITE);

The final piece of the puzzle is the implementation of readcb , which looks like:

static void readcb (struct bufferevent *bev, …) {

…

input = bufferevent_get_input(bev);

…

bufferevent_write_buffer(output, input);

…

}

Notice that we are reusing the same ev_base as before, which lets us plug into the same event loop. The newly registered callbacks will be invoked by the same event loop in that was dispatched in proxy.c#proxy , and on the same thread.

readcb runs on the same stack, in the same thread, as _init_event_loop .

Beyond Servers

The design of batch jobs can benefit from a discussion of threads and event loops as well.

Suppose, for example, we have a batch job running on a single machine that retrieves a large number of records from many databases and aggregates the data. To achieve concurrency, we could implement this batch job using multiple threads, with one thread for each database. This setup looks familiar, except that our batch job is acting as a client that is connected to multiple database servers.

This implementation suffers from the same drawbacks described above: each new thread could consume quite a bit of memory. Furthermore, if we assume that the SELECT queries take a long time on the database hosts, then we are also managing a large number of mostly idle threads and connections.

In cases like this, an event loop could be a useful optimization. After all database connections have been created, we could register callbacks on I/O events from them. The event loop would then run on one main thread, invoking the associated callbacks when results from the databases are available to be read on their respective sockets. Each invoked callback would pull the query results into process memory for aggregation.

In this model, we reduced a large number of idle threads to one fully utilized main thread; however, we also forfeited the ability to utilize multiple cores on the machine. It might be advantageous to optimize further by distributing the work over multiple processes on each machine: one process for each core, and each with its own event loop.