Introduction

Latency is a tricky subject, sometimes it's not even clear what or how to measure it. I've had the experience of writing a fairly complex system requiring low latencies in Erlang. Fortunately Erlang provides really good baseline performance. Most of the time you simply write your program and it will perform well. There are however a few tricks that can be used to lower the latencies of a specific path in the system. This document describes a few of these tricks.

Yield

Erlang allows you to design efficient concurrent systems without caring how processes are scheduled or how many cores the system is running on. When running Erlang with multiple schedulers (generally one per CPU-core) the runtime will balance the load between the schedulers by migrating processes to starved schedulers. There is no way to bind processes to schedulers or control how processes are migrated between schedulers. This introduces a non-deterministic behavior in the system and makes it hard to control latency.

A common pattern is to have a demultiplexer that receives a message, sends it to some other process/processes and then performs some additional processing on the message:

loop(State) -> receive Msg -> Pid = lookup_pid(Msg, State), Pid ! Msg, State2 = update_state(Msg, State), loop(State2) end.

After the message has been sent the receiving process will be ready to execute, but unless the receiving process is on a different scheduler the demultiplexer will first finish executing. Ideally we would bind the demultiplexer to one scheduler and bind the receiving processes to the other schedulers, but that's not allowed in Erlang.

Erlang provides only one simple, but powerful way to control scheduling: The built-in function (BIF) erlang:yield/0 lets processes voluntarily give up execution and let other processes get a chance to execute.

The demultiplexer pattern can be modified by adding erlang:yield() after sending the message:

loop(State) -> receive Msg -> Pid = lookup_pid(Msg, State), Pid ! Msg, erlang:yield(), State2 = update_state(Msg, State), loop(State2) end.

After the message has been sent the demultiplexer will give up execution. If the demultiplexer and the receiver are on the same scheduler the receiver will execute before the demultiplexer finishes executing, if they are on different schedulers they will execute in parallel.

Using the erlang:yield/0 BIF it's possible to control the scheduling of Erlang processes. If used correctly this can reduce the latency in a system.

Network

All network I/O in Erlang is implemented as an Erlang driver. The driver is interfaced by the module prim_inet which in turn is interfaced by the network related modules in the kernel application.

There is a performance issue with the prim_inet:send/2 and prim_inet:recv/2 functions affecting all the network related modules. When calling prim_inet:send/2 or prim_inet:recv/2 the process will do a selective receive. If the process's message queue is long there will be a performance penalty from doing this selective receive.

For receiving there is a simple solution to this problem: use the {active, once} socket option.

A simple selective receive-free TCP receiver:

loop(Sock) -> inet:setopts(Sock, [{active, once}]), receive {tcp, Sock, Data} -> loop(Sock); {tcp_error, Sock, Reason} -> exit(Reason); {tcp_closed, Sock} -> exit() end.

To implement sending without doing a selective receive it is necessary to use the low-level port interface function erlang:port_command/2 . Calling erlang:port_command(Sock, Data) on a TCP socket would send the data Data on the socket and return a reference Ref . The socket will reply by sending {inet_reply, Ref, Status} to the process that called erlang:port_command .

A simple selective receive-free TCP writer:

loop(Sock) -> receive {inet_reply, _, ok} -> loop(Sock); {inet_reply, _, Status} -> exit(Status); Msg -> try erlang:port_command(Sock, Msg) catch error:Error -> exit(Error) end, loop(Sock) end.

Though not Erlang specific it is important to remember to tune the send and receive buffer sizes. If the TCP receive window is full data may be delayed up to one network round trip. For UDP, packets will be dropped.

Distribution

Erlang allows you to send messages between processes at different nodes on the same or different computers. It is also possible to interact with C-nodes (Erlang nodes implemented in C). The communication is done over TCP/IP and obviously this introduces latencies, especially when communicating between nodes on a network.

Even when the nodes are running on the same computer they communicate using TCP/IP over the loopback interface. Different operating systems have widely different loopback performance (Solaris has lower latency than Linux). If your system uses the loopback interface it's a good idea to consider this.

Further Reading