Erlang distribution over TLS

2016-06-28 by Magnus Henoch

Erlang Solutions offers world-leading RabbitMQ consultancy, support & tuning solutions. Learn more >

What is Erlang distribution?

The “distribution protocol” is the means by which multiple Erlang nodes join together to form a cluster. When Erlang nodes are clustered, any process can send messages to processes on any other node, and spawn new processes on any other node. This forms the basis for distributed applications such as Mnesia, the database implementation that comes with Erlang/OTP, and RabbitMQ, the message broker.

The Erlang distribution protocol was designed assuming that it’s running on a trusted network. While nodes connecting to each other are required to prove that they possess a shared secret, called a “cookie”, this is mostly aimed at ensuring that different Erlang clusters on the same network don’t accidentally merge; it’s not recommended to rely on the cookie mechanism to keep an attacker out.

Furthermore, all Erlang nodes in a cluster trust each other completely. Any node in the cluster can run any code on any of the other nodes, including running arbitrary commands with os:cmd . This is why the Distribunomicon chapter of Learn You Some Erlang describes Erlang’s security model with the words * this space intentionally left blank * .

In this blog post, I describe how to run the Erlang distribution protocol over TLS, and what problems that may or may not solve.

Why TLS? What problems does it solve?

Let’s say you have an existing Erlang cluster in your data centre, and you’re going to upgrade it to use TLS. (You might think that using TLS means that you can run an Erlang cluster over the Internet, in which case I’ll say that you are very brave, and I’d like to hear about your experiences!) In the simplest possible configuration, communication between nodes is encrypted, but the nodes don’t verify certificates.

What does that mean? It means that given two Erlang nodes called Alice and Bob, if Eve (an eavesdropper) is already inside your network, and can listen to network traffic, she still cannot see what your Erlang nodes are sending to each other, which might be sensitive data that you want to protect. This is thus an example of defence in depth: even if an attacker has penetrated your firewall, they still encounter further obstacles in getting what they want. However, if the attacker inside your network is Mallory, who can perform a man-in-the-middle (MITM) attack, he can just present a different certificate to each node, and proxy the connection.

The usual way to verify TLS certificates is to check that they are signed by a trusted Certificate Authority (CA). This ensures that Mallory cannot intercept the connection unless he gets hold of the private key of the CA. However, it also introduces a new important question: which CAs do you trust? Since any Erlang node with a certificate issued by a trusted CA gets full access to your node, you probably want to trust as few CAs as possible, perhaps creating your own CA to issue certificates for the nodes in your cluster and trusting only that one.

Another way to reduce that risk is to create a whitelist of trusted certificates. This is not available out of the box, but you can do it by implementing your own verification function.

How to use distribution over TLS?

This is described in the official documentation (http://erlang.org/doc/apps/ssl/ssl_distribution.html), so I’ll just give some hints and examples here to get you started.

First of all, since this involves passing many long command line arguments to erl, I’d suggest writing a shell script that starts erl appropriately, so you don’t have to fiddle with the arguments in the terminal.

The documentation says that you need to either include the SSL application in your boot script, or explicitly include the SSL ebin directory in the code path. Eventually you’ll probably want to do the former, using your favourite release generation tool, but while you’re experimenting you can make do with the latter. Here is a snippet that saves the right directory in a shell variable:

SSL_DIR=$(erl -noinput -eval 'io:format("~s~n", [filename:dirname(code:which(inet_tls_dist))])' -s init stop)

This involves an extra invocation of the Erlang virtual machine, which makes startup a bit slower, but on the other hand you don’t need to worry about finding the right directory manually.

At the end of the shell script, start erl with the required parameters:

erl -pa $SSL_DIR -proto_dist inet_tls -ssl_dist_opt $SSL_DIST_OPT "$@"

So what should be in the SSL_DIST_OPT variable? That depends on what kind of verification you want.

Encryption, no verification

The bare minimum you need is a certificate and a private key to be used on the server side. The “client” will not be required to present a certificate.

Note that “server” and “client” are somewhat foreign terms to Erlang distribution. In Erlang, if two nodes are connected, it doesn’t matter which node initiated the connection, but when using TLS we get to set different options for the two “sides” of the connection.

The certificate and the private key should be stored in PEM format. They can either be concatenated into a single file, in which case you only need the server_certfile option, or be stored in two separated files, in which case you also need server_keyfile :

SSL_DIST_OPT="server_certfile erl-dist.pem server_keyfile erl-dist.key"

Since we’re not verifying certificates at this stage, self-signed certificates are sufficient.

Verify server certificate against CA list

If you have a set of trusted Certificate Authorities, and want the “client” to verify that the certificate of the “server” was signed by one of them, pass the client_cacertfile option. We also need to set the client_verify option to verify_peer to make the client perform the verification:

SSL_DIST_OPT="server_certfile erl-dist.pem server_keyfile erl-dist.key \ client_cacertfile ca.pem client_verify verify_peer"

Remember that the client doesn’t present a certificate yet, so there is no point in giving the CA list to the server.

Verify client certificate against CA list

It might be considered a bit silly to verify the server certificate and not the client certificate — after all, in principle it might be random which node ends up connecting to the other, and regardless of the direction both nodes get full access to each other. Thus let’s pass symmetric arguments for certificate, key and CA list, and also make the server require the client to present a certificate with the server_fail_if_no_peer_cert option.

SSL_DIST_OPT="server_certfile erl-dist.pem client_certfile erl-dist.pem \ server_keyfile erl-dist.key client_keyfile erl-dist.key \ server_cacertfile ca.pem client_cacertfile ca.pem \ server_verify verify_peer client_verify verify_peer \ server_fail_if_no_peer_cert true"

Use a custom verification function (from 19.0)

To get more flexibility when verifying certificates, for example if we want to do custom logging, or if we want to implement certificate whitelisting, we need to implement a custom verification function. Support for those was added in the 19.0 release of Erlang/OTP. We can set the verify_fun option for client and server, similarly to how this option is described in the documentation of the ssl module (http://erlang.org/doc/man/ssl.html):

SSL_DIST_OPT="server_certfile erl-dist.pem client_certfile erl-dist.pem \ server_keyfile erl-dist.key client_keyfile erl-dist.key \ server_cacertfile ca.pem client_cacertfile ca.pem \ server_verify verify_peer client_verify verify_peer \ server_verify_fun {my_module,my_function,my_state} \ client_verify_fun {my_module,my_function,my_state} \ server_fail_if_no_peer_cert true"

Note that when establishing a TLS connection from within an Erlang program, the verify_fun option takes a tuple with two elements, a fun and an initial state term. However, when parsing Erlang terms from the command line, it’s not possible to create a fun object, so we pass in the module name and the function name as atoms instead. In this case, the function my_module:my_function/3 would be called.

The verification callback function is called for each error encountered during verification, for each unknown certificate extension, and for each valid certificate in the certificate chain. For each of these cases, the function can decide whether verification should succeed or fail. It can also update its state data — the third element of the tuple is the initial state. An implementation would look something like this:

my_function(Cert, valid, State) -> ... my_function(Cert, valid_peer, State) -> ... my_function(Cert, {bad_cert, Reason}, State) -> ... my_function(Cert, {extension, Extension}, State) -> ...

See the documentation for more details.

Check whether the certificate is revoked (from 19.0)

In an ideal world, possessing a certain certificate would be sufficient proof that the entity is the one it claims to be. In the real world, however, private keys get misplaced, stolen and inappropriately distributed, such that a certificate may have to be revoked before its expiry time.

When a CA revokes a certificate, it puts the serial number of the certificate on a Certificate Revocation List (CRL). An entity that wants to verify the validity of a certificate needs to download the CRL and check that the certificate is not on it.

If this is something we want to use in our Erlang cluster, we need to figure out where we get CRLs from. If we get certificates from a “proper” CA, the certificates will most likely contain a “distribution point” extension with the URL that the CRL can be downloaded from. In that case, the Erlang ssl application can download it for us. Otherwise, we may have to get hold of the CRL by other means and pass it to the ssl application manually.

We also get to choose how extensive checks to make. This is governed by the crl_check setting. The default setting is false , meaning no checks are performed. When set to true , all certificates in the certificate chain are checked against CRLs, and if any CRL is missing, it’s treated as if that certificate were revoked. We can also set crl_check to peer , to only check the peer certificate (and not its issuing CA), or to best_effort , to accept the certificate as valid if we can’t find the relevant CRL.

Starting with Erlang/OTP 19.0, we can write something like this:

SSL_DIST_OPT="server_certfile erl-dist.pem client_certfile erl-dist.pem \ server_keyfile erl-dist.key client_keyfile erl-dist.key \ server_cacertfile ca.pem client_cacertfile ca.pem \ server_verify verify_peer client_verify verify_peer \ server_crl_check true client_crl_check true \ server_crl_cache {ssl_crl_cache,{internal,[{http,5000}]}} \ client_crl_cache {ssl_crl_cache,{internal,[{http,5000}]}} \ server_fail_if_no_peer_cert true"

This instructs the ssl_crl_cache module to retrieve CRLs by HTTP, with a timeout of 5 seconds (specified as 5000 milliseconds).

What Erlang/OTP version should I use?

While TLS distribution has been supported for a long time, I’d recommend using at least version 18.3. In this version, a number of important fixes are present:

All sockets use the nodelay option by default. On Linux, by default sockets use Nagle’s algorithm to reduce overhead from packet headers by delaying sending data until there’s a full packet’s worth of it, or until a timeout which defaults to 40 milliseconds. Unfortunately, this interacts badly with Erlang distribution: every roundtrip would be delayed by 40 milliseconds.

option by default. On Linux, by default sockets use Nagle’s algorithm to reduce overhead from packet headers by delaying sending data until there’s a full packet’s worth of it, or until a timeout which defaults to 40 milliseconds. Unfortunately, this interacts badly with Erlang distribution: every roundtrip would be delayed by 40 milliseconds. Various options for distribution listening ports, in particular inet_dist_listen_min and inet_dist_listen_max for setting a specific port range, are supported for TLS distribution as well as for unencrypted distribution. (These options are described in the kernel documentation.)

and for setting a specific port range, are supported for TLS distribution as well as for unencrypted distribution. (These options are described in the kernel documentation.) In earlier versions, there was a race condition whereby if a node was starting up, and another node attempted to connect to it at exactly the wrong moment, the first node would drop the connection and stop listening for any further connections. This was fixed in 18.3.

TLS distribution over IPv6 is supported. Specify -proto_dist inet6_tls on the command line to use it instead of IPv4.

As mentioned above, custom verification functions and CRL checking are supported starting from version 19.0.

What about epmd?

Before you start opening up your firewall for Erlang distribution over the Internet, you should consider the last piece of the puzzle: epmd, the Erlang Port Mapper Daemon.

Since it’s possible to run more than one Erlang node on a single host, those nodes wouldn’t be able to listen on the same port. Therefore, you need a little program that tells you what port your Erlang nodes are actually listening on. Type epmd -names on the command line to see what information it holds:

$ epmd -names epmd: up and running on port 4369 with data: name foo at port 53668

An Erlang node that wants to connect to “foo”, regardless of whether it’s on this host or on another host, would first connect to epmd on port 4369 and ask which port “foo” is listening on. Having got the answer 53668, it would then proceed to connect to that port and do the distribution protocol handshake.

However, even if you use the TLS distribution protocol, the connection to epmd is unencrypted, so all the concerns about eavesdropping and MITM mentioned above still apply. The blog post Spoofing the Erlang Distribution Protocol by Michael Santos, though somewhat dated, describes all kinds of interesting things that can be done to epmd if it’s accessible over the network.

A new feature in Erlang/OTP 19.0 lets you do something about it. Two new command line arguments are introduced: -start_epmd , by which you can turn off the automatic starting of epmd when an Erlang node is started, and -epmd_module , whereby you can specify a custom “port mapping” module instead of the standard one, which queries epmd. If you could lock down your nodes to always listen on one single port, you could write a custom module that always returns that port without having to query anything, thereby sidestepping the issue entirely.

Conclusion

As all Erlang developers know, distributed systems are hard, even with Erlang! Erlang just gives you the tools to manage that complexity and to decide what trade-offs you accept. The same thing applies to security: it is very hard! Wrapping the Erlang distribution protocol in TLS may or may not be one of the pieces of the security your application requires.

Erlang Solutions is the world leader in RabbitMQ consultancy, development, and support.

We can help you design, set up, operate and optimise a system with RabbitMQ. Got a system with more than the typical requirements? We also offer RabbitMQ customisation and bespoke support.

Learn more about our work with RabbitMQ >