If you are an Erlang user, you probably know what atoms are. Chances are also high that you are aware of the major caveat regarding atoms in Erlang:

“Atoms are not garbage-collected. Once an atom is created, it is never removed. The emulator terminates if the limit for the number of atoms (1,048,576 by default) is reached.”

The text for atoms is stored (once for each unique atom) in an atom table, which is never garbage collected. A configurable limit exists for the number of entries in the table. Hitting the limit (e.g. by dynamically generating atoms) can result in a VM crash.

Atoms are great, but dynamically generating them (e.g. via the list_to_atom/1 function) can lead to trouble. After all, there is a reason why the list_to_existing_atom/1 function exists. Here is an interesting thread on the Erlang Questions mailing list discussing the pros and cons of atoms with input from both Richard A. O’Keefe and Joe Armstrong.

Since having too many atoms can cause your system to crash unexpectedly, it is crucial to keep an eye on the number of entries of the atom table on a long-running production system. Given that at Klarna we have one or two of these long-running systems, we collect a few metrics which help us doing so. Let’s see what these metrics are.

In the current version of OTP (19.2 at the time of this writing), information about the memory used for the atom table is available to the user via the erlang:memory/1 function. There are actually two flavours of the same function that can be used. Let’s have a look at the first one:

1> erlang:memory(atom).

202481

The function returns the amount of memory used for the atom table itself plus the amount of memory reserved for atom strings at a given point in time. The amount of memory reserved for atom strings grows in chunks. The return value is expressed in bytes.

Let’s now look at a slightly different version of the function:

2> erlang:memory(atom_used).

187410

This variant returns the amount of memory used for the atom table itself plus the amount of memory of atom string space actually used. Once again, the return value is expressed in bytes. Thanks to Mikael for confirming the difference for me.

Some weeks ago I was looking at exactly these metrics for a cluster of six Erlang nodes and noticed that the atom table was constantly growing on the nodes. I considered that normal, given that in a long-running Erlang system atoms are created all the time (new modules are loaded, commands are typed in shells, etc). What was weird, though, is that the growth was more evident in one of the nodes.

Erlang atom tables growing over time.

A quick investigation showed that this behaviour started immediately after our latest release. The change-log revealed that a new Erlang application had been deployed and started on exactly that node. That application could have very well been the source of dynamically generated atoms. The issue was worth investigating.

In our monitoring graphs I could only see the memory allocated for the atom table, but what I really wanted to know was how many atoms were present in our production system and whether we were close to the notorious 1M limit or not. Interesting enough, I could not locate any other relevant metric in our monitoring system and, after going through the official Erlang documentation, I got convinced that the information was actually not exposed to the user. It was at that point that my colleague Daniel suggested that it was possible to extract this information from the (semi-undocumented) binary output returned by the erlang:system_info/1 function. Let’s have a look at it.

3> erlang:system_info(info).

<<"=memory

total: 13227160

processes: 4383720

processes_used: 4383496

system: 8843440

atom: 202481

atom_used: 187410

bi"...>>

The output is truncated by the Erlang shell, so let’s print the return value in a slightly nicer format (output has been truncated):

4> io:put_chars(erlang:system_info(info)).

=memory

total: 13287200

processes: 4394640

processes_used: 4394416

system: 8892560

[...]

=index_table:atom_tab

size: 8192

limit: 1048576

entries: 7227

=hash_table:module_code

[...]

Indeed, the information we need is there. Let’s implement a trivial helper module to extract it:

Let’s finally see our helper in action:

1> atom_table:count().

7085

Not the best API ever, but at least we got the information we need. Since we believe this is an important metric for any production Erlang system to track, my colleague Mikael raised a pull request to the OTP team suggesting a better way to retrieve this precious information. The PR was accepted, meaning that starting from OTP 20 the following API to retrieve information about the number of atoms in use will be available:

erlang:system_info(atom_count).

Cute, isn’t it?

Once we had the metric we wanted, we back-ported the new feature to our own fork of Erlang/OTP and setup a periodic job to send the number of atoms present in our production system to our monitoring tool. That showed that we were way below the 1M limit and that the growth was in fact negligible.

For crucial metrics such as this one, it is usually a good idea to raise an alarm if a predefined threshold is passed and to set the threshold to a very low value (e.g. 50% of the limit). Even in huge Erlang systems it’s unlikely to see hundreds of thousands of atoms — so seeing such may highlight problematic dynamic generation of atoms. In that case, we want to be alerted as soon as possible.

Let’s now imagine that our system leaks atoms. How can we figure out which atoms are getting generated? There are a few ways for retrieving the list of atoms from a running Erlang system, but my favourite one is the one proposed by legoscia on StackOverflow. It’s pure evil and it uses an undocumented feature of the external term format.

We could use the code sample from Stack Overflow to fetch the list of atoms in our system, wait a little bit and then run it again, peeking at the difference. We probably don’t even need to run the code in production, since a local workstation or a test system could be enough to spot the root cause behind the unexpected generation of atoms.

If we find atoms being dynamically generated, we may want to ensure that it does not happen anymore. In that case, I’d recommend to use something like the Erlang Style Reviewer Elvis of which my colleague Juan is the main contributor.

But what about you? Are there other Erlang metrics that you track (or that you’d love to track) that are difficult to retrieve or otherwise hidden? Let us know in the comments.