It’s a well known fact that Erlang VM’s generational GC does not do well when trying to garbage collect non-heap binaries. Here at Splunk, while we’ve been building brand new technology (standing on the shoulders of giants, of course) we’ve run into this weakness multiple times. This is a chronicle of our adventure.

A little background

Erlang binaries of up to a certain size (64 bytes to be precise) get stored in each process’s heap space and are garbage collected along with other state variables (tuples, lists etc). Larger ones however get stored in a separate shared memory space (called ProcBin) and a pointer to each one of them is stored in the manipulating process’s heap space instead. Those “large binaries” are not garbage collected in the Erlang conventional way (that is, per-process GC) since they are not accounted for in the process’s memory usage. They are reference counted and have a different GC pattern and collection interval, which, as it turns out, is not very intuitive (even when fine-tuned) and can allow your application to self-destruct if it handles a sufficiently large number of binaries (in number and in memory size).

In our architecture, we have a large number of binary data coming into the system from a Cowboy web server instance, each packet of up to 4 KB in size, which is also touched by 3 different long-live dgen_server processes on its way to purpose.

What we have observed is that, even though at some point in the lifetime of a binary datum, the datum will be released from all the processes that have touched it (after being served in a request or sent over to an other Erlang node), the datum’s space will still remain allocated (and the memory won’t be returned to the OS) for an indefinite amount of time. Indefinite here is explained by the Erlang documentation as “until memory pressure kicks-in and an old generation GC occurs”, which is at best blurry as to what memory pressure really means and how it is measured. More importantly, memory pressure usually pertains to a process and not the Erlang VM as a whole (which is where the large binaries are stored).

Tampering with the beast

To test the speed and effectiveness of Erlang’s binary GC, we’ve used wrk and a simple Lua script:

with a 4 KB JSON event.json file to fill our Erlang application with data via Cowboy. The wrk command used is

which, on our testbed 8-core virtualized server with 2 GB of memory and no swap, fills the internal data structures with about 1600 MB of memory in exactly 85 seconds.

The test procedure we followed is described below:

Start the application server. Fill it up with about 1.6 GB of data from the wrk script. Fetch all of the data, serially, in batches of 40 MBs each (10.000 events). The previous operation leaves the application server without any meaningful binary data stored in process state. Run the wrk script again to populate the data structures again. Crash (7 out of 10 times). Repeat steps 1 to 6 with a manual [ erlang:garbage_collect(Pid) || Pid <- erlang:processes() ]. after step 4.

Kvetch

What we’ve observed is that:

Erlang will not garbage collect the shared binary space until there’s actual memory pressure. That translates to about 90% of the system’s memory being full, without significant competition from system processes.

Even then, it won’t run a full sweep to clean up the entire unused binary data set, but will start cleaning progressively, reclaiming as much space as needed in order to operate correctly (such as allocating outgoing buffers and memory to Cowboy handlers). This is standard behaviour in incremental generational GC systems but in Erlang’s reality it doesn’t always occur fast enough or in a timely manner to save the VM from crashing under OOM conditions.

Sometimes (3 out of 10), it will misestimate the remaining system memory or fail to adequately prioritize the GC mechanism and as a result, much needed memory will not be freed in time and the whole VM will crash under OOM conditions. Erlang fanboys will say that this logic may align with Erlang’s philosophy of “let it crash”, but we believe that this concept should only apply to an Erlang controlled environment (that is processes, functions, ports etc) and not to the VM as a whole.

Forcing an old generation sweep after a lower number of minor sweeps (like 5 or 10 or even 0) via {spawn_opt, [{fullsweep_after, 5}]} in gen_server:start_link/4 did absolutely nothing, since in order to run a minor collection some memory pressure should occur and no such thing is happening in the process, as long as it’s lightweight enough in terms of other process state (remember that our binaries are stored in ProcBin and not in the process heap).

Forcing a garbage collection with erlang:garbage_collect(whereis(named_process)). will do the job, cleanup the entire stale binary data set from the shared heap and do it fast enough to not notice it in CPU usage.

Solution?

Unfortunately, there’s no elegant solution here. Even the official Erlang/OTP documentation states: “If the heap doesn’t grow, it’s likely that there won’t be a garbage collection, which may cause binaries to hang around longer than expected. A strategically-placed call to erlang:garbage_collect() will help.”. What could be done is implementing some sort of self-scrubbing in OTP designs that use gen_server (or any other gen_*pattern) like:

This is as inelegant as it gets. Better solutions include measuring the general binary memory usage from processes that are bound to create/manipulate/handle large binaries and run self-scrubbing on them like:

or sending the GC signal to the offending processes right after handling a large binary transaction.

Hopefully the situation will improve with R17B but, until then, workarounds such as the above have to be implemented to ensure proper application operation.

Panagiotis Papadomitsos

@priestjim