ETS will hold references to large binaries, even if just part of the original binary is stored in a table. Copy binaries if pieces of them are going to be kept around for a long time.

I was creating a new Elixir application that was part of a larger micro service. It was responsible for consuming messages from Kafka, decoding the payloads, plucking two fields from the payload, and storing those fields in an ETS table. Disregarding the boilerplate and error handling, the code looked like this:

A callback which eventually gets call by Brod to handle Kafka messages. Looks simple, but there is a memory leak here!

I ran the code locally, wrote some unit tests, and everything worked as expected. There were no signs of any problems until I deployed the code to production and gave it some shadow traffic to test it out with load for the first time:

DataDog: memory consumption is very high

Yikes! Each time the service started, the memory would rise very quickly, until the service was killed/restarted by the container orchestrator for using too much memory. If we zoom in on the legend, we can see that the purple area represents memory being used by Binaries , while the blue area representing ETS is using comparability little memory:

Datadog: Most memory being using by Binaries, not ETS

That data is provided by periodically calling :erlang.memory/0 which gives you information about how memory is being used by the VM, broken down by the memory type. Here is an example of using IEx to call it locally:

Example of :erlang.memory/0 in action

The next step was to connect to my service and examine things from the inside. Depending on how you create and run your releases this step will be different, but being able to connect to a running service is a must-have in my opinion. In this particular case, the release was built with Distillery and run inside a Docker container:

# Inside the machine # Find the CONTAINER ID of your Elixir service

$ docker ps # Enter inside the container

$ docker exec -it <CONTAINER_ID> bin/bash # Start a remote_console, you maybe have to set other args and change the path to the binary

$ /opt/app/bin/my_service remote_console

Once we are attached to our Elixir service, we can use some helpful libraries to debug:

:observer_cli — gives you the same info as :observer but in textual form, which is great for instances like this when we don’t have a GUI

— gives you the same info as but in textual form, which is great for instances like this when we don’t have a GUI :recon — helper functions and scripts that are specifically meant for use in production environments

First I wanted to get a picture of which processes were using the most memory, just to make sure that it was indeed the Kafka-related stuff that I had added. To do that, I used the :observer_cli and ordered all of the processes by their memory usage:

Using :observer_cli while attached to a service in production to get memory information about each process

Indeed, immediately after starting the service, I could see a lot of Kafka related processes using a lot of memory (there is one brod_consumer process spawned for each of my 64 Kafka partitions). The :observer_cli updates in real-time, so you can watch it for a bit to get a picture of the trends in your service.

This is a good start, but is not conclusive enough to say that there is a memory leak; lots of processes use many MB of memory.

Next I used :recon.bin_leak/1 to find the processes that were holding onto the most binary references. If they are the same processes that are using the most memory, then we might be on to something.

:recon.bin_leak/1 works by forcing a garbage collection and then watching to see how many references are let go. It returns the top N process (I just looked at 5). A larger negative number means a large amount of binary references were being held on to:

Using :recon.bin_leak/1 to see the top memory offenders

If we cross-reference the pids that were returned with the pids in preceding :observer_cli screenshot, we can see that they are indeed the same Kafka-related processes. Additionally, if we were watching the :observer_cli in real-time, we would expect to see those processes’ memory usage drop after triggering the garbage collection.