One day you set aside a shoebox to store newspaper clippings. Suddenly you are trapped under an avalanche of whole newspapers and wondering how long your body will lie there before anyone misses you.

That is what kept happening to my Erlang apps. They would store obsolete binary data in memory until memory filled up. Then they would go into swap and become unresponsive and unrecoverable. Eventually somebody would notice the smell and restart the server.

The problem seems to be related to Erlang’s memory management optimizations. Sometimes an optimization becomes pathological. If you store a piece of binary data for a while (a newspaper clipping) Erlang “optimizes” by remembering the whole binary (the newspaper). When you remove all references to that data (toss the clipping) Erlang sometimes fails to purge the data (lets the newspapers pile up everywhere). If nobody shows up to collect the garbage, Erlang dies an embarrassing death.

The first step to recovery is to monitor the app’s memory footprint and log in every so often to sweep out the detritus. It can be tricky to find the PIDs that need attention and tragic if you arrive too late. The permanent solution is to build periodic garbage collection into the app. It’s not hard to do. The only hazard is doing it too often since it incurs some CPU overhead.

Each time I have found an app doing this, I’ve had to locate the offending module and install explicit garbage collection. If there is a periodic event, such as a timeout that happens every second, I’ll use it to call something like this:

gc(Tick) -> case Tick rem 60 of 0 -> erlang:garbage_collect(self()); _ -> ok end.

Today I installed this simple code and here is the result:





For the cost of 5% of one CPU core I stopped the cycle of swap and restart. I would like to learn why my binaries are not being garbage collected automatically. The processes involved queue the binaries in lists for a short time, then send them to socket loops which dispose of them via gen_tcp:send/2 . Setting fullsweep_after to 0 had no effect. I’ll be interested in any theories. However, I’m not looking for a new solution since mine is satisfactory. I hope other Erlang hackers find it useful.

30.223873 -98.142171