Share:

A few weeks ago, Eran Hammer of Walmart labs came to the Node.js core team complaining of a memory leak he had been tracking down for months. By expending a lot of effort over those few months he had taken the memory leak from a couple hundred megabytes a day, down to a mere eight megabytes a day. That in and of itself is an impressive feat, however, the source and final portion of the leak was elusive.

So he spent more effort and found a test case that reproduced with only Node APIs, which placed the fault right at Node.js' doorstep. He found that the more HTTP client requests he did, the more memory his Node process would consume, but it was really slow. Here's the graph from his initial data:

The top line is barely trending up, but it's enough to manifest in frustrating ways for Walmart. The good news is that Node.js is getting mature, the bad news is that a more mature Node.js means bugs are harder to find. So with the test client and server in hand, I started on my 3 week journey to track down this memory leak. Lucky for you, you're getting the abbreviated version here.

$ node server.js & $ UMEM_DEBUG="default,audit=200" node --expose-gc --always-compact standalone.js

First things first, I started the server and put it in the background since we were only interested in the client. Then I ran Node with UMEM_DEBUG set to record various important information about the memory allocations. This added a certain amount of overhead to the application, but since we were just using a test script it was completely safe. Also, we were exposing the GC so we could run it from inside the script, at the same time telling the GC to stay as compact as possible.

Since it took a while to manifest, I had a little script to grab interesting information periodically from the client process. Every hour, it grabbed the output of pmap -x and a core file and stored those in Joyent Manta so we didn’t have to worry about storage space.

bash #!/bin/bash set -xe DEST=~~/public/walmart.graphs mmkdir -p $DEST for i in $(seq 1 100); do pmap -x $1 | mput $DEST/pmap.$i gcore $1 mput -f core.$1 $DEST/core.$i mput -f symboltable.json $DEST/ sleep 3600 done

Then I let this run for 15 hours. To start with, let's compare the output between the first pmap and the 15th pmap.

diff diff -u pmap.1 pmap.15 --- pmap.1 2013-11-21 01:42:27.929994800 +0000 +++ pmap.15 2013-11-21 01:42:09.622017684 +0000 @@ -4,7 +4,7 @@ 08040000 32 32 32 - rw--- [ stack ] 08050000 14944 14944 - - r-x-- node 08EF7000 56 56 16 - rwx-- node -08F05000 15204 15104 15104 - rwx-- [ heap ] +08F05000 181288 180784 180784 - rwx-- [ heap ] 80199000 4 4 4 - rw--- [ anon ] 8019A000 4 4 4 - rw--- [ anon ] 8019B000 8 8 8 - rw--- [ anon ] @@ -15,12 +15,13 @@ 801D9000 256 256 256 - rw--- [ anon ] 80219000 512 512 512 - rw--- [ anon ] 86300000 100 100 100 - rw--- [ anon ] -86B00000 1024 1024 1024 - rw--- [ anon ] +86700000 1024 636 636 - rw--- [ anon ] +8A500000 36 36 36 - rw--- [ anon ] +8A50A000 980 484 484 - rwx-- [ anon ] 8F140000 64 64 64 - rw--- [ anon ] 92A00000 36 36 36 - rw--- [ anon ] 92A0A000 164 164 164 - rwx-- [ anon ] 95B00000 228 228 228 - rw--- [ anon ] -9A300000 1024 664 664 - rw--- [ anon ] 9C000000 4096 4096 4096 - rw--- [ anon ] 9C400000 2048 2048 2048 - rw--- [ anon ] 9C600000 1024 1024 1024 - rw--- [ anon ] @@ -29,12 +30,11 @@ 9CC00000 2048 2048 2048 - rw--- [ anon ] 9CE00000 1024 1024 1024 - rw--- [ anon ] 9CF00000 1024 1024 1024 - rw--- [ anon ] -A0600000 36 36 36 - rw--- [ anon ] -A060A000 980 484 484 - rwx-- [ anon ] ABE00000 36 36 36 - rw--- [ anon ] ABE0A000 384 384 384 - rwx-- [ anon ] AC300000 100 100 100 - rw--- [ anon ] -B0B00000 1024 1016 1016 - rw--- [ anon ] +AC400000 1024 1024 1024 - rw--- [ anon ] +B6200000 1024 712 712 - rw--- [ anon ] B8700000 292 292 292 - rw--- [ anon ] BE700000 1024 1024 1024 - rw--- [ anon ] FE2F0000 64 64 64 - rwx-- [ anon ] @@ -101,4 +101,4 @@ FEFFB000 8 8 8 - rwx-- ld.so.1 FEFFD000 4 4 4 - rwx-- ld.so.1 -------- ------- ------- ------- ------- -total Kb 58728 57764 39156 - +total Kb 224812 223112 204504 -

The lines marked [ anon ] are mmap()'d regions of memory. Most of these are regions that V8 has allocated for your JavaScript objects. The lines marked [ heap ] are what you might consider the "traditional" heap, memory that is allocated with malloc() or new in C++ parlance. What you can see is that the traditional heap is growing rapidly over the 15 hours, but the V8 heap isn't growing, instead it's just shifting where its memory is stored.

When I saw that I was leaking memory in the heap, I was excited, anticipating this was a classic memory leak. A leak that would happen because Node allocated memory, but neglected to free it. In MDB there's a particularly helpful command ::findleaks that will show you the memory addresses and the stack traces for leaked memory, not unlike using valgrind , but without all the performance penalty.

> ::findleaks BYTES LEAKED VMEM_SEG CALLER 4096 4 8019a000 MMAP 32768 1 801a1000 MMAP 16384 1 8019d000 MMAP 8192 1 8019b000 MMAP ------------------------------------------------------------------------ Total 4 oversized leaks, 61440 bytes

Unfortunately for me, 61440 bytes is drastically less than the 160 megabytes actually leaking in the process as indicated by pmap . Another useful advantage of running with UMEM_DEBUG enabled, means we can tap into the cache information that libumem stores to let you know what sizes of memory are being allocated. You can access a lot of this information by simply using the ::umastat command.

> ::umastat cache buf buf buf buf memory alloc alloc name size in use in ptc total in use succeed fail ------------------------- ------ ------- ------- ------- ------- --------- ----- ... umem_alloc_2304 2304 3 - 5 12K 9 0 umem_alloc_2688 2688 0 - 7 20K 4 0 umem_alloc_4096 4096 16987 - 16987 133M 16998 0 umem_alloc_4544 4544 14 - 192 864K 8136132 0 umem_alloc_8192 8192 3 - 3 36K 5 0 ...

So at hour 15 we can see that there's 133MB allocated in the 4K cache. That's a significant amount of the memory we're trying to track down. Another feature of UMEM_DEBUG is that it keeps stack traces for those allocations, so I extended the venerable stackvis which can be used to generate flamegraphs for performance, but instead I used it to generate flamegraphs for the stack traces most responsible for the memory allocations in the 4k cache. You can see the full flamegraph, but below I've attached the interesting bit.

The majority of this memory is being allocated by a method called HandleScope::Extend deep inside V8. Basically a HandleScope is used to keep track of the JavaScript objects that were allocated in this stack frame (and above) such that when this scope is deallocated it marks JavaScript objects as available for garbage collection. When we inspect this data we see we're collecting a lot of objects, but they're all the Undefined representative object.

So what we knew at this point was that we were generating a lot of JavaScript objects, they were getting stored by some underlying HandleScope but that was never being deallocated such that those objects could be collected. We aren't seeing the V8 heap grow, because these are all singleton objects of Undefined .

I had been doing all this work on our v0.10 branch, so Trevor Norris decided to look into it on the master branch of Node. Trevor was able to find a much smaller reproducible case, using that test case, I could reuse the same UMEM_DEBUG techniques, which revealed the predominate stack trace in the 4k cache:

> ::umem_cache !grep 4096 09138010 umem_alloc_4096 020f 80000000 4096 14835 0914c010 umem_alloc_40960 020f 80000000 40960 3 > 09138010::walk bufctl | ::bufctl -v ... c008810 c009000 47e1be7da09f7a 1 9138010 0 0 libumem.so.1`umem_cache_alloc_debug+0x144 libumem.so.1`umem_cache_alloc+0x19a libumem.so.1`umem_alloc+0xcd libumem.so.1`malloc+0x2a libstdc++.so.6.0.17`operator new+0x29 libstdc++.so.6.0.17`operator new[]+0x1d v8::internal::HandleScope::Extend+0x127 v8::Isolate::GetCurrentContext+0x78 node::NeedImmediateCallbackGetter+0x1e 0xa0a660a3 0xa0a62068 0x90d1186c 0xa0a64c54 0xa0a28679 0xa0a18c2a _ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPS5_Pb+0xe4 v8::internal::Execution::Call+0x77 v8::Function::Call+0xc9 node::MakeCallback+0xc5 _ZN4nodeL14CheckImmediateEP10uv_check_si+0xc7 uv__run_check+0x31 uv_run+0x10e node::Start+0x118 main+0x1b _start+0x83 ...

This stack had very little surface area to weed through, and when I looked at the node::CheckImmediate function, I noticed it was missing a HandleScope . By adding a HandleScope the memory leak went away. Magic.

At this point we knew that we were looking for something in v0.10 that called MakeCallback but that didn't first have a HandleScope on the stack. I then worked up this simple DTrace script.

awk #!/usr/sbin/dtrace -s pid$1::*HandleScopeC1*:entry { self->cone++; } pid$1::*HandleScopeD1*:entry /self->cone/ { self->cone--; } pid$1::*HandleScopeC2*:entry { self->ctwo++; } pid$1::*HandleScopeD2*:entry /self->ctwo/ { self->ctwo--; } pid$1::*node*MakeCallback*Handle*Object*String*Value*:entry /self->cone < 1 || self->ctwo < 1/ { @[jstack(200, 8000), probefunc] = count(); } tick-10s { printa(@); }

Here we trace all the constructors and destructors for HandleScope as well as MakeCallback , if we see that MakeCallback has been called but we don't have any HandleScope s on the stack, grab the stack trace and keep track of how many times we see that stack trace. Finally, every 10 seconds print out the stack traces we've seen. The result was pretty damning.

bash ./handlescope.d $(pgrep -f standalone) dtrace: script './handlescope.d' matched 12 probes CPU ID FUNCTION:NAME 18 4362 :tick-10s node`_ZN4node12MakeCallbackEN2v86HandleINS0_6ObjectEEENS1_INS0_6StringEEEiPNS1_INS0_5ValueEEE node`_ZN4node10HandleWrap7OnCloseEP11uv_handle_s+0x83 node`uv_run+0x164 node`_ZN4node5StartEiPPc+0x17b node`main+0x1b node`_start+0x83 _ZN4node12MakeCallbackEN2v86HandleINS0_6ObjectEEENS1_INS0_6StringEEEiPNS1_INS0_5ValueEEE 3587

In 10 seconds, there was only one stack trace, and we saw that trace 3587 times. HandleWrap::OnClose was the culprit, it was calling MakeCallback without first adding a HandleScope to the stack. HandleWrap is an internal class used in Node to wrap file descriptors. That means that for every file descriptor that was closing, Node was potentially leaking 4 bytes of memory. So for someone like Walmart with lots of connections and lots of disconnects this was adding up to quite a lot of memory.

There's only good news from here, the fix landed in the v0.10 branch, and is available in version v0.10.22. We've also made a change in the master branch that will allow us to detect this type of bug much sooner, even if the error is coming from a misbehaving binary add-on.

I want to thank my fellow engineers here at Joyent: Josh Clulow, Dave Pacheco, Robert Mustacchi, and Bryan Cantrill all of whom were instrumental in the process of finding this leak. Also, thanks to Trevor Norris for finding the reduced test case for master which helped us cross the finish line. This was a team effort, and demonstrates what an incredible team we have at our disposal. So it may have taken 3 weeks to hunt this 4 byte memory leak down, but in the end, it was incredibly satisfying to finally nail it down.

You can experience the saga yourself as Eran and I walk through the set up, identification, and resolution of this memory leak.