Heaptrack - A Heap Memory Profiler for Linux

Hello everyone,

with a tingly feeling in my belly, I’m happy to announce heaptrack, a heap memory profiler for Linux. Over the last couple of months I’ve worked on this new tool in my free time. What started as a “what if” experiment quickly became such a promising tool that I couldn’t stop working on it, at the cost of neglecting my physics masters thesis (who needs that anyways, eh?). In the following, I’ll show you how to use this tool, and why you should start using it.

A faster Massif?

Massif, from the Valgrind suite, is an invaluable tool for me. Paired with my Massif-Visualizer, I found and fixed many problems in applications that lead to excessive heap memory consumption. There are some issues with Massif though:

It is relatively slow. Especially on multi-threaded applications the overhead is large, as Valgrind serializes the code execution. In the end, this sometimes prevents one from using Massif altogether, as running an application for hours is unpractical. I know that we at KDAB sometimes had to resort to over-night or even over-weekend Massif sessions in the hope to analyze elusive heap memory consumption issues.

It is not easy to use. Sure, running valgrind --tool=massif <your app> is simple, but most of the time, the resulting data will be too coarse. Frequently, one has to play around to find the correct parameters to pass to --depth , --detailed-freq and --max-snapshots . Paired with the above, this is cumbersome. Oh and don’t forget to pass --smc-check=all-non-file when your application uses a JIT engine internally. Forget that, and your Massif session will abort eventually.

is simple, but most of the time, the resulting data will be too coarse. Frequently, one has to play around to find the correct parameters to pass to , and . Paired with the above, this is cumbersome. Oh and don’t forget to pass when your application uses a JIT engine internally. Forget that, and your Massif session will abort eventually. The output is only written at the end. When you try to debug an issue that takes a long time to show up, it would be useful to regularly inspect the current Massif data. Maybe the problem is already apparent and we can stop the debug session? With Massif, this is not an option, as it only writes the output data at the end, when the debugee stops.

With these issues in mind, I often wondered whether there isn’t a better alternative. To track the heap memory consumption, all we need to track are the calls to the allocation functions like malloc and free . The rest of what Valgrind is doing is not required, so shouldn’t it be possible to write a custom tracker with the help of the LD_PRELOAD trick which solves the issues above? But we need to get backtraces, and that quickly, as malloc & friends are often called extremely often. How is that possible?

The Shoulders of Giants

For a long time, I did not know any solution to the backtrace problem. But early this year, a colleague of mine told me that vogl also uses the LD_PRELOAD trick to overload the OpenGL functions and has the ability to grab backtraces. Apparently it was also quite efficient, so I had a look at what its doing and indeed, there I found my holy grail: libunwind , paired with a patched libbacktrace and some (to me) esoteric Linux C APIs. Combined, this makes it possible to efficiently grab backtraces with libunwind and delay the DWARF debug symbol interpretation until a later time. Without the example code in vogl, I’d never come up with this - so many thanks to Valve for releasing the source code on GitHub!

Introducing heaptrack

From here on, the rest was mostly plumbing. heaptrack consists of five parts:

libheaptrack_preload.so : The shared library that is injected into the debugee application using the LD_PRELOAD trick. It overloads malloc & friends, grabs a backtrace of raw instruction pointers with unw_backtrace and writes it all to file specified via the DUMP_HEAPTRACK_OUTPUT environment variable. Additionally, dlopen and dlclose are overwritten and trigger the collection of runtime information on shared libraries, which is required to later translate the instruction pointer addresses with DWARF debug information. Finally, a timer is also started with allows us to correlate allocations and memory consumption with real time.

: The shared library that is injected into the debugee application using the trick. It overloads & friends, grabs a backtrace of raw instruction pointers with and writes it all to file specified via the environment variable. Additionally, and are overwritten and trigger the collection of runtime information on shared libraries, which is required to later translate the instruction pointer addresses with DWARF debug information. Finally, a timer is also started with allows us to correlate allocations and memory consumption with real time. libheaptrack_inject.so : Similar to the preload variant, this library is used for runtime-attachement to an existing process. Frequently, I found myself wondering why suddenly an application’s heap memory consumption increases. Neither massif, nor any other tool I know of, can do runtime-attaching, but heaptrack can now do this!

: Similar to the preload variant, this library is used for runtime-attachement to an existing process. Frequently, I found myself wondering why suddenly an application’s heap memory consumption increases. Neither massif, nor any other tool I know of, can do runtime-attaching, but heaptrack can now do this! heaptrack_interpret : This process reads the output of libheaptrack.so over stdin and annotates the instruction pointer addresses with DWARF debug symbols with the help of libbacktrace . The annotated data stream is then sent to stdout . I recommend gzip ‘ing it to save some disk space, as the data files can easily consume hundreds of megabytes otherwise. The resulting data file is then “final”, meaning you can transfer it to any other machine as no further processing is required that is machine dependent.

: This process reads the output of over and annotates the instruction pointer addresses with DWARF debug symbols with the help of . The annotated data stream is then sent to . I recommend ‘ing it to save some disk space, as the data files can easily consume hundreds of megabytes otherwise. The resulting data file is then “final”, meaning you can transfer it to any other machine as no further processing is required that is machine dependent. heaptrack : To simplify the process, there is a small shell script which combines the first two tools. It launches the arguments passed to it as a process with the correct LD_PRELOAD environment. The output of libheaptrack.so is directly transmitted to a heaptrack_interpret process with the help of mkfifo . And the heaptrack_interpret output finally is compressed on the fly and stored to disk. This is the tool you want to use: $ heaptrack yourapp [your arguments...] starting application, this might take some time... output will be written to /home/milian/heaptrack.yourapp.12345.gz ... Heaptrack finished! Now run the following to investigate the data: heaptrack_print /home/milian/heaptrack.yourapp.12345.gz | less

heaptrack_print : Similar to ms_print , this process analyzes the output of heaptrack_interpret . It has many features, which I’ll outline below. You can run it at any time on the output file that heaptrack creates, and it supports transparent decompression of gzip ‘ed files. The output is written directly to the CLI, which is often cumbersome to interpret. I plan to work on a proper heaptrack-visualizer in the future.

The temporary file format of the libheaptrack.so output, as well as the permanent one by heaptrack_interpreted is currently undocumented. It’s plain text though and should be easy to decipher, esp. with the source code at hand.

Note that heaptrack, contrary to Massif, does not do any aggregation of the data. It only minimizes the data files by not printing the same backtrace information repeatedly. But each individual malloc or free call, together with the function arguments, will be tracked. This allows some extremely interesting insights into the heap usage of a debugee, as we can later analyze the data to find all of the following:

heap memory consumption: this is what Massif does, and often the most interesting

number of calls to allocation functions: usually you’d need a profiler like Valgrinds callgrind to figure out where you frequently allocate memory. Heaptrack gives you that information as well, and much quicker. I used this data already in many places to get rid of temporary memory allocations. This is extremely worthwhile, as not only are memory allocations relatively slow, your performance also benefits from “secondary” effects: when you reuse memory, the chances are much higher that it is cached already, and cache-misses are often the biggest slow-down of current applications.

to figure out where you frequently allocate memory. Heaptrack gives you that information as well, and much quicker. I used this data already in many places to get rid of temporary memory allocations. This is extremely worthwhile, as not only are memory allocations relatively slow, your performance also benefits from “secondary” effects: when you reuse memory, the chances are much higher that it is cached already, and cache-misses are often the biggest slow-down of current applications. total amount of memory allocated, ignoring deallocations: Not so useful, but sometimes interesting and nicely accompanies the call count data to find temporary memory allocations

leaked memory: Even without the fancy analysis of Valgrind’s memcheck tool to distinguish between still reachable, possible and definitely lost memory, heaptrack can give you a quick look at what memory has not been freed when the debugee stopped.

tool to distinguish between still reachable, possible and definitely lost memory, heaptrack can give you a quick look at what memory has not been freed when the debugee stopped. histogram of allocation sizes over the number of calls: So far one can only

…: Your ideas are welcome - I’m confident that many more insights can be found from heaptracks data.

NOTE: Just like other profilers and tools, heaptrack relies on the DWARF debug information in your application. If you try to analyze a stripped release build without debug symbols, you’ll have a hard time making sense of it.

Using heaptrack_print

Assume we have run heaptrack on an application and now want to evaluate the obtained data. heaptrack_print is the tool to do that, but it’s relatively cumbersome to use (plain ASCII output, not even an ncurses GUI!). Thus, I explain the output here, such that you can make sense of it. Do take a look at the --help output as well.

Calls to Allocation Functions

Enabled by default, disable via -a / --print-allocators 0 .

The output below the MOST CALLS TO ALLOCATION FUNCTIONS header is a list of the top 10 locations that call memory allocation functions. The format, by default, is merged, e.g., for code similar to this:

void asdf() { new int; } void bar() { asdf(); } void laaa() { bar(); asdf(); }

will produce output like this, when laa() is called ten times from main() :

MOST CALLS TO ALLOCATION FUNCTIONS 11 calls to allocation functions with 44B peak consumption from asdf() at /ssd/milian/projects/kde4/heaptrack/tests/test.cpp:24 in /ssd/milian/projects/.build/kde4/heaptrack/tests/test_cpp 10 calls with 40B peak consumption from: bar() at /ssd/milian/projects/kde4/heaptrack/tests/test.cpp:36 in /ssd/milian/projects/.build/kde4/heaptrack/tests/test_cpp laaa() at /ssd/milian/projects/kde4/heaptrack/tests/test.cpp:41 in /ssd/milian/projects/.build/kde4/heaptrack/tests/test_cpp main at /ssd/milian/projects/kde4/heaptrack/tests/test.cpp:103 in /ssd/milian/projects/.build/kde4/heaptrack/tests/test_cpp 1 calls with 4B peak consumption from: bar() at /ssd/milian/projects/kde4/heaptrack/tests/test.cpp:36 in /ssd/milian/projects/.build/kde4/heaptrack/tests/test_cpp laaa() at /ssd/milian/projects/kde4/heaptrack/tests/test.cpp:41 in /ssd/milian/projects/.build/kde4/heaptrack/tests/test_cpp main at /ssd/milian/projects/kde4/heaptrack/tests/test.cpp:105 in /ssd/milian/projects/.build/kde4/heaptrack/tests/test_cpp ...

Here, the backtraces are merged on the location of the new int allocation in asdf() , and all sub-traces are displayed beneath. Since heaptrack_print sorts the data, you can just read its output from the top to find the top 10 hotspots of allocation functions. You can disable backtrace merging with -m / --merge-backtraces 0 .

Peak Memory Consumption

Enabled by default, disable with -p / --print-peaks 0 .

To decrease your memory consumption, you need to decrease the peak memory consumption. Under the PEAK MEMORY CONSUMERS caption, heaptrack_print shows the top ten hotspots, sorted by the peak size in bytes. This can look e.g. like this:

PEAK MEMORY CONSUMERS 3.98MB peak memory consumed over 37473 calls from QString::realloc(int) in /usr/lib/libQtCore.so.4 1.04MB over 4 calls from: QString::append(QString const&) in /usr/lib/libQtCore.so.4 0x7fa9ce54bf73 in /usr/lib/libQtCore.so.4 0x7fa9ce54c5ee in /usr/lib/libQtCore.so.4 QTextStream::readAll() in /usr/lib/libQtCore.so.4 Kate::Script::readFile(QString const&, QString&) at /ssd/milian/projects/kde4/kate/part/script/katescripthelpers.cpp:82 in /ssd/milian/projects/compiled/kde4/lib/libkatepartinterfaces.so.4 Kate::Script::require(QScriptContext*, QScriptEngine*) at /ssd/milian/projects/kde4/kate/part/script/katescripthelpers.cpp:289 in /ssd/milian/projects/compiled/kde4/lib/libkatepartinterfaces.so.4 0x7fa9bac7d228 in /usr/lib/libQtScript.so.4 ...

Massif Compatibility

Pass a file path to -M / --print-massif . Tune output with --massif-threshold and --massif-detailed-freq .

heaptrack_print , since yesterday, also supports converting the heaptrack data to the Massif file format. This can then be visualized with my Massif-Visualizer. The resulting files are relatively large as much more detailed snapshots are included. I optimized the visualizer a bit as well to speed up the evaluation of these files. It is worth it though! Since the time axis uses real time, it is much easier to correlate to the actual runtime behavior of your application (note: you can configure Massif to also use “real time”, but due to its high overhead, the results are still confusing and not much different to the instruction count). The higher level of detail also makes it simpler to interpret the results. Note though, that the converter currently has no code to ensure the peak is not missed, which can be seen in the images below. I plan to add this eventually.



heaptrack



Massif

Comparison of heaptrack and Massif on the same work load shows the much higher level of detail. Overall, the results are compatible, but note that heaptrack uses real time whereas Massif defaults to instruction count for the abscissa time axis. Also, the Massif file generated by heaptrack currently misses the peak, which is accurately tracked by Massif.

Memory Leaks

Disabled by default, enable with -l, --print-leaks 1 .

The leaks reported by heaptrack are simply all calls to malloc & friends which where never free ‘ed afterwards. It is not possible to do a “still reachable” or “possibly lost” analysis as Valgrind’s memcheck tool does. Still, it is often quite helpful. Note though that it does not support suppression files, which is crucial here as otherwise you’ll often see leaks reported inside libc and other external libraries which are often intentional.

Size Histogram

Disabled by default, enable by passing an output file to -H, --print-histogram .

The size histogram gives an insight into whether you potentially could benefit from a pool allocator or similar optimization technique. heaptrack_print just writes the raw data to the output file you specify. With octave or gnuplot , you can then evaluate this manually, yielding a graph such as the following:

Note how many allocations below 8 byte are done by this application. All of these waste memory space, as the value itself could easily be stored in the space required for a single pointer on a 64 bit machine. For those interested, most of these allocations come from small strings, since Qt’s QString class has no small-string optimization (yet, planned for Qt 6). In the future, the heaptrack data could be analyzed such that it directly points you to the culprits of such memory wastes.

Try it out

So far, I developed this tool mostly to scratch my own itch. I demoed it to some colleagues, but until yesterday, some essential features where missing. Now, I think, it is ready for a wider audience. If you are interested, try it out - I’m interested in your feedback:

git clone git://anongit.kde.org/heaptrack cd heaptrack mkdir build cd build cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. make install

This should be all that is required to get heaptrack up and running. It depends on Boost (for heaptrack_print and heaptrack_interpret ) and a recent libunwind (for libheaptrack.so ). If in doubt, compile libunwind also from source, as I fixed one significant performance issue there on my platform. Thus, if heaptrack is extremely slow, please try to update libunwind first. Also note that I have another patch for libunwind in the pipeline to increase the DWARF cache, which improves the runtime performance of heaptrack even futher.

Furthermore I took the liberty of leveraging C++11 features wherever I needed it. You will need a recent compiler to build heaptrack. CMake should tell you if your compiler is too old.

Also note again that this tool currently only works on Linux. With some work, it might be able to port it to other Unixoid platforms. Personally, I won’t spent time on this, as it is not worth it for me. I develop cross-platform Qt applications, and can thus easily investigate the memory consumption on a Linux host.

Platform wise, I build and tested this code only on X86-64 platforms. I hope it also works fine on 32Bit x86, as well as ARM, but I’ll have to test it.

A note on Performance

CPU Overhead

I do not have any reliable benchmark, but I still want to share some rough estimates on the overhead of heaptrack and compare it with Massif. In heaptrakcs source tree, you can find e.g. tests/threaded.cpp , which allocates and frees memory repeatedly and in parallel with multiple threads. With perf stat , we can estimate the worst-case overhead of heaptrack with this test:

Baseline

$ perf stat -r 5 ./tests/threaded Performance counter stats for './tests/threaded' (5 runs): 147.544073 task-clock (msec) # 1.736 CPUs utilized ( +- 6.18% ) 563 context-switches # 0.004 M/sec ( +- 6.12% ) 735 cpu-migrations # 0.005 M/sec ( +- 0.78% ) 910 page-faults # 0.006 M/sec ( +- 4.38% ) 235,074,081 cycles # 1.593 GHz ( +- 10.63% ) [71.15%] <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 156,034,336 instructions # 0.66 insns per cycle ( +- 6.64% ) [91.21%] 35,155,936 branches # 238.274 M/sec ( +- 3.27% ) [90.29%] 366,564 branch-misses # 1.04% of all branches ( +- 8.44% ) [86.60%] 0.084972509 seconds time elapsed ( +- 6.20% )

Averaged over five runs, this test finishes in less than 100ms and roughly 150 million instructions are executed.

heaptrack

$ perf stat -r 5 heaptrack ./tests/threaded Performance counter stats for 'heaptrack ./tests/threaded' (5 runs): 2126.580121 task-clock (msec) # 2.120 CPUs utilized ( +- 4.10% ) 60,137 context-switches # 0.028 M/sec ( +- 10.29% ) 4,853 cpu-migrations # 0.002 M/sec ( +- 7.48% ) 106,589 page-faults # 0.050 M/sec ( +- 0.11% ) 5,398,514,290 cycles # 2.539 GHz ( +- 4.01% ) [55.40%] <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 5,403,905,664 instructions # 1.00 insns per cycle ( +- 1.23% ) [87.56%] 1,154,188,099 branches # 542.744 M/sec ( +- 1.58% ) [75.83%] 22,868,779 branch-misses # 1.98% of all branches ( +- 4.20% ) [76.43%] 1.003186573 seconds time elapsed ( +- 2.40% )

With heaptrack, the test application runs considerably slower. According to perf stat , it is approximately ~12 times slower. Furthermore, we are not executing ca. 5.4 billion instructions., have many more page-faults etc. pp.

Massif

$ perf stat -r 5 valgrind --tool=massif ./tests/threaded Performance counter stats for 'valgrind --tool=massif ./tests/threaded' (5 runs): 2589.948615 task-clock (msec) # 1.022 CPUs utilized ( +- 0.21% ) 11,318 context-switches # 0.004 M/sec ( +- 1.23% ) 7,168 cpu-migrations # 0.003 M/sec ( +- 1.73% ) 8,856 page-faults # 0.003 M/sec ( +- 0.18% ) 6,178,853,885 cycles # 2.386 GHz ( +- 0.19% ) [50.14%] <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 9,692,798,770 instructions # 1.57 insns per cycle ( +- 0.69% ) [81.01%] 2,311,709,276 branches # 892.570 M/sec ( +- 0.26% ) [77.11%] 29,323,133 branch-misses # 1.27% of all branches ( +- 0.38% ) [77.79%] 2.534584622 seconds time elapsed ( +- 0.27% )

With Massif, the situation is even worse. It synchronizes all threads, as evidenced by the task-clock report which shows that only once CPU is utilized. Overall, it is roughly 2.5 times slower than heaptrack and also executes nearly twice as many instructions.

This result is quite promising in favor of heaptrack. In many other tests, the test applications also feel much more fluent when running under heaptrack compared to Massif. But YMMV and take this with a grain of salt.

Memory Overhead

Be also aware that heaptrack not only slows down your application, but also adds a considerable memory overhead, both in-process ( libheaptrack.so ) as well as out-of-process ( heaptrack_interpret ). In a non-scientific measurement of the memory consumption of kwrite showing a medium sized text file, I acquired the following numbers for the total memory used after the file is loaded:

Baseline: 26.1MB

heaptrack: 39.3MB + 19.2MB = 58.5MB

Massif: 264.9MB

So again, heaptrack seems to be significantly leaner compared to Massif, but YMMV.

What’s left to do?

I will probably not spent much more time on heaptrack in the coming months, but rather hope to finally be able to concentrate on finishing my studies. Mid-term next year, after a long vacation, I then plan to start working on the following (if noone beat me to it until then):

do a proper release: I plan to move this tool to KDE’s extragear and undergo a code review. Once that is settled, I will release a first version and hope for packagers to distribute it.

heaptrack-gui: Generating massif.out files and looking at them in my Massif-Visualizer is nice, but inefficient and only shows a fraction of the data we have available. Thus, a proper GUI application is required to show all of the data in a heaptrack output file. Additionally, it could visualize the data as it comes in, giving you the ability to track the heap behavior of your application in real-time!

files and looking at them in my Massif-Visualizer is nice, but inefficient and only shows a fraction of the data we have available. Thus, a proper GUI application is required to show all of the data in a heaptrack output file. Additionally, it could visualize the data as it comes in, giving you the ability to track the heap behavior of your application in real-time! public API: heaptrack does not support custom allocators yet. To support this, a simple API could be added similar to Valgrind’s Client Request API.

I/O profiling etc.: The technique used for heap profiling can also be used to profile I/O, mutex lock contention and more.

Note that stack memory consumption cannot be profiled in this way. Use Massif for that, if you need to look at that.

Reinventing the Wheel

Initially, I thought heaptrack is unique in what it does. Over the time I realized that this is not quite the case. Google’s gperftools has a similar tool, and there is libmemusage.so and many others like it. Thankfully, none of them gives as much data as heaptrack, while still being efficient. So my time was not wasted. And I learned a ton in the process. I invite everyone to inspect my code and give suggestions. It is so far only about ~1.6kloc of code, but probably a bit lacking on the documentation side. I’ll improve this over time, I hope.

I also tried to implement this tool with perf probe , but could not get it to work reliably. The perf script support is still lacking the ability to run native code, which is crucial here to get high performance. Additionally, perf requires root access in order to use user-space probes on e.g. malloc and friends in libc.so . This is not practicable - heaptrack and LD_PRELOAD work as-is just fine.

Thanks

To wrap up this lengthy blog post, I want to express my deepest gratitude again to all those who made this tool possible. In no particular order:

Julian Seward and the Valgrind team: This tool suite will always remain invaluable to me. Without it, I’d never been able to cross-check the results of heaptrack reliably. For me, while I’ll probably use Massif less and less, I will still be use it to verify that the results obtained by heaptrack are correct. And the error-checking tools in the Valgrind suite, like memcheck , helgrind or drd are still unmatched in their quality.

, or are still unmatched in their quality. Michael Sartain, Peter Lohrman and Valve: Without the code in vogl, I’d still be out on the hunt for an efficient scheme to obtain backtraces, or would be clueless how to translate the instruction pointer addresses with DWARF debug symbols.

Arun Sharma, Lassi Tuura and the libunwind team: The core tool to actually get the backtrace. Many thanks for this fast, easy-to-use library!

team: The core tool to actually get the backtrace. Many thanks for this fast, easy-to-use library! GCC team: Not only an excellent compiler, but also the source of the libbacktrace library, which does the heave ELF/DWARF heavy lifting to translate raw instruction pointer addresses.

library, which does the heave ELF/DWARF heavy lifting to translate raw instruction pointer addresses. My colleagues at KDAB: Fruitful discussions with them lead to the solution for many of my problems over the last months. And thanks to the pre-alpha testing!

all future contributors: Patches welcome! :)