Announcing TCMalloc

By Chris Kennelly, Google Software Engineer

We are happy to announce the arrival of TCMalloc, a fast memory allocator with useful profiling and introspection features.

The source code can be found on Github. This is a distinct repository, allowing Abseil to be used independently of TCMalloc. The library includes:

A lockfree, per-CPU cache implementation based on restartable sequences, available on modern Linux kernels, and a per-thread based fallback implementation.

A hugepage-aware backend that reduces TLB stalls by more densely populating hugepage subregions while reducing memory waste.

Optimizations leveraging modern C++ language features, including sized delete from C++14 and overaligned allocation from C++17.

Always-on, sampling-based heap profiling, allowing an application to obtain a heap profile without additional configuration. We also provide a “peak heap” profile, a snapshot of the application’s memory usage near its high watermark of usage.

This blog post covers getting started and a peek at one of our key optimization techniques: per-CPU caches. For a more detailed explanation of the design of TCMalloc and its features, take a look at our design note.

Getting Started

Adding TCMalloc to a binary involves specifying it as the malloc attribute of a Bazel cc_binary or cc_test rule. For example, with Abseil hello:

cc_binary( name = "hello_main", srcs = ["hello_main.cc"], deps = [ ":hello", ], malloc = "@com_google_tcmalloc//tcmalloc", )

We can also leverage the telemetry TCMalloc provides into our heap with MallocExtension .

#include <iostream> #include “tcmalloc/malloc_extension.h” int main(int argc, char** argv) { absl::optional<size_t> heap_size = tcmalloc::MallocExtension::GetNumericProperty( "generic.current_allocated_bytes"); if (heap_size.has_value()) { std::cout << "heap size = " << *heap_size << " bytes" << std::endl; } }

MallocExtension is a separate library from TCMalloc, allowing it to be used when another malloc implementation is linked-in, for example, when using C++ sanitizers. The library is crafted so that although the telemetry and controls it provides will be inoperative, the code using it will still link and compile.

Key Optimizations: Per-CPU Caches

TCMalloc’s name comes from “thread caching” malloc. These caches allow us to allocate memory most of the time without taking locks, falling back to a more centralized cache when the thread cache is exhausted or full.

The cost of per-thread caches has increased as thread counts in applications have grown. Because we do not allow one thread to access another’s thread cache, to avoid locks, we need to carefully balance between ensuring a cache is large enough to hold needed objects and avoiding stranding memory in an idle cache. Objects held by idle threads are effectively wasted.

Our per-CPU caches take advantage of the fact that only one thread can run on a single core at any given time and that pre-emptions during the critical sequence of an allocation are relatively rare. These caches are based on the Linux kernel’s restartable sequences (RSEQ) feature, developed by Paul Turner and Andrew Hunter at Google and Mathieu Desnoyers at EfficiOS. Restartable sequences let us execute a region of code atomically and restart if we are interrupted by the kernel.

RSEQ allows us to share a cache across many interleaving threads of execution, improving the usefulness of the cache. When one thread is idle, another thread can still allocate from the core’s cache. Additionally, because the number of caches is bounded by the total number of cpus, we can allocate an array-based metadata structure for each cache. This allows us to reduce memory indirections, common in the linked-list data structures we use for our thread-cache implementation.

For systems without this feature, we fall back to a per-thread cache implementation.

More Information

For more information, consult the following documentation: