Coordinating access to global resources like the heap is a complicated topic, and getting it wrong can lead to a problem known as “race conditions”, which cause hard-to-debug crashes which are often also exploitable by hackers.

In our example of a website, suppose a web-request being serviced on one thread tries to update a database row, and another concurrent web-request tries to read from that same row. Normally, we’ll want to make sure the second thread never sees the row mid-write as it’s being overwritten by another thread and thus seeing the row in some partial or corrupted form. Databases solve this problem by making reads and writes appear to operate atomically: if two threads try to access the same row at the same time, one operation must complete before the next can begin. A very common way of solving these race-conditions is to force otherwise-simultaneous requests accessing a global resource into a sequential queue by using locks.

In general, locks work by one thread “marking” that it has taken ownership of a global resource before using it, then doing its operation, then marking that the resource is no longer in use. If another thread comes along and wants to use the resource and sees some other thread is using it, the thread waits until the other thread is done. This ensures that the global resource is only used by one thread at a time. But it comes with a cost: the thread that is waiting on the resource is stalling and wasting time. This is called “lock contention”.

For many global variables, this is an acceptable cost. But for the heap which is constantly in use by all threads, this cost quickly translates into the whole program slowing down.

The heap manager mostly solves this problem by using per-thread arenas where each thread gets its own arena until it hits the threshold. Additionally, the tcache per-thread cache is designed to reduce the cost of the lock itself because the lock instructions are quite expensive and end up taking a significant portion of the execution time in the fast path. This feature was added to the malloc memory allocation function in glibc 2.26 and is enabled by default.

Per-thread caching speeds up allocations by having per-thread bins of small chunks ready-to-go. That way, when a thread requests a chunk, if the thread has a chunk available on its tcache, it can service the allocation without ever needing to wait on a heap lock.

By default, each thread has 64 singly-linked tcache bins. Each bin contains a maximum of 7 same-size chunks ranging from 24 to 1032 bytes on 64-bit systems and 12 to 516 bytes on 32-bit systems.