SuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM). It turns out that the same design decisions also make it fast even without HTM. We compared SuperMalloc to DLmalloc, JEmalloc, Hoard, and TBBmalloc, For the malloc-test benchmark, with 1 thread SuperMalloc is about 2.1 times faster than the best alternatives, with 8 threads and HTM SuperMalloc is 2.75 times faster, and on 32 threads without HTM SuperMalloc is 3.4 times faster. SuperMalloc generally compares favorably with the other allocators on speed, scalability, speed variance, memory footprint, and code size.

SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. We exploit the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. We allocate 2MiB chunks which contain objects all the same size. To translate chunk numbers to chunk metadata, we use a simple array (most of which is uncommitted to physical memory). We take care to avoid associativity conflicts in the cache: most of the size classes are a prime number of cache lines, and nonaligned huge accesses are randomly aligned within a page. Objects are allocated from the fullest non-full page in the appropriate size class. For each size class, we employ a 10-object per-thread cache, a per-CPU cache that holds up 2MiB of objects, and a global cache that is organized so we can move 1MiB between a per-CPU cache and the global cache using O(1) instructions. We prefetch everything we can before starting a critical section, which makes the critical sections run fast, and for HTM improves the odds that the transaction will commit.