CMU 15-418 (S13) Final Project: nalloc nalloc: A Lock-Free Memory Allocator

Alex Podolsky

Summary

I've written two 100% lockfree memory allocators: __nalloc and nalloc. I've benchmarked them along with the current state of the art on my own set of synthetic tests.

The first allocator had poor scaling on par with libc, but I learned enough from it to write a second lockfree allocator that scales approximately linearly up to 30 cores. It scales sublinearly but slightly better than tcmalloc up to 64 cores.

To install, `git clone ~apodolsk/repo/nalloc` and read README.

Background

Memory allocators are important because most programs use them and many use them heavily. A bad allocator can be a central point of contention in billions of good programs while a good allocator can be a drop-in replacement that twists the memory accesses of bad, ignorant programs into a hardware-friendly pattern.

All scalable memory allocators that I know of, including existing lockfree allocators, attempt to turn allocation into a data-parallel problem by splitting the address space into CPU- or thread-local subheaps. In the best case, this has the additional benefit of improving locality, reducing false sharing, and empowering the prefetcher, as each thread touches contiguous sets of thread-private cache lines.

In the worst case, the program being serviced isn't as parallel as the allocator's addresses and careful design is needed to reduce artifactual and explicit communication resulting from memory block migrations between threads. Moreover, options available to achieve this can be at odds with the need to reduce fragmentation and memory blowup.

There's no need to describe sequential algorithms here.

tcmalloc and jemalloc are two of the best-performing general allocators around. ptmalloc is the default glibc allocator.

Analysis and challenges

I felt insecure about how/whether the project fit with the course, so I tried to come up with some non-obvious analysis of the problem. I mentioned them in the the presentation, so you can skip this.

That "variable data-parallelism" that I mentioned was a new source of difficulty. Allocation stands in contrast to all of the problems that we've seen in class because they could be analyzed for parallelism before doing the work. In the renderer, for instance, you could phrase the problem in a way that allowed the data to be handled without communication, and thus strike a programmed balance between work and span.

On the other hand, a parallel allocator is fed a predistributed, opaque workload with unknown dependencies that require an unknown amount of communication to resolve. As such, it needs to be able to scale to different levels of parallelism.

In practice, this meant that I had to spend a lot of time thinking "does this choice make sense if the workload forces threads to communicate a lot? Does it make sense if not?". (I didn't actually achieve anything special in this direction. As far as my tests could tell me, the state of the art allocators seemed to be about as bad in the worst case.)

I was pretty excited to see that some design issues were familiar from the web server. Like a server, an allocator needs to meet latency and throughput goals on a bursty workload by making tradeoffs against resource usage goals like fragmentation and blowup targets. "Recruiting more nodes, or more than you need" has similar costs and benefits to "fetching more pages from the global heap, or more often than you need", and "bootup cost" is absolutely an issue that came up.

I mention the less abstract obstacles that I ran into below.

__nalloc

Approach

__nalloc is "naive" because it's almost the same basic design that I would have used at the start of the semester. I assumed that the bottlneck would be synchronization, so I planned to stick a fast single-threaded algorithm into an efficient lockfree wrapper. The mysterious, as-yet-unnamed problem that I eventually ran into would have been obvious from the start, had I preferred analysis to "doing what sounds elegant" or profiled existing allocators.

__nalloc and nalloc both target allocations of <= 1024B in order to make the task simpler.

The main idea is as follows:

Run a standard single-threaded allocator ("segregated lists with boundary tags and eager splitting and coalescing") on each thread-local subheap.

Use a lockfree stack of pages to distribute memory from the global heap.

Use another lockfree stack attached to each thread to return migratory/"wayward" blocks back to their original thread.

A more detailed algorithm is below. You can skip it, especially if the parenthetical description above made sense. However, the part about "wayward blocks" at the very end is interesting:

Each thread keeps doubly linked "free lists" of free blocks of memory of various size classes.

Each block lives in an "arena", a page-sized chunk of memory whose address is "naturally aligned" to its size. Threads fill their free lists by fetching more arenas from a global lockfree stack of pages. When the page stack is empty, a thread seeking an arena calls mmap() to allocate a batch of pages from the OS. Each fresh arena holds a single block of maximal size.

In malloc(), a thread pops a block from the free list matching the request size. It shaves off extra space into a new block. If the list is empty, it tries the next largest one until it runs out and has to fetch a new arena.

In free(), a thread merges the freed block with its neighbors if possible. In order to do this, each block B needs a 4B header which stores an "is_free" flag, the size of B, and the size of the "left of" B in memory. All merged neighbors are removed from their free lists.

If thread F frees a block B that was allocated by thread M, then F inserts B into a lockfree stack of "wayward blocks" associated with M. When M runs out of blocks, it will pop the entire stack in a single cmpxchg operation, and then add each block to its free lists. How does F find that list? Each arena keeps a pointer to the stack of wayward blocks of the thread which owns it. Because arenas are naturally aligned, F can compute the address of B's host arena from B's address. Each arena also has an internal stack for "disowned blocks". If a thread exits and must free an arena before all its blocks are free, then it modifies that arena's "wayward blocks" pointer to point to the stack of disowned blocks.



Benchmarks

I wrote three benchmarks and analyzed them in perf, gperftools, and vtune.

In the first benchmark, each thread randomly allocates, writes, and frees into a private pool of memory.

In the second, threads allocate into a global pool implemented as a set of lockfree stacks.

In the third, a single thread allocates into a global pool, and all other threads free from the pool.

The total workload stays constant over the number of threads. Allocation sizes are limited to <= 1024B. The max amount of allocated bytes per thread is limited, but threads may make new allocations after freeing old ones.Time durations are in global time taken using gettimeofday().

Except where noted, the figures to follow were generated with the first benchmark.

Optimizations and comments, chronologically

I needed to align addresses up and down often. Surprisingly, my align_up() and align_down(), just an addition or subtraction and a modulo, took a combined 9% of runtime. I replaced those with bitops for powers-of-2 and the cost went away entirely. It didn't improve scaling, but I was amazed to find that div was so intense, in an arithmetic sort of way.

My original design used doubly linked lists of *arenas* rather than blocks. I thought it would help fragmentation, locality, and prefetching to exhaust arenas in order. Instead, threads wound up doing O(n) search over thousands of not-quite-full-enough arenas in order to find one with enough free contiguous space for big requests. Tricks like rotating arenas didn't help.

Arena initialization was taking 15% of runtime. In the relevant ASM, the most expensive instruction was the first MOV into memory. This could have been nasty, but I happened to know that Linux overcommits memory. That is, mmap() will reserve virtual addresses, but it assumes you don't really need the memory and it won't fetch physical frames until you actually touch it. I guessed that arena_init was pagefaulting in order to finalize those new VM mappings, and some man told me about an mmap flag to disable overcommit.

This plugged the page faults, but the rest of that work just moved into mmap(). I implemented a combination of batching and prefetching (each allocation of an arena allocates N others too), and mmap() has been at <1% ever since. It wasn't obvious that this would work, but it seemed likely that the expensive part was a VM segment tree lookup, and the kernel could do just 1 of those per request - and obviously amortize syscall and locking overhead.

On that note: you could argue that, by using mmap(NULL, MAP_ANON,...), __nalloc just moves the hard general-purpose allocation problem into the kernel. But someone will have to call mmap() no matter what, and you can't disable overcommit if you plan to preemptively mmap() some huge chunk of memory in order to manage it yourself. As long as the cost of mmap doesn't dominate, it's clearly the right thing to do. That's probably why many of the lockfree allocator papers I've read admitted to this trick too. It's fun to wonder whether the kernel can do its allocations without locks. I tried very briefly and failed in 410, and Linux has locks. But Alexia Massalin pulled it off in Synthesis.

A less right thing is that I never return memory to the system. This would make the pop from the page stack vulnerable to use-after-free. It would be exactly the bug that I mentioned in my comment on the lockfree stack slides. This is ironic because that comment also claimed that my project was motivated by this issue. If I were to solve it, here's my hypothetical solution: keep a refcount on the stack and only free if the refcount had reached 0 at any point after an arena was popped from the stack. You could check for this by keeping track of the tag value observed when the refcount was last 0. Then, upon attempted free, check to see that the tag observed when the page was popped is <= the tag mentioned above.

After writing an email to Kayvon professing "correctness", I realized that I had a very bad and obvious race condition:

A thread exits and modifies an arena's wayward block stack pointer to point to that arena's disowned_blocks stacks.

Meanwhile, another thread had read that old pointer and has committed itself to inserting onto the newly-dead thread's wayward blocks stack.

Segfault or corruption.

nalloc's design would be vulnerable to this too. The solution that I found there (described further on) doesn't work here. You could refcount the number of allocated blocks which could possibly wind up on a thread's wayward block stack. Is there a more clever (lockfree) way than having a locked instruction on the common path in order to satisfy a rare edge case?

Results

After this, and a long interlude of fixing bugs, here's how my performance looked on the theoretically perfectly parallel, no-migrations workload on a 64 core machine:

So __nalloc scales as poorly as ptmalloc (which isn't quite as bad as what I claimed in my presentation). And here's gperftools' sample-based profile of performance with 6 cores on GHC*:

*(I didn't get libunwind/gperftools set up on the ALADDIN machine)

Here's what hardware counters report, according to Linux perf:

Performance counter stats for './natest -t 6 -o 10000': 690,482,105 L1-dcache-loads 67,097,600 L1-dcache-misses # 9.72% of all L1-dcache hits [26.08%] 523,794,050 L1-dcache-stores 22,848,232 L1-dcache-misses 5,618,859 L1-dcache-prefetches 377,630 L1-dcache-misses 29,608,157 LLC-loads 14,276,385 LLC-misses # 48.22% of all LL-cache hits [21.63%] 45,153,262 LLC-stores 8,554,146 LLC-misses 6,247,900 LLC-prefetches 2,285,035 LLC-misses 389 page-faults 6,665,279,571 cpu-cycles 4,921,627,270 stalled-cycles-frontend # 73.84% frontend cycles idle 2,489,718,064 stalled-cycles-backend # 37.35% backend cycles idle 24,771,227 branch-misses 22,731,022 cache-misses # 46.787 % of all cache refs 48,583,721 cache-references 3,711,376,010 instructions # 0.56 insns per cycle # 1.33 stalled cycles per 0.415147303 seconds time elapsed

Compare that to tcmalloc:

Performance counter stats for './tctest.sh -t 6 -o 10000': 691,325,411 L1-dcache-loads 18,485,376 L1-dcache-misses # 2.67% of all L1-dcache hits [27.18%] 403,827,409 L1-dcache-stores 6,107,731 L1-dcache-misses 1,874,221 L1-dcache-prefetches 397,279 L1-dcache-misses 5,351,265 LLC-loads 799,161 LLC-misses # 14.93% of all LL-cache hits [23.60%] 13,872,244 LLC-stores 2,803,597 LLC-misses 58,189 LLC-prefetches 6,720 LLC-misses 15,043 page-faults 1,679,857,462 cpu-cycles 566,843,996 stalled-cycles-frontend # 33.74% frontend cycles idle [23.33%] 515,528,620 stalled-cycles-backend # 30.69% backend cycles idle [23.07%] 5,163,499 branch-misses 2,378,294 cache-misses # 26.742 % of all cache refs [22.54%] 8,893,414 cache-references 2,850,831,772 instructions # 1.70 insns per cycle # 0.20 stalled cycles per insn [27.57%] 0.106878690 seconds time elapsed

Note that Linux's sysfs on the target machine (ghc29) reports that LL/L3 caches are shared while L1 and L2 are local to each physical core.

__nalloc is memory bound. It makes 3.9x as many LLC references as tcmalloc while incurring a miss on 6.3x as many. For reference, for every free(), merge_adjacent() touches up to 3 block headers and inserts into the global free list. For every malloc(), shave touches up to 3 block headers and pops from the free list too. On the other hand, tcmalloc's profile shows that it's just pushing and popping to a singly linked list.

The most telling detail is that only 9% of of L1 data cache loads missed while 48% missed in the LLC. The effect isn't present in the single threaded case, and it increases with the number of threads, the number of allocations, and the uptime. So __nalloc has enough locality to fit in the working set for a while, but then it does something completely unexpected, and is more likely to do so given the above conditions. My guess is that this is the effect of a global linked list of blocks: as the uptime and number of allocations increases, it's more and more likely that adjacent blocks on the list are from arenas that have been forced out of the shared LLC. This increases with the number of threads because more threads spread the accesses out over more arenas. A single threaded program can search all arenas for a block of the right size, but a thread in the multithreaded case must restrict itself to its subset, and is thus more likely to have to mmap a new arena. So the hypothesis is poor locality in even the single threaded case, made much worse in the multithreaded case by under-utilization. (This isn't exactly fragmentation as I've heard it used - it's fragmentation where the contiguous block *exists*, but you can't access it because the address space is split).

Note: There should be be no conflict misses in this 100% parallel benchmark - nalloc's low LLC miss rate on the same test suggests that's the case. In retrospect, I should have tested my hypothesis by other means than writing an entirely new allocator - ie. varying the max number of allocations.

Update: I just tested this by increasing the max number of allocations in the single-threaded case. The same miss rates resulted - so indeed, that suggests to me that under-utilization with more threads makes the inherent poor locality of my sequential algorithm bad enough to thrash the LLC.

Given coalescing, I saw no good alternative to a global list. I considered partially disabling merging, keeping a single "current arena" pointer to check against in O(1), or trying to improve the ordering of blocks on the list. But the 418 assignments have suggested that adding complexity doesn't seem to pay off compared to changing the simple things.

Alex's report mentioned that he found no modern coalescing allocator. My own research into jemalloc, tcmalloc, and a few lockfree mallocs also failed to turn one up. I thought that I had probably found out why everyone avoids coalescing. They must have judged it to be naive, given the hardware.

nalloc

Approach

I designed nalloc to access memory less often, and in a more localized way.

The idea is to use the same lockfree wrapper, but to replace coalescing inside arenas with "slab" allocation. Each page/"slab" only provides blocks of a single size. The benefit is that you don't need to store headers, nor examine adjacent blocks in order to allocate or free; and you don't need a doubly linked list of blocks because you never need to "remove from the middle". The common case allocation is two loads and a store to pop a non-lockfree stack. The drawback is that you can waste space.

Algorithm:

Each thread keeps a private non-lockfree stack of *slabs* for each size class.

Naturally-aligned page sized slabs come from a global LF stack of pages.

Each slab contains a thread-private non-lockfree stack of blocks of identical sizes.

malloc() peeks at the head slab and pops from that slab's stack of blocks. If the slab is empty, then it gets popped and not put into any data structure. The *only* references to it are the allocated blocks which it holds.

free() exploits natural alignment to derive a block's slab and push the block onto the slab's block stack. If the slab's block stack is empty, then the slab is added back onto the stack of slabs corresponding to its size.

Each slab contains an LF stack of wayward blocks. If malloc() exhausts a slab, it'll pop that slab's wayward blocks and push them onto that slab's free block stack. free() will push onto the stack of wayward blocks if it detects that another thread owns the slab. If it has freed the last block in the slab and all of the blocks in the slab are wayward blocks, then it'll free the slab or steal it for the calling thread.



This is O(1) if you don't count cmpxchg loops inside the stack.

The biggest challenge was dealing with wayward blocks in a way that efficiently avoided the race in __nalloc or some other race in its place. Consider some of the options, where thread M is the owner of slab S and thread F is freeing a wayward block onto S:

Suppose F just quietly pushes onto wayward_blocks (the stack) in this scheme. The issue is that M is treating free() as a signal that S is nonempty. If every thread quietly pushes onto wayward_blocks until S's wayward_blocks is full, then no signal can ever be received and S is leaked.

Suppose F ties to actively signal somehow, or to just move S back onto M's slab stacks (make them lockfree). M might exit, so this is exactly the problem that __nalloc faced. You could store the slab stacks on a management block that lives longer than M and refcount the number of blocks which have slabs with references to it. I implemented most of this - unlike in coalescing, you can avoid updating the refcount upon every malloc and free by exploiting the fact that you can compute the number of allocated blocks in a slab. But it was complicated and hard to justify in favor of simply using a mutex.

M could "search for signals" rather than waiting. You'd pay for O(N) search.

The key is that M doesn't really need to keep track of empty slabs. So in my algorithm described above, M purposefully loses its reference to the slab in order to prevent racing with F. If F's insertion would fill the slab with only wayward blocks, then neither M nor any other thread could hold a reference to S; and F can safely do what it wants with the slab. Otherwise, F can't notify M but there's still a chance that M might "receive" the "slab non-empty signal" by freeing a block on S.

Happily, stealing is something that I had wanted to support from the start. Suppose you have a producer-consumer setup where the consumer needs memory too. Does it really make sense to return all memory to the producer if the consumer has loaded it into cache? And even if stealing isn't right for a workload, it's better if the last thread to load a slab into cache is the one freeing it.

The cost of this scheme is that you either absorb conflict misses or reserve 64B per slab in order to put the wayward blocks stack onto a different cache line from the block stack. With larger slabs, 64B is a reasonable overhead.

In order to avoid races when attempting to figure out whether a slab is empty or whether it consists of only wayward blocks, I needed to store a size directly in my lockfree stack. I stole 32 bits from the tag field to do this.

A stack of slabs works where a list of arenas didn't work in __nalloc because each slab can guarantee the ability to satisfy an allocation of a certain size. It has the theoretical benefits I mentioned before: locality in the allocator and also in the program that uses the memory, sequential access in the allocator (see below), and lower fragmentation (the other slabs get a better chance to fill up and be freed). This is an advantage of slab allocation that I completely missed when doing my initial design - I probably wouldn't have discovered it in any way except trying and failing.

Unlike the similar O(N) wayward stack clearing in __nalloc, this stack_popall() exchange when popping wayward blocks is O(1). The key is that you know that a slab's block stack is empty, so you can simply move the top node of the wayward_blocks stack onto it without breaking links.

Intel's x86 optimization guide (and reason) says that the prefetcher will in fact detect sequential accesses. So slabs also keep track of the size of the section of contiguous blocks starting at the base of the slab. They'll prefer to allocate and deallocate from this section rather than from the stack (actually, they should prefer to allocate from the stack - I implemented that backwards). This has the additional effect of making slab initialization O(1) because the stack traversal fields don't need to be written to each block immediately upon slab initialization.

A future improvement might be to keep two stacks - one in any old order, and other to be built in strictly decreasing order until it meets the contiguous section and is appended. That carries little cost if &second_stack->top is on the same cache line. (This sounds like a special case of some more general algorithm.)

This algorithm is less defensible for non-tiny allocations. You can't really have a stack of slabs for every 16B between 0B and (FOO * 4096)B, but you also don't want to satisfy 513B requests with 1024B blocks. Additionally, big blocks incur big overhead due to the header, so you want bigger slabs in that case - but bigger slabs waste more memory if they're not fully utilized and overcommit is off. A future implementation might elastically switch block sizes or toggle overcommit (or toggle reserving a cache line for a slab's wayward blocks stack) as it learns whether it actually needs to trade space for speed, and whether the amount of wasted space will be a significant fraction of total usage.

Also, with some modifications, I think it might be reasonable to use the space-efficient __nalloc in order to satisfy large allocations. A program probably wouldn't be making large allocations fast enough to saturate the allocator because it would be limited by the time it takes to fill up those allocations with actual data (assuming programs don't often preallocate space for a lot of huge things at once).

The only optimization I made after finishing the design was replacing a size class lookup table with arithmetic to compute the size of an allocation. That shaved 5% off of runtime. In retrospect, this might have been because of false sharing with a global LF stack - I marked the stacks with __aligned__(64) because I was worried about that, but I didn't mark the table.

Results

Results were generated with the benchmarks mentioned above.

nalloc's final profile in the 100% parallel case looks like this. Perf reports the following:

Performance counter stats for './natest -t 6 -o 10000': 519,955,974 L1-dcache-loads 22,066,220 L1-dcache-misses # 4.24% of all L1-dcache hits [26.59%] 355,314,274 L1-dcache-stores 5,272,972 L1-dcache-misses 4,815,201 L1-dcache-prefetches 396,406 L1-dcache-misses 7,920,682 LLC-loads 1,391,780 LLC-misses # 17.57% of all LL-cache hits [22.11%] 12,972,675 LLC-stores 1,386,302 LLC-misses 14,266 LLC-prefetches 2,807 LLC-misses 384 page-faults 1,712,655,726 cpu-cycles # 0.000 GHz 692,724,604 stalled-cycles-frontend # 40.45% frontend cycles idle 492,748,671 stalled-cycles-backend # 28.77% backend cycles idle 7,614,825 branch-misses 2,800,160 cache-misses # 23.218 % of all cache refs 12,060,173 cache-references 2,413,914,436 instructions # 1.41 insns per cycle # 0.29 stalled cycles per insn [27.19%] 0.105481932 seconds time elapsed

It's still memory bound, but not quite so badly. Compare that to the tcmalloc performance from before. nalloc makes less loads and stores. And here's how it compares to a range of allocators:

To repeat, the above shows scaling on a perfectly parallel fixed-sized-workload test where threads randomly allocate, write, and free blocks of size <= 1024 into and from private pools. No blocks migrate between threads.

The results look good. At less than 500 LOC, nalloc competes with jemalloc and tcmalloc, which are 10K and 20K LOC, respectively. The major caveat is, of course, that nalloc targets only one test and one criterion, while the rest are production ready. Having some knowledge of the differences and similarities, however, I think that it's likely that nalloc's design can be fruitfully extended to cover the full range of cases.

llalloc in these tests is Lockless Inc.'s LockLess allocator, which isn't 100% lockfree (it has a lock around the global heap). When run using the same exact test, it was twice as fast as nalloc in serial, but only achieved 25x speedup with 64 cores. When run at twice the length, it achieved a 40x speedup identical to nalloc's. I suppose some once-per-thread overhead must be dominating at that level (~42ms time-to-completion), but I didn't investigate.

I did, however, make sure that the other allocators don't suffer from this effect. Rather, they scale worse as problem size increases and the shared caches presumably become an issue.

The true lockfree allocators, NBMalloc, Streamflow, and the implementation of Maged Michael's alloc, all failed to compile on ALADDIN (2 failed to compile on my machine) and I don't think I should attempt to resolve that now.

Graphs of time-to-completion are hard to read for 64 cores, but I found that "work" graphs were a legible mix of parallel and sequential execution properties. The Y axis is num_threads * time-to-completion. Since I expect workloads to have been distributed evenly, this should be an approximation of total work done. A good slope should be <= 0. A Y value above the single-threaded Y value gives an approximation of overhead.

In the second test, where threads allocate and free into and from a global pool, block migration is very high.

WARNING: these tests are flawed. See below.

In the third test, where a single thread produces blocks for other threads to free, block migration is also very high.

In the third test, where a single thread produces blocks for other threads to free, block migration is also very high.

Tests 2 and 3 are flawed, as the communication costs of sending blocks between threads via global pool dominates. In a 6 core run of test 2, 76% of runtime is spent outside of nalloc, most of it operating on the lockfree stacks. I've split up the pool into multiple stacks and I'm 100% sure that the stacks are on different cache lines. The tests need to be rewritten to transfer blocks implicitly or just less often. At this point, though, I'd rather join Alex R. in his quest to find real workloads with block migrations.

Lessons Relearned

If you have the luxury of an existing solution, learn your problem's likely bottlenecks ahead of time.

Conversely, even when that's possible, trying and failing might be the only way to understand the tradeoffs well enough to tackle those bottlenecks.

You can't get away with writing a fast parallel wrapper around fast single-threaded code. You have to pay attention to the sequential parts of a parallel program, as they may contend or communicate in an obscure way. This was SAXPY's lesson (although SAXPY had a different underlying issue), but I had forgotten it.

If you're theoretically perfectly parallel, but not seeing perfect speedup, consider the effects of shared memory and/or shared caches vs working set size.

LibreOffice charts look uglier outside of LibreOffice.

References