Introduction

Writing a simple memory allocator is quite easy. Writing an efficient allocator that tries to minimize fragmentation, overhead and locking while trying to maximize locality of reference and performance is an extremely hard task, even for experienced programmers.

In this post, I’ll cover the basics of designing a simple memory allocator. The technique that will be used is called heap layers which is desribed in the “Composing High-Performance Memory Allocators” excellent paper.

Essentially, what the authors propose is to create a set of elementary layers that perform simple operations on the memory they allocates, and compose those layers together to create complex allocators. Instead of using virtual calls to forward calls to top layers, they use C++ mixins, described in the “Mixin-Based Programming in C++” paper. Mixins are templated classes that can have any parent class, making it possible to combine layers and have inlining when calling parent functions.

For example, Doug Lea’s dlmalloc is a very popular allocator and widely used in various software. Understanding the 2000+ lines of code to be able to modify it is hard to achieve and is error-prone. On the other hand, the authors of the heap layers paper have reproduced dlmalloc‘s behavior using a couple of layers, in around 500 lines of code, with similar performances and memory usage. The main advantage of this technique is that it makes it really easy to test new changes to the allocator.

The allocator that will be described in this post will look like this:

typedef StrictSegHeap< 10, StrictSegHeapTraits, ThreadSafeHeap<PagedFreeList>, ThreadSafeHeap<CRTMallocHeap> > GlobalHeap;

A strict segregated heap of 10 fixed-sizes paged free-lists for small allocations, using CRT malloc for large allocations, with a lock per size instead of a single lock to minimize locking. The layers provided here are for the purpose of the post, so readability and ease of understanding were traded with performance in the majority of the cases.

Let’s get started!

Basic Allocator

We start by writing a simple memory allocator that gets it’s memory from CRT malloc:

class CRTMallocHeap { public: inline void* Malloc(tU32 Size) { return malloc(Size); } inline void* Realloc(void* Ptr, tU32 Size) { return realloc(Ptr, Size); } inline void Free(void* Ptr) { return free(Ptr); } };

Then, we plug this into our 3 memory functions described in previous posts (see highlighted lines):

typedef CRTMallocHeap GlobalHeap; GlobalHeap GHeap; void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line) { void* Result = GHeap.Malloc(Size); RegisterAlloc(Result, Size, AllocType, Desc, File, Line); return Result; } void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line) { void* Result = GHeap.Realloc(Ptr, Size); UpdateAlloc(Ptr, Result, Size, File, Line); return Result; } void _Free(void* Ptr) { UnregisterAlloc(Ptr); return GHeap.Free(Ptr); }

We are now ready to implement our first memory allocator layer.

Free List Allocator

Remember the FreeList class desribed in the second post? It essentially allocates memory pages from the OS using VirtualAlloc and split them into fixed-size chunks. Then, all allocations that fits into that chunk size can be allocated using the free-list extremely fast.

Let’s write a heap layer for that:

class PagedFreeList { tU8* FirstFree; public: PagedFreeList() : FirstFree(NULL) {} inline void* Malloc(tU32 Size) { if(!FirstFree) { const tU32 PageSize = 65536; const tU32 NumAllocPerBatch = PageSize / Size; // Allocate a 64K page from the OS tU8* AllocBatch = (tU8*)VirtualAlloc(NULL, PageSize, MEM_COMMIT, PAGE_READWRITE); for(tU32 i=0; i<NumAllocPerBatch; i++) { Free(AllocBatch); AllocBatch += Size; } } tU8* Result = FirstFree; FirstFree = *((tU8**)FirstFree); return Result; } inline void Free(void* Ptr) { *(tU8**)Ptr = FirstFree; FirstFree = (tU8*)Ptr; } };

Size Heap

Some heaps need to know the size of an allocation in order to route this allocation to the good super-heap. A quick way to achieve that is to write a simple layer that allocates a small header along with each allocation and store the size of the allocation into that header:

template<class SuperHeap> class SizeHeap: public SuperHeap { struct FreeObject { tU32 Size; }; public: inline void* Malloc(tU32 Size) { FreeObject* Ptr = (FreeObject*)SuperHeap::Malloc(Size + sizeof(FreeObject)); if(Ptr) { Ptr->Size = Size; } return (void*)(Ptr + 1); } inline void* Realloc(void* Ptr, tU32 Size) { FreeObject* NewPtr = (FreeObject*)SuperHeap::Realloc(Ptr ? (FreeObject*)Ptr - 1 : NULL, Size + sizeof(FreeObject)); if(NewPtr) { NewPtr->Size = Size; } return (void*)(NewPtr + 1); } inline void Free(void* Ptr) { SuperHeap::Free((FreeObject*)Ptr - 1); } inline tU32 GetSize(void* Ptr) { return ((FreeObject*)Ptr - 1)->Size; } };

This heap requires 4 bytes per allocation. Other schemes could be used to minimize this overhead; for example, the PagedFreeList could implement GetSize() by keeping track of the page allocation size and return this value, as a page always contains same size allocations.

ThreadSafe Heap

If we want to add thread-safety to our layers, we simply need to implement a ThreadSafeHeap layer that provides a Lock() and Unlock() functions, like this:

template<class SuperHeap> class ThreadSafeHeap: public SuperHeap { volatile tS32 LockSem; inline void Lock() { while(1) { // Atomically swap the lock variable with 1 if it's currently equal to 0 if(!InterlockedCompareExchange(&LockSem, 1, 0)) { // We successfully acquired the lock MemoryBarrier(); return; } YieldProcessor(); } } inline void Unlock() { MemoryBarrier(); LockSem = 0; } public: ThreadSafeHeap() : LockSem(0) {} inline void* Malloc(tU32 Size) { Lock(); void* Result = SuperHeap::Malloc(Size); Unlock(); return Result; } inline void* Realloc(void* Ptr, tU32 Size) { Lock(); void* Result = SuperHeap::Realloc(Ptr, Size); Unlock(); return Result; } inline void Free(void* Ptr) { Lock(); SuperHeap::Free(Ptr); Unlock(); } };

Then, adding this layer on top of another one makes it thread-safe. Simply by rearranging where we place this layer can change from a slow, single lock:

typedef ThreadSafeHeap< StrictSegHeap< 10, StrictSegHeapTraits, PagedFreeList, CRTMallocHeap > > GlobalHeap;

to a lock per-layer:

typedef StrictSegHeap< 10, StrictSegHeapTraits, ThreadSafeHeap<PagedFreeList>, ThreadSafeHeap<CRTMallocHeap> > GlobalHeap;

Using a single lock will block whenever two or more threads uses the allocator at the same time. On the other hand, using a lock per layer as shown in the last code sample will reduce the chances of locking only if two or more threads allocates the same size range at the same time.

Strict Segregated Heap

The last layer that will be shown is a strict segregated layer: a layer that separates allocations by sizes using sub-heaps. This layer is strict because it doesn’t perform blocks splitting or coalescing (as dlmalloc does, for instance).

First, we need to define how the sizes will be separated. For this example, I chosed simple powers of two between 8 and 4096:

1..8 9..16 17..32 33..64 65..128 129..256 257..512 513..1024 1025..2048 2049..4096

Which translates into code like this:

struct StrictSegHeapTraits { // Return the bin that this size falls into static inline tU32 GetSizeClass(tU32 Size) { INT c = 0; Size--; while(Size > 7) { Size >>= 1; c++; } return c; } // Return the size of allocations in this bin static inline tU32 GetClassMaxSize(tU32 Bin) { return 1 << (Bin + 3); } };

And the strict segregated heap looks like this:

template<tU32 NumBins, class Traits, class LittleHeap, class BigHeap> class StrictSegHeap: public BigHeap { // The array of little heaps (bins), one for each allocation size LittleHeap LittleHeap[NumBins]; // Return the bin that this allocation falls into inline tu32 InnerGetSizeClass(tu32 Size) { if(Size > Traits::GetClassMaxSize(NumBins - 1)) { return NumBins; } return Traits::GetSizeClass(Size); } public: StrictSegHeap() {} inline void* Malloc(tU32 Size) { tU32 SizeClass = InnerGetSizeClass(Size); if(SizeClass >= NumBins) { // Allocation is too big, route to big heap return BigHeap::Malloc(Size); } tU32 MaxSizeClass = Traits::GetClassMaxSize(SizeClass); return LittleHeap[SizeClass].Malloc(MaxSizeClass); } inline void* Realloc(void* Ptr, tU32 Size) { if(Ptr) { if(Size) { tU32 ObjectSize = GetSize(Ptr); tU32 ObjectSizeClass = InnerGetSizeClass(ObjectSize); if(ObjectSizeClass >= NumBins) { return BigHeap::Realloc(Ptr, Size); } // Loose reallocation: only realloc if bigger or at least twice smaller if((Size > ObjectSize) || (Size < ObjectSize / 2)) { void* NewPtr = Malloc(Size); if(NewPtr) { tU32 CopySize = Min(ObjectSize, Size); memcpy(NewPtr, Ptr, CopySize); } LittleHeap[ObjectSizeClass].Free(Ptr); return NewPtr; } return Ptr; } Free(Ptr); return NULL; } return Malloc(Size); } inline void Free(void* Ptr) { if(Ptr) { INT ObjectSize = GetSize(Ptr); INT ObjectSizeClass = InnerGetSizeClass(ObjectSize); if(ObjectSizeClass >= NumBins) { BigHeap::Free(Ptr); } else { LittleHeap[ObjectSizeClass].Free(Ptr); } } } };

This heap layer routes small allocations into their corresponding sub-heaps and big allocations into the big heap. As we can see, the code relies on the fact that the sub-heaps provide the GetSize() function to know the size of incoming allocations for Realloc and Free, so it has to be provided or the code won’t compile.

Conclusion

Using the heap layers technique, we can now implement and test new layers quite easily to compose complex allocators easily. There’s a lot of utility layers that can be written; for example, a debug layer that keeps a header and a footer for each allocations to detect memory overwrites. Even the memory tracking system described in the second post could be written as a layer. Make sure to read the provided papers if you read that far!