Allocation Set internals

Last week I explained (or attempted to) the basics of memory contexts in PostgreSQL. It was mostly about the motivation behind the concept of memory contexts, and some basic overview of how it all works together in PostgreSQL.

I planned to write a follow-up post about various "gotchas" related to memory contexts, but as I was writing that post I ended up explaining more and more details about internals of AllocSet (the only memory context implementation in PostgreSQL). So I've decided to split that post into two parts - first one (that you're reading) explains the internals of allocation sets. The next post will finally deal with the gotchas.

The level of detail should be sufficient for understanding the main principles (and some tricks) used in the AllocSet allocator. I won't explain all the subtle details - if you're interested in that (and the code is quite readable, if you understand the purpose), please consult the actual code in aset.c. Actually, it might be useful to keep that file opened and read the related functions as I explain what palloc and pfree do (at the end of this post).

blocks

The introductory post mentioned that Allocation Sets are based on "block pools" - that means the allocator requests memory in large blocks (using malloc ), and then slices these blocks into smaller pieces to handle palloc requests.

The block sizes are somewhat configurable, but by default the allocator starts with 8kB blocks and every time it needs another block it doubles the size up to 8MB (see the constants at the end of memutils.h). So it first allocates 8kB block, then 16kB, 32kB, 64kB, ... 8MB (and then keeps allocating 8MB blocks). You may change different sizes, but the minimum allowed block size is 1kB, and the block sizes are always 2^N bytes.

To make the management easier, each block is decorated with a header, represented by this structure defined in aset.c:

typedef struct AllocBlockData { AllocSet aset ; /* aset that owns this block */ AllocBlock next ; /* next block in aset's blocks list */ char * freeptr ; /* start of free space in this block */ char * endptr ; /* end of space in this block */ } AllocBlockData ;

Each block references the memory context it's part of, has a pointer to the previous block (to make a linked list of allocated blocks) and pointers delimiting the free space in the block. The header is stored at the beginning of the block - the AllocBlockData is ~32B if I count correctly, so this overhead is negligible even for the smallest block size (~3%).

Chunks

So the allocator acquires memory in blocks, but how does it provide it to the palloc callers? That's what chunks are about ...

Chunks are pieces of memory allocated within the blocks in response to palloc calls. Just like blocks, each chunk is decorated with a header (this is slightly simplified, compared to the actual structure definition in aset.c):

typedef struct AllocChunkData { void * aset ; /* the owning aset */ Size size ; /* usable space in the chunk */ } AllocChunkData ;

Each chunk mentions which memory context it belongs to (just like blocks), and how much space is available in it.

Just like block sizes, the chunk sizes are not entirely arbitrary - sure, you may request arbitrary amount of memory (using palloc ), but the amount of memory reserved in the block will be always 2^N bytes (plus the AllocChunk header, which is usually 16B on 64-bit platforms).

For example, if you request 24 bytes (by calling palloc(24) ), the memory context will actually create a 32B chunk (because 32 is the nearest power of 2 after 24), and hand it back to you. So you will (rather unknowingly) 32B of usable space. The smallest allowed request is 8B, so all smaller requests get 8B.

Chunk overhead

Chunks are the main source of overhead in memory contexts.

Firstly, because of the 2^N sizes, there'll usually be some unused space. The bottom 1/2 of the chunk is always utilized (otherwise a smaller chunk would be used), and the upper half is 50% utilized on average, assuming a random distribution of requested sizes. Thus ~75% utilization and 25% overhead on average. That's not negligible, but it's probably a fair price for the features memory contexts provide (and compare it to overhead in garbage-collected languages, for example).

Secondly, there's the chunk header. A chunk with 32B usable space will actually occupy 48B. The impact of this really depends on the request size. For large requests it gets negligible (similarly to the block header), but as you can imagine for very small requests it gets very high (e.g. 200% for 8B, which is the smallest allowed request).

Chunk reuse

But why are the chunk sizes treated like this? The answer is "reuse." When you call pfree , the chunk is not actually returned to the system (using free ). It actually can't be, because it's only a small part of the block (which is what we got from malloc ), and there may be more chunks on it. So we can't just free the chunk or block.

Instead, the chunk is moved to a "freelist" - a list of empty chunks, and may be used for future palloc calls (requesting chunk with the same size). There's one freelist for each chunk size, so that we can easily get chunk of the "right" size (if there's one). Without using these "unified" chunk sizes, this would not be really possible - the size matching would be more expensive or less efficient.

Still, there are ways this can fail tremendously (e.g. when the assumtion of random distribution of sizes does not hold, or when the reuse does not work for some reason). More on these failures in the next post.

Oversized chunks

If there's a maximum block size, what happens to requests exceeding this limit? The current implementation creates a special one-chunk block for each such oversized chunk, with almost exactly the right size (no 2^N rules, just some basic alignment tweaks). The purpose of the block pool is to prevent overhead with allocating too many tiny pieces, so allocating large blocks directly is not a big deal (before you allocate enough of them to notice the overhead, you'll likely run out of memory).

More precisely - oversized chunks are not chunks exceeding max block size, but chunks exceeding 8kB (because the number of freelists is limited) or exceeding 1/8 of max block size (so that a stream of such chunks results in 1/8 overhead).

Anyway, the thing you should remember is that large chunks are special-cased - are allocated in dedicated blocks with the right size, and pfree immediately frees them.

Now, let's see what the two main methods - palloc and pfree do.

palloc (AllocSetAlloc)

Assuming the CurrentMemoryContext is set to an instance of allocation set allocator (and it's difficult to break this assumption, because there are no other implementations), the flow after calling palloc (see mcxt.c for implementation) looks about like this:

palloc ( size ) -> CurrentMemoryContext -> alloc ( CurrentMemoryContext , size ) -> AllocSetAlloc ( CurrentMemoryContext , size )

That is, the palloc looks at CurrentMemoryContext , and invokes the alloc method (which is part of MemoryContextMethods API, described in the previous post). As it's an AllocSet instance, it invokes AllocSetAlloc which then does all the work with block and chunks.

In short, it does about this:

Checks whether the size is exceeding the "oversized chunk" limit - in that case, a dedicated block (of just the right size) is allocated.

Checks whether there's an existing chunk in the freelist with the proper size, and if yes then uses it.

Checks whether there already is a block, and if there's enough space in it (that's what the block header is for). A new block is allocated (using malloc ) if necessary.

) if necessary. Reserves sufficient space in the block for the chunk, and returns it to the caller.

There are a few more tricks (e.g. what would you do with the remaining space in a block?) - the code is easy to read and quite well commented, althouth rather long. Not sure what happened to the "Every function should fit on a screen" recommendation - either they had extremely large screens, or small font size ;-)

pfree (AllocSetFree)

The pfree does not need to consult the CurrentMemoryContext - and it can't, because the chunk could have been allocated in a completely different context. But hey, that's what the chunk header is for - there's a pointer to the proper AllocSet instance. So the flow looks about this:

pfree ( chunk ) header = chunk - 16 B -> header -> aset -> free ( header -> aset , header ) -> AllocSetFree ( header -> aset , header )

And AllocSetFree doesn't do much:

Check if this is "oversized" chunk (those can and should be freed immediately).

If it's a regular chunk, move it to a freelist, so that it can be reused by subsequent palloc calls.

Summary