A Simple Macro

It began with a simple macro: (approximative code)

1 2 3 4 5 6 7 8 9 10 11 12 #define ADD_BYTE(C) do { \ if (offset == capa) { \ if (capa < 16) { \ capa = 16; \ } else { \ capa <<= 1; \ } \ buffer = realloc(buffer, capa); \ assert(buffer != NULL); \ } \ buffer[offset++] = (C); \ } while(0)

For the C -illiterates, this simple macro just adds the byte ‘C’ to the dynamically allocated buffer buffer , whose capacity (in bytes) is capa . The next position to be written is identified by offset . The capacity is doubled each time the buffer is filled (starting with a minimum of 16 bytes).

We’re adding bytes in a dynamic buffer – this is one of the most common thing you can do in a given program (used in strings, in arrays, etc.)

The problem is to understand if the reallocation strategy is good enough. If you take a look at the complexity, assuming realloc is O(N), you quickly conclude that adding a single byte is on average O(1), which is good.

But what is the worst-case complexity, when reallocating a buffer ?

Guy: Are you sure this is the good strategy ? It looks like you have a big performance issue if you extend the size of a large array – say, one gigabyte for example. Imagine the impact, especially if the buffer has to be moved from the swap. Me: Hummm, I never really thought about this, but I think it’s fine. I’m pretty sure the system can handle that gently, actually. Guy: I think a linked structure might be a better idea – maybe with the same exponential strategy, for example Me: I’m pretty sure this is a waste of time Guy: Okay, show me

Sheesh

Okay, now I need to back my arguments. As all programmers, I am lazy. I should say: as all programmers should be, I am lazy. Laziness is an incentive to be smarter and more efficient. You are too lazy to execute repetitive or boring tasks, and you need something smarter, something faster. This is sometimes called laziness. I prefer to call that efficiency.

Having a linked list of blocks to store a buffer would be a bit cumbersome. I didn’t say impossible – Impossible n’est pas français as we say sometimes (well, it did not end well for Napoléon, but nevermind) – but cumbersome. Copying a sub-array or storing it on disk would require some work, we probably would need an index of all arrays base pointers, get the low offset 2-based logarithm to get the initial block, and do some other boring stuff. Sheesh, I’m already tired.

I have the feeling that being lazy is the right thing to do here. Because for small memory blocks, we basically do not care, and for large blocks, the system’s virtual memory will gently take care of the issue.

Let’s have a look.

What is realloc() , By The Way ?

It is a regular function, standardized by POSIX, located in the C library. On linux, you can find it inside the libc.so library, with its siblings malloc and free :

1 2 3 4 nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep -E "T (malloc|realloc|free)$" 000000000007e450 T free 000000000007e030 T malloc 000000000007e4e0 T realloc

For the curious, “ T ” stands for “text symbol” (the uppercase means that the symbol is public and visible) ; the text segment of a program being the code, the data segment being initialized data (variables), and the bss segment being uninitialized (well, actually initialized to zero) data (variables).

Allocating memory, reallocating memory and freeing memory is done through an allocator (thank you Captain Obvious!). There are plenty of them (one of the most common being the buddy allocator).

We can implement our own trivial allocator using the venerable sbrk call, which basically expand the end of the data segment to have more space:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 #include <sys/types.h> #include <unistd.h> #include <string.h> void * malloc ( size_t size ) { return sbrk ( size ); } void free ( void * ptr ) { /* who cares ? */ } /* Note: meant to grow only (yes, I am lazy) */ void * realloc ( void * ptr , size_t size ) { void * nptr = malloc ( size ); if ( nptr == NULL ) { return NULL ; } // 'old_size' is supposed to be known :) // (for example put in a header before the actual block) memcpy ( nptr , ptr , old_size ); free ( ptr ); return nptr ; }

[ Note: as someone noted on hacker news, the old_size address needs to be known here ; for example by using a header before the returned address. And no, we shouldn’t free the original block upon realloc failure :) ]

A real allocator would be a bit more clever, of course, with complex data structures aimed to limit the number of actual memcpy calls.

We need to dig a bit inside the Glibc used on Linux to figure out what’s happening for large blocks.

Digging In The Glibc

Just download a recent glibc release, and have a look at the source tree.

There is an interesting directory out there: the malloc directory. Just pick the malloc.c file and open it in an editor.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 /* The main properties of the algorithms are: * For large (>= 512 bytes) requests, it is a pure best-fit allocator, with ties normally decided via FIFO (i.e. least recently used). * For small (<= 64 bytes by default) requests, it is a caching allocator, that maintains pools of quickly recycled chunks. * In between, and for combinations of large and small requests, it does the best it can trying to meet both goals at once. * For very large requests (>= 128KB by default), it relies on system memory mapping facilities, if supported. For a longer but slightly out of date high-level description, see http://gee.cs.oswego.edu/dl/html/malloc.html */

The interesting part (for us) is: For very large requests (>= 128KB by default), it relies on system memory mapping facilities, if supported..

The 128KB by default can actually be controlled by the mallopt function:

M_MMAP_THRESHOLD For allocations greater than or equal to the limit specified (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from the free list, the memory-allocation functions employ mmap(2) instead of increasing the program break using sbrk(2).

Okay, so as I said previously, the system’s virtual memory will handle everything when dealing with large blocks.

Basically, it means that:

malloc will be trunked to anonymous mmap

will be trunked to anonymous free will be trunked to munmap

will be trunked to realloc will be trunked to mremap (Darn, not yet in POSIX ? Guys ?)

Okay, so we need to dig a little deeper inside mremap.

A man mremap will tell us that:

mremap() uses the Linux page table scheme. mremap() changes the mapping between virtual addresses and memory pages. This can be used to implement a very efficient realloc(3).

Humm, very efficient realloc. How efficient ?

First, mmap , munmap , and mremap do have definitions in the glibc. Well, pretty small entry points, actually:

1 2 3 4 nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep -E "(mmap|munmap|mremap)$" 00000000000e4350 W mmap 00000000000e9080 W mremap 00000000000e4380 W munmap

Note that the default entry points are weak symbols in this case, which basically means that they can be overridden by someone else, during dynamic link ; eg. in this case:

1 2 3 4 ldd /tmp/sample linux-vdso.so.1 ( 0x00007584a8aaa000 ) libc.so.6 = > /lib/x86_64-linux-gnu/libc.so.6 ( 0x00007584a86e6000 ) /lib64/ld-linux-x86-64.so.2 ( 0x00007584a8aac000

… the symbols may be overridden by linux-vdso.so.1 , which is a magical library mapped in all programs on Linux to speedup some calls involving system calls.

Anyway, our symbols in the glibc are only trunks to kernel system calls (aka “syscalls”) whether they are in the glibc or inside the vdso ; see the sysdeps/unix/sysv/linux/mmap64.c default implementations, for example:

1 2 3 4 5 6 7 8 9 10 11 void * __mmap64 ( void * addr , size_t len , int prot , int flags , int fd , off64_t offset ) { // removed arguments check void * result ; result = ( void * ) INLINE_SYSCALL ( mmap2 , 6 , addr , len , prot , flags , fd , ( off_t ) ( offset >> page_shift )); return result ; }

Okay, so our initial question is not anymore glibc-related, but involves the Linux kernel.

Digging In The Kernel

Now download a recent kernel version and let’s have a quick look at what mremap involves.

If you take a look at the The Linux Kernel Howto, you see that there is an interesting directory for us:

mm This directory contains all of the memory management code. The architecture specific memory management code lives down in arch/*/mm/, for example arch/i386/mm/fault.c.

Ah, nice, just what we needed to check!

The mm/mremap.c file seems to be of some interest. And inside the file, you’ll discover the syscall entry point of the mremap function. It’s just here:

1 2 3 4 5 6 7 8 9 10 11 /* * Expand (or shrink) an existing mapping, potentially moving it at the * same time (controlled by the MREMAP_MAYMOVE flag and available VM space) * * MREMAP_FIXED option added 5-Dec-1999 by Benjamin LaHaise * This option implies MREMAP_MAYMOVE. */ SYSCALL_DEFINE5 ( mremap , unsigned long , addr , unsigned long , old_len , unsigned long , new_len , unsigned long , flags , unsigned long , new_addr ) {

This is basically where you enter in the kernel when invoking the corresponding syscall in your user code (through the glibc or through the corresponding vdso thunk).

If you read this function’s code, you first see various arguments checks, and trivial cases being handled (especially shrinking a mapped memory block, which is just freeing the pages on the tail).

The kernel then attempt to expand the mapped area by growing it (as would do an allocator in a C library with sbrk ), and successfully return if expanding the area was possible.

All these trivial cases are O(1) (even if entering in the kernel has a cost – a cost lower now that interrupts aren’t involved anymore, but this is still costly).

What about the worst case ?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 /* * We weren't able to just expand or shrink the area, * we need to create a new one and move it.. */ ret = - ENOMEM ; if ( flags & MREMAP_MAYMOVE ) { unsigned long map_flags = 0 ; if ( vma -> vm_flags & VM_MAYSHARE ) map_flags |= MAP_SHARED ; new_addr = get_unmapped_area ( vma -> vm_file , 0 , new_len , vma -> vm_pgoff + (( addr - vma -> vm_start ) >> PAGE_SHIFT ), map_flags ); if ( new_addr & ~ PAGE_MASK ) { ret = new_addr ; goto out ; } map_flags = vma -> vm_flags ; ret = move_vma ( vma , addr , old_len , new_len , new_addr ); if ( ! ( ret & ~ PAGE_MASK )) { track_exec_limit ( current -> mm , addr , addr + old_len , 0UL ); track_exec_limit ( current -> mm , new_addr , new_addr + new_len , map_flags ); } }

At first glance, the kernel is doing what we did in our trivial allocator:

allocate a new area

copy the previous area to the new one, and free the previous one (ie. move)

But let’s have a closer look.

The move_vma call follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 static unsigned long move_vma ( struct vm_area_struct * vma , unsigned long old_addr , unsigned long old_len , unsigned long new_len , unsigned long new_addr ) { ... new_pgoff = vma -> vm_pgoff + (( old_addr - vma -> vm_start ) >> PAGE_SHIFT ); new_vma = copy_vma ( & vma , new_addr , new_len , new_pgoff , & need_rmap_locks ); if ( ! new_vma ) return - ENOMEM ; moved_len = move_page_tables ( vma , old_addr , new_vma , new_addr , old_len , need_rmap_locks );

There are two interesting calls here:

the copy_vma call

call the move_page_tables call

The copy_vma function is inside the sibling mm/mmap.c file, and basically moves the internal kernel structure related to the mmap’ed block, which is a vm_area_struct structure.

You can find its definition in include/linux/mm_types.h ; this small structure contains all information on the region: the start and end address, the mmap’ed file on disk if any, etc.

So moving this one will be cheap, O(1).

What about move_page_tables ?

The interesting part seems to lie within this loop:

1 2 3 4 5 6 ... for (; old_addr < old_end ; old_addr += extent , new_addr += extent ) { ... move_ptes ( vma , old_pmd , old_addr , old_addr + extent , new_vma , new_pmd , new_addr , need_rmap_locks ); ...

The code is a bit complicated (and lacks a bit some comments, too), but is mostly basic arithmetics and accounting.

The move_ptes function contains itself the deepest inner loop:

1 2 3 4 5 6 7 8 9 10 11 12 ... for (; old_addr < old_end ; old_pte ++ , old_addr += PAGE_SIZE , new_pte ++ , new_addr += PAGE_SIZE ) { if ( pte_none ( * old_pte )) continue ; pte = ptep_get_and_clear ( mm , old_addr , old_pte ); pte = move_pte ( pte , new_vma -> vm_page_prot , old_addr , new_addr ); ... set_pte_at ( mm , new_addr , new_pte , pte ); } ...

What the hell are we doing in this loop ? We basically are moving the PTE, the Page Table Entries of the mmap’ed region somewhere else. These are basically long integers assigned to each pages.

We’re pretty done here: the kernel never actually touched a single bit of memory related to our allocated block, and just swapped few bits to relocate the entire region elsewhere.

We may argue that we are still O(N) technically, but

instead of copying entire pages of 4KB (or 2MB for huge pages), we are swapping integers inside the kernel structure

we are touching hot structures in the kernel, neither cold memory, nor bytes swapped on disk

Therefore, we are O(N), but with a huge divisor.

Oh, by the way, did you know that O(N) was in many cases misleading ?

In this case, N can be at most 248 (the maximum virtual space size). Most machines only deal with few gigabytes of memory, at most 64GB for most architectures, that is, 236 bytes. A huge page is 221 bytes, leaving 215 operations at most for a mremap (yes, that’s only 32768 operations).

Well, the worst case scenario is cheap in all cases, and this is what I suspected from the beginning.

Further read: do not hesitate to have a look at Gorman’s book Understanding the Linux Virtual Memory Manager.