I recently had to write my own malloc. While replacing important bits of the C standard library would normally be serious over-engineering, it turned out to be the only option.

The Backstory

We had taken over an embedded project with known stability issues, and I quickly determined that thread race conditions were involved. After fixing some low-hanging fruit by switching to mailboxes and queues for cross-thread communication, there was still strange behavior — in particular, certain actions could cause infinite loops after a few tries or during heavy network activity.

Once I was able to consistently reproduce the issue, I focused on the affected code: One thread was allocating memory to notify the other about user input, the other allocated a response and put it on the network stack’s outgoing queue, and then both were freed. Two threads were allocating memory to send messages to each other safely, then freeing the buffers at the same time. If I commented out the free(3), the instability went away (at least until memory was exhausted). Ack! There was a race condition in free(3) itself! After consulting the C99 standard (free draft) and Harbison and Steele, I confirmed that malloc/free/etc. are not required to be reentrant, they just are on most platforms because doing otherwise is a world of pain. Unfortunately, that doesn’t include embedded platforms.

The race condition was during the update of free(3)’s internal records. If one thread free’d memory, but suspended while it was being returned to the freelist, another thread could free memory and clobber the freelist in mid-update. Since the updates were neither locked nor atomic, this could cause a cycle, and subsequent calls to malloc could get stuck in it.

Like most embedded projects, most data structures were statically allocated, but some libraries used dynamic allocation. In particular, a proprietary network stack library directly called malloc and free, so I couldn’t just wrap all their calls in a mutex. The embedded platform’s compiler had an option to generate a reentrant version of the standard library, but (despite what its documentation strongly implied!) this never called the lock/unlock callbacks I had to provide. As a last resort, I could monkeypatch all the standard library memory management functions — malloc, free, realloc, and calloc were weak symbols, and if I defined them with proper types, mine would be linked instead.

Lessons Learned

1. Design for Discoverability, Even in a Hex Dump

I had enough space to log allocations, but only in a packed format. (Unless I logged to flash, but that modified timing enough to break network functionality.) Still, I could control the structure, and had a few bytes of padding. All of my allocations were wrapped by four bytes on each side — two ‘ [ ’ characters, a uint16_t size (not counting padding), the actual memory, then two ‘ ] ’s and a uint16_t monotonic allocation ID. That way, I could detect most buffer overruns at runtime, but this also meant that memory dumps looked like this:

..[[..230p8.80aeuoa$u.80ahhte&au-theothsnth]]..[[..$#(*GU>JH.hej]]..

2e 2e 5b 5b 2e 2e 32 33 30 70 38 2e 38 30 61 65 |..[[..230p8.80ae| 75 6f 61 24 75 2e 38 30 61 68 68 74 65 26 61 75 |uoa$u.80ahhte&au| 2d 74 68 65 6f 74 68 73 6e 74 68 5d 5d 2e 2e 5b |-theothsnth]]..[| 5b 2e 2e 24 23 28 2a 47 55 3e 4a 48 2e 68 65 6a |[..$#(*GU>JH.hej| 5d 5d 2e 2e 0a |]]...|



and the relative allocation sizes stood out. Also, after internal bookkeeping, the 2N-sized linked lists for common allocation sizes were filled with ‘L’s (for “link”), any memory in the large-chunk freelist was set to ‘F’s, and otherwise un-initialized heap memory was initialized to ‘u’s. This meant I could see the general state of the heap at a glance, and some kinds of pointer corruption stood out: the network stack was freeing memory mis-aligned to its allocation, causing one allocation to nest in another.

Similarly, I logged allocation metadata to a ring buffer in memory. Each of the records had a few bytes of padding (‘ ’ characters) so lines wrapped at 16 bytes, and looked like a s1s2 PpppCccc where a was ’m’ (malloc), ‘f’ (free), ‘r’ (realloc), or ‘c’ (calloc), s1 and s2 were sizes (pre- and post- for realloc, otherwise 0 for s2), P was the pointer, and C was the same ID that followed the closing brackets. I came to recognize certain patterns in the log at a glance. Later, I was able to chase down a memory leak by converting the SREC memory dump to a human readable format, then cancelling out all freed allocations with awk. (Yes, I did miss valgrind.)

2. Malloc & Co. Have Many Edge Cases

The contracts assumed by malloc, free, realloc and calloc are more complicated than they appear. I see now why Lua’s designers chose to wrap them all in one function, lua_Alloc (whose design is closest to realloc). I didn’t know that when realloc is called on a NULL, it’s equivalent to malloc. malloc(0), while pointless, is legal, but implementation defined. free(3) and realloc(3) are full of undefined behavior. (I’d also never seen anyone actually use calloc before.)

3. Many C Programs Are Too Cavalier about Error Conditions

People debate whether it’s worth checking malloc(2)’s result for NULL when virtual memory (or the OOM killer) will pick up slack on modern systems, but few embedded systems have such things. If malloc returned NULL, things went haywire. The “C/C++” implementation of msgpack was usually the first thing to fail. (Spot the bug.) Systems need to check input for security and stability, and dynamic memory allocation is another place where bad data can creep in.

4. Fail Early and Loudly

The spoonful of memory corruption tends to compound on itself, eventually leading to swarms of mind-boggling bugs whose root cause is distant in space and time. Instead of letting the system struggle on, creating issues that will seem too bizarre to reproduce, it’s best to just halt everything and alert developers while there’s still a chance for a meaningful autopsy.

Closing

While the code (“spaceman,” since it manages space) is not open source, I may write a less-platform-specific version if there is interest. The overall design is similar to my mpool project on github, except that was written for a VM for an APL dialect I’ve been working on, and it doesn’t have any of the concessions to embedded platforms’ limitations (e.g. it assumes mmap(2)).

