I’ve been working on a new malloc implementation for Jinx recently. The last time that a developer at Corensic wrote a version of malloc(), I asked him why he didn’t use a great off-the shelf malloc like tcmalloc. Now it’s my turn, and I also wrote my own. I guess you can accuse me of failing to learn from my mistakes; I’m certainly a hypocrite!

In my defense, a few things about our hypervisor make using someone else’s malloc a pain. First, we’re effectively an embedded system working inside a small chunk of memory allocated in a device driver. The hypervisor works in its own virtual address space, but we never modify that address space. We do some large allocations along the way, but we also have some long-lived smaller memory allocations, so fragmentation is a concern. We code our hypervisor in C. In addition to these constraints, we have one opportunity: we have no pre-emption in our threading package.

I decided to create a pretty simple global allocator, but with the addition of per-physical-processor freelists. Since there’s no pre-emption in our system, we can put freelists in local storage on physical processors, and access the freelists without a lock, without fear of migration.

An example of the global heap data structure follows:

The heap itself is a linked list of regions from HEAD to TAIL. Each element on the linked list is either in-use or free, or is the head or tail of the list. The heap linked list may have no two adjacent free regions (they should be coalesced into one free region). In addition, there are freelist heads for powers of two. For a given freelist head with size S, every item of size P on that freelist obeys S <= P < S * 2. At the head of every region, there’s a data structure that looks like this:

struct region_header { uint64 alloc_state : 2; uint64 bytes_to_previous_header : 31; uint64 bytes_to_next_header : 31; LIST_ENTRY(region_header) freelist; };

That LIST_SIZE LIST_ENTRY is a FreeBSD macro that represents doubly-linked-list linkage. alloc_state has one of these values:

enum { ALLOC_STATE_INVALID, ALLOC_STATE_FREE, ALLOC_STATE_ALLOCED, ALLOC_STATE_SENTINEL };

Sentinel is used for HEAD and TAIL.

To free an item into the heap, something like this occurs:

free_item(item, heap) { item->alloc_state = ALLOC_STATE_FREE; coalesce(item, right_neighbor(item)); coalesce(left_neighbor(item), item); }

This preserves the heap invariant that there exist no two adjacent free regions.

We use one giant lock for all items in the heap, relying on per-processor freelists to provide concurrency. Those freelists reuse the freelist field of the header structure for linkage. Those fields are protected by the global heap lock only as long as alloc_state == ALLOC_STATE_FREE.

When I attempted to make this heap thread-safe, though, I wrote this:

free_item(item, heap) { item->alloc_state = ALLOC_STATE_FREE; lock(); coalesce(item, right_neighbor(item)); coalesce(left_neighbor(item), item); unlock(); }

A bug in this code, like many concurrency errors, resulted in intermittent, unusual, crashes and corruption. It was rare enough that I almost checked it in, but was saved by one of the many unit tests we run prior to commit crashing once.

Now, in retrospect, setting the alloc_state field outside of the lock was a pretty obvious bug. Imagine that a manipulator of our left-hand neighbor, holding the lock, observed that our region had become free, and coalesced its region with ours. That would be catastrophic, as we’d end up with a pointer to a region that no longer was! Because regions consult their neighbors’ allocation status during coalescing, that field must be protected by that lock.

However, there’s a more subtle bug lurking here. Consider a different structure definition:

struct region_header { uint64 available: 2; uint64 alloc_state : 2; uint64 bytes_to_previous_header : 30; uint64 bytes_to_next_header : 30; LIST_ENTRY(region_header) freelist; };

Let’s say that “available” field is available for use by the user of the malloc interface. Now we have a problem with non-atomicity of modification. That is, even if that field is owned by the owner of the memory block, not by the holder of the lock, that field can not be modified using normal operations.

void free_item(item, heap) { item->available = 2; lock(); coalesce(item, right_neighbor(item)); coalesce(left_neighbor(item), item); unlock(); }

The first part of that code, compiled into x86-64 instructions, might look like

and ~$3, (%rax) or $2, (%rax)

As covered in a recent blog post about non-serializable single instructions, variants of these instructions without the LOCK prefix are not atomic, so there’s ample opportunity here to clobber ongoing modifications to bitfields that share our storage. My coworker David points out that it’s more likely to get compiled as this:

mov (%rax), %rdx and ~$3, %rdx or $2, %rdx mov %rdx, (%rax)

Either way, this is a non-atomic update!

This is yet another occasion where local reasoning about code is foiled by concurrency. Unlike single-threaded code, you have to know the exact structure you’re dealing with to know if modifying available is legal. In addition, you need to know what sort of atomicity your compiler or language guarantees when you’re accessing variables. For example, would this be OK?

struct region_header { uint8 available; uint8 alloc_state; uint16 bytes_to_previous_header; uint16 bytes_to_next_header; LIST_ENTRY(region_header) freelist; };

When I modify “available”, how do I know that “bytes_to_previous_header” doesn’t take a read-modify-write trip with me? Can I use that field without acquiring the global heap lock? I sent this question out to the rest of our development team, and our UW CSE faculty: in what languages is a byte read and write guaranteed to be isolated from adjacent fields? How about a 16 bit number? How about numbers that are the natural register width? How about larger?

Consensus was that all languages on all commonly used architectures guarantee that if one reads or writes a byte, adjacent fields don’t go along for the ride, and that this is true, outside of bitfields, for all types up to the register size of a CPU. Programs would clearly be too hard to reason about were this not true.

Bartosz illustrated this to me for C++ with some language from the C++ spec:

(1.7.3) A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. […] Two threads of execution (1.10) can update and access separate memory locations without interfering with each other.

(1.7.4) [ Note: Thus a bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields of non-zero width. —end note ]

(1.7.5) [ Example: A structure declared as struct { char a; int b:5, c:11, :0, d:8; struct {int ee:8;} e; } contains four separate memory locations: The field a and bit-fields d and e.ee are each separate memory locations, and can be modified concurrently without interfering with each other. The bit-fields b and c together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but b and a, for example, can be. —end example ]

I would be delighted to hear about it if people have examples from computers, compilers, and languages, old and new, where this is not true.

One last note on this topic: In the FreeBSD kernel, the developers have a good habit of documenting what lock protects which field of a structure, via a locking guide. In their world, that structure might look like this:

/* * Locking guide: * L: this field is modified only under the global malloc lock. * L/F: When alloc_state == ALLOC_STATE_FREE, L. Else, unlocked field. */ struct region_header { uint64 alloc_state : 2; /* L/F */ uint64 bytes_to_previous_header : 31; /* L */ uint64 bytes_to_next_header : 31; /* L */ LIST_ENTRY(region_header) freelist; /* L/F */ };

In this world, the author of the locking guide would be reprimanded for trying to apply locking rule L/F to the field alloc_state or the hypothetical available field, while other members of the same bitfield had locking rule L. Any time you document who locks what in a structure, the granularity may not be smaller than that which your compiler and your architecture guarantees, assuming you’re using non-atomic operations to modify the fields! This is a great practice that has served Corensic well in the development of its hypervisor. My non-adherence to this practice cost me a few hours in debugging!

So, in summary:

The combination of bitfields + concurrency is dangerous, because fields of a structure are implicitly intertwined with adjacent fields.

Those problems could be mitigated somewhat by wrapping all bitfields in their own structure. If you don’t do this (as we don’t), then you are denied local reasoning about fields of structures.

For these reasons, converting code into use of bitfields due to space pressure is also an operation that cannot be performed with only local reasoning. When you change a field of structure to a bitfield, you have to look at uses of that field to ensure that you don’t violate interference assumptions in the code.

Use of a locking guide in your C structures will alert you to many of these problems.