In February this year, Nathan Froyd described allocation sequences in SBCL and alternative approaches other implementations use to make sure the runtime never sees half-initialised objects. At first sight, SBCL's sequence seems fairly large and slow for what, in the end, increments a thread-local pointer, even on x86-64, with its awesome 14 (RBP actually serves as a frame pointer in SBCL) GPRs:

Set the pseudo-atomic bit OR BYTE PTR [R12+160], 8 (what you'd expect for bump:) Load the allocation pointer MOV R11, [R12+80] Test for overflow LEA RDX, [R11+16] CMP [R12+88], RDX Jump to an out of line sequence JBE L4 Or save incremented pointer MOV [R12+80], RDX ... and tag it LEA RDX, [R11+7] (end of allocation per se) Finally, unset the pseudo-atomic bit XOR BYTE PTR [R12+160], 8 Check for interrupt pending JEQ L2 Trigger a signal if so BREAK 9 ; pending interrupt trap L2: ...

That's two load/store and a test just to make sure we're not caught with our pants down, handling improperly initialised objects.

One element of the alternatives is to actively check for pending interrupts (or GCs, etc). Instead of allowing interruptions everywhere except around key sequences, interruptions are queued, and explicit checks inserted by the compiler. That seems like an interesting implementation choice, so I wanted to see what it'd look like in SBCL.

It was pretty easy to adapt the compiler to insert such checks at appropriate locations (in the current case, at the end of each function' prologue, and at the head of loops). Using memory protection to turn an access into a conditional interruption seemed like a good idea, and that's what I quickly implemented.

These checks can be executed fairly often in tight loops, so it's important that they be very efficient. The first version loaded a magic address from thread-local storage (an address pointed to by a register), and then wrote to that address. When the thread had to be interrupted, the page containing that address was made unreadable and unwritable, so the last access triggered a segfault, which was then handled specially.

The result was slow... 25% as much time as the original (signals-ful) version for a simple (loop repeat n do [not cons]) function, and no difference for a consing loop. Modifying the safepoint sequence to read from the magic address instead of writing to it halved the slowdown to ~10%, and did not improve the runtime of the consing loop. That's still far from impressive.

Executing two instructions at each safepoint seems obviously fishy. Indeed, things improved sensibly when the safepoint sequence became a single TEST , as in Nathan's example from HotSpot. Instead of having to read an arbitrary address from the thread struct, I moved the "magic page" to a fixed offset from the thread structure. The safepoint sequence then became a single instruction, TEST EAX, [THREAD_STRUCT_REG + offset] (a register is dedicated to thread-local storage). That's a single instruction, reads from memory and does not clobber any register. Unfortunately, that was only enough to bring the runtime for safepoints to the same level (+/- 1-2%) as that of the original code (it does save ~50 KB out of 40 MB on the x86-64 core :).