Nifty features of the ARM architecture

While sorting out GHC’s support for architectures with weakly-ordered memory subsystems I peeked at GCC’s lowering of the __sync_bool_compare_and_swap operation on ARM. This reminded me of a few features of the architecture that I quite like and felt compelled to write about.

My test involved looking at this program,

$ cat hi.c #include <stdint.h> int hello(uint32_t *x) { return __sync_bool_compare_and_swap(x, 0, 1); } $ arm-linux-gnueabihf-gcc -c hi.c

This produced the following assembler:

Function prelude: 0: b480 push {r7} 2: b083 sub sp, #12 4: af00 add r7, sp, #0 6: 6078 str r0, [r7, #4] 8: 687b ldr r3, [r7, #4] a: 2201 movs r2, #1 Issue full barrier: c: f3bf 8f5b dmb ish (1) Load value and claim exclusive access: 10: e853 1f00 ldrex r1, [r3] (2) Compare, loop if necessary: 14: 2900 cmp r1, #0 16: d103 bne.n 20 <hello+0x20> Store value and possibly cede exclusive access: 18: e843 2000 strex r0, r2, [r3] (2) Ensure that store succeeded, loop if necessary: 1c: 2800 cmp r0, #0 1e: d1f7 bne.n 10 <hello+0x10> Issue full barrier: 20: f3bf 8f5b dmb ish (1) Function epilogue: 24: bf0c ite eq (3) 26: 2301 moveq r3, #1 28: 2300 movne r3, #0 2a: 4618 mov r0, r3 2c: 370c adds r7, #12 2e: 46bd mov sp, r7 30: f85d 7b04 ldr.w r7, [sp], #4 34: 4770 bx lr

There are a few things here that I found interesting, parenthesized above and discussed in detail below.

1. Partitioned memory barriers

dmb is ARM’s Data Memory Barrier instruction. What is interesting is the options accepted by this instruction:

SY : Full system barrier

: Full system barrier ST : Store barrier

: Store barrier ISH : Full barrier within inner shareable domain

: Full barrier within inner shareable domain ISHST : Store barrier within inner shareable domain

: Store barrier within inner shareable domain NSH : Full barrier out to point of unification

: Full barrier out to point of unification NSHST : Store barrier out to point of unification

: Store barrier out to point of unification OSH : Full barrier out to outer shareable domain

: Full barrier out to outer shareable domain OSHST : Store barrier out to outer shareable domain

The amount of granularity embodied by these options is quite forward-thinking, especially given how much freedom it leaves with the microarchitecture designer:

inner/outer shareable domain: This allows a large system to be broken into multiple independently-coherent domains. For instance, a large machine may have several paritions, each running its own supervisor (e.g. hypervisor or kernel). The DMB ISH instruction provides an efficient way for a supervisor to issue a barrier within the “system” on which it runs.

point of unification: This is defined to be the point where the processor’s instruction and data caches are guaranteed to be coherent with its TLB walk logic. This depends upon the system design but is often the L2 cache. This is handy for implementing JIT compilers and other self-modifying code as it provides a flush mechanism which ensures that code modified by the local core can be executed.

2. Exclusive load/store

For someone used to IA-32’s rather boring atomic primitives, ARM is a breath of fresh air. IA-32 provides a handful of instructions implementing a selection of atomic operations (e.g. atomic add, subtract, AND, OR, etc.) via the LOCK prefix. By constrast, ARM provides a far more flexible mechanism consisting of two instructions:

ldrex : This is similar to the typical ldr instruction, loading an address (given in a register) from memory into a destination register. However, it also lays a claim of “exclusive” access to the loaded region (typically a cacheline, I believe).

strex : This is stores a value from a register to a memory location previously claimed with ldrex . If the value is has changed since the location was claimed then the store is aborted and a failure result is returned.

These collectively provide the necessary pieces for a compare-and-swap operation, but allowing significantly more flexibility over a dedicated CAS instruction. For instance, one might do non-trivial computation on the loaded value before storing an updated value.

3. If/then/else instruction

If/then/else flow control is incredibly common in typical programs. ARM has a nifty it instruction which predicates the execution of up to three instructions according to a given condition.

For instance, consider the program:

// Assume: // uint64_t x, y, z; // uint64_t *p; if (x == 0 ) { (x ==) { 0 ] = x; p[] = x; 1 ] = y; p[] = y; } else { 0 ] = z p[] = z }

this can be lowered as the following svelte instruction sequence:

cmp r0, #0 # x == 0? ittte eq str r0, [r7] # p[0] = x; str r1, [r7, #8] # p[1] = y; str r3, [r7] # p[0] = z; ...