When ARM introduced 64-bit support to its architecture, it aimed for compatibility with prior 32-bit software. But for Linux programmers, there remain some significant differences that can affect code behaviour. Here are some we found and the workarounds we developed for them.

I had originally planned to call this article “What’s NEW in ARMv8 for Linux Programmers?” However, I think “what’s different” is much more apt. And, just for the record, by “ARMv8-A” I mean AArch64, with the A64 instruction set, also known as arm64 or ARM64. I’ve used AArch64 registers in the examples, but many of the issues I’ve described also happen in the ARMv8-A 32-bit execution state.

To help frame the problems discussed here, let me start by giving a little background on the sort of codebase we have here at Undo. Our core technology is a record and replay engine, which works by recording all non-deterministic input to a program and uses just-in-time compilation (JIT) to keep track of the program state.Our technology started on x86 (32 and 64-bit) and had progressed to have fairly complete, maturing support on ARM 32-bit when we began adapting it to work on AArch64. I joined the company after almost all of the low hanging fruit had been grabbed (as well as many rather higher up the tree, to be fair) leaving us with some tricky problems to tackle when it came to moving to ARMv8.

This leads me to my first simple, but possibly helpful, observation: ARM64 is much more similar to ARM 32-bit (aka AArch32) than it is to x86. ARM64 is still quite RISC (though the cryptographic acceleration instructions do lead to raised eyebrows in a RISC architecture). So I don’t intend to try to cover the many differences between x86 and either ARM version. Nor do I want to rehash the differences between AArch32 and AArch64 — there are already good resources to explore those differences.

Also, a lot of ARM versus ARM64 resources focus on the instruction set and architectural differences. These differences are not really relevant to most Linux user space application developers, beyond the very obvious, such as “your pointers are bigger.” But, as we discovered, there are differences important to Linux user space developers, four of which I'll discuss here. These differences fall into several categories, some falling into more than one category. The categories are:

Differences due to migrating to use a fairly new kernel version.

Differences due to the architecture and instruction set (where this is relevant to user space programmers).

Ptrace differences. We use ptrace a lot, so this was very important to us.

I will try to use the following format in the next sections:

A brief explanation of the area.

What is the difference? Why is this different? (Sometimes it is easier to understand a change in behaviour by looking at a few assembly instructions than it is from a wordy description, so I'll provide that code.)

How did we encounter it?

How did we overcome it?

Where to find out more information.

1. Changes to ptrace

ptrace provides process tracing capabilities to user space programs.

There have been a number of changes to the requests accepted by ptrace() . These changes produce the most pleasant of all incompatibilities to analyse: compilation errors. Our error reports were for undefined symbols PTRACE_GETREGS (for general registers), PTRACE_GETFPREGS (for floating point and SIMD registers), and PTRACE_GETHBPREGS (for hardware breakpoint registers), as well as the SET versions of these requests.

The man page for ptrace was no help at all in resolving these errors, so we dug deeper. We had a look at the kernel source, and it turns out that usually there is an architecture-independent ptrace code path (ptrace_request() in kernel/ptrace.c), and separate architecture-dependent paths (e.g. arch_ptrace() in arch/arm/kernel/ptrace.c). Although the arm64 version has a compat_arch_ptrace for AArch32 applications, the arm64 arch_ptrace() directly calls ptrace_request() and does not add any additional ptrace request types.

The solution is to use PTRACE_GETREGSET and PTRACE_SETREGSET with various different arguments to read these registers.

Here is a table of the GETREGS-style request and the closest equivalent GETREGSET request. Different REGSETs are acquired through different arguments to addr ptrace() argument.

ARM 32-bit AArch64 GETREGS NT_PRSTATUS GETFPREGS NT_PRFREG GETHPBREGS NT_ARM_HW_BREAK

NT_ARM_HW_WATCH

Table 1.ARM 32-bit and closest equivalent AArch64 ptrace requests.

Note that NT_ARM_HW_BREAK and NT_ARM_HW_WATCH behave identically in a GETREGSET request.

Using GETREGSET is not as simple as using GETREGS, though. For a GETREGS request like this:

ptrace(PTRACE_GETREGS, 0, 0, regs);

GETREGSET would look like this:

struct{ void* buf; size_t len;} my_iovec = { regs, sizeof(*regs)};



ptrace(PTRACE_GETREGSET, 0, (void*) NT_PRSTATUS, &my_iovec);

Note, too, that I have said “the closest equivalent GETREGSET request.” Naturally, the AArch64 register set is different from the ARM 32-bit one, but there are more differences between the two beyond the register set.

Figure 1 shows a diagram of the registers returned from an ARM 32-bit GETREGS and AArch64 GETREGSET instruction.

Figure 1.GETREGS and GETREGSET.

Those familiar with AArch64 may notice that with GETREGSET we’ve been given a “cpsr” register, yet the hardware architecture does not have one. What's returned with GETREGSET has been synthesised into a cpsr-like layout from the individually accessible fields on AArch64.

A more notable difference between the two is the lack of orig_r0 (or orig_x0) for GETREGSET. This lack has to do with syscalls. On ARM 32-bit, a syscall number gets placed in r7 and the syscall arguments are placed in the argument registers r0-r3 prior to a syscall(SVC) instruction. The value returned from the syscall is located in r0 (as per the usual APCS, r7 in exceptional circumstances). After the kernel returns from the syscall , orig_r0 provides the original first argument to the syscall (which had been overwritten by the return value).

I actually don’t know what use a “normal” application is supposed to make of this original first argument. We use it for our support of restart_syscall , where the return value is ERESTART_RESTARTBLOCK.

Unfortunately the lack of orig_x0 is a problem for us that we have yet to resolve in all circumstances. If we have recorded the entry to the syscall , then we have all the information we need. However, if we have attached during a restart_syscall , then we do not know the original value of x0. Our only option is to allow the kernel to restart the syscall , but this restart is inefficient for us as we can’t optimise the recording of the syscall .

Returning to the subject of GETREGS versus GETREGSET: GETHBPREGS and NT_ARM_HW_BREAK are also significantly different. For a GETHBPREGS request, you use the addr field in the ptrace call to request a particular hardware breakpoint register. NT_ARM_HW_BREAK returns all hardware breakpoint registers.

The best place to look for more information on these ptrace differences is to examine the AArch64 ptrace source file: arch/arm64/kernel/ptrace.c

2. Increased use of locks

Load/store exclusives are the instructions used in ARM 32- and 64-bit to support atomic accesses.

We discovered that we were having a lot more problems due to load/store exclusive instructions in ARM64 than we had previously seen with ARM32. This increase was surprising, as the load/store exclusive instructions didn’t seem to have changed in a way that was relevant to us. In fact, I still do not think the load/store exclusive instructions have changed massively, but they have massively increased in use.

Take the random() glibc function as an example. It is called by rand() , which we all know well, and which is used in one of our favorite demo programs.

On AArch64 we see, a few instructions into random() , the following assembly code:

0x0000007fb7df3dd4 : mov w1, #0x1 // #1 0x0000007fb7df3dd8 : ldaxr w2, [x0] 0x0000007fb7df3ddc : cmp w2, wzr 0x0000007fb7df3de0 : b.ne 0x7fb7df3dec 0x0000007fb7df3de4 : stxr w3, w1, [x0] 0x0000007fb7df3de8 : cbnz w3, 0x7fb7df3dd8 0x0000007fb7df3dec : b.ne 0x7fb7df3e34

The load-exclusive-acquire (ldaxr) instruction and store-exclusive (stxr) instructions indicate that this code is probably trying to acquire a lock. A quick look at the source code shows that acquiring a lock is exactly what is happening. But the same code compiled for a Cortex-A9 (AArch32) did not do any locking. My guess is that this locking occurs purely to better support multiprocessor systems.

A short explanation of the assembly code is warranted here. ARMv8 does not have a single-instruction atomic read-modify-write. Instead, as we see in the example code, ldxr (load exclusive) and stxr (store exclusive) pairs are used. A load exclusive acquires a sort of lock called an “exclusive access mark.” This mark is checked for by store exclusive instructions — if a different load exclusive has been executed, the store exclusive will “fail” — it will not update memory. We thus saw that store exclusive instructions would never succeed, and such failure was a common problem because of the increased use of these paired instructions in AArch64.

The answer to “why” is in the ARM Architecture Reference Manual for ARMv8 — there is a list of things that can cause a store exclusive to fail. This includes, but is not limited to, normal (non-exclusive) loads and stores between the LDXR and STXR.

Our JIT was performing non-exclusive loads and stores between the LDXR and STXR, and thus was failing. Our attempts to debug the problem were making everything worse. The debug code was causing even more to happen in between the load exclusive and store exclusive.

The takeaway from this is to know that you can never predict whether a store exclusive will succeed or not, because pretty much anything can cause it to fail, including thread switches, context switches, etc. Also — do not trust GDB or other debuggers to tell you the truth. GDB in particular appears to do something sneaky to cause sequences to succeed if you single step through them, when really single stepping is so invasive it will cause the STXR to fail.

There’s no single or easy solution: it depends what caused your problem. Code generators should avoid outputting some instructions between the load and store exclusive — the above real example can be awkward as the instructions are not in the same basic block. In our JIT, we try to execute load and store exclusive instructions together where they occur as a pair.

For more information on exclusive accesses on ARM, the ARM Architecture Reference Manual is the place to look. I recommend the list of conditions in “Load-Exclusive and Store-Exclusive instruction usage restrictions”. There are also some hardcore sections on the types of monitors, and a very useful section explaining why there is an ‘a’ in the ldaxr (for “acquire”) but no ‘l’ (for “reLease”) in the stxr.

Aside: Older ARM cores do have an atomic read-and-write instruction: SWP. However, this instruction has been removed from ARMv8 32-bit, and does not exist on ARM 64-bit. I believe the intstruction is absent because SWP would not work well in multiprocessor systems, whereas load/store exclusive instructions assist multiprocessing by exporting information outside of the processor using the memory interface.

3. Additional pages in memory — vvar and vdso

The addition of [vvar] and [vdso] with AArch64 is in some ways purely the result of moving to a newer kernel, and it is the most tangential topic that I’ll cover. When I cat /proc/self/maps the response shows that the x86 laptop I am writing this on also has [vvar] and [vdso]. This excellent article does a great job of explaining what these pages are all about, which I won’t repeat here.

The problem for us is that [vvar] represents a source of non-determinism, so our record engine must save any data that is read from [vvar]. This need to respond to non-determinism is a well understood problem for us, and the solutions vary depending on the type of map we're using. The simplest (but worst performing) solution is to:

mprotect() the map to PROT_NONE

When a fault occurs:

Restore the original protection with mprotect



Re-execute the access



Save the data read



mprotect back to PROT_NONE



Continue

On AArch64, one of our tests was consistently failing with EACCESS when we tried to re-apply PROT_READ to vvar (the restore step above).

We isolated the cause of the problem to a nested local function in the test. I had no idea this nesting was allowed in C until this point, but nesting turns out to be a GNU extension. Importantly, you can pass the address of the nested function outside of the scope in which it is defined. Calls to the nested function then go via a “trampoline”, which is explained here.

Here is a simple example program:

typedef void (*fn_ptr_t)(void);void foo(fn_ptr_t fn_ptr){ fn_ptr();} int main(void){ int a_local = 2; void nested_func(void) { a_local++; return; } foo(nested_func);}

In order for foo to be able to call nested_fn , we need a trampoline. On AArch64, this trampoline gets generated on the stack at the start of main (followed by flushing the data cache and invalidating the instruction cache). The address of the trampoline is then passed to foo() .

foo() has no knowledge that fn_ptr is anything special, so foo() executes a normal BLR (branch with link, to register). The generated trampoline itself looks like:

ldr x18, 0x7ffffffaf0 br x17

The relevant points on the stack are shown in Figure 2.

Figure 2. Stack state showing trampoline.

We had already determined that the failing test problem was caused by some aspect of the test executable itself, rather than something that the test executable was doing at runtime. The key point is that trampolines are generated on the stack, which requires the stack to be executable. This requirement in itself was something of a shock to me.

The way Linux supports execution on the stack is via a personality named READ_IMPLIES_EXEC. The name is as it sounds. Not only does READ_IMPLIES_EXEC make your stack executable, it makes all pages that are readable, executable. Linux knows to set READ_IMPLIES_EXEC on an executable from the PT_GNU_STACK ELF flag, which defines the access rights needed for the stack.

For the program above, we can observe in /proc/pid/maps that nearly all the readable maps are also executable.

r-xp /tmp/tramprwxp /tmp/tramprwxpr-xp /lib/aarch64-linux-gnu/libc-2.19.so---p /lib/aarch64-linux-gnu/libc-2.19.sor-xp /lib/aarch64-linux-gnu/libc-2.19.sorwxp /lib/aarch64-linux-gnu/libc-2.19.sorwxpr-xp /lib/aarch64-linux-gnu/ld-2.19.sorwxpr--p [vvar]r-xp [vdso]r-xp /lib/aarch64-linux-gnu/ld-2.19.sorwxp /lib/aarch64-linux-gnu/ld-2.19.sorwxp [stack]

It’s easy to see the odd page out – [vvar]. This oddity wouldn’t necessarily be a problem, but looking in /proc/pid/smaps shows most readable pages have the following VMFlags:

VmFlags: rd mr mw me dw ac sd

The ‘me’ means “may execute”, meaning that mprotect is allowed to PROT_EXEC this page. However, critically, [vvar] does not have ‘me’. When we try to PROT_READ it, [vvar] gets transformed into a PROT_READ|PROT_EXEC, which of course is rejected.

4. mcontext (and so rt_sigframe_t) are different

When a signal is received and a signal handler is called, the context in which the signal was received is stored on the stack in a structure called the signal stack frame. As in previous sections, we expected this frame to be significantly different on AArch64 because there are different registers.

Even with this factor taken into account, we were still encountering problems. We were seeing differences in memory in the record and replay phases, which indicated there was something different about the signal frame that we hadn’t accounted for.

Here is how the signal frame is declared on ARM 32 bit:

struct rt_sigframe { struct siginfo info; struct sigframe sig;}; (arm/arm/kernel/signal.c)

And here is how it is declared on AArch64:

struct rt_sigframe { struct siginfo info; struct ucontext uc; u64 fp; u64 lr;}; (arm/arm64/kernel/signal.c)

The difference is obvious straight away, once you know where to look. The interesting question is: why are the frame pointer (fp) and the link register (lr) included separately here, when they are also saved as part of the general register context in mcontext ?

The answer is they are separately saved to allow debuggers to more easily unwind signal stacks, including when using the alternate signal stack (sigaltstack). Once we had accounted for this difference (and the fairly large amount of reserved space in mcontext ), we had dealt with all of the differences in AArch64 that affected our core technology and were able to see the same memory content in record and replay when using our engine.

These are just some of the differences that we’ve encountered, but they are a good sample of the sorts of problems one should expect. There are nuances to AArch64 that go beyond the well understood differences in the instruction set.

Isa Smith is a software engineer at leading UK start-up, Undo Software (undo-software.com). She previously worked at ARM and on ARMv8 compilation tools, and specializes in building software development tools for Linux and Android.



