TL;DR

This post shows that modern optimizing C compilers assume that the source code they are compiling respects memory alignment constraints, even if the target architecture imposes no such restriction. This can lead to compiled programs not behaving as the programmer intended.

C: The Language Formerly Known as Portable Assembly

Some people will tell you that C is a portable assembly language, in which each line of source code is individually translated into one or a few assembly instructions. In this worldview, the C compiler does not do anything very smart, rather it only handles for you the mapping of subexpressions to the registers and instructions of various incompatible target ISAs. This worldview is no longer accurate, for reasons discussed in this blog post and in previous posts in the same vein, but it used to be true several decades ago. See for instance this answer by StackOverflow’s Jerry Coffin, which ascribes the notion of “C as a portable assembly language” to the first edition of the book “The C Programming Language” by Brian Kernighan and Dennis Ritchie, the creator of C himself. This early history of C as portable assembly language is also visible in this post to comp.std.c in 1990 retracing the problem that the volatile qualifier solved and its introduction by the ANSI standardization committee in the 1989 C standard.

The shift of the C language from “portable assembly” to “high-level programming language without the safety of high-level programming languages” should be kept in mind when writing C code from scratch in 2020, or when maintaining legacy C code.

Misaligned Memory Accesses

The x86-64 and AArch64 architectures are popular at the moment, and their respective 32-bit predecessors are still widely used; the POWER architecture is getting old, but the Power architecture continues to receive updates; RISC-V is popular in some circles, and so on. The C language is old enough to remember a time when the popular architectures were different, and unlike the popular architectures of today, did not all allow misaligned accesses. Alpha and SPARC are two examples of once-popular instruction sets in which misaligned memory accesses were forbidden (a processor exception would happen if such an access were attempted).

The C standards, having to accommodate both target architectures where misaligned accesses worked and target architectures where these violently interrupted the program, applied their universal solution: they classified misaligned access as an undefined behavior. This allowed both executions to continue normally with the results that you would expect if you had been writing x86-64 assembly, or for the execution to stop abruptly, which you would expect if you had been writing SPARC assembly. This in turn led to the appearance of C programs that, consciously or not, were only written for architectures that allow misaligned memory accesses. In fact, if you ask early owners of Alpha or SPARC computers, there were many of these. For users of these architectures, the problem was that C programs, written to access memory in misaligned ways, worked as intended on the numerous processors that allowed this, but crashed when compiled and run on their computers. This is now a forgotten problem, but a well-documented one nevertheless. It is covered for instance in the blog post On misaligned memory accesses(2006).

Something funny happened to Pavel Zemtsov in 2016. He was writing C++ code that involved misaligned memory accesses, as one tends to do without even noticing when one is targeting x86-64, when he suddenly realized that he was using an architecture that forbade misaligned accesses after all! GCC automatically vectorized the loop he was writing, and in doing so, assumed that all uint32_t accesses were to addresses that were aligned to 32-bit boundaries. The summary of that blog post, A bug story: data alignment on x86, is that GCC used vector instructions recently added to the x86-64 ISA that had alignment requirements. The humor came from the fact that the source code looked very ordinary: nothing suggested the use of modern vector instructions; the source did not call intrinsics or anything like that. But making sophisticated transformations is what modern, optimizing C compilers do.

A Clever GCC Optimization

The present blog post brings bad, and as far as I know, previously undocumented news. Even if you really are targeting an instruction set without any memory access instruction that requires alignment, GCC still applies some sophisticated optimizations that assume aligned pointers. Consider the following function:

int h(int *p, int *q){ *p = 1; *q = 1; return *p; }

GCC optimizes this function to make it always return 1:

h: movl $1, (%rdi) movl $1, (%rsi) movl $1, %eax ret

GCC generates code for a function h that always return 1 : that’s the movl $1, %eax part of the assembly listing, in which the value 1 is hard-coded. In this function, GCC reasons that if p and q are the same, then the second assignment writes 1 to *p , and if they are distinct, when what is read in *p is exactly what was written at the first assignment, that is, 1 . It seems that either way, the function can only return 1 . The reasoning is sound if p and q both are aligned to an address multiple of sizeof(int) , but it is wrong if either pointer is misaligned, as in the calling context below:

void f(void) { char *t = malloc(1 + sizeof(int)); if (!t) abort(); int *fp = (int*)t; int *fq = (int*)(t+1); int r = h(fp, fq); assert(r == *fp); ... }

Strictly speaking, the function f invokes Undefined Behavior when it computes fq . The result t of the call to malloc is guaranteed to be aligned for int , but t+1 is not aligned for int . A very old computer that uses non-uniform pointer representations could crash on the conversion to int* of t+1 . The words “if the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined” in the C11 standard make the conversion Undefined Behavior.

The programmer, knowing that the program is to be executed on the little-endian x86-64 architecture, that has uniform pointer representation and handles misaligned accesses, might expect r to be set to 257 and the assertion to be true. Instead, r is set to 1, but *fp still evaluates to 257 , so that the assertion is evaluated as false at runtime.

Note that this discussion is not about strict aliasing. The description of strict-aliasing rules in the C standard is rather vague, and it could be interpreted as meaning that you are not allowed to use the second byte of an int to read the first byte of an int . However, if using the compilation option -fno-strict-aliasing , you tell GCC to use a forgiving memory model without strict aliasing restrictions, it still optimizes function h

How to express misaligned memory accesses

If you are writing C code in 2020 that would benefit from reading four bytes at once even if these may not be aligned to a multiple-of-4 address, and if this C code is intended to be compiled with a modern C compiler, you should use memcpy . On target architectures that allow misaligned memory accesses, a modern C compiler can easily translate the memcpy from a temporary int variable into the single assembly instruction that you would have written if you were directly writing in assembly. As a bonus, the same code will function on architectures that do not allow misaligned memory accesses; on these, the compiler will automatically generate either a longer sequence of instructions or an actual call to memcpy .

For instance, if the function h is intended to accept a misaligned pointer q , it can be changed as follows:

int h(int *p, int *q){ *p = 1; int one = 1; memcpy(q, &one, sizeof *q); return *p; }

Note: if you take this route, you might want to go the extra mile and avoid building misaligned pointers altogether, which as we said earlier, is Undefined Behavior in itself. The above example leaves the argument q as a pointer to int in order to focus on the important change with respect to the original, but if the second argument can be misaligned, it would be better to declare it as a pointer to char , so as not to force the construction of a misaligned pointer to int when the function is called.

When targeting x86-64, GCC produces the code we intended but did not get with the original function h :

h: movl $1, (%rdi) movl $1, (%rsi) movl (%rdi), %eax ret

This assembly code reads back the contents pointed to by the first argument (%rdi) in order to return it (in %eax) because the compiler is aware that accessing memory through the second argument (%rsi) may have changed it. The call to memcpy in the source code is free: it is not translated to a function call, and not even to a single unnecessary instruction.

By contrast, when generating assembly code for an architecture for which the direct memory access would not produce the intended result, such as armv5, GCC, having no shorter and more expedient version, keeps the call to memcpy . The armv5 ISA is a bit quirky, as RISC ISAs tend to be, so that on a processor implementing it, the misaligned memory access may not crash, but in any case, it would not produce the same result as the call to memcpy , hence GCC not replacing the latter by the former. This shows that GCC, while translating the statement *q = 1; , is assuming the int access to be aligned when it chooses to translate it to the single “store” instruction str r3, [r1] .

Does a lot of existing C code do this?

It is difficult to tell how much existing code does this. The optimization described in this post is only implemented in GCC, and it is focused enough that you wouldn’t expect it to be applied very often(just wait until C code is increasingly compiled with Link-Time Optimizations, though). At a time when the popular ISAs mostly provide the expected behavior on misaligned accesses, the crime of misaligned memory accesses is seldom reported, but this is only by lack of witnesses.

One data point is the fast compression library LZO, which contains the following lines, intended to do the right thing on each of several possible target platforms:

#if (LZO_ARCH_ALPHA) # define LZO_OPT_AVOID_UINT_INDEX 1 #elif (LZO_ARCH_AMD64) # define LZO_OPT_AVOID_INT_INDEX 1 # define LZO_OPT_AVOID_UINT_INDEX 1 # ifndef LZO_OPT_UNALIGNED16 # define LZO_OPT_UNALIGNED16 1 # endif # ifndef LZO_OPT_UNALIGNED32 # define LZO_OPT_UNALIGNED32 1 # endif # ifndef LZO_OPT_UNALIGNED64 # define LZO_OPT_UNALIGNED64 1 # endif #elif (LZO_ARCH_ARM) # if defined(__ARM_FEATURE_UNALIGNED) # if ((__ARM_FEATURE_UNALIGNED)+0) # ifndef LZO_OPT_UNALIGNED16 # define LZO_OPT_UNALIGNED16 1 # endif # ifndef LZO_OPT_UNALIGNED32 # define LZO_OPT_UNALIGNED32 1 # endif # endif # elif 1 && (LZO_ARCH_ARM_THUMB2) ... #if (LZO_OPT_UNALIGNED32) LZO_COMPILE_TIME_ASSERT_HEADER(sizeof(*(lzo_memops_TU4p)0)==4) #define LZO_MEMOPS_COPY4(dd,ss) \ * (lzo_memops_TU4p) (lzo_memops_TU0p) (dd) = * (const lzo_memops_TU4p) (const lzo_memops_TU0p) (ss) #elif defined(lzo_memops_tcheck__) #define LZO_MEMOPS_COPY4(dd,ss) \ LZO_BLOCK_BEGIN if (lzo_memops_tcheck__(lzo_memops_TU4,4,1)) { \ * (lzo_memops_TU4p) (lzo_memops_TU0p) (dd) = * (const lzo_memops_TU4p) (const lzo_memops_TU0p) (ss); \ } else { LZO_MEMOPS_MOVE4(dd,ss); } LZO_BLOCK_END #else #define LZO_MEMOPS_COPY4(dd,ss) LZO_MEMOPS_MOVE4(dd,ss) #endif

You will have noticed that the code wants to perform misaligned accesses, but is written to detect at compile-time platforms where misaligned accesses would fail, and in these cases, to avoid them.

This issue pointed out in Pavel Zemtsov’s 2016 post and in the present one is that the above #ifdef nest, when it picks the simple uint32_t lvalue access for the x86-64 target architecture (since x86-64 processors allow these memory accesses), exposes the code to be mistranslated by the compiler, for loop vectorization purposes or for optimization purposes.

Conclusion

This blog post is an extended version of a ticket I opened on GCC’s Bugzilla, where I suggested as a new feature a commandline option to make GCC not assume that memory accesses are aligned. The reaction of the GCC developers confirms that this behavior is here to stay. Clang developers, if you asked them, would reserve the right to make Clang behave in the same way, especially now that GCC has breached the topic. The conclusion is that as a modern C developer, you should always avoid misaligned memory accesses, and use memcpy to or from a temporary variable instead. The memcpy call will be free as long as the code is compiled with an optimizing compiler for a platform that allows misaligned memory accesses.

Acknowledgments

The following people have contributed context information in this post, participated in a discussion that lead to the discovery being written up, improved the wording of some sentences, or had their joke stolen in this post: hikari, Alexander Monakov, David Monniaux, Markus F.X.J. Oberhumer, John Regehr, Miod Vallat, Ashley Zupkus.