March 26, 2017

Since my job involves a lot of HPC stuff and performance-obsessed code, one of the first topics I had to dive into was aliasing in the C standard. I found out that it is a subtle topic, quite obscure and very often overlooked even by the most experienced programmers. This post is an attempt to clarify the concept of aliasing, why should we care, how it impacts the code generated by the compiler and how can we master it with no hassle.

So what is an alias? Let's try to clarify this a bit:

One pointer (lvalue expression) is said to alias another pointer when both refer to the same location or object

In practice, an alias involves referring to the same physical entity (an object stored in memory) through multiple handlers (pointers):

int i; int * ptr1 = & i; int * ptr2 = & i;

In this trivial example, ptr1 and ptr2 are both aliases of each other since they are pointing to the same object (the integer i ).

Just a quick note: even if we are talking about pointers and C-ish stuff, this concept is valid for all programming languages: a lot of recent languages use pointers (Go, for example) but in C this issue is a lot more critical since we can do arithmetics with them, moving pointers around in memory as we like.

Why should I care?

Got it, this is the question. Why should I care about aliasing at all?

Short answer: if you are writing code that isn't supposed to be pushed to its limits in terms of performance, you can ignore aliasing at all and be happy.

Long answer: while working on code among whose targets had performance, I found myself several times blaming the compiler for not having recognized obvious optimizations thus generating what I thought was crappy assembly code, translated in a way that was too pedantic for a modern piece of software. After diving into the obscure topic of aliasing, it obviously turned out that the problem was me ignoring a quite large slice of the C language.

So, if you want to dive into this, let's start with a trivial example:

void foo ( float * out_vector_a, float * out_vector_b, int * in_vector, int n) { for ( int i = 0 ; i < n; + + i) { out_vector_a[i] = in_vector[i]; out_vector_b[i] = in_vector[i]; } }

In this snippet we are doing a simple thing: copying an input memory chunk (starting at address in_vector ), one integer at a time, into two output memory chunks (starting at addresses out_vector_a and out_vector_b ). Pretty straghtforward, uh? How an harmless snippet like this could be source of hassles? Let's dive into the x86 assembly code generated by gcc (disassembled through objdump ):

Don't let this pile of assembly code scare you, just get the big picture and focus on the instructions that carry out the actual copies:

out_vector_a[i] = in_vector[i]; 10: 66 0f ef c0 pxor xmm0,xmm0 14: f3 0f 2a 04 82 cvtsi2ss xmm0,DWORD PTR [rdx+rax*4] 19: f3 0f 11 04 87 movss DWORD PTR [rdi+rax*4],xmm0 out_vector_b[i] = in_vector[i]; 1e: 66 0f ef c0 pxor xmm0,xmm0 22: f3 0f 2a 04 82 cvtsi2ss xmm0,DWORD PTR [rdx+rax*4] 27: f3 0f 11 04 86 movss DWORD PTR [rsi+rax*4],xmm0

In an attempt to render these instructions in a human friendly way, we can decode them like this:

out_vector_a[i] = in_vector[i]; 10: cleanup vector register xmm0 14: cast int -> float and load memory location in_vector[i] into xmm0 19: store xmm0 into memory location out_vector_a[i] out_vector_b[i] = in_vector[i]; 1e: cleanup vector register xmm0 22: cast int -> float and load memory location in_vector[i] into xmm0 27: store xmm0 into memory location out_vector_b[i]

There are no doubts about the first three instructions ( 10 , 14 and 19 ), we are actually spending cycles to carry out useful work copying the value contained in memory at in_vector[i] in the corresponding output memory at out_vector_a[i] using the register xmm0 as a buffer. But what about the following instructions? We are carrying out exactly the same work again. Take a look at instruction 22 : it's the same instruction as 14 . Exactly the same. That cvtsi2ss tells the CPU to load the content of the same memory location as the previous one ( [rdx+rax*4] ) in the same vector register ( xmm0 ).

Why are we loading the same data twice? Why the compiler isn't just recycling the value he loaded just two instructions earlier loading it from memory again instead?

That is because during the previous store instruction ( 14 ) the CPU wrote in a location inside the boundaries of array out_vector_a and in order to guarantee semantic correctness, the compiler must assume that the first store could have ended up writing in the same area as in_vector , in other words that out_vector_a could be an alias for in_vector . In order to avoid issues, it forces the CPU to reload the same data from memory again, so if the location has been written during the previous instruction, its updated content would be correctly retrieved in its updated state.

This conservative behavoiur guarantees correctness even in the situation where the two output memory chunks are actually the same but, of course, in the majority of situations where the output memory chunks are not overlapped, performing a load twice means that we are halving our own memory bandwidth and wasting a lot of clock cycles. How can we help the compiler in doing a better job?

Aliasing rules

In the ISO C standard document the committee precisely states how we can access an object or, in other words, what is the type of the pointer that can legally refer to an object:

An object shall have its stored value accessed only by an lvalue expression that has one of the following types: a type compatible with the effective type of the object,

with the effective type of the object, a qualified version of a type compatible with the effective type of the object,

of a with the effective type of the object, a type that is the signed or unsigned type corresponding to the effective type of the object,

type corresponding to the effective type of the object, a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,

an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or

type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or a character type. ISO/IEC 9899:201x, Section 6.5

Don't worry about the definition of compatible types, it turns out to be exactly what you are thinking of (except for struct and union , but we can ignore that for now):

Two types have compatible type if their types are the same. ISO/IEC 9899:201x, Section 6.2.7

So, following the rules, we can have pointers that are aliases in a lot of different legal ways. For example, these pointers could be valid aliases of each other:

int * a; unsigned int * b; char * c; typedef union { char c; float f; } Mask; Mask * d;

…but these couldn't:

long * a; double * b;

Note that a char * is always a potential alias since char is the minimum allocation unit in C: all type sizes are specified in char units and char pointers are how we swipe raw memory (copying chunks regardless of the underlying object type, for example).

Just a note about terminology: in the ISO C world, strict aliasing, ANSI aliasing and type-based aliasing are exactly the same thing, they all refer to the same concept (actually they are aliases…sorry for the pun :).

Exploiting the law

So, what about our previous example? We were dealing with two pointers of incompatible types ( int * and float * ), so our output arrays ( out_vector_a and out_vector_b ) and the input array ( in_vector ) aren't legal aliases. Anyway the compiler treated them as they were, that is because gcc (and almost all the compilers I've dealt with) works in a conservative way and assumes that the programmer doesn't care about aliasing rules and it tries to not to insert very subtle bugs. Let's try to tell gcc to abide by the standard with a proper flag ( -fstrict-aliasing ). This time I've left out the full assembly file (find it here), give a look to the interesting lines instead, the ones carrying out the actual copy:

out_vector_a[i] = in_vector[i]; 10: 66 0f ef c0 pxor xmm0,xmm0 14: f3 0f 2a 04 82 cvtsi2ss xmm0,DWORD PTR [rdx+rax*4] 19: f3 0f 11 04 87 movss DWORD PTR [rdi+rax*4],xmm0 out_vector_b[i] = in_vector[i]; 1e: f3 0f 11 04 86 movss DWORD PTR [rsi+rax*4],xmm0 23: 48 83 c0 01 add rax,0x1

This time the second load disappeared. Again, let's try to decode the meaning of the assembly:

out_vector_a[i] = in_vector[i]; 10: cleanup vector register xmm0 14: cast int -> float and load memory location in_vector[i] into xmm0 19: store xmm0 into memory location out_vector_a[i] out_vector_b[i] = in_vector[i]; 1e: store xmm0 into memory location out_vector_b[i] 23: increment loop counter

This is nice: under our own guarantee that we know what we are doing, the compiler loads the memory location once and for the second array it just retrieves the same value from the buffer register.

We've just doubled the memory bandwidth available for our copy function.

But what happens when the types of our output and input arrays are compatible? Let's give it a try:

void foo ( float * out_vector_a, float * out_vector_b, float * in_vector, int n) { for ( int i = 0 ; i < n; + + i) { out_vector_a[i] = in_vector[i]; out_vector_b[i] = in_vector[i]; } }

Again, just focus on the instructions carrying out the copy (full assembly here):

out_vector_a[i] = in_vector[i]; 10: f3 0f 10 04 82 movss xmm0,DWORD PTR [rdx+rax*4] 15: f3 0f 11 04 87 movss DWORD PTR [rdi+rax*4],xmm0 out_vector_b[i] = in_vector[i]; 1a: f3 0f 10 04 82 movss xmm0,DWORD PTR [rdx+rax*4] 1f: f3 0f 11 04 86 movss DWORD PTR [rsi+rax*4],xmm0 24: 48 83 c0 01 add rax,0x1

Apart from tiny differences due to the fact that we are dealing with float only (no more casts, the instruction that loads from memory is now movss ), we have fallen into the same double load situation:

out_vector_a[i] = in_vector[i]; 10: load memory location in_vector[i] into vector register xmm0 15: store xmm0 into memory location out_vector_a[i] out_vector_b[i] = in_vector[i]; 1a: load memory location in_vector[i] into vector register xmm0 1f: store xmm0 into memory location out_vector_b[i] 24: increment loop counter

Even if we ask the compiler to follow the standard aliasing rules, the use of compatible types leads to the come back of the extra instruction. What we are seeing here is that for the standard rules, both output arrays and input array are pointers of type float (compatible by definition since it's exactly the same type) so the compiler must assume that they can be aliases to each other referring to the same object in memory. How can we escape this ditch?

The restrict qualifier

Luckily the standard provides us a tool to tell the compiler under our own responsibility that even if two pointers could be legal aliases, we are guaranteeing that they aren't:

void foo ( float * restrict out_vector_a, float * restrict out_vector_b, float * restrict in_vector, int n) { for ( int i = 0 ; i < n; + + i) { out_vector_a[i] = in_vector[i]; out_vector_b[i] = in_vector[i]; } }

We've just took the same code as in the previous example and added the restrict qualifier to all pointers. Compiling it with the same gcc options as before, we end up with the following assembly (you can find the full listing here):

out_vector_a[i] = in_vector[i]; 10: f3 0f 10 04 82 movss xmm0,DWORD PTR [rdx+rax*4] 15: f3 0f 11 04 87 movss DWORD PTR [rdi+rax*4],xmm0 out_vector_b[i] = in_vector[i]; 1a: f3 0f 11 04 86 movss DWORD PTR [rsi+rax*4],xmm0 2f: 48 83 c0 01 add rax,0x1

Even this time we are able to let the compiler drop the second load instruction, the only load from memory we have here is at instruction 10 . This time we used the restrict qualifier to tell the compiler that a restricted pointer has no aliases at all. Of course the standard states precisely the meaning of the restrict qualifier:

An object that is accessed through a restrict -qualified pointer has a special association with that pointer. This association […] requires that all accesses to that object use, directly or indirectly, the value of that particular pointer. ISO/IEC 9899:201x, Section 6.7.3

In other words, a restrict pointer is the only legal way to access the object it points to in memory. This kind of special association is the same that for example links a newly allocated memory chunk to its pointer: since it is brand new, no other already existing pointer can be an alias for it.

For example, a statement that assigns a value returned by malloc to a single pointer establishes this association between the allocated object and the pointer. ISO/IEC 9899:201x, Section 6.7.3

Note that restrict hasn't been designed as a mean to change the code's behaviour but to allow additional optimizations (actually, compilers are allowed to completely ignore it):

The intended use of the restrict qualifier […] is to promote optimization, and deleting all instances of the qualifier from all preprocessing translation units composing a conforming program does not change its meaning (i.e., observable behavior). ISO/IEC 9899:201x, Section 6.7.3

Restricted ownership (there can be only one)

Let's consider the following function definitions:

void bar ( int * restrict p, int * restrict q, int n) { while (n - - > 0 ) { * p + + = * q + + ; } } void callbar ( void ) { extern int d[ 100 ]; bar(d + 50 , d, 50 ); // Valid! bar(d + 1 , d, 50 ); // Undefined behaviour! }

In this case we are seeing two different situations:

during the first call, even if the memory belongs to the same d array, the two pointers are swiping two completely disjoint areas, d[50] .. d[99] through p and d[0] .. d[49] through q . Due to the fact that restrict ensures unique accesses to objects (the underlying int elements in this case), each array element is going to be accessed through one and only one pointer: this code is perfectly legal leading to defined behaviour; during the second call, p accesses int objects in the range d[1] .. d[50] while q in the range d[0] .. d[49] : all the d array elements (except for d[0] and d[50] ) are going to be accessed through both pointers. This call breaks the restrict contract leading to undefined behaviour.

This example highlights the fact that restrict compliance of accesses doesn't care about timing. For instance, bar iterations during the latter call access d in the sequence:

p = d[1] , q = d[0]; p = d[2] , q = d[1]; p = d[3] , q = d[2]; …and so on…

The two restricted pointers never access the same element at the same time but, despite this, we end up with an undefined behaviour anyway: it's like a first touch ownership claim, the first pointer that touches an element estabilish the kind of special association with the object the standard talks about.

If inside a block an object is going to be accessed at any point in time through a restricted pointer, then that pointer becomes the one and only allowed to access that particular object.

There is an exception to this rule though, and it's about read-only accesses. Take a look to this example:

void foo ( int * restrict p, int * restrict q, int * restrict r, int n) { for ( int i = 0 ; i < n; + + i) { p[i] = q[i] + r[i]; } } void callfoo ( int n) { int a[n]; int b[n]; foo(a, b, b, n); // Valid! }

In this case, we are accessing elements of array b through two restricted pointers (arguments q and r ) at the same time but, unlike in the previous example, now we are doing read-only accesses since we are writing elements belonging to the (non overlapping) array a only: in this situation, what we are getting is a completely well defined behaviour. So, we can now reformulate the restricted ownership rule we saw in the previous example taking into account this aspect as well:

If inside a block an object is going to be written at any point in time through a restricted pointer, then that pointer becomes the one and only allowed to access (both reading and writing) that particular object.

Assignment between restrict pointers

The standard is so harsh about restrict that even producing a pointer alias to a restricted one (inside the same block) is undefined behaviour:

int * restrict q; int * restrict p = q; / / UNDEFINED BEHAVIOUR ! ! !

Note that the rule limiting assignments between restricted pointers does not distinguish between a function call and an equivalent nested block. However, there is one exception to the rule: only outer-to-inner assignments between restricted pointers declared in nested blocks have defined behavior:

{ int * restrict foo; int * restrict bar; foo = bar; // undefined behaviour { int * restrict foo_inner = foo; // OK: outer-to-inner int * restrict bar_inner = bar; // OK: outer-to-inner foo = bar_inner; // undefined behaviour bar_inner = foo_inner; // undefined behaviour } }

What happens in other languages

Since we are talking about references and objects in memory, aliasing is a concept common to all programming languages. In the modern ones, this issue is often ignored (preventing any optimization, obviously) or just carefully handled by the language definition itself (Rust for example does it in a brilliant way). But what happens in some classic languages?

Aliasing in C++

Of course C++ is a different language than C , but it inherited a lot of rules from the latter. Aliasing is no exception, the rules are the same with some additions: the concept of compatible types is extended to compatible dynamic types.

class A { A operator + ( const A & a) const ; }; class B : A {}; void foo (A * ip1, B * ip2, int n) { for ( int i = 0 ; i < n; + + i) { ip1[i + 1 ] = ip2[i] + ip2[i + 1 ]; } }

Disclaimer: I know that this example is actually crap, remember that in modern C++ , direct use of raw pointers is considered a bad practice.

In this case, the function parameters ip1 and ip2 are legal aliases of each other since class B is a subclass of A . Anyway, in C++ we have templates: two types generated from the same template while using different template parameters are considered non-compatible types.

Aliasing in FORTRAN

People doing numerical stuff do love FORTRAN , and for good reasons: it is carefully cut for their needs, designed from the ground up for number crunching. And people doing numerical stuff do love saying that FORTRAN is faster than any other language they ever tried (or just read about, usually they think is enough). Well, this is obviously not true, especially when comparing FORTRAN and C . There is a caveat though: if you don't know the C language, during casual use it's likely that compiled FORTRAN will produce faster assembly code. There's a reason of course: regarding aliasing, the designers took the easy way forbidding it at all. Aliasing in FORTRAN is strictly forbidden, except for read-only references.

So, a function like the following:

SUBROUTINE f (A, B) INTEGER , DIMENSION ( * ), INTENT (INOUT) :: A INTEGER , DIMENSION ( * ), INTENT (OUT) :: B INTEGER :: I DO I = 2 , LEN (A): A(I) = A(I - 1 ) ** 2 B(I) = A(I) ** 2 ENDDO END SUBROUTINE

…cannot be called in this way:

INTEGER , DIMENSION ( 100 ) :: Arr CALL f(Arr, Arr) ! < - - UNDEFINED BEHAVIOUR

The fact that we are passing the same reference to both the parameters yields an undefined behaviour with its load of potentially mind crushing bugs.

To be continued…

We will dive deeper in more outer-to-inner rule corollaries, pointer hierarchies and memory windows in the next post (coming soon).

Resources: