The restrict keyword can be considered an extension to the strict aliasing rule. It allows the programmer to declare that pointers which share the same type (or were otherwise validly created)alias eachother. By using restrict the programmer can declare that any loads and stores through the qualified pointer (or through another pointer copied either directly or indirectly from the restricted pointer) are theloads and stores to the same address during the lifetime of the pointer. In other words, the pointer is not aliased by any pointers other than its own copies.Read on for more information on the practical use and benefits to using the restrict keyword...

I, [insert your name], a PROFESSIONAL or AMATEUR [circle one] programmer recognize that there are limits to what a compiler can do. I certify that, to the best of my knowledge, there are no magic elves or monkeys in the compiler which through the forces of fairy dust can always make code faster. I understand that there are some problems for which there is not enough information to solve. I hereby declare that given the opportunity to provide the compiler with sufficient information, perhaps through some key word, I will gladly use said keyword and not bitch and moan about how "the compiler should be doing this for me." In this case, I promise that the pointer declared along with the restrict qualifier is not aliased. I certify that writes through this pointer will not effect the values read through any other pointer available in the same context which is also declared as restricted. * Your agreement to this contract is implied by use of the restrict keyword ;)

Restrict is a "no data hazards will be generated" contract between the programmer and the compiler. The compiler relies on this information to make optimizations. If the data is, in fact, aliased, the results are undefined and a programmer should not expect the compiler to output a warning. The compiler assumes the programmer is not lying .

Restrict is a type qualifier

A new feature of C99: The restrict type qualifier allows programs to be written so that translators can produce significantly faster executables. [...] Anyone for whom this is not a concern can safely ignore this feature of the language.

int* restrict foo;

Not all compilers are compliant with the C99 standard. For example Microsoft's compiler, does not support the C99 standard at all. If you are using MSVC on a x86 platform you will not have access to this critical optimization option.



When using GCC, remember to enable the C99 standard by adding -std=c99 to your compilation flags. In code that cannot be compiled with C99, use either __restrict or __restrict__ to enable the keyword as a GCC extension.



The restrict keyword was not included as part of the C++98 standard. However some C++ compilers may support it as an extension. It's important that when restrict is used in C++ to remember that the implicit this pointer should also be restricted. Consult your compiler's manual for how to do this, if possible.

An understanding the strict aliasing rule will provide good context for problems related to the restrict keyword.

Why was restrict introduced into C99?

The problem that the restrict qualifier addresses is that potential aliasing can inhibit optimizations. Specifically, if a translator cannot determine that two different pointers are being used to reference different objects, then it cannot apply optimizations such as maintaining the values of the objects in registers rather than in memory, or reordering loads and stores of these values. This problem can have a significant effect on a program that, for example, performs arithmetic calculations on large arrays of numbers. The effect can be measured by comparing a program that uses pointers with a similar program that uses file scope arrays (or with a similar Fortran program). The array version can run faster by a factor of ten or more on a system with vector processors. Where such large performance gains are possible, implementations have of course offered their own solutions, usually in the form of compiler directives that specify particular optimizations. Differences in the spelling, scope, and precise meaning of these directives have made them troublesome to use in a program that must run on many different systems. This was the motivation for a standard solution.

Non-aliased Memory Windows

typedef struct vector3 vector3;



struct vector3

{

float x;

float y;

float z;

};



The example code sections in the article are not meant to serve as examples of real production code, but rather as examples of real patterns often found in production code.

void

move( vector3* velocity,

vector3* position,

vector3* acceleration,

float time_step,

size_t count )

{

for (size_t i=0;i<count;i++)

{

velocity[i].x += acceleration[i].x * time_step;

velocity[i].y += acceleration[i].y * time_step;

velocity[i].z += acceleration[i].z * time_step;

position[i].x += velocity[i].x * time_step;

position[i].y += velocity[i].y * time_step;

position[i].z += velocity[i].z * time_step;

}

}



This article will examine the assembly output generated for the PowerPC. However, the principles and suggestions presented are applicable to many common architectures.

# This code was compiled with GCC 3.4.1 for PowerPC,

# with the following options: -O3 -fstrict-aliasing -std=c99

#

move:

cmpwi 0,6,0

stwu 1,-16(1)

beq- 0,.L7

li 8,0

mtctr 6

.L8:

add 9,8,3

lfsx 13,8,5

add 10,8,5

lfsx 0,8,3

lfs 8,4(9)

add 11,8,4

lfs 5,8(10)

lfs 7,4(10)

lfs 6,8(9)

fmadds 4,13,1,0

fmadds 3,7,1,8

fmadds 2,5,1,6

stfsx 4,8,3 # Store velocity_x

stfs 3,4(9) # Store velocity_y

stfs 2,8(9) # Store velocity_z

lfsx 11,8,4 # Load position_x

lfs 10,4(11) # Load position_y

lfs 9,8(11) # Load position_z

fmadds 12,4,1,11

fmadds 0,3,1,10

fmadds 13,2,1,9

stfsx 12,8,4

addi 8,8,12

stfs 0,4(11)

stfs 13,8(11)

bdnz .L8

.L7:

addi 1,1,16

blr



The use of unrestricted pointers inhibits the compiler's ability to schedule loads and may cause redundant loads in many cases. With few exceptions, accessing any value through a pointer will force the compiler to load, or reload, the value after any store. This is because the compiler cannot gaurantee that the value being loaded was not aliased by the value that was stored.

void

call_move( vector3* some_data, float time_step, count )

{

move( some_data, some_data, some_data, time_step, count );

}



vector3 velocity [ PARTICLE_COUNT ];

vector3 position [ PARTICLE_COUNT ];

vector3 acceleration [ PARTICLE_COUNT ];



void

move( float time_step )

{

for (size_t i=0;i<PARTICLE_COUNT;i++)

{

velocity[i].x += acceleration[i].x * time_step;

velocity[i].y += acceleration[i].y * time_step;

velocity[i].z += acceleration[i].z * time_step;

position[i].x += velocity[i].x * time_step;

position[i].y += velocity[i].y * time_step;

position[i].z += velocity[i].z * time_step;

}

}



Data Channel Channel Elements (by Index) velocity [0] ---> [1] ---> [2] ---> [N] position [0] ---> [1] ---> [2] ---> [N] acceleration [0] ---> [1] ---> [2] ---> [N]

An element in a restricted data stripe can be a function of one or more elements of any other restricted data stripes, but cannot be a function of a change in an element of a data stripe.

# This code was compiled with GCC 3.4.1 for PowerPC,

# with the following options: -O3 -fstrict-aliasing -std=c99

#

move:

lis 3,velocity@ha

lis 11,acceleration@ha

lis 9,position@ha

la 6,velocity@l(3)

la 5,acceleration@l(11)

la 7,position@l(9)

li 8,0

stwu 1,-16(1)

li 0,8192

mtctr 0

.L18:

add 12,8,6

lfsx 12,8,6 # Load velocity + 0

add 10,8,5

lfsx 13,8,5 # Load acceleration + 0

lfs 8,4(12) # Load velocity + 4

add 4,8,7

lfs 5,8(10) # Load acceleration + 8

lfs 6,8(12) # Load velocity + 8

lfs 7,4(10) # Load acceleration + 4

fmadds 9,13,1,12

fmadds 10,7,1,8

fmadds 11,5,1,6

lfsx 4,8,7 # Load position + 0

lfs 3,4(4) # Load position + 4

lfs 2,8(4) # Load position + 8

fmadds 0,9,1,4

fmadds 13,10,1,3

fmadds 12,11,1,2

stfsx 9,8,6 # Store velocity + 0

stfs 11,8(12) # Store velocity + 8

stfs 10,4(12) # Store velocity + 4

stfsx 0,8,7 # Store position + 0

addi 8,8,12

stfs 13,4(4) # Store position + 4

stfs 12,8(4) # Store position + 8

bdnz .L18

addi 1,1,16

blr



void

move( vector3* velocity,

vector3* position,

vector3* acceleration,

float time_step,

size_t count,

size_t stride )

{

float* restrict acceleration_x = &acceleration->x;

float* restrict velocity_x = &velocity->x;

float* restrict position_x = &position->x;

float* restrict acceleration_y = &acceleration->y;

float* restrict velocity_y = &velocity->y;

float* restrict position_y = &position->y;

float* restrict acceleration_z = &acceleration->z;

float* restrict velocity_z = &velocity->z;

float* restrict position_z = &position->z;



for (size_t i=0;i<count*stride;i+=stride)

{

velocity_x[i] += acceleration_x[i] * time_step;

velocity_y[i] += acceleration_y[i] * time_step;

velocity_z[i] += acceleration_z[i] * time_step;

position_x[i] += velocity_x[i] * time_step;

position_y[i] += velocity_y[i] * time_step;

position_z[i] += velocity_z[i] * time_step;

}

}



Data Channel Channel Elements (by Index) velocity_x [0] ---> [1] ---> [2] ---> [N] velocity_y [0] ---> [1] ---> [2] ---> [N] velocity_z [0] ---> [1] ---> [2] ---> [N] position_x [0] ---> [1] ---> [2] ---> [N] position_y [0] ---> [1] ---> [2] ---> [N] position_z [0] ---> [1] ---> [2] ---> [N] acceleration_x [0] ---> [1] ---> [2] ---> [N] acceleration_y [0] ---> [1] ---> [2] ---> [N] acceleration_z [0] ---> [1] ---> [2] ---> [N]

|---> velocity_x

velocity -------|---> velocity_y

|---> velocity_z



|---> position_x

position -------|---> position_y

|---> position_z



|---> acceleration_x

acceleration ---|---> acceleration_y

|---> acceleration_z



Typically, only the leaf nodes in a hierarchy of restricted pointers should be used.

# This code was compiled with GCC 3.4.1 for PowerPC,

# with the following options: -O3 -fstrict-aliasing -std=c99

#

move:

stwu 1,-32(1)

stw 31,28(1)

mullw 31,6,7

stw 30,24(1)

cmplwi 7,31,0

mr 30,7

addi 12,3,4

addi 6,5,4

addi 8,4,4

addi 7,5,8

addi 10,3,8

addi 11,4,8

li 9,0

ble- 7,.L27

.L31:

slwi 0,9,2

lfsx 13,3,0 # Load velocity_x

add 9,9,30

lfsx 8,12,0 # Load velocity_y

cmplw 7,31,9

lfsx 6,10,0 # Load velocity_z

lfsx 12,5,0 # Load acceleration_x

lfsx 7,6,0 # Load acceleration_y

lfsx 5,7,0 # Load acceleration_z

fmadds 11,12,1,13

fmadds 10,7,1,8

fmadds 9,5,1,6

lfsx 4,4,0 # Load position_x

lfsx 3,8,0 # Load position_y

lfsx 2,11,0 # Load position_z

fmadds 0,11,1,4

fmadds 13,10,1,3

fmadds 12,9,1,2

stfsx 11,3,0 # Store velocity_x

stfsx 10,12,0 # Store velocity_y

stfsx 9,10,0 # Store velocity_z

stfsx 0,4,0 # Store position_x

stfsx 13,8,0 # Store position_y

stfsx 12,11,0 # Store position_z

bgt+ 7,.L31

.L27:

lwz 30,24(1)

lwz 31,28(1)

addi 1,1,32

blr



Non-aliased Memory Access Patterns

struct particle

{

vector3 position;

vector3 velocity;

vector3 acceleration;

};



[ ... ]



void

call_move( particle* particles, float time_step, count )

{

move( &particles->position,

&particles->velocity,

&particles->acceleration,

time_step,

count,

sizeof(particle) );

}



Memory windows can overlap and still be non-aliased.

Usage and Suggestions

Before modifying code to use the restrict keyword, ensure that all code can compile safely with strict aliasing enabled.

void

move( vector3* restrict velocity,

vector3* restrict position,

vector3* restrict acceleration,

float time_step,

size_t count,

size_t stride );



void*

memcpy(void* restrict s1,

const void* restrict s2,

size_t n );



Use restrict in function prototypes and in structure definitions to publish the assumptions made about aliasing.

{

vector3* restrict position = &obj_a->position;

float* restrict position_x = &position->x; <-- UNDEFINED

{

float* restrict position_y = &position->y; <-- VALID

}

}



Restricted child pointers must be in a different block-level scope than the parent pointer.

lfsx 4,4,0 # Load position_x

lfsx 3,8,0 # Load position_y

lfsx 2,11,0 # Load position_z

fmadds 0,11,1,4 # Update position_y

fmadds 13,10,1,3 # Update position_x

fmadds 12,9,1,2 # Update position_z



void

move( vector3* restrict velocity,

vector3* restrict position,

vector3* restrict acceleration,

float time_step,

size_t count,

size_t stride )

{

float* restrict acceleration_x = &acceleration->x;

float* restrict velocity_x = &velocity->x;

float* restrict position_x = &position->x;

float* restrict acceleration_y = &acceleration->y;

float* restrict velocity_y = &velocity->y;

float* restrict position_y = &position->y;

float* restrict acceleration_z = &acceleration->z;

float* restrict velocity_z = &velocity->z;

float* restrict position_z = &position->z;



for (size_t i=0;i<count*stride;i+=stride)

{

const float ax = acceleration_x[i];

const float ay = acceleration_y[i];

const float az = acceleration_z[i];

const float vx = velocity_x[i];

const float vy = velocity_y[i];

const float vz = velocity_z[i];

const float px = position_x[i];

const float py = position_y[i];

const float pz = position_z[i];



const float nvx = vx + ( ax * time_step );

const float nvy = vy + ( ay * time_step );

const float nvz = vz + ( az * time_step );

const float npx = px + ( vx * time_step );

const float npy = py + ( vy * time_step );

const float npz = pz + ( vz * time_step );



velocity_x[i] = nvx;

velocity_y[i] = nvy;

velocity_z[i] = nvz;

position_x[i] = npx;

position_y[i] = npy;

position_z[i] = npz;

}

}



# This code was compiled with GCC 3.4.1 for PowerPC,

# with the following options: -O3 -fstrict-aliasing -std=c99

#

move:

stwu 1,-32(1)

stw 31,28(1)

mullw 31,6,7

stw 30,24(1)

cmplwi 7,31,0

mr 30,7

addi 12,3,4

addi 6,5,4

addi 8,4,4

addi 7,5,8

addi 10,3,8

addi 11,4,8

li 9,0

ble- 7,.L47

.L51:

slwi 0,9,2

lfsx 8,3,0 # Load vx

add 9,9,30

lfsx 7,12,0 # Load vy

cmplw 7,31,9

lfsx 6,10,0 # Load vz

lfsx 10,4,0 # Load px

lfsx 9,8,0 # Load py

lfsx 5,11,0 # Load pz

lfsx 4,5,0 # Load ax

lfsx 3,6,0 # Load ay

lfsx 2,7,0 # Load az

fmadds 0,8,1,10 # Update npx

fmadds 13,7,1,9 # Update npy

fmadds 12,6,1,5 # Update npz

fmadds 11,4,1,8 # Update nvx

fmadds 10,3,1,7 # Update nvy

fmadds 9,2,1,6 # Update nvz

stfsx 0,4,0 # Store npx

stfsx 13,8,0 # Store npy

stfsx 12,11,0 # Store npz

stfsx 11,3,0 # Store nvx

stfsx 10,12,0 # Store nvy

stfsx 9,10,0 # Store nvz

bgt+ 7,.L51

.L47:

lwz 30,24(1)

lwz 31,28(1)

addi 1,1,32

blr



Simplify expressions. Do not mix memory access with calculations. Use the [ Load --> Update --> Store ] pattern.

Summary

Strict aliasing means that two objects of different types cannot refer to the same location in memory. Enable this option in GCC with the -fstrict-aliasing flag. Be sure that all code can safely run with this rule enabled. Enable strict aliasing related warnings with -Wstrict-aliasing , but do not expect to be warned in all cases.

flag. Be sure that code can safely run with this rule enabled. Enable strict aliasing related warnings with , but do not expect to be warned in all cases. Compare the assembly output of the function with restricted pointers and file scope arrays to ensure that all of the possible aliasing information has been used.

Only use restricted leaf pointers. Use of parent pointers may break the restrict contract.

Publish as many assumptions as possible about aliasing information in the function declaration.

Memory windows may be overlapping and still be without aliases. Do not limit the data design to non-overlapping windows.

Begin using the restrict keyword immediately. Retrofit old code as soon as possible.

Keep loads and stores separated from calculations. This results in better scheduling in GCC, and makes the relationship between the output assembly and the original source clearer.

Additional Reading

The restrict keyword is a type qualifier for pointers and is a formal part of the C99 standard.Example usage:Notice that the restrict keyword qualifies the pointer and not the object being pointed to.In other words, proper use of the restrict keyword gives the compiler enough information to select a more optimal order of loads and stores to/from memory and to potentially make better use of registers to store non-aliased objects.Given the following structure, there is a significant difference in performance in even the smallest update loops.What follows is a simple example function that updates some "particles" with unrestricted pointers. Note that the pointers share the same type, so the compiler will assume they can be aliased, per the strict aliasing rule.Notice above thatmust wait forto be stored. This is because the compiler cannot gaurantee that the two are not aliased and must assume that the write tocan overwrite the location wherewill be read. Because the compiler mustperform the operations in the order declared in the source, it must assume this is the behavior the programmer intended.For instance, there is no reason (other than sanity) why the programmer could not call the function in this way:The use of restricted pointers would specifically disallow this.Compare this to the same function working with arrays of file scope. Working with file scope arrays represents the best case for the compiler with regard to alias analysis and should be used as the baseline for implementing functions with restricted pointers.With the above code the compiler knows the arrays will be stored seperately and can determine that they are three independent data, orand there can be no aliasing among them. A data stripe can be thought of as amade up of indexable elements.All the stores are completed at the end of the loop. More specifically, the load foris scheduled before the store of. This validates that the compiler has enough information to determine that the values stored do not alias the values loaded.In order to get this same behavior with non-file scope pointers, use the restrict keyword to declare that every location which is either loaded or stored has no aliases.Nine (9) non-aliased memory stipes were declared in the above code. This completely defines the aliasing relationships between all the loads and stores.By copying addresses from from pointer to another, an implicit hierarchy (or tree) of pointers is created. The child pointers are usually completely aliased by the parent pointer and it's important not to use them both at the same time (i.e. in the same scope). When restricted child pointers are created, consider the parent pointer to beand do not make an accesses through it. Note that in this case, any use oforwould invalidate the restrict contract and the results would be undefined.This code was compiled with GCC 3.4.1 for PowerPC with the following options:This version has all the flexibility of the first (unrestricted) version and the performance of the second (file scope arrays) version. You should expect code where all aliasing information is declared with the restrict keyword toperform significantly better, and never worse, than with unrestricted pointers. This is especially true on superscalar RISC, or RISC-like architectures with large register files, like the PowerPC or MIPS R4000.The asute reader may also have noticed that because nine (9) restricted stripes were used instead of three (3) file scope arrays, the compiler has been able to select a much simplier addressing scheme. Much of the pointer arithmetic has been hoisted out of the loop. The version with the restricted pointers is actuallyefficient than the one with file scope arrays.An important distinction to make is that the restrict keyword is not restricting anything. It is in factthe compiler to do more than it could previously. It should also be noted that the type of the pointer that is qualified with restrict is not important, it is only important what location and size was used when loading or storing from the pointer. The restrict keyword does not declare that the object being pointed to is completely without aliases, only that the addresses that are loaded and stored from are unaliased.For example, the following setup would be a completely valid use of restricted pointers:Although each stripe of data is part of the same "object", none of the accesses would be aliased. Some runtime systems try to determine whether or not pointers are aliased by simply checking to see if the memory windows overlap. That is not sufficient.Use of the restrict keyword should be very common. It should be used as a standard part of all new code. Older code should be revisited as possible to take advantage of the new optimization opportunities. It is somewhat difficult to refactor restricted requirements into pre-existing code as a certain amount of alias analysis must be done by the programmer. However, for the majority of live code in typical applications, memory access is not aliased (nor are memory windows overlapping) and aliasing hazards will be limited to a small fraction of the code base.Programmers using functions that make assumptions about aliasing must know what those assumptions are. Certainly, if at all possible, memory usage patterns should be documented. However, at the very least, aliasing assumptions in the parameters passed to the functions should be declared. In the above examples, the parametersandmust not be aliased and the restrict contract should be made public bydeclaring those parameters restricted.Not publishing aliasing assumptions will lead to very difficult to find bugs. Programmers will not know that the data must be independent and someone, someday will find a reason to use the same array in two or more pointers.Take for example, which has been officially changed to have the following declaration:Restricted pointers can be copied from one to another to create a hierarchy of pointers. However there is one limitation defined in the C99 standard. The child pointerbe in the same block-level scope as the parent pointer. The result of copying restricted pointers in the same block-level scope is undefined.There is one additional problem in the assembly output above which is somewhat particular to the GCC scheduler. Notice that the load forhappens immediately before its update and store. The first multiply-add will stall waiting the first load to be completed before executing. The first float (be ready in three (3) cycles. It would be considerably better (and faster) if the load could be pushed closer to the top of the loop so that it is more likely to be completed by the time it is needed.Due to the order in which scheduling is done in GCC, it is always better to simplify expressions. Do not mix memory access with calculations. The code can be re-written as follows:The loads are now properly scheduled and moved as far in advance as possible. The pattern [Load --> Update --> Store] is usually the optimal pattern for simple memory transformations on a superscalar RISC-like architecture, and is exactly what is being emitted. This is reasonably close to good hand-written assembly for the same code (without re-defining the problem), and the code now very suitable for unrolling.