I'm writing a program that computes an endgame tablebase for a chess variant. The algorithm for filling the tablebase works like this:

Start with a huge array of unsigned char , each member representing one position (we always assume that it's white's turn). An array member is even if the position is lost, odd if it is won, 0xff if it invalid, 0xfe if it is a draw. Iterate over the array, marking each illegal position with 0xff , each position where white is mated with 0x00 and all other positions with 0x0fe . Iterate over the array, considering only positions marked 0xfe . Check if there is a move that leads to a position whose array member is even, if there is one, write one plus the number of that position into the member corresponding to the curren position. If all moves lead to positions indicated by odd numbers (i.e. this is a loosing position), mark this position as one plus the highest of these numbers to indicate how long the strongest defence takes. Repeat step three until the array doesn't change anymore.

For speed, I'd like to parallelize step three. Careful reading yields that in each iteration, we only ever write one value (the number of the iteration) into the array. The following strategy obtains:

Split the array into n parts, let one thread work on each part. Let the current iteration be i.

If a thread has to read a member from the array and it is equal to i, treat it as if it was set to 0xfe because that means that the member was just written concurrently by another thread.

Now obviously there is a race condition in this program but it doesn't matter as we always get the right result if there aren't any pink elephants (which can't exist if a char is written atomically). However, since on paper there is a race condition, the C compiler may declare my program to be undefined and format my hard disk.

What can I do to parallelize this algorithm without violating any constraints of the C memory model and without causing massive slowdown (e.g. by adding locks)?

Simplified problem description

Here is a simplified algorithm that demonstrates the same concept but is stripped of all the unimportant stuff:

Start out with an array unsigned char a[n] . Each array member is either 0 or 1. For each array member that is set to 0: If some other array members are equal to 0 or 2, set this array member to 2. The array members checked depend on the index of the array member we want to update. There is no simple relationship between the index and the array members we need to check, it's essentially random.

Since we only ever change a 0 to a 2, it doesn't matter in which order we process the array entries, even though there is technically a race condition if we do so in parallel (since we concurrently read/write the same object). How can I tell the compiler that it shouldn't care about the race condition without sacrificing performance?