This week, we will finally tackle the heart of the job system: the implementation of the lock-free work-stealing queue. Read on for a foray into low-level programming.

Other posts in the series

Part 1: Describes the basics of the new job system, and summarizes work-stealing.

Part 2: Goes into detail about the thread-local allocation mechanism.

Recap

Remember the three operations that the work-stealing queue needs to offer:

Push(): Adds a job to the private (LIFO) end of the queue.

Pop(): Removes a job from the private (LIFO) end of the queue.

Steal(): Steals a job from the public (FIFO) end of the queue.

Further remember that Push() and Pop() are only called by the thread that owns the queue, and thus never concurrently. Steal() can be called by any other thread, at any time, concurrently with both Push() and Pop().

A locking implementation

Before delving into the realm of lock-free programming, let’s first work on an implementation that uses traditional locks. Conceptually, how do we build a data structure that acts like a double-ended queue (deque), with one end behaving like a LIFO, and the other end behaving like a FIFO?

Fortunately, this can quite easily be solved by having two indices that denote the two ends of the deque, as described in “D. Chase and Y. Lev., Dynamic circular work-stealing deque”. If we assume for a moment that we have infinite memory, and therefore an array with an unlimited number of entries, we can introduce two counters called bottom and top that possess the following properties:

bottom indicates the next available slot in the array to where the next job is going to be pushed. Push() first stores the job, and then increments bottom .

indicates the next available slot in the array to where the next job is going to be pushed. Push() first stores the job, and then increments . Similarly, Pop() decrements bottom , and then returns the job stored at that position in the array.

, and then returns the job stored at that position in the array. top indicates the next element that can be stolen (i.e. the topmost element in the deque). Steal() grabs the job from that slot in the array, and increments top.

An important observation to make is that at any given point in time, bottom – top yields the number of jobs currently stored in the deque. If bottom is less than or equal to top, the deque is empty and there is nothing to steal.

Another interesting fact is that both Push() and Pop() only change bottom, and Steal() only changes top. This is a very important property that minimizes synchronization overhead for the owner of the work-stealing queue, and proves to be advantageous for a lock-free implementation.

To better illustrate how the deque works, take a look at the following list of operations performed on the deque:

Operation bottom top size (bottom – top) Empty deque 0 0 0 Push 1 0 1 Push 2 0 2 Push 3 0 3 Steal 3 1 2 Pop 2 1 1 Pop 1 1 0

As described earlier, Push() and Pop() work in LIFO fashion, while Steal() works in FIFO fashion. In the example above, the first call to Steal() returns the job at index 0, while subsequent calls to Pop() return the jobs at index 2 and 1, in that order.

In C++ code, the implementation of the three operations for the work-stealing queue could look as follows:

void Push(Job* job) { ScopedLock lock(criticalSection); m_jobs[m_bottom] = job; ++m_bottom; } Job* Pop(void) { ScopedLock lock(criticalSection); const int jobCount = m_bottom - m_top; if (jobCount <= 0) { // no job left in the queue return nullptr; } --m_bottom; return m_jobs[m_bottom]; } Job* Steal(void) { ScopedLock lock(criticalSection); const int jobCount = m_bottom - m_top; if (jobCount <= 0) { // no job there to steal return nullptr; } Job* job = m_jobs[m_top]; ++m_top; return job; }

The only thing left to do for now is turning the unbounded, infinite array into a circular array. This could be accomplished by wrapping bottom and top accordingly, e.g. using a modulo operation. But that would make it much harder to calculate the number of jobs currently in the deque, because we would have to account for wrap-arounds. A better and simpler solution is to apply the modulo operation to bottom and top when accessing the array. As long as the number of elements stored in the deque is a power-of-two, this is nothing more than a binary AND operation:

static const unsigned int NUMBER_OF_JOBS = 4096u; static const unsigned int MASK = NUMBER_OF_JOBS - 1u; void Push(Job* job) { ScopedLock lock(criticalSection); m_jobs[m_bottom & MASK] = job; ++m_bottom; } Job* Pop(void) { ScopedLock lock(criticalSection); const int jobCount = m_bottom - m_top; if (jobCount <= 0) { // no job left in the queue return nullptr; } --m_bottom; return m_jobs[m_bottom & MASK]; } Job* Steal(void) { ScopedLock lock(criticalSection); const int jobCount = m_bottom - m_top; if (jobCount <= 0) { // no job there to steal return nullptr; } Job* job = m_jobs[m_top & MASK]; ++m_top; return job; }

And there you have it, an implementation of a work-stealing queue using traditional locks.

Prerequisites: Lock-free programming

Lock-free programming is a huge topic, and much has been written about it already. Rather than repeating what has been said, I would like to recommend a few good articles I’d like everybody to read before continuing with this post:

In general, Jeff Preshing’s blog is a gold mine full of articles about lock-free programming. If you’re ever in doubt, check his blog.

A lock-free work-stealing queue

Finished reading all the articles? Good.

Assuming no compiler reordering and absolutely no memory reordering for now, let’s implement all three operations in a lock-free manner, one by one. We will discuss the needed compiler and memory barriers later.

Consider a lock-free implementation of Push() first:

void Push(Job* job) { long b = m_bottom; m_jobs[b & MASK] = job; m_bottom = b+1; }

What could happen in terms of other operations happening concurrently? Pop() cannot be executed concurrently, so we only need to consider Steal(). However, Steal() only writes to top, and reads from bottom. So the worst that could happen is that Push() gets pre-empted by a call to Steal() right before signaling the availability of a new item in line #5. This is of no concern, because it only means that Steal() could not steal an item that would have been there already – no harm done.

Next in line, the Steal() operation:

Job* Steal(void) { long t = m_top; long b = m_bottom; if (t < b) { // non-empty queue Job* job = m_jobs[t & MASK]; if (_InterlockedCompareExchange(&m_top, t+1, t) != t) { // a concurrent steal or pop operation removed an element from the deque in the meantime. return nullptr; } return job; } else { // empty queue return nullptr; } }

As long as top is less than bottom, there are still jobs left in the queue to steal. If the deque is not empty, the function first reads the job stored in the array, and then tries to increment top by using a compare-and-swap operation. If the CAS fails, a concurrent Steal() operation successfully removed a job from the deque in the meantime.

Note that it is important to read the job before carrying out the CAS, because the location in the array could be overwritten by concurrent Push() operations happening after the CAS has completed.

A very crucial thing to note here is that top is always read before bottom, ensuring that the values represent a consistent view of the memory. Still, a subtle race may occur if the deque is emptied by a concurrent Pop() after bottom is read and before the CAS is executed. We need to ensure that no concurrent Pop() and Steal() operations both return the last job remaining in the deque, which is achieved by also trying to modify top in the implementation of Pop() using a CAS operation.

The Pop() operation is the most interesting of the bunch:

Job* Pop(void) { long b = m_bottom - 1; m_bottom = b; long t = m_top; if (t <= b) { // non-empty queue Job* job = m_jobs[b & MASK]; if (t != b) { // there's still more than one item left in the queue return job; } // this is the last item in the queue if (_InterlockedCompareExchange(&m_top, t+1, t) != t) { // failed race against steal operation job = nullptr; } m_bottom = t+1; return job; } else { // deque was already empty m_bottom = t; return nullptr; } }

In contrast to the implementation of Steal(), this time around we need to ensure that we first decrement bottom before attempting to read top. Otherwise, concurrent Steal() operations could remove several jobs from the deque without Pop() noticing.

Additionally, if the deque was already empty, we need to reset it to a canonical empty state where bottom == top.

As long as there are still several jobs left in the deque, we can simply return it without doing any additional atomic operations. However, as pointed out in the implementation notes above, we need to protect the code against races from concurrent calls to Steal() in case there is only one job left.

If so, the code carries out a CAS to increment top and check whether we won or lost a race against a concurrent Steal() operation. There are only two possibilities to the outcome of this operation:

The CAS succeeds and we won the race against Steal(). In that case, we set bottom = t+1 which sets the deque into a canonical empty state.

The CAS fails and we lost the race against Steal(). In that case, we return an empty job, but still set bottom = t+1. Why? Because loosing the race implies that a concurrent Steal() operation successfully set top = t+1, so we still have to set the deque into an empty state.

Adding compiler and memory barriers

So far, so good. In its current form, the implementation won’t work because we completely left compiler and memory ordering out of the picture. This needs to be fixed.

Consider Push():

void Push(Job* job) { long b = m_bottom; m_jobs[b & MASK] = job; m_bottom = b+1; }

Nobody guarantees that the compiler won’t reorder any of the statements above. Specifically, we cannot be sure that we first store the job in the array, and then signal this to other threads by incrementing bottom – it could be the other way around, which would lead to other threads stealing jobs which aren’t there yet!

What we need in this case is a compiler barrier:

void Push(Job* job) { long b = m_bottom; m_jobs[b & MASK] = job; // ensure the job is written before b+1 is published to other threads. // on x86/64, a compiler barrier is enough. COMPILER_BARRIER; m_bottom = b+1; }

Note that on x86/64 a compiler barrier is enough, because the strongly ordered memory model does not allow stores to be reordered with other stores. On other platforms (PowerPC, ARM, …) you would need a memory fence instead. Furthermore, notice that the store operation also doesn’t need to be carried out atomically in this case, because the only other operation writing to bottom is Pop(), which cannot be carried out concurrently.

Similarly, we also need a compiler barrier in the implementation of Steal():

Job* Steal(void) { long t = m_top; // ensure that top is always read before bottom. // loads will not be reordered with other loads on x86, so a compiler barrier is enough. COMPILER_BARRIER; long b = m_bottom; if (t < b) { // non-empty queue Job* job = m_jobs[t & MASK]; // the interlocked function serves as a compiler barrier, and guarantees that the read happens before the CAS. if (_InterlockedCompareExchange(&m_top, t+1, t) != t) { // a concurrent steal or pop operation removed an element from the deque in the meantime. return nullptr; } return job; } else { // empty queue return nullptr; } }

Here, we need a compiler barrier to ensure that the read of top truly happens before the read to bottom. Additionally, we would need another barrier that guarantees that the read from the array is carried out before the CAS. In this case however, the interlocked function acts as a compiler barrier implicitly.

On more operation left to go:

Job* Pop(void) { long b = m_bottom - 1; m_bottom = b; long t = m_top; if (t <= b) { // non-empty queue Job* job = m_jobs[b & MASK]; if (t != b) { // there's still more than one item left in the queue return job; } // this is the last item in the queue if (_InterlockedCompareExchange(&m_top, t+1, t) != t) { // failed race against steal operation job = nullptr; } m_bottom = t+1; return job; } else { // deque was already empty m_bottom = t; return nullptr; } }

The most important part of this implementation are the first three lines. This is one of the rare cases where we need a true memory barrier, even on Intel’s x86/64 architecture. More specifically, adding just a compiler barrier between the store to bottom = b and the read from long t = top is not enough, because the memory model explicitly allows that “Loads may be reordered with older stores to different locations”, which is exactly the case here.

This means that the first few lines of code need to be fixed as follows:

long b = m_bottom - 1; m_bottom = b; MEMORY_BARRIER; long t = m_top;

Note that neither SFENCE nor LFENCE would suffice in this case, and it really needs to be a full MFENCE barrier. Alternatively, we can also use an interlocked operation such as XCHG instead of a full barrier. XCHG acts like a barrier internally, but turns out to be a tiny bit cheaper in this case:

long b = m_bottom - 1; _InterlockedExchange(&m_bottom, b); long t = m_top;

The rest of the implementation of Pop() can stay as it is. Similar to Steal(), the CAS operation serves as a compiler barrier, and the store to bottom = t+1 does not need to happen atomically because there can’t be any concurrent operation that also writes to bottom.

Performance

What did the lock-free implementation gain us in terms of performance? Here’s an overview:

Basic Thread-local allocator Lock-free deque Increase vs. first impl. Increase vs. second impl. Single jobs 18.5 ms 9.9 ms 2.93 ms 6.31x 3.38x parallel_for 5.3 ms 1.35 ms 0.76 ms 6.97x 1.78x

Compared to our first implementation, going lock-free and using thread-local allocators gave us almost 7x the performance. Even on top of using thread-local allocators, the lock-free implementation performs between 1.8x and 3.3x times faster.

Outlook

That’s it for today. Next time, we will see how high-level algorithms like parallel_for can be implemented on top of this system.