Back in 2012, I wrote about the task scheduler implementation in Molecule. Three years have passed since then, and now it’s time to give the old system a long deserved lifting.

Requirements for the new job system were the following:

The base implementation needs to be simpler. Jobs can be quite stupid on their own, but it should be possible to build high-level algorithms such as e.g. parallel_for on top of the implementation.

The job system needs to implement automatic load balancing.

Performance improvements should be made by gradually replacing parts of the system with lock-free alternatives, where applicable.

The system needs to support “dynamic parallelism”: it must be possible to alter parent-child relationships and add dependencies to a job while it is still running. This is needed to allow high-level primitives such as parallel_for to dynamically split the given workload into smaller jobs.

Today, we will look at the base implementation of the new job system, using locks/critical sections. Even when using locks, there are a few pitfalls that I would like to point out before going lock-free.

The very basics

Similar to the old task scheduler, our new job system basically works as follows:

We have N worker threads that continually grab a job from a queue, and execute it.

For N cores, we create N-1 worker threads.

The main thread is also considered a worker thread, and can help with executing jobs.

This time around, there is one major difference though: our job system now implements a concept known as Work stealing, which means that rather than using one global queue into which all jobs are pushed, each worker thread has its own job queue. Using a global job queue creates a lot of contention, especially when several threads are involved.

Work stealing is a simple, but effective, concept:

New jobs are always pushed into the queue of the calling thread.

Whenever a worker thread wants to work on a job, it tries to pop a job from its own queue first. If there is no job in the queue, the thread tries to steal a job from one of the other worker thread’s queues.

The operations Push() and Pop() are only called by the worker thread that owns the queue.

and are only called by the worker thread that owns the queue. The operation Steal() is only called by worker threads that do not own the queue.

The last two items are important, and lead to the following observations:

Push() and Pop() can work on one end of the queue (the private end), while Steal() works on the other end (the public end).

and can work on one end of the queue (the private end), while works on the other end (the public end). The private end can work in LIFO fashion for better utilization of the cache, while the public end works in FIFO fashion for better work balancing.

When talking about work stealing, such a double-ended data structure is often called a work-stealing queue/deque. One very important benefit of such a work-stealing queue is the fact that it is possible to implement it in a lock-free manner.

In C++, a basic implementation could look as follows:

// main function of each worker thread while (workerThreadActive) { Job* job = GetJob(); if (job) { Execute(job); } } Job* GetJob(void) { WorkStealingQueue* queue = GetWorkerThreadQueue(); Job* job = queue->Pop(); if (IsEmptyJob(job)) { // this is not a valid job because our own queue is empty, so try stealing from some other queue unsigned int randomIndex = GenerateRandomNumber(0, g_workerThreadCount+1); WorkStealingQueue* stealQueue = g_jobQueues[randomIndex]; if (stealQueue == queue) { // don't try to steal from ourselves Yield(); return nullptr; } Job* stolenJob = stealQueue->Steal(); if (IsEmptyJob(stolenJob)) { // we couldn't steal a job from the other queue either, so we just yield our time slice for now Yield(); return nullptr; } return stolenJob; } return job; } void Execute(Job* job) { (job->function)(job, job->data); Finish(job); }

I’ve deliberately left out details for the Finish() function at this point – it will be discussed later.

What is a job?

Obeying the “keep it simple” requirement, a job needs to store at least two things: a pointer to the function being executed, and an optional parent job.

Additionally, we need a counter that lets us keep track of the number of unfinished jobs for handling parent/child-relationships. And in order to avoid False Sharing, we add padding to ensure that a Job object occupies at least one whole cache line:

struct Job { JobFunction function; Job* parent; int32_t unfinishedJobs; // atomic char padding[]; };

Note that the unfinishedJobs member is marked as being atomic. In Molecule, it is altered by using any of the Interlocked* functions on Windows. Using C++11, you could use a std::atomic type. I also left out the size of the padding array, because it’s different for 32-bit and 64-bit and clutters the code with insignificant complexity due to several sizeof() operators being involved.

The job function associated with a job accepts two parameters: The job it belongs to, and the data associated with the job.

typedef void (*JobFunction)(Job*, const void*);

Associating data with a job

One thing I didn’t like about the old task scheduler was the fact that the user had to hold on to a job’s data until the job was finished. This is not a problem when job data can be stored on the stack, but it sometimes lead to unnecessary allocations from the heap which I wanted to get rid of in the new system.

Fortunately, there is a simple solution for storing data that belongs to a job: We can store it in-place in our Job struct!

The padding array is a perfect candidate for storing job data. The array is unused, we need it anyway, so why not put it to good use? In Molecule, data associated with a job is memcpy-ied into the padding array as long as the given data fits – which is ensured by a compile-time check. If the data is too big to be stored in-place, the user can always allocate the data from the heap, and hand just a pointer to the data to the job system.

Adding jobs

Pushing jobs into the system is always done in two steps: first, a job is created. Second, the job is added to the system. Splitting this operation into two parts allows us to implement dynamic parallelism, which was one of the requirements mentioned earlier.

In C++, jobs are created using either of the following functions:

Job* CreateJob(JobFunction function) { Job* job = AllocateJob(); job->function = function; job->parent = nullptr; job->unfinishedJobs = 1; return job; } Job* CreateJobAsChild(Job* parent, JobFunction function) { atomic::Increment(&parent->unfinishedJobs); Job* job = AllocateJob(); job->function = function; job->parent = parent; job->unfinishedJobs = 1; return job; }

Note that I’ve left out the functions accepting additional data that is memcpy-ied into the padding array.

For now, AllocateJob() simply allocates and returns a new Job object by calling new.

As can be seen, when creating a job as a child of an already existing job, the parent’s unfinishedJobs member is atomically incremented. This needs to be done atomically because other threads could be adding different jobs as children to the same job, leading to data races.

Adding a newly created job to the system is done by a call to Run():

void Run(Job* job) { WorkStealingQueue* queue = GetWorkerThreadQueue(); queue->Push(job); }

Waiting for a job

Of course, once we’ve added a few jobs to the system, we need to be able to check if they are finished, and do something meaningful in the meantime. This is accomplished by calling Wait():

void Wait(const Job* job) { // wait until the job has completed. in the meantime, work on any other job. while (!HasJobCompleted(job)) { Job* nextJob = GetJob(); if (nextJob) { Execute(nextJob); } } }

Determining whether a job has completed can be done by comparing unfinishedJobs with 0. If the counter is greater than 0, either the job itself or any of its child jobs hasn’t finished so far. If the counter is zero, all associated jobs have finished.

The system in practice

The following simple example creates a bunch of single, empty jobs which are added to the system:

void empty_job(Job*, const void*) { } for (unsigned int i=0; i < N; ++i) { Job* job = jobSystem::CreateJob(&empty_job); jobSystem::Run(job); jobSystem::Wait(job); }

Of course, this is inefficient because we create, run, and wait for each job in isolation. Still, this serves as a good test for measuring the job system’s overhead for creating, adding, and running jobs.

Another example creates single jobs again, but runs them as children of one root job:

Job* root = jobSystem::CreateJob(&empty_job); for (unsigned int i=0; i < N; ++i) { Job* job = jobSystem::CreateJobAsChild(root, &empty_job); jobSystem::Run(job); } jobSystem::Run(root); jobSystem::Wait(root);

This is much more efficient, because creating and running jobs is now done in parallel to executing jobs already in the system.

Finishing and deleting jobs

We’re almost done. We still need to properly finish jobs by telling their parent that execution has finished. And we need to delete all jobs that we allocated.

You might be tempted to write the Finish() function like this:

void Finish(Job* job) { const int32_t unfinishedJobs = atomic::Decrement(&job->unfinishedJobs); if (unfinishedJobs == 0) { if (job->parent) { Finish(job->parent); } delete job; } }

We first atomically decrement our counter of unfinishedJobs. As mentioned earlier, as soon as this counter reaches 0, this job and all its children have completed, so we need to tell our parent about this. Afterwards, we can delete the job because we no longer need it. However, there is a not-so-subtle bug in there – can you spot it?

The problem is that we are not allowed to delete the job at this point. There could still be threads waiting on this exact job, calling HasJobCompleted() to check whether this job has finished already. This would lead to the thread accessing memory that no longer belongs to this process, either causing an access violation or reading garbage values.

One solution is to defer deletion of jobs to a later point in time, but you still have to be careful about this:

void Finish(Job* job) { const int32_t unfinishedJobs = atomic::Decrement(&job->unfinishedJobs); if (unfinishedJobs == 0) { const int32_t index = atomic::Increment(&g_jobToDeleteCount); g_jobsToDelete[index-1] = job; if (job->parent) { Finish(job->parent); } } }

I’ve inserted code that stores the jobs to be deleted in a global array, but it is still wrong. The reason for that is that as soon as the thread finishing the job decrements the unfinishedJobs member, the thread could get pre-empted. If you’re unlucky and this particular job was the root job in the example above, it would be disastrous to go ahead and try to delete all jobs stored in the array.

Of course, there is a way to do all this in a safe manner:

void Finish(Job* job) { const int32_t unfinishedJobs = atomic::Decrement(&job->unfinishedJobs); if (unfinishedJobs == 0) { const int32_t index = atomic::Increment(&g_jobToDeleteCount); g_jobsToDelete[index-1] = job; if (job->parent) { Finish(job->parent); } atomic::Decrement(&job->unfinishedJobs); } }

Note that this code decrements the counter one more time after the job has been added to the global array and the parent has been notified. Completion of a job is now signalled by unfinishedJobs being -1, not 0. After the root job has finished executing, it is safe to delete all jobs that have been allocated for this frame.

In this particular case, it would be safe to set unfinishedJobs to -1 without using an atomic instruction, but the code would need an additional compiler barrier (and memory barrier on other platforms) in order to be correct.

Implementation details

A few notes on how to implement some of the things mentioned in this post:

Accessing worker thread queues can most easily be done by using a thread-local index. On Windows/MSVC, this can be accomplished by either using __declspec(thread) or TlsAlloc.

Yielding a thread’s time slice can be done by using either _mm_pause, Sleep(1), Sleep(0), or other variants. However, you should always make sure that worker threads do not consume 100% CPU time when there is nothing to do. An Event, Semaphore or Condition Variable can be used for that.

Performance

Using the job system described above, I conducted two tests to test the performance and overhead of the system. The first test creates 65000 single, empty jobs that are run in isolation, like shown in the example above. The second test also creates 65000 jobs, but by using several parallel_for loops that recursively split their work into smaller jobs.

Performance was measured on an Intel Core i7-2600K CPU clocked at 3.4 GHz, having 4 physical cores with Hyperthreading (= 8 logical cores).

The running times are as follows:

Single jobs: 18.5 ms

parallel_for: 5.3 ms

There are a few things worth noting:

Using parallel_for is much more representative of practical work loads, because creating and adding jobs can be done in parallel.

The job system uses new and delete for allocating jobs, which is not really efficient.

The implementation of the work-stealing queue uses locks.

Outlook

Next time, we will look at how to get rid of new and delete, simplifying our Finish() function in the process. After that, we will tackle the lock-free implementation of the work-stealing queue. Last but not least, we will take a look at how to implement high-level algorithms such as parallel_for using this job system.

I promise we’ll be cutting the running time down to only a fraction of what we have right now.

Disclaimer

The post assumes an x86 architecture and a strong memory model. If you are not aware of the underlying implications, you are better off using C++11 and std::atomic with sequential consistency when working on other platforms.