Continuing from where we left off last time, today we are going to discuss how to build high-level algorithms such as parallel_for using our job system.

Other posts in the series

Part 1: Describes the basics of the new job system, and summarizes work-stealing.

Part 2: Goes into detail about the thread-local allocation mechanism.

Part 3: Discusses the lock-free implementation of the work-stealing queue.

The basic idea

At the moment, all we can do is push jobs into the system, and let the system take care of balancing the load across all available cores. So far, the system is not aware of the fact that some jobs might be smaller, some might be larger, and all of them probably take a different amount of time to finish.

A very common scenario in game programming is to perform a certain task for a fixed number of elements, e.g. applying the same transformation to all elements of an array/vector/range. Transformations in this context could be culling 10.000 bounding boxes, updating the bone hiearchy for 1.000 characters, or animating 100.000 particles. No matter the task, there is probably some data parallelism we can (and should!) exploit.

This is where high-level algorithms such as parallel_for come into play.

Conceptually, the idea behind a parallel_for job is very simple. Given a range of elements, the job should split the range into (almost) equal parts, and distribute the individual workloads by spawning newly created jobs into the system. For each new range that is created, a user-supplied function should then be called that takes care of performing the actual work over the given range of elements.

As an example, consider how we would go about updating particles. In its simplest form, this could be nothing more than a free function accepting an array of data, along with the number of elements stored in the array:

void UpdateParticles(Particle* particles, unsigned int count);

Assuming that we have 100.000 particles to update, we could update all of them on the same core by spawning a job that simply does the following:

UpdateParticles(particles, 100000);

What do we do if we want to split the update into 4 equally-sized parts instead? Pretty simple with this kind of function:

UpdateParticles(particles, 25000); UpdateParticles(particles + 25000, 25000); UpdateParticles(particles + 50000, 25000); UpdateParticles(particles + 75000, 25000);

Putting each call into a separate job will automatically distribute the workload to 4 cores, if available. Note that this is exceptionally easy due to the way we defined the UpdateParticles() function.

Of course, doing all of this by hand is tedious and error-prone, and should be the responsibility of the parallel_for job.

Implementation

In its most basic form, the parallel_for needs to accept the following parameters:

A range of data, consisting of a pointer to the data, and the number of elements stored in the array.

A function to call on newly spawned jobs/ranges.

Without further ado, the high-level implementation could look like the following:

Job* parallel_for(Particles* data, unsigned int count, void (*function)(Particles*, unsigned int)) { const parallel_for_job_data jobData = { data, count, function }; Job* job = jobSystem::CreateJob(&jobs::parallel_for_job, jobData); return job; } struct parallel_for_job_data { Particles* data; unsigned int count; void (*function)(Particles*, unsigned int); };

The implementation simply creates a new (root) job called jobs::parallel_for_job that takes care of dividing the range into smaller ranges, recursively. The returned job can then be run and waited upon by the user:

Job* job = parallel_for(g_particles, 100000, &UpdateParticles) jobSystem::Run(job); jobSystem::Wait(job);

The parallel_for_job itself is also only a handful of lines:

void parallel_for_job(Job* job, const void* jobData) { const parallel_for_job_data* data = static_cast<const parallel_for_job_data*>(jobData); if (data->count > 256) { // split in two const unsigned int leftCount = data->count / 2u; const parallel_for_job_data leftData(data->data, leftCount, data->function); Job* left = jobSystem::CreateJobAsChild(job, &jobs::parallel_for_job, leftData); jobSystem::Run(left); const unsigned int rightCount = data->count - leftCount; const parallel_for_job_data rightData(data->data + leftCount, rightCount, data->function); Job* right = jobSystem::CreateJobAsChild(job, &jobs::parallel_for_job, rightData); jobSystem::Run(right); } else { // execute the function on the range of data (data->function)(data->data, data->count); } }

As can be seen, the job splits the given range into two halves as long as there are more than 256 elements left in the range. For each newly created range, a job is spawned as a child of the given job, effectively cutting the initial range into smaller pieces using a divide-and-conquer strategy. Note that this interleaves the splitting of the range with the execution of these (and other) jobs, which is a bit more efficient than having one job doing all the splitting work.

Of course, there are at least two things that need to be changed in this implementation:

The parallel_for algorithm currently only accepts functions that deal exclusively with particles. This was intentional to make the code more readable, but needs to be fixed.

Ranges are always split as long as they contain more than 256 elements. Basing the splitting strategy solely on the number of elements might not be the best option available, and we certainly want to be able to at least configure the threshold on a per-job basis.

Unsurprisingly, C++ templates offer a nice way to deal with both problems in an efficient and type-safe manner.

A more generic implementation

Accepting ranges of any type can be achieved by introducing a template parameter for both the parallel_for function, as well as the job itself. Additionally, the parallel_for_job_data also needs to be able to hold pointers of arbitrary types.

Similarly, the splitting strategy can be fed into the parallel_for algorithm by an additional template argument supplied by the user. This allows the user to choose from different strategies, based on the job to be performed.

With the addition of the two above-mentioned arguments, the code then becomes the following:

template <typename T, typename S> Job* parallel_for(T* data, unsigned int count, void (*function)(T*, unsigned int), const S& splitter) { typedef parallel_for_job_data<T, S> JobData; const JobData jobData(data, count, function, splitter); Job* job = jobSystem::CreateJob(&jobs::parallel_for_job<JobData>, jobData); return job; } template <typename T, typename S> struct parallel_for_job_data { typedef T DataType; typedef S SplitterType; parallel_for_job_data(DataType* data, unsigned int count, void (*function)(DataType*, unsigned int), const SplitterType& splitter) : data(data) , count(count) , function(function) , splitter(splitter) { } DataType* data; unsigned int count; void (*function)(DataType*, unsigned int); SplitterType splitter; }; template <typename JobData> void parallel_for_job(Job* job, const void* jobData) { const JobData* data = static_cast<const JobData*>(jobData); const JobData::SplitterType& splitter = data->splitter; if (splitter.Split<JobData::DataType>(data->count)) { // split in two const unsigned int leftCount = data->count / 2u; const JobData leftData(data->data, leftCount, data->function, splitter); Job* left = jobSystem::CreateJobAsChild(job, &jobs::parallel_for_job<JobData>, leftData); jobSystem::Run(left); const unsigned int rightCount = data->count - leftCount; const JobData rightData(data->data + leftCount, rightCount, data->function, splitter); Job* right = jobSystem::CreateJobAsChild(job, &jobs::parallel_for_job<JobData>, rightData); jobSystem::Run(right); } else { // execute the function on the range of data (data->function)(data->data, data->count); } }

Now the parallel_for implementation is able to cope with ranges to arbitrary data, also accepting a splitting strategy as an additional parameter. A valid strategy implementation only needs to provide a Split() function that decides whether a given range should be split further or not, exemplified by the following two implementations:

class CountSplitter { public: explicit CountSplitter(unsigned int count) : m_count(count) { } template <typename T> inline bool Split(unsigned int count) const { return (count > m_count); } private: unsigned int m_count; }; class DataSizeSplitter { public: explicit DataSizeSplitter(unsigned int size) : m_size(size) { } template <typename T> inline bool Split(unsigned int count) const { return (count*sizeof(T) > m_size); } private: unsigned int m_size; };

CountSplitter simply splits a range based solely on the number of elements, as we’ve seen in the initial example.

On the other hand, DataSizeSplitter is a strategy that also takes the size of the working set into account. For example, on a platform with an L1 cache size of 32KB, a DataSizeSplitter(32*1024) could be used to ensure that ranges are only split as long as their working set no longer fits into the L1 cache. The important feature here is that this works across jobs of all kinds, no matter whether they’re working on particle, animation, or culling data – the strategy does it automatically.

Invocation of the parallel_for is the same as before, except for an additional splitter argument:

Job* job = parallel_for(g_particles, 100000, &UpdateParticles, DataSizeSplitter(32*1024)); jobSystem::Run(job); jobSystem::Wait(job);

Future work

As of now, the parallel_for implementation can only call free functions. It should be easy to upgrade the system to also make it accept member functions, lambdas, std::function, etc.

Outlook

The next part in this series is probably going to be the last one, and will go into detail about how to handle dependencies between jobs.