Little Trouble in Big Data – Part 2 improving CPU utilization across cores

In the first blog in this series, I showed how we solved the original problem of how to use mmap() to load a large set of data into RAM all at once, in response to a request for help from a bioinformatics group dealing with massive data sets on a regular basis. The catch in our solution, however, was that the process still took too long. In this blog, I describe how we solve this, starting with Step 3 of the Process I introduced in Blog 1:

3. Fine-grained Threading

The original code we inherited was written on the premise that:

Eigen uses OpenMP to utilize multiple cores for vector and matrix operations. Writing out the results of the Monte Carlo simulation is time-consuming and therefore put into its own thread by way of OpenMPI with another OpenMPI critical section doing the actual analysis.

Of course, there were some slight flaws in this plan.

Eigen’s use of OpenMP is only for some very specific algorithms built into Eigen itself. None of which this analysis code was using, so that was useless. Eigen does make use of vectorization, however, which is good and can in ideal circumstances give a factor of 4 speedup compared to a simplistic implementation. So we wanted to keep that part. The threading for writing results was, shall we say, sub-optimal. Communication between the simulation thread and the writer thread was by way of a lockless list/queue they had found on the interwebs. Sadly, this was implemented with a busy spin loop which just locked the CPU at 100% whilst waiting for data to arrive once every n seconds or minutes. Which means it’s just burning cycles for no good reason. The basic outline algorithm looks something like this:

const std::vector colIndices = {0, 1, 2, 3, ... }; const std::vector markerIndices = randomise(colIndices); for (i = 0; i < maxIterations; ++i) { for (j = 0; j < numCols; ++j) { const unsigned int marker = markerIndices[j]; const auto col = data.mappedZ.col(marker); output += doStuff(col); } if (i % numIterations == 0) writeOutput(output); }

So, what can we do to make better use of the available cores? For technical reasons related to how Markov Chain Monte Carlo works, we can neither parallelize the outer loop over iterations nor the inner loop over the columns (SNPs). What else can we do?

Well, recall that we are dealing with large numbers of individuals – 500,000 of them in fact. So we could split the operations on these 500k elements into smaller chunks and give each chunk to a core to process and then recombine the results at the end. If we use Eigen for each chunk, we still get to keep the SIMD vectorization mentioned earlier. Now, we could do that ourselves but why should we worry about chunking and synchronization when somebody else has already done it and tested it for us?

This was an ideal chance for me to try out Intel’s Thread Building Blocks library, TBB for short. As of 2017 this is now available under the Apache 2.0 license and so is suitable for most uses.

TBB has just the feature for this kind of quick win in the form of its parallel_for and parallel_reduce template helpers. The former performs the map operation (applies a function to each element in a collection where each is independent). The latter performs the reduce operation, which is essentially a map operation followed by a series of combiner functions, to boil the result down to a single value.

These are very easy to use so you can trivially convert a serial piece of code into a threaded piece just by passing in the collection and lambdas representing the map function (and also a combiner function in the case of parallel_reduce ).

Let’s take the case of a dot (or scalar) product as an example. Given two vectors of equal length, we multiply them together component-wise then sum the results to get the final value. To write a wrapper function that does this in parallel across many cores we can do something like this:

const size_t grainSize = 10000; double parallelDotProduct(const VectorXf &Cx, const VectorXd &y_tilde) { const unsigned long startIndex = 0; const unsigned long endIndex = static_cast(y_tilde.size()); auto apply = [&](const blocked_range& r, double initialValue) { const long start = static_cast(r.begin()); const long count = static_cast(r.end() - r.begin()); const auto sum = initialValue + (Cx.segment(start, count).cast() * y_tilde.segment(start, count)).sum(); return sum; }; auto combine = [](double a, double b) { return a + b; }; return parallel_reduce(blocked_range(startIndex, endIndex, grainSize), 0.0, apply, combine); }

Here, we pass in the two vectors for which we wish to find the scalar product, and store the start and end indices. We then define two lambda functions.

The apply lambda, simply uses the operator * overload on the Eigen VectorXf type and the sum() function to calculate the dot product of the vectors for the subset of contiguous indices passed in via the blockedRange argument. The initialValue argument must be added on. This is just zero in this case, but it allows you to pass in data from other operations if your algorithm needs it. The combine lambda then just adds up the results of each of the outputs of the apply lambda.

When we then call parallel_reduce with these two functions, and the range of indices over which they should be called, TBB will split the range behind the scenes into chunks based on a minimum size of the grainSize we pass in. Then it will create a lightweight task object for each chunk and queue these up onto TBB’s work-stealing threadpool. We don’t have to worry about synchronization or locking or threadpools at all. Just call this one helper template and it does what we need!

The grain size may need some tuning to get optimal CPU usage based upon how much work the lambdas are performing but as a general rule of thumb, it should be such that there are more chunks (tasks) generated than you have CPU cores. That way the threadpool is less likely to have some cores starved of work. But too many and it will spend too much time in the overhead of scheduling and synchronizing the work and results between threads/cores.

I did this for all of the operations in the inner loop’s doStuff() function and for some others in the outer loop which do more work across the large (100,000+ element) vectors and this yielded a very nice improvement in the CPU utilization across cores.

So far so good. In the next blog, I’ll show you how we proceed from here, as it turns out this is not the end of the story.