We spend way to much time building bits and pieces to put together parallel programs. How much time do we (the programmer) spend putting together boiler plate code that could easily be abstracted away by libraries like RaftLib? RaftLib enables programmers to stitch together parallel code from sequential kernels (both pipeline and task parallel), with far fewer lines of code than most other parallelization frameworks (while being just as performant). This mode of programming is known as streaming (also data-flow). Not only does RaftLib remove the burden of boiler plate coding, it also automates the optimization process. RaftLib chooses the type of memory to allocate between links, dynamically resizes them using multiple analytic models (both queueing network and Support Vector Machine learned models).

To demonstrate how easy it is to build an application, I want to demonstrate a simple app, that is widely used. I chose parallel bzip. We’ll go through a bit of RaftLib implementation of parallel bzip (using libbz2), then go on to a performance comparison of this version and the classic pbzip2 utility.

If you’ve never looked at the pbzip2 code, you’re not alone. I hadn’t either until I wanted to do this comparison. Spending about 10 minutes reviewing it, I decided it was fairly well written. At first glance, it seems like spaghetti code, however the number of global variables and error handlers is quite typical of parallel programs these days (often by necessity). Although I disdain lines of code as a metric, it is useful to understand the scale of the code base. The main file (pbzip2.cpp) contains over 4,500 lines of code, the header file an additional 300. Most of the code is devoted to book-keeping tasks (e.g., building and managing the structures for safe communication between parallel threads). To reimplement pbzip2 I need to know exactly which functions do the work of compressing, so I cross referenced the bzip library documentation with this code. Keying in on things that had the bzip library function prefix BZ2_XXX within the pbzip2.cpp file. I found that the pbzip2 authors use the buffer to buffer compress function within a worker thread, voila. Now I know exactly how I should implement my kernel within RaftLib.

I won’t have to worry about all the custom communications framework that consumes so much code within pbzip2. RaftLib manages that for me. I also don’t need to build any custom file readers. That is also handled by a stock RaftLib library call. What I will have to construct is a compress kernel, and a kernel to write the file in indexed order, since each block will be variable length (i.e., the original blocks are uncompressed and homogeneous, the compressed blocks differ based on compressibility). Let’s go over the file write kernel first. We only have to write the code as if it were sequentially executing, that is, no special care need be taken to accommodate its parallel execution (each kernel can potentially be executed as a separate thread). The only rule that must be followed is that state must be accessed only from within the kernel, or streamed in via its ports. Our filewrite kernel, looks like almost any other C/C++ code containing a file operation (granted, it’s wrapped inside a virtual “run” function). I did make one concession for this code since I wanted to benchmark it against the classic bzip2 on a per thread count basis: I added a port count parameter for the constructor, so I could control the exact number of worker threads for benchmarking (see below).

Since we’re implementing the raft::parallel_k, care must be taken to only access ports via the for loop iterator construct. This gives the run-time the freedom to add input ports as it sees fit (and prevents ports from being removed or added while the kernel is using them). The default RaftLib file reader object also extends the same raft::parallel_k class. At line 27 we see this iterator loop, accessing each port sequentially for data. In order to sequentialize the writes in the correct order, the code checks for the order id passed through by the compress kernel from the file reader. This process visually looks something like this:

On to the actual worker, the compress kernel. To build it, we need to write an object that extends raft::kernel. Then set up the run function to pull in data from an input port, compress it using the BZ2_bzBuffToBuffCompress function, and then write it to the output port. Using RaftLib, this is quite simple. We can even do it zero copy, directly using the input stream as a data source and the output stream to write the data to. We first add input ports within the constructor, with the filechunk struct as a type. This struct enables the port to contain statically sized chunks of a file, to which the filereader RaftLib library kernel attaches a block order, and file position (relative to the original file). This will allow us to maintain order when re-assembling the compressed blocks. Of course, the programmer must know to pass through the order if they intend to use it, the run-time can’t simply infer that the programmer intends on re-ordering the data within the file-write kernel.

Here’s my implementation of the compress kernel (by far the longest chunk of code in rbzip2):

I even did some basic error handling for the bzip library function. One interesting line I’d like to call out is number 87 where the CLONE() function (well, actually a macro) is inserted. This CLONE() function enables the run-time to duplicate this object as many times as needed (assuming that the programmer has given a copy constructor as well). All that is left is to define a main function, instantiate the needed kernels, and link them together.

Lets see what that looks like:

Awesomely, that is pretty much all that is necessary to write a parallel bzip compressor using the RaftLib library. It’s short and simple.

As you can see, most of the code is taken up with command line processing (using the cmdargs library). The actual kernel instantiation and linking is handled in lines 55–72. Line 72 defines a pipeline and task parallel bzip compressor topology. The “<=” tells the run-time to expand the kernels after it and before the “>=” operator, to match the number of input and output ports available. This means that the compress kernel will be automatically duplicated by the run-time, all the programmer has to do is specify it once.

The full code is available here. Total time authoring and debugging the code, about 20 minutes. Looking at the original version of pbzip2, I’m fairly certain that it took a lot longer than that.