This is the second post in my C++ accelerator library series. It is about Boost.Compute, a header-only C++ accelerator programming library developed by Kyle Lutz and others. It is based on OpenCL and is released under the Boost 1.0 license. The name of the library, the license, as well as the documentation suggest that Boost.Compute one day will apply for an official review by the Boost community to become a part of the Boost C++ library suite. Acceptance into Boost can be a stepping stone to inclusion into the C++ Standard. But this is not the only reason why it is worthwhile to take a look at Boost.Compute.

Compute, stand by, operate, stand by to operate, image by Marcin Wichary

Boost.Compute manages memory through the compute::vector template, a concept similar to std::vector . Memory is either transferred synchronously through copy or asynchronously through the copy_async function. I consider this split an excellent design decision: copy does exactly what one would expect as it is the same call we can find in the STL. But asynchronous copy is also available for those programmers that need more performance and know what they are doing. An asynchronous copy returns an instance of compute::future . The future can be blocked on to ensure deterministic and correct execution of the program. Both copy function interfaces are iterator based.

All functions resulting in commands issued to an accelerator accept an optional command_queue argument. A command_queue has a wait function to block until completion of all enqueued commands. In addition, barriers and markers can be enqueued in a command_queue for additional control over asynchronous operations. These are in my opinion all the tools necessary to express concurrency and control command-level parallelism.

Boost.Compute comes with a multitude of parallel primitives. These STL like functions compose the core of Boost.Compute. This is an incredible amount of accelerator-enabled general algorithms and I believe a listing of the more complex functions is in order:

/** Returns the result of applying function to the elements in the range [first, last) and init. */ T accumulate(InputIterator first, InputIterator last, T init, BinaryFunction function) /** Returns true if value is in the sorted range [first, last). */ bool binary_search(InputIterator first, InputIterator last, const T & value) /** Performs an exclusive scan on the elements in the range [first, last) and stores the results in the range beginning at result. */ OutputIterator exclusive_scan(InputIterator first, InputIterator last, OutputIterator result) /** Calls function on each element in the range [first, last). */ UnaryFunction for_each(InputIterator first, InputIterator last, UnaryFunction function) /** Copies the elements using the indices from the range [first, last) to the range beginning at result using the input values from the range beginning at input. */ void gather(MapIterator first, MapIterator last, InputIterator input, OutputIterator result) /** Performs an inclusive scan on the elements in the range [first, last) and stores the results in the range beginning at result. */ OutputIterator inclusive_scan(InputIterator first, InputIterator last, OutputIterator result) /** Returns the inner product of the elements in the range [first1, last1) with the elements in the range beginning at first2. */ T inner_product(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init, BinaryAccumulateFunction accumulate_function, BinaryTransformFunction transform_function) /** Merges the sorted values in the range [first1, last1) with the sorted values in the range [first2, last2) and stores the result in the range beginning at result. Values are compared using the comp function. */ OutputIterator merge(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, InputIterator2 last2, OutputIterator result, Compare comp) /** Calculates the cumulative sum of the elements in the range [first, last) and writes the resulting values to the range beginning at result. */ OutputIterator partial_sum(InputIterator first, InputIterator last, OutputIterator result) /** Randomly shuffles the elements in the range [first, last). */ void random_shuffle(Iterator first, Iterator last) /** Returns the result of applying function to the elements in the range [first, last). Requires the binary operator to be commutative. */ void reduce(InputIterator first, InputIterator last, OutputIterator result, BinaryFunction function) /** Performs left rotation such that element at n_first comes to the beginning. */ void rotate(InputIterator first, InputIterator n_first, InputIterator last) /** Copies the elements from the range [first, last) to the range beginning at result using the output indices from the range beginning at map. */ void scatter(InputIterator first, InputIterator last, MapIterator map, OutputIterator result) /** Sorts the values in the range [first, last) according to compare. */ void sort(Iterator first, Iterator last, Compare compare) /** Performs a key-value sort using the keys in the range [keys_first, keys_last) on the values in the range [values_first, values_first + (keys_last - keys_first)) using compare. */ void sort_by_key(KeyIterator keys_first, KeyIterator keys_last, ValueIterator values_first, Compare compare) /** Sorts the values in the range [first, last) according to compare. The relative order of identical values is preserved. */ void stable_sort(Iterator first, Iterator last, Compare compare) /** Transforms the elements in the range [first, last) using transform and stores the results in the range beginning at result. */ OutputIterator transform(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, OutputIterator result, BinaryOperator op) /** Transforms each value in the range [first, last) with the unary transform_function and then reduces each transformed value with reduce_function. */ void transform_reduce(InputIterator first, InputIterator last, OutputIterator result, UnaryTransformFunction transform_function, BinaryReduceFunction reduce_function)

There is no special support for numerical analysis but a number of numerical operations can be built from aforementioned parallel primitives.

Boost.Compute provides a number of built in functions to pass to the parallel primitives but programmers may also specify custom functions. This can be done by creating an instance of a compute::function. A shorthand macro BOOST_COMPUTE_FUNCTION() simplifies this task. Next to custom functions, programmers can also declare a BOOST_COMPUTE_CLOSURE() with the added benefit of capturing variables that can then be used within the accelerator function. As a third option, programmers can specify a lambda expression instead of a custom function object. This is accomplished with the help of Boost.Proto. Kyle Lutz talks about these features in a recent blog post.

Boost.Compute contains a well designed C++ wrapper for OpenCL. Each wrapper type has conversion operators to the underlying OpenCL object; a handy feature that enables full compatibility to the lower layer of accelerator programming. Boost.Compute is inter-operable with OpenGL, OpenCV, Eigen, QT and VTK. It is available on github, the documentation is available here.

I would categorize Boost.Compute as a general purpose, high productivity library. Depending on the quality of implementation and specialization of the provided parallel primitives, close to peak performance should be possible with Boost.Compute. I like the combination of low level OpenCL wrappers and higher level parallel primitives as well as the possibility to fall back to OpenCL code if certain features are not yet wrapped by Boost.Compute. I think the work in Boost.Compute can be an important part of a yet to be defined standard C++ accelerator library.