I have heard many complaints about the verbosity of the OpenCL API. This claim in not unwarranted.

The verbosity is due to the low level nature of OpenCL. It is written in the C programming language; the lingua franca of programming languages. While this allows you to run an OpenCL program on virtually any platform, it has some disadvantages.

In a typical OpenCL program must:

Query for the platform Get the device IDs from the platform Create a context from a set of device IDs Create a command queue from the context Create buffer objects for your data Transfer the data to the buffer Create and build a program from source Extract the kernels Launch the kernels Transfer the data to the host

Wow that was longer than I thought it would be. As you can imagine this can be a daunting task for anyone who is not familiar with the OpenCL API.

For kicks and giggles I decided to create the smallest OpenCL program with a focus on simplicity and readability. I am going to make use of the excellent C++ API from the Khronos website. Technically you can create a program that transfers to and from the device but I wanted to make something less trivial. Here are my requirements for the program:

Transfer data from host to device

Perform an addition on two vectors and store it in a third

Return the data to the host

Print the results

Here is my attempt:

#define __CL_ENABLE_EXCEPTIONS #include "cl.hpp" #include <vector> #include <iostream> #include <iterator> #include <algorithm> using namespace cl; using namespace std; int main(int argc, char* argv[]) { Context(CL_DEVICE_TYPE_DEFAULT); static const unsigned elements = 1000; vector<float> data(elements, 5); Buffer a(begin(data), end(data), true, false); Buffer b(begin(data), end(data), true, false); Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float)); Program addProg(R"d( kernel void add( global const float * restrict const a, global const float * restrict const b, global float * restrict const c) { unsigned idx = get_global_id(0); c[idx] = a[idx] + b[idx]; } )d", true); auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add"); add(EnqueueArgs(elements), a, b, c); vector<float> result(elements); cl::copy(c, begin(result), end(result)); std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", ")); } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 #define __CL_ENABLE_EXCEPTIONS #include "cl.hpp" #include <vector> #include <iostream> #include <iterator> #include <algorithm> using namespace cl ; using namespace std ; int main ( int argc , char * argv [ ] ) { Context ( CL_DEVICE_TYPE_DEFAULT ) ; static const unsigned elements = 1000 ; vector < float > data ( elements , 5 ) ; Buffer a ( begin ( data ) , end ( data ) , true , false ) ; Buffer b ( begin ( data ) , end ( data ) , true , false ) ; Buffer c ( CL_MEM_READ_WRITE , elements * sizeof ( float ) ) ; Program addProg ( R "d( kernel void add( global const float * restrict const a, global const float * restrict const b, global float * restrict const c) { unsigned idx = get_global_id(0); c[idx] = a[idx] + b[idx]; } )d" , true ) ; auto add = make_kernel < Buffer , Buffer , Buffer > ( addProg , "add" ) ; add ( EnqueueArgs ( elements ) , a , b , c ) ; vector < float > result ( elements ) ; cl :: copy ( c , begin ( result ) , end ( result ) ) ; std :: copy ( begin ( result ) , end ( result ) , ostream_iterator < float > ( cout , ", " ) ) ; }

Not bad eh? The biggest savings were realized by making use of the default platform, context, command queue objects in the C++ API. I also took advantage of a few C++11 features including string literals, and auto. This code was tested on OSX using the clang++ compiler. It should be able to run on Visual Studio 2013 and GCC with little or no changes.

Lets dive into the code. A default context can be created by passing CL_DEVICE_TYPE_DEFAULT to the cl::Context function. This allows you to call several OpenCL functions without specifying a context. In many cases managing a context is not necessary and adds unnecessary overhead to the code.

I am also taking advantage of the special Buffer constructors in the C++ API which take iterators as inputs. This allows you to allocate the correct amount of memory and transfer the data with one call.

I especially like the make_kernel function in the C++ API. It is a function which creates a functor(an object which overloads the parenthesis operator). Lets take a look at how it is used.

auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add"); add(EnqueueArgs(elements), a, b, c); 1 2 3 auto add = make_kernel < Buffer , Buffer , Buffer > ( addProg , "add" ) ; add ( EnqueueArgs ( elements ) , a , b , c ) ;

make_kernel is a template function which takes the program object and the name of the kernel as arguments. The template parameters are the type of the input parameters of the kernel. This function returns a functor which takes an EnqueueArgs object as well as three Buffer objects. The EnqueueArgs object can be used to set the launch configuration of the kernel. In this case I took advantage of the default command queue but that can also be set using the EnqueueArgs object.

The copy command can be used to transfer the data to and from the device. In this case it transfers the data from the c buffer into the result vector.

The C++ OpenCL API provides numerous abstractions over the C counterpart. It is also easy to mix the C and C++ interface in the same program. I have yet to find a functionality which is missing from the C++ API. I would encourage everyone to check out the C++ Wrapper API for OpenCL.