Today I’m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone!

CUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won’t repeat myself here. Instead, I wanted to take a deeper look at C++11 support in device code.

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc , but also in device code. New C++ language features include auto , lambda functions, variadic templates, static_assert , rvalue references, range-based for loops, and more. To enable C++11 support, pass the flag --std=c++11 to nvcc (this option is not required for Microsoft Visual Studio).

In my earlier CUDA 7 feature overview post, I presented a small example to show some C++11 features. Let’s dive into a somewhat expanded example to show the power of C++11 for CUDA programmers. This example will proceed top-down, covering a couple of layers of abstraction that allow us to write concise, reusable C++ code for the GPU, all enabled by C++11. The complete example is available on Github.

Let’s say we have a very specific (albeit contrived) goal: count the number of characters from a certain set within a text. (In parallel, of course!) Here’s a simple CUDA C++11 kernel that abstracts the mechanics of this a bit.

__global__ void xyzw_frequency(int *count, char *text, int n) { const char letters[] { 'x','y','z','w' }; count_if(count, text, n, [&](char c) { for (const auto x : letters) if (c == x) return true; return false; }); }

Peeling Back the Layers

This code puts most of the algorithmic implementation inside of the count_if function. Let’s dig deeper. xyzw_frequency() uses an initializer list to initialize the letters array to four characters x , y , z , and w . It then calls a function, count_if , which is a generic algorithm which increments a counter for each element in its input for which the specified predicate evaluates to true . Here’s the interface of count_if ; we’ll look at the implementation shortly.

template <typename T, typename Predicate> __device__ void count_if(int *count, T *data, int n, Predicate p);

The last argument is a predicate function object that count_if calls for each element of the data input array. You can see in xyzw_frequency that we use a special syntax for this function object: [&](char c) { ... } . That [](){} syntax indicates a lambda function definition. This definition constructs a “closure”: an unnamed function object capable of capturing variables in scope.

Lambda

C++11 lambdas are really handy for cases where you have a simple computation that you want to use as an operator in a generic algorithm, like our count_if . As Herb Sutter says,

Lambdas are a game-changer and will frequently change the way you write code to make it more elegant and faster.

By the way, I recommend you check out Sutter’s brief “Elements of Modern C++ Style” for a brief guide to using new C++ features.

Let’s look at that Lambda again:

[&](char c) { for (const auto x : letters) if (c == x) return true; return false; }

This defines an anonymous function object that has an operator() method that takes one argument, char c . But you can see that it also accesses letters . It does so by “capturing” variables in the enclosing scope, in this case by reference, as the [&] capture list specifies. This gives us a way to define functions inline with the code that calls them, and access local variables without having to pass every variable to the function. That’s powerful.

Inside our lambda function we use two more C++11 features. The auto specifier, and a range-based for loop.

Auto

auto specifies that the compiler should deduce the type of the declared variable from its initializer. auto lets you avoid specifying type names that the compiler already knows, but more importantly, sometimes it lets you declare variables of unknown or “unutterable” types (like the type of many lambdas). You’ll find yourself using auto all the time.

Range-based For Loops

A range-based for loop simply executes a loop over a range. In our case, the loop for (const auto x : letters) is equivalent to this loop:

for (auto x = std::begin(letters); x != std::end(letters); x++)

Range-based for loops can be used to iterate over arrays of known size, as in our example ( char letters[] { 'x', 'y', 'z', w'}; ), or over any object that defines begin() and end() member functions. I think the range-based for loop is much clearer in this case, and I’ll come back to how we can use it for something more specific to CUDA C++.

Iterating in Parallel

Now we need to define our count_if function. This function needs to count the elements of an input array for which a predicate returns true. And of course, we want to do this in parallel! Here’s a possible implementation:

template <typename T, typename Predicate> __device__ void count_if(int *count, T *data, int n, Predicate p) { for (int i = blockDim.x * blockIdx.x + threadIdx.x; i < n; i += gridDim.x * blockDim.x) { if (p(data[i])) atomicAdd(count, 1); } }

If you are an avid CUDA C++ programmer, or an avid reader of this blog, you probably recognize the idiom used in this loop. We call it a “grid-stride loop”, and I’ve written about it before on Parallel Forall. The thing to know about grid-stride loops is that they let you decouple the size of your CUDA grid from the data size it is processing, resulting in less coupling between your host and device code. It also has portability and debugging benefits.

Grid-Stride Range-Based For Loops?

But the downside of this CUDA C++ idiom is that it is verbose, ugly, and bug prone (it’s easy to type blockIdx when you meant blockDim ). Maybe we can use C++11 range-based for loops to make it better. Wouldn’t it be nice to implement count_if like this instead?

template <typename T, typename Predicate> __device__ void count_if(int *count, T *data, int n, Predicate p) { for (auto i : grid_stride_range(0, n)) { if (p(data[i])) atomicAdd(count, 1); } }

To do this we just need to define a helper function grid_stride_range() that takes start and end index, and returns an object with the appropriate interface for a C++ range, that steps through the range with a grid stride. Writing a range class is fairly straightforward; there are various examples on StackOverflow or Github. I chose one that I liked on Github, range.hpp by Github user klmr. To make this useful in CUDA, I forked the repository and annotated all functions with __host__ __device__ so the classes can be used on either the CPU or GPU (and I wrapped this in a macro so you can still compile it with any C++11 compiler). You can find the updated utilities on Github here.

These utilities let us define a range as simply as range(0, n) , and we can make it a strided range with range(0, n).step(2) . This creates an object representing the range from 0 to n with a stride of 2. So I can create our CUDA grid_stride_range() utility function like this:

#include "range.hpp" using namespace util::lang; // type alias to simplify typing... template<typename T> using step_range = typename range_proxy<T>::step_range_proxy; template <typename T> __device__ step_range<T> grid_stride_range(T begin, T end) { begin += blockDim.x * blockIdx.x + threadIdx.x; return range(begin, end).step(gridDim.x * blockDim.x); }

Now we can type for(auto i : grid_stride_range(0, n)) when we want to implement a kernel that covers (in parallel) the entire range of indices from 0 to n . This makes our grid-stride loop idiom clean, safe, and fast (in my limited testing performance was more or less the same as the explicit grid-stride loop).

Type Aliases

It’s worth noting one other C++11 feature in use here: type aliases (sometimes called “template typedefs”). The code template using step_range = ... gives us a way to define generic type aliases that act as synonyms for the original type. Here I just used it to simplify the long typename in klmr’s range classes. The C++14 standard defines a new feature that lets us use auto as the return type of the function. In this case it would cause the compiler to deduce the return type from the call to range().step() . Alas, C++14 features are not yet supported by nvcc in CUDA 7; but we plan to support them in a future release.

Et Voilà!

With this, our example is complete. Here’s the code for a complete example. In it, we load the complete text of Tolstoy’s “War and Peace”, and run our xyzw_frequency kernel on it. Here’s the output.

Read 3288846 byte corpus from warandpeace.txt counted 107310 instances of 'x', 'y', 'z', or 'w' in "warandpeace.txt"

C++11 and Thrust

As I mentioned in my post about CUDA 7 features, CUDA 7 also includes a major update to Thrust. Thrust is a powerful, open source C++ parallel algorithms library, and the new features of C++11 make Thrust more expressive than ever.

The most obvious way that Thrust benefits from C++11 is through the auto keyword. You’ll find that auto saves you from typing (knowing, even) the names of complex Thrust types. Here’s a rather extreme example where we need to store an instance of a complex iterator type in a variable:

typedef typename device_vector<float>::iterator FloatIterator; typedef typename tuple<FloatIterator, FloatIterator, FloatIterator> FloatIteratorTuple; typedef typename zip_iterator<FloatIteratorTuple> Float3Iterator; Float3Iterator first = make_zip_iterator(make_tuple(A0.begin(), A1.begin(), A2.begin()));

With auto , the above code is drastically simplified:

auto first = make_zip_iterator(make_tuple(A0.begin(), A1.begin(), A2.begin()));

Thrust is designed to resemble the C++ Standard Template Library (STL), and just like the STL, C++11 lambda makes a powerful combination with Thrust algorithms. We can easily use lambda as an operator when we apply Thrust algorithms to host_vector containers. Here’s a version of our xyzw_frequency() function that executes on the host. Instead of our custom count_if , it uses thrust::count_if and a host-side lambda function.

void xyzw_frequency_thrust_host(int *count, char *text, int n) { const char letters[] { 'x','y','z','w' }; *count = thrust::count_if(thrust::host, text, text+n, [&](char c) { for (const auto x : letters) if (c == x) return true; return false; }); }

Another major new feature of Thrust is the ability to call Thrust algorithms from CUDA device code. This means that you can launch a CUDA kernel and then call Thrust algorithms such as for_each and transform_reduce from device threads. It also means that you can nest Thrust algorithm calls; in other words you could call transform_reduce inside for_each .

To make sure this can be done as efficiently and flexibly as possible, Thrust now provides execution policies to control how algorithms are executed. If you call a Thrust algorithm from a device thread (or inside another Thrust algorithm), then you can use the thrust::seq execution policy to run the “inner” algorithm sequentially within a single CUDA thread. Alternatively, you can use thrust::device to have the algorithm launch a child kernel (using Dynamic Parallelism). On devices that don’t support Dynamic Parallelism Thrust algorithms will be run on a single thread on the device.

Here’s another CUDA kernel version of xyzw_frequency() that uses thrust::count_if() just like before, except now it operates on device memory using a device-side lambda. Other than the __global__ specifier and the thrust::device execution policy, this code is identical to the host version!

__global__ void xyzw_frequency_thrust_device(int *count, char *text, int n) { const char letters[] { 'x','y','z','w' }; *count = thrust::count_if(thrust::device, text, text+n, [=](char c) { for (const auto x : letters) if (c == x) return true; return false; }); }

Note that count_if() here is called with the thrust::device execution policy, so each calling thread will perform the count using a dynamic parallel kernel launch on devices that support it ( sm_35 and higher). Therefore we launch the kernel using a single thread, and let Thrust generate more parallelism on the device. Using thrust_seq here would probably perform much worse, since the device thread would process the entire text sequentially. (See the our series of posts on Dynamic Parallelism (1, 2, 3) to get a good understanding.)

It’s worth noting one other very minor change in this function versus the host version. Notice the capture list of the lambda. In the host version, it was [&] . In the device version it must be [=] . The reason is that when using Dynamic Parallelism, the child kernel cannot access the local memory of the parent kernel. So we must capture the letters array by value in the lambda, or the child kernel that executes the lambda will perform an invalid memory access.

In addition to device-side algorithms and C++11 support, Thrust now provides much higher performance for a number of algorithms, and supports execution of algorithms on specified CUDA streams.

Download CUDA 7 and Learn More at GTC!

Download the CUDA Toolkit version 7 now from CUDA Zone!

I’ll be covering this material and more in person at the 2015 GPU Technology Conference this week. If you missed by talk “CUDA 7 and Beyond” on Tuesday, March 17, I’ll be presenting it again as a featured talk on Friday, March 20. If you are unable to attend, the talk will also be available from GTC on demand.

The complete example from this post is available on Github. Note: CUDA 7 does have some documented limitations for C++11 feature support.