OpenACC is a high-level programming model for accelerators, such as NVIDIA GPUs, that allows programmers to accelerate applications using compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded to an accelerator. Through the use of compiler directives, OpenACC allows programmers to maintain a single source code for the CPU and GPU that is portable across a range of accelerators and operating systems. In the past we featured an introductory series of posts on OpenACC as well as several CUDACasts videos—click here to find them.

OpenACC version 2.0 was ratified last year and is now available in multiple commercial compilers, so now is a good time to discuss the new features of OpenACC 2.0.

Function calls within compute regions

OpenACC 1.0 compilers rely on inlining function and subroutine calls within compute regions. This means that unless the compiler can automatically inline a function call, the programmer must manually inline the function. This limitation proved to be difficult for applications, so OpenACC 2.0 introduces the acc routine directive, which instructs the compiler to build a device version of the function or subroutine so that it may be called from a device region. For readers already familiar with CUDA programming, this functionality is similar to the __device__ function specifier. To guide optimization, you can use clauses to tell the compiler whether the routine should be built for gang, worker, vector, or seq (sequential) level parallelism. You can specify multiple clauses for routines that may be called at multiple levels of parallelism.

#pragma acc routine vector void foo(float* v, int i, int n) { #pragma acc loop vector for ( int j=0; j<n; ++j) { v[i*n+j] = 1.0f/(i*j); } } #pragma acc parallel loop for ( int i=0; i<n; ++i) { foo(v,i); //call on the device }

In the above C/C++ example, we have specified that the foo routine may be called from the device and that the loop contained within the function contains vector-level parallelism. Note that in C/C++ you must apply the acc routine directive on both the function declaration and implementation, if they are not the same. The following example shows how to use acc routine from Fortran.

subroutine foo(v, i, n) { !$acc routine vector real :: v(:,:) integer :: i, n !$acc loop vector do j=1,n v(i,j) = 1.0/(i*j) enddo end subroutine !$acc parallel loop do i=1,n call foo(v,i,n) enddo !$acc end parallel loop

Unstructured Data Regions

Data regions provide explicit control over how long data remains on the GPU so that it can be shared between multiple OpenACC compute regions. Since frequent PCI-e data transfers can limit application performance, data regions are important for getting the best possible performance from an OpenACC application on a GPU. In OpenACC 1.0, data regions are always structured, meaning that they begin and end in the same scope, as in the following example.

int main(int argc, char **argv) { double a[100]; #pragma acc data create(a) { function_with_openacc(a); } return 0; }

For many applications, particularly those with C++ classes, there isn’t a natural place to add a data region that expresses the lifetime of the data. OpenACC 2.0 adds unstructured data lifetimes to better handle these situations. The acc enter data and acc exit data directives define where an unstructured data region begins and ends. The following example creates an unstructured data region in a C++ class constructor and deletes it in the matching C++ class destructor. This pattern was impossible in OpenACC 1.0, since the class data is not visible outside of the class instance.

class Matrix { Matrix(int n) { len = n; v = new double[len]; #pragma acc enter data create(v[0:len]) } ~Matrix() { #pragma acc exit data delete(v[0:len]) delete[] v; } private: double* v; int len; };

Mark Ebersole talked about unstructured data lifetimes in a recent CUDACast video, included here.

Unstructured data lifetimes are not limited to C++ classes. Fortran programmers may find unstructured data regions convenient for handling module data that is allocated and deallocated in module initialization and finalization subroutines, as in the following.

module mymod real, allocatable :: a(:) contains subroutine mymod_init allocate(a(100)) !$acc enter data create(a) end subroutine mymod_init subroutine mymod_finalize !$acc exit data delete(a) deallocate(a) end subroutine mymod_finalize end module mymod

Extended Runtime API

OpenACC 2.0 also enhances data management to include run-time library routines for cases where it is difficult to express data management needs using directives alone. A few of the particularly useful routines include acc_map_data and acc_unmap_data , which provide a way to map an existing device memory allocation (e.g. from CUDA or OpenCL) to an OpenACC variable. Similarly, acc_deviceptr and acc_hostptr provide a way to directly access a host or device copy of a variable, respectively. You can explicitly copy data to and from the accelerator device with acc_memcpy_to_device and acc_memcpy_from_device .

OpenACC 2.0 also recommends some additional device-specific routines that make interoperating with other programming models, such as CUDA or OpenCL, even easier. For example, the acc_get_cuda_stream and acc_set_cuda_stream functions provide a way for OpenACC and CUDA kernels to work within the same CUDA streams.

For a complete list of new API routines, see section 1.7 of the OpenACC specification document.

Nested OpenACC regions

OpenACC 2.0 expands the way programmers can express parallelism with the addition of nested OpenACC regions. In OpenACC 1.0 a parallel or kernels region could not contain another parallel , kernels , or data region. While most algorithms are easily expressed in this manner, some algorithms benefit from being able to launch additional parallelism dynamically. OpenACC 2.0 allows nesting regions within regions to express these types of nested parallelism in a more natural way. This feature is much like CUDA Dynamic Parallelism, introduced in CUDA 5.0 on Kepler (sm_35 and newer) GPU devices.

OpenACC atomics

The OpenACC atomic directive ensures that a particular variable is accessed and/or updated atomically (no two gangs, workers, or vector threads access the location simultaneously) to prevent indeterminate results and race conditions. In other words, it prevents one thread from stepping on the toes of other threads due to accessing a variable simultaneously, resulting in different results run-to-run. For example, if I want to count the number of elements that have a value greater than zero, I could write the following.

if ( val > 0 ) { #pragma acc atomic capture { cnt++; } }

Without the atomic directive, it would be possible for multiple threads to try to increment the value of cnt simultaneously, which will likely result in an incorrect final count. The atomic directive ensures that no two threads access cnt at the same time, resulting in a correct final count. The OpenACC specification lists the types of operations that can be included in an atomic region.

Support for multiple device types

Since performance portability is one of the primary goals of OpenACC, it’s important to provide a way for programmers to optimize their directives for all accelerator platforms of interest. With OpenACC 1.0, preprocessor directives were required if the programmer wished to tune directives for particular targets. In addition to making the code more difficult to maintain, due to duplication of directives, this means that it’s impossible to support multiple device types in the same executable. OpenACC 2.0 allows certain clauses to be provided specifically for particular architectures. In the example below, the programmer has chosen to specify a different vector_length depending on the type of device being used.

#pragma acc parallel loop \ device_type(nvidia) vector_length(256) \ device_type(radeon) vector_length(512) \ vector_length(64) for ( int i=0; i<n; ++i) { c[i] = a[i] + b[i]; }

In this hypothetical example, if the loop is being built for an NVIDIA device, the compiler uses a vector length of 256; if it’s being built for a Radeon device the compiler uses a vector length of 512; and for any other accelerator it uses a vector length of be 64. This specialization doesn’t just apply at compile time; if the executable is built for multiple targets, the appropriate target and vector length will be selected at run time. Not all clauses can be specialized by device type, but clauses related to device-specific optimizations are supported.

Tile Clause

The final new OpenACC 2.0 feature that I’d like to highlight is the addition of a tile clause to the existing acc loop directive. The tile clause gives the programmer a way to express locality within a loop nest, guiding the compiler to additional optimization options. Consider the following matrix transpose example.

#pragma acc parallel loop private(i,j) tile(8,8) for(i=0; i<rows; i++) { for(j=0; j<cols; j++) { out[i*rows + j] = in[j*cols + i]; } }

By adding the tile(8,8) clause to this parallel loop, I’ve instructed the compiler to automatically introduce two additional loops that work on an 8×8 chunk (tile) of the matrix before moving on to the next chunk. This gives the compiler the opportunity to optimize within the tile in hopes of achieving better performance. While a matrix transpose does not have a lot of data reuse, other algorithms may see a significant performance improvement by exploiting both data locality and data reuse within the available loops.

Learn More

This is just a sample of some of the most significant new capabilities in OpenACC 2.0. If you’re interested in an exhaustive list of new features, section 1.7 of the OpenACC 2.0 specification lists everything that changed since 1.0. Our CUDACasts series will also be highlighting OpenACC 2.0, so stay tuned for more of OpenACC 2.0 in action. Finally, come out to GTC14 or tune in to GTC On Demand to learn more about OpenACC.