I previously have explained the basics of OpenACC in these two articles. As a recap, OpenACC is an emerging interface for easily writing simple code that executes on the GPU. It follows the OpenMP model of using pragmas, which are then executed behind the scenes transparently to the programmer.

OpenACC conditional data clauses such as present_or_copy (abbreviated as pcopy ) and present_or_create (abbreviated as pcreate ) enable programmers to make multiple function/library calls that can run on any OpenACC device, yet perform a minimum number of expensive dynamic-memory and data transfer operations. They are an alternative to device-specific optimized libraries or existing CUDA and OpenCL application frameworks, which can be called using regions of OpenACC memory allocated with acc_malloc() or identified by use_device and deviceptr clauses. Tutorial examples demonstrate how these OpenACC features can be combined with conditional runtime execution based on device type to build device-independent application binaries that efficiently run on both GPU and non-GPU hardware without recompilation. In other words, pragma-based OpenACC has the ability to act as a high-level, massively parallel programming platform that can create transparent, yet efficient, scientific and commercial applications for both CPUs and GPUs from a single source tree. It does not matter if the preferred source language is C, Fortran, or some mix of languages. The previous article in this series, provided a general introduction to OpenACC parallel regions and the OpenACC execution model.

OpenACC Conditional Data Clauses

By definition, library functions bundle some functionality so it can be reused in different application contexts. A consistent calling syntax or API is used to pass data between the calling method and function. A unique challenge facing the authors of libraries, subroutines, and functions for accelerators is the specification of data location (where the data is present on a device). Unlike conventional computer languages like C and Fortran, the pragma-based OpenACC syntax cannot pass information about a variable type  such as presence on a device  via an API. Instead, OpenACC requires that the programmer define data location via pragmas. For reusable code, the OpenACC specification defines a set of conditional data clauses that only perform expensive I/O and memory allocation operations when the data is not already present on the device. These conditional data clauses start with present_or_ , which can be abbreviated with a " p" at the beginning of the clause. The following are the conditional data clauses:

present_or_copy(list) or pcopy(list): tells the runtime to check if each of the variables or arrays on the list is already present in the accelerator memory. If present, the accelerator data is used. If not, memory is allocated on the accelerator at region entry and the data is copied to the accelerator. Data is copied back to the host at region exit.

or tells the runtime to check if each of the variables or arrays on the list is already present in the accelerator memory. If present, the accelerator data is used. If not, memory is allocated on the accelerator at region entry and the data is copied to the accelerator. Data is copied back to the host at region exit. present_or_copyin(list) or pcopyin(list): tells the runtime to check if each of the variables or arrays on the list is already present in the accelerator memory. If present, the accelerator data is used. If not, memory is allocated on the accelerator and the data is copied to the accelerator at region entry.

or tells the runtime to check if each of the variables or arrays on the list is already present in the accelerator memory. If present, the accelerator data is used. If not, memory is allocated on the accelerator and the data is copied to the accelerator at region entry. present_or_copyout(list) or pcopyout(list): tells the runtime to check if each of the variables or arrays on the list is already present in the accelerator memory. If present, the accelerator data is used. If not, memory is allocated on the accelerator at region entry and the data is copied to the host at region exit.

or tells the runtime to check if each of the variables or arrays on the list is already present in the accelerator memory. If present, the accelerator data is used. If not, memory is allocated on the accelerator at region entry and the data is copied to the host at region exit. present_or_create(list) or pcreate(list): tells the runtime to check if each of the variables or arrays on the list is already present in the accelerator memory. If present, the accelerator data is used. If not, memory is allocated on the accelerator at region entry.

While the OpenACC programmer specifies variable names to the pragmas, the OpenACC runtime uses the host data address of the variable to find the device location in a scope-dependent lookup table. Basing the runtime check on the host address means that conditional data clauses will work correctly regardless of how variables are named inside the function. OpenACC does not provide any limitation on the how the host processor uses or reuses memory, which means that a lookup according to host address can only work correctly when the host and device addresses are guaranteed to be related. In other words, the author of the code that calls the OpenACC library function must do two things to benefit from the use of conditional data statements in the underlying function:

Create a context, say, with an OpenACC data pragma, that ensures all required variables and arrays are present on the device.

pragma, that ensures all required variables and arrays are present on the device. Call the library function within the scope of that context. (Note, the OpenACC runtime will generate an error when the data is partially present on the device  for example, if only a slice of one of the arrays is on the device.)

The call to the function doMult() in the following code sample is called within the context of the OpenACC pragma (line 2), which ensures that the A , B , and C matrices are present within the scope of the for loop.

// Ensure the A and B matrices are present on the device #pragma acc data pcopyin(A[0:size][0:size],B[0:size][0:size]) for(int i=0; i < nIter; i++) { double startTime = omp_get_wtime(); doMult(size,A,B,C); double endTime = omp_get_wtime(); printf("%s runtime %8.5g

",argv[0], (endTime-startTime)); }

Viewing the timeline with the NVIDIA Visual Profiler shows that the pcopyin() and pcopyout() clauses work correctly, as the A and B matrices only get transferred to the GPU prior to the first call to doMult() . Meanwhile, the results of matrix C are transferred from the device to the host after each call.

[Click image to view at full size]

The next code snippet illustrates how easy it is to forget the needed OpenACC data pragma at the beginning of the for loop. Most programmers have spent years working with code that looks like our first example, which makes it very easy to miss the omission of the pragma. In comparison with the previous code snippet, the call to doMult() in our next example will copy matrices A and B from the host to the device on every loop iteration even though the doMult() function utilizes conditional data operations.

// this will work but not take advantage of the OpenACC conditionals for(int i=0; i < nIter; i++) { double startTime = omp_get_wtime(); doMult(size,A,B,C); double endTime = omp_get_wtime(); printf("%s runtime %8.5g

",argv[0], (endTime-startTime)); }

As the timeline in Figure 2 shows, the good news is that data is correctly supplied to the doMult() function on the device. The bad news is that it is very easy to miss such performance inefficiencies  especially in legacy code. The "lessons learned" message is clear: Always check the runtime behavior of any new code with the NVIDIA visual profiler ( nvvp ) just to see if there are any unexpected data transfers.

[Click image to view at full size]

At the risk of being pedantic, it is easy to forget that reusable OpenACC code (such as in libraries) is not made up of independent packages where the programmer's only concern is to correctly pass variables through the API. From a performance point of view, it is the responsibility of the author of the OpenACC calling code  not the author of the reusable code  to make certain that the required data is present on the device. The fail-safe built into the OpenACC conditional data clauses is that the program will still work correctly, but the omission can be disastrous from a performance point of view.

The benefits of conditional data clauses are:

A minimal number of data transfers and dynamic memory operations will occur when used appropriately.

Lookup tables are fast, so the use of conditional data clauses should not have any noticeable performance impact. For this reason, it is recommended that conditional data clauses be commonly used throughout your OpenACC code subject to the concerns listed below.

The concerns are:

The OpenACC runtime checks only whether memory is present and whether the data is valid or the latest version. It is extremely important that the OpenACC programmer verify that that no conditional operation or arbitrary calling sequence will have unintended side effects. Otherwise, the application might crash, produce incorrect results, or suffer from occasional poor performance.

Utilizing a runtime check for data presence (as opposed to a compile-time check through type casting or other methods) means that more time must be spent verifying application correctness and exercising all conditional branches in the code. Even so, some errors and performance issues can remain hidden and potentially be discovered by users.

A performance "gotcha"  especially when calling a third-party library method  is to forget that the calling code must ensure that the data is present on the device when the library function is called. While the code will run correctly, no performance benefit will be realized from the use of conditional data operations.

The following code refactors the simple matrix multiplication from the first article in this series to work with dynamic 2D arrays and to use OpenACC conditional data clauses.

/* simpleMult.c for OpenACC and OpenMP */ #include <stdio.h> #include <stdlib.h> #include <omp.h> // A simple square matrix multiply using conditional data clauses void doMult(int size, float (* restrict A)[size], float (* restrict B)[size], float (* restrict C)[size]) { // Compute matrix multiplication. #pragma omp parallel for default(none) shared(A,B,C,size) #pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \ pcopyout(C[0:size][0:size]) for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { float tmp = 0.; for (int k = 0; k < size; ++k) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; } } } void fill(int size, float (* restrict A)[size], float (* restrict B)[size]) { for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { A[i][j] = ((float)i + j); B[i][j] = ((float)i - j); } } } int main(int argc, char *argv[]) { if(argc != 3) { fprintf(stderr,"Use: %s size nIter

",argv[0]); return -1; } int size=atoi(argv[1]); int nIter=atoi(argv[2]); if(nIter <= 0) { fprintf(stderr,"%s: Invalid nIter (%d)

",argv[0],nIter); return -1; } // allocate the square matrices float (*restrict A)[size] = malloc(sizeof(float)*size*size); float (*restrict B)[size] = malloc(sizeof(float)*size*size); float (*restrict C)[size] = malloc(sizeof(float)*size*size); fill(size,A,B); // Ensure the A and B matrices are present on the device #pragma acc data pcopyin(A[0:size][0:size],B[0:size][0:size]) for(int i=0; i < nIter; i++) { double startTime = omp_get_wtime(); doMult(size,A,B,C); double endTime = omp_get_wtime(); printf("%s runtime %8.5g

",argv[0], (endTime-startTime)); } free(A); free(B); free(C); return 0; }