OpenCL provides many benefits in the field of high-performance computing, and one of the most important is portability. OpenCL-coded routines, called kernels, can execute on GPUs and CPUs from such popular manufacturers as Intel, AMD, Nvidia, and IBM. New OpenCL-capable devices appear regularly, and efforts are underway to port OpenCL to embedded devices, digital signal processors, and field-programmable gate arrays.

Not only can OpenCL kernels run on different types of devices, but a single application can dispatch kernels to multiple devices at once. For example, if your computer contains an AMD Fusion processor and an AMD graphics card, you can synchronize kernels running on both devices and share data between them. OpenCL kernels can even be used to accelerate OpenGL or Direct3D processing.

Despite these advantages, OpenCL has one significant drawback: it's not easy to learn. OpenCL isn't derived from MPI or PVM or any other distributed computing framework. Its overall operation resembles that of NVIDIA's CUDA, but OpenCL's data structures and functions are unique. Even the most introductory application is difficult for a newcomer to grasp. You really can't just dip your foot in the pool  you either know OpenCL or you don't.

My goal in writing this article is to explain the concepts behind OpenCL as simply as I can and show how these concepts are implemented in code. I'll explain how host applications work and then show how kernels execute on a device. Finally, I'll walk through an example application with a kernel that adds 64 floating-point values together.

Host Application Development

In developing an OpenCL project, the first step is to code the host application. This runs on a user's computer (the host) and dispatches kernels to connected devices. The host application can be coded in C or C++, and every host application requires five data structures: cl_device_id, cl_kernel, cl_program, cl_command_queue, and cl_context .

When I started learning OpenCL, I found it hard to remember these structures and how they work together, so I devised an analogy: An OpenCL host application is like a game of cards.

A Game of Cards

In a card game, a dealer sits at a table with one or more players and distributes cards from a deck. Each player receives these cards as part of a hand and then analyzes how best to play. The players can't interact with one another or see another player's cards, but they can make requests to the dealer for additional cards or a change in stakes. The dealer handles these requests and takes control once the game is over. Figure 1 illustrates this analogoy.

In addition to the dealer and the players, Figure 1 also depicts the table that supports the game. The players seated at the table don't have to take part, but only those seated at the table can participate in the game.

The Five Data Structures

In my analogy, the card dealer represents the host. The other aspects of the game correspond to the five OpenCL data structures that must be created and configured in a host application:

Device: OpenCL devices correspond to the players. Just as a player receives cards from the dealer, a device receives kernels from the host. In code, a device is represented by a cl_device_id .

. Kernel: OpenCL kernels correspond to the cards. A host application distributes kernels to devices in much the same way a dealer distributes cards to players. In code, a kernel is represented by a cl_kernel .

. Program: An OpenCL program is like a deck of cards. In the same way that a dealer selects cards from a deck, the host selects kernels from a program. In code, a program is represented by a cl_program .

. Command queue: An OpenCL command queue is like a player's hand. Each player receives cards as part of a hand, and each device receives kernels through a command queue. In code, a command queue is represented by a cl_command_queue .

. Context: OpenCL contexts correspond to card tables. Just as a card table makes it possible for players to transfer cards to one another, an OpenCL context allows devices to receive kernels and transfer data. In code, a context is represented by a cl_context .

To clarify this analogy, Figure 2 shows how these five data structures work together in a host application. As shown, a program contains multiple functions, and each kernel encapsulates a function taken from the program.

Once you understand how host applications work, learning how to write code is straightforward. Most of the functions in the OpenCL API have straightforward names like clCreateCommandQueue and clGetDeviceInfo . Given the analogy, it should be clear that clCreateKernel requires a cl_program structure to execute, and clCreateContext requires one or more cl_device_id structures.

Shortcomings of the Analogy

My analogy has its flaws. Six significant shortcomings are given as follows:

There's no mention of platforms. A platform is a data structure that identifies a vendor's implementation of OpenCL. Platforms make it possible to access devices. For example, you can access an Nvidia device through the Nvidia platform. A card dealer doesn't choose which players sit at the table. However, an OpenCL host selects which devices should be placed in a context. A card dealer can't deal the same card to multiple players, but an OpenCL host can dispatch the same kernel to multiple devices through their command queues. The analogy doesn't mention how devices execute kernels. Many OpenCL devices contain multiple processing elements, and each element may process a subset of the input data. The host identifies the number of work items that should be generated to execute the kernel. In a card game, the dealer distributes cards to players and each player arranges the cards to form a hand. In OpenCL, the host creates a command queue for each device and enqueues commands. One type of command tells a device to execute a kernel. In a card game, the dealer passes cards in a round-robin fashion. OpenCL sets no constraints on how host applications distribute kernels to devices.

At this point, you should understand that a large part of a host application's job involves creating kernels and deploying them to OpenCL-compliant devices such as GPUs, CPUs, or hybrid processors. Next, I'll discuss how these kernels execute on the devices.