This article will be of interest if you don’t want to read the whole new specifications [PDF] for OpenCL 1.2.

As always, feedback will be much appreciated.

After many meetings with the many members of the OpenCL task force, a lot of ideas sprouted. And every 17 or 18 months a new version comes out of OpenCL to give form to all these ideas. You can see totally new ideas coming up and already brought outside in another product by a member. You can also see ideas not appearing at all as other members voted against them. The last category is very interesting and hopefully we’ll see a lot of forum-discussion soon what should be in the next version, as it is missing now.

With the release of 1.2 there was also announced that (at least) two task forces will be erected. One of them will target integration in high-level programming languages, which tells me that phase 1 of creating the standard is complete and we can expect to go for OpenCL 2.0. I will discuss these phases in a follow-up and what you as a user, programmer or customer, can expect… and how you can act on it.

Another big announcement was that Altera is starting to support OpenCL for a FPGA-product. In another article I will let you know everything there is to know. For now, let’s concentrate on the actual differences in this version software-wise, and what you can do with it. I have added links to the 1.1 and 1.2 man-pages, so you can look it up.

New Kernel-functions

The most rudiment debug-tool, printf, first needed to have a vendor-specific extension enabled, but now you can flood the standard output without it. For those who have not tried printf yet, have a global size of 1000, let the CPU print “pingn” and the kernels “pongn”. Then you’ll know exactly why you need to be careful with this function.

The function popcount returns the number of ones in a variable. So if x is 5 (binary 101), then popcount(x) is 2. A nice explanation of fast popcount on SSE is here. It counts bits regardless of what it represents, so it also counts the sign-bit.

Replaced functions

The OpenCL group prefers to change the name of functions when the parameter-list changes. Below you’ll find the “new” functions I encountered.

clEnqueueMarker, clEnqueueBarrier and clEnqueueWaitForEvents have been merged into clEnqueueMarkerWithWaitList and clEnqueueBarrierWithWaitList. The barrier and marker functionality are still the same, but if a non-NULL waiting-list is given it will also continue if all the events have occurred. This was tricky to program before. A new option is that you can fire an event when all previous events have occurred.

clCreateImage2D and clCreateImage3D have been merged into clCreateImage. clCreateFromGLTexture2D and clCreateFromGLTexture3D have been merged into clCreateFromGLTexture. As the functions were comparable and the parameter texture_target handles the differences, not much has changed. What is new (and a mayor reason for merging these functions) is the adding of 1D images, and support for image-arrays (see below for explanation how they work). 1D images were introduced to be compliant with OpenGL 1D images.

Mem-flags CL_MEM_COPY_HOST_WRITE_ONLY, CL_MEM_COPY_HOST_READ_ONLY and CL_MEM_COPY_HOST_NO_ACCESS have been added to describe how the host can connect to the object at the device, where 1.1 only described how the device could access the object and if the memory was allocated at the device or the host.

clUnloadCompiler got renamed to clUnloadPlatformCompiler, and clGetExtensionFunctionAddress to clGetExtensionFunctionAddressForPlatform – both must now specify a valid platform-reference. This seems to be logical, as clUnloadCompiler probably removed compilers of all platforms, and the function-address seems to be unspecified when more platforms were loaded.

DirectX

Besides the fancy 1D images, support for DirectX 9 and 11 textures also have been added. DX9 is an interesting choice, but this way such software can be given a longer life by adding OpenCL to speed it up. I still disagree with the idea that it has official KHR-support, as it only works under Microsoft code. Under Linux (and all its derivatives like Android) and OSX it is not supported.

The new functions clCreateFromDX9MediaSurfaceKHR, clEnqueueAcquireDX9MediaSurfacesKHR and clEnqueueReleaseDX9MediaSurfacesKHR are comparable to clCreateFromD3D10Texture2DKHR, clEnqueueAcquireD3D10ObjectsKHR and clEnqueueReleaseD3D10ObjectsKHR. clCreateFromD3D11BufferKHR, clCreateFromD3D11Texture2DKHR, clCreateFromD3D11Texture3DKHR, clEnqueueAcquireD3D11ObjectsKHR and clEnqueueReleaseD3D11ObjectsKHR are like their D3D10-counterparts.

Sharing like cl_khr_d3d10_sharing for DX9 and 11 is enabled with cl_khr_dx9_media_sharing and cl_khr_d3d11_sharing. The counterparts of clGetDeviceIDsFromD3D10KHR are clGetDeviceIDsFromD3D11KHR and clGetDeviceIDsFromDX9MediaAdapterKHR.

Multi-user and Multi-device

As OpenCL-devices get more powerful, it is very probable that the device can be shared better. Also, it gets more common to have multiple GPUs in a system, and/or have various capable devices now CPUs get better support.

clEnqueueMigrateMemObjects helps with multiple devices to copy memory objects from one device to another; first this had to be done by copying via the host.

clCreateSubDevices partitions a device in sub-devices. It can be partitioned in equal parts, specified sizes, or depending on specific hardware. The last option can split the devices based on i.e. cache-hierarchy, so that the different subdevices have shared cache at the given level. The functions clRetainDevice and clReleaseDevice have been altered to handle sub-devices. First this was under the extension device_fission.

Intitalisation of data

clEnqueueFillBuffer and clEnqueueFillImage help with initialising data by filling it with a pattern or a colour. This was first best done at the host, or with a kernel specially written for it, or just ignored. Our lives have improved.

Building

It seems that more effort is put in making sure the kernels are better protected. The function clBuildProgram can be split up between clCompileProgram and clLinkProgram. If I understand correctly, it is comparable to how clCreateProgramWithBinary works, as this takes compiled binaries.

clGetProgramInfo en clGetProgramBuildInfo have been extended to get information on how the program has been built. The new function clGetKernelArgInfo returns specified information on the arguments used for building the kernel. This is useful when the building of the software is separated from the program. Such is the case when binaries are used.

Image arrays

An array of 1D or 2D images can be written by write_image{f|i|ui|h}. The image ID is given by the y (1D) or z (2D) value. With read_image{f|i|ui|h} you need to specify the coordinates plus the image-number, int2 for 1D and int3 for 2D images.

The kernel-function get_image_array_size returns the number of images in an array. It is the responsibility of the software to keep things in order, as it does not give an array of image-numbers.

Other

Macros CL_VERSION_1_2 and __OPENCL_C_VERSION__ have been added. The first gives a 1 when supported, or 0 when not. The latter gives 120 for version 1.2.

Double-precision is now an optional core feature instead of an extension. Meaning, you just need to check if the device supports it, but you don’t need to pragma it in.

CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE has been deprecated. It gives the smallest alignment in bytes which can be used for any data type. It is quite comparable to CL_DEVICE_MEM_BASE_ADDR_ALIGN. This could help select the best device for an alignment-optimised kernel, but is rarely used.

A new flag CL_MAP_WRITE_INVALIDATE_REGION has been added to cl_map_flags. This is comparable to CL_MAP_WRITE, but without guarantees memory is not being overwritten.

Storage class specifiers extern and static are now supported. A storage class settles the scope of the variable (c definition here).

Video

Tim Mattson of Intel explains some of the highlights of OpenCL 1.2 in this 12 minute video