About this document

This document is a design and coding guide for developing high performance OpenCL applications for the Intel® Xeon Phi™ coprocessor. It will take you from the Intel Xeon Phi coprocessor architecture and microarchitecture, through the key OpenCL constructs and show you how to use them efficiently to best utilize the Intel Xeon Phi coprocessor HW. Since exploiting HW parallelism is essential for performance applications, we will show you how to improve the parallelism of your OpenCL application on Intel Xeon Phi coprocessor. With this knowledge, you will be ready to design and program your application to perform best on Intel Xeon Phi coprocessor through OpenCL.

Why is this paper needed?

While OpenCL is a portable programming model, the performance portability is not guaranteed. Traditional GPUs and the Intel Xeon Phi coprocessor have different HW designs. Their differences are such that they benefit from different application optimizations. For example, traditional GPUs rely on the existence of fast shared local memory, which the programmer needs to program explicitly. Intel Xeon Phi coprocessor includes fully coherent cache hierarchy, similar to regular CPU caches, which automatically speed up memory accesses. Another example: while some traditional GPUs are based on HW scheduling of many tiny threads, Intel Xeon Phi coprocessors rely on the device OS to schedule medium size threads. These and other differences suggest that applications usually benefit from tuning to the HW they’re intended to run on.

Will I need to have different OpenCL optimizations for different devices?

Not necessarily. Will you add a small #ifdef in your code to run 50% faster on Intel Xeon Phi coprocessor? Will you duplicate a 1000-line file for that? Would you do it for only 10% speedup? Or, maybe you would prefer adding the optimization unconditionally and pay 10% slowdown on other devices for 50% improvement on Intel Xeon Phi coprocessor? It is totally your decision. In some cases, you will need to make the tradeoff between cross device performance and maintainability of your OpenCL application. We really encourage developers to explore the performance potential of the Intel Xeon Phi coprocessor, using the guidelines available in this document and then decide based on the performance numbers. This document doesn’t intend to answer all the questions, but instead give you some tools to answer them yourself.

Intel Xeon Phi coprocessor high-level HW overview

Figure 1. Intel® Xeon Phi™ Coprocessor Microarchitecture

An Intel Xeon Phi coprocessor contains many cores, each with a 512-bit vector arithmetic unit, capable of executing SIMD vector instructions. An L1 cache is included in each core (32 KB data + 32 KB instructions). An L2 cache is associated with each core (512 KB combined Data and Instr, L1 D cache is inclusive). A high-speed interconnect allows data transfer between the L2 caches and the memory subsystem. Each core can execute up to four HW threads simultaneously. This simultaneous multi-threading helps hide instruction and memory latencies. OpenCL hides most of these details from the programmer.

For further reading about the Intel Xeon Phi coprocessor HW:

http://software.intel.com/en-us/mic-developer

Key Intel Xeon Phi Coprocessor Performance Aspects

Multi-threading parallelism

Intel Xeon Phi coprocessor HW includes many cores depending on the SKU (I assume 60 in this paper). Each core is capable of running up to four HW threads. In most cases, populating the 240 threads with tasks is essential to maximize performance. The exact number of HW threads can be queried with the clGetDeviceInfor(NUM_COMPUTE_UNITS); interface.

In Core Vectorization

The vector size in the Intel Xeon Phi coprocessor is 512 bit wide SIMD. Typically, this vector represents 8 double precision floating point numbers, or 16 single precision floating point numbers. Each Intel Xeon Phi coprocessor core can issue a single vector computation instruction per cycle.

PCI Express* (PCIe) Bus Interface

The Intel Xeon Phi coprocessor resides on the PCIe bus. Transferring data over the PCIe bus has the highest latency and the lowest bandwidth. As you would do in any other PCIe device, you should reduce this traffic to a minimum.

Memory subsystem

The Intel Xeon Phi coprocessor includes three levels of memory (GDDR, L2 cache, and L1 cache). The following table includes important cache information:

L1 (Data + Instructions) Shared L2 Total Size 32 KB + 32 KB 512 KB Miss Latency 15-30 cycles 500-1000 cycles

Accessing the L1 cache involves latency of only one cycle.

Since the Intel Xeon Phi coprocessor is an in-order machine, the latency of memory accesses has significant impact on software performance. Luckily, the programmer can reduce these latencies. Prefetches are one of the tools that can help hide memory latencies. We will discuss it in more detail later.

Data Access Pattern

Accessing memory consecutively is the fastest way to access memory on the Intel Xeon Phi coprocessor. It improves cache efficiency, reduces the number of TLB (Translation Lookaside Buffer) misses, and allows the HW prefetcher to kick in.

Mapping the OpenCL constructs to Intel Xeon Phi coprocessor

Understanding how the key OpenCL constructs are implemented on the Intel Xeon Phi coprocessor will help you better design your application to take advantage of the coprocessor’s HW. It will also help you avoid the coprocessor’s performance pitfalls.

Conceptually, at initialization time, the OpenCL driver creates 240 SW threads and pins them to the HW threads (for a 60-core configuration). Then, following a clEnqueueNDRange() call, the driver schedules the work groups (WG) of the current NDRange on the 240 threads. A WG is the smallest task being scheduled on the threads. So calling clEnqueueNDRange() with less than 240 WGs, leaves the coprocessor underutilized.

The OpenCL compiler creates an optimized routine that executes a WG. This routine is built from up to three nested loops, as shown in the following pseudo code:

__Kernel ABC(…) For (int i = 0; i < get_local_size(2); i++) For (int j = 0; j < get_local_size(1); j++) For (int k = 0; k < get_local_size(0); k++) Kernel_Body;

Note that the innermost loop is used for dimension zero of the NDRange. This directly impacts the access pattern of your performance critical code. It also impacts the implicit vectorization efficiency.

The OpenCL compiler implicitly vectorizes the WG routine based on dimension zero loop, i.e., the dimension zero loop is unrolled by the vector size. So the WG code with vectorization looks like:

__Kernel ABC(…) For (int i = 0; i < get_local_size(2); i++) For (int j = 0; j < get_local_size(1); j++) For (int k = 0; k < get_local_size(0); k += VECTOR_SIZE) Vector_Kernel_Body;

The vector size of Intel Xeon Phi coprocessor is 16, regardless of the data types used in the kernel. However, in the future, we may increase the vectorization size to allow more instruction level parallelism.

Exposing algorithm parallelism

While the OpenCL specification provides various ways to express parallelism and concurrency, some of them will not map well to Intel Xeon Phi coprocessor. We will show you how the key OpenCL constructs are mapped to the coprocessor, so you can design your application to exploit its parallelism.

Multi-threading

To get good utilization of the 240 HW threads, it’s best to have more than 1000 WGs per NDRange. Having 180‒240 WGs per NDRange will provide basic threads utilization; however, the execution may suffer from poor load-balancing and high invocation overhead.

Recommendation: Have at least 1000 WGs per NDRange to optimally utilize the Intel Xeon Phi coprocessor HW threads. Applications with NDRange of 100 WGs or less will suffer from serious under-utilization of threads.

Single WG execution duration also impacts the threading efficiency. Lightweight WGs are also not recommended, as these may suffer from relatively high overheads.

Vectorization

OpenCL on Intel Xeon Phi coprocessor includes an implicit vectorization module. The OpenCL compiler automatically vectorizes the implicit WG loop over the work items in dimension zero (see example above). The vectorization width is currently 16, regardless of the data type used in the kernel. In future implementations, we may vectorize even 32 elements. As OpenCL work items are guaranteed to be independent, the OpenCL vectorizer needs no feasibility analysis to apply vectorization. However, the vectorized kernel is only used if the local size of dimension zero is greater than or equal to 16. Otherwise, the OpenCL runtime runs scalar kernel for each of the work items. If the WG size at dimension zero is not divisible by 16, then the end of the WG needs to be executed by scalar code. This isn’t an issue for large WGs, e.g., 1024 items at dimension zero, but is for WGs of size 31 on dimension zero.

Recommendation 1: Don’t manually vectorize kernels, as the OpenCL compiler is going to scalarize your code to prepare it for implicit vectorization.

Recommendation 2: Avoid using a WG size that is not divisible by 32 (16 will work for now).

Work-Item-ID nonuniform control flow

In this section, we explain the difference between uniform and nonuniform control flow, in the context of implicit vectorization. It is important to understand because uniform control flow may have small negative impacts on performance. But nonuniform control flow creates significant performance overhead within the innermost NDRange dimension. The uniformity with respect to the vectorized loop (dimension zero) matters.

A branch is uniform if it is statically guaranteed that all work items within a WG execute the same side of the branch.

Uniform branch example:

//isSimple is a kernel argument Int LID = get_local_id(0); If (isSimple == 0) Res = buff[LID];

Nonuniform branch example:

Int LID = get_local_id(0); If (LID == 0) Res = -1;

Another uniform branch example:

Int LID = get_local_id(1); //Uniform as the IF is based on dimension one, while vectorization on dimension on. If (LID == 0) Res = -1;

While vectorizing, the compiler has to linearize (flatten) any code dominated by nonuniform control flow via predication. The first and major cost of predication is the execution of both sides of the branch. Additional penalties result from the masked execution.

Assuming the following original kernel code:

Int gid = get_global_id(0); If(gid % 32 == 0) Res = HandleEdgeCase(); Else Res = HandleCommonCase(); End

After vectorization (and predication), the code looks like:

int16 gid = get16_global_id(0); uint mask; Mask = compare16int((gid % broadcast16(32)), 0) res_if = HandleEdgeCase(); res_else = HandleCommonCase(); Res = (res_if & mask) | (res_else & not(mask));

Note that both the IF and the ELSE are executed for all of the work items.

Recommendation: Avoid branches, especially those that are nonuniform on dimension zero.

Data Alignment

For various reasons, memory access that is vector-size-aligned is faster than unaligned memory access. In the Intel Xeon Phi coprocessor, OpenCL buffers are guaranteed to start on a vector-size-aligned address. However, this only guarantees that the first WG starts at an aligned address. To guarantee that all WGs start at a properly aligned location, the WG size (local size) needs to be divisible by 16, or even by 32 if you want to take advantage of potential product improvements. Calling EnqueueNDRange with local size NULL, lets the OpenCL driver choose the best WG size for you. The driver should be smart enough to choose a WG size matching the alignment requirements. However, the programmer needs to make sure that the global size is divisible by VECTOR_SIZE and the quotient is big enough to allow the runtime efficient split to WGs. “Big enough” is 1,000,000 in cases of a small kernel and 1000 in the case of a huge kernel including a 1000 iteration loop in the kernel. Also NDRange offsetting can break the alignment.

Recommendation 1: Don’t use NDrange offset. If you have to use an offset, then make it a multiple of 32, or at least a multiple of 16.

Recommendation 2: Use local size that is a multiple of 32, or at least of 16.

Design your algorithm to benefit from the Intel Xeon Phi coprocessor memory subsystem

Since Intel Xeon Phi coprocessor is an in-order machine, it is very sensitive to memory latencies. Memory-related optimizations, at the application level, can lead to 2X-4X performance speedup.

Intra WG data reuse

Designing your application to maximize the amount of data reuse from the caches is the first memory optimization to apply. However, only certain algorithms need to reuse data. For example, adding two matrices involves no opportunity to reuse any data. But multiplying two matrices (GEMM) involves significant data reuse. Therefore, it is an obvious candidate for blocking/tiling optimization. Please see more details in the Intel SDK for OpenCL Applications XE – Optimization Guide.

To benefit from data reuse, you need to take into account the WG implicit loop(s), as described earlier in this document. The programmer’s control over these loops is through the local size definition. The programmer can add additional loop(s) (explicit) in the kernel.

Cross WG data reuse

Cross-group data reuse is a greater challenge. Currently, OpenCL on Intel Xeon Phi coprocessor doesn’t allow enough control over the WGs scheduling. Therefore, cross WG data reuse is almost impossible. We will keep this section as a placeholder for future development.

Data access pattern

Consecutive data access usually allows the best memory system performance. When one considers consecutive memory access, understanding the structure of the WG implicit loops is crucial. The innermost implicit loop is the loop over dimension zero. If your kernel introduces no additional (explicit) loop, then you should try having most of your memory accesses consecutive with that implicit dimension zero loop in mind. For example:

The following code accesses the 2D buffers consecutively in memory (recommended):

__kernel ABC(…){ int ID1 = get_global_id(1); int ID0 = get_global_id(0); res[ID1][ID0] = param1 * buffer[ID1][ID0]; }

The following code doesn’t access the 2D buffers consecutively in memory (not recommended):

__kernel ABC(…){ int ID1 = get_global_id(1); int ID0 = get_global_id(0); res[ID0][ID1] = param1 * buffer[ID0][ID1]; }

The second code example scans the 2D buffers “column major.” With vectorization, it results in double faults, namely: 1) The input vector data need to be gathered along the column from 16 consecutive rows. The result is stored via scatter instructions to 16 different rows. Both operations perform slowly. 2) Memory access is not consecutive, iteration to iteration. Both of these increase the pressure on the TLB and prevent prefetching.

Simple one dimension example (recommended):

Consecutive access:

Int id = get_global_id(0); A[id]= B[id];

Non-Consecutive access (not recommended):

Int id = get_global_id(0); A[id*4] = B[id*4]

Recommendation: Use ID(0) to index memory consecutively within the row. With explicit 2D buffer: buffer[ID1][ID0]. With 2D indexing into 1D buffer: buffer[STRIDE * ID1 + ID0]

If your kernel includes an explicit loop, then you should remember that the implicit vectorization is still based on the ID(0) implicit loop. So accessing buffers through the OpenCL IDs should follow the recommendation above (buffer[ID1][ID0]). This will keep vector access consecutive and efficient. Accessing buffers through the inner loop index (idx), will be consecutive within the inner loop (buffer[ID1][idx]) and will be uniform to the vectorized loop, which is excellent! However, mixing ID0 and idx should be avoided. For example, buffer[ID0][idx] is strided to the vectorized loop, therefore will result in gather/scatter.

Data layout

Pure SOA (Structure-of-Arrays) data layout results in simple and efficient vector loads and stores. However, spatial locality is lower, the pressure on the TLB is higher, and the number of pages used simultaneously can be higher.

With AOS (Array-of-Structures) data layout, the generated vectorized kernel needs to load and store data via gather and scatter instructions, which are less efficient than simple vector load and store. However, for random access pattern, AOS layout is often more efficient than SOA because of better spatial locality. Please remember that random access of SOA data layout creates gather and scatter instructions too.

The third option is AOSOA—an array of structures of small arrays. The size of the small arrays should be 32 for the Intel Xeon Phi coprocessor. This would allow vectorization of up to 32 elements vector.

struct Point32 { float x[32], y[32], z[32]; }; __kernel void ABC(__global Point32* ptrData)

AOSOA allows efficient vectorization using simple vector loads, while not overloading the TLB, nor spreading the accesses across many pages. The problem of AOSOA is the readability of the code. Most people don’t naturally think in AOSOA terms.

Data prefetching

With the Intel Xeon Phi coprocessor being an in-order machine, data prefetching is an essential way to bring data closer to the cores, in parallel with other computations. Loads and stores are executed serially, with parallelism. For example, any two load instructions are executed entirely serially. The prefetch instruction is exceptional. It is executed in parallel to other instructions, including to other prefetch instructions. Therefore, prefetch instruction that hasn’t finished on time, can still improve the performance as this memory request executed in parallel to other instruction. A cache miss means a thread stall plus a few cycles penalty to reissue the instruction. The Intel Xeon Phi coprocessor includes a simple automatic HW prefetcher to the L2. It takes some time for the HW prefetcher to kick in, and it needs to restart on every 4 KB virtual page boundary.

Automatic SW prefetches to the L1 and L2 are inserted by the OpenCL compiler for data accessed in future iterations, whenever it figures out (through analysis) that such can be inserted and provide benefit. The beta release includes partial support for automatic SW prefetching.

Manual prefetching can be inserted by the programmer into the OpenCL kernel, via the prefetch built-in. Currently, manual prefetches are inserted exactly at the location and to the address that the programmer requested, but these are limited to L2 prefetches. In the future, the OpenCL compiler may add both L2 and L1 prefetches for the PREFETCH built-in. It may also improve the location and stride indicated by the programmer. Manual prefetches should be inserted at least 500 cycles before the data is going to be actually used. Usually only the main input and output buffers need to be prefetched.

Local memory and Barriers

While traditional GPUs include Shared Local Memory (SLM), which requires manual management, Intel Xeon Phi coprocessor includes a two-level cache system (automatic), similar to most modern CPUs. Therefore, using the OpenCL SLM provides no benefit on the Intel Xeon Phi coprocessor. Furthermore, local memory in the coprocessor is allocated on the regular GDDR memory and is supported by the cache system like any other memory. Therefore, it introduces additional overhead in terms of redundant data copy and management.

Recommendation: Avoid using Shared Local Memory on the Intel Xeon Phi coprocessor.

The Intel Xeon Phi coprocessor includes no special HW support for barriers. Therefore, barriers are emulated by OpenCL on the coprocessor. We recommend avoiding the use of barriers. Also, splitting the kernel into two separate kernels will be slower than a barrier, so we don’t recommend taking this path either.

As of the beta release, the combination of barrier and WG size nondivisible by 16, results in execution of a scalar kernel. Please avoid this combination. Currently, we don’t see a justification to optimize it within the OpenCL compiler.

Summary

While designing your OpenCL application for Intel Xeon Phi coprocessor, you should pay careful attention to the following aspects:

Include enough work groups within each NDRange—a minimum of 1000 is recommended. Avoid lightweight work groups. Don’t hesitate using the maximum local size allowed (currently 1024). Keep the WG size a multiple of 32. Avoid ID(0) dependent control flow. This allows efficient implicit vectorization. Prefer consecutive data access. Data layout preferences: AOS for sparse random access; pure SOA or AOSOA(32) otherwise. Exploit data reuse through the caches within the WG—tiling/blocking. If auto-prefetching didn’t kick in, use the PREFETCH built-in to bring the global data to the cache 500‒1000 cycles before use. Don’t use local memory. Avoid using barriers.

For further reading

The Intel SDK for OpenCL Applications XE – Optimization Guide includes many more details.

Intel Xeon Phi coprocessor performance tuning with Intel® VTune™ Amplifier XE.

Using the Intel® SDK for OpenCL™ Applications XE 2013 with the Intel® Xeon Phi™ Coprocessor