DISCLAIMER: This article was migrated from the old blog thus may contain formatting and content differences compared to the original post. Additionally, it likely contains technical inaccuracies, opinions that I may no longer align with, and most certainly poor use of English (I was young and foolish :)). This article remains public for those who may find it useful despite its flaws.

Currently there are several ways to feed data to the GPU no matter of what API we use and what type of application we develop. In case of OpenGL we have uniform buffers, texture buffers, texture images, etc. The same is true for OpenCL and other compute APIs that even provide more fine-grained memory management taking advantage of the local data store (LDS) available on today’s hardware. In this article I’ll present the memory access performance characteristics of AMD’s Evergreen-class GPUs focusing on what this all means from OpenGL point of view. While most of the data is about the HD5870, the general principles and relative performance characteristics are valid for other GPUs, including ones from other vendors.

Introduction

Traditional CPU based applications don’t have to worry too much about where they put their data as they have a simple set of possibilities: registers and global memory (accessed through a series of linear caches called L1, L2 and on newer architectures also L3). While this and its details can be already quite cumbersome to utilize efficiently, GPU based algorithms need even more investigation as their architecture is based on a more complex multi-level memory design.

Typical questions an OpenGL graphics developer could ask nowadays are:

Where should I put my per-object data?

From where should I source animation data?

Should I use uniform buffers, texture buffers or vertex buffers for my per-instance data?

What does it mean from performance point of view if I use read-write buffers or textures?

Of course, the list could continue and answering the individual questions is not easy and often requires performance measurements to prove our suspicions. Instead of trying to answer all these questions it is easier to take a look at the actual hardware performance characteristics and solve the individual issues based on that.

I’ve already touched the topic in the past with the article Uniform Buffers VS Texture Buffers where I’ve presented the key differences between the two data access method and a few examples when to use one or the other. In this article I’ll go further and try to provide more accurate data about how various memory access methods perform in practice.

Earlier there were little to no detailed information about the actual performance of API level memory access methods but fortunately the increasing popularity of OpenCL made vendors to provide more technical details about the architecture and performance of their products to enable software developers to fully leverage the power of today’s GPUs. While these documents focus on OpenCL or other compute APIs, most of the data applies indirectly to OpenGL as well.

The Evergreen architecture

In order to be able to provide some actual performance data, I’ve selected as reference AMD’s Evergreen architecture and the Radeon HD5870 as the target hardware. Note that most of the presented details roughly apply to all other modern GPUs, including NVIDIA’s Fermi architecture. Each time there is a clear difference between the two, I’ll try to point it out. However, I cannot be 100% sure what are these differences as ATI’s OpenCL programming guide is somewhat more talkative about actual performance details than that of NVIDIA’s OpenCL programming guide.

OpenCL Platform Model

From OpenCL platform model’s point of view the Radeon HD5870 is structured in the following way:

Total of 20 compute units.

Each compute unit consists of 16 stream cores.

Each stream core consists of 5 processing elements (4 traditional, 1 transcendental).

This sums up to a total of 1600 processing elements on the Radeon HD5870.

The basic OpenCL architecture applies in the same way to NVIDIA GPUs, however, there is are differences between AMD’s and NVIDIA’s GPU architecture. AMD uses a special super-scalar architecture since their HD2000 series that allows them to execute 5 separate instructions in each core.

ATI super-scalar architecture consisting of one transcendental unit (left), four traditional units and a dedicated branch execution unit (right).

What this already reveals us from OpenGL point of view is that AMD’s architecture groups together 16 stream cores so fragment shaders are most probably running on 4×4 tiles of fragments in sync. As an example, it is important to note this in case we use heavy dynamic branching in shaders as we should be aware of that in case the branch selection is not coherent for the specified fragment neighborhood, performance can drop due to the fact that hardware masks out those processing elements that did not select the appropriate branch.

Also, it is important to note that usually one out of four or five processing elements (depending on hardware generation and vendor) are capable of executing transcendental instructions such as logarithm, exponential or trigonometric functions.

Memory capacity and performance

AMD is very clear about the memory capacity and performance details in their OpenCL programming guide. The figure below showcases these hardware characteristics of the Radeon HD5870:

OpenCL Memory Type Hardware Resource Size/CPU Size/GPU Peak Read Bandwidth / Stream Core Private GPRs 256KB 5MB 48 bytes/cycle Local LDS 32KB 640KB 8 bytes/cycle Constant Direct-addressed – 48KB 16 bytes/cycle Constant Same-indexed – – 4 bytes/cycle Constant Varying-indexed – – ~0.6 bytes/cycle Images L1 Cache 8KB 160KB 4 bytes/cycle Images L2 Cache – 512KB ~1.6 bytes/cycle Global Memory – 1GB ~0.6 bytes/cycle

GPRs – General Purpose Registers

LDS – Local Data Store

Direct-addressed constant – a constant accessed using a constant address.

Same-indexed constant – a varying-indexed constant where each processing element accesses the same index.

Varying-indexed constant – a varying-indexed constant where the processing elements access different indices.

Of course, consider this data for fetches that are properly aligned. In case of unaligned data access the actual throughput can be much lower. In order to be able to reach the peak bandwidth we have to align our data usually to multiples of 4, 8 or 16 bytes (depending on actual hardware).

As it can be seen, constant storage can also fall into three different access performance categories so do buffers and images. While actual numbers differ on various platforms, the guidelines apply to most of modern GPUs: use a particular addressing method wisely and take in consideration access locality in order to get optimum performance.

These numbers are no different in case of OpenGL terminology either, just replace the word “constant” with uniform buffers and think about images and global data as texture images or buffer objects. The only exception is that there is no direct alternative for local memory in OpenGL.

An additional thing to consider since Shader Model 5.0 hardware is read-write images and buffers. AMD refers to the two memory access method as FastPath and CompletePath. This means that in case of read-only textures or buffers the GPU uses the FastPath that is able to take full advantage of the L2 cache while read-write textures and buffers usually use the so called CompletePath that sacrifices the advantages of the L2 cache to enable the use of atomic operations on global memory objects. This, of course, has a quite huge performance effect reducing the throughput of the GPU about five times on the Radeon HD5870:

Kernel Effective Bandwidth Ratio to Peak Bandwidth copy 32-bit 1D FastPath 96 GB/s 63% copy 32-bit 1D CompletePath 18 GB/s 12%

Summary

Well, now we’ve seen that how various OpenCL memory types perform in reality, let’s see how all these information translate to the OpenGL world. Here are my top-10 recommendations about when and how to use the various data acquiring possibilities present in modern OpenGL:

Align your data to multiples of 16 bytes and fetch them accordingly.

Use direct-addressing of data in uniform buffers and try to avoid indexing into uniform buffers. If you must use indexing into uniform buffers, make sure that the indices are coherent across processing elements working in sync. If you heavily use indexed data consider using texture buffers instead of uniform buffers to take advantage of the L1 and L2 cache. Texture and buffer caches are linear so consider this when planning you access patterns. Bind textures and buffers for read-write mode only when it is really necessary, use regular texture binding otherwise to ensure optimum performance. A single atomic buffer operation forces the shader to use the slow path so use atomic operations wisely. Do not use atomic buffer operations to implement atomic counters, use built-in hardware atomic counters instead as they are much faster. Consider using dynamic branching to avoid costly memory operations as often as possible. Try to make your branch selection coherent across processing elements working in sync (e.g. 4×4 fragment tile in case of a fragment shader).

Relative performance characteristics of memory access methods (higher is better).