DISCLAIMER: This article was migrated from the old blog thus may contain formatting and content differences compared to the original post. Additionally, it likely contains technical inaccuracies, opinions that I may no longer align with, and most certainly poor use of English (I was young and foolish :)). This article remains public for those who may find it useful despite its flaws.

After the release of the OpenGL 4.1 specification the Khronos Group slowed down the pace a little bit but they didn’t left OpenGL developers without a new specification version for too long as a few weeks ago they’ve released OpenGL 4.2. The new version of the specification brings several API improvements as well as exposes some important pieces of hardware functionality that makes OpenGL 4.x class hardware a great step forward in GPU history. This article aims to present the newly introduced features in the latest version of the OpenGL specification and, as a few months ago I wrote an article about Suggestions for OpenGL 4.2 and beyond, I will write a few words about how does the new specification reflect my forecast.

New features in OpenGL 4.2

OpenGL 4.2 finally filled the holes in the capability matrix of Shader Model 5.0 hardware with some long waited extensions from which some of the functionalities were actually already accessible through cross-vendor and vendor specific extensions. Also, the new version of the specification brings some important API improvement extensions and GLSL constructs that continue the transition to a more easy to use state and shader management.

This extension adds the new block compression texture formats called BC7 and BC6H in Direct3D terminology. The extension is actually available for quite some time, since the release of OpenGL 4.0 but now it became core. The formats provide high quality block compression for fixed point RGBA and sRGB textures as well as two floating point texture compression formats for signed and unsigned data.

Traditional block compression methods (as S3TC or RGTC) use the gradients in a block of pixels which works fine for smooth images but does provide poor results in case of sharp edges. BPTC solves the issue by dividing blocks into multiple partitions which are compressed using independent gradients thus providing better overall quality.

When comparing compression efficiency, BPTC has a compression ratio of 3:1 compared to 6:1, 4:1 and 2:1 that are the compression ratios of the S3TC DXT1, S3TC DXT5 and RGTC formats respectively.

This is an interesting extension that solves a problem that I didn’t even know is such a big issue. The extension is designed primarily to support compressed image formats with fixed-size blocks as that of BPTC as an example. The application can use this extension to configure pixel store parameters so that subtexture operations can provide consistent results in all cases.

This is again an interesting extension that provides API improvement over how texture storage is allocated in classic OpenGL. As we all know, OpenGL was always too ad hoc on resource management, from the point of view of when actual resources are allocated for a particular API primitive. This is especially a problem in case of textures where we potentially talk about large amount of data. In classic OpenGL the driver could not know from the beginning for example whether the application will need mipmaps for the texture or how many levels are required. This could easily result in bad allocation patterns and/or large reallocations. This extension introduces the concept of immutable texture images where all the levels are allocated up-front for a texture object.

This extension extends the so called “AutoDraw” feature by providing instanced “AutoDraw”. This means that geometry captured using transform feedback can be rendered multiple time using geometry instancing. This is actually a feature that even D3D11 does not provide and being such, I didn’t even think that hardware supports it, even though I think the list usage patterns of the extensions is most probably pretty narrow.

This extension is actually the feature I called ARB_instanced_arrays2 in my suggestion list. The extension provides three new draw commands, one is kind of illy named as DrawElementsInstancedBaseVertexBaseInstance, even though this command can be called the “basic” indexed draw commands that specifies all parameters. Also, the parameter list of the indirect indexed draw command is extended with the base instance parameter. Fortunately, however, the ARB chosen to add new commands rather than a SetBaseInstance-style state specifier command to introduce the new concept. Funnily this feature was missing for a long time as, as far as I know, it is supported by all GPUs capable of doing instanced drawing, and is available in D3D as well.

This is where things get start really interesting. This new extension is the ARBified version of the extension EXT_shader_image_load_store which fortunately didn’t make it into core in its current form.

The extension provides GLSL built-in functions allowing shaders to load from, store to, and perform atomic read-modify-write operations to a single level of a texture called an image from any shader stage. Also, the extension indirectly enables the same set of operations for buffer objects by using buffer textures. This enables developers to implement more sophisticated algorithms using shaders that require more complex data structures than just plain arrays.

This, together with atomic counters that we will talk about later, enables the possibility to implement append/consume buffers and rendering techniques like AMD’s Order-Independent Transparency (OIT) algorithm as presented at GDC10.

As the introduction of the new write operations to fragment shaders besides the traditional framebuffer writes makes the execution of the shader have side effects and thus sensitive to whether early-Z is used or not by the hardware, so the extension also provides a mechanism to force or disable early-Z in the fragment shader.

A similar issue is in case of vertex shaders as the post-transform cache may be no longer valid in case of certain usage patterns of load/store images so, based on how smart the shader compiler is, the post-transform cache could be easily disabled in case a vertex shader uses load/store images resulting in downgraded performance, so care must be taken when using read/write images in vertex shaders as OpenGL does not have any mechanism to help these issues (but I actually have a proposal that I’ll talk about in a future article).

The API of this extension is greatly improved compared to the EXT version, especially when dealing with various texture image formats. The extension also provides a future-proof DSA-style API. Further, the ARB version of the extension supports loads from any texture format and corrected some specification bugs of the EXT version.

From hardware implementation point of view, it must be noted that in case a shader contains atomic operations applied to a particular read/write image the driver uses a different hardware path, as required by atomic read-modify-writes so that care must be taken to use atomic operations only when necessary. Also note that this decision is made statically at compile time by the driver so even a single atomic operation in an unlikely taken branch will result it degraded performance. This is another reason why to use atomic counters to implement append/consume buffers instead of using read/write image atomics.

This the other long waited feature that I also suggested and was still missing from OpenGL but was available in D3D11. The specification was actually ongoing for a long time now (about a year) and it even appeared for a while in AMD’s OpenGL drivers sometimes as EXT, sometimes as ARB extension. The extension provides API to access a number of hardware atomic counters that provide efficient counter operations on a GPU global scale. Atomic counters come handy in many cases like append/consume buffers or indirect draw buffer construction.

The extension provides access to these atomic counters from GLSL and also makes it possible to back them up with buffer objects so after OpenGL draw calls the value of the counters is preserved in these buffers for later use.

The OpenGL implementation is superior compared to D3D’s as it provides access to atomic counters from all shader stages, with caveats of course as, it was mentioned in the previous section, the side effects made possible with read/write images and atomic counters require special care in case of fragment and vertex shaders as they may result in invalid rendering and/or lower performance.

On hardware vendor implementations, it must be noted that atomic counters are much, much more faster than read/write image atomics, at least on AMD hardware which has dedicated hardware for atomic counters. On NVIDIA hardware, though, it seems that there is no different hardware path for atomic counters as their performance is roughly the same as in case of read/write image atomics.

The dedicated hardware implementation of atomic counters, however, comes with a trade-off as the number of atomic counters is severely limited on AMD hardware, but one can still use read/write image atomics if ran out of atomic counters.

This is another extension I’ve suggested and that fills another functionality hole compared to D3D11. The extension is actually an ARBified version of AMD_conservative_depth that extends the application developer’s control over eary depth and stencil tests. ARB_shader_image_load_store already provides a way to force or disable eary-Z and this extension provides further modes that provide a hint to the driver about how depth is modified in a fragment shader that outputs depth. This passes enough information to the GL implementation to activate some early depth test optimizations safely while still preserving the ability to account the final depth value in the depth test.

The extension exposes the new capability in the form of fragment shader input layout qualifiers called “depth_any”, “depth_greater”, “depth_less” and “depth_unchanged”. The interesting ones are the one that assume a greater or less depth value as output and provide the ability to early reject groups of fragments using Hi-Z and early-Z even when depth is modified. This technique can greatly improve the rendering performance of volumetric particles, decals and billboards.

As far as I can tell, though, the extension provides performance benefits only the AMD hardware currently as NVIDIA hardware does not have such functionality thus using the extension would still force NVIDIA GPUs to disable early-Z in case the fragment shader outputs a depth value, but future hardware may change this.

This is a strangely named extension that provides a lot of improvements to GLSL. These are mostly API improvements only, but have a great value when looking at source code maintainability and resource management.

I think the most useful addition of the extension is the “binding” layout qualifier that I referred to as ARB_explicit_sampler_location and ARB_explicit_uniform_block_index in my suggestion list. This enables shader writers to explicitly bind a uniform block binding index to a uniform block as well as explicitly bind sampler, texture and image binding points to a sampler or image variable.

Besides that, the extension adds other minor improvements, like implicit conversion of return values of functions, UTF-8 character set support, C-style initializer list support and scalar swizzle operators.

This is another kind of strangely named extension that was meant to provide the possibility to query information about the internal format of textures, however, it actually failed it as it provides only the ability to query the maximum number of samples available for different texture formats.

The extension was ambitious as it planned to provide internal format information like the ability to query the actual internal format used, whether the format is renderable, accessible in a particular shader stage, whether it can be used as read/write image, and even to provide performance hint about using a particular texture internal format. Unfortunately all these were left for a future extension.

This is the last new extension introduced in OpenGL 4.2 that trivially adds the requirement to the pointer returned by buffer mapping commands that they provide a minimum of 64 byte alignment to support processing of the data directly with special CPU instructions like SSE or AVX. This can provide further performance increase when client is modifying buffer data.

Conclusion

OpenGL 4.2 again proven that OpenGL is not dead, but in fact plans to be again the ultimate choice of 3D API by pushing the exposed hardware capabilities over the line set by D3D11. When thinking about the list of expected extensions I presented in my earlier article, Suggestions for OpenGL 4.2 and beyond we can see that OpenGL 4.2 fulfilled all my expectations and even my wish list was partly fulfilled, but here’s the list for a better overview:

My expectations for OpenGL 4.2:

GL_EXT_shader_image_load_store added in the form of GL_ARB_shader_image_load_store

GL_ARB_shader_atomic_counters added as is

GL_ARB_instanced_arrays2 added in the form of GL_ARB_base_instance

GL_ARB_explicit_sampler_location added in the form of GL_ARB_shading_language_420pack

GL_ARB_explicit_uniform_block_index added in the form of GL_ARB_shading_language_420pack



My personal wish-list for OpenGL 4.2: