DISCLAIMER: This article was migrated from the old blog thus may contain formatting and content differences compared to the original post. Additionally, it likely contains technical inaccuracies, opinions that I may no longer align with, and most certainly poor use of English (I was young and foolish :)). This article remains public for those who may find it useful despite its flaws.

The Khronos Group continues the progress of streamlining the OpenGL API. One very important step in this battle has been made just a few days ago by releasing two concurrent core releases of the OpenGL specification, namely version 3.3 and 4.0. This is a major update of the standard containing many revolutionary additions to the tool-set of OpenGL that need careful examination. In this article I would like to talk about these new features trying to point out their importance and touching also some practical use case scenarios.

This is the fourth revision of the OpenGL API standard in the last two years. This fast pace revolution started about one and half years ago with the release of the version 3.0 of the specification. At that time, a great feel of disappointment has overcame the developers due to the lack of the promised rewrite of the whole API. Others, who had to deal with legacy code were also disappointed but they felt so because the new revision of the API threatened them with removing old features. These two opposing forces have put the Khronos Group into a situation where there was very difficult to make a decision that would make everybody happy. After two releases, this issue has been mostly resolved with OpenGL 3.2 and also lots of missing features have been integrated into the core API meanwhile.

Even though great steps has been made in order to fulfill everybody’s needs, the gap between the core functionality of OpenGL and the DirectX API still increased, especially due to the introduction of Shader Model 5.0 hardware. OpenGL was in a position when it had to adopt the features of the new hardware generation and also try to make up leeway in case of Shader Model 4.0 hardware. My personal wish was that there should be two new versions of the API: one that complements the OpenGL 3.x API with the missing features and another that catches up to DirectX 11. Actually my wish became true as the first time in the history of OpenGL we got two new releases of the standard at once, and finally, we got an API that is a really competitive alternative for Microsoft’s DirectX API. I think I can say this in the name of every OpenGL developer: Thank you Khronos!

Okay, but that’s enough about history and acknowledgements. Lets see what’s under the hood of the new API revisions! When I read the good news at OpenGL.org I felt myself like a child at Christmas just taking the first look at the presents under the tree: I was in great ecstasy and started to “open the presents” as fast as I could…

New features of OpenGL 3.3

Let’s start with the new version of the API targeting Shader Model 4.x hardware. It seems that the concentration on the major release 4.0 didn’t capture the attention of the ARB explicitly as we have many interesting features already in the first box…

This is a feature for what I’ve seen many requests on the OpenGL discussion forums. It enables fragment shaders to output an additional color per render target that can be used as a blending factor for either source or destination colors providing an additional degree of freedom to affect the way how fragments are blended into the destination buffers. This is one functionality that is supported by the underlying hardware for a while but without API support it was impossible to take advantage of it. As it is very straightforward how this feature works I would not even talk about it too much. Just one additional comment: surprisingly AMD already supports this extension in its latest graphics drivers which is a remarkable thing taking in consideration that AMD drivers were always a step behind the NVIDIA ones in the race of adopting latest OpenGL features. It seems that now AMD takes seriously the OpenGL support and this is good news for all the developers out there, especially for me, being an ATI fan.

Most probably not just for me, the way how the binding of vertex attributes to shader attributes and the binding of shader outputs to render targets happened earlier caused a big headache from both the point of view of modular software design and efficiency. Previously, the application developer had little to no control over how to automatically connect these elements together in a shader independent way. This tight coupling between the host application code and the shaders just make the work of the developers cumbersome. This feature leverages the way how this binding process is done by allowing to globally assign a particular semantic meaning to an attribute location without knowing how that attribute will be named in any particular shader, decoupling the host application from the shaders. This extension is a typical example how design abstractions can ease the life of the developer without any dependency on hardware support.

Well, there isn’t too much to say about this extension as it just adds a new occlusion query type that reports just a boolean value about the visibility of the object rather than the actual samples. It is somewhat equivalent to the occlusion query extensions prior to ARB_occlusion_query. Don’t ask me why this feature is important but they felt that it might be useful. One thing I can think about that with such a query we might get our results about the occlusion query of the proxy object sooner as we have to wait only till the first passed sample but I’m not confident whether such thing is supported by either the hardware or the drivers.

This is one another feature that people have been waiting for years. This extension decouples texture image data from sampler state. Previously, if a texture image had to be used with different sampler modes, no matter if we talk about various filtering modes or texture coordinate wrapping, one had to do expensive state changes to modify the sampler state of the texture object, accomplish the needed filtering or wrapping from within shaders or, in worst case, duplicating texture image data in order to have access to the same texture with different sampler parameters. The primary intend of this feature is to solve these problems.

One thing to remark regarding to this extension is that even though it is a long waited addition to the API, several people already expressed their discontent regarding to the fact that the texture unit semantics have been kept. Nevertheless, I also expected that the introduction of this feature should be the point when the texture unit semantics has to go but after seeing the example of ARB_explicit_attrib_location as a way to decouple the shader code from the host application code I tend to agree with Khronos in this decision as we can think about the texture units from now as an adapter layer between GPU and CPU code and as such the decision seems reasonable.

This extension adds built-in functions for getting and setting the bit encoding for floating-point values in the OpenGL Shading Language. As it is more like an indicator extension regarding to added functionality in the Shading Language I would rather not go into details as I will talk about the new Shading Language later.

Again, an extension that is quite self-explanatory: new texture image format called RGB10_A2 with non-normalized unsigned integers in them. This is nothing more than another hole filled in the gap between hardware and API support.

Especially when using one or two component texture formats, like in the case of shadow maps, the specification was somewhat unclear how these components are finally mapped to RGBA quadruples and provided little to no facilities to control this process. If the developers weren’t already fed up with this, the possibility of a problem increased even further because often the driver implementations behaved differently as well. This issue has been finally clarified with this extension by providing an explicit tool for the application developer to control the swizzling of the components that is done implicitly afterwards in case of every single texture fetch. The new state is introduced as part of texture object state that provides fine grained control over when and how to use the swizzling. According to the extension specification, this feature has a notable role in helping porting issues of legacy OpenGL applications as well as those of the games written for PlayStation 3 as the console provides such functionality already.

Prior to this extension, runtime performance measurements were limited to the use of client side timing information or relying on the use of offline profiling mechanisms like that of AMD’s GPUPerfStudio. During development, this timing information can help identify application, driver or GPU bottlenecks. At runtime, this data can be used to dynamically optimize the scene to achieve reasonable frame rates. While today’s hardware provides a great repertoire of performance measurement metrics there was no API support to access these previously. This feature provides an additional asynchronous query type that enables application developers to measure the driver and GPU time that is required to complete a set of rendering commands, thus providing additional flexibility for both offline and runtime optimizations. While this extension does not guarantee 100% consistency and repeatability, the information gathered with timer queries will definitely make it possible to identify server side bottlenecks and the reasons behind them.

Many people argued with me at the OpenGL discussion forums when I stated that instanced arrays should be included in core OpenGL. Their reasoning was built on the fact that we already have the ARB_draw_instanced extension that provides a shader based thus much more flexible way to handle instanced geometry. While from this point of view I tend to agree with them, there are many non-trivial use cases which prove that my reasoning is not pointless. It seems that Khronos agrees with me regarding to this topic.

In a nutshell, the instanced arrays feature enables the use of vertex attributes as a source of instance data. This is done by introducing a so called “array divisor” that specifies how the corresponding vertex attributes are mapped to instances. Usually a vertex attribute advances on a per-vertex basis. In case of instanced arrays this advance happens only after every Nth conceptual draw calls that is equivalent to a traditional draw command, excluding instanced draw commands.

One use case can be when one deals with huge number of instances where the per-instance data simply not fits into uniform buffers. While in such cases one can use a texture buffer instead to source the instance data like it was mentioned in my article Uniform Buffers VS Texture Buffers, accepting the additional overhead of using texture fetches may prove to be a not-so-performance-wise decision. Beside standard instancing use cases, there are plenty of nasty tricks that can be efficiently achieved using this feature but that goes far beyond the scope of this article and requires a separate discussion on what I will most probably recap in the near future.

We’ve arrived to the final new extension included in core OpenGL 3.3. This is another gap filling extension to provide two new vertex attribute data formats: a signed and an unsigned format with 10 bits for each significant coordinate. The most typical use of this format is to store vertex normals in the signed-normalized version of the format in order to have a compact (4 bytes per normal) yet high precision (due to 10 bits per component) format that can reduce memory needs and bandwidth requirements while retaining sufficient precision. Previously, there was no way to have such high precision for the vertex attributes in case of a 4-byte footprint.

The OpenGL Shading Language 3.30

The first remarkable thing is the shift in the versioning of the Shading Language. It seems that from now it will be in align with the core specification version. This decision was most probably made because of the introduction of two release branches of the standard specification in order to avoid confusion regarding to the correspondence between API and Shading Language versioning.

As in case of talking about the OpenGL Shading Language it is much more difficult to easily summarize the new features with corresponding use cases I will simply limit my comments to an excerpt from its specification regarding to the features added in this new version:

Layout qualifiers can be used to declare the location of vertex shader inputs and fragment shader outputs in align with the API functionality provided by ARB_explicit_attrib_location as mentioned before.



Built-in functions provided to converting floating-point values to integer ones representing their encoding.



Some clarification of already existing facilities of the language.

New features of OpenGL 4.0

It is very obvious that the major version number change indicates that this revision of the specification is targeting Shader Model 5.0 hardware. To be honest, as I was never really interested in DirectX, I barely know all the features introduced by DX11 but seems that there are some great facilities in OpenGL 4.0 that I’ve never heard that hardware supports it. This can be due to DX11 does not even support such functionalities but it is maybe because I don’t know enough details about DX11. Anyway, let’s see the revolutionary things that we face we checking out the latest version of the OpenGL specification…

Using this feature one is able to select individual blend equations and blend functions for each render target. This extension was already exposed for a few months now so most probably everybody heard about it or even if not the functionality is very straightforward. It simply removes some of the restrictions when dealing with multiple render targets (MRT). One interesting thing is still that the Khronos Group decided to include this extension in the 4.0 version of the API but not in 3.3. This is odd as Shader Model 4.0 capable hardware already supports this feature or at least I have the extension on my Radeon HD2600 which raises the question: why only in 4.0? Unfortunately, I don’t know the answer but I hope the ARB has a good reason behind this, as we will see later, there are other features that for some reason were only exposed in the latest version of the API but not in core for Shader Model 4.0 hardware.

In case of traditional multisample rendering the hardware optimizes the multisampling in a way that the fragment shader is executed only once for each fragment. This can be done as the standard specification relaxes the way how the implementation behaves regarding to feeding color and texture coordinate values for each sample. While this optimization usually does not provide any rendering artifacts and it heavily reduces the amount of pressure on the GPU, there are some situations when this optimization results in aliasing artifacts. One sample use case is when alpha-tested primitives are rendered.

This extension provides a global state for enabling and disabling sample shading and a way to control how fine-grained per-sample shading should be by supplying a minimum number of samples that need to be shaded. Beside this, it also introduces the required language elements to the OpenGL Shading Language to support sample shading.

In my humble opinion, this is one of the most important features introduced in this new version of the API specification. So far, many engine and shader developers faced the problems that where inherently there in the Shading Language that heavily reduced the ability to create a modular shader design in order to separate the independent tasks done in shaders nowadays. One initiative was the idea behind the EXT_separate_shader_objects extension. While that extension removed the dependency between shader stages, it does not address the problem with tight coupling inside one shader stage, also the aforementioned extension defeats some of the design goals of the Shading Language introducing complicated language semantics in order to solve the problem of inter-stage dependency.

Just to emphasize the importance of this new functionality with a very basic example, let’s take a simple rendering engine that supports skeletal animated geometry, materials and lights. In such a use case both the vertex and fragment shaders have multiple roles: the vertex shader has to perform the skeletal animation (property of the geometry) and the view transformation (property of the camera or of the light in case of shadow map rendering), and the fragment shader has to calculate the incident light to the surface point (property of the light) and then calculate the illuminance factor (property of the material). With the traditional tool-set these components of the shaders were tightly coupled and in order to support the combination of any geometry type (animated or not, skeletal or morph animation, etc.), any light type (directional, point, etc.) and material type (diffuse, phong, environment mapped, etc.), one had to compile all possible combinations of the shaders or create uber-shaders that do run-time decisions in order to solve the problem of heterogeneous inputs. Both of these solutions provide additional hardware resource usage and possible runtime overhead.

This extension adds some kind of polymorphism support to shaders. This way a single shader can include many alternative subroutines for a particular task and dynamically select through the API which subroutine is called from each call site. This opens the doors for modular shader designs while retaining most of the performance of specialized shaders.

Yes, this is about the new geometry tessellation mechanism introduced by Shader Model 5.0 hardware. The extension itself introduces three new stages that are roughly situated between the vertex shader and the geometry shader:

Tessellation Control Shader – This new shader type operates on a patch that is actually nothing more than a fixed-size collection of vertices, each with per-vertex attributes and a number of associated per-patch attributes. Also note that while it operates on a patch, it is invoked on a per-vertex basis. The most important rule of this shader is to perturb the tessellation level for the patch that controls how finely the patch will be tesselated. Usually think about a patch as a triangle or quad. This shader is equivalent to DX11′s hull shader.



– This new shader type operates on a patch that is actually nothing more than a fixed-size collection of vertices, each with per-vertex attributes and a number of associated per-patch attributes. Also note that while it operates on a patch, it is invoked on a per-vertex basis. The most important rule of this shader is to perturb the tessellation level for the patch that controls how finely the patch will be tesselated. Usually think about a patch as a triangle or quad. This shader is equivalent to DX11′s hull shader. Fixed-function tessellation primitive generator – The role of this new stage is to subdivide the incoming patch based on the tessellation level and related configuration that the unit gets as input.



– The role of this new stage is to subdivide the incoming patch based on the tessellation level and related configuration that the unit gets as input. Tessellation Evaluation Shader – This new shader type is responsible of calculating the position and other attributes of the vertices produced by the tesselator. This shader is equivalent to DX11′s domain shader.

One important thing to notice is that a new primitive type is introduced, namely a patch. A patch on its own it is not directly or indirectly related to any traditional OpenGL primitive as it cannot be directly rendered. It is used only as the input type for the tesselator, however, a patch supplies the control grid of the geometry to be generated via tessellation so in practice it is most likely to be equivalent with triangles or quads but it is important to remark the difference.

As this is maybe the most well known feature of Shader Model 5.0 hardware I wouldn’t like to talk about it more as everybody knows what is it for and it would be rather long to explain how to use it. Also, it is not the intension of this article to fully cover the usage of all the new features, it is just a quick summarization of the new possibilities.

Yet another extension that introduces an additional format, now for texture buffers. Previously, texture buffers supported only four-component formats, this is extended with three-component formats. As currently there is no any practical use case in my mind when this can be useful, I would rather not come up with one. However, my opinion is that these formats most probably work with reduced performance compared to the four-component ones even though the memory footprint and bandwidth usage is maybe somewhat lower, I have concerns regarding to alignment related performance issues.

Those who already use texture arrays to help batching issues and remove unnecessary state changes most probably adore this extension as it enables texture array capabilities also for cube map textures. This comes handy especially in case when many materials use environment cube maps or when shadow cube maps are used for point lights. To come up with even a more concrete example, you can render the shadow cube maps for many hundreds of point lights with a single draw call by taking advantage of the layered rendering capability of geometry shaders and the possibility to bind texture arrays as render targets.

One more thing to notice here is that cube map arrays are already supported by Shader Model 4.1 hardware so the question to the ARB is again there, however, as OpenGL 3.3 still targets Shader Model 4.0 hardware maybe we will see a 3.x version of the specification that will also include this extension. The judgement is up to you whether you agree with me or not.

Another feature from the repertoire of Shader Model 4.1. This extension introduces new texture fetching functions to the Shading Language that determine a 2×2 footprint of the texture that would be used for linear filtering in a texture lookup and returns a vector consisting of the first component from each of the four texels in the footprint. This is the so called Gather4 texture fetching mode and can be useful to accelerate percentage closer filtering of shadow maps as it can fetch four samples at once. Still, there are some limitations on the use of this fetching mode, one important thing is that a shader cannot use normal and gather fetches on the same sampler. This makes me think about whether this feature is not part of the sampler object state instead of being a Shading Language construct. Anyway, as in typical use cases these limitations does not defeat the goal of the feature, I would not consider this problem a design issue.

The transform feedback mechanism already proved to me that is a great addition to the tool-set of graphics application developers. This feature extends transform feedback with an object type that encapsulates transform feedback related state to enable configuration reuse. Also it provides a way to pause and resume transform feedback mode if, for some reason, some rendering commands should be excluded from the feedback process.

The last and maybe most important benefit of this extension is the ability to draw primitives captured in transform feedback mode without querying the captured primitive count. It is roughly equivalent to DX10′s AutoDraw feature and the purpose of it is to eliminate the need to query the number of previously generated primitives in order to supply it to an OpenGL draw command. This solves the synchronization issues that previously happened between the CPU and the GPU.

One example is when a skeletal animated geometry has to be used in a multipass rendering technique. We can think about traditional forward rendering or when dealing with multiple shadow maps that have to be generated. Anyway, as the calculations needed to perform skeletal animation are rather expensive, it is wastage to perform these calculations in each pass. A common way to solve this problem is to use transform feedback to capture the geometry emitted by a vertex shader that simply executes the skeletal animation on the input geometry. In subsequent rendering passes this feedback buffer can be used to source the geometry data to eliminate the need to recompute the animation. Without this extension, in such cases the application is most probably stalled until the feedback process ends as it needs to query the number of generated primitives. With this extension, this is solved as we don’t have to know the results of the previous transform feedback in order to issue a draw command that sources the data from the feedback buffer. By the way, this seems to be logical as the information is already on the GPU so why it should ping-pong between the CPU and the GPU?

As I mentioned before, the functionality provided by this extension is equivalent to DX10′s AutoDraw feature. This time my question is really serious: why this feature haven’t been included in OpenGL 3.3? It would provide a great benefit for those who use transform feedback and I don’t see any reason behind not supporting it because, as far as I can tell, it is supported on the corresponding hardware.

Surprisingly, OpenGL 4.0 comes with another transform feedback extension as well but this time a true Shader Model 5.0 feature. The new hardware generation has the ability to emit vertices from the geometry shader to multiple vertex streams. In order to provide clever API support, the ARB decided to relax the previous limitation of transform feedback mode that output can be in either interleaved format or to separate buffers. This new extension enables the use of both together also providing a way to group geometry shader outputs to groups in order to target the individual vertex streams.

The most important benefit of this feature is still that we have separate streams, each with its own primitive emission counter so the outputs should not necessarily have the same granularity. This provide room for very clever rendering techniques. As an example, remember NVIDIA’s Skinned Instancing demo that used one draw call per geometry LOD to sort instance data on a per-LOD basis. Using this extension, this preprocessing step can be done with a single draw call, but the abilities of this feature goes far beyond such a simple use case, I will also talk a bit about another in the next section.

One of my less technical notes is that it seems that the Khronos Group members have good sense of humor. I realized this when I met the “manbearbig” when reading one of the examples in the extension specification.

We’ve arrived to the most culminating point in the list of features introduced. It is hard to say such a thing, but in my humble opinion this extension can be the Holy Grail of next generation rendering engines. I will explain why I think so…

The extension provides a way to source the parameters of instanced draw commands from within buffer objects. One naive use case would be to put all the rendering command parameters to a buffer object using the host application and then draw everything with a single command. While this simple method already has its benefits, this feature provides much more flexibility than this. The most revolutionary is that, using this extension, one is able to generate instanced draw commands with the GPU on-the-fly. Together with ARB_transform_feedback3 it is possible to write a completely GPU based scene management system.

Those who remember my Instance Cloud Reduction (ICR) algorithm, presented in the article Instance culling using geometry shaders, know that the required synchronization points between the CPU and the GPU heavily limited the practical utility of the culling technique. By taking advantage of the aforementioned features in case of ICR does not just eliminate the synchronization issues that I’ve spoken of but makes the technique practical also in case of heavily heterogeneous scenes with virtually any number of geometries even if there are multiple number of LOD level for them, and this whole stuff can be done with even less number of draw calls than that of the demo that accompanied my article. As soon as we will see OpenGL 4.0 capable drivers I will write an article about this technique, supplying also a reference implementation.

This extension provides new fragment shader texture functions, namely textureLOD*, that return the results of automatic LOD computations that would be performed if a texture lookup would be performed. These functions return a two-component vector. The X component of the result vector contains information about the mipmap level that would be used if a normal texture lookup would have been made with the same coordinates. This value can be a concrete mipmap level or a value between two levels if trilinear filtering is in use. The Y component of the result holds the computed LOD lambda-prime, see the OpenGL specification in order to check out where it is actually coming from and how it is calculated.

One interesting thing that this extension can be used for is when one implements some shader based filtering and addressing method for textures. As an example, lets take a mega-texture implemented that uses a 3D texture for storage, without actual mipmaps, and the addressing, filtering and mipmapping is done with shader code. As right now this is the only example that came into my mind and this is already awkward enough, I would rather leave the further discussion of the importance of this feature to more competent people.

Basically, this extension is nothing more than a big umbrella feature under what all the additional general or minor API changes go. Just to sum up the miscellaneous features provided by this extension, here is an excerpt from the extension specification:

Support for indexing into arrays of samplers using non-constant indices.

Support for indexing into an array of uniform blocks.



Extending Gather4 with the ability to select any single component of a multi-component texture, to perform per-sample depth comparison, and to specify arbitrary offsets computed at runtime when gathering the 2×2 footprint.



Support for instanced geometry shaders, where a geometry shader may be run multiple times for each primitive.

For a full list of new facilities introduced by the extension refer to the extension specification.

This extension enables the use of double-precision floating-point data types and arithmetic from within shaders, also providing API entry points for double-precision data where it was missing. While one may think that the added precision is somewhat wastage in case of real-time graphics, it is important to note that GPUs are more and more often used for scientific calculations, not even necessarily in case of graphics related tasks. Taking in consideration this fact, the importance of double-precision floating-point support should not be underestimated. Beside that, maybe standard graphics application developers can also take advantage of the higher precision in some extreme use case scenarios.

The OpenGL Shading Language 4.00

Beside what I’ve already mentioned, there is no important thing to mention regarding to the Shading Language. There where many changes but most of them are simply provide Shading Language support to the API extensions. What I haven’t mentioned so far is the synchronization possibility for tessellation shaders, more implicit conversions, more integer functions, packing and unpacking facilities for floating-point formats and a new qualifier to force precision and disallow optimizations that re-order operations or treat different instances of the same operator with different precision.

Conclusion

I hope that some of you didn’t give up the reading so far. Sorry, but it seems that this article gone wild and still didn’t manage to cover all the topics I intended to talk about. But still, maybe I’ll recap on those subjects later.

Where is direct state access?

The original promise of eliminating the bind-to-modify semantics from the OpenGL API is still not done. The first reaction of many people is still to ask this question. While the bind-to-modify semantics is a rather annoying “feature” of OpenGL, I tend to state that if we are not talking about legacy OpenGL, the importance of direct state access is less and less relevant as we can already heavily reduce the number of state changes and API calls in our applications, thanks to the fast pace evolution of OpenGL. I sincerely think that with a modern rendering engine design built upon the idioms behind the new versions of the OpenGL API one should not face any significant scalability issues due to the outdated bind-to-modify semantics but maybe I’m wrong.

Personally, I have only one problem with the newly released specification versions that I’ve already tried to emphasize several times: the fact that so far many Shader Model 4.x features are missing from the 3.x line of the API specification. Hopefully that will be solved sooner or later, however addressing these issues should happen before the hardware to support will become outdated.

Anyway, we should not have any harsh complains as the Khronos Group did a great job again. They managed to keep again the half-year schedule and they even published two parallel releases at once! If someone still says that the DirectX API is superior compared to OpenGL should think it twice, as it seems that the tendency is that OpenGL just starts to evolve more and more fast. Beside that as now also AMD is being active in the OpenGL world, we can expect good support from both industry and developer community point of view.

My respect for the Khronos Group and thanks for reading the article!