DISCLAIMER: This article was migrated from the old blog thus may contain formatting and content differences compared to the original post. Additionally, it likely contains technical inaccuracies, opinions that I may no longer align with, and most certainly poor use of English (I was young and foolish :)). This article remains public for those who may find it useful despite its flaws.

OpenGL 3.0 capable GPUs introduced a level of processing power and programming flexibility that isn’t comparable with any earlier generations. After that, OpenGL 4.0 and the hardware supporting it even further pushed the limits of what previously seemed to be impossible. Thanks to these features nowadays more and more possibilities are available for the graphics developers to implement GPU based scene management and culling algorithms. The Mountains demo showcases some of these rendering techniques that, as far as I know, were never implemented so far using OpenGL. In this article I will present the key features of the demo that will be discussed in more detail in subsequent articles. Demo binaries with full source code are also published.

The demo itself is mainly inspired by the March of the Froblins demo released by AMD and the SIGGRAPH 2008 Course Notes by Jeremy Shopf, Joshua Barczak, Christopher Oat and Natalya Tatarchuk presenting the actual implementation in detail. That demo targeted the Radeon HD4800 series and presented several practical GPU based culling algorithms implemented using DirectX10. The Mountains demo implements these techniques in OpenGL and further improves the technique used in AMD’s demo by unleashing the new features introduced by Shader Model 5.0 hardware and OpenGL 4.0.

While this article briefly presents the demo and the used rendering techniques, the details of each individual technique will be presented in subsequent articles as the thorough examination of them needs a longer discussion that would render this article simply too long and overwhelming.

Introduction

The Mountains demo renders a tiled terrain block with thousands of high detail tree models (the full detail tree model is over five thousand triangles). Due to the view distance used in the demo is quite large, several tiles of the terrain block are potentially visible on the screen and this results in a huge explosion in the number of triangles the GPU has to render. Also, with traditional methods the rendering of the terrain blocks and the several thousand tree models would need loads of draw calls. In order to solve this problem, the demo renders the trees using geometry instancing to minimize the number of draw calls.

In a traditional rendering engine CPU based culling methods would be used. While that would even work in practice, it is more convenient to perform the culling on the GPU as every information needed to do it is available there. Nevertheless, culling is a typical algorithm that can easily take advantage of the highly parallel architecture of the GPU. Also, performing the culling on the CPU would make geometry instancing barely beneficial.

Another problem with a scene like this is that a simple per-object view frustum culling would not solve the problem completely as most of trees in the view frustum are not visible due that they are hidden by the terrain. In traditional OpenGL the way how to solve this problem would be the use of per-object occlusion queries and rendering of bounding volumes. While this may work in practice, it involves too much CPU intervention even if we take advantage of conditional rendering and nevertheless, this also breaks instancing.

These are the issues that motivated me in creating this demo and I established the following goals for the project:

All the object-level information must stay on the GPU and the CPU should not make decisions on a per-object basis.

The renderer should use as few draw calls as possible in order to solve the problem of visibility determination.

Don’t draw anything that is not inside the view frustum or is occluded by terrain.

The result is a renderer that does little to no scene management on the CPU, instead uses the GPU for visibility determination that is, in most cases, able to reduce the scene’s geometric complexity from over 400 million triangles under one million triangles providing an interactive experience on a Radeon HD5770 with around 200 frames per second.

View from above

Implementation

The scene consists of a tiled terrain with over 130 thousands of triangles and more than 1400 tree instances each with almost 6 thousands of triangles. This sums up to 8 million triangles for a single tile block of terrain. As the view range is needed to be quite large we actually deal with a 7×7 tile of terrain that is dynamically placed in a way that the camera always resides in the middle block of the tile. What all this means that even though we dynamically generate the scenery around the camera, we still have to deal with a scene consisting of over 400 million triangles. This is simply too much for the GPU to deal with.

The first step done in order to reduce the geometric complexity of the scene is done on the CPU by performing a view frustum culling on a per-terrain-block basis. This will limit our 7×7 tile to a smaller subset that contains only those blocks that are lying within the view frustum. The result is a scene usually around 50 million triangles.

While this is already a reasonable amount of simplification, in order to further reduce the amount of geometry we have to render we have to do per-object culling. But as mentioned before, we would not like to do such fine grained scene management on the CPU so we need some sophisticated methods to do it on the GPU.

In order to accomplish this, we will take advantage of the geometry shader’s capability of discarding geometry. We will use it to do the per-object decisions in order to cull the tree instances that are not visible. The three techniques implemented in the culling geometry shader and the accompanying vertex shader are the following:

Instance Cloud Reduction (ICR) – This method does view frustum culling on a per-instance basis based on the bounding box of the instanced geometry, in this case the tree. The technique was first presented in my previous article titled Instance culling using geometry shaders and then further improved according to the instructions presented in Instance Cloud Reduction reloaded. In this case, the technique allows us to do a more fine grained yet still high level view frustum culling of the tree instances than that allowed by the simple per-tile culling performed on the CPU.

– This method does view frustum culling on a per-instance basis based on the bounding box of the instanced geometry, in this case the tree. The technique was first presented in my previous article titled Instance culling using geometry shaders and then further improved according to the instructions presented in Instance Cloud Reduction reloaded. In this case, the technique allows us to do a more fine grained yet still high level view frustum culling of the tree instances than that allowed by the simple per-tile culling performed on the CPU. Hierarchical-Z Map based Occlusion Culling – This technique allows for conservative per-instance occlusion culling completely done and evaluated on the GPU using a similar algorithm that the hardware depth buffer uses to hierarchically reject fragments based on their depth values. Using this technique, a coarse occlusion culling can be performed on the instances without the need of occlusion queries and CPU intervention. Update! The technique is discussed in detail in the article Hierarchical-Z map based occlusion culling.

– This technique allows for conservative per-instance occlusion culling completely done and evaluated on the GPU using a similar algorithm that the hardware depth buffer uses to hierarchically reject fragments based on their depth values. Using this technique, a coarse occlusion culling can be performed on the instances without the need of occlusion queries and CPU intervention. The technique is discussed in detail in the article Hierarchical-Z map based occlusion culling. Dynamic Level-of-Detail Determination – This method allows us to dynamically select a suitable geometry level-of-detail on a per-instance basis completely on the GPU based on the application provided LOD parameters and the distance of the instance from the camera. The Mountains demo uses three LOD levels for the tree object: one with 5811 triangles, another with 2893 triangles and the lowest detailed version contains 1492 triangles. Update! The technical details of the algorithm are presented in the article GPU based dynamic geometry LOD.

While in the Mountains demo all these techniques are used to determine the visibility and the LOD of static scenery (as trees are unlikely to move) the truth is that these methods apply with no modification also to dynamic scenery. This is a very important thing to note as usually dynamic objects are those that makes many of the CPU based scene management and visibility determination algorithms difficult to use or simply inefficient.

The key improvement compared to how these techniques are used in AMD’s demo is that my implementation applies all the algorithms to the instance set in a single rendering pass compared to the several passes needed by the original implementation. This is because the Mountains demo takes advantage of the latest technologies introduced by OpenGL 4.0 and the supporting hardware (in this case the functionality provided by the extension GL_ARB_transform_feedback3).

By using these techniques the GPU is able to reduce the geometric complexity of the scene from 50 million triangles down to around a few millions, sometimes even under a million. Of course, the actually reduction efficiency is heavily influenced by the view position and direction.

Besides the scene management and visibility determination techniques, the demo also showcases a few simple visual effects:

A simple infinitely far skybox generated using a geometry shader.

Simple diffuse lighting applied to the tree instances.

Global illumination-like effect that simulates the terrain to cast shadows over the trees even though no shadow rendering technique is applied.

Simple diffuse lighting applied to the tree instances. Global illumination-like effect that simulates the terrain to cast shadows over the trees even though no shadow rendering technique is applied. Fog effect to smooth out the disappearance of the terrain at the far clip plane.

Simplistic fake depth-of-field effect that makes far away objects look blurry.

Maybe I will present also some of these techniques in detail in another article if there is interest for it.

As I mentioned, I used a geometry shader to render the skybox and so I did when rendering full screen quads to apply image space algorithms. I’ve done this because I always feel kind of stupid when I have to put such a simple geometry like a skybox or a full screen quad into a vertex buffer. In these situations I feel like I would simply use immediate mode to draw that damn little piece of geometry but I want to stick to core OpenGL so I quickly change my mind. As a simple alternative, I rather used geometry shaders to emit these simple geometric objects that are used so often that I even wonder how OpenGL does not have e.g. a glDrawScreenQuad-like command. Of course, the geometry shaders don’t start by themselves so I used dummy draw commands to make the geometry shader do its job.

View horizon and sky

Performance

Now let’s see how our GPU based optimizations perform in practice. I’ve collected results from typical view positions from where a moderate number of trees are visible. The tests were done on a Radeon HD 5770. Other configuration parameters are not really relevant as the demo is clearly GPU bound as only a few state changes and render commands are executed on the CPU. Of course, this is kind of a synthetic demo as you would usually want to balance the workload between the CPU and the GPU but usually you have AI, physics and other things for the CPU so transferring as much work to the GPU as possible usually gives a great benefit.

Performance comparison of the demo in frames per second on a Radeon HD5770 (higher is better): no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).

As you can see on the figure above, using all the optimizations clearly shows its benefits on the frame rate of the demo, even though the Hi-Z map based occlusion query requires several additional draw passes due to the construction of the Hi-Z map. It is also clearly visible that in a scene like this where there are a lot of occluders, ICR is simply not sufficient on its own. One final note that the application of dynamic LOD has a more significant effect without Hi-Z as occlusion culling removes the largest ratio of the instances.

Amount of visible geometry after culling in millions of triangles: no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).



Our next chart shows the amount of geometry that is finally drawn after culling in millions of triangles. On this figure we see exactly the inverse of the previous chart and it is not surprising as obviously we have a geometry throughput bottleneck. It also clearly shows how important dynamic LOD is even if we don’t perform more sophisticated visibility determination algorithms.

No LOD Dynamic LOD No culling 17 draw calls 19 draw calls Instance cloud reduction 17 draw calls 19 draw calls ICR + Hi-Z map based occlusion query 27 draw calls 29 draw calls

Finally, in the table above we’ve listed the number of draw calls needed by each technique from the reference point of view. The techniques applied do not have a significant effect on the amount of draw calls: we have a fixed number of draw calls and additionally two draw calls if we use LOD. The only exception is when we use Hi-Z map based occlusion culling as the Hi-Z map is a full mipmap chain and we need ten additional draw calls to generate all the mip-levels.

Conclusion

The techniques presented are rather simple to implement and can provide huge performance increases. Nevertheless, they allow the renderer to offload even some of the object-level algorithms from the CPU to the GPU and obviously this is the direction to go in the future.

We’ve also met mostly our goals set at the beginning. Of course not fully as the occlusion culling performed is rather a coarse culling method and does not eliminate completely all the instances that will not contribute to the final image.

Future work

While the implementation almost completely eliminates all need of CPU intervention during the rendering phase, I still had to use a few asynchronous queries to get the amount of visible instances for each geometry LOD, although the latency incurred by the use of query objects is hidden in the demo by rendering the skybox between the initiation of the queries and the retrieving of the results.

As soon as we get atomic counters into core OpenGL and consequently when we’ll have drivers supporting it, I will further improve the technique using indirect rendering and atomic counters so even the need for these queries will be eliminated.

Additionally, as mentioned several times, I plan to write detailed articles about the individual techniques I used in the demo. I decided to go in this direction as a thorough description of all the details of the demo would be simply too long in one piece.

Deep in the forest

Running the demo

The demo uses OpenGL 4.0 so a Shader Model 5.0 capable graphics card is a must. Even though most of the used techniques makes it possible to create an implementation running on OpenGL 3.x, this time I wanted to stick to GL 4.0 as I took advantage of the new features of it to even further improve the implementation.

First, don’t be afraid if after startup the demo will run on very low frame rates. This is because by default all GPU based optimizations are disabled.

You can use the SPACE button to switch between the various culling methods:

No culling at all

Instance cloud reduction

ICR with Hi-Z map based occlusion culling

Finally, you can turn dynamic LOD on and off using the F3 key.

There are a few other controls present in the demo that you may figure out if you read the code, but I don’t want to go into the details of them as they will be presented in the upcoming articles where I will present Hi-Z map based occlusion culling and dynamic LOD in detail. So stay tuned: follow me on twitteror subscribe to the RSS feed.

Links: source code, Win32 binary