WARNING: This is a technical graphics rendering post! If you are not interested in the nitty gritty of OpenGL performance optimization and what I did this week, then what you should know is “Hey, we’re making the game faster!” and “Well, you should probably update your drivers if you haven’t done so already.” The fact of the matter is that this is a blog post that will be most interesting to people with an interest in graphics programming, but this is a game development blog after all, and this is game development!

(We’ve left a whacky animated gif at the bottom of this post as a reward for making it all the way through. No cheating! -ed.)

One of the things that has been talked about recently in the graphics programming community is the problem of driver overhead. Every time a graphics programmer makes a drawing call to the video card, the driver sits in between your game and the video card and looks at everything the game sends it. It checks to make sure you didn’t feed it garbage and that everything is valid, so that you don’t end up crashing the computer in a fiery blaze of glory or a blue screen of death. These validations add up and take time – every time you switch states, every time you change textures, and every time you draw something, the little man in the driver needs to go through and sort it out. This takes time, and slows the game down.

In the build of Clockwork Empires which is currently on Steam (30B), we spend about 30 msec/frame doing rendering. Most of that time is spent on the CPU, either setting up state, culling things, and generally doing busywork.

There are a number of traditional ways to try to make this better – geometry instancing, which is something I have been planning on implementing forever, is one of them. In geometry instancing, rather than drawing one cabbage, we draw a hundred cabbages all with one draw call. We batch up all the cabbage positions when we draw cabbage, and do it all at once. If we draw 100 cabbages, we save ourselves 99 trips to the driver. The problem, however, is when you have 15 logs, 12 pipes, 37 planks, 12 crates of beer, 14 fish people heads, 18 trees… what do you do then? Well, ideally, you’d like to draw it all at once.

Recent initiatives like AMD’s Mantle library, Apple’s Metal library, and DirectX 12 have tried to fix this by providing methods that reduce the amount of driver between the user and the video card – for lack of a better word. None of this is helpful to us, because we use OpenGL. The people looking at how OpenGL handles this problem – a consortium of half-crazed developers from various IHVs such as AMD and NVIDIA – proposed a solution called Almost-Zero Driver Overhead (AZDO), which postulates “Hey, we don’t actually need to do anything to fix this problem; really, all we need to do is to be creative with some of the extensions we already have.” Their solution:

Batch all your draw calls up at once, using instancing on stereoids: put ALL your vertex and index data in one giant vertex buffer object, and have everything reference it (saving that state change)

Put all your texture data in one giant texture array, and let the instancing code (“draw cabbage #73”) figure out which texture in the array to use (“texture 5 – cabbage”);

When writing instancing data to the GPU, put it in GPU memory and use what are called “fence” extensions to synchronize when the driver is allowed to validate stuff you do and copy it into the right space with when you don’t need it anyway.

Reduce the # of shaders as much as possible (which is great, and very compatible with deferred rendering, which we do)

What is not mentioned in any of the technology documentation provided by the IHV is how to actually structure this to make it work without driving yourself around the bend. Clockwork Empires already has a renderer that operates in two parts – a frontend, and a backend. The frontend does things like “update all the animations” and “walk the scene, figuring out what’s visible and what’s not visible, and put all the visible stuff in a list of things to draw.” Any visible thing (static model, skeletal model, rigid body model, particle system, landscape chunk, ground polygon, etc.) just dumps a big bunch of “polygon soup, shader, texture, rendering state flags” to the backend. The backend then sorts the pile of soup/shader/texture/flags in order to reduce the number of driver calls it has to make anyway, and fires it off at the driver.

The first trick to moving to an AZDO renderer is to work incrementally. In my case, I worked in the following order: get everything in the big array, get it using the right function, then get the texture arrays working. Thankfully the actual extension that does the work (glMultiDrawIndirectARB) is based on three other extensions, and keeps wrapping functionality from other extensions up in bigger, more efficient functions with less draw calls. Each of these previous extensions corresponds to “the next level of AZDO”, and so you can figure out what goes wrong one step at a time rather than trying to shoot for making everything efficient as quickly as possible: first put all your vertices and indices in one giant array and access it normally; then, use normal instancing to draw everything in it; then use the big fancy dispatch function. We then do the same for texture arrays.

Performance is excellent, for what that’s worth. I’m not done yet, but we have now gone from 40 msec/frame on my development box to 10 msec/frame on my development box, and have eliminated a lot of needless CPU overhead. There is some lower-hanging fruit still to remove, and I am not quite done everything I need to do, but I’m pretty happy.

Some warning notes and general musings for developers:

If you’re going to try this on your engine at home, upgrade your drivers first. Nothing is more frustrating than running the NVIDIA sample code on the NVIDIA graphics card to see if you’ve missed something debugging your own implementation, and having it crash. 🙂

glMapBufferARB() will stall everything, unless you feed it correctly. If you don’t get a performance increase, mark up your source tree with your favourite profiler and see whether or not you are actually getting a speed up or whether you have simply created another stall. Right now we keep a CPU copy of all the matrices and texture IDs, and pass it along with glBufferData(); this will eventually use the half-crazed, fenced triple-buffering that Cass Everett’s presentation recommends, but we have more immediate bottlenecks.

So how do you structure an engine around AZDO if you have a bunch of things in your engine that you want to do? Very often game development presentations from IHVs are of the form “here are some best practices” with no actual explanation as to how to make a thing that satisfies the best practices AND the demands of the game/your art department.

I am almost convinced that the right place to put all the AZDO stuff in your codebase from a logistics perspective is to build one set of buffers for instancing and texture lookups per shader. This is very consistent with the idea of data-oriented design that has seized the industry by force. You typically batch meshes in game by shaders, there aren’t many of them, and that’s the place in a deferred renderer where the nature of your data transform changes (we do different things in the vertex shader to skin a model versus a static model, we do different things in the fragment shader to render models with glow versus without glow, or whatever.) Obviously you then want to make your shaders as general purpose as possible – which is great if you’re filling a deferred rendering pipeline.

I don’t have a good solution to the problem of “artists use lots of textures in different sizes” yet; right now, I’m assuming that our largest texture is 512×512 and am eating the cost of unused memory on the GPU until I get it working a bit better. If you create an array larger than available texture memory, you can and will cause Troubles. The correct answer, from an academic, pie-in-the-sky perspective, is to use sparse bindless textures; however, we can’t guarantee that these extensions exist on our target hardware.

As an advantage to all of this, I am working on removing a lot of dead code from the renderer. More on this later.

The current speed issues in the codebase on the renderer are now caused by two things: poor quadtree performance (in particular, cache misses) and a very slow function called when clicking and dragging selection boxes. Both of these can and will be fixed. I’ve also fixed a bunch of crash bugs caused with graveyards, and with dropping a shovel infinitely, over and over again. Stuff will be put into the experimental tree as soon as it is stable which will hopefully be before I take off for PAX Friday morning.