Adventures in Optimization…

Shortly after my last post here, I added wandering nomads to the game. They occasionally come to the town, and the player can either allow or deny them citizenship. There might be consequences of turning them away too many times…. but that’s a feature for another day.

After adding the nomads I was in a graphics sort of mood, and so I started looking at adding some atmospheric effects to the game – mist, fog, rain, snow, and changing lighting conditions throughout the year. I was looking at the shader programs that make the game pretty, and started wondering about graphics performance – thinking I should look at optimizing things before adding more GPU overhead. Which is very dangerous for me, being a graphics programmer. I can tinker forever trying this and that to make the same scene display at a higher framerate.

So I took a long road down optimization lane, and with it came some serious coding and learning a new graphics API. This is probably too long and has too much information about graphics programming, so I understand if you feel like saying TL;DR, finish the game already. 🙂

Lets look at my test scenes. One is a view over a town in winter, and the other is the same town seen from high above.

The first scene has 2409 objects submitted to the GPU, and consist of ~817,000 triangles.



The second scene has 4617 objects submitted to the GPU, and consists of ~1,243,000 triangles.



From early on my graphics engine has been able to render the same object in multiple locations with a single draw call. This is called instancing. So instead of having to call Draw() 2409 times for the first scene, the engine can batch all similar trees together, or all the rocks together, or all the houses together and submit them in 411 calls to Draw(). This is good since each draw call requires some state change for the GPU, and it (usually) saves CPU time as well.

Up until last week, my engine used DirectX 9.0c and shader model 2. I initially chose this for the widest possible support of video cards and older computers. If you know anything about shader model 2, it doesn’t support instancing. So if you want it, you have to fake it. This is done using data repetition. Each mesh is repeated in memory some number of times, but each repetition has a different ‘id’ encoded with each vertex. This id is used to look up into a table of transforms. (A transform for those non-graphics people is just a location, orientation, and scale…) This is very much like hardware skinning – the deformation that takes place to bend a model around its skeletal structure. Due to limitations on the number of constants that can be set for shader model 2, the engine can draw up to 52 objects at once.

To draw multiple copies of an object the graphics code does something like:

device->SetVertexShaderConstantF(transformListIndex, transformData, transformCount * 4); device->DrawIndexedPrimitive(D3DPT_TRIANGLELIST, startVertex, 0, numVertex * instanceCount, startIndex, primitiveCount * instanceCount);

The vertex shader then does something like this

uniform float4x4 tc_transforms[52]; struct input { float4 position : POSITION; }; float4x4 localToClip = tc_transforms[input.position.w]; float4 localPosition = mul(localToClip, float4(input.position.xyz, 1.0));

This method works great, and is very fast. However the mesh vertex and index data are repeated 52 times! Think about the memory requirements for this in the scope of all the meshes in the game…. For dynamic vertices, like particle systems, drawing instaces is worse, since the engine has to manually copy the data N times, and set the vertex id value. memcpy() starts to show up on the profiler for heavy scenes.

To alleviate this memory requirement (which is getting huge, btw), I decided to try the instancing method available in shader model 3. In shader model 3, you can mark a vertex stream as instanced. This was pretty quick to add to the engine. I’ve got all the graphics code isolated to a few files, so in a few hours I had this new method working. Instead of shoving transforms into vertex constants, they instead get copied into a vertex buffer. At draw time, you do something like:

device->SetStreamSourceFreq(1, D3DSTREAMSOURCE_INSTANCEDATA | 1); device->SetStreamSourceFreq(0, D3DSTREAMSOURCE_INDEXEDDATA | instanceCount); device->DrawIndexedPrimitive(D3DPT_TRIANGLELIST, startVertex, 0, numVertex * instanceCount, startIndex, primitiveCount);

You then change the vertex declartion to include the ‘instance’ inputs, and the shader then loads the transform like so:

struct input { float4 position : POSITION; float4 w0 : TEXCOORD0; // transform row 1 float4 w1 : TEXCOORD1; // transform row 2 float4 w2 : TEXCOORD3; // transform row 3 float4 w3 : TEXCOORD3; // transform row 4 }; float4x4 localToClip = transpose(float4x4(input.w0, input.w1, input.w2, input.w3)); float4 localPosition = mul(localToClip, float4(input.position.xyz, 1.0));

After changing all the vertex layouts and vertex programs to do this, I took some frame rate results. The scene was tested at 1600×900, 2xMSAA, 2x anisotropic filtering, 2048 shadow map, and a 5 tap PCF kernel for sampling the shadow map. The test machine is an i7 920 @ 2.67Gh with a NVIDIA Geforce GTX 280.

Shader Model 2 Shader Model 3 Scene 1 103 FPS 85 FPS Scene 2 63 FPS 55 FPS



What!?! Why is the shader model 3 method 10-20% slower? This sort of thing frustrates me. I make a change that’s supposed to be either faster or the same speed, and it’s actually slower. After making sure I wasn’t doing something bad when filling the transform buffer, I pulled out a GPU profiler to see where the difference was.

You’ve probably heard talk about games being either vertex bound or pixel bound. This just means that the GPU is spending more time working on the vertices of triangles, or more time on filling pixels on screen. The truth is that most games are both pixel and vertex bound at different times. If a game uses shadow maps, it’s probably vertex bound while rendering the shadow map. Modern GPU’s are crazy fast for filling pixels that are used for shadow maps. So fast in fact that the GPU spends most it’s time loading vertex data and running vertex shaders and that isn’t fast enough generate enough work for the pixel pipline, even with load balancing.

My game is bound like this. It’s also vertex bound on characters and animals – they’re small on screen, but the GPU is spending more time deforming the meshes than shading pixels. This probably means I need lower poly meshes and LOD’s but that’s a another task for another day.

However for other objects, when the game is rendering buildings and terrain, more time is spent on pixels than on vertices.

What’s happening to make the shader model 3 version slower is that the areas that are vertex bound become more vertex bound because of the load of additional data per vertex. When drawing an object for a shadow, the shader model 2 version only has to load 8 bytes of data. The shader model 3 version has to load 72 bytes. This is a lot more memory traffic, and the only explanation I have for the slow-down.

While the shader model 3 version is slower, the memory consumption of the application drops from 270Meg to 172Meg. That’s nearly 100 megabytes of repeated mesh data!

At this point, I have an idea to have both low memory usage and faster render time, but I know it’s going to take a few days to implement. Really, the game runs fine, I should focus on gameplay features, and I don’t need to write a DirectX 11 render path but….

I don’t think DX10/11 is very well accepted, or used widely as of yet. If I have a question about something in DX9, I can google for it and have an answer in seconds. The same can’t be said of DX11. There are very few examples. And while the documentation is there, there are a lot of hidden issues that you only find by reading the debug output from DX11 once you start using it. I’d hate to be someone who never used DX9 and jumped right into DX11.

I had a very slow start getting DX11 up and running. Apparently automatic updates installed Internet Explorer 10 on my development machine, which breaks the DX11 debug output. It also breaks PIX – a tool that lets you capture a frame of a game and examine all the GPU calls and state. I use it all the time to take the guess work out of rendering errors. It’s like a debugger for graphics.

The fix for this is to use Windows SDK 8.0. At the same time I figured I’d update to Visual Studio 2012. Once compiling with the new SDK, I discover that XAudio 2.8, which is all that ships with SDK 8.0, isn’t available for Windows 7. So I hack things up to use the old SDK to get XAudio 2.7 while still using the new DirectX 11.1. This all finally works, but PIX is still broken.

Finally I just uninstall IE10 and related updates since I don’t use it anyway. Now PIX works, and the DX11 Debug layer works. And Back to Visual Studio 2010. On with coding….

The DirectX graphics interface for my engine is only about 60K of code, so writing the same small bit for DX11 was pretty quick. I spent more time writing a shader generation system so that I didn’t have to write different vertex and pixel programs for shader model 2/3 vs 4. Texture sampling and shader inputs and outputs are significantly different between the different shader models. I also spent a fair amount of time debugging and make sure I wasn’t doing anything to cause the GPU to stall.

In shader model 4, there is this great input called SV_INSTANCEID. It gives you the index of the instance that the GPU is working on. This is exactly what my initial implementation did, but as I don’t have to supply the index myself there is no need for data repetition.

The draw call becomes

context->DrawIndexedInstanced(primitiveCount, instanceCount, startIndex, startVertex, 0);

The shader looks like:

cbuffer tc { float4x4 tc_transforms[128]; }; struct input { float4 position : POSITION; uint instanceid : SV_INSTANCEID; }; float4x4 localToClip = tc_transforms[input.instanceid]; float4 localPosition = mul(localToClip, float4(input.position.xyz, 1.0));

This is fantastic. DX11 also uses the least memory while running the game. 94Meg for the test scene.

While writing the DX11 implementation and reworking the vertex and pixel programs for shader model 4, I found a bunch of items such as unneeded input assembler loads, floating point exceptions, and render state that was making things slower needlessly. Because of those fixes, my shader model 2 implementation runs at 130FPS instead of the original 103FPS. Also fantastic.

Here’s my final resulting frame rates for my two systems. All results are GPU limited. CPU time is under 4ms in all cases.

The first scene has 2409 objects, and has ~817,000 triangles.

The second scene has 4617 objects, and has ~1,243,000 triangles.

Test System 1

i7 920 @ 2.67Ghz, NVIDIA Geforce GTX 280

1600×900, 2xMSAA, 2X Ansiotropic, 2048 Shadow Map, 5 Tap PCF shadow kernel

Shader Model 2 Shader Model 3 Shader Model 4 Scene 1 130 FPS (7.7ms) 100 FPS (10.0ms) 118 FPS (8.5ms) Scene 2 77 FPS (13ms) 61 FPS (16ms) 71 FPS (14ms)



Test System 2

i5 M480 @ 2.67 Ghz, NVIDIA Geforce 610M

1280×720, No MSAA, Trilinear, 1024 Shadow Map, 5 Tap PCF shadow kernel

Shader Model 2 Shader Model 3 Shader Model 4 Scene 1 33 FPS (30.3ms) 26 FPS (38.5ms) 48 FPS (20.8ms) Scene 2 21 FPS (47.6ms) 17 FPS (58.8ms) 28 FPS (35.7ms))



What does this tell me? It tells me I still possibly have something wanky in my DX11 implementation since it’s still 0.8ms slower than the DX9 shader model 2 version. However on the laptop GPU, the results are phenomenal. A decrease of nearly 10ms in the first test scene, and 12ms in the second scene is pretty amazing just for an API change.

It also tells me that I probably won’t ship the shader model 3 version. While is does use less memory, I’d prefer a better gameplay experience for those with older systems and video cards, and I can tweak the memory used for each model. Trees and rocks can have the full 52 copies of the mesh data, but buildings, and other things that will never reach 52 on screen at once can have only 2-5 copies. This will bring down memory consumption to reasonable levels, although it does require a per-asset tweak.

The DX11 memory usage is really good, and I could probably get the DX9 version down even further by not using the D3DPOOL_MANAGED flag on resources, but then alt-tabbing away and back to the application becomes annoying since I have to manually load all graphics resources from disk again. I’d much rather have the switch be immediate.

Was this week and a half of trying different instancing methods worth it? For sure. The original implementation now runs 2ms faster (103 to 130 FPS), and those with DX10 level video cards will get a performance boost on some systems. While writing the DX11 code, I treated it as a different platform. This makes me more confident about porting the game to other systems (like ones that use OpenGL), as the functionality is now there for making ports and dealing with different data per platform.

Now back to that mist, fog, rain, snow, and changing lighting conditions….