Performance Optimizations

|

Normally the last few weeks of development before launch are really stressful because everything has to come together in a nice clean package. Usually this means frantically trying to fix the remaining bugs, making sure the game runs great on all supported hardware while at the same time polishing the game. The last weeks are very critical to achieving that certain high quality feel. That’s why this time I’ve decided to ease the pain of the last few weeks with a head start. In this blog post I’ll talk about the most important performance optimizations that I’ve been working on for the last two weeks.

Performance Profiler

Some might think that optimizing is boring and tedious because the work is very time consuming and nothing seems to happen to the game on the surface. To help with this I’ve turned opimizing into a sort of game for myself. Whatever I’m optimizing I first come up with some metric, usually a number whose value I can track and try to make it as small as possible. I set up a challenge for myself to see how low I can push that number. Some things are easy to measure like memory consumption of the process with Windows Task Manager or the time needed to process a single frame. But to really dig deep into performance issues it’s necessary to breakdown the measurements into smaller bits to get a better idea what to optimize. Therefore I built a performance profiler directly into the game that I can summon with a press of a button.

The profiler shows milliseconds and percentage of frame time spent in each subsystem of the game and various other statistics. To run at 60 frames per second the computer can spent up to about 16 milliseconds to process each frame, including updating the game world and rendering the view. The profiler also shows amount of temporary memory allocations (Malloc column) during the frame — more about that later.

Memory optimizations

Legend of Grimrock 2 is a 32-bit application on Windows so that means that the application can use up to 2GB of memory without hacks. To make matters worse in my experience the real limit is closer to 1.5GB presumably because DirectX resources eat up virtual address space. Textures and other assets eat a lot of memory so 1.5GB is not that much today. Optimizing memory usage will also help with load times, and can potentially increase overall performance too. It’s important to get the memory usage as low as possible without sacrificing quality of assets.

After cleaning up unused assets we determined that we still needed to shave off some more memory so I begin looking into what could be done on the code size. One easy optimization which was already planned for Grimrock 1 but I never had time to work on was a simple animation compression technique. In Grimrock 1 animations are stored as “array of structs”, where the struct contains position, rotation and scale. This was not optimal because, for example, scaling is very rarely used. In fact most skeletal animation nodes have only rotation movement. A simple optimization is to store keyframes as “struct of arrays”, meaning that position, rotation and scale keys are optional. For many animation nodes we just need to store constant position and scale values and varying rotation keyframes. This optimization cut the memory usage of animations by 20 MB.

Another big optimization was compression of vertex format used by models. Previously all model vertices had normal, tangent, bitangent and texcoord vectors stored as 32-bit floating point values. Floats have a very big range and high precision, more than we need so I compressed those into 16-bit integers. Also a common trick is to leave out the bitangent needed for normal mapping because it can be reconstructed in the shader by computing the crossproduct of the normal and tangent vectors (TBN handedness still need to be stored but it fits nicely into tangent vector’s fourth component). Vertex format optimization yielded about 75 MB saving.

Rendering optimizations

Grimrock 1 didn’t need a geometry level of detail system but Grimrock 2 has much longer view distances and has many more models on the screen so the number of triangles drawn can get quite large. I had implemented a very simple level of detail (LOD) rendering system some months ago where we simply swap between high detail and low detail meshes based on their distance from the player. While this increased the frame rate this resulted in ugly snap when the lod level was changed which limited the usability of the system. When doing the performance optimizations I revisited the old lod system. After doing some initial tests with the artists we figured that a crossfade between the LODs would be the ideal solution. Unfortunately alpha-blending the models is out of the question with a light prepass deferred renderer. A pretty common technique is to use alphatest dissolving instead. I used the technique successfully in Alan Wake’s rendering engine and it still works great today. The end result is quite good and it’s hard to see the LOD transition even if you know what’s going on. We also use the same dissolving technique to fade out small objects like grass.

We are using distorted planar reflections for our water rendering and this requires rendering the scene twice, once for the main view and once mirrored upside down. This can get really heavy on frame rate. Fortunately it’s not necessary to draw the reflected scene with full detail. In fact many object don’t need reflections at all, usually those that are far away from the reflective surface (except if they are really tall like towers). Making an automatic solution that handles all cases nicely is pretty hard, so we gave a few hints to the renderer to help pick the objects to reflect. Objects in Grimrock 2 have three reflection modes: “never” means that the object is never reflected. It’s used for small objects like most items. “always” means the object is always reflected, like the sky and very large structures. “cell” is the default option and used by almost everything. With this option we take advantage of our grid structure. The level designer can paint in Dungeon Editor which cells in the level have reflections enabled. Static objects with “cell” reflection mode will then skip reflection rendering if their cell is not reflective. For dynamic objects we currently use either “never” or “always” mode so that we don’t have to check constantly where they are and update their reflection enable flag.

Game world update

Years of object oriented programming tends to produce bad habits. One good example is the game object update logic that was in place two weeks ago. In Grimrock 2 game objects such as monsters, doors and teleporters are made from components such as lights, models, clickable zones and particle effects. In the object-oriented way each game object had a update() method which calls update() for all its components. Can you see what’s wrong with this? There are at least two big problems (and a few other missed optimization opportunities). First the code has to iterate through all components regardless if they actually need to be updated. For example models do not need to be updated at all because there is nothing dynamically changing about them. Secondly the code has to “megamorphically dispatch” to the component update routine, meaning the code is jumping between different component types all the time. This code branching is very slow. A much better approach is to update all components of given type in one go, i.e. update all particle systems in one pass, update all animations in one pass and so on, something like this:

...

updateComponents(LightComponent)

updateComponents(AnimationComponent)

updateComponents(FloorTriggerComponent)

...



This restructuring of update alone saves several milliseconds per frame. It also has other very nice properties. The code is easier to profile and it’s easy to toggle updating of component by type. It’s also trivial to change the update order of component types. For example, animation components should be updated after monsters so that animations start playing immediately, not one frame after the monster’s brain has decided what to do.

With all of these optimizations in place the average frame rate seemed decent (we haven’t tested on low end setups yet though so we may have go back to optimizations later). I have a frame rate number displayed on the screen all the time and I’ve set it up so that it turns bright red if the frame rate dips below 60 fps. While testing I noticed that every now and then, apparently for now reason, the frame time spiked above 16ms. I immediately began suspecting Lua’s garbage collector and added Lua memory statistics to the profiler. It turned out we were allocating about 40 KB per frame, at 60 fps that’s about 2.5 MB per second! After a few seconds of this Lua decided that enough is enough and collected garbage which dipped the frame rate. We were very lucky that this problem had not surfaced with Grimrock 1. Lucky because garbage collection issues are really hard to fix. I suspect that the working set and garbage generated was much smaller in Grimrock 1 so the problem did not exist.

I began hunting down the source of garbage. Thanks to the update restructuring it was easy to add per component type mem alloc statistics and a few culprits were quickly found. Some cases were easy to optimize away, like the creation of temporary tables here and there. Much more problematic was vector math code that created a lot of temporaries. Lua garbage collector is not particularly good at short lived temporaries like these. I decided to try an experimental technique, to make a separate vector and matrix classes that would be allocated from a pool. At the end of the frame the temp vectors and matrices would be returned back to the pool. The only problem was how to handle “boxing”. Temporaries could not be stored permanently in objects’ fields because their values would become corrupt at the end of current frame. A simple solution is to use boxed vectors as member variables and copy the values explicitly from temporary to the boxed version. It’s a bit of a chore to do it but it seems to work okay.

I still haven’t gone through all the places but the most bad behaving temporary allocating routines have now been optimized. As a result temporary memory allocations per frame has gone down from over 40KB to about 4KB per frame. My goal is to keep optimizing it below 1 KB. Garbage collection is already pretty harmless but I want to beat it so there’s no doubt about it.

That pretty much sums up the work of the past two weeks. Together with content optimizations the game now uses about 25% less memory and runs 25% faster. Not bad for two weeks of work! Hopefully this was an interesting read. If not, then prepare yourself for some more artsy blog update coming next! 🙂