Adventures in Debugging

I’ve just fixed one of the worst bugs that has been plaguing Banished. It’s potentially the cause of nearly all my pending crash to desktop bugs. It took me so long to track down that I thought I’d share the debugging process. Normally fixing a bug is an hour task. Reproduce the issue, find a solution, implement the fix, and then test the fix. Other times debugging is an all day and all night process, or worse.

The most important thing is to be able to reproduce the issue. If a developer can’t make a bug happen reliably, there is little chance it can be fixed. And finding reproduction steps can be hard.

So more than a few months ago I’d noticed an issue when I’d build test cities by cheating. I’d pause the passage of time in game, turn on debug mode so I could place buildings for free, and place a whole lot of buildings – basically lay out an entire town. This let me test features and the simulation of large towns, without spending hours to make it properly. Once in a while, when the passage of time was resumed, the game would immediately crash.

Large towns run much better when the game isn’t built for debugging, so I’d often be in release mode for a high framerate, and be unable to properly use the debugger to see why the game crashed. Even worse, the autosave feature was properly saving towns. If the crash occurred and the save was reloaded, the game could be unpaused without issue!

Because the save game was intact, I usually lost less than a minute of work due to the autosave process. And since the bug was so infrequent and it occured while I was cheating, I half assumed I had something wrong with the code that let me build buildings for free. So I ignored the issue for a while.

But lately I’ve had testers having this exact situation occur, but without cheating. They lay out where buildings should go with the passage of time paused, and on resume it crashes. Testers have also been having other random crashes that I can’t quite figure out, and no one can reproduce them or give any hint as to what they were doing when the game crashed. Save games always seem to be intact and reload without issue or crashing… Hmm…

About a month ago I added functionality to output core dumps and call stacks from crashes and asserts in game so I’d at least have an idea of where in the code the game was crashing for testers. A core dump gives me the state of the program when the crash or assert occurs, and a call stack is just the list of function calls that were made up to the current point of execution.

While the core dump isn’t as useful as live debugging, it gives me ideas about where to start looking and how I might reproduce the bug. A save game, core dump, and call stack usually make fixing bugs easy. But not today.

Looking at a single core dump didn’t give me much idea about what was going on. I knew memory was either being accessed incorrectly, or it was being overwritten, but that was it. But looking at over 20 dumps with similar call stacks got me to thinking. While the call stacks all had crashes in different locations, they all came from a root function called ComponentEntity::Update(). A small percentage of them even occurred directly in that function.

This is the lower part of the call stack of the crashes: