Recurring Nightmares

Almost every project I’ve worked on has a bit of code that always causes chronic problems. Some code gets written and it seems to work well in development and testing. Then someone on the team creating real world data causes it to malfunction.

So then the programmers fix the issue, and everyone continues on their merry way, until, some different real world data causes it to malfunction.

This cycle continues until the game ships. I’m not sure this problem comes from poor development – it’s just that code can be so complex, no one can envision all the ways it will be used or all the ways it can break.

I’ve even seen this happen across multiple projects that reuse a game engine. A feature might be used flawlessly on one game, and unchanged, it breaks on the next game in development because it’s being used differently or with different requirements.

I’ve seen this happen across all fields. Graphics, physics, collision, AI, audio, animation, controller input, even movie playback!

Sometimes, once true real world requirements of a system are known, a full rewrite helps the chronic bugs. Or eventually all the edge cases are discovered. Sometimes not.

Unfortunately I’ve got an issue like this now. A while back I decided I wanted to support arbitrary placement of objects. Any shape, any rotation, with potential intersections, added and removed dynamically to the play field. I ended up implementing an algorithm from a paper called Fully Dynamic Constrained Delaunay Triangulations. It’s a bit complicated, but was fun to implement and works really well. The paper covers all the cases needed to be robust, but there are really tricky issues that come up.

Right now it handles pretty much whatever I throw at it. It makes nice big triangular maps of pathable space that I can run A-Star on for pathfinding, and the map can be analyzed quickly to allow different sized units.

Here’s some pretty pictures of it, and some test paths that have a non-zero unit radius.

So every few weeks, the random level generator ends up placing objects in configurations that breaks my code. Arghghgh. Sometimes it breaks building a map. Sometimes it breaks when quitting and a specific object is removed. The worst is if it breaks when I place an object manually, because it’s really hard to find and recreate the exact placement that caused the issue.

And so I stop whatever task I was doing, and start debugging the triangulation code. This is not easy – Some maps have 20,000 objects, and 175,000 triangles in the pathing mesh. I end up doing a mix of visual and code debugging. By drawing the pathing mesh at the point the error occurs, as well as before and after, and stepping slowly through the code to figure out what is happening, I can figure out what’s causing the bug. It usually takes me several hours to determine the problem. Sometimes more. Finding a quality fix is typically hard. So I take a break, sleep the night, and generally have a good idea for a solution in the morning. Implement and test.

Then I wonder, “Okay, What was I doing before I ran into this bug??”

For this triangulation system, the culprit is floating point math. Every time. The algorithm is good. I haven’t had to change the major details of it a single time since I got the initial implementation working. But because math on a computer is an approximation using a fixed number of bits, math that works out on paper does not always behave the same way on a computer.

For example, one of my issues dealt with computing the circumcircle of a triangle. The algorithm just didn’t work at far distances from the origin until I wrote the math three different ways to find the most numerically stable and accurate implementation. On paper the math should have resulted in exactly the same result for all three methods!

Another issue arose because I was testing a large circle against a point which laid exactly on its perimeter, but the test failed because of lack of numerical precision. I’ve also had failures do to nearly degenerate triangles. And other crazy things that are hard to describe concisely.

I’m pretty sure some of the worst bugs to fix properly in my programming career are due to floating point imprecision. We have a long history, and we are not friends.

When I started making games professionally the fix would be to test using an epsilon. For example instead of

if (x == 0.0) { ... }

I would write something like

if (abs(x) < 0.00001) { ... }

In the right case this can be good. But in most cases is very bad. Because without the right epsilon and knowing what x is and always will be, you are potentially creating false positives in addition to fixing the original problem. I avoid this whenever possible.

My goto solution now is to use geometry analysis to determine an answer that needs high precision. Can I make an inference of the result using vertices, edges, and faces, and their relation to each other? Can I write the algorithm to be fault tolerant of values slightly under or over the desired one? If not, can I rewrite the math such that I’m never using values orders of magnitude away from each other?

Having fixed so many small cases – about one a month, I do consider going back to a simple grid for pathfinding. But I purposefully chose this route as the most flexible. I do wonder if I’ll ever get this piece of code to be fully stable. At least it works today and it generates lovely paths for units to follow.

Only time will tell if I get all the kinks worked out – at least until the next project that uses it in a different way.