Hardware bugs??

Sometimes fixing bugs makes you question why programming is your chosen profession. Or at least why you chose to make a game engine from scratch. Sometimes there are technical details you don’t really care to know about.

This last week I’ve been attempting to build the largest possible city I can to determine reasonable population counts for achievements and scenarios. This takes a lot of time. The amount of time is compounded by me fixing any issues I run into. Because of bugs, I haven’t gotten to max population.

If the bug is something such as the opacity of an icon is wrong, I just write it down in my bug list to be fixed later. If it’s something that stops play, like the game crashes, I fix the issue immediately. In general if the fix wasn’t too invasive, I can continue from an auto-save and at most lose a minute of game play time.

There’s also balance issues. I ran into a situation where some fishing docks weren’t producing enough fish for a beginning settlement to survive on. You can survive on gathering, farming, or hunting – so it doesn’t make sense to not survive using fishing. Balancing bugs take a bit of play and testing to fix but in general are pretty easy to deal with.

Then there are the really bad issues. Problems that crop up only every few hours and are not reproducible using any simple steps. If you’ve got good testers eventually they can figure out reproduction steps, but doing so in a debug build is usually painful, as the game runs so slow and it takes a long time for the bug to occur.

I’ve got one bug like this that I’ve been ignoring it for a while. Ever since I added DirectX 11 rendering I’ve occasionally seen the interface to the graphics hardware fail. All it reports is:

D3D11: Removing Device.

If you query the interface for more information the error code is

DXGI_ERROR_DEVICE_REMOVED

When this occurs, you can’t continue rendering without restarting the graphics interface.

It happens occasionally, usually rendering large scenes, but not always. The D3D11 debug layer output shows nothing extra, no warnings, no errors. This is a non-debuggable error. Rendering the exact same scene that caused the issue doesn’t make the issue occur again, so reproducing it is very hard. Only my main development machine does this. Other computers I have around do not. The DirectX 9 version of the game also has no problems and can run for days without the graphics card dying.

The documentation says if you get this error, either the driver for the video card has been updated, or the video card has been physically removed from the machine. Clearly, I’m not doing either of these things while the game is running. Since those two things are very rare, I had written the code to just throw an error and quit should it occur.

It’s not comforting that it seems to happen anywhere from five times a minute to once an hour on my main development machine. But some days it doesn’t happen at all.

Searching the internet for other developers with the same issue is very hard – there’s a lot of noise with non-programmers talking about it. A lot of new games don’t handle this error and simple quit as well. Doing a search for ‘DXGI_ERROR_DEVICE_REMOVED’ gives you an unlimited list of forums for Crysis 3, Arma, Civ 5, Conan, Hitman, Secret World, Battlefield 3, 3DMark, and more, that just report this error and quit. Apparently gamers are not happy about this. I wouldn’t be either.

The list of possible fixes people recommend are re-installing drivers, using beta drivers, re-installing windows, removing dust from the video card and reseating it, lowering quality settings in the game, and even increasing the voltage sent to the video card.

These are not things I have ever had to do to fix a bug. I’ve never actually run into a driver malfunction either – it’s always been something stupid I’m doing that makes the video card throw errors. After hours of debugging and trying different things to figure out what I was doing wrong, I eventually made a google search that eliminated all pages that contain the game names that have this error, and found a Microsoft blog talking about things that cause this error that aren’t in the official documentation.

It could be driver bugs, hardware faults, overheating, GPU removed from system, or DirectX running out of memory. I can’t tell which is happening. And since I can render the exact same scenes with DirectX 9 on the same video card, I have a hard time believing any of these things is actually occurring in a fatal way. Either way, it seems like this is a general ‘something went wrong’ error.

For all my searching, I’ve not seen one developer write about how they fixed this issue. This is annoying that it even happens, and more so since it can actually occur differently from what the official documentation says.

I don’t know how many gamers will have this error occur, but looking at the major releases that have it, I can imagine it won’t be an isolated problem. Since it appears that I’m not doing anything wrong, I just need to handle this error as gracefully as possible. So what’s the fix?

My first gut feeling is to just ignore DirectX 11 and only ship DirectX 9 since it works perfectly, but I did spend the time to make a DirectX 11 renderer and the performance win on newer graphics hardware is very hard to ignore.

To fix it properly, I have to release all graphics resources, shut down D3D11, restart it, and then recreate all the resources. All while the game is running. The problem with this is recreating the resources is somewhat painful. The resources are in video memory, but not in any application accessible memory. Once the device gets into the removed state, I can’t access them.

I could possibly reload all meshes and textures from disc, but this is slow. Some resources, like the terrain, are only stored inside a save game and would be very hard to get at in this state. If a device removal condition occurs I really wouldn’t want the player to see a hiccup that would be over a second or two of time as things reload.

So instead I’ll have to keep a memory backup of every texture and mesh so that I can restart the device at anytime. This seems like a serious waste of memory, but there’s no other real choice. D3D9 requires a similar handling for device resets, but it isn’t as severe as it takes care of some of the legwork for you, and you don’t have to actually destroy and recreate the interface to the graphics hardware.

I haven’t made this fix yet, so I’m interested to see how often it occurs once the recovery process works properly. I’m also wondering if I should make the game quit if it’s happening at some high frequency.

If anyone has any additional information about this issue. I’d love to hear about it.

And here I thought making a video game was all about design, balancing, and fun. 🙂