by

TL;DR: Server-side game recording is awesome for performing scalability testing, as well as reproducing game bugs and enabling players to replay their game experiences.

How to load-test game servers

In anticipation of the launch of Guild Wars 2 in August I thought it would be interesting to talk about some of the techniques we used to ensure that the original game of Guild Wars was capable of scaling to the massive number of users we expected, hoped for and feared.

When developing game server software being too successful can be just as scary as a flop — gamers who can’t play because the servers are overloaded don’t really care why the servers are failing, they just quit playing if they can’t get online.

Note: I don’t work at ArenaNet (makers of Guild Wars) anymore, though many of my friends still do; I’m eagerly awaiting the results of their efforts just like the legions of fans of the original game.

Simulating user load

There are lots of ways to load-test servers:

Launch the game and watch users suffer, then fix the problems as they are discovered

Write game “bots” to simulate the (expected) behavior of real players

Run background processes on the server to consume CPU, memory and other resources

Game recording and playback

Testing with real players

In the early days of online gaming — and I’m thinking here of the 1990’s — players were so happy to be able to experience multiplayer games that they gladly suffered the pain of live-debugging issues for the development team. Creating this kind of user pain was never an acceptable solution, and few companies that chose this path continue to exist. Gamers quickly became intolerant of online games that didn’t work correctly when given better alternatives.

I’m convinced that the efforts we put into releasing highly polished, mostly bug-free games during at Blizzard, and later at ArenaNet, was a key ingredient to the success of games like Warcraft and Guild Wars. Diablo (the original version) was a fun enough game experience that, while it was buggy as hell, managed to succeed despite its flaws. It also had the advantage of releasing in 1995, during a period when players were still willing to suffer mightily just to play online.

On the other hand, it was pretty obvious to folks who worked at ArenaNet that Tabula Rasa — a game developed by a sister studio to ArenaNet that was released in late 2007 — was going to fail if for no other reason than the development team was using the beta-test audience to debug the game. By that late date players were already accustomed to bailing out of a game that didn’t have a good online play experience — no point in actually paying money for a game if the devs can’t release a decent beta experience.

So yeah, foisting a game that doesn’t scale on a beta or launch audience doesn’t cut it these days, so this strategy is of limited utility.

Simulating real players

Many dev-teams choose to write “bots” that endeavor to simulate real player behavior. This turns out to be challenging for lots of reasons.

First, the programming team is already busy working on the game. Any effort spent writing bots usually means sacrifices of other game-features, which is a poor trade-off. And those bots have to continually be updated and debugged as the game is modified so that the bot can keep playing, which takes even more time.

Second, anticipating all the actions that players might take is nigh impossible. Players are much more diverse than game designers anticipate, and consequently they discover corner-cases that might never occur to anyone on the development team. That’s worth a whole ‘nother article about obscure bugs.

Third, it may be impractical to simulate even a fraction of the game. In the original release of Guild Wars there were 600 different spells/skills in the magic system (and 1500 by the time the fourth game in the series rolled out). Trying to simulate all the combinations of spells that teams of players might cast leads to enormous combinatorial possibilites. If we limit a player-simulator bot to casting ten spells then there are 600^^10 possibilites for those spells — that’s 6,046,617,600,000,000,000,000,000,000. If we imagine that the game server can test one billion sets of those ten spells per second, it would take 191 billion years to test all the combinations. That’s a lot of room for error!

Every game-server developer I’ve talked to who went down the path of writing a player-simulation bot said that the actual behavior of game-players differed radically from what their bots simulated, limiting the overall usefulness of this load-testing method. YMMV definition #4.

Load the server with fake work

Running background processes that burn up lots of memory, CPU, network bandwidth, and disk IO is easy — there are plenty of open-source and proprietary tools that do the job. Here’s a program that uses an entire CPU core all by itself, which you’re welcome to use if you’ll only send me a small royalty (2% of game revenue?!? Negotiable!).

void main () { for (;;) ; }

Using such programs can be useful for finding trivial problems, but doesn’t do a great job of helping the development team identify the hot-spots that can cause real problems.

A good case-in-point is Diablo 2, which was released after I left Blizzard Entertainment. While the QA testing team worked hard to identify game bugs, the job of load-testing was something that the development team didn’t tackle prior to launch.

It turned out that game-server memory access patterns were highly inefficient, leading the servers to spend a lot of time thrashing memory in and out of the cache, which caused more heat to be generated than was expected. This turned out to be a huge problem because it would cause the top-of-the-line Compaq servers which were used to host Diablo multiplayer games to overheat and crash frequently.

See, in a datacenter, “pizza box” servers are loaded into tall storage racks. It’s possible to cram 42 1U (for one unit high) servers into a single rack. That stack of servers running flat-out would create so much heat that the datacenter refrigeration system couldn’t provide enough cooling air to keep the servers happy.

The short term solution was to reduce the number of servers in the rack, leaving gaps between servers to allow for additional air-flow.

So yeah, some load testing is needed to discover problems not just in the game code, but in the infrastructure as well. So while this method can find some problems, it’s not going to help identify scalability problems the game-server code itself.

Load testing using recorded games

We had a lot of ideas about what we wanted to do with Guild Wars that overlapped to solve our load-testing needs.

One of our primary goals in building Guild Wars was to create a game that could be an “e-sport“. And in fact we gave away over $300K in prizes to players competing in Guild Wars tournaments between 2005 and 2007. Our goal was to record games by top players and play them back for the entire user community, with the kind of color commentary that John Madden did so brilliantly for US football. We miss you, John.

Recording games is also useful because if — or rather when — the game crashes it’s possible to use the game recording to recreate and fix the bug. In programming, bugs that are reproducible are trivial to correct; most of the challenge in debugging games is in discovering why the game is crashing in the first place; once that’s known everything else is easy-peasy.

Some Guild Wars players are probably familiar with the “double patch” we usually used on patch-days. We’d patch the game, and often discover a critical issue affecting a percentage of players. We’d play-back recordings of crashed games, fix the bug, run the build system, and often turn around new builds of the game in under five minutes — this does wonders for the reputation of the game that problems can be fixed so rapidly.

But the relevent use of recorded games for this article is that they can be replayed to simulate exactly what real players do when they’re playing the game and create the desirable server load. We were able to record hundreds of games and play them back on a server to (mostly) simulate how a server would behave when heavily loaded. I say mostly because our solution didn’t simulate sending network packets — the playback code just “pretended” to send because there weren’t any real players connected to the game. Not sending network packets didn’t turn out to cause any significant difference between simulated and real combat drops.

The ability to perform load-testing by playing back recorded games had all sorts of secondary utility. When AMD released a new and inexpensive series of Opteron processors we had doubts about whether those chips would be good at running 32-bit Guild Wars server code. Intel had previously released a processor called the Itanium which turned out to be terrible — it was quickly dubbed the “Itanic” — in reference to purportedly unsinkable ship Titanic — given it’s sinking reputation. We didn’t want to buy into hardware that would sink our company.

So to test the Opteron chips we simply gathered up a slew of game-recording files, tested the throughput on one of our current-gen servers, and compared the results to those of an Opteron, which were very favorable. We ended up choosing the Opteron for later hardware purchases, saving hundreds of thousands of dollars in hardware, as well as hosting, cooling and power costs, which are all of course critical to a hybrid free-to-play game like Guild Wars — you can’t host a free-to-play game if your operational costs are burning through your profit-margin.

References

Server programmers will probably be interested to read some recently-posted notes about scaling from the folks at DropBox. Check out Scaling lessons learned at Dropbox, part 1.

Important: I should mention that I don’t endorse the DropBox service. As someone with a security mindset it boggles my mind that DropBox user data can be decrypted by the DropBox staff — or a hacker — because all data is encrypted by the same key. If you’re looking for a good alternative check out SpiderOak. From their site: “SpiderOak never stores or knows a user’s password or the plaintext encryption keys which means not even SpiderOak employees can access the data”. Note that I don’t have any affiliation w/SpiderOak; I just use their service.

Conclusion

I hope I have one. Oh yeah — game recording is awesome, and not just for load-testing. It enables you to build much higher quality games than traditional debugging techniques, and is worth it to spend the time implementing for your next online game project.

It’s definitely worth another article to talk about the coding complexities required to implement recording, like recording format, random-number generation, time-sequencing, compression, debugging, playback desynchronization and other esoterica, but I hope I’ve convinced you that it’s a powerful technique for writing online games.