

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”6″ chap=”Chapter 22(a) from “beta” Volume VI”]

First of all, let’s note that whatever I will discuss in this section – is just my personal feelings on the subject of testing. There are very few universally-acceptable truths in testing, and every company has its own testing process, which is quite different from the others.

As a result (and unlike many other places in this book), I am not going to say “hey, this is THE way to do testing!”. Instead – I will outline my personal experiences and concerns about existing testing techniques (and will try to cross-reference them with existing works on the subject).

Automated Regression Testing is a Good Thing™

While just two paragraphs above I admitted that there are very few (near-)absolute truths when it comes to testing, at least one such truth clearly exists. It is that

For any sizeable agile project, 1 you DO need some kind of automated regression testing.

“At this point, I do NOT try to specify how to implement this automated regression testing; instead – I am just saying that we DO need it, that’s itAt this point, I do NOT try to specify how to implement this automated regression testing; instead – I am just saying that we DO need it, that’s it. The rationale for this rule is very obvious – as soon as we have a project with frequently changing requirements (and games pretty much inevitably fall under this definition) – making a change and just hoping that each and every change didn’t break anything else – is way way too optimistic. With number of changes within a monthly build going into hundreds – you can be pretty sure that at least one of them will break the existing behavior. As a result – having automated tests for regressions certainly qualifies as a Good Thing™ (and moreover, this stands both in theory and in practice).

On “Common Wisdoms” a.k.a. Popular Misperceptions

So far so good – and I haven’t even deviate from “common wisdoms” yet. Well, I am just about to do so.

On Unit Testing

When people are speaking about “automated testing”, they very often imply “automated unit testing”. However, at the very least for distributed systems,

Unit testing is not enough – by far.

It means that if you think that by unit-testing your game, you’re going to have it error-free – you’re in for a Big Fat Surprise™ (and very unpleasant one at that). From my experience, for distributed systems (and games in particular), percentage of regressions which can be detected by unit testing, is well below 20%.2 The remaining 80% tend to represent all the bugs related to generalized notion of “races”. NB: for our purposes here, we’ll consider both classical inter-thread/inter-DB-connection races, and “unusual sequence of incoming messages” as “generalized races”. While classical ones are not possible at application level as long as your architecture is (Re)Actor-based and single-writing-DB-connection-based (as discussed in Vol I-III) – the“unusual sequence” ones are inherent to all interactive distributed systems, including (Re)Actor-based ones.

As a result, I have to insist that

unit testing should not take more than a very small portion (such as “20%”) of your overall testing efforts.

Unfortunately, way too often when speaking about testing, there is a horrible misperception of “hey, we’re 100% unit-tested, so we should be fine” – which tends to fail very badly when facing real-world games (including stock exchanges, banks, etc.).

“BTW, I don’t want to say that unit testing is pointlessBTW, I don’t want to say that unit testing is pointless. If your code is 100% unit-tested – this is not bad, but – you still need to do like 80% of your testing work. OTOH, if you’re not 100% unit-tested but are using different testing techniques (in particular, those techniques discussed below) – these non-unit-testing testing efforts can still keep your game in a very good shape. In other words, I am arguing that:

You should do MUCH more than mere unit testing.

You should take a look which amount of unit testing represents a “sweet spot” in terms of bringing-the-most-improvement for your testing efforts. In general, each effort is subject to the “law of diminishing returns” – so if trying to spend more and more efforts on unit testing, at some point your unit testing efforts will most likely start yielding results which are worse than you’d have spending the same effort on some non-unit testing.

Of course, in theory (and for military/nuclear stations/…) it is necessary to perform all the types of testing (and have both 100% unit-test coverage and use all other testing methods which we’ll discuss below). However, for quite a few games we’re operating with a limited budget of time for testing – and should allocate it to those techniques which provide the best bang for your hour spent.

BTW, apparently I am not alone here; for example, [Coplien], to the best of my understanding, seems3 to go even further and claim that “Most Unit Testing is Waste”.

On Code Coverage

One popular metric with testing is “code coverage” of the test suites. It is easy to obtain, and is even useful. However, as any other metric, it is often being misused. In particular, it falls victim to the Goodhart’s law, which says:

When a measure becomes a target, it ceases to be a good measure.

“as soon as you tell your teams that you’re looking at the code coverage – they will find a way to abuse this metric, so it will become pretty much uselessWhen applying it to the code coverage, it means that as soon as you tell your teams that you’re looking at the code coverage – they will find a way to abuse this metric, so it will become pretty much useless. BTW, it doesn’t mean that your developers are dishonest or trying to game the system – as Goodhart-law-related abuses can easily happen at subconscious level.

On the other hand – as long as your developers don’t know that you’re using this metric (and you just come up with suggestions “let’s write a test for this use case”, without mentioning where you got it) – you’re ok to use code coverage as a way to find under-covered pieces of code.

BTW, [Coplien] seems to provide a very good real-world example of developers abusing “code coverage” when it became a CMM target (and in my interpretation – it is a very obvious application of Goodhart’s law).

On TDD, ATDD, and BDD

The idea behind TDD is very neat – “let’s write the test first, and the code second”. This way we can be sure that we have 100% of the functionality covered by our testing.

For quite a while, Test-Driven Development (TDD) was seen as The Way to make programs reliable. Apparently, at the very least the reality was not that bright. In particular (to re-iterate – these are just my personal takes about it, not to be seen as any kind of “universally acceptable truth”):

“ I am arguing that changing design to enable testing should be avoided. A side note: in IC design, Design-for-Testing (DfT) is an (almost-)universally- acceptable practice. However, even there DfT is usually implemented as-non-intrusively-as-possible with relation to “normal” operation. I am arguing for doing the same in program design, optimizing for normal operation – and adding tests to the existing design (in general, well-designed code should be also testable, but forgetting about normal operation and caring only about testability leads to Really Ugly Results™ way too often). For some examples of the stuff which I am trying to avoid, please refer, for example, to [Hannson].

I am arguing that changing code to enable testing should be avoided (on the same readability grounds as above). It means that I am pretty much ruling out extra levels of indirection just to enable testing.

OTOH, I am not against mocking and stubbing – as long as they can be done without affecting the “normal” design and code.

And – I am not against TDD as such (as long as the above rules stand).

If you think about it – you’ll probably see that following the way outlined above, will lead us to

Having larger pieces to test.

Sure, with these restrictions we still can test all our code base – but we’ll be doing it using larger modules-under-test – and essentially on a higher level of abstraction.

And – IMNSHO this is a Really Good Thing™ for several reasons:

The higher we are in the food chain abstraction level diagram – the more we’re moving from how we’re doing things to what we’re doing. This helps to avoid “Tautological TDD” (for discussion on TTDD, see, for example, [Pereira]) and other related anti-patterns.

abstraction level diagram – the more we’re moving from how we’re doing things to what we’re doing. This helps to avoid “Tautological TDD” (for discussion on TTDD, see, for example, [Pereira]) and other related anti-patterns. With larger pieces – we can test much more than when testing the smaller ones. In other words – we can test the whole larger piece, including interactions between smaller pieces comprising the larger one. As noted above – low-level unit testing has only limited use, so going from low-level unit testing towards functional testing and/or acceptance testing is a Good Thing™.

We’re not cluttering the code (which is, as we know, is read much more times than written) with not-really-relevant details. As a developer, I am saying it is a Good Thing™ too.

In a sense – IMO the approach outlined above, is well-aligned with Acceptance-Test-Driven Development (ATDD) and Behaviour-Driven Development (BDD) – which also tend to imply larger pieces for testing (with ATDD and BDD, at least as I understand them, you shouldn’t try to unit-test each and every function – but should test much larger units instead, keeping implementation details off the testing table). Also – while I hate taking yet another risk to misinterpret [Coplien] – it seems that his “Test at a coarser level of granularity” is pretty close too.

Automated testing for (Re)Actors

If (as I really hope) you have decided to implement your system as a bunch of (Re)Actors (as it was discussed at length in Vol. II’s chapter on (Re)Actors) – we can use (Re)Actors to introduce other two layers of automated testing.

Script-Driven Testing

As it was discussed in Vol. II, (Re)Actors provide a very-well defined interface to the outside world; very shortly – for (Re)Actors, pretty much everything is an input event (and most of the reactions generated by (Re)Actors in response to input events, are outgoing messages).

“it is possible to write a test script, which takes (Re)Actor as a whole, throws messages at it – and observes messages generated by (Re)Actor, checking these generated messages against the rules specified in the scriptThis, in turn, means that it is possible to write a test script, which takes (Re)Actor as a whole, throws messages at it – and observes messages generated by (Re)Actor, checking these generated messages against the rules specified in the script.

With such a (Re)Actor-level script-based testing – we’re essentially treating our (Re)Actor as a “black box” – which is very well aligned with “we should test what the object does, not how it works” paradigm.

That’s it – but this simple testing has been observed to be very efficient (especially when compared to traditional unit testing). The reason for it is quite simple – (Re)Actor-level script-driven testing usually operates at a level which is much higher than usual unit-testing – which in turn:

Tests things which are different from those things usually tested by unit testing Tends to be more resilient to implementation changes (in a sense – because it is closer to BDD/ATDD paradigm)

In general, script-driven testing at (Re)Actor level is a good fit for Functional Testing – and for Regression Testing.

[[TODO: databases/mocking]]

Replay-based Regression Testing

The idea of such testing goes along the following lines:

The whole point of regression testing is to ensure that there are no unexpected changes introduced from version V into version V+1

With deterministic (Re)Actors, we can easily record all the inputs and outputs of selected (Re)Actors during real game play from a production system – while it is running version V.

Then – we can separate all the changes intended for upcoming-but-not-yet-released version V+1, into two categories: (a) those which are supposed to change the existing behavior, and (b) those which are not supposed to change it. NB: most of “added functionality” changes tend to fall into (b) category (i.e. most of the time, existing behavior shouldn’t change unless enabled – and it certainly wasn’t enabled when version V was run). This separation can be done quite easily – as long as you have a practice of attributing of your commits to the issues within your issue tracking. And for issues – separating non-modifying-existing-behavior issues and modifying ones is rarely a problem. Moreover, in practice you’ll see that like 80+% of all the commits are supposedly non-modifying-ones (the vast majority of commits tend to be about new features, and they’re usually not changing existing behavior unless enabled). 4 Then – we can build version V+0.5, based on version V, plus only those-commits-which-are-not-supposed-to-change-existing-behavior “ As soon as we have this version V+0.5 – we can (and should) replay the logs which were recorded during normal operation of version V, against version V+0.5 – and see if there are any discrepancies. Each such discrepancy should be seen as a bug unless proven otherwise (and the burden of proof should be on the developer who has made the offending commit). If it happens to be caused by a commit which was misattributed to a different issue – I won’t argue whether it is a bug or not, but what is clear is that it should be re-attributed – and then version V+0.5 should be re-built, and re-tested. This process can be quite unpleasant at first, but pretty soon your developers will learn to be more careful with attributing their commits (which is a Good Thing™ for any sizeable development anyway) BTW, if any non-trivial bugs are identified during replay testing – test cases for them may (and should) be introduced into script-based regression testing.



Bingo! We have our replay-based regression testing – which tends to provide much more thorough testing than simple unit-testing, and is also more thorough than script-based testing. I tend to attribute it to the following phenomenon:

Nobody can possibly predict all the scenarios which your players will throw at your system.

I’ve spent quite a bit of time looking at the players and cases they create – and I have to admit that predicting all the scenarios is well beyond not only my humble capabilities to generate test cases – but also well beyond capabilities of all the developers and QAs I’ve ever met. BTW, some of the QAs I had the pleasure to work with, were exceptionally good and were able to find like 3x-5x more significantly different test cases than developers; however, capabilities of 100K players pressing the buttons at the same time – go at least 5x-10x further than the very best of QAs.

Simulation Testing

One more thing which tends to help when testing games – is simulation testing. We just create 10’000 simulated players – and run them against the QA instance of our Server, keeping an eye on it to see whether everything makes sense (there are no assertions or crashes, database invariants still stand after we finish simulation, and so on).

Just like replay-based testing, simulation-based testing tends to find those scenarios which are impossible to predict in advance. However, it is neither a subset nor a superset of the replay-based testing, because:

Bugs in new functionality cannot be tested by replay-based testing by definition – but there is a chance to find some of them in simulation

While simulators are good to find randomly occurring interactions (including those “generalized races” mentioned above), they’re usually not able to find some non-random patterns which can be caused by players who’re trying to angle the system, or players who just behave in a way simulation writer wasn’t able to think of.

In addition, simulation testing is a reasonably good way to do basic Performance Testing, and it can be seen as a kind of Integration Testing too.

Summary: My Personal Recommendations for Testing

[[TODO: manual testing (see arguments in the comments below)]]

[[TODO: testing frameworks (“don’t matter much” for unit tests, your own one for (Re)Actor-level testing)]]

To summarize my own personal recommendations for testing of the games (and any other distributed system):

“ Unit tests (at least as they’re usually understood) are usually not enough to provide sufficient test coverage This can (and IMO should) be alleviated by moving tests up the abstraction level – and testing larger pieces of code 5 As a Big Fat Rule of Thumb™, you should not spend more than 20% of your overall testing efforts, on unit testing.

Modifying design and/or code to make your code testable is generally a pretty bad thing (especially from readability point of view)

Whether to use TDD – is up to you, as long as: You’re testing larger pieces of code with higher-level tests (think ATDD or BDD) You modify neither design nor code to run your tests 6 BTW, mock-ups and stubs are fine as long as the above restrictions stand Most importantly – DO use automated testing . Moreover, DO use as many different automated testing techniques as you can Unit testing might be useful. 7 What is important though is not to rely solely on unit testing. As a rule of thumb, for games (and any other interactive distributed system) unit testing should take at most 20% of overall testing efforts DO use (Re)Actors – and you’ll have two other testing techniques to use, which are known to work very well: Script-driven regression testing at (Re)Actor level (just throwing messages at (Re)Actor and seeing its reactions to them) Replay-based regression testing. This one requires (Re)Actors to be deterministic – but tends to test things which no test writer can possibly think of. Simulation-based testing provides a yet another significantly different way to test your system (which can find bugs not foundable by any other testing methods).



[[To Be Continued…

This concludes beta Chapter 22(a) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 22(b), about logging and production post-factum debugging]]

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.