Start Killing Mutants: Mutation testing your code

Testing your tests to test that they test what you think they test.

I’ve been in a situation where I’d “forgotten” to write a test for a certain function — on a tech test for an interview, no less — because mocking out a dependency was just too awkward. I was nagged by the knowledge that I didn’t do the right thing, and in real life it might have been caught at code review, with the public shaming that entails, rather than just in an interview, where I could point to all the other tests I did write and explain that I’d hit the point where I’d just stopped caring after spending my whole weekend on the damn thing.

This isn’t an article on interview technique.

It’s on how to identify the tiny cracks in test coverage. Gaps smaller than skipping an entire function like I did — things more like statements half-tested and mocks half-mocked. Holes you wouldn’t necessarily notice when reviewing unit tests (and how often does that happen anyway?) but which still mean that you can break your system and be lied to by your supposedly-trusty test suite. Unit testing is standard practice now, but you can only rely on your suite if you make sure it covers enough of the application’s functionality.

A great way to do this is with a mutation testing framework, but to understand its true value, let’s think about why we actually write tests in the first place.

Why do we write unit tests?

There are lots of reasons we write tests. A few are:

Fast feedback if code is not behaving as expected: this is good for debugging, and reduces the cost of defects.

They can be a form of documentation, because nobody ever updates actual documentation.

They can often encourage better design principles, as untestable code is often bad code.

But my number one reason, at least as far as the purpose of this article goes, is:

To ensure that any changes we make don’t accidentally alter the functionality of the existing codebase.

To make sure our changes don’t alter any existing functionality, we need to make sure our tests are testing existing functionality.

This is what mutation testing can help with.

What is mutation testing?

A mutation testing framework will go through this process:

1. Alter source code in one very small way

2. Run unit tests

3. Record if any tests fail

The changes really are small, often just flipping an operand or return value. One possible alteration (‘mutant’) looks like this:

const isPositive = (num) => num > 0

becomes

const isPositive = (num) => num >= 0

and if your tests all pass, that means you haven’t checked the boundary condition of isPositive(0) . Writing the mutated code in the first place would be an easy mistake to make: anyone could write the wrong operand, or copy/paste the wrong thing. For the tests, it would be easy to test isPositive(5) and isPositive(-5) and call it a day.

But with the magic of mutation testing, we can find out that not every relevant case has been tested for. So we can guard against small slips like this, as well as bigger mistakes, when we write the source code as well as when we come to edit it in the future.

The framework will create a large number of mutants, and for each mutation, one of three things happens:

At least one test fails. This means the mutant has been ‘killed’ and therefore the part of the code that has been changed is properly covered.

All unit tests pass. This means the mutant has ‘survived’, and the changed functionality is not covered by tests.

Infinite loop/runtime error. This usually means that the mutation is something that couldn’t actually happen — or it’s something that would be caught when you actually try to run your application — and counts as a kill.

Some other possible mutations:

if (x === 3)

becomes

if (x >= 3)

if (x <=3)

if (x !== 3)

if (true)

if (false)

if (x === 3) {

k++

}

becomes

if (x === 3) {}

if (x === 3) k--

return {token: "c38bf32"}

becomes

return {}

return null

return {token: ""}

N.B. As mentioned below, creating a large number of these mutants and testing them can take a long time, which limits the use of mutation testing in continuous integration pipelines. If running locally, you can pinpoint specific files to mutate (e.g. the ones you’re working on), and leave whole-codebase runs for remote overnight jobs.

Example

Here’s a real-life example: a block of code and its pseudo-tests (they were real tests, I’ve just shortened and readable-ified them):

SOURCE CODE function handleLogin(request, response) {

const {username, password} = request.body;

if (!username) {

return response.status(400)

.json({reason: ‘ERR_NO_USERNAME’})

}

if (!password) {

return response.status(400)

.json({reason: ‘ERR_NO_PASSWORD’})

}

} PSEUDO-UNIT TESTS const testRequest = {

body: {} // no username or password fields

}

const mockResponse = () => ... testNoUsername() {

handleLogin(testRequest, mockResponse);

expect(mockResponse.calls.single.toBe(400));

}

testNoPassword() {

handleLogin(testRequest, mockResponse);

expect(mockResponse.calls.single.toBe(400));

}

Let’s look at what the mutation framework did:

MUTATED SOURCE CODE function handleLogin(request, response) {

const {username, password} = request.body;

if (!username) {

return response.status(400)

.json({reason: ‘ERR_NO_USERNAME’})

}

if (false) { // formerly if(!password)

return response.status(400)

.json({reason: ‘ERR_NO_PASSWORD’})

}

}

Now the second `if` statement never runs. But both tests still pass — the mutant survives. We have testNoPassword() which makes sure the response is handled if there’s no password, so what’s going on?

Because the test request.body doesn’t contain either a username or password, both tests are getting caught in the if(!username) statement and returning out there. if(!password) is never getting hit at all! Syntactically, of course, we’re passing it an object that doesn’t have a password, but without a username field, the test will never exercise the intended logic.

The problem might have been jumping out at you, but this is a very small part of a large codebase, and this segment was part of a big pull request, so the gap in the test coverage wasn’t noticed by anyone on the team. The solution is to create a testRequest for each test:

UPDATED TEST const mockResponse = () => ...

testNoPassword() {

const testRequest = {

body: {

username: "somethingNotStupid"

}

}

handleLogin(testRequest, mockResponse);

expect(mockResponse.calls.single.toBe(400));

}

And then that mutant is killed!

By identifying and fixing cases like this across the application, we can ensure that we have a reliable test suite, that will check a significant number of logic-flows through our system. This is extremely useful if we need to do a large refactor, or just a small change. We will know that our tests are dependable, and will catch any unwanted changes to the system’s logic.

Advantages and disadvantages

Let’s go back to my self-identified number 1 reason to write unit tests:

To ensure that any changes we make don’t accidentally alter the functionality of the existing codebase.

It’s difficult to make sure we create tests that are reliable enough to achieve this aim. There are lots of reasons this is the case, ranging from complex application logic to people just not finding tests interesting enough to review them properly. I’ve shown one of many examples where relying on human code review wasn’t enough, and it makes sense to use any tools available to write a decent test suite.

Some more advantages and disadvantages:

Advantages

• Mutants survived vs killed is a more reliable metric than line coverage. This isn’t something I’ve touched on, but it’s a very valuable point. Mutation testing actually ensures your unit tests are testing what they should be, and that they test a high proportion of the logic (most of the cyclomatic complexity). Line coverage is only concerned with whether lines are hit somewhere along the way. There’s a time and a place for line coverage, but mutation tests are a valuable way to ensure you’re really testing what you want to.

• It catches many small and easy-to-miss programming errors, as well as holes in unit tests that would otherwise go unnoticed. Mutation testing is based on the ‘competent programmer hypothesis’. This is the idea that programmers are basically good at what they do, so errors are caused by small slips (such as >= instead of > ) rather than by the large-scale design of the program. Whether or not you subscribe to the theory, it’s clear that we make a lot of small mistakes a lot of the time, and they aren’t always easy to catch.

Disadvantages

• Running a mutation testing framework against the whole of any non-trivial codebase is extremely computationally expensive. Runs can take anywhere up to several hours, making them unsuitable for a fast release process, even though many tools have clever ways of cutting down the time required. Of course, you can run a framework overnight and check the report later — like with line coverage, you get value from general trends as well as specific failures. You can set aside time to go through a report in detail and start to shore up some obvious deficiencies in your tests.

• Mutation testing requires brainpower to sort ‘junk’ mutations from useful catches. Not every surviving mutant is legitimate, and with some languages/frameworks, you can get an unfavourable signal-to-noise ratio. In these cases it’s potentially still useful to compare trends over a period of time, to ensure surviving mutations don’t keep increasing.

It’s clear that mutation testing gives us a much clearer view of what our tests are actually testing than human analysis and line coverage do. I urge you to give it a try — you may be surprised at what your tests are missing.

Suggested Tools

A couple of open-source tools I use are:

Javascript: Stryker (stryker-mutator.io)

Java and Kotlin: PIT (pitest.org)

Caution: PIT mutates bytecode directly, and lots of mutations that are valid in Java aren’t valid in Kotlin, as they would be caught by the Kotlin compiler (e.g. mutations throwing null pointer exceptions, or relating to non-exhaustive when-statements). Work is ongoing to support Kotlin more fully, but beware of the signal-to-noise ratio caused by these junk mutations.

Further Reading