My team at Opower has gone through several evolutions of our testing philosophy. Although I now feel confident that we are now testing the right set of contracts, we’ve made some mistakes along the way. I’d like to share those lessons learned with you.

It’s ok to duplicate config for tests.

I used to write a lot of tests like this:

This is three files represented in one gist, for clarity. The assertion language in the test is chai. Also, I’m using some es6 here (“const”) because it’s wonderful.

No point in duplicating the constant data, right? If the spec changed, I’d hate to have to update this in two places.

Unfortunately, by not duplicating data, we’re actually duplicating the logic we’re trying to test. There’s no point in writing a test that just re-implements the original function, either in the same or a different way. Just hardcode your expected output.

And this test is not even really testing what I want. I want a function that returns “Hi, Sam”. I don’t want to care what the constants value is or where it’s stored — that’s an implementation detail. The following example actually tests the external contract of the function, which is what a good test is all about:

When I first starting doing that, I was afraid that it would be painful to have to update multiple places when I made a change. In practice, that hasn’t been a problem at all. DRY is theoretically great to protect you against change, but some fields change very rarely. Additionally, if you make a change in your app code, and the tests break, this is a great validation that your tests actually do something. When I make a change, I like to update the test first, see it fail, then update the app code to make the test pass. (This is a standard TDD approach.) Just changing a config value and seeing all the tests still pass doesn’t give me the same confidence that the tests are tracking the functionality I care about.

Be careful with unit tests

Unit tests for methods whose outputs are a pure function of the inputs are nice. But when you tell yourself, “Well, the outputs of this function are its side effects”, things start to deteriorate rapidly. proxyquire is a wonderful technical achievement but has become a curse word on my team.

Let’s look at a test for a js file called package-exists. This file exports a function that checks to see if a package exists in the npm registry.

This is just one of many dependencies which were proxyquire’d.

The test file this snippet was taken from is 93 lines. The code it is testing is 55 lines. That should have been a signal to us right there.

The problem here is that we have an absurd amount of knowledge in the tests about implementation details of the app code. We need to know how it imports and uses npm. There are other valid ways that the app code could fulfill its contract, but there’s no way it can change without breaking the tests.

That was the ultimate demise of these tests. We would go to make some trivial change, which would affect the internals but not the external contract of a single file, or the entire module. All the unit tests would fail, and we would spend a ton of time trying to keep them up to date. Finally, we just adopted a policy of deleting the tests as soon as their implementation detail knowledge got out of date with reality.

e2e tests are better than unit tests

As we progressed with the aforementioned painful unit tests, we found that the e2e tests were far more durable and useful in tracking down failures. But before we dove whole-heartedly into them, I was concerned that although e2e tests are good for smoking out large issues, they don’t provide any insight into what part of the code is broken. Unit tests are good for that, right?

This potential problem did not manifest itself. Our modules are all fairly small, so if an e2e test is failing, the search space is not that large.

And just because a test exercises your module from an outsider’s perspective doesn’t mean you can’t have focused tests. By being smart about our assertions, we can still target individual parts of the contract. For instance, let’s say that our module writes out a manifest file, and we want to ensure that it gets written correctly:

In the first example, we just dump an object into our assertion library of choice and call it a day. This test is easy to write, but difficult to maintain. We have to specify all the fields in the object with this approach, even those we don’t care about. And when the test fails, you might get a nice diff, depending on your testing tooling, or you might just see an error.

In the second example, we are specifically calling out the fields we care about. This is much more flexible, allowing you to add new fields to the object without breaking old tests. (If you didn’t want this, you could make an assertion about the total number of keys.) Additionally, you could set it up so each assertion corresponds to a specific point in your contract. That way, when a single assertion fails, the problem is more immediately tractable.

We have been particularly enjoying e2e over unit tests for our new koa server. A koa server, like Express, uses a series of middlewares to define how requests are handled. In our tests, instead of calling the function for a single middleware, we just make a request to the server, and assert on the response. We have to craft requests such that they exercise certain middlewares and not others, and sometimes we have to add a little cruft to account for middlewares not related to the tests which are still running, but overall, this has provided much more value because it exercises the entire stack. Here is an example middleware:

If we wanted to do a unit test of this middleware, we could do the following:

We spend a fair portion of the set-up phase just mocking out koa. But we don’t really care about how koa works. We could switch to hapi.js, and if our server still responds to the /heartbeat endpoint, I still want that test to pass. Let’s try an e2e test:

Some details omitted for brevity.

Now our assertion matches what we care about: we have a server, and its response to our request is what we want.

Mocking for e2e tests

It would be nice if we could spin up new instances of all our dependent services for each test we run, but we have not yet invested the time to get Docker fully set up. For now, we use nock for external dependencies. It is quite nice to work with. Here’s an example, using the same server from before. It makes a call to a backing service to determine which navigation to show the user. We want to test that it handles that service’s response correctly.

Although we are mocking here, it’s ok, because we are mocking out an interface that is external to the module we are testing. There is some risk that our mocking will not match how the backing service actually behaves. To mitigate that, we could use Docker to spin up an instance of that backing service for our tests. However, that would take substantial effort, and these nine lines of nock code provide good value for the effort it takes to write them. In practice, we haven’t had many issues with mocking backing services incorrectly.

It’s easy to mock out a service with a simple API like this. However, not all services you depend on will be like that. We have code that reads and writes S3 content. Nocking out the S3 http interactions would not be a good use of time — they are fairly complicated. Instead, we just create a bucket on the fly, use it for the tests, make our assertions on it, and then delete the bucket. This has worked out well — the only pitfall has been tricky issues around not deleting buckets when the tests are killed halfway through the test run.

Specific testing tools

There are many factors to consider when picking a testing framework, but for this discussion, I’ll focus on one that is very important to my team: how the framework handles asynchronous code.

The latest version of Jasmine provides a done callback:

This is fine, but a little cumbersome when you are working with promises. Mocha lets you return the promise directly:

The assertion library is chai.

This is nicer. Still, it’s easy to mess up. Quick, what’s wrong with the following test?

It looks reasonable enough at a glance. But because you are not returning the promise, your assertion is never called. I’ve seen entire test suites where the author forgot they had to return promises, and gleefully said, “all 50 of my newly written tests are passing”. This was odd, because the code was totally broken. Turns out that none of the “passing tests” were really being run. Mocha makes this mistake too easy.

We’ve moved on from Mocha and Jasmine, and now use tape. The brilliance of tape (and tap) is that it lets you specify up front how many assertions you want to make:

If you don’t know up-front how many assertions you will make, you can also call t.end() at the end of the function, which functions like Jasmine’s done().

Now we know that passing tests actually mean something.

A reasonable person might object to tap’s approach, saying that it’s just extra metadata that must be maintained, and the real solution is code coverage. I would disagree. The maintenance cost of this metadata is low; if it gets out of sync, you discover it easily, because the tests fail, and it’s trivial to figure out what you need to change it to. I have not found this to be burdensome in practice.

Code coverage in general is something that has to be taken with a grain of salt; the ROI of hitting 100% code coverage is not always apparent. Additionally, the amount of code coverage we have will not necessarily line up with the amount of user stories that we successfully implement. Imagine that our module exports a function, and it needs to do the right thing for different inputs, and each set of inputs will exercise all the code. We could have one test that runs, and another test that is broken and does not actually run. Because both tests exercise the function, we would get a 100% code coverage score, even though we are not testing all of our use cases. Code coverage is merely a downstream indicator of what we really care about, which is whether our system does what it is supposed to.

But moreover, code coverage will not solve the “are my tests really being run” problem. One could imagine a series of events like the following:

The test runner starts the test case

The test case initiates an async operation, but does not tell the test runner that the test is async

The test runner ends the test case synchronously, since it was not informed that the test was async

The async operation is executed, and the code coverage mechanism says, “Yup, this code was run”

The async operation completes, and the assertions fail, but they are not picked up by the test runner, because it has already ended the test

We would get a passing code coverage score for the function we are testing, because we are technically running it, but our assertions are not firing in a useful way. Code coverage is not a replacement for a mechanism for ensuring that your assertions fire.

Conclusion

In summary:

Don’t share anything between app and test code

e2e tests are preferable to unit tests

tape is great (thanks, @substack and @izs!)

Thanks to @dylang and Ben Siemon for reading drafts of this.