At the beginning of 2014 the security of the Internet was rocked by two serious flaws: Apple’s “goto fail” bug (CVE-2014-1266) and OpenSSL’s “Heartbleed” bug (CVE-2014-0160). Both were vulnerabilities in the Secure Sockets Layer technology upon which the majority of secure communications on the Internet relies. These bugs are as instructive as they were devastating: They were rooted in the same programmer optimism, overconfidence, and haste that strike projects of all sizes and domains.

These bugs arouse my passion because I've seen and lived the benefits of unit testing, and this strongly-imprinted experience compels me to reflect on how unit testing approaches could prevent defects as high-impact and high-profile as these SSL bugs. Unit testing is the process of looking for chunks of code that make for convenient “units” to which to apply automated Unit Tests, small programs designed to verify low-level implementation details and detect coding errors early. The nature of the defects inspired me to write my own proof-of-concept unit tests to reproduce the errors and verify their fixes. I wrote these tests to validate my intuition, and to demonstrate to others how unit tests could have detected these defects early and without heroic effort.

Writing unit tests produces benefits beyond detecting low-level coding errors. In this article, I explore the question of whether unit testing could have helped prevent the "goto fail" and Heartbleed bugs. In doing so, I hope to establish a compelling case for the adoption of unit testing as part of everyday development, so that the experience of Self Testing Code becomes universal. I offer my insights in the hope that they may help avoid similar failures in the future, in the spirit of a postmortem or project retrospective . My experience doesn't mean I'm owed deference based on mah authoritah, but I hope to make a sufficiently compelling case that will lead more people and organizations to consider the benefits of a unit-testing culture.

Many popular and technical media stories have run with explanations of how these defects originated, why they slipped past existing safeguards before being so widely deployed, and what should be done to prevent such bugs from happening again. It troubles me that most of these analyses fall back on facile excuses that miss the mark, and promote resigned acceptance of such defects due to the ever-increasing complexity of modern software systems. It is as though the software industry at large, as well as the public that depends on it, is anxious to accept such failures as inevitable fate, the price we pay for the modern conveniences that technology affords us. It's the easiest possible explanation that allows us to make sense of a bad situation and move on as a society.

I don't accept such defects as inevitable. Rather, we must seize this opportunity to reflect on how we developers can do far better than rely on fate, or more funding, or any number of external factors to prevent security vulnerabilities or other high-impact defects caused by low-level coding errors. Bugs will happen, but neither software developers nor the public should be satisfied with that as a response to defects this colossal in scope. Deep, genuine reflection is difficult and encounters a lot of resistance, as it calls on developers to accept responsibility for their human limitations—which is often a challenge to the very self-image of a programmer. That makes it all the more important to dive deeply into these two bugs in particular, to search for genuine solutions and to avoid setting a dangerous precedent: If everything in the short-term turns out OK in the wake of "goto fail" and Heartbleed, then why bother changing anything about current software development practices?

My proof-of-concept unit test for "goto fail" may be easy to dismiss as a one-off test written with 20/20 hindsight. I would rather it appear as an example of the kind of accessible unit testing approach that development teams everywhere can apply to existing code, right now, to avoid similarly embarrassing (and potentially catastrophic) bugs. A development culture that values unit testing and whose members work to improve their craft will produce tests that will most likely catch programming errors exactly like "goto fail" long before they have a chance to impact any users.

I do know that development cultures can change. Bugs such as these give us an occasion to reflect upon our own development cultures, if unit testing is not already a vital part, and begin to appreciate why unit testing is such an important development practice. I'll discuss my experience with changing a development culture in detail in a later section of this article, and offer advice for how to effect change in other development cultures, from a single team to an entire company.

I've never worked at Apple, nor do I know any Apple developers. I don't know exactly what the company-wide development culture is there, and whether this code is representative or exceptional. Even if this code is the exception rather than the norm, it's still unacceptable. It doesn't matter to me, as someone whose privacy and security might've been violated by this coding error, what the circumstances were "excusing" this particular error or what the rest of the culture looks like. I want to see greater accountability for such mistakes. Not shaming, not condemnation, etc., but accountability and the due diligence that follows it. That's the deeper strategy for how we prevent the next "goto fail" from happening.

The presence of six separate copies of the same algorithm clearly indicates that this bug was not due to a one-time programmer error: This was a pattern. This is evidence of a development culture that tolerates duplicated, untested code.

Also, the Security-55471 version of ssl_regressions.h , which appears to list a number of SSL regression tests for this library, remains unchanged in the Security-55471.14 version of ssl_regressions.h . The only substantial difference between the two versions of the library is the deletion of the goto fail statement itself, with no added tests or eliminated duplication:

Unit testing introduces pressure to minimize copy/paste, because the copy/pasted code also has to be unit tested. It could have ensured that only one copy of this algorithm existed since it would have been easier to test. A unit test could have easily verified that this algorithm was correct, merge or no, and could have prevented the “goto fail” bug from being written in the first place.

Code duplication is a Code Smell that is known to increase the likelihood of software errors. It is also apparent from the function names above that there’s more duplication besides that of the core handshake algorithm. This cut-and-paste code reuse also supports the hypothesis that the bug might have been caused by a large merge operation, as duplicate code increases the available “code surface” during merges and compounds the potential for undetected merge errors.

A copy of the same algorithm with a different HashReference instance appears immediately above the buggy algorithm in the same function. In total, the algorithm appears six different times in the same file ( sslKeyExchange.c from Security-55471 ):

That permanent double-check is important here: We don’t know exactly how that rogue second goto fail got into the code; a likely reason is that it was the result of a large merge operation. When merging a branch into the mainline, large differences can result. Even if a merge compiles, it can still introduce errors. Inspecting such merge differences can be time-consuming, tedious, and error-prone, even for experienced developers. In this case the automated double-check provided by unit tests provides a fast and painstaking (yet painless!) code review, in the sense that the tests will likely catch potential merge errors before a human inspects the merged code. It’s unlikely the original author introduced the "goto fail" bug into the code, but a suite of tests doesn’t just help you find your own mistakes: It helps reveal mistakes made by programmers far into the future.

It’s likely that the programmer who wrote this algorithm the first time did execute the program to check for errors in the new code. Most programmers will run a program with some sample inputs to verify that it’s doing what they think it should do. The problem is that these runs are often ephemeral and thrown away once the code is working; an automated test captures those runs as a permanent double-check.

Writing a set of tests to exercise this function is straightforward because now we’re thinking about concrete examples rather than conditions. Furthermore, the tests act as a double-check: It’s easy to make a mistake with conditional logic, accidentally reversing one test in the chain; but when you write tests you are stating the behavior twice, once with examples, once with logic. You have to make the same mistake in two different representations for a bug to get through.

This test was written without a testing framework, to demonstrate that an effective test can be written using tools already in-use by a project. Even without referencing a standard framework, the explanation in the preceding paragraph should prove relatively easy to follow: Well-organized test cases using well-organized objects and functions with well-chosen names mean that if a test fails, you can usually diagnose the failure from the information in the test case alone without digging through the full implementation of the test program. Testing frameworks can help in writing tests more efficiently, but are not a prerequisite for writing well-organized, thorough unit tests.

A framework can make unit tests easier to write, and their failures easier to diagnose, especially as an entire team or company becomes familiar with its idioms. However, a framework is a convenience, not a prerequisite for effective unit testing.

A unit testing framework is a library providing code structures for organizing test cases and assertion methods for verifying that test outcomes match expectations. The output generated by a framework adheres to a standard format that usually identifies the failing test case and the line in the test file at which a failing assertion occurs. Many unit testing frameworks are based on the Xunit model.

This example will only run on OS X. You will need to have Xcode installed. The Security-55471-bugfix-and-test.tar.gz bundle contains a build.sh script that will:

Despite the fact that C is not an object-oriented programming language, the existing code for this algorithm exhibits a clearly object-oriented design that actually makes for easy unit testing once the code is extracted into its own function. The tls_digest_test.c proof-of-concept unit test shows how a HashReference stub can be used to effectively cover every path through the extracted HashHandshake() algorithm. The actual test cases look like this:

Regardless of the scope of the code under test, it's critical to exhaustively test failure cases to the extent possible. It is tempting to test that the code does what it should do and leave it at that, but it's arguably even more important to test that it doesn't do what it shouldn't do.

For an algorithm this straightforward, the test cases will rather closely "mirror" the implementation: One success case, five failure cases. For higher-level or more complex operations, such close "mirroring" can make for brittle tests and should generally be avoided. This is especially important to keep in mind when using mocks or other Test Doubles to test code in isolation from its collaborators.

In the case of HashHandshake() , the contract can be described as: Five steps, all must pass. Success or failure is propagated to the caller by the return value. The HashReference is expected to respond correctly to the series of calls; whether it makes use of any functions or data beyond that passed in by HashHandshake() is an implementation detail opaque to HashHandshake() itself.

This function is more easily understood in isolation. Faced with such a self-contained function like this, a programmer can begin to focus on the external effects of the code, considering questions such as:

Such a block of code is easier to test when extracted into its own function. Extracting chunks of code like this is habitual practice for people writing unit tests and can help get parts of an existing code base under test a piece at a time. Looking closely at the variables and data types used in the algorithm make it clear that this block of code is performing a handshake on the hashes. By looking up the type of SSLHashSHA1 , we can also see that it is an instance of a HashReference “ jump table ”, a structure containing function pointers that enables C programmers to implement virtual function-like behavior (i.e. substitutability and run-time polymorphism). We can extract this operation into a function with a name signifying its intent (leaving out the extra goto fail ):

While looking for “units” to which to apply “unit” tests, the entire block of code containing the buggy algorithm, with its cluster of conditional logic, leaps out as such a unit (from the SSLVerifySignedServerKeyExchange() function in version 55471 of Apple’s Secure Transport library ):

Some have claimed that a coding style requiring the use of curly braces for all if statements or enabling unreachable-code compiler warnings could have helped. However, there are deeper problems with the code that unit testing could help to resolve.

C programmers will also immediately recognize that the first goto fail statement is bound to the result of the if statement preceding it, but the second goto fail is not: The matching indentation of the two statements bears no significance in C, as surrounding curly braces are required to bind more than one statement to an if condition. If the first goto fail is not executed, the second one certainly will be. This means that subsequent steps of the handshake algorithm will never be executed, but any exchange successfully passing this point will always produce a successful return value even if the final verification step would have failed. More plainly: The algorithm gets short-circuited by the extra goto fail statement.

Some have argued that all goto statements are bad, based on Edsger Dijkstra's famous essay A Case against the GO TO Statement , summarized by the popular axiom "Goto considered harmful ". However, the goto fail statement expresses an idiom familiar to C programmers. In the case of an unrecoverable error, such statements pass control immediately to a recovery block at the end of a function, where locally-allocated resources are properly released. Other languages have built-in support for such “abortion clauses”, as Dijkstra called them in the conclusion to his essay: destructors in C++; try/catch/finally in Java; defer()/panic()/recover() in Go; try/except/finally and the with statement in Python. In C, there is no essential problem with or confusion surrounding the use of goto in this context. In other words, goto should not be considered harmful here.

The “goto fail” bug first shipped to iPhones, iPads, and AppleTVs in September 2012, appeared in iOS 7.0 and OS X Mavericks, and was not fixed until February 2014—seventeen months after it was introduced. A short-circuit skipping the final step of the SSL/TLS handshake algorithm left users vulnerable to a man-in-the-middle attack, whereby a malicious system relaying traffic between an affected system and another system could present the illusion of a secure connection using false credentials, and subsequently intercept all communications between the other two systems.

However such decisions are made, it is true that developing and sustaining a high-functioning unit testing culture is not a cost-free proposition. In the next section I'll explore those costs and consider whether or not they're worthwhile.

Consequently, the development cultures which produced the bugs either had not considered unit testing at all, or had considered it and rejected it on some basis, which I believe can be described as a perceived "opportunity cost". This means that unit testing was deemed to provide insufficient value in return for the investment, draining precious resources from other priorities and opportunities. This may not have been a conscious decision, but the choice is manifest by the other tools and practices a team has decided to adopt instead.

The buck stops with the code review process, whereby a change is accepted for inclusion into the code base by the developers who control access to the canonical source repository. If unit tests are not required by a code reviewer, then cruft will pile on top of cruft, multiplying the chances of another "goto fail" or Heartbleed slipping through. As was perhaps the case with "goto fail", the development teams at many companies are focused on high-level business goals, lack any direct incentive to improve code quality, and perceive an investment in code quality to be at odds with shipping on-time. As was the case with Heartbleed, many Open Source projects are volunteer-driven, and the central developers are short on either the time or the skills required to enforce the policy that each code change be accompanied by thorough, well-crafted unit tests. No one is paying, rewarding, or pressuring them to maintain a high level of code quality.

Of course, this raises the question: Why didn’t the teams responsible for the code write or insist upon such tests years ago at the time the bugs were introduced?

Also in both cases, having access to the Open Source code enabled me to dive into each code base and, in a matter of hours, write conclusive proof-of-concept unit tests for each bug. It also enabled me to engage the OpenSSL developers and submit a pull request for the proof-of-concept Heartbleed unit test (adapted from Google to OpenSSL coding style, of course) which was ultimately included in the central OpenSSL source repository as ssl/heartbeat_test.c .

At the same time, providing open access to the source code meant that, in both cases, anyone in the world with Internet access could inspect the code after-the-fact to grasp the nature and severity of the errors, report on their technical details and ramifications, and debate about lessons learned and appropriate responses to prevent a recurrence. The quality of those reports varies, naturally, but the transparency afforded by Open Source software enables an open debate that should, ultimately, ideally, lead to object lessons that will be of benefit to society. Had a similar vulnerability occurred in closed-source software, this valuable discussion would be more difficult to have—it’s actually quite possible that similar vulnerabilities have existed, and the software development community at large will likely never get a chance to learn from them.

It is unknown whether either bug has ever been successfully exploited, but the code has been available as Open Source on Apple’s and OpenSSL’s servers for years, providing the opportunity for a malicious agent to discover either bug and use knowledge of it to his/her advantage without notifying anyone else. In light of this realization, let’s propose a corollary to Linus’s Law:

It should be clear by now that both the “goto fail” and Heartbleed bugs were fairly straightforward programming errors, which are among the kind of errors unit tests are so great at catching early. It should also be clear from the above discussion, supported by the implementation of both proof-of-concept unit tests, that it is likely that these bugs could have been prevented, had the teams that produced each bug embraced the practice of unit testing.

One last point to make: By opening each of dtls1_process_heartbeat() ( ssl/d1_both.c ) and tls1_process_heartbeat() ( ssl/t1_lib.c ) in separate browser tabs and flipping between them, again we see apparent tolerance of duplicated, untested code, as we did in the “goto fail” example. With the proof-of-concept test in place, it would be possible to eliminate the duplication by extracting one common function with an extra set of parameters—perhaps a small “jump table”—to implement the slight differences between the algorithms.

Given the power of modern version control systems and the increasingly-common practices of forking, merging, and cherry-picking, tests have become more important than ever to guard against unintentional changes, especially changes leading to a regression of a known catastrophic bug. The apparent removal of a regression test during a cherry pick or a merge should set off alarm bells, even more so if the test was included in the same change as the fix, as the fix could become undone as well.

In a unit testing culture, when a bug is discovered, the natural reaction is to write a test that exposes it, then to fix the code to squash it. To expand the point made during the “goto fail” discussion, that manual tests run to verify a code change prove ephemeral, a fix unaccompanied by a test is vulnerable to becoming undone. An automated regression test guards against future errors just as a test written for the code in the first place could have.

The proof-of-concept test above shows that it is conceivable that had someone tried to unit test the code, they could have possibly caught and prevented one of the most catastrophic computer bugs in history. The existence of the proof-of-concept unit test eliminates the assertion that it would've been impossible. Sadly the fix submitted for the bug also lacked a unit test to verify it and guard against regression.

A coding standard document could also help with this process. In addition to specifying the particulars of naming, whitespace, and brace placement, such a standard could require that request- and buffer-handling code be accompanied by tests to verify the absence of buffer overrun issues. This would be in addition to requiring that all code submitted for review be covered by new or existing unit tests as a matter of policy.

Developers well-accustomed to unit testing would have produced or insisted upon a small series of well-tested changes building up to a feature rather than a single, monolithic change such as the one in question. A smaller, well-tested change containing only the above functions could have better enabled the author, the reviewer, or an interested onlooker to notice the use of an externally-supplied value to read a block of memory, and to verify that such a value had been handled properly. An explicit reference to the specific section of the protocol defining the structure and handling of heartbeat requests might've also helped focus the testing and the review.

There is another issue we can address in the Heartbleed example that we could not in the “goto fail” example. With “goto fail”, we have no visibility into the exact change that introduced the bug; available evidence suggests that it was possibly a large merge operation, compounded by code duplication. Still, the “complicated merge” theory is only a guess. With Heartbleed, we can see the exact change that introduced both the TLS heartbeat feature and the Heartbleed bug buried within it, and that it had been code-reviewed.

The contents of the returned buffer in the failing test will depend on the contents of memory on the machine executing the test. The value of kMaxPrintableCharacters , set to 1024 by default at the top of the test file, can be increased to see even more memory contents returned.

Like the “goto fail” test, this test was written without the help of a testing framework. It may be copied directly into the test/ directory of any OpenSSL release from 1.0.1-beta1 to 1.0.1g without any modification and executed. When executed for version 1.0.1g, the test passes and produces no output. For the other versions, the test cases with “Heartbleed” in the name fail with output resembling:

The tls1_process_heartbeat() tests are nearly identical, except they call SetUpTls() to initialize a HeartbleedTestFixture and don’t cover the ExcessivePlaintextLength case. ExecuteHeartbeat() and other test helper functions are a little more complicated than those of the “goto fail” test, but only slightly.

heartbleed_test.c should compile on UNIX-based platforms without modification. On Windows, you may want to install VirtualBox (or another virtual machine platform) to run Ubuntu Linux , FreeBSD , or some other Open Source UNIX variant.

In this case, the protocol spec practically defines the appropriate unit test for us. It doesn't explicitly say that there should be verification that payload_length match what is actually read, but provides a strong hint that payload_length should receive special attention.

Given that the heartbeat functions process request buffers containing externally-supplied data, a programmer accustomed to self-testing would find it habitual to probe for weaknesses in handling such input—especially as it pertains to the reading and allocation of memory buffers.

As opposed to the case of the “goto fail” bug, there is no need to extract a new function: both dtls1_process_heartbeat() and tls1_process_heartbeat() are already good-sized units that don’t require a large amount of complicated setup to get under test. We can get right to the same questions posed earlier in the context of “goto fail”:

The first check covers the case where the client has sent the empty string as a payload, making sure the actual size of the data read from the socket matches this minimum request size; the second ensures the client-supplied payload size does not exceed that of the buffer containing the payload data.

The memcpy() is bad because the length value, payload , is not verified as matching the length of what was actually read from the request. The request could have contained the empty string, but indicated a length up to 64 kilobytes. As a result, up to 64 kilobytes of the process's memory are returned as a response, not the contents of the request buffer. Again, there is no logging of this event; it literally leaves no trace.

n2s() is a macro (from ssl/ssl_locl.h ) that reads the next two bytes of the pointer p , stores the value in payload , and advances p by two bytes.

The local pointer variable *p is initialized to the beginning of the heartbeat request buffer. The first byte will identify the type of the request, to be stored in hbtype . The next two bytes specify the client-supplied size of the request data that the client expects to be copied and sent back as the response; this size will be stored in payload . ( payload_size or payload_len would’ve been a better name to match the variable's intent.) Following that is the beginning of the client-supplied data, or “payload”, to be copied and returned to the client, which will be pointed to by pl . ( This variable should’ve been named payload .)

The change introducing the bug was code reviewed; it is apparent that the reviewer did not insist that the change include unit tests. The bug was not discovered and fixed until April 2014 and released as part of 1.0.1g.

Heartbleed was a similarly heartbreaking case of untested security-critical code which appeared as part of the ubiquitous OpenSSL library. It was introduced in January 2012 as part of a large, untested change implementing TLS heartbeats in OpenSSL-1.0.1-beta1 . The bug enabled an attacker to send an empty handshake request and declare that it sent up to 64k of data; a vulnerable system would read but not verify the declared size and would respond with whatever contents resided in up to 64k of its memory adjacent to the request buffer. There was no logging of this exchange; there would be absolutely no trace of the attack.

However, despite all of its benefits, unit testing shouldn't be the only tool in your development toolbox for ensuring high-quality, mostly bug-free code. Next, let's consider a number of other available tools and practices that can be used in concert with unit testing as part of day-to-day development, and why it's still worth adopting unit testing in light of these other items that can be brought to bear in catching defects early.

That sense of immediate gratification is what really hooked most of us who swear by our unit tests. For others, it's the high degree of trust that regressions won't occur. In either case, unit testing creates a pure high, based on a sense of forward progress, a sense of fearless productivity, with none of the other nasty side effects of addiction. No rational arguments, no data, no charts or dollar amounts needed.

Immediate gratification is what really hooked most of us who swear by our unit tests. No rational arguments, no data, no charts or dollar amounts needed.

The exhilaration of immediately verifying that the code you just added or changed really did what you intended it to do is its own reward. The feeling of (relative) certainty that your code will correctly handle any input thrown at it is invigorating. The rush of excitement when a test detects an error in code you just wrote, an error you (or some other poor sap) won't have to spend hours debugging, fixing, verifying, and cleaning up later, is addictive.

What you (should have) experienced is the intellectual thrill that comes from making a change to part of a real system, and seeing the impact of that change in near-real time, without having to build and launch the entire product and poke around through the user interface. Think about it: Up until now, you think you've understood the "goto fail" and Heartbleed code by just reading it, or the explanations earlier in this article, or perhaps other sources you may have read. But now you've actually felt how the code works. In the case of the Heartbleed test, you could actually see the contents of your machine's memory spilled onto the screen. (On my machine, I can clearly see my PATH and other environment variables.)

The first two sections of this article contain links to the "goto fail" unit test bundle and the Heartbleed unit test . If you haven't done it already, download the code, build it and run it on your system. Make sure the tests pass. Then, change something, either in the test code or in the code under test, to make it break. Look at the output. Take it in, reflect. Then fix the code to make the test pass again.

Best of all, unit testing skill is portable across domains, languages, and companies, just like any other basic programming skill. It is an investment that pays returns over the course of a lifetime. Remember: Past unit testing experience is what enabled me to write proof-of-concept unit tests for both “goto fail” and Heartbleed so quickly, having no familiarity with the code and not programming on a regular basis for years.

What I'm saying is, there is no greater argument in favor of unit testing than the actual experience of unit testing. You Cannot Measure Productivity , but you can feel it. Even if your first unit tests prove ugly, complicated, and brittle, trust me, you can get better at it, and the reward will be well worth the journey.

There is no greater argument in favor of unit testing than the actual experience of unit testing. Best of all, unit testing skill is portable across domains, languages, and companies, just like any other basic programming skill.

My own experience with unit testing did not begin with some extensive rational argument, or compelling objective evidence convincing me to try it. The team I was a member of at Northrop Grumman had just finished a brutal push to meet a required certification deadline; in the following months, while rewriting a subsystem for performance and stability reasons, I tried unit testing out for kicks. The difference between the two experiences couldn't have been more different, or more convincing. I could see and feel the progress of the new system as every new feature was added, and the finished product turned out exactly as intended. When the rare bug did occur, it took no more than a couple hours to pinpoint it, reproduce it, fix it, and ship the fix—without adding any new defects in the process.

After all these words, words, words , do you remain unconvinced of the value and power of unit testing? Can't say I blame you. To be honest, like other good things in life, you can't really know what it's like until you've actually tried it. On top of that, it's possible you won't enjoy it at all until someone helps you learn how to do it well.

Or, even worse: The team may decide to leave the bug in-place from fear of breaking something else. That certainly doesn't inspire user trust, much less developer confidence and productivity.

Contrast that against the situation where the buggy code isn't well-covered by unit tests. The developer must take time to understand the affected code and far more care to pinpoint the error and ensure its fix is free of side-effects. Verification of the fix may not come for days or even longer, depending on the nature of whatever pre-release testing happens to be in place, if any. The interruption is prolonged, and drains more development and testing time from the new release.

If the buggy code is well-covered by a suite of automated tests, especially small unit tests, this interruption may not take much time on the part of the developer assigned to fix the bug. The existing tests serve as documentation of the intent of the affected code. The developer adds a new test to reproduce the bug, verifying that the defect is well-understood before attempting to fix it. This new test verifies the fix for the bug, and the existing tests provide a high degree of confidence that the fix is free of unintentional side-effects. The new test becomes a permanent part of the test suite to guard against regression, the fix is released, and development on the new release continues. The interruption is finished.

Imagine a bug is found in integration or system testing, or after a new release is pushed to a datacenter, or perhaps by a user some time after that. The developers responsible for the buggy code have already moved on to other tasks, and are likely under deadline pressure to deliver. If the bug is severe enough, at least one of those developers will have to stop to address it, slowing the progress of the new development work underway.

Think of the opposite, as was the case in the pre-unit testing days of GWS: When you're on a project that doesn't have ample unit testing coverage, you're afraid to do anything since you don't know what you might break.

Think of it like this: Every time a test fails, that is an opportunity to deepen your understanding of the system. If you're new to a team, breaking many tests as you begin to make changes to the system can help you become productive far more quickly, as each of these events align your awareness of the system more closely with reality. If you've been on the team for a long time, existing tests will answer many questions that new contributors may have, saving your time and focus. They will also remind you of all the nuances of the code you might have written in the past, and haven't had to think about for some time, should you have to dive back into it. In other words, you benefit your future self when adding a well-crafted suite of tests to your code, minimizing the time needed to context-switch back into that prior state of mind.

Poorly-written unit tests lack this quality, usually because less thought is given to test code than "production" code. The solution: Set the same quality bar for test code as production code. If you don't, your tests will become hard to maintain and slow down the team.

Well-written unit tests can provide two types of documentation: the test names act as a sort of specification of the code's behavior; and the tests themselves act as code samples for each behavior case. Even better than typical Application Programming Interface (API) documentation, well-maintained unit tests are by definition an up-to-date representation of actual behavior. The author of a unit test effectively communicates to other developers how a piece of code should be used, and what to expect from it. These "other developers" may be brand new to the team, or may not yet be hired (or even born). Such documentation helps developers understand unfamiliar code, even entire systems, without interrupting anyone else to the degree that they might without unit tests.

Unit test names can act as a specification of the code's behavior; the tests themselves act as code samples for each behavior case. To achieve this, set the same quality bar for test code as production code.

When code-level design is approached this way, all of the smaller pieces that make up the larger system become not just more reliable, but easier to understand. This makes everyone more productive, as the mental effort required to comprehend what a specific piece of code does is minimized.

Think of what problems you're trying to solve with the code you're writing; then think of the code you'd like to write, as a client, to make use of the solution. That ideal client code can be expressed as unit test cases that use the interface of the code you're developing.

Far from being an exercise in academic purity, code quality matters. Bad code provides bugs with plenty of shadows in which to hide; good code increases the chances that they will be found and squashed sooner rather than later. When the author of a piece of code writes a test for that code, the author effectively becomes the first user. Just as eating your own dogfood is good software development practice at the overall product level, having to write code that uses your own code can lead to improved designs that are more readable, maintainable, and debuggable.

In other words, you can be more productive since you can iterate on code much quicker: You don't need to start up some heavyweight server if you can just run a unit test instead. So if it takes a few tries to get some code right, those few tries might take minutes (or longer) if you have to start up a server again and again, compared to seconds if you just need to rerun the unit tests each time.

This rapid feedback cycle generates a sense of flow during development, which is the ideal state of focus and motivation needed to solve complex problems. Contrast that with the opposite phenomenon, using the familiar operating systems metaphor of context switching . Context switching requires that the present state of operations be saved somehow, and that a new state of operations be swapped in before initiating the new activity; then there's the time and effort involved in switching back. Plus, there's the issue of how much state must be managed per operation. Without unit tests, we have to use more of our brains to remember weird corner cases and strange side-effects, giving us less time and energy to do the thing we're better at than the computer: advancing solutions to new problems rather than juggling the weight of all the problems that have already been solved.

Unit testing is not in the same class as integration testing, or system testing, or any kind of adversarial "black-box" testing that tries to exercise a system based solely on its interface contract. These types of tests can be automated in the same style as unit tests, perhaps even using the same tools and frameworks, and that's a good thing. However, unit tests codify the intent of a specific low-level unit of code. They are focused, and they are fast. When an automated test breaks during development, the responsible code change is rapidly identified and addressed.

In case you missed it, the important point about the GWS Team story is that over time, unit testing discipline allowed the team to move faster and do more. Unit tests are just as much about improving productivity as they are about catching bugs, so proper unit testing sped them up rather than slowed them down. Let's highlight a few factors that contributed to this outcome.

Thanks to the GWS example inspiring the efforts of the Testing Grouplet (a team of developers volunteering to promote unit testing adoption, described in a later section of this article), many teams at Google were able to transition to a unit testing culture and benefit from reduced fear and increased productivity. It did take time to overcome inertia, indifference, the friction of outdated tools, and resistance, since at first unit testing felt like a cost and some people worried that the time spent writing that second representation of behavior could be spent writing new code (that would get them promoted). Eventually, as people experienced what it meant to cast aside the fear of change, they came to see this side-effect as easily outweighing those lines of code, in terms of its impact on their happiness, on their team's happiness, and on the bottom-line of productive output.

Furthermore, the mitigation of fear led to the expansion of their joy in programming, as they could see tangible progress being made towards exciting new milestones without being held back by chronic outbreaks of high-priority bugs. The impact on productivity of high morale, based on the ability to remain in a state of creative flow, cannot be overstated. While I was at Google, the GWS Team exhibited the ideal testing culture, integrating an enormous number of complex changes from outside contributors while making their own constant improvements.

Over time, unit test coverage and development momentum went up, while defect, production rollback, and emergency release counts went down. New team members found themselves becoming productive far more quickly because the tests allowed them to gain a deeper perspective on a system one unit at a time, and to begin contributing changes with the confidence that the existing tests would likely detect any unexpected side-effects. Any tests they caused to fail in the course of their early efforts accelerated their grasp of the system. Experienced members of the team, who had grown cautious of making changes and accepting changes from contributors, were able to make and accept changes quickly for the same reason and no longer had to rely primarily upon large and expensive system or manual tests with feedback cycles on the order of hours or days. Adding more new developers actually allowed the team to move faster and do more, avoiding the scenario described by Brooks's Law in which "adding manpower to a late software project makes it later".

Determined to overcome this fear, the GWS Team introduced a testing culture. They took a hard line: No code was accepted, no code review was approved without an accompanying unit test. This often frustrated contributors from other teams trying to launch their features, but the GWS Team stuck to its guns.

Fear is the mind-killer. It stops new team members from changing things because they don't understand the system, and it stops experienced people changing things because they understand it all too well.

As a concrete example, let's take what is possibly the most popular page on the Internet: Google's home page. The Google Web Server (GWS) team's unit testing story became well-known throughout the company. The GWS Team had gotten into a position in the mid 2000's where it was difficult to make changes to the web server, a C++ application serving Google's home page and many other Google web pages. Despite this difficulty, integrating new features was integral to the success of Google as a business. The barrier that was stopping people from making changes as rapidly as possible was the same that slows change on most mature codebases: a quite reasonable fear that changes will introduce bugs.

When I joined Google in 2005, it was already very successful and many "long timers" believed it was because we were doing everything right. As a result, at that time and for some years afterwards, there was a lot of resistance to change. However, as the user base and potential for catastrophe exploded, and as success and the growth that came with it caught up to Google, it became clear that more “rock stars” producing “rock star” code was going to produce nothing but a bunch of noise and confusion in the long-term. An influx of new Google developers eventually helped accelerate the cultural shift towards unit testing adoption, both because these new developers were open to the idea, and because testing eventually proved effective in helping these new folks get up to speed and avoid making mistakes.

Despite the risks and the costs, it's important to realize that the benefits of unit testing go beyond merely minimizing the chances of releasing catastrophic bugs.

There have been examples in the past of successful teams or companies full of rock star programmers banging out code that changes the world. Google certainly fit this description for its first several years of existence. In that case, it’s arguable that, during that era, the time spent on unit testing would’ve been wasteful, as it might've needlessly slowed down those top-notch developers, especially if they weren't already used to writing unit tests. Since the company and the code base was smaller, and code reviews were already mandatory, the company effectively could manage the complexity by only hiring the "smartest" programmers who could rapidly get up to speed in that environment.

In practice, buggy unit tests tend to be the exception. If practicing pure Test-Driven Development, a failing test should be written before the code that makes it pass; this could help to prevent such bugs. If not practicing pure TDD, temporarily adding an error into the code under test to make sure the test will fail can also help. In either case, writing multiple test cases that check that the code doesn't do what it shouldn't do (instead of just checking the happy path where all inputs are valid) may reveal bugs in other test cases. Still, the possibility remains that unit tests themselves may contain bugs, especially if care isn't taken to ensure that they fail when they're supposed to.

It's arguable that the test makes things worse in this case, providing a false sense of security. However, the bug could exist even if the test hadn't been written; given the existence of the buggy test, fixing the code and the test is tantamount to providing a regression test for the bug. Fixing the test and learning from the mistake provides value; blaming the test and deleting it is a step backwards. As one possible measure to avoid buggy tests in the future, the team responsible for such a bug could endeavor to take a closer look at the test code submitted as part of future code reviews, to provide it with the same priority and care as "production" code.

The second expectation in this test should check for "B:" , with a colon, not just "B" . If the code under test accidentally filters for "B" without a colon, the test will pass when it should fail.

On the other hand, be aware of the saying: "There is nothing more permanent than throw-away code." The trade-off is that the more features are implemented without accompanying tests, the more Technical Debt a team builds up that must be repaid later. Unit testing can be difficult if you don't design for testability from the start—using dependency injection , writing well-defined classes that focus on one thing, and so forth. It is up to the team to gauge the acceptable limits of such debt, and at which point it must be paid to avoid an even more expensive rewrite once maintenance and new feature development grow too cumbersome.

Used wisely, it is a very effective tool. It is not a universal cure, however; much of the debate surrounding the use of mocks is also a debate about excessive DI.

Dependency injection is the design principle whereby a piece of code does not contain direct references to its dependencies, but contains abstract interface references instead. This can improve the isolation of a piece of code to make it easier to test, by replacing concrete dependencies with stubs, mocks, or other Test Doubles .

Speaking of new projects or teams or companies or domains, as ideal as it may be to follow Agile practices to the letter and practice pure Test Driven Development (TDD) at all times, sometimes a developer or a team needs to explore, to play, before getting serious about defining expectations and behavior. (Some argue that always following all Agile practices to the letter is a demonstration that you don't understand Agile.) While it’s always nice to get testing experience as early as possible on a project, sometimes you just need to write throw-away, prototype code; in that case thorough unit testing is probably overkill. This may be especially true of startups trying to launch a product as fast as possible.

Sometimes, tests themselves can become a maintenance burden; it may seem like they paint a project into a corner, restricting progress rather than maximizing it. This is a particular danger to new teams that lack experience with unit testing and don't understand its value. Mock objects are prone to misuse by inexperienced practitioners, leading to brittle tests of dubious value. With experience, this scenario becomes less likely. You eventually learn to step back, reevaluate the goal of the code and the test, and rewrite one, the other, or both. In the meanwhile, it may become necessary at times to replace an overly-restrictive test rather than to spend the effort salvaging it.

Martin, Kent Beck, and David Heinemeier Hansson initiatied a series of Google Hangout discussions billed as Is Test-Driven Development Dead? I thought it should've been called "Are Mocks Dead?", given the agreement between the three that overuse of mock objects is beginning to leave folks with a negative view of unit testing.

Mock objects are a type of Test Double that enable tests to specify expected behaviors rather than outcomes. They can help to test code in isolation from its dependencies and to detect critical side-effects that aren't reflected by the outputs. However, they can be overused, leading to tests that "mirror" the implementation, whereby any change in the code will break the test.

If developers are not motivated to research available materials and improve their skills, or just don't have any idea how to begin, this may imply the need to invest in internal training programs or to contract outside help to provide training. This can lead to a bit of price-shock if resources are tight, deadlines are looming, and the future benefits do not seem clear. The time required to learn the necessary skills should be no greater than that required to train developers in any other skill or technology; but if developers resist, the process can become more drawn-out, painful, and expensive.

To remedy this lack of knowledge and experience, motivated developers can band together to improve one another's unit testing skills and increase the amount of test coverage of the code base over time. In this section, I will describe how the Google Web Server Team built up its test coverage and achieved a high degree of overall productivity; in later sections I will explain how Google as a whole was able to adopt a unit testing culture, and how lessons from that experience may apply to individual teams. However, self-training will take time and energy, and the big-picture payoff may not be immediately apparent, so it requires patience, honest effort, and commitment to see all the way through. Over time, though, as the code base grows and more developers join the team, the value becomes increasingly clear. A two-person team might manage without unit testing, but a twenty-person team will have a harder time, as feature and communication complexity is compounded.

While coverage is a useful metric, it's not an objective measure of code or product quality, and should be used merely as a signal rather than proof of anything. Martin's Test Coverage bliki explains this in detail.

Code coverage is a measurement of how much code is actually executed by a test or test suite. It can help identify areas of a system that have not been sufficiently tested.

Unit testing, like any other tool, language, or process, can be applied poorly—especially when one first begins, and even more so if one has no good examples to follow, nor mentors for guidance. Unit tests which are brittle, large, slow, perpetually broken (and subsequently ignored), or flaky set bad examples which can get replicated through an entire test suite like a virus. Poorly-written tests can actually be worse than no tests at all, leaving the impression that testing is a waste of time. Builds remain broken and ignored, flooding the testing signal with the noise of constant failures. Developers uninterested in working with the testing environment become willing to live with the fear of making slow and painful changes. The end result is a drag on productivity, an increased risk of defects, and a team convinced that testing is for other people.

Flaky tests, aka nondeterministic tests, produce different results with no change in the code under test or its inputs. They indicate a lack of control over at least one dependency that is relevant to the outcome of the test. Fixing flakiness involves either better isolating the code under test, or better controlling its dependencies.

That said, this is a one-time cost. The cost of bringing someone up to speed on unit testing is relatively low if good unit testing practices are already in place on a team, and unit testing skills are portable from one project to the next. Hence, the learning curve is steepest for teams that don't have any unit testing practices at all.

There will be a learning curve. Like any skill that relies on craft rather than rote processes, a programmer learning to write unit tests will have to go through phases of learning and development, of trial and error, reflection, experimentation, and integration. This takes time, energy, and funding away from other activities. It will cause an initial slow-down in development as people grow accustomed to the practice.

While unit testing can greatly reduce the number of low-level defects, including defects as high-visibility and high-impact as "goto fail" and Heartbleed, and have a positive influence on other aspects of code quality and the development process, building and maintaining a unit testing culture comes at a cost. There’s no such thing as a free lunch.

The point is, eventually, we did make it happen, despite the odds being stacked against us. To wrap up this article, I'd like to spell out a few general principles I've drawn from my Google experience that may provide clearer insight into how to effect similar changes in your own team or throughout your own company over time.

However, I don't want to leave you with the impression that Google is wonderful and does everything right, but your own team or company is hopelessly screwed. I've provided this description to foster ideas, not remind you of how far from the ideal your environment may be. Trust me, this environment I'm describing of the Google I left is in stark contrast to the Google I first joined, and my Testing Grouplet partners-in-crime and I were underfunded and woefully outnumbered. We had to start small and grind away for years to effect the change in the culture that we'd committed to make happen.

Google had other tools, processes, and layers of testing and staging in place to ensure the highest possible code quality and avoid catastrophic, preventable defects. They didn’t catch every defect, but many that did slip through were relatively minor and easy to pinpoint and repair swiftly, free of the fear of negative side effects. More challenging defects could usually be addressed with a greater degree of confidence and speed as well. Automated testing, including high levels of unit test coverage, was critical to this fear-free environment that enabled high productivity despite the massive scale of the development operation and user base.

After Testing on the Toilet launched, Nooglers became the primary mechanism for improving distribution throughout Mountain View as the company grew and acquired more office space. We ended the unit testing lecture with the promise of books or T-shirts for any brave Nooglers who'd volunteer to post that week's TotT episode in their buildings. We called them the "Noogler Army". This was yet another way to get people engaged in the unit testing culture, to have fun and feel a sense of belonging and early contribution to the cause.

Working with EngEDU, Google's in-house training organization, the Testing Grouplet produced an introductory unit testing lecture and lab. This helped ensure that every new developer coming into Google was at least aware of the available tools and frameworks, of the rationale behind unit testing, and of some basic unit testing principles and techniques. Normally, after the one-hour lecture given by a member of the Testing Grouplet, the Nooglers would attend a lab proctored by another Testing Grouplet member to gain some immediate hands-on experience with what they'd just learned. The Testing Grouplet helped produce and maintain the internal materials used in this lab.

Finally, physical build orbs are highly-visible information radiators , the next best thing to a full-blown Communal Dashboard . Arguably, orbs still might have a place on a team with a full-blown dashboard, as it encourages a playful "shaming" culture, whereby team members grow personally concerned about the well-being of the orb, and hold each other accountable when it appears unhappy due to build breakages.

The point of the orbs was threefold: For one, they were fun to hack on. For people who wanted to promote testing culture in an immediately tangible fashion, putting together or extending an orb project was a fun way to go about it. This helped recruit people into Testing Grouplet projects and generate a sense of energy and progress, boosting morale. For another, the Testing Grouplet used them as "prizes" for teams that signed up for the Test Certified program, in the time-honored Google tradition of persuading people to take action by rewarding them with nifty swag. We went with the grain of Google nature, not against it. Absent funding and authoritah, the Testing Grouplet had to make the best use of available resources and cultural forces to effect change. In fact, I'd argue that these constraints forced us to produce creative solutions that had more staying power than any amount of money or authoritah ever could.

My first coding project at Google was to write a script that would change the color and pulse of a glowing orb—a spherical lamp small enough to be balanced in one hand and large enough that, when placed on a cube wall or a shelf, it could be seen by a whole team—based on the pass/fail status of a Chris/Jay continuous build. Over time, this script would expand in scope to handle a dizzying combination of build projects running on different continuous integration systems (ultimately including TAP) and controlling several different hardware orb devices, including the NYC-inspired Statue of Lorberty (yes, the torch would glow with different colors). Eventually browser plugins would serve as more visible reminders to individual team members regardless of whether they were at their desks or logged in using their laptops, but physical orbs in a shared team space never went entirely out of style.

In case that last section hasn't sunk in yet: Centrally-managed continuous integration infrastructure. One-page, one-click setup of build projects. Every change in the company was integrated, built and tested within minutes (at least at the time I was there) via distributed build and execution in the cloud. Every result was stored and made visible to every developer in the company. Most breakages were fixed before most affected projects even noticed. Heaven, Nirvana, Valhalla, Stonehenge —whatever you want to call it, TAP was it.

TAP represented the ultimate crowning achievement of the Testing Grouplet’s efforts. Developed by the Testing Technology team in close collaboration with the Build Tools team, TAP pushed the boulder over the top of the hill after years of steady effort. By the time I left Google, nearly every team had at least one TAP build, and most build breakages were rolled back or fixed before most build cops had a chance to notice a breakage in the first place.

A "build cop" is someone tasked with ensuring that the build remains in a passing state. When a breakage occurs, it's the build cop's responsibility to either "roll back" the responsible change, to ensure its fix is submitted immediately, or to handle other technical issues leading to the breakage. This is a role that team members typically rotate through on a weekly basis.

An outcome of the January 2008 Revolution Fixit, the Test Automation Platform (TAP) became Google’s centralized continuous integration system. Rolled out Google-wide during the March 2010 TAP Fixit, TAP was built upon Google’s in-house toolchain that made use of cloud infrastructure to massively parallelize build actions and test executions. TAP executed every test in the entire company’s code base affected by every code change, and only those tests affected by a given change, within minutes. (This time scale may have shifted by now, as Google's continued to grow since I left.) A TAP build was configured by a single short web form, and any project could have multiple builds. TAP’s data collection component, Sponge, collected the results of every build attempt and test run, whether run by an automated build or an individual developer, recorded its build commands and complete execution environment, and archived the information for later inspection. The TAP UI provided easy visibility into every change affecting every project in the company.

In response, two Ads developers developed their own single-machine, project-specific continuous integration framework, known as the “ Chris/Jay Continuous Build ”. This framework spread throughout Google thanks in part to its inclusion as a Test Certified Level One requirement. It provided a relatively flexible continuous integration server for Google projects and supported the Testing Grouplet’s Test Certified mission well for many years, but a C/J build did require a fair amount of maintenance from each team that used one.

When the Testing Grouplet first started in 2005, the existing centralized testing service, called the Unit Test Framework, was unable to keep up with demand. It used a dedicated set of machines to build and execute every test in the company and store the results in a database. However, the feedback cycle grew ever longer due to increased load on the system, diminishing its value.

Given the large shared source repository and uniform language styles applied throughout, Google encouraged the development of common libraries to hide low-level details that were reused throughout all Google projects. The most widespread examples were the infrastructure for Remote Procedure Calls (RPCs) and protocol buffers , a data description language used within the RPC system and in many other places where hierarchical, often serialized data structures were required. If anyone at Google tried to define serialized structures and manipulate memory buffers directly (such as the buffer manipulation in the code containing the Heartbleed bug), the first thing a code reviewer would've said is, "Why not use a protobuf?"

Thanks to Test Certified Level Two requirements, nearly every team had a formal, written development policy that every code change be accompanied by tests (except for pure refactorings that didn’t change existing behavior in already-covered code). Eventually the Build Tools and Testing Technology teams integrated test results (or the lack thereof) directly into the code review tool. The reviewer could see whether the author had bothered to run any tests and ensure that they had passed, especially if changes had been made in response to previous review comments.

Google instituted the practice of code reviews since its inception: No code was committed to source control until it had been reviewed and explicitly approved by someone other than the author. Controls existed to ensure that project “owners” were included in any relevant reviews. Reviewing code was just as much of a programmer’s day-to-day responsibility as writing code—sometimes more so—and the common style guidelines removed a ton of friction from the process, allowing the reviewer to quickly flag potential issues where the style appeared wrong, and to remain as focused on the implication of the change itself as possible. Internal tools helped developers manage their queue of incoming and outgoing reviews, and gave every developer visibility into the status of and discussion around every code change.

Nearly all of Google’s source code repository was available to all developers to browse and check out into a personal working copy. Since the Google style guides applied to all projects in a given language, and many of the naming conventions were similar across language guides, Google developers could easily scan code in parts of the code base that they’d never seen before and make sense of it relatively quickly. This made it easy for Googlers to contribute to different projects, to extract repeated code into common libraries reusable by all projects, to identify and possibly patch bugs in other projects, and even to switch projects without enduring the friction of adapting to a new coding style.

As an example of a style guideline aimed at avoiding errors (as opposed to avoiding frivolous arguments over braces, spaces, and names), the current Google C++ style guide insists that heap-allocated function parameters must be passed in via std::unique_ptr if the callee is to assume ownership , and must be passed by const reference if the caller is to retain ownership. This is necessary because memory is not automatically managed in C++, and training developers to recognize poor memory management by sight is worth the cost compared to waiting for static and dynamic analysis tools to catch such errors. (Google ran such tools as well, but they were costly and provided a longer feedback cycle.)

All Google developers had to “earn readability” in each language they regularly used. “Earning readability” was a guided process whereby a developer internalized much of the language-specific style guide. Though "earning readability" involved writing code, the ultimate intent was to ensure that code you write remains "readable" to other developers according to company-wide conventions. The undersung Readability Grouplet was the all-volunteer team that maintained this invaluable process. Source control mechanisms made it prohibitively cumbersome to produce code in a language over the long term without earning readability status. This ensured that the style guides remained relevant and widely enforced.

No executive permission or directives were required to run a fixit. Once a group decided to run one, they ran one. (A VP of Engineering was usually willing to send a prepared announcement encouraging participation, however.) The Fixit Grouplet existed to help coordinate between fixit teams to ensure they picked optimal dates (e.g. avoid any fixits during Burning Man week in early September, because half of Mountain View will be on the playa) and didn’t cannibalize each others’ efforts, leading to a condition known as “fixit fatigue”. The Fixit Grouplet also provided tools, documentation, history, and advice so that new fixits could benefit from the experience of past fixits.

These goal-focused events served to punctuate the other long-term efforts initiated by the Testing Grouplet, driving the overall unit testing adoption mission to the next level. Each new fixit capitalized on the experiences and momentum of previous fixits. Testing on the Toilet proved an invaluable tool in getting the word out about these events and preparing the Google development community for them well in advance.

The Testing Grouplet organized Testing Fixits in August 2006 and March 2007 focused on fixing broken tests and writing new tests for uncovered code, as well as the Revolution Fixit in January 2008 that introduced powerful new tools from the Build Tools team that dramatically improved development and testing speed. The Test Certified Challenge, lasting several months during summer 2008, recruited many new projects and helped many others move to higher Test Certified levels. The Build Tools team’s October 2009 Forgeability Fixit finished getting nearly every build target and test built and executing in the cloud, perfectly setting up the capstone of the entire Testing Fixit/Testing Grouplet arc: The March 2010 TAP Fixit, which introduced the Test Automation Platform throughout Google.

Fixits were short events organized to focus Google’s entire development community on issues that were important but had been largely put aside. They were also useful for rolling out new tools, and helping address any problems developers may have encountered. Fixits typically lasted from a day to a week, and were one of the most effective techniques used by several Grouplets and other teams for making big changes happen, thanks to the critical mass of planning and participation that went into each event.

The Test Mercenaries were a team of software developers dedicated full-time to helping Google development teams achieve Test Certified status. The Testing Grouplet proposed the concept for the team and it existed from late 2006 until early 2009. Ideally, at least two Mercs would be assigned to a team for three months, during which the Mercs would learn about the product, the code, and the team dynamic, and then try to introduce improved unit testing practices along the path set by Test Certified. Success on a team-by-team basis was varied and difficult to measure in terms of productivity impact, but the focused, full-time efforts of the Test Mercenaries greatly augmented all other volunteer-based Testing Grouplet efforts. Test Mercenary experiences informed many Test Certified discussions and Testing on the Toilet episodes, as well as inspired tool developments that proved critical to driving unit testing adoption throughout the culture.

Getting every Google development team to achieve Test Certified Level Three status became the ultimate goal of all the Testing Grouplet-related efforts. The Engineering Productivity department became sold on the idea that Test Certified could provide Test Engineers and Software Engineers in Test with a tool to better communicate with development teams and make better use of everyone’s time, and threw its weight behind the program. The goal was effectively met with the rollout of the Test Automation Platform continuous integration system in 2010, after which nearly every development team at Google was operating at Test Certified Level Three.

Engineering Productivity was the department dedicated to improving in-house development tools and practices, especially for testing. Test Engineers were focused on system-level testing and test automation. Software Engineers in Test were focused on all levels of testing but were expected to contribute test code to their projects and develop in-house tools that could be reused across projects.

Test Certified was a program designed by the Testing Grouplet which provided development teams a clear path towards improved unit testing practices and code quality. It originally consisted of three “levels” composed of discrete steps that a team could adopt as quarterly goals and achieve over time. (It eventually defined five levels, last I heard.) The first level focused on establishing the use of tools and baseline measurements (e.g. a continuous integration server , code coverage , identification of chronically broken and “flaky” tests ); the second level focused on adopting and enforcing a testing policy requiring tests for all code changes and new code, and setting easily-reachable test coverage goals; the third level focused on guiding a team towards a high level of test coverage and the accompanying productivity benefits.

Why post flyers in the bathrooms, as opposed to other public spaces? Why not blast out email newsletters? The idea was thrown out during a Testing Grouplet brainstorming session; no idea was off-limits. We'd tried a number of conventional methods—internal training, guest speakers, handing out books—and were looking for some new angle to take in getting people's attention. The boldness of this particular idea and the alliterative name just clicked with the group; it worked for us. Fortunately, once we got rolling and started actually posting the flyers, the idea stuck. Despite early objections from the vocal minority (as expected), the value of the medium became apparent, and the message it conveyed—that testing was an accessible skill, conducive to incremental learning and improvement—resonated more deeply the longer the series continued.

Testing on the Toilet (TotT), a series of one-page articles posted in Google bathrooms, is the most visible of the Testing Grouplet’s efforts and achievements. Started in 2006, weekly episodes continue to be published. Each episode is a one-page overview of a particular testing technique, tool, or related issue, distributed to bathrooms in Google development offices throughout the world. “Ads” at the bottom, which approximate Google search result ads, provide links to more information related to the topic. Each episode is written, vetted, edited, and distributed all by volunteers. Over the years, it has been immensely effective in educating Google developers about the benefits and proper application of unit testing, and in starting company-wide conversations using standard concepts that have further enriched the Testing Grouplet’s efforts. These conversations helped prevent the echo-chamber effect by allowing non-Testing Grouplet members to contribute their ideas, arguments, and experiences.

Despite its longevity, TotT remains on the cutting edge: It allowed me to contribute, at the same time I was writing this article, Episode 327: Finding More Than One Worm in the Apple and Episode 330: While My Heart Gently Bleeds . These were the first episodes from an “outside” contributor, and the first episodes to investigate real-world, user-visible software defects and how unit tests could have prevented them.

Guests to Google offices have taken a number of pictures of Testing on the Toilet episodes and posted them online. Several of them are featured in the TotT post on my blog, linked from this subsection.

The Testing Grouplet was but one of a collection of “Intergrouplets” which aimed to improve the quality of day-to-day development life and productivity throughout Google by helping solve issues that cut across all teams. The Grouplets often complemented the efforts of official, dedicated teams by providing grassroots feedback, advocacy, and other forms of support. For example, the Testing Grouplet had a close relationship with the Testing Technology and Build Tools teams, the EngEDU internal training organization, and the Engineering Productivity department as a whole (discussed below in the "Test Certified" subsection). Other Grouplets formed by passionate volunteers that extended the effort to improve development quality and experience included: the Documentation Grouplet; the Mentoring Grouplet; the Hiring Grouplet; the Readability Grouplet, guardians of the Google style guides and the readability tradition; and the Fixit Grouplet, which maintained the tradition of "fixits", which were focused company-wide efforts designed to address widespread issues or to roll out new tools.

These Testing Grouplet-related efforts represent a number of our best ideas, which happened to be the right ideas at the right time. There's plenty more things we tried that didn't stick as well; the important point is that we persevered. We continued to experiment with new ideas and learn from our experiences until we found a set of methods that worked especially well in the context of Google culture at the time. Some of the same methods may work for other teams and other companies; then again, they may not. Still, I hope they serve as a source of inspiration for ideas that could work in other development organizations.

The Testing Grouplet was a team of Google developers who worked together in their 20% time (time provided by Google to allow developers to work on Google-related projects of their choosing aside from their main projects) to address the challenges in promoting unit testing adoption throughout Google. An all-volunteer group with little funding and no direct authority, it relied on persuasion and innovation to convince Google developers of the value of unit testing, and provided them with the tools and knowledge needed to do it well. The Testing Grouplet successfully employed unconventional tactics to achieve its grand strategy of driving unit testing culture throughout Google, many of which are described in the following subsections.

The Testing Grouplet provided a community for those of us who cared about unit testing. The Testing Grouplet and its allies worked steadily over the course of years, and was successful in disseminating testing knowledge throughout Google, as well as driving the development and adoption of new tools. These tools gave Google developers the time to test, and this shared knowledge made their code easier to test over time. Metrics and success stories shared by participants in the Testing Grouplet's Test Certified program also helped convince other teams to give unit/automated testing a try. Participating teams often credited Test Certified with helping improve productivity metrics they most cared about, such as the number of code changes and/or features submitted over a given period of time relative to bugs, rollbacks, and emergency releases over the same period.

Resistance to unit testing at Google was largely a matter of developers undereducated in unit testing struggling to write new code using old tools that were straining heavily under the load of Google's ever-growing operation. Adding tests to existing code appeared prohibitively difficult, and given the status quo, providing tests for new code appeared futile. People who cared about unit testing did the hard work of convincing other Googlers that writing unit tests not only provides the confidence that the code they write is correct today, but that it'll stay correct in six months' time when some one else (or even the original developer) needs to change the code.

You may believe that it was easy for Google to adopt a unit testing culture because Google is the mythical Google, with endless resources and talent at its disposal. Trust me, "easy" is not the word I would use to describe our efforts. In fact, vast pools of resources and talent can get in the way, as they tend to reinforce the notion that everything is going as well as possible, allowing problems to fester in the long shadows cast by towering success. Google was not able to change its development culture by virtue of being Google; rather, changing Google's development culture is what helped its development environment and software products continue to scale and to live up to expectations despite the ever-growing ranks of developers and users.

I'll also mention a few other components of the Google development environment during the time I worked there, to provide a more complete picture of how Google maintained a high level of code quality despite its massive scale and rate of feature development. Some of this information may be out of date, but I believe the overall picture based on my memories may still prove helpful. This description is provided not to prescribe a guaranteed process, but to provide inspiration to other individuals and teams looking to make similar changes in their own organizations.

The biggest reason to make instructional examples out of the “goto fail” and Heartbleed bugs, aside from their high visibility, is because detecting and preventing bugs like these is a solved problem. At the time I joined Google, the development culture was largely averse to unit testing. The work that I and others did as part of Google's Testing Grouplet helped to make writing tests the norm, rather than the exception. What follows is a brief description of how the Testing Grouplet fostered a strong unit testing culture at a large, growing, successful company with most developers either ignorant about unit testing or hostile towards it, claiming “My code is too hard to test” or “I don’t have time to test”.

How to Change a Culture

You may be convinced that "goto fail" and Heartbleed could've been prevented by unit testing. You may be convinced that open-sourcing code should increase the need for unit testing, not decrease it. You may be convinced that unit testing produces a slew of benefits in addition to defect prevention, and that it's worth the cost. You may have the taste for it after playing with the proof-of-concept tests from this article, and perhaps beginning to test some of your own code. You may be convinced that unit testing can serve to improve the application of existing tools and practices, and you may be inspired by Google's example of driving unit testing throughout a company in the large.

Now you are ready to start making a change in your own project, your own team, or your own company...but you might not have any clue how to start. Here I'll offer a few personal insights which may help guide you. This is not a prescription to follow to the letter, nor does it guarantee results. However, I hope that they will serve to foster insights of your own that may prove helpful in driving unit testing adoption throughout your environment.

Be the Change You Wish to See (Quote courtesy of Mahatma Gandhi) Whether you realize it or not, you've already started. You've read through this article and internalized its arguments. You've internalized the experience of unit testing for yourself. This has given you a foundation to build from, a perspective to take in any discussion on the topic of software development. There's nothing stopping you from walking the walk now, even if no one else follows you. Don't try to change any minds directly yet; just try to show how it's done, by writing tests for your own code. Seek out blogs, magazines, books, and seminars to hone your skills, such as those in the Further Reading section below. Read through everything here on Martin's website. Join a Meetup, such as the AutoTest Meetups in Boston, New York, San Francisco, and Philadelphia, or start your own. Lead by example and stay the course.

Start Small with the Existing Code As demonstrated by the "goto fail" and Heartbleed proof-of-concept examples, the Google Web Server story, and the Google story as a whole, you can begin making improvements in existing code, right now. The only way your code base will improve is by working with it, and no amount of discussion or argument will be as effective as actually writing tests. By setting an example, by providing a pattern for others to follow, you're demonstrating that these ideas can work even in your team's code—and working code is its own best argument. Take a small part of the existing code base, and write a test for it. Refactor the code if you have to; extract functions and classes that would serve as good, isolated units to test. When adding new functionality to existing code, make sure it's part of a well-tested unit, refactoring the code using the new unit if need be. Add a unit testing framework if you can; otherwise, study the examples provided in this article to learn how to get by without one. Chip away at the problem; in time, you will be amazed at how much you alone have been able to accomplish.

The Small/Medium/Large Test Pyramid Unit tests are not a one-size-fits-all solution for code or product quality. You should never promise that. The Testing Grouplet pioneered the concept of the Small/Medium/Large test size schema; Mike Cohn's Test Pyramid bears an uncanny resemblance to it. Make sure everyone is the clear of the fundamental role that unit tests play, but don't oversell them.

Set Up Continuous Integration Do whatever you can, even if you have to beg, borrow, or steal, to set up a continuous integration environment. Roll your own using a shell script and a cron job if you have to, even if it runs on your own workstation. Even if it doesn't run tests at first, being able to ensure that the code can build (for compiled languages) and the program can launch at all times is a critical prerequisite for spreading a unit testing culture; unit tests are pretty useless if the code can't compile to begin with. If your team isn't already in the habit of ensuring that the code is always in a compilable state, that may be the first battle you need to win before driving the adoption of unit testing. If everyone develops on completely separate branches and integration comes long after-the-fact, take it upon yourself to perform the integration work covertly. Set up your own git repository to pull from these different branches and integrate between them. When people see what you've been up to and how many headaches you're helping to avoid, you'll gain credibility that will serve you well.

Maximize Visibility Make sure other people can see when the build is broken. People and managers who were once indifferent or hostile towards continuous builds and testing have had their minds changed by monitoring devices set up within easy sight of their desks. This works because people will naturally start to ask questions when the build breaks ("Why is that thing red again?"), and over time it can have a major effect on everyone's attitude. It's human nature to care about only the problems we can see, so make it easy for people to see when there's a problem. The monitoring device can be in the form of a plugin in individuals' browsers, centrally-located glowy orbs, large monitor screens displaying a build dashboard, specially-wired traffic lights, you name it. It should be conspicuous enough that people should have to make a deliberate effort to remain unaware of the build's current status. Visibility aids can add a sense of fun as well. Teams can be imaginative and amusingly competitive with how they display their test statuses. One team at Google had a flapping penguin that came noisily to life when their build broke. Of course, all the surrounding teams had to try to find something just as good. It all helps to spread the message.

Partners-In-Crime Eventually you will have to join forces with some partners-in-crime, people who need no convincing. You will both challenge and reinforce one another's ideas, and provide moral support to one another when the time comes to make a stand in the face of resistance. Develop your arguments, methods, idioms, etc. by bouncing them off of one another. Be more critical of these ideas than any potential critics might be, but treat each other with courtesy and respect. Make each other better, and eventually you may make the rest of your team or company better. When trying to persuade a group of people (in anything), it's always easiest to start with those who are already closest to agreeing with you. Once you get one other person seeing it your way, you are no longer a loner, no longer the crazy guy with that wacky idea that no one else believes, and there are now two of you doing the persuading. Once you get a third, and then a fourth, you have some momentum. Another subtle, effective way to get other people involved is to ask for advice. If someone on your team is resistant to testing, or even just unfamiliar with it, ask that person to review your code and tests. Ask whether there are other tests you haven't thought of. Most programmers are happy to offer an opinion, and it's a way to involve them in testing without forcing them. Over time, they might become convinced to the point of advocating for unit testing of their own volition.

Educate Find a way to spread knowledge throughout your team. It can be as straightforward as a weekly brown-bag lunch or as crazy as posting weekly flyers in the bathroom. Invite people to speak to your team, or organize a team outing to go to a talk or a Meetup. Start an internal mailing list to share and discuss ideas and tools.

Delegate, Delegate, Delegate! Paradoxically, the less you have to do directly to make things happen, the more things you can make happen. If you can establish a vision and a direction, you'll find volunteers who are more than happy to assume specific roles and run with them, which will give them a sense of belonging and value within the community you're building and will free you to stay focused on the larger picture. After running a couple of Fixits, I realized that rather than holding onto every responsibility for myself, it was far more productive to create an explicit list of roles I needed people to fill. From then on, presenting a list of roles up-front worked like a charm to get a grassroots organization up and running very, very quickly. Some roles you might consider right now for your team or organization (and some of the names are deliberately silly, to keep it light and fun): Historian : Documents, summarizes, and archives notable issues or activities and their artifacts in a centrally-accessible repository (e.g. a wiki or a team blog)

: Documents, summarizes, and archives notable issues or activities and their artifacts in a centrally-accessible repository (e.g. a wiki or a team blog) Minister of Information : Personally solicits people to produce talks, blog posts, articles, etc.; this person can then head a sub-community of speakers, authors, and volunteer editors (a la Testing on the Toilet), maybe even cultivate a community-specific knowledge base (e.g. using a wiki)

: Personally solicits people to produce talks, blog posts, articles, etc.; this person can then head a sub-community of speakers, authors, and volunteer editors (a la Testing on the Toilet), maybe even cultivate a community-specific knowledge base (e.g. using a wiki) Minister of Propaganda : Oversees announcements of team activities through a variety of media, e.g. emails, flyers, prominent wall projections, scripts given to high-profile managers, executives, or other representatives, etc.

: Oversees announcements of team activities through a variety of media, e.g. emails, flyers, prominent wall projections, scripts given to high-profile managers, executives, or other representatives, etc. Minister of Communication : Monitors the health of the communication channels available to the team, suggests and implements improvements (along with the Minister of Information); perhaps maintains a list of contact information and archives of artifacts (along with the Historian)

: Monitors the health of the communication channels available to the team, suggests and implements improvements (along with the Minister of Information); perhaps maintains a list of contact information and archives of artifacts (along with the Historian) Wordsmith : Someone to specifically handle the upkeep and organization of new artifacts, e.g. to make sure posts are tagged, maybe experiment with CSS styles, do SEO work to make sure the content is easily discoverable by search engines (if artifacts are public), etc.

: Someone to specifically handle the upkeep and organization of new artifacts, e.g. to make sure posts are tagged, maybe experiment with CSS styles, do SEO work to make sure the content is easily discoverable by search engines (if artifacts are public), etc. Scheduler : Keeps track of the logistics, e.g. who is speaking when, where events are taking place; maintains a list of suitable venues and seeks out new ones, etc.

: Keeps track of the logistics, e.g. who is speaking when, where events are taking place; maintains a list of suitable venues and seeks out new ones, etc. Festmeister : For events, makes sure that beer and pizza and any goodies to be given out are all taken care of.

: For events, makes sure that beer and pizza and any goodies to be given out are all taken care of. Heart and Soul: Follows-up with speakers, authors, or other contributors and guests and personally expresses gratitude on behalf of the team, in a variety of forms: Personal emails, gift certificates, schwag, small parties, etc. These are just a few off the top of my head, but I want you to notice something: Right now, you may be filling all of these roles, whether you're aware of it or not. It's a lot for one person to do, and it both bogs you down and misses an important opportunity to grow the team into a real, honest-to-goodness community.

Be the Walrus So what role would be left for you after delegating away everything? I called myself "The Walrus" because I'm a silly Beatlemaniac, but the essence of the role is "Organizer". You're the one with an eye on the big picture who manages a team of specialists. You're the one who gets to set the direction and priorities, who has the privilege of providing feedback to the creative people you've trusted with important responsibilities and helping to remove any obstacles they encounter, and who gets to be constantly amazed at the energy and creativity people bring to their tasks to do incredible things you'd never even dreamed possible.

Embrace the Power of Teamwork Holding on to all those other roles only impedes your ability to thrive as an Organizer, which in turn holds the community back from its full potential. So I'd encourage you to come up with your own list of things that you do for the community already, codify them in a set of roles, and actively engage individuals who you think would best suit each role. I would sometimes even produce a list of role names with my name next to them in bold red type, and tell everyone that the success of the enterprise would be inversely proportional to the number of roles that still had my name next to them in red. (The only one with my name next to it in green was "The Walrus".) When confronted with such a list, and with clearly defined roles as such, it's amazing how quickly they will volunteer and act. That said, folks in such roles should be encouraged to interact without running every decision through you; the roles help clarify responsibilities so that you don't have to be involved in every little detail, and people can work many things out between themselves. Everyone should be encouraged to seek out good ideas, to develop good ideas, and to share them amongst themselves. You should be keeping an ear to the ground, of course, but people should feel like you're listening, not like you're listening in or trying to be their boss. Expect them to pleasantly surprise you, and they will.

Make Yourself Obsolete Start looking for your replacement on day one. No enterprise should be so fragile such that it falls apart after you've moved on. That goes for life in general. With regards to spreading a unit testing culture, you don't want to be stuck with the title "The Testing Guy" or "The Testing Girl". You want to make sure people are willing and able to step up whenever you may need or want to step aside. That's how a legacy is built.

Run a Fixit Speaking of roles and fixits, a fun and productive way to rally the community you're building to promote unit testing is to run a fixit. You can start with a small team-sized fixit, and later run office-wide or even company-wide events. All you need to get started is a clear goal (e.g. fix all broken code/tests, increase coverage by X%, adopt some nifty new tool), a set of well-defined volunteer roles (as mentioned above), and a shared spreadsheet of some kind to track tasks that need to be done and who's assigned to handle each. Then pick a day, get the word out, and make it happen! There's nothing like applying concerted team effort to boost morale and solve nasty, lingering problems. As an example of why fixits are not only fun and productive, but may prove vital to the cause, consider the case where a large project is in a state such that parts of it cannot even be compiled. This completely undercuts any effort to set up continuous integration, and encourages people to start selecting the branches of the project that they test rather than trying to test everything before committing their changes. In other words, they will execute <tool> test subprojectyBitA/**/* anotherBitB/ohAndThis/**/* partC/**/* ... rather than <tool> build **/* or testing by some other more appropriate selection criteria, such as test size. Consequently, chronic problems potentially get worse, and continuous integration remains out of reach. This case would be perfect for a fixit: broken areas of the code can be identified in advance and compiled into a spreadsheet; then people can volunteer to handle specific breakages to avoid duplicate effort. The team can attack these problems in one dedicated sprint, can make the event festive and fun, and the code will be in a state conducive to continuous integration and testing at the end of the day—hopefully. Even if everything isn't fixed right away, the team should be encouraged by tangible progress made to resolve a chronic issue, and should have gained insights that will motivate them to eventually solve the problem completely.

Eschew Authoritah The best solutions are not those imposed top-down; the best solutions are those that individuals independently perceive as providing value and embrace willingly. Often those solutions will provide people with a sense of empowerment and purpose, as opposed to forced solutions that produce feelings of powerlessness and senselessness. Resist the temptation to solve problems by issuing orders, or asking managers or executives to issue orders on your behalf; they will almost certainly backfire. People hate being told how to do their jobs; you know how programmers hate that exponentially more than most. To that end, emphasize the goal you're trying to achieve, rather than insisting on the exact way to achieve it. Provide clear, concrete ideas, but allow people the flexibility to adapt them to their situation. Very few programmers will argue that reducing the number of build breakages, rollbacks or late-night fire drills is a bad thing. Unit tests, continuous integration, and code reviews are ways of reducing stress, increasing confidence in the code, and spending less time diagnosing and fixing problems that have worked for many different teams in many different situations to solve similar problems; but no two sets of unit tests, continuous builds, or code review practices are exactly alike. Soliciting support from management or executive management is another matter. Encouragement and endorsements can raise the visibility of your efforts in a helpful way, so long as they are presented as suggestions rather than orders from on high. In contrast, if management is passive, keep plowing ahead as best you can; at least they aren't obstructing or threatening you in any way. If management is actively hostile or dismissive, then the choice becomes harder. Is pushing for change worth risking your job over? Is it worth quitting? These are risks you have to weigh for yourself; but despite the unpleasantness of getting fired or quitting, that doesn't mean it isn't worth trying. You do have a choice, like it or not, and you will have to live with it. Also, remember that no matter how mu