I made a mistake. Not a typo or a bug in some pasted code (actually I’ve made some of those, too). I perpetuated what seems (now, since I analyse it) to be a big myth in software engineering. I uncritically quoted some work without checking its authority, and now find it lacking. As an author, it’s my responsibility to ensure that what I write represents the best of my knowledge and ability so that you can take what I write and use it to be better at this than I am.

In propagating incorrect information, I’ve let you down. I apologise unreservedly. It’s my fervent hope that this post explains what I got wrong, why it’s wrong, and what we can all learn from this.

The mistake

There are two levels to this. The problem lies in Table 1.1 at the top of page 4 (in the print edition) of Test-Driven iOS Development. Here it is:

As explained in the book, this table is reproduced from an earlier publication:

Table 1.1, reproduced from Code Complete, 2nd Edition, by Steve McConnell (Microsoft Press, 2004), shows the results of a survey that evaluated the cost of fixing a bug as a function of the time it lay “dormant” in the product. The table shows that fixing bugs at the end of a project is the most expensive way to work, which makes sense…

The first mistake I made was simply that I seem to have made up the bit about it being the result of a survey. I don’t know where I got that from. In McConnell, it’s titled “Average Cost of Fixing Defects Based on When They’re Introduced and Detected” (Table 3-1, at the top of page 30). It’s introduced in a paragraph marked “HARD DATA”, and is accompanied by an impressive list of references in the table footer. McConnell:

The data in Table 3-1 shows that, for example, an architecture defect that costs $1000 to fix when the architecture is being created can cost $15,000 to fix during system test.

As already covered, the first problem is that I misattributed the data in the table. The second problem, and the one that in my opinion I’ve let down my readers the most by not catching, is that the table is completely false.

Examining the data

I was first alerted to the possibility that something was fishy with these data by the book the Leprechauns of Software Engineering by Laurent Bossavit. His Appendix B has a broad coverage of the literature that claims to report the exponential increase in cost of bug fixing.

It was this that got me worried, but I thought deciding that the table was broken on the basis of another book would be just as bad as relying on it from CC2E was in the first place. I therefore set myself the following question:

Is it possible to go from McConnell’s Table 3-1 to data that can be used to reconstruct the table?

My results

The first reference is Design and Code Inspections to Reduce Errors in Program Development. In a perennial problem for practicing software engineers, I can’t read this paper: I subscribe to the IEEE Xplore library, but they still want more cash to read this title. Laurent Bossavit, author of the aforementioned Leprechauns book, pointed out that the IEEE often charge for papers that are available for free elsewhere, and that this is the case with this paper (download link).

The paper anecdotally mentions a 10-100x factor as a result of “the old way” of working (i.e. without code inspections). The study itself looks at the amount of time saved by adding code reviews to projects that hitherto didn’t do code reviews; even if it did have numbers that correspond to this table I’d find it hard to say that the study (based on a process where the code was designed such that each “statement” in the design document corresponded to 3-10 estimated code statements, and all of the code was written in the PL/S language before a compiler pass was attempted) has relevance to modern software practice. In such a scheme, even a missing punctuation symbol is a defect that would need to be detected and reworked (not picked up by an IDE while you’re typing).

The next I discovered was Boehm and Turner’s “Balancing Agility and Discipline”. McConnell doesn’t tell us where in this book he was looking, and it’s a big book. Appendix E has a lot of secondary citations supporting “The steep version of the cost-of-change curve”, but quotes figures from “100:1” to “5:1” comparing “requirements phase” changes to “post-delivery” changes. All defect fixes are changes but not all changes are defect fixes, so these numbers can’t be used to build Table 3-1.

The graphs shown from studies for agile Java projects have “stories” on the x axis and “effort to change” in person/hour on the y-axis; again not about defects. These numbers are inconsistent with the table in McConnell anyway. As we shall see later, Boehm has trouble getting his data to fit agile projects.

“Software Defect Removal” by Dunn is another book, which I couldn’t find.

“Software Process Improvement at Hughes Aircraft” (Humphrey, Snyder, and Willis 1991) The only reference to cost here is a paragraph on “cost/performance index” on page 22. The authors say (with no supporting data; the report is based on a confidential study) that costs for software projects at Hughes were 6% over budget in 1987, and 3% over budget in 1990. There’s no indication of whether this was related to the costs of fixing defects, or the “spread” of defect discovery/fix phases. In other words this reference is irrelevant to constructing Table 3-1.

The other report from Hughes Aircraft is “Hughes Aircraft’s Widespread Deployment of a Continuously Improving So” by Ron R. Willis, Robert M. Rova et al.. This is the first reference I found to contain useful data! The report is looking at 38,000 bugs: the work of nearly 5,000 developers over dozens of projects, so this could even be significant data.

It’s a big report, but Figure 25 is the part we need. It’s a set of tables that relate the fix time (in person-days) of defects when comparing the phase that they’re fixed with the phase in which they’re detected.

Unfortunately, this isn’t the same as comparing the phase they’re discovered with the phase they’re introduced. One of the three tables (the front one, which obscures parts of the other two) looks at “in-phase” bugs: ones that were addressed with no latency. Wide differences in the numbers in this table (0.36 days to fix a defect in requirements analysis vs 2.00 days to fix a defect in functional test) make me question the presentation of Table 3-1: why put “1” in all of the “in-phase” entries in that table?

Using these numbers, and a little bit of guesswork about how to map the headings in this figure to Table 3-1, I was able to use this reference to construct a table like Table 3-1. Unfortunately, my “table like Table 3-1” was nothing like Table 3-1. Far from showing an incremental increase in bug cost with latency, the data look like a mountain range. In almost all rows the relative cost of a fix in System Test is greater than in maintenance.

I then looked at “Calculating the Return on Investment from More Effective Requirements Management” by Leffingwell. I have to hope that this 2003 webpage is a reproduction of the article cited by McConnell as a 1997 publication, as I couldn’t find a paper of that date.

This reference contains no primary data, but refers to “classic” studies in the field:

Studies performed at GTE, TRW, and IBM measured and assigned costs to errors occurring at various phases of the project life-cycle. These statistics were confirmed in later studies. Although these studies were run independently, they all reached roughly the same conclusion: If a unit cost of one is assigned to the effort required to detect and repair an error during the coding stage, then the cost to detect and repair an error during the requirements stage is between five to ten times less. Furthermore, the cost to detect and repair an error during the maintenance stage is twenty times more.

These numbers are promisingly similar to McConnell’s: although he talks about the cost to “fix” defects while this talks about “detecting and repairing” errors. Are these the same things? Was the testing cost included in McConnell’s table or not? How was it treated? Is the cost of assigning a tester to a project amortised over all bugs, or did they fill in time sheets explaining how long they spent discovering each issue?

Unfortunately Leffingwell himself is already relying on secondary citation: the reference for “Studies performed at GTE…” is a 520-page book, long out of print, called “Software Requirements—Objects Functions and States”. We’re still some degrees away from actual numbers. Worse, the citation at “confirmed in later studies” is to Understanding and controlling software costs by Boehm and Papaccio, which gets its numbers from the same studies at GTE, TRW and IBM! Leffingwell is bolstering the veracity of one set of numbers by using the same set of numbers.

A further reference in McConnell, “An Economic Release Decision Model” is to the proceedings on a 1999 conference on Applications of Software Measurement. If these proceedings have actually been published anywhere, I can’t find them: the one URL I discovered was to a “cybersquatter” domain. I was privately sent the powerpoint slides that comprise this citation. It’s a summary of then-retired Grady’s experiences with software testing, and again contains no primary data or even verifiable secondary citations. Bossavit describes a separate problem where one of the graphs in this presentation is consistently misattributed and misapplied.

The final reference provided by McConnell is What We Have Learned About Fighting Defects, a non-systematic literature review carried out in an online “e-Workshop” in 2002.

Section 3.1 of the report is “the effort to find and fix”. The 100:1 figure is supported by “general data” which are not presented and not cited. Actual cited figures are 117:1 and 135:1 for “severe” defects from two individual studies, and 2:1 for “non-severe” defects (a small collection of results).

The report concludes:

“A 100:1 increase in effort from early phases to post-delivery was a usable heuristic for severe defects, but for non-severe defects the effort increase was not nearly as large. However, this heuristic is appropriate only for certain development models with a clearly defined release point; research has not yet targeted new paradigms such as extreme programming (XP), which has no meaningful distinction between “early” and “late” development phases.”

A “usable heuristic” is not the verification I was looking for – especially one that’s only useful when practising software development in a way that most of my readers wouldn’t recognise.

Conclusion

If there is real data behind Table 3-1, I couldn’t find it. It was unprofessional of me to incorporate the table into my own book—thereby claiming it to be authoritative—without validating it, and my attempts to do so have come up wanting. I therefore no longer consider Table 1-1 in Test-Driven iOS Development to be representative of defect fixing in software engineering, I apologise for including it, and I resolve to critically appraise material I use in my work in the future.