By Janie Chang, Writer, Microsoft Research

At Microsoft Research, there are computer scientists and mathematicians who live in a world of theory and abstractions. Then there is Nachi Nagappan, who was on loan to the Windows development group for a year while building a triage system for software bugs. For Nagappan, a senior researcher at Microsoft Research Redmond with the Empirical Software Engineering Group (ESE), the ability to observe software-development processes firsthand is critical to his work.

The ESE group studies large-scale software development and takes an empirical approach. When Nagappan gets involved in hands-on projects with Microsoft development teams, it’s all part of ongoing research in his quest to validate conventional software-engineering wisdom.

Spotlight: Academic programs Now accepting applications for 2021 Ada Lovelace Fellowship A fellowship that provides a research funding opportunity for doctoral students underrepresented in the field of computing. Read more

“A big part of my learning curve when I joined Microsoft in 2005,” Nagappan says, “was getting familiar with how Microsoft does development. I found then that many of the beliefs I had in university about software engineering were actually not that true in real life.”

That discovery led Nagappan to examine more closely the observations made by Frederick Brooks in The Mythical Man Month, his influential book about software engineering and project management.

“To some degree, The Mythical Man Month formed the foundation of a lot of the work we did,” Nagappan says. “But we also studied other existing assumptions in software engineering. They can be good or bad, because people make decisions based on these assumptions. Our primary goal was to substantiate some of these beliefs in a Microsoft context, so managers can make decisions based on data rather than gut feel or subjective experience.”

More Isn’t Always Better

One assumption Nagappan examined was the relationship between code coverage and software quality. Code coverage measures how comprehensively a piece of code has been tested; if a program contains 100 lines of code and the quality-assurance process tests 95 lines, the effective code coverage is 95 percent. The ESE group measures quality in terms of post-release fixes, because those bugs are what reach the customer and are expensive to fix. The logical assumption would be that more code coverage results in higher-quality code. But what Nagappan and his colleagues saw was that, contrary to what is taught in academia, higher code coverage was not the best measure of post-release failures in the field.

“Furthermore,” Nagappan says, “when we shared the findings with developers, no one seemed surprised. They said that the development community has known this for years.”

The reason is that software quality depends on so many other factors and dynamics that no one metric can predict quality—and not all metrics apply to all projects. Nagappan points out two of the most obvious reasons why code coverage alone fails to predict error rates: usage and complexity.

Code coverage is not indicative of usage. If 99 percent of the code has been tested, but the 1 percent that did not get tested is what customers use the most, then there is a clear mismatch between usage and testing. There is also the issue of code complexity; the more complex the code, the harder it is to test. So it becomes a matter of leveraging complexity versus testing. After looking closely at code coverage, Nagappan now can state that, given constraints, it is more beneficial to achieve higher code coverage of more complex code than to test less complex code at an equivalent level. Those are the kinds of tradeoffs that development managers need to keep in mind.

Write Test Code First

Nagappan and his colleagues then examined development factors that impact quality, another area of software engineering discussed in The Mythical Man Month. One of the recent trends that caught their interest was development practices; specifically, test-driven development (TDD) versus normal development. In TDD, programmers first write the test code, then the actual source code, which should pass the test. This is the opposite of conventional development, in which writing the source code comes first, followed by writing unit tests. Although TDD adherents claim the practice produces better design and higher quality code, no one had carried out an empirical substantiation of this at Microsoft.

“The nice thing about working at Microsoft,” Nagappan says, “is that the development organization is large enough that we could select teams that allowed for an apples-to-apples comparison. We picked three development projects under the same senior manager and looked at teams that used TDD and those that didn’t. We collected data from teams working on Visual Studio, Windows, and MSN and also got data from a team at IBM, since the project was a joint study.”

The study and its results were published in a paper entitled Realizing quality improvement through test driven development: results and experiences of four industrial teams, by Nagappan and research colleagues E. Michael Maximilien of the IBM Almaden Research Center; Thirumalesh Bhat, principal software-development lead at Microsoft; and Laurie Williams of North Carolina State University. What the research team found was that the TDD teams produced code that was 60 to 90 percent better in terms of defect density than non-TDD teams. They also discovered that TDD teams took longer to complete their projects—15 to 35 percent longer.

“Over a development cycle of 12 months, 35 percent is another four months, which is huge,” Nagappan says. “However, the tradeoff is that you reduce post-release maintenance costs significantly, since code quality is so much better. Again, these are decisions that managers have to make—where should they take the hit? But now, they actually have quantified data for making those decisions.”

Proving the Utility of Assertions

Another development practice that came under scrutiny was the use of assertions. In software development, assertions are contracts or ingredients in code, often written as annotations in the source-code text, describing what the system should do rather than how to do it.

“One of our very senior researchers is Turing Award winner Tony Hoare of Microsoft Research Cambridge in the U.K.,” Nagappan says. “Tony has always promoted the utility of assertions in software. But nobody had done the work to quantify just how much assertions improved software quality.”

One reason why assertions have been difficult to investigate is a lack of access to large commercial programs and bug databases. Also, many large commercial applications contain significant amounts of legacy code in which there is minimal use of assertions. All of this contributes to lack of conclusive analysis.

At Microsoft however, there is systematic use of assertions in some Microsoft components, as well as synchronization between the bug-tracking system and source-code versions; this made it relatively easy to link faults against lines of code and source-code files. The research team managed to find assertion-dense code in which assertions had been used in a uniform manner; they collected the assertion data and correlated assertion density to code defects. The results are presented in the technical paper Assessing the Relationship between Software Assertions and Code Quality: An Empirical Investigation, by Gunnar Kudrjavets, a senior development lead at Microsoft, along with Nagappan and Tom Ball.

The team observed a definite negative correlation: more assertions and code verifications means fewer bugs. Looking behind the straight statistical evidence, they also found a contextual variable: experience. Software engineers who were able to make productive use of assertions in their code base tended to be well-trained and experienced, a factor that contributed to the end results. These factors built an empirical body of knowledge that proved the utility of assertions.

The work also brings up another issue: What kind of action should development managers take based on these findings? The research team believes that enforcing the use of assertions would not work well; rather, there needs to be a culture of using assertions in order to produce the desired results. Nagappan and his colleagues feel there is an urgent need to promote the use of assertions and plan to collaborate with academics to teach this practice in the classroom. Having the data makes this easier.

Has there been any feedback from Hoare?

“Absolutely,” Nagappan says. “He followed up and read our work on assertions and was very happy that someone was proving the relationship between assertions and software quality.”

Organizational Structure Does Matter—a Lot.

Nagappan recognized that although metrics such as code churn, code complexity, code dependencies, and other code-related factors have an impact on software quality, his team had yet to investigate the people factor. The Mythical Man Month is most famous for describing how communication overhead increases with the number of programmers on a project, but it also cites Conway’s Law, paraphrased as, “If there are N product groups, the result will be a software system that to a large degree contains N versions or N components.” In other words, the system will resemble the organization building the system.

The first challenge was to somehow describe the relationships between members of a development group. The team settled on using organizational structure, taking the entire tree structure of the Windows group as an example. They took into account reporting structure but also degrees of separation between engineers working on the same project, the level to which ownership of a code base rolled up, the number of groups contributing to the project, and other metrics developed for this study.

The Influence of Organizational Structure on Software Quality: An Empirical Case Study, by Nagappan, Brendan Murphy of Microsoft Research Cambridge, and Victor R. Basili of the University of Maryland, presents startling results: Organizational metrics, which are not related to the code, can predict software failure-proneness with a precision and recall of 85 percent. This is a significantly higher precision than traditional metrics such as churn, complexity, or coverage that have been used until now to predict failure-proneness. This was probably the most surprising outcome of all the studies.

“That took us by surprise,” Nagappan says. “We didn’t expect it to be that accurate. We thought it would be comparable to other code-based metrics, but these factors were at least 8 percent better in terms of precision and recall than the closest factor we got from the code measures. Eight percent, on a code base the size of Windows, is a huge difference.”

Geographical Distance Doesn’t Matter—Much.

One of the most cherished beliefs in software project management is that a distributed-development model has a negative impact on software quality because of problems with communication, coordination, culture, and other factors. Again, this meant looking at organizational metrics.

“The fact is,” Nagappan says, “that no one has really studied a large project. Most studies were either based on assumptions or on an outsourced model where a piece of the project is handled outside. But at Microsoft, we don’t outsource product development. Our global development teams are Microsoft employees: the same management structure, access to the same resources.”

But first of all, how do you define “distributed”? The research team took the corporate address book and came up with six degrees of distribution:

In the same building.

In different buildings but sharing a cafeteria.

On the same campus, within walking distance.

In the same region, within easy driving distance.

In the same time zone.

More than three time zones away.

Next, they classified all developers in the Windows team into these buckets. Then they looked for statistical evidence that components developed by distributed teams resulted in software with more errors than components developed by collocated teams.

Does distributed development affect software quality? An empirical case study of Windows Vista—by Christian Bird, University of California, Davis; Nagappan; Premkumar Devanbu, University of California, Davis; Harald Gall, University of Zurich, and Murphy—found that the differences were statistically negligible. In order to verify results, the team also conducted an anonymous survey with researchers Sriram Rajamani and Ganesan Ramalingam in Microsoft Research India, asking engineers who they would talk to if they ran into problems. Most people preferred to talk to someone from their own organization 4,000 miles away rather than someone only five doors down the hall but from a different organization. Organizational cohesiveness played a bigger role than geographical distance.

Putting Research to Work in the Real World

Many of the findings from these papers have been put to use by Microsoft product teams. Some of the tools that shipped with Visual Studio 2005 and Visual Studio 2008 incorporate work from the ESE group, and Microsoft’s risk-analysis and bug-triage system for Windows Vista SP2 made use of the team’s technology for risk estimation and analysis.

Nagappan believes there is value in further exploring organizational metrics but adds that more real-world context needs to be applied, because his studies have been confined to Microsoft development groups. He would like to extend his work to other commercial environments, distributed development teams, and open-source environments.

But there is one point that gives this software-engineering myth buster a great deal of satisfaction.

“I feel that we’ve closed the loop,” Nagappan says. “It started with Conway’s Law, which Brooks cited in The Mythical Man-Month; now, we can show that, yes, the design of the organization building the software system is as crucial as the system itself.”