Introduction

The “Five Whys” is a method for performing a “Root Cause Analysis” to detect the true cause of problems in a product or process. The Five Whys was developed by Taiichi Ohno, the father of the Toyota Production System and is a cornerstone for how Toyota achieves the quality it does.

Software development is a complex process and modern business software is quite large and complicated. If something goes wrong, we have a tendency to just blame the first named culprit, for instance if a bug appears in the field then blame the QA person that tested that feature. Then this is usually accompanied with a lot of talk about people being “accountable” for the work they do. But if you step back a bit you realize that actually a much larger system has failed. The QA person wasn’t the only person to look at that feature. The programmer was responsible for testing it (both manually and providing unit tests) before giving it to QA. A software architect reviewed what the programmer was doing. The entire product went through a regression test before release. This feature was reviewed by a Business Analyst and a Usability Analyst. The product and feature had gone through a customer Beta Testing phase. And then in-spite of all this process and review by all these responsible and accountable people, the bug still made it out into the field. So to just superficially blame one person doesn’t give a good perspective on what went wrong and doesn’t help preventing the problem happening in the future.

The real goal of these methods is to allow companies to learn from their mistakes rather than to assign blame. Generally the process works way better when everyone is confident that this is the real intent. If everyone believes the goal is to assign blame, then the whole process goes political and getting to the truth is usually impossible.

Learning from mistakes has its own pitfalls. One common theory for why successful companies eventually fail is that they dutifully analyze problems and learn from their mistakes. For every problem that happens they develop a process or procedure to prevent that problem ever happening again. The downfall here is that as time passes, the number of processes and procedures created from this process becomes huge. This makes the company very bureaucratic and getting things done very hard. Great care has to be taken that the outcome of these studies doesn’t just bury a company in procedures that get harder and harder to follow. The best result is if the current process can be simplified, perhaps eliminating the step that created the mistake. After all simple processes and procedures are much easier to follow and less error prone. When developing remedies to problems analyzed, careful attention has to be paid to the cost of the solution, so that the solution isn’t worse than the original problem.

Performing a Root Cause Analysis study takes time and attention. So when you want to get started with this, start with the worst and most easily defined problems first. Run a few pilots to get used to the methodology. One good place to start is problems found in the field by customers, as mentioned before these indicate a fairly serious systemic failure which might be useful to fix. Problems found in the field are usually categorized by severity, so you could start by looking at the most severe problem first and then working down the list.

The Five Whys

The basic concept behind the Five Whys is really simple. Basically keep asking why for five times to get past the superficial reasons to the underlying reason. It’s something that we all did as three year olds when we kept annoying our parents by asking why over and over again. Basically as children we perceived that this was a good way to get a deeper understanding of the world, but then we abandoned this technique as we got older (or were scolded enough times). Basically the insight here is that we had it right as three year olds and shouldn’t have given up.

As an example perhaps of an interviewer asking a programmer:

Why did this defect make it into the product?

– Because the code review and unit tests didn’t catch it.

– Because the code review and unit tests didn’t catch it. Why didn’t the code review catch it?

– Because we substituted a reviewer from another team.

– Because we substituted a reviewer from another team. Why did you substitute who was doing the reviews?

– Because our team was frantically coding and couldn’t spare the time.

– Because our team was frantically coding and couldn’t spare the time. Why were they so busy coding?

– Because the stories they committed to were larger than expected.

– Because the stories they committed to were larger than expected. Why was that?

– Two key dependencies were underestimated and forced a lot of extra work.

This is a fairly simplified and abbreviated example, but shows the basic idea. Also notice that each question can often lead to several avenues of pursuit for the next level. It’s up to the interviewer to decide whether to pursue several or follow the main thread. The diagram below shows how this might go.

Root Cause Analysis

When you perform a Root Cause Analysis (RCA), you usually follow the following phases:

The Five Whys are usually performed in the “Data Analysis and Assessment” step. The “Data Collection” step before is used to identify the correct people, so you know who to ask the Five Whys to. Typically you want to have a separate Five Whys interview with each of the people involved with the problem and then perhaps additional people depending on the outcome of the interviews.

Generally you do each interview separately and document all the questions and answers. We document everything to do with the process on a Wiki, so everything is kept transparent and visible. After all the interviews you then need perform the assessment to identify the root causes and choose the items you want to pursue. This is usually done in a couple of brainstorming sessions. First everyone studies the documented Five Why interviews then they get together to discuss what happened and make suggestions of how to change processes or make other changes. All these ideas are documented. Then some time is left to think about them and another meeting happens to decide which ideas will be implemented, keeping in mind that we don’t want solutions worse than the problem and don’t want to introduce unnecessary bureaucracy. Then we apply the solutions and inform all affected departments. Follow up is scheduled to ensure that the solution is working and we are getting the desired results, if they aren’t working they should be scrapped or changed.

Summary

Usually when people first read about RCA they take it as a very heavy handed procedure to address simple problems. However when you’ve run a few of these, you quickly realize that most problems aren’t that simple and that solving systemic problems can be quite difficult. I’ve found Five Whys an extremely effective method to get to the bottom of things, where one of the key benefits is that it forces people to take the time to really think about what happened. Then the documentation produced provides a lot of visibility into what happened which usually makes implementing the solution face less resistance.