Highlights on "quality," and Deming's work as it applies to software development

Back when I was in university, I visited a bookstore and found, in the physics section, a group of books on relativity. One of the shortest books with the least glossy cover had a surprising author: Albert Einstein. Surprising? Well, he formed the theory, but everyone knows he wrote science papers, not books, and for advanced physicists, not average people, right?

Apparently not. You can still buy Relativity, the Special and General Theory on Amazon, including a Kindle format for $0.99. From Einstein's own description, "The present book is intended, as far as possible, to give an exact insight into the theory of Relativity to those readers who, from a general scientific and philosophical point of view, are interested in the theory, but who are not conversant with the mathematical apparatus of theoretical physics."

Sure enough, the book is very readable, and didn't require any advanced math in order to understand the concepts. Einstein himself understood his theory so well that he was able to explain it in terms an average person can understand.

In contrast, the other relativity books on the same shelf were more complicated and harder to read. Maybe the authors understood just as well as Einstein, and had the benefit of decades more research in the area, but somehow they managed to obscure the main point. Perhaps part of genius, or some kinds of genius, is just the ability to explain your deep thoughts to others.

Anyway, I took that as an important lesson: don't be afraid to go back to the original source when you want to learn something. It's not always as intimidating as it sounds.

Which brings us to this Christmas holiday, where I had a few hours here and there to catch up on reading, and I set out to learn about the quality revolution in manufacturing (generally agreed to have started in Japan, albeit with ironic American influence). I wanted to know: What was the big insight? Did westerners finally catch on? Does it all just seem obvious to us now? And even more importantly, does it apply to software development, or is software too different from manufacturing for the concepts to really apply? (In particular, it seems clear that manufacturing quality is largely about doing a one-time design and reliably producing zillions of identical items, and the problems seem to be mainly in the reliable reproduction. In software, reproduction is the easy part.)

In my stint as a "Semiconductor Industry Veteran" (The Guardian's words, not mine :)) for four months in the late 1990s, I had heard buzzwords related to this topic. Notably: Toyota Production System (TPS), Six Sigma, Zero Defects, and Total Quality Management. I didn't know much about them, just the names. Well, I had heard that in the TPS, any worker could push a big red button and stop the assembly line if they found a defect, but I didn't really see how that could help. Six Sigma sounded fishy... six sigma from what? Just three sigma is better than 99%, so Six Sigma must be essentially Zero Defects, which sounds nice, it must be what Japanese companies were doing. Total Quality Management sounds like a scam just from the name, so let's just ignore that one.

So it's a vacation, but I have limited time, what should I read? After a bit of investigation, one thing the 1990's Japanese System fan fiction management consultants all seemed to have in common was an influence by W. Edwards Deming, an American who apparently either co-invented or at least evangelized all this stuff around Japan (I'm still not quite sure). Maybe let's go there and see if we can find the original source.

(It later turned out that Deming was in turn heavily influenced by Walter A. Shewhart's work in the 1920s, but that's going back a bit far, and it seems like Deming generalized his work to a bunch of new areas, so let's stick with Deming for now.)

A quick trip to Amazon brought me to Deming's last book, The New Economics for Industry, Government, Education, which wins the dubious award for "most overpriced e-book by a dead guy I have ever seen." But on the other hand, if it's that expensive, maybe there's something good in there, so I bought it.

It was even better than I thought. I've never highlighted so many passages in a single book before. I don't even like highlighting passages in books. And this is a pretty short book. It gets right to the point, demolishing everything I thought I knew. (It also includes some well-placed jabs at Six Sigma, a management technique supposedly based on Deming's methods!)

(Around the part of the book where he was insulting Six Sigma, I thought I'd better look it up to see what it's about. It's indeed pretty awful. Just to start with, the "six" sigma are just because they count the entire range from -3 standard deviations to +3 standard deviations. That's an error margin +/- 3 stddev, or ~99.7%. It's not "six" standard deviations by any normal human's terminology. It gets worse from there. Safe to ignore.)

Anyway, back to demolishing everything I thought I knew. I was looking for advice on how to improve quality, which means I was kinda surprised when early in the book I found this:

I listened all day, an unforgettable Thursday, to 10 presentations, reports of 10 teams, on reduction of defects. The audience was composed of 150 people, all working on reduction of defects, listening intently, with touching devotion to their jobs. They did not understand, I think, that their efforts could in time be eminently successful-no defects-while their company declines. Something more must take place, for jobs.

In a book that I thought was about reducing defects, this was a little disconcerting.

We move on to,

Have you ever heard of a plant that closed? And why did it close? Shoddy workmanship? No.

Um, what is this? Maybe I should go get a Total Quality Management book after all. (Just kidding. I refuse to acknowledge any religion with a name that dumb.)

He moved on to blast a few more things that many of us take as axiomatic:

The effect of Hard work? Best efforts? Answer: We thus only dig deeper the pit that we are in. Hard work and best efforts will not by themselves dig us out of the pit. [...] Ranking is a farce. Apparent performance is actually attributable mostly to the system that the individual works in, not to the individual himself. [...] In Japan, the pecking order is the opposite. A company in Japan that runs into economic hardship takes these steps:2 1. Cut the dividend. Maybe cut it out. 2. Reduce salaries and bonuses of top management. 3. Further reduction for top management. 4. Last of all, the rank and file are asked to help out. People that do not need to work may take a furlough. People that can take early retirement may do so, now. 5. Finally, if necessary, a cut in pay for those that stay, but no one loses a job. [...] Appointment of someone as Vice President in Charge of Quality will he disappointment and frustration. Accountability for quality belongs to top management. It can not he delegated. [...] It is wrong to suppose that if you can't measure it, you can't manage it-a costly myth. [...] If you can accomplish a [numerical quality/goal] goal without a [new] method, then why were you not doing it last year? There is only one possible answer: you were goofing off. [...] The bulletin America 2000: An Educational Study, published by the Secretary of Education, Washington, 18 April 1991, provides a horrible example of numerical goals, tests, rewards, but no method. By what method? [...] It is important that an aim never be defined in terms of a specific activity or method. [...] This diagram [process flow diagram], as an organization chart, is far more meaningful than the usual pyramid. The pyramid only shows responsibilities for reporting, who reports to whom. It shows the chain of command and accountability. A pyramid does not describe the system of production. It does not tell anybody how his work fits into the work of other people in the company. If a pyramid conveys any message at all, it is that anybody should first and foremost try to satisfy his boss (get a good rating). The customer is not in the pyramid. [...] Suppose that you tell me that my job is to wash this table. You show to me soap, water, and a brush. I still have no idea what the job is. I must know what the table will be used for after I wash it. Why wash it? Will the table be used to set food on? If so, it is clean enough now. If it is to be used for an operating table, I must wash the table several times with scalding water, top, bottom, and legs; also the floor below it and around it. [...] The U.S. Postal Service is not a monopoly. The authorities of the postal service are hampered by Congress. If the U.S. Postal Service were a monopoly, there would be a chance of better service. [...] The interpretation of results of a test or experiment is something else. It is prediction that a specific change in a process or procedure will be a wise choice, or that no change would be better. Either way the choice is prediction. This is known as an analytic problem, or a problem of inference, prediction. Tests of significance, t-test, chi-square, are useless as inference-i.e., useless for aid in prediction. Test of hypothesis has been for half a century a bristling obstruction to understanding statistical inference. [...] Management is prediction. [...] Theory is a window into the world. Theory leads to prediction. Without prediction, experience and examples teach nothing. To copy an example of success, without understanding it with the aid of theory, may lead to disaster.

I could keep quoting the quotes. They are all about equally terrifying. Let me ruin any suspense you have might left: you aren't going to fix quality problems by doing the same thing you were doing, only better.

This book also quickly answered my other questions: yes, it is applicable to software development. No, westerners have still not figured it out. No, software will not get better if we just keep doing what we were doing. Yes, Japanese manufacturers knew this already in the 1980s and 1990s.

Statistical Control

This post is getting a bit long already, so let's skip over more ranting and get to the point. First of all, I hope you read Deming's book(s). This one was very good, quite short, and easy to read, like Einstein's Relativity was easy to read. And you know it works, because his ideas were copied over and over, poorly, eventually into unrecognizability, ineffectiveness, and confusion, the way only the best ideas are.

Just for a taste, here's what I think are the two super important lessons. They may seem obvious in retrospect, but since I never really see anybody talking about them, I think they're not obvious at all.

The first lesson is: It's more important for output to have a continuous (usually Gaussian) distribution than to be within specifications. Western companies spend a lot of time manufacturing devices to be "within spec." If an output is out of spec, we have to throw it away and make another one, or at least fix it. If it's in spec, that's great, ship it! But... stop doing that. Forget about the spec, in fact. Real-life continuous processes do not have discontinuous yes/no breakpoints, and trying to create those requires you to destabilize and complexify your processes (often by adding an inspection or QA phase, which, surprisingly, turns out to be a very bad idea). I don't have time to explain it here, but Deming has some very, very strong opinions on the suckiness of post-manufacturing quality inspection.

The second lesson is: Know the difference between "special causes of variation" and "common causes of variation." Special causes produce outputs outside the continuous statistical distribution you were expecting. In my line of work, the continous variable might be Internet throughput, which varies over time, but is pretty predictable. A special cause would be, say, an outage, where you get 0 Mbps for a while before the outage is fixed. (On the other hand, if you're running an ISP, the pattern of your outages may actually be its own statistical distribution, and then you break it down further. Drunk drivers knocking down poles happens at a fairly predictable rate; accidentally deleting your authentication key on all the routers, causing a self-induced accidental outage is a special event.)

The reason it's so important to differentiate between special and common causes is that the way you respond to them is completely different. Special causes are usually caused by, and thus can be fixed or avoided by, low-level individual workers, in the case of manufacturing or repetitive work. In software (although Deming doesn't talk about software), special causes are weird bugs that only trigger in certain edge cases but cause major failure when they do. Special causes can often be eliminated by just fixing up your processes.

Common causes are simple variation inside your normal statistical distribution. It's tempting to try to fix these, especially when you're unlucky enough to get an outlier. Wifi throughput got really low for a few seconds, lower than two standard deviations fro the mean, isn't that weird? We got a surprisingly large number of customer service calls this week, much more than last week, what did we do wrong? And so on.

The great insight of Deming's methods is that you can (mostly) identify the difference between common and special causes mathematically, and that you should not attempt to fix common causes directly - it's a waste of time, because all real-life processes have random variation. (In fact, he provides a bunch of examples of how trying to fix common causes can make them much worse.)

Instead, what you want to do is identify your mean and standard deviation, plot the distribution, and try to clean it up. Rather than poking around at the weird edges of the distribution, can we adjust the mean left or right to get more output closer to what we want? Can we reduce the overall standard deviation - not any one outlier - by changing something fundamental about the process?

As part of that, you might find out that your process is actually not in control at all, and most of your problems are "special" causes. This means you're overdriving your process. For example, software developers working super long hours to meet a deadline might produce bursts of high producitivity followed by indeterminate periods of collapse (or they quit and the whole thing shuts down, or whatever). Running them at a reasonable rate might give worse short-term results, but more predictable results over time, and predictable results are where quality comes from.

My job nowadays is trying to make wifi better. This separation - which, like I said, sounds kind of obvious in retrospect - is a huge help to me. In wifi, the difference between "bugs that make the wifi stop working" and "variable performance in wifi transfer speeds or latency" is a big deal. The former is special causes, and the latter is common causes. You have to solve them differently. You probably should have two different groups of people thinking about solving them. Even further, just doing a single wifi speed test, or looking at the average or a particular percentile to see if it's "good enough" is not right. You want to measure the mean and stddev, and see which customers' wifi is weird (special causes) and which is normal (common causes).

Anyway, it's been a good vacation so far.