When Dr. Deming talked about 94% of all problems are due to the system and the responsibility of management what he was saying was that improvement requires systems thinking and a blameless culture. He often targeted management as the problem however, I think his use of the word “management” can be thought of as a modern abstraction for any one actor in a modern complex adaptive system. All the actors in the system (management, machines, governance,…).

Deming learned a large part of his thinking from the work done by Walter Shewhart’s Statistical Process Control. Shewhart’s work was based on statistics and specifically around understanding variation. I can’t find the author of the original quote here; however, it is one of my favorite interpretations of Deming and Shewhart’s core variation message.

“Misunderstanding variation is the root cause of knee-jerk reactions, over control, micromanagement, and tampering with results. Quite frankly, when management does not understand variation their decisions can, and usually do, make things worse. The ramifications are widespread and costly. Employees feel frustration for having to explain randomness or things completely out of their control. Customers have no idea what to expect. Customer facing employees have no confidence in what can be promised or delivered”

When we look at variation from the Deming/Shewhart’s perspective we see variation as much more than just noise vs anomalies. We are actually looking for variation in terms of randomness vs non-randomness. Shewhart developed Statistical Process Control (SPC) as a tool for looking at statistical variation in a process. With SPC you collect data on the process, then you calculate data points based on their Standard Deviation (STD) or otherwise called Sigma (1 STD = 1 Sigma). STD is one of many statistical tools to measure variation. Shewhart introduced SPC charts that are used to graph data points via an Upper Control Limit (3 Sigma above the mean) and a Lower Control Limit (3 Sigma below the mean).

The data points that fall within the UCL and LCL are called common or chance cause variation and the data points that fall beyond the UCL and LCL are called special or assignable. Classic incident management perceptions see a control chart as normal vs anomalies (i.e., data points beyond the UCL/LCL). This is furthest from the truth with SPC.

Note: Shewhart called it assignable and chance variation and Deming renamed them to special and common cause variation. Chance variation is also often referred to as random variation.

With SPC special/assignable cause data points actually have less impact on systemic improvement. They are typically black swans where post mortem improvements would at best be too costly and worst-case useless. On the other hand, when a special/assignable data point is truly assignable we often find the improvement is less likely to create widespread systematic improvement. For example, a truck runs into a data center (true story).

Therefore, the real beauty of Stewart's ideas of variation is in the areas that focus on common cause variation. It’s here were we can start seeing Deming’s 94/6 (I believe today Deming would actually say more like 99/1). If Deming is correct that most process issues are related to the system (systems thinking) then we are not likely to find special or assignable cause between the UCL and LCL (common cause). Therefore what becomes really important is looking at the process as a system within this abstraction called statistical process control (i.e., the common cause variation). Remember, statistical process control does not mean deterministic process control. What it means is that you are using statistics as an abstraction or a heuristic to model the behavior of the process. Think of it more like a quantum physics ideal. We don’t really know if we are right or wrong we can only use probabilities (statistics) as a tool to predict and understand the process. When a process is in statistical process control common cause variation data points should appear random. If the data points are non-random inside of the common cause variation it might be an indication of a system related trend.

So what Deming/Shewhart would say that when you see true randomness in your common cause variation it is more likely (predictable) that you have a process under control. However, when you start seeing non-random data points in the common cause variation this could be an area for improvement. For example, a linear plot of data points in a control chart graph inside the UCL/LCL could mean the process has been tampered with. Notice the last 6 data points in the example below. They are showing some linearity trend within the UCL/LCL.

I’m adding an update to this blog… I found a great resource on all things Deming called the “Curious Cat Management Improvement blog” ( https://curiouscat.com/). They have a great defintion page on understanding variation here: https://curiouscat.com/management/variation.cfm Which I will summarize as follows… There is Common Cause and Special Cause variation. What we look for; however, are the patterns that indicate a specific special cause. Here are the patterns: — 8 points in a row on one side of the mean — 4 of 5 points between 1 and 3 sigma — 2 of 3 points between 2 and 3 sigma — 6 points in a row increasing or decreasing — 14 points in a row that alternate above and below the mean.

This brings me back to systems thinking and blamelessness. We often see knee-jerk reactions in IT. However, in my opinion we rarely ever assign it to its appropriate understanding of variation. When the truck crashed into the data center did they fire the system administrator for the data center being down for a week? Or even worse did they create a task force to study and investigate whether there should have been three concrete pylons on the back road protecting the data center versus the original two? In Eric Reis’ Lean Startup he tells a story of a 5Why example where a programmer gets blamed for his code bringing down the system. Would have an SPC control chart shown that there was a trend of certain team code deploys at the same time as the outages (in common case variation)? Maybe, under further investigation of that trend it is found that this is a new team and due to cutbacks in new hire training that new team was unaware of the coding guard rails that are available for developers.

Western management typically fails due to two main points in my opinion.

They do not use data or specifically statistics as an abstraction to understand improvement or second-order derivatives. In other words, in complex adaptive systems their judgment is usually based on false analytics. They tend to rely on cognitive bias’ to explain organizational events. When using a tool like SPC it is less likely that you will find an individual ( a person to blame) as the bad apple. When all else fails they tend to fall back to western management command control (deterministic) way of thinking. Where individuals are rewarded or blamed based hierarchal control structures and incomplete data.

Here is a link to all my presentation video’s and sides. A lot of Deming’s ideas can be found in my Bad Equilibrium, Kata and of course my Deming to Devops presentations.

https://github.com/botchagalupe/my-presentations