Why was the Windows 10 October update that problematic? Watch Now

In September 2017, Microsoft boasted that it had just released the "best version of Windows 10 ever." A year later, as Windows engineers struggle with the most recent release of the company's flagship operating system, there's a compelling case that the October 2018 Update is the worst version of Windows 10 ever.

The month began almost triumphantly for Microsoft, with the announcement on October 2 that its second Windows 10 release of the year, version 1809, was ready for delivery to the public, right on schedule. Then, just days later, the company took the unprecedented action of pulling the October 2018 Update from its servers while it investigated a serious, data-destroying bug.

Also: Microsoft serves up 40 new Windows 10 bug fixes

An embarrassing drip-drip-drip of additional high-profile bug reports has continued all month long. Built-in support for Zip files is not working properly. A keyboard driver caused some HP devices to crash with a Blue Screen of Death. Some system fonts are broken. Intel pushed the wrong audio driver through Windows Update, rendering some systems suddenly silent. Your laptop's display brightness might be arbitrarily reset.

And with November fast approaching, the feature update still hasn't been re-released.

What went wrong? My ZDNet colleague Mary Jo Foley suggests Microsoft is so focused on new features that it's losing track of reliability and fundamentals. At Ars Technica, Peter Bright argues that the Windows development process is fundamentally flawed.

Or maybe there's an even simpler explanation.

I suspect a large part of the blame comes down to Microsoft's overreliance on one of the greatest management principles of the last half-century or so: "What gets measured gets done." That's certainly a good guiding principle for any organization, but it also leads to a trap for any manager who doesn't also consider what's not being measured.

Also: It's time for Microsoft to bring Windows 10 Mobile back from the dead

For Windows 10, a tremendous number of performance and reliability events are measured constantly on every Windows 10 PC. Those streams of diagnostic data come from the Connected User Experience and Telemetry component, aka the Universal Telemetry Client. And there's no doubt that Microsoft is using that telemetry data to improve the fundamentals of Windows 10.

In that September 2017 blog post, for example, Microsoft brags that it improved battery life by 17 percent in Microsoft Edge, made boot times 13 percent faster, and saw an 18 percent reduction in users hitting "certain system stability issues." All that data translated into greater reliability, as measured by a dramatically reduced volume of calls to Microsoft's support lines:

Our internal customer support teams are reporting significant reductions in call and online support request volumes since the Anniversary Update. During this time, we've seen a healthy decline in monthly support volumes, most notably with installation and troubleshooting update inquiries taking the biggest dip.

Microsoft has been focusing intently on stuff it can see in its telemetry dashboard, monitoring metrics like installation success rates, boot times, and number of crashes. On those measures of reliability and performance, Windows 10 is unquestionably better than any of its predecessors.

Unfortunately, that focus has been so intense that the company missed what I call "soft errors," where everything looks perfectly fine on the telemetry dashboard and every action returns a success event even when the result is anything but successful.

Telemetry is most effective at gathering data to diagnose crashes and hangs. It provides great feedback for developers looking to fine-tune performance of Windows apps and features. It can do a superb job of pinpointing third-party drivers that aren't behaving properly.

Also: Microsoft Windows zero-day disclosed on Twitter, again

But telemetry fails miserably at detecting anything that isn't a crash or an unambiguous failure. In theory, those low-volume, high-impact issues should be flagged by members of the Windows Insider Program in the Feedback Hub. And indeed, there were multiple bug reports from members of the Windows Insider Program, over a period of several months, flagging the issue that caused data to be lost during some upgrades. There were also multiple reports that should have caught the Zip file issue before it was released.

So why were those reports missed? If you've spent any time in the Feedback Hub, you know that the quality of reporting varies wildly. As one Windows engineer complained to me, "We have so many issues reported daily that are variations of 'dark theme sucks, you guys should die' that it's hard to spot the six upvotes on a real problem that we can't repro in-house."

In response to those missed alarms, Microsoft has added a new field to its problem reporting tool in Feedback Hub, to provide an indication of the severity of an issue.

Time will tell if that addition helps or if testers will automatically overrate every bug report out of frustration. Even with that change, the recent problems highlight a fundamental flaw in the Windows Insider Program: Its members aren't trained in the art of software testing.

The real value of Insider Preview builds is, not surprisingly, capturing telemetry data from a much wider population of hardware than Microsoft can test in-house. As for those manual feedback reports, I'm skeptical that even an extra layer of filtering will be sufficient to turn them into actionable data.

Ultimately, if Microsoft is going to require most of its non-Enterprise customers to install feature updates twice a year, the responsibility to test changes in those features starts in Redmond. The two most serious bugs in this cycle, both of which wound up in a released product, were caused because of a change in the fundamental working of a feature.

Also: Top ten features in the Windows 10 October 2018 Update TechRepublic

An experienced software tester could have and should have caught those issues. A good tester knows that testing edge cases matters. A developer rushing to check in code to meet a semi-annual ship deadline is almost certainly not going to test every one of those cases and might not even consider the possibility that customers will use that feature in an unintended way.

Sometime in the next few days, Microsoft will re-release the October 2018 Update, and everything in the Windows-as-a-service world will return to normal. But come next April, when the 19H1 version is approaching public release, a lot of people will be holding their breath.

Related links