Last week, McAfee broke a lot of its customers' computers. A virus definition update caused a false positive identification of a virus within a key Windows file.

McAfee initially tried to downplay the issue, claiming only "moderate to significant" issues on affected machines, and that the default configuration of its software was harmless. "Not booting properly and being useless for real work" strikes us as somewhat worse than "moderate to significant," and there are many reports from people saying that McAfee is wrong about the default configuration (the situation seems unclear, but it looks like upgrades and certain patches can result in a different "default"—one that isn't safe). As if that was any consolation—none of the settings should result in machines getting broken. Ultimately, such quibbling is irrelevant: tens or hundreds of thousands of machines were disabled by the virus update.

Eventually, McAfee did issue a statement that was suitably apologetic. And Monday, the company offered home users who were affected by the problem two years of free updates plus compensation for any costs incurred (business users are offered nothing more than an apology). What was missing was any credible explanation of why it happened, and how it would be prevented in the future.

One rather depressing hint was given in an early revision of a FAQ the company published about the problem. The document has been sanitized, but the relevant portion can be found at ZDNet:

9. What is McAfee going to do to ensure this does not repeat? McAfee is currently conducting an exhaustive audit of internal processes associated with DAT creation and Quality Assurance. In the immediate term McAfee will do the following to provide mitigation from false detections: Strict enforcement of rules and processes regarding DAT creation and Quality Assurance. Addition of the missing Operating Systems and Product configurations. Leveraging of cloud based technologies for false remediation. A revision of Risk Assessment criteria is underway.

(Emphasis ours.)

In other words, McAfee didn't bother to test one of the most widely used operating systems around before pushing out an update. Good job.

One might expect other anti-virus vendors to highlight McAfee's failing and promote their own products instead. But their response has been, if anything, sympathetic, instead laying the blame squarely with malware authors, who have resorted to such cunning trickery as "giving their executables the same names as Windows files" just to make the virus scanners' jobs harder.

McAfee isn't the first company to make a mistake like this. Last month a BitDefender update broke 64-bit Windows XP, Windows Vista, and Windows 7 installations. Five years ago, Trend Micro hobbled Windows XP Service Pack 2 machines, an incident that even saw the company pay compensation to some affected customers.

Sometimes the damage is less severe, but still thoroughly debilitating; a McAfee update in 2006 resulted in many programs including Excel and Google Toolbar being identified as viruses, duly breaking them.

The truth is, false positives are abundant. A site tracking false positives gave up updating after being inundated with reports. Small developers producing shareware or custom applications are getting nailed with false positives on a consistent basis. These guys are producing programs that won't feature on any AV vendor's test matrix (though as history shows, even being widely used doesn't guarantee that), and their customers (or, even worse, potential customers) are routinely being inconvenienced, if not downright scared off.

A 2007 Symantec update to various Norton products shows just how hard it is to test against known products effectively; the update took down Windows XP Service Pack 2 machines, but only the Simplified Chinese edition, and only when a particular Windows patch was installed. This was still enough to cause problems for millions of PCs.

So what's the solution here? Unfortunately, there doesn't appear to be a good one. Signature-based anti-virus software is always going to suffer this kind of problem, and the scale of testing, even if restricted to major software, is enormous. Perhaps impractically so, with the number of different patches and languages that would need to be tested. Certainly, given the alarming regularity with which these problems occur, it seems to be a larger task than the anti-virus vendors can manage.

But other approaches to anti-virus fare no better. Heuristic scanners, which try to trap software because the actions it takes appear to be malicious or because of the network traffic it sends, ultimately have the same problem; they catch things they shouldn't. A strong case can be made that virus scanners should verify digital signatures and ignore files that are properly signed (as such files cannot have been tampered with), something that some anti-virus software already does, but even this has issues. Many scanners scan running processes (to detect, for example, self-propagating worms that attack network services), and terminating system processes because they appear to be infected can be just as damaging as deleting system programs from disk.

Moreover, not every file on a system is signed. In general, every program on a corporate desktop could be signed; typical corporate desktops don't need to allow running of arbitrary downloaded programs or anything like that, so greater use of signatures (even for custom, in-house applications) might be of value. But that's probably not an option for home users. And besides, a virus scanner destroying a document I'm working on just because it happens to look like a virus is not really a great improvement.

IT departments should perhaps be more circumspect about rolling out definition updates, but they too suffer some of the same testing problems. Though the problem should be more tractable for those organizations that have standard system images and carefully manage their own patching, that isn't the reality for a great many companies. This isn't even a situation where a virtualized lab can provide a good solution—a false positive could easily be generated for a critical driver that's only used on real hardware.

Improved operating system security, while useful for many things, offers little help, at least given current OS designs. Tasks like sending spam e-mail or destroying documents don't need elevated user privileges, so in many cases, OS security features offer little or no benefit.

Similarly, many viruses depend not on software flaws to propagate, but end-user flaws. In other words, they trick people into running them. Better programming and more restricted user accounts don't do much to help here. A suitably radical redesign of the operating system could reduce vulnerability (something along the lines of the experimental Qubes, for example, would offer greater separation between user data and potentially hostile software), but the prospect of such a redesign becoming mainstream any time soon is extremely low.

With AV vendors trapped in a game they can never win—virus writers will always outpace them—this is, then, a problem that shows no sign of being solved anytime soon. Though some false positives shouldn't have happened, and McAfee really should have tested against Windows XP Service Pack 3, such problems will unfortunately continue to be a fact of anti-virus life.