A year ago, Stephen Jakisa was having some serious computer problems. It started while he was playing Battlefield 3, a first-person shooter game set in the near future. But soon even his web browser was crapping out every 30 minutes or so. He couldn't even install software on the PC.

It got so bad that Jakisa – a programmer by profession, and no technical neophyte – thought he might have a virus, or maybe some seriously buggy software on his PC. But he decided to check things out with a friend, Ioan Stefanovici who happened to be writing his Ph.D. thesis on computer reliability.

After a bit of investigative work, Jakisa and Stefanovici traced the source of the problem: a bad memory chip on Jakisa's PC. Because his computer had been running fine for about six months before the problems popped up, Jakisa hadn't suspected the hardware until his friend talked him into running a special memory analysis tool. "I was really losing my mind," he says, "If this were to happen to Joe Blow down the street who doesn't know anything about computers, he would have been completely stumped."

Jakisa pulled out the buggy memory module, and the computer has worked fine ever since.

When computers crash, buggy software usually gets the blame. But over the past few years, computer scientists have started taking a hard look at hardware failures, and they're learning that another type of problem pops up more often than many people realize. That's right: hardware bugs.

Stephen Jakisa Photo: Stephen Jakisa

Chipmakers work hard to make sure their products are tested and working properly before they ship, but they don't like to talk about the fact that it can be a struggle to keep the chips working accurately over time. Since the late 1970s, the industry has known that obscure hardware problems could cause bits to flip inside microprocessor transistors. As transistors have shrunk in size, it's become even easier for stray particles to bash into them and flip their state. Industry insiders call this the "soft error" problem, and it's something that's going to become more pronounced as we move to smaller and smaller transistors where even a single particle can do much more damage.

But these "soft errors" are only part of the problem. Over the past five years, a handful of researchers have taken a long hard look at some very large computing systems, and they've realized that in many cases, the computer hardware we use is just plain broken. Heat or manufacturing defects can cause components to wear out over time, leaving electrons leaking from one transistor to another, or channels on the chip that are designed to transmit current simply break down. These are the "hard errors."

The Power of 'Soft Errors'

Scientists designing the next generation of computer chips are really worried about this soft-error problem, and that's because of one major factor: power. As the next generation of supercomputers start to come online, they will have more chips and smaller components. And with all of these tiny transistors, it will take more and more energy to keep bits from flipping within these computers.

The problem is tied to basic physics. As chipmakers send electrons down smaller and smaller wires on their chips, the electrons simply escape, like drops of water bursting out of a leaky hose. The smaller the wires, the more electrons that leak out, and the more power it takes to keep everything working properly.

The problem is so tricky that Intel is working with from the U.S. Department of Energy and other government agencies to solve it. Using its future-generation 5-nanometer chipmaking processes, Intel will build the brains of supercomputers that are 1,000 times more powerful than today's top machines by the end of the decade. But, right now, it looks like these super-systems will also be power hogs.

"We have a path to get there not worrying about power," says Mark Seager, chief technology officer for the high-performance computing ecosystem at Intel. "But if you want us to address power too, that's over and above our technical roadmap."

For regular computer users like Stephen Jakisa, the world of bit-flips and soft errors is murky space. Chipmakers don't like to talk about how often their products fail – they think of this information as a proprietary secret – and good studies are hard to come by. Often, technology companies prohibit their own customers from talking about hardware failure rates. "That's been an area of active research in the industry," says Seager. "We don’t talk about it much externally because it's a very sensitive topic."

Not-So-Soft Errors

Soft errors are one thing, but there are other problems that hardware makers have said even less about. According to a small team of researchers at the University of Toronto, when computer's dynamic random-access memory (DRAM) fails, it's more likely to be caused by old age or buggy manufacturing (these are hard errors) than the soft errors that come from cosmic rays.

In 2007, University of Toronto professor Bianca Schroeder got access to Google's data centers, where she collected a treasure trove of information on how frequently the company's custom-designed Linux systems crapped out. She found a lot more errors than they expected. And furthermore, about eight percent of Google's memory chips were responsible for 90 percent of the problems. Sometimes it happened every few minutes.

Looking more closely, Schroeder's team found that the bugs seemed to be concentrated on specific regions of the computer's memory, and they tended to happen in older machines. The problems they uncovered were hard errors, not soft errors, and they were a much bigger deal than the U of T researchers had expected.

Schroeder and her team published a paper on their Google findings in 2009, and they followed up with a second paper earlier this year that found similar results on memory chips used by IBM Blue Gene Systems as well as on a Canadian supercomputer called SciNet.

On all of the systems, the DRAM failure rates were about the same, says Ioan Stefanovici, who co-authored the 2012 paper. Another paper, this one written by researchers at AMD, also found that hard errors were more common than soft errors in DRAM memory chips. But AMD, like Intel, hasn't released any research on the failure rates of the static random-access memory (SRAM) that's built into its general-purpose microprocessors.

"It's not a new problem," says Vilas Sridharan, a reliability architect at AMD and one of the authors of the AMD paper. "Errors in DRAM devices were first identified in 1979, but we're still learning."

The world's largest DRAM maker, Samsung, said it did "not have any specific data they can share on this topic," according to a company spokesman.

Did bad memory cause this Blue Screen of Death in Toronto? Photo: Ioan Stefanovici

Schroeder and Stefanovici say that chipmakers need to take these hard errors more seriously. Today's high-end chips use a variety of tricks and techniques – things like error-correcting code – to recover from soft errors, but they're not as well equipped to handle hard errors.

And that's causing more problems than most people realize. High-end supercomputers might have the error-correcting code that fixes up bit-flips whenever they happen. But that's not the case on the PC. "Most mobile devices and consumer-grade laptops and desktops don't include error-correcting code, partly because the error model has been that errors in DRAM are mostly caused by soft errors," says Stefanovici.

Because of his computer skills, Stefanovici gets tapped every now and then to diagnose bizarre computer crashes. He says he's traced at least three issues over the past year to bad DRAM.

Two years ago, he was walking past Dundas Square – it's Canada's slightly muted take on New York's Times Square – a big block filled with flashy signs and tourists in the heart of Toronto. Gazing up, he saw that one of the signs had gone blue – the sure sign of a computer crash. Stefanovici snapped a blurry shot of the screen with his BlackBerry and noted the error code. He isn't positive, but judging from parity error displayed on the screen, he thinks that bad memory in the computer's video card was to blame.