In 2009, a researcher named Michael Schatz revolutionized the world of genetics research when he showed how an open-source software tool called Hadoop could help find mutations hidden in the long and winding string of DNA that is the human genome.

Hadoop is a number-crunching tool that can pool the processing power of thousands of computer servers. Working as a bioinformatician at the University of Maryland, Schatz ran Hadoop atop Amazon EC2 – a cloud computing service that gives you instant access to as many servers as you need – and he needed no more than a few hours to handle calculations than would ordinarily require a month of processing time.

The rub is that Hadoop was built for software engineers – not geneticists. It's not the easiest thing for science researchers to wrap their heads around, and though it significantly reduced calculation times, it's not necessarily suited to crunching genomic data atop cloud services such as Amazon, which often involves moving enormous amounts of information from place to place. Hadoop is meant to crunch data without moving it.

But today, multiple startups – including DNAnexus and Spiral Genetics – are taking the genomics world beyond Hadoop and onto a new breed of web service designed to analyze genome data even more efficiently. These services still process information using the power of thousands of servers, but they're specifically built for the sort of problems geneticists are looking to solve – and according to the companies, they don't require the software know-how you need to operate your own cluster of Hadoop servers.

"Our system is really kind of a comprehensive, whole system for working with genomic data," says Andreas Sundquist, the CEO of DNAnexus, a Mountain View, California company funded in part by Google Ventures, the search giant's investment arm. "Most bioinformatics software that exists today is not written to run with Hadoop."

>'Clients use our website like Gmail or Google Maps. We make it really easy to take huge data sets, do all the data crunching, and come down with a list of genes impacted.' Andreas Sundquist

Spiral Genetics – a company based in Seattle – also claims that it can deliver calculations about 10 times faster than a system that merely runs Hadoop atop a cloud service such as Amazon EC2.

Scientists used to map genes sequentially, from point A to point Z. That’s the way the Human Genome Project was done, and it took a group of international scientists 13 years and roughly $4.6 billion in today’s dollars to map all 23 human chromosomes. But about a year before Michael Schatz published his seminal paper on Hadoop, the genomics community started using a cheaper, faster method known as "next-generation sequencing."

This method maps genes by chopping them up into millions of small, random fragments that can be sequenced in parallel. A computer algorithm then determines how the pieces fit together by comparing them to a known sequence, or reference genome, and with additional algorithms, you can zero in on the locations where there might be mutations.

You can do all this with Hadoop, known for crunching data inside big-name web services such as Facebook, Yahoo, and Twitter. Michael Schatz, who's now at Cold Spring Harbor Laboratory, and others have open-sourced algorithms specifically designed to process genomics data with the platform. But DNAnexus and Spiral Genetics are looking to simplify the process.

“Clients use our website like Gmail or Google Maps,” says DNANexus CEO Andreas Sundquist. “We make it really easy to take huge data sets, do all the data crunching, and come down with a list of genes impacted.”

According to Sunquist, DNAnexus delivers that list in a matter of hours or sometimes days – depending on how complex the analysis is. Meanwhile, Spiral Genetics claims a delivery time of less than three hours – whether researchers upload one genome or a 1,000. This is only possible, the company says, because it built a Hadoop alternative from scratch.

"When we started out, we were interested in using Hadoop, just like everybody else," says Adina Mangubat, the 25-year-old CEO of Spiral Genetics. "But it became clear it just wasn't going to perform the way we needed."

The trouble, the company says, is that if you process genomics data with an online service, you're forced to move a lot of data from place to place. Amazon houses the human genome data on its S3 storage service, and if you want to crunch it, you have to move it onto S3's sister service, EC2. This can slow things down.

Spiral's system is specifically designed to dovetail with both S3 and EC2, and according to chief technology officer Jeremy Bruestle, it can even outperform a dedicated Hadoop cluster that already houses the genome data set. "We have the flexibility of the cloud, but with performance that is actually even better than a cluster," he says. The company does not provide many details describing how its patented system works – other than to say it's able to grab and process data from the S3 more efficiently than a service based on Hadoop.

The other problem with Hadoop is that it wasn't designed for real-time queries. You can't instantly ask small questions of your data set. It's what known as a "batch system," and that means it there's always a lagtime when you run a job. But just as companies such as Cloudera have worked to instantly query big data sets in the world of big business, Spiral and DNAnexus are looking towards real-time performance in the genomics game.

According to both companies, their systems make it easier for researchers to, say, query the genome of a particular patient. That's the same reason Knome – another genomics outfit – also built an alternative to Hadoop.

But to gain traction among scientists, Spiral and DNAnexus will have to convince large research institutions to part ways with their existing infrastructure. Institutions such as BGI and the University of California, Santa Cruz have already built massive server farms designed to crunch genomics data, so they're un likely to move onto a new cloud service any time soon.

"What’s really been happening is more specialized clouds are being built for particular data sets,” says Michael Schatz, referring to tools such as DNAnexus and Spiral. “I really don’t see major research institutions letting go of their computing infrastructure any time soon.”

To ease those pains, Spiral offers a product called Spiral Cluster that lets researchers power their own clusters with the company's technology and to offload any jobs they can’t handle on their own onto the Spiral cloud service. “It makes researchers feel like they have an ever expanding cluster,” says Spiral CEO Mangubat.

The hope is that when they need to upgrade their clusters, scientists will opt to move their entire operation to Spiral's cloud service instead of investing in hardware.

Spiral and DNAnexus also say that a researcher can customize the way their services operate or even upload new applications to these services. "We've built a framework to let you run really anything you want in the cloud," Sundquist says. "We just provide the infrastructure to allow the developer to choose how they want to deploy their tools most effectively."

That's important because not all scientists use the same technologies to sequence genes, and the methods they use to map DNA impacts the types of analysis that should be done. Both companies bill their services as a way for any genomics researcher to analyze data – and share this work with others.

“I hope these guys deliver on that exact promise,” says Jonathan Hirsch, the president of Syapse, a cloud-based startup trying to bring genomics into the clinic. "If they can handle that, that’s tremendous value.”