Alfred Pasieka

Amazon’s cloud computing unit, Amazon Web Services, will store for public use the entire contents of the National Institutes of Health’s 1000 Genomes Project, a survey of genetic information from 1,700 individuals that is some 200 terabytes in size. Anyone can access the information for free, and there is no requirement to share any research results.

Amazon is incurring significant costs here, and providing a useful service: While the government data would commonly be accessible by anyone, downloading and storing this sequenced DNA information is a long and expensive process.

AWS could end up making money on the deal, however. Manipulating this much information requires a lot of computing power, and Amazon will be charging its regular rates for use of computers. That is still significantly less than buying the kind of supercomputers needed for most big genetic research.

“Downloading this to your own servers could take weeks to a month, assuming you had the data storage,” said Adam Selipsky, vice president of AWS. Crunching the numbers, he added “you’d need access to large, high performance compute clusters that cost conservatively hundreds of thousands of dollars, and in many cases millions to tens of millions of dollars.”

For example, he said, AWS recently created for a pharmaceutical client a virtual supercomputer of 30,000 semiconductor “cores,” for which it charged $1,279 an hour. The AWS machine executed the equivalent of what had been 11 years of work on the company’s computers in a few hours, Amazon says.

The people in the study have consented to have their data made public, and there is no personal information, such as a history of disease, associated with the genetic information. “It’s the only public data set like this,” said Lisa D. Brooks, program director for the Genetic Variation Program of the National Human Genome Research Institute, a part of the National Institutes of Health. “It is an almost complete set of human genetic variants.”

“Some diseases, like sickle cell anemia, we know the genetic basis,” she said. “With others, like diabetes and heart disease, there is a genetic contribution there, but there are multiple genes, and environmental contributions. If we knew more about the genetic component of this we could predict risk better.”