I get asked this question a lot. How big is our genetic code? You know… that DNA blueprint thing… consisting of billions of letters… A’s, G’s, C’s, T’s… present in all of the TRILLIONS of cells in the human body… the thing that makes you you. How big is it, really?

We each have ~3 billion base pairs in our genomes, but how much storage space does one human genome take up? The answer, of course, is:

It depends.

It depends on what we’re talking about. Are we referring to that single string of letters inside your cells? Or the raw data that comes off a genome sequencer, which has to have many “reads” at each position for adequate coverage, and has quality data associated with it? Or perhaps we’re just talking about the list of every spot in your genome where you differ from the so-called “normal” reference genome?

Here are just a few of the many ways of breaking it down:

1. In a perfect world (just your 3 billion letters): ~700 megabytes

If you had a perfect sequence of the human genome (with no technological flaws to worry about, and therefore not need to include information on data quality along with the sequence), then all you would need is the string of letters (A, C, G and T) that make up one strand of the human genome, and the answer would be about 700 megabytes. It would look something like this:

AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCAGTCAGATTTACCCTGGCTCACCTTGGCGTCGCGTCCGGCGGCAAACTAAGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCTCCTTCCAGCCTCCGAAAAACTCGGACCAAAGATCAGGCTTGTCCGTTCTTCGCTAGTGATGAGACTGCGCCTCTGTTCGTACAACCAATTTAGGTGAGTTCAAACTTCAGGGTCCAGAGGCTGATAATCTACTTACCCAAACATAG

To do the math, each base pair takes 2 bits (you can use 00, 01, 10, and 11 for T, G, C and A). Multiply that by the number of base pairs in the human genome, and you get 2 * 3 billion = 6,000,000,000 bits. And remember, you have to go from bits to bytes to get to an answer in megabytes. A bit is just a single unit of digital information, but a byte is a sequence of bits (usually 8). And because computers work in binary math, 1 kilobyte = 1024 (i.e. 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 = 1024). 1 gigabyte = 1024 megabytes = 1048576 kilobytes = 1073741824 bytes. So you take the 5,800,000,000 bits and divide it by 8 to get 750,000,000 bytes. Divide that by 1024 and you get 732,422 kilobytes. Divide it by 1024 once more and you’re left with 715 megabytes. Yup, it could prety much fit on a CD-rom, not that anyone uses those things anymore.

2. In the real world, right off the genome sequencer: ~200 gigabytes

In reality, in order to sequence a whole human genome, you need to generate a bunch of short “reads” (~100 base pairs, depending on the platform) and then “align” them to the reference genome. This is also known as coverage. For example, a whole genome sequenced at 30x coverage means that, on average, each base on the genome was covered by 30 sequencing reads. Illumina’s next-generation sequencers, for example, can produce millions of short 100bp reads per hour, and these are often stored in FASTQ. These file formats store not only the letter of each base position, but also a lot of other information such as quality. Here’s what a FASTQ file looks like.

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!’’*((((***+))%%%++)(%%%%).1***-+*’’))**55CCF>>>>>>CCCCCCC65

This quality data isn’t as simple as ACGT, because there are a variety of different letters and symbols used. So in this case, we’re considering each letter a byte rather than a bit. Using some quick & dirty, oversimplified math, the numbers look like this: Assuming a 3 billion letter human genome length and an average depth of coverage of 30×, we would have 90 billion letters, roughly occupying 90 gigabytes of disk space, if we map one character to one byte. Considering that a typical FASTQ file contains both short-reads and quality scores, the total size would be around 180 gigabtyes (ignoring control lines and carriage returns). It varies widely, but let’s call it 200 gigabytes.

3. As a variant file, with just the list of mutations: ~125 megabytes

Only about 0.1% of the genome is different among individuals, which equates to about 3 million variants (aka mutations) in the average human genome. This means we can make a “diff file” of just the places where any given individual differs from the normal “reference” genome. In practice, this is usually done in a .VCF file format, which in its simplest format looks something like so:

chr20 14370 rs6054257 G A 29 PASS 0|0

Where each line uses ~45 bytes, and you times this by the ~3 million variants in a given genome, and you get a .VCF file size of about 135,000,000 bytes or ~125 megabytes.

So there you have it. A few of the many ways of looking at genome storage size. Practically speaking, #1 doesn’t really apply, because you never get a perfect string of an entire human genome. #3 is the most efficient, and is what people often pass around and deal with for the downstream analysis and interpreation. But #2 is how genomes are usually stored, because sequencing is still an imperfect science, as is variant calling. So you really need to hang on to the raw sequencing reads and associated quality data, for future tweaking of the data analysis parameters if needed.

What this means is that we’d all better brace ourselves for a major flood of genomic data. The 1000 genomes project data, for example, is now available in the AWS cloud and consists of >200 terabytes for the 1700 participants. As the cost of whole genome sequencing continues to drop, bigger and bigger sequencing studies are being rolled out. Just think about the storage requirements of this 10K Autism Genome project, or the UK’s 100k Genome project….. or even.. gasp.. this Million Human Genomes project. The computational demands are staggering, and the big question is: Can data analysis keep up, and what will we learn from this flood of A’s, T’s, G’s and C’s….?