Programming dirty-dive into the Human Genome (in Python)

“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.” ~ Fred Brooks, The Mythical Man-Month, p. 102.

The human genome is, we are told, the heart of modern biology. It serves as the static blueprint from which our biological cells compute with. It is in short, simple glorious data.

To explore the human genome, you could futz around with fancy webapps designed for Excel-loving biologists. But no, you're a programmer. You just want the data.

So here it is: [chromFa.tar.gz] from the UCSC Genome Institute.

It’s 905 MB compressed, 2.9 GB decompressed.

Let's unpack this a bit.

First, there is no such thing as the human genome. This is a carefully synthesised version provided by the University of Santa Cruz. Specifically, it is Version 19 that was released in 2009. It serves as a useful reference, where individual human genomes are stored as diffs to the reference.

Other institutes provide other versions. For instance, the original genome company, Celera, released a genome that was the actual genome of Craig Venter, the founder of Celera. Ponder the politics of that.

Files as chromosomes

If you decompress the file, it unpacks into a bunch of files. You can, of course, fetch these files individually. You can even download genomes for other organisms.

The files we are interested are chr[\d+|XY].fa.gz . These are chr1.fa to chr22.fa , and chrX.fa and chrY.fa .

There are 46 chromosomes in every one of your cells (apart from sperm and egg cells).

Each of your cells has two copies of the first 22 chromosomes, one copy from your father, and one from your mother. They are ordered in terms of size, from the largest chr1.fa which comes in at 242 MB, to the smallest chr22.fa at 50 MB.

As well, there are also chromosome X and chromosome Y, which dictates whether you're male or female. Chromosome Y is found only in men - men have one chromosome X and one chromosome Y. Women have two Chromosome X. In sequencing experiments, if you find any DNA fragments from chromosome Y, you can be sure that the sample is from a guy.

Reading the genome: the FASTA format

For all the sophistication of bioinformatics, the FASTA format is a simple one. Consider the start of chr1.fa :

>chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ...

It’s text! The first line starts with > and a title. Then it’s a list of lines of 50 characters separated by a carriage return. The N means nucleic acid of unknown characteristic. The beginning of the chromosome is not well-determined.

At line 200, it gets a bit more interesting:

... NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN taaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta accctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac cctaacccaaccctaaccctaaccctaaccctaaccctaaccctaacccc taaccctaaccctaaccctaaccctaacctaaccctaaccctaaccctaa ccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaa ccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaac cccaaccccaaccccaaccccaaccctaacccctaaccctaaccctaacc ctaccctaaccctaaccctaaccctaaccctaaccctaacccctaacccc ...

And here you start seeing the a , c , t and g of the nucleic acids of DNA.

In Python, this is how you'd read chromosome 22:

def read_chromosome (fname) : lines = [] with open(fname, 'UR' ) as f: lines = [l.strip() for l in f.readlines()] seq = "" .join(lines[ 1 :]) return seq chr22_sequence = read_chromosome( 'chr22.fa' )

Welcome, you've just read a DNA molecule into your computer as a single long-ass 50MB string. On today's computers, you should be able to easily read this into RAM. This string represents the unbroken sequence of nucleotides that make up an individual DNA molecule that make up a chromosome.

Btw, always use 'UR' in the read function because from experience, biologists will fuck you up by giving you a Windows text file, whilst you work in Unix. If you don't use 'UR', the different line endings will break your code.

So how big is your genome?

Simply getting the file size will give you a good idea:

> du -shc chr*.fa

or we can use the python function from above:

print( "chr22.fa length is" , len(read_chromosome( 'chr22.fa' )))

Reading the Gene list

The human genome is one very long string. Obviously there must be some kind of structure but we have yet to completely decipher how our cells read it.

Biologists have figured out a bunch of things tho. Different properties of the human genome can be accessed with UCSC Genome Institute Table Browser or the (FTP version). If you were to train as a bioinformatician, you will learn how all these tables work.

Here, we will focus on the most studied feature of the genome: the gene. A reasonably workable understanding of a gene is a region of the gene that is involved in the production of a protein.

To look at the current mapping of genes on the release 19 of the human genome from UCSC, you need to download [refGene.txt.gz] which lists all the regions of the genes in all the chromosomes.

refGene.txt is simply a TSV file. So here's a Python function that reads a specific gene, using the commonly known name of the gene name2 :

labels = [ 'bin' , 'name' , 'chrom' , 'strand' , 'txStart' , 'txEnd' , 'cdsStart' , 'cdsEnd' , 'exonCount' , 'exonStarts' , 'exonEnds' , 'score' , 'name2' , 'cdsStartStat' , 'cdsEndStat' , 'exonFrames' ] def has_text (s) : return len(s.strip()) > 0 def get_int_list (s) : return map(int, filter(has_text, s.split( ',' ))) def get_entry (name2) : for line in open( 'refGene.txt' ): tokens = line.split( '\t' ) entry = dict(zip(labels, tokens)) if entry[ 'name2' ] == name2: for key, value in entry.items(): if 'exon' in key: entry[key] = get_int_list(value) else : if value.isdigit(): entry[key] = int(value) return entry return None

So let's look at the HBB gene, which is involved in the production of hemoglobin, the protein in your blood that is red and binds oxygen:

entry = get_entry( 'HBB' ) print(entry)

which gives the metadata as:

{ "bin" : 625 , "exonEnds" : [ 5246956 , 5248029 , 5248301 ], "exonFrames" : [ 0 , 2 , 0 ], "name" : "NM_000518" , "txStart" : 5246695 , "exonCount" : [ 3 ], "cdsEndStat" : "cmpl" , "cdsEnd" : 5248251 , "score" : 0 , "name2" : "HBB" , "strand" : "-" , "cdsStart" : 5246827 , "cdsStartStat" : "cmpl" , "chrom" : "chr11" , "txEnd" : 5248301 , "exonStarts" : [ 5246695 , 5247806 , 5248159 ] }

So we’ve got the metadata, so how do we read the DNA of the HBB gene?

First up, the designers of refGene.txt wisely decided to use [0, n) indexing. This is important as biologists typically prefer counting from 1, where this mix of indexing is a plague in bioinformatics.

The entirety of a gene lies between positions txStart to txEnd on the chromosome indicated by chrom . So HBB is found on chromosome 11 in the interval [5246695 , 5248301) , but on the - strand.

Let’s naively read it:

def read_interval (fasta, interval) : f = open(fasta, 'r+b' ) f.seek( 0 , 0 ) s = f.readline() offset = f.tell() s = f.readline() line_len = len(s) - 1 position = (interval[ 0 ] // line_len) * (line_len + 1 ) + (interval[ 0 ] % line_len) interval_len = interval[ 1 ] - interval[ 0 ] f.seek(position + offset, 0 ) letters = f.read(interval_len) letters = '' .join(letters.splitlines()) while (len(letters) < interval_len): letters += f.read( 1 ) letters = '' .join(letters.splitlines()) return letters read_interval( 'chrom11.fa' , [ 5246695 , 5248301 ])

We can compare this with some online sources. It’s good to use the more technical name of HBB where name = NM_000518 . We use this to search ensemble.org to lead eventually to this online display of the cDNA sequence of the gene HBB-201.

Comparing to the cDNA sequence, we see that our output is completely wrong!

The reason is that we forgot to interpret the - strand. This naturally leads us into the concept of reverse complementarity.

Watson-Crick pairing, in reverse!

What is this - strand?

There’s a beautiful redundancy in the chromosome - it is physically composed of two long molecules entwined in a double-helical form:

visualization from jolecule.com

The string of the FASTA file refers to the chemical make-up of one of the molecules. Let’s call it the + strand - so called because cellular processes only use one of the strands, the + strand, but ignores the - one. Still, the other strand - serves to bind to the + strand, forming the beautiful double-helix, which is a vastly more stable chemical arrangement.

Inside the cell, the two DNA molecules that make up a chromosome rarely exist in an isolated single strand form. They are prised open, cut, copied, modified, and then wrapped back together again.

The beauty in the pairing of the strands is that there is a strict order. You should only ever pair A's to T's, and G's to C’s, reflecting their stereochemical compatibility in a double-helix.

visualization from jolecule.com

In code, to get the sequence of the other strand (the reverse complement) in a particular region, we reverse the direction, and then apply the base-pairing complementarity as a dictionary:

complement = { 'A' : 'T' , 'T' : 'A' , 'C' : 'G' , 'G' : 'C' , 'a' : 't' , 't' : 'a' , 'c' : 'g' , 'g' : 'c' } def get_reverse_complement (s) : result = '' for c in reversed(s): result += complement[c] return result

So, combined with the code above:

s = read_interval( "chrom11.fa" , [ 5246695 , 5248301 ]) if entry[ "strand" ] == "-" : s = reverse_complement(s) print(s)

So let’s re-check the output with cDNA sequence of the gene HBB-201.

It’s pretty close except for the last two letters. The reason for this difference is that the online website is on Version 38 of the UCSC human genome. And we used Version 19 of the USCS human genome. It's important to track these differences as our knowledge of even well-studied molecules improves over time.

Protein coding regions

Okay, we can now read a gene using our table of contents in refGene.txt . So what good is it?

A modern operational definition of a gene is a region of the genome that is used to build a protein. Well it turns out a gene is actually used to build a bunch of related proteins. Bits of the DNA is used to encode chunks of proteins, and these are stitched together in the manufacturing process.

The chunks that directly code for proteins are called exons, and the rest of the gene are control elements and recognition sites, much of it still needs to be deciphered.

Still let’s work out how you’d read the exons, especially for one on the - strand:

for i_exon, interval in enumerate(reversed(zip(entry[ 'exonStarts' ], entry[ 'exonEnds' ]))): letters = read_interval(fasta, interval) if entry[ 'strand' ] == '-' : letters = get_reverse_complement(letters) print() print( '>' , i_exon+ 1 , fasta, entry[ 'strand' ], interval, len(letters)) print(letters)

We can compare with the online version: Transcript: HBB-201, and it looks pretty good for the three exons.

Actual protein

Can we do anything with these transcripts - pieces of DNA known to code for bits of proteins.

To brutally summarize one of the most glorious episodes of scientific discovery, biologists discovered a universal biological code where three nucleic acids (a codon) map to a single amino acid, where a string of amino acids make up a protein. The molecular machinery for this process is stunning.

So here's a simple translation of the three-leter codons to proteins:

def translate_dna (sequence) : amino_by_codon = { 'ATA' : 'I' , 'ATC' : 'I' , 'ATT' : 'I' , 'ATG' : 'M' , 'ACA' : 'T' , 'ACC' : 'T' , 'ACG' : 'T' , 'ACT' : 'T' , 'AAC' : 'N' , 'AAT' : 'N' , 'AAA' : 'K' , 'AAG' : 'K' , 'AGC' : 'S' , 'AGT' : 'S' , 'AGA' : 'R' , 'AGG' : 'R' , 'CTA' : 'L' , 'CTC' : 'L' , 'CTG' : 'L' , 'CTT' : 'L' , 'CCA' : 'P' , 'CCC' : 'P' , 'CCG' : 'P' , 'CCT' : 'P' , 'CAC' : 'H' , 'CAT' : 'H' , 'CAA' : 'Q' , 'CAG' : 'Q' , 'CGA' : 'R' , 'CGC' : 'R' , 'CGG' : 'R' , 'CGT' : 'R' , 'GTA' : 'V' , 'GTC' : 'V' , 'GTG' : 'V' , 'GTT' : 'V' , 'GCA' : 'A' , 'GCC' : 'A' , 'GCG' : 'A' , 'GCT' : 'A' , 'GAC' : 'D' , 'GAT' : 'D' , 'GAA' : 'E' , 'GAG' : 'E' , 'GGA' : 'G' , 'GGC' : 'G' , 'GGG' : 'G' , 'GGT' : 'G' , 'TCA' : 'S' , 'TCC' : 'S' , 'TCG' : 'S' , 'TCT' : 'S' , 'TTC' : 'F' , 'TTT' : 'F' , 'TTA' : 'L' , 'TTG' : 'L' , 'TAC' : 'Y' , 'TAT' : 'Y' , 'TAA' : '_' , 'TAG' : '_' , 'TGC' : 'C' , 'TGT' : 'C' , 'TGA' : '_' , 'TGG' : 'W' , } amino_sequence = '' i_start = 0 for i in range(i_start, len(sequence), 3 ): codon = sequence[i:i+ 3 ] if codon in amino_by_codon: amino_sequence += amino_by_codon[codon] return amino_sequence

So we run the three exons above with the translation function

for i_exon, interval in enumerate(reversed(zip(entry[ 'exonStarts' ], entry[ 'exonEnds' ]))): letters = read_interval(fasta, interval) if entry[ 'strand' ] == '-' : letters = get_reverse_complement(letters) print() print( '>' , i_exon+ 1 , fasta, entry[ 'strand' ], interval, len(letters)) print(translate_dna(letters))

It turns out that exon 1 and exon 2 don’t give anything meaningful using this naive approach, but exon 3 gives:

LLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH_ARFLAVQFLLKVPLFPKSNY_TGGYYEGP_ASGFCLIKNIYFHC

which is part of the sequence of Hemoglobin at position 108.

Now obviously lots of complicated things are happening to exon 1 and exon 2, but it’s quite nice that we can get part of the protein sequence using just a short calculation with no further information about cellular machinery.

So hopefully you can start seeing how the circle of life goes around.

Find out more