DNA seen through the eyes of a coder or If you are a hammer, everything looks like a nail

`This is one of the coolest things I've read in a while.' -- jwz

This is just some rambling by a computer programmer about DNA. I'm not a molecular geneticist. If you spot the inevitable mistakes, please mail me (bert hubert, @PowerDNS_Bert) at bert@hubertnet.nl.

I'm not trying to force my view unto the DNA - each observation here is quite 'uncramped'. To see where I got all this from, head to the bibliography.

Quick links: The source code, Position Independent Code, Conditional compilation, Epigenetics, Dead code, bloat, comments ('junk dna'), fork() and fork bombs ('tumors'), Mirroring, failover, Cluttered APIs, dependency hell, Viruses, worms, Central Dogma, Binary patching aka 'Gene therapy', Bug Regression, Reed-Solomon codes: 'Forward Error Correction', Holy Code, Framing errors: start and stop bits, Massive multiprocessing: each cell is a universe, Self hosting & bootstrapping, The Makefile, Plugins, Further reading .

Updates

20th of January 2016:

Added a section on Plugins, or Plasmids.

3rd of January 2008:

A lot of updates are arriving since this page was linked on Reddit.com, I'm currently evaluating and merging the suggested changes. Please do keep sending updates!

23rd of September 2006:

Small update on the number of genes. Some other updates have been sent to me over the past four years, and I'll try to work them in to the page.

The source code

DNA is not like C source but more like byte-compiled code for a virtual machine called 'the nucleus'. It is very doubtful that there is a source to this byte compilation - what you see is all you get.

The language of DNA is digital, but not binary. Where binary encoding has 0 and 1 to work with (2 - hence the 'bi'nary), DNA has 4 positions, T, C, G and A. Whereas a digital byte is mostly 8 binary digits, a DNA 'byte' (called a 'codon') has three digits. Because each digit can have 4 values instead of 2, an DNA codon has 64 possible values, compared to a binary byte which has 256.

A typical example of a DNA codon is 'GCC', which encodes the amino acid Alanine. A larger number of these amino acids combined are called a 'polypeptide' or 'protein', and these are chemically active in making a living being.

See also http://www.ultranet.com/~jkimball/BiologyPages/C/Codons.html.

Position Independent Code

Nearly half of the human genome is composed of transposable elements or jumping DNA. First recognized in the 1940s by Dr. Barbara McClintock in studies of peculiar inheritance patterns found in the colors of Indian corn, jumping DNA refers to the idea that some stretches of DNA are unstable and "transposable," ie., they can move around -- on and between chromosomes. http://www.ornl.gov/hgmis/resource/people.html

Conditional compilation

Of the 20,000 to 30,000 genes now thought to be in the human genome, most cells express only a very small part - which makes sense, a liver cell has little need for the DNA code that makes neurons. But as almost all cells carry around a full copy ('distribution') of the genome, a system is needed to #ifdef out stuff not needed. And that is just how it works. The genetic code is full of #if/#endif statements. This is why 'stem cells' are so hot right now - these cells have the ability to differentiate into everything. The code hasn't been #ifdeffed out yet, so to speak. Stated more exactly, stem cells do not have everything turned on - they are not at once liver cells and neurons. Cells can be likened to state machines, starting out as a stem cell. Over the lifetime of the cell, during which time it may clone ('fork()') many times, it specializes. Each specialization can be regarded as choosing a branch in a tree. Each cell can make (or be induced to make) decisions about its future, which each make it more specialized. These decisions are persistent over cloning using transcription factors and by modifying the way DNA is stored spacially ('steric effects'). A liver cell, although it carries the genes to do so, will generally not be able to function as a skin cell. There are some indications out there that it is possible to 'breed' cells 'upwards' into the hierarchy, making them pluripotent. See also this article.

Epigenetics & imprinting: runtime binary patching

This can be compared to the Linux kernel, which at boot time discovers what CPU it is running on, and actually disables parts of its binary code in case (for example) it is running on a single CPU system. This goes beyond something like if(numcpus > 1), it is the actual nopping out of locking. Crucially, this nopping occurs in memory and not on the disk based image.

Similarly, as an embryo develops in the mother's womb, its DNA is edited substantially to reduce its growth rate, and the size of the placenta. In such a way, the competing interests of the father ('large strong children') and the mother ('survive pregnancy') are balanced. Such 'imprinting' can only happen within the mother, since the father's genome doesn't know anything about the size of the mother.

Recently, it is also becoming clear that the metabolic status of the parents influences the chances of long life, cancer and diabetes in their grandchildren. This also makes sense, as surviving in a food poor climate may require a different metabolic strategy than in one where food is abundantly available.

Mechanisms behind epigenetics and imprinting are 'methylation', which attaches methyl groups to DNA to 'flip' their activation status, but also histone modification, which can curl up DNA so it is not activated.

Some of these DNA edits are heritable and passed on to children, other forms may only impact one animal.

This field is still developing rapidly, and it may be that our DNA is much more dynamic than originally thought.

Dead code, bloat, comments ('junk dna')

The genome is littered with old copies of genes and experiments that went wrong somewhere in the recent past - say, the last half a million years. This code is there but inactive. These are called the 'pseudo genes'. Furthermore, 97% of your DNA is commented out. DNA is linear and read from start to end. The parts that should not be decoded are marked very clearly, much like C comments. The 3% that is used directly form the so called 'exons'. The comments, that come 'inbetween' are called 'introns'. These comments are fascinating in their own right. Like C comments they have a start marker, like /*, and a stop marker, like */. But they have some more structure. Remember that DNA is like a tape - the comments need to be snipped out physically! The start of a comment is almost always indicated by the letters 'GT', which thus corresponds to /*, the end is signalled by 'AG', which is then like */. However because of the snipping, some glue is needed to connect the code before the comment to the code after, which makes the comments more like html comments, which are longer: '<!--' signifies the start, '-->' the end.

So an actual stretch of DNA with exons and introns might look like this:

ACTUAL CODE<!-- blah blah blah blah ---- blah -->ACTUAL CODE | | | | | | exon 1 donor intron 1 branch acceptor exon 2 (start of comment) (end of comment)

The actual cutting of the comments happens after the DNA has been transcribed into RNA and is performed by looping the comment and bringing the pieces of actual code close together. Then the RNA is cut at the 'branch site' near the end of the comment, after which the 'donor' (comment start) and 'acceptor' (comment end) are connected to each other.

Now, what are these comments good for? That discussion is part of a holy war that can rival the vi/emacs one. When comparing different species, we know that some introns show fewer code changes than the neighboring exons. This suggests that the comments are doing something important.

There are lots of possible explanations for the massive amount of non-coding DNA - one of the most appealing (to a coder) has to do with 'folding propensity'. DNA needs to be stored in a highly coiled form, but not all DNA codes lend themselves well to this.

This may remind you of RLL or MFM coding. On a hard disk, a bit is encoded by a polarity transition or the lack thereof. A naive encoding would encode a 0 as 'no transition' and 1 as 'a transition'.

Encoding 000000 is easy - just keep the magnetic phase unchanged for a few micrometers. However, when decoding, uncertainty creaps in - how many micrometers did we read? Does this correspond to 6 zeroes or 5? To prevent this problem, data is treated such that these long stretches of no transitions do not occur.

If we see 'no transition,no transition,transition,transition' on disk, we can be sure that this corresponds to '0011' - it is exceedingly unlikely that our reading process is so imprecise that this might correspond to '00011' or '00111'. So we need to insert spacers so as to prevent too little transitions. This is called 'Run Length Limiting' on magnetic media.

The thing to note is that sometimes, transitions need to be inserted to make sure that the data can be stored reliably. Introns may do much the same thing by making sure that the resulting code can be coiled properly.

However, this area of molecular biology is a minefield! Huge diatribes rage about variants with exciting names like 'introns early' or 'introns late', and massive words like 'folding propensity' and 'stem-loop potential'. I think it best to let this discussion rage on a bit.

A fascinating link of uncertain scientific value is http://post.queensu.ca/~forsdyke/introns.htm .

2013 Update: ten years on, the debate still hasn't settled! It is very clear that 'junk dna' is a misnommer, but as to its immediate function, there is no consensus. Check out Fighting about ENCODE and junk for a discussion of where we stand.

fork() and fork bombs ('tumors')

As with unix, great problems arise when cells keep on forking. They quickly exhaust resources, sometimes leading to death. This is called a tumor. The cell is riddled with 'ulimits' and 'watchdogs' to prevent this sort of thing from happening. The number of divisions is limited by Telomere shortening, for example.

A cell cannot clone unless very stringent conditions are met - a 'secure by default' configuration. It is only when these safeguards fail that tumors can grow. Like with computer security, it is hard to strike a balance between security ('no cells can divide') and usability.

Compare this to the well known Halting Problem, first described by the founder of Computer Science, Alan Turing. Perhaps it is as impossible to predict if a program will ever finish as it is to create a functional genome that cannot get cancer?

Mirroring, failover Each DNA Helix is redundant in itself - you can see the genome as a twisted ladder whereby each spoke contains two bases - hence the word 'basepair'. If one of these bases is missing, it can be derived from the one on the other side. T always binds to A, C always to G. So, we can state that the genome is mirrored within the helix. 'RAID-1' so to speak. Furthermore, there are two copies of each chromosome present - one from each parent, with the notable exception of the Y chromosome, which is only present in males. The actual details are complicated - but most genes are thus present twice. In case one is broken or unusefully mutated, the other independent copy is still there. This is what we would normally call 'failover'. Cluttered APIs, dependency hell As proteins interact in the cell, they rely on each others' characteristics. It has just been shown that proteins that interact with a lot of other proteins cannot evolve, or at least, only do so at a very slow rate. See Nature, 28 June 2001, and M. Kimura, T. Ohta, Science, 26 April 2002. They propose that this is because of great internal dependencies which inhibit the changing of the 'contract' of the protein. It is also noted that evolution does take place, but very slowly as both parts of the dependency need to evolve in a compatible way at the same time. Viruses, worms Somebody recently proposed in a discussion that it would be really cool to hack the genome and compromise it so as to insert code that would copy itself to other genomes, using the host-body as its vehicle. 'Just like the nimda worm!'

He shortly thereafter realised that this is exactly what biological viruses have been doing for millions of years. And they are exceedingly good at it.

A lot of these viruses have become a fixed part of our genome and hitch a ride with all of us. To do so, they have to hide from the virus scanner which tries to detect foreign code and prevent it from getting into the DNA.

The Central Dogma: .c -> .o -> a.out/.exe

This dogma tells us that DNA is used to make RNA and that RNA is used to make proteins, which is like saying that from a .c file comes a .o object file, which can be compiled into an executable (a.out/exe). It also tells us that this is the only order in which information flows.

Now, the Central Dogma has recenly been tarnished somewhat. Like any billion year old coding project, a lot of hacking has been going on, and sometimes information flows the other way. Sometimes RNA patches the DNA and at other times, the DNA is modified by proteines created earlier.

But generally, the dependencies are clear, so the Central Dogma remains important.

Binary patching aka 'Gene therapy'

We can fiddle easily enough with DNA. There are companies to which you can send an ASCII file with DNA characters, and they will synthesise the corresponding 'output' for you. We can also splice DNA into developing animals and plants. It is far harder to 'patch the running executable', as any programmer can attest. It is just like that with the genome. To change a running copy ('a human'), you need to edit each and every relevant copy of the gene you want to patch. For many years, medical science has tried to patch people with SCID, or 'Severe Combined Immunodefeciency', which is a very nasty disease which in effect disables the immune system - leading to very ill patients. It has been clear for quite a while now which letters in the DNA need to be fixed in order to cure these people. Many attempts where made to patch running people, using viruses that insert new DNA into living organisms, but this proved to be very hard. The genome is guarded far too well for such a simple approach to work - cells guard their code better than Microsoft! However, recently the right virus was found which was able to breach the protection of the genome and fix the broken characters, leading to apparently healthy people.

Bug Regression

In tropical regions of the world where the parasite-borne disease malaria is prevalent, people with a single copy of a particular genetic mutation have a survival advantage.

...

While inheriting one copy of the mutation confers a benefit, inheriting two copies is a tragedy. Children born with two copies of the genetic mutation have sickle cell anemia, a painful disease that affects the red blood cells. http://www.fda.gov/fdac/features/496_sick.html

There are quite a few examples of this happening. See also the wonderful book 'Genome' by Matt Ridley.

Reed-Solomon codes: 'Forward Error Correction'

6 bits could conceivably map to 64 amino acids, yet there are only 20 in use. For example, UCU, UCC, UCA and UCG all encode for 'Serine', whereas only UGG maps to 'Tryptophan'.

Now, it turns out that some likely 'typos' (UCU -> UCC) in the encoding lead to an identical amino acid being expressed. For more about this fascinating phenomenon, read 'Metamagical Themas' by Douglas Hofstadter.

DNA knows the concept of the 'molecular clock'. Some parts of the genome are actively changing and some parts are sacrosanct. A good example of the latter are the Histone genes H3 and H4.

These genes are fundamental to the actual storage of the genome and are thus of paramount importance. Any failure in this code rapidly leads to a non-functioning organism.

So it is to be expected that this code isn't tinkered with and that turns out the case. The H3 an H4 genes have a *zero* effective mutation rate in humans. But it goes far beyond that. You share almost the exact same code with anything from chickens to grass or moulds.

RATES OF NUCLEOTIDE SUBSTITUTION PER SITE PER 1000 MILLION YEARS BETWEEN VARIOUS HUMAN AND RODENT PROTEINS-CODING GENES WITH DIVERGENCE SET AT 80 MILLION YEARS BASED ON FOSSIL EVIDENCE: gene Number of codons Effective rate histone 3 135 0.00 histone 4 101 0.00 insulin 51 0.13 gamma interferon 136 2.79 http://www.staffs.ac.uk/schools/sciences/biology/Handbooks/evolseqphylo.htm

Now, it does appear that there are two ways the genome can make sure that code does not mutate. The first way is described above: use amino acids that are highly degenerate and making sure that those typos that DO occur result in the same output.

Furthermore, genes can be copied earlier or later in the cell's reproductive process, leading to more or less favourable copying conditions. Many more of such conditions apply.

It appears as if H3 and H4 were authored very carefully as they do have a lot of 'synonymous changes', which through the clever techniques described above do not lead to changes in the output.

Framing errors: start and stop bits

...0 0000 0001 0000 0010 0000 0011 0...

...00 0000 0010 000 00100 000 00110 ...

Each way a strand of DNA can be read is called an Open Reading Frame and there are generally 6, 3 each way.

Massive multiprocessing: each cell is a universe

If a cell needs to do something ('call a function'), it whips up the right piece of the genome and transcribes it into RNA. The RNA is then translated into a sequence of amino acids, which together make up a protein the DNA coded for. Now for the really cool bit :-) This protein is tagged with a shipping address. This is a marker consisting of several amino acids which tell the rest of the cell where this protein needs to go. There is machinery which acts on these instructions, and delivers the protein, which is potentially on the outside of the cell. The delivery instruction is then stripped off and several post processing steps may be performed, possibly activating the protein - which is good, because you may not want to transport an active protein through places where it should not do work.

A Cell

Self hosting & bootstrapping

In actual fact, this was solved by not writing the first C compiler in C (duh), but in a language that was available already: B. See here for details about 'bootstrapping'.

The same holds for the genome. To create a new 'binary' of a specimen, a *living* copy is required. The genome needs an elaborate toolchain in order to deliver a living thing. The code itself is impotent. This toolchain is commonly called 'your parents'.

Update: Recently, it has become possible to 'bootstrap' life with very little actually living source material. The dictum "every cell comes from a cell" is becoming less true. See for example Mycoplasma laboratorium.

It appears that RNA, which is an intermediate code between DNA and a protein, may have been the 'B' for DNA. Which begs the question where RNA came from. It is very interesting to note that extra-terrestial objects often contain amino acids! See http://www.google.com/search?hl=en&q=amino+acids+meteorites

The Makefile

Organisms typically start out as a single cell, which as said before contains two entire copies of the genome. The big tarfile so to speak, with all files extracted, ready to go. Now what? Enter the Homeobox genes. Cells must be copied and assigned a purpose. The Homeobox genes start out by laying a 'top to bottom' dependency which reads 'start with the head'. In order to make this happen, a chemical gradient is created by which cells can sense where they are, and decide if they need to do things useful for building a head, or for building a primordial notochord. Only discovered in 1983, the Homeobox genes are a very exciting area of research right now. It is interesting to note that like a Makefile, 'HOX' genes only trigger things in other genes and don't materially build things themselves. The homeobox 'syntax' appears to be very 'holy' in the sense described above. What happens if you copy paste the 'legs selector' part of a mouse HOX gene into the fruitfly Homeobox: 'In fact, when the mouse Hox-B6 gene is inserted in Drosophila, it can substitute for Antennapedia and produce legs in place of antennae' http://www.ultranet.com/~jkimball/BiologyPages/H/HomeoboxGenes.html The fruitfly and human genomes did not branch just millions of years ago but hundreds of millions of years ago. And you can copy paste parts ('Selectors' in the genetic language) of the Makefile and it still clicks. Please note that the 'build a leg' routine in a fruitfly is of course radically different from that in a mouse, but the 'selector' correctly triggers the right instructions.

Plugins: Plasmids

All living organisms have DNA, sometimes organized into multiple chromosomes ('libraries'), sometimes in only one, typically circular in that case. This goes for most bacteria. Next to this large main genome, such bacteria frequently also host 'plasmids': tiny circles of DNA with specific functions.

Such plasmids are somewhat portable between species, and through a variety of mechanisms they do indeed get transferred horizontally. By this way even non-identical bacteria can 'learn' antibiotic resistance from each other, for example.

Compared to the programming world, a plasmid is not voluntary, and is like LD_PRELOADing a .so or the equivalent on other platforms. And in fact, plasmids are frequently injected for research purposes. They can easily be injected in all kinds of bacteria, and immediately get to work.

Plasmids copy themselves independently from the main chromosome, and are thus a permanent fixture of bacteria. To make this happen, a plasmid features the magnificently named 'Origin of Replication' gene which gets triggered when the cell wants to divide.

Further reading