This week, the ENCODE project released the results of its latest attempt to catalog all the activities associated with the human genome. Although we've had the sequence of bases that comprise the genome for over a decade, there were still many questions about what a lot of those bases do when inside a cell. ENCODE is a large consortium of labs dedicated to helping sort that out by identifying everything they can about the genome: what proteins stick to it and where, which pieces interact, what bases pick up chemical modifications, and so on. What the studies can't generally do, however, is figure out the biological consequences of these activities, which will require additional work.

Yet the third sentence of the lead ENCODE paper contains an eye-catching figure that ended up being reported widely: "These data enabled us to assign biochemical functions for 80 percent of the genome." Unfortunately, the significance of that statement hinged on a much less widely reported item: the definition of "biochemical function" used by the authors.

This was more than a matter of semantics. Many press reports that resulted painted an entirely fictitious history of biology's past, along with a misleading picture of its present. As a result, the public that relied on those press reports now has a completely mistaken view of our current state of knowledge (this happens to be the exact opposite of what journalism is intended to accomplish). But you can't entirely blame the press in this case. They were egged on by the journals and university press offices that promoted the work—and, in some cases, the scientists themselves.

To understand why, we'll need a bit of biology and a bit of history before we can turn back to the latest results and the public response to them.

What we know about DNA, and when we knew it

Among other things, DNA has at least two key functions. First, it codes for the proteins that perform most of a cell's functions. Second, it has control sequences that don't encode anything, but determine when and where the coding sequences are active. We've had some indication that non-coding DNA played key regulatory roles since the 1960s, when the Lac operon was described and won its discoverers the Nobel Prize.

The Lac operon is present in bacterial genomes, which are under extreme pressure to carry as little DNA as possible. The typical bacterial genome is over 85 percent protein-coding DNA, leaving just a small fraction for regulatory purposes. But that isn't generally true of vertebrates.

The coding portions of vertebrate genes turned out to be interrupted by noncoding regions, called introns. Some of these are huge—roughly a third the size of some of the smaller bacterial genomes. Vertebrate genomes also appeared to be littered with old and disabled viruses and mobile genetic parasites called transposons. Even some of the coding portions seemed a bit useless—near exact duplicates of genes were common, as were mutated and disabled copies. Many of these apparently useless pieces of DNA continued to carry sites for regulatory DNA binding proteins and continued to make RNA.

(To give you an idea of how mainstream all this was, I spent some time working on a mouse gene that was thought to be superfluous because it was a near-exact copy of a gene used by the immune system. But the copy was only expressed in males because a mobile genetic element's regulatory sequences had been inserted nearby. And I knew all this as an undergrad in the late 1980s).

By the time we sequenced the human genome, we discovered that this seemingly useless stuff was the majority. Over half the genome was built from the remains of viruses and transposons. Introns accounted for another large fraction. And all of it seemed to be an evolutionary accident. One fish, the fugu, lacks a lot of this DNA, and seems to get along fine, while many salamanders have ten times the DNA per cell that humans do. And if you looked at the DNA of different mammals, the vast majority of it (about 95 percent) wasn't shared by different species.

These findings seemed to support a model that was first proposed back in the 1970s, which picked up the (possibly unfortunate) moniker junk DNA. Genomic accidents—duplicating genes, picking up a virus—happen at a steady rate. Individually, these don't cause an appreciable cost in terms of fitness, so species aren't under a strong selective pressure to get rid of it, and pieces could linger in the genome for millions of years. But the typical bit of junk doesn't do anything positive for the animals that carry it.

You could even consider the idea of junk DNA to be a scientific hypothesis. It notes that animal genomes experience several processes that produce superfluous bits of DNA, predicts that these will not cause enough harm to be selected for elimination, and proposes an outcome: genomes littered with random bits of history that have no impact on an organism's fitness.

Junk dies a thousand deaths

For decades we've known a few things: some pieces of non-coding DNA were critically important, since they controlled when and where the coding pieces were used; but there was a lot of other non-coding DNA and a good hypothesis, junk DNA, to explain why it was there.

Unfortunately, things like well-established facts make for a lousy story. So instead, the press has often turned to myths, aided and abetted by the university press offices and scientists that should have been helping to make sure they produced an accurate story.

Discovery of new regulatory DNA isn't usually surprising, given that we've known it's out there for decades. There has been a steady stream of press releases that act as if finding a function for non-coding DNA is a complete surprise. And many of these are accompanied by quotes from scientists that support this false narrative.

The same thing goes for junk DNA. We've known for decades that some individual pieces of junk DNA do something useful. Introns can regulate gene expression. Bits of former virus or transposon have been found incorporated into genes or used to regulate their expression. So some junk DNA can be useful, in much the same way that a junk yard can be a valuable source of spare parts.

But it's important to keep these in perspective. Even if a function is assigned to a piece of junk that's 1,000 base pairs long, that only accounts for about 1/2,250,000 of the total junk that is estimated to reside in the human genome. Put another way, it's important not to fall into the logical fallacy that finding a use for one piece of junk must mean that all of it is useful.

Despite that, many new findings in this area are accompanied by some variation on the declaration that junk is dead. Both press officers and scientists have presented a single useful piece of virus as definitively establishing that every virus, transposon, and dead gene in the human genome is essential for our collective health and survival.