Estimates of the number of genes in the human genome have ranged wildly over the past two decades, from 20,000 all the way up to 150,000. By the time the working draft of the human genome was published in 2001, the best approximation stood at 35,000, yet even that number has fallen. A new analysis, one that harnesses the power of comparing genome sequences of various organisms, now reveals that the true number of human genes is about 20,500, thousands fewer than what is currently listed in human gene catalogs.

The work, led by researchers at the Broad Institute of MIT and Harvard, has implications beyond merely settling the debate over how many genes are in the human genome. An accurate gene count can help identify the locations of genes and their functions, an important step in translating genomic information into biomedical advances.

Ironically, the way genes are recognized has triggered much of the confusion over the human gene count. Scientists on the hunt for typical genes — that is, the ones that encode proteins — have traditionally set their sights on so-called open reading frames, which are long stretches of 300 or more nucleotides, or “letters” of DNA, bookended by genetic start and stop signals. This method produced the most recent gene count of roughly 25,000, but the number came under scrutiny after the 2002 publication of the mouse genome revealed that many human genes lacked mouse counterparts and vice versa.

Such a discrepancy seemed suspicious in part because evolution tends to preserve gene sequences — genes, by virtue of the proteins they encode, usually serve crucial biological roles. But like it or not, the 25,000 DNA sequences were already listed in the catalogs of human protein-coding genes, and skeptics had no systematic way to remove them. “At that point, no one had gone through the gene catalogs with a fine-toothed comb to find evidence that they weren’t valid,” said Michele Clamp, first author of the study and senior computational biologist at the Broad Institute.

Far from blatant mistakes, non-gene sequences can masquerade as true genes if they are long enough and happen by chance to fall between start and stop signals. Despite having gene-like characteristics, these open reading frames may not encode proteins. Instead, they might have other functions or possibly none at all.

To distinguish such misidentified genes from true ones, the research team, led by Clamp and Broad Institute director Eric Lander, developed a method that takes advantage of another hallmark of protein-coding genes: conservation by evolution. The researchers considered genes to be valid if and only if similar sequences could be found in other mammals – namely, mouse and dog. Applying this technique to nearly 22,000 genes in the Ensembl gene catalog, the analysis revealed 1,177 “orphan” DNA sequences. These orphans looked like proteins because of their open reading frames, but were not found in either the mouse or dog genomes.

Although this was strong evidence that the sequences were not true protein-coding genes, it was not quite convincing enough to justify their removal from the human gene catalogs. Two other scenarios could, in fact, explain their absence from other mammalian genomes. For instance, the genes could be unique among primates, new inventions that appeared after the divergence of mouse and dog ancestors from primate ancestors. Alternatively, the genes could have been more ancient creations — present in a common mammalian ancestor — that were lost in mouse and dog lineages yet retained in humans.

If either of these possibilities were true, then the orphan genes should appear in other primate genomes, in addition to our own. To explore this, the researchers compared the orphan sequences to the DNA of two primate cousins, chimpanzees and macaques. After careful genomic comparisons, the orphan genes were found to be true to their name — they were absent from both primate genomes. This evidence strengthened the case for stripping these orphans of the title, "gene."

After extending the analysis to two more gene catalogs and accounting for other misclassified genes, the team’s work invalidated a total of nearly 5,000 DNA sequences that had been incorrectly added to the lists of protein-coding genes, reducing the current estimate to roughly 20,500.

In addition to suggesting a major revision to the human gene count, this work provides a set of rules for evaluating any future proposed additions to the human gene catalog. It also underscores the benefit of genome sequencing projects. “Without several primate genomes, we wouldn’t have been able to put the final nail in the coffin of these putative genes,” said Clamp.

More broadly, the research reveals that little invention of genes has occurred since mammalian ancestors diverged from the non-mammalian lineage. “There’s no real creativity going on in the mammalian genome,” explained Clamp. That means that the number, structure, and function of protein-coding genes are not expected to differ very much from mammal to mammal, so what makes humans different from mice and dogs likely lies outside this realm of the genome. Clamp and her Broad Institute colleagues are now peering into the genomes of many other mammals, in an attempt to explain what parts of our genome truly make us human.

Journal reference: Clamp M et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl. Acad. Sci. USA. DOI: 10.1073/pnas.0709013104