Interestingly, the same "remarkable lack of correspondence" can be noted when discussing the relationship between the number of protein-coding genes and organism complexity. Scientists estimate that the human genome, for example, has about 20,000 to 25,000 protein-coding genes. Before completion of the draft sequence of the Human Genome Project in 2001, scientists made bets as to how many genes were in the human genome. Most predictions were between about 30,000 and 100,000. Nobody expected a figure as low as 20,000, especially when compared to the number of protein-coding genes in an organism like Trichomonas vaginalis. T. vaginalis is a single-celled parasitic organism responsible for an estimated 180 million urogenital tract infections in humans every year. This tiny organism features the largest number of protein-coding genes of any eukaryotic genome sequenced to date: approximately 60,000.

In fact, compared to almost any other organism, humans' 25,000 protein-coding genes do not seem like many. The fruit fly Drosophila melanogaster, for example, has an estimated 13,000 protein-coding genes. Or consider the mustard plant Arabidopsis thaliana, the "fruit fly" of the plant world, which scientists use as a model organism for studying plant genetics. A. thaliana has just about the same number of protein-coding genes as humans—actually, it has slightly more, coming in at about 25,500. Moreover, A. thaliana has one of the smallest genomes in the plant world! It would seem obvious that humans would have more protein-coding genes than plants, but that is not the case. These observations suggest that there is more to the genome than protein-coding genes alone.

As shown in Table 1 (adapted from Van Straalen & Roelofs, 2006), there is no clear correspondence between genome size and number of protein-coding genes—another indication that the number of genes in a eukaryotic genome reveals little about organismal complexity. The number of protein-coding genes usually caps off at around 25,000 or so, even as genome size increases.

Table 1: Genome Size and Number of Protein-Coding Genes for a Select Handful of Species

Species and Common Name Estimated Total Size of Genome (bp)* Estimated Number of Protein-Encoding Genes* Saccharomyces cerevisiae (unicellular budding yeast ) 12 million 6,000 Trichomonas vaginalis 160 million 60,000 Plasmodium falciparum (unicellular malaria parasite) 23 million 5,000 Caenorhabditis elegans (nematode) 95.5 million 18,000 Drosophila melanogaster (fruit fly) 170 million 14,000 Arabidopsis thaliana (mustard; thale cress) 125 million 25,000 Oryza sativa (rice) 470 million 51,000 Gallus gallus (chicken) 1 billion 20,000-23,000 Canis familiaris (domestic dog) 2.4 billion 19,000 Mus musculus (laboratory mouse) 2.5 billion 30,000 Homo sapiens (human) 2.9 billion 20,000-25,000

* There may be other estimates in the literature, but most estimates approximate those listed here.

While the majority of emphasis has been placed on protein-coding genes in particular, scientists have continued to refine their definition of what exactly a gene is, partly in response to the realization that DNA encodes more than just proteins. For instance, in a study of the mouse genome, scientists found that more than 60% of this 2.5 billion bp genome is transcribed, but less than 2% is actually translated into functional protein products (FANTOM Consortium et al., 2005). Within this article, however, the discussion focuses on protein-coding genes, unless otherwise stated. Note, however, that much of the genome's transcription is dedicated to making tRNA, rRNA, and many RNAs involved in splicing and gene regulation.

While scientists have been measuring genome size for decades, they have only recently had the technological capacity and know-how to count genes. To estimate the number of protein-coding genes in a genome, scientists often start by using what are known as gene-prediction programs: computational programs that align the sequence of interest with one or more known genome sequences. Other computer programs can predict gene location by looking for sequence characteristics of genes, such as open reading frames within exons and CpG islands within promoter regions.

However, all of these computer programs only predict the presence of genes. Each prediction must then be experimentally validated, such as by using microarray hybridization to confirm that the predicted genes are represented in RNA (Yandell et al., 2005). As Michael Brent, a professor of computer engineering at Washington University, explained in Nature Biotechnology, gene prediction has become much more accurate over the past several years (Brent, 2007). Its improved precision accounts for why estimates of the number of genes in the human genome have decreased from 45,000 about 10 years ago, to Venter et al.'s estimate of 26,588 upon completion of the Human Genome Project (Venter et al., 2001), to the current estimate of between 20,000 and 21,000. In short, the older computational methods generated a lot of false positives, meaning that they predicted the presence of protein-coding genes that weren't actually there.