We've tended to measure our success with sequencing genomes in terms of our ability to sequence the billions of bases in the human genome. But the progress has made completing the genomes of bacteria, which are typically a thousand times smaller, relatively trivial. For these organisms, we actually have the luxury of being able to do a thorough survey of genomes.

So far, however, the emphasis has been on sequencing the ones we know well: the lab strains, those associated with major diseases, etc. A new paper takes an approach that's less driven by self-interest. Its authors surveyed hundreds of strains of bacteria and archaea that we know how to culture, and picked 200 of them that are broadly dispersed across the tree of life, based on the sequence of a ribosomal RNA gene. They're now in the process of completing the genomes of all of them, and the paper serves as an interim report.

So far, it seems to be working. The ribosomal RNA gene provided a fair estimate for identifying distant relatives, and the complete genome sequences have revealed many genes that we've never seen before. Over 10 percent of the gene families (groups of similar genes that appear in multiple species) are completely new to biologists.

Some of the individual gens are rather interesting. Biofuel efforts have focused attention on the enzymes that break down the cellulose in wood. The survey found 35 new relatives of these genes, even though the organisms sequenced only included two that are known to digest cellulose. They've also identified the first and (so far) only bacterial version of the actin gene, which eukaryotes use to build their cells' internal skeletons. The authors suggest that it's probably used as part of a bacterial attack on these cells, where it interferes with the eukaryotic version.

Since the ribosomal RNA gene acted as a decent proxy for evolutionary distance, the authors used existing sequences to estimate how many bacteria and archaea we'd have to sequence in order to get a full picture of the diversity we're able to grow in the lab. They estimate that we'd have to sequence about 1,500 strains in total in order to capture about half of the genetic diversity, something that seems well within reach of current technology.

A couple of other aspects of the work seem worthy of mention. For starters, they used just about all of the major sequencing technologies that were on the market as the work was being done, including some of the newer, high-throughput methods that were used for the recently described panda genome.

The other thing is that the last author of the paper, Jonathan Eisen (who was kind enough to provide the image below), is a major proponent of open access science. So, not only are the sequences going into a public repository, something that is typical for genome work, but Nature has agreed to publish the paper under a Creative Commons license. So, anyone who's curious for the details can go have a look.

Nature, 2009. DOI: 10.1038/nature08656



A genomic tree of bacterial life. Click to see the full-resolution version at Nature

Listing image by Jonathan Eisen