Tue 19 February 2013 In science. tags: sciencedatametagenomics

I spend so much of my time writing stuff down to cadge funding or bruit about ideas, and much of that never really goes anywhere. In the interests of slowing down any competitors by getting them to take my old ideas seriously, here is an interesting set of ideas that I wrote up a few months ago with one particular funding body in mind.

I would welcome comments by scientists on whether or not the social ideas, below, would actually work. Remember, this is in the context of "no do-ey, no fund-ey".

(Basically, I'm trying to hack scientific culture the way ESR talks about hacking software culture. See my more general thoughts on this, too.)

Technology: We and others have a number of solutions that need to be carefully implemented with attention to both biological correctness and scale. Some specific ideas: 1) Methods for integrating metagenomic and metatranscriptomic data and eventually metaproteomic data, to identify and annotate genes. The goal is to enable the robust comparison of gene expression across conditions and environments. The digital normalization approach developed by my lab allows us to combine both metagenome and metatranscriptome data from many different conditions for a maximally sensitive global assembly. Following this assembly we can then recover differentially expressed genes by looking at transcriptional levels in specific samples. 2) Correlation and difference analysis of large metagenomic data sets. Specifically, enable us to query for presence/absence/abundance across metagenomic and metatranscriptomic shotgun data sets from a vast (1000s-100s of 1000s) number of samples, and extract gene presence/absence and expression level profiles from that data. Our lab has developed the ability to do this very sensitively at the genomic level, which would be a nice complement to protein-based techniques. For example, we can see a ~50% genomic overlap between the raw reads from Iowa prairie and Iowa corn soil samples, indicating that a substantial portion of the underlying genomes are shared. This is a general approach that would let us compare and contrast microbial communities without passaging data through the very biased filter of assembly. The underlying technology already exists, but scaling it up so that we can do ongoing comparisons of thousands or (eventually) millions of samples, and providing a flexible query system on top of it, is a significant challenge. 3) Assembly-graph-based exploration of complex data sets. It is quite likely that we are failing to assemble highly variable regions from complex metagenomes, and it should be straightforward to use partitioning to isolate, detect and analyze such regions. 4) Annotation evaluation. Virtually everyone expresses frustration with the current genome annotation pipelines. I propose to develop methods for evaluating annotations for environmental (meta)genomes so that annotation pipelines and assembly strategies can be compared more objectively.