Scalability

In order to assess the scalability of Ray Meta, we simulated two large datasets. Although a simulation does not capture all genetic variations (and associated complexity) occurring in natural microbial populations, it is a way to validate the correctness of assemblies produced by Ray Meta and the abundances predicted by Ray Communities. The first dataset contained 400 × 106 reads, with 1% as human contamination. The remaining reads were distributed across 100 bacterial genomes selected randomly from GenBank. The read length was 101 nucleotides, the substitution error rate was 0.25% and reads were paired. Finally, the proportion of bacterial genomes followed a power law (with exponent -0.5) to mimic what is found in nature (see the section on Materials and methods). The number of reads for this 100-genome metagenome roughly corresponds to the number of reads generated by one lane of an Illumina HiSeq 2000 flow cell (Illumina, Inc.). Table S1 in Additional file 1 lists the number of reads for each bacterial genome and for the human genome. This dataset was assembled by Ray Meta using 128 processor cores in 13 hours, 26 minutes, with an average memory usage of 2 GB per core. The resulting assembly contained 22,162 contigs with at least 100 nucleotides and had an N50 of 152,891. The sum of contig lengths was 345,945,478 nucleotides. This is 93% of the sum of bacterial genome lengths, which was 371,623,377 nucleotides. Therefore, on average there were 3,459,454 assembled nucleotides and 221 contigs per bacterial genome, assuming that the bacterial genomes were roughly of the same size and same complexity and that the coverage depth was not sufficient to assemble incorporated human contamination. Using the known reference sequences, we validated the assembly using MUMmer to assess the quality. There were 11,220 contigs with at least 500 nucleotides. Among these, 152 had misassemblies (1.35%). Any contig that did not align as one single maximum unique match with a breadth of coverage of at least 98.0% was marked as misassembled. The number of mismatches was 1,108 while the number of insertions or deletions was 597.

To further investigate the scalability of our approach for de novo metagenome assembly, we simulated a second metagenome. This one contained 1,000 bacterial genomes randomly selected from GenBank as well as 1% of human sequence contamination. The proportion of the 1,000 bacterial genomes was distributed according to a power law (with exponent -0.3) and the number of reads was 3 × 109 (Table S2 in Additional file 1). This number of reads is currently generated by one Illumina HiSeq 2000 flow cell (Illumina, Inc.). This second dataset, which is larger, was assembled de novo by Ray Meta in 15 hours, 46 minutes using 1,024 processor cores with an average memory usage of 1.5 GB per core. It contained 974,249 contigs with at least 100 nucleotides; N50 was 76,095 and the sum of the contig lengths was 2,894,058,833, or 80% of the sum of bacterial genome lengths (3,578,300,288 nucleotides). Assuming a uniform distribution of assembled bases and contigs and that human sequence coverage depth was not sufficient for its de novo assembly, there were, on average, 974 contigs and 2,894,058 nucleotides per bacterial genome. To validate whether or not the produced contigs were of good quality, we compared them to the known references. There were 196,809 contigs with at least 500 nucleotides. Of these, 2,638 were misassembled (1.34%) according to a very stringent test. There were 59,856 mismatches and 13,122 insertions or deletions.

Next, we sought to quantify the breadth of assembly for the bacterial genomes in the 1,000-genome dataset. In other words, the assembled percentage was calculated for each genome present in the 1,000-genome metagenome. Many of these bacterial genomes had a breadth of coverage (in the de novo assembly) greater than 95% (Figure 1).

Figure 1 Assembled proportions of bacterial genomes for a simulated metagenome with sequencing errors. 3 × 109 100-nucleotide reads were simulated with sequencing errors (0.25%) from a simulated metagenome containing 1,000 bacterial genomes with proportions following a power law. Having 1,000 genomes with power law proportions makes it impossible to classify sequences with their coverage. This large metagenomic dataset was assembled using distributed de Bruijn graphs and profiled with colored de Bruijn graphs. Highly similar, but different genomes, are likely to be hard to assemble. This figure shows the proportion of each genome that was assembled de novo within the metagenome. Of the bacterial genomes, 88.2% were assembled with a breadth of coverage of at least 80.0%. Full size image

Estimating bacterial proportions

Another problem that can be solved with de Bruijn graphs is estimating the genome nucleotide proportion within a metagenome. Using Ray Communities, the 100-genome and 1,000-genome datasets de novo assembled de Bruijn graphs were colored using all sequenced bacterial genomes (Table S4 in Additional file 1) in order to identify contigs and to estimate bacterial proportions in the datasets. Ray Communities estimates proportions by demultiplexing k-mer coverage depth in the distributed de Bruijn graph (see the section on Demultiplexing signals from similar bacterial strains in Materials and methods). Because coloring occurs after de novo assembly has completed, the reference sequences are not needed for assembling metagenomes.

For the 100-genome dataset, only two bacterial genome proportions were not estimated correctly. The first was due to a duplicate in GenBank and the second to two almost identical genomes (Figure 2A). When two identical genomes are provided as a basis to color the de Bruijn graph, no k-mer is uniquely colored for any of these two genomes, and identifying k-mers cannot be found through demultiplexing. This can be solved by using a taxonomy, which allows reference genomes to be similar or identical.

Figure 2 Estimated bacterial genome proportions. For the two simulated metagenomes (100 and 1,000 bacterial genomes, respectively), colored de Bruijn graphs were utilized to estimate the nucleotide proportion of each bacterial genome in its containing metagenome. Genome proportions in metagenomes followed a power law. Black lines show the expected nucleotide proportion for bacterial genomes while blue points represent proportions measured by colored de Bruijn graphs. (A) For the 100-genome metagenome, only two bacterial genomes were not correctly measured (2.0%), namely Methanococcus maripaludis X1 and Serratia AS9. Methanococcus maripaludis X1 was not detected because it was duplicated in the dataset as Methanococcus maripaludis XI, thus providing zero uniquely colored k-mers. Serratia AS9 was not detected because it shares almost all its k-mers with Serratia AS12. (B) For the 1,000-genome metagenome, 4 bacterial genomes were overestimated (0.4%) while 20 were underestimated (2.0%). These errors were due to highly similar bacterial genomes, hence they did not provide uniquely colored k-mers. This problem can be alleviated either by using a curated set of reference genomes or by using a taxonomy. The remaining 976 bacterial genomes had a measured proportion near the expected value. Full size image

In the 1,000-genome dataset, four bacterial genome proportions were overestimated and 20 were underestimated (Figure 2B). In both the 100-genome and 1,000-genome datasets, the proportion of bacterial genomes with incorrect estimates was 2.0%. In both of these, the incorrect estimates were caused by either duplicated genomes, identical genomes or highly similar genomes. The use of a taxonomy alleviates this problem.

The results with the 100-genome and 1,000-genome datasets show that our method can recover bacterial genome proportions when the genome sequences are known. In real microbiome systems, there is a sizable proportion of unknown bacterial species. For this reason, it is important to devise a system that can also accommodate unknown species by using a taxonomy, which allows the classification to occur at higher levels - such as phylum or genus instead of species.

Metagenome de novo assembly of real datasets

Here, we present results for 124 fecal samples from a previous study [22]. From the 124 samples, 85 were from Denmark (all annotated as being healthy) and 39 were from Spain (14 were healthy, 21 had ulcerative colitis and 4 had Crohn's disease). Each metagenome was assembled independently (Table S3 in Additional file 1) and the resulting distributed de Bruijn graphs were colored to obtain taxonomic and gene ontology profiles (see Materials and methods and Table S4 in Additional file 1).

These samples contained paired 75-nucleotide and/or 44-nucleotide reads obtained with Illumina Genome Analyzer sequencers. In about 5 hours, 122 samples were assembled (and profiled) using 32 processor cores and the two remaining samples, namely MH0012 and MH0014, were assembled (and profiled) with 48 and 40 processor cores, respectively (Table S3 in Additional file 1). These runtime figures include de novo assembly, graph coloring, signal demultiplexing and taxonomic and gene ontology profiling, which are all tightly coupled in the process. In the next section, taxonomic profiles are presented for these 124 gut microbiome samples.

Taxonomic profiling

In metagenomic projects, the bacterial genomes that occur in the sample may be unknown at the species level. However, it is possible to profile these samples using a taxonomy. The key concept is to classify colored k-mers in a taxonomy tree: a k-mer is moved to a higher taxon as long as many taxons have the k-mer so it can be classified as the nearest common ancestor of the taxons. For example if a k-mer is not classified at the species level, it can be classified at the genus level and so on. Furthermore, taxonomy profiling does not suffer from similarity issues as seen for proportions present in samples because k-mers can be classified in higher taxons when necessary.

Accordingly, k-mers shared by several bacterial species cannot be assigned to one of them accurately. For this reason, the Greengenes taxonomy [28] (version 2011_11) was utilized to classify each colored k-mer in a single taxon with its taxonomic rank being one of the following: kingdom, phylum, class, order, family, genus or species. For each sample, abundances were computed at each taxonomic rank. At the moment, the most recent and accurate taxonomy for profiling taxons in a metagenome is Greengenes [28]. We profiled taxons in the 124 gut microbiome samples using this taxonomy. We also incorporated the human genome into this taxonomy to profile the human abundance in the process. At the phylum level, the two most abundant taxons were Firmicutes and Bacteroidetes (Figure 3A). The profile of the phylum Chordata indicated that two samples contained significantly more human sequences than the average (Figure 3A). The most abundant genera in the 124 samples were Bacteroides and Prevotella (Figure 3B). The taxon Bacteroides is reported more than once because several taxons had this name with a different ancestry in the Greengenes taxonomy. The genera Prevotella and Butyrivibrio had numerous samples with higher counts, indicating that the data are bi-modal (Figure 3B). The genus Homo had two samples with significantly more abundance (Figure 3B).

Figure 3 Fast and efficient taxonomic profiling with distributed colored de Bruijn graphs. From a previous study, 124 metagenomic samples containing short paired reads were assembled de novo and profiled for taxons. The graph coloring occurred once the de Bruijn graph was assembled de novo. (A) The taxonomic profiles are shown for the phylum level. The two most abundant phyla were Firmicutes and Bacteroidetes. This is in agreement with the literature [22]. The abundance of human sequences was also measured. The phylum Chordata had two outlier samples. This indicates that two of the samples had more human sequences than the average, which may bias results. (B) At the genus level, the most abundant taxon was Bacteroides. This taxon occurred more than once because it was present at different locations within the Greengenes taxonomic tree. Also abundant is the genus Prevotella. Furthermore, the later had numerous samples with higher counts, which may help in non-parametric clustering. Two samples had higher abundance of human sequences, as indicated by the abundance of the genus Homo. Full size image

Grouping abundance profiles

It has been proposed that the composition of the human gut microbiome of an individual can be classified as one of three enterotypes [23]. We profiled genera for each of the 124 gut microbiome samples to reproduce these three enterotypes. The 124 samples (85 from Denmark and 39 from Spain) were analyzed using the two most important principal components (Figure 4; see Materials and methods). Two clear clusters are visible, one enriched for the genus Bacteroides and one for the genus Prevotella. A continuum between two enterotypes has also been reported recently [42].

Figure 4 Principal component analysis shows two clusters. Principal component analysis (see Materials and methods) with abundances at the genus level yielded two distinct clusters. Abundances were obtained with colored de Bruijn graphs. One was enriched in the genus Bacteroides while the other was enriched in the genus Prevotella. Principal component 1 was linearly correlated with the genus Prevotella while principal component 2 was linearly correlated with the genus Bacteroides. This analysis suggests that there is a continuum between the two abundant genera Bacteroides and Prevotella. This interpretation differs from the original publication in which three human gut enterotypes were reported [23]. More recently, it has been proposed that there are only two enterotypes and individuals are distributed in a continuum between the two [42]. Full size image

Profiling of ontology terms

Gene ontology is a hierarchical classification of normalized terms in three independent domains: biological process, cellular component and molecular function. Some biological datasets are annotated with gene ontology. Here, we used gene ontology to profile the 124 metagenome samples based on a distributed colored de Bruijn graph (see Materials and methods). First, abundances for biological process terms were obtained (Figure 5A). The two most abundant terms were metabolic process and transport. The terms oxidation-reduction process and DNA recombination had numerous sample outliers, which indicates that these samples had different biological complexity for these terms (Figure 5A). Next, we sought to profile cellular component terms in the samples. The most abundant term was membrane, followed by cytoplasm, integral to membrane and plasma membrane. This redundancy is due to the hierarchical structure of gene ontology (Figure 5B). Finally, we measured the abundance for molecular function terms. The most abundant was ATP binding, which had no outliers. The term DNA binding was also abundant. However, the latter had outlier samples (Figure 5C).

Figure 5 Ontology profiling with colored de Bruijn graphs. Gene ontology profiles were obtained by coloring of the graph resulting from de novo assembly. Gene ontology has three domains: biological process, cellular component and molecular function. For each domain, only the 15 most abundant terms are displayed. (A) Ontology terms in the biological process domain were profiled. Some of these have several outlier samples, namely oxidation-reduction process and DNA recombination. (B) Ontology profiling for cellular component terms is shown. The most abundant is the membrane term. (C) The profile for molecular function terms is shown. Binding functions are the most abundant with ATP binding, nucleotide binding and DNA binding in the top three. Next is catalytic activity, which is a general term. More specific catalytic activities are listed. Full size image

Comparison of assemblies

Three samples from the MetaHIT Consortium [22] - MH0006 (ERS006497), MH0012 (ERS006494) and MH0047 (ERS006592) - and three samples from the Human Microbiome Project Consortium [24] - SRS011098, SRS017227 and SRS018661 - were assembled with MetaVelvet [39] and Ray Meta to draw a comparison. Assembly metrics are displayed in Table 1. The average length is higher for MetaVelvet for samples ERS006494 and ERS006592. For the other samples, the average length is higher for Ray Meta. The N50 length is higher for Ray Meta for all samples. For all samples but ERS006497, the total length is higher for Ray Meta. Although we assembled the 124 samples from [22] and 313 samples (out of 764) from the Human Microbiome Project [24] with Ray Meta on supercomputers composed of nodes with little memory (24 GB), we only assembled a few samples with MetaVelvet because a single MetaVelvet assembly requires exclusive access to a single computer with a large amount of available memory (at least 128 GB). Ray Meta produced longer contigs and more bases for these six samples. The shared content of assemblies produced by MetaVelvet and Ray Meta is shown in Table 1. A majority of sequences assembled by MetaVelvet and Ray Meta are shared. As metagenomic experiments will undoubtedly become more complex, Ray Meta will gain a distinct advantage owing to its distributed implementation.

Table 1 Comparison of assemblies produced by MetaVelvet and Ray Meta Full size table

Validation of taxonomic profiling

We compared Ray Communities to MetaPhlAn in order to validate our methodology. Taxonomic profiles for 313 samples (Additional file 2) from the Human Microbiome Project [24] were generated with Ray Communities and compared to those of MetaPhlAn [27]. The correlations are shown in Table 2 for various body sites. Correlations are high - for instance the correlations for buccal mucosa (46 samples) were 0.99, 0.98, 0.97, 0.98, 0.95 and 0.91 for the ranks phylum, class, order, family, genus and species, respectively. These results indicate that Ray Communities has an accuracy similar to that of MetaPhlAn [27], which was utilized by the Human Microbiome Project Consortium [24]. The correlation at the genus rank for the site anterior nares was poor (0.59) because MetaPhlAn classified a high number of reads in the genus Propionibacterium thus yielding a very high abundance while the number of k-mer observations classified this way by Ray Communities was more moderate. For the body site called stool, the correlation at the family rank was weak (0.62) because MetaPhlAn utilizes the NCBI taxonomy whereas Ray Communities utilizes the Greengenes taxonomy, which has been shown to be more accurate [28]. Overall, these results indicate that Ray Communities yields accurate taxonomic abundances using a colored de Bruijn graph.