Development of an integrative binning approach

The DAS Tool approach to solve the binning problem is to integrate predictions from multiple established binning tools. The number and type of binning tools is flexible. Candidate bins are generated independently when all binning tools are applied to the same assembly. DAS Tool then uses a consensus approach to select a single set of non-redundant, high-quality bins (Fig. 1). Nevertheless, we advise that the user examine each of the final bins to identify potential contamination based on erroneous phylogenetic affiliation and to remove sequences from phage/virus (based on gene content).

Fig. 1: Overview of the DAS Tool algorithm. Step 1: The input of the DAS Tool comprises scaffolds of one assembly (grey lines) and a variable number of bin sets from different binning predictions (same-coloured rounded rectangles). Step 2: Single-copy genes (blue shapes) on scaffolds are predicted and scores (blue and green boxes) are assigned to bins. Step 3: Aggregation of redundant candidate bin set from all binning predictions. Step 4: Iterative selection of high-scoring bins and updating of scores of remaining partial candidate bins. The output comprises non-redundant set of high-scoring bins from different input predictions. Full size image

DAS Tool applied to simulated microbial communities

To validate the DAS Tool algorithm, we applied it to three assemblies from simulated microbial communities that were created for the CAMI challenge19. The assemblies comprise different numbers of organisms including strain variation to simulate microbial communities with low (40 genomes), medium (132 genomes) and high complexity (596 genomes). We predicted bins using five binning tools (ABAWACA 1.07 (https://github.com/CK7/abawaca), CONCOCT9, MaxBin 213, MetaBAT10 and tetranucleotide ESOMs4) and combined the result using DAS Tool. To determine how well the reconstructed bins represent the reference genomes, we calculated F 1 scores, which are the harmonic mean of precision and recall. We also focused on how well each tool reconstructs genomes with common or unique strains in the data set. For the most challenging, high-complexity data set, DAS Tool reports more high-quality genomes with and without strain variation than any individual tool (Fig. 2). DAS Tool reports 41 high-quality bins (F 1 score > 0.6) of genomes with common strains and 299 genomes of unique strains. MaxBin 2 obtained the second-best results with 23 and 253 genomes (F 1 score > 0.6) for reference genomes with common and unique strains, respectively. Tetranucleotide ESOMs performed well in reconstructing genomes from unique strains (173 genomes, F 1 score > 0.6), but reported only a low number of the genomes with strain variation (6 genomes, F 1 score > 0.6) (Fig. 2). Besides reconstructing a higher number of high-quality genomes, the F 1 score distribution of all reconstructed genomes shows an equal or higher median compared to the best-performing single binning tool (DAS Tool: 0.627 (common strain), 0.979 (unique strain); MaxBin 2: 0.449 (common strain), 0.980 (unique strain)) (Fig. 2). DAS Tool not only reconstructs a higher number of high-quality genomes and resolves strain variation better than any of the individual tools on the high-complexity data set, but also performs better on the assemblies of medium- and low-complexity communities (Supplementary Fig. 1).

Fig. 2: Reconstructed genomes from a simulated microbial community consisting of 596 genomes. a, The number of reconstructed genomes per method above a certain F 1 score threshold. The higher the F 1 score the more similar the reconstructed genome is to the reference. b, The distribution of F 1 scores of all reported bins (centre line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range). Individual values appear as dots. The precise n number in terms of reconstructed bins per method is given above each boxplot. Metrics are calculated for all reference genomes (all), genomes with strain variation (common_strain; ≤95% average nucleotide identity (ANI) to other reference genomes) and without strain variation (unique_strain; >95% ANI to other reference genomes). Full size image

Application of DAS Tool to environmental metagenomic data

Probst et al.17 generated a highly curated set of genome bins from metagenomic data from a high-CO 2 cold-water geyser that were ideal for evaluation of the DAS Tool algorithm. The data comprise two assemblies of sequences from samples collected sequentially on 3.0 μm and 0.2 μm filters and a set of 3.0 μm filtrates from subsurface fluids collected at a single time point. The published bins were generated by a comparative approach of three methods followed by manual curation of the results17. We used CheckM15 to generate marker gene-based quality estimates for the published bins that can be compared to quality estimates for all binning methods, including DAS Tool. Bins were only considered to be of high (>90% complete) or draft (70–90% complete) quality if they had less than 5% contamination.

We compared the results of the three independent binning predictions from ref. 17 (ABAWACA 1.0, tetranucleotide ESOMs, differential-abundance ESOMs), as well as those from ABAWACA 1.07, CONCOCT, MetaBAT and MaxBin 2 to results achieved using DAS Tool. DAS Tool was applied using either a combination of three or seven different binning algorithms (Fig. 3 and Supplementary Table 2).

Fig. 3: Reconstructed genomes from Crystal Geyser, a high-CO 2 cold-water geyser. The number of high-quality genomes with low contamination (<5%) from metagenomic assemblies of two samples. Probst.2016 represents the combination from ref. 17 of ABAWACA.1, tetraESOM and seriesESOM and a final manual curation step. DAS_Tool.3binners uses the same three predictions as input. DAS_Tool.7binners additionally uses ABAWACA.2, CONCOCT, MaxBin.2 and MetaBat. Full size image

Although DAS Tool with three binning algorithms reported more near-complete and draft genomes than the three methods alone, it returned fewer genomes than in the curated set from ref. 17 (Fig. 3 and Supplementary Table 2). However, when we included seven binning tools in DAS Tool (adding ABAWACA 1.07, CONCOCT, MaxBin 2 and MetaBAT), the reported number of near-complete genomes was higher for the 0.2 μm sample (DAS Tool: 36 genomes, Probst: 32) and even higher for the 3.0 μm sample (DAS Tool: 38, Probst: 31). For both samples a larger number of draft genomes was reconstructed than was achieved previously17 (Fig. 3 and Supplementary Table 2). The number of draft genomes increased slightly when allowing more contamination per bin (Supplementary Fig. 3).

Combination of bins using DAS Tool improves genome count from metagenomic data with different levels of complexity

To evaluate the performance of DAS Tool on samples of different complexity, we applied it to shotgun metagenomic data of lower, medium and high complexity from human microbiomes20, natural oil seeps21,22 and soil (see Data availability). We binned all samples separately using ABAWACA 1.07, CONCOCT, MaxBin 2, MetaBAT and tetranucleotide ESOMs. All predictions were combined using DAS Tool and CheckM was used to estimate the quality of the resulting bins. In addition, we used ggKbase binning tools to analyse the human gut data. This was appropriate, given colonization of the human gut by genomically well-characterized bacteria. ggKbase tools were not used in the other analyses because they do not perform well in systems with many previously unreported organisms.

Summing up the number of bins of each quality level that were generated for the three ecosystems, DAS Tool reported the highest number of near-complete and draft bins in all cases (Fig. 4).

Fig. 4: The number of high-quality genomes with low contamination (<5%) from metagenomic assemblies of samples from three ecosystems representing a range of complexity. Samples were collected from adult human gut (1 faecal sample), oil seeps (5 samples) and hillslope soil and underlying weathered shale (6 samples). The samples were assembled and binned separately. Reconstructed genomes were summed up per ecosystem. For sample-by-sample results, see Supplementary Fig. 5. Full size image

Interestingly, the performance of the single binning tools that were used as input for DAS Tool differed between ecosystems and none of them was the clear winner. This is also reflected in the composition of the final bin set in terms of the input methods where genomes were selected (Supplementary Fig. 4). In the case of bins generated for the lower-complexity human gut samples using single binning tools, ggKbase followed by MetaBAT generated the largest number of near-complete genomes. For the medium-complexity oil seeps, ABAWACA 1.07 and MetaBAT produced the most draft-quality genomes while CONCOCT produced slightly more high-quality bins. For high-complexity soil data, MaxBin 2 reported the most draft and near-complete genomes.

We also examined the performance of the various binning approaches sample by sample. DAS Tool reported either the most or the same number of near-complete genomes with low contamination for all 12 samples (higher: 6/12; equal: 6/12). The number of reconstructed genomes per sample increases when considering genomes with a higher amount of contamination. In 11 of 12 samples, DAS Tool reports a higher number of genomes with more than 70% completeness and less than 15% contamination (Supplementary Fig. 6).

To estimate the expected species number per ecosystem we clustered for each assembly all predicted ribosomal protein S3 sequences at 99% amino acid identity. Given the number of resulting clusters and the number of draft genomes, DAS Tool reconstructed 76.5% (75 bins/98 clusters), 24.6% (86/349) and 8.7% (79/907) of possible genomes from the data sets of human gut, oil seeps and soil, respectively (Supplementary Table 3).

Besides CheckM, we also estimated the completeness of bins using the single-copy gene base approach BUSCO16. In general, the estimations of BUSCO are less conservative, which results in a higher number of classified high-quality genomes compared to CheckM. According to BUSCO, DAS Tool reports the most near-complete and draft-quality genomes for all ecosystems (Supplementary Fig. 7a).

We also applied the recently published Binning_refiner18 to combine the binning results of the three environments and compared its performance to DAS Tool. For all 12 assemblies, DAS Tool extracted considerably more near-complete and draft genomes than Binning_refiner (Supplementary Fig. 8).

Genome analysis reveals previously unreported lineage with hydrocarbon degradation potential

Binning of metagenomic data from Santa Barbara oil seep samples revealed three genomes whose 16S rRNA gene sequences lacked closely related sequences in the SILVA database23 (78.8, 79.4 and 87.4% identity). The estimated completeness of these reconstructed genomes ranges from 95.6 to 89.6% (Supplementary Table 4).

In a phylogenetic tree based on 16 concatenated ribosomal proteins, the three genomes cluster as a monophyletic group with one TA06 and two WOR-3 genomes (Supplementary Fig. 9a). The JGI_Cruoil_03_Bacteria_38_101 forms a cluster together with the TA06 lineage at a pairwise tree distance (patristic distance) of 1.2977 but is more distant to the two WOR-3 (patristic distances of 1.5531 and 1.5258, respectively). In contrast, the two lineages JGI_Cruoil_03_Bacteria_44_89 and JGI_Cruoil_03_Bacteria_51_56 share greater similarity with the two WOR-3 at a minimal patristic distance of 1.3350 and 1.0582, respectively, and have a greater distance to the TA06 (patristic distance of 1.4328 and 1.4673, respectively).

For comparison, the patristic distance between representatives of closely related phyla in the same tree was between 1.0282 and 1.2110 (Firmicute Thermincola sp. JR versus the Chloroflexus C. aurantiacus J-10-fl and Melainabacteria Obscuribacter phosphatis versus the Cyanobacteria Leptolyngbya sp. PCC 7104) (Supplementary Fig. 10).

Given that both distances are smaller than the distances of TA06 and WOR-3 to our reconstructed genomes JGI_Cruoil_03_Bacteria_38_101 and JGI_Cruoil_03_Bacteria_44_89 as well as the distance of JGI_Cruoil_03_Bacteria_38_101 to JGI_Cruoil_03_Bacteria_44_89 (patristic distance of 1.5164), we conclude that these two genomes may be representatives of two previously unreported phylum-level lineages. The third genome, JGI_Cruoil_03_Bacteria_51_56, is closer to the WOR-3 at a patristic distance of 1.0582 and is probably part of the WOR-3 candidate division.

Interestingly, the 16S rRNA gene sequences of all three of our reconstructed genomes group with some sequences classified as TA06 and one sequence classified as a WS3 (the other WS3 sequences form a lineage sibling to Zixibacteria) (Supplementary Figs. 9b and 11). Except for one TA06 (Candidate_division_TA06_bacterium_32_111), the corresponding TA06 and WS3 genomes place distant from our genomes on the concatenated ribosomal protein tree. Thus, some of the 16S rRNA gene sequences of these publicly available genomes may be misclassified or misbinned (a common problem with 16S rRNA gene binning, especially if the gene is in multi-copy and the scaffolds are short). Regardless, it is clear that our genomes are highly distinct from any other genomes in public databases.

Pathway analysis reveals genes encoding for hydrocarbon degradation enzymes, including aldehyde dehydrogenase, which are present in all three genomes. Additionally, alcohol dehydrogenase, aldehyde ferredoxin oxidoreductase and methanol dehydrogenase are present in JGI_Cruoil_03_Bacteria_44_89, the genome with highest estimated completeness, suggesting pathways for degradation of alkanes and methanol (Supplementary Table 5).

Genomes from soil

From six soil samples, we reconstructed 79 minimally contaminated (<5%) draft genomes (>70% completeness), 26 of which were high-quality draft genomes (>90% completeness) (Supplementary Fig. 5). Two of the high-quality genomes were well-assembled (a Gemmatimonadetes genome consisting of 11 scaffolds and a Bacteroidetes genome on 14 scaffolds), with estimated completeness above 97% and contamination below 3.3%.

It has been shown recently that some Gemmatimonadetes are able to consume methanol using a pyrrolo-quinoline quinone (PQQ)-dependent methanol dehydrogenase (MDH) and to convert the resulting formaldehyde using the tetrahydromethanopterin (THMPT) and tetrahydrofolate (THF)-linked formaldehyde oxidation pathways24. Likewise, we were able to find a PQQ-MDH and two key enzymes of the THF pathway (methenyltetrahydrofolate cyclohydrolase, methylenetetrahydrofolate dehydrogenase) in the high-quality Gemmatimonadetes genome bin but could not find any enzymes belonging to the THMPT pathway. Additionally, we found genes for carbon fixation, fermentation, nitrogen assimilation, complex carbon degradation and sulfur metabolism. Similarly, the Bacteroidetes genome encodes enzymes for carbon fixation, fermentation and nitrogen assimilation, but by contrast has no genes for methane metabolism, complex carbon degradation or sulfur metabolism (Supplementary Table 5).