The 154,723 reconstructed genomes and the 80,990 reference genomes retrieved from public repositories (see below) were then clustered based on whole-genome nucleotide similarity estimation using Mash (). The cutoff on the hierarchical clustering was tuned based on the intra- and inter-species diversity of the confidently taxonomically labeled subset of the 80,990 reference genomes resulting in species-level genome bins (SGBs) spanning ∼5% genetic diversity, as independently proposed elsewhere (). Overall we obtained 16,332 SGBs that were further divided in known SGBs (kSGB) that contain at least one reference genome, unknown SGBs (uSGBs) without any reference genomes, and non-human SGBs containing only reference genomes and no genomes reconstructed from our assembly of the human microbiome ( Figure 6 A). The kSGBs were then taxonomically labeled with the species label (if available) of the reference genome(s) present in the bin, whereas uSGBs were assigned the phylum of their closest reference genome, and to a genus-level and family-level annotation when possible.

In brief, we first collected and curated a metagenomic resource comprising a total of 9,428 metagenomes (from public resources and samples sequenced in this study, see below) and then applied metagenomic assembly - metaSPAdes () or MEGAHIT () - to each sample separately. Each metagenomic assembly was then quality controlled for minimum length and the 204M contigs were subjected to sample-specific contig binning based on tetranucleotide frequency and contig abundance using MetaBAT2 () resulting in over 345,000 putative genome bins ( Figure 6 A). Genome bins were then strictly quality controlled to identify reconstructed genomes with quality at least comparable with the typical quality of isolate genome sequencing. By controlling genome completeness and contamination using CheckM () and strain heterogeneity with the CMSeq pipeline described below, we identified 70,178 high-quality genomes and 84,545 additional MQ genomes ( Figure 6 A).

Our approach to reconstruct bacterial and archaeal genomes from the human microbiome ( Figure 6 A) exploits metagenomic single-sample assembly, contig binning, and species-level inter-sample genome grouping at the scale of the many thousands of metagenomes now available in public repositories.

We collected publically available metagenomic samples from 46 different studies, totaling 9,316 metagenomes and 4.1e11 Illumina reads. Overall, the samples cover 31 countries: USA (1,431 samples), China (1,342), Israel (956), Sweden (600) and Denmark (580) are the 5 most represented. The metagenomes were sampled from 5 major body sites: 7,783 samples from the gut (stool samples), 783 from the oral cavity, 503 from the skin (including 93 samples from anterior nares), 88 from the vagina, and 9 from maternal milk (excluded for visualization from the figures). Samples from adults (19 to 65 years of age) account for 6,615 samples, but all age categories are covered with 1,098 newborns (< 1 year of age), 465 children (age ≥1 year and <12 years), 216 school-age individuals (age ≥12 and <19 years), and 876 from adults and seniors (age ≥19 and >65 years; merged with the class “adult” in Figure 1 ). Despite manual curation efforts, 46 samples from public repositories used here still miss the metadata for age category. All these and other manually-curated metadata fields are available in Table S1 and are included in the curatedMetagenomicData package () together with all the taxonomic () and functional potential profiles () of the microbial species with available reference genomes. To cross-validate the results on the raw-reads mappability, we also retrieved 384 additional metagenomes not used to reconstruct the SGBs. Specifically, we considered 303 Westernized gut metagenomes, 52 Westernized oral metagenomes and 29 non-Westernized oral metagenomes as reported in Table S1

We enrolled, sampled, and sequenced the gut microbiome of individuals from the Madagascar Health and Environmental Research (MAHERY) study cohort that was set up in 2004 in a remote rainforest region in north-eastern Madagascar to study the impact of environmental change on human health (). The cohort includes local people (Betsimisaraka and Tsimihety ethnicity) whose diet relies heavily on self-grown rice and wild plants and meats. Samples were collected between January 2013 and May 2014 from two subsistence communities (A and B) adjacent to the Makira Natural Park, approximately 10 km away from each other. A subset of the households in the two communities were randomly selected to be enrolled in the study (95 households out of 160 in Community A and 57 households out of 157 in Community B), for a total of 719 individuals < 74 years old. Enrolled people were subjected to clinical visits and questionnaires about dietary intake, and were asked to collect biological samples (fingernails, blood, faeces) to assess health and nutritional status. The samples considered in this study were collected from a total of 112 healthy volunteers (54 females and 58 males, Table S1 ). The gut microbiome of five female individuals were also sampled from a previously established cohort in Gimbichu (Ethiopia, Oromia Region).

Faecal samples from the Madagascar cohort were self-collected in sterile polypropylene screw cap collection tubes (Sarstedt) after defecation on the waxy side of a banana leaf, and returned to the local research team within five hours of collection (). Three ml of 97% ethanol were added to stabilize samples before storing them at −23°C within 14 days of collection. Samples were then shipped on dry ice to the USA to be stored at −80°C. Faecal samples from Ethiopian individuals were collected in REAL MiniSystem “Total - fix” (Durviz S.L., Valencia, Spain) and kept frozen at −80°C.

DNA was extracted with the PowerSoil DNA Isolation Kit (MoBio Laboratories) after pre-heating to 65°C for 10 min and to 95°C for 10 min (). Libraries were prepared with the NexteraXT DNA Library Preparation Kit (Illumina) and sequenced on the HiSeq2500 machine (Illumina). The metadata for this cohort are available in Table S1 and are included in the curatedMetagenomicData package together with the taxonomic and functional potential profiles of the species with available reference genomes. We sequenced the 117 samples for a total of 593.9 Gb (5.3 Gb average per sample after quality control, 3.87 Gb standard deviation, Table S1 ). The raw reads were submitted to the NCBI-SRA archive and are available under the BioProjects PRJNA485056 (Madagascar cohort) and PRJNA504891 (Ethiopian cohort).

In addition to the sequenced Madagascar cohort (above), 480 additional samples were annotated as “non-Westernized” from a total of 5 studies spanning 4 populations. These were a traditional Fijian population () (172 stool samples and 140 saliva samples), the hunter-gatherer Hadza population (Tanzania) from two different studies () (67 stool samples in total), the traditional agro-pastoral Mongolian population () (65 stool samples), and a Peruvian rural community () (36 stool samples). With the Madagascar cohort, this work thus considers a total of 592 non-Westernized compared with 8,836 Westernized samples.

Westernization and urbanization are complex processes that occurred during the last few centuries involving profound lifestyle changes compared to populations prior to the modern era. These changes include increased hygiene and sanitized environments, introduction and large availability of antibiotics and other drugs, switch toward a high-calorie high-fat dietary regimes and toward processed sterilized food, enhanced exposure to xenobiotics and pollutants, reduced contact with wildlife and domesticated animals, and transition from autarchic food production systems to a controlled food chain in a global economy. All these factors are thought to have dramatic effects on the human microbiome that co-evolved with our our body for hundred thousands of years in non-Westernized conditions. In this work, we adopt the terms “Westernized” and “non-Westernized” as umbrella terms to depict populations that differ by at least the majority of the above factors even though this definition comprises very heterogeneous populations.

We considered the whole set of 17,607 microbial species (16,959 bacteria, 648 archaea) available as of March 2018 in the UniProt portal () for which at least one proteome (the set of coding sequences associated with the genome) is available. Quality control performed by UniProt to retain the proteomes and the associated genomes include the availability of a set of annotated coding sequences and the check that the number of coding sequences is statistically consistent with the one of proteomes of neighboring species. We then considered all the available annotated genomes for these species and downloaded them from the NCBI GenBank database () obtaining a total of 80,853 genomes. This large genome set comprises both complete (12%) and draft (88%) genomes, and it is the largest set of microbial isolate genomes with taxonomic assignments and quality-controlled sequences available as of March 2018. Draft genomes include also metagenomic species that are explicitly labeled with the “MAG” abbreviation (n = 37) and co-abundance gene groups metagenomic assemblies (CAGs, n = 377) (). We further added this genome set to the 137 isolate genomes collected in () for a total of 80,990 considered as reference genomes. We refer to this set of 80,990 as “isolate genomes” for brevity, but they also comprise previous metagenomic assembly as mentioned above. To further expand the set of reference genomes we also considered all the 159,803 assemblies available in NCBI as of September 2018.

The relative abundance of each reconstructed genome in the 9,428 metagenomes was calculated from the alignments of the raw reads against the assemblies of the same sample (performed using BowTie2 as reported above). This avoids spurious read assignments (i.e., reads mapping sufficiently well against more than one genome in the same or different species). Indeed, as a direct consequence of the assembly-based approach, it is very rare (< 0.01%) that a read can be assigned to more than one contig assembled from a metagenome containing the read itself. Thus, the relative genome abundance in each sample was defined as the number of reads aligning to each contig of the genome normalized by the total number of reads in the sample. Only primary alignments with alignment length ≥50 nt and edit-distance with respect to the contig ≤2 nt were considered. Abundances at SGB level in each sample were computed as the sum of the abundances of the reconstructed strains belonging to the same SGB.

Each of the 9,428 samples were processed with the standard quality-control employed by metaSPAdes () which includes the read corrector BayesHammer () and then independently subjected to de-novo metagenomic assembly through metaSPAdes () (version 3.10.1; default parameters), which exhibited the best accuracies in recent comparisons among metagenomic assemblers (). Samples that failed to be processed due to memory requirements (>1Tb of RAM), and samples with only unpaired reads, were assembled through MEGAHIT () (version 1.1.1; default parameters). An extended comparison between metaSPAdes and MEGAHIT assemblers across all the datasets considered in this study confirmed that metaSPAdes performs consistently better especially in recovering long contigs ( Figure S7 A). Contigs shorter than 1,000 nt were discarded from further processing. This resulted in 2.04e8 different contigs for a total length of 8.67e11 nt. Reads were mapped to contigs using Bowtie2 () (version 2.2.9; option ‘--very-sensitive-local’) and the mapping output was used for contig binning through MetaBAT2 () (version 2.12.1; option ‘-m 1500’), which showed good performance in comparison with other binning methods (). MetaBAT2 achieved the best performances among single-sample binning tools also in the evaluation performed in the Metawrap paper (), a recent tool for multiple binning. The multiple binning approach looks promising, although lack of independent validation and high computational requirements make it infeasible to be used in the large-scale scenario exploited in this paper at this stage. The procedure of binning through MetaBAT2 generated 345,654 bins (i.e., putative genomes) for a total length of 6.55e11 nt indicating that 75% of the assembled contigs were grouped into bins.

We evaluated the presence of plasmids and viruses within reference genomes and reconstructed SGBs by mapping the 13,924 plasmids and 10,529 viruses in RefSeq against the 80,990 reference genomes and the 154,723 genomes in the SGBs with BLAST (). We filtered alignments shorter than 500 nucleotides and with less than 80% identity. A plasmid or virus was considered to be present if at least 50% of its sequence was covered by any genome or SGBs in our catalog. We found that 37% of the fully sequenced plasmids in the RefSeq repository were represented in the reconstructed genomes (95% in the available reference genomes). The 16S rRNA sequences in the SGB genomes were searched with Barrnap 0.9 (default parameters). The 16S rRNA taxonomy ( Table S4 ) was inferred with RDP rRNA classifier version 2.11 () (default parameters), only on predicted rRNA sequences longer than 500 nucleotides. We set RDP’s minimum confidence threshold to call for each taxonomic level at 75%. Although we confirmed that the 16S rRNA gene is challenging to be recovered by metagenomic assembly (it was recovered in only 7.43% of the reconstructed genomes), the search for the most 400 conserved coding genes from PhyloPhlAn () in the reconstructed genomes and isolate sequencing available for the 9 largest SGBs and the 10 Bacteroides SGBs of Figure 4 , confirmed that cross-species conservation of genes is not an issue for metagenomic assembly. Metagenomically reconstructed genomes recovered more PhyloPhlAn markers in 10 cases and less markers in 9 cases, and all comparisons were within 5% average differences.

Based on these quality estimated and on recent guidelines (), we selected as medium-quality (MQ) genomes those having completeness >50% and contamination <5% resulting in a total of 154,723 microbial genomes. Stricter quality control reduced the set of near-complete, high-quality (HQ) genomes to 70,178 with completeness >90% and no evidence of strong intra-sample strain heterogeneity (<0.5% polymorphic positions). The strain heterogeneity threshold removed 3,653 reconstructed genomes (5.2%) of otherwise HQ genomes, and we verified that these genomes tended to have higher CheckM contamination (although always below the recommended 5% threshold) with a median of 0.74% against 1.56% (p < 1e-50). This provides an additional indication that the CMSeq heterogeneity score helps in controlling strain mixtures and contaminations.

Putative genomes were subjected to quality control to generate the final set of reconstructed draft genomes. Three main measures were taken into account: i) completeness; ii) contamination; and iii) strain heterogeneity. Completeness and contamination were estimated using CheckM () (version 1.0.7; lineage specific workflow), while strain heterogeneity was estimated through a strategy we developed to identify assemblies resulting from strain mixtures even when the strains were very closely related. Following this procedure, reads were mapped against the reconstructed genomes from the same sample using Bowtie2 () (version 2.2.9; option ‘--very-sensitive-local’) and dominant and non-dominant alleles were determined over all protein coding nucleotides. We only considered base calls with a PHRED quality score of at least 30 and only those positions with a coverage of at least 10x. We considered a position as non-polymorphic if the dominant allele frequency was >80%. In order to calculate the polymorphic rate, we then considered only polymorphic positions corresponding to non-synonymous mutations. Validation experiments performed by mixing simulated metagenomic sequencing (with Illumina error models) of 5 randomly selected pairs of strains from each of the the 10 Bacteroides species of Figure 4 at decreasing dominant strain frequency (and thus higher nucleotide-level heterogeneity) confirmed that this approach reflects indeed the expected level of strain mixture. The strain heterogeneity estimation tool is available at https://bitbucket.org/CibioCM/cmseq

A similar analysis was conducted to compare the genomes reconstructed using our fully automated pipeline with the ones obtained through manual curation using anvi’o () ( Figure 7 D; Table S2 ). Manually curated genomes were generated starting from the same set of unbinned contigs. A total of 50 metagenomes from the database considered in this study were randomly selected and assigned to six groups of students that were previously trained for the task of manual curation of contig binning by guided execution and discussion of the available anvi’o tutorials followed by curation of several example metagenomes common to all groups. Each group was asked to bin contigs for the strain with the highest reconstruction quality in the sample. This resulted in 46 manually-curated reconstructed genomes. Our automatic procedure recovered a genome closely matching (>99.5% whole genome genetic identity) the manually-curated one in all 46 cases. The comparison between genomes was done by computing the ANI score through the pyani tool and the results are reported in Figure 7 D and Table S2

Comparison between the genome reconstructed from the automatic pipeline and the one from isolate was done by computing the average nucleotide identity (ANI) and the corresponding alignment coverage using the pyani tool () (version 0.2.6; option ‘-m ANIb’). Results showed that in all cases the genomes reconstructed from metagenomes with our automatic pipeline were almost identical to the genome of the expected isolated strains. For the only case in which this was not true (S. aureus isolate MF093 and paired metagenome CM_cf__CF_FIFC009SS_t3M17__bin.3), we verified with MLST typing (both from assembled and unassembled reads) and with StrainPhlAn () that a clinical event of strain-replacement from ST45 to ST273 occurred.

Genomes reconstructed from metagenomes were compared with the ones of isolates obtained from the same sample, or from samples obtained from the same individual at earlier or later time points ( Figures 7 A–7C; Table S2 ). We compared 18 isolates with 36 genomes reconstructed from metagenomes from 8 different bacterial species. Compared samples included sputum from cystic fibrosis patients (), stool and breast milk samples from mother-infant pairs (), and feces of adults consuming fermented milk product containing a probiotic strain ().

Evaluation of single-sample assemblies against co-assembly and co-binning methods

Ferretti et al., 2018 Ferretti P.

Pasolli E.

Tett A.

Asnicar F.

Gorfer V.

Fedi S.

Armanini F.

Truong D.T.

Manara S.

Zolfo M.

et al. Mother-to-Infant Microbial Transmission from Different Body Sites Shapes the Developing Infant Gut Microbiome. Costea et al., 2017 Costea P.I.

Coelho L.P.

Sunagawa S.

Munch R.

Huerta-Cepas J.

Forslund K.

Hildebrand F.

Kushugulova A.

Zeller G.

Bork P. Subspecies in the global human gut microbiome. Costea et al., 2017 Costea P.I.

Coelho L.P.

Sunagawa S.

Munch R.

Huerta-Cepas J.

Forslund K.

Hildebrand F.

Kushugulova A.

Zeller G.

Bork P. Subspecies in the global human gut microbiome. Li et al., 2015 Li D.

Liu C.-M.

Luo R.

Sadakane K.

Lam T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Li and Durbin, 2009 Li H.

Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Alneberg et al., 2014 Alneberg J.

Bjarnason B.S.

de Bruijn I.

Schirmer M.

Quick J.

Ijaz U.Z.

Lahti L.

Loman N.J.

Andersson A.F.

Quince C. Binning metagenomic contigs by coverage and composition. In order to provide a comparison to the single sample strategy employed here, we co-assembled and co-binned a subset of the data where multiple samples from the same individual were available. Samples were taken from two studies: the already described investigation of the microbiome of newborns and of their mothers (), and a study considering fecal microbiome time series for adults (). From the first, we selected 22 infants for which at least 3 fecal samples taken during the first four months post-partum were available (maximum 5, median 4). We also co-assembled 21 fecal samples from the mothers from the same study to provide a comparison against cross-sectional co-assembly. Somewhat longer fecal time series were available from the second study, from which we selected four individuals with a number of time points between eight and ten (). This gave us a total of 26 longitudinal time series from the same individual and one cross-sectional study (21 individuals) each of which we co-assembled using MEGAHIT () with default parameters except for the kmer-list set to (21,31,..,99). The assembled contigs were then cut into 10kbp fragments and the reads from each sample within the time series (or mother in the cross-sectional study) were then mapped back onto the contig fragments using BWA and a per sample depth of coverage was calculated (). The contig fragments were then clustered using the CONCOCT algorithm (default parameters) which combines both tetramer composition and coverage in a Gaussian mixture model after a PCA based dimensionality reduction (). Following clustering, a consensus cluster assignment across fragments was given to each contig to assign clusters based on the original co-assembly.

Tatusov et al., 2003 Tatusov R.L.

Fedorova N.D.

Jackson J.D.

Jacobs A.R.

Kiryutin B.

Koonin E.V.

Krylov D.M.

Mazumder R.

Mekhedov S.L.

Nikolskaya A.N.

et al. The COG database: an updated version includes eukaryotes. Alneberg et al., 2014 Alneberg J.

Bjarnason B.S.

de Bruijn I.

Schirmer M.

Quick J.

Ijaz U.Z.

Lahti L.

Loman N.J.

Andersson A.F.

Quince C. Binning metagenomic contigs by coverage and composition. We called ORFs on the co-assembled contigs and assigned COGs () using RPSBlast. The same procedure was applied to the genomes reconstructed by single sample assembly from the same set of samples used in each co-assembly. We then selected only those reconstructed genomes from both studies that possessed more than 75% of a panel of 36 single copy core genes in single copy (). To remove redundancy across reconstructed genomes from the single sample clustering (i.e., same genome reconstructed at multiple time points from the same individual), and to determine the intersection of genomes between the two approaches, we then performed a hierarchical average linkage clustering of all the genomes from both methods and clustered at 1% nucleotide identity on the core gene panel. The results of such procedure are given in Table S2 . We then also evaluated the genomes obtained by the co-assembly by computing CheckM completeness, CheckM contamination, and CMSeq heterogeneity as described for the single-assembly reconstructed genomes. Co-assembled genomes were then assigned to the HQ or MQ category with the same thresholds used for the single-assembly reconstructed genomes. The number of HQ and MQ genomes obtained with the two approaches was then compared, and additional genome quality metrics such as genome length, N50, completeness estimate, and contamination estimates were considered. For the genomes obtained by single-sample assembly, the grouping into SGBs was used to compare the number of distinct species obtained compared to the co-assembly approach. This second set of evaluations is also reported in Table S2

For the short infant time series, the increase in number of genomes obtained by co-assembling and co-clustering was typically modest after collapsing closely related strains from single-genome assembly (median increase of 3% for the 36-core gene based evaluation - Table S2 , 6.87% for the CheckM-based evaluation with thresholds for HQ genomes - Table S2 ). Without removal of closely related strains, single-genome assembly recovered more genomes (12% HQ genomes, 50% MQ genomes) because the same strains (or closely related ones) were recovered at multiple time points ( Figures 7 E and S7 B).

The improvement for the co-assembly approach was more clear from the second study where at least eight time points were available (median increase 31% - Table S2 ). Across all the considered individuals there was a weak correlation between increase in the number of reconstructed genomes obtained from co-clustering and sample number (p = 0.08). We conclude that co-assembling and co-binning of gut metagenomes requires a moderate number of samples (more than 5) to achieve substantial improvements. The co-assembly of mothers yielded an increase of 3% in the number of HQ genomes (after merging single-sample assemblies into 99% identity genome bins) when using the 36 single-copy genes for quality control ( Table S2 ), and a decrease from 124 to 88 HQ SGB-grouped genomes when using >90% CheckM completeness and < 5% CheckM contamination thresholds ( Table S2 ). Other genome quality statistics were very similar between the two approaches with however the co-assembly method showing slightly more contamination (1.7% against 0.9% for HQ genomes, Table S2 ). Overall, this suggests that large scale co-assembly may at best offer limited improvement in terms of overall recovered diversity.

Quince et al., 2017b Quince C.

Delmont T.O.

Raguideau S.

Alneberg J.

Darling A.E.

Collins G.

Eren A.M. DESMAN: a new tool for de novo extraction of strains from metagenomes. Truong et al., 2017 Truong D.T.

Tett A.

Pasolli E.

Huttenhower C.

Segata N. Microbial strain-level population structure and genetic diversity from metagenomes. Quince et al., 2017b Quince C.

Delmont T.O.

Raguideau S.

Alneberg J.

Darling A.E.

Collins G.

Eren A.M. DESMAN: a new tool for de novo extraction of strains from metagenomes. Truong et al., 2017 Truong D.T.

Tett A.

Pasolli E.

Huttenhower C.

Segata N. Microbial strain-level population structure and genetic diversity from metagenomes. It is of note that the co-assembly approach can reconstruct only one bin per species or subspecies ( Figures 7 E and S7 B) and on a large cross-sectional database such as the one considered in this study, this would effectively be a composite population-level genome incorporating both variation in single-nucleotide variants on core genes and variation in accessory genes. It is possible to resolve this variation on co-assemblies via single-nucleotide variant calling () and when this is followed by deconvolution of haplotypes across samples as employed in the DESMAN pipeline () this does allow the reconstruction of whole-genome haplotypes and assignment of accessory genes to specific strains. However, when most species are present in a single dominant strain as it is the case in the human microbiome (), directly assembling strains from individual samples is a more straightforward strategy that both avoids the deconvolution step and uncertainties associated with variant calling from mapped reads. It is therefore more suitable for the very large scale analyses considered here where the aim is to generate a small number of HQ strains from each sample to provide the most comprehensive picture of overall diversity in the human gut.

The general conclusion of this comparison is thus that co-assembly and co-binning approaches would be useful for retrieving substantially more genomes in relatively long (>5) subject-specific time series, whereas the potential advantage of retrieving more low-abundance species in a cross-sectional co-assembly is overcome by the disadvantage of having to use more complex approaches such as DESMAN to resolve the strain variation. That is perhaps more appropriate where the aim is to extract as much information as possible from a single study rather than to produce a single comprehensive high fidelity strain catalog. Because time series comprising more than 5 samples from the same subject and body site are very rare in the available cohorts (only 70 individuals - i.e., 1.0% - in our database), co-assembly is not considered in the present work as it would not provide advantages.