Publicly available human gut metagenomes

We downloaded 11,523 sequencing runs for publicly available human gut metagenomes from the NCBI SRA39. These data correspond to 3,810 samples, 15 studies9,21,40,41,42,43,44,45,46,47,48,49,50,51 (and https://olive.broadinstitute.org/projects/infant_gut_flora_and_antibiotics) and >181 billion sequencing reads with an average length of 100 bp (Supplementary Tables 1, 2). Sequencing metadata were obtained from the SRAdb relational database52 and host metadata were obtained from either the NCBI BioSample database53 or from supplementary datasets linked to publications (Supplementary Table 3). No metadata were available online or upon request from the Fiji cohort9; these individuals were treated as healthy adults from a rural population.

Metagenome assembly and binning

We co-assembled the 11,523 sequencing runs for each of the 3,810 biological samples using MegaHIT v.1.1.154 with default parameters. This resulted in 333,661,332 contigs longer than 200 bp, totalling 453.5 × 109 bp, with an average N50 of 12,460 bp (Supplementary Table 2). Human gut MAGs were generated per sample using three different tools with default options: MaxBin v.2.2.455, MetaBAT v.2.12.156 and CONCOCT v.0.4.010, which all use a combination of sequence composition and coverage information. DAS Tool v.1.1.057 with option ‘-score_threshold 0’ was used to integrate MAGs produced by the three tools. Contigs shorter than 1 kb were discarded. This process resulted in 152,591 MAGs longer than 100 kb, which totalled 73,632,219 contigs (22% of total assembled) and 310.7 × 109 bp (69% of total assembled). All MAGs were screened for contamination against the human genome (build 38) and phiX genome with BLASTN v.2.6.058.

Refinement of MAGs on the basis of alignment of contigs between conspecific genomes

To refine MAGs from the HGM dataset, we performed pairwise alignment of contigs between MAGs and other closely related, near-complete MAGs and reference genomes (Supplementary Table 6). Our logic was that strains of the same species should share homology between most contigs, and that contigs that fail this condition (that is, are present in one genome but absent in the other) probably represent contamination. For each input MAG, we used Mash v.2.059 to find at least five closely related, near-complete genomes in the database (>95% estimated completeness, <5% estimated contamination, Mash distance ≤ 0.05, P ≤ 0.001), and then used BLASTN to align contigs between each MAG and all target genomes. Contigs in the MAG that failed to align at ≥70% nucleotide identity over ≥25% length to any of the closely related genomes were flagged for removal.

Refinement of MAGs on the basis of taxonomic annotation of contigs

We identified and removed taxonomically discordant contigs from MAGs using two complementary approaches (Supplementary Table 6). The first approach performs taxonomic annotation on the basis of universal single-copy marker genes. Hidden Markov models for marker-gene families were downloaded from the PhyEco database60, and searched against MAGs with HMMER v.3.1b261. A subset of 100 (for Archaea) or 88 (for Bacteria) gene families was used. Marker genes found in MAGs were then aligned against a reference database of taxonomically annotated marker genes from reference genomes using BLASTP. For each gene, we transferred the taxonomy of the best hit in the reference database at the appropriate rank on the basis of the percentage of amino acid identity cut-offs specific to each gene family at each rank. We then taxonomically annotated each MAG on the basis of the consensus taxonomy of marker genes at the lowest rank, such that >70% of marker genes were annotated. Contigs were flagged for removal if they (1) contained a taxonomically discordant marker gene, and (2) lacked a concordant marker gene. The second approach for taxonomic refinement is similar to the first, except that 855,764 clade-specific prokaryotic marker genes from the MetaPhlAn227 database were used for taxonomic annotation after excluding ‘pseudo markers’ that are not unique to a clade.

Refinement of MAGs on the basis of outlier nucleotide composition and sequencing read depth

Using an approach similar to a previously published method11, we identified and removed contigs from MAGs with either (1) outlier GC content, (2) outlier tetranucleotide frequency or (3) outlier sequencing read depth (Supplementary Table 6). We used principal component analysis to reduce the tetranucleotide frequency dimensionality down to the first principal component (tetranucleotide frequency PC1). For each MAG, we then measured the average GC content, average tetranucleotide frequency PC1 and average sequencing read depth. Contigs were flagged for removal if they deviated from these averages beyond cut-offs selected to minimize reduction in completeness (Supplementary Table 6).

Validation of MAG refinement pipeline

We simulated 1,000 human gut MAGs to validate our overall MAG refinement strategy (Supplementary Table 7). Each simulated MAG contained two genomes: one ‘host’ genome (representing the target genome) and one ‘donor’ genome (representing the contaminating genome). All 102 genomes used in simulations were isolated from the human gut, and were estimated to have >95% completeness, <1% contamination and <25 contigs. MAGs were simulated with completeness (mean = 61.9%), contamination (mean = 10.0%) and N50 (mean = 35.8 kb) on the basis of randomly sampled MAGs from the HGM dataset. MAGs were dropped in cases in which contamination exceeded completeness, and thus the host genome was in the minority. The refinement pipeline was applied to each simulated MAG and—to evaluate the pipeline—we quantified the overall reduction in completeness and contamination (Extended Data Fig. 1a, b).

Application of refinement strategies to the HGM dataset

We applied each of the refinement approaches described above to the MAGs (Extended Data Fig. 2c, Supplementary Table 6). In rare cases, these approaches may erroneously flag a large proportion of a MAG. To avoid this, we applied a particular approach to a MAG only if it resulted in ≤25% reduction in total length. The five approaches combined removed 5,251,859 contigs (7.13% of total) and 20,821.2 Mb (6.70% of total) from the MAGs. After removing potential contaminants, we were left with 152,279 MAGs with a total length ≥100 kb and 10,036 individual contigs longer than 100 kb that were either unbinned or removed during decontamination. These long contigs were included with other MAGs, which brought the total number to 162,315.

Quality assessment of MAGs

CheckM v.1.0.713 was used to estimate completeness and contamination of the 162,315 recovered MAGs (Supplementary Table 6); CheckM is based on the copy-number of lineage-specific single-copy genes. Additional statistics were obtained for each genome, including the contig N50, number of contigs, average contig length, contig read-depth, and number of tRNA and rRNA genes. tRNAs were identified using tRNAscan-s.e. v.1.3.162 and rRNA genes using Barrnap v.0.9-dev63 with options ‘–reject 0.01 –evalue 1e-3’. We identified 60,664 MAGs that met the MIMAG medium-quality criteria of ≥50% completeness with ≤10% contamination14. For analyses that required near-complete genomes, we used a subset of 24,345 high-quality MAGs that were ≥90% complete, ≤5% contaminated, with an N50 ≥ 10 kb, an average contig length ≥5 kb, ≤500 contigs and ≥90% of contigs with ≥5× read-depth.

Estimation of SNP density

Read mapping and SNP calling were performed to assess the genetic diversity of each MAG (Supplementary Table 5). Bowtie 2 v.2.3.464 was used to construct a database of MAGs for each sample, and to align metagenomic reads. Reads with low mapping and sequence quality were discarded (quality scores <20 and <30, respectively), and we counted the occurrence of nucleotides with quality ≥30 across each MAG. To compare SNPs between MAGs sequenced to different depths, we down-sampled each MAG to 40 mapped reads per site. MAGs with at least 200,000 sites of ≥40× depth were retained for analysis. A SNP was called if at least two reads matched the alternative allele at a genomic site. SNP density was calculated as the number of SNPs per kilobase.

Reference genomes used for comparison

We downloaded 201,102 publicly available bacterial and archaeal reference genomes from the Integrated Microbial Genomes (IMG; https://img.jgi.doe.gov/)65 (n = 61,713) and Pathosystems Resource Integration Center (PATRIC; https://www.patricbrc.org/)66 (n = 139,389) databases, on 16 January 2018. These included genomes from 2 human gut culturomics studies6,7 and 16,525 previously published MAGs, including a previous MAG study from the human gut20 and nearly 8,000 MAGs assembled from SRA metagenomes11. To remove redundancy within and between databases, we used Mash59 with default parameters to cluster genomes with a Mash distance of 0.0, which are expected to be identical. This resulted in 153,900 non-redundant reference genomes, of which 127,419 were classified as high quality, 18,498 as medium quality and another 7,983 as low quality (Supplementary Table 9).

Species-level clustering of reference genomes and MAGs

Using an approach similar to a previously published method67, we clustered the 60,664 MAGs and 145,917 reference genomes meeting or exceeding the MIMAG medium-quality standard into species-level OTUs on the basis of 95% whole-genome ANI (Supplementary Table 10). We first performed single-linkage clustering of genomes on the basis of a Mash ANI of 99%, which resulted in 79,675 clusters that can be confidently assigned to the same species-level OTU. Mash is extremely fast, although it can underestimate ANI for incomplete genomes67. To address this, we used the ANIcalculator v.1.068 to compute whole-genome-based ANI (gANI) between the 99%-identity clusters, and required that at least 20% of genes were aligned. The 20% cut-off was chosen to minimize the negative effects of incomplete genomes, and to avoid the formation of spurious OTUs (Extended Data Fig. 5a). To increase computational efficiency, we calculated gANI only for genome pairs with >90% Mash ANI. Genomes were clustered into OTUs using average-linkage hierarchical clustering with a 95% gANI cut-off using the package MC-UPGMA v.1.0.069, which yielded 23,790 OTUs.

All OTUs were taxonomically annotated using the tool GTDBTk v.0.0.6 (release 80, www.github.com/Ecogenomics/GtdbTk), which produces standardized taxonomic labels that are based on those used in the Genome Taxonomy Database26. Additionally, we constructed pan-genomes on the basis of clustering all genes within each OTU, using VSEARCH v.2.4.370 with 90% DNA identity and 50% alignment cut-offs (maximum 500 genomes per OTU). Human gut OTUs were identified from the set of 23,790 OTUs on the basis of (1) containing a MAG from the HGM dataset, (2) being detected by IGGsearch (see ‘Development of IGGsearch for metagenomic profiling of species-level OTUs’) in at least 1 of 3,810 metagenomes used for MAG recovery or (3) containing a genome isolated from the human gut (Extended Data Fig. 6a, b, Supplementary Table 10). A total of 4,558 species-level OTUs were annotated as being found in the human gut, on the basis of a combination of the three criteria.

Phylogenetic analysis of MAGs and reference genomes

We constructed phylogenetic trees of MAGs and reference genomes using concatenated alignments of conserved, single-copy marker-gene families from the PhyEco database60 for Bacteria (n = 88 genes) and Archaea (n = 100 genes). Individual marker genes were identified using HMMER v.3.1b2 with bit-score cut-offs that are specific to gene family. For computational efficiency, genomes were collapsed down to species-level OTUs, which were represented as individual leaves in the phylogenetic tree. To reduce the effect of contamination, taxonomically discordant marker genes were removed, as described in ‘Refinement of MAGs on the basis of taxonomic annotation of contigs’. FAMSA v.1.2.571 was used to construct protein-based multiple sequence alignments for each gene family. Columns with >15% gaps were removed, alignments were concatenated and sequences with >70% gaps were removed (n = 39). FastTree2 v.2.1.1072 was used to build a maximum likelihood phylogeny for Bacteria and Archaea with default options. All trees were visualized using iTOL v.373. To quantify the gain in phylogenetic diversity from the HGM dataset, we computed the total branch length of two subtrees: a tree of all 4,558 gut OTUs (PD Gut ) and a tree of 2,500 gut OTUs with reference genomes (PD RefGut ). The percentage gain in phylogenetic diversity was computed as: 100 × (PD Gut − PD RefGut )/PD RefGut . To identify OTUs for higher-rank groups, we performed average-linkage hierarchical clustering of phylogenetic distances, which was implemented in R (Supplementary Table 10). Rank-specific cut-offs were identified by maximizing similarity to the Genome Taxonomy Database for reference genomes (Extended Data Fig. 3d, e).

Development of IGGsearch for metagenomic profiling of species-level OTUs

Using an approach similar to MetaPhlAn227, we developed an accurate and efficient tool for quantifying the abundance of species-level OTUs from unassembled metagenomes. First, we identified marker genes for each OTU (Supplementary Fig. 1a). Up to 300 genes from the pan-genome of each OTU were selected with the maximum intra-OTU frequency and minimum inter-OTU frequency. The intra-OTU frequency was computed as the fraction of genomes within an OTU in which a gene was found at 90% DNA identity. The inter-OTU frequency was determined on the basis of DNA alignments (using HS-BLASTN v.0.0.574) between each gene and the pan-genomes of other OTUs, and accounts for (1) the number of other pan-genomes in which the gene is found, (2) the frequency of the gene in each pan-genome and (3) the percentage of identity of each alignment. For computational reasons, genes were first aligned within each phylum, and only the 300 top-scoring candidates per OTU were subsequently checked for uniqueness between phyla. A total of 6,198,663 marker genes were identified for 23,790 OTUs.

A large number of OTUs contained just a single genome, which made it difficult to accurately predict conserved genes. To refine our marker-gene set, we used abundance co-variation information, which is a common strategy for binning genetic regions from the same species and has previously been applied3,10,20,21,56. Specifically, we performed read-mapping of the 3,810 metagenomic samples to the database of 6,198,663 marker genes using Bowtie 2 v.2.3.4 and quantified the read depth of each gene in each sample. We used average linkage clustering to group genes from each OTU into co-variance groups on the basis of Pearson correlations of read depth across samples (Supplementary Fig. 1b). After applying a correlation threshold of 0.90, we selected the largest cluster of genes for the final marker-gene set. This procedure removed 55,132 genes for 1,402 OTUs that were present in ≥10 samples with ≥1× coverage.

IGGsearch is a command-line tool that uses Bowtie 2 to map metagenomic reads to the database of marker genes and quantify species-level OTUs. Read alignments are removed with low percentage of identity (minimum = 95%), alignment coverage (minimum = 70% of read) and base quality (minimum = 20). For each metagenomic sample, OTU relative abundance is estimated by taking the average read depth across marker genes and normalizing these values to 1.0 across all OTUs. Species presence is determined on the basis of the percentage of marker genes with at least one mapped read.

The sensitivity and specificity of IGGsearch was evaluated on two benchmark datasets. First, we benchmarked IGGsearch on the CAMI challenge dataset (https://data.cami-challenge.org/participate) (Supplementary Tables 11, 12, Supplementary Fig. 2a). Second, we benchmarked IGGsearch on simulated gut metagenomes that contained between 500,000 and 50,000,000 paired-end reads, read length of 100 bp, Illumina-style sequencing error and 1 genome from each of 100 randomly selected gut species-level OTUs (Supplementary Fig. 2b). On the basis of these benchmarks, we called OTUs present when at least 15% of their marker genes were detected, which gave a good balance between sensitivity and specificity.

Metagenome-wide association of species abundance with disease

We used IGGsearch species profiles to identify species-level OTUs associated with disease for ten previously published studies, including colorectal cancer43, type 2 diabetes21,44, rheumatoid arthritis42, Parkinson's disease75, atherosclerotic cardiovascular disease76, ankylosing spondylitis77, non-alcoholic fatty liver disease78, liver cirrhosis79 and obesity80 (Extended Data Table 1, Supplementary Tables 15, 16). To identify species–disease associations, we compared the relative abundances of species for the 4,558 human gut species-levels OTUs between cases and healthy controls using the Wilcoxon rank-sum test. Non-gut OTUs were excluded to reduce the effect of multiple hypothesis testing. For each disease, P values were corrected for multiple hypothesis tests using the Benjamini–Hochberg procedure. We performed the same statistical procedure using species profiles from three other tools: MIDAS v.1.3.04, MetaPhAn2 v.2.7.727 and mOTU v.1.1.13. All tools were run with default parameters and the distributed reference data. To prevent confounding signals owing to disease treatment, we excluded 100 individuals taking drugs that affect microbiome composition, including metformin in patients with type 2 diabetes21,44, acarbose, atorvastatin, fondaparinux and metoprolol in patients with atherosclerotic cardiovascular disease76, and antirheumatic drugs in patients with rheumatoid arthritis42.

Machine learning models for disease prediction

We constructed random forest models implemented in the scikit-learn Python package (https://scikit-learn.org) to predict disease state from species abundance profiles generated with IGGsearch, MIDAS, mOTU and MetaPhlAn2 (Extended Data Table 1). For IGGsearch, we included all 23,790 species OTUs and allowed the random forest model to choose the most-predictive OTUs. Random forest models were implemented in the scikit-learn package v.0.19.181 and were optimized for each of the four tools for each of the ten diseases. Specifically, we tested 1,000 random combinations of parameter values for (1) the number of trees in the forest, (2) the number of features to consider at each split, (3) the maximum number of levels in each tree, (4) the minimum number of samples to split a node, (5) the minimum number of samples at each leaf and (6) whether to use bootstrapping during model training. To avoid overfitting, each model was evaluated using tenfold cross-validation and the combination of parameters that yielded the best receiver operating curve (ROC) area under the curve (AUC) was selected. To obtain robust estimates of model performance, all models were re-run 100 times and ROC AUC values were averaged across runs.

Identifying genomic features and auxotrophies of uncultured gut bacteria

We selected a subset of 504 human gut species-level OTUs from bacteria for comparative genomic analysis between cultured and uncultured organisms (Supplementary Table 17). OTUs with <5% prevalence in human gut metagenomes were excluded, because rare organisms may be amenable to cultivation but not yet sampled. Uncultivated OTUs were defined as those that contain only MAGs (either from the current study or previous studies, n = 271) and cultivated OTUs as those that contain at least one isolate genome (n = 233). We based all comparative analysis between OTUs using 24,345 high-quality MAGs from the HGM dataset, which was done (1) to avoid biases that result from a comparison of MAGs to isolate genomes (which differ in assembly quality) and (2) to avoid issues arising from low completeness among MAGs in the medium-quality tier.

We compared several broad genomic features between groups, including estimated genome size, GC content, coding density and estimated replication rate. Estimated genome size was corrected for completeness and contamination using: Ĝ = G × 100/Ĉ − (G × \(\hat{T}\)/100), in which Ĝ is the estimated genome size of a MAG, G is the observed genome size, Ĉ is the estimated percentage completeness and \(\hat{T}\) is the estimated percentage contamination. Replication rate was estimated with iRep v.1.1028 for MAGs with >5× read-depth, which is based on differences in sequencing depth between the origin and terminus of replication. Genomic features were averaged across all high-quality MAGs for each OTU, and then compared between OTUs using the Wilcoxon rank-sum test (Supplementary Table 18).

To identify potential auxotrophies, we compared the prevalence of genes, modules and pathways from the KEGG database (release 77.1)82 between cultivated and uncultivated OTUs. Proteins from high-quality MAGs were annotated on the basis of amino acid alignments to KEGG using LAST v.82883, and assigned to the KEGG orthology group with lowest value of E < 1 × 10−5. Next, we computed the fraction of MAGs per OTU that contained each KEGG orthology group, and compared these values between OTUs using the Ives–Garland test implemented in the phylolm R package v.2.684. The Ives–Garland test performs logistic regression while controlling for differences in phylogeny between groups, and has previously been applied to microbiome data85. This analysis was repeated for modules and pathways from the KEGG database. P values were corrected for multiple hypothesis tests using the Benjamini–Hochberg procedure (Supplementary Table 19). This same analysis was performed for functions from the TIGRFAM database (release 15.0)86 (Extended Data Fig. 9a).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.