No statistical methods were used to predetermine sample size. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.

Metagenomic datasets

We extracted 13,133 sequencing runs classified as human gut metagenomes in the European Nucleotide Archive (ENA), encompassing 75 different studies (Supplementary Table 1). Metadata (location, age, health status and antibiotic usage) for each individual sampled was retrieved through the ENA API with the mg-toolkit (https://pypi.org/project/mg-toolkit/) and further curated by inspecting the publications linked to each project when available. Samples were classified as having been obtained from healthy individuals only if explicitly stated in their original study.

De novo assembly and binning

Raw reads from each run were first assembled with SPAdes v.3.10.020 with option --meta21. Thereafter, MetaBAT 215 (v.2.12.1) was used to bin the assemblies using a minimum contig length threshold of 2,000 bp (option --minContig 2000) and default parameters. Depth of coverage required for the binning was inferred by mapping the raw reads back to their assemblies with BWA-MEM v.0.7.1645 and then calculating the corresponding read depths of each individual contig with samtools v.1.546 (‘samtools view -Sbu’ followed by ‘samtools sort’) together with the jgi_summarize_bam_contig_depths function from MetaBAT 2. The QS of each metagenome-assembled genome (MAG) was estimated with CheckM v.1.0.722 using the lineage_wf workflow and calculated as: level of completeness − 5 × contamination. Ribosomal RNAs (rRNAs) were detected with the cmsearch function from INFERNAL v.1.1.247 (options -Z 1000 --hmmonly --cut_ga) using the Rfam48 covariance models of the bacterial 5S, 16S and 23S rRNAs. Total alignment length was inferred by the sum of all non-overlapping hits. Each gene was considered present if more than 80% of the expected sequence length was contained in the MAG. Transfer RNAs (tRNAs) were identified with tRNAscan-s.e. v.2.049 using the bacterial tRNA model (option -B) and default parameters. Classification into high- and medium-quality MAGs was based on the criteria defined by the minimum information about a metagenome-assembled genome (MIMAG) standards23 (high: >90% completeness and <5% contamination, presence of 5S, 16S and 23S rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination). Given that only 240 of the MAGs with >90% completeness and <5% contamination passed the MIMAG thresholds regarding the presence of rRNA and tRNA genes due to known issues relating to the assembly of rRNA regions16,50, we refer to our highest quality MAGs as ‘near complete’16 instead. VirFinder v.1.151 was used to predict the presence of viral contigs within the 13,133 human gut assemblies generated with SPAdes. This tool uses a k-mer-based, machine-learning approach to detect distinguishing signatures between virus and host (prokaryotic) sequences. Expected P values for the presence of viral sequences were calculated for each contig with ≥5 kb length and subsequently corrected for multiple testing using the Benjamini–Hochberg method with a FDR threshold of 10%.

Assignment of MAGs to reference databases

Four reference databases were used to classify the set of MAGs recovered from the human gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination) retrieved from both the HMP catalogue (https://www.hmpdacc.org/catalog/) and the HGG8. From the RefSeq database, we used all the complete bacterial genomes available (n = 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053 eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August 2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly available as of August 201813,16,17,18,19, including those deposited in the Integrated Microbial Genomes and Microbiomes (IMG/M) database52. For each database, the function ‘mash sketch’ from Mash v.2.053 was used to convert the reference genomes into a MinHash sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG and the set of references was calculated with ‘mash dist’ to find the best match (that is, the reference genome with the lowest Mash distance). Subsequently, each MAG and its closest relative were aligned with dnadiff v.1.3 from MUMmer 3.2354 to compare each pair of genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI.

Genome dereplication

To dereplicate the collection of unclassified bacterial MAGs (AQ <60% or ANI <95% against the target references), high-level similarity clusters were first generated with Mash53. In brief, a MinHash sketch was created for these genomes to perform an all-against-all comparison. Then, a hierarchical clustering was built from the Mash distance relationships and individual clusters were defined at a cut-off of 0.2. Each cluster was subsequently dereplicated with dRep v.2.2.255 to extract the MAGs displaying the best quality and representing individual metagenomic species (MGS). dRep was run with options -pa 0.9 (primary cluster at 90%), -sa 0.95 (secondary cluster at 95%), -cm larger (coverage method: larger), -con 5 (contamination threshold of 5%). For the near-complete MAGs, the -nc parameter was set to 0.60 (coverage threshold of 60%), whereas for the medium-quality MAGs with a QS >50 this was changed to 0.30 (coverage threshold of 30%). The 2,468 HR genomes were also dereplicated into 956 representative species with dRep, using the criteria defined above for the near-complete MAGs. These included 553 species collected specifically from the human gut, referred to as HGR.

Phylogenetic and taxonomic analyses

Genes were predicted using prodigal v.2.6.356 (default single mode) and 40 universal core marker genes from each genome were extracted using specI v.1.032. Phylogenetic trees were built by concatenating and aligning the marker genes with MUSCLE v.3.8.31. Marker genes absent only from specific genomes were kept in the alignment as missing data. Maximum-likelihood trees were constructed using RAxML v.8.1.1557 with option -m PROTGAMMAAUTO. All phylogenetic trees were visualized in iTOL58. Phylogenetic diversity was quantified by the sum of branch lengths using the phytools R package59.

Taxonomic classification of each MGS was performed with both CheckM and UniProtKB29. First, the function tree_qa from CheckM was used to infer the approximate phylogenetic placement of the MGS genome within the CheckM internal reference tree (which comprised 2,052 finished and 3,604 draft genomes). Those classified at least at the class rank were then compared with the taxonomic assignment deduced from protein alignments against UniProtKB (release 2018_04) using the blastp function of DIAMOND v.0.9.17.11860. A positive hit at the species level was inferred if ≥60% of the proteins had ≥80% of the sequence aligned with an amino acid identity of ≥96%, based on previously reported thresholds26,33. Genomes within UniProtKB were presumed to represent cultured species if labelled with a full species name lacking any of the following terms: uncultured, sp. or bacterium. For those MGS without an assigned species (UMGS), a genus-level boundary was set with the following criteria, as previously defined61: at least 50% of the proteins with an e value less than 1 × 10−5, a sequence identity of more than 40% and a query coverage above 50%. In case the taxon predicted with UniProt was missing from the CheckM reference database, the full lineage was manually inspected to determine the most likely annotation. Owing to possible mislabelling of the UniProt entries, the CheckM taxonomic lineage was kept if there were incongruences between both classifications. Lastly, the positioning of the UMGS genomes within the HGR phylogenetic tree was used to resolve further inconsistencies or misclassifications.

Technical reproducibility and cluster quality

A random subset of 1,000 metagenomes (Supplementary Table 1) was tested with two additional approaches to assess the reproducibility of the MAGs generated here. With one of the methods, metagenomes were assembled with MEGAHIT v.1.1.324 and subsequently binned with MetaBAT 2, MetaBAT 1 and MaxBin v.2.2.462. A refinement step was then performed using the bin_refinement module from MetaWRAP v.1.025 to combine and improve the results generated by the three binners. The second method involved a modified co-assembly approach, in which individual assemblies from the same study were first merged and dereplicated with CD-HIT v.4.763 (cd-hit-est with option -c 0.99 defining a sequence identity threshold of 99%). Metagenomic datasets were then mapped to their merged, non-redundant assembly with BWA-MEM to obtain co-abundance information for binning with MetaBAT 2 (with option --minContig 2000). The resulting MAGs with a QS >50 obtained with each method were compared to the MAGs recovered with our main pipeline (individual assembly with SPAdes, plus binning with MetaBAT 2) for the same 1,000 datasets, using the combined Mash and MUMmer workflow described above.

To further assess the level of potential contamination of the MGS reported, we analysed the quality of the Mash clusters containing each MGS using the Matthews Correlation Coefficient (MCC). First, CompareM v.0.0.23 (https://github.com/dparks1134/CompareM) was used to analyse the average amino acid identity (AAI) of the specI marker genes within and between Mash clusters. To be able to estimate the MCCs, true positives, false negatives, false positives and true negatives were determined based on three different AAI thresholds: 90%, 95% and 97%. For each pairwise comparison, we considered a true positive when both MAGs belonged to the same cluster and had an AAI equal to or above the threshold; false negatives if they belonged to the same cluster, but the AAI was below the threshold; false positives when the genomes were included in different clusters, but their AAI was equal to or above the threshold; and true negatives corresponded to genomes from different clusters with an AAI below the threshold. Thereafter, MCCs were calculated with the mcc function from the mltools64 R package. Possible values range from −1 to 1, with 1 indicating perfect agreement between the Mash clustering and the marker genes AAI.

Functional characterization

Functional prediction analyses were carried out for the 1,952 UMGS and the dereplicated set of 553 HGR genomes. Predicted genes were first functionally characterized with InterProScan v.5.27-66.036 with options -goterms and -pa. The presence of microbial BGCs was inferred with antiSMASH 435, using option --knowclusterblast to determine the number of BGCs that matched the MIBiG repository. GO39,40 annotations were deduced for each gene based on the InterPro (IPR) entries, and translated to GPs37,38 using the assign_genome_properties.pl script present in http://github.com/ebi-pf-team/genome-properties. GhostKOALA42 was used to generate KO annotations of the protein-coding sequences. Differential abundance analysis of GO slim and KO term frequencies between the UMGS and HGR genomes was performed with the compositional data analysis tool ALDEx265. Because we were evaluating genomes with differing lengths and degrees of completeness, this method was used to take into account discrepancies in total gene counts. The aldex.clr function was used with 128 Monte Carlo instances sampled from a Dirichlet distribution to generate a distribution of probabilities for each GO slim/KO term consistent with the observed data. These were subsequently converted to distributions of log ratios to account for the compositional nature of the data. The aldex.effect function was used to calculate the expected value of the difference between distributions of each group (median log 2 difference), the expected value of the pooled group variance (median log 2 dispersion) and the standardized effect sizes on the abundance difference of each GO/KO classification. The effect-size measure used is similar in concept to Cohen’s d but is calculated on the distributions themselves rather than on the summary statistics of those distributions, resulting in metrics that are relatively robust and efficient66. Lastly, the aldex.ttest was used to perform non-parametric Wilcoxon rank-sum tests on the GO/KO frequencies between the two test groups (UMGS and HGR). GPs, classified as ‘yes’, ‘no’ and ‘partial’ were converted to 2, 0 and 1, respectively, and those more prevalent specifically among the UMGS genomes were detected with a two-tailed χ2 test. The expected P values from all the statistical tests were corrected for multiple testing with the Benjamini–Hochberg method. A PCA was carried out on the GP distributions of the HGR and UMGS genomes, using the FactorMineR67 package. Separation according to phylum and genome type was assessed with the ANOSIM test based on the Gower distances between the GP profiles.

Species prevalence and abundance

Read classification of the 13,133 human gut metagenomic datasets was performed with sourmash v.2.0.0a468 against the HR, RefSeq and UMGS genome collections. Signature files were generated for both the reference (FASTA) and query (FASTQ) files, with ‘sourmash compute --scaled 1000 -k 31 --track-abundance’. For each set of references, a lowest common ancestor database was created (‘sourmash lca index --scaled 1000 -k 31’), with each genome representing a unique species lineage. Raw reads were then compared with ‘sourmash lca gather’ against each database. Species prevalence and abundance was determined with BWA-MEM, where species presence was inferred by assessing the level of genome coverage, mean read depth and depth evenness. First, we calculated depth and variation penalty scores corresponding to the missing coverage (100% − genome coverage) multiplied by either the log(mean depth) or the depth coefficient of variation (defined as the standard deviation of read depth divided by the mean), respectively. These metrics allowed us to gauge both coverage and depth simultaneously, as genomes that have a high mean depth (or high depth variation) but are not well covered are less likely to be present in the sample than those that have the same level of coverage with lower read depth. Thresholds for determining genome presence were set at a minimum coverage of at least 60%, and both depth and variation penalty scores at a maximum of the 99th percentile (Extended Data Fig. 7). Relative abundance of each species was determined by the proportion of uniquely mapped and correctly paired reads (filtered using ‘samtools view -q 1 -f 2’) out of the total read count. Accumulation curves based on the number of UMGS detected per geographical region were bootstrapped ten times at each sampling interval. Asymptotic regressions were performed using the SSasymp and nls functions from the R stats package69.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Code availability

Custom scripts used to generate data and figures are available at https://github.com/Finn-Lab/MGS-gut.