Constitution of the databases of ARDs

We define as an ARD as in Martínez et al.39: a protein encoded by a gene that confers resistance to antibiotics when it is present or increases susceptibility to antibiotics when it is absent. This definition excluded housekeeping genes in which mutations can confer resistance to some antibiotics (such as topoisomerases in which mutations can lead to fluoroquinolone resistance) and genes involved in the regulation of antibiotic resistance genes. Also, we excluded efflux pumps such as Tet(A) or QepA as very few or no PDBs are available, presumably due to the difficulty to crystallize transmembrane proteins. Amino acid sequences of functionally characterized ARDs from the major antibiotic families used in human medicine (β-lactams, aminoglycosides, tetracyclines, trimethoprim, sulfonamides, macrolides-lincosamides-synergistines, fluoroquinolones, fosfomycin and glycopeptides)20,40 were obtained from the following antibiotic resistance databases: Resfinder9, ARG-ANNOT7, the Lahey Clinic (http://www.lahey.org/studies/), RED-DB (http://www.fibim.unisi.it/REDDB/), Marilyn Roberts’s website for macrolides and tetracycline resistance genes (http://faculty.washington.edu/marilynr/) and from functional metagenomics studies5,6,41. When ARDs were provided as nucleic acids sequences, they were translated into proteins with Prodigal42. Non-redundancy of the reference ARDs was assessed with CD-HIT v4.5.743 (100% identity). The final database was manually curated to remove incomplete sequences and ARDs from families not considered in this work. The cluster of orthologous genes (COG) of each member of the reference dataset was assigned from the v3 eggNOG database44. In total, we collected 1,651 non-redundant amino acid sequences spanning 20 ARD families: class A β-lactamases (Blaa), class B1-B2 β-lactamases (Blab1), class B3 β-lactamases (Blab3), class C β-lactamases (Blac), class D β-lactamases (Blad), aminoglycoside acetyltransferases (AAC) AAC(2’), AAC(3)-I, AAC(3)-II and AAC(6’), aminoglycoside nucleotidyltransferases (ANT), aminoglycoside phosphotransferases (APH), 16S ribosomal RNA (rRNA) methylases, Tet(M), Tet(X), type A dihydrofolate reductases (DfrA), dihydropteroate synthases (Sul), erythromycin ribosome methylases (Erm), quinolone resistance proteins (Qnr), fosfomycin resistance proteins (Fos) and D-Ala–D-Lac/Ser ligases (Van) (Table 1). The recently described plasmid-mediated colistin resistance mcr-1 gene45 could not be included because of the lack of a reliable PDB template obtained by X-ray diffraction at the time of the study.

Interrogation of the catalogue for ARDs

We used a 3,871,657 million proteins catalogue previously published19. This catalogue was built from the metagenomic sequencing of the faeces of 396 subjects from Denmark and Spain. In brief, the 3.9 million gene catalogue results from a non-redundancy filtering at 95% nucleic acid identity and 90% coverage: predicted genes from all samples (45.4 million in total) were clustered using BLAT by single linkage. Any two genes with greater than 95% identity and covering more than 90% of the shorter gene were clustered together. The contigs were originally built using SOAPdenovo (from the MOCAT pipeline46). We selected this catalogue over the more recent 9.9 million gene catalogue that was published during the course of this study27 because MGUs (including the MGS) had been determined only for the 3.9 million gene catalogue. The genes of the catalogue were translated into proteins using Prodigal42 using the –p meta option. For each ARD family, we searched for ARDs using the following three methods: (1) we built a hidden Markov model file for each ARD family and searched the catalogue with Hmmsearch (v3.1)47; (2) we performed a Smith–Waterman alignment with a heuristic seed detection (BLASTP v.2.2.28+)21; and (3) a rigorous Smith–Waterman search (SSearch v.36.3.6)48 with an E-value threshold of 1 × 10–5. Only the hits with a size ranging from 75 to 125% of the mean amino acid size of the ARD family were further considered. All candidates were assigned a COG/NOG from eggNOG v344. When candidates were found in different ARD families (for example, a candidate could be a hit in class B1-B2 and class B3 β-lactamases), the candidate was assigned to the family for which it had the highest amino acid identity with the reference.

Negative references

For each ARD family, COGs/NOGs were attributed to reference ARDs. In parallel, the COGs/NOGs were attributed to the hits obtained during the initial steps of PCM (that is, the hits obtained by the BLASTP/SSearch and Hmmer search). In the list of candidates from a given ARD family, the COGs/NOGs that were not found in the COGs/NOGs attributed to reference ARDs were assumed to be potential COGs/NOGs from false positive hits (Supplementary Fig. 2) as it reproduced the errors of functional assignment that were likely to be generated in sequence-only annotations. The amino acid sequences of the representative proteins from those COGs/NOG groups were obtained from the eggNOG v3 database and were added to the negative reference dataset. A manual curation step was performed to ensure that no references were included in the negative references.

Selection of structural templates

The list of protein structures that could be used as structural templates was downloaded (June 2014 and November 2014) from the PDB library (ref. 49, http://www.rcsb.org/). Using the reference dataset and the negative references described above, Hmmer47, BLASTP21 and SSearch48 were performed on the PDB database with default settings and E values of 1 × 10–5. Results were merged into a non-redundant PDB list. Both lists (references and negative templates) were manually curated to ensure that no references were represented in the negative templates dataset, and vice versa. If more than one PDB shared the same UniProt number (that is, if the structure of a protein has been determined on multiple occasions), we filtered the PDB files to include a unique structure per UniProt number using the following positive criteria: absence of ligand, completeness of the protein and high resolution.

PCM

The concept of PCM is shown in Supplementary Figs. 1–3 and the framework is available at https://github.com/aghozlane/pcm. The concept of leveraging the protein structure in complement to its amino acid sequence was motivated by the fact that proteins sharing common functions would be more conserved in the active site that cannot be observed by the analysis of protein sequence alignments37. Each candidate was subjected to homology modelling with reference templates and negative templates, generating two three-dimensional structures for each candidate (Fig. 1a). The main idea is that if a sequence is truly functionally related to the reference fold, its model must be significantly different from the ones obtained with the negative structural template. Homology modelling was performed by PCM in six main steps (example in Supplementary Fig. 3):

(1) Three structural templates were identified by BLASTP (among the lists produced as described above) that shared the highest amino acid identity with the candidate protein. (2) A multiple sequence alignment was performed between the candidate and the three templates sequences using Clustalo50. (3) A prediction of the secondary structure was performed using psipred (v3.5)51. The residues predicted to fold in helix or in beta-sheet conformation with a level of confidence higher or equal to seven were considered to constrain the model. (4) A comparative modelling was performed with the MODELLER programming interface52. MODELLER automatically calculates a model by satisfaction of spatial restraints such as atomic distance and dihedral angles in the target sequence, extracted from its alignment with the template structures. Stereo-chemical restraints for residues are obtained from the CHARMM-22 molecular force field and statistical preferences obtained from a representative set of known protein structures. (5) The best model out of a hundred produced by MODELLER (based on the Dope score) was considered for structure assessment analysis using ProQ53 and Prosa-web54. The Dope score (Modeller), z-score (Prosa), MaxSub and Levitt-Gerstein (LG) score (ProQ) are statistical potential variables used to predict the model quality. Both ProQ and Prosa-web are trained on the PDB to determine real protein configuration and they estimate the energetic farvourability of the conformation of each residue in the model. (6) The best model was aligned with the reference set of structures using TM-Align17 and MAMMOTH55. The r.m.s.d (TM-Align), z-score (MAMMOTH), TM score (MAMMOTH, TM-Align) estimates the degree of superposition of the residue between two structures.

The differences (delta) between the scores determined from each modelling path (with the reference set or the negative set) were calculated and used for the PCM machine-learning program (see below).

For one given candidate, the PCM whole process took an average of 8 CPU-hours (30 min on 16 CPUs).

Taxonomic assignation

The pdARDs were taxonomically assigned by combining the results obtained from BLASTN against the National Center for Biotechnology Information (NCBI) Genomes database (minimal 70% identity and 80% coverage), a BLASTN against the IMOMI in-house database (minimal 85% identity and 90% coverage) and the taxonomy of the metagenomic unit whenever applicable. The lowest taxonomic rank from the results of the three methods was assigned to the pdARD.

Statistical analysis

To discriminate reference proteins from negative references, we used model quality predictors and alignment scores (inferred from the semi-automatic pipeline described above) and developed a custom pipeline in R (R Core Team, 2013, http://www.R-project.org) to perform the classification. The LASSO penalized logistic regression56 implemented in LIBLINEAR57 was used to compute the classifier. Ten-fold stratified cross validation (re-sampled 100 times to obtain more stable accuracy estimates) was used to partition the data into a training and test sets. The LASSO hyper-parameter was optimized for each model in a nested five-fold cross validation on the training dataset using the area under curve as the model selection criterion. From the 100 times re-sampled ten-fold cross validation, receiver operating characteristic analysis was used to evaluate model performance using the median area under curve. Coefficients extracted for each modelling or alignment score were also evaluated for their stability throughout the computed models. The PCM score was the ratio (expressed as a percentage) between the numbers of time a candidate was classified as a reference and the number of bootstraps. Predicted ARDs were candidates with a PCM score ≥50% and a TM score given by TM-Align ≥0.517. To control how structural modelling brought additional information compared to amino acid sequence alignment only, we built a logistic regression model based on T-coffee alignment score (R glm, ten-fold stratification, re-sampled 100 times). We then compared the two classifier models used for PCM and for T-coffee alignment based on the reference set (see Supplementary Information).

Validation of the method with a functional metagenomic dataset

The performance of PCM was assessed by analysing the data in Forsberg et al., where the ARD content of different North American soils was analysed using functional metagenomics18. The screening of the clones was performed on aztreonam, chloramphenicol, ciprofloxacin, colistin, cefepime, cefotaxime, cefoxitin, D-cycloserine, ceftazidime, gentamicin, meropenem, penicillin, piperacillin, piperacillin-tazobactam, tetracycline, tigecycline, trimethoprim and trimethoprim-sulfamethoxazole (cotrimoxazole). Here, we collected the nucleotide sequences of the inserts deposited on GenBank (KJ691878–KJ696532). The sequence translation of the open reading frames was performed by Prodigal (using default parameters)42. A total of 4,654 insert sequences were collected, in which 12,904 amino acid sequences were predicted. We then searched for ARDs belonging to the relevant ARD families according to the antibiotics used for the screening of the clones: β-lactamases (all classes), APH, ANT, AAC(2’), AAC(3)-I, AAC(3)-II, AAC(6’), 16S rRNA methylases, Tet(M), Tet(X), Qnr, Sul and DfrA, using the Supplementary Table 2 of the Forsberg et al. paper. Inserts with no putative ARDs (according to the annotation of the gene) were removed (n = 269). Inserts selected on cycloserine (n = 868) and chloramphenicol (n = 129) were not considered here because they were not included in the 20 ARD families in this work. Fourteen inserts that contained more than one putative ARD that could be identified to confer resistance to the antibiotic used for the screening (for example, two β-lactamases) were not considered in this analysis. An additional 1,658 inserts containing no putative ARDs or a putative ARDs that did not confer resistance to the antibiotic used for selection were discarded and so were 294 inserts containing efflux pumps, as these were not considered in this study. The resulting validation set contained 1,423 inserts (with resistance genes) for a total of 3,778 genes. To compare the outcome of PCM with other tools, the results for class B1-B2 and B3 β-lactamases generated by PCM were merged into one class B β-lactamases group as other tools do not separately consider the different class B β-lactamases.

In total, 1,390 unique hits were found during the initial screen of PCM, of which 1,374 were predicted as ARDs (Supplementary Table 7). Among the 33 ARDs not included for PCM, 12 were not considered because they were undersized and ten because they were oversized. No hits for AAC(2’), ANT, Qnr or Sul were found. The mean identity shared with reference ARDs was 37.6% (range 18.8–94.5). Overall, the sensitivity was 96.6%, with no false negative. In comparison, only eight ARDs would have been identified by a conventional method (combination of Hmmsearch, BLASTP and SSearch with both a minimal identity with a reference ARD and coverage over or equal to 80%). Conversely, Resfams11 that was specifically designed to identify ARDs from functional metagenomic datasets showed a similar sensitivity to PCM with the identification of 1,346 ARDs out of 1,423 (94.6% sensitivity).

Validation of the method for incomplete genes

The 3.9 million gene catalogue harbours 41.4% of genes that are predicted to be incomplete either on the 5’, the 3’ or both extremities19. As the size parameter is crucial for homology modelling, we tested to what extent the prediction of incomplete ARDs by PCM could remain valid. We selected 12 reference class A β-lactamases (BlaZ, CblA-1, CepA-29, CfxA2, CfxA6, CTX-M-8, KPC-10, OXY-1, PER-1, SHV-100, TEM-101 and VEB-1) and we then iteratively removed 5% of the amino acid sequence at both edges to obtain 16 bi-directionally trimmed candidates (from 100 to 25%) per reference ARD. Candidate genes were chosen to span the diversity of known β-lactamases, but the main representative β-lactamase of the subfamily (for example, TEM-1 for TEM β-lactamase) was not necessarily chosen. Note that SHV-100 has a slightly longer sequence (13 amino acid duplication) than other SHV. A total of 192 PCM experiments were performed: we observed that the 12 references were correctly predicted as ARDs when at least 40% of the protein remained (that is, 30% trim from each extremity, Supplementary Fig. 4). Thus, we are confident that with the 75% size threshold used in this study (a maximum of 25% removed from one edge), no misclassification due to an incomplete gene would be expected.

Gene synthesis

We selected 71 pdARDs from 12 ARD families: 14 from class A β-lactamases, eight from class B1-B2 β-lactamases, seven from class B3 β-lactamases, four from class C β-lactamases, two from class D β-lactamases, 2 AAC(3)-I, 5 AAC(3)-II, 8 AAC(6’), 3 ANT, 4 APH, 13 Tet(M) and 1 Tet(X)) for gene synthesis and sub-cloning into E. coli to test the decrease of susceptibility to antibiotics. For β-lactamases, a chromogenic test (nitrocefin) was used to detect function. Minimal inhibitory concentrations (MIC) were determined by E-Test strips (bioMérieux, Marcy-l’Etoile, France) in duplicate. A pdARD was considered to have an activity against an antibiotic (tobramycin for AAC(3)-I, AAC(3)-II, AAC(6’) and ANT; kanamycin for APH and tetracycline for Tet(M)) when the MIC of the clone was above the MIC of a clone harbouring the plasmid without a synthesized gene or when the colour of the broth containing nitrocefin turned red, in the case of β-lactamases. We used the plasmid vector pET-22b+ (embedding a β-lactamase-encoding gene) for pdARDs hypothesized to confer resistance to aminoglycosides and the pET-26b (embedding a gene conferring resistance to kanamycin) for the other pdARDs. The selection of the pdARDs for synthesis was performed as follows: references (n = 12), pdARDs that shared a high identity with known ARDs (≥95% amino acid identity and ≥80% coverage with a reference ARD); good predictions (n = 41), pdARDs with the highest degree of confidence for the prediction (PCM score >99%, TM score TM-Align >0.9 and <70% amino acid identity with a reference ARD); fair predictions (n = 18): pdARDs with the lowest degree of confidence for the prediction (PCM score <80%, TM score TM-Align <0.8 and <70% amino acid identity with a reference ARD).

Signatures of MGEs nearby the predictions of ARDs

We searched for MGE-associated proteins encoded by genes located in the same contigs as pdARDs. The 3.9 million gene catalogue results from a non-redundancy filtering at 95% for the genes19, but to identify the contigs on which pdARDs were identified, we needed to return to the redundant catalogue (that is, the non-dereplicated catalogue of genes) and identified homologues sharing 95% nucleic acid identity with the pdARDs. By doing so, we could identify contigs (n = 16,955) carrying at least one pdARD. The mean size of the contigs was 19,711 base pairs (minimum 500, maximum 461,981, median 8,513). In total, the 16,955 contigs contained a total of 908,888 genes after the subtraction of pdARDs. The 908,888 genes were then translated into proteins with Prodigal42 and queried for IS elements using BLASTP (query size threshold, 150 amino acids; E value, 1 × 10–30; identity threshold, 40%) against the ISfinder database58. Conjugative elements were queried among the same gene set (n = 908,888) with Conjscan59, using the default parameters and the filters recommended by the authors (best E < 0.001 and sequence coverage of at least 50%). Most proteins belonging to the type IV secretion systems (T4SS), which are involved in conjugation, are ubiquitous in that they have numerous homologues. Hence, when searching for conjugation proteins in a 3.9 million protein catalogue, there would be a high risk of false positives. Accordingly, the co-location of hits was deemed crucial. A conjugative T4SS is made from:

a protease (VirB4)

a second coupling protein protease (t4cp)

a relaxase (MOB)

a proteic complex (MPF) composed of at least ten proteins

To identify a T4SS on a contig, we required presence of at least one virB4 hit, a t4cp1 or t4cp2 hit, a MOB hit and a certain number of MPF hits. All hits must co-localize. A MOB element alone can mobilize a neighbouring gene (such as an ARD-encoding gene) via other T4SSs. However, in our dataset the short length of contigs led us to adapt those parameters (following the recommendations of the developers of the Conjscan software). Besides the MOB element, we considered that the presence of two hits from the same family (for example, T_virB6 and T_virB8, or B_traF and B_traH) or virB4+ any hit from another family on the same contig as a pdARD was a strong indication of the presence of mobility associated elements. Integrons were identified using IntegronFinder60 on the 16,955 contigs using default parameters.

We also searched for pdARDs in MSPs26 obtained from the 9.9 million intestinal gene catalogue27 using BLASTN with a 95% identity threshold over 90% of the query.

Finally, we searched for homologues of pdARDs in GenBank with 97% identity threshold over 90% of the query. We found 820 out of 6095 pdARDs (13.5%) that aligned against 139,413 GenBank entries. We filtered hits corresponding to a virus, a plasmid or a vague taxonomic affiliation by considering the following terms: uncultured bacterium, artificial, unidentified, uncultured organism, environmental samples and metagenome.

Distribution of the pdARDs in the MetaHIT cohort (n = 663 subjects)

pdARDs profiles were obtained from the abundance matrix of the 3.9 million genes as described in Nielsen et al19. The ‘reads per kilobase per million mapped read’ method was used to normalize the mapping counts. After summing the relative abundances of pdARDs genes belonging to the same family, Dirichlet multinomial mixture models were used to find ARD clusters (that is, resistotypes) using the Dirichlet multinomial R package. The same method was applied to detect gut microbiota clusters (that is, enterotypes)61. The Laplace criterion was used to define optimal number of clusters as described on oral and faecal microbial dataset62. By analogy with the term enterotype, we chose to name a cluster of subjects on the basis of their similarity of their faecal relative abundance of pdARDs a resistotype. The chi-squared test was used to assess the associations between resistotypes and enterotypes. Rarefaction analysis at 1 million reads was done to determine the gene richness per samples. RLQ analysis63 was conducted to assess the associations between the relative abundances of pdARDs, their characteristics (family, size of the cluster of associated genes) and those of subjects (enterotypes, resistotypes, gender, body mass index, age). Of note, we excluded the patients suffering from inflammatory bowel disorders from this analysis. Co-inertia analysis was conducted to assess the associations between microbiota β-diversity and pdARDs profiles. Microbiota composition was assessed using MGS (see below) relative abundance and β-diversity by square root Jensen–Shannon Divergence. A principal coordinate analysis was done on Jensen–Shannon Divergence distance matrix and a principal component analysis was done on ARD profiles. Both analyses were then subjected to co-inertia analysis and Monte Carlo permutation was done to assess to robustness of shared inertia.

Constitution of cohorts of patients with various antibiotic exposures

We included three cohorts of patients with various exposures to antibiotics.

Hospitalization without antibiotics

A total of 31 patients with no exposure to antibiotics or hospitalization during the three preceding months and admitted to the medicine ward of the Beaujon University Teaching Hospital (Clichy, France) were included and provided a faecal sample at admission. Among them, 16 also provided a stool sample at discharge. One patient received antibiotics between admission and discharge and was not further considered for the analysis. In total, 15 patients could provide a stool sample soon after admission (T0) and at discharge (T1). The mean time between T0 and T1 samples was 10.7 days. The mean age of patients was 67.8 years old and the gender ratio (M/F) was 1.3. All patients gave informed consent. This work was approved by the French National Institutional Review Board (IRB 00008522) and registered at clinicaltrials.gov (NCT02031588).

Chronic exposure

Thirty cystic fibrosis patients were enroled at the Cystic Fibrosis Unit of the Ramón y Cajal Hospital in Madrid. One faecal sample was collected at the occasion of a consultation. All subjects for this study were provided a consent form describing the study and providing sufficient information for subjects to make an informed decision about their participation as faecal donors in this study. Cystic fibrosis is a genetic disease that leads to an impairment of the lung function through an uncontrolled production of mucus. The consequence is chronic bacterial colonization, resulting in deleterious reactive fibrosis of the lung. Bacterial load is controlled by chronic exposure to antibiotics (home-therapy, mostly oral and inhaled in our cohort), which has resulted in significant life prolongation, and the near-absence of hospital care. Hence, the cystic fibrosis patients had been exposed to various antibiotics during the five years before the faecal sample was collected:

β-lactams (ampicilln, amoxycillin, cloxacillin, piperacillin-tazobactam, cefepime, ceftriaxone, ceftazidime, cefditoren, meropenem): 25 out of 30

Macrolides (azithromycin, clarithromcyin): 17 out of 30

Colistin: 21 out of 30

Fluoroquinolones (ciprofloxacin, levofloxacin, moxifloxacin): 26 out of 30

Cotrimoxazole: 14 out of 30

Glycopeptides (vancomycin): 1 out of 30

Aminoglycosides (amikacin, tobramycin): 12 out of 30

Tetracyclines (doxycycline, minocycline): 2 out of 30

Linezolid: 3 out of 30

Rifampin: 1 out of 30

Fosfomycin: 5 out of 30

On average, cystic fibrosis patients had been exposed to 5.9 different antibiotics and had an average of 12.2 antibiotic courses during the five years before the sample was taken. The mean age was 36.3 years old and the gender ratio (M/F) was 1.3. The consent form was obtained before that subject provided any faecal sample for the study and was signed by the subject or legally acceptable surrogate, and the investigator-designated research professional obtaining the consent. According to the National Spanish laws the study did not require the approval of the Ethics Committee. Nonetheless, the Ethics Committee of the Hospital Ramón y Cajal guaranteed that the study was performed done according to the good clinical practices guidelines.

Short high-dose exposure

Short high-dose exposure consists of administering a mixture of topical and parenteral antibiotics and antifungal agents to a patient at admission to eliminate potential bacterial and fungal pathogens. SDD has been shown to significantly reduce mortality in the intensive care unit29 and is now part of standard care for intensive care patients in the Netherlands. To assess the effect of SDD on the intestinal microbiota, we analysed the faecal samples from 13 patients admitted to the intensive care unit of the University Medical Centre of Utrecht (Netherlands). The samples were collected at admission (T0, first sample passed after admission) and after SDD (T1). Among the 13 patients for whom a faecal sample could be obtained at T0, 10 could provide a faecal sample at T1. The mean age was 59.9 years old and the gender ratio (M/F) was 0.5. SDD consisted of 4 days of intravenous cefotaxime and topical application of tobramycin, colistin and amphotericin B. Additionally, a subset of samples (n = 4) from this cohort was cultured in a brain-heart infusion broth overnight in ambient atmosphere at 37 °C. The protocol for the collection of stool samples was reviewed and approved by the institutional review board of the University Medical Centre of Utrecht (Netherlands) under number 10/0225. Informed consent for faecal sampling during hospitalization was waived. Written consent was obtained for the collection of faecal samples after hospitalization.

Metagenomic sequencing and mapping

Total faecal DNA was extracted64,65 and sequenced using SOLiD 5500 wildfire (Life Technologies) resulting in a mean of 68.5 million sequences of 35-base-long single-end reads. High-quality reads were generated with quality score cut-off >20. Reads with a positive match with human, plant, cow or SOLiD adaptor sequences were removed.

Filtered high-quality reads were mapped to the MetaHIT 3.9 million gene catalogue19 using the METEOR software66. The read alignments were performed in colourspace with Bowtie software (version 1.1.0)67. Uniquely mapped reads (reads mapping to a single gene from the catalogue) were attributed to the corresponding genes. Shared reads (mapping different genes of the catalogue) were attributed according to the ratio of their unique mapping counts, as following: as a read can map on different genes of the catalogue, the abundance of a gene G(A g ) depends on the abundance of uniquely mapped reads (A u ), that is, reads that map only to the gene G and on the abundance of N shared reads (A s ) that aligned with M genes in addition to the gene G:

$$A_{\mathrm{g}} = A_{\mathrm{u}} + A_{\mathrm{s}}$$

where

$$A_{\mathrm{s}} = \mathop {\sum }\limits_{i = 1}^N C_{o_i}$$

For each shared read, the gain of abundance corresponds to a coefficient C o that takes in account the total number of uniquely mapped reads on the M genes:

$$C_{o_i} = \frac{A_{{\mathrm{u}}}}{{A_{\mathrm{u}} + \mathop {\sum}\limits_{j = 1}^M {A_{{{\mathrm{u}}}_{{j}}}} }}$$

For instance, if a gene G is mapped by ten reads that only map to it (unique reads), but also with one read that also align on a gene M that was mapped by five unique reads, then:

$$A_{\mathrm{g}} = 10 + \frac{{10}}{{10 + 5}} \approx 10.7$$

To decrease technical biases due to different sequencing depth, samples with at least 5 million mapped reads were downsized to 5 million mapped reads (random sampling of 5 million mapped reads without replacement) using R package momr31. The abundance of each gene in a sample was then normalized by dividing the number of reads that mapped to the gene (A g ) by the gene nucleotide length and by the total number of reads from the sample. The resulting set of gene abundances, termed a microbial gene profile, was used to estimate the abundance of MGS19.

Gene richness analysis

Microbial gene richness was calculated by counting the number of genes mapped at least once for a given sample. Gene richness was calculated using R package momr for samples where 5 million or more reads had been mapped to the 3.9 million gene catalogue.

MGS

MGS are co-abundant gene groups with more than 700 genes and can be considered part of complete bacterial species genomes. 741 MGS were delineated from 396 human gut microbiome samples19. In this study, the relative abundance of MGS was determined as the median abundance of 90% of the genes composing each cluster, meaning that the 10% genes with the lowest abundance for each MGS were not considered for the calculation of the abundance of the MGS. Typically, these genes correspond to genes with zero count, to accessory genes (hence their detection is not constant) or to genes that are not detected because of insufficient sequencing depth. The MGS taxonomical annotation was updated by sequence similarity using NCBI BLASTN, when more than 50% of the genes matched the same reference of NCBI database (December 2014 version) at a threshold of 95% of identity and 90% of gene length coverage to get the species annotation19.

Statistical analysis for the distribution of pdARDs and MGS between groups

Statistical analyses for the differential abundances of pdARDs and MGS were performed using the application SHAMAN68(http://shaman.pasteur.fr/). Data are available at (https://github.com/aghozlane/evotar), with the graphical representations using the abundances from the matrix rarefied at 5 million reads. The relationship between richness and the abundance of ARDs was assessed by the Spearman correlation test. The statistical threshold for significance was set at a P value of 0.05.

Code availability

The PCM code can be found at https://github.com/aghozlane/pcm.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.