Ethics statement

The Institutional Review Board of South China Agricultural University (SCAU-IRB) approved the protocols. All animals were sampled under authorization from Animal Research Committees of South China Agricultural University (SCAU-IACUC).

Study design

Fourteen senior class veterinary students (Student ID: H, I, J, K, L, M, N, O, P, Q, W, X, Y, Z) provided their written informed consent and voluntarily enrolled in the study during participation in an ~3-month-long practical training course in veterinary science at SCAU from July to October 2015. The 14 students were randomly divided into three groups of four to five persons, and each group was assigned to one of three swine farms in three different Chinese provinces, including (from north to south), Henan (Farm ID: H farm), Jiangxi (Farm ID: D farm), and Guangdong (Farm ID: S farm) (Supplementary Fig. 1a). These are typical large-scale swine farms, and all have been in operation for more than 5 years. Three farms implement self-breeding, and all use the closed-end management model. Among them, H farm is the largest, with 15,000 sows, D farm (7400 sows) is the next largest, and S farm (3800 sows) is the smallest. Due to limitations in the volunteer veterinary student population, all subjects were male and a parallel group of swine farm unexposed students was not possible. We have taken several steps to mitigate these limitations, including comparisons to a healthy cohort from urban Chinese individuals (BioProject accession number PRJEB13870 [https://www.ncbi.nlm.nih.gov/bioproject/PRJEB13870]). To control for differences at individual level, the students’ fecal samples were collected longitudinally and the fecal samples at the phase before arriving at the farm (T0) were considered a blank control. In addition, four to five farm workers in each swine farm were also recruited in this study. All the farm workers had engaged in pig farming for 4–18 years and stayed at the present farm at least for 1 year. The volunteers signed an informed consent form and were asked to agree to fecal swabbing and to complete a short questionnaire related to personal information, such as age and gender, personal hygiene, dietary habits, antibiotic use, hospitalization, previous visits to farms or factories, and other pertinent factors (Supplementary Table 1). In addition to environmental exposure, other factors such as diet and work stress may be the important factors influencing the human gut microbiota. Considering that these factors may be caused by environmental changes, in this study, we consider these related factors as environmental impacts.

Sample collection

The students’ fecal samples were collected at the following intervals: (1) 1–2 weeks prior to their entry into the swine farm, (2) weekly for the 3 consecutive months of their stay at the swine farm; (3) monthly for another 3 consecutive months after their return to the university. At each swine farm, four to five farm workers who had worked on the farm for at least one year were recruited, and their fecal samples were collected monthly during students’ swine farm stays. In addition, averages of 40 pig feces samples, 3 soil samples, 3 sewage samples, and 3 ventilation dust samples for each farm, were collected monthly for the 3 consecutive months of the students stay at the swine farm. Among them, 42 students’ fecal samples and 12 pooled samples consisting of 55 environmental samples (around 3–5 samples for each item per farm) from the swine farms, including pig feces, soil, sewage, and ventilation dust, were used in the metagenomic sequencing (Supplementary Data 3 and 5). Samples were submitted using an assigned student study ID and date. Samples were kept on dry ice during transport and were stored at −80 °C prior to DNA extraction and chemical analysis.

DNA extraction

Genomic DNA was extracted from all samples using the HiPure Stool DNA Kit (Magen, No. D3141) according to the manufacturer’s instructions. Briefly, STL buffer (1 ml) was added to 50 mg of sample in a 2-ml screw-cap tube (Axygen), and the mixture was incubated at 65 °C for 10 min. The samples were then vortexed for 15 s and centrifuged at 13,000×g for 10 min, and 600 μl of the supernatant was transferred to a fresh 2.0-ml tube. PS buffer (150 µl) and 150 µl of absorber solution was then added. Following a second centrifugation (13,000×g, 5 min), the supernatants were placed in fresh 2.0-ml tubes, and 700 µl of GDP buffer was added. A HiPure DNA Mini Column (Magen; No. D3141) was used to absorb the products, which were then eluted with sterile water.

16S rRNA amplification, sequencing, and preprocessing

The V3 and V4 hypervariable regions of the 16S rRNA gene were sequenced and analyzed to define the composition of the bacterial community in human fecal samples. The following amplification primers were used: primer-F = 5′ TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG; primer-R = 5′ GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGAC TACHVGGGTATCTAATCC. For amplicon library preparation, 20 ng of each genomic DNA, 1.25 U Taq DNA polymerase, 5 μl 10× Ex Taq buffer (Mg2+ plus), 10 mM dNTPs (all reagents purchased from TaKaRa Biotechnology Co., Ltd), and 40 pmol of primer mix was used for each 50-μl amplification reaction. For each sample, the 16S rRNA gene was amplified under the following conditions: initial denaturation at 94 °C for 3 min followed by 30 cycles of 94 °C for 45 s, 56 °C for 1 min, and 72 °C for 1 min and a final extension at 72 °C for 10 min. The PCR products were quantified by gel electrophoresis, pooled and purified for reactions. Pyrosequencing was performed on an Illumina MiSeq sequencer with paired-end reads 300 base pairs (bp) in length.

Based on the overlaps between the sequenced paired-end reads, the reads were merged into long sequences using the FLASH algorithm (min-overlap = 30, max-overlap = 150)42. Low-quality sequences were then trimmed and eliminated from the analysis based on the following criteria: (a) shorter than 400 bp and (b) a sequence producing more than 3 ‘N’ bases. Bioinformatic analysis was implemented using the Quantitative Insights into Microbial Ecology QIIME2 platform (https://qiime2.org/)43. Briefly, raw Illumina amplicon sequence data was quality control processed using the DADA2 algorithm44, removing the chimeric sequences and truncating the sequences from 5 to 250 bases. Phylogenetic diversity analyses were realized via the q2-phylogeny plugin, which used the mafft45 program to perform multiple sequence alignment on the representative sequences (FeatureData in QIIME2) and the FastTree46 program to generate phylogenetic tree from the alignments. The microbial community structure (i.e., species richness, evenness, and between-sample diversity) of fecal samples was estimated by biodiversity. The Shannon index was used to evaluate alpha diversity, and the weighted and unweighted UniFrac distances were used to evaluate beta diversity. All of these indices were calculated by the QIIME2 pipeline (q2-diversity plugin).

Metagenomic sequencing and data quality control

The Illumina HiSeq 3000 platform was used to sequence the samples. We constructed a 150-bp paired-end library with an insert size of 350 bp for every sample. The raw sequencing reads for each sample were independently processed for quality control using the FASTAX Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). The quality control used the following criteria: (1) reads were removed if they contained more than 3 ‘N’ bases or more than 50 bases with low quality (<Q20); (2) no more than 10 bases with low quality (<Q20) or assigned as N in the tails of reads were trimmed. The remaining reads were then mapped to the human and swine genomes using SOAPalinger247 to remove host DNA contamination. Overall, an average of 0.9% of low-quality or host genome reads was removed from the sequenced samples.

De novo assembly, gene calling, and gene catalog construction

To determine the best assembling method for high-quality whole-metagenome-sequencing reads, we compared the performance of two assemblers, SOAPdenovo v2 (previously used in human gut microbiomes)25,48 and MEGAHIT (a de novo assembler for large and complex metagenomic sequences)49. For SOAPdenovo, we tested the k-mer length ranging from 23 to 123 bp by 20-bp steps for each sample and selected the assembled contig set with the longest N50 length. For MEGAHIT parameters “–mink 21 –maxk 119 –step 10 –pre_correction” were used. For most of the samples, MEGAHIT obtained a better assembled contig set than SOAPdenovo; this could be due to its improved assembly of bacterial genomes with highly uneven sequencing depths in metagenomic samples. As a result, we obtained an average of 254.6 ± 72.4 and 754.4 ± 180.4 Mbp (mean ± SD) contig sets for human fecal samples and environmental samples, respectively. The unassembled reads for each ecosystem were pooled and reassembled for further analysis.

Genes were predicted by MetaGeneMark50 based on parameter exploration by the MOCAT pipeline26. A non-redundant gene catalog was constructed using CD-HIT51; from this catalog, genes with >90% overlap and >95% nucleic acid similarity (no gap allowed) were removed as redundancies. The gene catalogs contained 3,338,109 and 11,374,480 non-redundant genes generated from the human microbiome and the swine farm ecosystem, respectively.

Quantification of metagenomic genes

The abundance of genes in the non-redundant gene catalogs was quantified as the relative abundance of reads. First, the high-quality reads from each sample were aligned against the gene catalog using SOAP 2.2147 using a threshold that allowed at most two mismatches in the initial 32-bp seed sequence and 90% similarity over the whole read. Then, only two types of alignments were accepted: (1) those in which the entirety of a paired-end read could be mapped onto a gene with the correct insert size; (2) those in which one end of the paired-end read could be mapped onto the end of a gene only if the other end of the read mapped outside the genic region. The relative abundance of a given gene in a sample was finally estimated by dividing the number of reads that uniquely mapped to that gene by the length of the gene region and by the total number of reads from the sample that uniquely mapped to any gene in the catalog. The resulting set of gene relative abundances for all samples was termed a gene profile. The average read mapping rates (or mean reads usage) were 71.5% and 43.8% for human gut microbiome and swine farm environmental samples, respectively.

Quantification of taxa in metagenomic data

We performed the taxonomic profiling (including phylum, class, order, family, genus, and species levels) of the metagenomic samples using MetaPhlAn252, which relies on ~1 million clade-specific marker genes derived from 17,000 microbial genomes (including bacterial, archaeal, and viral species) to unambiguously classify metagenomic reads to taxonomies and yield relative abundances of taxa identified in the sample.The Shannon index, was used to represent the within-sample diversity (alpha diversity) of the microbiota in the samples53.

Identification and quantification of AR genes

The AR genes from each metagenomic assemblies were identified by blasting protein sequences against CARD (downloaded February 2018)36 database using stringent cutoff (>95%ID and >95 overlap with subject sequence). The remaining unannotated sequences were filtered and subsequently annotated with Resfams core database. This approach resulted in 12,739 unique AR genes from 66 metagenomic assemblies. Together, these 12,739 genes with 2252 AR sequences from CARD database were used to create high-precision sequence markers using ShortBRED35 (parameters: –clustid 0.95 and –ref Uniref90.fasta).

The ShortBRED results included 20,514 markers for 5607 AR gene families. The marker list was then manually curated to reduce the rate of false positives in our surveys. Following criteria was used to filter out the false positives:

genes that confer resistance via overexpression of resistant target alleles (e.g. resistance to antifolate drugs via mutated DHPS and DHFR);

global gene regulators, two-component system proteins, and signaling mediators;

efflux pumps that confer resistance to multiple antibiotics;

genes modifying cell wall charge (e.g. those conferring resistance to polymixins and defensins).

The final set consisted of 1924 AR gene families. The abundance of AR gene families was measured using shortbred_quantify.py script and about 1018 AR determinants were detected with RPKM > 0 in at least two samples.

Identification of virulence factor genes and antibacterial biocide and metal resistance genes

We identified the virulence factors based on the Virulence Factors of Pathogenic Bacteria Database (VFDB, downloaded February 2018)54 and the antibacterial biocide and metal resistance genes based on the BacMet database55. Amino acid sequences were aligned against the databases using BLASTP (e-value ≤ 1e−5) and assigned to genes by the highest-scoring annotated hit with >80% similarity that covered >70% of the length of the query protein.

Species transmission event identification and SourceTracker

We used a modified SourceTracker algorithm37 to identify species transmission events from the swine farm environment to human gut microbiota. Briefly, the new genes found in each sample during swine farm residence were grouped into species-level clusters by consistent taxonomic assignment and relative abundance (range: average ±5%). The SourceTracker algorithm was then used to estimate the probability that the species in the fecal sample came from the source environment (probability > 80%). The probable transferred species with <100 genes or <0.01% relative abundance in the human gut microflora were further filtered.

To identify transfer events involving AR genes, SourceTracker was run with the default settings using the environmental microbiota as the source.

Microbial genome reconstruction in metagenomes

We established an approach to reconstruct the genomes of the high-abundance (typically, >3%) species in the human gut metagenomes. Firstly, metagenomic reads were mapped to the closest reference genomes using SOAP2.2147 (>95% identity). The mapped reads were independently assembled using Velvet56, an algorithm for de novo short read assembly for single microbial genomes. The software was run multiple times using different k-mer parameters ranging from 39 to 131 to generate the best assembly results. Then, the raw assembled genome was scaffolded by SSPACE57, and gaps were closed by GapFiller58. The short scaffolds were filtered with a minimum length threshold of 200 bp. A circle plot of the draft genomes was obtained using BRIG software59. The average nucleotide identity (ANI) between genomes was calculated using the ANIb algorithm, which uses BLAST as the underlying alignment method60.

Network visualization

The AR gene co-occurrence network was visualized by Cytoscape 3.3.061 using an edge-weighted spring-embedded layout.

Mobile genetic elements

Putative MGE genes, including transposase, integrase, recombinase, phage terminase and endopeptidase genes, and bacterial insertion (IS) sequences were identified from the functional selection by Pfam (v29.0)62 and Kyoto Encyclopedia of Genes and Genomes (KEGG, downloaded December 2017)63 annotation. AR genes were considered to co-localize with an MGE if they shared a contig with an MGE gene in a nearby area (<10 kilobases).

Phylogenetic classification of contigs

AR contigs and metagenomic assembly contigs were classified using BLASTN with parameters “-word_size 16 -evalue 1e−5 -max_target_seqs 5000” based on the NCBI reference microbial genomes (downloaded December 2017). At least 70% alignment coverage of each contig reads was required. Based on the parameter exploration of sequence similarity across phylogenetic ranks64, we used 90% identity as the threshold for species assignment and 85% identity as the threshold for genus assignment.

Cultures and E. coli analyses

All samples were cultured on MacConkey agar plates and incubated at 37 °C for 24 h. Five to six suspicious colonies with typical E. coli morphology was selected from each sample for identification. We obtained 1851 E. coli isolates, including 954 isolates from students’ fecal samples, 182 isolates from farm workers’ fecal samples, 657 isolates from pig feces, and 58 isolates from other farm environmental samples (soil, sewage, and ventilation dust). After identifying E. coli isolates by MALDI-TOF MS (Biomerieux, France), we characterized 82 E. coli isolates to determine their genetic relatedness by pulsed-field gel electrophoresis (PFGE)65. These 82 E. coli strains were randomly selected from one pig farm and origin from students (n = 13), farm workers (n = 2), pigs (n = 51), and farm environments (16). The DNA banding patterns were analyzed by BioNumerics software (Applied Maths, Sint-Martens-Latem, Belgium) using the Dice similarity coefficient and a cut-off value of 85% of the similarity values was chosen to indicate identical Eric types. Salmonella enterica serotype Braenderup H9812 standards served as size markers.

Phenotypic and genotypic resistance testing

All 1851 E. coli isolates were tested for susceptibility to 11 antimicrobials for human medicine and food animals’ production, including colistin (CS), cefotaxime (CTX), gentamicin (GEN), amikacin (AMK), tetracycline (TET), fosfomycin (FOS), ciprofloxacin (CIP), methoxazole/trimethoprim (S/T), chloromycetin (CHL), meropenem (MEM), and tigecycline (TIG). Antimicrobial susceptibilities of isolates were determined by the agar dilution method and the results were interpreted according to the Clinical and Laboratory Standards Institute (CLSI) (M100-S25)66. All isolates were further screened for bla CTX-M and fosA3 genes (conferring resistance to CTX and FOS, respectively) by PCR amplification using primers (Supplementary Table 2). As fosA3 was frequently co-transferred with bla CTX-M mediated by a single plasmid24, the genetic contexts of the fosA3 and bla CTX-M genes were explored by PCR mapping (Supplementary Table 2) using the reference regions surrounding them among 15 fosA3-bla CTX-M -co-harboring E. coli isolates, which were randomly selected from one pig farm.

Creation of the DBN model

The DBN model was created based on genus composition profiles of students’ fecal samples at all seven time points. Firstly, we removed (1) two students (H and N) who lacked the sequencing data for at least two time points, and (2) the genera with average relative abundance <0.5% in students, remaining the gut microbial communities of 12 students on 39 high-abundant genera for further analysis. These genera covered 86% of total relative abundance of analyzed samples. Then, we calculated the genus–genus associations based on the extended local similarity analysis (eLSA) algorithm67 (default parameters), using the students’ genus profiles at all seven time points. The eLSA tool generated an association network from significant associations (permutated P < 0.01), including both time-independent (undirected) and time-dependent (directed) associations. For each genus, five most significant associations were remained to simplify the network. Lastly, the partially directed DBN model was created based on the genus–genus association network and the directed associations for each genus from its previous time point to current time point (as shown in Supplementary Fig. 15).

Prediction of the microbial composition based on the DBN model

In the DBN model, the current relative abundance (t n ) of every genus can be expressed as a function of the relative abundances of its parent genera at the previous time point (t n−1 ). The functions in the resulting DBN were derived using Eureqa v1.24.06 (default parameters). Eureqa is a freely downloadable software for deducing equations and hidden mathematical relationships in numerical data sets without prior knowledge of existing patterns. The operations, including constant, add, subtract, multiply, divide, sine, cosine, and exponential, were permitted in solutions. Eureqa was allowed to search for best-fitting equations for a maximum of 1 × 1010 formula evaluations, or until correlations >0.8 were observed. To evaluate the accuracy of the DBN model, we trained a new model by using the microbial compositions at time points T0–T5 and then predicted the microbial composition at T6. This leave-one-out cross-validation strategy was also used to predict the compositions of time points T1–T5. For all samples, their predicted microbial compositions achieved high consistency by Bray–Curtis similarity (1-Bray–Curtis distance). Finally, in our dataset, we predicted the relative abundance of all genera at an extrapolated time point (T7) based on the formulas, using their abundances at time point T6. Similarly, the microbial communities at time points T8 and T9 were predicted based on T7 and T8.

Statistical analysis

Statistical analysis was implemented using the R platform. Principal coordinate analysis (PCoA) was performed using the “ape” package68 based on the UniFrac distances between samples. dbRDA was performed using the “vegan” package69 based on the Bray–Curtis distances on normalized taxa abundance matrices and visualized using the “ggplot2” package. In analyses of PCoA and dbRDA, the top two principal components of the samples were shown, and the Mann–Whitney U-test was used to evaluate the significance of differences in samples obtained at different time points. PERMANOVA70 was used to determine the significance of time points on the subject’s gut microbiota as well as antibiotic resistome. We implemented PERMANOVA using the adonis function based on the Bray–Curtis dissimilarity and 999 permutations. This function calculates the interpoint dissimilarities of each group and compares these values to the interpoint dissimilarities of all points to generate a pseudo-F statistic. This pseudo-F statistic is then compared to the distribution of pseudo-F statistics generated when the function is run on the dissimilarity matrix with permuted labels. Procrustes analysis was performed using the “vegan” package, and the significance of the Procrustes statistic (a correlation-like statistic derived from the symmetric Procrustes sum of squares) was estimated by the protest function with 999 permutations. Rarefaction analysis implemented by in-house Perl scripts was performed to assess the gene richness of environmental samples. Statistical significance was set at P < 0.05 following Benjamini–Hochberg corrections.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.