Sample collection and DNA extraction

A total of 116 sewage samples were collected from 32 WWTPs influents in 17 major Chinese cities during August 2014 (summer, n = 59) and February 2014 (winter, n = 57). All the untreated influent samples from each WWTP were taken within two consecutive days without recent rainfall to exclude the effect of the weather. Detailed information on these samples is summarized in Table 1 and Additional file 1: Table S1. All sewage samples from WWTPs were collected in 400-mL sterilized containers and were mixed with 100% ethanol at a volume ratio of 1:1 for biomass fixation. The fixed samples were kept on ice and were immediately delivered to laboratory. The microbial cells from 400 mL of fixed sample were pelleted by centrifuging at 9500 g for 20 min at 4 °C. All pellets were stored at −20 °C before DNA extraction.

Table 1 Information of WWTP sewage samples Full size table

Genomic DNA was extracted from the collected pellets using the FastDNA® Spin kit for Soil (MP Biomedicals, France) following the manufacturer’s instructions. Total DNA was eluted in 100 μL of sterile water and kept at −20 °C until use. DNA concentrations and purity were measured using a NanoDrop spectrophotometer (ND-1000, Nanodrop, USA) [16].

DNA sequencing

The hypervariable V4-V5 region of the 16S rRNA gene was amplified using the primer pair (515 F: 5′-GTGCCAGCMGCCGCGG-3′ and 907R: 5′-CCGTCAATTCMTTTRAGTTT-3′ with sample-identifying six-nucleotide barcodes) [17]. The 4 × 50 μL reaction system was set up for each PCR amplification under the following program: initial denaturation at 95 °C for 5 min, and 30 cycles at 95 °C for 30 s, 58 °C for 30 s, and 72 °C for 30 s and a final extension at 72 °C for 10 min. The resulting amplicons were purified, quantified, pooled, and sequenced on an Illumina MiSeq PE300 platform (Novogene, Beijing, China). For metagenome sequencing, approximately 3 μg of sewage DNA was used for shotgun library construction with an insert size of 300 bp, followed by Illumina paired-end sequencing on the HiSeq 2500 platform (Novogene, Beijing, China).

Phylotype analysis

All the raw reads were processed using QIIME [18] (1) to sort and assign by exactly matching the unique barcode into each sample, (2) to remove primers and the sequences with ambiguous bases, primer mismatches, and homopolymers in excess of six bases, or error in barcodes, and (3) to filter low-quality reads with >20 low-quality bases. Chimeric and noisy sequences were also filtered out. After processing, sequences were clustered into operational taxonomic units (OTUs) using Uclust clustering, which groups sequences at a minimum pair-wise identity of 97%. Mitochondrion or chloroplast sequences and singleton OTUs were discarded from the final OTU table. For each resulting OTU, the most abundant read was selected as a representative sequence. The taxonomic classification of each representative sequence was conducted using a Ribosomal Database Project (RDP) Classifier at an 80% confidence threshold (Version 2.2) [19, 20]. Alignment of the OTU representative sequences was conducted using a PyNAST aligner [21], and a phylogenetic tree was built using a FastTree algorithm [22] for downstream analysis.

Rarefaction was performed to discern Phylogenetic Diversity, Chao1 diversity, Shannon index, and observed species metrics at each sampling depth. To remove the bias caused by different sequencing depth, the OTU table was rarefied and an even sampling depth was set by randomly subsampling the same number of sequences from each sample. Beta-diversity was estimated by computing weighed/unweighed UniFrac and Bray-Curtis distances between every pair of community samples using QIIME.

Metagenomic analysis

Of the original 116 sewage samples, 24 samples were excluded from metagenomic sequencing due to the low quantities of DNA or poor sequence data (Additional file 1: Table S1). Thus, only 92 samples were further used for metagenomics analysis. Metagenomic sequencing of sewage DNA samples generated 203 Gb pairs of high-quality data with an average of 2.2 Gb for each sample. Data filtration was conducted to remove raw reads with low-quality following the methods used in a previous study [23]. Subsequently, metagenomic sequences were analyzed by BLASTx against the Structured Non-redundant Clean Antibiotic Resistance Genes Database (SNC-ARDB) with E value ≤1 × 10−5. A read was annotated to be a resistance gene if its BLAST hit for the alignment against SNC-ARDB had ≥90% amino acid read identity for ≥25 amino acids [16, 24]. In the present study, a package of customized scripts was developed to automatically classify the BLAST hits into different types and subtypes of resistance genes. The detailed procedure for sorting sequences using a customized Python script was reported previously [23].

SNC-ARDB contains a number of genes for efflux proteins that do not necessarily confer resistance phenotypes. These proteins do, however, function in the efflux of antibiotics and have previously been classified as resistance genes and in the Comprehensive Antibiotic Resistance Database (CARD) [25,26,27]. Therefore, efflux pump-related genes were retained in the SNC-ARDB to evaluate antibiotic resistance potential [16].

The ‘abundance’ of the resistance type or subtype was calculated as previously reported by Li et al. (2015) [16]. Thus, the abundance of resistance genes based on metagenomic analysis was compared with those derived from qPCR in the previous studies. The abundance of resistance genes was transformed to ‘concentration’ (copies per liter) by normalization to the absolute copy number of 16S rRNA gene [28]. The average copy number of 16S rRNA genes per bacterium is currently estimated at 4.1 based on the Ribosomal RNA Operon Copy Number Database (rrnDB version 4.4.4) [29]. The numbers of bacterial cells were calculated by dividing the copy number of 16S rRNA gene by 4.1, and the ‘relative abundance’ of resistance genes (copies per bacterial cell) was estimated by dividing the ARGs concentration in each sample by its corresponding number of bacterial cells. Additionally, the copy number of resistance genes discharged by each person per day in urban areas is defined as ‘ARG load’, which can be calculated by the formula: ARG load (copies/capita/day) = (the average concentration of sewage ARGs) × (the volume of municipal sewage discharge)/urban population. ‘ARG burden’ (copies/day, the total ARG load in urban areas) is calculated by multiplying the medium value of ARG load by total urban population.

Real-time qPCR quantification of 16S rRNA gene

Real-time qPCR assay of total bacteria was performed using a SYBR® Green approach on a Roche 480 (Roche Inc., USA). The absolute copy numbers of 16S rRNA gene were quantified using primers 515 F and 907R. The qPCR system (20 μL) amplification was conducted as reported previously [30]. The size of amplified fragments was about 410 bp. For the preparation of 16S rRNA gene standards, 16S rRNA gene was amplified from extracted DNA and then was cloned into the pMD 19-T vector (TaKaRa, Japan). Plasmids containing the target gene were used as standards for the calibration curve. All qPCR assays were conducted in triplicate with negative and positive controls.

Human gut microbiome analysis

In this study, human gut microbiota was defined as the bacterial genera detected in human gut or human intestinal tract. To track the human gut microbiome fingerprint in the sewage, the human gut microbiome database including 382 bacterial genera (Additional file 2: Dataset 1) was retrieved from Human Microbiome Project (HMP) [31] and the Metagenomics of the Human Intestinal Tract (MetaHit) [32] project. The microbial catalogue reference set (16S rRNA gene) from these two human metagenomic projects covers almost all genera of human gut bacteria and is a useful resource for further analyses of human gut microbiome [33].

Statistical analysis and network analysis

Averages and standard deviations were determined using Excel 2010 (Microsoft Office 2010, Microsoft, USA). One-way analysis of variation (ANOVA), paired-sample t tests and correlation tests were performed using SPSS V20.0 (IBM, USA). All statistical tests were considered significant at P < 0.05. Diversity index, non-metric multidimensional scaling (NMDS) and significance test (Adonis test, procrustes analysis, and mantel test) were performed in R 3.1.0 with vegan 2.2.0 [34, 35]. Post-hoc plot was generated using STAMP V2.1.3 [36]. To investigate co-occurrence patterns of microbial community and resistome, correlation matrixes were constructed by calculating each pairwise Spearman’s rank correlations. The P value was adjusted with a multiple testing correction using FDR method to reduce the false-positive results [37]. A correlation between any two items was considered statistically robust if the Spearman’s correlation coefficient (ρ) was > 0.7 and the P value was < 0.01 [16, 38]. The resulting correlation matrixes were translated into an association network using Gephi 0.9.1 [39]. An informatics mathematical approach based on geographical information systems, ArcGIS, was applied to map the resistance load and the resistance burden at varying spatial scales [40]. Spatial autocorrelation analysis was conducted to evaluate spatial dependency of ARG burdens between provinces using ArcGIS [41].