S rRNA gene sequencing of a pure Salmonella bongori culture

To demonstrate the presence of contaminating DNA and its impact on high and low biomass samples, we used 16S rRNA gene sequence profiling of a pure culture of Salmonella bongori that had undergone five rounds of serial ten-fold dilutions (equating to a range of approximately 108 cells as input for DNA extraction in the original undiluted sample, to 103 cells in dilution five). S. bongori was chosen because we have not observed it as a contaminant in any of our previous studies and it can be differentiated from other Salmonella species by 16S rRNA gene sequencing. As a pure culture was used as starting template, regardless of starting biomass, any organisms other than S. bongori observed in subsequent DNA sequencing results must therefore be derived from contamination. Aliquots from the dilution series were sent to three institutes (Imperial College London, ICL; University of Birmingham, UB; Wellcome Trust Sanger Institute, WTSI) and processed with different batches of the FastDNA SPIN Kit for Soil (kit FP). 16S rRNA gene amplicons were generated using both 20 and 40 PCR cycles and returned to WTSI for Illumina MiSeq sequencing.

S. bongori was the sole organism identified in the original undiluted culture but with subsequent dilutions a range of contaminating bacterial groups increased in relative abundance while the proportion of S. bongori reads concurrently decreased (Figure 1). By the fifth serial dilution, equivalent to an input biomass of roughly 103Salmonella cells, contamination was the dominant feature of the sequencing results. This pattern was consistent across all three sites and was most pronounced with 40 cycles of PCR. These results highlight a key problem with low biomass samples. The most diluted 20-PCR cycle samples resulted in low PCR product yields, leading to under-representation in the multiplexed pool of samples for sequencing as an equimolar mix could not be achieved (read counts for each sample are listed in Additional file 1: Table S1a). Conversely, using 40 PCR cycles generated enough PCR products for effective sequencing (a minimum of at least 14,000 reads per sample were returned, see Additional file 1: Table S1a), but a significant proportion of the resulting sequence data was derived from contaminating, non-Salmonella, DNA. It should be noted though that even when using only 20 PCR cycles contamination was still predominant with the lowest input biomass [see Additional file 1: Figure S1].

Figure 1 Summary of 16S rRNA gene sequencing taxonomic assignment from ten-fold diluted pure cultures and controls. Undiluted DNA extractions contained approximately 108 cells, and controls (annotated in the Figure with 'con') were template-free PCRs. DNA was extracted at ICL, UB and WTSI laboratories and amplified with 40 PCR cycles. Each column represents a single sample; sections (a) and (b) describe the same samples at different taxonomic levels. a) Proportion of S. bongori sequence reads in black. The proportional abundance of non-Salmonella reads at the Class level is indicated by other colours. As the sample becomes more dilute, the proportion of the sequenced bacterial amplicons from the cultured microorganism decreases and contaminants become more dominant. b) Abundance of genera which make up >0.5% of the results from at least one laboratory, excluding S. bongori. The profiles of the non-Salmonella reads within each laboratory/kit batch are consistent but differ between sites. Full size image

Sequence profiles revealed some similar taxonomic classifications between all sites, including Acidobacteria Gp2, Microbacterium, Propionibacterium and Pseudomonas (Figure 1b). Differences between sites were observed, however, with Chryseobacterium, Enterobacter and Massilia more dominant at WTSI, Sphingomonas at UB, and Corynebacterium, Facklamia and Streptococcus at ICL, along with a greater proportion of Actinobacteria in general (Figure 1a). This illustrates that there is variation in contaminant content between laboratories, which may be due to differences between reagent/kit batches or contaminants introduced from the wider laboratory environment. Many of the contaminating operational taxonomic units (OTUs) represent bacterial genera normally found in soil and water, for example Arthrobacter, Burkholderia, Chryseobacterium, Ochrobactrum, Pseudomonas, Ralstonia, Rhodococcus and Sphingomonas, while others, such as Corynebacterium, Propionibacterium and Streptococcus, are common human skin-associated organisms. By sequencing PCR `blank' negative controls, specifically PCR-amplified ultrapure water with no template DNA added, we were able to distinguish between taxa that had originated from the DNA extraction kits as opposed to DNA from other sources (such as PCR kit reagents, laboratory consumables or laboratory personnel). Sixty-three taxa were absent from all PCR blank controls but present at >0.1% proportional abundance in one or more serially-diluted S. bongori samples [see Additional file 1: Figure S2], suggesting that they were introduced to the samples at the DNA extraction stage. These include several abundant genera observed at all three sites, such as Acidobacteria Gp2, Burkholderia, unclassified Burkholderiaceae and Mesorhizobium. It also includes taxa, such as Hydrotalea and Bradyrhizobium, that were only abundant in samples processed by one or two sites, possibly indicative of variation in contaminants between different batches of the same type of DNA extraction kit.

Quantitative PCR of bacterial biomass

To assess how much background bacterial DNA was present in the samples, we performed qPCR of bacterial 16S rRNA genes and calculated the copy number of genes present with reference to a standard curve. Assuming a complete absence of contamination, copy number of the 16S rRNA genes present should correlate with dilution of S. bongori and reduce in a linear manner. However, at the third dilution copy number remained stable and did not reduce further, indicating the presence of background DNA at approximately 500 copies per μl of elution volume from the DNA extraction kit (Figure 2).

Figure 2 Copy number of total 16S rRNA genes present in a dilution series of S. bongori culture. Total bacterial DNA present in serial ten-fold dilutions of a pure S. bongori culture was quantified using qPCR. While the copy number initially reduces in tandem with increased dilution, plateauing after four dilutions indicates consistent background levels of contaminating DNA. Error bars indicate standard deviation of triplicate reactions. The broken red line indicates the detection limit of 45 copies of 16S rRNA genes. The no template internal control for the qPCR reactions (shown in blue) was below the cycle threshold selected for interpreting the fluorescence values (that is, less than 0), indicating the contamination did not come from the qPCR reagents themselves. Full size image

Shotgun metagenomics of a pure S. bongoriculture processed with four commercial DNA extraction kits

Having established that 16S rRNA gene sequencing results can be confounded by contaminating DNA, we next investigated whether similar patterns emerge in shotgun metagenomics studies, which do not involve a targeted PCR step. We hypothesised that if contamination arises from the DNA extraction kit, it should also be present in metagenomic sequencing results. DNA extraction kits from four different manufacturers were used in order to investigate whether or not the problem was limited to a single manufacturer. Aliquots from the same S. bongori dilution series were processed at UB with the FastDNA SPIN Kit for Soil (FP), MoBio UltraClean Microbial DNA Isolation Kit (MB), QIAmp DNA Stool Mini Kit (QIA) and PSP Spin Stool DNA Plus kit (PSP). As with 16S rRNA gene sequencing, it was found that as the sample dilution increased, the proportion of reads mapping to the S. bongori reference genome sequence decreased (Figure 3a). Regardless of kit, contamination was always the predominant feature of the sequence data by the fourth serial dilution, which equated to an input of around 104Salmonella cells.

Figure 3 Summary of the metagenomic data for the S. bongori ten-fold dilution series (initial undiluted samples contained approximately 108cells), extracted with four different kits. Each column represents a single sample. A sample of ultrapure water, without DNA extraction, was also sequenced (labelled `water'). a) As the starting material becomes more diluted, the proportion of sequenced reads mapping to the S. bongori reference genome decreases for all kits and contamination becomes more prominent. b) The profile of the non-Salmonella reads (grouped by Family, only those comprising >1% of reads from at least one kit are shown) is different for each of the four kits. Full size image

Samples were processed concurrently within the same laboratory. If the contamination was derived from the laboratory environment then similar bacterial compositions would be expected in each of the results. Instead, a range of environmental bacteria was observed, which were of a different profile in each kit (Figure 3b). FP had a stable kit profile dominated by Burkholderia, PSP was dominated by Bradyrhizobium, while the QIA kit had the most complex mix of bacterial DNA. Bradyrhizobiaceae, Burkholderiaceae, Chitinophagaceae, Comomonadaceae, Propionibacteriaceae and Pseudomonadaceae were present in at least three quarters of the dilutions from PSP, FP and QIA kits. However, relative abundances of taxa at the Family level varied according to kit: FP was marked by Burkholderiaceae and Enterobacteriaceae, PSP was marked by Bradyrhizobiaceae and Chitinophagaceae. The contamination in the QIA kit was relatively diverse in comparison to the other kits, and included higher proportions of Aerococcaceae, Bacillaceae, Flavobacteriaceae, Microbacteriaceae, Paenibacillaceae, Planctomycetaceae and Polyangiaceae than the other kits. Kit MB did not have a distinct contaminant profile. This was likely a result of the very low number of reads sequenced, with 210 reads in dilution 2, 79 reads in dilution 3 and fewer than 20 reads in subsequent dilutions [see Additional file 1: Table S1b]. Although read count is only a semi-quantitative measure of DNA concentration, this may indicate that levels of background contamination from this kit were comparatively lower than the other kits tested.

Comparatively few contaminant taxa that were detected in the `blank' water control, which was dominated by Pseudomonas, were detected in the serially diluted metagenomic samples. This provided further evidence that the observed contamination was likely to have originated in large part from the DNA extraction kits themselves. These metagenomic results, therefore, clearly show that contamination becomes the dominant feature of sequence data from low biomass samples, and that the kit used to extract DNA can have an impact on the observed bacterial diversity, even in the absence of a PCR amplification step. Reducing input biomass again increases the impact of these contaminants upon the observed microbiota.

Impact of contaminated extraction kits on a study of low-biomass microbiota

Having established that the contamination in different lots of DNA extraction kits is not constant or predictable, we next show the impact that this can have on real datasets. A recent study in a refugee camp on the border between Thailand and Burma used an existing nasopharyngeal swab archive [38] to examine the development of the infant nasopharyngeal microbiota. A cohort of 20 children born in 2007/2008 were sampled every month until two years of age, and the 16S rRNA gene profiles of these swabs were sequenced by 454 pyrosequencing.

Principal coordinate analysis (PCoA) showed two distinct clusters distinguishing samples taken during early life from those taken from subsequent sampling time points, suggesting an early, founder nasopharyngeal microbiota (Figure 4a). Four batches of FP kits had been used to extract the samples and a record was made of which kit was used for each sample. Further analysis of the OTUs present indicated that samples possessed different communities depending on which kit had been used for DNA extraction (Figure 4b,d,e) and that the first two kits' associated OTUs made up the majority of their samples' reads (Figure 4d). As samples had been extracted in chronological order, rather than random order, this led to the false conclusion that OTUs from the first two kits were associated with age. OTUs driving clustering to the left in Figure 4a and b (P value of <0.01), were classified as Achromobacter, Aminobacter, Brevundimonas, Herbaspirillum, Ochrobactrum, Pedobacter, Pseudomonas, Rhodococcus, Sphingomonas and Stenotrophomonas. OTUs driving data points to the right (P value of <0.01) included Acidaminococcus and Ralstonia. A full list of significant OTUs is shown in Additional file 1: Table S2. Once the contaminants were identified and removed, the PCoA clustering of samples from the run no longer had a discernible pattern, showing that the contamination was the biggest driver of sample ordination (Figure 4c). New aliquots were obtained from the original sample archive and were reprocessed using a different kit lot and sequenced. The previously observed contaminant OTUs were not detected, further confirming their absence in the original nasopharyngeal samples (manuscript in preparation, Salter S, Turner P, Turner C, Watthanaworawit W, Goldblatt D, Nosten F, Mather A, Parkhill J, Bentley S).

Figure 4 Summary of the contaminant content of nasopharyngeal samples from Thailand. a) The PCoA plot appears to show age-related clustering; however, b) extraction kit lot explains the pattern better. c) When coloured by age, the plot shows the loss of the initial clustering pattern after excluding contaminant OTUs from ordination. d) The proportion of reads attributed to contaminant OTUs for each sample, demonstrating that the first two kits were the most heavily contaminated. e) Genus-level profile of contaminant OTUs for each kit used. Full size image

This dataset, therefore, serves as a case study for the significant, and potentially misleading, impact that contaminants originating from kits can have on microbiota analyses and subsequent conclusions.