Bacterial sampling

Genomes from 415 S. epidermidis isolates, from multiple sampling efforts, were analysed (Supplementary Data 2). These included: 240 isolates sampled as part of this study; 35 genomes available from public repositories (February 2013); and 140 recently sequenced S. epidermidis genomes from geographically and clinically diverse isolates characterised in previous studies21,23,24,52. Asymptomatic carriage isolates were sampled from healthy volunteers in Swansea University (UK) in 2012, using culture swabs containing Ames media, which were then cultured on Columbia Blood Agar plates. Volunteers gave informed consent, as assessed by the local Human Tissue Act committee (Wales REC 6) at the Swansea University Medical School (ref: #13/WA/0190). To ensure that isolates from infection were not laboratory contaminants, 113 strains from prosthetic joint infections isolated from independent pure cultures of pre-operative joint aspirates, and intraoperative tissue specimens were obtained under strict aseptic conditions16. Additionally, isolates were sampled from intraoperative surgical specimens of fracture fixations (n = 60), osteomyelitis (n = 5), bacteraemia (n = 2) and colonised catheters linked to an infection event (n = 45). Among these, 85 isolates from infection (identifier 1043–1136) were collected as part of a prospective study performed between November 2011 and September 2013 at the BGU Murnau, Germany, a level-one trauma centre with a high volume, 70-bed unit for septic and reconstructive surgery52,53. The total dataset in this study comprised 141 isolates from healthy carriage obtained in hospitals and the community from 11 countries in three continents (57/141 from the UK) and 274 isolates from clinical infections (Supplementary Data 2).

Genomic DNA extraction, sequencing and archiving

DNA was extracted using the QIAamp DNA Mini Kit (QIAGEN, Crawley, UK), using manufacturer’s instructions, with 1.5 μg/μl lysostaphin (Ambi Products LLC, NY) to facilitate cell lysis. DNA was quantified using a Nanodrop spectrophotometer, as well as the Quant-iT DNA Assay Kit (Life Technologies, Paisley, UK). High-throughput genome sequencing was performed using a HiSeq 2500 machine (Illumina, San Diego, CA), and the 100-bp short-read paired-end data were assembled using the de novo assembly algorithm, Velvet54. The VelvetOptimiser script (version 2.2.4) was run for all odd k-mer values from 21 to 99. The minimum output contig was size set to 200 bp with the scaffolding option disabled. Other program settings were as default, and assembly quality metrics were recorded (Supplementary Data 5). All genome sequences were archived on a web-accessible BIGSdb database55, and genome sequences generated in this study are available on NCBI BioProject PRJNA433155.

Core and accessory genome characterization

A S. epidermidis coding sequence pangenome gene list was constructed for isolates in this study56 by automated annotation of all genomes from the dataset using the RAST/SEED system57 and the WebMGA COG annotation server58 (see Supplementary Methods). After removal of alleles of the same gene with a BLAST threshold of 70% sequence similarity56, there were 12,079 unique genes present in at least one of the 415 genomes. Consistent with previous studies, and the whole-genome MLST principle55,59,60,61, the gene complement and allelic variation of each isolate was determined by comparison with the pangenome with gene presence recorded as a BLAST match of > 70% sequence identity over ≥ 50% of sequence length. For each pair of isolates, the number of shared genes and alleles (identical sequences at a given locus) was calculated. Core genes were present in 100% of the genomes and accessory genes were present in at least one isolate.

Phylogenetic analyses

Core gene sequences were individually aligned, using MUSCLE62, and concatenated, consistent with the gene-by-gene approach55,60,61; and a tree was reconstructed using an approximation of maximum-likelihood phylogenetics in FastTree263. This tree was used as an input for ClonalFrameML64 to produce core genome phylogenies with branch lengths corrected for recombination.

In vitro phenotype assays

To measure variation in clinically relevant phenotypes for 80 isolates, established in vitro laboratory assays quantified: (i) biofilm formation; (ii) toxicity using a vesicle lysis test (VLT) (for 48 isolates); (iii) methicillin resistance; (iv) production of interleukin-8 (IL-8) by human keratinocytes in presence of S. epidermidis; (v) IL-8 production following inoculation of human blood with S. epidermidis. Briefly, biofilm formation was assessed using crystal violet staining of bacteria attached to the polystyrene surface of a 96-well microtitre plate65, in three biological replicates for each bacterial strain, grown for 24 h at 37 °C in tryptone soy broth (TSB) and washed in PBS (see Supplementary Methods). Methicillin resistance was quantified using standard European Committee on Antimicrobial Susceptibility Testing (EUCAST)66 methods for susceptibility testing67. Bacteria were cultured in the presence of Etest strips (bioMerieux) comprising a pre-determined continuous gradient of methicillin for ~16 h. The minimum inhibitory concentration (MIC) was recorded based upon the zone of inhibition. Cell toxicity was assessed using a vesicle lysis test (VLT)29 designed to be specific to small amphipathic peptides, including staphylococcal delta and phenol soluble modulin (PSM) toxins (see Supplementary Methods). Briefly, a solution of lipid vesicles containing encapsulated self-quenched fluorescent dye, 5(6)-carboxyfluorescein (CF), were designed to be responsive to specific Staphylococcus toxins so that when the vesicles were disrupted by bacterial supernatants containing secreted cytolytic factors, an increase of fluorescence was measured (see Supplementary Methods)29. IL-8 was chosen as an immune response marker from a suite of cytokines as it is known to be important in mediating the pro-inflammatory response in staphylococcal infection, culminating in neutrophil recruitment in pathogen defence. Overexpression of IL-8, along with TNFα, IL-6 and IL-1B, has been postulated as a biomarker for staphylococcal sepsis68. IL-8 production by a HaCaT keratinocytes (ATCC)69 cell line from human skin epithelial, and by human whole blood was measured by enzyme-linked immunosorbent assay (ELISA) after challenge by 80 strains (in three biological replicates) of S. epidermidis representing the genomic diversity of the species23. These phenotype assays were chosen as they have been previously related to pathogenicity. However, it should be noted that S. epidermidis is principally adapted to the commensal niche (17) with no clear virulence-associated phenotype that completely distinguishes invasive from commensal strains70,71. The incomplete understanding of pathogenicity means that it is possible that a given phenotype may promote opposite outcomes, for example, infectivity (and acute infection) on one hand and adaptability (chronic infection) on the other. Full details of in vitro phenotype assays are included in Supplementary Methods.

Pangenome-wide association study

The alignment-free GWAS method involved fragmentation of assembled genomes into consecutive, overlapping 30 bp k-mers (or “30-mers”, termed “k-mers” throughout this study), and sorting by isolate source (asymptomatic carriage vs. infection), capturing genetic variation in the core and accessory genome59,72. The prevalence of each k-mer in the two phenotypic groups was quantified in a 2 × 2 contingency table (with four cells a, b, c, d) in which rows indicated presence/absence of the k-mer and columns indicated phenotype. Because bacteria reproduce clonally, sequences present in related strains will not only reflect adaptive elements associated with the phenotype of interest, but also sequence that was inherited from the common ancestor, potentially confounding GWAS analysis25,72,73,74. To account for this, two steps were taken. First, duplicate input datasets were defined, each containing 38 unique isolate pairs (one from asymptomatic carriage, one from infection) that are closely related on a ClonalFrameML phylogeny (Supplementary Figure S6). These two datasets were technical replicates for independent GWAS analyses. Second, the significance of the association score (P-value) for each k-mer, a + d – (b + c), was determined by comparing the observed association score with a Monte Carlo simulated null distribution where k-mers where randomly gained and lost along the branches of the clonal phylogeny, independent of the phenotype of interest (asymptomatic carriage vs. infection). Algorithmic comparison of the simulated and observed k-mer score distributions allows correction of the P-values to account for the phylogenetic relationships72. Details of the pipeline and scripts are available in Supplementary information (see Supplementary Figure 1) and on https://github.com/sheppardlab/pGWAS. To allow functional inference, the significantly associated k-mers (P < 0.001) were mapped to the coding sequence pangenome described above56, and allele at each locus were identified. The reference pangenome approach is detailed in the Supplementary information.

Covariance of GWAS hits with secondary in vitro phenotypes

All k-mers significantly associated with the primary phenotype (asymptomatic carriage and infection) were correlated with data from in vitro phenotypes for that isolate. Results from quantitative biofilm formation, methicillin resistance, cell toxicity and host cell immune response phenotype assays were divided into three categories with a third of ranked values in each (upper 100th–66th, middle 66th–33rd, lower 33rd–1st). For every k-mer associated with the primary phenotype (n = 310,850), a 2 × 2 contingency table summarised k-mer presence/absence in isolates within the upper and lower percentile for the secondary phenotype (Fig. 1). The genome position of k-mers significantly associated with the secondary phenotype (Fisher’s exact test, P-value < 0.005) were visualised using Circos75.

Horizontal gene transfer among infection-associated genes

Population genetic analyses were undertaken to compare molecular variation among 61 genes that contained infection-associated elements, correlated with a secondary infection phenotype and those that did not (n = 1946 genes), in asymptomatic carriage and infection isolates. For both groups, the number of alleles at each locus (determined using a whole-genome MLST approach61 and consistency index (CI)) were calculated. The consistency of a phylogenetic tree to patterns of variation in sequence alignments was determined for each gene of interest, and constituted an inference of the minimum amount of homoplasy in these genes, as implied by the tree76. The CI function from the R Phangorn package77 was used to calculate consistency indices for every single-gene alignment of the 61 genes of interest to a phylogeny constructed from a concatenated gene-by-gene alignment of 1946 genes shared by all 152 isolates used in the GWAS. The average CI of these shared genes was compared to that of the 61 genes containing pathogenicity-associated elements and correlated with secondary in vitro phenotypes.

Risk calculation

Pathogenicity is a complex multifactorial property. By training a classifier using the output of the GWAS analysis, we were able to go from observations of sequence variation among infection and carriage isolates to predicting phenotype and allowing risk calculation for different genotypes. To capture the non-linear and potentially complex association between sequence variation and phenotype, a Random Forest (RF) classifier was used78. To limit the complexity of the model, a feature selection procedure was applied. The data contained 415 isolates (141 asymptomatic, 274 infection). The set of candidate predictors consisted of 310,850 presence/absence patterns of disease-associated k-mers identified in the primary GWAS analysis and 23,561 presence/absence patterns of disease-associated and lab phenotype-correlated k-mers (Fig. 1). After filtering out the non-unique k-mer patterns, this corresponded to 1900 and 293 predictors, respectively. In separate RF runs, the classifiers were trained using all 1900 or 293 predictors. The importance of the predictors was estimated using the built-in criterion of the RF model. The predictors were then sorted from the most to the least important. To reduce the model complexity and thereby the risk of overfitting, we applied a two-step feature selection approach. In the first step, we made use of prior biological knowledge and focused on k-mers that were correlated with known pathogenicity-associated laboratory phenotypes. In the second step, we used a data-driven procedure to pick out a small subset of the most informative predictors discovered during the first step. To evaluate the performance of models including only a small subset of the predictors, the classification accuracy of RF models including only the l highest ranked predictors (l = 1,…n) was estimated using two-fold cross-validation (100 iterations).

The accuracy of the classifier was estimated by out-of-bag prediction, which gives an unbiased estimate of the out-of-sample accuracy without requiring a separate test set. The procedure exploits the subsampling step used during training where the out-of-bag prediction of isolate A is the mean prediction averaged over all trees that did not have isolate A included in their bootstrap training sample.

Ethics

Volunteers who donated blood for this study gave their consent as part of a research project that has been assessed by the local Human Tissue Act committee (Wales REC 6) at the Swansea University Medical School (ref: #13/WA/0190).