SCs have distinct accessory genomes

The original analysis of 616 pneumococcal genome sequences identified 5,442 clusters of orthologous genes (COGs), of which around 1,500 were ‘core’ to almost all isolates and around 3,000 were rare19. Applying the ‘power law’ method for quantifying the pangenome22,23 suggested the gene pool available to this population was unbounded (Supplementary Fig. 1). However, this was heavily influenced by numerous rare COGs that individually had little impact on the population structure, and were the most likely to represent false-positive gene predictions. An alternative representation (Fig. 1) showed the distribution of variation across the population using pairwise comparisons between isolates. This revealed three distinct groups of points that suggested differences in gene content were approximately proportional to core genome divergence.

Figure 1: Existence of distinct clusters in the pneumococcal population and the properties of the cCOGs with which they are associated. (a) Comparison of pairwise distances between isolates in terms of their core genome divergence, as measured by the cophenetic distance calculated from a maximum likelihood core genome phylogeny19, and the difference in their accessory genomes, as measured by the Jaccard distance based on the variation in the COG content of their sequences. Points in red indicate comparisons within monophyletic sequence clusters, while purple points represent comparisons between isolates within the diverse SC16. Points in green indicate comparisons between the atypical unencapsulated isolates of SC12 and other sequence clusters; points in turquoise represent all other comparisons between isolates in different sequence clusters. (b) Properties of the COGs characteristic of SC12. The 78 COGs found in >95% of SC12 isolates, and found at a frequency <5% in the other monophyletic sequence clusters, were classified according to their function or location. (c) The 141 cCOGs of all other sequence clusters classified in the same manner. Full size image

The mainly red group of points nearest the origin of the plot demonstrated isolates within the same monophyletic SC were highly similar in their core and accessory genomes, while the set of turquoise points showed the greater level of divergence between representatives of different non-SC12 SCs (purple points represent comparisons between isolates in SC16). The discontinuity between these two sets of points indicated clonal structure in the population, as higher rates of recombination were predicted to generate a more homogenous distribution (Supplementary Fig. 2). Hence the co-circulating lineages that could be distinguished through their core genomes24 also maintained distinct accessory genomes. The set of green points represented comparisons between isolates in SC12 and those in other SCs, highlighting the divergence of SC12 from the rest of the population.

Potential speciation of atypical genotypes

Genetic loci unique to SC12 seemed likely to explain its distinctive phenotype and disease tropism. To define a set of candidate genes, ‘characteristic COGs’ (cCOGs) were identified in each SC as those COGs found in greater than 95% of genomes in that cluster, while being present in fewer than 5% of genomes in any other monophyletic SC. This identified 78 cCOGs in SC12 (Fig. 1b). Forty-four cCOGs were found within putative MGEs, and a further three were found within the conjugative element-related pneumococcal pathogenicity island 1 (PPI-1)8, although the pit iron transporter operon within this locus implicated in pathogenesis25 was absent from SC12 but present in all other SCs (Supplementary Fig. 3a).

Other non-MGE GIs contributing to SC12 divergence appeared to represent single gain or loss events conserved across the SC. All lacked a functional capsule polysaccharide synthesis (cps) locus and conserved a distinct set of large surface proteins (Supplementary Fig. 3b–e). The SC12 isolates also lacked either of the fucose utilization loci, one of which was evident in all other SCs (Supplementary Fig. 4); the conservation of these sequences across deep-branching clades suggested exchange at this locus was not rapid, and therefore a single deletion in an ancestor of SC12 would account for the observed pattern. Analogously, it seemed likely a single acquisition of genes encoding dihydroxyacetone kinases, in place of the pneumococcal histidine triad protein gene phpA (Supplementary Fig. 3f), would explain their conservation in SC12 and absence from other isolates in the collection. Hence, the SC12 isolates appeared to be genetically, antigenically and metabolically distinct from the other SCs, and therefore may represent a separate species.

Clonal association of GI diversity

Each of the other SCs had a smaller number of cCOGs (Fig. 1c and Supplementary Table 1). In some cases, these corresponded to putative protein antigens or cps genes; for instance, all SC4 representatives expressed capsule type 22F, not found elsewhere in the sample19. The other cCOGs showed little similarity in sequence or putative function, although a substantial number were located in the 3′ variable region of PPI-18. Although extensive diversity was observed at this locus across the species, there was little evidence of variation within SCs (Fig. 2), suggesting genes within this locus may underlie lineages’ distinctive traits: at least some allelic variation has previously been associated with differences in virulence in a mouse model of disease26. Distinct loci within PPI-1, each ~20-kb long and encoding metabolic genes, were evident in SC1, SC5 and the serotype 3 isolates27. However, not all alleles were unique to a SC: SC2 and SC3 shared an ~10 kb gene cassette, a 3.8 kb allele was common to SC6, SC10, SC13 and SC15, and both complete and incomplete versions of a previously described lantibiotic synthesis gene cluster8 were found in the PPI-1 loci of SC8, SC9 and SC12. In SC12, these genes were accompanied by a putative RMS, which alone constituted the 3′ variable region of PPI-1 in SC4 and SC11. The read mapping suggested SC4 also possessed the version of the island found in SC6, but in fact these genes were found on an ICE and appear to exemplify the contribution of MGEs to the diversity of sequences within PPI-18,28. MGEs themselves accounted for almost a quarter of the cCOGs not associated with PPI-1, suggesting such elements did not necessarily exhibit a high level of mobility, and instead may contribute to the stable differences between SCs.

Figure 2: Distribution of PPI-1 3′ variable region sequences across the population. (a) The maximum likelihood phylogeny based on the core genome annotated according to the distribution of sequence clusters. (b) The set of sequences representing the diversity of the 3′ variable region of PPI-1 across the collection; different reference sequences are demarcated by the alternating orange and brown blocks. Coding sequences that could be annotated based on functional domain information are marked (DHase—dehydrogenase). (c) Heatmap representing the distribution of sequence across the population. Each row corresponds to an isolate in the phylogeny. Absence of mapping reads is indicated by blue; red regions indicate read mapping coverage up to a maximum of 50-fold, demonstrating the locus is present in the relevant isolate. Full size image

Diversification driven by MGEs

The smallest putative mobile sequences previously characterized in pneumococci, three families of short interspersed repeats29, were generally stable in frequency within SCs (except for expansion and contraction of boxB tandem arrays), with small differences between lineages (Supplementary Fig. 5). All three families were evident in SC12 at typical frequencies, in contrast to the related species S. pseudopneumoniae30 and S. mitis31. Similarly, some types of insertion sequences (ISs) were ubiquitous across the sample, while others exhibited stable associations with particular lineages (Supplementary Fig. 6). Acquisition of novel ISs was observed within SC8: IS1202 was gained through serotype switching events twice, while ISSpn5 was imported as part of ICE ‘scars’8. An extensive search for longer MGEs (see Supplementary Methods) identified 2,260 putative MGE-derived genetic loci, with a median length of 31 COGs (range 2–91 COGs). On the basis of their distribution around the chromosome, 16 insertion sites could be robustly identified within the core genome (Supplementary Fig. 7 and Supplementary Methods). As in Escherichia coli and Salmonella enterica, the majority of the insertion sites (15 of 16) were in intergenic regions despite the high coding densities of bacterial genomes32. However, in contrast to these enteric bacteria, all but two insertion sites were closer to the origin of replication than the terminus. As the distance of genes from the replichore boundaries is conserved even more strongly than synteny in pneumococci33, this result should apply across the species, although sequence variation prevented the re-identification of three known insertion sites for large conjugative elements that lie close to the terminus of replication (Supplementary Fig. 7).

A network was constructed in which each putative MGE was represented by a node, coloured according to the SC of the host cell, with vertices linking elements determined as being similar using Mountford’s index34. This allowed all but 34 putative MGEs to be classified into three groups based on the presence of functional domains (Fig. 3, Supplementary Figs 8,9; Supplementary Methods). The most numerous group (1,083 nodes) represented putative ICEs (Supplementary Table 2). These spanned the full range of detected MGE lengths, likely reflecting the efficiency of conjugation in transferring long segments of DNA between streptococci35, permitting modular variation through the insertion or deletion of sequence segments36. Hence, these elements are effective vectors for the import of novel DNA into a species. For instance, all antibiotic resistance genes encoded by MGEs were found on ICEs in the component labelled A and B. These consisted of sequences related to Tn5253, generated through the insertion of Tn916-type elements and other cassettes into Tn5252-type elements8,37 (Supplementary Figs 10,11). Conversely, ICEs in component C did not appear to carry `cargo' genes, but did exhibit extensive similarity to Streptococcus suis MGE ICESsu3245738, which contained a cluster of resistance genes not evident in the pneumococcal elements (Supplementary Fig. 12). Hence, in other species, these elements can fulfil the role played by component A and B ICEs in pneumococci.

Figure 3: Mobile genetic elements found in the pneumococcal population. The 2,226 MGE sequences identified in the collection that could be classified as derived from a putative ICE, PRCI or prophage are each represented by a node, coloured according to the sequence cluster of the host in which it was found as displayed in the key. These are linked by vertices based on their similarity in terms of COGs using the Mountford index and classified using functional domains that appear characteristic of different MGE types. Clusters of nodes described in the text are annotated with letters. The grey boxes display the annotation of representative nodes, indicated by the dashed lines, from the main clusters in the network. Full size image

Component C was one of the six ICE network components predominately associated with SC12, and in this case appeared to represent a conserved GI distinguishing these isolates from the rest of the population. Such sharing of MGEs through recent common ancestry (that is, vertical transmission of the MGE) was indicated by these cliques of highly connected nodes within the same SC. Conservation of ICEs has been observed in multidrug-resistant lineages in which the ICEs encoded resistance genes37,39,40, but many examples identified here, such as those in component C or the Tn5252-type MGE similar to ICESpPN128 found in SC6 (Supplementary Fig. 11), lack such obviously beneficial cargo genes. In some cases, the importance of vertical transmission to the spread of some MGEs may reflect the absence of modules encoding the machinery needed for horizontal mobility. Examples were evident in component D (Supplementary Fig. 13), the longest representative of which appeared potentially mobile, whereas the shorter members may have lost some of the machinery for transfer between cells41.

The second group of MGEs, accounting for 471 nodes, likely represents phage-related chromosomal islands (PRCIs), mobilized in cis by ‘helper’ prophage42. First identified as ‘pathogenicity islands’ carrying superantigens in Staphylococcus aureus43, these pneumococcal examples encoded a high proportion of sequences for which no robust functional prediction could be made. Representatives from components E and F exhibited similarity with the Streptococcus pyogenes PRCI SpyCI1 (Supplementary Figs 14,15) and were typically between 8 and 15 kb in length with putative integrase and regulatory genes transcribed in one direction and a DNA primase gene transcribed in the opposite direction. Representatives from component G were similar in size and genetic organization, with an integration module that showed limited similarity with the enterococcal PRCI EfCIV58344 (Supplementary Fig. 16). The most unusual representatives were in component H, in which the putative integrase and primase genes were linked to a central, transposase-flanked portion that closely matched a GI from Streptococcus mutans LJ23 (Supplementary Fig. 17). However, there was generally little evidence of the ICE-type modular evolution: PRCIs exhibited less variation in size (Supplementary Fig. 8), and the same core set of functions tended to be conserved between them. Sequence variation was instead mosaic in nature, with the level of sequence divergence between representatives changing at breakpoints that varied between elements, likely representing the consequence of homologous recombination.

Exhibiting a similar mosaic pattern of sequence variation were prophage, the third group of 672 nodes. These generally conserved a distinctive module order, and had a consistent orientation across the five insertion sites containing full-length prophage in which the genes active during MGE replication were aligned with the strong coding bias of the pneumococcal genome, akin to the ‘polarization’ seen in enteric bacteria32. In marked contrast with ICEs and PRCIs, few instances of prophage being stably associated with a lineage were observed, as implied by the connectivity within component I (Supplementary Table 2). Notable exceptions to this trend were evident: isotypes of prophage φOXC141, independently observed to be stably associated with the serotype 3 genotype predominant in Massachusetts19,27, were identified in the expected hosts (Supplementary Fig. 18). These viruses were within component I, which encompassed the previously described diversity of pneumococcal phages4. Similarly conserved between related isolates were two atypical phage: one similar to Enterococcus faecalis V583-pp144 present in all but four members of SC4 (components J and K; Supplementary Fig. 19), and another similar to S. oralis prophage φPH10 present in five SC11 isolates (component L; Supplementary Fig. 20). Yet, the largest set of nodes that showed stable association with host SCs was found within component I; these represented a GI identified in the multidrug-resistant PMEN1 lineage8 that likely represented a ‘prophage remnant’ that has lost its mobilization machinery (Supplementary Fig. 18). Such degradation of an MGE can occur when selection acts to conserve a beneficial cargo gene45; the only candidate in this instance was a coding sequence (CDS) with a functional domain associated with RMSs, suggesting this gene may have been preserved to protect against other MGEs.

Potential barriers to sequence exchange

The apparently high rate of phage transmission suggested there would be strong selection for mechanisms that prevented infection with these viruses, which may also inhibit the exchange of other GIs. RMSs seemed likely to play such a role, and 11 candidate RMSs were identified using Pfam domains46 (Supplementary Tables 3 and 4). Three of these were present at the dpn locus, of which two were the previously characterized DpnI and DpnII systems11. The only example of switching between these two systems within a SC occurred on the long branch within SC12 (Fig. 4b). The one other change at the dpn locus within a SC involved replacement of a Type II RMS (designated DpnIII and represented by SPN23F18640-18650 in the genome of S. pneumoniae ATCC 700669 (ref. 8)) present in all but one isolate of SC13, in which it had been replaced by DpnI. DpnIII likely targets a different motif to DpnI and DpnII, both sensitive to adenine methylation11, as functional domain information suggested the DpnIII RMS modified cytosine bases.

Figure 4: Mechanisms potentially affecting GI transfers. (a) Maximum likelihood phylogeny based on the core genome annotated according to the distribution of sequence clusters. The branches of the phylogeny are coloured according to a maximally parsimonious reconstruction of CSP pherotype. The ‘CSP-3-like’ sequence was identical to the previously described CSP-3 (ref. 16) but lacking an FNIFNF peptide. (b) Variation in accessory RMSs. The columns to the left indicate which of the three RMSs is present at the dpn locus by black bars in the appropriate rows. The eight columns to the right indicate the presence of other putative accessory RMSs, as inferred from the distribution of the relevant methylase COGs. Columns are labelled with the accession code of the sequence in Supplementary Table 4, with black bars again indicating the presence of an RMS in the corresponding isolate. (c) Variation in the ivr locus. The left columns show reads corresponding to the 5′ part of the full-length spnIVRhsdS gene assigned to the two alternative 5′ TRD-encoding sequences A or B. The heatmap indicates the proportion of reads corresponding to the spnIVRhsdS gene that matched each allele, with red indicating a higher proportion and blue a lower proportion. The right columns show reads likely corresponding to the 3′ part of spnIVRhsdS assigned to the three alternative TRD-encoding sequences a, b or c. (d) Variation in the tvr locus. Eleven different spnTVRhsdS TRD-encoding sequences were identified across the population. When the TRD-encoding sequence was present as part of a full-length CDS, the cell is coloured red, if the TRD was found in the 5′ half (these are labelled with uppercase Roman numerals), and orange, if found in the 3′ half (these are labelled with lowercase Roman numerals). Where the TRD-encoding sequence was present as a lone fragment, the corresponding cell in the grid is coloured blue. Empty cells indicate the TRD-encoding sequence was absent from the corresponding isolate. Full size image

As both DpnI and DpnII do not prevent the uptake of GIs by transformation2, but are likely to be similarly effective against MGEs found as double-stranded DNA forms in the cell, it was unsurprising to find that the accessory genome diversified at approximately the same rate in isolates carrying either system (Supplementary Fig. 21). However, there was also little difference in the equivalent rate estimated from those isolates carrying DpnIII, which appeared to be a typical Type II restriction system. Furthermore, increasing numbers of non-Dpn accessory RMSs also did not appear to affect the rate of accessory genome diversification, despite these Type I and II systems being potentially able to cleave MGEs or any GIs imported by transformation (Supplementary Fig. 22). Despite their apparent lack of effect on the plasticity of genome content, these accessory RMSs exhibited a similar level of conservation across clades as the Dpn systems (Fig. 4b).

The absence of an observed effect may reflect the influence of other aspects of the transformation mechanism. One candidate was pherotype, which was also conserved across deep-branching clades. All 15 monophyletic SCs were uniformly associated with either CSP-1 or CSP-2, with no isolates having acquired the rarer pherotypes or switched between the more common types (Fig. 4a) despite requiring a change within the range of commonly observed transformation events47. One explanation is that inter-pherotype exchange of sequence is infrequent18. However, any inhibition of exchange between the pherotypes does not appear to substantially affect their relative rates of recombination. No significant difference was observed in the rate of diversification through homologous recombination relative to point mutation between SCs of the two common pherotypes (Wilcoxon rank-sum test of previously calculated r/m values19, N=15, W=20, P=0.46), and no substantial difference in the relative rate of accessory to core genome diversification could be identified between them (Supplementary Fig. 23).

Rapid RMS variation through DNA inversion

The lack of a detectable impact of either accessory RMSs or pherotype on the rate of genome content diversification indicated there may be an alternative mechanism inhibiting the spread of GIs. Two candidates were ‘core’ RMSs that were ubiquitous in the sampled population (Figs 4 and 5). Both of these were Type I RMS loci containing multiple sequences encoding different DNA-binding target recognition domains (TRDs) of specificity subunits together with a recombinase. One of these loci encoded TRDs on both strands of the genome (Fig. 5a), and had previously been demonstrated to undergo rearrangements through sequence inversion in S. pneumoniae TIGR4 (ref. 48). This phase variation resulted in five TRD-encoding sequences being combined into up to six different full-length genes, each encoding a putatively functional Type I RMS specificity subunit formed of two TRDs. This region was denoted the ‘inverting variable restriction’ locus (ivr locus), with the specificity subunit encoded by the spnIVRhsdS gene. The rapid variation in the composition of spnIVRhsdS across the sequenced collection (Fig. 4c) was hypothesized to be driven by intragenomic recombinations catalysed by the recombinase encoded by ivrR within the ivr locus48. Hence, ivrR was disrupted using an antibiotic resistance marker to stabilize three different versions of spnIVRhsdS generated by intragenomic recombination during routine culturing of S. pneumoniae R6 (ref. 49).

Figure 5: Structures of the RMS loci varying through intragenomic recombination. (a) Structure of the ivr locus in S. pneumoniae R6 and (b) structure of the tvr locus in S. pneumoniae ATCC 700669. In both cases, the CDSs encoding the methylases (hsdM genes) and endonucleases (hsdR genes) are coloured red, and the CDSs encoding the recombinases are coloured pink. The components of the specificity subunit CDSs are coloured differently: the 5′ TRD-encoding sequences are dark blue, the 3′ TRD-encoding sequences are light blue, and invariant regions are purple. TRD-encoding sequences are labelled as in Fig. 4. Full-length specificity subunit genes, containing a representative of each component type, are boxed. The sets of repeats on which the recombinases may act are indicated by the orange and green boxes. In the tvr locus of S. pneumoniae ATCC 700669, the two grey CDSs represent tvrAT, encoding a putative toxin–antitoxin system. This sequence is aligned with other tvr loci encoding functional spnTVRhsdS genes. Regions of sequence similarity between loci are indicated by red bands. The positions of primers are indicated with purple arrows. (c) Variation in the tvr locus through intragenomic recombination. For each Massachusetts isolate in (b) (and their corresponding mutants in which the 3′ end of the locus has been replaced), a single colony was serially passaged in broth three times. DNA was extracted from each passage, and the arrangement of the tvr locus assayed by PCR amplification with the primers labelled in (b) and an extension time of 60 s. The ladder used was the 1 kb Plus Invitrogen ladder, with the darkest band corresponding to a size of 1.65 kb. In each case, the expected size of the band from the native tvr locus was >3 kb, with shuffling of the TRD-encoding sequences expected to result in tvr locus arrangements that generated smaller bands. Full size image

The mutants were characterized by SMRT sequencing, with de novo assemblies confirming that each had a different spnIVRhsdS allele (Supplementary Fig. 24). The mutant with the same spnIVRhsdS sequence as the R6 genome (composed of the TRDs denoted Aa) was found to have adenines methylated at the N6 position in three motifs. Two of these, TCG A G and TCTAG A (underlined adenines were methylated; Supplementary Table 5), likely represented the activity of two Type II RMSs. Twenty CDSs in the S. pneumoniae R6 genome matched the RMS-associated functional domains listed in Supplementary Table 3. The most likely candidates for causing these methylation patterns were SpnIM, an ‘orphan’ methyltransferase encoded by a CDS adjacent to an endonuclease pseudogene, predicted to target the TCTAGA motif50; and Spr1102, which appears to form a functional Type II RMS with Spn1103 (orthologous with the accessory RMS with accession code LK020705 in Fig. 4). The third motif, C A G(N) 8 TTYG, was bipartite and likely to represent the activity of the Type I RMS encoded by the ivr locus. SMRT sequencing of the second mutant, in which the 3′ region of the spnIVRhsdS gene had been switched by an inversion such that it was composed of TRDs Ab, found the same Type II RMS motifs and an altered Type I motif, GA A (N) 9 TTYG. The maintenance of the 5′ half of the spnIVRhsdS allele was consistent with the conservation of the TTYG component of the original motif. Correspondingly, SMRT sequencing of a third mutant with a Ba spnIVRhsdS allele, in which only the 5′ region of spnIVRhsdS differed from allele Aa, identified a bipartite methylated motif of C A G(N) 7 GTG; this preserved the CAG component of the original motif, while nevertheless again altering the system’s overall specificity.

Rapid RMS variation through DNA translocation

In contrast with the ivr locus described above, the TRD-encoding sequences at the second ‘core’ RMS (the SP_0886-SP_0892 region of the S. pneumoniae TIGR4 genome) were all on the same strand (Fig. 5b). In many isolates, apparently functional specificity subunit genes were formed through the combination of two TRDs, as at the ivr locus. Alignments of the locus in closely related members of sequence type 3280 (ref. 19) suggested that ‘shuffling’ of TRDs occurred through lateral translocation of DNA (Supplementary Fig. 25); PCR amplification confirmed this was genuine variation and not an assembly artefact (Supplementary Fig. 26). This was unlikely to represent spontaneous, irreversible mutation because isolates in SC2 and SC3 apparently alternated between two different forms (Fig. 4d and Supplementary Fig. 27). Rather, the changes were likely catalysed by the putative recombinase, TvrR, encoded by this locus, henceforth termed the ‘translocating variable restriction’ (tvr) locus. These alterations would likely involve excision and re-integration of DNA; the putative TvrTA toxin–antitoxin system may select against failure to re-insert the gene cassette during rearrangements, as these systems are likely to be effective in stabilizing such dynamic genetic loci. To test whether variation in this locus could occur through intragenomic recombination, individual colonies from three different isolates, each of which had a different full-length spnTVRhsdS gene (Fig. 5b), were serially passaged in broth three times. A PCR was designed to amplify an ~3 kb product from the ‘native’ version of the locus, which could also detect rearrangements through the amplification of shorter products as the consequence of a primer binding site within a TRD-encoding sequence being shuffled closer to the 3′ edge of the locus (Fig. 5c). In the case of CH2060, only the ~3 kb product was clearly observed, suggesting any rearrangements were rare in this isolate; with BR1109, a shorter band became prominent over the time course, suggesting infrequent rearrangement; whereas the variant locus was easily detectable in ND6010 after a single night's growth. None of these shorter bands were observed following the replacement of the 3′ end of the locus, including tvrR, with a kanamycin resistance marker (Supplementary Fig. 28). Again, the high rate of this mechanism was reflected by the extensive variation in spnTVRhsdS observed across the population (Fig. 4d). These data indicated that the ‘shuffling’ of spnTVRhsdS configurations was rapid, commonly occurring within SCs, whereas the horizontal acquisition of new spnTVRhsdS TRD-encoding sequences was much less frequent.