DNA methylation acts in concert with restriction enzymes to protect the integrity of prokaryotic genomes. Studies in a limited number of organisms suggest that methylation also contributes to prokaryotic genome regulation, but the prevalence and properties of such non-restriction-associated methylation systems remain poorly understood. Here, we used single molecule, real-time sequencing to map DNA modifications including m6A, m4C, and m5C across the genomes of 230 diverse bacterial and archaeal species. We observed DNA methylation in nearly all (93%) organisms examined, and identified a total of 834 distinct reproducibly methylated motifs. This data enabled annotation of the DNA binding specificities of 620 DNA Methyltransferases (MTases), doubling known specificities for previously hard to study Type I, IIG and III MTases, and revealing their extraordinary diversity. Strikingly, 48% of organisms harbor active Type II MTases with no apparent cognate restriction enzyme. These active ‘orphan’ MTases are present in diverse bacterial and archaeal phyla and show motif specificities and methylation patterns consistent with functions in gene regulation and DNA replication. Our results reveal the pervasive presence of DNA methylation throughout the prokaryotic kingdoms, as well as the diversity of sequence specificities and potential functions of DNA methylation systems.

DNA methylation is a chemical modification of DNA present in many prokaryotic genomes. The best-known role of DNA methylation is as a component of restriction-modification systems. In these systems, restriction enzymes target foreign DNA for cleavage, while DNA methylation protects the host genome from destruction. Studies in a handful of organisms show that DNA methylation may also act independently of restriction systems and function in genome regulation. However, a lack of technologies has limited the study of DNA methylation to a small number of organisms, and the broader patterns and functions of DNA methylation remain unknown. Here we use SMRT-sequencing to determine the genome wide DNA methylation patterns of more than 200 diverse bacteria and archaea. We show that DNA methylation is pervasive and present in more than 90% of studied organisms. Analysis of this data enabled annotation of the specific DNA binding sites of more than 600 restriction systems, revealing their extraordinary diversity. Strikingly, we observed widespread DNA methylation in the absence of restriction systems. Analyses of these patterns reveal that they are conserved through evolution, and likely function in genome regulation. Thus DNA methylation may play a far wider function in prokaryotic genome biology than was previously supposed.

Funding: RJR and AF were supported by the Small Business Innovation Research Program (National Institute of General Medical Sciences) of the National Institutes of Health award number R44GM105125 to RJR. Research was conducted at the E.O. Lawrence Berkeley National Laboratory and performed under Department of Energy Contract DE-AC02-05CH11231, University of California. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

In the present study, we systematically use SMRT sequencing to uncover the patterns of DNA methylation across a large panel of more than 200 diverse bacterial and archaeal genomes to provide an overview of the epigenomic landscape of prokaryotes. In so doing we reveal the ubiquity of DNA methylation, and annotate DNA binding specificities for hundreds of MTases belonging to previously intractable types of RM systems. Furthermore, we demonstrate that a large proportion of the ‘orphan’ MTase genes encoded in prokaryotic genomes are active under normal conditions and produce patterns of DNA methylation that are consistent with gene regulatory functions. Our findings provide evidence for the pervasiveness and potentially diverse functions of DNA methylation in prokaryotic genomes.

MTase-encoding genes are present in the majority of bacterial and archaeal genomes, suggesting that DNA methylation may be similarly abundant. Bisulfite sequencing has enabled genome-wide surveys of 5mC methylation [ 28 , 29 ], but a historic absence of tools for studying m6A and m4C modifications that predominate in prokaryotic DNA[ 30 ] has precluded more comprehensive studies. It has recently been demonstrated that kinetic analysis of single molecule, real-time (SMRT) sequencing data can directly detect many types of DNA modification [ 4 , 31 , 32 ]. While this approach is only modestly sensitive to m5C methylation, it is capable of detecting both m6A and m4C highly with a high degree of accuracy and sensitivity. The application of SMRT sequencing to a small number of prokaryotes enabled the identification of methylated motifs, and annotation of the respective MTases [ 15 – 21 ].

Beyond RM systems, MTases can also be involved in prokaryotic genome regulation [ 8 , 22 ]. These enzymes are typically observed as ‘orphan’ MTases that are found encoded in prokaryotic genomes in the absence of genes encoding a cognate restriction enzyme [ 23 ]. Examples include the Dam MTases that regulate DNA replication timing and gene expression of Gammaproteobacteria [ 24 ] and the CcrM MTases that regulate cell cycle progression of Alphaproteobacteria [ 19 , 25 ]. While genome-wide methylation analysis of individual genomes can in principle identify regulatory MTases and provide insight into the associated regulatory DNA methylation system [ 17 , 18 , 20 , 21 , 26 , 27 ], in the absence of systematic mapping efforts it has remained unclear how common such mechanisms are in prokaryotes. It is unknown whether the MTases associated with RM systems can also play a regulatory role.

Knowledge of the binding specificities of RM systems is critical to understanding their biological functions. Traditional approaches to determine RM system specificities rely on patterns of DNA cleavage by REases, a strategy that limits discovery largely to Type II RM systems where the REase binds and cleaves DNA at the same location [ 5 ]. Owing to this limitation, while the DNA binding specificities of several thousand Type II RM systems are known, typically fewer than 100 of each of the other types of RM system are known [ 5 ]. For Type I, IIG and III systems that cut outside of the RM binding site, a more recent alternative approach is to take advantage of the identical motif specificities of methylation and restriction. In these cases, determination of the sequences methylated by the MTase can directly reveal the recognition sequence of the accompanying REase, as recently demonstrated for individual RM systems [ 15 – 21 ].

RM systems play a central role in prokaryotic defense, and their constituent enzymes are foundational tools in modern molecular biology [ 6 ]. RM systems comprise a restriction endonuclease (REase) and a MTase with the same DNA binding specificity. The REase degrades DNA from viruses and other exogenous sources, while the cognate MTase methylates potential REase target sites in the host genome and thus protects them from cleavage. RM systems are classified into four main types [ 5 , 6 , 9 , 10 ]. Type I RM systems are complex, multi-subunit systems composed of separate REase and MTase subunits, and a common DNA recognition specificity (S) subunit [ 11 ]. The S subunit in combination with two MTase subunits methylates DNA, while the S subunit in combination with two MTase subunits and two REase subunits results in restriction. Type I RM systems recognize bi-partite motifs (e.g. CAGNNNNNTCA), and cleave at large distances (up to several kb) from their binding site. Type II RM systems comprise separate REase and MTase enzymes, which are expected to show identical DNA binding specificity [ 12 ]. They bind short, mostly palindromic, motifs (e.g. GATC), and cleave DNA within or close to the recognition site. Exceptions are the Type IIG RM systems that are single chain polypeptides containing both DNA restriction and methylation activities, bind short non-palindromic sequences (e.g. GCCCAG), and cleave DNA outside of the DNA binding site [ 12 ]. In Type III systems the MTase alone contains a DNA binding specificity domain and forms a complex with the REase in order to restrict [ 13 ]. They bind short non-palindromic motifs (e.g. CGAAT) and cut outside of the DNA binding site. Finally, Type IV RM systems cut modified DNA and do not have a MTase component [ 14 ].

DNA methylation has widespread roles in the regulation of eukaryotic genomes [ 1 – 3 ], but the extent to which similar processes exist in prokaryotes is unknown. Methylated DNA is found in the genomes of bacteria and archaea in the forms of 6-methyladenosine (m6A), 4-methylcytosine (m4C), and 5-methylcytosine (m5C)[ 4 ], and is the product of DNA methyltransferase (MTase) enzymes [ 5 ]. MTases are often a component of restriction-modification (RM) systems [ 6 ], but have also been implicated in DNA mismatch repair [ 7 ] and other epigenetic regulatory phenomena [ 8 ]. While MTase genes are present in the genomes of many prokaryotes, the overall abundance and patterns of prokaryotic DNA methylation, and the functional diversity of MTases remains largely unknown.

Results

Genome-Wide Methylation Patterns in 230 Diverse Prokaryotes To explore the locations and potential functions of DNA methylation across prokaryotes, we selected 230 organisms for study, including 217 bacterial and 13 archaeal species, spanning 19 different phyla and 37 different classes (Fig 1A, S1 Table). These organisms were selected primarily based on their phylogenetic diversity to enable a comprehensive survey of bacterial methylation systems and maximize the chances for discovery of novel systems. For each organism, we isolated genomic DNA, and performed deep single molecule, real-time (SMRT) sequencing. We obtained on average 130-fold read coverage per organism, resulting in a combined dataset size of more than 79 million single-molecule reads and 105 Gb across all sequenced genomes. We aligned all SMRT sequences to the respective reference genomes, and used kinetic data analysis to identify the locations and probable types (m6A, m4C, m5C) of high-confidence base modifications in each sequenced genome (see Methods). We then identified sequence motifs that were recurrently methylated in each genome (Methods). The results of these analyses were genome-wide basepair-resolution methylation maps for each organism examined, as well as a set of modified motifs for each genome, where each motif represents the likely binding specificity of a DNA MTase. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Methylomes of 230 prokaryotes. A) Phylogenetic tree of 230 sequenced organisms. Outer bars indicate the number and types of active MTases detected per genome B) Number of methylated motifs and MTase genes identified across all 230 organisms. C) Breakdown of 583 annotated DNA MTases by type. D) Proportion of annotated DNA MTases with and without cognate restriction enzymes. For Type IIG systems, MTase and restriction activities are encoded in the same peptide. https://doi.org/10.1371/journal.pgen.1005854.g001 In total we identified 858 methylated motifs, with DNA modifications detected from 215 / 230 organisms (93%), and across all sequenced phyla (Fig 1A). On average, we observed 3 methylated motifs per organism, with a maximum of 19 in Neisseria gonorrhoeae. Among modified motifs, the predominant base modification type detected was m6A (75%), with m4C and m5C accounting for 20% and 5%, respectively (S1 Fig). The large number of m6A methylated motifs is consistent with the frequent occurrence of this modification type in the database of known MTase specificities [5], and the ease with which this modification type is detected by SMRT sequencing. In contrast, the low frequency of m5C methylated motifs is an underestimate of the true number of such motifs across these genomes due to the lower sensitivity of SMRT sequencing to this modification type (S2 Fig)[16]. The fifteen organisms without detectable methylation are from across the sampled taxa, with no obvious shared characteristics. In 8/15 cases, their genomes lack predicted MTase genes (but harbor methyl-directed restriction enzymes), while in other cases MTases are present but were not detectably active by SMRT sequencing (S2 Table). In summary, these data reveal that DNA methylation is widespread across prokaryotes, and provide a valuable resource for exploring the specificities and functions of the MTases present in these genomes.

Systematic Annotation of DNA Methyltransferase Specificities To identify the individual MTases responsible for each methylated motif, we performed large-scale annotation of MTase binding specificities across the studied genomes. Using an integrative RM-system gene annotation pipeline (Methods), we identified 1,459 candidate MTase genes across the 230 genomes, and classified them according to RM-system type (panel A in S3 Fig). We then similarly classified the 858 detected motifs according to the type of MTase system to which they likely belong (panel B in S3 Fig). Comparison of the types of methylated motifs and MTase genes within the same organism enabled us to make initial predictions of the MTase enzyme responsible for each observed methylated motif (Fig 1B). For nearly all detected methylated motifs (849, 99%), we identified at least one candidate MTase in the same genome predicted to be capable of producing the modification. In contrast, there were many (640, 44%) candidate MTase genes for which no potential modification activity was detected. Of these 227 are MTases that are predicted to produce m5C modifications that are difficult to detect by SMRT sequencing. Other cases may be MTases that are inactive due to genetic drift, mis-identified enzymes that target RNA or protein rather than DNA, or genes that are not expressed, as frequently occurs when MTases are located on prophages. In 620 cases, we were able to unambiguously match a single candidate MTase to a motif of the same type in the same genome (Fig 1C), thus generating a set of high confidence annotations of MTase specificities (S3 Table and S4 Table). The remaining unmatched motifs are due to several candidate MTases being present in the same genome, with insufficient evidence to make an unambiguous assignment. For almost all Type I and III MTase gene predictions, a cognate REase was identified in the same genomic region, suggesting that these constitute intact RM systems, and enabling the systematic annotation of restriction specificities (Fig 1D, S1 Text). In contrast, restriction enzyme candidates could not be identified for over half (165/318) of the Type II MTases that are present (Fig 1D). This is consistent with the previous observation that Type II MTases frequently occur as orphans in bacterial genomes [23]. While we cannot exclude the possibility that some novel REase genes were not identified due to sequence divergence, these 165 orphan Type II MTases represent a large group of MTases with likely non-RM functions.

Expanding the Repertoire of RM System Specificities Comparison with known RM systems [5] indicates that our systematic analysis identified 148 RM systems with previously undescribed sequence determinants, substantially expanding the repertoire of available specificities. The discovery rate of novel enzyme specificities was particularly large for Type I, IIG, and III RM systems that have been historically difficult to study using conventional approaches (Fig 2A). For example, 92% (161/175) of annotated Type I system specificities identified in our study were novel. In addition, among the Type I motifs that could not be matched to genes the majority were new specificities not seen previously. As a result, our analysis increases the number of known Type I system specificities almost four-fold (from 76 to 293, Fig 2B). Our data also reveals the extraordinary diversity of modes of DNA recognition by Type I RM systems, with variation observed in all aspects of the DNA recognition architecture (Fig 2C). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Novelty and diversity of methylome data. A) Novelty of MTases annotated in this study. The 583 newly annotated MTases were compared against all existing annotated MTases in the REBASE database [5], and classed as novel or homologous to existing MTases on the basis of their sequence specificities. B) Impact of this study on the diversity of known MTase specificities. ‘Previously existing’ refers to all specificities in the REBASE database prior to this study. New specificity refers to methylated motifs identified in this study. C) Diversity of DNA recognition by Type I RM systems. Type I specificity subunits recognize bipartite motifs. In this data, all aspects of this architecture are found to vary. Red bases indicate the positions of 6mA methylation (A), or their reverse complement (T). Sub-motif sequence variation shows the IUPAC codes for ambiguous nucleotides found within the bipartite DNA sequences. https://doi.org/10.1371/journal.pgen.1005854.g002 We also identified a substantial number of novel recognition specificities by Type IIG and Type III MTases. Among Type IIG RM systems annotated, 82% (56/68) were novel, while the same was true for 79% (47/59) of the Type III specificities (Fig 2A). Unmatched motifs in these categories cannot always be unambiguously attributed as being from a Type IIG or Type III enzyme because both lead to characteristic single-strand methylation. Preliminarily we have considered short recognition sequence of 4 or 5 bases to most likely belong to the Type III family, while the longer recognition sequences of 6 or more base pairs are considered as Type IIG. Overall, the number of observed specificities across these Types of restriction system increased 2.7-fold (from 144 to 385) as a result of our study.

Novel Protective Modifications Previously, protection against Type I restriction enzymes was always found to be mediated by m6A modification [11]. In this study, we find examples of protection by m4C (M.Dac11109IV in Desulfobacca acetoxidans and M1.Mma5219I in Methanohalophilus mahii, S3 Table). Similar results have been obtained from other recent studies [5], and several of these systems have now been experimentally verified (Morgan et al. personal communication). Interestingly, when this happens there are two MTase genes associated with the system, one of which appears responsible for m6A methylation and the other for m4C methylation. In these cases the bipartite recognition sequence of the Type I S subunit has only G and C residues in one of the target recognition domains, which explains why m6A cannot be used to protect both halves. There are many homologs elsewhere in REBASE of systems like this, but often the specificity is unknown [5]. A similar situation has also been found for some Type III MTases where occasionally m4C is found as the protective modification both in some of the systems identified here as well as others [5].

New Families of Type IIG-Like Restriction Systems Type IIG systems are defined by the presence of a single target recognition domain (TRD) for the entire RM system. They typically consist of a single polypeptide containing both the endonuclease domain and m6A MTase, as in the prototypical enzyme MmeI [33] (S4A Fig). Here, we identified 76 novel Type IIG-like systems, many of which were atypical in terms of gene order, presence or absence of a DNA translocase, and differences in linkage between the endonuclease and MTase domains (S3 Table and panels B-E in S4 Fig). For example, we identified several different systems in which one peptide contains an MmeI family MTase/TRD, but in which the endonuclease is encoded on a separate peptide (AchA6III and OspHL35III, panels B and C in S4 Fig). Other examples such as CalB3II (panel D in S4 Fig) are new examples of BREX-like systems [34]. These systems use the specific methylation of the MTase protein to distinguish self from non-self in phage restriction, but appear to accomplish restriction without generating DNA cleavage. Finally, we observe novel systems that are unrelated to MmeI or BREX. For example, MexAMORF1192P is a four-protein system of two translocase proteins and separate MTase-TRD and endonuclease proteins (panel E in S4 Fig). These analyses highlight the value of SMRT-sequencing in annotating novel RM systems. The examples we describe represent just a portion of the wide diversity of Type IIG-like systems that evolve from various permutations of endonuclease, MTase and translocase domains with a single DNA recognition module. The preliminary annotations of Type IIG-like MTases from this study can be propagated across many orthologs and will enable their further characterization and systematic classification.

An Unusual Type II RM System While Type II RM systems represent historically the best-studied class of RM systems, our systematic survey identified a substantial number of new Type II RM systems, some of which have unusual properties. For example, all Type II RM systems described to date are characterized by close genomic proximity of the genes encoding the REase and the MTase, respectively [5]. We observed one pair of adjacent MTases M1.Csp12AI and M2.Csp12AI in Clostridium sp. 12(A) that were very similar to the m6A-MTase M.FokI from Flavobacterium okeanokoites. However, in Clostridium sp. 12(A) the gene encoding the corresponding FokI-like restriction enzyme was not found in the immediate vicinity of M1/M2.Csp12AI, but at a genomic location 1.2 megabase pairs (Mb) away. All three genes were tested for activity by cloning. While M2.Csp12AI could be cloned alone, it was only possible to clone the M1.Csp12AI gene in the presence of M2.Csp12AI. In both cases, just as in the genome, both MTases were shown to be fully functional by PacBio sequencing of DNA (S5 Fig). To exclude the possibility that the large apparent distance resulted from an incorrect genome assembly, we confirmed by PCR that the distance between the REase gene and the two MTase genes is at least 36 kb (S6 Fig). These results indicate that, unlike all previously described Type II RM systems, there are Type II RM systems in which the REase and MTase genes are located at distant sites on the chromosome.

Families of Active ‘Orphan’ MTases Are Conserved across Diverse Prokaryotic Phyla Our systematic survey identified 165 candidate ‘orphan’ Type II MTases (Fig 3A, S3 Table and S4 Table, Methods). These MTases are found in isolation, i.e. in the absence of corresponding restriction enzymes, but nonetheless actively methylate specific sites in the genome. This feature raises the possibility that these MTases are involved in non-RM-functions, such as gene regulation. Orphan MTases are widely distributed among prokaryotes with at least one example in 111 (48%) organisms and 15/20 different phyla included in this study (Fig 3B). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. Identification and classification of ‘Orphan’ Type II MTases. A) Proportion of Type II MTases encoded with (RM-system) or without (‘orphan’) a cognate restriction system. B) Proportion of sequenced organisms with orphan Type II MTases. C) Comparison of evolutionary conservation properties of Type II MTases. An MTase is considered conserved if orthologs are present in >50% species in the respective taxonomic class. D) Table of orphan MTase families identified based on protein sequence clustering (Methods and S2 Fig). Tree is based on agglomerative clustering of protein sequences. Bar charts represents fraction of related species with (green) or without (grey) a copy of the orphan MTase gene (diagonal lines, < 5 sequenced genomes). https://doi.org/10.1371/journal.pgen.1005854.g003 To explore the properties and potential functions of orphan MTases in more detail, we first examined the phylogenetic conservation of orphan and RM system MTases. We determined the presence or absence of each MTase among all sequenced species related to the host organism at the genus, family or class level, and with an available reference genome sequence (Methods). We considered MTases to be conserved if present in at least 50% of species within the respective taxonomic group (Fig 3C). Overall, orphan MTases are far more likely to be evolutionarily conserved than RM system-associated MTases. For example, the majority of orphan MTases (57%) are conserved at the genus level, while the same is true for only 9% of RM system MTases. A similar contrast between orphan and RM MTases is observed at the level of family and class (Fig 3C). These results are consistent with a greater degree of conservation of orphan MTases compared with RM MTases [23], and suggest that orphan MTases have functional roles distinct from host protection. We next performed protein sequence similarity-based clustering to identify candidate novel families of related orphan MTases. We generated initial protein clusters from all 260 Type II MTases in our study (S7 Fig and S8 Fig), then extracted sub-clusters of orphan MTases from taxonomically related host organisms and with identical motif recognition sequences (Methods). These analyses resulted in 19 orphan MTase families accounting for 107 / 165 orphan MTases in our study (Fig 3D). The remaining 58 MTases are ‘singletons’ with no ortholog in any other genome in our dataset. The two most highly represented orphan MTase families in our study are the known regulatory orphan Dam MTases in Gammaproteobacteria, and CcrM MTases in Alphaproteobacteria, reflecting our large sampling of organisms from these taxa. Of the remaining 17 candidate families, 3 are apparent homologs of Dam MTases in Cyanobacteria and two archaeal classes, respectively. The other 12 families are novel orphan MTases of unknown function and are found in diverse prokaryotes including both bacteria and archaea. The most highly represented orphan MTase family methylates the motif 5’-RAm6ATTY-3’ (T indicates that the A on the complementary strand is modified) in all six Spirochaetaceae sequenced as part of this study. This motif and orphan MTase had previously been observed in Campylobacter jejuni [16]. In many cases, novel orphan MTase families are widely conserved in genomes beyond those included in our study. For example, the gene for the orphan MTase targeting 5’-TTA m6A-3’ in two Arthrobacter species in our study is present in 39 / 42 (93%) of all sequenced genomes from the genus Arthrobacter. Similarly the orphan MTase targeting 5’-m4CATG-3’ in two Haloarchaeal species in our study is present in 121 / 156 (78%) of all sequenced genomes from the class Haloarchaea (Fig 3D). In summary, these analyses reveal the presence of several novel evolutionarily conserved families of orphan MTases of unknown function.

Orphan Type II MTases Are Associated with Unmethylated Sites in Gene Regulatory Regions We hypothesize that some of the newly discovered orphan MTases function similarly to the known regulatory orphan MTases Dam and CcrM, i.e. that they regulate gene expression through the presence or absence of methylation in regulatory sequences. Alternatively their function may be to regulate DNA replication, through clusters of motifs in regions of the genome associated with DNA replication control [35]. To explore these possibilities in more detail, we searched our methylome data for signatures consistent with such functions. It has previously been shown that a subset of target sites of the E. coli regulatory MTase Dam is completely unmethylated [36–38]. These unmethylated sites are the consequence of the competing activities of Dam MTase and regulatory proteins, and the presence or absence of methylation at these sites has a demonstrated impact on gene expression [39, 40]. We therefore asked if we could recapitulate these findings for Dam MTases in our dataset, and if similar patterns are associated with novel orphan MTases. In the E. coli data from this study, the vast majority (17,544/17,562, 99.9%) of 5’-G m6ATC-3’ motifs are fully methylated on both strands of the genome. However, a distinct set of 18 5’-G m6ATC-3’ motifs is unmethylated on both strands of the genome (Fig 4A). These unmethylated sites include six GATC positions in upstream regulatory regions of agn43 genes that are known to be regulatory targets of Dam methylation [39]. Unmethylated sites are also detected in association with the dam orphan MTase gene of Salmonella bongorii, (Fig 4B, and S5 Table). In contrast, unmethylated sites are absent from the genome of Clostridium thermocellum, a bacterium harboring a 5’-G m6ATC-3’ specific MTase that is part of an RM system (Fig 4C). These results suggest that the presence of small subsets of reproducibly unmethylated recognition motifs across the genome may be a distinctive signature of orphan MTases. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 4. Orphan MTases are associated with unmethylated sites in the gene regulatory regions. A) Scatter plot of DNA modification scores on forward and reverse strands of each Dam MTase target motif (GATC) in the E. coli genome. The ipdR (inter-pulse duration ratio) is the primary metric in DNA modification detection, and corresponds to the time delay in incorporation of successive bases in a sample versus an unmodified control. This plot reveals a distinct set of sites that is unmethylated on both strands of the genome (highlighted in red). B) Scatter plot of DNA methylation scores at Dam target motif sites in Salmonella bongorii reveals a similar set of unmethylated sites. C) Scatter plot of DNA methylation scores GATC-specific RM-system MTase in Clostridium thermocellum. In this case all sites in the genome are methylated. D) Systematic analysis of the number of unmethylated motifs (on both strands of the genome) associated with orphan MTases (blue panel), and RM-system MTase (red panel). Orphan MTase names and gene orders correspond to Fig 3. E) Fold enrichment of all motifs (grey bars) and unmethylated motifs (black bars) in gene regulatory regions. * = significantly enriched (p <0.01, Fishers exact). Letters indicate enrichment at specific functional categories of genes based on COG category analysis. K = transcription, T = signal transduction, H = coenzyme metabolism. https://doi.org/10.1371/journal.pgen.1005854.g004 We extended this analysis to all m6A orphan and RM-system associated MTases in our dataset with sufficient SMRT sequencing coverage for confident detection of unmethylated sites (Methods). We observed widespread occurrence of unmethylated sites in association with Dam MTases across Gammaproteobacteria, as well as with the regulatory CcrM orphan MTases in Alphaproteobacteria (consistent with recent observations of unmethylated sites in Caulobacter [21]). Strikingly, we also observed unmethylated sites in association with at least one MTase for the majority (13/16) of novel orphan MTase families, as well as with over half of ‘singleton’ orphan MTases (Fig 4D, S9 Fig and S5 Table). In contrast, MTases of restriction systems are almost always associated with complete modification of their genomes, with only four apparently unmethylated sites observed across 41 RM MTases (Fig 4D), and consistent with a role in protecting the genome from the cognate restriction enzyme. On further inspection, all four apparent unmethylated RM MTase sites have modification scores at the borderline of detection, and likely represent the background false-positive rate of detection of unmethylated sites. Overall these analyses confirm that unmethylated motifs are a common signature of novel orphan MTases, and may represent novel regulatory sites in the genome. In known cases of gene regulation by orphan MTases, functionally relevant motif sites are located in regulatory sequences upstream of genes and are unmethylated in some or all of the population [39, 41]. We therefore asked whether the target motifs of the orphan MTases identified in this study are similarly associated with gene regulatory regions (Fig 4E). In general, orphan MTase motifs (irrespective of their methylation state) are not significantly enriched at gene regulatory regions (defined as 100bp upstream of CDS start to 50bp downstream of CDS start, Fig 4E, grey bars). However, two-thirds of orphan MTases are associated with a significant enrichment of unmethylated motifs in gene regulatory regions (Fig 4E, black bars). Furthermore, unmethylated motifs are especially enriched in the promoters of genes of related function, most notably transcriptional regulators (Fig 4E). For example, in Nocardia sp BMG111209, unmethylated 5’-ATCGm6AT-3’ motifs are 5-fold enriched in gene regulatory regions, compared with fully methylated motifs (17/28 (61%), compared to 13% by chance). This enrichment increases to more than 20-fold for unmethylated sites upstream of transcriptional regulators (7/28 (25%) unmethylated motifs compared with only 1.2% methylated motifs, p < 0.01). Finally, at least in the case of dam methylases in gammaproteobacteria, unmethylated motifs overlap predicted transcription factor binding sites significantly more frequently than do methylated motifs (S10 Fig and S5 Table). Overall, these results demonstrate a substantial enrichment of unmethylated motifs in regulatory regions of the genome. Since this enrichment is not merely a consequence of an elevated density of motifs in these regions, it may instead reflect the involvement of these sites in regulatory processes. The patterns of novel orphan MTases (including ‘singleton’ MTases) resemble those of the known MTases Dam and CcrM, further supporting the possibility that they may have shared functions in the epigenetic control of gene expression.