DNAmod consists of two components: a relational database back-end and a web interface front-end. We used the Chemical Entities of Biological Interest (ChEBI) database [13, 22] to seed the DNAmod database. We imported a nucleobase-related subset of ChEBI, consisting of chemical entities and related annotations. We performed queries against the entities to construct a set of candidate DNA modifications for DNAmod, retaining most of these as a separate unverified set. Then, we filtered candidate entities into a manually curated set of verified DNA modifications, augmenting them with modification-specific annotations.

The web interface front-end allows users to either search or browse through the catalogue of DNA modifications, integrating ChEBI’s information with our own.

Identifying candidate DNA modifications from ChEBI

DNAmod leverages ChEBI [22] to define a set of modified DNA candidates for inclusion and to add preliminary information for each candidate. ChEBI is a database of small biologically relevant molecules, which affect living organisms. We queried ChEBI via ChEBI Web Services [22]. We used Biopython [10] and the Python Simple Object Access Protocol (SOAP) client, suds [35], to query ChEBI and construct the DNAmod database.

ChEBI provides an ontology which encodes the relationships between its compounds. We used this ontology to precisely define the notion of parents and children, which we used to hierarchically retrieve and display modifications. We used two kinds of relationships for this purpose, both of which have associated symbols, defined by ChEBI [13]: \(\mathcal {F}\) has functional parent and \(\triangle \) is a. We used these relationships to find candidate DNA modifications, by identifying entities related to the core nucleobases, which we represent by their symbols: {A,C,G,T,U}. We included uracil, since many of its descendants in the ontology are modifications of thymine (CHEBI:17821, which is equivalent to 5-methyluracil), and are not annotated as descendants of thymine itself. For each of these bases, we imported all entities that are annotated in the ontology as a child of one of these bases, via the \(\mathcal {F}\) has functional parent relationship. ChEBI ranks entities based on their degree of curation. We only imported entities with the highest rating—three stars—indicating manual curation by ChEBI. Whenever possible, we only included entities as nitrogenous bases (nucleobases). If ChEBI did not have the nucleobase, we then selected the nucleoside form and finally, if necessary, the nucleotide. These imported bases formed the candidate set of modifications (the unverified set), from which we created a curated set of DNA modifications (the verified set).

The ChEBI ontology does not generally encode \(\mathcal {F}\) has functional parent relationships for nucleobases beyond the children of the unmodified nucleobases. It instead encodes modified nucleobases with an \(\triangle \) is a relationship to their parent base. This is because descendant entities of specific modifications are generally subtypes of the class of modifications from which they originate. For example, 3-methyladenine \(\triangle \) is a methyladenine. Methyladenine, however, \(\mathcal {F}\) has functional parent adenine, since it is conceived of as possessing adenine as a characteristic group and as being derived via functional modification [13]. We therefore need to use both of these relationships, within the ChEBI ontology, to accurately capture the full nucleobase hierarchy.

ChEBI also provides selected citations, associated with some of its entities. We retrieved the citations from ChEBI as PubMed IDs [32]. We used the Biopython [10] package Bio.Entrez to query the PubMed citation database, using NCBI’s Entrez Programming Utilities [32]. We retrieved the details of each citation, and use them to construct a formatted citation. We currently support only publications indexed in PubMed.

Manual curation and annotation

We manually created and defined a whitelist, which contains our curated (or verified) set of candidates that we deem DNA modifications. For each of the bases enumerated in our whitelist, we also imported all descendants with an eventual \(\mathcal {F}\) has functional parent or \(\triangle \) is a relationship with any of the members of the verified set. We expanded the verified set to include any bases recursively imported in this manner, since they were children of verified DNA nucleobases. We also manually created and defined a distinct blacklist, which contains compounds that we deem to not be DNA modifications, also excluding any of their descendant compounds. Therefore, our above verification rule has the exception that it excludes any bases with an ancestor in our blacklist.

We can formalize the above description of bases imported from the ChEBI ontology [13] and subsequent filtering as follows. Let \(a\mathbin{\mathcal {F}}\,b\) specify that a has the \(\mathcal {F}\) has functional parent relationship with b. The definition of \(\mathcal {F}\) is transitive: for all n entities, \(l_{i}\), for \(i = 0\) to \(n - 1\), between a and b,

$$\begin{aligned} a\mathbin{\mathcal {F}}\,b \iff \bigl ( a\mathbin{\mathcal {F}}\,l_{n - 1} \bigr ) \wedge \bigl ( l_{i}\mathbin{ \mathcal {F}}\,l_{i - 1} \mathord {\forall } i \in \left( 0, n\right) \bigr ) \wedge \bigl ( l_{0}\mathbin{\mathcal {F}}\,b \bigr ). \end{aligned}$$

The analogous definitions hold for \(\triangle \).

We call each \(l_{i}\) a child of \(l_{i - 1}\) and call each \(l_{i - 1}\) a parent of \(l_{i}\). We refer to a as a descendant of b and refer to b as an ancestor of a. Let \(\mathcal {C}\) represent the first level of children of the unmodified nucleobases, such that \(\mathcal {C} = \left\{ x \mid x\mathbin{{\mathcal {F}}}\,y, y \in \{\tt{A, C, G, T, U}\} \right\} \). Let \(\mathcal {V} \subset \mathcal {C}\) represent the manually-annotated, verified proper subset of \(\mathcal {C}\).

We manually curated a blacklist of excluded entities, \(\mathcal {B}\), satisfying: \(\mathcal {B} \subseteq \left\{ b\mid\left( b\mathbin{{\mathcal {F}}}\,p \vee b \mathbin{\triangle} p \right) , p \in \mathcal {V} \right\} \). We imported the set of verified DNA modifications, \(\mathcal {M}\), defined in set-builder notation with predicates, as:

$$\begin{aligned} \mathcal {M}=\, {} \mathcal {V}\, \cup\, &\left\{ z \mid \left( \exists v\, {\in } \mathcal {V} \right) \left( \forall\, b\, {\in }\, \mathcal {B} \right) \right. \\&\left. \left[ \left( z\mathbin{{\mathcal {F}}}\,v \vee z \mathbin{\triangle} v \right) \wedge \lnot \left( z\mathbin{{\mathcal {F}}}\,b \vee z \mathbin{\triangle} b \right) \right] \right\} . \end{aligned}$$

Finally, we added a small number of bases manually, that do not have any of the DNA bases or uracil as a parent in their ontology, but are nonetheless notable modified bases, such as 2′-deoxyinosine.

We additionally provided two kinds of manual annotations: sequencing techniques and occurrence in nature, for each modified DNA base. We surveyed the literature of sequencing methods for covalent DNA modifications [6, 29, 37, 39, 45], and annotated the available methods for each base, providing curated citations. These annotations include the method’s name, our categorizations of the basis for the method (such as chemical conversion), its resolution, and any further qualifier (Table 1A). Qualifiers include limitations (such as applicability to only some genomic regions), enrichment methods, and advantages (such as optimization for single-cell sequencing). We considered any method which involves affinity-based recognition of targets to be of “low” resolution [5]. These methods can also suffer from low specificity or antibody cross-reactivity [6]. Conversely, we annotated any methods based principally upon the detection of a chemically converted modification as “high” resolution. This generally reflects the resulting resolution of the method’s output data and often corresponds to the necessity to bin genomic regions during downstream analyses of the detected analyte.

For each modified base, we investigated if it had been previously reported to occur in vivo. This included any endogenous occurrences, as well as those stimulated exogenously, such as from exposure to an environmental toxin. We annotated any modification observed in vivo as “natural”. We additionally provided non-exhaustive examples of some organisms in which the modifications have been reported. We based these annotations on our ability to find evidence of in vivo occurrence, as opposed to publications describing only the synthesis or physicochemical properties of a nucleobase. For each of these annotations, we also briefly annotated a primary biological function, if known (Table 1B). For any modification not observed in vivo, we annotated it as “synthetic” and listed a reference pertaining to its synthesis or in which the synthetic base was used.

We entered these annotations in two annotation source files (Table 1), which we later imported into our database. This decoupled them from the rest of our pipeline and allows outside experts to submit additions without requiring knowledge of our pipeline or programming workflow.

Table 1 Possible annotations within DNAmod’s curated (A) sequencing method data and (B) natural occurrence information Full size table

DNAmod integrates manually-curated nomenclature, including the name and abbreviation deemed most consistent and in common use [9, 11, 28]. We additionally provide recommendations for one-letter symbols of selected modified bases, and in some instances for their base-pairing complements, as previously described [49]. The DNAmod web interface displays recommended notation in an organized table (Fig. 1).

Fig. 1 Manually-curated recommended notation, mapping techniques, and natural occurrence data for 5-formylcytosine (5fC). See Table 1 for an explanation of the mapping and natural occurrence table headers Full size image

We store all data, either imported from ChEBI or from our manual annotations, within a SQLite [25] database, used via the Python sqlite3 package [16].

Website generation

We created a static website to display and provide navigation for the information contained within the database. We generated it by formatting the database content using the templating engine Jinja2 [42]. Two templates were sufficient to generate all HTML files. We used a single template for all modification pages and another for the homepage. We also record the date of the most recent update to the database. The main footer contains this date, along with the current ChEBI and DNAmod versions. All web pages use the Bootstrap [36] framework, which provides a standardized, portable, and mobile-compatible viewing format. We visualized the chemical structure of each compound from its Simplified Molecular-Input Line-Entry System (SMILES) [52] data, if available from ChEBI, as a vector graphic. We did this using the cheminformatics toolkit Open Babel [34], via its Python wrapper Pybel [33].

Searching and navigation

DNAmod makes modifications accessible via three main navigation options, each provided on a tab of the DNAmod homepage. First, users may search for modifications by several fields. Second, users may find curated DNA modifications via a pie menu [7]. Third, users may find candidate entities as a list, categorized by their parent unmodified nucleobases.

Client-side search functionality provides a means of rapidly finding bases with differing nomenclature (Fig. 2a), while maintaining a static web page. This functionality relies on the elasticlunr.js JavaScript module [47]. Searches match to multiple fields: common or International Union of Pure and Applied Chemistry (IUPAC) names, all synonyms, any assigned abbreviation, and recommended notation symbol, when available. DNAmod displays curated DNA modifications in green, and others in magenta. The search results provide the field matched by the query, such as “abbreviation”, along with the common name of the associated hit.

Alternatively, users may browse the modifications in DNAmod through a pie menu [7] interface (Fig. 2b). This interface hierarchically arranges the bases according to their structure within the ChEBI ontology. The innermost ring consists of the four unmodified DNA bases, with an additional “other” category. This category encapsulates modified bases found in DNA, but which are not modifications of one of the four DNA bases. Consecutive outer rings represent children of the previous base or category. We demarcated natural versus synthetic bases by colouring natural bases in teal and synthetic bases in grey.