Evolution is based on the selection of phenotypic variants that must (i) confer a reproductive advantage to the individual, and (ii) are heritable, i.e. information how to generate the phenotypic variants in response to an environment are passed from parents to offspring. Heritability has traditionally been thought to be exclusively genetic, i.e. based on variations in the DNA sequence. In this view, genetic information is then expressed under influence of environmental cues to bring about the phenotype, a process known as G × E→ P43. During the last 30 years it became however clear, that a substantial amount of heritable phenotypic variance can be coded by non-genetic means44. We had earlier conceptualized this view as a systems approach to inheritance, that includes genetic, epigenetic, cytoplasmic and microbial elements that are interrelated by forward and reverse interaction2. These elements interact mutually, and with the environment, to give raise to the phenotype. In this concept, genetic information (the genotype) is only one of many elements that as part of an inheritance system providing heritable information, that the environment will shape into a phenotype. We define here ‘inheritance system’ as a system that is able to write and store, transmit, and receive hereditary information45. The concept implies also that genotype and epigenotype cannot exist independent of each other, and are interrelated by forward action and feed-back. This is different from the idea that sees the genome as hard-wired information that is controlled by the epigenome46. In the latter, the epigenome is conceptually closer to the (molecular) phenotype (i.e. product of the genotype) than to an element of the inheritance system itself.

The introduction of the epigenotype notion did not really solve the question, since theoretically each phenotype could just be the visible expression of its underlying epigenotype. Given the multiple facets phenotypes can acquire in living organisms, it is remarkable that, with very few exceptions, the genetic material and the genetic code remains extremely constant and thus universal47. In other words, there exists a single ‘type’ of genome. The origin of this universality of the genetic code remains enigmatic and controversial but whatever the origin is, it allows to transmit coded information from one generation to the next. These generations can understand the code since it uses a universal and constant key.

Given the presumably close relation between genotype and epigenotype we and others reasoned that the epigenotype and the epigenetic code should equally possess universality. The high conservation of histones and histone marks, and the conservation of methylation of cytosines suggests indeed this. Nevertheless, one could argue that the epigenetic code is simply entirely genetically determined. If this were true, we would expect that different DNA methylation types would correspond to the clades in taxonomical tree that are based on DNA sequence similarity. Our results do not support this view. Alternatively, DNA methylation types could entirely be determined by environmental conditions. In this case, similar environments should impose similar DNA methylation types. Neither our results, nor recent analyses of DNA methylation in invertebrates provide evidence of this. E.g. a very comprehensive study of DNA methylation in insects37 did not find relations of methylation types to social behavior and the authors concluded that DNA methylation must have “more ubiquitous function”. However, compared to the tremendous amount of genomic data that is available, epigenomic data is relatively sparse and biased, which is an obstacle to answer the question conclusively. In the present study, we coped with this caveat by using a hybrid approach in which we combined available experimental data on DNA methylation with results coming from a newly developed software that predicts gene body DNA methylation types with CpG o/e ratios. Our algorithm (based on the number of species positive predicted and true positives based on the literature) allowed for including species for which no experimental DNA methylation data existed. The PPV of the algorithm is excellent for mosaic methylation (PPV = 1), and methylated gene bodies (PPV = 0.875), but decreases then to 0.75 (low methylated) and 0.5 for ultra-low gene body methylation. This is due to the fact that our algorithm does not differentiate well between low and ultra-low methylation. If we consider the dataset as a whole, out of the 54 species with known DNA methylation types, 41 were predicted correctly (total PPV = 0.76).

There are some particularly interesting cases of “wrong” prediction. Cryptosporium parvum is a monoxenic unicellular parasite of vertebrates. It belongs to the apicomplexan its exact phylogenetic position is controversial. Exysted oocysts are the only stage that can be used to produce host DNA free genomic DNA preparations. Notos predicts clearly high gene body methylation but LC-ESI-MS did not detect 5mC in exysted oocysts purified from infected cattle48. Genome analysis of C. hominis to which C. parvum has only 3–5% sequence divergence49, showed that the number of genes is reduced (3,952 genes) compared to other apicomplexan, relying heavily on host gene activity. The genome shows also traces of integration of genes by lateral transfer. We hypothesize that either the progenitor DNA was methylated, or that cryptosporidium methylates DNA in the intracellular stages using the vertebrate host DNMTs.

Another peculiar case is the ciliate Tetrahymena thermophile. Also for this species Notos predicted high methylation while radioisotope labeling showed that Tetrahymena contains only N6-methyl-adenine but not 5mC50. T. thermophila and other ciliates use DNA elimination to remove approximately one-third of the genome, when the somatic macronucleus differentiates from the germline micronucleus. Histone 3 lysine 9 trimethylation (H3K9me3) is deposited on DNA destined for this elimination (reviewed in Bracht51). Interestingly, in other ciliates, DNA methylation is used for the tagging of DNA to be eliminated. It might therefore be that Tetrahymena had used DNA methylation in the past and has lost this capacity relatively recently, so that we still see traces in the CpG o/e ratio.

In summary, Notos predicts very reliable mosaic and high gene body methylation without being entirely error free. We had earlier36 used only mode number (1 or 2, i.e. non-mosaic and mosaic methylation) and peak position of 0.75 to differentiate species with presumably methylated (<0.75) and non-methylated (≥0.75) gene bodies. For the present work we added skewness −0.04, and SD 0.12, and changed peak position threshold to 0.69 for better prediction.

Conceptually, our approach is based on the classical observation that CpN dinucleotides are observed in statistically expected frequency in low methylated regions or genomes. It was initially used to identify unmethylated CpG island in vertebrate promoters, and two major algorithms exist (Gardiner-Garden and Frommer52 and Takai and Jones53). These two algorithms use the CpG o/e ratios with a score above 0.60 and 0.65, respectively. ‘Score’ (here ‘mode position’ Mo) is one parameter of our clustering algorithm. We used a decision tree to iteratively adjust this score and reached 0.69. This value is close to what was used in previous studies (e.g. for C. intestinalis: 0.7054 and 0.8031, and Nematostella vectensis, 0.7054, or Apis mellifera, 1.054). It is conceivable that Mo could be slightly different for each major phylogenetical clade, and using more sophisticated clustering algorithms such as support vector machine clustering that can use multiple thresholds could still improve the PPV of our method. In addition, more experimental data on a wide range of organisms is urgently needed.

We find that there are four types of gene body methylation. Despite a much wider data basis in terms of phylogenetic clades, our results confirm earlier findings that concluded on three to four DNA methylation types17,24. This could be the result of a “frozen accident” situation in which methylation (e.g. type 1 and type 3) occurred randomly in early ancestors (since 5mC is coding neutral that would not have had an impact on translation), but with the establishment of a chromatin structure 5mC was recruited as epigenetic information carrier, and any change in DNA methylation type would have had a strong impact on genome function and thus fitness and was therefore maintained. Nevertheless, switching of methylation type has occurred in evolutionary time scales. Our findings indicate that there were at least three large events of secondary loss of DNA methylation: in archaeplastida (the “true” plants) where we find one branch with high methylation and another with mosaic methylation (in monocotyledons), ultra-low or mosaic methylation in the apicomplexa branch of “protists”, and one transition to ultra-low gene body methylation in Diptera (Fig. 2). For D. melanogaster in the dipteran branch it was shown experimentally that only the ‘writing’ capacity of the epigenetic inheritance element was lost, not the receiving (‘reading’) capacity55. The reason for evolutionary switching between methylation types is not clear and arguments are controversial.

It has been proposed that secondary loss of DNA methylation occurs because its mutational costs outweighed its adaptive value56. Indeed, in mosaic type methylation it is the evolutionary stable “old” genes that are in the methylated compartments meaning that there must be stabilizing mechanisms that prevent mutations there. Therefore, it might not be the mutational costs but the costs of maintaining such mechanisms that becomes an evolutionary burden. It was an early observation that CpG containing codons are used much less in coding sequences of vertebrates, and mutations due to CpG methylation was considered a major cause for such codon bias57 and therein. Codon bias was observed also recently in the reef-building coral Acropora millepora57, and linked to mosaic methylation in this species. Again, phylogenetically old genes which are constitutively expressed are methylated and CpG depleted. The authors conclude that CpG methylation leads to mutations that establish a set of preferred codons in constitutively expressed genes. Once such codon bias is fixed, then alleles that control the abundance of appropriate tRNAs could have stronger effects more amenable to natural selection. The authors hypothesize that an advantage of mutation-driven codon bias that it would be beneficial for organisms with small population size or otherwise inefficient selection. Still another explanation for mosaic methylation was advanced by Gavery and Roberts32 who speculated that hypo-methylated regions (here in the pacific oyster Crassostrea gigas) might have greater epigenetic flexibility and higher regulatory control than hyper-methylated ones. Mosaic methylation could also be the result of whole genome duplication (WGD) events as suggested for Oryza sativa56. In addition, we have shown that environmental conditions can influence on germ-line methylation in C. gigas that possess mosaic methylation, and that blocks of CpG methylation are added or removed preferentially in or around genes58. One should keep in mind that DNA methylation is only one of many bearers of epigenetic information. Another one, and probably the most difficult to capture is the topology of the interphase nucleus. Using Hi-C data, Lieberman-Aiden et al.59. established that the human genome is divided into two compartments (A-B) with pairs of loci in compartment B showing higher interaction frequency at a given genomic distance than pairs of loci in compartment A. They concluded that compartment B is more densely packed (heterochromatic) than compartment A. Higher average DNA methylation was later found to be a good predictor for the open compartment A in human cell lines60 but that link could be broken in cancer cells. This cannot be interpreted as DNA methylation being decisive for topologically associated domains (TAD) establishment since DNA methylation free organisms such as D. melanogaster also presents canonical A-B domains61. But in drosophila, such TAD organization is not driven by long-lived interactions but rather relies on the formation of transient, low-frequency contacts62. We hypothesize therefore that DNA methylation actually impacts on the relative dynamics of formation of contacts in A and B compartments, possibly stabilizing them. It is tempting to speculate that one consequence of compartmentation of genomes dynamics by methylation is that this might create additional units of selection. Results from tunicates support this idea: Ciona CpGo/e ratios have different profiles (bimodal for C. intestinalis and unimodal for C. savignyi). The C. intestinalis methylome is predicted to be mosaic that corresponds to experimental observations63. Our prediction for C. savignyi is low methylation (cluster 2). Both species diverged from each other 184 (±15) Mya42 and their genomes are very different64,65. For instance, analysis of 18S rRNA sequences shows that the pairwise divergence of the two ciona species is slightly greater than that between human and e.g. birds66. This is puzzling since developmental features, body plan, effective population size and environment are very similar, and even hybrids can be generated to the tadpole stage67. However, C. savignyi shows a genome wide average Single Nucleotide Polymorphism (SNP) heterozygosity of 4.5% while C. intestinalis, that has mosaic methylation, is genetically less polymorphic (1.5%) (reviewed in Veeman et al.68). It is conceivable that the methylated C. intestinalis genome can generate sufficiently stable TADs so that genome x epigenome interactions can serve as heritable unit of selection, while in C. savignyi TADs are more dynamic because the relative weight of DNA methylation in the generation of stable heritable phenotypic variants is less important. Our prediction concords with very recent results showing that stress-induced DNA methylation changes in C. savignyi can occur but are highly ephemeral (<48–120h), and thus not maintained through germline69.

In conclusion, our findings indicate that initially there were three types of gene body DNA methylation: ‘primary no methylation’, ‘primary whole genome methylation’, and ‘primary mosaic methylation’ that produced by secondary loss ‘weak methylation’, or ‘secondary no methylation’. These findings are in concordance with the idea that DNA methylation in gene bodies (i) uses three types of universal codes (low, high and mosaic)), and (ii) that it is an element of the inheritance system and not a molecular phenotype that results from genotype × environment interaction. This has immediate practical consequences: e.g. since there are three types of methylation codes, pan-species conclusions about the potential function of DNA methylation can only be drawn within the type (e.g. functional tests in vertebrates with high gene body methylation cannot be used to conclude on methylation function in mosaic type mollusks). In addition, if DNA methylation is part of the inheritance system then heritable phenotypic diversity can be produced by DNA methylation changes without changes in the DNA sequence. The notion that everything that is heritable is necessarily genetic should be abandoned.