Compared to normal methylation levels in the population, epigenetic mutations can be either higher or lower in methylation level. Hence, we defined HMO and LMO as CpGs with abnormally higher or lower methylation levels than the average (Additional file 1: Figure S2). Of the defined epigenetic mutation sites, almost half were identified as HMOs and the other half as LMOs (118,259 HMOs and 119,175 LMOs). Thirty-six CpGs were defined as both HMOs and LMOs because those sites had intermediate methylation levels and very small IQRs. However, among the frequently mutated CpGs, there were significantly more HMOs than LMOs (969 and 216, p < 1e-16) (Fig. 2). Similar to the results of epigenetic mutations, both HMOs and LMOs were significantly associated with age (p = 1.98e-12 for HMOs and p = 1.73e-14 for LMOs), sex (p = 1.81e-3 for HMOs and p = 0.037 for LMOs), and B cells (p = 3.76e-22 for HMOs and p = 3.03e-20 for LMOs). Nevertheless, numbers of both sets of frequent mutations (log10-transformed) significantly increased with age (p = 2.09e-17 for HMOs and p = 1.14e-05 for LMOs) (Fig. 1c, e). Sex was no longer a significant factor with either frequent HMOs or LMOs. The composition of CD19+ B cell was still strongly associated with HMOs (p = 2.25e-12), but only marginally significant for LMOs (p = 0.046). Sample quality, as measured by detection p value, showed strong effects on both frequent HMOs and LMOs; however, LMOs were much more influenced (p = 8.09e-30) than HMOs (p = 3.58e-8). Moreover, the first genetic principal component became a significant factor (p = 7.65e-5) when analyzing frequent HMOs, while it had no effect on LMOs (p = 0.92) (Table 2).

Fig. 2 The distribution of mutated samples for high methylation outliers (HMOs) and low methylation outliers (LMOs). For most CpGs, epigenetic mutations only occurred in a small number of samples, but HMOs were more likely to appear in a large number of samples (n > 50) than LMOs (969 HMOs and 216 LMOs, p < 1e-16) Full size image

To better present the longitudinal effect, the same analysis was performed on 110 individuals with four or more measures (in total 470 samples). Still, the total epigenetic mutations, frequent mutations, frequent HMOs, and LMOs all significantly increased with age (Fig. 1b, d, f) despite the lower statistical power (Additional file 1: Table S1). Among the other factors, sample quality and CD19 B cell proportion were still significantly associated with all the four outcomes, while the effects of sex and genetic PC1 were no longer significant (Additional file 1: Table S1).

Functional annotation of epigenetic mutations

To characterize HMO and LMOs, we examined their locations in relation to CpG island regions and regulatory features. Compared to all CpGs analyzed, where 33.5% of CpGs located in CpG islands, HMOs were enriched within CpG islands (63% of CpGs, p < 1e-16) and frequent HMOs even more so (88% of CpGs, p < 1e-16). On the other hand, LMOs were mostly located outside of CpG islands (88% CpGs outside of CpG islands, p < 1e-16), but the opposite was true for frequent LMOs, which were enriched in CpG islands (51% of CpGs, p = 8.6e-8) (Fig. 3). We further explored regulatory features of the frequent epigenetic mutations using the Ensembl database [14], and found that frequent HMOs were enriched in promoter regions (p = 1.1e-10), but less likely to be found in CCCTC-Binding factor (CTCF) binding sites (p = 1.4e-09) and regions of open chromatin (p = 3.6e-07) (Fig. 4a). The frequent LMOs, on the other hand, were enriched in CTCF (p = 7.7e-12) and transcription factor binding sites (p = 3.9e-05), open chromatin (p = 0.0012), and promoter flanking regions (p = 0.041), while depleted in promoter regions (p = 6.9e-19) (Fig. 4b). Moreover, we performed a pathway analysis of frequent epigenetic mutations using DAVID [15], but failed to identify enriched pathway.

Fig. 3 Proportions of high methylation outliers (HMOs) and low methylation outliers (LMOs) in different CpG island regions. HMOs are enriched in CpG islands (p < 1e-16) while LMOs are more distributed outside of CpG islands (p < 1e-16), especially in open sea regions. However, both frequent HMOs and LMOs are enriched in CpG islands (p < 1e-16 and p = 8.6e-8) Full size image

Fig. 4 The distribution of regulatory features of frequent high methylation outliers (HMOs) and low methylation outliers (LMOs). Compared to the background distribution of the 450k array design, frequent HMOs were enriched in promoter regions (a), while the opposite was true for LMOs (b) Full size image

Epigenetic mutation is associated with cancer diagnosis

As aberrant DNA methylation levels in gene regulatory regions may cause abnormal gene expression, which may be associated with cancer, we analyzed epigenetic mutations in relation to cancer diagnosis in the SATSA participants. Cancer diagnosis date was retrieved using linkage to The National Patient Registry (prior to May 2016) including ICD-codes for all cancer types (ICD7 codes 140-205, ICD8 codes 140-209, ICD9 codes 140-208, ICD10 codes C00-C97, and B21). The SATSA participants included 29 prevalent cancer cases diagnosed already at baseline, and 79 incident cases that developed cancer during the follow-up period. Hence, information on whether the participant was diagnosed with cancer by the end of the follow-up was tested in the mixed model for associations with log10-transformed numbers of epigenetic mutations. Samples of individuals with cancer, including samples before and after cancer diagnosis, were observed to have a significantly higher number of epigenetic mutations (p = 0.014), HMOs (p = 0.019), LMOs (p = 0.027), and frequent HMOs (p = 0.013), but no associations were found for frequent LMOs (p = 0.71, Table 2). Furthermore, in the survival analysis, people with a higher number of frequent HMOs had a higher risk of cancer incidence (Additional file 1: Table S2).

Epigenetic mutations are shared within twin pairs

By applying a co-twin control design, we could further study the genetic effect and the genetic-age interaction in association with epigenetic mutations. We calculated the number of shared epigenetic mutations within a twin pair sampled at the same time, and studied their association with time and twin zygosity using a random effects model (Table 3). The numbers of shared epigenetic mutations were normalized in order to compare the effect sizes from different sets of CpGs. First, taking all CpGs into account (n = 390,894), the number of shared epigenetic mutations increased significantly with age (β=0.019, p = 0.026), and MZ pairs shared more epigenetic mutations than DZ pairs (β=1.078, p = 3.41e-18). After excluding 20,660 cis-meQTL CpGs, the age effect became stronger (β=0.025, p = 5.98e-3) while the zygosity effect was smaller (β=0.855, p = 1.05e-11). Last, within the 20,660 cis-meQTL-CpGs, the number of shared epigenetic mutations was not associated with age (β=2.86e-4, p = 0.969), while the zygosity difference (β=1.461, p = 8.34e-28) was larger than in results from non-meQTL-CpGs. None of the three tests showed significant twin zygosity-age interaction or sex effect.

Table 3 The results of the scaled number of shared epigenetic mutations calculated from different sets of CpGs in association with age, sex, twin zygosity, and zygosity-age interaction Full size table

Epigenetic mutations were validated using pyrosequencing

To verify epigenetic mutations identified from 450k array, we selected four frequently mutated CpGs (One HMO: cg05270750 and three LMOs: cg17338133, cg25351353, cg05124918) in 93 samples from 26 individuals for validation with pyrosequencing. In general, the pyrosequencing results were well correlated with methylation data measured by the 450k array (cg05270750: r = 0.84; cg17338133: r = 0.59; cg25351353: r = 0.80; cg05124918: r = 0.77). In addition, we compared methylation levels of mutated samples to the normal group using results from the 450 k array and pyrosequencing respectively. In pyrosequencing data, significant differences were observed between mutated samples and normal ones, using the same definition of a mutated sample as that for the 450k array data (Table 4). Hence, pyrosequencing technically validated epigenetic mutations identified from the 450k array. Although the agreement between the two methods was generally good, we still observed large differences between pyrosequencing and 450k data in some samples, where four samples in cg17338133 and six samples in cg 05124918 showed over 15% methylation level differences between 450k array and pyrosequencing data after centering their mean methylation levels. This indicates that we might wrongly detect or fail to detect epigenetic mutations from 450k chip data. In general, pyrosequencing data were more stable and changes in methylation levels were smoother than that from 450k array (Fig. 5). For example, in cg05270750 measured by the 450k array (Fig. 5e), one participant was identified to have epigenetic mutations in the first three measures, but the methylation level turned back to normal status in the last two measures. However, pyrosequencing data showed the methylation levels of the five measures from this individual were consistently defined as epigenetic mutations.

Table 4 Results from t tests comparing methylation levels in samples with epigenetic mutations to normal samples using data from the 450k array and pyrosequencing Full size table

Fig. 5 The longitudinal change of four CpGs in 93 samples from 26 individuals measured by 450k array (left panel) and pyrosequencing (Pyroseq, right panel) techniques. Methylation levels of a cg05270750 from 450 k-chip, b cg05270750 from Pyroseq, c cg17338133 from 450 k-chip, d cg17338133 from Pyroseq, e cg25351353 from 450 k-chip, f cg25351353 from Pyroseq, g cg05124918 from 450-chip, and h cg05124918 from Pyroseq. Samples are shown as points colored by their mutation status defined by the 450k data and lines links longitudinal samples collected in the same individual Full size image

Functional validation of epigenetic mutations in cancer tissues

To further verify the overabundance of epigenetic mutations in cancer tissues, we picked a gene PR/SET domain 7 (PRDM7) which was the only gene related to CpGs tested in pyrosequencing (cg05270750), and analyzed DNA methylation and gene expression data of the gene in tumor tissues and normal adjacent tissues using The Cancer Genome Atlas (TCGA) [16] data downloaded from Wanderer [17]. We selected the four most common cancer types in both sexes combined: lung cancer, breast cancer, colorectal cancer, and prostate cancer [18]. The total numbers of tumor and normal adjacent samples were 2,209 and 261 respectively, all cancer types combined. On average, the expression levels of PRDM7 were higher in tumor tissues than normal adjacent tissues in all cancer types, but the difference was only statistically significant for lung cancer (p = 1.83e-09, Additional file 1: Table S3). For DNA methylation data, the tumor tissues had significantly lower methylation levels than normal adjacent tissues in the gene body (Fig. 6a). However, for CpGs in the PRDM7 promoter (from cg06295223 to cg26935333), there was no significant difference between the mean methylation levels of cancer and normal adjacent tissues (Fig. 6a). To quantify and compare epigenetic mutations in both tissues, we used the distribution of normal adjacent samples to determine epigenetic mutation cutoffs. By calculating the number of epigenetic mutations in tissue samples, tumor tissues had higher proportions of epigenetic mutations in the gene body, while epigenetic mutations were not observed in normal adjacent tissues. In the gene promoter, tumor and normal adjacent tissues had similar and relatively low proportions of epigenetic mutations (Fig. 6b).