Estimation of 5 mC and 5 hmC

We adapted the BS–oxBS technology to DNA from 30 glioblastomas and applied a novel algorithm, OxyBS, to obtain genome-wide 5 hmC and 5 mC estimates21. All samples were primary tumours, fresh frozen and IDH1 and IDH2 wild type. Patient demographic, tumour characteristics and survival data are presented in Table 1. Here, we focused on the identification of 5 hmC distribution and abundance to better understand the role of 5 hmC in glioblastoma pathobiology. Our approach to investigate potential functions of 5 hmC included assessment of five aims (Supplementary Fig. 1): (1) to characterize genomic abundance and distribution of 5 hmC and 5 mC, (2) to delineate high 5 hmC CpGs in glioblastoma, (3) to determine functional annotation of regions with high 5 hmC, (4) to assess the impact of 5 hmC on gene expression (5) and to apply a machine learning algorithm to cluster patients by 5 hmC profiles.

Table 1 Patient demographics and tumour characteristics. Full size table

Global content and genomic distribution of 5 hmC and 5 mC

First, to qualitatively assess the extent of deregulation in the glioblastoma epigenome we compared the total genomic content of 5 hmC among glioblastoma samples with levels of 5 hmC in the prefrontal cortex. Although we were unable to obtain matched disease-free cortex samples in our own cohort, we accessed publicly available prefrontal cortex data using the same BS–oxBS protocol in an independent population of individuals that presented with no evidence of neurological impairment (GSE74368, n=5). To avoid limitations due to the presence of multiple cell types (that is, glial and neuronal) in the prefrontal cortex samples we did not compare CpG-specific differences in 5 hmC. However, to provide a global perspective on 5 hmC differences we examined the overall proportion of 5 hmC in total cytosine content (that is, summed 5 hmC beta-values across all CpGs within each sample divided by total number of CpGs profiled) between the two tissues. We observed an average 3.5 fold reduction (P=6.2E-06, Wilcoxon rank sum test) in the total 5 hmC content in glioblastoma samples when compared with cortex samples (Supplementary Fig. 2A). While all glioblastoma samples exhibited a decreased level of total 5 hmC there was high variability in total 5 hmC loss across tumours. Potential biological explanations that may account for the variability of total 5 hmC include sample differences in cellular proportions and differences in the expression of the TET enzymes. To this end, we estimated putative cellular proportions from DNA methylation data (‘Methods’ section) using a non-negative matrix factorization approach (RefFreeEWAS) and found that there were stable estimates of tumour purity across samples (Supplementary Fig. 3A). Thus, differences in 5 hmC levels across tumours may be more strongly associated with the molecular alterations across tumours rather than tumour purity (Supplementary Fig. 3B). We next tested whether the overall proportions of 5 hmC in total cytosine content were associated with the expression of TET enzymes and additional epigenetic enzymes (that is, TET1, TET2, DNMT3A, IDH1, etc). Again, we did not observe any significant relationships (P>0.05, Spearman’s rho test, Supplementary Table 1) consistent with prior studies22. Finally, we note that the levels of 5 hmC were not significantly associated with total 5 mC content (Supplementary Fig. 2B, P=0.18, linear regression). 5 hmC has been implicated in the DNA demethylation pathway and may suggest functionality of 5 hmC outside simply removal of 5 mC.

At the nucleotide level, the dynamic range of 5 hmC detection was between 0.00 and 0.99 on the beta-value scale. Empirical cumulative density plots for both mean 5 hmC and 5 mC across all subjects revealed near zero 5 hmC levels in a majority of CpGs and revealed the well-established bimodal distribution for 5 mC (Fig. 1a). To assess the site-specific relationship between 5 hmC and 5 mC levels we computed Spearman correlation coefficients for each CpG and observed that moderate negative correlations exist for a large majority of CpGs, and that <20% of CpGs demonstrated weak positive correlations between 5 hmC and 5 mC (Fig. 1b). There are known differences in the distribution of DNA methylation based upon by genomic context23. The dependency of 5 mC levels upon CpG density was observed in glioblastoma and the CpG islands and CpG island shores exhibited demonstrably lower levels of 5 mC (Fig. 1c). To determine whether potential CpG density specific patterns also exist for 5 hmC we next computed the proportions of CpG-specific mean 5 hmC across the four strata of CpG Island, Shores, Shelves and Ocean. Figure 1d displays the distributions of mean 5 mC and 5 hmC as well as statistical assessment over each of the CpG strata. Generally, the patterns of 5 hmC levels were similar to 5 mC and the lowest levels of 5 hmC were found across CpG islands and CpG Island shores. Similar to observations in 5 mC, CpG Island shelves and Ocean regions also harboured the highest levels of 5 hmC. The role of 5 mC in regulating gene expression has best been described for promoter regions and it is known that TET1, which catalyzes the oxidation of 5 mC to 5 hmC, is enriched for binding to DNA at CpG island promoters24. To examine the overall distribution of mean 5 mC and 5 hmC across the promoter regions for both cortex tissue and glioblastomas we averaged the cytosine modifications over a four kilobase pair window near transcriptional start sites23. We noted the occurrence of hypermethylation (5 mC) across glioblastoma promoter regions and a consistently lower level of glioblastoma 5 hmC proximal to promoter regions when compared with healthy tissue (Fig. 2a,b).

Figure 1: 5hmC is depleted and uniquely distributed in glioblastoma. (a) Empirical cumulative distribution of mean 5-hydroxymethylcytosine (5hmC) and 5-methylcytosine (5mC) across thirty glioblastomas. (b) Cumulative proportions of Spearman correlation coefficients calculated for CpG-specific across thirty glioblastomas. (c) Percentiles of mean 5mC beta-values for glioblastomas (n=30) across CpG Island strata. Percentiles (that is, quintiles) shown were selected arbitrarily to highlight a large range of values and significant differences between beta-values across the genomic regions at these percentiles were evaluated by Kruskal-Wallis hypothesis tests. Statistical significance (P<8.3E-03, Bonferroni adjusted alpha) is indicated with a *. (d) Percentiles of mean 5hmC in glioblastomas (n=30) across CpG Island strata with statistical assessment via Kruskal-Wallis tests. Full size image

Figure 2: Promoter region 5hmC and 5mC in cortex and glioblastoma. (a) Heat maps represent 5hmC and 5mC levels for 450K array CpGs across 4,000 base pair window in glioblastoma (n=30) and prefrontal cortex (n=5) promoter regions. (b) Mean 5hmC and 5mC averages with +/− 2 kilo base pairs around canonical transcriptional start sites for glioblastoma (n=30) and prefrontal cortex (n=5). Full size image

5 hmC is uniquely distributed in the glioblastoma genome

To determine which 5 hmC sites had the highest potential functional relevance we calculated the mean 5 hmC for each CpG across all thirty tumours to identify CpG sites with the highest proportion of 5 hmC alleles (Supplementary Fig. 4). In glioblastoma, a large number of CpGs demonstrated mean 5 hmC values near zero while signals >5% 5 hmC were found in ∼35,000 CpGs (Supplementary Fig. 4). However, to be confident in distinguishing signal from background we then defined CpGs as high 5 hmC CpGs if they were in the highest 1% of mean 5 hmC level; resulting in 3,876 CpGs with a mean value of at least ∼9.0% 5 hmC that we described here as ‘high 5 hmC CpGs’ and are used in subsequent analyses. Among high 5 hmC CpGs, 60% were recurrent across a majority of the thirty glioblastomas (2,347/3,876 CpGs with at least 15 tumours displaying ∼9% 5 hmC at a given CpG) while 30% (1,162/3,876 CpGs) were also among the 1% most variable 5 hmC probes suggesting greater inter-individual variability at these locations. The complete list of the high 5 hmC CpGs with genomic location, CpG-specific mean and s.d. is provided as a resource in Supplementary Data 1. The dynamic regulation of DNA methylation for cell-identity is a balance between methylation and demethylation25. Here, we visualized the CpG-specific relationships between high 5 hmC and 5 mC at these high 5 hmC CpGs (Fig. 3a) and observed that the highest levels of 5 hmC track to sites with intermediate levels of 5 mC. These results are consistent with the notion that 5 hmC may function as an intermediate in the DNA demethylation process at dynamically regulated regions. Together, the observed loss of 5 hmC that occurs in glioblastoma may also reflect disruption of regulated genomic DNA methylation patterns.

Figure 3: 5hmC in glioblastoma is enriched at gene regulatory regions. (a) 5hmC-5mC scatterplot for the highest 5hmC CpGs (n=3,876) with smooth curve fitted by loess across thirty glioblastomas. Each point represents a single CpG per tumour with the highest intensity values represented in red and lower intensity (that is, fewer signals) represented in blue. (b) The distribution of high 5hmC CpGs relative to the nearest canonical transcriptional start site (TSS) in kilo base pairs with category bins for genomic distance shown for both upstream and downstream of the TSS. Percentage of high 5hmC CpGs (n=3,876) are shown above each proportion. (c) Forest plot of odds ratios and 95% confidence intervals (Cochran-Mantel-Haenszel or Fisher’s exact test) for enrichment of high 5hmC genomic regions against 450K background set. Numerical representation of the odds ratio and associated P-value for each genomic feature are also presented. Full size image

Previous sequencing studies in non-diseased tissue have shown that 5 hmC marks tend to reside in intronic and gene regulatory regions26,27. However, the localization of 5 hmC in the glioblastoma genome remains obscure and may influence disease-critical gene expression programs. The genomic distribution of high 5 hmC CpGs relative to canonical transcriptional start sites (TSS) is presented in Fig. 3b and reveals that ∼75% of high 5 hmC were found beyond the 5 kb region upstream of the TSS. To determine whether elevated levels of 5 hmC were associated with genomic features, we used a Cochran–Mantel–Haenszel test with binary outcomes (for example, genomic region of interest versus other) to provide a measure of enrichment that permitted stratification by CpG density (that is, CpG islands, shores, shelves and ocean). We observed that high 5 hmC CpGs were more likely to be found in intronic regions (2.12 odds ratio (OR), P=1.7E-112, Cochran–Mantel–Haenszel test, Fig. 3c) as well as depleted in promoter regions (OR=0.69, P=4.8E-21, Cochran–Mantel–Haenszel test, Fig. 3c) when the genomic regions of the 450 K array were used as a background. These findings concur with previous observations in non-diseased tissue28,29. Leveraging publicly available histone H3 lysine 27 acetylation (H3K27ac) genome-wide maps, which marks active enhancers, from three primary glioblastomas we observed substantial enrichment for enhancer regions among the high 5 hmC CpGs for all three tumours (median OR=3.1, P=1.9E-211, Cochran–Mantel–Haenszel test, Fig. 3c). Interestingly, we also found a significant enrichment of high 5 hmC CpGs at glioblastoma cell line (U87) defined enhancers (OR=2.2, P=1.7E-46, Cochran–Mantel–Haenszel test, Fig. 3c) and super-enhancers (OR=3.00, P=9.7E-63, Cochran–Mantel–Haenszel test, Fig. 3c). Super-enhancers are a subset of enhancers that have critical functions in defining cell-identity and are frequently found at key oncogenic drivers providing support that 5 hmC may play key roles in disease30. We also noted that there was a modest enrichment among high 5 hmC CpGs for glioblastoma DNase hypersensitivity sites, a marker of open chromatin (OR=1.32, P=5.34E-7, Cochran–Mantel–Haenszel test, Fig. 3c). Furthermore, co-localization of 5 hmC with Polycomb repressive complex 2 has been established in embryonic stem cells, but has not been observed in differentiated cells31. Similarly, we found no association between 5 hmC sites and enrichment for Polycomb group protein target genes in glioblastomas (OR=0.96, P=0.48, Cochran–Mantel–Haenszel test, Fig. 3c)31.

Enrichment of genomic regions among critical gene sets

To provide a broader interpretation of 5 hmC function we next sought to identify enrichment of high 5 hmC within specific gene sets. To this end, we used the genomic coordinates of high 5 hmC CpGs as a query set of regions and tested for enrichment against the background of all CpGs present on the 450 K array in a Genomic Regions Enrichment of Annotations Tool (GREAT) analysis. The top twenty most highly significant gene sets ranked by fold enrichment are presented in Table 2. The high 5 hmC CpGs overlapped significantly with diverse biological pathways including regulation of RNA catabolism and stability, immune response and somatic stem cell maintenance. Enrichment of RNA stability gene sets may reflect shared biological processes of 5 hmC between healthy cells and cancer cells as a prior report on adult liver 5 hmC made similar observations26. In contrast, it is possible that gene sets involved in the immune response and stem cell maintenance are driven by the presence of tumour infiltrating lymphocytes. To validate our pathway enrichment findings we applied an agnostic consensus clustering approach to biological pathways defined in the Kyoto Encyclopedia of Genes and Genomes (KEGG), which considered all 5 hmC CpGs on the array. We observed that gene sets involved with metabolism, immunity and RNA processing functions displayed the highest levels of 5 hmC (Supplementary Fig. 5A–C).

Table 2 GREAT functional enrichment analysis of genomic regions for high 5 hmC CpGs (n=3,876). Full size table

It has previously been observed that 5 hmC is located near transcription factor binding sites (TFBSs) in the mammalian genome32. To identify whether particular transcription factors are associated with patterns of high 5 hmC we tested for enrichment in binding sites from 689 transcription factor experiments (ENCODE) using the Locus Overlap Analysis (LOLA) software33. Relative to genomic regions of the 450 K array, high 5 hmC sites were significantly associated with sixty-nine TFBSs (Q-value<0.05, Fisher’s exact test, Supplementary Data 2). The top ranked TFBSs enrichments are ranked and the original cell line in which they were profiled is presented in Fig. 4 and the complete rankings can be found in Supplementary Data 2. Several of these transcription factors have established roles in oncogenesis including c-Fos and c-Jun as regulators of cell proliferation and survival, whereas the co-activator p300 is typically found at enhancer regions34,35

Figure 4: Biological interpretation of genomic regions with high 5hmC. Significance of overlap between high 5hmC regions in glioblastoma with binding sites of transcription factors profiled by ENCODE. The top 10 enriched transcription factors (TFs) as obtained by LOLA analysis are shown. TF are plotted on the x-axis sorted by TF Q-values (Fisher's exact test, corrected for multiple hypotheses testing of 689 TFs) on the y-axis. The log odds ratio for each TF is represented by bubble size and the cell line in which the ChIP-seq experiment was conducted is indicated by bubble colour. Full size image

5 hmC localizes to genes actively transcribed in glioblastoma

The robust enrichment for 5 hmC sites in super-enhancers and binding sites of proliferation-associated TFs suggests that 5 hmC is strongly associated with regulation of disease-specific gene expression programs. To confirm that 5 hmC marks are generally associated with active expression in glioblastoma we interrogated the entire glioblastoma transcriptome using RNA sequencing expression levels from The Cancer Genome Atlas (TCGA) (n=172)13. Genes were then segregated into the three groups of lowly, moderately or highly expressed genes based on expression tertiles and we tested enrichment for 5 hmC sites against the background of the 450 K array. A significant enrichment was found for genes that are actively transcribed in glioblastomas (P=5.2E-139, χ2 test, Supplementary Table 2). Prior evidence has suggested a role for 5 hmC in the regulation of alternative transcript splicing in normal tissues28. To examine whether a similar process occurs in cancer, we leveraged the TCGASpliceSeq database for glioblastoma-specific alterations in mRNA splicing patterns of RNAseq data36. We found that 5 hmC is significantly enriched for exon skip (OR=2.03, P=2.23E-48, Fisher’s exact test, Supplementary Table 3) and alternate promoter events (OR=2.06, P=9.75E-35, Fisher’s exact test, Supplementary Table 3), but observed neither an enrichment nor depletion in retained introns (OR=0.95, P=6.3E-01, Fisher’s exact test, Supplementary Table 3) or alternate donor sites (OR=1.12, Fisher’s exact test, Supplementary Table 3).

We next investigated whether gene expression of candidate genes, within our own cohort, were associated with high 5 hmC (‘Methods’ section) using the NanoString nCounter technology. For our candidate gene approach we selected 14 genes with a high proportion of high 5 hmC sites in a given gene region (Supplementary Data 3) and a gene with an established relationship to glioblastoma survival, MGMT (Supplementary Data 4). In general, 5 hmC sites exhibited positive associations with gene expression in our data set (Fig. 5a). In contrast, 5 mC at these locations demonstrated primarily negative associations (Fig. 5b). MGMT methylation has been used as a biomarker of response to alkylating agents in glioblastoma patients11. We found that 5 hmC was not associated with MGMT expression, but confirmed the strength of association between 5 mC and MGMT expression (Supplementary Fig. 6). Overall, the strongest association between 5 hmC and gene expression among the candidate genes was found in the TSS1500 region of the long non-coding RNA SOX2-OT (Spearman correlation coefficient=0.62, Fig. 5c,d, Supplementary Data 5). Notably, there was no significant association between 5 mC and gene expression at this genomic location (Spearman correlation coefficient=−0.18, Fig. 5e). Finally, given our observation that 5 hmC was associated with splicing patterns from the TCGA we next assessed whether CpG-specific 5 hmC was associated with the differential expression of gene transcript variants from seven genes with high 5 hmC in gene regulatory regions (Supplementary Data 4). In our candidate list, CpGs did not exhibit significant associations between differential expression of transcript variants and 5 hmC levels (Supplementary Data 6). Taken together, we conclude that 5 hmC is often positively correlated with expression and may, in specific gene contexts, be associated with alternative mRNA transcript splicing.

Figure 5: 5hmC is positively associated with gene expression. (a) 5hmC sites (that is, CpG position) relative to gene feature are plotted against correlation coefficient derived from Spearman correlations of CpG-specific 5hmC and gene expression. The size of each bubble point represents the nominal statistical significance of the given Spearman correlation with an increasing bubble size corresponding to a smaller P-value. The colour of each bubble refers to a distinct gene region. All numerical 5hmC-gene expression correlation coefficients and P-values are presented in Supplementary Data 5. (b) For each 5hmC site, the correlation between 5mC and candidate gene expression was plotted. Bubble point size also reflects the increasing strength of association based on Spearman correlation. All numerical 5mC-gene expression correlation coefficients and P-values are also presented in Supplementary Data 5. (c) The genomic location and context of the CpG (cg02417857, 5hmC value) most significantly associated with gene expression mapped to within 1,500 bp of SOX2-OT transcription start site. Alternative transcripts, CpG island locations, and DNase hypersensitivity sites from the UCSC genome browser are presented. (d) 5hmC levels at cg02417857 were significantly positively correlated with SOX2-OT expression (n=23, P=0.002). (e) 5mC levels at cg02417857 were not significantly associated with SOX2-OT expression (n=23, P=0.40). Full size image

5 hmC profiles are associated with patient survival

Prior publications have identified an association between decreased total 5 hmC (determined by different methods) and poorer survival in glioblastoma patients37. Although the tumours we profiled were IDH1 and IDH2 wild type, we confirmed that one sample was G-CIMP+ as defined in Noushmehr et al.8 (Supplementary Fig. 7). Patients with G-CIMP+ tumours have a younger age at diagnosis and significantly improved survival (independent of age)10. As a result, this sample was excluded from our survival analyses. To investigate the relation of 5 hmC patterns (that is, not total 5 hmC) with survival we used a model-based clustering method, Recursively Partitioned Mixture Model (RPMM). RPMM has been extensively used for clustering DNA methylation data to identify classes of tumours based upon 5 mC values, including TCGA38,39,40. Here we applied RPMM to the 3,876 high 5 hmC CpGs and the resulting clustering solution contained two distinct clusters, defining a low and a high 5 hmC cluster (Fig. 6a). Separate clustering analyses were performed for the 1,000, 2,000, 3,000 and 4,000 highest 5 hmC CpGs to examine classification sensitivity and we observed complete stability of cluster membership (that is, samples remained in either low or high 5 hmC clusters regardless of CpG number selected). Cluster membership was associated with total 5 hmC amount (P=2.0E-04, Kruskal–Wallis rank sum test), and patient age (P=0.03, Kruskal–Wallis rank sum test), but not copy number alterations (P>0.05, Fisher’s exact test, Supplementary Fig. 8). The cluster of patients with low-5 hmC tumours had an older age at diagnosis (median age=74.3 years) compared with the high 5 hmC cluster (median age=65.2 years) and had a shorter median survival (median overall survival=2.2 months) when compared with the high 5 hmC cluster (median overall survival=14.5 months). A multivariable Cox proportional hazards model adjusting for age at diagnosis and patient sex resulted in a significantly increased hazard of death related with membership in the low-5 hmC cluster independent of age (Hazard Ratio=3.3, CI 95% 1.3–8.2, P=0.03, Cox proportional hazards regression, Fig. 6b,c). To compare with prior work that has measured total 5 hmC, we also constructed an index of total 5 hmC by taking the mean across all CpGs considered in the present analysis (n=387,617) and segregated subjects into low and high-total 5 hmC groups based on whether total 5 hmC levels were above or below the median. In a multivariable Cox proportional hazards model adjusted for age at diagnosis and sex, the low total 5 hmC group was not significantly associated with survival (HR=0.94 CI 95% (0.43–2.03), P=0.88, Cox proportional hazards regression) suggesting that the genomic location of 5 hmC is an important consideration in associations with disease progression.