Study population and sample preparation

This study consisted of 428 women of European ancestry (214 pre-treatment invasive epithelial ovarian cancer cases and 214 controls one-to-one matched with EOC cases on the basis age (within 1-year)) between the ages of 27 and 91 enrolled in the Mayo Clinic Ovarian Cancer Study [20]. Genomic DNA was isolated from whole blood collected at the time of enrollment, using PureGene DNA isolation reagents (Gentra Systems, Minneapolis, MN), re-suspended in TE buffer, and stored at -80°C. Samples were bar-coded with a unique subject identification number to ensure accurate and reliable sample processing and storage. Research protocols were approved by the Mayo Clinic Institutional Review Boards, and all participants provided written informed consent.

Genotype data

Leukocyte-derived DNA was genotyped with the Illumina 610-quad Beadchip Array™ according to manufacturer’s protocol, at the Mayo Clinic Medical Genome Facility (Rochester, MN) by laboratory personnel blinded to case-control status. Detailed quality control (QC) procedures have been described elsewhere [20, 27]. Briefly, Illumina’s Genome Studio™ software was used to perform automated genotype clustering and calling. Assays included duplicates and laboratory controls, which showed sample concordance of 99.93%, genotype call rate of 99.7%. SNPs were excluded with call rate <95%, MAF <0.05, Hardy-Weinberg Equilibrium (HWE) p-value < 10-4, or unresolved replicate errors, and samples were excluded with call rate <95%, ambiguous gender, or predicted less than 80% European ancestry. SNPnexus was used to annotate the genotyped variants [28–30].

DNA methylation assays

Leukocyte-derived DNA was assayed and underwent QC procedures at the Mayo Clinic Molecular Genome Facility (Rochester, MN). Samples were assayed in two batches, hereafter referred to as Batch 1 (n = 132; 66 cases and matched control samples) and Batch 2 samples (n = 296; 148 cases and matched control samples). For each sample, 1 μg of genomic DNA was bisulfite modified (BSM) using the Zymo EZ96 DNA Methylation Kit (Zymo Research, Orange, CA) according to the manufacturer’s protocol. Epigenome-wide assessment of DNA methylation was carried out using the Illumina Infinium HumanMethylation27 BeadChip, which is capable of interrogating the methylation status >27,000 CpG loci across the genome. This assay uses bisulfite-treated DNA and two site-specific probes for each marker, which bind to the associated methylated and unmethylated sequences. The intensity of the methylated probe relative to the total probe intensity (sum of methylated and unmethylated probe intensities) represents the fractional level of methylation for that specific site within a sample. Centre d’Etudes du Polymorphisme Humain (CEPH) DNA, placental DNA (positive control) and whole genome amplified (WGA) DNA (negative control) were also included (n = 9, n = 12 and n = 8, respectively), as were technical replicates (n = 12). Briefly, fragmented DNA was hybridized to the BeadChips, which were then processed through a primer extension and an immunohistochemistry staining protocol to allow detection of a single-base extension reaction. Finally, BeadChips were coated and then imaged on an Illumina iScan. Analysis included control probes for assessing sample-independent and sample-dependent performance.

Methylation data pre-processing and quality control

The methylation level of each CpG locus was calculated in GenomeStudio® Methylation module (v.1.9.0) by comparing the ratio of fluorescent signal from the methylated allele to the sum from the fluorescent signal from both methylated and unmethylated alleles and scored as beta values, ranging from 0 (unmethylated) to 1 (methylated). We first excluded probes that had an rsid, were located on the Y chromosome, or were positioned at a single nucleotide polymorphism (SNPs) (dbSNP build 137), as SNPs at the same site have the potential to confound methylation assessment. We also removed CpG loci that had high beta values in BSM negative controls (defined as exceeding four standard deviations of the mean) and those that were detected in <70% of samples (based on a detection p-value cut-off of 0.05). This left a total of 25,926 out 27,578 (94%) of probes that passed QC. The intra-class correlation coefficients, computed based on beta values among CEPH replicates and for duplicate samples, were >0.93, indicating a high degree of reproducibility in our array. In addition, samples were excluded if >25% of the probes for that sample had detection p-values that exceeded 1 × 10-5. Following QC, 428 samples remained for analyses; including 132 samples (66 cases and matched control samples) and 296 samples (148 cases and matched control samples) in Batch 1 and Batch 2, respectively.

Next, we assessed possible plate/Beadchip/batch effects visually and through principal component analyses (PCA) [31, 32]. DNAm values were logit-transformed (i.e., log 2 (β/1- β)) as in previous studies to obtain the DNAm M-value for each CpG locus [33, 34]. PCA represents a feature extraction technique where the methylation data is orthogonally transformed, such that the first principal component has the largest possible variance (accounts for maximal amount of variability in the methylation data), and each succeeding component, in turn has the next highest variance possible. PCA was applied to the methylation data for each batch separately (n 1 = 138 and n 2 = 296, for Batch 1 and Batch 2, respectively) and also to the combined methylation data for both Batches (n = 428). The resulting top principal components (those explaining the maximal proportion of variability in DNAm) were then examined in terms of their association with technical aspects concerning the array (i.e., plate/BeadChip), and batch for the principal components estimated from the combined methylation data from the Batch 1 and Batch 2 samples. As batch was observed to be a major determinant of variability in the combined DNAm data (Additional files 1 and 2), we adjusted for batch-effects by applying the ComBat normalization method [35] using the R-package ‘sva’. Combat is an empirical Bayes batch adjustment methodology that uses a location and scale adjustment for standardizing the mean and variability in methylation levels across batches. This methodology been shown to perform effectively and efficiently compared to competing batch/plate-adjustment methodologies [36, 37] and has become an established preprocessing step for array-based DNA methylation data [38–40]. Following the application of Combat, principal components were computed from the batch-adjusted data and inspected to ensure that batch effects had been successfully attenuated. In addition, within each batch we observed plate-effects (data not shown). To remove variability in DNAm due to plate, we fit a linear model to the logit-transformed methylation values for each CpG locus and included a fixed effect term for plate. The logit-transformed locus means were then added back onto the unstandardized residuals derived from these models, before back transforming values on the logit-scale to a 0 to 1 scale.

Technical validation of the methylation array data

As an orthogonal array validation, eight CpGs with a broad spectrum of percent methylation (range; 0.11-0.73) and variability (standard deviation; 0.11-0.15) were assessed using bisulfite pyrosequencing. Ninety-six samples were tested, including 45 cases and 45 controls, two samples each of WGA, BSM negative, and control samples (CpGenome™ Universal Methylated DNA; Millipore Corporation, Billerica, MA). Primers (Additional file 3) were designed using the Pyrosequencing Assay Design Software. Genomic DNA (20-30 ng) was PCR-amplified using primers, one of which was biotinylated. Briefly, the incorporated biotinylated amplicon was immobilized on streptavidin-coated beads used to purify and render the denatured, single stranded and biotinylated PCR product. Single stranded DNA was purified using the pyrosequencing vacuum workstation. The single-stranded product was annealed to 0.3 μM of the sequencing primer complementary to the single-stranded template and placed at 85°C for two minutes, then cooled to room temperature for five minutes. Pyrosequencing reactions were performed on Biotage PyroMark MD, and data were analyzed using PyroMD Software. Percent methylation was quantified as methylated C to unmethylated C ratio using the Pyro Q-CpG software, which provided automatic QC for each sample for completion of bisulfite conversion and estimates of non-converted DNA. The median Pearson correlation of methylation values between the array and pyrosequencing assays was 0.88 (Additional file 3), suggesting high concordance in the methylation array values and those generated from pyrosequencing.

Cell mixture deconvolution analysis

Recent work has demonstrated substantial differences in the DNAm signature across different leukocyte subtypes [13–15] and also differences in white blood cell proportions by EOC case-control status [41–43]. As such, heterogeneity in the underlying distribution of white blood cell types is likely to be a key confounder when examining the association between DNAm and EOC status. Using the plate- and batch-adjusted methylation data, we employed a statistical methodology [13] for inferring changes in the distribution of leukocytes based on peripheral blood DNAm signatures, in combination with a previously obtained external reference data set consisting of methylation signatures from purified leukocyte samples (i.e., B cells, natural killer (NK) cells, CD8+ T lymphocytes, CD4+ T lymphocytes, monocytes, and granulocytes) [13, 14]. In this approach data obtained from a target set comprised of DNA methylation profiles from a heterogeneous mixture of cell populations is assumed to be a high-dimensional multivariate surrogate for the underlying distribution of cell types. Houseman et al. [13], proposed a cell mixture deconvolution methodology – similar to regression calibration – that involves the projection of DNA methylation profiles from the target set onto a reference data set, which consists of the DNA methylation signatures for isolated leukocyte subtypes. Under certain constraints, the cell mixture deconvolution approach can be used to approximate the underlying distribution of cell proportions within the target data via constrained projection. Application of this method to our data allowed us to estimate the expected difference in cell type proportions between ovarian cases and controls, as well as to predict the proportion of the aforementioned leukocyte subtypes for each of the study samples. In addition, these methods allowed us to quantify the proportion of total and systematic variability in peripheral blood DNAm explained by estimated immune cell composition.

Although this method has been shown to produce accurate and reliable estimates of the underlying distribution of cell type [44], we additionally investigated the consistency of our results with an independent study population. Specifically, we compared our estimates of the expected difference in cell type proportions between ovarian cases and controls with the results reported in Houseman et al. [13]; which consisted of the application of the cell mixture methodology using blood-derived methylation data from n = 131 pretreatment EOC cases and n = 274 controls [12].

Causal inference test (CIT)

In a manner similar to that described in Liu et al. [16], genotype (G), methylation (M), and phenotype (Y) relationships were assessed using the causal inference test (CIT) [45] to classify them as “methylation mediated”, “methylation consequential” or “independent”. The CIT is comprised of a series of conditional correlation analyses that consider the possible directed relationships between a causal factor (genotype (G)), a potential mediator (methylation (M)) and an outcome (EOC status (Y)) (Figure 1A). In order for methylation (M) to be classified as a mediator of genetic (G) risk for EOC (Y) the following conditions must be met: (1) G and Y are associated, (2) G is associated with M after adjustment for Y, (3) M is associated with Y after adjusting for G, and (4) G is independent of Y after adjusting for M (Figure 1B). When M is a consequence of Y or independently acted on by G (Figure 1A), there should be no difference in the effect of G on Y, when conditioning on M. However, when M mediates the genetic risk for EOC, conditioning on M should substantially reduce the effect of G on Y [16, 45].

Figure 1 Identification of epigenetically mediated genetic risk factors for EOC. (A) Directed acyclic graphs (DAGs) depicting the possible relationships between a causal factor (G), a potential mediator (M), and an outcome (Y). Top, DAG for the methylation-mediated relationship, wherein G acts on Y through M. Middle, DAG for the methylation-consequential (reverse causality) relationship, in which changes in M arise as a consequence of Y. Bottom, DAG for the methylation-independent relationship, wherein G acts on M and Y independently. (B) The four components of the CIT. (C) Flow diagram illustrating the various filtering steps, and ensuing results, used to identify methylation sites that are candidates for mediators of genetic risk for EOC. Full size image

The CIT P-value was defined using the intersection-union framework as the maximum of the component P-values for the first three of these conditions. G, M, and Y relationships were considered methylation-mediated if: the p-value obtained from the fourth condition above was > 0.05, indicating no statistically significant association between G and Y after adjustment for M; and the CIT P-value was < 0.05. Where appropriate, linear and logistic regression models were used to examine the four conditions comprising the CIT.

To ease the computational burden that would ensue from examining the above conditions for every G, M, and Y trio (503,502 × 25,926 × 4 total tests), we implemented the three step filtering procedure described in [16]. In the first step, the methylation status of each CpG, epigenome-wide, was examined for its association with Y. In step two, potential genetic regulators of methylation were identified by computing the pairwise relationship between each SNP, genome-wide, and each of CpG sites that were associated with Y in step 1. The final step involved an examination of the relationship between Y and each SNP that was identified as statistically significant in step 2. The general scheme of this analytic procedure is given in Figure 1C.

Identifying ovarian cancer-associated differentially methylated CpGs

To discern differentially methylated CpGs between EOC cases and controls, we fit a series of linear regression models, which modeled the methylation M-value for each CpG as a function of ovarian case/control status. Models were adjusted for the estimated differential leukocyte cell counts described above, as well as age (continuous), current smoking status (yes vs. no), alcohol consumption (never, former, and current), study enrollment year (1999-2002, 2003, 2004, 2005, and 2006-2007), location of residence (MN vs. other), parity and age at first birth (nulliparous, 1-2 at ≤ 20 yrs, 1-2 at > 20 yrs, 3+ at ≤ 20 yrs, and 3+ > 20 yrs), and the first principal component representing within-European population sub-structure. Due to the large number of tests being performed, we corrected for multiple comparisons by computing the false discovery rate (FDR) q-value [46].

Identifying genotype-dependent differentially methylated CpGs

All epigenome-wide statistically significant (FDR q-value < 0.05) ovarian cancer- associated differentially methylated CpGs were subsequently examined based on their association with genotype using an additive minor-allele dosage model fit to all of the study subjects. Briefly, we used a series of linear regression models (# ovarian cancer-associated CpGs × # of SNPs) that modeled methylation M-values, as a function of the number of minor alleles for a specific SNP. Genotype-methylation associations were adjusted for multiple comparisons by computing the FDR q-value. A less stringent FDR q-value cutoff of 0.10 was used to determine statistical significance, so as to limit false negative findings.