Study subjects

Multiple-case breast cancer families. Subjects were members of 25 multi-generational families with multiple cases of breast cancer. The families were participants in the Kathleen Cunningham Foundation Consortium for Research into Familial Breast Cancer (kConFab) and the Australian Breast Cancer Family Registry (ABCFR)39,40. The present study was based on samples and phenotypic data from a total of 210 family members (87 affected and 123 unaffected) from 25 families and phenotypic data on their relatives.

One or more members of these families had undergone previous genetic testing and were not found to carry a mutation in a known breast cancer susceptibility gene. Genomic DNA was isolated from blood samples or (if no blood specimen was available) from Epstein-Barr virus transformed cell lines (Supplementary Data 2). All participants provided signed informed consent to participate in the relevant research resources. This study was approved by the Human Research Ethics Committee of The University of Melbourne (1441955) and meets the principles of the Declaration of Helsinki.

Melbourne Collaborative Cohort Study (MCCS)

Data from an independent nested case–control study of methylation as a risk factor for breast cancer within the Melbourne Collaborative Cohort Study (MCCS) were used to test the findings from the family analysis41. This included breast cancer cases with a first diagnosis of invasive adenocarcinoma of the breast (International Classification of Diseases for Oncology, C50) occurring between blood collection and 31 December 2007 and ascertained by record linkage to the population-based Victorian Cancer Registry (VCR), and to the Australian Cancer Database. Controls were selected through incidence density sampling and matched with cases on year of birth, year of baseline attendance, country of origin and, when possible, type of baseline blood specimen (dried blood spot, buffy coat, or lymphocyte). The HM450K array was used to measure genome-wide methylation in DNA prepared from peripheral blood sample collected prior to cancer diagnosis of the cases as described by Severi et al. (2014)2. All participants provided signed informed consent to participate in the relevant research resources.

Bisulfite conversion and the HM450K array

A total of 500 ng of genomic DNA per sample was bisulfite converted using Zymo Gold EZ-DNA kit (Irvine, CA). Prior to processing the bisulfite converted samples on the Infinium HM450 K BeadArray, the conversion was confirmed using bisulfite-specific PCR designed in-house42. The Infinium HM450 K (San Diego, CA) was performed using the TECAN automated liquid handler (Männedorf, Switzerland) according to the manufacturer’s instruction.

HM450K data processing

All bioinformatic processing was performed with R version 3.2.043. Raw intensity signals were imported and processed using the minfi package44. All samples had an average detection p-value < 0.001, indicating good quality data. Therefore, no sample was removed from the analysis. Wherever possible, individuals from the same families were run on the same chips. Individual CpG probes with detection p-value greater than 0.05 (3949 CpG probes) were deemed unreliable and excluded from further analyses. All samples were Illumina and SWAN normalised to reduce technical bias between Type 1 and Type 2 probes45. β-values and M-values were calculated in minfi24,44. β-values denote relative methylation percentage calculated from the ratio of the methylated probe intensity and the overall intensity, where 0 indicates 0% methylation and 1 indicates 100% methylation24. Due to the heteroscedastic nature of β-values and unsuitable for many statistical tests, M-values, which are the log2 of β-values, are also calculated24. Methylation measures from twelve technical duplicates were used for testing the reproducibility of methylation measures and removed from subsequent analysis. No further batch correction method was performed.

Clonal bisulfite sequencing

Clonal bisulfite sequencing was performed to test for the parent-of-origin allelic methylation patterns of the VTRNA2-1 locus as previously described25. Germline DNA provided by 8 families, including 16 children were included in this analysis. All DNAs were first genotyped for rs2346019, (located at the downstream region of VTRNA2-1) using High-Resolution Melt curve analysis run on a RotorGene thermocycler (Qiagen, Hilden, Germany). Families where the allelic-specific methylation could be discriminated using this genotype information were selected for the bisulfite sequencing analysis (i.e., parents with disparate genotypes whose children were heterozygote at rs2346019). A set of previously published bisulfite-specific primers were used for amplifying the VTRNA2-1 locus25. Cloning was performed using a TOPO-TA kit and at least 10 colonies per individual? were selected for Sanger Sequencing.

Statistical methods

Our method for identifying heritable methylation marks is based on a generalisation of the standard expectation–maximisation (EM) algorithm for Gaussian mixtures to allow for non-independent group memberships. These calculations were performed using custom code implemented in R version 3.1.143 because existing general segregation analysis software was too slow to make the calculations feasible for almost half a million probes.

For each methylation site (CpG probe), two statistical models were fitted to the site’s M-values: a mixture model, in which the M-values were modelled as a mixture of two normal distributions (with means and variances to be estimated); and a Mendelian model, which is the same as the mixture model except that group membership was modelled as the carrier status (e.g., for a rare variant) at an autosomal genetic locus. Therefore, group memberships are independent under the mixture model but not under the Mendelian model. The maximised log-likelihoods, l mix and l Mendel , for these models were calculated using the EM algorithm, with l mix obtained from the standard EM algorithm for Gaussian mixtures46 and l Mendel calculated using the modification of this algorithm described in The EM algorithm for the Mendelian model, below. For each model, setting the means and variances for the two groups to be equal corresponds to a Gaussian model in which the M-values follow a normal distribution, so this Gaussian model is nested inside both the mixture and Mendelian models. Using the likelihood ratio test to compare these models to the Gaussian model is uninformative because many probes appear to have a bimodal distribution, so instead we compared l mix to l Mendel . A maxim from the field of statistical model selection is that the maximised log-likelihood quantifies how well a model fits the observed data47. Therefore, Δl = l Mendel − l mix is a measure of how Mendelian the probe’s M-values are, over and above how bimodal they are. Note also that since the mixture and Mendelian models have the same number of model parameters, Δl is the difference between the AICs for these two models, so the AIC model-selection approach would select the Mendelian model in preference to the mixture model whenever Δl > 0 (and similarly for the BIC)47.

To validate the ability of the Δl statistic to identify methylation sites with Mendelian-like inheritance patterns, we calculated Δl for all 481,563 methylation sites and used logistic regression and the likelihood ratio test to test whether or not the proportion of probes within 10 bp of a known SNP increases with Δl. This is a test on the efficacy of our statistic Δl, because the observed M-values of methylation probes with nearby SNPs are likely to have Mendelian-like inheritance patterns, just as an artefact of how the HM450 K array measures methylation48. The HM450K probes are 50mer oligonucleotides in design with the interrogated target CpGs at the last base. A technical limitation of the platform is that a large proportion of probes overlap one or more known SNPs48. As the accuracy of methylation measurements relies on the efficient hybridising of probes to target complementary DNA fragment, SNPs within probes potentially interfere with this binding and interrupt the actual methylation measurements48. The observed methylation values are therefore biased by nearby SNPs and will tend to follow Mendelian patterns of inheritance. We could therefore assess if Δl identified heritable sites by testing whether probes with higher values of Δl were more likely to have nearby SNPs. In addition to the formal test above, we also binned probes by their values of Δl and graphed the proportion of probes within 10 bp of a known SNP for each bin. Known SNPs were defined by Illumina’s HM450 K Manifest v1.2 (see Web resources).

To identify heritable methylation marks associated with breast cancer, we first excluded all methylation probes on sex chromosomes or within 10 bp of known SNPs. Then we screened the remaining 365,169 probes for those most consistent with a Mendelian pattern of inheritance, using the statistic Δl. Note that this screening was based on the structure of the 25 families and did not use any data on breast cancer-affected status or age. For each of the 1000 most Mendelian sites (those with the highest values of Δl), we calculated carrier probabilities for the hypothetical genetic variant that determines group membership in the Mendelian model. These calculations used standard techniques from segregation analysis49, in which the observed M-values played the role of the ‘phenotypes’ and the Gaussian densities (with the model parameters equal to their maximum likelihood estimates from the Mendelian model) played the role of the ‘penetrance’ function. The calculation of these carrier probabilities also only used pedigree structure and M-values, not age or breast cancer data.

Cox proportional hazards survival analysis was then used to test for associations between breast cancer and the carrier probabilities for the 1000 most Mendelian methylation marks. These analyses were conducted in R version 3.1.143 using the coxph function of the survival package50. To adjust for multiple testing, a Bonferroni-corrected p-value threshold of 0.05/1000 was used to determine statistical significance. Note that the effects of multiple testing were greatly reduced in our study because we screened the methylation sites for those with Mendelian inheritance patterns before testing for association with breast cancer.

The families in this study were ascertained because they each contained multiple breast cancer cases, and no adjustment for this ascertainment criterion was made. This means that our hazard ratio estimates are biased, so we do not report these here, but since the ascertainment criterion has no effect on the test statistic under the null hypothesis, our p-values for association with breast cancer are valid. These p-values were based on the likelihood ratio test, not the Wald test, so variances for the hazard ratios were not needed and hence were not estimated using either standard maximum likelihood or robust variance estimators.

The EM algorithm for the Mendelian model

This section gives a detailed, mathematical description of our generalization of the standard EM algorithm for Gaussian mixtures to allow for non-independent group memberships, as well as a precise description of the above statistic \(\Delta\ell\) and its two related statistical models.

The statistic Δl for measuring how Mendelian the inheritance pattern of a given site is: for each of the methylation sites, we fitted two statistical models to the sites’ M-values x 1 ,…,x n , where n is the number of people with epigenome-wide data and x i is the site’s M-value for person i. The first model is a mixture of two Gaussians, so under this model there are binary random variables y 1 ,…,y n so that: the n bivariate random variables (x 1 ,y 1 ),…,(x n ,y n ) are independent; and for each j = 0 or 1, P(y i = j) = α j and (x i |y i = j) ~N(μ j σ j 2), where θ = (α 0 ,α 1 ,μ 0 ,μ 1 ,σ 0 ,σ 1 ) is a vector of parameters to be estimated while satisfying the constraint α 0 + α 1 = 1. In this paper, we will also impose the additional constraint that α 1 = 0.01, so α 0 and α 1 are fixed constants. The second model is the same as the first, except that the group membership variables y 1 ,…,y n are modelled as the carrier status for a rare, autosomal genetic variant, with y i = 1 if individual i is a carrier and y i = 0 if he or she is a non-carrier. Note that y i and y j will generally be dependent random variables if individuals i and j belong to the same pedigree, though we still assume that x 1 ,…,x n are conditionally independent given y 1 ,…,y n .

We will refer to these models as the mixture and Mendelian models, respectively. Setting μ 0 = μ 1 and σ 0 = σ 1 in either of these models gives a third model for the M-values, in which x 1 ,…,x n are independent and follow a univariate normal distribution, that we call the Gaussian model. The maximised log-likelihoods \(\ell _{\rm mix}\), \(\ell _{\rm Mendel}\), and \(\ell _{\rm Gauss}\) of these three models measure the goodness-of-fit of each model to the site’s M-values47. Since the Gaussian model is nested inside the other two models, \(\ell _{\rm mix}\) and \(\ell _{\rm Mendel}\) can both be formally compared to \(\ell _{\rm Gauss}\) using a likelihood ratio test in order to determine if either of these models gives a more parsimonious fit to the data than the Gaussian model. However, the M-values of a very large number of the sites are bimodal, so these tests very often prefer both of the other models to the Gaussian model. To discover sites whose methylation patterns are Mendelian, we therefore compare \(\ell _{\rm Mendel}\) to \(\ell _{\rm mix}\), even though the mixture and Mendelian models are not nested. Since these models have the same number of parameters, \({\mathrm{\Delta}}\ell = \ell _{\rm Mendel} - \ell _{\rm mix}\) is the difference in both the AIC and BIC of the two models, so if \({\Delta}\ell > 0\) then the AIC and BIC would both favour the Mendelian model over the mixture model as the more parsimonious description of the data47. Also, since \(\ell _{\rm mix}\) and \(\ell _{\rm Mendel}\) measure the goodness-of-fit of these models to the site’s M-values47, the better the Mendelian model fits the data compared to the mixture model, the larger \({\Delta}\ell\) should be. We therefore interpret \({\Delta}\ell\) as a statistic which measures how ‘Mendelian’ the site is, i.e. how consistent the observed M-values at the site are with a Mendelian pattern of inheritance within families.

Note that we have assumed that all familial aggregation of aberrant DNA methylation is due to a major gene, so \(\ell _{\rm Mendel}\) and hence \({\Delta}\ell\) will be upwardly biased if part of this familial aggregation is caused by multiple genes of small effect (i.e., a polygenic effect), or if our model is misspecified in other ways. However, note that we only use \({\Delta}\ell\) to rank the methylation sites, and this ranking is completely insensitive to a wide range of biases. Also, while there are good theoretical and empirical reasons for using \({\Delta}\ell\) to screen the methylation sites, this screening is not a formal statistical procedure, so even if \({\Delta}\ell\) were biased then this would have no effect on the validity of our tests for association with breast cancer (the only formal part of our analysis). Finally, we note that replacing the Mendelian model with a mixed model (a model that incorporates a polygene in addition to a major gene) would possibly identify sites with polygenic but not Mendelian patterns of inheritance, which we are not interested in here.

A detailed description of the EM algorithm for the Mendelian model: since our analysis included approximately 480,000 sites, efficient algorithms were needed to maximise the likelihoods. For the mixture model this was straight-forward, because the EM algorithm for a mixture of Gaussians results in analytical update formulae46, which can be iterated to rapidly converge (in most cases) to the maximum likelihood estimates. For the Mendelian model, we used a modification of this algorithm that we now describe in detail.

In the EM algorithm for the Mendelian model, we took the M-values x 1 ,…,x n of a given site as the observed data and the binary carrier statuses y 1 ,…,y n as the hidden data. For now, the reader can simply think of y 1 ,…,y n as variables defining group memberships, as in the standard EM algorithm for Gaussian mixtures46, though with the caveat that y 1 ,…,y n are not independent. With model parameters θ = (α 0 ,α 1 ,μ 0 ,μ 1 ,σ 0 ,σ 1 ) as above, if θ t is the estimate of these parameters at iteration t then the EM algorithm chooses the estimate θ t + 1 at the next iteration to be the argument which maximises the function of θ given by

$$Q(\theta ,\theta _t) = {\Bbb E}[{\mathrm{log}}P(x,y|\theta)|x,\theta _t]$$

where x = (x 1 ,…,x n ), y = (y 1 ,…,y n ), \({\Bbb E}[ \cdot ]\) is the expectation functional and P(x,y|θ) is the likelihood of the full data at parameter value θ. More precisely, if \({\cal Y}\) = {0,1}n is the set of all binary vectors of length n, then

$$\begin{array}{l}Q(\theta ,\theta _t) = \mathop {\sum }\limits_{y \in {\cal Y}} P(y|x,\theta _t){\mathrm{log}}P(x,y|\theta )\\ = \mathop {\sum }\limits_{y \in {\cal Y}} P(y|x,\theta _t){\mathrm{log}}\left[ {P(x|y,\theta )P(y|\theta )} \right]\\ = \mathop {\sum }\limits_{y \in {\cal Y}} P(y|x,\theta _t){\mathrm{log}}P(x|y,\theta ) + \mathop {\sum }\limits_{y \in {\cal Y}} P(y|x,\theta _t){\mathrm{log}}P(y|\theta )\\ = \mathop {\sum }\limits_{y \in {\cal Y}} \mathop {\sum }\limits_{i = 1}^n P(y|x,\theta _t){\mathrm{log}}P(x_i|y_i,\theta ) + \mathop {\sum }\limits_{y \in {\cal Y}} P(y|x,\theta _t){\mathrm{log}}P(y|\theta )\end{array}$$ (1)

since the M-values x 1 ,…,x n are assumed to be conditionally independent given the carrier statuses y, with the distribution of x i only a function of y i . The first sum in (1) is a function only of the parameters μ 0 , μ 1 , σ 0 , and σ 1 , while the second sum only depends on α 0 and α 1 . So, to find θ that maximises Q(θ,θ t ), we can maximise these two functions separately. In the analysis presented in this paper, however, α 0 and α 1 were fixed to the values 0.99 and 0.01, respectively, so we focus on maximising the first term of (1) here.

Let δ ij denote the Kronecker delta, and for each j = 0 or 1, let φ(x i |μ j ,σ j ) be the probability density function for the normal distribution N(μj,σ2) evaluated at x i , so that P(x i |y i ,θ) = ϕ(x i |μ yi, σ yi ). Then the first sum in (1) is

$$\begin{array}{l}\mathop {\sum }\limits_{y \in {\cal Y}} \mathop {\sum }\limits_{i = 1}^n P(y|x,\theta _t){\mathrm{log}}P(x_i|y_i,\theta )\\ = \mathop {\sum }\limits_{y \in {\cal Y}} \mathop {\sum }\limits_{i = 1}^n P(y|x,\theta _t){\mathrm{log}}\phi (x_i|\mu _{y_i},\sigma _{y_i})\\ = \mathop {\sum }\limits_{y \in {\cal Y}} \mathop {\sum }\limits_{i = 1}^n P(y|x,\theta _t)\mathop {\sum }\limits_{l = 0}^1 \delta _{ly_i}{\mathrm{log}}\phi (x_i|\mu _l,\sigma _l)\\ = \mathop {\sum }\limits_{i = 1}^n \mathop {\sum }\limits_{l = 0}^1 q_{il}^t{\mathrm{log}}\phi (x_i|\mu _l,\sigma _l)\end{array}$$ (2)

where

$$q_{il}^t = \mathop {\sum }\limits_{y \in {\cal Y}} \delta _{ly_i}P(y|x,\theta _t)$$ (3)

so that \(q_{il}^t\) is the carrier probability for person i corresponding to x and the parameter values θ t when l = 1 (note that the t in \(q_{il}^t\) is a general superscript, not a power). Therefore, (2) is a weighted log-likelihood of normal distributions, so it can be maximised in exactly the same way as for the standard EM algorithm for Gaussian mixtures46. This gives the following parameter values at iteration t + 1, for each l = 0,1:

$$\mu _l^{t + 1} = \mathop {\sum }\limits_{i = 1}^n w_{il}^tx_i{\kern 1pt} \, {\mathrm {and}}\ \sigma _l^{t + 1} = \sqrt {\mathop {\sum }\limits_{i = 1}^n w_{il}^t\left( {x_i - \mu _l^{t + 1}} \right)^2}$$ (4)

where \(w_{il}^t = q_{il}^t/\mathop {\sum }\limits_{j = 1}^n q_{jl}^t\) and, as before, the superscripts t and t + 1 are not exponents.

To calculate these estimates, we used the definition (3) of \(q_{il}^t\) and the following expression for P(y|x,θ t ). Let F be the partition of {1,…,n} into families, so that F is a set of sets of indices, with each f∈F of the form f = {i 1 ,…,i k }, where i 1 ,…,i k are all of the people in a given family with epigenome-wide methylation data. For any such f∈F, let xf = (x i1,…, x ik ) and yf = (y i1,…, y ik ) be the observed and hidden data for the family, respectively. Then since the carrier statuses and M-values of people from different families are independent,

$$P(y|x,\theta _t) = \mathop {\prod }\limits_{f \in F} P(y^f|x^f,\theta _t) = \mathop {\prod }\limits_{f \in F} \frac{{P(x^f|y^f,\theta _t)P(y^f|\theta _t)}}{{P(x^f|\theta _t)}}$$ (5)

To calculate P(y|x,θ t ) from the right-hand side of (5), we note that, as before,

$$P(x^f|y^f,\theta _t) = \mathop {\prod }\limits_{i \in f} P(x_i|y_i,\theta _t) = \mathop {\prod }\limits_{i \in f} \phi (x_i|\mu _{y_i}^t,\sigma _{y_i}^t)$$

Also, P(yf|θ t ) can be calculated using standard techniques from segregation analysis49, as described in more detail in Statistical methods, above. Finally, the denominator P(xf|θ t ) in the right-hand side of (5), which is just a normalising constant, can be obtained by summing the numerator overall values of yf. Therefore, P(y|x,θ t ) can be calculated from (5), so substituting this into (3) gives \(q_{il}^t\) which, by (4), gives the updated parameters for the EM algorithm.

Improving calculation speeds: our analyses of ~480,000 sites would not be feasible without a number of techniques to improve the speed of the EM algorithm for the Mendelian model, so we briefly describe two of these techniques now.

The Mendelian model is a segregation analysis model49, and for such models the most time-consuming part of the calculation is summing overall possible genotype combinations for all family members in each family. However, this part of the calculation is essentially common to all methylation sites, so we obtain considerable improvements in speed by performing this calculation once and storing the results for later use.

More precisely, the update equations (4) for the EM algorithm depend on the carrier probabilities P(yf|θ t ) via (3) and (5), where we recall that yf is a set of carrier statuses for all of the members of family f with epigenome-wide data. Using standard techniques from segregation analysis49, P(yf|θ t ) can be expressed as a sum over all genotype combinations for the family which are consistent with the genotypes yf. Evaluating these sums is usually very time-consuming, however P(yf|θ t ) depends on \(\alpha _0^t\) and \(\alpha _1^t\) but not on \(\mu _0^t\), \(\mu _1^t\), \(\sigma _0^t\) or \(\sigma _1^t\), and \(\alpha _0^t\) and \(\alpha _1^t\) are held fixed for all t, so P(yf|θ t ) does not depend on t or the M-values xf. Therefore, we calculated P(yf|θ t ) once for every possible combination yf of genotypes, and stored these values of P(yf|θ t ) for later use in the update equations (4) (via (3) and (5)) for each methylation site.

We also used a simplifying assumption. To reduce the number of genotype combinations yf for which we had to store values of P(yf|θ t ), we assumed that no more than 1 of the founders in each family is a carrier and that no founder is a homozygote carrier (as usually holds if the variant is rare). This assumption is not essential, however, and it can be weakened (e.g., to allow 2 variants or less among the alleles of the founders) or entirely dispensed with (if the families are not too large and not too many family members have epigenome-wide data).

Testing 24 methylation marks in the MCCS

For each of the 24 CpG sites of interest, we first estimated odds ratios (OR) for breast cancer risk using conditional logistic regression models, for a one standard deviation increase in the methylation M-values in blood HM450K data set of 433 cases and their matched controls from the MCCS. The models were adjusted for body mass index, tobacco smoking, alcohol drinking, time between blood collection, and cancer diagnosis, and sample type (DNA extracted from dried blood spots, peripheral blood mononuclear cells, and buffy coats, although the vast majority (97%) of case–control pairs were successfully matched on sample type). For methylation marks exhibiting a bimodal or trimodal distribution, we categorised the methylation variables into groups corresponding to the observed ‘peaks’ of hypo, hemimethylated or hypermethylated, based on visual inspection of the M-value distribution (Supplementary Fig. 4). We used the same models as for the continuous variable analyses. The larger peak was chosen as the reference category. Sensitivity analyses were conducted: (1) further adjusting the models for blood cell composition as estimated by the algorithm by Houseman et al.51; (2) further adjusting the models for age at menarche, menopausal status, number of live births, and use of hormonal replacement therapy; (3) restricting the analyses to DNA prepared from dried blood spots.

Associations between genetic variants and DNA methylation

Data for all variants with 1 kb of the GREB1 probe that were genotyped or imputed using the iCOGS array were retrieved for MCCS participants included in the Breast Cancer Association Consortium52. A total of 251 participants (231 cases and 20 controls) had iCOGS and HM450K data available. Association between genotype and methylation was assessed using linear regression, with beta-value as the outcome variable and allele count as the explanatory variable. The allele count was estimated by rounding the allele dose to an integer value.

Web resources

Illumina Infinium HumanMethylation450K manifest was downloaded from http://support.illumina.com/array/array_kits/infinium_humanmethylation450_beadchip_kit/downloads.html

Data availability

All DNA methylation data (HM450K array) has been deposited to GEO (Accession No. GSE104942) and all bisulfite sequencing data has been deposited into BankIt2071934 (MG686237-MG686418) and is freely available.