Participants provided written informed consent, all studies were approved by the relevant ethics committees, and procedures followed were in accordance with the ethical standards of these committees.

UK Biobank samples were genotyped using Affymetrix UK BiLEVE Axiom array and Affymetrix UK Biobank Axiom array and imputed to the combined 1000 Genomes Project v.3 and UK10K reference panels using SHAPEIT3 and IMPUTE3.The lowest imputation info score for the SNPs used in these analyses was 0.86. Samples were included on the basis of female sex (genetic and self-reported) and ethnicity filter (Europeans/White British ancestry subset). Duplicates, individuals with high degree of relatedness (>10 relatives), and one of each related pair of first degree relatives were removed. Samples were also excluded using standard quality control criteria.

Genotype calling, quality control, and imputation for iCOGS and OncoArray were performed as previously described.Briefly, imputation was performed for the iCOGS and OncoArray datasets separately using the Phase 3 (October 2014) release of the 1000 Genomes data as reference.We followed a two-stage approach using SHAPEIT for phasingand IMPUTE2 for the imputation.Where samples were genotyped with iCOGS and OncoArray, the OncoArray calling was used. SNPs with MAF > 0.01 and imputation r> 0.9 for OncoArray and r> 0.3 for iCOGS were included in this analysis (∼7 million SNPs); a higher threshold was imposed for OncoArray to ensure accurate determination of the PRS in the validation and test datasets.

The best PRSs were evaluated in an independent test dataset comprising 11,428 invasive breast cancer-affected case subjects and 18,323 control subjects from ten studies nested within prospective cohorts, all genotyped using the OncoArray ( Tables S3 and S4 ). The overall breast cancer PRS was also evaluated among 190,040 women of European ancestry from the UK Biobank cohort who had not had any cancer diagnosis or mastectomy prior to recruitment. A total of 3,215 incident registry-confirmed invasive breast cancers developed over 1,381,019 person years of prospective follow-up. Follow-up started 6 months after age of baseline questionnaire. The primary endpoint was invasive breast cancer. Follow-up was censored at the earliest of: risk-reducing mastectomy, diagnosis of any type of cancer, death, or January 15, 2017.

The dataset used for development of the PRSs comprised 94,075 breast cancer-affected case subjects and 75,017 control subjects of European ancestry from 69 studies in the BCAC ( Tables S1 and S2 ). Data collection for individual studies is described previously.Samples were genotyped using one of two arrays: iCOGSand OncoArray.The dataset was divided into a training and validation set. The validation set was randomly selected (approximately 10% of case and control subjects) from studies that had been genotyped with the OncoArray, after excluding studies of bilateral breast cancer, studies or sub-studies oversampling for family history, and individuals with in situ cancers or case subjects with unknown ER status.

Statistical Analysis

PRS = β 1 x 1 + β 2 x 2 + … + β k x k … + β n x n

where β k is the per-allele log odds ratio (OR) for breast cancer associated with SNP k, x k is the allele dosage for SNP k, and n is the total number of SNPs included in the PRS. Previous analyses found no evidence for statistically significant interactions between SNPs 19 Milne R.L.

Herranz J.

Michailidou K.

Dennis J.

Tyrer J.P.

Zamora M.P.

Arias-Perez J.I.

González-Neira A.

Pita G.

Alonso M.R.

et al. kConFab Investigators Australian Ovarian Cancer Study Group GENICA Network TNBCC

A large-scale assessment of two-way SNP interactions in breast cancer susceptibility using 46,450 cases and 42,461 controls from the breast cancer association consortium. , 20 Joshi A.D.

Lindström S.

Hüsing A.

Barrdahl M.

VanderWeele T.J.

Campa D.

Canzian F.

Gaudet M.M.

Figueroa J.D.

Baglietto L.

et al. Breast and Prostate Cancer Cohort Consortium (BPC3)

Additive interactions between susceptibility single-nucleotide polymorphisms identified in genome-wide association studies and breast cancer risk factors in the Breast and Prostate Cancer Cohort Consortium. The general aim was to derive a PRS of the form:where βis the per-allele log odds ratio (OR) for breast cancer associated with SNP k, xis the allele dosage for SNP k, and n is the total number of SNPs included in the PRS. Previous analyses found no evidence for statistically significant interactions between SNPsand little evidence for departures from a log-additive model for individual SNPs. Assuming this is true in general, the PRS summarizes efficiently the combined effects of SNPs on disease risk.

k to assign. Inclusion of only those SNPs reaching a stringent significance threshold (“genome-wide significant,” p < 5 × 10−8) threshold ignores information from larger numbers of SNPs that are likely, but not certain, to be associated with the risk of breast cancer. We used two general approaches for model selection: “hard-thresholding,” based on a stepwise regression model that retained SNPs significantly associated with overall or subtype-specific disease at a given threshold, and penalized regression using lasso. 21 Friedman J.

Hastie T.

Tibshirani R. Regularization paths for generalized linear models via coordinate descent. , 22 Tibshirani R. Regression shrinkage and selection via the Lasso. The main challenge is how to determine which SNPs to include and the weighting parameters βto assign. Inclusion of only those SNPs reaching a stringent significance threshold (“genome-wide significant,” p < 5 × 10) threshold ignores information from larger numbers of SNPs that are likely, but not certain, to be associated with the risk of breast cancer. We used two general approaches for model selection: “hard-thresholding,” based on a stepwise regression model that retained SNPs significantly associated with overall or subtype-specific disease at a given threshold, and penalized regression using lasso.A schema for the analyses is shown in Figure S1

1 Michailidou K.

Lindström S.

Dennis J.

Beesley J.

Hui S.

Kar S.

Lemaçon A.

Soucy P.

Glubb D.

Rostamianfar A.

et al. NBCS Collaborators ABCTB Investigators ConFab/AOCS Investigators

Association analysis identifies 65 new breast cancer risk loci. 23 Willer C.J.

Li Y.

Abecasis G.R. METAL: fast and efficient meta-analysis of genomewide association scans. 2 < 0.9) with lowest p value for association with overall breast cancer in the training set were retained (more rigorous pruning, for example at r2 < 0.2, would have removed from consideration informative SNPs from regions with multiple correlated signals 24 French J.D.

Ghoussaini M.

Edwards S.L.

Meyer K.B.

Michailidou K.

Ahmed S.

Khan S.

Maranian M.J.

O’Reilly M.

Hillman K.M.

et al. GENICA Network kConFab Investigators

Functional variants at the 11q13 risk locus for breast cancer regulate cyclin D1 expression through long-range enhancers. , 25 Meyer K.B.

O’Reilly M.

Michailidou K.

Carlebur S.

Edwards S.L.

French J.D.

Prathalingham R.

Dennis J.

Bolla M.K.

Wang Q.

et al. GENICA Network kConFab Investigators Australian Ovarian Cancer Study Group

Fine-scale mapping of the FGFR2 breast cancer risk locus: putative functional variants differentially bind FOXA1 and E2F1. To prioritize SNPs for analysis, single SNP association tests were first conducted in the training set. Per-allele ORs and standard errors were estimated separately in the iCOGS and OncoArray datasets, adjusting for study and nine ancestry informative principal components (PCs) in the iCOGS dataset and by country and ten PCs in the OncoArray dataset, using a purpose-written program.Combined p values were then derived using a fixed-effects meta-analysis with the software METAL.SNPs were sorted by p value and filtered on LD, such that uncorrelated SNPs (correlation r< 0.9) with lowest p value for association with overall breast cancer in the training set were retained (more rigorous pruning, for example at r< 0.2, would have removed from consideration informative SNPs from regions with multiple correlated signals).

In the hard thresholding approach, a series of stepwise forward regression analyses were first carried out in 1 Mb regions centered on SNPs significant at a pre-specified threshold for association with either overall and/or subtype-specific disease in the training set. Only SNPs passing the specified p value thresholds were included in each 1 Mb region. Two analyses were performed in parallel: for overall breast cancer and ER-negative disease. At each stage the SNP with the smallest (conditional) p value for any analysis was added to the model, the threshold for the stepwise regression being the same as that for pre-selection. The process was repeated until no further SNPs could be added at the pre-defined threshold. A second stage of stepwise regressions were then carried out across all regions in each chromosome, to take into account correlated SNPs in different regions. Finally, the effect sizes for the selected SNPs were jointly estimated in a single logistic regression model.

For the best-performing PRSs, SNPs associated with ER-positive at p < 10−6 but not with overall breast cancer (at p < 10−5) were added at the end of the final SNP list. A third round of stepwise forward regression was then carried out with p value for selection of p < 10−6 for ER-positive disease. For completeness we added to this final PRS two rarer variants (BRCA2 p.Lys3326X and CHEK2 p.Ile157Tyr) which are established to confer a moderate risk of breast cancer and were genotyped on the OncoArray but did not pass the allele frequency threshold in the PRS development phase.

21 Friedman J.

Hastie T.

Tibshirani R. Regularization paths for generalized linear models via coordinate descent. For the penalized regression using lasso, we used the program glmnet. SNPs with p < 0.001 in overall BC or ER-negative disease in the training set were pre-selected for inclusion in the lasso, and BRCA2 p.Lys3326X and CHEK2 p.Ile157Thr were added. Covariates for 19 PCs (9 for iCOGs and 10 for Oncoarray) and country were included in each model. For overall breast cancer, the penalty parameter (lambda) giving the best overall breast cancer PRS in the validation set was selected.

β ERpositive = β overall + η ∗ β case - only

β ERnegative = β overall - ( 1 - η ) ∗ β case - only

where η = 0.27 was the proportion of ER-negative tumors in the validation set. To construct subtype-specific PRSs, we evaluated four different methods: (1) using effect sizes for overall breast cancer (for each of the subtypes), (2) using effect sizes for subtype-specific (ER-positive or ER-negative) disease, (3) using a hybrid method, in which effect sizes were estimated in the relevant subtype for SNPs passing a certain optimal significance threshold in a case-only logistic regression (ER-positive versus ER-negative disease), and otherwise, using effect sizes estimated for overall breast cancer, or (4) by estimating case-only ORs using lasso and combining these with the overall breast cancer ORs to derive subtype-specific estimates, using the formulae:where η = 0.27 was the proportion of ER-negative tumors in the validation set.

For the lasso analysis, effect sizes for subtype-specific disease were estimated using method 4 above, combining the estimates from a case-only lasso analysis with the coefficients for overall breast cancer from the lasso analysis. The lambda for the case-only model giving the best subtype-specific PRS in the validation set was selected.

To evaluate the performance of each potential PRS, we standardized the PRSs to have unit standard deviation (SD) in the validation set of control subjects. The association of the standardized PRSs was evaluated in the validation and test (prospective studies) datasets, by logistic regression. We used a Cox proportional hazards regression model to assess the association with risk of breast cancer in UK Biobank. Models were also compared in terms of the area under the receiver operator characteristic curves (AUC), adjusted for study, calculated using the Stata command comproc. Meta-analysis of study-specific effects was carried out using the Stata command metan.

26 Song M.

Kraft P.

Joshi A.D.

Barrdahl M.

Chatterjee N. Testing calibration of risk models at extremes of disease risk. The goodness of fit of the continuous model (i.e., assuming a linear association between log(OR) and risk) was tested using the Hosmer-Lemeshow (HL) test to compare the observed and predicted risks by quantile and using the tail-based test proposed by Song et al.In addition, we considered specifically the risks in the highest and lowest 1% of the distribution.

Effect modification of the PRS by age and family history of breast cancer in first-degree relatives was evaluated by fitting additional interaction terms in the model. The validation and prospective test datasets were combined for this analysis.

7 Mavaddat N.

Pharoah P.D.

Michailidou K.

Tyrer J.

Brook M.N.

Bolla M.K.

Wang Q.

Dennis J.

Dunning A.M.

Shah M.

et al. Prediction of breast cancer risk based on profiling with common genetic variants. The absolute risks of developing breast cancer (overall and subtype-specific disease) were calculated taking into account the competing risk of dying from causes other than breast cancer, as described previously,with the PRS modeled as a continuous covariate and including a linear “age × PRS” interaction term. The absolute risk of developing subtype-specific disease was obtained constraining to the incidence of overall incidence of ER-negative and ER-positive disease in the UK. Women are at risk of developing both ER-negative and ER-positive disease, so the absolute risks were calculated given that the individual has been free of breast cancer of any subtype.