Japanese subjects and QC on genotypes

Details on study design and basic characteristics for each study are provided in Supplementary Methods. Briefly, 1,703 MEC Japanese American subjects were genotyped by the Broad Genotyping Center on the Illumina 1M-Duo Array and 1,602 (803 cases, 799 controls) passed their initial QC filters. To maximize sample size, initially ‘failed’ samples on five plates were re-clustered with a customized genotype calling algorithm—this step recovered 42 additional MEC subjects (23 cases, 19 controls), although not all SNPs on the array were preserved. To increase statistical power and to provide a larger control pool, 1,033 prostate cancer-free men and 808 breast cancer-free women genotyped on the Illumina 660W‐Quad platform were drawn from the MEC prostate cancer12 and breast cancer13 studies, respectively.

Japanese from the following studies were all genotyped on the Illumina 1M-Duo array by the University of Southern California (USC) Epigenome Center: 697 from CCFR (384 cases, 313 controls), 155 cases from CR2&3, 1,463 from Fukuoka, Japan (685 cases, 778 controls), 212 from Nagano, Japan (106 cases, 106 controls) and 1,332 from JPHC (670 cases, 662 controls). In general, all genotyped samples were examined and excluded according to the following: (1) call rates<90%, 95% or 97% depending on the batches, (2) missing on basic covariates (age, sex or disease status), (3) gender mismatch, that is, the reported sex was different from that estimated based on X chromosome inbreeding coefficient F, calculated by PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/), (4) ethnicity outliers, that is, subjects fell out of the Japanese cluster (by visual inspection) on PC plots, where PCs were derived for study subjects as well as unrelated HapMap CEU, YRI and JPT samples with our own R program (The Comprehensive R Archive Network http://www.r-project.org/), based on about 20k SNPs with inter-marker distance>100 kb, and (5) close (⩾2nd degree) relatives, where relationships were derived from estimated probabilities of sharing 0, 1 or 2 alleles based on genomic data (calculated by PLINK), and relatives were removed in the following order: subjects with most relatives, controls and subjects with lower call rates. All cases were verified by histological records to have invasive carcinoma of the colon or rectum. More details on genotype QC can be found in Supplementary Methods. After QC, the following subjects were retained in analysis: 3,094 from the MEC (797 cases, 2,297 controls), 285 from CCFR (276 cases and 9 controls), 134 cases from CR2&3, 1,411 from Fukuoka, Japan (662 cases, 749 controls), 207 from Nagano, Japan (105 cases, 102 controls) and 1,293 from the JPHC (653 cases, 640 controls).

African American subjects and QC on genotypes

Sample collection and genotyping QC have been described in detail elsewhere4 and in Supplementary Methods. We genotyped 7,168 African American samples from six studies/centres: the MEC (442 cases, 4,620 controls), CCFR (999 cases, 290 controls), SCCS (164 cases, 160 controls), the MD Anderson Cancer Center (189 cases), UNC-CanCORS (84 AA cases) and UNC-Rectal (112 cases, 108 controls) on the Illumina 1M-duo platform. QC procedures for all subjects were similar to the criteria described for the Japanese study subjects. Included in analysis were 6,427 subjects (4,609 controls, 1,818 cases) on 1,049,327 markers. We also included 170 PLCO samples (76 cases, 94 controls) that were previously genotyped on the Illumina Omni 2.5M array and pre-filtered by the NCI genotyping centre for analysis (527,383 markers that overlapped with other studies). Overall, 6,597 subjects (1,894 cases, 4,703 controls) were used in association testing. Supplementary Table 2 shows the distribution of subjects by participating study.

Imputation

Prediction of un-typed or partly genotyped SNPs was performed with BEAGLE 3.3 (ref. 2) using the 1000 Genomes Project (phase 1, release 3) East Asians as reference panels for the Japanese data and Europeans and Africans for the African American data. Imputation was performed separately for the two ethnic groups with all cases and controls combined. Markers with minor allele frequencies<0.005 in reference panels were excluded from imputation. For the African American data, 10,050,748 markers with imputation accuracy R2>0.8 were kept for association analysis; for Japanese data, 4,266,108 markers with imputation R2>0.95 were retained. Altogether, 4,276,079 autosomal genotyped or imputed markers were available in both populations for meta-analysis.

Analysis of the Japanese and African American GWASs

PCs were calculated as in EIGENSTRAT14 with our own R program, including unrelated HapMap CEU, YRI and JPT samples as population controls. Ethnicity outliers were identified on PC plots by visual inspection and subsequently removed. Pair-wise PC plots suggested that the first two PCs were most informative for global ancestry and the distribution of PCs was similar among all cases and controls in both Japanese and African Americans (Supplementary Figs 1 and 2). Logistic regression of CRC on allelic dosage with adjustment for age at blood draw, sex and the first four PCs was performed to obtain OR estimates and 95% CI of per increase in allele count with PLINK, where age was grouped as <55 years, 5-year intervals from 55 to 80 and ⩾80 years. The genomic control factor (λ) was estimated from the median of the χ2 statistics divided by 0.456.

Heterogeneity of genetic effects by site (colon versus rectal cancer, mutually exclusive), stage (regional/distant versus local/in situ) and age at diagnosis (≤55 versus >55 years) was tested in a case-only analysis. Effect modification by sex was assessed comparing the model with and without the cross-product term. These and additional stratified analyses by site, stage, age at diagnosis and sex were adjusted for age at blood draw, sex (where appropriate), the first four PCs and BMI.

Conditional analyses were performed to examine the independence of association signals in the chromosome 10 region, conditioning on the SNP with the smallest P-value. Significance of the additional contribution by other SNPs was calculated based on a likelihood ratio test. These analyses were carried out using SAS 9.3.

Local ancestry estimation for African Americans

The percentage of African ancestry (0, 50 or 100%, that is, half of the estimated number of African chromosomes) was inferred for each participant at the putative CRC risk locus on chromosome 10 (±250 kb) with the LAMP program v2.4 (ref. 15). To summarize local ancestry, for each individual we averaged across all local ancestry estimates that are within the region. The effect of local ancestry was evaluated by examining the relative change in ORs with and without adjustment for local ancestry in logistic regression.

CORECT study for replication

The CORECT study meta-analysis was conducted using germline DNA in the Molecular Epidemiology of Colorectal Cancer study (MECC) (set 1: 484 cases and 498 controls; set 2: 1,120 cases and 820 controls), CCFR (set 1: 1,977 cases and 999 controls; set 2: 1,660 cases and 1,393 controls), Kentucky case–control study (1,038 cases and 1,134 controls), Newfoundland case–control study (548 cases and 538 controls), American Cancer Society CPS II nested case–control study (ACS/CPSII, 539 cases and 469 controls) and the Melbourne nested case–control study (195 cases and 477 controls). All subjects were self-reported whites. The majority of the studies were genotyped using the Affymetrix Axiom CORECT Set containing ~1.3 million SNPs and indels on two physical genotyping chips (Supplementary Table 3). Genotype data were screened based on filters such as call rates, concordance rates, sample relatedness and ethnic outliers. IMPUTE2 (ref. 16) was used to impute missing genotypes based on the cosmopolitan panel of reference haplotypes from Phase I of the 1000 Genomes Project. Imputed genotypes were screened based on stringent imputation quality and accuracy filters (info⩾0.7, certainty⩾0.9, concordance⩾0.9 between directly measured and imputed genotypes after masking input genotypes for genotyped markers only). Associations between genetic variants and CRC risk were tested using a log-additive genetic model within each study, allowing for study-specific adjustment for age, sex, study centre, genotyping batch and 2–4 PCs. More details of each participating study can be found in Supplementary Methods.

GECCO study for replication

The GECCO GWAS consortium has been described before17,18,19. The consortium consisted of European-descent participants within the French Association Study Evaluating RISK for sporadic CRC (ASTERISK, 948 cases and 947 controls); CR2&3 (87 cases and 125 controls); Darmkrebs: Chancen der Verhütung durch Screening (DACHS set 1: 1,710 cases and 1,707 controls; DACHS set 2: 675 cases and 498 controls); Diet, Activity, and Lifestyle Study (DALS set 1: 706 cases and 710 controls; DALS set 2: 410 cases and 464 controls); Health Professionals Follow-up Study (HPFS set 1: 227 cases and 230 controls; HPFS set 2: 176 cases and 172 controls); MEC (328 cases and 346 controls); Nurses’ Health Study (NHS set 1: 394 cases and 774 controls; NHS set 2: 159 cases and 181 controls); Ontario Familial Colorectal Cancer Registry (OFCCR, 650 cases and 522 controls); Physician’s Health Study (PHS, 382 cases and 389 controls); Postmenopausal Hormone study (PMH, 280 cases and 122 controls); Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO set 1: 533 cases and 1,976 controls; PLCO set 2: 486 cases and 415 controls); VITamins And Lifestyle (VITAL, 285 cases and 288 controls); and the Women’s Health Initiative (WHI set 1: 470 cases and 1,529 controls; WHI set 2: 1,006 cases and 1,010 controls). All individual studies were genotyped on Illumina arrays on 240k–730k markers and went through rigorous QC. The genotype data were imputed to increase the density of genetic variants. The haplotypes from the 1000 Genomes Project Phase I were used as the reference panel. Logistic regression of CRC on SNP dosage effect on CRC risk was performed with adjustment for age, sex (when appropriate), centre (when appropriate), smoking status (PHS only), batch effects (ASTERISK only) and the first three PCs from EIGENSTRAT13 to account for population substructure within each individual study. Additional details on sample collection, genotyping, QC and statistical methods are provided in Supplementary Methods.

All samples were collected with informed consent and all procedures were approved by the Human Research Institutional Review Boards (IRBs) at relevant institutions. Specifically, the study protocols of the Japanese and African Americans’ GWASs were approved by the University of Hawaii Human Studies Program and University of Southern California IRB, the IRB in the National Cancer Center, Japan, the Ethics Committee of Kyushu University Faculty of Medical Sciences, the University of North Carolina IRB, Vanderbilt University IRB, the Fred Hutchinson Cancer Research Center IRB and the MD Anderson Cancer Center IRB. The GECCO portion of this work was approved by the Fred Hutchinson Cancer Research Center IRB. The University of Southern California Health Sciences IRB approved all elements of the CORECT study.

Meta-analysis

A fixed-effect model with inverse variance weighting implemented in METAL20 was used to combine the results from the Japanese and the African American studies and for further combining with replication studies. Heterogeneity measure I2 was calculated and Cochran’s Q statistic to test for heterogeneity was calculated21. For the 12 top hits in the VTI1A region at 10q25 (see text), OFCCR in GECCO was excluded because these SNPs did not pass the quality filters in this substudy (Table 1, Supplementary Table 4 and Supplementary Fig. 4). In Supplementary Fig. 5, SNPs that passed the filters in OFCCR were included whenever applicable.