Discovery cohort

The UK Biobank15 (UKBB) is a prospective cohort study of over 500,000 individuals from across the United Kingdom, aged between 40 and 70. Blood, urine and saliva samples were collected and each participant answered an extensive questionnaire on health and lifestyle. Written informed consent was obtained from each participant, in accordance with the Declaration of Helsinki.

Genotyping and imputation

A total of 152,736 samples belonging to the interim data release of UK Biobank genetic data (May 2015) were genotyped using a combination of two arrays: the UK Biobank Axiom array from Affymetrix (N = 102,754), that was specifically designed for the purpose of genotyping the UK Biobank participants, and the UK BiLEVE array (N = 49,982), that was designed to study the genetics of lung health and disease. The UK10K haplotype reference and the 1000 Genomes Phase 3 reference panels were merged and used as reference panel in the IMPUTE2 software34. Kinship coefficients for all pairs were calculated using KING’s robust estimator35 and used to identify and remove related individuals. Imputation, relatedness assessment and quality control were performed by the analysis group at the Wellcome Trust Centre for Human Genetics, University of Oxford. Details are provided at the UK Biobank website (httpλ://biobank.ctsu.ox.ac.uk). A total of 8,351,141 SNPs meeting the following conditions were included in our genome-wide association study: call rate ≥95%, minor allele frequency (MAF) ≥1% and Hardy–Weinberg equilibrium test with P ≥ 1 × 10−9.

Phenotyping

Ease of skin tanning was collected for 460,922 individuals with self-reported European ethnicity, 140,749 of whom had genotype data. Individuals that reported themselves to be of European ancestry but described their skin colour as 'black' were removed from the data set. As carried out in previous studies of similar phenotypes3,4 and in order to reduce misclassifications in the self-reported data, individuals reporting that they never tanned only burn, or get mildly or occasionally tanned were included in the group of individuals showing a low tan response, while people who reported getting moderately or very tanned were included in the group of individuals showing a high tan response. Reliability of the reported tanning ability was cross-verified using self-reported information on hair colour, and 2947 individuals reporting themselves to have red hair yet declaring that they get very or moderately tanned were removed from the data set. We further removed 16,067 individuals who were estimated to be genetically related, and 439 individuals showing signs of insufficient data quality, resulting in 121,296 individuals (Supplementary Table 1).

Association study

Logistic regression was performed using PLINK36 (version 1.90 b3.38) assuming an additive genetic model and including sex as covariate, as well as the first five principal components assessed on the genomic data to control for potential population stratification. Associations were considered significant and taken forward for replication if the UKBB discovery P value was lower than 5.0 × 10−8. Genome-wide Manhattan and Q–Q plots were generated using the qqman R package37 (version 0.1.2). Regional Manhattan plots for the associated loci were generated using LocusZoom Standalone38. The genotype data from the individuals included in the UKBB data set was used to estimate a more accurate LD structure.

LD score regression analysis

The genomic inflation factor (λ GC ) was calculated as the ratio between the observed and expected median χ2 statistics. We used the LD score regression (LDSC) software16 (version 1.0.0) to quantify the proportion of such inflation that was due to the presence of polygenic inheritance and to other confounding biases, such as population stratification. Briefly, the LDSC approach evaluates an LD score based on an unbiased estimator of the squared Pearson’s correlation, and then regresses the χ2 statistics against it. The mean contribution of confounding biases in the test statistics is evaluated as the intercept of the regression model minus one. LD scores were evaluated using the 1000 Genomes Project European data16.

Identification of independent signals within loci

We used a stepwise procedure to identify independent signals within the loci identified in the UKBB sample. Specifically, we extended each locus to include a 1 Mb flanking region either side and fitted a new regression model, where the top-associated genome-wide significant SNP was included as a covariate (conditional model). We considered the genome-wide significant (P < 5 × 10−8) top-associated SNP resulting from the conditional model as an independent signal, and included it in the covariate set of a new conditional model. We stopped the stepwise procedure when we could not identify any additional genome-wide significant SNPs. Conditional models were build using PLINK36 (version 1.90 b3.38).

Replication cohorts

The TwinsUK cohort includes more than 13,000 monozygotic and dizygotic twin volunteers from all regions across the United Kingdom39. The phenotype was collected via nurse-administered questionnaires using the Fitzpatrick classification40 and dichotomised into two categories: low (sun-reactive skin type I and II) and high (sun-reactive skin type III and IV) tan response. Phenotypic and genotypic data were available for 3937 female individuals with European ancestry (Supplementary Table 3). Microarray genotyping was conducted using a combination of Illumina arrays (HumanHap300, HumanHap610, 1M-Duo and 1.2M-Duo 1 M) and imputation was performed using the IMPUTE2 software using haplotype information from the 1000 Genomes Project (Phase 1, integrated variant set across 1092 individuals, v2, March 2012), as previously described41. Guy’s and St Thomas’ Hospital NHS Trust Research Ethics Committee approved the study, and all twins provided informed written consent.

The Rotterdam Study (RS) is a prospective study of men and women from the Ommoord municipality of Rotterdam, the Netherlands42. Subjects were genotyped using the Infinium II HumanHap550 K and Human 610 Quad Arrays of Illumina. Imputation was performed using the MaCH and minimac software packages and the 1000 Genomes Project (Phase 1, integrated variant set across 1092 individuals, v2, March 2012) as the reference panel, as described elsewhere43. Ease of skin tanning was collected on 10,451 individuals via questionnaires, and individuals who reported getting easily burned while in the sun were considered as having a low tan response (Supplementary Table 4). All Rotterdam Study participants have provided written informed consent. The study has been approved by the medical ethics committee according to the Wet Bevolkingsonderzoek ERGO (Population Study Act Rotterdam Study), executed by the Ministry of Health, Welfare and Sports of the Netherlands.

The Australian Study from the Queensland Institute of Medical Research (QIMR) comprises twins and their family members taking part in a long-running study of melanoma risk factors44. Two independent samples were used for this analysis. The first included adolescent twins, their siblings and parents, and the second a collection of adult twins. These were genotyped in two phases. Phase 1 samples were genotyped using Illumina Human610-Quadv1_B arrays at deCODE Genetics, Iceland. For imputation, they were merged with a larger set of individuals genotyped on Human610-Quadv1_B, and Human660W-Quad_v1_C arrays. Phase 2 samples were genotyped on HumanOmniExpress-12v1-1_A, HumanOmni25M-8v1-1_B and HumanCoreExome 12v1-0_C arrays from Illumina at the Diamantina Institute, University of Queensland. Self-reported ease of skin tanning was collected on 5509 individuals from the adolescents plus parents, and on 2248 individuals from the adult twins, but only a total of 5149 individuals had genotype data and were included in this analysis (Supplementary Tables 5 and 6). Individuals reporting that they never tanned only burn, or usually burn and sometimes tanned were included in the group of individuals showing a low tan response, while people who reported of usually or always getting tanned were included in the group of individuals showing a high tan response. The study protocol was approved by appropriate institutional review boards, and all participants have provided written informed consent.

The Nurses’ Health Study (NHS), the Nurses’ Health Study 2 (NHS2) and the Health Professionals Follow-Up Study (HPFS) are three large prospective cohort studies of US men and women of European ancestry (HS). Ease of skin tanning was assessed on 35,845 individuals by asking what kind of tan was developed after repeated sun exposures (e.g., a 2-week vacation outdoors) during childhood or adolescence, and categorised into a binary variable (Supplementary Table 7). Subjects were genotyped on multiple arrays (Affymetrix, Illumina HumanHap, Illumina OmniExpress, HumanCore Exome and OncoArray; Supplementary Table 8) and imputed to approximately 47 million markers using the 1000 Genomes mixed population Project Phase 3 Integrated Release Version 5 (2010–11 data freeze, 2012-03-14 haplotypes) as reference panels. Specifically, SNP genotypes were imputed in two steps. First, genotypes on each chromosome were phased using ShapeIT (v2.r837). Then, phased data were submitted to the Michigan Imputation Server, and imputed using Minimac3. The protocol of the study was approved by the Institutional Review Board of Brigham and Women’s Hospital and the Harvard School of Public Health.

Meta-analysis

Association studies in the TwinsUK and QIMR cohorts were conducted for individual SNPs using a linear mixed model as implemented in Merlin45 and GEMMA46, respectively, in order to take into account the non-independence of twin data. Association studies in the RS and in the HS cohorts were conducted using a logistic regression model adjusting for sex, age and the first five genotype principal components. Specifically, in the HS cohort, associations in each component GWAS set (Affymetrix, Illumina HumanHap series, Illumina OmniExpress, HumanCore Exome, and OncoArray) were combined in an inverse-variance-weighted meta-analysis using the METAL software47.

Meta-analyses of the results obtained in the replication cohorts were carried out using a weighted Z-score method based on sample size, P value and direction of effect in each study as implemented in the METAL software47. We considered an association replicated when the meta-analysis P value reached a Bonferroni-adjusted significance threshold of P < 0.05/30 = 1.67 × 10−3.

SNP-by-sex interaction

We assessed SNP-by-sex interaction in the UKBB data set for all replicated SNPs using PLINK36 (version 1.90 b3.38). The first five principal components assessed on the genomic data were included as covariates to control for potential stratification issues. We considered an interaction term significant when its P value reached a Bonferroni-adjusted significance threshold of P < 0.05/20 = 2.5 × 10−3. We then attempted replication using four of our replication cohorts (the TwinsUK cohort included only female individuals). The meta-analysis in the replication cohorts was performed using a weighted Z-score method based on sample size, P value and direction of effect in each study as implemented in the METAL software. We considered the interaction replicated when the meta-analysis P value was smaller than the Bonferroni-adjusted significance threshold of P < 0.05/5 = 0.01.

Power calculations

Power was assessed with Genetic Power Calculator48 assuming a prevalence of 40% for increased tanning ability—as observed in both the UKBB (42%) and TwinsUK (38%) data sets.

Identification of known loci

We interrogated the NHGRI-EBI GWAS catalogue49 (release 26 June 2017; association P < 5 × 10−8) to identify overlap between SNPs identified in our study (or in tight linkage disequilibrium with them; r2 ≥ 0.8) and previously reported associations for ease of skin tanning, pigmentation-related phenotypes and skin cancer.

Genetic correlation between tanning ability and skin cancer

The 2017 release of the UK Biobank genetic data included a further 367,186 individuals. We removed 968 individuals flagged because of low genetic data quality, 120,502 individuals who were estimated to be genetically related and 57,157 individuals affected by any cancer, either malignant or in situ, resulting in 907 CMM (ICD-10 code: C43) and 5912 non-melanoma skin cancer (ICD-10 code: C44) cases, and 181,740 controls (Supplementary Table 12). Genotype data were processed as described in 'Genotyping and imputation', resulting in 5,734,850 SNPs meeting the following conditions: call rate ≥95%, minor allele frequency (MAF) ≥1% and Hardy–Weinberg equilibrium test with P ≥ 1 × 10−9. Due to the small number of CMM cases, we only assessed association between ease of skin tanning and non-melanoma skin cancer using a logistic regression approach, as implemented in PLINK (version 2.00), assuming an additive genetic model, and including age, sex, genotyping array and the first five principal components assessed on the genomic data as covariates.

We used the cross-trait LD score regression (LDSC) software16,50 (version 1.0.0) to estimate the genetic correlation between ease of skin tanning and cancer occurrence. We followed the protocol described in Bulik-Sullivan et al.50, removing indels, structural variants, strand ambiguous SNPs and those with MAF <1%. LD scores were evaluated using the 1000 Genomes Project European data16.

Identification of loci affecting CMM and tanning ability

We applied a Bayesian bivariate analysis as implemented in GWAS-PW27 to investigate whether loci here associated with ease of skin tanning and previously involved in melanoma risk exert a shared effect on both ease of skin tanning and CMM, using data from a large meta-analysis of CMM11 (N case = 15,990; N control = 26,409). GWAS-PW estimates the posterior probability that each locus includes a genetic variant which (i) associated only with CMM, (ii) associated only with tanning ability, (iii) associated with both traits or (iv) that the genomic block includes two genetic variants, associating independently with each of the two traits.

Association study with natural hair colour

Self-reported hair colour was assessed via questionnaire in 118,777 out of 121,296 individuals used in this study (Supplementary Tables 14 and 15). To assess association between hair colour and the loci replicated in our study of ease of skin tanning, we used (a) a linear regression model to test association with non-red-haired individuals where blonde = 1, light brown = 2, dark brown and black = 3, and (b) a logistic regression model to test association with red versus non-red hair colour. Both linear and logistic regression were performed using PLINK36 (version 1.90 b3.38) assuming an additive genetic model and including age, sex and the first five principal components assessed on the genomic data as covariates.

GTEx cis-eQTL analysis

To study whether the replicated SNPs have a regulatory effect on gene expression levels, we used expression quantitative trait loci (eQTLs) data in three skin tissues from the GTEx consortium project31 (data release v6), namely fibroblasts, sun-exposed skin (lower leg) and non-sun-exposed skin (suprapubic). cis-eQTLs were defined by the GTEx consortium as SNPs 1 MB around the transcript start site passing 5% FDR31. We extended the set of 34 independent SNPs replicated for ease of skin tanning by including any SNP in high linkage disequilibrium (r2 ≥ 0.8) with them using SNAP Proxy Search51 and data from both the HapMap 3 (release 2) and 1000 Genome (pilot 1) projects. We then evaluated an empirical enrichment P value by comparing the overlap between the set of cis-eQTLs in the GTEx project database and the original extended set of SNPs with the overlap obtained using 1000 random sets of SNPs created using a cyclic permutation procedure52.

Analysis of epigenetic marks

By following the same procedure described above for the cis-eQTL enrichment analysis, we additionally assessed enrichment and depletion of epigenetic markers by using histone marks and DNA accessibility peak data in epithelial foreskin melanocyte primary cells from the Roadmap consortium project32 (data release v9). We focused on a core set of histone modifications (namely: H3K4me1, H3K4me3, H3K27me3, H3K36me3, H3K9me3 and H3K27ac) and DNA-seq accessibility data. Since the histone modification data for the studied cell line was available from two donors, we averaged the overlaps among samples.

URLs

For UK Biobank, see http://www.ukbiobank.ac.uk/. For LD score regression, see https://github.com/bulik/ldsc. For NHGRI-EBI GWAS catalogue, see https://www.ebi.ac.uk/gwas/. For GWAS-PW, see https://github.com/joepickrell/gwas-pw. For Power calculation, see http://pngu.mgh.harvard.edu/~purcell/gpc/., For SNAP, see http://archive.broadinstitute.org/mpg/snap/. For GTEx portal, see http://www.gtexportal.org/home/. For Roadmap Epigenome Project, see http://www.roadmapepigenomics.org/.

Data availability

Supplementary data that support the findings of this study have been deposited in Zenodo with the https://doi.org/10.5281/zenodo.1194289 at the address https://doi.org/10.5281/zenodo.1194289. The association summary statistics for the 10,834 genome-wide significant SNPs are provided as Supplementary Data 1. Genomic coordinates are reported in GRCh37.p13. The association summary statistics for the SNPs associated in the UKBB non-melanoma skin cancer GWAS (P < 1 × 10−5) are provided as Supplementary Data 2. Genomic coordinates are reported in GRCh37.p13. The cis-eQTLs identified in the three studied skin tissues, and available within the GTEx project, are provided as Supplementary Data 3.