UK Biobank

UK Biobank is a prospective cohort study of over 500,000 individuals from across the United Kingdom62. Participants, aged between 37 and 73, were invited to one of 22 centers across the UK between 2006 and 2010. Blood, urine and saliva samples were collected, physical measurements were taken, and each individual answered an extensive questionnaire focused on questions of health and lifestyle. All participants gave written informed consent and the study was approved by the North West Multicentre Research Ethics Committee. UKB has Human Tissue Authority research tissue bank approval, meaning separate ethical approvals are not required to use the existing data. UKB provided genotyping information for 488,377 individuals. Data access to UKB was granted under application #21988. Phenotypes and genotypes were downloaded directly from UKB.

Genotyping and imputations

UKB participants were genotyped on two slightly different arrays and quality control was performed by UKB63. 49,950 samples were genotyped as part of the UK BiLEVE study using a newly designed array, with 438,427 remaining samples genotyped on an updated version (UK Biobank Axiom array), both manufactured by Affymetrix (96% of SNPs overlap between the arrays). Samples were processed and genotyped in batches approx. 5000 samples each. In brief, SNPs or samples with high missingness, multi-allelic SNPs and SNPs with batchwise departures from Hardyâ€“Weinberg equilibrium were removed from the data set. After quality control, genotypes were available for 488k subjects at 805k sites. UKB provided 40 principal components (PCs) of genetic relatedness (UKB field id 22009) and a binary assessment of whether subjects were genetically confirmed European Ancestry (UKB field id 22006), based on principal components analysis of their genetic data.

We have computed Pearson correlations between self reported ethnicity (UKB field id 21000), coded as binary variable, and the 40 principal components in UKB data set of 488,363 participants with genetics principal components analysis data available. The estimates could be found in Supplementary Data 17.

Imputed data were prepared by UKB. In summary, autosomal phasing was carried out using a version of SHAPEIT364 modified to allow for very large sample sizes. Imputation was carried out using IMPUTE265 using the merged UK10K and 1000 Genomes Phase 3 reference panels to yield higher imputation accuracy of haplotypes. The imputations resulted in 92,693,895 SNPs, short indels and large structural variants, imputed in 488,377 individuals63.

Discovery and replication samples

For the discovery and replication we used only the data from PCA cohort (QC passed, Data-Field 22020, N = 407,208). This cohort also represents the largest possible unrelated individuals subset63 with all relatives of third degree or closer removed. For the discovery set we selected 300,447 genetically confirmed white (GCW) British individuals according to the genetic principal components provided by the UK Biobank who were not included in UK BiLEVE study (UKB Resource 531). For replication, we used a combination of the UK Biobank participants not included in the discovery set that comprised rest of European ancestry individuals (self-reported white, data-field 21000, n = 81,099), individuals of African ancestry (self-reported Africans, n = 3073), individuals of South Asian ancestry (Indian, Pakistani, and Bangladeshi; n = 6921), Chinese individuals (n = 1422) and Caribbean individuals (n = 3799). Remaining self-declared ethnicities that were mixed, or were ambiguous (Other ethnic group, Prefer not to answer, Not available) were not analyzed. To reduce the risk of bias due to population stratification, all groups were analyzed separately followed by a meta-analysis. Total resulting sample size for replication was 96,313 individuals. Additionally, we checked that there is no individuals with kinship coefficient r > 0.01 between discovery and replication cohorts, using relationship data provided by UKB (UKB data category 100315). For more details see Supplementary Data 18.

The replication threshold was set as p < 0.05/12 = 0.004. For each SNP, statistical power (or probability) of replication was estimated using the fact that under alternative hypothesis (H 1 :β ≠ 0) the test statistics T2 from replication sample is expected to follow the \(\chi _{{\mathrm{df}} = {\mathrm{1,NCP}}}^2\) distribution, where NCP is the expected non-centrality parameter computed as \((T_{{\mathrm{disc}}}^2 - 1) \times N_{{\mathrm{rep}}}/N_{{\mathrm{disc}}}\), where \(T_{{\mathrm{disc}}}^2 = (\beta _{{{\mathrm{disc}}}}/se_{{\mathrm{disc}}})^2/\lambda _{{\mathrm{LDSC}}}\) is test statistic for particular SNP in discovery cohort, corrected for LD score regression interecept λ LDSC , N rep is the sample size of the replication cohort and N disc is the sample size of the discovery cohort. The the power of replication is equal to the probability that such distributed statistics would exceed the threshold value k = 8.2 that corresponds to right-hand integral of \(\chi _1^2\) equal to 0.004.

Incidence of diseases calculation from UKB data

We used in-patient hospital admissions data (UKB data category 2000) and self-reported diagnoses obtained via verbal interview (UKB data category 100074) to extract information in relation to the disease history, the nature of and the age at the available diagnosis. For each of the condition, we follow the instructions similar to the ones given by the UK Biobank outcome adjudication group for algorithmic-defined stroke and MI (UKB data category 42). For each selected condition, except for cancer and death we compile a list of hospital data codes (ICD-10, Supplementary Data 19) and self-reported data codes (UKB data coding 6) that defines these conditions in our study. We used National cancer registries linkage to UKB (UKB data category 100092) in addition to hospital data for cancer and National death registries linkage to UKB (UKB data category 100093) to define death event. First, for each condition we set the age of first occurrence of any of corresponding hospital data codes as age this condition was manifested. Next, if there was missing hospital data (for hospital data it is impossible to distinguish between missing data and absence of any disease) we added self-reported data if there was any. Therefore we obtained age each condition was occurred. The minimal age from this data set for every individual from UKB was taken as age the healthspan terminates. When calculating disease incidence rates, each participant was counted despite the existence of any other disease earlier in life, therefore some participant may have different event times for different conditions. By definition, the incidence rate of a disease is the limit m(t) = Δt−1N d (t, Δt)/N h (t) when Δt is sufficiently small. Here t is the age, N h (t) is the number of people healthy at the age t and N d (t, Δt) is the number of people diagnosed between the ages t and t + Δt (both N h and N d are presumed to be large). This definition does not rely on any specific underlying model. In practice, datasets are of limited size and the interval Δt cannot be made arbitrarily small, and therefore precautions should be taken to avoid possible artifacts in the calculation. To compute the incidence rate at a given age t, one shall consider a set of participants Υ(t, Δt) defined as those who are healthy at the age t and whose health status is available in the whole age range [t, t + Δt): \(\Upsilon (t,{\mathrm{\Delta }}t) = \{ u|((\delta ^u = 0) \vee (\delta ^u = 1 \wedge t \le t_{\mathrm{d}}^u)) \wedge (t + {\mathrm{\Delta }}t \ < \ t_2^u)\}\). Here u is the participant’s id, δu = 1 if the participant was diagnosed and δu = 0 otherwise, \(t_{\mathrm{d}}^u\) is the age when diagnosed, and \(t_2^u\) is the maximal age at which the information about the diagnosis (if any) would still be recorded. From this N h (t) = |Υ(t, Δt)| and \(N_{\mathrm{d}}(t,{\mathrm{\Delta }}t) = |\{ u \in \Upsilon (t,{\mathrm{\Delta }}t)|\delta ^u = 1 \wedge t \le t_{\mathrm{d}}^u \ < \ t + {\mathrm{\Delta }}t\} |\), where |..| is the size of the set.

The maximum follow-up age \(t_2^u\) does not coincide with the age at the diagnosis \(t_{\mathrm{d}}^u\) and shall be inferred from the study setup. Assuming \(t_2^u = t_{\mathrm{d}}^u\) for diagnosed participants would overestimate the risks. Also, the age is often rounded and hence Δt may be not large enough to treat the rounding errors as negligible. We addressed the issue by consistently using half-open intervals [..) definitions. Finally, our prescription relies on the implicit assumption, that the diagnosis does not influence the enrollment. This is not always true. If someone is dead, this would, naturally, prevent that person from being enrolled at a greater age. This can be addressed by the following modification: \(\Upsilon \prime (t,{\mathrm{\Delta }}t) = \{ u\, \in \,\Upsilon (t,{\mathrm{\Delta }}t)|t_1^u \ < \ t\}\), where \(t_1^u\) is the age at enrollment. In this study, we assumed that the enrollment in UKB was not biased by diagnoses and thus we used the Υ for all diseases and conditions, Υ' participants set was only employed for the mortality rate calculation.

Cox-Gompertz proportional hazards model and healthspan

By design of the UKB study, every participant is admitted into the cohort at the age \(t_1^n\). According to the medical history information, the participant may be diagnosed with any of the diseases relevant to determination of lifespan at the age of the first \(t_{\mathrm{d}}^n\) (if applicable). By the end of the followup age, \(t_2^n\), we labeled every study participant as frail, δn = 1, if the participant is already diagnosed with any of the diseases, \(t_{\mathrm{d}}^n \le t_2^n\), or δn = 0, otherwise.

Under then Cox-Gompertz proportional hazards model the risks of frailty acquisition or healthspan end at the age t is \(h(t,x^n) = h_0e^{{\mathrm{\Gamma }}t + \beta x^n}\), where xn is a vector of age-independent parameters, characterizing the participant. Here h 0 , Γ, and β are the baseline morbidity incidence, the Gompertz exponent and the log-odds-ratio regression coefficients vector, the model parameters. The (negative log of) likelihood of the data can be presented in the following form:

$$\begin{array}{*{20}{l}} L \hfill & = \hfill & {\mathop {\sum}\limits_n \frac{{h_0e^{\beta x^n}}}{\Gamma }\left( {e^{{\mathrm{\Gamma }}\,{\mathrm{min}}\left( {t_{\mathrm{d}}^n,t_2^n} \right)} - 1} \right)} \hfill \\ {} \hfill & {} \hfill & { - \delta ^n({\mathrm{log}}\,h_0 + \beta x^n + {\mathrm{\Gamma }}\,{\mathrm{min}}(t_{\mathrm{d}}^n,t_2^n)).} \hfill \end{array}$$ (1)

Given a necessary amount of data the model parameters could be obtained by the likelihood maximization or, equivalently, minimization of the cost function L.

We built the first version of the Cox-Gompertz healthspan model by including GCW-British UKB participants information, including gender and the first genetic principal components variables, assessment center codes and genotyping batch labels (see Supplementary Data 3 for the summary of the model parameters). The morbidity incidence growth rate is 0.098 per year, which corresponds to a doubling time of seven years, compatible with the mortality rate doubling time of approximately 8 from the Gompertz mortality law. As expected, being male is a risk factor (log-hazard ratio, log(HR) = 0.26 at the significance level of p = 5 × 10−301) corresponding to an average healthspan difference of about five years. The genetic principal component PC4 was highly significant log(HR) = 3.4 × 10−2, p = 9.2 × 10−23. PC5 was also highly significant log(HR) = 4.6 × 10−2, p = 1.7 × 10−40. The average healthspan or lifespan can be estimated from Cox-Gompertz model parameters as \(\bar t \approx ({\mathrm{ln}}({\mathrm{\Gamma }}/h_0) - \gamma )/{\mathrm{\Gamma }}\), where γ = 0.577 is the Euler-Mascheroni constant, see, e.g.,66.

Gene variant-healthspan association testing

If the participants state vector \(x_i^n\) is extended by the genetic variants variables sn, in principle, the model has to be re-evaluated, to obtain a new versions of all model parameters. We do not expect, however, large effects of any of the gene variants on lifespan. Therefore the model parameters should not change much as well and the variation of the Cox-Gompertz model with respect to the genetic variables can be accurately obtained by iterations, using the model from 4.5 as the zeroth order approximation (see a related example of a perturbation theory application in a proportional hazards model involving prediction of all-cause mortality in ref. 46).

We note, however, that the simultaneous determination of the weak effects of a gene on the baseline hazard h 0 and the rate of aging Γ is an ill-defined mathematical problem66. Only the combination of the two parameters, the change in the life expectancy can be determined with accuracy. We therefore fix the Gompertz exponent Γ to its most probable value in the zeroth order model and allow for all other model parameters adjustment. The perturbation theory expansion for the small effect β s associated with the gene variants yields (the derivation is not shown):

$$\beta _s = \frac{{\mathop {\sum}

olimits_n {s^n} \left( {\delta ^n - N_{\mathrm{d}}\rho ^n} \right)}}{{N_{\mathrm{d}}\left\langle {\delta s^2} \right\rangle _\rho }},$$ (2)

where, for convenience, we introduced the weights

$$\rho ^n = \frac{{e^{\beta x^n}\left( {e^{{\mathrm{\Gamma }}\,{\mathrm{min}}\left( {t_{\mathrm{d}}^n,t_2^n} \right)} - 1} \right)}}{{\mathop {\sum}\limits_n e^{\beta x^n}\left( {e^{{\mathrm{\Gamma }}\,{\mathrm{min}}\left( {t_{\mathrm{d}}^n,t_2^n} \right)} - 1} \right)}}$$

normalized in such a way that \(\mathop {\sum}

olimits_n \rho _n = 1\). We used the notation 〈δs2〉 ρ for the corresponding weighted average. The effect determination error

$$\sigma _s^2 = \frac{1}{{N_{\mathrm{d}}\left\langle {\delta s^2} \right\rangle _\rho }},$$ (3)

and hence the statistical power of the gene variant association with the healthspan is explicitly dependent on the number of people with diagnoses, \(N_{\mathrm{d}} = \mathop {\sum}

olimits_n \delta ^n\).

In our analyses, we used imputed variants with the expected effective minor allele count (defined as twice the minor allele frequency multiplied by sample size and by the imputation quality) more than 200 for discovery cohort genotypes and imputation info score (as IMPUTE info, calculated by RegScan67 for discovery cohort with–info2 option) more than 0.7.

Conditional and joint multi-SNP analysis

Conditional and joint analysis (COJO) as implemented in the program GCTA21 was used to find SNPs independently associated with the phenotypes of interest. As input, this method uses (meta-analysis) summary statistics and a reference sample that is utilized for the LD estimation. The method starts with the “top SNP” (the one with smallest p-value, conditional that p < p 0 , where p 0 is specific threshold defined by user) as provided by the summary-level data and then the p-values for all the remaining SNPs are calculated conditional on the selected SNP. The algorithm then selects the next top SNP in the conditional analysis (provided p < p 0 ) and proceeds to fit all the selected SNPs in the model dropping all those SNPs with p-values > p 0 . The iteration continues until no SNP is added or dropped from the model thus finding a subset of associated SNPs with a threshold for LD (r2 < 0.9) among SNPs. Finally, a joint analysis of the subset of associated SNPs is performed. We had performed analyses with p 0 = 5 × 10−8 and p 0 = 1 × 10−5.

As the LD reference, we used a sub-sample of 10,000 people, randomly chosen from the total set of 120,286 people used for GWAS discovery phase. Additional to our previous SNP filters described in the Association testing section, in selecting LD reference data, we further filtered out the SNPs with imputation info scores less than 0.7 and minor allele frequencies (MAF) less than 0.002.

Sex-specific analysis

We performed sex-specific genetic association analysis (males: n = 137,469, females: n = 162,978) for 12 genome-wide significantly associated SNPs from Table 2. We estimated the difference of SNP effects between males and females using approach from ref. 68 (see “SNP selection strategy” subsection in Methods, Eq. (1)) that allows testing difference between effect sizes accounting for their possibly correlated joint distribution. The results are reported in Supplementary Data 16. For this method Spearman correlation for effect sizes between males and females was estimated using only called SNPs with MAF > 0.05 (377,781 SNPs in total). The significance threshold was set as p < 0.05/12 = 0.042.

Heritability and genetic correlation analyses

We used LD hub and ldsc58 tools for estimation of captured heritability and genetic correlations between HS and different traits and common diseases58. A total of 231 traits were analyzed after removing duplicates via using only the most recent study for each trait as indicated by the largest PMID number. Genetic correlations between HS and the traits with p < 4.3 × 10−5 (Bonferroni corrected, 0.01/231) were considered statistically significant. Pair-wise genetic correlations between all the traits selected as described above were obtained from the LD-hub. To focus on the largest magnitude genetic correlations, we selected only the traits with absolute values of genetic correlations with HS more than 0.3. This filtering led to the total of 36 traits (including HS). Clustering and visualization was carried out using corrplot package for R and basic hclust function. For clustering, we estimated squared Euclidean distances by subtracting absolute values of genetic correlation from 1 and used Ward’s clustering method.

For genetic correlation analysis between each disease comprising healthspan phenotype and healthspan itself we used LDSC (LD Score) v1.0.0 software. Genotype calls were filtered by MAF > 0.01 using LDSC ‘munge-sumstats’ script to produce total 659,079 variants valid for downstream analysis. Genomic reference was constructing by randomly sampling 10,000 individuals from the UKB population. Then, we ran LDSC genetics correlation analysis with default parameters and input data as described above. Cross-correlations can be seen at Fig. 3 and Supplementary Data 16.

For analysis of heritability, genomic control inflation factor λ19 and genetics correlations we have used SNPs defined by overlap between our set of SNPs and ‘high quality SNPs’ as suggested by the authors of the LD hub (these represent common HapMap3 SNPs that usually have high imputation quality; also, this set excludes HLA region)20, 1,162,742 SNPs in total).

Variant effect prediction (VEP)

We used PAINTOR software69 to prepare the set of SNPs for VEP annotation. For this analysis, we provided PAINTOR with clumping results, LD matrices and annotation files calculated using the same 10,000 UKB individuals reference set that we used for COJO analysis. With PLINK70 and we performed clumping analysis with ‘p1’ and ‘p2’ p-value threshold parameters set to 5 × 10−8, ‘r2’ set to 0.1 and MAF > 0.002. Then, we generated pair-wise correlation matrix for all SNPs in each region in clumping analysis results using plink–r option. When running PAINTOR, we did not use annotations; we changed options controlling input and output files format only, and otherwise we have used default parameters. We choose 159 SNPs marked by PAINTOR as 99% credible set for further analysis. In the next step, each SNP was extended with a list of proxy SNPs with R2 > 0.8 calculated using EUR cohort from 1000 Genomes Project Phase 371 (N = 503) with 84.4 million variants as reference set. Total 924 SNPs was chosen for functional annotation by VEP with GRCH37 genomic reference.

Gene-set and tissue/cell enrichment analysis

For prioritizing genes in associated regions, gene set enrichment and tissue/cell type enrichment analyses, we have used the DEPICT software v. 1 rel. 19432 with following parameters: flag_loci = 1; flag_genes = 1; flag_genesets = 1; flag_tissues = 1; param_ncores = 10. Independent (as selected by COJO procedure) variants with p < 5 × 10−8 (14 SNPs) and p < 10−5 (135 SNPs) has resulted from these analyses. We have used UKB subset of 10,000 individuals for computations of LD (the same subset as used for COJO analysis).

Pleiotropy with complex traits

We investigated the overlap between associations obtained here and elsewhere, using PhenoScaner v1.1 database34. For five replicated SNPs (Table 1) we looked up traits that have demonstrated genome-wide significant (p < 5 × 10−8) association at the same or at strongly (r2 < 0.8) linked SNPs.

Code availability

All computer code used in this research is available at https://github.com/azenin/healthspanpaper.

Reporting Summary

Further information on experimental design is available in the Nature Research Reporting Summary linked to this Article.