Experimental model and subjects

Participants of the MGRB were consented through the biobank programmes of the ASPREE and 45 and Up studies11,13. At the time of blood collection, each participant was aged 60 years or older.

Samples from the ASPREE study were from individuals aged 75 years or older at the time of enrolment, with no reported history of any cancer type, no clinical diagnosis of atrial fibrillation, no serious illness likely to cause death within the next 5 years (as assessed by general practitioner), no current or recurrent condition with a high risk of major bleeding, no anaemia (haemoglobin > 12 g/dL males, >11 g/dL females), no current continuous use of other antiplatelet drug or anticoagulant, no systolic blood pressure ≥180 mm Hg and/or a diastolic blood pressure ≥105 mm Hg, no history of dementia or a Modified Mini‐Mental State Examination (3MS) score ≤77, and no severe difficulty or an inability to perform any one of the 6 Katz basic activities of daily living.

Samples from the 45 and Up Study were from individuals with no self-reported history of cancer, heart disease, or stroke. Neurological disease was not explicitly excluded, but participants were required to correctly self-complete a health survey at enrolment. We confirmed no record of cancer diagnosis in the NSW Central Cancer Registry, and no record of cancer diagnosis in the NSW Admitted Patient Data Collection, for all 45 and Up Study individuals in the MGRB.

Participants in the Australian Schizophrenia Research Bank (ASRB) were recruited through a national media campaign and consented to data and sample collection and genomic analyses following discussion with a clinical assessment officer42. UK Biobank samples were sourced from the UK Biobank Resource under Application Number 17984.

Ethics

The ASPREE Biobank study was approved by the Monash University Human Research Ethics Committee, and subsequent WGS of Australian ASPREE participants was approved by the Alfred Hospital Ethics Committee. The use of 45 and Up Study samples in the MGRB is covered by ethics approvals from the University of New South Wales Human Research Ethics Committee and the NSW Population & Health Services Research Ethics Committee. The use of the ASRB data was approved by the University of Newcastle Human Ethics Research Committee.

Sample collection and processing

For ASPREE participants of the MGRB, peripheral blood samples were processed to buffy coat within 4 h of collection, then stored at −80 °C. DNA was later purified from buffy coat via magnetic bead extraction (Qiagen).

For 45 and Up Study participants of the MGRB, peripheral blood samples were refrigerated at 4 °C and processed to buffy coat within 48 h of collection. Buffy coat was stored at −80 °C, and DNA purified via column extraction (Qiagen).

ASRB participant PBMCs were extracted from whole blood by centrifugation in Lymphoprep (Vital Diagnostics). Genomic DNA (gDNA) was extracted from PBMCs using salt extraction and quantified by PicoGreen assay (Life Technologies). The integrity of gDNA was determined by agarose gel electrophoresis prior to sequencing.

Sequencing

WGS of the MGRB, 45 and Up cancer, and ASRB cohorts was performed on Illumina HiSeq X sequencers at the Kinghorn Centre for Clinical Genomics (KCCG), Sydney, using paired-end Illumina TruSeq Nano DNA HT libraries and v2.5 clustering and sequencing reagents. Each sample was sequenced on one HiSeq X lane.

Sequence alignment and processing

All sequence data generated at the KCCG were processed following the Genome Analysis Toolkit (GATK) best practices43. We first defined a custom reference genome tailored to Illumina HiSeq X sequencers, being the 1000 Genomes Phase 3 decoyed version of build 37 of the human genome, with an added contig of NC_001422.1 to act as a decoy for the HiSeq-specific ΦX174 sequence spike-in. Reads were aligned to this reference using bwa 0.7.15 mem in paired mode, and duplicates marked with biobambam2 2.0.65 bamsormadup, with a minimum optical pixel distance of 2500. All other parameters for both bwa and bamsormadup were left at defaults. For high-depth samples run on multiple sequencing lanes, data merging was performed at this point using samtools 1.5. Indel realignment and base quality score recalibration of mapped reads were performed using GATK 3.7-0 and best practices parameters; unmapped reads were left unmodified. GATK HaplotypeCaller was used to generate g.vcfs from all single-lane realigned and recalibrated BAMs using recommended parameters. Pipeline steps were accelerated using GNU parallel 20170722 (ref. 44).

Locus confidence tiers

We defined locus confidence tiers for WGS genotyping on the basis of prior annotations, sequence complexity, and empirical metrics on our data. Locus tiers ranged from 1 to 3, with 1 indicating the highest confidence in WGS variant detection performance and 3 the lowest.

To specify the locus confidence tiers, we first identified regions of the genome which empirically had unusual coverage in the MGRB and 45 and Up cancer sequencing data. For each sample we defined bounds on the expected sequence coverage as the 0.001 and 0.999 quantiles of a Poisson distribution, with rate equal to the modal nonzero coverage observed across all autosomal loci within that sample. As typically 15 reads are required for high genotyping performance45, the lower bound was thresholded to always be at least 15. Within each sample, we defined each autosomal locus as being either in-bound (depth within the sample-specific bounds) or out-of-bound. We then calculated across all samples the rate at which each locus was out-of-bounds, considering the entire MGRB cohort. Regions for which this rate exceeded 5% (in other words, loci which had unusual coverage in at least 5% of MGRB + 45 and Up cancer samples) were marked as problematic. These problematic regions were smoothed by a morphological closing operation followed by an open operation, with structuring elements being centred intervals on the genome of size 131 and 11 bp, respectively, to yield a final definition of regions of unusual depth in the MGRB cohort. These regions totalled 409 Mb, 13.0% of the reference genome, 13.2% of the canonical chromosomes (1–22, X, Y), and 14.9% of the CCDS coding sequence (accessed 21 Nov 2017).

We then defined a poor-quality subset of the genome as all loci within 5 bp of the union of: the unusual depth regions, repeat regions identified by RepeatMasker, low complexity regions of the reference genome detected by mdust with default parameters, excludable regions from the ENCODE project, and poorly aligned or non-unique regions from the ENCODE project (Supplementary Data 4). This poor-quality subset totalled 1832 Mb in size, 58.4% of the reference genome, 59.0% of the canonical chromosomes, and 18.1% of the CCDS coding sequence.

Variants in non-canonical chromosomes, the pseudoautosomal regions (X: 60001–2699520, 154931044–155260560; Y: 10001–2649520, 59034050–59363566), or within the poor-quality subset of the genome defined above, were assigned to the lowest confidence tier 3. For the remaining variants in canonical chromosomes, if the variant overlapped a high-confidence HG001 region identified by the GiaB consortium v3.3.2 (ref. 14) it was assigned the highest confidence tier 1, else it was assigned an intermediate confidence tier of 2. In total, 81.9% of the CCDS coding genome was in confidence tier 1 or 2 (Table 4).

Table 4 Quantity in megabases of genomic regions in each locus confidence tier. Full size table

Initial sample quality control

Poor-quality MGRB and 45 and Up cancer samples were identified on the basis of genotype metrics at a small diagnostic set of loci. All 3033 single-lane samples were genotyped at SNP loci on the Illumina Infinium QC Array 24 v1.0, using GATK GenotypeGVCFs, and quality metrics calculated within Hail v0.1 (ref. 46). A total of 2904/3033 (95.7%) samples passed initial quality thresholds (Table 5). Of these, 14 (0.5%) had a reported sex that did not match their genetic sex, as determined from the X chromosome inbreeding coefficient; these sex-discordant samples were not considered further. In total, 2890/3033 (95.3%) MGRB and 45 and Up cancer samples passed initial quality control (QC).

Table 5 Quality metric conditions for samples to pass quality control. Full size table

Small variant genotyping and final QC

The 2890 MGRB and 45 and Up cancer samples passing initial QC were joint called in a single batch using GATK GenotypeGVCFs, and imported to Hail v0.1 for processing. A second round of QC (Table 5) identified an additional 31 samples with poor-quality metrics not revealed by the initial QC round; these were dropped. The PCRELATE component of the GENESIS 2.8.0 package47 was used to determine structure-corrected relatedness between the 2859 samples remaining, using autosomal SNPs LD-pruned with an r2 threshold of 0.1, KING robust relatedness estimates from SNPrelate 1.12.1 (ref. 48), and without a population reference cohort. Fourteen pairs of individuals related to second degree or closer were identified and excluded from the cohort. MGRB (cancer-free) and 45 and Up cancer samples were split into separate cohorts at this point, and four 45 and Up cancer samples excluded on the basis of incomplete or inconsistent clinical data. In summary 2841 unrelated samples passed all data quality requirements, comprising 2570 cancer-free MGRB individuals, 269 45 and Up cancer samples, and the reference materials RM 8391 and RM 8398.

ASRB processing and quality control

Sequence processing and quality control for the ASRB cohort proceeded as described for the MGRB. 344/476 ASRB samples passed all QC thresholds and were used for subsequent analysis.

Cohort population structure

The MGRB cohort population structure was determined using principal components analysis (PCA), with reference to the 1000 genomes (1000 G) populations. A merged dataset of all MGRB and 45 and Up cancer genotypes and the 1000G Phase 3 genotypes was generated in Hail. To ensure high genotype concordance between platforms, merged variants were restricted to autosomal strand-specific SNPs in Tier 1 regions of the genome (see Locus confidence tiers), with a 1000G allele frequency in the range of 5–95%, and no evidence of deviation from Hardy–Weinberg equilibrium within any of 17 homogeneous 1000G populations (P HWE > 0.01/17 for each of population codes BEB, CDX, CEU, CHB, CHS, FIN, GBR, GWD, IBS, ITU, JPT, KHV, LWK, MSL, STU, TSI, and YRI). Merged variants were LD-pruned in Hail with an r2 threshold of 0.1, and PCA performed in Hail on biallelic variants with a combined MGRB and 1000G allele frequency in the range of 5–95%.

A hierarchical eigenvalue decomposition discriminant analysis classifier was constructed to assign MGRB samples to 1000G populations on the basis of PCA scores. The first classifier layer predicted a sample’s 1000G superpopulation (AFR, AMR, EAS, EUR, or SAS), and the second a sample’s European population (CEU, FIN, GBR, IBS, or TSI), conditional on EUR being the predicted superpopulation by the first layer. Models were trained on 1000G sample scores only using PC1–4 as predictors, and then were applied to predict population source for the MGRB samples. All models were implemented using mclust v5.3 (ref. 49).

Small variant processing and annotation

Small variant processing and annotation was performed within Hail v0.1. Variant consequences were determined using Ensembl VEP 90 with default Ensembl release 90 databases. Variants were further annotated with 1000 genomes Phase 3, Haplotype reference consortium, GnomAD, and dbSNP allele frequencies, as well as ClinVar, CATO, and Eigen annotations (see Supplementary Data 4 for resource versions).

Germline SV detection

Germline SVs in the MGRB and 45 and Up cancer cohorts were detected using GRIDSS v1.4.1 (ref. 50), excluding regions in the Encode DAC Mappability Consensus Excludable list (Supplementary Data 4). Where possible, linked sets of breakend calls resulting from a single rearrangement were merged into higher-level structural events. To eliminate overlap with GATK indel calls and enable assessment of cohort frequencies, SV events were filtered to be of length at least 50 bp, and those of the same type within a window of 100 bp were merged to the one call.

Germline MEIs were identified using Mobster v0.2.2 (ref. 51) without blacklisting existing mobile element regions. MEI calls were then processed to remove false-positive events in existing mobile element regions and to estimate variant zygosity by local realignment to the reference genome. MEIs occurring in different samples within 100 bp of each other were merged to the one call.

Rare variant burden comparison

To compare rare variant burden between the platform-matched MGRB and 45 and Up cancer cohorts, missense or nonsense variants (as judged by VEP) in ACMG SF 2.0 cancer-associated genes were joint called across both cohorts, and each variant scored for pathogenicity by ACMG criteria, blinded to cohort. The rate of individuals carrying pathogenic variants was then directly compared by Fisher’s test. To exclude potential confounding due to source cohort, the 45 and Up component of the MGRB only was compared to the 45 and Up cancer cases.

Genome-wide common variant frequency comparison

To compare patterns of common variation between the MGRB and other cohorts, we merged the MGRB variants with gnomAD v2.0.1 non-Finnish European (NFE) WGS allele frequencies, and allele frequencies from a subset of the UK Biobank genotype set, consistent of participants who self-identified as white British and who had confirmed European ancestry by PCA52. To minimise the influence of technical artefacts, variants were restricted to strand-specific biallelic SNPs listed in the EBI GWAS database, that were located in regions of the genome covered by the Genome in a Bottle standard, and were sequenced to a depth of at least 15 in at least 98% of samples in both the MGRB and gnomAD WGS cohorts. Further, variants which were not observed in one or more cohorts, or were genotyped at a rate of less than 97% in any cohort, were excluded. In total, 21,033 SNPs remained following this filtering, with very similar allele frequencies across all cohorts (Supplementary Data 2, sheet 1; Supplementary Fig. 3); these loci and frequencies were used in the following common variant analyses.

We tested for phenotype-linked bias in allele frequency between the cohorts as follows. For a given phenotype-associated set of variants, each variant was scored on two metrics: its variant allele frequency enrichment or depletion in MGRB vs gnomAD or UKBB, and the positive or negative association of the variant allele with the trait. A Fisher’s exact test was then used to test for dependence of variant enriched/depleted status on the trait direction of effect, with deviation from the null indicating an allele frequency bias between MGRB and gnomAD or UKBB that is specific to the phenotype considered.

Three sets of variants were tested by this procedure: a test set of variants reported to be associated with phenotypes depleted in the MGRB (Supplementary Data 2, sheets 2–3), and two negative control sets of variants linked to anthropometric traits (Supplementary Data 2, sheets 4–5), or behavioural traits (Supplementary Data 2, sheets 6-7).

PS estimation and testing

PSs were calculated as \(S_i = \mathop {\sum}

olimits_j {\beta _jd_{ij}}\) where S i is the PS for individual i, β j the GWAS-reported coefficient for a single variant allele at locus j, and d ij is the variant allele dosage for individual i at locus j. We considered only autosomal variants, and if a variant dosage was not available for an individual, it was imputed as \(\widehat {d_{ij}} = 2f_j\), with f j the variant allele frequency reported by the source publication. To reduce bias due to this imputation, variants with a call rate under 97% were excluded from PS calculation in all individuals.

PS GWAS coefficients were derived from each source publication by the following procedure. Firstly, all loci that were reported to be genome-wide significant in replication were selected, along with the effect allele’s association coefficient in the replication cohort. If replication results were not reported, the derivation cohort loci and coefficients were used. Regression coefficients for binary traits were then transformed to a log-odds scale; coefficients for the continuous traits of height and parental lifespan were used as-is. Loci were converted to GRCh37-centric coordinates, and where possible any loci falling within tier 3 regions (see section Locus confidence tiers) were imputed to an alternative locus falling in a tier 1 or 2 region, chosen so that the original and alternative loci had an R2 of at least 0.9 as assessed in the GBR population of 1000 genomes by NCI LDlink53. If no alternative loci fulfilled this criterion, the original locus was dropped from the PS. PSs were derived by this procedure for colorectal cancer54, melanoma55, breast cancer56, prostate cancer57, blood pressure58, early-onset coronary artery disease59, atrial fibrillation60, height21, Alzheimer’s disease61, and parental longevity1. The Alzheimer’s disease PRS as originally reported lacked the highly significant APOE locus; accordingly this locus was manually added to the PRS using the tag SNP rs10414043, and an estimated β = 1.34 (ref. 62). rs10414043 was used in preference to the more conventional rs429358 as the latter was not robustly genotyped on all platforms. All loci, alleles, and coefficients used in the PRS calculations are detailed in Supplementary Data 2, sheet 8.

An approximate bootstrap procedure was used to test for PS shift between MGRB, gnomAD, UKBB, and the 45 and Up cancer cohort, accounting for genetic drift between populations (Supplementary Fig. 9). All cohorts were first collapsed to allele-frequency data only, with individual genotypes discarded. PS variants were subset to those called at a rate of at least 97% all cohorts, and with an absolute difference in alternate allele frequency between MGRB, gnomAD, or UKBB of less than 4%. To include the effect of genetic drift in the bootstrap, the allele frequency differences between MGRB, and each of the gnomAD, UKBB, and 45 and Up cancer cohorts were calculated as \(d_{2,i} = \left( {f_{{\mathrm{MGRB}},i} - f_{{\mathrm{Other}},i}} \right) \div \sqrt {f_{{\mathrm{Other}},i}\left( {1 - f_{{\mathrm{Other}},i}} \right)}\), where f Cohort,i denotes the variant allele frequency at locus i in cohort Cohort. d 2,i was calculated for all well-genotyped loci (not just those in PSs). d 2,i is related to the fixation index F ST , and the distribution of d 2,i values between a cohort and MGRB reflects the genetic distance between these groups of individuals. For a given PS and comparison cohort Other, testing then proceeded as follows. A bootstrap Australian reference cohort (ARC) was generated by shifting the Other allele frequencies based on d 2 : \(f_{{\mathrm{ARC}},i} = f_{{\mathrm{Other}},i} + d_2^{(i)}\sqrt {f_{{\mathrm{Other}},i}\left( {1 - f_{{\mathrm{Other}},i}} \right)}\), where \(d_2^{(i)}\) has been sampled with replacement from d 2 . This simulates an unseen comparison cohort ARC which has the same mean genetic distance from MGRB as the comparison cohort Other, but which we would expect has no allelic shift due to the phenotypic depletion that we hypothesise is present in MGRB. To test this hypothesis we calculate the expected PS in both MGRB and in ARC, as \(s_{{\mathrm{Cohort}}} = \frac{2}{n}\sum_{j = 1}^n {f_{{\mathrm{Cohort}},{{j}}}\beta _j}\), where j = 1...n indexes the PS loci, and estimate the MGRB score depletion as s MGRB −s Other . This depletion statistic was calculated for each comparator cohort and PS, for 100,000 bootstrap rounds.

To facilitate comparisons between scores of different scales, bootstrap distributions for each score were normalised so that the s UKBB score had a mean of zero and a standard deviation of 1. The 95% bootstrap confidence intervals for each cohort were then defined as the 0.025 and 0.975 quantiles of the normalised scores. Approximate two-sided p values for the MGRB score depletion were calculated as \(p \approx \frac{2}{{B + 1}}\left( {\frac{1}{2} + {\mathrm{min}}\left( {\sum_{k = 1}^B {\left[ {s_{{\mathrm{MGRB}},(k)} \, < \, s_{{\mathrm{Other}},(k)}} \right]} ,\sum_{k = 1}^B {\left[ {s_{{\mathrm{MGRB}},(k)}\, > \, s_{{\mathrm{Other}},(k)}} \right]} } \right)} \right)\) where s MGRB,(k) and s Other,(k) are samples from the bootstrap mean values for cohorts MGRB and Other, respectively, B = 100,000 is the number of bootstrap rounds, and [] denotes Iverson brackets.

The statistical power improvement from using the MGRB extreme phenotype cohort as a control vs gnomAD was estimated by asymptotic approximation. Bootstrap distribution means of the mean prostate cancer score difference between the 45 and Up cancer cases, and gnomAD and MGRB controls were used as mean shift values for statistical power calculation. After correcting for varying cohort sample size, bootstrap distribution variance was highly consistent across all three cohorts, and pooled variance scaled to a sample size of 1 was used as the dispersion parameter. Power was then calculated across a range of sample sizes for both the MGRB vs 45 and Up cancer, and gnomAD vs 45 and Up cancer tests, by direct root finding of the relevant t distributions. Finally, the power vs sample size relationship was inverted by piecewise linear interpolation to yield the sample size vs power curves.

Individual genotypes were available for both MGRB and 45 and Up cancer cohorts. In these cases, a secondary analysis was performed that directly compared the distributions of individual PSs between cohorts. Height prediction was validated by ordinary linear regression of measured individual height against the polygenic height predictor21 with additional additive linear covariates of sex and age at measurement; no evidence for model misspecification was observed. The association between PS and risk of specific cancers was assessed by logistic regression, with the effect of PS on cancer risk modelled by GCV-penalised thin plate splines. Comparisons were restricted to the specific cancers of prostate, colorectal, and melanoma, as other cancers were either poorly sampled in the 45 and Up cancer cases, or did not have PSs defined.

Incidental somatic variant detection

Somatic variants were identified in post-BQSR BAM files using FreeBayes, with options: --pooled-continuous --standard-filters --min-alternate-fraction 0 --min-alternate-count 3 --hwe-priors-off --allele-balance-priors-off --use-mapping-quality. FreeBayes was restricted to detecting variants within 10 kb of RefSeq genes in the COSMIC Cancer Gene Census downloaded 11 December 2017. Variant annotation was performed using the Ensembl VEP63 release 90, with default options, and variants were notated with COSMIC 83 frequencies.

Annotated variants were filtered to retain only non-synonymous variation (missense, splice donor or acceptor, start lost, stop gained, frameshift, or inframe indel) affecting Cancer Gene Census Tier 1 genes, with a maximum population allele frequency of less than 0.1%, a variant allele fraction (VAF) of at least 10%, and three or more reads supporting the variant. We then identified likely driver mutations from these filtered variants by the following criteria: either a variant had a HIGH consequence in a canonical tumour suppressor gene transcript or the variant was observed at least 100 times in the COSMIC database. Consequences and canonical transcripts were as defined by Ensembl VEP; tumour suppressor genes were Tier 1 genes from the COSMIC Cancer Gene Census with a TSG annotation.

Telomere length

Telomere lengths were estimated using Telseq v0.0.1 (ref. 64). To reduce batch effects between the ASRB and MGRB cohorts, ASRB telomere length estimates were calibrated using Deming regression, fit to 85 ASRB samples sequenced both in the original ASRB batch, and contemporaneously with the MGRB.

Telomere length estimation by Telseq was validated by qPCR on a subset of 120 samples from the ASRB and MGRB cohorts (Supplementary Fig. 10)65. Briefly, qPCR was conducted in triplicate. Reactions included genomic DNA (5 ng), 2× Rotor-Gene SYBR Green Master Mix (Qiagen), 500 nM Tel forward [5′-CGGTTT(GTTTGG) 5 GTT-3′] and 500 nM Tel reverse [5′-GGCTTG(CCTTAC) 5 CCT-3′] or 300 nM 36B4 forward [5′-CAGCAAGTGGGAAGGTGTAATCC-3′] and 500 nM 36B4 reverse [5′-CCCATTCTATCATCAACGGGTACAA-3′] in a 25 μL reaction. Amplification was conducted in a Rotor-Gene Q qPCR cycler (Qiagen) at 95 °C for 5 min, followed by 30 cycles of 95 °C for 7 s and 58 °C for 10 s (telomere reaction) or 35 cycles of 95 °C for 15 s and 58 °C for 30 s (single copy gene reaction). Telomere content for each sample was determined by the telomere to single copy gene ratio (T/S ratio) by calculating \({\mathrm{\Delta }}C_{\mathrm{t}} = C_{{\mathrm{t}}_{{\mathrm{telomere}}}} \div C_{{\mathrm{t}}_{{\mathrm{single}}\,{\mathrm{copy}}\,{\mathrm{gene}}}}\). The T/S ratio of each sample was normalised to the mean T/S ratio of a reference sample, which was included in each run. The experiment was accepted if the reference sample T/S ratio ranged within 95% variation interval, and if the standard curve had a high correlation factor (R2 > 0.95).

Mitochondria and Y chromosome copy number

Mean mitochondrial genome copy number in each sample was estimated using read counts, as \(2 \times \left( {R_{{\mathrm{MT}}} \div S_{{\mathrm{MT}}}} \right) \div \left( {R_{\mathrm{A}} \div S_{\mathrm{A}}} \right)\), where R Z and S Z denote the number of reads mapping to contig set Z and the total size of contig set Z, and MT and A denote mitochondrial and autosomal contigs, respectively. Read counts were mapped and aligned reads reported by samtools idxstats, and were not corrected for read duplication. Patch contigs were not included in counts. Y copy number in males was estimated by an analogous procedure, as \(2 \times \left( {R_{\mathrm{Y}} \div S_{\mathrm{Y}}} \right) \div \left( {R_{\mathrm{A}} \div S_{\mathrm{A}}} \right)\).

Mitochondrial variants

Variants in the mitochondrial genome were detected using FreeBayes, considering only reads with base quality over 24 and mapping quality over 30; all other parameters were left at defaults. Variants with fewer than 10 alternate reads, or an alternate allele fraction under 0.001, were discarded. For each variant passing these filters a Phred-like quality score q was calculated as \(q = - 10{\mathrm{{log}}}_{10}\left( {1 - F\left( {n;p,N} \right)} \right)\), with n the count of alternate allele reads, N the total depth at the variant locus, p = 0.0025 a fixed error rate estimate, and F(n; p, N) the cumulative density function of a binomial distribution with N draws and success probability p. Variants with q < 30, high-depth variants (n > 15) with an alternate read strand bias of greater than 0.9, or variants in the highly variable locations MT:302–319 or MT:3105–3109 were discarded. The final metric of mitochondrial variant burden for a sample was defined as the number of low-frequency (variant allele fraction under 0.01) variants passing all above filters in that sample.

Somatic single-nucleotide variants

Somatic SNV burden was estimated using a combination of statistical filtering and spectral denoising. Putative somatic SNVs were first identified on the basis of a variant allele frequency that was statistically inconsistent with either machine error or germline variation. The burden of these variants in each sample was then dimensionally reduced by spectral factorisation, and per-sample signature scores used as the final somatic variant estimates.

We first developed a statistical filtering procedure to identify likely somatic variants that uses dynamic thresholds to optimise sensitivity while controlling signal to noise ratio. This procedure calls a variant at a given locus as likely somatic if it satisfies the following criterion:

$$c_{\mathrm{E}} \le n_{\mathrm{A}} \le c_{\mathrm{H}},$$

where n A is the number of non-reference allele reads at the locus, and c E and c H are integers which maximise:

$$p_{{n}} = r_{{\mathrm{RR}}}\mathop {\sum}\limits_{n_{\mathrm{A}} = c_{\mathrm{E}}}^{c_{\mathrm{H}}} {\frac{n}{{n_{\mathrm{A}}}}p_{\mathrm{A}}^{n_{\mathrm{A}}}\left( {1 - p_{\mathrm{A}}} \right)^{n - n_{\mathrm{A}}}}$$

subject to:

$$\frac{{r_{\mathrm{c}}}}{{1 - r_{\mathrm{c}}}}\frac{{p_{{n}}}}{{r_{{\mathrm{RR}}}\alpha _{\mathrm{E}} + r_{\mathrm{H}}\alpha _{\mathrm{H}}}} \ge g_{\mathrm{r}}$$

with \(r_{{\mathrm{RR}}} = \frac{1}{2}\left( {1 - r_{\mathrm{H}} + \sqrt {1 - 2r_{\mathrm{H}}} } \right)\), and \(p_{\mathrm{A}} = \left( {1 - \frac{4}{3}\varepsilon } \right)f + \varepsilon\). Here n is the sum of reference and alternate allele depths at the locus, r H is the expected rate of heterozygous variant germline loci, r C the expected rate of somatic variant loci, ε the base read error rate, g r the minimum acceptable ratio of true positive calls to false positive, and f is the expected somatic variant allele fraction. α E and α H are test sizes corresponding to thresholds c E and c H : \(c_{\mathrm{E}} \equiv {\mathrm{inf}}\left\{ {n_{\mathrm{A}}:{\mathrm{Pr}}({\mathrm{err}})\, <\, \alpha _{\mathrm{E}}} \right\}\),\(c_{\mathrm{H}} \equiv {\mathrm{sup}}\left\{ {n_{\mathrm{A}}:{\mathrm{Pr}}({\mathrm{het}})\, <\, \alpha _{\mathrm{H}}} \right\}\), \({\mathrm{Pr}}({\mathrm{err}}) = \mathop {\sum}

olimits_{i = n_{\mathrm{A}}}^n {\frac{n}{i}\varepsilon ^i\left( {1 - \varepsilon } \right)^{n - i}}\), \({\mathrm{Pr}}({\mathrm{het}}) = \mathop {\sum}

olimits_{i = 0}^{n_{\mathrm{A}}} {\frac{n}{i}\left( {\frac{1}{2} + \frac{1}{3}\varepsilon } \right)^i\left( {\frac{1}{2} - \frac{1}{3}\varepsilon } \right)^{n - i}}\). Informally, this procedure selects variants with an alternate allele count n A too large to be due to sequencing error (n A ≥ c E ), yet too low to be from a poorly sampled heterozygous germline locus (n A ≤ c H ). The derivation of this procedure and further details are available in Supplementary Data 5. A notable advantage of this procedure is that it yields per-locus estimates of variant detection sensitivity, as p n . These estimates are critical in normalisation of variant detection rates to account for differential coverage across variable sequencing runs, which is necessary for the accurate estimation of sample somatic variant burden.

Insufficiencies of the simple error model used above result in incomplete control of the signal to noise ratio, and further filtering is required to reliably quantify somatic variant burden. Assuming that true somatic events and false-positive machine noise exhibit differential sequence context bias, we use a spectral dimensionality reduction approach to achieve additional denoising and summarise the total somatic variant burden in each sample. Extending previous cancer somatic signature work66,67, we calculate per-sample sensitivity-normalised somatic variant burden for each of 96 single-nucleotide variant classes, as the total number of detected somatic events of a given class in that sample, divided by the sum of p n in that sample for all loci corresponding to the given variant class. The resulting 96 × n normalised burden matrix is then reduced by non-negative matrix factorisation, using 100 random optimisation starting points per cardinality. To select the appropriate factorisation cardinality, we reduce the burden matrix by merging groups of 16 age-consecutive samples by summing burden for each variant class, and factorise this reduced matrix with 100 random restarts and cardinality ranging from 2 to 10. The lowest cardinality that gives an inflection point on the plot of explained variance vs cardinality is selected, and applied to the full burden matrix. Per-sample scores are extracted from the best of 100 random runs at this final selected cardinality.

We applied the above procedure to the MGRB and ASRB samples, with parameters adapted to maximise sensitivity with low-depth sequencing data: g r = 5, f = 0.2, \(\varepsilon = 2.0 \times 10^{ - 3}\), \(r_{\mathrm{H}} = 1.0 \times 10^{ - 4}\), \(r_{\mathrm{C}} = 5.0 \times 10^{ - 7}\). Our filtering process employed SNVs identified by samtools mpileup, with maximum depth 101, mapping quality adjustment of 50, BAQ recalculation, no indel reporting, and minimum read and mapping qualities of 30, and employed a blacklist of common SNPs observed in either MGRB or dbSNP. The factorisation cardinality procedure applied to our data indicated that three signatures best described the mutation patterns observed (Supplementary Fig. 11). Signature 3 in this work was quantitatively similar to COSMIC signature 5 (cosine similarity 0.81), previously reported to be associated with age at cancer diagnosis29, and the per-sample scores for this signature were used as the summative somatic burden measure. Signature 1 from this work was very similar to COSMIC signature 1 (cosine similarity 0.95), which has also been associated with spontaneous deamination processes and age. However, we observed substantial inter-cohort differences in score distribution for this signature, suggestive of high technical variability, and did not examine it further.

Somatic copy number variants

We developed a model-based strategy to identify subclonal copy number variants (CNVs), assuming a single genetically homogeneous subclone present on a background of diploid cells.

We first defined a set of autosomal SNPs with highly stable sequencing characteristics on our platform. We selected loci containing autosomal biallelic SNPs in the MGRB cohort, with a variant allele fraction between 5% and 95%, and a mean GC content in the surrounding 100 bp of between 30% and 55%. These were further filtered to retain only loci with highly consistent coverage in both the MGRB and ASRB cohort data, with mean(DP rel ) ∈ [0.9, 1.1] in both cohorts, var(DP rel ) ∈ [0.025, 0.033] in the MGRB, and var(DP rel ) ∈ [0.025, 0.040] in the ASRB cohort. Here DP rel is locus depth relative to mean sample depth, and statistics are calculated over all samples in each cohort. In total, 1,862,065 loci passed all filters, with a median inter-locus distance of 626 bp, and 5th and 95th percentiles of 30 and 4904 bp, respectively.

We individually genotyped MGRB, 45 and Up cancer, and ASRB samples at this set of highly reliable loci using GATK HaplotypeCaller with default parameters, except for a variant window size of 100 bp (-ip 100). Within each sample, the depths of reference and variant alleles at all heterozygous SNV target loci were fit to the following subclonal CNV model to produce estimates of local ploidy and global sample subclonal fraction.

Consider a locus i in a single sample which contains fraction f of aneuploid cells, the remaining 1−f being entirely diploid (gonosomes are not modelled). We denote the copy number (ploidy) of the aneuploid cells at i with k 1,i and k 2,i , \(k_{ \cdot ,i} \in \aleph\). For example, \(k_{1,i},k_{2,i} = (1,1)\) denotes a diploid state (no aneuploidy), \(k_{1,i},k_{2,i} = (1,0)\) the deletion of one allele, and k 1,i , k 2,i = (2,2) duplication of both alleles. Our task is to estimate k 1,i , k 2,i for all i, and f globally, given reference and non-reference allele depths d R,i and d A,i .

The extent to which the aneuploid cell ploidies k 1,i and k 2,i affect the representation of alleles in the mixed cell population depends on the aneuploid cell fraction f. Let p 1,i and p 2,i represent the mean ploidy of each chromatid in the mixed cell DNA pool. As the pool is assumed to consist of only two populations, with 1−f of the cells diploid, \(p_{1,i} = fk_{1,i} + (1 - f)\) and \(p_{2,i} = fk_{2,i} + (1 - f)\).

We assume that the sequencer does not exhibit allelic bias. Then, \(E[d_{1,i}] = c_ip_{1,i}\), \(E[d_{2,i}] = c_ip_{2,i}\), with c i a normalising constant to account for the depth at locus i. Here d 1,i and d 2,i denote the depths of reads from chromatid 1 and 2, respectively. Unfortunately we do not have phased genotypes, and so cannot easily determine the chromatid source of each read. Instead we have unphased reference and non-reference depths d R,i and d A,i , and must account for the resulting phase uncertainty with a mixture model.

Disregarding allele phasing we model the depths of reference and non-reference reads at i using a mixture: \(d_{{\mathrm{R}},i},d_{{\mathrm{A}},i} \sim D(c_ip_{1,i}),D(c_ip_{2,i})\) with probability \(\frac{1}{2}\), else \(d_{{\mathrm{R}},i},d_{{\mathrm{A}},i} \sim D(c_ip_{2,i}),D(c_ip_{1,i})\), D(μ) denoting a distribution function with expected value μ. In our implementation we employ a negative binomial distribution for D, with probability mass function \(f_D(x;\mu ,s) \equiv \frac{{\Gamma (x \, + \, s)}}{{\Gamma (x \, + \, 1)\Gamma (s)}}q^s(1 - q)^x\), \(q \equiv \frac{s}{{s \, + \, \mu }}\). The size term s captures overdispersion relative to the Poisson distribution, and is optimised per-sample in the model fit.

The normalising constant c i is half the expected depth at locus i, which is itself a complex function of locus- and sample-specific factors. We model this function at the locus- and sample-level using empirical cohort depth measurements, and a sample-specific GC bias correction. Specifically, we define \(c_i \equiv b_ie^{h(g_i)}\), where b i is the mean relative depth of locus i (where relative depth is defined as \(d_i \div \frac{1}{n}\sum_i {d_i}\), with \(d_i \equiv d_{{\mathrm{R}},i} + d_{{\mathrm{A}},i}\) and n the number of target loci, n = 1,862,065), across the sample’s cohort (either MGRB or ASRB), h is a smooth function, and g i is a vector of GC fraction in windows of various size around locus i. For this work, g i was a 5-vector of GC fraction in windows of size 100, 200, 400, 600, and 800 bp, calculated on the reference sequence centred at locus i. The sample-specific GC correction function h was implemented using a generalised additive model (GAM) with five smooth terms, and fit to all heterozygous loci for each sample as \({\mathrm{ln}}(c_i \div b_i) \sim s(r_{1,i}) + s(r_{2,i}) + s(r_{3,i}) + s(r_{4,i}) + s(r_{5,i})\), with r j,i being the score of the jth principal component of the GC fraction matrix for locus i, and s denoting a penalised regression spline term. GAMs were fit using mgcv 1.8–17 (ref. 68) with default parameters.

A greedy agglomerative algorithm was used to segment the genome of each sample into regions of differing ploidy state. Initially the genome was divided into segments of 100 consecutive heterozygous loci, with segment boundaries enforced between chromosomes. Adjacent segments were tested for identical distribution of \(d_{{\mathrm{R}},i} \div c_i,d_{{\mathrm{A}},i} \div c_i\) by a two-sample Kolmogorov–Smirnov test, and the two segments with the highest p value genome-wide were merged. This process was repeated until either no segment pairs remained to merge, or all Kolmogorov–Smirnov test p values were less than 0.01. Segments were never merged between chromosomes.

The above model was fit to the allele counts within each genome segment by maximum likelihood. Ploidies of each segment, k 1,i and k 2,i , as well as the global aneuploid fraction f and overdispersion s, were optimised by grid search with local polishing. As very high ploidies coupled with low f result in highly expressive but likely incorrect models, the maximum allowable ploidy k max ; \(k_{1,i},k_{2,i} \le k_{{\mathrm{max}}}\) was determined in an outer loop through minimisation of the Bayesian Information Criterion (BIC). A final polishing step was applied to the BIC-optimal model, which merged consecutive segments of the genome if they were assigned identical chromatid ploidies by the model. This final polished model yielded global cell fraction f, as well as local ploidies across the genome, for the single aneuploid clone assumed to be present in each sample.

Clonal haematopoiesis

Extending previous work6, clonal haematopoiesis of indeterminate potential (CHIP) was defined in an individual if either: a somatic small variant (see section Incidental somatic variant detection) was detected with a variant allele frequency of at least 10%, or somatic CNV (see section Somatic copy number variants) indicated the presence of a clone comprising at least 10% of nucleated blood cells.

Somatic burden statistical analysis

Exploratory analysis indicated that variable transformation was required for some measures. For the following analyses, telomere length and Y copy number were modelled as-is; somatic variant burden, mitochondrial load, and mitochondrial variant count were log-transformed prior to modelling; and grip strength in kg was power transformed with exponent 0.7.

Within-cohort trends in somatic measures were estimated by linear regression, with 95% Wald confidence intervals. Likelihood ratio tests of nested models were used to evaluate inter-cohort trend differences, with p values corrected for multiple testing by Holm’s step-up procedure69.

We used a permutation procedure to test the importance of somatic burden measures in predicting grip strength and gait speed, conditioned on age. For each of 18 possible frailty measure × somatic measure × sex combinations (frailty measures: grip strength, gait speed; Somatic measures: Telseq telomere length, nuclear somatic variant burden, mtDNA copy number, mitochondrial variant count, and Y copy number in males only), we calculated the deviance of the following generalised additive model \({\mathrm{frailty}} \sim s({\mathrm{age}}) + s({\mathrm{weight}}) + s({\mathrm{BMI}}) + s({\mathrm{abdocirc}}) + s({\mathrm{somatic}})\), with age in years, weight in kg, BMI in kg/m2, abdominal circumference (abdocirc) in cm, and the somatic measure of interest (transformed if relevant following exploratory analysis). In this model specification, s(x) denotes a GCV-penalised thin plate spline smooth term in x as implemented in R package mgcv 1.8-17(ref. 68), with Gaussian error and identity link. This model’s deviance d was compared to the deviance d (i) of 10,000 models fit in the same manner but with the somatic variable permuted, and a p value estimated as \(\widehat p = \frac{1}{{10,001}}\left( {\sum_i {\left[ {d_{(i)} \le d} \right] + 0.5} } \right)\). To address multiple testing concerns we used a two-stage process. In the first stage p values were calculated as above for all 18 tests on a randomly selected subset of 25% of the ASPREE samples. Tests with a p value less than 0.2 in the first stage were tested in the second validation stage on the remaining 75% of the ASPREE samples, and these second-stage p values corrected for multiple testing by Holm’s method.

We observed cohort differences in intercepts in plots of somatic measures vs age. To remove these solely for the purposes of illustration (Fig. 3), for each somatic measure we fit the generalised additive model \({\mathrm{measure}} \sim s({\mathrm{age}},{\mathrm{by}} = {\mathrm{sex}}) + {\mathrm{cohort}}\), with Gaussian error and identity link. In this model specification s(age, by = sex) denotes a GCV-penalised thin plate spline with age as the predictor variable, stratified by sex. Model fits were performed using the R package mgcv68. After confirming the suitability of the model fits, cohort-specific effects were removed by calculating the quantity \(y_i^\prime = y_i - \widehat s_C + \widehat s_{{\mathrm{ASPREE}}}\) for each individual and measure, where \(y_i^\prime\) is the cohort-corrected somatic measure for individual i, to be plotted; y i is the original measurement for individual i in cohort C; and \(\widehat s_C\) and \(\widehat s_{{\mathrm{ASPREE}}}\) are the model estimates of the cohort intercept term for cohort C and the ASPREE cohort, respectively. In this manner, somatic measurements were transformed to have an intercept matching that fitted to the ASPREE cohort.

We used the following procedure to illustrate the effect of mtDNA copy number on grip strength in males. For each male individual i in the ASPREE cohort, an age-local quantile of mitochondrial DNA copy number c i was defined as \(q_i \equiv \widehat {F_i}\left( {c_i} \right)\), where \(\widehat {F_i}\) is the empirical cumulative distribution function of c in the neighbourhood of individual i, with the neighbourhood of an individual i defined as all male ASPREE individuals within ±1 year of age of i. Ages were rounded to the nearest integer for the purposes of neighbourhood definition; for the median ASPREE male age of 80 years, this neighbourhood contained 293 men with ages in [79, 81] years. Given these local mtDNA copy number quantile estimates q, a generalised additive model of the form grip strength~age + s(q) was fit using the R package mgcv68, with s smooth term as above. Predictions from this model with age = 80 and varying q defined the estimated influence of age-local mtDNA copy number on grip strength for an 80 year old man. These grip strength predictions were transformed to effective age estimates assuming typical mtDNA copy number by inversion of the model predictions for s = 0.5, and used to calculate an age excess as a function of q. Variability of this relationship was estimated using 100,000 bootstrap samples, and results presented as highest posterior density intervals.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.