Lifespan estimation from CpG density

We identified all vertebrate species that had reference genomes available in NCBI35, known maximum lifespans in the AnAge database23 and evolutionary divergence times in TimeTree37. This primary data set contained 252 species from five vertebrate classes (Supplementary Table 1), with lifespans ranging from 1.1 years, a Turquoise killifish (Nothobranchius furzeri) to 205 years, a Rougheye rockfish (Sebastes aleutianus). We removed humans (Homo sapiens) from the data set as they were listed with a maximum lifespan of 120 years, which does not reflect the variability and the true global average lifespan (60.9–86.3 years)38.

Mammals comprised the most represented class of vertebrates in the data set (Supplementary Table 1), and the average BLAST length of promoters from EPD was 374 bp (Supplementary Fig. 2). The BLAST hit length decreased with increasing evolutionary distance from humans (R2 = −0.85, p-value < 2.20 × 10−16; Supplementary Fig. 2), which is most likely a reflection of using human promoter sequences. We also identified a positive correlation (R2 = 0.64, p-value < 2.2 × 10−16) between total CpG sites and genome size across animal species (Supplementary Fig. 3). It has been suggested that an increase rate of recombination prevents the loss of CpG island density during increased chromosome numbers and genome size39. This suggests that CpG density genome wide is maintained across different sized animal genomes.

The final lifespan predictor was based on 42 promoters (Supplementary Table 2) after a 10-fold cross validation to optimise the model (see Methods). From here on, promoters in the model will be referred to as the lifespan loci and the model itself as the lifespan clock. As expected, the lifespan clock returned a regression coefficient between the known and predicted lifespan of species within the training data set (R2 = 0.78, p-value < 2.20 × 10−16) (Fig. 1a). Furthermore, using the independent set of samples in the testing data set, the lifespan clock also returned a high regression coefficient (R2 = 0.76, p-value < 2.20 × 10−16; Fig. 1b). In addition, the correlation between the known and predicted lifespan using untransformed log values was 0.77 and 0.76 for the training and testing data set respectively. Although the model was developed using all classes of vertebrates, which was accounted for using a phylogenetic generalized least squares (PGLS) approach (see methods), it is important to note that multiple slopes of regression may exist in classes of vertebrates. Therefore, more tailored models may be potentially developed specific to a species class or taxonomy. This was confirmed using an ANCOVA test which showed a significant effect (p-value = 0.00014) of vertebrate class with predicted maximum lifespan. Moreover, in the testing data set using untransformed log values of lifespan we found statistically significant but differing regression coefficients for each vertebrate class between the known and predicted lifespans (Aves; R2 = 0.49, p-value = 0.043, Fish; R2 = 0.56, p-value = 0.025, Mammalia; R2 = 0.91, p-value = 1.85 × 10−13, Reptilia; R2 = 0.94, p-value = 0.029). We were unable to dwetermine the regression coefficient for Amphibia due to low sample size in the testing data set. Other lifespan prediction models could potentially be developed in the future, specific to species class or taxonomic ranking as the AnAge database continues to gain more lifespan data and more reference genomes become available. The accuracy in predicting the lifespan of species from all five vertebrate classes examined (Supplementary Table 1 and Fig. 1b), suggests that CpG density has the application of a universal bio-marker panel for lifespan in vertebrates.

Figure 1 Lifespan Estimation from CpG density with lifespan loci. The correlation between the known and predicted lifespan in the (a) training and (b) testing data set. Colours denote the class of each species. The R2 value and p-value are given above each plot. Full size image

The lifespan clock performed well across species from all classes, producing a median absolute error (MAE) of 3.72 years (Fig. 2a) and a maximum relative error of 5.9% (Fig. 2b) in the testing data set. We also found no significant difference between the absolute error rate between the training and testing data sets (p = 0.20, t-test). In the testing dataset no difference between MAE was found between species that had their lifespan estimates obtained from either captivity (43 species) or the wild (26 species) (p = 0.31, t-test). This suggests that the source of the lifespan estimate from the AnAge database (captivity or wild) was not a major confounding factor to the model. Despite high accuracy, individual lifespan loci may not necessarily represent the strongest lifespan correlated promoters (Fig. 3a,b). This is similar to other age-related models, where individual components of the overall model do not necessarily correlate well with the age-related feature13,40. Therefore, the lifespan loci may only be somewhat predictive of the directionality of CpG density with increasing lifespan. Principle component analysis (PCA) was used to visually characterise the variation of CpG density in the different species. A PCA of the lifespan loci will elucidate the extent to which the species separate out by lifespan and if there are other drivers of variation in CpG density within the species. The PCA separated the species based on lifespan (Fig. 4). This analysis suggests the CpG density of the lifespan loci separate species based on lifespan well. It also suggests technical variations such as the genome assembly level, (e.g. contig, scaffold, chromosome assembly) are not a major source of variation between samples (Supplementary Fig. 4). We also tested if genome GC content was a driver of variation in predicted lifespan and if it should be adjusted for within the model. However, there was no correlation between GC content and the absolute error rate (Supplementary Fig. 5). This analysis suggests the longevity model is independent of technical factors and variations within genomes. We also tested the lifespan clock on non-vertebrates using the raw prediction values (Supplementary Table 3 and Supplementary Text). However, the lifespan clock returned inaccurate estimates for non-vertebrates suggesting it is only suitable for vertebrate species.

Figure 2 Performance and characterisation of the lifespan loci. Box plots show the (a) Absolute error rate, (b) relative error rate of each species in the training and testing data sets. Each dot point overlayed on the box plots represent an individual species. Full size image

Figure 3 Weighting and correlation coefficients of the lifespan loci. (a) Weighting of each lifespan loci in order from most positive to negative in magnitude. (b) Pearson correlation compared to the weight of each lifespan loci. Full size image

Figure 4 Principle component analysis using the CpG density in the lifespan loci which shows the species separate based on their known lifespans. Species are coloured by increasing lifespan. Full size image

We characterised the functions of the lifespan-related loci by performing a gene ontology (GO) enrichment with the associated genes detailed in EPD. Previous research has described an association between energy metabolism and lifespan41,42, often referred to as the rate-of-living theory43,44. However, although lifespan loci-associated genes were most commonly related to development and energy metabolism processes, there was no significant enrichment for any GO terms. We also performed Pearson correlations between lifespan and CpG density to determine which promoters positively and negatively correlated (Supplementary Table 2). Of the 42 promoters 34 correlated significantly (p < 0.05) with lifespan, of which 12 and 22 promoters correlated negatively and positively with lifespan respectively. The remaining 8 lifespan loci did not significantly correlate with lifespan.

Extinct animal lifespan estimation

Lifespan is a central life-history attribute, so a lifespan estimator coupled with ancient DNA analysis can reveal this previously hidden aspect of the ecology of extinct species. We estimated lifespan for two extinct members of the Elephantidae family, the woolly mammoth (Mammuthus primigenius)45 and the straight-tusked elephant (Palaeoloxodon antiquus)46. By identifying single nucleotide polymorphisms (SNPs) into the African elephant genome we were able to estimate lifespan estimates for these two extinct species. The AnAge database lists the African elephant as having an estimated lifespan of 65 years, which was used in training data set. The lifespan clock estimated both the woolly mammoth and the straight-tusked elephant as having a lifespan of 60.0 years. Although this is within range of the modern-day counterpart due to the lack of lifespan information surrounding the woolly mammoth and the straight-tusked elephant, it is difficult to determine the true accuracy of the model for these two species. There is no a priori reason that accuracy of estimates of lifespan for extinct species should be less than living ones (median 1.2%, 3.72 years in the testing dataset). We also analysed the passenger pigeon (Ectopistes migratorius) which has an assembled genome47 and became extinct in 191448. The lifespan clock estimated the lifespan for the passenger pigeon to be 28.0 years. The lifespan of the passenger pigeon in the wild was never recorded. However it has been suggested that the age of Martha, the last surviving member was at least 17 and more likely, as old as 29 years49,50, which, although only a single example, adds credibility to our model-based estimate of lifespan.

We also examined whether lifespan estimates for humans significantly differed from their close relatives, including chimpanzees51,52 and extinct members of the Hominidae family, Denisovans53 (Homo denisova) and Neanderthals54 (Homo neanderthalensis). The lifespan clock estimated a 38.0 year lifespan for humans (hg19). The maximum lifespans of humans is a controversial topic55,56. In the past 200 years, the average life expectancy of humans has more than doubled because of modern medicine and changes in lifestyle57,58. Early humans have been reported to have a maximum life expectancy of 40 years57,58 less than half by modern standards23,38. Similarly, in chimpanzees the lifespan was estimated at 39.7 years. The maximum longevity of a chimpanzee in the wild is thought to be of a 55 years old female, however it is reported that many live to approximately 40 years of age23,59. We next estimated the lifespan of Denisovans and Neanderthals. We estimated that Denisovans and Neanderthals both had a lifespan of 37.8 years. This suggests that these extinct Hominidae species had similar lifespans to their early human modern-day counterparts.

Lifespan estimation in long-lived species

The Rougheye rockfish (Sebastes aleutianus) was the oldest lived species in the data set at 205 years. Some species of tortoises and whales have also been reported to live for more than 100 years60,61. These species are of interest as they can provide models and insights into longevity and age associated diseases, but they are also difficult species for which to obtain lifespan estimates. We explored the application of the lifespan clock to several very long-lived species which were not included in the training data set. We first tested the lifespan clock on the genome of the Pinta Island tortoise (Chelonoidis abingdonii), which has a lifespan within the calibration range62. Lonesome George was the last surviving Pinta Island tortoise and was estimated to be over 100 years old when his genome was sequenced. The lifespan clock estimated the maximum lifespan of the Pinta Island tortoise to be 120 years old. This lifespan estimation is 10–20 years higher than most estimates of Lonesome George’s age at death62. It is important to note that this is not the accepted maximum lifespan of Pinta Island tortoise due to only one individual having its age recorded at death. Nevertheless, the model provides a credible and rigorously validated lifespan estimate for this long-lived and extremely data deficient species. Application of the model to other species of Galapagos tortoise with better lifespan information would enable further evaluation of the lifespan estimate for Chelonoidis abingdonii.

Bowhead whales are thought to be the longest living mammal19, with one individual estimated as 211 years old19. Using our lifespan estimator and the bowhead whale genome63, we estimated the maximum longevity of the bowhead whale to be 268 years. This lifespan estimate is 57 years more than the oldest aged individual to date19,63. Lifespan estimation for long-lived species is difficult since many age estimates have been made by extrapolation with models calibrated on limited data from much younger known-age individuals. Bowhead whales provide another example of this, with lifespan predicted by the alternative method of eyeball amino acid racemisation19 being well beyond the calibration range of the model, as it is with our lifespan clock. Moreover, it is rarely possible to follow long-lived species from birth to death as they would normally out live a generation of researchers. It is also important to note that many of the age estimates in these animals showed no signs of pathology19. Generally, if an animal was in the upper limits of its lifespan one would expect pathological features of some age-related diseases. The lack of such findings suggest that the animals were not near the maximum of their lifespans and may potentially had lived for many years longer.