Phenotype prediction from plasma protein profiles

We have previously quantified abundance levels of circulating plasma proteins from cardiovascular and cancer biomarker panels using the highly sensitive protein extension assay (PEA)10,21 in 976 individuals from the Northern Swedish Population Health Study (NSPHS). Seventy-seven of these protein measurements were used to build models to predict chronological age, weight, height and hip circumference. Prediction models were built using generalized linear models with penalized maximum likelihoods as implemented by the glmnet-package23 in R24 and models were optimized using a 10-fold cross-validation scheme on 75% of the observation and subsequently evaluated using the remaining 25% (see Methods for details). We repeated the process 500 times and recorded which proteins were selected in the model. As expected, individual variation in protein abundance values and the distribution of phenotypes, gave rise to some variation in the proteins selected to be part of the final model. On average 68 of the 77 proteins were included in the model predicting age (Fig. 1A, Table 1). In total, all 77 proteins were included at least once in any of the age predicting models and a core set of 29 proteins was present in all models. The models for age, height, weight and hip circumference performed well on the test and training sets (Table 1) and summary statistics (including protein inclusion statistics) for all models and traits are reported in Supplementary tables 2–5. The models predicted chronological age with an R2 = 0.83, while predicting weight (R2 = 0.48), height (R2 = 0.34) and hip circumference (R2 = 0.60) with somewhat lower correlation coefficients. An example of the correlation between chronological and predicted age for one model is shown in Fig. 1B and the distribution of prediction errors for 500 age models in Fig. 1C. In the test sets, 95% of the average errors for each of the models were within +/− 1.23 years and there was no statistically significant difference (p = 0.52, Wilcoxon Ranked Sum test) between the distribution of errors in the training and test sets, indicating that the models were not over-fitted to the training data. In terms of accuracy, the plasma protein profile predicted chronological age within 5.0 years, weight within 6.8 kg, height within 4.7 cm and hip circumference within 5.1 cm, for 50% of the observations. Additional performance measurements for the models are shown in Supplementary Figures 1–3. We also evaluated the performance of the models when restricted to a core set of proteins that were included in all models for each trait (Table 1). Interestingly, the models based on the core set of proteins showed similar performance statistics as the models using the full set of proteins, suggesting that a smaller set of proteins can capture most of the phenotype variation. This observation was also confirmed by an analysis of the fraction of variance of the traits that can be explained by individual and combined proteins included in the prediction models (Supplementary Figure 4, Supplementary Tables 2–5). An analysis of the overlap between the proteins that were present in the four core-models showed that only 4 proteins (Fig. 2) were common between all models. These were Tissue plasminogen activator (tPA), Tumor necrosis factor receptor 1 (TNFR1), the Receptor tyrosine-protein kinase ErbB-3 (ErbB3) and Endothelial cell-specific molecule 1 (ESM-1). None of the genes coding for these proteins have been implicated in a recent GWAS for variation in human adult height25. In our material, out of the four proteins common to all models, ESM-1 explains the largest proportion of the variance seen in height (9.8%, Supplementary Table 4). ESM-1 is mainly expressed in endothelial cells in lung and kidney tissue but circulates in the bloodstream26. We have found no evidence relating ESM-1 to height in the literature but speculate that circulating levels of ESM-1 could be a reflection of lung volume, which is correlated to height27. Notably, none of the four proteins in common to the traits are among the set of proteins explaining the largest fraction of variance in the four traits (Supplementary Tables 2–5).

Table 1 Model performances for the traits. Full size table

Figure 1 Model performance. (A) Inclusion-rate of proteins (number of times a protein was included in any model) in age prediction models, when executed 500 times. (B) Actual age (y-axis) vs. predicted age (x-axis) for one model, with training set in red and test set in blue. P-values indicate significance rate for correlation calculated using Spearman’s method. (C) Distribution of errors for all 500 separate execution times overlaid, with training set in red and test set in blue. Vertical dashed lines indicate the 2.5% and 97.5% quartiles of the distribution of the average error from each of the 500 separate runs, respectively. P-values for test-set represent two-sided differences of error distribution in test set vs. training set, calculated using Wilcoxon Ranked Sum test. Full size image

Figure 2 Protein overlap in core models. Overlaps between proteins present in each of the four core models predicting Age, Hip Circumference (HIP), Weight and Height. Full size image

The proteins included in the study represent a non-random selection of the proteome since they are based on biomarker panels for cancer and cardiovascular disease. We therefore evaluated the distribution of superfamilies relative to the human proteome using the International Protein Sequence Resource (PIR) database. We found a significant overrepresentation (p < 0.05, Bonferroni adjusted) of 3 such families among the 77 proteins analysed (PIRSF002522:CXC chemokine, PIRSF001950:small inducible chemokine, C/CC type and PIRSF000619:TyrPK_EGF-R, Supplementary Table 6). In the core set of proteins used in the prediction models, only one family was shown to be overrepresented (PIRSF000619:TyrPK_EGF-R) and only in the models predicting age, weight and height. We repeated our model after removing any protein that was annotated within this family and found performance to remain unchanged (Table 1, Supplementary Figures 5–7). This suggests, however, that the non-random selection of proteins included in the analysis does not significantly contribute to the performance of the models.

Lifestyle choices affect the biological age

The ability to use the plasma protein profile to accurately predict age allowed us to examine the effect of lifestyle choices on the predicted phenotype (age). We first studied smoking by comparing data on 115 individuals in the study cohort who self-reported as smokers with 860 individuals that reported as non-smokers. Smoking status was used to split the cohort into training and test sets and an age-prediction model was built using the non-smokers. This model predicted smokers to be on average 2.3 years older (Fig. 3A, p < 1.8 × 10−4, Wilcoxon Ranked Sum test) than their chronological age, even though the two groups do not differ in chronological age (p > 0.9, Wilcoxon Ranked Sum test). Usage of the Swedish wet tobacco product “snus” did not alter the predicted age (Fig. 3B, p > 0.5, Wilcoxon Ranked Sum test).

Figure 3 Effects of lifestyle factors on predicted age. (A) Smoking. Age predicting model trained on non-smokers and applied to smokers. (B) Snus. Snus is Swedish wet tobacco. Age predicting model trained on non-snus-users and applied to snus-users. (C) Fatty fish. Age prediction model trained on individuals with the most common consumption of fatty fish (Salmon, Whitefish and Herring) applied to groups with other levels of fatty fish consumption. Analysis restricted to individuals between 20 and 50 years of age. (D) Significant correlations between Soda consumption and other phenotypic traits in the study cohort, with red colour indicating positive and blue negative correlations. (E) BMI. Model trained on individuals with normal BMI (18.5–24.9) and applied to individuals with higher BMI. Analysis restricted to individuals over 20 years of age. (F) Soda. Age prediction model trained on individuals that do not drink soda and applied to groups with different levels of soda consumption. Analysis restricted to individuals between 20 and 50 years of age. (G) Coffee. Age prediction model trained on non-coffee drinkers and applied to groups with different levels of coffee consumption. Analysis restricted to individuals between 20 and 50 years of age. (H) Exercise. Model trained on individuals reporting that they are as active on their free-time as other individuals in their age-group and applied to individuals that reporting to be much less, less, more or much more active than individuals in their age-group. (A–C,E–H). Specifically written out predicted phenotype differences imply a statistically significant (p < 0.05, two-sided Wilcoxon Ranked Sum test) change compared to the control group (coloured black). All other differences have a p > 0.05. All actual phenotype differences have a p > 0.05. Full size image

Body mass index (BMI) is used to classify obesity and we examined the impact of BMI on predicted age by training a model on individuals with normal weight (BMI between 18.5 and 25) and applying this to higher BMI-intervals (Fig. 3E). We observed that a BMI less than 40 does not alter the predicted versus chronological age, however individuals with a BMI over 40 were predicted to be on average 6.3 years older than their chronological age (p < 4.2 × 10−3, Wilcoxon Ranked Sum test).

Over 1000 phenotypic traits have been measured in our study cohort, including lifestyle factors such as dietary habits. Many of the 284 lifestyle and anthropometrical variables were, however, not independent. This is illustrated for dietary items in Fig. 3D, where variables which are significantly correlated (p < 0.05/2842 = 6.2 × 10−7, Spearman’s Rho) with soda consumption are shown. For instance, consumption of sweets (Bulk confectionery, R = 0.40, p < below machine precision (BMP)), French fries (R = 0.40, p < BMP), pizza (R = 0.25, p < 8.9 × 10−16) and white bread (R = 0.25, p < 3.6 × 10−15) were all positively correlated with soda consumption, while consumption of fatty fish (R = −0.18, p < 1.8 × 10−7), porridge (R = −0.19, p < 5.6 × 10−9) and berries (R = −0.19, p < 5.4 × 10−9), as well as chronological age (R = −0.38, p < 3.5 × 10−34), were all found to be negatively correlated with soda consumption. In light of these findings, individual dietary variables should be viewed as lifestyle indicators and differences in plasma protein abundance is not necessarily the effect of a single food item. Nevertheless, we trained a model on non-soda drinkers and predicted the age of soda drinkers stratified by consumption. Since soda-consumption is known to be age-correlated, we included only individuals between 20 and 50 years of age, restricting the analysis to categories with at least 25 individuals. Individuals with high soda consumption were predicted to be significantly older than their chronological age (Fig. 3F, p < 9.6 × 10−3, Wilcoxon Ranked Sum test). There was no statistical difference in actual chronological age between individuals when stratified on soda consumption (p > 0.1, Wilcox Ranked Sum test).

In addition, we trained an age predicting model on the consumption of fatty fish. Using the most common consumption frequency (once per week) as controls, the age of individuals consuming fatty fish at least 3 times per week were predicted to be lower than their chronological age (Fig. 3C, p < 1.2 × 10−2, Wilcoxon Ranked Sum test), whilst individuals with little or no consumption were predicted to be older than their chronological age (Fig. 3C, p < 4.3 × 10−2, Wilcoxon Ranked Sum test). There was no statistical difference in chronological age between individuals when stratified on fatty fish consumption (p > 0.05). Remarkably, the same pattern was found for coffee consumption. The age-prediction model was trained on non-coffee drinkers and applied to the remaining individuals. Individuals reporting a consumption of between 3 to 6 cups of coffee per day were predicted to be on average 5.6 years younger than their chronological age (Fig. 3G, p < 4.0 × 10−2, Wilcoxon Ranked Sum test). This analysis was also restricted to individuals between 20 and 50 years of age and as before there was no statistical difference in chronological age between groups based on consumption of coffee (p > 0.1, Wilcox Ranked Sum test). Finally, we studied self-reported exercise, where participants compared their own level relative to individuals of the same age in the community (peers). We trained the model using individuals that exercise at similar levels as their peers and applied it to other exercise categories. Individuals that exercised less or much less than their peers had a significantly higher predicted age (+2.3 years, p < 1.2 × 10−2 versus +5.2 years, p < 7.9 × 10−4, Wilcox Ranked Sum test). There was no difference in actual chronological age between the different exercise groups (p > 0.6, Wilcox Ranked Sum test) (Fig. 3H). None of the individual groups in any of the lifestyles investigated showed any significant (Bonferroni adjusted p > 0.05, Breusch-Pagan test) dependency between the predicted age and the actual age.

The contribution of an individual protein to the age model and the difference between groups (e.g. smokers vs. non-smokers) was in most cases modest and an increase in protein abundance was shown to have either an additive or subtractive effect on the predicted age (Supplementary Tables 1–6). This is illustrated using the effect of individual proteins on predicted age of smokers versus non-smokers. The majority of proteins contributed a small positive or negative effect on the predicted age (Fig. 4). Some proteins however, such as the cytokines CXCL9 and CXCL10, mediated relatively large effects (on average +0.27 years in smokers compared to non-smokers, p < 5.6 × 10−7 and −0.77 years, p < 5.6 × 10−2 respectively). Both CXCL9 and CXCL10 have previously been shown to be down-regulated in response to cigarette smoke extract compared to control samples in human monocyte-derived macrophages28. In our age prediction model, the coefficient (β) for CXCL9 was positive while negative for CXCL10 and both abundance levels were found to be higher in non-smokers compared to smokers. Therefore, the contribution from CXCL9 to the predicted age was lower in smokers compared to non-smokers, while higher in smokers compared to non-smokers for CXCL10. Notably, IL-12 was found to contribute the largest effect (on average, +0.82 years in smokers compared to non-smokers, p < 7.6 × 10−14, Wilcoxon Ranked Sum test). For IL-12 the sign of the coefficient (β) in the age prediction model was negative, meaning that smokers have lower levels of IL-12, which in turn contributes to a higher predicted age compared to non-smokers.

Figure 4 Effect of single proteins on predicted age in smokers. The age predicting model was trained on non-smokers and applied to smokers. The Y-axis shows the contribution of each protein to the total age-difference between predicted and chronological age in smokers, based on the change in protein levels between the two groups. Red (blue) colour corresponds to a positive (negative) contribution to the age in smokers compared to non-smokers. The X-axis depicts the statistical significance of that contribution for each protein (two-sided Wilcoxon Ranked Sum test, −log10(p)). Full size image

Limitations of the study

The NPSHS cohort used consists exclusively of western European ethnicity and therefore the models used here need not be representative of other populations. The sample size is moderate which restricts statistical evaluation of relationships between different lifestyle choices or stratifications on these. Finally, the NSPHS is a cross-sectional study and we lack follow-up data that could have been used to study relationships between longevity, mortality and the age prediction carried out here.