To perform this study, we received a large administrative dataset of anonymized blood biochemistry and cell count results linked to individuals’ chronological age, sex, and confirmed smoking status. The dataset was representative of the entire Alberta population, both rural and urban, with proportional representation of individuals of all ethnic origins. We then trained a set of supervised feed-forward deep neural networks (DNNs) on the nonsmokers to predict the chronological age (Fig. 1B). Subsequently, we calculated the age of the smokers and nonsmokers excluded from the training. To further investigate the effect of smoking on age prediction, we included smoking status as one of the input features and performed feature importance (FI) analysis. Finally, we trained a set of supervised feed-forward deep neural networks to predict the smoking status of patients using only their blood profiles and sex.

Figure 1 Deep learning-based blood-biochemistry clocks accurately predict chronological age. (A) Prediction accuracy of the best-performing model. The model trained on 24 parameters achieved an R2 of 0.57 and an MAE of 5.7 years. (B) The design of the deep learning study that used blood-biochemistry data to predict an individual’s age. Blood samples of nonsmokers were first preprocessed and normalized as previously described8. Next, arbitrage ranking based on 320 RF models was applied to facilitate the selection of the most appropriate feature space with maximum samples available. Afterward, missing values were reconstructed using an autoregressive model with a view towards increasing the training sets, and the resulting feature sets were used to train and test DNNs for predicting patient age and smoking status. (C) Feature importance plot. Fasting glucose, sex, and RDW exhibited higher relative importance scores than other features used in model training. Note High-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol. RDW for red blood cell distribution width, RBC for red blood cell counts, MCV for mean corpuscular volume, ALT for alanine transaminase, MCHC for mean corpuscular hemoglobin. Full size image

Data overview and preprocessing

We obtained data from 149,000 fully anonymized individual records linked to smoking status (49,000 smokers), sex, and age, with up to 66 blood biochemistry and cell count markers (Supp. Table 1). Of the 66 markers, 36 were among the 41 features used to train our previous Aging.AI 1.0 system10. The number of females, males, smokers and non-smokers within each age group was comparable (Supplementary Fig. 1). The median age was 55 years.

DNNs require large training datasets. To obtain a sufficiently large training sets we first selected samples with the same blood test date, that is, datasets consisting exclusively of blood-based biomarkers measured on the same day, so that our DNN could be trained consistently, relevantly, and accurately.

Although deep learning models can automatically extract features from the data and usually outperform shallow machine learning at this task, it is a good practice to select a set of relevant features before training the network. We optimized the feature spaces that were used to train the models for age prediction first excluding smoking status using a multifactorial adaptive statistical arbitrage model13 for subsets of samples with various numbers of measured markers. We trained 320 random forest (RF) models on distinct feature spaces and subsequently extracted FI values from each model. The features were ranked by their relative importance to age prediction according to the scores of the models (Formula 1, Supplementary Fig. 2). The accuracy of any predictor depends on the sample size and the feature space on which it is trained. To supplement the number of features used to train our predictors, we applied linear regression to fill missing values for 30–60% (depending on the feature type) of the samples in the dataset. This reconstruction successfully increased the number of available features from 14, 15, and 18 to 18, 20, and 23 features, respectively.

The blood marker with the largest contribution to the age-prediction model is glycated hemoglobin (hemoglobin A1c), followed in descending order by blood urea, fasting serum glucose, and serum ferritin (Supplementary Fig. 2). Fasting glucose was among the most important features in our previous studies on deep learning-based hematological aging clocks10,11.

Interestingly, the most important markers (as selected by the arbitrage FI method) demonstrate independent weak biweight mid-correlation, which shows the strength of a linear association between blood markers and age. The arbitrage FI method is more robust than the Pearson correlation coefficient, being a median-based measure that is less sensitive to outliers (Supplementary Fig. 3, Table 2).

Deep-learned blood-biochemistry clocks can effectively predict biological age

Using the FI ranking determined by the RF models, we selected three different sets of blood biochemistry and cell count markers (Supplementary Table 3). Input feature sets were chosen to contain the maximum number of available samples that displayed the features selected via RF-based arbitrage feature selection previous section).

To predict individual age, we trained three DNNS on selected blood test input features of nonsmoking subjects. The predictive performance of each model was evaluated using the Pearson correlation coefficient (r), the standard coefficient of determination (R2), and the mean absolute error (MAE) (Formulae 2–4).

All three models achieved a relatively high correlation between predicted and actual chronological age. The best-performing model was the deep neural network trained on 23 blood test input features (MAE = 5.72 years, R2 = 0.56). The deep neural network trained on 20 blood test input features achieved an MAE of 5.78 years and an R2 of 0.578, followed by the deep neural network trained on the 18 available blood test input features, which achieved an MAE of 5.898 years and an R2 of 0.55 (Fig. 1A, Supplementary Fig. 4A,B, Table 1). Samples from the tail ends of the distribution (individuals younger than 35 years and those older than 75 years) exhibited a higher error rate for age prediction. Fasting glucose, sex, and red blood cell distribution width (RDW) were predicted to be the most important markers (Fig. 1C, Supplementary Fig. 4C,D).

Deep-learned biochemistry clocks reveal differences in the biological ages of smokers and nonsmokers

To investigate the effect of smoking on age prediction, we used neural networks trained on nonsmokers to calculate the age of the smokers and nonsmokers excluded from the training set. Model demonstrated R2 of 0.57 in predicting non-smokers and R2 of 0.55 in predicting smokers. We also calculated the log 2 aging ratio (Formula 5) as proposed by Hannum et al.14. Compared with nonsmokers, smokers showed an accelerated rate of aging through to age 55 years regardless of sex (Figs 2B and 3, Supplementary Fig. 8). After age 55, these differences disappeared and perhaps even reversed themselves for the most elderly subjects (Figs 2B and 3, Supplementary Table 4). In the context of biological aging, this suggests that the contribution of tobacco smoking as an external factor of aging may eventually be masked by the intrinsically stochastic and physiologically deleterious nature of the aging process. Alternatively, the people most affected by smoking may have died at an earlier age and thus were be excluded from the old-age smoking group.

Figure 2 Deep learning-based hematological clocks demonstrated accelerated aging rates in smokers and revealed patient smoking status. (A) The prediction accuracy of the best-performing model trained on feature space extended with smoking status. The model, trained on 24 parameters, achieved an R2 of 0.60 and an MAE of 5.42 years (B) The log 2 aging ratio of smokers to nonsmokers by age and sex groups for the best-performing model. Smokers demonstrated a higher aging rate regardless of sex. However, these differences plateaued after 55 years of age. A log 2 aging ratio of 1 means the sample was predicted to be twice as old as a chronological age, and a log 2 aging ratio of −1 means the sample was predicted to be half as old as a chronological age. (C) The most important features in the classification of smoking status selected by the PFI method. HDL cholesterol, sex, and hemoglobin exhibited higher relative importance scores than other features used in model training. (D) The model trained on 23 parameters achieved an F1 score of 0.67 and an accuracy of 0.84. Note High-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol. RDW for red blood cell distribution width, RBC for red blood cell counts, MCV for mean corpuscular volume, ALT for alanine transaminase, MCHC for mean corpuscular hemoglobin. Full size image

Figure 3 Confusion matrices. (A) Confusion matrices for the best-performing smoking status classifier, trained on 23 features, in number of samples (left) and percentage (right). Row values show predicted smoking status, and columns show actual smoking status. Most of the error smoking predictions occurred in individuals older than 55 years. (B) Confusion matrices for age prediction by age groups for the best model, trained on 24 parameters, in number of samples (left) and percentage (right). Row values show actual chronological age group, and columns show predicted age group. Smokers of age groups < 30 and 30–40 were mostly predicted to be older. Full size image

To further evaluate the importance of smoking status in age prediction we included smoking status as an input feature along with blood test values and trained the new set of DNNs on the three extended sets of input features. Smokers were included in the training set for this round. To robustly compare the performance of these models with models trained on nonsmokers, we used the same number of samples in the training sets. The best-performing deep neural network, which was trained on 24 blood test input features, performed better than the model trained on 23 input features (without smoking status) and achieved an R2 of 0.60 and an MAE of 5.42 years (Fig. 2A, Table 1). Deep neural networks trained on 21 and 19 blood test input features also exhibited higher age-prediction accuracy than the models trained on 20 and 18 blood test input features, respectively (Supplementary Figs S5A and S5B, Table 1). These results suggest that smoking status plays an important role in predicting age. However, this feature was not among the five most important features (Supplementary Figs S5C, S5D and 5E). To evaluate the dependence between age prediction as a target function and smoking status, we conducted a partial dependence analysis that confirmed predicted age increase with a smoking status of 1 (smokers) (Supplementary Figs 7–9). The same analysis of sex as an input feature showed that predicted age increases slightly with a sex of 1 (male) (Supplementary Fig. 9).

Table 1 Prediction accuracy of the three top-performing models after rounds of optimization. Full size table

Deep-learned biochemistry clocks as biomarkers of lifestyle

To explore whether the smoking status of patients could be assessed using only patient sex and their blood test values we trained three DNNs on the same input feature sets used in the prior models to classify smokers and nonsmokers. The best-performing smoking status classifier, which was trained on 23 blood test input features, achieved an accuracy of 0.83 and an F1 score of 0.67, followed in descending order by the model trained on 20 blood test input features, and the model trained on 18 blood test input features (Fig. 2D, Supplementary Figs 6A,B, Table 1). High-density lipoprotein (HDL) cholesterol, hemoglobin, RDW, and mean corpuscular volume (MCV) were consistently the most important factors in determining a patient’s smoking status (Fig. 2C, Supplementary Fig. 4C,D).

Curiously, most of the false-positive and false-negative smoking status predictions occurred in individuals older than 55 years (Fig. 3A). This observation was consistent with the increased error rate that accompanied predictions of the ages of smokers and nonsmokers who were chronologically younger than 40 years. Furthermore, the majority of smoker samples for individuals younger than 30 years were predicted to be within the range of 31–40 years (35%) and 41–50 years (36%), whereas the ages of most of the nonsmokers (62%) were predicted correctly (Fig. 3B). The same trend was observed for the 31–40 age group, in which the ages of 43% of the smokers were predicted to be 41–50, and only 23.43% of nonsmokers were predicted to fall within the 31–40 age group. This trend was not observed in subjects older than 51 years and was therefore consistent with the observation made above.

Cardiovascular disease risk and smoking status

To assess the cardiovascular risk values, we examined the cholesterol ratio, which was calculated by dividing total cholesterol by HDL cholesterol (cholesterol ratio = total cholesterol/HDL cholesterol). We classified the blood samples into four groups based on their cholesterol ratios and fasting glucose levels, using the following reference ranges: (1) cholesterol ratio > 4 and fasting glucose >5 mmol/L; (2) cholesterol ratio > 4 and fasting glucose ≤ 5 mmol/L; (3) cholesterol ratio 4 and fasting glucose > 5 mmol/L; and (4) cholesterol ratio ≤ 4 and fasting glucose > 5 mmol/L. As shown in Fig. 4, smokers had a higher log 2 aging ratio than did nonsmokers regardless of their cholesterol ratio and fasting glucose levels. On average, female smokers were predicted to be twice as old as their chronological age as compared to non-smokers. Male smokers, on average, were predicted to be one and a half times as old as their actual chronological age compared to nonsmokers. However, females with cholesterol ratio > 4 and fasting glucose < 5 mmol/L tended to be predicted as being older. Interestingly, our results also suggest that smokers from the age groups 60–70 years and >70 years with a normal glucose level (<5 mmol) are predicted to be younger than their chronological age. This phenomenon is not observed in smokers with a high blood glucose level.