We used a cohort of over 80,000 patients from the CALIBER programme [ 26 ]. CALIBER links 4 sources of electronic health data in England: primary care health records (coded diagnoses, clinical measurements, and prescriptions) from 244 general practices contributing to the Clinical Practice Research Datalink (CPRD); coded hospital discharges (Hospital Episode Statistics, HES); the Myocardial Ischemia National Audit Project (MINAP); and death registrations (ONS). CALIBER includes about 4% of the population of England [ 27 ] and is representative in terms of age, sex, ethnicity, and mortality [ 26 ].

Diagnoses were identified in CPRD, HES, or MINAP records according to definitions in the CALIBER data portal ( www.caliberresearch.org/portal ). Stable angina was defined by angina diagnoses in CPRD (Read codes) and HES (ICD-10 codes), repeat prescriptions for nitrates, coronary revascularisation (Read codes in CPRD or OPCS-4 codes in HES), or ischaemia test results (CPRD). Acute coronary syndromes were defined in MINAP or diagnoses in CPRD and HES.

Patients were eligible to enter the cohort after being registered with the general practice for a year. If they had pre-existing coronary artery disease (myocardial infarction [MI], unstable angina or stable angina) they entered the cohort on their eligibility date, otherwise they entered on the date of the first stable angina diagnosis after eligibility, or six months after the first acute coronary syndrome diagnosis (MI or unstable angina) after eligibility. The six-month delay was chosen to differentiate long-term prognosis from the high-risk period that typically follows acute coronary syndromes. For patients whose index diagnosis was MI, we used information in CPRD and MINAP to attempt to classify the type of MI as ST-elevation myocardial infarction (STEMI) or non–ST-elevation myocardial infarction (NSTEMI).

The primary endpoint was death from any cause. Patients were followed up until they died (as identified in ONS or CPRD) or transferred out of practice, or until the last data collection date of the practice.

We also generated an extended set of predictor variables for the data-driven models, derived from rich data prior to the index date in CALIBER: primary care (CPRD) tables of coded diagnoses, clinical measurements and prescriptions; and hospital care (HES) tables of coded diagnoses and procedures. From each of these sources, we generated new binary variables for the presence or absence of a coded diagnosis or procedure, and numeric variables for laboratory or clinical measurements. We selected the 100 least missing variables from each source table, and the 100 least missing of both clinical history and measurements from the primary care data. After combining with the pre-selected predictors and removing duplicate and erroneous variables, there were 586 variables per patient in the extended dataset.

Expert-selected predictors [ 28 ] were extracted from primary care (CPRD) and coded hospital discharges (HES). These included cardiovascular risk factors (e.g. hypertension, diabetes, smoking, lipid profile), laboratory values, pre-existing diagnoses and prescribed medication. Small-area index of multiple deprivation (IMD) score was derived from the patient’s postcode. Table 1 shows all the expert-selected predictors used.

Approval was granted by the Independent Scientific Advisory Committee (ISAC) of the Medicines and Healthcare Products Regulatory Agency (protocol 14_107). Approval from ISAC is equivalent to approval from an institutional review board. All data were fully, irreversibly anonymised before access was provided to the authors.

Statistical methods

Imputation of missing data. Multiple imputation was implemented using multivariate imputation by chained equations in the R package mice [29], using the full dataset of 115,305 patients. Imputation models included: Covariates at baseline, including all of the expert-selected predictors and additional variables: age, quadratic age, diabetes, smoking, systolic blood pressure, diastolic blood pressure, total cholesterol, HDL cholesterol, body mass index, serum creatinine, haemoglobin, total white blood cell count, CABG or PCI surgery in the six months prior to study entry, abdominal aortic aneurysm prior to study entry, index of multiple deprivation, ethnicity, hypertension diagnosis or medication prior to study entry, use of long acting nitrates prior to study entry, diabetes diagnosis prior to study entry, peripheral arterial disease prior to study entry, and history of myocardial infarction, depression, anxiety disorder, cancer, renal disease, liver disease, chronic obstructive pulmonary disease, atrial fibrillation, or stroke.

Prior (between 1 and 2 years before study entry) and post (between 0 and 2 years after study entry) averages of continuous expert-selected covariates and other measurements: haemoglobin A1c (HbA1c), eGFR score [30], lymphocyte counts, neutrophil counts, eosinophil counts, monocyte counts, basophil counts, platelet counts, pulse pressure.

The Nelson–Aalen hazard and the event status for all-cause mortality. Since many of the continuous variables were non-normally distributed, all continuous variables were log-transformed for imputation and exponentiated back to their original scale for analysis. Imputed values were estimated separately for men and women, and five multiply imputed datasets were generated. We verified that the distributions of observed and imputed values of all variables were similarly distributed.

Training, validation and test sets. Before building prognostic models for all-cause mortality, we randomly split the cohort into a training set ( of the patients), and a test set (the remaining ) which was used to evaluate performance of the final models. Performance statistics were calculated on repeated bootstrap samples from this held-out test set in order to estimate variability. Where cross-validation was performed other than elastic net regression, the training set was randomly split into three folds, and the model trained on two of the folds and performance validated against the third to establish optimal parameters, before those parameters were used to build a model on the full training set.

Continuous Cox proportional hazards models. Cox proportional hazards models are survival models which assume all patients share a common baseline hazard function which is multiplied by a factor based on the values of various predictor variables for an individual. Missing values in continuous variables were accommodated both by imputation, and, separately, by explicit inclusion in the model. The latter was done by first setting the value to 0, such that it did not contribute to the patient’s risk, and then employing a missingness indicator with an independent coefficient, meaning an additional binary dummy variable to account for whether a value is missing or not [8, 14]. For example, risk as a function of time λ(t) in a Cox model with a single continuous variable with some missingness would be expressed as (1) where λ 0 (t) is the baseline hazard function, β 1 is the coefficient relating to the continuous variable; x 1 is the variable’s value (standardised if necessary), set to 0 if the value is missing; β missing is the coefficient for that value being missing; and x missing is 0 if the value is present, and 1 if it is missing.

Missing categorical data. Missingness in categorical data can be dealt with very simply, by adding an extra category for ‘missing’. This is compatible with both Cox models (in which variables with n categories are parameterised as n − 1 dummy variables) and random forests (which can include categorical variables as-is).

Discrete Cox models for missing values. Another method to incorporate missing data is to discretise all continuous predictor variables, and include missing data as an additional category [8, 14]. Choice of category boundaries then becomes an additional hyperparameter. Discretisation schemes were chosen by cross-validation to avoid overfitting. Discretising a value too finely may allow a model to fit the training data with spurious precision and generalise poorly to new data, whilst doing so too coarsely risks losing information carried by that variable. Every continuous variable i was split into 10 ≤ n i ≤ 20 bins, which were assigned in quantiles; for example, a 10-bin discretisation would split patients into deciles. An additional category was then assigned for missing values, resulting in n i + 1 coefficients per variable when fitting the Cox model.

Random survival forests. Random survival forests [19–21] is an alternative method for survival analysis which has previously been used to model deaths in the context of cardiovascular disease [31]. It is a machine-learning technique which builds a ‘forest’ of decision trees, each of which calculates patient outcomes by splitting them into groups with similar characteristics. These hundreds to thousands of decision trees each have random imperfections, meaning that whilst individual trees are then relatively poor predictors, the averaged result from the forest is more accurate and less prone to overfitting than an individual ‘perfect’ decision tree [32]. At each node in a decision tree, starting at its root, patients are split into two branches by looking at an individual covariate. The algorithm selects a split point which maximises the difference between the survival curves of patients in the two branches defined by that split. For a continuous parameter, this is a value where patients above are taken down one branch, and patients below are taken down the other; for a categorical variable, a subset of values is associated with each branch. Split points are defined so as to maximise the homogeneity within each branch, and the inhomogeneity between them. Decision trees for regression or classification often use variance or Gini impurity, respectively. Survival forests, by contrast, maximise the difference between survival curves as measured by the logrank test [19], or optimise the C-index using which branch a patient falls into as a predictor [21]. After testing, we found splitting on C-index to be too computationally intensive for negligible performance benefit, so logrank tests were used. This process is repeated until the leaves of the tree are reached. These are nodes where either there are no further criteria remains by which the patients at that leaf can be distinguished, including the possibility of only a single patient remaining, or splitting may be stopped early with a minimum node size or maximum tree depth. In a random forest, each tree is created using only a random sample with replacement of the training data, and only a subset of parameters is considered when splitting at each node. A ‘forest’ comprising such trees can then use use the average, majority vote or combined survival curve, for regression, classification and survival forests, respectively, to aggregate the results of the individual trees and predict results for new data. Random survival forests are thus, in principle, able to model arbitrarily shaped survival curves without assumptions such as proportional hazards, and can fit nonlinear responses to covariates, and arbitrary interactions between them. One issue with random forests is that variables with a large number of different values offer many possible split points, which gives them more chances to provide the optimal split. One approach to combat this is to use multiple-comparisons corrections to account for the resulting increased likelihood of finding a split point in variables where there are many options [33, 34], but this is complicated by split statistics for adjacent split points being correlated. To avoid this issue, we instead used a value n split to fix the number of split points tested in continuous variables which also has the advantage of reducing computational time for model building [35]. Many methods have been proposed for dealing with missing data in random forests, and these primarily divide into two categories. Some methods effectively perform on-the-fly imputation: for example the surrogate variable method, proposed in the classic classification and regression trees (CART) decision tree algorithm, attempts to use nonmissing values to find the variable which splits the data most similarly to the variable initially selected for the split [36]. Other methods allow missing values to continue through the network, for example by assigning them at random to a branch weighted by the ratio of nonmissing values between branches, known as adaptive tree imputation [19], or sending missing cases down both branches, but with their weights diminished by the ratio of nonmissing values between branches, as used in the C4.5 algorithm [37]. We used adaptive tree imputation, with multiple iterations of the imputation algorithm due to the large fraction of missing data in many variables used [19]. The number of iterations of the imputation process, number of trees, number of split points (n split ), and number of variables tested at each split point (m try ) were selected with cross-validation.

Variable selection. Performance can be improved by variable selection to isolate the most relevant predictors before fitting the final model. We evaluated various methods for variable selection: ranking by completeness (i.e. retaining those variables with the fewest missing values); by permutation variable importance (which measures a variable’s contribution to the final model performance by randomly shuffling values for that variable and observing the decrease in C-index), similar to varSelRF [38, 39]; and by effect on survival, as measured by a logrank test between survival curves of different variable values (either true vs false, categories, or all quartiles for continuous variables). All of these are inherently univariate, and may neglect variables which have a large effect when in combination with others [15, 40]. After ranking, we fitted models to decreasing numbers of the highest-ranked variables and cross-validated to find the optimal number based on the C-index. After selection, the contribution of variables to the final model was assessed by taking the permutation variable importance. There are a number of measures of variable importance in common use [41, 42], often using metrics which are specific to random forests, such as the number of times a variable is split upon, or the pureness of nodes following such a split. Permutation variable importance is more readily generalised: it is given by the reduction in performance when values of a particular variable are shuffled randomly within the dataset, thus removing its power to aid predictions—this allows it to be used to assess the importance of variables in other types of model, such as Cox models. The impact on model predictivity is assessed and, the larger the impact, the more important the variable is considered to be. Permutation variable importance was used in this study in order to allow comparisons across different types of model.

Elastic net regression. Elastic net regression extends regression techniques including Cox models to penalise complex models and thus implicitly select variables during the fitting procedure [40, 43]. It works by minimising the regularised error function (2) where y is the response variable, X are the input data, β is a vector of coefficients, 0 ≤ α ≤ 1 is a hyperparameter for adjusting the weight of the two different regularisation components, and λ is a hyperparameter adjusting the overall magnitude of regularisation. The hyperparameter α linearly interpolates between a LASSO (least absolute shrinkage and selection operator) and ridge regression: a value α = 0 corresponds to pure LASSO, whilst α = 1 corresponds to pure ridge regression, with intermediate values representing a mixture of the two. The value of λ is determined to some extent by the scale of the data, and so normalisation between parameters is important before fitting. In our case, values were either binary true/false when referring to the presence or absence of particular codes in a patient’s medical history, or represented discretised continuous values with an additional ‘missing’ category, binarised into a one-hot vector for fitting. Thus, since all values are either 0 or 1, no normalisation was undertaken before performing the elastic net procedure. Hyperparameters α and λ were chosen by grid search with ten-fold cross-validation over the training set. Having identified the optimal α, this value was used to fit models on bootstrapped samples of the training set to extract model parameters along with an estimate of their variation due to sampling.

Model performance. Model performance was assessed using both the C-index, and a calibration score derived from accuracy of five-year mortality risk predictions. The C-index measures a model’s discrimination, its ability to discriminate low-risk cases from high-risk ones, by evaluating the probability of the model correctly predicting which of two randomly selected patients will die first. It ranges from 0.5 (chance) to 1 (perfect prediction). The C-index is widely used, but its sensitivity for distinguishing between models of different performance can be low because it is a rank-based measure [44]. In particular, the C-index is unable to assess the calibration of a model, which reflects the accuracy of absolute risk predictions made for individual patients. A well-calibrated model is particularly important in the context of a clinical risk score, where a treatment may be assigned based on whether a patient exceeds a given risk threshold [45]. This was determined by plotting patient outcome (which is binary: dead or alive, with patients censored before the date of interest excluded) against mortality risk predicted by the model [44]. Then, a locally smoothed regression line is calculated, and the area, a, between the regression line and the ideal line y = x provides a measure of model calibration. Since a smaller area is better, we define the calibration score as (1 − a) such that a better performance results in a higher score. This means that, like the C-index, it ranges from 0.5 (very poor) to 1 (perfect calibration).