Data

Data was acquired from a prospective observational study conducted by OAI. The dataset followed 4,796 patients and acquired images including 2D posteroanterior radiographs and 3D Sagittal Double Echo Steady-State (DESS) MRI images over the course of 10 years. Details of data collection and study design have been previously reported30. The OAI study protocol was approved by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) and is registered on ClinicalTrials.gov as “Osteoarthritis Initiative (OAI): A Knee Health Study”, NCT#00080171. The study was carried out in accordance with all pertinent guidelines and regulations, and written and informed consent was obtained from participants prior to each clinical visit in the study.

Both posteroanterior radiographs and DESS MRI images were evaluated as data sources for TKR prediction models. Patients for whom KL grade was not recorded at any point in the longitudinal study were excluded. To homogenize datasets, radiograph and MRI images were only taken from patients and time points at which both were available (n = 35,482). We labeled entries as cases if the patient underwent a first TKR within 5 years of the given time point (n = 1,043). We labeled entries as controls if patients did not undergo a TKR or eventually underwent one but the time to it was longer than 5 years (n = 34,439). Contralateral TKRs were not considered.

The radiographs and MRI images were preprocessed for training and model evaluation. Radiographs were cropped to a 500 × 500 region centered around the knee joint. Briefly, 2D cross-correlation template matching was used to identify a 500 × 500 bounding box centered around the knee joint in 450 joints, and these cases were used to train a U-Net architecture that identified this region for all posteroanterior radiographs from the OAI study32. DESS MRIs were center-cropped to a 120 × 320 × 320 region, after which both sets of cropped images were normalized. Normalized MRI pixel values were then rounded to nearest integers, compressing the MRI image to 14 possible pixel values. This rounding approach was initially tested as a strategy to accelerate training of a 3D CNN, given the large imaging volumes and large dataset on which it was being trained, believing the approach could suppress information extraneous to eventual TKR. Empirically, this approach yielded superior validation performance to leaving pixel values unrounded, so it was utilized. Examples of the results of this compression strategy are in Supplementary Fig. S1.

Non-imaging variables were screened for among studies and reviews detailing risk factors for knee OA progression and TKR onset21,22,23,34,35,36,37. Variables such as KL grade known to be deducible directly from MRI images and radiographs were not considered. From these studies, 40 non-imaging variables of interest were identified (see Supplementary Table S1). The OAI database was then parsed for corresponding variables, and these corresponding variables were added as potential non-imaging variables for our study, yielding 44 potential non-imaging variables. In some cases, multiple OAI metrics corresponded to non-imaging variables of interest, causing the number of OAI non-imaging variables to exceed what was identified from literature. Missing data points were imputed with k-nearest neighbors. These potential variables were used to train a random forest with 100 trees to predict onset of TKR within 5 years, and the minimum depth at which each feature was used across all trees in the forest was identified. Features whose minimum depth was below the average minimum depth of all features were preserved as non-imaging variables38. This yielded 27 non-imaging variables that are displayed in Table 1.

Table 1 List of non-imaging variables fed into logistic regression models to make predictions of whether a patient would undergo TKR within 5 years. Abbreviations used: Body Mass Index (BMI), Nonsteroidal Anti-inflammatory drugs (NSAIDS), Blood Pressure (BP), Physical Activity Scale for the Elderly (PASE), Knee Injury and Osteoarthritis Outcome Score (KOOS), Quality of Life (QOL), Western Ontario and McMaster Universities Arthritis Index (WOMAC), Short Form 12 (SF-12). Full size table

The data were then split into training, validation, and test with a 65%/20%/15% split, ensuring entries of any patient were only in one of the three datasets to prevent data leakage. Within the training set, imbalance between TKR and non-TKR cases was addressed with data augmentation, drawing bootstrap samples from the rare class with replacement39. A summary of the data prior to augmentation is provided in Table 2, detailing the number of cases and controls while showing descriptive statistics regarding demographics in each of the three datasets.

Table 2 Data used to train 3D DESS MRI and 2D radiograph architectures. After exclusion criteria were applied, 35,482 qualifying entries were found in the OAI dataset across 4,790 unique patients, all of which were split into training, validation, and test sets as displayed in table. To prevent data leakage, all entries from any given patient were only allowed to be in one of the three sets. S.d. is reported for age, BMI, and KOOS pain score within the table. Full size table

Pipeline architecture

The DL-based pipeline is based on a DenseNet-121 with the following parameters: 16 filters in initial layer, growth rate of 32, pooling block configuration of [6, 12, 24, 16], 4 bottleneck layers, 2 classes. The same architecture was used for the radiograph and MRI pipelines, but for the MRI pipeline, we modified the convolutional layers, batch normalization layers, pooling layers, and leaky rectified linear unit (ReLU) layers to allow for 3D image input40. The network yielded a scalar reflecting certainty of TKR within 5 years, which was added to the non-imaging variables. The 28 resulting variables were fed into one of three sets of Logistic Regression (LR) ensembles, with each ensemble optimized to maximize sensitivity and specificity in cases of no (KL = 0, 1), moderate (KL = 2, 3), and severe OA (KL = 4). Based on the KL grade of a sample, it was fed into an LR ensemble, yielding a prediction as to whether the patient will undergo a TKR within 5 years.

Training

A DenseNet-121 was initially pretrained to predict knee OA using the entire training set, assessing cross-entropy loss and accuracy on the validation set after completion of each epoch. The pre-train was stopped when validation loss began to increase. The pretrained model was subsequently fine-tuned to predict TKR. We utilized a random search to determine optimal learning rate, dropout rate, weights of the cross-entropy loss function, and number of layers to freeze during fine-tuning. The search was carried out for 25 iterations, after which a set of parameters were selected that yielded the best combination of accuracy, sensitivity, and specificity on the validation set. Due to computational intensity, the hyperparameter search was not conducted on the entire dataset: for the 2D DenseNet-121, 10% of training and validation sets were used, whereas for the 3D DenseNet-121, 2.5% of both were used. After the search, the model fine-tuned using the subset of the training set was further fine-tuned on the entire training set using optimal parameters until validation loss began to increase. The test set was held out during training and predictions for it evaluated just once after fine-tuning, which marked the end of model optimization.

Integration of imaging and non-imaging data

Random forest regression, support vector machine, neural network, and LR architectures were assessed for efficacy of integrating imaging and non-imaging predictions, with LR providing best results on validation data. The LR architecture was thus used: all 28 imaging and non-imaging models were fed into an LR model, the optimal parameters of which were also identified through a random search. The search was conducted for 100 iterations, seeking to optimize the cross-entropy loss function weights afforded to both classes. For the cases of no, moderate, and severe OA, ideal parameters were identified by selecting those that maximized Youden’s index within each OA classification in the search41. Predictions of the best few models in each classification were averaged to yield final TKR predictions. The number of predictions averaged in each classification was selected by finding a value that optimized validation accuracy, AUC, and Youden’s index. The resulting LR models were ensembled and run on test data just once. Confidence intervals of accuracy, sensitivity, and specificity for each OA severity were obtained by bootstrapping, sampling 100% of test data with replacement (B = 100). Confidence intervals for AUC were calculated in the same manner. Results are reported on 3 versions of each model: the sole DenseNet-121 output (image only), output of a single LR model trained to predict TKR using solely the 27 non-imaging variables while not weighting the loss function class weights (non-imaging info. only), and output of the LR ensemble with image predictions (integrated model).

Statistical analysis

The accuracies of X-ray and MRI pipeline performances within each OA classification and overall were compared using McNemar’s test42,43. This test was appropriate because it specifically tests for differences in a dichotomous variable in matched groups. In our case, the variable was correct TKR prediction and the groups were the X-ray and MRI pipelines. Initially, the McNemar test statistic was modeled with a chi-squared distribution to test for significant differences between the pipelines, and if one existed, a binomial distribution was used to interrogate which pipeline yielded the significantly higher performance. All tests were carried out at α = 0.05.

Relative sensitivity and specificity of the X-ray and MRI pipelines were assessed by comparing their AUCs within each OA classification and overall. This test is appropriate because the ROC curve plots true positive rate (sensitivity) against false positive rate (1 – specificity); consequently, the closer the AUC is to 1, the better the combination of sensitivity and specificity. 100% of test data was sampled with replacement (B = 100), and for each corresponding pair of X-ray and MRI pipelines (matched by OA classification and use of images only or both image and non-image information), AUCs were calculated. To test if one outperformed the other, differences in AUCs were calculated at each iteration, and the mean and standard deviation of the differences used to conduct a student’s t-test with 99 degrees of freedom. This test is applicable on each matched pair of X-ray and MRI pipelines due to the number of iterations for which test data was sampled, allowing the central limit theorem to apply. For confidence intervals, mean and standard deviation of AUCs of individual models were calculated and used to report 95% intervals.

Imaging biomarker identification

For all 124 true positives in the test data for the integrated MRI pipeline, corresponding controls were identified by randomly sampling from test data true negatives, keeping OA status distributions identical and using a student’s t-test with 123 degrees of freedom to ensure no significant difference in KOOS pain scores across cases and corresponding controls at α = 0.05. Occlusion maps were generated for all cases and controls using voxel size of 12 × 32 × 32 and stride of 12. For each pixel, the value displayed represented the magnitude of change in the scalar pipeline output resulting when that pixel was occluded, averaged across all occlusions in which that pixel existed. Pixels for which scalar pipeline output change lied in the top 5% were designated as “hotspots.” Anatomic regions of these hotspots were identified and odds ratios (OR) calculated to interrogate possible imaging biomarkers of TKR. 95% OR confidence intervals were calculated for each anatomic region investigated in this analysis using Cornfield’s method, as this method performs well with relatively small sample sizes44. P values of ORs were calculated using a two-tailed Fisher’s exact test45. Tissues where p values fell below the significance level of α = 0.05 and in which 95% OR confidence intervals did not include 1 were deemed significant. These test selections were appropriate, as they allowed for direct comparison of the frequencies at which several tissues were hotspots across cases and controls, and as such, identified significant tissues with regards to TKR onset.