Clinical information

Taiwan’s National Health Insurance launched the Integrated Care of CKD project in 2002. China Medical University Hospital (CMUH), a tertiary medical center in Central Taiwan, joined this program in 2003, prospectively enrolling consecutive patients with CKD willing to participate.41 CKD diagnosis was based on the criteria of the National Kidney Foundation’s Kidney Disease Outcomes Quality Initiative’s Clinical Practice Guidelines for CKD.41,43 The patients in this program were regularly followed at the outpatient department; they routinely underwent at least one kidney sonographic study. In Taiwan, almost all kidney sonographic studies are performed and interpreted by nephrologists. Biochemical markers of renal injury, including serum creatinine and blood urea nitrogen levels as well as spot urine protein-to-creatinine ratio, were measured at least every 12 weeks or more frequently. Since 2003, CMUH has implemented electronic medical records (EMRs) for care management; therefore, we integrated the data of CMUH’s pre-ESRD program with CMUH’s EMRs containing laboratory test results, medications, special procedures, medical images, and admission records.44 We initially enrolled 8,281 CMUH pre-ESRD patients aged 20–89 years, with a total of 203,353 sonographic images; their eGFR was measured within 4 weeks before or after the day of the kidney sonography. The study was approved with waived informed consent by the Research Ethical Committee/Institutional Review Board of the China Medical University Hospital in Taiwan (Approval no.: CMUH105-REC3–068 and CMUH106-REC3–118).

The eGFR was estimated using the abbreviated MDRD equation (eGFR = 186 × creatinine−1.154 × age−0.203 × 1.212 [if black] × 0.742 [if female]).45 The serum creatinine level closest to and within the 4 weeks before and after the day of the kidney sonography was used to define the labeled eGFR. The sociodemographic variables collected during the enrollment interview were age, sex, education, cigarette smoking status, and alcohol consumption. Diabetes mellitus and hypertension were defined by the physicians’ clinical diagnoses based on the patients’ International Classification of Diseases codes and glucose-lowering or blood pressure-lowering agent use. History of cardiovascular disease was defined as documented coronary artery disease, myocardial infarction and stroke in the EMRs.

Data information

All kidney ultrasound studies were performed by board-certified nephrologists and deidentified with waived consent, complying with the Institutional Review Board of CMUH. We selected studies performed after 2014 that used GE ultrasound systems (LOGIQ E9 and LOGIQ P3, GE Healthcare, Milwaukee, WI, USA) for higher image quality, with regard to sharpness, contrast, and noise, compared with images from prior systems (before 2014). The original Digital Imaging and Communications in Medicine files were then converted into Portable Network Graphics images; 37,696 images were selected from the two GE models with two different sizes: 960 × 720 for LOGIQ E9 and 820 × 614 for LOGIQ P3.

In general, nephrologists determine an individual’s kidney length by obtaining images of the best possible quality to capture the maximum observable kidney length. We then used the template matching technique to detect the presence of this specific annotation pattern in every image and then filter out those images without length annotations measuring kidney size. We selected these high-quality images to train our deep learning model, with the final dataset containing 1,446 uniquely identifiable primary sonographic studies of 1299 patients. Each sonographic study provided at least one image each of the right and the left kidneys. Each sonographic study also served as the primary unique input, with the final database comprising 4,505 images. The selection flow chart is presented in Supplemental Fig. 3. For the selected 4,505 images, we applied the “findContours” function of the cv2 module in Python to isolate “bean-shaped” kidneys from irrelevant information surrounding the kidneys, such as the supplier’s logo, which may have obfuscated the learning accuracy of our proposed CNN.

Model Selection: prediction of continuous eGFRs through CNNs

We predicted eGFRs based on patients’ kidney ultrasound images by using deep CNNs. Our neural network architecture, as illustrated in Fig. 1a, was referenced from the ResNet-101 model.46 In brief, the ResNet-101 model comprises a bunch of residual blocks, with each block being a combination of convolution and identity-mapping layers, resulting in a total of 101 layers (Fig. 1b). To predict the patients’ eGFRs, we replaced the last 1000-class classifier in the ResNet-101 model by using a regressor of consecutive fully connected layers, comprising 512 (FC1), 512 (FC2), 256 (FC3), and 1 (output), as illustrated in Fig. 1a. We employed a dropout layer to reduce overfitting between every two consecutive fully connected layers, where the dropout probability was determined using the grid search method.47 The activation function of all layers except the output layer used rectified linear units; the output layer adopted a linear activation function because this prediction task was a regression-type problem with the output values ranging from 0 to >100. For this regression-type prediction problem, we optimized the mean squared error defined as follows:

$${\mathrm{MSE}} = \frac{1}{n}\mathop {\sum }\limits_{i = 1}^n \left( {\hat Y_i - Y_i} \right)^2,$$

where \(\hat Y_i\) and Y t are the predicted and actual eGFRs of sample i, respectively.

Motivated by the observation44 that the earlier features of a convolution network are generally not specific to a particular task and thus transferable to other tasks, we explored different combinations of freezing residual blocks and found that by keeping the first residual block fixed, a minimal mean squared error was achieved over the validation dataset (Supplemental Table 7).48 By contrast, the parameters of the regressor were randomly initialized using a Gaussian distribution, with a mean of zero and standard deviation of 0.1. Except the regressor, we considered ResNet-101 pretrained on ImageNet as an initialization for the rest of our network weights.

Model selection: prediction of irreversible CKD status through extreme gradient-boosting tree

Clinically, an eGFR of <60 ml/min/1.73 m2 denotes prognostic significance of reduced kidney function. At this stage, patients must receive multidisciplinary nephrological care.49 To evaluate whether our CNN model accurately detects an irreversible CKD status, we reformulated the original regression problem to a binary classification problem by predicting whether a patient’s eGFR was lower than the cutoff threshold of 60 ml/min/1.73 m2.

Here we treated the ResNet model that was well-trained in Section 3 as a fixed feature extractor and computed a 256-dimension vector for every image containing the activation of the last fully connected layer (FC3) of the ResNet model in Section 3. We demoted these 256-dimension features as image codes. After extracting these codes from the images, we trained an eXtreme Gradient-Boosting model (XGBoost), a scalable end-to-end tree boosting model proposed by Chen and Guestrin,50 to identify whether the corresponding eGFR value was below the 60-ml/min/1.73 m2 threshold. The objective function of this binary classification problem was of minimizing binary entropy loss; the hyperparameters of our XGBoost model were determined using the grid search method.47 For the implementation of XGBoost in Python, the finalized hyperparameters were set as tree depth = 3, learning rate = 0.1, data subsampling = 50%, column sampling = 50%, and scale the positive sample weight by 0.25; the remaining components were set using the default setting. The XGBoost model output a probability of eGFR below the cutoff threshold (60 ml/min/1.73 m2).

Training phase

All sonographic studies were partitioned into nontesting (90%) and testing (10%) groups based on the unique and hashed patient identification key to ensure mutually exclusive flow of patients into different groups. The sample size planning was based on the MAE learning curves (Supplementary Fig. 2a). We used all images from sonographic studies in the same group as the dataset of that group. The testing dataset was not employed in this phase. We adopted the bootstrap aggregation (also called bagging) technique, a model ensemble algorithm, to improve the stability and accuracy of our deep learning model. During bagging, we uniformly sampled from the nontesting dataset with replacement to assemble a double sized dataset as the training dataset. Some duplicate sonographic studies existed in a training dataset, and it is expected to have a fraction of 86.4% of unique sonographic studies of the entire nontesting dataset.51 We considered such sonographic studies from a training dataset as the validation (out-of-bag) dataset. We repeated the above sampling process 10 times to obtain 10 pairs of the training and out-of-bag datasets. For each pair, we trained a ResNet model and XGBoost for eGFR prediction and CKD status classification, respectively. In the final evaluation (testing) phase, we averaged the output of 10 independent models from bagging as the final prediction. The flowchart is presented in Fig. 2.

Before feeding images into our ResNet model, we conducted a tailored image-cropping method, based on two markers annotating the kidney length, to remove the irrelevant peripheral region of the kidneys. We first identified the positions of the two markers \((x_1,y_1)\) and \((x_2,y_2)\) and calculated their distance and middle point, denoted as d and \((x_c,y_c)\), respectively. Next, we cropped the square region centered at \((x_c,y_c)\) with a length d. To unify the size of the input images, we resized the cropped images to 224 × 224 pixels and normalized each pixel value based on the mean and standard deviation of the images in the ImageNet dataset. During training, three image augmentation schemes—namely shift along x and y axes (±10%), rotation (±40 degree), and horizontal flip—were applied independently, with each scheme having an 80% probability of occurrence. Several input images are presented in Fig. 5.

Fig. 5 Tailored image-cropping method, based on two markers that annotated the kidney length, was used to remove the irrelevant peripheral region of the kidneys. To unify the image size to our neural network model, we resized cropped images to 224 × 224 pixels. Data augmentation schemes comprising shift, rotation, and horizontal flip were performed Full size image

In Section 3, our ResNet model was trained using Adam optimizer, which automatically adapted the learning rate for every parameter and considered the momentum of gradients during optimization using a batch size of 128 at a time for gradient calculation.52 An initial learning rate of 10−4 was used, which was then reduced by a factor of 10 after validation loss plateaued over 10 epochs. We imposed an L2 regularization of 10−5 on the network parameters (also called weight decay) to achieve better model generalization. We adopted an early stopping mechanism with a patience of 20 to prevent overfitting and retain the model at the minimum validation loss. We then aggregated the 10 ResNet models in the bagging process by averaging their predictions when evaluating the dataset testing. In Section 4, we trained a corresponding XGBoost model to identify whether a patient’s eGFR was <60 ml/min/1.73 m2 by using the codes extracted from the ResNet model as inputs. The nontesting and testing members were the same as in Section 3. Finally, we obtained 10 XGBoost models for predicting an irreversible CKD status and then restored these models for the next testing phase.

Evaluation(testing) phase

For each sonographic study in the testing group, we selected the ultrasound kidney image with the longest annotated length for the final testing dataset. No image augmentation was performed in the evaluation phase. To reduce the variance among the models, we averaged the outputs from the 10 ResNet models, restored in the bagging process as the final eGFR prediction. We quantified the prediction results by using the following metrics: MAE, Pearson’s correlation, and R-squared.

\(\hat Y_i = \frac{1}{{10}}\mathop {\sum }

olimits_{j = 1}^{10} y_{ij}\), where y ij is the prediction of input sample i by the ResNet-j model

\({\mathrm{MAE}} = \frac{1}{n}\mathop {\sum }

olimits_{i = 1}^n \left| {\hat Y_i - Y_i} \right|\), where Y i is the measured eGFR value of input sample i

\(\rho _{Y,\hat Y} = \frac{{cov(Y,\hat Y)}}{{\sigma _Y \cdot \sigma _{\hat Y}}}\)

\(R^2 = 1 - \frac{{SS_{res}}}{{SS_{tot}}}\), where \(SS_{tot} = \mathop {\sum }\limits_{i = 1}^n (Y_i - \overline {Y_i} )^2\) and \(SS_{res} = \mathop {\sum }\limits_{i = 1}^n (\hat Y_i - \overline {Y_i} )^2.\)

For evaluating the testing performance for classifying CKD status, we averaged the output probabilities from 10 restored XGBoost models used in the previous training phase. The classification probability threshold was set to 0.5 as follows:

\(\hat P_i = \frac{1}{{10}}\mathop {\sum }

olimits_{j = 1}^{10} P_{ij}\), where P ij is the prediction of input sample i by the XGBoost- j model

\(\hat Y_i = \left\{ {\begin{array}{*{20}{c}} {0,\hat P_i \,<\, threshold} \\ {1,\hat P_i \ge threshold} \end{array}} \right.\), where \(threshold\) is set at 0.5.

True positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR) were used for calculating accuracy, precision, recall, F1 score and plotting the Receiver Operating Characteristic (ROC) curve and the estimated Area Under Curve (AUC). We also evaluate the agreement between CNN-based eGFR and serum creatinine-based eGFR in the classification of CKD by B-statistic due to the highly symmetrically imbalanced nature of the present data. The definitions of TRP, FPR, TNR and FNR are provided in the Supplementary Text. To examine the model’s reliability, we leveraged the bootstrap method to construct 95% bootstrap confidence intervals, evaluating the model’s performance on 10,000 bootstrap testing datasets, sampled from the testing dataset with replacement. We regarded the 2.5th and 97.5th percentiles of the evaluation results as the 95% bootstrap confidence intervals. The bootstrap confidence intervals of the accuracy, precision, recall, and F1 score are presented in the Results section.