Abstract Background Chest radiograph interpretation is critical for the detection of thoracic diseases, including tuberculosis and lung cancer, which affect millions of people worldwide each year. This time-consuming task typically requires expert radiologists to read the images, leading to fatigue-based diagnostic error and lack of diagnostic expertise in areas of the world where radiologists are not available. Recently, deep learning approaches have been able to achieve expert-level performance in medical image interpretation tasks, powered by large network architectures and fueled by the emergence of large labeled datasets. The purpose of this study is to investigate the performance of a deep learning algorithm on the detection of pathologies in chest radiographs compared with practicing radiologists. Methods and findings We developed CheXNeXt, a convolutional neural network to concurrently detect the presence of 14 different pathologies, including pneumonia, pleural effusion, pulmonary masses, and nodules in frontal-view chest radiographs. CheXNeXt was trained and internally validated on the ChestX-ray8 dataset, with a held-out validation set consisting of 420 images, sampled to contain at least 50 cases of each of the original pathology labels. On this validation set, the majority vote of a panel of 3 board-certified cardiothoracic specialist radiologists served as reference standard. We compared CheXNeXt’s discriminative performance on the validation set to the performance of 9 radiologists using the area under the receiver operating characteristic curve (AUC). The radiologists included 6 board-certified radiologists (average experience 12 years, range 4–28 years) and 3 senior radiology residents, from 3 academic institutions. We found that CheXNeXt achieved radiologist-level performance on 11 pathologies and did not achieve radiologist-level performance on 3 pathologies. The radiologists achieved statistically significantly higher AUC performance on cardiomegaly, emphysema, and hiatal hernia, with AUCs of 0.888 (95% confidence interval [CI] 0.863–0.910), 0.911 (95% CI 0.866–0.947), and 0.985 (95% CI 0.974–0.991), respectively, whereas CheXNeXt’s AUCs were 0.831 (95% CI 0.790–0.870), 0.704 (95% CI 0.567–0.833), and 0.851 (95% CI 0.785–0.909), respectively. CheXNeXt performed better than radiologists in detecting atelectasis, with an AUC of 0.862 (95% CI 0.825–0.895), statistically significantly higher than radiologists' AUC of 0.808 (95% CI 0.777–0.838); there were no statistically significant differences in AUCs for the other 10 pathologies. The average time to interpret the 420 images in the validation set was substantially longer for the radiologists (240 minutes) than for CheXNeXt (1.5 minutes). The main limitations of our study are that neither CheXNeXt nor the radiologists were permitted to use patient history or review prior examinations and that evaluation was limited to a dataset from a single institution. Conclusions In this study, we developed and validated a deep learning algorithm that classified clinically important abnormalities in chest radiographs at a performance level comparable to practicing radiologists. Once tested prospectively in clinical settings, the algorithm could have the potential to expand patient access to chest radiograph diagnostics.

Author summary Why was this study done? Chest radiographs are the most common medical imaging test in the world and critical for diagnosing common thoracic diseases.

Radiograph interpretation is a time-consuming task, and there is shortage of qualified trained radiologists in many healthcare systems.

Deep learning algorithms that have been developed to provide diagnostic chest radiograph interpretation have not been compared to expert human radiologist performance. What did the researchers do and find? We developed a deep learning algorithm to concurrently detect 14 clinically important pathologies in chest radiographs.

The algorithm can also localize parts of the image most indicative of each pathology.

We evaluated the algorithm against 9 practicing radiologists on a validation set of 420 images for which the majority vote of 3 cardiothoracic specialty radiologists served as ground truth.

The algorithm achieved performance equivalent to the practicing radiologists on 10 pathologies, better on 1 pathology, and worse on 3 pathologies.

Radiologists labeled the 420 images in 240 minutes on average, and the algorithm labeled them in 1.5 minutes. What do these findings mean? Deep learning algorithms can diagnose certain pathologies in chest radiographs at a level comparable to practicing radiologists on a single institution dataset.

After clinical validation, algorithms such as the one presented in this work could be used to increase access to rapid, high-quality chest radiograph interpretation.

Citation: Rajpurkar P, Irvin J, Ball RL, Zhu K, Yang B, Mehta H, et al. (2018) Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med 15(11): e1002686. https://doi.org/10.1371/journal.pmed.1002686 Academic Editor: Aziz Sheikh, Edinburgh University, UNITED KINGDOM Received: May 29, 2018; Accepted: October 3, 2018; Published: November 20, 2018 Copyright: © 2018 Rajpurkar et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The data used in this study is third party and is publicly hosted by the National Institutes of Health Clinical Center at https://nihcc.app.box.com/v/ChestXray-NIHCC. The test set annotations are not made publicly available to preserve the integrity of the test results when hosting public model evaluation. All other data is included in the paper, its Supporting Information files, and at the following Box link (which contains code as well): https://stanfordmedicine.box.com/s/b3gk9qnanzrdocqge0pbuh07mreu5x7y. Funding: The authors received no specific funding for this work. This study was made possible via infrastructure support from the Stanford Center for Artificial Intelligence in Medicine and Imaging (AIMI.stanford.edu). Competing interests: I have read the journal's policy and the authors of this manuscript have the following competing interests: CPL holds shares in whiterabbit.ai and Nines.ai, is on the Advisory Board of Nuance Communications and on the Board of Directors for the Radiological Society of North America, and has other research support from Philips, GE Healthcare, and Philips Healthcare. MPL holds shares in and serves on the Advisory Board for Nines.ai. None of these organizations have a financial interest in the results of this study. Abbreviations: AUC, area under the receiver operating characteristic curve; CAM, class activation mapping; CI, confidence interval; IRB, International Review Board; NPV, negative predictive value; PPV, positive predictive value; ROC, receiver operating characteristic

Introduction Chest radiography is the most common type of imaging examination in the world, with over 2 billion procedures performed each year [1]. This technique is critical for screening, diagnosis, and management of thoracic diseases, many of which are among the leading causes of mortality worldwide [2]. A computer system to interpret chest radiographs as effectively as practicing radiologists could thus provide substantial benefit in many clinical settings, from improved workflow prioritization and clinical decision support to large-scale screening and global population health initiatives. Recent advancements in deep learning and large datasets have enabled algorithms to match the performance of medical professionals in a wide variety of other medical imaging tasks, including diabetic retinopathy detection [3], skin cancer classification [4], and lymph node metastases detection [5]. Automated diagnosis from chest imaging has received increasing attention [6,7], with specialized algorithms developed for pulmonary tuberculosis classification [8,9] and lung nodule detection [10], but the use of chest radiographs to discover other pathologies such as pneumonia and pneumothorax motivates an approach that can detect multiple pathologies simultaneously. Only recently have the computational power and availability of large datasets enabled the development of such an approach. The National Institutes of Health’s release of ChestX-ray14 led to many more studies that use deep learning for chest radiograph diagnosis [11–13]. However, the performance of these algorithms has not been compared to that of practicing radiologists. In this work, we aimed to assess the performance of a deep learning algorithm to automatically interpret chest radiographs. We developed a deep learning algorithm to concurrently detect the presence of 14 different disease classes in chest radiographs and evaluated its performance against practicing radiologists.

Methods Data The ChestX-ray14 dataset [14] was used to develop the deep learning algorithm. The dataset is currently the largest public repository of radiographs, containing 112,120 frontal-view (both posteroanterior and anteroposterior) chest radiographs of 30,805 unique patients. Each image in ChestX-ray14 was annotated with up to 14 different thoracic pathology labels that were chosen based on frequency of observation and diagnosis in clinical practice. The labels for each image were obtained using automatic extraction methods on radiology reports, resulting in 14 binary values per image, where 0 indicates the absence of that pathology and 1 denotes the presence (multiple pathologies can be present in each image). We partitioned the dataset into training, tuning, and validation (see S1 Table for statistics of dataset splits used in this study). The training set was used to optimize network parameters, the tuning set was used to compare and choose networks, and the validation set was used to evaluate CheXNeXt and radiologists. There is no patient overlap among the partitions. Radiologist annotations A validation set of 420 frontal-view chest radiographs was selected from ChestX-ray14 for radiologist annotation. The set was curated to contain at least 50 cases of each pathology according to the original labels provided in the dataset by randomly sampling examples and iteratively updating the selected examples by sampling from the examples labeled with the underrepresented pathologies. The radiographs in the validation set were annotated by 3 independent board-certified cardiothoracic specialist radiologists (average experience 15 years, range 5–28 years) for the presence of each of the 14 pathologies. The majority vote of their annotations was taken as a consensus reference standard on each image. To compare to the algorithm, 6 board-certified radiologists from 3 academic institutions (average experience 12 years, range 4–28 years) and 3 senior radiology residents also annotated the validation set of 420 radiographs for all 14 labels. All radiologists individually reviewed and labeled each of the images using a freely available image viewer with capabilities for picture archiving and communication system features such as zoom, window leveling, and contrast adjustment. Radiologists did not have access to any patient information or knowledge of disease prevalence in the data. Labels were entered into a standardized data entry program, and the total time to complete the review was recorded. The Stanford International Review Board (IRB) approved this study, and all radiologists consented to participate in the labeling process. Algorithm development The deep learning algorithm, called CheXNeXt, is a neural network trained to concurrently detect the 14 pathologies in frontal-view chest radiographs. Neural networks are functions with many parameters that are structured as a hierarchy of layers to model different levels of abstraction. In this study, the selected architecture was a convolutional neural network, a particular type of neural network that is specially designed to handle image data. By exploiting a parameter sharing receptive field, convolutional neural networks scan over an image to learn features from local structure and aggregate the local features to make a prediction on the full image. The neural network used in this study is a 121-layer DenseNet architecture [15] in which each layer is directly connected to every other layer within a block. For each layer, the feature maps of all preceding layers are used as inputs, and its own feature maps are passed on to all following layers as inputs. Once specifying the neural network architecture, the parameters are automatically learned from a large amount of data labeled with the presence or absence of each pathology. The learning process consists of iteratively updating the parameters to decrease the prediction error, which is computed by comparing the network’s prediction to the known annotations on each image. By performing this procedure using a representative set of images, the resulting network can make predictions on previously unseen frontal-view chest radiographs. Training procedure The training process consisted of 2 consecutive stages to account for the partially incorrect labels in the ChestX-ray14 dataset. First, multiple networks were trained on the training set to predict the probability that each of the 14 pathologies is present in the image. Then, a subset of those networks, each chosen based on the average error on the tuning set, constituted an ensemble that produced predictions by computing the mean over the predictions of each individual network. The ensemble was used to relabel the training and tuning sets as follows: first, the ensemble probabilities were converted to binary values by computing the threshold that led to the highest average F1 score on the tuning set across all pathologies. Then, the new label was taken to be positive if and only if either the original label was positive or the ensemble prediction was positive. Finally, new networks were trained on the relabeled training set, and a subset of the new networks was selected based on the average error on the relabeled tuning set. The final network was an ensemble of 10 networks trained on the relabeled data, where again the predictions of the ensemble were computed as the mean over the predictions of each individual network. Before both stages of training, the parameters of each network were initialized with parameters from a network pretrained on ImageNet [16]. The final fully connected layer of the pretrained network was replaced with a new fully connected layer producing a 14-dimensional output, after which the sigmoid was applied to each of the outputs to obtain the predicted probabilities of the presence of each of the 14 pathology classes. Before inputting the images into the network, the images were resized to 512 pixels by 512 pixels and normalized based on the mean and standard deviation (SD) of images in the ImageNet training set. For each image in the training set, a random lateral inversion was applied with 50% probability before being fed into the network. The networks were updated to minimize the sum of per-class weighted binary cross entropy losses, where the per-class weights were computed based on the prevalence of that class in the training set. All parameters of the networks were trained jointly using Adam with standard parameters [17]. Adam is an effective variant of an optimization algorithm called stochastic gradient descent, which iteratively applies updates to parameters in order to minimize the loss during training. We trained the networks with minibatches of size 8 and used an initial learning rate of 0.0001 that was decayed by a factor of 10 each time the loss on the tuning set plateaued after an epoch (a full pass over the training set). In order to prevent the networks from overfitting, early stopping was performed by saving the network after every epoch and choosing the saved network with the lowest loss on the tuning set. No other forms of regularization, such as weight decay or dropout, were used. Each stage of training completed after around 20 hours on a single NVIDIA GeForce GTX TITAN Black. Each network had 6,968,206 learnable parameters, and the final ensemble had 69,682,060 parameters. The open-source deep learning framework PyTorch (http://pytorch.org/) was used to train and evaluate the algorithms. Interpreting network predictions In order to interpret predictions, CheXNeXt produced heat maps that identified locations in the chest radiograph that contributed most to the network’s classification through the use of class activation mappings (CAMs) [18]. To generate the CAMs, images were fed into the fully trained network, and the feature maps from the final convolutional layer were extracted. A map of the most salient features used in classifying the image as having a specified pathology was computed by taking the weighted sum of the feature maps using their associated weights in the fully connected layer. The most important features used by CheXNeXt in its prediction of the pathology were identified in the image by upscaling the map to the dimensions of the image and overlaying the image. Statistical analysis and evaluation on the validation set We provide a comprehensive comparison of the CheXNeXt algorithm to practicing radiologists across 7 performance metrics, namely, area under the receiver operating characteristic curve (AUC), sensitivity, specificity, F1 metric, positive and negative predictive value (PPV and NPV), and Cohen’s kappa [19]. To convert the probabilities produced by CheXNeXt to binary predictions, we chose pathology-specific thresholds through maximization of the F1 score on the tuning set (more details presented in S1 Appendix). To compare the CheXNeXt algorithm to radiologists using a single diagnostic performance measure, we used the AUC metric. Because the radiologists only provided yes/no responses for each image and not a continuous score, the receiver operating characteristic (ROC) was estimated for the radiologists as a group using partial least-squares regression with constrained splines to fit an increasing concave curve to the specificities and sensitivities of 9 radiologists. We specify knots at each 1/20th and assume symmetry. An example with R code is provided in S1 Appendix. Because we estimate the ROCs for the radiologists, we cannot use standard confidence intervals (CIs) for the radiologists' AUCs, and so to ensure a fair comparison, we calculated and compared the respective AUCs in the same manner, as follows. We first estimate the ROC for the radiologists using constrained splines—as described above—and the ROC for the algorithm and then estimate the AUCs for both the algorithm and the radiologists using linear interpolation and the composite trapezoidal rule. Finally, we use the robust bootstrap method, described below, to construct CIs around the AUCs. In addition to individual-level and pathology-specific performance measures, the CheXNeXt algorithm was evaluated over all pathologies and against radiologists as a group. To evaluate CheXNeXt against resident radiologists as a group and board-certified radiologists as a group, the micro-averages of the performance measures were computed across all resident radiologists as well as across all board-certified radiologists. Micro-averages for groups of radiologists were calculated by concatenating the predictions of group members and then calculating the performance measures. For example, to calculate the sensitivity for board-certified radiologists in predicting hernia (420 images), we concatenated each of 6 board-certified radiologists' predictions into a single array of length 420 × 6 = 2,520, repeated the reference standard for hernia 6 times to create an array of the same length, and then calculated sensitivity. To provide an overall estimate of accuracy, the proportion correct was calculated for each image across all 14 pathologies, and the mean and SD of these proportions are reported. The nonparametric bootstrap was used to estimate the variability around each of the performance measures; 10,000 bootstrap replicates from the validation set were drawn, and each performance measure was calculated for CheXNeXt and the radiologists on these same 10,000 bootstrap replicates. This produced a distribution for each estimate, and the 95% bootstrap percentile intervals (2.5th and 97.5th percentiles) are reported [20]. Because AUC is a single measure on which to compare the CheXNeXt algorithm to the radiologists as a group, the difference between the AUCs on these same bootstrap replicates was also computed. To control the familywise error rate when testing for significant differences in AUCs, the stringent Bonferroni-corrected [21] CIs of 1 − 0.05/14 are reported. If the interval does not include 0, there is evidence that either CheXNeXt or the radiologists are superior in that task. All statistical analyses were completed in the R environment for statistical computing [22]. The irr package [23] was used to calculate the exact Fleiss’ kappa and Cohen’s kappa. The boot package [24] was used to perform the bootstrap and construct the bootstrap percentile intervals (95% and 99.6%). The ConSpline package [25] was used to estimate the ROC for the radiologists using partial least-squares regression with constrained splines, the pROC package [26] was used to estimate the ROC for the algorithm, and the MESS package [27] was used to calculate the AUC for both the radiologists and CheXNeXt. Figures were created using the ggplot2 [28] and gridExtra [29] packages.

Discussion The results presented in this study demonstrate that deep learning can be used to develop algorithms that can automatically detect and localize many pathologies in chest radiographs at a level comparable to practicing radiologists. Clinical integration of this system could allow for a transformation of patient care by decreasing time to diagnosis and increasing access to chest radiograph interpretation. The potential value of this tool is highlighted by the World Health Organization, which estimates that more than 4 billion people lack access to medical imaging expertise [30]. Even in developed countries with advanced healthcare systems, an automated system to interpret chest radiographs could provide immense utility [31,32]. This algorithm could be used for worklist prioritization, allowing the sickest patients to receive quicker diagnoses and treatment even in hospital settings in which radiologists are not immediately available. Furthermore, experienced radiologists are still subject to human limitations, including fatigue, perceptual biases, and cognitive biases, all of which lead to errors [33–37]. Prior studies suggest that perceptual errors and biases can be reduced by providing feedback on the presence and locations of abnormalities on radiographs to interpreting radiologists [38], a scenario that is well suited for our proposed algorithm. An additional application for CheXNeXt is screening of tuberculosis and lung cancer, both of which use chest radiography for screening, diagnosis, and management [39–43]. The CheXNeXt algorithm detected both consolidation and pleural effusion, the most common findings for primary tuberculosis, at the level of practicing radiologists. Similarly, CheXNeXt achieved radiologist-level accuracy for both pulmonary nodule and mass detection, a critical task for lung cancer diagnosis, with much higher specificity than previously reported computer-aided detection systems and comparable sensitivity [44–47]. Although chest radiography is not the primary method used to perform lung cancer screening, it is the most common thoracic imaging study in which incidental lung cancers (nodules or masses) are discovered. For example, in a large study of incidentally discovered lung cancers in 593 patients, 71.8% were diagnosed incidentally on chest X-ray and the remaining on computed tomography (CT) scan [48]. This would suggest that, despite the recommendation and widespread use in modernized healthcare environments for the use of screening CT, chest radiographs remain the primary modality by which lung cancer is imaged. Additionally, lung cancers are sometimes diagnosed on chest CT and then identified in retrospect as “missed” on previous chest radiographs. This scenario is not rare and has a considerable medicolegal impact on the field of radiology. Furthermore, the vast majority of the world’s population does not have access to chest CT for lung cancer screening or diagnosis and therefore must rely on the versatile and less resource-intensive chest radiograph for the detection of thoracic pathologies, including lung cancer and tuberculosis. Once clinically validated, an algorithm such as CheXNeXt could have impactful clinical applications in healthcare systems. While CheXNeXt performed extremely well in comparison to board-certified radiologists on acute diagnoses, it performed poorest in the detection of emphysema and hiatal hernia. The symmetric "global" radiographic appearance in emphysema (symmetric pulmonary overexpansion) may have been more challenging to recognize as opposed to asymmetric "localized" findings such as pulmonary nodule, effusion, or pneumothorax. In addition, hiatal hernia was the least prevalent of all the 14 labels in the training data. These shortcomings could be addressed in the future by obtaining more labeled training data for these pathologies. Additionally, the sensitivity of board-certified radiologists in the detection of mass was low. To investigate this, we evaluated the sensitivity of the board-certified radiologists and algorithm after grouping the mass and nodule pathology classes as lung lesion (if the label was positive for either nodule or mass, the new label was positive for lung lesion; otherwise, it was negative). Before collapsing these classes, the board-certified radiologists achieved a sensitivity of 0.573 in detecting nodules and 0.495 in detecting masses. After collapsing, the board-certified radiologists achieved a sensitivity of 0.667 in the detection of lung lesions. This indicates that the board-certified radiologists frequently selected the nodule label when the ground truth was mass but did accurately detect a pulmonary lesion. CheXNeXt had higher sensitivities for mass and nodule than board-certified radiologists (0.754 and 0.690, respectively) and maintained a higher sensitivity (0.723) after grouping. This study has limitations that likely led to a conservative estimate of both radiologist and algorithm performance. First, the radiologists and algorithm only had access to frontal radiographs during reading, and it has been shown that up to 15% of accurate diagnoses require the lateral view [1]. The lack of lateral views in the dataset may limit detection of certain clinical findings such as vertebral body fractures or subtle pleural effusions not detected on frontal views alone; future work may consider utilizing the lateral views when applicable for diagnosis and algorithm development. Second, neither CheXNeXt nor the radiologists were permitted to use patient history or review prior examinations, which has been shown to improve radiologist diagnostic performance in interpreting chest radiographs [49,50]. Third, the images were presented to the radiologists and the CheXNeXt algorithm at a resolution of 1,024 pixels and 512 pixels, respectively, and chest radiographs are usually presented at a resolution of over 2,000 pixels. Fourth, the reference standard was decided by a consensus of cardiothoracic radiologists, and no access to cross-sectional imaging, laboratory, or pathology data was available to determine the reference standard. The comparison to gold standard cases for all pathologies is outside the scope and purpose of this study. Instead, the goal is to evaluate the performance of a deep learning algorithm in diagnostic tasks on radiographs using a retrospective approach based on the interpretations of an expert panel compared with the interpretations of individual nonspecialist radiologists. Finally, consolidation, infiltration, and pneumonia are all manifestations of airspace opacities on chest radiographs yet were provided as distinct labels. While any given radiograph can be marked with one or more of these 3 labels, certain radiographic patterns of airspace opacities are characteristic of pneumonia and, when combined with clinical information, can determine the pneumonia diagnosis specifically. Even in the absence of clinical data, identifying airspace opacity patterns characteristic of pneumonia is useful, particularly in parts of the world where access to expert diagnostics is limited. This work has additional limitations that should be considered when interpreting the results. This study is limited to evaluation on a dataset from a single institution, so future work will be necessary to address generalizability of these algorithms to datasets from other institutions. Additionally, the experimental design used to assess radiologists in this work does not replicate the clinical environment, so the radiologist performance scores presented in this study may not exactly reflect true performance in a more realistic setting. Specifically, disagreement in chest radiograph interpretation between clinical radiologists has been well described and would not always be interpreted as error in clinical practice, e.g., atelectasis is not always a clinically important observation, particularly if other findings are present. In that way, the labeling task performed by the radiologist readers in this study differs from routine clinical interpretation because in this work, any/all relevant findings in each image were labeled as present no matter the potential clinical significance. Finally, the primary performance metric comparison in this study required estimating the ROC for radiologists. While we assumed symmetry in the specificities and sensitivities, allowing for a better fit, we acknowledge that this is not a perfect comparison, and for this reason, we also provided a comprehensive view of how the algorithm compares to radiologists on 6 other performance metrics (Fig 2 and S1 Fig). All performance metrics and estimates of uncertainty should be taken together to better understand the performance of this algorithm in relation to these practicing radiologists.

Conclusion We present CheXNeXt, a deep learning algorithm that performs comparably to practicing board-certified radiologists in the detection of multiple thoracic pathologies in frontal-view chest radiographs. This technology may have the potential to improve healthcare delivery and increase access to chest radiograph expertise for the detection of a variety of acute diseases. Further studies are necessary to determine the feasibility of these outcomes in a prospective clinical setting.

Acknowledgments We would like to acknowledge the Stanford Machine Learning Group (stanfordmlgroup.github.io) and the Stanford Program for Artificial Intelligence in Medicine and Imaging for infrastructure support (AIMI.stanford.edu).