Having spent the last few days at RSNA 2017, I had a chance to see many of the different deep learning based prototypes for clinical decision support that vendors are offering. It’s impressive, and I am excited to see this technology get built into real, deployable products that can improve patient care. Judging from the standing-room only crowds at AI-focused RSNA events, I’m not alone.

I am worried though that some of the hype is running ahead of the actual evidence. As an example, one company at RSNA, DeepRadiology, had a slick monochromatic booth where they were showing a video featuring quotes from prominent figures in the deep learning world. The video claimed performance already better than that of human radiologists, and constantly improving, with an eye-popping comparison of error rates (0.82% for humans, 0.0367% for their deep learning based interpretation system). I would have taken a picture, but they were strictly enforcing a no photography rule, further enhancing the mystique.

However, they put a paper up on arXiv before the conference. I’ll unpack it, because it serves as a great example of how claims about exciting deep learning systems can be exaggerated.

Here’s the basic claim about performance we’ll explore, the last two sentences of the abstract: “DeepRadiologyNet achieved a clinically significant miss rate of 0.0367% on an automatically-detected high confidence subset of the test set, considerably below an estimated literal error rate of 0.82%. Thus, the DeepRadiologyNet system proved superior to published error rates for US board certified radiologists in detecting clinically significant abnormalities on CT scans of the head.”

What they are doing: they want to identify findings in head CT exams. They gathered an impressive amount of data — 24,000 studies to train on, and then 29,925 studies to test on, gathered from over 80 different sites. They use an ensemble of GoogLeNet-inspired convolutional neural networks to predict 30 findings. Since GoogLeNet is two-dimensional, they must have trained on the individual image slices and then come up with a scheme for determining that the CT contained pathology — this is not described. Standard tricks like dropout and data augmentation were used. They use a novel hierarchical loss function to take account of the fact that detecting different pathologies are of different importance, and it is worse to miss a critical finding than to falsely report one.

They show an average ROC curve for their model across 4 clinically significant findings (Figure 3); they don’t report precise AUC, but it looks to be very roughly around 0.75, and translates to a sensitivity of 80% at a specificity of 60% for the average clinically significant finding. So far, so good: deploying a standard convolutional neural net architecture with a novel loss function and interesting results, executed on an impressively large dataset.

How do they get from that to claiming better than human performance? They calculate a “clinically significant miss rate” (i.e., the percent of false negatives they got for their four clinically significant findings) on their held out data using their convolutional neural net. They then report their clinically significant miss rate not on the full test dataset, but on the 42.1% and 8.5% of the test dataset on which the model was most confident. Using an average from five different studies that report different error rates, they back into an estimated 0.83% clinically significant miss rate for human radiologists. They then compare the model’s miss rate on its most confident 8.5% of held out data (0.0367%) and compare it to the human performance they backed into from a rough literature search (0.83%).

This comparison is seriously misleading:

(1) First and foremost, it is extremely unfair to compare your best 8.5% of cases against 100% of someone else’s cases. If I could count only my best 8.5% of free throws, I would be better than any player in the NBA.

(2) Comparing their performance to a radiologist miss rate extracted from heterogeneous external studies — not the radiologist miss rate on their cases predicted by their algorithm — is dubious. It is misleading to present it as if it were a precise head-to-head matchup.

(3) They can’t directly compare their model to human performance on their predicted cases because they are training their model to human annotations; the human annotations would have perfect accuracy since they are the target.* A claim that their model dramatically reduces clinically significant human misses lacks face validity given that the model is trained to human annotations.

(4) The choice of classification threshold needs to be defined — at what false positive rate is the model achieving its low miss rate? They say it is a “complex issue not discussed in this manuscript”, but their chosen ‘clinically significant miss rate’ statistic (based on false negatives) has no meaning in a vacuum.

In summary, the claim that DeepRadiologyNet is “more accurate than humans” does not appear justified, and is based on a flawed comparison.

If you hear a claim that sounds too good to be true, read the study.

Deep learning models need to be held to the same evidence-based standards as anything else in medicine. Until we have prospective, multicenter studies that show strong performance in actual clinical use, validated by third parties, a certain amount of skepticism is in order. I believe deep learning likely will demonstrate superhuman performance on a range of visual perceptual tasks in radiology, but claims about its performance need to be based on transparent science with reproducible results.

*Note: their test data was labeled by a consensus of 2–5 human radiologists who apparently reinterpreted the study. The process by which the training data was labeled is described less — while it also involved expert radiologists, a pre-screening NLP step was performed in which the initial radiologist reports were scanned to look for key word findings. It is possible that the training data was partially labeled automatically (i.e., negative unless certain words showed up, which flagged it for human annotation) which would reduce accuracy. The number of radiologists who annotated each training image is not described.