With transfer learning from BI-RADS classification, we can already train a fairly accurate breast cancer classification model. However, this comes at the cost of using a model with less capacity, with a shallower ResNet for processing visual features. Is there a way that we can take advantage of a deeper model while still training on full-resolution mammograms?

Clearly, something has to give. In our case, we make the following compromise: What if we give up end-to-end training?

Patch-level Cancer Classification

Remember the pixel-level segmentations? They could serve as an extremely powerful training signal, telling us exactly where a suspicious change is located in a mammogram. While we could directly train a localization model for identifying cancers, here we’ll target something much simpler: classifying patches.

From our big dataset of mammograms, we randomly sample 256x256 patches from our full mammogram images. If a sampled patch overlaps with an annotated lesion, we assign the patch the corresponding label: malignant, benign or a negative label. Then, we train a model to classify these patches.

We sample 256x256 patches randomly from full-sized mammographic images.

With 256x256-sized patches, not only can we use a model with higher capacity, it also means that we can use off-the-shelf models trained for ImageNet. We experimented with a large number of existing models, and found DenseNet-121 to perform best. Because we’re sampling small patch from large mammograms, we can sample and train a very large number of these patches–we end up training the model for 1,000 epochs on 5,000,000 sampled patches. This model performs exceedingly well on this patch-classification task.

Importantly, the patch-classification model has a significant limitation: by operating on small patches of the full mammograms, it lacks information from the full context of a mammogram. In practice, radiologists routinely make clinical determinations based on whole breast evaluation comparing tissue in different regions of a breast, different mammographic views of the breast or even between breasts. The patch classification model, while having extremely high capacity, is constrained to use very local features. It effectively misses the forest for the trees.

In contrast, our “image-only” model was able to use information from all parts of a mammogram, but had far less capacity. So is there a way we can combine the benefits of both?

Patch-Classification Heatmaps: Seeing both the forest and the trees

Since our patch-level classifier is so good at mammogram patches, why don’t we try using the output of the patch-level classifier as an input to the “image-only” model?

We do just that: we apply the (trained and frozen) patch classifier in a sliding window fashion across the entire high-resolution mammogram, generating a sort of “heatmap” of predictions. We extract heatmaps for both the benign and malignant predictions from the patch classifier.

We then append these heatmaps as additional channels to our actual mammogram images. We modify our “image-only” model to take inputs of three channels: the mammographic image, the ‘benign’ patch heatmap and the ‘malignant’ patch heatmap. We call this the “image-and-Heatmaps” model. Training this model thus requires us to first run the patch classifier across the whole mammogram to generate the heatmaps, and then us both the mammogram images and heatmaps as inputs to the model. It turns out that this works really well.

Results

We measure the performance of our model using the area under the ROC curve (AUC for short). We find that using BI-RADS pretraining and patch classification heatmaps both help significantly, and the combination of both does even better. Despite having fewer positive examples, our models actually perform better on accurately classifying malignant cases than benign cases. This may be because benign cases are often harder to spot, in many cases being “mammographically occult”, which means that radiologists conclude that the benign lesion could not have been identified from mammograms alone. In these cases, these benign lesions were usually identified and worked up on other screening methods (for example, breast ultrasound or MRI).

Comparison to Radiologists

An important question to ask for models trained on medical data is: how well do they work in practice? To answer this question, we conducted a reader study. Using a subset of our test set, comprising both positive and negative examples of cancer, we asked a group of 14 radiologists with varying levels of experience to determine the presence of malignant lesions based on just the screening mammograms. We compared their predictions to our model’s predictions. (Note: the distribution of sampled exams for the reader study is slightly different from the full test set, so the AUC numbers here are not directly comparable to those in the table above.)

Impressively, our model performs at least as well as an average individual radiologist on this task. While the model still somewhat underperforms the averaged predictions of all 14 radiologists — effectively an ensemble of radiologists — this already demonstrates the utility of our model. (In practical terms, it would be unrealistic to have every screening mammogram be looked over by that many radiologists, as the standard of care in the United States is a single radiologist reading the study.) In outperforming an average individual radiologist, our model may potentially be used to assist a radiologist tasked with going through 80–150 screening mammograms a day.

It is worth noting here that this task in our reader study is a simplification of what radiologists do in reality. As mentioned above, based on screening mammograms, radiologists only assign BI-RADS labels: an assessment of risk. To actually perform a full diagnosis of cancer, radiologists ask the patient to return for additional images and rely on a suite of other imaging techniques: diagnostic mammography (similar to screening mammography but where other specialized views focusing on a smaller area in the breast are also used), ultrasound and MRI, which can be concluded with a biopsy to make a final determination.

Hence, this result can be seen from two perspectives. On one hand, classifying breast cancer screening exams alone is not a task that radiologists are typically trained or expected to do. On the other hand, we show that our model is able to predict the presence of cancer, a downstream goal, using only screening mammograms, and this could potentially be of great help to radiologists.

We can go one step further: what if we combined the expertise of radiologists with the accuracy of our model?

We find that the combination of the radiologist and our model is actually even better. This shows not only that radiologists and the model specialize in different aspects of this task, but further that having radiologists working in conjunction with the model actually leads to even more accurate predictions. In our opinion, this is the true takeaway from our work–that our models can be used to not substitute but assist radiologists with their work, leading to better outcomes for patients.

Conclusion

The confluence of the deep learning revolution in computer vision and the growth of medical imaging technology has opened the doors to a new wave of research at the intersection of machine learning and healthcare. Today, we have a real chance to apply cutting-edge machine learning methods to improve millions of lives, and the close collaboration between the NYU Center for Data Science and the Department of Radiology at NYU Langone Health has presented just such an opportunity. This research was the product of integrating medical knowledge and machine learning expertise, and combining the power of modern medical imaging technology with the power of cutting-edge computing hardware.

We are proud to present our work on applying deep learning to the problem of breast cancer screening, and the accompanying report detailing our data creation procedure. We have shown that trained neural networks can not only perform comparably to trained radiologists on the task, but furthermore can meaningfully improve the accuracy of radiologists and assist them with their work. We believe this is an exciting result, and we are happy to share both our methods, as well as our code and trained models with the world. By opening our models to the public, we hope to invite other research groups to both independently validate and furthermore build on our work.

Of course, this is just the beginning. There are many further problems we want to solve. Can we train models to directly localize and classify cancer lesions? Can we train models to better incorporate past patient exams, just as radiologists compare different sets of exams from the same patient in making determinations? And can we make these models interpretable, to understand how they are making their judgments, and in turn provide that additional knowledge to radiologists and doctors? These are just some of the questions we are working to answer in the near future, and we cannot wait to let you hear about more soon.