Analyzing chest X-ray images with machine learning algorithms is easier said than done. That’s because typically, the clinical labels required to train those algorithms are obtained with rule-based natural language processing or human annotation, both of which tend to introduce inconsistencies and errors. Additionally, it’s challenging to assemble data sets that represent an adequately diverse spectrum of cases, and to establish clinically meaningful and consistent labels given only images.

In an effort to move forward the goalpost with respect to X-ray image classification, researchers at Google devised AI models to spot four findings on human chest X-rays: pneumothorax (collapsed lungs), nodules and masses, fractures, and airspace opacities (filling of the pulmonary tree with material). In a paper published in the journal Nature, the team claims the model family, which was evaluated using thousands of images across data sets with high-quality labels, demonstrated “radiologist-level” performance in an independent review conducted by human experts.

The study’s publication comes months after Google AI and Northwestern Medicine scientists created a model capable of detecting lung cancer from screening tests better than human radiologists with an average of eight years experience, and roughly a year after New York University used Google’s Inception v3 machine learning model to detect lung cancer. AI also underpins the tech giant’s advances in diabetic retinopathy diagnosis through eye scans, as well as Alphabet subsidiary DeepMind’s AI that can recommend the proper line of treatment for 50 eye diseases with 94% accuracy.

This newer work tapped over 600,000 images sourced from two de-identified data sets, the first of which was developed in collaboration with Apollo Hospitals and which consists of X-rays collected over years from multiple locations. As for the second corpus, it’s the publicly available ChestX-ray14 image set released by the National Institutes of Health, which has historically served as a resource for AI efforts but which suffers shortcomings in accuracy.

The researchers developed a text-based system to extract labels using radiology reports associated with each X-ray, which they then applied to provide labels for over 560,000 images from the Apollo Hospitals data set. To reduce errors introduced by the text-based label extraction and provide the relevant labels for a number of ChestX-ray14 images, they recruited radiologists to review approximately 37,000 images across the two corpora.

The next step was generating high-quality reference labels for model evaluation purposes. A panel-based process was adopted, whereby three radiologists reviewed all final tune and test set images and resolved disagreements through online discussion. This, the study’s coauthors say, allowed difficult findings that were initially only detected by a single radiologist to be identified and documented appropriately.

Google notes that while the models achieved expert-level accuracy overall, performance varied across corpora. For example, the sensitivity for detecting pneumothorax among radiologists was approximately 79% for the ChestX-ray14 images, but was only 52% for the same radiologists on the other data set.

“The performance differences between datasets … emphasize the need for standardized evaluation image sets with accurate reference standards in order to allow comparison across studies,” wrote Google research scientist Dr. David Steiner and Google Health technical lead Shravya Shetty in a blog post, both of whom contributed to the paper. “[Models] often identified findings that were consistently missed by radiologists, and vice versa. As such, strategies that combine the unique ‘skills’ of both the [AI] systems and human experts are likely to hold the most promise for realizing the potential of AI applications in medical image interpretation.”

The research team hopes to lay the groundwork for superior methods with a corpus of the adjudicated labels for the ChestX-ray14 data set, which they’ve made available in open source. It contains 2,412 training and validation set images and 1,962 test set images, or 4,374 images in total.

“We hope that these labels will facilitate future machine learning efforts and enable better apples-to-apples comparisons between machine learning models for chest X-ray interpretation,” wrote Steiner and Shetty.