Last year the United States Food and Drug Administration (FDA) cleared a total of 12 AI tools that use machine learning for health (ML4H) algorithms to inform medical diagnosis and treatment for patients. The tools are now allowed to be marketed, with millions of potential users in the US alone.Because ML4H tools directly affect human health, their development from experiments in labs to deployment in hospitals progresses under heavy scrutiny. A critical component of this process is reproducibility.

A team of researchers from MIT, University of Toronto, New York University, and Evidation Health have proposed a number of “recommendations to data providers, academic publishers, and the ML4H research community in order to promote reproducible research moving forward” in their new paper Reproducibility in Machine Learning for Health.

Reproducibility Crisis in machine learning

Just as boxers show their strength in the ring by getting up again after being knocked to the canvas, researchers test their strength in the arena of science by ensuring their work’s reproducibility. If other researchers cannot replicate the research findings, the original study will draw doubters and critics. Although reproducibility is an essential part of science, many sub-fields such as machine learning are now experiencing a reproducibility crisis.

According to a survey of 1,576 researchers conducted by respected journal Nature in 2016, more than 70 percent of researchers failed in their attempts to reproduce others’ experiments, and more than half were unable to reproduce even their own experiment results. In the critical field of medicine, 41 percent or respondents reported taking concrete steps to attempt to improve their research reproducibility.

This April, organizers of one of the world’s largest AI gatherings, the Neural Information Processing Systems Conference (NeurIPS), updated their paper submission policy to include “a mandatory Reproducibility Checklist for all submissions.”

NeurIPS Reproduciblity Checklist

But how to improve reproducibility? Traditionally, researchers either repeated their experiments themselves or appointed someone within their lab to test reproducibility. Another approach has been to improve the documentation and standardization of experiment methods.

The MIT et al researchers argue that it is not enough to merely replicate experiment results, and propose examining a machine learning study from three different perspectives: If other researchers can replicate the exact technical results of a paper under identical conditions, the study has achieved Technical Replicability. They then introduce Statistical Replicability and Conceptual Replicability into the criteria to determine if a study is fully reproducible.

Unique challenges for ml4h

Scientist across various disciplines have deployed machine learning approaches to speed up research data analysis. Isaac Kohane, Chair of the Department of Biomedical Informatics in the Blavatnik Institute at Harvard Medical School, explains: “A machine-learning model can be trained on tens of millions of electronic medical records with hundreds of billions of data points without lapses.”

ML4H however faces unique challenges in Technical Replicability, Statistical Replicability, and Conceptual Replicability. Researchers used both qualitative arguments and quantitative literature reviews of over 300 papers from different institutions covering ML4H, NLP, CV, and general machine learning, concluding that ML4H “lags behind other subfields of machine learning on various reproducibility metrics.”

Technical Replicability Challenges

Health data is privacy sensitive. The defacto confidentiality trait of health data makes it difficult for researchers to release the data openly without de-identification techniques to avoid possible malicious use from others. Turning to the few available public datasets won’t help either because of the risk of dataset-specific overfitting. Researchers found that only half of the ML4H papers they surveyed used public datasets compared to over 90 percent of CV and NLP papers. And only about 13 percent of the ML4H papers open-sourced their code compared to 37 percent of the CV papers and about half the NLP papers.

Statistical Replicability Challenges

Less complex data types. Researchers quantified the frequency with which papers demonstrate variance regarding their results. For example, on whether papers list both the approaches and the standard deviation of a performance metric for several random splits, researchers found that 38 percent of ML4H papers showed such statistical replicability. The researchers pointed out however that the problem remains because the datasets used in ML4H papers tend to be relatively small, high dimensional, sparse/irregularly sampled, and suffer from high rates of noise.

Conceptual Replicability Challenges

Lack of multi-institution datasets in healthcare. Only 19 percent of ML4H papers used multiple datasets in their studies, compared to 83 percent of CV papers and 66 percent of NLP papers. Using only one dataset could compromise a study’s conclusions especially since the goal in ML4H research is to deploy a technique in the real world, which requires models to function across various medical care practices. Researchers also attributed the low 19 percent result to the fact that different medical institutions have different deployment environments and data collection methods.

The researchers propose that putting these three replicabilities at the heart of future ML4H studies will provide a clearer picture for stakeholders; and that multi-institute datasets should be made more accessible for studies, as the increasing use of multi-source data will improve conceptual reproducibility. They also call on the ML community and researchers to focus on “expanding our trajectory of statistical rigor.”

The paper Reproducibility in Machine Learning for Health is available on arXiv.