The bedrock of any machine learning system is data. For algorithms to learn to make accurate predictions, they need to be taught with lots of relevant examples. The collection of these examples is called a dataset.

All the extraordinary-looking studies that Vandewiele encountered were based on the popular Term-Preterm EHG Database. It contained several hundred records, each corresponding to a single pregnancy. Each record, in turn, contained clinical variables like the mother’s age and weight during their visit to the obstetrician, the number of weeks till the actual birth, as well as an electrical signal measured by an electrode placed on the abdomen.

More often than not, the sensitive nature of medical datasets renders them inaccessible to third-party researchers outside the original studies. This makes efforts of reproduction extremely complicated, if not impossible. One can thus only imagine the sighs of relief from Vandewiele’s group when they discovered that the necessary dataset was publically available. A mere click of a button and the data was theirs.

With the data downloaded, it was time to start feeding it into predictive models outlined in the papers. Ideally, the scientists would have open-sourced their codebases, such that this step would have amounted to merely running some existing scripts. Unfortunately, we live in a world where the practice of keeping research codebases private is common in the artificial intelligence community.

Not ones to recoil from hardship, the group rolled up their sleeves and got to work. They took the article showing the best results and replicated its setup in full. But when they finally ran the analysis, something strange happened — the obtained results were markedly worse than what was reported. The predictions were barely better than random!

“Surely we’ve made a mistake,” Vandewiele thought. Yet, after spending days double and triple-checking each and every line of their code, nothing seemed amiss. Eventually, with their curiosity giving way to frustration, the team wrote the first experiment off and tried to replicate the next paper in line. Same thing. The system worked much worse than advertised. What was going on? Had they stumbled upon a conspiracy?

Now downright maniacal, the team was in beast mode. Articles were being reimplemented left and right. Still, none of the reproductions got even close to achieving the promised near-perfect predictive accuracy. It’s as if they were stuck in a Greek tragedy, tortured by a pitiless boulder that turned around and rolled downhill the moment they were about to send it toppling over the mountain’s crest.

Several fruitless months later, with the immense toil of duplicating eleven studies behind them, the team could take no more. But just as they were about to throw in the towel… a breakthrough. With a simple change in how the data was organized before it was fed into the machine learning models, Vandewiele and his collaborators were finally able to obtain results on par with the original studies. The only problem: such a data-handling scheme was fundamentally flawed.

To make sense of the conundrum, we need to take a closer look at the methodology behind building machine learning systems.