Researchers should take care that they know what questions they're really asking of their data, say two Johns Hopkins scientists.

(Photo: U.S. Army RDECOM/Flickr)

Scientists have come under fire recently for a lack of reproducible results, resulting from occasional sloppiness (sometimes even potential fraud). But as reproducibility grabs attention, another question looms on the horizon: Are researchers actually utilizing their data to answer the questions that they're asking?

"Public pressure has contributed to the massive recent adoption of reproducible research, with corresponding improvements in reproducibility. But an analysis can be fully reproducible and still be wrong," write Johns Hopkins University biostatistics professors Jeff Leek and Roger Peng in a perspective in Science.

If the Washington football team wins their last home game of the season, the incumbent party stays in the White House. That rule has predicted 17 of the last 19 presidential elections.

"Once an analysis is reproducible, the key question we want to answer is, 'Is this data analysis correct?'" Leek and Peng write. "We have found that the most frequent failure in data analysis is mistaking the type of question being considered."

Perhaps the most widely known example of that failure occurs when scientists mistake correlation for causation, or, as Leek and Peng describe it, inferential questions for causal ones. Inferential questions ask whether two variables are somehow related, while causal questions ask if deliberately changing one actually affects the other. As an example, some researchers report a connection between mobile phones and brain cancer, but those studies simply asked cancer patients and healthy people about their past cell phone use. Sure, it's possible that cell phones cause brain cancer, but maybe a cancer diagnosis leads people to report spending more time on their portable phones than they actually did.

A simpler example is the strange correlation between sports and elections. One rule of thumb: If the Washington football team wins their last home game of the season, the incumbent party stays in the White House. That rule has predicted 17 of the last 19 presidential elections, but that doesn't mean their wins or losses influenced who moved into the White House. Maybe it's the other way around, or maybe it's a coincidence. Answering the inferential question—is there a correlation?—says nothing about the causal question.

Researchers also mistake exploratory data analysis for prediction. That was the case with Google Flu Trends, which in its early days uncovered a possible correlation between Web searches for flu-related words and actual flu diagnoses. When Flu Trends went on to predict future influenza cases using their initial analysis, it failed miserably.

It's a challenge for reporters and even the general public too. Earlier this year, Leek and Peng point out, headlines incorrectly proclaimed that two-thirds of cancer cases were the result of "bad luck." That error stemmed from misunderstandings about what the original study had actually claimed—namely, that there was a high correlation between the rate of cell death and regeneration in a given part of the body and the risk of getting cancer.

"Confusion between data analytic question types is central to the ongoing replication crisis, misconstrued press releases describing scientific results, and the controversial claim that most published research findings are false," Leek and Peng write. The only solution, they say, is better education for scientists and scientists-in-training, with an emphasis on knowing the question.