By Amy Orben and Andrew Przybylski

Behavioural scientists have long assumed data analysis to be automatically objective: that they could open a dataset, ‘have a look at the data’ and report the results they find. Our paper shows that doing this, especially when using powerful large-scale datasets, will produce many significant results that demarcate non-existent effects (‘false positives’), misleading the scientific literature and social policy in the process.

We started working on this paper after realising how easy it was to find headline-worthy significant results in large-scale public datasets. While many behavioural scientists understand studies with large numbers of participants as having high scientific quality, we found that this was fundamentally untrue. Studies that examine large-scale datasets without the necessary precautions are built on exceptionally unstable foundations.

For our work, we focused on the link between adolescent well-being and digital technology use. To begin answering this question, we first needed to define well-being. The datasets we - like others - work with contain many hundreds of questions, with over a dozen of measures that could relate to well-being. Without specifying how one would measure well-being before accessing the data, it is easy to home in on a definition using trial-and-error. Indeed, we found that most papers focused on diverse subsets of the available measures. With the measurement of well-being being just one of many decisions a researcher has to make during the analysis process, there are often thousands, if not millions, of analyses that could have been run to answer just one research question.

Moreover, as these datasets entail many thousands of participants, they can detect extremely small effects. It is therefore almost impossible not to find statistically significant effects, especially if one can choose from multiple analysis pathways. This flexibility enables researchers to quickly, and unknowingly, find statistically significant results. These will garner press and policy interest, even though they most likely fail to constitute effects that merit such coverage.

In our paper we use a novel method to examine and visualise the effect of such analytical flexibility when using large-scale data. Instead of running one statistical analysis – we ran hundreds of thousands.

The article outlines the range of results we could have found answering one research question: what is the correlation between adolescent well-being and digital technology use? Many thousands of theoretically defensible analysis options revealed negative correlations, each could have been a paper claiming negative technology effects. Thousands of others provided non-significant (possibly unpublishable) results or potentially overly optimistic positive effects. Different scientific papers with different conclusions could have therefore been written just by analysing the same datasets in slightly different ways.

Although statistical significance is often used as an indicator that findings are practically significant, the paper moves beyond this surrogate to put its findings in a real-world context. In one dataset, for example, the negative effect of wearing glasses on adolescent well-being is significantly higher than that of social media use. Yet policymakers are currently not contemplating pumping billions into interventions that aim to decrease the use of glasses.

Behavioural scientists need to acknowledge the dangers of analysing large-scale datasets with traditional analytical ‘let’s just have a look’ approaches. In the area of social media research, such a change would ensure that money and time isn’t wasted on research or policy campaigns built on nothing but hot air, large data and a little bit of analytical flexibility.

Photo from Nokton (Flickr)