Policymakers often draw on the work of social scientists to predict how specific policies might affect social outcomes such as the employment or crime rates. The idea is that if they can understand how different factors might change the trajectory of someone’s life, they can propose interventions to promote the best outcomes.

In recent years, though, they have increasingly relied upon machine learning, which promises to produce far more precise predictions by crunching far greater amounts of data. Such models are now used to predict the likelihood that a defendant might be arrested for a second crime, or that a kid is at risk for abuse and neglect at home. The assumption is that an algorithm fed with enough data about a given situation will make more accurate predictions than a human or a more basic statistical analysis.

Now a new study published in the Proceedings of the National Academy of Sciences casts doubt on how effective this approach really is. Three sociologists at Princeton University asked hundreds of researchers to predict six life outcomes for children, parents, and households using nearly 13,000 data points on over 4,000 families. None of the researchers got even close to a reasonable level of accuracy, regardless of whether they used simple statistics or cutting-edge machine learning.

“The study really highlights this idea that at the end of the day, machine-learning tools are not magic,” says Alice Xiang, the head of fairness and accountability research at the nonprofit Partnership on AI.

The researchers used data from a 15-year-long sociology study called the Fragile Families and Child Wellbeing Study, led by Sara McLanahan, a professor of sociology and public affairs at Princeton and one of the lead authors of the new paper. The original study sought to understand how the lives of children born to unmarried parents might turn out over time. Families were randomly selected from children born in hospitals in large US cities during the year 2000. They were followed up for data collection when the children were 1, 3, 5, 9, and 15 years old.

McLanahan and her colleagues Matthew Salganik and Ian Lundberg then designed a challenge to crowdsource predictions on six outcomes in the final phase that they deemed sociologically important. These included the children’s grade point average at school; their level of “grit,” or self-reported perseverance in school; and the overall level of poverty in their household. Challenge participants from various universities were given only part of the data to train their algorithms, while the organizers held some back for final evaluations. Over the course of five months, hundreds of researchers, including computer scientists, statisticians, and computational sociologists, then submitted their best techniques for prediction.

The fact that no submission was able to achieve high accuracy on any of the outcomes confirmed that the results weren’t a fluke. “You can't explain it away based on the failure of any particular researcher or of any particular machine-learning or AI techniques,” says Salganik, a professor of sociology. The most complicated machine-learning techniques also weren’t much more accurate than far simpler methods.

For experts who study the use of AI in society, the results are not all that surprising. Even the most accurate risk assessment algorithms in the criminal justice system, for example, max out at 60% or 70%, says Xiang. “Maybe in the abstract that sounds somewhat good,” she adds, but reoffending rates can be lower than 40% anyway. That means predicting no reoffenses will already get you an accuracy rate of more than 60%.

Likewise, research has repeatedly shown that within contexts where an algorithm is assessing risk or choosing where to direct resources, simple, explainable algorithms often have close to the same prediction power as black-box techniques like deep learning. The added benefit of the black-box techniques, then, is not worth the big costs in interpretability.

The results do not necessarily mean that predictive algorithms, whether based on machine learning or not, will never be useful tools in the policy world. Some researchers point out, for example, that data collected for the purposes of sociology research is different from the data typically analyzed in policymaking.

Rashida Richardson, policy director at the AI Now institute, which studies the social impact of AI, also notes concerns in the way the prediction problem was framed. Whether a child has “grit,” for example, is an inherently subjective judgment that research has shown to be “a racist construct for measuring success and performance,” she says. The detail immediately tipped her off to thinking, “Oh there’s no way this is going to work.”

Salganik also acknowledges the limitations of the study.

But he emphasizes that it shows why policymakers should be more careful about evaluating the accuracy of algorithmic tools in a transparent way. “Having a large amount of data and having complicated machine learning does not guarantee accurate prediction,” he adds. “Policymakers who don't have as much experience working with machine learning may have unrealistic expectations about that.”

To have more stories like this delivered directly to your inbox, sign up for our Webby-nominated AI newsletter The Algorithm. It's free.