Mark Zuckerberg faced questions over data privacy and misinformation during marathon hearings before the US Congress in April 2018.Credit: Andrew Harrer/Bloomberg/Getty

A pioneering research initiative designed to allow independent scientists to access Facebook data has hit a major snag over privacy.

The goal of the project was to enable academic researchers to study how social media is influencing democracies — and to establish a model of collaboration that would allow scientists to take advantage of tech companies’ rich troves of data. But the funders backing the initiative are considering ending their support for the project because privacy issues have prevented Facebook from providing scientists with all the data that they were promised — and it’s not clear when these can be made available.

Academic scientists have been increasingly keen to get their hands on data from tech giants, such as Facebook, to conduct independent analyses as concerns about the influence of misinformation circulating on social-media sites plague political processes worldwide. The US-based research initiative — launched in cooperation with Facebook last July in the wake of the Cambridge Analytica scandal — funded 12 projects that were designed to investigate topics such as the spread of fake news and how social media was used in recent elections in Italy, Chile and Germany. Facebook was not involved in selecting which projects received funding.

Facebook gives social scientists unprecedented access to its user data

But issues with the data quickly emerged: Facebook has been able to share some information with researchers, but providing them with more sensitive and detailed data without compromising user privacy proved technically more difficult than the project’s organizers expected.

Last month, the 8 charitable funders — which so far have provided a total of up to US$600,000 for the scheme, called the Social Media and Democracy Research Grants programme — gave Facebook until 30 September to provide the full data set or said they would begin winding up the programme. They say that it is impractical to allow researchers to keep bidding for cash while no one knows when the necessary data will become available. The programme’s structure — which included separate bodies to oversee grants and to provide access to the data — had also proven too complex, says Larry Kramer, president of one of the charities, the Hewlett Foundation in Menlo Park, California.

Following the funders’ statement, Facebook has released a further data set, but not the full range originally promised. Now that the deadline has passed, the Hewlett Foundation says that it is working with its partners to assess the next steps for this project and to determine which of the originally approved research proposals can be accomplished. Researchers who have already received money will not be required to return it, and those who are able to complete their studies with the limited data set will continue to receive funding, say the charities.

Other partners that are involved in the project — and have spent a year working with Facebook on data-sharing solutions — say they are continuing their efforts to build a computing infrastructure that allows the company to share its data with researchers, irrespective of the funders’ decisions. The partners will continue to release data sets in the coming weeks, and Facebook has more than 30 people working on the project, says Gary King, a social scientist at Harvard University in Cambridge, Massachusetts, and co-founder of Social Science One, a body that is central to the project. Academics set up this non-profit organization at the outset of the funding programme to act as a ‘data broker’ between Facebook and the researchers in this initiative, as well as future ones.

“To learn about societies, we must go to where the data are,” says King. Although more social-science data exist than ever before, most are tied up in companies and are inaccessible to researchers, he adds. King also notes that the model his team is implementing remains the only plausible model for future collaborations with other technology giants, and that solving the problem of how to get useful data out of companies while maintaining user privacy is essential.

A spokesperson for Facebook told Nature: “This is one of the largest sets of links ever to be created for academic research on this topic. We are working hard to deliver on additional demographic fields while safeguarding individual people’s privacy.”

Data shortcomings

At issue is the amount and type of information that Facebook has been able to give external researchers.

Data sets released so far, for example, include 32 million links, or URLs, each of which has been shared since 1 January 2017 by at least 100 users with their privacy settings set to ‘public’. These links include some valuable information, such as ratings of the page’s trustworthiness as scored by third-party fact-checking sites.

The scant science behind Cambridge Analytica’s controversial marketing techniques

But the company had promised to give researchers access to URLs that were shared publicly only once, and to a wider range of demographic data about users. This is a bigger data set of around one billion links and would include those that were largely shared privately, says Simon Hegelich, a political-data scientist at the Technical University of Munich in Germany, whose team is studying misinformation campaigns that took place during Germany’s 2017 election. Because fake news tends to circulate in links that are shared privately, the data on public shares are not a good proxy for how misinformation spreads in general, says Hegelich. “My impression is that, at least for our project, the data that Facebook is offering is more or less useless,” he adds.

But other scientists funded by the programme say that the data already released are unprecedented and will allow them to achieve at least some of their research goals. “Results from this initiative are promising,” says Magdalena Saldaña, a social scientist at the Pontifical Catholic University of Chile in Santiago. Her team is examining how Facebook users consumed misinformation — and the properties that the untruths had in common — during the 2017 Chilean presidential election campaign. Although they cannot yet, for example, study the demographic profiles of users who tend to get exposed to misinformation, they can determine how content predicts the amount of fake news that is shared, she says.

Trusted party

Facebook does its own research on the impact of information shared on its platform. But academics want to carry out their own studies that are not subject to vetting by the company. This is a problem because to do such research, external academics often need to access proprietary information, which then means that their results would need company’s pre-publication approval. The solution was to establish a trusted ‘third party’ — Social Science One — whose members sign non-disclosure agreements with the firm but can advocate on researchers’ behalf. Through a complex legal agreement, the organization acts as a Facebook insider: it is able to see the types of data available and pick interesting sets, which allows researchers to retain academic freedom and receive assurance that they can trust what is released.

But Social Science One encountered a problem almost as soon as the project began. King and his co-founder, Nathaniel Persily at Stanford University in California, thought that researchers could carry out their work using Facebook’s systems. However, the company did not have structures that could be readily adapted to give parties access to specific data, says King. “It was like renting out a room if you don’t have a separate entrance — you’d instead have to give keys to the whole house,” says King.

How Facebook and Twitter could be the next disruptive force in clinical trials

Instead, sharing data with researchers without compromising user privacy required entirely new infrastructure. Working with Facebook, Social Science One has built a secure portal that connects to Facebook’s servers and uses a mathematical technique known as differential privacy, pioneered by Harvard and Microsoft Research computer scientist Cynthia Dwork. This adds noise to the results of analyses that prevents users from becoming personally identifiable but does not bias the results. “Differential privacy turned out to be not only useful, but the thing that has to work,” says King.

This ‘trusted third party’ model is one that scientists now hope to emulate with other companies, says Jake Metcalf, a technology ethicist at the think tank Data & Society in New York City. Similar systems are used to give researchers access to genetic data, he says. But he adds that social-media data, although less sensitive than medical information, bring an extra privacy challenge in that they are connected to a person’s real-world behaviour. This means that, even if data are anonymized, it is relatively easy to use them to identify individuals, especially if they are cross-referenced with other data, such as those from cell phones, says Metcalf — who is also part of the team conducting ethical reviews for proposals to the scheme.

“Facebook gets the headlines here, but really, the effort has been to build a model for data sharing between social-media platforms and researchers,” says Metcalf. “It’s a very challenging model to achieve.”

Although the grant scheme might have been over-ambitious, its breakdown is not a death knell for the model, he says. “I’m still confident that this is basically the way to move forward.”