When Alex Stamos describes the challenge of studying the worst problems of mass-scale bad behavior on the internet, he compares it to astronomy. To chart the cosmos, astronomers don't build their own Hubble telescopes or Arecibo observatories. They concentrate their resources in a few well-situated places and share time on expensive hardware. But when it comes to tackling internet abuse ranging from extremism to disinformation to child exploitation, Stamos argues, Silicon Valley companies and academics are still trying to build their own telescopes. What if, instead, they shared their tools—and more importantly, the massive data sets they've assembled?

That's the idea behind the Stanford Internet Observatory, part of the Stanford Cyber Policy Center where Stamos is a visiting professor. Founded with a $5 million donation from Craigslist creator Craig Newmark, the Internet Observatory aspires to be a central outlet for the study of all manner of internet abuse, assembling for visiting researchers the necessary machine learning tools, big data analysts, and perhaps most importantly, access to major tech platforms' user data—a key to the project that may hinge on which tech firms cooperate and to what degree.

"People have to stand up and be patriots. That means platforms and researchers and funders." Craig Newmark, philanthropist

As an example of the sorts of phenomena the observatory hopes to study and learn to prevent, Stamos points to political disinformation of the kind that roiled the 2016 presidential election during his time at Facebook, a problem that has become the most glaring example of Silicon Valley's blind spots around abuse of their services. "Misinformation is not just a computer science problem. It's a problem that brings in political science, sociology, psychology," Stamos says. "Part of the idea of the Internet Observatory is to build a place for these people to work together, and we want to build the infrastructure necessary to allow all the different parts of the political and social sciences to study what’s happening online."

Stamos says the observatory is currently negotiating with tech firms—he names Facebook, Google, Twitter, YouTube, and Reddit as examples—that it hopes will offer access to user data via API in real time and in historical archives. The observatory will then share that access with social scientists who might have a specific research project but lack the connections or resources to grapple with the immensity of the data involved. Stamos hopes that his data clearinghouse might lower the technical barriers that social scientists face when they try to study users on the internet at scale.

"They have to have a grad student write Python, they have to spend months negotiating data access agreement with tech companies, they have to build a bunch of data science infrastructure," Stamos says. "We’re trying to do that work once and offer it to all these people."

First, Get the Data

But negotiating that access to data may not be an easy sell, even for someone with as many Silicon Valley connections as Stamos. Facebook has been wary of any data-sharing agreements with academics since its disastrous Cambridge Analytica scandal, a privacy debacle—one that happened under Stamos' watch—for which the FTC announced a $5 billion fine against the company just yesterday. The European Union's General Data Protection Regulation also limits what sort of data tech firms can share about European users. And collecting all that access in a single organization could make it a significant target for hackers.

WIRED reached out to Twitter, Google, Facebook, and Reddit about the observatory's plan. Twitter and Reddit declined to comment, though a Reddit spokesperson said the company hadn't been approached to share its data. Facebook and Google didn't respond.

As a model for how those data-sharing deals can actually be struck, though, Stamos points to another project known as Social Science One. When it was created in April of 2018, that initiative hammered out a deal with Facebook to access some of its user data as part of its efforts to combat disinformation intended specifically to influence democratic elections. That data-sharing arrangement uses a form of so-called differential privacy, a still-developing class of tools that allow data to be queried in aggregate while limiting the details included in responses. It means that no uniquely identifying information is ever shared about individuals.