I’ve been a Twitter user since March 2009. In that time, I’ve posted more than 32,000 messages to the social media site, and I’ve liked more than 6,000 tweets from other people. More than likely, that means I’ve also been a social science guinea pig. But I’m not special: If you’re a habitual Twitter user, your tweets were probably harvested alongside mine.

“If you’re on Twitter and tweeting publicly, you’re part of a data set somewhere,” said Nick Proferes, a professor of information studies at the University of Kentucky. Twitter is a popular research tool, but it’s also a very new one. Social scientists are still debating whether it’s OK to collect and analyze tweets without users’ knowledge — and what the ethical norms of studying publicly available social media data should be.

Researchers use Twitter for a huge range of subjects: measuring how people cope with global crises and their aftermaths, tracking geographic differences in public health and analyzing the behavior of automated “bot” accounts during the 2016 presidential debates, to name a few. At least 25 billion tweets were collected and analyzed in scholarly research published from 2007 to 2012, according to a paper Proferes published with a colleague in 2014. They counted 382 papers in just the first six years that Twitter existed. Katrin Weller, an information scientist working at the GESIS Leibniz Institute for the Social Sciences in Germany, published a separate paper in 2014 that also tried to count social science research papers dealing with Twitter, and she came up with a similar number. Proferes and Weller agreed that many more Twitter-based papers have been published since.

Twitter is a preferred source of social media data for research because the website is relatively easy and cheap for scientists to use. Twitter is basically set up as a self-publishing platform, said Nina Cesare, a digital data researcher at the University of Washington’s Institute for Health Metrics and Evaluation. She has published research based on Twitter data and a guide for other scientists who want to do the same. Unless a user makes his or her account private — and the vast majority don’t — everything posted is public information, just as if you printed it up on a flyer or yelled it loudly in the town square. But Twitter posts are a lot more useful for data analysis than a guy wandering around screaming, “’Tis 6 o’clock and all is #blessed!”

Using Twitter’s own systems or third-party apps that scrape data from the site, scientists can get free samples of tweets — drawn at random either from the site’s ongoing daily stream of other people’s consciousness or (using keywords and other search parameters) from the archive of public tweets that go back years. With bigger budgets, scientists can pay for larger collections of tweets. And, since Twitter is predominantly a text-based medium, it’s easier to analyze and compare those messages with one another than it would be to study Instagram, which has a similar level of public availability, Cesare said. Earlier this week, Twitter announced that it would begin to put some limits on how many of these requests people can make and which apps they can use to make them — but, in general, Proferes said it’s still one of the easiest social media sites to use for research.

Some researchers have used Twitter to amass huge data sets. Proferes found nine papers published between 2007 and 2012 that were based on collections of more than 1 billion tweets each. Jennifer Van Hook, a professor of sociology and demography at Penn State, is part of a research team that has collected 30 terabytes of geotagged tweets — something the university has promoted as “the largest publicly accessible archive of human behavior in existence.” One use for this data set: improving the quality of other Twitter-based social science research by figuring out how well the population of Twitter matches the population of a given geographic area. Eventually, Van Hook told me, the team hopes to create tools that allow researchers to statistically account for differences between physical and online communities.

This is all possible because Twitter users agreed that the messages they post and the details about themselves that they share were public information when they joined the site. It’s all there in the user agreement.

But research suggests that Americans don’t really read or understand the user agreements we sign off on. Meanwhile, earlier this year, Proferes published results of a survey of 268 people who have public Twitter accounts. Forty-three percent of them didn’t think researchers were permitted to use tweets without permission from the tweeter. Sixty-five percent didn’t think researchers should do that, regardless of whether it was allowed.

Twitter users aren’t always keen to have their tweets studied From a survey of 268 Twitter users who have public accounts, 2018 Uncomfortable Comfortable How would you/do you feel … very somewhat neither somewhat very … about the idea of tweets being used in research? 3.0% 17.5% 29.1% 35.1% 15.3% … if a tweet of yours was used in one of these research studies? 4.5 22.5 23.6 33.3 16.1 … if your entire Twitter history was used in one of these research studies? 21.3 27.2 18.3 21.6 11.6 The second question had 267 respondents, instead of 268. Source: Social Media and Society

Scientists haven’t quite figured out how to deal with this phenomenon yet. “This isn’t a discipline where a textbook tells you what to do,” Weller told me. The ethical rules about using social media data are still being hashed out.

Some scientists, like Van Hook, don’t see a problem with taking tweets and metadata and analyzing that information without explicitly informing the tweeters. To her mind, this is really no different than observing things that exist in public spaces — like counting the number of participants in a protest march or reading signs people have chosen to post in their front yards. Institutional review boards, the authorities that govern research ethics at universities and research institutions, generally agree. Most do not even count a study of tweets as human subjects research — a status, typically associated with medical studies, that offers participants special ethical and privacy protections

But this kind of digital-commons research can have real risks for participants, Cesare said. She pointed to a case where Harvard sociologists had put together an anonymized database based on Facebook profiles that they were using to study how culture and race affected relationships between people. But then they made the data set public for other researchers to use. “Turns out that likes and ‘about me’s’ and all these pieces of metadata could be reassembled and reaggregated in a way that allowed you to figure out where they went to school, what dorm they lived in and who they were,” Cesare said. She believes this same sort of risk exists for metadata pulled from Twitter, especially in the case of tweets that have been tagged with geospatial data.

It’s possible to anonymize and aggregate. Researchers can also choose not to share their data sets and are, in many cases, required not to. But if you keep the data set private, you basically make it impossible for your research to be replicated. Anyone wanting to replicate would have to request their own tweet sample, and there’s no guarantee that the random sample another scientist gets will be the same or produce the same results, Weller said. More frustratingly, even if scientists do want to give Twitter users a say in the process, there’s really no practical way to do that — not when you’re talking millions of tweets from tens of thousands of users.

Proferes said he’s seen a shift in the past couple of years toward more researchers having conversations about how to solve these conflicts. But for now, this is an area where scientists are doing research while simultaneously debating how that research should be done. “We’re at this crossroads,” Cesare said. “Do we just adjust the codified standards to accommodate the structure … behind these new data sources, or do we have to rethink our framework entirely based on what we’re seeing online?”