The web is catalyzing a quiet revolution in social science and behavioral research. Researchers are using it as a source of detailed information about humans, and the variety and amount of available data may yield new insights into human behaviors. Users of online services, such as Facebook and Twitter, generate large amounts of personal information as they create profiles, communicate with others, share video and images, leave comments on blogs, and otherwise interact with the web. Researchers can easily extract this information using automated software tools to build datasets for analysis, increasing the speed at which large data can be complied while simultaneously reducing the managerial burdens. The richness and quantity of the data promise to give researchers “the capacity to collect and analyze with an unprecedented breadth and depth of scale.” Despite the provocative insights that may result from this new vein of data, these emerging practices fall into a category of human subjects research in which the legal and ethical standards are unclear. At the heart of the matter are some difficult questions about the boundaries between public and private information.

In the US, certain types of academic research on humans are governed by regulations known as the Common Rule. These regulations are designed to decrease the risk of psychological and physical harms to human subjects by requiring academic institutions to establish Institutional Review Boards (IRBs) and oversight programs to review and approve research studies conducted at their institutions. For example, a researcher might be required to disclose to potential subjects the nature of the study, obtain meaningful consent, and put in place controls to protect the security and confidentiality of any sensitive data collected from subjects.

Not all research on humans is subject to the Common Rule.

Not all research on humans is subject to the Common Rule. Studies that only use publicly-available information, which potentially includes information mined from the web, have long been an exempt category. The term public, in this sense, is synonymous with benign — if the information was collected from a public source, analyzing and disseminating it is not considered harmful to the person about whom it pertains. More importantly, if a research study falls into this category, institutions are not required to oversee it. The public-private distinction in the Common Rule owes its origins to privacy law, which has traditionally held that public information is not subject to privacy protections.

While the public-private distinction made sense for a world in which the costs of obtaining information were greater, one might question whether it remains sensible. For instance, it does not account for the potential vulnerabilities of users who might become unwilling participants in studies, or the techno-social privacy norms to which communities adhere as they publish and share information online. As evidenced in recent studies, there is a startling disconnect between user expectations of privacy on the internet and their legal realities. Users may not fully understand or appreciate the consequences of their actions when they decide to publish, only to later regret them without recourse. Normative behaviors around privacy on the web have also been shown to be more nuanced than previously thought. Recent scholarship has illustrated how community norms and the specific contexts in which individuals share information play an important role in shaping user notions of privacy. Although they choose to publish, these users may believe their audience is limited to only those they target or to a community that will respect normative boundaries. This leaves room for users to feel violated when their information is taken from one context and placed in a new context that they did not anticipate.

On the other hand, the web is often described as a one-to-many or many-to-many medium in which people are interconnected with indistinct boundaries in “networked publics.” Most information published to the web is indexed and discoverable by using search engines, and it is capable of being copied and stored indefinitely. Once published to the web, a user effectively relinquishes control, making it ostensibly public. In many respects, these characteristics lend support to the argument that information on the web should be considered public, regardless of what users expect or intend. Indeed, they are often cited by researchers and IRBs as the basis for why studies that use information mined from the web are currently subject to minimal oversight in practice.

Changes to the public-private distinction may be on the horizon

Ethicists and researchers are increasingly flagging the lack of clear ethical guidance as problematic, and changes to the public-private distinction as used in research may be on the horizon. In 2013 the US Department of Health and Human Services, the agency responsible for administering the Common Rule, released draft guidelines that urge researchers to note “expressed norms or requests in a virtual space, which — although not technically binding — still ought to be taken into consideration.” A number of associations within the field of social science have also begun crafting new ethical guidelines for data mining practices which, among other things, suggest deliberative decision-making processes to help researchers identify potential risks on a case-by-case basis. While these developments give some clues to the trajectory of new policy and ethics, more research and public debate is needed to better understand the potential risks to users.

This essay first appeared in the Internet Monitor project’s second annual

report, Internet Monitor 2014: Reflections on the Digital World, and developed in collaboration with the Privacy Tools for Sharing Research Data project at Harvard University. The report, published by the Berkman Center for Internet & Society, is a collection of roughly three dozen short contributions that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year. The Privacy Tools for Sharing Research Data project is supported by NSF grant CNS-1237235. Illustration by Willow Brugh.