Social interaction and data integration in the digital society can affect the control that individuals have on their privacy. Social networking sites can access data from other services, including user contact lists where nonusers are listed too. Although most research on online privacy has focused on inference of personal information of users, this data integration poses the question of whether it is possible to predict personal information of nonusers. This article tests the shadow profile hypothesis, which postulates that the data given by the users of an online service predict personal information of nonusers. Using data from a disappeared social networking site, we perform a historical audit to evaluate whether personal data of nonusers could have been predicted with the personal data and contact lists shared by the users of the site. We analyze personal information of sexual orientation and relationship status, which follow regular mixing patterns in the social network. Going back in time over the growth of the network, we measure predictor performance as a function of network size and tendency of users to disclose their contact lists. This article presents robust evidence supporting the shadow profile hypothesis and reveals a multiplicative effect of network size and disclosure tendencies that accelerates the performance of predictors. These results call for new privacy paradigms that take into account the fact that individual privacy decisions do not happen in isolation and are mediated by the decisions of others.

INTRODUCTION

The networked nature of our digital society fundamentally changes the principles of how we interact (1). One of these is privacy: Using online services carries privacy losses that are not always trivial to perceive and decide upon, neither for users nor for regulators. Not only does social surveillance allows people to closely watch each other (2), but also the data of an individual user can be used to infer its private attributes (3, 4). From a purely individualist perspective, empowering users to control and price their private information would allow them to balance the benefits, costs, opportunities, and risks of online activity (3, 5). This would hold if individuals used online media in isolation, as it was the case in the early days of the Web. However, the ubiquity of social media renders this individual perspective obsolete and can produce collective effects beyond individual decisions and control (6–10). Users are constantly interacting with each other online, leaving large and deep layers of information that can reveal private attributes of others without their awareness (11). It is possible that the control of individuals over their information is progressively being lost through leaking privacy, leaving a trace of private information with each social interaction.

An example of leaking privacy is the phenomenon of shadow profiles: files with private information of a person that online services can generate from the data that the social contacts of that person give to the service (12). Shadow profiles could be constructed without permission or knowledge of the person who is being profiled, who might not be a user nor agree to the terms of the online service that builds the profile. The idea of shadow profiles came to light in 2013, when a bug in Facebook revealed that the mobile phone numbers of some users had been extracted from the phonebooks of their friends but never provided by the users themselves (13). Because many online services have access to user contact lists outside the service, for example, Facebook’s messenger phone app permissions and its potential connection with WhatsApp, the same inference of personal information could be carried out for people who are not users. To ensure the right to privacy and informational self-determination (14), we need to evaluate whether shadow profiles are a possibility. This question is formalized as the shadow profile hypothesis: The data given by users of an online service predict personal information of nonusers.

Previous research on privacy in social media provides background on the inference of private attributes of users from their online activity (15). Some examples of this line of research are the prediction of gender, age, and political orientation with Twitter data (16), and of sexual orientation and romantic partnerships with Facebook data (17, 18). These predictions build on the information captured by assortativity and homophily in social networks (16), providing evidence that private attributes of users can be predicted when sufficient contextual data are available. These analyses evaluate how some information about a user can be predicted through its activity and the activity of its friends but do not venture to evaluate whether these predictions can be applied to people who are not users of the service. Notable exceptions have applied simulation approaches to investigate the inference of friendships outside Facebook (19) and used friendship signals to infer sexual orientation (9), but we lack an empirical and formal test of the shadow profile hypothesis in a large online social network.

The research gap that has so far prevented the analysis of predictive power over nonusers can be explained by the lack of necessary data outside the private control of the owners of online services. A large company, such as Facebook or Google, could publicly show the possibility to build shadow profiles, but these results could easily be in conflict with the company’s interests and business models. Therefore, we need audits by independent researchers to reliably test whether shadow profiles are a possibility. To overcome this challenge and provide a first test of the shadow profile hypothesis, we use the method of Internet Archaeology (8): We study the traces of a disappeared online social network, Friendster, to address a question about its functioning. Here, we test the shadow profile hypothesis against the data that were abandoned in Friendster when it was discontinued as an online social network but captured by the Internet Archive and made available for independent research. We trace back the history of the growth of the social network to evaluate whether information inside the network had predictive power to infer personal information of individuals who were not users at that time, with the aim to empirically test the shadow profile hypothesis.

We apply principles from network science to gain insight into the structural properties that can explain whether shadow profiles can be built, measuring how personal attributes of users are related to their neighbors. We evaluate a straightforward prediction method for private information on the basis of the data of the friends of a user to then historically evaluate that prediction for nonusers as the network grew. The aim of this article was not to advance the techniques to infer personal information of individuals outside a social network but to measure a lower bound on that predictability and use it to test the shadow profile hypothesis.