Members of social network platforms often choose to reveal private information, and thus sacrifice some of their privacy, in exchange for the manifold opportunities and amenities offered by such platforms. In this article, we show that the seemingly innocuous combination of knowledge of confirmed contacts between members on the one hand and their email contacts to non-members on the other hand provides enough information to deduce a substantial proportion of relationships between non-members. Using machine learning we achieve an area under the (receiver operating characteristic) curve ( ) of at least for predicting whether two non-members known by the same member are connected or not, even for conservative estimates of the overall proportion of members, and the proportion of members disclosing their contacts.

Funding: EAH and KAZ are supported by the Heidelberg Graduate School of Mathematical and Computational Methods for the Sciences ( http://www.mathcomp.uni-heidelberg.de/ ), University of Heidelberg, Germany, which is funded by the German Excellence Initiative (GSC 220). FAH and KAZ were supported by a fellowship of the Marsilius Kolleg ( http://www.marsilius-kolleg.uni-heidelberg.de/ ), University of Heidelberg. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2012 Horvát et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Inference of undisclosed, unobserved, or future contacts (the “edges”) between people or agents (the “nodes”) is known as the “link prediction” problem [5] – [7] . It is a difficult problem mainly because the imbalance between possible and realized future edges is extremely high in most cases [8] , [9] . In contrast, the prediction of some properties of given links, e.g. the sign of the weight on a link [10] is simpler because the problem is typically more balanced. Link prediction was mostly approached with unsupervised [11] and recently also with supervised learning methods [12] – [14] . Inference was done both using solely structural measures based on the network topology [15] , [16] but also by additionally taking into account the nodes’ attributes [17] – [19] . The most common setting of the link prediction problem is, given an evolving network at an early stage, to predict newly acquired edges at a later stage. The success of link prediction has usually been estimated by cross-validation within the same network [20] , [21] . This typically implies a dependence between training and test data and, hence, an overly optimistic estimate of the accuracy of an algorithm. To our knowledge, we present the first link prediction work where learning and testing are performed on entirely independent networks.

Some individuals prefer to keep intimate details such as their political preferences or sexual orientation private. Recent results suggest that such details can nonetheless be inferred with high probability if a sufficient number of confirmed contacts in a social network chooses to reveal their details [1] – [4] . As a consequence, some of the more circumspect choose to stay away from social network platforms such as Facebook in the belief that this will help protect their privacy. In this article, we show that such an assumption is no longer valid: with the help of machine learning, social network operators can make predictions regarding the acquaintance or lack thereof between two non-members with a high rate of success. To our knowledge these are the first results on the potential of social network platforms to infer relationships between non-members.

Methods

The Problem All members of society can be seen as nodes in an unobservable social graph. This latent social graph is dynamic and extremely complex, with edges of widely differing quality (two people may be kindred, or engaged, or work together, they may like or dislike each other, etc.). From the point of view of a social network platform like Facebook, the set of all people can be divided into a fraction of members and of non-members. The multi-faceted relationships between people are much simplified, in an extreme case into mere binary form: two members may declare a “friendship” which is then represented by an edge in the set . In reality, social networks typically have access to more information that allows to estimate the quality of an edge (its strength, its asymmetry, etc.) especially if they integrate a messaging service. Additionally, a fraction of all platform members may also share their contacts to non-members, e.g., by uploading their email address book (Figure 1). Social network platforms then have direct access to two different sets of relationships: on the one hand, the mutually confirmed contacts between platform members ( ); and on the other hand, their members’ unilateral declarations of their acquaintance with non-members ( ). The edges in both and are an abstraction and a subset of the edges in the latent social graph. The central question of this article is to what extent the acquaintance of two people who are both non-members can be predicted. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Definitions and examples. Any social network platform divides society into two sets: the set of members (black nodes) and of non-members . In our toy example of individuals, i.e. a fraction of , are members. The relevant subset of non-members (red nodes) that are in contact with at least one member is distinguished from other non-members (gray nodes). of the members, i.e., a fraction of , have disclosed their outside social contacts. The knowledge of the set of edges between members (black, bi-directed) and the set of edges (green) to non-members is enough to infer a substantial fraction of edges between non-members (red edges). https://doi.org/10.1371/journal.pone.0034740.g001 For the very reason that the latent social graph is fundamentally unobservable, both we and a social network operator with similar aims as described here needs to impute the missing information to admit a machine learning procedure. The approach we choose is to use the observed part of a social network – say, the Facebook network of all students at a given university – and presume it represents the complete (and unobservable) social graph of a hypothetical community. In other words, the edges in this social graph are considered the ground truth. We then proceed to partition this community into a set of members and non-members by a number of member recruitment models outlined below which represent a broad range of potential strategies by which people choose to become members. Finally, we predict the existence or otherwise of an edge between any two non-members and evaluate the accuracy of these predictions with respect to the ground truth.

Ground Truth Imputation In line with what would be available to a social network operator, we use real social networks, in this case real-world Facebook friendship networks representing the students from five different US universities [22]. Figure 2 shows a comparison of their number of members, their average degree (the average number of friends a member has), their density where is the number of friends, and their average clustering coefficient (the average probability that two friends of a member are friends themselves [23]). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. Comparison of basic network analytic statistics of the five data sets obtained from Traud et al. Comparison of basic network analytic statistics of the five data sets obtained from Traud et al. [22] https://doi.org/10.1371/journal.pone.0034740.g002 The ground truth imputation comprises three steps. In the first step the platform penetration percentage is modeled. The percentage of members in a given population varies strongly with the type of social network platform and the social community of interest. According to Facebook, it had more than 800 million active users in November 2011 [24], while the number of internet users worldwide is estimated at over 2 billion [25]. Thus, roughly of all internet users are active Facebook members. Around of all Facebook users are US citizens. Assuming that each Facebook account represents one individual, we can estimate that over of all North American internet users are members of Facebook. For certain social strata the percentage of Facebook users is known exactly: one study showed that already back in 2005 over of the undergraduates of the Carnegie Mellon University were members of the platform [26]. Later, in 2009 around of all interviewed students of the University of Illinois of Chicago [27] and of a polled contemporary Canadian sample [28] were Facebook members. The platform penetration parameter thus reflects different membership densities and allows to model different social network platforms and their acceptance in different communities. Given a real Facebook friendship network and a choice of , the second step of the ground truth imputation is to partition the nodes of the network into members and non-members. For this we need models for how people choose to become members of a platform which we call member recruitment models. An analysis of the evolution of online social networks [29] suggests that a network platform recruits its members through a mixture of online mediated invitations by friends who already are members, and independent decisions by individuals who are not yet friends of a member. Since the actual member recruitment process is unknown and probably also depends on the group of people that is considered (e.g. college students vs. employees), we have emulated the growth of social network platforms using processes ranging from strongly dependent to purely independent decisions. All models start with labeling a node chosen uniformly at random as the first member. Strongly dependent decisions are modeled by processes in which only people who know at least one member will join the network. In a breadth first search (BFS) model all friends of the first member are labeled as members after which all their friends are labeled and so on. In a depth first search (DFS) model a randomly chosen friend of the first member joins the platform after which a randomly chosen friend of the new member joins and so on recursively. Less dependent decisions are modeled by a random walk (RW) which is restarted from a new node as soon as a friend of a new member is chosen which is already a member. The ego networks selection (EN) model joins the independent decision of some randomly chosen seed members with the dependent decision of their direct friends. Purely random selection of members (RS) is based entirely on independent decisions modeled by the random selection of a set of members. These member recruitment models are described in more detail in Text S1. For an analysis of the structural properties of the partitions obtained with different member recruitment models see Figures S1, S2, and S3. Figure 3 shows the resulting partitions of a toy graph under all five models. We show below that our main findings are robust with respect to the specific choice of the member recruitment model. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. Membership propagation in a toy example according to different propagation models. Note that real social networks exhibit more long-range edges. Examples for the platform penetration value show the nodes from which the propagation started (black nodes with white core). Other members are marked black and relevant non-members red; for ease of reading arrows are not displayed, but black edges are bidirectional while green edges point from black to red nodes. With BFS and DFS the network is explored starting from one node (denoted by a white circle); with RW and EN there are more nodes from which the propagation is launched; and finally, for RS all selected nodes can be seen as starting nodes. https://doi.org/10.1371/journal.pone.0034740.g003 In the third step, a so-called disclosure parameter is chosen to model the probability with which a member opens her email address book to the platform, e.g. through revealing her email contacts. We consider that a member who revealed her email contact list shared thereby all of her contacts justified by the feature of the platform allowing for easy automatic uploading of the entire email address book. governs the fraction of connections between the member and non-member sets. As such, it is a key ingredient of the ground truth imputation. Given an underlying graph with nodes and edges , the simulated member recruitment model results in a subset of nodes that are considered members, and a set of non-members . We will only focus on the set of relevant non-members whose email address has been disclosed by at least one member (see Figure 1). These node sets induce the edge sets and as defined above, and additionally the edges between non-members, which are not directly accessible to the platform, and are at the core of our interest and prediction efforts. Let denote the new graph containing all the structural information that is assumed to be known by a given social network platform (black and red nodes, and black and green edges in Figure 1). The ground truth imputation is thus determined by the choice of the percentage of individuals deciding to become members of the social network, the member recruitment model (BFS, DFS, RW, EN, RS), and the choice of , the propensity of members to disclose their contacts with non-members.

Feature Extraction To predict whether two non-members are connected, we compute topological graph features of the network around and in on which the prediction is based. We deduce the features from relational knowledge because the data at hand is anonymized and therefore no node attributes are available. The exact choice of features is rooted in the known structural properties of (online) social networks [30], [31]. The intuition that two people sharing common friends are likely to be friends themselves motivates including a feature that counts the absolute number of common neighbors and have. However, the absolute number of common neighbors might be misleading if has just a few neighbors, while has many. Thus, we add three normalized versions of the number of common neighbors where the normalization is done by the smaller degree, the larger degree and the number of nodes which are neighboring at least one of the two nodes (the so-called Jaccard coefficient). The typically high assortativity (measuring the likelihood for nodes to connect to other nodes with similar degrees [32]) and the significant local clustering [33] of nodes in online social networks justifies focusing on the average degree and the clustering coefficient of the common neighbours of and . The community structure of social networks [34] leads us to construct several features that reflect the interconnectedness of the member side neighbors of the two nodes as illustrated by Figure 4. Finally, we count the absolute number of distinct paths between and in with exactly three edges. For a precise description of the features see Text S2. For each pair of non-members these scalars are stored in a dimensional feature vector. A feature vector relating to two connected non-members is called a positive sample, and one describing unconnected non-members is a negative sample. Based on this vector, supervised machine learning is used to predict which pairs of non-members are connected (acquainted) and which ones are not. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Features based on different edge sets between the exclusive, joint, and common neighborhoods of v and w. All left-hand nodes belong to the joint neighborhood of and . is exclusive to , while are exclusive to , and are common neighbors of both. Our features comprise the absolute number of edges between common neighbors (black, dashed edges), exclusive neighbors (black, straight edge), joint neighborhood (all black edges between nodes ), and an exclusive and a common neighbor (black, dotted edges). For each of them we also added their normalized value. Normalization was done by the number of possible edges between the neighbors they have. https://doi.org/10.1371/journal.pone.0034740.g004

The Prediction Algorithm Supervised learning requires a training set on which the classifier’s parameters are adjusted. Its performance is then evaluated on an independent test set. We restrict our predictions to those pairs of non-members with at least one common neighbor among the members. In this respect we follow similar approaches which restricted link predictions to pairs of nodes with a maximum distance of two [9], [12]. Our focus is thus on predicting whether two non-member friends of a member are friends themselves, i.e. whether a pair of non-members is contained as an edge or not. We employ the random forest classifier [35], an ensemble of decision trees that has previously been used for link prediction in dynamic networks [9], [11], [12]. For a more detailed description see Text S3. Once the random forest has been trained it can be applied to the test set, and edges with a probability higher than some threshold are predicted to exist. This prediction can then be compared to the ground truth.

Accuracy Measures for Prediction A good classification result is characterized by a high sensitivity (probability of predicting an edge that truly exists) and high specificity (probability of predicting the absence of an edge that truly doesn’t exist). In the following we use two classic accuracy measures for the link prediction problem, the and the which combine sensitivity and specificity [13], [36]: Varying the threshold allows to trade-off sensitivity vs. specificity. The receiver operating characteristic ( curve) shows the against plot. The area under this curve ( ) is a scalar performance measure that aggregates the prediction accuracy over all possible settings of this threshold. A perfect predictor achieves an of while random guessing in a two-class problem yields a value of . While the measures the accuracy over the full range of possible thresholds, the is based on a specific threshold: let denote the number of positive samples in the test set, i.e., the non-member pairs connected by an edge, and let all samples of the test set, i.e., all non-member pairs having at least one common member friend, be ordered non-increasingly by their prediction value. The introduced by [11] is defined as the percentage of correctly classified positive samples among the first samples in the ranking, and is thus also equal to the sensitivity achieved by predicting these samples to be edges. It can be shown that the specificity is linearly dependent on and thus both measures are captured by it. The higher the value the more the number of positive samples is enriched among the highest-ranked samples. Note that the should always be at least as large as the overall fraction of positive edges among all edges. Otherwise, the prediction algorithm performs worse than a naive algorithm in which samples are drawn uniformly at random from all samples and predicted to be edges. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 5. Prediction accuracy (AUC) of samples based on all member recruitment models in the cross-validation training scheme applied to UNC data. The white square denotes a data point where there was not enough data to make the prediction. https://doi.org/10.1371/journal.pone.0034740.g005 PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 6. 4 → 1 cross-prediction accuracy. Minimal (lower triangle) and maximal (upper triangle) prediction accuracy for all five member recruitment models are shown as a function of platform penetration and the disclosure parameter . Upper row: ; lower row: ; black triangles denote data points where was smaller than the according fraction of positive samples among all samples. https://doi.org/10.1371/journal.pone.0034740.g006 PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 7. 1 → 1 cross-prediction accuracy. values for each of the five member recruitment models at . The and -axis show on which network the random forest was trained and tested, respectively. The white field indicates that there were too few edge samples to reasonably train the classifier. https://doi.org/10.1371/journal.pone.0034740.g007