Information Flows in You — And Your Friends

Upper limits of predictability using social media information even if a person has deleted their social media presence

You’ve had enough. Of baby pictures, of political rants by ‘friends’, even of cute cat pictures! Of fearing about your privacy and future career security. You decide to delete your accounts on Facebook, Twitter and Instagram. And you’re in good company: millions of Facebook users started leaving the platform in 2017 and it doesn’t look too good going forward either. Twitter struggles to find new users to keep its active user base constant and only Instagram is growing. After deleting your account together with all the content you ever posted, you and millions along with you think you’re safe. Turns out, you’re not. Because, from an information theory perspective, it’s like you never left.

Conversely, if you’re a data scientist trying to build predictive machine learning models based on social media information (and social media creates a large share of the 2.5 quintillion bytes of data generated each day) you might think that it would be quite useful to be able to use the data of individuals for your predictions. From Target supposedly identifying pregnancies before the parents of the affected person knew to predicting political allegiances and election outcomes, social media data are a veritable treasure trove for data science and machine learning. So if you’re a data scientist you are in luck! Because even though your person of interest and all their generated content may have vanished from social media you still have a chance to predict their behavior or sentiments. Here’s how.

Limits of text predictability based on the availability of data from friends and the individual.

Our point of reference here is a recent publication in the journal Nature Human Behaviour by James P. Bagrow, Xipei Liu and Lewis Mitchell from the University of Vermont. In this work, the authors probed the limits of how much predictive power you can potentially extract out of social media, specifically in a context where the individual in question deleted their social media profile. Turns out, quite a lot! Up to 95% of the predictive power revolving around you can be gathered without you. Who’s to blame? Your friends! There’s just so much information about you embedded in your close social environment that data scientists in principle don’t need the individual itself to be present to achieve a reasonable prediction accuracy. At the end of the article we’ll discuss potential limitations to this approach so stick around.

But let’s return to the actual publication. The researchers conducting this study were guided by information theory. Conceived by Bell Labs researcher Claude Shannon after World War II, information theory is a whole discipline interested in topics such as data loss and compression during communication or, in short, the flow of information. Shannon also introduced the notion of entropy into information theory, denoting the amount of uncertainty in terms of the expected outcome of an event. An entropy of 1 would imply that we could not predict which event happens at all (for instance which side of a coin results from a coin flip), while an entropy of zero tells you that there is no uncertainty in the expected outcome. This is important because predicting individual behavior in this case means predicting what they write in the future (on Twitter) given the past. Therefore, a lower entropy means you’re able to do a better job at predicting what an individual will say next or will say given a cue (for instance politics).

In their work, Bagrow et al. use the entropy rate given in bits of information needed to predict future text. An entropy rate of 4 bits would correspond to choosing randomly from 24 = 16 words for each predicted word. This might not sound like much but considering that social media users have a vocabulary of about 5000 words this is a dramatic improvement! An important consideration here is that information theory deals with the limits of communication, thus a given entropy rate and its corresponding predictability imply the upper limit of a prediction-generating model given this data. The random predictability (with a vocabulary of 5000 words) would be 0.02%, whereas the average Twitter user in their dataset is characterized by 53% predictability (corresponding to an entropy rate of 6.6 bits). This would mean that, in an ideal model built on this Twitter data, more than every second predicted word would be correct.

Yet the main part of this story deals with your friends. Analogous to entropy, cross-entropy is the number of bits from your friend which are needed to predict your texts. In this article, the authors chose the 15 presumably closest friends for each individual (which were most often mentioned by the individual on Twitter). The first important point is that there is information about you in your social circle. Combining predictive information from you together with your friends increases the predictability to over 60% (64% for an infinite number of friends) even though there are definitely diminishing returns operating here, meaning that the effect of the first friend added to the model is a lot more palpable than the effect of the tenth friend. This is all nice and well to improve predictions but now comes the whopper. Remove yourself from the social media network and only use your friends to predict your text and you end up at ~56% predictability with 15 friends or ~61% with an infinite number of friends. Just to really make it clear: using a mere 8–9 friends (without using the individual itself) you break even with using information from the actual person and you can get up to 95% of the maximum predictability of individual+friends!

If you’re a data science person who wants to make use of this, here’s another insight: this predictability is especially pronounced for individuals which posted a lot (thereby strongly influencing/imprinting their friends) combined with friends that didn’t post a lot (as they would be too varied in their expressions otherwise) and mention the individual frequently. Therefore, the embedding a person leaves in the network when they delete their profile varies according to their personality and network.

Here are some caveats/limitations to this study: As all such studies, this social embedding effect unfortunately only works in practice if you have at some point been an active user of social media in order to identify your social circle. Yet if there is some other way to link you to your friends (GPS co-location, mentioning in posts without tagging, etc.) and if they are on social media, this caveat is void. Another consideration is that the predictability gained with social media might be limited to texts posted on social media. And while this still allows for probing sentiments and attitudes, a model built on this might not be accurate in predicting, say, longform articles written by the individual. The most important limitation, also mentioned by the authors, is that this embedded information may change over time as your social circle evolves (hell, some of them might even quit social media as well). Thus, you might have a short time-frame to develop a well-performing model from the moment the individual quits social media (good for you, individual!). It would be interesting to see whether this could be mitigated by including either more friends into the model, carefully choosing friends with a low degree of ‘change’ to preserve the embedding or simply relying on older archived social media data (the internet forgets nothing).

This is why prediction from your friends after you quit social media will be important. Source: Google Trends

In summary, I think this article is a great example of the way we preserve information in our environment. Think of cities for instance. What else are they than the memory of crossroads, natural harbors and trade routes? In the same vein, our friends are representative of at least a part of us and carry information about us with them. And apparently, with enough friends, this could be enough to teach a model to know us better than if it would have ‘studied’ us directly. With the exodus of traditional social media platforms mentioned earlier, this might be a golden opportunity for advertising firms or agencies interested in your political leanings to maintain or expand their predictive potential using machine learning and data mining. I wonder how easily built / potent a machine learning model built just on friends would be in practice! Let me know if you give it a try!

Some additional notes / trivia from the article which you may find interesting:

- Social media texts are more extreme than ‘conventional texts’ with some being very predictable and some intractable.

- Based on cognitive limits, Dunbar’s number postulates a maximum number of around 150 friends per individual. Yet the average number of Facebook friends is over 300 and the average number of LinkedIn connections clocks in at over 500. This means social media platforms could have quite the impact compared to conventional friendship networks.

- There is more long-term information in the posts of individuals themselves compared to their friends. This is made clear by the diminishing returns impact made on individual text prediction by friend posts dating back a while compared to recent posts.

- The limits presented here might be extendable as the authors excluded hyperlinks from their data. Using Co-training or other approaches might result in a model which includes information about these hyperlinks and therefore is able to leverage more information and achieve a better prediction.