Ars has just one question for PhD student Arvind Narayanan and his advisor Vitaly Shmatikov: why must you continually shatter our illusions? Despite the all-seeing, all-knowing panopticon that is the Internet, some of us like to dream our simple dreams of anonymity and privacy; we choose to believe that our Netflix movie recommendations do not identify us; and we hold on to the belief that we can remain comfortably anonymous behind the veil of our Pumpalumpkin Twitter account.

But like the yapping Toto at the end of the Wizard of Oz, Narayanan and Shmatikov take delight in ripping back the curtain, exposing the great and terrible Oz as nothing more than a scrawny academic.

Their newest paper, "De-anonymizing social networks," is yet another attack on the idea that data can be easily anonymized by stripping out a few bits of personally identifiable information (PII). Much of their work over the last few years is built on the premise that PII extends far beyond names and addresses; in many datasets, the very structure of the data provides all sorts of clues that can be deciphered with only a few bits of information.

Who needs names when we have topology?

In "De-anonymizing social networks," Narayanan and Shmatikov take an anonymous graph of the social relationships established through Twitter and find that they can actually identify many Twitter accounts based on an entirely different data source—in this case, Flickr.

One-third of users with accounts on both services could be identified on Twitter based on their Flickr connections, even when the Twitter social graph being used was completely anonymous. The point, say the authors, is that "anonymity is not sufficient for privacy when dealing with social networks," since their scheme relies only on a social network's topology to make the identification.

The issue is of more than academic interest, as social networks now routinely release such anonymous social graphs to advertisers and third-party apps, and government and academic researchers ask for such data to conduct research. But the data isn't nearly as "anonymous" as those releasing it appear to think it is, and it can easily be cross-referenced to other data sets to expose user identities.

It's not just about Twitter, either. Twitter was a proof of concept, but the idea extends to any sort of social network: phone call records, healthcare records, academic sociological datasets, etc.

As for who might care, the authors sketch out a few scenarios:

The strongest adversary is a government-level agency interested in global surveillance. Its objective is large-scale collection of detailed information about as many individuals as possible. Another attack scenario involves abusive marketing. If an unethical company were able to de-anonymize the graph using publicly available data, it could engage in abusive marketing aimed at specific individuals. Phishing and spamming also gain from social-network de-anonymization. Using detailed information about the victim gleaned from his or her de-anonymized social-network profile, a phisher or a spammer will be able to craft a highly individualized, believable message. Yet another category of attacks involves targeted de-anonymization of specific individuals by stalkers, investigators, nosy colleagues, employers, and neighbors.

This isn't the first time that Narayanan and Shmatikov have sounded the alarm, either; it's the main subject of their research. The pair made waves back in 2007 with a similar paper showing that Netflix's release of 100 million bits of anonymous movie recommendation data could expose any user's entire recommendation history with just eight known movie rankings and dates within a 14-day error margin.

In other words, knowing that a friend rated eight movies in a particular way at a particular point in time suddenly enables one to extract from the dataset all the movies rated by the person between 1999 and 2005. And you don't even need to go to the trouble of asking someone; in the paper, the researchers used public recommendation data that people had entered in the Internet Movie Database (where many people also use their real names) to expose those people's entire recommendation set from Netflix.

That might not sound like a big deal until one considers an example:

First, we can immediately ?nd his political orientation based on his strong opinions about Power and Terror: Noam Chomsky in Our Times and Fahrenheit 9/11. Strong guesses about his religious views can

be made based on his ratings on Jesus of Nazareth and The Gospel of John. He did not like Super

Size Me at all; perhaps this implies something about his physical size? Both items that we found with

predominantly gay themes, Bent and Queer as folk were rated one star out of ?ve. He is a cultish

follower of Mystery Science Theater 3000. This is far from all we found about this one person, but having made our point, we will spare the reader further lurid details.

A lesson that keeps being relearned

AOL famously learned this lesson the hard way in 2006, when it released a huge dataset of anonymized search terms. Journalists quickly found out that the data itself might well be "anonymous," but individual users could still be tracked down after looking through their set of search terms.

The fact is that we're not as anonymous as many of us would like to think. Back in 2000, a Carnegie Mellon researcher took a look at 1990 US census data and concluded that 87 percent of all Americans could be uniquely identified based on only three items: ZIP code, gender, and date of birth.

For most people at most times, anonymity isn't crucial; knowing that you could be unmasked isn't a major deterrent to Internet postings. Many people using social networks like Twitter, for instance, do so as a way of connecting to others and gaining followers using their real names. Narayanan himself has a Twitter account with his name on it.

But for those doing anything "sensitive"—watching movies that you don't want the world to know you're watching, searching for things that you don't want the world to know you're searching for, it's useful to remember just how far your data trail extends behind you on the Internet... and just how well determined researchers can follow the digital bread crumbs.