Photo by Andrew Nguyen

About a decade ago, a hacker said to me, flatly, “Assume every card in your wallet is compromised, and proceed accordingly.” He was right. Consumers have adapted to a steady thrum of data breach notifications, random credit card charges, and out-of-the-blue card replacements. A privacy-industrial complex has sprung up from this — technology, services, and policies all aimed at trying to protect data while allowing it to flow freely enough to keep the modern electronic bazaar thriving. A key strategy in this has been to “scrub” data, which means removing personally identifiable information (PII) so that even if someone did access it, they couldn’t connect it to an individual.

So much for all that.

In a paper published in Science last week, MIT scientist Yves-Alexandre de Montjoye shows that anonymous credit card data can be reverse engineered to identify individuals’ transactions, a finding which calls into question many of the policies developed to protect consumers and forces data scientists to reconsider the policies and ethics that guide how they use large datasets.

De Montjoye and colleagues examined three months of credit card transactions for 1.1 million people, all of which had been scrubbed of any PII. Still, 90% of the time he managed to identify individuals in the dataset using the date and location of just four of their transactions. By adding knowledge of the price of the transactions, he increased “reidentification” (the academic term for spotting an individual in anonymized data) to 94%. Additionally, women were easier to reidentify than men, and reidentification ability increased with income of the consumer.

To be clear: Reidentification means that the researchers could identify all the transactions that belong to an individual, but de Montjoye didn’t attempt to say which individual. For example, if he wanted to know my transactions, he’d need to take additional steps to cross reference something he knew about me to his data. If, for example, I posted on Facebook about a trip to a restaurant, that could provide the key to connecting me to an entire portfolio of anonymous transactions. “We didn’t try to put names on it,” de Montjoye says, “but we know basically what you need to do that.”

What’s more, de Montjoye showed that even “coarse” data provides “little anonymity.” He lowered the “resolution” on his data by looking only at areas where purchases happened, not specific shops, and 15-day time frames in which they happened, not specific dates. He also broadened the price range of the purchases so that transactions that previously were categorized as between $5 and $16 were now put in a bin more than twice as big that ranged between $5 and $34. Even with low-res data like this, he could pluck out four transactions and reidentify individuals 15% of the time. By looking at 10 such data points, he could, remarkably, reidentify individuals 80% of the time.

It’s not the first time de Montjoye has played the part of privacy killjoy. In previous work he pulled off a similar trick, reidentifying individuals using anonymous mobile phone location data. (Others have performed similar parlor tricks with other datasets.) And while he hasn’t yet tested other types of large datasets, such as browsing histories, he believes that “it seems likely” that they, too, are susceptible to reidentification.

The implications of de Montjoye’s work are profound. Broadly, it means that anonymity doesn’t ensure privacy, which could render toothless many of the world’s laws and regulations around consumer privacy. Guaranteeing anonymity (that is, the removal of PII) in exchange for being able to freely collect and use data — a bread-and-butter marketing policy for everyone from app makers, to credit card companies — might not be enforceable if anonymity can be hacked. Anonymization as we define it today, de Montjoye says, is “inadequate” and ultimately doomed to fail with large metadata — the kind of publicly available big data that so many companies are tapping into. (He won’t use the term “big data,” but what he describes as “metadata datasets” are largely in line with that concept).

One obvious response to this problem, being explored in Europe, is to make anyone who wants to use such data prove that they’ve made it impossible to identify individuals within the dataset. But if de Montjoye can identify four out of five people from anonymous data with only a general sense of where they were, when they were there, and how much they spent, it’s hard to imagine someone proving beyond a doubt they’ve anonymized their data. That kind of mandate, then, could ultimately prohibit the use and sharing of data.

That would be a terrible outcome given the power of the kinds of large datasets de Montjoye is testing. “The potential for good that comes from this kind of data is too great to shut it down,” he says, citing any number of cases: Mobile data can be used in the fight against the spread of disease. Traffic data can enable smarter traffic systems that significantly reduce emissions. Economic data tracking can help identify opportunities for innovation and growth.

One model de Montjoye cites is “PII 2.0” (PDF) proposed by Paul M. Schwarz and Daniel Solove. Currently, PII is binary and information is either personally identifiable or not. Schwarz and Solove propose a spectrum from those two ends, with a third category in between, in which identification is possible but not probable, and then regulation that addresses each type separately.

de Montjoye also looks to the “New Deal on Data” proposed by MIT’s Sandy Pentland (a co-author on de Montjoye’s paper) in which ownership rights of data shift to the consumer.

“Our goal is to start a debate, not shut down the use of this kind of data,” says de Montjoye. “This is a potential risk with these large datasets; anonymization is limited, but the potential uses for this data are great. So let’s find a better model. Let’s find a balance between privacy and utility.”