Analysis How do you protect people's privacy when you have big databases of personal records you want to share?

That's the question that the US National Institute of Standards and Technology (NIST) has dug into in an extensive review [PDF] of the different methods that government departments and other organizations use when publishing data.

The paper, now finalized after a period of public comment and review, is chock-full of acronyms and jargon, but does its best to wade through them to provide an understandable document.

There are two main lessons that comes out of it: first, striking the balance between providing privacy and useful data is not easy and will require using a number of different approaches; and second, there is a lot more work that needs to be done.

There is a third point too: it's harder to protect the privacy of celebrities.

On the second point, the paper attempts to make sense of the most common terms used in the ever-expanding privacy industry.

There are, for example, three words commonly used to describe the method of removing or altering information to shield people's privacy. The paper broadly argues that they are all effectively interchangeable. De-identification and pseudonymization may purport to be different, but in reality there is no dividing line; likewise anonymization, which often fails to do what it says and actually anonymize people.

Likewise, the paper throws out the buzz-term "personally identifiable information" in preference to the simpler "personal information" because the former is largely meaningless.

Methods

But to the guts of it: what do people do and what are the best methods?

There is a surprisingly large array of different ways to protect privacy in data. They can be largely grouped into two areas: the data itself, and the way it is provided.

When it comes to the data itself, possibly the most common approach is to simply pull out the data fields that contain personal information – everything from social security numbers to IP addresses. But that can often remove data that is very useful to have – geographic identifiers for example. And information is never personal or not personal. As the paper points out, information is on more of a spectrum from unrelated (like weather) to highly personal (like your name). There is no clear cut-off.

Sometimes companies will replace the more personal data fields with values that are created through a different method in order to further separate the data. But the problem then comes when people are able to combine different databases to "re-identify" people. And efforts to encrypt data have also failed.

Several high-profile examples are given in the paper. One was the sending of Massachusetts governor William Weld's own medical records to him through the post after the hospital data he had championed as being protected was used to identify him.

In that case, it was possible to find Weld's records in the database by finding other identifying information about him: his zip code, date of birth, and sex.

Another famous example was when the encrypted taxi numbers published by New York City were cracked and then by reviewing photos of celebrities getting into or out of cabs, researchers and journalists were able to pinpoints people's movements around the Big Apple.

Then there was when AOL anonymized people's data but left in their search terms, making it easier in some cases to identify people (for example, searching on your home address) – and then learn a whole lot more about them.

Bradley Cooper's cab journeys were tracked through the release of New York taxi data. Credit: Gawker