There are only three known jokes about statistics in the whole universe, so to complete the trilogy (see here and here for the other two), listen up:

Three statisticians are on a train journey to a conference, and they get chatting to three epidemiologists who are also going to the same place. The epidemiologists are complaining about the ridiculous cost of train tickets these days. At this, one of the statisticians pipes up “it’s actually quite reasonable if use our method – we’ve just got one ticket between the three of us”. The epidemiologists are amazed. “But how do you get away with that?”, they cried in unison. “Watch and learn” replied a statistician. A few minutes later, the inspector’s voice was heard down the carriage. At that, the statisticians bundled themselves into the toilet. The inspector knocked on the door. “Tickets please”, she said, and the statisticians passed their single ticket under the door. The inspector stamped it and returned it, and the statisticians made it to the conference. On the way back, the statisticians again met the epidemiologists. This time, the epidemiologists proudly displayed their single ticket. “Aha”, said a statistician. “This time we have no tickets.” Again the epidemiologists were amazed, but they had little time to ponder it because the inspector was coming down the carriage. The epidemiologists dashed off into the toilet, and soon enough there was a knock on the door. “Tickets please”, they heard, and passed their ticket under the door. The statisticians took the ticket and went off to their own toilet! The moral of the story being “never use a statistical technique that you don’t understand”.

All this preamble goes by way of saying: data anonymisation isn’t something that I know a great deal about, but I had some ideas and wanted to get feedback from you.

Any personal data of any importance needs to respect the privacy of the people it represents. Data containing financial or medical details in particular should not be exposed for public consumption (at least if you want people to continue providing you with their data). Anonymising data is an important concept in achieving this privacy.

While this is something you need to think about through the whole data lifecycle (from creating it, to storing it – probably in a database – through analysing it, and possibly publishing it) this post focuses on the analysis phase. At this stage, you data is probably in a data frame form, with some identifying columns that need to be anonymised, and some useful values that need to be preserved. Here’s some made-up data, in this case pacman scores of the Avengers.