Data mining competitions can accelerate research in certain fields the list of which is growing as data is now stored about just about everything. The 'traditional' approach is to collect a load of data and maybe clean it up a bit to form a public data set that can be released for use by competitors to work on. A subset of data is withheld to be used for scoring purposes.



There's already a problem here, a lot of data out there is personal information that absolutely should not be released into the public domain and in many cases this is codified in law (e.g. data protection acts). Netflix tried to circumvent this by doing the following:



1) Anonymize the data.

2) Perturb the data in an attempt to disrupt any attempt at de-anonymizing it.



This sort of worked but not completely. E.g. some people used certain browser applets (or whatever) to publish what is in their netflix queue so that friends and family could see what they have been watching recently and are intending to watch, fair enough. But this now meant that this short list of films could be cross referenced with the netflix prize data set. With a sufficient number of films to match you get a fairly unique 'fingerprint' that you can use to de-anonymize the netflix data (or one row of it). If you add in the date that each film was rated (part of the netflix data) then the de-anonymization is even easier.



So what you say? Who cares and does it matter? Well let's say you'd rated some 'dodgy' films and were careful to not publish those ratings/viewings in your public netflix queue, and let's also say you're a public figure with the tabloid press on your tail. That's just one example, the broad principle is that we have laws to protect privacy, they have good reasons for existing and here's one way that information can leak out unlawfully. The fact that this leak was possible because of data you published on your netflix queue is irrelevant - some data was made available that was private to you which you didn't authorize. And this is just film ratings we're talking about, there's far more sensitive data out there (e.g. medical records, mobile phone location data).



So now you're thinking - if this mining is risking privacy in the name of commercial gain then don't do it. Absolutely. If you can't guarantee privacy then don't do it, this is probably the only safe route legally and morally.



Two points come in here:

1) What if the gain is a broader societal gain. e.g. medical research.

2) Commercial gain benefits society too. Are there ways to prevent data leaking out so we can tap these gains as well.





There's a couple of possibilities, neither of which are ideal but perhaps worth pursuing:



Option 1: Don't release the data

Ask competitors to submit algorithms, e.g. a code class that implements some interface. The competition servers run the algorithm on the data and generate a score. To make this work you'll need to allocate a fixed amount of CPU time to each algorithm, e.g. you can make 1 submission per day and it will be allocated a maximum of N minutes of CPU on our servers.



Pros

- Data is private.

- Level(er) playing field. Competitors with modest hardware (e.g. a poor student) can compete more favourably with better equipped competitors (e.g. well funded research lab).

- Restricted to relatively quick (practical) methods.



Cons

- Cannot perform statistical analysis of data, visualizations, etc. This is often key to understanding the problem and creating a good model. What you can do though is form hypotheses, build the corresponding algorithm and submit it, but you can only do that once per day (as an example) rather than the many times per day or hour that you can do when operating on local data. Some stats could be precalculated and released with the data to partially alleviate this problem (more work for the organiser).

- Cannot push algorithm to limits. E.g. create an ensemble of 1000's of models to squeeze out extra accuracy. Typically the top N competitors will battle it out by doing just this. Arguably this directs a lot of effort at obtaining very little gain (law of diminishing returns), so perhaps it's not entirely bad that we limit this.

- Limited to relatively quick models. Some of the really interesting developments are in CPU hungry algorithms.

- Required CPU time increases with size of data set. Some of the most interesting aspects of data mining are in finding subtle features that are only detectable in large data sets.





Option 2: Anonymize the data

I already covered how this can go wrong, but are there ways around this problem? An interesting observation from cryptography is that you can sometimes pick out statistical correlations in encrypted data that correspond to aspects of the underlying decrypted data (plain text). Instead of perturbing the data as netflix did (or maybe as well as?), could we obfuscate the data in some way to render it unusable as a source of directly identifiable data but such that the statistical structures are still discoverable?



To what extent would finding these statistical structures amount to decrypting/deobfuscating the data? Is it in fact the same problem? (classic cryptography). Or is there a distinction to be made? Tags: health, machine learning