Today EFF Staff Technologist Jeremy Gillula is speaking at an FTC workshop on big data and its impact on privacy, prompted by the recent reports on big data by the White House as well as the President’s Council of Advisors on Science and Technology (PCAST). Our major point at the workshop will be that many seem to be putting the cart before the horse when it comes to big data: before we as a society start worrying about how we can mitigate big data’s privacy risks, we think its proponents first need to show that their analyses are statistically valid. In other words, we need proof that big data is good science and not just snake oil.

As we explained in comments we submitted last month to the National Telecommunications and Information Administration (NTIA), although there’s a lot of hype out there about how amazing big data will be, there are very few public examples of big data use that have actually held up to scientific and technical scrutiny. This is partially due to the closed and proprietary nature of most big data projects, which are run by companies that probably find it easier to share their successes than their failures. As a result, “there are no big data about big data” from which we can draw conclusions about its validity.

Big Data’s Technical Hurdles

But big data also faces purely technical challenges . For example, a major assumption in big data is that if one has a large enough data set, then the data will automatically be statistically representative of the underlying population. This claim, “that sampling bias does not matter, is simply not true in most cases that count.” On the contrary, big data sets “are so messy, it can be hard to figure out what biases lurk inside them – and because they are so large, some analysts seem to have decided the sampling problem isn’t worth worrying about. It is.”

Additionally, even if one tackles the problem of sampling bias, there’s no way to know what correlations in the data are actually meaningful. (This is the old “correlation is not causation” problem.) More problematic, however, is that “big data may mean more information, but it also means more false information.” This contributes to what is known as the “multiple-comparisons” problem: if you have a large enough data set, and you do enough comparisons between different variables in the data set, some comparisons that are in fact flukes will appear to be statistically significant. (Take for example a data set that contains both the U.S. homicide rate and the market share of Internet Explorer. Both went down sharply from 2006 to 2011, but that doesn’t mean the correlation is meaningful.)

Not All Big Data Can Overcome These Technical Challenges

Ironically, the types of big data usage that get the most hype are exactly the types of usage that tend to suffer from these statistical problems. One such problematic use is individualized targeting of large numbers of individuals based on “found” data, i.e. information about them that was originally collected for some other purpose. By using “found” data that was not intended for the specific use it is being put to, sampling biases are inevitable. Another problematic use is the idea that by collecting all the data and then storing it indefinitely, it can be used to learn something new at some distant point in the future. Not only will such a “discovery” likely be subject to sampling biases, but any correlations that are discovered in the data (as opposed to being explicitly tested for) are likely to be spurious.

This isn’t to say that all usage of big data is destined to fail. With careful forethought and the proper use of statistical techniques, some uses of big data might be able to overcome these statistical traps and truly benefit society. But it’s worth emphasizing that the types of uses that tend to suffer from statistical weaknesses also tend to pose the greatest privacy threats, since they involve keeping data longer than otherwise necessary and using that data for purposes for which consent (or perhaps even notice) was not given.

Creatively using new tools is a huge part of innovation. But so is understanding when a tool works and when it doesn't. Thus, before we start asking about how to reconcile big data with privacy, we should understand when big data is good science—and when it’s not.