Our present enthusiasm for big data stems from the confusion of data and knowledge. Firms today can gather more data, at lower cost, about a wider variety of subjects, than ever before. Big data’s advocates claim that this data will become the raw material of a new industrial revolution. As with its 19th century predecessor, this revolution will alter how we govern, work, play, and live. But unlike the 19th century, we are told, the raw materials driving this revolution are so cheap and abundant that the horizon is bounded only by the supply of smart people capable of molding these materials into the next generation of innovations (Manyika et al. 2011).

This utopia of data is badly flawed. Those who promote it rely on a series of dubious assumptions about the origins and uses of data, none of which hold up to serious scrutiny. In aggregate, these assumptions all fail to address whether the data we have actually provides the raw materials needed for a data-driven Industrial Revolution we need. Taken together, these failures point out the limits of a revolution built on the raw materials that today seem so abundant.

Four of these assumptions merit special attention: First, N = all, or the claim that our data allow a clear and unbiased study of humanity; second, that today = tomorrow, or the claim that understanding online behavior today implies that we will still understand it tomorrow; third, offline = online, the claim that understanding online behavior offers a window into economic and social phenomena in the physical world; and fourth, that complex patterns of social behavior, once understood, will remain stable enough to become the basis of new data-driven, predictive products and services in sectors well beyond social and media markets. Each of these has its issues. Taken together, those issues limit the future of a revolution that relies, as today’s does, on the “digital exhaust” of social networks, e-commerce, and other online services. The true revolution must lie elsewhere.

N = All

Gathering data via traditional methods has always been difficult. Small samples were unreliable; large samples were expensive; samples might not be representative, despite researchers’ best efforts; tracking the same sample over many years required organizations and budgets that few organizations outside governments could justify. None of this, moreover, was very scalable: researchers needed a new sample for every question, or had to divine in advance a battery of questions and hope that this proved adequate. No wonder social research proceeded so slowly.

Mayer-Schönberger and Cukier (2013) argue that big data will eliminate these problems. Instead of having to rely on samples, online data, they claim, allows us to measure the universe of online behavior, where N (the number of people in the sample) is basically all (the entire population of people we care about). Hence we no longer need worry, they claim, about the problems that have plagued researchers in the past. When N = all, large samples are cheap and representative, new data on individuals arrives constantly, monitoring data over time poses no added difficulty, and cheap storage permits us to ask new questions of the same data again and again. With this new database of what people are saying or buying, where they go and when, how their social networks change and evolve, and myriad other factors, the prior restrictions borne of the cost and complexity of sampling will melt away.

But N ≠ all. Most of the data that dazzles those infatuated by “big data”—Mayer-Schönberger and Cukier included—comes from what McKinsey & Company termed “digital exhaust” (Manyika et al. 2011): the web server logs, e-commerce purchasing histories, social media relations, and other data thrown off by systems in the course of serving web pages, online shopping, or person-to-person communication. The N covered by that data concerns only those who use these services—not society at large. In practice, this distinction turns out to matter quite a lot. The demographics of any given online service usually differ dramatically from the population at large, whether we measure by age, gender, race, education, and myriad other factors.

Hence the uses of that data are limited. It’s very relevant for understanding web search behavior, purchasing, or how people behave on social media. But the N here is skewed in ways both known and unknown—perhaps younger than average, or more tech-savvy, or wealthier than the general population. The fact that we have enormous quantities of data about these people may not prove very useful to understanding society writ large.

Today = Tomorrow

But let’s say that we truly believe this assumption—that everyone is (or soon will be) online. Surely the proliferation of smart phones and other devices is bringing that world closer, at least in the developed world. This brings up the second assumption—that we know where to go find all these people. Several years ago, MySpace was the leading social media website, a treasure trove of new data on social relations. Today, it’s the punchline to a joke. The rate of change in online commerce, social media, search, and other services undermines any claim that we can actually know that our N = all sample that works today will work tomorrow. Instead, we only know about new developments—and the data and populations they cover—well after they have already become big. Hence our N = all sample is persistently biased in favor of the old. Moreover, we have no way of systematically checking how biased the sample is, without resorting to traditional survey methods and polling—the very methods that big data is supposed to render obsolete.

Online Behavior = Offline Behavior

But let’s again assume that problem away. Let’s assume that we have all the data, about all the people, for all the online behavior, gathered from the digital exhaust of all the relevant products and services out there. Perhaps, in this context, we can make progress understanding human behavior online. But that is not the revolution that big data has promised. Most of the “big data” hype has ambitions beyond improving web search, online shopping, socializing, or other online activity. Instead, big data should help cure disease, detect epidemics, monitor physical infrastructure, and aid first responders in emergencies.

To satisfy these goals, we need a new assumption: that what people do online mirrors what they do offline. Otherwise, all the digital exhaust in the world won’t describe the actual problems we care about.

There’s little reason to think that offline life faithfully mirrors online behavior. Research has consistently shown that individuals’ online identities vary widely from their offline selves. In some cases, that means people are more cautious about revealing their true selves. Danah Boyd’s work (Boyd and Marwick 2011) has shown that teenagers cultivate online identities very different from their offline selves—whether for creative, privacy, or other reasons. In others, it may mean that people are more vitriolic, or take more extreme positions. Online political discussions—another favorite subject of big data enthusiasts—suffer from levels of vitriol and partisanship far beyond anything seen offline (Conover et al. 2011). Of course, online and offline identity aren’t entirely separate. That would invite suggestions of schizophrenia among internet users. But the problem remains—we don’t know what part of a person is faithfully represented online, and what part is not.

Furthermore, even where online behavior may echo offline preferences or beliefs, that echo is often very weak. In statistical terms, our ability to distinguish “significant” from “insignificant” results improves with the sample size—but statistical significance is not actual significance. Knowing, say, that a history of purchasing some basket of products is associated with an increased risk of being a criminal may be helpful. But if that association is weak—say a one-hundredth of a percent increase—it’s practical import is effectively zero. Big data may permit us to find these associations, but it does not promise that they will be useful.

Behavior of All (Today) = Behavior of All (Tomorrow)

OK, but you say, surely we can determine how these distortions work, and incorporate them into our models? After all, doesn’t statistics have a long history of trying to gain insight from messy, biased, or otherwise incomplete data?

Perhaps we could build such a map, one that allows us to connect the observed behaviors of a skewed and selective online population to offline developments writ large. This suffices only if we care primarily about describing the past. But much of the promise of big data comes from predicting the future—where and when people will get sick in an epidemic, which bridges might need the most attention next month, whether today’s disgruntled high school student will become tomorrow’s mass shooter.

Satisfying these predictive goals requires yet another assumption. It is not enough to have all the data, about all the people, and a map that connects that data to real-world behaviors and outcomes. We also have to assume that the map we have today will still describe the world we want to predict tomorrow.

Two obvious and unknowable sources of change stand in our way. First, people change. Online behavior is a culmination of culture, language, social norms and other factors that shape both people and how they express their identity. These factors are in constant flux. The controversies and issues of yesterday are not those of tomorrow; the language we used to discuss anger, love, hatred, or envy change. The pathologies that afflict humanity may endure, but the ways we express them do not.

Second, technological systems change. The data we observe in the “digital exhaust” of the internet is created by individuals acting in the context of systems with rules of their own. Those rules are set, intentionally or not, by the designers and programmers that decide what we can and cannot do with them. And those rules are in constant flux. What we can and cannot buy, who we can and cannot contact on Facebook, what photos we can or cannot see on Flickr vary, often unpredictably. Facebook alone is rumored to run up to a thousand different variants on its site at one time. Hence even if culture never changed, our map from online to offline behavior would still decay as the rules of online systems continued to evolve.

An anonymous reviewer pointed out, correctly, that social researchers have always faced this problem. This is certainly true but many of the features of social systems—political and cultural institutions, demography, and other factors—change on a much longer timeframe than today’s data-driven internet services. For instance, US Congressional elections operate very differently now compared with a century ago; but change little between any two elections. Contrast that with the pace of change for major social media services, for which 2 years may be a lifetime.

A recent controversy illustrates this problem to a T. Facebook recently published a study (Kramer et al. 2014) in which they selectively manipulated the news feeds of a randomized sample of users, to determine whether they could manipulate users’ emotional states. The revelation of this study prompted fury on the part of users, who found this sort of manipulation unpalatable. Whether they should, of course, given that Facebook routinely runs experiments on its site to determine how best to satisfy (i.e., make happier) its users, is an interesting question. But the broader point remains—someone watching the emotional state of Facebook users might have concluded that overall happiness was on the rise, perhaps consequence of the improving American economy. But in fact this increase was entirely spurious, driven by Facebook’s successful experiment at manipulating its users.

Compounding this problem, we cannot know, in advance, which of the social and technological changes we do know about will matter to our map. That only becomes apparent in the aftermath, as real-world outcomes diverge from predictions cast using the exhaust of online systems.

Lest this come off as statistical nihilism, consider the differences in two papers that both purport to use big data to project the outcome of US elections. DiGrazia et al. (2013) claim that merely counting the tweets that reference a Congressional candidate—with no adjustments for demography, or spam, or even name confusion—can forecast whether that candidate will win his or her election. This is a purely “digital exhaust” approach. They speculate—but cannot know—whether this approach works because (to paraphrase their words) “one tweet equals one vote”, or “all attention on Twitter is better”. Moreover, it turns out that the predictive performance of this simple model provides no utility. As Huberty (2013) shows, their estimates perform no better than an approach that simply guesses that the incumbent party would win—a simple and powerful predictor of success in American elections. Big data provided little value.

Contrast this with Wang et al. (2014). They use the Xbox gaming platform as a polling instrument, which they hope might help compensate for the rising non-response rates that have plagued traditional telephone polls. As with Twitter, N ≠ all: the Xbox user community is younger, more male, less politically involved. But the paper nevertheless succeeds in generating accurate estimates of general electoral sentiment. The key difference lies in their use of demographic data to re-weight respondents’ electoral sentiments to look like the electorate at large. The Xbox data were no less skewed than Twitter data; but the process of data collection provided the means to compensate. The black box of Twitter’s digital exhaust, lacking this data, did not. The difference? DiGrazia et al. (2013) sought to reuse data created for one purpose in order to do something entirely different; Wang et al. (2014) set out to gather data explicitly tailored to their purpose alone.

The Implausibility of Big Data 1.0

Taken together, the assumptions that we have to make to fulfill the promise of today’s big data hype appear wildly implausible. To recap, we must assume that:

1. everyone we care about is online; 2. we know where to find them today, and tomorrow; 3. they represent themselves online consistent with how they behave offline, and; 4. they will continue to represent themselves online—in behavior, language, and other factors—in the same way, for long periods of time.

Nothing in the history of the internet suggests that even one of these statements holds true. Everyone was not online in the past; and likely will not be online in the future. The constant, often wrenching changes in the speed, diversity, and capacity of online services means those who are online move around constantly. They do not, as we’ve seen, behave in ways necessarily consistent with their offline selves. And the choices they make about how to behave online evolve in unpredictable ways, shaped by a complex and usually opaque amalgam of social norms and algorithmic influences.

But if each of these statements fall down, then how have companies like Amazon, Facebook, or Google built such successful business models? The answer lies in two parts. First, most of what these companies do is self-referential: they use data about how people search, shop, or socialize online to improve and expand services targeted at searching, shopping, or socializing. Google, by definition, has an N = all sample of Google users’ online search behavior. Amazon knows the shopping behaviors of Amazon users. Of course, these populations are subject to change their behaviors, their self-representation, or their expectations at any point. But at least Google or Amazon can plausibly claim to have a valid sample of the primary populations they care about.

Second, the consequences of failure are, on the margins, very low. Google relies heavily on predictive models of user behavior to sell the advertising that accounts for most of its revenue. But the consequences of errors in that model are low—Google suffers little from serving the wrong ad on the margins. Of course, persistent and critical errors of understanding will undermine products and lead to lost customers. But there’s usually plenty of time to correct course before that happens. So long as Google does better than its competitors at targeting advertising, it will continue to win the competitive fight for advertising dollars.

But if we move even a little beyond these low-risk, self-referential systems, the usefulness of the data that underpin them quickly erodes. Google Flu provides a valuable lesson in this regard. In 2008, Google announced a new collaboration with the Centers for Disease Control (CDC) to track and report rates of influenza infection. Historically, the CDC had monitored US flu infection patterns through a network of doctors that tracked and reported “influenza-like illness” in their clinics and hospitals. But doctors’ reports took up to 2 weeks to reach the CDC—a long time in a world confronting SARS or avian flu. Developing countries with weaker public health capabilities faced even greater challenges. Google hypothesized that, when individuals or their family members got the flu, they went looking on the internet—via Google, of course—for medical advice. In a highly cited paper, Ginsberg et al. (2008) showed that they could predict region-specific influenza infection rates in the United States using Google search frequency data. Here was the true promise of big data—that we capitalize on virtual data to better understand, and react to, the physical world around us.

The subsequent history of Google Flu illustrates the shortcomings of the first big data revolution. While Google Flu has performed well in many seasons, it has failed twice, both times in the kind of abnormal flu season during which accurate data are most valuable. The patterns of and reasons for failure speak to the limits of prediction. In 2009, Google Flu under-predicted flu rates during the H1N1 pandemic. Post-hoc analysis suggested that the different viral characteristics of H1N1 compared with garden-variety strains of influenza likely meant that individuals didn’t know they had a flu strain, and thus didn’t go looking for flu-related information (Cook et al. 2011). Conversely, in 2012, Google Flu over-predicted influenza infections. Google has yet to discuss why, but speculation has centered on the intensive media coverage of an early-onset flu season, which may have sparked interest in the flu among healthy individuals (Butler 2013).

The problems experienced by Google Flu provide a particularly acute warning of the risks inherent in trying to predict what will happen in the real world based on the exhaust of the digital one. Google Flu relied on a map—a mathematical relationship between online behavior and real-world infection. Google built that map on historic patterns of flu infection and search behavior. It assumed that such patterns would continue to hold in the future. But there was nothing fundamental about those patterns. Either a change in the physical world (a new virus) or the virtual one (media coverage) were enough to render the map inaccurate. The CDC’s old reporting networks out-performed big data when it mattered most.