$\begingroup$

I think the only useful definition of big data is data which catalogs all information about a particular phenomenon. What I mean by that is that rather than sampling from some population of interest and collecting some measurements on those units, big data collects measurements on the whole population of interest. Suppose you're interested in Amazon.com customers. It's perfectly feasible for Amazon.com to collect information about all of their customers' purchases, rather than only tracking some users or only tracking some transactions.

To my mind, definitions that hinge on the memory size of the data itself to be of somewhat limited utility. By that metric, given a large enough computer, no data is actually big data. At the extreme of an infinitely large computer, this argument might seem reductive, but consider the case of comparing my consumer-grade laptop to Google's servers. Clearly I'd have enormous logistical problems attempting to sift through a terabyte of data, but Google has the resources to mange that task quite handily. More importantly, the size of your computer is not an intrinsic property of the data, so defining the data purely in reference to whatever technology you have at hand is kind of like measuring distance in terms of the length of your arms.

This argument isn't just a formalism. The need for complicated parallelization schemes and distributed computing platforms disappears once you have sufficient computing power. So if we accept the definition that Big Data is too big to fit into RAM (or crashes Excel, or whatever), then after we upgrade our machines, Big Data ceases to exist. This seems silly.

But let's look at some data about big data, and I'll call this "Big Metadata." This blog post observes an important trend: available RAM is increasing more rapidly than data sizes, and provocatively claims that "Big RAM is eating Big Data" -- that is, with sufficient infrastructure, you no longer have a big data problem, you just have data, and you return back to the domain of conventional analysis methods.

Moreover, different representation methods will have different sizes, so it's not precisely clear what it means to have "big data" defined in reference to its size-in-memory. If your data is constructed in such a way that lots of redundant information is stored (that is, you choose an inefficient coding), you can easily cross the threshold of what your computer can readily handle. But why would you want a definition to have this property? To my mind, whether or not the data set is "big data" shouldn't hinge on whether or not you made efficient choices in research design.

From the standpoint of a practitioner, big data as I define it also carries with it computational requirements, but these requirements are application-specific. Thinking through database design (software, hardware, organization) for $10^4$ observations is very different than for $10^7$ observations, and that's perfectly fine. This also implies that big data, as I define it, may not need specialized technology beyond what we've developed in classical statistics: samples and confidence intervals are still perfectly useful and valid inferential tools when you need to extrapolate. Linear models may provide perfectly acceptable answers to some questions. But big data as I define it may require novel technology. Perhaps you need to classify new data in a situation where you have more predictors than training data, or where your predictors grow with your data size. These problems will require newer technology.

As an aside, I think this question is important because it implicitly touches on why definitions are important -- that is, for whom are you defining the topic. A discussion of addition for first-graders doesn't start with set theory, it starts with reference to counting physical objects. It's been my experience that most of the usage of the term "big data" occurs in the popular press or in communications between people who are not specialists in statistics or machine learning (marketing materials soliciting professional analysis, for example), and it's used to express the idea that modern computing practices meant hat there is a wealth of available information that can be exploited. This is almost always in the context of the data revealing information about consumers that is, perhaps if not private, not immediately obvious. The anecdote about a retail chain sending direct mailings to people it assessed were expectant mothers on the basis of their recent purchases is the classic example of this.

So the connotation and analysis surrounding the common usage of "big data" also carries with it the idea that data can reveal obscure, hidden or even private details of a person's life, provided the application of a sufficient inferential method. When the media report on big data, this deterioration of anonymity is usually what they're driving at -- defining what "big data" is seems somewhat misguided in this light, because the popular press and nonspecialists have no concern for the merits of random forests and support vector machines and so on, nor do they have a sense of the challenges of data analysis at different scales. And this is fine. The concern from their perspective is centered on the social , political and legal consequences of the information age. A precise definition for the media or nonspecialists is not really useful because their understanding is not precise either. (Don't think me smug -- I'm simply observing that not everyone can be an expert in everything.)