$\begingroup$

In a word, yes. I believe there are still clear situations where sampling is appropriate, within and without the "big data" world, but the nature of big data will certainly change our approach to sampling, and we will use more datasets that are nearly complete representations of the underlying population.

On sampling: Depending on the circumstances it will almost always be clear if sampling is an appropriate thing to do. Sampling is not an inherently beneficial activity; it is just what we do because we need to make tradeoffs on the cost of implementing data collection. We are trying to characterize populations and need to select the appropriate method for gathering and analyzing data about the population. Sampling makes sense when the marginal cost of a method of data collection or data processing is high. Trying to reach 100% of the population is not a good use of resources in that case, because you are often better off addressing things like non-response bias than making tiny improvements in the random sampling error.

How is big data different? "Big data" addresses many of the same questions we've had for ages, but what's "new" is that the data collection happens off an existing, computer-mediated process, so the marginal cost of collecting data is essentially zero. This dramatically reduces our need for sampling.

When will we still use sampling? If your "big data" population is the right population for the problem, then you will only employ sampling in a few cases: the need to run separate experimental groups, or if the sheer volume of data is too large to capture and process (many of us can handle millions of rows of data with ease nowadays, so the boundary here is getting further and further out). If it seems like I'm dismissing your question, it's probably because I've rarely encountered situations where the volume of the data was a concern in either the collection or processing stages, although I know many have

The situation that seems hard to me is when your "big data" population doesn't perfectly represent your target population, so the tradeoffs are more apples to oranges. Say you are a regional transportation planner, and Google has offered to give you access to its Android GPS navigation logs to help you. While the dataset would no doubt be interesting to use, the population would probably be systematically biased against the low-income, the public-transportation users, and the elderly. In such a situation, traditional travel diaries sent to a random household sample, although costlier and smaller in number, could still be the superior method of data collection. But, this is not simply a question of "sampling vs. big data", it's a question of which population combined with the relevant data collection and analysis methods you can apply to that population will best meet your needs.