Sample your data!

Overview

Being a good data scientist requires you to give answers that are approximately correct. But it also requires you to be pragmatic, and a big part of being pragmatic is to run your analyses quickly. That's where big data can hurt you: After you get over the first rush of seeing millions or billions of observations, you realize that any analysis with this data will be slow. If only you had less data!

There's an easy fix: Sample your data randomly. This is often easy to do and it will make your analysis run faster. The catch is that sampling your data will lead to a loss of precision. In other words, the standard error of your estimates will increase. This is bad in theory, but in practice you might have more precision than you need. In that situation, sampling your data is a no-brainer.

There's another good reason for sampling, and this reason is much less well-known: If your outcome variable has strong tails or is an unlikely event, then you can do better than sampling your data randomly. If you use stratified sampling on the outcome variable, then the loss of precision is small even if you drastically down-sample your data.

This situation applies often in the real world: For instance, response rates to advertising or fraud rates often lie much below 1%. If an event has a 1% probability of being true, you can sample your data down to 10% of its original size and increase the standard error of your estimate by less than 10%. In other words: Your computation time goes down by 90%, but the precision of your estimates stays almost constant.

These numbers show why knowing how to sample well is an essential skill for any data scientist. To any colleague asking me to improve the speed of their data analysis, my first response is: "Did you sample the data? And if you did, can you sample more? Is the outcome variable tailed? If so, can you sample on the outcome variable?"

Although sampling is valuable, I was never formally taught how to do it. I learned through trial and error. Neither was I able to find many tutorials online (if you find them, please let me know!). To me this suggests that there's value in a brief introduction.

Whether or not sampling is good for your data analysis depends on what you're doing. Here's a quick decision tree:

Does my analysis run so fast that speeding it up wouldn't make me happier? In that case, no sampling is needed.

Otherwise, sampling can likely help: Can I not risk losing precision? Then don't sample your data for your final analysis. But sample your data when testing your code. Otherwise, sample your data for almost all steps in the analysis, potentially including the final analysis.



In the following, I'll work with a simple example: Your company is selling hammers on its website. Your manager asks you to predict how likely the customers are to buy a hammer given that they've visited your website. Now speaking statistics, that means that you're given \(N\) observations of a dummy \(D_i, i=1,\ldots,N\) with mean \(p\). Your your goal is to estimate \(p\).

Random sampling

We estimate \(p\) using the sample mean \(\hat{p}=\sum{D_i}/N\). The standard error of this estimator is given by \(\sqrt{(p(1-p)/N}\).

To save on computation time, we are now choosing \(M < N\) observations randomly, which I call random sampling . This will increase the standard error of your estimate: The relative standard error of random sampling compared to not sampling is \(\sqrt{N/M}\).

The graph below shows this relationship - it displays the increase in the standard error as a function of the reduction in the sample size. For instance, if you take a 25% sample of your data, you will reduce your sample by a factor of four and standard errors will double.

So far, this is all statistics. Your decision about sampling will need to weigh the increased standard error against the time cost of computing your results.

The relationship between the number of observations and the computation time is called complexity . In this application, it's likely that complexity is proportional to \(N\). Therefore, taking a 25% sample will make your analysis four times as fast. For this reason, I've given the x-axis another label, 'Speed-up of analysis'. This is the real goal of sampling.

Is the trade-off worth it? Here is where the statistics end and where your judgment enters: Let's say that you estimate a mean of \(0.1\) with a standard error of \(0.001\) and your code takes 4 hours to run. By taking a 25% sample, you'd increase the standard error of your estimate to \(0.002\) and lower computation time to one hour. Is that worth it? Unless you care whether your estimate is \(0.101\) or \(0.0102\), sampling will be a good idea.

Stratified sampling

The beautiful theory

Good news: While the sampling strategy presented above will always work, you can do better as long as \(p

eq 0.5\)! To get an intuition for a sampling strategy that's better than random sampling, let's assume that \(p\) is very small: You have a thousand zeros and only ten ones. In that case, what type of observation is more important? Clearly, the ones: If you dropped one of these observations, your estimated mean would change by almost ten percent. If, instead, you dropped a zero, your estimate for the mean would stay basically unchanged.

This implies that, if you sample, you should start by sampling the observations with \(D=0\). This I call stratified sampling .

The first time you hear this, you should be a little concerned, aren't we selecting based on the outcome? We are. And if you do not adjust your estimates for this sampling strategy, you would get biased estimates. To fix this, you will need to use sample weights in your analyses.

The graph below tells the story: You can see how stratified sampling (blue) performs against the random sampling (red) that I described above. Move the slider around to change the probability \(p\) of the outcome variable. Note: While deriving the standard error is straightforward when sampling randomly, doing so in the case of stratified sampling was nowhere as easy. The derivation is given in the appendix.

Probability:

Did I mention that you can move the slider? If you do that, the probability goes from 50% to 0% in a non-linear way, so you can really close in on 0%. Neat!

Let's start with a probability of 50%. In this case, the idea behind stratified sampling doesn't make sense, since both ones and zeros in our outcome variable are equally valuable for estimation the mean. Therefore, random and stratified sampling perform equally well. As we decrease the probability, however, stratified sampling starts to outperform random sampling and this difference becomes dramatic as the probability gets closer to zero. In fact, this difference becomes so large that I don't even bother trying to fit both curves onto the same graph.

To see how much stratified sampling can help you, let's assume that you were to reduce your sample size by a factor of ten:

If sampling randomly, your standard error increases by 317%.

With stratified sampling at \(p=0.05\), it increases by 50%.

With stratified sampling at \(p=0.01\) it increases by only 5%!

There are many data analyses where the outcome variable is true less than 1% of the time. In that case, sampling can give your analysis a huge boost in speed at almost no cost in precision.

The ugly practice: Weighting

With stratified sampling, it's necessary to weight your data. While that's simple statistics, implementing it with code is not always straightforward.

Example in Python

The code below uses Python to calculate the weighted and the unweighted mean of the variable purchases. We assume that this variable is stored in a pandas dataframe df . The sampled data is called df_sampled and also contains a column with sampling weights.

mean_unsampled = df.purchases.mean() import numpy as np mean_sampled = np.average(df_sampled.purchases, weights=df_sampled.weight)

Example in R

In R, it's a little nicer: To take an average, you use mean . To take a weighted average, you use weighted.mean , although you need to import this function first.

mean_unsampled

Click here to see a Python example in action!

Conclusion and next steps

The examples I gave cover only some of the use cases for sampling. For instance, you will also have to deal with panel data (many observations per person, for many people) or your outcome variable might be continuous instead of binary. In these situations, sampling is just as useful, but you'll need to do it somewhat differently:

With panel data, you should use stratified sampling at the person level.

With a left-tailed continuous outcome variable (for instance a log-normal distribution), you should sample more heavily at the lower end of the distribution.

If you are interested in more detail, send me an email! I'm happy to add content if there's enough interest.

For the above analysis, I assumed that the time to run the analysis (the complexity) was linear in the number of observations. For many common analyses, for instance estimating a mean or running OLS, that holds true unless you run out of memory. For more complicated analyses, complexity will usually increase more than linearly in the number of observations. This suggests that the relative benefits of sampling are even higher in these situations. If you want a more formal exposition, let me know!

I hope this article was helpful. Did you love it or hate it? Did you find errors? In either case send me an email!

Thanks to Peter Cohen and Alexandra Vo for their fantastic feedback on this post, and also to the wonderful people at TGG Group for allowing me to write much of this while at work.

Appendix

The standard error under stratified sampling

I use these equations to calculate the standard error of the weighted mean.

This solution doesn't look nice. My hunch is that a nicer-looking solution exists. If you know it, please tell me.