If you’ve been following this series, you know there are many ways an analysis you’re doing in data science can become biased. One way is the common problem that correlation doesn’t imply causation. You can find, for example, a negative effect of a treatment or policy when the real effect is positive! We’ve discussed how to correct for this type of bias at length in the article about the back-door criterion, A Technical Primer on Causality.

This article will be about a different source of bias: selection bias. We’ll develop the intuition behind a generalization of the back-door criterion, called the selection back-door criterion, or the s-back-door criterion. First, let’s dig into exactly what selection bias means.

Also: we’re hiring! Both associate (AVP) and lead (VP) data scientists!

About Selection Bias

Selection bias comes in many forms. Here’s an example. Political pollsters might run a survey for people’s voting preferences in an election by randomly sampling phone numbers. This approach under-represents people who forgo land lines for cellphones (note: modern pollsters call cellphones too, exactly because of this problem!). If these cell-phone preferring users also tend to have a preference for one candidate over another that differs from the general population, then this under-representation will lead to bias. We’d like a way to rigorously correct for this problem. In general, survey samples won’t perfectly represent the general population.

Here’s another slightly different example. You might be surveying skill sets within a college. At that college, students will be admitted if they have strong math skills, or strong social skills (or both). If you admit people with this policy, you’ll find a negative relationship between math and social skills within the college population, even if they’re un-associated in the general population. This is strictly due to the admissions process. Intuitively, if I know someone was admitted, and I know they don’t have strong math skills, they must have strong social skills. In other words, conditional on admittance, math and social skills are statistically dependent. If you only sample students (people who were admitted), then your sample implicitly conditions on admittance!

In the rest of this post, I’ll go more into the technical background so you understand selection bias. Then I’ll briefly explain the use of post-stratification weighting in google consumer surveys as a practical example, and as a reference point for the technology of post-stratification weighting. Finally, I’ll detail some powerful and relatively new results from Bareinboim, Tian, and Pearl on a general approach for correcting selection bias. They’ve answered the questions “When is selection bias present?”, “What do I need to adjust for to correct for selection bias?”, and finally “How do I make the adjustment to correct for selection bias?”. I’ll demonstrate with a real-world example and give some code in Jupyter notebooks.

Understanding Selection Bias

If you haven’t read about the back-door criterion, go back and read the technical primer on causality! It’s a pre-requisite for this article. In particular, you’ll need to understand statistical and causal dependence, d-separation, and how d-separation relates to graphs.

When does selection bias happen? The textbook case is as in the example above: we select a population without realizing that the population selects for some characteristics. In our example, admittance to college selects for high achievement in math or social skills. After this selection, you’ll find math and social skills are negatively related, even when they’re unrelated in the general population! If you draw a causal graph for this, it looks like the figure below.

Here, you can decompose the joint distribution like P(A, M, S) = P(A|M, S)P(M)P(S), and use that decomposition to show P(M,S|A) doesn’t decompose into P(M|A)P(S|A)! In other words, M and S aren’t conditionally independent given admittance to the college. Why are we conditioning?

When we say we’re examining the college population, we’re saying that A=true (the student was admitted). In the general population, there are many students who aren’t admitted (A=false), but we don’t see them e.g. because we’re running a study in college research lab, and recruit using fliers around campus. Our sample is selection biased.

Sampling Indicators and Intuition

In general, you can add a binary sampling indicator to a graph like this one. In a real study, we’d augment the graph above with a sampling indicator like

the figure to the left, where admittance drives the sampling process. Now, we’re conditioning on S=1 (they were in the study), which is a descendant of the collider at A, and so has a similar effect as conditioning on A directly: it induces statistical dependence between M and So (social skills) conditionally on S. Conditioning on a collider or a descendant of a collider introduces selection bias!

In general, the selection bias problem can be more subtle than this. Consider a case like in the figure below. Here, the sampling process is driven by the X variable, and Y (the outcome of interest) is also driven by X. Let’s make this concrete: X might be how hard you study for a test, and Y might be your test outcome. Someone might be running a study on the effects of studying on test outcomes, and might recruit by asking around the school library. Then you’re more likely to be recruited if you study harder, so X drives S as well.

In this example, we tend to select students who study harder. We have a biased sample relative to the general population, and we’ll find that P(Y|S=1), the distribution of Y in the population studied, is different from P(Y), the distribution of Y in the general population. People we sample, because they were sampled, are people who tend to study harder. (Thanks to Alan O’Donnell for clarifying this explanation!)

This problem can happen in more general cases: any time there’s statistical dependence between S and Y. You can see an especially weird version of this in the following example, which actually has an interpretation as collider bias! Here, we’ll look at bias in the conditional distributions of P(Y|D), which is not equal to P(Y|D, S=1).

Generally, we’d like to know when we don’t have to deal with selection bias. We should set up some notation before we go on.

Notation: Samples vs. Populations

The quantities we usually want to estimate are probability distributions and statistics of random variables like averages, variances, etc. We’ll write the distribution of Y when we know X as P(Y|X=x). This is what you would measure in the general population if you estimated the distribution of Y over all data points where X=x. Contrast this with the sampled version: if a data point is sampled, then the sampling indicator takes the value S=1. If it wasn’t, then S=0. That means the version of the conditional distribution we get for Y given X in our sample is actually P(Y|X, S=1).

All sample quantities are estimated conditional on S=1. An example we’ll look at often is the sample mean of Y, E[Y|S=1], and the conditional sample mean E[Y|X=x, S=1].

Sometimes, sample quantities will match population quantities. For the mean, that would mean E[Y] = E[Y|S=1]. In other words, the population mean, E[Y], is equal to the sample mean, E[Y|S=1].

Some useful theorems

Returning to our example, we can see our sample estimate for P(Y|X, S=1) matches our population estimate P(Y|X) whenever P(Y|X) = P(Y|X, S=1), or Y is independent of S given X. Bareinboim, Tian, and Pearl have formalized this intuition into a nice, simple theorem:

Gs here is the causal graph which is required to include a sampling indicator, S. There’s a nice implication here: sampling is always caused by something. You should be able to add a sampling variable to your causal graph. [EDIT: as one commenter points out, it’s worth noting James Heckman’s nobel-winning work on this subject, as cited in Bareinboim et al’s paper].

We can look at the example of getting sampled because you study, and doing well on a test because you study. We’re saying here that because studying accounts both for getting sampled and doing well on a test, it’s all we need to know to account for the fact that we tend to over-sample people who study hard. We simply look at sub-populations who are equally likely to be sampled, and see how likely they are to do well on the test. These are precisely the sub-populations of equal X! In other words, in this system, P(Y|X, S=1) = P(Y|X).

In this theorem, “s-recoverable” has a precise definition. We won’t detail it here. We’ll just say that intuitively, it means that you can recover the population-level quantity P(Y|X) from a sample. In other words, it’s possible to recover from selection bias! That won’t be true in general, so it’s nice to know when it is true.

This result holds for any causal graph G. This criterion is very restrictive: it requires the variables you want to condition on, X, render S and Y conditionally independent. That generally won’t be true. When that condition doesn’t hold, we just can’t recover P(Y|X) from selection bias (without a more general result). That is, statistical inference isn’t even possible (not even mentioning causal inference, recovery of P(Y|do(X=x))!).

Fortunately, if we allow some other criteria to hold, we can do better! This result only applies in the case that you only have data from a sample (all distributions are conditional on S=1). If you have some population-level data (not conditional on S=1), you can do much better!

A more general solution: using external data

In general, we can add some extra data from a set of variables, C. We’d like S to be independent of Y given X together with C. If we can achieve that, then we can say P(Y|X, C, S=1) = P(Y|X,C), and we can recover this new conditional distribution. If we use the law of total probability with this, we can recover P(Y|X)! The tricky part is that the law of total probability will require us to have P(X,C), which is a population-level quantity (not conditional on S=1). This all gives us the following theorem:

This result is a big deal! It’s a general tool for using some population-level data with biased data from a sample, and recovering unbiased population-level conditional distributions. Let’s look at some applications to make this all concrete.

Post-stratification and survey weighting: an example

It turns out this formula reduces to post-stratification weighting when X = {}. Post-stratification is used in google consumer surveys to achieve better accuracy by making their survey sample more representative of the target population. If your sample has 25% women, but the general population has 50%, you can try adjusting your estimates by weighting women more highly in your sample. You can do that using the law of total probability (you can show this as an exercise, if you want to refresh your probability skills!),

For the C=female stratum, we’d estimate 25% for P(C=female|S=1) — our sample proportion. If we just replace this value with the population proportion (50%), and we assume Y is independent of S given C, we can get a new estimate for E[Y] that better reflects the population value. This is the basic idea behind post-stratification weighting. It gives us a new estimator for the population mean of Y,

using our sample knowledge of the average value of Y in each strata, E[Y| C=c, S=1], and the known population proportion for C.

There are some nice side effects to the stratification. If C accounts for some of the variance in Y, the variance of the population mean estimator can be smaller than if you didn’t stratify! You can find the variance estimator in statistics classes, and Penn State actually has a nice online course with a tutorial on post-stratification weighting I like to refer to!

For the purposes of estimation, you can turn this expression into a sum over data points instead of strata as

which is just a weighted average over the Y, with weights given by this ratio of probabilities. That’s why this is called post-stratification weighting. For surveys, you can weight survey responses using the sample proportion of various demographic traits (age, sex, location) and population-level proportion of those same traits (e.g. from US census data). This is exactly the approach taken by google consumer surveys.

To justify all of this intuition theoretically, if we take the formula from the s-recovery theorem above, set X={}, and multiply both sides by Y and sum over Y, we can get a result for the expectation value of Y given sample data.

and we see the population-level estimate for the outcome, when Y is independent of S given C, is given by the post-stratification estimator! The more general result reduces nicely to the familiar post-stratification weighting formula.

Now, let’s apply this theorem to a more general problem, where X isn’t the null set.

A Machine Learning Example

When I was at BuzzFeed, we had a tricky missing-data problem. We wanted to translate articles into other languages if we thought they’d perform well. In the past, people have done this subjectively, based on their intuition for which articles might perform well. I suggested training a machine-learning model to predict the performance of articles if they’re translated, Yt (translated article pageviews), based on a set of attributes, X. The problem is that we’ve only translated a subset of articles, so have only measured Yt for translated articles — not the whole population of articles! Our selection indicator is “whether or not an article is translated”.

Even worse, we translated articles because we expected that the performance would be good, so we were sure Yt wouldn’t be independent of translation. We expected that there‘d definitely be some selection bias. How can we tackle this problem with our new theorem?

We can start by drawing a causal graph for this problem,

Here, Y is the original article performance. X is some set of known attributes of the article that drives its performance, and might also drive the translated article performance. Those attributes, together with past performance, drive whether we translate the article, S. There is a set of unknown attributes, U, that drive performance before and after translation, as well. The problem, then, is to estimate P(Yt|X, Y) in the general population, given data only for a sample. With that, we’d be able to estimate E[Yt|X, Y]. Without bias correction, we’d estimate something that’s potentially different: P(Yt |X, Y, S=1), or E[Yt|X, Y].

Referring back to the theorem, we need to find a set of variables in the graph that will d-separate Yt and S (conditionally on X and Y). There are a couple of paths that induce statistical dependence between them. Yt ← U → Y → S, Yt ← X → Y → S. Additionally, since we want to condition on Y, we open the collider at Y so there’s the path Yt ← U→Y ← X → S to worry about, too. We need to find a set to block all of these paths, separating Yt from the sampling indicator. The set {X,Y} will block all of these paths!

This blocking really hinges on whether we can exhaustively measure all of the variables, X. We might always fail, and that will result in us failing to correct some of the selection bias. We can always hope that we’ve done a good enough job, but we have no guarantee. This is a general problem when working with observational data. The general solution is usually a question to the effect of “would you rather not use data at all?”. You should always look at these estimates as your provisional best estimate, and take them with a grain of salt.

A we proceed, we’ll assume we’ve done a good job accounting for confounding factors. That would mean we actually understand the factors that are common drivers of original and translated article performance, as well as the selection process. That’s where or “best guess” interpretation comes in.

From here, we can use the adjustment formula.

Where Y → Yt, X → {X, Y}, and C={}. This case simplifies to just estimating P(Yt|X, Y). We can use the first theorem in this case — there’s no need to do any adjustment to P(Yt|Y, X)!

We might instead prefer to estimate P(Yt|X), where we don’t have to wait for the article to perform to know whether we should translate it. We could also use this model to explore what kinds of article tend to perform well when translated.

In that case, we still need to control for Y to render Yt conditionally independent of S. In that case, Y → Yt, X → {X}, C → {Y}. Then, the procedure is to use a machine learning model for P(Yt|X, Y, S=1), and another model for P(Y|X). The first estimator just uses sample-level data, but the second needs population-level data. Fortunately, we have population-level data for Y and X jointly! It’s just the original article performances given their attributes. We can estimate these two quantities, and combine them as the formula requires. You can estimate this by modifying the final example that follows the next section.

Causal Inference with Selection Bias

Now that we know how to estimate P(Y|X) in the context of selection bias, we can understand how we estimate P(Y|do(X))! The key trick will be to use the back-door adjustment formula,

but require that the strata-level regressions are selection-bias corrected. That’s going to mean if Z is a back-door admissable set for the causal effect of X on Y, we need Y to be conditionally independent of S given X and Z. Then, P(Y|X,Z,S=1) = P(Y|X,Z)! Here’s the weird part, if you have any experience doing regression analysis: in general, that might require conditioning on descendants of X! That was forbidden by the back-door criterion, which is a sufficient condition for adjustment. The extra wiggle-room (it suggested not conditioning on descendants of the causal state!) comes from the fact that it’s not a necessary condition!

There’s good intuition for why you shouldn’t usually condition on descendants of the causal state, X. You can block mechanisms for the causal effect, or accidentally condition on colliders or descendants of colliders. Both of those will result in bias. The trick here will be that we have to explicitly require that the set we condition on doesn’t introduce that bias. The s-backdoor criterion does exactly that. It reads:

From Bareinboim, Tian, and Pearl (2014). http://ftp.cs.ucla.edu/pub/stat_ser/r425.pdf

Z is the set of things we’d like to condition on for back-door adjustment, and to make sure we adjust for sample-selection bias. Now, we need to make sure we don’t add bias due to conditioning on descendants of the causal state, so we have to add extra constraints on all descendants of the causal state. Those are the Z- variables.

The extra constraint is that the Z- variables are blocked from Y by a set of variables that satisfies the back-door criterion! That makes sure they don’t lie along a causal path between the causal state, X, and Y. It also makes sure they’re not a descendant of a collider with information about Y.

With those extra constraints satisfied, we have a new back-door criterion that corrects for selection bias. It required conditioning on descendants of the causal state, so we made sure we didn’t add extra bias by doing that. The new adjustment formula is

Notice that if Y is independent of S given {X, Z}, then you’re free to remove S from the conditional! This reduces then to the original back-door adjustment formula.

Let’s resume with the example before, with selection for translated articles. We want to estimate P(Yt|X). P(Yt|do(X)) would tell us what kinds of articles we might want to produce more of in order to get better translated article performance, while P(Yt|X) tells us which articles tend to perform well. I’d do both, but this article is long enough! Look forward to a future post for that, along with other selection bias correction methods.

Conclusion

In summary, we can estimate conditional distributions and expectations at a population level, as long as we satisfy certain independence assumptions. We’ve shown that the post-stratification approach is just a special case of a more general formula for selection-bias correction. We’ve used that intuition to understand causal inference in the presence of selection bias.

These theorems are powerful tools. You can turn all of these formulae into machine learning statistical and causal estimators by using machine learning models to estimate the P(Y|X,Z), P(Y|X,Z,S=1) (and similar for expectation value) pieces.

The biggest constraint is probably that these approaches require joint population-level data for the X and Z (or C) variables. You could always start with the simplest version, where X is empty, and take a post-stratification styled approach!