In this post, I will be using the excellent CausalInference package to give an overview of how we can use the potential outcomes framework to try and make causal inferences about situations where we only have observational data. The author has a good series of blog posts on it's functionality.

Because most datasets you can download are static, throughout this post I will be using be using my own functions to generate data. This has two advantages: we can and will generate datasets with specific properties, and we have the ability to "intervene" in the data generating system directly, giving us the ability to check whether our inferences are correct. These data generators all generate i.i.d. samples from some distribution, returning the results as a pandas dataframe. You can find the functions which generate these datasets in the accompanying file datagenerators.py on github here.

To begin, let's look at a motivating example.

One day a team lead notices that some members of their team wear cool hats, and that these members of the team tend to be less productive. Being data drive, the Team Lead starts to record whether or not a team member wears a cool hat ($X=1$ for a cool hat, $X=0$ for no cool hat) and whether or not they are productive ($Y=1$ for productive, $Y=0$ for unproductive).

After making observations for a week, they end up with a dataset like the following: