Simpson' Paradox is a example of one of the counter intuitive properties of probability distributions. It occurs when an observed relationship between two variables is reversed when you take into account another variable.

The paradox itself has been described very well elsewhere, so I'm not going to describe it in too much detail here. Instead I'm going to try and answer another question: how do we write a program to generate examples of Simpson's paradox?

Let's start with an example of the paradox:

Imagine we are trying to work out whether a certain drug is an effective treatment for a disease. To decide whether it is effective, we compare people who took the drug (call this $x$, if someone took the drug, $x=1$, if not $x=0$), by examining how many of each recovered from the disease (call this $y$ - $y=1$ mean they got better, $y=0$ mean they did not).

When we just look at these two numbers, we find that of 250 people who took the drug, 110 recovered (44%), whereas out of the 250 people who did not take the drug 177 recovered (71%). From these results it looks like there is a clear advantage to not taking the drug. Unfortunately, this drug was not administered as part of a random controlled trial. This means that the decision of whether or not to take the drug may have been confounded.

In fact, if we look at the recovery rate for different age groups (denoted by variable $z$), we find that for each age group more of those who took the drug recovered than those who did not, as shown in the table below