This is my first attempt at an elementary statistics post, which I hope is suitable for Less Wrong. I am going to present a discussion of a statistical phenomenon known as Simpson's Paradox. This isn't a paradox, and it wasn't actually discovered by Simpson, but that's the name everybody uses for it, so it's the name I'm going to stick with. Along the way, we'll get some very basic practice at calculating conditional probabilities.

A worked example

The example I've chosen is an exercise from a university statistics course that I have taught on for the past few years. It is by far the most interesting exercise in the entire course, and it goes as follows:

You are a doctor in charge of a large hospital, and you have to decide which treatment should be used for a particular disease. You have the following data from last month: there were 390 patients with the disease. Treatment A was given to 160 patients of whom 100 were men and 60 were women; 20 of the men and 40 of the women recovered. Treatment B was given to 230 patients of whom 210 were men and 20 were women; 50 of the men and 15 of the women recovered. Which treatment would you recommend we use for people with the disease in future?

The simplest way to represent these sort of data is to draw a table, we can then pick the relevant numbers out of the table to calculate the required conditional probabilities.

Overall

A B lived 60 65 died 100 165

The probability that a randomly chosen person survived if they were given treatment A is 60/160 = 0.375

The probability that a randomly chosen person survived if they were given treatment B is 65/230 = 0.283

So a randomly chosen person given treatment A was more likely to surive than a randomly chosen person given treatment B. Looks like we'd better give people treatment A.

However, since were given a breakdown of the data by gender, let's look and see if treatment A is better for both genders, or if it gets all of its advantage from one or the other.

Women

A B lived 40 15 died 20 5

The probability that a randomly chosen woman survived given that they were given treatment A is 40/60 = 0.67

The probability that a randomly chosen woman survived given that they were given treatment B is 15/20 = 0.75

So it looks like treatment B is better for women. Guess that means treatment A must be much better for men, in order to be better overall. Let's take a closer look.

Men

A B lived 20 50 died 80 160

The probability that a randomly chosen man survived given that they were given treatment A is 20/100 = 0.2

The probability that a randomly chosen man survived given that they were given treatment B is 50/210 = 0.238

So a randomly chosen man was more likely to survive if given treatment B than treatment A. What is going on here?

Treatment A, which seemed better in the overall data, was worse for both men and women when considered separately.

This, in essence, is Simpson's Paradox, partitioning data can result in a reversal of the correlations present in the aggregated data. Why does this happen? Well, essentially for two reasons. Firstly, the treatments were given to different numbers of people - treatment A was used much less often than treatment B in the example data, and secondly (and probably more importantly) the aggregation is hiding a confounding variable. Treatment B was much more likely to be given to men than to women, and men are much less likely than women to survive the disease, this obviously makes treatment B look worse in the aggregated data.

So, you might think, we've sorted things out. Gender was the missing variable, and we now know that we can safely give everyone treatment B. Well, if I were writing the exercises for the course I teach on, I would have included the following follow-up question.

Yet Another Variable

It turns out that gender wasn't the only data that were collected about the patients. For the men, we also noted whether they were had any family history of heart disease. of the men given treatment A, 80 had a family history of heart disease, 10 of these survived. Of the men given treatment B, 55 had a family history of heart disease, 5 of these survived. The data now break down as follows:

History of heart disease

A B lived 10 5 died 70 50

No history of heart disease

A B lived 10 45 died 10 110

This time I will leave the calculations as an exercise to the reader but, as you can see, things have changed again. We can keep playing this game all day.

Which data to use?

This leaves us with the important question, which data should we use when making our decisions? Given a randomly chosen person, it looks like treatment A is better than treatment B. But any randomly chosen person is either a man or a woman, and whichever they are, treatment B is better than treatment A. But let's say the randomly chosen person is a man, then we could ask them whether or not they have a family history of heart disease and whichever answer they give, we will prefer to give them treatment A.

It may appear that the partitioned data always give a better answer than the aggregated data. Unfortunately, this just isn't true. I made up the numbers in the previous example five minutes ago in order to reverse the correlation in the original exercise. Similarly, for just about any given set of data, you can find some partition which reverses the apparent correlation. How are we to decide which partitions are useful? If someone tells us that women born under Aries, Leo or Sagittarius do better with treatment A, as do those born under the Earth, Air and Water signs, would we really be willing to switch treatments?

As you might expect, Judea Pearl has an answer to this problem (in chapter 6 of [1]). If we draw the relevant causal networks, we can formally decide which variables are confounding and so which partitions we should use (he quotes a further famous examples in which it is shown that you might want to use different versions of the same data depending on how they were acquired!), but that's another post for another time (and probably for someone better acquainted with Pearl than I am). In the meantime we should take Simpson's Paradox as a further warning of the dangers of drawing causal conclusions from data without understanding where the causes come from.

In Real Life

I'll finish with a famous real life example. In 1975, there was a study published [2] which demonstrated that 44% of male graduate applicants for graduate programmes at Berkeley were being accepted, whereas only 35% or female applicants were. This was obviously a pretty serious problem, so the authors decided to have a closer look, to try and see which departments in particular were most guilty of discrimination.

As you'll be expecting by now, what they found was that not only were most of the departments not biased at all, in fact, there were more which were biased in favour of women than there were in favour of men! The confounding variable that was found was that women were applying for more competitive departments than men... of course, as we've seen, it's just possible that something else was hiding in the data.

There are several other real-life examples. You can find a few in the wikipedia article on Simpson's Paradox. Batting averages are a common toy example. It's possible for one player to have a better average than another every season for his entire career, and a worse average overall. Similar phenomena are not particularly unusual in medical data - treatments which are given to patients with more serious ilnesses are always going to look worse in aggregate data. One of my personal favourite examples is that countries which put fluoride in the water have significantly more people who require false teeth than those which don't. As usual, there's a hidden variable lurking.

References:

(1) Judea Pearl. Causality: Models, Reasoning, and Inference, Cambridge University Press (2000, 2nd edition 2009)

(2) P.J. Bickel, E.A. Hammel and J.W. O'Connell (1975). "Sex Bias in Graduate Admissions: Data From Berkeley". Science 187 (4175): 398–404