Wearing a Bra Makes You Live Longer

On causality versus correlation

Hopefully obvious to most readers, but in my experience many still tend to confuse the following two statements:“whenever A holds, B is more likely to hold” and “A causes B”.These are not the same. Statement 1 expresses statistical dependence, statement 2 talks about causal influence.

Let me demonstrate this distinction with the following example: If you look at some data, you may observe that “people who wear bras tend to live longer”. So, if you want to live longer one should wear bras, right? Of course not. The statement simply reflects the fact that wearing a bra and life expectancy are positively correlated: women are more likely to wear bras and they also tend to live longer than men. Trying to manipulate your life expectancy by wearing a bra is a stupid strategy.

Despite this exaggerated example — and several others — which clearly demonstrate the possible implications of misinterpreting statistical dependence as causal influence, the distinction between correlation and causation is very often overlooked. Almost every day I see blog posts and articles making this fundamental mistake. Below are two examples I have came across recently.

A shorter commute time to save your marriage?

Before finishing this post I found a great example, a blog post about the correlation between commute times and divorce rates by HubSpot founder and CTO Dharmesh Shah. The post referenced the finding that:

“If your commute is longer than 45 minutes, you’re 40% more likely to get divorced.”

The rest of the blog posts proposes some possible causal mechanisms that explain the phenomenon, and essentially suggests that you should aim to reduce your commute time to save your marriage. Well, some of the hypotheses presented make sense, but the cause-effect relationship is actually not implied by the data.

So what can go wrong?

The most common thing that causes a problem is a confounding factor that is not accounted for in the study. What if the observed correlation is due to a confounding factor, such as household income: poorer people tend to have longer commute times, and they also tend to divorce more often? If this is the case, reducing your commute time alone will hardly save your marriage.

Best time to tweet?

Another good example of the correlation vs causation oversight is bit.ly’s analysis of the optimal time to tweet. Bit.ly observes click patterns on millions or probably billions of shortened URLs a day, and from all this data they created this neat heat map visualisation below (showing only links shared on Twitter). The y axis indexes days of the week, x axis the time within the day. Darker blue corresponds to higher click rate.

Click through patterns during the week

(source: bit.ly)

Based on this heat map the blog post concludes:

“For Twitter, posting in the afternoon earlier in the week is your best chance at achieving a high click count (1-3pm Monday through Thursday). Posting after 8pm should be avoided.” — Hilary Mason, bit.ly

Wrong! The data does not strongly support this strategy. It only proves correlation, not causation. Again you can think of possible confounding factors, say: football related content gets a lot of clicks and it tends to be shared around 1-3pm.

Don’t get me wrong, this analysis gives very useful insights, certainly better than having no information at all. But care has to be taken when manipulative strategies are devised based on observational data. A general rule of thumb that you can follow is the following: