Best practice Data fallacies

Statistical fallacies are common tricks data can play on you, which lead to mistakes in data interpretation and analysis. Explore some common fallacies, with real-life examples, and find out how you can avoid them. Download poster



The practice of selecting results that fit your claim and excluding those that don’t. The worst and most harmful example of being dishonest with data. Get the printable card When making a case, data adds weight – whether a study, experiment or something you’ve read. However, people often only highlight data that backs their case, rather than the entire body of results. It’s prevalent in public debate and politics where two sides can both present data that backs their position. Cherry Picking can be deliberate or accidental. Commonly, when you’re receiving data second hand, there’s an opportunity for someone choosing what data to share to distort the truth to whatever opinion they’re peddling. When on the receiving end of data, it’s important to ask yourself: ‘What am I not being told?’. Related Reading: Economics Help: Examples of Cherry Picking in action

Misrepresenting climate science: Cherry Picking data to hide the disappearance of arctic ice

Drawing conclusions from an incomplete set of data, because that data has ‘survived’ some selection criteria. Get the printable card When analyzing data, it’s important to ask yourself what data you don’t have. Sometimes, the full picture is obscured because the data you’ve got has survived a selection of some sort. For example, in WWII, a team was asked where the best place was to fit armour to a plane. The planes that came back from battle had bullet holes everywhere except the engine and cockpit. The team decided it was best to fit armour where there were no bullet holes, because planes shot in those places had not returned. Related reading: Abraham Wald and the Missing Bullet Holes: An excerpt from How Not To Be Wrong by Jordan Ellenberg

When an incentive produces the opposite result intended. Also known as a Perverse Incentive. Get the printable card Named from a historic legend, the Cobra Effect occurs when an incentive for solving a problem creates unintended negative consequences. It’s said that in the 1800s, the British Empire wanted to reduce cobra bite deaths in India. They offered a financial incentive for every cobra skin brought to them to motivate cobra hunting. But instead, people began farming them. When the government realized the incentive wasn’t working, they removed it so cobra farmers released their snakes, increasing the population. When setting incentives or goals, make sure you’re not accidentally encouraging the wrong behaviour. Related reading: Unintended Consequences of the Wrong Measures

The Cobra Effect: How to avoid unintended consequences when setting goals

To falsely assume when two events occur together that one must have caused the other. Get the printable card Global temperatures have steadily risen over the past 150 years and the number of pirates has declined at a comparable rate. No one would reasonably claim that the reduction in pirates caused global warming or that more pirates would reverse it. But it’s not usually this clear-cut. Often correlations between two things tempt us to believe that one caused the other. However, it’s often a coincidence or there’s a third factor causing both effects that you’re seeing. In our pirates and global warming example, the cause of both is industrialization. Never assume causation because of correlation alone – always gather more evidence. Related reading: Spurious Correlations

This ≠ That

The Cardinal Sin of Data Mining and Data Science: Overfitting

The practice of deliberately manipulating boundaries of political districts in order to sway the result of an election. Get the printable card In many political systems, it’s possible to manipulate the likelihood of one party being elected over another by redefining the political districts – include more rural areas in a district to disadvantage the party that’s more popular in cities etc. A similar phenomenon known as the Modifiable Areal Unit Problem (MAUP) can occur when analyzing data. How you define the areas to aggregate your data – e.g. what you define as ‘Northern counties’ – can change the result. The scale used to group data can also have a big impact. Results can vary wildly whether using postcodes, counties or states. Related reading: How the new math of Gerrymandering works: The New York Times covers a Gerrymandering case in Wisconsin

Understanding the Modifiable Areal Unit Problem [PDF]

The Modifiable Areal Unit Problem (MAUP) explained [Wikipedia]

Drawing conclusions from a set of data that isn’t representative of the population you’re trying to understand. Get the printable card A classic problem in election polling where people taking part in a poll aren’t representative of the total population, either due to self-selection or bias from the analysts. One famous example occurred in 1948 when The Chicago Tribune mistakenly predicted, based on a phone survey, that Thomas E. Dewey would become the next US president. They hadn’t considered that only a certain demographic could afford telephones, excluding entire segments of the population from their survey. Make sure to consider whether your research participants are truly representative and not subject to some sampling bias. Related reading: How to identify bias in samples and surveys

The mistaken belief that because something has happened more frequently than usual, it’s now less likely to happen in future and vice versa. Get the printable card This is also known as the Monte Carlo Fallacy because of an infamous example that occurred at a roulette table there in 1913. The ball fell in black 26 times in a row and gamblers lost millions betting against black, assuming the streak had to end. However, the chance of black is always the same as red regardless of what’s happened in the past, because the underlying probability is unchanged. A roulette table has no memory. When tempted by this fallacy, remind yourself that there’s no rectifying force in the universe acting to ‘balance things out’! Related reading: The Gambler’s Fallacy (aka the Monte Carlo Fallacy) explained [Wikipedia]

When something happens that’s unusually good or bad, over time it will revert back towards the average. Get the printable card Anywhere that random chance plays a part in the outcome, you’re likely to see regression toward the mean. For example, success in business is often a combination of both skill and luck. This means that the best performing companies today are likely to be much closer to average in 10 years time, not through incompetence but because today they’re likely benefitting from a string of good luck – like rolling a double-six repeatedly. Related reading: The Regression to the Mean explained [Wikipedia]

A more in-depth look at Regression Toward the Mean

A phenomenon in which a trend appears in different groups of data but disappears or reverses when the groups are combined. Get the printable card In the 1970s, Berkeley University was accused of sexism because female applicants were less likely to be accepted than male ones. However, when trying to identify the source of the problem, they found that for individual subjects the acceptance rates were generally better for women than men. The paradox was caused by a difference in what subjects men and women were applying for. A greater proportion of the female applicants were applying to highly competitive subjects where acceptance rates were much lower for both genders. Related reading: Simpson’s Paradox

When average isn’t good enough: Simpson’s Paradox in education and earnings