$\begingroup$

Another example is the ecological fallacy.

Example

Suppose that we look for a relationship between voting and income by regressing the vote share for then-Senator Obama on the median income of a state (in thousands). We get an intercept of approximately 20 and a slope coefficient of 0.61.

Many would interpret this result as saying that higher income people are more likely to vote for Democrats; indeed, popular press books have made this argument.

But wait, I thought that rich people were more likely to be Republicans? They are.

What this regression is really telling us is that rich states are more likely to vote for a Democrat and poor states are more likely to vote for a Republican. Within a given state, rich people are more likely to vote Republican and poor people are more likely to vote Democrat. See the work of Andrew Gelman and his coauthors.

Without further assumptions, we cannot use group-level (aggregate) data to make inferences about individual-level behavior. This is the ecological fallacy. Group-level data can only tell us about group-level behavior.

To make the leap to individual-level inferences, we need the constancy assumption. Here, the voting choice of individuals most not vary systematically with the median income of a state; a person who earns \$X in a rich state must be just as likely to vote for a Democrat as someone who earns \$X in a poor state. But people in Connecticut, at all income levels, are more likely to vote for a Democrat than people in Mississippi at those same income levels. Hence, the consistency assumption is violated and we are led to the wrong conclusion (fooled by aggregation bias).

This topic was a frequent hobbyhorse of the late David Freedman; see this paper, for example. In that paper, Freedman provides a means for bounding individual-level probabilities using group data.

Comparison to Simpson's paradox

Elsewhere in this CW, @Michelle proposes Simpson's paradox as a good example, as it indeed is. Simpson's paradox and the ecological fallacy are closely related, yet distinct. The two examples differ in the natures of the data given and analysis used.

The standard formulation of Simpson's paradox is a two-way table. In our example here, suppose that we have individual data and we classify each individual as high or low income. We would get an income-by-vote 2x2 contingency table of the totals. We'd see that a higher share of high income people voted for the Democrat relative to the share of low income people. Were we to create a contingency table for each state, however, we'd see the opposite pattern.

In the ecological fallacy, we don't collapse income into a dichotomous (or perhaps multichotomous) variable. To get state-level, we get the mean (or median) state income and state vote share and run a regression and find that higher income states are more likely to vote for the Democrat. If we kept the individual-level data and ran the regression separately by state, we'd find the opposite effect.

In summary, the differences are:

Mode of analysis: We could say, following our SAT prep skills, that Simpson's paradox is to contingency tables as the ecological fallacy is to correlation coefficients and regression.

Degree of aggregation/nature of data: Whereas the Simpson's paradox example compares two numbers (Democrat vote share among high income individuals versus the same for low income individuals), ecological fallacy uses 50 data points (i.e., each state) to calculate a correlation coefficient. To get the full story from in the Simpson's paradox example, we'd just need the two numbers from each of the fifty states (100 numbers), while in the ecological fallacy case, we need the individual-level data (or else be given state-level correlations/regression slopes).

General observation

@NeilG comments that this just seems to be saying that you can't have any selection on unobservables/omitted variables bias issues in your regression. That's right! At least in the regression context, I think that nearly any "paradox" is just a special case of omitted variables bias.

Selection bias (see my other response on this CW) can be controlled for by including the variables that drive the selection. Of course, these variables are typically unobserved, driving the problem/paradox. Spurious regression (my other other response) can be overcome by adding a time trend. These cases say, essentially, that you have enough data, but need more predictors.

In the case of the ecological fallacy, it's true, you need more predictors (here, state-specific slopes and intercepts). But you need more observations, individual-, rather than group-level, observations as well to estimate these relationships.

(Incidentally, if you have extreme selection where the selection variable perfectly divides treatment and control, as in the WWII example that I give, you may need more data to estimate the regression as well; there, the downed planes.)