Surprising reciprocity

I have two correlated random variables, \(X\) and \(Y\), with zero mean and equal variance. I tell you that the best way to predict \(Y\) based on the knowledge of \(X\) is \(y = a x\). Now, you tell me, what is the best way to predict \(X\) based on \(Y\)?

Your intuition might tell you that if \(y = ax\), then \(x = y/a\). This is correct most of the time… but not here. The right answer will surprise you.

So what is the best way to predict \(Y\) based on \(X\) and vice versa? Let’s find the \(a\) that minimizes the mean squared error \(E[(Y-aX)^2]\):

\[E[(Y-aX)^2] = E[Y^2-2aXY+a^2X^2]=(1+a^2)\mathrm{Var}(X)-2a\mathrm{Cov}(X,Y);\]

\[\frac{\partial}{\partial a}E[(Y-aX)^2] = 2a\mathrm{Var}(X)-2\mathrm{Cov}(X,Y);\]

\[a=\frac{\mathrm{Cov}(X,Y)}{\mathrm{Var}(X)}=\mathrm{Corr}(X,Y).\]

Notice that the answer, the (Pearson) correlation coefficient, is symmetric w.r.t. \(X\) and \(Y\). Thus it will be the same whether we want to predict \(Y\) based on \(X\) or \(X\) based on \(Y\)!

How to make sense of this? It may help to consider a couple of special cases first.

First, suppose that \(X\) and \(Y\) are perfectly correlated and you’re trying to predict \(Y\) based on \(X\). Since \(X\) is such a good predictor, just use its value as it is (\(a=1\)).

Now, suppose that \(X\) and \(Y\) are uncorrelated. Knowing the value of \(X\) doesn’t tell you anything about the value of \(Y\) (as far as linear relationships go). The best predictor you have for \(Y\) is its mean, \(0\).

Finally, suppose that \(X\) and \(Y\) are somewhat correlated. The correlation coefficient is the degree to which we should trust the value of \(X\) when predicting \(Y\) versus sticking to \(0\) as a conservative estimate.

This is the key idea—to think about \(a\) in \(y=ax\) not as a degree of proportionality, but as a degree of “trust”.

Added on 2020-08-21:

There was a related interesting discussion on twitter:

Which line do you think does a better job of predicting Y given X?

(Hint: This is one of those days where very basic statistics confuses the hell out of me.)

(1/n) — Jay Hennig ( @jehosafet ) August 19, 2020

The answer is…black! Its rmse is 4 times lower than the red line. I keep staring at that plot and it still blows my mind. (2/n) — Jay Hennig ( @jehosafet ) August 19, 2020

Meanwhile, the red line is better than the black line at predicting X given Y.

One explanation of the discrepancy is, predicting Y given X assumes that there is no measurement noise in X, while predicting X given Y assumes there is no measurement noise in Y.

(3/n) — Jay Hennig ( @jehosafet ) August 19, 2020

The weird thing is, you are just as good at predicting X given Y as you are at predicting Y given X. So why does the red line look so much better than the black line?

(4/n) — Jay Hennig ( @jehosafet ) August 19, 2020

By the way, both lines have zero mean and unit variance, so the slope of each line is equal to the pearson's correlation (ρ) between X and Y:

black: Y = ρX

red: X = ρY

(5/n) — Jay Hennig ( @jehosafet ) August 19, 2020