$\begingroup$

Many people (outside the specialist experts) who think they are frequentist are in fact Bayesian. This makes the debate a bit pointless. I think that Bayesianism won, but that there are still many Bayesians who think they are frequentist. There are some people who think that they don't use priors and hence they think they are frequentist. This is dangerous logic. This is not so much about priors (uniform priors or non-uniform), the real difference is more subtle.

(I'm not formally in the statistics department; my background is maths and computer science. I'm writing because of difficulties I've had trying to discuss this 'debate' with other non-statisticians, and even with some early-career statisticians.)

The MLE is actually a Bayesian method. Some people will say "I'm a frequentist because I use the MLE to estimate my parameters". I have seen this in peer-reviewed literature. This is nonsense and is based on this (unsaid, but implied) myth that a frequentist is somebody who uses a uniform prior instead of a non-uniform prior).

Consider drawing a single number from a normal distribution with known mean, $\mu = 0$, and unknown variance. Call this variance $\theta$.

$ X \equiv N(\mu = 0, \sigma^2 = \theta) $

Now consider the likelihood function. This function has two parameters, $x$ and $\theta$ and it returns the probability, given $\theta$, of $x$.

$ f(x,\theta) = \mathrm{P}_{\sigma^2=\theta} (X=x) = \frac{1}{\sqrt{2\pi \theta}} e^{-\frac{x^2}{2\theta}} $

You can imagine plotting this in a heatmap, with $x$ on the x-axis and $\theta$ on the y-axis, and using the colour (or z-axis). Here is the plot, with contour lines and colours.

First, a few observations. If you fix on a single value of $\theta$, then you can take the corresponding horizontal slice through the heatmap. This slice will give you the pdf for that value of $\theta$. Obviously, the area under the curve in that slice will be 1. On the other hand, if you fix on a single value of $x$, and then look at the corresponding vertical slice, then there is no such guarantee about the area under the curve.

This distinction between the horizontal and vertical slices is crucial, and I found this analogy helped me to understand the frequentist approach to bias.

A Bayesian is somebody who says

For this value of x, which values of $\theta$ give a 'high enough' value of $f(x,\theta)$?.

Alternatively, a Bayesian might include a prior, $g(\theta)$, but they are still talking about

for this value of x, which values of $\theta$ give a high enough value of $f(x,\theta)g(\theta)$?

So a Bayesian fixes x and looks at the corresponding vertical slice in that contour plot (or in the variant plot incorporating the prior). In this slice, the area under the curve need not be 1 (as I said earlier). A Bayesian 95% credible interval (CI) is the interval which contains 95% of the available area. For example, if the area is 2, then the area under the Bayesian CI must be 1.9.

On the other hand, a frequentist will ignore x and first consider fixing $\theta$, and will ask:

For this $\theta$, which values of x will appear most often?

In this example, with $\mathcal{N}(\mu=0, \sigma^2 = \theta)$, one answer to this frequentist question is: "For a given $\theta$, 95% of the $x$ will appear between $-3\sqrt\theta$ and $+3\sqrt\theta$."

So a frequentist is more concerned with the horizontal lines corresponding to fixed values of $\theta$.

This is not the only way to construct the frequentist CI, it's not even a good (narrow) one, but bear with me for a moment.

The best way to interpret the word 'interval' is not as an interval on a 1-d line, but to think of it as an area on the above 2-d plane. An 'interval' is a subset of the 2-d plane, not of any 1-d line. If somebody proposes such an 'interval', we then have to test is the 'interval' is valid at a 95% confidence/credible level.

A frequentist will check the validity of this 'interval' by considering each horizontal slice in turn and looking at the area under the curve. As I said before, the area under this curve will always be one. The crucial requirement is that the area within the 'interval' be at least 0.95.

A Bayesian will check validity by instead looking at the vertical slices. Again, the area under the curve will be compared to the subarea that's under the interval. If the latter is at least 95% of the former, then the 'interval' is a valid 95% Bayesian credible interval.

Now that we know how to test whether a particular interval is 'valid', the question is how do we choose the best option among the valid options. This can be a black art, but generally you want the narrowest interval. Both approaches tend to agree here - the vertical slices are considered and the goal is to make the interval as narrow as possible within each vertical slice.

I have not attempted to define the narrowest possible frequentist confidence interval in the above example. See the comments by @cardinal below for examples of narrower intervals. My goal is not to find the best intervals, but to emphasize the difference between the horizontal and vertical slices in determining validity. An interval that satisfies the conditions of a 95% frequentist confidence interval will usually not satisfy the conditions of a 95% Bayesian credible interval, and vice versa.

Both approaches desire narrow intervals, i.e. when considering one vertical slice we want to make the (1-d) interval in that slice to be as narrow as possible. The difference is in how the 95% is enforced - a frequentist will only look at proposed intervals where 95% of each horizontal slice's area is under the interval, whereas a Bayesian will insist that each vertical slice be such that 95% of its area is under the interval.

Many non-statisticians don't understand this and they focus only on the vertical slices; this makes them Bayesians even if they think otherwise.