This post is by Phil.

I’m aware that there are some people who use a Bayesian approach largely because it allows them to provide a highly informative prior distribution based subjective judgment, but that is not the appeal of Bayesian methods for a lot of us practitioners. It’s disappointing and surprising, twenty years after my initial experiences, to still hear highly informed professional statisticians who think that what distinguishes Bayesian statistics from Frequentist statistics is “subjectivity” (as seen in a recent blog post and its comments).

My first encounter with Bayesian statistics was just over 20 years ago. I was a postdoc at Lawrence Berkeley National Laboratory, with a new PhD in theoretical atomic physics but working on various problems related to the geographical and statistical distribution of indoor radon (a naturally occurring radioactive gas that can be dangerous if present at high concentrations). One of the issues I ran into right at the start was that many researchers were trying to use data from statewide radon surveys to make maps of radon concentrations within states. The surveys had been designed to characterize the statistical distribution of radon in each state, not the spatial distribution, so sample sizes were much higher in highly populated counties than in low-population counties.

Within the counties with lots of measurements, the statistical distribution of radon measurements was roughly lognormal, with a geometric standard deviation of around 3 (a dimensionless number) and a geometric mean that varied from county to county. One of the first datasets I looked at in detail was from Minnesota, where counties with more than 100 radon survey measurements had observed geometric means as low as 83 Bequerels per cubic meter and as high as 136 Bequerels per cubic meter.

100 radon measurements from a county is enough to characterize the county’s radon concentration pretty well, especially if one is willing to assume the statistical distribution of measurements is lognormal; for instance, one of the parameters of interest to policy-makers is the expected fraction of homes with radon concentrations exceeding the recommended “action level” of 150 Bequerels per cubic meter, and with 100 measurements you could estimate that fraction to within a few percentage points, which was close enough.

But what about counties with many fewer measurements? Many counties had a number of measurements in the single digits. Unsurprisingly, by far the lowest and highest observed geometric mean radon concentrations were from counties with very few measurements: if you only measure in two homes in a county, 4% of the time they will both come from the highest 20% of homes in the county so your geometric mean will be much higher than the true geometric mean, and 4% of the time they will both come from the lowest 20% of homes in the county, with the opposite result. I still remember, 20 years later, that the highest county geometric mean was in Lac Qui Parle County, with just two sampled homes and an observed geometric mean of 500 Bequerels per cubic meter.

Using the statistics language S-Plus (this was years before R was released), I created my own model of what I thought was happening: I could choose a statistical distribution of “true” geometric means and geometric standard deviations, simulate the survey sampling process, and look at the resulting statistical distribution of simulated county geometric means and standard deviations. It was evident right away that if some of the true geometric means were near 500 Bequerels per cubic meter, then some of the sampled geometric means were very likely to be far higher than that: the statistical distribution of measurements is far wider than the statistical distribution of true values. So it seemed to me to be extremely unlikely that the geometric mean radon concentration of all of the homes in Lac Qui Parle county was anywhere near 500 Bequerels per cubic meter. Instead, the geometric mean was probably much lower but that the survey had happened to sample two homes from the upper end of the distribution. But however certain I was, how could I come up with an estimate and an uncertainty for the county’s true geometric mean radon concentration? I asked this question of the UC Berkeley statistics professor who gave us occasional advice, but didn’t get a useful answer. That professor’s attitude was: I have an “unbiased” estimate of the geometric mean, 500 Bequerels per cubic meter, and an uncertainty in that quantity…what else could I hope for? This was very unsatisfying to me, especially since it was clear that the “unbiased” estimate was in fact far more likely to be too high than too low. But being unsatisfied didn’t get me closer to having a good answer.

Fortunately, my high school friend Andrew Gelman was a new statistics professor at Berkeley, and he let me sit in on his graduate level course on Bayesian data analysis. Quite early in the course we were introduced to “hierarchical” models, of which the first example was an analysis (originally by Don Rubin) of the effectiveness of classes that were designed to improve students’ performance on the Scholastic Aptitude Test. I won’t go into details here; if you’re interested, find any edition of Bayesian Data Analysis (Gelman et al.). To perform the evaluation, Rubin first estimated the effect (and uncertainty) of the training, on average, in each of the eight schools. For our purposes, we take those estimates as given:

School, estimated treatment effect, standard error of the estimate A 28 15 B 8 10 C -3 16 D 7 11 E -1 9 F 1 11 G 18 10 H 12 18

Now, suppose I want to estimate the “true effect size” in School A. One argument is that the best estimate is 28 points. I have an unbiased estimate and a standard error, what else can I hope for?

On the other hand, the standard error is larger than the estimated effect in most of the schools. A hypothesis test fails to reject the hypothesis that the true effect size is the same — 8 points — in all of the schools. So another argument is that my best estimate of the effect in School A is 8 points. After all, the “same course” was taught in each of the schools, so each of these eight schools gives me a single estimate of the same quantity, which is the course effectiveness.

Rubin thought (and I think most people would agree) that neither of those arguments represents our state of knowledge. The first argument would treat School A completely in isolation, ignoring the fact that we have considerable evidence that courses similar to the one taught in School A evidently have typical effect size less than 20 points. The second argument would assume that the true effect in all schools is exactly equal, in spite of the courses being taught by different teachers to different students. Rubin suggested a sort of middle path: (1) assume that each school’s “true effect” is drawn from a normal distribution with unknown mean and standard deviation, and (2) assume the observed effect in each school is sampled from a normal distribution with a mean equal to the true effect, and standard deviation given in the table above. (As a physicist I recognized this as an “inverse problem”: we are given the results in the table above, and we have to work backwards to deduce the underlying parameters that produced them. )

An interesting feature of Rubin’s model is that it contains both of the conventional approaches discussed above as special cases: if we force the standard deviation of true effects to be zero then all of the schools end up with the same estimated true effect (with a central estimate of 8 points), whereas if we force the standard deviation of true effects to infinity then each school’s estimated true effect is equal to its observed effect. One can play around with a few different prior distributions for the standard deviation of true effects and find that it doesn’t really matter, any reasonable choice that doesn’t force the standard deviation to be very small or very large turns out to give about the same statistical distribution. School A ends up with an estimated true effect of 10 points, with about 50% of the probability between 7 and 16 points.

The thing that was so great about this example at the time is that it was almost exactly like the data and model I was already using for my indoor radon data. If I worked with the logarithms of the radon measurements, then I had data that were approximately normally distributed within each county, and I was already assuming that the county means were (log)normally distributed in my simulation program. So immediately after leaving Andrew’s classroom I went back to my office, coded up the calculations, and came up with estimates for the geometric mean radon concentration in every county in Minnesota. A few weeks later, I improved the model by including a couple of explanatory variables in a linear regression model. The final estimate for Lac Qui Parle County: 192 +/- 29 Bequerels per Cubic Meter, much lower than the geometric mean of its two observations, but higher than the typical Minnesota county. (In the unlikely event that you’re interested in the details, including model checking, you can read Price PN, Nero AV, and Gelman A, “Bayesian prediction of mean indoor radon concentrations for Minnesota counties,” Health Physics 71:922-936, 1996.)

Although the characteristic of Rubin’s 8-schools model that I most appreciated at the time was its close match to the problem I was working on, the thing I most appreciate about it now is that it cleanly illustrates almost every theme of Bayesian modeling that I have encountered in the past twenty years: where do models come from; what happens if you pick the “wrong” form of the prior distribution; how do you check the sensitivity of the results to your assumptions… I feel like once you’ve done the 8-schools problem, the rest of Bayesian hierarchical modeling is just tinkering with the details.

At the time that we did the radon work described above, Bayesian data analysis was not at all mainstream. Indeed, in some ways it was controversial. Many of Andrew’s colleagues at Berkeley were openly hostile towards, or mocking of, Bayesian statistics, and I once sat in a conference room in slack-jawed wonder as I heard two UC Berkeley statistics professors discussing some Bayesian work that had just been presented in terms that were both shockingly ignorant and nearly slanderous. They really just had no idea what they were talking about. Anyone who has ever worked through the eight schools problem knows more than those guys did. The anti-Bayes prejudice of the UC Berkeley faculty was a major factor in Andrew leaving for Columbia, which is a pity and I’m still unhappy about it.

One of the criticisms those professors brought up was that “of course, Bayesian analysis is so subjective…” But in fact, the normal-normal model that we used for the radon analysis was no more (or less!) “subjective” than the normal-normal model our Frequentist project advisor had suggested using for an analysis of variance of those same data; indeed, it was the same model!

As for me, I am completely indifferent to whether a given statistical approach is “Frequentist” or “Bayesian”; what I want to know is, “does it work for my problem”? If there’s a Frequentist method for estimating the radon concentration in Lac Qui Parle County, Minnesota that is better — by which I mean either more accurate in its central estimate or providing more accurate uncertainty estimates, or both — then I want to use it.

That said, I am almost always dealing with situations in which the underlying parameters of interest (theta) are observed subject to error (measurements y), which is to say I am trying to find p(theta | y), and most or all Bayesian methods are designed to address exactly that issue. So I find myself using Bayesian methods a lot. But that’s a pragmatic decision based on what works for my problem, not a philosophical decision based on what school of statistical philosophy I want to associate myself with.

This post is by Phil.