When everyone starts walking around the chilly streets of Toronto looking like they’re cosplaying the last 5 minutes of Call Me By Your Name, you know that Spring is in the air. Let’s celebrate the end of winter by pulling out our Liz Phair records, our slightly less-warm coats, and our hunger for long reads about some fundamentals.

I’ve always found the real sign that Spring has almost sprung is when strangers start asking which priors you should use on the variance parameters in a multilevel model.

A surprising twist on this age-old formula is that this year the question was actually asked on Twitter! Because I have a millennial’s-1 deep understanding of social media, my answer was “The best way to set priors on the variance of a random effect0 is to stop being a Nazi TERF”. But thankfully I have access to a blogging platform, so I’m going to give a more reasoned, nuanced, and calm answer here.

(Warning: This post is long. There’s a summary at the end. There is also an answer that fits into a 280 character limit. It obviously links to the Stan prior choice wiki. You can also look at this great Stan case study from Michael Betancourt.)

The book of love

In this post, I’m going to focus on weakly informative priors for the variance parameters in a multilevel model. The running example will be a very simple multilevel model for Poisson where the log-risk is modelled using a global intercept, a covariate, and one iid Gaussian random effect0. Hence we only have one variance parameter in teh model that we need to find a prior for.

Some extensions beyond this simple model are discussed further down, but setting priors for those problems are much harder and I don’t have good answers for all of them. But hopefully there will be enough information here to start to see where we’re (we = Andrew, Aki, Michael and I, but also Håvard Rue, Sigrunn Sørbye, Andrea Riebler, Geir-Arne Fuglstad, and others) going with this thread of research.

Fallin’ out of love

The first thing I’m going to do is rule out any attempt to construct “non-informative” (whatever that means) priors. The main reason for this is that in most situations (and particularly for multilevel models), I don’t think vague priors are a good idea.

The main reason is that I don’t think they live up to their oft-stated ability to “let the data speak for itself”. The best case scenario is that you’re working with a very regular model and you have oodles of data. In this case, a correctly-specified vague prior (ie a reference prior) will not get in the way of the data’s dream to stand centre stage and sing out “I am a maximum likelihood estimator to high order“. But if that’s not the tune that your data should be singing (for example, if you think regularization or partial pooling may be useful), then a vague prior will not help.

Neither do I think that “letting the data speak” is the role of the prior distribution. The right way to ensure that your model is letting the data speak for itself is through a priori and a posteriori model checking and model criticism. In particular, well-constructed cross-validation methods, data splitting, posterior predictive checks, and prior sensitivity analysis are always necessary. It’s not enough to just say that you used a vague prior and hope. Hope has no place in statistics.

Most of the priors that people think are uninformative turn out not to be. The classic example is the prior for the variance of a random effect0. Andrew wrote a paper about this 12 years ago. It’s been read (or at least cited) quite a lot. The paper is old enough that children who weren’t even born when this paper was written have had time to rebel and take up smoking. And yet, the other day I read yet another paper that referred to a prior on the variance of a random effect0 as “weakly informative”. So let me say it one more time for the people at the back: It’s time to consign priors to the dustbin of history.

Love connection

But even people who did read Andrew’s paper0.25 can go wrong when they try to set vague priors. Here’s a “close to home” example using the default priors in INLA. INLA is an R package for doing fast, accurate, approximate Bayesian inference for latent Gaussian models, which is a small but useful class of statistical models that covers multilevel models, lots of spatial and spatiotemporal models, and some other things. The interface for the INLA package uses an extension to R’s formula environment (just like rstanarm and brms ), and one of the things that this forces us to do is to specify default priors for the random effects0. Here is a story of a default prior being bad.

This example is taken from the first practical that you can find here (should you want to reproduce it). The model, which will be our running example for the next little while, is a simple Poisson random effects0 model. It has 200 observations, and the random effect0 has 200 levels. For those of you familiar with the lme4 syntax, the model can be fit as

lme4::glmer(num_awards ~ math + (1|id),data=awards,family="poisson")

If you run this command, you get a standard deviation of 0.57 for the random effect0.

INLA parameterises a normal distribution using its precision (ie the inverse of the variance). This is in line with JAGS and BUGS, but different to Stan (which is the younger piece of software and uses the standard deviation instead). The default prior on the precision is , which is essentially a very flat exponential prior.

The following table shows the 95% posterior credible intervals for the standard deviation across five different priors. The first is the INLA default; the second is recommended by Bernardinelli, Clayton, and Montomoli. Both of these priors are very vague. The third is a prior with a not particularly small . The fourth and fifth are examples of PC priors which are designed to be weakly informative. For this model, the PC prior is an exponential prior on the standard deviation with the rate parameter tuned so that a priori with upper bounds and respectively.

2.5% Median 97.5% INLA default 0.00 0.01 0.05 IG(0.5,0.0005) 0.02 0.07 0.59 IG(0.001,0.001) 0.04 0.40 0.75 PC(1,0.1) 0.04 0.42 0.74 PC(10,0.1) 0.09 0.49 0.79

The takeaway message here is that the two vague priors on the precision (the default priors in INLA and the one suggested by Bernardinelli et al.) are terrible for this problem. The other three priors do fine and agree with the frequentest estimate0.5.

The take away from this should not be that INLA is a terrible piece of software with terrible priors. They actually work fairly well quite a lot of the time. I chose INLA as an example because it’s software that I know well. More than that, the only reason that I know anything at all about prior distributions is that we were trying to work out better default priors for INLA.

The point that we came to is that the task of finding one universal default prior for these sorts of parameters is basically impossible. It’s even more impossible to do it for more complex parameters (like lag-1 autocorrelation parameters, or shape parameters, or over-dispersion parameters). What is probably possible is to build a fairly broadly applicable default system for building default priors.

The rest of this post is a stab at explaining some of the things that you need to think about in order to build this default system for specifying priors. To do that, we need to think carefully about what we need a prior to do, as well as the various ways that we can make the task of setting priors a little easier.

To think that I once loved you

The first problem with everything I’ve talked about above is that, when you’re setting priors, it is a terrible idea to parameterize a normal distribution by its precision. No one has an intuitive feeling about the inverse of the variance of something, which makes it hard to sense-check a prior.

So why is it used? It simplifies all of the formulas when you’re dealing with hierarchical models. If

,

then the variance of is , whereas if we parameterize by the precision, we just need to add the precisions of and to get the precision of . So when people sat down to write complicated code to perform inference on these models, they made their lives easier and parameterized normal distributions using the precision.

But just because something simplifies coding, it doesn’t mean that it’s a good idea to use it in general situations!

A much better way to parameterize the normal distribution is in terms of the standard deviation. The key benefit of the standard deviation is that it directly defines the scale of the random effect0: we know that it’s highly likely that any realization of the random effect0 is within 3 standard deviations of the mean. This means that we can easily sense-check a prior on the standard deviation by asking “is this a sensible scale for the random effect0?”.

For this model, the standard deviation is on the scale of the log-counts of a Poisson distribution, so we probably don’t want it to be too big. What does this mean? Well, the largest count observation that a computer is most likely only going to allow is 2,147,483,648. This is a fairly large number. If we want the mean of the Poisson to be most likely less than that (in the case where everything else in the log-risk is zero), then we want to ensure . If we want to ensure that the expected count in each cell is less than a million, we need to ensure . If we want to ensure the expected count to be less than 1000, we need to ensure . (I can literally go on all day.) Under the Gamma(0.001,0.001) prior on the precision, the probabilities of these three events happening are, respectively, 0.01, 0.009, and 0.007. This means that rather than “speaking for itself” (whatever that means), your data is actually trying to claw back probability mass from some really dumb parts of the parameters space.

If you plot all of the distributions on the standard deviation scale (sample from them and do a histogram of their inverse square root), you see that the first 3 priors do not look particularly good. In particular, the first two are extremely concentrated around small values. This explains the extreme shrinkage of the standard deviation estimates. The third prior is extremely diffuse.

F.E.E.L.I.N.G.C.A.L.L.E.D.L.O.V.E.

So if the standard deviation is the correct parameter for setting a prior, what sort of prior should we set on it?

Well I have some good news for you: there are several popular alternatives and no clear winner. This is pretty common when you do research on priors. If you don’t screw up the prior completely, it’s a higher-order effect. This means that asymptotically, the effect of the prior on the posterior is smaller than the effect of sampling variation in the data.

But this didn’t happen in the above example, where there was a very clear effect. There are a few reasons for this. Firstly, the data only has 200 observations, so we may be so far from asymptopia the the prior is still quite important. The second reason is more interesting. There is more than one sensible asymptotic regime for this problem, and which one we should use depends on the details of the application.

The data, as it stands, has 200 records and the random effect0 has 200 different levels. There are essentially 3 different routes to asymptopia here. The first one is that we see an infinite number of repeated observations for these 200 levels. The second is a regime where we have observations, the random effect0 has levels, and we send . And the third option is half way between the two where the number of levels grows slower than the number of observations.

It is a possibly under-appreciated point that asymptotic assessments of statistical methods are conditional on the particular way you let things go to infinity. The “standard” asymptotic model of independent replication is quite often unrealistic and any priors constructed to take advantage of this asymptotic structure might have problems. This is particularly relevant to things like reference priors, where the asymptotic assumptions are strongly baked in to the prior specification. But it’s also relevant to any prior that has an asymptotic justification or any method for sense-checking priors that relies on asymptotic reasoning.

This girl’s in love with you

Well that got away from me, let’s try it again. What should a prior for the standard deviation of a Gaussian random effect0 look like?

The first thing is that it should peak at zero and go down as the standard deviation increases. Why? Because we need to ensure that our prior doesn’t prevent the model from easily finding the case where the random effect0 should not be in the model. The easiest way to ensure this is to have the prior decay away from zero. My experience is that this also prevents the model from over-fitting (in the sense of having too much posterior mass on very “exciting” configurations of the random effects0 that aren’t clearly present in the data). This idea that the prior should decrease monotonically away from a simple base model gives us some solid ground to start building our prior on.

The second thing that we want from a prior is containment. We want to build it so that it keeps our parameters in a sensible part of the model space. One way to justify this (and define “sensible”) is to think about the way Rubin characterized the posterior distribution in a very old essay (very old = older than me). He argued that you can conceptualize the posterior through the following procedure (here is the observed data):

Simulate

Simulate

If , then keep , otherwise discard it.

The set of that are generated with this procedure are samples from the posterior distribution.

This strongly suggests that the prior should be built to ensure that the following containment property holds: The prior predictive distribution should be slightly broader than a prior predictive distribution that only has mass on plausible data sets.

This requirement is slightly different to Andrew’s original definition1 of a weakly informative prior (which is a prior on a parameter that is slightly broader than the best informative prior that we could put on a parameter without first seeing the data). The difference is that while the original definition of a WIP requires us to have some sort of interpretation of the parameter that we are putting the prior on, the containment condition only requires us to understand how a parameter affects the prior data generating process. Containment also reflects our view that The prior can often only be understood in the context of the likelihood.

If this all sounds like it would lead to very very complicated maths, it probably will. But we can always check containment visually. We have a discussion of this in our (re-written the other month to read less like it was written under intense deadline pressure) paper on visualization and the Bayesian workflow. That paper was recently accepted as a discussion paper in the Journal of the Royal Statistical Society Series A (date of the discussion meeting TBC) which is quite nice.

Standard Bitter Love Song #1

An immediate consequence of requiring this sort of containment condition is that we need to consider all of the parameters in our model simultaneously in order to check that it is satisfied. This means that we can’t just specify priors one at a time when a model has a number or random effects0, or when there are other types of model component.

For the sake of sanity, we can simplify the process by breaking up the parameters into groups that affect different part of the data generating process. For instance, if you have several random effects0, a good way to parameterize their variances is to put a prior on the standard deviation of the total variability and then a simplex-type prior to distribute the total standard deviation across the many components.

There are some complexities here (eg what do you do when the random effects0 have differing numbers of levels or differing dependence structures between levels?). Some suggested solutions are found towards the end of this paper and in this paper. A similar idea (except it’s the variances that are distributed rather than the standard deviations) is used for the default priors in rstanarm. The same idea pops up again for setting priors on mixture models.

I am leaving you because I don’t love you

So while that was a fun diversion, it still hasn’t given us a prior to use on the variance of a random effect0. But we do now know what we’re looking for.

There are a number of priors that decay away from the base model and give us some form of containment. Here is an incomplete, but popular, list2 of simple containment priors:

A half-Gaussian on the standard deviation

An exponential on the standard deviation

A half-t with 7 degrees of freedom on the standard deviation

A half-t with 3 degrees of freedom on the standard deviation

A half-Cauchy on the standard deviation.

These were listed from lightest to heaviest tail on purpose.

All of these priors require the specification of one parameter controlling the width of the 90% highest probability density interval2.5. This parameter allows us to control containment. For example, if the random effect0 was part of a model for the log-relative risk of a disease, it may be sensible to set the parameter so that the probability that the standard deviation was less than 1 with 90% prior probability. This would correspond to the random effect0 being contained within [-3,3] and hence the random effect’s0 contribution to the relative risk be constrained to be a multiplicative factor contained within [0.05, 20].

This brings out an important aspect of containment priors: they are problem dependent. Although this example does not need a lot of expert information to set a sensible prior, it does need someone to understand how big a deviation from the baseline risk is unlikely for the particular scenario you are modelling. There isn’t a way around this. You either explicitly encode information about the problem in a prior, hide it in the structure of your asymptotic assumptions, or just throw your prior against a wall and hope that it sticks.

One way to standardize the procedure for setting priors is to demand that the model is correctly scaled in order to ensure that all of the parameters are on the unit scale. This can be done structurally (if your expected counts are well constructed, the relative risk will often be on unit scale). It can also be done by introducing some data-dependence into the prior, although this is a little more philosophically troublesome and you have to be diligent with your model assessment.

As for which of these priors is preferred, it really depends on context. If your model has a lot of random effects0, or the likelihood is sensitive to extreme values of the random effect0, you should opt for a lighter tail. On the other hand, a heavier tail goes some way towards softening the importance of correctly identifying the scale of the random effect0.

To put some skin in the game, last time I talked to them about this3 Andrew seemed more in favour of good scaling and a half-Normal(0,1) prior, I like an exponential prior with an explicitly specified 90% quantile for the standard deviation, and Aki seemed to like one of the Student-t distributions. Honestly, I’m not sure it really makes a difference and with the speed of modern inference engines, you can often just try al three and see which works best. A nice4 simulation study in this direction was done by Nadja Klein and Thomas Kneib.

I love you but you’re dead

None of us are particularly fond of the half-Cauchy as a prior on the standard deviation. This was one of the early stabs at a weakly informative prior for the standard deviation. It does some nice things, for example if you marginalize out a standard deviation with a half-Cauchy prior, you get a distribution on the random effect0 that has a very heavy tail. This is the basis for the good theoretical properties of the Horseshoe prior for sparsity, as well as the good mean-squared error properties of the posterior mean estimator for the normal means problem.

But this prior has fallen out of favour with us for a couple of reasons. Firstly, the extremely heavy tail makes for a computational disaster if you’re actually trying to do anything with this prior. Secondly, it turns out that some regularization does good things for sparsity estimators.

But for me, it’s the third problem that sinks the prior. The tail is so heavy that if you simulate from the model you frequently get extremely implausible data sets. This means that the half-Cauchy prior on the standard deviation probably doesn’t satisfy our requirement that a good prior satisfies the containment property.

Can’t help lovin’ that man

Right. We’ve managed to select a good prior for the standard deviation for a Gaussian random effect0. The next thing to think about is whether or not there are any generalizable lessons here. (Hint: There are.) So let us look at a very similar model that would be more computationally convenient to fit in Stan and see that, at least, all of the ideas above still work when we change the distribution of the random effect0.

The role of the random effect0 in the example model is to account for over-dispersion in the count data (allowing the variance being larger than the mean). An alternative model that does the same thing is to take the likelihood as negative-binomial rather than Poisson. The resulting model no longer has a random effect0 (it’s a pure GLM). If you’re fitting this model in Stan, this is probably a good thing as the dimension of the parameter space goes from 203 to 3, which will definitely make your model run faster!

To parameterize the negative binomial distribution, we introduce an over-dispersion parameter with the property that the mean of the negative binomial is and the variance is .

We need to work out a sensible prior for the over-dispersion parameter. This is not a particularly well-explored topic in the Bayesian literature. And it’s not really obvious what a sensible prior will look like. The effect of on the distribution of is intertwined with the effect of .

One way through this problem is to note that setting a prior on is in a lot of ways quite similar to setting a prior on the standard deviation of a Gaussian random effect0.

To see this, we note that we can write the negative binomial with mean and overdispersion parameter as

.

This is different to the previous model. The Gamma distribution for has a heavier left tail and a lighter right tail than then log-normal distribution that was implied by the previous model. That being said, we can still apply all of our previous logic to this model.

The concept of the base model would be a spike at , which is a Poisson distribution. (The argument for this is that it is a good base model because every other achievable model with this structure is more interesting as the mean and the variance are different from each other.) The base model occurs when .

So we now need to work out how to ensure containment for this type of model. The first thing to do is to try and make sure we have a sensible parameterization so that we can use one of our simple containment priors.

The gamma distribution has a mean of 1 and a variance of , so one option would be to completely follow our previous logic and use as a sensible transformation. Because the Gamma distribution is highly skewed, it isn’t completely clear that the standard deviation is as good a parameterization as it is in the previous model. But it turns out we can justify it a completely different way, which suggests it might be a decent choice.

A different method for finding a good parameterization for setting priors was explored in our PC priors paper. The idea is that we can parameterize our model according to its deviation from the base model. In both the paper and the rejoinder to the discussion, we give a pile of reasons why this is a fairly good idea.

In the context of this post, the thing that should be clear is that this method will not ensure containment directly. Instead, we are parameterizing the model by a deviation so that if you increase the value of by one unit, the model gets one unit more interesting (in the sense that the square root5 of the amount of information lost when you replace this model by the base model increases by one). The idea is that with this parameterization we will contain and hopefully therefore constrain .

If we apply the PC prior re-parameterization to the Gaussian random effects0 model, we end up setting the prior on the standard deviation of the random effect0, just as before. (This is a sense check!)

For a the Gamma random effect0, some tedious maths leads to the exact form of the distance (Warning: this looks horrible. It will simplify soon.)

,

where is the Kullback-Leibler divergence, and is the digamma function (the derivative of the log-gamma function).

To simplify the distance, let’s take a look at what it looks like for very small and very large . When is near zero, some rooting round the internet shows that . Similarly, when is large, we also get . Moreover, if you plot the square of the distance, it looks a lot like a straight line (it isn’t quite, but it’s very close). The suggests that putting a containment prior on the parameter might be a good idea.

So the end of this journey is that something like an appropriately scaled half-normal, exponential, or half-t distribution on is a good candidate for a containment prior for the over-dispersion parameter of a negative binomial distribution. The truth of this statement can be evaluated using prior data simulations to check containment, and posterior sensitivity and predictive checks to check that the prior is appropriate for the problem at hand. Because no matter how good I think the mathematical argument is for a prior, it is still vital to actually verify the theory numerically for the particular problem you are trying to solve.

To bring you my love

If I had to summarize this very long post, I’d probably say the following:

Vague priors tend to be “accidentally informative” in random effects 0 models (and other somewhat complicated statistical models)

models (and other somewhat complicated statistical models) The first thing you need to think about when setting a prior is to find a parameterization that is at least a little bit interpretable. If you can’t give an “in words” interpretation of your prior, you probably haven’t put enough care into setting it.

Base models are very useful to work out how to set priors. Think about what the most boring thing that your model can do and expand it from there.

The idea of containment is hiding in a lot of the ways people write about priors. It’s a good thing and we should pay more attention to it.

Containment tells us explicitly that we need to consider the joint prior on all parameters of the model, rather than just thinking about the priors on each parameter independently. Sometimes we can cheat if different sets of parameters change (almost) disjoint aspects of a model.

Containment also suggests that the priors need to respect the scale of the effect a parameter has. Sometimes we can fix this though a linear scaling (as in regression coefficients or standard deviation parameters), but sometimes we need to be more creative (as in the over-dispersion parameter).

Containment makes it hard to specify a universal default prior for a problem, but we can still specify a universal default procedure for setting priors for a problem.

We can always check containment using prior simulations from the model.

We can always assess the effect of the prior on the estimates through careful prior and posterior checks (this is really the story of our Visualization and Workflow paper).

These considerations stretch far beyond the problem of specifying priors for variance components in multilevel models.

Footnotes:

-1 For those of you who use “millennial” as a synonym for “young”, I promise you we are not. Young people were not alive when the Millennium turned. The oldest millennials are 35.

0 When I talk about random effects, I’m using the word in the sense of Hodges and Clayton, who have a lovely paper looks at how that term should be modernized. In particular, they point out the way that in modern contexts, random effects can be interpereted as “formal devices to facilitate smoothing or shrinkage, interpreting those terms broadly”. Everything that I talk about here holds for the scale parameter of a Gaussian process or a spatial random effect. Andrew and Aki prefer to just use different terms. The specific example that I’m using falls into their definition of an “old-style” random effect, in that the distribution of the random effect is of interest rather than the particular value of the realization isn’t of interest. I personally think that “random effect” is a sufficiently broad, accessible definition that bridges decades of theory with current practice, so I’m happy with it. But mileage obviously varies. (I have tagged the term every time it’s used in case someone gets half way through before they realize they’re not sure what I mean.)

0.25 Even if you use the suggested prior in Andrew’s paper (a half-Cauchy with scale parameter 25 on the standard deviation), bad things will happen. Consider a simple version of our problem, where , . If we put the recommended half-Cauchy prior on , then the simulation of will return an NA in R around 20% of the time. This happens because it tries to draw an integer bigger than , which is the largest integer that R will let you use. This will lead to very very bad inference and some fun numerical problems in small-data regimes.

0.5 Posterior intervals that are consistent with a particular estimator with some good frequentist properties isn’t necessarily a sign that a prior is good. Really there should be some proper posterior checks here. I have not done them because this is a blog.

1 Now I didn’t go back and check Andrew’s paper, so this may well be my memory of my original interpretation of Andrew’s definition of a Weakly Informative Prior. There are turtles everywhere!

2 These options are taken from the Stan prior choice wiki, which has a lot of interesting things on it.

2.5 I can’t say it controls the standard deviation because the half-Cauchy doesn’t have a well-defined standard deviation. So I picked a 90% highest density interval as the thing to be controlled. Note that because we’ve specified priors that decay away from the base model, these are one-sided intervals. Obviously if you want to use something else, use something else.

3 Obviously I do not speak for them here and you really shouldn’t take my half-memory of a conversation at some point in the dusty, distant past as a reflection of their actual views. But we do have a range of opinions on this matter, but not such a diverse set of opinions that we actually disagree with each other.

4 The exponential prior on the standard deviation (which is the PC prior for this model) did very well in these simulations, so obviously I very much like the results!

5 The square root is there for lots of good reasons, but mainly to get make sure all of the scales come out right. For the maths-y, it’s strongly suggested by Pinsker’s inequality.