Preprologue

Blogs without audiences tend to die. I’m hoping to lift this one from death into mere dormancy by fleshing out a post that’s been sitting as a draft for a few months; it’s about fixed and random effects models, but meanders rather far into linear algebra and genetics.

Prologue

Models are created in order to 1) Predict/discriminate between outcomes, or 2) Estimate the relationship between parameters and outcomes. Sometimes, classification is what’s important, and this is typically for intermediate steps. For instance, a company called Affectiva uses (or at least proposed, in their white paper, using) feature detection and support vector machines to classify emotional reactions in real time based on facial expressions. If I’m interested in what kinds of things elicit a negative response, I care that the classification is accurate, but not so much on what it’s picking up. By contrast, sometimes it’s the what that’s important: which risk factors predispose people to myocardial infarction, and how might they elucidate the biology underlying the disease? The big challenge facing the machine learning community is demonstrating that such methods can be used not only for setting (1), but also for setting (2). One of the problems preventing widespread adoption of certain methods (say for the use of the Graph Laplacian) is a lack of understanding of their behavior under certain null distributions; that is, nobody knows how to interpret their parameters. (Is the 6th component of a weight vector from an SVM significant? How do you test for this besides a costly permutation?) In the case of testing significance of a parameter against a null, progress is being made: just last year a test for weight components in least-squares SVMs was proven via the Lyapunov CLT. On the other hand, little work is being done on testing specific alternatives against said null, perhaps because the idea is antithetical to a principal of machine learning: making assumptions as mild as possible.

I believe this is the reason General Linear Mixed Models are still so widely used when better methods exist.

Fixed vs Random?

Nobody understands the difference between fixed effects and random effects. RA Fisher probably did, but he’s dead now, so more’s the pity. Anyway, ask any 10 statisticians and you’ll get 4 different interpretations, and 6 people telling you that if you asked any 10 statisticians you’ll get 4 different interpretations, and 6 people telling you that if you asked any 10 statisticians…

(No seriously: see http://stats.stackexchange.com/questions/4700/what-is-the-difference-between-fixed-effect-random-effect-and-mixed-effect-mode and http://andrewgelman.com/2005/01/25/why_i_dont_use/)

So I’m not going to try to explain the difference, because I don’t understand it either. What I do understand are the practical and mathematical differences, and those I can describe. These follow from a simple question:

Q: When must you use a random effect model?

A: When the thing you’re using as an explanatory element is a matrix instead of a vector

Ex: Let’s say you’re interested in the relation between a user’s Netflix ratings and how often they buy books on Amazon. So you poll 1000 people with both Netflix and Amazon accounts, and ask how many times in the past 12 months they’ve bought books on Amazon. Now, you’ve conned Netflix into participating, but Netflix doesn’t want to share the full spectrum of movie ratings for each user. They want to avoid follow up questions like “which movies are most predictive of buying books on Amazon,” so they structured the deal thusly: for these 1000 individuals, they tell you how alike or dislike each pair of individuals’ ratings vectors are.

So you’ve got 1000 response datapoints, some number of control variables (because you’re a good researcher who thinks about confounding), and a 1000 x 1000 matrix Netflix Ratings Similarity Matrix, which is the thing you’re actually interested in. Clearly you can’t stuff that matrix next to the control variables in the model equation – the dimensions just don’t work out – so the only place to stick it is into the covariance, as a random effect.

And you fit your random components via ML or EM or REML and have a nice result that and everybody is happy, or at least you are for a time.

(To whatever fraction of my zero readers who yelled out “You could use some of the eigenvectors as fixed effects”: shhh, we’ll get there, but for now you’re confusing everyone)

To place this purely practical consideration into more model-esque terms: If what you’re measuring is a property of a datapoint, it’s a fixed effect, and goes on the scale of the mean. If you’re measuring a property between two datapoints, it’s a random effect, and goes on the scale of the covariance. These between point metrics are generally similarity metrics, inverse distances, graph laplacians, or other measures that want to capture “closeness” as a positive value, and “farness” as something near 0.

The duality between fixed and random effect models is well known, and is the cause of the “fixed vs random effects” question in the first place. To step from the covariance scale to a fixed effect scale, we diagonalize the covariance model:

and blindly write

But here’s the problem: there’s an eigenvector for every row in the matrix, so 1000 of them (or for datapoints). Even without covariates, all the degrees of freedom are gone. The random effect model only had one additional parameter, but we’ve gained a lot more here, and this comes from the fact we’ve relaxed all the constraints on the relative weights of the eigenvectors (the eigenvalues). Instead we have

This isn’t new or special or clever or brilliant, but I think it’s one of the “deep” properties of random effect models.

The principal components of a covariance matrix (which is what we hypothesize these random effect “similarity” matrices to be) describe orthogonal directions along which the variance of the data is independently maximizes (so PC1 explains the most – a proportion, and so on down the list). Each of these vectors provide a value for each datapoint you have (the Amazon/Netflix case, one value for each person sampled), and thus the datapoints stratify along this dimension, sometimes they cluster discretely, other times they smear, and sometimes there are one or two outliers driving the variance. And, typically, these “dimensions of similarity” demonstrate a kind of spooky latent variable. For instance, a principal component of the Netflix Rating Similarity Matrix may well be how much people love Kevin Spacey, information that was never actually queried. (As I mentioned in a previous post, we geneticists typically use a genetic similarity matrix in our models, and without fail the top components sort samples by geography or by technical metrics like DNA quality.)

So when you put a matrix into a random effects model, you are making a fairly narrow hypothesis: That you a) know the axes of variation along which your response is distributed (codified by the eigenvectors), and b) know their relative importance (codified by the eigenvalues). Now (a) isn’t all that deep, as the reason for specifying the model in the first place is to provide a measure of similarity that should nicely explain the covariance in the data, so it’s not such a big deal to step from “covariance” to “axes of variation underlying covariance.” But I think (b) is the crux of the fixed-vs-random topic. It’s also very much related to two of my previous posts: the Manifold Hypothesis, and Population Stratification.

Population Stratification occurs when non-genetic risk factors stratify along geographical lines (it occurs through many other mechanisms, too). For instance, in studying obesity, environmental or cultural factors like nutritional content of food, price of gas, etc; differ from place to place, and likely explain a large proportion of the variance, and given that genetic variation will similarly stratify between geographic locations, these environmental effects can contribute to a phantom genetic association where none exists. Luckily, as mentioned, these correlates have a good proxy through the genetic relationship matrix (GRM), and can be corrected via principal components, or by including the GRM as a random effect.

The question of which approach to use is often considered more of an “art” than a “science,” and the advice is always “do both, see which works best, the other can be a robustness check.” Well, yes. But also, this is a copout, and the real answer is “do both, at the same time, according to your model“. This gets back to the constraints the random effect matrix places on the relative importance of the axes of variation: Including however many principal components as fixed effects relaxes these constraints, in effect saying “I believe my covariance matrix identified axes along which my response varies, but I don’t think it captured the relative importance.” Concretely, if the first principal component is an east-west cline, explaining 15% of the variation, and the second principal is a north-south cline, explaining 8% of the variation, why should the east-west cline explain more of the variance in the response than the north-south cline? For instance, you might be looking at incidence of melanoma, and the north-south cline is not only associated with longer days and more sun, but also higher temperatures and thus more time spent wearing T-shirts. Freeing axes of variation from the constraints of the eigenvalues enables stratification to appropriately adjust along geographical, technical, or Kevin-Spacey to the degree that they truly explain the variance. The price you pay for this is a loss of degrees of freedom. (Obviously you can’t include all eigenvectors as fixed effects, so the balance is to include some number of the top eigenvectors that clearly correspond to physical confounders, and include the random effect matrix as well to partially correct for the remainders, which if they’re simply noise will have similar eigenvalues due to Wigner’s law, e.g. from the middle of the semicircle.)

These are actually a form of the Strong and Weak manifold hypothesis. Recall the Strong hypothesis was “My data is drawn from a distribution on a manifold , and the domain of my hypothesis class is “, while the Weak hypothesis was only the first of the two clauses. In a very similar way, a Random effect is a Strong hypothesis: it provides both the coordinate system (the eigenvectors), and the hypothesis class (the eigenvalues), and asks only that that the best hypothesis from this class be chosen ; by contrast the fixed effect provides only the coordinate system, but relaxes the hypothesis class substantially (to the ). Another way to view it is a random effect as a projection down to 1D (through the linear combination of eigenvectors), and the corresponding fixed effect as a projection down to 3D or 4D (or however many eigenvectors you include).

What does this say about which model to use? My recommendation:

Case 1: I’m testing a specific covariance matrix, both the axis of variation and their relative importance => Random Effect

Case 2: I’m testing a covariance matrix, but am not sure the relationships between variation axes are right => Fixed Effect

Case 3: I have measurements on the data point level (e.g. 1000 x 1, not 1000 x 1000) => Fixed Effect

Case 4: I’m controlling for population stratification and/or cryptic relatedness => Both

A few concluding comments: I’m actually surprised that these distinctions aren’t discussed all that much. For instance, in the EMMA/EMMAX/FaST-LMM papers, which are papers that each correct for population stratification on the random effect level, none of them compare their degree of correction to that of including both Principal Components and the GRM. Interestingly, the EMMAX paper really does demonstrate a situation under which the random effect model outperforms including the top 100 principal components; which I find quite interesting, as it would be incredibly easy to break (just generate phenotypes according to PC6). I’m sure there’s some nice reference from the 80’s or 90’s that discuss all of this at length, but my google-foo just isn’t up to the task of turning it up.

It should also be pointed out that while is not optimal, there’s still going to be a correlation between and that sum, it’s just that a substantial amount of “noise” is added to the explanatory variable (alt: signal removed from). This is also all orthogonal to the question of which GRM is the right GRM, and when multiple GRMs may be appropriate, and what to do in these cases. Obviously, these are interesting intellectual questions, but of an importance level commensurate with a rambling spare-time blog, as opposed to, say, analyzing the data we’ve got.