Marginal likelihoods and Bayes factors¶

After so much fuzz, you will ask yourself: "I still don't know how to make relative judgements about the plausibility of expressed hypotheses given data!". Fair point, which is why this tutorial finally provides the final puzzle part of HypTrails: the marginal likelihood.

Let us re-iterate Bayesian inference:

$$ \overbrace{P(\theta| D, H)}^{\text{posterior}} = \frac{\overbrace{P(D | \theta, I)}^{\text{likelihood}}\overbrace{P(\theta|H)}^{\text{prior}}}{\underbrace{P(D|H)}_{\text{marginal likelihood}}} $$

$\theta$ corresponds to the parameters of the Markov chain model, $D$ are the human trails at interest and $H$ corresponds to a hypothesis. We have said that we incorporate hypotheses as priors into the inference process. Thus, the priors (with pseudo counts) $P(\theta|H)$ encode our hypotheses. This means that for individual hypotheses, we have different pseudo count configurations of the prior. Furthermore, for a single hypothesis, we can further steer the amount of belief by the hypothesis weighting factor $k$ leading to increasing cocentration of the prior.

What we want to achieve now, is to determine which prior best reflects the given data. To that end, we utilize the marginal likelihood $P(D|H)$ (also called evidence) which we describe next.

The marginal likelihood corresponds to the probability of the data given a hypothesis where the parameters have been marginalized out:

$$ P(D | H) = \int P(D | \theta, H)P(\theta | H)d\theta $$

The evidence is the weighted average over all possible values of the parameters $\theta$ where the weights come from the prior. So basically, the marginal likelihood is an average of the likelihood weighted by the prior.

Generally, we can say that if the prior is well aligned with the data, then the evidence is rising with the strength of the prior. The evidence is the largest if the prior and the likelihood are concentrated over the same parameter regions and it is the lowest if they concentrate on different regions. Hence, we want to choose an informative prior that captures the same regions as the likelihood. This leads to the fact that priors that better capture the underlying mechanisms producing the data, produce higher evidences compare to those that do not capture these mechanisms that well. Furthermore, if the prior choice represents a valid hypothesis about behavior producing the data (in our case human trails), the evidence should be larger than a uniform prior, or an unlikely hypothesis prior with an equal amount of pseudo counts. For a more detailed understanding of the marginal likelihood and its application in HypTrails, please consult both our HypTrails paper and PlosOne paper.

For our Markov chain model with a conjugate Dirichlet prior, the marginal likelihood is defined as (please consult our PlosOne paper for derivation):

$$ P(D | H) = \prod_i\frac{\Gamma(\sum_j \alpha_{i,j})}{\prod_j \Gamma(\alpha_{i,j})} \frac{\prod_j \Gamma(n_{i,j}+\alpha_{i,j})}{\Gamma(\sum_j (n_{i,j}+\alpha_{i,j}))} $$

For comparing the plausibility of two hypotheses, we resort to Bayes factors. Bayes factors are representing a Bayesian method for model comparison that include a natural Occam's razor guarding against overfitting. In our case, a model represents a hypothesis at interest with each having different priors with different hyperparameters that express corresponding beliefs. For illustrative purposes, we are now interested in comparing hypotheses $H_1$ and $H_2$ where $H_1,H_2\in\textbf{H}$, given observed data $D$. We can define the Bayes factor---note that we apply unbiased comparison assuming that all hypotheses are equally likely a priori---as follows:

$$ B_{1,2} = \frac{P(D | H_1)}{P(D|H_2)} $$