So, what are Markov chain Monte Carlo (MCMC) methods? The short answer is:

MCMC methods are used to approximate the posterior distribution of a parameter of interest by random sampling in a probabilistic space.

In this article, I will explain that short answer, without any math.

First, some terminology. A parameter of interest is just some number that summarizes a phenomenon we’re interested in. In general we use statistics to estimate parameters. For example, if we want to learn about the height of human adults, our parameter of interest might be average height in in inches. A distribution is a mathematical representation of every possible value of our parameter and how likely we are to observe each one. The most famous example is a bell curve:

In the Bayesian way of doing statistics, distributions have an additional interpretation. Instead of just representing the values of a parameter and how likely each one is to be the true value, a Bayesian thinks of a distribution as describing our beliefs about a parameter. Therefore, the bell curve above shows we’re pretty sure the value of the parameter is quite near zero, but we think there’s an equal likelihood of the true value being above or below that value, up to a point.

As it happens, human heights do follow a normal curve, so let’s say we believe the true value of average human height follows a bell curve like this:

Clearly, the person with beliefs represented by this graph has been living among giants for years, because as far as they know, the most likely average adult height is 6'2" (but they’re not super confident one way or another).

Lets imagine this person went and collected some data, and they observed a range of people between 5' and 6'. We can represent that data below, along with another normal curve that shows which values of average human height best explain the data:

In Bayesian statistics, the distribution representing our beliefs about a parameter is called the prior distribution, because it captures our beliefs prior to seeing any data. The likelihood distribution summarizes what the observed data are telling us, by representing a range of parameter values accompanied by the likelihood that each each parameter explains the data we are observing. Estimating the parameter value that maximizes the likelihood distribution is just answering the question: what parameter value would make it most likely to observe the data we have observed? In the absence of prior beliefs, we might stop there.

The key to Bayesian analysis, however, is to combine the prior and the likelihood distributions to determine the posterior distribution. This tells us which parameter values maximize the chance of observing the particular data that we did, taking into account our prior beliefs. In our case, the posterior distribution looks like this:

Above, the red line represents the posterior distribution. You can think of it as a kind of average of the prior and the likelihood distributions. Since the prior distribution is shorter and more spread out, it represents a set of belief that is ‘less sure’ about the true value of average human height. Meanwhile, the likelihood summarizes the data within a relatively narrow range, so it represents a ‘more sure’ guess about the true parameter value.

When the prior the likelihood are combined, the data (represented by the likelihood) dominate the weak prior beliefs of the hypothetical individual who had grown up among giants. Although that individual still believes the average human height is slightly higher than just what the data is telling him, he is mostly convinced by the data.

In the case of two bell curves, solving for the posterior distribution is very easy. There is a simple equation for combining the two. But what if our prior and likelihood distributions weren’t so well-behaved? Sometimes it is most accurate to model our data or our prior beliefs using distributions which don’t have convenient shapes. What if our likelihood were best represented by a distribution with two peaks, and for some reason we wanted to account for some really wacky prior distribution? I’ve visualized that scenario below, by hand drawing an ugly prior distribution:

Visualizations rendered in Matplotlib, enhanced using MS Paint

As before, there exists some posterior distribution that gives the likelihood for each parameter value. But its a little hard to see what it might look like, and it is impossible to solve for analytically. Enter MCMC methods.

MCMC methods allow us to estimate the shape of a posterior distribution in case we can’t compute it directly. Recall that MCMC stands for Markov chain Monte Carlo methods. To understand how they work, I’m going to introduce Monte Carlo simulations first, then discuss Markov chains.