What’s all this then?

I’m pretty excited to share something I’ve been working on for quite some time: a probabilistic programming DSL in Haskell.

I’ll start with a little on why I think probabilistic programming is interesting.

I’ll also do a quick review of the approach I’ve taken, but not an in-depth explanation of the algorithms- see the references section (or the code) for technical details.

Finally, I’m releasing this before it’s really “ready”, so this post is more of a summary on what I’ve learned so far rather than how to actually use it. If you do want to play with it, I’ve included an example model in Language.PP.Examples.

However, please note there are a lot of improvements I want to make. If you’ve got any ideas, please open an issue on github - contributions are more than welcome. I’d love to eventually make this into a tool that can be used for day-to-day data science work, but there’s certainly work to be done first.

Oh, and if you just want to look at the code, it’s on github.

Motivation

I’ve found that often when trying to model some real-world phenomenon, you have a pretty good idea of how the underlying system works, but frequently it’s hard to encode the dynamics of that system into your model.

Consider the humble pendulum. Here’s the setup:

You have a time series of measurements

Each measurement is the pendulum’s angle at a given time

Measurements are corrupted with zero-mean gaussian random noise

You want to answer these questions:

Where is the pendulum at a given time?

What are the physical parameters of the system, like arm length and bob weight?

In this case, a probabilistic program would be a simulation of the system using the physical laws that govern it, along with some probabilistic components: in this case, the gaussian noise that corrupts measurements.

In general, a probabilistic program is a generative model (in the Bayesian sense). In other words, it’s a program that when executed returns samples from a particular probability distribution.

“Probabilistic programming” refers to both these programs and the algorithms for doing inference over them, given some observed data.

When we say “doing inference”, we mean “discovering the values of parameters of the simulation.” For example, our pendulum simulation is parameterised by the arm length and bob weight- different values of these will result in different generated observations. Inference is just “going backwards” to find the parameters that generated a given set of observations.

The amazing thing about probabilistic programming is that there are inference tools that work across any program- regardless of its structure!

This is great news for fans of Bayesian inference: the dream of probabilistic programming is that you declaratively specify your modelling assumptions and you can straight away do inference with it- no more hand-writing inference algorithms tailored to your model.

Contrast this to “traditional” machine learning algorithms- like a decision tree, in which modelling assumptions are fixed.

A decision tree in particular learns a set of decision boundaries in some high-dimensional feature space. Modelling usually means feature engineering: carefully selecting features that can be well approximated by the assumptions of the model.

I like to see these approaches as dual- with a probabilistic program, you encode your assumptions in the model structure, whereas with a decision tree it’s encoded in your features.

Interlude: The Sticky-State model

Because I like to refer to concrete examples, I’m going to describe a simple hidden-markov type model and refer back to it later in the post. Here’s a made up scenario we’re trying to model:

Let’s say you have a robot with a faulty light sensor. The sensor can be in one of two states: working or faulty (let’s call them \(A\) and \(B\), respectively).

Observations in the \(A\) state are drawn from a gaussian with mean \(0\), while observations in the \(B\) state come from one with mean \(5\).

Additionally, when the sensor is in a state it tends to stay in that state- it’s “sticky”- and the probability of staying in the same state is the same- whichever state it’s currently in. We’ll use \(\theta\) to refer to the probability that the state remains the same, and \(1 - \theta\) the probability that it changes.

Finally, We’ll also assume we always start in state \(A\).

Now, given a series of observations, we want to figure out which are faulty as well as the probability of staying in the same state.

A simulation of this model might proceed as follows:

Start in state A Switch current state to \(B\) with probability \(1 - \theta\) , or remain in state \(A\) . Generating observations. If in state \(A\) , generate from a gaussian with mean \(0\)

, generate from a gaussian with mean If in state \(B\) , generate from a gaussian with mean \(5\) Repeat from step 2

The “stickiness” tends to produce “runs” of values, like \(AAABBBBBAABBBBB\) which we want to exploit in our inference. That’s where probabilistic programming comes in.

I’ll also show pseudocode that generates data according to this model in the “Monads of Probability” section.

Purely-Functional Probabilistic Programming

So what does this library do? Two things:

Provides a representation of Probabilistic Programs using the Free Monad Implements an inference algorithm called PMMH (which is part of a family of methods called PMCMC, or Particle Markov-Chain Monte-Carlo)

The idea is that the particular representation of programs (part 1) makes implementing the inference algorithm (part 2) much easier (or even possible).

I’ll start with explaining a little about PMCMC (part 2), and then I’ll describe why part 1 is important for its implementation in the context of general probabilistic programs.

Particle Markov-Chain Monte-Carlo and Learning Probabilistic Programs

Using PMCMC for inference on probabilistic programs is an idea from the frankly amazing paper, “Learning Probabilistic Programs”

The authors show how PMCMC is applicable to arbitrary probabilistic programs - not just state-space models like HMMs. They use it in the context of a domain-specific language called “Anglican”.

Although this blog post is not concerned with it, the main focus of the “Learning Probabilistic Programs” paper is actually profoundly more interesting. Note:

PMCMC lets you do inference on arbitrary probabilistic programs You can describe a distribution over probabilistic programs You can describe distributions as probabilistic programs

Therefore, you can use probabilistic programs to model probabilistic programs. In particular, “Learning Probabilistic Programs” is concerned with learning an entire model for a given set of observations- simply by placing a prior over probabilistic programs. I strongly recommend you read this paper- it’s quite accessible and extremely interesting.

Now that’s all well and good- but why use PMCMC in the first place? The reason requires a little background on how the Metropolis-Hastings algorithm works.

You have some desired parameters to infer- for example, the transition probability from our “Sticky State” example \(\theta\).

The general idea is that you start at a random point in the parameter space, \(\theta_0\), and randomly walk around, selecting new points by preferring those that are more probable given your observations.

This preference for more probable points is done by selecting a proposal distribution \(Q(\theta_{i+1} | \theta_{i})\)- which is a distribution over the next point in the walk, given the current point. This is used along with the probability of your observations given these parameters to decide whether or not to use the proposed point.

This poses a problem for probabilistic programming systems: we don’t want the user to have to come up with a proposal distribution, because…

It could be difficult or impossible

Some choices can result in poor performance (inference doesn’t work well)

Programmers are lazy

PMCMC is great, because it works well even when your proposal distribution is simply the prior- i.e. you completely ignore the previous point in the random walk, and select a new one from the prior distribution.

This isn’t actually the only reason- part of the Metropolis-Hastings “acceptance ratio” requires that you know the likelihood of your observed data given the parameters of your model. This is actually not easy in some models- including our “Sticky State” model.

That’s because our observations depend on some latent variables. So we might know an expression for \(P( y | s, \theta )\), but we don’t have an expression for \(P(y | \theta)\) directly.

PMCMC methods solve this problem by using sequential monte carlo to provide an estimate of \(P(y | \theta)\) which is then used with Metropolis-Hastings. Turns out an estimate works just as well in the acceptance ratio! If you want to know more on why, I recommend this series of blog posts by Prof. Darren Wilkinson- they were immensely helpful to me.

Sequential Monte Carlo

We’ve learned that PMCMC uses a pairing of sequential monte carlo and metropolis-hastings to provide a joint estimate over latent values and parameters- the question is now: what is SMC, and how do we implement it?

Sequential Monte Carlo algorithms- also known as particle filters- are a family of algorithms concerned with doing inference over state space models like the HMM. They’re similar in some ways to genetic algorithms, in that they have generations of particles, with less probable particles “dying out”, and more probable particles propagating more.

In state space models, a particle represents a particular sequence of states, but the key insight from Learning Probabilistic Programs is that they can equally well represent an interpreter over a probabilistic program. More on this later!

I’ll use the Sticky State example to demonstrate how SMC does inference. First, let’s see some made-up observed data: a sequence of latent states including the initial state:

[A, A, A, A, A, B, B, A]

which generates the observations- and remember, an observation depends only on the current state.

[0, 1, 1, 2, 6, 5, 0]

Now, to do inference we do the following:

Begin with 100 particles initialised to state A. Transition each particle to its next state, so we’ll have a mixture of particles in states A and B . Weight particles by the probability of the last transition- so particles in B state will have lower weights. Resample the particles (i.e., randomly select 100 particles weighted by their weight from step 3) Weight each particle by the probability of the corresponding observation, given its current state Resample again Go to step 2.

Hopefully it’s clear how more probable sequences get propagated, while less probable sequences die out. To translate this into the language of probabilistic programs:

Particles are probabilistic programs being evaluated step-by-step

Each “step” of a probabilistic program is a probabilistic operation- i.e. a random value being drawn from a distribution.

Each step is weighted by the probability of the value drawn from its distribution

Resampling is resampling of partially-evaluated programs

Now that we understand how the particle filter works, we can move on to representations of our program to help implementation of it.

Monads of Probability

The fact that probability distributions form a monad has been known for some time. It’s exploited in the popular paper Probabilistic Functional Programming In Haskell, and the excellent random-fu library uses it for generating random values (the RVar type).

random-fu’s RVar gives us the ability to write simulations, but not do inference on them. Here’s some pseudocode for roughly what our sticky-state model looks like using random-fu:

Recalling the particle filter, we need a way to evaluate model step by step, so we can resample the set of partially-evaluated programs.

The Free Monad is the solution to this problem. In short, it lets us capture the structure of our monadic model, so we can evaluate it step-by-step later. I don’t want to go into an explanation of what the free monad is in this post so I’ll refer you to some of the many free monad tutorials.

So how do you use it? To explain, I’ll list here (as of writing this post) the core types of my library:

There are just two important ones:

P (for “probabilistic”): A functor representing randomly-sampled values with an associated probability. PP (for “probabilistic program”): the free monad over the functor P.

Note that each is parameterised by a “sampling monad”, m - this is where the actual random generation is taking place. Currently, I’m using random-fu to fill this role. To demonstrate what this means, here’s an implementation of eval . It takes a probabilistic program PP m a , and builds a monadic value of type a by summing log-probabilities at each step, to yield a final sample along with the summed log-probabilities.

At last! Below is the bootstrap particle filter. I’ve left out implementations for the following functions in the interests of conciseness:

terminated : true if a program has finished

: true if a program has finished stepAll : advance a list of probabilistic programs one step

: advance a list of probabilistic programs one step resample : resample a list of probabilistic programs

Hopefully the core idea is clear: we simply repeatedly step the programs (or particles, or ps ), resample them, and repeat until all have terminated.

Acknowledgements and Improvements

Hopefully that was an illuminating glance into what probabilistic programming is about. As mentioned, this library is in a very early state, and there are a few shortcomings with the code right now. Here’s a list of things I’m aware of and working on:

Parametrising P and PP by a monad m might be better done using FreeT .

and by a monad might be better done using . WriterT might be used instead of summing probabilities

might be used instead of summing probabilities Performance is bad: The Metropolis-Hastings implementation takes a very long time, and seems to have a space leak.

The “StickyState” example doesn’t seem to converge properly, although hidden states are correctly inferred.

I also like to list the resources I found useful in writing this:

And last but certainly not least, it seems that I’ve been beaten to the punch by a talk and paper from ICFP 2015: Practical Probabilistic Programming with Monads.

I’ve only skimmed this paper, but I believe it’s a similar idea- it also seems like it’s a pretty great explanation (and possibly a more solid implementation) so I definitely recommend reading it.