Bayesian networks (and probabilistic graphical models more generally) are cool. We computer geeks can love ‘em because we’re used to thinking of big problems modularly and using data structures. But better than being cool, they’re useful. Especially if you have the kind of problem that involves hundreds or thousands of interrelated variables, any one of which you might want to predict based on some subset of the others. Did I mention that your variables can have noisy, missing, or just plain unobservable data?

At Khan Academy, we’ve got problems like that.

What are the underlying concepts that relate mastery of our hundreds of exercises to each other?

Can we predict the best ordering of topics for a user to minimize time spent and maximize success?

What instructional interventions can an intelligent learning system make to best aid the user?

So I sat down to start applying these tools, and there were plenty of online resources to help with the book learnin’. I’m not going to duplicate them–I recommend this or that. When I started coding, though, I had several practical questions and really wanted a simple code example. I didn’t find a great one, so I’m posting what I came up with in the hope of helping the next soul.

THE WORKING EXAMPLE

For our example problem, let’s say there is a hidden (unobservable) binary variable T for every Khan Academy user that represents whether they have mastered a given topic (1=mastered, 0= not mastered). You can think of a topic as a collection of any N exercises. While topic mastery is hidden, we can partially observe performance on the exercises in that topic. I say ‘partially’ because the user may not do problems on all of the exercises. So the example program needs to handle missing data for all of the exercise variables E_i, which for simplicity can also be binary variables (1=good performance, 0=bad). In the idiom of Bayes nets, then, our graph looks like this:

Of course, lots of problems could be modeled by this simple graph structure of a hidden parent variable with a collection of children with missing data. Let me know if you invent another interesting application!

THE CODE

Here’s the code. It contains functions to learn from a simulated example or from a data file representing the evidence from your child variables. The learning algorithm is an expectation-maximization, of which the most interesting piece is the expectation step where we must impute the hidden T-variable given whatever E-variable evidence is available for that data sample. To see the algebra worked out using Bayes rule, check out this excellent write up courtesy of Jascha.

If for some sad reason you have an aversion to matrices (or NumPy), you might also take a look at my rough draft version written with the excellent Pandas library instead of just NumPy. (It doesn’t support handling of missing data for the E-variables and I ended up converting strictly to NumPy for speed.)

What I find fascinating (and what the Pandas version illustrates nicely) is that the most complicated math really needed here is… counting! The idea that we can learn the full joint distribution with nary a gradient or a step size parameter in sight kind of feels like magic to me.

PRACTICAL TIPS AND TRICKS, TRIALS AND ERRORS

A few lesson learned here and over the years:

Learning from simulated data is very useful for testing and debugging your algorithm. Two useful tests are:

Make sure your algorithm learns a model for which you know the true distribution.



If the above doesn’t work, initialize your parameters to the correct values and see if they “walk away”

Graphing the log-likelihood evolution is helpful to visualize convergence behavior.

I initially tested an example model with just two children variables. That system was “underdetermined” because there were more parameters than constraints. That made things not converge as expected, and I spent many hours searching for a bug before I was mercifully rescued.

When imputing values, make sure you calculate probabilities and sample from that distribution (or pass the full distribution through to future imputations).

THE PAYOFF

Despite the simplicity of this example, it is already Really Useful. I created a data set for a subset of the Khan Academy exercises, which you can download here. I heuristically chose a definition for the E-variables of whether a user answered more or less than 85% of problems correct on an exercise. Once I had learned the full joint distribution (“theta” in the code), I could infer the probability of mastery for a given user on any exercise, including exercises for which they had not yet done any problems. When I plugged those predictions in as an additional feature to our accuracy model, it was a highly significant feature, especially on the first few problems done for an exercise.

Of course, there are many different ways we could construct new features to summarize performance across a pool of exercises, but this a clean, robust, and transparent option. It’s easily extended to a full hierarchical model of our knowledge map. And the graphical modeling framework is powerful enough that we can eventually accomodate temporal effects, a decision-making agent, prior knowledge of experts (hello, teachers!), and more.

NEXT STEPS

I’m tremendously excited about the potential of probabilistic graphical models to power optimized online learning at Khan Academy and elsewhere. If you’re interested in learning about these modeling techniques in general, hustle over to Stanford’s free and recently-launched online course, and follow me on Twitter for more practical examples and updates. If you want to directly improve the future of education, there's a place to do that, too.