(Updated 9/2/2009, but still unfinished; see other’s work on this that I’ve collected)

I never took a statistics class, so I only know the kind of statistics you learn on the street. But now that I’m in global health research, I’ve been doing a lot of on-the-job learning. This post is about something I’ve been reading about recently, how to decide if a simple statistical model is sufficient or if the data demands a more complicated one. To keep the matter concrete (and controversial) I’ll focus on a claim from a recent paper in Nature that my colleague, Haidong Wang, choose for our IHME journal club last week: Advances in development reverse fertility declines. The title of this short letter boldly claims a causal link between total fertility rate (an instantaneous measure of how many babies a population is making) and the human development index (a composite measure of how “developed” a country is, on a scale of 0 to 1). Exhibit A in their case is the following figure:

An astute observer of this chart might ask, “what’s up with the scales on those axes?” But this post is not about the visual display of quantitative information. It is about deciding if the data has a piecewise linear relationship that Myrskyla et al claim, and doing it in a Bayesian framework with Python and PyMC. But let’s start with a figure where the axes have a familiar linear scale!

I was able to produce this alternative plot without working much because the authors and the journal made their dataset easily available on the web as a csv file. Thanks, all! In fact, they have provided more data than 1975 and 2005. Here is everything from the years in between as well:

Since we discussed this in journal club, I can now repeat my fellow post-docs’ observations and sound smart. In fact, I already did with that bit about the axes being on a non-linear (and non-logarithmic) scale. The letter claims that there is a breakpoint around human development index .86, and the data is piecewise linear with decreasing slope below this value, and increasing above. The next observation that is not mine originally, is that the data also looks like a quadratic would fit it pretty well.

To be slightly more formal, here are two alternative models for the association between HDI and TFR:

and

How should I decide between these two models? Well, I’m open to suggestions, but I learned about one way when I read a citation classic by UW statistician Adrian Raftery and co-author about Bayes Factors. They describe the ratio . If K is more than 3.2, this constitutes “substantial” evidence that model 2 is superior.

Since I’ve titled this post in the form of a tutorial, I’m now going to go through calculating the Bayes factor with MCMC in Python, which turns out to be a slightly challenging computation (but easy to code; thanks, PyMC!).

To set up the models in the fully Bayesian way, we need some priors, which I’ve made up below in a convenient, reusable form:

class linear: def __init__(self, X, Y, order=1, mu_beta=None, sigma_beta=1., mu_sigma=1.): if mu_beta == None: mu_beta = zeros(order+1) self.beta = Normal('beta', mu_beta, sigma_beta**-2) self.sigma = Gamma('standard error', 1., 1./mu_sigma) @potential def data_potential(beta=self.beta, sigma=self.sigma, X=X, Y=Y): mu = self.predict(beta, X) return normal_like(Y, mu, 1 / sigma**2) self.data_potential = data_potential def predict(self, beta, X): return polyval(beta, X) def logp(self): logp = [] for beta_val, sigma_val in zip(self.beta.trace(), self.sigma.trace()): self.beta.value = beta_val self.sigma.value = sigma_val logp.append(Model(self).logp) return array(logp)

and

class piecewise_linear: def __init__(self, X, Y, breakpoint=.86, mu_beta=[0., 0., 0.], sigma_beta=1., mu_sigma=1.): self.breakpoint=breakpoint self.beta = Normal('beta', mu_beta, sigma_beta**-2) self.sigma = Gamma('standard error', 1., 1./mu_sigma) @potential def data_potential(beta=self.beta, sigma=self.sigma, X=X, Y=Y, breakpoint=self.breakpoint): mu = self.predict(beta, breakpoint, X) return normal_like(Y, mu, 1 / sigma**2) self.data_potential = data_potential def predict(self, beta, breakpoint, X): very_high_dev_indicator = X &gt;= breakpoint mu = (beta[0] + beta[1]*(X-breakpoint)) * (1 - very_high_dev_indicator) mu += (beta[0] + beta[2]*(X-breakpoint)) * very_high_dev_indicator return mu def logp(self, beta_val=None, sigma_val=None, breakpoint_val=None): for beta_val, sigma_val in zip(self.beta.trace(), self.sigma.trace()): self.beta.value = beta_val self.sigma.value = sigma_val logp.append(Model(self).logp) return array(logp)

Then to try to calculate the Bayes factor, I can draw samples from the posterior distribution of each model, and look at the harmonic mean of their posterior liklihoods.

def bayes_factor(m1, m2, iter=1e6, burn=25000, thin=10, verbose=0): MCMC(m1).sample(iter, burn, thin, verbose=verbose) logp1 = m1.logp() MCMC(m2).sample(iter, burn, thin, verbose=verbose) logp2 = m2.logp() mu_logp = mean(logp2) K = exp(pymc.flib.logsum(-logp1) - log(len(logp1)) - (pymc.flib.logsum(-logp2) - log(len(logp2)))) return K

Unfortunately, it seems to take a prohibitively large number of samples to get the same average twice.

To make it all run, I’ve made a little module to load the data, but I won’t bore you with the details; it’s online here if you want to play around with it yourself.

>>> import data >>> m1=models.linear(X=data.hdi, Y=data.tfr, order=2) >>> m2=models.piecewise_linear(X=data.hdi, Y=data.tfr, mu_beta=[1,-10,1]) >>> model_selection.bayes_factor(m1, m2, iter=1e7)

According to the Bayes factor, the piecewise linear model is (to be filled in soon) better than the quadratic model. Or, more quantitatively, the observed data is (tba) times more likely under model 2 than model 1. Cool!

As a side-effect, this yields an alternative way to ask if the association on the “high development” piece of the piecewise model is really positive:

>>> m2 = models.piecewise_linear(X=data.hdi, Y=data.tfr, breakpoint=.86, mu_beta=[1,-8,1], sigma_beta=1., mu_sigma=.1) >>> MCMC(m2).sample(iter=1000*1000+20000, thin=1000, burn=20000, verbose=1) >>> m2.beta.stats()['95% HPD interval'] array([[ 1.91520875, 2.01669476], [-9.84036014, -9.48869047], [-3.85410676, -1.23869569]]) >>> m2.beta.stats()['mean'] array([ 1.96932651, -9.66349812, -2.5683748 ])

The research questions for the computer scientist in me are: did I draw enough samples to get a correct answer? and did I really need to draw that many?

The tentative answers are no and no! See the comments for leads on more efficient schemes.