



Economics and similar, for the sleep-deprived A subtle change has been made to the comments links, so they no longer pop up. Does this in any way help with the problem about comments not appearing on permalinked posts, readers?



Update: seemingly not



Update: Oh yeah!

Tuesday, January 09, 2007

This has been so absurdly trailed it is bound to be a total anticlimax



yes folks its



The long awaited Freakonomics review Part 3:



Happy New Year to most of my readers. Time to crack on with the Freakonomics review, which I actually wrote last year, but wanted to hold back until I'd finished the whole thing (yes I know, touching isn't it?). Thanks to Radek for giving me the heads up on this review, which makes a number of good points. The remaining parts are tentatively entitled "Freakiology", on the subject of Levitt & Dubner's in my opinion very sketchy treatment of important issues of sociology, and "How Freaked is Economics?", on what the success of Freakonomics as a popular book, and the success of Levitt as an academic economist, say about the current state of the science of economics. This bit rounds off the statistical and methodological critique. It is entitled ...



Natural experiments ain't so Freaking Natural



This section of the Freakonomics review deals with my problems with the underlying econometric methodology of Freakonomics (or more specifically, with themes in the career of Steven Levitt which are summarised in the book). It expands on this post from a year ago, in which I got rather alarmed at Levitt's reaction to being criticised in a working paper on his abortion & crime model. I need to start with a big caveat; if you're criticising someone's statistics, it is easy to get into the realm of the unprofessional and/or defamatory. Nothing I say below should be taken as accusing Levitt or any of his coauthors of intentionally misrepresenting anything. In particular, references to "data-mining" refer to what I regard as a deformation of the general field econometric methodology rather than to anything purposeful or specific to Levitt. So here we go.



Levitt's reputation in economics rests on micro-economic empirical work rather than theory. Empirical microeconomics is in general a rather hairy mathematical field; unlike empirical macroeconomics, when you are generally dealing with a relatively small number of well-known and consistently collected aggregate time series, microeconomic datasets tend to be idiosyncratic, problem-specific, collected in ways which might be considered to introduce bias, qualitative (ie, yes/no) rather than quantitative and statistically hairy in all sorts of other ways.



This is why microeconometricians tend to be much stronger on nonlinear methods than macroeconometricians (although generally weaker on time series modelling) and generally more familiar with the dusty end of the STATA instruction manual. Very few macro models are particularly complicated from a mathematical standpoint; they are often big, which makes them complicated in a different way, and in general a lot of thought has to go in to the process of deciding what variables to include and exclude, but the actual statistical guts of the thing is normally a linear regression – probably one that is estimated by maximum likelihood because of the time series issues, but basically a model where the output is a linear function of the input, with parameters chosen to minimise squared error.



Microeconometric models are much gnarlier, almost always estimated by ML (or these days, as often as not, by Bayesian methods which allow for even more complicated functional forms), with structures that are highly non-linear. On the other hand, my assessment of microeconometric modelling is that they don't spend anything like as much time and effort on the modelling issues (as opposed to estimation issues) as the macro guys, and they might be surprised what they found if they did.



Levitt, on the other hand, does a lot of microeconometrics, but he is not a good econometrician (as he freely admits). How does he manage it? Well, partly by having good co-authors (I will return to this issue in part 5). But partly by making very extensive use indeed of the instrumental variables approach.



As part of my "mission to explain", I should probably now explain what the IV approach is and why anyone might be interested in it. OK here goes then. Imagine you are the chancellor of Oxford University, trying to find out whether rugby players are thicker than rowers[1]. You might want to carry out a regression analysis of the form:



Alpha (% of rugby players) + Beta (% of rowers) = Average Finals mark.+/- an error term



, across the colleges, and have a look at the coefficients on rugby and rowing.



However, the dean of your medical school points out to you that this regression won't work. The variation across the finals marks of different colleges is also affected by the amount of beer the students drink. Rugby players drink more beer than the average student, so colleges with a high proportion of rugby players will have lower marks than average, not because rugby players are morons, but because they are also boorish drunks. Also, because you're not on speaking terms with any of the bursars, you can't get any college-level data for beer consumption. Hmmm.



Inspiration strikes. You realise that Welsh students are more likely to be rugby players than the average student, but no more likely than the average student to be a drunk. Furthermore, from a previous year's study you have data on the number of closeted gay men in each college, and it too is well-correlated with the proportion of rugby players. Bingo zingo, it turns out that the college-level data on purchases of annoying novelty hats also correlates well with the rugby players, while the readership of the Financial Times correlates strongly and negatively. So you can estimate a preliminary regression thus:



Rho (% of Welsh) + Tau (% closet gays) + Theta (novelty hats) + Mu (FT readers) = %Rugby players +/- an error term.



Call the left hand side of this equation Gamma. Gamma is a pretty good estimate of the number of rugby players, and (because Welshness, closeted gayness, novelty hat purchase and FT readership are none of them correlated with beer drinking), unlike the raw data for the number of rugby players, it isn't correlated with the variance in the error term for finals marks. You can therefore substitute Gamma for the % of rugby players in the first equation, and your estimates will now be consistent, because you've got rid of the confounding factor of beer consumption. Gamma is an "instrument" for the number of rugby players, and the version of your regression equation which substitutes Gamma for the percentage of rugby players is the "instrumental variable" estimate of the relationship between rugby, rowing and finals marks.



That's IV estimation[2]. Levitt does a hell of a lot of it. As long as the left hand variables of your preliminary regression aren't themselves correlated with the variance in the original equation (in other words, as long as Welshness isn't itself correlated with drunkenness), and as long as the fit of the equation estimating Gamma is reasonably good, it will be OK. (You can actually make do even with a really bad fit in the Gamma equation if you have loads and loads of data[3], but usually you don't). It's a good method of estimating these models, so why doesn't everybody do it?



Well, in the real world (a place I have often visited), you aren't allowed to randomly pluck series out of the air and say that they are strongly correlated with the variable you want them to be instruments for. If it turns out that they are weakly correlated, then you are in hell. The reason for this is that, although we said that Gamma was "uncorrelated" with the error term in the finals marks equation, in any real (finite) dataset the measured correlation is likely to be a small number close to zero rather than the actual number zero. This matters like hell because:



1) the bias introduced by this small empirical correlation gets "scaled up" by the reciprocal of the covariance between the instrument and its target variable (this is now technical as hell, but here's the best discussion I can find). The idea here is that you are trying to explain the (signal + noise) in finals marks with (signal + noise) in the instrument. If there is only a small correlation between the two signals, then there had better be no correlation at all between the two noise terms, or you are just fitting noise to noise and your overall signal/noise ratio will go through the floor.



2) in finite samples, the sampling distribution of the IV estimate is the ratio of two normally distributed variables. The ratio of two normals is a surprisingly complicated distribution; basically, if it is important to you to estimate something which is the ratio of two normals, then you had better hope that the correlation between them is pretty high because as it goes to zero the ratio of two normals becomes a Cauchy distribution, which is in statistical terms "a really awkward bastard to deal with"[4].



Do you see why this took so bloody long to write, by the way? So the take-away here is that weak instruments in IV estimation are really bad news, much much worse than poorly correlated regressors in normal regression analysis. It can actually be better from a mean-squared error point of view to just ignore the bias and do the ordinary regression, if the only instruments you can find are weak. This is important.



In general, in those of Levitt's published papers that I've read, there really is not very much discussion of the strength of the instruments. There is also a hell of a tendency to say that "there is no reason to believe that this is correlated", with a bit of a lacuna where the bit ought to be where you check that it is actually uncorrelated, or to have a look at how any small correlation might get inflated by a weakish instrument. What was that Malcolm Gladwell quote again?



Steve Levitt has the most interesting mind in America, and reading Freakonomics is like going for a leisurely walk with him on a sunny summer day as he waves his fingers in the air and turns everything you once thought to be true inside out.



Yup, always with the waving of the fingers. We should have got suspicious the moment that anyone told us that econometrics could be fun. In fact, as with all statistical work, the ratio of inspiration and creativity to meaningless grind is so low that it is scarcely possible to reject the null hypothesis of no fun at all. In fairness, a lot of the work that made Levitt famous predates a lot of the weak instruments literature - it is only comparatively recently that it has even become standard practice to report the results of the first-stage regressions so that everyone can make their own mind up about the strength of the instrument. And Levitt is actually quite good by the standards of econometricians when it comes to doing crosschecks and similar non-data-driven tests of whether the model is working or not, which is really the only "solution" to a weak instruments problem at present (people keep working on statistical refinements and there are a few goodish rules of thumb, but basically weak instruments is an unsolved problem of estimation theory). So it's not that this is an awful thing about Levitt; the point I want to make here is that the idea that creativity and flair can substitute for the hard yards in econometrics sounds like a free lunch and it probably is.



On the other hand, however, in most cases, it looks to me as if Levitt is using quite strong instruments (the seminal abortion 'n' crime paper is an exception though; without doing the work, it looks to me as if the crack epidemic in the data takes nearly all the strength out of the instrument Levitt & Donohue were using). I didn't really want to write this piece about weakness of instruments. The real critique I have is based on the way the instruments get found.



Levitt is a hell of a one for "natural experiments". A "natural experiment" is a subspecies of instrumental variable estimation, taking advantage of some natural or otherwise exogenous variation to create a situation where some units get assigned to a treatment group and some get assigned to a control group, by chance rather than design. The stereotypic textbook example is one where you want to investigate whether military service creates human capital (whether ex-soldiers do better in civilian life than non ex-soldiers), but you think that there might be some unobserved characteristics (like self-discipline or bravery) which affect both the decision to sign up, and later success. So what you do is take the cohort of men born in the early 1950s, and use their draft number as an instrument.



Natural experiments are another of those "why doesn't everyone do econometrics this way?!?!?" areas. The answer is twofold. The first part of the answer is that natural experiments are really quite hard to find. Things like the Vietnam draft don't really come along with anything like the frequency at which econometric problems arrive which look like they'd be amenable to an econometric approach.



Levitt's big thing, the one that won him the John Bates Clark and the fawning adulation of millions of groupies Steven Dubner, is being really creative and unconventional in the selection of quirky things which can be used as natural experiments. This is the whole selling point of Freakonomics - it's all about this sort of lateral thinking and "making you see the world in a whole new light".



Which brings me to the second part of the answer which is, unfortunately, that since the success of Freakonomics, every bugger does use natural experiments, all the time. Levitt's book is Edward de Bono for the green eyeshades set. I am wholly suspicious of this outpouring of creativity on the part of economists, rather as I would suspect and fear a sudden outbreak of interest in stochastic calculus among teachers of modern dance.



The problem is that the upsurge in economists finding natural experiments is not a result of there being more natural experiments to find, but a result of economists deeming more things to be acceptable natural experiments. This is worrisome, from a statistical point of view, and it is here that the discussion of "data mining" shall begin, so I redirect your attention to the disclaimer above in which I make it clear that I use the phrase in a sense which is pejorative from a methodological point of view but not personally



The trouble is that there are two ways in which you can go about discovering a natural experiment if you don't have an obvious one to hand. You can either be more assiduous in searching for them, or you can lower your standards as to what constitutes a decent natural test of your thesis. Of these two, oddly enough, I regard the second as much less potentially harmful. It just gives us a social phenomenon whereby not a robin can fall without some lazy graduate student or junior faculty member using its passing as a "natural experiment" on the market for bird seed. It tends to mean that crap papers proliferate in the journals (in general, purporting to prove propositions that nobody was ever disposed to doubt, using econometric techniques so bad as to make you doubt it after all), but it is hard to get worked up about this on opportunity cost grounds, as the authors of these papers would be churning out crap of some kind or another anyway.



The first phenomenon, however, is more subtly pernicious. Choosing natural experiments is a form of data-mining. Since all sorts of things are happening all the time, if you are prepared to get really creative about it, and prepared to put up with weakish instruments in an IV estimate, you are often able to find all sorts of natural experiments for propositions of interest if you look hard enough. Specifically, you will as likely as not be able to find one which gives you the result you are looking for.



I direct readers now to my discussion of data mining and stepwise regression from a couple of years ago. And to this stupid joke from roughly the same period, in order to point out that the decline in quality of this blog since then is largely illusory. The point I want to make is that the natural experiment version of data-mining causes just the same problems as stepwise regression.



Recall that in the case of stepwise regression, it became impossible to interpret the normal tests of statistical significance, because the critical values of the test statistics assumed that the underlying process was a random one. And the process which generated the test statistics wasn't a random one, because it had been specifically set up to iterate through combinations of regressors until a model was found with the "right" result.



I think something exactly similar could be at work in the natural experiments literature. We just don't know how many potential "natural experiments" were looked at and didn't work out, and why. In many ways, we're even worse off than we were in the stepwise regression case, because there is at least a sensible mathematical way of getting an idea of the size and shape of the space of possible regression models that a data-miner has iterated over, and constructing an algorithm like PcGets in order to do so in as sensible a manner as possible. There is no such objective way of dealing with the potential space of natural experiments.



I note here that, as with the stepwise case, the point has to be made that simply trusting in the honesty of our econometricians isn't going to do any good. As I pointed out back then, the double blind criterion is not used in medical tests in order to protect us from dishonest experimenters. It's there to protect us from unconscious bias, wishful thinking and the temptation to find rationalisations for a course of action that is most congenial. And as far as I can see, there is simply no way to introduce any equivalent of the double blind into this form of econometrics.



Another way of describing this problem is to notice that the business of coming up with a natural experiment to test some hypothesis or other is basically the same thing as looking for a piquant anecdote to illustrate a point. It's the same sort of thing that Gladwell or Friedman do, without the statistical manipulation. And to be honest, the econometric toolkit does not actually add anything much at all to the evidentiary value of a natural experiment - all the persuasive power is in the selection of the "experiment" itself. I think that this is both a bad thing about the natural experiment literature and a good thing about anecdotal evidence and case studies (which are, at the end of the day, often a good way of backing up a hypothesis about the world). There is nothing wrong with what Gladwell does, but it is a mistake to think that one is adding anything by taking the semi-attached anecdote and turning it into a regression. Or to put it another way, the plural of "anecdote" is not "data" - it's "Freakonomics".



Postscript I think it makes sense to repeat a third time my disclaimer above that I am specifically not accusing Levitt of sharp statistical practice. My dislike of the natural experiment methodology is general, and while Levitt is the poster boy for its renaissance in American economics, I think he is simply the expression of much wider trends, which result from much deeper pathologies of the subject, which I'll be dealing with in Part 5. Note in particular that I've poured a lot of scorn in the past on the "Devastating Critique" school of statistical rhetoric as exemplified by Steve Milloy, where one takes an utterly standard limitation of some methodology or other (canonically, a suggestion for further research made in the paper itself) and inflates it into a "Devastating Critique" of the methodology itself. I haven't changed my mind about "Devastating Critiques" and didn't intend to deliver one here myself. A fair old amount of Levitt's work does not use natural experiments, and not all of his natural experiment work is necessarily data-mined. But a lot of the key claims made in Freakonomics look to me to be based on "Just So" stories where opposing "Just So" stories could easily be told (Ariel Rubinstein's review picks out a lot of them), and Dubner does not seem to realise how impressive and definitive it isn't that Levitt has converted his "Just So" story into a model.



Parts 4 and 5 to come some time between next week and the heat death of the universe.



[1] Strictly speaking this regression wouldn't answer that specific question but cut me some bloody slack here will you.

[2] Specifically it's "Two-stage Least Squares". IV is a bit more general than this; it is also possible to do it through the General Method of Moments and Limited Information Maximum Likelihood, which I am fucked if I'm going to explain because I barely understand them myself. However, Levitt often uses 2SLS, and I think most of my comments here also to GMM and LIML estimation too.

[3] Everyone says this but it isn't really true. Asymptotically, the IV instrument is unbiased, which is what I mean here. But weak instrument bias can be a problem in finite sample estimation even with huge numbers of data points - famously, Bound, Jaeger and Baker found it in a study with 329,000 data points

[4] Benoit Mandelbrot advocates the widespread use of the Cauchy distribution for capturing the uncertainty of a wide variety of modelling situations. Oddly enough, many past colleagues describe Benoit Mandelbrot as "a really awkward bastard to deal with", which perhaps tells us something about self-similarity in general.

this item posted by the management 1/09/2007 03:50:00 PM

