‘Metrics Monday: 2SLS–Chronicle of a Death Foretold?

Last week I discussed how it is generally not possible to compare 2SLS estimates with OLS estimates because the two estimates apply to different groups of observations. Given that, it makes sense that I should write this week about a new working paper by Alwyn Young that has been making the rounds these past few months.

The paper is titled “Consistency without Inference: Instrumental Variables in Practical Application.” In it, Young uses the bootstrap to conduct a meta-analysis of 1,400 2SLS coefficients across 32 papers published in the AEA journals, and to essentially ask: “Is 2SLS all that it is cracked up to be?”

Before anything, I should note that this article is not about bias. Rather, it is about inference. Specifically, Young assumes that all the IVs included in his study are exogenous. What he is concerned with is whether they are as strong as they should be, and with whether the inferences that can be drawn from 2SLS results are valid.

For a while now, I have been thinking that with the Credibility Revolution having brought the focus of applied micro back to getting causal (unbiased) estimates, the next logical step–the Second Credibility Revolution,* so to speak–should be for the literature to focus on getting the standard errors right. Young’s paper–along with the Abadie et al. (2017) paper on clustering I discussed about a few weeks ago–is a step in that direction.

There is a lot going on in Young’s paper, and I cannot possibly do justice to all of his results in a 500-word 1000-word blog post.

To begin with, when I say Young uses the bootstrap, what this means is he uses two versions of the bootstrap (he refers to them as bootstrap-c and bootstrap-t) to study the distribution of test statistics for the more than 1,500 2SLS coefficients across the 1,400 studies in his sample, with the goal of assessing the quality of inference that this yields. He generally compares those two bootstrap versions with (i) what the authors have done, (ii) clustering the standard errors or using robust versions thereof, and (iii) the default, which is to do nothing.

As it turns out, Young finds that

Conventional tests tend to overreject the null hypothesis that the 2SLS coefficient is equal to zero. 2SLS estimates are falsely declared significant one third to one half of the time, depending on the method used for bootstrapping. The 99-percent confidence intervals (CIs) of those 2SLS estimates include the OLS point estimate over 90 of the time. They include the full OLS 99-percent CI over 75 percent of the time. 2SLS estimates are extremely sensitive to outliers. Removing simply one outlying cluster or observation, almost half of 2SLS results become insignificant. Things get worse when removing two outlying clusters or observations, as over 60 percent of 2SLS results then become insignificant. Using a Durbin-Wu-Hausman test, less than 15 percent of regressions can reject the null that OLS estimates are unbiased at the 1-percent level. 2SLS has considerably higher mean squared error than OLS. In one third to one half of published results, the null that the IVs are totally irrelevant cannot be rejected, and so the correlation between the endogenous variable(s) and the IVs is due to finite sample correlation between them. Finally, fewer than 10 percent of 2SLS estimates reject instrument irrelevance and the absence of OLS bias at the 1-percent level using a Durbin-Wu-Hausman test. It gets much worse–fewer than 5 percent–if you add in the requirement that the 2SLS CI that excludes the OLS estimate.

Young further discusses weak instrument tests and the weaknesses of common methods used in dealing with weak IVs.

There are several goodies along the way. For instance, the discussion of Table VI on p. 26 highlights the importance of always showing both 2SLS and OLS results. Young even goes as far as providing some guidance for the practice of econometrics when he writes (emphasis added):

With these basic prerequisites for credibility in place, one might then ask whether 2SLS estimates rule out the OLS results, i.e. accepting that, taking into full account their covariance, the OLS and 2SLS population moments are different, one might still want to know if the OLS estimates are unlikely to be true. The weak form of this demand might be that the 2SLS confidence interval does not encompass the entirety of the OLS confidence interval, while the strong form might be that it does not contain the actual OLS point estimate.

To come full circle, Young finally concludes by implying the need for a Second Credibility Revolution when he writes that “[t]he care devoted to research design deserves … an equally careful and complementary inference design, one that combines the information in 2SLS and OLS using practical measures of their strengths and weaknesses.”

As regards the title of this post, the question mark at the end signas that I don’t think applied econometricians will stop using 2SLS. I do think, however, that the religious reverence in which 2SLS results using plausibly exogenous IV are held might weaken in the near future given the inference issues highlighted by Young.

___

* I don’t have my copy of Mostly Harmless Econometrics at hand, given that I tend to write these posts at home on weekend and my book is in the office, but I recall that in their conclusion, Angrist and Pischke cop to the fact that their book is about getting rid of bias, and that using the methods they outline, your standard errors might not be ideal, but at least you’ll have done your best to eliminate bias.