In 2010, I wrote:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.” At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data.

Randomized controlled trials (RCTs) have well-known problems with realism or validity (a problem that researchers try to fix using field experiments, but it’s not always possible to have a realistic field experiment either), and cost/ethics/feasibility (which pushes researchers toward smaller experiments in more artificial settings, which in turn can lead to statistical problems).

Beyond these, there is the indirect problem that RCTs are often overrated—researchers prize the internal validity of the RCT so much that they forget about problems of external validity and problems with statistical inference. We see that all the time: randomization doesn’t protect you from the garden of forking paths, but researchers, reviewers, publicists and journalists often act as if it does. I still remember a talk by a prominent economist several years ago who was using a crude estimation strategy—but, it was an RCT, so the economist expressed zero interest in using pre-test measures or any other approaches to variance reduction. There was a lack of understanding that there’s more to inference than unbiasedness.

From a different direction, James Heckman has criticized RCTs on the grounds that they can, and often are, performed in a black-box manner without connection to substantive theory. And, indeed, it was black-box causal inference that was taught to me as a statistics student many years ago, and I think the fields of statistics, economics, political science, psychology, and medicine, are still clouded by this idea that causal research is fundamentally unrelated to substantive theory.

In defense, proponents of randomized experiments have argued persuasively that all the problems with randomized experiments—validity, cost, etc.—arise just as much in observational studies. As Alan Gerber and Don Green put it, deciding to unilaterally disable your identification strategy does not magically connect your research to theory or remove selection bias. From this perspective, even when we are not able to perform RCTs,

Christopher Hennessy writes in with another criticism of RCTs:

In recent work, I set up parable economies illustrating that in dynamic settings measured treatment responses depart drastically and systematically from theory-implied causal effects (comparative statics) both in terms of magnitudes and signs. However, biases can be remedied and results extrapolated if randomisation advocates were to take the next step and actually estimate the underlying shock processes. That is, old-school time-series estimation is still needed if one is to make economic sense of measured treatment responses. In another line of work, I show that the econometric problems become more pernicious if the results of randomisation will inform future policy setting, as is the goal of many in Cambridge, for example. Even if an economic agent is measure zero, if he views that a randomisation is policy relevant, his behavior under observation will change since he understands the future distribution of the policy variable will also change. Essentially, if one is doing policy-relevant work, there is endogeneity bias after the fact. Or in other words, policy-relevance undermines credibility. Rather than deal with these problems formally, there has been a tendency amongst a proper subset of empiricists to stifle their impact by lumping them into an amorphous set of “issues.” I think the field will make faster progress if we were to handle these issues the with same degree of formal rigor with which the profession deals with, say, standard errors. We should not let the good be the enemy of the best. A good place to start is to write down simple dynamic economic models that actually speak to the data generating processes being exploited. Absent such a mapping, reported econometric estimates are akin to a corporation reporting the absolute value of profits without reporting the currency or the sign. What does one learn from such a report? And how can such be useful in doing cost-benefit analyses on government policies? We have a long way to go. Premature claims of credibility only serve to delay confronting the issues formally and making progress.

Here are the abstracts of two of Hennessy’s papers:

Double-blind RCTs are viewed as the gold standard in eliminating placebo effects and identifying non-placebo physiological effects. Expectancy theory posits that subjects have better present health in response to better expected future health. We show that if subjects Bayesian update about efficacy based upon physiological responses during a single-stage RCT, expected placebo effects are generally unequal across treatment and control groups. Thus, the difference between mean health across treatment and control groups is a biased estimator of the mean non-placebo physiological effect. RCTs featuring low treatment probabilities are robust: Bias approaches zero as the treated group measure approaches zero. Evidence from randomization is contaminated by ex post endogeneity if it is used to set policy endogenously in the future. Measured effects depend on objective functions into which experimental evidence is fed and prior beliefs over the distribution of parameters to be estimated. Endowed heterogeneous effects generates endogenous belief heterogeneity making it difficult/impossible to recover causal effects. Observer effects arise even if agents are measure zero, having no incentive to change behavior to influence outcomes.

As with the earlier criticisms, the implication is not that observational studies are OK, but rather that real-world complexity (in this case, dynamics of individual beliefs and decision making) should be included in a policy analysis, even if a RCT is part of the story. Don’t expect the (real) virtues of a randomized trial to extend to the interpretation of the results.

To put it another way, Hennessy is arguing that we should be able to think more rigorously, not just about a localized causal inference, but also about what is traditionally part of story time.