I joke with my graduate students they need to get as many technical skills as possible as PhD students because the moment they graduate it’s a slow decline into obsolescence. And of course by “joke” I mean “cry on the inside because it’s true”.

Take experiments. Every year the technical bar gets raised. Some days my field feels like an arms race to make each experiment more thorough and technically impressive, with more and more attention to formal theories, structural models, pre-analysis plans, and (most recently) multiple hypothesis testing. The list goes on. In part we push because want to do better work. Plus, how else to get published in the best places and earn the respect of your peers?

It seems to me that all of this is pushing social scientists to produce better quality experiments and more accurate answers. But it’s also raising the size and cost and time of any one experiment.

This should lead to fewer, better experiments. Good, right? I’m not sure. Fewer studies is a problem if you think that the generalizabilty of any one experiment is very small. What you want is many experiments in many places and people, which help triangulate an answer.

The funny thing is, after all that pickiness about getting the perfect causal result, we then apply it in the most unscientific way possible. One example is deworming. It’s only a slight exaggeration to say that one randomized trial on the shores of Lake Victoria in Kenya led some of the best development economists to argue we need to deworm the world. I make the same mistake all the time.

We are not exceptional. All of us—all humans—generalize from small samples of salient personal experiences. Social scientists do it with one or two papers. Usually ones they wrote themselves.

[Read the follow-up post here]

The latest thing that got me thinking in this vein is an amazing new paper by Alwyn Young. The brave masochist spent three years re-analyzing more than 50 experiments published in several major economics journals, and argues that more than half the regressions that claim statistically significant results don’t actually have them.

My first reaction was “This is amazingly cool and important.” My second reaction was “We are doomed.”

Here’s the abstract:

I follow R.A. Fisher’s The Design of Experiments, using randomization statistical inference to test the null hypothesis of no treatment effect in a comprehensive sample of 2003 regressions in 53 experimental papers drawn from the journals of the American Economic Association. Randomization tests reduce the number of regression specifications with statistically significant treatment effects by 30 to 40 percent. An omnibus randomization test of overall experimental significance that incorporates all of the regressions in each paper finds that only 25 to 50 percent of experimental papers, depending upon the significance level and test, are able to reject the null of no treatment effect whatsoever. Bootstrap methods support and confirm these results.

The basic story is this. First, papers often look at more than one treatment and many outcomes. There are so many tests that some are bound to look statistically significant. What’s more, when you see a significant effect of a treatment on one outcome (like earnings), you are more likely to see a significant effect on related outcome (like consumption), and if you treat these like independent tests you overstate the significance of the results.

Second, the ordinary statistics most people use to estimate treatment effects are biased in favor of finding a result. When we cluster standard errors or make other corrections, we rely on assumptions that simply don’t apply to experimental samples.

One way to deal with this is something called Randomization Inference. You take your sample, with its actual outcomes. You engage in a thought experiment, where you randomly assign treatment thousands of times, and generate a treatment effect each time. Most of these imaginary randomizations will generate no significant treatment effect. Some will. You then look at your actual treatment effects, compare them to the distribution of potential treatment effects, and ask “what are the chances I would get these treatment effects by chance?”

RI has been around for a while, but very few experimenters in economics have adopted it. It’s more common but still unusual in political science. Here is Jed Friedman with a short intro. The main textbook is by Don Green and I recommend it to newcomers and oldcomers alike. I have been reading it all week and it’s a beautiful book.

Alwyn Young has very usefully asked what happens if we apply RI (and other methods) to existing papers. I don’t completely buy his conclusion that half the experiments are actually not statistically significant. Young analyzed almost 2000 regressions across 50 papers, or about 40 regressions per paper. Not all regressions are equal. Some outcomes we don’t expect treatment to affect, for instance. So Young’s tests are probably too stringent. Pre-analysis plans are designed to help fix this problem. But he has a good point. And work like this will raise the bar for experiments going forward.

But I don’t want to get into that. Rather, I want to talk about why this trend worries me.

I predict that, to get published in top journals, experimental papers are going to be expected to confront the multiple treatments and multiple outcomes problem head on.

This means that experiments starting today that do not tackle this issue will find it harder to get into major journals in five years.

I think this could mean that researchers are going to start to reduce the number of outcomes and treatment they plan to test, or at least prioritize some tests over others in pre-analysis plans.

I think it could also going to push experimenters to increase sample sizes, to be able to meet these more strenuous standards. If so, I’d expect this to reduce the quantity of field experiments that get done.

Experiments are probably the field’s most expensive kind of research, so any increase in demands for statistical power or technical improvements could have a disproportionately large effect on the number of experiments that get done.

This will probably put field experiments even further out of the reach of younger scholars or sole authors, pushing the field to larger and more team based work.

I also expect that higher standards will be disproportionately applied to experiments. So it some sense it will raise the bar for some work over others. Younger and junior scholars will have stronger incentives to do observational work.

On some level, this will make everyone more careful about what is and is not statistically significant. More precision is a good thing. But at what cost?

Well for one, I expect it to make experiments a little more rote and boring.

I can tell you from experience it is excruciating to polish these papers to the point that a top journal and its exacting referees will accept them. I appreciate the importance of this polish, but I have a hard time believing the current state is the optimal allocation of scholarly effort. The opportunity cost of time is huge.

Also, all of this is fighting over fairly ad hoc thresholds of statistical significance. Rather than think of this as “we’re applying a common standard to all our work more correctly”, you could instead think of this as “we’re elevating the bar for believing certain types of results over others”.

Finally, and most importantly to me, if you think that the generalizability of any one field experiment is low, then a large number of smaller but less precise experiments in different places is probably better than a smaller number of large, very precise studies.

There’s no problem here if you think that a large number of slightly biased studies are worse than a smaller number of unbiased and more precise studies. But I’m not sure that’s true. My bet is that it’s false. Meanwhile, the momentum of technical advance is pushing us in the direction of fewer studies.

I don’t see a way to change the professional incentives. I think the answer so far has been “raise more money for experiments so that the profession will do more of them.” This is good. But surely there are better answers than just throwing more fuel on the fire.

Incentives for technical advances in external rather than just internal validity strike me as the best investment right now. Journal editors could play a role too, rewarding the study of scale ups and replications (effectiveness trials) as much as the new and counter-intuitive findings (efficacy trials).

Of course, every plea for academic change ends with “more money for us” and “journal editors should change their preferences.” This is a sign of either lazy or hopeless thinking. Or, in my case, both.

I welcome ideas from readers, because to me the danger is this: That all the effort to make experiments more transparent and accurate in the end instead limits how well we understand the world, and that a reliance on too few studies makes our theory and judgment and policy worse rather than better.

Update: In a follow-up post I round up the many comments and papers people shared.