Peter Marlow/Magnum Photos

In early 2009, the United States was engaged in an intense public debate over a proposed $800 billion stimulus bill designed to boost economic activity through government borrowing and spending. James Buchanan, Edward Prescott, Vernon Smith, and Gary Becker, all Nobel laureates in economics, argued that while the stimulus might be an important emergency measure, it would fail to improve economic performance. Nobel laureates Paul Krugman and Joseph Stiglitz, on the other hand, argued that the stimulus would improve the economy and indeed that it should be bigger. Fierce debates can be found in frontier areas of all the sciences, of course, but this was as if, on the night before the Apollo moon launch, half of the world’s Nobel laureates in physics were asserting that rockets couldn’t reach the moon and the other half were saying that they could. Prior to the launch of the stimulus program, the only thing that anyone could conclude with high confidence was that several Nobelists would be wrong about it.

But the situation was even worse: it was clear that we wouldn’t know which economists were right even after the fact. Suppose that on February 1, 2009, Famous Economist X had predicted: “In two years, unemployment will be about 8 percent if we pass the stimulus bill, but about 10 percent if we don’t.” What do you think would happen when 2011 rolled around and unemployment was still at 10 percent, despite the passage of the bill? It’s a safe bet that Professor X would say something like: “Yes, but other conditions deteriorated faster than anticipated, so if we hadn’t passed the stimulus bill, unemployment would have been more like 12 percent. So I was right: the bill reduced unemployment by about 2 percent.”

Another way of putting the problem is that we have no reliable way to measure counterfactuals—that is, to know what would have happened had we not executed some policy—because so many other factors influence the outcome. This seemingly narrow problem is central to our continuing inability to transform social sciences into actual sciences. Unlike physics or biology, the social sciences have not demonstrated the capacity to produce a substantial body of useful, nonobvious, and reliable predictive rules about what they study—that is, human social behavior, including the impact of proposed government programs.

The missing ingredient is controlled experimentation, which is what allows science positively to settle certain kinds of debates. How do we know that our physical theories concerning the wing are true? In the end, not because of equations on blackboards or compelling speeches by famous physicists but because airplanes stay up. Social scientists may make claims as fascinating and counterintuitive as the proposition that a heavy piece of machinery can fly, but these claims are frequently untested by experiment, which means that debates like the one in 2009 will never be settled. For decades to come, we will continue to be lectured by what are, in effect, Keynesian and non-Keynesian economists.

Over many decades, social science has groped toward the goal of applying the experimental method to evaluate its theories for social improvement. Recent developments have made this much more practical, and the experimental revolution is finally reaching social science. The most fundamental lesson that emerges from such experimentation to date is that our scientific ignorance of the human condition remains profound. Despite confidently asserted empirical analysis, persuasive rhetoric, and claims to expertise, very few social-program interventions can be shown in controlled experiments to create real improvement in outcomes of interest.

To understand the role of experiments in this context, we should go back to the beginning of scientific experimentation. In one of the most famous (though probably apocryphal) stories in the history of science, Galileo dropped unequally weighted balls from the Leaning Tower of Pisa and observed that they reached the ground at the same time. About 2,000 years earlier, Aristotle had argued that heavier objects should fall more rapidly than lighter objects. Aristotle is universally recognized as one of the greatest geniuses in recorded history, and he backed up his argument with seemingly airtight reasoning. Almost all of us intuitively feel, moreover, that a 1,000-pound ball of plutonium should fall faster than a one-ounce marble. And in everyday life, lighter objects often do fall more slowly than heavy ones because of differences in air resistance and other factors. Aristotle’s theory, then, combined authority, logic, intuition, and empirical evidence. But when tested in a reasonably well-controlled experiment, the balls dropped at the same rate. To the modern scientific mind, this is definitive. The experimental method has proved Aristotle’s theory false—case closed.

Of course, Aristotle, like other proto-scientific thinkers, relied extensively on empirical observation. The essential distinction between such observation and an experiment is control. That is, an experiment is the (always imperfect) attempt to demonstrate a cause-and-effect relationship by holding all potential causes of an outcome constant, consciously changing only the potential cause of interest, and then observing whether the outcome changes. Scientists may try to discern patterns in observational data in order to develop theories. But central to the scientific method is the stricture that such theories should ideally be tested through controlled experiments before they are accepted as reliable. Even in scientific fields in which experiments are infeasible, our knowledge of causal relationships is underwritten by traditional controlled experiments. Astrophysics, for example, relies in part on physical laws verified through terrestrial and near-Earth experiments.

Thanks to scientists like Galileo and methodologists like Francis Bacon, the experimental method became widespread in physics and chemistry. Later, it invaded the realm of medicine. Though comparisons designed to determine the effect of medical therapies have appeared around the globe many times over thousands of years, James Lind is conventionally credited with executing the first clinical trial in the modern sense of the term. In 1747, he divided 12 scurvy-stricken crew members on the British ship Salisbury into six treatment groups of two sailors each. He treated each group with a different therapy, tried to hold all other potential causes of change to their condition as constant as possible, and observed that the two patients treated with citrus juice showed by far the greatest improvement.

The fundamental concept of the clinical trial has not changed in the 250 years since. Scientists attempt to find two groups of people alike in all respects possible, apply a treatment to one group (the test group) but not to the other (the control group), and ascribe the difference in outcome to the treatment. The power of this approach is that the experimenter doesn’t need a detailed understanding of the mechanism by which the treatment operates; Lind, for example, didn’t have to know about Vitamin C and human biochemistry to conclude that citrus juice somehow ameliorated scurvy.

But clinical trials place an enormous burden on being sure that the treatment under evaluation is the only difference between the two groups. And as experiments began to move from fields like classical physics to fields like therapeutic biology, the number and complexity of potential causes of the outcome of interest—what I term “causal density”—rose substantially. It became difficult even to identify, never mind actually hold constant, all these causes. For example, how could an experimenter in 1800, when modern genetics remained undiscovered, possibly ensure that the subjects in the test group had the same genetic predisposition to a disease under study as those in the control group?

In 1884, the brilliant but erratic American polymath C. S. Peirce hit upon a solution when he randomly assigned participants to the test and control groups. Random assignment permits a medical experimentalist to conclude reliably that differences in outcome are caused by differences in treatment. That’s because even causal differences among individuals of which the experimentalist is unaware—say, that genetic predisposition—should be roughly equally distributed between the test and control groups, and therefore not bias the result.

In theory, social scientists, too, can use that approach to evaluate proposed government programs. In the social sciences, such experiments are normally termed “randomized field trials” (RFTs). In fact, Peirce and others in the social sciences invented the RFT decades before the technique was widely used for therapeutics. By the 1930s, dozens of American universities offered courses in experimental sociology, and the English-speaking world soon saw a flowering of large-scale randomized social experiments and the widely expressed confidence that these experiments would resolve public policy debates. RFTs from the late 1960s through the early 1980s often attempted to evaluate entirely new programs or large-scale changes to existing ones, considering such topics as the negative income tax, employment programs, housing allowances, and health insurance.

By about a quarter-century ago, however, it had become obvious to sophisticated experimentalists that the idea that we could settle a given policy debate with a sufficiently robust experiment was naive. The reason had to do with generalization, which is the Achilles’ heel of any experiment, whether randomized or not. In medicine, for example, what we really know from a given clinical trial is that this particular list of patients who received this exact treatment delivered in these specific clinics on these dates by these doctors had these outcomes, as compared with a specific control group. But when we want to use the trial’s results to guide future action, we must generalize them into a reliable predictive rule for as-yet-unseen situations. Even if the experiment was correctly executed, how do we know that our generalization is correct?

A physicist generally answers that question by assuming that predictive rules like the law of gravity apply everywhere, even in regions of the universe that have not been subject to experiments, and that gravity will not suddenly stop operating one second from now. No matter how many experiments we run, we can never escape the need for such assumptions. Even in classical therapeutic experiments, the assumption of uniform biological response is often a tolerable approximation that permits researchers to assert, say, that the polio vaccine that worked for a test population will also work for human beings beyond the test population. But we cannot safely assume that a literacy program that works in one school will work in all schools. Just as high causal densities in biology created the need for randomization, even higher causal densities in the social sciences create the need for even greater rigor when we try to generalize the results of an experiment.

Criminology provides an excellent illustration of the way experimenters have grappled with the problem of very high causal density. Crime, like any human social behavior, has complex causes and is therefore difficult to predict reliably. Though criminologists have repeatedly used the nonexperimental statistical method called regression analysis to try to understand the causes of crime, regression doesn’t even demonstrate good correlation with historical data, never mind predict future outcomes reliably. A detailed review of every regression model published between 1968 and 2005 in Criminology, a leading peer-reviewed journal, demonstrated that these models consistently failed to explain 80 to 90 percent of the variation in crime. Even worse, regression models built in the last few years are no better than models built 30 years ago.

So since the early 1980s, criminologists increasingly turned to randomized experiments. One of the most widely publicized of these tried to determine the best way for police officers to handle domestic violence. In 1981 and 1982, Lawrence Sherman, a respected criminology professor at the University of Cambridge, randomly assigned one of three responses to Minneapolis cops responding to misdemeanor domestic-violence incidents: they were required to arrest the assailant, to provide advice to both parties, or to send the assailant away for eight hours. The experiment showed a statistically significant lower rate of repeat calls for domestic violence for the mandatory-arrest group. The media and many politicians seized upon what seemed like a triumph for scientific knowledge, and mandatory arrest for domestic violence rapidly became a widespread practice in many large jurisdictions in the United States.

But sophisticated experimentalists understood that because of the issue’s high causal density, there would be hidden conditionals to the simple rule that “mandatory-arrest policies will reduce domestic violence.” The only way to unearth these conditionals was to conduct replications of the original experiment under a variety of conditions. Indeed, Sherman’s own analysis of the Minnesota study called for such replications. So researchers replicated the RFT six times in cities across the country. In three of those studies, the test groups exposed to the mandatory-arrest policy again experienced a lower rate of rearrest than the control groups did. But in the other three, the test groups had a higher rearrest rate.

Why? In 1992, Sherman surveyed the replications and concluded that in stable communities with high rates of employment, arrest shamed the perpetrators, who then became less likely to reoffend; in less stable communities with low rates of employment, arrest tended to anger the perpetrators, who would therefore be likely to become more violent. The problem with this kind of conclusion, though, is that because it is not itself the outcome of an experiment, it is subject to the same uncertainty that Aristotle’s observations were. How do we know if it is right? By running an experiment to test it—that is, by conducting still more RFTs in both kinds of communities and seeing if they bear it out. Only if they do can we stop this seemingly endless cycle of tests begetting more tests. Even then, the very high causal densities that characterize human society guarantee that no matter how refined our predictive rules become, there will always be conditionals lurking undiscovered. The relevant questions then become whether the rules as they now exist can improve practices and whether further refinements can be achieved at a cost less than the benefits that they would create.

Sometimes, of course, we do stumble upon a policy innovation that appears consistently to work (or, much more often, not work). For example, various forms of intensive probation—in which an offender is closely monitored but not incarcerated—were tested via RFT at least a dozen times through 2004 and failed every test.

Criminologists at the University of Cambridge have done the yeoman’s work of cataloging all 122 known criminology RFTs with at least 100 test subjects executed between 1957 and 2004. By my count, about 20 percent of these demonstrated positive results—that is, a statistically significant reduction in crime for the test group versus the control group. That may sound reasonably encouraging at first. But only four of the programs that showed encouraging results in the initial RFT were then formally replicated by independent research groups. All failed to show consistent positive results.

It is true that 12 of the programs were tested in “multisite RFTs”—experiments conducted in several different cities, prisons, or court systems. While not true replication, this is a better way to uncover context sensitivity than a single-site trial. But there, too, 11 of the 12 failed to produce positive results; and the small gains produced by the one successful program (which cost an immense $16,000 per participant) faded away within a few years. In short, no program within this universe of tests has ever demonstrated, in replicated or multisite randomized experiments, that it creates benefits in excess of costs. That ought to be pretty humbling.

The same conclusion holds if you forget about formal replications and merely examine similar programs that have been tested at different times, despite material differences at the level of detail and execution. From those 122 criminology experiments, I extracted the 103 that were conducted in the United States and grouped them into 40 “program concepts”: mandatory arrest for domestic violence, intensive probation, and so on. Of these 40 concepts, 22 had more than one trial. Of those 22, only one worked each time it was tested: nuisance abatement, in which the owners of blighted properties were encouraged to clean them up. And even nuisance abatement underwent only two trials.

So what do we know, based on this series of experiments, about reducing crime? First, that most promising ideas have not been shown to work reliably. Second, that nuisance abatement—which is at the core of what is often called “Broken Windows” policing—tentatively appears to work. Even that conclusion needs qualification: it’s a safe bet that there is some jurisdiction in the United States where even Broken Windows would fail. We must remain open to the iconoclast who will find the limits of our conclusions—just as the hard sciences always devote some resources to those who try to unseat conventional wisdom. That is, experimentation does not create absolute knowledge but rather changes both the burden and the standard of proof for those who disagree with its findings.

At the same time that the social sciences began struggling with the problem of dismayingly high causal densities, the same problem was being addressed by another entity entirely: the business world. There have been pockets of successful randomized experimentation in business for decades—consumer-package companies running test markets for new products, for example, and catalog marketers testing new offers. More recently, the information-technology revolution has created the possibility of experimenting much more broadly.

A key event occurred in 1988, when Rich Fairbank and Nigel Morris left a small strategy-consulting firm where the three of us worked to found credit-card company Capital One. The company was designed precisely as an application of the experimental method to business, and that method quickly permeated Capital One, to an extent never before seen. Suppose marketers wanted to know whether a credit-card solicitation would meet with greater success if it was mailed in a blue envelope or in a white one. Rather than debate the question, the company would simply mail, say, 50,000 randomly selected households the solicitation in a blue envelope and 50,000 randomly selected households the same solicitation in a white envelope, and then measure the relative profitability of the resulting customer relationships from each group. The success of Capital One, Fairbank told Fast Company, was predicated on its “ability to turn a business into a scientific laboratory where every decision about product design, marketing, channels of communication, credit lines, customer selection, collection policies and cross-selling decisions could be subjected to systematic testing using thousands of experiments.” By 2000, Capital One was reportedly running more than 60,000 tests per year. And by 2009, it had gone from an idea in a conference room to a public corporation worth $35 billion.

Through competitive pressure and professional osmosis, Capital One has transformed not only the credit-card industry but most financial services marketed through direct channels. Randomized experimentation is now a core capability for the marketing of everything from credit cards to checking accounts. Nonfinancial companies, too, have imported the experimental model. Harrah’s Entertainment carefully executes randomized tests of various hypotheses for how to market to customers—for example, identifying a large number of people who live in Southern California and who usually visit Las Vegas on weekends, mailing a randomly selected group of them an attractive hotel offer for a Tuesday night, and comparing the response of that group (the test group) with the response of the rest of the sample (the control group). “It’s like you don’t harass women, you don’t steal and you’ve got to have a control group,” the CEO of Harrah’s said in a Stanford Business School case study. “This is one of the things that you can lose your job for at Harrah’s—not running a control group.”

The Internet is even better for experimentation than the direct-mail and telemarketing channels that Capital One originally used. Executing a randomized experiment—say, to determine whether a pop-up ad should appear in the upper-left or upper-right corner of a webpage—is close to costless on a modern e-commerce platform. The leaders in this sector, such as Google, Amazon, and eBay, are inveterate experimenters. These days, experimentation is something that one assumes from a successful online commerce company.

For all these companies, from Capital One to Google, very large test groups of consumers—tens of thousands or even more—can be selected economically, and the insights that the experiments create can be applied to millions of total customers. In 1999, after years of chewing on Fairbank and Morris’s example, I started a software company that applied the experimental method to environments where such large samples weren’t feasible—a chain of retail stores, for example, that wants to test which of two window displays will lead to greater sales. The company now provides the software platform for experiments for dozens of the world’s largest corporations.

What businesses have figured out is that they can deal with the problem of causal density by scaling up the testing process. Run enough tests, and you can find predictive rules that are sufficiently nuanced to be of practical use in the very complex environment of real-world human decision making. This approach places great emphasis on executing many fast, cheap tests in rapid succession, rather than big, onetime “moon shots.” It’s something like the replacement of craft work by mass production. The crucial step was to lower the cost and time of each test, which doesn’t simply make the process more efficient but, by allowing many more test iterations, leads to faster and more useful learning.

Many of the same techniques that businesses use to lower the cost per test—integration with operational data systems, standardization of test design, and so on—could be applied to social policy experiments. In fact, they were applied in a limited way during the execution of more than 30 randomized experiments during the welfare-reform debate of the 1990s, which was one of the most fruitful sequences of social policy experiments ever done. Businesses have demonstrated that the concept of replication of field experiments can be pushed much further than most social scientists had imagined.

But what do we know from the social-science experiments that we have already conducted? After reviewing experiments not just in criminology but also in welfare-program design, education, and other fields, I propose that three lessons emerge consistently from them.

First, few programs can be shown to work in properly randomized and replicated trials. Despite complex and impressive-sounding empirical arguments by advocates and analysts, we should be very skeptical of claims for the effectiveness of new, counterintuitive programs and policies, and we should be reluctant to trump the trial-and-error process of social evolution in matters of economics or social policy.

Second, within this universe of programs that are far more likely to fail than succeed, programs that try to change people are even more likely to fail than those that try to change incentives. A litany of program ideas designed to push welfare recipients into the workforce failed when tested in those randomized experiments of the welfare-reform era; only adding mandatory work requirements succeeded in moving people from welfare to work in a humane fashion. And mandatory work-requirement programs that emphasize just getting a job are far more effective than those that emphasize skills-building. Similarly, the list of failed attempts to change people to make them less likely to commit crimes is almost endless—prisoner counseling, transitional aid to prisoners, intensive probation, juvenile boot camps—but the only program concept that tentatively demonstrated reductions in crime rates in replicated RFTs was nuisance abatement, which changes the environment in which criminals operate. (This isn’t to say that direct behavior-improvement programs can never work; one well-known program that sends nurses to visit new or expectant mothers seems to have succeeded in improving various social outcomes in replicated independent RFTs.)

And third, there is no magic. Those rare programs that do work usually lead to improvements that are quite modest, compared with the size of the problems they are meant to address or the dreams of advocates.

Experiments are surely changing the way we conduct social science. The number of experiments reported in major social-science journals is growing rapidly across education, criminology, political science, economics, and other areas. In academic economics, several recent Nobel Prizes have been awarded to laboratory experimentalists, and leading indicators of future Nobelists are rife with researchers focused on RFTs.

It is tempting to argue that we are at the beginning of an experimental revolution in social science that will ultimately lead to unimaginable discoveries. But we should be skeptical of that argument. The experimental revolution is like a huge wave that has lost power as it has moved through topics of increasing complexity. Physics was entirely transformed. Therapeutic biology had higher causal density, but it could often rely on the assumption of uniform biological response to generalize findings reliably from randomized trials. The even higher causal densities in social sciences make generalization from even properly randomized experiments hazardous. It would likely require the reduction of social science to biology to accomplish a true revolution in our understanding of human society—and that remains, as yet, beyond the grasp of science.

At the moment, it is certain that we do not have anything remotely approaching a scientific understanding of human society. And the methods of experimental social science are not close to providing one within the foreseeable future. Science may someday allow us to predict human behavior comprehensively and reliably. Until then, we need to keep stumbling forward with trial-and-error learning as best we can.