Six years ago, a woman in rural Kenya told me her story. Every night when it rained, she’d have to move her children to a neighbour’s hut – one with a metal roof instead of her leaky, straw one. But then the nonprofit organisation GiveDirectly transferred $1,000 to her, so she bought a roof, and now her family could get a full night’s sleep. Years later, I can still remember sitting with her in her hut, and thinking to myself that I wished I had a lot more money, so I could give to dozens of families like hers.

Meeting the woman and others in her village had an impact on me emotionally, but what struck me was also the weight of evidence from research studies. Cash transfers of various types have been studied through dozens of large randomised controlled trials (RCTs) in many low-income countries, showing consistent positive effects on health and education. And contrary to the past fears of donors, there’s growing evidence that people buy essential goods and do not wile away the money on such things as cigarettes and alcohol.

Another striking fact about these studies: they are all relatively recent. This kind of research on how to help people in low-income countries, through ‘micro’-type programmes, has exploded over the past two decades. And this dramatic surge of studies has been trailed by a fascinating – and crucial – debate about the value of this kind of evidence.

Prior to 2000, it was much more common for economists studying international economic development to look at ‘macro’ questions, rather than to study ‘micro’ anti-poverty programmes. For instance, they’d look at big datasets and compare many factors across countries, asking questions about how rich countries developed economically, while others stayed poor. But starting in the late 1990s and early 2000s, economists including Esther Duflo and Abhijit Banerjee at the Massachusetts Institute of Technology, and Michael Kremer at Harvard University, began to argue that another approach was needed too. In their book Poor Economics (2011), Duflo and Banerjee explain that, rather than just studying the ultimate causes of poverty, or the effectiveness of aid in general, their question is: ‘Do we know of effective ways to help the poor?’

This approach led them and their colleagues to ask micro-type questions such as: should we distribute bed nets to combat malaria for free, or will people use bed nets less if they get them free versus paying for them? (A randomised study showed that giving them away for free in Kenya didn’t lead to less usage than charging for them.) Will giving small loans (‘microfinance’) lift people out of poverty by allowing them to start businesses? (A set of six RCTs in six countries showed little evidence that the loans raise incomes overall, though they might have other beneficial effects.) What are the long-term impacts of giving deworming pills to children, on school attendance and later-life income? (A long-term RCT in Kenya showed positive effects on both, though this study also prompted much debate.)

To give a bit of background: RCTs work by randomly assigning one group of people to participate in the programme (the ‘treatment’ group) and comparing their results with those of another randomly assigned group (‘the control’ group). By contrast, ‘observational’ studies make use of existing data, without conducting an ‘intervention’ (ie, without enrolling people in a study). It’s also possible to use other methods that intervene rather than merely observe, such as ‘difference in difference’ studies that compare non-randomly assigned treatment and control groups before and after a programme.

But using these other methods raises tough questions about correlation versus causation. For instance, one might compare the health of people who eat tofu with the health of those who never do. But in how many ways are those two groups different? Maybe the tofu-eaters also eat vegetables more often. Maybe they exercise more. So, even if they are healthier, a researcher would need to ‘control for’ these differences statistically – and what if there are differences they didn’t think of or can’t measure?

The beauty of RCTs is that you don’t need to worry as much about possible confounders – statistically, you can estimate the likelihood of observing the study results (or more extreme results) if the ‘null hypothesis’ were true (often, this is defined as the hypothesis that there is no effect).

Back in 2010, I became enamoured with RCTs. I had been disillusioned by lack of evidence during my early forays into the nonprofit world. As a mentor to a teenager in foster care, I was horrified to learn that a huge percentage of those who ‘age out’ of foster care (usually at 18) become homeless within 18 months. I interned for a couple of nonprofits who worked with the population, but they seemed to know – or, really, care to know – little about their effectiveness. I was a grad student in philosophy at the time, specialising in topics related to skepticism, and I latched on to RCTs as a route to reliable knowledge of how to help people.

I decided to leave academic philosophy, finding that my interests fit well into a job with GiveWell, a US nonprofit that has relied on RCTs (though not exclusively) for evidence of effectiveness. In my early days working for that organisation, I conducted research into Cochrane, a nonprofit that synthesises RCTs (mostly on health-related questions) into meta-analyses of their cumulative effect. I was astounded by how few people seemed to have heard of Cochrane’s systematic reviews, and became an unofficial evangelist, telling friends and family to consult their reviews when they had health questions.

Many RCTs in low-income countries have been conducted by Innovations for Poverty Action (IPA), a US group that I went to work for after GiveWell, and that is among those driving the movement towards RCTs in development economics. In 2003, Duflo, Banerjee and their fellow economist Sendhil Mullainathan at the University of Chicago founded the Abdul Latif Jameel Poverty Action Lab (J-PAL). Since the early 2000s, researchers affiliated with J-PAL, IPA, as well as the World Bank and Department for International Development (DFID) in the UK, have conducted many hundreds of RCTs in low- and middle-income countries on education, financial inclusion and health programmes.

This surge in RCTs has been described by many inside and outside the field as having a dramatically positive impact. In some cases, the movement has been praised as revolutionary, for instance by the Australian politician and economist Andrew Leigh in Randomistas: How Radical Researchers are Changing Our World (2018).

Much of their critique of evidence from RCTs over other study methods centres around external validity

My own view of RCTs has evolved over time from great enthusiasm to a generally positive feeling, though heavily annotated with caveats and unanswered questions. Gradually, through involvement with these groups, I’d become aware of criticisms of the RCT movement. In some cases, this awareness was very specific to a case. For example, when I co-wrote a GiveWell report on mass media for social change back in 2012, in part pointing to a lack of RCTs, some key researchers wrote a reply expressing their view that it was neither necessary nor feasible to conduct an RCT as opposed to relying on other study methods to get good evidence that these programmes had an impact. In other cases, I encountered the criticism in a much more general form – for at least the past decade, Nancy Cartwright, a philosopher of science at Durham University in the UK, has been critical of the idea that we can figure out ‘what works’ using RCTs. Her argument is that this view is far too simple.

Along with the Nobel laureate Angus Deaton, Cartwright argues that researchers and organisations conducting RCTs often overstate their value as evidence. In a co-authored paper, they make a case that RCTs are one method among many that can be useful, but shouldn’t be thought of as the research ‘gold standard’. Much of their reasoning about why evidence from RCTs is not preferable to other study methods centres around external validity. Cartwright and Deaton argue that the context for social interventions is often very complicated, with many relevant factors at play that can easily be absent in a new context. ‘Causal processes often require highly specialised economic, cultural or social structures to enable them to work,’ they write. Thus, to apply the findings beyond the original context, we must have a theory about which ‘supporting factors’ are important (ie, additional factors that function along with the treatment to create the observed outcome).

Cartwright and Deaton point to examples in which a programme that was effective in an RCT failed when replicated in a new context. One such example is RCTs in Kenya that showed that student test scores went up considerably when NGO-run schools hired extra teachers. However, when the study was replicated in the same way, only this time staffed by government-hired teachers, test scores did not increase. They point to research from the economist Eva Vivalt at the Australian National University on the generalisability of social interventions, which has shown a similar failure to replicate programmes in many other instances.

Cartwright and Deaton conclude that, in order to make the evidence from RCTs useful for new contexts, researchers and policymakers have to seek to understand not just that a treatment works (in some specific context), but why it works. Applying RCT evidence to a new context requires many assumptions, they argue, and so the purported advantage of RCTs (ie, that they require relatively few assumptions) isn’t really an advantage when it comes to using the evidence for policy, which would often require applying it in places outside the original context. This is true both for new contexts (such as a new country), and also for applying a study’s results to an individual person.

When RCT proponents hear this critique, they often wonder why this argument is directed towards RCTs rather than to research studies in general. This is a theme in the 20 recent articles responding to Deaton and Cartwright, and is summed up by this Tweet from the economist Pamela Jakiela at the University of Maryland: ‘Nice insight by Deaton and Cartwright, but for some reason they keep spelling “study” as R-C-T.’

Another kind of complaint about the RCT movement has been about its relevance and importance. How often are the things that RCTs study really useful, versus just things that are easy to study with RCTs? Here the metaphor that’s often invoked is someone looking for lost keys under a streetlight: maybe that’s where you can see most easily, but that doesn’t mean the keys are most likely to be there. For example, government decision-making in areas such as trade policy and adjudication of property rights are arguably of huge importance to the wellbeing of people in a country, yet it’s rarely feasible to conduct an RCT on these policies and practices.

The economist Lant Pritchett at Harvard University argues that many of the micro programmes studied by J-PAL and other groups are unlikely to do much to really combat poverty, when compared with the macro changes that governments could decide to make. Recently, 15 well-known academics, including a number of economists, signed a letter in The Guardian expressing this critique of the RCT movement. They wrote: ‘[T]he real problem with the “aid effectiveness” craze is that it narrows our focus down to micro-interventions at a local level that yield results that can be observed in the short term.’ Rather than focusing on RCTs to test various micro-development projects that ‘generally do little to change the systems that produce the problems in the first place’, they urge, we should aim to tackle the ‘root causes of poverty, inequality and climate change’.

Many supporters of RCTs would respond that whether to conduct RCTs or work on shifting policies isn’t an ‘either/or’ question, and I’d agree with them. Ruth Levine, programme director of global development and population at the Hewlett Foundation in California, writes that findings from RCTs on individual and community behaviours can be useful to the big questions about the structural drivers of poverty. She argues against the view that RCTs threaten to crowd out other kinds of research and programmes, calling it unrealistic, given that the ‘majority of official and private development dollars are spent on programmes that are not and will not be subject to RCTs’.

Not only that, but it’s not clear if the critics are right when they argue that micro programmes have depressingly small effects. The Progresa study in Mexico (later renamed Oportunidades) bolstered evidence for a programme of conditional cash transfers that has helped many millions of people, and the same kind of programme has been tested and rolled out in many other low-income countries. Dean Karlan, founder of IPA and co-author with Jacob Appel of More Than Good Intentions (2011), writes that this kind of approach ‘won’t eradicate poverty with one fell swoop … but we can make – and are making – real, measurable, and meaningful progress towards eradicating it’.

Villages in Kenya show evidence of the impact of giving cash transfers

The concern that I’ve found most disturbing personally, in the eight years since I first started thinking about RCTs, is about research more generally. There are many potential problems with research studies, including RCTs, that can crop up, compromising our ability to rely on the results. One big one is publication bias: the skew towards publishing positive results while burying null results in the proverbial file drawer. There’s some evidence that this problem is less of an issue with RCTs than with other study methods: RCTs in development economics tend to be expensive and time-consuming, and the results can be interesting and get published regardless of their findings. But it’s still something to worry about.

Another big problem is selective reporting. In some fields such as medicine, journals require potential authors to ‘pre-register’ their studies online in advance, specifying what outcomes they plan to measure. The idea is that if there are many things being measured, and researchers can cherrypick what to report later, it’s easy to find a so-called positive result that could very well be statistical noise. One striking example: after the US National Heart, Lung and Blood Institute required researchers to pre-register their studies, the proportion of studies finding positive results declined from 57 per cent to 8 per cent. But in development economics, it’s still relatively rare to pre-register a study’s outcomes and analysis plan, and there’s controversy over this being a good idea in the first place, with some researchers arguing that specifying outcomes in advance is overly constraining. Another idea is to accept a paper ahead of time, before the study is conducted, based purely on its importance and design – this ‘registered report’ model has expanded in psychology and, this past year, the Journal of Development Economics announced that it would try out the model as well.

Major concerns about publication bias and selective reporting aren’t specific to RCTs, but they do serve to temper an inflated view of the RCT movement’s capacity to deliver answers without complication. Clearly, RCTs can suffer from many issues that reduce our ability to rely on their results, as can other methods. That might be clear to most of the researchers who kickstarted the movement, but it seems that many of us could use a reminder.

Going back to the villages where GiveDirectly is distributing cash in Kenya, there’s much evidence of the impact of giving cash transfers. But many questions remain: IPA is conducting an RCT of GiveDirectly’s basic income experiment, giving monthly transfers to households in rural Kenya for 12 years. The study will shed much-needed light on the effects of basic income in the Kenyan context. Already, there’s been critical discussion of the RCT itself, on the particular details of how it has been conducted and analysed.

Rather than see this kind of critique as a sign that something has gone wrong, I think it’s exactly what we need – close and sustained scrutiny of the details of particular RCTs; discussion of when to conduct an RCT versus another kind of method; and weighing up how much investment to put into micro versus macro ways of trying to help people. As I’ve learned over the years, we can’t easily answer such questions. But we can’t stop trying.