If we find that a deworming programme in one region improves school performance, how much have we learned? Should we be confident that a programme carried out by a different organization in another country will have similar effects? Will the programme be equally effective if it’s handed over to the government and scaled up?

Eva Vivalt is a Post-Doctoral Research Fellow at NYU’s Development Research Institute whose work addresses questions like these. She founded AidGrade to synthesize data from different impact evaluations and improve the assessment of development interventions. Her paper “How much can we generalize from impact evaluations?” has been described as a must-read for development economists. I recently got in touch with Eva to learn more about her research.

Could you tell us about your work as a development economist? How does your academic research relate to your work with AidGrade?

Right now I'm focusing on leveraging AidGrade's data for research because there is a lot more that can be said with it. But I am interested in other topics in development as well and have several other studies in progress. I'm lucky because my academic work only allows me to set up the structure for and manage AidGrade on the side, and the staff are so great that they can take it from there.

Your paper assesses to what extent we can generalize from evaluations of programs implemented by one organization in a particular location to the efficacy of similar programs implemented by other organizations elsewhere. Why is this issue important? Does it matter for individual donors?

How much one can extrapolate across settings is critical to understanding what we can take away from any one study. Without this kind of work, we could make fantastically wrong predictions. For example, a similar issue came up in medicine, where many studies were traditionally done on men but there has been increasing acknowledgement that effects of some drugs might be different in women or children. The issue is likely to be even more important in development economics, because there are so many more things that could interact with the intervention and determine its effects. So one reason this work is important is because it teases apart some of the factors associated with the effects of a program and helps to improve our ability to predict.

Apart from the prediction issue, knowing the distribution of effects is also important. If we were equally unsure of the effects of all development programs, perhaps we could get away with only looking at the average effect for a particular intervention, but if we know some have spanned a broader range than others that could make a difference as to which we want to support. A former professor and inspiration, Adrian Wood, liked to say that first we should just do no harm; one can imagine perhaps we should pay more attention to the lower end of the range of possible effect sizes because if you do end up doing an intervention with poor results you might sow distrust, and research in psychology shows people dislike losses more than they like gains. Alternatively, if there are poverty traps, perhaps we should weight the large, positive effect sizes more highly. Either way, it is important to consider the full range of possible outcomes.

Finally, and less relevant to individual donors but more relevant for researchers, this kind of work can help tell us where we should allocate more attention – where are the big unknowns and unexplained variation.

How do you assess the generalizability of impact evaluations?

AidGrade gathered data from about 600 impact evaluations of 20 different kinds of development programs. These were academic studies, and you can think of our data as essentially comprising the results tables along with characteristics of each paper (e.g. whether it was a randomized controlled trial, whether it was randomized by cluster, characteristics of the sample being studied, etc.). We also coded what kind of outcome the results captured (e.g. attendance rates in percentage points, height in cm, etc.).

For the main results, I run leave-one-out meta-analyses of all the papers within each intervention-outcome combination. For example, there might be six papers estimating the effect of school meals (intervention) on attendance rates (outcome). I take one of those estimates out, generate the meta-analysis result from the other five, and see how well that meta-analysis result predicts the one left out, cycling through all possible permutations. It's a bit more complicated than that, but that's the quick summary.

You find that a positive, significant result in one study is a poor predictor of finding a similar result if the same intervention is studied again: there’s only a 32% chance that the next results with be positive and significant. How worried should we be in light of this finding?

It does suggest some caution. One should also consider the sample size of the studies. Studies are very frequently underpowered, meaning that there is just too small a sample to be likely to observe a significant effect even if the program really had a positive effect. In general, especially with regards to how we commonly talk about programs or see them discussed in the press, I think we don't know quite as much as we think we know. Academics are better at being careful about this.

You find that government-implemented programs generally fare worse than programs implemented by academics and NGOs. Why do you think we find this pattern of results? What lessons can we take away?

There are a few possible explanations. The first thing that comes to mind for most people is that government-implemented and NGO/academic-implemented programs are qualitatively different – the government-implemented programs, in particular, tend to be much larger. There could be "equilibrium effects" at a large scale that reduce the observed impact. For example, perhaps microfinance programs help some business owners but at the expense of others – if one looked at a small-scale program, it would appear to help, but it might not when scaled up. Alternatively, it could simply be harder to implement a program well at scale. Another possibility is that NGO/academic-implemented programs are simply better-managed or devote more effort or other resources to each beneficiary.

In my paper, I show that while it is true that larger programs tend to fare worse, government-implemented programs still perform worse even controlling for sample size, so size is not the full story and it appears to be a combination of factors. More work is needed to distinguish between the remaining possibilities.

Based on your findings, are there any changes you think we should make in how we go about trying to identify the best charities to support with our donations? For example, should we significantly increase our confidence in an organization like GiveDirectly, where we have an evaluation that looks directly at their program?

I think that is a correct inference to draw. There is definitely still reason to sometimes support unstudied charities, but the full range of possible results and our uncertainty about them should enter into that calculation. We should also be cautious as to whether the evaluation of a charity was cherry-picked in any way.

What steps can be taken to improve the generalizability of impact evaluations? Are development economists following these steps or is generalizability a neglected concern?

Great question. First, there are things that people could do to help measure the local generalizability of their results – for example, include subgroup analyses in their pre-analysis plans. This means that they commit to examining differences in results for different subsets of their own sample. While this would only give us within-sample variation, I found that variance in results within a paper is associated with variance in results across the papers that look at the same intervention-outcome, and it might be relevant locally (such as if the program were to be expanded), so it would seem a step in the right direction. There should also be better coordination across people working on different papers so that they use more of the same outcome measures. Paul Gertler and co-authors, among others, have been setting up impact evaluations that ask the same questions in many different settings. This is very valuable work.

More generally, better integration of theory in development work would help. People like to say that impact evaluations are testing some causal chain, but if you delve into the papers, how they think the intervention actually works is often left unstated. It's a knotty problem, but the holy grail would be to uncover some stable parameters explaining at least some of the variation in an intervention's effects. Failing that, we would need many more iterations of experiments.