For about four weeks over winter break I interned at GiveWell, a nonprofit that evaluates charities to figure out which ones do the most effective work. I learned a lot from working there, including a lot about how GiveWell works that seems not to be widely known. So I thought I’d share my findings.

Be more skeptical!

a.

In the developing world, water quality is a huge concern. Many natural water sources contain bacteria that cause diarrhea, a major killer. If you filter or chlorinate the water, you can get rid of those pathogens.

So if you give people filters or chlorine dispensers, that’ll reduce how much diarrhea they get, right? Not so fast—it’s a really plausible theory, but we need to check that it actually works in practice. Fortunately, we know how to do this: a randomized controlled trial!

So we did a bunch of studies. We gave some villages dispensers and others not, and periodically surveyed them for diarrhea prevalence. Sure enough, diarrhea went down in the intervention villages compared to the controls. Great! Let’s scale it up and save some lives!

Except we missed a thing: the trials weren’t blinded. The intervention group knew they were getting cleaner water. And what’s worse, the outcome (diarrhea incidence) was self-reported. And when we did more studies, and blinded them this time, the entire effect went away. Turns out that self-reported diarrhea incidence isn’t a reliable enough measure of diarrhea.

Except we missed another thing! The blinded studies all had methodologocal problems. Except so did the unblinded ones! So it’s not really clear which way the evidence points. You can find out more at GiveWell’s intervention report; to quote from their summary:

Overall, we are ambivalent about the effect of water quality interventions on diarrhea. We find plausible theories grounded in the available evidence for both believing that water quality interventions reduce diarrhea and for the more pessimistic conclusion that these interventions do not have an effect. We also do not see any analysis that may lead to a more definitive answer without a significant, additional investment of time. We therefore summarize our work so far, provide feedback we’ve received from scholars and leave questions for further investigation.

Just to repeat that: stopping people from ingesting diarrhea-causing bacteria every day with their water may not decrease diarrhea prevalence. Why? Well, I don’t have much of an idea. One speculative guess is that diarrhea is a simple contagion and people who were getting it from their wells were also getting it from other places. But, as GiveWell notes, there are plausible theories in both directions.

b.

My assignment at GiveWell was to do research around interventions that promote exclusive breastfeeding. The Global Burden of Disease study used a meta-analysis by Lamberti et al. of many observational studies to conclude that proper breastfeeding practices could save up to 800,000 deaths anually, making it the third-biggest risk factor for death in the developing world (source). This meta-analysis found that in the first six months, infants who were partially breastfed were over four times more likely to die than infants who were exclusively breastfed, and infants who weren’t breastfed at all were a staggering fourteen times more likely to die. And that didn’t change even when you controlled for all of the known confounders—maternal income, maternal health, job outside the home, etc. What an effect! And it was used by the GBD program, widely cited in cost-effectiveness studies, and seemed to be well-endorsed by the public health community.

Now, unlike for water quality, there aren’t any randomized trials of breastfeeding directly, because it would be unethical to make participants not breastfeed their children—breastfeeding has known health benefits. But it is ethical to take two random groups and give one of them support for breastfeeding—peer counselling, education about its benefits, etc. And this raises breastfeeding rates substantially, often almost doubling them.

Most of these studies don’t measure mortality directly, presumably because they don’t have enough statistical power to detect a reasonably-sized difference. But many of them did record deaths, because when one of your subjects dies you have to drop them from your study, and it’s good practice to record the amount and reason for attrition. And when I pooled all the studies that recorded mortality this way, it was totally inconsistent with the results obtained by Lamberti et al. Instead the drop in mortality that you might expect to find, mortality was almost significantly higher in the intervention groups. Ouch

So what gives? Again, I don’t know. But my suspicion, after reading about 30 papers on breastfeeding and crunching a bunch of numbers, is that there’s simply too much intra-country variation in who breastfeeds, and in what mediates its benefits, to say anything very general about its effects. In some countries, the problem is that children are weaned too early; this might deprive them of nutrition or antibodies in their mother’s milk. In others, mothers breastfeed for long enough, but they introduce complementary foods (which may be contaminated with pathogens) too early, causing diarrhea, respiratory infections, and allergies. In yet other countries, mothers tend to introduce solid foods too late, which can cause stunting or underweight since breastmilk eventually stops having enough nutrition for a growing baby. There seems to be some sort of bonding effect as well, although this isn’t usually discussed in developing-world studies. And breastfeeding might or might not end up having a largish effect on IQ down the road, maybe through the same nutritional or health effects as before. So that’s at least four different mechanisms and six effects. Yikes.

The meta-analysis of Lamberti et al. drew from many studies, but their results on all-cause mortality were based on only three. So it’s possible that that simply wasn’t a representative enough sample of countries. In effect, the data was highly clustered, and their confidence intervals should have been way wider. Or maybe there were unknown confounders that would have changed the analysis when controlled for. Who knows!

Public health is really hard

This probably doesn’t come as a surprise after what I just wrote, but I hadn’t previously grasped just how difficult it is to know anything about public health. For instance, after reading maybe ten different randomized controlled trials of breastfeeding promotion, I realized just how hard it is for study designers to get all the details right.

In fact, there were over 50 decent randomized controlled trials of breastfeeding support, according to a Cochrane review on the subject. Of those, only 11 measured any secondary outcomes—some kind of health outcome like reports of infections or hospitalizations that would add to the evidence base on the actual effects of breastfeeding. For the ones that did, there was no standardized set of metrics to test, so it was impossible to formally meta-analyze the secondary outcomes; I ended up just making a table of all the effects and staring at it until I came up with some sort of intuition about what was going on. Many of the studies had problems with the mechanics of the trial, like allocation (it turns out that assigning to treatment group by alternating based on the baby’s birth week is, um, not very random) and concealment (no, your data collectors aren’t supposed to know which group their interviewees are in). The lack of standardization and the difficulty with protocols made it much harder to understand the big picture on breastfeeding promotion.

GiveWell takes a broader view than I thought

GiveWell are often accused of treating charity too narrowly. The focus on global health can come off as a kind of measurability bias: focusing on the the best-understood, most-researched charitable cause and ignoring anything that’s too hard to analyze. Within global health they’ve also been accused of treating symptoms rather than underlying issues, again because something like distributing bednets is easier to understand than e.g. changing laws or improving education. Sometimes it’s also described as “risk aversion”—GiveWell doesn’t want to risk making a bad recommendation.

But in fact, their focus on symptom treatments in global health doesn’t arise from risk aversion or measurability bias so much as regressing everything a ton. GiveWell has several times had the experience of making a “bad recommendation” in the sense of their top charity turning out to be not quite as good as they’d hoped (for instance, VillageReach’s data issues and AMF’s room for more funding). But I don’t think this has made them any more cautious (in the sense of sacrificing mean for variance), just changed their views on certain types of arguments.

In retrospect, it’s somewhat surprising to me that the impression still persists that GiveWell is conservative and focused on strength of evidence rather than strength of effect. After all, GiveWell Labs is now maybe 50% of GiveWell’s resources, and their list of shallow cause overviews is pretty heavy on non-mainstream-EA causes like advocacy, existential risk, open borders, animal welfare, etc. I guess the myth that GiveWell is stodgy or risk averse will dissipate more as Labs becomes more active, but right now it still exists.

Things depend ridiculously on the details

Another thing GiveWell gets accused of is being too thorough. And they truly are ridiculously thorough. One coworker recounted doing some due diligence on GiveDirectly by checking that they had correctly audited the work of their field officers who checked villages to mark down which houses had thatched roofs—one of the criteria GiveDirectly uses to determine who received cash transfers. This involved looking at an aerial photo of each village in question and comparing it to the audit spreadsheet to make sure that GiveDirectly had noticed all the discrepancies in their field workers’ work. When she mentioned this, I couldn’t imagine how it could have a worthwhile return on time. (Update: I came in late at the meeting where my coworker described this process and apparently made the wrong inference. The auditing process involved comparing GiveDirectly’s spreadsheets to make sure that households where GiveDirectly found a discrepancy, e.g. in roof materials, were marked for additional auditing.)

But in the past, many cases for top charities have hinged on surprisingly picayune details. Errors in the DCP2 radically changed the cost-effectiveness estimates for deworming. A chart important for the evaluation of VillageReach turned out to overstate the case significantly because it didn’t mention missing data. Some strong-looking pieces of evidence for SCI were much weaker than they appeared once some unclear wording was interpreted correctly.

In fact, even the audit example turned out to be less extreme than it sounded. While looking at the spreadsheets, my coworker noticed that in one village every house was marked as needing an audit. It turned out to be just an error in copying over the data, but it might have been something as severe as the wording issues which essentially destroyed the evidential value of those studies. So GiveWell’s thoroughness is understandable; it’s been won through hard experience.

Research is harder than I thought

Before working there, I thought that GiveWell’s research process was relatively straightforward. I’d come up with (or be given) a question, read a bunch of papers on it, figure out what the answer was, shore up any holes in the reasoning, and then we’d know what to do about (e.g.) breastfeeding promotion in developing countries.

As the headline indicates, though, that’s not how it worked. My research was constantly in self-directed exploration-mode, probing a little bit further in various directions to see which ones would be the most fruitful in finding a conclusive view of how breastfeeding interventions work. Because of how extremely this kind of reasoning depends on the details of the analysis, this required me to hold a huge amount of information about the various breastfeeding studies in my head at once, and very frequently aggregate it into micro-decisions that could cause me to spend hours down a rabbit hole if I made them incorrectly.

Something about the high ratio of reading things and making decisions to actually taking actions made this kind of research pretty draining. I now understand much better why GiveWell claims that they’re bottlenecked on people despite the fact that general “research” skills seem like they should be in good supply: it’s easy for me to imagine people varying by orders of magnitude in their ability to direct their research in appropriate directions and produce a good synthesis of the amount of information I had to take in during my work there.