Transcript

Robert Wiblin: Hi listeners, this is the 80,000 Hours Podcast, the show about the world’s most pressing problems and how you can use your career to solve them. I’m Rob Wiblin, Director of Research at 80,000 Hours.

Before we get into it just a few quick announcements.

If you think of yourself as part of the effective altruism community you should fill out the 2018 effective altruism survey. This helps keep track of who is involved, how they’re trying to improve the world, and what they believe. I’ll put a link in the show notes and associated blog post.

If you want to get a high impact job you should check out our job board, which was recently updated with new vacancies. You can find it at 80000hours.org/job-board/. It’s where we list the positions we’re most excited about filling.

Finally I just wanted to give a shout out to our producer Keiran Harris who has been doing a great job editing the episodes and generally helping to improve the show.

And without further ado, I bring you Eva Vivalt.

Robert Wiblin: Today I’m speaking with Dr. Eva Vivalt. Eva is a lecturer in the Research School of Economics at the Australian National University and the founder of AidGrade, a research institute that pools together hundreds of global development studies in order to provide actionable advice.

Eva has a PhD in Economics and an MA in Mathematics from UC Berkeley, and an MPhil in Development studies from Oxford University. She’s also previously worked at the World Bank. She’s a vegan, a Giving What We Can member, and principal investigator on Y Combinator Research’s randomized control trial of the basic income.

Thanks for coming on the podcast.

Eva Vivalt: Thank you. Great to be here.

Robert Wiblin: So, we’re going to talk a bit about your career as an economist and the various findings that you’ve had in your research over the last five years. But first, what are your main research interests these days? Is there any way of summarizing it? Is there a core topic that you’re looking into?

Eva Vivalt: So a lot of my work is on really how to make better evidenced-based policy decisions. And part of that, that I’ve recently gotten into, is looking more at priors that people may have, both policy makers and researchers. And there’s lots, actually, to say about priors. But I think that’s a direction that my research has gone recently that actually relates quite well to some of the previous stuff, the linkage being evidence-based policy.

Robert Wiblin: There’s a lot of heavy material to cover there later on in the show. But to warm up let’s talk first about Y Combinator’s basic income study – what is the study looking at and what motivates it?

Eva Vivalt: Yeah, no, I’m really excited by this study. So essentially the study is to give out $1000 per month for either three or five years to a bunch of individuals who are randomly selected. So the randomization is at this individual level, it’s not actually like giving, for example, everybody in an area the program. There’s a control group as well that still gets some nominal amount too, hopefully, so that they continue to answer surveys and such. We’re looking at a variety of outcomes. Things like time use, for example, like most economists would say that if you give people money they should actually work a little a bit less, that’s completely a rational thing to do. But if they are working less, what are they doing with their time instead? Because it could be actually really good for people to work less if they are, for example, getting more education so they can get a better job in the future. Or taking care of their kids, et cetera, et cetera. There’s all sorts of productive uses of time that one might find otherwise adding a lot of value. There’s health outcomes, education outcomes.

I should say this program is targeted to relatively poorer individuals and relatively younger individuals because the thought is it could actually change people’s trajectory over time. Those are kind of the areas where we might expect the money to go a bit farther and to see slightly larger effects.

Robert Wiblin: Interesting. Okay. So, given that it’s from Y Combinator, which is a tech data accelerator, is it kind of motivated by the concern that everyone’s going to lose their jobs because of technology? Or is it just more prosaic issues around equality and lack of opportunity in the United States?

Eva Vivalt: I think there’s a variety of motivations here. So I think in the background somewhere there is this concern about technology potentially displacing workers. I think there’s also some genuine utopian ideal of people should be able to do …

Robert Wiblin: They shouldn’t have to be wage slaves.

Eva Vivalt: Yeah, yeah, yeah. It’s not like all negative people lose their jobs because people could lose jobs in a good way. Nobody actually wants hard work in some regards. To be fair it’s not like really a great test of what happens if people lose jobs per se because to do that is what you’d want is randomized control trial in which you fire people, which is not likely to happen anytime soon.

Robert Wiblin: Not going to get past the ethics board.

Eva Vivalt: Yeah. But I think this is more motivated by the idea that you can imagine some worlds in which what you would want to do is expand the social safety net. And if you are expanding the social safety net, this could be one relatively efficient way of doing so, and so let’s look at what the effects of this particular kind of program would be. And you might imagine that some kind of program like this would probably start out with targeting relatively poorer individuals even though a true basic income program would target everybody.

Robert Wiblin: So what do you expect to find? Given past studies that are similar. And also, how many people are in this study?

Eva Vivalt: We have about 1000 people in the treatment group, 2000 control and then this larger super control group for which we just have administrative data. It’s actually a decent sized experiment and there’ve not been … the most similar studies in the states are some of the negative income tax experiments and EITC from the ’70s, there’s also I guess the Alaska Permanent Fund. The other similar ones I would say would be Moving to Opportunity and the Oregon Health Insurance Experiment. But these are all like … they’ve all got quite a lot of differences actually.

So, Alaska Permanent Fund; everybody just gets a certain transfer. So, that one actually is universal. It’s not very much of a transfer and you’ve got to use different approaches to evaluate it since everybody gets it. Oregon health insurance, well obviously that’s health insurance. Negative income tax experiments, those were quite old and had a lot of differential attrition issues.

Like I say, by now I think most economists would expect some effects on labor supply. There’s loads of papers on labor supply elasticity. I think there’s a little bit less on what people do with their time otherwise. One thing we’re doing is designing this custom time use app that people can put on their phones so we can sort of ping them and ask, “Hey, what are you doing right now?”

Robert Wiblin: Is there a key uncertainty that it’s trying to resolve? Like will people quit their jobs? Or will they become happier? Or will they spend more time on leisure with their family? That kind of thing.

Eva Vivalt: Yeah, so rather than one key outcome, we’ve got like lots of different families of outcomes. So we’ve got health outcomes, we’ve got education outcomes, we’ve got financial health, we’ve got subjective wellbeing, we’ve got this kind of employment/time use/income stuff. We’ve actually even got some more behavioral things like political outcomes, do people have more or less inter-group prejudice and other-regarding preferences, that kind of thing. So, we’ve got actually quite a lot of things. Also doing some things relating to work on scarcity, that people under a lot of economic pressure might make worse decisions. Is that a short-term effect? A long-term effect? That kind of thing.

So there’s actually quite a lot of outcomes and sometimes when I talk to people about it they get a little bit confused. We’re looking at so many different things, but I think for a study of this kind of cost it’s actually really good to get a lot of different outcomes from it.

Robert Wiblin: I just quickly did the maths and it looks like it should cost like 100 million dollars.

Eva Vivalt: Not quite, but still quite high up there. Yeah.

Robert Wiblin: I was just thinking, if you’ve got a 1000 and you’re giving then $12000 each, that would come to 12 million each year of the study, plus then the control group and all the other on costs and so on. It depends how long you run it, but it’s a pretty serious expense.

Do you worry about having too many outcome variables? Or I suppose, you’ll be smart enough to adjust for the multiple testing problem.

Eva Vivalt: Yeah, we’re adjusting for that. We’re basically — within a type of thing, so like health, we’ll consider these as sort of like separate subject areas. So, there’ll be like a paper on health, a paper on financial health, et cetera. And then within each of those papers we’ll do all the appropriate family-wise error corrections, et cetera.

Robert Wiblin: Yeah. Are you going to preregister the analysis do you think?

Eva Vivalt: Yes, we will.

Robert Wiblin: Excellent. That’s great.

So what’s your role in the whole thing? There’s quite a significant number of people involved right?

Eva Vivalt: Yeah, no, this is a great project. For the PIs it’s myself and Elizabeth Rhodes, who’s a recent PhD grad from Michigan, David Broockman, who’s a Stanford GSB assistant professor, and Sarah Miller, who’s a health economist at the business school at Michigan. So those are the PIs and then we’ve got like a larger advisory board. We’re trying to keep in touch with both relevant academics, a bunch of senior researchers, as well as people obviously who are involved in other similar projects that we try to continue to talk with.

Robert Wiblin: And what’s your niche?

Eva Vivalt: Well I’m just one of PIs. “Just”, with quote marks. I think I was originally brought on board partially for experience with impact evaluations and sort of these large-scale trials.

Robert Wiblin: Yeah. When might we hope to see results from it? It’d be some years out.

Eva Vivalt: Yeah it will. The shortest treatment arm, that’s three years out. Actually we’d be gathering data slightly before the very end of it because what we don’t want to do is do the survey at the end of the three years and then we get the effect of people coming off the program, that kind of transition effect. We’ve got a baseline survey, midline survey and endline survey, and we’ve got a bunch of little intermediate surveys along the way that people can do just quickly by themselves on mobile. And for the big surveys, we’re going to do the last of those like two and half years in or so.

And even if we get like some early results, we’re not going to release the bulk of things until at least the end of that three year arm because things can always change and we don’t … because it’s a very high-profile study, what we don’t want is people to come away with some idea of how things went a year in and then three years in things have changed a lot but nobody listens to it. And it could also like affect some of the narrative. We don’t want the subjects to hear about themselves in the media, right? That would not be great.

Robert Wiblin: That would be disastrous really.

Another exciting thing you’re working on outside of your core research agenda is how to get people to accept ‘clean meat’, which we’ve recently done done a few episodes on. That paper is called Effective Strategies for Overcoming the Naturalistic Heuristic: Experimental Evidence on Consumer Acceptance of ‘Clean’ Meat.

What did you look at in that study?

Eva Vivalt: Yeah, so we were interested in a few things. We were interested in looking at … I assume you’ve covered clean meat; clean meat is essentially, you can think of it as lab-grown meat or synthetic meat or some other kind of unpalatable terms, if you like.

Robert Wiblin: It’s the rebranding of that.

Eva Vivalt: Yeah, it’s the rebranding of that. So meat not from animals directly. Some people have got a knee-jerk reaction that, “Ew, this is disgusting. It’s not natural,” and so this is what we’re calling this naturalistic heuristic, that sort of prevents people from being interested in clean meat. And we’re looking at ways of overcoming that. We tried various methods like directly saying “look, things that are natural aren’t necessarily good and vice versa.” We tried another appeal that was more trying to get them to think about things that they are quite happy with even though they are unnatural. So maybe, prompt some sort cognitive dissonance there. Like if they don’t like clean meat they should also not like a lot of other things that they do like.

Robert Wiblin: Vaccines.

Eva Vivalt: Yeah, yeah, and I mean there’s lots of foods that something has happened to them. Like they’re fermented or they just changed a lot from the past anyways. Like corn nowadays looks nothing like corn a long time ago, chickens nowadays look nothing like chickens a long time ago, et cetera. And we also looked at giving people sort of a descriptive norms type of approach of; other people are very excited about clean meat so maybe you should be, too.

It’s a little bit tentative but it seemed like the approach that was sort of trying to prompt cognitive dissonance by telling them about how there’s all these other unnatural goods that they like was maybe doing the best. The downside though is it did seem like quite a lot of … more people than I would have thought were actually quite negative towards clean meat. And especially, almost nothing did as well as — we had one treatment where we didn’t know how, a priori, how poorly people would respond to it. So we thought we’re going to prime some people with negative social information so that at least there’s some people for whom they’ve got some kind of anti, you know, they’ve got some kind of naturalistic [crosstalk 02:02:14].

Robert Wiblin: Some prejudice against it.

Eva Vivalt: Yeah, exactly. And it turned out, that priming effect, was pretty much bigger than anything else we found, which is kind of disappointing because you can imagine that the very first thing that other companies who produce conventional meat products will do, most likely, is to try to attack clean meat as like-

Robert Wiblin: Gross.

Eva Vivalt: Yeah. So that was little bit unfortunate.

And we also did another study where we were looking at the effects of knowing about clean meat on ethical beliefs because we thought actually if the … to some extent your ethical beliefs could be a function of what you think is like fairly easy to do. And so if you think that there is a good alternative out there, it could actually potentially change your views towards animals more generally, or the environment. So we were using this negative priming as an instrument for people thinking more or less positively towards clean meat and then looking at the effect on ethical beliefs, and there was actually some evidence that people were changing at least their stated ethical beliefs. I think we need to do a few more robustness checks there, but it was still quite surprising.

Robert Wiblin: Yeah. Why do you think the ’embrace unnaturalness’ message worked the best? Do you have a theory there?

Eva Vivalt: My best guess is that it had something to do cognitive dissonance and the fact that it was a relatively mild way of putting things. People don’t tend to like fairly strong messages against what they hold dear. We weren’t really undermining or trying to undermine what they were valuing, we were just saying, “Look, even by your own judgements here, to be consistent with your own things” …

Robert Wiblin: ‘You’re right about these other things, so why not be right about this one too’.

Eva Vivalt: Exactly.

Robert Wiblin: ‘You’re so smart’.

Eva Vivalt: It’s a very positive message in a way.

Robert Wiblin: How clear cut was the result? Are you pretty confident that that was the best one?

Eva Vivalt: You know I’m not 100% confident. So this is why I don’t want to oversell it because one could say this one was the one that sort of lasted the longest. We had like some follow ups. But at least in the short run, it could have also been — the descriptive norms might have done pretty well as well. So like it depends on whether you think — how we should weight the different rounds of data that we collected, right? And so we kind of pre-specified we were interested in the follow up but if you weren’t interested in that, if you thought that actually the early data should be somewhat informative about the later data, maybe the later data was just a bad draw, for example, then, you know.

So I wouldn’t lean too, too hard on it.

Robert Wiblin: Yeah. I mean I think that the naturalist heuristic is one of the most consistently harmful heuristics that people apply because it causes them to, in my view at least, reach the wrong answer just about so many different issues. And I wonder if there’s potential to just have a non-profit that just like pursues relentlessly this point that being unnatural is not bad, being natural is not good. They would help with clean meat, but also just so many other things as well.

Eva Vivalt: That’s a fair point, and while doing this we got introduced to so many people who are doing so much interesting work on vaccines, et cetera, that, you know …. Yeah, I think that especially in the future as biotech in general becomes better, et cetera, et cetera, there’s going to be so many new products that are unnatural that plausibly benefit from such a message.

Robert Wiblin: We just need a generic pro-unnaturalness organization that can kind of be vigilantes and go to whatever new unnatural thing people don’t like.

Eva Vivalt: Yes, exactly.

Robert Wiblin: Well, it sounds like clean meat is just kind of being developed now so there’s probably going to be … we’ll want to try out a whole lot of other messages, because you’ve only tried out three here. Were there any other messages that you considered including that you would like to see other people test?

Eva Vivalt: Hmm, that’s a good question. Things don’t come to mind at this moment but I do think there’s a lot more room for further research here. Especially, one thing I don’t know about … I’m imagining that people are using unnaturalness … they seem to also think it’s unnatural and therefore it’s not healthy and therefore it’s all this other stuff. But I think there could be more done to break that down a little bit more because presumably you could in fact, at least theoretically, think that something is unnatural without thinking it’s necessarily unhealthy.

Robert Wiblin: So, you’ve written a paper that’s been pretty widely cited in the last few years called “How Much Can We Generalize From Impact Evaluations?” That was your job market paper, right?

Eva Vivalt: Yep.

Robert Wiblin: So that’s the work that you did during your PhD that you’re using to try and get a job, which we might talk about later. But, what question were you trying to answer with this paper?

Eva Vivalt: Yeah. So at the time that I was writing it, there was quite a lot of impact evaluation being done on various topics like de-worming, bednets, et cetera. But not so much of an effort to synthesize all the results. And so I’d started this non-profit research institute, AidGrade, to gather all the results from various impact evaluations and try to say something more systematic about them.

But in the course of doing so I was kind of shocked to see how much results really varied. And I think if you talk to researchers they’ll say, “oh yeah, we know that things vary. Of course, they vary. There’s obviously all these sources of heterogeneity.” But I think that the language people use when talking to the general public or to funders is actually quite a bit different. And there, you know, things get really simplified. So I think there’s a bit of a disconnect.

And anyways, I was investigating a little bit some of the potential sources of heterogeneity. I mean, it was, at that point, what I’m looking at is observational data. Even if the data are coming from RCTs, because I’m just looking at the results that the various papers found. So I can’t definitively say the sources of the heterogeneity, but I could at least look for correlates of that and also try to say something about how, in a way, we should be thinking about generalizability. And how there are some metrics that we can use that can help us estimate the generalizability of our own results.

Robert Wiblin: So basically, you’re trying to figure out if we have a study in a particular place and time that has an outcome, how much can we say that that result will apply to other places and times that this same question could be studied. Is that one way of putting it?

Eva Vivalt: Yeah, because you’ll never actually have exactly the same setting ever again. Even if you do it in the same place, things hopefully would have changed from the first time you did it. So we might naturally expect to have different results. And then the issue is, well by how much? And how can we know that?

Robert Wiblin: All right. So I’m the kind of guy who, when they load up a paper, skips the method section, skips straight to the results. So, how much can we generalize from studies in development economics?

Eva Vivalt: Not terribly much, I’m afraid to say. This was really disheartening to me at the time. Gotten over it a bit, but yeah. I guess one main takeaway as well is that we should probably be paying a little more attention to sampling variance in terms of thinking of the results of studies. Sampling variance is just the kind of random noise that you get, especially when you’ve got very small studies. And some small studies just happen to find larger results. So I think if we try to separate that out a bit and a little bit down-weight those results that are coming from studies of small sample sizes, that certainly helps a bit.

Another thing that came out, and this is just an observational correlation, but one of the more interesting ones and I think it’s now part of the dialogue you hear from people, is that results from smaller studies that were done with an NGO, potentially as a pilot before government scale-up, those ones were initially more promising. And then the scale-ups didn’t live up to the hype as it were. Like the government-implemented larger versions of the same programs, or similar programs, they didn’t seem to do so well. So that’s a little bit disconcerting, if we think that generally we start as researchers by studying these interventions in smaller situations in the hopes that when we scale it up we’ll find the same effects.

Robert Wiblin: Hmm. So is the issue there that NGOs do these pilot studies and for those pilot studies they’re a bit smaller and the people who are running them are very passionate about it, so they run them to a very high standard? Or they offer the intervention to a very high standard. But then when it’s scaled up, the people who are doing it they don’t have a much money or they don’t know what they’re doing. And so the results tend to be much worse?

Eva Vivalt: Yeah, I think that’s part of it. There could also be like a targeting aspect of this. You start with the places where you think there’s going to be particularly high effects. And then, as you scale it up, you might end up incorporating expanding the treatment to some people who are not going to benefit as much. And that would be, actually, completely fine. The worst story is where the initial NGO, or the initial study, everybody was very excited about it and put a lot of effort into it. And then maybe their capacity constraints worsened it when it was trying to be scaled up. So, that’s a little more disconcerting I guess.

Robert Wiblin: Right. So let’s just back up a little bit. You said the answer is that we can’t generalize very much from these development studies. What is your measure of generalizability, statistically? And on a scale between zero and one, where do we stand?

Eva Vivalt: Yeah, so that’s an excellent question. One of the things I argue for in my paper is that we should be caring about this true inter-study variance term. Which, I and some other people like Andrew Gelman call tau-squared. Which one has to estimate, you don’t know that up front. But that this is a pretty good measure of, well, the true inter-study variance.

And there’s also a related figure that that ties into, which is called the I-squared. Where you’ve got essentially the proportion of the variance that’s not just sampling error. And that’s nice because it’s a unitless metric that’s well established in the meta-analysis literature. And it kind of ranges from zero to one and it’s very much related to this pooling factor, where if you’re trying to think about how much to weight a certain study, you might think of putting some weight on that study and some weight on all the other studies in that area.

And if you’re doing that, there’s some weight that you can put on one individual study’s result and that would range between zero and one. And similarly, for the weight you put on all the other studies’ results. I’m not sure if that completely answered your question.

Robert Wiblin: Yeah.

Eva Vivalt: But there are these metrics you can use, and I would completely agree, and I was trying to push for initially that … I mean, I’m still trying to push for it, but I think it’s now more accepted that we should be thinking of generalizability as something that is non-binary that lies somewhere between zero and one.

Robert Wiblin: So, what is tau-squared? I saw this in the paper, but to be honest I didn’t really understand what it actually is. Is this some kind of partition of the variance that’s due to … I just don’t know.

Eva Vivalt: Yeah, no worries. So essentially, yeah, you can think of it as some measure of …. Okay, you’ve got a whole bunch of different results from different studies. Some of that variation is just due to sampling variance. So if you think of these studies as all replications, I mean they’re not, but if you were to think of them as replications then the only source of variance would be the sampling variance because you’d be drawing an observation from some distribution. And you’d be drawing a slightly different observation, so you’d get a little bit of noise there naturally …

Robert Wiblin: So that’s just some studies get lucky and some studies get unlucky in a sense. So they have higher or lower numbers just because of what individuals they happened to include?

Eva Vivalt: Yeah, exactly. And so if you’re then thinking okay well we’re not actually really in a case of replications. We’re actually in a case where there is a different effect size in every place that we do the study because there’s so much heterogeneity. Like, there’s other contextual factors or whatnot. Well, then you’ve got not just this sampling variance, but also some additional sort of true latent heterogeneity that you need to estimate.

Robert Wiblin: That the effect was different in the different cases.

Eva Vivalt: Exactly. Exactly. So, I’m just arguing for separating the two of these things out. And then trying to say, well this is the true heterogeneity.

And you could go even a step further and say well, maybe we can model some of the variation. And maybe we want to think that the important thing in terms of generalizing is how much unmodeled heterogeneity there is. Like how much we can’t explain. Like if we can say that, for example, well I’ve got a conditional cash transfer program and I want to know the effects on enrollment rates and maybe I think baseline enrollment rates are really important in determining that. Because it’s probably easier to do a better job in improving the enrollment rate from 75% than from 99%, right? It’s just a little bit easier. So, you can say okay well then I’ve got some model where baseline enrollment rates are an input into that model. And then after accounting for baseline enrollment rates, what’s sort of the residual unexplained heterogeneity in results. Because that’s going to be the limiting factor on how much I can actually extrapolate from one setting to another accurately.

Robert Wiblin: Okay. So a tau-squared of one would indicate that all of them had the same effect in every case that they were implemented. And a zero would indicate that it was totally random, the effect that it would have in each different circumstance. Is that right?

Eva Vivalt: Not quite, actually. Sorry, I might have explained this a little bit funny. So there is something that ranges between zero and one, which is either the I-squared or this pooling term. But the tau-squared itself, you can think of it as a kind of variance. It’s going to really be in terms of the units of whatever the thing was initially. So if it’s conditional cash transfers on enrollment rates, enrollment rates are maybe in percentage points. So then the variance would relate to those units of enrollment rates. And so that’s actually a great point because it’s going to be very difficult to compare the tau-squared of one particular outcome to the tau-squared of a completely different intervention’s effect on a completely different outcome because those things are going to be in different units entirely.

That’s one advantage of I-squared relative to tau-squared, is that I-squared is unitless. It kind of scales things. So that does run between zero and one, and does not depend on the units. Although it’s not 100% straightforward either. I mean, that has also got some drawbacks.

I’m trying to summarize the paper here, but I guess if one’s really super interested in these issues I would just recommend reading the paper.

Robert Wiblin: Taking a look at it.

Eva Vivalt: It goes in much greater detail. I’m simplifying a bit here.

Robert Wiblin: Sure, okay. We’ll definitely stick up a link to it.

So let’s say that we had a new intervention that no one really knew anything about. And then one trial was done of it in a particular place, and it found that it improved the outcome by one standard deviation. Given your findings, how should we expect it to perform in a different situation. Presumably less than one standard deviation improvement, right?

Eva Vivalt: Yeah. I mean, to be honest, one standard deviation improvement is just huge. Enormous.

Robert Wiblin: I was just saying that because one’s a nice round number.

Eva Vivalt: Oh yeah. But the typical intervention is going to be more like 0.1 rather than one. So if I saw one somewhere, I’d be like, wow, that’s got to be a real outlier. That was a very high draw. So I would be skeptical just for that reason.

Robert Wiblin: Okay, so I’ve got 0.1. What might you expect then if it was done somewhere else?

Eva Vivalt: Well, it’s going to depend a lot on the intervention and the outcome. And if I’m using some more complicated model. I think the best way to answer those questions is to look at a specific intervention and a specific outcome and try to model as much of the heterogeneity as possible. And there’s not going to be any substitute for that, really.

What I’m looking at in my paper is trying to say something like, well that might be so. But still, what can we say about looking across all the interventions, across all the outcomes? And that’s where I pick up patterns like if it’s done by an NGO, if it’s a relatively smaller program it tends to have higher effects. But that’s a little bit hand-wavy. I think the best way to answer those questions in terms of what do I really find is to go to that particular intervention, that particular outcome.

But what I can say is that even with one study’s results, and now this is pretty weak but it’s still true, there’s still a relationship, is that if you look at the heterogeneity of results within the study, that actually does predict the heterogeneity of results across studies. I mean, weakly. And there’s no reason for it to necessarily be true, but it is a stylized fact that one could use.

Robert Wiblin: Hey, I just wanted to interject that I later emailed Eva to see if there was any rule of thumb we could use to get a sense of how bad the generalisability is from one study to another.

One option is to say that:

The median absolute amount by which a predicted effect size differs from the true value given in the next study is 99%. In standardized values, the average absolute value of the error is 0.18, compared to an average effect size of 0.12.

So, colloquially, if you say that your naive prediction was X, well, it could easily be 0 or 2*X — that’s how badly this estimate was off on average. In fact it’s as likely to be outside the range of between 0 and 2x, as inside it.

This wouldn’t be rigorous enough to satisfy an expert in the field, but it’s good enough for us here. Back to the interview.

Robert Wiblin: Okay. So did you find out under what circumstances results are more generalizable and when they’re less generalizable?

Eva Vivalt: Yeah. So again this is a little bit hand-wavy and I think a little bit less the point of the paper, because like I say, even though these studies are mostly RCTs, when I’m looking at them, at that point it’s as though I’ve got observational data. Because the studies are selected in various ways that … where people even choose to do the studies is selected and I’m just looking at this data. But despite that, if you do the naïve thing of doing ordinary least squares regression of your effect sizes on various study characteristics…. So I mentioned bigger programs and government-implemented programs tend to do worse. There’s not much of a general trend in other things. In particular, it doesn’t seem to matter so much if it’s an RCT or not. Or where it was done.

Actually, one thing I did find is, you can’t even necessarily just say …. So often you hear from policy-makers and researchers, “well we’ve got results from one particular country. So at least we know how it works in that country.” And actually, I would disagree with that. Because even within a country, if you’ve got multiple results from the same country, they don’t predict each other very well. And it makes sense if you think about, you know, I don’t think anybody would say within the US, “oh yeah, well results from Massachusetts are going to be very similar to results from Texas,” or something like that. Right? Even within a country there’s so much variation that maybe it’s no better than taking results from a completely different area of the globe. But it’s still not that great and I can’t actually even find any kind of statistically significant relationship within a country.

Robert Wiblin: Isn’t this pretty damning? Why would we bother to do these studies if they don’t generalize to other situations? It seems like we can’t learn very much from them.

Eva Vivalt: Yeah, so that’s a great devil’s advocate type question. I’m still, despite all this, an optimist that we’re learning something. Right? Because part of it is that this way of looking at it doesn’t model all the little factors. I mean, I am actually quite skeptical of most of the stories that people tell about why an intervention worked in one place and why it didn’t work in another place. Because I think a lot of those stories are constructed after the fact, and they’re just stories that I don’t think are very credible. But that said, I don’t want to say that we can learn nothing. I would just say that it’s very, very hard to learn things. But, what’s the alternative?

Robert Wiblin: Well, I guess, potentially using one’s intuition. But one thing you could say looking at this, is that it’s not really worth running these studies. An alternative view would be that because each study is less informative than we thought, we have to run even more of them. Do you have a view between those two different ways of responding?

Eva Vivalt: Yeah. I would argue for running more of them, but not in a completely senseless manner. I think we can still say something about …. There are ones which are higher variance, where we could learn more, where the value of information of doing another study is going to be higher.

So, I guess part of this depends on, sorry to get into technical details but …

Robert Wiblin: No, go for it.

Eva Vivalt: … the decision problem we think people are faced with. Right? Because if you think that a policy-maker is, what they really care about in making their decision is whether some result is statistically significant and better than some other result in a statistically significant way. Well okay, then that’s a different problem from if they are just trying to find, if they’re okay with something that there’s a 20% chance works better than the alternative.

So think of this all in terms of: there is some problem that a policy-maker is trying to solve, and then within that problem you’ve got the ability to run studies or not run studies. And the value of information of running each of those things is going to be different depending on how much underlying heterogeneity there is.

Just to be a little bit simpler about this, the intuition is that if you’ve got … I mean, the studies that are the most valuable to run would be the ones where you don’t know very well a priori what’s going to happen. You’ve got a higher degree of uncertainty up front. But where you think there is a good upswing potential, as it were, right? Like it could overtake the best possible outcome.

Robert Wiblin: A lot of value of information, I think is the …

Eva Vivalt: Yeah, exactly.

Robert Wiblin: Okay. We’ll come back to some of those issues later because you have other papers that deal with how these RCTs can inform policy-makers.

But let’s just talk a little bit more about your method here. So, how did you collect all of this data on all these different RCTs. It sounds like an enormous hassle?

Eva Vivalt: Yeah. I wouldn’t recommend it. I mean, obviously one has to do it. But, oh, my goodness. I think I was very lucky actually to have a lot of great help from various RAs over the course of several years, through AidGrade, who were gathering and double-checking and sometimes triple-checking some of this data. Everything, all the data was gathered by two people. And if their results disagreed in some way, and their inputs disagreed then a third person would come in and arbitrate. So that’s how we got all of the characteristics of the different studies coded up. All the effect sizes.

I am hopeful that in the future, we’re going to be able to do a lot more with automated reading of these papers. You would think that’s absolutely crazy, but I think it works pretty well so far. I mean, not of the actual results tables. I think the results tables are actually the hardest task in a way, because you need to really know what a particular result represents. Is this a regression with controls, is it with whatever else. What methods, et cetera. But for basic characteristics of studies, like where was it done, was it an RCT or not, those kinds of things, actually we’ve had pretty good success with some pilot studies trying to read that automatically through natural language processing.

And that, I think, is really the best hope for the future. Because studies are coming out so quickly these days that I think to keep abreast of all of the literature and all the various topics — I mean, it’s even more of a constraint for the medical literature where there’s loads of studies and new ones coming out all the time. Meta-analyses can go out of date quite quickly and they’re not really incentivized properly in the research community so the only way to get people to actually do them and keep the evidence up-to-date in some sense is by at least making the process easier.

I don’t think that it can be ever 100% done by computer. I think you’re still going to need some inputs from people. But if you can reduce the amount of effort it takes by 80% or 90% and just have people focus on the harder questions and the harder parts of that, that would be a huge benefit.

Robert Wiblin: Do you think there’s enough of this data aggregation? Or are there too few incentives for people to do this in academia?

Eva Vivalt: No, I think the incentives are all wrong. Because researchers, they want to do the first paper on a subject. Or ideally, if not the first then the second. The third is even worse than that. And by the time you get to do a meta-analysis, well that’s kind of the bottom of the bin in some regards. You think it would be more highly valued, but it’s not.

Robert Wiblin: Wouldn’t you get a lot of citations from that? Because people would trust the results of a meta-analysis more that the individual papers.

Eva Vivalt: I think that’s fair. And you can get some fairly well cited meta-analyses. Unfortunately, citations are just not the criterion that’s really used for evaluating research in economics. I know it is more so in other fields, but not so much in economics where it really is the journal that matters.

Robert Wiblin: So the journals that publish that kind of thing just aren’t viewed as the most prestigious?

Eva Vivalt: Yeah, that’s exactly right.

Robert Wiblin: I’ve also heard that in fields where collecting a big data set, especially an historical data set, is what enables you to ask a lot of new questions. There’s perhaps too few incentives to put it together. Because you do all of the work of putting it together then you publish one paper about it, and then other people will use the same dataset to publish lots of papers themselves. And in a sense you don’t get the full fruit of all of the initial work that you did. Is that a possibility here, where other people can now access this dataset of all of these different RCTs that you’ve compiled and so you don’t … Kind of they drank a bit of your milkshake in a sense.

Eva Vivalt: I wouldn’t put it that strongly, both because I’m actually quite happy if other people do things with the data and also because …. It depends I guess where you are at in the process. I think for people who are just finishing up their PhD, for example, it’s actually very good to show that you can compile a very large dataset because that’s what a lot … a lot of research depends on having very good data and if you can show that you can collect really good data then that’s great for you. Obviously you also want to publish well based on that. That’s, I guess, a separate question.

Robert Wiblin: So, what are the biggest weaknesses of this study? Do you think that we should trust this result, that results aren’t that generalizable? Or is this something that could be overturned with future research?

Eva Vivalt: I don’t think it’s really at danger of being overturned per se. That’s just a function of the fact we’re doing social science and there are all sorts of things that can change and that matter for your treatment effects. So, yeah, I’m not tremendously concerned about that.

Robert Wiblin: So what kinds of studies did you include in this particular dataset? For example, you were looking at development studies.

Eva Vivalt: Yeah.

Robert Wiblin: If you looked instead at say, education studies in the developed world. Might you get a different results if you were looking at a different domain or field?

Eva Vivalt: Maybe. I think the bigger difference, though, would probably be with things that are less, at least intuitively, context-specific. Things like health …

Robert Wiblin: Medicine.

Eva Vivalt: Yeah, exactly. So for example in our data, actually the things that almost varied more were the health interventions. But that’s because we weren’t controlling for things like baseline incidence of disease or any of those kinds of things.

Robert Wiblin: Right.

Eva Vivalt: And if you do control for those, then, I mean, we weren’t doing that in the general analysis, but if you do control for them then actually the heterogeneity is a lot smaller. So, things that have a clearer, more straightforward causal effect, there we might expect to see slightly different results.

Robert Wiblin: Hmm. So kind of antibiotics will usually treat the same disease anywhere. But I suppose in these studies they actually have different impacts because in different places people have the underlying disease at different levels.

Eva Vivalt: Yeah, exactly. Yeah. I mean, everybody I think at this point would agree that things like de-worming et cetera depend on what the baseline prevalence of the worms, or whatever, is. And once you control for those things, then you actually … Because there’s some very clear mechanisms through which these things work, there are fewer things that can go wrong. Whereas the more general social science type thing, there’s so many factors that feed into what the treatment effects ultimately are, so it’s a little bit messier.

Robert Wiblin: So you wrote another paper called “How Much can Impact Evaluations Inform Policy Decisions?” Which, I can imagine, was partly informed by this other paper. Do you want to explain what you found there?

Eva Vivalt: Sure. So that paper is looking a bit at, well the fact that if we do try to put this into some kind of framework where a policy-maker is deciding between different options, and they’re always going to want to choose the thing that has the highest effect. Well, given the heterogeneity we observe, how often would they actually change their mind? You know, if the outside option takes some particular value. So, yeah, it’s quite related.

We also tried to use some priors that we had collected. Some predictions that policy-makers had made about the effects of particular programs.

Robert Wiblin: So just to see if I’ve understood the set-up correctly, you’ve got this modeled agent, which I guess is a politician or a bureaucrat or something. And they, they’ve got some background thing that they could spend money on, perhaps this is spending more money on schools or whatever else. And they think that they know how good that is. And so that’s somewhere they could stick the money. And then you’re thinking of the value of a study on another thing, that might be better, or might be worse. And the bureaucrat say, even though there hasn’t been any studies done yet or not many, they have some belief about how good this other option is, this new option. But they’re not sure about it, and they would somewhat change their mind if a randomized control trial were done. And then you want to see, well, how often would that trial cause them to actually change their decision and go for this alternative option?

Eva Vivalt: Yeah, that’s exactly it. You’re putting it much better than I did.

Robert Wiblin: So, what did you find? Is there any way of communicating how often people do change their mind? And maybe perhaps what’s the monetary value of these studies?

Eva Vivalt: That’s an excellent question. So, we didn’t actually connect it to actual monetary value because that depends a bit upon what you think the value of some of these outcomes is. We did this a little bit abstractly, trying to compare two programs that — one was 90% of the value of another one, or 50%. But we weren’t actually making assumptions on the final, the last mile type part of “well yeah, but what is this actually worth?” I mean, that’s going to depend a bit on what the actual outcomes and the values of the outcomes are.

So, I wish I had a better answer is what I’m trying to say.

Robert Wiblin: Okay. So in the abstract you wrote, “We show that the marginal benefits of a study quickly fall and when a study will be the most useful in making a decision in a particular context is also when it will have the lowest external validity,” which is a bit counter-intuitive. And then also, “The results highlight that leveraging the wisdom of the crowds can result in greater improvements in policy outcomes than running an additional study.”

Did you want to explain those sentences?

Eva Vivalt: Sure. So, yeah. I think one of the interesting things is the statement that when a study will be most useful is when it will have the lowest external validity, that is relating to the point that in a sense, when’s the study going to be most useful? What’s going to be the most useful’s when it surprises us and was really different. When’s it going to be the most different? Well, when we’re not going to able to generalize more from it, when it’s got some underlying factors that make it a little bit weird in some way. It’s going to be the highest value in that setting, but if you try to think about extrapolating from it….

Robert Wiblin: So is it not so much that that study can’t be generalized to other things that makes it valuable. But rather that other things can’t already be generalized to this one? So this is a more unique case?

Eva Vivalt: Yeah. And I mean, it could go either way in the sense that if you think that the other studies haven’t found this particular thing, and this particular thing is a bit unique, well, likewise, you wouldn’t expect this unique thing to say much about those other ones either. So, again, this is a little bit abstract because you can try to think about, “well, yes, but does this new thing tell us something about some other, more complicated underlying models of the world as to why this one happened to be so surprising?” But yeah, that’s just the general intuition.

And then with respect to leveraging the wisdom of the crowds, well, we did look at different kinds of ways of making decisions. We looked at a dictator making a decision all by themselves versus a collective of various bureaucrats voting and just using a majority voting rule to try to decide which particular intervention to do. And there, because people can frequently be wrong, actually adding additional people to the set of people who are making the decision can lead to substantial benefits in terms of the actual … in choosing the right program afterwards. There were actually some simulations in which it performed better.

Robert Wiblin: Are you saying that running these broad surveys is potentially more informative than an RCT? And I guess also presumably cheaper as well? Or at least in the model.

Eva Vivalt: Yeah, so I guess … So in the model it’s more a matter of how many people are making the decision and how many people’s inputs are being fed into this process. So, I guess if you’ve got a more democratic decision making process or you involve more people, their priors are more likely to be correct in that case. Sort of like their aggregate prior. And the benefits of just doing that can be higher than the benefits of doing an RCT. I mean, it depends a little bit on all sorts of underlying parameters here. But there were at least some simulations for which that was definitely true, where adding additional people helping to make the decision resulted in better decisions than running an additional study.

Robert Wiblin: So, what surprised you the most from these simulations that you were running? Was there anything that you didn’t expect?

Eva Vivalt: Well I don’t think I was expecting that result, to be honest. Also, obviously, it does depend on the quality of the priors that people initially have, right? Like if you actually do have very highly uninformed individuals, then aggregating more highly uninformed priors is not going to help you.

Robert Wiblin: Shit in, shit out.

Eva Vivalt: Yeah, basically.

Robert Wiblin: I get to swear on my own show.

Eva Vivalt: Well, I could just say that you said it.

Robert Wiblin: So, do you think that we should run more studies, or less, on the basis of this paper?

Eva Vivalt: Well, I don’t think that’s the right … It’s not like we … There’s not a real trade-off here. Have more democratic decision making processes or run additional studies. We can do both. So I think more studies still is going to help, but so is actually taking that evidence into consideration and also having more people help to make decisions and hopefully balance out some of the errors that are made because, actually a lot of … I mean, I’ve also done some work looking at how policy-makers interpret evidence from studies and update.

Robert Wiblin: So you modeled bureaucrats or politicians as these Bayesian agents who I guess update perfectly. Was that right?

Eva Vivalt: At least in this paper. There’s another paper that does not do it, but yeah.

Robert Wiblin: Yeah. What kind of deviations might you expect? Do you think they might update too much or too little in the real world?

Eva Vivalt: Well I think, actually, so I’ve got this other paper with Aidan Coville of the World Bank where we are looking at precisely some of the biases that policy-makers have. And one of the bigger ones is that people are perfectly happy to update on new evidence when that goes in a nice, positive — when it’s good news. But people really hate to update based on bad news. So for example, if you think that the effects of a conditional cash transfer program on enrollment rates is that maybe they’ll increase enrollment rates by three percentage points. And then we can randomly show you some information that either says it’s five or it’s one. Well if we show you information that says it’s five, you’re like “great, it’s five.” If we show you information that says it’s one, you’re like “eh, maybe it’s two.” So we see that kind of bias. We also-

Robert Wiblin: It’s interesting because if you update negatively or if you update downwards then you’re creating a much greater possibility for future exciting positive updates. You can’t have positive updates without negative updates as well.

Eva Vivalt: Well, that’s fair I guess.

Robert Wiblin: I guess they’re not thinking that way.

Eva Vivalt: Present bias or something. No, I don’t know.

And it kind of makes sense intuitively, because one of the initial reasons for why we’re considering this particular bias in the first place is… I think a situation that will be very familiar to people who engage with policy-makers is, you know, you’re asked to do an impact evaluation. You come back saying, “oh yeah, this thing showed no effect.” And people are like, “oh really? It must be the impact evaluation that’s wrong.”

Robert Wiblin: I wonder, it’s notorious that impact evaluations within bureaucracies that want to protect their own programs are too optimistic. But I wonder, it’s a bit like, kind of everyone overstates how tall they are on dating sites but at the end of the day, you end up knowing how tall someone is, because everyone overstates by the same amount. And I wonder if looking at these impact evaluations you kind of figure out what’s the truth or what’s right on average just by saying “well was it extremely good or was it merely good?” You just adjust everything down by a bit.

Eva Vivalt: That’s a good point. That’s a good point. Yeah, no, fair enough. I mean the other thing that …

Robert Wiblin: I suppose that would just end up rewarding even more extreme lying.

Eva Vivalt: Yeah. And that’s not the only bias that people have got either, right? So another thing that we were looking at is how people were taking or not taking the variance into consideration. So in the simplest idea, you can think of this as just sampling variance. But you can also look at heterogeneity across studies. And basically, people were not updating correctly based on confidence intervals. That might be the easiest way of framing it.

And we did try to break that down a bit, and try to say, “well okay but why is that? Are they misinterpreting what a confidence interval is? Is it some kind of aggregation failure? Is it just that they’re ignoring all new information, and so obviously they’re going to be caring less about confidence intervals than somebody who actually does take information into consideration and does actually update at all?”

So we did try to break it down in several ways. And yeah, it does seem like people are not taking the variance into account as a Bayesian would.

Robert Wiblin: Oh, hold on. So you’re saying they just look at the point results and not at how uncertain it was?

Eva Vivalt: Yeah, pretty much. I mean, they do look a little bit at how uncertain it was, but not as much as they should if they were fully Bayesian. If they were actually Bayesian then they would care more about the confidence intervals.

Robert Wiblin: Right. So if it would be a small study that kind of gets a fluky extreme result, people over-rely on that kind of thing.

Eva Vivalt: Yeah, exactly.

Robert Wiblin: That doesn’t surprise me.

So what is the latest on your work on priors? Is that related to this paper?

Eva Vivalt: So, it is. This is one of the things that I’ve been up to. So for this particular one, we were looking at biases that policy-makers might have and biases in updating.

So, you start out with a Bayesian model and say, “okay, well look, but people aren’t Bayesian. How can we modify this model and have some kind of quasi-Bayesian model?” And so we were looking at two biases: this kind of optimism I was talking about and this variance neglect. Which you can think of it as some kind of extension neglect more broadly and related to the hot hand fallacy or gambler’s fallacy for people who are into the behavioral economics literature.

And we basically … It was a really simple study. We just collected peoples priors. We then showed them some results from studies, and then we got their posteriors. And we presented information in different ways, because we were also interested in knowing if the way in which we present information can also help people overcome biases if they are biased. So if you’ve got a problem, what’s the solution? And we did this not just for policy-makers, but also for researchers, for practitioners like NGO operational staff, that kind of thing. We also got a side sample of MTurk participants.

And these biases actually turned out to be pretty general. And the big thing on the solution side is more information will encourage people to update more on the evidence. So I guess if you’re in that situation of, you’ve got some bad news, come bearing a lot of data and that should help at least a little bit. So, you know, more quantiles of the data, that kind of thing. Maximum, minimum values, you know, the whole range of as many statistics as you can really.

Robert Wiblin: Hold on. So your main finding was in order to accept a negative result, people have to be confronted with the overwhelming evidence so that they can’t ignore it?

Eva Vivalt: Yeah, at least it should help.

Robert Wiblin: Were there any other discoveries?

Eva Vivalt: The other kinds of things that we’ve been doing … We have actually collected priors in a whole bunch of different settings so actually I’m in the process, also with a grad student, of trying to look at some additional biases that policy makers may have. Like omission bias, status quo bias, where people don’t want to actually change, deviate, from decisions that were made in the past where they would have to do something differently, or take action. Like there might be some bias towards inaction.

Robert Wiblin: Or at least not changing your action. Not shutting down the program.

Eva Vivalt: Yeah. Yeah, yeah. I mean, the kinds of things that bureaucracies are typically sort of criticized for. But more specifically, on the priors, we’ve also asked experts to predict effects of various impact evaluations. One thing that I’m really excited about is trying to more systematically collect priors in the future. And so, I’ve been talking with many people actually, including Stefano DellaVigna and Devin Pope, who’ve got these great papers on expert predictions, about setting up some larger websites so that in the future people could more systematically collect priors for their research projects.

I’m getting at this point an email every week roughly asking for advice on collecting priors, because I think researchers are very interested in collecting priors for their projects because it makes sense from their perspective. They’re highly incentivized to do so because it helps with, not just with all this updating work, but also for them, personally, it’s like, “Well now nobody can say that they knew the results of my study all along.” Like, “I can tell them ‘well, this is what people thought beforehand and this is the benefit of my research.’” And also, if I have null results, then it makes the null results more interesting, because we didn’t expect that.

So, the researchers are incentivized to gather these things but I think that, given that, we should be doing that a little bit more systematically to able to say some interesting things about like … well, for example; one thing is that people’s priors might, on average, be pretty accurate. So this is what we saw with the researchers, when we gathered our researchers’ priors, that they were quite accurate on average. Individuals, they were off by quite a lot. There’s the kind of wisdom of the crowds thing.

But, if you think that you could get some wisdom of the crowds and that people are pretty accurate overall, if you aggregate, well that actually suggests that it could be a good yardstick to use in those situations where we don’t have RCTs. And it could even help us figure out where should we do an RCT, where are we not really certain what the effect will be and we need an RCT to come in and arbitrate, as it were.

So I think there’s a lot more to do there that could be of pretty high value.

Robert Wiblin: Right, okay. So, I’ve got a number of questions here. I guess, so the question we’re trying to answer, well at least one of them, is: how good are experts as a whole at predicting the likeliest outcome of a study that you’re going to conduct? Or, to put it another way, the impact of an intervention. And, I guess, the stuff that I’ve read is that experts, at least individual experts, are not very reliable. But you’re saying that if you systematically collect the expectations of many different experts, then on average they can be surprisingly good.

Eva Vivalt: Yeah. Yeah. I would say that. I think that like, again, it sort of depends a bit on — this is why it would be really nice to get systematic data across many, many different situations. Because it could just be that the ones that we’ve looked at so far are not particularly surprising, but there probably are some situations in which people are able to predict things less well, and it would be nice to know are there some characteristics of studies that can help to tell us when experts are going to be good or bad at predicting this kind of thing.

But I would agree that any one individual expert is going to be fairly widely off, I think.

Robert Wiblin: So how do you actually solicit these priors or these expectations from these experts? Have you figured out the best way of doing that?

Eva Vivalt: Yeah, so that’s an excellent question. And we tried several different things. By now, I think I’ve got a pretty good idea of what works. So, in some sense the gold standard, if people can understand it, which is a big if, is to ask people to put weights in different bins because then you can get the distributions of their priors as well. Like not just a mean, but sort of how much uncertainty is captured in that.

But that’s quite hard for most people to do. People aren’t really used to thinking of their beliefs as putting weights in bins.

Robert Wiblin: Not even people in this field of social science?

Eva Vivalt: Not really. I mean, the researchers are a bit better at it, but in any case, at least what we’ve done, is even when talking with researchers it’s better to try to be perfectly clear about what the bins mean and go through all that kind of thing beforehand.

The other thing is, if you are asking sort of more of lay public, is it’s probably better to move to asking them to sort of give ranges, as it were. So, you know, what is a value such that you think that it’s less than a 10% chance it’ll fall below this value, or less than 10% chance it’ll fall above this value, or different quantiles… I mean, you then have to make some assumptions about the actual distribution because people can give you a range but if you really want to get at some of the updating questions, you need to know a little bit more. Like, you want to know whether those distributions are normal or not. And you don’t know whether things are normally distributed if you just have three points, right?

Robert Wiblin: Yeah, yeah. So that sounds like a really exciting research agenda, but we’ve got to push on because there’s quite a lot of other paper’s that you’ve published in the last few years that I want to talk about.

Another one that you’ve written up, which is a bit more hopeful, is ‘How Often Should We Believe Positive Results: Assessing The Credibility Of Research Findings In Development Economics.’ And of course, most of social science is facing a replication crisis where we’re just finding that many published results in papers don’t pan out when you try to do the experiment again. What did you find in development economics?

Eva Vivalt: Yeah, so actually the situation was a lot better than I would have initially thought. So I think this is actually quite a positive result. It could be biased from the kinds of studies that we included. Like we had a lot of conditional cash transfers in there. They tend to have very large sample sizes, so they’re kind of like the best case scenario. But nonetheless, the false positive report probabilities are actually quite small.

Robert Wiblin: Are you able to describe the method that you applied in that paper? Obviously, you weren’t replicating lots of these studies, you must have used some other method to reach this conclusion.

Eva Vivalt: Yep. Well, there’s quite a lot of nice literature here that I can refer people on to. The false positive and false negative report probabilities, the equations for how to calculate those are coming from out of a paper by Wacholder et al. There’s some other people who’ve also looked at this. Where essentially the probability that you’ve got a false positive or a false negative depends a bit on the priors that you’ve got.

So for example, if you think of some study that is looking at, I don’t know, something we really don’t believe to exist, like extra sensory perception or something, right? And if you found some positive result for that well, nobody’s going to trust a study that shows that ESP is real. And to really show that credibly, you would need to have lots of studies with really, precisely estimated coefficients.

Again, your priors are going into it, the statistical significance or your p-values that you’ve found would go into it and that’s just an equation you can sort of write out.

The other thing is that there are these type S and type M errors that Andrew Gelman and some co-authors talk about. And these are the probability that if you’ve got a statistically significant result, it’s actually of the right sign-

Robert Wiblin: So it’s positive rather than negative, or negative rather than positive.

Eva Vivalt: Yeah, yeah. Because you would be surprised, but it’s actually true that if you’ve got low-powered results, then even if you find something statistically significant, there is some probability that the true value is negative when you see something that says it’s positive, or vice versa.

Robert Wiblin: Yeah, and then there’s type M errors?

Eva Vivalt: Yeah so this is same kind of thing except for magnitude. So, you’ve found some significant result and it has certain magnitude, but chances are that’s actually incorrect in some way. Like it’s most likely inflated in value, so the truth is likely to lie lower than that.

Robert Wiblin: So how did you put together this information to try to figure out what fraction of results were accurate? I’m not quite understanding that.

Eva Vivalt: Sure, sure, sure. So, the main source of data that we used here is we had to get a whole bunch of expert beliefs, because these were inputs into the equations. And to get the expert beliefs we did one thing that’s not 100% kosher, but is the best kind of approximation we could do, which is that we didn’t want to wait until a lot of impact evaluations were over. Like a lot of the other work that I’ve done on priors, also with Aiden, we are actually waiting until all the results of the real studies come out. But for this we wanted a bunch of results to use already, as it were. So what we did was we used AidGrade’s database of impact evaluation results and we said, “Okay let’s go to topic experts,” like people who have, for example, done a study on a conditional cash transfer program, and then ask them “which of all these other programs have you heard about?”

They were also all conditional cash transfers programs but, you know, ones by other people. And then for the ones that they hadn’t heard about, we asked them to make up to five predictions about the effects that those studies would find. We’d describe the studies to them in great detail and then got their best guess.

Then, using this data we could say something about the false positive report probability, because then we’ve got the p-value that each study found and we’ve got what we’re considering to be the prior probability of some kind of nominal effect. We needed, actually, them to also give a certain value below which they would consider the study to have not been successful. Like, if the conditional cash transfer program doesn’t improve enrollment rates by, I don’t know, 5 percentage points then it’s not successful, because we wanted to …. All these equations deal with sort of like, the likelihood that some particular hypothesis is true. For us we wanted … there’s like some critical threshold above which we would think that it had an effect, versus not have an effect. Some meaningful effect. The minimum meaningful, kind of like the minimum detectable effect size.

So we create this probability of attaining this non-null effect, given the distribution of priors and given this particular cut-off threshold. And those are just inputs to this equation, along with the power of the study.

Robert Wiblin: Right. Okay. I think I understand now. So, you’ve got all of these different studies looking at the effect size on different outcomes, and they have different levels of power. So different kind of sample sizes and different variances in them. And then, you’re collecting priors from a bunch of different subject matter experts, and then you’re thinking, “Well, if we took those priors and updated appropriately based on the results in those studies, how often would we end up forming the wrong conclusion?” Or is actually just that; what if you took the point estimate from that study, how often would you be wrong relative to if you’d updated in a Bayesian way? Is the second right? Or am I totally wrong?

Eva Vivalt: So I would think of it in a different way. If you see a positive, significant result, there’s some probability that it just happened to be that way by chance and there’s some probability that that’s a true thing.

Robert Wiblin: And especially if it was unlikely to begin with, then it may well still probably be wrong, because of, kind of-

Eva Vivalt: Yes.

Robert Wiblin: Regression to the mean effect.

Eva Vivalt: Yeah, if you think that it’s really unlikely a priori and you observe it, it’s more likely to be a false positive. If you’re under-powered to begin with, it’s more likely to be a false positive. If it’s got a p-value of 0.049, it’s more likely to be a false positive. So, these are all just sort of factors that go into it and you could do the same kind of thing for false negatives actually. Yep.

Robert Wiblin: Okay. Well, let’s push on.

You did another paper on specification searching, which is the practice where people who are writing a paper try out a whole lot of different specifications to try to, I guess, get the answer that they’ll like and they publish just the results of that, like, to show you. And you were trying to figure out how common this practice is in different disciplines and researchers using different methods. How did you try to do that and what did you find?

Eva Vivalt: Yeah, this paper is similar in methodology to some papers by Gerber and Malhotra and others, where … and also there’s some work by Brodeur et al. looking at essentially the distribution of statistics. Say you’ve got a bunch a different studies, you’ve got a bunch of different t-statistics from each of those results, what you would expect is that there’s going to be some smooth distribution of those statistics. I mean, hopefully. But what you actually observe in the data is there’s some lumpiness and in particular there tends to be some slightly lower density of results that are just marginally insignificant, than you would expect and some sort of bump in the distribution, just above the threshold for statistical significance, which is usually at the 0.05 level. So 1.96.

So you’ll see like, relatively few results around 1.95 and relatively more results than you had anticipated having around 1.97. That’s the general intuition, right?

Robert Wiblin: Yeah. And that’s an indication that people were fishing around to find the specification that would just get them over the line to be able to publish.

Eva Vivalt: Exactly. But I mean it’s not as straightforward as just that because you can imagine that … what is that distribution supposed to look like in reality? And there’s other reasons why you might expect to see some more statistically significant results. For example, people design the studies such that they can find significant results in the first place. So, it’s not 100% straightforward to just say, “Oh yeah well we’ve got a lot of significant results and therefore it must be specification searching.” I think it becomes more credible that it is specification searching if you can say, “Yeah but it’s within a really small band, right around the threshold for significance.” As you expand the band out a little bit, I think you could try to argue-

Robert Wiblin: There are other possible explanations.

Eva Vivalt: Yeah, exactly. That like, people are designing this study very cleverly just to get significance. Although, honestly to be fair, I think it’s difficult to swallow that people are designing the study perfectly appropriately to just barely get statistical significance, right? I mean it’s so hard to predict what the effects will be anyways, and then your hands are a little bit tied from the fact that generally when you’re doing this you have got a given budget and you can’t really exceed that budget anyways. So you’re dealing with a certain sample size and having to adapt your study accordingly. It’s not like you’ve got free reign to perfectly maximize.

Robert Wiblin: Okay so the alternative innocent explanation is that people can anticipate ahead of time what the effect size will be, and then they chose the sample size that will allow them to get just below 0.05 p-value. So they’ll be able to publish the paper at minimum cost.

Eva Vivalt: Yeah.

Robert Wiblin: But in reality it’s just it’s a bit hard to believe that that explains most of what’s going on, especially given that we just know that lots of academics in fact do do specification searching.

Eva Vivalt: Yeah. It’s just people don’t have as fine control over the design of study as you would perhaps anticipate because funding is somewhat out of their hands. Also, because any one given paper is going to be looking at so many different outcomes, so how can you really design a study so that you are just barely significant for outcome A and B and C, you know? And so like it becomes a little bit implausible. But that would be the best case for the contrary view.

Robert Wiblin: Yeah. Okay. So you looked for this suspicious clumping of p-values or effect sizes across a whole of lot of different methods and disciplines, and what did you find?

Eva Vivalt: Yeah, actually the situation seemed a lot better for RCTs than non-RCTs, which is kind of understandable if you think about it because I think RCTs generally have an easier time getting published these days anyways. It could be reflecting that, that you don’t need to engage in specification searching if you’ve got an RCT and people are more likely to publish your results anyways, even if they’re null.

The other thing is that things do seem to be changing a little bit over time. In particular the non-RCTs, as time goes on they become more and more significant, as it were. Let’s just not lean too hard on this explanation but it could be, in the old days, maybe you would lie and say, “Well, I’ve got a non-RCT and it found a value of 1.97.” People would be like, “Oh, okay. 1.97, I believe that.” And nowadays if you see 1.97 everybody’s like, “Wait a second.” So now, you’ll see values that are more like 2.1 or something, right? It’s like values that are a little bit farther out there and more significant.

Robert Wiblin: I see. Okay, so you’re saying because people have learned that this is kind of an indication of specification searching, people have to go even further and find specifications that get them an even more significant result so it doesn’t look suspicious.

Eva Vivalt: Yeah, maybe, yeah. That would be the intuition. Again, I can’t like 100% say, but it would consistent with that at least.

Robert Wiblin: It sounds to me like you’ve been doing quite a lot of work on this Bayesian approach. Looking into priors and updating, based on those. Does it feel like development economics is becoming more Bayesian? And is that a good thing?

Eva Vivalt: You know, actually, honestly, I believe it is and that’s really exciting. These days I don’t have to worry quite so much about … I’m definitely hardcore Bayesian and I think that it’s a little bit easier for me to talk about things that rely on a Bayesian interpretation.

Robert Wiblin: Do you think there’s any downsides of Bayesian method being applied more often? I guess one thing I worry about is people kind of fiddling with the priors in order to get the outcome that they want. Or perhaps there’s a bit more flexibility and there’s more possibly for specification searching.

Eva Vivalt: Hmm. Honestly we’re probably not going to go down the route of being … I don’t see the discipline as becoming fully Bayesian any time in the near future. I just don’t see the likelihood of that. What I do think though is that … so it is true that what researchers do and what policy makers do could be a bit different. It might be fine to be different. I’ve heard the argument that researchers should be very concerned about getting unbiased estimates and policy makers … there’s this bias-variance tradeoff that I care actually very passionately about and that others care passionately about too as well, I believe.

Robert Wiblin: Did you want explain what that is?

Eva Vivalt: Sure. The bias-variance tradeoff is essentially saying that you’ve got several sources of prediction error. You’ve got some error due to possible biases, you’ve got some error due to variance and you’ve got some other idiosyncratic error. And this is something that is generally true in all contexts, in all ways, and comes up in different ways.

An example is: if you think of nearest neighbor matching, if you want you can include more neighbors, and if you include more neighbors you’ve got more observations, so you’ve got more precise estimates. Like lower variance estimates. But on the other hand, if you’re including more neighbors, you’ve got some worse matches. So you’re increasing your bias. And so, all estimation approaches are going to have some error due to bias and some error due to variance. And economists have focused really narrowly on producing unbiased estimates, and if all you care about is prediction error … I know Andrew Gelman takes this view and so do I and so do other people like Rachael Meager I think and others. We’re like, “Well hang on, why do we care just so much about getting unbiased estimates?” You also care about having precise estimates, too. It would help for prediction error to maybe accept a little bit of bias.

And the argument I’ve heard is that “maybe researchers should be unbiased, but policy makers interpreting the evidence, it’s okay to accept a bit more bias there.” Maybe the … you don’t need every person at every layer to be reducing prediction error as much as possible. I think that like in practical terms, if you’re an effective altruist, et cetera, you do care about minimizing prediction error regardless of the source. But then it’s a slightly separate question to say what researchers should be doing per se.

Robert Wiblin: So I’ll stick up links to both Andrew Gelman’s blog and a description of the bias-variance tradeoff.

So as I understand it you’re saying that there’s different statistical methods that you could use that would be systematically too optimistic or pessimistic, but would be more precise, is that right? And in general, people go for something that’s neither too optimistic or pessimistic, but is not as precise as it might be. It has like larger average mistakes, and it’s just not clear why we’ve chosen that particular approach.

Eva Vivalt: Yeah. So, there’s a nice diagram that you can throw up if you’re putting links to things that sort of shows the bias-variance tradeoff really, really nicely, I think. Where you’ve got prediction error on one axis and you’ve got different curves of error for if you’ve got biased estimates or if you’ve got estimates with high variance, low precision. Your total prediction error is going to be some function of both of these things as well as some other error. And economists have focused really quite a lot on getting unbiased estimates.

You would think that if anywhere this consideration might come up a little bit in the process of using machine learning because there there’s a lot of techniques that are biased that people accept. Like Lasso or ridge regressions and all sorts of other things, but even there, if you talk to people who are actually involved with these kinds of methods, they’re highly focused on getting unbiased estimates so that the rest of the profession accepts them, which I think is kind of a shame in some regards.

But again, I want to be a little bit agnostic because I’m not 100% sure actually myself what is the best way of going about it, I just feel that at least at the time of making a policy decision, we should be minimizing overall prediction error regardless of the source of that error. Whether it’s bias or variance. I’m not sure what the researchers should do. That’s, I think, like I said, a slightly separate problem. But I do think we’re not paying attention to prediction error as much as we should.

Robert Wiblin: Alright. Let’s turn now to some of the implications of this work and some research that we’ve done for people involved in the effective altruism movement. So we wrote this article, ‘Is It Fair To Say That Most Social Interventions Don’t Work?’ Ben Todd worked on that and put it up last year. It’s one of the articles on our site that I like, I think the most out of all of them. And the reason we looked into it is in a lot of our talks, so many years, we’ve been saying most social interventions, if you look at them, don’t work. On the basis of looking at lots of randomized control trials and saying while most of them seem to produce null results, the interventions that they’re looking at don’t seem to be helping.

But then we had some doubts about that, because we’re thinking “it’s possible you’re getting false negatives for example, and it’s possible that an intervention works in some circumstances and not others.” So, is there anything that you want to say about that article possibly? We could walk through the various different moves that we take and then try to reach a conclusion about it.

Eva Vivalt: Yeah, it’s a really difficult question because, like you say, there are lots of things that go into it. Null results could just be underpowered. The other big thing is that unfortunately we tend to do impact evaluations in some of the better situations in the first place, and this would sort of work in the other direction. Like, so many impact evaluations just fall apart and never happen and we don’t actually observe their outcomes because the study just fell apart.

So yeah, it’s hard to say, to be honest, but happy to walk through-

Robert Wiblin: Sure, sure. Okay. So one of the things is; only some interventions are ever evaluated and they’re probably ones that are better than others, because you would only bother spending the money on an RCT if it looks really positive. Do you have any sense of how big that effect is?

Eva Vivalt: Honestly, I don’t, but I will say that there’ve been some people looking at the impact evaluations that don’t end up happening. Like David McKenzie and some other people were trying to pool together some estimates of this. And I think that problem is actually quite large. It’s not necessarily that it’s … it’s a little bit distinct from the problem that we only try to study those things that have some chance of being really highly effective. It’s also that even within a particular topic, that is highly effective or that we suspect is highly effective, the ones that end up happening are the better instantiations of that particular program. Like the government in that particular area had it more together or whatever else. So we’re getting biased estimates as well that way.

Robert Wiblin: Okay. So, we kind of start with this quote from David Anderson, who does research in this area, and he says it looks like 75% of social interventions, that he’s seen, have weak or no effects. And this suggests that it might even be worse than that because there’s all of these programs that aren’t even being evaluated, which are probably worse. Maybe it’s 80 or 90% of social interventions have small or no effects.

But there’s other things that we need to think about. So, there’s lots of different outcomes that you could look at when you’re studying an intervention. You might think “you’ve got this change in a school, should it be expected to improve their math scores or their english scores or how much they enjoy being at school?” All of these different things, which I guess that pushes in the direction of being over optimistic because the papers that get published can kind of fish for whichever one they found a significant effect in. But even if we were honestly reporting the results, it then just becomes kind of unclear which were the things that you kind of expected to have an effect on anyway. It just makes it quite confusing. What actually are we saying when we say 75% of things have weak or no effects? Was it just a primary effect or on many of them?

Eva Vivalt: Yeah. That’s totally fair because often times a study will throw in all sorts of random other things that they don’t actually honestly anticipate there being effect on, but if you’re doing the study anyways, why not?

Robert Wiblin: Yeah, yeah, yeah. So it turns out that this change at the school didn’t make the students happier, would you expect it to anyway? Maybe they were just curious about that. So it’s really unclear what you’re sampling across.

Then there’s this issue of; we said no effect or weak effects is often how this quote is given, but then what is a weak effect? That’s just kind of a subjective judgment. Is it relative to the cost? Is it relative to the statistical significance? Is it material? Again, that just kind of muddies the water and it you think about, it becomes a much more subjective kind of claim. Do you have anything to add to that?

Eva Vivalt: Not really. I mean-

Robert Wiblin: Does this come up in your own research?

Eva Vivalt: I mean, to me, what I would find the important question is actually in some ways … I realize that obviously for the purposes of this post that you’ve put together with Ben Todd, et cetera, I think that the question is really interesting, of which have any effect whatsoever, but I would a little bit think that another important question would be “which matter relative to some other outside” … I guess, which matter at all is a good question, but I always think about what is the outside option, and what the outside option is really matters.

So when you were talking about weak effects, yeah probably they are talking about statistical significance, but you can also think of weak effects as like “sure it has an effect but so what? We can do so much better.”

Robert Wiblin: Mm-hmm (affirmative), yeah. And then, I think the part of the article that you helped with was moving from talking about individual studies, where very often you get null results to meta analyses where you combine different studies. And then more often, I think you find that an intervention works, at least on average. Do you want to talk about that?

Eva Vivalt: Yeah, if you’ve got some underpowered studies then combining them does tend to improve the situation slightly. It depends a little bit on exactly how you’re doing it and what kinds of things you’re including, but I’d say by and large you do end up with … because you’re essentially adding some power when you do a meta-analysis, by at least partially pooling results from different studies.

Robert Wiblin: And so you can pick up smaller effects.

Eva Vivalt: Yeah.

Robert Wiblin: Which means that, I guess, more of them become … like just jump over the line of being positive or material or observable.

Eva Vivalt: Becoming significant, not necessarily-

Robert Wiblin: Statistically.

Eva Vivalt: Yeah exactly. It could be like a very small effect, but …

Robert Wiblin: Well there’s a bunch of other moves that we make here, or adjustments up and down, but what we were trying to kind of get at is how much of a gain do you get by picking the best interventions or trying to be evidence based rather than just picking something at random. And I think the conclusion that we reached after looking at all of this, is that it’s perhaps not as much as you might … or people who are extremely supportive of doing more empirical work might hope, because one; is that the measurements are somewhat poor. So there’s a good chance often of you think that you’ve the best intervention from a pool but in fact you’ve gotten it wrong.

But also that even if there’s like a small fraction of the interventions that you might be sampling from that are much more effective than others, even if you choose at random, you still have a reasonable chance of picking one of those anyway. Which means that, let’s say that there’s like 10 different interventions and only one of them works. If you pick at random, you can’t do worse than a tenth as well as definitely picking the best one because you have a one in ten chance of picking it anyway.

Which I guess is perhaps something that I think effective altruism hadn’t thought as much about. We often tended to compare the very best interventions with the very worst ones, but it’d be a very peculiar strategy to try to find the very worst ones and do those. Instead you should really compare your attempt at picking the best intervention with kind of picking at random among things that have been studied. In which case the multiple ineffectiveness that you get probably isn’t going to be huge. Do you have any comments on that?

Eva Vivalt: Yeah. I mean this is a little bit similar to when I was trying to look at like how much we can learn from an impact evaluation. I had to make assumptions about what that outside option is that the policy makers are considering. And just sort of based on the distribution of effects that I saw in AidGrade’s database, it’s actually reasonable that a lot of these projects, a lot of interventions have got somewhat similar effect sizes, at least without taking cost-effectiveness into consideration. Obviously I’d love to take costs into consideration but it’s very hard to because like 10% of studies say anything about costs and then it’s not very credible when they do say it.

But things were pretty tightly distributed. So I tried some different specifications. Like I was saying, trying out 50% of the effect of another program or 90% of the effect of another program, like how well can you distinguish between two programs, one of which is 90% of the value of the other one, as it were. You have to make some pretty strong assumptions there. Things do seem to be … so, I don’t know. That’s how I’ve gone about it in the past.

Robert Wiblin: Things seem to be fairly clumped together, you’re seeing?

Eva Vivalt: Well, out of the ones in AidGrade’s database, and again without taking costs into consideration. I’m not trying to make a broader claim than that because there’s just no data.

Robert Wiblin: Right. Okay, so I was just about to bring this up next, which is like four years ago or so, Robin Hanson responded to one of your graphs from AidGrade, which seemed to suggest that if you looked at effect sizes in terms of standard deviation improvements then you kind of found a normal distribution of effect sizes and it wasn’t that widely dispersed, as you’re saying. And he was saying “well this was a bit in conflict with the standard line that people in effective altruism give, which is that there’s massive distributions in how cost effective different approaches are. That it’s not just normal, but it’s lognormal or power law distributed, or something like that. Which gives you much greater dispersion between the best and the average and the worst.”

Did you ever respond to that? Because I think we ended concluding it might be a bit of a misunderstanding.

Eva Vivalt: I think that, yeah … so there’s two things that are certainly not included. One thing I just alluded to is costs. That’s saying nothing about the cost-effectiveness of a particular intervention and I would love to have been able to produce those graphs for the cost-effectiveness. But, like I say, the thing is that papers just don’t report costs, and they should. But they don’t do. So it’s really hard for me to come in as an outsider to each of these papers and say, “Oh yeah, but actually I know what the costs are.”

One could make strong assumptions about those and try to infer what costs are from other studies, et cetera, but it’s quite hard to do and not very credible. So I’m sure one could do it, but probably not in an academic setting. I haven’t been pursuing it but I would love for other people to pursue it and I’m sure that other people are pursuing it.

Robert Wiblin: Well the other thing, if you want to move to cost-effectiveness you also have to think about the actual welfare gain from the different improvements.

Eva Vivalt: Exactly. So, that’s the other thing I was going say is then how can you actually value these outcomes? Because the outcomes are pretty … they don’t have intuitive value to them, right? How do you value an extra year in school versus a centimeter of height, right? How do you think about that kind of thing. What does that actually mean in terms of value? So then you need some additional mapping from the outcomes to something that we value.

Robert Wiblin: So, yeah. Is it possible that we start with this normal distribution of standard deviation changes and then because costs per recipient are so wildly distributed and the benefits per standard deviation improvement are so wildly distributed that you still very wide dispersion in the cost effectiveness of different interventions?

Eva Vivalt: You could do.

Robert Wiblin: Mm-hmm (affirmative), you could. Yeah.

Eva Vivalt: I just have not a very clear sense of that because I don’t have a clear sense of the costs.

Robert Wiblin: Okay, it’s not just positive. Other people could look at this and try to figure that out.

Eva Vivalt: Yeah, yeah, and I really hope somebody does.

Robert Wiblin: I guess there’s also the Disease Control Priorities Project, of course, has produced cost effectiveness estimates for lots of different health treatments and find that they’re extremely widely dispersed. But I think that their resourcing per intervention that they’re looking at isn’t so good, and very often they rely on modeling rather than empirical results, which might be causing them to overstate the variance because some of it is just mistakes on their part.

Eva Vivalt: I see. Yeah, no, I’ve heard a little bit about that. That makes a lot of sense. I think that one thing that is certainly necessary and I hope happens in the near future is some attempt at also adding values to these other things that we might care about. Like all the educational stuff, et cetera, to sort of be able to compare them with heath interventions, et cetera. Because the same kind of way that they do the disability adjusted life years, et cetera, they could do for some kind of more general well being.

Robert Wiblin: Right, yeah.

So I really want to try to pin you down a little bit on how valuable is being empirical? Because it seems like you’ve got some positive results and some negative results, you’ve got the generalizability doesn’t seem so good so can we really learn so much? On the other hand it looked like some of your research suggests that in fact most of the results that show positive effects are kind of right about that. And then we’ve got to consider I guess the cost of doing these different studies and whether people actually respond to it in government. Did you have … you’ve been working in this area for five or ten years now, have you updated in favor of empirical social science or against it?

Eva Vivalt: I think it’s the only game in town to be honest. As much as we may criticize some of the things that come out of standard research, I guess the only answer in terms of what to do next is more of the same. And with some improvements, but more is better. And I think people are a little bit more aware of and focused on addressing some of the limitations in past research, both in terms of — people are thinking more now about the differences in scale up. People are thinking a bit more now about how results actually feed into the policy process. So, for me I think there’s incremental change, but I’m certainly pro-empirical work because what’s the alternative? It’s not-

Robert Wiblin: Well, I think there are alternatives. One is, as you were saying, just survey people on their expectations about what works, even before you’ve run any studies. And it could just be that that gets you a lot of the way and it costs very little, so maybe we should just do that and then screw the RCTs, or only do them occasionally.

Eva Vivalt: I don’t want to rule out the possibility that we can learn something from … I think we can learn more using observational data, which priors would also be similar to. And I don’t want to rule out that we can learn something from those, I just, maybe this is just a matter of semantics. Like I would still consider that, in some sense, empirical work because what you could do is try to say, “well yes but I want to try to”-

Robert Wiblin: A systematic survey.

Eva Vivalt: “Figure out which are the … yeah. Figure out the situations in which this is actually relatively okay,” and then it’s some approximation strategy that’s still not quite valid but better than