0:33 Intro. [Recording date: March 6, 2017.] Russ Roberts: My guest is author and blogger, Andrew Gelman, professor of statistics and political science at Columbia University. Andrew is a very dangerous man, for me. As listeners know, I've been increasingly skeptical over the years about the reliability of various types of statistical analyses in psychology, economics, epidemiology. And coming across your work, Andrew, which I've done lately in reading your blog, you've confirmed a lot of my biases. Which is always a little bit dangerous. But in such an interesting way. So, I'm hoping to learn a lot, I'm hoping, in this conversation, along with our listeners. And at the end we'll talk about whether I've gone too far and become too comfortable. So, I want to start with something--the other point I want to make before we start is it's sometimes hard to talk about statistics and data over the phone. And in a podcast. We're going to do the best we can, without a whiteboard. But I'm hoping that both beginners and sophisticated users of statistics will find things of interest in our conversation. So, we start, though, with something very basic. Which is: statistical significance. And we are going to wonder about statistical significance in small samples. But let's just start with a definition. When economists, or psychologists, say, 'This result is statistically significant,' What do they usually have in mind? Andrew Gelman: Statistically significant means you are observing a pattern in data, which, if there was actually nothing going on in the data were just noise. The probability of seeing a pattern at least that extreme is less than 1 in 20. Russ Roberts: And that 1 in 20, so, it's almost surely 95% of the time not due to the randomness of the noise. That's an arbitrary cut-point that has somehow emerged as a norm in published academic research. Correct? Andrew Gelman: Correct. But you state it a little wrong. One of the challenges of statistical significance is it kind of answers the wrong question. And one way to see that is that when people define it, they tend to get the definition garbled. In fact, I have a published article where, I think it's either the first or second sentence completely garbles the definition of statistical significance. That was our article about the Garden of Forking Paths. Russ Roberts: Okay, we're in good company here. Andrew Gelman: Yeah. I blame the editor of the magazine, because they edited the thing. And of course it's certainly not my responsibility what comes out under my name. Russ Roberts: No. Andrew Gelman: I don't think I should ever take responsibility for that-- Russ Roberts: No, absolutely not. Andrew Gelman: So, statistical significance does not say that your result is almost certainly not due to noise. It says: If there was nothing going on but noise, then the chance is only 1 in 20 that you'd see something as extreme or greater. And it's typically illustrated in textbooks with examples of coin-flipping. Um, like, if you flip a coin a hundred times, it's very unlikely that you'd see more than 60 heads. Or more than 60 tails. So, if you saw that, you'd say, 'Well, this is kind of odd. It doesn't seem consistent with my model, in which nothing's going on.' Russ Roberts: Well, in that case, you'd presume--your model is that it is a fair coin, with a 50% chance of heads and a 50% chance of tails. So, if you consistently got 70 or 80 or 90 you'd start to wonder whether your model was, your assumption was correct. But--correct me if I'm wrong--1 in 20 times you will get more than 60 heads. So that, if that might just happen to be one of the times that that happens. That will happen, actually, 5% of the time, is what you are saying. Correct? Andrew Gelman: Yeah. It's not literally 1 in 20 that you'll get more than 60, because those numbers were just approximate. Russ Roberts: Sorry. Andrew Gelman: No, no, I was just pulling--I mean, there's some cut point. Exactly. But. Right. So, if you are living in a world where sometimes you have random number generators and you are trying to say, 'Are the results consistent with a certain random number generator?' then the statistical significance test, it's doing that for you.

4:54 Russ Roberts: But that's not how we use it in empirical work. So, your more accurate definition than mine was, I think, had at least two negatives. So, I was somewhat confused by it. You are someone who has written papers with statistically significant results, I think. So, let's go back to that more accurate definition. I run some analysis. I run an experiment. I have a hypothesis about what I'm going to find. I may come to that hypothesis after the fact--we'll come to that later. But I have some--in my published paper, I have a result. And it says that this claim, about this, say, difference or impact of some variable on something we care about--a public policy change, might be the minimum wage, it might be immigration, it might be male/female differences--this difference or this effect is statistically significant. And when people write that in published work, what do they have in mind? Andrew Gelman: What they are saying--should I give an example? Russ Roberts: Yeah; sure. Andrew Gelman: So, a few years ago some economist did a study of early childhood intervention in Jamaica. And about 25 years ago, they went, they gathered, about I think it was 130 4-year-olds in Jamaica, and they divided them into a treatment group and a control group. And the treatment group--they--the control group--they did some helpful things for the kids' families. The control group, the treatment group, sorry--they had--a fairly intense 1-year intervention with the parent. Then they followed up the kids, 20 years later they looked at the kids' earnings, which is--earnings is this quaint term that economists use to describe how much money you make. Russ Roberts: Yes. Andrew Gelman: And they like to call it how much you earn. Russ Roberts: Yes. Andrew Gelman: And it turned out that the kids, in the treatment group, had 42% higher earnings than the kids in the control group. Russ Roberts: And the intervention was when they were 4 years old-- Andrew Gelman: When they were 4 years old. And the idea--I mean, there is some vague theory behind it, that this is a time of life where, if the kids can be prepared it can make a big difference. It's controversial. There's people who don't believe it. So they did this study. And the did the study--we'll come back to the study. It's a great example. So, they found an estimate of 42%. And it was statistically significant. So, the statement goes as follows: Suppose that the treatment had no effect. So, suppose that they were giving, not even placebo. Like, just nothing. Like there was no difference between getting leaflets or whatever they were getting and getting the full treatment. Zero effect. There's still going to be randomness in the data because some kids are going to earn more than other kids, as they grow older. So, if you have no--if it's completely random and the treatment has zero effect whatsoever, not any effect at all, not a placebo, not nothing--then, you'd expect some level of variation. And it turns out that if you see an effect with those data, if you see an effect as large as 42%, there's a less than 5% chance that you see an effect that large just by chance. Russ Roberts: Which encourages you to think that you found something that works. Andrew Gelman: Right. It says that, well, here's two stories of the world. One is the treatment, really works; it's helping these kids. Another story of the world is, it's just random; it's just fluctuations in data. Statistical significance tests, the p-value rules out the hypothesis, seems to rule out the hypothesis that it's just capitalizing on noise. Now, it doesn't really, for reasons we can get into-- Russ Roberts: [?] want to-- Andrew Gelman: Right. That's an example. And I can give you another example, if you want.

8:41 Russ Roberts: I want to stop with that one. Because I just want to--before we go any further, I just want to talk about my favorite, one of my favorite things that I dislike. And I want to let you react to it. So, let's say I'm at a cocktail party; someone says, 'I think we should spend more on preschool education.' And I say, 'Well, it might be good. I don't know. It probably has some benefit. It depends on what it costs, and how you decide what's in that education.' And then the person, the other person says, 'Well, studies show that preschool education has a 42% impact on wages, even 20 years later.' And that 42% is really--I'm sure the actual number is even more precise than 42--it's 42.something. And now the burden of proof is on me. This is a scientific result. There was a peer-reviewed paper. And in fact as you point out, and I did my homework before this interview, one of the authors of that paper is Jim Heckman, who has got a Nobel Prize in economics. He's been a guest on EconTalk, to boot. And you're going to tell me that that result isn't reliable? So there is a certain magicness to statistical significance that people do invoke in policy discussions and debate. Andrew Gelman: Yes. There is a magic. I think that your hypothetical person who talks to you at a cocktail party, I would not agree that the evidence is so strong. I wouldn't want to personalize it with respect to Jim Heckman. Russ Roberts: Yeah, neither would I. He's a good man-- Andrew Gelman: What he's doing, he's doing standard practice. And so, I wouldn't--I would hope that he could do better. But this is kind of what everybody's doing. Russ Roberts: So, what's wrong with that conclusion? Why might one challenge that 42% result? Which, as you point out, only 132 observations--and this is where one of my biggest 'Aha!' moments came from reading your stuff. Because I would have thought, 'Wow. Only 130 observations'--that means you have to divide it in half: 65 and 65 roughly. I assume. Maybe people dropped out; maybe you lost track of some people. But you have a small sample, and you still found--that's very noisy, usually. Very imprecise. And you still found statistical significance. That means, 'Wow, if you'd had a big sample you'd have found even a more reliable effect.' Andrew Gelman: Um, yes. You're using what we call the 'That which does not kill my statistical significance makes it stronger' fallacy. We can talk about that, too. Russ Roberts: Yeah. Explain. Andrew Gelman: Well, to explain that we need to probably do more background. So, let's get to that. But let me say here that there are two problems with the claim. So, the first problem is that, it's not true that if nothing were going on there's less than a 5% chance that you'd see something so extreme. That's actually not correct. Actually, if nothing's going on, your chance of finding something "statistically significant" is actually much more than 5% because of what some psychology researchers refer to as 'researcher degrees of freedom,' or 'p-hacking.' Or, what I call the Garden of Forking Paths. Basically, there are many different analyses that you could do of any data set. And you kind of only need one that's statistically significant to get it to be publishable. It's a little bit like you have a lottery where there's a 1 in 20 chance of a winning ticket, but you get to keep buying lottery tickets until you win. So that's half of it. The other half, that the estimate of 42% is an over-estimate--in statistics jargon, it's a biased estimate. So, when you report things that are statistically significant, by definition, or by construction, a statistically significant estimate has to be large. Under typical calculations, it has to be at least 2 times, 2 standard errors away from zero. And the standard error is something calculated based on the design. This particular study had, maybe, maybe had a standard error of about 20%--so, 42% is two standard errors away from zero. Because it's selection bias, the only things that get--I wouldn't say the only things-- Russ Roberts: yeah, almost-- Andrew Gelman: typically--well, people do publish studies where they say, 'Hey, we shot something down.' Okay? Russ Roberts: Yep. Andrew Gelman: But when studies are reported as a success, they are almost always reported as statistically significant. So, if you design a study with a small sample that's highly variable--as of course studies of kids, and adults are: people are highly variable creatures--when you design a study that's highly variable, in a small sample you'll have a large standard error. Which means any result that's possibly statistically significant has to be large. So, you are using a statistical procedure which, a). has more than a 5% chance of giving you a positive finding even if nothing is going on, and b). whether or not something is going on, the estimate is going to be over-estimated. So that, I don't believe that 42% number because of the procedure used to create it.

14:00 Russ Roberts: Now, the people who did the study were aware of some of these issues: in fact, aware of all of them, really. One of the ways they avoided--one of the ways the tried to keep the standard error--the imprecision of the estimate that's inevitable with a small, finite sample--is they--the parents of these children were chosen to be somewhat similar. Right? They were low-income parents. Now, low-income parents yield high-income children sometimes. And sometimes not. As you point out, there's a lot of variation. It just has nothing has nothing to do with the treatment effect. And everyone understands that. So, when you say that the standard error is 20%, it's a way of saying in statistical jargon that: Of course, there's going to be some--even if we had all the children come from parents with the same income--literally the same--the same within a few dollars--they'd still have variation as they grew up because of random life events, skills, things you can't observe, things you can't control for. And so, as you point out, you are trying to say, well, but if it's 42%, that must be really big. And your point is that, well, it kind of had to be or it wouldn't be published. So, what we're observing is a non-random sample of the results measuring this impact. Is that a good way to say it? Andrew Gelman: Yes. And let me say--I don't know that the standard error was exactly 20%, in case anyone wants to look that up-- Russ Roberts: No, no-- Andrew Gelman: I just know that 42% was statistically--so, I'll tell you a few things about the study. So, some of the kids went to other countries. And I think they on average had higher incomes. And the percentage who went to other countries was different for the treatment and control group. Now, that's not necessarily a bad thing--now, maybe part of the treatment is to encourage people to move. Or maybe going to another country is kind of a random thing that you might want to control for. So, there's sort of a degree of freedom in the analysis right there. Another degree of freedom in the analysis is that they actually had 4 groups, not 2 groups, because, if I'm remembering it correctly, they crossed the intervention with some dietary intervention, I think, giving minerals. I can't remember the details. But I think they concluded that the dietary intervention didn't really seem to have an effect. So they averaged over that. But of course, had they found something, that would have been reportable. The published paper actually--the pre-print had 42%; and then the published paper, I think the estimate was more like 25%. So, there's a lot. And I guess the standard error was lower, too. So a lot depends on how you actually do the regression. Which I'm sure you're aware of, from your experience. Well, let me just sort of think: It's not--I don't feel that--it's not that any of their choices were wrong. So, I don't think that what they did with the people who moved to other countries was necessarily a bad choice. You have to choose how to analyze it. I don't think it was necessarily a wrong thing to aggregate over the intervention that didn't seem to make a difference. Certainly I don't think it's wrong to run a regression analysis. I've written a whole book about regression. It's not that any of the analyses are wrong. It's that, when you do that, given that you kind of know the goal, is to get a win, you are more likely to find statistical significance than you would have. And I actually have written about this--that there's a term, 'p-hacking,' which is that people hack the data until they get p less than .05, and they can publish. And, I don't really--I have no reason to think the authors of this paper were "p-hacking." I don't think--it's not like a matter of people cheating. It's not a matter of, like, people like craftily trying to manipulate the system. I mean, the guys who wrote this paper have very secure careers. If they didn't think the effect was large, they would have no motivation to exaggerate it. But, they are using statistical procedures, which happen to be biased, because of selection. And it's hard to avoid that. Just like if you are doctor and you don't blind yourself. You can have all the good will in the world, but you are using a biased procedure and it's hard to correct for bias.

18:14 Russ Roberts: It's a deep, really deep point. And I really like your distinction between p-hacking and the garden of the forking paths. And we'll come back to that phrase--garden of the forking paths--to make a little clearer why that's the phrase that you use. But, p-hacking has a negative connotation. It sounds corrupt. It sounds like you've done something--as you've said, either dishonest or fraudulent. And, tragically, it's not. It's just common research practice that if you run a result, you run an analysis, and you don't get an interesting result, your natural inclination is to tinker. 'What if we try a different specification? What if we throw out the people who moved? What if I treated the people who moved to the Western hemisphere different from the people who moved to the Eastern hemisphere? What if I--?' There's so many choices. And that's the garden of the forking paths: you have so many decision modes that you have to inevitably make in these kind of analyses where there's a lot going on--and the world's a complicated place--that your natural inclination is to try different stuff. And the economics version of this is Ed Leamer's paper, "Let's Take the Con Out of Econometrics", where he basically says: When you are doing this, you no longer have the situation where the classical statistical test of p < .05 is the right one. Because it's not a one-time thing. You've made all these other choices. Andrew Gelman: Let me pick up on that. Because this relates to this concept of the replication crisis. Before I get into that, let me also interject that Uri Simonsohn and his colleagues, [?] psychologists who coined the terms p-hacking and researcher degrees of freedom, if you read their paper, they never say that p-hacking is cheating. I mean, they are pretty clear. So, I don't like the term 'p-hacking' because I think it implies cheating. But I just want to say that the people who came up with term were quite scrupulous. They are, I think, actually, nicer than I am. They had a paper called--their paper from 2011 was called "False Positive Psychology." About how you can get false positives through p-hacking. I wrote something on the blog and I said how they used that to mock--there's a sub-field of psychology called Positive Psychology, which is about how[?] psychology can help you. Which happens to be plagued with studies that are flawed. And I wrote that, um, the title "False Positive Psychology" was a play on words. And, Uri Simonsohn emailed me and said, 'No! They had no meaning to-- Russ Roberts: It was an accident-- Andrew Gelman: he just--he is a nice guy; he wasn't doing that. Now, let me come back to--so, there's something called a Replication Crisis in Psychology that-- Russ Roberts: We've interviewed Bryan Nosek a couple of times on the program. Andrew Gelman: Okay. Russ Roberts: So, we're into it. But describe what it is. And it's very important. Andrew Gelman: People have done studies which are either published or appear completely successful--have good p-values. Later on people try to replicate them, and the replications fail. And the question is: Why does the replication fail? When a replication fails, it's natural to say, 'How does the new study differ from the old study?' Actually, though, typically the main way the new and the old studies differ is that the new study is controlled. Meaning, you actually know ahead of time what you are going to look for. Whereas the old study is uncontrolled. So, you can do p-hacking on the old study but not on the new study. And Nosek himself--he must have told you--he did a study, the so-called 50 stages-- Russ Roberts: "50 Shades of Gray"-- Andrew Gelman: said he, with his own study that he was ready to publish with his collaborators; and they replicated it; and it didn't replicate. And they realized they had p-hacked without realizing. So, let me get back to the early childhood intervention. So, the usual, one way you could handle this result is you could say, 'It's an interesting study. Great. Okay. There's forking paths. We don't know if we believe this result, so let's replicate. So, the trouble is, that to replicate this would mean waiting another 25 years. Russ Roberts: Yeah. Andrew Gelman: And what are you going to do in the meantime? So, replication is a lot easier in psychology than it is in economics or political science. We can't just like say, 'I want to learn more about international relations so let's start a few more wars. And rip up some trees. And see what happens.' Or, throughout the economy, 'Let's create a couple more recessions and create some discontinuities.' It might happen, but people aren't doing it on purpose. Russ Roberts: It's hard to get a large number where the other things are constant. There's always the potential to say, 'This time was different,' because this depression, or this recession was started by the housing sector, or the whatever--so it actually can't even be generalized with all those other ones. Andrew Gelman: Yeah. You just can't replicate it. So, it puts us--what I'd like to get back to in these examples--it puts us in a difficult position. Because on the one hand, I think these claims are way overstated. On the other hand, you have to do something different. I'd like to share two more examples, if there's a chance. But you tell me when is the best time.

23:17 Russ Roberts: Well, I want to talk about, in the psychology literature particularly, this issue of priming, that was recently talked about. But we can talk about lots of things. So, what do you want to talk about? Andrew Gelman: I wanted to give two--I want to give three examples. So the first example was something that really matters, and there's a lot of belief that early childhood intervention should work. Although the number of 42% sounds a little high. But, and also where people also actually care about how much it helps. It's not enough to say--even if you could somehow prove beyond a shred of a doubt that it's had a positive effect, you'd need to know how much of an effect it is. Because it's always being compared to other potential uses of tax dollars. Or individual dollars. So, I want to bring up two other examples. So, the second example is from a few years ago. There was a psychologist at Cornell University who did an experiment on Cornell students of ESP (Extra Sensory Perception). And he wrote a paper finding that these students could fortell the future. And it was one of these lab experiments--I don't remember the details but they could click on something and somehow you--it was one of these things where you could only know the right answer after you clicked on it. And he felt, he claimed that they were predicting the future. And if you look carefully-- Russ Roberts: Andrew, I've got to say before you go on--when I saw the study, the articles on that, I thought it was from the Onion. But, it's evidently a real paper. Any one of these, by the way, strikes me as an Onion article--that people named Dennis are more likely to be dentists. Andrew Gelman: I was going to get to that one. Russ Roberts: Yeah, well, go ahead. Go with the ESP first. Andrew Gelman: So, the early childhood intervention is certainly no Onion article. The ESP article--it was published in the Journal of Personality and Social Psychology, which is one of the top journals in the field. Now, when it came out, the take on it was that it was an impeccably-done study; and, sure, like people--most people didn't believe it. I don't even think the Editor of the journal believed it. They published it nonetheless. Why did they publish it? Part of it is, like, we're scientists and we don't want to be suppressing stuff just because we don't believe it. But part of it was the take on it--which I disagree with, by the way. But at the time, the take on it was that this was an impeccably-done study, was high quality research; it had to be published because if you are publishing these other things you have to publish this, too. And there's something wrong. Like, once it came out, there's obviously something wrong there. Like, what did they do wrong? It was like a big mystery. Oh, and by the way: The paper was featured, among other places, completely uncritically on the Freakonomics blog. Russ Roberts: I'm sure it made the front page of newspapers and the nightly news-- Andrew Gelman: It was on the front page of the New York Times. Yeah. So in the newspaper--they were more careful in the newspaper than in Freakonomics, and they wrote something like, 'People don't really believe it, but this is a conundrum.' If you look at the paper carefully, it had so many forking paths: there's so much p-hacking--almost every paragraph in the results section--they try one thing, it doesn't work. They try something else. It's the opposite of a controlled study. The experiment was controlled: they randomly assigned treatments. But then the analysis was completely uncontrolled. It's super-clear that they had many more than 20 things they could have done for every section, for every experiment. It's not at all a surprise that they could have got statistical significance. And what's funny is when it came out, a lot of people--like, the journal editor--were like, 'Oh, this is solid work.' Well, like, that's what people do in psychology. This is a standard thing. But when you look at it carefully it's completely--it was terrible. Russ Roberts: So, in that example--I mean, what's interesting about that for me is that you say, 'In the results it was clear to you.' But of course in retrospect, in many published studies--the phrase I like is 'We don't get to be in the kitchen with the statistician, the economist, the psychologist. We don't know what was accepted and rejected.' So, one of my favorites is baseball players whose names start with 'K' are more likely to strike out. Well, did you look at basketball players and see if their names start with 'A' are more likely to have assists? Did you look at--how many things did you look at? And if you don't tell me that--'K' is the scoring letter for strike-out, for those listening at home who are not from America; or who don't follow baseball; or who don't score--keep track of the game via scorecard; 'K' is a shorthand abbreviation for strikeout--which, of course, is funny because I'm sure some athletes don't know that either. But the claim was that they are more likely to strike out. I don't know the full range of things that the author has tested for unless they give me what I've started to call the Go-Pro--you wear the HeadCam [head camera]--and I get to see all your regressions; and all your different specifications; and all the assumptions you made about sample; and who you excluded; and what outliers. Now, sometimes you get some of that detail. Sometimes authors will tell you. Andrew Gelman: This is like cops--like research [? audio garbled--Econlib Ed.] Russ Roberts: Exactly. Andrew Gelman: So, it's actually worse than that. It's not just all the analyses you did. It's all the analysis you could have done. And so, some people wrote a paper, and they had a statistically significant result, and I didn't believe it; and I gave all these reasons, and I said how it's the garden of forking paths: If you had seen other data you would have done--you could have done your analysis differently. And they were very indignant. And they said, 'How can you dismiss what we did based on--and your assumption'--that's me--'how can I dismiss what they did based on my assumption about what they would have done, had the data had been different? That seems super-unfair.' Russ Roberts: It does. Andrew Gelman: Like, how is it that I come in from the outside? And the answer is that, if you report a p-value in your paper--a probability that a result would have been more extreme, had the data come from, at random--your p-value is literally a statement about what you would have done had the data been different. So the burden is on you. So, to get back to the person in the, you know, who bugs you at the cocktail party, if someone says, 'This is statistically significant, the p-value is less than .05; therefore had the data been noise it's very unlikely we would have seen this,' they are making a statement saying, 'Had the data looked different, we would have done the exact same analysis.' They are making a statement about what they would have done. So, the GoPro wasn't even quite enough. Because my take on it is people navigate their data. So, you see an interesting pattern in some data, and then you go test it. It's not--like, the thing with the assists, the letter 'A', whatever--maybe they never did that. However, had the data been different maybe they would have looked at something different. They would have been able-- Russ Roberts: And someone, I didn't read carefully in this, but someone did write a response to that article saying that it turned out that people with the letter 'O' struck out even more often. What do you do with that? Which is a different variation on that, all the possible things you could have looked at. Andrew Gelman: Well, they also found that--my favorite was that lawyers--they felt, they looked the number of lawyers named 'Laura,' and the number of dentists named 'Dennis'. And there are about twice as many lawyers named 'Laura' and dentists named Dennis as you would expect if the names were just at random. And I believe this. So, when I-- Russ Roberts: Twice as much! How could you--it's obviously not random! Andrew Gelman: Well, no. Well, twice as much--well, yeah. Twice as much is first-- Russ Roberts: Twice as much as what? Andrew Gelman: It's not as ridiculous as you might think. So, it goes like this. Very few people are dentists. So, if like 1% of the people named 'Dennis' decide to become dentists, that will be enough to double the number of dentists named 'Dennis.' Because it's a rare career choice. So, it's, in some way it's not the most implausible story in the world. It actually takes only a small number of people to choose their career based on their name for it to completely do this to this to the statistics. But--and I bought it. I was writing about it. But then someone pointed out that people named 'Laura'--the name 'Laura' and 'Dennis' were actually quite popular many years ago--like I guess when we were kids or even before then. And when the study was done, where the lawyers and dentists in the study were mostly middle-aged people. So, in fact they hadn't corrected for the age distribution. So there was something that they hadn't thought of. It was an uncontrolled study. So, I bring up the ESP only because that's a case where, like, it's pretty plausible that it was just noise. And then when you look carefully at what they did, it's pretty clear that they did just zillions of different analyses.

32:20 Russ Roberts: So, I want to bring up the Priming example, because I want to make sure we get to it. And then I want to let you do some psychological analysis of my psyche. But the Priming example is, there was a very respected and incredibly highly cited paper that took a bunch of undergraduates, put them in a room, asked them to form sentences with 5 different words. And that really wasn't the experiment. The real experiment was watching what they did when they left the room. And the 5--one group got 5 words that were associated with the elderly, like 'Florida,' and 'bald' and 'wrinkly' and 'old'--not 'old' but 'subtle'. 'Subtle,' 'gray'. Andrew Gelman: 'Tenured professor,' right, was there? Russ Roberts: Right. Gray. I don't remember. And then one of the--the control group, the other group, got sort of regular words. And it turned out that the people who got the 'old' words, like, 'Florida,' 'wrinkly,' 'old,' 'gray,' and 'bald,'--they left the room more slowly, because they'd been primed to think about being old. Now, none of them, of course, asked for a walker. This is my bad joke about this kind of study. It's like, these are going to have to be somewhat subtle effects. And yet they found a statistically significant result that, when the replication attempt was tried, was not found to be successful. They could not replicate this result. And of course there was a big argument back and forth between the original authors, whether they did it right. But my view was always this seemed silly to me. And your point about small samples--and these are very small samples, I think it's 30 or 50 undergraduates, where the speed of living[?], the room, is going to be highly noisy, meaning high standard-error. So to find a statistically significant effect, going to find a big effect, to me is implausible. But that's what they found. And then it didn't replicate. Andrew Gelman: Oh, but it's worse than that. Because between the original study and the non-replication, there were maybe 300 papers that cited the original paper. What were called 'conceptual replications.' What appeared to be replicated. So, someone would do a new study and they would test something slightly different--a different set of words, different conditions--and find a different pattern. Like you might--maybe there was another study where you'd do a certain kind of word and people would end up walking faster, not slower. Russ Roberts: That seemed to confirm the original result. Overwhelmingly. Because, as you say, hundreds of studies found the existence of priming once they knew to look for it. Why isn't that true? Why isn't priming real? Andrew Gelman: Right. That's the problem. It's that--well, of course, priming is real. Everything is real, at that level. It's that it varies. So, different things, I think, give you a couple of words, and a lot of people, it won't prime you at all. For anything but depending on who you are, it might really tick you off; it might remind you that you have to go the bathroom and you have to walk faster. There are all sorts of things it can do. The effect is highly variable. In fact, the concept, the effect, is kind of meaningless. Because it's the nature of these kind of indirect stimuli to do different things to different people in different scenarios. So, I think part of the problem is the theoretical framework. They have a sort of button-pushing, take-a-pill model of the world. So, this idea that like, oh, you push the button and it makes people walk a lot slower--that's very naive. Just a psychological--just treating it as a psychological theory, it's naive. But it's a self-contained system. You do a study, and it's possible to get statistical significance through forking paths. Do another study: If it shows the same thing as the first study, that's great. If it doesn't, you come up with a story why that isn't. Then, you come up with a story and you find a pattern in data, itself statistically significant. That's a second study. This can go on forever. There's really no way of stopping it. The only way of stopping it, perhaps, is to do, either through theoretical analyses of the sort that I do, to explain why statistical significance is not so meaningful. Or, by just brute force running a replication. And running a replication is great. You can't very often do that in political science and econ, but when you can do it, it sort of shuts people up. For sure.

36:28 Russ Roberts: So, in this case, the dozens or hundreds of statistically significant results of priming didn't seem to be confirmed by the attempts to replicate them. As you point out, where you have one choice. We're going to look at these kind of words and see if people walk slower. As opposed to, 'Well, they walked faster. I guess that's because--' and it's statistically significant, or, 'I tried a different set of words.' Or I tried a different group. And so, somebody blogged on this recently. And Daniel Kahneman, Nobel Laureate, commented on the blog--and apparently it actually was him. There's always some uncertainty about whether he actually commented. Because he had had a chapter in his book, Thinking Fast and Slow, on Priming. And he conceded--he had actually written about it a few years ago. But he conceded that: these results are probably not reliable. And this was the shocking part for me--and we'll link to this, because I'm going to write about it. It's just stunning. He said, 'Well, I just assumed that, since they had survived the peer-review process, I have to accept them as scientific.' And that was the most stunning thing I'd--besides the fact that he conceded that his chapter was probably not reliable, the fact that he also conceded that he had used the fact that they had survived the peer-review process as sufficient to prove their scientific merit was also stunning to me. Now, he's conceding--I think correctly--that peer review is not an infallible scientific barometer. Which is good--that's a good thing. Andrew Gelman: It is a good thing. But it took us a while to realize that.

38:01 Russ Roberts: But that brings us to my problem. Which I want your help with. So, now what? So, I'm a skeptic. So, I tend to reject--I don't like psychology so much. I don't like these cutesy results that fill all these pop books by authors whose names we are not going to mention, that use these really clever, bizarre results; but they are peer-reviewed and they are statistically significant. So, I tend to make fun of all of them. Even before I've looked at the study. And that's not a good habit. And similarly, in economics, the idea that we can control for these factors and measure, say, 'The Multiplier'--to come back the 'The' Priming Effect, strikes me as foolish, silly, and unscientific. But, don't I have a problem of going too far? Can I now--I'm tempted to reject all of these findings in economics, epidemiology, social psychology. Because I say, 'Oh, they are all p-hacking. It's the garden of the forking paths. It's the file-drawer bias.' Etc., etc. Almost none of them replicate. Or, do I say, 'Well, I'm going to keep an open mind. Some of them might replicate. Some of them might be true. And if so, how do I decide which ones?' Help me out, Doctor. Andrew Gelman: Um, I think we have to move away from the idea of which are true and which are false. So, setting aside things like the ESP, which may be--I'm not an expert in that topic, but there's certainly a lot of people who take the reasonable view that there's nothing going on at all there. But generally, I think there are effects. The Early Childhood Intervention has effects on individual kids. And I think that priming people has effects. Just that consistent effects, the consistent effect of something like priming, it's just the average of all local effects. The same with early childhood intervention. The reported effect, the average treatment effect, of early childhood intervention, is the average of all the effects on individual kids. It will be positive for some and negative for others. It's going to vary in size. So, I don't think you should put yourself in a position of having to decide, 'Does it work or not?' I think everything works, everything has an effect. Sometimes, some of these things, maybe it doesn't matter. Like, the Priming--I don't think, what are you supposed to do with the priming? So, let's, let's like, you are supposed to do, 'Oh well,' maybe the priming will make a difference. So, if you are, for example, if you are a company, you are advising a company, and they are advertising. So, would a certain prime help themselves sell more of a product. Or, if you don't like the advertising, you are working for a government agency, you are trying to prime people to have better behavior or trying to prime soldiers to be less afraid or whatever it is. That's going to be a specific context. And, you know--I think you want to study it in that specific context. I don't think we're going to learn much from some literature in the psychology labs, seeing words on the screen.

40:58 Russ Roberts: Well, let me take a slightly more important example. So, I'm not going to argue that what you just said is un-important. But it's relatively unimportant. It would be scary if the government or if corporations were secretly influencing us. I mean, an example of this would be when they flash-bot--allegedly flash-by Coke--so see it, allegedly during movies, and people rushed out supposedly at intermission and bought a lot of Coke without realizing that they'd seen these simple subliminal suggestions. And I don't think--it turned out--I've seen the real actual study of that. It really didn't work. But somehow that's became this scary thing. And if it were true--it would be scary. But let's take the minimum wage. Does an increase in the minimum wage affect employment? Job opportunities for [?] there's a lot of smart people on both sides of this issue who disagree. And who have empirical work that they're right and you're wrong; and each side feels smug: that it's studies are the good studies. And I reject your claim that I have to accept that it's true or not true. I mean, I'm not sure--which way do I go there? I don't know what to do. Andrew Gelman: Well, I think-- Russ Roberts: Well, excuse me: I do know what to do. Which is, I'm going to rely on something other than the latest statistical analysis. Because I know it's noisy and full of problems, and has probably been p-hacked. I'm going to rely on basic economic logic, the incentives that I've seen work over and over and over again. And at my level of empirical evidence that the minimum wage isn't good for low-income people is that fact that firms ship jobs overseas to save money; they change, they put in automation to save money. And I assume that when you put in the minimum wage they are going to find ways to save money there, too. So, I--it's not a made-up religious view. I have evidence for it. But it's not statistical. So, what do I do there? Andrew Gelman: Okay. I'd rather not talk too much about the minimum wage because it has a lot of technical knowledge that I'm not an expert on. Last time I took economics, was in 11th grade. I did get an A in the class, but still I wouldn't say I'm an expert on the minimum wage. But let's talk a little bit about that. So, the first thing is that I do think that having a minimum wage policy would have effects on a lot of people. And it will help some people and hurt other people. So, I think the hypothesis that the minimum wage has 0 effect is kind of silly. Of course it's going to have an effect. And obviously there are going to be people who are going to get paid more, and other people who aren't going to get hired. So, part of it is just quantitative: How much is the effect going to be? Who is it going to be helping and hurting? The other thing--I agree with you completely about the role of theory, that your theory has to help you understand it. I think it's possible to fill in the gaps a little bit. So, to say, 'You have a theory about what firms will do in response to the minimum wage; and you have evidence based on how firms have responded to the minimum wage in the past. But you can argue quite reasonably that number of minimum wage changes is fairly small and idiosyncratic.' So you [?] how firms have responded to many other stressors in the past. And so, you have a theory. I think that one could build a statistical model incorporating those data into your theory. So, you'd have a model that says stressors have different effects; you characterize a stressor are--you'd have some that are somehow more similar to minimum wage like regulatory stressors versus other things, which are economic rather than political; how much--when prices of raw materials change, so forth. It should be possible to fill in the steps: to connect from the theory to the empirics and ultimately [?] make a decision. Let me talk about early childhood intervention, though, instead. Not that I know anything about that, either; but that's an area where our theory is weaker. Russ Roberts: Maybe. I don't know. Andrew Gelman: Okay. So, if we talk about theory of early childhood intervention, there's two theories out there. One theory is that this should help because there are certain deficits that kids have and you are directly targeting them. There is another theory that says most things won't help much because people are already doing their best. Right? So those are, sort of, in some sense those are your two theories to get things started. Russ Roberts: There's another one: Nature is stronger than nurture, so it doesn't really matter. Nurture is over-rated. You know, that's another theory. Andrew Gelman: Sure. Indeed. So there's another theory that these things won't have such large effects; that the deficits that people have are symptoms, not causes; and so reducing these deficits might not solve the problem. I mean, for that matter, it's not even nature versus nurture: it's individual versus group. So, if Jamaica is a poor country, it could be their environment. And so changing some aspect--so, sure. So, basically we have a bunch of theories going on. And, if you want to understand them, you probably have to get a little closer to the theory in terms of what's measured. Now, in the meantime, you have decisions to make. So, this study that was done was in a more traditional--like, take a pill, push a button. Like, the experts came up with an intervention. One way to frame this--I think it's very easy to be dismissive of experts. But one way to frame this in a positive way, I believe, is: Imagine that this wasn't an experiment. Imagine that there was just a fixed budget for early childhood intervention--like the government was going to spend x on it, and that was what the people wanted. If you have a fixed budget, you might as well do the best job. And of course you should talk to education experts, and economics experts, and so forth. I wouldn't want just some non-experts to make it up. Like, experts have flaws, but presumably people who know about curricula and child development could do a better job. I would assume. Russ Roberts: I'm going to let you assume that, but I'm going to also argue that the evidence for that is very weak. But, go ahead. Andrew Gelman: Okay. Well, let me just say that--let me say that I doubt that they are going to be worse. Assuming that they don't have, if they don't have motivation in[?]-- Russ Roberts: There's fads. There's group-think. Andrew Gelman: Sure. I do[?] take that. I'll accept that. Let's say that--put it another way. I'll accept your point; and let me just step back and--forget about who is doing it. Suppose you are doing it. Or suppose some group is doing it. Suppose there is a mandate to do some level of early childhood intervention, just as there's a mandate to have public education in this country, and so forth. Somehow, you want to do a better job rather than a worse job. So, however that's done, some approach is chosen. And so, you are going to do that. Now, then there's a question of how effective this thing is going to be. And here, this just gets back to the statistics. So, with 130 kids, it's going to be hard to detect much, because the error is so high. The variation is so high. So, it's going to be hard to use a study like this to make a decision. And one of the problems is that we are kind of conditioned to think that if you--we're conditioned to think that the point of social science is to get these definitive studies, these definitive experiments. And we're going to prove that the drug works, or that the treatment works. And, that's kind of a mistake. And partly because of the small sample. But not even just that. It's also because conditions change. What worked in Jamaica 25 years ago might not work in Jamaica now, let alone the United States right now. So, there is no substitute for a theory. I think there is no substitute for observational data. So, economists use a lot of observational data. There's millions of kids who go to school, and millions of kids who do preschool. So, in some sense, you have to do that: you have to do the observational analysis. You have to have theory. Ultimately, decisions have to be made. I agree with your skepticism about your own skepticism. That is, saying, 'I don't trust this 42%,' doesn't mean you can say, 'I believe it's 0.' [?] You don't know. And so you have to kind of triangulate. And I think, one thing I like to say is that I think research should be more real-world-like; and the real world should be more research-like. And economists are moving into this. So, more and more people are doing field experiments rather than lab experiments; are doing big studies. So they are trying to make research more realistic. But the flip side is that these studies are small. And they have this noise problem. And, it's taken people a while to realize this. So, a lot of people felt that if you have a field experiment, if you have a field experiment, you have identification because it's an experiment. And you have external validity, because it's in the field. Therefore you have a win. But actually, that's not the case. If it's too noisy a study, and too small a study, the identification and the generalizability aren't enough. So, the flip is if people are doing policy, they need to have good records. The organization should be keeping track of which kids are getting preschool and which kids aren't. And how they are doing. And future statisticians, economists, and sociologists should study these data. And, yes, they are going to have arguments, just like they have arguments about the minimum wage. But, I think you have to kind of do your best on that.

50:31 Russ Roberts: So, let me try a different approach. I'm not proud of this; but I'm going to push it and see whether I can sell you at all on it, okay? So, when you talk about field experiments, I'm thinking about the deworming literature, which had a lot of enthusiasm for deworming. And there was a huge encouragement to give to charities that deworm poor children in Africa because a field experiment that found that they did much better in all kinds of dimensions. And when they did it on a very large scale, it didn't work so well. Now, there's pushback from the people who did the first study; and I don't know where we are on that. As you point out: I think de-worming is probably better than not de-worming. The magnitude is what's at issue here, and the variation across individuals. And of course we had a guest on EconTalk who said we have too sterile an environment and that's leading to many autoimmune problems. So, even the question of whether it's good or not is maybe a little bit up in the air. But, here's the way I see it--and I don't like this way of seeing it, but I find myself increasingly drawn to this perspective. Which is: You are arguing, 'Well, you've got to be more honest; you've got to look at a bigger sample, you've got to be more thorough; you've got to keep better records. You've got to be more skeptical; you've got to not oversell. You've got to be aware of the biases.' And I agree with all of that, 100%. But wouldn't you--isn't it possible that the study of statistics, the way it's taught in a Ph.D. program in statistics and the way it's taught in economics and econometrics--is it just giving you a cudgel, a club, a stick, with which to beat your intellectually inferior opponents? And all it really comes down to is ideology and gut instinct? So, when you tell me about priming and I do, 'Get, that [?] strike me as plausible?' And I'm right. Hey, my biases, my gut, turned out to be better than the statistical analysis.' And you tell me about the minimum wage, and we go back and forth with all these incredibly complex statistical analyses, and it turns out, maybe something will come to a consensus--I have no idea. But I'm tempted to just rely on my gut feeling, and to be honest about it. As opposed to pretending, as most--I fear--young economists do now: 'Oh, I just listen to the data. I don't have any preconceptions. I see what the data tell me.' And I find that to be dangerous, to be honest. And I'd almost rather live in a world where people said, 'I'm not going to pretend that my opinion is scientifically based, because there's not much science. There's a lot of pages in the appendix; there's a lot of Greek letters. But the truth is, it's mostly just my gut with a few facts.' Andrew Gelman: I think it depends on the context. I have certainly worked on a lot of problems where people change their views based on the data. And we--it's always--there's always new data coming. So, we estimated the effects of redistricting--you know, gerrymandering. And we estimate--we did a paper in 1994 which was based on data from the 1960s and 1970s, I think, or maybe the 1970s and 1980s--I'm not remembering; I think it was the 1960s and 1970s. Anyway, we looked a bunch of redistrictings, and we found that the effective redistricting was largely to make elections more competitive, not less competitive. Now, we found that in the data, and that changed how people viewed things. Now, is that still the case? Maybe not. So, redistricting has become much more advanced than it used to be. You can gerrymander, like you couldn't gerrymander before. So, our conclusions were time-bound. But, in doing that, we learned something new. We did an analysis of decision-making for radon gas--for radon in your house, which can give you cancer--and using a sort of technocratic approach or a statistical approach, we found that a targeted measurement and intervention, if applied nationally, could save billions of dollars without losing any lives. I do work in toxicology and pharmacology, where we have fairly specific models. I don't use statistical significance to make these decisions. So, when I do these analyses, we do use prior information. And we're very explicit about it. But it's information. So I liked what you said--what you said about the minimum wage was, you said, you have a theory as to why it's counterproductive; and you feel you have data. And with care, they could be put together, and put into a larger model. And, sure, there's going to be political debates. I'm not denying that. But I think there's a lot of room between, on the one hand, things that are so politicized-- Russ Roberts: and complex. Andrew Gelman: Well, not even that complex. But on some hand, some things are so politicized it's going to be very hard for some people to judge, and you have to sort of rely on the political process. And on the other extreme, maybe the other extreme would be studies that are purely data-based, looking at statistical significance, like this ESP study, that are just wrong. I think there's a lot of room in between. I don't think--the fact that we're not going to use science to solve the issue of abortion, or whatever, or maybe even the minimum wage will be difficult--I don't think that means that science or social science are useless. I think that it is part of the political process. I mean, you might as well say that, like, public health is useless because some people aren't going to quit smoking. Like, well, on the margin it could still make a difference, right? And there are lot of things that are maybe easier to quit, easier for people to change things. Russ Roberts: The problem with that argument--it's a good point. I take the point. It's a great point. Here's the problem with it. The problem with it is that there are all these errors on the other side, where we do something that's actually--we encourage people, we are encouraging people to smoke because of, we've got empirical, so-called statistical, scientific studies that show that x is good or y is bad-- Andrew Gelman: Well, but that's why we should do better statistics. I think that--I don't think--right. Someone wrote a paper where they looked at--well, I mean like, there's lots of papers like cancer cure-of-the-week; everything causes cancer-- Russ Roberts: Right. Andrew Gelman: everything cures cancer; everything prevents cancer. Right. And someone did this study and they found there had been published papers with a lot of food ingredients that were said to both cause cancer and cure cancer. You know, who knows? Maybe they do. Whatever. Right. And so, I think that we do need to have better statistical analyses. I think that people have to move away from statistical significance. I think that's misleading. People have to have an understanding that when they do a noisy study and they get a large effect, that that's not as meaningful as they think. But, within the context of doing that, I do feel that we've learned. At least, I feel like I have learned from statistical analyses--I've learned things that I couldn't have otherwise learned. A lot of--look at baseball. Look at Bill James. He wasn't doing things of significance[?]-- Russ Roberts: I think about him all the time-- Andrew Gelman: Bill James, like, he learned a lot from data,-- Russ Roberts: 100% correct. Andrew Gelman: from a combination of data and theory, replicating going back and checking on new data. I think if Bill James had been operating based on the principles of standard statistical methods in psychology, he would have discovered a lot less. So, I'd like to move towards a Bill James world, even if that means that there's still going to be places where people are making bad decisions. Russ Roberts: So, for people who don't know: Bill James wrote the Baseball Abstract for a number of years and has written many books using data to analyze baseball. And he's considered the founder of the Sabermetrics movement, which is the application of statistics to baseball--as opposed to people who follow their gut. And, as it turns out, I'm an enormous Bill James fan. And I'm an enormous believer that data in baseball is more reliable than the naked eye watching, say, a player over even 30 or 40 games. A game like baseball, where the effects are quite small, actually, it's important to use statistical analysis.

58:54 Russ Roberts: I think the challenge is, is that baseball is very different from, say, the economy. Or the human body. Baseball is a controlled environment: so, you can actually measure pretty accurately, either through simulation or through actual data analysis, say, whether trying to steal a base is a good idea. There is a selection bias. There's issues--it's not 100% straightforward. You can still do it badly. But you can actually learn about the probability of a stolen base leading to a run, and actually get pretty good at measuring that. What I think we can't measure is the probability of a billion dollar, or trillion dollar stimulus package in helping us recover from a recession. That's what I'm a little more skeptical about. Well, actually, a lot more. Andrew Gelman: No, those things are inherently much more theory-based. I mean, the baseball analogy would be, Bill James has suggested, like, reorganizing baseball in various ways. Russ Roberts: Right. Great example. Andrew Gelman: So, if you were to change [?] well, who's to say? But, again, okay sure. I'm not going to somehow defend if someone says, 'Well, I have this statistically significant result, therefore you should do this in the economy'. But I think that there are a lot of intermediate steps. I've done a lot of work in Political Science, which is not as controlled as baseball. And it is true: The more controlled the environment is, the more you can learn. It's easier to study U.S. Presidential elections than it is to study primary elections. The general elections, easier to study than the primary election, because the general election is controlled, and the primary election is uncontrolled. Pretty much. So, the principle still applies. And I think--you are right. But there is a lot of-- Russ Roberts: Yeah, I don't--don't misunderstand-- Andrew Gelman: back [?] social and biological world that have enough regularity that it seems like we can study them.