Deborah Mayo and I had a recent blog discussion that I think might be of general interest so I’m reproducing some of it here.

The general issue is how we think about research hypotheses and statistical evidence. Following Popper etc., I see two basic paradigms:

Confirmationist: You gather data and look for evidence in support of your research hypothesis. This could be done in various ways, but one standard approach is via statistical significance testing: the goal is to reject a null hypothesis, and then this rejection will supply evidence in favor of your preferred research hypothesis.

Falsificationist: You use your research hypothesis to make specific (probabilistic) predictions and then gather data and perform analyses with the goal of rejecting your hypothesis.

In confirmationist reasoning, a researcher starts with hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A.

In falsificationist reasoning, it is the researcher’s actual hypothesis A that is put to the test.

How do these two forms of reasoning differ? In confirmationist reasoning, the research hypothesis of interest does not need to be stated with any precision. It is the null hypothesis that needs to be specified, because that is what is being rejected. In falsificationist reasoning, there is no null hypothesis, but the research hypothesis must be precise.

In our research we bounce

It is tempting to frame falsificationists as the Popperian good guys who are willing to test their own models and confirmationists as the bad guys (or, at best, as the naifs) who try to do research in an indirect way by shooting down straw-man null hypotheses.

And indeed I do see the confirmationist approach as having serious problems, most notably in the leap from “B is rejected” to “A is supported,” and also in various practical ways because the evidence against B isn’t always as clear as outside observers might think.

But it’s probably most accurate to say that each of us is sometimes a confirmationist and sometimes a falsificationist. In our research we bounce between confirmation and falsification.

Suppose you start with a vague research hypothesis (for example, that being exposed to TV political debates makes people more concerned about political polarization). This hypothesis can’t yet be falsified as it does not make precise predictions. But it seems natural to seek to confirm the hypothesis by gathering data to rule out various alternatives. At some point, though, if we really start to like this hypothesis, it makes sense to fill it out a bit, enough so that it can be tested.

In other settings it can make sense to check a model right away. In psychometrics, for example, or in various analyses of survey data, we start right away with regression-type models that make very specific predictions. If you start with a full probability model of your data and underlying phenomenon, it makes sense to try right away to falsify (and thus, improve) it.

Dominance of the falsificationist rhetoric

That said, Popper’s ideas are pretty dominant in how we think about scientific (and statistical) evidence. And it’s my impression that null hypothesis significance testing is generally understood as being part of a Popperian, falsificiationist approach to science.

So I think it’s worth emphasizing that, when a researcher is testing a null hypothesis that he or she does not believe, in order to supply evidence in favor of a preferred hypothesis, that this is confirmationist reasoning. It may well be good science (depending on the context) but it’s not falsificationist.

The “I’ve got statistical significance and I’m outta here” attitude

This discussion arose when Mayo wrote of a controversial recent study, “By the way, since Schnall’s research was testing ’embodied cognition’ why wouldn’t they have subjects involved in actual cleansing activities rather than have them unscramble words about cleanliness?”

This comment was interesting to me because it points to a big problem with a lot of social and behavioral science research, which is a vagueness of research hypotheses and an attitude that anything that rejects the null hypothesis is evidence in favor of the researcher’s preferred theory.

Just to clarify, I’m not saying that this is a particular problem with classical statistical methods; the same problem would occur if, for example, researchers were to declare victory when a 95% posterior interval excludes zero. The problem that I see here, and that I’ve seen in other cases too, is that there is little or no concern with issues of measurement. Scientific measurement can be analogized to links on a chain, and each link—each place where there is a gap between the object of study and what is actually being measured—is cause for concern.

All of this is a line of reasoning that is crucial to science but is often ignored (in my own field of political science as well, where we often just accept survey responses as data without thinking about what they correspond to in the real world). One area where measurement is taken very seriously is psychometrics, but it seems that the social psychologists don’t think so much about reliability and validity. One reason, perhaps, is that psychometrics is about quantitative measurement, whereas questions in social psychology are often framed in a binary way (Is the effect there or not?). And once you frame your question in a binary way, there’s a temptation for a researcher, once he or she has found a statistically significant comparison, to just declare victory and go home.

The measurements in social psychology are often quantitative; what I’m talking about here is that the research hypotheses are framed in a binary way (really, a unary way in that the researchers just about always seem to think their hypotheses are actually true). This motivates the “I’ve got statistical significance and I’m outta here” attitude. And, if you’ve got statistical significance already and that’s your goal, then who cares about reliability and validity, right? At least, that’s the attitude, that once you have significance (and publication), it doesn’t really matter exactly what you’re measuring, because you’ve proved your theory.

I am not intendeing to be cynical or to imply that I think these researchers are trying to do bad science. I just think that the combination of binary or unary hypotheses along with a data-based decision rule leads to serious problems.

The issue is that research projects are framed as quests for confirmation of a theory. And once confirmation (in whatever form) is achieved, there is a tendency to declare victory and not think too hard about issues of reliability and validity of measurements.

To this, Mayo wrote:

I agreed that “the measurements used in the paper in question were not” obviously adequately probing the substantive hypothesis. I don’t know that the projects are framed as quests “for confirmation of a theory”,rather than quests for evidence of a statistical effect (in the midst of the statistical falsification arg at the bottom of this comment). Getting evidence of a genuine, repeatable effect is at most a necessary but not a sufficient condition for evidence of a substantive theory that might be thought to (statistically) entail the effect (e.g., a cleanliness prime causes less judgmental assessments of immoral behavior—or something like that). I’m not sure that they think about general theories–maybe “embodied cognition” could count as general theory here. Of course the distinction between statistical and substantive inference is well known. I noted, too, that the so-called NHST is purported to allow such fallacious moves from statistical to substantive and, as such, is a fallacious animal not permissible by Fisherian or NP tests. I agree that issues about the validity and relevance of measurements are given short shrift and that the emphasis–even in the critical replication program–is on (what I called) the “pure” statistical question (of getting the statistical effect). I’m not sure I’m getting to your concern Andrew, but I think that they see themselves as following a falsificationist pattern of reasoning (rather than a confirmationist one). They assume it goes something like this: If the theory T (clean prime causes less judgmental toward immoral actions) were false, then they wouldn’t get statistically significant results in these experiments, so getting stat sig results is evidence for T. This is fallacious when the conditional fails.

And I replied that I think these researchers are following a confirmationist rather than falsificationist approach. Why do I say this? Because when they set up a nice juicy hypothesis and other people fail to replicate it, they don’t say: “Hey, we’ve been falsified! Cool!” Instead they give reasons why they haven’t been falsified. Meanwhile, when they falsify things themselves, they falsify the so-called straw-man null hypotheses that they don’t believe.

The pattern is as follows: Researcher has hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A. I don’t see this as falsificationist reasoning, because the researchers’ actual hypothesis (that is, hypothesis A) is never put to the test. It is only B that is put to the test. To me, testing B in order to provide evidence in favor of A is confirmationist reasoning.

Again, I don’t see this as having anything to do with Bayes vs non-Bayes, and all the same behavior could happen if every p-value were replaced by a confidence interval.

I understand falisificationism to be that you take the hypothesis you love, try to understand its implications as deeply as possible, and use these implications to test your model, to make falsifiable predictions. The key is that you’re setting up your own favorite model to be falsified.

In contrast, the standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify and thus represent as evidence in favor of A.

As I said above, this has little to do with p-values or Bayes; rather, it’s about the attitude of trying to falsify the null hypothesis B rather than trying to trying to falsify the researcher’s hypothesis A.

Take Daryl Bem, for example. His hypothesis A is that ESP exists. But does he try to make falsifiable predictions, predictions for which, if they happen, his hypothesis A is falsified? No, he gathers data in order to falsify hypothesis B, which is someone else’s hypothesis. To me, a research program is confirmationalist, not falsificationist, if the researchers are never trying to set up their own hypotheses for falsification.

That might be ok—maybe a confirmationalist approach is fine, I’m sure that lots of important things have been learned in this way. But I think we should label it for what it is.

Summary for the tl;dr crowd

In our paper, Shalizi and I argued that Bayesian inference does not have do be performed in an inductivist mode, despite a widely-held belief to the contrary. Here I’m arguing that classical significance testing is not necessarily falsificationist, despite a widely-held belief to the contrary.