(Draft of material for forthcoming The Personality Puzzle, 8th edition. New York: W.W. Norton).

[Note: These are two sections of a chapter on Research Methods, and the first section follows a discussion of Null Hypothesis Significance Testing (NHST) and effect size.]

Replication

Beyond the size of a research result, no matter how it is evaluated, lies a second and even more fundamental question: Is the result dependable, something you could expect to find again and again, or did it merely occur by chance? As was discussed above, null hypothesis significance testing (NHST) is typically used to answer this question, but it is not really up to the job. A much better indication of the stability of results is replication. In other words, do the study again. Statistical significance is all well and good, but there is nothing quite so persuasive as finding the same result repeatedly, with different participants and in different labs (Asendorpf et al., 2013; Funder et al., 2014).[1]

The principle of replication seems straightforward, but it has become remarkably controversial in recent years, not just within psychology, but in many areas of science. One early spark for the controversy was an article, written by a prominent medical researcher and statistician, entitled “Why most published research findings are false” (Ionnadis, 2005). That title certainly got people’s attention! The article focused on biomedicine but addressed reasons why findings in many areas of research shouldn’t be completely trusted. These include the proliferation of small studies with weak effects, researchers reporting only selected analyses rather than everything they find, and the undeniable fact that researchers are rewarded, with grants and jobs, for studies that get interesting results. Another factor is publication bias, the fact that studies with strong results are more likely to be published than studies with weak results – leading to a published literature that makes effects seem stronger than they really are (Polanin, Tanner-Smith & Hennessy, 2016).

Worries about the truth of published findings spread to psychology a few years later, in a big way, when three things happened almost at once. First, an article in the influential journal Psychological Science outlined how researchers could make almost any data set yield significant findings through techniques such as deleting unusual responses, adjusting results to remove the influence of seemingly extraneous factors, and neglecting to report experimental conditions or even whole experiments that fail to get expected results (Simmons, Nelson & Simonsohn, 2011). Such questionable research practices (QRP’s) have also become known as p-hacking, a term which refers to hacking around in one’s data until one finds the necessary degree of statistical significance, or p-level, that allows one’s findings to be published. To demonstrate how this could work, Simmons and his team massaged a real data set to “prove” that listening to the Beatles song When I’m 64 actually made participants younger!

Coincidentally, at almost exactly the same time, the prominent psychologist Daryl Bem published an article in a major journal purporting to demonstrate a form of ESP called “precognition,” reacting to stimuli that are presented in the future (Bem, 2011). And then, close on the heels of that stunning event, another well-known psychologist, Diederik Stapel, was exposed for having become famous on the basis of studies in which he, quite literally, faked his data (Bhattacharjee, 2013). The two cases of Bem and Stapel were different because nobody suggested that Bem faked his data, but nobody seemed to be able to repeat his findings, either, suggesting that flawed (but common) practices of data analysis were to blame (Wagenmakers, Wetzels, Borsboom & van der Maas, 2011), in particular, various kinds of p-hacking. For example, it was suggested that Bem might have published only the studies that successfully demonstrated precognition, not the ones that failed. One thing the two cases did have in common was that the work of both researchers had passed through the filters of scientific review that were supposed to ensure that published findings can be trusted.

And this was just the beginning. Before too long, many seemingly well-established findings in psychology were called into question when researchers found that they were unable to repeat them in their own laboratories (Open Science Collaboration, 2015). One example is a study that I described in previous editions of this very book, which purported to demonstrate a phenomenon sometimes called “elderly priming” (Anderson, 2015; Bargh, Chen & Burrows, 1996). In the study, some college student participants were “primed” with thoughts about old people by having them unscramble words such as “DNIRKWE” (wrinkled), “LDO” (old), and (my favorite) “FALODRI” (Florida). Others were given scrambles of neutral words such as “thirsty,” “clean,” and “private.” The remarkable – even forehead-slapping – finding was that when they walked away from the experiment, participants in the first group moved more slowly down the hallway than participants in the second group! Just being subtly reminded about concepts related to being old, it seemed, is enough to make a person act old.

I reported this fun finding in previous editions because the measurement of walking speed seemed like a great example of B-data, as described in Chapter 2, and also because I thought readers would enjoy learning about it. That was my mistake! The original study was based on just a few participants[2] and later attempts to repeat the finding, some of which used much larger samples, were unable to do so (e.g., Anderson, 2015; Doyen, Klein, Pichon & Cleeremans, 2012; Pashler, Harris & Coburn, 2011). In retrospect, I should have known better. Not only were the original studies very small, but the finding itself is so remarkable that extra-strong evidence should have been required before I believed it.[3]

The questionable validity of this finding and many others that researchers tried and failed to replicate stimulated lively and sometimes acrimonious exchanges in forums ranging from academic symposia and journal articles to impassioned tweets, blogs, and Facebook posts. At one point, a prominent researcher referred to colleagues attempting to evaluate the replicability of published findings as “shameless little bullies.” But for the most part, cooler heads prevailed, and insults gave way to positive recommendations for how to make research more dependable in the future (Funder et al., 2014; Shrout & Rodgers, 2018). These recommendations include using larger numbers of participants than has been traditional, disclosing all methods, sharing data, and reporting studies that don’t work as well as those that do. The most important recommendation – and one that really should have been followed all along – is to never regard any one study as conclusive proof of anything, no matter who did the study, where it was published, or what its p-level was (Donnellan & Lucas, 2018). The key attitude of science is – or should be – that all knowledge is provisional. Scientific conclusions are the best interpretations that can be made on the basis of the evidence at hand. But they are always subject to change.

[Discussion of ethics follows, including deception and protection of research subjects, and then this section:]

Honesty and Open Science

Honesty is another ethical issue common to all research. The past few years have seen a number of scandals in physics, medicine, and psychology in which researchers fabricated their data; the most spectacular case in psychology involved the Dutch researcher Diederik Stapel, mentioned earlier. Lies cause difficulty in all sectors of life, but they are particularly worrisome in research because science is based on truth and trust. Scientific lies, when they happen, undermine the very foundation of the field. If I report about some data that I have found, you might disagree with my interpretation—that is fine, and in science this happens all the time. Working through disagreements about what data mean is an essential scientific activity. But if you cannot be sure that I really even found the data I report, then there is nothing for us to talk about. Even scientists who vehemently disagree on fundamental issues generally take each other’s honesty for granted (contrast this with the situation in politics). If they cannot, then science stops dead in its tracks.

In scientific research, complete honesty is more than simply not faking one’s data. A lesson that emerged from the controversies about replication, discussed earlier, is that many problems arise when the reporting of data is incomplete, as opposed to false. For example, it has been a not-uncommon practice for researchers to simply not report studies that didn’t “work,” i.e., that did not obtain the expected or hoped-for result. And, because of publication bias, few journals are willing to publish negative results in any case. The study failed, the reasoning goes, which means something must have gone wrong. So why would anybody want to hear about it? While this reasoning makes a certain amount of sense, it is also dangerous, because reporting only the studies that work can lead to a misleading picture overall. If fifty attempts to find precognition fail, for example, and one succeeds, then reporting the single success could make it possible to believe that people can see into the future!

A related problem arises when a researcher does not report results concerning all the experimental conditions, variables, or methods in a study. Again, the not-unreasonable tendency is only to report the ones that seem most meaningful, and omit aspects of the study that seem uninformative or confusing. In a more subtle kind of publication bias, reviewers and editors of journals might even encourage authors to focus their reports only on the most “interesting” analyses. But also again, a misleading picture can emerge if a reader of the research does not know what methods were tried or variables were measured that did not yield meaningful results. In short, there is so much flexibility in the ways a typical psychology study can be analyzed that it’s easy – much too easy – for researchers to inadvertently “p-hack,” which, as mentioned earlier, means that they keep analyzing their data in different ways until they get the statistically significant result that they need (Simmons, Nelson & Simohnson, 2011).

The emerging remedy for these problems is a movement towards what is becoming known as open science, a set of practices intended to move research closer to the ideals on which science was founded. These practices include fully describing all aspects of all studies, reporting studies that failed as well as those that succeeded, and freely sharing data with other scientists. An institute called the “Center for Open Science” has become the headquarters for many efforts in this direction, offering internet resources for sharing information. At the same time, major scientific organizations such as the American Psychological Association are establishing new guidelines for full disclosure of data and analyses (Appelbaum et al., 2018), and there is even a new organization, the Society for the Improvement of Psychological Science (SIPS) devoted exclusively to promoting these goals.

[1] R.A. Fisher, usually credited as the inventor of NHST, wrote “we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment that will rarely fail to give us a statistically significant result” (1966, p 14).

[2] Actually, there were two studies, each with 30 participants, which is a very small number by any standard.

[3] The astronomer Carl Sagan popularized the phrase “extraordinary claims require extraordinary evidence,” but he wasn’t the first to realize the general idea. David Hume wrote in 1748 that “No testimony is sufficient to establish a miracle, unless the testimony be of such a kind, that its falsehood would be more miraculous than the fact which it endeavors to establish” (Rational Wiki, 2018).