Introduction

Scientific findings cannot exist in isolation, but rather must rely on the capacity of other laboratories to successfully replicate them. When only the ‘discovering’ lab can replicate a result, trust in the finding can and should be examined. As a consequence of increased concern regarding the replicability of scientific results, showing the same results given the same methods with new data, psychologists have initiated assorted replication efforts to assess the reliability of extant research findings. The results of these large-scale replication attempts have introduced new questions into the field. One such initiative ran single replications of 100 studies and reported only about one third of the studies replicated according to various plausible criteria for what should count as a successful replication (Open Science Collaboration, 2015; see also Earp, 2016). While conclusions regarding the actual replication rate in this and other efforts have been debated (e.g., Gilbert et al., 2016a; Gilbert et al., 2016b; Anderson et al., 2016; Etz & Vandekerckhove, 2016), the question of why systematic replication efforts have routinely failed to replicate original findings has become an important topic in psychology.

There are a number of reasons why efforts to replicate a research finding may fail. A relatively rare reason is that researchers deliberately fabricate the initial result (see John, Loewenstein & Prelec, 2012; Fiedler & Schwarz, 2016; and Agnoli et al., 2017 for admission rates). A more commonplace source of false findings is engaging in selective reporting of participants, conditions, or analyses (e.g., Simmons, Nelson & Simonsohn, 2011; see also John, Loewenstein & Prelec, 2012) that can make non-results appear real. If an original effect is wholly fallacious, it is easy to explain why it would not replicate. But what about findings that corresponds to real effects?

There are also a number of reasons why replication efforts may fail to replicate a real effect including lack of power in the replications (Cohen, 1969), lack of fidelity among researchers to the procedures of the original study (see Gilbert et al., 2016b), unacknowledged variance in auxiliary assumptions (Earp & Trafimow, 2015), deliberate questionable research practices used by the replicator to show a lack of evidence (e.g., Protzko, 2018), among others.

In this manuscript, we investigate a separate possible explanation, researcher ‘expertise and diligence’. Related to the notion of replication fidelity, it seems reasonable that highly skilled researchers would be more effective than less skilled researchers in isolating the dependent and independent variables in a manner that enables the original finding to be replicated. An apocryphal example comes from cognitive dissonance research, where different researchers read the same exact script to participants to induce a feeling of choosing to do the researcher a favor, yet different abilities in acting and expressing sincerity would represent a lack of fidelity if replicating experimenters do not come off as sincere to participants as the original authors. If processes such as these alter replicability, then such a capacity to effectively carry out the replicated research might be expected to be associated with the achievement of the investigator carrying out the replications. In other words, researchers who have been highly successful in carrying out their own lines of research may be better able to effectively carry out replications than their less successful counterparts.

The above logic suggests researchers who engage and fail in replicating canonical studies are of inferior ‘expertise and diligence’ (Bartlett, 2014; Cunningham & Baumeister, 2016). In this study, we test this hypothesis using the h-index of replicators in a series of pre-registered replications to determine whether researchers of higher ‘expertise and diligence’ are more successful at replicating a given effect. Notably, this hypothesis regarding the relationship between replication and h-index has been put explicitly forth:

“In other words, it is much easier to be a successful nonreplicator while it takes ‘expertise and diligence’ to generate a new result in a reliable fashion. If that is the case, it should be reflected in measures of academic achievement, e.g., in the h-index or in the number of previous publications.” (Strack, 2017; p. 9).

Although various researchers have speculated on the role of researcher ‘expertise and diligence’ in replication success, the only (indirect) empirical evidence to speak to this question comes from a re-analysis of 100 single replications of prominent psychology findings (Open Science Collaboration, 2015). Specifically, the number of times a study was internally replicated by the original authors in the original publication was not predictive of whether an outside research team could replicate the effect (Kunert, 2016, cf. Cunningham & Baumeister, 2016). While this was interpreted as evidence for the prevalence of questionable research practices (a researcher who engages in such practices to ‘create’ an effect is likely to do it in many of their own internal replications), the evidence could also support the hypothesis that the original authors had the requisite ‘expertise and diligence’ the replicating team lacked. Both interpretations rely on a property of the original research team (tendency to engage in QRPs, special ability) that was not shared by the replicating team.

As the hypothesis has been put forward that researchers of different degrees of ‘expertise and diligence’ (indexed through h-index) are more or less able to experimentally replicate an effect, we sought to address the question empirically. Given previous speculations that replication failures are a product of a lack of adequate skill set on the part of researchers, and that the h-index is a reasonable metric by which to assess researchers’ acumen (especially as put forward in the hypothesis to be tested), we sought to test this conjecture. As replication ‘success’ is a function of the observed effect size and the sample size, given that the studies we investigate here have a more fixed sample size, we investigate the hypothesis in the context of the observed effect size returned by a replicator as a function of their h-index (a proxy for ‘expertise and diligence’ outlined above).

Replications To test the hypothesis that replication success is a function of researcher ‘expertise and diligence’, we collected 100 replications that had been conducted across five studies. We used the first five published Registered Replication Reports (RRRs), investigations where a dozen or more individual research teams all attempt to replicate the same study. RRRs are direct replications of a study conducted by multiple, independent laboratories of researchers who vary in the extent to which they believe in the original finding. All labs follow the exact same protocol approved by the original study team or by surrogates who share their theoretical perspective before data collection begins. The five RRRs represent multi-lab replications of the following phenomena: verbal overshadowing (Schooler & Engstler-Schooler, 1990); priming commitment and reaction to hypothetical romantic betrayal (Finkel et al., 2002); the facial feedback hypothesis (Strack, Martin & Stepper, 1988); ego depletion (Hagger et al., 2016), and that people are intuitively cooperative yet deliberatively selfish (Rand, Greene & Nowak, 2012). All data and analysis scripts are archived at https://osf.io/qbq6v/. The reason for using this as our sample was that it provided multiple replications of the same basic effect by researchers of varying levels of ‘expertise and diligence’. In one replication investigation (Open Science Collaboration, 2015), researchers who had more publications chose to replicate studies with larger original effect sizes (Bench et al., 2017). After taking this initial volunteering into effect, there was no residual relationship between researcher ‘expertise and diligence’ and replication success (cf. Cunningham & Baumeister, 2016). As that investigation only looked at one replication per study, however, it was unable to look at replication variation within the same study. The analysis proposed here is able to look at variation in ‘expertise and diligence’ across different replicators within multiple replications of the same study. Instead of one effect having one replication and averaging across different studies, each study under replication has multiple effect sizes from multiple researchers. In short, by examining the replication success of multiple investigations of the same effect, it should be in principle possible to isolate the role of variation in researchers’ ‘expertise and diligence’ in contributing to their effectiveness in replicating the original findings.