Our analyses examined the reporting practices employed in a large sample of empirical studies published in leading philosophy journals over a recent 4-year period. We found that NHST (in the form of reporting p-values) is overwhelmingly the dominant statistical analysis approach. During the period we examined, the older field of experimental psychology has gradually acknowledged the shortcomings of over-reliance on p-values as a sole marker of findings’ meaningfulness, and reporting complementary measures such as effect sizes and confidence intervals has become common. In Experimental Philosophy, however, this is not yet the norm: only half of the papers we examined reported measures of effect size, and still fewer reported confidence intervals. (Admittedly, confidence intervals have a one-to-one relation with p-values, but they are widely viewed as being more straightforward to interpret).

Furthermore, it is now accepted in the fields of experimental psychology and cognitive neuroscience that underpowered studies have, in the past, led to an over-representation of false positives in the published record; this has led to a recent emphasis on using prospective power analysis, when possible, to pre-determine sample sizes; to a lesser extent, reporting of observed power has also increased. We find no evidence of this trend in the Experimental Philosophy literature: among the studies we assessed, a very small number made any reference at all to statistical power. Finally, very few studies employed more sophisticated statistical approaches, such as Bayes factor.

The results reported here suggest that to date, Experimental Philosophy has adopted analytical and reporting practices that are closer to those that dominated psychology and cognitive neuroscience before the re-examination prompted by recent concerns about a replication crisis (Button et al. 2013; Open Science Collaboration 2012, 2015). In our Introduction, we reviewed surveys of the Psychology literature that spanned the years 1996 to 2013. We showed that reporting of effect sizes, for example, has increased from 26% of the articles sampled in 1996–2000 (Matthews et al. 2008) to over 90% in a survey of articles published in Canadian psychology journals in 2013 (Counsell and Harlow 2017). The turning point seems to be after 2010, as a survey of papers from 2009 to 2010 still found effect sizes were reported in only about 40% of studies (Fritz et al. 2012); and a large-scale analysis of survey articles (Fritz et al. 2013) examining articles published in psychology journals between 1990 and 2010 found only 3% reported power analysis, 10% reported confidence intervals, and 38% reported effect sizes (although an upward trend across this period was noted for effect sizes). This has changed in recent years (though the process is still ongoing): Tressoldi et al. (2013), found that effect sizes and confidence intervals were reported in a majority of articles published in 2011 in both high and low impact journals (with the notable—and lamentable—exception of the highest-impact venues, Nature and Science), in some journals reaching 90% - the figure also found by Counsell and Harlow (2017). In light of this, our findings that only 53% of Experimental Philosophy articles in our sample reported effect sizes, and only 28% provided confidence intervals, suggest that statistical reporting practices in Experimental Philosophy seem to be lagging a few years behind those of comparable fields.

The studies we examined almost always provided information about sample size. Other important information about sample demographics and study design was less commonly (though frequently) reported. However, fewer than half of the studies directly referred to the number of participants that had been excluded from analysis. It is possible, of course, that the low proportion of reported exclusions is due to a low rate of exclusions in the studies themselves, and that all authors who excluded participants also reported this explicitly. However, it is noteworthy that participant exclusion is a highly-common practice in psychology and related fields; although there are often good justifications for doing so (e.g., when participants fail to engage with the task, are unable to perform it adequately, or have clear response biases), the practice has also been highlighted as an element of ‘researcher degrees of freedom’ (Simmons et al. 2011). Specifically, when exclusion criteria are not set a-priori (and reported as such), this leaves potential room for introduction of novel exclusion criteria after the results are known; this may, in turn, make it easier to obtain statistically-significant results—and due to the human susceptibility to cognitive biases, which even those who do research on such biases are not immune to (Simmons et al. 2011), the best researchers, armed with the best of intentions, may be unaware that they are using exclusion rules they would not have invoked before the data were known.

Our current sample gives reason to believe that participant exclusion may also be common in Experimental Philosophy, due to the large variety of criteria that have been applied when such exclusions were reported. On the one hand, as mentioned above, there are often perfectly valid reasons for excluding participants. On the other hand, however, the need to exclude a substantial number of participants (in some cases, over half) should be avoided as much as possible, to prevent concerns about researcher-degrees-of-freedom (Simmons et al. 2011) and statistical artefacts (Shanks 2017) as alternative explanations for reported findings. Several of the studies we surveyed excluded a large number of participants for failing basic comprehension tests or otherwise showing that they did not follow task requirements: For example, Wilkenfeld et al. (2016) tested 142 participants but also mention that a further 188 participants were excluded for failing to consent, failing to complete the experiment, or giving an incorrect response to one of the reading or comprehension questions; Horvath and Wiegmann (2016) excluded the data of 142 (out of 284) subjects who did not complete the survey or completed it in under 1 min; Berniūnas and Dranseika (2016) excluded 52 of 300 participants for failing a comprehension task; and Roberts et al. (2016) tested 140 participants but excluded 72 of them—65 for answering one or more comprehension questions incorrectly, and 7 because they had formal training in philosophy. When a large proportion of participants fails comprehension tests, this implies that the task design may have benefitted from additional piloting, prior to running the study, in order to make its content sufficiently clear to participants; and restrictions that disqualify from participation and can be known in advance (such as having formal training in philosophy) should be applied during initial participant screening rather than after data collection. The flipside of exclusion criteria is very strict inclusion criteria: Holtzman (2013) reported that out of 1195 participants recruited through blogs and social networks and who had completed his survey, he focused only on 234 philosophers who held a PhD or DPhil in philosophy. There is nothing wrong with conducting research on populations with specific educational or professional backgrounds; but ideally, recruitment procedures should prevent the sample from consisting mostly of participants who do not belong to the relevant population.

Most of the above examples are of studies that used online platforms for data collection. Although such platforms are incredibly useful, their use may also result in the recruitment of a high number of unsuitable participants or a low level of participant engagement, which can negatively impact the quality of the data collected. This attests to the difficulties involved in carrying out research online; however, such difficulties must be mitigated through rigorous recruitment procedures and the use of comprehensible tasks. Unless the measured variables are entirely independent of the exclusion criteria (a requirement that is very hard to verify), excessive post-hoc data selection—even when completely justified in light of the study’s goals—can lead to results that are pure artefacts resulting from regression to the mean (Shanks 2017). Finally, many of the concerns raised by data exclusion can be assuaged by adhering to two simple recommendations: Pre-registering the study before it is run, including details of its proposed exclusion criteria and analysis plans; and reporting the effect of exclusions on the results after the study is concluded. We go into further detail on both of these recommendations below.

The sample of studies covered by our analysis is representative of the work being published in leading philosophy journals, but is obviously not entirely comprehensive: some Experimental Philosophy articles have not been included in our sample because they were published in journals such as Episteme, an outlet that was not listed in the two rankings considered in this study. Furthermore, the sample of journals considered here is rather heterogeneous: for example, some of the journals that are classed here as philosophical, such as Review of Philosophy and Psychology, are outlets intended to attract genuinely interdisciplinary research. It should also be noted that the classification of authors as philosophers and non-philosophers is at least somewhat arbitrary. We considered the affiliation at the time of publication (usually given in the published article) but this might not fully capture the researcher’s educational background. Finally, it could be argued that the sample itself is not large enough, at 134 papers, to adequately cover the field’s norms on such a diverse range of variables, not all of which are relevant to all the papers in the sample. While we acknowledge that any sample meant to reflect a greater whole could benefit from being larger, we do believe that our principled choice of leading journals, combined with our methodology for selecting all the empirical papers these journals published over a substantial period, provides a representative picture of the state of the art as indicated by the field’s leading publication venues.

We also note that our coding strategy (a score of “0” for the answer “no”, and a score of “1” for the answer “yes”) has a limited resolution, meaning that items which varied in their degree of completeness could still be given the same score. Importantly, however, this is likely to have resulted in a more positive picture of reporting practices than the actual reality: Any mention of a relevant variable (e.g., effect size) would lead to a paper being assigned a value of 1 for that variable, even if the report itself was partial or applied inconsistently (or even incorrectly, an issue we did not delve into); a value of 0 was only assigned if the paper did not mention the variable at all. This may have somewhat inflated the number of papers coded with a value of 1 for any given variable.

On the other hand, the keyword-based search deployed here may have also occasionally missed some papers which did, in fact, report on a particular variable. In particular, in examining the reporting of study design features, we assessed whether the study was presented as “within subjects”, “between subjects”, “repeated measures” or “independent groups”; however, even in psychological research these labels are not universally used in reports; it is often assumed that educated readers would be able to infer such design features from the description of the study.

Notably, we focus here on the type of information reported, not on reporting or analysis errors. In the field of psychology, recent studies (Veldkamp et al. 2014; Nuijten et al. 2016) have focused instead on the prevalence of inconsistent p-values in top psychology journals by means of an automated procedure to retrieve and check errors in the reporting of statistical results. A recent application of this type of analysis to the field of Experimental Philosophy (Colombo et al. 2018) concludes that statistical inconsistencies are not more widespread in Experimental Philosophy than in psychology—meaning that when experimental philosophers use NHST, they do not tend to make consistency errors any more than psychologists do.

Despite its limitations, we believe our study of current practices for reporting the design and analysis of Experimental Philosophy research offers interesting and potentially important findings. Such investigations provide insight into what researchers are doing well and what could be done to improve research and reporting practices in future studies. This complements direct assessments of replicability, such as the XPhi Replicability Project, a recent large-scale effort to reproduce central Experimental Philosophy findings (Cova et al. 2018 https://osf.io/dvkpr/), which has provided encouraging data about current levels of replication in the field. We should not be complacent, though: Ensuring continued replicability requires the consistent adoption of appropriate reporting practices. We therefore end this report with a set of recommendations for authors, editors and reviewers of Experimental Philosophy papers (see Fig. 1 for a summary infographic).

Fig. 1 Recommendations for authors, editors and reviewers of Experimental Philosophy studies. This list complements the recommendations that Simmons et al. (2011) made for Psychology. We repeat two of their recommendations (marked with asterisks) but endorse all of their suggestions. The present recommendations build on practices that have been adopted in recent years in other empirical fields, but have yet to become the norm in Experimental Philosophy Full size image

We start with a general recommendation for philosophers and academic philosophy departments. A growing number of philosophers are carrying out empirical research, and an increasing number (in sub-fields such as philosophy of mind and philosophy of neuroscience and psychology) view empirical findings as directly relevant to their conceptual analysis. If this trend is to continue, it will become essential for philosophers to acquire statistical literacy as part of their education. Statistical analyses are the lens through which present-day science looks at empirical data. Therefore, an adequate understanding of statistics—including current developments and controversies in relevant fields—should not be outsourced to collaborators from other fields, but rather should become as integral to a philosopher’s education as courses in logic currently are.

As for authors, editors and reviewers, we strongly endorse the recommendations of Simmons et al. (2011), who made a list of suggestions aimed at reducing the number of false-positive publications by putting in place checks on experimenter degrees of freedom. These recommendations were aimed at researchers in psychology, but are equally applicable to any field in which statistics are used to analyze empirical data, and particularly to fields where those data are human behaviors, beliefs and attitudes. We will not repeat those recommendations here, but our recommendations below do include a couple of them that, in light of the present findings, seem to have particular relevance to Experimental Philosophy.

For example, it seems particularly necessary for authors in Experimental Philosophy to take heed of Simmons et al.’s (2011) recommendation that “If observations are eliminated, authors must also report what the statistical results are if those observations are included”. Further requirements also make sense in light of the large number of exclusions in some of the studies examined here (none of which report whether and to what extent application of exclusion or inclusion criteria affected the results): reports must commit to having defined the rules for exclusion prior to conducting any analysis (including the calculation of descriptive statistics), and must provide a clear rationale for such exclusions, to prevent ad-hoc removal of participants. Furthermore, to prevent undisclosed exclusions, papers should always explicitly report whether any participants were excluded or not.

More generally, transparency can be improved by adopting pre-registration. There is increasing support across the sciences for the idea of pre-registering studies, with initiatives such as the Preregistration Challenge (http://cos.io/prereg) offering assistance and incentives to conduct pre-registered research, and journals such as Psychological Science awarding ‘badges’ to papers that employ various good practices, including pre-registration. Current pre-registration platforms (e.g., the Open Science Framework, http://osf.io/; and AsPredicted, http://AsPredicted.org/) allow registration to consist simply of the basic study design, although they also enable inclusion of a detailed pre-specification of the study’s procedures, expected outcomes and plan for statistical analysis (including exclusion criteria). Importantly, pre-registering the analysis plan does not preclude analyses that were not originally considered, or further analyses on subsets of the data; rather, it enables a clear and transparent distinction between confirmatory (pre-registered) and exploratory analyses, with the acknowledgment that it is often the latter kind that leads to the most interesting follow-up research.

With regard to specific analysis techniques, NHST is the main approach to statistical analysis in Experimental Philosophy (and is still the norm in Experimental Psychology too). However, experimental philosophers should take heed of the recent move in psychology toward augmenting p-values with measures of effect size and increased use of confidence intervals. In particular, a paper’s discussion and interpretation of its findings should focus on effect sizes, as they are more informative than simply reporting whether a finding was statistically significant.

The use of other statistical approaches in place of NHST (e.g., Bayesian analysis) is also on the rise in psychology and other sciences, although the use of these approaches is still controversial: Simmons et al. (2011) oppose the adoption of Bayesian statistics as a way of addressing the shortcomings of p-values, noting that such analyses are prone to arbitrary assumptions (e.g., in the choice of prior probabilities) that, along with simply adding another set of tests to choose from, increase researcher degrees of freedom; several other authors (e.g., Dienes 2011, 2014; Kruschke 2013; Rouder et al. 2009), focus instead on the usefulness of Bayesian analyses for establishing whether the evidence supports the null hypothesis. Whatever the outcome of these debates, experimental philosophers should remain up to date on the current consensus regarding best practice.

Authors should also make sure they provide all the relevant information on both the methods and results. Although the vast majority of the studies we examined reported their sample size, a much smaller number reported sample demographics that would allow an assessment of their findings’ generalizability. Furthermore, many studies were vague on design and procedure details that determine whether a reader who wanted to conduct an exact replication would be able to do so. To facilitate clear and comprehensive writing, journal editors should recognize that word limits can be a serious obstacle to proper reporting of methods and results. In light of this, journals such as Psychological Science have now made clear that “The Method and Results sections of Research Articles do not count toward the total word count limit. The aim here is to allow authors to provide clear, complete, self-contained descriptions of their studies” (Psychological Science 2018). We suggest that editors of Philosophy journals should also consider revising their guidelines and strive to allow for sufficient level of detail in reporting.

Philosophers are not as accustomed as psychologists are to using graphs to make their point, but Experimental Philosophy authors should present their findings graphically if visualization allows for readers to better see trends and patterns (Matejka and Fitzmaurice 2017). For example, although there is some controversy about the use of bar-graphs to display results (see Bar Bar Plots Project 2017Footnote 5; Pastore et al. 2017), there is a consensus that bar graphs showing means are uninterpretable without including error bars representing standard errors, standard deviations, or confidence intervals; when including error bars, the measure they represent should be clearly indicated.

However, even when graphics are helpful, authors should always provide numerical values for descriptive statistics and effect sizes as well, so that the study can be included in future replication efforts, Bayesian analyses and meta-analyses. To avoid redundancy, numerical values that are represented in graphic depictions can be given in supplementary online information , which is allowed by most journals. In cases in which journals do not allow authors to use supplementary materials , editors and publishers should consider updating their editorial policies to allow for their use.

Further, it is the role of editors and reviewers to verify that appropriate reporting practices, including those detailed above, are adhered to. In particular, editors of philosophy journals that publish experimental papers should make it a habit to go outside their usual reviewer pool and seek reviewers with the relevant methodological and statistical expertise to evaluate the empirical aspects of the work.

Reviewers, for their part, should focus not only on the content of the findings but also make sure to address quality of reporting, verifying the clarity and completeness of empirical methods, and the use of statistical analyses that go further than simply reporting p-values. As recommended by Simmons et al. (2011), reviewers should also be tolerant of imperfections in the results—empirical data are messy, and an unrealistic expectation for perfectly neat stories is a strong incentive for researchers to apply so-called ‘researcher degrees of freedom’. Although we have no evidence that unrealistic demands are a particular problem amongst reviewers of Experimental Philosophy studies, we do note that real data often lend themselves less comfortably to the kind of air-tight conceptual arguments that philosophers are more accustomed to.

The rapid recent growth of Experimental Philosophy suggests exciting prospects for informing philosophical arguments using empirical data. This burgeoning field must, however, insure itself against facing its own replication crisis in years to come by taking advantage of insights reached, over the same recent period, by other fields; adopting best-practice standards in analysis and reporting should go a long way towards this goal.