It is time for researchers to avail themselves of the full arsenal of quantitative and qualitative statistical tools. . . . The current practice of focusing exclusively on a dichotomous reject-nonreject decision strategy of null hypothesis testing can actually impede scientific progress. . . . The focus of research should be on . . . what data tell us about the magnitude of effects, the practical significance of effects, and the steady accumulation of knowledge. ( Kirk, 2003 , p. 100)

We need to make substantial changes to how we usually carry out research. My aim here is to explain why the changes are necessary and to suggest how, practically, we should proceed. I use the new statistics as a broad label for what is required: The strategies and techniques are not themselves new, but for many researchers, adopting them would be new, as well as a great step forward.

Ioannidis (2005) and other scholars have explained that our published research is biased and in many cases not to be trusted. In response, we need to declare in advance our detailed research plans whenever possible, avoid bias in our data analysis, make our full results publicly available whatever the outcome, and appreciate the importance of replication. I discuss these issues in the Research Integrity section. Then, in sections on estimation, I discuss a further response to Ioannidis, which is to accept Kirk’s advice that we should switch from null-hypothesis significance testing (NHST) to using effect sizes (ESs), estimation, and cumulation of evidence. Along the way, I propose 25 guidelines for improving the way we conduct research (see Table 1).

These are not mere tweaks to business as usual, but substantial changes that will require effort, as well as changes in attitudes and established practices. We need revised textbooks, software, and other resources, but sufficient guidance is available for us to make the changes now, even as we develop further new-statistics practices. The changes will prove highly worthwhile: Our publicly available literature will become more trustworthy, our discipline more quantitative, and our research progress more rapid.

As I mentioned, Ioannidis (2005) identified reliance on NHST as a major cause of many of the problems with research integrity, so shifting from NHST would be a big help. There are additional strong reasons to make this change: For more than half a century, distinguished scholars have published damning critiques of NHST and have described the damage it does. They have advocated a shift from NHST to better techniques, with many nominating estimation—meaning ESs, CIs, and meta-analysis—as their approach of choice. Most of the remainder of this article is concerned with explaining why such a shift is so important and how we can achieve it in practice. Our reward will be not only improved research integrity, but also a more quantitative, successful discipline.

Further discussion is needed of research integrity, as well as of the most effective practical strategies for achieving the two types of research integrity I have identified. The discussion needs to be enriched by more explicit consideration of ethics in relation to research practices and statistical analysis ( Panter & Sterba, 2011 ) and, more broadly, by consideration of the values that inform the choices researchers need to make at every stage of planning, conducting, analyzing, interpreting, and reporting research ( Douglas, 2007 ).

For the research literature to be trustworthy, we need to have confidence that it is complete and that all studies of at least reasonable quality have been reported in full detail, with any departures from prespecified procedures, sample sizes, or analysis methods being documented in full. We therefore need to be confident that all researchers have conducted and reported their work honestly and completely. These are demanding but essential requirements, and achieving them will require new resources, rules, and procedures, as well as persistent, diligent efforts.

A study that keeps some features of the original and varies others can give a converging perspective, ideally both increasing confidence in the original finding and starting to explore variables that influence it. Converging lines of evidence that are at least largely independent typically provide much stronger support for a finding than any single line of evidence. Some disciplines, including archaeology, astronomy, and paleontology, are theory based and also successfully cumulative, despite often having little scope for close replication. Researchers in these fields find ingenious ways to explore converging perspectives, triangulate into tests of theoretical predictions, and evaluate alternative explanations ( Fiedler, Kutner, & Krueger, 2012 ); we can do this too. (See Guideline 5 in Table 1 .)

A single study is rarely, if ever, definitive; additional related evidence is required. Such evidence may come from a close replication, which, with meta-analysis, should give more precise estimates than the original study. A more general replication may increase precision and also provide evidence of generality or robustness of the original finding. We need increased recognition of the value of both close and more general replications, and greater opportunities to report them.

Considering the diversity of our research, full prespecification may sometimes not be feasible, in which case we need to do the best we can, keeping in mind the argument of Simmons et al. (2011) . Any selection—in particular, any selection after seeing the data—is worrisome. Reporting of all we did, including all data-analytic steps and exploration, must be complete. Acting with research integrity requires that we be fully informative about prespecification, selection, and the status of any result—whether it deserves the confidence that arises from a fully prespecified study or is to some extent speculative. (See Guideline 4 in Table 1 .)

Exploration has a second meaning: Running pilot tests to explore ideas, refine procedures and tasks, and guide where precious research effort is best directed is often one of the most rewarding stages of research. No matter how intriguing, however, the results of such pilot work rarely deserve even a brief mention in a report. The aim of such work is to discover how to prespecify in detail a study that is likely to find answers to our research questions, and that must be reported. Any researcher needs to choose the moment to switch from not-for-reporting pilot testing to prespecified, must-be-reported research.

Tukey (1977) advocated exploration of data and provided numerous techniques and examples. Serendipity must be given a chance: If we do not explore, we might miss valuable insights that could suggest new research directions. We should routinely follow planned analyses with exploration. Occasionally the results might be sufficiently interesting to warrant mention in a report, but then they must be clearly identified as speculative, quite possibly the result of cherry-picking.

Sample sizes, in particular, need to be declared in advance—unless the researcher will use a sequential or Bayesian statistical procedure that takes account of variable N . I explain later that a precision-for-planning analysis (or a power analysis if one is using NHST) may usefully guide the choice of N , but such an analysis is not mandatory: Long confidence intervals (CIs) will soon let us know if our experiment is weak and can give only imprecise estimates. The crucial point is that N must be specified independently of any results of the study.

Full details of a study need to be specified in advance of seeing any results. The procedure, selection of participants, sample sizes, measures, and statistical analyses all must be described in detail and, preferably, registered independently of the researchers (e.g., at Open Science Framework, openscienceframework.org ). Such preregistration might or might not be public. Any departures from the prespecified plan must be documented and explained, and may compromise the confidence we can have in the results. After the research has been conducted, a full account must be reported, and this should include all the information needed for inclusion of the results in future meta-analyses.

Confirmatory and exploratory are terms that are widely used to refer to research at the ends of this spectrum. Confirmatory , however, might imply that a dichotomous yes/no answer is expected and suggest, wrongly, that a fully planned study cannot simply ask a question (e.g., How effective is the new procedure?). I therefore prefer the terms prespecified and exploratory . An alternative is question answering and question formulating .

Psychologists have long recognized the distinction between planned and post hoc analyses, and the dangers of cherry-picking, or capitalizing on chance. Simmons et al. (2011) explained how insidious and multifaceted the selection problem is. We need a much more comprehensive response than mere statistical adjustment for multiple post hoc tests. The best way to avoid all of the biases Simmons et al. identified is to specify and commit to full details of a study in advance. Research falls on a spectrum, from such fully prespecified studies, which provide the most convincing results and must be reported, to free exploration of data, results of which might be intriguing but must—if reported at all—be identified as speculation, possibly cherry-picked.

One key requirement is that a decision to report research—in the sense of making it publicly available, somehow—must be independent of the results. (See Guideline 3 in Table 1 .) The best way to ensure this is to make a commitment to report research in advance of conducting it ( Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012 ). Ethics review boards should require a commitment to report research fully within a stated number of months—or strong reasons why reporting is not warranted—as a condition for granting approval for proposed research.

Achieving such complete reporting—and thus a research literature with integrity—is challenging, given pressure on journal space, editors’ desire to publish what is new and striking, the career imperative to achieve visibility in the most selective journals, and a concern for basic quality control. Solutions will include fuller use of online supplementary material for journal articles, new online journals, and open-access databases. We can expect top journals to continue to seek importance, novelty, and high quality in the research they choose to publish, but we need to develop a range of other outlets so that complete and detailed reporting is possible for any research of at least reasonable quality, especially if it was fully prespecified (see the next section). Note that whether research meets the standard of “at least reasonable quality” must be assessed independently of the results, to avoid bias in which results are reported. Full reporting means that all results must be reported, whether ESs are small or large, seemingly important or not, and that sufficient information must be provided so that inclusion in future meta-analyses will be easy and other researchers will be able to replicate the study. The Journal Article Reporting Standards listed in the American Psychological Association’s (APA’s) Publication Manual ( APA, 2010 , pp. 247–250; see also Cooper, 2011 ) will help.

Meta-analysis is a set of techniques for integrating the results from a number of studies on the same or similar issues. If a meta-analysis cannot include all relevant studies, its result is likely to be biased—the file-drawer effect. Therefore, any research conducted to at least a reasonable standard must be fully reported. Such reporting may be in a journal, an online research repository, or some other enduring, publicly accessible form. Future meta-analysts must be able to find the research easily; only then can meta-analysis yield results free of bias.

In considering how to address our three central problems, and thus work toward research integrity, we need to recognize that psychological science uses a wonderfully broad range of approaches to research. We conduct experiments, and also use surveys, interviews, and other qualitative techniques; we study people’s reactions to one-off historical events, mine existing databases, analyze video recordings, run longitudinal studies for decades, use computer simulations to explore possibilities, and analyze data from brain scans and DNA sequencing. We collaborate with disciplines that have their own measures, methods, and statistical tools—to the extent that we help develop new disciplines, with names like neuroeconomics and psychoinformatics. We therefore cannot expect that any simple set of new requirements will suffice; in addition, we need to understand the problems sufficiently well to devise the best responses for any particular research situation, and to guide development of new policies, textbooks, software, and other resources. Guideline 2 ( Table 1 ) summarizes the three problems, to which I now turn.

First, consider the broad label research integrity . I use this term with two meanings. The first refers to the integrity of the publicly available research literature, in the sense of being complete, coherent, and trustworthy. To ensure integrity of the literature, we must report all research conducted to a reasonable standard, and reporting must be full and accurate. The second meaning refers to the values and behavior of researchers, who must conduct, analyze, and report their research with integrity. We must be honest and ethical, in particular by reporting in full and accurate detail. (See Guideline 1 in Table 1 .)

Important sets of articles discussing research-integrity issues appeared in Perspectives on Psychological Science in 2012 and 2013 (volume 7, issue 6; volume 8, issue 4). The former issue included introductions by Pashler and Wagenmakers (2012) and by Spellman (2012) . Debate continues, as does work on tools and policies to address the problems. Here, I discuss how we should respond.

Simmons, Nelson, and Simonsohn (2011) made a key contribution in arguing that “undisclosed flexibility in data collection and analysis allows presenting anything as significant.” Researchers can very easily test a few extra participants, drop or add dependent variables, select which comparisons to analyze, drop some results as aberrant, try a few different analysis strategies, and then finally select which of all these things to report. There are sufficient degrees of freedom for statistically significant results to be proclaimed, whatever the original data. Simmons et al. emphasized the second of the three central problems I noted, but also discussed the first and third.

Ioannidis (2005) invoked all three problems as he famously explained “why most published research findings are false.” He identified as an underlying cause our reliance on NHST, and in particular, the imperative to achieve statistical significance, which is the key to publication, career advancement, research funding, and—especially for drug companies—profits. This imperative explains selective publication, motivates data selection and tweaking until the p value is sufficiently small, and deludes us into thinking that any finding that meets the criterion of statistical significance is true and does not require replication.

I recognize how difficult it may be to move from the seductive but illusory certainty of “statistically significant,” but we need to abandon that security blanket, overcome that addiction. I suggest that, once freed from the requirement to report p values, we may appreciate how simple, natural, and informative it is to report that “support for Proposition X is 53%, with a 95% CI of [51, 55],” and then interpret those point and interval estimates in practical terms. The introductory statistics course need no longer turn promising students away from our discipline, having terminally discouraged them with the weird arbitrariness of NHST. Finally, APA’s Publication Manual ( APA, 2010 , p. 34) makes an unequivocal statement that interpretation of results should whenever possible be based on ES estimates and CIs. (That and other statistical recommendations of the 2010 edition of the manual were discussed by Cumming, Fidler, Kalinowski, & Lai, 2012 .) It is time to move on from NHST. Whenever possible, avoid using statistical significance or p values; simply omit any mention of NHST. (See Guideline 10 in Table 1 .)

I am advocating shifting as much as possible from NHST to estimation. This is no mere fad or personal preference: The damage done by NHST is substantial and well documented (e.g., Fidler, 2005 , chap. 3); the improved research progress offered by estimation is also substantial, and a key step toward achieving the core aim expressed in Guideline 6. Identification of NHST as a main cause of problems with research integrity ( Ioannidis, 2005 ) reinforces the need to shift, and the urgency of doing so.

Schmidt and Hunter made a detailed response to each statement. They concluded that each is false, that we should cease to use NHST, and that estimation provides a much better way to analyze results, draw conclusions, and make decisions—even when the researcher may primarily care only about whether some effect is nonzero.

My colleagues and I ( Coulson, Healey, Fidler, & Cumming, 2010 ) presented evidence that, at least in some common situations, researchers who see results presented as CIs are much more likely to make a correct interpretation if they think in terms of estimation than if they consider NHST. This finding suggests that it is best to interpret CIs as intervals, without invoking NHST, and, further, that it is better to report CIs and make no mention of NHST or p values. Fidler and Loftus (2009) reported further evidence that CIs are likely to prompt better interpretation than is a report based on NHST. Such evidence comes from the research field of statistical cognition , which investigates how researchers and other individuals understand, or misunderstand, various statistical concepts, and how results can best be analyzed and presented for correct comprehension by readers. If our statistical practices are to be evidence based, we must be guided by such empirical results. In this case, the evidence suggests that we should use estimation and avoid NHST.

Considering exact p values gives an even more dramatic contrast with CIs. If an experiment gives a two-tailed p of .05, an 80% prediction interval for one-tailed p in a replication study is (.00008, .44), which means there is an 80% chance that p will fall in that interval, a 10% chance that p will be less than .00008, and a 10% chance that p will be greater than .44. Perhaps remarkably, that prediction interval for p is independent of N , because the calculation of p takes account of sample size. Whatever the N , a p value gives only extremely vague information about replication ( Cumming, 2008 ). Any calculated value of p could easily have been very different had we merely taken a different sample, and therefore we should not trust any p value. (See Guideline 9 in Table 1 .)

Now consider NHST. A replication has traditionally been regarded as “successful” if its statistical-significance status matches that of the original experiment—both p s < .05 or both p s ≥ .05. In Figure 1 , just 9 of the 24 (38%) replications match the significance status of the experiment immediately below, and are thus successful by this criterion. With power of .50, as here, in the long run we can expect 50% to be successful. Even with power of .80, only 68% of replications will be successful. This example illustrates that NHST gives only poor information about the likely result of a replication.

What does an experiment tell us about the likely result if we repeat that experiment? For each experiment in Figure 1 , note whether the CI includes the mean of the next experiment. In 20 of the 24 cases, the 95% CI includes the mean next above it in the figure. That is 83.3% of the experiments, which happens to be very close to the long-run average of 83.4% ( Cumming & Maillardet, 2006 ; Cumming, Williams, & Fidler, 2004 ). So a 95% CI is an 83% prediction interval for the ES estimate of a replication experiment. In other words, a CI is usefully informative about what is likely to happen next time.

Running a single experiment amounts to choosing randomly from an infinite sequence of replications like those in Figure 1 . A single CI is informative about the infinite sequence, because its length indicates approximately the extent of bouncing around in the dance. In stark contrast, a single p value gives virtually no information about the infinite sequence of p values. (See Guideline 8 in Table 1 .)

CIs and p values are based on the same statistical theory, and with practice, it is easy to translate from a CI to the p value by noting where the interval falls in relation to the null value, µ 0 = 0. It is also possible to translate in the other direction and use knowledge of the sample ES (the difference between the group means) and the p value to picture the CI; this may be the best way to interpret a p value. These translations do not mean that p and the CI are equally useful: The CI is much more informative because it indicates the extent of uncertainty, in addition to providing the best point estimate of what we want to know.

The 95% CIs bounce around as we expect; they form the dance of the CIs . Possibly surprising is the enormous variation in the p value—from less than .001 to .75. It seems that p can take almost any value! This dance of the p values is astonishingly wide! You can see more about the dance at tiny.cc/dancepvals and download ESCI ( Cumming, 2013 ) to run the simulation. Vary the population ES and n , and you will find that even when power is high—in fact, in virtually every situation— p varies dramatically ( Cumming, 2008 ).

I describe here a major problem of NHST that is too little recognized. If p reveals truth, and we replicate the experiment—doing everything the same except with a new random sample—then replication p , the p value in the second experiment, should presumably reveal the same truth. We can simulate such idealized replication to investigate the variability of p . Figure 1 depicts the simulated results of 25 replications of an experiment with two independent groups, each group having an n of 32. The population ES is 10 units of the dependent variable, or a Cohen’s δ of 0.50, which is conventionally considered a medium effect. Statistical power to find a medium-sized effect is .50, so the experiment is typical of what is published in many fields in psychology ( Cohen, 1962 ; Maxwell, 2004 ).

Despite warnings in statistics textbooks, the word significant is part of the seductive appeal: A “statistically significant” effect in the results section becomes “significant” in the discussion or abstract, and “significant” shouts “important.” Kline (2004) recommended that if we use NHST, we should refer to “a statistical difference,” omitting “significant.” That is a good policy; the safest plan is never to use the word significant . The best policy is, whenever possible, not to use NHST at all.

Why is NHST so deeply entrenched? I suspect the seductive appeal—the apparent but illusory certainty—of declaring an effect “statistically significant” is a large part of the problem. Dawkins (2004) identified “the tyranny of the discontinuous mind” (p. 252) as an inherent human tendency to seek the reassurance of an either-or classification, and van Deemter (2010) labeled as “false clarity” (p. 6) our preference for black or white over nuance. In contrast to a seemingly definitive dichotomous decision, a CI is often discouragingly long, although its quantification of uncertainty is accurate, and a message we need to come to terms with.

Kline (2004 , chap. 3; also available at tiny.cc/klinechap3) provided an excellent summary of the deep flaws in NHST and how we use it. He identified mistaken beliefs, damaging practices, and ways in which NHST retards research progress. Anderson (1997) has made a set of pithy statements about the problems of NHST available on the Internet. Very few defenses of NHST have been attempted; it simply persists, and is deeply embedded in our thinking. Kirk (2003) , quoted at the outset of this article, identified one central problem: NHST prompts us to see the world as black or white, and to formulate our research aims and make our conclusions in dichotomous terms—an effect is statistically significant or it is not; it exists or it does not. Moving from such dichotomous thinking to estimation thinking is a major challenge, but an essential step. (See Guideline 7 in Table 1 .)

In a book on the new statistics ( Cumming, 2012 ), I discussed most of the issues mentioned in the remainder of this article. I do not refer to that book in every section below, but it extends the discussion here and includes many relevant examples. It is accompanied by Exploratory Software for Confidence Intervals, or ESCI (“ESS-key”), which runs under Microsoft Excel and can be freely downloaded from the Internet, at www.thenewstatistics.com ( Cumming, 2013 ). ESCI includes simulations illustrating many new-statistics ideas, as well as tools for calculating and picturing CIs and meta-analysis.

Rodgers (2010) argued that psychological science is, increasingly, developing quantitative models. That is excellent news, and supports this core aim. I am advocating estimation as usually the most informative approach and also urging avoidance of NHST whenever possible. I summarize a few reasons why we should make the change and then discuss how to use estimation and meta-analysis in practice.

Suppose you read in the news that “support for Proposition X is 53%, in a poll with an error margin of 2%.” Most readers immediately understand that the 53% came from a sample and, assuming that the poll was competent, conclude that 53% is a fair estimate of support in the population. The 2% suggests the largest likely error. Reporting a result in such a way, or as 53 ± 2%, or as 53% with a 95% CI of [51, 55], is natural and informative. It is more informative than stating that support is “statistically significantly greater than 50%, p < .01.” The 53% is our point estimate , and the CI our interval estimate , whose length indicates precision of estimation. Such a focus on estimation is the natural choice in many branches of science and accords well with a core aim of psychological science, which is to build a cumulative quantitative discipline. (See Guideline 6 in Table 1 .)

Estimation: How

In this section, I start with an eight-step new-statistics strategy, discuss some preliminaries, and then consider ESs, CIs, the interpretation of both of these, and meta-analysis.

An eight-step new-statistics strategy for research with integrity The following eight steps highlight aspects of the research process that are especially relevant for achieving the changes discussed in this article. Formulate research questions in estimation terms. To use estimation thinking, ask “How large is the effect?” or “To what extent . . . ?” Avoid dichotomous expressions such as “test the hypothesis of no difference” or “Is this treatment better?” Identify the ESs that will best answer the research questions. If, for example, the question asks about the difference between two means, then that difference is the required ES, as illustrated in Figure 1. If the question asks how well a model describes some data, then the ES is a measure of goodness of fit. Declare full details of the intended procedure and data analysis. Prespecify as many aspects of your intended study as you can, including sample sizes. A fully prespecified study is best. After running the study, calculate point estimates and CIs for the chosen ESs. For Experiment 1 in Figure 1, the estimated difference between the means is 16.9, 95% CI [6.1, 27.7]. (That is the APA format. From here on, I omit “95% CI,” so square brackets signal a 95% CI.) Make one or more figures, including CIs. As in Figure 1, use error bars to depict 95% CIs. Interpret the ESs and CIs. In writing up results, discuss the ES estimates, which are the main research outcome, and the CI lengths, which indicate precision. Consider theoretical and practical implications, in accord with the research aims. Use meta-analytic thinking throughout. Think of any single study as building on past studies and leading to future studies. Present results to facilitate their inclusion in future meta-analyses. Use meta-analysis to integrate findings whenever appropriate. Report. Make a full description of the research, preferably including the raw data, available to other researchers. This may be done via journal publication or posting to some enduring publicly available online repository (e.g., figshare, figshare.com; Open Science Framework, openscienceframework.org; Psych FileDrawer, psychfiledrawer.org). Be fully transparent about every step, including data analysis—and especially about any exploration or selection, which requires the corresponding results to be identified as speculative. All these steps differ from past common practice. Step 1 may require a big change in thinking, but may be the key to adopting the new statistics, because asking “how much” naturally prompts a quantitative answer—an ES. Step 6 calls for informed judgment, rather than a mechanical statement of statistical significance. Steps 3 and 8 are necessary for research integrity.

The new statistics in context The eight-step strategy is, of course, far from a complete recipe for good research. There is no mention, for example, of selecting a good design or finding measures with good reliability and validity. Consider, in addition, the excellent advice of the APA Task Force on Statistical Inference (Wilkinson & Task Force on Statistical Inference, 1999; also available at tiny.cc/tfsi1999), including the advice to keep things simple, when appropriate: “Simpler classical approaches [to designs and analytic methods] often can provide elegant and sufficient answers to important questions” (p. 598, italics in the original). The task force also advised researchers, “As soon as you have collected your data, . . . look at your data” (p. 597, italics in the original). I see the first essential stage of statistical reform as being a shift from NHST. I focus on estimation as an achievable step forward, but other approaches also deserve wider use. Never again will any technique—CIs or anything else—be as widely used as p values have been. (See Guideline 11 in Table 1.) I mention next four examples of further valuable approaches: Data exploration : John Tukey’s (1977) book Exploratory Data Analysis legitimated data exploration and also provides a wealth of practical guidance. There is great scope to bring Tukey’s approach into the era of powerful interactive software for data mining and representation.

Bayesian methods : These are becoming commonly used in some disciplines, for example, ecology (McCarthy, 2007). Bayesian approaches to estimation based on credible intervals, to model assessment and selection, and to meta-analysis are highly valuable (Kruschke, 2010). I would be wary, however, of Bayesian hypothesis testing, if it does not escape the limitations of dichotomous thinking.

Robust methods : The common assumption of normally distributed populations is often unrealistic, and conventional methods are not as robust to typical departures from normality as is often assumed. Robust methods largely sidestep such problems and deserve to be more widely used (Erceg-Hurn & Mirosevich, 2008; Wilcox, 2011).

Resampling and bootstrapping methods: These are attractive in many situations. They often require few assumptions and can be used to estimate CIs (Kirby & Gerlanc, 2013). Note that considering options for data analysis does not license choosing among them after running the experiment: Selecting the analysis strategy was one of the degrees of freedom described by Simmons et al. (2011); that strategy should be prespecified along with other details of the intended study.

ESs An ES is simply an amount of anything of interest (Cumming & Fidler, 2009). Means, differences between means, frequencies, correlations, and many other familiar quantities are ESs. A p value, however, is not an ES. A sample ES, calculated from data, is typically our point estimate of the population ES. ESs can be reported in original units (e.g., milliseconds or score units) or in some standardized or units-free measure (e.g., Cohen’s d, β, η p 2, or a proportion of variance). ESs in original units may often be more readily interpreted, but a standardized ES can assist comparison over studies and is usually necessary for meta-analysis. Reporting both kinds of ESs is often useful. Cohen’s d Cohen’s d deserves discussion because it is widely useful but has pitfalls. It is a standardized ES that is calculated by taking an original-units ES, usually the difference between two means, and expressing this as a number of standard deviations. The original-units ES is divided by a standardizer that we choose as a suitable unit of measurement: d = ( M E − M C ) / s , where M E and M C are experimental (E) and control (C) means, and s is the standardizer. Cohen’s d is thus a kind of z score. First we choose a population standard deviation that makes sense as the unit for d, and then we choose our best estimate of that standard deviation to use as s in the denominator of d. For two independent groups, if we assume homogeneity of variance, the pooled standard deviation within groups, s p , is our standardizer, just as we use for the independent-groups t test. If we suspect the treatment notably affects variability, we might prefer the control population’s standard deviation, estimated by s C (the control group’s standard deviation), as the standardizer. If we have several comparable control groups, pooling over these may give a more precise estimate to use as the standardizer. These choices obviously lead to different values for d, so whenever we see a value of d, we need to know how it was calculated before we can interpret it. When reporting values of d, make sure to describe how they were calculated. Now consider a repeated measure design, in which each participant experiences both E and C treatments. We would probably regard the C population as the reference and choose its standard deviation, estimated by s C , as the standardizer. However, the CI on the difference in this repeated measure design (and also the paired t test) is calculated using s diff , the standard deviation of the paired differences, rather than s C . As noted earlier, with two independent groups, s p serves as the standardizer for d and also for the independent-groups t test. By contrast, the repeated measure design emphasizes that the standard deviation we choose as the standardizer may be quite different from the standard deviation we use for inference, whether based on a CI or a t test. Equation 1 emphasizes that d is the ratio of two quantities, each estimated from data (Cumming & Finch, 2001). If we replicate the experiment, both numerator and denominator—the original-units ES and the standardizer—will be different. Cohen’s d is thus measured on a “rubber ruler,” whose unit, the standardizer, stretches in or out if we repeat the experiment. We therefore must be very careful when interpreting d, especially when we compare d values given by different conditions or experiments. Do the original-units ESs differ, does the standardizer differ, or do both differ? This difficulty has led some scholars, especially in medicine (Greenland, Schlesselman, & Criqui, 1986), to argue that standardized ES measures should never be used. In psychology, however, we have little option but to use a standardized ES when we wish to meta-analyze results from studies that used different original-units measures—different measures of anxiety, for example. I have three final remarks about d. Because d is the ratio of two estimated quantities, its distribution is complex, and it is not straightforward to calculate CIs for d. ESCI can provide CIs on d in a number of basic situations, or you can use good approximations (Cumming & Fidler, 2009). (See Grissom & Kim, 2012, chap. 3, for more about CIs for d.) Second, symbols and terms referring to the standardized difference between means, calculated in various ways, are used inconsistently in the literature. “Hedges’s g,” for example, is used with at least two different meanings. I recommend following common practice and using Cohen’s d as the generic term, but be sure to explain how d was calculated. Third, the simple calculations of d I have discussed give values that are biased estimates of δ, the population ES; d is somewhat too large, especially when N is small. A simple adjustment (Grissom & Kim, 2012, p. 70, or use ESCI) is required to give d unb , the unbiased version of Cohen’s d; we should usually prefer d unb . (For more on Cohen’s d, see Cumming, 2012, chap. 11.) Interpretation of ESs Interpretation of ESs requires informed judgment in context. We need to trust our expertise and report our assessment of the size, importance, and theoretical or practical value of an ES, taking full account of the research situation. Cohen (1988) suggested 0.2, 0.5, and 0.8 as small, medium, and large values of d, but emphasized that making a judgment in context should be preferred to these fallback benchmarks. Interpretation should include consideration, when appropriate, of the manipulation or treatment, the participants, and the research aims. When interpreting an ES, give reasons. Published reference points can sometimes guide interpretation: For the Beck Depression Inventory-II (Beck, Steer, Ball, & Ranieri, 1996), for example, scores of 0 through 13, 14 through 19, 20 through 28, and 29 through 63 are labeled as indicating, respectively, minimal, mild, moderate, and severe levels of depression. In pain research, a change in rating of 10 mm on the 100-mm visual analog scale is often regarded as the smallest change of clinical importance—although no doubt different interpretations may be appropriate in different situations. A neuropsychology colleague tells me that, as a rough guideline, he uses a decrease of 15% in a client’s memory score as the smallest change possibly of clinical interest. Comparison with ESs found in past research can be useful. I hope increasing attention to ES interpretation will prompt emergence of additional formal or informal conventions to help guide interpretation of various sizes of effect. However, no guideline will be universally applicable, and researchers must take responsibility for their ES interpretations. (See Guideline 12 in Table 1.)

Interpretation of CIs CIs indicate the precision of our ES estimates, so interpretation of ESs must be accompanied by interpretation of CIs. I offer six approaches, one or more of which may be useful in any particular case. The discussion here refers to a 95% CI for a population mean, µ, but generally applies to any CI. (For an introduction to CIs and their use, see Cumming & Finch, 2005; also available at tiny.cc/inferencebyeye.) One from an infinite sequence The CI calculated from our data is one from the dance, as Figure 1 illustrates. In the long run, 95% of CIs will include µ, and an unidentified 5% will miss. Most likely our CI includes µ, but it might not—it might be red, as in Figure 1. Thinking of our CI as coming from an infinite sequence is the correct interpretation, but in practice we need to interpret what we have—our single interval. That is reasonable, providing our CI is typical of the dance. It usually is, with two exceptions. First, in Figure 1, the CIs vary somewhat in length, because each is based on the sample’s standard deviation. Each CI length is an estimate of the length of the heavy line at the bottom of the figure, which indicates an interval including 95% of the area under the curve, which would be the CI length if we knew the population standard deviation. With two groups of n = 32, CI length varies noticeably from experiment to experiment. A smaller n gives greater variation, and if n is less than, say, 10, the variation is so large that the length of a single CI may be a very poor estimate of precision. A CI is of little practical use when samples are very small. A second exception occurs when our CI is not chosen randomly. If we run several experiments but report only the largest ES, or the shortest CI, that result is not typical of the dance, and the CI is practically uninterpretable. Simmons et al. (2011) explained how such selection is problematic. Barring a tiny sample or data selection, it is generally reasonable to interpret our single CI, and the following five approaches all do that. We should, however, always remember the dance: Our CI just might be red! (See Guideline 13 in Table 1.) Focus on our interval Our CI defines a set of plausible, or likely, values for µ, and values outside the interval are relatively implausible. We can be 95% confident that our interval includes µ and can think of the lower and upper limits as likely lower and upper bounds for µ. Interpret the point estimate—the sample mean (M) at the center of the interval—and also the two limits of the CI. If an interval is sufficiently short and close to zero that you regard every value in the interval as negligible, you can conclude that the true value of µ is, for practical purposes, zero or very close to zero. That is the best way to think about what, in the NHST world, is acceptance of a null hypothesis. Prediction As I discussed earlier, our CI is an 83% prediction interval for the ES that would be given by a replication experiment (Cumming & Maillardet, 2006). Our CI defines a range within which the mean of a repeat experiment most likely will fall (on average, a 5-in-6 chance). Precision The margin of error (MOE—pronounced “mow-ee”) is the length of one arm of a CI and indicates precision. Our estimation error is the difference between the sample and population means (M – µ). We can be 95% confident that this error is no greater than the MOE in absolute value. A large MOE indicates low precision and an uninformative experiment; a small MOE is gold. A major purpose of meta-analysis is to integrate evidence to increase precision. Later, I discuss another use of precision—to assist research planning. The cat’s-eye picture of a CI The curve in Figure 1 shows the sampling distribution of M, the difference between the two sample means: As the dance also illustrates, most values of M fall close to µ, and progressively fewer fall at greater distances. The curve is also the distribution of estimation errors: Errors are most likely close to zero, and larger errors are progressively less likely, which implies that our interval has most likely fallen so that M is close to µ. Therefore, values close to our M are the best bet for µ, and values closer to the limits of our CI are successively less good bets. The curve, if centered around M rather than µ, indicates the relative plausibility, or likelihood, of values being the true value of µ. The center graphics of Figure 2 show the conventional error bars for a 95% CI and the cat’s-eye picture of that CI, bounded by the likelihood curve centered around M and its mirror image. The black area of the cat’s-eye picture spans the CI and comprises 95% of the area between the curves. The horizontal width of the cat’s-eye picture indicates the relative likelihood that any value of the dependent variable is µ, the parameter we are estimating. A value close to the center of the CI is about 7 times as likely to be µ as is a value near a limit of the 95% CI. Thus, the black area is the likelihood profile, or beautiful “shape” of the 95% CI (Cumming, 2007; Cumming & Fidler, 2009). This fifth approach to interpreting a CI is a nuanced extension to the second approach: Our CI defines an interval of plausible values for µ, but plausibility varies smoothly across and beyond the interval, as the cat’s-eye picture indicates. Download Open in new tab Download in PowerPoint Link with NHST If our CI falls so that a null value µ 0 lies outside the interval, the two-tailed p is less than .05, and the null hypothesis can be rejected. If µ 0 is inside the interval, then p is greater than .05. Figure 1 illustrates that the closer the sample mean is to µ 0 , the larger is p (Cumming, 2007). This is my least preferred way to interpret a CI: I earlier cited evidence that CIs can prompt better interpretation if NHST is avoided. Also, the smooth variation of the cat’s-eye picture near any CI limit suggests that we should not lapse back into dichotomous thinking by attaching any particular importance to whether a value of interest lies just inside or just outside our CI.

Error bars: prefer 95% CIs It is extremely unfortunate that the familiar error-bar graphic can mean so many different things. Figure 2 uses it to depict a 99% CI and SE bars (i.e., mean ±1 SE), as well as a 95% CI; the same graphic may also represent standard deviations, CIs with other levels of confidence, or various other quantities. Every figure with error bars must state clearly what the bars represent. (An introductory discussion of error bars was provided by Cumming, Fidler, & Vaux, 2007.) Numerous articles in Psychological Science have included figures with SE bars, although these have rarely been used to guide interpretation. The best way to think of SE bars on a mean is usually to double the whole length—so the bars extend 2 SE above and 2 SE below the mean—and interpret them as being, approximately, the 95% CI, as Figure 2 illustrates (Cumming & Finch, 2005). The right-hand cat’s-eye picture in Figure 2 illustrates that relative likelihood changes little across SE bars, which span only about two thirds of the area between the two curves. SE bars are usually approximately equivalent to 68% CIs and are 52% prediction intervals (Cumming & Maillardet, 2006): There is about a coin-toss chance that a repeat of the experiment will give a sample ES within the original SE bars. It may be discouraging to display 95% CIs rather than the much shorter SE bars, but there are strong reasons for preferring CIs. First, they are designed for inference. Second, for means, although there is usually a simple relation between SE bars and the 95% CI, that relation breaks down if N is small; for other measures, including correlations, there is no simple relation between the CI and any standard error. Therefore, SE bars cannot be relied on to provide inferential information, which is what we want. Since the 1980s, medical researchers have routinely reported CIs, not SE bars. We should do the same. (See Guideline 14 in Table 1.) Figure 2 shows that 99% CIs simply extend further than 95% CIs, to span 99% rather than 95% of the area between the two likelihood curves. They are about one third longer than 95% CIs. Further approximate benchmarks are that 90% CIs are about five sixths as long as 95% CIs, and 50% CIs are about one third as long. Noting benchmarks like these (Cumming, 2007) allows you to convert easily among CIs with various levels of confidence. In matters of life and death, it might seem better to use 99% CIs, or even 99.9% CIs (about two thirds longer than 95% CIs), but I suggest that it is virtually always best to use 95% CIs. We should build our intuitions (bearing in mind the cat’s-eye picture) for 95% CIs—the most common CIs—and use benchmarks if necessary to interpret other CIs.

Research planning One of the most challenging and creative parts of empirical research is devising ingenious studies likely to provide precise answers to our research questions. I refer to such studies as informative. The challenge of research design and planning is to increase informativeness, and it is worth much time and effort to do this: We should refine tasks, seek measures likely to have better reliability and validity, consider participant selection and training, use repeated measures when appropriate, consider statistical control, limit design complexity so informativeness is increased for the remaining questions, use large sample sizes, and consider measuring more than once and then averaging; in general, we should reduce error variability in any way we can, and call on the full range of advice in handbooks of design and analysis. High informativeness is gold. (See Guideline 21 in Table 1.) Statistical power Statistical power is the probability that if the population ES is equal to δ, a target we specify, our planned experiment will achieve statistical significance at a stated value of α. I am ambivalent about statistical power for two reasons. First, it is defined in terms of NHST, so it has meaning or relevance only if we are using NHST, and has no place when we are using the new statistics. However, anyone who uses NHST needs to consider power. (See Guideline 22 in Table 1.) Second, the term power is often used ambiguously, perhaps referring to the narrow technical concept of statistical power, but often referring more broadly to the size, sensitivity, quality, or informativeness of an experiment. For clarity, I suggest using informativeness, as I did earlier, for this second, broader concept. For a given experimental design, statistical power is a function of sample size, α, and the target δ. Increasing sample size increases informativeness as well as power, but we can also increase power merely by choosing α of .10 rather than .05, or by increasing the target δ—neither of which increases informativeness. Therefore, high power does not necessarily imply an informative or high-quality experiment. Funding bodies and ethical review boards often require justification for proposed experiments, especially proposed sample sizes. It is particularly important to justify sample sizes when human participants may be subjected to inconvenience or risk. Power calculations have traditionally been expected, but these can be fudged: For example, power is especially sensitive to δ, so a small change to the target δ may lead to a substantial change in power. For a two-independent-groups design with n of 32 in each group, choosing δ of 0.50 gives power of .50, as in Figure 1, but δ of 0.60 or 0.70 gives power of .67 or .79, respectively. Power of .80 is, following Cohen (1988), often regarded as acceptable, even though 20% of such experiments would fail to achieve statistical significance if the population ES were equal to the stated target δ. For several simple designs, ESCI provides power curves and calculations. Beyond that, I recommend the excellent free software G*Power 3 (available at tiny.cc/gpower3). One problem is that we never know true power, the probability that our experiment will yield a statistically significant result, because we do not know the true δ—that is why we are doing the experiment! All we can say is that our experiment has power of x to detect a stated target δ. Any statement that an experiment has power of x, without specifying the target δ, is meaningless. Should we calculate power after running the experiment, using our observed estimate of δ as the target? That is post hoc power. The trouble is that post hoc power tells us about the result, but little if anything about the experiment itself. In Figure 1, for example, Experiments 23, 24, and 25 give post hoc power of .17, .98, and .41, respectively. Post hoc power can often take almost any value, so it is likely to be misleading, as Hoenig and Heisey (2001) argued. If computer output provides a value for power, without asking you to specify a target ES, it is probably post hoc power and should be ignored. (See Guideline 23 in Table 1.) Precision for planning In an NHST world, statistical power can support planning. In an estimation world, we need instead precision for planning—sometimes called accuracy in parameter estimation (AIPE). We specify how large a MOE we are prepared to accept and then calculate what N is needed to achieve a CI with a MOE no longer than that. Familiarity with ESs and CIs should make it natural to think of an experiment aiming for precision of f units of σ, the population standard deviation (and thus the units of Cohen’s δ). Figure 7, from ESCI, displays precision-for-planning curves. The vertical line marks an f of 0.4 and tells us that a two-independent-groups experiment would on average give a MOE no larger than 0.4 × σ if the groups each had an n of 50. A complication, however, is that the MOE varies from experiment to experiment, as Figure 1 illustrates. The lower curve in the figure gives n for an experiment that on average yields a satisfactory MOE. To do better, we may set the level of assurance, γ, to 99, as in the upper curve in Figure 7, which tells us that two groups with n of 65 will give a MOE no longer than our target of 0.4 × σ on 99% of occasions. I hope funding bodies and ethics review boards will increasingly look for precision-for-planning analyses that justify sample sizes for proposed research. Download Open in new tab Download in PowerPoint Figure 7 shows that small changes to f can indicate the need for large changes to n. The corresponding figure for a repeated measure experiment indicates that much smaller Ns will suffice if we have a reasonably large correlation between the measures. Such figures should give us practical guidance in choosing experimental designs and sample sizes likely to make an experiment sufficiently informative to be worth running. After we conduct an experiment, the MOE calculated from the data tells us, of course, what precision we achieved. Previously (Cumming, 2012, chap. 13), I have discussed precision for planning, and ESCI provides calculations for several simple situations. Precision-for-planning techniques are being developed for an increasing range of ES measures and designs (Maxwell, Kelley, & Rausch, 2008). (See Guideline 24 in Table 1.)