



t-test), are ‘not small’. Now, if we observe an effect that falls between the two equivalence bounds of d = -0.3 and d = 0.3 we can act (in good old-fashioned Neyman-Pearson approach to statistical inferences) as if the effect is ‘zero or very small’. It might not be exactly zero, but it is small enough. You can check out a great In an earlier blog post I talked about equivalence tests. Sometimes you perform a study where you might expect the effect is zero or very small. So how can we conclude an effect is ‘zero or very small’? One approach is to specify effect sizes we consider ‘not small’. For example, we might decide that effects larger than d = 0.3 (or smaller than d = -0.3 in a two-sided-test), are ‘not small’. Now, if we observe an effect that falls between the two equivalence bounds of d = -0.3 and d = 0.3 we can act (in good old-fashioned Neyman-Pearson approach to statistical inferences) as if the effect is ‘zero or very small’. It might not be exactly zero, but it is small enough. You can check out a great interactive visualization of equivalence testing by RPsychologist.

≤ -0.3, and t-test, it is not yet that easy to perform a TOST equivalence test ( ≥ 0.3. This is the basic idea of the TOST (two one-sided tests) equivalence procedure. The idea is simple, and it is conceptually similar to the traditional null-hypothesis test you probably use in your article to reject an effect of zero. But where all statistics programs will allow you to perform a normal-test, it is not yet that easy to perform a TOST equivalence test ( Minitab is one exception). We can use two one-sided tests to statistically reject effects-0.3, and





But psychology really needs a way to show effects are too small to matter (see ‘ Why most findings in psychology are statistically unfalsifiable ’ by Richard Morey and me). So I made a spreadsheet and R package to perform the TOST procedure. The R package is available from CRAN, which means you can install it using install.packages(“TOSTER”).





Let’s try a practical example (this is one of the examples from the vignette that comes with the R package).





Eskine (2013) showed that participants who had been exposed to organic food were substantially harsher in their moral judgments relative to those in the control condition (Cohen’s d = 0.81, 95% CI: [0.19, 1.45]). A replication by Moery & Calin-Jageman (2016, Study 2) did not observe a significant effect (Control: n = 95, M = 5.25, SD = 0.95, Organic Food: n = 89, M = 5.22, SD = 0.83). The authors have used Simonsohn’s recommendation to power their study so that they have 80% power to detect an effect the original study had 33% power to detect. This is the same as saying: We consider an effect to be ‘small’ when it is smaller than the effect size the original study had 33% power to detect.





With n = 21 in each condition, Eskine (2013) had 33% to detect an effect of d = 0.43. This is the effect the authors of the replication study designed their study to detect. The original study had shown an effect of d = 0.81, and the authors performing the replication decided that an effect size of d = 0.43 would be the smallest effect size they will aim to detect with 80% power. So we can use this effect size as the equivalence bound. We can use R to perform an equivalence test:





install.packages("TOSTER")

library("TOSTER")

TOSTtwo(m1=5.25, m2=5.22, sd1=0.95, sd2=0.83, n1=95, n2=89, low_eqbound_d=-0.48, high_eqbound_d=0.48, alpha = 0.05, var.equal = TRUE)





Which gives us the following output:





TOST results: t-value lower bound: 3.48 p-value lower bound: 0.0003 t-value upper bound: -3.03 p-value upper bound: 0.001 degrees of freedom : 182 Equivalence bounds (Cohen's d): low eqbound: -0.48 high eqbound: 0.48 Equivalence bounds (raw scores): low eqbound: -0.4291 high eqbound: 0.4291 TOST confidence interval: lower bound 90% CI: -0.188 upper bound 90% CI: 0.248 NHST confidence interval: lower bound 95% CI: -0.23 upper bound 95% CI: 0.29 Equivalence Test Result: The equivalence test was significant, t(182) = -3.026, p = 0.00142, given equivalence bounds of -0.429 and 0.429 (on a raw scale) and an alpha of 0.05. Null Hypothesis Test Result: The null hypothesis test was non-significant, t(182) = 0.227, p = 0.820, given an alpha of 0.05. Based on the equivalence test and the null-hypothesis test combined, we can conclude that the observed effect is statistically not different from zero and statistically equivalent to zero.



You see, we are just using R like a fancy calculator, entering all the numbers in a single function. But I can understand if you are a bit intimidated by R. So, you can also fill in the same info in the spreadsheet (click picture to zoom):













Using a TOST equivalence procedure with alpha = 0.05, and without assuming equal variances (because when sample sizes are unequal, you should report Welch’s t-test by default ), we can reject effects larger than d = 0.48: t(182) = -3.03, p = 0.001.





The R package also gives a graph, where you see the observed mean difference (in raw scale units), the equivalence bounds (also in raw scores), and the 90% and 95% CI. If the 90% CI does not include the equivalence bounds, we can declare equivalence.













Moery and Calin-Jageman concluded from this study: “We again found that food exposure has little to no effect on moral judgments” But what is ‘little to no”? The equivalence test tells us the authors successfully rejected effects of a size the original study had 33% power to reject. Instead of saying ‘little to no’ we can put a number on the effect size we have rejected by performing an equivalence test.





t-tests, dependent t-tests, correlations, or meta-analyses, you can check out a practical primer on equivalence testing using the TOST procedure I've written. It's available as a If you want to read more about equivalence tests, including how to perform them for one-sample-tests, dependent-tests, correlations, or meta-analyses, you can check out a practical primer on equivalence testing using the TOST procedure I've written. It's available as a pre-print on PsyArXiv . The R code is available on GitHub

I’m happy to announce my first R package ‘TOSTER’ for equivalence tests (but don’t worry, there is an old-fashioned spreadsheet as well).