Abstract A typical rule that has been used for the endorsement of new medications by the Food and Drug Administration is to have two trials, each convincing on its own, demonstrating effectiveness. “Convincing” may be subjectively interpreted, but the use of p-values and the focus on statistical significance (in particular with p < .05 being coined significant) is pervasive in clinical research. Therefore, in this paper, we calculate with simulations what it means to have exactly two trials, each with p < .05, in terms of the actual strength of evidence quantified by Bayes factors. Our results show that different cases where two trials have a p-value below .05 have wildly differing Bayes factors. Bayes factors of at least 20 in favor of the alternative hypothesis are not necessarily achieved and they fail to be reached in a large proportion of cases, in particular when the true effect size is small (0.2 standard deviations) or zero. In a non-trivial number of cases, evidence actually points to the null hypothesis, in particular when the true effect size is zero, when the number of trials is large, and when the number of participants in both groups is low. We recommend use of Bayes factors as a routine tool to assess endorsement of new medications, because Bayes factors consistently quantify strength of evidence. Use of p-values may lead to paradoxical and spurious decision-making regarding the use of new medications.

Citation: van Ravenzwaaij D, Ioannidis JPA (2017) A simulation study of the strength of evidence in the recommendation of medications based on two trials with statistically significant results. PLoS ONE 12(3): e0173184. https://doi.org/10.1371/journal.pone.0173184 Editor: Chuhsing Kate Hsiao, National Taiwan University, TAIWAN Received: January 20, 2017; Accepted: February 16, 2017; Published: March 8, 2017 This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. Data Availability: Scripts to reproduce the published simulation results may be obtained from the first author's website: http://www.donvanravenzwaaij.com/Papers.html and are included as a supporting information file. Funding: The authors received no specific funding for this work. Competing interests: The authors have declared that no competing interests exist.

Introduction Endorsement of medications (drugs and biologics) for clinical use is under rigorous control by regulatory agencies. Since 1962, the body that provides this control is the US Food and Drug Administration (FDA; [1]). The FDA has a critical function as the gateway for the adoption of new medications. The way the FDA endorses drugs and biologics is through clinical trials. New medications are tested, often against a placebo condition or an existing alternative, and statistical evidence is accumulated to quantify efficacy and to offer some reassurance of safety. By far the most common way to quantify evidence of efficacy is by defining a null hypothesis (e.g., the new medicine is just as effective as the placebo), collect data, and then generate using some statistical procedure what is known as a p-value. A p-value quantifies the probability of observing a difference between placebo and treatment at least as extreme as the difference observed in your data, given that the null hypothesis is true. In other words, if in reality there is no difference between the efficacy of the new medicine and the placebo, the probability of finding an effect at least as large as the one present in the data is equal to the size of the p-value. Typically, a p-value smaller than .05 is deemed statistically significant: it is considered sufficient evidence to reject the null hypothesis. Empirical studies have shown that the use of p-values is practically ubiquitous in biomedical research and this applies also to clinical trials [2]. Unfortunately, p-values are associated with a number of problems [3–7]. Firstly, it is impossible to quantify evidence in favor of the null hypothesis. A p-value can be used to reject or to fail-to-reject the null hypothesis, but never to accept it. Secondly, using p-values leads to over-rejection of the null hypothesis. The a-priori plausibility of the alternative hypothesis is not taken into account, as a result of which the alternative hypothesis gets endorsed if the null hypothesis is sufficiently unlikely. This leads to incorrect inference, in particular if the alternative hypothesis is even less likely. Thirdly, p-values are notoriously hard to interpret. Researchers generally want to use the data to infer something about their hypotheses, such as: what evidence do the data provide for the null hypothesis versus the alternative hypothesis? The p-value cannot answer these questions, instead giving an abstract number that quantifies the probability of obtaining a data pattern at least as extreme as the one observed if the null hypothesis were true. This definition proves to be very cryptic for most researchers in the field [8,9]. Finally, p-values do not allow for optional stopping, based on examining the preliminary evidence [10]. This means that a p-value can only be properly interpreted when the sample size for testing was determined beforehand and the statistical inference was carried out on the data of that exact sample size. In practice, additional participants are often tested when “the p-value approaches significance”, after which the p-value is calculated again. In clinical trials, this takes the form of interim analyses with the potential of early stopping at different points [11]. Alternatively, sometimes testing is discontinued when “an intermediate analysis fails to show a trend in the right direction”. Another issue that is associated with the use of p-values is the almost ubiquitous focus on a .05 threshold (though some fields, like genomics, do employ more stringent criteria for significance [12]). Nevertheless, one trial may require a different degree of certainty than another [13]. The FDA recognizes the need for rigorous statistical evidence in their policy for drug endorsement. In their guidance for industry [14], the FDA states “With regard to quantity, it has been FDA’s position that Congress generally intended to require at least two adequate and well-controlled studies, each convincing on its own, to establish effectiveness.” (p. 3). “Convincing on its own” can be interpreted in many different ways and the meaning of “adequate and well-controlled” is somewhat subjective. However, given the widespread use of p-values and the strong emphasis on passing the threshold of p < .05, one may often interpret this guidance as meaning that two independent clinical trials with p < .05 are required before a new drug or biologic gets endorsed. Moreover, there is no specification of how many trials with p>.05 are allowed among the set of trials that contains these two statistically significant trials. Combining evidence in such a fashion is statistically inappropriate and can lead to wildly differing levels of strength of evidence. In this paper, we will present through simulation the extent to which strength of evidence varies when employing a criterion for drug approval of having exactly two p-values lower than .05 for different scenarios. We focus on the scenario of exactly two statistically significant results, as this represents the FDA’s threshold for establishing effectiveness. We will show that in certain cases, this policy could actually lead to evidence in favor of the null hypothesis. We will quantify strength of evidence using Bayes factors [15,16]. A Bayes factor captures the relative evidence that the data provide for the alternative hypothesis against the null hypothesis in the form of an odds ratio. For example, when BF = 10, the data are 10 times more likely to have occurred under the alternative hypothesis than under the null hypothesis. On the other hand, when BF = 0.1, the data are 10 times more likely to have occurred under the null hypothesis than under the alternative hypothesis. As for interpreting the strength of evidence as quantified by a Bayes factor, a Bayes factor between 1 and 3 (or, conversely, between 1/3 and 1) is considered ‘not worth more than a bare mention’, a Bayes factor between 3 and 20 (or, conversely, between 1/20 and 1/3) is considered ‘positive’, and a Bayes factor between 20 and 150 (or, conversely, between 1/150 and 1/20) is considered ‘strong’ [17]. In the next section, we will describe the set-up of our simulations in detail. Then, we will present the results of our simulations, demonstrating both the range in strength of evidence and the proportion of times the evidence actually points in favor of the null hypothesis. We will conclude with a discussion of the implications of our results for regulatory assessment of new medications.

Method We conducted three sets of simulations. For every set, we generated 12,500 data sets. All of the data sets were intended to mimic two–condition between–subjects experiments with an experimental group and a control (e.g. placebo) group. A two-tailed t-test with a threshold of p < .05 combined with selection as “successes” only of those results for which all statistically significant effects are in the direction of the experimental arm being better than the control (e.g. placebo) arm. The three sets of simulations differed on the true population effect size between the two groups. In the first set of simulations, the true population effect size was small (0.2 standard deviations, or 0.2 SD), in the second set of simulations, the true population effect size was medium (0.5 SD), and in the third set of simulations, the true population effect size was zero (0 SD) [18]. Empirical evidence suggests that most effective treatments have small or modest effects, in particular when major clinical outcomes are concerned and very large effects are uncommon, except as chance findings in very small trials [19]. Therefore, our simulations are: p 123 ~ N(0, 1) e 1 ~ N(0.2, 1) e 2 ~ N(0.5, 1) e 3 ~ N(0, 1) where p 123 indicates simulated data for the placebo groups in all sets of simulations, and e 1 , e 2 , and e 3 indicate simulated data for the experimental groups in the first, second, and third set of simulations respectively. The notation ∼N(,) indicates that values were drawn from a normal distribution with mean and standard deviation parameters given by the first and second number between parentheses, respectively. where pindicates simulated data for the placebo groups in all sets of simulations, and e, e, and eindicate simulated data for the experimental groups in the first, second, and third set of simulations respectively. The notation ∼N(,) indicates that values were drawn from a normal distribution with mean and standard deviation parameters given by the first and second number between parentheses, respectively. For each effect size simulation set, we ran five different kinds of number of trial simulations: one with 2 trials with statistically significant results out of 2 performed, one with 2 significant results out of 3 performed, one with 2 significant results out of 4 performed, one with 2 significant results out of 5 performed, and one with 2 significant results out of 20 performed. This was achieved by continuously regenerating data until exactly 2 significant results emerged. Note that our simulations are not concerned with the likelihood of obtaining exactly 2 out of 5 significant results given a certain effect size. The purpose of our simulations is to demonstrate the range of strengths of evidence if such a scenario were to occur. These simulations reflect different scenarios: on one end the scenario in which exactly two trials were conducted and both were statistically significant in the expected direction, on the other end the scenario in which twenty trials were conducted and exactly two were significant in the expected direction (and 18 were not statistically significant). We also varied the number of participants per group. We ran five conditions: n = 20, n = 50, n = 100, n = 500, and n = 1,000. Thus, to sum up, our simulations varied along the following dimensions: Effect size: small (0.2 SD), medium (0.5 SD), and zero (0 SD) Number of total trials: 2, 3, 4, 5, and 20 Number of participants: 20, 50, 100, 500, and 1,000 This resulted in a total of 75 types of simulations. We replicated each simulation type 500 times. In addition to these simulations, we performed sensitivity analyses with simulations that used individual differences in the effect size distribution and unequal variance in the two groups (see S1 and S2 Files for details). We calculated one-sided JZS Bayes factors for the combined data from the total number of trials conducted [20]. This one-sided Bayes factor quantifies the relative likelihood of the one-sided alternative hypothesis, the experimental group has a higher mean than the control group, against the null hypothesis, the experimental group has the same mean as the control group. Bayes factors were calculated using the BayesFactor package available in the statistical software package R (R package available at https://cran.r-project.org/web/packages/BayesFactor/index.html). For each replication, we computed an independent-samples one-sided Bayesian t-test for the data of all trials combined. The JZS Bayes factor is calculated by comparing the marginal likelihood of the data under the null hypothesis to the marginal likelihood of the data under the alternative hypothesis, integrated over a range of plausible alternative hypotheses. The range of alternative hypotheses is given by a prior on the effect size parameter δ, which follows a Cauchy distribution with a width of r = √2/2 (see [20], Equation in note 4 on page 237). A mathematically equivalent way of obtaining the JZS Bayes factor, perhaps more intuitive, is to divide the height of the prior distribution by the height of the posterior distribution, evaluated at effect size δ = 0 (for a mathematical proof, see [21]).

Discussion In this study, we simulated clinical trial data comparing an experimental group to a placebo. Simulations differed on the underlying true effect size, the number of clinical trials, and the number of participants in each trial. The simulations all had one thing in common: exactly two of the conducted trials were statistically significant with a two-tailed p-value lower than .05 and an effect in the expected direction (favoring the experimental intervention). For all simulations, we assessed the strength of evidence in favor of the alternative, the experimental group outperforms the placebo, by means of a Bayes factor using a default prior. The result of our simulations is simple yet compelling: a criterion of endorsement of two p-values lower than .05 leads to a large variety in strength of evidence in favor of a medicine’s efficacy. In a non-trivial proportion of cases, this criterion even leads to endorsement when statistical evidence actually favors the null hypothesis. Bayes factors that favor the null hypothesis for two p-values lower than .05 occur when the true effect size is zero, when the number of trials is large, and when the number of participants in both groups is low. What are we to conclude from this? First and foremost, it is likely that a criterion asking for two statistically significant trials would often lead to correct endorsement of new medications. It is difficult to estimate the true proportion of incorrectly endorsed medicine, because the underlying true effect size cannot be known in advance. However, empirical evidence across many medical interventions suggests that most effects are small or modest [19]. Our results show that even for a true effect size of .5, medicines sometimes get endorsed based on very unconvincing evidence. This result is a direct consequence of the p-value’s propensity to over-reject the null hypothesis [6]. Fortunately, a straightforward solution exists: quantifying evidence using the Bayes factor. Such a change in protocol for statistical inference is not unprecedented. Use of Bayesian statistics in clinical trials design and analysis has been used for a number of years by FDA in some domains (e.g., medical device clinical trials; [22–24]). Use of Bayesian methods for statistical inference has been widely recommended for a long time [25,26], but has sparsely been endorsed [2]. Computational difficulties are no longer an excuse for the underuse of Bayesian methods. With the advent in computational power in recent decades came the possibility of computer-driven sampling routines to approximate posterior distributions necessary to calculate Bayes factors in the form of so-called Markov chain Monte Carlo sampling (MCMC)[27,28]. Another important problem that was not solved until recently was the absence of easy-to-handle statistical software with an intuitive interface. This meant that the application of the Bayesian hypothesis test was a tool that could only be used by statistical experts. The recent development of online Bayes factor calculator tools [20] (tools available at http://pcl.missouri.edu/bayesfactor) and the statistical freeware program JASP [29] has greatly enhanced the accessibility of Bayesian statistical inference. It is important to stress that the results of our simulations make no assumptions as to how the data were obtained. Our simulations make no commitment about the nature of the data, whether it was obtained with honest intentions, through cherry-picking, or through p-hacking. Our results are in agreement with some empirical data on placebo-controlled trials submitted to the FDA. For example, Monden et al. examined 58 trials of second-generation antidepressants for generalized anxiety [30]. The BF estimated for these trials varied widely from 0.07 to 131000. Among the 59 doses that were felt by the FDA to have substantial evidence for efficacy, only 26 had a BF of at least 20. Some limitations of our study should be discussed. First, the “exactly 2 significant trials” rule that we simulated may not capture fully the way that FDA or other regulatory agencies operate, even without explicit consideration of Bayesian methods. The FDA’s position to require “…at least two adequate and well-controlled studies, each convincing on its own, to establish effectiveness” is not the same as “exactly 2 significant trials”. Our demonstration provides an indication of the strength of evidence one were to obtain when this policy is employed with exactly two significant trials. Furthermore, the approval process includes consideration of multiple aspects of efficacy and safety and it also entails a qualitative assessment of the adequacy of design, conduct and analysis of the trial and of the relevance of the outcomes used. Therefore, our simulations should not be seen as exactly mapping the regulatory process, but rather as exploring the consequences of using a rule that is based on statistical significance alone. Safety assessment in particular has a stronger track record of use of Bayesian analysis [31]. Second, there can be differences of opinion on how strong the evidence should be before a medication is approved. A BF of 3 typically is considered to be very weak and worth mere mentioning, while even a BF of 20 may not be considered conclusive at times [17]. Third, we used two different non-null effect sizes, but the magnitude of the effect that is considered to be sufficiently good to lead to approval may vary on a case-by-case basis. E.g. the type of outcome, the availability of other drugs, the safety profile of the tested medication, and other factors may also be involved in the decision-making. Fourth, we have focused on evaluating superiority trials, but for some drugs decisions may be made based on non-inferiority designs. Non-inferiority trials are a minority: A survey identified 209 non-inferiority or equivalence trials published in 2009 [32]. Appropriate Bayesian considerations apply also for non-inferiority trials [33,34]. Fifth, we used a standard Bayesian framework for all analyses, so as to standardize the inferences derived. In real practice, some further diversification can exist based on additional prior evidence. For our simulations, we assumed non-informative priors. Allowing for these caveats, our study offers through simulations yet another demonstration of the unfortunate effect of p-values on statistical inferences. More routine consideration of BF in regulatory assessments and clinical decision-making would be a step forward for the adoption of medications in clinical practice.

Author Contributions Conceptualization: DvR. Formal analysis: DvR. Methodology: DvR JPAI. Project administration: DvR JPAI. Software: DvR. Supervision: JPAI. Validation: DvR. Visualization: DvR JPAI. Writing – original draft: DvR. Writing – review & editing: JPAI.