Data from FDA Reviews

We identified the phase 2 and 3 clinical-trial programs for 12 antidepressant agents approved by the FDA between 1987 and 2004 (median, August 1996), involving 12,564 adult patients. For the eight older antidepressants, we obtained hard copies of statistical and medical reviews from colleagues who had procured them through the Freedom of Information Act.19 Reviews for the four newer antidepressants were available on the FDA Web site.17,20 This study was approved by the Research and Development Committee of the Portland Veterans Affairs Medical Center; because of its nature, informed consent from individual patients was not required.

From the FDA reviews of submitted clinical trials, we extracted efficacy data on all randomized, double-blind, placebo-controlled studies of drugs for the short-term treatment of depression. We included data pertaining only to dosages later approved as safe and effective; data pertaining to unapproved dosages were excluded.

We extracted the FDA's regulatory decisions — that is, whether, for purposes of approval, the studies were judged to be positive or negative with respect to the prespecified primary outcomes (or primary end points).21 We classified as questionable those studies that the FDA judged to be neither positive nor clearly negative — that is, studies that did not have significant findings on the primary outcome but did have significant findings on several secondary outcomes. Failed studies22 were also classified as questionable (for more information, see the Methods section of the Supplementary Appendix, available with the full text of this article at www.nejm.org). For fixed-dose studies (studies in which patients are randomly assigned to receive one of two or more dose levels or placebo) with a mix of significant and nonsignificant results for different doses, we used the FDA's stated overall decisions on the studies. We used double data extraction and entry, as detailed in the Methods section of the Supplementary Appendix.

Data from Journal Articles

Our literature-search strategy consisted of the following steps: a search of articles in PubMed, a search of references listed in review articles, and a search of the Cochrane Central Register of Controlled Trials; contact by telephone or e-mail with the drug sponsor's medical-information department; and finally, contact by means of a certified letter sent to the sponsor's medical-information department, including a deadline for responding in writing to our query about whether the study results had been published. If these steps failed to reveal any publications, we concluded that the study results had not been published.

We identified the best match between the FDA-reviewed clinical trials and journal articles on the basis of the following information: drug name, dose groups, sample size, active comparator (if used), duration, and name of principal investigator. We sought published reports on individual studies; articles covering multiple studies were excluded. When the results of a trial were reported in two or more primary publications, we selected the first publication.

Few journal articles used the term “primary efficacy outcome” or a reasonable equivalent. Therefore, we identified the apparent primary efficacy outcome, or the result highlighted most prominently, as the drug–placebo comparison reported first in the text of the results section or in the table or figure first cited in the text. As with the FDA reviews, we used double data extraction and entry (see the Methods section of the Supplementary Appendix for details).

Statistical Analysis

We categorized the trials on the basis of the FDA regulatory decision, whether the trial results were published, and whether the apparent primary outcomes agreed or conflicted with the FDA decision. We calculated risk ratios with exact 95% confidence intervals and Pearson's chi-square analysis, using Stata software, version 9. We used a similar approach to examine the numbers of patients within the studies. Sample sizes were compared between published and unpublished studies with the use of the Wilcoxon rank-sum test.

For our major outcome indicator, we calculated the effect size for each trial using Hedges's g — that is, the difference between two means divided by their pooled standard deviation.23 However, because means and standard deviations (or standard errors) were inconsistently reported in both the FDA reviews and the journal articles, we used the algebraically equivalent computational equation24:

g = t × the square root of (1/n drug + 1/n placebo ).

We calculated the t statistic25 using the precise P value and the combined sample size as arguments in Microsoft Excel's TINV (inverse T) function, multiplying t by −1 when the study drug was inferior to the placebo. Hedges's correction for small sample size was applied to all g values.26

Precise P values were not always available for the above calculation. Rather, P values were often indicated as being below or above a certain threshold — for example, P<0.05 or “not significant” (i.e., P>0.05). In these cases, we followed the procedure described in the Supplementary Appendix.

For each fixed-dose (multiple-dose) study, we computed a single study-level effect size weighted by the degrees of freedom for each dose group. On the basis of the study-level effect-size values for both fixed-dose and flexible-dose studies, we calculated weighted mean effect-size values for each drug and for all drugs combined, using a random-effects model with the method of DerSimonian and Laird27 in Stata.28

Within the published studies, we compared the effect-size values derived from the journal articles with the corresponding effect-size values derived from the FDA reviews. Next, within the FDA data set, we compared the effect-size values for the published studies with the effect-size values for the unpublished studies. Finally, we compared the journal-based effect-size values with those derived from the entire FDA data set — that is, both published and unpublished studies.

We made these comparisons at the level of studies and again at the level of the 12 drugs. Because the data were not normally distributed, we used the nonparametric rank-sum test for unpaired data and the signed-rank test for paired data. In these analyses, all the effect-size values were given equal weight.