Mean improvement scores were not available in five of the 47 trials ( Figure 1 ). Specifically, four sertraline trials involving 486 participants and one citalopram trial involving 274 participants were reported as having failed to achieve a statistically significant drug effect, without reporting mean HRSD scores. We were unable to find data from these trials on pharmaceutical company Web sites or through our search of the published literature. These omissions represent 38% of patients in sertraline trials and 23% of patients in citalopram trials. Analyses with and without inclusion of these trials found no differences in the patterns of results; similarly, the revealed patterns do not interact with drug type. The purpose of using the data obtained from the FDA was to avoid publication bias, by including unpublished as well as published trials. Inclusion of only those sertraline and citalopram trials for which means were reported to the FDA would constitute a form of reporting bias similar to publication bias and would lead to overestimation of drug–placebo differences for these drug types. Therefore, we present analyses only on data for medications for which complete clinical trials' change was reported. The dataset comprised 35 clinical trials (five of fluoxetine, six of venlafaxine, eight of nefazodone, and 16 of paroxetine) involving 5,133 patients, 3,292 of whom had been randomized to medication and 1,841 of whom had been randomized to placebo.

Confirming earlier analyses [ 2 ], but with a substantially larger number of clinical trials, weighted mean improvement was 9.60 points on the HRSD in the drug groups and 7.80 in the placebo groups, yielding a mean drug–placebo difference of 1.80 on HRSD improvement scores. Although the difference between these means easily attained statistical significance ( Table 2 , Model 3a), it does not meet the three-point drug–placebo criterion for clinical significance used by NICE. Represented as the standardized mean difference, d, mean change for drug groups was 1.24 and that for placebo 0.92, both of extremely large magnitude according to conventional standards. Thus, the difference between improvement in the drug groups and improvement in the placebo groups was 0.32, which falls below the 0.50 standardized mean difference criterion that NICE suggested. The amounts of change for drug and placebo groups varied widely around their respective means, Q(34)s = 51.80 and 74.59, p-values < 0.05, and I 2 s = 34.18 and 54.47. Thus, the mean change exhibited in trials provides a poor description of results, and moderator models are indicated.

Baseline HRSD scores, improvement, and sample sizes in drug and placebo groups for each clinical trial are reported in Table 1 . As in the FDA files, studies are identified by protocol numbers. The data from these trials can be obtained from the FDA using FOIA requests and citing the medication name and protocol number. The table also includes references to published reports of the data abstracted from the FDA files, when they could be found (using the search methods described above). Studies in which data only from selected sites of a multisite study were published are not cited in the table. We have also excluded published reports in which dropouts have been removed from the data. For each of the trials, the pharmaceutical companies had submitted to the FDA data in which attrition was handled by carrying forward the last observation carried forward (LOCF) on the patient, which was the basis in all cases of the FDA review. These data and their corresponding citations appear in the table. Even in the LOCF data, there sometimes are some minor discrepancies between the published version and the version submitted to the FDA. In some cases, for example, the N is slightly larger in the published studies than in the data reported to the FDA. Further complicating this problem is the fact that occasionally, the company has published a trial more than once, with slight discrepancies in the data between publications. Data in the table are those reported to the FDA.

Drug and Initial Severity Trends in Change

Moderator analyses examined whether drug type, duration of treatment, and baseline severity (HRSD) scores related to improvement. Although drug type and duration of treatment were unrelated to improvement, the drug versus placebo difference remained significant, and amount of improvement was a function of baseline severity (Table 2, Model 1a). Specifically, the amount of improvement depended markedly on the quadratic function of baseline severity, but the linear function of baseline severity interacted with assignment to drug versus placebo (Model 1b). Specifically, as Figure 2 shows, improvement from baseline operated as a ∩-shaped curvilinear function in relation to baseline severity, with those at the lowest and highest levels experiencing smaller gains, whereas those in-between experienced larger gains; the slope for placebo declined as severity increased, whereas the slope for drug was slightly positive. The difference between drug and placebo exceeded NICE's 0.50 standardized mean difference criterion at comparisons exceeding 28 in baseline severity. Further analyses indicated that drug type did not moderate this affect. Although venlafaxine and paroxetine had significantly (p < 0.001) larger weighted mean effect sizes comparing drug to placebo conditions (ds = 0.42 and 0.47, respectively) than fluoxetine (d = 0.22) or nefazodone (0.21), these differences disappeared when baseline severity was controlled.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. Mean Standardized Improvement as a Function of Initial Severity and Treatment Group Drug improvement is portrayed as red triangles around their solid red regression line and placebo improvement as blue circles around their dashed blue regression line; the green shaded area indicates the point at which comparisons of drug versus placebo reach the NICE clinical significance criterion of d = 0.50. Plotted values are sized according to their weight in analyses. https://doi.org/10.1371/journal.pmed.0050045.g002

For all but one sample, baseline HRSD scores were in the very severe range according to the criteria proposed by the American Psychiatric Association (APA) [21] and adopted by NICE [1]. The one exception derived from a fluoxetine trial that had two samples, one with HRSD scores in the very severe range and the other with scores in the moderate range. Because the low-HRSD condition might be considered an outlier, the analyses were performed again without it. Results continued to reveal that drug versus placebo assignment interacted with initial severity to influence improvement; yet the curvilinear function of the baseline was no longer significant, although group continued to interact with the linear component (Table 2, Model 2c). As Figure 3 shows, drug efficacy did not change as a function of initial severity, whereas placebo efficacy decreased as initial severity increased; values again exceeded NICE's 0.50 standardized mean difference criterion at comparisons greater than 28 in baseline severity. This final model comprising three simultaneous study dimensions (viz., drug vs. placebo, baseline, and the interaction) explained 51.45% of the variation in improvement. Although this model was in a formal sense incorrectly specified (Q Residual (64) = 96.07, p < 0.01), when a random-effects constant was instead assumed, the same pattern of results remained in this more statistically conservative mixed-effects model. A final model that incorporated even the drug types for which only some trials were available confirmed these trends.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. Mean Standardized Improvement as a Function of Initial Severity and Treatment Group, Including Only Trials Whose Samples Had High Initial Severity Drug improvement is portrayed as red triangles around their solid red regression line and placebo improvement as blue circles around their dashed blue regression line; the green shaded area indicates the point at which comparisons of drug versus placebo reach the NICE clinical significance criterion of d = 0.50. Plotted values are sized according to their weight in analyses. https://doi.org/10.1371/journal.pmed.0050045.g003

Figure 4 displays raw mean differences between drug and placebo as a function of initial severity, rising as a linear function of baseline severity levels (Table 2, Models 3a and 3b) even though, almost without exception, the scores were in the very severe range of the criteria proposed by APA [21]. Yet when these data are considered in conjunction with those in Figure 3, it seems clear that the increased difference is due to a decrease in improvement in placebo groups, rather than an increase in drug groups.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Mean Drug–Placebo Difference Scores as a Function of Initial Severity Plotted values are sized according to their sample sizes (n); the green line represents the NICE clinical significance criterion. The solid blue regression line represents the trend across all 35 trials; the dashed red line represents the trend excluding the left-most observation. https://doi.org/10.1371/journal.pmed.0050045.g004

A visual inspection of Figure 4 suggests that studies' effects are fairly evenly distributed above and below the NICE criterion (3) but that most small studies have high baselines and show large effects. Although sample size (N) was negatively linked to the drug-versus-placebo differences (β = −0.34, p = 0.003), when mean baseline severity values are controlled, this effect disappears and the baseline effect remains significant. The interaction of sample size with baseline severity was marginally significant, p = 0.0586, and the pattern indicated that baseline severity was somewhat more predictive for smaller than for larger studies. Yet, because simple-slopes analyses revealed that baseline scores were significantly predictive even for the largest studies, study differences in sample size would appear to qualify neither the pattern of results we have reported nor their interpretation.

Examination of publication bias often relies on inspections of effect sizes in relation to sample size (or inverse variance) [22]. A funnel plot of the data depicted in Figure 4 indicates that the larger studies in the FDA datasets tended to show smaller drug effects than smaller studies. Although such a pattern might be construed as indicating a publication or other reporting bias, our use of complete datasets precludes this possibility, unless some small trials were not reported despite the FDA Guidelines [6]. A more plausible explanation is that trials with higher baseline scores tended to be small. In any case, funnel-plot inspections assume that there is only one population effect size that can be tracked by a comparison between drug and placebo groups, whereas the current investigation shows that these effects vary widely and that the magnitude of the difference depends on initial severity values. Consequently, funnel-plot inspection is much less appropriate in the present context. Unfortunately, there are no other tools yet available to detect publication or other reporting biases in the face of effect modifiers.