Meta-analyses indicate that antidepressants are superior to placebos in statistical terms, but the clinical relevance of the differences has not been established. Previous suggestions of clinically relevant effect sizes have not been supported by empirical evidence. In the current paper we apply an empirical method that consists of comparing scores obtained on the Hamilton rating scale for depression (HAM-D) and scores from the Clinical Global Impressions-Improvement (CGI-I) scale. This method reveals that a HAM-D difference of 3 points is undetectable by clinicians using the CGI-I scale. A difference of 7 points on the HAM-D, or an effect size of 0.875, is required to correspond to a rating of ‘minimal improvement’ on the CGI-I. By these criteria differences between antidepressants and placebo in randomised controlled trials, including trials conducted with people diagnosed with very severe depression, are not detectable by clinicians and fall far short of levels consistent with clinically observable minimal levels of improvement. Clinical significance should be considered alongside statistical significance when making decisions about the approval and use of medications like antidepressants.

Decisions about the approval and use of medications should not be based on statistical significance alone. Estimation of the clinical relevance of drug-placebo differences is also necessary, to balance the utility of a drug's effects against its side effects and health risks. Antidepressants have been compared with placebo in numerous randomised controlled trials. The methodological flaws of these studies have been widely discussed and include selective publication and outcome reporting, bias introduced by placebo washout procedures, infringement of the double bind, and inflation of drug-placebo differences through categorisation of data []. Despite these problems, there remains a consensus that antidepressants have worthwhile effects in people with more severe depression, at least. The difference between antidepressants and placebo in the treatment of major depression is small, however. Mean differences between antidepressants and placebo reported in meta-analyses of the Food and Drug Administration data set have ranged from 1.80 to 2.56 points on the widely used Hamilton rating scale for depression (HAM-D) [], with effect sizes (d) ranging from 0.31 to 0.32 []. In some studies [], effects varied as a function of baseline severity ranging from d = 0.11 for patients in the mild to moderate range (HAM-D ≤ 18) to 0.47 for patients with very severe depression (HAM-D ≥ 23) [], although another study as failed to find a severity effect [].

Exploratory analyses of efficacy data from major depressive disorder trials submitted to the US Food and Drug Administration in support of new drug applications.

Subsequently, an empirical method of establishing the clinical relevance of change scores has been reported in a number of studies []. The method links scores on various scales used in psychiatric outcome trials to scores on the commonly used Clinical Global Impressions-Improvement (CGI-I) scale, a scale which rates improvement on a scale of 1 (very much improved) through 4 (no change) to 7 (very much worse) []. The CGI-I is said to be ‘intuitively understood by clinicians’ ([], p 243) and has good inter-rater reliability, between 0.65 and 0.92 []. It has been judged to be a useful measure in clinical trials [] and shown to have concurrent validity with other measures, including CGI severity ratings []. Spearman correlations ranging between .70 and .80 have been reported between CGI-I and HAM-D []. Thus, this method allows one to align the degree of change on a symptom scale to clinician perception of improvement, and provides a means of establishing an empirically derived criterion for clinical significance. The method has been applied to scales measuring symptoms of schizophrenia [], and more recently to depression scales, specifically the HAM-D. We suggest that a CGI-I rating of 3, which indicates that the patient has “minimally improved” provides the most liberal criterion possible, as the next step on the scale is “no change.”

The validity of the CGI severity and improvement scales as measures of clinical effectiveness suitable for routine clinical use.

Severity of depression and response to antidepressants and placebo: an analysis of the Food and Drug Administration database.

A direct comparison of effect sizes from the clinical global impression-improvement scale to effect sizes from other rating scales in controlled trials of adult social anxiety disorder.

Until recently there have been no empirically validated criteria for establishing the clinical significance of change scores on scales measuring psychiatric symptoms. In the 2004 National Institute of Health and Clinical Excellence (NICE) guidelines on treating depression, it was suggested that differences of three points on the HAM-D and standardized mean differences of 0.50 might be clinically significant [], but no evidence was cited to support these proposed cut-offs, and they were criticised as arbitrary []. The specification of criteria for clinical relevance was removed from the later edition of the Guidance published in 2009, but effects continued to be classified according to their ‘clinical importance,’ apparently using the same criteria proposed in the 2004 Guidance []. For example, based on a standardized mean difference of 0.34, the 2009 updated NICE guidance concluded that the difference between SSRIs and placebos is “unlikely to be of clinical importance” (p. 317).

Leucht and colleagues also reported that the correspondence of HAM-D change scores to clinical ratings varied somewhat as a function of baseline severity. For less severely depressed patients, a clinician rating of minimal improvement corresponded to a 6-point HAM-D difference, whereas for very severely depressed patients, it corresponded to an 8-point change.

To date, this method has been used to establish the clinical relevance of pre–post treatment differences. We propose that it can also serve as an empirically validated method of evaluating the clinical significance of drug-placebo differences, since these are also frequently calibrated in terms of differences on the Hamilton scale. Applying this to placebo-controlled antidepressant trials, Leucht et al.'s [] data reveal that the 3-point difference in HAM-D scores proposed by NICE is overly lenient. It results in classifying a difference that cannot be detected by clinicians as clinically important. These data suggest that a difference of 7-points on the HAM-D might be a more reasonable cut-off, as it corresponds to a clinician rating of minimal improvement.

Leucht and colleagues described these data as follows: ‘The results were consistent for all assessment points examined. A CGI-I score of 4 (“no change”) corresponds with a slight reduction on the HAM-D-17 of up to 3 points’ ([], p 245–246). In other words, clinicians could not detect a difference of 3 points on the Hamilton when asked to rate a patient's overall improvement. Examination of the figure reveals that a CGI-I score of 3 (‘minimally improved’) corresponded to changes in Hamilton score of around 7 points after two to four weeks of treatment. To attain a CGI score of 2 (‘much improved’), required a change in Hamilton score of 14 points at the four week assessment.

Leucht et al. [] used the raw data on the antidepressant mirtazapine gathered from 43 trials in more than 7000 people diagnosed with ‘major depressive disorder’. The data were derived from placebo-controlled, comparative and open label trials that had been sponsored by the drug company, Organon. The linking analysis of absolute change in Hamilton scores to CGI-improvement scores at four time points is presented in Fig. 1

*Reprinted from J Affect Disord, 148 (2,3), Leucht S, Fennema H, Engel R, Kaspers-Janssen M, Lepping P, Szegedi A. What does the HAMD mean? 243–8, (2013), with permission from Elsevier.

Fig. 1 *Reprinted from J Affect Disord, 148 (2,3), Leucht S, Fennema H, Engel R, Kaspers-Janssen M, Lepping P, Szegedi A. What does the HAMD mean? 243–8, (2013), with permission from Elsevier.

Conventionally, an effect size of 0.50 is considered ‘medium’ and 0.80 is considered ‘large.’ However, Cohen proposed these cut-offs with “invitations not to employ them if possible. The values chosen had no more reliable a basis than my own intuition” [] (p 534). The data considered here suggest that with respect to changes on the HAM-D, effect sizes as large as 1.00 may be required to indicate ‘minimal’ differences as rated by clinicians.

Using an SD of 8.0, the effect size (d) corresponding to a difference score of 7-points (i.e., a clinician rating of minimally improved) is 0.875. For very severely depressed patients, the effect size corresponding to a minimal difference would be 1.00, and for less severely depressed patients it would be a 0.75. These are the effect sizes that are required to indicate a ‘minimal’ difference as rated by clinicians. They are more than twice the magnitude of the effect sizes derived from meta-analyses, including those examining separately people with the most severe levels of depression [].

A meta-analysis of 5 placebo-controlled mirtazapine trials yielded change score SDs of 7.7 for mirtazapine and 8.3 for placebo []. Reported in the same paper, a meta-analysis of 5 trials comparing mirtazapine to amitriptyline yielded SDs of 7.9 and 7.8, respectively. A later comparator trial [] reported SDs of 7.5 for mirtazapine and 7.7 for paroxetine. These data reveal substantial consistency in the variance of HAM-D change scores across different trial designs, antidepressants, and placebos.

One problem with the cut-offs proposed by NICE (2004) is that a 3 point difference in HAM-D change scores does not correspond well to the effect size of d = 0.50 that was proposed to indicate clinical significance. The pooled SD of change scores in the Kirsch et al. meta-analysis (N = 5133) was 8.0 (7.9 for the investigational drug and 8.2 for placebo) []. However, that meta-analysis did not include the medication assessed in the Leucht et al. analysis (i.e., mirtazapine). More important, it did not include comparator studies without placebo arms, which were included in the Leucht et al. paper. Thus, it seemed important to assess the reliability of our SD estimate using other data.

4. Discussion

26 OECD Health at a Glance 2013: OECD indicators. 27 Gotzsche P. Psychiatric drugs are doing us more harm than good. Over the last few decades antidepressants have become some of the most widely used and profitable drugs in history. Rates of prescriptions have risen throughout the developed world [], leading to debates about the inappropriate medicalization of misery []. The more fundamental question, however, is whether antidepressants achieve worthwhile effects in depression in general. Guidelines have attempted to consider the issue of clinical relevance of antidepressant effects, but have not constructed empirically validated criteria.

28 Kirsch I.

Moncrieff J. Clinical trials and the response rate illusion. 29 Royston P.

Altman D.G.

Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. 30 MacCallum R.C.

Zhang S.

Preacher K.J.

Rucker D.D. On the practice of dichotomization of quantitative variables. 31 Rutherford B.R.

Sneed J.R.

Roose S.P. Does study design influence outcome? The effects of placebo control and treatment duration in antidepressant trials. 32 Sinyor M.

Levitt A.J.

Cheung A.H.

Schaffer A.

Kiss A.

Dowlati Y.

et al. Does inclusion of a placebo arm influence response to active antidepressant treatment in randomized controlled trials? Results from pooled and meta-analyses. 31 Rutherford B.R.

Sneed J.R.

Roose S.P. Does study design influence outcome? The effects of placebo control and treatment duration in antidepressant trials. 32 Sinyor M.

Levitt A.J.

Cheung A.H.

Schaffer A.

Kiss A.

Dowlati Y.

et al. Does inclusion of a placebo arm influence response to active antidepressant treatment in randomized controlled trials? Results from pooled and meta-analyses. 33 Khan A.

Faucett J.

Lichtenberg P.

Kirsch I.

Brown W.A. A systematic review of comparative efficacy of treatments and controls for depression. 28 Kirsch I.

Moncrieff J. Clinical trials and the response rate illusion. The commonly used method of estimating the ‘response’ to drug treatment in clinical trials of antidepressants (arbitrarily set at a 50% reduction in symptoms), involves the categorisation of continuous data from symptom scales, and therefore does not provide an independent arbiter of clinical significance. Moreover, this method can exaggerate small differences between interventions such as antidepressants and placebo [], and statisticians note that it can distort data and should be avoided []. Response rates in double-blind antidepressant trials are typically about 50% in the drug groups and 35% in the placebo groups (e.g., []). This 15% difference is often defended as clinically significant on the grounds that 15% of depressed people who get better on antidepressants would not have gotten better on placebo. However, a 50% reduction in symptoms is close to the mean and median of drug improvement rates in placebo-controlled antidepressant trials [] and thus near the apex of the distribution curve. Thus, with an SD of 8 in change scores, a 15% difference in response rates is about (an odds ratio of 1.86, a relative risk of 0.77, and an NNT of 7) is exactly what one would expect from a mean 3-point difference in HAM-D scores []. Lack of response does not mean that the patient has not improved; it means that the improvement has been less, by as little as one point, than the arbitrary criterion chosen for defining a therapeutic response.

34 Moncrieff J.

Cohen D. Rethinking models of psychotropic drug action. 35 Forkmann T.

Scherer A.

Boecker M.

Pawelzik M.

Jostes R.

Gauggel S. The Clinical Global Impression Scale and the influence of patient or staff perspective on outcome. 36 Cuijpers P.

Turner E.H.

Koole S.L.

van Dijke A.

Smit F. What is the threshold for a clinically relevant effect? The case of major depressive disorders. 16 Leucht S.

Fennema H.

Engel R.

Kaspers-Janssen M.

Lepping P.

Szegedi A. What does the HAMD mean?. The small differences detected between antidepressants and placebo may represent drug-induced mental alterations (such as sedation or emotional blunting) or amplified placebo effects rather than specific ‘antidepressant’ effects []. At a minimum, therefore, it is important to ascertain whether differences correlate with clinically detectable and meaningful levels of improvement. The CGI has been criticised for not reflecting the patient's perspective [], and other data such as functioning and quality of life measures are also required to fully assess the value of antidepressant treatment. Cuijpers et al. [] have proposed a different method of establishing a ‘minimal important difference’ (MID) based on ‘utility’ measures derived from quality of life scales. However, the study from which the MID was estimated did not include samples of depressed individuals, and the values obtained were found to be unstable. As a result, the authors were only able to provide a “very rough estimate of the cutoff for clinical relevance” (p. 376). Use of a patient-rated version of the CGI might allow for a more reliable and valid complement to the clinician-rated data used here to assess the clinical relevance of HAM-D scores. In its absence, CGI improvement scores provide the first empirically validated method for establishing the clinical relevance of antidepressant effects. Based on the Leucht et al. data [], empirically derived criteria for minimal clinically relevant drug-placebo differences would be, a 7-point difference in HAM-D change scores (8 points for very severely depressed patients), and a drug-placebo effect size (d) of 0.875 (1.00 for very severely depressed patients). Currently, drug effects associated with antidepressants fall far short of these criteria.