Discussion of new results

Our reanalyses of the trial data based on the published protocol generated some troubling findings. First, scores on the protocol-specified primary outcome measure — improvement in self-reported fatigue and physical function – were numerically higher for the CBT and the GET groups than for the Control group. However, these differences did not pass the threshold for statistical significance after correcting for the number of planned comparisons specified in the trial protocol. Using a more lenient correction (assuming only five planned comparisons), outcomes are only marginally more positive: the comparison between GET and Control just reaches this threshold, but the comparison between CBT and Control does not. Of course, our analyses did not incorporate a number of important stratification variables that were unavailable in the FOIA dataset. However, it appears unlikely that their inclusion would substantially alter the result, and our analyses remain the closest approximation to the originally specified one that has ever been published. Our findings suggest that, had the investigators stuck to their original primary outcome measure, the outcomes would have appeared much less impressive.

Improvement rates for self-rated fatigue and physical function considered individually did yield some statistically significant findings, which suggests that the interventions were somewhat specific in the way they altered patients’ illness perceptions. Self-rated physical function scores showed greater improvement in the GET group than in the Control group — but not self-rated fatigue scores – which suggests GET had a modest effect on patients’ perceptions of their physical function, but did not do much to alter symptom perceptions. Conversely, self-rated fatigue showed greater improvement in the CBT group than in Controls – but not physical function – which suggests CBT elicits modest reductions in symptom-focusing, but does not do much to improve patients’ confidence in their physical capacities.

Second, when recovery rates were calculated using the definition specified in the published protocol, these were extremely low across the board, and not significantly greater in the CBT or GET groups than in the Control group. Neither an intent-to-treat nor an available case analysis yielded a significant benefit for these therapies over conventional medical care. Again, we were unable to incorporate a number of stratification variables into this analysis, but it is unlikely that the result would be different had we done so.

With respect to long-term outcomes, the investigators’ original analysis did not reveal any significant effects of treatment allocation on self-reported fatigue and physical function at long-term follow-up [7]. They suggested this null effect may have been due to the confounding effects of post-trial therapy. Our informal re-examination of the long-term follow-up results provide no support for this suggestion. We found that even when patients who received post-trial CBT or GET are excluded, there is still no evidence of any long-term treatment-related benefits – not even a trend in the hypothesised direction. Of course, our analyses were informal. Ideally, we would have replicated the analysis reported in [7] for this patient subset, which included all the covariates listed in that analysis ([7], p. 1068), such as fatigue and physical function scores at 52 weeks, time of follow-up, trial centre and disease caseness. This was not possible on the data available. However, until better evidence becomes available, there is no reason to believe that post-trial therapy can offer a viable explanation for the absence of treatment effects at long-term follow-up.

One major problem for the PACE trial is that it was originally designed around a highly optimistic view of the therapeutic benefits of CBT and GET. Drawing on results from previous, smaller trials, the PACE investigators estimated that CBT would be likely to yield an improvement rate some six times greater than medical care alone, and GET would yield a rate five times greater [20]. These expectations formed the basis of the power calculations for the trial. But unfortunately, the improvement rates for CBT and GET participants - when compared with Control participants - fell markedly short of those expectations. So it is perhaps not surprising that an analysis of the binary improvement data alone was insufficient to detect any statistically reliable effects. In this context, it would have been perfectly acceptable first to report the protocol-specified primary outcome analysis, and then to explore the data using methods that are more sensitive to smaller effects – for example, analysis of the individual, continuous outcome measures. However, instead, the researchers chose to omit the former analysis altogether, and report only the latter. They then reported improvement rates based on an entirely new, and much more generous, definition of improvement. In sum, the analyses that were the least complimentary to CBT and GET never appeared in the published reports; the analyses that showed these interventions in a more favourable light were the only ones to be published.

As we have already pointed out, the timing of the change to the primary outcome – several months after trial completion - was highly problematic. There was also insufficient independent justification for making the change. For reasons that are never made clear, investigators had suddenly taken the view that “…a composite measure would be hard to interpret, and would not allow us to answer properly our primary questions of efficacy (i.e. comparing treatment effectiveness at reducing fatigue and disability).” ([13], p. 25). Certainly, the separate analysis of the two continuous measures provides useful additional information, but this does not justify abandoning the originally planned outcome. Further, the protocol already included measures of specific improvement rates in self-rated fatigue and physical function, and it is not clear why these were abandoned in favour of the new measure.

Turning now to the recovery rates, the late changes to the definition of recovery made it much easier for a patient to qualify as recovered. These changes were quite substantial. For example, the minimum physical function score required to qualify as recovered was reduced from 85 to 60, which is close to the mean score for patients with Class II congestive heart failure (57/100 [21]), and lower than the score required for trial entry (65/100). Also, on the fatigue criterion, a patient could now count as “recovered” despite reporting continuing fatigue on as many as seven out of the 11 fatigue questionnaire items, a level that substantially overlaps with that required for trial entry. Again, these changes operated to favour the study hypotheses. They enabled the researchers to make the claim that CBT and GET were significantly more likely to lead to recovery than conventional medical care (the original recovery definition would have yielded a null result), and to declare that at least “a fifth” of participants recover with CBT and GET [4, 22]. Neither claim could have been made if the original definition of recovery had been used.

Again, the timing of the change to the recovery definition – over a year after the trial was completed - is highly problematic. Also, an adequate justification for the change is yet to be provided. In their 2013 publication on recovery rates, the researchers argued that the normal ranges for some key scores were wider than previously thought, which would justify classing more participants as “recovered” on these measures [4]. However, we have recently shown that when the chronically ill and the very old were excluded from the relevant reference samples, and where correct statistics were applied to determine appropriate cut-off values, the normal ranges are, if anything, narrower than previously believed [17]. Consequently, this argument does not stand up to scrutiny (see [17] for further details).

Several other arguments have been presented in defence of these changes [23, 24]. One was that since there is no agreed definition of recovery, the new modified one is just as good as the original (the original definition “simply makes different assumptions” [24], p. 289). This argument fails to explain why the definition was changed in the first place. If both definitions are indeed equally good, then the one to be preferred is surely the one that was specified in advance, before any of the results were known. Another argument was that the recovery rates obtained with the modified definition were numerically similar to those found in some previous trials of CBT for CFS [23]. However, these other trials used entirely different definitions of recovery, so are not relevant here. One final argument was that the original definition of recovery was simply “too stringent to capture clinically meaningful recovery” [23]. However, the only supporting evidence for this statement comes from the disappointing recovery rates in the PACE trial itself; no independent justification is offered. Clearly, a strong concept like recovery must be operationalised carefully. Physicians and lay people understand this term to mean a return to good health [25], and any definition must preserve this core meaning. If anything, the original protocol-specified definition was rather generous, and may have identified some individuals that had not recovered in the plain English sense of the word. For example, on the primary physical function measure (the SF36), it was possible to score in the bottom decile for working age individuals with no long-term illness or disability, and still count as recovered on that criterion [17]. The definition also did not require evidence of an ability to return to work or other premorbid activities, even though these are very important components of what recovery means to patients. There was certainly no justification for further loosening that definition. In sum, none of the trial investigators’ arguments adequately justified the late changes to the recovery definition. More detailed discussion of these issues can be found elsewhere [26].

Turning now to long-term follow-up, the original publication of the long-term follow-up data reported no significant differences amongst treatment groups at this time point [7]. However, the authors dismissed their own finding, arguing that many participants received additional post-trial therapy which might have operated to obscure group differences. Instead, they based their main conclusion on comparisons between time points. For example, the first line of the Discussion reads: “The main finding of this long-term follow-up study of the PACE trial participants is that the beneficial effects of the rehabilitative CBT and GET therapies on fatigue and physical functioning observed at the final 1 year outcome of the trial were maintained at long-term follow-up 2·5 years from randomisation.” ([7] p. 1072, Italics added). This conclusion is repeated in the Abstract. The decision to lead with this conclusion again operated to show the findings in a more positive light than would have been possible based on their own primary between-groups analysis. The informal analyses we presented here provide no support for the investigators’ claim that post-trial therapy contaminated the long-term outcome data. Of course, our analyses did not include important potentially confounding variables that might differ amongst trial arms, and such a comprehensive analysis might possibly produce a different result. However, until there is positive evidence to suggest that this is the case, the conclusion we must draw is that PACE’s treatment effects are not sustained over the long term, not even on self-report measures. CBT and GET have no long-term benefits at all. Patients do just as well with some good basic medical care.

Overall evaluation of the trial

Some notable strengths of the PACE study included the large sample size (determined a priori using power analysis [1]), the random allocation of patients to treatment arms, the use of a well-formulated protocol to minimise drop-outs, and the reporting of the full CONSORT trial profile (including detailed information about missing data). The incorporation of an active comparison group - Adaptive Pacing Therapy - also provided a useful secondary control for factors such as overall therapy time and patient-therapist alliance. It is worth pointing out that results for this group were not significantly different from those for the Control group on any of the measures considered in this paper. Other strengths were that each therapy group received a substantial dose of therapy, and standardised manuals ensured comparability of treatments across centres and therapists. Finally, a wide range of outcomes was measured, including several objective measures, as well as various adverse events measures.

However, despite these strengths, the design, analysis and reporting of the results introduced some significant biases. We have already discussed some of the biases that were introduced at the analysis and reporting stage. Several key results that showed CBT and GET in less than favourable light were omitted and replaced with new ones that appeared more favourable to the treatments. These changes were made at a late stage in the trial, and we have argued here that none had sufficient independent justification. In reality, the effects of CBT and GET were very modest - and not statistically reliable overall if we apply procedures very close to those specified in the original published protocol.

Another source of bias arose from the trial’s heavy reliance on self-reports from participants who were aware of their treatment allocation. Clearly, in a behavioural intervention trial, full blinding is not possible. Nevertheless, it is the researchers’ responsibility to consider the possible effects of lack of blinding on outcomes, and to ensure such factors are insufficient to account for any apparent benefits. A trial that is not blinded, self-reported outcomes in particular can produce highly inflated estimates of treatment-related benefits [27, 28]. A recent meta-analysis of clinical trials for a range of disorders found that when patients were not blinded to their treatment allocation, their self-reported improvement on the treatment of interest was inflated by 0.56 standard deviations, on average, when compared to a corresponding blinded phase of the same trial [29]. In contrast, observer-rated measures of improvement were not significantly affected by blinding. Given this discrepancy in the effects of blinding on subjective and objective measures, it appears unlikely that these effects reflect genuine health benefits. Amore plausible explanation is that they are expectation-related artefacts – for example, they reflect the operation of attentional biases that favour the reporting of events consistent with one’s expectations [30], or recall/confirmation biases that enhance recollection for expectation-consistent events [31].

The PACE investigators have argued that expectancy effects alone cannot account for the positive self-reported improvements, because at the start of treatment, patients’ expectations of improvement were not greater in the CBT and GET groups than in the other groups [2, 23]. However, they fail to point out that CBT and GET participants were primed during treatment to expect improvement. The manual given to CBT participants at the start of treatment proclaimed CBT to be “a powerful and safe treatment which has been shown to be effective in... CFS/ME” ([32], p. 123). The GET participants’ manual described GET as “one of the most effective therapy strategies currently known” ([33], p. 28). Both interventions emphasised that faithful adherence to the programme could lead to a full recovery. Such messages — from an authoritative source — are likely to have substantially raised patients’ expectations of improvement. Importantly, no such statements were given to the other treatment groups. When we add to this the fact that the CBT programme, and to a lesser extent GET, was designed to reduce “symptom focusing”, which may have further influenced self-report behaviour in the absence of genuine improvement [27, 34], these findings start to look very worrying indeed.

A further cause for concern in the PACE trial was that the two primary self-report measures appear to behave in different ways depending upon the intervention. Our analysis based of the protocol-specified outcomes indicated that GET produces modest enhancements in patients’ perceived physical function, but has little effect on symptom perception. Conversely, CBT improved symptom perception – specifically, self-rated fatigue scores – but had little effect on perceived physical function. If these interventions were operating to create a genuine underlying change in illness status, we would expect change on one measure to be accompanied by change on the other.

Given the high risk of participant response bias in this study, it was therefore crucial to demonstrate accompanying improvement on more objective measures. However, only one such measure showed a treatment effect. On the six-minute walking test, the originally-reported available case analysis found that GET participants walked reliably farther than Control participants at the primary, 52-week endpoint. However, after an entire year, this group walked an average of just 67 m farther than baseline, and around 30 m farther than Controls. To put this in context, a sample of Class II chronic heart failure patients with similar baseline walking distances increased their distance by an average of 141 m after only three weeks of a gentle graded exercise programme [35].

No other objective measures yielded significant treatment effects. Most notably, treatment did not affect aerobic fitness, measured using a step test. If GET had genuinely improved participants’ physical function and levels of activity, these improvements should have been clearly evident on fitness measures taken a full year after trial commencement. Treatment also did not affect time lost from work [3]. There was ample opportunity for improvement here: during the six months preceding the trial, 83% of participants were either in work or would have worked if able (based on the number reporting lost work days). This suggests they could have immediately increased their hours if their health had permitted. Finally, the percentage of participants receiving government benefits or income protection actually increased over the treatment period for all groups [3]. It is concerning that these negative findings were not even published until years after the primary results had been reported, so these inconsistencies are not immediately apparent to the reader. For example, the crucial fitness results were not published until four years after the primary outcomes. The investigators dismissed most of these measures as unimportant or unreliable; they did not consider them valuable as a means of estimating the degree of bias inherent in their self-report outcomes.

The absence of evidence for treatment-related recovery is an additional, serious concern for the trial. CBT and GET were not seen as adjunct treatments that might relieve a little distress. Rather, they were seen as capable of reversing the very behaviours and cognitions responsible for CFS. The behavioural-deconditioning model, on which the treatments were based, assumes that there is no underlying disease process in CFS, and that patients’ concerns about exercise are merely “fearful cognitions” that need addressing ([36], p. 47–8). Participants in some trial arms were even told that “there is nothing to stop your body from gaining strength and fitness” ([32], p. 31). If this model of CFS were correct, and if the treatments were operating as hypothesised, then some participants that duly followed the programme should have returned to the levels of health and physical function, that they enjoyed prior to illness onset. Therefore, the rates of recovery in the CBT and GET groups should have been significantly and reliably higher than in the Control group, irrespective of the method used to define recovery. This was not the case.

The failure of CBT and GET to “reverse” CFS is perhaps not so surprising when we consider recent exercise physiology studies. CFS patients have shown various physical abnormalities when tested 24 h after exertion (reduced VO 2 max and/or anaerobic thresholds; for a review, see [37]). These abnormalities are not seen in sedentary, healthy adults or even in patients with cardiovascular disease, and therefore cannot be attributed to deconditioning alone. Such findings call into question the core assumption of the behavioural/deconditioning model that there is no ongoing disease process. If there is a rational basis for patients’ concerns over exercise, encouraging them to push through symptoms may be harmful, and recasting patients’ concerns as dysfunctional may cause additional, psychological harm.

Turning now to safety issues, there were few group differences in the incidence of adverse events, and the researchers concluded that both CBT and GET were safe for people with CFS. This finding – particularly that relating to GET - contrasts markedly with findings from informal surveys conducted by patient organisations [38, 39]. In these surveys, between 33% and 79% of respondents report worsened health as a result of having participated in some form of graded exercise programme (weighted average across 11 different surveys: 54% [39]). Of course, in such surveys, participant self-selection may operate to enhance the reporting rates for adverse outcomes. However, this finding is so consistent, and the number of participants surveyed is so large (upwards of 10,000 cases), that it cannot be entirely dismissed. One likely reason for the discrepancy between PACE’s findings and those of patient surveys is the conservative approach used in PACE’s GET programme. Patients were encouraged to increase activity only if it provoked no more than mild symptoms [40]. Unfortunately, compliance with the activity recommendations was not directly assessed: actigraphy data were collected only at trial commencement [1] and never reported. This is a significant omission, since there is evidence that graded exercise therapies are not always successful in actually increasing CFS patients’ activity levels [41]. Even those who comply with exercise goals may reduce other activities to compensate [42]. The lack of improvement in fitness levels in PACE’s GET group does suggest that participants may not have substantially increased their activity levels, even over the course of an entire year. Also, even though the majority of GET participants chose walking as their primary activity [2], this group demonstrated an average increase in walking speed of only 10% after an entire year (increases of 50% or more have been observed in other patient populations [35]). Given these features, it is inappropriate to generalise the safety findings from PACE to graded activity programmes more widely, especially as they are currently implemented in clinical settings.