The pre-registered primary outcome measure was whether participants met the specified threshold for improvement in self-reported fatigue and physical function. Several years after trial preregistration, the investigators decided this measure was “hard to interpret” ([6], p. 25). They replaced it with the continuous scores generated by the two original self-report scales, and they also modified the scoring method for the fatigue scale. [5] In addition, they substantially loosened the definition of recovery used in secondary analyses, making it much easier for patients to qualify as recovered. [2] These changes are clearly not insubstantial. Further, as we showed in our paper, all of them resulted in more successful outcomes than would have been obtained using the pre-registered measures. [4]

Sharpe et al. argue that the pre-specified outcome measures are “no more valid” than the modified ones ([5], p. 4). This argument is puzzling. The purpose of pre-registration is to prevent researchers from altering their outcome measures in ways that favour their hypotheses, after they have begun to observe the trial’s progress. Therefore, all other things being equal, measures that are stipulated ahead of time will always trump those formulated after the fact. Sharpe et al. offer the justification that changing the scoring method for the fatigue scale made it “more accurate and sensitive to change” ([5], p. 1). However, they provide no evidence to support this claim.

The concept of pre-registration forms the cornerstone of a good clinical trial, and this is the reason it is so vital to get good statistical advice before the trial begins, especially on matters such as the sensitivity, validity and interpretability of the primary outcome measures. Of course, it is perfectly acceptable to report additional, exploratory analyses that come to mind at a later date, but these should not replace the originally-specified measures.

An additional reason to prefer the pre-registered primary outcomes is that they formed the basis of the power analyses conducted to determine sample size. Given that the trial was estimated to be sufficiently well-powered to detect effects on a binary outcome measure, the failure to observe such effects reliably is of central interest, and should have been highlighted in the trial publications.

With regard to the recovery measure, we previously addressed all of Sharpe et al.’s justifications for altering these in our original paper, and see no need to repeat those arguments here (see [4] p. 8, see also [7, 8]). To summarise, Sharpe et al. “prefer” their modified definition because it generates similar rates of recovery to previous studies, and is also more consistent with “our clinical experience” ([5], p. 6). Clearly, it is not appropriate to loosen the definition of recovery simply because things did not go as expected based on previous studies. Researchers need to be open to the possibility that their results may not align with previous findings, nor with their own preconceptions. That is the whole point of a trial. Otherwise, the enterprise ceases to be genuinely informative, and becomes an exercise in belief confirmation.