Human protection policies require favorable risk–benefit judgments prior to launch of clinical trials. For phase I and II trials, evidence for such judgment often stems from preclinical efficacy studies (PCESs). We undertook a systematic investigation of application materials (investigator brochures [IBs]) presented for ethics review for phase I and II trials to assess the content and properties of PCESs contained in them. Using a sample of 109 IBs most recently approved at 3 institutional review boards based at German Medical Faculties between the years 2010–2016, we identified 708 unique PCESs. We then rated all identified PCESs for their reporting on study elements that help to address validity threats, whether they referenced published reports, and the direction of their results. Altogether, the 109 IBs reported on 708 PCESs. Less than 5% of all PCESs described elements essential for reducing validity threats such as randomization, sample size calculation, and blinded outcome assessment. For most PCESs (89%), no reference to a published report was provided. Only 6% of all PCESs reported an outcome demonstrating no effect. For the majority of IBs (82%), all PCESs were described as reporting positive findings. Our results show that most IBs for phase I/II studies did not allow evaluators to systematically appraise the strength of the supporting preclinical findings. The very rare reporting of PCESs that demonstrated no effect raises concerns about potential design or reporting biases. Poor PCES design and reporting thwart risk–benefit evaluation during ethical review of phase I/II studies.

To make a clinical trial ethical, regulatory agencies and institutional review boards have to judge whether the trial-related benefits (the knowledge gain) outweigh the trial-inherent risks. For early-phase human research, these risk–benefit assessments are often based on evidence from preclinical animal studies reported in so-called “investigator brochures.” However, our analysis shows that the vast majority of such investigator brochures lack sufficient information to systematically appraise the strength of the supporting preclinical findings. Furthermore, the very rare reporting of preclinical efficacy studies that demonstrated no effect raises concerns about potential design and/or reporting biases. The poor preclinical study design and reporting thwarts risk–benefit evaluation during ethical review of early human research. Regulators should develop standards for the design and reporting of preclinical efficacy studies in order to support the conduct of ethical clinical trials.

Competing interests: I have read the journal's policy and the authors of this manuscript have the following competing interests: JK serves in a remunerative capacity on the data safety monitoring board of an early-phase trial being pursued by Dimension Therapeutics. JK is part of the Editorial Board of the Meta-Research section of PLOS Biology.

In general, the preclinical safety studies (mainly pharmacokinetics and toxicology experiments) inform judgments about risk in early phase trials. Judgments about clinical promise rely heavily on preclinical efficacy studies (PCESs, often described as “preclinical pharmacodynamic” studies in regulatory documents), which aim at providing a readout of disease response in animal models. The primary objective of this study was to determine the extent, quality, and accessibility of PCESs that are contained within IBs submitted for ethical review of early-phase clinical trials.

Clinical investigators, IRBs, data safety and monitoring boards, and regulatory agencies (e.g., the European Medicines Agency [EMA] and the Food and Drug Administration [FDA]) are all charged with risk–benefit assessment. Their main source of information is the investigator brochure (IB). According to the ICH (International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use) Guideline for Good Clinical Practice E6 [ 15 ], the information in IBs “should be presented in a concise, simple, objective, balanced, and non-promotional form that enables a clinician, or potential investigator, to understand it and make his/her own unbiased risk–benefit assessment of the appropriateness of the proposed trial”.

However, many such analyses reflect preclinical studies that have been submitted for animal care committee review or that are described in publications. Many such studies are not necessarily embedded within drug development programs and may have been pursued after a drug had already shown efficacy in trials. In contrast, little is known about the extent, quality, and accessibility of preclinical evidence used to justify and review the launch of early phase clinical trials. As a result, it is unclear whether preclinical studies submitted to institutional review boards (IRBs) or regulatory agencies are described in ways that enable the respective evaluators to perform a critical assessment about the strength of evidence supporting a new trial. Nor is it clear whether such materials adhere to various standards and guidelines on the design of preclinical studies [ 12 – 14 ].

In early phase trials, assessment of risks and benefits depends heavily on evidence gathered in preclinical animal studies. Over the past 10 years, many commentators have raised concerns about the design and reporting of preclinical reports [ 2 – 9 ]. These concerns have been mainly informed by cross-sectional studies of peer-reviewed publications [ 2 , 10 ] and study protocols [ 11 ] for preclinical studies. These analyses consistently show infrequent reporting of measures aimed at reducing bias, including a priori sample size calculation, blinding of outcome assessment, and randomization. Further analyses suggest that publication bias frequently leads to inflated estimation effect sizes [ 2 ].

With regard to outcomes, the results for 636 PCESs (90%) demonstrated an effect, and the results for 43 PCESs (6%) demonstrated no effect. For 29 PCESs (4%), the direction of results was unclear. The 43 PCESs demonstrating no effect came from 16 IBs (18%). S1 Table gives examples for PCESs demonstrating no effect.

Less than half of all PCESs (44%) stemming from 68 IBs (75%) reported results in quantitative terms that allowed scoring for the results as “demonstrating an effect” or “demonstrating no effect.” Altogether, 30% of all PCESs (n = 211) reported an effect size, and 23% of all PCES (n = 161) reported a p-value. Another 53% of all PCES (n = 372) provided narrative descriptions of results.

A reference to published, peer-reviewed reports of preclinical efficacy was provided for 80 PCESs (11% of all PCESs) stemming from 20 IBs (18% of all IBs). These journal publications provided additional information on sample size (for 91% of PCESs reported in journals versus for 26% of PCESs reported in IBs), baseline characterization (76% versus 18%), control groups (98% versus 46%), randomization (28% versus 4%), and blinding of outcome assessment (5% versus 0%). However, no or even less additional information in journal publications was found, for example, for sample size calculations (0% for PCES in journals and 0% for PCES in IBs) and for age matching of animals and patient group (5% versus 9%). For further information, see Table 2 .

As described in the “Materials and methods” section, some validity issues pertain less to individual PCESs than to a package of several PCESs within 1 IB. These were thus rated at the IB level only. The following percentages refer to the 91 IBs that included at least 1 PCES. At the IB level, 9% (n = 8) reported for at least 1 PCES whether the age of animals matched the patient group proposed in the clinical trial. Most IBs (74%, n = 67) described a preclinical dose response for at least 1 studied outcome. At least 1 PCES described mechanistic evidence of efficacy in 70% of all IBs (n = 64). At least 1 replication of an efficacy experiment was described in 82% of IBs (n = 75). These included 70 replications in different models (77%) and 25 replications in a different species (27%).

Baseline characterization of animals was described for 18% of all PCESs (n = 127 studies). The animal species was reported for 88% of all PCESs (n = 622 studies); see Table 3 for further details on animal species. The animal model used in the experiment was reported in general terms (e.g., “xenograft” or “T-cell tolerance model”) for 97% of all PCESs (n = 684 studies), but its specifications (e.g., transplanted cell line for xenograft or tumor size before treatment) were reported for only 27% of all PCESs (n = 193 studies). Outcome choice was reported for 81% of all PCESs (n = 575 studies).

Table 2 presents the extent to which the 109 IBs described the implementation of practices to reduce validity threats for the 708 PCESs contained in them. A sample size was reported in 26% of PCESs (n = 184), with a median group size of 8 animals. Sample size calculation was never explained. None of the 708 PCESs were described as using blinded treatment allocation and/or outcome assessment. Only 4% of PCESs (n = 26) were described as using randomization, and 5% (n = 38 studies) reported the exclusion of animal data.

Altogether, we obtained 109 IBs for phase I (n = 15), phase I/II (n = 10), and phase II (n = 84) clinical trials that were submitted to 1 of 3 German IRBs (see the “ Materials and methods ” section). The majority of these IBs (n = 97, 83%) reflected the full sample of IBs for phase I/II trials submitted to 1 of the 3 IRBs between 2010 and 2016. The IBs covered 8 out of 12 therapeutic areas as distinguished by the European Medicine Agency ( Table 1 ). Seven studies (6%) were “first in human,” whereas all other IBs (94%) mentioned at least some clinical evidence for the investigational product. All trials were privately funded (1 IRB did not allow recording of the funders of the 6 IBs they shared, so we have this information for only 103 IBs). These included 48 IBs (47%) from the top 25 pharma companies by global sales [ 16 ].

Discussion

Our analysis of 109 IBs for phase I/II trials uncovered 3 striking features of the 708 PCESs the IBs presented to IRBs and regulatory agencies to support early-phase trials. First, 89% of all PCESs present data without a reference to a published, peer-reviewed report. While it is possible that sponsors maintain internal review mechanisms for PCESs, members of IRBs or regulatory agencies reviewing IBs have no way of knowing whether preclinical efficacy data have been subject to critical and independent evaluation. Further, IRBs and regulatory sponsors have no way of directly accessing preclinical reports if they are not published. Such commonplace nonpublication of preclinical evidence is potentially inconsistent with numerous scientific and ethical guidelines on early-phase trial launch [13,17].

The second finding is that much of the information needed for a favorable appraisal of the PCES’s validity is not provided in IBs. On the positive side of the ledger, IBs often contain PCESs that characterize the mechanism of action for new drugs or a dose response. Many also contain more than 1 study testing similar hypotheses, thus establishing some level of reproducibility for efficacy claims. On the negative side, IBs contain very little information that would enable reviewers to evaluate the risk of bias in these individual studies. For example, less than 20% of PCESs reported baseline characterization, exclusion of data from analysis, or randomization. Sample size calculation and blinding for outcome assessment were never reported. Many have previously argued that internal validity is the sina qua non of a valid experimental claim [18]. The dose response, mechanism, and replication studies described in IBs are difficult to interpret without knowing how well they implemented measures to limit bias and the effects of random variation. A potential reporting bias for studies demonstrating the intended mechanism or dose response would further aggravate this difficulty.

This leads us to the third striking finding: the scarcity of PCESs in IBs that do not demonstrate an effect (n = 43, 6%). Several nonexclusive explanations for this imbalance of outcomes in PCESs can be envisioned. One is biased study design. It is possible that in the absence of prespecifying end points or the limited use of techniques like blinded outcome assessment, PCESs consistently show large effects. A second explanation is biased inclusion of PCESs in IBs. With a median group size of 8, the PCESs in our sample had a limited ability to measure treatment effects precisely. This might have resulted in studies that showed unusually large effects. Attrition of animals in small experiments might aggravate the tendency for studies to occasionally produce large effects [19]. A recent investigation from the British Medical Journal (BMJ) supports the notion that animal data are sometimes reported selectively in IBs [20]. A third explanation is that only those treatments that show consistently positive effects in PCES were selected for early-phase trials. Though our study does not allow us to discriminate between these explanations, we think the latter explanation is improbable. Studies demonstrating no effect are crucial for demarcating the boundaries of dosing, diagnostic eligibility, or treatment timing for a new treatment [21]. The evidentiary basis for such boundaries would be important to include in an IB. Indeed, some PCESs we found (see S1 Table) demonstrated such efforts at “demarcation.”

Our study has several limitations. First, because of difficulties accessing IBs [20,22], our analysis did not utilize a random sample. Nevertheless, we believe our findings are likely to be generalizable to other IBs used for early-phase trials. For example, IBs used in our study covered a broad spectrum of different funders and addressed many different therapeutic areas. It is also important to note that the same IBs that funders submit to local IRBs are also submitted to the national regulatory agency (Bundesinstitut fuer Arzneimittel und Medizinprodukte [BfArM]). If the studies in our sample deviate from norms that are used elsewhere, these deviations fall within the window of acceptability for drug regulators.

A second limitation might be seen in the fact that only a small minority of included IBs were “first in human” studies (7 IBs, comprising 30 PCESs). However, many phase I trials that are not first in human involve new disease indications or drug combinations; PCESs are critical for justifying such studies. Moreover, phase II trials represent the first attempt to test a drug’s efficacy in human beings. Their justification ultimately rests on the evidence of clinical promise established in PCESs. In general, if information on preclinical efficacy is considered important to include in an IB, then validity reporting for the respective PCESs should be important as well.

Last, there are limitations to the way we measured the degree to which validity threats were described as being addressed in IBs. For instance, our 14-item matrix did not weight any practices based on their potential impact on bias. Also, that measures aimed at strengthening the validity of PCES findings are reported so infrequently in IBs does not necessarily mean that such measures were not implemented. Nevertheless, our matrix was based on systematic review evidence and thus provided a reasonable starting point for describing how study designs were reported in IBs. Stakeholders tasked with risk–benefit assessment, such as investigators, IRBs, regulatory agencies, and data safety and monitoring boards, have no way of knowing how studies were performed if the relevant study design information is not provided in IBs—this is especially the case if the studies themselves have not been published.

To improve the effective use of preclinical information for risk–benefit assessment in phase I/II trials, we offer the following recommendations. First, IBs should describe measures taken in PCESs to support clinical generalizability. Animal Research: Reporting of In Vivo Experiments (ARRIVE) recommendations, established in 2010, provide one suggestion for how to do so [23]. To facilitate efficient evaluation of IBs, it might be helpful to present relevant practices, like use of randomization or choice of endpoints, in tabular form. Furthermore, explicit remarks on the “level of evidence” for each preclinical study might be presented in such tables, with studies designated as “confirmatory” when they had prespecified hypotheses and protocols or “exploratory” when their hypotheses and protocols were not established prospectively. “Confirmatory” studies should also undertake the expense of employing methods that would enhance internal and construct validity−namely, a priori sample size calculation, concealed allocation, or blinded outcome assessment and use of clinically relevant endpoints [12]. Information on the reproducibility of confirmatory preclinical studies and meta-analysis of sufficiently similar studies might further improve the level of evidence [24]. A more stepwise presentation of preclinical evidence could further help evaluators to navigate through the most important questions to assess clinical promise and safety [9,25]: Have effects been reproduced in different models and/or in independent laboratories? Do the conditions of the experiment (for instance, age of animal models, timing of treatments, and outcomes) match clinical scenarios?

Second, IBs should state whether they are presenting the totality of preclinical evidence, and if not, how data were selected for inclusion in the IB. One option would be to only present preclinical studies that have been preregistered [26,27]. This increased transparency might help with preventing selective outcome reporting, and it allows evaluators to check whether other relevant preclinical studies exist.

Future studies need to evaluate how improved reporting for preclinical data presented in IBs influences risk–benefit analysis during ethical review. However, better reporting alone is unlikely to solve problems related to risk of bias in preclinical evidence. Regulatory bodies like the FDA and the EMA offer specific recommendations for the design of preclinical safety studies [28]. To our knowledge, there are no regulatory guidelines offering standards for the design and reporting of PCESs. As the IBs investigated in this study inform ethical as well as regulatory review, we recommend that regulators develop standards for the design and reporting of PCESs to be included in IBs.