Low quality of the published literature

The VPA model of autism is relatively new and potential therapeutic compounds tested in this model have not yet advanced to human trials. The opportunity therefore exists to clean up the literature and prevent a repeat of the SOD1 story. The main finding is that only 9% (3/34) of studies correctly identified the experimental unit and thus made valid inferences from the data. One study used a nested design [44], the second mentioned that litter was the experimental unit[45], and the third used one animal from each litter, thus bypassing the issue[46]. In fourteen studies (41%) it was not possible to determine the number of dams that were used (i.e. the sample size) and in four studies (12%) the number of offspring used were not indicated. In addition, only four (12%) reported randomly assigning pregnant females to the VPA or control group. Many studies also used only a subset of the offspring from each litter, but often it was not mentioned how the offspring were selected. Only six studies (18%) reported that the investigator was blind to the experimental condition when collecting the data. Ten studies (29%) did not indicate whether both male and female offspring were used. No study mentioned performing a power analysis to determine a suitable sample size to detect effects of a given magnitude—but this is probably fortuitous, given that only three studies correctly identified the experimental unit. It is possible that many studies did use randomisation and assess outcomes blindly, but simply did not report it. However, randomisation and blinding are crucial aspects for the validity of the results and their omission in manuscripts suggests that they were not used. This is further supported by studies showing that when manuscripts do not mention using randomisation or blinding the estimated effects sizes are larger compared to studies that do mention using these methods, which is suggestive of bias[19–23, 29].

A number of papers had additional statistical or experimental design issues, ranging from trivial (e.g. reporting total degrees of freedom rather than residual degrees of freedom for an F-statistic) to serious. These include treating individual neurons as the experimental unit, which is common in electrophysiological studies, but just as inappropriate as treating blood pressure values taken from left and right arms as n = 2, or dissecting a single liver sample into ten pieces and treating the expression of a gene measured in each piece as n = 10 [12]. If it were that easy, clinical trials could be conducted with tens of patients rather than hundreds or thousands. Regulatory authorities are not fooled by such stratagems, but is seems many journal editors and peer-reviewers are. A list of studies can be found in Additional file 3.

Estimating the magnitude of litter effects

To illustrate the extent to which litter effects can influence the results, data originally published by Mehta et al. [41] were used and experimental details can be found therein. Locomotor activity in the open field is shown in Figure 2 for nine VPA and five saline injected control litters. Half of the animals from each condition were given MPEP (a mGluR5 receptor antagonist) or saline. Visually, there do not appear to be differences between VPA and control groups and there is a slight increase in activity due to MPEP. The effect of MPEP was not significant when litter effects were ignored (Figure 2A; p = 0.082), but it was when adjusting for litter (Figure 2B; p = 0.011). In this case the shift in p-value was not large, but it happened to decrease it below the 0.05 threshold after the excess noise caused by litter-to-litter variation was removed.

Figure 2 Analysis with and without litter taken into account. Nine pregnant female C57BL/6 mice were injected with 600 mg/kg VPA subcutaneously on embryonic day 13, and five control females received vehicle injections. Half of the animals in each condition were also injected with either a mGluR5 receptor antagonist (MPEP) or saline postnatally. Total locomotor activity in the open field over a 30 min period at 8–9 weeks of age is shown. There was a slight increase in activity due to MPEP, but it was not significant when differences between litters were ignored (A; two-way ANOVA, mean difference = 0.60, F(1,44) = 3.17, p = 0.082). Adjusting for litter removed unexplained variation in the data, allowing the small difference between groups to become statistically significant (B; mixed-effects model, mean difference = 0.64, F(1,32) = 7.19, p = 0.011). Note how the values in the second graph have less variability around the group means; this increased precision leads to greater power of the statistical tests. Lines go through the mean of each group and points are jittered in the x direction. Full size image

It may be difficult to determine whether litter effects are present by simply plotting the data by litter because they may be obscured by the experimental effects. For a visual check, it is preferable to remove the effect of the experimental factors first and then plot the residual values versus litter. The y-axis for Figure 3 shows the residuals, which are defined as the difference between the observed locomotor activity for each animal and the value predicted from the model containing group (VPA/saline) and condition (MPEP/saline) as factors (from Figure 2A). The residuals should be pure noise, centred at zero, and should not be associated with any other variable. However, it is clear that there are large differences between litters (Figure 3A), indicating heterogeneity in the response from one litter to the next. When litter effects are taken into account, the mean of each litter is closer to zero. Also note that variance of the residuals ( σ ε 2 ) is reduced by 61% when litter is taken into account (p < 0.001). This is shown by the spread of the grey points around zero on the right side of each graph, which are clustered closer together in the second analysis. The interpretation is that litter accounted for 61% of the previously unexplained variation in the data. Note that it would be impossible to determine whether litter effects are present if only one litter per treatment group was used because litter and treatment would be completely confounded.

Figure 3 Visualising litter-to-litter variation. The residuals represent the unexplained variation in the data after the effects of VPA and MPEP have been taken into account; they should be pure noise and therefore not associated with any other variable. However, the standard analysis (A) shows that when residuals are plotted against litter (x-axis) there are large differences between litters. In other words, there is another factor affecting the outcome besides the experimental factors of interest. The variance of the residuals (grey points on the right) is high ( σ ε 2 = 1.29). The proper analysis (B) reduces the unexplained variation in the data by 61% ( σ ε 2 = 0.50; p < 0.001), which can be seen by the narrower spread of the grey points around zero, and the large differences between the litters have been removed. This reduction in noise allows smaller true signals to be detected. Error bars are SEM. Litters F and L only have one observation and thus no error bars. Full size image

A similar analysis was performed for other variables and the results are displayed in Table 1. It is clear that litter-to-litter variation is important for a number of behavioural outcomes. It is also clear from Figure 3A how one could obtain false positives with an inappropriate design and analysis. Suppose an experiment was conducted with only one VPA and one saline litter, with ten animals from each, and that there is no overall effect of VPA on a particular outcome. If the experimenter happened to select Litter A (saline) and Litter M (VPA) there would be a significant increase due to VPA, but if Litter D (saline) and Litter G (VPA) were selected, there would be a significant effect in the opposite direction! There are many combinations of a single saline and VPA litter that would lead to a significant difference between conditions. Having two or three litters per group instead of one will reduce the false positive rate, but it will still be much higher than 0.05 [4]. In addition, these apparent differences would not replicate with a properly designed follow-up experiment.

Table 1 Importance of litter effects on body weight and behavioural tests Full size table

How power is affected by the number of litters and animals

Figure 4 shows the power for various combinations of number of litters and number of animals per litter. This analysis is based on averaging the values for the animals within a litter and then comparing the groups with a t-test. It is clear that increasing the number of animals per litter has little effect on power (the lines in Figure 4A are nearly flat after two animals per litter), whereas increasing the number of litters results in a large increase in power. The results for the mixed-effect model are nearly identical and the results of the inappropriate analysis which ignores litter shows increasing power with increasing number of animals per litter (Additional file 4). This is false power however, and is due to an artificially inflated sample size (pseudoreplication) that will lead to many false positive results.

Figure 4 Power calculations for VPA experiments. Panel A shows how power changes as the number of animals per litter increases from one to eight (x-axis) and the number of litters per group increases from three to ten (different lines). It is clear that increasing the number of animals per litter has only a modest effect on power with little improvement after two animals. A two-group study with three litters per group and eight animals per litter (2 × 3 × 8 = 48 animals) will have only a 30% chance of detecting the effect, whereas a study with ten litters per group and one animal per litter (2 × 10 × 1 = 20 animals) will have almost 80% power and also use far fewer animals. Panel B shows the same data, but presented differently. Power for different combinations of litters and animals per litter is indicated by colour (red = low power, white = high) and reference lines for 70%, 80%, and 90% power are indicated. Note that these specific power values are only relevant for the locomotor activity task with a fixed effect size and will have to be recalculated for other outcomes. However, the general result (increasing litters is better than increasing the number of animals per litter) will apply for all outcomes. Full size image

Some may object on ethical grounds to using so many litters and then selecting only one or a few animals from each, as there will be many additional animals that will not be used and presumably culled. Certainly all of the animals could be used, but there is almost no increase in power after three animals per litter (at least for the locomotor data) and therefore it is a poor use of time and resources to include all of the animals. One could argue therefore that it is unethical to submit a greater number of animals to the experimental procedure if they contribute little or nothing to the result. One could also argue that it is even more unethical to use any animals for a severely underpowered (or flawed) study in the first place and then to clutter the scientific literature with the results. One way to deal with the excess animals is to use them for other experiments. This requires greater planning, organisation, and coordination, but it is possible. Another option is to purchase animals from a commercial supplier and request that the animals come from different litters rather than have an in-house colony. As a side note, suppliers do not routinely provide information on the litters that the animals come from and thus an important variable is not under the experimenter’s control and cannot even be checked whether it is influencing the results.

How does litter-to-litter variation arise?

Differences between litters could exist for a variety of reasons, including shared genes and shared prenatal and early postnatal environments, but also due to age differences (it is difficult to control the time of mating), and because litters are convenient units to work with. For example, it is not unusual for litter-mates to be housed in the same cage, which means that animals within a litter also share not just their early, but also their adult environment. It is also often administratively easier to apply experimental treatments on a per cage (and thus per litter) basis rather than per animal basis. For example, animals in cage A and C are treated while cage B and D are controls. Animals may also undergo behavioural testing on a per cage basis; for example, animals are taken from the housing room to the testing room one cage at a time, tested, and then returned. Larger experiments may need to be conducted over several days and it is often easier to test all the animals in a subset of cages on each day, rather than a subset of animals from all of the cages. At the end of the experiment animals may also be killed on a per cage basis. Given that it may take many hours to kill the animals, remove the brains, collect blood, etc., the values of many outcomes such as gene expression, hormone and metabolite concentrations, and physiological parameters may change due to circadian rhythms. All of these can lead to systematic differences between litters and can thus bias results and/or add noise to the data.

There is an important distinction to be made between applying treatments to whole litters versus “natural” variation between litters. When a treatment is applied to a whole litter such as the VPA model of autism or maternal stress models, then the litter is the experimental unit and the sample size is the number of litters. Therefore, by definition, litter needs to be included in the analysis if more than one animal per litter is used (or the values within a litter can be averaged). However, if multiple litters are used but the treatment(s) are applied to the individual animals, experiments should be designed so that if litter effects exist, then valid inferences can still be made. In other words, litters should not be confounded with other experimental variables because it would be difficult or impossible to detect their influence and remove their effects. Whether litter is an important factor for any particular outcome is then an empirical question, and if it is not important then it need not be included in the analysis. However, the power to detect differences between litters will be low if only a few litters are used in the experiment and therefore a non-significant test for litter effects should not be interpreted as the absence of such effects. Analysing the data with and without litter and choosing the analysis that gives the “right” answer should of course be avoided [35]. Flood et al. provide a nice example in the autism literature of an appropriate design followed by a check for litter effects, and then the results for the experimental effect were reported when litter was both included and excluded[47]. Consistent with other studies demonstrating litter-effects, this paper found a strong effect of litter on brain mass.

Four ways to improve basic and translational research

Better training for biologists

Most experimental biologists are not provided with sufficient training in experimental design and data analysis to be able to plan, conduct, and interpret the results of scientific investigations at the level required to consistently obtain valid results. The solution is straightforward, but requires major changes in the education and training of biologists and it will take many years to implement. Nevertheless, this should be a long-term goal for the biomedical research community.

Make better use of statistical expertise

A second solution is to have statisticians play a greater role in preclinical studies, including peer reviewing grant applications and manuscripts, as well as being part of scientific teams [48]. However, there are not enough statisticians with the appropriate subject matter knowledge to fully meet this demand—just as it is difficult to do good science without a knowledge of statistics, it is difficult to perform a good analysis without knowledge of the science. In addition, this type of “project support” is often viewed by academic statisticians as a secondary activity. Despite this, there is still scope for improving the quality of studies by making better use of statistical expertise.

More detailed reporting of experimental methods

Detailed reporting of how experiments were conducted, how data were analysed, how outliers were handled, whether all animals that entered the study completed it, and how the sample size was determined are all required to assess whether the results of the study are valid, and a number of guidelines have been proposed which cover these points, including the National Institute of Neurological Disorders and Stroke (NINDS) guidelines [49], the Gold Standard Publication Checklist[50], and the ARRIVE (Animals in Research: Reporting In Vivo Experiments) guidelines [51]. For example, ARRIVE items 6 (Study design), 10 (Sample size), 11 (Allocating animals to experimental groups), and 13 (Statistical methods) should be a mandatory requirement for all publications involving animals and could be included as a separate checklist that is submitted along with the manuscript, much like a conflict of interest or a transfer of copyright form. Something similar has recently been introduced by Nature Neuroscience[52]. This would make it easier to spot any design and analysis issues by reviewers, editors, and other readers. In addition, and more importantly, if scientists are required to comment on how they randomised treatment allocation, or how they ensured that assessment of outcomes was blinded, then they will conduct their experiments accordingly if they plan on submitting to a journal with these reporting requirements. Similarly, if researchers are required to state what the experimental unit is (e.g. litter, cage, individual animal, etc.), then they will be prompted to think hard about the issue and design better experiments, or seek advice. This recommendation will not only improve the quality of reporting, but it will also improve the quality of experiments, which is the real benefit. A final advantage is that it will make quantitative reviews/meta-analyses easier because much of the key information will be on a single page.

Make raw data available

Another solution is to make the provision of raw data a requirement for acceptance of a manuscript; not “to make it available if someone asks for it”, which is the current requirement for many journals, but uploaded as supplementary material or hosted by a third party data repository. None of the VPA studies provided the data that the conclusions were based on, making reanalysis impossible. Remarkably, of the thirty-five studies published, only one provided the necessary information to conduct a power analysis to plan a future study [46], and this was only because one animal per litter was used and the necessary values could be extracted from the figures. Datasets used in preclinical animal studies are typically small, do not have confidentiality issues associated with them, are unlikely to be used for further analyses by the original authors, and have no additional intellectual property issues associated with them given that the manuscript itself has been published. It is noteworthy that many journals require microarray data to be uploaded to a publicly available repository (e.g. Gene Expression Omnibus or ArrayExpress), but not the corresponding behavioural or histological data. It is perhaps not surprising that there is a relationship between study quality and the willingness to share data [53]-[55]. Publishing raw data can be taken as a signal that researchers stand behind their data, analysis, and conclusions. Funding bodies should encourage this by requiring that data arising from the grant are made publicly available (with penalties for non-adherence).

The above suggestions would help ensure that appropriate design and analyses are used, and to make it easy to verify claims or to reanalyse data. Currently, it is often difficult to establish the former and almost impossible to perform the latter. Moreover, it is clear that appropriate designs and analyses are often not used, making it difficult to give the benefit of the doubt to those studies with incomplete reporting of how experiments were conducted and data analysed.