The first aim of this study was to evaluate the relationship between sample size and reported discovered foci in published VBM studies across different neuropsychiatric conditions so as to probe into the possibility of reporting biases affecting preferentially smaller studies. This was achieved by focusing on available voxel‐based meta‐analyses of VBM studies, such those performed with Activation Likelihood Estimate [Laird et al., 2005 ] or Signed Differential Mapping [Radua and Mataix‐Cols, 2009, 2012 ; Radua et al., 2010a, 2012a ]. These meta‐analyses are attempting to reconcile contrasting and inconclusive individual VBM findings by obtaining larger sample sizes with an associated greater statistical power. Thus, we assess whether larger sample sizes are associated with larger number of identified foci. If conversely, the same or even more foci are claimed to be identified by small studies as with larger ones this would offer evidence for bias. The second aim of this study was to explore the impact of a number of variables on the relationship between sample size and number of reported foci in the VBM literature including the type of neuropsychiatric disorder, the publication year, the sample size of the study, the slice thickness of the images, the degree of smoothing, the software used to preprocess or perform the statistical analysis of the data, the statistical threshold employed, and the use of a small volume corrections (SVC) in the analysis. Our third aim was then to evaluate whether sample size was related to the number of reported foci in meta‐analyses of VBM studies across different psychiatric conditions and whether these meta‐analyses report more or fewer foci than the much smaller studies that they include.

Detecting these biases in single studies is difficult: by default it is very difficult to unearth unpublished studies and unless the original protocol is available it is not possible to check whether the presented results are more favorable (e.g., claim more discovered foci with abnormalities) than an analysis based on the original protocol. However, one may obtain hints about the presence of such biases, when many studies have been performed. In the absence of bias, one would expect power to detect abnormalities to improve when sample size increases, other things being equal. Conversely, with such biases small studies with unimpressive null results may be unpublished, or they may be analyzed in a way that they provide more foci. Evidence from many different scientific fields suggests that bias may affect to a lesser extent large studies, since these are likely to be published regardless of their results and analytical manipulation may be less prominent [Rothstein HR et al., 2005].

Structural magnetic resonance imaging (sMRI) studies have been carried out by many researchers in different neuropsychiatric conditions including psychosis, depression, dementia, attention deficit hyperactivity disorder (ADHD) and autistic disorders. Often, the morphometric measurements used in these studies have been obtained from a priori regions of interests (ROIs) that can be clearly defined (such as the hippocampi or the ventricles) [Ashburner and Friston, 2000 ]. However, there are a number of morphometric features that may be more difficult to quantify by inspection, meaning that many structural differences may be over or under looked. The caveat of the ROIs structural analyses is that, because of these difficulties, researchers can introduce a large source of heterogeneity undermining the consistency of their results. These problems may ultimately prevent clinical application of sMRI to psychiatry [Borgwardt and Fusar‐Poli, 2012 ]. To address this limitation, an advanced structural imaging technique has been recently introduced and widely applied. Voxel‐based morphometry (VBM) involves a voxel‐wise comparison of the local concentration of gray matter between two or more groups of subjects. The procedure usually involves spatially normalizing high‐resolution images from all the subjects in the study into the same stereotactic space, segmenting the gray and white matter and smoothing the resulting gray‐matter segments. Some protocols also include a ‘modulation’ step, but its effects are disputed (Radua et al. 2013 ). Voxel‐wise parametric or non‐parametric statistical tests comparing the experimental groups are then performed correcting for multiple comparisons. The value of this automated analytical approach is that it gives an “even‐handed and comprehensive assessment of anatomical differences throughout the brain” without necessarily biasing attention a priori to a specific ROI [Laird et al., 2005 ]. Because of this, VBM is considered an objective method to analyze whole‐brain structural abnormalities in neuroscience and psychiatric research and bridge structural neuroimaging toward clinical applications. However, it is unclear whether the current VBM literature can still be affected by biases, in particular publication and other selective reporting biases, where investigators selectively report statistically significant results and under‐report non‐significant findings – as noted for sMRI studies [Ioannidis, 2011 ].

Finally, we also assessed the results of the meta‐analyses that had combined these VBM studies. First, we evaluated the relationship between the number of reported foci in each meta‐analysis and the combined sample size of the studies included in the meta‐analysis with a Poisson regression. Again, this model was used, instead of simpler ones, because the number of foci in VBM studies was observed to follow a count distribution. For the sake of completeness, we also conducted simple Pearson correlation, non‐linear Spearman correlation, and Pearson correlation after discarding the most influential study according to the dfbetas statistic of the regression of foci by sample size. Second, we evaluated whether the number of foci reported in the meta‐analyses was larger or smaller than the number of foci reported in each of the VBM studies that they had combined. We tested the hypothesis that the much larger sample size of the meta‐analyses would allow detecting more or at least as many foci as in the individual studies. In the presence of bias in single studies, the meta‐analyses may report even fewer validated foci than the single studies, because the biases may be diluted in the meta‐analysis. The analysis used the Wilcoxon paired test, where the number of foci in the meta‐analysis was compared paired against the number of foci in each study that it included.

To explore experimental variables influencing the relationship between sample size and number of reported foci, subgroup regressions were conducted on the following subsets of studies: studies published up to and after 2008, studies with up to or more than six authors, studies with less than or at least 32 patients, studies with samples sizes up to 80 patients, studies conducted in MRI devices with magnets up to or stronger than 1.5 Tesla (T), studies with MRI acquisition slices thickness of at least or thinner than 1.5 mm, studies employing statistical parametric mapping (SPM) or other software packages to pre‐process and compare the images, studies applying a smoothing of up to or superior to 8 mm of FWHM, studies thresholding at P < 0.001 uncorrected for multiple comparisons, studies thresholding at P < 0.05 FDR‐ or FWE‐corrected for multiple comparisons, studies employing SVC, and studies investigating different neuropsychiatric conditions. Cutoffs for magnet intensity, slice thickness, and smoothing kernel were chosen because they allowed dividing the total sample of studies in two sub‐groups of fairly similar size. The year 2008 was chose as cutoff to specifically test the impact of advanced VBM algorithms such as the DARTEL, which were introduced shortly before [Ashburner, 2007b ]. The sample size of 32 patients was chosen on the basis of evidence indicating that the minimum sample size for a neuroimaging study is 16 patients per group [Friston, 2012 ]. The number of authors of six was chosen on the basis of the previous findings by Sayo et al (2011). Regression slopes of complementary subgroups with different findings (e.g., one slope is significantly higher than zero and the other slope is not significantly higher than zero) were formally compared with zero‐inflated Poisson models. All calculations were performed with the “pscl” package for R [Jackman, 2012] [R_Development_Core_Team, 2011]. A regression line was applied to the reported plots for both significant and non‐significant relationships.

A simulation framework was used to assess whether such potential bias could significantly affect the expected relationship. First, 84,000 gray‐matter datasets were simulated by adding normally distributed noise to a normal gray‐matter template ( n = 42,000 controls), or to a gray‐matter template with abnormal volume in regions reported to have decreased gray matter in first psychotic episodes ( n = 42,000 patients) [Radua et al., 2012b ]. Second, these data were smoothed with a large Gaussian kernel [ σ = 6 mm, full‐width at half maximum (FWHM) = 14 mm], thus simulating both the spatial covariance observable in raw data and the smoothing usually applied in VBM pre‐processing. Finally, individuals were grouped in 400 simulated studies with different numbers of participants (from n = 10 to 200 per group), and standard group‐level voxel‐based statistics were performed (uncorrected P = 0.001, 20 voxels extent). As shown in Figure 2 , the number of clusters followed a clear positive relationship with the sample size. The relationship would be the same if each cluster was substituted by three reported foci.

It should be noted that variability in the way authors report VBM results could conceal the expected positive relationship between the sample size and the number of reported foci. Statistically significant voxels are usually grouped in clusters of spatially contiguous voxels, and only the local maxima (i.e., foci) are reported. Importantly, an increase of the sample size helps non‐significant voxels between two close clusters to achieve statistical significance, thus sometimes converting the two close clusters into a single larger one. The number of foci should not be affected by this conversion, but some authors choose to report only three foci per cluster. In other words, these authors could report up to six foci when describing the two close clusters, whilst no more than three when describing the single larger cluster obtained after an increase of the sample size. In such a case, the relationship between the sample size and the number of foci could be downwards biased.

Specifically, the relationship between the number of reported foci in each study and the sample size of the study was assessed with a zero‐inflated Poisson regression [Zeleis et al., 2008 ]. This model was used, instead of simpler ones, because the number of foci in VBM studies was observed to follow a mixture of a point mass at zero with a count distribution. However, for the sake of completeness, we also conducted meta‐analytical combination of Poisson coefficients, estimated separately for each published meta‐analysis, simple Pearson correlations, non‐linear Spearman correlations, Pearson correlations after discarding the most influential study according to the dfbetas statistic of the regression of foci by sample size [Belsley et al., 1980 ], and meta‐analytical combination of (Fisher‐transformed) Pearson correlations separately estimated per each published meta‐analysis.

At the level of each individual study, we extracted the total sample size, the overall number of foci, the condition, the contrast within the condition, the imaging parameters (magnet intensity, slice thickness, degree of smoothing, and software packages used), the statistical threshold [false discovery rate (FDR), family‐wise error (FWE) correction, uncorrected P ‐value] and the use of SVC in the analysis. Similarly, at the level of the meta‐analyses, we extracted total sample size, the overall number of foci, the condition, the contrast within the condition, the statistical threshold and the software packages used.

Articles were included in our analysis if they were independent (see above) whole‐brain voxel‐based meta‐analyses of magnetic resonance imaging (MRI) studies of human brain. Meta‐analyses were eligible regardless of the neurological or psychiatric condition investigated. Exclusion criteria were (i) meta‐analyses of ROI (not whole brain), (ii) structural modalities other than VBM (e.g., diffusion tensor imaging, cortical pattern matching), (iii) functional brain imaging meta‐analyses or (iv) non‐human studies, and (v) overlapping meta‐analyses. Meta‐analyses of studies investigating white matter differences (Peters et al., in press; Radua et al., 2010b ) were also excluded. Only a minority of the retrieved meta‐analyses had fully listed the number of subjects and number of foci identified in each individual study, which was necessary to perform the statistical analysis. To circumvent this problem, include moderators and avoid missing data, we collected all the individual studies for each meta‐analysis and extracted these details (see below).

Many of the included articles conducted different meta‐analyses for more than one condition (sub‐meta‐analyses). These sub‐meta‐analyses were each considered separately for inclusion in our study. To avoid inclusion of overlapping data, when two meta‐analyses included overlapping sets of studies in a similar contrast, we retained only the more recent and largest number of studies/sample size meta‐analysis. In the case where the same paper included both an overall meta‐analysis and separate sub‐analyses for different conditions, the sub‐analyses were preferentially included (with the overall meta‐analysis being considered a duplicate and excluded). As a consequence, all the included meta‐analyses have addressed different between‐groups contrasts. We then carefully searched the included articles at the level of each individual study and studies were compiled and compared a second time to eliminate overlapping samples to insure that no individual study was double counted.

We conducted a four‐step literature search. First, we searched on PubMed using the Boolean terms “voxel‐based morphometry meta‐analysis.” All publications listed in PubMed prior to August 1, 2012 were included. In a second step we also searched the bibliographies of Brain Map ( http://brainmap.org/pubs/ ) and SDM databases ( http://sdmproject.com ) (last search performed on July 31, 2012). All eligible publications were included. In a third step we hand searched the references of the included publications to minimize the possibility of biases in the literature search. Full texts were pulled for all potentially eligible publications. Then the retrieved publications underwent an initial culling of ineligible and duplicate analyses. These publications were then hand searched for inclusion criteria and selected by two analysts independently (MF & PFP), with any discrepancies adjudicated until 100% rater agreement was achieved. To achieve a high standard of reporting we have adopted ‘Preferred Reporting Items for Systematic Reviews and Meta‐Analyses’ (PRISMA) guidelines [Liberati et al., 2009 ].

The regression was not nominally significant when meta‐analyses, rather than individual studies, were analyzed as a whole group. However, this was only true for those meta‐analyses which include less than 10 studies (−0.17% increase in the number of foci per each 10‐patient increase in sample‐size, P = 0.641), while the regression achieved statistical significance for those meta‐analyses including 10 studies or more (0.35% increase in the number of foci per each 10‐patient increase in sample‐size, P < 0.001, Fig. 7 ). As shown in Figure 7 , there were many meta‐analyses with fewer than 10 studies and total sample size <1,000 that reported a substantial number of loci, e.g., six of them reported at least 10 loci, while this occurred in only one meta‐analysis with 10 or more studies. The median number of loci was three in meta‐analyses with at least 10 studies and five in meta‐analyses with fewer than 10 studies.

There were no significant differences in the other methodological subgroups (up to 80 patients, studies with magnets up to 1.5T, studies with magnets stronger than 1.5T, MRI slices of 1.5 mm or more, MRI slices inferior to 1.5 mm, SPM used for pre‐processing, other software used for pre‐processing, FWHM of 8 mm or less, FWHM superior to 8 mm, SPM used for statistics, other software used for statistics, no correction for multiple comparisons, use of SVC, and no use of SVC). Similarly, exclusion of foci detected with SVC from the main analysis did not change the main results (2% increase in the number of foci per each 10‐patient increase in sample‐size, P < 0.001; Pearson r = 0.138, P = 0.007; Spearman rho = 0.113, P = 0.023; Pearson r without the most influential study = 0.166, P = 0.002).

No major differences according to the field (psychiatry or neurology, Supporting Information Fig. 1S) and clinical conditions (Supporting Information Fig. 2S) were observed. With respect to methodological moderators the only subgroup where the regression between sample size and number of reported foci was relatively stronger was the set of studies with less than 32 patients (52% increase in the number of foci per each 10‐patient increase in sample‐size, P < 0.001). The regression slope was nominally significant but small in studies with at least 32 patients (2% increase in the number of foci per each 10‐patient increase in sample‐size, P < 0.001), with differences in regression slope between these two subgroups being statistically significant ( P < 0.001). The regression slope was small but still nominally significant in studies published up to 2008, in studies with more than six authors (Fig. 5 ), and in studies thresholding at P < 0.05 FWE‐corrected for multiple comparisons (Fig. 6 ) (2–3% increase in the number of foci per each 10‐patient increase in sample‐size, P < 0.001). Conversely, it was null in studies published after 2008, in studies with up to six authors, and in studies thresholding at P < 0.05 FDR‐corrected for multiple comparisons (<1% increase in the number of foci per each 10‐patient increase in sample‐size, P > 0.05), with differences in regression slope between these pairs of subgroups being statistically significant ( P ≤ 0.005 in all cases).

As shown in Table 3 and Figure 4 , studies with larger sample sizes were found to report more abnormalities, but the slope was very small (2% increase in the number of foci per each 10‐patient increase in sample‐size, P < 0.001), thus indicating potential reporting biases affecting the smaller studies more than the larger studies. Results were similar when using Pearson and Spearman correlations, when discarding statistically influential studies, or when combining Pearson correlations separately estimated per each published meta‐analysis (simple Pearson r = 0.148, P = 0.004; Spearman rho = 0.139, P = 0.006; Pearson r without the most influential study = 0.176, P < 0.001; meta‐analytically combined Pearson r = 0.110, P = 0.010). The binomial part of the zero‐inflated Poisson regression was not found to be influenced by the sample size.

Our literature search identified 54 full text articles, which were assessed for inclusion criteria. The final database comprised 42 articles with 79 meta‐analyses (including sub‐analyses). After checking for duplicate or overlapping meta‐analyses, a final set of 47 meta‐analyses were included and the final dataset used in this study comprised a total of 324 individual VBM studies. The literature search and the characteristics of the included and excluded meta‐analyses are detailed in the Figure 3 , Table 1 and Supporting Information Table IS. As shown in Table 2 , the number of participants ranged from 12 to 545 in the studies (median = 47, interquartile range = 39), and from 149 to 4087 in the meta‐analyses (median = 534, interquartile range = 721). The median number of reported foci per study was six, while the median number of foci reported per meta‐analysis was three. Seventy‐four percent of the studies reported 10 foci or less, and 79% of the meta‐analyses five foci or less. Other descriptive details of the included studies and meta‐analyses are depicted in Table 2 .

DISCUSSION

This study explored the potential confounding role of biases in VBM studies by assessing whether the number of reported brain abnormalities was positively related to the sample size of the studies, as it would be statistically expected given that studies with larger sample sizes have more power to detect abnormalities. Overall, we found a weak correlation between sample size and number of reported loci, corresponding to an increase of only 2% per 10 additional patients. This is far less than what would be expected based on power considerations. Thus, it suggests that reporting biases may be inflating the number of discovered reported loci in small studies. Evaluation of a large number of moderator variables suggested similar findings in a wide array of study, disease, technical and other characteristics, although there were hints that the statistical threshold employed and publication year had a modulating effect. Finally, we found that for whole‐brain voxel‐based meta‐analyses including less than 10 studies, there was no association between sample size and number of loci, which is again suggestive of potential reporting biases. This pattern was not seen in meta‐analyses with more than 10 studies, which generally reported few loci (median = 3). The number of loci reported in meta‐analyses, especially large ones, was significantly smaller than the number of loci reported in single studies, also corroborating that the literature of single studies may often present inflated numbers of discoveries.

Overall, the strength of the evidence that we found for reporting biases in VBM studies may be weaker than previous findings in non‐VBM (i.e., ROI) structural neuroimaging studies, where an excess significance bias was more clearly detected [Ioannidis, 2011]. However, the ROI assessment evaluated the number of observed versus expected significant results in each study and in multiple studies, while this was not possible to do for VBM studies given the nature of the data. Second, automated methods such as VBM tend to be less biased by the researcher's influence; in contrast, a researcher could perform several exploratory ROI analyses and report results for only those ROIs that yielded significant results [Ioannidis, 2011; Radua and Mataix‐Cols, 2012]. Third, the manual tracing of ROIs, as compared with VBM methods, can introduce significant heterogeneity in the anatomical definition of the brain areas investigated across studies and thus affect the significance of the results reported, hampering publication biases. Fourth, we found that the median sample size of the individual VBM studies retrieved was of 47, which is larger than the typical sample size of previously analyzed ROI studies [Ioannidis, 2011]. Some authors have even proposed optimal sample sizes for individual VBM studies of 16–32 subjects per group [Friston, 2012], suggesting that between‐subjects comparison studies of n < 32 are too small even by liberal estimates. Still the fact that the number of loci reported in single VBM studies is smaller than what eventually gets validated in large meta‐analyses suggests that reporting or other biases may sometimes be substantial in some VBM studies.

We explored several potential factors that may be influencing and modulating reporting biases in VBM literature such as publication year, type of condition investigated, statistical threshold employed and other methodological characteristics of the analysis method. A similar pattern of weak or null correlations was seen in several analyses according to different moderator variables although we found some hints that statistical threshold employed and publication year had some modulating effect. There was no difference in the relationship between identified foci and sample size according to type of clinical conditions. Similarly, no differences in the relationship between number of foci and sample size were detected when VBM psychiatric literature was compared with the VBM neurological literature. Factors other than publication biases may account for the lack of clinical applications of psychiatric neuroimaging (e.g., heterogeneity of psychiatric diagnoses or differences in the psychopathological characteristics across samples) [Borgwardt et al., 2012]. Sample size, magnet field, slices thickness, type of analysis package, type of smoothing kernel, use of SVC did not affect the results.

Conversely, the foci‐sample relationship was positive and statistically significant in studies employing an FWE correction, negative and non‐significant in studies applying an FDR correction, even if the increase of foci related to sample size should be higher in studies applying such correction (Fig. 1). Furthermore, there was a statistically significant difference between subgroups. The reasons for the existence of potential bias in studies using an FDR are again speculative. It should be noted that sample size range appeared to be smaller for VBM studies which employed FDR correction compared with those which employed FWE correction (see Fig. 6); it is therefore possible that the absence of a significant relationship for studies which used FDR but not for those which used few correction could simply be explained by differences in power. Alternatively, the absence of a relationship between sample size and foci in studies applying an FDR correction could be related to some mis‐use of FDR in neuroimaging [Chumbley and Friston, 2009].

In addition, found an effect of the number of authors, with significant correlation between sample size and number of foci for VBM studies with more than six authors and no correlation for studies with up to six authors (statistically significant between‐subgroups differences, Fig. 5). Strikingly, we replicated the similar relationship previously reported by Sayo et al. (2011), who found that studies with less coauthors reported larger ventricular‐brain ratio abnormalities in patients with schizophrenia. They suggested that larger research groups may be more conservative or exacting in their research methodology. Similarly, we also found a differential effect for publication year, with significant correlation between sample size and number of foci for VBM studies published up to 2008 and no correlation for studies published after 2008 (statistically significant between‐subgroups differences). The reasons for this observation are highly speculative and it could be a chance finding, given the number of modulator variables assessed. It could be, for instance, that as far as the structural abnormalities in many disorders have been more or less established in previous studies, only studies finding such abnormalities are published. Alternatively, newer studies use advanced VBM algorithms (i.e., DARTEL [Ashburner, 2007b], introduced shortly before 2008) [Ashburner, 2007a] which could enable them to detect most of the abnormalities with even relatively small sample sizes. However, this seems unlikely to be the case given that the number of foci reported in these studies is indeed lower than in older studies (9–74% decrease depending on the sample size). The causes for this observation are highly speculative. On the one hand, in recent years investigators may have conservatively thresholded the analyses when the results appeared in brain regions that were unexpected based on the results of previous studies. On the other hand, some of the new VBM algorithms that have been introduced based on theoretical grounds lack formal empirical validations and may have had a detrimental impact on the sensitivity of the analyses. A third possibility is that these new VBM algorithms have resulted in fewer false‐positives compared to standard VBM, for instance by improving the spatial registration of the images or minimizing the impact of non‐normality [Salmond et al., 2002; Viviani et al., 2007]. Because the causes for the lower number of significant foci after 2008 are speculative, the implications of this observation for the minimal appropriate N per group are also unclear.

Finally, we tested the sample size/number of foci correlation hypothesis at meta‐analytical level (Fig. 7). We found that the relationship was absent when meta‐analyses with less than 10 studies were included in a Poisson regression. Small meta‐analyses report more foci than larger ones. This finding may be useful to guide editors, reviewers and authors to improve the reliability of voxel‐based meta‐analyses by either setting the bar for the number of required studies included in meta‐analyses at k ≥ 10 or ensuring that high‐quality null findings in meta‐analyses of conditions with k < 10 available published singleton studies are available after systemic literature review, and ideally, also access to registries of data, since it is notoriously difficult to unearth unpublished unregistered data. The fact that meta‐analyses with many studies validate few foci (median = 3) is also suggestive that the larger numbers of foci reported in small studies and small meta‐analyses may be inflated by several false‐positives.

Some caveats should be discussed about our study. First, we cannot rule out the possibility that in some cases large studies and even large meta‐analyses may suffer from reporting and/or other biases that inflate the number of discovered loci. Conversely, some small studies may be more meticulous and thus optimize the yield of discoveries, despite their limited sample size. However, it is unlikely that there would be a systematic error in favor of small studies being better than larger studies in this regard. Our analysis focuses on the large picture including many hundreds of studies. Second, the total number of genuine loci to be discovered in each disease and condition is unknown and power calculations require making assumptions about how many of such abnormalities would be detected. Most likely, the number and magnitude of abnormalities differs substantially across different diseases and conditions. Thus, again, our approach offers an aggregate view of the big picture and inferences may not be possible to extrapolate to each of the topics that we analyzed. Even if reporting biases are present in the field‐at‐large, this does not mean that all sub‐fields and each topic are equally affected. Third, there is preliminary evidence that VBM studies with a smaller sample size may be more susceptible to false positive rates than those with a larger sample size; this is due to the impact of non‐normality of the data [Salmond et al., 2002; Viviani et al., 2007] which is critically dependent on sample size [Scarpazza et al., in press]. Thus we cannot exclude the possibility that our results reflect differences in false positive rates as a function of sample size rather than reporting biases.