This study explored the potential confounding role of reporting bias in fMRI studies of sex differences by assessing the prevalence of “positive” results and conclusions and whether or not the number of reported foci was positively related to the sample size of the studies. Across 179 identified fMRI studies of the brain published over a decade, few had a title that focused on the lack of sex differences or similarities between sexes and only 17 did not highlight sex differences in their abstract. Given the typically very small sample size of the studies in this literature, this “success rate” is implausibly high. Moreover, there was no statistical correlation between sample size and the number of identified foci. We analyzed relationships across different types of spatial smoothing, slice thickness, date of publication, use of corrected or uncorrected p-values, use of SPM or other statistical approaches, whole brain or ROI studies, and a range of different behavioral and somatosensory tasks. Nonetheless, there was no clear and consistent relationship between sample size and the number of significant foci. These results are surprising because owing to higher statistical power, studies with larger sample sizes should have been able to detect more differences when true sex differences are present9,10,11.

The lack of relationship observed in these analyses may reflect systematic reporting bias in small fMRI studies that produces a published literature with more sex difference signals than truly exist. We have previously reported a small but significantly positive correlation between sample size and number of brain abnormalities in VBM studies with variance by publication date, statistical thresholds and other imaging parameters13, and a lack of a consistently positive relationship between sample size and foci across the larger field of published fMRI studies14. The median number of foci in small studies (≤32 subjects) was approximately the same for larger studies (>32 subjects). As has been shown for morphometric12 and fMRI studies at large9,10,11,14, it appears that there is reporting bias driving an excess of significance. Studies with smaller sample sizes and reduced statistical power have been shown to produce imprecise and frequently spurious false positive results and it is possible that studies and analyses with more significant results are selected for publication. While, this problem is not specific to the study of sex differences but inherent to small-sample fMRI research, this problem might be exacerbated by the very simple fact that subgroup analyses based on sex are always tempting to do and easy to perform (in most datasets, information on sex are generally present). It is probable that the high proportion of “positive findings” result from a combination of factors including publication bias due to journal editorial practices favoring positive results, and significance biases including selective outcome and analysis reporting bias (reporting additional analyses that were not pre-specified), under-reporting of null results (“file-drawer problem”, particularly in underpowered studies), p-value “hacking” (manipulation of the analysis parameters until significant results are obtained), and other factors identified across the psychological literature18,19 and in the fMRI literature8. We have published suggestions for reducing these practices8 and there is some evidence that efforts to promote open science are bearing fruit as more light is shed on these problems20. We do not know if these recommendations are now widely followed by researchers, but if investigators are following the recommendation to use more stringent primary thresholds only for higher power studies, this might explain why higher-powered studies are not reporting more foci; if this is rampant and systematic across the field, it would represent a type of reporting or significance bias.

Our results could also reflect a dearth of biologically plausible sexual dimorphism in brain function across a range of many tasks published in the literature. A previous systematic review of fMRI studies concluded that there was widespread publishing of underpowered studies with “false-positive claims of sex differences in the brain, to enable the proliferation of untested, stereotype-consistent functional interpretations”21 and suggested that widespread scientific assumptions that female and male brains are functionally distinct, dichotomous, fixed, and invariant due to a sexually differentiated genetic blueprint are not scientifically justified and may be sexist22. Other investigators have posited that sex differences in cognitive test performance are explained by hormonal differences throughout development in combination with cultural influences, gender stereotypes, and biopsychosocial interactions23; and that females and males belong to a single heterogeneous population rather than two distinct populations with regard to brain structure and function24.

Some limitations should be acknowledged. First, in order to prevent any difficulty due to multiple measurements, we extracted foci for the analysis with the largest number of foci. But in many studies there was more than one analysis. As a result, some studies may have claimed to have identified far more significant foci than the number we have extracted. Thus, our analysis probably underestimates the potential problem of having too many statistically significant claims for sex differences in fMRI studies. Second, there were differences in the types of study designs across studies. We attempted to address this methodological heterogeneity with sensitivity analyses across different subgroups defined by methodological features. However, these subgroup analyses might be underpowered to demonstrate the relationship explored. Conversely, the one positive subgroup result encountered may be a spurious association found by chance since it did not survive correction for multiplicity. Third, the statistical significance of the results of fMRI studies may depend on the analytical method used and some parametric methods have been shown to yield inappropriate type I errors25. Here, we considered the correction used but did not re-analyze the raw data or to confirm the results using the same assumptions and statistical methods employed by the original authors. In addition, we sought to control for the level of correction (cluster-level vs. voxel-level) in each study. We attempted to extract this information but use of clusterwise vs voxelwise statistical correction was often not clearly documented in the different papers. Another open question that we were not able to control is how to appraise the statistical stringency. That is, is for instance 0.005 cluster-level FWE more or less stringent than 0.01 voxel-level FWE or 0.01 cluster-level FDR and so forth.

Fourth, our literature search was limited to studies published in the decade 2004–2013. Curating the database required extensive time and effort and it was not felt that enough additional information would be gained to justify updating the searchto capture more recent studies at this time. It is unlikely that earlier or more recent studies would present a different pattern, but empirical evaluations of very recent fMRI studies may be worth performing in the future, especially if large, multicenter investigations start appearing more frequently in this literature. Interestingly, we observed a small but statistically significant interaction between sample size and publication year, suggesting that the most recent studies may have operated in an environment where the strength of biases may have decreased.

Fifth, our searches were extensive, but we might have missed some studies of sex differences. In particular, we may have missed some studies that found no significant sex differences and this “negative” result was alluded to only in some fine print in the paper and thus could not be retrieved with our literature searches. If so, this would also be a form of reporting bias, if “positive” results are not only more likely to be published, but are also more prominently presented when published, as compared with “negative” results.

Sixth, we acknowledge that an increase of the sample size and power may enable non-significant voxels between two close clusters to achieve statistical significance, thus sometimes converting the two close clusters into a single larger one. The number of foci should not be affected by this conversion, but some authors choose to report only three foci per cluster. We did not assess reporting of <= 3 foci/cluster in our sensitivity analyses. In such a case, the relationship between the sample size and the number of foci could be downwards biased. However, in a previous publication, we found no evidence of an effect of this practice on the correlation between sample size and number of foci reported13. Although this modeling was from a database of VBM studies, it should be noted that in our earlier mega-analysis of fMRI studies14, we found the expected relationship between sample size and number of foci in meta-analyses – which also have the same effect of converting close clusters into a single, robust activation focus using activation-likelihood estimation. We may also not have extracted some other important confounders such as study quality defined in other ways. We cannot exclude that some large studies may be of poor quality and thus are less prone to find foci than smaller studies. Nevertheless, it seems unlikely since one would expect higher quality criteria in larger investigations that are typically performed by more experienced teams.

Importantly, our evaluation cannot conclude that there are no biologically plausible functional sex differences in human brain function, cognition or behavior that would be reflected in fMRI studies of the brain. However, the present data suggest that there is likely excess significance bias in the reported results of fMRI studies of sex differences of the brain.

This excess significance and reporting bias may stem from a constellation of factors that are likely to affect more prominently the literature of small studies. These factors include, but are not limited to, lack of pre-registration8, large flexibility in the modes of analyses26, inappropriate statistical methods26 and selection pressure from the current reward and incentives system to report the most significant results8. Conversely, solutions to this problem may involve pre-registered protocols and registration databases8, openness and transparency with wider data sharing practices such as Neurovault27 and OpenfMRI28, as well as pre-registered reports29 and other efforts that try to minimize selective reporting20,30.