You may have seen some news stories saying one part of the brain is bigger, or smaller, in people with a certain mental health problem, or even a specific job. These are generally based on real, published research. But how reliable are the studies?

One way of critiquing a piece of scientific research is to read the academic paper in detail, looking for flaws. But that may not be enough, if some sources of bias might exist outside it, in the wider system of science.

By now you'll be familiar with publication bias: the phenomenon where studies with boring, negative results are less likely to get written up or published. You can estimate this using a tool such as, say, a funnel plot. The principle is simple: expensive landmark studies are harder to brush under the carpet, but small ones can disappear more easily. So split your studies into "big ones" and "small ones": if the small studies, averaged out together, give a more positive result than the big studies, then maybe some small negative studies have gone missing in action.

Sadly, this doesn't work with brain scan studies, because there's not enough variation in size. So Professor John Ioannidis, a godlike figure in the field of "research about research", took a different approach. He collected a large, representative sample of these anatomical studies, counted up how many positive results they got, and how positive those results were, and then compared this to how many similarly positive results you could plausibly have expected to detect, simply from the sizes of the studies.

This can be derived from something called the "power calculation". Everyone knows that the more data you collect for a piece of research, the greater your ability to detect a modest effect. What people often miss is that the size of sample needed also changes with the size of the effect you're trying to detect: detecting a true 0.2% difference in the size of the hippocampus between two groups, say, would need more subjects than a study aiming to detect a 25% difference.

By working backwards and sideways from these kinds of calculations, Ioannidis was able to determine, from the sizes of effects measured and from the numbers of people scanned, how many positive findings could plausibly have been expected, and compare with how many were reported. The answer was stark: even being generous, there were twice as many positive findings as you could realistically have expected from the amount of data reported on.

What could explain this? Inadequate blinding is an issue: a fair amount of judgment goes into measuring brain size area on a scan, so wishful nudges can creep in. And boring old publication bias is another: maybe whole negative papers aren't getting published.

But a final, more interesting explanation is also possible. In these kinds of studies, it's possible that many brain areas are measured to see if they're bigger or smaller, and maybe, then, only the positive findings get reported within each study.

There is one final line of evidence to support this. In studies of depression, for example, 31 studies report data on the hippocampus, six on the putamen and seven on the prefrontal cortex. Maybe, perhaps, more investigators really did focus solely on the hippocampus. But given how easy it is to measure the size of another area – once you've recruited and scanned your participants – it's also possible that people are measuring these other areas, finding no change and not bothering to report that negative result in their paper alongside the positive ones they've found.

There's only one way to prevent this: researchers would have to publicly pre-register which areas they plan to measure and then report all findings. In the absence of that process, the entire field might be distorted, by a form of exaggeration that is – we trust – honest and unconscious, but more interestingly, collective and disseminated.