In many fields, it will be nigh impossible to objectively and/or at least semi-automatically quantify many important aspects of the work. However, in the experimental sciences, luckily, there are aspects of reliability and accuracy that can be quantified objectively and compared across large numbers of articles and journals.

According to a quote attributed to Albert Einstein, “Not everything that can be counted counts and not everything that counts can be counted.” Whether a publication is considered “good” depends on a number of variables. Among the most frequently cited is novelty, i.e., that the publication in question constitutes a discovery not made before and a significant scientific advancement. However, novelty alone is a questionable aspect of quality long before one attempts to quantify it. Whether a publication is novel depends on the knowledge, and thus perspective, of the reader. Similarly, what constitutes a significant advancement is highly subjective as well. For these reasons, a focus on novelty incentivizes authors, likely against their better knowledge, to make their work appear more novel, e.g., by using the word “novel” more often [ 37 ] or by leaving out references to prior work—a common practice that some journals seem to openly endorse [ 38 ]. Finally, table-top cold fusion, arsenic in DNA, or the purported link between the MMR vaccine and autism were at least as novel as the discovery of CRISPR gene scissors, gravitational waves, or place cells, and yet most would agree that there is an important enough distinction between the former group of “discoveries” and the latter, which justifies not treating them equivalently. In other words, novelty alone is useless as a signal of quality. Of course, if a discovery is truly novel, it cannot yet have been reproduced. Therefore, any journal rank that aspires to capture quality beyond mere novelty must be able to distinguish between submitted, novel manuscripts of the former, unreliable type and the second, reliable kind before actual replications have been attempted. Is our system of ranked journals up to this task? Given that we all send our most novel work to the best journals, are these top journals indeed able to separate the novel, reliable wheat from the novel, unreliable chaff?

It is a noteworthy discovery in and of itself that a number so flawed as the IF nevertheless correlates with anything, let alone exceedingly well with scholars’ subjective notion of journal prestige [ 32 – 36 ]. Due to this correlation between IF and subjective prestige rank, the IF lends itself as a tool to test several quantifiable aspects of quality and to see how well the hierarchy of prestige stands up against the scientific method.

The evidence against our notion of prestige

For instance, crystallographers quantify the quality and accuracy of computer models derived from experimental work in structural biology and chemistry by comparing the computer models to established properties of the substance’s constituencies. They use bond distances, angles, and other factors to derive a difference score that measures how far away a given model is from being perfectly accurate. Averaging thousands of such models over the journals they have been published in, prestigious journals such as Cell, Molecular Cell, Nature, EMBO Journal, and Science publish significantly substandard models of such structures [39].

Such prestigious journals have also been found to publish exaggerated effect sizes with lower than necessary sample sizes in single gene association studies for psychiatric disorders [40]. Overall statistical power has been found to be weak across the biomedical and psychological sciences [41–44], indicating an overall low reliability for these fields. Statistical power was found to be at best uncorrelated with journal rank [8], or it correlated negatively, i.e., publications in higher-ranking journals report a lower statistical power [42,44].

Animal disease models are subjected to similar procedures as clinical trials in humans to evaluate the effectiveness of the treatments. Clearly, only the highest standards of scientific rigor should apply to such experiments. Among the most basic standards are the randomized assignment of individuals to treatment and control groups and the blind assessment of the outcome. Analyzing the reporting of randomization and blinding in the methods sections, it becomes clear that not only is it rare that these basic procedures are reported, authors at high-ranking journals are worse at it than lesser journals. Therefore, at best, authors of publications in high-ranking journals are sloppier in reporting their methods than their counterparts in less prestigious journals [45]. At worst, they adhere less to basic notions of good experimental design.

Sloppiness may also be attributed whenever discrepancies can be found between the actual results of a study and what is reported in the publication. For instance, gene symbols and accession numbers may inadvertently be converted into dates or floating point numbers when -omics researchers copy and paste their results into Microsoft Excel spreadsheets without tedious error correction by hand. This is a rather common error, but it is noteworthy that the incidence of such errors is higher in more prestigious journals [46]. It may also happen that the p-values reported in a publication deviate from the p-value calculated from the data. However, it is curious that the incidence of these errors increases with journal rank, and the errors universally lower the p-value, rather than increase it, as one would expect if these errors were due to chance alone [42]. In the arms race between authors desperate to get ahead of the competition and journals trying to detect questionable research practices, the low-hanging fruit seem to be collected by the high-ranking journals: the rate of duplicated images is lower in these journals than in other journals [47]. This constitutes the currently only, to my knowledge, example in the literature in which higher-ranking journals appear to be better at catching errors than lower-ranking journals, as the lone exception to the rest of the literature.

These few examples stand in for a growing body of evidence in which high-ranking journals seem to often struggle to reach even average reliability [8,48]. In fact, some of the most convincing studies point towards an inverse relation between journal rank and reliability. A straightforward ad hoc hypothesis explaining this observation is that the emphasis of editors on novelty increases with journal rank, but editorial focus on scientific rigor and reliability does not. Given that novel and surprising results ought to be met with increased scrutiny, the data seem to suggest that this increase in editorial and statistical scrutiny does not take place. Taken together, the available evidence therefore not only invalidates the current use of IF specifically and of our subjective journal rank more generally but also demonstrates how counterproductive their deployment in evaluations proves to be by rewarding unreliable science.

This body of evidence points in the same direction as complementary research showing that selecting researchers based on their productivity also leads to decreased reliability [49,50]: selecting scientists on number of publications and journal rank will, over time, tend to decrease scientific reliability. In both cases, scientists are hired and promoted who publish less reliable work than their peers and who then go on and teach their students how to become successful scientists—by publishing a lot and in prestigious journals. This research is agnostic to the intention or motives of the individuals. Training, strategies, and competence all vary among the population of early career researchers from which institutions hire faculty. Using the common selection criteria ensures a bias towards unreliability, irrespective of its ultimate underlying source or reason, and institutions employ them at their own risk. Therefore, inasmuch as the number and venue of scholarly publications are used as measures for scientific “excellence,” the currently available data support recent parallel conclusions that this “excellence” is not excellent [51]. As it stands, “used in its current unqualified form it is a pernicious and dangerous rhetoric that undermines the very foundations of good research and scholarship” [51].