Thus far, we have been arguing that “excellence” is primarily a rhetorical signalling device used to claim value across heterogeneous institutions, researchers, disciplines, and projects rather than a measure of intrinsic and objective worth. In some cases, the qualities of these projects can be compared in detail on other bases; in many—perhaps most—cases, they cannot. As we have argued, the claim that a research project, institution, or practitioner is “excellent” is little more than an assertion that that project, institution, or practitioner can be said to succeed better on its own terms than some other project, institution, or practitioner can be said to succeed on some other, usually largely incomparable, set of terms.

But what about these sets of “own terms”? How easy is it to define the “excellence” of a given project, institution, or practitioner on an intrinsic basis? Even if we leave aside the comparative aspect, are there formal criteria that can be used to identify “excellence” in a single research instance on its own terms or that of a single discipline?

Research suggests that this is far harder than one might think. Academics, it turns out, appear to be particularly poor at recognizing a given instance of “excellence” when they see it, or, if they think they do, getting others to agree with them. Their continued willingness to debate relative quality in these terms, moreover, creates a basis for extreme competition that has serious negative consequences.

Do researchers recognize excellence when they see it?

The short answer is no. This can be seen most easily when different potential measures of “excellence” conflict in their assessment of a single paper, project, or individual. Adam Eyre-Walker and Nina Stoletzki, for example, conclude that scientists are poor at estimating the merit and impact of scientific work even after it has been published (2013). Post-publication assessment is prone to error and biased by the journal in which the paper is published. Predictions of future impact as measured by citation counts are also generally unreliable, both because scientists are not good at assessing merit consistently across multiple metrics and because the accumulation of citations is itself a highly stochastic process, such that two papers of similar merit measured on other bases can accumulate very different numbers of citations just by chance. Moreover, Wang et al. (2016) show that in terms of citation metrics the most novel work is systematically undervalued over the time frames that conventional measures use, including, for instance, the Journal Impact Factor that Eyre-Walker and Stoletzki suggest biases expert assessment.

This is true even of work that can be shown to be successful by other measures. Campanario, Gans and Shepherd, and others, for example, have traced the rejection histories of Nobel and other prize winners, including for papers reporting on results for which they later won their recognition (Gans and Shepherd, 1994; Campanario, 2009; Azoulay et al., 2011: 527–528). Campanario and others have also reported on the initial rejection of papers that later went on to become among the more highly cited in their fields or in the journals that ultimately accepted them (Campanario, 1993, 1996; Campanario, 1995; Campanario and Acedo, 2007; Calcagno et al., 2012; Nicholson and Ioannidis, 2012; Siler et al., 2015). Yet others have found a generally poor relationship between high ratings in grant competitions and subsequent “productivity” as measured by publication or citation counts (Pagano, 2006; Costello, 2010; Lindner and Nakamura, 2015; Fang et al., 2016; Meng, 2016).

As this suggests, academics’ abilities to distinguish the “excellent” from the “not-excellent” do not correlate well with one another even within the same disciplinary environment (there tends to be greater agreement at the other end of the scale, distinguishing the “not acceptable” from the “acceptable,” see Cicchetti, 1991; Weller, 2001). To earn citations or win prizes for a rejected manuscript, after all, authors need to begin by convincing a different journal (and its referees) to accept work that others previously have found wanting.

But this is not something that only Nobel prize winners are good at: as Weller reported in the early years of this century, most (51.4%) rejected manuscripts were ultimately published; in the vast majority of cases (approximately 90%), these previously rejected articles were accepted on their second submission and, in the vast majority of these cases (also approximately 90%), at a journal of similar prestige and circulation (Weller, 2001). While these statistics have almost certainly changed in the last few years with changes in the demographics of submission and, especially, the development of venues that focus on the publication of “sound science” (Public Library of Science, 2016), the basic sense that journal peer review is a gatekeeper that is frequently circumvented remains.

Articles that are initially rejected and then go on to be published to great acclaim or even just in journals of a similar or higher ranking represent what are in essence false negatives in our ability to assess “excellence.” They are also evidence of terrible inefficiency. The rejection of papers that are subsequently published with little or no revision at journals of similar rank increases the costs for everyone involved without any countervailing improvement in quality. In addition to multiplying the systemic cost of refereeing and editorial management by the number of resubmissions, such articles also present an opportunity cost to their authors through lost chances to claim priority for discoveries, for example, or, even more commonly, lost opportunities for citation and influence (Gans and Shepherd, 1994; Campanario, 2009; Şekercioğlu, 2013; Brembs, 2015; Psych Filedrawer, 2016).

More worryingly, there is also considerable evidence of false positives in the review process—that is to say submissions that are judged to meet the standards of “excellence” required by one funding agency, journal, or institution, but do worse when measured against other or subsequent metrics. In a somewhat controversial work, Peters and Ceci submitted papers in slightly disguised form to journals that had previously accepted them for publication (Peters and Ceci, 1982; see Weller, 2001 for a critique). Only 8% overall of these resubmissions were explicitly detected by the editors or reviewers to which they were assigned. Of the resubmissions that were not explicitly detected, approximately 90% were ultimately rejected for methodological and/or other reasons by the same journals that had previously published them; they were rejected, in other words, for being insufficiently “excellent” by journals that had decided they were “excellent” enough to enter the literature previously.

When it comes to funding, a similar pattern of false positives may pertain: a study by Nicholson and Ioannidis (2012) suggests that highly cited authors are less likely to head major biomedical research grants than less-frequently-cited but socially better-connected authors who are associated with granting agency study groups and review panels. Fang, Bowen and Casadevall have discovered that “the percentile scores awarded by peer review panels” at the NIH correlated “poorly” with “productivity as measured by citations of grant-supported publications” (Fang et al., 2016). These suggest a bias towards conformance and social connectedness over innovation in funding decisions in a world in which success rates are as low as 10%. It also provides further evidence of funding-agency bias against disruptively innovative work noted by many researchers over the years (Kuhn [1962] 2012; Campanario, 1993, 1995, 1996, 2009; Costello, 2010; Ioannidis et al., 2014; Siler et al., 2015).

Fraud, error and lies

To the extent that the above are evidence of inefficiencies in the system, some might argue that individual problems in determining “excellence” in specific cases are resolved in the longer term and over large samples. Of course, these examples only show work for which multiple measures of “excellence” can be compared: given their unreliability, this suggests that work that is not measured more than once may be unjustly suppressed or unjustly published, without us being able to tell the difference. On the other hand, it is presumably possible that even such extreme examples of differing perceptions of “excellence” represent honest differences of opinion as to the qualitative merit of the research or researchers. The same cannot be said, however, of actual fraud and outright errors.

As various studies have concluded, reported instances of both fraud and error (as measured through retractions) are on the rise (Claxton, 2005; Dobbs, 2006; Steen, 2011; Fang et al., 2012; Grieneisen and Zhang, 2012; Yong, 2012b; Chen et al., 2013; Andrade, 2016). This is particularly true at higher prestige journals (Resnik et al., 2015; Siler et al., 2015; Belluz, 2016). If we add to this list of (potentially) “false positives” studies that cannot be replicated, the number of papers that meet one measure of “excellence” (that is, passing peer review, often at “top” journals) while failing others (that is, being accurate and reproducible, and/or non-fraudulent) rises considerably (Dean, 1989; Burman et al., 2010; Lehrer, 2010; Bem, 2011; Goldacre, 2011; Yong, 2012b; Rehman, 2013; Resnik and Dinse, 2013; Hill and Pitt, 2014; Chang and Li, 2015; Open Science Collaboration, 2015). It is the very focus on “excellence”, however, that creates this situation: the desire to demonstrate the rhetorical quality of “excellence” encourages researchers to submit fraudulent, erroneous, and irreproducible papers, at the same time as it works to prevent the publication of reproduction studies that can identify such work.

In other words, erroneous, and especially fraudulent or irreproducible papers are interesting because they represent a failure of both our ability to identify and predict actual qualitative “excellence” and the incentive system that is used to encourage scientists and scholars to produce the kind of sound and defensible work that should be a sine qua non for quality. As Fang, Steen, and Casadevall (2012; cf Steen, 2011 for which the later article represents a correction) have shown, the majority of retracted papers are withdrawn for reasons of misconduct including fraud, duplicate publication, or plagiarism (67.4%), rather than error (21.3%)—although inadvertent error should presumably itself be disqualification from “excellence”. But even these figures may under-represent the true incidence of misconduct. Mistakes and errors made in good faith are a natural and necessary part of the research process. Yet, as focus groups and surveys conducted by various researchers have demonstrated, some forms of error can be misconduct in the form of a (semi-)deliberate strategy for ensuring quick and/or numerous publications by “ ‘cutting a little corner’ in order to get a paper out before others or to get a larger grant,... [or] because... [a researcher] needed more publications that year” (Anderson et al., 2007: 457–458; see also Fanelli, 2009; Tijdink et al., 2014; Chubb and Watermeyer, 2016).

Thus in one small sample of detailed surveys, Fanelli showed that while only a small percentage of scientists (1.97% pooled weighted average, n=7) admitted to fabricating, falsifying, or modifying data, a much larger percentage claimed to have seen others engaging in similarly outright fraudulent activity (14.12%, n=12). Furthermore, even larger percentages had engaged in (33.7%) or seen others engage in (72%) questionable research described using less negatively loaded language (Fanelli, 2009; the percentage of scientists admitting to explicit misconduct is considerably higher [15%] in Tijdink et al., 2014). As Fanelli concludes: “Considering that these surveys ask sensitive questions and have other limitations, it appears likely that this is a conservative estimate of the true prevalence of scientific misconduct” (2009, 9)—a conclusion very strongly supported by the anecdotal admissions of Anderson et al.’s focus groups.

The drive for “excellence” in the eyes of assessors is shown even more starkly in work by Chubb and Watermeyer (2016). In structured interviews, academics in Australia and the United Kingdom admitted to outright lies in the claims of broader impacts made in research proposals. As the authors note: “Having to sensationalize and embellish impact claims was seen to have become a normalized and necessary, if regretful, aspect of academic culture and arguably par for the course in applying for competitive research funds” (6). Quoting an interviewee, they continue, “If you can find me a single academic who hasn’t had to bullshit or bluff or lie or embellish to get grants, then I will find you an academic who is in trouble with his [sic] Head of Department” (6; “[sic]” as in Chubb and Watermeyer). Here we see how a competitive requirement, perceived or real, for “excellence”, in combination with a lack of belief in the ability of assessors to detect false claims, leads to a conception of “excellence” as pure performance: a concept defined by what you can get away with claiming in order to suggest (rather than actually accomplish) “excellence”.

What is striking about these behaviours, of course, is that they are unrelated to (and to a great extent perhaps even incompatible with or opposed to) the actual qualities funders, governments, journal editors and referees, and researchers themselves are ostensibly using “excellence” to identify. No agency, ministry, press, or research office intentionally uses “excellence” as shorthand for “able to embellish results or importance convincingly”, even as the researchers being adjudicated under this system report such embellishment as a primary criterion for success. Whether it occurs through fraud, cutting corners, or exaggeration, this performance of “excellence” is commonly justified as being necessary for survival, suggesting a cognitive and cultural dissonance between those aspects of their work that the performers feel is essential and those aspects they feel they must emphasise, overstate, embellish, or fabricate to appear more “excellent” than their competitors. The evidence that fraud and corner-cutting are a problem at the core of the research process suggests that the pressure for these performances of “excellence” is not restricted to stages that do not matter. As Kohn argues, reward-motivation affects scientific creativity (the ability to “break out of the fixed pattern of behaviour that had succeeded in producing rewards… before”) as much as it does evidence-gathering or the inflation of results (1999, 44; see also Lerner and Wulf, 2006; Azoulay et al., 2011; Tian and Wang, 2011).

Competition for scarce resources and the performance of “excellence”

So why do researchers engage in this kind of dubious activity? Clearly for both Chubb and Watermeyer’s interviewees, as well as those identified as having committed scientific fraud, it is competition for scarce resources, whether funding, positions, or community prestige. Of course this is not a new issue (Smith, 2006). Taking time away from his work on the difference machine, Charles Babbage published an analysis of what he saw as the four main kinds of scientific frauds in an 1830 polemic, Reflections on the Decline of Science in England: And on Some of Its Causes. These included the self-explanatory “hoaxing” and “forging,” in addition to “trimming” (“clipping off little bits here and there from those observations which differ most in excess from the mean and in sticking them on to those which are too small”) and “cooking” (“an art of various forms, the object of which is to give ordinary observations the appearance and character of those of the highest degree of accuracy”) (Babbage, 1831: 178; see Zankl, 2003; and Secord, 2015 for a discussion).

The motivation for these frauds, then as now, involves prestige and competition for resources. Babbage’s typology of fraudulent science was but a minor chapter in a book otherwise mostly concerned with the internal politics of the Royal Society. He attributed the decline he saw in English science to the lack of attention and professional opportunities available to potential scientists. He was, as a result, keenly sensitive to questions of credit and its importance in determining rank and authority. Indeed, as Casadevall and Fang remind us, “Since Newton, science has changed a great deal, but this basic fact has not. Credit for work done is still the currency of science…. Since the earliest days of science, bragging rights to a discovery have gone to the person who first reports it” (Casadevall and Fang, 2012: 13). The prestige of first discovery always has been a scarce resource. Now that that prestige is measured also through the scarce resource of authorship in “the right journals” and coupled ever more strongly to the further scarce resources of career advancement and grant funding, it should not be a surprise that the competition for those markers has become steadily stronger. The performance of “excellence” has become more marked as a result.

If scandals such as fraudulent articles were the only way in which this overwhelming competitive focus on “excellence” hurt research, it would be bad enough. But the emphasis on rewarding the performance of “excellence” also has a more general impact on research capacity: it is the mechanism by which “the Matthew effect”—that is, the disproportionate accrual of resources to those researchers and institutions that are already well-rewarded—operates in a hyper-competitive research environment, creating distortions throughout the research cycle, even for work that is not fraudulent or the result of misconduct (Bishop, 2013; as its etymology implies, the “Matthew effect” predates today’s hypercompetition, see Merton, 1968, 1988)Footnote 1: it increases the stakes of the competition for resources and, as a result, encourages gamesmanship; creates a bias towards (non-disruptively) novel, positive, and even inflated results on the part of authors and editors; and discourages the pursuit and publication of types of “Normal Science” (such as replication studies) that are crucial to the viability of the research enterprise, without being glamorous enough to suggest that their authors are “excellent”.

Positive bias and the decline effect

Just how destructive this need to perform “excellence” is can be illustrated by the well-known bias towards positive results in scientific publication (for example, Dickersin et al., 1987, 2005; Sterling, 1959; Kennedy, 2004; Young and Bang, 2004; Bertamini and Munafò, 2012; Rothstein, 2014; Psych Filedrawer, 2016). Thus, for example, Fanelli (2011) demonstrated a 22% growth between 1990 and 2007 in the “frequency of papers that, having declared to have ‘tested’ a hypothesis, reported a positive support for it”. This is all the more remarkable given that the late 1980s were themselves not a halcyon period of unbiased science: in an 1987 study of 271 unpublished and 1041 published trials, Dickersin et al. found that 14% of unpublished and 55% of published trials favoured the experimental therapy (1987). As Young et al. suggest, “the general paucity in the literature of negative data” is such that “[i]n some fields, almost all published studies show formally significant results so that statistical significance no longer appears discriminating” (2008, 1419).

Another artifact of this positive bias is the “decline effect,” or the tendency for the strength of evidence for a particular finding to decline over time from that stated on its first publication (Schooler, 2011; Gonon et al., 2012; Brembs et al., 2013; Groppe, 2015; Open Science Collaboration, 2015). While this effect is also well-known, Brembs et al. have recently shown that its presence is significantly positively correlated with journal prestige as measured by Impact Factor: early papers appearing in high prestige journals report larger effects than subsequent studies using smaller samples (2013, see Figs. 1b and 1c in this reference).

The bias against replication

Finally, there is a bias against the publication of replication studies in disciplines where such patterns make scientific sense. Indeed, there are currently insufficient structural incentives to perform work that “merely” revalidates existing studies, fuelled by a focus on novelty in most definitions of “excellence”. As Nosek et al. note

Publishing norms emphasize novel, positive results. As such, disciplinary incentives encourage design, analysis, and reporting decisions that elicit positive results and ignore negative results. Prior reports demonstrate how these incentives inflate the rate of false effects in published science. When incentives favour novelty over replication, false results persist in the literature unchallenged, reducing efficiency in knowledge accumulation. (2012)

This bias against replication is even more remarkable, however, when it involves studies that invalidate rather than confirm the original result, especially when the original result has a high profile or is potentially field-defining—qualities that one would assume would increase the novelty and interest of the (non) replication itself (Goldacre, 2011; Wilson, 2011; Nosek et al., 2012; Yong, 2012a, b; Aldhous, 2011; for a view from the other side of replication, see Bissell, 2013). This is in part, a function of publishing economics: commercial journals earn money from subscription, access, and reprint fees (Lundh et al., 2010); high profile results and a high prestige reflected by a high Impact Factor help maintain the demand for these journals and hence ensure both a continuing stream of interesting new material and a steady or rising income for the journal as a whole (Lawrence, 2007; Munafò et al., 2009; Lundh et al., 2010; Marcovitch, 2010). Undercutting (or perhaps even qualifying) the high-profile results that help bring in these subscribers, new articles, and attention attacks the very foundation of this success—a journal that publishes high profile but incorrect papers is undercutting its case for subscription and author submissions. One doesn’t need to imagine a conspiracy to promote poor science to understand how a conscious or unconscious bias against replication studies might arise under such circumstances.

The reluctance of major journals to publish replication studies embeds this bias in the incentive system that guides authors. As Wilson notes:

[M]ajor journals simply won't publish replications. This is a real problem: in this age of Research Excellence Frameworks and other assessments, the pressure is on people to publish in high impact journals. Careful replication of controversial results is therefore good science but bad research strategy under these pressures, so these replications are unlikely to ever get run. Even when they do get run, they don't get published, further reducing the incentive to run these studies next time. The field is left with a series of “exciting” results dangling in mid-air, connected only to other studies run in the same lab. (2011)

As Rothstein (2014) argues “The consequences of this problem include the danger that readers and reviewers will reach the wrong conclusion about what the evidence shows, leading at times to the use of unsafe or ineffective treatments”.

Homophily

Thus far, we have been discussing the negative impact of “excellence” largely in terms of its effect on the practice and results of professional researchers. There is, however, another effect of the drive for “excellence”: a restriction in the range of scholars, of the research and scholarship performed by such scholars, and the impact such research and scholarship has on the larger population. Although “excellence” is commonly presented as the most fair or efficient way to distribute scarce resources (Sewitz, 2014), it in fact can have an impoverishing effect on the very practices that it seeks to encourage. A funding programme that looks to improve a nation’s research capacity by differentially rewarding “excellence” can have the paradoxical effect of reducing this capacity by underfunding the very forms of “normal” work that make science function (Kuhn [1962] 2012) or distract attention from national priorities and well-conducted research towards a focus on performance measures of North America and Europe (Vessuri et al., 2014). A programme that seeks to reward Humanists, similarly, by focussing on output in “high impact” academic journals paradoxically reduces the impact of these same disciplines by encouraging researchers to focus on their professional peers rather than broader cultural audiences (Readings, 1996), reducing the domain’s relevance even as its performance of “excellence” improves. A programme of concentration on the “best” academics, in other words, can have the effect of focussing attention on problems and approaches in which “excellence” can be performed most easily rather than those that could benefit the most (or provide the greatest actual impact) from increased attention.

Moreover, a concentration on the performance of “excellence” can promote homophily among the scientists themselves. Given the strong evidence that there is systemic bias within the institutions of research against women, under-represented ethnic groups, non-traditional centres of scholarship, and other disadvantaged groups (for a forthright admission of this bias with regard to non-traditional centres of scholarship, see Goodrich, 1945), it follows that an emphasis on the performance of “excellence”—or, in other words, being able to convince colleagues that one is even more deserving of reward than others in the same field—will create even stronger pressure to conform to unexamined biases and norms within the disciplinary culture: challenging expectations as to what it means to be a scientist is a very difficult way of demonstrating that you are the “best” at science; it is much easier if your appearance, work patterns, and research goals conform to those of which your adjudicators have previous experience. In a culture of “excellence” the quality of work from those who do not work in the expected “normative” fashion run a serious risk of being under-estimated and unrecognised (King et al., 2014, 2016; O’Connor and O’Hagan, 2015; University of Arizona Commission on the Status of Women, 2015; this is, in part, an explanation for the systemically underreported and poorly acknowledged and rewarded work of women “assistants” in many of the great scientific discoveries of the twentieth century). There is a clear case to answer that, absent substantial corrective measures and awareness, a focus on “excellence” will continue to maintain rather than work to overcome social barriers to participation in research by currently underrepresented groups.

Homophily is in some senses a variant on Merton’s “Matthew effect,” discussed above. It is also a variant on the old argument that existing power structures—those populated by those whom it is assumed already exemplify “excellence”—tend towards conservatism in their processes of evaluation. It underpins the calls to reassess the focus of mainstream scholarship, whether this is “great men” history, the “Dead White Male” in literary “canon”, or the bias towards the ills of the western male patient in medical research. As Barbara Herrnstein Smith says with respect to literary evaluation:

…[a work that “endures”] will also also begin to perform certain characteristic cultural functions by virtue of the very fact that it has endured...In these ways, the canonical work begins increasingly not merely to survive within but to shape and create the culture in which its value is produced and transmitted and, for that very reason, to perpetuate the conditions of its own flourishing. (Herrnstein Smith, 1988 emphasis in the original)

In other words, the works that—and the people who—are considered “excellent” will always be evaluated, like the canon that shapes the culture that transmits it, on a conservative basis: past performance by preferred groups helps establish the norms by which future performances of “excellence” are evaluated. Whether it is viewed as a question of power and justice or simply as an issue of lost opportunities for diversity in the cultural co-production of knowledge, an emphasis on the performance of “excellence” as the criterion for the distribution of resources and opportunity will always be backwards looking, the product of an evaluative process by institutions and individuals that is established by those who came before and resists disruptive innovation in terms of people as much as ideas or process.