As noted above, the search queries returned a raw list of 51 items from Web of Science and 47 from Scopus. Once duplicates were eliminated, we had a unique list of 68 items. The final list yielded 46 references for analysis set, 44 are modeling papers and 2 are position papers. In this section we will review the 44 modeling papers following the structure of our data charting.

Kinds of models

Several kinds of modeling approaches were represented in our set of modeling papers. Most papers (26 cases) developed simulation experiments by means of agent-based modeling (ABM). 15 cases adopted other kinds of stochastic models, such as evolutionary models or latent Markov models. Lastly, 3 papers developed formal models of peer review that were studied analytically instead of numerically. Thus, ABMs appear to be the preferred method for simulating a peer review system.

We identified two ABMs which were particularly influential and thus earned special attention: Thurner and Hanel (2011) and Squazzoni and Gandelli (2012a, 2013). After their publication, these two models sparked their own strands of research, as they were further developed and studied in subsequent publications both by the authors and also by other scholars. Both models were originally developed to study the potential effects of different behavioral strategies that a journal reviewer could adopt. In both cases results indicate that reviewer behaviors are highly consequential: they affect both the efficacy of peer review (that is, how good the peer review system is at promoting the best submissions and filtering out the worst), and its efficiency (that is, relating to the amount of resources—e.g. time—their functioning requires).

Because several other papers built on these two models and inherited many of their characteristics, for convenience we will refer to these two ABMs as the root models, and the literature they sparked as their model group, akin to a phylogenetic tree, with branching structures. The existence of two model groups leads to a second general observation we can make about the landscape of models of peer review: it is fragmented. We found no instances of research where different existing models were combined, integrated, or tested against one another. With the exception of the two root models, most peer review models were not further developed after their first publication. This observation echoes the conclusion by Grimaldo et al. (2018, 2018) about the state of the broader literature on peer review, where fragmentation and lack of collaboration and/or knowledge sharing also prevails.

Kinds of modeled systems

Most of the references in our set focused specifically on journal peer review (29 cases, including the two root models). Much less studied are other types of peer review systems: grants (8 cases), conferences (4 cases), career evaluation (2 cases), and evaluation of research institutions (1 case). Lastly, in 2 models peer review was defined so abstractly that it could represents all of the above types of peer review systems. This demonstrates that the systems of peer review other than peer review in journals are relatively understudied.

Prominent model features

In this paragraph we discuss the main features, or sets of assumptions, which help classifying the existing models.

Intrinsic quality

First, we consider the assumption of what we call intrinsic quality. According to this assumption, what is being evaluated in the peer review (be it a paper, a grant proposal, or a CV) has an intrinsic, objectively quantified level of quality. This assumption is important because it embodies a precise perspective on what peer review is for. That is, to assume that submissions to a journal have an intrinsic quality implies that the role of the reviewer is to estimate the intrinsic quality as accurately as possible. By contrast, not assuming any intrinsic quality implies that the submission can only be evaluated subjectively: this means that, when two reviewers widely disagree on the assessment of the submission, they still may both be right.

Most of the models from our set (35 out of 44 modeling papers, including the two root models) make the assumption of intrinsic quality, explicitly or not. They define that what is being evaluated has an attribute, often named quality, expressed in one or more continuous variables. The intrinsic quality is typically assigned randomly during the initialization of the simulation. In these models, the reviewer’s assessment is calculated from the intrinsic quality: the difference between the intrinsic quality and the reviewer’s assessment is determined by some degree of evaluation error or bias.

In models without this assumption, the reviewer’s assessment is either entirely random, or based on other objective properties of what is being peer reviewed. For example, applicants can be evaluated by their research experience, or paper submissions by the seniority of the author.

Trade-off between resources for reviewing and preparing own submissions

Reviewers (for a journal or for a grant review panel) are usually scholars themselves. As such, their own work is typically peer reviewed by other colleagues. Some models of peer review need this dual-role of scholars to be explicitly modeled. Therefore, these models assume a realistic choice model for scholars: scholars are endowed with a finite amount of resources (e.g. time), and need to choose how much resources to invest in reviewing other scholars’ work, and how much to invest in preparing their own submissions. Models assuming this realistic choice model also assume that the quality of reviews and submissions is function of the quantity of resources invested in them.

By contrast, models without this assumption abstract from the dual-role of scholars. They propose a simpler implementation of a peer review system, where scholars (whose work is to be peer reviewed) and reviewers are two distinct populations. The quality of a submission or a review is either entirely random, or function of some other quantity (e.g. the seniority of the scholar).

The majority of the modeling papers in our set (30 out of 44) do not use this realistic choice model for scholars. This feature is the key distinction between the two groups of ABMs of peer review: whereas papers based on the root model by Squazzoni and Gandelli (2012a, 2013) assume the realistic choice model, the ones based on Thurner and Hanel (2011) do not.

Social influence

A manuscript, proposal or application and the information that comes with it are not the only aspects guiding a reviewer’s assessments. This is because reviewers do not do their job in complete isolation. In some peer review systems (such as review panels), the assessment is produced collectively by the reviewers through discussion. During a review panel discussion, different social processes are at play, which can determine and bias the final decision on a particular proposal (Derrick 2018; van Arensbergen et al. 2014). Similarly, in other peer review systems, reviewers come to a final assessment after having read other reviewers’ assessment, or after having read the author’s response from a previous round of review. In all of these instances, the reviewer’s final assessment results from influence dynamics, where the initial assessment can be socially influenced by repeated, complex interactions with other reviewers or with the authors, which give rise to non-linear dynamics.

The complexity inherent to social influence dynamics is an aspect of peer review, which is absent in most of the modeling papers in our set (41 out of 44, including the root models). In these models, for the sake of simplicity, the reviewers are assumed to act independently from their social environment, and their assessment is either randomly produced, or solely based on the properties of what they are evaluating.

Only three papers modeled at least some aspects of complex social influence dynamics in their models. Zhu et al. (2016), for instance, test the predicted effect of including two phases in the peer review process: a reviewer discussion and the author’s feedback. The reviewer discussion consists of reviewers adjusting their own score based on the scores and confidence of all reviewers; the author feedback is the opportunity for authors to improve the quality (and hopefully then the scores) of their submission. Lyon and Morreau (2018) developed an ABM to study the wisdom of crowd effects in expert panels. In Flynn and Moses (2012), social influence affects which papers will be reviewed by a reviewer: whether (and how much) the reviewer agrees with other reviewers on the assessment of a conference submission is consequential for which submissions to review next.

Empirical calibration and validation

Simulation models can make use of empirical data in two ways: in their calibration (when model parameters are set to an empirically-observed value) and validation (when the model predictions are empirically tested) (Hassan et al. 2010). Empirical calibration and validation of a model can be desirable for different reasons. On the one hand, calibration can reduce the parameter space that needs to be explored and can tailor the model to the social environment that is being studied. Validation, on the other hand, can provide insight into the accuracy of the model’s predictions, and thus on the goodness of our understanding of the modeled social process.

While the modeling community advocates for the use of empirical data in modeling (Hedström and Manzo 2015), few of the modeling papers on peer review use any. Out of 44 modeling papers, only 12 contained at least one empirically calibrated model parameter, and only 6 compared at least some of the model’s predictions to empirical data.

Research questions

Here we group the 44 modeling papers by their main research questions, or aspect of peer review that was investigated. We chose the aspects that were examined in several papers and with different modeling methods or theoretical frameworks. All the following aspects have one commonality: they often emerge as crucial in determining the efficacy and efficiency of a peer review system.

The aggregation of reviewers’ assessments

Peer review typically relies on the reviews of two or more reviewers. In the review process, the different reviews need to be synthesized into one single score, or one single decision (e.g. to accept or to reject): hence the need for a decision rule, or some other method of aggregation of the assessments made by different reviewers into an atomic piece of information.

Four of the modeling papers tested different ways to aggregate reviewers’ assessments. Linton (2016) compares two aggregation rules: a standard averaging rule, where the final decision is the mean of the scores from all reviews, and a rule based on the Black–Scholes model (Black and Scholes 1973). The latter rule of aggregation predicts a higher acceptance rate for high-risk high-gain submissions (that is, submissions where the reviews are in disagreement). Esarey (2017) ran numerical simulations to compare other aggregation rules: acceptance upon unanimous approval by reviewers (including or excluding the editor’s opinion); acceptance upon approval by the majority of reviews (including or excluding the editor’s); unilateral editor decision based on the average review score; the effects of a desk rejection phase prior to all of the above aggregation rules. For all the above rules, the main finding is that the editor’s role and random noise are the main factors in determining the outcome of the selection process.

Righi and Takács (2017) build on the root model by Squazzoni and Gandelli (2012a, 2013). They study the alternatives that the editor of a journal has available when the reviewers disagree on a manuscript submission. Specifically, the editor can either reject the paper, accept the paper, or follow the advice of one of the reviewers. In the latter case, the editor can choose to what degree a reviewer’s reputation matters when choosing which reviewer to trust. The model shows that reviewer reputation does not contribute to better quality reviews or submissions. Surprisingly, the acceptance of controversial submissions is predicted to indirectly improve the quality of submissions: by inducing an oversupply of publishable manuscripts, this aggregation rule forces the editor to rely on the author’s reputation in order to make a decision which, in turn, incentivizes the authors to improve their reputation by investing more in submitting good quality manuscripts.

Lyon and Morreau (2018) are concerned with the composition of reviews committees, groups of reviewers who grade documents using scores and grades. Reviewers may have a different understanding (or interpretation) of scores and grades, and simulations predict that diversity in reviewers’ interpretations can foster the accuracy of the aggregated score.

Allocation of submissions to reviewers

In a peer review process, a key step is the selection of experts to be invited to act as reviewers. In some peer review systems, the approach is top-down: there is a person or persons (e.g. program officer, conference chair, journal editor etc.) who oversees finding and inviting a suitable potential reviewer for each given submission or proposal. In some other cases (e.g. some conferences) there is a bidding system. In bidding systems, a pool of potential reviewers is invited to choose among (‘bid on’) the submissions available for review, and a procedure is put in place to match submissions and reviewers based on the reviewers’ expertise and preferences.

Both the top-down and the bidding approach can be implemented in various ways. This raises the question: which approach has the most desirable outcome and under what circumstances? Some papers in our set have used simulations to answer this question.

Top-down allocation rules

In most of the models where the allocation is explicitly modeled, it is assumed to be random: scientists have a uniform probability to be selected as reviewers by journal editors or program officers (D’Andrea and O’Dwyer 2017; Grimaldo and Paolucci 2013; Roebber and Schultz 2011; Squazzoni et al. 2012a, 2012c, 2013). However, two papers examine alternative rules of allocation. For instance, Cook et al. (2005) test the efficacy of alternative heuristics for matching reviewers and submissions in a case: when reviewers are asked to supply their assessment in the form of an (ordinal) rank of submissions.

Cabotà et al. compare different allocation rules based on the reputation (or skill level) of authors and reviewers (Cabotà et al. 2014b). In their alternative scenarios, submissions are sent out for review to reviewers with the same reputation as the authors, to reviewers with a lower reputation, or with a higher reputation– a control treatment is examined, where reviewers are chosen randomly. The outcome variables capture the efficacy of the peer review process, its efficiency, and the inequality in the distribution of resources across scholars. Results suggest that the stronger difference between the allocation rules emerges when reviewers are systematically biased against authors with a better reputation than their own: in this case, choosing reviewers with a reputation higher than the author’s produces less biased reviews and thus improves the efficacy of peer review.

Allocation by bidding

Two papers focus on bidding systems. Allesina (2012) proposes an allocation system for peer review in journals which is based on a public repository of manuscripts. Authors who want to submit their own work to the public repository first have to review three other submissions of their choice. After a submission is peer reviewed, journals compete to publish it. This innovative system, it is argued, can help address some of the shortcomings of a traditional peer review system (with a top-down allocation rule).

Flynn and Moses (2012) focus on the choices available to a member of a program committee (PC) who needs to bid on the submissions that she intends to review. If the PC member wants to review the best submissions, how can she identify them and make the right bids? The authors argue how a solution to this question can be found in a search algorithm, known in computer science as the ‘ant colony optimization’ algorithm (Dorigo et al. 2006).

Role of the editor

In all peer review systems, there are individuals who have the final say on whether submissions, proposals or applications are to be accepted or rejected. The role of these individuals may be particularly crucial in a peer review process in two ways: through personal behavior, or through policies specifically. Their personal attitudes can directly influence which submissions to desk reject before the peer review process, and when to follow or disregard the reviewers’ recommendation (editor behavior). These individuals can also enact policies to change the peer review process. Examples are the selection of aggregation rules and allocation rules (which we already discussed), or the selection of how many reviewers to invite (see e.g. Kovanis et al. 2016).

The number of reviewers is one of the main manipulations in Bianchi and Squazzoni (2015, based on the root model by Squazzoni and Gandelli 2012a, 2013). Here, increasing the number of reviewers (n = 1 through 3) is shown to improve the accuracy of the peer review process, at the cost of increasing the amount of resources invested in the peer review.

In the model by Zhu et al. (2016), it is the program chair of a conference who can enact different policies. Various policies are explored: (a) the choice for a single blind vs. double blind review process; (b) the choice to add the chair’s own evaluation of the submission to the reviewers’; (c) to allow reviewers to be socially influenced by their peers via reviewer discussion, and (d) to allow authors to improve their submission after author feedback. By showing how all these four policies could impact the peer review process, Zhu et al. argue that the editorial choices of the program chair are of paramount importance.

In a similar vein, Wang et al. (2016) and D’Andrea and O’Dwyer (2017) extend the root model by Thurner and Hanel (2011) to include an array choices available to a journal editor (through modifying editorial policies and enacting decisions in their personal role). The editor can affect the process structure by choosing an aggregation rule, an allocation rule, and the number of reviewers to be involved in the process; she can consult a tiebreaking referee, when the initial review are in disagreement; desk-reject blatantly low quality submissions; blacklist selfish referees (referees who systematically reject submissions that they perceive as competition); and/or allow authors to revise and resubmit their manuscript.

Mrowinski et al. (2016, 2017) show how an evolutionary algorithm can be used to optimize editorial strategies by (1) minimizing the review time, and (2) keeping constant the number of reviewers involved. The model takes as input two editorial choices: how many reviewers to try to involve, and the target number of reviews. Then, based on the current state of the review process, the model can inform the editor as to how many new potential reviewers to invite, and when.

When inviting reviewers, the editor may also consider the larger review network (i.e., the network of which scholar reviews whose work). Waters et al. (2016) propose a model to study how network properties affect the efficacy of the review process. Their preliminary results identify the conditions under which clustering in the review network may have an adverse effect on the efficacy of the review process.

Lastly, Roebber and Schultz (2011) study how authors can optimally respond to editorial strategies. The model compares two strategies that scholars can follow when applying for funding: striving for quantity (submitting many proposals) or quality (submitting fewer, but of better quality). The model they develop allows to test which one is the most effective strategy depending on the editorial policy put in place by the funding program officer. Specifically, the funding officer has three choices to make: how many reviewers to invite for review; whether to base a decision on the quality of the proposal, or on the reputation of the applicant; whether or not to only fund proposal which received unanimously positive reviews. Results show that in most cases applicants are better off prioritizing quantity over quality in their proposal. There is only one case where prioritizing quality is the winning strategy: when the editorial policy requires many reviews (i.e. > 4) and the reviews must all be positive.

Reviewer behavior

Reviewers are the core of any peer review system. It follows that reviewer behavior and social influence play a vital role in the peer review process. Reviewers’ own attitudes and biases can come into play when reviewing a submission. Following this line of thought, Sigelman and Whicker (1987) study two dimensions of reviewers’ attitude: their severity and conventionality. Severity refers to the tendency to give generally positive (or negative) reviews; conventionality is the tendency to be harsher towards highly innovative submissions. Severity and conventionality show no effect on the effectiveness of the peer review process—a non-result that, the authors stress, may not be robust given a different parameterization of the two variables (Sigelman and Whicker 1987: 506).

The root model by Thurner and Hanel (2011) examines the effects of different reviewer strategies on the efficacy of peer review; some of these effects are examined in subsequent research (e.g. Wang et al. 2016), in some cases with adjustments (e.g. D’Andrea and O’Dwyer 2017). In particular, reviewers are considered accurate if they can correctly differentiate between good and bad quality submissions; inaccurate when their assessment is given at random; selfish if they adopt the strategic behavior of rejecting contributions of a higher quality than their own work while being accurate otherwise; altruist if they accept all contributions; and misanthropist if they reject all. Simulations consistently show that inaccurate or selfish reviewers are especially detrimental to the peer review process, as they lower the average quality of the published papers. Paolucci and Grimaldo (2014) replicate this finding and identify simulation conditions under which selfish reviewers are less detrimental, or even slightly beneficial.

The root model by Squazzoni and Gandelli (2012a, 2013) and some follow-up papers (Bianchi and Squazzoni 2015; Cabotà et al. 2013, 2014a, b) also test scenarios with varying degrees of reviewer accuracy. A control treatment where reviewers of manuscripts give accurate reviews is compared to (1) treatments where reviewers have an increasing probability of giving inaccurate reviews, (2) a treatment where reviewers are only accurate if their own manuscript was accepted (in Bianchi and Squazzoni 2015), and (3) a treatment with some conformist reviewers (i.e., reviewers who imitate other reviewers) (Cabotà et al. 2014a). These results show that, compared to the control treatment, such reviewer strategies can negatively affect both the efficacy and efficiency of peer review.

In a study looking at grant applications, Roebber and Schultz (2011) manipulate the proportion of reviewers who give an accurate vs. inaccurate (or ‘hasty’) review. Their results show that, under some conditions, inaccurate reviewers can also have a beneficial effect: they can reward applicants who apply less often, but with higher quality proposals.

Lastly, Sobkowicz (2015, 2017) proposes a simulation model of a scientific community where scholars compete for grants. The model highlights the role of reviewers’ tendency to favor proposals submitted by their close collaborators (hereafter: in-group favoritism), or to switch to more promising scientific domains. In-group favoritism in particular, even if not very prevalent, is predicted to distort the selection process through peer review.

Peer review systems

Many modeling papers in our set focus on peer review systems as a whole. For instance, Tan et al. (2018) focus on how a journal peer review system reacts to an external shock, such as an increase in the number of received submissions. Their model shows that the number of submissions positively correlates with journal quality, but only up until a critical saturation level. Beyond this level, more submissions result in lower journal quality.

Fang (2011) models a different kind of exogenous constraint on peer review: over-competition induced by scarcity of funding. This simulation model shows that over-competition in science can, alone, trigger a cascade where some scientific fields go extinct while a few dominant ones become monopolistic. This dynamic would not be driven by the scientific quality of the fields, nor by the goodness of the research, but solely by the self-reinforcing dynamics in the reproduction of scientists and scientific fields.

Other papers align and compare different existing peer review systems. They do so by simulating the alternative systems under the same conditions, and then measuring and comparing their efficacy and efficiency (Dignum and Dignum 2015; Zhou et al. 2016). These simulation models show how different journal peer review systems (i.e. single blind, double blind, open peer review or glance review) differ in terms of their efficacy. A more complete and recent comparison between journal peer review systems (Kovanis et al. 2016; Kovanis et al. 2017) found that while most systems are shown to outperform a conventional peer review system in at least some ways, cascade peer review emerges as the best compromise.

Furthermore, scholars have used simulation models to benchmark and develop new variants of existing peer review systems, or new systems altogether. Abramo et al. (2011), for instance, focus on national research assessments whereby a national agency use peer review to obtain a ranking of researchers or research institutions based on their research quality or productivity. Using empirical data from the Italian national assessment, the authors show that a ranking process constructed with bibliometric indicators can outperform (and be cheaper than) the traditional process based on peer review.

Grimaldo and co-authors developed two changes to a standard peer review system: reviewer accountability, also called ‘disagreement control’ (Grimaldo and Paolucci 2012, 2013; Grimaldo et al. 2012), and a reputation system (Grimaldo et al. 2018b). The idea of reviewer accountability hinges on the notion that repeated disagreement between reviewers may be a signal of poor reviews, or of selfish reviewer behavior. Thus, reviewer accountability can be implemented in traditional peer review systems (e.g. in a conference of for a journal) by banning low-quality reviewers; in other words, reviewers who consistently disagree with other reviewers are blacklisted and prevented from reviewing again for the same outlet. The second alternative, the reputation system, is explored as a viable alternative to peer review. Here, author reputation and peer review can both be used as means to filter out manuscripts which are unworthy of publication and to identify the ones which are worthy. By means of an ABM, the authors show the conditions under which reviewer accountability and reputation can be equally or even more effective and efficient than a traditional peer review system.

Bias in peer review

Biases can lead to scientific outputs or proposals or careers succeeding or failing on grounds unrelated to quality. For this reason, we can argue that biases are a systematic impairment of the efficacy of a peer review process.

Some of the models of peer review focus specifically on bias. Some models try to explain how biases come about, others explore the consequences of the actions of biased individuals in the peer review process, and a few try to develop solutions.

(a) Where bias comes from

Stinchcombe and Ofshe (1969), the oldest paper in our set, attempt to explain the emergence of evaluation bias. With a simple numerical test of a probabilistic model they show that two core conditions are enough to explain why nearly half of publishable-quality journal submissions are in fact rejected during the peer review process. The conditions are that (1) journals have a very low acceptance rate, and (2) reviewers’ estimates of the quality of a manuscript are not perfect.

Evaluation bias is even stronger when reviewers are not only somewhat inaccurate, but also biased. In a simulation study by Day (2015), the authors compare how acceptance rate varies as function of the introduction of bias against some of the submissions. Even small amounts of bias are shown to predict a large and significant detrimental effect on the success rate of applicants who are discriminated against.

Bornmann et al. (2009) investigate the origins of gender bias in PhD and postdoc fellowship applications. Expanding their previous modeling work (Bornmann et al. 2008), the authors represent the peer review of fellowship applications with a hidden Markov model, which allows to estimate the stability of the review scores through the different review stages of the application process. Their results are twofold. On the one hand, the assessment obtained during the first stage of review emerges as the most important predictor of the success of an application, and in this stage, there appears to be no gender difference. On the other hand, PhD applications show significant gender differences in the subsequent stages, where male applicants are systematically evaluated more favorably.

(b) The consequences of biased individuals

The model by Squazzoni and Gandelli shows the potential consequences of reviewers’ bias against submissions from authors with a different status or productivity level (Squazzoni and Gandelli 2012b, c—based on the root model by Squazzoni and Gandelli 2012a, 2013). This bias is shown to moderate the effect of reviewer’s accuracy and ultimately impact the efficacy and efficiency of the peer review process. Further work shows that evaluation bias is also affected by the interplay between the number of reviewers and their accuracy (Bianchi and Squazzoni 2015).

Other authors study the consequences of biased editors and reviewers. Particularly negative for the efficacy of peer review is the bias against highly innovative contributions, modeled as a tendency to favor conventional work and to promote a reviewer’s favored topics (Sigelman and Whicker 1987; Sobkowicz 2015). Similarly, Wang et al. (2016) explore consequences of ingroup favoritism by editors and selfish behavior by reviewers.

(c) A possible remedy to biased reviewers

Only three papers address potential remedies. One proposed solution is to introduce a system of reviewer accountability: simulations suggest that this can be achieved by banning reviewers who prove unreliable (Grimaldo and Paolucci 2012, 2013; Grimaldo et al. 2012). A second solution is proposed by Sobkowicz as a remedy to reviewers’ bias against highly innovative contributions, and consists in appointing an additional reviewer for submissions which prove controversial among reviewers (Sobkowicz 2015).