Study selection and general characteristics of reports

The screening process is summarized in a flow diagram (Fig.1). Of the 4312 records retrieved, we finally included 46 reports: 39 research articles; 3 editorials; 2 information guides; 1 was a letter to the editor and 1 study was available only as an abstract (excluded studies are listed in Additional file 2; included studies are listed in Additional file 3).

Fig. 1 Study selection flow diagram Full size image

General characteristics of the tools

In the 46 reports, we identified 24 tools, including 23 scales and 1 checklist. The tools were developed from 1985 to 2017. Four tools had from 2 to 4 versions [22,23,24,25]. Five tools were used as an outcome in a randomized controlled trial [23, 25,26,27,28]. Table 3 lists the general characteristics of the identified tools. Table 4 presents a more complete descriptive summary of the tools’ characteristics, including types and measures of validity and reliability.

Table 3 Main characteristics of the included tools Full size table

Table 4 Descriptive characteristics of tools used to assess the quality of a peer review report Full size table

Six scales consisted of a single item enquiring into the overall quality of the peer review report, all of them based on directly asking users to score the overall quality [22, 25, 29,30,31,32]. These tools assessed the quality of a peer review report by using: 1) a 4 or 5 Likert point scale (n = 4); 2) as ‘good’, ‘fair’ and ‘poor’ (n = 1); and 3) a restricted scale from 80 to 100 (n = 1). Seventeen scales and one checklist had several items ranging in number from 4 to 26. Of these, 10 used the same weight for each item [23, 24, 27, 28, 33,34,35,36,37,38]. The overall quality score was the sum of the score for each item (n = 3); the mean of the score of the items (n = 6); or the summary score (n = 11) (for definitions see Table 1). Three scales reported more than one way to assess the overall quality [23, 24, 36]. The scoring system instructions were not defined in 67% of the tools.

None of the tools reported the definition of peer review report quality, and only one described the tool development [39]. The first version of this tool was designed by a development group composed of four researchers and three editors. It was based on a tool used in an earlier study and that had been developed by reviewing the literature and interviewing editors. Successively, the tool was modified by rewording some questions after some group discussions and a guideline for using the tool was drawn up.

Only 3 tools assessed and reported a validation process [39,40,41]. The assessed types of validity included face validity, content validity, construct validity, and preliminary criterion validity. Face and content validity could involve either a sole editor and author or a group of researchers and editors. Construct validity was assessed with multiple regression analysis using discriminant criteria (reviewer characteristics such as age, sex, and country of residence) and convergent criteria (training in epidemiology and/or statistics); or the overall assessment of the peer review report by authors and an assessment of (n = 4–8) specific components of the peer review report by editors or authors. Preliminary criterion was assessed by comparing grades obtained by an editor to those obtained by an editor-in-chief using an earlier version of the tool. Reliability was assessed in 9 tools [24,25,26,27, 31, 36, 39, 41, 42]; all reported inter-rater reliability and 2 also reported test-retest reliability. One tool reported the internal consistency measured with the Cronbach’s alpha [39].

Quality components of the peer review reports considered in the tools with more than one item

We extracted 132 items included in the 18 tools. One item asking for the percentage of co-reviews the reviewer had graded was not included in the classification because it represented a method of measuring reviewer’s performance and not a component of peer review report quality.

We organized the key concepts from each item into ‘topic-specific matrices’ (Additional file 4), identifying nine main domains and 11 subdomains: 1) relevance of study (n = 9); 2) originality of the study (n = 5); 3) interpretation of study results (n = 6); 4) strengths and weaknesses of the study (n = 12) (general, methods and statistical methods); 5) presentation and organization of the manuscript (n = 8); 6) structure of the reviewer’s comments (n = 4); 7) characteristics of reviewer’s comments (n = 14) (clarity, constructiveness, detail/thoroughness, fairness, knowledgeability, tone); 8) timeliness of the review report (n = 7); and 9) usefulness of the review report (n = 10) (decision making and manuscript improvement). The total number of tools corresponding to each domain and subdomain is shown in Fig. 2. An explanation and example of all domains and subdomains is provided in Table 5. Some domains and subdomains were considered in most tools, such as whether the reviewers’ comments were detailed/thorough (n = 11) and constructive (n = 9), whether the reviewers’ comments were on the relevance of the study (n = 9) and if the peer review report was useful for manuscript improvement (n = 9). However, other items were rarely considered, such as whether the reviewer made comments on the statistical methods (n = 1).

Fig. 2 Frequency of quality domains and subdomains Full size image

Table 5 Explanations and Examples of quality domains and subdomains Full size table

Clustering analysis among tools

We created a domain profile for each tool. For example, the tool developed by Justice et al. consisted of 5 items [35]. We classified three items under the domain ‘Characteristics of the reviewer’s comments’, one under ‘Timeliness of the review report’ and one under ‘Usefulness of the review report’. According to the aforementioned classification, the domain profile (represented by proportions of domains) for this tool was 0.6:0.2:0.2 for the incorporating domains and 0 for the remaining ones. The hierarchical clustering used the matrix of Euclidean distances among domain profiles, which led to five main clusters (Fig. 3).

Fig. 3 Hierarchical clustering of tools based on the nine quality domains. The figure shows which quality domains are present in each tool. A slice of the chart represents a tool, and each slice is divided into sectors, indicating quality domains (in different colours). The area of each sector corresponds to the proportion of each domain within the tool. For instance, the “Review Rating” tool consists of two domains: Timeliness, meaning that 25% of all its items are encompassed in this domain, and Characteristics of reviewer’s comments occupying the remaining 75%. The blue lines starting from the centre of the chart define how the tools are divided into the five clusters. Clusters #1, #2 and #3 are sub-nodes of a major node grouping all three, meaning that the tools in these clusters have a similar domain profile compared to the tools in clusters #4 and #5 Full size image

The first cluster consisted of 5 tools developed from 1990 to 2016. All tools included at least one item in the characteristics of the reviewer’s comments domain, representing at least 50% of each domain profile. In the second cluster, there were 3 tools developed from 1994 to 2006. These tools were characterized to incorporate at least one item in the usefulness and timeliness domains. The third cluster included 6 tools that had been developed from 1998 to 2010 and exhibited the most heterogeneous mix of domains. These tools were distinct from the rest because they encompassed items related to interpretation of the study results and originality of the study. Moreover, the third cluster included two tools with different versions and variations. The first, second, and third cluster were linked together in the hierarchical tree that presented tools with at least one quality component grouped in the domain characteristics of the reviewer’s comments. In the fourth cluster, there are 2 tools developed from 2011 to 2017 that consist of at least one component in the strengths and weaknesses domain. Finally, the fifth cluster included 2 tools developed from 2009 to 2012 and which consisted of the same 2 domains. The fourth and fifth clusters were separated from the rest in the hierarchical tree that presented tools with only a few domains.