Objectives

To evaluate the reliability of RobotReviewer's risk of bias judgments.

Study Design and Setting

In this prospective cross-sectional evaluation, we used RobotReviewer to assess risk of bias among 1,180 trials. We computed reliability with human reviewers using Cohen's kappa coefficient and calculated sensitivity and specificity. We investigated differences in reliability by risk of bias domain, topic, and outcome type using the chi-square test in meta-analysis.

Results

Reliability (95% CI) was moderate for random sequence generation (0.48 [0.43, 0.53]), allocation concealment (0.45 [0.40, 0.51]), and blinding of participants and personnel (0.42 [0.36, 0.47]); fair for overall risk of bias (0.34 [0.25, 0.44]); and slight for blinding of outcome assessors (0.10 [0.06, 0.14]), incomplete outcome data (0.14 [0.08, 0.19]), and selective reporting (0.02 [−0.02, 0.05]). Reliability for blinding of participants and personnel ( P < 0.001), blinding of outcome assessors ( P = 0.005), selective reporting ( P < 0.001), and overall risk of bias ( P < 0.001) differed by topic. Sensitivity and specificity (95% CI) ranged from 0.20 (0.18, 0.23) to 0.76 (0.72, 0.80) and from 0.61 (0.56, 0.65) to 0.95 (0.93, 0.96), respectively.

Conclusion

Risk of bias appraisal is subjective. Compared with reliability between author groups, RobotReviewer's reliability with human reviewers was similar for most domains and better for allocation concealment, blinding of participants and personnel, and overall risk of bias.