Given that there are far more applications than can be funded, and that only the best ones are even discussed, we hope that the study sections can agree on the grades they receive, especially at the top end of the spectrum.

In this study of the system, researchers obtained 25 funded proposals from the National Cancer Institute. Sixteen of them were considered “excellent,” as they were funded the first time they were submitted. The other nine were funded on resubmission — grant applications can be submitted twice — and so can still be considered “very good.”

They then set up mock study sections. They recruited researchers to serve on them just as they do on actual study sections. They assigned those researchers to grant applications, which were reviewed as they would be for the N.I.H. They brought those researchers together in groups of eight to 10 and had them discuss and then score the proposals as they would were this for actual funding.

The intraclass correlation — a statistic that refers to how much groups agree — was 0 for the scores assigned. This meant that there was no agreement at all on the quality of any application. Because they were concerned about the reliability of this result, the researchers also computed a Krippendorff’s alpha, another statistic of agreement. A score above 0.7 (range 0 to 1) is considered “acceptable.” None were; the values were all very close to zero. A final statistic measured overall similarity scores and found that scores for the same application were no more similar than scores for different applications.

There wasn’t even any difference between the scores for those funded immediately and those requiring resubmission.

It would be easy to mistake this study as a death knell for the peer review process. It’s not. A careful reader must note that all of the grants in this study were exceptional. They succeeded, after all. Since the N.C.I. funds only about 10 percent of grants, we’re looking only at proposals in the best decile, and it’s likely that there might be less variability in scores among those than among grants occupying the full spectrum of quality.

This should still concern us greatly. This system was devised back when more than half of submitted grants were funded. That’s very different than what we see today.