Preventable medical error leads to an estimated 200,000 deaths per year in the US, and many of these deaths are caused by mistaken diagnoses. Clearly, making it easier for doctors to avoid errors should be a priority.

One promising avenue could be collective decision-making: pooling the diagnoses of various doctors and using their joint wisdom to hit on the most likely answer. According to a paper in this week’s PNAS, though, this method is only likely to work if all the doctors in the group have the same level of skill.

Obviously, ethics committees are unlikely to allow a team of researchers to toy with patients’ potentially life-or-death diagnoses. So in order to figure out whether collective decision-making would help with the problem, the team combined real-world data with a computer simulation.

First, they compiled images from real breast cancer and skin cancer screening examinations. More than 100 doctors who examine mammogram data were invited to make diagnoses based on 182 mammogram image sets, and 40 dermatologists examined images of 108 skin lesions. Not all doctors assessed all cases, but the overall result was 16,813 diagnoses of mammograms and 4,320 diagnoses of skin lesions. Each doctor also rated how confident they were in their diagnoses.

For each image, the researchers had data on whether the patient had gone on to have treatment for breast or skin cancer. That meant they could label each doctor’s diagnosis as correct if they diagnosed cancer and the patient went on to be treated for it; if they didn’t diagnose it and the patient was treated later, it was clearly incorrect. Each doctor could thus be given an accuracy rating.

The researchers then used the data in a computer simulation, creating virtual "doctors" with the same diagnoses, accuracy ratings, and confidence ratings. These virtual doctors were then used to test what would happen with two different kinds of collective decision-making processes: one where a group of virtual doctors submitted their diagnoses and then went with the diagnosis of the most confident doctor, and another where the most common diagnosis of a group of virtual doctors was chosen as the overall diagnosis (a simple majority-wins rule).

This process allowed the research team to see whether group diagnoses matched up with the real-world medical history of the patients. What they found was that the collective decision-making did improve accuracy ratings, and the majority-rules process worked a little better than the confidence process.

But the improvement was only significant when the doctors in a group had similar accuracy ratings. If two doctors were similar in their diagnostic abilities, the group decision-making resulted in a better accuracy rate than that of an individual. When one of the doctors was better than the others, collective decision-making wasn’t so useful.

This makes sense because both processes work by overruling outlying judgements. It's great when doctors have similar accuracy levels because the outlying judgment is likely to be the wrong one. But if one doctor in the group is better than the rest, a collective judgment is likely to drown out the best doctor’s judgment.

This result held true in the simulation across different group sizes and regardless of how good the doctors in the group were. A group of similarly bad doctors would do better diagnosing as a group, and a group of similarly great doctors would also do better. It’s combining a poor doctor with a good one, or even a good one and a great one, that causes the problem.

The outcome has some important real-world implications. If doctors are well-matched, they can very simply pool their resources to arrive at better diagnoses (assuming they could first figure out a nice, neat accuracy rating for themselves, which is unlikely to be very simple). However, in some countries and medical fields, it’s standard practice to get a second diagnosis, and this study points out that we should be very careful here—if the doctors aren’t matched in accuracy, this process could result in a worse outcome for the patient.

An important complication that this simulation couldn’t take into account is that doctors often don’t just make diagnoses based on images—they also spend time talking to the patient and each other.

The authors of this study point out that the next step in research like this would be to figure out how face-to-face communication might change things. Coming up with a way to calculate accuracy ratings for real doctors is also an important hurdle. Until these hurdles are cleared, collective medical decision-making should probably be treated with caution.

PNAS, 2016. DOI: 10.1073/pnas.1601827113 (About DOIs).