$\begingroup$

Background

The question relates to research I am doing into the Wisdom of Crowds effect (Galton, 1907; Page, 2007; Surowiecki, 2004), in which an average of the estimates made by individuals proves to be highly accurate relative to the estimates made by the individuals themselves. For example, if we ask a bunch of participants to provide an estimate of the number of jellybeans in a jar, and then take the average of those estimates, the average tends to be more accurate than almost all of the individuals in the group. Specifically I am thinking here about a special case in which crowd wisdom is computed using pairs of different individuals (or, pairs of estimates taken from the same individual at different times).

The Question

I am seeking to investigate the appropriateness of various metrics for quantifying the accuracy gain obtained from averaging dyads of estimates instead of adopting individual estimates.

I'll attempt to illustrate the question using a simplified example, in which three participants provide a single estimate. Let's say the estimates provided are 1, 4, and 6, and that the true value is 5.

We calculate the error that arises when we compare averaged pairs of estimates with the truth. The process is done exhaustively so with three participants we have six pairs [average of participant 1's estimate & participant 2's estimate, 1 & 3, 2 & 1, 2 & 3, 3 & 1, 3 & 2]. Previous studies (e.g. Herzog & Hertwig, 2009) have calculated the gains from averaging by subtracting the error of the averaged estimates from the error of the initial estimates. Specifically, they've defined percentage gain as Gain From Averaging divided by Absolute Error of Estimate 1 multiplied by 100. This is illustrated in Figure 1.

You get zero absolute gain from averaging if all answers are above the true value, or if all answers are below the true value. As soon as there is some mix of answers below and above the true value you get absolute gains from averaging. In Figure 1 we have both an absolute gain from averaging and a percentage gain from averaging, and this seems reasonable.

Things start getting a little strange if I change Person 3's estimate from 6 to 50 (see Figure 2). Now some of the percentages become massively negative. Without attempting to present anything formal, here's an prose description of why this happens. When the very high wrong answer (50) comes appears in the Value #1 column and the low mildly wrong answer (1) appears in the Value #2 column we get a high absolute error on Estimate 1 (45) and a medium-sized absolute error from averaging (20.5). Our gain from averaging is 24.5 but we only get a modest percentage gain (54.44%). When they reverse positions and 1 is in the Value #1 column and 50 is in the Value #2 column, we get a low absolute error on Estimate 1 (4) and the same absolute error from averaging we had just before (20.5). We have lost 16.5 from averaging, which is less than the 24.5 we gained above. However, when we turn this into a percentage loss it's massively bigger than our old percentage gain (-412.50%). Intuitively this seems rather unfair.

Initial thoughts about answering the question

The reason I initially adopted the percentages approach is that I was following the precedent of Herzog and Hertwig (2009). However, I'm starting to wonder if a better approach might be to divide the sum of Gains From Averaging by the sum of Absolute Errors of Estimate 1. This would mean that the overall 'gain' from averaging could never be negative. In Figure 1 it would be 4/12. In Figure 2 it would be 10/100. I would greatly appreciate input from readers into the question of what the most appropriate metric is.

Finally, readers may wonder why we average dyad by dyad rather than just taking the average of all three participants. The reason is that we are looking for a way to compare with another condition in which we take two estimates from each participant and average those. Figure 3 illustrates this process. We want to be able to say something like "Averaging two estimates from each participant creates an accuracy gain of [some number]. However, it is not as effective as averaging guesses between participants, which produces an accuracy gain of [some other number]."

Links

Link to an imgur album containing all the figures.

Link to the Excel file used to generate the figures.

References

Galton, F. (1907). Vox populi. Nature, 75, 450-451.

Herzog, S. M., & Hertwig, R. (2009). The wisdom of many in one mind. Psychological Science, 20, 231-237.

Page, S. E. (2007). The Difference: How the power of diversity creates better groups, firms, schools and societies. Princeton: Princeton University Press.

Surowiecki, J. (2004). The wisdom of crowds: Why the many are smarter than the few. London: Abacus.