Here, we provide a more detailed statistical argument describing the framework's extreme sensitivity to incidental parameters. The crux of the statistical issue is this: the framework could only be valid if d, the estimated difference limen used in the calculation step, is a measure of olfactory resolution that converges to the true value of this quantity as more data is collected, that is, if it is consistent.

‘Significantly discriminable’ is a moving target dependent on sample size, choice of significance criterion, and correction for multiple comparisons. And d is the only data-dependent value used in subsequent calculations (Equation 1), Together, this guarantees that the estimate of z in (Bushdid et al., 2014) is a moving target as well, dependent on these same parameters. d is generated by testing a number of null hypotheses, and is closely related to the fraction of these which are rejected. But the probability of and criteria for rejection of these null hypotheses depends critically on sample size and α, the values that we explored in Figure 3 and Table 2. Certainly, we would agree that there is nothing objectionable about the specific parameters chosen in (Bushdid et al., 2014). However, there is nothing objectionable about many other values for those parameters either.

In effect, calculating d is somewhat like judging whether a coin meets a cutoff for being fair based on a series of tosses. It matters very much how many tosses one makes, and how much deviation from chance one is willing to tolerate before calling a coin unfair. If you have no particular reason to believe a coin is unfair, you might be disinclined to call it unfair if you observe 6 10 (60%) heads, but probably not if you observed 600 1000 heads (also 60%). However, if you own a casino, you might call 5100 heads in 10,000 (51%) evidence of an unfair coin. Whether the coin is fair is not something we directly measure, but rather we have more or less evidence for various degrees of fairness.

A similar situation applies in (Bushdid et al., 2014)'s analysis by considering its formal definition of d (a definition we verified by reconstructing the critical figures from (Bushdid et al., 2014) in Figure 2. d is defined as that inter-stimulus distance D for which 50% of subjects can significantly discriminate a mixture class. By a mixture ‘class’ we denote the set of mixture pairs for which each mixture has the same number of total components (N) and each pair has the same number of distinct, non-overlapping components D ( D = N − O , see Table 1). For example, the mixture pair (ABC, ABC) would be a member of the class with N = 3 and D = 1 distinct components. We focus here on calculations pertaining to the number of tests T per class, but the same argument is readily translated over to the number of subjects S.

To assess significant discriminability from chance, (Bushdid et al., 2014) used a two-tailed binomial test. Thus if a p-value is smaller than α 2 then the subject is considered able to significantly discriminate from pairs in the mixture class. The p-value is given by 1 minus the cumulative binomial distribution function for n = T trials, k successes, and a probability of success equal to 1 3 , with k corresponding to the number of subjects discriminating correctly, and 1 3 to chance in a 3-way forced choice task. Thus, the subject's discrimination performance is significant if:

(3) α / 2 > 1 − c d f b i n o m i a l ( T , k , 1 3 ) = ∑ i = 0 k ( T i ) ( 1 3 ) i ( 2 3 ) T − i

For α = 0 . 05 , T = 20 (as used in [Bushdid et al., 2014]), this inequality is satisfied for k > = 11 . For each subject, k might be any value between 0 and 20 depending on olfactory acuity. If k > = 11 for more than 50% of subjects, then the value of D characterizing that mixture pair is necessarily > d . If k > = 11 for fewer than 50% of subjects, then D < d . If k > = 11 for exactly 50% of subjects, then D = d. The actual estimate for d is obtained by regression in the spirit of Figure 2.

What kind of subject can discriminate successfully 11 times out of 20? Consider a mixture class X N , D (characterized by N and D), and a subject performance of f N , D , corresponding to the proportion of mixtures correctly discriminated from a sample of size T. Note that f N , D is simply the abscissa of Figure 1 from (Bushdid et al., 2014). A subject with f N , D = 0 . 55 would get k = T ∗ f N , D = 11 out of T = 20 correct on average. So we can rewrite the inequality above as an equation:

(4) 1 − α / 2 = ∑ i = 0 f N , D * T ( T i ) ( 1 3 ) i ( 2 3 ) T − i

If the above equation is satisfied, then the subject will be considered to be on the boundary between significantly discriminating and not significantly discriminating mixture pairs in the class. If half of subjects perform better than f N , D , and half less, then half of subjects will be considered to significantly discriminate mixture pairs in the class (and half not), and so d will be set equal to D. This is simply the definition of d.

The value f N , D for which that equation is satisfied depends upon α and T. f N , D is related to N and D through the data, and so the value of D for which the equation is satisfied (i.e., D = d) depends upon α, T, and the data. However, it is inappropriate for the discriminability limen to depend on α and T in this way. As we showed above, this has serious consequences for the estimate of d, and therefore also for the estimate of z. It is what makes z inconsistent.

Figure 1—figure supplement 1 shows the relationship between the critical f N , D , T, and α. Note that this relationship is independent of the data. The data only determine how f N , D depends upon D and consequently determines z. In summary, a smaller (larger) value of α or T requires a much higher (lower) value of f N , D to satisfy the equation. This higher (lower) value of f N , D might only be found at a much larger (smaller) value of D, implying a much larger (smaller) value of d and therefore a much smaller (larger) value of z.