$\begingroup$

The Kappa ($\kappa$) statistic was introduced in 1960 by Cohen [1] to measure agreement between two raters. Its variance, however, had been a source of contradictions for quite a some time.

My question is about which is the best variance calculation to be used with large samples. I am inclined to believe the one tested and verified by Fleiss [2] would be the right choice, but this does not seem to be the only published one which seems to be correct (and used throughout fairly recent literature).

Right now I have two concrete ways to compute its asymptotic large sample variance:

The corrected method published by Fleiss, Cohen and Everitt [2];

The delta method which can be found in the book by Colgaton, 2009 [4] (page 106).

To illustrate some of this confusion, here is a quote by Fleiss, Cohen and Everitt [2], emphasis mine:

Many human endeavors have been cursed with repeated failures before final success is achieved. The scaling of Mount Everest is one example. The discovery of the Northwest Passage is a second. The derivation of a correct standard error for kappa is a third.

So, here is a small summary of what happened:

1960: Cohen publishes his paper "A coefficient of agreement for nominal scales" [1] introducing his chance-corrected measure of agreement between two raters called $\kappa$ . However, he publishes incorrect formulas for the variance calculations.

. However, he publishes incorrect formulas for the variance calculations. 1968: Everitt attempts to correct them, but his formulas were incorrect as well.

1969: Fleiss, Cohen and Everitt publish the correct formulas in the paper "Large Sample Standard Errors Of Kappa and Weighted Kappa" [2].

1971: Fleiss publishes another $\kappa$ statistic (but a different one) under the same name, with incorrect formulas for the variances.

statistic (but a different one) under the same name, with incorrect formulas for the variances. 1979: Fleiss Nee and Landis publish the corrected formulas for Fleiss' $\kappa$ .

At first, consider the following notation. This notation implies the summation operator should be applied to all elements in the dimension over which the dot is placed:

$\ \ \ p_{i.} = \displaystyle\sum_{j=1}^{k} p_{ij}$ $\ \ \ p_{.j} = \displaystyle\sum_{i=1}^{k} p_{ij}$

Now, one can compute Kappa as:

$\ \ \ \hat\kappa = \displaystyle\frac{p_o-p_c}{1-p_e}$

In which

$\ \ \ p_o = \displaystyle\sum_{i=1}^{k} p_{ii}$ is the observed agreement, and

$\ \ \ p_c = \displaystyle\sum_{i=1}^{k} p_{i.} p_{.i}$ is the chance agreement.

So far, the correct variance calculation for Cohen's $\kappa$ is given by:

$\ \ \

ewcommand{\var}{\mathrm{var}}\widehat{\var}(\hat{\kappa}) = \frac{1}{N(1-p_c)^4} \{ \displaystyle\sum_{i=1}^{k} p_{ii}[(1-p_o) - (p_{.i} + p_{i.})(1-p_o)]^2 \\ \ \ \ + (1-p_o)^2 \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1 \atop i

ot=j}^{k} p_{ij} (p_{.i} + p_{j.})^2 - (p_op_c-2p_c+p_o)^2 \} $

and under the null hypothesis it is given by:

$\ \ \ \widehat{\var}(\hat{\kappa}) = \frac{1}{N(1-p_c)^2} \{ \displaystyle\sum_{i=1}^{k} p_{.i}p_{i.} [1- (p_{.i} + p_{i.})^2] + \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1, i

ot=j}^{k} p_{.i}p_{j.}(p_{.i} + p_{j.})^2 - p_c^2 \} $

Congalton's method seems to be based on the delta method for obtaining variances (Agresti, 1990; Agresti, 2002); however I am not sure on what the delta method is or why it has to be used. The $\kappa$ variance, under this method, is given by:

$\ \ \ \widehat{\var}(\hat{\kappa}) = \frac{1}{n} \{ \frac{\theta_1 (1-\theta_1)}{(1-\theta_2)^2} + \frac{2(1-\theta_1)(2\theta_1\theta_2-\theta_3)}{(1-\theta_2)^3} + \frac{(1-\theta_1)^2(\theta_4-4\theta_2^2)}{(1-\theta_2)^4} \} $

in which

$\ \ \ \theta_1 = \frac{1}{n} \displaystyle\sum_{i=1}^{k} n_{ii}$

$\ \ \ \theta_2 = \frac{1}{n^2} \displaystyle\sum_{i=1}^{k} n_{i+}n_{+i}$

$\ \ \ \theta_3 = \frac{1}{n^2} \displaystyle\sum_{i=1}^{k} n_{ii}(n_{i+} + n_{+i})$

$\ \ \ \theta_4 = \frac{1}{n^3} \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1}^{k} n_{ij}(n_{j+} + n_{+i})^2$

(Congalton uses a $+$ subscript rather than a $.$, but it seems to mean the same thing. In addition, I am supposing that $n_{ij}$ should be a counting matrix, i.e. the confusion matrix before being divided by the number of samples as related by the formula $p_{ij} = \frac{n_{ij}}{\mathrm{samples}}$)

Another weird part is that Colgaton's book seems to refer the original paper by Cohen, but does not seems to cite the corrections to the Kappa variance published by Fleiss et al, not until he goes on to discuss weighted Kappa. Perhaps his first publication was written when the true formula for kappa was still lost in confusion?

Is somebody able to explain why those differences? Or why would someone use the delta method variance instead of the corrected version by Fleiss?

[1]: Fleiss, Joseph L.; Cohen, Jacob; Everitt, B. S.; Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, Vol 72(5), Nov 1969, 323-327. doi: 10.1037/h0028106

[2]: Cohen, Jacob (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1): 37–46. DOI:10.1177/001316446002000104.

[3]: Alan Agresti, Categorical Data Analysis, 2nd edition. John Wiley and Sons, 2002.

[4]: Russell G. Congalton and Green, K.; Assessing the Accuracy of Remotely Sensed Data: Principles and Practices, 2nd edition. 2009.