 Cohen's kappa

Cohen's kappa coefficient is a statistical measure of interrater agreement or interannotator agreement^{[1]} for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. Some researchers (e.g. Strijbos, Martens, Prins, & Jochems, 2006^{[1]}) have expressed concern over κ's tendency to take the observed categories' frequencies as givens, which can have the effect of underestimating agreement for a category that is also commonly used; for this reason, κ is considered an overly conservative measure of agreement.
Others (e.g., Uebersax, 1987^{[2]}) contest the assertion that kappa "takes into account" chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The socalled chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess—a very unrealistic scenario.
Contents
Calculation
Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The first mention of a kappalike statistic is attributed to Galton (1892),^{[3]} see Smeeton (1985).^{[4]}
The equation for κ is:
where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters other than what would be expected by chance (as defined by Pr(e)), κ = 0.
The seminal paper introducing kappa as a new technique was published by Jacob Cohen in the journal Educational and Psychological Measurement in 1960.
A similar statistic, called pi, was proposed by Scott (1955). Cohen's kappa and Scott's pi differ in terms of how Pr(e) is calculated.
Note that Cohen's kappa measures agreement between two raters only. For a similar measure of agreement (Fleiss' kappa) used when there are more than two raters, see Fleiss (1971). The Fleiss kappa, however, is a multirater generalization of Scott's pi statistic, not Cohen's kappa.
Example
Suppose that you were analyzing data related to people applying for a grant. Each grant proposal was read by two people and each reader either said "Yes" or "No" to the proposal. Suppose the data were as follows, where rows are reader A and columns are reader B:
B B Yes No A Yes 20 5 A No 10 15 Note that there were 20 proposals that were granted by both reader A and reader B, and 15 proposals that were rejected by both readers. Thus, the observed percentage agreement is Pr(a)=(20+15)/50 = 0.70.
To calculate Pr(e) (the probability of random agreement) we note that:
 Reader A said "Yes" to 25 applicants and "No" to 25 applicants. Thus reader A said "Yes" 50% of the time.
 Reader B said "Yes" to 30 applicants and "No" to 20 applicants. Thus reader B said "Yes" 60% of the time.
Therefore the probability that both of them would say "Yes" randomly is 0.50*0.60=0.30 and the probability that both of them would say "No" is 0.50*0.40=0.20. Thus the overall probability of random agreement is Pr("e") = 0.3+0.2 = 0.5.
So now applying our formula for Cohen's Kappa we get:
Inconsistent results
One of the problems with Cohen's Kappa is that it does not always produce the expected answer.^{[5]} For instance, in the following two cases there is equal agreement between A and B (60 out of 100 in both cases) so we would expect the relative values of Cohen's Kappa to reflect this. However, calculating Cohen's Kappa for each:
Yes No Yes 45 15 No 25 15 Yes No Yes 25 35 No 5 35 we find that it shows greater similarity between A and B in the second case, compared to the first.
Significance and Magnitude
Statistical significance only states how precisely we have measured the magnitude. It makes no claim on how important is the magnitude in a given application or what is considered as high or low agreement.
Statistical significance for kappa is rarely reported, probably because even relatively low values of kappa can nonetheless be significantly different from zero but not of sufficient magnitude to satisfy investigators.^{[6]}^{:66} Still, its standard error has been described^{[7]} and is computed by various computer programs.^{[8]}
If statistical significance is not a useful guide, what magnitude of kappa reflects adequate agreement? Guidelines would be helpful, but factors other than agreement can influence its magnitude, which makes interpretation of a given magnitude problematic. As Sim and Wright noted, two important factors are prevalence (are the codes equiprobable or do their probabilities vary) and bias (are the marginal probabilities for the two observers similar or different). Other things being equal, kappas are higher when codes are equiprobable and distributed similarly by the two observers.^{[9]}^{:261–262}
Another factor is the number of codes. As number of codes increases, kappas become higher. Based on a simulation study, Bakeman and colleagues concluded that for fallible observers, values for kappa were lower when codes were fewer. And, in agreement with Sim & Wrights's statement concerning prevalence, kappas were higher when codes were roughly equiprobable. Thus Bakeman et al. concluded that "no one value of kappa can be regarded as universally acceptable."^{[10]}^{:357} They also provide a computer program that lets users compute values for kappa specifying number of codes, their probability, and observer accuracy. For example, given equiprobable codes and observers who are 85% accurate, value of kappa are .49, .60, .66, and .69 when number of codes is 2, 3, 5, and 10, respectively.
Nonetheless, magnitude guidelines have appeared in the literature. Perhaps the first was Landis and Koch,^{[11]} who characterized values < 0 as indicating no agreement and 0–.20 as slight, .21–.40 as fair, .41–.60 as moderate, .61–.80 as substantial, and .81–1 as almost perfect agreement. This set of guidelines is however by no means universally accepted; Landis and Koch supplied no evidence to support it, basing it instead on personal opinion. It has been noted that these guidelines may be more harmful than helpful.^{[2]} Fleiss's^{[12]}^{:218} equally arbitrary guidelines characterize kappas over .75 as excellent, .40 to .75 as fair to good, and below .40 as poor.
Weighted Kappa
Weighted kappa lets you count disagreements differently^{[13]} and is especially useful when codes are ordered^{[6]}^{:66}. Three matrices are involved, the matrix of observed scores, the matrix of expected scores based on chance agreement, and the weight matrix. Weight matrix cells located on the diagonal (upperleft to bottomright) represent agreement and thus contain zeros. Offdiagonal cells contain weights indicating the seriousness of that disagreement. Often, cells one off the diagonal are weighted 1, those two off 2, etc.
The equation for weighted κ is:
where k=number of codes and w_{ij}, x_{ij}, and m_{ij} are elements in the weight, observed, and expected matrices, respectively. When diagonal cells contain weights of 0 and all offdiagonal cells weights of 1, this formula produces the same value of kappa as the calculation given above.
Kappa Maximum
Kappa assumes its theoretical maximum value of 1 only when both observers distribute codes the same, that is, when corresponding row and column sums are identical. Anything less is less than perfect agreement. Still, the maximum value kappa could achieve given unequal distributions helps interpret the value of kappa actually obtained. The equation for κ maximum is:^{[14]}
where , as usual, ,
k=number of codes, P_{i +} are the row probabilities, and P _{+ i} are the column probabilities.
See also
Notes
 ^ Carletta, Jean. (1996) Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), pp. 249–254.
 ^ Gwet, K. (2010). "Handbook of InterRater Reliability (Second Edition)" ISBN 9780970806222
References
 ^ Strijbos, J.; Martens, R.; Prins, F.; Jochems, W. (2006). "Content analysis: What are they talking about?". Computers & Education 46: 29–48. doi:10.1016/j.compedu.2005.04.002.
 ^ Uebersax JS. (1987). "Diversity of decisionmaking models and the measurement of interrater agreement" (PDF). Psychological Bulletin 101: 140–146. doi:10.1037/00332909.101.1.140. http://www.namic.org/Wiki/images/d/df/Kapp_and_decision_making_models.pdf.
 ^ Galton, F. (1892). Finger Prints Macmillan, London.
 ^ Smeeton, N.C. (1985). "Early History of the Kappa Statistic". Biometrics 41: 795.
 ^ Kilem Gwet (May 2002). "InterRater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity". Statistical Methods for InterRater Reliability Assessment 2: ???.http://agreestat.com/research_papers/inter_rater_reliability_dependency.pdf
 ^ ^{a} ^{b} Bakeman, R.; & Gottman, J.M. (1997). Observing interaction: An introduction to sequential analysis (2nd ed.). Cambridge, UK: Cambridge University Press. ISBN 0521275938.
 ^ Fleiss, J.L.; Cohen, J., & Everitt, B.S. (1969). "Large sample standard errors of kappa and weighted kappa". Psychological Bulletin 72: 323–327. doi:10.1037/h0028106.
 ^ Robinson, B.F; & Bakeman, R. (1998). "ComKappa: A Windows 95 program for calculating kappa and related statistics". Behavior Research Methods, Instruments, and Computers 30: 731–732. doi:10.3758/BF03209495.
 ^ Sim, J; & Wright, C. C (2005). "The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements". Physical Therapy 85: 257–268.
 ^ Bakeman, R.; Quera, V., McArthur, D., & Robinson, B. F. (1997). "Detecting sequential patterns and determining their reliability with fallible observers". Psychological Methods 2: 357–370. doi:10.1037/1082989X.2.4.357.
 ^ Landis, J.R.; & Koch, G.G. (1977). "The measurement of observer agreement for categorical data". Biometrics 33 (1): 159–174. doi:10.2307/2529310. JSTOR 2529310. PMID 843571.
 ^ Fleiss, J.L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: John Wiley. ISBN 0471263702.
 ^ Cohen, J. (1968). "Weighed kappa: Nominal scale agreement with provision for scaled disagreement or partial credit". Psychological Bulletin 70 (4): 213–220. doi:10.1037/h0026256. PMID 19673146.
 ^ Umesh, U.N.; Peterson, R.A., & Sauber. M.H. (1989). "Interjudge agreement and the maximum value of kappa.". Educational and Psychological Measurement 49: 835–850. doi:10.1177/001316448904900407.
 Kilem Gwet (May 2002). "InterRater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity". Statistical Methods For InterRater Reliability Assessment 2:
 Banerjee, M.; Capozzoli, Michelle; McSweeney, Laura; Sinha, Debajyoti (1999). "Beyond Kappa: A Review of Interrater Agreement Measures". The Canadian Journal of Statistics / La Revue Canadienne de Statistique 27 (1): 3–23. JSTOR 3315487.
 Brennan, R. L.; Prediger, D. J. (1981). "Coefficient λ: Some Uses, Misuses, and Alternatives". Educational and Psychological Measurement 41: 687–699. doi:10.1177/001316448104100307.
 Cohen, Jacob (1960). "A coefficient of agreement for nominal scales". Educational and Psychological Measurement 20 (1): 37–46. doi:10.1177/001316446002000104.
 Cohen, J. (1968). "Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit". Psychological Bulletin 70 (4): 213–220. doi:10.1037/h0026256. PMID 19673146.
 Fleiss, J.L. (1971). "Measuring nominal scale agreement among many raters". Psychological Bulletin 76 (5): 378–382. doi:10.1037/h0031619.
 Fleiss, J. L. (1981) Statistical methods for rates and proportions. 2nd ed. (New York: John Wiley) pp. 38–46
 Fleiss, J.L.; Cohen, J. (1973). "The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability". Educational and Psychological Measurement 33: 613–619. doi:10.1177/001316447303300309.
 Gwet, K. (2008). "Computing interrater reliability and its variance in the presence of high agreement". British Journal of Mathematical and Statistical Psychology 61 (Pt 1): 29–48. doi:10.1348/000711006X126600. PMID 18482474. http://www.agreestat.com/research_papers/bjmsp2008_interrater.pdf.
 Gwet, K. (2008). "Variance Estimation of NominalScale InterRater Reliability with Random Selection of Raters". Psychometrika 73 (3): 407–430. doi:10.1007/s1133600790548. http://www.agreestat.com/research_papers/psychometrika2008_irr_random_raters.pdf.
 Gwet, K. (2008). "Intrarater Reliability." Wiley Encyclopedia of Clinical Trials, Copyright 2008 John Wiley & Sons, Inc.
 Scott, W. (1955). "Reliability of content analysis: The case of nominal scale coding". Public Opinion Quarterly 17: 321–325.
 Sim, J.; Wright, C. C. (2005). "The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements". Physical Therapy 85 (3): 257–268. PMID 15733050.
External links
 Kappa, its meaning, problems, and several alternatives
 Kappa Statistics: Pros and Cons
 Windows program for kappa, weighted kappa, and kappa maximum
 Java and PHP implementation of weighted Kappa
Online calculators
Wikimedia Foundation. 2010.
Look at other dictionaries:
Kappa (disambiguation) — Kappa may be:* Kappa, a letter of the Greek alphabet * Kappa (band), a Liverpool based Rock and Roll band * Kappa (folklore), a Japanese river imp * Kappa , a short story by Ryūnosuke Akutagawa * Kappa (Finnish measurement) * Kappa (company), an… … Wikipedia
Kappa — (uppercase Kappa;, lowercase kappa; or Unicodeϰ; el. Κάππα) is the 10th letter of the Greek alphabet, used to represent the voiceless velar stop, or k , sound in Ancient and Modern Greek. In the system of Greek numerals it has a value of 20. It… … Wikipedia
Cohen — may refer to: Kohen, a Jewish priest Cohen (surname), a common Jewish surname Contents 1 Media 2 Locations 3 Science 4 Law … Wikipedia
Cohen — ist ein jüdischer Familienname. Der Familienname hat die höchste Verbreitung vergleichbar mit Müller und Schmidt in der deutschsprachigen Welt oder mit Smith in der englischsprachigen Welt. Herkunft und Bedeutung Cohen ist der biblische Name von… … Deutsch Wikipedia
Kappa de Cohen — En statistiques, le test du Kappa mesure l’accord entre observateurs lors d un codage qualitatif en catégories. Le calcul du Kappa se fait de la manière suivante : Où Pr(a) est l accord relatif entre codeurs et Pr(e) la probabilité d un… … Wikipédia en Français
kappa Opioid receptor — Opioid receptor, kappa 1 Rendering based on PDB 2A0D … Wikipedia
Kappa — Pour les articles homophones, voir CAPA et capa. Pour les articles homonymes, voir Kappa (mythologie). Kappa … Wikipédia en Français
Cohen — Cette page d’homonymie répertorie les différents sujets et articles partageant un même nom. Pour les articles homophones, voir Cohn et Coen. Le nom de cohen désigne les membres du clergé hébreu, qui réalisaient les sacrifices du Temple de… … Wikipédia en Français
Kappa effect — The Kappa effect is a term relating to the human perception of time. This effect is noted in the study of psychology.The term was coined in 1953 by researchers publishing in the journal Nature .Cohen, J. , Hansel, C. E. M. , and Sylvester, J. D … Wikipedia
Cohens Kappa — ist ein statistisches Maß für die Interrater Reliabilität von Einschätzungen von (in der Regel) zwei Beurteilern (Ratern), das Jacob Cohen 1960 vorschlug. Dieses Maß kann aber auch für die Intrarater Reliabiliät verwendet werden, bei dem derselbe … Deutsch Wikipedia