Professional Documents
Culture Documents
Christian M. Meyer
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 1
Introduction
Validity, Reliability, Agreement
m raters r R
50 years of agreement
aka. coders, annotators, observers,…
studies – 50 different
notation schemas!
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 3
Percentage of Agreement
Definition
Relatedness? r1 r2
gem – jewel high high
coast – shore high high
r1 high r1 low
coast – hill high low
r2 high 2 2 4
forest – graveyard low high
r2 low 1 5 6
asylum – fruit low low
3 7 10
noon – string low low
automobile – wizard low low
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 4
Percentage of Agreement
Standard Error and Confidence Interval
Standard error:
SE(AO) = (AO (1 – AO)) / n
r1 high r1 low SE(AO) = (0.7 (1 – 0.7)) / 10 = 0.04
r2 high 2 2 4
r2 low 1 5 6 Confidence intervals:
3 7 10 CL = AO – SE(AO) · zCrit
CU = AO + SE(AO) · zCrit
0.610 ≤ 0.7 ≤ 0.789
Percentage of agreement: with zCrit = 1.96 (95% confid.)
AO = 1/n c (# of agreements) 0.624 ≤ 0.7 ≤ 0.775
AO = 1/10 (2 + 5) = 0.7 with zCrit = 1.645 (90% confid.)
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 5
Issues
Why is there Disagreement at all?
r1 + r1 –
r2 + 10 20 30 Almost perfect agreement,
although the actual PN
r2 – 20 1,000 1,020
identification did not really work!
30 1,020 1,050
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 8
Issues
Summary
chance-corrected weighted
Measure agreement multiple raters categories
Percentage of Agreement
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 9
Chance-corrected Measures
Definition
Bennett, Alpert
Scott (1955) Cohen (1960)
& Goldstein (1954)
AO – AES AO – AE π AO – AE κ
S= π= κ=
1 – AES 1 – AE π 1 – AE κ
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 10
Chance-corrected Measures
Example
A – AE 1
Basic idea: agreement = O Chance-corrected S: AES =
1 – AE k
AE = 1 / 2 = 0.5
S
1
Percentage of agreement: Cohen’s κ: AE κ = 2 c nc,r1 nc,r2
n
AO = 1/n c (# of agreements) AE = 1/10 (3 · 4 + 6 · 7) = 0.54
κ 2
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 11
Issues
Agreement by Chance
AO = 0.5 AO = 1/3
S = 0.0 S = 0.0
π = 0.0 Chance-corrected π = 0.0
κ = 0.0 κ = 0.0
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 12
Issues
Summary
chance-corrected weighted
Measure agreement multiple raters categories
Percentage of Agreement
Chance-corrected S
Scott’s π
Cohen’s κ
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 13
Multiple Raters
Agreement Table
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 14
Multiple Raters
Generalized Measures
So far, we only considered two raters, although there are usually more
Generalize two-rater measures:
Fleiss (1971) Davis and Fleiss (1982)
Generalizes Scott’s π. The basic idea is to Generalizes Cohen’s κ. The basic idea is to
consider each pairwise agreement of raters consider each pairwise agreement of raters
and average over all items i. and average over all items i.
A’O – A’Eπ A’O – A’Eκ
multi-π = multi-κ =
1 – A’Eπ 1 – A’Eκ
1 1
A’O = n (n – 1) A’O = n (n – 1)
nm(m – 1) i c i,c i,c nm(m – 1) i c i,c i,c
1 1 m-1 m nr1,c nr2,c
A’E π = c nc2 A’Eκ = c
(nm) 2 (m2 ) r1=1 r2=r1+1 n2
with the total number of raters ni,c that with the total number of annotations nc,r
annotated item i with category c. by rater r for category c.
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 15
Multiple Raters
Example for multi-π
3 · 2 + 0 · (-1)
Fleiss (1971) c ni,c(ni,c – 1) Item high low
1 6 1 3 0
A’O = n (n – 1)
nm(m – 1) i c i,c i,c 2 2 2 1
1
A’O = 32 = 0.533 2 3 2 1
10 · 3(3 – 1)
1 2 4 2 1
A’E =π
c nc2
(nm) 2
2 5 1 2
1
A’E π = (132 + 172) = 0.508 6 6 0 3
(10 · 3) 2
6 7 0 3
A’O – A’Eπ 2 8 1 2
multi-π = = 0.049
1 – A’Eπ
2 9 1 2
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 16
Issues
Summary
chance-corrected weighted
Measure agreement multiple raters categories
Percentage of Agreement
Chance-corrected S
Scott’s π
Cohen’s κ
multi-π
multi-κ
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 17
Krippendorff’s α
Definition
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 19
Issues
Summary
chance-corrected weighted
Measure agreement multiple raters categories
Percentage of Agreement
Chance-corrected S
Scott’s π
Cohen’s κ
multi-π
multi-κ
Krippendorff’s α
Weighted κ (not covered here)
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 20
Side Note: Criticism
“The Myth of Chance-Corrected Agreement”
Pearson correlation r
A B A B
vs. Cohen’s κ:
1 1 1 2
r = 1.0 r = 1.0
2 2 κ = 1.0 κ = -0.08 2 4
3 3 3 6
Correlation measures are
4 4 also not suitable as 4 8
5 5 agreement measure. 5 10
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 22
Interpretation
What is good agreement?
Krippendorff (2004)
“even a cutoff point of 0.8 is a pretty low standard”
Neuendorf (2002)
“reliability coefficients of 0.9 or greater would be acceptable to all,
0.8 or greater [..] in most situations”
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 23
Recommendations
by Artstein and Poesio (2008)
@ARTICLE{Artstein08,
author = {Artstein, Ron and Poesio, Massimo},
title = {Inter-Coder Agreement for Computational Linguistics},
journal = {Computational Linguistics},
year = {2008},
volume = {34},
pages = {555--596},
number = {4},
month = {December},
publisher = {Cambridge, MA: The MIT Press},
issn = {0891-2017},
url = {http://www.aclweb.org/anthology-new/J/J08/J08-4004.pdf}
}
Extended article version:
http://cswww.essex.ac.uk/Research/nle/arrau/icagr.pdf
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 26