You are on page 1of 26

Reading Group

Christian M. Meyer

Inter-Coder Agreement for Computational Linguistics


based on the corresponding survey article
by Ron Artstein and Massimo Poesio,
Computational Linguistics 34(4):555-596, December 2008.

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 1
Introduction
Validity, Reliability, Agreement

 We manually or automatically generate datasets for evaluating our NLP


systems. For the evaluation, we need to consider the following questions:

Is my evaluation Is my evaluation What is good


valid? data reliable? agreement?

• Can we actually • Is the generation • How to measure


draw conclusions? reproducible? agreement?

• One prerequisite • raters annotate a • How to interpret


for validity is that sample of the data. the result?
the evaluation data • Assumption: The
is reliable. data is reliable if
their agreement is
good.
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 2
Introduction
Notation

m raters r  R
50 years of agreement
aka. coders, annotators, observers,…
studies – 50 different
notation schemas!

matching? yes yes no


k categories c  C aka. labels,
score for.. low medium low
annotations,…, which can be:
• binary (yes, no)
Apple NN NNP NN
• ordinal (1, 2, 3,…)
• continuous (0.03, 0.49,…)
..bass.. WN1 WN2 WN1
• ordered-category (low, high)
n items i  I aka. units, records,… • nominal (NN, NNP, JJ, VB)
• likert-scale (strongly disagree,
disagree, agree, strongly agree)

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 3
Percentage of Agreement
Definition

Relatedness? r1 r2
gem – jewel high high
coast – shore high high
r1 high r1 low
coast – hill high low
r2 high 2 2 4
forest – graveyard low high
r2 low 1 5 6
asylum – fruit low low
3 7 10
noon – string low low
automobile – wizard low low

Percentage of agreement: brother – lad low high


cord – smile low low
AO = 1/n c (# of agreements)
autograph - shore low low
AO = 1/10 (2 + 5) = 0.7
Example word pairs taken from Rubenstein & Goodenough (1965).
Calculation example is inspired by Artstein & Poesio (2008).

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 4
Percentage of Agreement
Standard Error and Confidence Interval

Standard error:
SE(AO) = (AO (1 – AO)) / n
r1 high r1 low SE(AO) = (0.7 (1 – 0.7)) / 10 = 0.04
r2 high 2 2 4
r2 low 1 5 6 Confidence intervals:
3 7 10 CL = AO – SE(AO) · zCrit
CU = AO + SE(AO) · zCrit
0.610 ≤ 0.7 ≤ 0.789
Percentage of agreement: with zCrit = 1.96 (95% confid.)
AO = 1/n c (# of agreements) 0.624 ≤ 0.7 ≤ 0.775
AO = 1/10 (2 + 5) = 0.7 with zCrit = 1.645 (90% confid.)

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 5
Issues
Why is there Disagreement at all?

Sources of disagreement: Possible corrective actions:


 Insecurity in deciding on a  Training of the annotators
category
 Write better instructions
 Carelessness
 Provide better environment
 Difficulties/Differences in
 Reduce amount of annotated
comprehending instructions
data per annotator
 Openness for distractions
 Use more annotators
 Tendency to relax performance
…
standard when tired
 personal opinions/values
…
Taken from Krippendorff (2004)
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 6
Issues
Equal Weights

 All categories are treated equally


 Consider annotating proper nouns (PN) for tokens in arbitrary texts

r1 + r1 –
r2 + 10 20 30 Almost perfect agreement,
although the actual PN
r2 – 20 1,000 1,020
identification did not really work!
30 1,020 1,050

AO = 1/1050 (10 + 1000) = 0.961

 For binary data: calculate positive and negative agreement


AO+ = 2 (# of agreements for +) / r (# of + annotations)
AO+ = 2 · 10 / (30 + 30) = 0.333
AO– = 2 (# of agreements for –) / r (# of – annotations)
AO– = 2 · 1000 / (1020 + 1020) = 0.980
(Cicchetti and Feinstein, 1990)
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 7
Issues
Agreement by Chance

 Percentage of agreement does not regard agreement by chance


 Imagine the raters would guess randomly:

r1 high r1 low r1 high r1 med r1 low


r2 high 45 45 90 r2 high 20 20 20 60
r2 low 45 45 90 r2 med 20 20 20 60
90 90 180 r2 low 20 20 20 60
60 60 60 180
AO = 1/180 (45 + 45) = 0.5
AO = 1/180 (20 + 20 + 20) = 1/3

One would assume similar agreement


 use chance-corrected measures!

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 8
Issues
Summary

chance-corrected weighted
Measure agreement multiple raters categories

Percentage of Agreement   

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 9
Chance-corrected Measures
Definition

agreement beyond chance AO – AE


Basic idea: agreement = =
attainable chance-corrected agreement 1 – AE

Bennett, Alpert
Scott (1955) Cohen (1960)
& Goldstein (1954)
AO – AES AO – AE π AO – AE κ
S= π= κ=
1 – AES 1 – AE π 1 – AE κ

assume uniform distribution, assume a single distribution assume different


i.e. the same probabilities for for all raters, i.e. each rater probability distributions
each categories: annotates the same way: for each rater:
1 1 1
AES = AE π = c nc2 AE κ =  n n
k 4n 2 n2 c c,r1 c,r2
with the total number of with the total number of
annotations nc for category annotations nc,r for category
c by all raters. c by rater r.

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 10
Chance-corrected Measures
Example

A – AE 1
Basic idea: agreement = O Chance-corrected S: AES =
1 – AE k
AE = 1 / 2 = 0.5
S

S = (0.7 – 0.5) / (1 – 0.5) = 0.4


r1 high r1 low
1
r2 high 2 2 4 Scott’s π: AE π = c nc2
4n 2
r2 low 1 5 6 AE π = 1/(4 · 102) ((3 + 4) 2 + (6 + 7)2)
3 7 10 = 0.545
π = (0.7 – 0.545) / (1 – 0.545) = 0.341

1
Percentage of agreement: Cohen’s κ: AE κ = 2 c nc,r1 nc,r2
n
AO = 1/n c (# of agreements) AE = 1/10 (3 · 4 + 6 · 7) = 0.54
κ 2

AO = 1/10 (2 + 5) = 0.7 κ = (0.7 – 0.54) / (1 – 0.54) = 0.348

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 11
Issues
Agreement by Chance

 Percentage of agreement does not regard agreement by chance


 Imagine the raters would guess randomly:

r1 high r1 med r1 low


r1 high r1 low
r2 high 20 20 20 60
r2 high 45 45 90
r2 med 20 20 20 60
r2 low 45 45 90
r2 low 20 20 20 60
90 90 180
60 60 60 180

AO = 0.5 AO = 1/3

S = 0.0 S = 0.0
π = 0.0 Chance-corrected π = 0.0
κ = 0.0 κ = 0.0

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 12
Issues
Summary

chance-corrected weighted
Measure agreement multiple raters categories

Percentage of Agreement   
Chance-corrected S   
Scott’s π   
Cohen’s κ   

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 13
Multiple Raters
Agreement Table

Relatedness? r1 r2 r3 Item high low


gem – jewel high high high 1 3 0
coast – shore high high low 2 2 1
coast – hill high low high 3 2 1
forest – graveyard low high high 4 2 1
asylum – fruit low low high 5 1 2
convert to
noon – string low low low 6 0 3
agreement
automobile – wizard low low low table 7 0 3
brother – lad low high low 8 1 2
cord – smile low low high 9 1 2
autograph - shore low low high 10 1 2
Example word pairs taken from Rubenstein & Goodenough (1965).

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 14
Multiple Raters
Generalized Measures

 So far, we only considered two raters, although there are usually more
 Generalize two-rater measures:
Fleiss (1971) Davis and Fleiss (1982)
Generalizes Scott’s π. The basic idea is to Generalizes Cohen’s κ. The basic idea is to
consider each pairwise agreement of raters consider each pairwise agreement of raters
and average over all items i. and average over all items i.
A’O – A’Eπ A’O – A’Eκ
multi-π = multi-κ =
1 – A’Eπ 1 – A’Eκ
1 1
A’O =   n (n – 1) A’O =   n (n – 1)
nm(m – 1) i c i,c i,c nm(m – 1) i c i,c i,c
1 1 m-1 m nr1,c nr2,c
A’E π = c nc2 A’Eκ = c  
(nm) 2 (m2 ) r1=1 r2=r1+1 n2
with the total number of raters ni,c that with the total number of annotations nc,r
annotated item i with category c. by rater r for category c.

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 15
Multiple Raters
Example for multi-π

3 · 2 + 0 · (-1)
Fleiss (1971) c ni,c(ni,c – 1) Item high low
1 6 1 3 0
A’O =   n (n – 1)
nm(m – 1) i c i,c i,c 2 2 2 1
1
A’O = 32 = 0.533 2 3 2 1
10 · 3(3 – 1)
1 2 4 2 1
A’E =π
c nc2
(nm) 2
2 5 1 2
1
A’E π = (132 + 172) = 0.508 6 6 0 3
(10 · 3) 2
6 7 0 3
A’O – A’Eπ 2 8 1 2
multi-π = = 0.049
1 – A’Eπ
2 9 1 2

multi-π is also known as κ (Fleiss, 1971) and


2 10 1 2
K (Carletta, 1996) – check definition! 32 nc 13 17

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 16
Issues
Summary

chance-corrected weighted
Measure agreement multiple raters categories

Percentage of Agreement   
Chance-corrected S   
Scott’s π   
Cohen’s κ   
multi-π   
multi-κ   

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 17
Krippendorff’s α
Definition

 Allow further flexibility by allowing arbitrary category metrics.


Krippendorff (1980) Distance function dc1,c2
Derived from empirically statistics and Arbitrary metric to allow working with
content analysis. But can be represented in
the same notation. binary or nominal data:
dc1,c2 = (c1 == c2 ? 0 : 1)
DOα ‘est. var. within items’
α=1– = with this distance function: α ≈ π
DE α ‘est. total variance’
1 ordinal data (‘square distance func.’):
DOα =   n n d dc1,c2 = (c1 – c2)2
nm(m – 1) i c1 c2 i,c1 i,c2 c1,c2
1 weighted data:
D Eα =   n n d
nm(nm – 1) c1 c2 c1 c2 c1,c2 dc1,c2 NN NNP VB
NN – 0.1 0.9
with the total number of raters ni,c that
NNP 0.1 – 0.9
annotated item i with category c and the total
number of annotations nc for category c by VB 0.9 0.9 –
all raters. as well as interval, ratio data.
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 18
Krippendorff’s α
Example

nc1,c2 r1 + r1 ● r1 – Krippendorff (1980)


r2 + 46 0 6 52 1
DOα =   n n d
r2 ● 0 10 6 16 nm(m – 1) i c1 c2 i,c1 i,c2 c1,c2
r2 – 0 0 32 32
DOα = 46 · 0 + 10 · 0 + 6 · 1 + 6 · 0.5 + 32 · 0
100 · 2(2 – 1)
46 10 44 100 = 0.09
dc1,c2 r1 + r1 ● r1 – c nc 1
DE α =   n n d
r2 + 0.0 0.5 1.0 + 98 nm(nm – 1) c1 c2 c1 c2 c1,c2
r2 ● 0.5 0.0 0.5 ● 26 1274 + 7448 + 1274 + 988 + 7448 + 988
DEα =
r2 – 1.0 0.5 0.0 – 76 100 · 2(100 · 2 – 1)
= 0.4879
nc1 nc2 dc1,c2 r1 + r1 ● r1 –
DOα 0.09
r2 + 0 1274 7448 α=1– = 1 – = 0.8155
r2 ● 1274 0 988 DE α 0.4879
r2 – 7448 988 0

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 19
Issues
Summary

chance-corrected weighted
Measure agreement multiple raters categories

Percentage of Agreement   
Chance-corrected S   
Scott’s π   
Cohen’s κ   
multi-π   
multi-κ   
Krippendorff’s α   
Weighted κ (not covered here)   
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 20
Side Note: Criticism
“The Myth of Chance-Corrected Agreement”

 Chance-corrected measures have often been criticized


 The presented measures S, π, κ assume that the raters
are completely statistically independent
 This means (1) the raters guess on every item or (2) the raters guess with
probabilities similar to the observed ratings.
 (1) is clearly not valid for an annotation study
 (2) would not need a chance-correction
 Another argument is the different approach when comparing to a gold
standard  measure precision/recall without any chance-correction
 John Uebersax proposes using raw agreements and focus on statistic
significance tests, standard error and confidence intervals
 cf. (Uebersax, 1987; Agresti, 1992; Uebersax, 1993)
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 21
Traditional Statistics
Why not use χ2 or correlations?

r1 + r1 ● r1 – χ2 = 64.59 χ2 is highly significant,


r2 + 25 13 12 50 because of the strong
AO = 0.36 associations
r2 ● 12 2 16 30
S = 0.04 +/+, ●/–, –/ ●
r2 – 3 15 2 20 π = 0.02 The agreement is
40 30 30 100 κ = 0.04 however low
Adapted from Cohen (1960)

Pearson correlation r
A B A B
vs. Cohen’s κ:
1 1 1 2
r = 1.0 r = 1.0
2 2 κ = 1.0 κ = -0.08 2 4
3 3 3 6
Correlation measures are
4 4 also not suitable as 4 8
5 5 agreement measure. 5 10

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 22
Interpretation
What is good agreement?

 Landis and Koch (1977)


0.0 0.2 0.4 0.6 0.8 1.0

Poor Slight Fair Moderate Substantial Perfect

 Krippendorff (1980); Carletta (1996)


 0.67 < K < 0.8 “allowing tentative conclusions to be drawn”
 above 0.8 “good reliability”

 Krippendorff (2004)
 “even a cutoff point of 0.8 is a pretty low standard”

 Neuendorf (2002)
 “reliability coefficients of 0.9 or greater would be acceptable to all,
0.8 or greater [..] in most situations”
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 23
Recommendations
by Artstein and Poesio (2008)

1. Anything is better than nothing


2. Give details on your study (who annotates and how?)
3. Use intensive training or professionals annotators
4. Report also the agreement table/contingency matrix
rather than only the obtained agreement
5. Annotate with as many raters as possible, since it reduces the
difference between the measures
6. Use K (equal to multi-π) or α which are used in the majority of studies,
allow comparison and solve chance-related issues
7. Use Krippendorff’s α for category labels that are not distinct
from each other (custom distance function)
8. Be careful with weighted measures as they are hard to interpret
9. Agreement should be above 0.8 to ensure data reliability (but depends
on the case)
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 24
Bibliography
Where to Start Reading
 Artstein, R. and Poesio, M. "Inter-Coder Agreement for  Di Eugenio, B. and Glass, M. "The Kappa Statistic: A Second
Computational Linguistics," Computational Linguistics (34:4), Look," Computational Linguistics (30:1), 2004, pp. 95-101.
2008, pp. 555-596.  Fleiss, J. and Cohen, J. "The equivalence of weighted kappa
 Artstein, R. and Poesio, M. "Bias decreases in proportion to the and the intraclass correlation coefficient as measures of
number of annotators"'Proceedings of the 10th conference on reliability," Educational and Psychological Measurement (33:3),
Formal Grammar and the 9th Meeting on Mathematics of 1973, pp. 613-619.
Language', 2005, pp. 141-150.  Fleiss, J. L. "Measuring nominal scale agreement among many
 Bennett, E. M., Alpert, R. and Goldstein, A. C. raters," Psychological Bulletin (76:5), 1971, pp. 378-381.
"Communications through limited response questioning," Public  Krippendorff, K. Content Analysis: An Introduction to Its
Opinion Quarterly (18:3), 1954, pp. 303-308. Methodology, Thousand Oaks, CA: Sage Publications, 2004.
 Carletta, J. "Assessing agreement on classification tasks: The  Landis, J. R. and Koch, Gary, G. "The measurement of
kappa statistic," Computational Linguistics (22:2), 1996, pp. observer agreement for categorical data," Biometrics (33:1),
249-254. 1977, pp. 159-174.
 Cicchetti, D. V. "A new measure of agreement between rank  Neuendorf, K. A. The Content Analysis Guidebook, Thousand
ordered variables"'Proceedings of the American Psychological Oaks, CA: Sage Publications, 2002.
Association', American Statistical Association, 1972, pp. 17-18.  Passonneau, R. J. "Measuring agreement on set-valued items
 Cohen, J. "Weighted kappa: Nominal scale agreement with (MASI) for semantic and pragmatic annotation"'Proceedings of
provision for scaled disagreement or partial credit," the Fifth International Conference on Language Resources and
Psychological Bulletin (70:4), 1968, pp. 213-220. Evaluation', 2006.
 Cohen, J. "A Coefficient of Agreement for Nominal Scales,"  Scott, W. A. "Reliability of content analysis: The case of
Educational and Psychological Measurement (20:1), 1960, pp. nominal scale coding," Public Opinion Quaterly (19:3), 1955,
37-46. pp. 321-325.
 Davies, M. and Fleiss, J. L. "Measuring agreement for  Siegel, S. and Castellan jr., N. J. "Nonparametric Statistics for
multinomial data," Biometrics (38:4), 1982, pp. 1047-1051. the Behavioral Sciences", New York, NY: McGraw-Hill, 1988.
 Di Eugenio, B. "On the usage of Kappa to evaluate agreement
on coding tasks"'Proceedings of the Second International
Conference on Language Resources and Evaluation', 2000, Full BibTeX file available on request
pp. 441-444.
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 25
Thanks for your attention
Any Questions?

@ARTICLE{Artstein08,
author = {Artstein, Ron and Poesio, Massimo},
title = {Inter-Coder Agreement for Computational Linguistics},
journal = {Computational Linguistics},
year = {2008},
volume = {34},
pages = {555--596},
number = {4},
month = {December},
publisher = {Cambridge, MA: The MIT Press},
issn = {0891-2017},
url = {http://www.aclweb.org/anthology-new/J/J08/J08-4004.pdf}
}
Extended article version:
http://cswww.essex.ac.uk/Research/nle/arrau/icagr.pdf

09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 26

You might also like