Inter Rater Agreement

Reading Group
Christian M. Meyer
Inter-Coder Agreement for Computational Linguistics

based on the corresponding survey article
by Ron Artstein and Massimo Poesio,
Computational Linguistics 34(4):555-596, December 2008.
09.11.2009 | Computer Science Department | Ubiquitous Knowledge Processing Lab | © Christian M. Meyer | 1
Introduction
Validity, Reliability, Agreement
 We manually or automatically generate datasets for evaluating our NLP

systems. For the evaluation, we need to consider the following questions:
Is my evaluation Is my evaluation What is good

valid? data reliable? agreement?
• Can we actually • Is the generation • How to measure

draw conclusions? reproducible? agreement?
• One prerequisite • raters annotate a • How to interpret

for validity is that sample of the data. the result?
the evaluation data • Assumption: The
is reliable. data is reliable if
their agreement is
good.
Introduction
Notation
m raters r  R
50 years of agreement
aka. coders, annotators, observers,…
studies – 50 different
notation schemas!
matching? yes yes no

k categories c  C aka. labels,
score for.. low medium low
annotations,…, which can be:
• binary (yes, no)
Apple NN NNP NN
• ordinal (1, 2, 3,…)
• continuous (0.03, 0.49,…)
..bass.. WN1 WN2 WN1
• ordered-category (low, high)
n items i  I aka. units, records,… • nominal (NN, NNP, JJ, VB)
• likert-scale (strongly disagree,
disagree, agree, strongly agree)
Percentage of Agreement
Definition
Relatedness? r1 r2
gem – jewel high high
coast – shore high high
r1 high r1 low
coast – hill high low
r2 high 2 2 4
forest – graveyard low high
r2 low 1 5 6
asylum – fruit low low
3 7 10
noon – string low low
automobile – wizard low low
Percentage of agreement: brother – lad low high

cord – smile low low
AO = 1/n c (# of agreements)
autograph - shore low low
AO = 1/10 (2 + 5) = 0.7
Example word pairs taken from Rubenstein & Goodenough (1965).
Calculation example is inspired by Artstein & Poesio (2008).
Percentage of Agreement
Standard Error and Confidence Interval
Standard error:
SE(AO) = (AO (1 – AO)) / n
r1 high r1 low SE(AO) = (0.7 (1 – 0.7)) / 10 = 0.04
r2 high 2 2 4
r2 low 1 5 6 Confidence intervals:
3 7 10 CL = AO – SE(AO) · zCrit
CU = AO + SE(AO) · zCrit
0.610 ≤ 0.7 ≤ 0.789
Percentage of agreement: with zCrit = 1.96 (95% confid.)
AO = 1/n c (# of agreements) 0.624 ≤ 0.7 ≤ 0.775
AO = 1/10 (2 + 5) = 0.7 with zCrit = 1.645 (90% confid.)
Issues
Why is there Disagreement at all?
Sources of disagreement: Possible corrective actions:

 Insecurity in deciding on a  Training of the annotators
category
 Write better instructions
 Carelessness
 Provide better environment
 Difficulties/Differences in
 Reduce amount of annotated
comprehending instructions
data per annotator
 Openness for distractions
 Use more annotators
 Tendency to relax performance
…
standard when tired
 personal opinions/values
…
Taken from Krippendorff (2004)
Issues
Equal Weights
 All categories are treated equally

 Consider annotating proper nouns (PN) for tokens in arbitrary texts
r1 + r1 –
r2 + 10 20 30 Almost perfect agreement,
although the actual PN
r2 – 20 1,000 1,020
identification did not really work!
30 1,020 1,050
AO = 1/1050 (10 + 1000) = 0.961
 For binary data: calculate positive and negative agreement

AO+ = 2 (# of agreements for +) / r (# of + annotations)
AO+ = 2 · 10 / (30 + 30) = 0.333
AO– = 2 (# of agreements for –) / r (# of – annotations)
AO– = 2 · 1000 / (1020 + 1020) = 0.980
(Cicchetti and Feinstein, 1990)
Issues
Agreement by Chance
 Percentage of agreement does not regard agreement by chance

 Imagine the raters would guess randomly:
r1 high r1 low r1 high r1 med r1 low

r2 high 45 45 90 r2 high 20 20 20 60
r2 low 45 45 90 r2 med 20 20 20 60
90 90 180 r2 low 20 20 20 60
60 60 60 180
AO = 1/180 (45 + 45) = 0.5
AO = 1/180 (20 + 20 + 20) = 1/3
One would assume similar agreement

 use chance-corrected measures!
Issues
Summary
chance-corrected weighted
Measure agreement multiple raters categories
Percentage of Agreement   
Chance-corrected Measures
Definition
agreement beyond chance AO – AE

Basic idea: agreement = =
attainable chance-corrected agreement 1 – AE
Bennett, Alpert
Scott (1955) Cohen (1960)
& Goldstein (1954)
AO – AES AO – AE π AO – AE κ
S= π= κ=
1 – AES 1 – AE π 1 – AE κ
assume uniform distribution, assume a single distribution assume different

i.e. the same probabilities for for all raters, i.e. each rater probability distributions
each categories: annotates the same way: for each rater:
1 1 1
AES = AE π = c nc2 AE κ =  n n
k 4n 2 n2 c c,r1 c,r2
with the total number of with the total number of
annotations nc for category annotations nc,r for category
c by all raters. c by rater r.
Chance-corrected Measures
Example
A – AE 1
Basic idea: agreement = O Chance-corrected S: AES =
1 – AE k
AE = 1 / 2 = 0.5
S
S = (0.7 – 0.5) / (1 – 0.5) = 0.4

r1 high r1 low
1
r2 high 2 2 4 Scott’s π: AE π = c nc2
4n 2
r2 low 1 5 6 AE π = 1/(4 · 102) ((3 + 4) 2 + (6 + 7)2)
3 7 10 = 0.545
π = (0.7 – 0.545) / (1 – 0.545) = 0.341
1
Percentage of agreement: Cohen’s κ: AE κ = 2 c nc,r1 nc,r2
n
AO = 1/n c (# of agreements) AE = 1/10 (3 · 4 + 6 · 7) = 0.54
κ 2
AO = 1/10 (2 + 5) = 0.7 κ = (0.7 – 0.54) / (1 – 0.54) = 0.348
Issues
Agreement by Chance
 Percentage of agreement does not regard agreement by chance

 Imagine the raters would guess randomly:
r1 high r1 med r1 low

r1 high r1 low
r2 high 20 20 20 60
r2 high 45 45 90
r2 med 20 20 20 60
r2 low 45 45 90
r2 low 20 20 20 60
90 90 180
60 60 60 180
AO = 0.5 AO = 1/3
S = 0.0 S = 0.0
π = 0.0 Chance-corrected π = 0.0
κ = 0.0 κ = 0.0
Issues
Summary
Chance-corrected S   
Scott’s π   
Cohen’s κ   
Multiple Raters
Agreement Table
Relatedness? r1 r2 r3 Item high low

gem – jewel high high high 1 3 0
coast – shore high high low 2 2 1
coast – hill high low high 3 2 1
forest – graveyard low high high 4 2 1
asylum – fruit low low high 5 1 2
convert to
noon – string low low low 6 0 3
agreement
automobile – wizard low low low table 7 0 3
brother – lad low high low 8 1 2
cord – smile low low high 9 1 2
autograph - shore low low high 10 1 2
Example word pairs taken from Rubenstein & Goodenough (1965).
Multiple Raters
Generalized Measures
 So far, we only considered two raters, although there are usually more
 Generalize two-rater measures:
Fleiss (1971) Davis and Fleiss (1982)
Generalizes Scott’s π. The basic idea is to Generalizes Cohen’s κ. The basic idea is to
consider each pairwise agreement of raters consider each pairwise agreement of raters
and average over all items i. and average over all items i.
A’O – A’Eπ A’O – A’Eκ
multi-π = multi-κ =
1 – A’Eπ 1 – A’Eκ
1 1
A’O =   n (n – 1) A’O =   n (n – 1)
nm(m – 1) i c i,c i,c nm(m – 1) i c i,c i,c
1 1 m-1 m nr1,c nr2,c
A’E π = c nc2 A’Eκ = c  
(nm) 2 (m2 ) r1=1 r2=r1+1 n2
with the total number of raters ni,c that with the total number of annotations nc,r
annotated item i with category c. by rater r for category c.
Multiple Raters
Example for multi-π
3 · 2 + 0 · (-1)
Fleiss (1971) c ni,c(ni,c – 1) Item high low
1 6 1 3 0
A’O =   n (n – 1)
nm(m – 1) i c i,c i,c 2 2 2 1
1
A’O = 32 = 0.533 2 3 2 1
10 · 3(3 – 1)
1 2 4 2 1
A’E =π
c nc2
(nm) 2
2 5 1 2
1
A’E π = (132 + 172) = 0.508 6 6 0 3
(10 · 3) 2
6 7 0 3
A’O – A’Eπ 2 8 1 2
multi-π = = 0.049
1 – A’Eπ
2 9 1 2
multi-π is also known as κ (Fleiss, 1971) and

2 10 1 2
K (Carletta, 1996) – check definition! 32 nc 13 17
Issues
Summary
multi-π   
multi-κ   
Krippendorff’s α
Definition
 Allow further flexibility by allowing arbitrary category metrics.

Krippendorff (1980) Distance function dc1,c2
Derived from empirically statistics and Arbitrary metric to allow working with
content analysis. But can be represented in
the same notation. binary or nominal data:
dc1,c2 = (c1 == c2 ? 0 : 1)
DOα ‘est. var. within items’
α=1– = with this distance function: α ≈ π
DE α ‘est. total variance’
1 ordinal data (‘square distance func.’):
DOα =   n n d dc1,c2 = (c1 – c2)2
nm(m – 1) i c1 c2 i,c1 i,c2 c1,c2
1 weighted data:
D Eα =   n n d
nm(nm – 1) c1 c2 c1 c2 c1,c2 dc1,c2 NN NNP VB
NN – 0.1 0.9
with the total number of raters ni,c that
NNP 0.1 – 0.9
annotated item i with category c and the total
number of annotations nc for category c by VB 0.9 0.9 –
all raters. as well as interval, ratio data.
Krippendorff’s α
Example
nc1,c2 r1 + r1 ● r1 – Krippendorff (1980)

r2 + 46 0 6 52 1
DOα =   n n d
r2 ● 0 10 6 16 nm(m – 1) i c1 c2 i,c1 i,c2 c1,c2
r2 – 0 0 32 32
DOα = 46 · 0 + 10 · 0 + 6 · 1 + 6 · 0.5 + 32 · 0
100 · 2(2 – 1)
46 10 44 100 = 0.09
dc1,c2 r1 + r1 ● r1 – c nc 1
DE α =   n n d
r2 + 0.0 0.5 1.0 + 98 nm(nm – 1) c1 c2 c1 c2 c1,c2
r2 ● 0.5 0.0 0.5 ● 26 1274 + 7448 + 1274 + 988 + 7448 + 988
DEα =
r2 – 1.0 0.5 0.0 – 76 100 · 2(100 · 2 – 1)
= 0.4879
nc1 nc2 dc1,c2 r1 + r1 ● r1 –
DOα 0.09
r2 + 0 1274 7448 α=1– = 1 – = 0.8155
r2 ● 1274 0 988 DE α 0.4879
r2 – 7448 988 0
Issues
Summary
multi-π   
multi-κ   
Krippendorff’s α   
Weighted κ (not covered here)   
Side Note: Criticism
“The Myth of Chance-Corrected Agreement”
 Chance-corrected measures have often been criticized

 The presented measures S, π, κ assume that the raters
are completely statistically independent
 This means (1) the raters guess on every item or (2) the raters guess with
probabilities similar to the observed ratings.
 (1) is clearly not valid for an annotation study
 (2) would not need a chance-correction
 Another argument is the different approach when comparing to a gold
standard  measure precision/recall without any chance-correction
 John Uebersax proposes using raw agreements and focus on statistic
significance tests, standard error and confidence intervals
 cf. (Uebersax, 1987; Agresti, 1992; Uebersax, 1993)
Traditional Statistics
Why not use χ2 or correlations?
r1 + r1 ● r1 – χ2 = 64.59 χ2 is highly significant,

r2 + 25 13 12 50 because of the strong
AO = 0.36 associations
r2 ● 12 2 16 30
S = 0.04 +/+, ●/–, –/ ●
r2 – 3 15 2 20 π = 0.02 The agreement is
40 30 30 100 κ = 0.04 however low
Adapted from Cohen (1960)
Pearson correlation r
A B A B
vs. Cohen’s κ:
1 1 1 2
r = 1.0 r = 1.0
2 2 κ = 1.0 κ = -0.08 2 4
3 3 3 6
Correlation measures are
4 4 also not suitable as 4 8
5 5 agreement measure. 5 10
Interpretation
What is good agreement?
 Landis and Koch (1977)

0.0 0.2 0.4 0.6 0.8 1.0
Poor Slight Fair Moderate Substantial Perfect
 Krippendorff (1980); Carletta (1996)

 0.67 < K < 0.8 “allowing tentative conclusions to be drawn”
 above 0.8 “good reliability”
 Krippendorff (2004)
 “even a cutoff point of 0.8 is a pretty low standard”
 Neuendorf (2002)
 “reliability coefficients of 0.9 or greater would be acceptable to all,
0.8 or greater [..] in most situations”
Recommendations
by Artstein and Poesio (2008)
1. Anything is better than nothing

2. Give details on your study (who annotates and how?)
3. Use intensive training or professionals annotators
4. Report also the agreement table/contingency matrix
rather than only the obtained agreement
5. Annotate with as many raters as possible, since it reduces the
difference between the measures
6. Use K (equal to multi-π) or α which are used in the majority of studies,
allow comparison and solve chance-related issues
7. Use Krippendorff’s α for category labels that are not distinct
from each other (custom distance function)
8. Be careful with weighted measures as they are hard to interpret
9. Agreement should be above 0.8 to ensure data reliability (but depends
on the case)
Bibliography
Where to Start Reading
 Artstein, R. and Poesio, M. "Inter-Coder Agreement for  Di Eugenio, B. and Glass, M. "The Kappa Statistic: A Second
Computational Linguistics," Computational Linguistics (34:4), Look," Computational Linguistics (30:1), 2004, pp. 95-101.
2008, pp. 555-596.  Fleiss, J. and Cohen, J. "The equivalence of weighted kappa
 Artstein, R. and Poesio, M. "Bias decreases in proportion to the and the intraclass correlation coefficient as measures of
number of annotators"'Proceedings of the 10th conference on reliability," Educational and Psychological Measurement (33:3),
Formal Grammar and the 9th Meeting on Mathematics of 1973, pp. 613-619.
Language', 2005, pp. 141-150.  Fleiss, J. L. "Measuring nominal scale agreement among many
 Bennett, E. M., Alpert, R. and Goldstein, A. C. raters," Psychological Bulletin (76:5), 1971, pp. 378-381.
"Communications through limited response questioning," Public  Krippendorff, K. Content Analysis: An Introduction to Its
Opinion Quarterly (18:3), 1954, pp. 303-308. Methodology, Thousand Oaks, CA: Sage Publications, 2004.
 Carletta, J. "Assessing agreement on classification tasks: The  Landis, J. R. and Koch, Gary, G. "The measurement of
kappa statistic," Computational Linguistics (22:2), 1996, pp. observer agreement for categorical data," Biometrics (33:1),
249-254. 1977, pp. 159-174.
 Cicchetti, D. V. "A new measure of agreement between rank  Neuendorf, K. A. The Content Analysis Guidebook, Thousand
ordered variables"'Proceedings of the American Psychological Oaks, CA: Sage Publications, 2002.
Association', American Statistical Association, 1972, pp. 17-18.  Passonneau, R. J. "Measuring agreement on set-valued items
 Cohen, J. "Weighted kappa: Nominal scale agreement with (MASI) for semantic and pragmatic annotation"'Proceedings of
provision for scaled disagreement or partial credit," the Fifth International Conference on Language Resources and
Psychological Bulletin (70:4), 1968, pp. 213-220. Evaluation', 2006.
 Cohen, J. "A Coefficient of Agreement for Nominal Scales,"  Scott, W. A. "Reliability of content analysis: The case of
Educational and Psychological Measurement (20:1), 1960, pp. nominal scale coding," Public Opinion Quaterly (19:3), 1955,
37-46. pp. 321-325.
 Davies, M. and Fleiss, J. L. "Measuring agreement for  Siegel, S. and Castellan jr., N. J. "Nonparametric Statistics for
multinomial data," Biometrics (38:4), 1982, pp. 1047-1051. the Behavioral Sciences", New York, NY: McGraw-Hill, 1988.
 Di Eugenio, B. "On the usage of Kappa to evaluate agreement
on coding tasks"'Proceedings of the Second International
Conference on Language Resources and Evaluation', 2000, Full BibTeX file available on request
pp. 441-444.
Thanks for your attention
Any Questions?
@ARTICLE{Artstein08,
author = {Artstein, Ron and Poesio, Massimo},
title = {Inter-Coder Agreement for Computational Linguistics},
journal = {Computational Linguistics},
year = {2008},
volume = {34},
pages = {555--596},
number = {4},
month = {December},
publisher = {Cambridge, MA: The MIT Press},
issn = {0891-2017},
url = {http://www.aclweb.org/anthology-new/J/J08/J08-4004.pdf}
}
Extended article version:
http://cswww.essex.ac.uk/Research/nle/arrau/icagr.pdf

Inter Rater Agreement

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inter Rater Agreement

Uploaded by

Copyright:

Available Formats

Reading Group

Inter-Coder Agreement for Computational Linguistics

 We manually or automatically generate datasets for evaluating our NLP

Is my evaluation Is my evaluation What is good

• Can we actually • Is the generation • How to measure

• One prerequisite • raters annotate a • How to interpret

matching? yes yes no

Percentage of agreement: brother – lad low high

Sources of disagreement: Possible corrective actions:

 All categories are treated equally

AO = 1/1050 (10 + 1000) = 0.961

 For binary data: calculate positive and negative agreement

 Percentage of agreement does not regard agreement by chance

r1 high r1 low r1 high r1 med r1 low

One would assume similar agreement

agreement beyond chance AO – AE

assume uniform distribution, assume a single distribution assume different

S = (0.7 – 0.5) / (1 – 0.5) = 0.4

AO = 1/10 (2 + 5) = 0.7 κ = (0.7 – 0.54) / (1 – 0.54) = 0.348

 Percentage of agreement does not regard agreement by chance

r1 high r1 med r1 low

Relatedness? r1 r2 r3 Item high low

multi-π is also known as κ (Fleiss, 1971) and

 Allow further flexibility by allowing arbitrary category metrics.

nc1,c2 r1 + r1 ● r1 – Krippendorff (1980)

 Chance-corrected measures have often been criticized

r1 + r1 ● r1 – χ2 = 64.59 χ2 is highly significant,

 Landis and Koch (1977)

Poor Slight Fair Moderate Substantial Perfect

 Krippendorff (1980); Carletta (1996)

1. Anything is better than nothing

You might also like