You are on page 1of 7

Psy 524/624

Spring 2008

Lab Activities # 6

Today we will:
- talk about inter-rater agreement and reliability
- create contingency tables (remember?) to use in estimating kappas
- estimate kappa using SPSS and calculate it by hand
- calculate percent agreement

Before we dig into this week’s dataset, let’s talk about concepts.

Inter-rater agreement/reliability: what and so what?

Kappa: what and so what?

Are inter-rater agreement and inter-rater reliability the same thing? Why or why
not?
Agreement: “Extent to which the different judges tend to make exactly the same judgments about
the rated subject” (Tinsley & Weiss, 1975, p. 359) E.g., different judges assign exactly the same
numerical value to the same subject.
Reliability: “Degree to which the ratings of different judges are proportional when expressed as
deviations from their means” (Tinsley & Weiss, 1975, p. 359) E.g., different judges have the same
general rank ordering of subjects, but don’t necessarily assign the same numerical value to each
subject.
*Note: not a ton of time needs to be spent on this issue, but it’s important to understand that inter-
rater agreement/reliability are NOT the same thing. Also, a general warning that a lot of very
smart people have messed this up by confusing the terms (e.g., Schmidt & Hunter, 1989;
Kozlowski & Hattrup, 1992).

1
DATAFILE: engagementratings.sav
Here we have two different raters rating the level of engagement on a reading
activity for 91 children. Each rater makes one rating per child, and there are four
levels of engagement: low, moderate-low, moderate-high, and high. What you
see in the dataset is a summary of the observations. So the frequency
associated with 1-1, for example, represents the number of children that rater1
and rater2 both rated as “low.” 1-2 represents the number of children who
received a rating of “low” from rater1 and a rating of “moderate-low” from rater2.
The engR1 variable contains the ratings for rater1 and engR2 contains ratings for
rater2. If inter-rater agreement is high, where would you expect the majority of
the frequencies to be? (in 1-1, 2-2, 3-3, and 4-4)

First we will need to weight cases by frequency.


DataWeight Cases.
Select Weight Cases By and move freq over to the open space.
Now we will analyze the weighted data.
AnalyzeDescriptive StatisticsCrosstabs. (this should be familiar to you from
521)
Move engR1 to Rows and engR2 to Columns.
Click Statistics, select Chi Square, Phi and V, and Kappa. Click Continue.
Click Cells, select Observed and Expected.
Click Continue and OK.

Crosstabs
Case Processing Summary

Cases
Valid Missing Total
N Percent N Percent N Percent
engR1 * engR2 91 100.0% 0 .0% 91 100.0%

engR1 * engR2 Crosstabulation

engR2
low mod-low mod-high high Total
engR1 low Count 7 7 2 3 19
Expected Count 2.5 5.8 3.8 6.9 19.0
mod-low Count 2 8 3 7 20
Expected Count 2.6 6.2 4.0 7.3 20.0
mod-high Count 1 5 4 9 19
Expected Count 2.5 5.8 3.8 6.9 19.0
high Count 2 8 9 14 33
Expected Count 4.4 10.2 6.5 12.0 33.0
Total Count 12 28 18 33 91
Expected Count 12.0 28.0 18.0 33.0 91.0

2
This table provides the expected, observed, and marginal counts (we could also
get the residuals), which we can use to compute kappa. Note that the diagonal
counts represent rater agreement (these are the 1-1, 2-2, etc from the dataset).

Chi-Square Tests

Asymp. Sig.
Value df (2-sided)
Pearson Chi-Square 16.955a 9 .049
Likelihood Ratio 15.486 9 .078
Linear-by-Linear
10.014 1 .002
Association
N of Valid Cases 91
a. 7 cells (43.8%) have expected count less than 5. The
minimum expected count is 2.51.

Chi square: this statistic tells us that there does appear to be dependency
somewhere in the data. This is a good thing, as we want the ratings of the two
judges to be related.
Linear-by-linear association: this tells us that as one rater’s ratings increase, the
other rater’s ratings also increase. (Note that this is only useful if the categories
used in ratings are at least ordinal in nature, ideally interval or ratio, and the rows
and columns of the contingency table are in an increasing or decreasing order. If
either is not true, both the chi-square and linear-by-linear statistics are
meaningless). This is also a good thing, as it indicates that the ratings are likely
more similar than dissimilar. A negative linear-by-linear association would not be
a good thing.

Symmetric Measures

Asymp.
a b
Value Std. Error Approx. T Approx. Sig.
Nominal by Nominal Phi .432 .049
Cramer's V .249 .049
Measure of Agreement Kappa .129 .069 2.114 .035
N of Valid Cases 91
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.

Phi & Cramer’s V: You tell me-which of these statistics should I focus on?
Given that our table is larger than 2X2 we’ll interpret Cramer’s V: the relationship
between rater1’s ratings and rater2’s ratings is .25, a small relationship.

Kappa: (agreement correcting for chance) is .13, which is less than desirable.
However, our kappa does differ significantly from 0, p < .04. The discrepancy
between magnitude (.13) and significance (<.04) is likely due to sample size. A

3
significant kappa is a necessary but not sufficient condition in the context of
displaying inter-rater agreement.
-So, can anyone tell me why kappa doesn’t look so great? What’s causing
the problem (i.e., lack of agreement)?

Avoiding small kappas: Ideally, before you’ve put a ton of time and resources into
your study, you have piloted your scales (particularly if they are behavioral
observation scales). If there are problems with the rating scale, they will likely
emerge during the pilot study (i.e., you’ll see that kappa stinks). However, the
problem may also be in the way you train your raters. Debriefing with the raters is
a good way to get to the root cause.

Dealing with small kappas: In some sense, you can think of a small kappa in the
same way that you would think of a really low alpha (i.e., .4). It’s a bummer, and
you don’t want it. It limits the confidence you can have in the validity of the
construct in the same way that a low alpha does: if you’re not at least measuring
“something” consistently, how can you say that the “something” you’re measuring
is in fact what you intended to measure?
That said, some people will try to reduce the number of rating categories in order
to improve kappa. Reducing the number of categories to improve inter-rater
agreement is a judgment call, and you will likely need to do quite a bit of justifying
so that your reviewers, peers, advisors, committee members, etc., don’t yell at
you. This decision becomes especially shady because it is known that, all else
equal, kappa is higher when the number of categories is smaller. In short, I
wouldn’t advocate this approach.

DATAFILE: engagementratingsB.sav.
This dataset contains the data from two different raters who rated the same 91
children as in the first example. Here, though, raters were simply asked to
indicate whether the level of engagement on the activity was high or low. Let’s
analyze the agreement between these two new raters using a different rating
scale.

Again, we will need to weight cases by frequency.


DataWeight Cases. Select Weight Cases By and move freq over to the open
space. Now we will analyze the weighted data following the same steps as
above. AnalyzeDescriptive StatisticsCrosstabs. Move engR1 to Rows and
engR2 to Columns.
Statistics: select Chi Square, Phi and V, and Kappa. Click Continue.
Cells: select Observed and Expected. Click Continue and OK.

Crosstabs

4
Case Processing Summary

Cases
Valid Missing Total
N Percent N Percent N Percent
engR1 * engR2 91 100.0% 0 .0% 91 100.0%

engR1 * engR2 Crosstabulation

engR2
low high Total
engR1 low Count 24 15 39
Expected Count 17.1 21.9 39.0
high Count 16 36 52
Expected Count 22.9 29.1 52.0
Total Count 40 51 91
Expected Count 40.0 51.0 91.0

This table provides the expected, observed, and marginal counts (we could also
get the residuals…), which we can use to compute kappa. Note that the diagonal
counts represent rater agreement.

To calculate % agreement:
#Agree/total =
60/91 = .66
66% rater agreement on engagement level during the activity.

Chi-Square Tests

Asymp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 8.565b 1 .003
Continuity Correctiona 7.361 1 .007
Likelihood Ratio 8.657 1 .003
Fisher's Exact Test .005 .003
Linear-by-Linear
8.471 1 .004
Association
N of Valid Cases 91
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 17.
14.

Chi-square: there is an association between the raters’ engagement ratings


Linear-by-linear association: as one rater’s ratings increase, the other rater’s
ratings also increase. Again, this is a good thing.

5
Symmetric Measures

Asymp.
a b
Value Std. Error Approx. T Approx. Sig.
Nominal by Nominal Phi .307 .003
Cramer's V .307 .003
Measure of Agreement Kappa .307 .100 2.927 .003
N of Valid Cases 91
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.

Phi: .31, p = .003; the ratings of rater one are moderately related to the ratings of
rater two.
Kappa: (agreement correcting for chance) is .31—still less than desirable, but
better than in the other example. Kappa does differ significantly from 0, p < .01.
There is 31% agreement between rater one and rater two after correcting for
chance. Note how much this differs from the simple percent agreement statistic
of .66!

Calculating Kappa by Hand

Step 1: Calculate Chance Agreement (i.e., expected frequencies)


Low-low Cell
-Probability of low rating by rater 1 = 39/91 = .43
-Probability of low rating by rater 2 = 40/91 = .44
If the two raters are operating independently of one another, the probability that
they would both report low = .43X.44 = .19.
Thus, for 91 cases, .19X91 = 17.22; 17.22 low ratings for both raters would be
expected purely by chance.
Expected frequency for low-low cell: 17.22

High-high Cell
-Probability of high rating by rater 1 = 52/91 = .57
-Probability of high rating by rater 2 = 51/91 = .56
If the two raters are operating independently of one another, the probability that
they would both report low = .57X.56 = .32.
Thus, for 91 cases, .32X91 = 29.14; 29.14 high ratings for both raters would be
expected purely by chance.
Expected frequency for low-low cell: 29.14

Now we can plug these values, along with the observed frequencies, into the
tables below. (Note: of course, you can get the expected frequencies directly
from SPSS, rather than calculating them by hand)

6
Exp Freq r2 low r2 high Obs Freq r2 low r2 high
r1 low 17.22 r1 low 24
r1 low 29.14 r1 high 36
Adding agreement diagonal: 46.36 60

κ = Pobs – Pexp

1 – Pexp

κ = (60/91) – (46/91)

1 – (46/91)

~30% agreement between rater one and rater two after correcting for chance
(basically the same as what we found when using SPSS, with rounding error)

Finally, let’s also calculate the odds ratio:


(24/15) / (36/16) = .71
The odds-ratio tells us that the odds of raters agreeing on high engagement .71
times greater than the odds of raters agreeing on low engagement. This might
indicate that it is easier for the raters to recognize high engagement behaviors.