This paper briefly illustrates calculation of both Fleiss’ generalized kappa and Gwet’s newly-developed robust measure of multi-rater agreement using SAS and SPSS syntax.

© All Rights Reserved

125 views

GENERALIZED KAPPA STATISTIC

This paper briefly illustrates calculation of both Fleiss’ generalized kappa and Gwet’s newly-developed robust measure of multi-rater agreement using SAS and SPSS syntax.

© All Rights Reserved

- American Foundation for Suicide Prevention Grant Application
- 06 Article 04
- 9
- Analyzing Working Skill Influence on the Working Readiness of Vocational High School Student of Construction Engineering in North Sulawesi 1
- Amm Hjh Degruyter Ps
- rubric0305
- Chapter 8 Resolucao Montgomery
- Development, Validation, Teaching Practice Evaluation Instrument
- The Effect of the Applying Modern Methods of Tea Cultivation on the Quality of Iranian Produced Tea
- OUTPUTwanita
- paddock to plate
- Chapter 10 SAS ESSENTIALS.pptx
- syllabus
- Pi is 1836955314000794
- MPhil Management 2015-17
- Criteria for Judging the Best Imputation Method
- R programming exam with solutions
- rubric
- EH2434B 28042017 Co
- final yes

You are on page 1of 11

for Use with Multiple Raters

Jason E. King

Baylor College of Medicine

Educational Research Association, Dallas, Texas, Feb. 5-7,

2004.

Correspondence concerning this article should be

addressed to Jason King, 1709 Dryden Suite 534, Medical

Towers, Houston, TX. 77030. E-mail: Jasonk@bcm.tmc.edu

Generalized Kappa

Abstract

Many researchers are unfamiliar with extensions of

Cohens kappa for assessing the interrater reliability of

more than two raters simultaneously. This paper briefly

illustrates calculation of both Fleiss generalized kappa

and Gwets newly-developed robust measure of multi-rater

agreement using SAS and SPSS syntax. An online, adaptable

Microsoft Excel spreadsheet will also be made available for

download.

Generalized Kappa

Theoretical Framework

Cohens (1960) kappa statistic () has long been used

to quantify the level of agreement between two raters in

placing persons, items, or other elements into two or more

categories. Fleiss (1971) extended the measure to include

multiple raters, denoting it the generalized kappa

statistic,1 and derived its asymptotic variance (Fleiss,

Nee, & Landis, 1979). However, popular statistical computing

packages have been slow to incorporate the generalized

kappa. Lack of familiarity with the psychometrics literature

has left many researchers unaware of this statistical tool

when assessing reliability for multiple raters.

Consequently, the educational literature is replete with

articles reporting the arithmetic mean for all possible

paired-rater kappas rather than the generalized kappa. This

approach does not make full use of the data, will usually

not yield the same value as that obtained from a multi-rater

measure of agreement, and makes no more sense than averaging

results from multiple t tests rather than conducting an

analysis of variance.

Two commonly cited limitations of all kappa-type

measures are their sensitivity to raters classification

probabilities (marginal probabilities) and trait prevalence

in the subject population (Gwet 2002c). Gwet (2002b)

demonstrated that statistically testing the marginal

probabilities for homogeneity does not, in fact, resolve

these problems. To counter these potential drawbacks, Gwet

(2001) has proposed a more robust measure of agreement among

multiple raters, denoting it the AC1 statistic. This

statistic can be interpreted similarly to the generalized

kappa, yet is more resilient to the limitations described

above.

A search of the Internet revealed no freely-available

algorithms for calculating either measure of inter-rater

reliability without purchase of a commercial software

package. Software options do exist for obtaining these

statistics via the commercial packages, but they are not

typically available in a point-and-click environment and

require use of macros.

The purpose of this paper is to briefly define the

generalized kappa and the AC1 statistic, and then describe

their acquisition via two of the more popular software

packages. Syntax files for both the Statistical Analysis

System (SAS) and the Statistical Package for the Social

Sciences (SPSS) are provided. In addition, the paper

Generalized Kappa

spreadsheet that estimates the generalized kappa statistic,

its standard error (via two options), statistical tests, and

associated confidence intervals. Application of each

software solution is made using a real dataset. The dataset

consists of three expert physicians having categorized each

of 45 continuing medical education (CME) presentations into

one of six competency areas (e.g., medical knowledge,

systems-based care, practice-based care, professionalism).

To encourage the reader to replicate these analyses, the

data are provided in Table 1.

Generalized Kappa Defined

Kappa is a chance-corrected measure of agreement

between two raters, each of whom independently classifies

each of a sample of subjects into one of a set of mutually

exclusive and exhaustive categories. It is computed as

K

k

i 1

i 1

po pe

,

1 pe

(1)

ratings by two raters on a scale having k categories.

Fleiss extension of kappa, called the generalized

kappa, is defined as

n

K 1

nm 2

i 1

j 1

2

ij

nm m 1 p j q j

(2)

j 1

subjects rated, m = the number of raters, p j = the mean

proportion for category j, and q j = 1 the mean proportion

for category j. This index can be interpreted as a chancecorrected measure of agreement among three or more raters,

each of whom independently classifies each of a sample of

subjects into one of a set of mutually exclusive and

exhaustive categories.

As mentioned earlier, Gwet suggested an alternative to

the generalized kappa, denoted the AC1 statistic, to correct

for kappas sensitivity to marginal probabilities and trait

prevalence. See Gwet (2001) for computational details.

Generalized Kappa

lack of consensus on the correct standard error formula to

employ. Fleiss (1971) original standard error formulas is

as follows:

P E 2m 3 P E 2 m 2 p 3j

2

, (3)

Nm( m 1)

1 P E 2

2

SE ( K )

m

where P ( E ) p 2j and

j 1

3

j

j 1

SE K

2

k

p q

j 1

nm m 1

pjqj

j 1

p j q j q j p j .

j 1

k

(4)

than the original formula.

Algorithms employed in the computing packages may use

either formula. Gwet (2002a) mentioned in passing that the

Fleiss et al. (1979) formula used in the MAGREE.SAS macro

(see below) is less accurate than the formula used in his

macro (i.e., Fleiss SE formula). However, it is unknown why

Gwet would prefer Fleiss original formula to the

(ostensibly) more accurate revised formula.

Generalized Kappa Using SPSS Syntax

David Nichols at SPSS developed a macro to be run

through the syntax editor permitting calculation of the

generalized kappa, a standard error estimate, test

statistic, and associated probability. The calculations for

this macro, entitled MKAPPASC.SPS (available at

ftp://ftp.spss.com/pub/spss/statistics/nichols/macros/mkappa

sc.sps), are taken from Siegel and Castellan (1988). Siegel

and Castellan employ equation 3 to calculate the standard

error.

The SPSS dataset should be formatted such that the

number of rows = the number of items being rated; the number

of columns = the number of raters, and each cell entry

represents a single rating. The macro is invoked by running

the following command:

MKAPPASC VARS=rater1 rater2 rater3.

Generalized Kappa

rater1, rater2, and rater3. Results for the sample dataset

are as follows:

Matrix

Run MATRIX procedure:

------ END MATRIX -----

Report

Estimated Kappa, Asymptotic Standard Error,

and Test of Null Hypothesis of 0 Population Value

Kappa

___________

ASE

___________

Z-Value

___________

P-Value

___________

.28204658

.08132183

3.46827632

.00052381

indicate that the kappa value is statistically significantly

different from 0 (p < .001), but not large (k = .282).

Generalized Kappa Using SAS Syntax

SAS Technical Support has also developed a macro for

calculating kappa, denoted MAGREE.SAS (available at

http://ewe3.sas.com/techsup/download/stat/magree.html). That

macro will not be presented here, however, a SAS macro

developed by Gwet will be described. Gwets macro, entitled

INTER_RATER.MAC, allows for calculation of both the

generalized kappa and the AC1 statistic (available at

http://ewe3.sas.com/techsup/download/stat/magree.html).

Gwets macro also employs equation 3 to calculate the

standard error. A nice feature of the macro is its ability

to calculate both conditional and unconditional (i.e.,

generalizable to a broader population) variance estimates.

The SAS dataset should be formatted such that the

number of rows = the number of items being rated; the number

of columns = the number of raters, and each cell entry

represents a single rating. A separate one variable data set

must be created defining the categories available for use in

rating the subjects (see an example available at

http://www.ccit.bcm.tmc.edu/jking/homepage/).

Generalized Kappa

%Inter_Rater(InputData=a,

DataType=c,

VarianceType=c,

CategoryFile=CatFile,

OutFile=a2);

Variance type can be modified to u rather than c if

unconditional variances are desired. Results for the sample

data are as follows:

INTER_RATER macro (v 1.0)

Kappa statistics: conditional and unconditional analyses

Category

1

2

3

4

5

6

Overall

Standard

Kappa

Error

0.28815

0.21406

-0.03846

.

.

0.49248

0.47174

0.28205

Prob>Z

0.21433

1.34441 0.08941

0.29797

0.71841 0.23625

0.27542 -0.13965 0.55553

.

.

0.38700

1.27256 0.10159

0.21125

2.23311 0.01277

0.08132

3.46828 0.00026

AC1 statistics: conditional and unconditional analyses

Inference based on conditional variances of AC1

Category

AC1

Standard

statistic

Error

1

0.37706

2

0.61643

3

-0.13595

4

.

.

5

0.43202

6

0.48882

Overall

0.51196

Prob>Z

0.12047 5.11695 0.00000

0.00000

.

.

.

.

0.56798 0.76064 0.22344

0.25887 1.88831 0.02949

0.05849 8.75296 0.00000

obtained earlier. This algorithm also permits calculation of

kappas for each rating category. It is of interest to

observe that the AC1 statistic yielded a larger value (.512)

than kappa (.282). This reflects the sensitivity of kappa to

the unequal trait prevalence in the populations (notice in

the Table 1 data that few presentations were judged as

embracing competencies 3, 4 and 5).

Generalized Kappa

To facilitate more widespread use of the generalized

kappa, the author developed a Microsoft Excel spreadsheet

that calculates the generalized kappa, kappa values for each

rating category (along with associated standard error

estimates), overall standard error estimates using both

Equations 3 and 4, test statistics, associated probability

values, and confidence intervals (available for download at

http://www.ccit.bcm.tmc.edu/jking/homepage/). To the

authors knowledge, such a spreadsheet is not available

elsewhere.

Directions are provided on the spreadsheet for entering

data. Edited results for the sample data are provided below:

BY CATEGORY

gen kappa_cat1 =

gen kappa_cat2 =

gen kappa_cat3 =

gen kappa_cat4 =

gen kappa_cat5 =

gen kappa_cat6 =

******************

OVERALL

gen kappa =

SEFleiss1a

z=

p calc =

CILower =

CIUpper =

0.288

0.214

-0.038

#DIV/0!

0.492

0.472

0.282

0.081

3.468

0.000524

0.123

0.441

SEFleiss2b

z=

p calc =

CILower =

CIUpper =

0.058

4.888

0.000001

0.169

0.395

This approximate standard error formula based on Fleiss (Psychological Bulletin, 1971, Vol. 76, 378-382)

This approximate standard error formula based on Fleiss, Nee & Landis (Psychological Bulletin , 1979, Vol. 86, 974-977)

earlier, as is the SE estimate based on Fleiss (1971).

Fleiss et al.s (1979) revised SE estimate is slightly lower

and yields tighter confidence intervals. Use of confidence

intervals permits assessing a range of possible kappa

values, rather than making dichotomous decisions concerning

interrater reliability. This is in keeping with current best

practices (e.g., Fan & Thompson, 2001).

Conclusion

Generalized Kappa

interrater agreement among three or more judges. This

measure has not been incorporated into the point-and-click

environment of the major statistical software packages, but

can easily be obtained using SAS code or SPSS syntax. An

alternative approach is to use a newly-developed Microsoft

Excel spreadsheet.

Footnote

Gwet (2002a) notes that Fleiss generalized kappa was based

not on Cohens kappa but on the earlier pi () measure of

inter-rater agreement introduced by Scott (1955).

1

Generalized Kappa

10

References

Cohen, J. (1960). A coefficient of agreement for nominal

scales. Educational and Psychological Measurement, 20,

37-46.

Fan, X., & Thompson, B. (2001). Confidence intervals about

score reliability coefficients, please: An EPM guidelines

editorial. Educational and Psychological Measurement, 61,

517-531.

Fleiss, J. L. (1971). Measuring nominal scale agreement

among many raters. Psychological Bulletin, 76, 378-382.

Fleiss, J. L. (1981). Statistical methods for rates and

proportions (2nd ed.). New York: John Wiley & Sons, Inc.

Fleiss, J. L., Nee, J. C. M., & Landis, J. R. (1979). Large

sample variance of kappa in the case of different sets of

raters. Psychological Bulletin, 86, 974-977.

Gwet, K. (2001). Handbook of inter-rater reliability.

STATAXIS Publishing Company.

Gwet, K. (2002a). Computing inter-rater reliability with the

SAS system. Statistical Methods for Inter-Rater

Reliability Assessment Series, 3, 1-16.

Gwet, K. (2002b). Inter-rater reliability: Dependency on

trait prevalence and marginal homogeneity. Statistical

Methods for Inter-Rater Reliability Assessment Series, 2,

1-9.

Gwet, K. (2002c). Kappa statistic is not satisfactory for

assessing the extent of agreement between raters.

Statistical Methods for Inter-Rater Reliability

Assessment Series, 1, 1-6.

Siegel, S., & Castellan, N. J. (1988). Nonparametric

Statistics for the Behavioural Sciences (2nd ed.). New

York: McGraw-Hill.

Scott, W. A. (1955). Reliability of content analysis: The

case of nominal scale coding. Public Opinion Quarterly,

XIX, 321-325.

Generalized Kappa

11

Table 1

Physician Ratings of Presentations Into Competency Areas

Subject

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Rater1

1

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

1

2

2

1

Rater2

1

1

2

1

1

1

2

1

1

1

1

2

2

2

1

1

2

1

2

1

2

1

1

Rater3

1

2

2

1

2

2

1

2

2

1

3

1

2

2

1

1

3

6

3

1

2

2

1

Subject

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

Rater1

2

2

6

6

2

2

6

6

2

2

2

2

2

2

2

2

2

2

2

2

2

2

Rater2

2

6

1

6

6

6

6

6

5

3

2

2

6

2

2

2

2

2

2

2

2

1

Rater3

6

6

1

6

6

6

1

6

5

2

2

2

6

6

2

2

2

3

2

2

2

2

- American Foundation for Suicide Prevention Grant ApplicationUploaded bySteven Entezari
- 06 Article 04Uploaded byDelikanliX
- 9Uploaded bySurya Wijaya
- Analyzing Working Skill Influence on the Working Readiness of Vocational High School Student of Construction Engineering in North Sulawesi 1Uploaded byInternational Journal of Innovative Science and Research Technology
- Amm Hjh Degruyter PsUploaded byadjunk
- rubric0305Uploaded byapi-253334745
- Chapter 8 Resolucao MontgomeryUploaded byMatheus Higino
- Development, Validation, Teaching Practice Evaluation InstrumentUploaded bySri Novita Yanda
- The Effect of the Applying Modern Methods of Tea Cultivation on the Quality of Iranian Produced TeaUploaded byIOSRjournal
- OUTPUTwanitaUploaded byNikita Ashardika Putri
- paddock to plateUploaded byapi-245531652
- Chapter 10 SAS ESSENTIALS.pptxUploaded bymasmid
- syllabusUploaded bytimtam89
- Pi is 1836955314000794Uploaded byCinta Anatasha
- MPhil Management 2015-17Uploaded bySri
- Criteria for Judging the Best Imputation MethodUploaded byapi-3696796
- R programming exam with solutionsUploaded byJohana Coen Janssen
- rubricUploaded byapi-302940298
- EH2434B 28042017 CoUploaded byadib assoli
- final yesUploaded byapi-283413992
- Jeopardy Final ReviewUploaded byArunkumar
- lesson4Uploaded byapi-322707087
- math skittles term project-1Uploaded byapi-241771020
- Stats OutlineUploaded byacasto91
- t4 lessonplan3 finalUploaded byapi-242609249
- Research Chapter 6Uploaded byKeo Pichyvoin Unique
- R05 - Statistics for ManagementUploaded byAnjali Naidu
- Jordan Term 4Uploaded byroom8ncs
- Maths Josh Term 4Uploaded byroom8ncs
- Kyle Term 4Uploaded byroom8ncs

- Propensity ScoresUploaded byajax_telamonio
- Repeated measures analysis by regressionUploaded byajax_telamonio
- 1-s2.0-S030440171000590X-mainUploaded byajax_telamonio
- Inter Acci OnesUploaded byajax_telamonio
- 194873 55318 Banford Law PracticleUploaded byPriya Goyal
- Psy524 Lecture 19 Logistic_contUploaded byNitish Bhayrau
- FDR 2004Uploaded byajax_telamonio
- s Pss Mixed ModelsUploaded byAroop Mukherjee
- Bootstrap BasedUploaded byajax_telamonio
- Is Obesity Harmful if Metabolic Factors Are NormalUploaded byajax_telamonio
- Surface RT User Guide_FinalUploaded byfbolanosr7256
- Surface Getting Started GuideUploaded byEvan Herrera
- The B Files File6 Recall Bias Final CompleteUploaded byajax_telamonio
- Ten Commandments for Dealing With ConfoundingUploaded byyaniryani
- DM_02Uploaded byajax_telamonio
- DM_01Uploaded byajax_telamonio
- February 2008Uploaded byajax_telamonio
- ajhg00437-0016Uploaded byajax_telamonio
- The Effectiveness of Right Heart Catheterization in TheUploaded byajax_telamonio
- Lista de ComandosUploaded byajax_telamonio
- Quinn & Keough StatsUploaded byajax_telamonio
- The Effect of Modifiable Risk Factors on ..Uploaded byajax_telamonio
- Frequentist and Bayesian ApproachesUploaded byajax_telamonio
- Tutor BiasUploaded byajax_telamonio
- Tutor DiseñoUploaded byajax_telamonio
- st03.pdfUploaded byLutfi Hidiyaningtyas
- Content Validity in Psychological Assessment_HaynesUploaded byajax_telamonio
- On the Cost of Data Analysis_FarawayUploaded byajax_telamonio
- Validation of Cognitive Structures_ SEM (Struct Eq Model) ApproachUploaded byajax_telamonio

- 3.11.6 NMKLNordValUploaded byPaula Andrea Restrepo Ochoa
- Applying Different Equations to Evaluate the Level of Mismatch Between Students and School FurnitureUploaded byAnkit Rajput
- 34534-125171-1-PBUploaded byAnonymous n0E20Px
- Motion Palpation: A Narrative ReviewUploaded byrapannika
- Human Centric Software EngineeringUploaded byHitesh Mohapatra
- Comprehending and Learning From Internet Sources: Processing Patterns of Better and Poorer LearnersUploaded byCandy Aldana Ditan
- 5Uploaded byVigneshwari Mahamuni
- 02 Desri SettingUploaded byindra tamara
- Polit, Beck, Owen Cvi 2007Uploaded byPutuBudiana
- A Comprehensive Classification of Mandibular Fractures[1]Uploaded byMairaMaraviChavez
- (2017) Empowering Preschool Teachers to Identify Mental Health Problems a Task-Sharing Intervention in EthiopiaUploaded byJulián A. Ramírez
- the efficacy of drama in europeUploaded byrifqi235
- Comparison of Model Validation Techniques for Land Cover Dynamics in Jodhpur CityUploaded byAnonymous vQrJlEN
- Abraira, Vargas - 1999 - Generalization of the Kappa Coefficient for Ordinal Categorical Data, Multiple Observers and Incomplete DesignsUploaded byAndré Luiz Lima
- lab 3 erika erin shannonUploaded byapi-261636274
- Treatment Based ClassificationUploaded byasuratos
- Inter Rater AgreementUploaded bySwastika Raisa
- A Comparison of Classification Techniques for the Land Use Land Cover Classification (1)Uploaded byemicaya
- CONTENT VALIDITY OF TECHNOLOGY INFORMATICS GUIDING EDUCATION REFORM (TIGER) ASSESSMENT INSTRUMENT FOR INFORMATICS COMPETENCIES OF GRADUATING NURSING STUDENT.Uploaded byIJAR Journal
- [Artigo]Landis e Koch 1977 Intervalo Kappa the Measurement of Observer Agreement Categorical DataUploaded byJkhdsjkhjh Jknjkhnjh
- ClassificationArticle by Foody 2002Uploaded byNasirajk
- Rodriguez - Tools used in Global Software Engineering - 2012.pdfUploaded bywilson cordova
- Week7-Hard VsSoft ClassificationUploaded byBogdan Eugen Dolean
- Reliability Checking Through SPSSUploaded byjazzlovey
- SCMUploaded byAmol Mahajan
- Grading Systems in Head and Neck DysplasiaUploaded byDrMohmd Zidaan
- Pulse DiagnoseUploaded byJames Parker
- Bone 1Uploaded byRishi Kothari
- Chondro ReproductUploaded bycerebrin80
- Educational and Psychological Measurement 2014 Jones 1049 66Uploaded byParyanto Hippii Solo