You are on page 1of 17

The Measurement of Observer Agreement for Categorical Data

Author(s): J. Richard Landis and Gary G. Koch


Reviewed work(s):
Source: Biometrics, Vol. 33, No. 1 (Mar., 1977), pp. 159-174
Published by: International Biometric Society
Stable URL: http://www.jstor.org/stable/2529310 .
Accessed: 19/11/2012 06:33

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to
Biometrics.

http://www.jstor.org

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
BIOMETRICS 33, 159-174
March 1977

ofObserver
The Measurement forCategoricalData
Agreement
J. RICHARD LANDIS
Department of Biostatistics, Universityof Michigan, Ann Arbor, Michigan 48109 U.S.A.

GARY G. KOCH

Department of Biostatistics, Universityof North Carolina, Chapel Hill, North Carolina 27514 U.S.A.

Summary
This paper presentsa generalstatisticalmethodology cate-
for theanalysis of multivariate
goricaldata arisingfromobserver reliabilitystudies.The procedureessentiallyinvolvesthecon-
structionoffunctionsof theobservedproportions whichare directedat theextentto whichthe
observersagreeamongthemselves and theconstruction of teststatistics
for hypothesesinvolving
thesefunctions.Testsforinterobserver bias are presented in termsoffirst-ordermarginalhomo-
geneityand measuresofinterobserver agreement are developedas generalized
kappa-typestatistics.
Theseprocedures witha clinicaldiagnosisexamplefromtheepidemiological
are illustrated litera-
ture.

1. Introduction
Researchersin manv fieldshave become increasinglyaware of the observer (rater or
interviewer)as an importantsourceofmeasurementerror.Consequently,reliabilitystudies
are conductedin experimentalor surveysituationsto assess the level ofobservervariability
in the measurementproceduresto be used in data acquisition. When the data arising
fromsuch studies are quantitative,tests for interobserverbias and measures of inter-
observeragreementare usually obtainedfromstandardANOVA mixedmodelsor random
effectsmodels such as those discussed in Andersonand Bancroft[1952], Scheffe[1959],
and Searle [1971]. As a result,hypothesistests of observereffectsare used to investigate
interobserver in the mean responseamong observers,and estimates
bias, i.e., differences
of intraclasscorrelationcoefficientsare used to measureinterobserverreliability.rVlodifica-
tions and extensionsof these standard ANOVA models have been proposed by Grubbs
[1948, 1973], Mandel [1959], Fleiss [1966],Overall [1968],and Loewenson,Bearman and
Resch [1972]to evaluate the measurementerrorin varioustypesof applications.Although
assumptionsof normalityfor these models may not be warrantedin certain cases, the
ANOVA proceduresdiscussed in Searle [1971] and the symmetricsquare difference pro-
cedure in Koch [1967, 1968] still permitthe estimationof the appropriatecomponents
of varianceand the reliabilitycoefficients.
On the otherhand, many observerreliabilitystudiesinvolvecategoricaldata in which
the responsevariableis classifiedinto nominal(or possiblyordinal)multinomialcategories.
As reviewedin Landis and Koch [1975a, 1975b],a wide varietyof estimationand testing
procedureshave been recommendedfor the assessmentof observervariabilityin these
Key Words: Observer agreement; Multivariate categorical data; Kappa statistics; Repeated measurement
experiments;Weighted least squares.
159

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
160 BIOMETRICS, MARCH 1977

cases. In thispaper we proposea unifiedapproachto the evaluationof observeragreement


for categoricaldata by expressingthe quantities which reflectthe extent to which the
observersagree among themselvesas functionsof observed proportionsobtained from
underlyingmultidimensional contingencytables. These functionsare thenused to produce
test statisticsfor the relevant hypothesesconcerninginterobserverbias in the overall
usage of the measurementscale and interobserveragreementon the classificationof in-
dividual subjects. For illustrativepurposes,this generalmethodologyis developed within
the contextof a typical data set whichresultedfroman investigationof observervari-
abilityin the clinicaldiagnosisof multiplesclerosis.

2. A Clinical Diagnois Example


Let us considerthe data arisingfromthe diagnosisof multiplesclerosisreportedin
Westlundand Kurland [1953]. Among other things,the investigatorswere interestedin
comparingpatient groups to study possible differences in the geographicaldistributions
of the disease. For thispurpose,a seriesof patientsin Winnipeg,Manitoba and a separate
seriesof patientsin New Orleans,Louisiana wereselectedand wereexaminedby a neurol-
ogist in their respectivelocations. Afterthe completionof all the examinations,each
neurologistwas requested to review all the recordswithoutseeing his earlier summary
and diagnosisand to classifytheminto one of the followingdiagnosticclasses:
1. Certainmultiplesclerosis;
2. Probable multiplesclerosis;
3. Possible multiplesclerosis(odds 50: 50);
4. Doubtful,unlikely,or definitely not multiplesclerosis.
In orderto evaluate agreementbetweenthe diagnosticians,the Winnipegneurologistthen
reviewed and classifiedeach of the New Orleans patient records,and vice versa. The
data resultingfromthese reviewdiagnosesare presentedin Table 1.
A preliminary inspectionof the Winnipegdata indicatesthat the Winnipegneurologist
tended to diagnose more of the patientsas certain (1) or probable (2) multiplesclerosis
than did his counterpartin New Orleans. As a result,they agreed on the diagnosis of
only 64/149 (43 percent) of the patients. Althoughthe differences in the overall crude
of
distributions the diagnosesseem to be less prominentwithinthe New Orleanspatients,
the neurologistsdiagnosed only 33/69 (48 percent) of them into identically the same
in
category.The statisticalissues concerningthesedifferences diagnosiscan be summarized
withinthe framework of the followingbasic questions:
(1) Are thereany differences betweenthe two patientpopulationswithrespectto the
overallcrudedistributionof the diagnosesby each of the two neurologists?
(2) Are thereany differences betweenthe overall crude distributionsof the diagnoses
by the two neurologistswithineach of the respectivepatientpopulations?
(3) Is thereany neurologistX sub-populationinteractionin the overallcrudedistribu-
tion of the diagnoses?
(4) Is there any difference between the two patient populationswith respect to the
overall agreementof the two neurologistson the specificdiagnosis of individual
patients?
(5) Is the agreementof the two neurologistson the specificdiagnosis of individual
patientssignificantly differentfromdance agreementbased on theiroverall crude
distributions of diagnoses?

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 161

Table 1
DIAGNOSTIC CLASSIFICATION REGARDING MIULTIPLE SCLEROSIS

Sub-population Winnipeg Patients (1)

Observer Winnipeg Neurologist (2)

Diagnostic 1 2 3 4 Total Proportion


Class
1 38 5 0 1 44 0.295

New Orleans 2 33 11 3 0 47 0.315


Neurologist
(1) 3 10 14 5 6 35 0.235

4 3 7 3 10 23 0.154

Total 84 37 11 17 149

Proportion 0.564 0.248 0.074 0.114

Sub-population New Orleans Patients (2)

Observer Winnipeg Neurologist (2)

Diagnostic 1 2 3 4 Total Proportion


Class

1 5 3 0 0 8 0.116

2 3 11 4 0 18 0.261
New Orleans
Neurologist 3 2 13 3 4 22 0.319
(1)
4 1 2 4 14 21 0.304

Total 11 29 11 18 69

Proportion 0.159 0.420 0.159 0.261

(6) Are therecertainpatternsof disagreementwhichmay reflectsignificant imprecision


in the diagnosticcriteria?
As stated in Koch et al. [1977],questions (1)-(3) are directlyanalogous to the hypotheses
of "no whole-ploteffects,""no split-ploteffects,"and "no whole-plotX split-plotinter-
action" in standardsplit-plotexperiments. In thiscontext,question(1) addressesdifferences
among the sub-populations,question (2) involves the issue of interobserverbias, and
question (3) is concerned with the observer X sub-populationinteraction.Thus, the
first-order marginal distributionsof response for each of the neurologistswithin each
sub-populationcontain the relevant informationfor dealing with these questions. In

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
162 BIOMETRICS, MARCH 1977

contrastto overall crude differences, questions (4)-(6) are addressed at interobserver


agreementon a subject-to-subjectbasis; and, as such they are directlyanalogous to
hypothesesconcerningintraclasscorrelationcoefficients in randomeffectsmodels. Hence,
certainfunctionsof the diagonal cells of varioussubtablesare used to provideinformation
forestimatingand testingthe significanceof agreementon the classificationof individual
subjects.
In the followingsectionsa general methodologyforansweringthese questionsis de-
veloped in terms of specifichypotheses.These proceduresare then illustratedwith an
analysis of the data in Table 1.

3. Methodology
Let i = 1, 2, ..., s index a set of sub-populationsfromwhichrandomsamples have
been selected. Suppose that the same responsevariable is measured separatelyby each
of d observersusing an L-point scale. Let the r = Ld responseprofilesbe indexed by a
vectorsubscriptj = (jl
il jd), wherej, = 1, 2,
* , L forg = 1, 2, , d. Further-
more,let 7rij = 7ri i2, representthe joint probabilityofresponseprofilej forrandomly
i, id

selectedsubjectsfromthe ith sub-population.Then let the first-order marginalprobability

=0i wk Er 1i I ,i 2 d,*' for g = l, 2, , d (3.1)


j with ik = ,2,

representthe probabilityof the kthresponsecategoryforthe gthobserverill the ith sub-


population.
3.1 HypothesesInvolvingMarginalDistributions
Hypothesesdirectedat the questionsof differences among sub-populationsand inter-
observerbias involve distributionsof the responseprofilesand can be expressedin terms
of constraintson the first-order marginalprobabilities fis . As a result, the specific
hypothesesassociated with questions (1)-(3) are directlyanalogous to HSM , HCI , and
HAM outlinedin Koch et al. [1977] in expressions(2.4), (2.5), and (2.9), respectively.In
particular,the d observerscorrespondto the d conditions,and thus the hypothesisof
firstordermarginalsymmetry (homogeneity) addressesthe issue of interobserverbias. These
hypothesescan also be expressedin termsof constraintson meanscorefunctionsassociated
witheach observersuchas the { Di} summaryindexesspecifiedin (2.14) in Koch etal. [1977].
Further discussion of hypotheses involving marginal distributionswithin the context
of observeragreementstudiesis givenin Landis [1975].

3.2 HypothesesInvolvingGeneralizedKappa-Type Measures


Whereas the previous hypothesesconcerningdifferences among sub-populationsand
interobserver bias involvedonlythe first-ordermarginalprobabilities,hypothesesdirected
at the extentto whichobserversagree amongthemselveson the classification of individual
subjects must be formulatedin termsof the internalelementsof the table. For example,
the estimateof the crude proportionof agreementbetween two observersis simplythe
sum of the observedproportionson the main diagonalof the corresponding two-waytable.
In addition,if partial creditis permittedfor certaintypes of disagreement,an estimate
of the weightedproportionof agreementwill involve the weightedinclusionof the off-
diagonal cells.

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 163

As reviewedin Landis and!Koch [1975a, 1975b],numerousmeasuresof observeragree-


menthave been proposedforcategoricaldata, e.g., Goodman and Kruskal [1954],Cohen
[1960, 1968], Fleiss [1971], Light [1971], and Cicchetti [1972]. Most of these quantities
are of the form

K = 1 --- (3.2)

whereir0is an observationalprobabilityof agreementand wreis a hypotheticalexpected


probabilityof agreementunder an appropriateset of baseline constraintssuch as total
independenceof observerclassifications.Rangingfrom[- 7re/(l - re)] to +1, K indicates
the extentto which the observationalprobabilityof agreementis in excess of the prob-
abilityof agreementhypothetically expectedunderthe baseline constraints.Furthermore,
as shownin Fleiss and Cohen [1973]and Fleiss [1975],K is directlyanalogous to the intra-
obtained fromANOVA modelsforquantitativemeasurements
class correlationcoefficient
and can be used as a measure of the reliabilityof multipledeterminationson the same
subjects.
Several kappa-type measures of interobserveragreementcan be formulatedto in-
vestigate selected patterns of disagreementsimultaneouslyby choosing corresponding
sets of weightswhichreflectthe role of each responsecategoryin a givenagreementindex.
For example, a set of weightscan be chosen so that the resultingagreementmeasure
indicates the combinedperformanceof all the observers,such as majorityor consensus
agreement,or sets of weightscan be directedat subsets of observers,such as all possible
pairwise agreementmeasures. Alternatively,these weights can be chosen so that the
associated kappa measuresindicate the incrementsin agreementwhichresultby succes-
sively combiningrelevant categoriesof the responsevariable. Such kappa measuresare
said to be in a hierarachicalrelationshipwith each other.Thus, in general,let w1;, W2v,
... , wj be u setsofweights
assignedtotheresponse profiles indexedbyj = (j, , j,, ..* id)

Moreover let 0 < whj < 1 forh = 1, 2, , u over all j, so that the resultingestimates
are interpretableas probabilitiesof agreement.Then the observationalprobabilityof
agreementassociated withthe hthset of weightsin the ith sub-populationis the weighted
sum
I
Z iw for = 21,2
...
(3.3)
Nih =W*..
i =f 1 2, .., U.

the expectedproportionofagreementassociatedwith(3.3) is theweighted


Correspondingly,
sum

~~~~~~h 'Y i ~~Yih


h = for =1, 2, . .. ,
Z
Wh.7ri(i 12(3i4) U,(34

where wrij(e) representsthe joint hypotheticalexpected probabilityof response profilej


forrandomlyselectedsubjects fromthe ith sub-population.
These expectedprobabilitiesare determinedby the choiceof a particularset ofbaseline
constraintsassumed forthe responseprofiles.For this purpose,let E = {IEl , E2 , ...
representsuch underlyingconstraintson the marginalprobabilities {/,in} of (3.1). In
this context,the followingsets of constraintsare of interestin creatinginterobserver
agreementmeasures:
(i) Under the assumptionof total independenceamong the responsevariables from
the d observers,the {17rii (e) } satisfy

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
164 BIOMETRICS,MARCH1977

-E : 7rii 2 .* *i
. d =
Ofi1;1i2i 2 .
.f i di d

d
= Joikik for i = 1, 2, **, s. (3-5)
k= 1

(ii) Under the assumption of "no interobserverbias" the hypothesisof first-order


marginalhomogeneity(Hc A in Koch et al. [1977]) holds. In this situation,let the
commonprobabilityof classificationinto the kthcategorybe
VPik = Oilk = Obi2k = = (3.6)
for i = 1, 2, , s and k = 1, 2, *.., L. Then under the baseline constraints
of total independenceand marginalhomogeneitythe {7ri(e)} satisfy
52
:2 {i I j2 I () =i i 2d

= T~iI'o
g
for i= 1, 2, ** ,s. (3.7)
= 1

Consequently,a generalizedkappa-typemeasure of agreementdirectlyanalogous to


(3.2) can be formulatedby
i
Kih = Nih -
i
for =
1, 2, ,S (3.8)
1 - T'ih It = 1,12, ,

undera set of specifiedconstraintsin E. Here Kih representsan agreementmeasureamong


the d observersin the ith sub-populationwithrespectto the hthset of weights.
Withinthis framework, the specifichypothesesassociated with questions (4)-(6) can
now be formulatedas follows:
(4) If thereare no differencesamongthe s sub-populationswithrespectto the measures
ofoverallspecific agreement amongthed observers underE, thenthe {Kih} satisfy
the hypothesis
HSA IEz Klh = K2h = Kh for t = 1, 2, ,, (3.9)
whereSA denotessub-populationagreement.
(5) If the level of observedagreementis equal to that expected under E, then the
Kih} satisfythehypothesis

HNA IEZ Kih = O for = 1, 2, ,s (3.10)


h =1, 2, ...

whereNA denotesno agreement.


(6) In some cases the weightsforthe kappa measuresare chosento be in a hierarchical
relationshipwith each otherin orderto investigatespecificdisagreementpatterns.
In these situations,if the extent of disagreementis the same for the categories
combinedby the (h + 1)-st set of weightsas forthose combinedby the hth set,
thenthe {Kih} satisfythe hypothesis

HHAjfEz Ki,h+1 = Ki,h for i = 1, 2, * , s, (3.11)

whereHA denoteshierarchicalagreement.
In order to maintain consistentnomenclaturewhen describingthe relative strength
of agreementassociated with kappa statistics,the followinglabels will be assignedto the
corresponding rangesof kappa:

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 165

Kappa Statistic Strength


ofAgreement
< 0.00 Poor
0.00-0.20 Slight
0.21-0.40 Fair
0.41-0.60 1\Ioderate
0.61-0.80 Substantial
0.81-1.00 AlmostPerfect

Althoughthese divisionsare clearlyarbitrary,they do provide useful "benchmarks"for


the discussionof the specificexamplein Table 1.
3.3 Estimationand HypothesisTesting
Test statisticsforthehypothesesconsideredin theprevioussectionsas wellas estimators
for correspondingmodel parameterscan be obtained by using the general approach for
the analysisof multivariatecategoricaldata proposedby Grizzle,Starmerand Koch [1969]
(hereafterabbreviatedGSK) as outlinedin Appendix1 in Koch etal. [1977].The hypotheses
in Section 3.1 involvingconstraintson the first-ordermarginalprobabilitiescan be tested
by expressingthe estimatesof the {4iak} or the {a-qidas linearfunctionsof the type given
in Appendix1 (A.14) in Koch etal. [1977].These particularmatrixexpressionshave already
been discussedin considerabledetail in Koch and Reinfurt[1971] and Koch et at. [1977],
and thus they will not be elaboratedhere. Otherwise,theirspecificconstructionforthese
hypothesesin observeragreementstudiesis documentedin Landis [1975].
In contrastto the linear functionswhich pertain to the hypothesesin Section 3.1,
all the hypothesesinvolvinggeneralizedkappa-type measures require the expressionof
the ratioestimatesof the {Kih} as compounded
logarithmic-exponential-linear
functions
of the observed proportionsas formulatedin Appendix 1 (A.20) in Koch et al. [1977].
As a result,the test statisticsforthe hypothesesin Section 3.2 can also be generatedby
the corresponding expressiongivenin Appendix1 (A.11) in Koch etal. [1977].

4. Analysisof Multiple SclerosisData


This section is concernedwith the analysis of the multiplesclerosisdata ill Table 1
with primaryemphasis given to illustratingthe methodologyin Section 3. Tests of sig-
nificanceare used in a descriptivecontextto identifyimportantsources of variation as
opposed to a rigorousinferentialcontext;and thus issues pertainingto multiplecompar-
isons are ignoredhere. These, however,can be handled by the Scheffetype procedures
given in Grizzle,Starmerand Koch [1969]. The design forthis example involves s = 2
sub-populations,d = 2 observers,and L = 4 responsecategories.Thus, thereare r = L 16
possiblemultivariateresponseprofileswithineach of the sub-populations.

4.1 MarginalHomogeneity
Tests
The functionsrequired to test the hypothesesinvolvingmarginaldistributionscan
be generatedin the formulationof (A.14) in Appendix 1 in Koch et al. [1977] with the
functionvectorF' = (F1', F2') where

F1' = (0.295, 0.315, 0.235, 0.564, 0.248, 0.074) (4.1)


F2' = (0.116, 0.261, 0.319, 0.159, 0.420, 0.159),

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
166 BIOMETRICS, MARCH 1977

Table 2
HIERARCHICALWEIGHTS FOR AGREEMENTMEASURES

Weights w1j w2j w3j w4j

Observer 2 2 2 2

Diagnostic 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Class

1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0
2 0 1 0 0 1100 1 00 1 1 1 0
Observer 1
3 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1
4 0 0 0 1 0 0 0 1 0 0 11 0 0 1 1

which contain the marginalproportionsfor diagnosticclasses "1," "2" and "3" for the
two observerswithinthe two sub-populations.The test statisticforHsm is Qc = 46.37
with d.f. = 6, whichimpliesthat there are significant(a = 0.01) differences in the dis-
tributionsofthe observedresponseprofilesbetweentheWinnipegand New Orleanspatients.
The tests of this hypothesiswithineach of the observersalso indicate statisticallysig-
nificant(a = 0.01) differences betweenthe two sub-populations,althoughthe Winnipeg
neurologistrepresentsthe moredominantcomponent.Similarly,the test statisticforHcMa
is Qc = 69.01 withd.f. = 6, whichimpliesthat thereare significant(a = 0.01) differences
in theresponseprofilesbetweenthe two neurologists withineach sub-population.Moreover,
the dominant componentof these observerdifferences is withinthe Winnipegpatient
group. These results suggest that significantinterobserverbias exists between the two
neurologistsin their overall usage of the diagnosticclassificationscale. In addition,the
goodness-of-fitstatistic for testingthe interactionhypothesisHAger is Q = 14.09 with
d.f. = 3. This significant(a = 0.01) observerX sub-populationinteractionis consistent
withthe resultthat the observerdifferences are moresubstantialin the Winnipegpatient
group (Qc = 58.47) than in the New Orleanspatientgroup (Qc = 10.54).

Table 3
DESCRIPTION OF HIERARCHICAL WEIGHTS

Set of Disagreement Permitted


Weights for Agreement Statistic

1 None; requires perfect agreement.

2 Certain (1) with Probable (2).

3 Certain (1) with Probable (2);


Possible (3) with Doubtful (4).

Certain (1) with Probable (2);


4 Probable (2) with Possible (3);
Possible (3) with Doubtful (4).

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 167

4.2 HierarchicalKappa-Type MeasuresofAgreement


Specificpatternsof disagreementbetweenthe neurologistson the diagnosticclassifica-
tion of individualsubjects can be investigatedby selectinga hierarchyof weightswhich
successivelycombine adjoining categoriesof diagnosisin orderto create potentiallyless
stringentreliabilitymeasures. For example, the four sets of weightsin Table 2 can be
used to investigatethe sources of imprecisediagnosticcriteria.As indicated in Table 3,
these weightsare chosen so that specificdisagreementpatternsare successivelytolerated
in the correspondingestimatesofagreement.In particular,w1jrepresentsthe set of weights
whichgeneratethe kappa measureof perfectagreementproposedin Cohen [1960].
The sequence of hierarchicalkappa-type statistics within each of the two patient
populationsassociated withthe weightsgivenin Table 2 can be expressedin the formula-
tion (A.20) in Appendix 1 in Koch et al. [1977] under the baseline constraintsof total
independenceEl in (3.5) by letting
1111 0000 0000 0000
0000 1 1 1 1 0000 0000
0000 0000 1111 0000
0000 0000 0000 1 1 1 1

1 0 00 01 0 0 0 1 0 0 0 1 0 00

A, - 0001( OO1??X2; '01 (4.2)


24X3200 010
1 0 00 1 0 0010 001 0
2 0 0 1 0001 0001 0001

1 000 0 1 00 00 1 0 000 1

1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1

1100 1100 0011 0011

_1 1 00 1 1 1 0 0 1 1 1 001 1

1000 1000 0000

1 00 0 0 10 0000

1 000 00 1 0 0000

1000 0001 0000

0 10 0 1 0 0 0 0 0 0 0

40X24 0 10 0 0 1 00 0000012; (4 3)
0 100 00 1 0 0000

0 10 0 0 001 0 0 0 0

0 01 0 1 0 0 0 0 0 0 0
00(10( 010 0000((

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
168 BIOMETRICS, MARCH 1977

0 0 1 0 0 0 1 0 0 0 0 0
00 1 0 000 1 0000

0001 1000 0000

000 1 01 00 0000

000 1 00 1 0 0000

4OX24 0 0 0 1 0001 0000 0 12; (4.3)


4AX24
~~~~~~~~~~~~Con
0000 0000 1000
0000 0000 0100
0000 0000 0010

-1 0 0 0 0 -1 0 0 0 -1 0 0 0 0 -1 10 0 0
-1 -1 0 0 -1 -1 0 0 0 0 -1 0 0 0 0 -1 0 10 0
-1 -100 -1 -1 00 0 0 -1 -1 00 -1 -1 0010
-1 -1 0 0 -1 -1 -1 0 0 -1 -1 -1 0 0 -1 -1 0 0 0 1
A3 = 012
1fX40 0 1 1 10 1 1 1 0 1 1 1 1 0 0 0 0 0 (4.4)

0 011 0 0 11 1 1 0 1 11 1 0 0000
0 011 0 0 1 1 1 1 0 0 1 1 0 0 0000
010 1 00011000110
0 1 0 0000

A4 = [14 -I4] 0I2; (4.5)


9X16

For the data in Table 1, these estimatesare givenby

K11 0.208
K12 0.328
K13 0.408

F = K14 = 0.596 (4.6)


K21 0.297
K22 0.332
K93 0.386
_K24- _0.789

wherekih is the estimateof the agreementmeasurein the ith sub-populationassociated


withthe hthset of weightsshownin Table 2. In addition,the estimatedcovariancematrix
of F is givenby

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 169

0.2546 0.2122 0.1868 0.1442


0.2122 0.4005 0.3862 0.2912 0
0.1868 0.3862 0.5200 0.3832

VF = 0.1442 0.2912 0.3832 0.5700 X 10-2. (4 7)


0.6163 0.5582 0.5046 0.2185
0.5582 0.6879 0.6544 0.3010
? 0.5046 0.6544 1.0030 0.4147
0.2185 0.3010 0.4147 0.7720

The test statisticsforthe hierarchicalhypothesesin (3.11) are displayedin Table 4.


These results indicate that all increases in successive agreementmeasures within the
Winnipegpatientgroupare significant(a = 0.05); but forthe New Orleanspatientgroup,
the only significant(a = 0.05) increasein agreementpertainedto the finalset of weights.
Thus, the neurologistsare exhibitingsignificantdisagreementbetween diagnoses (1,2),
(2,3) and (3,4) in the Winnipeggroup and significantdisagreementbetween diagnoses
(2,3) in the New Orleansgroup,as evidencedby theinflatedfrequenciesin theseoff-diagonal
cells in Table 1. On the other hand, the estimatesin (4.6) suggestthat the hierarchical

Table 4
STATISTICAL TESTS FOR HIERARCHICAL HYPOTHESES

Hypothesis D.F. QC

Combined Patient Groups

K11 2 6.89*
K12 K22 K21

13 12' 23 K22 2 S.15

14 1A; 24 23 28.13**

Winnipeg Patients (1)

K12 =11 1 6.20**

13 =12 1 4.38*

K14 =13 1 10.96**

New Orleans Patients (2)

K22 =21 1 0.69

K23 =22 1 0.76

K24 =23 1 17.17**

* means significant at a = 0.05;


** means significant at a = 0.01.

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
170 BIOMETRICS, MARCH 1977

kappa measureswithinboth patient groupsexhibitthe same increasingtrend.Since the


estimatedvariances of the kappa statisticsare much largerforthe New Orleans patient
group (due to the smallersample size), the agreementpatternsmay indeed be essentially
the same in both patientgroups.
If the two neurologistsare indeed exhibitingthe same agreementpatternswithrespect
to the weightsgiven in Table 2 withinthe two groupsof patients,then under (3.5) the
Kih } satisfythe following
hypothesesfrom(3.9)
HSAI1E,1 KUZb= K2h for t = 1, 2, 3, 4. (4.8)
Test statisticsforthese hypothesesboth individuallyand jointlyare presentedin Tfable5.
The resultsin Tables 4 and 5 suggestthat a reduced model can be used to combine
parameterswhich are essentiallyequivalent. For this purpose,the agreementstatistics
in (4.6) can be modeledby
10 0 0 0
0 1 ? ? ?
0K1

KIc
0 0 1 0 0 ,

EAIFl = X 000 10 K3 (4.9)


K4

0 1 0 0 0 _5

0 0 1 0 0

where"EA" denotes"asymptoticexpectation."For thismodel,the goodness-of-fit statistic


is Q = 2.27 withd.f. = 3. Thus, thisreducedmodelprovidesa satisfactory characterization
of the variation among these agreementmeasures. Specifictest statisticsfor the cor-
respondinghypothesesin (3.10) and (3.11) pertainingto the model X in (4.9) are givenin
(a = 0.01) different
Table 6. These resultssuggestthat all the parametersare significantly
fromzero, and moreover,are significantly (a = 0.05) different
fromeach other.Further-
more, by reducingthe model to these smoothed estimates,the marginallysignificant
(a = 0.10) difference betweenK14 and K24 in Table 5 is now significant(a = 0.05) forthe

Table 5
STATISTICAL TESTS BETWEEN PATIENT SUB-POPULATIONS

Hypothesis D.F. Q

Kl =2h for h = 1,2,3,4. 4 7.15

1 0.90
K11 K21

1 0.00
K12 K22

1 0.03
K13 = K23

K14 =24 1 2.77

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 171

Table 6
STATISTICAL TESTS FOR MODEL X

Hypothesis D,F, QC Hypothesis D.F.

K =K 1 5.40* Ka = 0 1 31.05**
2 11
K =3 K 1 4.92* K = 0 1 40.71**
22
K = K 1 12.33** K = 0 1 45.49**

K =
K4 1 4.88* K = 0 1 72.44**

K = 0 1 94.97**

* means significant at ot = 0.05;


* means significant at ot = 0.01.

comparisonof K4 and K5 in thisfinalmodel.Finally,the predictedvalues forthe Ki,, } based


on thefittedmodel(4.9) are displayedin Table 7 togetherwiththeircorresponding estimated
standarderrors.
Thus, theseresultssuggestthat the diagnosticcriteriaare not verydistinctwithrespect
to theirusage by these two neurologists.In additionto bias at the macro stage, i.e., con-
sideringonly the overall marginalproportions,these observersexhibitedsignificantdis-
agreementat the micro state, i.e., consideringeach individual subject, in specifyinga
diagnosis.Only withrespectto the relativelyrelaxedcriterioncorresponding to the fourth
set of weightsdo the kappa statisticsindicate a "moderate" to "substantial" level of
interobserver reliability.

5. Discussion
In someapplications,one may also be interestedin a set ofweightswhichassignvarying
degreesof partial creditto the off-diagonal
cells dependingon the extentof the disagree-
ment,ratherthan successivelycombiningadjoiningcategoriesas shown in Table 2. For

Table 7
SMOOTHED ESTIMATES OF AGREEMENT UNDER MODEL X

Sub-population 1 2

Agreement Estimate Estimated Estimate Estimated


Weights Statistic Under X Standard Error Under X Standard Error

Wij Kil 0.236 0.042 0.236 0.042

w2j Ki2 0.311 0.049 0.311 0.049

W3j Ki3 0.383 0.057 0.383 0.057

W. Ki. 0.579 0.068 0.790 0.081


4j i4

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
172 BIOMETRICS, MARCH 1977

Table 8
ALTERNATIVE WEIGHTS FOR OVERALL AGREEMENT MEASURES

Weights jlj 2j

Observer 2 2
Diagnostic 1 2 3 4 1 2 3 4
Class

1 1 0 0 0 1 i ?
0

2 0 1 0 0 ? 1 i
Observer 1
3 0 0 1 0 i i 1 ?

4 0 0 0 1 0 i1? 1

example,the weightsw2j in Table 8 are directlyanalogous to those discussed in Cohen


[1968],Fleiss, Cohen and Everitt [1969]and Cicchetti[1972],whichwereused to generate
weightedkappa and C statistics.For the data in Table 1, these estimatesare givenby
K11 0.208
F = K12 = 0.315 (5.1)
K21 0.297
K22 _ 0.407_
where the {KiJ estimate the perfectagreementkappa measure and the { i,} estimate
the partial-creditweightedkappa agreementmeasurebetweenthe two neurologistsin the
two patient populations.A more extensiveanalysis of these data under the weightsin
Table 8 is givenin Landis [1975]and Landis et al. [1976].
Althoughthe methodologyforthe assessmentof observeragreementdevelopedill this
paper is quite general,these procedureshave been illustratedwith an example involving
only two observers.However, for situationsin which eitherthe numberof observersd
or the numberof responsecategoriesL is moderatelylarge,the numberof possiblemulti-
variate response profilesr = Ld becomes extremelylarge. Consequently,the matrices
requiredto implementthe GSK proceduresdirectlymay be outsidethe scope of computa-
tional feasibility.In addition, for each of the s sub-populationsmany of the r possible
responseprofileswill not necessarilybe observed in the respectivesamples so that cor-
respondingcell frequenciesare zero. Thus, in such cases, specializedcomputingprocedures
are requiredto obtain the estimatesof the pertinentfunctions.
One alternativeapproach for handlingsuch very large contingencytables in which
most of the observed cell frequenciesare zero is discussed in Landis and Koch [1977].
In this regard,the same estimatorswhichwould need to be obtained fromthe conceptual
multidimensional contingency table can be generatedby firstforming appropriateindicator
variables of the raw data fromeach subject and then computingthe across-subjectarith-
meticmeans. Subsequentto these preliminary steps,the usual matrixoperationsdiscussed
in Appendix1 in Koch et al. [1977] can then be applied to these indicatorvariable means
to determinethe required measures of observeragreement.These alternativecomputa-
tionsinvolvingraw data, as wellas the extendedGSK proceduressummarizedin Appendix1
in Koch et al. [1977] can all be performedby a recentlydeveloped computerprogram
(GENCAT) discussedin Landis, etal. [1976].

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
AGREEMENT MEASURES FOR CATEGORICAL DATA 173

Acknowledgments
This research was partially supportedby Research Grants GM\1-00038-20 and GM-
70004-05fromthe National Instituteof General Medical Sciencesand by the U. S. Bureau
of the Census throughJointStatisticalAgreementsJSA 74-2 and JSA 75-2. The authors
wouldlike to thankthe refereesfortheirhelpfulcommentson an earlierdraftofthispaper.
In addition,the authors are gratefulto 1\ls. Rebecca Wesson and Ms. Lynn Wilkinson
fortheirconscientioustypingof previousdraftsof thispaper,and to 1\ls.Linda L. Blakley
and 1\ls.Connie M\'Jassey
fortheirefficienttypingof the finalversionof this manuscript.

La Mlesurede la Concordance
EntreObservations
pourdes Donneesen Categories

Resume
Particle expose une methodologie statistiquegeneratepour l'analyse de donneesmulti-
variatesen categories
provenant d'etudesdefiabilited'observateurs.La procedurefaitprincipale-
mentappel a la construction de fonctionsdes proportions observeestraduisantla concordance
des observateursentreeux et a la constructionde statistiques
de testspour des hypothesesimpli-
quantcesfonctions. On present des testspourdes biais entreobservateurs enfonctionde l'homo-
geneit'marginaledu premierordreet on construit des mesuresde concordance entreobservateurs
commedes statistiquesgeneralisantcelies du typekappa. On illustreces proceduresavec an
exemplede diagnosticcliniqueprovenant de la litterature
epidemiologique.

References
Anderson,
R. L. and Bancroft,T. A. [1952].Statistical
Theory in Research.
McGrawHill,NewYork.
Bhapkar,V. P. [1966].A noteon theequivalenceoftwotestcriteria forhypotheses
in categorical
data.JournaloftheAmerican StatisticalAssociation61, 228-235.
Bhapkar,V. P. [1968].On the analysisof contingencytables witha quantitativeresponse.Biometrics
24,329-338.
Bhapkar,V. P. and Koch,G. G. [1968a].Hypotheses of "no interaction"
in multidimensional con-
tingency tables.Technometrics 10, 107-123.
Bhapkar, V. P. andKoch,G. G. [1968b].On thehypotheses of"nointeraction" incontingency tables.
Biometrics 24,567-594.
Ciechetti, D. V. [1972].A newmeasureof agreement betweenrank-ordered variables.Proceedings,
80thAnnualConvention, APA, 17-18.
Cohen,J.[1960].A coefficient ofagreement fornominalscales.Educational
andPsychological Measure-
ment 20, 37-46.
Cohen,J. [1968].Weighted kappa: nominalscaleagreement withprovision forscaleddisagreement
orpartialcredit.Psychological Bulletin70,213-220.
Fleiss,J. L. [1966].Assessingthe accuracyof multivariate observations.Journalof theAmerican
StatisticalAssociation61,403-412.
Fleiss,J.L., Cohen,J.andEveritt,B. S. [1969].Largesamplestandarderrors ofkappaandweighted
kappa.Psychological Bulletin72,323-337.
Fleiss,J. L. [1971].Measuringnominalscale agreement amongmanyraters.Psychological Bulletin
76,378-382.
Fleiss,J.L. and Cohen,J. [1973].The equivalence ofweighted kappaand theintraclass correlation
coefficient as measuresof reliability.Educationaland Psychological
Measurement 33, 613-619.
Fleiss,J. L. [1975].Measuringagreement betweentwojudgeson thepresence or absenceofa trait.
Biometrics 31,651-659.
Forthofer, R. N. and Koch,G. G. [1973].Ananalysisforcompounded functionsofcategoricaldata.
Biometrics 29, 143-157.
Grizzle,J.E., Starmer, C. F. and Koch,G. G. [1969].Analysisofcategoricaldata bylinearmodels.
Biometrics 25,489-504.

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions
174 BIOMETRICS, MARCH 1977

Grubbs, F. E. [1948]. On estimatingprecisionof measuringinstrumentsand product variability.


Journalof theAmericanStatisticalAssociation43, 243-264.
Grubbs,F. E. [1973]. Errorsof measurement,precision,accuracy and the statisticalcomparisonof
measuringinstruments.Technometrics 15, 53-66.
Goodman,L. A. and Kruskal,W. H. [1954]. Measures of associationforcrossclassification.Journal
of theAmericanStatisticalAssociation49, 732-764.
Koch, G. G. [1967]. A generalapproach to the estimationof variance components.Technometrics 9,
93-118.
Koch, G. G. [1968]. Some furtherremarksconcerning"A general approach to the estimationof
variancecomponents."Technometrics 10, 551-558.
Koch, G. G. and Reinfurt,D. W. [1971]. The analysis of categoricaldata frommixedmodels. Bio-
metrics27, 157-173.
Koch, G. G., Landis, J. R., Freeman,J. L., Freeman,D. H., Jr.and Lehnen,R. G. [1977].A general
methodologyfor the analysis of experimentswith repeated measurementof categoricaldata.
Biometrics33, 133-158.
Landis, J. R. [1975]. A generalmethodologyforthe measurementof observeragreementwhen the
data are categorical.Ph.D. Dissertation,Universityof North Carolina, Instituteof Statistics
Mimeo Series No. 1022.
Landis, J. R. and Koch, G. G. [1975a]. A reviewof statisticalmethodsin the analysisof data arising
fromobserverreliabilitystudies (Part I). StatisticaNeerlandica29, 101-123.
Landis, J. R. and Koch, G. G. [1975b].A reviewof statisticalmethodsin the analysisof data arising
fromobserverreliabilitystudies (Part II). StatisticaNeerlandica29, 151-161.
Landis, J. R. and Koch, G. G. [1977]. An application of hierarchicalkappa-typestatisticsin the
assessmentof majorityagreementamong multipleobservers.Accepted forpublicationin Bio-
metrics.
Landis, J. R., Stanish,W. M., Freeman,J. L. and Koch, G. G. [1976].A computerprogramforthe
generalizedchi-squareanalysis of categoricaldata using weightedleast squares (GENCAT).
Universityof Michigan BiostatisticsTechnical Report No. 8. Acceptedforpublicationin Com-
puterProgramsin Biomedicine.
Light, R. J. [1971]. Measures of responseagreementforqualitative data: some generalizationsand
alternatives.PsychologicalBulletin76, 365-377.
Loewenson,R. B., Bearman, J. E. and Resch, J. A. [1972]. Reliabilityof measurementsforstudies
of cardiovascularatherosclerosis. Biometrics28, 557-569.
Mandel, J. [1959].The measuringprocess.Technometrics 1, 251-267.
Neyman,J. [1949]. Contributionto the theoryof the X2test. Proceedingsof theBerkeleySymposium
on mathematical statisticsand probability,
Berkeley and Los Angeles,Universityof California
Press, 239-272.
Overall, J. E. [1968]. Estimating individual rater reliabilitiesfromanalysis of treatmenteffects.
Educationaland PsychologicalMeasurement 28, 255-264.
Scheff6,H. [1959]. The Analysisof Variance.Wiley,New York.
Searle, S. R. [1971].Linear Models. Wiley,New York.
Wald, A. [1943]. Tests of statisticalhypothesesconcerninggeneralparameterswhen the numberof
observationsis large. Transactionsof theAmericanMathematicalSociety54, 426-482.
Westlund,K. B. and Kurland,L. T. [1953].Studies on multiplesclerosisin Winnipeg.Manitoba and
New Orleans,Louisiana. AmericanJournalofHygiene57, 380-396.

1975
ReceivedApril 1975, RevisedNovember

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM
All use subject to JSTOR Terms and Conditions