8 views

Uploaded by Jkhdsjkhjh Jknjkhnjh

Landis e Koch 1977 Intervalo Kappa the Measurement of Observer Agreement Categorical Data

Landis e Koch 1977 Intervalo Kappa the Measurement of Observer Agreement Categorical Data

© All Rights Reserved

- University of Minnesota Biennial Budget Request
- Statistical Methods
- Criteria Paediatrics
- Enfermedad de Still Del Adulto
- Design and Implementaion of a Diseases Diagnostic System
- 22
- 1072 MDS MediBonus Benefit Guide 2016 Members March2016
- Reliability Checking Through SPSS
- morport irvan
- CHAPTER 1 Tosend at Email
- Understanding_Your_Aptitudes.pdf
- abnormal psychology.docx
- Care Plan (3)
- Basic Epidemiology by Beaglehole and Bonita
- Improving Diagnosis for Congenital Cataract by Introducing NGS Genetic Testing
- Siemens ReliableOperationSupportSiemensPowerDiagnosticsService
- Attribute R & R
- him work examples
- Casumpang v. Cortejo
- Q

You are on page 1of 17

Reviewed work(s):

Source: Biometrics, Vol. 33, No. 1 (Mar., 1977), pp. 159-174

Published by: International Biometric Society

Stable URL: http://www.jstor.org/stable/2529310 .

Accessed: 19/11/2012 06:33

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .

http://www.jstor.org/page/info/about/policies/terms.jsp

.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of

content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

of scholarship. For more information about JSTOR, please contact support@jstor.org.

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to

Biometrics.

http://www.jstor.org

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

BIOMETRICS 33, 159-174

March 1977

ofObserver

The Measurement forCategoricalData

Agreement

J. RICHARD LANDIS

Department of Biostatistics, Universityof Michigan, Ann Arbor, Michigan 48109 U.S.A.

GARY G. KOCH

Department of Biostatistics, Universityof North Carolina, Chapel Hill, North Carolina 27514 U.S.A.

Summary

This paper presentsa generalstatisticalmethodology cate-

for theanalysis of multivariate

goricaldata arisingfromobserver reliabilitystudies.The procedureessentiallyinvolvesthecon-

structionoffunctionsof theobservedproportions whichare directedat theextentto whichthe

observersagreeamongthemselves and theconstruction of teststatistics

for hypothesesinvolving

thesefunctions.Testsforinterobserver bias are presented in termsoffirst-ordermarginalhomo-

geneityand measuresofinterobserver agreement are developedas generalized

kappa-typestatistics.

Theseprocedures witha clinicaldiagnosisexamplefromtheepidemiological

are illustrated litera-

ture.

1. Introduction

Researchersin manv fieldshave become increasinglyaware of the observer (rater or

interviewer)as an importantsourceofmeasurementerror.Consequently,reliabilitystudies

are conductedin experimentalor surveysituationsto assess the level ofobservervariability

in the measurementproceduresto be used in data acquisition. When the data arising

fromsuch studies are quantitative,tests for interobserverbias and measures of inter-

observeragreementare usually obtainedfromstandardANOVA mixedmodelsor random

effectsmodels such as those discussed in Andersonand Bancroft[1952], Scheffe[1959],

and Searle [1971]. As a result,hypothesistests of observereffectsare used to investigate

interobserver in the mean responseamong observers,and estimates

bias, i.e., differences

of intraclasscorrelationcoefficientsare used to measureinterobserverreliability.rVlodifica-

tions and extensionsof these standard ANOVA models have been proposed by Grubbs

[1948, 1973], Mandel [1959], Fleiss [1966],Overall [1968],and Loewenson,Bearman and

Resch [1972]to evaluate the measurementerrorin varioustypesof applications.Although

assumptionsof normalityfor these models may not be warrantedin certain cases, the

ANOVA proceduresdiscussed in Searle [1971] and the symmetricsquare difference pro-

cedure in Koch [1967, 1968] still permitthe estimationof the appropriatecomponents

of varianceand the reliabilitycoefficients.

On the otherhand, many observerreliabilitystudiesinvolvecategoricaldata in which

the responsevariableis classifiedinto nominal(or possiblyordinal)multinomialcategories.

As reviewedin Landis and Koch [1975a, 1975b],a wide varietyof estimationand testing

procedureshave been recommendedfor the assessmentof observervariabilityin these

Key Words: Observer agreement; Multivariate categorical data; Kappa statistics; Repeated measurement

experiments;Weighted least squares.

159

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

160 BIOMETRICS, MARCH 1977

for categoricaldata by expressingthe quantities which reflectthe extent to which the

observersagree among themselvesas functionsof observed proportionsobtained from

underlyingmultidimensional contingencytables. These functionsare thenused to produce

test statisticsfor the relevant hypothesesconcerninginterobserverbias in the overall

usage of the measurementscale and interobserveragreementon the classificationof in-

dividual subjects. For illustrativepurposes,this generalmethodologyis developed within

the contextof a typical data set whichresultedfroman investigationof observervari-

abilityin the clinicaldiagnosisof multiplesclerosis.

Let us considerthe data arisingfromthe diagnosisof multiplesclerosisreportedin

Westlundand Kurland [1953]. Among other things,the investigatorswere interestedin

comparingpatient groups to study possible differences in the geographicaldistributions

of the disease. For thispurpose,a seriesof patientsin Winnipeg,Manitoba and a separate

seriesof patientsin New Orleans,Louisiana wereselectedand wereexaminedby a neurol-

ogist in their respectivelocations. Afterthe completionof all the examinations,each

neurologistwas requested to review all the recordswithoutseeing his earlier summary

and diagnosisand to classifytheminto one of the followingdiagnosticclasses:

1. Certainmultiplesclerosis;

2. Probable multiplesclerosis;

3. Possible multiplesclerosis(odds 50: 50);

4. Doubtful,unlikely,or definitely not multiplesclerosis.

In orderto evaluate agreementbetweenthe diagnosticians,the Winnipegneurologistthen

reviewed and classifiedeach of the New Orleans patient records,and vice versa. The

data resultingfromthese reviewdiagnosesare presentedin Table 1.

A preliminary inspectionof the Winnipegdata indicatesthat the Winnipegneurologist

tended to diagnose more of the patientsas certain (1) or probable (2) multiplesclerosis

than did his counterpartin New Orleans. As a result,they agreed on the diagnosis of

only 64/149 (43 percent) of the patients. Althoughthe differences in the overall crude

of

distributions the diagnosesseem to be less prominentwithinthe New Orleanspatients,

the neurologistsdiagnosed only 33/69 (48 percent) of them into identically the same

in

category.The statisticalissues concerningthesedifferences diagnosiscan be summarized

withinthe framework of the followingbasic questions:

(1) Are thereany differences betweenthe two patientpopulationswithrespectto the

overallcrudedistributionof the diagnosesby each of the two neurologists?

(2) Are thereany differences betweenthe overall crude distributionsof the diagnoses

by the two neurologistswithineach of the respectivepatientpopulations?

(3) Is thereany neurologistX sub-populationinteractionin the overallcrudedistribu-

tion of the diagnoses?

(4) Is there any difference between the two patient populationswith respect to the

overall agreementof the two neurologistson the specificdiagnosis of individual

patients?

(5) Is the agreementof the two neurologistson the specificdiagnosis of individual

patientssignificantly differentfromdance agreementbased on theiroverall crude

distributions of diagnoses?

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

AGREEMENT MEASURES FOR CATEGORICAL DATA 161

Table 1

DIAGNOSTIC CLASSIFICATION REGARDING MIULTIPLE SCLEROSIS

Class

1 38 5 0 1 44 0.295

Neurologist

(1) 3 10 14 5 6 35 0.235

4 3 7 3 10 23 0.154

Total 84 37 11 17 149

Class

1 5 3 0 0 8 0.116

2 3 11 4 0 18 0.261

New Orleans

Neurologist 3 2 13 3 4 22 0.319

(1)

4 1 2 4 14 21 0.304

Total 11 29 11 18 69

in the diagnosticcriteria?

As stated in Koch et al. [1977],questions (1)-(3) are directlyanalogous to the hypotheses

of "no whole-ploteffects,""no split-ploteffects,"and "no whole-plotX split-plotinter-

action" in standardsplit-plotexperiments. In thiscontext,question(1) addressesdifferences

among the sub-populations,question (2) involves the issue of interobserverbias, and

question (3) is concerned with the observer X sub-populationinteraction.Thus, the

first-order marginal distributionsof response for each of the neurologistswithin each

sub-populationcontain the relevant informationfor dealing with these questions. In

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

162 BIOMETRICS, MARCH 1977

agreementon a subject-to-subjectbasis; and, as such they are directlyanalogous to

hypothesesconcerningintraclasscorrelationcoefficients in randomeffectsmodels. Hence,

certainfunctionsof the diagonal cells of varioussubtablesare used to provideinformation

forestimatingand testingthe significanceof agreementon the classificationof individual

subjects.

In the followingsectionsa general methodologyforansweringthese questionsis de-

veloped in terms of specifichypotheses.These proceduresare then illustratedwith an

analysis of the data in Table 1.

3. Methodology

Let i = 1, 2, ..., s index a set of sub-populationsfromwhichrandomsamples have

been selected. Suppose that the same responsevariable is measured separatelyby each

of d observersusing an L-point scale. Let the r = Ld responseprofilesbe indexed by a

vectorsubscriptj = (jl

il jd), wherej, = 1, 2,

* , L forg = 1, 2, , d. Further-

more,let 7rij = 7ri i2, representthe joint probabilityofresponseprofilej forrandomly

i, id

j with ik = ,2,

population.

3.1 HypothesesInvolvingMarginalDistributions

Hypothesesdirectedat the questionsof differences among sub-populationsand inter-

observerbias involve distributionsof the responseprofilesand can be expressedin terms

of constraintson the first-order marginalprobabilities fis . As a result, the specific

hypothesesassociated with questions (1)-(3) are directlyanalogous to HSM , HCI , and

HAM outlinedin Koch et al. [1977] in expressions(2.4), (2.5), and (2.9), respectively.In

particular,the d observerscorrespondto the d conditions,and thus the hypothesisof

firstordermarginalsymmetry (homogeneity) addressesthe issue of interobserverbias. These

hypothesescan also be expressedin termsof constraintson meanscorefunctionsassociated

witheach observersuchas the { Di} summaryindexesspecifiedin (2.14) in Koch etal. [1977].

Further discussion of hypotheses involving marginal distributionswithin the context

of observeragreementstudiesis givenin Landis [1975].

Whereas the previous hypothesesconcerningdifferences among sub-populationsand

interobserver bias involvedonlythe first-ordermarginalprobabilities,hypothesesdirected

at the extentto whichobserversagree amongthemselveson the classification of individual

subjects must be formulatedin termsof the internalelementsof the table. For example,

the estimateof the crude proportionof agreementbetween two observersis simplythe

sum of the observedproportionson the main diagonalof the corresponding two-waytable.

In addition,if partial creditis permittedfor certaintypes of disagreement,an estimate

of the weightedproportionof agreementwill involve the weightedinclusionof the off-

diagonal cells.

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

AGREEMENT MEASURES FOR CATEGORICAL DATA 163

menthave been proposedforcategoricaldata, e.g., Goodman and Kruskal [1954],Cohen

[1960, 1968], Fleiss [1971], Light [1971], and Cicchetti [1972]. Most of these quantities

are of the form

K = 1 --- (3.2)

probabilityof agreementunder an appropriateset of baseline constraintssuch as total

independenceof observerclassifications.Rangingfrom[- 7re/(l - re)] to +1, K indicates

the extentto which the observationalprobabilityof agreementis in excess of the prob-

abilityof agreementhypothetically expectedunderthe baseline constraints.Furthermore,

as shownin Fleiss and Cohen [1973]and Fleiss [1975],K is directlyanalogous to the intra-

obtained fromANOVA modelsforquantitativemeasurements

class correlationcoefficient

and can be used as a measure of the reliabilityof multipledeterminationson the same

subjects.

Several kappa-type measures of interobserveragreementcan be formulatedto in-

vestigate selected patterns of disagreementsimultaneouslyby choosing corresponding

sets of weightswhichreflectthe role of each responsecategoryin a givenagreementindex.

For example, a set of weightscan be chosen so that the resultingagreementmeasure

indicates the combinedperformanceof all the observers,such as majorityor consensus

agreement,or sets of weightscan be directedat subsets of observers,such as all possible

pairwise agreementmeasures. Alternatively,these weights can be chosen so that the

associated kappa measuresindicate the incrementsin agreementwhichresultby succes-

sively combiningrelevant categoriesof the responsevariable. Such kappa measuresare

said to be in a hierarachicalrelationshipwith each other.Thus, in general,let w1;, W2v,

... , wj be u setsofweights

assignedtotheresponse profiles indexedbyj = (j, , j,, ..* id)

Moreover let 0 < whj < 1 forh = 1, 2, , u over all j, so that the resultingestimates

are interpretableas probabilitiesof agreement.Then the observationalprobabilityof

agreementassociated withthe hthset of weightsin the ith sub-populationis the weighted

sum

I

Z iw for = 21,2

...

(3.3)

Nih =W*..

i =f 1 2, .., U.

Correspondingly,

sum

h = for =1, 2, . .. ,

Z

Wh.7ri(i 12(3i4) U,(34

forrandomlyselectedsubjects fromthe ith sub-population.

These expectedprobabilitiesare determinedby the choiceof a particularset ofbaseline

constraintsassumed forthe responseprofiles.For this purpose,let E = {IEl , E2 , ...

representsuch underlyingconstraintson the marginalprobabilities {/,in} of (3.1). In

this context,the followingsets of constraintsare of interestin creatinginterobserver

agreementmeasures:

(i) Under the assumptionof total independenceamong the responsevariables from

the d observers,the {17rii (e) } satisfy

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

164 BIOMETRICS,MARCH1977

-E : 7rii 2 .* *i

. d =

Ofi1;1i2i 2 .

.f i di d

d

= Joikik for i = 1, 2, **, s. (3-5)

k= 1

marginalhomogeneity(Hc A in Koch et al. [1977]) holds. In this situation,let the

commonprobabilityof classificationinto the kthcategorybe

VPik = Oilk = Obi2k = = (3.6)

for i = 1, 2, , s and k = 1, 2, *.., L. Then under the baseline constraints

of total independenceand marginalhomogeneitythe {7ri(e)} satisfy

52

:2 {i I j2 I () =i i 2d

= T~iI'o

g

for i= 1, 2, ** ,s. (3.7)

= 1

(3.2) can be formulatedby

i

Kih = Nih -

i

for =

1, 2, ,S (3.8)

1 - T'ih It = 1,12, ,

the d observersin the ith sub-populationwithrespectto the hthset of weights.

Withinthis framework, the specifichypothesesassociated with questions (4)-(6) can

now be formulatedas follows:

(4) If thereare no differencesamongthe s sub-populationswithrespectto the measures

ofoverallspecific agreement amongthed observers underE, thenthe {Kih} satisfy

the hypothesis

HSA IEz Klh = K2h = Kh for t = 1, 2, ,, (3.9)

whereSA denotessub-populationagreement.

(5) If the level of observedagreementis equal to that expected under E, then the

Kih} satisfythehypothesis

h =1, 2, ...

(6) In some cases the weightsforthe kappa measuresare chosento be in a hierarchical

relationshipwith each otherin orderto investigatespecificdisagreementpatterns.

In these situations,if the extent of disagreementis the same for the categories

combinedby the (h + 1)-st set of weightsas forthose combinedby the hth set,

thenthe {Kih} satisfythe hypothesis

whereHA denoteshierarchicalagreement.

In order to maintain consistentnomenclaturewhen describingthe relative strength

of agreementassociated with kappa statistics,the followinglabels will be assignedto the

corresponding rangesof kappa:

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

AGREEMENT MEASURES FOR CATEGORICAL DATA 165

ofAgreement

< 0.00 Poor

0.00-0.20 Slight

0.21-0.40 Fair

0.41-0.60 1\Ioderate

0.61-0.80 Substantial

0.81-1.00 AlmostPerfect

the discussionof the specificexamplein Table 1.

3.3 Estimationand HypothesisTesting

Test statisticsforthehypothesesconsideredin theprevioussectionsas wellas estimators

for correspondingmodel parameterscan be obtained by using the general approach for

the analysisof multivariatecategoricaldata proposedby Grizzle,Starmerand Koch [1969]

(hereafterabbreviatedGSK) as outlinedin Appendix1 in Koch etal. [1977].The hypotheses

in Section 3.1 involvingconstraintson the first-ordermarginalprobabilitiescan be tested

by expressingthe estimatesof the {4iak} or the {a-qidas linearfunctionsof the type given

in Appendix1 (A.14) in Koch etal. [1977].These particularmatrixexpressionshave already

been discussedin considerabledetail in Koch and Reinfurt[1971] and Koch et at. [1977],

and thus they will not be elaboratedhere. Otherwise,theirspecificconstructionforthese

hypothesesin observeragreementstudiesis documentedin Landis [1975].

In contrastto the linear functionswhich pertain to the hypothesesin Section 3.1,

all the hypothesesinvolvinggeneralizedkappa-type measures require the expressionof

the ratioestimatesof the {Kih} as compounded

logarithmic-exponential-linear

functions

of the observed proportionsas formulatedin Appendix 1 (A.20) in Koch et al. [1977].

As a result,the test statisticsforthe hypothesesin Section 3.2 can also be generatedby

the corresponding expressiongivenin Appendix1 (A.11) in Koch etal. [1977].

This section is concernedwith the analysis of the multiplesclerosisdata ill Table 1

with primaryemphasis given to illustratingthe methodologyin Section 3. Tests of sig-

nificanceare used in a descriptivecontextto identifyimportantsources of variation as

opposed to a rigorousinferentialcontext;and thus issues pertainingto multiplecompar-

isons are ignoredhere. These, however,can be handled by the Scheffetype procedures

given in Grizzle,Starmerand Koch [1969]. The design forthis example involves s = 2

sub-populations,d = 2 observers,and L = 4 responsecategories.Thus, thereare r = L 16

possiblemultivariateresponseprofileswithineach of the sub-populations.

4.1 MarginalHomogeneity

Tests

The functionsrequired to test the hypothesesinvolvingmarginaldistributionscan

be generatedin the formulationof (A.14) in Appendix 1 in Koch et al. [1977] with the

functionvectorF' = (F1', F2') where

F2' = (0.116, 0.261, 0.319, 0.159, 0.420, 0.159),

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

166 BIOMETRICS, MARCH 1977

Table 2

HIERARCHICALWEIGHTS FOR AGREEMENTMEASURES

Observer 2 2 2 2

Diagnostic 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Class

1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0

2 0 1 0 0 1100 1 00 1 1 1 0

Observer 1

3 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1

4 0 0 0 1 0 0 0 1 0 0 11 0 0 1 1

which contain the marginalproportionsfor diagnosticclasses "1," "2" and "3" for the

two observerswithinthe two sub-populations.The test statisticforHsm is Qc = 46.37

with d.f. = 6, whichimpliesthat there are significant(a = 0.01) differences in the dis-

tributionsofthe observedresponseprofilesbetweentheWinnipegand New Orleanspatients.

The tests of this hypothesiswithineach of the observersalso indicate statisticallysig-

nificant(a = 0.01) differences betweenthe two sub-populations,althoughthe Winnipeg

neurologistrepresentsthe moredominantcomponent.Similarly,the test statisticforHcMa

is Qc = 69.01 withd.f. = 6, whichimpliesthat thereare significant(a = 0.01) differences

in theresponseprofilesbetweenthe two neurologists withineach sub-population.Moreover,

the dominant componentof these observerdifferences is withinthe Winnipegpatient

group. These results suggest that significantinterobserverbias exists between the two

neurologistsin their overall usage of the diagnosticclassificationscale. In addition,the

goodness-of-fitstatistic for testingthe interactionhypothesisHAger is Q = 14.09 with

d.f. = 3. This significant(a = 0.01) observerX sub-populationinteractionis consistent

withthe resultthat the observerdifferences are moresubstantialin the Winnipegpatient

group (Qc = 58.47) than in the New Orleanspatientgroup (Qc = 10.54).

Table 3

DESCRIPTION OF HIERARCHICAL WEIGHTS

Weights for Agreement Statistic

Possible (3) with Doubtful (4).

4 Probable (2) with Possible (3);

Possible (3) with Doubtful (4).

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

AGREEMENT MEASURES FOR CATEGORICAL DATA 167

Specificpatternsof disagreementbetweenthe neurologistson the diagnosticclassifica-

tion of individualsubjects can be investigatedby selectinga hierarchyof weightswhich

successivelycombine adjoining categoriesof diagnosisin orderto create potentiallyless

stringentreliabilitymeasures. For example, the four sets of weightsin Table 2 can be

used to investigatethe sources of imprecisediagnosticcriteria.As indicated in Table 3,

these weightsare chosen so that specificdisagreementpatternsare successivelytolerated

in the correspondingestimatesofagreement.In particular,w1jrepresentsthe set of weights

whichgeneratethe kappa measureof perfectagreementproposedin Cohen [1960].

The sequence of hierarchicalkappa-type statistics within each of the two patient

populationsassociated withthe weightsgivenin Table 2 can be expressedin the formula-

tion (A.20) in Appendix 1 in Koch et al. [1977] under the baseline constraintsof total

independenceEl in (3.5) by letting

1111 0000 0000 0000

0000 1 1 1 1 0000 0000

0000 0000 1111 0000

0000 0000 0000 1 1 1 1

1 0 00 01 0 0 0 1 0 0 0 1 0 00

24X3200 010

1 0 00 1 0 0010 001 0

2 0 0 1 0001 0001 0001

1 000 0 1 00 00 1 0 000 1

1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1

_1 1 00 1 1 1 0 0 1 1 1 001 1

1 00 0 0 10 0000

1 000 00 1 0 0000

0 10 0 1 0 0 0 0 0 0 0

40X24 0 10 0 0 1 00 0000012; (4 3)

0 100 00 1 0 0000

0 10 0 0 001 0 0 0 0

0 01 0 1 0 0 0 0 0 0 0

00(10( 010 0000((

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

168 BIOMETRICS, MARCH 1977

0 0 1 0 0 0 1 0 0 0 0 0

00 1 0 000 1 0000

000 1 01 00 0000

000 1 00 1 0 0000

4AX24

~~~~~~~~~~~~Con

0000 0000 1000

0000 0000 0100

0000 0000 0010

-1 0 0 0 0 -1 0 0 0 -1 0 0 0 0 -1 10 0 0

-1 -1 0 0 -1 -1 0 0 0 0 -1 0 0 0 0 -1 0 10 0

-1 -100 -1 -1 00 0 0 -1 -1 00 -1 -1 0010

-1 -1 0 0 -1 -1 -1 0 0 -1 -1 -1 0 0 -1 -1 0 0 0 1

A3 = 012

1fX40 0 1 1 10 1 1 1 0 1 1 1 1 0 0 0 0 0 (4.4)

0 011 0 0 11 1 1 0 1 11 1 0 0000

0 011 0 0 1 1 1 1 0 0 1 1 0 0 0000

010 1 00011000110

0 1 0 0000

9X16

K11 0.208

K12 0.328

K13 0.408

K21 0.297

K22 0.332

K93 0.386

_K24- _0.789

withthe hthset of weightsshownin Table 2. In addition,the estimatedcovariancematrix

of F is givenby

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

AGREEMENT MEASURES FOR CATEGORICAL DATA 169

0.2122 0.4005 0.3862 0.2912 0

0.1868 0.3862 0.5200 0.3832

0.6163 0.5582 0.5046 0.2185

0.5582 0.6879 0.6544 0.3010

? 0.5046 0.6544 1.0030 0.4147

0.2185 0.3010 0.4147 0.7720

These results indicate that all increases in successive agreementmeasures within the

Winnipegpatientgroupare significant(a = 0.05); but forthe New Orleanspatientgroup,

the only significant(a = 0.05) increasein agreementpertainedto the finalset of weights.

Thus, the neurologistsare exhibitingsignificantdisagreementbetween diagnoses (1,2),

(2,3) and (3,4) in the Winnipeggroup and significantdisagreementbetween diagnoses

(2,3) in the New Orleansgroup,as evidencedby theinflatedfrequenciesin theseoff-diagonal

cells in Table 1. On the other hand, the estimatesin (4.6) suggestthat the hierarchical

Table 4

STATISTICAL TESTS FOR HIERARCHICAL HYPOTHESES

Hypothesis D.F. QC

K11 2 6.89*

K12 K22 K21

14 1A; 24 23 28.13**

13 =12 1 4.38*

** means significant at a = 0.01.

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

170 BIOMETRICS, MARCH 1977

estimatedvariances of the kappa statisticsare much largerforthe New Orleans patient

group (due to the smallersample size), the agreementpatternsmay indeed be essentially

the same in both patientgroups.

If the two neurologistsare indeed exhibitingthe same agreementpatternswithrespect

to the weightsgiven in Table 2 withinthe two groupsof patients,then under (3.5) the

Kih } satisfythe following

hypothesesfrom(3.9)

HSAI1E,1 KUZb= K2h for t = 1, 2, 3, 4. (4.8)

Test statisticsforthese hypothesesboth individuallyand jointlyare presentedin Tfable5.

The resultsin Tables 4 and 5 suggestthat a reduced model can be used to combine

parameterswhich are essentiallyequivalent. For this purpose,the agreementstatistics

in (4.6) can be modeledby

10 0 0 0

0 1 ? ? ?

0K1

KIc

0 0 1 0 0 ,

K4

0 1 0 0 0 _5

0 0 1 0 0

is Q = 2.27 withd.f. = 3. Thus, thisreducedmodelprovidesa satisfactory characterization

of the variation among these agreementmeasures. Specifictest statisticsfor the cor-

respondinghypothesesin (3.10) and (3.11) pertainingto the model X in (4.9) are givenin

(a = 0.01) different

Table 6. These resultssuggestthat all the parametersare significantly

fromzero, and moreover,are significantly (a = 0.05) different

fromeach other.Further-

more, by reducingthe model to these smoothed estimates,the marginallysignificant

(a = 0.10) difference betweenK14 and K24 in Table 5 is now significant(a = 0.05) forthe

Table 5

STATISTICAL TESTS BETWEEN PATIENT SUB-POPULATIONS

Hypothesis D.F. Q

1 0.90

K11 K21

1 0.00

K12 K22

1 0.03

K13 = K23

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

AGREEMENT MEASURES FOR CATEGORICAL DATA 171

Table 6

STATISTICAL TESTS FOR MODEL X

K =K 1 5.40* Ka = 0 1 31.05**

2 11

K =3 K 1 4.92* K = 0 1 40.71**

22

K = K 1 12.33** K = 0 1 45.49**

K =

K4 1 4.88* K = 0 1 72.44**

K = 0 1 94.97**

* means significant at ot = 0.01.

on thefittedmodel(4.9) are displayedin Table 7 togetherwiththeircorresponding estimated

standarderrors.

Thus, theseresultssuggestthat the diagnosticcriteriaare not verydistinctwithrespect

to theirusage by these two neurologists.In additionto bias at the macro stage, i.e., con-

sideringonly the overall marginalproportions,these observersexhibitedsignificantdis-

agreementat the micro state, i.e., consideringeach individual subject, in specifyinga

diagnosis.Only withrespectto the relativelyrelaxedcriterioncorresponding to the fourth

set of weightsdo the kappa statisticsindicate a "moderate" to "substantial" level of

interobserver reliability.

5. Discussion

In someapplications,one may also be interestedin a set ofweightswhichassignvarying

degreesof partial creditto the off-diagonal

cells dependingon the extentof the disagree-

ment,ratherthan successivelycombiningadjoiningcategoriesas shown in Table 2. For

Table 7

SMOOTHED ESTIMATES OF AGREEMENT UNDER MODEL X

Sub-population 1 2

Weights Statistic Under X Standard Error Under X Standard Error

4j i4

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

172 BIOMETRICS, MARCH 1977

Table 8

ALTERNATIVE WEIGHTS FOR OVERALL AGREEMENT MEASURES

Weights jlj 2j

Observer 2 2

Diagnostic 1 2 3 4 1 2 3 4

Class

1 1 0 0 0 1 i ?

0

2 0 1 0 0 ? 1 i

Observer 1

3 0 0 1 0 i i 1 ?

4 0 0 0 1 0 i1? 1

[1968],Fleiss, Cohen and Everitt [1969]and Cicchetti[1972],whichwereused to generate

weightedkappa and C statistics.For the data in Table 1, these estimatesare givenby

K11 0.208

F = K12 = 0.315 (5.1)

K21 0.297

K22 _ 0.407_

where the {KiJ estimate the perfectagreementkappa measure and the { i,} estimate

the partial-creditweightedkappa agreementmeasurebetweenthe two neurologistsin the

two patient populations.A more extensiveanalysis of these data under the weightsin

Table 8 is givenin Landis [1975]and Landis et al. [1976].

Althoughthe methodologyforthe assessmentof observeragreementdevelopedill this

paper is quite general,these procedureshave been illustratedwith an example involving

only two observers.However, for situationsin which eitherthe numberof observersd

or the numberof responsecategoriesL is moderatelylarge,the numberof possiblemulti-

variate response profilesr = Ld becomes extremelylarge. Consequently,the matrices

requiredto implementthe GSK proceduresdirectlymay be outsidethe scope of computa-

tional feasibility.In addition, for each of the s sub-populationsmany of the r possible

responseprofileswill not necessarilybe observed in the respectivesamples so that cor-

respondingcell frequenciesare zero. Thus, in such cases, specializedcomputingprocedures

are requiredto obtain the estimatesof the pertinentfunctions.

One alternativeapproach for handlingsuch very large contingencytables in which

most of the observed cell frequenciesare zero is discussed in Landis and Koch [1977].

In this regard,the same estimatorswhichwould need to be obtained fromthe conceptual

multidimensional contingency table can be generatedby firstforming appropriateindicator

variables of the raw data fromeach subject and then computingthe across-subjectarith-

meticmeans. Subsequentto these preliminary steps,the usual matrixoperationsdiscussed

in Appendix1 in Koch et al. [1977] can then be applied to these indicatorvariable means

to determinethe required measures of observeragreement.These alternativecomputa-

tionsinvolvingraw data, as wellas the extendedGSK proceduressummarizedin Appendix1

in Koch et al. [1977] can all be performedby a recentlydeveloped computerprogram

(GENCAT) discussedin Landis, etal. [1976].

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

AGREEMENT MEASURES FOR CATEGORICAL DATA 173

Acknowledgments

This research was partially supportedby Research Grants GM\1-00038-20 and GM-

70004-05fromthe National Instituteof General Medical Sciencesand by the U. S. Bureau

of the Census throughJointStatisticalAgreementsJSA 74-2 and JSA 75-2. The authors

wouldlike to thankthe refereesfortheirhelpfulcommentson an earlierdraftofthispaper.

In addition,the authors are gratefulto 1\ls. Rebecca Wesson and Ms. Lynn Wilkinson

fortheirconscientioustypingof previousdraftsof thispaper,and to 1\ls.Linda L. Blakley

and 1\ls.Connie M\'Jassey

fortheirefficienttypingof the finalversionof this manuscript.

La Mlesurede la Concordance

EntreObservations

pourdes Donneesen Categories

Resume

Particle expose une methodologie statistiquegeneratepour l'analyse de donneesmulti-

variatesen categories

provenant d'etudesdefiabilited'observateurs.La procedurefaitprincipale-

mentappel a la construction de fonctionsdes proportions observeestraduisantla concordance

des observateursentreeux et a la constructionde statistiques

de testspour des hypothesesimpli-

quantcesfonctions. On present des testspourdes biais entreobservateurs enfonctionde l'homo-

geneit'marginaledu premierordreet on construit des mesuresde concordance entreobservateurs

commedes statistiquesgeneralisantcelies du typekappa. On illustreces proceduresavec an

exemplede diagnosticcliniqueprovenant de la litterature

epidemiologique.

References

Anderson,

R. L. and Bancroft,T. A. [1952].Statistical

Theory in Research.

McGrawHill,NewYork.

Bhapkar,V. P. [1966].A noteon theequivalenceoftwotestcriteria forhypotheses

in categorical

data.JournaloftheAmerican StatisticalAssociation61, 228-235.

Bhapkar,V. P. [1968].On the analysisof contingencytables witha quantitativeresponse.Biometrics

24,329-338.

Bhapkar,V. P. and Koch,G. G. [1968a].Hypotheses of "no interaction"

in multidimensional con-

tingency tables.Technometrics 10, 107-123.

Bhapkar, V. P. andKoch,G. G. [1968b].On thehypotheses of"nointeraction" incontingency tables.

Biometrics 24,567-594.

Ciechetti, D. V. [1972].A newmeasureof agreement betweenrank-ordered variables.Proceedings,

80thAnnualConvention, APA, 17-18.

Cohen,J.[1960].A coefficient ofagreement fornominalscales.Educational

andPsychological Measure-

ment 20, 37-46.

Cohen,J. [1968].Weighted kappa: nominalscaleagreement withprovision forscaleddisagreement

orpartialcredit.Psychological Bulletin70,213-220.

Fleiss,J. L. [1966].Assessingthe accuracyof multivariate observations.Journalof theAmerican

StatisticalAssociation61,403-412.

Fleiss,J.L., Cohen,J.andEveritt,B. S. [1969].Largesamplestandarderrors ofkappaandweighted

kappa.Psychological Bulletin72,323-337.

Fleiss,J. L. [1971].Measuringnominalscale agreement amongmanyraters.Psychological Bulletin

76,378-382.

Fleiss,J.L. and Cohen,J. [1973].The equivalence ofweighted kappaand theintraclass correlation

coefficient as measuresof reliability.Educationaland Psychological

Measurement 33, 613-619.

Fleiss,J. L. [1975].Measuringagreement betweentwojudgeson thepresence or absenceofa trait.

Biometrics 31,651-659.

Forthofer, R. N. and Koch,G. G. [1973].Ananalysisforcompounded functionsofcategoricaldata.

Biometrics 29, 143-157.

Grizzle,J.E., Starmer, C. F. and Koch,G. G. [1969].Analysisofcategoricaldata bylinearmodels.

Biometrics 25,489-504.

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

174 BIOMETRICS, MARCH 1977

Journalof theAmericanStatisticalAssociation43, 243-264.

Grubbs,F. E. [1973]. Errorsof measurement,precision,accuracy and the statisticalcomparisonof

measuringinstruments.Technometrics 15, 53-66.

Goodman,L. A. and Kruskal,W. H. [1954]. Measures of associationforcrossclassification.Journal

of theAmericanStatisticalAssociation49, 732-764.

Koch, G. G. [1967]. A generalapproach to the estimationof variance components.Technometrics 9,

93-118.

Koch, G. G. [1968]. Some furtherremarksconcerning"A general approach to the estimationof

variancecomponents."Technometrics 10, 551-558.

Koch, G. G. and Reinfurt,D. W. [1971]. The analysis of categoricaldata frommixedmodels. Bio-

metrics27, 157-173.

Koch, G. G., Landis, J. R., Freeman,J. L., Freeman,D. H., Jr.and Lehnen,R. G. [1977].A general

methodologyfor the analysis of experimentswith repeated measurementof categoricaldata.

Biometrics33, 133-158.

Landis, J. R. [1975]. A generalmethodologyforthe measurementof observeragreementwhen the

data are categorical.Ph.D. Dissertation,Universityof North Carolina, Instituteof Statistics

Mimeo Series No. 1022.

Landis, J. R. and Koch, G. G. [1975a]. A reviewof statisticalmethodsin the analysisof data arising

fromobserverreliabilitystudies (Part I). StatisticaNeerlandica29, 101-123.

Landis, J. R. and Koch, G. G. [1975b].A reviewof statisticalmethodsin the analysisof data arising

fromobserverreliabilitystudies (Part II). StatisticaNeerlandica29, 151-161.

Landis, J. R. and Koch, G. G. [1977]. An application of hierarchicalkappa-typestatisticsin the

assessmentof majorityagreementamong multipleobservers.Accepted forpublicationin Bio-

metrics.

Landis, J. R., Stanish,W. M., Freeman,J. L. and Koch, G. G. [1976].A computerprogramforthe

generalizedchi-squareanalysis of categoricaldata using weightedleast squares (GENCAT).

Universityof Michigan BiostatisticsTechnical Report No. 8. Acceptedforpublicationin Com-

puterProgramsin Biomedicine.

Light, R. J. [1971]. Measures of responseagreementforqualitative data: some generalizationsand

alternatives.PsychologicalBulletin76, 365-377.

Loewenson,R. B., Bearman, J. E. and Resch, J. A. [1972]. Reliabilityof measurementsforstudies

of cardiovascularatherosclerosis. Biometrics28, 557-569.

Mandel, J. [1959].The measuringprocess.Technometrics 1, 251-267.

Neyman,J. [1949]. Contributionto the theoryof the X2test. Proceedingsof theBerkeleySymposium

on mathematical statisticsand probability,

Berkeley and Los Angeles,Universityof California

Press, 239-272.

Overall, J. E. [1968]. Estimating individual rater reliabilitiesfromanalysis of treatmenteffects.

Educationaland PsychologicalMeasurement 28, 255-264.

Scheff6,H. [1959]. The Analysisof Variance.Wiley,New York.

Searle, S. R. [1971].Linear Models. Wiley,New York.

Wald, A. [1943]. Tests of statisticalhypothesesconcerninggeneralparameterswhen the numberof

observationsis large. Transactionsof theAmericanMathematicalSociety54, 426-482.

Westlund,K. B. and Kurland,L. T. [1953].Studies on multiplesclerosisin Winnipeg.Manitoba and

New Orleans,Louisiana. AmericanJournalofHygiene57, 380-396.

1975

ReceivedApril 1975, RevisedNovember

This content downloaded by the authorized user from 192.168.82.209 on Mon, 19 Nov 2012 06:33:45 AM

All use subject to JSTOR Terms and Conditions

- University of Minnesota Biennial Budget RequestUploaded byMN Senate Committee on Higher Education & Workforce Development
- Statistical MethodsUploaded byrecep
- Criteria PaediatricsUploaded byAn Zheng
- Enfermedad de Still Del AdultoUploaded byunicoboris
- Design and Implementaion of a Diseases Diagnostic SystemUploaded byPeter Nwachukwu
- 22Uploaded bySuprapto Root Adrw
- 1072 MDS MediBonus Benefit Guide 2016 Members March2016Uploaded byMarius
- Reliability Checking Through SPSSUploaded byjazzlovey
- morport irvanUploaded byNurcahyo Tri Utomo
- CHAPTER 1 Tosend at EmailUploaded byDarlene Raisa Asi DalisAy
- Understanding_Your_Aptitudes.pdfUploaded bystutikapoor
- abnormal psychology.docxUploaded byMthokozisi
- Care Plan (3)Uploaded byJen Rey
- Basic Epidemiology by Beaglehole and BonitaUploaded byMuhammad Bilal Siddiqui
- Improving Diagnosis for Congenital Cataract by Introducing NGS Genetic TestingUploaded byIRatna Novaliasari
- Siemens ReliableOperationSupportSiemensPowerDiagnosticsServiceUploaded byhyoung65
- Attribute R & RUploaded bydinesh
- him work examplesUploaded byapi-252670617
- Casumpang v. CortejoUploaded byCamille Britanico
- QUploaded byCryptic Choton
- issue 3 michael deren man with 2 hearts articleUploaded byapi-400507461
- New CUSOM Evaluation Form 2018Uploaded byAmish Parikh
- Consultation Version Competence Framework V3.4_180117Uploaded byHuckle Lee
- Abnormal PsychologyUploaded byMthokozisi
- THE MEDICAL RELATIONSHIP - INGLES.docxUploaded byAlmendra Stefhany
- Operative Dept Clincial Instruction Guide 11-12Uploaded bymanishpankaj123
- With Great InterestUploaded byRamón Ruesta Berdejo
- Neill PresentationUploaded byionadavis
- Patient History Update FormUploaded byBioVeda Health and Wellness Centers
- Module 1Uploaded byAbuzdea Alex

- Test1 Marking Scheme fUploaded byFahad Ismail
- Client Server TechnologyUploaded bykumard205
- DaimlerChrysler X MitsubishiUploaded byMuhd Azrul Salleh
- Notes for Kx Driver v7Uploaded byManuel
- 1_TuunainenUploaded bysegitiga2007
- Heuristic Problem Solving (3)Uploaded byEarly Lilo
- Synopsis DslrUploaded bysaini_sahil
- Project Accountant or Construction Accountant or Government ProjUploaded byapi-78386428
- Rebar Bending Formula & Hook Design -ACI-318Uploaded byויליאם סן מרמיגיוס
- Assigment From Sekine Sensei (Martiwi)Uploaded byMartiwi Setiawati
- GREG LYNUploaded byVignesh Vicky
- Linux Voice Issue 005Uploaded byPablo Velarde Alvarado
- Software Engineering NotesUploaded bynavneet_prakash
- b30Uploaded byvsrikala68
- CachingUploaded bynikkusonik
- sihi002Uploaded bySandi Aslan
- iptv.m3uUploaded byAhmed Elaidy
- EVDO BasicsUploaded byBramha Jain
- ReducerUploaded bymanoj.catia
- complaintmanagementsystem-120219111048-phpapp01Uploaded bydavid sea
- Knowledge MapUploaded byAlan Chan
- Fts Basics Fm14 EnUploaded byAnonymous LmsS77
- IT Sector PptUploaded byRavi Kumar
- Toward a New ConsciousnessUploaded byJeff R. Ennenga
- Wind EnergyUploaded bysksiddique
- Noti_D2DEngg_2011Uploaded byNirav Patel
- Example of Descriptive Research PDFUploaded byAlvin
- ksuite-list.pdfUploaded byJperformance Jperformance
- Denon AVR-X5200W .pdfUploaded byboroda2410
- AN1167--Practical Aspects of EMIUploaded bysteve_y