Testing The Equality of Two Dependent Kappa Statistics: Statist. Med. 19, 373) 387 (2000)

STATISTICS IN MEDICINE
Statist. Med. 19, 373}387 (2000)
TESTING THE EQUALITY OF TWO DEPENDENT KAPPA

STATISTICS
ALLAN DONNER1*, MOHAMED M. SHOUKRI2, NEIL KLAR3 AND EMMA BARTFAY1

1 Department of Epidemiology and Biostatistics, The University of Western Ontario, London, Ontario, N6A 5C1, Canada
2 Department of Population Medicine, University of Guelph, Guelph, Ontario, N1G 2W1, Canada
3 Dana-Farber Cancer Institute, Department of Biostatistical Science, 44 Binney Street, Boston, MA 02115, U.S.A.
SUMMARY
Procedures are developed and compared for testing the equality of two dependent kappa statistics in the
case of two raters and a dichotomous outcome variable. Such problems may arise when each of a sample of
subjects are rated under two distinct settings, and it is of interest to compare the observed levels of
inter-observer and intra-observer agreement. The procedures compared are extensions of previously
developed procedures for comparing kappa statistics computed from independent samples. The results of
a Monte Carlo simulation show that adjusting for the dependency between samples tends to be worthwhile
only if the between-setting correlation is comparable in magnitude to the within-setting correlations. In this
case, a goodness-of-"t procedure that takes into account the dependency between samples is recommended.
Copyright ( 2000 John Wiley & Sons, Ltd.
1. INTRODUCTION
The need to compare two or more coe$cients of inter-observer agreement arises often in studies
of reliability. In some studies, as pointed out by Alsawalmeh and Feldt,1 it is natural to conduct
such comparisons using two independent groups of subjects. This would be the case, for example,
when preliminary reliability studies have been conducted in each of several centres that have
agreed to participate in a multi-centre clinical trial. Independent groups of subjects would also be
compared when assessing control-informant agreement on exposure history in di!erent case-
control studies (for example, Korten et al.2). In other studies, however, the comparison of interest
is naturally conducted using the same set of subjects. For example, Browman et al.3 report on
a study in which four readers, two radiologists and two clinical haematologists independently
assessed the radiographic vertebral index (VRI) on 40 radiographs from patients with myeloma.
One purpose of this investigation was to determine how coe$cients measuring inter-observer
agreement varied according to expertise in radiologic diagnosis. This led to a comparison of two
* Correspondence to: Allan Donner, Department of Epidemiology and Biostatistics, Kiesge Building, The University of
Western Ontario, London, Ontario, Canada N6A 5C1
Contract/grant sponsor: Natural Sciences and Engineering Research Council of Canada
Contract/grant sponsor: Harvard School of Public Health
Contract/grant sponsor: Schering}Plough Corporation
CCC 0277}6715/2000/030373}15$17)50 Received August 1998

Copyright ( 2000 John Wiley & Sons, Ltd. Accepted April 1999
374 A. DONNER ET AL.
such coe$cients, one for radiologists and one for the non-radiologists, with each coe$cient
computed from data collected on the same set of 40 patients. There are many other similar
examples in which it is the aim of the investigators to compare levels of inter-observer agreement
across di!erent settings, but with each setting involving the same subjects. Provided it is feasible
to do so, it is also clear that using the same sample of subjects rather than two di!erent samples
should lead to a more e$cient comparison.
Similar problems arise when it is of interest to compare coe$cients of intra-observer variabil-
ity. For example, Baker et al.4 report on a study in which each of two pathologists assessed 27
patients with respect to the presence or absence of dysplasia. Each assessment was performed in
duplicate, providing an opportunity to investigate whether the two pathologists showed compa-
rable levels of within-observer reproducibility. This comparison again suggests a test of equality
between two dependent kappa statistics, where each statistic may be regarded as an index of
reproducibility. Thus, although the discussion below will be framed in terms of the comparison of
coe$cients of inter-observer agreement, the results and conclusions obtained will be applicable to
a wider class of comparisons involving dependent kappa statistics.
For the case of a continuous outcome variable, the intraclass correlation coe$cient5,6 is
frequently used as a measure of inter-observer agreement. Tests for comparing two independent
intraclass correlations have been reviewed by Kraemer,7 who showed that a procedure due to
Feldt,8 a procedure based on Fisher's Z-transformation, and a likelihood ratio test derived by
Kraemer9 will all produce essentially the same results for large sample sizes.
For the case of a dichotomous outcome variable, the kappa statistic is the most frequently used
measure of inter-observer agreement. As pointed out by Bloch and Kraemer,10 the version of the
kappa statistic selected should depend on the population model underlying the study in question.
One commonly used model allows the marginal probabilities of success associated with the
observers (raters) to di!er. This model, which allows for the estimation of &rater e!ects',
leads to Cohen's kappa, the most frequently adopted version of kappa in practice.11 An
alternative model, discussed by Bloch and Kraemer,10 Hale and Fleiss12 and Dunn,13 assumes
that each rater may be characterized by the same underlying success rate. This model, referred to
here as the common correlation model, leads to the intraclass version of the kappa statistic
obtained as the usual intraclass correlation coe$cient calculated from a one-way analysis of
variance. As pointed out by Landis and Koch,14 this version of the kappa statistic, algebraically
equivalent to Scott's index of agreement,15 is most appropriate when interest focuses on the
reliability of a measurement process itself rather on potential di!erences among raters. An
interesting discussion of the properties and interpretation of these two versions of the kappa
statistic is given by Zwick.16
Procedures for testing the homogeneity of two or more independent kappa statistics for the
case of two raters and a dichotomous outcome variable under the common correlation model
have been considered by Donner et al.17 Using Monte Carlo simulation, these investigators
compared two hypothesis-testing approaches with respect to type I and type II error properties.
The "rst approach was based on a simple comparison of the kappa statistic to its estimated large
sample standard error, regarding this ratio as an approximate standard normal deviate. The
second approach was based on a goodness-of-"t procedure18 as applied to the common correla-
tion model. The results of this comparison showed that the two approaches have similar
properties provided the number of subjects in each sample is large ('100), and the prevalence of
the underlying trait of interest is not extreme, while the goodness-of-"t approach is to be preferred
for comparisons involving smaller numbers of subjects, and those for which the prevalence of the
Copyright ( 2000 John Wiley & Sons, Ltd. Statist. Med. 19, 373}387 (2000)
TESTING THE EQUALITY OF TWO DEPENDENT KAPPA STATISTICS 375
underlying trait is small ((0)3). Sample size requirements for the comparison of two or more
inter-observer agreement coe$cients have been given for both the continuous and dichotomous
case by Donner.19
In this paper, we deal with the problem of comparing two intraclass kappa statistics as
computed over the same sample of subjects, that is, we relax the assumption of independent
samples. Relatively little research has been reported on this topic. As a consequence, many such
studies of inter- and intra-observer agreement are reduced to the reporting of descriptive
comparisons only (for example, Workum et al.20 and Hill et al.21). This problem was noted by
McKenzie et al.22 who remarked that &methods for the comparison of correlated kappa coe$-
cients obtained from the same sample do not appear in the literature'. These authors subsequently
developed a testing procedure which uses computationally intensive resampling methods. Atten-
tion was limited to tests for pairwise equality among kappa coe$cients in studies where each
subject is assessed by three raters. Other research on this problem has been reported by
Williamson and Manatunga.23 These authors described an estimating equations approach for the
analysis of ordinal data with the underlying assumption of a latent bivariate normal variable.
This approach, like that of McKenzie et al.,22 is computationally intensive. We focus in this paper
on the development and comparison of model-based procedures for testing the equality of two
dependent kappa statistics. Attention is restricted to extensions of those methods previously
developed for comparing kappa statistics from independent samples, including Wald tests and
goodness-of-"t procedures. The extensions are based on accounting for the covariance between
the two kappa statistics, as derived under a generalized common correlation model.
2. THE MODEL
We assume that each of N subjects is rated under two settings, where each setting involves each of
two observers assigning a binary rating to each subject. The probability model we develop can be
characterised by the parameters i , j"1, 2 and i , where i measures the level of inter-observer
j " j
agreement under setting j, and i measures the expected level of inter-observer agreement
"
between any two ratings selected on the same subject from the di!erent settings. The primary
focus of our investigation is on testing the hypothesis H : i "i , where i is primarily regarded
0 1 2 "
as a nuisance parameter.
The model is developed by allowing for two levels of nesting, the "rst for settings within
subjects and the second for observations (ratings) within settings. Following the approach of
Rosner24,25 let X "1(0) denote the binary assignment of the ith subject under the jth setting
ijk
for rater k as a success (failure), i"1, 2, 2 , N; j"1, 2; k"1, 2. Furthermore let n denote the
marginal probability that an observation is recorded as a success across all subjects in the
population, and let P denote the probability that an observation for the ith subject is recorded as
i
a success as averaged over both settings. A mechanism for introducing the correlation between
ratings obtained in di!erent settings is to assume that conditional on n the distribution of
P among all subjects is modelled as a beta distribution with parameters a and b, that is,
i
P Dn&f (P )"beta (a, b) with n"a/a#b, for a#b'0. Furthermore let P denote the prob-
i i ij
ability that an observation is recorded as a success by the ith subject under the jth setting. We
introduce the between-setting correlation by assuming that, conditional on P , the distribution of
i
P is modelled as f (P )&beta (j P , j (1!P )), i"1, 2, 2 , N; j"1, 2. Let i "corr (X ,
ij ij j i j i Cj ij1
X DP )"(1#j )~1 where i may be interpreted as the conditional within-setting correlation,
ij2 i j Cj
and let i "corr (X , X )"(1#a#b)~1, jOj@, denote the between-setting correlation.
" ijk ij{k{
Then, as shown in Appendix I, these assumptions lead to the following probabilities for the joint
distribution of the ratings (X , X ) taken under setting 1:
i11 i12
P (i )"Pr (X "0, X "0)
1 C1 i11 i12
"(1!n)2 (1!i )#i (1!i ) n(1!n)#i (1!n)
" C1 " "
P (i )"Pr (X "1, X "0 or X "0, X "1)
2 C1 i11 i12 i11 i12
"2(1!i ) (1!i ) n(1!n)
C1 "
P (i )"Pr (X "1, X "1)
3 C1 i11 i12
"n2(1!i )#i (1!i ) n(1!n)#i n.
" C1 " "
We denote this model by M . The joint distribution of the ratings (X , X ) taken under setting
1 i21 i22
2 has corresponding probabilities P (i ), l"1, 2, 3 and are identical to the P (i ) when i is
l C2 l C1 C1
replaced by i . We denote this model by M . When i "0 the models M and M reduce to the
C2 2 " 1 2
common correlation model10,18 as applied to each setting separately, with correlation para-
meters i and i , respectively. We therefore refer to the above model as the &generalized
C1 C2
common correlation model'. We can further show that
i "corr(X , X )"i #i (1!i ) j"1, 2. (1)
j ij1 ij2 " Cj "
Substituting
i !i
i " j " (2)
Cj 1!i
"
in M we obtain
1
P (i )"(1!n)2#i n(1!n)
1 1 1
P (i )"2(1!i ) n(1!n)
2 1 1
P (i )"n2#i n(1!n)
3 1 1
for model M , and similar equations for M , with i replacing i . Equation (2) shows that
1 2 2 1
i may be interpreted as the degree of within-setting correlation above that predicted by the
Cj
value of i . That is, it measures the reliability of ratings taken in the jth setting, after adjusting for
"
the correlation between settings. The three probability equations corresponding to each condi-
tion are also seen to be those of the common correlation model, with parameters i and i ,
1 2
respectively.
3. THE GOODNESS-OF-FIT APPROACH

Table I(a) shows the marginal rating frequencies for the N subjects under each of the two settings.
Given that the frequencies n , n , n follow a multinomial distribution conditional on N,
1j 2j 3j
estimated probabilities for setting j under H : i "i "i are given by PK (iL ), PK (iL ), PK (iL ), where
0 1 2 1 2 3
we obtain the PK (iL ), l"1, 2, 3, by replacing i by an estimate of the common value i computed
l j
under H in P (i ), l"1, 2, 3. Table I (b) shows the corresponding joint rating frequencies < and
0 l j ij
underlying probabilities h . These quantities are needed to estimate the degree of dependence
ij
between the two samples.
Table I (a). Marginal rating frequencies (n ) for N subjects under two settings.
ij
Category Probability Ratings Frequency of subjects Total
Setting 1 Setting 2
1 P (i ) (0,0) n n n
1 j 11 12 1.
2 P (i ) (1, 0) or (0,1) n n n
2 j 21 22 2.
3 P (i ) (1,1) n n n
3 j 31 32 3.
N N 2N
Table I(b). Joint rating frequencies (< ) and underlying probabilities (h )

ij ij
Setting 2 ratings Setting 1 ratings Total
(0, 0) (1, 0) or (0, 1) (1, 1)
(0, 0) < < < n

11 12 13 12
# # # P (i )
11 12 13 1 2
(1, 0) or (0, 1) < < < n
21 22 23 22
# # # P (i )
21 22 23 2 2
(1, 1) < < < n
31 32 33 32
# # # P (i )
31 32 33 3 2
Total P (i ) P (i ) P (i )
1 1 2 1 3 1
n n n N
11 21 31
In the special case i "0, H : i "i is equivalent to testing the equality of two kappa
" 0 1 2
statistics computed from independent samples. In this case, the maximum likelihood estimator of
i is given10 by
j
n
iL "1! 2j , (3)
j 2NnL (1!n9 )
j j
where nL "(2n #n )/2N, with large sample variance derived as
j 3j 2j
C D
1!i i (2!i )
var(iL )" j (1!i ) (1!2i )# j j , j"1, 2. (4)
j N j j 2n (1!n )
j j j
An overall measure of agreement under H is then given by
0
C DNC D
2 2
i( " + iL nL (1!n( ) + nL (1!n( ) (5)
j j j j j
j/1 j/1
where i( is identical to the overall measure of agreement proposed by Fleiss26 and by Landis and
Koch.14
The goodness-of-"t statistic for testing H in the case of independent sample is then given17 by
0
2 3 [n !NPK (iL )]2
s2 " + + lj l (6)
G NPK (iL )
j/1 l/1 l
where we obtain PK (iL ) by replacing n by n( and i by iL in P (i ), l"1, 2, 3; j"1, 2. Under H ,

l j j l j 0
s2 follows an approximate chi-square distribution with one degree of freedom.
G
This statistic would be expected to yield conservative results (that is, type I error lower than
nominal) when applied to a comparison involving dependent samples. However, the statistic
G Y
s2 may be extended to this case by considering corr (i( , i( ), the estimated correlation between i(
1 2 1
and i( , as derived in Appendix II. Then, since var(i( )"var(i( ) under H : i "i , an approx-
2 1 2 0 1 2
imate statistic for testing H in the case of dependent samples is obtained by referring
0
GD G Y
s2 "s2 /[1!corr (iL , iL )] to tables of the chi-square distribution with one degree of freedom.
1 2
Letting co( v(iL , iL ) denote the estimated covariance between i( and i( as given in Appendix II,
1 2 1 2
an alternative test procedure can be constructed by computing the Wald statistic
iL !iL
Z " 1 2 (7)
VD [va( r(iL )#va( r(iL )!2 cov
Y (iL 1 , iL 2 )]
1 2
and referring Z to tables of the standard normal distribution. Setting cov
VD Y (iL 1 iL 2 )"0 in this
expression gives a version of the test statistic compared to s2 by Donner et al.17 in the case of two
G
independent samples (i "0). We denote this statistic by Z .
" V
4. MONTE CARLO STUDY

In order to study the properties of the test statistics s2 and Z , a Monte Carlo study was
GD VD
carried out generating the observations from the generalized common correlation model. The
test statistics s2 and Z , expected to be conservative for i '0, were also included in this
G V "
comparison.
None of the four procedures can be computed if nL "0 or 1 for any j because then iL is
j j
unde"ned. Following Bloch and Kraemer,10 these iterations were therefore replaced until a total
of 1000 iterations were obtained for each parameter combination. Iterations having i( "1,
j
j"1, 2, were not replaced, but were regarded as not having rejected H . Although corr (iL , iL ) is
0 1 2 Y
Y
non-negative, corr (iL , iL ) may be calculated as negative in any given sample, particularly when
1 2
i is small. For these occurrences we did not choose to truncate the resulting value at zero.
"
However, when the simulation study was repeated under a policy of truncating corr (iL , iL ) at
1 2 Y
zero, the results were virtually unchanged.
Table II compares the type I error properties of s2 , Z , s2 and Z for testing
GD VD G V
H : i "i "i at a"0)05 (two-sided) with N"200, 100, 50, 25 for various values of i, n and
0 1 2
i . The main conclusion from these results is that the unadjusted tests are overly conservative
"
(using an arbitrary de"nition in which the empirical type I error is observed to be less than 0)03)
only when i is equal in magnitude to i, that is, when the &between-setting correlation' is equal to
"
the null value of the &within-setting correlation'. Otherwise the observed type I errors for both
s2 and Z tend to be close to nominal, notwithstanding the lack of independence between i( and
G V 1
i( . In fact if i( is much less than i, then adjustment of either Z or s2 can lead to in#ated type
2 " V G
I errors, particularly in small samples, where the empirically estimated adjustment factor lacks
stability. However, the unadjusted statistic Z is also frequently too liberal in small samples
V
(N"50 or 25), particularly when the prevalence parameter n is small ((0)3). This is consistent
with earlier results reported by Donner et al.17 in their comparison of procedures for testing the
equality of independent kappa statistics. These results led to these authors' conclusion that s2 ,
G
based on goodness-of-"t theory, is to be generally preferred to the Wald statistic based on large
Table II. Type I errors for testing H : i "i "i at a"0)05 (two-sided)
0 1 2
n i i "i
" 1 2
0)4 0)6 0)8
i "0)1 i "0)4 i "0)1 i "0)4 i "0)6 i "0)3 i "0)6 i "0)8
" " " " " " " "
(a) N"200
0)1 s2 0)056 0)038 0)060 0)064 0)019 0)066 0)043 0)021
G
Z 0)057 0)040 0)058 0)063 0)018 0)062 0)043 0)019
V
s2 0)056 0)058 0)057 0)075 0)047 0)072 0)059 0)069
GD
Z 0)059 0)060 0)056 0)073 0)042 0)066 0)054 0)056
VD
0)3 s2 0)046 0)034 0)061 0)050 0)032 0)061 0)040 0)021
G
Z 0)046 0)034 0)060 0)049 0)032 0)061 0)041 0)021
V
s2 0)048 0)049 0)064 0)059 0)061 0)063 0)053 0)065
GD
Z 0)048 0)049 0)064 0)059 0)069 0)064 0)052 0)064
VD
0)5 s2 0)053 0)042 0)057 0)034 0)029 0)050 0)029 0)026
G
Z 0)057 0)042 0)057 0)034 0)029 0)051 0)031 0)027
V
s2 0)053 0)058 0)057 0)049 0)051 0)055 0)041 0)046
GD
Z 0)054 0)060 0)057 0)049 0)051 0)055 0)041 0)046
VD
(b) N"100
0)1 s2 0)072 0)041 0)066 0)050 0)037 0)054 0)039 0)020
G
Z 0)088 0)053 0)072 0)058 0)040 0)050 0)043 0)020
V
s2 0)071 0)068 0)068 0)074 0)070 0)091 0)098 0)085
GD
Z 0)088 0)072 0)075 0)075 0)070 0)054 0)055 0)077
VD
0)3 s2 0)064 0)034 0)048 0)048 0)029 0)045 0)033 0)023
G
Z 0)069 0)036 0)049 0)049 0)029 0)046 0)034 0)025
V
s2 0)074 0)054 0)052 0)057 0)048 0)054 0)051 0)058
GD
Z 0)075 0)055 0)053 0)058 0)047 0)057 0)052 0)051
VD
0)5 s2 0)060 0)023 0)047 0)048 0)031 0)045 0)039 0)034
G
Z 0)064 0)026 0)048 0)049 0)032 0)046 0)041 0)064
V
s2 0)063 0)042 0)050 0)058 0)058 0)047 0)054 0)083
GD
Z 0)064 0)043 0)052 0)058 0)059 0)049 0)052 0)047
VD
(c) N"50
0)1 s2 0)061 0)034 0)077 0)059 0)039 0)060 0)030 0)022
G
Z 0)123 0)064 0)115 0)100 0)062 0)078 0)044 0)030
V
s2 0)092 0)083 0)119 0)110 0)116 0)305 0)278 0)262
GD
Z 0)120 0)077 0)123 0)109 0)089 0)081 0)054 0)059
VD
0)3 s2 0)055 0)032 0)049 0)050 0)035 0)061 0)041 0)020
G
Z 0)066 0)036 0)056 0)055 0)039 0)069 0)051 0)022
V
s2 0)058 0)049 0)055 0)065 0)062 0)091 0)075 0)072
GD
Z 0)069 0)050 0)057 0)071 0)063 0)080 0)065 0)055
VD
0)5 s2 0)053 0)040 0)047 0)030 0)032 0)035 0)048 0)027
G
Z 0)062 0)045 0)049 0)036 0)038 0)040 0)054 0)030
V
s2 0)058 0)053 0)056 0)045 0)067 0)050 0)065 0)071
GD
Z 0)062 0)055 0)058 0)050 0)068 0)052 0)069 0)067
VD
Table II. Continued
n i i "i
" 1 2
0)4 0)6 0)8
i "0)1 i "0)4 i "0)1 i "0)4 i "0)6 i "0)3 i "0)6 i "0)8
" " " " " " " "
(d) N"25
0)1 s2 0)053 0)033 0)058 0)052 0)027 0)052 0)037 0)009
G
Z 0)210 0)160 0)197 0)169 0)117 0)122 0)110 0)052
V
s2 0)223 0)237 0)290 0)290 0)288 0)453 0)474 0)395
GD
Z 0)234 0)162 0)205 0)185 0)142 0)127 0)119 0)124
VD
0)3 s2 0)050 0)043 0)048 0)046 0)037 0)047 0)038 0)014
G
Z 0)062 0)058 0)065 0)060 0)047 0)057 0)045 0)014
V
s2 0)056 0)075 0)067 0)073 0)074 0)208 0)211 0)184
GD
Z 0)080 0)086 0)073 0)077 0)081 0)063 0)049 0)032
VD
0)5 s2 0)048 0)041 0)061 0)043 0)032 0)052 0)037 0)019
G
Z 0)058 0)050 0)070 0)048 0)038 0)058 0)038 0)021
V
s2 0)053 0)062 0)068 0)060 0)072 0)158 0)161 0)124
GD
Z 0)068 0)069 0)081 0)068 0)079 0)065 0)043 0)044
VD
sample normal theory. The comparative results found here for s2 and the adjusted Wald statistic
GD
Z follow a similar pattern.
VD
Table III compares the empirical powers for the four procedures at N"200 and 100. They
show that the advantage in power gained by adjusting for the independence between samples can
be striking, but only when i is large relative to i and i , where it often reaches four or "ve
" 1 2
percentage points.
5. EXAMPLES
As a "rst example, we consider data arising from the study referred to in Section 1, which suggests
a comparison of coe$cients of inter-observer agreement for each of two pathologists with respect
to the presence or absence of dysplasia of the urothelium.4 The data for the 27 patients in this
study are given in Table IV, where the overall prevalence of dysplasia is estimated as n( "0)37.
The values of the kappa statistics measuring the degree of reproducibility for each of the two
raters are given by i( "0)48 and i( "0)90, respectively.
1 2
The values of the adjusted test statistics are given by Z "2)50 (p"0)012) and s2 "14)4
VD GD
(p"0)00015), respectively, both of which are signi"cant at the 5 per cent level. However, the
unadjusted test statistics are also signi"cant in this case, with Z "2)14 (p"0)032) and
V
s2 "11)12 (p"0)00085). This similarity in results can be attributed at least in part to the
G
relatively small value for the between-setting correlation, given here by i( "0)17.
"
As a second example, Oden27 presented data concerning the presence or absence of geographic
atrophy in the eyes of 840 patients, with each eye graded by the same two raters. In this example,
the two eyes represent the two di!erent settings, i denotes the interrater agreement coe$cient
1
for the left eye, and i the interrater agreement coe$cient for the right eye. For these data,
2
presented in Table V, the overall prevalence of the condition is estimated as n( "0)0185, with the
Copyright ( 2000 John Wiley & Sons, Ltd.
Table III. Empirical powers for testing H : i "i "i at a"0)05 (two-sided)
0 1 2
n i i "0)4 i "0)6
" 1 1
i "0)6 i "0)7 i "0)8 i "0)8 i "0)9
TESTING THE EQUALITY OF TWO DEPENDENT KAPPA STATISTICS

2 2 2 2 2
0)1 0)2 0)3 0)4 0)1 0)2 0)3 0)4 0)1 0)2 0)3 0)4 0)1 0)2 0)4 0)6 0)1 0)2 0)4 0)6
(a) N"200
0)1 s2 0)324 0)294 0)295 0)311 0)625 0)622 0)606 0)606 0)878 0)857 0)887 0)877 0)383 0)401 0)396 0)395 0)790 0)818 0)817 0)818
G
Z 0)320 0)296 0)298 0)310 0)618 0)617 0)600 0)608 0)878 0)863 0)891 0)875 0)383 0)399 0)391 0)393 0)794 0)824 0)810 0)817
V
s2 0)322 0)297 0)312 0)350 0)620 0)625 0)616 0)640 0)877 0)860 0)895 0)889 0)388 0)405 0)421 0)464 0)789 0)818 0)817 0)848
GD
Z 0)321 0)293 0)311 0)344 0)618 0)626 0)613 0)640 0)880 0)865 0)894 0)889 0)392 0)402 0)419 0)456 0)797 0)818 0)815 0)847
VD
0)3 s2 0)572 0)561 0)583 0)570 0)931 0)906 0)921 0)928 0)996 0)999 0)998 0)999 0)748 0)769 0)722 0)752 0)991 0)990 0)992 0)994
G
Z 0)571 0)560 0)583 0)570 0)932 0)906 0)922 0)928 0)996 0)999 0)998 0)999 0)746 0)768 0)722 0)750 0)991 0)990 0)991 0)994
V
s2 0)568 0)574 0)598 0)607 0)927 0)908 0)929 0)940 0)996 0)998 0)998 0)999 0)753 0)776 0)744 0)800 0)992 0)989 0)992 0)995
GD
Z 0)569 0)575 0)597 0)605 0)929 0)907 0)929 0)940 0)996 0)998 0)998 0)999 0)754 0)777 0)742 0)799 0)992 0)989 0)991 0)995
VD
0)5 s2 0)631 0)642 0)658 0)629 0)953 0)948 0)959 0)956 0)999 1)00 1)00 0)999 0)795 0)794 0)837 0)819 0)998 0)999 0)998 0)999
G
Z 0)636 0)647 0)662 0)633 0)953 0)949 0)959 0)957 0)999 1)00 1)00 0)999 0)799 0)798 0)838 0)821 0)998 0)999 0)998 0)999
V
s2 0)640 0)647 0)672 0)670 0)950 0)951 0)960 0)961 0)999 1)00 1)00 1)00 0)793 0)801 0)843 0)856 0)998 1)00 0)998 0)999
GD
Z 0)642 0)651 0)679 0)672 0)951 0)951 0)961 0)961 0)999 1)00 1)00 1)00 0)794 0)802 0)846 0)860 0)998 1)00 0)998 0)999
VD
(b) N"100
0)1 s2 0)191 0)181 0)164 0)182 0)364 0)348 0)339 0)370 0)618 0)610 0)561 0)590 0)233 0)237 0)237 0)200 0)496 0)521 0)515 0)481
G
Z 0)206 0)208 0)178 0)200 0)381 0)377 0)359 0)394 0)647 0)635 0)584 0)614 0)243 0)247 0)253 0)218 0)519 0)546 0)538 0)491
V
s2 0)193 0)194 0)186 0)219 0)372 0)361 0)357 0)415 0)626 0)619 0)574 0)619 0)241 0)261 0)262 0)283 0)524 0)551 0)522 0)557
GD
Z 0)218 0)212 0)205 0)232 0)395 0)382 0)380 0)431 0)650 0)641 0)599 0)938 0)245 0)266 0)267 0)280 0)529 0)555 0)555 0)533
VD
0)3 s2 0)305 0)293 0)326 0)330 0)672 0)640 0)669 0)669 0)910 0)910 0)921 0)910 0)411 0)460 0)445 0)441 0)846 0)875 0)868 0)884
G
Z 0)307 0)297 0)331 0)336 0)677 0)644 0)673 0)675 0)914 0)913 0)922 0)911 0)417 0)467 0)447 0)444 0)851 0)879 0)873 0)884
V
s2
Statist. Med. 19, 373}387 (2000)
0)311 0)305 0)352 0)375 0)668 0)644 0)684 0)698 0)907 0)914 0)924 0)917 0)414 0)454 0)467 0)508 0)847 0)872 0)871 0)908
GD
Z 0)317 0)313 0)358 0)379 0)674 0)650 0)692 0)701 0)913 0)918 0)926 0)919 0)421 0)465 0)471 0)511 0)848 0)876 0)887 0)909
VD
0)5 s2 0)386 0)256 0)395 0)377 0)715 0)735 0)724 0)728 0)945 0)940 0)953 0)968 0)487 0)507 0)505 0)488 0)909 0)934 0)908 0)915
G
Z 0)392 0)361 0)402 0)382 0)721 0)741 0)727 0)740 0)948 0)943 0)955 0)971 0)493 0)511 0)508 0)491 0)910 0)934 0)913 0)917
V
s2 0)382 0)367 0)409 0)413 0)707 0)745 0)735 0)753 0)946 0)941 0)953 0)973 0)499 0)509 0)519 0)543 0)931 0)931 0)925 0)937
GD
Z 0)390 0)371 0)416 0)421 0)718 0)748 0)737 0)763 0)950 0)944 0)956 0)973 0)505 0)515 0)527 0)544 0)916 0)932 0)925 0)937
VD
381
Table IV. Joint rating frequencies for 27 subjects obtained from replicate
assessments by each of two pathologists
Pathologist 2 Pathologist 1 Total
(0, 0) (1, 0) or (0, 1) (1, 1)
(0, 0) 9 5 6 20
(1, 0) or (0, 1) 0 1 0 1
(1, 1) 1 1 4 6
Total 10 7 10 27
Table V. Joint rating frequencies for 840 patients from asessments on

each of two eyes
Right eye Left eye Total
0 1 2
0 800 14 2 816
1 12 3 0 15
2 5 0 4 9
Total 817 17 6 840
values of the kappa statistics given by i( "0)40, i( "0)54 and i "0)29. Note that in compari-
1 2 "
son to the "rst example, the value of i( is relatively close in magnitude to i( and i( . This result is
" 1 2
not unexpected, given the relatively highly correlations that have been observed for ocular traits
(for example, Rosner28).
The values of the adjusted test statistics are given by Z "1)11 (p"0.27) and s2 "1)34
VD GD
(p"0)27), both of which are non-signi"cant at the 5 per cent level. The unadjusted test statistics
have still larger p-values, with Z "0)848 (p"0)40) and s2 "0)782 (p"0)38). Schouten,29 in his
V G
analysis of these data, came to a similar conclusion. Note that these data may also be used to
compare the coe$cients of intra-observer variability obtained by the two raters when assessing
a patient's left and right eye, as shown by Oden.27
6. DISCUSSION
When it is possible to use the same sample of subjects in testing the equality of kappa statistics,
rigorous statistical comparisons will recognize the dependence between the two sample estimates.
For the case of a normally distributed outcome variable, an approximate test procedure for
comparing two intraclass correlations, based on the distribution of the ratio of two related
variances, was developed by Alsawalmeh and Feldt.1 The purpose of this paper was to develop
and evaluate relatively simple test procedures that could be used to compare dependent kappa
statistics in the case of a dichotomous outcome variable. A major "nding was that formal
adjustment for the dependence between the two kappa statistics tends to be substantively
worthwhile only when the between-setting correlation is of the same order of magnitude as the
within-setting correlation. Otherwise the type I errors for the unadjusted goodness-of-"t test
remain close to nominal and the loss in power relative to the adjusted test statistic is negligible.
This "nding is consistent with results obtained by Donner et al.30 in their comparison of two
dependent intraclass correlations computed from normally distributed family data, where it was
also found that the lack of independence can be ignored for most practical purposes.
For the comparison of coe$cients of inter-observer agreement, it is natural to expect the
between-setting correlation to be smaller, perhaps substantially, than the within-setting correla-
tions assuming that the settings involve di!erent pairs of raters. Thus it would seem little would
be lost in this case by employing conservative test procedures that in theory, assume indepen-
dence between samples. However the situation may be quite di!erent for the comparison of
coe$cients of reproducibility or intra-observer variability. Consider, for example, the literature
review of test}retest correlations with respect to hearing level thresholds conducted by Coren and
Hakstian.31 In most studies reviewed by these authors, a given patient may supply both a right
and left ear for analysis. A question of interest which arises is whether the test}retest correlation
for the left ear is the same as for the right ear. The adjusted goodness-of-"t test could be used to
compare such test}retest correlations for dichotomous aural traits, taking into account the
dependence between ears. It is interesting to note in this regard that Coren and Hakstian31
observe in their review that the &correlations between the two ears are almost as high as the
correlations between successive measures taken on the same ear on separate occasions', corre-
sponding to the case in which adjustment for the dependence between ears in most worthwhile. It
also seems reasonable to assume that similar problems could arise in ophthalmologic research as
well as other "elds in which duplicate readings are obtained on each of two bilateral measures
recorded on the same subject. However further empirical work is needed to establish the actual
levels of such within- and between-setting correlations.
A limitation of our model is that it assumes a common prevalence across settings as well as
across raters within settings. This assumption would seem reasonable when the same subjects are
rated on the presence or absence of a speci"ed trait under settings which involve raters having the
same level of training and experience. It is also consistent with the caution expressed by many
authors (for example, Thompson and Walter32) against the comparison of two or more kappa
statistics when the population prevalence for the groups compared may di!er. This caution is
reasonable given the well-known dependence of the kappa statistic on the estimated group prevalence.
Other assumptions of the model are that subject i is selected randomly from a large population
of subjects, and that the raters in each setting are drawn at random from a population of raters. If
raters are regarded "xed, then a further assumption of marginal homogeneity (that is, a common
prevalence n) must be assumed for our results to hold. In addition, measurement errors within
and across settings are assumed to be independent.
An obvious extension of this work is to comparisons that involve m ratings for setting i, where
i
m *2, i"1, 2, and m and m may di!er. This would require the development of probability
i 1 2
models much more complex than that presented here, or, alternatively, the adoption of robust
variance estimators.
APPENDIX I: JOINT PROBABILITY DISTRIBUTION OF (X ,X ,X ,X )

i11 i12 i21 i22
P (0, 0, 0, 0)"*~1 [b(b#1) (b#2) (b#3)#(i #i ) ab (b#1) (b#2)
C1 C2
#i i ab(a#1) (b#1)]"#
C1 C2 11
P (1, 0, 0, 0)"P (0, 1, 0, 0)"2*~1(1!i ) [ab(b#1) (b#2)#i ab(a#1) (b#1)]"#

C1 C2 12
P (1, 1, 0, 0)"*~1 [(1#i i ) ab(a#1) (b#1)#i ab(b#1) (b#2)
C1 C2 C1
#i ab(a#1) (a#2)]"#
C2 13
P (0, 0, 1, 0)"P (0, 0, 0, 1)"2*~1(1!i ) [ab(b#1) (b#2)#i ab(a#1) (b#1)]"#
C2 C1 21
P (1, 0, 1, 0)"P (0, 1, 1, 0)"4*~1(1!i ) (1!i ) [ab(a#1) (b#1)]"#
C1 C2 22
P (0, 1, 1, 1)"P (1, 0, 1, 1)"2*~1(1!i ) [ab(a#1) (a#2)#i ab(a#1) (b#1)]"#
C2 C1 23
P (0, 0, 1, 1)"*~1 [(1#i i ) ab(a#1) (b#1)#i ab(a#1) (a#2)
C1 C2 C1
#i ab(b#1) (b#2)]"#
C2 31
P (1, 1, 1, 0)"P (1, 1, 0, 1)"2*~1(1!i ) [ab(a#1) (a#2)#i ab(a#1) (b#1)]"#
C1 C2 32
P (1, 1, 1, 1)"*~1 [a(a#1) (a#2) (a#3)#(i #i ) ab(a#1) (a#2)
C1 C2
#i i ab(a#1) (b#1)]"#
C1 C2 33
b(b#1) ab
P (i )" #i
1 1 (a#b) (a#b#1) C1 (a#b) (a#b#1)
2(1!i ) ab
P (i )" C1
2 1 (a#b) (a#b#1)
a(a#1) ab
P (i )" #i
3 1 (a#b) (a#b#1) C1 (a#b) (a#b#1)
b(b#1) ab
P (i )" #i
1 2 (a#b) (a#b#1) C2 (a#b) (a#b#1)
2(1!i ) ab
P (i )" C2
2 2 (a#b) (a#b#1)
a(a#1) ab
P (i )" #i
3 2 (a#b) (a#b#1) C2 (a#b) (a#b#1)
where *"(a#b) (a#b#1) (a#b#2) (a#b#3),
n(1!i ) n(1!n) (1!i )

a" " , b" " and i "(1#a#b)~1.
i i "
" "
APPENDIX II: CORRELATION BETWEEN i( AND i(

1 2
Since the i( are functions of M"(< , < , < , < , < , < , < , < , < )@, then to the "rst
j 11 12 13 21 22 23 31 32 33
order of approximation, application of the delta method gives
A BA B
LiL LiL
cov (iL , iL )" + cov (< , < ) 1 2
1 2 ij lm Ln Ln
ijlm ij lm
(see Stuart and Ord33). Since M has a multinomial distribution, then
cov (< , < )"!Nh h iOl, jOm,
ij lm ij lm
var (< )"Nh (1!h )
ij ij ij
and
cov (n , n )"N [h !P (i ) P (i )] i, j"1, 2, 3
i1 j2 ij i 1 j 2
(see Stuart34). Hence after some algebra we obtain
A B A B
A A
N cov (iL , iL )"d A!d #B !d #C #d (A#2B#2C#4D ) (8)
1 2 1 2 2 3 2 4
where
P (i ) (1!2n)
d "[4n2 (1!n)2]~1, d " 2 1
1 2 4n3(1!n)3
P (i ) (1!2n) P (i ) P (i ) (1!2n)2
d " 2 2 , d " 2 1 2 2
2 4n3(1!n)3 4 16n4(1!n)4
A"# !P (i ) P (i )
22 2 1 2 2
B"# !P (i ) P (i )
32 3 2 2 1
C"# !P (i ) P (i )
23 3 1 2 2
D"# !P (i ) P (i ).
33 3 2 3 1
A sample estimate of cov Y 1 1 ) may be obtained by substituting <ij for hij , i( j for ij and
(iL , i
L
n( "(1/4 N) [n #n #2(n #n )] for n. A sample estimate of the correlation between
22 21 32 31
i( and iL is then given by
1 2
covY (iL 1 , iL 2 )
Y
corr (iL , iL )"
1 2 JMva( r(iL ) va( r(iL )N
1 2
Y (iL 1 , iL 2 ) is obtained by replacing ij by iL j , i" by iL " and n by n( in terms of right hand side
where cov
of (8).
ACKNOWLEDGEMENTS
The work of Drs. Donner and Shoukri was partially supported by grants from the Natural
Sciences and Engineering Research Council of Canada. Dr. Klar was partially supported by
a faculty grant from the Department of Biostatistics, Harvard School of Public Health and the
Schering-Plough Corporation.
REFERENCES
1. Alsawalmeh, Y. M. and Feldt, L. S. &Testing the equality of two related intraclass reliability coe$cients',
Applied Psychological Measurement, 18, 183}190 (1994).
2. Korten, A. E. Jorm, A. F., Henderson, A. S., McCusker, E. and Creasey, H. &Control-informant
agreement on exposure history in case-control studies of Alzheimer's disease', International Journal of
Epidemiology, 21, 1121}1131 (1992).
3. Browman, G. P., Markman, S., Thompson, G., Minuk, T., Chirawatkul, A. and Roberts, R. S.
&Assessment of observer variation in measuring the radiographic vertebral index in patients with
multiple myelomad', Journal of Clinical Epidemiology, 43, 833}840 (1990).
4. Baker, S. G., Freedman, L. S. and Parmar, M. K. B. &Using replicate observations in observer agreement
studies with binary assessments', Biometrics, 47, 1327}1338 (1991).
5. Kraemer, H. C. and Komer, A. F. &Statistical alternatives in assessing reliability, consistency and
individual di!erences for quantitative measures: applications to behavioural measures of neonates',
Psychological Bulletin, 83, 914}921 (1976).
6. Shrout, P. E. and Fleiss, J. L. &Intraclass correlation: Uses in assessing rater reliability', Psychological
Bulletin, 86, 420}428 (1979).
7. Kraemer, H. C. &Extension of Feldt's approach to testing homogeneity of coe$cients of reliability',
Psychometrika 45, 41}45 (1981).
8. Feldt, D. S. &A test of the hypothesis that Cronbach's alpha reliability coe$cient is the same for two tests
administered to the same sample', Psychometrika, 45, 99}105 (1980).
9. Kraemer, H. C. &On estimation and hypothesis testing problems for correlation coe$cients', Psychomet-
rika, 40, 473}485 (1975).
10. Bloch, D. A. and Kraemer, H. C. &2]2 kappa coe$cients: measures of agreement or association',
Biometrics, 45, 269}287 (1989).
11. Cohen, J. &A coe$cient of agreement for nominal scales', Educational and Psychological Measurement,
20, 37}46 (1960).
12. Hale, C. A. and Fleiss, J. L. &Interval estimation under two study designs for kappa with binary
classi"cations', Biometrics, 49, 523}534 (1993).
13. Dunn, G. Design and Analysis of Reliability Studies, Oxford University Press, Oxford, New York,
1989.
14. Landis, J. R. and Koch, G. G. &A one-way components of variance model for categorical data',
Biometrics, 33, 671}679 (1977).
15. Scott, W. A. &Reliability of content analysis; the case of nominal scale codin', Public Opinion Quarterly,
19, 321}325 (1955).
16. Zwick, R. &Another look at interrater agreement', Psychological Bulletin, 103, 374}378 (1988).
17. Donner, A., Eliasziw, M. and Klar, N. &Testing the homogeneity of kappa statistics', Biometrics, 52,
176}183 (1996).
18. Donner A. and Eliasziw, M. &A goodness-of-"t approach to inference procedures for the kappa statistic:
con"dence interval construction, signi"cance-testing and sample size estimation', Statistics in Medicine,
11, 1511}1519 (1992).
19. Donner, A. &Sample size requirements for the comparison of two or more coe$cients of interobserver
agreement', Statistics in Medicine, 17, 1157}1168 (1998).
20. Workum, P., BelBoro, E. A., Halford, S. K. and Murphy, R. L. H. &Observer agreement, chest
ausculation, and crackles in asbestos-exposed workers', Chest, 89, 27}29 (1986).
21. Hill, C., Keks, S., Roberts, S., Opeskin, K., Dean, B., MacKinnon, A. and Capolov, D. &Problem of
diagnosis in postmortem brain studies of schizophrenia', American Journal of Psychiatry, 153, 533}537
(1996).
22. McKenzie, D. P., MacKinnon, A. J., PeH ladeau, N., Onghena, P., Bruce, P. C., Clarke, D. M., Harrigan, S.
and McGorry, P. D. &Comparing correlated kappas by resampling: Is one level of agreement signi"-
cantly di!erent from another', Journal of Psychiatry Research, 30, 483}492 (1996).
23. Williamson, J. M. and Manatunga, A. K. &The Consultant's Forum: Assessing interrater agreement from
dependent data', Biometrics, 53, 707}714 (1997).
24. Rosner, B. &Multivariate methods for clustered binary data with more than one level of nesting', Journal
of the American Statistical Association, 84, 373}380 (1989).
25. Rosner, B. &Multivariate methods for clustered binary data with multiple subclasses, with application to
binary longitudinal data', Biometrics, 48, 721}731 (1992).
26. Fleiss, J. L. &Measuring nominal scale agreement among many raters', Psychological Bulletin, 76,
378}382 (1971).
27. Oden, N. &Estimating kappa from binocular data', Statistics in Medicine, 10, 1303}1311 (1991).
28. Rosner, B. &Statistical methods in ophthalmology: An adjustment for the intraclass correlation between
eyes', Biometrics, 38, 105}114 (1982).
29. Schouten, H. J. A. &Estimating kappa from binocular data and comparing marginal probabilities',
Statistics in Medicine, 12, 2207}2217 (1993).
30. Donner, A., Koval, J. J. and Bull, S. &Testing the e!ect of sex di!erences on sib-sib correlations',
Biometrics, 40, 349}356 (1984).
31. Coren, S. and Hakstian, R. &Methodological implications of interaural correlation: Count heads not
ears', Perception and Psychophysics, 48, 291}294 (1990).
32. Thompson, W. D. and Walter, S. D. &A reappraisal of the kappa statistic', Journal of Clinical Epidemi-
ology, 41, 949}958 (1988).
33. Stuart, A. and Ord, K. Advanced ¹heory of Statistics, Volume 1, 1987, pp. 324.
34. Stuart, A. &Test for the homogeneity the marginal distributions in a two-way classi"cation', Biometrika,
42, 412}416 (1995).

Testing The Equality of Two Dependent Kappa Statistics: Statist. Med. 19, 373) 387 (2000)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Testing The Equality of Two Dependent Kappa Statistics: Statist. Med. 19, 373) 387 (2000)

Uploaded by

Copyright:

Available Formats

STATISTICS IN MEDICINE

Statist. Med. 19, 373}387 (2000)

TESTING THE EQUALITY OF TWO DEPENDENT KAPPA

ALLAN DONNER1*, MOHAMED M. SHOUKRI2, NEIL KLAR3 AND EMMA BARTFAY1

CCC 0277}6715/2000/030373}15$17)50 Received August 1998

3. THE GOODNESS-OF-FIT APPROACH

Table I(b). Joint rating frequencies (< ) and underlying probabilities (h )

(0, 0) (1, 0) or (0, 1) (1, 1)

(0, 0) < < < n

where we obtain PK (iL ) by replacing n by n( and i by iL in P (i ), l"1, 2, 3; j"1, 2. Under H ,

4. MONTE CARLO STUDY

Table II. Continued

TESTING THE EQUALITY OF TWO DEPENDENT KAPPA STATISTICS

Pathologist 2 Pathologist 1 Total

(0, 0) (1, 0) or (0, 1) (1, 1)

Table V. Joint rating frequencies for 840 patients from asessments on

Right eye Left eye Total

APPENDIX I: JOINT PROBABILITY DISTRIBUTION OF (X ,X ,X ,X )

P (1, 0, 0, 0)"P (0, 1, 0, 0)"2*~1(1!i ) [ab(b#1) (b#2)#i ab(a#1) (b#1)]"#

where *"(a#b) (a#b#1) (a#b#2) (a#b#3),

n(1!i ) n(1!n) (1!i )

APPENDIX II: CORRELATION BETWEEN i( AND i(

You might also like