Professional Documents
Culture Documents
Perspective
I
n musculoskeletal practice and research, there is
frequently a need to determine the reliability of valuable information on the reliability
measurements made by clinicians—reliability here
being the extent to which clinicians agree in their of diagnostic and other examination
ratings, not merely the extent to which their ratings are
associated or correlated. Defined as such, 2 types of procedures.
reliability exist: (1) agreement between ratings made
by 2 or more clinicians (interrater reliability) and by chance can be considered “true” agreement. Kappa is
(2) agreement between ratings made by the same clini- such a measure of “true” agreement.14 It indicates the
cian on 2 or more occasions (intrarater reliability). proportion of agreement beyond that expected by
chance, that is, the achieved beyond-chance agreement as
In some cases, the ratings in question are on a continu- a proportion of the possible beyond-chance agreement.15
J Sim, PhD, is Professor, Primary Care Sciences Research Centre, Keele University, Keele, Staffordshire ST5 5BG, United Kingdom
(j.sim@keele.ac.uk). Address all correspondence to Dr Sim.
CC Wright, BSc, is Principal Lecturer, School of Health and Social Sciences, Coventry University, Coventry, United Kingdom.
258 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005
ўўўўўўўўўўўўўўўўўўўўўў
Table 1. P o⫺P c .8462⫺.5385
Diagnostic Assessments of Relevance of Lateral Shift by 2 Clinicians, (5) ⫽ ⫽ ⫽.67
From Kilpikoski et al9 (⫽.67)a 1⫺P c 1⫺.5385
冉 冊冉 冊冉 冊冉 冊
point (eg, “no pain”–“mild pain”) is less serious than
f 1⫻g1 f 2⫻g2 26⫻24 13⫻15
⫹ ⫹ disagreement by 2 scale points (eg, “no pain”–“moderate
n n 39 39 pain”). To reflect the degree of disagreement, kappa can
P c⫽ ⫽
n 39 be weighted, so that it attaches greater emphasis to large
differences between ratings than to small differences. A
16⫹5 number of methods of weighting are available,25 but
⫽ ⫽.5385 quadratic weighting is common (Appendix). Weighted
39
kappa penalizes disagreements in terms of their serious-
Substituting into the formula: ness, whereas unweighted kappa treats all disagreements
equally. Unweighted kappa, therefore, is inappropriate
for ordinal scales.26 Because in this example most dis-
Physical Therapy . Volume 85 . Number 3 . March 2005 Sim and Wright . 259
agreements are of only a single cate-
gory, the quadratic weighted kappa
(.67) is higher than the unweighted
kappa (.55). Different weighting
schemes will produce different values
of weighted kappa on the same data;
for example, linear weighting gives a
kappa of .61 for the data in Table 2.
260 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005
ўўўўўўўўўўўўўўўўўўўўўў
Table 4.
(A) Assessment of the Presence of Lateral Shift, From Kilpikoski et al9 (⫽.18); (B) the Same Data Adjusted to Give Equal Agreements in Cells a
and d, and Thus a Low Prevalence Index (⫽.54)
a b a b
Present Present
28 3 31 15 3 18
Clinician 1 Clinician 1
c d c d
Absent Absent
6 2 8 6 15 21
Total 34 5 39 21 18 39
Physical Therapy . Volume 85 . Number 3 . March 2005 Sim and Wright . 261
Table 5.
(A) Contingency Table Showing Nearly Symmetrical Disagreements in Cells b and c, and Thus a Low Bias Index (⫽.12); (B) Contingency Table
With Asymmetrical Disagreements in Cells b and c, and Thus a Higher Bias Index (⫽.20)a
Present a b Present a b
29 21 50 Clinician 1 29 6 35
Clinician 1
c d c d
Absent Absent
23 27 50 38 27 65
n n
Total 52 48 100 67 33 100
262 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005
ўўўўўўўўўўўўўўўўўўўўўў
Table 6.
(A) Data Reported by Kilpikoski et al9 for Judgments of Directional Preference by 2 Clinicians (⫽.54); (B) Cell Frequencies Adjusted to Minimize
Prevalence and Bias Effects, Giving a Prevalence-Adjusted Bias-Adjusted of .79
a b a b
Present Present
Clinician 1 32 1 33 Clinician 1 18 2 20
c d c d
Absent Absent
3 3 6 2 17 19
Total 35 4 39 20 19 39
Physical Therapy . Volume 85 . Number 3 . March 2005 Sim and Wright . 263
as indicated by the change in the marginal totals. Table 7.
Data on Assessments of Stiffness at C1–2 From Smedmark et al7,a
Furthermore, all of the frequencies in the cells have
changed between Table 6A and Table 6B.
Clinician 2 Total
Therefore, the PABAK coefficient on its own is uninfor- No
mative because it relates to a hypothetical situation in Stiffness stiffness
which no prevalence or bias effects are present. How- Stiffness 2 (3) 1 (0) 3
Clinician 1
ever, if PABAK is presented in addition to, rather than in
No stiffness 7 (6) 50 (51) 57
place of, the obtained value of kappa, its use may be
Total 9 51 60
considered appropriate because it gives an indication of
the likely effects of prevalence and bias alongside the a
Figures in the cells represent the observed ratings; those in parentheses are
true value of kappa derived from the specific measure- the ratings that would secure maximum agreement given the marginal totals.
Observed ⫽.28; maximum attainable ⫽.46. Smedmark et al did not specify
264 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005
ўўўўўўўўўўўўўўўўўўўўўў
with different levels of experience, tools or assessment ity of a test’s results or a diagnosis will necessarily be
protocols with inherently different sensitivity,53 or exam- superior to a stated threshold for clinical importance.
iners who can be expected to utilize classification criteria One-tailed tests should be reserved for occasions when
of different stringency.14 These a priori sources of dis- testing a null hypothesis that kappa is zero.
agreement will give rise to different marginal totals and
will, in turn, be reflected in max. The statistical hypothesis test provides only binary infor-
mation: Is there sufficient evidence that the population
For a given reliability study, the difference between value of kappa is greater than .40, or not? A more useful
kappa and 1 indicates the total unachieved agreement approach is to construct a confidence interval around
beyond chance. The difference between kappa and the sample estimate of kappa, using the standard error
max , however, indicates the unachieved agreement of kappa (Appendix) and the z score corresponding to
beyond chance, within the constraints of the marginal the desired level of confidence.57 The confidence inter-
totals. Accordingly, the difference between max and 1
In a practical situation in which ratings are compared Prior to undertaking a reliability study, a sample size
across clinicians, agreement will usually be better than calculation should be performed so that a study has a
that expected by chance, and specifying a zero value for stated probability of detecting a statistically significant
kappa in the null hypothesis is therefore not very mean- kappa coefficient or of providing a confidence interval
ingful.54,55 Thus, the value in the null hypothesis should of a desired width.58,59 Table 8 gives the minimum
usually be set at a higher level (eg, to determine whether number of participants required to detect a kappa
the population value of kappa is greater than .40, on the coefficient as statistically significant, with various values
basis that any value lower than this might be considered of the proportion of positive ratings made on a dichot-
clinically “unacceptable”). The minimum acceptable omous variable by 2 raters, specifically, (f1 ⫹ g1)/2n in
value of kappa will depend on the clinical context. For terms of the notation in Table 1. A number of points
example, the distress and risk assessment method should be noted in relation to this table. First, the
(DRAM)56 can be used to identify patients with low back sample sizes given assume no bias between raters. Sec-
pain who are at risk for poor outcome. If this categori- ond, except where the value of kappa stated in the null
zation determines whether or not such patients are hypothesis is zero, sample size requirements are greatest
enrolled in a management program, the minimal accept- when the proportion of positive ratings is either high or
able level of agreement on this categorization might be low. Third, given that the minimum value of kappa
set fairly high, so that decisions on patient management deemed to be clinically important will depend on the
are made with a high degree of consistency. If agree- measurement context, in addition to a null value of zero,
ment on the test were particularly important and the non-zero null values between .40 and .70 have been
kappa were lower than acceptable, it might mean that included in Table 8. Finally, following earlier comments
clinicians need more training in the testing technique or on 1- and 2-tailed tests, the figures given are for 2-tailed
the protocol needs to be reworded. When a value greater tests at a significance level of .05, except where the value
than zero is specified for kappa in the null hypothesis, a of kappa in the null hypothesis is zero, when figures for
2-tailed test is preferable to a 1-tailed test. This is because a 1-tailed test also are given.
there is no theoretical reason to assume that the reliabil-
Physical Therapy . Volume 85 . Number 3 . March 2005 Sim and Wright . 265
Table 8.
The Number of Subjects Required in a 2-Rater Study to Detect a Statistically Significant (Pⱕ.05) on a Dichotomous Variable, With Either 80%
or 90% Power, at Various Proportions of Positive Diagnoses, and Assuming the Null Hypothesis Value of Kappa to be .00, .40, .50,
.60, or .70a
.10 .40 39 54 50 66
.30 .40 39 54 50 66
When seeking to optimize sample size, the investigator data obtained with diagnostic and other procedures
needs to choose the appropriate balance between the used in musculoskeletal practice. We conclude with the
number of raters examining each subject and the num- following recommendations:
ber of subjects.60 In some instances, it is more practical
to increase the number of raters rather than increase the • Alongside the obtained value of kappa, report the
number of subjects. However, according to Shoukri,39 bias and prevalence.
when seeking to detect a kappa of .40 or greater on a • Relate the magnitude of the kappa to the maximum
dichotomous variable, it is not advantageous to use more attainable kappa for the contingency table con-
than 3 raters per subject—it can be shown that for a cerned, as well as to 1; this provides an indication of
fixed number of observations, increasing the number of the effect of imbalance in the marginal totals on the
raters beyond 3 has little effect on the power of hypoth- magnitude of kappa.
esis tests or the width of confidence intervals. Therefore, • Construct a confidence interval around the
increasing the number of subjects is the more effective obtained value of kappa, to reflect sampling error.
strategy for maximizing power. • Test the significance of kappa against a value that
represents a minimum acceptable level of agree-
Conclusions ment, rather than against zero, thereby testing
If used and interpreted appropriately, the kappa coeffi- whether its plausible values lie above an “accept-
cient provides valuable information on the reliability of able” threshold.
266 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005
ўўўўўўўўўўўўўўўўўўўўўў
• Use weighted kappa on scales that are ordinal in 18 Fleiss JL. Measuring nominal scale agreement among many raters.
their original form, but avoid its use on interval/ Psychol Bull. 1971;76:378 –382.
ratio scales collapsed into ordinal categories. 19 Fritz JM, Wainner RS. Examining diagnostic tests: an evidence-
• Be cautious when comparing the magnitude of based perspective. Phys Ther. 2001;81:1546 –1564.
kappa across variables that have different prevalence 20 Feinstein AR, Cicchetti DV. High agreement but low kappa, I: the
or bias, or that are measured on different scales. problems of two paradoxes. J Clin Epidemiol. 1990;43:543–549.
21 Bartko JJ, Carpenter WT. On the methods and theory of reliability.
References J Nerv Ment Dis. 1976;163:307–317.
1 Toussaint R, Gawlik CS, Rehder U, Rüther W. Sacroiliac joint
22 Fleiss JL, Nee JCM, Landis JR. Large sample variance of kappa in the
diagnostics in the Hamburg Construction Workers Study.
case of different sets of raters. Psychol Bull. 1979;86:974 –977.
J Manipulative Physiol Ther. 1999;22:139 –143.
23 Hartmann DP. Considerations in the choice of interobserver reli-
2 Fritz JM, George S. The use of a classification approach to identify
ability estimates. J Appl Behav Anal. 1977;10:103–116.
subgroups of patients with acute low back pain. Spine. 2000;25:
Physical Therapy . Volume 85 . Number 3 . March 2005 Sim and Wright . 267
43 Lantz C, Nebenzahl E. Behavior and interpretation of the statistic:
Appendix.
resolution of the two paradoxes. J Clin Epidemiol. 1996;49:431– 434. Weighted Kappa
44 Gjørup T. The kappa coefficient and the prevalence of a diagnosis. Formula for weighted kappa (w):
冘
Methods Inf Med. 1988;27:184 –186.
wf o⫺wf c
冘
45 Landis JR, Koch GG. The measurement of observer agreement for
categorical data. Biometrics. 1977;33:159 –174. w⫽
n⫺ wfc
46 Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. New where 兺wfo is the sum of the weighted observed frequencies in the cells
York, NY: John Wiley & Sons Inc; 1981. of the contingency table, and 兺wfc is the sum of the weighted frequen-
47 Altman DG. Practical Statistics for Medical Research. London, England: cies expected by chance in the cells of the contingency table.
Chapman & Hall/CRC; 1991. Linear and quadratic weights are calculated as follows:
48 Shrout PE. Measurement reliability and agreement in psychiatry.
Stat Methods Med Res. 1998;7:301–317. i⫺j
on the number of categories. Epidemiology. 1996;7:199 –202. Where i⫺j is the difference between the row category on the scale and
the column category on the scale (the number of categories of disagree-
51 Haas M. Statistical methodology for reliability studies. J Manipulative
ment), for the cell concerned, and k is the number of points on the scale.
Physiol Ther. 1991;14:119 –132.
The weights for unweighted, linear weighted, and quadratic weighted
52 Soeken KL, Prescott PA. Issues in the use of kappa to estimate kappas for agreement on a 4-point ordinal scale are:
reliability. Med Care. 1986;24:733–741.
53 Knight K, Hiller ML, Simpson DD, Broome KM. The validity of
self-reported cocaine use in a criminal justice treatment sample. Am J Linear Quadratic
Drug Alcohol Abuse. 1998;24:647– 660. Unweighted weights weights
54 Posner KL, Sampson PD, Caplan RA, et al. Measuring interrater No disagreement 1 1 1
reliability among multiple raters: an example of methods for nominal Disagreement by 0 .67 .89
data. Stat Med. 1990;9:1103–1115. 1 category
55 Petersen IS. Using the kappa coefficient as a measure of reliability Disagreement by 0 .33 .56
or reproducibility. Chest. 1998;114:946 –947. 2 categories
Disagreement by 0 0 0
56 Main CJ, Wood PJ, Hollis S, et al. The distress and risk assessment 3 categories
method: a simple patient classification to identify distress and evaluate
the risk of poor outcome. Spine. 1992;17:42–51.
57 Sim J, Reid N. Statistical inference by confidence intervals: issues of
interpretation and utilization. Phys Ther. 1999;79:186 –195. This method weighting is based on agreement; a method of weighing
based on disagreement also can be used.25
58 Flack VF, Afifi AA, Lachenbruch PA, Schouten HJA. Sample size
determinations for the two rater kappa statistic. Psychometrika. 1988;53: A number of widely available computer programs will calculate and
321–325. w through standard options. SPSS 12a will calculate (but not w) and
performs a statistical test against a null value of zero. STATA 8b and
59 Donner A, Eliasziw M. A goodness-of-fit approach to inference SAS 8c calculate both and w and perform a statistical test against a
procedures for the kappa statistic: confidence interval construction, null value of zero for each of these statistics. PEPI 4d calculates both
significance-testing and sample size estimation. Stat Med. 1992;11: and w , and provides a value of max for each of these statistics. This
1511–1519. program performs a statistical test against a null value of zero (and if the
60 Walter SD, Eliasziw M, Donner A. Sample size and optimal designs observed value of or w ⬎.40, against a null value of .40 also),
for reliability studies. Stat Med. 1998;17:101–110. together with 90%, 95%, and 99% 2-sided confidence intervals.
Additionally, the prevalence-adjusted bias-adjusted kappa (PABAK) is
calculated, and a McNemar test for bias is performed. Various other
stand-alone programs are available, and macros can be used to
perform additional functions related to for some of these programs.
Most programs will calculate a standard error of . Two standard errors
can be calculated. A standard error assuming a zero value for should
be used for hypothesis tests against a null hypothesis that states a zero
value for , whereas a different standard error, assuming a nonzero
value of , should be used for hypothesis tests against a null hypothesis
that states a nonzero value for and to construct confidence intervals.
a
SPSS Inc, 233 S Wacker Dr, Chicago, IL 60606.
b
StataCorpLP, 4905 Lakeway Dr, College Station, TX 77845.
c
SAS Institute Inc, 100 SAS Campus Dr, Cary, NC 27513.
d
Sagebrush Press, 225 10th Ave, Salt Lake City, UT 84103.
268 . Sim and Wright Physical Therapy . Volume 85 . Number 3 . March 2005