You are on page 1of 4

Journal of Applied Psychology Copyright 1993 by the American Psychological Association, Inc.

1993. Vol. 78, No. 2. 306-309' 002I-90IO/93/S3.00

rwg: An Assessment of Within-Group Interrater Agreement


Lawrence R. James, Robert G. Demaree, and Gerrit Wolf

Schmidt and Hunter (1989) critiqued the within-group interrater reliability statistic (&g) described
by James, Demaree, and Wolf (1984). Kozlowski and Hattrup (1992) responded to the Schmidt and
Hunter critique and argued that <^ is a suitable index of interrater agreement. This article focuses on
the interpretation of i^ as a measure of agreement among judges' ratings of a single target. A new
derivation of ^ is given that underscores this interpretation.

In 1984 we suggested a technique "for assessing agreement (Schmidt & Hunter, 1989, p. 369). These authors concluded that
[italics added ] among the judgments made by a single group of there is no real need for f,g because interrater agreement can be
judges on a single variable in regard to a single target" (James, assessed by using the standard deviation of the ratings across
Demaree, & Wolf, 1984, p. 85). The technique was cast as a judges or the standard error of the mean or both.
heuristic form of interrater reliability because our derivations Kozlowski and Hattrup (1992) argued that the Schmidt and
built on earlier work by Finn (1970), who had characterized his Hunter (1989) critique
approach as estimating "the proportion of non-error variance
in the ratings, a reliability [italics added] coefficient" (Finn, of c^ did not clearly distinguish the concepts of interrater consen-
1970, p. 72). In accordance with framing a measure of agree- sus (i.e., agreement) and interrater consistency (i.e., reliability).
ment as a special form of interrater reliability, we noted explic- When the distinction between agreement and reliability is care-
fully drawn, the critique of ^ is shown to divert attention from
itly that our technique was fashioned after the "interchangeabil- more critical problems in the assessment of agreement. A compar-
ity" index of interrater reliability (estimated using interclass ison demonstrates that the approach for assessing within-group
correlation) in multitarget designs (Shrout & Fleiss, 1979). A agreement proposed by Schmidt and Hunter (1989) suffers from
distinguishing feature of interchangeable forms of interrater several limitations. [This] comment concludes that ^ should not
be used as an index of interrater reliability but, within certain
reliability is that they are measures of agreement. The term bounds, it is suitable as an index of within-group interrater agree-
agreement is used because the estimators are sensitive both to ment and that [the Schmidt and Hunter alternatives] are not ac-
the similarity (among judges) on the rank orderings of target ceptable substitutes for extant indexes of interrater agreement, (p.
ratings and to differences in the level (i.e., mean) of each judge's 161)
ratings (cf. Lahey, Downey, & Saal, 1983). A correlational form
of interrater reliability, referred to as a "consistency" index by Kozlowski and Hattrup (1992) are correct in stating that
Shrout and Fleiss (1979), attends only to similarity of rank or- much of the Schmidt and Hunter (1989) critique of ^ focused
derings of judges' target ratings (i.e., whether judges' scores cor- on an interrater reliability (correlational) index rather than on
relate with one another, irrespective of whether the scores are the agreement (interchangeability) index described by James et
the same). We purposely chose to assess interchangeability or al. (1984). To be a bit more precise and faithful to the original
agreement because our objective was to assess whether judges presentation, James et al. proposed a heuristic form of an inter-
gave the same rating to a target. changeability (agreement) index of interrater reliability,
In 1989 Schmidt and Hunter critiqued our approach, begin- whereas Schmidt and Hunter (mis)directed much of their criti-
ning with the argument that true variance is, by definition, zero cism to a consistency (correlational) index of interrater reliabil-
in single stimulus situations, and thus interrater reliability must ity. Kozlowski and Hattrup are also correct in stating that our
also be zero. Based in part on this logic and in part on other intention was to suggest a measure of agreement, and not con-
concerns emanating from a classic model of reliability, Schmidt sistency, and that ^ is an estimator of agreement. However,
and Hunter (1989) described our "within-group interrater reli- what cannot be done, at least not the way things are presently
ability" fcg) as neither conceptually possible nor legitimate. set up, is to follow Kozlowski and Hattrup's recommendation
They stated that if one proceeded to calculate an ^, then the to sever all ties between interrater reliability and &g and to treat
ensuing estimate "may have no meaningful interpretation" i^g as strictly a measure of agreement with, in effect, no ties to
classic measurement theory. It is not possible to follow this rec-
ommendation because r»g is currently derived in terms of classic
Lawrence R. James, Department of Management, University of measurement theory as an interchangeability (agreement) in-
Tennessee; Robert G. Demaree, Department of Psychology, Texas dex of interrater reliability.
Christian University; Gerrit Wolf, Department of Management, State Thus we had a decision to make. Do we attempt to defend the
University of New York at Stony Brook.
We thank Steven Kozlowski for his helpful suggestions and advice.
current statistical basis for i^g, which in effect means the devel-
Correspondence concerning this article should be addressed to opment of a spirited defense for our attempt to break out of the
Lawrence R. James, Department of Management, 408 Stokely Man- closed system of traditional measurement procedures in which
agement Center, University of Tennessee, Knoxville, Tennessee true variance greater than zero is not conceptually possible in a
37996-0545. single stimulus design? Or, do we forgo this defense in favor of
306
SHORT NOTES 307

deriving a less controversial statistical basis for ^, which as a variance on rating variable x, with x representing, for example,
by-product would legitimize (statistically) Kozlowski and Hat- judgments of the overall publishability of a single manuscript
trup's (1992) recommendation to treat ^ exclusively as a mea- by a set of reviewers and editors. When judges are in complete
sure of agreement? agreement, ^2 = 0. However, measurement errors may occur,
In regard to the first option, we agree that our procedure which would engender lack of agreement among the judges and
created strain for classic measurement theory, but we believe it produce an Sf > 0. Because S^2 arises only from variation in
to be a strain that could be accommodated. However, such errors, it is referred to as error variance. To develop a statistic
accommodation would likely engender a long and tedious de- that estimates degree of agreement among judges, it is neces-
bate over the appropriate use and interpretation of agreement sary to have a benchmark to compare to S^2. Inasmuch as S.2 >
in relation to classic measurement terms and interrater reliabil- 0 reflects departure from perfect agreement, we adopted a
ity. Views in this area vary from the position that some forms of benchmark that reflected the expected value of S2 in a condi-
agreement are legitimately treated in the context of interrater tion in which judgments are due exclusively to random measure-
reliability (cf. Shrout & Fleiss, 1979) to the position that a clear ment errors. This expected variance is referred to as %2, and we
demarcation should be made between interrater reliability (typ- previously used the equation for a discrete uniform distribution
ically a consistency index) and agreement (typically an inter- to determine %2 (referred to as aEU2). However, the r^ statistic is
changeability index, cf. Tinsley & Weiss, 1975; see also Algina, not conditional on the use of aEU2. Therefore, we shall use the
1978; Bartko, 1976, 1978; Lawlis & Lu, 1977; Mitchell, 1979; general %2 term to indicate random responding, whatever the
Shrout & Fleiss, 1979; Stine, 1989). Moreover, such a debate ensuing distribution. The important point regarding %2 is that
would likely produce apprehensiveness in potential users, it is a theoretical benchmark for responses attributable totally
which defeats the purpose of proposing a needed statistic. to random measurement errors.
Thus, we decided to exercise the second option by recasting We shall now assess the extent to which the actual ratings
ty as an estimator of interrater agreement without relying on given by judges on a single target resemble the theoretical
true variance or equations from classic measurement theory. benchmark for random responding. This assessment consists
Our decision to adopt this approach occurred before publica- of ascertaining the degree to which the observed ratings reflect
tion of the Kozlowski and Hattrup (1992) article and was in- a reduction in error variance relative to the theoretical bench-
fluenced by a reviewer who suggested that precedents have al- mark. The term reduction in error variance refers to the degree
ready been set by generalizability theory (cf. Cronbach, Gleser, to which the observed error variance (S^2) indicates a decrease
Nanda, & Rajaratnam, 1972) and item response theory (cf. Hu- in the variation of judgments relative to %2 and is estimated by
lin, Drasgow, & Parsons, 1983; Lord, 1980) for moving beyond the difference between %2 and S^2, or %2 - ^2. To illustrate, if
classical test theory to discuss measurement characteristics; judges' ratings are due exclusively to random influences, then
and that by viewing r^s in nonclassical terms, it is possible to we would expect that their ratings would show little reduction
examine critically this statistic on its own merits, which in error variance relative to oE2. Indeed, we might find that S,2 ^
Schmidt and Hunter (1989) did not really do. Publication of the <%2 or that aE2 - S;2 = 0, which indicates that there has been
Kozlowski and Hattrup article reinforced our decision to pro- essentially no reduction in error variance relative to %2 and that
ceed in this manner. Indeed, this article and the Kozlowski and judges do not agree. At the other extreme, we have a case in
Hattrup article mutually reinforce one another in several ways. which judges agree perfectly and S^2 = 0. This condition con-
In particular, the statistical basis furnished in this article is notes a large reduction in error variance relative to <%2, as shown
balanced by Kozlowski and Hattrup's critical examination of by a high value for %2 - S?, which is
Schmidt and Hunter's inappropriate treatment of ^ as a consis-
tency index; empirical demonstrations of £^'s usefulness as a a2 - 0 = <7/
measure of agreement and of problems with Schmidt and
We will now convert the scale for reduction of error variance,
Hunter's suggested alternatives; and underscoring of £,/s
which varies from 0 to %2, to a more interpretable scale that in
unique contribution to agreement indices, namely procedures
for adjusting estimates of ^ for errors engendered by response most cases varies from 0 to 1.0. This rescaling involves dividing
the actual reduction in error variance by the maximum possi-
biases. These treatments in Kozlowski and Hattrup permitted
us to focus on recasting ^ as an estimator of interrater agree- ble reduction in error variance, the latter value being <%2 on the
ment without reference to interrater reliability. basis of the previous discussion. The ensuing equation is
~ S2)/oE2 = 1 - (S2/aE2), (1)
Recasting the Interrater Agreement Index
where ^ is an interrater agreement index defined as the propor-
All definitions and assumptions pertaining to scaling and tional reduction in error variance. Note that this equation is the
error distributions are the same as those presented in 1984. In same as that given in 1984 for r»g.
general, prior illustrations, caveats, and recommendations also To illustrate interpretation of^, it was shown previously that
remain applicable. The key changes introduced here pertain to S? = %2 indicates lack of agreement inasmuch as no reduction
the recasting of the statistical basis for ^ and a discussion of the in error variance has occurred. In this situation, ^ ^ 0/%2 = 0,
bounds and limitations of r^g as an index of interrater agree- which denotes that the proportional reduction in error variance
ment. is approximately zero, and thus there is no agreement among
We shall need to review definitions of several terms to initiate judges. On the other hand, if judges agree totally (S^ = 0), then
the recasting process. First, remember that ^.2 is the observed W = %2/<>£2 = 1.0. The ratio oj2/%2 is interpreted to mean the
308 SHORT NOTES

reduction in error variance is proportionately equal to the re- set equal to .00 (cf. James et al., 1984—S^ and %2 are unequal
duction expected in a condition of perfect agreement. because the former is a sample value, based on division by [nk —
1], whereas the latter is a population value, based theoretically
Assumptions and Concerns on division by nk).
James et al. (1984) discussed a number of situations in which It would seem that the preceding values of ^ are consistent
Cy might be used and provided some assumptions and caveats with the concept of interrater agreement and expectations re-
that should be considered before using t^,g. In the interest of garding what point estimates of agreement should be for the
space, these points will not be repeated here. However, three stated conditions. Kozlowski and Hattrup (1992) furnish addi-
additional issues stimulated by the recasting of t^ as a heuristic tional empirical illustrations that support this reasoning.
agreement index do require attention. First, ^ should not be
used in equations that require classically based estimators of Conclusion
reliability. For example, it would not be appropriate to use ^ in The r^g statistic proposed by James et al. (1984) does not con-
correction for attenuation equations. form to standard measurement theory. Nevertheless, we be-
Second, certain liabilities may be inherent in the use of an lieve the theory underlying ^ and the interpretation of ^ as an
agreement index that is sensitive to the actual values of ratings interrater agreement index to be logical, legitimate, and mean-
furnished by judges (i.e., rater effects). A key issue here is that, ingful. To assist in affirming this position, we have furnished a
on subjective scales in particular, judges may impute different derivation and interpretation of ^g as a measure of interrater
meanings to identical segments of the scale (cf. Cronbach, 1946, agreement that is not formally tied to classic measurement
1950; Guilford, 1954; Stine, 1989; Winer, 1971). This is a form theory. This derivation and interpretation serve to satisfy Koz-
of differential response bias (e.g., response sets and styles that lowski and Hattrup's (1992) recommendation that ^ be used as
vary among judges), which is to be distinguished from the gen- an indicator of interrater agreement but not interrater relia-
eral response tendencies for all or most judges discussed by bility.
James et al. (1984). The result of differential response bias is to We do not wish to convey the impression that r^g is without
introduce a judge's "interpretation effect" as a cause of variance fault or is beyond debate. We believe that there are a number of
(cf. Winer, 1971), which serves to increase S? and thus bias ^ in unresolved and debatable issues pertaining to ^, such as
a negative direction. This finding suggests that r»g is best used in whether theoretical error distributions should be characterized
situations in which judges are believed to be interpreting the as uniform versus triangular or normal. Moreover, the condi-
rating scale in a similar manner. tions in which an interrater agreement index is not useful need
Third, although it is possible to anticipate some questions to be more clearly specified. Thus, much is left to be done, and
and concerns such as those mentioned above, the actual behav- we hope that future research can be carried out in a construc-
ior of ^ over studies is an empirical question that will be better tive manner.
answered in future years when empirical data have accumu-
lated (although several researchers have already found the pro- References
cedure to be useful; see Hater & Bass, 1988; Kozlowski & Hat- Algina, J. (1978). Comment on Bartko's "On Various Intraclass Corre-
trup, 1992; Kozlowski & Hults, 1987; Schneider & Bowen, lation Reliability Coefficients." Psychological Bulletin, 85, 135-138.
1985). For the present we shall have to rely on a rational argu- Bartko, J. J. (1976). On various intraclass correlation reliability coeffi-
ment to respond to Schmidt and Hunter's (1989) assertion that cients. Psychological Bulletin, 83, 762-765.
£,,? "may have no meaningful interpretation" (p. 369). Bartko, J. J. (1978). Reply to Algina. Psychological Bulletin, 85, 139-
Consider a situation in which 10 judges have rated a manu- 140.
script on a 5-point Likert scale ranging from strongly disagree Cronbach, L. J. (1946). Response sets and test validity. Educational &
(1) to strongly agree (5) in regard to whether the manuscript Psychological Measurement, 6, 475-494.
should be published. On the basis of James et al. (1984), we Cronbach, L. J. (1950). Further evidence on response sets and test de-
sign. Educational & Psychological Measurement, 10, 3-31.
assume that %2 is best represented by the variance of a discrete,
Cronbach, L. J., Gleser, G. C, Nanda, H., & Rajaratnam, N. (1972). The
uniform distribution and thus is equal to 2.0. Illustrations are dependability of behavioral measurements: Theory ofgeneralizability
now easily developed for comparing various values of S? to c%2 of scores and profiles. New York: Wiley.
and then deciding whether aE2 is consistent with a particular Finn, R. H. (1970). A note on estimating the reliability of categorical
level of interrater agreement. For example, if all 10 judges rate data. Educational and Psychological Measurement, 30, 71-76.
the manuscript a 5, indicating a high degree of publishability, Guilford, J. P. (1954). Psychometric methods. New York: McGraw-Hill.
then ^2 = 0 and ^ = (2.0 - 0)/2.0 = 1.0, which denotes perfect Hater, J. J., & Bass, B. M. (1988). Superiors' evaluations of subordi-
agreement. If one half of the judges give the manuscript a 5 and nates' perceptions of transformational and transactional leadership.
the other half give it a 4, then ^2 = .28 and ^ = (2.0 - .28)/2.0 = Journal of Applied Psychology, 73, 695-702.
.86, which suggests a high but not perfect level of interrater Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item response theory:
Application to psychological measurement. Homewood, 1L: Dow-
agreement. However, if three judges each rate the manuscript a
Jones-Irwin.
5, a 4, and a 3 and one judge rates the manuscript a 2, then S? = James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-
1.067 and ^ = (2.0 - 1.067)/2.0 = .47, which, correctly we group interrater reliability with and without response bias. Journal
believe, suggests a reasonably low level of interrater agreement. of Applied Psychology, 69, 85-98.
Finally, if the judges appear to have responded randomly to the Kozlowski, S. W J., & Hattrup, K. (1992). A disagreement about
scale, such that two judges are represented at each value on the within-group agreement: Disentangling issues of consistency versus
scale, then S? = 2.22 and ^ = (2.0 - 2.22)/2.0 = -.11, which is consensus. Journal of Applied Psychology, 77,161-167.
SHORT NOTES 309

Kozlowski, S. W J., & Hulls, B. M. (1987). An exploration of climates Schneider, B., & Bowen, D. E. (1985). Employee and customer percep-
for technical updating and performance. Personnel Psychology, 40, tions of service in banks: Replication and extension. Journal of Ap-
539-563. plied Psychology, 70, 423-433.
Lahey, M. A., Downey, R. G., & Saal, F. E. (1983). Intraclass correla- Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in
tions: There's more there than meets the eye. Psychological Bulletin, assessing rater reliability. Psychology Bulletin, 86, 420-428.
93, 586-595. Stine, W W (1989). Interobserver relational agreement. Psychological
Lawlis, G. F, & Lu, E. (1977). Judgment of counseling process: Reliabil- Bulletin, 106, 341-347.
ity, agreement, and error. Psychological Bulletin, 78, 17-20. Tinsley, H. E. A., & Weiss, D. 1 (1975). Interrater reliability and agree-
Lord, F. M. (1980). Applications of item response theory to practical ment of subjective judgments. Journal of Counseling Psychology, 22,
testing problems. Hillsdale, NJ: Erlbaum. 358-376.
Mitchell, S. K. (1979). Interobserver agreement, reliability, and general- Winer, B. J. (1971). Statistical principles in experimental design. New
izability of data collected in observational studies. Psychological York: McGraw-Hill.
Bulletin, 86, 376-390.
Schmidt, F. L., & Hunter, J. E. (1989). Interrater reliability coefficients Received February 19,1990
cannot be computed when only one stimulus is rated. Journal of Revision received August 7,1992
Applied Psychology, 74, 368-370. Accepted August 10,1992 •

AMERICAN PSYCHOLOGICAL ASSOCIATION


SUBSCRIPTION CLAIMS INFORMATION Today's Date:.

We provide this form to assist members, institutions, and nonmember individuals with any subscription problems. With the
appropriate information we can begin a resolution. If you use the services of an agent, please do NOT duplicate claims through
them and directly to us. PLEASE PRINT CLEARLY AND IN INK IF POSSIBLE.

PRINT FULL NAME OR KEY NAME OF INSTITUTION MEMBER OR CUSTOMER NUMBER (MAYBE FOUND ON ANYPAST ISSUE LABEL)

DATE YOUR ORDER WAS MAILED (OR PHONED)

PREPAID ___CHECK__CHAROE
CHECK/CARD CLEARED DATE:_
CITY STATEACOUNTRY ZIP
(If possible, send* copy, front and back, of your cancelled check to help us in our research
of your claim.)
YOUR NAME AND PHONE NUMBER ISSUES: MISSING DAMAGED

TITLE VOLUME OR YEAR NUMBER OR MONTH

Thank you. Once a claim is received and resolved, delivery of replacement issues routinely takes 4-6 weeks.
——————— (TO BE FILLED OUT BY APA STAFF) ————————
DATE RECEIVED:. DATE OF ACTION: _
ACTION TAKEN: _ INV.NO.&DATE:
STAFF NAME: LABEL NO. & DATE:

Send this form to APA Subscription Claims, 750 First Street, NE, Washington, DC 20002-4242

PLEASE DO NOT REMOVE. A PHOTOCOPY MAY BE USED.

You might also like