You are on page 1of 20

ORGANIZA

10.1177/1094428105275376
Brown, Hauenstein
TIONAL / INTERRATER
RESEARCH METHODS
AGREEMENT RECONSIDERED

Interrater Agreement Reconsidered:


An Alternative to the rwg Indices

REAGAN D. BROWN
Western Kentucky University

NEIL M. A. HAUENSTEIN
Virginia Polytechnic Institute and State University

For continuous constructs, the most frequently used index of interrater agreement
(rwg(1)) can be problematic. Typically, rwg(1) is estimated with the assumption that a
uniform distribution represents no agreement. The authors review the limitations
of this uniform null rwg(1) index and discuss alternative methods for measuring
interrater agreement. A new interrater agreement statistic, awg(1), is proposed. The
authors derive the awg(1) statistic and demonstrate that awg(1) is an analogue to Co-
hen’s kappa, an interrater agreement index for nominal data. A comparison is
made between agreement estimates based on the uniform rwg(1) and awg(1), and issues
such as minimum sample size and practical significance levels are discussed. The
authors close with recommendations regarding the use of rwg(1) / rwg(J) when a uni-
form null is assumed, rwg(1) / rwg(J) indices that do not assume a uniform null, awg(1) /
awg(J) indices, and generalizability estimates of interrater agreement.

Keywords: interrater agreement; interrater reliability; kappa

The assessment of interrater agreement is a critical issue in many areas, subsuming


both applied and theoretical topics. Topic areas that address interrater agreement
include job analysis (Lindell, Clause, Brandt, & Landis, 1998), performance appraisal
(Bozeman, 1997; Fleenor, Fleenor, & Grossnickle, 1996; Schrader & Steiner, 1996),
procedural justice (Mossholder, Bennett, & Martin, 1998), group performance (Eby &
Dobbins, 1997; Hyatt & Ruddy, 1997, Mulvey & Ribbens, 1999; Neuman & Wright,
1999), and perceptions of work environments (Klein, Conn, Smith, & Sorra, 2001).
There are several interrater agreement indices that can be used when raters are
assigning stimuli to nominal categories (e.g., J. Cohen, 1960, for two raters rating mul-

Authors’Note: We thank Andrea Sinclair and John Donovan for their helpful comments and suggestions
on the article. An earlier version of this work was presented at the 15th Annual Conference of the Society for
Industrial and Organizational Psychology, New Orleans, Louisiana, April 2000. Correspondence concern-
ing this article should be addressed to Reagan D. Brown, Department of Psychology, Western Kentucky Uni-
versity, Bowling Green, KY 42101; e-mail: reagan.brown@wku.edu.
Organizational Research Methods, Vol. 8 No. 2, April 2005 165-184
DOI: 10.1177/1094428105275376
© 2005 Sage Publications
165
166 ORGANIZATIONAL RESEARCH METHODS

tiple stimuli; Lawshe, 1975, for multiple raters rating a single dichotomous stimulus;
Li & Lautenschlager, 1997, for multiple raters rating multiple stimuli). However,
ambiguity arises concerning the degree of allowable disagreement among raters when
rating (i.e., the manifest variable) a target on a continuous construct (i.e., the latent
variable).1 Our discussion of agreement will be limited to the issue of subjective rat-
ings of continuous constructs. Further difficulty arises when evaluating the quality of
the agreement estimate for a single stimulus. Because there is no true score variance
when a single stimulus is rated, psychometric theory does not underpin any of single
stimulus agreement indices (F. L. Schmidt & Hunter, 1989). Thus, the position that
one agreement index is superior to another is based on the argument that the superior
agreement index more closely captures consensus than other agreement indices do.
When judging a single stimulus, the simplest approach is to use a measure of the
observed variability (e.g., standard deviation) of the judges’ ratings as an index of
agreement (F. L. Schmidt & Hunter, 1989). Unfortunately, the variability of a set of
ratings is confounded with the scale of measurement (e.g., 5-point scale vs. 7-point
scale), which makes it difficult to work with observed variability as an index of agree-
ment (Kozlowski & Hattrup, 1992). Most other interrater agreement indices rely on
transformations of the observed variability of the ratings. (Burke and colleagues
[Burke & Dunlop, 2002; Burke, Finklestein, & Dusig, 1999] recently proposed an
interrater agreement index based on average deviations from the mean/median.) The
indices that transform the observed variability of ratings subtract from one the ratio of
the observed variance to the variance of a null distribution of ratings. The shape of the
null distribution is the primary determinant of the quality of the agreement estimate. If
the null distribution fails to model disagreement properly, then the interpretability of
the resultant agreement coefficient is suspect. The most frequently used agreement
statistic is Finn’s (1970) rwg(1), for which a uniform (i.e., rectangular) distribution is
used as the null. Alternative null distributions for rwg(1) have been proposed by other
researchers and will be addressed.
With the uniform null, all scale values are rated with equal frequency (i.e., uniform
disagreement). The formula for rwg(1) is given below.

rwg(1) = 1 – (sx2/σEU2), (1)

where sx2 is the variance of ratings for a single stimulus, σEU2 = (A2 – 1)/12, and A is the
number of scale response categories.
Thus, as originally proposed and is commonly computed, rwg(1) estimates interrater
agreement by comparing the observed variance for a set of ratings to the variance of a
uniform distribution.
It is important to note that for rwg(1), as proposed by Finn (1970) and as currently
computed (Lindell & Brandt, 1999), the observed variance of the ratings is computed
using the equation for an unbiased estimate of population variance (i.e., using the n – 1
denominator, hereafter referred to as the sample variance equation), whereas the vari-
ance of the uniform distribution is computed with the population variance equation.
Finn (1970) did not specify a range for rwg(1) but did state that 1 indicates perfect agree-
ment and 0 indicates no agreement (i.e., uniform disagreement). In cases for which rat-
ing dissensus exceeds uniform disagreement, rwg(1) will be negative. Although some
researchers (James, Demaree, & Wolf, 1984) do not interpret negative values of rwg(1)
Brown, Hauenstein / INTERRATER AGREEMENT RECONSIDERED 167

and state that values less than zero should be set to zero, others (Lindell & Brandt,
2000) have argued that it is important to examine agreement even when it exceeds uni-
form disagreement (e.g., extreme disagreement can be very informative of organiza-
tional climate issues). In short, truncation of the range of rwg(1) values results in the loss
of useful information.
Finally, it is important to recognize that revisions to Finn’s (1970) rwg(1) statistic
have been proposed. James et al. (1984) proposed the null distribution be modeled to
differentiate systematic error from random error when estimating agreement. For
example, if a researcher has evidence that a positive leniency bias may be affecting
most of the raters, then the null distribution would be negatively skewed. As such, it is
best to think of rwg(1) as a family of interrater agreement statistics that varies as a func-
tion of the shape of the null distribution. Lindell, Brandt, and Whitney (1999) pro-
posed another agreement statistic, r*wg(1); however, in its single target form, r*wg(1)
reduces to rwg(1).

Problems With Current Variance-Based Agreement Indices

There are two problems with the current interrater agreement indices that use
observed variability in ratings as the indicator of agreement and a single population
variance estimate of the null distribution: scale dependency and the effect of sample
size on the magnitude of the agreement index. We will discuss these two problems in
relation to the rwg(1) family because of its popularity. First, the lower bound of any rwg(1)
index will be a function of the number of scale anchors. For example, Lindell and
Brandt (1997) showed that the use of the rectangular null results in the lower bound of
rwg(1) being scale dependent as a function of the number of rating scale anchors. How-
ever, this problem is not limited to using a rectangular null. As such, equal rwg(1) values,
regardless of the assumed variance of the null distribution, that are less than zero com-
puted from a 5-point rating scale versus a 9-point scale do not necessarily represent
comparable levels of agreement. In addition, if two rating scales have the same range
(e.g., 1-5) but have different numbers of scale anchors available (e.g., integers only vs.
integers and half-unit values), then rwg(1) values are again scale dependent, regardless
of whether the coefficients are positive or negative.
The second problem with the rwg(1) family of indices is that sample sizes influence
rwg(1) values and their interpretability (Kozlowski & Hattrup, 1992; Lindell et al.,
1999). This occurs because rwg(1) is computed as 1 minus the ratio of the observed sam-
ple variance to the population variance of the uniform distribution. The observed sam-
ple variance in the numerator of rwg(1) is influenced by sample size (i.e., for a fixed sum
of the squared deviations from the mean, the observed variance increases as sample
size decreases), and dividing the sample variance by the population variance of the
uniform distribution passes the sample size dependency onto the rwg(1) statistic (James
et al., 1984). Assuming a constant pattern of ratings, rwg(1) values based on small sam-
ples are smaller than rwg(1) values based on large samples. Thus, it is more likely that
agreement will be judged as greater for the larger samples, independent of the true
level of consensus. In addition, two different patterns of ratings that differ in levels of
agreement can produce the same rwg(1) values as a function of differences in sample
sizes. Finally, even in cases for which the agreement from a sample of raters is general-
ized to a population of raters, the mismatch of the type of variance equations is still
troubling. The range of rwg(1) values is affected by the mismatch, and as such, the
168 ORGANIZATIONAL RESEARCH METHODS

interpretability of the resultant coefficients is also affected. As an illustration, consider


the case for which five raters each choose a different anchor from a 5-point scale for
their rating of a stimulus. Their ratings clearly represent uniform disagreement, which
should yield an rwg(1) of zero. Because of the mismatch of variance equations, rwg(1)
equals –.25. Thus, rwg(1) coefficients indicating uniform disagreement and perfect
agreement no longer range from 0 to 1. As a point of comparison, consider coefficient
alpha, a statistic that also is based on ratios of variances. In both the numerator and
denominator of this ratio, the variances are computed using the same equation (i.e., the
sample variance equation). Although these two problems are troubling, their effects
can be anticipated and in some cases alleviated with adjustments to the rwg(1) equation.

Problems With the Uniform Null Distribution Assumption

Although rwg(1) is a family of agreement indices, the uniform rwg(1) is the most fre-
quently used null distribution. The uniform null distribution assumes that if there is no
variance related to agreement, then raters disagree uniformly. There are only two con-
ditions for which this assumption of uniform disagreement is valid. The first is if there
is no shared rating bias (either motivational based or cognitively based) among raters
and the raters function as random number generators. The alternative is if those biases
present in the ratings offset each other. That is, for every lenient rater, there is also a
rater making a compensatory severe rating.
Finn (1970) assumed that raters are free from rating bias and therefore produce
entirely random ratings. However, James et al. (1984) noted that the assumption that
random error is the only source of error variance is unrealistic. They argued that sys-
tematic biases affect raters, and they further outlined several factors “that render the
expected distribution nonuniform when the true interrater reliability is zero” (p. 89).
Although theoretically, it might be argued that raters asked to simply assign numbers
to a rating scale in the absence of any stimulus will produce a uniform distribution, the
reality is that ratings are gathered in relation to both a stimulus and a situation. The
presence of the stimulus alone is likely to cause some rating bias to be present, which
can readily be exacerbated by the situation (Schriesheim, 1981). Moreover, even if
there is no shared rating bias among the raters, it is extremely unlikely that the raters
will function as random number generators (and produce the uniform distribution)
when they do not agree. A. M. Schmidt and DeShon (2003) collected ratings of nonex-
istent constructs as well as other scenarios for which the ratings should be random and
found that the ratings did not display a uniform distribution. Rather, they found distri-
butions of ratings that resembled a negatively skewed bell-shaped distribution. The
distributions they found have less variability than the uniform null distribution, and as
a result, use of the uniform null distribution will result in an overestimate of
agreement.
If rating bias is present, then the only way to maintain belief in the validity of the
uniform null distribution assumption is to postulate that compensatory rating biases
are operating. That is, rating bias is present, but it mimics the effects of random error
on a set of ratings, thereby producing the uniform distribution. Such a postulation is
almost certainly indefensible. Researchers have both proposed (Feldman, 1981;
Wherry & Bartlett, 1982) and empirically demonstrated (Borman, 1987; Hauenstein
& Alexander, 1991) that there exist large communalities among different raters’preex-
isting perceptions of stimuli categories. Such shared perceptions among raters make it
Brown, Hauenstein / INTERRATER AGREEMENT RECONSIDERED 169

unlikely that a perfect, compensating mechanism for rater bias can exist. The bottom
line is that the assumption that a uniform distribution properly models “no agreement”
is likely invalid in many situations in which ratings are rendered (James et al., 1984).
This concern is also echoed in the interrater reliability literature. Recently, Murphy
and DeShon (2000) argued that any application of classical test theory to interrater
reliability is flawed. They argued that the assumption that observed ratings are a func-
tion of true score variance and random error variance, the latter of which is orthogonal
to the true score variance, is untenable because ratings can be influenced by shared
goals and biases, shared perceptions of the organization and of the appraisal systems,
shared frames of reference, and shared relationships with ratees.
If the uniform null distribution assumption is violated, then the denominator of
rwg(1) is biased, which results in a biased estimate of interrater agreement. A further
complication to the interpretation of rwg(1) values biased in this manner is that the
observed mean of a set of ratings is related to the range of possible observed variances
for the ratings. The form of the relationship is curvilinear and resembles an inverted U
shape. This relationship can be understood by realizing that at the scale midpoint, the
range of observed variances is maximized and has an upper limit determined by the
square of the difference between the highest scale point and the scale midpoint. Natu-
rally, the lower limit is zero. In contrast, when the mean of a set of ratings is equal to
either endpoint of the rating scale, the observed variance can only be zero.
If the assumption of the uniform null distribution is valid, then the relationship
between the mean and maximum variance of a set of ratings (and the associated vary-
ing lower bound of rwg(1)) is simply a meaningless by-product of interrater agreement.
In such a case, the extent to which the mean of a set of ratings deviates from the scale
midpoint is in and of itself a reflection of increasing interrater agreement. The conse-
quences of using rwg(1) when the uniform null distribution is valid are limited to the two
problems listed above. However, if the uniform null distribution assumption is invalid
(most likely due to shared rating biases among raters), then the consequences of the
relationship between the mean and the range of observed variances for rwg(1) are
twofold.
First, in most situations, rwg(1) will be confounded with the observed mean rating,
and it is impossible to separate the extent to which an rwg(1) value is a function of the
confound with observed mean from the extent to which it represents actual interrater
agreement. The presence of rwg(1)’s confound with the mean rating can be seen in the
fact that the lower bound of rwg(1) is not constant across the entire range of observed
means, which also means that the range of possible rwg(1) values varies depending on
the mean of the observed ratings. For those means in which the maximum possible
variance is less than the denominator of rwg(1) (i.e., means toward the extremes of the
scale), the lowest value of rwg(1) must be greater than zero. In other words, the maxi-
mum possible disagreement that raters can exhibit for these mean locations is less than
the uniform null distribution’s standard of uniform disagreement. In contrast, for those
means in which the maximum observed variance is greater than the denominator of
rwg(1) (i.e., means nearer the middle of the scale), the lowest value of rwg(1) is less than
zero. Thus, the range of possible rwg(1) values decreases as the observed mean moves
from the center of the rating scale toward the extremes. The implication of this con-
found is that the uniform disagreement standard, although a constant value, actually
covaries in terms of stringency with the observed mean. The uniform disagreement
standard is more stringent for means nearer the scale midpoint and more lenient for
170 ORGANIZATIONAL RESEARCH METHODS

means nearer the scale extremes. As such, two equal rwg(1) values derived from
different observed means do not represent comparable levels of agreement.
The second consequence of the relationship between the mean and the range of
observed variances is that the threshold for a meaningful level of agreement is difficult
to establish and defend because the probability of meeting or exceeding any criterion
value varies as a function of the observed mean. For means nearer the scale midpoint,
the probability of failing the minimum agreement criterion is greater than the proba-
bility of failing for means nearer the extremes.
As previously mentioned, alternative distributions can be considered for use as the
null distribution (James et al., 1984). However, the James et al. (1984) procedure is dif-
ficult to apply because of problems with accurately identifying the form of the null dis-
tribution. As a result, it has not gained wide acceptance. There have, however, been
studies in the domain of multirater feedback systems in which interrater agreement
was estimated using a negatively skewed null distribution (Johnson & Ferstl, 1999;
Walker & Smither, 1999).

Overcoming All the Limitations

To summarize, there are three total limitations with the rwg(1) family of interrater
agreement indices: scale dependency, sample size dependency, and bias caused when
erroneously assuming the uniform null is valid. Although all three limitations are
problematic, the uniform distribution assumption is the most serious. In response to
these problems, we develop a new interrater agreement statistic primarily motivated
by the desire to eliminate reliance on the a priori specification of the null distribution.
In doing so, we also developed the new statistic in a manner that also eliminated the
scale dependency and sample size dependency problems.
Estimating agreement without being able to accurately specify the null distribution
or without using a uniform null distribution requires an alternative way of thinking
about the null distribution. An existing alternative is to adopt the notion of estimat-
ing agreement using multiple null distributions. An agreement statistic currently
exists that uses the notion of multiple null distributions, although it was designed for
an analysis structured in a different manner. J. Cohen’s (1960) kappa is used to esti-
mate the agreement between two raters rating multiple stimuli on a categorical scale.
Alternatively, our discussions have focused on the agreement among multiple raters
rating a single continuous construct. Because of the differences in the type of agree-
ment analysis, one statistic cannot be used in the other situation. However, the princi-
ples found in kappa can be used to model agreement when multiple judges are rating a
single stimulus.
Kappa is computed as the ratio of the percentage of cases in agreement minus a null
agreement standard to 1 minus the null agreement standard. Kappa’s null agreement is
simply the sum of the products of the percentage assigned to a given category. Appli-
cations of kappa to evaluations of dichotomous stimuli allow one to see that kappa’s
null is a function of the mean rating (i.e., p). Under the same conditions for which
kappa equals the phi correlation (dichotomous data, same mean rating for both raters),
departures from a .50 mean rating for each of the two raters results in an increase in the
null agreement level. Thus, for more extreme mean ratings, it becomes more difficult
Brown, Hauenstein / INTERRATER AGREEMENT RECONSIDERED 171

Table 1
Examples of Kappa When the Marginal Frequencies Are Split 50/50
Rater A
Rater B 1 0

Kappa and phi = 1.0


1 5 0
0 0 5
Kappa and phi = .6
1 4 1
0 1 4
Kappa and phi = .2
1 3 2
0 2 3
Note. Kappa’s null agreement for all sets of ratings = .5.

to achieve high agreement given the same number of observed disagreements between
raters. To demonstrate this phenomenon, we present examples of kappa coefficients
for raters exhibiting different levels of agreement for ratings of 10 stimuli. The critical
difference is that in Table 1, the marginal probabilities are both .5, but in Table 2, the
marginal probabilities are split .8/.2. In each table, we start with perfect agreement and
then proceed to move one rating from each cell in the diagonal cells (the two cells indi-
cating agreement) to the two cells in the off diagonal (the two cells indicating disagree-
ment). We continue this process for a second iteration. As can be seen in Tables 1 and
2, when agreement is perfect, kappa and phi are 1.0 regardless of the mean rating.
However, the same degree of observed variation is more heavily penalized by kappa
and the phi correlation when the mean rating deviates from equal proportions per
category.
The question remains as to whether the pattern of results generated by kappa’s null
agreement standard is actually desirable. A fixed 50% null agreement for kappa (i.e., a
uniform distribution of ratings across the response categories for dichotomous vari-
ables) would be far less complex, and perhaps more logical, than the equation defined
by J. Cohen (1960). Although a constant null agreement is indeed less complex, the
observed agreement rates are heavily influenced by the base rates of the behaviors
observed (Suen & Ary, 1989). Thus, tailoring the null agreement level to the base rate
of behavior is necessary to accurately assess agreement beyond the chance level.
The logic of kappa can be extended to create an interrater agreement index of rat-
ings intended to measure continuous constructs. Although previous researchers
(Berry & Mielke, 1988; Janson & Olsson, 2001) have extended kappa to assess the
agreement of ratings of continuous constructs across multiple rating targets, kappa’s
principles have never been applied to the single target situation. We propose an
interrater agreement index for ratings of continuous constructs that is not dependent
on one specification of the null distribution, thereby providing an alternative estimate
of interrater agreement that addresses the limitations raised above. We label this new
index awg(1). We chose to use a instead of r to convey the point that interrater agree-
ment is a measure of consensus, not consistency, and is not correlationally based
(Kozlowski & Hattrup, 1992).
172 ORGANIZATIONAL RESEARCH METHODS

Table 2
Examples of Kappa When the Marginal Frequencies Are Split 80/20
Rater A
Rater B 1 0

Kappa and phi = 1.0


1 8 0
0 0 2
Kappa and phi = .38
1 7 1
0 1 1
Kappa and phi = –.25
1 6 2
0 2 0
Note. Kappa’s null agreement for all sets of ratings = .68.

Deriving the awg(1) Statistic

Multiple sets of ratings can yield the same mean, each with a different variance. For
the purposes of illustration, consider a 5-point scale for which four raters supply rat-
ings that yield a mean of 2.0. Table 3 displays all possible combinations of these rat-
ings and their associated variances. The maximum variance is seen in the last case.
Ratings with the maximum possible variance for a given mean can consist only of
combinations of extreme values (i.e., the minimum and maximum values of a scale).
This relationship holds for any scale, regardless of the number of scale points or actual
numerical values of the scale points (e.g., 1 to 9 vs. –4 to +4) due to the fact that vari-
ance is simply the average squared difference from the mean. Computation of the max-
imum possible variance at a scale mean begins with a calculation of the ratios of one
extreme scale value to the other. For means located at the midpoint of the scale (e.g.,
mean of 5 on a 9-point scale), the ratio is a simple one to one (i.e., one rating of 9 per
each rating of 1). For means less than the scale midpoint, the ratio will favor more of
the lowest scale value than the highest scale value. Means greater than the midpoint
will consist of more of the highest scale value than the lowest scale value. Thus, a for-
mula for the mean can be altered to reflect the ratios of the two extreme scale values.

[ b( H ) + ( k − b )L ] (2)
M = ,
k

where M is the observed mean rating, H is the maximum possible value of scale, L is
the minimum possible value of scale, b is the number of H ratings (k – b is the number
of L ratings), and k is the number of raters. Solving this formula for b yields the follow-
ing:

k( M − L ) (3)
b= .
(H − L )
Brown, Hauenstein / INTERRATER AGREEMENT RECONSIDERED 173

Table 3
All Combinations of Ratings From Four Raters
Having a Mean of 2.0 on a 5-Point Scale
2
Stimulus r(1) r(2) r(3) r(4) S x

1 2 2 2 2 0.00
2 1 2 2 3 0.67
3 1 1 3 3 1.33
4 1 1 2 4 2.00
5 1 1 1 5 4.00
Note. Each r represents a hypothetical rater.

Equation (3) allows one to compute the number of highest scale value ratings needed
from k raters to generate any mean of X (k – b yields the number of lowest scale value
ratings). Computation of the maximum possible variance for a given mean (symbol-
ized as smpv/m2) is a simple application of the variance equation to a situation for which
we have only two scores (H and L, to varying rates).

2 b( H − M )2 + ( k − b )( L − M )2
s mpv m = , (4)
k −1

which simplifies to

smpv/m2 = [(H + L)M – (M2) – (H * L)] * [k/(k – 1)]. (5)

The maximum possible variance as defined by Equation (5) is the maximum vari-
ance of ratings at a given mean that could be obtained from a sample of k raters. It is
important to note that unlike rwg(1), we use a sample-based equation for the computa-
tion of both the observed variance and the maximum possible variance at a given
mean. The advantage of using matched equations is that awg(1) values have the same
range, regardless of the number of raters. If the situation arises in which the observed
variance reflects a population parameter, then the equations given above and below
can be modified to match by changing the denominators for both the observed vari-
ance computation and the maximum variance at a given mean from k – 1 to k.
To derive awg(1), the maximum possible variance at the mean is used as the null dis-
tribution. Agreement equals 1 minus the quotient of 2 times the observed variance
divided by the maximum possible variance.

2 * sx 2
awg (1 ) = 1 − 2
. (6)
[( H + L )M − ( M ) − ( H * L )]* [ k / ( k − 1)]

As with other agreement indices, subtracting the quotient of the observed variance di-
vided by the maximum possible variance from 1 reorients the agreement index so that
1 indicates perfect agreement. The multiplication of the observed variance by 2 is arbi-
174 ORGANIZATIONAL RESEARCH METHODS

trary and is done to match the conceptual range of rwg(1).2 An awg(1) of –1 indicates maxi-
mum disagreement, and +1 indicates perfect agreement. An awg(1) of 0 indicates that
the observed variance in the ratings is 50% the maximum variance at the observed
mean. In addition, awg(1) will equal rwg(1) when the mean rating equals the scale mid-
point as long as the variance equations are not mismatched for rwg(1). The ratio of the
observed variance to the maximum possible variance is a standardized variance. Note
that when the mean rating equals the highest or lowest possible rating (M = H or L), the
standardized variance component of Equation (6) results in a division by zero error, as
is true with kappa. Given that there is zero variance in these cases (all ratings are identi-
cal), standardized variance should also be set to zero, resulting in an awg(1) value of 1.0.
To summarize, awg(1) is based on the principles found in kappa. Like kappa, awg(1)
estimates agreement as the proportion of observed agreement to the maximum dis-
agreement possible given the observed mean rating, which in practical terms means
that dissensus is more heavily penalized in the estimate of agreement if it occurs at the
extremes of the rating scale. The awg(1) index overcomes all the aforementioned limita-
tions of rwg(1). In short, two equal awg(1) values represent the same level of consensus,
regardless of whether the number of scale anchors vary, the number of raters vary, or
the location of the observed means vary. Furthermore, the likelihood of awg(1) passing
or failing a criterion to establish agreement is independent of the location of the
observed mean, unlike rwg(1).

Sample Size and awg(1)

As previously mentioned, rwg(1) has problems that, although routinely ignored,


affect the interpretability of the resultant agreement estimate. In contrast, we explicitly
recognize that depending on the number of raters, awg(1) will not achieve a consistent
interpretability for ratings with means near the extremes of the rating scale. This limi-
tation is related to the computation of maximum possible variance. Near the extremes
of the scale, the maximum possible variance for a given mean rating cannot include
both the highest and lowest scale values given the number of raters. As an illustration,
consider a 5-point scale used by 10 raters. No set of ratings with a mean of 1.3 can
include a single rating of 5, the maximum rating of the scale. The maximum rating that
will still result in a mean of 1.3 is the second highest point on the scale, 4. Although a
new version of maximum possible variance could be calculated for this mean (e.g.,
using the second-highest rating), this new version of the maximum possible variance
would represent a new standard that would cause agreement calculated with the two
equations of maximum possible variance to be incomparable. As a result, agreement
for ratings with means near the extreme locations of the scale cannot be quantified.
The portion of a scale in which agreement can be quantified is a direct function of the
number of raters. With a large sample of raters, nearly the entire scale range will pro-
duce an interpretable awg(1) value. A determination of the range of scale means that
allow for an interpretable awg(1) can be made using the following equations:

Minimum mean with interpretable awg(1) = [L(k – 1) + H]/k. (7)

Maximum mean with interpretable awg(1) = [H(k – 1) + L]/k. (8)


Brown, Hauenstein / INTERRATER AGREEMENT RECONSIDERED 175

Equations (7) and (8) locate the lower and upper scale values beyond which awg(1) will
no longer range from –1 to 1. We recommend computing awg(1) only when the range of
scale locations that allow for an awg(1) with a –1 to 1 range includes at least the second-
lowest and the second-highest scale anchors (e.g., 2 and 4 for a 5-point scale). An easy
rule for the computation of the number of raters needed to meet that standard is A – 1
(where A is the number of response categories). Thus, four raters are needed for a 5-
point scale to be able to compute awg(1) with a –1 to 1 range for all means ranging from 2
to 4. If a researcher wishes to include all scale locations ranging from the lowest half-
point to the highest half-point (e.g., 1.5 to 4.5 for a 5-point scale), then the number of
raters needed is 2 × (A – 1), or 8 raters for a 5-point scale. Note that if the mean rating
equals the highest or lowest scale value, these ratings can be assigned an awg(1) of 1.0
because the observed variance necessarily equals 0.
Researchers will inevitably face a situation in which a set of ratings have an
observed mean that falls outside of the recommended range of scale locations for
which awg(1) is interpretable. If the number of raters is equal to or greater than A – 1,
then the observed means that fall outside the defined range are only the extreme means
for which raters are likely exhibiting high degrees of agreement. To illustrate, Table 4
lists all possible combinations of ratings yielding a mean outside of the usable range of
awg(1) when four raters are using a 5-point scale. The implications of Table 4 are clear.
First, it is better to use more raters because doing so will increase the range of observed
means where awg(1) is interpretable. Second, if the number of raters is equal to or greater
than A – 1, the sets of ratings with observed means that lie outside the acceptable range
can likely be interpreted as indicating strong agreement, but that judgment will ulti-
mately be the researcher’s and cannot be quantified.
One other issue with awg(1) must be addressed. The computation of rwg(1) requires
only the observed variance of ratings, which of course is subject to sampling error. The
computation of awg(1) necessitates the observed mean and the observed variance of the
ratings, both of which are subject to sampling error. As such, it is arguable that awg(1)
may be more susceptible to the influences of sampling error than is rwg(1). The advan-
tages of awg(1), in terms of independence from the number of scale points, number of
raters, and the observed mean, far outweigh the minimum sample size requirement
and sampling error issue.

Empirical Demonstration of rwg(1)’s Relationship With the Observed Mean

Although there are several problems with the uniform null-based rwg(1), the most
limiting is the confound between rwg(1) and the observed mean if the uniform null distri-
bution assumption is violated. In a Monte Carlo analysis of measures of halo and
leniency, Alliger and Williams (1989) found a strong association between rating mean
and variability. Given that rwg(1) is a linear transformation of the observed variance as
long as the number of scale points is constant, the confound between rwg(1) and scale
mean is clear. To demonstrate that the relationship between rwg(1) and the observed
mean is also present in data from applied projects, we analyze data from a project with
a large automotive manufacturing facility for which we compute the correlation
between rwg(1) and mean rating and compare that to the correlation between awg(1) and
mean rating.
The project at the manufacturing facility involved an analysis of the agreement of
the answers from 27 experts whose mean responses constituted the answer key of a sit-
176 ORGANIZATIONAL RESEARCH METHODS

Table 4
All Combinations of Ratings From Four Raters
Having a Mean Less Than 2.0 on a 5-Point Scale
Stimulus r(1) r(2) r(3) r(4) M

1 1 1 1 1 1.0
2 1 1 1 2 1.25
3 1 1 1 3 1.5
4 1 1 2 2 1.5
5 1 2 2 2 1.75
6 1 1 2 3 1.75
7 1 1 1 4 1.75
Note. Each r represents a hypothetical rater.

uational judgment test to be used for promotional purposes. The experts rated the
desirability of 91 responses to hypothetical situations using a 7-point Likert-type
scale. Interrater agreement among the experts was critical to the study because expert
disagreement for any of the items would cast doubt on its usefulness (i.e., if experts
disagree on the appropriate course of action, then the correctness of the answer key
would be difficult to justify). The rwg(1) values ranged from .07 to .99, with a median
value of .65. We reflected the mean ratings below the scale midpoint of 4 so that devia-
tions from the scale midpoint were unidirectional. These rwg(1) coefficients were
strongly correlated with the reflected mean rating, r(89) = .63, p < .05. For the awg(1)
analysis, 2 of the 91 mean ratings were outside of the interpretable range (1.22 to 6.78
on a 7-point scale) of awg(1) values. Inspection of the data indicated strong agreement
for these 2 cases. For the 89 remaining cases, the awg(1) coefficients were not correlated
with the reflected mean rating, r(87) = –.03, p > .05.
This analysis demonstrates what almost assuredly occurs in any applied situation
for which rwg(1) is used to estimate agreement, namely, that the magnitude of the rwg(1)
values will be correlated with the observed means, and this correlation represents a
significant confound if the assumption that the null distribution is uniform is violated.

Comparing awg(1) and Uniform Null-Based rwg(1) Values

It is difficult to directly compare awg(1) and rwg(1) coefficients because each statistic is
based on a different null distribution and rwg(1) mismatches the variance equations (i.e.,
sample vs. population) in the numerator and denominator. One way to make meaning-
ful comparisons is to examine awg(1) and rwg(1) coefficients for a given level of observed
variance. Such comparisons are provided for a 5-point rating scale in Table 5. The first
column in Table 5 lists observed variances ranging from 1.0 to 0.20, which translates
directly into the corresponding rwg(1) values. The remaining columns provide the com-
parable awg(1) values at a given observed mean, starting at the scale midpoint and mov-
ing out in .5 increments. Also, to deal with the mismatch of variance equation problem
with rwg(1), we used the population variance equation for all of the Table 5 calculations.
The differences between the two agreement indices are obvious. The rwg(1) values
remain constant, but the awg(1) values, like the kappa values in Tables 1 and 2, vary as a
function of the mean rating, which in turn determines maximum possible disagree-
ment. Specifically, for awg(1), a fixed amount of observed variability in ratings results in
Brown, Hauenstein / INTERRATER AGREEMENT RECONSIDERED 177

Table 5
Comparisons of awg(1) to rwg(1) Coefficients for a 5-point Rating Scale
awg(1) When:
2
σ x rwg(1) M=3 M = 2.5 or 3.5 M = 2 or 4 M = 1.5 or 4.5

1.0 .50 .50 .47 .33 –.14


0.8 .60 .60 .57 .47 .09
0.6 .70 .70 .68 .60 .31
0.4 .80 .80 .79 .73 .54
0.2 .90 .90 .89 .87 .77
Note. To remove any sample size effects, all agreement values were computed using population
variance estimates.

less agreement as the observed mean deviates from the scale midpoint. An alternative
perspective for interpreting awg(1) values is that as observed variability increases (i.e.,
from 0.2 to 1.0), the detrimental effect of the increasing variability on agreement is
amplified as the observed means deviate from the scale midpoint (e.g., there is a .4 dif-
ference between the minimum and maximum awg(1) values when M = 3, but there is a
.91 difference between the minimum and maximum awg(1) values when M = 1.5 or M =
4.5).

Practical Importance and Statistical Significance of rwg(1) and awg(1)

Similar to rwg(1) (A. Cohen, Doveh, & Eick, 2001; Dunlap, Burke, & Smith-Crowe,
2003), there is no sampling distribution that can be used to test awg(1) for statistical sig-
nificance. Although the chi-square test of variance has recently become the method of
choice for the determination of statistical significance of agreement indices (Lindell &
Brandt, 1997, 1999; Lindell et al., 1999), it is an inappropriate test statistic on both the-
oretical (data used to estimate agreement display greater levels of variance and
kurtosis than is indicated by the chi-square distribution) and empirical (chi-square test
for significance is likely to lead one to a Type I error when compared to Monte Carlo
analyses) grounds (A. Cohen et al., 2001; Dunlap et al., 2003). Until an appropriate
statistical test is developed or identified, it is recommended that researchers determine
statistical significance for any agreement index via critical values derived from Monte
Carlo analyses.3 Examination of Monte Carlo analyses of awg(1) reveals that critical val-
ues increase as the number of raters decrease or the number of anchors increase. More-
over, it is important to recognize that when power is low, levels of agreement needed to
achieve statistical significance can exceed typical expectations for what constitutes
reasonable agreement among judges.
Statistical significance aside, researchers need defensible thresholds indicative of
strong, moderate, and weak agreement. Because awg(1) and rwg(1) yield identical agree-
ment estimates when the mean rating equals the scale midpoint (and the variance esti-
mates are properly matched), interpretive standards for the magnitude of agreement
should be the same for rwg(1) and awg(1). Although there appears to be no clear agreement
regarding thresholds for rwg(1), LeBreton, Burgess, Kaiser, Atchley, and James (2003)
traced the most commonly used cutoff of .70 back to George (1990), who based her
use of a .70 cutoff on a personal communication from Lawrence R. James. Similarly,
178 ORGANIZATIONAL RESEARCH METHODS

Burke and Dunlop (2002) suggested .70 as a reasonable cutoff, and Judge and Bono
(2000) recently stated that “the mean rwg statistic was .74. This relatively high level of
interrater agreement appeared sufficient to justify aggregation” (p. 757). Ultimately,
the .70 value is a heuristic (LeBreton et al., 2003), and we recommend the same .70
heuristic for awg(1) as indicating a moderate level of agreement. Kozlowski and Hattrup
(1992) constructed a data set designed to indicate high agreement for rwg(J). Their high-
agreement data result in an rwg(1) of .79. Thus, for strong and weak agreement, we sug-
gest .80 and above and .60 to .69 as reasonable standards, respectively. Values from 0
to .59 should probably be considered as unacceptable levels of agreement, especially if
aggregating individual responses to represent group-level constructs.

Agreement Across Multiple Stimuli

In addition to quantifying consensus at the individual stimulus level, interrater


agreement indices are often used to quantify consensus across a number of stimuli.
James (1982) initially proposed a version of rwg(1) to cover scenarios in which multiple
judges rate multiple stimuli. The statistic, called rwg(J), was later revised by Lindell et al.
(1999) to eliminate inflation as J, the number of stimuli rated, increased. Because both
the original and revised rwg(J) compare the observed variance to the variance of a uni-
form null distribution, any problems with rwg(1) will also be associated with rwg(J). awg(1)
can also be adapted to fit the multiple stimuli model but must be done in a slightly dif-
ferent manner to satisfy the rules governing order of operations. The recommended
method of computation of awg(J) is to compute the mean of the Jawg(1) coefficients. Rat-
ings of stimuli with means outside of the useable range of awg(1) must be treated as miss-
ing data when the mean is computed.

awg (J ) =
∑ awg(1) . (9)
J

When ratings are made by a group of judges of multiple stimuli, researchers are
faced with the issue of evaluating agreement at the individual item level or across all
items. Clearly, an analysis designed to identify stimuli for which raters are not display-
ing an adequate level of consensus (e.g., BARS development) dictates that agreement
should be computed for each stimulus. An analysis at the individual stimulus level can
provide valuable information even when the degree of agreement across all stimuli is
the main concern. The previously mentioned situational judgment test from the manu-
facturing facility illustrates such a case. The 27 expert raters responded to a total of 91
test items. Mean ratings for 2 of these items were outside the interpretable range for
awg(1) given the number of raters. Agreement analysis across items indicated weak
interrater agreement (awg(J) = .58). Analysis at the item level, however, revealed signifi-
cant variability in the pattern of individual agreement coefficients. Experts demon-
strated exceedingly poor agreement (awg(1) less than .40) on 14 of the 89 items with
interpretable awg(1) values. Of these 14 items, 5 related to the same scenario, an unlikely
coincidence given that all but one of the scenarios consisted of 10 questions each. Dis-
cussions with management revealed that this scenario posed a situation (proper course
of action when your supervisor has an closed-door policy) that is contrary to their cor-
porate culture (open-door policy). It is not surprising that experts failed to agree on the
best response to a situation they were unlikely to encounter at the organization. The
Brown, Hauenstein / INTERRATER AGREEMENT RECONSIDERED 179

scenario and the items associated with it were subsequently considered for elimination
from the test battery.
Berry and Mielke (1988) and Janson and Olsson (2001, 2004) also have proposed
extensions of kappa for ratings of continuous constructs across multiple stimuli. Their
multiple target indices yield values greater than awg(1). Their indices perform in a simi-
lar, but not identical, fashion to one of the intraclass correlation coefficients (ICC)
(2,1) but are not encumbered by as many restrictions as ICC (2,1).

Discussion
Conceptually, the quality of rwg(1) when assuming a uniform null is dependent on the
validity of the assumption that a uniform distribution accurately models no agreement.
The validity of this assumption is predicated on either assuming that the participating
raters are free from rating bias and operate as random number generators or, alterna-
tively, there is some compensating mechanism among raters that offsets any rating
bias that is present. It is clear from the work of James et al. (1984) and others that such
an assumption is often untenable. However, the procedure for modeling the bias in the
population distribution that James et al. (1984) recommend is difficult to achieve con-
sistently because, essentially, it requires researchers to establish the validity of the
agreement index (i.e., the researcher must partition the relevant systematic variance
from both the irrelevant systematic variance and random error).
In contrast, awg(1), like J. Cohen’s (1960) kappa, is not predicated on the uniform
null distribution assumption. Rather, both awg(1) and kappa assume that multiple null
distributions exist based on the conception of maximum disagreement at a given
observed mean. The assumption of multiple null distributions based on maximum dis-
agreement allows awg(1) to overcome the theoretical and practical limitations of rwg(1).
Unlike what will usually happen when using rwg(1), awg(1) will not correlate with the
observed rating mean, and the probability of awg(1) meeting or failing a minimum
agreement criterion is constant, regardless of the observed rating mean.

Agreement and Sample Size

The mismatch of type of variance equation affects the range and interpretability of
rwg(1). Both James et al. (1984) and Lindell et al. (1999) recommend a minimum sample
size of 10 when computing rwg indices, a recommendation that is typically ignored. In
contrast, our development of awg(1) explicitly takes into account the importance of sam-
ple size by recognizing that the range of interpretability for awg(1) is dependent on sam-
ple size and by computing awg(1) with a sample variance estimate in both the numerator
and the denominator. By doing so, awg(1) achieves independence from sample size
(within the range of interpretability), unlike rwg(1).

Agreement and Generalizability Theory

Recent articles (Hoyt, 2000; Murphy & DeShon, 2000) have reasserted the utility
of generalizability theory for the assessment of interrater reliability and also have dis-
cussed how generalizability theory can be used to quantify consensus, the traditional
domain of agreement statistics (James, Demaree, & Wolf, 1993). Conceptually,
generalizability theory is superior to even the best interrater agreement indices
180 ORGANIZATIONAL RESEARCH METHODS

because it is based on a theory of measurement. As such, generalizability analyses


should be considered as an alternative to multiple target agreement indices. However,
there are limitations to generalizability theory. For example, components of the model
may not be estimable due to the research design (Murphy & DeShon, 2000). Other
problems include that stimuli are often nested in raters (e.g., providing 360 feedback),
which limits the interpretability of the g coefficients, and that generalizability theory is
applicable only when there is variability across the stimuli being rated. Most impor-
tant, generalizability theory cannot estimate agreement at the individual stimulus
level. That is, the level of analysis for generalizability theory is limited to the cross-
stimuli level. Thus, agreement indices such as awg(1) are necessary for the estimation of
the agreement of ratings of a single stimulus.

Models That Use Observed Variability of a Set of Ratings

The implications of the points made in this article are not limited to the issue of
interrater agreement. Occasionally, the standard deviation (or some other index of
variability) across a set of ratings is used as a variable in some type of modeling effort
(e.g., Johnson & Ferstl, 1999; Klein et al., 2001; Nathan & Tippins, 1990). Typically,
for each participant in a study, the standard deviation across a set of rating scales is
used as a variable in a research study. Naturally, the relationship between the mean and
variability of a set of ratings remains relevant to the research problem.
Nathan and Tippins (1990) provided an example of the problems associated with
using the standard deviation as a variable in a model. In their performance appraisal
study, they used the standard deviation across each manager’s ratings of an employee
as a halo effect measure. Nathan and Tippins found significant relationships between
halo and rating accuracy and halo and leniency (measured as the mean rating). From
these results, Nathan and Tippins concluded that more haloed ratings are more accu-
rate and therefore lenient ratings are more accurate because leniency is related to halo.
Concerning this last conclusion, Nathan and Tippins argued that raters who pay too
much attention to “specific, but unrepresentative, critical incidents” (p. 296) are likely
to provide inaccurate performance judgments. In reference to negative critical inci-
dents, they went so far as to suggest that “rater training programs should . . . instruct
raters to beware of overgeneralizing a stable level of dimension performance from
isolated critical incidents, especially negative ones” (p. 294).
Without debating the conclusion that rating accuracy and halo are related, the evi-
dence on which Nathan and Tippins (1990) based the conclusion that halo and
leniency are related was a spurious result attributable to their use of the mean rating as
the measure of leniency and the standard deviation operationalization of halo effect.
As usually occurs in performance evaluation, the distribution for each performance
dimension in their study was negatively skewed. Given the nature of the distributions,
the positive relationship they found between leniency and halo (i.e., the mean rating
and the standard deviation, respectively) was to be expected. Although it appeared that
strong halo effects and leniency effects were desirable characteristics of performance
ratings, the relationship was not interpretable as such (cf., Alliger & Williams, 1989).
To summarize, it is recommended that when the observed standard deviation across a
set of ratings is used as a variable, the relationship between the observed means and the
standard deviation be statistically controlled. Use of awg(1) explicitly controls for this
Brown, Hauenstein / INTERRATER AGREEMENT RECONSIDERED 181

association and is easier and more parsimonious to use than the alternative of
attempting to simultaneously model both the mean and the standard deviation.

Final Thoughts

A clear trend in recent research is to use multilevel perspectives by examining phe-


nomena from both the individual and group level. Multilevel research is a complex
undertaking both in terms of theoretical justifications and the accompanying analyses
(Rousseau, 1985, 2000). Clearly, the use of awg(1) or awg(J) will not resolve many of the
complexities that multilevel researchers face. However, when operationalizing vari-
ables at the group level by aggregating individual ratings, awg(1) and awg(J) provide oper-
ational definitions of consensus that are consistently interpretable in terms of the pro-
portion of consensus to maximum possible dissensus. The only limitation of awg(1) is
the minimum sample size requirement, which if not met, precludes the estimation of
agreement with awg(1). Not only does rwg(1) have limited interpretability when sample
sizes are small (James et al., 1984; Lindell, et al., 1999), the uniform null rwg(1) index
rests on an assumption that is unlikely met (James et al., 1984; A. M. Schmidt &
DeShon, 2003).
As such, we make the following recommendations for researchers estimating
agreement one stimulus at a time. Review prior research to determine if an empirical
estimate of the variance of the null distribution is justified. If an empirical estimate is
justified, compute rwg(1) using the procedure recommended by James et al. (1984). If
there is no confidence in an empirical estimate of the variance of the null distribution,
consider the context in relation to the potential for motivational biases and/or shared
cognitive rating biases. If it appears that there is no reason for judges to be motivated to
distort ratings or that judges share some preconceived notion of the categories under
evaluation, then the uniform distribution rwg(1) may provide good estimates of agree-
ment. An example of a situation in which the uniform null might be justifiable is a pilot
study in which judges are evaluating unfamiliar stimuli with no apparent motivation to
distort. Even in this situation, given the previously mentioned findings of A. M.
Schmidt and DeShon (2003), a better choice may be to use a variance estimate from a
negatively skewed normal distribution than a uniform distribution. If there is not a
defensible empirical estimate of the variance of the null distribution and rater motiva-
tion and/or shared cognitive rater biases are concerns, then awg(1) will provide a better
estimate of agreement than the uniform rwg(1). Concerns about rater motivation and
shared cognitive rater biases is likely in almost any applied setting, so awg(1) is prefera-
ble to uniform rwg(1) in most applied situations where in which agreement is estimated.
The process for deciding among the use of multi-item estimates of agreement is the
same as for estimating agreement for stimuli one at a time, except that generalizability
analyses should be considered before using rwg(J) or awg(J).

Notes
1. When rating continuous constructs, all measurement scales are technically discontinuous.
We will not differentiate scale type any further (i.e., ordinal, interval, ratio) because this classifi-
cation does not affect our arguments. We are addressing interrater agreement in any situation in
which the subjective ratings are manifest variables intended to measure latent variables that are
theoretically continuous.
182 ORGANIZATIONAL RESEARCH METHODS

2. An earlier version of this article presented at the annual meeting of the Society of Indus-
trial and Organizational Psychology (Brown, 2000) used zero as the lower bound of awg(1).
3. Tables of critical values for both awg(1) and awg(J) are available from the first author.

References
Alliger, G. M., & Williams, K. J. (1989). Confounding among measures of leniency and halo.
Educational and Psychological Measurement, 49, 1-10.
Berry, K. J., & Mielke, P. W. (1988). A generalization of Cohen’s kappa agreement measure to
interval measurement and multiple raters. Educational and Psychological Measurement,
48, 921-933.
Borman, W. C. (1987). Personal constructs, performance schemata, and “folk theories” of sub-
ordinate effectiveness: Explorations in an army officer sample. Organizational Behavior
and Human Decision Processes, 40, 307-322.
Bozeman, D. P. (1997). Interrater agreement in multi-source performance appraisal: A com-
mentary. Journal of Organizational Behavior, 18, 313-316.
Brown R. D. (2000, April). Interrater agreement reconsidered: The role of maximum possible
variance. Paper presented at the 15th Annual Conference of the Society for Industrial and
Organizational Psychology, New Orleans, LA.
Burke, M. J., & Dunlop, W. P. (2002). Estimating interrater agreement with the average devia-
tion index: A user’s guide. Organizational Research Methods, 5, 159-172.
Burke, M. J., Finkelstein, L. M., & Dusig, M. S. (1999). On average deviation indices for esti-
mating interrater agreement. Organizational Research Methods, 2, 49-68.
Cohen, A., Doveh, E., & Eick, U. (2001). Statistical properties of the rwg(J) index of agreement.
Psychological Methods, 6, 297-310.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological
Measurement, 20, 37-46.
Dunlap, W. P., Burke, M. J., & Smith-Crowe, K. (2003). Accurate tests of statistical significance
for rwg and average deviation interrater agreement indices. Journal of Applied Psychology,
88, 356-362.
Eby, L. T., & Dobbins, G. H. (1997). Collectivistic orientation in teams: An individual and
group-level analysis. Journal of Organizational Behavior, 18, 275-295.
Feldman, J. M. (1981). Beyond attribution theory: Cognitive biases in performance appraisal.
Journal of Applied Psychology, 66, 127-148.
Finn, R. H. (1970). A note on estimating the reliability of categorical data. Educational and Psy-
chological Measurement, 30, 71-76.
Fleenor, J. W., Fleenor, J. B., & Grossnickle, W. F. (1996). Interrater reliability and agreement of
performance ratings: A methodological comparison. Journal of Business Psychology, 10,
367-380.
George, J. M. (1990). Personality, affect and behavior in groups. Journal of Applied Psychology,
75, 107-116.
Hauenstein, N. M. A., & Alexander, R. A. (1991). Rating ability in performance judgments: The
joint influence of implicit theories and intelligence. Organizational Behavior and Human
Decision Processes, 50, 300-323.
Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we
do about it? Psychological Methods, 5, 64-66.
Hyatt, D. E., & Ruddy, T. M. (1997). An examination of the relationship between work group
characteristics and performance: Once more into the breech. Personnel Psychology, 50,
553-586.
Janson, H., & Olsson, U. (2001). A measure of agreement for interval or nominal multivariate
observations. Educational and Psychological Measurement, 61, 277-289.
Brown, Hauenstein / INTERRATER AGREEMENT RECONSIDERED 183

Janson, H., & Olsson, U. (2004). A measure of agreement for interval or nominal multivariate
observations by different sets of judges. Educational and Psychological Measurement, 64,
62-70.
James, L. R. (1982). Aggregation bias in estimates of perceptual agreement. Journal of Applied
Psychology, 67, 219-299.
James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability
with and without response bias. Journal of Applied Psychology, 69, 85-98.
James, L. R., Demaree, R. G., & Wolf, G. (1993). rwg: An assessment of within-group agreement.
Journal of Applied Psychology, 78, 306-309.
Johnson, J. W., & Ferstl, K. L. (1999). The effects of interrater and self-other agreement on per-
formance improvement following upward feedback. Personnel Psychology, 52, 271-303.
Judge, T. A., & Bono, J. E. (2000). Five factor model of personality and transformational leader-
ship. Journal of Applied Psychology, 85, 751-765.
Klein, K. J., Conn, A. B., Smith, D. B., & Sorra, J. S. (2001). Is everyone in agreement? An ex-
ploration of within-group agreement in employee perceptions of the work environment.
Journal of Applied Psychology, 86, 3-16.
Kozlowski, S. W., & Hattrup, K. (1992). A disagreement about within-group agreement: Disen-
tangling issues of consistency versus consensus. Journal of Applied Psychology, 77, 161-
167.
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28,
563-575.
LeBreton, J. M., Burgess, J. R. D., Kaiser, R. B., Atchley, E. K. P., & James, L. R. (2003). The re-
striction of variance hypothesis and interrater reliability and agreement: Are ratings from
multiple sources really dissimilar? Organizational Research Methods, 6, 78-126.
Li, M. F., & Lautenschlager, G. (1997). Generalizability theory applied to categorical data. Edu-
cational & Psychological Measurement, 57, 813-822.
Lindell, M. K., & Brandt, C. J. (1997). Measuring interrater agreement for ratings of a single tar-
get. Applied Psychological Measurement, 21, 271-278.
Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a
*
test: A comparison of the CVI, T, rWG(J), and r WG(J) indexes. Journal of Applied Psychology,
84, 640-647.
Lindell, M. K., & Brandt, C. J. (2000). Climate quality and climate consensus as mediators of
the relationship between organizational antecedents and outcomes. Journal of Applied Psy-
chology, 85, 331-348.
Lindell, M. K., Brandt, C. J., & Whitney, D. J. (1999). A revised index of interrater agreement
for multi-item ratings of a single target. Applied Psychological Measurement, 23, 127-135.
Lindell, M. K., Clause, C. S., Brandt, C. J., & Landis, R. S. (1998). Relationship between organi-
zational context and job analysis task ratings. Journal of Applied Psychology, 83, 769-776.
Mossholder, K. W., Bennett, N., & Martin, C. L. (1998). A multilevel analysis of procedural jus-
tice context. Journal of Organizational Behavior, 19, 131-141.
Mulvey, P. W., & Ribbens, B. A. (1999). The effects of intergroup competition and assigned
group goals on group efficacy and group effectiveness. Small Group Research, 30, 651-677.
Murphy, K. R., & DeShon, R. (2000). Interrater correlations do not estimate the reliability of job
performance ratings. Personnel Psychology, 53, 873-900.
Nathan, B. R., & Tippins, N. (1990) The consequences of halo “error” in performance ratings: A
field study of the moderating effect of halo on test validation results. Journal of Applied
Psychology, 75, 290-296.
Neuman, G. A., & Wright, J. (1999). Team effectiveness: Beyond skills and cognitive ability.
Journal of Applied Psychology, 84, 376-389.
Rousseau, D. M. (1985). Issues of level in organizational research: Multi-level and cross-level
perspectives. Research in Organizational Behavior, 7, 1-37.
184 ORGANIZATIONAL RESEARCH METHODS

Rousseau, D. M. (2000). Multilevel competencies and missing linkages. In K. J. Klein & S. W.


Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations,
extensions, and new directions (pp. 572-582). San Francisco: Jossey-Bass.
Schmidt, A. M., & DeShon, R. P. (2003, April). Problems in the use of rwg for assessing inter-
rater agreement. Paper presented at the 18th Annual Conference of the Society for Indus-
trial and Organizational Psychology, Orlando, FL.
Schmidt, F. L., & Hunter, J. E. (1989). Interrater reliability coefficients cannot be computed
when only one stimulus is rated. Journal of Applied Psychology, 74, 638-370.
Schrader, B. W., & Steiner, D. (1996). Common comparison standards: An approach to improv-
ing agreement between self and supervisory performance ratings. Journal of Applied Psy-
chology, 81, 813-820.
Schriesheim, C. A. (1981). The effect of grouping or randomizing items on leniency response
bias. Educational and Psychological Measurement, 41, 401-411.
Suen, H. K., & Ary, D. (1989). Analyzing quantitative behavioral observation data. Hillsdale,
NJ: Lawrence Erlbaum.
Walker, A. G., & Smither, J. W. (1999). A five year study of upward feedback: What managers
do with their results matters. Personnel Psychology, 52, 393-423.
Wherry, R. J., & Bartlett, C. J. (1982). The control of bias in ratings: A theory of ratings. Person-
nel Psychology, 35, 521-551.

Reagan D. Brown earned his doctorate in industrial and organizational psychology at Virginia Polytechnic
Institute and State University in 1997 and is currently an associate professor of psychology at Western Ken-
tucky University. His research interests include personnel selection and psychometrics.
Neil M. A. Hauenstein is an associate professor in psychology at Virginia Polytechnic Institute and State
University, and he is a member of the graduate training program in Industrial and Organizational Psychol-
ogy. His primary research interests lie in the area of performance appraisal/performance management re-
search, with a particular interest in the psychometric criteria used to evaluate subjective judgments.

You might also like