The Moderating Influence of Job Performance Dimensions on Convergence
of Supervisory and Peer Ratings of Job Performance: Unconfounding
Construct-Level Convergence and Rating Difficulty
Chockalingam Viywesvaran
ova Ingenta Cnivctiy
Frank L. Schmidt
University of lows
Deniz S. Ones
Unisersity of Minnesota, Pin Cities Campus
hens fol) portonmanee. The general pa
Hse
ects fully
ind per rt
Faratie ep
aici nee
foastgtlevet diapne
Performance ratings hase taiionsally played a centeal ale in
the measurement of job performance in indusiril—organizational
prychology. In their vesiew, Lawshe ana Balm (19661 four that
(668 of validiy studies used roings. Lent, Authach, and Levin
(1971) found in their review of 1506 validation studies that 63%
used ratings ay the enterion. There iv evidence that the use of
ratings has mot decreased over the years. if there hay beet
dscemible wend, it has beon an iperessng use of ratings (Canc,
1991; Landy. 1989: Marchese & Muchinsks. 19881, Forearm
Berardi and Beatty J984) found that 83% of the respondents
asumey indicated that they used raingss these petmary source of
performance appraisal. Cleveland, Murphy. gd Williams 1989),
also found that the use of ratings Ys presalent in organizations
Because ratings of pertormance are widely sed they have been
Widely researched, Many researchers (eg. Saul. Downey. & Lat
bes, 1980) have examined the pos chometric properties f rating
soch as distbutional characteristics cee. central tendeney smd
Jesieney the skewness sd hurtosis ofthe distribution | elibity
(ing, Hunter, & Schmidt, 19N0; Rothstein, NOU: Salgado &
Moscoso, 1996: Viswesv aan, Ones, & Schmidh, L096 and va
Tidy (eg. Chanch & Bracken, 1997), The effects of the rating
(Coekatingany Viswesaran, Depart of PRyubology. Phe Ter
Organizations Univers of he, Dom S. Ones, Dspsttiea of Peyet
0, Univenty of Minot, Tain Cis Cass
Altre shows cuntriuted gles kr ot thor ery
1 eros Coch
liza Viswesvran, Pepartnica ) Pocllogy. Balt Wevaatiea
nivesiy. Ntiani, Prat 88299 mal ish fa okt
Comespndece cose thi atic =
the cores enncen supenisor and peer ratings for diffrent
vn the data sugpeste tht raters fom the same
rou ater or ferent evel. Methods were used to separate the
Jack sf convruclee convergence ope corelaon Bereensupersivor
The authors found complete camtructessl eentergzence for ratings of overall job
pevtermnnce, oats, eat jb Apo lee, aly. and Teadenbip but not for rains of adi
peice! complance or acceptance of aathoey. Higher ring
‘it ower ean obverse poc-super sur covtelaions than were
rnpoets ut supers sor Ipictions for research and practice are
fommat have been extensively researched. and consensus has
emerged that the contribution of format differences to rating qual-
ity is limited {Landy & Farr. 1980). The effects of uaining (e.g.
Borman, 1978: London & Smither. 1995) and training influences
‘on cognitive processes have also been investigated (e.g. Haten-
stein & Foti, 1989), Researchers have also examined the purposes
for which ratings are used i an organization (Cleveland et al.
1980), the engnitive processes involved in ratings (e.g... DeNisi,
ino, 1984. and the influence of political climate
in the organivation on ratings (eg. Kravitz & Balzer. 1992).
Ratings ate subjective evaluations abtained from. supervisors.
peers. subordinates. self, or customers. The various uses of ratings,
n he broly classified as administrative, feedback, oF research
(Of the tive sources of evaluation, Cascio (1991. p. 81) stated that
only upersory and peer ratings are used forall three prpeses 0
‘which ratings can be used, Most ratings are confined to supervi-
sony ratings, For example, Lent etal. 1971) found that 93% of all
Fatings were supervisor raings. (The remaining 7% of their daa
hase was baccd on peer ratings.) Berardin and Beatty (1984)
estimated that over OOF of the ratings used in the erature are
Super isor evaluations. Fughermore. tbe traditional hierarchical
Structure of most organizations also increases the emphasis on
Suet isor ratings. In contrast, im recent years there has been a
fnwwement away from traditional hierarchical organizations toward
iain stcures ane lean of poofectbased organization of werk
(Noman & Zawacki, 1991), a end that has increased the use of
pect catings and even of subordinate ratings.
This atictefxcuses om the relationship between peer and super
sisor ratings of job performance. It examines whether the eore-
fatioms Between ratings Made by superisors and those made by346 VISWESVARAN, SCHMIDT. AND ONES
peers is different for the different job performance dimensions
rated (ic is the rating content a moderator of the convergence?)
In examining the convergence in ratings between peers and super
visors, we distinguish between the effects of rating difficulty and
constrictlevel convergence. Agreement between raters ean be
reduced by the absence of agreement on the nature ofthe construct
to be rated, by the difficulty of rating a particular agreed 00
SB... where 8D,
Js he standard deviation of observed cuttchations and k isthe mae ef
utes. SE, way used to compute canfidenve intra fe the mean rt
‘hecontidence intevts for dllernt job perormance dtnenson oenip.
‘wecannot reject the hypesis that rating ificuly the same are
inensions,
‘The mean ebservedcomtelation for each othe ne dimensions wes nek
sonected using the itenater reliabilities for peer ratings and interaee
relabilities For superviary ratings feperied in Vises e's (198)
arte, These iatrate reliabilities ate shown in Tae 1 dheused te
Viwesvaran et al had summarized the saterrater elise forthe
Aimensions where four o mre estimates vere available. For superior
ratings of adminis competence. intnatcs reliability in Vowessren
tal analyse was 42, but ths estimate wus bed nm Tes than fou
estimates and hus 9 at sep in that pubis
The mean observed correlation foreach of the nine dimensions was
conecte for interrater reliability of poe ratings and or interat relbi
fy of supervisory ratings: Thus. the secon scp in our anaes was
{stinat the te-score orrelations betwen per snd super nor ating fr
ach of the nine dimensions of ja perfomance. I extimatng the te
score corelation, wher th wo eases ein created ae vided BY
itfrent rater, interrerrelisbiis caetficents ate the appropiate rel
abies: inrarter-selainy estinstes are not appropiate (Scheidt &
‘Hiner, 1996. 1999: Schmit. Visnessaran,&’ Ones, 20K Vines
a 1996; see Murphy & DeShon. 2000, for a coma cm) abe
‘mpertant to ote that inerater relia cuercts for both eater speifc
\afance and random mesuremen error. Such. theres nee
onset for invarater aii
he confidence interval arcu the mean sorreted vortelaton exe
Yo 100, then we cannot reject the hy pothensthat Sup¥ sa pes
:atng the same job performance construct. Confidence itera fr mest
eomiruct-leve! correlations are paced by coreg the eri ft
confidence intervals for mean observed eiecstions, Confidence intra
ca be correct for messirement ere jus as nvidual comrtins e
‘mew correlations can becomes Murer & Seal 19%. 120. Tt
‘ur bles chased on convention n payehuial reeatc, we repr 95%
confidence intervals hough terested rears my compute aber cm
Fidene inten of
Results
‘Table 2 reports the resulls. All mean abserved eomeations it
Table 2 are tess than .50. indicating substantial dissyreerentPEER AND SUPERVISOR RATINGS 349
Table |
Interrater Reiabilities: Sample Size Weighted Means and Standard Deviations
Ratings
Supervisor Poor
Jo performance dimension Nk Ms Nk M
oxeral job petomance eo 40) 23D
Pradaceit 2015 19S) S ska
Brion pe 433 aks
Interpersonal competence 306 sss
‘Aditinistrave competense 120009 382s
uals Po ay A 8
Joh hnwwtedge worn 20 3s 0
Leadenhip i co rr Oy
Compliance ocaecepuanee of autbooty 90S B58 SOS
Mw 8 8
‘Note. ‘These values are summarized from Viswesvaran el (1996) atv, Vales are for ratings produced
by a single ate. N~ toll simple sive serosal estimates that wereaverage k= number of samples verged.
between peers and supervisors. This disagreement may he because
of rating difficulty (reflected in interpeer and! iniersupervisoreeli-
ability), dimension-level disagreements between peers and super-
visors or both.
To unconfound rating difficulty and constructlevel conver
gence, we first examine the confidence interval azound the esti-
mated rue-score correlation. If the interval includes 1.00, we
cannot reject the hypothesis that peers and supervisors are rating
the same performance construct. In such 2 case, we then evaluate
‘ting difficulty by looking athe mean observed correlations for
that performance dimension. Ifthe confidence intervals around the
‘estimated true-score correlations do not include 1,00 this suggests
‘that peers and supervisors may be rating somewhat different con-
structs and rating dificulty may not be the whole explanation for
less than perfest convergence
For the nine performance dimensions summarized in Table 2.
the confidence intervals around the true-score conelation do not
include 1.00 for three pectormance dimensions—administrative
‘competence. interpersonal competence, and compliance or aceep-
Table 2
tance of authority. Peers and supervisors may have somewhat
different conceptualizations of these three dimensions. However,
the underlying constructs are fairly highly correlated for peers and
supervisors —.86 for interpersonal competence. 69 for adminis
tive competence, and 78 for compliance or acceptance of author-
ity. I is pertivent to note that the reliability reported for peer
ratings of compliance or acceptance of authority (71) is unusually
high compared with the reliability estimates reposted for other
dimensions. For example, the range of interater reliability for
supervisors was 45-63, and it was 33-42 for peers (excluding
‘the value of 71 for compliance) we had used a value of 43 (the
‘mean of interpeer reliabilities) instead of .71 for compliance rat-
ings, then we would have concluded that peers and supervisors are
rating the same construct when providing ratings of compliance
with authority. On the basis of our decision rule, spervisors and
peers appear 10 be mating the same construct when providing
ratings of prodoctivity, quality, job knowledge, leadership, overall
Jjob performance, and effort. Fr these six dimensions. confidence
iniervals for the mean coastruc-level corelations include 1.00.
Mean Correlations Benson Peer Ratings and Supervisory Ratings of
Different Job Performance Dimensions
Jot pertrmunce dimension y ose C1 9s cui)
sera jb perfomance 6 ps 88-100
Producti 0 91 “7-100
rine 8 99 '8L1O0
Iverpersonal competence sas Be 798
Adminitrative compete Viet 34 Bh 480.
Quality m8. 68 2100
Job knowted iiss 8665-100
Tewlenhip ee) 9166-100
18 7 els
Compan or asseptnce of auton
” . a 85
Sine The pfommane dimensiom were taken fom, and descbed in, Viowesaran als (986) atic k
umber ot inp ied she netaranalyss N= wl sample sir eos he & spl: = sarple sire
‘Scipine! mesmo he ven e correltons #4 C1) = 98 conde eral constructed avand te mean
Vinton routine ti score const: 956 CT (p= 986 caine ina Fr350 VISWESVARAN, SCHMIDE, AND ONES
‘The mean observed peer-superviser correlations are less than
0 for all nine dimensions. Combined with the finding that the
Confidence intervals for the mean coselation corrected for inte
peer and inersupervisor reliability (i.e, corrected for rating ditti-
culty) included 1.00 for six of the nine dimensions, this pattern
suggests that rating difficulty (as reflected in level of measurement
cron) is the major factor depressing peer-supervisor coverpence.
Is this observed coreelation convergence moderated by the rating
content? The confideace intervals around the mean observed cor
relations overlap for the six dimensions for which the confidence
imerval on the true-score correlation includes 1.00, suggesting that
‘moderating influences, if any, are wesk. Confidence inervals on
‘observed correlations also overlap for the three dimensions for
‘which the tue-score correlations may be les than 1.00.
To some extent, the present findings allow us to examine the
hhypotheses about specific performance dimensions found in the
literature and discusced earlier. OF the performance dimensions
hypothesized inthe literature to be relatively easy to rate, three are
contained in Table 2: productivity, effat, and job knowledge. For
all three of these dimensions, the confidence interval sound the
‘mean construct level includes 1.00, suggesting that supervisors and
[Peers may be rating the same construct. The mean iue-score
‘corelation for these dimensions is 92, The average observed
‘correlation between supervisor and peer rates is 4. OF the
performance dimensions hypothesized in th literature 19 be rela-
tively difficult to rate two are contained in Table 2: administrative
‘competence and leadership. (A third, communications sis, was
‘dropped because of lack of data.) For one of these Leadership, the
‘tue-score correlation confidence interval reaches up to 1.00, and
the mean tru-score curelation is large (91), Also, fortis dinten-
sion the mean observed score correlation between supervisor and
‘eer raters is 1, the same 2s the mean value for the low rating
ificulty dimensions of productivity, effort, and job knowledge,
Hence there is no evidence that leadership is more difficult to rate
than these low rating difficulty performance dimeasions. The other
performance dimension hypothesized in the literature to be relax
tively difficult to rate is administrative competence. For this di
‘mension, the mean corstructlevel corelation is 69, and the con-
fidence interval does not reach up (9 1.00, suggesting that
supervisor and peer raters are not ratiag the same consiruct. In
addition, the mean observed correlation between supervisor and
Peer raters is 34, the lowest value in Table 2. However, for both
‘observed and construc-level corelations, the confidence intervals,
overlap those of other dimensions and, in panicular, overiap the
confidence intervals of the thee dimensions hypothesized to be
‘elatively easy to rate. Hence, these findings do not provide strong.
support for the hypothesis that administative competence is more
difficult to rae than other dimensions, including those hypothe-
(0 be easy to rate. For the remaining performance dimen-
sions—overall job pesformance, interpersonal competence, qual
ity, and compliance ar acceptance of authority—the literature
contained no specific hypotheses about rating dificuly
The mean observed correlations for peer-supervisor convere
-Rence reported here can be compared with the interrater seat
ties reported in Viswesvaran et a.'s (1996) anicle for these nine
‘dimensions for supervisors and peers. These interrater reliabilities
are for single raters and are presented in Table I. Interter rel
bilities are higher for supervisors (M = .55) than for pers (Nf
3), indicating there is more measurement error in peer ratings
sbserved score cvtelation 31, Ths mans tht on aver
sacro the performance dimensions. mssirentet ctor tdhoes ae
fbserved correlation bet
to al, a reduction of 4 coelat
"supervisor aval peer estes thom 8S
points. fn contri ack of
‘complete construct convergence reves the Bic supervsor-pat
ccomsruct-level correlation trom 1.0K 10 SS. ¢anueh smaller ne
duction, Hence. iti clear that messurement error is the major
factor reducing correlations bet cen supers ison peer ating
The effect of me snow hee times
larger than the impact of lack of eompcte comsiruet convergence
(AIS = 293)
we omit the two dimensions with the hagest mean saperviser
peer comtruct-level correlutions—almimstrative eampetense
(69) and quality (.68-—these figures ate even more stiking. The
mean true-score correlation is 0, and the: mean observed core
Tation is 41. Hence, measurement cttor reduces the mean
supervisor-peer correation even more: 90 = 1 ~ 49. Howeven
the reduction in the mean consiruci-le\el correlation iscaly 1.00
90 = -10.
For six ofthe nine performance dimensions, we wannotrejet the
Dyposhesis thatthe man ru
‘we can evaluate the data in Table 2 trom 31 mare holistic perspec
tive. In this connection. itis striking that none wf the estimates of
mean true-score coelations are 1,0} or exesed 1X0, Suppse that
the real (population) value of every true score correlation were in
fact 1.00. Then half the estimated mean true-ssore correlations
should be (atleast slightly) less than 10K) and lf should be (at
Jeast slightly) above 1.00——hecause of sampling err in the
estimates (Schmidt & Hunter. 1999, ef, alswy Hunter & Schmid
1990. chap. 9). The expected average would he exatly 1.00,
Insel. none ofthe Values eceads HIM ant the as erage value
only 85 90 i We exclude the two performance dimensions with
the fowest mean truescore corselatons). This patter of resus
suggests that supervisors amd peers are. on the aerage, rating
Similar bur not exactly identical pertormnce constructs, The lack
of perfect agreement between supers ise and poet raters s mos
{due ty measurement error. However, stmll part fis de 0 lack
Of complete comiuct convergence berween supervise and Peet
faters. This i the averuge pattern of findings. However fr some
specific dimensions, the supervisor-peer constrict comergence
may be complete. For example, for overall joh porfoemance. the
mean true-score correlation is, soul fir the efor dimension.
is 99)
The variation in mesa interrater rehahlties scnoss peeommance
dimensions may be viewed is reflecting ditteences in rating
iffcuky for performance dimensions. s\n impuri question
Whether these ditterences an rating difficulty generalize across
rating sourees. Are dimensions that ate dofficult for pees 10 rate
aso dificult for supervisors 1 rate? Wee cornelated he met
merrier rebabititics wt supersisurs and peers across te mine
Performance dimensions and obtained corretation of 2). He nee.
the dimensions that are the mus uitficul pte realy are no thE
core comelation 1.0, However,HoLK AND SUPERVISOR RATINGS 351
ame for supersvor and pest raters, However. thn eorreatin is
Somputed an t stall simple, sd serene, any inference a the
fans hast be tntatve
Discussion
Ratings are widely esod in pestormance
and supers isons
this anicle. we integrated reser examining the convergence
feoreltion) between peer and supervisor ratings. Our findings
indicate that the moderating on the
‘convergence hetween peor and Supervisor eating is not a song
fs itis implied hy some cognitively hased hypatheses 'eg..eval-
tation difficulty! proposed to explain the rating processes. Our
findings are also consistent with the conclusions of Mount etal
(1998) that source effects Ce. peer. SuperN sors) ae not strong in
ratings
‘Our findings indicate that peee-supervisor convergence across
dimensions apparently does not sary mach by job performance
dimension at either the level of observed correlations or the level
of consruet-level corre
tance of processes postulated to moderate convergence (ef. Bor-
rman, 1974, 197% Wohlers & London, 1989), Warisility im peer
supervisor convurgence across dimensions is a necessary but not
‘sufficient condition to support the mechanisms posulated to mod-
rate convergence. The meta-analstic cumulation reported here
indiates that there is only limited Variability in peer-supervisor
convergence acts dittensions. Thus, the condition (substantial
Variability” in peer-supervisor canvergence acroxs dimensions)
necessary for inferring suppor forthe mechanisms postulated 10
moderale convergence appears to be only weakly satisfied.
Anoiher interesting point that emerged from cur results that
could profit from Future research scrutiny 6 the high reliability of
compliance fatines compared with ratings of other dimensions
Perhaps hehasiors related to compliance and accepting authority
rmay’be regarded sy esteasion of mores and poems encountered in
society at large, providing raters with a common same of refer-
ence, Vet another issie isthe question of the consruet validity of
ratings. I imterrter seisbilities are as low sts we found in the
cumulative fire shen tev what event cam organizations, ad
Wwe as sciemist-prctitioners, rely on a Single super isor's ratings
wppeaials and peers
Ihe tats mest conten sources oF tatings. For
oct -of the eating conte
jons. This culls into question the impor-
to validate our iniersentions? Ht is itue that when an organization
bases is dovisions on the ratings of g single Supervisor dere fsa
Tot of unreliability in th
Having said that. we sss point ot that reliability and construct
validity, though related, ate not the same, Fhe fact that the rel
ability is only 50 mnples that the constuct vality of the ob-
served score th. the cumelation hetseeen the observed scores Of
measure and the underlying construct has to Be THN SO) or Fes,
(Retiabitity isthe ratio tre to observed variance. and the square
Toot of the reliaility coetficiom is the correlation between oh
served scores aml the scale"s underlying nue scores [Nanay &
Bernein, 1994}1 The correation Between the observed scones
8nd the consinat the seae i aneded measure cannot be higher
than the ow
Seale the square woot of te reliability, Therefore if dhe tebablity
tion betweow ubsenied sgores an true scores the
Of ratings is 0, then the ampticaion i that the construct valiity
cannot he bigher than 71, Ui til igh and respectable inde
Of comtruet sality. ‘The comstact vahdiy of ratings has bee
researched for mny years, and even a brief summary is beyond the
scope of this url.
‘One can argue further that because the concept of interrater
reliability rests on the premise that one is computing the correla-
tion between parallel ates, interater reliability for supervisor
ratings is problematic because there is only one tue supervisor. To
compute the interrater reliability of supervisor ratings, a second
person (in most eases. the supervisor's supervisor is located (0
provide second ratings. Is the second rater a parallel measure ofthe
first tri, supervisor” I nt, is interrater reliability underestimated
for supervisors? (Note that in most organizations, individuals
‘would have reported 19 more than one supervisor over time.
Employees rotate among supervisors. and, as such, estimating
interrater reliability may not be a serious problem.)
“There are twe pois to note in addressing this cencern, Fits, 28
seen in Table I. the average intemater reliability for supervisors is
hhigher than that reported for peers (where presumably two parallel
raters ate available. I is more likely that an individual has two
peers but only one immediate supervisor. Thus, intersupervisor
seliability is computed by having a supervsor’s supervisor. or a
‘tand-in or a rotting supervisor, provide ratings inaddition to the
immediate supervisor. However, the computation is relatively
‘more siraightforward for ioterpeer reliability because individuals
are likely to have two peers. If the various factors are influencing
interater reliabilities, then they are more likely to affect super
sors than peers. Thus. supervisor reliability should be lower than
‘estimated peer interrater reliability. However, the cumulative ev-
idence shows that interrater reliability is lower for peers than itis
for supervisors. This seems to be evidence against the hypothesis
that interrater relbility for supersisory ratings is underestimated
in the erature because there are 10 to parallel supervisors (i.
the supervisor is unique 10 each ntee).
‘Second, there another factor-—leniency—to take into account.
Raters differ in leniency. Typically. in computing the interater
reliability, some individuals willbe rated by two raters. others by
Another set of (wo rales. and so forth, Thus, ater main effects
leniency) are confounded in Rater % Rate interactions. An
argument can be made that the interrater estimate is therefore
affected. However, the reliability estimate used should match the
type of data being corelated in the real wosld. In almost all
instances, we hive different supervisors rating different employ-
tees. These ratings are then correlated with test or predictor data
‘Thus. the reliability coefficient used t0 correct the correlations
should also inclade the rarer main effect as error. The interater
reliability estimates that we have here accurately reflect this eal
‘world dynamic of oblaining data—and hence are the appropriate
‘ones to use in making the eorrections.
'At this point, we state our position on another issue that was
raised first by the editor-in-chief of this journal. the initial action
editor of this atte, and subsequently by the associate editor who
Served as the action editor. Both elaimed that our framewer: is
ald only if one accepis our definitions of true score and error
Nariance. Specifically. both argued that raterspecific variance
need not be constnued 3& error. They’ suggested that raters may
observe different episodes of behavier. resulting in true disagree
nents hetween raters that should not be interpreta as measure
iment error (sce ako Murphy & Cleveland. 1995), Even though two
raters may observe diferent incidents of performance. as long 3s
the different meidents of performance viewed tap into the samea \VISWESVARAN, SCHMIDT, AND ONES
Performance dimension (Sample the same domain), raters can be
viewed as parallel raters. Under the classicabmeasurement model,
‘ster-specific variance is measurement ervor just as item-speci
variance is measurement error in the computation of coefficients of
‘uivalence and alpha. A more detailed discussion ofthis issbe is,
available elsewhere (Murphy & DeShon, 2000; Schmidt et a.
2000). Here, we state that our framework is built only on the
rinciples of classical-measurement theory, and, to the extent these
Principles are inappropriate, our framework and concivsions will
‘dso be inappropriate. However, itis our position that classial-
measurement theory is appropriate for use with ratings of job
performance (Schmidt etal, 2000)
In this stady, we focus on whether the confidence intervals for
any two performance dimensions overlap, A reviewer suggested
that instead we should compute the confidence interval around the
difference between the correlations for each pair of performance
‘dimensions. Although this could be done, it would result in 36