You are on page 1of 10
The Moderating Influence of Job Performance Dimensions on Convergence of Supervisory and Peer Ratings of Job Performance: Unconfounding Construct-Level Convergence and Rating Difficulty Chockalingam Viywesvaran ova Ingenta Cnivctiy Frank L. Schmidt University of lows Deniz S. Ones Unisersity of Minnesota, Pin Cities Campus hens fol) portonmanee. The general pa Hse ects fully ind per rt Faratie ep aici nee foastgtlevet diapne Performance ratings hase taiionsally played a centeal ale in the measurement of job performance in indusiril—organizational prychology. In their vesiew, Lawshe ana Balm (19661 four that (668 of validiy studies used roings. Lent, Authach, and Levin (1971) found in their review of 1506 validation studies that 63% used ratings ay the enterion. There iv evidence that the use of ratings has mot decreased over the years. if there hay beet dscemible wend, it has beon an iperessng use of ratings (Canc, 1991; Landy. 1989: Marchese & Muchinsks. 19881, Forearm Berardi and Beatty J984) found that 83% of the respondents asumey indicated that they used raingss these petmary source of performance appraisal. Cleveland, Murphy. gd Williams 1989), also found that the use of ratings Ys presalent in organizations Because ratings of pertormance are widely sed they have been Widely researched, Many researchers (eg. Saul. Downey. & Lat bes, 1980) have examined the pos chometric properties f rating soch as distbutional characteristics cee. central tendeney smd Jesieney the skewness sd hurtosis ofthe distribution | elibity (ing, Hunter, & Schmidt, 19N0; Rothstein, NOU: Salgado & Moscoso, 1996: Viswesv aan, Ones, & Schmidh, L096 and va Tidy (eg. Chanch & Bracken, 1997), The effects of the rating (Coekatingany Viswesaran, Depart of PRyubology. Phe Ter Organizations Univers of he, Dom S. Ones, Dspsttiea of Peyet 0, Univenty of Minot, Tain Cis Cass Altre shows cuntriuted gles kr ot thor ery 1 eros Coch liza Viswesvran, Pepartnica ) Pocllogy. Balt Wevaatiea nivesiy. Ntiani, Prat 88299 mal ish fa okt Comespndece cose thi atic = the cores enncen supenisor and peer ratings for diffrent vn the data sugpeste tht raters fom the same rou ater or ferent evel. Methods were used to separate the Jack sf convruclee convergence ope corelaon Bereensupersivor The authors found complete camtructessl eentergzence for ratings of overall job pevtermnnce, oats, eat jb Apo lee, aly. and Teadenbip but not for rains of adi peice! complance or acceptance of aathoey. Higher ring ‘it ower ean obverse poc-super sur covtelaions than were rnpoets ut supers sor Ipictions for research and practice are fommat have been extensively researched. and consensus has emerged that the contribution of format differences to rating qual- ity is limited {Landy & Farr. 1980). The effects of uaining (e.g. Borman, 1978: London & Smither. 1995) and training influences ‘on cognitive processes have also been investigated (e.g. Haten- stein & Foti, 1989), Researchers have also examined the purposes for which ratings are used i an organization (Cleveland et al. 1980), the engnitive processes involved in ratings (e.g... DeNisi, ino, 1984. and the influence of political climate in the organivation on ratings (eg. Kravitz & Balzer. 1992). Ratings ate subjective evaluations abtained from. supervisors. peers. subordinates. self, or customers. The various uses of ratings, n he broly classified as administrative, feedback, oF research (Of the tive sources of evaluation, Cascio (1991. p. 81) stated that only upersory and peer ratings are used forall three prpeses 0 ‘which ratings can be used, Most ratings are confined to supervi- sony ratings, For example, Lent etal. 1971) found that 93% of all Fatings were supervisor raings. (The remaining 7% of their daa hase was baccd on peer ratings.) Berardin and Beatty (1984) estimated that over OOF of the ratings used in the erature are Super isor evaluations. Fughermore. tbe traditional hierarchical Structure of most organizations also increases the emphasis on Suet isor ratings. In contrast, im recent years there has been a fnwwement away from traditional hierarchical organizations toward iain stcures ane lean of poofectbased organization of werk (Noman & Zawacki, 1991), a end that has increased the use of pect catings and even of subordinate ratings. This atictefxcuses om the relationship between peer and super sisor ratings of job performance. It examines whether the eore- fatioms Between ratings Made by superisors and those made by 346 VISWESVARAN, SCHMIDT. AND ONES peers is different for the different job performance dimensions rated (ic is the rating content a moderator of the convergence?) In examining the convergence in ratings between peers and super visors, we distinguish between the effects of rating difficulty and constrictlevel convergence. Agreement between raters ean be reduced by the absence of agreement on the nature ofthe construct to be rated, by the difficulty of rating a particular agreed 00 SB... where 8D, Js he standard deviation of observed cuttchations and k isthe mae ef utes. SE, way used to compute canfidenve intra fe the mean rt ‘hecontidence intevts for dllernt job perormance dtnenson oenip. ‘wecannot reject the hypesis that rating ificuly the same are inensions, ‘The mean ebservedcomtelation for each othe ne dimensions wes nek sonected using the itenater reliabilities for peer ratings and interaee relabilities For superviary ratings feperied in Vises e's (198) arte, These iatrate reliabilities ate shown in Tae 1 dheused te Viwesvaran et al had summarized the saterrater elise forthe Aimensions where four o mre estimates vere available. For superior ratings of adminis competence. intnatcs reliability in Vowessren tal analyse was 42, but ths estimate wus bed nm Tes than fou estimates and hus 9 at sep in that pubis The mean observed correlation foreach of the nine dimensions was conecte for interrater reliability of poe ratings and or interat relbi fy of supervisory ratings: Thus. the secon scp in our anaes was {stinat the te-score orrelations betwen per snd super nor ating fr ach of the nine dimensions of ja perfomance. I extimatng the te score corelation, wher th wo eases ein created ae vided BY itfrent rater, interrerrelisbiis caetficents ate the appropiate rel abies: inrarter-selainy estinstes are not appropiate (Scheidt & ‘Hiner, 1996. 1999: Schmit. Visnessaran,&’ Ones, 20K Vines a 1996; see Murphy & DeShon. 2000, for a coma cm) abe ‘mpertant to ote that inerater relia cuercts for both eater speifc \afance and random mesuremen error. Such. theres nee onset for invarater aii he confidence interval arcu the mean sorreted vortelaton exe Yo 100, then we cannot reject the hy pothensthat Sup¥ sa pes :atng the same job performance construct. Confidence itera fr mest eomiruct-leve! correlations are paced by coreg the eri ft confidence intervals for mean observed eiecstions, Confidence intra ca be correct for messirement ere jus as nvidual comrtins e ‘mew correlations can becomes Murer & Seal 19%. 120. Tt ‘ur bles chased on convention n payehuial reeatc, we repr 95% confidence intervals hough terested rears my compute aber cm Fidene inten of Results ‘Table 2 reports the resulls. All mean abserved eomeations it Table 2 are tess than .50. indicating substantial dissyreerent PEER AND SUPERVISOR RATINGS 349 Table | Interrater Reiabilities: Sample Size Weighted Means and Standard Deviations Ratings Supervisor Poor Jo performance dimension Nk Ms Nk M oxeral job petomance eo 40) 23D Pradaceit 2015 19S) S ska Brion pe 433 aks Interpersonal competence 306 sss ‘Aditinistrave competense 120009 382s uals Po ay A 8 Joh hnwwtedge worn 20 3s 0 Leadenhip i co rr Oy Compliance ocaecepuanee of autbooty 90S B58 SOS Mw 8 8 ‘Note. ‘These values are summarized from Viswesvaran el (1996) atv, Vales are for ratings produced by a single ate. N~ toll simple sive serosal estimates that wereaverage k= number of samples verged. between peers and supervisors. This disagreement may he because of rating difficulty (reflected in interpeer and! iniersupervisoreeli- ability), dimension-level disagreements between peers and super- visors or both. To unconfound rating difficulty and constructlevel conver gence, we first examine the confidence interval azound the esti- mated rue-score correlation. If the interval includes 1.00, we cannot reject the hypothesis that peers and supervisors are rating the same performance construct. In such 2 case, we then evaluate ‘ting difficulty by looking athe mean observed correlations for that performance dimension. Ifthe confidence intervals around the ‘estimated true-score correlations do not include 1,00 this suggests ‘that peers and supervisors may be rating somewhat different con- structs and rating dificulty may not be the whole explanation for less than perfest convergence For the nine performance dimensions summarized in Table 2. the confidence intervals around the true-score conelation do not include 1.00 for three pectormance dimensions—administrative ‘competence. interpersonal competence, and compliance or aceep- Table 2 tance of authority. Peers and supervisors may have somewhat different conceptualizations of these three dimensions. However, the underlying constructs are fairly highly correlated for peers and supervisors —.86 for interpersonal competence. 69 for adminis tive competence, and 78 for compliance or acceptance of author- ity. I is pertivent to note that the reliability reported for peer ratings of compliance or acceptance of authority (71) is unusually high compared with the reliability estimates reposted for other dimensions. For example, the range of interater reliability for supervisors was 45-63, and it was 33-42 for peers (excluding ‘the value of 71 for compliance) we had used a value of 43 (the ‘mean of interpeer reliabilities) instead of .71 for compliance rat- ings, then we would have concluded that peers and supervisors are rating the same construct when providing ratings of compliance with authority. On the basis of our decision rule, spervisors and peers appear 10 be mating the same construct when providing ratings of prodoctivity, quality, job knowledge, leadership, overall Jjob performance, and effort. Fr these six dimensions. confidence iniervals for the mean coastruc-level corelations include 1.00. Mean Correlations Benson Peer Ratings and Supervisory Ratings of Different Job Performance Dimensions Jot pertrmunce dimension y ose C1 9s cui) sera jb perfomance 6 ps 88-100 Producti 0 91 “7-100 rine 8 99 '8L1O0 Iverpersonal competence sas Be 798 Adminitrative compete Viet 34 Bh 480. Quality m8. 68 2100 Job knowted iiss 8665-100 Tewlenhip ee) 9166-100 18 7 els Compan or asseptnce of auton ” . a 85 Sine The pfommane dimensiom were taken fom, and descbed in, Viowesaran als (986) atic k umber ot inp ied she netaranalyss N= wl sample sir eos he & spl: = sarple sire ‘Scipine! mesmo he ven e correltons #4 C1) = 98 conde eral constructed avand te mean Vinton routine ti score const: 956 CT (p= 986 caine ina Fr 350 VISWESVARAN, SCHMIDE, AND ONES ‘The mean observed peer-superviser correlations are less than 0 for all nine dimensions. Combined with the finding that the Confidence intervals for the mean coselation corrected for inte peer and inersupervisor reliability (i.e, corrected for rating ditti- culty) included 1.00 for six of the nine dimensions, this pattern suggests that rating difficulty (as reflected in level of measurement cron) is the major factor depressing peer-supervisor coverpence. Is this observed coreelation convergence moderated by the rating content? The confideace intervals around the mean observed cor relations overlap for the six dimensions for which the confidence imerval on the true-score correlation includes 1.00, suggesting that ‘moderating influences, if any, are wesk. Confidence inervals on ‘observed correlations also overlap for the three dimensions for ‘which the tue-score correlations may be les than 1.00. To some extent, the present findings allow us to examine the hhypotheses about specific performance dimensions found in the literature and discusced earlier. OF the performance dimensions hypothesized inthe literature to be relatively easy to rate, three are contained in Table 2: productivity, effat, and job knowledge. For all three of these dimensions, the confidence interval sound the ‘mean construct level includes 1.00, suggesting that supervisors and [Peers may be rating the same construct. The mean iue-score ‘corelation for these dimensions is 92, The average observed ‘correlation between supervisor and peer rates is 4. OF the performance dimensions hypothesized in th literature 19 be rela- tively difficult to rate two are contained in Table 2: administrative ‘competence and leadership. (A third, communications sis, was ‘dropped because of lack of data.) For one of these Leadership, the ‘tue-score correlation confidence interval reaches up to 1.00, and the mean tru-score curelation is large (91), Also, fortis dinten- sion the mean observed score correlation between supervisor and ‘eer raters is 1, the same 2s the mean value for the low rating ificulty dimensions of productivity, effort, and job knowledge, Hence there is no evidence that leadership is more difficult to rate than these low rating difficulty performance dimeasions. The other performance dimension hypothesized in the literature to be relax tively difficult to rate is administrative competence. For this di ‘mension, the mean corstructlevel corelation is 69, and the con- fidence interval does not reach up (9 1.00, suggesting that supervisor and peer raters are not ratiag the same consiruct. In addition, the mean observed correlation between supervisor and Peer raters is 34, the lowest value in Table 2. However, for both ‘observed and construc-level corelations, the confidence intervals, overlap those of other dimensions and, in panicular, overiap the confidence intervals of the thee dimensions hypothesized to be ‘elatively easy to rate. Hence, these findings do not provide strong. support for the hypothesis that administative competence is more difficult to rae than other dimensions, including those hypothe- (0 be easy to rate. For the remaining performance dimen- sions—overall job pesformance, interpersonal competence, qual ity, and compliance ar acceptance of authority—the literature contained no specific hypotheses about rating dificuly The mean observed correlations for peer-supervisor convere -Rence reported here can be compared with the interrater seat ties reported in Viswesvaran et a.'s (1996) anicle for these nine ‘dimensions for supervisors and peers. These interrater reliabilities are for single raters and are presented in Table I. Interter rel bilities are higher for supervisors (M = .55) than for pers (Nf 3), indicating there is more measurement error in peer ratings sbserved score cvtelation 31, Ths mans tht on aver sacro the performance dimensions. mssirentet ctor tdhoes ae fbserved correlation bet to al, a reduction of 4 coelat "supervisor aval peer estes thom 8S points. fn contri ack of ‘complete construct convergence reves the Bic supervsor-pat ccomsruct-level correlation trom 1.0K 10 SS. ¢anueh smaller ne duction, Hence. iti clear that messurement error is the major factor reducing correlations bet cen supers ison peer ating The effect of me snow hee times larger than the impact of lack of eompcte comsiruet convergence (AIS = 293) we omit the two dimensions with the hagest mean saperviser peer comtruct-level correlutions—almimstrative eampetense (69) and quality (.68-—these figures ate even more stiking. The mean true-score correlation is 0, and the: mean observed core Tation is 41. Hence, measurement cttor reduces the mean supervisor-peer correation even more: 90 = 1 ~ 49. Howeven the reduction in the mean consiruci-le\el correlation iscaly 1.00 90 = -10. For six ofthe nine performance dimensions, we wannotrejet the Dyposhesis thatthe man ru ‘we can evaluate the data in Table 2 trom 31 mare holistic perspec tive. In this connection. itis striking that none wf the estimates of mean true-score coelations are 1,0} or exesed 1X0, Suppse that the real (population) value of every true score correlation were in fact 1.00. Then half the estimated mean true-ssore correlations should be (atleast slightly) less than 10K) and lf should be (at Jeast slightly) above 1.00——hecause of sampling err in the estimates (Schmidt & Hunter. 1999, ef, alswy Hunter & Schmid 1990. chap. 9). The expected average would he exatly 1.00, Insel. none ofthe Values eceads HIM ant the as erage value only 85 90 i We exclude the two performance dimensions with the fowest mean truescore corselatons). This patter of resus suggests that supervisors amd peers are. on the aerage, rating Similar bur not exactly identical pertormnce constructs, The lack of perfect agreement between supers ise and poet raters s mos {due ty measurement error. However, stmll part fis de 0 lack Of complete comiuct convergence berween supervise and Peet faters. This i the averuge pattern of findings. However fr some specific dimensions, the supervisor-peer constrict comergence may be complete. For example, for overall joh porfoemance. the mean true-score correlation is, soul fir the efor dimension. is 99) The variation in mesa interrater rehahlties scnoss peeommance dimensions may be viewed is reflecting ditteences in rating iffcuky for performance dimensions. s\n impuri question Whether these ditterences an rating difficulty generalize across rating sourees. Are dimensions that ate dofficult for pees 10 rate aso dificult for supervisors 1 rate? Wee cornelated he met merrier rebabititics wt supersisurs and peers across te mine Performance dimensions and obtained corretation of 2). He nee. the dimensions that are the mus uitficul pte realy are no thE core comelation 1.0, However, HoLK AND SUPERVISOR RATINGS 351 ame for supersvor and pest raters, However. thn eorreatin is Somputed an t stall simple, sd serene, any inference a the fans hast be tntatve Discussion Ratings are widely esod in pestormance and supers isons this anicle. we integrated reser examining the convergence feoreltion) between peer and supervisor ratings. Our findings indicate that the moderating on the ‘convergence hetween peor and Supervisor eating is not a song fs itis implied hy some cognitively hased hypatheses 'eg..eval- tation difficulty! proposed to explain the rating processes. Our findings are also consistent with the conclusions of Mount etal (1998) that source effects Ce. peer. SuperN sors) ae not strong in ratings ‘Our findings indicate that peee-supervisor convergence across dimensions apparently does not sary mach by job performance dimension at either the level of observed correlations or the level of consruet-level corre tance of processes postulated to moderate convergence (ef. Bor- rman, 1974, 197% Wohlers & London, 1989), Warisility im peer supervisor convurgence across dimensions is a necessary but not ‘sufficient condition to support the mechanisms posulated to mod- rate convergence. The meta-analstic cumulation reported here indiates that there is only limited Variability in peer-supervisor convergence acts dittensions. Thus, the condition (substantial Variability” in peer-supervisor canvergence acroxs dimensions) necessary for inferring suppor forthe mechanisms postulated 10 moderale convergence appears to be only weakly satisfied. Anoiher interesting point that emerged from cur results that could profit from Future research scrutiny 6 the high reliability of compliance fatines compared with ratings of other dimensions Perhaps hehasiors related to compliance and accepting authority rmay’be regarded sy esteasion of mores and poems encountered in society at large, providing raters with a common same of refer- ence, Vet another issie isthe question of the consruet validity of ratings. I imterrter seisbilities are as low sts we found in the cumulative fire shen tev what event cam organizations, ad Wwe as sciemist-prctitioners, rely on a Single super isor's ratings wppeaials and peers Ihe tats mest conten sources oF tatings. For oct -of the eating conte jons. This culls into question the impor- to validate our iniersentions? Ht is itue that when an organization bases is dovisions on the ratings of g single Supervisor dere fsa Tot of unreliability in th Having said that. we sss point ot that reliability and construct validity, though related, ate not the same, Fhe fact that the rel ability is only 50 mnples that the constuct vality of the ob- served score th. the cumelation hetseeen the observed scores Of measure and the underlying construct has to Be THN SO) or Fes, (Retiabitity isthe ratio tre to observed variance. and the square Toot of the reliaility coetficiom is the correlation between oh served scores aml the scale"s underlying nue scores [Nanay & Bernein, 1994}1 The correation Between the observed scones 8nd the consinat the seae i aneded measure cannot be higher than the ow Seale the square woot of te reliability, Therefore if dhe tebablity tion betweow ubsenied sgores an true scores the Of ratings is 0, then the ampticaion i that the construct valiity cannot he bigher than 71, Ui til igh and respectable inde Of comtruet sality. ‘The comstact vahdiy of ratings has bee researched for mny years, and even a brief summary is beyond the scope of this url. ‘One can argue further that because the concept of interrater reliability rests on the premise that one is computing the correla- tion between parallel ates, interater reliability for supervisor ratings is problematic because there is only one tue supervisor. To compute the interrater reliability of supervisor ratings, a second person (in most eases. the supervisor's supervisor is located (0 provide second ratings. Is the second rater a parallel measure ofthe first tri, supervisor” I nt, is interrater reliability underestimated for supervisors? (Note that in most organizations, individuals ‘would have reported 19 more than one supervisor over time. Employees rotate among supervisors. and, as such, estimating interrater reliability may not be a serious problem.) “There are twe pois to note in addressing this cencern, Fits, 28 seen in Table I. the average intemater reliability for supervisors is hhigher than that reported for peers (where presumably two parallel raters ate available. I is more likely that an individual has two peers but only one immediate supervisor. Thus, intersupervisor seliability is computed by having a supervsor’s supervisor. or a ‘tand-in or a rotting supervisor, provide ratings inaddition to the immediate supervisor. However, the computation is relatively ‘more siraightforward for ioterpeer reliability because individuals are likely to have two peers. If the various factors are influencing interater reliabilities, then they are more likely to affect super sors than peers. Thus. supervisor reliability should be lower than ‘estimated peer interrater reliability. However, the cumulative ev- idence shows that interrater reliability is lower for peers than itis for supervisors. This seems to be evidence against the hypothesis that interrater relbility for supersisory ratings is underestimated in the erature because there are 10 to parallel supervisors (i. the supervisor is unique 10 each ntee). ‘Second, there another factor-—leniency—to take into account. Raters differ in leniency. Typically. in computing the interater reliability, some individuals willbe rated by two raters. others by Another set of (wo rales. and so forth, Thus, ater main effects leniency) are confounded in Rater % Rate interactions. An argument can be made that the interrater estimate is therefore affected. However, the reliability estimate used should match the type of data being corelated in the real wosld. In almost all instances, we hive different supervisors rating different employ- tees. These ratings are then correlated with test or predictor data ‘Thus. the reliability coefficient used t0 correct the correlations should also inclade the rarer main effect as error. The interater reliability estimates that we have here accurately reflect this eal ‘world dynamic of oblaining data—and hence are the appropriate ‘ones to use in making the eorrections. 'At this point, we state our position on another issue that was raised first by the editor-in-chief of this journal. the initial action editor of this atte, and subsequently by the associate editor who Served as the action editor. Both elaimed that our framewer: is ald only if one accepis our definitions of true score and error Nariance. Specifically. both argued that raterspecific variance need not be constnued 3& error. They’ suggested that raters may observe different episodes of behavier. resulting in true disagree nents hetween raters that should not be interpreta as measure iment error (sce ako Murphy & Cleveland. 1995), Even though two raters may observe diferent incidents of performance. as long 3s the different meidents of performance viewed tap into the same a \VISWESVARAN, SCHMIDT, AND ONES Performance dimension (Sample the same domain), raters can be viewed as parallel raters. Under the classicabmeasurement model, ‘ster-specific variance is measurement ervor just as item-speci variance is measurement error in the computation of coefficients of ‘uivalence and alpha. A more detailed discussion ofthis issbe is, available elsewhere (Murphy & DeShon, 2000; Schmidt et a. 2000). Here, we state that our framework is built only on the rinciples of classical-measurement theory, and, to the extent these Principles are inappropriate, our framework and concivsions will ‘dso be inappropriate. However, itis our position that classial- measurement theory is appropriate for use with ratings of job performance (Schmidt etal, 2000) In this stady, we focus on whether the confidence intervals for any two performance dimensions overlap, A reviewer suggested that instead we should compute the confidence interval around the difference between the correlations for each pair of performance ‘dimensions. Although this could be done, it would result in 36

You might also like