You are on page 1of 16
28 SOURCES OF PERFORMANCE APPRAISAL INFORMATION As stated in Chapter 3, job performance can be characterized by many criteria Three different types of data are used: objective production data, personnel data, and judgmental data. Objective Production Data Objective production data For a person in the job of a , fob performance may be measured by counting the number of objects produced per day, per week, and so forth, Similarly, salespeople are ap- praised by assessing (counting) their sales volume over a given period. It is even possible to evaluate the performance of firefighters by counting the number of fires they extinguish. Although each of these objective production measures has some intuitive appeal, none is usually a complete measure of job performance, Two problems in particular affect each of these measures, First, we would like to assume that dif- ferences in performance across people reflect true differences in how well these le perform their jobs. Unfortunately, me tor may produce more because he or she We st machine. A salesperson might have a larger sales volume because his ot her territory is better. Firefighters who put out few fires might be responsible for an area with relatively few buildings. This problem of variation in ermal factors should sound familiar 1 pee: AN Cr Sscses Caper) The second problem with objective performance measures is that they rarely tell the whole story A machine operator who produces more abjects per day but who ore defective objects should no} fed ‘A salesperson spends 2 lot of time recruiting new customers, an aspect that must be weighed against simply making calls on estab- lished customers, Creating new customers can be as important as maintaining . business with old ones. The sales volume might be lower at first, but in the long run, the new customers will increase total sales volume. Extinguishing fires is but cone aspect of a firefighter’ job; preventing fires is another one. The "best" fire- fighters conceivably may not have put out many fires but might have contributed heavily toward preventing fires in the first place. There is growing use of electronic performance monitoring (EPM) to assess the performance of workers engaged in computer-based tasks. An EPM system records the objective aspects of perfor- mance on computer-based tasks, such as volume and speed of entries (Lund, 1992). Research (Westin, 1992) has revealed that such systems are regarded as, unfair and inaccurate assessments of work performance because they ignore the Sources of Peformance Appraisal Information 219 discretionary or judgmental component of work, which is inherent to all jobs. In short, all of these actual criteria suffer from criterion deficiency. They are deficient measures of the conceptual criteria they seek to measure Relevance of Objective Production Data, For the jobs mentioned, objec- tive production data have some relevance. It would be absurd to say that sales volume has no relationship to a salesperson’s performance. The salesperson’ job is indeed to sell, The issue is the degree of relevance. It is a mistake to give too uch importance to objective production data in performance appraisal (see Field Note 1). This is sometimes a great temptation because these data are usually very accessible, but the meaning of those clear-cut numbers is not always so evident. Finally, for many jobs, objective performance measures do not exist or, if they do, they have little relevance to actual performance. Criterion relev: of judgment Personnel Data The 1 information is personnel data, SERED The two most common indices of performance are he critical issue with these variables is criterion rele~ vatice. To what extent do they reflect real differences in job performance? ‘Absenteeism is probably the most sensitive measure of performance. In almost all jobs, employees who have unexcused absence are judged as perlorming worse than others, all other factors being equal. Indeed, an employee can be fired for excessive absence. Most organizations have policies for dealing with absenteeism, which attests to its impor ce, How. ever, th Absences can be “excused” or “unexcused” depending on many factors pertaining to both “FIELD NOTE 1 - What Is “High” Performance? © general downturn in the economy’ and products ar services are not being purchased with the same fre~ ‘quency as in the previous’ yea, sales could be down, for example, by an average of 15%. This 19% (actually 18H) figure would. then represent “average” perfor~ F) sully we think of high perfor- HB) mance as. a, positive Score, a gain, or some improvement over ‘the status quo: Conversely, when individuals perform “worse” than they’ did fast year, it is tempting to con fe clude they didn't perform:as well. However, such ts not elways the case. Performance must be judged fn terms of what is under the control of the individuals ‘being evaluated rather than those influences on per~ “formance that are beyond their control. There can be tomoad, pervasive factors, sometimes. of an economic nature, that suppress the performance of everyone being judged. One example is in safes. If there, is,a mance, ‘Perhaps the best salesperson in the year had only: 34 drop in sales over the previous year. Thus, “good” performance inthis situation is @ smaller loss, compared with some average or norm group. This ex=, ‘ample illustrates that there is always some judgmental, ‘of contextual, component to; appralsing. performance ‘and that sole reliance on, objective numbers can be: misleading. Say) 220 Chapter 7 Performance Appratsal the individual (for example, seniority) and the job (for example, job level), An em- ployee who has ten days of excused absence may still be appraised as performing betier than an employee wi umexcused absence sled ms absenteeism is a thorny problem, terme tis seen as a highly relevant cri- terion variable in most organizations, be used as a measure of job performance { re both used as variables, a5 are accidents resulting in injury or property damage. Accidents = i criterion variable ‘or GSE => for white-collar jobs. People who drive delivery tracks may be evaluated in part on the number of accidents they have, ‘This variable can Road conditions, miles driven, time of day, and condition of the truck can. all contribute to acci- Gents, While relevance is limited to certain jobs, accidents can contribute greatly to appraisal. Companies often give substantial pay raises to drivers with no acci- dents and fire those with a large number. Relevance of Personnel Data. There is no doubt that factors such as absence and accidents sre meaningful measures of job performance, Employees may be discharged for excessive absences or accidents. Employers expect employees to come to work an lowever, as was the case with production data, personnel data rarely give a com- prehensive picture of the employee’ performance. Other highly relevant aspects of job performance often are not revealed by personnel data, It is for this reason that judgmental data are relied upon to offer a more complete assessment of job performance. Rating Errors — errors occur in making ratings, it is important to understand the major types of rating errors. In making appraisals with rating scales, the rater may unknowingly comunit errors in judgment, These can be placed halo errors, leniency errors, and central-tendency halo error 2 type of rating error the employee that per- ‘in which the rater meates al! evaluations of this person. Typically, the rater has strong feelings about assesses the ratee 25 gy performing well on a variety of performance : cross many factors) as uniformly good or bad. The rater who is impressed by an em- Aenean ployee’ idea might allow those feelings to cary over to the evaluation of leader- Inowedge ofenlya _sbip, cooperation, motivation, and so on, This occurs even though the "good idea” fimited number of 38 Not related to the other factors. The theory of person perception offers a con- performance ceptual basis to understand halo error. In the schemas we se to assess other peo- dimensions ple, it may make sense to us that they would be rated highly across many differ- leniency error a type of rating error in whieh the rater assesses a dispro- portionately large umber of rates as performing wel (positive leniency) or poorly (negative Feniency) in contrast to their true level of performance central-tendency error a type of rating error in which the rater assesses a dispro- portionately large number of ratees as performing in the middle or central part of a distribution of rated performance in contrast to their true level of performance Sources of Performance Appraisal Information 221 ent dimensions of performance, even dimensions we have little or no opportunity to observe Raters who commit halo errors do not distinguish among the many dimen- sions of employee performance. A compounding problem is that there are two types of halo. One type is truly a rat d inv sistent ratings to an employee when thee Mle led toms a : erred to ;olomonson and Lance (1997) concluded that a valid halo (actual job dimensions that are positively inter- related) does not affect halo rater error (raters who allow general impressions to tings), In general, halo errors ate considered to be the yper, 1981). Recent research on halo error has revealed it as a more complex phenomenon than initially believed. Mur- phy and Anhalt (1992) concluded that halo error is not a stable characteristic of the rater or ratee, but rather is the result of an interaction of the rater, ratee, and evaluative situation. Balzer and Sulsky (1992) contended that halo error may not be a rating “error” so much as an indicator of how we cognitively process infor- mation in arriving at judgments of other people. That is, the presence of a halo does not necessarily indicate an inaccuracy in the ratings, In a related view, Lance, LaPointe, and Fisicaro (1994) noted there is disagreement as to whether halo extor can be attributed to the rater or to the cognitive process of making judgments of sirntlar objects re the second category. Some teachers are “hard graders” and otters "ony grat” =e eee ie een oth ae ‘raters give evaluations that are lower than the “true” level of ability Cif it can be ascertained); this is called severity oxffegalive lenient). The easy ives evaluations that are higher than the “true” level; this ‘These errors usvall rater 1¢, Bernardin, Villanova, and Peyrelitte (1995) found that the tendency to make leniency errors was stable with individuals; that is, people tend to be consistently lenient or harsh in their ratings. Bernardin, Cooke, and Villanova (2000) found that the most lenient raters were those who had the personality characteristics of being low in Conscientiousness and high in Agreeableness. Central-tendency error refers to the rater’ unwillingness to assign extreme— high or low-—ratings. Everyone is “average,” and only the middle (central) part of the scale is used. This error may happen when raters are asked to evaluate unfa- riiar aspects of performance. Rather than not respond, they play it safe and say the person is average in this “unknown” ability, Even though we have long been aware of halo, leniency, and central-tendency errors, there is no-clear consensus on how these errors are manifested in ratings. Saal, Downey, and Lahey (1980) observed that researchers define these errors in somewhat different ways. For example, leniency errors are sometimes equated with skew in the distribution of ratings; that is, positive skew is evidence of neg- ative leniency and negative skew of positive leniency. Other researchers say that an average rating above the midpoint on a particular scale indicates positive leniency. 222 Chapter 7 Performance Appraisal The exact meaning of central tendency is also unclear. Central-tendency errors occur if the average rating is around the midpoint of the scale but there is not much variance in the ratings. The amount of variance that separates central- tendency errors from “good” ratings has not been defined. Saal and associates think that more precise definitions of these errors must be developed before they can be overcome. Finally, the absence of these three types of rating errors does not necessarily indicate accuracy in the ratings. The presence of the rating errors leads to inaccurate ratings, but accuracy involves other issues besides the removal of these three error types. /O psychologists are seeking to develop statistical indica- tors of rating accuracy, some based on classical issues in measurement (Cronbach etal, 1972). Judgmental Data Judgmental data are commonly used for performance appraisal because finding relevant objective measures is difficult. Subjective assessments can apply to almost all jobs. Those who do the assessments are usually supervisors, but some use has also been made of self-assessment and peer assessment. A wide variety of mea- sures have been developed, all intended to provide accurate assessments of how people are performing (Pulakos, 1997). These are the major methods used in per- formance appraisal: 1. Graphic rating scales 2. Employee-comparison methods a. Rank order , Paired comparison ¢. Forced distribution 3. Behavior checklists and scales a. Critical incidents b, Behaviorally anchored rating scale (BARS) ¢. Behavior-observation scale (BOS) Graphic Rating Scales, Graphic rating scales are the most commonly used tools in performance appraisal. Individuals are rated on a number of traits or fac- tors, The rater judges “how much” of each factor the individual has. Usually per- formance is judged on a 5- or 7-point scale, and the number of factors ranges between 5 and 20, The more common dimensions rated ate quantity of work, quality of work, practical judgment, job knowledge, cooperation, and motivation. Examples of typical graphic rating scales are shown in Figure 7-3. Employee-Comparison Methods. Rating scales provide for evaluating em- ployees against some defined standard. With employee comparison methods, indi- vidvals are compared with one another; variance is thereby forced into the appraisals, Thus, the concentration of ratings at one part of the scale caused by rating error is avoided. The major advantage of employee-comparion methods is the elimination of central tendency and leniency errors because raters are com- pelled to differentiate among the people being rated. However, halo error is still Sources of Performance Appraisal Information 223 Job Knowledge High|_X. Low sf 3 2 4 at of Wore | x_ | | | Superior Above Average ‘Below Unacoepable average serge Dependabitty Rae this employee's dependability by assigning a score according tothe following ale 2 score) 1 to 5 (Poot) Gives up quick 8 fo 10 Chverage) Does the routine wor: 11 to 15 Good) Rely gives wp. Quay of Work oO [x | Consisenty Frequently Metsjob——Frequendy—Consisently cxceedsjob exctedsjod requirements below job belowjob requirements requirements requirements requirements Proctcaljudgnent = SG) 2 1 FIGURE 7-3 Examples of graphic rating scales for various performance dimen: possible because it manifests itself across multiple evaluations of the same person. However, all methods of employee comparison involve the question of whether varlation represents true differences in. performance or whether it creates a false Impression of lange differences when they are in fact small, There are three major employee-comparison methods: rank order, paired compatison, and forced distri- bution. With the rank-order method, the rater ranks employees from high to low on a Biven performance dimension, The person ranked first is regarded as the “best” and the person ranked last a5 the “worst.” However, we do not imow how good the “best” is or how bad the “worst” is. We do not know the level of performance For example, the Nobel Prize winners in a given year could be ranked in terms of their overall contributions to science. But we would be hard-pressed to conclude that the Nobel laureate ranked last made the worst contribution to science. Rank- order data are all relative to some standard—in this case, excellence in scientific research, Another problem is that it becomes tedious and perhaps even meaning- less to rank order large numbers of people. What usually happens is that the rater can sort out the people at the top and botom of the pile. However, for the rest with undifferentiated performance, the rankings may be somewhat arbitrary 224 Chapter 7 Performance Appraisal With the paired-comparison method, each employee is compared with every other eraployee in the group being evaluated. The rater’ task is to select which of the two is better on the dimension being rated. The method is typically used to evaluate employees on a single dimension: overall ability to perform the job. The number of evaluation pairs is computed by the formula n(n — 1)/2, where n is the number of people to be evaluated, For example, if there are 10 people in a group, the number of paired comparisons is 10(9)/2 = 45. At the conclusion of the eval. uation, the number of times each person was selected as the better of the two is tallied. The people are then ranked by the number of tallies they receive. A major limitation is that the number of comparisons mushrooms dramatically with large numbers of employees, If 50 people are to be appraised, the number of comparisons is 1,225; this obviously takes too much time, The paited-comparison method is best for relatively small samples. The forced-distribution method is most useful when the other ‘employee- comparison methods are most limited—that is, when the sample is large. Forced distribution is typically used when the rater must evaluate employees on a single dimension, but it can also be used with multiple dimensions. The procedure is based on the normal distribution and assumes that employee performance is nor mally distributed. The distribution is divided into five to seven categories, Using predetermined percentages (based on the normal distribution), the rater evaluates am employee by placing him or her into one of the categories. All employees are evaluated in this manner. The method “forces” the rater to distribute the employ- ees actoss all categories (which is how the method gets its name), Thus, it is impossible for all employees to be rated excellent, average, or poor. An exampie of the procedure for a sample of 50 employees is illustrated in Figure 74 Some raters react negatively to this method, saying that the procedure creates artificial distinctions among employees. This is partly because the raters feel that perlormance is not normally distributed but rather negatively skewed: that is, ‘most of their employees are performing very well. The dissatisfaction ean be par. Ually allayed by noting that the lowest 10% are not necessarily performing, poorly, Just not as well as the others. The problem (as with all comparison methods) is that performance is not compared with a defined standard. The meaning of the differences among employees must be supplied from some other source Behavioral Checklists and Scales. Most recent advances in performance ap- praisal involve behavioral checklists and scales, The key term is behavior, Behav. tors are less vague than other factors. The greater the agreement on the meaning of the performance appraised, the greater the chance that the appraisal will be accurate. All ofthe methods in this category have their origin directly or indirectly ‘in the critical-incidents method. Critical incidents are behaviors that result in good or poor job performance. Anderson and Wilson (1997) noted that the critical-incident technique is flexible and can be used for performance appraisal as well as job analysis. Supervisors record behaviors of employees that greatly influence their job performance. They either keep a ranning tally of these critical incidents as they occur on the job or recall them at later time. Critical incidents are usually grouped by aspects of per- formance: job knowledge, decision-making ability, leadership, and so on. The end Number of employees to be placed in each category based on $0 employees Sources of Performance Appraisal Information 225 i E Vom Not hes oxbutin “SO oh ee FIGURE 7-4 B The forced-distribution method of performance appraisal behaviorally anchored rating scales (BARS) a type of performance appraisal rating scale in whieh the points or values are descriptions of behavior product is a list of behaviors (good and bad) that constitute effective and ineffec- tive job performance, ‘The original critical-incidents method did not lend itself to quantification (that is, a score reflecting performance). It was used to guide employees in the specifies of their job perlormance. Each employee's performance can be described in terms of the occurrence of these critical behaviors, The supervisor can then counsel the employee to avoid the bad and continue the good behaviors. For example, a neg- ative critical incident for a machine operator might be “leaves machine running while unattended.” A positive one might be “always wears safety goggles on the job.” Discussing performance in such clear terms is more understandable than using such vague statements as “poor attitude” or “careless work habits.” Behaviorally anchored rating scales (BARS) ate a combination of the crtical- incident and rating-scale methods. Performance is rated on a scale, but the scale points are anchored with behavioral incidents. The development of BARS is time- consuming, but the benefits make it worthwhile, BARS ate developed in a five- step process 1. A list of critical incidents is generated in the manner discussed previously: 2. A group of people (usually supervisots—either the same people who gen- erated the critical incidents initially or another group) clustets the incidents into @ smaller set of performance dimensions (usually five to ten) that they typically rep- resent, The result is a given number of performance dimensions, each containing several illustrative critical incidents, 3, Another group of knowledgeable people is instructed to perform the fol- lowing task: The critical incidents are “scrambled” so that they are no longer listed under the dimensions described in step 2. The critical incidents might be written, ‘on separate note cards and presented to the people in random order. The raters’ task is to reassign or retranslate all the critical incidents back to the original per- formance dimensions. The goal is to have critical incidents that clearly represent the performance dimensions under consideration. A critical incident generally is 226 Chapter 7 Performance Appraisal said to be retranslated successfully if some percentage (usually 50% to 80%) of the raters reassign it back to the dimension from which it came. Incidents that are not retranslated successfully (that is, there is ample confusion as to which dimension they represent) are discarded. 4. The people who retranslated the items are asked to rate each “surviving” critical incident on a scale (typically seven or nine points) of just how effectively or ineffectively it represents performance on the appropriate dimension. The rat- {ngs given to each incident are then averaged, and the standard deviation for each item is computed. Low standard deviations indicate high rater agreement on the value of the incident, High standard deviations indicate low rater agreement, A standard deviation criterion is then set for deciding which incidents will be retained for inclusion in the final form of the BARS. Incidents that have a standard deviation in excess of 1.50 typically are discarded because the raters could not agree on their respective values. 5. The final form of the instrument consists of critical incidents that met both. the retranslation and standard deviation criteria. The incidents serve as behavioral anchors for the performance dimension scales. The final BARS instrument is 2 series of scales listed vertically (one for each dimension) and anchored by the retained incidents. Each incident is located along the scale according to its estab- lished rating. An example of BARS for patrol officer performance is shown in Fig- ure 7-5. As can be seen, behaviors are listed with respect to what the employee is expected to do at various performance levels. For this reason, BARS are sometimes referred to as “behavioral expectation scales.” One of the major advantages of BARS is unrelated to performance appraisal. It's the high degree of involvement of persons developing the scale. The partici- pants must carefully examine specific behaviors that lead to effective performance. In so doing, they may reject false stereotypes about ineffective performances. The method has face validity for both the rater and ratee and also appears usefull for training raters. However, one disadvantage is that BARS are job specific; that is, a different behaviorally anchored rating scale must be developed for every job. Fur- thermore, it is possible for employees to exhibit different behaviors (on a single performance dimension) depending upon situational factors (such as the degree of ‘urgency), and as such there is no one single expectation for the employee on that dimension, For example, consider the dimension of interpersonal relations, When conditions at work are relatively free of tensions, a person may be expected to behave calmly, However, when operating under stress, a person may act irritably, Thus, the expectation of behavior depends on the circumstances in effect. Another development in appraisal is the behavioral-observation scale (OS). Like BARS, it is based on critical incidents. With BOS the rater must rate the employee ‘on the frequency of critical incidents. The rater observes the employee aver a cer- tain period, such asa month. Here is an example of a five-point eritical-incident scale used in appraising salespeople, as provided by Latham and Wexley (1977): Knows the price of competitive products Never Seldom Sometimes Generally Always 1 2 3 4 5 Sources of Performance Appraisal Information 227 Job knowledge: Awareness of procedures, laws, and court rulings and changes in them Contd be expected t follow comet procedures for evidence preservation at scene ofactime Could be expected to know s/he could break down locked door ‘while in hot pursuit and thes amrst fleeing suspect Could be expected to occasionally Fave to ask other olicers about points of law Could be expected to misinforra public on legal matters through lack of knowledge Raver Very igh and conduct him/nerself| —-3 Could be expected tobe fully aware of recent court rulings accordingly Moderate Could be expected to search sogpect's cat two hours ater spect was booked Very low Rate FIGURE 7-5 @ Example of a behaviorally anchored rating scale for appraising patrol officers ‘SOURCE. From Paycoloy of Work Behavior rev ed (p, 128), by J. Landy and DA. Trumbo, 1980, Pacific Grove, CA: rooksCole. Raters evaluate the employees on several such critical incidents, recording how often they observed the behavior. The total score is the sum of all the critical inci- dents. The final step is to correlate the response for each incident (a rating of 1, 2, 3, 4, or 5) with the total performance score, This is called item analysis. It is meant to detect the critical incidents that most influence overall performance, 228 Chapter 7 Performance Appraisal ‘Those incidents that have the highest correlations with the total score are the most discriminating factors in performance, They would be retained to develop criteria for job success. Latham, Fay, and Saari (1979) suggested advantages to performance appraisals with BOS, First, like BARS, BOS are developed by those who use the method for evaluation, who understand they are committed to using the scales. Second, BOS information can be used to teach new employees the behaviors most critical to the job. Finally, BOS are content valid; the aspects of performance appraised are derived directly from the job. The authors believe this satisfies the EEOC require- ment that appraisal methods be job relevant. Relevance of Judgmental Data. The relevance of judgmental data in perfor- mance appraisal, like the relevance of any type of performance appraisal data, refers to the extent to which the observed data are accurate measures of the “true” variable being measured. The “true” variable can refer to a global construct, such, as overall job performance, or a dimension of job performance, such as interper- sonal relations ability. One method of assessing the relevance of judgmental data is to correlate them with performance appraisals from another method, such as abjective production or personnel data, In studies that have conducted this type of analysis, the resulting correlations have been only moderate. While these results may be interpreted to mean that judgmental data have only moderate relevance, the key question is whether the objective production data or the personnel data can be assumed to represent “true” ability, Those types of data might be just as incomplete or remotely relevant as judgmental data. Since we never obtain mea- sures of the conceptual criterion (that is, “true” ability), we are forced to deal with imperfect measures that, not surprisingly, yield imperfect results. Research (e.g., ‘Weekley & Gier, 1989) showed the existence of rater disagreement and halo error even among such expert raters as Olympic judges who were intensely trained to make accurate evaluations. DeNisi and Peters (1996) reported that instructing raters to keep a structured diary for continuous record keeping of performance (rather than using memory recall) produced more accurate assessments of the employees. Wagner and Goffin (1997) concluded that performance appraisal rat- ings produced by behaviorsl-observation methods are no more accurate than those produced by employee comparison methods. Borman (1978) had another approach to assessing the relevance of judgmental data, He made videotapes of two employment situations: a manager talking with a problem employee and a recruiter interviewing a job candidate, Sixteen video- tapes were made, eight of each situation. Each tape showed a different degree of performance—from a highly competent recruiter to a totally inept one. Similar degrees of performance were shown of the manager-subordinate meeting, Profes- sional actors were used in the tapes. The same actor played the recruiter in all eight tapes, but a different actor played the new employee in each tape. Thus, the performance level (the “true” ability of the manager or recruiter) was “programmed” into the scripts. Sources of Performance Appraisal Information 229 “Barkley, | perceive my role in this institution not as a judge but merely as an observer and recorder, I have observed you to bea prize boob and have so recorded it.” SOURCE, Reprinted by permission: Tabune Media Servies D 1979 by Chiego Tbune NE News Syndicate In, All igh Reserved Raters were asked to rate the performance of the manager and recruiter with 4a series of rating scales. The evaluations were correlated with the performance lev- els depicted. Correlations between ratings and levels of performance across several job dimensions (organizing the interview, establishing rapport, and so on) ranged from 42 to .97. The median was .69. Although this study used a simulation as opposed to actual job performance, it did show that various types of rating pro- cedures ate susceptible to differences in validity. The study also revealed that cer- tain dimensions of performance are more accurately evaluated (‘answering recruitee’s questions,” r = .97) than others ("reacting to stress," r = 42). Borman was led to conclude that raters are limited in their ability to appraise performance; they could not accurately evaluate the levels of “true” performance that were acted uit in the scripts, He suggested a practical upper limit to validity that is less than the theoretical limit (r = 1.0) (see Field Note 2). After many years of research on. various types of performance appraisal rating scales, /O psychologists have con- cluded that the vatiance in rated performance due to the rating scale format is slight, typically less than 5%. Other sources of variance in rated performance are mote substantial, These topics will be discussed next. 230 Chapter 7 Performance Appraisal FIELD. NOTE 2 Good Research Isn't Cheap EE ary times unexpected costs are FATA: associated with performance ap- praisal, Here isthe story of one ‘of the more unusual expenses ! have ever encountered in a research study. One of the uses of performance appraisal infor- Imation is as a criterion of job performance. In turn, criteria, of job performance may’ be used to validate selection tests | had a colleague who needed to col- ‘eet both performance appraisal (criterion) data and test score (predictor) data to develop a selection test battery for a company. He traveled to the company and had all the supervisors convene in the company cafe- {ecria, He explained the nature of the performance rat- ings he wanted them to make. Then he explained that. all. their subordinates would be taking @. 30-minute ‘test and the scores would be correlated with the super= Visols: performance appraisal ratings, as is done in a concurrent crterion-related valiity study. My colleague then asked:the supervisors. whether they wanted to ‘ake the same test their subordinates would be taking just to.get a feel for what it was lke. They agreed. He RATER TRAINING passed out the test and informed them they. would have 30 minutes to complete it. He wanted the test ing procedure to be very exact, giving everyone pre~ Cisely 30 thinutes. His watch did not have! a, second hhand, 0 he was about to ask if he could borrow some- cone else's watch, when he spied the company’s micro- wave oven on the wall in the cafeteria, He went over to the microwave, set the timer. for 30. minutes, tld the supervisors to begin the: test, and started: the microwave, S ‘About 20 minutes into the test, a terrible odor began to ill'the cafeteria. Somebody noticed it vas coming from the microwave, My colleague had failed 40 place anything into the microwave when he started it, s0 for 20 minutes the microwave cooked itself, ult- ‘mately suffering terminal meltdown. The. microwave «ost $800 to replace and is one of the more unusual test-validation.experise items 1 have ever heard of. Incidentaly, the test turned out to be highly predic= tive of the performance appraisal ratings, so the cxer= cise was not a.complete waste, -. rater training the process of ‘vaining raters to ‘make more accurate ratings of perfor- rmanee, typically achieved by reducing the frequency of halo, leniency, and central~ tendency errors Can you train raters to make better performance appraisals? The answer appears to be yes. Rater training is a formal process in which appraisers are taught to make fewer halo, leniency, and central-tendency errors, For example, Latham, Wex- ley, and Pursell (1975) randomly assigned 60 managers who appraised perfor- mance to one of these three groups: * Workshop group. This group was shown videotapes on evaluating individu- als, Members then discussed appraisal procedures and problems in making appraisals, with the intent of reducing rating errors, + Discussion group. This group received training similar in content, but the main method was discussion * Control group. This group received no training. Six months later, the three groups were “tested.” They were shown videotapes of several hypothetical job candidates along with job requirements, The managers were asked to evaluate the candidates’ suitability for the jobs in question. The groups showed major differences in the rating errors they made, The workshop group had no rating errors, and the control group performed the worst, making all three types of errors. Rater Motivation — 231 Zedeck and Cascio (1982) considered the purpose for which performance appraisal ratings are made—merit raise, retention, and development—and found that training works better for some purposes than others. Training typically en- hhances the accuracy of performance appraisals as well ag their acceptability to those who are being appraised. However, not all research on rater training is positive. Bernardin and Pence (1980) reported that raters who were tained to reduce halo errors actually made less accurate ratings after taining. This seems due to the fact that, as Bartlett (1983) noted, there are two types of halo; reduction of invalid halo increases accu- racy, but reduction of valid halo decreases it. Hedge and Kavanagh (1988) con- cluded that certain types of rater training reduce classical rating errors such as halo and leniency but do not increase rating accuracy. It is possible to reduce the occurrence of rating ertors and also reduce accuracy because other factors besides the three classic types of rating errors affect accuracy. The relationship between ating errors and accuracy is uncertain because of our inability to know what “truth” is (Sulsky & Balzer, 1988). One type of rater training appears particularly promising, Frame-of-reference training (Sulsky & Day, 1992) involves providing raters with common reference standards (.e., frames) by which to evaluate performance. Raters are shown vignettes of good, poor, and average performances and are given feedback on the accuracy of their ratings of the vignettes, The intent of the training is to “calibrate” raters so that they agree on what constitutes varying levels of performance elfec- tiveness for each performance dimension, Research by Woehr (1904) and Day and Sulsky (1995) supported the conclusion that frame-of-reference training increases the accuracy of individual raters on separate performance dimensions. In a meta- analytic review of rater training for performance appraisal, Woehr and Huffcutt (1994) examined the effectiveness of rater training methods on the major depen- dent variables of reduced halo ertor, reduced leniency error, and increased rating accuracy, The authors concluded that rater taining has a positive effect on each dependent variable, However, the taining strategies are differentially effective in addressing the aspect of performance ratings they are designed to meet, Because raters are influenced by numerous attribution errors or biases, Kraiger and Agui- nis (2001) encourage evaluators to rely on multiple sources of information and question the veracity of judgments made by raters. RATER MOTIVATION 11 is not unusual for the majority of employees in a company to receive very high performance evaluations. These inflated ratings are often interpreted as evidence of massive rater errors (i, leniency or halo) or a breakdown in the performance appraisal system. The typical organizational response to rating inflation is to make some technical adjustment in the rating scale format or to institute a new training program for raters. However, there is another explanation for rating inflation that is unrelated to rater errors, Murphy and Cleveland (1995) posited that the tendency to give uniformly hhigh ratings is an instance of adaptive behavior that is, from the rater’ point of 232 Chapter 7 Performance Appraisal rater motivation a concept that refers to organizationally induced pressures that compel raters to evaluate rates positively view, an eminently logical course of action. These deficiencies in ratings are more likely to be a result of the raters willingness to provide accurate ratings than of the capacity to rate accurately. Ifthe situation is examined from the raters perspective, there are many sound reasons to provide inflated ratings. Rater motivation is often adjusted to achieve some particular result First, there are typically no rewards from the organization for accurate ap- praisals and few if any sanctions for inaccurate appraisals. Official company poli- cies often emphasize the value of good performance appraisals, but organizations ppically take no specific steps to reward this supposedly valued activity. Second, the most common reason cited for rating inflation is that high ratings are needed to guarantee promotions, salary increases, and other valued rewards. Low ratings, con the other hand, result in these rewards being withheld from subordinates. Raters are thus motivated to obtain valued rewards for their subordinates. Third, ralers are motivated to give inflated ratings because the ratings received by subor- dinates are a reflection of the rater’s job performance (Latham, 1986). One of the duties of managers is to develop their subordinates. If managers consistently rate their subordinates as less than good performers, it can appear that the managers axe not doing their jobs. Thus, high ratings make the rater look good and low rat- ings make the rater look bad. Fourth, raters tend to inflate their ratings because they wish to avoid the negative reactions that accompany low ratings (Klimoski & Inks, 1990). Negative evaluations typically result in defensive reactions from sub- ordinates, which can create a very stressful situation for the rater. The simplest ‘way 10 avoid unpleasant or defensive reactions in appraisal interviews is to give uniformly positive feedback (ie., inflated ratings). Kozlowski, Chao, and Morrison (1998) describe the process of “appraisal poli- tics” in organizations. If there is a sense that most other raters are inflating their ratings of their subordinates, a good rater has to play politics to protect and en- hhance the careers of his or her own subordinates. To the extent that a rating infla- tion strategy actually enhances the prospects of the better subordinates, it may be Interpreted as being in the best interests of the organization to do so. Kozlowski et al, stated: “Indeed, if rating distortions are the norm, a failure to engage in appraisal politics may be maladaptive” (p. 176). Supporting this conclusion, Jawa- har and Williams (1997) meta-analyzed performance appraisals given for admini- strative purposes (e.g., promotions) versus developmental or research purposes. ‘Their results showed that performance appraisals conducted for administrative purposes were one-third of a standard deviation higher than those obtained for evelopment or research purposes. As these authors stated, performance appraisals will be much more lenient when those appraisals are “for keeps” (see Field Note 3). ‘There is no simple way to counteract a raters motivation to inflate ratings. ‘The problem will not be solved by just increasing the capability of raters. In addi- tion, the environment must be modified in such a way that raters are motivated to provide accurate ratings. Murphy and Cleveland (1995) believe accurate rating is most likely to occur in an environment where the following conditions exist: = Good and poor performance are clearly defined. * Distinguishing among workers in terms of their levels of performance is widely accepted. Contextual Performance FIELD NOTE 3 Are High Ratings a “Problem"? search on performance ap- gatded high ratings as refleet- ing some kind of ertor. This error then becomes the focus of corrective action, as methods (ie, different rating techniques, rater training) are applied to pro- duce lower evaluations. However, an examination of just the statistical properties of ratings, apart from the organizational context in which they are rendered, fails to capture why they occur. As recent research has revealed, managers who give high evaluations of their ‘employees are behaving in an eminently reasonable feshion, not making etrors per se. Managers (or other Supervisors) have a vested interest in the job perfor mance of their subordinates. The subordinates are socialized and coached to exhibit desired behaviors on the job. Those who don't exhibit these behaviors are often gismissed. Those who do are ewatded with socal approval, if nothing more than being alowed to retain’ their jobs. The performance of subordinates js -al8o regarded as a measure of the manager's own job per= formance. itis then logical for a manager to cultivate an efficient work group. Finally, menagers often feel a sense of sponsorship for their employees. They want their employees to do well and have in fact often n= vested a sizable portion of their own time and energy: ‘0 produce, that outcome. Thus, when it comes time for a formal performance review of their subordinates, ‘managers often respond by rating them highly, Rather than errors of judgment, the high evaluations could represent little more than the outcome of a successful socialization process designed to achieve that very outcome. * There is a high degree of trust in the system. + Low ratings do not automatically result in the loss of valued rewards. * Valued rewards are clearly linked to accuraey in performance appraisal 233 The authors know of organizations in which none of these conditions is met, but they dont know of any in which all of them are met. It is clear that we need more research on the organizational context in which performance appraisals are conducted. Mero and Motowidlo (1995) also reported findings that underscore the importance of the context in which ratings are made, They found that raters who are held accountable for their performance ratings make more accurate rat- ings than raters who are not held accountable, The problem of rating inflation will ultimately be solved by changing the context in which ratings are made, and not by changing the rater or the rating scale. CONTEXTUAL PERFORMANCE contextual performance behavior exhibited by an employee that contributes to the welfare of the orga- nization but isnot a forrial component of an employee's job duties Borman and Motowidlo (1993) contended that individuals contribute to organiza- Tonal effectiveness in ways that go beyond the activities that make up their jobs. They Rmeigher help or hinder efforts to accomplish organizational goals by doing many things Thatgre not directly related'to their main functions, However, these contributions are inipoxtant besatsé they shape the organizational or psychologi- cal context that serves. lyst for work. The authors argued that these con- uibutions are a-valid componente overall job performance, yet they transcend the assessment of performance in spedilic tasks. This aspect of performance is Teferted to as contextual performance, orPanizational citizenship behavior, and

You might also like