You are on page 1of 15

Journal of Business and Psychology, Vol. 19, No. 3, Spring 2005 (2005) DOI: 10.

1007/s10869-004-2234-y

MANAGERIAL EXPERIENCE AND THE MEASUREMENT EQUIVALENCE OF PERFORMANCE RATINGS Gary J. Greguras
Singapore Management University

ABSTRACT: Establishing the measurement equivalence of instruments is a prerequisite to making meaningful comparisons between individuals or within individuals over time. Whereas previous research has investigated the effects of rater characteristics on the measurement equivalence of performance ratings, the current study investigated a ratee characteristicratee job experience. Using conrmatory factor analysis and item response theory methods with replication, the measurement equivalence of supervisor ratings of 7,200 managers with differing levels of managerial experience was assessed. Overall, results indicated a high degree of measurement equivalence suggesting that meaningful comparison may be made across ratees with different levels of job experience. KEY WORDS: performance ratings; job experience; measurement equivalence; ratings.

Performance ratings often are used as criteria for validating selection assessments, predictors of managerial potential, input for personnel decisions, or standards for mapping performance changes. Unfortunately, performance measurement is plagued with a host of problems (e.g., rater bias, Landy & Farr, 1980). Research generally has focused on attempting to identify and remedy these problems by focusing on characteristics of the instrument, the appraisal process, or the users (Murphy & Cleveland, 1995). The most widely used measure of employee performance continues to be supervisor ratings. Supervisor ratings may be used to make comparisons between (e.g., deciding which manager should be promoted) and within (e.g., assessing whether a managers performance improved over the past year) employees. Comparing ratings between different raters, ratees, or time periods assumes that the ratings are on the same psychological metric (Maurer, Raju, & Collins, 1998), or that measurement
Address correspondence to Gary J. Greguras, Lee Kong Chian School of Business, Singapore Management University, 469 Bukit Timah Road, Singapore 259756. E-mail: garygreguras@smu.edu.sg. 383
0889-3268/05/0300-0383/0 2005 Springer Science+Business Media, Inc.

384

JOURNAL OF BUSINESS AND PSYCHOLOGY

equivalence exists. If performance scales represent different things for different raters (i.e., measurement nonequivalence), ratings cannot be compared or interpreted across groups (Mauer et al., 1998; Vandenberg & Lance, 2000). As such, establishing the measurement equivalence of ratings is a prerequisite to conducting meaningful comparisons (Vandenberg & Lance, 2000). Laftte, Raju, Scott, and Fasolo (1998) suggested that future research consider whether managerial experience affects the measurement equivalence of performance ratings. Establishing the measurement equivalence of performance ratings of managers with different levels of experience is important because individual and organizational decisions often are based on comparisons between managers with different levels of experience. If the ratings are not equivalent, then ratings may not be able to be meaningfully compared or interpreted (Vandenberg & Lance, 2000). As such, the purpose of the current study is to assess whether or not managerial (i.e., ratee) job experience inuences their supervisors (i.e., raters) use and interpretation of the appraisal instrument. Given the increase in organizational interventions (e.g., multisource feedback systems) targeted at improving managerial performance (Church & Bracken, 1997; Yammarino & Atwater, 1997), assessing the effects of managerial experience on the measurement equivalence of ratings is warranted. Below the potential effects of managerial experience on performance ratings is discussed, followed by an overview of the current literature on measurement equivalence, followed by a description of the current study.

MANAGERIAL EXPERIENCE There are several reasons to expect that a ratees job experience may affect the measurement equivalence of supervisor performance ratings. First, raters may have different expectations for different employees, and these expectations likely change as a function of their job experience (Murphy, 1989). For example, managers with more experience may be expected to perform at higher levels, make fewer mistakes, require less supervision, or know the policies and procedures of the organization better than an employee with less experience (Murphy, 1989). If raters have different expectations for ratees with differing levels of experience, they also may apply different standards when evaluating these employees (Barrett & Alexander, 1989). Hence, different rater expectations or standards may contribute to measurement nonequivalence across ratees with different levels of experience. A second reason to expect that ratee job experience may affect the measurement equivalence of rating instruments is that supervisors may

GARY J. GREGURAS

385

make different assumptions or attributions about employees with differing amounts of experience. For example, raters may assume that employees who have more job experience are more productive than employees with less experience (McEnrue, 1988), and as a consequence, may rate more experienced ratees higher regardless of the ratees true performance level (Cascio & Valenzi, 1977). In addition to assuming that more experienced workers actually are more productive, raters also may assume that only productive, good employees are retained by an organization (McEnrue, 1988). Again, such rater assumptions regarding the merits of more or less experience may affect supervisors judgments of managers performances (McEnrue, 1988) and may result in systematic distortions in performance ratings as a function of job experience. Third, according to the attractionselectionattrition (ASA) model (Schneider, 1987), raters may perceive ratees who have worked in a position or organization for a longer length of time to be more similar to themselves than ratees with less experience. Consistent with the ASA model, job experience has been shown to be related to rater perceptions of ratee similarity (e.g., Greguras & Balzer, 1999), which in turn, has been shown to inuence supervisory affect toward a subordinate (Wayne & Liden, 1995), rater attributions about ratee behaviors (Jones & Nisbett, 1972; Wexley & Klimoski, 1984), and supervisory performance ratings (Wayne & Liden, 1995). Again, any of these factors may inuence how a rater interprets and uses instruments designed to assess ratee performance. A fourth reason to expect that the experience level of ratees may inuence rater behaviors and cognitions is that the tasks and responsibilities of employees likely change as a function of job experience (Murphy, 1989). Hence, raters may perceive ratee abilities, or the importance of different behaviors, as changing with ones level of job experience. Consistent with this view, related research on job analysis task ratings indicates that incumbents with different levels of experience provide varying task analysis responses (Harvey, 1991) and report carrying out the tasks at differing levels of frequencies (Borman, Dorsey, & Ackerman, 1992; Landy & Vasey, 1991). Research also has demonstrated that as individuals learn their jobs and gain more experience some abilities become more important and other less important for optimal levels of performance (Hofmann, Jacobs, & Baratta, 1993). Because managers with different degrees of experience many times have different roles, tasks, and responsibilities, raters may use different types of information when evaluating their performances. Measurement nonequivalence exists if raters disagree about the kinds of duties a ratee is expected to perform or about the relative importance of these duties (Cheung, 1999). Taken together, there are several reasons why measurement nonequivalence might be expected as a function of ratee job experience.

386

JOURNAL OF BUSINESS AND PSYCHOLOGY

MEASUREMENT EQUIVALENCE Measurement equivalence indicates that the instrument means and functions the same across raters (Cheung, 1999; Vandenberg & Lance, 2000). Both conrmatory factor analysis (CFA) and item response theory (IRT) methods may be used to asses the measurement equivalence of ratings. The IRT and CFA methods should be considered complementary approaches in that they provide slightly different information regarding the equivalence of an instrument. The IRT distinguishes itself from CFA in two important ways: (1) the relationship between an indicator and its latent construct is not constrained to be linear which results in a greater likelihood of the correct functional form being specied, and, related, (2) information on the relationship between an indicator and its latent construct is given across the range of possible values of the latent construct. In contrast to this second point, CFA methods use one number (e.g., a factor loading) to represent the relationship of an indicator and its construct along all possible values of the latent construct. On the other hand, CFA provides some distinct advantages for examining measurement equivalence over IRT methods. For example, CFA methods are specically designed to incorporate multiple latent constructs in an examination of measurement equivalence, whereas, IRT methods generally are designed for single factor examinations. Additionally, CFA methods can easily accommodate various violations of assumptions (e.g., correlated errors among indicators) that are required preconditions for IRT analyses. In tandem and used judiciously, the two methods give the researcher maximal information on various potential sources of nonequivalence. Both of these methods are discussed below. Conrmatory Factor Analysis Vandenberg and Lance (2000) and Cheung (1999) recommend assessing the measurement equivalence of instruments by using a series of hierarchically nested models. The models become more restrictive as additional constraints are placed on the models (i.e., as the testing sequence moves up the hierarchy of nested models). The change in t of the model to the data is then compared from one step to the next to assess whether or not the additional constraints resulted in a signicant decrease in model t. Cheung (1999) discussed two types of equivalence: conceptual equivalence and psychometric equivalence. Conceptual equivalence indicates that raters agree on the item loadings and factor structure of an instrument (i.e., the instrument means the same thing to the different rater sources). Because conceptual equivalence is a prerequisite to

GARY J. GREGURAS

387

making meaningful comparisons across rater groups (Cheung & Rensvold, 1998; Reise, Widaman, & Pugh, 1993), the majority of existing studies focus on establishing the conceptual equivalence of an instrument (e.g., Facteau & Craig, 2001). In contrast, psychometric equivalence indicates that the different rater sources respond to the instrument in the same way (i.e., equivalent levels of reliability, variance, range of ratings, mean level of ratings, and intercorrelations among factors). Whereas conceptual equivalence is a prerequisite to meaningful comparisons, psychometric non-equivalence may reveal meaningful differences between raters. For a detailed discussion on CFA methods for assessing measurement equivalence, see Cheung (1999) or Vandenberg and Lance (2000). Several studies have used CFA methods to assess the conceptual equivalence of performance ratings. For example, Cheung (1999) investigated self and supervisor ratings of 332 mid-level executives on two broad performance dimensions (i.e., internal and external roles). Results indicated that the ratings from the two rater sources were conceptually equivalent. Similarly, Maurer et al. (1998) found that peer and subordinate ratings of a team building performance dimension were conceptually equivalent. Further, in a comprehensive study of the measurement equivalence of performance ratings, Facteau and Craig (2001) analyzed self, supervisor, peer, and subordinate ratings across seven performance dimensions. Consistent with previous research (e.g., Laftte et al., 1998; Stennett, Johnson, Hecht, Green, Jackson, & Thomas, 1999; Woehr, Sheehan, & Bennett, 1999), results indicated that the ratings across sources and performance dimensions were conceptually invariant (with the exception of one error covariance in the self and subordinate groups). As a whole, the existing literature on the measurement equivalence of performance ratings using CFA methods has indicated that ratings from different sources across various performance dimensions generally are conceptually equivalent (for an exception, see Lance & Bennett, 1997). Item Response Theory IRT procedures allow the measurement equivalence of an instrument to be examined at both the item and scale levels. As such, both item and scale scores may be analyzed for differential functioning. Differential functioning refers to the difference in expected scores for individuals with the same standing on a latent construct (i.e., theta, h) as a result of ones group membership (Raju, van der Linden, & Fleer, 1995; Reise et al., 1993). That is, if individuals possess the same level of an underlying construct, yet the expected scores for these individuals are different because they belong to different groups, then differential functioning

388

JOURNAL OF BUSINESS AND PSYCHOLOGY

exists. Differential item functioning (DIF) refers to differential functioning at the item level, whereas differential test functioning (DTF) refers to differential functioning at the test or scale level. Both DIF and DTF are important to examine and may produce different results. For example, if two items evidence differential functioning, but in the opposite direction of one another, the scale may not evidence DTF. Likewise, if several items have nonsignicant, but nonzero, DIF in the same direction, the impact of these items on the scale scores may result in DTF across groups (Facteau & Craig, 2001). The DIF or DTF becomes more or less important depending upon whether item or scale level information is of interest. With performance ratings (the focus in the current study), often both item-level and scale-level information is presented back to the ratee (Facteau & Craig, 2001) and therefore both DIF and DTF are of interest and should be examined. Two of the studies described above that have used CFA methods also used IRT methods to assess the measurement equivalence of performance ratings. Maurer et al. (1998) observed that two of the seven items investigated evidenced DIF which resulted in a signicant DTF. However, these ndings were not replicated across samples thereby leading them to conclude that peer and subordinate ratings on a team-building skills performance dimension were equivalent; this conclusion is consistent with their CFA results presented above. Likewise, across 276 tests for differential functioning, Facteau and Craig (2001) observed DIF only ve times and DTF only once. As such, Facteau and Craig (2001) concluded overall that supervisor, peer, subordinate, and self ratings across performance dimensions are equivalent; this conclusion also is consistent with their CFA results presented above. Overview of the Current Study The purpose of the current study is to assess the measurement equivalence of supervisor ratings of managers with differing levels of job experience. Previous studies that have investigated the measurement equivalence of performance ratings have focused exclusively on rater differences (e.g., rater source, rater opportunity to observe ratee performance). The current study differs from these previous studies by investigating a ratee characteristic (i.e., ratee job experience) as a factor that may affect the measurement equivalence of performance ratings. The measurement equivalence of supervisor performance ratings of subordinates (managers) with differing levels of experience should be established if meaningful comparisons between or within managers are to be made. This paper responds to the need for research investigating ratee characteristics as suggested by Maurer et al. (1998) and Laftte et al. (1998). The current used both CFA and IRT methods to assess the

GARY J. GREGURAS

389

measurement equivalence of supervisory ratings for groups of managers with different levels of job experience. METHOD Participants A stratied random sample of 3,600 ratees was drawn from a large database. The sample was stratied such that an equal number (n 1,200) of ratees within three categories of managerial experience (i.e., 05, 510, and 1020 years of managerial experience) was randomly selected for inclusion in the current study. Managers were employed in a variety of different organizations and functional units (e.g., marketing, manufacturing, and engineering). The majority of ratees had some college or technical training (93%) with many ratees having Bachelor Degrees (33.5%) and Masters Degrees (23.6%). The majority of ratees were male (67.3%) with an average age of 40.0 years. The majority of ratees were White (83.5%) with fewer being Black (3.3%), Latino or Hispanic American (2.7%), or Asian or Pacic American (2.5%). Ratees represented several different organizational levels with the majority of ratees being in middle- to upper-management positions (70.7%). Raters were 3,600 supervisors of the above ratees. The majority of raters had some college or technical training (88.6%) with many raters having Bachelor Degrees (27.4%) and Masters Degrees (26.9%). The majority of raters were male (76.5%) and White (74.6%) with fewer being Black (4.7%), Latino or Hispanic American (1.4%), or Asian or Pacic American (1.5%). As suggested by Mauer et al. (1998), a hold-out sample was used to assess the robustness of the results from the original sample and thereby to increase the condence in the results. From the complete database, an additional independent stratied random sample of 3,600 ratees was drawn and used in replication analyses. This replication sample did not differ signicantly from the sample described above on any of the demographic characteristics. Measures The current study examined the measurement equivalence of two performance dimensions: Demonstrate Adaptability (seven items) and the performance dimension of Coach and Develop (eight items). These scales were chosen because of their conceptual similarity between scales on other commonly used 360-degree feedback instruments (e.g., Acting with Flexibility and Setting a Developmental Climate scales from

390

JOURNAL OF BUSINESS AND PSYCHOLOGY

Benchmarks; Lombardo & McCauley, 1994) and the number of items in each scale (estimation in IRT increases in accuracy with increasing numbers of items). Note that items for both scales were recoded from a 5point scale to a 4-point scale by collapsing the lower two scale points. The lower two scale points were collapsed because only approximately onehalf of one percent of the raters (i.e., 57 out of 1,200) used the lowest response category on any one item. Analyses All analyses were conducted using both CFA and IRT methods. Readers interested in more details on the CFA method of testing equivalence are referred to Byrne, Shavelson, and Muthen (1989), Cheung (1999), or Vandenberg and Lance (2000). Readers interested in more details on the specic IRT method used in the present study are referred to Flowers, Oshima, and Raju (1999). RESULTS Table 1 presents the means, standard deviations, and reliability estimates of the two scales included in this study. All scale means were slightly above the mid-point; the reliability estimates for the Adaptability scale ranged from .80 to .82; for the Coach and Develop scale the reliability estimates ranged from .87 to .89. Conrmatory Factor Analysis Analyses were conducted using LISREL 8.2 following the approach outlined by Joreskog and others (Drasgow & Kanfer, 1985; Joreskog, 1971a, 1971b). The equivalence of the factor form across rater groups (i.e., raters evaluating ratees with 05, 510, and 1020 years experience) was rst tested, followed by tests for the equivalence of factor
Table 1 Scale Characteristics Across Levels of Managerial Experience 05 Years Scale Adaptability (7) Coach and Develop (8) a .80 .87 M 19.18 20.96 SD 3.73 4.23 a .81 .88 510 Years M 19.03 21.18 SD 3.68 4.25 a .82 .89 1020 Years M 19.03 21.28 SD 3.75 4.45

Note. N = 3,600 with equal numbers of managers within each range of managerial experience (n = 1,200). Number of items in each scale in parentheses. Note that items were recoded from 5 scale points to 4 scale points (i.e., the lower two scale points were collapsed).

GARY J. GREGURAS

391

loadings. Assessing the factor form and factor loading equivalence across groups tests for conceptual equivalence. Each of these tests is briey described below; Cheung (1999) provides a more detailed discussion. If raters agree on the factor structure (i.e., number of performance dimensions) and on the particular scale items associated with each dimension, then factor form equivalence exists. On the other hand, if factor form nonequivalence is observed, one can conclude that raters are considering items to be associated with different dimensions. While imposing the constraint of invariant factor form across the groups, t indices (see Table 2) indicate that the data provide a reasonably good t to the data (RMSEA .082; CFI .91). These results are replicated in the hold-out sample (RMSEA .086; CFI .90, see Table 2). These results indicate that raters across the three rater groups agree on the number of performance dimensions and on the particular items associated with each dimension. The second test of equivalence was for factor loading equivalence. This model was compared to the baseline model (i.e., factor form model). The change in v2 indicates that this model does not differ signicantly from the baseline model (Dv2 (26) 23.80, p > .05, see Table 2), indicating that the factor loadings are equivalent across the three rater groups. The results from the replication sample are consistent with these results (Dv2 (26) 26.27, p > .05, see Table 2). In addition, the RMSEA and CFI values also indicated a fairly good t of the data to the model in both the derivation(RMSEA .078; CFI .91) and replication (RMSEA .082; CFI .90) samples. These results indicate that the three rater groups agree on the relative importance of the items as indicators of the latent constructs. This observed equivalence in factor loadings is referred to as factorial invariance. Given the equivalence of factor form and factor loadings, these data suggest that the raters across the three rater groups possess similar conceptual
Table 2 CFA Results Across Managerial Experience Levels v2 df p RMSEA CFI v2difference .082 .078 .086 .082 .91 .91 .90 .90 df v2/df p

Model Equal form 2269.68 293 <.001 Equal factor loadings 2293.48 319 <.001 Replication Sample Equal form 2471.32 293 <.001 Equal factor loadings 2497.59 319 <.001

23.80

26

.92

>.05

26.27

26

1.01

>.05

Note. For the chi-square values, N = 3,600. RMSEA = root mean square error of approximation; CFI = comparative t index.

392

JOURNAL OF BUSINESS AND PSYCHOLOGY

models of performance and that ratings across these groups are comparable. IRT Analyses IRT analyses were used to assess for DIF and DTF. Results from the IRT analyses for the derivation and replication samples are presented in Table 3. In the derivation sample, results indicated that no items evidenced DIF for either of the two scales (i.e., Adaptability and Coach and Develop) for any of the three rater groups. In the replication sample one item was found to evidence DIF. However, because this was not found in the derivation sample, consistent with Maurer et al. (1998), these results indicate that items did not function differently across the performance dimensions for the three groups. The DTF index may be used to assess differential functioning at the test level. Raju et al. (1995) proposed a v2 test for assessing the statistical signicance of the observed DTF index; the degrees of freedom for this index equal to NF ) 1, where NF is the number of examinees in the focal group. As suggested and used in Facteau and Craig (2001), a cut-off of .054 multiplied by the number of scale items was used on the DTF index for scales composed of items with four options. Scales with DTF values

Table 3 IRT Results of Differential Item and Test Functioning Scale (Comparison) Derivation Sample Adaptability (1) Adaptability (2) Adaptability (3) Coach and Develop (1) Coach and Develop (2) Coach and Develop (3) Replication Sample Adaptability (1) Adaptability (2) Adaptability (3) Coach and Develop (1) Coach and Develop (2) Coach and Develop (3) Number of DIF Items 0 0 0 0 0 0 0 1 0 0 0 0 DTF .001 .001 .001 .002 .001 .000 .003 .040 .001 .002 .001 .000 v2 1,749.38* 1,225.03 1,790.49* 1,204.83 1,203.73 1,201.20 2,299.75* 22,365.54* 1,876.43* 1,204.83 1,203.73 1,201.20

Note. Numbers in parentheses denote the comparisons: (1) = 05 years vs. 510 years, (2) = 05 years vs. 1020 years, and (3) = 510 years vs. 1020 years. Degrees of freedom = 1,199 for each of the v2 tests. *p < .01. Scales with signicant (p < .01) v2 values and DTF indices (.054 *Number of scale items) would be indicative of DTF; none of the scales evidenced DTF. n = 1,200 for each group; N = 3,600 for each sample.

GARY J. GREGURAS

393

above this cut-off that also have signicant v2 values at a p < .01 level are said to evidence DTF. As indicated in Table 3, none of the scales for any of the groups evidenced DTF. In sum, no item or test differences were observed for either scale for any group in the derivation sample. The same IRT analyses were conducted on the cross validation sample (see Table 3). The results from these analyses are consistent with those of the derivation sample. Thus, no scales evidenced DTF for either the derivation or replication samples. These results indicate measurement equivalence across the three rater groups. The results from the IRT and CFA analyses converge and both suggest that ratings of managers with different levels of experience are equivalent.

DISCUSSION As organizations continue to change and the characteristics of jobs and employees diversify, the equivalence of the instruments used to measure and evaluate the behaviors and attributes of employees must be established if meaningful comparisons between employees or within employees over time are to be made. Although several authors have discussed the importance of establishing the measurement equivalence of instruments before making such comparisons (e.g., Vandenberg & Lance, 2000), relatively few studies have explicitly assessed the measurement equivalence of performance ratings across groups (Facteau & Craig, 2001). As Austin and Villanova (1992) note, there has been a chronic lack of attention to the conceptual and psychometric characteristics of criteria. Responding to the need for research in this area, the current study investigated whether a managers (i.e., ratee) job experience affects the measurement equivalence of supervisor (i.e., rater) ratings by using both CFA and IRT methods. Overall, results consistently indicated a large degree of measurement equivalence across raters evaluating ratees with differing amounts of job experience. Results from the CFA analyses indicated that ratee experience did not affect the conceptual equivalence of raters. That is, raters agreed on the number of performance dimensions, on the items associated with each performance dimension, and on the importance of each item in relation to its latent construct. Likewise, results from the IRT analyses also indicated a large degree of measurement equivalence in that there were no replicated instances of differential item or differential test functioning. These ndings are consistent with previous research investigating the measurement equivalence of performance ratings that generally has observed that performance ratings across groups are equivalent (e.g., Facteau & Craig, 2001; Maurer et al., 1998).

394

JOURNAL OF BUSINESS AND PSYCHOLOGY

However, unlike previous research that has investigated rater characteristics, the current study extends the research on the measurement equivalence of performance ratings by investigating a ratee characteristic (i.e., job experience). The results of this study have several implications. First, these results suggest that ratings of employees with differing levels of experience are conceptually equivalent, and that comparisons within and between employees may be interpreted from a measurement perspective. This is good news for practitioners wanting to make decisions between employees or wanting to track employee changes over time. However, as discussed below as a limitation of the current study, only two performance dimensions were examined. Because research suggests that some dimensions are easier to evaluate than others (Wohlers & London, 1989), and that some raters are more accurate on some dimensions than others (Borman, 1979), the extent to which these results generalize to other performance dimensions or instruments should be investigated. Second, the IRT analyses revealed that the items and scales functioned the same across employees. Given that performance norms and feedback ratings often are presented to the ratee at the item and scale level, these results suggest that differences between ratings, norms, items, and scales are not the result of the instrument meaning different things to different raters. Third, several theories have been proposed that describe how changes in individual performance may occur over time (e.g., Murphys, 1989 transition-maintenance model of individual change; Kanfer & Ackermans, 1989 model of skill acquisition). Before ratings between and within individuals can be meaningfully compared, the measurement equivalence of such ratings must be established (Drasgow & Kanfer, 1985; Vandenberg & Lance, 2000). The current study provides initial support for, and suggests that, supervisory ratings are largely equivalent across selected levels of ratee experience. Limitations and Future Research The current study has several limitations. First, only one instrument was investigated. The current study found that supervisor ratings across the performance dimensions were equivalent across rater groups, but different dimensions or instruments may function differently. Future research should investigate the measurement equivalence of a larger number of performance dimensions and instruments to more robustly assess the ndings of the current study. Second, wide categories for grouping ratees (i.e., 05 , 510 and 1020 years) were used in the current study. Ratee experience might impact the measurement equivalence of ratings most within the rst few months or years of employment followed by a negligible impact after this initial period of employment.

GARY J. GREGURAS

395

However, if large differences actually exist within the rst several months or years of employment, it is likely that these uctuations would have been detected even given the rather wide categories used in the current study. As an aside, the analyses in the current study required large samples, and as such, the choice to group employees into these wider categories was necessary to conduct the appropriate analyses. As suggested by a reviewer, future research could collect perceptual data to better understand psychologically when an employee is perceived as being experienced versus inexperienced. Based on these perceptions, different categorizations might then be used to explore how perceived experience might inuence the measurement equivalence of ratings. Third, the current study only assessed the impact of ratee experience on the measurement equivalence of one rater source (i.e., supervisors). Future research should assess the measurement equivalence of self-, peer-, and subordinate ratings of ratees with different amounts of job experience. Finally, future research should explore the effects of additional ratee (e.g., job level), rater (e.g., rater job experience), and contextual (e.g., rating purpose) characteristics on the measurement equivalence of performance ratings both cross-sectionally and longitudinally.

REFERENCES
Austin, J. T., & Villanova, P. (1992). The criterion problem: 19171992. Journal of Applied Psychology, 77, 836874. Barrett, G. V. & Alexander, R. A. (1989). Rejoinder to Austin, Humphreys, and Hulin: Critical reanalysis of Barrett, Caldwell, and Alexander. Personnel Psychology, 42, 597612. Borman, W. C. (1979). Format and training effects on rating accuracy and rater errors. Journal of Applied Psychology, 64, 410412. Borman, W. C., Dorsey, D., & Ackerman, L. (1992). Time spent responses as time allocation strategies: Relations with sales performance in a stockbroker sample. Personnel Psychology, 45, 763777. Byrne, B., Shavelson, R., & Muthen, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456466. Cascio, W. F., & Valenzi, E. R. (1977). Behaviorally anchored rating scales: Effects of education and job experience of raters and ratees. Journal of Applied Psychology, 62, 278282. Cheung, G. W. (1999). Multifaceted conceptions of self-other ratings disagreement. Personnel Psychology, 52, 136. Cheung, G. W., & Rensvold, R. B. (1998). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25, 127. Church, A. H., & Bracken, D. W. (1997). Advancing the state of the art of 360-degree feedback. Group & Organization Management, 22, 149161. Drasgow, F., & Kanfer, R. (1985). Equivalence of psychological measurement in heterogeneous populations. Journal of Applied Psychology, 70, 662680. Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from different rating sources comparable? Journal of Applied Psychology, 86, 215227. Flowers, C. P., Oshima, T. C., & Raju, N. S. (1999). A description and demonstration of the polytomous-DFIT framework. Applied Psychological Measurement, 23, 309326.

396

JOURNAL OF BUSINESS AND PSYCHOLOGY

Greguras, G. J., & Balzer, W. K. (1999, April). Assessing the robustness of previous supervisory performance rating models. Poster presented at the Fourteenth Annual Conference of the Society for Industrial and Organizational Psychology, Atlanta, GA. Harvey, R. (1991). Job analysis. In M. Dunnette & L. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed.). Palto Alto, CA: Consulting Psychologists Press, Inc. Hofmann, D. A., Jacobs, R., & Baratta, J. E. (1993). Dynamic criteria and the measurement of change. Journal of Applied Psychology, 78, 194204. Jones, E. E., & Nisbett, R. E. (1972). The actor and the observer: Divergent perceptions of the causes of behavior. In E. Jones, D. Kanouse, H. Kelley, R. Nisbett, S. Valins, & B. Weiner (Eds.), Attribution: Perceiving the causes of behavior. Morristown, NJ: General Learning Press. Joreskog, K. G. (1971a). Simultaneous factor analysis in several populations. Psychometrika, 36, 409426. Joreskog, K. G. (1971b). Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109133. Kanfer, R., & Ackerman, P. L. (1989). Motivation and cognitive abilities: An integrative/ aptitude-treatment interaction approach to skill acquisition. Journal of Applied Psychology, 74, 657690. Laftte, L. J., Raju, N. S., Scott, J. C., & Fasolo, P. M. (1998, April). Examination of the measurement equivalence of 360 feedback assessment with conrmatory factory analysis and item response theory. Paper presented at the 13th Annual Conference of the Society for Industrial and Organizational Psychology, Dallas, TX. Lance, C. E., & Bennett, W. Jr., (1997, April). Rater source differences in cognitive representation of performance information. Paper presented at the meeting of the Society for Industrial and Organizational Psychology, St. Louis, MO. Landy, F. J., & Farr, J. (1980). Performance rating. Psychological Bulletin, 87, 72107. Landy, F. L., & Vasey, J. (1991). Job analysis: The composition of SME samples. Personnel Psychology, 34, 2750. Lombardo, M., & McCauley, C. (1994). Benchmarks: A manual and trainers guide. Center for Creative Leadership: Greensboro, NC. Maurer, T. J., Raju, N. S., & Collins, W. C. (1998). Peer and subordinate appraisal measurement equivalence. Journal of Applied Psychology, 83, 693702. McEnrue, M. P. (1988). Length of experience and the performance of managers in the establishment phase of their careers. Academy of Management Journal, 31, 175185. Murphy, K. R. (1989). Is the relationship between cognitive ability and job performance stable over time? Human Performance, 2, 183200. Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social, organizational, and goal-based perspectives. Thousand Oaks, CA: Sage Publications. Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353368. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Conrmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552566. Schneider, B. (1987). The people make the place. Personnel Psychology, 40, 437454. Stennett, R. B., Johnson, C. D., Hecht, J. E., Green, T. D., Jackson, K., & Thomas, W. (1999, August). Factorial invariance and multirater feedback. Poster presented at the Fourteenth Annual Conference of the Society for Industrial and Organizational Psychology, Atlanta, GA. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 469. Wayne, S. J., & Liden, R. C. (1995). Effects of impression management on performance ratings: A longitudinal study. Academy of Management Journal, 38, 232260. Wexley, K. N., & Klimoski, R. (1984). Performance appraisal: An update. In K. M. Rowland & G. R. Ferris (Eds.), Research in personnel and human resources management, 2, 3579. Greenwich, CT: JAI Press.

GARY J. GREGURAS

397

Wohlers, A. J., & London, M. (1989). Ratings of managerial characteristics: Evaluation difculty, co-worker agreement, and self-awareness. Personnel Psychology, 42, 235261. Woehr, D. J., Sheehan, M. K., & Bennett, W. Jr. (1999, April). Understanding disagreement across rating sources: An assessment of the measurement equivalence of raters in 360 degree feedback systems. Poster presented at the Fourteenth Annual Conference of the Society for Industrial and Organizational Psychology, Atlanta, GA. Yammarino, F., & Atwater, L. (1997). Do managers see themselves as others see them? Implications of self-other rating agreement for human resource management. Organizational Dynamics, 25(4), 3544.

You might also like