0001

BEHAVIORAL RESEARCH IN ACCOUNTING Volume 16, 2004 pp. 75-88 Debiasing Balanced Scorecard Evaluations Michael L. Roberts Thomas L. Albright Aleecia R. Hibbets The University of Alabama ABSTRACT: Lipe and Salterio (2000) found that superiors disregarded halt of the information when using a Balanced Scorecard to evaluate the performance of two divisional managers. Only common measures affected the superiors’ holistic evaluations, defeating the purpose of the Balanced Scorecard. Our study examines whether disaggregating the Balanced Scorecard results in evaluations consistent with the intent of the Balanced Scorecard approach. Results indicate the disaggregated strategy al- lows superiors to utilize unique as well as common measures, thus overcoming the common-measures bias. In addition, we find Balanced Scorecard performance evaluations explain more than half the variation in subsequent compensation decisions. INTRODUCTION aplan and Norton (1996) observe that many corporate managers rely on financial measures K alone to evaluate subordinates’ performance, disregarding key elements in the corporation's. strategic mission and inadvertently emphasizing measures that lag, instead of lead, actual firm performance. Kaplan and Norton (1996) created the Balanced Scorecard (BSC) to enable managers to utilize strategically important nonfinancial as well as financial measures. A central premise behind the BSC is that each business unit of a firm should develop its own scorecard with ‘measures that capture the unit’s unique strategy. The tool is now widely used in organizations (Silk 1998), However, Lipe and Salterio (2000) demonstrated that M.B.A. students assigned to the role of superiors using the BSC disregard measures unique to particular divisions. Superiors relied only on the items appearing on both divisions’ scorecards. Half of the measures included in the scorecards, which were unique or specific to a single division, were ignored. Because all of the items on a BSC are assumed to be critically important measures of strategic performance, this common-measures bias undercuts its potential usefulness. Lipe and Salterio (2000) attribute the common-measures bias they found to the superior’s need to employ simplifying cognitive strategies. The purpose of this study is to examine a potential approach to debias performance evaluations using the BSC. We use a “disaggregated/mechanically aggregated” Balanced Scorecard (hereafter Disaggregated Balanced Scorecard) in which participants: (1) evaluate performance separately for each of 16 performance measures and then (2) mechanically aggregate the separate judgments using pre-assigned weights for each measure. Following this disaggregation-plus-mechanical-aggregation, participants make an overall evaluation. Thus, we examine whether the common-measures bias found by Lipe and Salterio (2000) when the BSC is "The authors gratefully acknowledge the cooperation of Marlys Lipe in providing copies of experimental materials from Lipe and Salterio (2000) 1576 Roberts, Albright, and Hibbets used to make holistic judgments can be overcome by utilizing a prior disaggregated, mechanically ‘aggregated information processing strategy M.B.A. students role-playing superiors in our study weighted unique measures consistently with the BSC guidelines they were given for both unique and common measures, in contrast to Lipe and Salterio (2000). Thus, our findings suggest disaggregating the steps involved in performing BSC evaluations can overcome common-measures bias. Disaggregating the process, therefore, is one approach for improving effectiveness of the BSC. Using disaggregated steps was not suggested by the BSC’s originators, Kaplan and Norton (1996). We also extend Lipe and Salterio (2000) to examine the influence of BSC performance evaluations on subsequent compensation decisions. Although Kaplan and Norton suggest the BSC should affect compensation, they provide no guidelines for this linkage (Kaplan and Norton 1996; Lipe and Salterio 2000). We find superiors’ performance evaluations using the disaggregated BSC strategy explains slightly more than half of the variation in superiors’ decisions to distribute a bonus to division managers. Performance and bonus allocations are highly correlated, ‘The remainder of the paper is organized into five sections. The next section reviews relevant literature and presents hypotheses. In the third section, we describe our research methods. Then we present results of our experiment, tests of hypotheses, and supplemental analyses of related questions. In the fourth section, we discuss implications, limitations, and offer suggestions for future research. In the final section, we present our conclusions. LITERATURE REVIEW AND HYPOTHESIS DEVELOPMENT ‘Cognitive Demands in Comparative and Individual Judgments Prior research in psychology has shown that decision makers faced with comparative evaluations tend to use information common to both objects and to underweight information unique to each object (Slovic and MacPhillamy 1974). Dominance of the common information was found only when objects were evaluated in pairs. The same information item did not dominate when each object was evaluated individually. Lipe and Salterio's (2000) (hereafter Lipe and Salterio) participants were older, M.B.A. students, had an average of five years of work experience, and arguably were more knowledgeable about their task than Slovic and MacPhillamy’s (1974) undergraduate participants. Lipe and Salterio structed their participants to evaluate two retail divisional managers independently, in contrast to Slovic and MacPhillamy (1974), whose task involved choosing which of two candidates would be ‘more successful. However, Lipe and Salterio’s experimental materials presented participants with Balanced Scorecards for both division managers before they evaluated each manager's performance. Lipe and Salterio’s results were consistent with Slovic and MacPhillamy (1974). Lipe and Salterio found their M.B.A. participants, role-playing the part of superiors evaluating division managers, used the common measures but disregarded the unique measures in evaluating the performance of division managers using the BSC. Thus, Lipe and Salterio demonstrate the application of ‘common-measures bias in the BSC context, an important practical application. ‘Common measures may dominate in comparative evaluations for at least three related reasons. First, they form a smaller subset of the total information, and it is cognitively easier to retain and process less, rather than more, information (Anderson 1990). Second, not only does this result in less total information, but also it may result in fewer categories or types of information to process (Lipe and Salterio 2002). Third, common measures are the only information available to directly compare the managers. An Aid to Debiasing Lipe and Salterio (2000, 287) suggested their subjects ignored unique measures in order to reduce their effort to complete the evaluation tasks. One method for improving judgment quality Behavioral Research in Accounting, 2004Debiasing Balanced Scorecard Evaluations n when effort is insufficient is to use a decision aid (Kennedy 1995). Searching for the optimum combination of human judgment and statistical modeling, Einhorn (1972) demonstrated improved decision accuracy when human judges coded decision information into quantitative form and outputs were generated using a mechanical combination rule. Bowman (1963) suggested combining man and model by using “clinical synthesis” whereby the individual uses the output of a model as an input to the individual’ final judgment. Application of Einhorn’ (1972) and Bowman's (1963) suggested approaches to the BSC would involve a two-step process: (1) disaggregate the evaluation decision into several smaller decisions and (2) aggregate the smaller decisions into an overall score based on predetermined weights (e.g., Einhorn 1972; Lyness and Cornelius 1982; Edwards and Newman 1982). Step 1, disaggregating a complex decision, would encourage the extent to which each individual dimension is processed. ‘When focusing attention on one dimension, the decision-maker’s short-term working memory would be free from simultaneously keeping information about other dimensions from decaying. This shift in attention and processing capacity should facilitate greater total effort and ensure that effort is exerted on all measures. This step should overcome common-measures bias to the extent the bias is caused by failure to adequately attend to unique measures. In Step 2, the predetermined weights used to aggregate the evaluations into an overall score should reinforce the importance of both common and unique measures to the organization. It is thus more likely that both common and unique measures will be used in subsequent holistic evaluations because decision makers will have already incurred the processing cost of evaluating each dimension. Disaggregated judgment strategies are more advantageous the more complex the judgment required, even when “complex” judgments include as few as nine information cues (Lyness and Cornelius 1982). In comparison, the BSC typically requires four to seven performance measures in ‘each of four categories (as suggested by Kaplan and Norton 1996). As a result, evaluators using the BSC could potentially have 16 to 28 cues to process, holistically, in assessing the performance of a firm manager. Thus, performance judgments using the BSC should be adequately complex to realize the benefits of a disaggregated, mechanically aggregated judgment." Disaggregated judgments plus mechanical aggregation both decreases and increases task demands. Cognitive demands at any one time are reduced because the amount of information to be considered for evaluating each individual dimension is less than the information in the entire BSC. However, the total time and effort required increases because the number of evaluations and computations increases. For example, to apply disaggregation-plus-mechanical-aggregation to Lipe and Salterio's BSC, 16 separate evaluations would be required for each of the two division managers (a total of 32 separate judgments, compared with only two holistic judgments in Lipe and Salterio). Then each of the 16 evaluations would have to be extended by its decision weight, and the total of the 16 products summed. A total of 96 evaluations and computations would be necessary. Consistent with Kennedy's (1995) debiasing framework, we expect that providing superiors with a disaggregated BSC will increase the total cognitive effort expended to evaluate all measures prior to making holistic evaluations. Given this increased effort, we expect superiors to utilize all the We note repors that a few divisions of some firms have adopted weights for each scorecard item or categories of items (Davis 2000; Kaplan 1997; Kaplan and Norton 2001, 256; Malina and Selio 2001). However, Kaplan and Norton do not Advocate weighting and scoring each scorecard item separately or any particular method of algorithm for aggregating individual scores. Also, Lipe and Salterio (2002) demonstrate that when all measures within a BSC category are consistently above or below target, BSC users tend to collapse performance on the individual items into a categorical evaluation. In this situation, the total aumber of information cues for making an overall evaluation could be reduced considerably, eg, from 16 to four In addition, Bonner etal. (1996) suggest reasonableness checks should be employed when using mechanically aggregated judgments. Reasonableness checks ensure aggregation problems such as those encountered by Jiambalvo and Waller (1984) and Daniel (1988) do not occur. For example, decision makers ean be directed to adjust their disaggregated judgments of risk components so total risk does not exceed 1.0. Another approach, which we employ, is to provide ‘weights for each subtask to be mechanically aggregated (Lyness and Cornelius 1982). Behavioral Research in Accounting, 20048 Roberts, Albright, and Hibbets BSC measures rather than the strategy chosen by Lipe and Salterio’s participants of utilizing only half the BSC measures. Based on the above, we test the following hypothesis in alternative form: Hi: Presenting the BSC in a disaggregated format will result in subsequent holistic evaluations of managers’ performance that reflect unique (as well as common) measures. We describe the specific methods we use to disageregate-mechanically-aggregate the BSC in section three. First, however, we describe an additional extension of Lipe and Salterio (2000). Linking the BSC to Compensation Conceptually, performance evaluation using the BSC should be linked to compensation of unit ‘managers (Kaplan and Norton 1996, 217). However, firms traditionally have implemented the BSC on an experimental basis and have waited to become more familiar with the new performance evaluation tool before changing compensation practices (Chow et al. 1997; McWilliams 1996). Asa result, Kaplan and Norton (1996) make no recommendations about how BSC evaluations should apply to compensation decisions. ‘Supervisors may be reluctant to be “tied” to a formal evaluation tool that does not allow them to ‘compensate subordinates at their discretion. Therefore, itis important to determine whether superiors will follow the formal BSC procedure in making compensation decisions. Lipe and Salterio did not test the theoretical linkage between performance evaluation and compensation decisions in their study. Because the link between performance evaluation and compensation is seen as critical to ex ante decisions of managers (Lipe and Salterio 2000, 293), we test this linkage directly with the following hypothesis: H2: — Superiors’ holistic performance evaluations using the disaggregated BSC will affect subsequent compensation decisions. PROCEDURES M.B.A. students were given a case involving two divisions of WCS Incorporated, a retail firm specializing in women’s apparel. The case was administered during class, prior to any instruction on the Balanced Scorecard. No credit was given for participation, and responses were anonymous. This approach is identical to Lipe and Salterio (2000), whose 58 first-year M.B.A. participants completed a classroom case. The case was adapted from Lipe and Salterio, which had followed Kaplan and Norton's (1996) Kenyon Stores example of a BSC implementation, Participants were asked to assume the role of a senior executive of WCS who has recently participated in a Harvard Business School symposium on the Balanced Scorecard. Participants were given the mission statement of 'WCS? and introduced to the two division managers. The case informed participants of the individual divisions’ strategies and presented each division's Balanced Scorecard. Next, participants completed the two steps of the Disaggregated BSC: they (1) rated each ‘manager's performance on each of the 16 Balanced Scorecard items, using a scale from 0 (Unac- ceptable) to 100 (Excellent), and then (2) multiplied these individual judgments by pre-determined ‘weights and summed the weighted scores to create a total, aggregated score for each division. Pre- determined weights for the unique measures were 64 percent of the total. These two steps were not used by Lipe and Salterio nor were they suggested by Kaplan and Norton (1996) > The mission statement reads, “We will be an outstanding apparel supplier in each of the specialty niches served by WCS.” Behavioral Research in Accounting, 2004Debiasing Balanced Scorecard Evaluations 9 Participants then made a separate overall assessment of each manager’s performance, mea- sured on a scale from 0 (Reassign) to 100 (Excellent), This overall assessment was worded the same and used the same scale as Lipe and Salterio’s study and is used to test H. This separate judgment ‘was elicited to give participants the opportunity to adjust their overall assessment if they were not satisfied with the outcome of their mechanically aggregated score for any reason. Thus, participants were not bound by the mechanical aggregation of their disaggregated judgments. They were free to disregard them completely or to use them however they saw fit in evaluating each manager's performance. ‘After making an overall evaluation of each manager's performance, participants allocated a total year-end bonus fund of $100,000 between the two division managers. These allocations are used to test H2. Then they completed follow-up questions about the case, provided demographic information, answered manipulation checks, and responded to questions regarding task difficulty, realism, and understandability. Participants were given information about two of WCS's divisions, RadWear (RAD), specializ~ ing in teen clothing, and WorkWear (WORK), specializing in women’s business uniforms. The strategies for each division were presented, and performance measures appropriate for the division's strategy were employed on each division’s scorecard. The design is 2 x 2 x 2, with two between-subjects factors (Common and Unique) and one within-subjects factor (Division). The first between-subjects factor, Common, indicates whether RadWear or WorkWear performs better on the common measures. The second between-subjects factor, Unique, indicates whether RadWear or WorkWear performs better on the unique measures. Each participant evaluated managers of both divisions; thus Division is the within-subjects factor. Each scorecard contained 16 separate measures, four in each of the four categories. In each category, two measures were common across divisions, and two measures were unique to each division. For example, in the financial category, both divisions had measures for return on sales and sales growth. The two measures unique to RadWear were new store sales and market share relative torretail space; WorkWear’s two unique financial measures were revenues per sales visit and catalog profits. Both divisions perform better than target on all 16 measures. The percentage above target, however, was varied so that either RadWear or WorkWear performed better as indicated in the experimental design described above. The percentage above target was calculated to the second: and reported in a column of the scorecard. These percentages are identical to Lipe and Salterio (2000). With 16 common and unique measures, unit weighting would imply a weight of 6.25 percent for each measure (100/16). Total weight for each of the four categories was set at 25 percent, and within each category we varied the pre-assigned weights between 4.0 and 9.0 percent.‘ These weights were given to participants on the face of the Disaggregated BSC. Unique measures were assigned 64 percent of the total weights.° A copy of the Disaggregated BSC is shown in the Appendix. Ifthe unique measures are used in the evaluation as hypothesized, an interaction of Division and. Unique should be observed. This is in addition to the interaction of Division and Common reported by Lipe and Salterio (2000). Part Eighty-one (81) M.B.A. students participated in the experiment. Seventy-nine (79) useable responses are reported below because one participant failed to complete the overall performance evaluation for both managers, and another participant did not provide disaggregated scores for ants + Large discrepancies among individual item weights results in perception by uses thatthe BSC is not, in fact, “balanced” and results in ignoring low-weight items (Malina and Selto 2001, 71). Fifty percent ofthe 16 items onthe scorecard are unique to each division, We weighted these items slightly more than SO percent to ensure that participants were not merely resorting 10 unit-weighing. Behavioral Research in Accounting, 200480 Roberts, Albright, and Hibbets ‘WorkWear. Twenty-five (25) participants were Executive M.B.A. students; 54 were regular M.B.A. students. We tested for potential systematic differences in these two participant groups on the variables of interest by including degree program as a variable in each statistical model. No significant differences were observed for degree program; therefore, the two groups were collapsed for the analysis reported below. Mean age was 27.6 (median, 25.0), with 5.1 years of work experience (median, 2.0). Seventy- three (73) percent of participants were male. Fifty-three (53) percent indicated prior experience in making performance evaluations. Attention and Manipulation Checks Overall, participants regarded the case as realistic, easy to understand, and not difficult to complete. The mean score for realism was 2.2 on a scale from -5.0 to 5.0, with 5.0 indicating participants “strongly agree” the case is realistic. The mean score for understandability was 3.1, and the mean score for difficulty was -2.6. Participants also agreed the BSC items were usefully catego- rized (mean of 2.6), that RadWear and WorkWear target different markets (mean score 3.7), used different measures (mean score of 3.1), and should use different measures (mean score of 3.4). All ‘means were significantly different from zero (p < 0.01). We also checked each participant's multiplication and addition of the weighted scores for the mechanical aggregation (Step 2) part of the task. For RadWear, 70 of the 79 participants’ mechanically aggregated scores were within +/— 1.0 of our recalculation, and all 79 were within +/- 5.0. For WorkWear, 75 participants were within +/— 1.0 of our recalculation, and all 79 were within +/— 6.0. RESULTS. Disaggregation Strategy Table 1 presents the results of the repeated measures ANOVA (compare to Lipe and Salterio 2000, Table 3). If the Disaggregated BSC is successful in preventing the common-measures bias observed by Lipe and Salterio, there should be a significant interaction between Unique measures and Division. As shown in Panel A, both the Division x Unique interaction (f = 30.51, p < 0.01) as well as the Division x Common interaction are significant (f = 12.81, p < 0.01). Therefore, our results provide evidence that both common and unique measures are important in explaining differences in overall evaluation scores. This result differs from Lipe and Salterio, who found significance only on common measures. (Note: None of the between-subjects tests shown in Panel A are significant, nor is the three-way within-subject interaction. This is a result of the balanced experimental design and is expected.) Panel B of Table 1 reports means to illustrate direction and magnitude of the results. Consistent with Lipe and Salterio, when common measures favor RadWear, superiors rank RadWear's manager 2.28 points higher than WorkWear's manager. Likewise, when common measures favor WorkWear, superiors rank WorkWear's manager 2.58 points higher than RadWear’s manager. These differences for common measures are marginally significant, p = 0.05. However, in contrast to Lipe and Salterio, our results indicate that when unique measures favor RadWear, superiors rank RadWear's manager 3.75 points higher than WorkWear’s manager. Like- wise, when unique measures favor WorkWear, superiors rank WorkWear's manager 4.0 points higher than RadWear's manager. These differences for unique measures are significant, p < 0.01 To further examine the relative influence of the common and unique measures, we regressed differences in superiors’ overall performance evaluations on common and unique measures. Lipe and Salterio reported a significant positive slope coefficient from regression of 10.87 for Common measures (t = 3.28, p < 0.01), but an insignificant coefficient for Unique measures, 0.08 (t = 0.02, > 0.10). In contrast, as shown in Table 2, both Common and Unique measures in our study have significantly positive slope coefficients: 5.18 (t = 3.63, p < 0.001) and 8,00 (t= 5.67, p < 0.001) for Common and Unique, respectively. Behavioral Research in Accounting, 2004Debiasing Balanced Scorecard Evaluations 81 TABLE 1 Influence of Common and Unique Measures on Subjective Overall Performance Evaluations Using, Disaggregated Balanced Scorecard Panel A: Results of a 2 x 2 x 2 Repeated Measures ANOVA of Subjective Overall Performance Evalu- ations of RadWear’s and WorkWear’s Division Managers, Variable ar ss Ms. F P Between Subjects Common 1 0.77 0.77 0.00 095 Unique 1 87.98, 87.98 039 0.54 Common x Unique 1 30.11 50.11 022 0.64 Error "4 17,004.39 2226.73 Within Subjects Division 1 336 336 0.17 0.68 Division x Common 1 258.76 258.16 1281 0.0006 Division x Unique 1 616.10 616.10 3051 <0.0001 Division x Common x Unique 1 10.86 10.86 os4 047 Error 1" 1,514.50 20.19 Panel B: Mean Subjective Overall Performance Evaluations of RadWear’s and WorkWear’s Division Managers* Measures Favor WorkWe Common RadWear 79.12 76.56 (10.90) 1.46) WorkWear 7685 79.12 az (10.13) Difference: RadWear— WorkWear 228 258 T-test p-value 0.05 0.05 Unique RadWear 79.03 7671 (a.s7) (10.74) WorkWear 75.28 80.75 a7) (10.03) Difference: RadWear~ WorkWear 3.78 4.0 T-test p-value 0.005 < 0.0001 Overall evaluations made on a 101-point scale, with 0 labeled “Reassign’” and 100 labeled “Excellent.” Panel values are means (standard deviation). Common measures appear on both divisions’ balanced scorecards, Unique ‘measures appeat on only one division's balanced scorecard. Favor RadWear indicates the measures were higher for the ‘RadWeat division than the WorkWear division. Favor WorkWear indicates the measures were higher forthe Work Wear division than the RadWear Division. Based on the results shown in Tables | and 2, we conclude the Disaggregated BSC is effective in eliminating the common-measures bias Lipe and Salterio found when the BSC is used for holistic performance evaluations. Bonus Distribution (Allocation) ur second hypothesis examines the influence of performance evaluations on the bonus allocation. We calculated the difference in managers’ bonuses assigned by each participant. We regressed this difference on the differences in managers’ overall performance evaluations assigned by each Behavioral Research in Accounting, 2004Roberts, Albright, and Hibbets TABLE 2 ‘Comparison of Relative Weights of Common and Unique Measures on Differences in ‘Subjective Overall Evaluations of Division Managers: Regression Analysis Results Sum of ‘Mean Source at Square Model 2 1,724.16 862.08 Error 16 3.050.72 40.14 Corrected Total 79 4,774.88 R 036 Adj. R? 034 Parameter Standard Variable at Estimate Error t-value —Pr>|t) Intercept T 691 130 533 <0001 Common 1 5.18 143 3.63 0.0005, Unique I 8.00 1B 5.67 <.0001 ‘We obtained the same results when using the difference between the mechanically aggregated scores asthe dependent variable. The dependent variable isthe difference inthe overall evaluations of RadWear’s and WorkWear’s Division managers performance for the past year on a 101-point scale, with 0 labeled “Reassign” and 100 labeled “Excellent ‘0 dummy variable indicating the particular division scored low (high) on the eight Balanced Scorecard measures that appeared on both divisions’ scorecard; and Unique = 40/1 dummy variable indicating the particular division scored low (high) on the eight Balanced Scorecard measures that were unique to that division, ic, did not appear on both divisions’ scorecards Participant using the Disaggregated BSC (PerformDiff), controlling for differences in each manager's mechanically aggregated score (AggScDiff). Table 3 reports the regression results. The performance-compensation model is significant, f = 48.84, p < 0.0001. Managers’ overall evaluation scores were significant (p < 0.0001). Mechanically aggregated scores, included as a control variable, were marginally significant (p = 0.07). Interestingly, the model explains only 55 percent of the variance in bonus differences. Thus, superiors appear to use the Disaggregated BSC performance evaluations as part of their judgment models for assigning bonuses, but they are either inconsistent in their application of performance evaluation information or they adjust bonus allocations for additional factors not included in the BSC.* ‘Supplemental Analyses By design, the mechanically aggregated BSC scores represent an input to the superiors” perfor- ‘mance and compensation decisions. Superiors’ final decisions were made separately from the mechanical aggregation. Importantly, their decisions were framed as an overall (holistic) evaluation. This distinction raises the question to what extent the overall performance evaluations are affected by the preliminary, mechanically aggregated BSC score. To address this linkage, we correlated the superiors’ subjective, overall evaluations of each managers’ performance with their mechanically aggregated score for the same manager. Coefficients of correlation for RadWear were 0.74 (p < 0.0001) and for WorkWear, 0.84 (p < 0.0001). Thus, for each division manager, the mechanically aggregated scores are significantly correlated with the © Debriefing conversations following the experiment revealed that some participants may have rated WorkWear’s perfor: 8 ‘mance a slightly beter than RadWear's because WorkWear was deseribed as more stable and less growth-oriented and, thus, may have been ata disadvantage in achieving above-target performance. A paired (test for differences in bonus compensation awarded to each division manager, however, was not significant (p = 0.28), Behavioral Research in Accounting, 2004Debiasing Balanced Scorecard Evaluations 83 TABLE 3 Influence of Disaggregated Balanced Scorecard Performance Evaluations on Difference in Managers’ Bonuses: Regression Analysis Results Sum of Mean Souree at Squares Square Feyalue Pr>F Model 2 7,071283,943—3,535,681,971 4884 = <0001 Error 16 5,502,293,601 72,398,600 Conected Total 78 12,573,577,543 R056 Adj-R? 055 Parameter Standard Variable ar Estimate Error tevalue Pr> ttl Intercept 1 1,580.13 965.64 168 0.1059 PerformDiff 1 955.50 179.41 533 <0001 AseseDifl 1 451.1 245.10 184 0.0696 “The dependent variables the difference inthe dolar amounts ofa total annual bonus of $10,000 that was available to allocate between the two managers, PerformDiff = the difference between the two managers’ subjective overall performance evaluations using the ‘disaggregated Balanced Scorecard; and ‘AgeScDiff = the difference between the wo managers’ mechanically aggregated scores using the disaggregated Balanced Scorecard. subjective, overall evaluations. Both correlations are less than 1.0, however, indicating superiors’ holistic evaluations included some mental adjustment of their mechanically aggregated scores or, at least, they were not perfectly consistent.” Previous research has found disaggregating decisions increases consensus and inter-judge agree- ment (Libby and Libby 1989; Davis 1998). We compared standard deviations for our participants” evaluations (Table 1, Panel B) with those reported by Lipe and Salterio (2000, Table 3, Panel B). F- statistics were significant for only one of the eight comparisons (p < 0.05). Thus, we conclude that disaggregating BSC evaluations does not reduce variation among evaluators. We note, however, that our participants made use of twice the number of BSC items as Lipe and Salterio's participants ‘Also, the standard deviations available for comparison with Lipe and Salterio are averages across two experimental cells, which would necessarily indicate less variation than individual cell means. IMPLICATIONS, LIMITATIONS, AND SUGGESTIONS Implications Lipe and Salterio (2000) note the common measures employed in the BSC tend to be more traditional financial measures, like return on sales and average markdowns, and that these measures tend to lag actual performance. In contrast, the unique measures, such as sales from new market leaders and market share relative to retail space, tend to be nontraditional and, more importantly, leading indicators of performance that capture elements of corporate and division strategic emphasis not captured elsewhere. Thus, ignoring the unique measures in the BSC is tantamount, in many cases, to ignoring many leading indicators and focusing managerial attention more on lagging indicators, To be effective as a management control device, the BSC should result in evaluations that are accurate, objective, and verifiable (Malina and Selto 2001, 75). Significant conflict and tension 7” See footnote 6. Behavioral Research in Accounting, 200484 Roberts, Albright, and Hibbets between superiors and evaluates was observed when evaluation was perceived as subjective. Per- ceptions of subjectivity led to rejection of the BSC and return to financial performance measures at another large firm (Ittner et al. 2002). Using the disaggregated Balanced Scorecard, our participants utilized the unique factors to a substantial extent. While two other studies find training (Dilla and Steinbart 2002) and explicit ‘communication of the importance of all BSC measures (Roberts et al. 2002) can improve utilization of unique measures, both these latter studies find common measures account for two to four times more variation in evaluations than unique measures. BSC items were not explicitly weighted in either of these studies. In contrast, the present study demonstrates weights established as part of the BSC design enables decision makers to place an equal or greater weight on unique measures, consistent with company strategy. To the extent unique measures represent leading indicators, the disaggregated BSC will enable managers to intervene sooner when divisions encounter problems and to attempt corrective action. Limitations The results ofthis study are limited to comparative evaluations. As discussed above, Slovic and MacPhillamy's (1974) findings of common-measures bias did not hold when individuals, rather than Pairs, were evaluated. Thus, when the BSC is used to evaluate divisions individually, an important condition leading to common-measures bias will be absent. Also, the participants in this experiment did not have personal experience with the managers being evaluated nor individual accountability for their performance evaluations and compensation decisions. Accountability has positively affected decision making in some related contexts, such as when decision aids are not available (Ashton 1990) and when decision makers sequentially process several positive and negative infor- ‘mation items (Kennedy 1993). Finally, though our participants are similar to Lipe and Salterio’s (2000), i.., M.B.A. students at a major public university, there may be other differences between our Participants and/or the timing and setting of the two experiments about which we are unaware and have not considered. ‘Suggestions We used a two-part, disaggregated-mechanically-aggregated decision aid strategy consistent with earlier research on “judgments of man versus models of man” (Ashton 1982, 34-43). In our approach, however, human decision makers perform the aggregation, as suggested by Bowman (1963), prior to making subjective, overall evaluations. Thus, common-measures bias could possibly bbe mitigated by either (1) requiring BSC users to evaluate performance on each BSC measure and/or (2) suggesting weights for each measure. Future research could test whether common-measures bias can be reduced or overcome by one of these approaches alone. We note, however, one study found that requiring disaggregated judgments without providing a mechanism for combination resulted in decreased judgment quality compared to holistic judgments (Lyness and Cornelius 1982). Also, providing suggested weights would likely produce a result similar to a reminder to use all the ‘measures (Roberts et al. 2002). Additionally, superiors could be asked to evaluate performance for each BSC category, ie., to evaluate performance on four items at a time, and then make a holistic judgment. Theoretically, thi would substantially lessen the amount of information to be processed at each stage, thereby reducing the need for cognitive simplifying strategy(ies) present in Lipe and Salterio’s (2000) study. Future research should examine the extent to which mechanical aggregation is acceptable to managers and superiors. The influences of factors extraneous to the stated BSC measures should also be addressed. In this study, the mechanically aggregated scores explained slightly more than 50 Percent of the variation in overall evaluations of performance for one division (RadWear) and Behavioral Research in Accounting, 2004Debiasing Balanced Scorecard Evaluations 85 70 percent of the variation in performance evaluation for the other division (WorkWear). Perhaps participants view the teenage RadWear market as more volatile than the WorkWear market, result- ing in greater variance in evaluations of RadWear, or superiors could be reacting negatively to some items on the BSC. They may discount the BSC somewhat since, in this experiment, they were not active participants in developing the measures, or they could be reacting negatively to the target- setting practice of the BSC. For example, it may seem unusual to participants that both divisions ‘exceeded their target performance on all 16 BSC measures. Superiors may be imposing their own standards of performance that differ somewhat from the BSC guidelines. This possibility is suggested by the average mechanically aggregated scores, as well as the holistic scores obtained by Lipe and Salterio, in the 70-80 range for managerial performance that exceeded target on all 16 measures. ‘These and other possible explanations should be investigated by future research. Since acceptance of the performance evaluation too! is critical to managers’ ex ante behavior (Lipe and Salterio 2000, 293), these issues are important to understand. CONCLUSIONS Lipe and Salterio (2000) demonstrate an important limitation to using the BSC. Without provi ing a way to ease the cognitive burden on users, decision makers, when making comparative evaluations, will tend to focus attention only on measures common among managers and ignore measures unique to division manager. Our study demonstrates an efficient method for reducing the over- whelming cognitive demands of the Balanced Scorecard, while enabling users to make evaluations consistent with all the important elements of corporate strategy and mission. Although they circumvent the issue somewhat, Kaplan and Norton (1996) indicate employee behaviors are not likely to be modified without a definite link to compensation. If the amount of ‘compensation to be received is determined from a superior’s evaluation of the employee's performance in meeting the division's goals, then itis important to know how those superiors’ evaluations are affected by the inclusion of weights on the BSC. Our results indicate decision-makers’ compensation decisions are strongly supported by the overall performance evaluation scores of the disaggre~ gated Balanced Scorecard. This evidence, and similar evidence from practice, should reassure employees in firms that have adopted the BSC approach that their bonus is, in fact, based on the messages communicated by management—but only if the weights and disaggregated scores are made explicit. Behavioral Research in Accounting, 2004Roberts, Albright, and Hibbets APPENDIX RadWear Balanced Scorecard ‘Targets and Actuals for 1999 % Better Performance Weighted Measure Target _Actual_ than Target. _Evaluation* _Score** Financial: 1. Return on sales 4% 24% 26% 8.33% — 2. New store sales 9% = 30% «= 32.5% — 3. Sales growth 5% 38% 38% 4, Market share relative to retail space 1% S80 $8685 8.56% = Customer-Related: 1. Mystery shopper program rating 8% 85 96 12.94% 2. Repeat sales 5% 30% 34% 13.33% 3. Returns by customers as % of sales 8% 12% 116% 3.33% 4. Customer satisfaction rating 4% 92% 95% 3.26% ae Internal Business Processes: 1. Returns to suppliers 4% 6% 5% 16.67% 2. Average major brand names/store 1% 32 37 15.63% 3. Average markdowns o% 16% 13.5% 15.63% 4, Sales from new market leaders 8% 25% 29% 16.00% Learning and Growth: 1. Average tenure of sales personnel 9% 1 16 14.29% a 2. Hours of employee training/employce 4 15 17 13.33% 3. Stores computerizing 8% 85% 0% 5.88% aaa, 4, Suggestions/employee 4% 33 35 6.06% ‘Composite Score: (Aggregate of Weighted Scores) * Use the following 100-point scale to indicate your evaluation of each scorecard item (place a value corresponding to this scale in the blank beside each scorecard item): ° 50 100 a Unsccep. very poor average good very Excellent table poor good ‘** Multiply your performance evaluation of each measure by the weighting factor corresponding to the measure, Behavioral Research in Accounting, 2004Debiasing Balanced Scorecard Evaluations 87 REFERENCES ‘Anderson, J. R. 1990. Cognitive Psychology and its Implications. New, York, NY: W. H. Freeman and ‘Company. Ashton, RH. 1982. Human Information Processing in Accounting. Sarasota, FL: American Accounting Association. 1990. Pressure and performance in accounting decision settings: Paradoxical effects of incentives, feedback, and justification. Journal of Accounting Research 28 (Supplement): 148-180. Bonner, S. E., R. Libby, and M. W. Nelson. 1996. Using decision aids to improve auditors’ conditional probability judgments. The Accounting Review 71 (2): 221-241. Bowman, E. H. 1963. Consistency and optimality in managerial decision making. Management Science 9 (1): 310-321, Chow, C. W., K. M. Hadad, and J. E. Williamson. 1997, Applying the balanced scorecard to small companies. Management Accounting 79 (2): 21-27. Daniel, S. . 1988. Some empirical evidence about the assessment of audit risk in practice. Auditing: A Journal ‘of Practice & Theory (Spring): 174-181 Davis, E. B. 1998. Decision-aids for going concem evaluation: Expectations of partial reliance. Advances in ‘Accounting Behavioral Research 1: 33-59. Davis, S. 2000. An investigation of the development, implementation, and effectiveness of the balanced scorecard: A field study. Dissertation, The University of Alabama, Dilla, W. N., and P. J. Steinbart. 2002. The effects of alternative supplementary information display formats on judgments made using the Balanced Scorecard. Working paper, lowa State University Einhom, H. J, 1972. Expert measurement and mechanical combination. Organizational Behavior and Human Decision Processes 19 (Feb): 86-106. Edwards, W., and J. R. Newman. 1982. Multiatribute Evaluation. Beverly Hills, CA: Sage Publications, Inc. Ituner, C. D., D. F, Larcker, and M. W. Meyer. 2003. Subjectivity and the weighting of performance measures: Evidence from a balanced scorecard. The Accounting Review. 78 (3): 725-758. Jiambalvo, J., and W. Waller. 1984. Decomposition and assessments of audit risk. Auditing: A Journal of Practice & Theory (Spring) 80-88. Kaplan, R., and D. Norton. 1996. The Balanced Scorecard. Boston, MA: Harvard Business School Press. 1997. Mobil USMé&R. Harvard Business School case. Boston, MA: Harvard Business Schoo! Publish- ing + and D. Norton. 2001. The Strategy-Focused Organization, Boston, MA: Harvard Business School Press. Kennedy, J. 1993. Debiasing audit judgment with accountability: A framework and experimental results. Journal of Accounting Research 31 (Autumn): 231-245, 1995, Debiasing the curse of knowledge in audit judgment. The Accounting Review 70 (2): 249-273. Lee, C., K.S. Law, and P. Bobko. 1999. The importance of justice perceptions on pay effectiveness: A two year study of a skill-based pay plan. Journal of Management 25 (6): 851-873, Libby, R., and P. A. Libby. 1989, Expert measurement and mechanical combination in control reliance decisions. The Accounting Review 64 (4): 729-787. + and M. G. Lipe. 1992, Incentives, effort, and the cognitive processes involved in accounting-related judgments. Journal of Accounting Research 30 (2): 249-273, Lipe, M., and S. Salteio. 2000. The balanced scorecard: Judgmental effects of common and unique performance measures. The Accounting Review 75 (3): 283-298, , and 2002. A note on the judgmental effects of the Balanced Scorecard's information organization. Accounting, Organizations and Society 27 (6): 531-540. Lyness, K.., and E. T. Comelius Il. 1982. A comparison of holistic and decomposed judgment strategies in a performance rating simulation. Organizational Behavior and Human Performance 29 (Feb): 1-38, Malina, M. A., and F. H. Selto. 2001. Communicating and controlling strategy: An empirical study of the effectiveness of the Balanced Scorecard. Journal of Management Accounting Research 13: 47-90. MeWilliams, B. 1996. The measure of success. Across the Board 33 (2): 16-20. Behavioral Research in Accounting, 2004Roberts, Albright, and Hibbets Roberts, M. L., T. L. Albright, and A. R. Hibbets. 2002. Improving utilization of unique measures in the Balanced Scorecard: The effects of increased awareness and experience. Working paper, The University of Alabama. Silk, S. 1998. Automating the balanced scorecard, Management Accounting (May): 38-44. Slovic, P., and D. MacPhillamy. 1974, Dimensional commensurability and cue utilization in comparative judgment. Organizational Behavior and Human Performance 11: 172-194. Behavioral Research in Accounting, 2004Copyright of Behavioral Research in Accounting is the property of American Accounting Association and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

0001

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

0001

Uploaded by

Copyright:

Available Formats

You might also like