You are on page 1of 45

Measuring Real Activity Management

Daniel Cohen
School of Management
University of Texas at Dallas
Richardson, TX 75080
dcohen@utdallas.edu

Shailendra Pandit
College of Business Administration
University of Illinois at Chicago
Chicago, IL 60607
shail@uic.edu

Charles Wasley
Simon Graduate School of Business
University of Rochester
Rochester, NY 14627
charles.wasley@simon.rochester.edu

Tzachi Zach
Fisher College of Business
The Ohio State University
Columbus, OH 43210
zach.7@osu.edu

July 2019

Financial support from The University of Texas at Dallas, the University of Illinois at Chicago,
the Simon School at the University of Rochester, and the Fisher College of Business at The Ohio
State University is gratefully acknowledged. We gratefully acknowledge the comments and
suggestions of Zahn Bozanic, Bill Cready, Jerry Zimmerman, participants at the European
Accounting Association Annual Congress, workshop participants at The University of Rochester
and Syracuse University and especially an anonymous referee.

Electronic copy available at: https://ssrn.com/abstract=1792639


Measuring Real Activity Management

Abstract

To test hypotheses about earnings management many studies investigate managers’ manipulation
of real activities (real earnings management, REM). Tests using measures of abnormal REM hinge
critically on the measurement of normal real activities. Yet there is no systematic evidence on the
statistical properties of commonly-used REM measures. We provide such evidence by
documenting the Type I error rates and power of the test of the REM measures commonly used in
the literature. We find these measures are often mis-specified with Type I error rates that deviate
from the nominal significance level of the test, especially in samples of firms with extreme
performance or firm characteristics. We also compare the specification and power of traditional
REM measures with performance-matched REM measures to see if the latter provide better-
specified and more powerful tests. While performance-matched REM measures are not immune
from mis-specification in all settings, in general, they are better specified under the null hypothesis
(i.e., in terms of Type I errors) than are traditional REM measures. Comparisons of the power to
detect abnormal REM reveal that neither approach, traditional or performance-matched, is
consistently more powerful than the other in terms of detecting abnormal REM ranging from 1%
to 10% of (lagged) total assets. The absence of a dominant approach to measure abnormal REM
leads us to recommend that future researchers report results using both traditional and
performance-matched measures so that readers are able to clearly assess the reliability of the
inferences drawn about the magnitude and significance of the abnormal REM documented in a
given study.

JEL classification: M41, C12, C15, M42.

Keywords: Real activity management; real earnings management; earnings management; real
activity models, test specification, Type I errors, Type II errors, power of the test, meet or beat,
earnings benchmarks, model specification.

Electronic copy available at: https://ssrn.com/abstract=1792639


1. Introduction

A vast literature in accounting examines managers’ incentives, and the actions they take,

to manage earnings (for reviews see, Healy and Wahlen 1999; Dechow and Skinner 2000; Fields,

Lys and Vincent 2001; Dechow, Ge and Schrand 2010). Early studies on earnings management

typically examined discretionary accruals (DAs). More recently, studies have begun to focus on

real earnings management (REM) (see, Roychowdhury 2006; Gunny 2010). REM are actions

managers take to achieve financial reporting objectives (e.g., to report a profit vs. a loss) by altering

real activities such as, but are not limited to, sales promotions, overproduction of inventory,

delaying or accelerating discretionary expenses such as R&D, advertising, and SG&A, and selling

assets to recognize gains. Research focusing on REM is now commonplace in the earnings

management literature as an alternative, or in addition, to tests using DAs.1

Since tests of REM are joint tests of a researcher’s model of expected (normal) real

activities and earnings management, inferences about earnings management hypotheses based on

measures of REM hinge critically on a researcher’s ability to accurately model expected (normal)

real activities. An implicit assumption in prior REM studies is that, under the null hypothesis of

no REM, the REM measures utilized are well specified. That is, that their Type I error rates

correspond to the nominal significance level of the test (e.g., 5%). However, there is no systematic

evidence on whether the REM measures commonly used in the literature are, in fact, well-specified

and hence are able to deliver reliable inferences about managers’ incentives to manage earnings

by manipulating real activities. Additionally, there is no systematic evidence about the type II error

rates for the REM measures commonly used in the literature. Stated differently, there is no

evidence on the power of the test for various REM measures, that is, their ability to detect real

1
We use the terms real earnings management, real activity management, real activity manipulation and REM
interchangeably.

Electronic copy available at: https://ssrn.com/abstract=1792639


earnings management when it is present in the data. Given the on-going focus in the literature on

measuring REM, it is surprising that no systematic empirical evidence exists on the properties of

alternative REM measures or the statistical tests based on them. This contrasts with research on

the properties of DAs which have been extensively studied (see, Kothari, Leone and Wasley 2005;

Dechow, Sloan and Sweeney 1995). Our objective is to provide systematic evidence on the Type

I error rates and power of the test associated with a variety of REM measures, thus our study of

these properties of alternative REM measures fills this void in the literature.2

An issue encountered in all REM studies is how economic shocks to a firm’s performance

affect a researcher’s ability to accurately model normal real activities. For example, does the shock

have a linear or non-linear effect on a firm’s normal performance? Shocks can affect managerial

decisions about the firm’s real activities in at least two ways. First, managers may engage in REM

by opportunistically altering real decisions to mask the effect of the shock on the firm’s reported

earnings to continue to report high earnings. Alternatively, managers may alter real decisions as

part of a rational response to the shock so that the firm’s reported earnings best reflect the shock’s

effect on firm value. Since the hypothesis tested in a typical REM study is one of opportunistic

managerial behavior, an issue faced by REM researchers is how to accurately separate changes in

real activities motivated by the former (opportunistic response) from the latter (rational response).

Kothari (2001, 164) makes a similar point in the context of DA models. We reiterate his point here

because it applies to the REM setting as well.

2
Other areas where researchers have provided evidence on the properties of firm performance measures include
Barber and Lyon (1996) on abnormal operating performance, and Brown and Warner (1980, 1985), Bernard (1987),
Kothari and Wasley (1989) and Campbell and Wasley (1993) on abnormal stock returns.

Electronic copy available at: https://ssrn.com/abstract=1792639


In addition to the traditional REM measures used in prior research, we also investigate the

properties of performance-matched REM measures.3 We analyze performance-matched REM

measures because they have been found to improve test specification and power in other settings,

namely for abnormal DAs. Specifically, Kothari et al. (2005) find that performance matching leads

to better specified measures of DAs when compared to traditional measures of DAs such as those

based on the Jones or modified-Jones model. Thus, instead of entertaining more complicated

models of expected (normal) real activities, we first investigate whether an approach adopted

elsewhere in the earnings management literature, namely, performance-matching, performs better

than the traditional REM measures used in the earnings management literature.4,5

Our empirical analysis unfolds as follows. First, we use simulations based on firms’ actual

real activity measures to document and compare the Type I error rates of traditional and

performance-matched REM measures. Second, we use simulations to introduce abnormal REM

into firms’ actual real activity measures. We then document and compare the power of the test of

the traditional and performance-matched REM measures. This comparison allows us to identify

which approach to specifying REM measures leads to the best-specified and most powerful tests

of earnings management-related hypotheses. A key feature of our simulations is that the results

allow us (and future researchers) to make informed tradeoffs between test specification under the

3
Examples of REM studies that have implemented some form of performance-matching in their specific setting are
Cohen and Zarowin (2010) and Badertscher (2011). In a recent study, Srivastava (2019) implements matching based
on industry cohorts that share the same life cycle stage and production technology.
4
Simply generalizing the results on performance matching from Kothari et al.’s (2005) analysis of DAs to real
activities seems misguided because there is no a priori reason to believe that managers’ DA choices would necessarily
be indicative of their real activity choices.
5
Roychowdhury (2006, 361-362) expresses concern about whether models used to derive his REM measures are
linear and uses the performance matching techniques of Kothari et al. (2005). However, Roychowdhury (2006) does
not provide systematic evidence on the properties of traditional and performance-matched REM measures, which is
the focus of our study.

Electronic copy available at: https://ssrn.com/abstract=1792639


null hypothesis (Type I error rates) and power of test for both traditional and performance-matched

REM measures. Our primary tests are based on firms drawn from the “full sample” of observations.

That is, without regard to prior performance or any firm financial characteristics. However,

because managers’ choice of real activity levels is a function of their firms’ recent economic

performance and firm characteristics, we supplement our main tests using the “full sample” with

results for REM measures drawn from samples designed to capture extreme firm performance

(e.g., sales growth) or firm financial characteristics (e.g., firm size). The motivation for these

supplemental tests comes from Kothari (2001, 163) who stresses that “earnings management

studies almost invariably examine samples of firms that have experienced unusual performance.”

Our main findings are as follows. The traditional REM measures commonly used in the

earnings management literature are mis-specified in many of the settings we examine. For

example, mis-specification tends to be modest in samples drawn from the “full sample” of

observations where, while Type I error rates exceed the nominal (5%) significance level of the

test, they never exceed 15%. On the other hand, in samples of firms from the top or bottom

quartiles of size, book-to-market, and sales growth, Type I error rates (for a 5% test) often exceed

15%, and in some cases are much higher. Such evidence raises concerns about the validity of the

inferences drawn in prior earnings management studies that used these traditional REM measures.

Turning to performance-matched REM measures, while they are not well specified in each and

every setting, on balance, they tend to yield better-specified tests (i.e., lower Type I error rates)

when compared to the traditional REM measures.6

6
As discussed in section 5.2.2., conclusions about performance-matched REM measures are subject to the caveat that
their lower Type I error rates vis-à-vis those of traditional REM measures may be due in part to the higher standard
deviation exhibited by performance-matched REM measures.

Electronic copy available at: https://ssrn.com/abstract=1792639


Turning to the power of the test, the overall evidence does not yield a dominant approach

to measure REM. Stated differently, neither the traditional REM measures nor the performance-

matched REM measures consistently yield the most powerful test. Instead, the most powerful REM

measure varies depending on the type of real activity metric (e.g., abnormal SG&A, abnormal

CFO, etc.) and on the magnitude of abnormal real activity. A notable feature of the power of the

test results is that they rule out the concern that performance-matched REM measures sacrifice

power to achieve the better Type I errors rates they exhibit versus the traditional REM measures.

Our study contributes to the earnings management literature in the following ways. As the

first study to systematically document the properties (i.e., Type I error rates and power of the test)

of the traditional REM measures used in prior studies, our evidence facilitates a keener

appreciation of the reliability of the inferences prior studies have drawn about REM. Second, our

comparison of the Type I error rates and power of the test for traditional and performance-matched

REM measures provides a useful guide to future researchers when evaluating the trade-offs

between Type I and Type II error rates to decide which REM measures to use in a specific setting.

Since the trade-off between Type I and Type II errors is researcher specific, the absence of a

dominant approach to generate measures of abnormal REM leads us to recommend that future

researchers report results using both traditional and performance-matched measures. This

approach will allow readers to assess the reliability of the inferences drawn about the magnitude

and significance of the abnormal REM documented in a given study. Based on our findings, it

appears that the two approaches (performance-matched and traditional) vary in their effectiveness

depending on sample characteristics. Therefore, it is important that future researchers also evaluate

their specific samples and benchmark them against the results we provide in Table 3, for example.

Electronic copy available at: https://ssrn.com/abstract=1792639


The remainder of the paper is organized as follows. Section 2 describes the REM measures

we study. Section 3 describes our research design. Section 4 reports preliminary results and section

5 our main results. Section 6 summarizes our sensitivity tests. Section 7 concludes.

2. Measuring Real Activity Management

2.1 Overview

Following Roychowdhury (2006), managers’ willingness to manipulate real activities to

achieve financial reporting objectives or to capture private benefits has become an active area of

earnings management research. The impetus for such research also comes from Graham, Harvey

and Rajgopal (2005) who report that surveyed managers would consider altering discretionary

expenditures, as well as take other real actions, to achieve financial reporting objectives.7 A key

feature of prior REM studies is that they invariably use the same or very similar REM measures.8

2.2 A Conceptual perspective on what drives expected real activities

Value-maximizing managers choose operating, investing and financing policies that

maximize firm value, where the specific policies they choose are a function of the firm’s

investment opportunity set (IOS). Smith and Watts (1992, 264) note that the IOS varies across

firms. The real activity measures that are the subject of REM research fall under operating policies.

Among other things, a firm’s operating policy choices relate to setting selling prices; credit terms

and cash discounts; production schedules/quantities; advertising outlays; R&D outlays; other

discretionary expenditures; (non-top management) employee compensation (salary, bonuses, etc.);

7
Prior to Graham et al. (2005) and Roychowdhury (2006) other authors had examined managers’ willingness to alter
real decisions such as R&D outlays (see Dechow and Sloan 1991; Baber et al. 1991; Bushee, 1998; Bens et al. 2002),
share repurchases (see Bens et al. 2003), asset sales (see Bartov 1993) and over-production (see Thomas and Zhang
2002) to achieve financial reporting incentives. Our point is simply that REM has become a more active area of
earnings management research following Graham et al. (2005) and Roychowdhury (2006).
8
Exceptions are Cohen, Mashruwala and Zach (2011) and Eldenburg, Gunny, Hee and Soderstrom (2011). Srivastava
(2019) argues traditional REM measures can be improved by benchmarking on life cycle stage and production
technology.

Electronic copy available at: https://ssrn.com/abstract=1792639


the timing of gains from asset sales, and so on. Conceptually, the fundamental driver of the real-

activity measures used in REM-related research is firms’ IOSs. Since IOSs are firm specific,

expected real activity levels will differ across firms, even if they are in the same industry.

The discussion above has two implications for measuring expected (normal) real activities.

Ideally, empirical models of expected (normal) real activities should include variables designed to

capture the features of a firm’s IOS. Second, because the IOS is firm specific, estimation of

expected real activities should ideally be based on firm-specific models. With regard to the former,

while measures of IOSs are available (e.g., asset beta, PPE, Tobin’s q; see Skinner 1993), how

such measures drive a firm’s real activities is not well understood. As a result, to develop models

of expected real activities, prior REM studies (e.g., Roychowdhury 2006) rely instead on models

of the earnings/cash flow relation (see Dechow, Kothari, and Watts 1998) where the fundamental

driver of real activities is a firm’s sales level. With regard to the second point, while a model of

expected (normal) real activities should be firm-specific, in most accounting settings firm-specific

estimation is infeasible due to the relatively short time-series of annual or quarterly data that is

available. As a result, REM studies rely on cross-sectional estimation at the industry level. A well-

known problem with cross-sectional estimation is that such models exhibit low explanatory power

because they over-simplify the underlying economics of the relation.

2.3 Real activity measures common to the existing REM literature

Roychowdhury (2006, 344-45) developed three REM measures: abnormal cash flow from

operations, abnormal discretionary expenses, and abnormal production costs. Gunny (2010) built

on those, modifying them slightly to specify other measures of abnormal R&D, abnormal SG&A,

abnormal gains on asset sales, and abnormal production costs (see section 3.1 and the Appendix

for the details underlying estimation of all the REM measures used in our study). Most REM

studies use the measures in Roychowdhury (2006). Such reliance makes it all the more important
7

Electronic copy available at: https://ssrn.com/abstract=1792639


to document the specification of those measures with alternative approaches such as performance-

matching. The implicit assumption in prior REM studies is that all the traditional REM measures

are mean zero under the null hypothesis of no REM. However, there is no systematic evidence that

this is true, which raises the question of whether tests based on commonly used REM measures

can be relied upon to yield valid inferences about REM hypotheses. Our results provide systematic

evidence on the validity of these concerns.

2.4 Performance-matched REM measures

2.4.1 Motivation for performance-matching

Kothari (2001, 163) stresses that “earnings management studies almost invariably examine

samples of firms that have experienced unusual performance.” Relatedly, Skinner (1993, 420)

states that “… there is reason to believe that accounting procedure choice is related to how well or

badly firms are performing…” These observations motivate the need to isolate the effects of firm

performance on models that try to measure earnings management. Performance matching is one

way to achieve this. For example, in cases where the true model is unknown (perhaps linear for

some firms, but non-linear for others), performance matching can be beneficial because it does not

impose any particular functional form linking real activities to performance. Instead, the

simple premise underlying performance matching is simply that the impact of performance

on real activities is similar between a treatment firm and its matched control firm.

In addition, in cases where variables expected to drive real activities can only be measured

imprecisely or where empirical measures of theoretical constructs are unavailable, performance-

matching can be beneficial because it does not require the specification and measurement of

every conceivable variable expected to drive real activities. Instead, the idea behind

performance matching is simply that the impact of such variables is similar between a

treatment firm and its matched control firm. For these reasons, performance-matching

Electronic copy available at: https://ssrn.com/abstract=1792639


provides a viable way to control for both potential nonlinearities in the relation between real

activities and firm performance as well as for the effect of variables measured with error or

omitted entirely from a model of expected (normal) real activities. Indeed, performance

matching has proved successful in other contexts. For example, Kothari et al. (2005) find that it is

a reliable way to mitigate mis-specification in popular measures of DAs.

2.4.2 Implementation of performance-matching

While there are a number of ways to performance-match (e.g., return on sales, return on

equity) our choice of return on assets (ROA) builds on a point made by Skinner (1993, 421) that

“… firms’ recent accounting performance may be correlated with their IOS…” Given Skinner’s

observation and our discussion in section 2.2 about firms’ IOS being the fundamental driver of

real-activity levels, ROA seems like a natural choice. The specifics of our performance-matching

approach are as follows. For each abnormal REM measure (e.g., cash flow from operations,

discretionary expenses, production costs, R&D, SG&A, and gains on asset sales) we calculate a

performance-matched version for a given “treatment” firm in a given year by matching it to another

firm in the same two-digit SIC code whose ROA is within ±10%. The performance-matched REM

measure is the difference between the REM measures of the treatment firm and its matched control

firm. For example, performance-matched cash flow from operations is (where i denotes the

treatment firm and j the matched firm): PM_CFOi,t = (Ab_CFOi,t – Ab_CFOj,t) (see section 3.1 and

the Appendix for the additional details underlying estimation of all REM measures).

While performance-matching has potential benefits, it is not without limitations. Because it is

based on differencing two variables, the standard deviation of a performance-matched REM

measure will be at least 1.4x that of an un-differenced REM measure. As a result, performance-

matched measures may sacrifice more power (i.e., be more prone to Type II errors) than traditional

(un-differenced) REM measures. This is a non-trivial point because of the trade-off researchers
9

Electronic copy available at: https://ssrn.com/abstract=1792639


face between Type I and Type II errors in REM (and all research) settings. The simulations we

conduct provide systematic evidence on that tradeoff between specification and the power of the

test (i.e., between Type I and II errors) for traditional vs. performance-matched REM measures.

3. Research design

3.1 Sample and data requirements

To calculate the REM measures of interest, we require annual financial statement data from

COMPUSTAT for the period 1986-2017. We retain all firm-year observations meeting data

availability requirements in a given year. We do not require data on all variables for all firms for

each year because doing so would introduce a severe survivorship bias. The traditional REM

measures used in prior research that we analyze are: abnormal cash flows from operations

(Ab_CFO); abnormal discretionary expenses (Ab_DISC_EXP); abnormal production costs

(Ab_PROD); abnormal R&D expenses (Ab_R&D); abnormal selling and general expenses

(Ab_SGA); and abnormal gains from sales of fixed assets (Ab_GAIN).

In addition to the traditional measures above, we analyze modified versions of Ab_CFO,

Ab_PROD, and Ab_DISC_EXP based on suggestions in Gunny (2010) and Vorst (2016). Since

the R2 of these modified real activity models are very similar to those of the traditional models, for

brevity, we do not report results for REM measures based on these modified models (available

upon request). Following prior research, we estimate the underlying models of expected (normal)

real activity using annual data at the two-digit SIC code level for all industries with at least 15

observations in a given year. To obtain abnormal REM measures we subtract the expected value

of each measure based on the expectation model from the reported COMPUSTAT value or the

value calculated using COMPUSTAT numbers (e.g., production costs = COGS + ΔINV).

As stated above, we calculate a performance-matched version of each REM measure for a

given “treatment” firm in a given year by matching it to a firm in the same two-digit SIC code

10

Electronic copy available at: https://ssrn.com/abstract=1792639


whose ROA is within ±10%. The performance-matched measure is the difference between the

treatment firm’s REM measure and that of its performance-matched control firm.

In the REM literature it is common to winsorize the data to reduce the impact of extreme

data points. In the REM setting there are two points at which the data can be winsorized. First, the

data used to estimate the underlying models of expected (normal) real activities can be winsorized

(hereafter, pre-estimation). Based on our reading of the REM literature, most studies (except

Gunny 2010; and Vorst 2016) are silent about whether winsorization occurs at this stage.

Alternatively, winsorization can take the form of winsorizing the abnormal REM measures

themselves, which would be after the model of expected (normal) real activity was estimated

(hereafter, post-estimation). Based on our reading of the REM literature, it appears that roughly

half the studies we read winsorize at this stage.9 Given these differences, before we perform our

simulation analysis, we report and discuss summary statistics of the properties of REM measures

generated under all three winsorization approaches, namely, (i) pre-estimation; (ii) post-

estimation; and (iii) pre- and post-estimation winsorization. We then use the properties of the

summary descriptive statistics to decide which approach to base our simulation analysis on. As

discussed below, we adopt the first approach, namely pre-estimation winsorization (simulation

results based on the other approaches are reported in our section on sensitivity tests). We stress

that our intent here is not to argue in favor of or against any one approach, but simply to document

the sensitivity of our inferences to the choice.

3.2 Simulation procedures

The hypothesis tested in the typical REM study is that managers manipulated real activities

to boost reported earnings. Consistent with this we perform simulations for one-sided tests of the

9
For examples, see Alissa, Bonsall, Koharki and Penn (2013), Cheng, Lee, Shevlin (2016), Cohen, Dey, and Lys
(2008), Kim and Park (2014), McGuire, Omer and Sharp (2012), McInnis and Collins (2011) and Zang (2012).

11

Electronic copy available at: https://ssrn.com/abstract=1792639


alternative hypothesis of positive (income-increasing) abnormal REM (we discuss results for two-

sided tests in our section on sensitivity tests). Our simulations assess proper test specification under

the null hypothesis by documenting each REM measure’s Type I error rate. Subsequent

simulations document the power of the test for each REM measure to detect abnormal REM when

it is present (i.e., has been seeded) in the data. Simulations are based on 1,000 random samples of

100 firm-year observations drawn without replacement. Our primary tests are based on simulations

where samples are drawn from the “full sample” of firms (i.e., all firm-years).

In supplemental tests we report Type I errors rates for sub-samples formed on the basis of

recent firm performance (e.g., sales growth) or firm characteristics (e.g. firm size). The sub-

samples are designed to capture features common to samples in earnings management studies.

Such samples are often characterized by large and/or small firms; firms that are value or glamour

stocks; and/or firms exhibiting momentum in recent performance. Our market value of equity

(MVE) sample partition captures the firm size characteristic; our earnings-to-price (E/P) and book-

to-market (B/M) partitions capture value/glamour firms; while sales growth captures momentum,

or the lack thereof, in recent financial performance. These choices are not meant to exhaust all

possible scenarios, but rather to reasonably characterize settings similar to those encountered in

earnings management studies. Our supplemental tests on these sub-samples are similar to the

approach used in other studies of the properties of alternative measures of earnings management

(e.g., discretionary accruals). For example, Dechow et al. (1995) examine extreme earnings and

cash flow performance, Kothari et al. (2005) examine extreme operating cash flow, book-to-

market, sales growth, earnings-to-price, and size, and Dechow, Hutton, Kim and Sloan (2012)

extreme earnings growth, cash flow, size, and sales growth.

12

Electronic copy available at: https://ssrn.com/abstract=1792639


3.2.1 Simulations Assessing Test Specification under the null hypothesis (Type I error rates)

For each of the 1,000 random samples we construct, we compute the mean value of each

abnormal REM measure and then tabulate the frequency with which the null hypothesis of zero

mean abnormal REM is rejected based on a t-test with nominal significance levels of 5%. To assess

departures from the nominal significance level of the test we construct a 95% confidence interval

for the (theoretical) nominal significance level, which for 1,000 samples is 3.65% to 6.35% for a

nominal significance level of 5%. If the observed rejection rate falls above (below) the upper

(lower) bound of this interval, the test is mis-specified in that it is rejecting the null hypothesis too

frequently (infrequently). Such cases are evidence that the REM measure is mis-specified and

biased against (in favor of) the null hypothesis. To save space, we do not tabulate or discuss in the

text the results for all possible combinations of simulated settings. Instead, section 5 presents the

results for a baseline set of simulations and we summarize the results of variations from these

baseline results in section 6 (all unreported results are available upon request).

As noted above, our primary tests report the Type I error rates for all REM measures drawn

from the “full sample” of observations (i.e. from all firms). Then, in supplemental tests, we report

Type I error rates for the sub-sample partitions reflecting recent firm performance or firm

characteristics. The sub-samples consist of firms with high vs. low earnings/price (E/P) ratios, high

vs. low book-to-market (B/M) ratios, high vs. low recent sales growth, and large vs. small firms

(market value of equity). We construct sub-samples by annually ranking all firm-year observations

on the basis of each partitioning variable. For each partitioning variable we pool observations

across all sample years (1986-2017) and then draw 1,000 random samples of 100 firms each from

the top and bottom quartiles of each partitioning variable. We then test whether the mean of the

REM measure is significantly different from zero.

13

Electronic copy available at: https://ssrn.com/abstract=1792639


3.2.2 Simulations assessing the power of tests to detect abnormal REM

To assess the power of tests, we use the same 1,000 samples described above where we

“artificially” induced (i.e., seeded) abnormal real activity performance into the underlying raw real

activity variables (e.g., CFO) before estimating the models of expected (normal) real activity. We

vary the ‘seed’ from 1% to 10% of lagged total assets. For example, for a given firm i in year t,

the revised level of CFO is: CFO*i,t = CFOi,t + p*ATi,t-1, where p varies from 1% to 10%. Seeding

abnormal REM into the raw data allows us to evaluate the power of the test exhibited by alternative

REM measures to detect abnormal REM when it is in fact present in the data.

4. Preliminary results

4.1 Distributional properties of alternative abnormal REM measures

Table 1 presents univariate descriptive statistics for all abnormal REM variables measured

at the individual firm level with performance-matched REM measures signified by a PM prefix.

As discussed above, we report separate summary statistics for three different winsorization

approaches: (i) pre-estimation winsorization; (ii) post-estimation winsorization; and (iii) pre- and

post-estimation winsorization. Means or medians that are statistically different from zero appear

in bold (or italic) font. “Stars” denote significant differences between means and medians under

the pre-estimation winsorization scheme and those under the other two winsorization schemes.

[Insert Table 1 here]

Since the abnormal REM measures (traditional and performance-matched) under pre-

estimation winsorization are residuals from first-stage regression models their means are zero by

construction. While medians of the traditional REM measures are generally different from zero,

they are indistinguishable from zero for the performance-matched versions. Turning to the results

under different winsorization approaches, for the traditional REM measures, there are differences

between the means and medians under the different winsorization schemes. For example, the

14

Electronic copy available at: https://ssrn.com/abstract=1792639


average Ab_CFO under the post-estimation winsorization is 0.023, which is significantly different

from the average of zero under the pre-estimation approach. Similar differences occur between all

traditional measures under the post-winsorization scheme, while there does not seem to be any

difference when we examine the performance-matched measures. As for the pre/post-

winsorization scheme, there does not seem to be any difference between its means and medians

and those of the pre-estimation winsorization. Turning to the standard deviations, REM measures

of discretionary expenses by far exhibit the largest variation. For example, under pre-estimation

winsorization, σ(Ab_DISC_EXP) is 0.95 while σ(PM_DISC_EXP) is 1.23. As expected, since the

performance-matched REM measures are constructed by differencing the REM measures of a

treatment and control firm, the standard deviations of performance-matched measures are higher

than those of traditional REM measures. As noted above, a potential implication of a performance-

matched REM measure’s higher standard deviation is that better specified tests (i.e., lower Type I

error rates) using performance-matched measures may come at a cost of reduced power to detect

abnormal REM.

Overall, the summary descriptive statistics reveal that the distributions of the commonly-

used REM measures exhibit means close to zero when pre-estimation winsorization is used.

However, averages of various REM measures tend to become more extreme when samples are

formed on the basis of recent firm performance or firm financial characteristics (un-tabulated).

The simulation evidence reported next, which calibrates the Type I error rates of these REM

measures, provide systematic evidence as to the reliability of inferences drawn in prior REM

studies that have employed the traditional REM measures.10

10
Unreported summary statistics for sub-samples formed on the basis of recent firm performance or firm financial
characteristics reveal that except for Ab_GAIN, and to a lesser degree Ab_R&D, mean and median values tend to be
non-zero and very extreme in most cases.

15

Electronic copy available at: https://ssrn.com/abstract=1792639


5. Simulation Results

5.1 Overview

We first report simulations that assess test specification under the null hypothesis (Type I

error rates). We then report simulations that assess the power of the test to detect abnormal REM

when it has been seeded into the underlying data. The key aspect of the simulations assessing test

specification under the null hypothesis is that in random samples, where firms are selected without

regard to any prediction about managerial incentives to manage real activities, the expected value

of each REM measure should be zero. Evidence that a given REM measure is biased in favor of

the alternative hypothesis (i.e., against a “true” null) would lead to the conclusion that researchers

should avoid using such measures or face the risk of making a Type I error by concluding they

have documented a significant “treatment” effect when in fact they actually have not.

Simulations assessing the power of tests to detect abnormal REM are designed to provide

systematic evidence on the trade-off between bias reduction and power across the various REM

measures. A maintained assumption underlying the power of the test analysis and comparisons is

that each REM measure is well-specified under the null hypothesis. So long as a given REM

measure (traditional or performance-matched) is well-specified under the null hypothesis (i.e., it

exhibits an acceptable Type I error rate), the power of the test simulations can shed meaningful

light on the degree to which that REM measure will be able detect a given level of abnormal REM

when it is present in the data. We seed levels of abnormal REM ranging from 1% to 10%, and then

tabulate the percent of times out of 1,000 samples where the null hypothesis of zero REM is

rejected at a given level of abnormal REM (i.e., the power of the test).

Before reporting the results, we provide a brief roadmap for the baseline simulations

reported below (see section 6 for sensitivity tests where we vary some of the choices below):

16

Electronic copy available at: https://ssrn.com/abstract=1792639


1) Traditional REM Measures: Ab_CFO, Ab_PROD, Ab_DISC_EXP, Ab_R&D, Ab_SGA,
and Ab_GAIN (see Appendix for details).

2) Performance-matched REM Measures: A performance-matched version of each


traditional REM measure listed above, denoted: PM_CFO, PM_PROD,
PM_DISC_EXP, PM_R&D, PM_SGA, PM_GAIN (see Appendix for details).

3) Sample composition: Primary tests are based on the “full sample” (i.e., all firms).
Supplemental tests use sub-samples that reflect recent firm performance or firm
financial characteristics. These are defined as the lower and upper quartiles of: book-
to-market ratios, past sales growth, earnings-to-price ratios, and market value of equity.

4) Hypothesis Test: A one-sided test of the alternative hypothesis that mean REM is
positive. That is, where REM is hypothesized to have been income increasing.

5) Nominal significance level of the test: 5%.

We begin by reporting (section 5.2) rejection rates (i.e., Type I error rates) for a one-tailed

test where the alternative hypothesis is of income-increasing REM. Next (section 5.3), we report

results for the power of the test. For ease of interpretation in the tables below, rejection rates that

are significantly less than the nominal significance level of the test (i.e., conservative tests) appear

in italics, while those significantly greater than the nominal significance level of the test (i.e.,

which reject a true null hypothesis too often) appear in bold.

5.2 Test specification under the null hypothesis (Type I error rates)

5.2.1 Tests based on the “full sample” (all firms)

Table 2 reports rejection rates (Type I error rates) for the null hypothesis that the mean

REM in a given sample is zero against a one-tailed alternative hypothesis of positive REM. The

earnings management setting modeled here is one where managers engaged in income-increasing

REM to achieve a financial reporting objective such as meeting or beating an earnings threshold

(e.g., reporting a profit instead of a loss). Table 2’s simulations are based on samples drawn from

the “full sample” (all firms), that is, without regard to any firm characteristic. For comparison

17

Electronic copy available at: https://ssrn.com/abstract=1792639


purposes, results are reported under different winsorization schemes: (i) pre-estimation; (ii) post-

estimation; and (iii) pre- and post-estimation winsorization.

We use two approaches to assess whether the observed rejection frequencies exhibit

evidence of mis-specification. Under the first, we (objectively) compare rejection frequencies with

the lower (upper) bound of 3.65% (6.35%) for the 95% confidence interval for the 5% nominal

significance level of the test. Second, we apply a subjective threshold of 15% to define severe

misspecification to mimic a potential researcher’s subjective choice about tolerable Type I error.11

[Insert Table 2 here]

We first discuss the results for the traditional REM measures used in prior studies. These

results appear in the shaded rows of Table 2 and reveal evidence of mis-specification in that none

of the traditional REM measures is consistently well specified under the null hypothesis. For

example, four measures yield tests that over-reject the null hypothesis (Ab_CFO, AB_DISC_EXP,

AB_R&D, and Ab_SGA) with an average rate of 9.65% (compared to the test’s upper bound of

6.35%) while the other two (Ab_PROD and Ab_GAIN) yield conservative tests based on an

average rejection frequency of 1.80% (compared to the test’s lower bound of 3.65%). Using the

15% cutoff to define severe mis-specification would lead one to conclude that in random samples

drawn from the “full sample” (i.e., all firms), traditional REM measures are not severely mis-

specified.

A notable feature of the results using the traditional REM measures is that, based on the

behavior of the rejection rates across the columns of Table 2, the findings are similar regardless of

the winsorization approach. Thus, to the extent mis-specification exists, it does not seem to be

11
Personally, we are not recommending 15% as a measure of tolerable Type I error because it is more than double the
6.35% upper bound of the 95% confidence interval, and in our view represents more extreme misspecification.

18

Electronic copy available at: https://ssrn.com/abstract=1792639


induced by the choice of when to winsorize. In sum, the results for the traditional REM measures

suggest that misspecification exists, but is not extremely severe.

The question at this point is whether performance-matched REM measures (the non-shaded

cells in Table 2 and directly below their corresponding traditional REM measure) yield more

reliable inferences than traditional measures. From an overall perspective, Table 2’s rejection rates

indicate that performance-matched REM measures are better specified than the traditional REM

measures. Specifically, performance-matched versions of the traditional REM measures that

tended to yield over (under)-rejections, consistently decrease (increase), and more importantly, fall

within the 3.65% and 6.35% bounds. For example, focusing on the pre-estimation winsorization

results, the average rejection rate across the four performance-matched measures whose

corresponding traditional REM versions exhibited over-rejection is 5.33%, compared to 9.65% for

the corresponding traditional measures. In addition, the average rejection rate across the two

performance-matched measures whose corresponding traditional versions exhibited evidence of

under-rejection is 4.80%, compared to 1.80% for the corresponding traditional measures. Thus, on

balance, the evidence indicates that not only do performance-based REM measures tend to mitigate

the mis-specification of their corresponding traditional REM measures, but they also typically

yield better specified tests. This inference holds regardless of the winsorization approach.

Since Table 2’s results do not vary depending on the winsorization approach, and because

the pre-estimation winsorization REM measures exhibit more desirable univariate properties (see

Table 1), all of the remaining analysis reported in the paper (i.e., all remaining tables and figures)

use REM measures based on pre-estimation winsorization.

19

Electronic copy available at: https://ssrn.com/abstract=1792639


5.2.2. Tests based on sub-samples formed on the basis of past performance or firm characteristics

Table 2’s results were based on random samples constructed from the entire population

(i.e., “All Firms” samples). We next provide evidence on how REM measures perform in samples

where firms are not randomly drawn from the entire population of firms, which is more like the

typical earnings management setting. Table 3 reports results for sub-samples of firms with: high

vs. low earnings/price (E/P) ratios, high vs. low book-to-market (B/M) ratios, high vs. low sales

growth, and large vs. small firms (market value of equity). The motivation for this analysis is that

samples in earnings management studies are often characterized by firms with extreme

performance and/or financial characteristics. Our analysis of sub-samples follows the approach

taken in Dechow et al. (1995), Kothari et al. (2005), and Dechow et al. (2012).

Panel A of Table 3 reports summary statistics for the traditional and performance-matched

REM measures for each of the various sub-samples. Means and standard deviations are measured

across the 1,000 samples of 100 firms each used in the simulations. Panel B reports each REM

measure’s Type I error rates in the full sample (“All Firms’) and in sub-samples of firms we study.

[Insert Table 3 here]

Panel A confirms the notion that the standard deviations of performance-matched variables

are higher than those of traditional measures, similar to the statistics reported in Table 1. To briefly

preview the findings for the Type I error rates of the various REM measures reported in Panel B,

no set of REM measures, traditional or performance-matched, is consistently well-specified across

sub-samples of recent firm performance and/or financial characteristics. Stated differently, no set

of REM measures, traditional or performance-matched, consistently exhibits acceptable Type I

error rates in these sub-samples. This finding has two important implications. First, it prevents us

from analyzing the power of the test in these sub-samples because the power of the test is only

20

Electronic copy available at: https://ssrn.com/abstract=1792639


meaningful for well-specified tests. Thus, our power of the test analysis will be restricted to

random samples constructed from the entire population (i.e., “all firm” samples). The second

implication is that this finding supports the main conclusion of our study, namely, that there is no

dominant approach to measure abnormal REM, a finding that leads us to recommend that future

researchers report results using both traditional and performance-matched REM measures so that

readers are able to assess the reliability of the inferences drawn about the magnitude and

significance of the abnormal REM documented in a given study.

Turning to the simulation results for the specification of REM measures in the subsamples,

we first focus on the rejection frequencies (Type I error rate) for the upper quartile of each partition.

In those cases, traditional REM measures exhibit high degree of misspecification in the size, book-

to-market (B/M), and earnings-price (E/P) sub-samples. For example, in the upper B/M quartile,

five of the traditional REM measures exhibit rejection frequencies well above the 15% level. The

average rejection frequency for these five REM measures is 64.04%. Similar results are observed

in the high E/P and large firm quartiles. For example, in the high E/P quartile and large firm

quartile, four of the traditional REM measures have rejection frequencies far above the 15%

threshold (average rejection frequencies are 63.45% for high E/P and 58.23% for large firms). In

the high sales growth sub-sample, only Ab_SGA is grossly mis-specified.

Turning to the results for the lower quartile of each partition reveals evidence of over-

rejection in the low sales growth sub-sample. Specifically, two of the traditional REM measures

(Ab_PROD and Ab_DISC_EXP) have rejection rates exceeding 15% (average rejection rate for

these two is 22.5%). On the other hand, in some sub-samples such as low B/M, we observe a high

degree of under-rejections of the null, that is, very low rejection rates. For example, in the low

B/M subsample, the average rejection frequency of traditional REM measures is 0.64%.

21

Electronic copy available at: https://ssrn.com/abstract=1792639


Turning to the findings for the performance-matched REM measures reveals that, in

general, they tend to reduce, but not eliminate the misspecification in the traditional REM

measures. More specifically, rejection frequencies tend to decline in cases where the

corresponding traditional REM measure had experienced over-rejection. For example, in the high

B/M quartile, where the average rejection rate was 64.04% across the four traditional REM

measures that exhibited high over-rejection rates, the average rejection rate for the corresponding

performance-matched REM measures declines to 25.98%. While performance matching tends to

reduce the problem of over-rejection by the traditional REM measures, the results clearly show

that performance-matching does not cure mis-specification in all cases. Finally, performance-

matching also seems to correct the under-rejection problem associated with some of the traditional

REM measures, see for example, the results for the low B/M sub-sample. However, and to be clear,

the results reveal that the performance-matched versions of the traditional REM measures do not

completely eliminate the under-rejection tendency of some of the traditional REM measures in

these sub-samples.

In summary, our results reveal that no REM measure (traditional or performance-matched)

is well-specified in each and every setting. That said, on balance, the rejection frequencies of

performance-matched REM measures are generally lower (i.e., less mis-specified) than those of

the corresponding traditional REM measures in both the “full sample” results (i.e., all firms) and

in the sub-samples of firms with extreme performance or financial characteristics. Moreover, while

there is a tendency for all REM measures to over-reject the null hypothesis, on balance,

performance-matched REM measures are less affected by over-rejection than are the traditional

REM measures. While, in some cases, under-rejection of the null hypothesis is also a problem for

traditional as well as performance-matched REM measures, on balance, the latter REM measures

22

Electronic copy available at: https://ssrn.com/abstract=1792639


are slightly less affected by under-rejections than are the traditional measures. Conclusions about

performance-matched REM measures based on their Type I error rates are subject to the following

caveat. An alternative explanation for the lower over-rejection rates of performance-matched REM

measures relative to their traditional REM counterpart is, holding all else constant, that the higher

standard deviation of performance-matched REM measures (see Panel A of Table 3) will lead to

fewer rejections of the null hypothesis by serving to make the standard error larger and hence the

t-statistic itself smaller.

5.3 The power of the test to detect abnormal REM

5.3.1 Overview

Since performance-matched REM measures are based on differencing two variables, the

standard deviation of a performance-matched REM measure will be at least 1.4x that of the

underlying un-differenced traditional REM measure. As a result, a concern with performance-

matched REM measures is that they may sacrifice power compared to traditional (un-differenced)

REM measures. The power of test results we report in this section provide evidence on whether

better specification of performance-matched REM measures under the null hypothesis in various

settings comes at the cost of lower power to detect abnormal REM. Given the mis-specification

plaguing both traditional and performance-matched REM measures in the sub-sample results

report in Table 3, we do not analyze the power of the test in these sub-samples. The reason is that

the power of the test is only meaningful for well-specified tests. As a result, our power of the test

analysis is based on random samples constructed from the “full sample” (i.e., “all firms”).

5.3.2 Power of the test to detect abnormal REM for alternative REM measures

To save space, and to more clearly illustrate the power of the test of the various REM

measures, instead of reporting tables of detailed rejection rates, we use figures to plot the power

curves of the various REM measures across seeded levels of abnormal REM ranging from 1% to

23

Electronic copy available at: https://ssrn.com/abstract=1792639


10% (below each figure’s we report the numerical values of the rejection rates that underlie the

corresponding power curves appearing in the figure). If a given performance-matched REM

measure sacrifices power relative to its corresponding traditional REM counterpart, the power

curve of the former will lie below that of the latter across all levels of seeded abnormal REM.

Figures 1-6 plot the power curves for traditional and performance-matched REM measures of

abnormal cash flow from operations, abnormal discretionary expenses, abnormal production costs,

abnormal R&D, abnormal SG&A and abnormal gains, respectively.

Examination of Figures 1-6 reveals the following. In Figure 1, for abnormal operating cash

flows, the power curves exhibit lower power for the performance-matched REM measure at lower

seed levels. However, the pattern flips at higher seed levels. Turning to abnormal discretionary

expenses in Figure 2, across all levels of seeded abnormal REM, the performance-matched

measure exhibits more power than its traditional REM counterpart. The power curves in Figures 3

and 5, for abnormal production and abnormal SG&A respectively, show a slight advantage for the

performance-matched REM measure at lower seed levels, but higher power for the traditional

measures for abnormal production and abnormal SG&A at seed levels of 3% or more of lagged

total assets. A similar pattern emerges in Figures 4 and 6, for abnormal R&D and abnormal gains

respectively, except that the higher power for the corresponding traditional measure kicks in for

abnormal R&D and abnormal gains at seed levels of 1% or more (instead of 3% for abnormal

production and SG&A).

In summary, on balance, at plausible levels of seeded abnormal REM of between 1% to

3%, traditional and performance-matched REM measures do not exhibit major differences in

power to detect abnormal REM. In other words, the power of the test results do not reveal that one

approach to measure REM (traditional vs. performance-matched) systematically dominates the

24

Electronic copy available at: https://ssrn.com/abstract=1792639


other. However, the power of the test results do indicate that the tendency for performance-

matched measures to be better specified under the null hypothesis (i.e., to exhibit better Type I

errors rates than their corresponding traditional REM measure, see Table 2) does not come at

systematically lower power to detect plausible levels of abnormal REM of 1% to 3%. Stated

differently, the improved test specification (i.e., Type I error rates) exhibited by performance-

matched measures will not impede their ability to detect plausible levels of abnormal REM of 1%

to 3%.

Overall, the empirical evidence in Tables 2 and 3 and Figures 1-6 indicates that no single

REM measure (traditional or performance-matched) is immune from mis-specification in each and

every setting. That said, in terms of Type I error rates (i.e., test specification), performance-

matched REM are a bit less prone to falsely reject a true null hypothesis (especially in samples of

firms that exhibit extreme performance or firm financial characteristics).12 Finally, the power of

the test results do not reveal that one approach to measure REM (traditional vs. performance-

matched) systematically dominates the other across a wide range of simulated settings.

6. Additional tests

We performed a battery of sensitivity tests by varying the choices underlying the baseline

simulations described in section 5.1. This section summarizes the findings (un-tabulated and

available upon request). The results reported above are for six REM measures that have been

traditionally used in the literature (Ab_CFO, Ab_PROD, Ab_DISC_EXP, Ab_R&D, Ab_SGA, and

Ab_GAIN). We also analyzed modified versions of Ab_CFO, Ab_PROD, and Ab_DISC_EXP

based on specifications developed in Gunny (2010) and Vorst (2016). Results based on these

modified REM measures yield inferences similar to those of our main tests.

12
Conclusions about performance-matched REM measures based on their Type I error rates are subject to the caveat
stated in section 5.2.2.

25

Electronic copy available at: https://ssrn.com/abstract=1792639


Our main tests were based on pre-estimation winsorization of the underlying data. Using

the other two winsorization approaches discussed in section 3.1 (i.e., post-estimation or a

combination of pre- and post-estimation) also yield inferences similar to those of our main tests.

Our main tests focused on a one-tailed alternative hypothesis of income-increasing REM. We also

tested the null hypothesis of zero REM against a two-tail alternative hypothesis of non-zero REM.

The alternative hypothesis of interest to a researcher here this is simply that REM is non-zero,

implying that managers engaged in some real earnings management, irrespective of its direction.

In simulations using the “full sample” (i.e., all firms), we continue to find some evidence of over-

rejection across all REM measures, although the degree of over-rejection is not severe (average

rejection rate is 7.2%). It is also the case that rejection frequencies are attenuated when

performance-matched REM measures are used. Finally, our main tests used a nominal significance

level of 5%. None of the conclusions of our main tests change if a 1% significance level is used.

7. Conclusions

The use of measures of real earnings management (REM) to test hypotheses related to

earnings management has become commonplace in the literature. Surprisingly, there is no

systematic evidence on the properties of REM measures commonly-used in the earnings

management literature. We provide such evidence by documenting the Type I error rates and

power of the test of REM measures commonly used in the literature, as well as corresponding

performance-matched versions of these measures.

Our main findings are the following. While performance-matched REM measures are not

immune from mis-specification in all settings, in general, and subject to the caveat that their

standard deviations are higher than those of their traditional REM counterpart, performance-

matched REM measures tend to be better specified under the null hypothesis. Comparisons of the

power to detect plausible levels (1% to 3% of total assets) of abnormal REM reveal that neither
26

Electronic copy available at: https://ssrn.com/abstract=1792639


approach, traditional or performance-matched, is consistently more powerful than the other in

terms of detecting abnormal REM. The absence of a dominant approach to measure abnormal

REM leads us to recommend that future researchers report results using both traditional and

performance-matched measures so that readers are able to clearly assess the reliability of the

inferences drawn about the magnitude and significance of the abnormal REM documented in a

given study.

Our study contributes to the earnings management literature in the following ways. As the

first study to systematically document the properties (i.e., Type I error rates and power of the test)

of the traditional REM measures used in prior studies, our evidence facilitates a keener

appreciation of the reliability of the inferences prior studies have drawn about REM. Second, our

comparison of the Type I error rates and power of the test for traditional and performance-matched

REM measures provides a useful guide to future researchers when evaluating the trade-offs

between Type I and Type II error rates to decide which REM measures to use in a given specific

setting. A fruitful avenue for future research would be to develop better theoretical and empirical

models of expected (normal) real activities.

27

Electronic copy available at: https://ssrn.com/abstract=1792639


Appendix

Real Earnings Management (REM) Variable Definitions and Measurement Procedures

We obtain abnormal REM measures by subtracting the expected value of each REM measure based on the
underlying expectation model from the actual value of the real activity measure (e.g., cash flow from
operations, R&D, SG&A, etc.) reported on COMPUSTAT (or the value calculated using COMPUSTAT
data, e.g., production costs = COGS + ΔINV). Following prior research, we estimate model parameters
using annual data at the two-digit SIC code for all industries with at least 15 observations in a given year.
REM expectation models and the resulting abnormal REM measures are:

A. REM measures used in prior REM research:

1) Ab_CFO is abnormal cash from operations (see, Roychowdhury, 2006) computed by estimating the
following model of expected CFO (by industry and year):

CFOit 1 SALESit ΔSALESit


=k +k +k +k +ε , (A.1)
Assetsi,t-1 0 1 Assetsi,t-1 2 Assetsi,t-1 3 Assetsi,t-1 it

where CFO is cash flow from operations, SALES is annual sales and Assets is total assets. Ab_CFO are the
residuals from model (A.1).

2) Ab_DISC_EXP is abnormal discretionary expenses (see, Roychowdhury, 2006) computed by estimating


the following model of expected DISC_EXP (by industry and year):

D ISC _EX P it 1 SALES i,t-1 (A.2)


=k 0  k 1 +k 2 + ε it
Assets i,t-1 Assets i,t-1 Assets i,t-1

where DISC_EXP is discretionary expenses during the year defined as the sum of advertising, R&D, and
SG&A expenses, and all other variables are as previously defined. Ab_DISC_EXP are the residuals from
model (A.2).

3) Ab_PROD is abnormal production costs (see, Roychowdhury, 2006) computed by estimating the
following model of expected PROD (by industry and year):

PRODit =k  k 1 SALESit ΔSALESit ΔSALESi,t-1 (A.3)


0 1 +k 2 +k3 +k 4 +εit
Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1

where PROD is production costs defined as the sum of costs of goods sold (COGS) and change in inventory
during the year, and all other variables are as previously defined. Ab_PROD are the residuals from model
(A.3).

4) Ab_R&D is abnormal research and development costs (see, Gunny, 2010) computed by estimating the
following model of expected R&D (by industry and year):

R&Dit 1 INTit R&Di,t-1 (A.4)


=k  k +k MVt +k 3Qt +k 3 +k +ε
Assetsi,t-1 0 1 Assetsi,t-1 2 Assetsi,t-1 4 Assetsi,t-1 it

28

Electronic copy available at: https://ssrn.com/abstract=1792639


where R&D is R&D expense, MV is the natural logarithm of the market value of equity (outstanding shares
times stock price), Q is Tobin’s Q [= market value of equity + book value preferred stock + book value of
long-term debt + debt in current liabilities) / total assets], and INT is internally generated funds (the sum of
Net Income before extraordinary items, R&D expense, and Depreciation and Amortization), and all other
variables are as previously defined. Ab_R&D are the residuals from model (A.4).

5) Ab_SGA is abnormal selling, general and administrative costs (see, Gunny, 2010), computed by
estimating the following model of expected SGA (by industry and year):

SGAit 1 INTit ΔSALESit ΔSALESit (A.5)


=k  k +k MV +k Q +k +k +k *DD+εit
Assetsi,t-1 0 1 Assetsi,t-1 2 t 3 t 3 Assetsi,t-1 4 Assetsi,t-1 5 Assetsi,t-1

where SGA is SG&A expense, ΔSALES is change in annual sales and MV, Q and INT were defined above.
DD is indicator variable equal to 1 when total sales decreases from year t-1 to t, and zero otherwise, and all
other variables are as previously defined. Ab_SGA are the residuals from model (A.5).

6) Ab_GAIN is abnormal gains (see, Gunny, 2010) computed by estimating the following model (by
industry and year):

GAINit 1 INTit ASALESit ISALESit (A.6)


=k  k +k MV +k Q +k +k +k +ε
Assetsi,t-1 0 1 Assetsi,t-1 2 t 3 t 3 Assetsi,t-1 4 Assetsi,t-1 5 Assetsi,t-1 it

where GAIN is gain from asset sales (times -1), ASALES is long-lived assets sales, ISALES is long-lived
investment sales, and all other variables are as previously defined. Ab_GAIN are the residuals from model
(A.6).

B. Modified REM measures:

Modified REM measures based on the refinements suggested in Gunny (2010) and Vorst (2016) which
include indicator variables for decline in sales.

7) Ab_CFO_MOD is abnormal cash from operations where the underlying model is modified to include a
separate explanatory variable to capture the effect of a decline in sales, estimated using the following model
of expected CFO (by industry and year):

CFOi,t 1 SALESi,t ∆SALESi,t ∆SALESi,t


=k0 +k1 +k2 +k3 +k4 *DD+∈i,t , (A.7)
Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1

where CFO is cash flow from operations, SALES is annual sales, ΔSALES is change in annual sales, Assets
is total assets, and DD is an indicator variable set to 1 when sales has declined. Ab_CFO_MOD are the
residuals from model (A.7).

8) Ab_DISC_EXP_MOD is abnormal discretionary expenses where the underlying model is modified to


include a separate explanatory variable to capture the effect of a decline in sales, estimated using the
following model of expected DISC_EXP (by industry and year):

DISC_EXPi,t 1 SALESi,t ∆SALESi,t ∆SALESi,t


=k0 +k1 +k2 +k3 +k4 *DD+∈i,t , (A.8)
Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1

29

Electronic copy available at: https://ssrn.com/abstract=1792639


where DISC_EXP is discretionary expenses during the year defined as the sum of advertising, R&D, and
SG&A expenses and all other variables are as previously defined. Ab_DISC_EXP_MOD are the residuals
from model (A.8).

9) Ab_PROD_MOD is abnormal production costs (see, Roychowdhury, 2006) where the underlying model
is modified to include a separate explanatory variable to capture the effect of a decline in sales, estimated
using the following model of expected PROD (by industry and year):

PRODi,t 1 SALESi,t ∆SALESi,t ∆SALESi,t-1 ∆SALESi,t ∆SALESi,t-1


=k0 +k1 +k2 +k3 +k4 +k5 *DD+k6 ∗ 𝐷𝐷 ∈i,t ,
Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1

(A.9)

where PROD is production costs defined as the sum of costs of goods sold (COGS) and change in
inventory during the year and all other variables are as previously defined. Ab_PROD_MOD are the
residuals from model (A.9).

C. Performance-matched REM Measures:

We match treatment firms to control firms based on return on assets (ROA), where ROA is defined as
income before extraordinary items divided by lagged total assets. Each treatment firm (i) is matched to a
performance-matched control firm (j) in the same two-digit SIC code whose ROA is within ±10% of the
treatment firm. We then define the difference between the REM measure of the treatment firm and the REM
measure of the control firm to be the resulting performance-matched REM measure. Using abnormal CFO
as an example: PM_CFOi,t = Ab_CFOi,t - Ab_CFOj,t. We define performance-matched measures for each
of the REM variables described above (i.e., Ab_CFO, Ab_PROD, Ab_DISC_EXP, Ab_R&D, Ab_SGA, and
Ab_GAIN).

D. Variables used to form sub-samples reflecting recent firm performance or financial characteristics:

Sales Growth: (Sales – lagged Sales)/lagged Sales.

Market Value of Equity: Fiscal year-end stock price multiplied by common shares outstanding.

Book-Market: Book value of common equity divided by the market value of equity.

Earnings/Price: Diluted earnings per share excluding extraordinary items divided by the fiscal year-end
stock price.

30

Electronic copy available at: https://ssrn.com/abstract=1792639


References
Alissa, W., S.B. Bonsall, K. Koharki, and M.W. Penn. 2013. Firms’ use of accounting discretion
to influence their credit ratings. Journal of Accounting and Economics 59, 129-147.

Baber, W., P. Fairfield, and J. Haggard. 1991. The effect of concern about reported income on
discretionary spending decisions: The case of research and development. The Accounting
Review 66 (4): 818-829.

Badertscher, B. 2011. Overvaluation and the choice of alternative earnings management


mechanisms. The Accounting Review 86(5), 1491-1518.

Barber, B., and J. Lyon. 1996. Detecting abnormal operating performance: The empirical power
and specification of test statistics. Journal of Financial Economics 41, 359-399.

Bartov, E. 1993. The timing of asset sales and earnings manipulation. The Accounting Review 68
(4): 840-855.

Bens, D., V. Nagar, D. Skinner, and F. Wong. 2003. Employee stock options, EPS dilution, and
stock repurchases. Journal of Accounting & Economics 36 (1-3): 51-90.

Bens, D., V. Nagar, and F. Wong. 2002. Real investment implications of employee stock option
exercises. Journal of Accounting Research 40 (2): 359-406.

Bernard, V. 1987. Cross-sectional dependence and problems in inference in market-based


accounting research. Journal of Accounting Research 25 (1): 1-48.

Bushee, B. 1998. The influence of institutional investors on myopic R&D investment behavior.
The Accounting Review 73 (3): 305-333.

Brown, S., and J. Warner. 1980. Measuring security price performance. Journal of Financial
Economics 8, 205-258.

Brown, S., and J. Warner. 1985. Using daily stock returns: The case of event studies. Journal of
Financial Economics 14 (1), 3-31.

Campbell, C. and C. Wasley. 1993. Measuring security price performance using daily NASDAQ
returns. Journal of Financial Economics 33, 73-92.

Cheng, Q., J. Lee, and T. Shevlin. 2016. Internal Governance and Real Earnings Management.
The Accounting Review 91(4): 1051-1085

Cohen, D., A. Dey, and T. Lys, 2008. Real and Accrual-based Earnings Management in the Pre-
and Post-Sarbanes Oxley Periods. The Accounting Review 82(3): 757-787.

Cohen, D., R. Mashruwala, and T. Zach. 2010. The Use of Advertising Activities to Meet
Earnings Benchmarks: Evidence from Monthly Data. Review of Accounting Studies 15(4):
808-832.

31

Electronic copy available at: https://ssrn.com/abstract=1792639


Cohen, D., and P. Zarowin. 2010. Accrual-Based and Real Earnings Management Activities
around Seasoned Equity Offerings. Journal of Accounting & Economics 50 (1): 2-19.

Dechow, P., W. Ge, and C. Schrand. 2010. Understanding earnings quality: A review of the
proxies, their determinants and their consequences. Journal of Accounting & Economics
50 (2/3), 344-401.

Dechow, P., A. Hutton, J.H. Kim, and R. Sloan. 2012. Detecting earnings management: A new
approach. Journal of Accounting Research 50 (2): 275-334.

Dechow, P., S.P. Kothari, and R. Watts. 1998. The relation between earnings and cash flows.
Journal of Accounting and Economics 25 (2): 133-169.

Dechow, P.M. and D. Skinner. 2000. Earnings management: Reconciling the views of
accounting academics, practitioners, and regulators. Accounting Horizons 14 (2): 235-250.

Dechow, P., Sloan, R., and A. Sweeney. 1995. Detecting earnings management. The Accounting
Review 70, 193-225.

Dechow, P. and R. Sloan. 1991. Executive incentives and the horizon problem. Journal of
Accounting and Economics 14 (1): 51-89.

Eldenburg, L., Gunny, K., Hee, K., Soderstrom, N. 2011. Earnings Management using real
activities: evidence from nonprofit hospitals, The Accounting Review 86, 1605-1630.

Fields, T., T. Lys, and L. Vincent. 2001. Empirical research on accounting choice. Journal of
Accounting and Economics 31 (1-3), pp. 255-307.

Graham, J., Harvey, R. and S. Rajgopal. 2005. The economic implications of corporate financial
reporting. Journal of Accounting & Economics 40 (1-3): 3-73.

Gunny, K. 2010. The relation between earnings management using real activities manipulation
and future performance: Evidence from meeting earnings benchmark. Contemporary
Accounting Research 27 (2), 855-888.

Healy, P., and J. Wahlen. 1999. A review of the earnings management literature and its
implications for standard setting. Accounting Horizons 13, 365-383.

Kim, Y. and M.S. Park. 2014. Real activity manipulations and auditors’ client-retention decision.
The Accounting Review 89 (1): 367-401.

Kothari, S. P. 2001. Capital markets research in accounting, Journal of Accounting & Economics
31, 105-231.

Kothari, S. P., Leone, A., and C. Wasley. 2005. Performance matched discretionary accrual
measures. Journal of Accounting & Economics 39 (1), 163-197.

Kothari, S. P., and C. Wasley. 1989. Measuring security price performance in size clustered
samples. The Accounting Review 64, 228-249.

32

Electronic copy available at: https://ssrn.com/abstract=1792639


McGuire, S., T. Omer, and N. Sharp. 2012. The impact of religion on financial reporting
irregularities. The Accounting Review 87, 645-673.

McInnis, J., and D. Collins. 2011. The effect of cash flow forecasts on accrual quality and
benchmark beating. Journal of Accounting & Economics 51(3), 219-239.

Skinner, D. 1993. The investment opportunity set and accounting procedure choice: Preliminary
evidence. Journal of Accounting and Economics 16(4), 407-445.

Smith, C., and R. Watts. 1992. The investment opportunity set and corporate financing, dividend,
and compensation policies. Journal of Financial Economics 32(3), 263-292.

Roychowdhury, S. 2006. Earnings management through real activities manipulation. Journal of


Accounting & Economics 42 (3), 335-370.

Srivastava, A. 2019. Improving the measures of real earnings management. Forthcoming in the
Review of Accounting Studies.

Thomas, J., and H. Zhang. 2002. Inventory changes and future returns. Review of Accounting
Studies 7 (2-3): 163-187.

Vorst, P., 2016. Real Earnings Management and Long-Term Operating Performance: The Role
of Reversals in Discretionary Investment Cuts. The Accounting Review 91 (4): 1219-1256.

Zang, A. 2012. Evidence on the tradeoff between real activities manipulation and accrual-based
earnings management. The Accounting Review 87(2) 675-703.

33

Electronic copy available at: https://ssrn.com/abstract=1792639


Figure 1 Power of the test: traditional and performance-matched abnormal cash flow from operations
The figure depicts the rejection frequencies of traditional and performance-matched REM measures for 1,000 simulated samples of 100 firms each drawn from the
general population of all firms. Along the x-axis rejection frequencies are reported for increasingly larger (1% to 10% of total assets) induced abnormal performance.

Abnormal cash flows from operations (CFO)


(Rejection rates along the y-axis and seeded level of abnormal REM along the x-axis)
Electronic copy available at: https://ssrn.com/abstract=1792639

34
Figure 2 Power of the test: traditional and performance-matched abnormal discretionary expenses

The figure depicts the rejection frequencies of traditional and performance-matched REM measures for 1,000 simulated samples of 100 firms each drawn from the
general population of all firms. Along the x-axis rejection frequencies are reported for increasingly larger (1% to 10% of total assets) induced abnormal performance.

Abnormal discretionary expenses (DISC_EXP)


(Rejection rates along the y-axis and seeded level of abnormal REM along the x-axis)
Electronic copy available at: https://ssrn.com/abstract=1792639

35
Figure 3 Power of the test: traditional and performance-matched abnormal production costs

The figure depicts the rejection frequencies of traditional and performance-matched REM measures for 1,000 simulated samples of 100 firms each drawn from the
general population of all firms. Along the x-axis rejection frequencies are reported for increasingly larger (1% to 10% of total assets) induced abnormal performance.

Abnormal production (PROD)


(Rejection rates along the y-axis and seeded level of abnormal REM along the x-axis)
Electronic copy available at: https://ssrn.com/abstract=1792639

36
Figure 4 Power of the test: traditional and performance-matched abnormal R&D

The figure depicts the rejection frequencies of traditional and performance-matched REM measures for 1,000 simulated samples of 100 firms each drawn from the
general population of all firms. Along the x-axis rejection frequencies are reported for increasingly larger (1% to 10% of total assets) induced abnormal performance.

Abnormal research & development expenditure (R&D)


(Rejection rates along the y-axis and seeded level of abnormal REM along the x-axis)
Electronic copy available at: https://ssrn.com/abstract=1792639

37
Figure 5 Power of the test: traditional and performance-matched abnormal SG&A
The figure depicts the rejection frequencies of traditional and performance-matched REM measures for 1,000 simulated samples of 100 firms each drawn from the
general population of all firms. Along the x-axis rejection frequencies are reported for increasingly larger (1% to 10% of total assets) induced abnormal performance.

Abnormal SG&A expenses (SGA)


(Rejection rates along the y-axis and seeded level of abnormal REM along the x-axis)
Electronic copy available at: https://ssrn.com/abstract=1792639

38
Figure 6 Power of the test: traditional and performance-matched abnormal gain

The figure depicts the rejection frequencies of traditional and performance-matched REM measures for 1,000 simulated samples of 100 firms each drawn from the
general population of all firms. Along the x-axis rejection frequencies are reported for increasingly larger (1% to 10% of total assets) induced abnormal performance.

Abnormal gain (GAIN)


(Rejection rates along the y-axis and seeded level of abnormal REM along the x-axis)
Electronic copy available at: https://ssrn.com/abstract=1792639

39
TABLE 1
Summary descriptive statistics for traditional and performance-matched measures of abnormal real earnings management (REM)

REM Measure N Mean Median Std. Dev. Mean Median Std. Dev. Mean Median Std. Dev.

Pre-estimation Winsorization Post-estimation Winsorization Pre- and Post-estimation Winsorization

Ab_CFO 204,353 0.000 0.037 0.520 0.023*** 0.039*** 0.820 0.002 0.037 0.390
PM_CFO 178,831 -0.001 0.000 0.650 -0.000 0.000 1.060 -0.001 0.000 0.490
Electronic copy available at: https://ssrn.com/abstract=1792639

Ab_PROD 199,954 0.000 -0.014 0.400 -0.022*** -0.020*** 0.450 -0.008*** -0.014 0.280
PM_PROD 174,036 -0.000 0.000 0.520 0.001 0.000 0.600 0.000 0.000 0.390
Ab_DISC_EXP 187,727 0.000 -0.070 0.950 -0.085*** -0.076*** 2.380 -0.007 -0.071 0.660
PM_DISC_EXP 161,982 0.000 0.000 1.230 0.003 0.000 2.520 0.001 0.000 0.890
Ab_R&D 96,225 0.000 -0.002 0.130 -0.003*** -0.002*** 0.240 -0.001 -0.002 0.110
PM_R&D 82,588 0.000 0.000 0.190 0.001 0.000 0.330 0.000 0.000 0.170
Ab_SGA 90,325 0.000 -0.031 0.450 -0.016*** -0.0339*** 0.630 -0.004 -0.031 0.370
PM_SGA 77,074 0.002 0.001 0.630 0.000 0.001 0.780 0.001 0.001 0.500
Ab_GAIN 76,334 0.000 -0.001 0.010 -0.001*** -0.001*** 0.030 -0.001 -0.001 0.010
PM_GAIN 65,202 0.000 0.000 0.020 -0.001 0.000 0.030 0.000 0.000 0.010

Summary statistics are reported for three different winsorization approaches: (i) when the independent variables in models of expected (normal) real activities are winsorized prior to
model estimation (‘pre-estimation’ winsorization); (ii) when the independent variables in models of expected (normal) real activities are not winsorized, but rather the resulting estimated
REM measures are winsorized (‘post-estimation’ winsorization) and (iii) when the independent variables in models of expected (normal) real activities are winsorized prior to model
estimation and the resulting estimated REM measures are also winsorized (‘pre- and post-estimation winsorization’). The table reports tests for differences in means and medians for
comparisons between different winsorization scenarios. The benchmark sample for which mean and median differences are compared to is the ‘pre-estimation’ winsorization sample. The
means and medians for this benchmark sample are compared to those for the ‘post-estimation’ and ‘pre- and post-estimation winsorization’ samples. *** by a mean or median value
indicates that mean or median value is significantly different from the benchmark sample’s mean or median at the 1% level. Mean and median numbers in bold denote values that
themselves are statistically different from zero. See Appendix for variable definitions and estimation methods.

40
TABLE 2
Type I error rates for traditional and performance-matched measures of abnormal real earnings management (REM): All firms

REM Measure (Post-estimation (Pre- and post-estimation


(Pre-estimation winsorization)
winsorization) winsorization)

Ab_CFO 8.8% 12.0% 9.0%


PM_CFO 6.0% 6.8% 5.9%
Ab_PROD 2.5% 1.4% 2.2%
PM_PROD 5.3% 4.5% 4.9%
Electronic copy available at: https://ssrn.com/abstract=1792639

Ab_DISC_EXP 11.4% 12.2% 11.5%


PM_DISC_EXP 4.8% 4.5% 5.2%
Ab_R&D 9.7% 7.8% 9.4%
PM_R&D 5.3% 3.5% 6.3%
Ab_SGA 8.7% 9.5% 8.9%
PM_SGA 5.2% 6.5% 5.2%
Ab_GAIN 1.1% 0.6% 0.9%
PM_GAIN 4.3% 4.7% 4.3%

Mean Over-rejection:
Traditional REM Measures 9.65% 10.38% 9.70%
PM REM Measures 5.33% 5.33% 5.65%

Mean Under-rejection:
Traditional REM Measures 1.80% 1.00% 1.55%
PM REM Measures 4.80% 4.60% 4.60%

The table reports rejection rates (Type I error rates) for traditional and performance-matched measures of abnormal real earnings management (REM)
for random samples drawn from the “full sample” (i.e., all firms). Rejection rates correspond to the percentage of 1,000 random samples of 100 firms
each where the null hypothesis of mean zero abnormal REM is rejected in favor of the alternative hypothesis of positive (i.e., income increasing REM)
at the 5% level (one-tailed t-test). With 1,000 samples, the 95% confidence interval for the (theoretical) nominal 5% significance level of the test
ranges from 3.65% to 6.35%. Rejection rates in italic (bold) are those that fall below the lower threshold of 3.65% (above the upper threshold of
6.35%). Results are reported for three alternative approaches to winsorize the data: pre-estimation, post-estimation, and pre- and post-estimation.
Under ‘pre-estimation’ winsorization the independent variables in models of expected (normal) real activities are winsorized prior to model estimation.
Under ‘post-estimation’ winsorization the independent variables in models of expected (normal) real activities are not winsorized, but rather the
resulting estimated REM measures are winsorized. Under ‘pre- and post-estimation winsorization’ the independent variables in models of expected
(normal) real activities are winsorized prior to model estimation and the resulting estimated REM measures are also winsorized. Variables with the
“Ab” prefix are traditional REM measures and those with a “PM” prefix are performance-matched REM measures. See Appendix for variable
definitions and estimation methods.

41
TABLE 3

Panel A: Summary statistics for traditional and performance-matched measures of abnormal real earnings management (REM): Full sample (“All Firms”) and sub-
samples of firms formed on the basis of recent firm performance or firm financial characteristics

Market
Statistic All Market Book- Book-
REM Measure: Sales Growth Sales Growth Value of Earnings/Price Earnings/Price
(N=1,000) Firms Value of Market Market
Equity
Equity
Low High Low High Low High Low High
Ab_CFO Mean -0.0025 -0.0485 0.0085 -0.0638 0.1041 -0.0157 0.0251 -0.1888 0.0803
Electronic copy available at: https://ssrn.com/abstract=1792639

Ab_CFO Std. Dev. 0.0513 0.0455 0.0582 0.0714 0.0219 0.0612 0.0196 0.0667 0.0226
PM_CFO Mean -0.0021 0.0081 0.0029 0.0470 0.0187 -0.0171 0.0143 -0.0050 -0.0137
PM_CFO Std. Dev. 0.0642 0.0674 0.0689 0.0921 0.0301 0.0735 0.0344 0.0914 0.0258
Ab_PROD Mean 0.0025 0.0349 -0.0115 0.0236 -0.0418 -0.0467 0.0275 0.0833 -0.0202
Ab_PROD Std. Dev. 0.0385 0.0411 0.0464 0.0540 0.0237 0.0462 0.0211 0.0545 0.0260
PM_PROD Mean -0.0054 0.0104 -0.0082 -0.0151 -0.0095 -0.0226 0.0201 0.0111 0.0305
PM_PROD Std. Dev. 0.0524 0.0567 0.0565 0.0726 0.0325 0.0603 0.0327 0.0749 0.0342
Ab_DISC_EXP Mean 0.0018 -0.0880 0.1746 0.0094 -0.0992 0.1289 -0.1206 0.1796 -0.1071
Ab_DISC_EXP Std. Dev. 0.0955 0.0792 0.1070 0.1281 0.0469 0.0982 0.0354 0.1260 0.0416
PM_DISC_EXP Mean 0.0062 -0.1755 0.1534 -0.1068 -0.0102 0.1066 -0.0774 -0.0230 -0.0495
PM_DISC_EXP Std. Dev. 0.1184 0.1218 0.1280 0.1631 0.0605 0.1336 0.0678 0.1745 0.0488
Ab_R&D Mean -0.0017 -0.0022 0.0039 -0.0066 -0.0075 0.0111 -0.0115 0.0122 -0.0071
Ab_R&D Std. Dev. 0.0130 0.0133 0.0165 0.0156 0.0069 0.0166 0.0059 0.0188 0.0052
PM_R&D Mean -0.0016 -0.0078 0.0042 -0.0148 -0.0014 0.0093 -0.0143 -0.0019 -0.0024
PM_R&D Std. Dev. 0.0179 0.0215 0.0214 0.0244 0.0109 0.0221 0.0120 0.0281 0.0072
Ab_SGA Mean -0.0025 0.0097 -0.0457 -0.0100 -0.0137 0.0538 -0.0617 -0.0278 0.0138
Ab_SGA Std. Dev. 0.0444 0.0430 0.0522 0.0606 0.0246 0.0466 0.0251 0.0574 0.0324
PM_SGA Mean -0.0013 0.0156 -0.0538 -0.0115 -0.0148 0.0526 -0.0590 -0.0236 -0.0187
PM_SGA Std. Dev. 0.0588 0.0637 0.0656 0.0806 0.0377 0.0638 0.0413 0.0874 0.0396
Ab_GAIN Mean -0.0001 0.0008 -0.0006 0.0001 0.0001 -0.0003 0.0002 -0.0006 0.0022
Ab_GAIN Std. Dev. 0.0014 0.0019 0.0013 0.0017 0.0012 0.0014 0.0013 0.0015 0.0019
PM_GAIN Mean -0.0001 0.0010 -0.0006 -0.0001 0.0000 -0.0003 0.0004 0.0001 0.0010
PM_GAIN Std. Dev. 0.0019 0.0024 0.0020 0.0022 0.0019 0.0020 0.0020 0.0021 0.0023

42
Panel B: Type I error rates for traditional and performance-matched measures of abnormal real earnings management (rem): full sample (“all firms”) and sub-samples of firms
formed on the basis of recent firm performance or firm financial characteristics

Average
Market Value Rejection
REM Measure: All Firms Sales Growth Book-Market Earnings/Price
Frequency
of Equity

Low High Low High Low High Low High


Ab_CFO 8.8% 0.1% 12.7% 0.5% 95.5% 6.8% 47.8% 0.0% 94.5% 29.63%
PM_CFO 6.0% 4.8% 6.6% 11.7% 23.4% 3.1% 6.6% 4.5% 0.9% 7.51%
Electronic copy available at: https://ssrn.com/abstract=1792639

Ab_PROD 2.5% 18.0% 1.1% 7.6% 0.0% 0.0% 33.4% 46.6% 0.9% 12.23%
PM_PROD 5.3% 7.4% 3.2% 3.8% 1.7% 1.5% 15.4% 7.6% 21.7% 7.51%
Ab_DISC_EXP 11.4% 39.9% 0.2% 6.0% 76.3% 0.2% 92.5% 0.2% 84.2% 34.54%
PM_DISC_EXP 4.8% 46.3% 0.3% 15.6% 7.5% 0.2% 37.2% 5.8% 28.1% 16.20%
Ab_R&D 9.7% 9.6% 4.6% 13.4% 42.4% 1.9% 68.1% 1.2% 51.6% 22.50%
PM_R&D 5.3% 11.2% 2.2% 16.5% 7.8% 2.1% 30.7% 5.7% 12.6% 10.46%
Ab_SGA 8.7% 3.9% 28.6% 8.7% 18.7% 0.4% 78.4% 16.8% 3.9% 18.68%
PM_SGA 5.2% 2.3% 21.5% 6.7% 10.6% 0.8% 40.0% 7.6% 14.2% 12.10%
Ab_GAIN 1.1% 4.3% 0.2% 1.2% 1.3% 0.7% 1.7% 0.5% 23.5% 3.83%
PM_GAIN 4.3% 11.6% 2.6% 4.4% 6.2% 3.9% 6.8% 3.7% 10.8% 6.03%

Mean Over-rejection:
Traditional REM Measures 9.65% 22.50% 20.65% 9.90% 58.23% 6.80% 64.04% 31.70% 63.45%
PM REM Measures 5.33% 21.63% 14.05% 9.00% 12.33% 3.10% 25.98% 7.60% 13.10%

Mean Under-rejection:
Traditional REM Measures 1.80% 0.10% 0.50% 0.85% 0.65% 0.64% 1.70% 0.48% 0.90%
PM REM Measures 4.80% 4.80% 2.03% 8.05% 3.95% 1.70% 6.80% 4.93% 21.70%

Means and standard deviations measured across 1,000 samples of 100 firms each for traditional and performance-matched measures of abnormal real earnings management
(REM) for the full sample (“All Firms”) and random samples drawn from the top and bottom quartiles of sales growth, market value of equity, book-to-market ratio, and earnings-
to-price ratio. Variables with the “Ab” prefix are traditional REM measures and those with a “PM” prefix are performance-matched REM measures. See Appendix for variable
definitions and estimation methods.

Rejection rates (Type I error rates) for traditional and performance-matched measures of abnormal real earnings management (REM) for random samples drawn from the top
and bottom quartiles of sales growth, market value of equity, book-to-market ratio, and earnings-to-price ratio. Rejection rates correspond to the percentage of 1,000 random
samples of 100 firms each where the null hypothesis of mean zero abnormal REM is rejected in favor of the alternative hypothesis of positive (i.e., income increasing REM) at
the 5% level (one-tailed t-test). With 1,000 samples, the 95% confidence interval for the (theoretical) nominal 5% significance level of the test ranges from 3.65% to 6.35%.
Rejection rates in italic (bold) are those that fall below the lower threshold of 3.65% (above the upper threshold of 6.35%). Rejection rates in italic (bold) are those that fall
below (above) the lower (upper) bound of the 95% confidence interval. Results are reported for pre-estimation winsorization under which the independent variables in models
of expected (normal) real activities are winsorized prior to model estimation. Variables with the “Ab” prefix are traditional REM measures and those with a “PM” prefix are
performance-matched REM measures. See Appendix for variable definitions and estimation methods.
43

You might also like