Cohen 2019 Measuring Real Activity Management

Measuring Real Activity Management
Daniel Cohen
School of Management
University of Texas at Dallas
Richardson, TX 75080
dcohen@utdallas.edu
Shailendra Pandit
College of Business Administration
University of Illinois at Chicago
Chicago, IL 60607
shail@uic.edu
Charles Wasley
Simon Graduate School of Business
University of Rochester
Rochester, NY 14627
charles.wasley@simon.rochester.edu
Tzachi Zach
Fisher College of Business
The Ohio State University
Columbus, OH 43210
zach.7@osu.edu
July 2019
Financial support from The University of Texas at Dallas, the University of Illinois at Chicago,
the Simon School at the University of Rochester, and the Fisher College of Business at The Ohio
State University is gratefully acknowledged. We gratefully acknowledge the comments and
suggestions of Zahn Bozanic, Bill Cready, Jerry Zimmerman, participants at the European
Accounting Association Annual Congress, workshop participants at The University of Rochester
and Syracuse University and especially an anonymous referee.
Electronic copy available at: https://ssrn.com/abstract=1792639

Measuring Real Activity Management
Abstract
To test hypotheses about earnings management many studies investigate managers’ manipulation
of real activities (real earnings management, REM). Tests using measures of abnormal REM hinge
critically on the measurement of normal real activities. Yet there is no systematic evidence on the
statistical properties of commonly-used REM measures. We provide such evidence by
documenting the Type I error rates and power of the test of the REM measures commonly used in
the literature. We find these measures are often mis-specified with Type I error rates that deviate
from the nominal significance level of the test, especially in samples of firms with extreme
performance or firm characteristics. We also compare the specification and power of traditional
REM measures with performance-matched REM measures to see if the latter provide better-
specified and more powerful tests. While performance-matched REM measures are not immune
from mis-specification in all settings, in general, they are better specified under the null hypothesis
(i.e., in terms of Type I errors) than are traditional REM measures. Comparisons of the power to
detect abnormal REM reveal that neither approach, traditional or performance-matched, is
consistently more powerful than the other in terms of detecting abnormal REM ranging from 1%
to 10% of (lagged) total assets. The absence of a dominant approach to measure abnormal REM
leads us to recommend that future researchers report results using both traditional and
performance-matched measures so that readers are able to clearly assess the reliability of the
inferences drawn about the magnitude and significance of the abnormal REM documented in a
given study.
JEL classification: M41, C12, C15, M42.
Keywords: Real activity management; real earnings management; earnings management; real
activity models, test specification, Type I errors, Type II errors, power of the test, meet or beat,
earnings benchmarks, model specification.

1. Introduction
A vast literature in accounting examines managers’ incentives, and the actions they take,
to manage earnings (for reviews see, Healy and Wahlen 1999; Dechow and Skinner 2000; Fields,
Lys and Vincent 2001; Dechow, Ge and Schrand 2010). Early studies on earnings management
typically examined discretionary accruals (DAs). More recently, studies have begun to focus on
real earnings management (REM) (see, Roychowdhury 2006; Gunny 2010). REM are actions
managers take to achieve financial reporting objectives (e.g., to report a profit vs. a loss) by altering
real activities such as, but are not limited to, sales promotions, overproduction of inventory,
delaying or accelerating discretionary expenses such as R&D, advertising, and SG&A, and selling
assets to recognize gains. Research focusing on REM is now commonplace in the earnings
management literature as an alternative, or in addition, to tests using DAs.1
Since tests of REM are joint tests of a researcher’s model of expected (normal) real
activities and earnings management, inferences about earnings management hypotheses based on
measures of REM hinge critically on a researcher’s ability to accurately model expected (normal)
real activities. An implicit assumption in prior REM studies is that, under the null hypothesis of
no REM, the REM measures utilized are well specified. That is, that their Type I error rates
correspond to the nominal significance level of the test (e.g., 5%). However, there is no systematic
evidence on whether the REM measures commonly used in the literature are, in fact, well-specified
and hence are able to deliver reliable inferences about managers’ incentives to manage earnings
by manipulating real activities. Additionally, there is no systematic evidence about the type II error
rates for the REM measures commonly used in the literature. Stated differently, there is no
evidence on the power of the test for various REM measures, that is, their ability to detect real
1
We use the terms real earnings management, real activity management, real activity manipulation and REM
interchangeably.

earnings management when it is present in the data. Given the on-going focus in the literature on
measuring REM, it is surprising that no systematic empirical evidence exists on the properties of
alternative REM measures or the statistical tests based on them. This contrasts with research on
the properties of DAs which have been extensively studied (see, Kothari, Leone and Wasley 2005;
Dechow, Sloan and Sweeney 1995). Our objective is to provide systematic evidence on the Type
I error rates and power of the test associated with a variety of REM measures, thus our study of
these properties of alternative REM measures fills this void in the literature.2
An issue encountered in all REM studies is how economic shocks to a firm’s performance
affect a researcher’s ability to accurately model normal real activities. For example, does the shock
have a linear or non-linear effect on a firm’s normal performance? Shocks can affect managerial
decisions about the firm’s real activities in at least two ways. First, managers may engage in REM
by opportunistically altering real decisions to mask the effect of the shock on the firm’s reported
earnings to continue to report high earnings. Alternatively, managers may alter real decisions as
part of a rational response to the shock so that the firm’s reported earnings best reflect the shock’s
effect on firm value. Since the hypothesis tested in a typical REM study is one of opportunistic
managerial behavior, an issue faced by REM researchers is how to accurately separate changes in
real activities motivated by the former (opportunistic response) from the latter (rational response).
Kothari (2001, 164) makes a similar point in the context of DA models. We reiterate his point here
because it applies to the REM setting as well.
2
Other areas where researchers have provided evidence on the properties of firm performance measures include
Barber and Lyon (1996) on abnormal operating performance, and Brown and Warner (1980, 1985), Bernard (1987),
Kothari and Wasley (1989) and Campbell and Wasley (1993) on abnormal stock returns.

In addition to the traditional REM measures used in prior research, we also investigate the
properties of performance-matched REM measures.3 We analyze performance-matched REM
measures because they have been found to improve test specification and power in other settings,
namely for abnormal DAs. Specifically, Kothari et al. (2005) find that performance matching leads
to better specified measures of DAs when compared to traditional measures of DAs such as those
based on the Jones or modified-Jones model. Thus, instead of entertaining more complicated
models of expected (normal) real activities, we first investigate whether an approach adopted
elsewhere in the earnings management literature, namely, performance-matching, performs better
than the traditional REM measures used in the earnings management literature.4,5
Our empirical analysis unfolds as follows. First, we use simulations based on firms’ actual
real activity measures to document and compare the Type I error rates of traditional and
performance-matched REM measures. Second, we use simulations to introduce abnormal REM
into firms’ actual real activity measures. We then document and compare the power of the test of
the traditional and performance-matched REM measures. This comparison allows us to identify
which approach to specifying REM measures leads to the best-specified and most powerful tests
of earnings management-related hypotheses. A key feature of our simulations is that the results
allow us (and future researchers) to make informed tradeoffs between test specification under the
3
Examples of REM studies that have implemented some form of performance-matching in their specific setting are
Cohen and Zarowin (2010) and Badertscher (2011). In a recent study, Srivastava (2019) implements matching based
on industry cohorts that share the same life cycle stage and production technology.
4
Simply generalizing the results on performance matching from Kothari et al.’s (2005) analysis of DAs to real
activities seems misguided because there is no a priori reason to believe that managers’ DA choices would necessarily
be indicative of their real activity choices.
5
Roychowdhury (2006, 361-362) expresses concern about whether models used to derive his REM measures are
linear and uses the performance matching techniques of Kothari et al. (2005). However, Roychowdhury (2006) does
not provide systematic evidence on the properties of traditional and performance-matched REM measures, which is
the focus of our study.

null hypothesis (Type I error rates) and power of test for both traditional and performance-matched
REM measures. Our primary tests are based on firms drawn from the “full sample” of observations.
That is, without regard to prior performance or any firm financial characteristics. However,
because managers’ choice of real activity levels is a function of their firms’ recent economic
performance and firm characteristics, we supplement our main tests using the “full sample” with
results for REM measures drawn from samples designed to capture extreme firm performance
(e.g., sales growth) or firm financial characteristics (e.g., firm size). The motivation for these
supplemental tests comes from Kothari (2001, 163) who stresses that “earnings management
studies almost invariably examine samples of firms that have experienced unusual performance.”
Our main findings are as follows. The traditional REM measures commonly used in the
earnings management literature are mis-specified in many of the settings we examine. For
example, mis-specification tends to be modest in samples drawn from the “full sample” of
observations where, while Type I error rates exceed the nominal (5%) significance level of the
test, they never exceed 15%. On the other hand, in samples of firms from the top or bottom
quartiles of size, book-to-market, and sales growth, Type I error rates (for a 5% test) often exceed
15%, and in some cases are much higher. Such evidence raises concerns about the validity of the
inferences drawn in prior earnings management studies that used these traditional REM measures.
Turning to performance-matched REM measures, while they are not well specified in each and
every setting, on balance, they tend to yield better-specified tests (i.e., lower Type I error rates)
when compared to the traditional REM measures.6
6
As discussed in section 5.2.2., conclusions about performance-matched REM measures are subject to the caveat that
their lower Type I error rates vis-à-vis those of traditional REM measures may be due in part to the higher standard
deviation exhibited by performance-matched REM measures.

Turning to the power of the test, the overall evidence does not yield a dominant approach
to measure REM. Stated differently, neither the traditional REM measures nor the performance-
matched REM measures consistently yield the most powerful test. Instead, the most powerful REM
measure varies depending on the type of real activity metric (e.g., abnormal SG&A, abnormal
CFO, etc.) and on the magnitude of abnormal real activity. A notable feature of the power of the
test results is that they rule out the concern that performance-matched REM measures sacrifice
power to achieve the better Type I errors rates they exhibit versus the traditional REM measures.
Our study contributes to the earnings management literature in the following ways. As the
first study to systematically document the properties (i.e., Type I error rates and power of the test)
of the traditional REM measures used in prior studies, our evidence facilitates a keener
appreciation of the reliability of the inferences prior studies have drawn about REM. Second, our
comparison of the Type I error rates and power of the test for traditional and performance-matched
REM measures provides a useful guide to future researchers when evaluating the trade-offs
between Type I and Type II error rates to decide which REM measures to use in a specific setting.
Since the trade-off between Type I and Type II errors is researcher specific, the absence of a
dominant approach to generate measures of abnormal REM leads us to recommend that future
researchers report results using both traditional and performance-matched measures. This
approach will allow readers to assess the reliability of the inferences drawn about the magnitude
and significance of the abnormal REM documented in a given study. Based on our findings, it
appears that the two approaches (performance-matched and traditional) vary in their effectiveness
depending on sample characteristics. Therefore, it is important that future researchers also evaluate
their specific samples and benchmark them against the results we provide in Table 3, for example.

The remainder of the paper is organized as follows. Section 2 describes the REM measures
we study. Section 3 describes our research design. Section 4 reports preliminary results and section
5 our main results. Section 6 summarizes our sensitivity tests. Section 7 concludes.
2. Measuring Real Activity Management
2.1 Overview
Following Roychowdhury (2006), managers’ willingness to manipulate real activities to
achieve financial reporting objectives or to capture private benefits has become an active area of
earnings management research. The impetus for such research also comes from Graham, Harvey
and Rajgopal (2005) who report that surveyed managers would consider altering discretionary
expenditures, as well as take other real actions, to achieve financial reporting objectives.7 A key
feature of prior REM studies is that they invariably use the same or very similar REM measures.8
2.2 A Conceptual perspective on what drives expected real activities
Value-maximizing managers choose operating, investing and financing policies that
maximize firm value, where the specific policies they choose are a function of the firm’s
investment opportunity set (IOS). Smith and Watts (1992, 264) note that the IOS varies across
firms. The real activity measures that are the subject of REM research fall under operating policies.
Among other things, a firm’s operating policy choices relate to setting selling prices; credit terms
and cash discounts; production schedules/quantities; advertising outlays; R&D outlays; other
discretionary expenditures; (non-top management) employee compensation (salary, bonuses, etc.);
7
Prior to Graham et al. (2005) and Roychowdhury (2006) other authors had examined managers’ willingness to alter
real decisions such as R&D outlays (see Dechow and Sloan 1991; Baber et al. 1991; Bushee, 1998; Bens et al. 2002),
share repurchases (see Bens et al. 2003), asset sales (see Bartov 1993) and over-production (see Thomas and Zhang
2002) to achieve financial reporting incentives. Our point is simply that REM has become a more active area of
earnings management research following Graham et al. (2005) and Roychowdhury (2006).
8
Exceptions are Cohen, Mashruwala and Zach (2011) and Eldenburg, Gunny, Hee and Soderstrom (2011). Srivastava
(2019) argues traditional REM measures can be improved by benchmarking on life cycle stage and production
technology.

the timing of gains from asset sales, and so on. Conceptually, the fundamental driver of the real-
activity measures used in REM-related research is firms’ IOSs. Since IOSs are firm specific,
expected real activity levels will differ across firms, even if they are in the same industry.
The discussion above has two implications for measuring expected (normal) real activities.
Ideally, empirical models of expected (normal) real activities should include variables designed to
capture the features of a firm’s IOS. Second, because the IOS is firm specific, estimation of
expected real activities should ideally be based on firm-specific models. With regard to the former,
while measures of IOSs are available (e.g., asset beta, PPE, Tobin’s q; see Skinner 1993), how
such measures drive a firm’s real activities is not well understood. As a result, to develop models
of expected real activities, prior REM studies (e.g., Roychowdhury 2006) rely instead on models
of the earnings/cash flow relation (see Dechow, Kothari, and Watts 1998) where the fundamental
driver of real activities is a firm’s sales level. With regard to the second point, while a model of
expected (normal) real activities should be firm-specific, in most accounting settings firm-specific
estimation is infeasible due to the relatively short time-series of annual or quarterly data that is
available. As a result, REM studies rely on cross-sectional estimation at the industry level. A well-
known problem with cross-sectional estimation is that such models exhibit low explanatory power
because they over-simplify the underlying economics of the relation.
2.3 Real activity measures common to the existing REM literature
Roychowdhury (2006, 344-45) developed three REM measures: abnormal cash flow from
operations, abnormal discretionary expenses, and abnormal production costs. Gunny (2010) built
on those, modifying them slightly to specify other measures of abnormal R&D, abnormal SG&A,
abnormal gains on asset sales, and abnormal production costs (see section 3.1 and the Appendix
for the details underlying estimation of all the REM measures used in our study). Most REM
studies use the measures in Roychowdhury (2006). Such reliance makes it all the more important
7

to document the specification of those measures with alternative approaches such as performance-
matching. The implicit assumption in prior REM studies is that all the traditional REM measures
are mean zero under the null hypothesis of no REM. However, there is no systematic evidence that
this is true, which raises the question of whether tests based on commonly used REM measures
can be relied upon to yield valid inferences about REM hypotheses. Our results provide systematic
evidence on the validity of these concerns.
2.4 Performance-matched REM measures
2.4.1 Motivation for performance-matching
Kothari (2001, 163) stresses that “earnings management studies almost invariably examine
samples of firms that have experienced unusual performance.” Relatedly, Skinner (1993, 420)
states that “… there is reason to believe that accounting procedure choice is related to how well or
badly firms are performing…” These observations motivate the need to isolate the effects of firm
performance on models that try to measure earnings management. Performance matching is one
way to achieve this. For example, in cases where the true model is unknown (perhaps linear for
some firms, but non-linear for others), performance matching can be beneficial because it does not
impose any particular functional form linking real activities to performance. Instead, the
simple premise underlying performance matching is simply that the impact of performance
on real activities is similar between a treatment firm and its matched control firm.
In addition, in cases where variables expected to drive real activities can only be measured
imprecisely or where empirical measures of theoretical constructs are unavailable, performance-
matching can be beneficial because it does not require the specification and measurement of
every conceivable variable expected to drive real activities. Instead, the idea behind
performance matching is simply that the impact of such variables is similar between a
treatment firm and its matched control firm. For these reasons, performance-matching

provides a viable way to control for both potential nonlinearities in the relation between real
activities and firm performance as well as for the effect of variables measured with error or
omitted entirely from a model of expected (normal) real activities. Indeed, performance
matching has proved successful in other contexts. For example, Kothari et al. (2005) find that it is
a reliable way to mitigate mis-specification in popular measures of DAs.
2.4.2 Implementation of performance-matching
While there are a number of ways to performance-match (e.g., return on sales, return on
equity) our choice of return on assets (ROA) builds on a point made by Skinner (1993, 421) that
“… firms’ recent accounting performance may be correlated with their IOS…” Given Skinner’s
observation and our discussion in section 2.2 about firms’ IOS being the fundamental driver of
real-activity levels, ROA seems like a natural choice. The specifics of our performance-matching
approach are as follows. For each abnormal REM measure (e.g., cash flow from operations,
discretionary expenses, production costs, R&D, SG&A, and gains on asset sales) we calculate a
performance-matched version for a given “treatment” firm in a given year by matching it to another
firm in the same two-digit SIC code whose ROA is within ±10%. The performance-matched REM
measure is the difference between the REM measures of the treatment firm and its matched control
firm. For example, performance-matched cash flow from operations is (where i denotes the
treatment firm and j the matched firm): PM_CFOi,t = (Ab_CFOi,t – Ab_CFOj,t) (see section 3.1 and
the Appendix for the additional details underlying estimation of all REM measures).
While performance-matching has potential benefits, it is not without limitations. Because it is
based on differencing two variables, the standard deviation of a performance-matched REM
measure will be at least 1.4x that of an un-differenced REM measure. As a result, performance-
matched measures may sacrifice more power (i.e., be more prone to Type II errors) than traditional
(un-differenced) REM measures. This is a non-trivial point because of the trade-off researchers
9

face between Type I and Type II errors in REM (and all research) settings. The simulations we
conduct provide systematic evidence on that tradeoff between specification and the power of the
test (i.e., between Type I and II errors) for traditional vs. performance-matched REM measures.
3. Research design
3.1 Sample and data requirements
To calculate the REM measures of interest, we require annual financial statement data from
COMPUSTAT for the period 1986-2017. We retain all firm-year observations meeting data
availability requirements in a given year. We do not require data on all variables for all firms for
each year because doing so would introduce a severe survivorship bias. The traditional REM
measures used in prior research that we analyze are: abnormal cash flows from operations
(Ab_CFO); abnormal discretionary expenses (Ab_DISC_EXP); abnormal production costs
(Ab_PROD); abnormal R&D expenses (Ab_R&D); abnormal selling and general expenses
(Ab_SGA); and abnormal gains from sales of fixed assets (Ab_GAIN).
In addition to the traditional measures above, we analyze modified versions of Ab_CFO,
Ab_PROD, and Ab_DISC_EXP based on suggestions in Gunny (2010) and Vorst (2016). Since
the R2 of these modified real activity models are very similar to those of the traditional models, for
brevity, we do not report results for REM measures based on these modified models (available
upon request). Following prior research, we estimate the underlying models of expected (normal)
real activity using annual data at the two-digit SIC code level for all industries with at least 15
observations in a given year. To obtain abnormal REM measures we subtract the expected value
of each measure based on the expectation model from the reported COMPUSTAT value or the
value calculated using COMPUSTAT numbers (e.g., production costs = COGS + ΔINV).
As stated above, we calculate a performance-matched version of each REM measure for a
given “treatment” firm in a given year by matching it to a firm in the same two-digit SIC code
10

whose ROA is within ±10%. The performance-matched measure is the difference between the
treatment firm’s REM measure and that of its performance-matched control firm.
In the REM literature it is common to winsorize the data to reduce the impact of extreme
data points. In the REM setting there are two points at which the data can be winsorized. First, the
data used to estimate the underlying models of expected (normal) real activities can be winsorized
(hereafter, pre-estimation). Based on our reading of the REM literature, most studies (except
Gunny 2010; and Vorst 2016) are silent about whether winsorization occurs at this stage.
Alternatively, winsorization can take the form of winsorizing the abnormal REM measures
themselves, which would be after the model of expected (normal) real activity was estimated
(hereafter, post-estimation). Based on our reading of the REM literature, it appears that roughly
half the studies we read winsorize at this stage.9 Given these differences, before we perform our
simulation analysis, we report and discuss summary statistics of the properties of REM measures
generated under all three winsorization approaches, namely, (i) pre-estimation; (ii) post-
estimation; and (iii) pre- and post-estimation winsorization. We then use the properties of the
summary descriptive statistics to decide which approach to base our simulation analysis on. As
discussed below, we adopt the first approach, namely pre-estimation winsorization (simulation
results based on the other approaches are reported in our section on sensitivity tests). We stress
that our intent here is not to argue in favor of or against any one approach, but simply to document
the sensitivity of our inferences to the choice.
3.2 Simulation procedures
The hypothesis tested in the typical REM study is that managers manipulated real activities
to boost reported earnings. Consistent with this we perform simulations for one-sided tests of the
9
For examples, see Alissa, Bonsall, Koharki and Penn (2013), Cheng, Lee, Shevlin (2016), Cohen, Dey, and Lys
(2008), Kim and Park (2014), McGuire, Omer and Sharp (2012), McInnis and Collins (2011) and Zang (2012).
11

alternative hypothesis of positive (income-increasing) abnormal REM (we discuss results for two-
sided tests in our section on sensitivity tests). Our simulations assess proper test specification under
the null hypothesis by documenting each REM measure’s Type I error rate. Subsequent
simulations document the power of the test for each REM measure to detect abnormal REM when
it is present (i.e., has been seeded) in the data. Simulations are based on 1,000 random samples of
100 firm-year observations drawn without replacement. Our primary tests are based on simulations
where samples are drawn from the “full sample” of firms (i.e., all firm-years).
In supplemental tests we report Type I errors rates for sub-samples formed on the basis of
recent firm performance (e.g., sales growth) or firm characteristics (e.g. firm size). The sub-
samples are designed to capture features common to samples in earnings management studies.
Such samples are often characterized by large and/or small firms; firms that are value or glamour
stocks; and/or firms exhibiting momentum in recent performance. Our market value of equity
(MVE) sample partition captures the firm size characteristic; our earnings-to-price (E/P) and book-
to-market (B/M) partitions capture value/glamour firms; while sales growth captures momentum,
or the lack thereof, in recent financial performance. These choices are not meant to exhaust all
possible scenarios, but rather to reasonably characterize settings similar to those encountered in
earnings management studies. Our supplemental tests on these sub-samples are similar to the
approach used in other studies of the properties of alternative measures of earnings management
(e.g., discretionary accruals). For example, Dechow et al. (1995) examine extreme earnings and
cash flow performance, Kothari et al. (2005) examine extreme operating cash flow, book-to-
market, sales growth, earnings-to-price, and size, and Dechow, Hutton, Kim and Sloan (2012)
extreme earnings growth, cash flow, size, and sales growth.
12

3.2.1 Simulations Assessing Test Specification under the null hypothesis (Type I error rates)
For each of the 1,000 random samples we construct, we compute the mean value of each
abnormal REM measure and then tabulate the frequency with which the null hypothesis of zero
mean abnormal REM is rejected based on a t-test with nominal significance levels of 5%. To assess
departures from the nominal significance level of the test we construct a 95% confidence interval
for the (theoretical) nominal significance level, which for 1,000 samples is 3.65% to 6.35% for a
nominal significance level of 5%. If the observed rejection rate falls above (below) the upper
(lower) bound of this interval, the test is mis-specified in that it is rejecting the null hypothesis too
frequently (infrequently). Such cases are evidence that the REM measure is mis-specified and
biased against (in favor of) the null hypothesis. To save space, we do not tabulate or discuss in the
text the results for all possible combinations of simulated settings. Instead, section 5 presents the
results for a baseline set of simulations and we summarize the results of variations from these
baseline results in section 6 (all unreported results are available upon request).
As noted above, our primary tests report the Type I error rates for all REM measures drawn
from the “full sample” of observations (i.e. from all firms). Then, in supplemental tests, we report
Type I error rates for the sub-sample partitions reflecting recent firm performance or firm
characteristics. The sub-samples consist of firms with high vs. low earnings/price (E/P) ratios, high
vs. low book-to-market (B/M) ratios, high vs. low recent sales growth, and large vs. small firms
(market value of equity). We construct sub-samples by annually ranking all firm-year observations
on the basis of each partitioning variable. For each partitioning variable we pool observations
across all sample years (1986-2017) and then draw 1,000 random samples of 100 firms each from
the top and bottom quartiles of each partitioning variable. We then test whether the mean of the
REM measure is significantly different from zero.
13

3.2.2 Simulations assessing the power of tests to detect abnormal REM
To assess the power of tests, we use the same 1,000 samples described above where we
“artificially” induced (i.e., seeded) abnormal real activity performance into the underlying raw real
activity variables (e.g., CFO) before estimating the models of expected (normal) real activity. We
vary the ‘seed’ from 1% to 10% of lagged total assets. For example, for a given firm i in year t,
the revised level of CFO is: CFO*i,t = CFOi,t + p*ATi,t-1, where p varies from 1% to 10%. Seeding
abnormal REM into the raw data allows us to evaluate the power of the test exhibited by alternative
REM measures to detect abnormal REM when it is in fact present in the data.
4. Preliminary results
4.1 Distributional properties of alternative abnormal REM measures
Table 1 presents univariate descriptive statistics for all abnormal REM variables measured
at the individual firm level with performance-matched REM measures signified by a PM prefix.
As discussed above, we report separate summary statistics for three different winsorization
approaches: (i) pre-estimation winsorization; (ii) post-estimation winsorization; and (iii) pre- and
post-estimation winsorization. Means or medians that are statistically different from zero appear
in bold (or italic) font. “Stars” denote significant differences between means and medians under
the pre-estimation winsorization scheme and those under the other two winsorization schemes.
[Insert Table 1 here]
Since the abnormal REM measures (traditional and performance-matched) under pre-
estimation winsorization are residuals from first-stage regression models their means are zero by
construction. While medians of the traditional REM measures are generally different from zero,
they are indistinguishable from zero for the performance-matched versions. Turning to the results
under different winsorization approaches, for the traditional REM measures, there are differences
between the means and medians under the different winsorization schemes. For example, the
14

average Ab_CFO under the post-estimation winsorization is 0.023, which is significantly different
from the average of zero under the pre-estimation approach. Similar differences occur between all
traditional measures under the post-winsorization scheme, while there does not seem to be any
difference when we examine the performance-matched measures. As for the pre/post-
winsorization scheme, there does not seem to be any difference between its means and medians
and those of the pre-estimation winsorization. Turning to the standard deviations, REM measures
of discretionary expenses by far exhibit the largest variation. For example, under pre-estimation
winsorization, σ(Ab_DISC_EXP) is 0.95 while σ(PM_DISC_EXP) is 1.23. As expected, since the
performance-matched REM measures are constructed by differencing the REM measures of a
treatment and control firm, the standard deviations of performance-matched measures are higher
than those of traditional REM measures. As noted above, a potential implication of a performance-
matched REM measure’s higher standard deviation is that better specified tests (i.e., lower Type I
error rates) using performance-matched measures may come at a cost of reduced power to detect
abnormal REM.
Overall, the summary descriptive statistics reveal that the distributions of the commonly-
used REM measures exhibit means close to zero when pre-estimation winsorization is used.
However, averages of various REM measures tend to become more extreme when samples are
formed on the basis of recent firm performance or firm financial characteristics (un-tabulated).
The simulation evidence reported next, which calibrates the Type I error rates of these REM
measures, provide systematic evidence as to the reliability of inferences drawn in prior REM
studies that have employed the traditional REM measures.10
10
Unreported summary statistics for sub-samples formed on the basis of recent firm performance or firm financial
characteristics reveal that except for Ab_GAIN, and to a lesser degree Ab_R&D, mean and median values tend to be
non-zero and very extreme in most cases.
15

5. Simulation Results
5.1 Overview
We first report simulations that assess test specification under the null hypothesis (Type I
error rates). We then report simulations that assess the power of the test to detect abnormal REM
when it has been seeded into the underlying data. The key aspect of the simulations assessing test
specification under the null hypothesis is that in random samples, where firms are selected without
regard to any prediction about managerial incentives to manage real activities, the expected value
of each REM measure should be zero. Evidence that a given REM measure is biased in favor of
the alternative hypothesis (i.e., against a “true” null) would lead to the conclusion that researchers
should avoid using such measures or face the risk of making a Type I error by concluding they
have documented a significant “treatment” effect when in fact they actually have not.
Simulations assessing the power of tests to detect abnormal REM are designed to provide
systematic evidence on the trade-off between bias reduction and power across the various REM
measures. A maintained assumption underlying the power of the test analysis and comparisons is
that each REM measure is well-specified under the null hypothesis. So long as a given REM
measure (traditional or performance-matched) is well-specified under the null hypothesis (i.e., it
exhibits an acceptable Type I error rate), the power of the test simulations can shed meaningful
light on the degree to which that REM measure will be able detect a given level of abnormal REM
when it is present in the data. We seed levels of abnormal REM ranging from 1% to 10%, and then
tabulate the percent of times out of 1,000 samples where the null hypothesis of zero REM is
rejected at a given level of abnormal REM (i.e., the power of the test).
Before reporting the results, we provide a brief roadmap for the baseline simulations
reported below (see section 6 for sensitivity tests where we vary some of the choices below):
16

1) Traditional REM Measures: Ab_CFO, Ab_PROD, Ab_DISC_EXP, Ab_R&D, Ab_SGA,
and Ab_GAIN (see Appendix for details).
2) Performance-matched REM Measures: A performance-matched version of each

traditional REM measure listed above, denoted: PM_CFO, PM_PROD,
PM_DISC_EXP, PM_R&D, PM_SGA, PM_GAIN (see Appendix for details).
3) Sample composition: Primary tests are based on the “full sample” (i.e., all firms).
Supplemental tests use sub-samples that reflect recent firm performance or firm
financial characteristics. These are defined as the lower and upper quartiles of: book-
to-market ratios, past sales growth, earnings-to-price ratios, and market value of equity.
4) Hypothesis Test: A one-sided test of the alternative hypothesis that mean REM is
positive. That is, where REM is hypothesized to have been income increasing.
5) Nominal significance level of the test: 5%.
We begin by reporting (section 5.2) rejection rates (i.e., Type I error rates) for a one-tailed
test where the alternative hypothesis is of income-increasing REM. Next (section 5.3), we report
results for the power of the test. For ease of interpretation in the tables below, rejection rates that
are significantly less than the nominal significance level of the test (i.e., conservative tests) appear
in italics, while those significantly greater than the nominal significance level of the test (i.e.,
which reject a true null hypothesis too often) appear in bold.
5.2 Test specification under the null hypothesis (Type I error rates)
5.2.1 Tests based on the “full sample” (all firms)
Table 2 reports rejection rates (Type I error rates) for the null hypothesis that the mean
REM in a given sample is zero against a one-tailed alternative hypothesis of positive REM. The
earnings management setting modeled here is one where managers engaged in income-increasing
REM to achieve a financial reporting objective such as meeting or beating an earnings threshold
(e.g., reporting a profit instead of a loss). Table 2’s simulations are based on samples drawn from
the “full sample” (all firms), that is, without regard to any firm characteristic. For comparison
17

purposes, results are reported under different winsorization schemes: (i) pre-estimation; (ii) post-
estimation; and (iii) pre- and post-estimation winsorization.
We use two approaches to assess whether the observed rejection frequencies exhibit
evidence of mis-specification. Under the first, we (objectively) compare rejection frequencies with
the lower (upper) bound of 3.65% (6.35%) for the 95% confidence interval for the 5% nominal
significance level of the test. Second, we apply a subjective threshold of 15% to define severe
misspecification to mimic a potential researcher’s subjective choice about tolerable Type I error.11
We first discuss the results for the traditional REM measures used in prior studies. These
results appear in the shaded rows of Table 2 and reveal evidence of mis-specification in that none
of the traditional REM measures is consistently well specified under the null hypothesis. For
example, four measures yield tests that over-reject the null hypothesis (Ab_CFO, AB_DISC_EXP,
AB_R&D, and Ab_SGA) with an average rate of 9.65% (compared to the test’s upper bound of
6.35%) while the other two (Ab_PROD and Ab_GAIN) yield conservative tests based on an
average rejection frequency of 1.80% (compared to the test’s lower bound of 3.65%). Using the
15% cutoff to define severe mis-specification would lead one to conclude that in random samples
drawn from the “full sample” (i.e., all firms), traditional REM measures are not severely mis-
specified.
A notable feature of the results using the traditional REM measures is that, based on the
behavior of the rejection rates across the columns of Table 2, the findings are similar regardless of
the winsorization approach. Thus, to the extent mis-specification exists, it does not seem to be
11
Personally, we are not recommending 15% as a measure of tolerable Type I error because it is more than double the
6.35% upper bound of the 95% confidence interval, and in our view represents more extreme misspecification.
18

induced by the choice of when to winsorize. In sum, the results for the traditional REM measures
suggest that misspecification exists, but is not extremely severe.
The question at this point is whether performance-matched REM measures (the non-shaded
cells in Table 2 and directly below their corresponding traditional REM measure) yield more
reliable inferences than traditional measures. From an overall perspective, Table 2’s rejection rates
indicate that performance-matched REM measures are better specified than the traditional REM
measures. Specifically, performance-matched versions of the traditional REM measures that
tended to yield over (under)-rejections, consistently decrease (increase), and more importantly, fall
within the 3.65% and 6.35% bounds. For example, focusing on the pre-estimation winsorization
results, the average rejection rate across the four performance-matched measures whose
corresponding traditional REM versions exhibited over-rejection is 5.33%, compared to 9.65% for
the corresponding traditional measures. In addition, the average rejection rate across the two
performance-matched measures whose corresponding traditional versions exhibited evidence of
under-rejection is 4.80%, compared to 1.80% for the corresponding traditional measures. Thus, on
balance, the evidence indicates that not only do performance-based REM measures tend to mitigate
the mis-specification of their corresponding traditional REM measures, but they also typically
yield better specified tests. This inference holds regardless of the winsorization approach.
Since Table 2’s results do not vary depending on the winsorization approach, and because
the pre-estimation winsorization REM measures exhibit more desirable univariate properties (see
Table 1), all of the remaining analysis reported in the paper (i.e., all remaining tables and figures)
use REM measures based on pre-estimation winsorization.
19

5.2.2. Tests based on sub-samples formed on the basis of past performance or firm characteristics
Table 2’s results were based on random samples constructed from the entire population
(i.e., “All Firms” samples). We next provide evidence on how REM measures perform in samples
where firms are not randomly drawn from the entire population of firms, which is more like the
typical earnings management setting. Table 3 reports results for sub-samples of firms with: high
vs. low earnings/price (E/P) ratios, high vs. low book-to-market (B/M) ratios, high vs. low sales
growth, and large vs. small firms (market value of equity). The motivation for this analysis is that
samples in earnings management studies are often characterized by firms with extreme
performance and/or financial characteristics. Our analysis of sub-samples follows the approach
taken in Dechow et al. (1995), Kothari et al. (2005), and Dechow et al. (2012).
Panel A of Table 3 reports summary statistics for the traditional and performance-matched
REM measures for each of the various sub-samples. Means and standard deviations are measured
across the 1,000 samples of 100 firms each used in the simulations. Panel B reports each REM
measure’s Type I error rates in the full sample (“All Firms’) and in sub-samples of firms we study.
Panel A confirms the notion that the standard deviations of performance-matched variables
are higher than those of traditional measures, similar to the statistics reported in Table 1. To briefly
preview the findings for the Type I error rates of the various REM measures reported in Panel B,
no set of REM measures, traditional or performance-matched, is consistently well-specified across
sub-samples of recent firm performance and/or financial characteristics. Stated differently, no set
of REM measures, traditional or performance-matched, consistently exhibits acceptable Type I
error rates in these sub-samples. This finding has two important implications. First, it prevents us
from analyzing the power of the test in these sub-samples because the power of the test is only
20

meaningful for well-specified tests. Thus, our power of the test analysis will be restricted to
random samples constructed from the entire population (i.e., “all firm” samples). The second
implication is that this finding supports the main conclusion of our study, namely, that there is no
dominant approach to measure abnormal REM, a finding that leads us to recommend that future
researchers report results using both traditional and performance-matched REM measures so that
readers are able to assess the reliability of the inferences drawn about the magnitude and
significance of the abnormal REM documented in a given study.
Turning to the simulation results for the specification of REM measures in the subsamples,
we first focus on the rejection frequencies (Type I error rate) for the upper quartile of each partition.
In those cases, traditional REM measures exhibit high degree of misspecification in the size, book-
to-market (B/M), and earnings-price (E/P) sub-samples. For example, in the upper B/M quartile,
five of the traditional REM measures exhibit rejection frequencies well above the 15% level. The
average rejection frequency for these five REM measures is 64.04%. Similar results are observed
in the high E/P and large firm quartiles. For example, in the high E/P quartile and large firm
quartile, four of the traditional REM measures have rejection frequencies far above the 15%
threshold (average rejection frequencies are 63.45% for high E/P and 58.23% for large firms). In
the high sales growth sub-sample, only Ab_SGA is grossly mis-specified.
Turning to the results for the lower quartile of each partition reveals evidence of over-
rejection in the low sales growth sub-sample. Specifically, two of the traditional REM measures
(Ab_PROD and Ab_DISC_EXP) have rejection rates exceeding 15% (average rejection rate for
these two is 22.5%). On the other hand, in some sub-samples such as low B/M, we observe a high
degree of under-rejections of the null, that is, very low rejection rates. For example, in the low
B/M subsample, the average rejection frequency of traditional REM measures is 0.64%.
21

Turning to the findings for the performance-matched REM measures reveals that, in
general, they tend to reduce, but not eliminate the misspecification in the traditional REM
measures. More specifically, rejection frequencies tend to decline in cases where the
corresponding traditional REM measure had experienced over-rejection. For example, in the high
B/M quartile, where the average rejection rate was 64.04% across the four traditional REM
measures that exhibited high over-rejection rates, the average rejection rate for the corresponding
performance-matched REM measures declines to 25.98%. While performance matching tends to
reduce the problem of over-rejection by the traditional REM measures, the results clearly show
that performance-matching does not cure mis-specification in all cases. Finally, performance-
matching also seems to correct the under-rejection problem associated with some of the traditional
REM measures, see for example, the results for the low B/M sub-sample. However, and to be clear,
the results reveal that the performance-matched versions of the traditional REM measures do not
completely eliminate the under-rejection tendency of some of the traditional REM measures in
these sub-samples.
In summary, our results reveal that no REM measure (traditional or performance-matched)
is well-specified in each and every setting. That said, on balance, the rejection frequencies of
performance-matched REM measures are generally lower (i.e., less mis-specified) than those of
the corresponding traditional REM measures in both the “full sample” results (i.e., all firms) and
in the sub-samples of firms with extreme performance or financial characteristics. Moreover, while
there is a tendency for all REM measures to over-reject the null hypothesis, on balance,
performance-matched REM measures are less affected by over-rejection than are the traditional
REM measures. While, in some cases, under-rejection of the null hypothesis is also a problem for
traditional as well as performance-matched REM measures, on balance, the latter REM measures
22

are slightly less affected by under-rejections than are the traditional measures. Conclusions about
performance-matched REM measures based on their Type I error rates are subject to the following
caveat. An alternative explanation for the lower over-rejection rates of performance-matched REM
measures relative to their traditional REM counterpart is, holding all else constant, that the higher
standard deviation of performance-matched REM measures (see Panel A of Table 3) will lead to
fewer rejections of the null hypothesis by serving to make the standard error larger and hence the
t-statistic itself smaller.
5.3 The power of the test to detect abnormal REM
5.3.1 Overview
Since performance-matched REM measures are based on differencing two variables, the
standard deviation of a performance-matched REM measure will be at least 1.4x that of the
underlying un-differenced traditional REM measure. As a result, a concern with performance-
matched REM measures is that they may sacrifice power compared to traditional (un-differenced)
REM measures. The power of test results we report in this section provide evidence on whether
better specification of performance-matched REM measures under the null hypothesis in various
settings comes at the cost of lower power to detect abnormal REM. Given the mis-specification
plaguing both traditional and performance-matched REM measures in the sub-sample results
report in Table 3, we do not analyze the power of the test in these sub-samples. The reason is that
the power of the test is only meaningful for well-specified tests. As a result, our power of the test
analysis is based on random samples constructed from the “full sample” (i.e., “all firms”).
5.3.2 Power of the test to detect abnormal REM for alternative REM measures
To save space, and to more clearly illustrate the power of the test of the various REM
measures, instead of reporting tables of detailed rejection rates, we use figures to plot the power
curves of the various REM measures across seeded levels of abnormal REM ranging from 1% to
23

10% (below each figure’s we report the numerical values of the rejection rates that underlie the
corresponding power curves appearing in the figure). If a given performance-matched REM
measure sacrifices power relative to its corresponding traditional REM counterpart, the power
curve of the former will lie below that of the latter across all levels of seeded abnormal REM.
Figures 1-6 plot the power curves for traditional and performance-matched REM measures of
abnormal cash flow from operations, abnormal discretionary expenses, abnormal production costs,
abnormal R&D, abnormal SG&A and abnormal gains, respectively.
Examination of Figures 1-6 reveals the following. In Figure 1, for abnormal operating cash
flows, the power curves exhibit lower power for the performance-matched REM measure at lower
seed levels. However, the pattern flips at higher seed levels. Turning to abnormal discretionary
expenses in Figure 2, across all levels of seeded abnormal REM, the performance-matched
measure exhibits more power than its traditional REM counterpart. The power curves in Figures 3
and 5, for abnormal production and abnormal SG&A respectively, show a slight advantage for the
performance-matched REM measure at lower seed levels, but higher power for the traditional
measures for abnormal production and abnormal SG&A at seed levels of 3% or more of lagged
total assets. A similar pattern emerges in Figures 4 and 6, for abnormal R&D and abnormal gains
respectively, except that the higher power for the corresponding traditional measure kicks in for
abnormal R&D and abnormal gains at seed levels of 1% or more (instead of 3% for abnormal
production and SG&A).
In summary, on balance, at plausible levels of seeded abnormal REM of between 1% to
3%, traditional and performance-matched REM measures do not exhibit major differences in
power to detect abnormal REM. In other words, the power of the test results do not reveal that one
approach to measure REM (traditional vs. performance-matched) systematically dominates the
24

other. However, the power of the test results do indicate that the tendency for performance-
matched measures to be better specified under the null hypothesis (i.e., to exhibit better Type I
errors rates than their corresponding traditional REM measure, see Table 2) does not come at
systematically lower power to detect plausible levels of abnormal REM of 1% to 3%. Stated
differently, the improved test specification (i.e., Type I error rates) exhibited by performance-
matched measures will not impede their ability to detect plausible levels of abnormal REM of 1%
to 3%.
Overall, the empirical evidence in Tables 2 and 3 and Figures 1-6 indicates that no single
REM measure (traditional or performance-matched) is immune from mis-specification in each and
every setting. That said, in terms of Type I error rates (i.e., test specification), performance-
matched REM are a bit less prone to falsely reject a true null hypothesis (especially in samples of
firms that exhibit extreme performance or firm financial characteristics).12 Finally, the power of
the test results do not reveal that one approach to measure REM (traditional vs. performance-
matched) systematically dominates the other across a wide range of simulated settings.
6. Additional tests
We performed a battery of sensitivity tests by varying the choices underlying the baseline
simulations described in section 5.1. This section summarizes the findings (un-tabulated and
available upon request). The results reported above are for six REM measures that have been
traditionally used in the literature (Ab_CFO, Ab_PROD, Ab_DISC_EXP, Ab_R&D, Ab_SGA, and
Ab_GAIN). We also analyzed modified versions of Ab_CFO, Ab_PROD, and Ab_DISC_EXP
based on specifications developed in Gunny (2010) and Vorst (2016). Results based on these
modified REM measures yield inferences similar to those of our main tests.
12
Conclusions about performance-matched REM measures based on their Type I error rates are subject to the caveat
stated in section 5.2.2.
25

Our main tests were based on pre-estimation winsorization of the underlying data. Using
the other two winsorization approaches discussed in section 3.1 (i.e., post-estimation or a
combination of pre- and post-estimation) also yield inferences similar to those of our main tests.
Our main tests focused on a one-tailed alternative hypothesis of income-increasing REM. We also
tested the null hypothesis of zero REM against a two-tail alternative hypothesis of non-zero REM.
The alternative hypothesis of interest to a researcher here this is simply that REM is non-zero,
implying that managers engaged in some real earnings management, irrespective of its direction.
In simulations using the “full sample” (i.e., all firms), we continue to find some evidence of over-
rejection across all REM measures, although the degree of over-rejection is not severe (average
rejection rate is 7.2%). It is also the case that rejection frequencies are attenuated when
performance-matched REM measures are used. Finally, our main tests used a nominal significance
level of 5%. None of the conclusions of our main tests change if a 1% significance level is used.
7. Conclusions
The use of measures of real earnings management (REM) to test hypotheses related to
earnings management has become commonplace in the literature. Surprisingly, there is no
systematic evidence on the properties of REM measures commonly-used in the earnings
management literature. We provide such evidence by documenting the Type I error rates and
power of the test of REM measures commonly used in the literature, as well as corresponding
performance-matched versions of these measures.
Our main findings are the following. While performance-matched REM measures are not
immune from mis-specification in all settings, in general, and subject to the caveat that their
standard deviations are higher than those of their traditional REM counterpart, performance-
matched REM measures tend to be better specified under the null hypothesis. Comparisons of the
power to detect plausible levels (1% to 3% of total assets) of abnormal REM reveal that neither
26

approach, traditional or performance-matched, is consistently more powerful than the other in
terms of detecting abnormal REM. The absence of a dominant approach to measure abnormal
REM leads us to recommend that future researchers report results using both traditional and
performance-matched measures so that readers are able to clearly assess the reliability of the
inferences drawn about the magnitude and significance of the abnormal REM documented in a
given study.
Our study contributes to the earnings management literature in the following ways. As the
first study to systematically document the properties (i.e., Type I error rates and power of the test)
of the traditional REM measures used in prior studies, our evidence facilitates a keener
appreciation of the reliability of the inferences prior studies have drawn about REM. Second, our
comparison of the Type I error rates and power of the test for traditional and performance-matched
REM measures provides a useful guide to future researchers when evaluating the trade-offs
between Type I and Type II error rates to decide which REM measures to use in a given specific
setting. A fruitful avenue for future research would be to develop better theoretical and empirical
models of expected (normal) real activities.
27

Appendix
Real Earnings Management (REM) Variable Definitions and Measurement Procedures
We obtain abnormal REM measures by subtracting the expected value of each REM measure based on the
underlying expectation model from the actual value of the real activity measure (e.g., cash flow from
operations, R&D, SG&A, etc.) reported on COMPUSTAT (or the value calculated using COMPUSTAT
data, e.g., production costs = COGS + ΔINV). Following prior research, we estimate model parameters
using annual data at the two-digit SIC code for all industries with at least 15 observations in a given year.
REM expectation models and the resulting abnormal REM measures are:
A. REM measures used in prior REM research:
1) Ab_CFO is abnormal cash from operations (see, Roychowdhury, 2006) computed by estimating the
following model of expected CFO (by industry and year):
CFOit 1 SALESit ΔSALESit

=k +k +k +k +ε , (A.1)
Assetsi,t-1 0 1 Assetsi,t-1 2 Assetsi,t-1 3 Assetsi,t-1 it
where CFO is cash flow from operations, SALES is annual sales and Assets is total assets. Ab_CFO are the
residuals from model (A.1).
2) Ab_DISC_EXP is abnormal discretionary expenses (see, Roychowdhury, 2006) computed by estimating

the following model of expected DISC_EXP (by industry and year):
D ISC _EX P it 1 SALES i,t-1 (A.2)

=k 0  k 1 +k 2 + ε it
Assets i,t-1 Assets i,t-1 Assets i,t-1
where DISC_EXP is discretionary expenses during the year defined as the sum of advertising, R&D, and
SG&A expenses, and all other variables are as previously defined. Ab_DISC_EXP are the residuals from
model (A.2).
3) Ab_PROD is abnormal production costs (see, Roychowdhury, 2006) computed by estimating the
following model of expected PROD (by industry and year):
PRODit =k  k 1 SALESit ΔSALESit ΔSALESi,t-1 (A.3)

0 1 +k 2 +k3 +k 4 +εit
Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1
where PROD is production costs defined as the sum of costs of goods sold (COGS) and change in inventory
during the year, and all other variables are as previously defined. Ab_PROD are the residuals from model
(A.3).
4) Ab_R&D is abnormal research and development costs (see, Gunny, 2010) computed by estimating the
following model of expected R&D (by industry and year):
R&Dit 1 INTit R&Di,t-1 (A.4)

=k  k +k MVt +k 3Qt +k 3 +k +ε
Assetsi,t-1 0 1 Assetsi,t-1 2 Assetsi,t-1 4 Assetsi,t-1 it
28

where R&D is R&D expense, MV is the natural logarithm of the market value of equity (outstanding shares
times stock price), Q is Tobin’s Q [= market value of equity + book value preferred stock + book value of
long-term debt + debt in current liabilities) / total assets], and INT is internally generated funds (the sum of
Net Income before extraordinary items, R&D expense, and Depreciation and Amortization), and all other
variables are as previously defined. Ab_R&D are the residuals from model (A.4).
5) Ab_SGA is abnormal selling, general and administrative costs (see, Gunny, 2010), computed by
estimating the following model of expected SGA (by industry and year):
SGAit 1 INTit ΔSALESit ΔSALESit (A.5)

=k  k +k MV +k Q +k +k +k *DD+εit
Assetsi,t-1 0 1 Assetsi,t-1 2 t 3 t 3 Assetsi,t-1 4 Assetsi,t-1 5 Assetsi,t-1
where SGA is SG&A expense, ΔSALES is change in annual sales and MV, Q and INT were defined above.
DD is indicator variable equal to 1 when total sales decreases from year t-1 to t, and zero otherwise, and all
other variables are as previously defined. Ab_SGA are the residuals from model (A.5).
6) Ab_GAIN is abnormal gains (see, Gunny, 2010) computed by estimating the following model (by
industry and year):
GAINit 1 INTit ASALESit ISALESit (A.6)

=k  k +k MV +k Q +k +k +k +ε
Assetsi,t-1 0 1 Assetsi,t-1 2 t 3 t 3 Assetsi,t-1 4 Assetsi,t-1 5 Assetsi,t-1 it
where GAIN is gain from asset sales (times -1), ASALES is long-lived assets sales, ISALES is long-lived
investment sales, and all other variables are as previously defined. Ab_GAIN are the residuals from model
(A.6).
B. Modified REM measures:
Modified REM measures based on the refinements suggested in Gunny (2010) and Vorst (2016) which
include indicator variables for decline in sales.
7) Ab_CFO_MOD is abnormal cash from operations where the underlying model is modified to include a
separate explanatory variable to capture the effect of a decline in sales, estimated using the following model
of expected CFO (by industry and year):
CFOi,t 1 SALESi,t ∆SALESi,t ∆SALESi,t

=k0 +k1 +k2 +k3 +k4 *DD+∈i,t , (A.7)
where CFO is cash flow from operations, SALES is annual sales, ΔSALES is change in annual sales, Assets
is total assets, and DD is an indicator variable set to 1 when sales has declined. Ab_CFO_MOD are the
8) Ab_DISC_EXP_MOD is abnormal discretionary expenses where the underlying model is modified to

include a separate explanatory variable to capture the effect of a decline in sales, estimated using the
following model of expected DISC_EXP (by industry and year):
DISC_EXPi,t 1 SALESi,t ∆SALESi,t ∆SALESi,t

=k0 +k1 +k2 +k3 +k4 *DD+∈i,t , (A.8)
29

where DISC_EXP is discretionary expenses during the year defined as the sum of advertising, R&D, and
SG&A expenses and all other variables are as previously defined. Ab_DISC_EXP_MOD are the residuals
from model (A.8).
9) Ab_PROD_MOD is abnormal production costs (see, Roychowdhury, 2006) where the underlying model
is modified to include a separate explanatory variable to capture the effect of a decline in sales, estimated
using the following model of expected PROD (by industry and year):
PRODi,t 1 SALESi,t ∆SALESi,t ∆SALESi,t-1 ∆SALESi,t ∆SALESi,t-1

=k0 +k1 +k2 +k3 +k4 +k5 *DD+k6 ∗ 𝐷𝐷 ∈i,t ,
Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1 Assetsi,t-1
(A.9)
where PROD is production costs defined as the sum of costs of goods sold (COGS) and change in
inventory during the year and all other variables are as previously defined. Ab_PROD_MOD are the
C. Performance-matched REM Measures:
We match treatment firms to control firms based on return on assets (ROA), where ROA is defined as
income before extraordinary items divided by lagged total assets. Each treatment firm (i) is matched to a
performance-matched control firm (j) in the same two-digit SIC code whose ROA is within ±10% of the
treatment firm. We then define the difference between the REM measure of the treatment firm and the REM
measure of the control firm to be the resulting performance-matched REM measure. Using abnormal CFO
as an example: PM_CFOi,t = Ab_CFOi,t - Ab_CFOj,t. We define performance-matched measures for each
of the REM variables described above (i.e., Ab_CFO, Ab_PROD, Ab_DISC_EXP, Ab_R&D, Ab_SGA, and
Ab_GAIN).
D. Variables used to form sub-samples reflecting recent firm performance or financial characteristics:
Sales Growth: (Sales – lagged Sales)/lagged Sales.
Market Value of Equity: Fiscal year-end stock price multiplied by common shares outstanding.
Book-Market: Book value of common equity divided by the market value of equity.
Earnings/Price: Diluted earnings per share excluding extraordinary items divided by the fiscal year-end
stock price.
30

References
Alissa, W., S.B. Bonsall, K. Koharki, and M.W. Penn. 2013. Firms’ use of accounting discretion
to influence their credit ratings. Journal of Accounting and Economics 59, 129-147.
Baber, W., P. Fairfield, and J. Haggard. 1991. The effect of concern about reported income on
discretionary spending decisions: The case of research and development. The Accounting
Review 66 (4): 818-829.
Badertscher, B. 2011. Overvaluation and the choice of alternative earnings management

mechanisms. The Accounting Review 86(5), 1491-1518.
Barber, B., and J. Lyon. 1996. Detecting abnormal operating performance: The empirical power
and specification of test statistics. Journal of Financial Economics 41, 359-399.
Bartov, E. 1993. The timing of asset sales and earnings manipulation. The Accounting Review 68
(4): 840-855.
Bens, D., V. Nagar, D. Skinner, and F. Wong. 2003. Employee stock options, EPS dilution, and
stock repurchases. Journal of Accounting & Economics 36 (1-3): 51-90.
Bens, D., V. Nagar, and F. Wong. 2002. Real investment implications of employee stock option
exercises. Journal of Accounting Research 40 (2): 359-406.
Bernard, V. 1987. Cross-sectional dependence and problems in inference in market-based

accounting research. Journal of Accounting Research 25 (1): 1-48.
Bushee, B. 1998. The influence of institutional investors on myopic R&D investment behavior.
The Accounting Review 73 (3): 305-333.
Brown, S., and J. Warner. 1980. Measuring security price performance. Journal of Financial
Economics 8, 205-258.
Brown, S., and J. Warner. 1985. Using daily stock returns: The case of event studies. Journal of
Financial Economics 14 (1), 3-31.
Campbell, C. and C. Wasley. 1993. Measuring security price performance using daily NASDAQ
returns. Journal of Financial Economics 33, 73-92.
Cheng, Q., J. Lee, and T. Shevlin. 2016. Internal Governance and Real Earnings Management.
The Accounting Review 91(4): 1051-1085
Cohen, D., A. Dey, and T. Lys, 2008. Real and Accrual-based Earnings Management in the Pre-
and Post-Sarbanes Oxley Periods. The Accounting Review 82(3): 757-787.
Cohen, D., R. Mashruwala, and T. Zach. 2010. The Use of Advertising Activities to Meet
Earnings Benchmarks: Evidence from Monthly Data. Review of Accounting Studies 15(4):
808-832.
31

Cohen, D., and P. Zarowin. 2010. Accrual-Based and Real Earnings Management Activities
around Seasoned Equity Offerings. Journal of Accounting & Economics 50 (1): 2-19.
Dechow, P., W. Ge, and C. Schrand. 2010. Understanding earnings quality: A review of the
proxies, their determinants and their consequences. Journal of Accounting & Economics
50 (2/3), 344-401.
Dechow, P., A. Hutton, J.H. Kim, and R. Sloan. 2012. Detecting earnings management: A new
approach. Journal of Accounting Research 50 (2): 275-334.
Dechow, P., S.P. Kothari, and R. Watts. 1998. The relation between earnings and cash flows.
Journal of Accounting and Economics 25 (2): 133-169.
Dechow, P.M. and D. Skinner. 2000. Earnings management: Reconciling the views of
accounting academics, practitioners, and regulators. Accounting Horizons 14 (2): 235-250.
Dechow, P., Sloan, R., and A. Sweeney. 1995. Detecting earnings management. The Accounting
Review 70, 193-225.
Dechow, P. and R. Sloan. 1991. Executive incentives and the horizon problem. Journal of
Accounting and Economics 14 (1): 51-89.
Eldenburg, L., Gunny, K., Hee, K., Soderstrom, N. 2011. Earnings Management using real
activities: evidence from nonprofit hospitals, The Accounting Review 86, 1605-1630.
Fields, T., T. Lys, and L. Vincent. 2001. Empirical research on accounting choice. Journal of
Accounting and Economics 31 (1-3), pp. 255-307.
Graham, J., Harvey, R. and S. Rajgopal. 2005. The economic implications of corporate financial
reporting. Journal of Accounting & Economics 40 (1-3): 3-73.
Gunny, K. 2010. The relation between earnings management using real activities manipulation
and future performance: Evidence from meeting earnings benchmark. Contemporary
Accounting Research 27 (2), 855-888.
Healy, P., and J. Wahlen. 1999. A review of the earnings management literature and its
implications for standard setting. Accounting Horizons 13, 365-383.
Kim, Y. and M.S. Park. 2014. Real activity manipulations and auditors’ client-retention decision.
The Accounting Review 89 (1): 367-401.
Kothari, S. P. 2001. Capital markets research in accounting, Journal of Accounting & Economics
31, 105-231.
Kothari, S. P., Leone, A., and C. Wasley. 2005. Performance matched discretionary accrual
measures. Journal of Accounting & Economics 39 (1), 163-197.
Kothari, S. P., and C. Wasley. 1989. Measuring security price performance in size clustered
samples. The Accounting Review 64, 228-249.
32

McGuire, S., T. Omer, and N. Sharp. 2012. The impact of religion on financial reporting
irregularities. The Accounting Review 87, 645-673.
McInnis, J., and D. Collins. 2011. The effect of cash flow forecasts on accrual quality and
benchmark beating. Journal of Accounting & Economics 51(3), 219-239.
Skinner, D. 1993. The investment opportunity set and accounting procedure choice: Preliminary
evidence. Journal of Accounting and Economics 16(4), 407-445.
Smith, C., and R. Watts. 1992. The investment opportunity set and corporate financing, dividend,
and compensation policies. Journal of Financial Economics 32(3), 263-292.
Roychowdhury, S. 2006. Earnings management through real activities manipulation. Journal of

Accounting & Economics 42 (3), 335-370.
Srivastava, A. 2019. Improving the measures of real earnings management. Forthcoming in the
Review of Accounting Studies.
Thomas, J., and H. Zhang. 2002. Inventory changes and future returns. Review of Accounting
Studies 7 (2-3): 163-187.
Vorst, P., 2016. Real Earnings Management and Long-Term Operating Performance: The Role
of Reversals in Discretionary Investment Cuts. The Accounting Review 91 (4): 1219-1256.
Zang, A. 2012. Evidence on the tradeoff between real activities manipulation and accrual-based
earnings management. The Accounting Review 87(2) 675-703.
33

Figure 1 Power of the test: traditional and performance-matched abnormal cash flow from operations
The figure depicts the rejection frequencies of traditional and performance-matched REM measures for 1,000 simulated samples of 100 firms each drawn from the
general population of all firms. Along the x-axis rejection frequencies are reported for increasingly larger (1% to 10% of total assets) induced abnormal performance.
Abnormal cash flows from operations (CFO)

(Rejection rates along the y-axis and seeded level of abnormal REM along the x-axis)
34
Figure 2 Power of the test: traditional and performance-matched abnormal discretionary expenses
Abnormal discretionary expenses (DISC_EXP)

35
Figure 3 Power of the test: traditional and performance-matched abnormal production costs
Abnormal production (PROD)

36
Figure 4 Power of the test: traditional and performance-matched abnormal R&D
Abnormal research & development expenditure (R&D)

37
Figure 5 Power of the test: traditional and performance-matched abnormal SG&A
Abnormal SG&A expenses (SGA)

38
Figure 6 Power of the test: traditional and performance-matched abnormal gain
Abnormal gain (GAIN)

39
TABLE 1
Summary descriptive statistics for traditional and performance-matched measures of abnormal real earnings management (REM)
REM Measure N Mean Median Std. Dev. Mean Median Std. Dev. Mean Median Std. Dev.
Pre-estimation Winsorization Post-estimation Winsorization Pre- and Post-estimation Winsorization
Ab_CFO 204,353 0.000 0.037 0.520 0.023*** 0.039*** 0.820 0.002 0.037 0.390
PM_CFO 178,831 -0.001 0.000 0.650 -0.000 0.000 1.060 -0.001 0.000 0.490
Ab_PROD 199,954 0.000 -0.014 0.400 -0.022*** -0.020*** 0.450 -0.008*** -0.014 0.280
PM_PROD 174,036 -0.000 0.000 0.520 0.001 0.000 0.600 0.000 0.000 0.390
Ab_DISC_EXP 187,727 0.000 -0.070 0.950 -0.085*** -0.076*** 2.380 -0.007 -0.071 0.660
PM_DISC_EXP 161,982 0.000 0.000 1.230 0.003 0.000 2.520 0.001 0.000 0.890
Ab_R&D 96,225 0.000 -0.002 0.130 -0.003*** -0.002*** 0.240 -0.001 -0.002 0.110
PM_R&D 82,588 0.000 0.000 0.190 0.001 0.000 0.330 0.000 0.000 0.170
Ab_SGA 90,325 0.000 -0.031 0.450 -0.016*** -0.0339*** 0.630 -0.004 -0.031 0.370
PM_SGA 77,074 0.002 0.001 0.630 0.000 0.001 0.780 0.001 0.001 0.500
Ab_GAIN 76,334 0.000 -0.001 0.010 -0.001*** -0.001*** 0.030 -0.001 -0.001 0.010
PM_GAIN 65,202 0.000 0.000 0.020 -0.001 0.000 0.030 0.000 0.000 0.010
Summary statistics are reported for three different winsorization approaches: (i) when the independent variables in models of expected (normal) real activities are winsorized prior to
model estimation (‘pre-estimation’ winsorization); (ii) when the independent variables in models of expected (normal) real activities are not winsorized, but rather the resulting estimated
REM measures are winsorized (‘post-estimation’ winsorization) and (iii) when the independent variables in models of expected (normal) real activities are winsorized prior to model
estimation and the resulting estimated REM measures are also winsorized (‘pre- and post-estimation winsorization’). The table reports tests for differences in means and medians for
comparisons between different winsorization scenarios. The benchmark sample for which mean and median differences are compared to is the ‘pre-estimation’ winsorization sample. The
means and medians for this benchmark sample are compared to those for the ‘post-estimation’ and ‘pre- and post-estimation winsorization’ samples. *** by a mean or median value
indicates that mean or median value is significantly different from the benchmark sample’s mean or median at the 1% level. Mean and median numbers in bold denote values that
themselves are statistically different from zero. See Appendix for variable definitions and estimation methods.
40
TABLE 2
Type I error rates for traditional and performance-matched measures of abnormal real earnings management (REM): All firms
REM Measure (Post-estimation (Pre- and post-estimation

(Pre-estimation winsorization)
winsorization) winsorization)
Ab_CFO 8.8% 12.0% 9.0%

PM_CFO 6.0% 6.8% 5.9%
Ab_PROD 2.5% 1.4% 2.2%
PM_PROD 5.3% 4.5% 4.9%
Ab_DISC_EXP 11.4% 12.2% 11.5%

PM_DISC_EXP 4.8% 4.5% 5.2%
Ab_R&D 9.7% 7.8% 9.4%
PM_R&D 5.3% 3.5% 6.3%
Ab_SGA 8.7% 9.5% 8.9%
PM_SGA 5.2% 6.5% 5.2%
Ab_GAIN 1.1% 0.6% 0.9%
PM_GAIN 4.3% 4.7% 4.3%
Mean Over-rejection:
Traditional REM Measures 9.65% 10.38% 9.70%
PM REM Measures 5.33% 5.33% 5.65%
Mean Under-rejection:
Traditional REM Measures 1.80% 1.00% 1.55%
PM REM Measures 4.80% 4.60% 4.60%
The table reports rejection rates (Type I error rates) for traditional and performance-matched measures of abnormal real earnings management (REM)
for random samples drawn from the “full sample” (i.e., all firms). Rejection rates correspond to the percentage of 1,000 random samples of 100 firms
each where the null hypothesis of mean zero abnormal REM is rejected in favor of the alternative hypothesis of positive (i.e., income increasing REM)
at the 5% level (one-tailed t-test). With 1,000 samples, the 95% confidence interval for the (theoretical) nominal 5% significance level of the test
ranges from 3.65% to 6.35%. Rejection rates in italic (bold) are those that fall below the lower threshold of 3.65% (above the upper threshold of
6.35%). Results are reported for three alternative approaches to winsorize the data: pre-estimation, post-estimation, and pre- and post-estimation.
Under ‘pre-estimation’ winsorization the independent variables in models of expected (normal) real activities are winsorized prior to model estimation.
Under ‘post-estimation’ winsorization the independent variables in models of expected (normal) real activities are not winsorized, but rather the
resulting estimated REM measures are winsorized. Under ‘pre- and post-estimation winsorization’ the independent variables in models of expected
(normal) real activities are winsorized prior to model estimation and the resulting estimated REM measures are also winsorized. Variables with the
“Ab” prefix are traditional REM measures and those with a “PM” prefix are performance-matched REM measures. See Appendix for variable
definitions and estimation methods.
41
TABLE 3
Panel A: Summary statistics for traditional and performance-matched measures of abnormal real earnings management (REM): Full sample (“All Firms”) and sub-
samples of firms formed on the basis of recent firm performance or firm financial characteristics
Market
Statistic All Market Book- Book-
REM Measure: Sales Growth Sales Growth Value of Earnings/Price Earnings/Price
(N=1,000) Firms Value of Market Market
Equity
Equity
Low High Low High Low High Low High
Ab_CFO Mean -0.0025 -0.0485 0.0085 -0.0638 0.1041 -0.0157 0.0251 -0.1888 0.0803
Ab_CFO Std. Dev. 0.0513 0.0455 0.0582 0.0714 0.0219 0.0612 0.0196 0.0667 0.0226
PM_CFO Mean -0.0021 0.0081 0.0029 0.0470 0.0187 -0.0171 0.0143 -0.0050 -0.0137
PM_CFO Std. Dev. 0.0642 0.0674 0.0689 0.0921 0.0301 0.0735 0.0344 0.0914 0.0258
Ab_PROD Mean 0.0025 0.0349 -0.0115 0.0236 -0.0418 -0.0467 0.0275 0.0833 -0.0202
Ab_PROD Std. Dev. 0.0385 0.0411 0.0464 0.0540 0.0237 0.0462 0.0211 0.0545 0.0260
PM_PROD Mean -0.0054 0.0104 -0.0082 -0.0151 -0.0095 -0.0226 0.0201 0.0111 0.0305
PM_PROD Std. Dev. 0.0524 0.0567 0.0565 0.0726 0.0325 0.0603 0.0327 0.0749 0.0342
Ab_DISC_EXP Mean 0.0018 -0.0880 0.1746 0.0094 -0.0992 0.1289 -0.1206 0.1796 -0.1071
Ab_DISC_EXP Std. Dev. 0.0955 0.0792 0.1070 0.1281 0.0469 0.0982 0.0354 0.1260 0.0416
PM_DISC_EXP Mean 0.0062 -0.1755 0.1534 -0.1068 -0.0102 0.1066 -0.0774 -0.0230 -0.0495
PM_DISC_EXP Std. Dev. 0.1184 0.1218 0.1280 0.1631 0.0605 0.1336 0.0678 0.1745 0.0488
Ab_R&D Mean -0.0017 -0.0022 0.0039 -0.0066 -0.0075 0.0111 -0.0115 0.0122 -0.0071
Ab_R&D Std. Dev. 0.0130 0.0133 0.0165 0.0156 0.0069 0.0166 0.0059 0.0188 0.0052
PM_R&D Mean -0.0016 -0.0078 0.0042 -0.0148 -0.0014 0.0093 -0.0143 -0.0019 -0.0024
PM_R&D Std. Dev. 0.0179 0.0215 0.0214 0.0244 0.0109 0.0221 0.0120 0.0281 0.0072
Ab_SGA Mean -0.0025 0.0097 -0.0457 -0.0100 -0.0137 0.0538 -0.0617 -0.0278 0.0138
Ab_SGA Std. Dev. 0.0444 0.0430 0.0522 0.0606 0.0246 0.0466 0.0251 0.0574 0.0324
PM_SGA Mean -0.0013 0.0156 -0.0538 -0.0115 -0.0148 0.0526 -0.0590 -0.0236 -0.0187
PM_SGA Std. Dev. 0.0588 0.0637 0.0656 0.0806 0.0377 0.0638 0.0413 0.0874 0.0396
Ab_GAIN Mean -0.0001 0.0008 -0.0006 0.0001 0.0001 -0.0003 0.0002 -0.0006 0.0022
Ab_GAIN Std. Dev. 0.0014 0.0019 0.0013 0.0017 0.0012 0.0014 0.0013 0.0015 0.0019
PM_GAIN Mean -0.0001 0.0010 -0.0006 -0.0001 0.0000 -0.0003 0.0004 0.0001 0.0010
PM_GAIN Std. Dev. 0.0019 0.0024 0.0020 0.0022 0.0019 0.0020 0.0020 0.0021 0.0023
42
Panel B: Type I error rates for traditional and performance-matched measures of abnormal real earnings management (rem): full sample (“all firms”) and sub-samples of firms
formed on the basis of recent firm performance or firm financial characteristics
Average
Market Value Rejection
REM Measure: All Firms Sales Growth Book-Market Earnings/Price
Frequency
of Equity
Low High Low High Low High Low High

Ab_CFO 8.8% 0.1% 12.7% 0.5% 95.5% 6.8% 47.8% 0.0% 94.5% 29.63%
PM_CFO 6.0% 4.8% 6.6% 11.7% 23.4% 3.1% 6.6% 4.5% 0.9% 7.51%
Ab_PROD 2.5% 18.0% 1.1% 7.6% 0.0% 0.0% 33.4% 46.6% 0.9% 12.23%
PM_PROD 5.3% 7.4% 3.2% 3.8% 1.7% 1.5% 15.4% 7.6% 21.7% 7.51%
Ab_DISC_EXP 11.4% 39.9% 0.2% 6.0% 76.3% 0.2% 92.5% 0.2% 84.2% 34.54%
PM_DISC_EXP 4.8% 46.3% 0.3% 15.6% 7.5% 0.2% 37.2% 5.8% 28.1% 16.20%
Ab_R&D 9.7% 9.6% 4.6% 13.4% 42.4% 1.9% 68.1% 1.2% 51.6% 22.50%
PM_R&D 5.3% 11.2% 2.2% 16.5% 7.8% 2.1% 30.7% 5.7% 12.6% 10.46%
Ab_SGA 8.7% 3.9% 28.6% 8.7% 18.7% 0.4% 78.4% 16.8% 3.9% 18.68%
PM_SGA 5.2% 2.3% 21.5% 6.7% 10.6% 0.8% 40.0% 7.6% 14.2% 12.10%
Ab_GAIN 1.1% 4.3% 0.2% 1.2% 1.3% 0.7% 1.7% 0.5% 23.5% 3.83%
PM_GAIN 4.3% 11.6% 2.6% 4.4% 6.2% 3.9% 6.8% 3.7% 10.8% 6.03%
Mean Over-rejection:
Traditional REM Measures 9.65% 22.50% 20.65% 9.90% 58.23% 6.80% 64.04% 31.70% 63.45%
PM REM Measures 5.33% 21.63% 14.05% 9.00% 12.33% 3.10% 25.98% 7.60% 13.10%
Mean Under-rejection:
Traditional REM Measures 1.80% 0.10% 0.50% 0.85% 0.65% 0.64% 1.70% 0.48% 0.90%
PM REM Measures 4.80% 4.80% 2.03% 8.05% 3.95% 1.70% 6.80% 4.93% 21.70%
Means and standard deviations measured across 1,000 samples of 100 firms each for traditional and performance-matched measures of abnormal real earnings management
(REM) for the full sample (“All Firms”) and random samples drawn from the top and bottom quartiles of sales growth, market value of equity, book-to-market ratio, and earnings-
to-price ratio. Variables with the “Ab” prefix are traditional REM measures and those with a “PM” prefix are performance-matched REM measures. See Appendix for variable
definitions and estimation methods.
Rejection rates (Type I error rates) for traditional and performance-matched measures of abnormal real earnings management (REM) for random samples drawn from the top
and bottom quartiles of sales growth, market value of equity, book-to-market ratio, and earnings-to-price ratio. Rejection rates correspond to the percentage of 1,000 random
samples of 100 firms each where the null hypothesis of mean zero abnormal REM is rejected in favor of the alternative hypothesis of positive (i.e., income increasing REM) at
the 5% level (one-tailed t-test). With 1,000 samples, the 95% confidence interval for the (theoretical) nominal 5% significance level of the test ranges from 3.65% to 6.35%.
Rejection rates in italic (bold) are those that fall below the lower threshold of 3.65% (above the upper threshold of 6.35%). Rejection rates in italic (bold) are those that fall
below (above) the lower (upper) bound of the 95% confidence interval. Results are reported for pre-estimation winsorization under which the independent variables in models
of expected (normal) real activities are winsorized prior to model estimation. Variables with the “Ab” prefix are traditional REM measures and those with a “PM” prefix are
performance-matched REM measures. See Appendix for variable definitions and estimation methods.
43

Cohen 2019 Measuring Real Activity Management

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cohen 2019 Measuring Real Activity Management

Uploaded by

Copyright:

Available Formats

Measuring Real Activity Management

Electronic copy available at: https://ssrn.com/abstract=1792639

JEL classification: M41, C12, C15, M42.

Electronic copy available at: https://ssrn.com/abstract=1792639

management literature as an alternative, or in addition, to tests using DAs.1

Electronic copy available at: https://ssrn.com/abstract=1792639

because it applies to the REM setting as well.

Electronic copy available at: https://ssrn.com/abstract=1792639

properties of performance-matched REM measures.3 We analyze performance-matched REM

elsewhere in the earnings management literature, namely, performance-matching, performs better

performance-matched REM measures. Second, we use simulations to introduce abnormal REM

Electronic copy available at: https://ssrn.com/abstract=1792639

when compared to the traditional REM measures.6

Electronic copy available at: https://ssrn.com/abstract=1792639

Electronic copy available at: https://ssrn.com/abstract=1792639

2. Measuring Real Activity Management

Following Roychowdhury (2006), managers’ willingness to manipulate real activities to

2.2 A Conceptual perspective on what drives expected real activities

Value-maximizing managers choose operating, investing and financing policies that

discretionary expenditures; (non-top management) employee compensation (salary, bonuses, etc.);

Electronic copy available at: https://ssrn.com/abstract=1792639

because they over-simplify the underlying economics of the relation.

2.3 Real activity measures common to the existing REM literature

Electronic copy available at: https://ssrn.com/abstract=1792639

evidence on the validity of these concerns.

2.4 Performance-matched REM measures

2.4.1 Motivation for performance-matching

imprecisely or where empirical measures of theoretical constructs are unavailable, performance-

Electronic copy available at: https://ssrn.com/abstract=1792639

a reliable way to mitigate mis-specification in popular measures of DAs.

2.4.2 Implementation of performance-matching

While performance-matching has potential benefits, it is not without limitations. Because it is

based on differencing two variables, the standard deviation of a performance-matched REM

Electronic copy available at: https://ssrn.com/abstract=1792639

3.1 Sample and data requirements

(Ab_CFO); abnormal discretionary expenses (Ab_DISC_EXP); abnormal production costs

(Ab_SGA); and abnormal gains from sales of fixed assets (Ab_GAIN).

In addition to the traditional measures above, we analyze modified versions of Ab_CFO,

As stated above, we calculate a performance-matched version of each REM measure for a

Electronic copy available at: https://ssrn.com/abstract=1792639

the sensitivity of our inferences to the choice.

3.2 Simulation procedures

Electronic copy available at: https://ssrn.com/abstract=1792639

extreme earnings growth, cash flow, size, and sales growth.

Electronic copy available at: https://ssrn.com/abstract=1792639

REM measure is significantly different from zero.

Electronic copy available at: https://ssrn.com/abstract=1792639

4.1 Distributional properties of alternative abnormal REM measures

[Insert Table 1 here]

Electronic copy available at: https://ssrn.com/abstract=1792639

difference when we examine the performance-matched measures. As for the pre/post-

winsorization, σ(Ab_DISC_EXP) is 0.95 while σ(PM_DISC_EXP) is 1.23. As expected, since the

performance-matched REM measures are constructed by differencing the REM measures of a

studies that have employed the traditional REM measures.10

Electronic copy available at: https://ssrn.com/abstract=1792639

measure (traditional or performance-matched) is well-specified under the null hypothesis (i.e., it

Electronic copy available at: https://ssrn.com/abstract=1792639

2) Performance-matched REM Measures: A performance-matched version of each

5) Nominal significance level of the test: 5%.

which reject a true null hypothesis too often) appear in bold.

5.2.1 Tests based on the “full sample” (all firms)

Electronic copy available at: https://ssrn.com/abstract=1792639