You are on page 1of 22

THE ACCOUNTING REVIEW American Accounting Association

Vol. 93, No. 5 DOI: 10.2308/accr-52005


September 2018
pp. 223–244

Custom Contrast Testing: Current Trends and a New


Approach
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

Ryan D. Guggenmos
Cornell University

M. David Piercey
Christopher P. Agoglia
University of Massachusetts Amherst
ABSTRACT: Contrast analysis has become prevalent in experimental accounting research since Buckless and
Ravenscroft (1990) introduced it to the accounting literature over 25 years ago. Since its initial introduction, the scope
of contrast testing has expanded, yet guidance as to the most appropriate methods of specifying, conducting,
interpreting, and exhibiting these tests has not. We survey the use of contrast analysis in the recent literature and
The Accounting Review 2018.93:223-244.

propose a three-part testing approach that provides a more comprehensive picture of contrast results. Our approach
considers three pieces of complementary evidence: the visual evaluation of fit, traditional significance testing, and
quantitative evaluation of the contrast variance residual. Our measure of the contrast variance residual, q2, is
proposed for the first time in this work. After proposing our approach, we walk through six common contrast testing
scenarios where current practices may fall short and our approach may guide researchers. We extend Buckless and
Ravenscroft (1990) and contribute to the accounting research methods literature by documenting current contrast
analysis practices that result in elevated Type I error and by proposing a potential solution to mitigate these
concerns.
Keywords: contrast testing; statistical analysis; ANOVA; experimental methods.

I. INTRODUCTION

A
little more than 25 years ago, Buckless and Ravenscroft (1990) introduced contrast testing to the accounting literature
and demonstrated that contrast testing provides a more powerful means to test for a traditional ordinal interaction than
the ANOVA interaction term. From these beginnings, contrast testing has become a common technique in
experimental accounting research, while remaining relatively rare in other disciplines (e.g., Furr and Rosenthal 2003, 45–47;
Loftus 1996). In the time since this journal published Buckless and Ravenscroft’s (1990) seminal work, the scope and
complexity of contrast testing in accounting research has greatly expanded. While the [þ3, 1, 1, 1] contrast coding
discussed by these authors certainly remains in use, researchers have increasingly turned to custom contrast weightings to test
specific patterns of means.
In our work, we present a review of all papers published in the top six accounting journals from January 2010 to December
2016. Our review reveals that 124 papers report the results of contrast analyses during this time period. These papers
collectively generate over 3,500 Google Scholar citations. While many of these papers use contrast tests to make simple
comparisons between cells, the majority of the papers (56 percent) test a specific pattern of means. However, only about a
quarter use the [þ3, 1, 1, 1] contrast coding discussed by Buckless and Ravenscroft (1990). More than twice as many
papers during this period use custom contrast weights. As Buckless and Ravenscroft (1990) limit the scope of their work to the
[þ3, 1, 1, 1] contrast weighting, they do not discuss the use of custom contrasts. Moreover, it is not obvious how these tests

We thank Kathryn Kadous (editor), two anonymous reviewers, Shana Clor-Proell, Ben Commerford, Rick Hatfield, Justin Leiby, Bob Libby, Eldar
Maksymov, Kristi Rennekamp, Caren Rotello, Matt Stern, and participants at the 2016 ABO Research Conference and the 2016 New England Behavioral
Accounting Research Symposium (NEBARS) at Bentley University for their helpful comments on this manuscript, and Bradley Bennett, Shana Clor-
Proell, Michael Durney, Steve Kuselias, Victor van Pelt, and Patrick Witz for feedback on the related software application. We thank Blake Steenhoven for
research assistance.
Editor’s note: Accepted by Kathryn Kadous, under the Senior Editorship of Mark L. DeFond.
Submitted: March 2017
Accepted: November 2017
Published Online: January 2018
223
224 Guggenmos, Piercey, and Agoglia

differ from tests using the [þ3, 1, 1, 1] coding. This gap in the accounting literature has left researchers on their own with
respect to developing best practices for custom contrast testing and, unfortunately, many of the norms that have emerged
increase Type I error. Some of these issues do not apply to the [þ3, 1, 1, 1] contrast coding or are much less of a concern.
Making matters worse, the concerns that remain are subtle, less than intuitive, and, as a consequence, often unbeknown to the
researcher.
In this article, we propose a comprehensive three-part approach for contrast analysis. Researchers following this approach
can harness the increased power afforded by contrast testing, while reducing exposure to Type I error. Our approach allows
readers and reviewers to look at the evidence for the result in multiple ways and to come to their own conclusions about the
strength of the reported findings in the context of the experiment conducted and the literature as a whole. Further, by
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

formalizing reporting norms around these tests, authors can provide readers with information in a structure that is familiar to
them. Our aim is to make the use and interpretation of these powerful tests more consistent across studies, improving the quality
of experimental accounting research for all involved. We hope that this paper can serve as a central reference for authors,
reviewers, and editors to address the complex, but fairly common, concerns that arise when using contrast analysis.
We assume that the reader has at least some basic familiarity with contrast testing (see Bobko 1986; Buckless and
Ravenscroft 1990; Furr and Rosenthal 2003; Levin and Neumann 1999; Rosnow and Rosenthal 1995, 1996), given its common
usage in the experimental accounting literature. For simplicity, we focus our discussion within the context of a 2 3 2 between-
subjects balanced (i.e., equal cell size) design (e.g., Buckless and Ravenscroft 1990). Omnibus ANOVA analyses herein are
conducted using Type III sums of squares, the default of experimental research (Maxwell and Delaney 2004).1
Our paper is organized as follows. In Section II, we provide a brief overview of current trends in contrast testing in the
accounting literature and put forth our three-part approach. Section III presents six quandaries where current practices may lead
The Accounting Review 2018.93:223-244.

to increased Type I error and, more importantly, how our approach can guide researchers facing these quandaries. Section IV
wraps up with a brief conclusion.

II. CURRENT TRENDS AND A NEW APPROACH


In order to assess current contrast testing trends in the accounting literature, we constructed a sample of all accounting
papers using planned contrasts that were published in the top six accounting journals in the time period from January 1, 2010 to
December 31, 2016.2 As reported in Table 1, our sample is comprised of 124 published papers, of which 49 observations use
contrast testing solely for making simple comparisons between cells and, therefore, are not the focus of this work. This reduces
our sample to 75 papers. Among these papers, 20 use the [þ3, 1, 1, 1] contrast coding that was the focus of Buckless and
Ravenscroft’s (1990) work.
Our analysis indicates that the vast majority (70 percent) of the papers in our sample that use the [þ3, 1, 1, 1] contrast
coding do not report a test of the between-cells residual, as recommended by Buckless and Ravenscroft (1990).3 Slightly less
than half of the papers (45 percent) provide a plot of the means, and a sensitivity analysis is conducted and reported about 15
percent of the time.4
In addition to the 20 papers that use the [þ3, 1, 1, 1] coding, our sample contains 42 papers that use custom contrast
analysis (see Table 1). We define custom contrast papers as those papers that report a result of a contrast test that uses a coding
other than the [þ3, 1, 1, 1] coding or a coding used by a term in the omnibus ANOVA. Custom contrast papers appear with
double the frequency of papers using the [þ3, 1, 1, 1] contrast coding. Of the 42 custom contrast papers, only 10 (23.8
percent) present a test of the between-cells residual. About half of the custom contrast papers present a sensitivity analysis (48

1
Slightly unbalanced cell sizes have little effect on these analyses. Therefore, for ease of presentation, we assume balanced cell sizes. We discuss this in
greater detail in footnote 22. When unbalanced cell sizes are of greater concern, we suggest that researchers consult Rosenthal, Rosnow, and Rubin
(2000) for a discussion of the effect of unequal-n on inference.
2
To construct our sample, we identified candidate papers using multiple approaches, such as searching articles for the terms ‘‘contrast,’’ ‘‘weights,’’
‘‘coding,’’ ‘‘comparison,’’ or ‘‘planned contrast,’’ as well as identifying papers that cite Buckless and Ravenscroft (1990). Once we identified a pool of
candidate papers, we eliminated false positives by reading the analysis section of the paper and making sure that contrast analysis was conducted. Once
at least one contrast test was identified, we coded papers as to whether contrast testing was used solely for cell comparisons or if the technique was used
to test for an interaction or other predicted pattern of means. For each of these papers, we classified the contrast coding, recorded whether the between-
cells residual test was reported, recorded whether a results plot was provided in the paper, and noted whether the observation included a multiple
contrast-based sensitivity analysis. Finally, a research assistant downloaded and recorded the Google Scholar citation count for each observation as of
June 22, 2017.
3
Buckless and Ravenscroft (1990, 941) state, ‘‘the statistical analysis of the data . . . should also include a test of the expectation that the other three cells
are statistically equivalent, since the entire predicted effect is the difference of one cell from the other three cells. The equality of the remaining three
cells is tested by performing a semi-omnibus F-test on the remaining unexplained between-group residual.’’
4
The purpose of multiple contrast-based sensitivity analyses is to demonstrate the robustness of a contrast test result. Quandary 4, found in Section III of
our paper, discusses in detail why these analyses are unable to achieve their goal.

The Accounting Review


Volume 93, Number 5, 2018
Custom Contrast Testing: Current Trends and a New Approach 225

TABLE 1
Publication Counts of Papers Using Contrast Testing in the Top Six Accounting Journals from 2010–2016
[þ3, 1, 1, 1] Contrast Coding Custom Contrast Coding Other Contrast Tests
Between- Between-
Cells Cells
Residual Sensitivity Results Residual Sensitivity Results Cell
Year Test Analysis Plot Total Test Analysis Plot Total Comparisons Other Total
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

2010 1 0 1 3 0 1 1 7 9 1 20
2011 0 0 0 1 1 4 4 8 7 1 17
2012 1 1 0 2 2 3 3 5 6 1 14
2013 0 0 1 2 2 2 2 3 6 1 12
2014 2 2 3 6 4 5 3 7 7 2 22
2015 0 0 0 2 1 4 6 8 8 4 22
2016 2 0 4 4 0 1 3 4 6 3 17
Total 6 3 9 20 10 20 22 42 49 13 124
30% 15% 45% 100% 24% 48% 52% 100%
This table presents counts of papers published in the ‘‘Top Six’’ accounting journals from 2010 to 2016 that report the results of contrast tests. These
journals include: The Accounting Review, Accounting, Organizations and Society, Contemporary Accounting Research, Journal of Accounting and
Economics, Journal of Accounting Research, and Review of Accounting Studies. These data were hand-collected from each journal between 2010 and
The Accounting Review 2018.93:223-244.

2016. Retracted papers are excluded from the analysis. Papers were classified by the type of contrast test reported. Papers classified as [þ3, 1, 1, 1]
Contrast Coding papers are those papers that reported results of a contrast test using the [þ3, 1, 1, 1] contrast coding discussed by Buckless and
Ravenscroft (1990), but did not report additional custom contrast tests (except within a sensitivity analysis). Custom Contrast Coding papers were defined
as those papers that reported the results of at least one contrast test of a linear combination of cell means where (1) more than two cell means were included
in the analysis, and (2) the contrast coding was not equal to an ANOVA main effect coding, ANOVA interaction term coding, or the [þ3, 1, 1, 1]
contrast coding. Other Contrast Tests papers were defined as papers that included any other contrast test, such as the results of simple comparisons
between cells or trend analyses. We provide counts of publications that report the results of the between-cells residual test, sensitivity analyses, and/or a
plot of the observed means for papers classified as [þ3, 1, 1, 1] Contrast Coding papers and Custom Contrast Coding papers. These publication
counts demonstrate the relatively low proportion of papers using contrast analysis that report the results of the between-cells residual test. In addition, the
counts show the relative prevalence of sensitivity analyses as part of custom contrast analysis.

percent), and a similar proportion of papers (52 percent) provide a plot of the observed means. Seven custom contrast papers
(17 percent) present both the between-cells residual and a plot of the observed means.
So what does this tell us? Contrast testing has become well established in the experimental accounting literature. An
examination of these analyses shows us that these tests are often presented without an accompanying between-cells residual test
or an evaluation of the visual fit of the pattern of observed means. This trend is concerning. As we demonstrate in our work, a
significant contrast test result by itself is not sufficient evidence as to whether an observed result matches a predicted pattern of
means. To the extent that a researcher concludes that an interaction hypothesis is supported on the basis of a significant contrast
test result alone, the possibility of Type I error increases. Further, without plotting the observed and predicted means, it may be
difficult for authors to assess the degree of fit—again giving rise to increased Type I error. And finally, because of limitations
inherent in the semi-omnibus F-test, a combination of a significant contrast test result and a non-significant semi-omnibus F-test
still may not be evidence that an observed pattern of means matches the prediction implied by the contrast coding. We discuss
these limitations more fully later.
We propose a three-part approach for testing for a predicted pattern of means and reporting the results of these tests in
order to mitigate these concerns. Our approach urges authors and readers to evaluate the convergence of three different types of
evidence: visual evaluation of fit, tests of significance, and quantitative evaluation of the residual between-cells variance, before
concluding the extent to which an interaction hypothesis is supported. Authors employing this approach are able to take
advantage of the increased power of contrast tests, but are also able use the principles of triangulation to guard against increased
Type I error (Campbell and Fiske 1959). By relying on the convergence of multiple kinds of evidence to come to a conclusion,
researchers can have greater assurance that their result is not due to their choice of analysis (Jick 1979). In addition, when
evidence does not converge, readers are provided with the information to be able to evaluate the entire evidence set in both the
context of the experiment and the literature as a whole, and to come to their own conclusion as to the strength of the result. A
graphical depiction of our approach is included as Figure 1.

The Accounting Review


Volume 93, Number 5, 2018
226 Guggenmos, Piercey, and Agoglia

FIGURE 1
Contrast Analysis Approach
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.
The Accounting Review 2018.93:223-244.

This figure illustrates the complementarity of the individual components that make up our three-step approach to custom contrast testing. The circles
indicate the three components of our approach, and the boxes describe the shortcomings of each pairing and how another component mitigates each
concern.

Component 1: Visual Evaluation of Fit


First, we recommend that researchers check the visual fit of the observed data to the predicted pattern implied by the
contrast weighting scheme by plotting both the predicted and observed means in side-by-side panels of an interaction plot. We
demonstrate this approach throughout the paper (e.g., Figure 2)5 and suggest that authors consider whether it could be useful to
include this plot in the manuscript. Inclusion comes with pros and cons and balances concerns about journal space with the
desire to provide transparency to the reader.6
We argue that many experimental accounting research studies could benefit from inclusion of these plots in published
work, but it is not clear that the benefits always outweigh the costs. As a consequence, we stop short of recommending blanket
inclusion. This does not reduce the usefulness of conducting this evaluation as an early part of the data analysis process, as
assessing visual fit before conducting other tests can help guard against confirmation bias (Nickerson 1998). We strongly
recommend that authors conduct an evaluation and assessment of visual fit and, if the figure is not included in the manuscript,
explicitly acknowledge this assessment in the paper. An author’s critical assessment as to the degree of fit found between
prediction and results is an important part of a comprehensive approach to contrast analysis, regardless of whether the plots are
included in the published paper.

5
With the exception of Figure 1, all graphics were prepared using the R language for Statistical Computing and the ggplot2 package (R Core Team 2016;
Wickham 2009).
6
As far as an argument for inclusion is concerned, there may not be better evidence that observed cell means fit a prediction than showing a side-by-side
comparison. Further, research in the graphical statistics literature has repeatedly demonstrated that graphs are more effective than tables when readers
are making comparisons (Gelman, Pasarica, and Dodhia 2002; Loftus 1993; Meyer, Shamo, and Gopher 1999; Meyer, Shinar, and Leiser 1997; Simkin
and Hastie 1987). As an argument against inclusion of results plots, journal space is limited, the information exhibited in a results plot is
informationally redundant to what is provided in a table of results, and interaction plots can imply linear relationships between variables that may not
exist.

The Accounting Review


Volume 93, Number 5, 2018
Custom Contrast Testing: Current Trends and a New Approach 227

Component 2: Tests of Significance


Second, we suggest that researchers demonstrate both the statistical significance of the contrast test and the non-
significance of the residual between-cells variance test. These tests, evaluated as a pair, provide a reasonable and defensible
basis for concluding that differences between the observed and predicted contrast results were not due to chance. While this
echoes advice given by Buckless and Ravenscroft (1990) in terms of the [þ3, 1, 1, 1] contrast, residual testing has not been
discussed with respect to custom contrasts in the accounting literature. Limitations of the semi-omnibus F-test, the most
common operationalization of the between-cells residual test, do not always allow a reader to determine if there are additional
effects present in the data or to affirmatively describe the form and type of the interaction. The effect of these limitations, and a
complementary measure, will be more fully discussed in Quandary 6.
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

Component 3: Quantitative Evaluation of the Contrast Variance Residual


Finally, we propose that researchers combine visual inspection and tests of significance with a quantitative evaluation of
the residual between-cells contrast variance unexplained by the contrast versus the total variance explained in the experiment.
We propose a measure that quantifies the variance related to any other potential effects present in the data (i.e., besides the
contrast effect) as a function of the totality of variance that could have been explained. We refer to this measure as the contrast
variance residual, denoted q2, and discuss it in Quandary 6. A strength of our measure is that it provides a tool to evaluate
residual between-cells variance independent of concerns over whether there is not enough, or too much, statistical power to
make for an appropriate test. This is important due to the increased power of contrast tests and the effects of statistical power on
the informativeness (or uninformativeness) of semi-omnibus F-test results. This measure serves to mitigate the problems caused
by relying on a non-significant semi-omnibus F-test result in order to confirm a result. Our metric allows readers to come to
The Accounting Review 2018.93:223-244.

their own conclusions as to how the size of the residual influences their view of the result, given the context of the specific
research question. Further, the metric allows for comparisons to be made across studies with similar experimental designs as to
the relative strength of the effect. These benefits and others will be discussed in Quandary 6.
We propose this combined approach, as each of the three components have distinct strengths and provide differential, but
complementary, information regarding the presence of an effect. But each component has its limitations, as well. Researchers
can harness the benefits and control for the limitations of each component by considering the three parts in tandem. Our
combined approach makes it easier for readers to consider how the entire evidence set, taken as a whole, either lends support for
or against the contrast hypothesis.
Before we go further, we explicitly remind readers that contrast testing is a confirmatory analysis method that relies on a
priori specification of contrast weights derived from well-grounded theory (Myers, Well, and Lorch 2010). Only the careful a
priori specification of analysis choices can guard against the cognitive biases that all humans are prone to, including
researchers. Acknowledgment of this constraint is a humble and conscious choice to recognize our limitations as information
processors in order to preserve the integrity and usefulness of our work. Accordingly, our approach and advice throughout
presupposes that an intentional and mindful a priori specification of the analysis has been made, including the specific choice
of contrast weightings, and that specification has been thoughtfully considered and derived from well-grounded theory.

III. SIX CONTRAST WEIGHT TESTING QUANDARIES

1. When the Glove Does Not Fit: The Contrast Test is Significant, But the Picture Does Not Match the Prediction
Perhaps too commonly, an observed pattern of cell means does not match a predicted pattern, yet a contrast test yields a
significant result. As an example: (1) theory predicts, and we hypothesize, a classic [þ3, 1, 1, 1] ordinal interaction (see
Figure 2, Panel A); (2) a plot of the results clearly shows two main effects and no interaction (see Figure 2, Panel B); and (3) a
contrast test conducted using the [þ3, 1, 1, 1] set of contrast weights is statistically significant.
This can lead to two reasonable, but different, conclusions. On one hand, the contrast test result could be considered
‘‘incorrect’’ since, based on a visual check of the pattern of observed cell means, no interaction exists. In this case, it appears as
though there is a spurious statistical result. On the other hand, the contrast test result is significant, and it might appear that the
visual evidence of an ordinal interaction could have become lost within the noise that surrounds each cell mean. These two
interpretations of the test results are simultaneously: (1) plausible, and (2) lead to very different (and opposing) conclusions.
Interpreting the results of this specific analysis is difficult, but even worse, the validity and reliability of inferences drawn from
these tests, in general, are called into question.
We can reconcile these interpretations by looking closely at what specifically it is that the contrast test is testing. To
illustrate, we construct a dataset through simulation. These data contain two main effects and absolutely no interaction. As
shown in Figure 2, Panel B, the lines in the plot are parallel and, as is expected and shown in Table 2, Panel A, the ANOVA

The Accounting Review


Volume 93, Number 5, 2018
228 Guggenmos, Piercey, and Agoglia

FIGURE 2
Quandary 1
Interaction Plots of a Predicted Pattern of Means and Simulated Data

Panel A: [þ3, 1, 1, 1]


Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.
The Accounting Review 2018.93:223-244.

Panel B: Simulated Data

This figure illustrates a lack of visual fit between a hypothesized contrast coding and a set of data. Panel A visually presents the pattern of means indicated
by a [þ3, 1, 1, 1] set of contrast weights and Panel B presents simulated data. Cell means in Panel B are indicated by boxed letters, and individual
observations are clustered around these boxes. The ANOVA, custom contrast, and between-cells residual test results for these data are exhibited in Table
2.

The Accounting Review


Volume 93, Number 5, 2018
Custom Contrast Testing: Current Trends and a New Approach 229

TABLE 2
Quandary 1: Simulated Data in Figure 2, Panel B

Panel A: Analysis of Variance


Sums of Degrees of Mean
Source of Variance Squares Freedom Square F-Statistic p-value
Factor 1 90.00 1 90.00 35.91 ,0.001
Factor 2 10.00 1 10.00 3.99 0.053
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

Interaction 0.00 1 0.00 0.00 1.000


Total between-cells variance 100.00 3 33.33 13.33 ,0.001
Error 90.22 36 2.51
Total 190.22 39

Panel B: Contrast Test


Sums of Degrees of Mean
Source of Variance Squares Freedom Square F-Statistic p-value
[þ3, 1, 1, 1] for [A, B, C, D] 53.33 1 53.33 21.28 ,0.001
The Accounting Review 2018.93:223-244.

Panel C: Contrast Test


Sums of Degrees of Mean
Source of Variance Squares Freedom Square F-Statistic p-value
[1, þ3, 1, 1] for [A, B, C, D] 13.33 1 13.33 5.32 0.027

Panel D: Contrast and Residual Between-Cells Variance Test


Sums of Degrees of Mean
Source of Variance Squares Freedom Square F-Statistic p-value
[þ3, 1, 1, 1] for [A, B, C, D] 53.33 1 53.33 21.28 ,0.001
Residual between-cells variance 46.67 2 23.33 9.31 0.001
Total between-cells variance 100.00 3 33.33 13.33 ,0.001
Error 90.22 36 2.51
Total 190.22 39
This table presents ANOVA and custom contrast test results for the simulated data depicted in Figure 2, Panel B. These data were simulated using the two-
factor design structural model exhibited by Myers et al. (2010, 204). This model appends a stochastic residual error to an additive model that contains main
effects of each factor, as well as an interaction. The model allows for the partitioning of deviations from a grand mean into components attributable to fixed
effects and residual error.

interaction term is not significant (F1,36 ¼ 0.00, p ¼ 1.00). However, as Table 2, Panel B shows, a [þ3, 1, 1, 1] contrast test
is statistically significant (F1,36 ¼ 21.28, p , 0.001). In fact, even a contrast coding that assigns the highest contrast weight (þ3)
to the ‘‘wrong’’ cell ([1, þ3, 1, 1]) yields a statistically significant result (F1,36 ¼ 5.32, p ¼ 0.027; see Table 2, Panel C). A
conclusion that the data fit the predicted pattern of means would clearly be a Type I error in either case.
Turning to the underlying statistics, we note that the weights [þ3, 1, 1, 1] are used to formulate the following linear
combination hypothesis regarding the relationship between cell means A, B, C, and D:
3A  B  C  D . 0: ð1aÞ
But that inequality does not define a pattern or shape in two dimensions (i.e., as the test of a 2 3 2 pattern should); it only tells
whether a linear combination of cell means is statistically different than zero—a one-dimensional conjecture. In fact, this
inequality is algebraically equivalent to testing the hypothesis that the cell mean A is greater than the average of the cell means
B, C, and D:

The Accounting Review


Volume 93, Number 5, 2018
230 Guggenmos, Piercey, and Agoglia

BþCþD
A. : ð1bÞ
3
Thinking about our results in Figure 2, Panel B in terms of the rewritten hypothesis in inequality (1b) above, it becomes
clearer that the significant contrast test result in Table 2, Panel B is not incorrect. As shown in Panel B of Figure 2, cell A is
higher than the average of cells B, C, and D. Analogously, the contrast test result in Table 2, Panel C is not incorrect; Cell B is
higher than the average of cells A, C, and D. That is, the answer is mathematically ‘‘correct,’’ but answers a less complex
question than we intended to ask.
A single contrast test, by itself, can only answer one question: whether a single linear combination of cell means is
statistically different than zero. As shown in inequality (1a) and rewritten as inequality (1b), all four cell means are included in
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

the hypothesis. However, even though the four cell means are included in the inequality, we are unable to determine the
positioning of individual cell means; we can only determine whether the left or right side of the inequality is larger. Said
differently, while we can test whether cell A is greater than the average of cells B, C, and D (as we do in inequality (1b)), we are
unable to simultaneously test whether cell B falls in between cells A and C, test whether cell D is the lowest of the four cells, or
answer any other question that would require a rewriting of the inequality. And even though the choices provided by different
contrast coding schemes allow for the flexibility to hone in on and test the exact hypothesis that our theory implies, there is
always the hard constraint that our hypothesis has to be able to be written as a single comparison between two discrete
expressions.
The reason for this limitation becomes clearer if we look at the degrees of freedom that the test is conducted against. A
single planned contrast is conducted with one degree of freedom. This means that only one value in the inequality is ‘‘free to
vary’’ at a time. Looking back at inequality (1a), if cell A is free to vary, then B, C, and D are constrained to whatever values
The Accounting Review 2018.93:223-244.

they need to be to make inequality (1a) true. This does not allow for a test of the pattern of cell means because only one value
may vary at a time. A test of shape would require letting three cell means vary in order to account for their position relative to a
fourth cell mean. Accordingly, we would need three degrees of freedom to conduct this test.7
So where does this leave us? By itself, a contrast test cannot tell us whether an observed pattern matches a predicted pattern
of cell means, but this test, paired with an examination of the between-cells residual, can help us move closer to our goal. To do
so, we must first show a significant contrast test result. Second, we must demonstrate that there is an insignificant amount of
residual between-cells variance.8 In the case of this [þ3, 1, 1, 1] example, the residual between-cells variance test is
equivalent to a test of whether the three cells assigned weights of 1 are simultaneously equal to one another (Buckless and
Ravenscroft 1990). In our simulated dataset, this test is significant (F2,36 ¼ 9.31, p ¼ 0.001). This indicates that even after we
have accounted for the variance related to cell A being greater than the average of cells B, C, and D with the contrast test, an
additional statistically significant effect is still present in the data. Together, these two tests account for the three degrees of
freedom that constitute all between-cells variance in the set of four cells. Accordingly, to draw a conclusion as to the shape of
the pattern of cell means, the results of the two tests must be taken together.

2. Hypothesizing More than We Have Bargained For: Simple Effects, Magnitude Assumptions, and the ANOVA
Interaction Term
Our second quandary concerns how the specification of a custom contrast may cause us, even unintentionally, to
hypothesize more than we have bargained for. To start this topic, we consider a hypothesis that predicts that one simple effect
will be larger than the other, which implies a [þ1, 1, 1, þ1] contrast coding, the same coding used for the conventional
ANOVA interaction term.
Checking our math, the [þ1, 1, 1, þ1] coding (given cells [A, B, C, D]), tests the hypothesis:
A  B  C þ D . 0; ð2aÞ
which can be rewritten as a contrast of two simple effects:
A  B . C  D; ð2bÞ
or, equivalently, as a contrast of the other two simple effects:

7
As we will discuss in Quandary 2, for instances in which a researcher is not concerned with testing shape, the residual between-cells variance test is
unnecessary. For example, to conduct a test of the simple inequality noted in Equation (1b) (i.e., that cell A is greater than the average of cells B, C, and
D), the set of contrast weights [þ3, 1, 1, 1], without a test of the residual between-cells variance, is an appropriate test of this hypothesis.
8
This test, often referred to as the residual between-cells variance test, is not perfect. Shortcomings of the between-cells residual test, and how our three-
part approach helps to compensate for these limitations, will be discussed in Quandary 6.

The Accounting Review


Volume 93, Number 5, 2018
Custom Contrast Testing: Current Trends and a New Approach 231

A  C . B  D: ð2cÞ
Inequalities (2b) and (2c) represent simple algebraic statements contrasting two expressions, each representing one simple
effect. As mentioned above, each single conjecture has only one degree of freedom; a test of the location of four cells relative to
one another would require three degrees of freedom. With that said, a hypothesis that one simple effect is larger than another is
often a perfectly reasonable hypothesis and it may be the most specific hypothesis that theory warrants.
On the other hand, when theory does warrant making specific predictions about the positions of cell means, custom
contrast testing can be a powerful tool. However, the power to make these more specific predictions comes at a price. As we
have noted, a custom contrast, when examined with the residual between-cells variance, tests whether the differences between a
predicted and observed pattern of means could have occurred by chance. But in addition, it also tests a joint hypothesis
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

regarding the relative magnitudes of the differences between cell means, as are implied by the coefficients used in the contrast
coding.
For example, a [þ3, þ1, 2, 2] contrast (examined with the residual between-cells variance) is not only testing the
hypothesis that the difference between cells A and C is larger than the difference between cells B and D, it is also making a
claim about how large those differences are. Specifically, this analysis explicitly and jointly tests the hypothesis that the
difference between A and C approximates 167 percent of the size of the difference between cells B and D (i.e., [3  (2)]/[1 
(2)]) ¼ 167 percent), that the difference between cells C and D approximates zero, and that the difference between cells A and
B approximates 40 percent of the difference between cells A and C (i.e., [3  1]/[3  (2)] ¼ 40 percent) and 67 percent of the
difference between cells B and D (i.e., [3  1]/[1  (2)]) ¼ 67 percent).
Even though accounting researchers may choose custom weights for the direction (and not the magnitudes) of the effects
implied by the weights, the underlying mathematics of a custom contrast test remain a test of magnitude. That is, a custom
The Accounting Review 2018.93:223-244.

contrast (when paired with a test of the residual between-cells variance) tests the proximity of the globally mean-centered
rescaled cell means to these specific chosen weights. As a result, the statistical tests are a comparison of the relative effect
magnitudes observed to the relative effect magnitudes predicted by the weights themselves—whether justified by theory or not.
In contrast, the ANOVA interaction term (without the residual between-cells variance test) uses only unitary weights (i.e.,
þ1s and 1s), does not contain such explicit assumptions about effect size magnitude, and, more simply, tests whether one
simple effect is greater than the other (Equations (2b) and (2c)). When theory only warrants predictions about the directions of
effects, rather than their magnitude, custom contrast tests may not be well justified.9 And as Berkowitz and Donnerstein (1982)
point out, while experiments may be ideal settings for testing directional predictions about relative effect sizes (e.g., when
effects will be larger or smaller, or when the direction of the effect will reverse), theory may not warrant making firm a priori
predictions about the relative magnitudes of effect sizes.
On the other hand, when a priori theory does support predictions more specific than ‘‘one effect will be larger than
another,’’ then contrast testing does provide a more powerful and appropriate test. In these cases, researchers should carefully
identify the nature of the additional assumptions and justify them based on a priori theory (e.g., Bobko 1986; Buckless and
Ravenscroft 1990; Levin and Neumann 1999; Rosnow and Rosenthal 1996). The results of the analysis, and the assumptions
made as part of it, must be considered in the context of the theory advanced.
Finally, we should comment on the use of simple effects tests to support an interaction hypothesis. This type of analysis is
often suggested as an alternative to contrast testing or the ANOVA interaction term to test for the presence of an interaction.
This advice gives us pause on two fronts. First, from a practical standpoint, a series of simple effects tests may not be
diagnostic. To see this issue, one can consider the scenario of a 2 3 2 design in which all four simple effects are found to be
significant and how it is not possible to discern from simple effects alone whether an interaction is present. Furthermore, even a
combination of significant and insignificant simple effects does not indicate whether an interaction has occurred (Gelman and
Stern 2006). As a consequence, simple effects tests used in this manner do not test whether an interaction is present and may
contradict the results of contrast tests (see Rutherford 2001; Myers et al. 2010).
From a theoretical standpoint, follow-up simple effects tests use the absence or presence of statistical significance to proxy
for the absence or presence of an effect. Unfortunately, while this logic is intuitively appealing, it may be difficult to support.
For decades, statisticians have reminded researchers that p-values are not measurements of effect size and, therefore,

9
On a related note, when it appears that unwieldy contrast weights (e.g., [þ7, þ1, 3.5, 4.5]) are necessary, this provides another signal that the
assumptions being made may not be appropriately justified by well-developed, and appropriately modest, theory (cf. Berkowitz and Donnerstein 1982).
In contrast, [þ3, 1, 1, 1] is notable for its relative lack of arbitrary specification. Other than the prediction that the one-cell directional effect exists,
the test makes no further assumptions about its relative effect size magnitude. For example, if the prediction is made that three of the four cells will be
the same and are weighted as 1 (or 1), then the requirement that contrast weighting schemes sum to zero only allows the cell of interest to be 3 (or 3).
The fourth weighting is simply a ‘‘plug’’ that must be three times the weighting of the other three cells. Researchers often encounter theories suggesting
that one cell mean will be relatively high while the other three will be relatively low, without expecting differences between the other three conditions.
Accordingly, the [þ3, 1, 1, 1] contrast weighting is often a useful, powerful, and modest statistical test of such predictions.

The Accounting Review


Volume 93, Number 5, 2018
232 Guggenmos, Piercey, and Agoglia

comparisons using p-values in this manner are not appropriate (Gelman and Stern 2006; Loftus 1996; Myers et al. 2010).
Further, it is often the case that ‘‘the difference between significant and non-significant is not itself statistically significant’’
(Gelman and Stern 2006, 328). Accordingly, even if comparisons between p-values were statistically sound, it would not be
possible to discern whether the results of the comparison were due to chance. Last, a series of follow-up simple effects tests
suffers from increased Type I error related to the increase in family-wise error that comes from multiple comparisons (Myers et
al. 2010). For a given interaction hypothesis, a contrast testing approach requires fewer individual tests when compared with a
simple effects approach, reducing this concern.
Taken together, these points make arguments for using simple effects tests in place of contrast tests for testing interaction
hypotheses tenuous. With this said, we do recognize that a simple effects test does provide a useful, statistically sound
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

comparison between two cells of interest. Accordingly, when an interaction is predicted (and demonstrated via a contrast
approach or through the significance of the ANOVA interaction term), a follow-up simple effects test could be interesting as a
supplementary test of the theory advanced in the paper. However, this does not make simple effects tests appropriate as the sole
test of an interaction, which requires the examination of the relationships between more than two cells simultaneously.

3. To Zero or Not to Zero: The Pros and Cons of Using Zero as a Contrast Weight
Continuing our discussion of how contrast weighting choices affect the results of contrast tests, we now discuss the
implications of using zeroes in contrast codings. Rosenthal et al. (2000) note that when zero is used as a contrast weight, the
cell mean is multiplied by that zero weighting, which removes it from the analysis. This tells us that the mean for the excluded
cell could be any value without changing the contrast test’s test statistic or p-value. However, in the same text, Rosenthal et al.
(2000, 102) also state that ‘‘contrast weights of 0 may be intended to function just as any other number.’’ At first glance, this
The Accounting Review 2018.93:223-244.

seems contradictory, but this is not the case.


We will use a [þ2, þ1, 0, 3] contrast weighting, as depicted in Figure 3, as an example. As noted above, cell C can be any
value (i.e., located anywhere on the graph) without affecting the test statistic and p-value of the contrast test in the slightest
amount. However, this is only one half of a significance test of shape. If the observed mean of condition C departs significantly
from zero, even though the contrast test statistic and p-value will be unaffected, then the departure from the zero prediction will
be detected by the residual between-cells variance test.
Why is this? Recall that (1) the total between-cells sum of squares equals the squared distance of each cell mean from the
unweighted average of all cell means (Neter, Kutner, Nachtsheim, and Wasserman 1996). Note, also, that (2) as a contrast
coding must sum to zero, a zero weight represents the unweighted average of all other cell means in the coding.10 Finally, recall
that (3) a zero weighting removes the effect of cell C from the [þ2, þ1, 0, 3] contrast test, and that (4) the residual between-
cells variance test captures all other between-cells variance not captured in the contrast test. Taken together, these four facts
imply that any departure of cell mean C from the overall mean of the four cells (i.e., zero) will appear in the residual between-
cells sum of squares and will be tested by its associated significance test. Thus, zero can be a perfectly reasonable choice for a
contrast weight when performing a test of shape.
This is especially noteworthy for the common case of a ‘‘pac-man’’ style interaction, a pattern that occurs when the two
lines slope upward and downward by the same amount from a central point. This shape implies the use of a þ1 and 1 weight
for the ‘‘top’’ and ‘‘bottom’’ conditions, and zero weights for the other two conditions (e.g., [0, þ1, 0, 1]). As follows from the
discussion in Quandary 1, when this coding is paired with a residual between-cells variance test, it is an appropriate significance
test of the ‘‘pac-man’’ interaction shape. But without being paired with a test of the residual between-cells variance, this is only
a test of the simple difference between cells B and D, as the zero weights remove cells A and C from the contrast test and there
is no other test that will pick up the effect of those cells.11 Relying on the contrast test alone could greatly inflate the chance of
Type I error.
With all of this said, the use of zero as a contrast weight places the burden to detect any departures of cell means from zero
on a single residual between-cells variance test (usually operationalized as the semi-omnibus F-test). This is a nontrivial ask of
a test that has multiple degrees of freedom and is simultaneously testing for the presence or absence of a variety of other
differences, all with a single p-value. This is one of the reasons we recommend that contrast tests of significance and between-
cells residual tests of non-significance are exhibited with the other two parts of our three-part approach—an examination of
visual fit and the quantification of the contrast variance residual.

10
For example, in the contrast used in this scenario, zero is the unweighted average of the four weights [þ2, þ1, 0, 3].
11
We remind the reader that we have been considering the role of zero weights with respect to a 2 3 2 between-subjects design. Because zero weights can
be used to remove cells from an analysis, multiple zero weights can also be used to test hypotheses about a subset or subsets of cells from more complex
experimental designs.

The Accounting Review


Volume 93, Number 5, 2018
Custom Contrast Testing: Current Trends and a New Approach 233

FIGURE 3
Quandary 3
Interaction Plot of a Contrast Coding Scheme that Includes a Zero Weighting
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.
The Accounting Review 2018.93:223-244.

This figure presents the pattern of means for a [þ2, þ1, 0, 3] contrast prediction.

4. When Less is More: Multiple Coding Scheme-Based Sensitivity Analyses


As discussed in Quandary 2, selecting a single set of contrast weights to test the shape of an interaction can be difficult
because (1) contrast weightings include explicit assumptions about the magnitude of effect sizes, and (2) experiments are not
well suited for making inferences about magnitude. In the end, the set chosen may appear arbitrary, which can be concerning.
This brings us to our fourth quandary, which examines a common attempt to quell this worry: multiple contrast-based
sensitivity analyses.
In these analyses, authors conduct multiple contrast tests with weights that each differ in their specific coding, but that all
generally approximate the hypothesized shape, in order to demonstrate robustness. For example, assume that a [þ1, þ5, 3, 3]
contrast is used and found to be significant and the paired between-cells residual is non-significant. To demonstrate the
robustness of the results, the analysis is reperformed using a [þ2, þ6, 4, 4], a [þ1, þ5, 4, 2], and a [þ1, þ5, 2, 4] set of
contrast weights.12 As all four weightings and their paired residual between-cells variance tests are found to be significant and
non-significant, respectively, the results are deemed to be robust. As shown in Table 1, these procedures are common, with 47.6
percent (20/42) of papers using custom contrasts reporting sensitivity analyses, even as only 23.8 percent (10/42) report the
result of a between-cells residual test.
But what do we learn from these tests? What was intended to be a robustness test for a test of shape, defined a priori,
results in a conclusion that four different patterns of means are all statistically indistinguishable from the observed pattern of
results. The analysis is four separate tests of similar, but different, expectations of the overall pattern of cell means and the
relative magnitudes of differences between them. Further, at least three of these specifications are somewhat arbitrary. This is at
odds with the requirement, discussed earlier, that contrast weights must be selected in accordance with a well-specified a priori

12
In our review, we noted that some papers included analyses that sought to demonstrate robustness by specifying contrast weighting sets constructed
from the original vector of contrast weights multiplied by different scalars (e.g., [þ1, þ5, 3, 3] and [þ2, þ10, 6, 6] or [þ0.5, þ2.5, 1.5, 1.5]).
Because the relative magnitude of weights within the set remains the same, these contrast weight sets all test exactly the same hypothesis. That is, if the
first set of contrast weights tested is significant, tests conducted with of a set of codings constructed through scalar multiplication of a vector of contrast
weights will also be significant with identical test statistics and p-values.

The Accounting Review


Volume 93, Number 5, 2018
234 Guggenmos, Piercey, and Agoglia

theory, and contrast weight testing should be conducted with the sole purpose of testing that specific theory (e.g., Bobko 1986;
Buckless and Ravenscroft 1990; Levin and Neumann 1999; Rosnow and Rosenthal 1996).
Further, to the extent that different contrast coding schemes are used, but these alternative coding schemes maintain the
general shape and order of means indicated by the original weights, the contrast tests (and their accompanying residual
between-cells variance tests) will almost certainly support the primary result.13 This leaves it difficult to conclude that these
alternative schemes demonstrate robustness. On the other hand, departing from a weighting that matches the general shape or
order of the original hypothesis is no longer showing robustness of the original test.
We suggest that a perceived need to perform such sensitivity analyses should be taken as a signal that a contrast test may
be requiring more explicit assumptions about relative effect magnitudes than are warranted. In these cases, the underlying
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

theory may not provide enough detail, or the experiment may not be designed, to warrant the specific predictions of relative
magnitude implied by a custom set of contrast weights. Echoing concerns discussed earlier in the paper, experiments are
relatively poor settings for testing hypotheses about magnitude (Berkowitz and Donnerstein 1982), so it is not surprising that
researchers may often be in situations where it is difficult to make these explicit claims. This should not be seen as a
shortcoming of the researcher or design, but rather a limitation of the experimental method.
If researchers want to provide evidence of the robustness of a contrast result, then rather than employing such sensitivity
analyses, we suggest the following approach. If the underlying theory does not provide enough detail to justify the use of a
custom contrast, then researchers should use the most specific test that theory warrants (e.g., using the ANOVA interaction term
to show that one simple effect is larger than another). Providing a larger quantity of informationally redundant test results will
not demonstrate robustness. On the other hand, if theory is sufficiently detailed to allow a prediction of an overall shape, then
showing significance of the originally hypothesized contrast (with a paired non-significant residual between-cells variance test
result) is adequate as far as tests of significance are concerned. When these tests are combined with the other two parts of the
The Accounting Review 2018.93:223-244.

three-part approach, a comprehensive picture of the result is available to readers and they can assess the robustness of the
reported findings themselves. This approach provides readers with a richer information set from which to derive their own
conclusions as to the strength of the claims made in the paper and, when evaluated as a whole, does not completely rely on
notions of statistical significance, qualitative visual evaluation, or effect size.

5. Who is Driving the Bus? Issues of Correlated Codings and Confounded Variance
This quandary discusses another source of Type I error: how correlation between contrast codings can result in confounded
variance and, as a consequence, an inability to cleanly test for distinct effects. As was touched on briefly as part of the last
quandary, correlated contrast codings can often lead to statistically ambiguous results. This fact becomes especially troubling
with a common type of hypothesis development: a separately hypothesized main effect (H1) and ordinal interaction (H2), as
depicted in Panel A of Figure 4.
To test the ordinal interaction hypothesis (H2) depicted in the figure, a custom contrast coding that reflects the expected
overall pattern of means, such as [þ5, þ1, 3, 3], could appear appropriate.
However, there may be a problem with this approach. The [þ5, þ1, 3, 3] custom hypothesis test is not orthogonal to the
[þ1, þ1, 1, 1] weighting of the ANOVA main effect test used for H1. Specifically:
   
þ5 þ1 þ1 þ1
correlation ; ¼ 0:905: ð3Þ
3 3 1 1
interaction main effect
contrast weights of Factor 1
ðH2Þ ðH1Þ

Thus, with a [þ5, þ1, 3, 3] custom contrast test, if the results are visually ambiguous with respect to H2, such as in Figure 4,
Panel B, then it will be unclear as to whether there is an interaction or simply a main effect and some noise. In this case,
evaluation of visual fit alone does not provide enough information to come to a conclusion. In addition, because of the non-
orthogonality between these sets of weights, the significance of the [þ5, þ1, 3, 3] test of H2 could be entirely an artifact of a
significant main effect of Factor 1, tested by H1.14 If the two contrast weighting schemes used are orthogonal, then this is not
the case. When contrasts are orthogonal, these contrast weights have zero correlation and contrast tests specified with these
codings will pick up distinct effects (Myers et al. 2010). For example, the ANOVA interaction term is orthogonal to each of the

13
This occurs because correlated sets of contrast weights will almost always yield statistically similar results. This point is discussed in greater detail in
the next quandary.
14
To make matters worse, in reality, we are unable to discern between these three possibilities: (1) a main effect truly exists and the ordinal interaction
does not, (2) the ordinal interaction exists and the main effect does not, or (3) both effects exist. The correlation between the sets of weightings does not
allow for the variance to be cleanly attributed to any of these effects.

The Accounting Review


Volume 93, Number 5, 2018
Custom Contrast Testing: Current Trends and a New Approach 235

FIGURE 4
Quandary 5
Interaction Plots of a Predicted Pattern of Means and Simulated Data

Panel A: [þ5, þ1, 3, 3]


Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.
The Accounting Review 2018.93:223-244.

Panel B: Simulated Data

This figure illustrates visual fit ambiguity that can arise between a contrast weight set and a pattern of means when evaluating a combined main effect and
ordinal interaction hypothesis. Panel A presents the pattern of means indicated by a [þ5, þ1, 3, 3] set of contrast weights and Panel B presents simulated
data. Cell means in Panel B are indicated by boxed letters, and individual observations are clustered around these boxes.

The Accounting Review


Volume 93, Number 5, 2018
236 Guggenmos, Piercey, and Agoglia

tests of the main effects (e.g., Equation (4) below). Since all pairwise correlations of these contrast weights are exactly zero,
each term in the ANOVA tests for a distinct effect:
   
þ1 1 þ1 þ1
correlation ; ¼ 0: ð4Þ
1 þ1 1 1
ANOVA main effect
interaction of Factor 1

In light of these facts, a closer look at the composition of the [þ5, þ1, 3, 3] contrast is warranted. We note that this
weighting is a linear combination of the traditional ANOVA main effect weighting and a [þ3, 1, 1, 1] ordinal interaction
contrast.
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

Specifically:
     
þ5 þ1 þ1 þ1 þ3 1
¼2 3 þ : ð5Þ
3 3 1 1 1 1
contrast weights main effect ordinal
initially chosen for H2 of Factor 1 interaction

This reveals that the [þ5, þ1, 3, 3] pattern of means could be more accurately described as the combination of a main effect
and an ordinal interaction (Rosenthal and Rosnow 1985; Rosnow and Rosenthal 1995, 1996). With this in mind, we can
‘‘subtract out’’ the test of H1 from the original test of H2, leaving [þ3, 1, 1, 1] as the ‘‘remaining’’ test of the ordinal
interaction, and consider this as a possible new weighting for H2.15
However, the problem remains: the new set of weights for H2 is still highly correlated with and, therefore, largely
informationally redundant to the ANOVA main effect test of H1:
The Accounting Review 2018.93:223-244.

   
þ3 1 þ1 þ1
correlation ; ¼ 0:577: ð6Þ
1 1 1 1
new set main effect
of weights of Factor 1
for H2 ðH1Þ

The new contrast weights for H2 are still not orthogonal to the H1 main effect weights. Thus, whether the [þ5, þ1, 3, 3] or
the [þ3, 1, 1, 1] weighting is used to test H2, the significance of the test could still solely be due to a main effect of Factor
1, increasing the chance of Type I error.
So how do we arrive at a better solution? We can decompose Equation (5) into the distinct, orthogonal components of the
ANOVA test:
       
þ5 þ1 þ1 þ1 þ1 1 þ1 1
¼ 33 þ þ : ð7Þ
3 3 1 1 þ1 1 1 þ1
main effect main effect Factor 1 3 Factor 2
of Factor 1 of Factor 2 interaction

As discussed above, the component weightings on the right-hand side of Equation (7) are orthogonal and, therefore, are able
to test for distinct effects.16 To test H1 and H2 independently of one another, a test must be used that can detect an
interaction (H2) independently of a main effect (H1). The conventional ANOVA interaction term is uncorrelated with the
main effect test for H1. If theory truly suggests that an overall [þ5, þ1, 3, 3] expected pattern of means is driven by two
distinct effects, then separately testing for these effects using the conventional main effect and interaction terms from the
ANOVA is a lower-power, but clean, test (see Rosenthal and Rosnow 1985). In addition, this method of analysis benefits
from parsimony, as it provides a test for the two underlying effects and does not add any other assumptions or predictions
about shape or magnitude.
While orthogonality is an issue in the example discussed above, there is no hard and fast rule that researchers must always
decompose patterns of means into mutually orthogonal contrasts (see, e.g., Myers et al. 2010). And to be clear, we are not
summarily dismissing custom contrast weights such as [þ5, þ1, 3, 3]. We have a more subtle point. Levin and Neumann

15
See Rosenthal and Rosnow (1985) and Furr and Rosenthal (2003) for examples of subtracting out one hypothesized contrast weight test from another.
16
Note further that [þ5, þ1, 3, 3] is not unique in this regard. Any pattern of means can be broken down into an orthogonal linear combination of
conventional main effect and ANOVA interaction components (cf. Rosnow and Rosenthal 1995). This follows from the fact that in any balanced (i.e.,
equal cell size) experimental design, the between-cells variance exactly equals the sum of the ANOVA main effects and interaction term with no
confounded variance (see Rosnow and Rosenthal 1995; Lane 2015).

The Accounting Review


Volume 93, Number 5, 2018
Custom Contrast Testing: Current Trends and a New Approach 237

(1999) suggest that these tests should be used when the overall pattern of means is of interest, as opposed to the components (or
subcomparisons) that make up the overall pattern (see, also, Rosenthal and Rosnow 1985).17 We echo this sentiment.
When the overall pattern is of primary interest and the researcher is not concerned with demonstrating the individual
components (e.g., a separate main and interactive effect) that combine to form this pattern, a theoretically justified custom
contrast (paired with a between-cells residual test) such as [þ5, þ1, 3, 3] is an appropriate significance test of the overall
pattern.18 However, when theory speaks to the individual components of a pattern of means, then being able to cleanly test for
each of these components separately is imperative. Contrast tests for each component that are correlated with one another
cannot be that test.
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

6. Thrown Under the Semi-Omnibus: Limitations of a Common Test and a New Metric
While the importance of residual between-cells variance testing is clear, its most common operationalization, the semi-
omnibus F-test, is not without limitations. The semi-omnibus F tests the null hypothesis of no additional systematic effects in
the data, after controlling for the tested contrast, by capitalizing on the fact that any other systematic effects present after
accounting for the contrast will be considered residual between-cells variance. But since the researcher conducts this test in
order to show the absence of an additional effect or effects, the test is burdened with proving a null hypothesis via a non-
significant result evaluated using a single F-statistic and associated p-value.
Further, a semi-omnibus F-test result may be hard to interpret when the test yields a p-value that is not significant, but
comes close (e.g., a p-value of 0.15). This situation is illustrated with simulated data in Table 3 and Figure 5. As shown in
Figure 5, Panel B, the means of these data are the same as the means of the simulated data in Figure 2, Panel B. The lines in
Figure 5, Panel B are clearly parallel and there is no interaction (F1,36 ¼ 0.00 exactly, p ¼ 1.00; Table 3, Panel A). Despite this, a
The Accounting Review 2018.93:223-244.

[þ3, 1, 1, 1] contrast test is significant (F1,36 ¼ 4.63, p ¼ 0.038; Table 3, Panel B) and the paired residual between-cells
variance is not significant (F2,36 ¼ 2.03, p ¼ 0.146; Table 3, Panel B). With a significant contrast test and an insignificant semi-
omnibus F-test, current practice would be to consider a [þ3, 1, 1, 1] contrast hypothesis supported. Said differently, current
practice may lead us to conclude that the pattern of means follows a [þ3, 1, 1, 1] pattern even though it very clearly does
not.19
If the results are more visually ambiguous than the results in Figure 5, Panel B, then this becomes even more difficult.20
The interaction may exist and it may not be the case that there is only a pair of main effects and sampling error. On the other
hand, it could be that the predicted effect does not exist, the data do not fit the expected pattern, and a small increase in sample
size would give the semi-omnibus F-test enough power to become significant. And, as discussed earlier, conducting a set of
simple effects tests as the sole test of an interaction does not provide a viable solution; it is possible to find simple effects test
results that contradict the predicted pattern (Rutherford 2001).21 In cases like this, even the combination of significance testing
and qualitative evaluation of visual fit can fall short. In order to address situations like this, we propose a metric that provides a
way to examine how well a contrast explains the between-cells variance without concern that a significance test may harbor too
much, or not enough, power. Our measure, q2, is the proportion of the remaining between-cells variance that is left unexplained
by the contrast with respect to the total explainable variance in the experiment. This measure allows for a quantitative
evaluation of the relative size of the residual between-cells effect that remains unexplained by the contrast of interest. This is a
more quantitative evaluation of fit than a simple visual comparison of the predicted and observed means. This relative effect
size measure completes the third part of our approach.

17
By determining that the overall pattern of means is of primary interest, there is also an implicit assumption being made that should be noted. This is the
fact that the extent that the main effect and ordinal interaction share variance is not of interest—even if one of the two effects is the primary driver of the
overall pattern. The control inherent in a well-designed experiment allows for statistical techniques to partition variance into individual sources, such as
main effects, interactions, and error. However, when testing an overall pattern of means with a custom contrast, researchers make a trade-off. In order to
increase power to detect an overall effect, they are no longer able to speak to which individual sources of variance are necessary to produce the effect.
While this trade-off may make sense in the context of a specific research question, it should not be taken lightly as it may reduce what can be learned
from the results of a study.
18
With the caution in the previous footnote in mind, we note that the [þ3, 1, 1, 1] ordinal interaction contrast and the [0, þ1, 0, 1] ‘‘pac-man’’-style
interaction, discussed in Quandary 3, are examples where the overall shape of means may be of greater interest than the component effects.
19
For this example, we simulate and present data that when analyzed, result in a semi-omnibus F-test p-value that could be classified as being ‘‘nearly
marginally significant’’ (p ¼ 0.146). While we also generated data that result in a significant contrast with even larger p-values for the semi-omnibus F-
test, for brevity, we chose not to include those data in the paper. However, this demonstrates that due to the limitations inherent in the semi-omnibus F-
test, even a relatively non-controversial semi-omnibus p-value does not always mean that the observed means fit the predicted pattern.
20
Researchers attempting to ‘‘see’’ questionable interactions should be aware of the human tendency to see spurious effects in random noise (i.e.,
misperceptions of randomness; Gilovich, Vallone, and Tversky [1985]) and the widespread failure to consider disconfirming theories and ignore or
misconstrue disconfirming evidence (i.e., confirmation bias; Wason [1960]).
21
For example, in Figure 5, Panel A, the expected pattern shows no difference between cells B and D. However, in the simulated data plotted as Figure 5,
Panel B, this simple effect is significant (t18 ¼ 2.390, p ¼ 0.028, two-tailed, untabulated).

The Accounting Review


Volume 93, Number 5, 2018
238 Guggenmos, Piercey, and Agoglia

TABLE 3
Quandary 6: Simulated Data in Figure 5, Panel B

Panel A: Analysis of Variance


Sums of Degrees of Mean
Source of Variance Squares Freedom Square F-Statistic p-value
Factor 1 90.00 1 90.00 7.82 0.008
Factor 2 10.00 1 10.00 0.87 0.357
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

Interaction 0.00 1 0.00 0.00 1.000


Total between-cells variance 100.00 3 33.33 13.33 ,0.001
Error 414.29 36 11.51
Total 514.29 39

Panel B: Residual Between-Cells Variance Test


Sums of Degrees of Mean
Source of Variance Squares Freedom Square F-Statistic p-value
[þ3, 1, 1, 1] for [A, B, C, D] 53.33 1 53.33 4.63 0.038
Residual between-cells variance 46.67 2 23.33 2.03 0.146
The Accounting Review 2018.93:223-244.

Total between-cells variance 100.00 3 33.33 2.90 0.048


Error 414.29 36 11.51
Total 514.29 39
Contrast Variance Residual, q2 46.67%
This table presents custom contrast and residual between-cells variance test results, as well as the contrast variance residual (q2) metric for the simulated
data depicted in Figure 5, Panel B. These data were simulated using the two-factor design structural model exhibited by Myers et al. (2010, 204). This
model appends a stochastic residual error to an additive model that contains main effects of each factor, as well as an interaction. The model allows for the
partitioning of deviations from a grand mean into components attributable to fixed effects and residual error.

We wish to make clear that q2 is a measure of relative effect size and should not be considered a measure of economic
significance or as a measure of the absolute size of an effect. Instead, the measure provides insight into how much of the
between-cells (i.e., explained or non-random) variance within the experiment that the contrast is unable to account for. In other
words, our measure tells readers what proportion of explainable (non-random) variance is ‘‘left on the table’’ after accounting
for the variance explained by the contrast.
Our metric is motivated by the suggestions of Rosenthal et al. (2000) and Furr (2004), who propose that, rather than focus
exclusively on the statistical significance of the residual between-cells variance, researchers should also consider the amount of
this variance relative to the total between-cells effect. This perspective recognizes that significance testing cannot provide the
entire picture. A significance test is designed to provide information about the likelihood of observing a pattern of means as
compared with chance, an important consideration. But by also considering the amount of variance remaining unexplained by
the contrast, with respect to the amount of variance that could be explained in total, an assessment as to the effectiveness of the
contrast can be made. Said differently, significance testing provides information as to whether it is more likely than chance that
an observed pattern of means fits a prediction. Relative effect size metrics measure, in the context of a particular experiment,
how good that fit between observation and prediction is.
To derive our measure, we first consider the correlation between a predicted contrast, [A, B, C, D], and the
corresponding cell means actually observed in the data, ½xA ; xB ; xC ; xD  (Rosnow and Rosenthal 1995; Rosenthal et al. 2000).
This correlation coefficient measures the degree to which the observed means match the predicted pattern, and can be
calculated as follows:
   
A B xA xB
r ¼ correlation ; : ð8Þ
C D xC xD

The Accounting Review


Volume 93, Number 5, 2018
Custom Contrast Testing: Current Trends and a New Approach 239

FIGURE 5
Quandary 6
Interaction Plots of a Predicted Pattern of Means and Simulated Data

Panel A: [þ3, 1, 1, 1]


Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.
The Accounting Review 2018.93:223-244.

Panel B: Simulated Data

This figure illustrates a lack of visual fit between a hypothesized contrast coding and a simulated dataset. Panel A visually presents the pattern of means
indicated by a [þ3, 1, 1, 1] set of contrast weights and Panel B presents the simulated data. Cell means in Panel B are indicated by boxed letters, and
individual observations are clustered around these boxes. Contrast testing results for these data are exhibited in Table 3.

The Accounting Review


Volume 93, Number 5, 2018
240 Guggenmos, Piercey, and Agoglia

The square of this correlation, r2, captures the amount of variance explained by the contrast of interest as a proportion of all
between-cells variance.22 Similarly, the residual between-cells variance can be expressed as the proportion of the between-
cells variance that remains unexplained by the contrast. This metric, which we refer to as the contrast variance residual and
denote as q2, does not hinge on statistical significance and is not affected by sample size:
q2 ¼ 1  r 2 : ð9Þ
2
To illustrate how q can be useful, we note that for the predictions and simulated observations presented in Figure 1 and
Figure 5, conclusions based on significance testing and visual inspection diverge. More specifically, the contrast tests are both
significant, neither set of observations matches the predicted pattern, and the semi-omnibus F-test results diverge. However, q2
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

yields the exact same conclusion as to the extent that the data fit the hypothesized [þ3, 1, 1, 1] pattern. For both sets of
data, the correlation between the observed means and the contrast weights is:
   
7 6 þ3 1
r ¼ correlation ; ¼ 0:730: ð10Þ
4 3 1 1
Observed Contrast Test
Means Weights

The square of this correlation coefficient is:


r 2 ¼ 0:7302 ¼ 0:533: ð11Þ
Implying that the relative size of the residual between-cells variance is:
The Accounting Review 2018.93:223-244.

q2 ¼ 1  r 2 ¼ 0:467: ð12Þ
This means that approximately 47 percent of the systematic (i.e., between-cells) variance is not explained by the [þ3, 1, 1,
1] contrast or, alternatively, that the variance assigned to the contrast effect and the remaining non-random variance after this
assignment are nearly equal. This suggests that in both datasets, ‘‘something else’’ could be going on.
It is important for us to note that we purposefully do not propose a ‘‘bright-line’’ cutoff or guideline for the interpretation of
q2. This metric is designed to be exhibited within the context of our three-part approach and to complement traditional
significance testing and visual inspection. Over time, we expect that reasonable expectations of q2 values will emerge with
respect to different research topics, much like the expectations for values of R2 have formed for different regression models in
the archival literature (e.g., reasonable expectations of an R2 value for an audit fee regression versus an earnings response
coefficient regression). However, even as norms for q2 emerge, we urge readers to regard them as more of a ‘‘sniff test’’ than a
hard and fast rule, given their atheoretical nature.
With this in mind, the finding that nearly half of the between-cells variance remains after accounting for the [þ3, 1, 1,
1] contrast, combined with a qualitative visual inspection of the pattern of means, strongly challenges the conclusion that the
observed pattern is a match to the hypothesized shape, even though significance testing (in one of our cases) suggests that a
match is present. Note that q2 contributes to the analysis regardless of whether there is sufficient power to find significance in
the semi-omnibus F-test and that all three parts of the approach contribute to the assessment of fit.

IV. CONCLUSION
Contrast testing has become commonplace in experimental accounting research, but guidance for its use has not kept pace.
Our review of papers published in the top six accounting journals from 2010 to 2016 indicates that analyses using the [þ3, 1,
1, 1] ordinal interaction contrast, first introduced by Buckless and Ravenscroft (1990), remain a standard approach.
However, our review also suggests that researchers have turned to more exotic custom contrast weightings. Even when derived
from theory and paired with a between-cells residual test, custom contrast tests for a predicted pattern of cell means make
implicit assumptions about effect size magnitude, which experiments are generally not designed to speak to. These tests often
use contrast codings that are correlated with and, therefore, confound variance with other hypothesized effects, making

22
When cell sizes are equal, Furr (2004) shows that this r2 measure is also equal to the contrast sum of squares divided by the total between-cells sum of
squares. When cell sizes are not equal, squaring the correlation given by Equation (8) no longer exactly equals this ratio of sums of squares. However,
Rosenthal et al. (2000) provide adjustments for unequal cell sizes in their computation of the percentage of between-cells variance explained by the
contrast, and when employing those adjustments, their calculations still yield r2 exactly equal to the square of the correlation in Equation (8). Thus,
researchers should only use the ratio of the contrast sum of squares divided by the total between-cells sum of squares to compute r2 if cell sizes are
perfectly equal. In contrast, they can correctly use the squared correlation of the means and the weights (Equation (8)) to compute r2, regardless of
whether cell sizes are equal or unequal and, therefore, Equation (8) is the approach we recommend in general. In any case, we provide the adjustment
given by Rosenthal et al. (2000) in Appendix A.

The Accounting Review


Volume 93, Number 5, 2018
Custom Contrast Testing: Current Trends and a New Approach 241

inference about which of the effect(s) are present nearly impossible. Further, the burden placed on the semi-omnibus F-test to
show that no statistically significant effects remain after accounting for the contrast (via a non-significant result with a single p-
value) makes inferences from the most common test of between-cells residual unreliable. And finally, multiple coding scheme-
based sensitivity analyses, which were more commonly reported than the between-cells residual test during our sample period,
are expected to combat some of these concerns, but do not actually provide much incremental information about robustness at
all.
But even with all of these drawbacks in mind, custom contrast testing is an indispensable tool for experimental accounting
researchers, as it increases statistical power and provides a more nuanced analysis of interactive effects (Bobko 1986; Buckless
and Ravenscroft 1990; Levin and Neumann 1999; Rosnow and Rosenthal 1995, 1996; Furr and Rosenthal 2003). One needs
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

only to change the example to a 3 3 3 factorial design to realize that, in isolation, the significance of the ANOVA interaction
term can be rather uninformative and that researchers would be at a great loss in understanding the nature of interactions
without these techniques (Buckless and Ravenscroft 1990).
Our paper contributes to the accounting literature by providing a systematic three-part approach to contrast testing that
works to preserve the power of custom contrast testing, while mitigating some of the risk of Type I error that these tests
bring to bear. By combining traditional significance testing (which includes testing both the significance of the contrast and
the non-significance of the residual between-cells variance) with a visual examination of fit (through the plotting of the
predicted and observed means) and a quantitative evaluation of the contrast variance residual (via our q2 metric), our
approach presents researchers with a new path forward.23 Our approach does not completely rely on any one test, but
instead uses all three approaches to strengthen inference via triangulation. Our approach follows longstanding calls in the
social sciences to use more than one technique or method to provide multiple viewpoints and collect different kinds of
evidence, as these can provide increased comfort that reported results are valid and are not due to analysis choices
The Accounting Review 2018.93:223-244.

(Campbell and Fiske 1959; Denzin 1978; Jick 1979). Approaches that use triangulation provide greater confidence in
results (Jick 1979).
Our approach, like any other, has limitations. Some of these limitations are inherent in the individual components of the
approach. For example, there is no universal bright-line cutoff for the contrast variance residual (q2). In many cases, rules of
thumb have evolved into the literature over time for similar effect size measures. However, these are often followed by statistics
research pointing out the arbitrary and astatistical nature of such cutoffs (e.g., Glass, McGaw, and Smith 1981; Jaccard 1998;
Lance, Butts, and Michels 2006). We suggest that q2 will be best evaluated within the context of the topical area, the research
question asked, and the state of prior theory, and that transparent disclosure of q2 in the research literature will enable
researchers, editors, reviewers, and readers to see what values are normally attainable within a stream of topically and
methodologically related research. This approach is similar to how the coefficient of determination (R2) is evaluated with
respect to the different regression models utilized across topical areas of archival research.
Another limitation of our approach is that the evaluation of visual fit may require significant judgment. Additionally, like
all null hypothesis significance tests, the significance testing component of our approach is subject to concerns about statistical
power. Further, when the information from the component techniques conflict, researchers may be left to struggle with results
that are inconclusive. We argue that while this may be unsatisfying, an inconclusive result is a more reasonable outcome than
an incorrectly accepted or rejected hypothesis. And, on balance, we believe that the use of our triangulation approach will
increase the quality of analysis in accounting research and provide readers greater comfort with reported results. It is our hope
that this approach lights a new path forward for conducting, exhibiting, and interpreting rigorous custom contrast tests for both
authors and the general scientific community.

REFERENCES
Berkowitz, L., and E. Donnerstein. 1982. External validity is more than skin deep: Some answers to criticisms of laboratory experiments.
American Psychologist 37 (3): 245–257. https://doi.org/10.1037/0003-066X.37.3.245
Bobko, P. 1986. A solution to some dilemmas when testing hypotheses about ordinal interactions. Journal of Applied Psychology 71 (2):
323–326. https://doi.org/10.1037/0021-9010.71.2.323
Buckless, F. A., and S. P. Ravenscroft. 1990. Contrast coding: A refinement of ANOVA in behavioral analysis. The Accounting Review
65 (4): 933–945.
Campbell, D. T., and D. W. Fiske. 1959. Convergent and discriminant validation by the multi-trait-multimethod matrix. Psychological
Bulletin 56 (2): 81–105. https://doi.org/10.1037/h0046016

23
In addition, we provide a tool for applying our approach. This tool performs the calculations and helps with the organization of results. It is available at:
http://www.comptrastsproject.org

The Accounting Review


Volume 93, Number 5, 2018
242 Guggenmos, Piercey, and Agoglia

Denzin, N. K. 1978. The Research Act. 2nd edition. New York, NY: McGraw-Hill.
Furr, R. M. 2004. Interpreting effect sizes in contrast analysis. Understanding Statistics 3 (1): 1–25. https://doi.org/10.1207/
s15328031us0301_1
Furr, R. M., and R. Rosenthal. 2003. Evaluating theories efficiently: The nuts and bolts of contrast analysis. Understanding Statistics 2
(1): 45–67. https://doi.org/10.1207/S15328031US0201_03
Gelman, A., and H. Stern. 2006. The difference between ‘‘significant’’ and ‘‘not significant’’ is not itself statistically significant. American
Statistician 60 (4): 328–331. https://doi.org/10.1198/000313006X152649
Gelman, A., C. Pasarica, and R. Dodhia. 2002. Let’s practice what we preach: Turning tables into graphs. American Statistician 56 (2):
121–130. https://doi.org/10.1198/000313002317572790
Gilovich, T., R. Vallone, and A. Tversky. 1985. The hot hand in basketball: On the misperception of random sequences. Cognitive
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

Psychology 17 (3): 295–314. https://doi.org/10.1016/0010-0285(85)90010-6


Glass, G. V., B. McGaw, and M. L. Smith. 1981. Meta-Analysis in Social Research. Thousand Oaks, CA: Sage Publications.
Jaccard, J. 1998. Interaction Effects in Factorial Analysis of Variance. Thousand Oaks, CA: Sage Publications.
Jick, T. D. 1979. Mixing qualitative and quantitative methods: Triangulation in action. Administrative Science Quarterly 24 (4): 602–611.
https://doi.org/10.2307/2392366
Lance, C. E., M. M. Butts, and L. C. Michels. 2006. The sources of four commonly reported cutoff criteria: What did they really say?
Organizational Research Methods 9 (2): 202–220. https://doi.org/10.1177/1094428105284919
Lane, D. M. 2015. Online Statistics Education: A Multimedia Course of Study. David M. Lane, Lead Developer, Rice University.
Available at: http://onlinestatbook.com
Levin, J. R., and E. Neumann. 1999. Testing for predicted patterns: When interest in the whole is greater than the sum of its parts.
Psychological Methods 4 (1): 44–57. https://doi.org/10.1037/1082-989X.4.1.44
Loftus, G. R. 1993. Editorial comment. Memory and Cognition 21 (1): 1–3. https://doi.org/10.3758/BF03211158
The Accounting Review 2018.93:223-244.

Loftus, G. R. 1996. Psychology will be a much better science when we change the way we analyze data. Current Directions in
Psychological Science 5 (6): 161–171. https://doi.org/10.1111/1467-8721.ep11512376
Maxwell, S. E., and H. D. Delaney. 2004. Designing Experiments and Analyzing Data: A Model Comparison Perspective. 2nd edition.
Mahwah, NJ: Lawrence Erlbaum Associates.
Meyer, D. E., M. K. Shamo, and D. Gopher. 1999. Information structure and the relative efficacy of tables and graphs. Human Factors 41
(4): 570–587. https://doi.org/10.1518/001872099779656707
Meyer, J., D. Shinar, and D. Leiser. 1997. Multiple factors that determine performance with tables and graphs. Human Factors 39 (2):
268–286. https://doi.org/10.1518/001872097778543921
Myers, J. L., A. D. Well, and R. F. Lorch, Jr. 2010. Research Design and Statistical Analysis. 3rd edition. New York, NY: Routledge.
Neter, J., M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. 1996. Applied Linear Statistical Models. 4th edition. Chicago, IL: Irwin.
Nickerson, R. S. 1998. Confirmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology 2 (2): 175–220.
https://doi.org/10.1037/1089-2680.2.2.175
R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: The R Foundation. Available at: https://
www.R-project.org
Rosenthal, R., and R. L. Rosnow. 1985. Contrast Analysis: Focused Comparisons in the Analysis of Variance. Cambridge, U.K.:
Cambridge University Press.
Rosenthal, R., R. L. Rosnow, and D. B. Rubin. 2000. Contrasts and Effect Sizes in Behavioral Research: A Correlational Approach.
Cambridge, U.K.: Cambridge University Press.
Rosnow, R. L., and R. Rosenthal. 1995. ‘‘Some things you learn aren’t so’’: Cohen’s paradox, Asch’s paradigm, and the interpretation of
interaction. Psychological Science 6 (1): 3–9. https://doi.org/10.1111/j.1467-9280.1995.tb00297.x
Rosnow, R. L., and R. Rosenthal. 1996. Contrasts and interactions redux: Five easy pieces. Psychological Science 7 (4): 253–257. https://
doi.org/10.1111/j.1467-9280.1996.tb00369.x
Rutherford, A. 2001. Introducing ANOVA and ANCOVA: A GLM Approach. London, U.K.: Sage Publications.
Simkin, D., and R. Hastie. 1987. An information-processing analysis of graph perception. Journal of the American Statistical Association
82 (398): 454–465. https://doi.org/10.1080/01621459.1987.10478448
Wason, P. C. 1960. On the failure to eliminate hypotheses in a conceptual task. Quarterly Journal of Experimental Psychology 19: 231–
241.
Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. New York, NY: Springer ScienceþBusiness Media.

The Accounting Review


Volume 93, Number 5, 2018
Custom Contrast Testing: Current Trends and a New Approach 243

APPENDIX A
Key Formulas for Contrast Analysis
This appendix serves as a reference for common formulas used in contrast analysis. A 2 3 2 between-subjects design is
assumed. We define xi as the set of observed cell means, ½xA ; xB ; xC ; xD , and wi as the set of contrast weights, [A, B, C, D],
derived a priori from a well-grounded theoretical prediction. N is the total number of experimental observations; ni is the
number of observations in cell xi ; and C is the number of elements in set wi; corresponding to the number of experimental
conditions included in the experiment.24
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.

Contrast Significance Test for a Linear Combination of Cell Means


Given equal variance:
Ri wixi
tcontrast ¼ tdf ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2ffi
w
MSerror Ri nii

or
2 3
½Ri wi xi 2 4 1 5
Fcontrast ¼ Fdf1 ¼
MSerror P w2i
i ni

with degrees of freedom, df ¼ N  C.


The Accounting Review 2018.93:223-244.

Given unequal variance (where si is the variance of cell i ):

0 0 Ri wixi
tcontrast ¼ tdf 0 ¼ qffiffiffiffiffiffiffiffiffiffiffiffi
w2 s 2
Ri ni i i

with degrees of freedom,


qffiffiffiffiffiffiffiffiffiffiffiffi4
w2 s2
Ri ni i i
df 0 ¼ w4 s4i
:
Ri n2 ðni i 1Þ
i

The Semi-Omnibus F-Test of the Between-Cells Residual Variance


ðSSbetween MScontrast Þ
df1
FBCRV ¼ Fdf1 ¼
df2 MSerror
with degrees of freedom, df1 ¼ C  1  1 and df2 ¼ N  C.

Evaluation of the Relative Contrast Variance Residual,25 q2


 
q2 ¼ 1  r 2 when r . 0
Given equal or unequal cell sizes:
r ¼ correlationðwi ; xi Þ
Alternatively, given equal cell sizes:
SScontrast
r2 ¼
SSbetween

24
Formulas herein are adapted from Myers et al. (2010), Rosenthal et al. (2000), and Furr (2004).
25
We note that the quantity q2 is not, itself, a ‘‘square’’ of a meaningful number. However, we follow the general mathematical convention that measures
of variance are denoted as squares.

The Accounting Review


Volume 93, Number 5, 2018
244 Guggenmos, Piercey, and Agoglia

Alternatively, given unequal cell sizes:


sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Funequal ni
r¼ ; where
Funequal ni þ FBCRV ðdf Þ

2P  3
2  
4 i wi 5 1
Funequal ni ¼ Fequal ni P w2 and df ¼ C  1  1:
i 
n i
i ni
Downloaded from aaajournals.org by Kings College London-FWIC Journals on 09/22/19. For personal use only.
The Accounting Review 2018.93:223-244.

The Accounting Review


Volume 93, Number 5, 2018

You might also like