You are on page 1of 35

Journal of Management Scientific Reports

1-35
DOI: 10.1177/27550311231202696
© The Author(s) 2023
Article reuse guidelines:
sagepub.com/journals-permissions

Current reproducibility practices in management:


What they are versus what they could be
José M. Cortina*
Virginia Commonwealth University, Richmond, VA, USA
Tine Köhler*
The University of Melbourne, Parkville, VIC, Australia
Lydia C. Aulisi
George Mason University, Fairfax, VA, USA

While debates about a replication crisis in organization studies have taken up significant journal
space over the past years, the issue of reproducibility has been mostly ignored. Reproducibility
manifests when researchers draw the same conclusions from a reanalysis of the same dataset as
used in the original study with the same (literal reproducibility) or superior (constructive repro-
ducibility) data analytic techniques. Reproducibility studies are crucial for correcting accidental
mistakes as well as intentional distortions during data preparation and analysis, thus allowing a
science to be self-correcting. In the current editorial, we define reproducibility, provide published
examples that illustrate the crucial role that reproducibility plays in scientific knowledge produc-
tion, and offer findings from a review of papers published in the 2019 volumes of Academy of
Management Journal and Journal of Management to explore how frequently different forms of
reproducibility are employed in the top management outlet. We discuss the implications of our
findings for future research and reporting practices and offer guidance for authors, reviewers,
and editors.

Keywords: Reproducibility; replication; knowledge production; self-correcting science;


constructive reproducibility

Acknowledgements: Work on the article was supported by the University of Melbourne’s Faculty of Business and
Economics.
*The first two authors contributed equally to this work.
Supplemental material for this article is available with the manuscript on the MSR website.
Corresponding Author: Jose M. Cortina, Department of Management and Entrepreneurship, Virginia
Commonwealth University, 301 W. Main St., Richmond, VA, USA.
Email: jcortina@vcu.edu

1
2 Journal of Management Scientific Reports

There is a great deal of interest in what is often referred to as the replication crisis (e.g.,
Aguinis et al. 2017; Ioannides, 2005; Lynch et al. 2015; Open Science Collaboration,
2015; Stroebe & Strack, 2014). The crisis generally revolves around the fact that, because
of sampling error coupled with questionable research practices (QRPs; e.g., Kunert, 2016;
Stroebe & Strack, 2014), there is reason to question many findings in our field. This has
led to calls for more replication studies, particularly those that are independent and construc-
tive (Köhler & Cortina, 2021). But there is a concept that is more fundamental than replica-
tion: Reproducibility.
We offer a more comprehensive definition later, but for the moment, a reproducibility
study can be thought of as one in which a given question is re-answered with a given
dataset. Although such studies may not “delight” us in the Davis (1971) sense, they are essen-
tial if a scientific field is to be self-correcting (Gleser, 1996).
Such studies are, perhaps, more common than many realize, but only in certain forms.
Other forms, some of which are of the utmost importance, are rare. The purposes of the
present paper are to identify the different forms that reproducibility can take, to explain
what is and isn’t accomplished by each form, to provide a rough estimate of the frequency
with which each form is currently used in the organizational sciences, and to develop
strategies for identifying candidates for a reproducibility study. We begin, however, with
definition.

Defining reproducibility
The term “reproducibility” is an unfortunate victim of the jingle fallacy. It is sometimes
used to mean replication, generalizability, or even just the degree to which findings are con-
sistent across studies. More recently, efforts have been made to define the term in a way that
makes its niche clear. Consistent with the computational sciences, Asendorpf et al. (2013)
define reproducibility as follows. “Researcher B … obtains exactly the same results (e.g., sta-
tistics and parameter estimates) that were originally reported by Researcher A … from A’s
data when following the same methodology.” What Asendorpf et al. (2013) describe is, in
fact, one specific form of reproducibility study, viz., an independent, literal reproducibility
study in that it involves a precise repetition of the analyses of the original researcher by a
second researcher using the original researcher’s data. Although this definition makes it
easy to identify a reproducibility study, there is value in generalizing it.
Köhler and Cortina (2021) provide the basis for a more general definition of reproducibil-
ity by observing that a defining characteristic of a reproducibility study, unlike a replication or
generalizability study, is that a reproducibility study involves a single data set, that is, no new
data are collected. Let us, therefore, use that observation as a starting point for the develop-
ment of a more encompassing definition of reproducibility.
Consider a hypothetical data file such as that in Table 1. It has N = 10 rows and K = 4
columns. We can start by stipulating that any study that adds an 11th row or a 5th column
would not be a reproducibility study because this would involve data that were not in the orig-
inal data file. Now let us suppose that the effect of X on Y was estimated by the original
researcher by applying ordinary least square (OLS) regression to the data in the green
shaded cells. A study that estimated the effect of Z on Y, even if it used only the data in
this data file, would not be reproducibility because it isn’t answering the same question
(i.e., does X affect Y ) as that asked by the original researcher. According to the Asendorpf

2
Cortina et al. / Current reproducibility practices in management 3

Table 1
Data example

Subject # X Y Z W

1 2 3 4 5
2 4 5 6 7
3 6 8 9
4 8 9 10 11
5 11 12 13
6 12 13 14 15
7 14 15 16 17
8 16 17 18 19
9 18 19 20 21
10 40 41 42 43

definition, the only way for there to be a reproducibility study would be for a new researcher
to apply OLS regression to the green-shaded cells just as they are, using whichever techniques
for missing data were used by the original researcher, and presumably using the same statis-
tical software as well.
But what if a study estimated the magnitude of the effect of X on Y by analyzing the data in
the green-shaded cells with, say, Poisson regression? What if a study winsorized1 the variables
at their high ends prior to conducting OLS? Because these new studies do not involve new data,
they are not replications. But neither do they conform to the Asendorpf et al. (2013) definition
of reproducibility. If, however, Poisson regression represents a superior analytic approach
(e.g., the data are count data), or if winsorizing makes estimators more robust given the positive
skew of the variables, then we must have room in our field for such studies. In other words, we
must make room for constructive reproducibility studies. Likewise, although there are many
advantages to independent corroboration of findings (we touch on this subject later), our
field should (and, as we show later, does) make room for the efforts of a researcher who
wishes to discover if her conclusions would have been different had she used different data
preparation or analytic techniques on her data. In technical terms, we must make room for
dependent reproducibility studies whether or not they are literal.
Finally, what if a researcher realized that variable Z was a potential confound and
therefore re-estimated the X–Y relationship using Z as a control (the cells shaded
yellow)? Such a study would not fit under the definition of replication (or generalizability)
if for no other reason than that no new data are collected. Thus, if such efforts are to have a
home at all, reproducibility must be that home. As this still involves the original research
question (i.e., Does X affect Y ) and the N rows and K columns of the original data file, this
shouldn’t be a problem. If we make this last observation our focus, then we can define
reproducibility as follows:

Given an original study of the relationship(s) between two or more variables using a subset of the
K variables from an N by K data matrix based on analysis of n = [2,N ]rows and k = [2,K ]
columns, a reproducibility study is any study of those same relationships based on any n = [2,
N] rows and k = [2,K ] columns of the original data file. The reproducibility study may be con-
ducted by anyone, including the original researcher, using any analytic approach.

3
4 Journal of Management Scientific Reports

Different forms of reproducibility


Reproducibility studies can be distinguished along two broad categories: independence
and constructiveness (Köhler & Cortina, 2023). A reproducibility study is independent if it
is conducted by a different set of investigators from the original study. If the reanalysis is con-
ducted by the original investigators, it is dependent. If there is incomplete overlap between
the list of investigators (e.g., Edwards & Harrison, 1993), then the study is semi-independent.
A reproducibility study is constructive if it retains all of the virtues of the original analysis
but includes at least one data handling or analytic improvement. There are degrees of con-
structiveness in reproducibility studies. Introduction of a minor improvement results in an
incremental constructive reproducibility study. Resolution of all major analytic flaws is com-
prehensive, and resolution of some-but-not-all major flaws is substantial. Reproducibility
designs that do not contain an element of constructiveness include (1) literal (aka exact)
reproducibility as it repeats the original analysis in every possible way; (2) regressive
reproducibility as it contains all of the analytic flaws of the original study plus at least
one additional flaw; and (3) quasi-random reproducibility as it introduces differences
that are neither superior nor inferior. Table 2 summarizes the goals of each combination
of independence and constructiveness. The reader can consult Köhler and Cortina (2021;
2023) for more detail.

The value of reproducibility


Gleser (1996) and others have noted that reproducibility studies are essential if a field is to
be self-correcting. As was mentioned earlier, there are a variety of reasons for results to differ
across author teams. In the past, these reasons might have been considered hypothetical, but
no longer. For example, Silberzahn et al. (2018) arranged to have 29 research teams address
the same psychological science question using the same data. The teams, which were asked
only to choose what they considered to be the most appropriate analytic methods, made very
different choices and arrived at very different conclusions.
Of course, the researchers in Silberzahn et al. (2018) had no incentive to make analytic
choices that were likely to produce hypothesis-consistent results. Where such incentives do
(or are perceived to) exist, we might expect even more variability in findings. Wicherts
et al. (2011) were able to assess willingness to share data in author teams from 49 papers
in psychology and found that willingness was related to strength of evidence and number
of statistical reporting errors. In short, Silberzahn, Wicherts, and others have demonstrated
that, where reproducibility is possible, it is most illuminating. These reasons are cataloged
in Köhler and Cortina (2023) and needn’t be repeated here, but it is instructive to look at a
couple of the very small number of reproducibility studies that exist in our field.
Consider the value of Edwards and Harrison (1993). These authors reanalyzed the data
used by French, Caplan and Van Harrison (1982) to test a model of the relationship
between person-environment fit and strain. Specifically, they replaced the difference score
indices of fit used by French et al. (1982) with indices that relaxed the unnecessary constraints
implied by difference score indices. For example, rather than indexing fit with b0 + b1(E-P),
where E and P stand for Environment and Person, they used b0 + b1E + b2P which omits the
superfluous constraint that b1=-b2. These authors found, among other things, that use of their
Cortina et al. / Current reproducibility practices in management 5

Table 2
Forms of reproducibility

Form of
reproducibility Definition Targets concerns about Examples

Literal Reanalyze the same dataset,


reproducibility using the same analytical
Dependent literal approach and software, to see Reporting accuracy/ da Motta Veiga &
reproducibility whether the same results are completeness, and chance Gabriel (2016)
Independent literal obtained. Reporting accuracy/ Kabins, Xu, Bergman,
reproducibility completeness, chance, Berry & Willson
researcher competence, (2016); Obenauer,
researcher malfeasance (2023)
Quasi-random Reanalyze the same dataset,
reproducibility using an approach or software
Dependent that is different without Variance of results across Avery, Wang, Volpone
quasi-random necessarily being better. different data analytic & Zhou (2013)
reproducibility (e.g., testing a structural approaches Eatough, Chang,
equation model in LISREL Miloslavic & Johnson
and then corroborating results (2011)
Independent with MPlus.) Variance of results across
quasi-random different data analytic
reproducibility approaches; researcher
analytic competence
Constructive Reanalyze the same dataset, Flaws in the analytical
reproducibility using superior analytic approach.
Dependent techniques or the previously Inaccurate (or inadequate) Martinez, White, Shapiro
constructive employed techniques more results. & Hebl (2016)
reproducibility properly. Nahum-Shani,
(e.g., using weighted least Henderson, Lim &
squares estimation because of Vinokur (2014)
Independent violation of the assumptions Inaccurate (or inadequate) Obenauer (2023);
constructive associated with Maximum results, experimenter Theissen et al. (2023);
reproducibility Likelihood estimation) competence, experimenter Edwards and Harrison
malfeasance, inaccuracy in (1993)
reporting of data analytic Hollenbeck, DeRue &
details Mannor (2006)
Judge, Thoresen, Bono
& Patton (2001)
Van Iddekinge, Roth,
Raymark &
Odle-Dusseau (2012)

unconstrained indices increased adjusted R2 by 146%. This and related work (e.g., Edwards &
Parry, 1993) changed the way that the field conceptualizes and tests for the effects of
congruence.
As another example, consider Hollenbeck, DeRue and Mannor (2006). These authors
reanalyzed the data in the Peterson, Smith, Martorana and Owens (2003) study of the relation-
ship between CEO personality and organizational performance. Hollenbeck et al. (2006)
found that conclusions regarding most of the 17 relationships examined by Peterson et al.
6 Journal of Management Scientific Reports

(2003) would change if a single subject were omitted from the analysis. They also found that
reanalysis that includes consideration of experiment-wise error eliminated all of the signifi-
cant relationships reported by Peterson et al. (2003). In short, whereas the original authors
concluded that several CEO personality characteristics were related to top management
team dynamics, which were then related to firm performance, the Hollenbeck et al. (2006)
reanalysis showed that almost none of these conclusions were actually supported by the data.
In a search that we conducted a few years ago, these were the only two examples of inde-
pendent reproducibility studies that we could find in our field. As we show later, the existence
of Journal of Management Scientific Reports (JOMSRs) is changing this. In any case, these
examples show that such studies can force us to reconsider conclusions regarding relation-
ships of great import. It is because of this potential that reproducibility studies are
common in other disciplines. In the sections that follow, we describe two consequential exam-
ples from economics as well as two recent papers published in JOMSR. These examples help
to illustrate the sorts of problems that are addressed by reproducibility generally and construc-
tive reproducibility particularly. Although these examples are, on the surface, very different
from one another and from the two examples already mentioned, they also share some impor-
tant commonalities. Later in the paper, we will use these commonalities to build a system for
identifying reproducibility candidates.
In describing the first two of these examples, we go into some detail. We beg the reader’s
indulgence, as we felt that this detail was necessary to communicate the value of the repro-
ducibility represented in each example. Two additional examples are contained in Section
A of Supplemental Online Materials (SOM).

Levitt and List (2011) and the Hawthorne effect


We are all familiar with the “Hawthorne effect.” Indeed, it is one of the most widely known
findings in the history of the behavioral sciences (Levitt and List, 2011). Wikipedia (2022)
defines the effect as, “a type of reactivity in which individuals modify an aspect of their
behavior in response to their awareness of being observed,” a definition that is entirely con-
sistent with that in the Oxford English Dictionary. In short, these studies, conducted at
Westinghouse’s Hawthorne Plant in Cicero, Illinois between 1924 and 1927, were intended
to discover optimal illumination levels for worker productivity. Instead, productivity
improved with each change in illumination levels, including a return to the original levels,
simply because the workers knew that their productivity was being monitored. Or so it
was thought.
The Hawthorne studies are often cited as the foundation of the human relations movement
(e.g., Franke & Kaul, 1978) and gave rise, along with Taylor’s scientific management, to the
field of Industrial Psychology (Levitt & List, 2011). This is odd given that the original studies
were never published. Westinghouse had been hoping to use the results to sell more light-
bulbs, but when it appeared that illumination levels were irrelevant, the firm lost interest
(Gillespie, 1991). It is odder still given that the original reports of the illumination studies
were lost (Gale, 2004). In short, the famous Hawthorne Effect is based on a quasi-experiment
designed by engineers and written up in a report that was not only unpublished but also
unseen for at least 75 years.
Cortina et al. / Current reproducibility practices in management 7

A 1989 meta-analysis of studies examining the Hawthorne effect in Education research


(Adair, Sharpe & Huynh, 1989) found that evidence in support of the Hawthorne effect is,
at best, mixed. There was, therefore, good reason to wonder if reanalysis of the original
data might prove illuminating.2
Levitt and List (2011) tracked down microfilm versions of some of the original data at the
University of Wisconsin-Milwaukee.3 This led them to the Baker Library at Harvard Business
School, where they found the rest of the data. We direct the reader to that paper for details of
the design of the original Hawthorne studies, but the basics are as follows. Data on units com-
pleted were collected for assembly workers working in three different rooms at the plant over
three waves between November 1924 and October 1927. Number of foot-candles of artificial
light was varied within and between rooms at various times over the three years, and number
of units produced was measured.
We are all aware of the conclusions that were drawn by those who originally publicized the
results, most notably Mayo (1933) and Roethlisberger and Dickson (1939). Output increased
immediately following lighting changes and tapered off after changes were halted, thus sug-
gesting that workers were responding not to illumination levels but to the fact that they were
being studied. Levitt and List (2011) conducted regression analyses in which the dependent
variable was daily output per worker, and the predictors of primary interest were dichotomous
design variables indicating whether there had been changes to artificial light in each of the
previous six days. Unlike the original analyses, these also included controls for day of the
week, month, whether the day was before or after a holiday, weather conditions, room,
and whether inputs were defective on that day.
Levitt and List (2011) reported that none of the coefficients for the design variables rep-
resenting changes in lighting were significant in spite of N > 1500. These authors also
explored the possibility that longer term effects could be seen by comparing room 1 (in
which the lighting manipulations occurred in all three waves) to rooms 2 and 3 (in which
the manipulations only occurred in wave 1) and by comparing all three rooms to other
rooms in the plant (i.e., those in which no such studies were conducted). Based on the
results of these and other reanalyses, it appears that what was originally attributed to the
Hawthorne effect was instead nothing more than seasonal fluctuations in output.
Comparison of experimented-upon rooms to other rooms shows that the drop off in produc-
tivity that coincided with the halting of lighting changes was also observed in rooms in which
lighting was not manipulated. It appears instead that productivity tended to decrease in late
Spring. Moreover, the comparison of wave 2 in room 1 to wave 2 in rooms 2 and 3 shows
that, although productivity increased in room 1 after changes in lighting recommenced (sug-
gesting Hawthorne), it also increased in rooms 2 and 3. Once again, seasonal fluctuation
rather than Hawthorne appears to have been the culprit.
Finally, Levitt and List (2011) leveraged the fact that the original experimenters reported
not only the number of foot-candles of artificial light but also the number of foot-candles of
natural light. Because the attention of workers was (supposedly) drawn to changes in artificial
light but not to changes in natural light, one should see increases in productivity associated
with increases or decreases in artificial light but minimal changes in productivity associated
with daily or seasonal changes in natural light. Levitt and List (2011) found that, when the
same sorts of control variables are included, the effect of artificial light, while positive, is min-
iscule (less than a 1% increase in productivity for a 1 SD increase in foot-candles). Instead, the
8 Journal of Management Scientific Reports

big winners were the day of the week, whether or not the previous day had been a holiday, and
the number of defective inputs.
The reproducibility study by Levitt and List (2011) was clearly an example of constructive
reproducibility in that it was a reanalysis without one of the major flaws of the original anal-
ysis, viz. failure to control for various aspects of time. Through this superior analysis, Levitt
and List (2011) showed that one of the most famous effects in the history of the behavioral
sciences is illusory. Productivity differences at the Hawthorne plant were largely due to
nothing more than a tendency for productivity to drop off over the course of the week and
at certain times of the year. How might the world of work be different today if the original
Hawthorne data had been analyzed properly?

Herndon et al. (2014) and the effect of high debt ratios


The Hawthorne illumination studies were unique in the long-term impact that they had.
But even the Hawthorne studies needed more than a decade for their impact to begin to mate-
rialize. The study of the effect of debt ratios on economic growth by Reinhart and Rogoff
(2010), on the other hand, is remarkable for its immediate impact. Even though this is an eco-
nomics example, the importance of the topic and the nature of the study should be clear to all
of us.
Published in American Economic Review (generally regarded as the top journal in all of
economics), Reinhart and Rogoff (2010) analyzed country-level data over more than two cen-
turies in order to estimate the relationship between a country’s debt ratio (i.e., debt as a pro-
portion of GDP) and economic growth. Simply put, what they found was that, once a
country’s debt ratio reaches about 90%, economic growth drops off precipitously. The
lesson that the authors drew from this finding was that spending in countries at or near this
threshold must be cut if there is to be growth in the future. This was especially relevant
given that the paper was published shortly after the financial collapse of 2008 and early in
the Great Recession.
Herndon et al. (2014) note that the findings of this study served as the basis for testimony
to Congress by the authors and others regarding U.S. spending policy and were widely
reported in the popular press at the time. These findings were also the only evidence cited
in support of Congressman Paul Ryan’s 2013 federal budget proposal (an explanation of
the plan can be found at https://bipartisanpolicy.org/blog/ryan-fy14-budget/). Conservative
politicians in Europe and elsewhere used them to justify their recommendation of austerity
measures for countries whose economies were in particular trouble. In short, the author’s find-
ings regarding the dangers of a high debt ratio had an immediate impact that could be felt in
every corner of the world. What more could anyone ask?
It turned out that, what anyone could ask was that the data collected by Reinhart and
Rogoff (2010) be analyzed properly, and this is what Herndon et al. (2014) set out to do.
In reanalyzing the Reinhart data, Herndon et al. (2014) found two glaring errors: selective
omission of cases and inappropriate weighting of values. We address each in turn because
each is crucial to reproducibility. We then describe the Herndon et al. (2014) findings.
Omission of cases. Herndon et al. (2014) discovered that Reinhart and Rogoff (2010)
intentionally omitted from their analyses the data from Australia, New Zealand, and
Canada for the 4 or 5 years following World War II. What is more, not only did Reinhart
Cortina et al. / Current reproducibility practices in management 9

and Rogoff (2010) fail to explain why they omitted these cases, they never acknowledged that
they had done so. By a remarkable coincidence, these three countries had high debt ratios in
the post-war years and experienced high growth. In other words, the data for these countries
flew in the face of the notion that high debt ratios slowed growth. Data for two of these coun-
tries appear to have been accidentally omitted from other analyses as well (as were data from
Austria, Belgium, and Denmark), but later data for New Zealand were included. Specifically,
New Zealand data reappear in the analyses once their economy began to contract.
Furthermore, there was nothing particular about the post-war years that would have caused
omission. As Herndon et al. (2014) note, the post-war data for the United States were included
in the analyses, and it will not surprise the reader to learn that the United States had a high
debt ratio at the time and that its economy was, for the most part, contracting.
Inappropriate weighting. Reinhart and Rogoff (2010) calculated GDP growth in each of
their four debt ratio categories using what we in our field would call an unweighted means
approach. As an example, Herndon et al. (2014) compare the data for the United Kingdom
to those for New Zealand. The United Kingdom spent 19 years in the high debt ratio category.
GDP growth during these 19 years was 2.4%. Therefore, the value contributed by the UK data
for the GDP growth mean in the high debt ratio category was +2.4. New Zealand was in the
high debt ratio category for only one year (in part because their other data were excluded).
Growth in that one year was −7.6%, and this was the number contributed by New Zealand
to the computation of GDP growth. Now suppose that these were the only two countries in
the high debt ratio category. According to Reinhart and Rogoff’s methodology, the mean
GDP growth would be (2.4–7.6)/2=−2.6% in spite of the fact that the United Kingdom con-
tributed 19 years of data while New Zealand contributed only one. If instead these two
country level values were weighted by the number of years on which they are based, the
mean would be ((19*2.4) + (1*−7.6))/20=+1.9%. As with omission of cases, Reinhart and
Rogoff (2010) never explained their decision to calculate unweighted means.
Consequences of these errors and the Herndon et al. (2014) reanalysis. Reinhart and
Rogoff (2010) thus committed a variety of analysis errors, all of which pushed one
towards the conclusion that austerity measures were beneficial for high debt countries. The
Herndon et al. (2014) reanalysis was quite simple. First, they included the omitted
post-war data from Australia, New Zealand, and Canada as well as other data that appear
to have been omitted by accident. Second, they weighted country-level means by the
number of years of data contributed by that country. This was done for both the 1946–
2009 data set and the 1790–2009 data set. The differences in results are dramatic.
In omitting, in particular, the data from Australia, Belgium, and Canada, Reinhart and
Rogoff (2010) omitted data from three countries that had strong growth while carrying a
large debt ratio. Moreover, through the use of unweighted means, countries for which few
years of data were available had an undue impact on results. From the 1946–2009 data,
Reinhart and Rogoff (2010) estimated that the mean GDP growth rate for countries with
debt ratios of 90% or more was −0.1%, far lower than, for example, the +3.4% growth
that they reported for countries with debt ratios in the 60%–90% range. This was much of
the basis for their conclusion that debt ratios greater than 90% were crippling. Herndon
et al. (2014) simply included all of the relevant data and weighted by sample size (i.e.,
number of years of data). Instead of a growth rate of −0.1% as reported by Reinhart and
Rogoff (2010), Herndon et al. (2014) found a growth rate for countries with debt ratios of
10 Journal of Management Scientific Reports

0.9 or greater of +2.2%. While this is lower than the 3.2% growth for countries with debt
ratios in the 0.6–0.9 range, it is not the catastrophic drop-off reported by the original
authors. Herndon et al. (2014) found something similar when they reanalyzed the data in the
1790–2009 data set, with 2.1% growth for countries in the > 0.9 debt ratio category as
opposed to 2.5% growth for countries in the 0.6–0.9 category.
Ironically, the Herndon et al. (2014) reanalysis does show nonlinearity in the debt ratio—
growth relationship, but not at the top of the debt ratio scale as suggested by the original
authors. Instead, there appears to be something of a drop off in growth as one goes from
no debt to around 30%. The policy implications of this difference cannot be overstated.
The conclusions of the original authors suggest (and were used to justify) austerity measures
in high-debt ratio countries such as Spain and Greece, and this is exactly what was proposed
in Europe’s “Fiscal Compact” (Dullien, 2012). Proper analysis suggests instead that, insofar
as austerity measures are called for, they should be applied to very rich countries such as
Switzerland and Saudi Arabia, something that, to our knowledge, entities such as the IMF
and the World Bank have never suggested.
In short, the flawed analyses of Reinhart and Rogoff (2010) led to one set of economic and
monetary policies, while appropriate analysis of their data, i.e., a constructive reproducibility
study, suggests a very different set of policies. As was the case with the Hawthorne studies,
appropriate original analysis of the data might have led to a very different world.
Next, we discuss two very recent examples published in JOMSR. Although the topics are
very different from those of the examples that we describe above, there are also similarities
that are worth noting.

Employment discrimination and Obenauer (2023)


One of the more cited papers in the history of research on discrimination in selection
systems was conducted by economists Marianne Bertrand and Sendhil Mullainathan, pub-
lished in 2004 in American Economic Review (Obenauer, 2023). Of particular interest
were not only race differences in “callbacks” (i.e., applicants being invited for interviews)
but also race differences in the effects of resume quality on callbacks. Previous studies of
this sort of issue were limited by various factors. Fidelity/demand characteristic problems
would limit conclusions drawn from vignette studies in which applicant names were
varied, or from retrospective self-report studies. Studies relying on census and other
population-level data would be rife with endogeneity problems (Heckman, 1998). Bertrand
and Mullainathan (2004) solved this problem by sending thousands of fictitious resumes to
employers. In addition to things like the amount of experience, the authors varied the
names of the applicants so that half of the applicants had, as they put it, very White sounding
names (e.g., Emily Walsh) while the other half had African American sounding names (e.g.,
Lakisha Washington).
Bertrand and Mullainathan (2004) found that African American “applicants” had to send
about 50% more resumes to receive a callback. Perhaps more interesting is the fact that,
whereas Whites with higher quality resumes (e.g., more relevant work experience) received
30% more callbacks than did Whites with lower quality resumes, the difference was only 9%
for African Americans. In other words, there was a race-by-resume quality interaction such
that the effect of resume quality was stronger for White applicants. This is surprising given
Cortina et al. / Current reproducibility practices in management 11

the typical arguments regarding such phenomena, which run along the lines of Whites get
callbacks regardless, whereas African Americans get them only if they can prove their
mettle. Similar “restricted variance” arguments have been made for male-female differences
in promotion (e.g., Lyness & Heilman, 2006), male–female differences in the effects of cloth-
ing on workplace outcomes (Chang & Cortina, in press), and Muslim-non-Muslim differ-
ences in the effects of warmth (King & Ahmad, 2010) (cf. Cortina et al., 2019). In short,
Bertrand and Mullainathan (2004) were able to conduct a true experiment of race effects in
selection systems, and they discovered some very important stuff.
Obenauer (2023) was a constructive reproducibility study. The author began, however,
with a literal (aka direct or exact) reproduction. This is an important first step in a constructive
reproducibility study as it allows the reproducers to ensure that they have the correct data and
that they understand the analyses that were conducted in the original. Obenauer was able to
reproduce almost all of the original findings.
Obenauer then endeavored to shore up some of the weaknesses in the original analyses. In
particular, the original authors reported marginal effects of, say, race on callback rates at high
and low levels of resume quality. Although the authors stated that these marginal effects were
based on moderated probit regressions, they did not report on the interaction terms them-
selves. Obenauer (2023) found that the race-by-quality interaction was not significant,
although it was in the direction reported by the original authors. Follow-up analyses
showed that, while the interaction was significant for job experience, it was generally non-
significant for other elements of resume quality.
Obenauer also controlled for factors that might have represented confounds in the original
analysis. For example, some of the job postings contained specific job requirements while
others did not. It could be that ambiguity of job requirements would allow more room for
bias against non-White names such that the race-by-quality interaction would exist only
for postings that failed to specify requirements. As another example, some of the
African-American names were more Arab-sounding, which might trigger anti-Arab biases,
so Obenauer controlled for this as well. Obenauer (2023) found that job requirement specif-
icity did indeed matter. As mentioned above, there was an experience by race interaction, but
Obenauer (2023) found that this interaction only existed for postings that did not specify an
experience requirement. He also found that much of what was attributed to African–American
names could instead be attributed to Arab-soundingness of names.
Obenauer (2023) was at pains to point out that he was able to reproduce many of the orig-
inal findings. In particular, the main effects of race remained quite strong. As it happens, some
of the race-by-resume quality interactions reported in the original were diminished when
superior analyses were conducted. Either way, it is important to know whether and when
the original conclusions hold when different analyses are conducted. For some effects,
Obenauer (2023) finds them to be robust vis-à-vis alterations in analysis. For others, superior
analyses caused them to diminish.

Theissen et al. (2023) and the value of cash flow


In an influential paper published in Strategic Management Journal, Kim and Bettis (2014)
examined the relationship between cash flow and firm performance (Theissen et al., 2023).
These authors explained that there are competing perspectives regarding this relationship.
12 Journal of Management Scientific Reports

From an economic perspective, cash represents an undeployed resource and is, therefore, a
symptom of inefficiency. In addition, agency theory suggests that cash relieves pressure on
managers, allowing leeway to make less efficient/successful decisions (Jensen, 1986). On
the other hand, cash makes it easier for firms to resolve internal conflict among coalitions
(Cyert & March, 1963) and to cope with uncertainty via “strategic hedging” (Courtney,
2001). From the latter perspective, more cash is better, although there should be diminishing
returns as the benefits of cash become outweighed by the opportunity costs of holding onto it.
Thus, one perspective implies a linear, negative relationship between cash and firm perfor-
mance while the other suggests a quadratic relationship that is positive but concave-
downward. Kim and Bettis (2014) also hypothesized that cash would be more beneficial
for larger firms because it is absolute cash levels rather than relative cash levels that allow
a firm to fend off competitor threats.
The author’s analysis of 23 years of COMPUSTAT data from over 18,000 firms showed
that cash did indeed show the positive main effect cum negative quadratic effect on firm per-
formance suggested by the cash-as-benefit perspective. In other words, more cash is better,
but only up to a point. They also found that cash was indeed more beneficial for larger firms.
Theissen et al. (2023) sought to conduct a reproduction of Kim and Bettis (2014). Their
stated reasons were that Kim and Bettis (2014) covered a very important topic, was well-cited,
and contained an operationalization that may have influenced their results. It is this last reason
on which we will focus our attention.
Kim and Bettis (2014) used as their primary dependent variable Tobin’s q. Although this is
not an uncommon operationalization of firm performance, Theissen et al. (2023) argue that it
is problematic given the research question that was the focus of Kim and Bettis (2014).
Tobin’s q is market value of the firm divided by total assets, with the denominator represent-
ing an attempt to account for a firm’s replacement costs. The problem, as explained by
Theissen et al. (2023), is that the primary explanatory variable, cash, is a component of
both the numerator and denominator of Tobin’s q. If we represent cash as X and Tobin’s q
as Y1/Y2, we can say that an increase in X necessarily results in a corresponding increase
in Y1 and Y2. If Y1/Y2 is greater than 1, then adding a constant to both the numerator and
denominator necessarily decreases the ratio, whereas if Y1/Y2 is less than 1, then the same
operation increases the ratio. These algebraic facts can produce artefactual results when
regressing Tobin’s q onto cash.
Theissen et al.’s (2023) solution was to use an operationalization of firm performance that
did not include cash. The connection between cash and total assets is obvious, so the adjust-
ment to the denominator of Tobin’s q was relatively straightforward. Adjusting market value
(i.e., the numerator) for cash on hand was more complicated but possible through a combi-
nation of balance sheet and market information. In short, they removed cash and short-term
investments from both components of Tobin’s q.
Their first move, however, was to conduct a literal reproduction. Their initial effort was
unsuccessful, but they surmised that this was due to the treatment of outliers.
Winsorization allowed Theissen et al. (2023) to reproduce the original results.
Next came the constructive reproducibility study that provided the true hook for the paper.
Theissen et al. (2023) reproduced both the positive main effect of cash and the positive cash
by firm size interaction reported in the original study. With the cash element removed from
the dependent variable, however, Theissen et al. (2023) found not the inverted-U (i.e.,
Cortina et al. / Current reproducibility practices in management 13

diminishing returns) reported by Kim and Bettis (2014), but a positive quadratic effect, sug-
gesting increasing benefits of additional cash. Clearly, the advice that one would give to firms
regarding cash would be very different depending on which of these papers one was reading.
Finally, it is worth noting that Theissen et al. (2023) also incorporated some constructive
replication elements to their paper. In particular, they added additional years of data and
divided their sample into firms with high investment opportunities, firms with low investment
opportunities, and firms that are somewhere in between. They found that the diminishing
returns pattern reported in the work of Kim and Bettis (2014) did indeed hold for low- and
moderate-opportunity firms. For high-opportunity firms, however, Theissen et al. (2023)
observed a “shape flip”, viz., the U-shape suggesting that, not only is more cash better,
but the degree to which this is the case increases as cash increases. Overall, this
reproducibility-cum-replication suggests that low-opportunity firms should not hold onto
cash, moderate-opportunity firms should hold onto cash up to a point, and high-opportunity
firms should carry as much cash as they can!

Lessons From these examples


In the above sections, we summarized six independent constructive reproducibility exam-
ples (Edwards & Harrison, 1993; Hollenbeck et al., 2006; Herndon et al., 2014; Levitt & List,
2011; Obenauer, 2023; Theissen et al., 2023). The fact that four of them come from our field
should not lead the reader to conclude that such studies are common in our field. These were,
in fact, the only examples that we could find. In any case, although these six examples cover a
wide range of topics, it is instructive to consider some of the commonalities among these
examples as this will form a basis for our later recommendations regarding the identification
of candidates for reproducibility.
The first commonality is fairly obvious: they all involve reproductions of impactful orig-
inal studies. Any study might benefit from a reproducibility effort, but it was, for example,
especially important to correct the record on the effect of high debt ratios. The only unfortu-
nate aspect of Herndon et al (2014) was that it wasn’t published before Reinhart and Rogoff
(2010) had the impact that it did. Obenauer (2023) also tackled an influential paper in
Bertrand and Mullainathan (2004), but because his reanalysis didn’t completely overturn
the original findings, celerity was less important. In our field, one might look for papers
that have the potential to change the way that we think about a given topic. Table 3 contains
examples of papers whose findings would push the field toward an entirely new way of think-
ing about a given phenomenon. We are not criticizing these papers. Given the potential (and
in most cases actual) impact of these papers, however, independent reproducibility would be
especially important.
A second commonality is that they all involved decisions that others might not make and
that were potentially consequential. The work of Theissen et al. (in press) is an especially
good example because the authors were able to show exactly why the dependent variable
used by the original authors was likely to skew results. The trick here is that author decisions
aren’t always obvious. Reinhart and Rogoff (2010) were transparent about some decisions,
but it took the reanalysis itself to discover the data omissions, some of which were intentional
and some accidental. In our field, studies that contain large numbers of control variables
might be good candidates because authors sometimes hunt for the permutation of controls
14 Journal of Management Scientific Reports

Table 3
Examples of reproducibility opportunities in papers that are or are likely to be
influential

Possible consequence of
Paper Topic Hook Impact reproducibility

Ilies et al. Positive affect, 29% of OCB variance Pre-Ilies, OCB was an Two of the 66 participants
2006 agreeableness is within person. This ASA sort of problem. were dropped because
and OCB variance is explained Now, we focus on the they showed no OCB
by fluctuation in PA, factors that impel variability. Perhaps
but not for agreeable people to engage in within person variability
people OCB. would have been trivial
if these had been
included.
Ones et al. Integrity test Integrity tests predict Pre-Ones, it was thought Studies by test publishers
1993 validity work outcomes better that self-report were included. When
than almost anything dispositional measures van Iddekinge left them
else couldn’t explain more out, average validities
than 4% of variance. were very different.
Ones showed that it
was possible to
quadruple that number.
Vancouver Self-efficacy and Within-person Pre-Vancouver, it was As with Ilies, participants
et al. 2001 performance self-efficacy is thought that SE was with no variability were
negatively related to purely beneficial. dropped. This would
performance Vancouver showed that have had unknown
an increase in one’s SE effects on results.
can be detrimental.
Kluger & Feedback and Feedback is detrimental Pre-Kluger, it was Various outliers, many
DeNisi performance when it shifts believed that feedback from one particular
1996 attention toward the interventions had to author, were dropped.
self and away from improve performance. Their inclusion would
the task have changed effect size
means and variances.
Thöni et al. Gender There is more If an organization is The authors included in
2021 differences in variability in male selecting (or their meta-analysis 4
cooperation cooperation behavior promoting) for high studies that they
variance than in female cooperation, they will themselves had
cooperation behavior select few women. If an conducted. Thus, in
– males are more organization is addition to the decisions
likely to be extremely avoiding those low in typical of meta-analysis,
cooperative and cooperation, they will they made data handling
extremely competitive select very few men. and analysis decisions at
the primary study level.

that allows them to conclude support for hypotheses. At the same time, studies that failed to
control for important factors (e.g., Arab-soundingness of names in Bertrand & Mullainathan,
2004) would also be good candidates. Studies using SEM whose degrees of freedom (df) do
not add up (about one-third of them according to Cortina et al., 2017) would also be good
candidates because, barring mere transcription error, df discrepancies suggest that the
model tested was different from the model reported. Table 4 contains specific examples.
Table 4
Examples of reproducibility opportunities in published papers that involved consequential decisions

Possible consequence of
Paper Topic Decisions Alternatives reproducibility

Jiang et al. (2019), Does display of joy during Excluded all entrepreneurial pitch Variance in joy is the target IV. It Find that lack of joy display may
AMJ entrepreneurial pitches videos in the dataset that displayed seems important to include those still lead to successful funding,
increase funding support the entrepreneur for less than 7.55 s, cases where no variance can be challenging the importance of
argued that those videos did not observed to ensure that those do not the role of joy.
show enough variance in joy. lead to equally high or higher
funding.
Used specific facial recognition Could have used other software or Given that funding decisions are
software to read emotion displays. human coders. made by human funders, i.e.,
human emotion readers, there
might be a discrepancy
between the effect size in a
repro study conducted with
human coders.
Controlled for 14 different control Only control for factors for which A repro study without those
variables. there is some justification. controls may produce different
effect sizes for the relationships
between core variables.
Lin et al. (2019), Top managers’ long-term 11 different control variables. Only control for factors for which A repro study without those
JOM orientation affects strategic there is some justification. controls may produce different
decision-making effect sizes for the relationships
processes, moderated by between core variables.
industry context
Averaged scores of two informants Could have run two models for the Source of information might
from each firm. different informants to evaluate matter.
meaningful differences.
Moderator measure for industry The moderator (i.e., industry Different estimates of the
complexity was based on data from complexity) variable should be strength of the moderator.
1998–2008. IV-DV data collection measured at the same time to be
date unknown, but paper was meaningful.
published in 2019.
Lots of other moderator measures Different measures for moderators. Different estimates of the
related to industry context. strength of the moderator.

15
(continued)
16
Table 4 (continued)
Possible consequence of
Paper Topic Decisions Alternatives reproducibility

Mendoza-Abarca & Performance effects of 10 control variables. Only control for factors for which Different effect size estimates
Gras (2019), JOM product diversification on there is some justification. without irrelevant controls.
newly created non-profit
organizations
Different operationalizations of core Different operationalization of Different findings that lead to
variables. efficiency and failure or of different conclusions.
diversification.

AMJ: Academy of Management Journal; JOM: Journal of Management.


Cortina et al. / Current reproducibility practices in management 17

As with the papers in Table 3, we aren’t being critical of the papers in Table 4. They are
included simply because they involved choices such that other, equally reasonable choices
might have led to different results.
Related to decisions is the issue of default settings. All software has default settings for a
variety of decisions. Sometimes these decisions are well-known (e.g., LISREL users were
aware that exogenous variables were allowed to correlate by default). Other times,
however, they aren’t. For example, lavaan (an SEM package in R) estimates covariances
among residuals of endogenous latent variables by default. Such covariances are symptomatic
of endogeneity (aka omitted variables), and estimating them inflates fit indices. But not all
lavaan users are aware of this default setting, and one can observe very different levels of
model fit when these covariances are fixed at zero. In fact, the more severe the endogeneity,
the more the fit indices are inflated by estimation of residual covariances.
A third commonality is that the reproducibility studies were independent (or semi-
independent in the case of Edwards & Harrison, 1993). The people who conducted the orig-
inal Hawthorne illumination studies did not have the know-how to do what Levitt and List
(2011) did. Reinhart and Rogoff (2010) may have had the know-how to analyze their data
properly, but they may have lacked the motivation. It usually takes an independent reproduc-
ibility study to show that a superior approach leads to different conclusions or that decisions
that are simply different (e.g., Michiels et al., 2005) would have done so. Having defined and
illustrated the usefulness of reproducibility, we now turn to our examination of its use in our
field.

Methods
Selection of studies for review of current reproducibility practices
To provide an illustrative examination of current reproducibility practices, we reviewed
one year of publications in the Academy of Management Journal (AMJ) and in the Journal
of Management (JOM). Both journals are general management journals, are methodologically
agnostic, and publish manuscripts from all subdisciplines of management. Since both rank
high on a number of journal lists, they are desirable publication outlets for all academics
in management. This makes them suitable for a general review of reproducibility practices.
While this review is by no means exhaustive, its results make it quite clear which reproduc-
ibility practices are currently common and which are nearly nonexistent. For the year 2019,
we examined all 54 empirical quantitative articles in AMJ and all 79 empirical quantitative
articles in JOM using a manual search for content that related in any way to reproducibility.
In the first step, we examined the 133 papers by carefully reading through their methods
sections in order to determine if they contained attempts at reproducibility of any kind.
Evidence of a reproducibility attempt could be as minimal as providing a statement such
as “we ran our analysis again (optional: with a specified modification of the analytic
approach) and obtained the same findings (alternative: and obtained different findings).”
While this kind of minimal statement often did not allow us to code for how the reproducibil-
ity attempt was carried out or why, we did count it as evidence of a reported attempt. In most
papers, authors provided some form of data to support such a statement, such as a findings
table comparing the findings from the original analyses to those of the reproducibility
18 Journal of Management Scientific Reports

Table 5
Types of reproducibility practices for which we coded

Type of reproducibility practice Explanation

Running analyses with and without controls In order for a reproducibility attempt to be placed in this
category, authors needed to provide findings from an
analysis of an effect or model without entering control
variables and also from an analysis with control variables.
We very frequently found papers that reported model
comparisons in which model 1 contained all of the control
variables and subsequent models contained all of the
control variables plus the target variables on which the
authors were conducting hypothesis testing. These studies
were not coded as reproducibility attempts as the analytic
strategy did not offer a reanalysis of primary relationships
with and without control variables in the equation. If a
study did report analyses with and without control
variables, then we further coded for constructiveness.
Manipulating the data in different ways We placed reproducibility attempts in this category if they
involved reanalysis of data using a different treatment of
the data themselves. For example, authors may have
chosen a different treatment of missing data (such as mean
replacement instead of exclusion), a different treatment for
trimming the data (such as Winsorizing at different points
of the data distribution), or the use of transformations of
variables to change the underlying distribution. If the
alternative data manipulation technique was clearly stated
to be superior, we coded the reproducibility attempt as
constructive.
Rerunning the same data with a different analytic In order for a reproducibility attempt to fall into this
technique category, authors needed to provide findings from a
reanalysis using a different analysis technique. Examples
include rerunning analyses with a different logit model,
reporting regression findings with and without fixed effects
for certain variables, testing an effect with ordinary least
square (OLS) regression versus a chi square test, t-test,
or z-test, and running an analysis with the mean versus
the median.
Constructing variables differently or using a different In order for a reproducibility attempt to fall into this
measure for a variable category, authors needed to provide findings from a
reanalysis that constructed one or more variables
differently or that used a different measure for a variable.
Examples include using a different scale to measure a
variable, dichotomizing a variable, or using different data
points to construct a variable.
Different model comparisons for the same hypothesis We coded a reproducibility attempt as falling into this
or set of hypotheses category when authors reran the analysis of a model in such
a way that they were testing the same substantive model in
a different way. For example, a model comparison in which
the only difference between the models is the freeing of
correlations among residuals or measurement error would
be coded in this category. Similarly, entering variables of
the same model sequentially versus simultaneously would
constitute a reproducibility attempt.
Cortina et al. / Current reproducibility practices in management 19

attempt. This then allowed us to make an assessment of different characteristics of the


reported reproducibility attempt, which we detail below in the description of our coding activ-
ities. The important piece of information that the authors needed to provide for us to code a
reproducibility attempt was a comparison of findings before and after the reproducibility
attempt.
Four papers in AMJ and 23 papers in JOM did not include any attempt at reproducibility,
leaving us with 106 papers that contained at least one type of reproducibility practice.4 If a
paper contained multiple independent data collections, we examined each study separately
for reproducibility practices. Our final dataset included 117 studies published across the
106 papers (see a list of these studies in Section D of the Supplemental Materials).

Coding of reproducibility practices


For each study in our dataset, we first coded the reference for the paper and the section of
the paper in which reproducibility was described. We also coded if the reproducibility attempt
focused on a single effect (e.g., relationship between two variables) or on a model (i.e., mul-
tiple relationships between variables). We then coded whether the reproducibility attempt was
dependent, semi-independent, or independent using the definitions that we offered in the pre-
vious section of our paper.
Once this basic information about a paper’s reproducibility attempt was coded, we then
proceeded to code specific reproducibility practices that authors employed in their study. If
a study included multiple reproducibility attempts, we coded each separately. We coded
the constructiveness of each reproducibility attempt as follows: If the authors provided any
reasonable supporting rationale for the superiority of their reproducibility attempt (e.g., a
rationale for including specific control variables or for the use of a different variable opera-
tionalization), then we coded the reproducibility attempt as constructive. If the authors did not
provide any rationale for the specific form of reproducibility that they included, or if the ratio-
nale essentially was “because other studies did so,” we coded the attempt as quasi-random.
When authors stated that their reproducibility approach was less suitable than their original
hypothesis test or when the authors used a statistical technique that has been shown to be infe-
rior to their original technique, then we labeled the attempt as regressive. We also distin-
guished between the five different types of reproducibility practices explained in Table 5.

Coder agreement
We assessed coder agreement at several points throughout the coding process: During the
creation of the initial coding scheme and the ensuing development of the final coding scheme
(initial coder agreement was 95.7% and 99.1%, respectively), during the main coding stage of
AMJ articles (initial coder agreement was 98.5%), during the main coding stage of JOM arti-
cles (initial coder agreement was 96.7%), and during a final random check across articles from
the entire sample (initial coder agreement was 97.5%). In all instances, discussion amongst
the coders led to 100% agreement before moving to the next step in the coding process.
Altogether, we assessed 57 out of 117 (i.e., 48.7%) of the studies in the data set for coder
agreement. A description of the full process of ensuring coder agreement as well as the list
of coded studies is accessible in the Supplemental Materials to this paper.
20
Table 6
Forms and purpose of reproducibility employed in 2019 AMJ and JOM papers

Constructing Different model


Running analyses Manipulating the Rerunning the same variables differently comparisons for the
without and with same data in different data with a different or using a different same hypothesis or
Codes control variables ways analysis technique measure for a variable set of hypotheses
Number of studies AMJ JOM Total AMJ JOM Total AMJ JOM Total AMJ JOM Total AMJ JOM Total
35 35 70 30 21 51 30 25 55 34 34 68 20 6 26
Purpose of reproducibility attempt AMJ JOM Total AMJ JOM Total AMJ JOM Total AMJ JOM Total AMJ JOM Total
1—Reproduce (i.e., reanalyzing the data) 0 1 1 0 1 1 2 5 7 0 2 2 0 1 1
2—Finding the same thing 3 5 8 1 1 2 1 5 6 2 4 6 3 0 3
3—Reproduce and find the same thing 32 30 62 29 19 48 27 16 43 32 30 62 17 5 22
Type of reproducibility attempt AMJ JOM Total AMJ JOM Total AMJ JOM Total AMJ JOM Total AMJ JOM Total
1—Dep. literal reproducibility 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3—Dep. quasi-random reproducibility 8 14 22 12 9 21 16 13 29 20 23 43 11 1 12
5—Dep. constructive reproducibility 28 19 47 15 16 31 8 9 17 12 14 26 5 4 9
12—Dep. confounded reproducibility 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14—Dep regressive reproducibility 0 3 3 8 0 8 6 7 13 8 3 11 6 2 8

Note: Values in this table can add up to more than the number of studies provided in the top row, given that some studies referred to multiple different types and purposes of
reproducibility attempts in their studies. AMJ: Academy of Management Journal; JOM: Journal of Management.
Cortina et al. / Current reproducibility practices in management 21

Findings
Forms of reproducibility
In this section, we report on the forms of reproducibility that we observed in our sample of
coded studies and on the purpose for which they were employed. We further report some
observations from our coding to contextualize the quantitative findings reported in Table 6.
Firstly, all coded reproducibility efforts were dependent, i.e., within study. There were no
examples of independent reproducibility in our sample of papers. In addition, in the majority
of instances, the expressed purpose of the reproducibility practice was to confirm findings
from the main analysis. Along the same lines, it is very rare that authors report that their repro-
ducibility attempt resulted in findings that did not confirm their original tests (25 attempts).
Where authors report such discrepancies, the report is usually followed with argumentation
as to why even these disconfirming findings do not contradict, and even support, the original
findings and hypotheses (11 instances). In other cases, disconfirming findings are not
acknowledged or discussed (9 instances), might be buried in Supplemental Materials that
cannot be accessed through the journal website (at least 2 instances), or are omitted in the
presentation of findings altogether (at least 1 instance). These observations seem to indicate
that reproducibility practices as they are currently being employed almost always support the
conclusions from the main analyses or are interpreted as if they did. Consequently, reproducibil-
ity practices are used to further crystalize conclusions and thus exemplify confirmation bias as
much as anything else. We further discuss implications from this observation in the Discussion.
In the following sections, we report our findings regarding the frequency of different forms
of reproducibility practices i.e., the influence of control variables, data manipulation, analytic
strategy, variable operationalization and measurement, and model comparisons on the find-
ings obtained.

Running analyses without and with control variables. Amongst the studies that ran anal-
yses for the same hypothesis without and with specific control variables (70 out of 117), 22
instances were quasi-random reproducibility attempts in which the authors did not specify a
rationale for inclusion of the control variable. Three attempts were regressive, in which the
original analyses included controls, and the reproducibility attempt excluded all or most of
the control variables despite the fact that the authors had argued for their importance in the
original analysis. Forty-seven reproducibility attempts were constructive, in which the
authors provided some form of explanation as to why analyses with and without controls
were beneficial for the test of the hypothesis of interest. This finding is encouraging as
there seems to be a heightened awareness that control variables included in a model test
should have a stated theoretical purpose. In fact, the predominant number of micro studies
that employed control variables referred explicitly to the work of Becker (2005) and men-
tioned that they only included control variables for which a theoretical rationale existed.
Furthermore, if the authors of these studies did not find divergent results between the
models including and excluding the control variables, they generally focused subsequent
attention on the findings of the analyses without control variables. In macro papers,
authors almost always included all of their control variables in subsequent analyses, irrespec-
tive of whether they had any demonstrated effect. We further discuss the inclusion of control
variables in model tests in the Discussion.
22 Journal of Management Scientific Reports

Manipulating the same data in different ways. We found 51 studies that reran analyses on
some subset of the original data. It is important to note that practices were not coded as repro-
ducibility when they constituted the testing of a suspected moderator, for example, by testing
an effect only in the part of the data that represented leaders versus the part that represented
followers or by testing an effect in a subsample from the data constituting state-owned enter-
prises versus in a subsample constituting privately owned enterprises. These are forms of gen-
eralizability that were frequently listed amongst the “robustness checks” reported in the
studies. They are not considered reproducibility because they essentially test a different
hypothesis than the original analyses, i.e., a moderator hypothesis. The reproducibility
attempts captured in this category retested the same hypotheses using a different treatment
of the data during the reanalysis. Common amongst them were forms of truncation to coun-
teract extreme values, logarithmic transformations to remedy non-normal distributions, ana-
lyzing data from different time periods to ensure a stable effect over time, or different
approaches to managing missing data.
We observed that 21 of these data-related reproducibility practices were quasi-random,
given that there was no clear indication why the reanalysis was in any way beneficial.
Thirty-one attempts were constructive, in which the reanalysis attempt was argued to be ben-
eficial, for example, because the data indeed suffered from underlying distributional or
missing data issues that could otherwise affect the legitimacy of the statistical results. In
eight instances, we deemed the reproducibility attempt to be regressive, i.e., the reproducibil-
ity attempt represented an analysis strategy that was inferior to that in the original test. This
may again be an indication of an underlying confirmation bias, in which it is not the increased
stringency of the test that is the purpose of the reproducibility attempt but rather the oppor-
tunity to report confirmatory findings. In some cases, footnotes suggested that authors had
been “encouraged” by reviewers to run an additional analysis even though this analysis
was in fact inferior to the original.

Rerunning the same data with a different analysis technique. In our sample, we found 55
studies that reanalyzed data with a different data analysis technique. Most of these were
reanalyses of the entire model, rather than reanalyses of a single effect. We deemed 29 of
these reanalysis attempts to be quasi-random, i.e., the authors did not provide any rationale
as to the benefits of the reproducibility attempt. Although one can argue that a reanalysis
that is neither better nor worse than the original provides some robustness-related informa-
tion, quasi-random reproducibility studies cannot answer the question, Robust with respect
to what? Seventeen attempts were coded as constructive, and 13 attempts as regressive. As
implied in the previous section, some of the constructive and regressive attempts were accom-
panied by footnotes indicating that the additional analysis was suggested by a reviewer.

Constructing variables differently or using a different measure for a variable.


Reproducibility attempts coded in this category operationalized variables in different ways
or used alternative measures for constructs of interest. We found 68 such attempts.
Examples include using different scales to assess the same construct, using different opera-
tionalizations of performance, or using different inclusion criteria for data points in the con-
struction of a variable. In 26 instances, the authors explained that the new measures or
operationalizations were superior to the original measure, which we subsequently coded as
Cortina et al. / Current reproducibility practices in management 23

constructive. In 43 instances, the alternative measure was different but not superior, which we
coded as quasi-random. In eleven cases, we deemed the new measure to be inferior to the
original measure, which we then coded as regressive. Regressive operationalizations or mea-
sures mainly consisted of instances in which the authors emphasized that they believed their
original operationalization or measure was superior to the one that they used for the reproduc-
ibility attempt.

Different model comparisons for the same hypothesis or set of hypotheses. Model com-
parisons that were coded in this category included models with uncorrelated versus correlated
measurement errors, partial versus full mediation models, or models that included or excluded
fixed effects. Many of the “robustness tests” provided by authors consisted of testing moder-
ator effects that were not originally hypothesized and that were more akin to post hoc ratio-
nalizations. These robustness tests were not coded here as they constitute either
generalizability attempts or wholly different hypothesis tests to rule out alternative explana-
tions for the observed results. Overall, we found 26 studies that offered model comparisons as
part of their reproducibility practices, out of which twelve were coded as quasi-random, i.e.,
the reproduced model was not clearly better or worse than the original model or where the
authors had not provided any rationale for the additional model test. Nine attempts were
coded as constructive, e.g., the follow-up model test was more stringent than the original
test or better represented the hypothesized model, and eight were coded as regressive, i.e.,
where the reproduced model was worse or did not represent as suitable a test of the original
model.

Discussion and recommendations


Our findings show that certain forms of reproducibility are quite common in AMJ and JOM
papers. Out of 54 quantitative papers published in AMJ in 2019, 50 papers included some
form of reproducibility, while out of 79 quantitative papers published in JOM in 2019, 56
papers included some form of reproducibility. Across those 106 papers, we found k = 117
independent studies using some form of reproducibility. This seems at first glance quite heart-
ening. For example, in reproducibility attempts that consisted of running analyses with and
without control variables and in those that manipulated data in different ways, the most
common form of reproducibility that we observed in our sample was dependent constructive
reproducibility. This indicates that authors are aware of the importance of providing evidence
that their findings are not dependent on particular methodological choices.
At the same time, amongst the reproducibility attempts including application of different
analysis techniques, constructing variables differently or using a different measure, or com-
paring different models for the same hypotheses (e.g., using different model testing steps;
allowing measurement errors to correlate), the most often observed form of reproducibility
in our sample was dependent quasi-random reproducibility. This form is unfortunately one
of the less useful ones for establishing the robustness and veracity of findings as methodolog-
ical choices for the reproducibility attempt do not improve hypothesis testing and often seem
chosen randomly from whatever was feasible. This indicates that researchers make some of
their reproducibility decisions rather haphazardly, perhaps driven by a desire to adhere to
some perceived reproducibility-related convention. Yet, it is uncommon for authors using
24 Journal of Management Scientific Reports

these particular reproducibility techniques to go the extra mile to make the reproducibility
attempt a meaningful one. Similarly, the fact that we see a number of dependent regressive
reproducibility attempts, in which the follow-up analysis is methodologically inferior to
the original analysis, seems to indicate that some authors engage in reproducibility not so
much to add to the rigor of their theory testing but rather to respond to or preempt reviewer
comments. It is worth nothing that such reproducibility attempts are, in an odd way, semi-
independent—Although they are executed by the author team, they are instigated by an
outsider.
We can also see an additional explanation for the patterns we observed. Based on our find-
ings that the expressed intention of the majority of dependent reproducibility attempts across
the 117 studies was to confirm findings from the authors’ original data analysis, we believe
that there is some evidence for a confirmation bias. Moreover, our coding indicates that dis-
confirming findings from the reproducibility attempts, if reported at all, were usually rational-
ized by the authors as additional evidence in support of their hypotheses. Several researchers
before us have pointed out that reviewers appear to more frequently attribute inconsistent
findings to authors’ lack of competence (Chan & Arvey, 2012) or to inadequate theory or
methods. In some cases, reviewers may ask authors to remove from their papers disconfirm-
ing evidence (e.g., Bechky & O’Mahony, 2016; Harrison et al., 2017), rather than supporting
such reporting as a sign of transparency. In other cases, reviewers may simply recommend
rejection of the manuscript because they consider the evidence unconvincing. It is thus under-
standable that authors might selectively report consistent data analysis and reproducibility
attempts.
It is also noteworthy that there are a few forms of reproducibility that are missing alto-
gether. Some of those are arguably the most important forms of reproducibility as they
hold the most potential to contribute to the knowledge generation and correction cycle. In
our sample, there was no example of dependent literal reproducibility. Perhaps more impor-
tantly, there was no example of any form of independent reproducibility, i.e., those forms of
reproducibility in which a different set of researchers analyze the same data to determine if
they would obtain the same findings and arrive at the same conclusions as the original
researchers.
Given page restrictions in journal articles, it is not surprising that authors do not report on
dependent literal reproducibility attempts, i.e., those in which they ran the exact same anal-
yses again to see if they stumble on any inconsistent findings. It is likely that reviewers
and readers assume it to be common practice for authors to take such steps before submission.
Nevertheless, we will provide suggestions in our recommendations to authors regarding how
to employ dependent literal reproducibility as a meaningful check before submission, the
results of which could be reported in a Supplemental Materials section or appendix.
The lack of any form of independent reproducibility is more problematic, though, as these
types of reproducibility studies have the strongest potential for adding credibility to previ-
ously obtained findings, or for showing that previously obtained findings do not pass
muster when they are put to more appropriate tests. Neither QRPs such as those in
Reinhart and Rogoff (2010) nor misguided analytic decisions such as those in the original
Hawthorne studies are likely to be uncovered in dependent reproducibility efforts.
Consequently, it is important to explore why these reproducibility attempts are relatively
rare and how we can encourage more of them.
Table 7
Recommendations for future reproducibility studies for authors

Dependent reproducibility Independent reproducibility

Literal Constructive Literal (within) Literal (across) Constructive

1. Researcher 1 runs 1. Researcher 1 and 2 run 1. Research team asks an 1. Consider studies with 1. Consider studies with important
analyses analyses separately but independent researcher important implications, implications, impact on policy,
2. Researcher 2 attempts concurrently to rerun their analyses impact on policy, funding, funding, managerial practices,
to reproduce 2. Identify the differences (e.g., as part of a managerial practices, etc. etc.
researcher 1’s in their data handling/ friendly review or 2. Consider studies that suggest 2. Consider studies that suggest
analyses by following analysis choices. research panel). rethinking, rethinking, reconceptualizing,
the same steps. 3. Focus on the results 2. Determine if findings reconceptualizing, retheorizing.
3. Determine if the from the superior align and, if not, why. retheorizing. 3. Identify methodological choices
researchers obtain the approach but report the 3. Obtain original data. of original study and consider
same findings. results of the other in an Purpose: 4. Rerun analyses following superior choices that might
Appendix. • Minimize errors when steps of original study result in different findings.
Purpose: using complex or novel exactly. 4. Obtain original data.
• Increases transparency Purpose: analytical techniques 5. Rerun with superior data
• Minimizes errors • Increases transparency • Validate or improve Purpose: handling/analytic techniques.
or oversights • Makes explicit different upon methodological • Discover whether there was 6. Determine if findings change.
methodological choices. choices sufficient transparency to
• Allows research team to obtain same findings Purpose:
select more constructive • Identify errors or oversights • Evaluate effect of inferior
ones. • Make explicit methodological methodological choices on
• Allows research team to incongruencies findings & conclusions
articulate a sound • Reconfirm or challenge
rationale for their conclusions affected by
methodological choices. methodological choices

25
26 Journal of Management Scientific Reports

Several characteristics of the publishing system in the organization sciences likely contrib-
ute to the low number of published independent reproducibility studies over the years and the
complete lack of these studies in our sample. These characteristics have been discussed
before, and we only briefly summarize their relevance for reproducibility studies here. The
first issue preventing independent reproducibility attempts is that data are seldom made available
by the authors of the original study. Furthermore, there is a lack of transparency when reporting
on study methods (Aguinis et al., 2011). Both issues impede independent reproducibility.
Additionally, previous work has noted that the obsession with novel theorizing in top man-
agement publications often prevents reanalysis or replication from being considered (e.g.,
Chan & Arvey, 2012; Köhler & Cortina, 2021; Tsang & Kwan, 1999). One possible
outcome of a reproducibility attempt is that it confirms the original findings. Even if the repro-
ducibility study in question were highly constructive it would not be sufficiently new and
would not involve novel theorizing. This would, in the minds of many, place such studies
outside the scope and mission of most top journals. Only when independent reproducibility
leads to meaningful alternative interpretations and retheorizing might (and it’s a big “might”)
top journals consider these studies. This is certainly the case for the exemplary studies that we
described in our “Introduction” section.
Of course, one cannot know where a reproducibility attempt might lead. Consequently,
engaging in an independent reproducibility attempt might be perceived as a high-risk
endeavor by authors, which likely would dissuade them from attempting such a study in
the first place. We hope that JOMSR, which focuses on empirical contributions and excludes
novel theorizing, will provide an outlet for authors in which these valuable forms of scientific
research find a home. To encourage more reproducibility studies, we highlight some strategies
for identifying highly promising original studies that would benefit from an independent
reproducibility attempt in our recommendations to authors below.

Recommendations
Having argued for the value of reproducibility studies to a self-correcting science, we now
provide advice to authors about when a reproducibility study might be warranted, to review-
ers about when to ask for reproducibility evidence and how to evaluate such evidence, and to
editors and publishers regarding journal content policy.

Advice for authors. In this section, we provide advice for conducting more constructive
and literal dependent reproducibility studies as well as for constructive and literal independent
reproducibility studies (see also Table 7). We pay specific attention to the different purposes
of different forms of reproducibility attempts and to the different strategies authors might
employ in their research designs.
Conducting more constructive and literal dependent reproducibility studies. There are a
few suggestions we would like to offer researchers who are attempting to incorporate
forms of dependent reproducibility into their work. Firstly, dependent literal reproducibility
can be extremely valuable for minimizing analytical errors or oversights. One of the ways to
use this meaningfully in a research team, for example, would be to have different researchers
on the team run analyses separately to determine if they obtain the same findings. One would
probably start with having one researcher do all of the analyses, and then have a second
Cortina et al. / Current reproducibility practices in management 27

researcher follow the exact same steps. This requires that the first researcher describe the steps
that they took in sufficient detail for the second researcher to follow them. Failure to arrive at
the same findings might alert the researchers to issues with transparency as well as errors in
the data analysis process. For example, in writing R code for analysis of an endogenous mod-
erator model (cf. Cortina et al., 2023), the second researcher might discover that the first
researcher had inadvertently analyzed the model as if it were a first stage moderation
model (a common mistake according to Cortina et al., 2023). The second researcher would
then function as an auditor for the data analysis process (akin to Lincoln and Guba’s
(1985) process of an external audit in qualitative research).
A different strategy for dependent reproducibility could be to let two researchers in the
research team analyze the data separately from the outset, making whichever choices they
deem appropriate. This could result in some quasi-random reproducibility but might also
result in some constructive reproducibility. As there are myriad methodological decisions
that researchers make when they analyze data, having two researchers undergo the process
independently can make some of these decisions much more explicit and can lead to the dis-
covery of the decisions that affect the findings.
For both of these approaches, researchers could summarize their efforts in a paragraph and
then provide more detail and results in an appendix or in online Supplemental Materials. This
would also contribute to transparency about the data analysis process, which ultimately
enables independent reproducibility (and replication) by other author teams in the future.
We also encourage researchers to engage in more constructive dependent reproducibility
and to specify the reasons for such studies. Although our review uncovered some examples of
constructive dependent reproducibility studies, they are relatively rare. Even when such studies
do appear, however, authors rarely provide a specific rationale for their choices. In many cases, it
remains unclear why they chose the approaches that they did and forewent others.
We, therefore, recommend that authors provide more description of the purpose of their
constructive reproducibility efforts. In supplementary coding that we conducted to explore
how authors refer to (and possibly conceptualize) their reproducibility efforts (see Section
B of SOM), we observed that authors commonly engage in “robustness checks”. This term
was used 69 times across the 117 studies in our sample, 36 times in AMJ and 33 times in
JOM. Robustness checks were represented across all types of reproducibility practices.
However, it was often not clear what these (dependent) robustness tests contributed to the
conclusions drawn. Symptomatic of this issue was that the term “robustness checks” or
“robustness tests” was used for a wide range of purposes such as reproducibility, replication,
generalizability, supplementary analysis such as post hoc tests or testing other DVs, or testing
different models as alternative explanations. Furthermore, all kinds of robustness tests—such
as reanalysis with and without control variables, different data manipulations, different ana-
lytical techniques, different variable measures or operationalization, and different model com-
parisons—were often lumped together in a kind of robustness laundry list without
distinguishing what each form of robustness check says about the core conclusions. As
such, readers are often left wondering with what purpose authors ran their robustness analyses
and what we readers are supposed to take away from them. On top of that, given the lack of
detailed information about what the authors did, we were seldom able to distinguish in our
coding between incremental, substantial, and comprehensive constructive reproducibility.
As a result, it was difficult to tell how much importance should be attached to these studies.
28 Journal of Management Scientific Reports

Further confusing matters is the fact that the term robustness traditionally refers to results
of analyses being insensitive to violation of assumptions (e.g., OLS is robust against nonnor-
mal conditional errors). In our reading of the AMJ and JOM papers, though, it appears that the
term is used to denote insensitivity to a wide range of unspoken assumptions, such as the
assumption that the X–Y relationship is still positive when we use a different measure of Y,
when we use the subset of the sample that issued press releases, or when we assess the con-
struct at a different point in time. These and many other “robustness checks” are in fact tests of
interactions. Consequently, readers would have an easier time comprehending the results of
these analyses if authors stated exactly what the moderator is, what sort of effect it might
have, and why. This would also help readers to delineate reproducibility attempts aimed at
robustness vis-a-vis data handling, variable construction, analysis choices, etc. from other
explorations that seem more substantive in nature and ipso facto constitute explorations of
alternative hypotheses (i.e., adding moderator effects to the hypothesized model).
Finally, an interesting observation was that while many papers, especially macro papers,
involve a large number of control variables in their model tests, findings are rarely provided
with and without these controls. As such, it remains unclear which of these control variables
actually matter and how their inclusion affects results. A subsequent question then becomes:
Why are they even included? The sense and nonsense of our field’s use of control variables is
beyond the scope of our paper (see Carlson & Wu, 2012). Yet, we would echo much of the
previous writing on the topic and urge researchers who routinely employ laundry lists of
control variables to ask if there is a clear reason for including each of them. A quick repro-
ducibility analysis of one’s model with and without controls would highlight which control
variables are needed, which are irrelevant, and which are counterproductive. Authors
should only include those control variables whose omission could be considered a mistake.
When to consider an independent reproducibility study. The preceding advice related to
dependent reproducibility. Now we turn our attention to independent reproducibility
attempts. As a first option, authors might want to consider soliciting an independent literal
reproducibility attempt of their study from an independent researcher. This might be war-
ranted when they are attempting a complex piece of analysis, when they want to validate
their methodological choices, or when they are trying out a new technique. This could also
be warranted when researchers obtain findings that are highly counter-intuitive or challenge
well-established theory. This kind of independent literal reproducibility could be part of a
friendly review, i.e., a review by a different researcher that they know and who will give
them critical, developmental feedback. The findings from this independent literal reproduc-
ibility step could be mentioned briefly in the paper and in more detail in the Supplemental
Materials. Other opportunities for independent literal reproducibility attempts before submis-
sion for journal review could be through review of a pre-established research panel as sug-
gested as part of the ManyOrgs initiative outlined by Castille et al. (in press).
Authors who contemplate conducting an independent reproducibility study of another study
need to first determine which analyses and findings are worth attempting to reproduce. As a
general rule, studies that have important implications insofar as they may have an impact on
policy, managerial practices, funding, or other strategic decision-making would benefit from repro-
ducibility attempts to ensure that the conclusions drawn from the work are based on sound evi-
dence. For similar reasons, a reproducibility study is warranted when an original study suggests
that a field should rethink, reconceptualize, retheorize, or even abandon a stream of research.
Cortina et al. / Current reproducibility practices in management 29

Let’s consider the example of Barrick and Mount’s (1991) research into personality as a
predictor of performance. Having been dormant since the 1960s, this research stream was
revived by Barrick and Mount’s meta-analysis. In part because of Barrick and Mount’s con-
clusions about the validity of personality variables as predictors of performance and in part
because personality predictors create less adverse impact in selection settings, these predictors
have become more common as selection tools than they used to be. A reproducibility attempt
of Barrick and Mount’s work would have been useful to determine, for example, whether the
corrected correlations for conscientiousness that crept into the 0.20’s might have remained in
the high teens with different decisions. We have absolutely no quarrel with the decisions that
were made. Our point is that the impact of the paper makes it a candidate for reproducibility.
In fact, given the influence that meta-analyses often have, any meta-analysis published in a
top journal would likely be a good candidate for a reproducibility attempt.
A paper is also a good candidate if it involved, or is likely to have involved, consequential
methodological decisions. Carlson and Wu (2012) showed that decisions related to control
variables can and do impact the results from statistical analyses. Similarly, various authors
(e.g., Wanous et al., 1989) have shown that different meta-analytic decisions (e.g., search
strategies, inclusion criteria, coding choices) can result in different findings. Overall, any
study that included data handling/analysis decisions that the research methods literature
has identified as consequential is ripe for a reproducibility attempt.
Finally, if the paper used software that has consequential default settings, then a reproduc-
ibility attempt is likely warranted. We mentioned lavaan and residual correlations earlier.
Another example would be default settings for handling missing data (often list-wise dele-
tion). Software for data collection contains defaults for respondent omission, bot detection,
etc. Any of these might be relevant for a given set of findings. Naturally, not all default set-
tings are relevant. In fact, most probably aren’t. However, in those cases in which existing
research methods literature has suggested that default settings have implications for findings
and conclusions, a reproducibility attempt can lead to different outcomes.
It is important to suggest at this point that it should not be a requirement that a reproduc-
ibility attempt leads to divergent findings from the original study. In fact, confirming findings
with superior methods is highly useful as it increases our confidence in the conclusions drawn
from the data and provides additional impetus for acting on the kinds of policy, managerial,
and research implications mentioned above. In this way, reproducibility attempts are a crucial
companion to replication attempts targeted at allowing our field to distinguish useful from
irrelevant or harmful research findings and theories.

Advice for publishers and editors. Our advice here is quite simple: Adopt data transpar-
ency policies that make reproducibility studies feasible, and make space in your journals for
such studies. Because most of our paper makes the case for the latter piece of advice, we focus
here on the former. As was mentioned earlier, Wicherts et al. (2011) found support for the
notion that authors whose evidence in support of hypotheses is flimsy and or driven by ques-
tionable analytic decisions are more reluctant to share their data. Presumably, if such authors
knew that they were going to have to share their data, they would make different (i.e., better)
decisions. The question then becomes, how does an Editor/Publisher bring about this state of
affairs?
30 Journal of Management Scientific Reports

The first step is to have a policy that strongly encourages authors to share their data with
researchers who ask for them. We are well aware of the objections to such policies, but the
simple fact is that, without data sharing, our field cannot be self-correcting, and without self-
correction, it can’t be a science. So, policy changes are needed. Earlier research on the effec-
tiveness of such policies suggested that authors were quite comfortable ignoring them (e.g.,
Wicherts et al., 2006). More recent research suggests that the next generation of scholars is
more willing to abide by such policies (Hardwicke et al., 2018). Even sharing data with
the editorial team would be a step in the right direction as this would allow some level of inde-
pendent corroboration. To be effective, though, this would probably require editors to work
together to ensure transparency. One journal wouldn’t do much good on its own.
But there is more to it than that. The difficulties in reproducing previous work encountered
by Herndon et al. (2014) are not uncommon. Hardwicke et al. (2018) found that they were
able to reproduce some findings only after getting clarification from the original authors.
Thus, a standardized system for organizing data and describing analyses would be helpful.
Rmarkdown files (an R-based system for organizing and reporting details of data) might
be of some use here. In short, by ensuring that data are shared in a manner that is useful
for prospective reproducers, editors can improve the quality of the work that they publish.
And for those who are concerned about such things, there is evidence that data sharing is asso-
ciated with higher citation rates (Piwowar et al., 2007).

Advice for reviewers and editors. When to request a dependent reproducibility study.
Aligned with the criteria above, reviewers need to ask themselves if the implications of a
given original study are likely to have a meaningful impact on future research or practice.
If so, they should consider whether requesting a dependent reproducibility attempt would
increase their overall confidence in the findings and conclusions provided. At the most
basic level, a reviewer should evaluate the rationale provided for the methodological
choices that the authors made and consider the possibility that different choices could have
led to different conclusions. Based on our empirical review, examples of methodological
choices that frequently lack any justification (let alone sound justification) are the inclusion
of specific control variables, the choice of a particular time lag interval or subset of the
data, or the selection of exclusion criteria. At the very least, reviewers should press authors
to communicate the reasons behind their choices. Clear communication about those
choices will facilitate future independent reproducibility and replication attempts.
As a further step, if a reviewer has reason to suppose that certain methodological choices or
default settings could lead to different findings, then they should ask authors to reanalyze the
data applying those different choices or alternative settings. These additional analyses could
be provided in supplemental analyses so as not to overcrowd the paper, but they should be
available to interested readers. We would emphasize, however, that a reviewer should
provide support from the methodological literature that said choices could lead to different
outcomes as opposed to requesting additional analyses without sound rationale. The latter
is likely responsible for many of the regressive reproducibility examples in our review.
Finally, reviewers could consider the possibility that authors did not report all of the
choices that they made. There are some telltale signs for this. For example, if a reviewer
were to go to the trouble of computing degrees of freedom (e.g., using the calculator we men-
tioned earlier) and those degrees of freedom do not add up, then they should ask the authors
Cortina et al. / Current reproducibility practices in management 31

about this discrepancy. As another example, the reviewer could check references regarding
the scales that were used. If the authors say that they used the X-item scale from
Rosencrantz and Guildenstern (1602), then check to make sure that the Rosenkrantz and
Guildenstern scale actually had X items. Heggestad et al. (2019) showed that authors often
use modified scales without reporting the fact. If there is a discrepancy, consider the possibil-
ity that the reason is that the authors dropped items because they did not behave or because a
different selection of items led to more favorable outcomes. Overall, request more transpar-
ency from the authors.
How to evaluate independent reproducibility attempts. First and foremost, we would wish
for reviewers to realize that they are encountering a unicorn. They should not spook it or shoo
it away because it doesn’t look like a normal horse. As we have demonstrated, independent
reproducibility attempts are crucial to a self-correcting science. Yet, they are the rarest form of
reproducibility. When done constructively, these studies can be more valuable than the orig-
inal study, especially when they allow the research community to re-evaluate the most influ-
ential works in our field.
When coming across an independent reproducibility attempt, reviewers should first deter-
mine the importance of the reproducibility attempt for a self-correcting science and the level
of constructiveness of the attempt. It should be irrelevant whether findings support or chal-
lenge the original results. What matters is that the reproducibility study contributes to our
knowledge regarding the phenomenon and theory at hand. This can be done either by
showing that superior analysis casts parts of the theory (and the original analysis that sup-
ported it) into doubt or by showing that the theory holds even when more appropriate
methods are used.
All of that said, our own experience suggests that reviewers find it difficult to see the value
of papers that do not contain novel theory, let alone novel data. Our final recommendation is,
therefore, directed at journal editors. If you are convinced by our arguments regarding the
importance of reproducibility, then you might consider making a place for them in your
journal. JOMSR cannot be the only outlet that publishes this kind of work. Changing a
journal policy statement is a good first step, but more is probably required. Some associate
editors and reviewers won’t be able to bring themselves to sign off on reproducibility
studies, suffering from acute cases of what Antonakis (2017) calls Neophilia and
Theorrhea (roughly, an addiction to new boxes and arrows). It would be incumbent upon
the editor, therefore, to choose AEs and reviewers who are more likely to see the value of
reproducibility, and perhaps to overrule reviewers who cannot. Too often, journal editors
leave the task of publishing replications, reproducibility studies, etc. to other editors,
fearing perhaps that such papers will drag down their impact factors. We suggest that these
fears may be misplaced. The two Economics examples and the two older Management exam-
ples that we offered earlier in the paper have, on average, been cited more than 700 times
apiece. It may be beneficial for a journal to have an editor that sees the value of getting in
on the ground floor.

Conclusion
Reproducibility studies have the potential for tremendous value, especially if one allows
for the possibility of constructive reproducibility. Constructive reproducibility allows
32 Journal of Management Scientific Reports

researchers to determine whether findings obtained by the researchers themselves or by others


withstand more appropriate tests. This in turn provides stronger forms of evidence. Yet, the
lack of independent constructive reproducibility is worrying. As we have reported, many
researchers report that decisions regarding the handling and analysis of data have an influence
on their findings. Yet, almost no one does what Obenauer (2023) or Edwards and Harrison
(1993) did, i.e., the expert retesting of hypotheses by a different team of researchers.
Reproducibility, as it exists now, seems to be used largely to support the earlier conclusions
derived by the same authors, i.e., authors who have a vested interest in supporting their con-
clusions. This essentially means that we have few mechanisms for our science to be self-
correcting. Along the same lines of Leavitt et al.’s (2010) concern about the proliferation
of theories that have little merit, this points to the potential proliferation of empirical findings
that may prove to be metaphorical quicksand in that they might not hold up if they were put to
a more stringent test by an independent team of researchers. The problem is, we cannot know.
These studies are almost nonexistent, but they shouldn’t be.

Declaration of conflicting interests


The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication
of this article.

Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publica-
tion of this article: This work was supported by the University of Melbourne’s Faculty of Business and
Economics,

ORCID iDs
José M. Cortina https://orcid.org/0000-0003-2336-917X
Tine Köhler https://orcid.org/0000-0003-0480-472X

Notes
1. Winsorization is a technique for reducing the effects of extreme scores by setting them to some arbitrarily
chosen less extreme score, e.g., the 5th and 95th percentiles.
2. As it were.
3. Based on a statement in Franke and Kaul, 1978. These and other authors (e.g., Bloombaum, 1983) reana-
lyzed other Hawthorne data, but the original illumination data were not reanalyzed until Levitt and List because
their whereabouts were unknown.
4. We did not include Monte Carlo simulations or bootstrapping in our definition of reproducibility as these
techniques are not primarily aimed at reanalyzing findings, but rather at creating a more stable estimate of an
effect size (or another parameter).

References
Adair, J. G., Sharpe, D., & Huynh, C.-L. (1989). Hawthorne control procedures in educational experiments: A recon-
sideration of their use and effectiveness. Review of Educational Research, 59(2): 215–228.
Cortina et al. / Current reproducibility practices in management 33

Aguinis, H., Cascio, W. F., & Ramani, R. S. (2017). Science’s reproducibility and replicability crisis: International
business is not immune. Journal of International Business Studies, 48: 653–663.
Aguinis, H., Ramani, R. S., & Alabduljader, N. (2018). What you see is what you get? Enhancing methodological
transparency in management research. Academy of Management Annals, 12: 1–28.
Antonakis, J. (2017). Editorial: The future of the leadership quarterly. The Leadership Quarterly, 28: 1–4.
Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J. A., Fiedler, K., Fiedler, S., Funder, D. C.,
Kliegl, R., Nosek, B. A., Perugini, M., Roberts, B. W., Schmitt, M., Vanaken, M. A. G., Weber, H., & Wicherts,
J. M. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27:
108–119.
Avery, D. R., Wang, M., Volpone, S. D., & Zhou, L. (2013). Different strokes for different folks: The impact of sex
dissimilarity in the empowerment–performance relationship. Personnel Psychology, 66(3): 757–784.
Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job performance: A meta-analysis.
Personnel Psychology, 44(1): 1–26.
Bechky, B. A., & O’Mahony, S. (2016). Leveraging comparative field data for theory generation. In K. D. Elsbach &
R. M. Kramer (Eds.) Handbook of Qualitative Organizational Research: 200–208. New York, NY: Routledge.
Becker, T. E. (2005). Potential problems in the statistical control of variables in organizational research: A qualitative
analysis with recommendations. Organizational Research Methods, 8: 274–289.
Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg more employable than Lakisha and Jamal? A field
experiment on labor market discrimination. American Economic Review, 94(4): 991–1013.
Bloombaum, M. (1983). The Hawthorne experiments: A critique and reanalysis of the first statistical interpretation by
Franke and Kaul. Sociological Perspectives, 26(1): 71–88.
Carlson, K. D., & Wu, J. (2012). The illusion of statistical control: Control variable practice in management research.
Organizational Research Methods, 15(3): 413–435.
Castille, C. M., Köhler, T., & O’Boyle, E. H. (in press). A brighter vision of the potential of open science for ben-
efiting practice: A ManyOrgs proposal. Industrial and Organizational Psychology: Perspectives on Science and
Practice.
Chan, M. E., & Arvey, R. D. (2012). Meta-Analysis and the development of knowledge. Perspectives on
Psychological Science, 7: 79–92.
Chang, Y., & Cortina, J. M. (in press). What should I wear to work? An integrative review of the impact of clothing in
the workplace. Journal of Applied Psychology.
Cortina, J. M., Dormann, C., Markell, H. M., & Keener, S. K. (2023). Endogenous Moderator Models: What They
are, What They Aren’t, and Why it Matters. Organizational Research Methods, 26: 499–523. https://doi.org/10.
1177/10944281211065111
Cortina, J. M., Green, J. P., Keeler, K. R., & Vandenberg, R. J. (2017). Degrees of freedom in SEM: Are we testing
the models that we claim to test? Organizational Research Methods, 20(3): 350–378.
Cortina, J. M., Koehler, T., Keeler, K. R., & Nielsen, B. B. (2019). Restricted variance interaction effects: What they
are and why they are your friends. Journal of Management, 45(7), 2779–2806.
Courtney, H. (2001). 20/20 foresight: Crafting strategy in an uncertain world. Boston, MA: Harvard Business School
Press.
Cyert, R. M., & March, J. G. (1963). A behavioral theory of the firm (2nd ed.). Oxford, U.K.: Wiley-Blackwell.
da Motta Veiga, S. P., & Gabriel, A. S. (2016). The role of self-determined motivation in job search: A dynamic
approach. Journal of Applied Psychology, 101(3): 350.
Davis, M. S. (1971). That’s interesting! toward a phenomenology of sociology and a sociology of phenomenology.
Philosophy of Social Science, 1: 209–344.
Dullien, S. (2012). Reinventing Europe: Explaining the Fiscal Compact. Retrieved on July 7, 2022 from https://ecfr.
eu/article/commentary_reinventing_europe_explaining_the_fiscal_compact/.
Eatough, E. M., Chang, C. H., Miloslavic, S. A., & Johnson, R. E. (2011). Relationships of role stressors with orga-
nizational citizenship behavior: A meta-analysis. Journal of Applied Psychology, 96(3): 619.
Edwards, J. R., & Harrison, R. V. (1993). Job demands and worker health: Three-dimensional reexamination of the
relationship between person-environment fit and strain. Journal of Applied Psychology, 78: 628–648.
Edwards, J. R., & Parry, M. E. (1993). On the use of polynomial regression equations as an alternative to difference
scores in organizational research. Academy of Management Journal, 36(6): 1577–1613.
Franke, R. H., & Kaul, J. D. (1978). The Hawthorne experiments: First statistical interpretation. American
Sociological Review, 43, 623–643.
34 Journal of Management Scientific Reports

French, J. R., Caplan, R. D., & Van Harrison, R. (1982). The Mechanisms of Job Stress and Strain (Vol. 7).
New York, NY: J. Wiley.
Gale, E. A. M. (2004). The Hawthorne studies––A fable for our times? Quarterly Journal of Medicine, 97(7): 439–449.
Gillespie, R. (1991). Manufacturing Knowledge: A History of the Hawthorne Experiments. New York, NY:
Cambridge University Press.
Gleser, L. J. (1996). [Bootstrap confidence intervals]: Comment. Statistical Science, 11(3): 219–221.
Hardwicke, T. E., Mathur, M. B., MacDonald, K., Nilsonne, G., Banks, G. C., Kidwell, M. C., … Frank, M. C.
(2018). Data availability, reusability, and analytic reproducibility: Evaluating the impact of a mandatory open
data policy at the journal cognition. Royal Society Open Science, 5(8): 180448.
Harrison, J. S., Banks, G. C., Pollack, J. M., O’Boyle, E. H., & Short, J. (2017). Publication bias in strategic man-
agement research. Journal of Management, 43(2): 400–425.
Heckman, J. J. (1998). Detecting discrimination. Journal of Economic Perspectives, 12(2): 101–116.
Heggestad, E. D., Scheaf, D. J., Banks, G. C., Monroe Hausfeld, M., Tonidandel, S., & Williams, E. B. (2019). Scale
adaptation in organizational science research: A review and best-practice recommendations. Journal of
Management, 45(6): 2596–2627.
Herndon, T., Ash, M., & Pollin, R. (2014). Does high public debt consistently stifle economic growth? A critique of
Reinhart and Rogoff. Cambridge Journal of Economics, 38(2): 257–279.
Hollenbeck, J. R., DeRue, D. S., & Mannor, M. (2006). Statistical power and parameter stability when subjects are
few and tests are many: Comment on Peterson, Smith, Martorana, & Owens. 2003. Journal of Applied
Psychology, 91: 1–5.
Ilies, R., Scott, B. A., & Judge, T. A. (2006). The interactive effects of personal traits and experienced states on intra-
individual patterns of citizenship behavior. Academy of Management Journal, 49(3): 561–575.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med, 2(8): e124.
Jensen, M. C. (1986). Agency costs of free cash flow, corporate finance, and takeovers. American Economic Review,
76(2): 323–329.
Jiang, L., Yin, D., & Liu, D. (2019). Can joy buy you money? The impact of the strength, duration, and phases of an
entrepreneur’s peak displayed joy on funding performance. Academy of Management Journal, 62(6): 1848–1871.
Judge, T. A., Thoresen, C. J., Bono, J. E., & Patton, G. K. (2001). The job satisfaction–job performance relationship:
A qualitative and quantitative review. Psychological Bulletin, 127(3): 376.
Kabins, A. H., Xu, X., Bergman, M. E., Berry, C. M., & Willson, V. L. (2016). A profile of profiles: A meta-analysis
of the nomological net of commitment profiles. Journal of Applied Psychology, 101(6): 881.
Kim, C., & Bettis, R. A. (2014). Cash is surprisingly valuable as a strategic asset. Strategic Management Journal,
35(13): 2053–2063. doi: 10.1002/smj.2205
King, E. B., & Ahmad, A. S. (2010). An experimental field study of interpersonal discrimination toward Muslim job
applicants. Personnel Psychology, 63(4), 881–906.
Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical
review, a meta-analysis, and a preliminary feedback intervention theory. Psychological Bulletin,
119(2): 254.
Köhler, T., & Cortina, J. M. (2021). Play it again, Sam! An analysis of constructive replication in the organizational
sciences. Journal of Management, 47(2): 488–518.
Köhler, T., & Cortina, J. M. (2023). Constructive replication, reproducibility, and generalizability: Getting theory
testing for JOMSR right. Journal of Management Scientific Reports, 1(2): 75–93.
Kunert, R. (2016). Internal conceptual replications do not increase independent replication success. Psychonomic
Bulletin & Review, 23: 1631–1638.
Leavitt, K., Mitchell, R. R., & Peterson, J. (2010). Theory pruning: Strategies to reduce our dense theoretical land-
scape. Organizational Research Methods, 13: 644–667.
Levitt, S. D., & List, J. A. (2011). Was there really a Hawthorne effect at the Hawthorne plant? An analysis
of the original illumination experiments. American Economic Journal: Applied Economics, 3(1):
224–238.
Lin, Y., Shi, W., Prescott, J. E., & Yang, H. (2019). In the eye of the beholder: Top Managers’ long-term orientation,
industry context, and decision-making processes. Journal of Management, 45(8): 3114–3145.
Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic Inquiry. Newbury Park, CA: Sage Publications, Inc.
Lynch Jr, J. G., Bradlow, E. T., Huber, J. C., & Lehmann, D. R. (2015). Reflections on the replication corner: In praise
of conceptual replications. International Journal of Research in Marketing, 32(4): 333–342.
Cortina et al. / Current reproducibility practices in management 35

Lyness, K. S., & Heilman, M. E. (2006). When fit is fundamental: performance evaluations and promotions of upper-
level female and male managers. Journal of Applied Psychology, 91(4): 777.
Martinez, L. R., White, C. D., Shapiro, J. R., & Hebl, M. R. (2016). Selection BIAS: Stereotypes and discrimination
related to having a history of cancer. Journal of Applied Psychology, 101(1): 122.
Mayo, E. (1933). The Human Problems of an Industrial Civilization. New York, NY: Macmillan Company.
Mendoza-Abarca, K. I., & Gras, D. (2019). The performance effects of pursuing a diversification strategy by newly
founded nonprofit organizations. Journal of Management, 45(3): 984–1008.
Michiels, S., Koscielny, S., & Hill, C. (2005). Prediction of cancer outcome with microarrays: A multiple random
validation strategy. The Lancet, 365(9458): 488–492.
Nahum-Shani, I., Henderson, M. M., Lim, S., & Vinokur, A. D. (2014). Supervisor support: Does supervisor support
buffer or exacerbate the adverse effects of supervisor undermining? Journal of Applied Psychology, 99(3): 484.
Obenauer, W. G. (2023). More on why Lakisha and Jamal didn’t get interviews: Extending previous findings through
a reproducibility study. Journal of Management Scientific Reports, 1(2): 114–145.
Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validates:
Findings and implications for personnel selection and theories of job performance. Journal of Applied
Psychology, 78: 679–703.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349: aac47
16–1–aac4716–8.
Peterson, R. S., Smith, D. B., Martorana, P. V., & Owens, P. D. (2003). The impact of chief executive officer per-
sonality on top management team dynamics: One mechanism by which leadership affects organizational perfor-
mance. Journal of Applied Psychology, 88(5): 795–808.
Piwowar, H. A., Day, R. S., & Fridsma, D. B. (2007). Sharing detailed research data is associated with increased
citation rate. PloS one, 2(3): e308.
Reinhart, C. M., & Rogoff, K. S. (2010). Growth in a time of debt. American Economic Review, 100(2): 573–578.
Roethlisberger, F. J., & Dickson, W. J. (1939). Management and the Worker. Cambridge, MA: Harvard University
Press.
Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., … Nosek, B. A. (2018). Many ana-
lysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods
and Practices in Psychological Science, 1(3): 337–356.
Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on
Psychological Science, 9: 59–71.
Theissen, M. H., Jung, C., Theissen, H. H., & Graf-Vlachy, L. (2023). Cash holdings and firm value: Evidence for
increasing marginal returns. Journal of Management Scientific Reports, 0(0). https://doi.org/10.1177/
27550311231187318
Thöni, C., Volk, S., & Cortina, J. M. (2021). Greater male variability in cooperation: Meta-analytic evidence for an
evolutionary perspective. Psychological Science, 32(1): 50–63.
Tsang, E. W. K., & Kwan, K.-M. (1999). Replication and theory development in organizational science: A critical
realist perspective. Academy of Management Review, 24: 759–780.
Vancouver, J. B., Thompson, C. M., & Williams, A. A. (2001). The changing signs in the relationships among self-
efficacy, personal goals, and performance. Journal of Applied Psychology, 86(4): 605.
Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H. N. (2012). The criterion-related validity of
integrity tests: An updated meta-analysis. Journal of Applied Psychology, 97(3): 499–530.
Wanous, J. P., Sullivan, S. E., & Malinak, J. (1989). The role of judgment calls in meta-analysis. Journal of Applied
Psychology, 74(2): 259.
Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to share research data is related to the strength of the
evidence and the quality of reporting of statistical results. PloS one, 6(11): e26828.
Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of psychological research data
for reanalysis. American Psychologist, 61(7): 726.
Wikipedia reference (2022). Hawthorne effect. Retrieved from https://en.wikipedia.org/wiki/Hawthorne_effect.

You might also like