You are on page 1of 18

Journal of Public Administration Research And Theory, 2018, 1–18

doi:10.1093/jopart/muy047
Article

Article

Best Practice Recommendations for Replicating


Experiments in Public Administration
Richard M. Walker*, Gene A. Brewer†, M. Jin Lee*, Nicolai Petrovsky‡,
Arjen van Witteloostuijn§
*City University of Hong Kong †University of Georgia ‡University of Kentucky §School of Business and Economics,
Vrije Universiteit Amsterdam

Abstract
Replication is an important mechanism through which broad lessons for theory and practice can be
drawn in the applied interdisciplinary social science field of public administration. We suggest a com-
mon replication framework for public administration that is illustrated by experimental work in the field.
Drawing on knowledge from other disciplines, together with our experience in replicating several experi-
ments on topics such as decision making, organizational rules, and government–citizen relationships, we
provide an overview of the replication process. We then distill this knowledge into seven decision points
that offer a clear set of best practices on how to design and implement replications in public adminis-
tration. We conclude by arguing that replication should be part of the normal scientific process in public
administration to help to build valid middle-range theories and provide valuable lessons to practice.

Replication is an important part of the scientific pro- knowledge (Hubbard, Vetter, and Eldon 1998; Open
cess because it can help to establish the external valid- Science Collaboration 2015), but they are rare in pub-
ity of knowledge and thus the ability to generalize a lic administration and management (Walker, Brewer,
study’s findings more broadly. Replication thus sits and James 2017).1
at the heart of scientific progress (e.g., Francis 2012; Public administration is an interdisciplinary social
Freese 2007; Kuhn 1996; Nosek and Lakens 2014). science field that emphasizes applied (“theory to
The goal of replication is to confirm or reject theories practice”) applications. In this article, we argue that
under similar and dissimilar conditions to build con- these two factors suggest the need to craft a replica-
fidence in (or falsify) them. In this way, theory test- tion framework for public administration to advance
ing is advanced, and knowledge is accumulated. Many knowledge. Over half-a-century ago, Simon (1996)
disciplines rely heavily on replications to advance argued that public administration is a “design science”
that devises courses of action designed to change exist-
We would like to thank Mary Feeney, Bert George, Oliver James, Seb ing outcomes into preferred ones. Recommendations
Jilke, Tobin Im, M.  Jae Moon, Greta Nasi, Asmus Leth Olsen, Lars from public administration scholarship can have tangi-
Tummers, Gregg Van Ryzin, and Xun Wu for contributing their ideas ble consequences for citizens; thus, prescriptions need
and thoughts to this project. Drafts of the article were presented at
seminars at the Laboratory for Public Management and Policy City
to be well-founded. If the discipline of public adminis-
University of Hong Kong, Erasmus University Rotterdam, KU Leuven tration is to successfully advance policy and practice,
and SPEA, University of Indiana, and three conferences: Charting New then research findings must be generalizable. In the
Directions in Public Management Research, University of Bocconi, applied social sciences, the context within which stud-
January 2018; The 2018 Yonsei International Public Management ies are conducted is considered to be a critical variable
Conference, Seoul, January 2018; and the 22nd International Research
Symposium on Public Management, Edinburgh, April 2018. All errors
remain with the authors. Address correspondence to the author at 1 In this article, the term “public administration” includes public
rmwalker@cityu.edu.hk. management.

© The Author(s) 2018. Published by Oxford University Press on behalf of the Public Management Research Association. 1
All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
2 Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx

for the development of theory (Freese 2007; Jilke, the costliness of experiments, the number of published
Meuleman, and Van de Walle 2014). studies with small-Ns, and the contextual nature of the
Most social and behavioral sciences are single dis- findings reported therein.
ciplines with interdisciplinary interests, whereas public This article contributes to the debate on replica-
administration is an interdisciplinary discipline with tion in public administration and beyond by offer-
many single interests. For example, much of psych- ing best practice recommendations for experimental
ology is concerned with questions about the human replications with a goal of improving theory and
psyche and human behavior at the individual level. practice. Outlining clear steps for conducting experi-
The interdisciplinary nature of public administration mental replications will inform researchers’ decisions
means that we are concerned with individual values, on whether a study can be replicated and how the rep-
attitudes, and behaviors in social and political settings. lication should be carried out. This article describes
Further, public administration research can include seven decision points. Central to these decision points
many different units of analysis (individuals, organi- are questions on criteria for determining the type of
zations, networks, countries, etc.). Although the scope replication to be implemented. We rely on Tsang and
of public administration is not necessarily broader or Kwan’s (1999) classification of different types of rep-
more important than other disciplines, the nature of lications to guide these decisions. Their framework
public administration is interdisciplinary and spans the offers conceptual clarity on the type of replication
social sciences. needed to extend the original study to different popu-
To help develop a clear set of arguments and illus- lations, designs, and analyses. The decision points are
trations, this article focuses primarily on the replica- (1) deciding whether replicating the study is feasible,
tion of experimental studies. However, much of what (2) assessing the internal validity of the original study
we say is relevant to other types of research, which and the replication, (3) making choices about statis-
also require rigorous replication. Hence, the steps we tical power, (4) choosing a critical test case, (5) estab-
outline are generally applicable to all types of research, lishing boundary conditions, (6) establishing content
including qualitative studies (Mele and Belardinelli validity, and (7) deciding how to compare the findings.
Forthcoming; Nowell and Albrecht Forthcoming). The first section of this article reviews several
Public administration scholars have increasingly rationales for replicating studies in the social sciences
adopted experimental research designs, which are in general and catalogs replication experiences in dif-
characterized by random assignment of treatments or ferent disciplines. In the second section, we introduce
subjects into separate treatment and control groups, the seven decision points for designing and implement-
and conducted in real-world or artificial laboratory- ing replications which provides the underpinnings for
like settings. These designs have been adopted because our best practice recommendations. Following this, we
of their strong internal validity and ability to iden- illustrate our best practice recommendations through
tify causal relationships (Blom-Hansen, Morton, and a replication conducted in Hong Kong in 2017 of
Serritzlew 2015; James, Jilke, and Van Ryzin 2017). Van Ryzin (2013). The foundations for our best prac-
Yet even when the internal validity of an experimental tice recommendations are drawn from the replication
study is strong, its external validity is often weak. experiences on the topics reviewed and our experience
Researchers face many challenges in publishing in replicating field, laboratory, and survey experiments
replications. Notably, editors often prefer new find- at the Laboratory for Public Management and Policy
ings on groundbreaking topics (van Witteloostuijn at City University of Hong Kong (e.g., Walker, Brewer,
2016) rather than the steady hum of scientific progress and James 2017). We conclude by proposing a “com-
inching forward. These expectations are reinforced by mon replication framework” that would integrate rep-
the reward and promotion practices of universities, lication into the normal scientific practices in public
in which faculty are expected to publish in leading administration to improve the quality of scholarship
or high-impact journals that demand strong findings and help develop mid-range theories and robust les-
on innovative topics. Such practices have led some to sons for practice.
argue that research articles contain many Type I errors
in which false positives abound (Ioannidis 2005; van
Witteloostuijn 2016). Replication publications are, Why Replication
nonetheless, on the rise across the social sciences (e.g., Scientific knowledge is bolstered when research designs
Francis 2012; Nosek and Lakens 2014) and in public have high validity. Researchers may not get it right the
administration (Walker, Brewer, and James 2017), but first time, but through repeated efforts, their findings
much more needs to be done. It is therefore import- amass and converge on the truth. Campbell and col-
ant to develop a replication framework for conduct- leagues (e.g., Campbell and Stanley 1963; Cook and
ing experimental replications, particularly considering Campbell 1979) shaped researchers’ thinking when

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx 3

they distinguished between four essential forms of et al. 2017; Nosek et al. 2015). In our view, the pub-
validity in social science research: internal validity, lic administration should institutionalize these norms
external validity, construct validity, and statistical con- of openness and transparency before a replication cri-
clusion validity. The authors further described various sis emerges in the field. This is essential if replication
threats to validity and cautioned researchers to min- is to become a normal part of public administration
imize them. For many decades, this has been textbook research. Given the relative youthfulness of experimen-
material, and it informs our decision points. tal research designs in public administration, ensuring
Researchers should strive for high levels of validity, the full transparency of all published experimental
according to Campbell and Stanley (1963). However, studies is an essential prerequisite for conducting rep-
there are tradeoffs that work against achieving a per- lications (decision point 1).2 A  leading voice in the
fect balance. On the one hand, research designs tai- transparency and openness debate has been the Center
lored to optimize the testing of causal relationships for Open Science, which has published Transparency
with high internal validity often create artificial or con- and Openness Guidelines (Nosek et  al. 2015). Level
trived conditions that make the result very difficult to 1 of these standards has been adopted by some pub-
reproduce in other settings (low external validity). On lic administration journals.3 Whether journals adopt
the other hand, research designs aimed at broad gen- these standards or not, public administration scholars
eralizability (high external validity) often sacrifice con- are urged to be open and transparent in their research
trol and rigor in the research design, making the results designs, data and code to facilitate replication.
less certain (low internal validity). Public administra-
tion scholars have adopted the use of experimental Replication
research designs in the search for stronger internal val- Many existing articles provide guidance on the conduct
idity. Nevertheless, in an applied social science discip- of replications. We identified a sample of these articles
line like public administration, external validity is also by searching for titles containing the term “replicat*”
important. Hence, replication research designs often in the Social Science Citation Index in the disciplines
incur a trade-off between internal and external val- of economics, management, psychology, sociology,
idity. External validity can then be produced through and public administration for the period 1970–2018.
a string of internally valid studies that test boundary We turned up 971 articles, of which the vast majority
conditions. were replication studies. Table 1 identifies three broad
themes that are discernable in the literature: types of
A Review of the Major Replication Efforts to Date replication, academic practices, and recommendations
and guidance for the conduct of replication. The table
Reproducibility
lists some useful studies and provides a brief summary
Reproducibility is “the extent to which consist-
of their recommendations for improving replication.
ent results are observed when scientific studies are
The first theme includes scholars working from
repeated” (Open Science Collaboration 2015, 657).
an exploratory social science perspective and seek-
Some disciplines have experienced a “reproducibility
ing to understand and classify types of replication—
crisis” because they could not reproduce high visibility
we return to this topic in the next section. A  second
research findings. In economics and finance, Dewald,
theme involves studies focusing on recommendations
Thursby, and Anderson (1986) attempted to verify
to promote the conduct and publication of replica-
findings reported in articles published in the Journal
tions. Recommendations typically include teaching
of Money, Credit, and Banking before and after the
replications in PhD programs to instill replication
journal required authors to provide data and code to
as a norm and changing the attitudes and behaviors
others upon request. For articles published before the
of journals, editors, and reviewers (Hubbard, Vetter,
mandate, 14 out of 62 authors contacted responded
and Little 1998; Schmidt 2009). The third theme pro-
that they had lost or discarded their data; for articles
motes the practice of replication. Recommendations
published subsequently, the most striking observation
include detailed guidance on implementing a direct
was the high prevalence of errors, some of which led
replication (Brandt et al. 2014), the development of a
to substantially different conclusions. More recently, in
psychology, the Open Science Collaboration’s (2015) 2 We encourage openness and transparency in all forms of empirical
multisite replication of 100 experiments challenged the research.
hitherto prevalent views of much of the accumulated 3 Public Administration Review, Journal of Behavioral Public
knowledge in the field when only 39% of the original Administration and the International Journal of Public Sector
effects were substantively replicated. Management have signed up to Level 1 (voluntary disclosure): See
https://www.publicadministrationreview.com/guidelines/, https://
The reproducibility debate has led to calls for open- journal-bpa.org/index.php/jbpa/transparency, and http://www.
ness and transparency in all aspects of the research emeraldgrouppublishing.com/products/journals/author_guidelines.
process (Camerer et al. 2016; Ioannidis, 2005; Munafo htm?id=ijpsm

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
4 Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx

Table 1.  Examples of Sources Addressing Replication Practices

Authors Journal T A R Key Observations on the Conduct of Replications

Bettis, Helfat, and Strategic Management √ √ Proposes narrow and quasi-replication types. Offers
Shaver (2016) Journal guidelines to achieve high-quality replication: Is
the focus of the replication an important finding?
Is the goal a narrow or quasi-replication? How
closely does the replication match the original
study? Is it beneficial to add advances in the
literature since the publication of the focal study?
Are the results interpreted so as to build the
literature?
Berthon et al. Information Systems √ √ Replication types: pure replication, context
(2002) Research extension, method extension, theory extension,
theory/method extension, method/context
extension, theory/context extension, and
pure generation. Introduced research space, a
framework based on the problem, theory, method,
and context to inform the process of conducting
replication.
Brandt et al. Journal of Experimental √ Provides five replication ingredients for a direct
(2014) Social Psychology replication: (1) Carefully defining the effects
and the methods that the researcher intends to
replicate. (2) Following as exactly as possible the
methods of the original study. (3) Having high
statistical power. (4) Making complete details
about the replication available. (5) Evaluating
replication results, and comparing them critically
with the original study.
Fabrigar and Journal of Experimental √ √ Distinguishes between direct and conceptual
Wegener (2016) Social Psychology replications. Makes recommendations on the
measurement of variables and on the analysis of
replication results using meta-analytic techniques.
Freese and Annual Review of √ √ Replication types based upon new and old data and
Peterson (2017) Sociology similarity and differences in research design and a
replication policy for sociology
Hubbard, Vetter, Strategic Management √ Recommendations on inculcating replication in
and Little (1998) Journal graduate school to promote replication as normal
practice and for journals to have separate sections
dedicated to publishing replications
Jilke et al. (2017) Public Management √ Provides guidance on how to empirically achieve
Review measurement equivalence for cross-replication
experiments based on the experimental
replication and survey measurement equivalence
literatures
Kerr, Shultz, and Journal of Advertising √ √ Based on a Delphi study an agenda for collective
Lings (2016) action to increase replication is presented: review
the reviewers, normalize replication, openness,
and transparency, grow publication opportunities,
replicate it differently
Makel and Plucker Educational Researcher √ Recommendations on third-party replication studies
(2014) to improve education policy and practice
Pederson and Public Administration √ Provides guidance on key replication ingredients:
Stritch (2018) Review relevance, number, internal validity, contextual
realism, and external
Schmidt (2009) Review of General √ √ Replication types: direct and conceptual. Discusses
Psychology the need for replication to be a mainstream
activity by achieving greater recognition of
replication activities by scholars, the teaching of
replication and changing the editorial behaviors
of journals.

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx 5

Table 1.  (Continued)

Singh, Ang, and Journal of Management √ √ Promotes a good-enough replication standard and
Leong (2003) the focusing of replication research on degree
of theory development (limited/substantial) and
degree of methodological development (limited/
substantial)
Tsang and Kwan Academy of √ Proposed six types of replication (including
(1999) Management Review reproduction) based upon the population (same
and different) and measurement and analysis
(same and different)
Uncles and Kwok Journal of Business √ √ Replication types: repeat studies, close replication,
(2013) Research and differentiated replication. Focuses on how
to do replications and promotes conducting
replications as an integral part of the original
research design.
Walker, James, and Public Management √ √ √ Provides a replication protocol based upon six
Brewer (2017) Review types of replication and the researchers involved,
population/subjects, methods, and analysis

Note: T = discusses types of replication; A = makes recommendations to the academic community to increase the prevalence of replications;
R = makes recommendations for the conduct of replications.

“good-enough” replication standard (Singh, Ang, and Best Practice Recommendations For Designing
Leong 2003), and a protocol based on the research- And Implementing A Replication
ers involved in the replication, the population/subjects This section introduces our main contribution: best
studied, and the methods and analysis (Walker, Brewer, practice recommendations for conducting replica-
and James 2017). Bettis, Helfat, and Shaver (2016) tions in public administration research broadly and in
move beyond questions of research design to consider experimental studies specifically. The best practice rec-
how a replication advances knowledge and how to ommendations consist of key “decision points” to help
interpret its results, as do Fabrigar and Wegener (2016), researchers navigate the thicket of issues they face when
while Jilke et al. (2017) provide detailed guidance on planning a replication, such as assessing feasibility and
measurement equivalence. Others have provided rep- determining the type of replication needed. Replications
lication models such as RNICE (relevance, number, connect the existing body of knowledge with new know-
internal validity, contextual realism, and external ledge. Therefore, they should be applied systematically
validity), which was published in a public administra- (Schmidt 2009). Smart decisions on these issues can
tion journal (Pedersen and Stritch 2018). This model increase a researcher’s impact, produce new knowledge
lists topics that need to be examined to evaluate a rep- for the field, and make scientific progress more efficient.
lication effort and is an important step forward in the The best practice recommendations span several steps
emerging debate in public administration. in the scientific method, such as testing and re-testing
If replication agendas are to pay dividends, studies hypotheses to accumulate results that eventually con-
must be carefully chosen (Bettis, Helfat, and Shaver verge and form scientific knowledge. Figure 1 depicts the
2016). Pedersen and Stitch (2018, 3)  suggested that main stages of replication: planning, implementing, and
“as a rule, a ‘valuable’ replication aims to provide sup- reporting results. The goals of replication and decisions
porting (or contradicting) evidence about the exist- on the type of replication to use (see below) influence
ence of a phenomenon that audiences care about and the best practice recommendations. Thus, trade-offs
deem important.” Detailed impact criteria have been between these choices are also considered.
offered by the Netherlands Organisation for Scientific
Research, providing a useful guide for public admin-
Different Types of Replication
istration scholars: The original study’s findings must
In selecting studies for replication, many decisions
be scientifically impactful (e.g., a breakthrough finding
must be made, and a classification scheme of different
with much subsequent work built on it, or a study that
types of replications is very helpful. Tsang and Kwan
has been highly cited) and/or take an important soci-
(1999) proposed a framework that identified four dif-
etal perspective (e.g., a critical policy).4
ferent forms of replication (table  2). The framework
4 https://www.nwo.nl/en/news-and-events/news/2017/social-sciences/ contrasts several study elements to create a typology:
second-call-for-proposals-replication-studies-open-for-applications.html studies using the “same measurement and analysis” or

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
6 Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx

Figure 1.  Replication: Summary of Main Issues and Decision Points

Table 2.  Different Types of Replication

Same Measurement and Analysis Different Measurement and Analysis

Same population Direct replication Conceptual replication


Different population Empirical generalization Generalization and extension

Source: Adapted from Tsang and Kwan (1999).

“different measurement and/or analysis,” and studies population (although perhaps a different sample), and
using the “same population” or a “different popula- context an “exact replication.” Following Schmidt
tion.” Tsang and Kwan’s (1999) terminology suggests (2009), we have relabeled this a “direct” replication
that different populations consist of different partici- because an “exact” or “literal” replication is actu-
pants (e.g., diverse groups of citizens in different ser- ally not possible. At a minimum, time passes, and the
vice areas or jurisdictions). Their ideas also apply to subjects change. However, we have moved beyond
the additional dimension of context (e.g., the same Schmidt’s (2009, 91) argument that a direct replication
citizens experiencing different policy domains or insti- is just a “[r]epetition of an experimental procedure.”
tutional settings). A study is repeated at a different point in time to test
Tsang and Kwan (1999) called a replication with whether the findings hold and whether the concepts
the same research procedures, measurements, analysis, are still applicable. A  direct replication could also

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx 7

involve a different sample of public managers, such may be attributed to an altered research design and/or
as frontline workers, from a larger population of pub- changes in the population.
lic managers. An example of a direct replication was The approaches to generalization and extension
implemented by Grimmelikhuijsen and Porumbescu have varied in the public administration literature. For
(2017) on Van Ryzin (2013). The replication yielded example, Olsen (2017) examined the hypotheses and
similar results and reinforced the findings reported in findings from psychology to study action bias. George
the original study. et  al. (2017) replicated a Danish experiment on per-
The second type of replication also uses a sample formance information and politicians’ preferences for
from the same population, but it is labeled “concep- spending and reform, adding an extension onto the
tual.” A  conceptual replication is an extension of the role of strategic goals. As discussed in decision point
original study because it involves different measure- 7, interpreting the replication results requires finesse.
ments and analyses. A conceptual extension introduces Nevertheless, generalizations and extensions represent
new procedures and explores additional ramifica- an important step in testing boundary conditions (deci-
tions of the original finding(s). For example, using the sion point 5) to refine theory and strengthen external
same population, Van Ryzin, Riccucci, and Li (2017) validity.
extended their earlier study of representative bureau- Moving from direct replication to generalization
cracy (Riccucci, Van Ryzin, and Li 2016) into a new and extension increases the confirmatory power of the
policy domain: emergency preparedness. They were replication (Schmidt 2009). A  direct replication tests
unable to verify their prior findings and concluded that whether the hypotheses and causal relationships in the
“the symbolic effects of gender representation may original study hold up with minimal change. Adding
be policy-specific” (Van Ryzin, Riccucci, and Li 2017, to population, measurement, and analysis through
1365). Scholars conducting replication studies should “empirical generalization,” “conceptual,” and “gen-
be aware that conceptual replication may require many eralization and extension” further tests the concepts
repetitions using similar and dissimilar measurements and causal relationships in each replication. In an ideal
and analyses to provide clear-cut answers. world, with ample time and resources available, every
The third and fourth types of replication involve replication would travel through the four stages shown
varying degrees of extension because they are per- in figure 2 to determine whether, or the extent to which,
formed on populations that differ from those in the the findings can be confirmed at each stage—reflecting
original study. These replications may also differ by Singh, Ang, and Leong’s (2003) call for “good enough”
using either the same or new measurements and/or replications. As Schmidt (2009, 97) noted, systematic-
analyses. For example, “empirical generalization” ally diverse populations, contexts, measures, and anal-
uses the same research design, measures, and analy- yses are likely to result in variations from the original
ses, but assesses whether the original findings hold up study that are suited to the multitrait–multimethod
in different populations. Lee, Moon, and Kim (2017) research design or a “systematic replication matrix.”
extended Knott, Millar, and Verkuilen’s (2003) study To overcome the obstacles of resources and time, a
of American university students’ decision making to “many-labs” research design has been employed, a
Hong Kong and Korea, broadly confirming the ori- topic we return to below.
ginal finding that individuals with limited informa-
tion make incremental decisions. “Generalization and Planning a Replication
extension” replications use a different population Decision Point 1: Deciding If It Is Feasible to Replicate the Study
while seeking to extend the original findings by adopt- The goal of the first decision point is to determine
ing additional measurements and analyses. Under this whether it is feasible to replicate a study. Although
type of replication, if the results are different from the original study is likely to have been published in
those reported in the original study, the discrepancies a scholarly journal, as we noted above, most public

Figure 2.  Building Generalizability

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
8 Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx

administration journals do not require all aspects of a replication may not be possible, or a different type of
research article’s design, procedures, and analysis to be replication may need to be undertaken.
fully disclosed. Given this, attempting a replication can
be challenging. Decision Point 2: Assessing the Internal Validity of the Original
If all information on the research design, analysis, Study and the Replication
and results are accessible, the researcher can move dir- Some public administration experiments are well-
ectly to decision points 2 and 3 to check for internal and designed. They are guided by a clear hypothesis (or a
conclusion (statistical) validity and determine which set of hypotheses) derived from a well-specified the-
type of replication should be undertaken. However, it oretical perspective. The measures and treatments fit
may not be possible to replicate a study if the original with the theoretical concepts. Alternative explanations
materials are not available and sufficient. This includes are carefully considered and to the greatest extent pos-
information on participant recruitment, instructions, sible eliminated by the design, which results in a high
measures, procedures, and analysis of the final data set degree of internal validity. In principle, such cases are
(Brandt et al. 2014). In some cases, materials may be suitable for replication. However, replications may still
available, but the original study was conducted in a encounter serious threats to internal validity if such
different language. For example, the research materi- threats were present in the original study.
als for Weibel, Rost, and Osterloh’s (2010) study were Internal validity is defined by the following ques-
in German. To conduct a replication, these materials tions: did the experimental stimulus, in fact, make
had to be requested from the authors and translated some significant difference in the result, and did this
meticulously. Translation is costly and translating covariation result from a causal relationship (Cook
somewhat technical research materials can result in and Campbell 1979; Shadish, Cook, and Campbell
misinterpretation, raising questions of construct valid- 2002, 37)? Validity is a property of inferences related
ity (decision point 6). to the design that relate to the sureness or truth of the
If the full details are unavailable or the costs of rep- results. Many well-known threats can weaken internal
lication are prohibitively high, there are three options. validity (e.g., ambiguous temporal precedence, selec-
First, the replication can be abandoned. Second, if tion bias, history, maturation, regression, attrition of
adequate study data and research materials are avail- respondents, testing, and instrumentation). Space does
able, the researcher can undertake a “checking of ana- not permit a full review of all threats to internal val-
lysis” (Tsang and Kwan 1999). If the original findings idity, but when replicating a study, full consideration
can be reproduced based on the replicator’s ability to should be given.
recreate the original study materials, the researcher Decision making on internal validity is a two-stage
can move to decision point 2 and onward to con- process. First, the research design of the original study
sider which type of replication to undertake. Third, if must be carefully examined to ensure that internal val-
partial details are available, a final option is to use a idity is high and that fatal problems, such as ambigu-
different type of replication (e.g., a conceptual replica- ous temporal precedence, are removed. Many threats
tion). This would also lead the researcher to decision to internal validity, such as attrition and history, occur
point 2. during the implementation of a study. Original studies
Information on the original materials may be avail- should clearly document these threats, but this does
able from the author(s). If the original authors are not always happen. Thus, it is important to system-
being consulted on how to access the materials, it may atically access the original study’s design, procedures,
be necessary to consider a role for one or more of them and analysis. Second, when replicating, improvements
in the replication. Benefits (including access to detailed to internal validity should be implemented to enable a
knowledge on how the original study was conducted “convincing replication” (Brandt et al. 2014; Pedersen
and how it can be replicated) can accrue by includ- and Stritch 2018). Any remaining limitations should be
ing the original team members. However, benefits also documented when the results are reported.
arise from maintaining independence from the ori-
ginal research team. Autonomy ensures that research- Decision Point 3: Making Choices about Statistical Power
ers have less emotional attachment to the findings and In public administration, one endemic threat to inter-
can be impartial: analysis of replications in education nal and conclusion validity is low statistical power. The
and psychology indicate that involving the original definition of statistical power is the ability of a statisti-
authors increases the likelihood of producing simi- cal test (at a level of statistical significance, α, specified
lar findings to the original article (Makel and Plucker by the researcher) to detect an effect size (δ, also speci-
2014; Makel, Plucker, and Hegarty 2012). If the rep- fied by the researcher) that exists within the population.
licating researcher desires independence from the ori- It follows that statistical power is always a contingent
ginal authors, but the full materials are unavailable, quantity—it is not defined except for specified values

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx 9

of statistical significance, α, and importantly, the effect difficult access to subjects.7 Those who are interested
size δ. Statistical power is a major consideration when in replication with a focus on maximizing power can
deciding what to infer from an experimental finding. take a number of steps to do so. Shadish, Cook, and
Low power can be a parsimonious explanation of Campbell (2002, 46–7) provided a useful catalog of
nonsignificant findings (Shadish, Cook, and Campbell ways to increase power. For one, if the treatments are
2002). Although the concept of statistical power is costly relative to the control conditions, or if each treat-
widely used, practice could be improved, as Lindsay ment is primarily compared with the control condition,
(2015) notes for the field of experimental psychology. the number of subjects in the control condition can be
In a power calculation, the level of statistical sig- increased. Indeed, Orr (1999) shows that for T treat-
nificance (α) is typically specified by convention, ments and a control condition, power is maximized if
often at 0.05 [but see Meyer, van Witteloostuijn, and 1/(T × 2) subjects are allocated to each treatment con-
Beugelsdijk (2017) and Cumming (2014) on how to dition, and one-half of the subjects are allocated to the
avoid the pitfalls of such a criterion]. For the effect size control condition.
δ, researchers tend to follow prior research.5 In a rep- Beyond increasing the number of subjects and/or
lication, δ can be based on the estimate(s) of the prior the optimal allocation of subjects to experimental con-
study. If researchers want to allow for the possibility of ditions, other strategies to increase statistical power
publication bias (which would mean the estimates of focus on reducing the variance of the outcome. For
δ are greater in magnitude than the population value), replications, the variance in outcome can be estimated
they can choose a smaller value of the effect size, δ, from the summary statistics of the study being repli-
for the power calculation (Vazire 2016, 4). Once the cated, if they are broken down and reported by experi-
level of statistical significance and effect size have been mental conditions. The following three strategies to
specified, statistical power increases in the number of decrease the outcome variance and thus increase stat-
subjects and decreases in the variance of the outcome.6 istical power are all compatible with both direct rep-
We discuss these two levels in turn. lication and empirical generalization. These strategies
The clearest solution to low statistical power is enhance power by reducing heterogeneity first so that
to increase the sample size. Often, for practical rea- likes are compared with likes. They increase power if—
sons, the number of subjects available for the replica- and only if—a variable has been measured before the
tion is relatively limited. Having a limited number of experiment that is expected to correlate with the out-
subjects means that a replication can improve on the come (cf. Shadish, Cook, and Campbell 2002). How
first experiment by taking steps to enhance statistical does this work?
power. However, there may be a range of constraints The three strategies are matching, blocking, and
that restrict the researcher from doing this. These stratifying. We define and illustrate them using the
include the costs of treatments plus compensation and running example of a “nudge” experiment that seeks
to identify what type of electric bill information
leads households to reduce their consumption. In this
5 Some researchers report a calculation of power using the estimate
of the effect size they obtained in the same study. Cumming (2014)
example, there are N households as subjects and T dif-
shows that such post hoc power can take almost any value and is thus ferent types of information on the electric bill, and the
not informative. Statistical power is most useful when calculated in outcome is electricity consumption. One variable that
reference to an effect size δ specified ex ante. could be used for matching, blocking, or stratifying is
6 While power calculations are best done in software, we provide a the physical size of each dwelling.
worked, simplified example. For this, we use a between-subjects
design, which is representative of the majority of experiments in public
Matching means that there are N/T matches with
administration. Suppose there is a treatment and a control condition the same scores on the matching variable. There would
(more conditions mean this calculation has to be repeated). The be N/T households matched on their square footage,
outcome (dependent variable) is a seven-point scale ranging from 1 and then the T different treatments would be randomly
to 7, with a mean of 3 and standard deviation of 1 in each condition.
Suppose we set the level of statistical significance at α = 0.05 in a one-
tailed test, and the effect size δ as 0.2 standard deviations, which is null of no effect will be rejected in 54% of trials. As noted before, for
0.2  ×  1  =  0.2 in our example. To calculate power, we use the critical the specified target effect size, power increases with the sample size
values of the relevant statistical distribution, typically the t-distribution. and decreases with the level of statistical significance the researcher
It can be approximated by the standard normal distribution for larger chooses. Instead of setting the number of subjects in each condition
sample sizes, which we use in our example. If the mean of the treated (N) and solving for power (which is useful when assessing existing
group is 3.2, the power for correctly rejecting the null hypothesis of no studies to be replicated), it can also be useful to set power at a given
positive effect is approximately: = 1 − prob.[Z > 1.64 − (0.2 / ( (1 / N) )]. value and solve for the number of subjects required. This is useful for
For a typical N of 75 subjects for a comparison between two conditions, designing a new study or replication.
this yields a power of: 1 − prob.[Z > 1.64 − (0.2 / ( (1 / 75) )] = 1 − prob. 7 Sample size is also subject to diminishing returns, and excessively
[Z > 1.64  – 1.73]  =  1  − prob.[Z > − 0.09]  =  1  − 0.4641  =  0.5359. That is, large samples can raise an ethical problem because they are wasteful
at the specified level of significance and the specified effect size, the and needlessly expose subjects to risk (Bacchetti et al. 2005).

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
10 Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx

assigned within each match. Blocking is conceptually replication site requires a detailed consideration of
similar to matching but does not necessarily occur context. Recently, several scholars have sought to
for the exact same scores, only for similar scores. In develop a theory of context in public administration
contrast, stratifying means there are fewer than N/T (e.g., Meier, Rutherford, and Avellaneda 2017; O’Toole
strata. For instance, strata may be the type of dwelling: and Meier 2015; Pollitt 2013). Their motivation stems
apartment/condominium, town home, or single-family from the equivocal findings of observational studies in
home. The N units would first be grouped into these different contexts, which suggests that context matters
three strata. Then, within each stratum, the T different and is an important aspect of theory in public adminis-
treatments would be randomly assigned. tration. The multidimensional nature of context across
In addition to using household size, we could use time and space has led to calls for a more intensive
the pretest electricity consumption as the matching, study of its key dimensions, such as the political, envi-
blocking, or stratifying variable. In either case, because ronmental, and internal factors affecting public organ-
the matching, blocking, or stratifying variables are izations (O’Toole and Meier 2015). Scholars further
correlated with the outcome, the power of the test to assert that context affects the relationships between
detect the effects of the T-1 treatments (each compared variables in different settings by interacting with pub-
against the control condition) is increased relative to lic administration practice, altering the relationship
simply randomly assigning the T conditions across the between the independent and dependent variables and
N households. For each of the three strategies, the ana- thus reshaping theory.9 These factors influence site
lysis must consider how the treatment was assigned, selection decisions.
especially when the percentage assigned to a condition The first step in choosing the critical test case is to
varies (Gerber and Green 2012). distinguish between two ideal types of studies (with
Apart from the three strategies just discussed, the much gray middle ground): confined versus universal.
introduction of covariates is standard practice in pub- Confined studies produce findings that are purported
lic administration. The logic behind controlling for to be specific to the local context, whereas universal
covariates is fundamentally the same as that behind studies claim global generalizability. In public admin-
matching, blocking, and stratifying: reducing error istration, confined studies are the norm: studies in spe-
variance by comparing treatment and control within cific institutional contexts are often framed as tests of
more homogeneous subgroups of the subject pool. general theory. Generally, the argument is that public
The difference is that this is done mathematically in administration phenomena are context-specific and
the analysis after the experiment has been conducted. boundary conditions apply (see below).
Instead of conducting a t-test or ANOVA of differences One example is the role of national culture. What
between groups, the covariates (e.g., students’ gender, holds in China may not work in Germany, and vice
major and location of subjects’ residence) are included versa. New public management (NPM) practices may
as additional variables. For this purpose, it can be help- be effective in similar contexts (i.e., Anglo-Saxon
ful to express the analysis as a regression with dummy countries), but much less so or not at all elsewhere. In
variables for the treatments. This is mathematically contrast, universal claims are said to hold everywhere
identical to an ANOVA, but some will find it easier to and every time, irrespective of context. For example,
interpret the findings (Angrist and Pischke 2009).8 public service motivation and red tape have been pre-
Given the limited resources and growing exper- sented as universal constructs with local variations,
tize in replication, together with threats to internal whereas agencification and autonomization are viewed
validity, many of the original studies that have been as Scandinavian concepts with universal implications.
replicated were underpowered. Therefore, the com- This is not the place to discuss the type of claims
parison between original and replicated results has researchers should make, but a study’s location on the
been inconclusive. Paying close attention to statistical confined-universal spectrum implies an understanding
power when designing a replication study is an essen- of context and a choice of replication sites.
tial ingredient for maximizing the insights to be gained In public administration, confined studies are the
from the replication. rule rather than the exception, so we take this as our
stepping stone. Our advice for the second step is to
Implementing the Replication apply both Mill’s (1843) method of difference or dis-
Decision Point 4: Choosing the Critical Test Case similarity and agreement or similarity.10 The initial
The next decision point is where to replicate, or
to choose between replication sites. The choice of
9 It should be noted that cloned dependent variables may differ across
settings because of unique attributes of the settings.
8 The analysis should be presented with and without covariates to aid 10 This is closely related to the issue of boundary conditions, which is
interpretation. elaborated in decision point 5.

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx 11

stage is to select sites where the study is likely to be rep- science agenda of public administration. The fifth deci-
licated (the method of agreement) or not (the method sion point, thus, requires researchers to clearly state
of difference). Applying this logic to public adminis- boundary conditions and hypothesize their expected
tration, we should include theoretically selected sites findings when replicating an original study.
in the replication design. Take the above example Choice of replication site will influence bound-
again. Suppose that theory predicts that NPM prac- ary conditions. If the strategy is to replicate a study
tices (e.g., competitive tendering) are only effective in in a very similar setting, the purpose is probably to
Anglo-Saxon countries. A  powerful multisite replica- increase the likelihood of confirming the results or fur-
tion would redo the original study in countries such ther refine the theory in the original study. The selec-
as Australia, Ireland, the United Kingdom, and the tion of a dissimilar setting suggests that the replication
United States (where, following the logic of the method includes extensions that will push the boundaries of
of agreement, the original finding is expected to rep- the original study to develop or adjust theory. Busse,
licate well) and in nation-states like Brazil, China, Kach, and Wagner (2017) propose “inside-out” and
France, and Germany (where, through the lens of the “outside-in” approaches to exploring boundary condi-
method of difference argument, the original result is tions (figure 3). In an inside-out exploration (point A in
not expected to replicate). This example illustrates the figure 3), uncertainty about boundary conditions is low,
difference in replicating a universal study: site selection and the accuracy of theoretical predications is high.
is irrelevant because the original result is expected to Replications testing inside-out are likely to implement
replicate everywhere. replications that incrementally change populations
Many replications fall into the gray zone with nei- or measures and analyses [see Grimmelikhuijsen and
ther fully confined nor fully universal findings. In the Porumbescu’s (2017) replication of Van Ryzin (2013)].
context of the above NPM example, an argument With the outside-in approach (point B in figure 3),
could be made that the effectiveness of performance boundary conditions are highly uncertain, and the
pay is positively associated with a country’s degree of accuracy of theoretical predications is low. A replica-
individualism. In this case, our advice is to sample rep- tion based on dissimilarity and uncertain boundary
lication sites across the individualism spectrum, taking conditions seeks to test and develop a theory for a new
care to select diverse sites. The findings from the repli- context. In this case, the researcher would hypothesize
cation can then be used in a moderated meta-analysis that the results of the original study would not be pre-
with individualism as the critical moderator. The pre- sent in the replication. Confirmation of this hypothesis
diction is that the effect sizes will be larger with higher does not equate to “replication failure” because the
individualism (and perhaps small or nonsignificant at focus is on breaking new ground. This means that “the
the collectivist end of the scale). problem statement is less clear and the research project
does not follow a fully straightforward plan but more
Decision Point 5: Establishing Boundary Conditions
likely involves some feedback loops and iterations”
A primary motive for replication in public adminis- (Busse, Kach, and Wagner 2017, 584).
tration is to establish whether research findings work
in different contexts and to understand what factors
might affect the degree of generalization. Public admin-
istration replication often aims to make more explicit
the moderators that explain why replication results
vary, but boundary conditions may apply to many
other aspects of a study, such as substituting a similar
stimulus to achieve a desired effect, testing the cause–
effect relationship on a different population, moving
the experiment to a different setting, or assessing the
consistency of results over time. An understanding of
boundary conditions is important because, as discussed
in decision point 4, the original study may have been
confined—designed for a specific and limited purpose
rather than for generalization. In any case, the goal
is to make all relevant boundary conditions explicit
because this permits an understanding of “the accu-
racy of theoretical predicators for any context” (Busse,
Kach, and Wagner 2017, 578). Put simply, it translates Figure 3.  Setting Boundary Conditions Note: Adapted from Busse,
theory to practice and helps implement the design Kach, and Wagner (2017, 584).

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
12 Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx

An inside-out approach to the question of bound- by respondents on a scale are equivalent when the
ary conditions has been applied to the replications respondents hold the same views and variation is only
conducted in Hong Kong on organizational design. related to context. In generalization and extension
The Hong Kong government was designed by the replications, researchers can make decisions about the
British colonial powers and thus reflects the combined suitability of the measures and explicitly build into
structures, cultures, and practices of East and West. their research design measures that are appropriate
Externally, public bureaucracy in Hong Kong looks for the context of the new study. If decisions are not
much like the West. However, Hong Kong’s government made about construct validity in the replicated study,
and public services are much more centralized and hier- it may suffer from construct, method, or item bias
archical than those in most Western countries because (Jilke, Meuleman, and Van de Walle 2014). If the meas-
of its duty-based and leadership-centric culture, which ures are not equivalent, the findings can suffer from
is typical of East countries. Replications of Kaufmann construct bias (dissimilarity of latent concepts across
and Feeney’s (2014) study of red tape (Walker, Lee, and countries), method bias (all types of biases that come
James 2017, 457) suggested the following proposition: from the methodological and procedural aspects of a
“The results of replications on organizational design survey), and item bias (different people understand or
will be in the same direction as the original studies, but interpret the same survey item differently).
weakened by Hong Kong’s diverse culture.” Notable Construct validity can be challenging. The repli-
in this view is the likely greater tolerance of rules and cation of Kaufmann and Feeney’s (2014) study (in
decisions made by more senior actors in the hierarchy. Walker, Lee, and James 2017) illustrates some aspects
This assertion was upheld in the replication, which of these challenges. Kaufmann and Feeney (2014)
resulted in nonsignificant findings. examined red tape and outcome favorability. The term
Clearer understanding of boundary conditions “red tape” does not readily convey the same meaning
in replications can inform public administration in English as it does in Chinese. The Chinese trans-
on how we use knowledge to solve problems and lation of “red tape” is “繁文縟禮 (Fan-Wen-Ru-Jie).”
devise solutions in different contexts. Indeed, explor- The meaning is similar in both languages, but the ori-
ation of boundary condition fosters theory develop- gins differ substantially. The roots of Fan-Wen-Ru-Jie
ment, strengthens research validity, and mitigates the come from the Chinese history of Confucianism, and
research–practice gap (Busse, Kach, and Wagner 2017). different schools of Confucian thought favor different
interpretations. Furthermore, the direct back trans-
Decision Point 6: Establishing Construct Validity lation of Fan-Wen-Ru-Jie is “complicated wordings
The sixth decision point concerns questions of con- (writings or documents) and burdensome etiquette,”
struct validity, a topic that has recently been discussed even though its definition is accepted as meaning “red
in studies of comparative public administration (Jilke, tape.” To partially offset this potential bias, the replica-
Meuleman, and Van de Walle 2014; Jilke et al. 2017). tion of Kaufmann and Feeney (2014) was conducted in
In cross-cultural studies, bias has resulted from sys- the English language. In short, when replicating an ori-
temic differences in the measurement instruments ginal study, the replication cannot simply copy or dir-
(Poortinga 1989). More specifically, the measures ectly translate the original questionnaire because there
used “do not correspond to cross-cultural differences may be cultural effects such as item inappropriateness
in the construct purportedly measured by the instru- or differential response styles (Matsumoto and Van de
ment” (Matsumoto and Van de Vijver 2011, 18). At Vijver 2011).
a basic level, translation from one language into the Taking questions of construct validity one step fur-
other already comes with subtle differences in meaning ther, Jilke et  al. (2017) suggested that measurement
(Harzing, Reiche, and Pudelko 2013), implying that equivalence should be checked during replication, par-
replication work that crosses countries and languages ticularly when populations and measurement or analy-
requires careful assessment of cross-cultural and cross- sis change (i.e., conceptual generalization, empirical
language differences, in combination with equally generalization, and generalization and extension). Jilke
careful translation and interpretation procedures. et  al. (2017) offered detailed guidance to test vari-
Construct validity is important in any experimen- ous levels of measurement equivalence using multiple
tal replication, and particularly so when replications group confirmatory factor analysis. Those attempt-
extend the original study to different populations. ing replications are encouraged to consult this source.
Researchers must ensure that the conceptual con- Finally, questions of construct validity are bound up
struct being measured is consistent in meaning across with decision point 1 regarding whether the original
different settings and that it can be “mapped onto a study can be replicated, and decision point 7 concern-
measurement scale in the same way” (Jilke et al. 2017, ing the interpretation of the replication’s results, which
1295). This means that a concept and its interpretation we now turn to.

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx 13

Reporting Results an imaginary US local government. The subjects were


Decision Point 7: Deciding How to Compare the Findings asked to view either low- or high-performance street
Determining how the findings should be compared cleanliness photographs. The study confirmed the cen-
depends on the number of replications, their power, tral hypothesis in EDT that “citizens judge public ser-
and the nature and extent of their heterogeneity (see vices not only on experienced service quality but also
the above discussions of boundary conditions and on an implicit comparison of service quality with prior
confined studies). Working with sharp rule-of-thumb expectations” (597). EDT was selected for replication
threshold differences (e.g., if the p value of the replica- because it could better explain citizen satisfaction in
tion is above .05 or the new effect size is half the ori- a variety of locations and thus had the potential to
ginal, the replication has failed) is unwise (Cumming become a valuable practical framework for govern-
2014). Actually, we advise against referring to replica- ments. Four replications were conducted. The first was
tions as “successes” or “failures” because these judg- an empirical generalization that sought to use, to the
ments can be highly subjective. As a collateral benefit, extent possible, the same measures and analysis as the
removing the success–failure classification from our original study among a population of Hong Kong citi-
replication terminology makes replications less threat- zens. Further replications extended the original study
ening for the original authors and hence increases the to explore the boundary conditions.
likelihood that replication work will be undertaken.
Describing the results of replications in more object- Planning the Replication
ive terms is also more consistent with the ideals of The design of the original experiment and information
social science. In any replication, the degree of replic- on the treatments and processes were clearly reported
ability should be carefully discussed using a series of in the article (Van Ryzin 2013, 602–6). In the repli-
benchmarks (e.g., power-corrected p values, full con- cation, the decision was made to include the original
fidence intervals, and effect sizes). For an example of study’s author, who became a co-investigator for
such careful reflection, we refer to Bouwmeester et al. the grant funding. This provided access to all of the
(2017), while recalling that care and attention are research materials needed for design and analysis and
needed to interpret the statistical evidence (Bordacconi tacit knowledge on the implementation of the original
and Larsen 2014). This is an appropriate methodology study. For example, the pretest questions discussed in
when only a limited number of replications are being the article were not published (Van Ryzin 2013, 603).
analyzed, perhaps one or two. Decision point 1 was passed.
When the number of replications that have been Decision point 2 requires an examination of the ori-
conducted increase scholars can compare findings ginal study’s internal validity. EDT is typically studied
using meta-analytic techniques. Fabrigar and Wegener using observational research methods such as cross-
(2016) illustrate how meta-analytic calculations can sectional surveys that raise questions of causality, and
be used on a small set of experiments to show how subjective measures of expectations and perform-
additional replications of the original study change ance, and endogeneity. Van Ryzin (2013) applied an
meta-analytic indices. In the example, Fabrigar and experimental research design to address these validity
Wegener (2016, 75)  walk through they show that concerns of ambiguous temporal precedence. Careful
when meta-analytic calculations are applied to “frag- reading of the article and discussion with the original
ile” replication results—i.e., ones that fail by thresh- author suggested that the threats to internal validity
old differences—effect sizes can be detected across the were low: subjects were drawn from a nationally rep-
population of studies. As the scale of the results being resentative sample and randomly assigned to relatively
compared increases, for example, in a many-labs study simple treatment conditions. Issues of maturation
that involves scholars from different research labs, and attrition did not arise. Decision point 2 was thus
more traditional meta-analytic techniques can imple- satisfied.
ment focuses on the overall effect size and variations Decision point 3 focuses on power and sample size.
therein. The sample size in the original study was 964 respond-
ents, randomized across four arms of the study (arm
1 = 251, arm 2 = 257, arm 3 = 226, and arm 4 = 230).
A Worked Example: Van Ryzin (2013) The original study did not discuss power analysis in
In this section, we walk through the decision points relation to sample size. Analysis of the sample sizes
to replicate Van Ryzin’s (2013) experimental test of reported (Van Ryzin 2013, 602) showed power at 0.95
the expectation disconfirmation theory (EDT) of citi- [for the effect sizes δ set equal to Van Ryzin’s (2013)
zen satisfaction. In Van Ryzin’s (2013) online experi- findings and the critical value α = 0.05]. The original
mental vignette methodology, subjects received either study used path analyses, so the minimum sample size
low or high expectations delivered by the mayor of required for power 0.95 was 76, based on MANOVA

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
14 Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx

for repeated measures between factors. Power analysis announcement to a medium-sized city. In 2017, Hong
indicated that the target response rate for the replica- Kong was a city of 7.36 million people. Therefore, to
tion should be 600 subjects. However, we recruited make the replication as realistic as possible for Hong
1,000 subjects because the replication included four Kong respondents (see discussion of construct valid-
experiments run over a 6-month period and attrition ity below), we changed the treatment from a citywide
was anticipated. The replication was a convenience level to a Hong Kong district. For the generalization
sample, and no claims were made that it was represen- and extension replications, the expectation and per-
tative of the Hong Kong population. formance treatments were further changed to capture
the Hong Kong context. These replications examined
Implementing the Replication street cleanliness, air quality, and secondary education
To date, all experimental tests of EDT and citizen sat- because these policy issues were relevant in the local
isfaction have been based in Western liberal democra- context.
cies (e.g., Grimmelikhuijsen and Porumbescu 2017). EDT is a well-established framework in the manage-
Among these tests, a number have been replications ment and marketing literature and, as noted above, has
[see Filtenborg, Gaardboe, and Sigsgaard-Rasmussen experienced increased popularity in public administra-
(2017) on Danish citizens; Grimmelikhuijsen and tion studies. The measures and manipulations exhibit
Porumbescu (2017) on US MTurk subjects]. Hong good face validity. The topic in Van Ryzin’s EDT study
Kong was selected as a critical test case (decision point was a technical and unambiguous public service: street
4)  because it is an Asian municipality that relies on cleanliness. The main concept’s construct validity (deci-
observational research methods to study citizen sat- sion point 6) was therefore not overly problematic. For
isfaction. The replications of EDT and citizen satis- measurement purposes, we used the same wording for
faction supported the theorized causal relationships the expectation and performance manipulations and
(Filtenborg, Gaardboe, and Sigsgaard-Rasmussen variable measurements as the original study. The per-
2017; Grimmelikhuijsen and Porumbescu 2017). formance manipulation in the original study included
These confirmatory replications of EDT on citizen sat- two pictures of a New York street showing different
isfaction, together with widespread evidence support- degrees of cleanliness. In the replication, a picture
ing the theory in studies of goods and products (Oliver was taken of a Hong Kong street. One photograph
2010), suggested a method of agreement (that the (high performance) showed the original clean street;
original study would be replicated). We were further the other was manipulated using Adobe Photoshop
motivated to conduct the replication because, if claims to include litter (low performance). This was done to
of universality could be upheld, EDT offered the pros- limit variation in the photographs to only litter (i.e.,
pect of a new and more robust approach to measure sharpen the treatment effect), and to enhance validity
citizen satisfaction in the city. for the Hong Kong subjects. To avoid construct validity
The contexts in Hong Kong and the United States problems associated with translating English into trad-
are very different, with large political, environmental, itional Chinese, all materials were prepared in English,
and internal variations. Hong Kong is governed by which is widely spoken. Following these decisions and
the Chinese principle of “one-country two systems” after finalizing the research design, the replication was
and reports directly to Beijing. It is, therefore, a more implemented from February to August 2017.
unitary system than the United States environmen-
tally, social capital is lower in Hong Kong, and the Reporting the Results of the Replication
government is very centralized and hierarchical [see As noted above, comparing an original study with a
Scott (2010) for a thorough review of the Hong Kong replication can be challenging because of differences
context]. However, EDT has been successfully applied in context. However, for Van Ryzin’s (2013) replica-
to information technology products in Hong Kong tion, a universal claim was advanced, implying that
(Thong, Hong, and Tam 2006). Given this, we adopted the results from the original study and the replication
an inside-out approach to the question of boundary would be comparable. We analyzed the replications by
conditions raised in decision point 5. We expected that implementing the same path analysis as the original
the theory would hold in Hong Kong, despite its very study. The results showed the same pattern and direc-
different economic system. To test this expectation and tion in the replication as original study, and compa-
confirm decisions on method of agreement/inside-out rable levels of statistical significance: expectations ⇒
replication, the first stage was to conduct an empirical satisfaction (positive), expectation ⇒ disconfirmation
generalization. For this, we sought to retain the ori- (negative), performance ⇒ disconfirmation (positive),
ginal measurement and analysis as much as possible. performance ⇒ satisfaction (positive), and disconfir-
However, the original vignettes were developed in the mation ⇒ satisfaction (positive). Given that several
American context and were based around a mayor’s replications were conducted, it was possible to make

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx 15

Table 3.  Comparison of Expectations and Performance Manipulations on Levels of Satisfaction in the Original
Study and Replications

Expected Results United Statesa Denmarkb United States Hong Kong

Low expectations, low performance Low 3.03 4.1 3.30 3.24


Low expectations, high performance High 3.97 4.8 4.57 5.10
High expectations, low performance Low 2.81 4.4 3.75 3.35
High expectations, high performance High 4.11 5.4 4.92 5.11

Note: Mean scores, 1 = very dissatisfied to 7 = very satisfied, unequal subscripts indicate statistically significant difference within pairs (low
expectations, high expectations).
a
Van Ryzin (2013).
b
Filtenborg, Gaardboe, and Sigsgaard-Rasmussen (2017). One decimal place reported.

further comparisons between the original study, our laboratory or field experiments. The findings derived
study, and other replications (Filtenborg, Gaardboe, from other research designs, like case studies and sur-
and Sigsgaard-Rasmussen 2017; Grimmelikhuijsen vey research, should also be subject to rigorous rep-
and Porumbescu 2017). Table  3 shows the expected lication and verification. The same best practices we
relationship between the expectation and performance have suggested, or similar ones, can be used to more
manipulations, and the satisfaction and mean scores efficiently corroborate knowledge derived from non-
in each study. The mean scores consistently show that experimental and quasi-experimental research designs
performance has a high impact on satisfaction, result- (Fabrigar and Wegener 2016). Researchers should
ing in high satisfaction when expectations are set low scrutinize these findings and test their boundaries in
or high, and a low effect on satisfaction when perfor- other samples and settings with different model speci-
mance is set low, and expectations are set low or high. fications and research techniques, such as comprehen-
This brief preview of the results from the several repli- sive literature reviews, meta-analysis, and longitudinal
cations of Van Ryzin (2013) suggests that the direction panel studies. Such thorough, persistent efforts to vet
of the findings was comparable and that EDT was vali- important research findings can significantly increase
dated on three occasions, providing emerging evidence confidence in public administration research and grow
of the theory’s universality. the knowledge base in our field. Moreover, rigorous
research on boundary conditions will pay big divi-
dends for public administration practice as researchers
Future Research On Replication
and practitioners advance the design science ambitions
Replication needs to become a normal scientific prac- of our field.
tice in public administration to improve the qual- Advancing a common replication framework can
ity of scholarship, accumulate knowledge, develop be achieved through collective action. Replications
mid-range theories, distill lessons for practice, and are an essential part of the scientific method that
identify robust size effects and boundary conditions. serves the common interest of the scientific commu-
Replication failures are not necessarily failures of good nity and beyond. Existing reward structures need to
scientific practice; rather, they should be seen as the sci- be modified to incentivize replication (Everett and
entific process in action.11 The time for action is ripe. If Earp 2015; van Witteloostuijn 2016). Currently, the
public administration is serious about science, validity, field does not provide enough incentives for individ-
and generalization, scholars should view replications ual researchers to carry out and publish replication
more positively and take up this common framework. studies on a large enough scale. Nevertheless, public
Practically, this suggests designing and undertaking administration researchers do know something about
replications for knowledge building rather than for solving this “tragedy of the commons.” Collective
instrumental purposes, such as falsifying a study result action problems can be overcome by group inter-
that, itself, has little impact on theory and practice. ventions (e.g., by altering the incentive structure and
The scholarly community needs to expand the spreading the costs) or through individual initiatives
meaning of what we refer to as replication, which (as when some individuals subordinate their self-
could have a more immediate and even greater cumu- interest and act for the common good). Both courses
lative effect on our field. Replications are not limited of action will help generate more replication studies
to cloning previous studies or re-running controlled in public administration.
From the experimental research design perspec-
tive taken in this article, many changes have taken
11 An anonymous reviewer suggested this line of argument.

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
16 Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx

place in the field since 1992 when Bozeman and Scott in this article. The field should strive for better replica-
argued that experimental methods are not necessarily tions in the same way that we aim for a better the-
suited to studying public organizations, and have eth- ory, research methods, and utilization. The field should
ical and logistical barriers (Bozeman and Scott 1992). trumpet the need for replications, try to refine and
The research agenda in public administration now improve the process, and incentivize high-impact work
embraces questions well suited for experimental meth- in the same way we pursue other normatively import-
ods, particularly questions of behavior (James, Jilke, ant principles. The implementation of a common rep-
and Van Ryzin 2017), institutional design (Meier and lication framework would place public administration
Smith 1994), and public service performance (Walker, at the forefront of the applied social sciences, accord-
Boyne, and Brewer 2010). Concerns over ethics and ing to it a leading role in the development of solutions
deception have been reduced, and logistical barriers to complex, human-related, real-world problems that
have been lowered as research expertize in the field involve value judgments. Eventually, replication can
has grown, and more resources have been dedicated become a virtuous cycle of theory testing, refinement,
to experimental studies (James, Jilke, and Van Ryzin and application.
2017).
As resources for experiments grow, more experi-
mental laboratories and research centers will become Conclusions
operational, creating opportunities for more exten- Replication offers public administration scholars the
sive replication agendas. When these laboratories are opportunity to address some substantive questions and
linked and working on common agendas, progress advance the theory and practice of public administra-
will speed up and become more efficient. Multisite or tion, in keeping with Simon’s (1996) notion that public
“many-lab” public administration studies are now feas- administration is a design science that devises courses
ible (for instance, see Klein et al. 2014; Open Science of action to change existing situations into preferred
Collaboration 2015) and we could also amass more ones. This major contribution is centered around test-
replications by making a replication study a mandatory ing the boundary conditions of current theory and
component of PhD programs. Multiple replications practice using conceptual, generalization, and exten-
can subsequently be used to accumulate estimates that sion replications. In this article, we outline a set of best
can be used in meta-analyses. Technological advances practices for public administration researchers who are
facilitate this; researchers now have access to larger, planning to conduct replication studies. These prac-
more diverse groups of subjects (e.g., through MTurk tices have been gleaned from our experience in one of
or eLancing). Survey technologies permit simultaneous the first replication laboratories in public administra-
implementation of experiments across time and space tion peppered with advice from other disciplines. This
(e.g., Qualtics). Datasets and protocols can be easily protocol should help researchers avoid common prob-
shared and stored. Actually, many-lab studies present lems, make wise decisions, and carry out better repli-
an ideal way to explore boundary conditions, and they cations. The important elements include targeting the
provide a real opportunity to advance the theory and most impactful research findings for replication, shor-
practice of public administration. ing up internal validity before moving on to external
Neither institutions nor individuals can be expected validity, and identifying the most important boundary
to subordinate their personal interest for the greater conditions to explore. Many useful practices have been
good, even though some do. Granting organiza- suggested. We hope that this article will trigger debate
tions (i.e., Netherlands Organisation for Scientific on replication and that our colleagues will take up this
Research) and journals (i.e., Journal of Behavioral agenda and help advance the scientific ambitions of the
Public Administration)12 can incentivize replications. field, develop better mid-range theories, and glean use-
Researchers can choose to take up the replication ful lessons for practice.
agenda and might earmark a percentage of their pub-
lications for replications in their research area. [LeBel
(2015) recommends the ratio of 4:1.] Yet over the long Funding
term, these contributions may be insufficient to main- This work was supported by: University Grants
stream replication in public administration. This means Committee, Research Grants Council (project no. CityU
that public administration replicators must become 11611516); Public Policy Research Funding Scheme
more strategic and resourceful, as we have alluded to from the Central Policy Unit of the Government of
the Hong Kong Special Administrative Region (project
12 The Journal of Behavioral Public Administration is explicitly open to
no. 2015.A1.031.16A); City University of Hong Kong
publishing replications of experimental work, including null findings; Department of Public Policy Conference grant and
see www.journal-bpa.org. College of Liberal Arts and Social Sciences Capacity

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx 17

Building grant; National Research Foundation of Filtenborg, Andres Foged, Frederik  Gaardboe, and Jesper  Sigsgaard-
Rasmussen. 2017. An experimental test of the expectancy disconfirmation
Korea grant funded by the Korean Government
theory of citizen satisfaction. Public Management Review 19:1235–50.
(NRF-2017S1A3A2067636). Francis, Gregory. 2012. The psychology of replication and the replication of
psychology. Perspectives on Psychological Sciences 7:585–94.
Freese, Jeremy. 2007. Replication standards for quantitative social sciences.
Why not sociology? Sociological Methods & Research 36:153–72.
References Freese, Jeremy, and David  Peterson. 2017. Replication in social science.
Angrist, Joshua D., and Jörn-Steffen  Pischke. 2009. Mostly harmless econo- Annual Review of Sociology 43:147–65.
metrics: An empiricist’s companion. Statistical Papers 52:503–4. George, Bert, Sebastian  Desmidt, Poul A.  Nielsen, and Martin  Baekgaard.
Brandt, Mark J., Hans Ijzerman, Ap Dijksterhuis, Frank J. Farach, Jason 2017. Rational planning and politicians’ preferences for spending
Geller, Roger Giner-Sorolla, James A. Grange, Marco Perugini, Jeffrey and reform: Replication and extension of a survey experiment. Public
R. Spies, and Anna Van’t Veer. 2014. The replication recipe: What makes Management Review 19:1251–71.
for a convincing replication? Journal of Experimental Social Psychology Gerber, Alan S., and Donald P. Green. 2012. Field experiments: Design, analy-
50:217–24. sis, and interpretation. New York/London: Norton.
Bacchetti, Peter, Leslie E.  Wolf, Mark R.  Segal, and Charles E.  McCulloch. Grimmelikhuijsen Sephan, and Gregory A. Porumbescu. 2017. Reconsidering
2005. Ethics and sample size. American Journal of Epidemiology the expectancy-disconfirmation model: Three experimental replications.
161:105–10. Public Management Review 19:1272–92.
Berthon, Pierre, Leyland Pitt, Michael Ewing, and Christopher L. Carr. 2002. Harzing, Anne-Wil, Sebastian Reiche, and Markus Pudelko. 2013. Challenges
Potential research space in MIS: A framework for envisioning and evaluat- in international survey research: A  review with illustrations and sug-
ing research replication, extension, and generation. Information Systems gested solutions for best practice. European Journal of International
Research 13:416–27. Management 7:112–34.
Bettis, Richard A., Constance E.  Helfat, and Myles J.  Shaver. 2016. The Hubbard, Raymond, Daniel E. Vetter, and Eldon L. Little. 1998. Replication
necessity, logic and forms of replication. Strategic Management Journal in strategic management: Scientific testing for validity, generalizability and
37:2193–203. usefulness. Strategic Management Journal 19:243–54.
Blom-Hansen, Jens, Rebecca Morton, and Søren Serritzlew. 2015. Experiments Ioannidis, John P.  A. 2005. Why most published research findings are false.
in public management research. International Public Management Journal PLoS Medicine 2:696–701.
18:151–70. James, Oliver, Sebastian R. Jilke, and Gregg G. Van Ryzin. 2017. Experiments
Bordacconi, Mats Joe, and Martin Vinæs Larsen. 2014. Regression to causal- in public management research. Challenges and contributions. Cambridge:
ity: Regression-style presentation influences causal attribution. Research Cambridge Univ. Press.
and Politics 1:1–6. Jilke, Sebastian, Bart  Meuleman, and Steven  Van de Walle. 2014. We need
Bouwmeester, Samantha, Peter P.  J. L.  Verkoeijen, Balazs  Aczel, to compare, but how? Measurement equivalence in comparative public
Fernando  Barbosa, Laurent  Bègue, Pablo  Brañas-Garza, Thorsten administration. Public Administration Review 75:36–48.
G.  H.  Chmura, G.  Cornelissen, F. S.  Døssing, A. M.  Espín, et  al. 2017. Jilke, Sebastian, Nicolai Petrovsky, Bart Meuleman, and Oliver James. 2017.
Registered replication report: Rand, Greene, and Nowak (2012). Measurement equivalence in replications of experiments: When and
Perspectives on Psychological Science 12:527–42. why it matters and guidance on how to determine equivalence. Public
Bozeman, Barry, and Patrick  Scott. 1992. Laboratory experiments in public Management Review 19:1293–310.
policy and management. Journal of Public Administration Research and Kaufmann, Wesley, and Mary  Feeney. 2014. Beyond the rules: The effect of
Theory 2:293–313. outcome favourability on red tape perceptions. Public Administration
Brandt, Mark J., Hans  Ijzerman, Ap  Dijksterhuis, Frank J.  Farach, 92:178–91.
Jason  Geller, Roger  Giner-Sorolla, James A.  Grange, Marco  Perugini, Kerr, Gayle, Don E.  Shultz, and Ian  Lings. 2016. “Someone should do
Jeffrey R. Spies, and Anna Van’t Veer. 2014. The replication recipe: What something”: Replication and an agenda for collective action. Journal of
makes for a convincing replication? Journal of Experimental Social Advertising 45:4–12.
Psychology 50:217–24. Klein, Richard A., Kate A. Ratliff, Michelangelo Vianello, Reginald B. Adams,
Busse, Christian, Andrew P. Kach, and Stephan M. Wagner. 2017. Boundary Jr., Stephan  Bahník, Michael L.  Bernstein, Brian A.  Nosek, et  al. 2014.
conditions: Why we need them, and when to consider them. Organizational Investigating variation in replicability: A “many labs” replication project.
Research Methods 20:574–609. Social Psychology 45:142–52.
Camerer, Colin F., Anna  Dreber, Eskil  Forsell, Teck-Hua  Ho, Jürgen  Huber, Knott, Jack H., Gary J. Miller, and Jay Verkuilen. 2003. Adaptive incremental-
Magnus Johannesson, Michael Kirchler, Johan Almenberg, Adam Altmejd, ism and complexity: Experiments with two-person cooperative signaling
Taizan  Chan, et  al. 2016. Evaluating replicability of laboratory experi- games. Journal of Public Administration Research and Theory 13:341–66.
ments in economics. Science 351:1433–6. Kuhn, Thomas. 1996. The structure of scientific revolutions. Chicago: Univ.
Campbell, D. T., and J. C. Stanley, 1963. Experimental and quasi-experimental of Chicago Press.
designs for research. Boston: Houghton Mifflin. LeBel, Etienne P. 2015. A new replication norm for psychology. Collabra 1:1–13.
Cook, Thomas, D., and Donald T.  Campbell. 1979. Quasi-experimentation: Lee, M. Jin, M.  Jae Moon, and Jungsook  Kim. 2017. Insights from experi-
Design and analysis issues for field settings. Boston, MA: Houghton Mifflin. ments with duopoly games: Rational incremental decision-making. Public
Cumming, Geoff. 2014. The new statistics: Why and how. Psychological Management Review 19:1328–51.
Science 25:7–29. Lindsay, D. Stephen. 2015. Replication in psychological science. Psychological
Dewald, William G., Jerry G. Thursby, and R. G. Anderson. 1986. Replication Science 26:1827–32.
in empirical economics: The journal of money, credit, and banking project. Makel, Matthew, and Jonathan A.  Plucker. 2014. Facts are more impor-
The American Economic Review 76:587–603. tant than novelty: Replication in the education sciences. Educational
Everett, Jim A. C., and Brian D. Earp. 2015. A tragedy of the (academic) com- Researcher 43:304–16.
mons: Interpreting the replication crisis in psychology as a social dilemma Makel, Matthew C., Jonathan A.  Plucker, and Boyd  Hegarty. 2012.
for early-career researchers. Frontiers in Psychology 6:1152. Replications in psychology research: How often do they really occur?
Fabrigar, Leandre R., and Duane T. Wegener. 2016. Conceptualizing and eval- Perspectives on Psychological Science 7:537–42.
uating the replication of research results. Journal of Experimental Social Matsumoto, David, and Fons. J. Van de Vijver. 2011. Cross-cultural research
Psychology 66:68–80. methods in psychology. New York: Cambridge Univ. Press.

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018
18 Journal of Public Administration Research and Theory, 2018, Vol. xx, No. xx

Meier, Kenneth J., Amanda  Rutherford, and Claudia Avellaneda, eds. 2017. Schmidt, Stefan. 2009. Shall we really do it again? The powerful concept
Comparative public management. Why national, environmental, and of replication is neglected in the social sciences. Review of General
organizational context matters. Washington, DC. Georgetown Univ. Press. Psychology 13:90–100.
Meier, Kenneth J., and Kevin B. Smith. 1994. Say it ain’t so Moe: Institutional Scott, Ian. 2010. The public sector in Hong Kong: Government, policy, people.
design, policy effectiveness and drug policy. Journal of Public Hong Kong: Univ. of Hong Kong Press.
Administration Research and Theory 4:429–42. Shadish, William R., Thomas D.  Cook, and Donald T.  Campbell. 2002.
Mele, Valentina, and Paolo Belardinelli. Forthcoming. Mixed methods in pub- Experimental and quasi-experimental designs for generalized causal infer-
lic administration research: Selecting, sequencing and connecting. Journal ence. Boston: Houghton Mifflin.
of Public Administration Research and Theory. Simon, Herbert A. 1996. Sciences of the artificial, 3rd ed.. Cambridge, MA:
Meyer, Klaus, Arjen van Witteloostuijn, and Sjoerd Beugelsdijk. 2017. What’s MIT Press.
in a p? Reassessing best practices for reporting hypothesis-testing research. Singh, Kulwant, Siah  Hwee Ang, and Siew Meng  Leong. 2003. Increasing
Journal of International Business Studies 48:535–51. replication for knowledge accumulation in strategy research. Journal of
Mill, John S. 1843. A system of logic. London: Longman. Management 29:533–49.
Munafo, Marcus R., Brian A.  Nosek, Dorothy V.  M.  Bishop, Katherine Thong, James Y.  L., Se-Joon  Hong, and Kar Yan  Tam. 2006. The effects of
S.  Button, Christopher D.  Chambers, Nathalie  Percied du Sert, post-adoption beliefs on the expectation-confirmation model for informa-
Uri  Simonsohn, Eric-Jan  Wagenmakers, Jennifer J.  Ware, and John tion technology continuance. International Journal of Human-Computer
P. A. Ioannidis 2017. A manifesto for reproducible science. Nature Human Studies 64:799–810.
Behaviour 1:0021. Tsang, Eric W. K., and Kai-Man Kwan. 1999. Replication and theory develop-
Nosek, B. A., G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman, S. J. Breckler, ment in organizational science: A critical realist perspective. Academy of
S. Buck, C. D. Chambers, G. Chin, G. Christensen, et al. 2015. Promoting Management Review 24:759–80.
an open research culture. Science 348:1422–5. Uncles, Mark D., and Simon Kwok. 2013. Designing research with in-built dif-
Nosek, Brian A., and Daniel  Lakens. 2014. Registered reports. A  method ferentiated replication. Journal of Business Research 66:1398–405.
to increase the credibility of published results. Social Psychology Van Ryzin, Gregg G. 2013. An experimental test of expectancy-disconfirma-
45:137–41. tion. Journal of Public Administration Research and Theory 32:597–614.
Nowell, Branda, and Kate Albrecht, Forthcoming. A reviewer’s guide to quali- Van Ryzin, Gregg G., Norma Riccucci, and Huafang Li. 2017. Representative
tative rigor. Journal of Public Administration Research and Theory. bureaucracy and its symbolic effect on citizens: A conceptual replication.
Oliver, Richard L. 2010. Satisfaction: A  behavioral perspective on the con- Public Management Review 19:1365–80.
sumer. 2nd ed. Armonk, NY: M. E. Sharpe. Vazire, Simine. 2016. Editorial. Social Psychological and Personality Science
Olsen, Asmus Leth. 2017. Responding to problems: Actions are rewarded 7:3–7.
independently of the outcome. Public Management Review 19:1352–64. Walker, Richard M., George A. Boyne, and Gene A. Brewer. 2010. Public man-
Open Science Collaboration. 2015. Estimating the reproducibility of psycho- agement and performance: Research directions. Cambridge: Cambridge
logical science. Science 349:aac4716. Univ. Press.
Orr, Larry L. 1999. Social experiments: Evaluating public programs with Walker, Richard M., Gene A. Brewer, and Oliver James, Guest Editors. 2017.
experimental methods. Thousand Oaks, CA: Sage. Special issue: Replication, experiments and knowledge in public manage-
O’Toole, Laurence J., Jr., and Kenneth J.  Meier. 2015. Public management, ment research. Public Management Review 19:1221–379.
context and performance: In quest of a more general theory. Journal of Walker, Richard M., Myong Jin Lee, and Oliver James. 2017. Replications of
Public Administration Research and Theory 25:237–56. experimental research: Implications for the study of public management.
Pedersen, Mogens Jin, and Justin M. Stritch. 2018. RNICE Model: Evaluating In Experiments in public management research: Challenges and contribu-
the contribution of replication studies in public administration and man- tions, eds O. James, S. Jilke, and G. G. Van Ryzin. Cambridge: Cambridge
agement research. Public Administration Review. 78:606–612. Univ. Press.
Pollitt, Christopher, ed. 2013. Context in public policy and management: The Weibel, Antoinette, Katja  Rost, and Margit  Osterloh. 2010. Pay for perfor-
missing link? Cheltenham, UK: Edward Elgar. mance in the public sector—Benefits and (hidden) costs. Journal of Public
Poortinga, Ype H. 1989. Equivalence of cross‐cultural data: An overview of Administration Research and Theory 20:387–412.
basic issues. International Journal of Psychology 24:737–56. van Witteloostuijn, Arjen. 2016. What happened to Popperian falsifica-
Riccucci, Normal, Gregg G. Van Ryzin, and Haufang Li. 2016. Representative tion? Publishing neutral and negative findings: Moving away from
bureaucracy and the willingness to co-produce: An experimental study. biased publication practices. Cross Cultural & Strategic Management
Public Administration Review 76:121–30. 23:481–508.

Downloaded from https://academic.oup.com/jpart/advance-article-abstract/doi/10.1093/jopart/muy047/5074357


by University of the Western Cape user
on 18 August 2018

You might also like