You are on page 1of 465

Single-Case

Research Designs
Methods for
Clinical and Applied Settings
SECOND EDITION

Alan E. Kazdin
Yale University

New York Oxford


OXFORD UNIVERSITY PRESS
20 11
Oxford University Press, Inc., publishes works that further Oxford University’s
objective o f excellence in research, scholarship, and education.

Oxford New York


Auckland Cape Town Dar es Salaam Hong Kong Karachi
Kuala Lumpur Madrid Melbourne MexicoCity Nairobi
New Delhi Shanghai Taipei Toronto

With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam

Copyright © 2011,1982 by Oxford University Press, Inc.

Published by Oxford University Press, Inc.


198 Madison Avenue, New York, New York 10016
http://www.oup.com

Oxford is a registered trademark o f Oxford University Press

All rights reserved. No part o f this publication may be reproduced,


stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording, or otherwise,
without the prior permission o f Oxford University Press.

Library o f Congress Cataloging-in-Publication Data


Kazdin, Alan E.
Single-case research designs : methods for clinical and applied settings / Alan E. Kazdin.—2nd ed.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-19-534188-1
1. Single subject research. 2. Case method. 3. Psychology—Research. 4. Psychotherapy—Research.
5. Social sciences— Research. 6. Experimental design. 7. Single subject research. . Title.
BF76.6.S56.K39 2011
6 i 6.890072'4—dc22 2009046998

Printed in the United States of America


on acid-free paper
To Nun Taylor, PhD
CONTENTS

P reface vi
A b o u t the A u th o r xi

BA CK GR O UN D TO THE DESIGNS

1. I N T R O D U C T I O N : S T U D Y O F T H E I N D I V I D U A L IN C O N T E X T 1

2. U N D E R P I N N I N G S OF S C I E N T I F I C R E S E A R C H 24

ASSESSMENT

3. BA CK GR O UN D AND KEY M E A SU R E M E N T C O N SID ER AT IO N S 49

4. M E T H O D S OF A S S E S S M E N T 73

5. E N S U R I N G T H E Q U A L I T Y OF M E A S U R E M E N T 98

MAJOR DESIGN OPTIONS

6. IN TR OD UCTIO N TO SIN G L E -C A SE R ESEA RCH AND


A B A B DESIGNS 121

7. M ULTIPLE-B ASELINE DESIGNS 1 44

8. CHANGING-CRITERION DESIGNS 167

9. M U LTIPLE-TREATM EN T DESIGNS 192

1 0. AD D IT IO N A L DESIGN OPTIONS 227

11. QUASI-SINGLE-CASE EXPERIMENTAL DESIGNS 25 7

E V A L U A T IN G S I N G L E -C A S E DATA

1 2. DATA EVALUATION 284

13. G R A P H I C D I S P L A Y OF D A T A F OR V I S U A L I N S P E C T I O N 323

P E R S P E C T I V E S AND C O N T R IB U T IO N S OF T H E D ESIG N S

1 4. E V A L U A T I O N OF S I N G L E - C A S E D E S I G N S : C H A L L E N G E S ,
LIMITATIONS, AND DIRECTIONS 352

15. S U M M I N G UP: S I N G L E - C A S E R E S E A R C H IN P E R S P E C T I V E 383

iv
Co n t e n t s v

A p p e n d ix: Statistical A n a ly se s for S in gle-C a se D esign s: Issues an d


Illustrations 401

R eferences 421

A u th o r Index 440

Subject Index 448


PREFACE

S
ingle-case research— is that an oxym oron o r an effort to legitim ize an uncontrolled
case study? Also, what can one learn from studying one person? And if one can
learn anything, it is lim ited to inform ation about just that one person? More generally,
can one really do experim ents— real science— with the single case? By the end o f this
book, I hope these and other questions are fully answered and that you view sin gle-
case an d group research in a slightly different light. Both sin gle-case and between-group
research m ethods have goals, m ethods o f achieving them , and strengths and
lim itations. T h ey share the goals, but their different m ethods, strengths, and lim ita-
tions underscore their com plem entary roles in uncovering knowledge. T he purpose o f
this book is to provide a concise description o f single-case m eth o do lo gy as well as to
con vey its contribution to science m ore generally.
Single-case research has played an im portant role in developin g and evaluating
interventions that are designed to alter som e facet o f hum an fun ctioning. M any d isci-
plines and professions provide interventions. Prom inent am ong these are education,
m edicine, psychology, counseling, social work, occupational and physical rehabili-
tation, nursing, and others. Public and private agencies through policies, legislation,
advertising, and public appeals too are always intervening to foster health, safety, and
welfare. M ost interventions and program s are not evaluated system atically. W hen they
are, the usual m ethod is based on betw een-group designs. T hese are the very fam iliar
designs with such characteristics as com prising groups that receive different conditions
(e.g., treatment, placebo, or no-treatm ent), carefully standardizing the interventions
(e.g., in relation to duration or dose), testing the null hypothesis, and using statistical
significance as a criterion to draw inferences about the effects o f the intervention. T he
random ized controlled trial, which begins with random ly assigning participants to the
different conditions o f the study, is the ideal and considered to be the “gold standard”
for research. Betw een-group research has m ade and continues to m ake enorm ous co n -
tributions to a range o f basic and applied questions. As any m ethodological approach, il
has its own lim itations and sources o f debate, but two o f these in particular are related
to this book.
First, evaluation is needed in m any applied or everyday situations in which c o n -
trolled betw een-group experim en ts are not feasible. V irtually e v e ry school, com m unity,
hospital, prison, college, and large business has one or more interventions designed
to help people or change som e facet o f hum an functioning. A re these interventions
effective or worth the cost or effort, and are they having the desired impact? Using
random assignm ent, form ing a com parison group, w ithholding the intervention, and
secu rin g a sam ple size sufficient to detect group differences (w hen such differences

vi
Pr ef ace vii

exist) preclude m ounting a random ized controlled trial to test the impact o f most
interventions in most settings. Som e interventions, such as individual psychotherapy
o r counseling, in principle cannot be subjected to a con tro lled trial because only one
or a few individuals are being treated or counseled. In these and a m yriad o f other
situations, program s go unevaluated because the gold standard o f a random ized trial
is not possible. There is no reason to go from a gold stan dard to n o o r shoddy ev alu a-
tion. Single-case designs can evaluate interventions in ways that more are com patible
with the dem ands o f program s in applied contexts such as classroom s, hospitals, clin -
ics, businesses, and the com m unity. Like the random ized controlled trial, sin gle-case
designs are true experim ents. That m eans they can lead to causal knowledge about the
im pact o f the intervention.
Second, the unique facet o f single-case designs is the capacity to evaluate interven-
tions with one or a few individuals. We want the conclusions that are obtained from
random ized trials about the effects o f interventions and the relative effects o f various
interventions. The findings from group data are essential. In addition, we very much
care, as scientists but also as regular citizens, about the im pact o f interventions on the
individual. Results from between-group studies typically do not address the in d iv id -
ual. We want to know if individuals change and to what extent. Single-case designs are
rem arkably well suited to evaluating interventions with the individual and in providing
inform ation that can be used to improve the effects if the intervention is not w orking
or w orking v ery well.
Single-case designs provide a range o f options fo r researchers, whether or not
betw een-group designs are feasible. T he designs provide a novel set o f research tools to
answ er critical questions, particularly in relation to the effect o f interventions. By train-
ing, m any o f us have learned that if the research is not between groups there are inher-
ent flaws. Two flaw s, that might also be called myths, com e to m ind quickly to illustrate
the point: (1) one cannot have a true experim ent with just one case, and (2) even i f it
were possible, one cannot generalize any finding from the individual to others. T h ese
two m yths and others like them are addressed because they raise im portant problem s
for both betw een-group and single-case research. T his bo ok is not at all about one
research strategy versus another. Just the opposite. M ultiple m ethodologies are needed
because o f their strengths and weaknesses and because o f their com plem entary c o n tri-
butions to knowledge.
Beyond investigation o f individual subjects, the sin gle-case designs greatly expand
the range o f options for conducting research. T he designs pro vide a m ethodological
approach that is well suited to the investigation o f groups. Indeed, “single-case” designs
have been used to study interventions applied to hundreds and thousands o f subjects
all in a given study. Hence, even in cases where investigation o f the individual subject is
not o f interest, the designs greatly expand the available m ethods for research.
Although single-case designs have enjoyed increasingly w idespread use, the m eth-
odology is rarely taught in undergraduate, graduate, or postdoctoral training. M oreover,
relatively few texts are available to elaborate the m ethodology. Consequently, the designs
are not used as widely as they might be in situations that could greatly profit from their
use. T his book elaborates the m ethodology o f single-case research and illustrates its use
in m any areas o f application including education; school, clinical, and counseling psy-
chology; m edicine; speech and language; rehabilitation; and other areas.
viii P R EF A C E

A central goal o f scientific research is to draw valid inferences, that is, con clu -
sions that result from m inim izing, ruling out, or m aking im plausible any alternative
explanations that w ould obscure interpretation o f the findings. Single-case methods,
as betw een-group m ethods, encom pass three broad topics: assessm ent, experim ental
design, and data evaluation. These are the com ponents o f research m ethodology and
act in concert to draw valid inferences. T h e central goal o f this b o ok is to con vey the
logic o f single-case designs and to illustrate precisely how the m ethodology achieves
the purpose. Assessm ent methods, different experim ental design options, and m ethods
o f data evaluation are covered in detail. A lthough not all options and design strategies
are elaborated, certainly the main designs are illustrated. For each design, the goal is to
convey the underlying rationale, the logic in relation to the goals o f scientific research,
and strengths and lim itations.
B etw een -gro u p design s often are p re-p lan n ed before the first participan t enters
the study. Such features as the design, ch aracteristics o f the sam ple to be included,
how m any subjects, and the duration o f the in terven tion are salient pre-plan ned
features to illustrate the point. T h is pre-p lan n ed feature is a strength because m any
issues that can interfere with d raw in g valid inferen ces (e.g., too heterogeneous a
sam ple, low statistical pow er) can be an ticipated an d ad dressed . A spects o f sin g le -
case design s are pre-p lan n ed as well (e.g., w hich design , w hat the intervention focus
will be); yet, in sin gle-case designs key d ecisio n s about the intervention an d its
im pact influence d ecisio n s du rin g the stu dy itself, and in relation to the in te rv e n -
tion and the design. Sin gle-case designs requ ire a deep ap preciation o f what we are
tryin g to accom p lish in research in o rd er to u n d erstan d w hen and how to m ake
changes w h ile a stu d y is underw ay. Such ch an ges can greatly strengthen the quality
o f the inferences that are draw n about the in terven tion , as exam ples th roughout the
b o ok will convey.
T h e first edition o f this book was published alm ost 30 years ago (I was 2 years old).
N eedless to say, m any changes have transpired in the ensuing years. First, application
o f single-case designs has expanded along m ultiple dim ensions including the range o f
disciplines in w hich they are used, the problem dom ain s that are studied, and the m ea-
sures used to assess individual functioning. T h e revision o f the book reflects expansion
in the description o f m ethods (e.g., assessm ent) as well as in the range o f exam ples.
Second and related, single-case designs have proliferated in basic and applied
behavioral research. A n area in psychology know n as applied behavior analysis focuses
on interventions in m any applied settings. T he area has firm ly established the utility o f
single-case designs. It is still the case that research in this area continues to use sin gle-
case designs. Yet, the m ethodology is applicable to a variety o f areas o f research well
beyond the initial em phases and uses. T he designs specify a range o f conditions that
need to be met; these conditions do not necessarily entail a com m itm ent to a particular
conceptual approach, procedures, or a particu lar discipline. For exam ple, m an y p rac-
tices that in the prior edition were view ed as essential or central (e.g., assessm ent o f
overt behavior) to single-case designs have expand ed without sacrificing the strength
o f the designs or the quality o f the inferences that can be draw n. A lthough som e people
m ay view such changes with ambivalence, i f not heresy, this expansion is an enorm ous
advance for science and for the quality o f care o f individuals w ho receive interven-
tions. T he book m akes an effort to convey the expanded applications by clarifying the
P re fa c e ix

central and associated features o f the designs and by illustrating the expanded options
for assessment an d evaluation.
Third, the book has been written to incorporate several recent developments
within single-case experim ental research. In the area o f experim en tal design, new
design options and com binations o f designs are presented that exp an d the range o f
questions that can be asked about alternative interventions. In the area o f data evalua-
tion, the underlying rationale and m ethods o f evaluating intervention effects through
visual inspection are detailed. In addition, the use o f statistical tests for single-case data,
controversial issues raised by these tests, and alternative statistics are presented as they
were in the first edition. Each o f these areas has advanced remarkably. Research on
visual inspection and statistical evaluation o f the single case has expanded. Although
data evaluation o f the single case is a core chapter, an appendix is included at the end o f
the book to perm it an expanded discussion o f the challenges, advan ces, and dilem m as
o f data evaluation in the designs.
In addition to recent developm ents, several topics are covered in this book that
are not w idely d iscu ssed in cu rren tly available texts. T h e topics in clu de the use o f
social validation techniques to evaluate the applied im p o rtan ce o r sign ifican ce o f
intervention effects, q u asi-sin gle-case experim en tal design s as techn iq u es to draw
scientific inferen ces, and exp erim en tal designs to study m ain ten an ce and general-
ization o f the changes. M any o f these have em erged fro m b eh av io ral research but
have broad application and have been adopted m ore w id ely in ed u catio n , clinical
psychology, an d rehabilitation, for exam ple. In ad dition , the lim itation s and special
problem s o f sin gle-case designs are elaborated. T h e bo ok not o n ly seek s to elaborate
sin gle-case design s but also to place the overall m eth o d o lo g y into a larger context.
Thus, the relation o f sin gle-case an d betw een -group designs is also discussed.
There is a new tim eliness to the topic o f single-case designs. In the period since
the first edition, there has been a proliferation o f evidence-based interventions. These
are interventions that have controlled studies in their b eh alf an d -where the effects have
been clearly replicated. M any disciplines have delineated evid ence-based interventions
(e.g., education, m edicine and its m any branches, psychology, social w ork, nursing,
dentistry, policy, and econom ics).
H ow could evidence-based interventions possibly relate to single-case designs?
First, am ong the issues in evidence-based interventions are extend in g such interven-
tions from the w ell-controlled settings in which they have been studied, invariably in
betw een-group designs, to individuals and to applied settings (e.g., the classroom ). Will
the interventions w ork in the m any different contexts to w h ich they could be applied?
Single-case designs provide multiple opportunities for such extension s well beyond
w hat betw een-group research could accomplish.
Second, there is an increased accountability in m an y dom ains o f intervention
including education and psychotherapy as two exam ples. Som e o f the accountability
concerns (e.g., from insurance com panies, third-party payers, local and federal go v-
ernm ents) result from uncontrolled costs and little outcom e benefit to show for these
costs. T he strength o f single-case design is evaluation and assessm ent o f individuals
and groups and use o f the inform ation to improve on go in g interventions. T h e designs
can provide inform ation about change or lack o f it in novel w ays that can determ ine
what gains we are achieving with an intervention.
P R EF A C E

Finally, single-case designs can establish whether interventions are evidence-based.


In evaluation o f w hat interventions are effective (e.g., in education, counseling, child
care, parenting practices), betw een-group research usually is the exclusive m ethod that
“counts.” Single-case experim ents are as rigorous as betw een-group designs, but they
are not usually integrated into a larger body o f findings. Science and care o f individ-
uals are not served by excluding data from a particular a m ethodological approach.
The reasons w hy single-case research is often omitted, and options to rectify this, are
addressed in this book.
Single-case designs provide a range o f options for evaluating interventions in feasible
ways and for the purpose o f helping people in everyday settings. T he careful assessment
that the designs use can make a difference in not only extending evidence-based inter-
ventions but also in im proving the quality o f care and services. Single-case designs repre-
sent a fascinating union o f attention to client care (e.g., educational o r clinical services)
and systematic evaluation. Research m ethodology is not at odds with sensitive care but in
some w ay it is the only w ay to ensure that care, a topic addressed in this edition.
In preparing a book on research m ethods, it is im portant to note what is not
included. Several critical topics in scientific research are not covered. Exam ples include
ethical issues for the protection o f clients (e.g., inform ed consent, privacy protections),
critical issues o f scientific integrity (e.g., conflict o f interest, fraud), and publication
and com m unication (e.g., m aking data available, w riting up research so it can be added
to the accum ulating body o f knowledge). T hese topics are shared with the tradition o f
between-group m ethods. Not covering the topics does not gainsay their im portance,
but rather rests on the likelihood that with scientific training in the betw een-group tra-
dition the reader has had exposure to the topics and the obligations they raise.

ACKNOWLEDGMENTS
It is not possible to list the m any people who have contributed to this book, although
if reviews o f the book are negative and sales plum m et and we enter the finger-pointing
phase, I have a list ready to go. T he book benefitted en orm ou sly from the com m ents
and suggestions o f Drs. Richard R. Bootzin, T hom as R. Kratochwill, and T hom as H.
Ollendick, who review ed the manuscript. I am v ery fortunate to have the input o f such
colleagues who have m ade m ajor contributions to research and research m ethodology
and who have influenced my work.
M any others have influenced this book, spanning from advisors in graduate
school to current students who participate in m ethodology discussions. I am grateful
to O xford U niversity Press, with whom I have been privileged to w ork over the years
and specifically with Patrick Lynch, Executive Editor, who helped develop this project.
Also, my w ork has been supported in part by various grants and funding agencies (e.g.,
Research Scientist Aw ard, M E R IT Aw ard, and projects from the National Institute o f
Mental Health, T he Robert W ood Johnson Foundation, W illiam T. Grant Foundation,
Rivendell Foundation o f A m erica, and the Leon Low enstein Foundation). I am deeply
grateful for the support.

A lan E. K azdin
N ew H aven, C o n n e cticu t
Jun e 200 9
ABOUT THE A U T H O R

A
lan E. Kazdin is the John M. M usser Professor o f P sychology and C h ild Psych iatry
at Yale U niversity and D irector o f the Yale Parenting C en ter and C hild C o n d u ct
C lin ic, an outpatient treatment service for children and fam ilies. He w a s 200 8 President
o f the A m erican Psychological Association. He received his Ph D in clinical p sych o lo gy
from N orthwestern University. Prior to com ing to Yale, he w as on the faculty o f T h e
Pennsylvania State U niversity and the University o f Pittsburgh School o f M edicine.
In 1989, he moved to Yale University, where he has been C hairm an o f the Psychology
Departm ent, D irector and C hair o f the Yale Child Stud y Center at the Yale School o f
M edicine, and D irector o f C hild Psychiatric Services at Yale-N ew H aven Hospital.
Dr. Kazdin is a licensed clinical psychologist, a Diplom ate o f the A m erican B oard o f
Professional Psychology (A B P P ), and a Fellow o f the A ssociation for the A dvancem ent
o f Science, the A m erican Psychological Association (A PA ), and the A ssociation for
Psychological Science. His honors include Research Scientist C areer and M E R IT
Aw ards from the N IM H and Awards for D istinguished Scientific C ontribution to
Clinical Psychology and Distinguished Professional C ontribution to Clinical C h ild
Psychology (APA ), O utstanding Research C ontribution b y an Individual (Association
for Behavioral and Cognitive Therapies), Distinguished. Service Aw ard (A B P P ),
Joseph Zubin Award for Lifetim e Contributions to U n derstan ding Psychopathology
(Society for Research in Psychopathology), and Aw ards for L ifetim e Contributions
to Psychology (APA ; Connecticut Psychological Association). He has served as e d i-
tor o f various professional journals: journal o f Consulting a n d C lin ical Psychology,
Psychological Assessment, Behavior Therapy, Clinical Psychology: Science and Practice,
and Current Directions in Psychological Science.
Currently, he teaches and supervises graduate and undergraduate students and
runs a clinical-research program on the treatment o f oppositional, aggressive, and
antisocial behavior am ong children and adolescents. He has authored o r edited o ver
650 articles, chapters, and books. His 45 books focus on m eth o d o lo gy and research
design, child and adolescent psychotherapy, parenting and child rearing, and c o n -
duct problem s am ong children. His other books with O xford U niversity Press include
Psychotherapy for Children and Adolescents: Directions for Research a n d Practice (2 0 0 0 ),
The Encyclopedia o f Psychology (Vols. 1-8 ) (2000), and Parent M anagem ent Training:
Treatment fo r Oppositional, Aggressive, and Antisocial Behavior in C hildren and
Adolescents (2005; Reprinted 2009).

xi
CHAPTER I

Introduction
Study of the Individual in Context

CHAPTER OUTLINE

The Uncontrolled Case Study


Strengths and Value o f the Case Study
Brief Illustrations
Methodological Limitations
General Comments
Historical O verview o f Research with the Single Case
Experimental Psychology
Clinical Research
Contemporary Development o f Single-Case Methodology
The Experimental Analysis o f Behavior
Applied Behavior Analysis
Current Issues in Intervention Research
Evidence-Based Interventions
Increased Evaluation and Accountability
General Comments
Contexts and Perspective
Overview o f the Book

S
ingle-case designs have been used in m an y areas o f research, including psychology,
m edicine, education, rehabilitation, social work, counseling, and other disciplines.
T h e designs have been designated by different terms, such as “ intrasubject-replication
designs,” “ N = 1 research,” “ intensive designs,” and so on.1 The unique feature o f these

Although several terms have b e e n proposed to describe the designs, each is partially misleading.
For example, “single-case” and “ N = 1” designs imply that only one subject is included in an inves-
tigation. Often this is true, but more often multiple subjects are included. Moreover, “ single-case”
research occasionally includes very large groups o f subjects (e.g., thousands). The term ‘ intrasu -
bject” is a useful term because it implies that the methodology focuses on performance of the same
person over time, which it often does. The term is partially misleading because some o f the designs
depend on looking at the effects o f interventions across subjects. “ In te n siv e designs” has not grown.

l
2 S I N G L E- C A S E R ES EA R C H D ES I G N S

designs is the capacity to conduct experim ental investigations with the single case, that
is, one subject. O f course, the designs can evaluate the effects o f interventions with large
groups and address m any o f the questions posed in betw een-group research; however,
the m ethodology is distinguished by including an approach and multiple designs that
rigorously evaluate interventions with one or a sm all num ber o f cases.
Single-case research m ethods are rarely taught to students or utilized by investi-
gators in the social an d biological sciences. T h e dom inant view s about how research
should be done still include m any m isconceptions about single-case research. For
example, a w idely held b elief is that single-case investigations cannot be “true exp eri-
ments” and cannot reveal “causal relations” between variables, as that term is used in
scientific research. A m o n g those who grant that causal relations can be dem onstrated
in such designs, a com m on v iew is that the designs cannot yield conclusions that extend
beyond the one o r few persons included in the investigation, leave generalizability o f
any effect unclear, o r are inferior to group designs in establishing generality. By the end
o f the book, the reader can ju dge whether any o f these concerns is just between-group
m ethodology propaganda o r has merit, but please wait until the end o f the book. I f you
cannot m ake it that far— even I have trouble reading m y own w riting— these concerns
are som ewhere betw een flatly false and unequivocally unclear.
Consider three research methodologies, for example, including quantitative between-
group research, single-case research, and qualitative research. Each o f these is a rather
large area o f study, with books, journals, and historical traditions. Results from each type
o f research are scientific findings, subject to replication, and so on. I f you, as the reader,
have training that is traditional within the social or biological sciences, you would be the
informed exception if you did not think ever so slightly that single-case research and
qualitative research are poor substitutes for “ real” science, that is, research in the group
tradition o f null hypothesis testing, random assignment, group designs, and statistical
evaluation. There is no m ethodology that is “ better” than another in som e abstract sense;
the methodologies are all to be view ed in the context o f how they contribute to o ur overall
goals o f acquiring knowledge and our ability to use them to draw valid inferences. Single-
case designs are im portant methodological tools that can be used to evaluate a num ber o f
research questions with individuals or groups. It is a mistake to discount them without a
full appreciation o f their unique characteristics and their sim ilarities to more com m only
used experimental methods. T he designs should not be proposed as flawless alternatives
for more com m only used research design strategies. Like any type o f methodology, sin-
gle-case designs have their own limitations, and it is im portant to identify these.
A little clarification o f term inology is needed before we begin. Single-case designs
are used heavily in applied research, and that will be the em phasis in the exam ples.
By applied research I refer to research in everyday life settings including schools, the

' (Continued) out of the tradition o f single-case research and is used infrequently. Also, the term
'intensive' has the unfortunate connotation that the investigator is working intensively to study the
subject, which probably is true but is beside the point. For purposes of conformity with many exist-
ing works, “single-case designs” is used in this book because it draws attention to the unique feature
of the designs, that is, the capacity to experiment with individual subjects, and because it enjoys the
widest use. (Of course, by referring to the single case, there is no intention o f slighting married or
cohabiting cases.)
In t r o d u ct i o n

home, in an office (e.g., dentist, physician, psychotherapist), institutions, an d bu si-


ness and industry, and for any population in need o f care, assistance, o r treatm ent, in
ways that reduce behaviors associated with im pairm ent or that increase behaviors that
im prove functioning. The goal in applied research is to develop, treat, educate, change,
help, o r have impact in som e im m ediate way. T h is is distinguished from basic research
in laboratory settings where the goal is to understand, conduct tests o f principle, and
elaborate theory. Applied and basic research are related, and it is easy to show that som e
o f the best clinical care, as in the context o f psychotherapy, for example, can d erive from
basic laboratory research (e.g., see K azdin, 2007). I em phasize applied settings because
single-case designs have broad applicability to evaluating program s in settings where
the usual group designs are not possible, or if possible are not feasible. For exam ple,
evaluating the effect o f a program devoted to altering the reading o f two special edu -
cation students, preventing unprotected sex in a high school, o r reducing crim e in a
n eighborhood are areas to which single-case designs can be applied. W ith increased
accountability and interest in evaluation, strategies are needed to help evaluate m any
program s that are well intentioned but will never see their w ay into a group design.
Also, 1 refer to quantitative research in the tradition o f null hypothesis testing, ran -
dom assignm ent, and statistical significance testing as “ betw een-group research.” T h is
term better captures the difference from single-case research. Both m ethodologies can
focus on groups o f subjects and evaluate results quantitatively (with statistical analyses).
However, the key difference in the m ethodology includes com paring groups rather than
evaluating interventions within the sam e subject or group over time. T here are am bigu i-
ties, com bined designs, and opportunities to get lost in nuances. Let us not begin there.
T he pu rpose o f this book is to elaborate the m ethodology o f single-case e xp eri-
mentation, to detail major design options and m ethods o f data evaluation, and to id en -
tify problem s and limitations. Single-case designs can be exam in ed in the larger context
o f applied research in which various m ethodologies, including single-case designs and
betw een-group designs, make unique as well as overlapping and com plem entary con -
tributions. In the present text, single-case research is presented as a m eth o do lo gy in
its own right and not necessarily as a replacem ent for other approaches. I also address
strengths and limitations o f single-case designs and the relationship o f sin gle-case to
betw een-group designs.
In this intro du cto ry chapter, I w ou ld like to place sin gle-d esign s in fo u r c o n -
texts: (1) the stu dy o f individ u als n o n rigo ro u sly (e.g., trad ition al u n con tro lled case
studies); (2) historical overview o f research w ith single case; (3) c o n tem p o rary d e v e l-
opm ent o f sin gle-case research; and (4) current issues in intervention research that
m ake the designs m ore useful and relevant than ever before.

THE UNCONTROLLED CASE STUDY


T he uncontrolled case study serves as an im portant backdrop for experim ental m eth -
ods with the single case and for scientific research more generally. H istorically, the term
“case study” has broad use across multiple disciplines, an d there is no single d efin i-
tion (Bolgar, 1965; Dukes, 1965; Robinson & Foster, 1979; Sechrest, Stewart, Stickle, &
Sidani, 1996). T he case study is a generic term that indicates a focus on the individual.
D epending on the discipline, that can be an individual person, group, organization,
institution, culture, or society. Any instance in which one o f som e “ thing” is studied in
S I N G L E- C A S E R ES EA R C H D ES I G N S

depth or used as an exam ple can be a case and hence a case study. There are some gen-
eral com m onalities or characteristics o f what a case study is, and these are highlighted
in Table 1.1. As noted there, the case study refers to the intensive study o f the individual.
W ithin psychology, particu larly clinical and counselin g psychology, the term has been
used m ostly to refer to an individual client. C ase studies focus on the rich details and
usually m ake an effort to describe details, offer explanations, and make connections
(e.g., early experience an d current functioning).
T he case study is not o nly defined by the focus on the individual but also has
com e to encom pass a m ethodological approach. T h is approach is reflected in the term
“anecdotal case study,” w h ich refers to descriptions o f the case that use anecdotes or
narrative and literary statem ents to describe the case (e.g., details o f who, what, where,
and when) and draw inferences and connections (e.g., why the person is this or that
way and h ow som e experien ces led to the current situation). System atic measures (e.g.,
questionnaires, direct observations, archival records) are not used in a way that would
allow anyone to draw' the sam e conclusions about the case.
T he case study as a m ethodology has been recognized to be a weak basis for d raw -
ing inferences, in part because o f the absence o f any controls, systematic assessment,
and procedures to corroborate what actually transpired (e.g., what happened, what
caused an outcom e). T he relations or connections am on g events o f the case could be in
the m ind o f the beholder, that is, the person w ho w rote the case description, as much as
they are in the actual characteristics o f the case. T h e unsystem atic and subjective nature
o f the descriptions m akes the case study generally unacceptable as a way o f draw ing
valid inferences. M ethodologists have this lesson m ore poignantly taught in childhood
and learn that if they go to m ethodology heaven, they have unlim ited access to H-net
(Internet in heaven) and audio and reading m aterials that describe random ized con -
trolled trials, those carefully controlled studies that are adored. W hen methodologists
go to the other place, they have to listen to and sit in the audience with an infinitely long
(literally) sym posium and set o f speakers describing and draw in g conclusions from
anecdotal case studies.

Strengths and Value o f the Case Study


The lack o f controlled conditions and failure to use m easures that that are objective
(e.g., replicable, reliable, valid) have lim ited the traditional case study as a research

T a b le 1.1 M a j o r C h a r a c t e r is t ic s o f C a s e S t u d ie s

K e y C h a ra c t e r is t ic s

• The intensive study of the individual. However, this could be an individual person,family,group, institution, state,
country, o r other level that can be conceived as a unit:

• The information is richly detailed, usually in narrative form rather than as scores on dependent measures;

• Efforts are made to convey the complexity and nuances of the case (e.g.. contexts, influence of other people)
special or unique features that may apply just to this case: and

• Information often is retrospective; past influences are used to account for som e current situation, but one
begins with the current situation.

Note: Am ong diverse sciences, case studies have taken many different forms so that exceptions to these charac
teristics can be readily found. For a broad and comprehensive view of case studies, see Sechrest et al. (1996).
In t r o d u ct i o n 5

tool. Yet the naturalistic and uncontrolled characteristics also have made the case a
unique source o f inform ation that com plem ents and contributes to theory, research,
and practice. C ase studies, even without serving as form al research, have m ade im por-
tant contributions. First, case study has served as a source o f ideas a n d hypotheses about
hum an perform ance and developm ent. For example, case studies from quite different
conceptual view s such as psychoanalysis and behavior therapy (e.g., case o f Little Hans
[Freud, 1933]; case o f Little Albert [Watson & Rayner, 1920]) w ere rem arkably influen-
tial in suggesting how fears m ight develop and in advancing theories o f hum an behav-
ior that would support these views.
Second, case studies have frequently served as the source fo r developing therapy
techniques. Here too, rem arkably influential cases within psychoanalysis and behav-
ior therapy m ight be cited. In the 1880s, the treatment o f a you ng w om an (A nna O.)
with several hysterical sym ptom s (Breuer & Freud, 1957) m arked the inception o f the
“ talking cure” and cathartic m ethod in psychotherapy. W ithin behavior therapy, devel-
opm ent o f treatment for a fearful boy (Peter) followed by evaluation o f a large num ber
o f different treatments to elim inate fears am ong children ( Jones, 1924a, 1924b) exerted
great influence in suggesting several different interventions, m an y o f w hich rem ain in
use in som e form in clinical practice.
Third, case studies perm it the study o f rare phenom ena. M an y problem s seen in
treatment or o f interest m ay be so infrequent as to make evaluation in group research
impossible. T he individual client with a unique problem or situation can be studied
intensively with the hope o f uncovering material that m ay shed light o n the develop-
ment o f the problem as well as effective treatment. For exam ple, the stu dy o f multiple
personality, in which an individual manifests two or more different patterns o f person-
ality, em otions, thoughts, and behaviors, has been elaborated greatly by the case study.
A prom inent, historically early illustration is the well-publicized report and subsequent
m ovie o f the “ T hree Faces o f Eve” (Thigpen & Cleckley, 1954,1957). T he intensive study
o f Eve revealed quite different personalities, m annerism s, gait, psychological test per-
form ance, and other characteristics o f general demeanor. T h e analysis at the level o f the
case provided unique inform ation not accessible from large-scale group studies. Also,
this was not just an anecdotal description; the investigators were system atic in collect-
ing a range o f objective measures, and video tapes, which m ade this case description
novel and quite inform ative.
Fourth, the case is valuable in providing a counter instance for notions that are con -
sidered to be universally applicable. For example, in traditional form s o f treatment such
as psychoanalysis, treatment o f overt symptoms was discouraged based on the notion
that neglect o f motivational and intrapsychic processes presumed to underlie dysfunction
would be ill-advised, if not ineffective. Without treating the supposed (but never dem on-
strated) root or underlying cause, there might be other symptoms that em erge. This was
referred to as symptom substitution. Yet, decades ago repeated case dem onstrations that
overt symptoms could be effectively treated without the em ergence o f substitute sym p-
toms cast doubt on the original caveat (see Kazdin, 1982). Although a case can cast doubt
upon a general proposition, it does not itself allow affirm ative claim s o f a very general
nature to be made. By showing a counter instance, the case study does provide a qualifier
about the generality o f the statement. With repeated cases, each show ing a sim ilar pat-
tern, the applicability o f the original general proposition is increasingly challenged.
S I N G L E - C A S E R ES EA R C H D ES I G N S

Finally, case studies have persu asive a n d m otiva tion a l value. From a m eth o d -
olo g ical stan dpoint, u n con tro lled case studies gen era lly pro vide a weak basis for
d ra w in g inferen ces. H ow ever, this point is often acad em ic. Even though cases m ay
not p ro vid e stron g causal know ledge on m eth o d o lo gical grou n ds, a case stu dy often
p ro vid es a dram atic and persu asive dem on stratio n an d m akes concrete and p o i-
gnant what m ight o th erw ise serve as an abstract p rin cip le. “ D id that person my
age, with 30 lbs (13.6kg) he w o u ld like to get rid of, really lose all that weight on the
N o rth B each D an delio n D iet? O f course, I am skeptical but the T V photos can’t be
w ro n g and even though he does not have my looks, ch arm , head o f hair suspiciously
th ick fo r his age, o r o th er appeal, he was a lot like me.” E ven w ithout photos, seein g
is b elievin g even though research in cogn itive p syc h o lo g y and neuroscience c o n -
tinue to teach us that percep tio n and m em ory are lim ited in w ays that interfere
w ith d raw in g valid in feren ces and accurately p erceiv in g con n ectio n s am ong events
in the w orld (e.g., G ilo vic h , G riffin , & K ah nem an, 200 2; Pohl, 2004; R oed iger &
M cD erm o tt, 200 0). M ore sim p ly stated, we see m any th in gs clearly even w hen they
are not there.
A n other reason that cases are often so dram atic is that they are usually selected sys-
tem atically to illustrate a particu lar point. Presum ably, cases selected random ly from
all th ose available would not illustrate the dram atic type o f change that typically is
evident in the particular case provided by an author or advertising agency N o r do the
cases convey the relation o f the alleged intervention to the alleged outcome. So our
dandelion diet chap m ay have been 1 o f 500 who agreed to the diet and the o nly one
who show ed change. Even if the illustrated case were accurately presented, it is likely
to be so highly selected as to not represent the reaction o f most individuals to the pro-
gram . Also, no conclusions can be draw n about the causal agent— the dandelion gu y—
I just sent and received a text m essage— he vom ited for 2 years straight and is now
6 inches shorter than he was. T h e diet did not quite w ork the w ay we were led to believe.
N otw ithstanding reason, logic, and methodology, the selection o f extreme cases does
not m erely illustrate a point, but rather often com pels us to believe in causal relations
that reason and data w ould refute.
T h e exam ple o f lo sin g w eight con veys w hat m ight well be the four m ain fu n c-
tions that case studies often serve, nam ely, to inform , intrigue, inspire, a n d incite
(Sechrest et a l„ 1996). Seen in this context, there is n o th in g like a case to con vey the
points, to provoke thought, to m otivate others, and to m ove to action. It is not that
the case will in variab ly acco m p lish an y o f these, but it w ill often do so better than
a w o n d erfu l fin d in g from a stu dy with a large sam ple (N ) and excellent controls.
Large studies m ay not acco m p lish what a w ell-p laced case stu d y can do. Indeed, a
rou tin e failin g o f science is to take fin d in gs and tran slate them in such a w ay that
they m ight have b roader in flu en ce (e.g., on health, clim ate). For exam ple, in relation
to clim ate changes, c o n vey in g to the public in con crete w ays with cases (e.g., in rela-
tion to continents, no n -h u m an an im als, and residents o f a coastal city) what global
w a rm in g can do trum ps the p ersuasive appeal o f all o f the constituent studies that
estab lish ed the problem (e.g., see G ore, 2006). In gen eral, case studies and stories
about what has happened o r can happen in a given instan ce m ight be a m ore s y s -
tem atic part o f that tran slation effort in c o n veyin g “ real” research findings to help
ben efit the public.
In t r o d u ct i o n

B rief Illustrations
Case studies can be quite useful. Indeed, even the single aspect o f generating hypothe-
ses for research would make the case study quite valuable. In clinical psychology, there
have been a few dom inant anecdotal case studies (e.g., som e o f the key cases o f Freud)
that have become classics in the mental health professions. T h e cases make w onderful
sense and are cohesive (internally consistent). Yet, in many instances, in-depth analyses
refute or underm ine the v ery basis for presenting the case.
For example, recall the case o f from the 1880s in which Joseph Breuer (18 42-19 25),
a Viennese physician and collaborator o f Sigm und Freud (1 856-1939), treated A n n a O.
(Breuer & Freud, 1957). A n na was 21 years old at the time an d had several sym ptom s
including paralysis and loss o f sensitivity o f the limbs, lapses in aw areness, distortions
o f sight and speech, headaches, and a persistent nervous cough. T hese sym ptom s were
considered to be due to anxiety rather than to medical or physical problem s. A s Breuer
talked with Anna and used a little hypnosis, she recalled early events in her past and
discussed the circum stances associated with the onset o f each sym ptom . A s these rec-
ollections were made, the sym ptom s disappeared. T his case has had enorm ous impact
and is credited with m arking the beginning o f the “talking c u re” an d cathartic m ethod
o f psychotherapy.
Actually, this case is a wonderful illustration about the m ixed blessing o f cases.
First, we have no really systematic inform ation about the case, what happened, and
whether and when the sym ptom s changed. Second, essential details of the case that
readily controvert or weaken the conclusions are rarely noted. For example, talk therapy
was com bined with hypnosis and rather heavy doses o f m edication (chloral hydrate, a
sleep-inducing agent), which was used on several occasions and when talk did not
seem to work (see Dawes, 1994). Thus, the therapy was hardly just talk, and indeed
whether talk had any im pact cannot really be discerned. Also, the outcom e of A n n a O.,
including her very serious symptoms and subsequent hospitalization, raises clear ques-
tions about the effectiveness o f the com bined talk-hvpnosis-m edication treatment.
Cases such as these, while powerful, engaging, and persuasive, do not perm it inferences
about what happened and why. Talk was not the only intervention, and the im pact o f
treatment was not at all clear in the short and long term.
There are m ore com pellin g instances that show what a case can yield with bet-
ter o bservations and reportin g than those o f the p revio u sly m entioned case. T he
well-know n case o f Ph in eas G age is a better exam ple o f w hat an un con tro lled case
study can show (M acm illan , 2009). Mr. G age was a 25-year-o ld man w o rk in g on the
railroad (in Verm ont, U SA ) and was go in g to use an exp lo sive to fracture a rock . A n
accident occu rred and caused a large m etal bar (a tam ping iro n , 5 ft, 7 in. lo n g [1.09
m eters]) to blast entirely through his skull and to land som e 20 m eters aw ay (for
photos and further details o f the story, see w w w .hbs.deakin .ed u .au/gagepage). A
physician who treated M r. G age w ithin 90 minutes o f the accid e n t recorded that he
spoke rationally and describ ed what had happened. F o llow -u p o f t he case ind icated
that his person ality had changed. Before the accident, he was regard ed as capable,
well balanced, an d v ery efficien t as a forem an. A fter the accid en t, he was im patient,
obstinate, grossly profan e, and showed little deference to others. A lso , he seem ed to
lose his ability to plan the future. His friend s noted that he w a s no longer like the
person they knew.
S I N G L E - C A S E R ES EA R C H D ES I G N S

This was a tragic experim ent in nature that has becom e a classic case within neu-
ropsychology. T he case is used to reflect the im pact o f a disastrous intervention on
cognitive and personality functioning. T he case is com pelling because of the abrupt-
ness and scope o f the accident, so that the causal agent is fairly clear. Also, perm anent
changes seem ed to be induced by the accident so that other influences were not likely.
Later we discuss the role o f the latency between im plem entation o f an intervention and
change as a w ay o f draw ing inferences about cause in case studies and in clinical work.
T h e circum stances o f Mr. G age made draw ing inferences from the case am ong the
clearest one m ight expect without form al investigation.
C ase studies have played a strong role in elaborating the relation o f brain and
behavior. T h e reason is that a variety o f special injuries, diseases, and interventions
o ccu r that could not be conducted experim entally. As these are carefully docum ented,
o ne can exam in e em otional, cognitive, and behavioral fun ctioning over time and the
resulting consequences. For exam ple, in one case a young boy had one half o f his brain
(one hem isphere) rem oved as part o f treatment to control epilepsy. Tracking his d evel-
opm ent over the course o f childhood revealed that several functions thought to be
specific to the lost hem isphere still developed. T he boy was functioning well in several
spheres (e.g., academ ic, language learning), all o f which were reasonably well d o c u -
m ented (Battro, 2001) and suggest the brains ability to com pensate and how with train-
ing the boy could overcom e significant deficits.
T he careful assessm ent o f a case can make the results quite persuasive. For e x a m -
ple, a report o f a 25-year-old m an with stroke revealed that he had dam age to specific
areas o f the brain (insula and putam en) suspected to be responsible for the em otion
o f disgust (Calder, Keane, M anes, A ntoun, & Young, 2000). T h e dam age could be
carefully docum ented (by fM R I [functional m agnetic resonance im aging]) and hence
show s an advance in specification o f the assault in com parison to the Phineas G age
exam ple noted previously. His dam age could be located to these areas. The m an was
system atically tested du rin g which he observed photos o f people experiencing d if-
ferent em otions (happiness, fear, anger, sadness, and surprise). He had no d ifficu lty
iden tifyin g these em otions. However, he could not identify the photos o f disgust.
D isgusting photos or ideas presented to him (e.g., friends who change underw ear
once a week or feces-shaped chocolate [rem em ber, I am just the messenger here; I
am not m aking this up]) were also difficult for him to identify as disgusting. T h is is
an interesting exam ple, because the case was system atically evaluated and hence the
strengths o f the inferences that were draw n are com m ensurately increased. Also, the
investigators com pared this case to male and fem ale control subjects without brain
in ju ry to provide a baseline on each o f the tasks. T he dem onstration becom es even
m ore interesting by falling som ew here between a case study and a betw een-group
(case-control) design.

M ethodological Lim itations


We know that there are m any lim itations o f the uncontrolled case study. In clinical
work, the usual w ay in which cases are described leads to their m any limitations. First,
case reports rely heavily on anecdotal inform ation in w hich clinical judgm ent and inter-
pretation play a m ajor role. M any inferences are based upon reports o f the clients; these
reports are the “data” upon which interpretations are made. T he client's reconstructions
I n t r o d u ct i o n

o f the past, remembered events from one’s past, particu larly those laden with em otion,
are likely to be distorted and highly selective. To this is added interpretation and ju d g -
ment o f the therapist in which unwitting (but n orm al hum an) biases operate to weave
a coherent picture o f the clients predicament an d the change o f events leading to the
current situation.
Second, m any alternative explanations usually are available to account fo r the cu r-
rent status o f the individual other than those provided b y the clinician. T h e next ch ap -
ter codifies the alternative explanations and the reasons that case studies often can
be seriously challenged. Postdictive or retrospective accounts try to reconstruct early
events and show how they invariably led to con tem porary functioning. A lthough such
accounts frequently are persuasive, they are scientifically indefensible.
T hird, a m ajor concern about the inform ation derived from a case study is th e
generalizability to other individuals or situations. Scientific research attempts to establish
general “ laws” o f behavior that hold without respect to the identity o f any individual.
Can a case do this? Probably not an anecdotal case study, but we have m ore to s a y on
this topic later. For now, an anecdotal case is not flawed becau se the findings m ay not
be generalizable— not at all. M ore likely, the “ find ings” m ay not even apply o r apply well
to the case itself, as discussed in relation to A n n a O. and the putative “ talking cure.” It
is not so much that the anecdotal description is “ w ro n g” but that it may om it critical
details that would change the conclusions. In science, w e do not talk about gen erality
o f a finding until we have a finding.
All that said, I invite the reader to consider the case at a broader level o f ab strac-
tion. We do research to rule out alternative explanations o f findings. For exam ple, I had
a cold, started using soap when I bathe, and got better in a week. T h e im plied causal
connection is a problem because alternative explanations abound. Yes, given m y h is-
tory, the shock and novelty o f using soap m ay have jolted m y system, but processes in
time (healing processes within the body) might have led to getting better even without
soap. In tact, this case study is so flawed we cannot tell if the soap actually im peded
rather than aided recovery. T he point here: m ethodology is not about details o f design
as m uch as it is about draw ing valid inferences. M y little soap case study really allow s
no valid inferences to be draw n. But there are cases in w h ich inferences could b e d ra w n
that are scientifically valid or very close to it (e.g., the 25-year-old m ale m entioned
previously). Also, there are m any things one can do with a case to m ake it so that valid
inferences can be drawn (Sechrest et al„ 1996). Using sin gle-case designs is one o f these,
but approxim ations o f the designs, and better observatio n o f cases, can accom plish this
as well.

G e n e ra l C o m m en ts
It is im portant to situate single-case designs in the con text o f the u ncontrolled case
study for several reasons. First, the uncontrolled case study, especially the anecdotal
case study, has very little in com m on with single-case exp erim en tal designs, and e lim i-
nating that confusion is valuable at the outset. Second, the stu d y o f individuals, even
when as an anecdotal case, can contribute to research. H ypothesis generation is not
trivial in science, and the m eticulous observation o f in d ivid u als can help generate new
ideas to be tested m ore rigorously. T hird, research is about d raw in g scientifically valid
inferences, that is, conclusions about relations betw een and am on g variables that can
S I N G L E- C A S E R ES EA R C H D ES I G N S

be replicated. Case studies, when uncontrolled, occasion ally can suggest causal rela-
tions and make im plausible other alternatives explanations o f the outcome. Later in
the book, I talk about systematic w ays o f draw ing inferences from instances when the
single-case experim ental designs cannot be used.
W hen we teach research design we often begin with the designs and procedures o f
design (e.g., random ly assigning subjects to groups, holding things constant, keeping
o bservers naïve). It is m ore helpful to begin with w hy we do research in the first place,
and the specific sources o f artifact, bias, and con foun d we arc tryin g to combat by goin g
through all these rituals. We engage in m ethodological practices, such as using c o n -
trol conditions (single-case) or control groups (betw een-group studies), to persuade
ourselves and others that critical variables cannot account for find ings we observed.
We are tryin g to draw inferences and rule out com peting interpretations. O nce this is
grasped, as discussed in the next chapter, we have m any m ore options for research and
for draw ing conclusions. One does not know m erely from the fact that there is a case,
or a group, or a few groups, w hether valid inferences can be derived. The extrem es o f
this are v ery easy to show, that is, one thoroughly docum ented an d evaluated case that
is the strongest scientific dem onstration o f a causal relation versus a between-group
study that provides the weakest or no decent basis for dem onstrating causal relations
(e.g., m y dissertation). T hroughout the book I em phasize not o n ly the designs that
form single-case m ethodology but also how they accom plish what we are trying to do
in science, that is, draw valid inferences.

HISTORICAL OVERVIEW OF R E S E A R C H
W ITH T H E SIN G LE CASE
Single-case research is often view ed as a radical departure from tradition in psycholog-
ical research. T he tradition rests on the betw een-group research approach that is deeply
engrained in the biological and social sciences. Interestingly, one need not trace the
history o f psychological research very far to learn that much o f traditional research was
based on the careful investigation o f individuals rather than on com parisons between
groups.

Experim ental Psychology


In the late 1880s an d early 190 0s, m ost investigations in exp erim en tal p sych o lo g y
utilized o n ly one or a few subjects as a basis o f d raw in g inferen ces. T his ap proach is
illustrated b y the w o rk o f several pro m in en t p sych o lo gists w o rk in g in a nu m b er o f
core research areas. W undt (18 32 -19 2 0 ), the father o f m odern psychology, in v esti-
gated sen so ry and perceptual processes in the late 1800s. Like others, W undt believed
that investigation o f one or a few subjects in depth w as the w ay to understand se n -
sation and perception. O ne o r two subjects (in clu d in g W undt him self) reported
on their reaction s an d perceptions (through intro spection ) based on changes in
stim ulus con d itio n s presented to them . Sim ilarly, E bbinghau s’ (18 5 0 -19 0 9 ) w ork on
hum an m em ory usin g h im self as a sub ject is w id ely know n. He studied learning an d
recall o f non sen se syllables w h ile alterin g m any con d itio n s o f trainin g (e.g., type
o f syllables, length o f list to be learn ed, interval betw een lea rn in g and recall). H is
carefu lly docum ented results p ro vid ed fu n dam en tal know ledge about the nature o f
m em ory.
I n t r o d u ct i o n

Pavlov (18 49-19 36 ), a physiologist who contributed greatly to psychology, made


m ajor breakthroughs in learning (respondent conditioning) in non-hum an anim al
research. P a vlo v 's experim ents were based p rim arily on studying one o r a few subjects
at a time, w ork that apparently did not demean his research or dissuade the Nobel
com m ittee from aw arding him their prize (in Physiology, 1904). A n exceptional feature
o f P avlovs w ork was the careful specification o f the independent variables (e.g., c o n d i-
tions o f training, such as the num ber o f pairings o f various stimuli) and the dependent
variables (e.g., drops o f saliva).
U sing a different paradigm to investigate learning (instrum ental conditioning),
Thorn dike (1874-1949) produced work that is also notew orthy for its focus on a few
subjects at one time. T horndike experim ented on a variety o f anim als. His best-know n
w ork is the investigation o f cats’ escape from puzzle boxes. O n repeated trials, cats
learned to escape m ore rapidly with fewer errors over tim e, a process dubbed “trial an d
erro r” learning.
T he preceding illustrations list only a few o f the m an y prom inent investiga-
tors who contributed greatly to early research in exp erim en tal psychology through
research on one or a few subjects at a time. O ther key figu res in psychology could be
cited as well (e.g., Bechterev, Fechner, Kohler, Yerkes). T h e sm all num ber o f persons
m entioned here should not im ply that research with one or a few subjects was d e lim -
ited to a few investigators. Investigation with one o r a few subjects was once com m on
practice. A n alyses o f publications in psychological jo u rn als have show n that from the
beginn in g o f the 190 0s through the 1920s and 1930s research w ith v e ry small sam ples
(e.g., one to five subjects) was the rule rather than the exception (Robinson & Foster,
1979)- Research typically excluded the characteristics cu rren tly view ed as essential to
experim entation, such as large sam ple sizes, control grou ps, and statistical evaluation
o f data.
T he accepted method o f research soon changed from the focus on one o r a few s u b -
jects to larger sample sizes. Although this history is extensive in its own right, certainly
am ong the events that stimulated this shift was the developm ent o f statistical m eth-
ods. Advances in statistical analysis accom panied greater appreciation o f the betw een-
group approach to research. Studies exam ined intact groups and obtained correlations
between variables as they naturally occurred. Thus, interrelationships between v a ri-
ables could be obtained without experim ental m anipulation.
Statistical analyses cam e to be increasingly advocated as a m ethod to perm it group
com parisons and the study o f individual differences as an alternative to exp erim en -
tation. Dissatisfaction with the yield o f sm all sam ple size research and the absence o f
controls w ithin the research (e.g., Chaddock, 1925; Dittmer, 1926) as well as develop-
ments in statistical tests (e.g., G osset’s developm ent o f th e Studentized I test in 1908)
all played a role in the move to group methods. Certainly, a m ajor im petus to increase
sam ple sizes was R. A. Fisher, whose book on statistical m ethods (R. A. Fisher, 1925)
dem onstrated the im portance o f com paring groups o f subjects and presented the now
fam iliar notions underlying the analyses o f variance. By the 1930s, journal publications
began to reflect the shift from small sample studies with no statistical evaluation to
larger sam ple studies utilizing statistical analyses (Borin g, 1957; Robinson & Foster,
1979)- Although investigations o f the single case w ere reported, it becam e clear that
they were a sm all m inority and possibly on their w ay out (D u kes, 1965).
S I N G L E- C A S E R ES EA R C H D ES I G N S

With the advent o f larger-sam ple-size research evaluated by statistical tests, the
rules for research becam e clear. T he basic control-group design becam e the paradigm
for psychological research: one group that received the experim ental condition was
com pared with another group (the control group) that did not. Also, the groups should
be com prised in a w ay that will optim ize the likelihood o f their equivalence and the
absence o f a systematic bias, that is, random assignm ent o f subjects to groups before
the experim ental condition is adm inistered. M ost research consisted o f variations o f
this basic design. W hether the experim ental condition produced a reliable effect was
decided by statistical significance, based on levels o f confidence (probability levels)
selected in advance o f the stu d y T hus larger sam ples becam e a m ethodological virtue.
W ith larger sam ples, experim ents are m ore pow erful, that is, better able to detect an
experim ental effect if there truly is one. Also, larger sam ples were im plicitly co n sid -
ered to provide greater evidence for the generality o f a relationship. I f the relationship
between the independent and dependent variables was show n across a large num ber
o f subjects, this suggested that the results were not idiosyncratic. (The idea that larger
sam ples or betw een-group experim ents lead to findings with greater generality than
the findings from sm all sam ple or even N = 1 research has an illusory feature to w hich
we return later in this book.) T he basic rules for betw een-group research have not
really changed, although the m ethodology has becom e increasingly sophisticated in
term s o f the num ber o f design options and statistical techniques for data analysis.

C lin ical Research


Substantive and m ethodological advances in experim en tal psychology usually influ -
ence the developm ent o f clinical psychology. T he anecdotal case study, already covered
in the discussion o f uncontrolled cases, is the m ost prom inent use o f the case in clinical
p sychology Leaving aside uncontrolled cases, the role o f the individual w as specifically
identified as critical to the acquisition o f knowledge. T he study o f individual cases has
been more im portant in clinical psychology than in other areas o f psychology. At one
time, the definition o f clinical psychology explicitly included the study o f the in d iv id -
ual (e.g., Korchin, 1976; Watson, 1951). Inform ation from group research is im portant
but excludes vital inform ation about the uniqueness o f the individual. Thus, in fo rm a-
tion from groups and inform ation from individ u als contribute separate but uniquely
im portant sources o f inform ation. This point was underscored in distinguishing two
approaches to research: the intensive study o f the individual (the idiographic approach)
as a supplement to the study o f groups (called the nom othetic approach) (Allport, 1961).
T he rationale was to discover the uniqueness o f each individual by investigating them
intensively. Uniqueness, as an argum ent, is a tw o-edged sword. Yes, individuals are d if-
ferent from each other, but science seeks the general laws first and then how groups or
subgroups m ay vary. T h e evaluation o f uniqueness did not catch on within m ainstream
psychological research. Indeed, the distinction o f research approaches m ay have actu -
ally held back single-case research, because it m ade less clear that single-case research
can and does study and is interested in general laws as well.
T he investigation o f the individual in clinical w ork has a history o f its own that
extends beyond one or a few theorists and well beyon d clinical psychology. T heories
about the etiology o f psychopathology, the developm ent o f personality and behavior,
and treatment techniques have routinely draw n on cases, but these have been m ostly
In tr o d u c tio n

anecdotal cases as illustrated by A n na O. and scores o f others not m entioned o f the


sam e ilk. M ore well-controlled and system atically evaluated study o f the individual
played a significant role historically in clinical work. A case study on the developm ent
o f childhood fear also had im portant clinical implications. In 19 20, W atson and Rayner
reported the developm ent o f fear in an 11-m onth-old infant nam ed A lbert. Albert in i-
tially did not fear several stimuli that were presented to him, including a white rat.
To develop A lb erts fear, presentation o f the rat was paired with a loud noise. A fter
relatively few pairings, A lbert reacted adversely when the rat w as presented by itself.
T h e adverse reaction appeared in the presence o f other stim uli as well (e.g., a fu r coat,
cotton-w ool, Santa Claus mask). T his case was interpreted as im plying that fear could
be learned and that such reactions generalized beyond the origin al stim uli to w hich
the fear had been conditioned.2 T h e preceding cases do not begin to exhaust the d ra -
matic instances in which intensive study o f individual cases had con sid erable im pact
on clinical work. As I m entioned, individual case reports have been influential in ela b -
orating relatively infrequent clinical disorders, such as multiple person ality (Prince,
1905; Thigpen & Cleckley, 1954,1957), and in suggesting viable clin ical treatm ents (e.g.,
Jones, 1924a).
C ase studies o ccasion ally have had rem arkable im pact w hen several cases w ere
accum ulated. Although each case is studied individually, the in fo rm atio n is a c c u -
m ulated to iden tify m ore general relationships. For exam ple, m o d ern p sych iatric
d iagn osis, or the classification o f ind ividuals into different d iagn o stic categories,
began with the analysis o f individual cases. Kraepelin (18 5 5 -19 2 6 ), a G erm an p s y -
chiatrist, identified specific “disease” entities or psychological d isord ers by sy ste m -
atically collecting thousands o f case studies o f hospitalized psychiatric patients. H e
d escribed the history o f each patient, the onset o f the disord er, and its outcom e.
From this extensive clinical m aterial, he elaborated various types o f "m en tal illn ess”
and provided a general m odel for con tem po rary approaches to psychiatric d iagn o sis
(Z ilb o o rg & Henry, 1941).
As these b rie f com m ents note, experim en tal work, including research and m ore
rigorous studies o f basic and applied topics, has a con siderable h isto ry involving one
or a few subjects. Scientific insights into critical phenom ena regarding perceptio n,
m em ory, learning, and psychiatric disord ers are a few o f the exam ples. T h is is im p o r-
tant to recognize in order to con vey that the study o f the in d iv id u al was not in v ari-
ably uncontrolled case studies with anecdotes and no clear ab ility to draw inferen ces
o f an y kind.

CONTEMPORARY D E V E L O P M E N T OF
SIN GLE-CASE METHODOLOGY
Current single-case designs have em erged from specific areas o f research w ithin p s y -
chology. T he designs and approach can be seen in bits and pieces in historical ante-
cedents o f the sort mentioned previously. However, the full em ergence o f a distinct
m ethodology and approach has m ore direct lineage.

1 Interestingly, efforts to replicate this demonstration repeatedly failed, with only occasional excep-
tions. The inconsistent effects obtained among the studies did not limit the enormous influence o f
this demonstration on interpreting fears and their acquisition (see Kazdin, 1978).
S I N G L E- C A S E R ES EA R C H D ES I G N S

The Experim ental A nalysis o f Behavior


T he developm ent o f single-case research, as currently practiced, can be traced to the
work o f Skinn er (19 0 4 -19 9 0 ), w ho developed program m atic non-hum an anim al lab-
oratory research to understand learning and behavior change. Skinner was interested
in studying the behavior o f individual organism s and determ ining the antecedent and
consequent events that influenced behavior. In Sk in n ers work, it is im portant to d is-
tinguish between the content or substance o f his theoretical account of behavior (a
type o f learning referred to as operant conditioning) and the m ethodological approach
toward experim entation and data evaluation (referred to as the experim ental analysis o f
behavior). T he substantive th eory and m ethodological approach were and continue to
be intertwined. H ence, it is useful to spend a little time on the distinction.
Sk in n ers research goal w as to discover lawful behavioral processes o f the ind i-
vidual organism (Skinner, 1956). He focused on anim al behavior and prim arily on the
arrangem ent o f consequences that follow ed behavior and influenced subsequent per-
formance. H is research articulated a set o f relationships or principles that described
the processes o f behavior (e.g., reinforcem ent, punishm ent, discrim ination, response
differentiation) that form ed operant conditioning as a distinct theoretical position and
area o f research (e.g., Skinner, 19 3 8 ,1953a).
Sk in n ers approach tow ard research, noted already as the experim ental analysis o f
behavior, consisted o f several distinct characteristics, m any o f which underlie sin gle-
case experim entation (Skinner, 1953b). First, Skinner was interested in studying the fre-
qu en cy o f perform ance. Frequen cy was selected for a variety o f reasons, including the
fact that it presented a continuous m easure o f ongoing behavior, provided o rd erly data,
reflected im m ediate changes as a function o f changing environm ental conditions, and
could be autom atically recorded. Second, one or a few subjects were studied in a given
experim ent. T he effects o f the experim ental m anipulations could be seen clearly in the
behavior o f individual organism s. By studying individuals, the experim enter could see
lawful behavioral processes that might be hidden in averaging perform ance across sev-
eral subjects, as is com m on ly done in group research. T hird, because o f the lawfulness
o f behavior and the clarity o f the data from continuous frequency m easures over time,
the effects o f various procedures on perform ance could be seen directly. Statistical
analyses were not needed. Rather, the changes in perform ance could be detected by
changing the conditions presented to the subject and observing systematic changes in
perform ance over time.
Investigations in the experim ental analysis o f behavior are based on using the sub-
ject, usually a rat, pigeon, or other non-hum an anim al, as its own control. The designs,
referred to as intrasubject-replication designs (Sidm an, 1960), evaluate the effect o f
a given variable that is replicated over tim e for one or a few subjects. Perform ances
before, during, and after an independent variable is presented are com pared. The
sequence o f different experim ental conditions over time is usually repeated w ithin the
same subject.
In the 1950s an d 1960s, the experim ental analysis o f behavior an d intrasubject or
single-case designs becam e identified with operant conditioning research. T he associa-
tion between operant conditioning as a theory o f behavior and single-case research as
a m ethodology becam e som ew hat fixed, in part because o f their clear connection in
the various publication outlets and professional organizations. Persons who conducted
research on operant conditioning usually used single-case designs, and persons w ho
used single-case designs usually were trained and interested in operant conditioning.
T h e connection between a particular theoretical approach and a research m ethodol-
o gy is not a necessary one, as is discussed later, but an awareness o f the connection is
im portant for an understanding o f the developm ent and current standing o f sin gle-
case methodology.

Applied Behavior Analysis


A s substantive and m ethodological developm ents were m ade in la b o rato ry ap p lica-
tions o f operant conditioning, the approach w as extended to hum an b eh avio r (see
Kazdin, 1978). The initial system atic extensions were designed to dem on strate the
utility o f the operant approach in investigating human p erfo rm an ce an d to d eter-
m ine if the findings o f non-hum an anim al laboratory research c o u ld be exten d ed to
hum ans. T he extensions began p rim arily with experim ental lab o rato ry research that
focused on adults with psychiatric disorders and children diagn o sed with d ev elo p -
m ental and intellectual disabilities (m ental retardation) and autism as w ell as adults
fu n ctioning norm ally in the com m u n ity (e.g., Bijou, 1955,1957; Ferster, 19 6 1; Lindsley,
1956, 1960). System atic behavioral processes evident in non -hu m an an im al research
were replicated with hum ans. Indeed, lawful relations gen eralized across species as
well as individuals. For exam ple, m anipulating the delivery o f r e in fo r c e s on a p a r-
ticular lab task (e.g., pressing a level) yielded curves that were sim ilar acro ss rats,
pigeons, and m onkeys.
In human applications, clinically interesting findings em erged as well, such as
reduction o f sym ptom s am ong patients with a diagnosis o f psychoses d u rin g labora-
tory sessions (e.g., Lindsley, 1960) and the appearance o f response deficits am ong in d i-
viduals with developm ental disabilities (e.g., Barrett & Lindsley, 19 6 2). A sid e from the
methodological extensions, even the initial research suggested the utility o f operant
conditioning for possible therapeutic applications.
Although experim ental w ork in operant conditioning and sin gle-case research c o n -
tinued, by the late 1950s and early 1960s an applied area o f research began to em erge.
Behaviors o f clinical and applied im portance were focused on directly, including stu t-
tering, reading, writing, and arithm etic skills, and the behavior o f patients hosp ital-
ized because o f severe psychiatric disorders (e.g., Ayllon, 1963; A yllon & M ichael, 1 9 5 9 ;
Goldiam ond, 1962; Staats, Staats, Schutz, & Wolf, 1962). By the m iddle o f the 1960s,
several programs o f research em erged for applied purposes. Applications w ere evident
in education and special education settings, psychiatric hospitals, outpatient treatm ent,
and other environments (U llm ann & Krasner, 1965). By the late 1960s, the extension
o f the experimental analysis o f behavior to applied areas was recognized fo rm ally as
applied behavior analysis (Baer, Wolf, & Risley, 1968, 1987). A pplied beh avior analysis
was defined as an area o f research that focused on socially an d clinically im portan t
behaviors related to matters such as psychiatric disorders, education, retardation, ch ild
rearing, crime, and social functioning m ore generally. Substantive and m ethodological
approaches o f the experim ental analyses were extended to applied questions and to
virtually all settings (e.g., preschools, colleges, m ilitary bases, business and industry,
institutions and hospitals) and populations (e.g., infants, the elderly; in d ivid u als d ia g -
nosed with physical disease or psychiatric disorders) (Kazdin, 1977c).
S i N G L E- C A S E R ES EA R C H D ES I G N S

Applied behavior analysis em erged from the extensions o f operant condition-


ing and the experim ental analysis o f behavior to diverse applied settings and child,
adolescent, and adult populations (Cooper, H eron, & H eward, 2007; K azdin, 2001).
In applied behavior analysis, intervention techniques used to change behavior draw
heavily on operant conditioning; the m ethodology to evaluate these techniques relies
on single-case designs. Thus, operant con ditioning and a m ethodology o f evaluation
continue to be connected. H owever, single-case designs represent an im portant m eth-
odological approach that extends well beyond any substantive focus, theoretical views,
or discipline. The designs have been extended to a variety o f interventions removed
from the conceptual fram ew ork o f operant conditioning and used as a m ethodology to
evaluate interventions in diverse contexts and settings (e.g., schools, hospitals, outpa-
tient treatment, business and industry, com petitive sports, rehabilitation centers) where
behavior change is o f interest. T his book elaborates single-case designs and draw s from
the diversity o f uses and applications. T h e expansion o f the designs is not m erely to new
areas but also reflects m odifications in key features (e.g., assessm ents, data-evaluation
m ethods) o f how single-case designs are used.

C U R R E N T I S S U E S IN I N T E R V E N T I O N RESEARCH
Treatment and intervention work raises issues that provide an important and relatively
new context for sin gle-case designs. T he context pertains to accountability in pro vid-
ing services and greater interest in evaluating program s, therapies, and interventions
in applied settings.

Evidence-Based Interventions
There has been heightened interest in identifying treatments or interventions that are
based on strong em pirical evidence. In the context o f “treatm ents” (e.g., m edicine and
its many branches, dentistry, nursing, clinical psychology, speech and language train-
ing, occupational and recreational therapy, and rehabilitation), evidence-based treat
ments (EBTs) delineate these interventions. H owever, the m any applications are not
“ treatment or therapy.” Prom inent am ong these are educational and school psychology
where there is considerable w ork in developing and delineating procedures with e v i-
dence on their b eh alf (e.g., Kratochwill, 2006). T here are other non-treatm ent areas as
well that are now evidence based, including social policy, law, econom ics, and m orality
(each can be easily searched on the web).
“ Evidence-based interventions (E B Is)” has been used as the more gen eric term
and applies to an exp an d in g range o f disciplines (e.g., social work, speech and lan-
guage, rehabilitation) that are com m itted to draw in g on their evidence base. Both EBTs
and EBIs refer to specific interventions that have outcom e studies that attest to their
efficacy. T he m ovem ent toward EBTs or E B Is encom passes m any different disciplines,
countries, and professional groups, organizations, and agencies within a country.3

' The various efforts have used many different terms and criteria (e.g., Kratochwill et al., 2009).
For example, in the context of psychotherapy, the terms have included “evidence-based treatments,”
“empirically validated treatments,” “empirically supported treatments,” “evidence-based practice,”
and “treatments that work.” The different terms are not completely interchangeable. For example,
EBTs focus on interventions with supportive research. Evidence-based practice refers to clinical
I n t r o d u ct i o n

T a b le 1.2 C r i t e r i a U s e d t o E s t a b lis h a T r e a t m e n t o r I n t e r v e n t io n a s E v id e n c e B a s e d

• Random assignment of subjects to treatment and control o r com parison conditions (e.g., no treatment)
routine care, treatment as usual for the setting);
• The sample has been well specified (inclusion and exclusion criteria);

• Treatment manuals specify the intervention procedures that were used;


• Multiple outcom e measures (raters, if used, are naive to conditions);

• Statistically significant differences between treatment and a com parison ccndition(s);


• Two o r m ore randomized controlled studies attest to the effects o f treatment; and

• The studies include replication of the findings beyond the original investigator o r originator of the treatment

Note: A s noted in the text, delineating treatments as evidence based reflects a broad movement involving mul
tiple professional groups and organizations and many different countries. There is no single set of criteria, but
those selected for this table are among those most commonly invoked.

Understandably, the term inology and definitions have v aried widely, and the variation
continues to increase. For example, in the United States and C an ada, states, provinces,
and the agencies for national governm ents are defining EBTs to gu id e psychological
services; the definitions can vary widely, and therefore so can the interventions that
are selected. In research, the criteria to be counted as evid en ce-based vary; this v a r-
iation was evident even from early efforts to delineate the criteria (see Cham bless &
Ollendick, 2 0 0 1). Table 1.2 includes criteria com m only used o r invoked. As evident in
the table, the criteria emphasize sound research m ethodology, in clu din g careful sp ec-
ification o f w h o the clients are and the procedures that com prise the intervention, as
well as replication o f intervention effects.
Establishing EBIs requires rigorous and well-controlled research. Interestingly, early
efforts to sp ecify what this research would entail included both betw een-group stu d-
ies (random ized controlled trials) and single-case exp erim en tal designs (Cham bless &
Ollendick, 200 1). T his is remarkable given the general u nfam iliarity o f single-case
designs. Even so, over time and across m any efforts to delineate interventions as ev i-
dence based, random ized controlled trials (RCTs) have em erged as p rim ary and m ore
often than not the sole criterion. An R C T is a betw een-group study where participants
are assigned random ly to one o f the conditions (e.g., treatm ent, variants o f a treatm ent,
control). A n R C T is view ed by a broad scientific com m u n ity as the “gold standard”
for intervention research and is usually required to establish the effectiveness o f a new
treatment (e.g., cancer treatment, m edication). Such trials address key m ethodological
concerns that can arise in research, as discussed in the next chapter, and can discern
whether one intervention is better than another.

' (Continued) work in which practitioners integrate evidence about interventions, their clini-
cal judgment and experience, and contextual factors about (he clients (American Psychological
Association, 2005; Institute of Medicine, 2001). As of this writing, it is not clear that there is evi-
dence for “evidence-based practice,” that is, that the integration can be done reliably and makes a
difference in helping patients when it does, and that it provides an increment in outcome effects that
would surpass relying on evidence related to the intervention. I use EBI here as the more general
term to encompass any program, treatment, or strategy with evidence that approximates the criteria
discussed in this section and many disciplines involved (e.g., education, clinical psychology).
S i N G L E- C A S E R ES EA R C H D ES I G N S

EB Is and RC Ts raise interesting challenges and con cern s that set the stage for
single-case designs. First, there are so m any intervention s and they could not be sub -
jected to random ized trials. For exam ple, in the context o f psychotherapy, there are
hundreds o f variations in use. The vast m ajority o f these have not been evaluated at
all, but they rem ain in use. Funds, researchers, and other resources are not available
to evaluate in an RC T. A re there alternative w ays to evaluate interventions that might
be feasible?
Second, intervention delivery in RCTs is very carefully controlled to ensure that the
data are interpretable at the end o f the study. T he carefully controlled conditions o f such
trials have led to concerns about the generality o f the results. That is, will the findings
(e.g., treatment A is effective) apply to clinical situations where such careful control o f
who receives and provides the services is not so m eticulously regulated (e.g., Hunsley,
2007; W ampold, 2001; Westen, Novotny, & T hom pson-B renn e, 2004)? Is there a way to
test treatments as th ey are extended to clinical practice or service situations?
Third, w hether one uses an E B T or ones own brand o f individualized treat-
ment, one cannot be sure, in principle or practice, that the treatm ent will be effective.
G en eralizin g from research, experience, and their com bination is always probabilistic
and does not guarantee an outcome. EBTs o f all sorts (e.g., aspirin, by-pass surgery,
plastic surgery, chem otherapy, anti-depressant m edication) cannot be depended on to
produce the desired outcom e without exception. We take system atic evaluation as p iv-
otal in research. H owever, it is no less im portant in the application o f interventions for
the individual (Kazdin, 2008b).
In short, the m ove tow ard E B Is underscores the im portan ce o f using interventions
with evidence in their behalf. In the process, concerns about w hether the treatments
will w ork in practice has raised the question o f how one cou ld know or tell. One needs
assessm ent and design strategies that could be applied to individual cases, whether
adults in therapy or students in special education or other classroom programs.

Increased Evaluation and A ccountability


E B Is are part o f a broader m ovem ent o f increased accountability in intervention work.
M uch o f this is prom pted by cost control and fun ding agen cies— managed care, insur-
ance, third-party payers, governm ent agencies in the context o f m edical, psychological,
and educational interventions but other areas as well (e.g., rehabilitation, services for
the elderly). T he guid ing questions are, “ W hat are we getting for o u r m oney?” and “ Do
any o f these interventions (e.g., educational reform s and fads) m ake any difference?
T hird -party payers (e.g., m anaged-care com pan ies in the United States) are increas-
ingly interested in the use o f interventions that have evidence in their behalf, especially
if the treatments reduce costs o f care. T here is also a con cern about increased evalua-
tion, that is, obtaining data about what is provided and what effects it is having.
C onsider the d elivery o f psychological services and the practice o f psychotherapy
to convey concerns about evaluation and accountability. In psychotherapy for children,
adolescents, or adults, treatm ent is delivered by a practicing clinician. Patient p rog-
ress is usually evaluated on the basis o f clinician im pressions, as opposed to systematic
observations using validated measures. T h is is the anecdotal case study “ m ethodology”
m entioned at the outset o f the chapter. There is a notoriou sly poor record o f the poor
reliability o f such ju dgm ents (e.g., Dawes, 1994; G arb, 200 5)— no fault o f clinicians
I n t r o d u ct i o n

but rather lim its o f norm al hum an perception, cognitive processes, an d m em ory and
recall.
Advances have been m ade in developing measures that are feasible, user (clinician,
patient) friendly, and well validated in the context o f clinical w ork. For exam ple, one
measure (the O utcom e Q uestionnaire 45 [OQ-45]; Lam bert et al., 1996), is a self-report
scale for adults designed to evaluate client progress (e.g., w eekly) over the course o f
treatment and at term ination. T he m easure requires approxim ately 5 minutes to co m -
plete and provides inform ation on four dom ains o f fun ctioning, including sym ptom s
o f psychological disturbance (prim arily depression and anxiety), interpersonal p ro b-
lems, social role fu n ctioning (e.g., problems at work), and qu ality o f life (e.g., facets
o f life satisfaction). T h e m easure has been evaluated extensively and applied to over
10,00 0 patients (see Lam bert, H ansen, & Finch, 2001; Lam bert et a l , 2 0 0 3).4

O btaining continuous m easures over the course o f treatm ent is pivotal for clinical
care to ensure that outcom es are being achieved and to m ake decisions about when to
alter or perhaps end treatment. The collection o f continuous m easures over tim e in this
fashion is a core com ponent o f single-case designs. O nce such assessm ent is in place,
all sorts o f questions can be asked about treatment. In clinical settings, these questions
are about how to im prove this patients care, how to evaluate if changes are occu rring in
the m any contexts in which they are needed, and more.
T he practicing clinician is confronted with the individ u al case, and it is at the level
o f the clinical case that em pirical evaluations o f treatment need to be made. T he p ro b -
lem, o f course, is that the prim ary investigative tool for the clinician has been, and in
the m ajority o f instances continues to be, the uncontrolled case study in which an ec-
dotal inform ation is reported and scientifically acceptable inferences cannot b e drawn.
There has been a lengthy history o f researchers and practitioners advocating the study
o f the individual to im prove the quality o f clinical w ork (e.g., C hassan , 1967; Shapiro,
1961a, 1961b; Shapiro & Ravenette, 1959). M uch o f this history has fallen on d eaf ears,
which we can tell by view ing how current practice is routinely conducted and by seeing
continued pleas to evaluate clinical work by authors w ho kn o w that pleading is not a
very effective intervention (Borckardt et al., 2008; Kazdin, 2008b). Also, as w e discuss
in a later chapter, there are ways to improve the uncontrolled case study to increase the
scientific yield an d im prove patient care.
At this point in time and in the very immediate future, the prim ary pressure on prac-
tice will be to justify what interventions are being used. It is still the case that an EB I is not
necessarily used even when one is available. Pressure is changing on that front. In addi-
tion, increased accountability in the form o f measurement o f patient care and change is
more salient than ever before. Single-case designs discussed in this book provide m ultiple
options that would facilitate the evaluation o f interventions in applied settings.

General Com m ents


An im portant context for considering single-case designs is the increased account-
ability in intervention work. W hether in the contexts o f individual psychotherapy,

4 For a review and update o f research on OQ-45, an excellent overview is provided at www.nrepp.
samhsa.gov/programfulldelails.asp?PROGRAM _lD=191. Also, the 45-item measure has been
reduced to 30 items and also has been validated (Lambert et al., 2004).
S i N G L E - C A S E R ES EA R C H D ES I G N S

classroom program s to prevent or alter som e problem , o r school- and school-district-


w ide interventions, som e m eans o f evaluation is likely to be requested. M ethods o f
evaluation often used to establish the effectiveness o f an intervention (random ized
controlled trials) usually are not feasible in applied settings. Also, we want profession-
als on the everyday battlefields (e.g., clinicians, teachers, program directors) to evaluate
w hat they are doing, to innovate and develop new interventions, and to feed into as
w ell as draw from large-scale controlled studies. Single-case designs represent a broad
m eth o do lo gy that not only includes m any design options, but also provides a contin-
uum o f conditions that v ary in their feasibility for im plem entation in applied settings.
T h is b ook elaborates experim ental designs and approxim ations that still permit one to
d raw inferences about intervention effects.

CO NTEXTS AND PERSPECTIVE


I have provided four contexts as a w ay o f introducing single-case designs. T he c o n -
texts con vey different points that serve as underpinnings for subsequent chapters. First,
the discussion o f uncontrolled case study was not designed to lam baste uncontrolled
anecdotal case reports. Such reports are an easy m ethodological target, but there is
som ething m ore useful and im portant to convey than their deficiencies. Rather, it is
useful to recognize that in principle an uncontrolled case study can be m ethodologi-
cally (not just subjectively) persuasive and can make im plausible potentially com peting
explanations o f what caused a change. A case can yield inferences that are valid; how to
accom plish this serves as a basis o f a later discussion (C hapter 11). T he key point was to
con vey that a case (N = 1) or group (N = 1,000) is not the issue. Rather, the o verarch-
ing m ethodological priority is always w hether an arrangem ent (design, circumstance)
allow s valid inference to be draw n. Even in some case studies, the answ er is yes.
Second, I highlighted historical issues related to research with individual subjects.
H ere the key point was to note that the evaluation o f groups, group com parisons, and
statistical analyses were not always the norm in psychology. Research with one or a few
subjects w as the rule and was the basis for m any advances. In most graduate training
program s in psychology, education, and counseling, for exam ple, history o f the field
receives dim inished em phasis and classroom time. W ith astounding advances and new
inform ation that must be added to any curriculum and a fixed or limited am ount o f
trainin g time, this is understandable. However, we unw ittingly convey that how things
are n o w (group research, quantitative tradition) is the w ay things always were and, per-
haps by im plication, how they rightfully ought to be for som e good, but never specified,
reason. Not long ago, research was focused on one or a few subjects. T h is was not only
to be a legitimate research approach but the research approach. Focus on single-case
designs actually adds rigor to that tradition, a point con veyed in the next chapter.
T hird , I discussed the developm ent and proliferation o f single-case designs and
how they grew out o f and were central to the experim ental analysis o f behavior. T he
experim ental analysis includes a substantive field (basic hum an and non-hum an an i-
mal research in operant conditioning) and a m ethodological approach (single-case
designs and all that entails). W hen the field began to apply key findings and techniques
to individuals in everyday life, the area o f applied behavior analysis emerged, prolifer-
ated, and flourished. T he content (often techniques based on findings in operant c o n -
ditioning) and m ethodology (single-case designs) began close to the basic laboratory
I n t r o d u ct i o n 21

ties, and in many w ays the ties remain. However, single-case designs and their scope o f
application have extended remarkably, in the range o f settings, populations, and inter-
vention techniques. T hat said, it is still the case that applied behavior analysis utilizes
single-case designs and has delineated an array o f design options, assessment strategies,
and com binations that are illustrated in this book. Still, the m ethodology is v e ry w idely
applicable and the b o ok can convey key m ethodological points by sam pling from the
richness beyond any sin gle field or area o f research. As I alluded to, these extensions
have influenced single-case m ethods and how they are used.
Last, I discussed contem porary issues that m ake single-case research especially rel-
evant in applied settings. T he em ergence o f E B Is has raised the question, “C an these
interventions be extended to everyday settings?” Rand om ized controlled trials have
done their job to establish these interventions as effective. Now, how can w e apply
these interventions and evaluate their impact on patient care in everyday settings? T h e
luxury o f controls, random ization, and so on are not possible in clinical, educational,
rehabilitation, and other settings. Do individuals get better with the interventions, how
do we know, and how can we show this? Single-case designs are well suited to testing
the application o f findings from large-scale random ized controlled trials and ad d ress-
ing these questions.
Single-case research is not merely a handm aiden o f controlled trials and a w ay
o f testing what other m ethodologies show. In m any applied settings (schools, clinics,
camps, recreational and rehabilitation facilities), those in charge innovate creative p ro -
gram s that they believe are effective. The choices for evaluating any program seem to
be: a random ized trial, an anecdotal case study, or an open study.5T h is restricted set o f
choices in part m ay be w h y so few program s have any evaluation whatsoever. A ran -
domized controlled trial is rarely feasible; the yield from an anecdotal c ase study or
open study is m inim al and scientifically unacceptable. Actually, single-case ex p e rim e n -
tal designs and quasi-experim ental single-case designs, all elaborated in subsequent
chapters, fill a huge gap. Valid inferences can be draw n about the impact o f p ro gram s
in ways that are more feasible than random ized trials. M oreover, the designs allow for
developing more effective program s. Ongoing data within the designs can b e , an d often
have been, used to m ake program s more effective w hile they are being im plem ented.
Apart from EBIs, increased accountability in services in general inadvertently
promotes single-case designs. Are the treatments, educational regim ens, and policies
w orking in this setting and with this client, patient, or student? T here is increased inter-
est in having data-based answ ers to these questions and w eaving evaluation in with
service delivery. This is im portant not for som e m ethodological ideal; rather, the qu al-
ity o f care depends on having inform ation about im pact. T h is leaves aside som e o f the

5 An “open study,” a term often used in medical research, refers to uncontrolled investigations that
omit pivotal controls and hence are not true experiments. Two o f the more common control Is that
arc suspended are absence of a control group and masking o f who receives the intervention. The
absence o f a control group is reflected in the fact that open studies are often pre-post (before-
after) comparisons o f patient functioning. All patients receive some treatment (e.g., medicine,
psychotherapy), but there is no control group for comparison. “Masking” refers to ensuring that the
investigators and treatment administrators do not know who received what condition (eg., “double
blinding” ). Masking procedures are routinely used in medical studies to eschew bias in influencing
subjects. In an open study all patients receive the intervention, and everyone knows about it.
S I N G L E- C A S E R ES EA R C H D ES I G N S

driving forces behind accountability such as costs (e.g., wasted money) on program s
that seem like a good idea at the tim e but have no data to show that they helped. Sin gle-
case experim ental designs can m ake a difference.

OVERVIEW OF TH E BOOK
T his text describes and evaluates single-case designs and is divided into five units or
sections, each with its own chapters: T he sections are B ackground to the Designs,
Assessm ent, M ajor Design O ptions, Evaluating Sin gle-C ase Data, and Perspectives
an d Contributions o f the Designs. Assessm ent, design, and data evaluation are the core
ingredients o f research m ethodology, whether single-case or m ore traditional between-
group research. It is im portant to convey the flow o f designs and decisions with this
organization but also to help relate the underpinnings and practices o f research trad i-
tions to each other. Single-case designs are not fundam entally different from group
designs in term s o f the goals and the means through which these are achieved. T here
are starkly different practices, but it is critical to con vey the com m onalities.
T he purpose o f research, w hether single case or between groups, is to draw valid
inferences. Experim entation consists o f arranging the situation in such a w ay as to rule
out or m ake im plausible the im pact o f extraneous factors that could explain the results.
Chapter 2 discusses key factors that experim entation attempts to rule out in ord er to
perm it inferences to be draw n about intervention effects.
Single-case designs depend heavily on assessm ent procedures. Chapters 3, 4, and 5
con vey fundam entals o f assessment in single-case research, com m only used strategies
for observing behavior, and the different conditions and situations in which assess-
ments are obtained. W ays to ensure and evaluate the quality, integrity, reliability, and
validity o f assessm ents also are discussed.
The precise logic and unique characteristics o f single-case experim ental designs
are introduced in C hapter 6. T h e m an ner in w hich single-case designs m ake and test
predictions about perform ance within the sam e subject underlies all o f the designs,
and the rationale shares features w ith the m ore fam iliar betw een-group designs. In
Chapters 6 through 10, several different experim ental designs and their variations, uses,
and potential problem s are detailed. C hapter 11 presents quasi-experim ental single-case
designs and includes multiple options for situations in which the conditions cannot be
com pletely controlled but one wants to draw inferences about intervention effects.
Single-case designs have relied heavily on visual inspection o f the data to ev alu -
ate the extent to which the intervention has led to and accounts for change. T h is is in
sharp contrast to the more fam iliar tests o f statistical sign ifican ce com m only used in
betw een-group research. T he underlying rationale and m ethods o f visual inspection
are discussed and illustrated in C hapter 12. Visual inspection is facilitated by graphing
the data. Options for graphing and how these options aid visual inspection serve as the
basis o f Chapter 13.
Although problem s, considerations, and issues associated with specific designs are
discussed throughout the text, it is useful to evaluate single-case research critically and
m ore broadly. Chapter 14 provides a discussion o f issues, problem s, and lim itations o f
single-case experim ental designs. Finally, the contribution o f single-case research to
experim entation in general and the interface o f alternative research m ethodologies are
exam ined in Chapter 15.
In t r o d u ct i o n

Statistical analyses are infrequently used as the basis o f draw in g inferences about
intervention effects. Instructors who teach single-case designs com m ent that it w o u ld
be useful to note that such tests exist and to illustrate their application. However, statis-
tical analysis is not central to the m ethodology as it is in betw een-group null hypothesis
testing. Nevertheless, statistical analyses are occasion ally used and readily available.
T he appendix at the end o f the book provides a m ore in-depth discussion o f data e v a lu -
ation in single-case research to encom pass the strengths and w eaknesses o f both visual
inspection and statistical analyses, illustrate statistical analyses, and convey significant
advances in statistics for the single case as well as the dilem m as these advances raise.
CHAPTER 2

Underpinnings of Scientific Research

CH A PTE R OUTLINE

Drawing Valid Inferences


Parsimony
Plausible Rival Hypotheses
Threats to Validity
Internal Validity
Defined
Threats to Internal Validity
General Comments
External Validity
Defined
Threats to External Validity
General Comments
Construct Validity
Defined
Threats to Construct Validity
General Comments
D ata-Evaluation Validity
Defined
Threats to Data-Evaluation Validity
General Comments
Priorities and Trade-offs in Validity
Experimental Validity in Context
Sum m ary and Conclusions

T
he previous chapter placed single-case designs in different contexts to convey their
lineage, roots, and precedents. Before the m ethodology is detailed, it is critical to
provide the key context, namely, why we do research in the first place and w hat the
design s are trying to accom plish. There are all sorts o f techniques, practices, and p ro ce-
dures that are used in research (e.g., various control groups, m easurin g perform ance on
one [post], two [p re-p o st], or multiple occasions). Learning the practices is the easier
part o f m ethodology. It is im portant to convey the rationale for why these practices
have becom e im portant. T h is is not an intellectual exercise. T h ere are many occasions,

24
Un d er p in n in g s o f Sci en t i f i c Resear ch

especially in applied settings (classroom, school, clinic, rehabilitation center) in w hich


som e critical com ponent or two from a betw een-group o r single-case design cannot
be implemented. Will this be a disaster in draw ing inferences or persuading colleagues
that we have an effective intervention? T he answ er lies in understanding what one is
tryin g to accomplish rather than whether one can do th is or that practice. There are
m an y strategies to accom plish the goal even if an often ch erish ed practice cannot be
im plem ented.
For those trained in between-group research, here is a heart-stopping exam ple.
Suppose one wants to conduct an experim ent and can n ot assign subjects to co n d i-
tions random ly. That is not necessarily a problem or grou n ds for M A D (m ethodologist
affective disorder). W hat is random ization for anyw ay? T w o o f the answers include
decreasing the likelihood that groups will differ on nuisance variables and ad dress-
ing som e o f the assum ptions for various statistical tests. W e can accomplish the goals
o f random ization or selectively neglect them in w ays that w ill not necessarily make a
difference in our final results (Shadish & Ragsdale, 1996). T h is sounds like m eth o d -
ological heresy— it actually is m ethodological gospel. (I alw ays have trouble separating
these.) I am not “ for or against” random ization; I am for d raw in g valid inferences and
having as m any m ethodological arrows in my quiver to sh o o t at artifacts and biases that
are ready to interfere with that. T he random ization arrow is wonderful but does not
always hit the target, that is, accomplish what it is supposed, to (Hsu, 1989), and other
arrow s can help.
The m ethodological go al—we must arrange the situation to obtain valid inferences,
that is, inferences in which we can be confident that there are no biases and artifacts that
could explain the finding. In science, replication is the key protection. W ithin a given
study, ruling out or m aking implausible com peting explanations that could account for
our intervention effect is the protection. Experim entation is needed to exam ine specifi-
cally w hy change has occurred. Through experim entation, extraneous factors that might
explain the results can be ruled out to provide an unam biguous evaluation o f the inter-
vention and its effects. This chapter discusses the purposes o f experim entation and the
types o f factors that must be ruled out if valid inferences are to be drawn.

D RAW ING VALID IN FEREN CES


T h e purpose o f research in general is to exam ine relationships between or am on g
variables. T he unique feature o f experim entation is that it exam ines the direct in flu -
ence o f one variable (the independent variable) on an oth er (the dependent variable).
Experim entation usually evaluates the influence o f a sm all num ber o f variables u nd er
conditions that will perm it unam biguous inferences to be draw n. Experim ents help
sim plify the situation so that the influence o f the variables o f interest can be sep a-
rated from the influence o f other, factors. D raw ing valid inferences about the effects o f
an independent variable or intervention requires attention to a variety o f factors that
potentially obscure the findings. Three key concepts serve a s a useful guide for design -
ing experim ents and interpreting the results o f com pleted studies.

P a rs im o n y
We are guided by concepts that pervade all science and that are central to m ethodology,
not just the substantive areas in which we do research. P arsim o n y is a guiding principle
S I N G L E- C A S E R ES EA R C H D ES I G N S

that underscores how to select am ong com peting interpretations or explanations o f a


finding. O ther things being equal, we select the sim plest explanation am ong com pet-
ing alternatives that explains the finding or phenom enon o f interest. There are other
nam es for the guideline and they convey the thrust that is intended. Am ong the other
term s are the “principle o f economy,” “ principle o f unnecessary plurality,” “principle o f
simplicity,” and “O ccam ’s razor.” T he latter is the m ore fam iliar and is a useful place to
begin. W illiam o f O ckham (ca. 1285-1349) was an English philosopher and Franciscan
monk. He applied the notion (that plurality [o f concepts] should not be posited without
necessity) in the context o f w riting on epistem ology in w hich he advocated that con -
cepts ought not to be added (plurality) if they are not needed (necessity) in accounting
for phenom ena. Supposedly, his frequent and sharp invocation o f the principle alleg-
edly accounts for why the term “ razor” was added to his (Latinized) name to form
Occam ’s razor.
In the context o f science, parsim ony m eans that if com peting views or interpreta-
tions o f any phenom ena can be proposed, one adopts the sim plest one that can explain
the data or inform ation. T he presum ption o f parsim ony is not tantamount to saying
that things are sim ple or can be explained sim ply but rather that the simplest account
o f the data or phenom ena is the one we adopt, until there is a need to m ove to m ore
com plex interpretations.
A fam iliar issue that illustrates parsim ony pertains to the existence o f unidentified
flying objects (U FO s, aka “ flying saucers” ) from extraterrestrial places. M any sightings
o f U FO s have been reported for centuries with m ore recent sightings associated with
photos and videos. T here are m any explanations o f these sightings, but I only briefly
entertain two. First, the “ natural” (i.e., earthly) phenom enon view would explain the
sightings to be a result o f o rd in ary phenom ena in our w o rld that give an appearance
o f som ething not so ordinary, namely, UFO s. Special atm ospheric conditions, m eteors
shooting across the sky, weather balloons, and highly secret m ilitary aircraft w ould be
am ong these phenom ena and w ould explain such sightings.
Second, the extraterrestrial view is that beings and their flying objects com e from
another planet or galaxy and that they period ically visit. All else being equal, the natural
phenom enon view provides a more parsim onious explanation because we do not have
to introduce new com plexities related to the existence o f beings (we do not know o f
beings on other planets), equipm ent (novel yet-to-be-invented m achines), and incom -
prehensible ways o f travel (above land and som e allegedly under water). On the other
hand, the public does not have access to m ore inform ation that is kept secret. That
inform ation could easily m ake the natural explanation sim ply inadequate. For exam ple,
if there really were any physical evidence o f the flying vehicles (e.g., from abandoned
U FO s on the side o f the highways, from U FO crashes, or U FO s that were parked ille-
gally, towed and “ booted” by the police and never reclaim ed) or o f the beings that
pilot them, or corroborated evidence unequivocally show ing that the objects attained
speeds and agility that exceed what we can accom plish by our current technology, then
the extraterrestrial explanation becom es parsim on ious. T he natural phenom enon view
would not be able to explain this new evidence without all sorts o f new concepts and
reasons to accom m odate each sign o f physical evidence. A s the “ natural” explanation
squirm s to add m an y different reasons that the new inform ation requires, we have par-
sim ony and evidence favoring the second explanation. It m ight be a sim pler account
U n d er p i n n i n g s of Scien t if ic Resear ch

o f all the data (requiring fewer explanations) to ju st state that there must be extrater-
restrial life flying around and visiting us.
Som e key points: A more parsim onious view is not necessarily true or accurate.
Also, parsim ony does not mean the explanation is sim ple. In fact, com bined view s that
are com plex m ay end up being the most parsim onious. In the preceding example, we
know for certain that the natural phenom ena view explains m any sightings that people
report, but we do not yet know if all the available evidence (to w hich we do not have
public access) can be explained by that view. Both view s com bined may be the best
explanation o f the data. Also, parsim ony com bined with evid en ce m ay or m ay not
make one o f the original interpretations im plausible. Aside from explaining substan-
tive questions within a given science, parsim ony is central to m ethodology. From the
standpoint o f interpreting the results o f a study, w e begin with a nod to parsim ony by
asking, “C an we explain the data with concepts and phenom en a we already know ?”

Plausible Rival Hypotheses


M ethodology is all about the conclusions that are reached from a study. At the end
o f the study we Find that there are group differences (betw een-group study) o r clear
changes in the individual while som e intervention w as in place (single-case design).
There are m any possible explanations for that effect (and these are discussed in this
chapter). O f all the explanations, the investigator w ishes to say, “ It was the intervention
that explains the Findings.” However, the rest o f the scientific com m u n ity (one’s peers),
by their training, is ready to say, “ Maybe, but was the study designed and evaluated to
handle all o f the other reasonable or plausible explanations?”
Central to m ethodology and conclusions in a particu lar study is the notion o f a
plausible rival hypothesis (C oo k & Cam pbell, 1979). A plausible rival hypothesis refers
to an interpretation o f the results o f an investigation on the basis o f som e other influ-
ence than the one the investigator has studied or wishes to discuss. T h e question to ask
at the com pletion o f a study is whether there are other interpretations that are plausible
to explain the findings. T h is sounds so much like parsim on y that the distinction is
worth m aking explicit. Parsim ony refers to adopting the sim pler o f tw o or m ore expla-
nations that account equally well for the data.
Plausible rival hypothesis has a slightly different thrust. At the end o f the investi-
gation, are there other plausible interpretations o f the find ings than the one advanced
by the investigator? Sim plicity o f the interpretation (parsim ony) m ay or m ay not be
relevant. At the end o f the study, there could be two or ten eq u ally com plex interpreta-
tions o f the results, so parsim on y is not the issue. For exam ple, an investigator wishes
to see whether ethnicity contributes to diet (e.g., proportion o f fiber, vitam ins in their
food) o f elderly people and how healthy these people are (e.g., instances and duration
o f illnesses, hospital visits in the past year). Let us say that tw o different ethnic groups
are shown to differ on both diet and health. At the end o f the study, the investigator dis-
cusses ethnic differences and how these explain the findings. A plausible rival hypothe-
sis might be socioeconom ic status (e.g., fam ily incom e, occupational status, education,
and living conditions). T hat is, if socioeconom ic status (SES) w as not controlled in this
study, then SES becom es a plausible rival hypothesis. Is SE S m ore parsim onious? M aybe,
maybe not. T h e findings m ight be equally accounted for by p o sin g o n e influence versus
another (an SES or ethnic difference) so that whether one is sim pler than another is
S I N G L E- C A S E R ES EA R C H D ES I G N S

arguable. Parsim ony (the sim pler explanation that can account for the data) and plau-
sible rival hypotheses (other interpretations o f the data that cou ld readily explain the
effect, w hether sim ple or not) often overlap. Yet, a plausible rival hypothesis does not
necessarily invoke sim plicity as a key feature but rather only asks whether som e other
interpretation is plausible. Plausibility derives from whether there is a reasonable basis
to say that the findings could be explained in som e other way. O ften plausibility stems
from the fact that other, prior studies have show n that a particular influence has an
effect very much like the one produced in this new study. In this hypothetical exam ple,
we know in advance o f this study that incom e, occupational status, education, and liv -
ing conditions (e.g., living in crow ded neighborhoods, in an area where there are more
pollutants in the air) all correlate with illness. Living conditions, as much as diet, might
explain any health differences unless they are ruled out.
M ethodological practices are intended to rule out or to m ake com peting inter-
pretations o f the results implausible. At the com pletion o f a study, the explanation one
wishes to provide ought to be the m ost plausible interpretation. T h is is achieved not
by arguing persuasively, but rather by designing the study in such a way that other
explanations do not seem v ery plausible or parsim onious. T h e m ore well designed the
experim ent, the few er the alternative plausible explanations that can be advanced to
account for the findings. Ideally, only the effects o f the independent variable (interven-
tion) could be advanced as the basis for the results.

Threats to V alidity
So what are all o f these rival interpretations o f the data that m ight plausibly explain the
results? T h ey are called threats to validity and refer to m ethodological issues that are
likely to rival the explanation that it w as the intervention (or experim ental m an ipula-
tion) that explained the effect. Four types o f experim ental validity address these pur-
poses in different ways: internal, external, construct, and statistical conclusion validity
(C oo k & C am pbell, 1979). T hese types o f validity serve as a useful way to convey and
rem em ber several key facets o f research and the rationale for m any methodological
practices. Table 2.1 lists each type o f validity and the broad question each addresses.
Each type o f validity is pivotal. Together they convey m any o f the considerations that
investigators have before them w hen they design an experim ent. T hese considerations
then translate to specific m ethodological practices (e.g., random assignment, selection
o f control groups).
Threats to validity have been devised and elaborated in the context of betw een-
group research. Som e o f these derive from features that are m ore characteristic o f
betw een-group rather than single-case designs or vary in how th ey emerge in the d if-
ferent design strategies. I have adapted these here to convey the likely m ethodological
issues and rival explanations that can interfere with draw ing valid inferences in single-
case designs.1

’ There is no single fixed list o f threats to each type o f validity. Methodology advances and evolves
just as substantive findings do in an area o f science. The threats discussed in this chapter constitute
those that are most fundamental and applicable to single-case designs. For a more extensive discus-
sion o f the threats to each type o f validity, other sources can be consulted (Cook & Campbell, 1979;
Kazdin, 2003; Shadish, Cook, & Campbell, 2002).
Un d er p in n in g s of Scien t if ic Resear ch

Table 2.1 Types of Expe rim e ntal Validity and Q u e stio n s o r Issues They A d d re ss

T y p e o f V a lid ity Q u e s t io n o r Issu e

In t e r n a l V a lid ity To what extent can the intervention rather than extraneous influences
be considered to account for the results, changes, o r differences am ong
conditions (e.g., baseline, intervention)?

E x t e r n a l V a lid ity To what extent can the results be generalized o r extended to people,
settings, times, measures o r outcomes, and characteristics other than
those included in this particular dem onstration?

C o n s t r u c t V a lid ity Given that the intervention was responsible for change, what specific
aspect of the intervention was the mechanism, process, o r causal agent?
W hat is the conceptual basis (construct) underlying the effect?

D a t a -e v a lu a t io n v a lid ity To what extent is a relation shown, demonstrated, o r evident between


the intervention and the outcom e? W h a t about the data and methods
used for evaluation that could mislead o r obscure demonstrating o r
failing to demonstrate an experimental effect?

Note: These threats apply generally to all of research, although they originally were identified and discussed primar
ily in the context of between-group studies. They are discussed in this chapter in relation to single-case designs and
serve as underpinnings for several design options and strategies to strengthen inferences discussed in subsequent
chapters.

T h e purpose o f research is to reach w ell-founded (i.e., valid) conclusions about


the effects o f a given intervention and the conditions under which it operates. A useful
distinction in research is the difference between findings and conclusions. T he fin d -
ings are a descriptive feature o f the study and include what w as found (e.g., when the
intervention was in effect, reading went up by 100% in a sin gle-case design, or people
w ho drin k w ine have fewer heart attacks in a case-control betw een-group study). T h e
conclusions refer to the basis or explanation o f the finding. The investigator wishes to
say that it was the intervention, but has the dem onstration (m easures, design, and data
evaluation) m ade im plausible threats to validity? We ad dress threats to help our fin d -
ings and conclusions converge justifiably on the sam e explanation; it was the interven-
tion. Single-case designs have m any elegant ways o f establishing that.

IN TERN AL VALIDITY

D e fin e d
T he task for experim entation is to exam ine the influence o f a particular independent
variable o r intervention in such a w ay that extraneous factors will not interfere with the
conclusions that the investigator w ishes to draw. W hen the results can he attributed,
with little or no ambiguity, to the effects o f the independent variable, the experim ent
is said to be internally valid. Internal validity refers to the extent to which an exp eri-
ment rules out alternative explanations o f the results. Factors or influences other than
the independent variable that could explain the results are called threats to internal
validity.

Threats to Internal Validity


A n experim en t ought to be designed to make m ajor threats to internal validity im plau-
sible. Even though the changes in perform ance m ay have resulted from the intervention
S I N G L E - C A S E R ES EA R C H D ES I G N S

T a b le 2.2 M a j o r T h r e a t s t o I n t e r n a l V a lid it y

S p e c ific T h r e a t W h a t It In c lu d e s

H is t o r y A ny event (other than the intervention) occurring at the time of the experiment
that could influence the results o r account for the pattern of data otherwise
attributed to the intervention Historical events might include family crises;
change in job, teacher, o r spouse; power blackouts; o r any other events.

M a t u r a t io n A ny change over time that may result from processes within the subject. Such
processes may include growing older, stronger, healthier, smarter, and m ore
tired o r bored.

In s t r u m e n t a t io n A ny change that takes place in the measuring instrument o r assessment


procedure over time. Such changes may result from the use of human observers
w hose judgments about the client o r criteria for scoring behavior may change
over time.

T e st in g A ny change that may be attributed to the effects o f repeated assessment.


Testing constitutes an experience that depending on the measure, may lead to
systematic changes in performance.

S t a t is tic a l r e g r e s s io n A ny change from one assessment occasion to another that might be due to a
reversion of scores toward the mean. If clients score at the extremes on one
assessment occasion, their scores may change in the direction toward the mean
on a second testing.

D iff u s io n o f t r e a t m e n t Diffusion of treatment can occur when the intervention is inadvertently provided
during times when it should not be (e.g., return-to-baseline conditions) o r to
persons w ho should not yet receive the intervention at a particular point. The
effects of the intervention will be underestimated if it is unwittingly administered
in intervention and nonintervention phases.

or independent variable, the factors listed in Table 2.2 m ight also explain the results.
I f inferences are to be draw n about the intervention (independent variable), then the
threats to internal validity must be ruled out. To the extent that each threat is ruled out
or m ade relatively im plausible, the experim ent is said to be internally valid.
History and maturation, as threats to internal validity, are relatively straightforward
(see Table 2.2). H istory refers to events in the individual’s environm ent (e.g., at home,
at school, a novel event in the news); maturation refers to processes, usually within the
individual over time (e.g., maturing, habituating to som ething in the environment).
Adm inistration o f the intervention may coincide with special or unique events in the cli-
en t's life or with maturational processes w ithin the person over time. T he design must rule
out the possibility that the pattern o f results is likely to have resulted from either one o f
these threats. In single-case designs, the pattern o f data over time could be consistent with
history or maturation whether there were rapid or gradual changes (e.g., improvement).
T h e potential influence o f instrum entation also must be ruled out. It is possible
that the data show changes over time not because o f progress in the clients behav-
ior but rather because the o bservers have gradually changed their criteria for scoring
client perform ance. The instrum ent, or m easuring device, has in som e way changed.
I f it is possible that changes in the criteria observers invoke to score behavior, rather
than actual changes in client perform ance, could account for the pattern o f the results,
instrum entation serves as a threat to internal validity.
Testing refers to com pleting a m easure on more than one occasion. For m any m ea-
sures (e.g., personality m easures, intelligence measures), perform ance on the second
U n d er p i n n i n g s o f Sci en t i f i c Resear ch

occasion is often better than on the first occasion. In much o f grou p research, the
assessment devices are adm inistered on two occasions, before an d after treatment.
Im provem ents could be due to repeated experience with the m easure. In single-case
research, perform ance is assessed on multiple occasions. Testing is a threat if there is
any reason to believe that experience and exposure to the m easure alone could explain
improvement.
Statistical regression refers to changes in extrem e scores from o ne assessm ent o cca-
sion to another. W hen persons are selected on the basis o f their extrem e scores (e.g.,
those who score low on a screening measure o f social interaction skills or high on a
m easure o f hyperactivity), they can be expected on the average to show som e changes
in the opposite direction (toward the mean) at the second testing m erely as a function
o f regression. I f treatment has been provided between these two assessm ent occasions
(e.g., pre- and posttreatm ent assessment) the investigator m ay believe that the im prove-
m ents resulted from the treatment. However, the im provem ents m ay have occurred
anyw ay as a function o f regression toward the m ean, that is, the tendency o f scores at the
extrem es to revert toward mean levels upon repeated testing.² T h e effects o f regression
must be separated from the effects o f the intervention.
In group research, regression effects are usually ruled out by including a no-
treatment group and by random ly assigning subjects to all groups. In this way, i f there
is regression it would be evident equally in both groups; changes above and beyond
those are likely to be due to the intervention. In single-case research, inferences about
behavior change are drawn on the basis o f repeated assessm ent over time. Although
fluctuations o f perform ance from one day o r session to the next may be based on
regression tow ard the m ean, this usually does not com pete with d raw in g inferences
about treatment. If there is a phase in single-case designs (e.g., baseline o r the period
before the intervention begins) with only one observation occasion, we do not know if
this is a reliable m easurem ent and typical perform ance or one which mig h t be extrem e.
If extrem e, the next observation might well be in the opposite direction m erely due to
regression. In single-case research, regression usually cannot account for the pattern o f
data because o f assessm ent on several occasions over tim e and across m ultiple co n d i-
tions (e.g., intervention, no intervention).
Diffusion o f treatment is one o f the more subtle threats to internal validity. W hen
the investigator is com paring treatment and no treatment or two or m ore different
treatments, it is im portant to ensure that the conditions rem ain distinct and include the
intended intervention. Occasionally, the different conditions d o n ot rem ain as distinct
as intended. For example, the effects o f parental praise on a child’s behavior in the home
might be evaluated in a single-case design in which praise is given to the child in som e
phases and withdraw n in other phases. It is possible that w hen parents are instructed

² Regression toward the mean is a statistical phenomenon that is related to the correlation between
initial test and retest scores. The lower the correlation, the greater the amount of error in the mea-
sure, and the greater the regression toward the mean. It is important to note farther that regression
does not mean that all extreme scores will revert toward the mean upon retesting or that any par-
ticular person will inevitably score in a less extreme fashion on the next occasion. The phenomenon
refers to changes for segments o f a sample (i.e., the extremes) as a whole and how those segments,
on the average, will respond.
S I N G L E- C A S E R ES EA R C H D ES I G N S

to cease the use o f praise, they m ay continue anyway. The results m ay show little or no
difference between treatm ent and “ no-treatm ent” phases because the treatment was
inadvertently adm inistered to som e extent in the no-treatm ent phase. The diffusion
o f treatment will interfere with draw ing valid inferences about the impact o f treatment
and hence constitutes a threat to internal validity.

General Com m ents


I have highlighted m ajor threats to internal validity. These form a critical basis for
understanding the logic o f experim entation in general. T he reason for arranging the
situation to conform to one o f the m any experim ental designs is to rule out the threats
that serve as plausible alternative hypotheses or explanations o f the results. Single-case
designs can readily rule out the threats to internal validity; they do so just as well but
quite differently from the w ay they are addressed in betw een-group designs.
Specific single-case designs rule out threats in different ways, as is discussed in
subsequent chapters. A lso discussed in later chapters are designs where one m ay need
to be innovative to rule out the threats. T his is w hy I em phasize the im portance o f
understanding the threats and how they operate. Research is all about addressing key
concepts (plausibility, threats to validity), and the practices we use are only helpful
insofar as they address these.

EXTERNAL VALIDITY

Defined
Internal validity refers to the extent to which an experim ent dem onstrates that the
intervention accounts for change. External validity addresses the broader question
and refers to the extent to which the results o f an experim ent can be generalized or
extended beyond the conditions o f the experim ent. In any experim ent, questions can
be raised about w hether the results can be extended to other persons, settings, assess-
ment devices, clinical problem s, and so on, all o f which are encom passed by external
validity. Characteristics o f the experim ent that m ay limit the generality o f the results
are referred to as threats to external validity.

T h re a ts to E x te rn a l V a lid it y
A sum m ary o f the m ajor threats to external validity is presented in Table 2.3. Each
pertains to a feature w ith in the experim ent that m ight delim it generality o f the results.
The factors that actually lim it the generality o f the results o f an experim ent m ay not be
known until subsequent research expands on the conditions under which the relation-
ship was o riginally exam in ed. For exam ple, the m an ner in which instructions are given,
the age o f the subjects, the setting in which the intervention w as implemented, ch arac-
teristics o f the teachers or trainers, and other factors may contribute to the generality o f
a given finding. Technically, the generality o f experim ental findings can be a function
o f virtually any characteristic o f the experim ent. Som e characteristics that m ay limit
generality o f the findings can be identified in advance.
Generality across subjects is a frequently raised concern in research, and especially
single-case research, as I discuss later. Even though the findings m a y b e internally valid,
it is possible that the results might only extend to persons very much like those included
U n d er p i n n i n g s of Scien t if ic R e s e a r c h

Table 2.3 M a jo r T h re ats to External Validity

S p e c ific T h r e a t W h a t It In c lu d e s

G e n e r a lity a c r o s s su b je c ts The extent to which the results can be extended to subjects o r clients
whose characteristics may differ from those included in the investigation.

G e n e r a lit y a c r o s s r e sp o n se s The extent to which the results extend to behaviors o r domains not
included in the program. These behaviors o r domains may be similar to
o r m e a s u re s
those focused on o r may be entirely different areas o f functioning.

G e n e r a lity a c r o s s s e tt in g s The extent to which the results extend to other situations in which the
client functions beyond those included in training.

G e n e r a lity a c r o s s t im e The extent to which the results extend beyond the times during the day
that the intervention is in effect and to times after the intervention has
been terminated.

G e n e r a lity a c r o s s The extent to which the intervention effects can be extended to


other persons w ho can administer the intervention. The effects may be
b e h a v io n -ch a n g e a g e n ts
restricted to persons with special skills, training, o r expertise.

R e a c tiv e e x p e r im e n t a l The possibility that subjects may be influenced by their awareness that
they are participating in an investigation or in a special program. T he
a rra n g e m e n ts
intervention effects may not extend to situations in which individuals are
unaware of the arrangement.

R e a c tiv e a s s e s s m e n t The extent to which subjects are aware that their behavior is being
assessed and that this awareness may influence h o w they respond.
Persons w ho are aware of assessment m ay respond differently from how
they would if they were unaware of the assessm ent

M u ltip le - tr e a tm e n t W h e n the same subjects are exposed to m ore than one treatment, the
conclusions reached about a particular treatment may be restricted.
in te rfe re n c e
Specifically, the results may only apply to othe r persons w ho experience
both of the treatments in the same way o r in the same order.

in the investigation. For example, cultural, ethnic, or gen der identity m ay som ehow
m ake a difference, and the findings might not generalize to a broad range o f groups.
O ther features o f the population, including special experiences, intelligence, age, and
receptivity to the particular sort o f intervention under investigation, m ay he potential
qualifiers o f the findings. For example, findings obtained w ith ch ild ren might not apply
to adolescents or adults, those obtained with individuals fu n ctioning well in the com -
m unity might not apply to those with serious physical or psychiatric im pairm ent; and
those obtained with laboratory rats might not apply to other types o f anim als, including
humans.
G enerality across responses, settings, and time are potential threats to external valid -
ity. W ould the sam e intervention achieve sim ilar effects if o ther responses (e.g., co m -
pleting hom ew ork, engaging in discussion), settings (e.g., at home), o r tim es (e.g., after
school) were included? A n y one o f the threats m ay provide qualifiers or restrictions on
the generality o f the results. For example, the sam e intervention might not he expected
to lead to the sam e results no matter what the behavior or problem is to which it is
applied.
G enerality o f b eh avior-ch an ge agent is a special feature that is related to the setting
and context that w arrants mention. As it is stated, the threat has special relevance for
intervention research in w hich som e persons (e.g., parents, teachers, hospital staff,
peers, and spouses) attempt to alter the behaviors o f others (e .g ., ch ildren , students,
S I N G L E- C A S E R ES EA R C H D ES I G N S

psychiatric patients). W hen an intervention is effective, it is possible to raise ques-


tions about the gen erality o f the results across behavior-chan ge agents. For exam ple,
when parents are effective in altering behavior, cou ld the results also be obtained by
others carryin g out the sam e procedures? Perhaps there are special characteristics
o f the behavior-change agents that have helped achieve the intervention effects. T he
clients m ay be m ore respon sive to a given in terven tion as a function o f w ho is car-
ryin g it out.
Reactivity o f the experim ental arrangem ent refers to the possibility that subjects are
aware that they are participating in an investigation and that this knowledge may bear
on the generality o f the results. Reactivity refers to changes in perform ance resulting
from that awareness o r know ledge o f participating. T h e experim ental situations may
be reactive, i.e., alter the behavior o f the subjects because they are aware that they are
being evaluated. It is possible that the results w ould not be evident in other situations
in which persons do not know that they are being evaluated. Perhaps the results depend
on the fact that subjects were responding w ithin the context o f a special situation. A
fam iliar exam ple o f reactivity o f arrangem ents— we are all w onderful drivers, o f course,
but we are usually a little m ore w onderful in the presence o f police cars. T he envi-
ronmental stim uli (presence o f police) constitute an arrangem ent that influences our
behavior. T he external validity question is: D oes our very careful driving generalize to
situations when these stim uli are not present? (Just try to squeeze into m y lane in traffic
and you will have m y answ er in a heartbeat!)
T he reactivity o f assessment warrants special m ention even though it could be su b -
sum ed under reactivity o f the experim ental arrangem ent. I f subjects are aw are o f the
observations that are being conducted or w hen they are conducted, the generality o f
the results m ay be restricted. To w hat extent w ould the results be obtained if subjects
were unaware that their behaviors were being assessed? Alternatively, to what extent
do the results extend to other assessment situations in w hich subjects are unaware that
they are being ob served? M ost assessment is conducted under conditions in which
subjects are aware that their responses are being m easured in som e way. In such cir-
cum stances, it is possible to ask whether the results w ould be obtained i f subjects were
unaware o f the assessm ent procedures. T he reactivity o f assessment too is fam iliar;
various m ovies and television shows take advantage o f assessing people under non-
reactive assessm ents, that is, when they are not aware and are less likely to give their
m ore social, guarded, and politically correct responses. T h e threat to external validity
is whether the the results o f an experim ent will o nly be evident when the individual is
aware o f the assessm ent procedures?
M ultiple-treatm ent interference only arises w hen the same subject or subjects
receive two or m ore treatments. In such an experim en t, the results m ay be internally
valid, that is, by ruling out threats to internal validity. However, the possibility exists
that the particular sequence or order in which the interventions were given m ay have
contributed to the results. For exam ple, if two treatm ents are adm inistered in succes-
sion, the second m ay be more (or less) effective or equally effective as the first. T he
results might be due to the fact that the intervention was second and followed this p ar-
ticular intervention. A different ordering o f the treatm ents might have produced differ-
ent results. Hence, the conclusions that were draw n m ay be restricted to the special way
in which the m ultiple treatments were presented. T here will be many exam ples in later
U n d er p i n n i n g s of Scien t if ic Resear ch

chapters where single-case designs are used to evaluate m ore than one intervention in
the sam e study and where multiple-treatment interference m ight em erge.

General Com m ents


T he m ajor threats to external validity do not exhaust the factors that may lim it the
generality o f the results o f a given experim ent. A n y feature o f the experim ent m ight be
proposed to limit the circum stances under which the relationship between the inde-
pendent and dependent variables operate. O f course, m erely because one o f the threats
to external validity is applicable to the experim ent does not necessarily mean that the
generality o f the results is jeopardized. It only means that som e caution should be exer-
cised in extending the results. O ne or more conditions o f the exp erim en t m ay restrict
generality; only further investigation can attest to w hether the potential threat actually
lim its the generality o f the findings.
A lso, a m isunderstanding about the threats, perhaps especially those related to
external validity, is their use and abuse. A given threat is pertinent in any given d em -
onstration only insofar as it is a plausible rival hypothesis about restrictions on the
generality o f the findings. So the superficial statement that begru dgin gly acknow ledges
a finding usually goes like this, “ Yes, o.k. that was found, but it m ay not apply to the
children I see in m y classroom , students with this or that backgrou nd , m ost people, tall
people, and so on.” T his kind o f statem ent is easy to make and is uninform ed or su p erfi-
cial when stated in its vacuous form , that is, without a cogent explanation. To pose that
there is threat to validity requires a plausible explanation. T hat is, the consum er (co l-
league, peer) ought to have in m ind precisely why that is a threat to external validity. A
finding m ight not be applicable to all sorts o f other conditions, people, and contexts,
but one needs a bit m ore justification than just cavalierly raising it after each finding
(especially findings one does not especially care for).
C onsider an exam ple. After years o f research, we finally have a grow ing list o f psychotherapies
with evidence in their b eh alf (Nathan & G orm an, 2007; Weisz & Kazdin,
2010). Treatm ents have been studied in very well controlled trials— so well controlled
that the issue o f generality is o f concern. I mention later in this chapter that there can
be a trad e-o ff o f careful experim ental control such as holding as m any variables c o n -
stant as possible and restrictions in generality o f the results. C arefu l control purposely
changes conditions so that they resem ble the conditions o f everyd ay life less. O nce the
finding is established, it is reasonable to question whether the find ing would hold when
critical conditions change.
Som etim es in psychotherapy trials, patients are selected because they m eet crite-
ria for a psychiatric disorder (e.g., depression, anxiety) but are excluded i f they have
m ultiple disorders or other sources o f im pairm ent (e.g., chronic medical condition
that requires treatment). Researchers have w isely looked for hom ogenous and well-
described populations in their initial studies to limit variability (and threats to data-
evaluation validity noted later in the chapter). So now, with a list o f internally valid
studies, replications, and a pile o f treatments, external validity is raised as a concern.
Practitioners, for exam ple, raise the question: “ Fine, the results o f the studies are inter-
nally valid, but will the results apply to my patients in clinical rather than research set-
tings?” Again, merely asking this is not very inform ed unless there is a plausible reason
to suspect that the proposed variable (changes in patient characteristics) might m ake a
S I N G L E- C A S E R E S EA R C H D ES I G N S

difference. O ne concern has been that patients seen in clinical w ork often have multiple
psychiatric disorders (referred to as c o morbidity), not just one that m ay have served as
a basis for recruiting patients for a random ized controlled trial. In other words, prac-
titioners question the external validity o f the findings. T his is a reasonable, plausible
challenge because patients with m ultiple disorders are likely to be more difficult to
treat and m ay com e from m ore com plex personal situations due to the scope o f their
disorders. T he scope o f disorders m ay not m erely reflect the com plexity o f the current
situation with such patients— perhaps they have a stronger genetic and environm ental
“ loading” or m ore untow ard underpinnings o f th eir m ore severe condition. It is plau -
sible to challenge external validity because m ore severe cases o f anything in life often
respond less well or not at all to treatm ents that worked with less severe cases. Much
research will be needed to address this challenge in the m any contexts (e.g., m any d is-
orders, ages o f clients) in which the threat could be valid. At this point, studies on the
topic in the context o f child therapy at least suggest that in fact, com orbidity and com -
plexity o f the case do not lim it the external validity generality o f the findings. Evidence-
based treatments tested to this point w ork was as well with m ore severe cases and cases
with more com plex personal and fam ily characteristics as with other cases (e.g., Doss &
Weisz, 2006; Kazdin & W hitley, 2006).

CONSTRUCT VALID ITY

Defined
C onstruct validity has to do with interpreting the basis o f the causal relation. Assum e
that threats to internal validity have been addressed, that is, the causal relation has been
identified betw een an intervention and behavior change. N ow we can ask the construct
validity question, W hat is the intervention and why did it produce the effect? Construct
validity addresses the presum ed cause or the explanation o f the causal relation between
the intervention or experim ental m anipulation and the outcom e. Is the reason for the
relation between the intervention and behavior change due to the construct (explana-
tion, interpretation) given by the investigator? For exam ple, let us say that in an exp eri-
ment the intervention consisted o f a teacher providing praise to a student for increased
tim e w orking on arithm etic assignm ents durin g a free-study period. T he intervention
caused the change, let us say, but was it the praise, or increased attention in general, or
a teacher taking special interest in this students arithm etic perform ance? T h e answer
to these questions focuses specifically on construct validity.3
Several features w ithin the experim ent can interfere with the interpretation o f the
results. These are often referred to as confounds. We say an experim en t is confounded
or that there is a con foun d to refer to the possibility that a specific factor varied (or

* Construct validity also is a commonly used term in the context o f test development and validation
and refers to evidence that a measure (e.g., anxiety or depression scale) actually measures the con-
struct (concept) it purports to measure. Multiple lines o f evidence are brought to bear to evaluate
the construct validity o f measures (Kazdin, 2003). The resemblance to the present use of the term is
in underscoring a question about the explanation: In test development, what explains performance
on the measure, what concept best represents the items; in methodology and design, what concept
explains why the intervention worked and how it achieved its effects.
Un d er p in n in g s o f Sci en t i f i c Resear ch

co-varied) with the intervention. T hat confound could, in w hole or in part, be respon-
sible for the results. Som e com ponent other than the one o f interest to the investigator
might be em bedded in the intervention and accounts for the findings. T h ere are m any
examples where construct validity is in question. For exam ple, in adult psychotherapy,
cognitive therapy is a well-established, evidence-based treatment t fo r m a jo r depression
(Hollon & Beck, 2004). We know from the studies that cognitive therapy causes the
change, but we do not know how it w orks or why the change occurs. T he proposed
interpretation (changes in cognitions lead to changes in depression) has been unsup-
ported as the likely basis for why that treatment works. In short, treatm ent w orks, but
w hy (a construct validity issue)?
In applied work (e.g., education, treatment, prevention, skill acquisition, reha-
bilitation), where single-case designs are often conducted, there is less o f an interest
in isolating the reason why the intervention produced change. A m u lti-com ponent
treatment package may be designed to im prove the reading o r speech o f children in
a special-education class. The challenge is to improve the skills, and that is the goal.
There might be little interest in identifying w hy and how specifically the intervention
worked. Even so, knowing precisely w h y and how change occurs can be im portant for
m axim izing the im pact o f the intervention and extending the intervention to other set-
tings (see Kazdin, 2007).
Construct validity tries to hom e in on what facet o f the intervention explains the
change. C onsider an exam ple from group research. We know that con su m in g a m od-
erate amount o f wine (e.g., one to tw o glasses with dinner) is associated w ith increased
health benefits (e.g., reduced risk o f heart attack). In studies o f this relation, consum p-
tion o f wine is the variable o f interest. A construct validity question is “ Is it the w ine or
som e other construct?” T his is a reasonable question because we know that wine d rin k -
ing is associated with (confounded by) other characteristics. People w h o drink w ine,
com pared to those who drink beer and other alcohol (spirits), tend to live healthier life-
styles, to sm oke less, to have lower rates o f obesity, to be lighter drinkers (total alcohol
consum ption), and to com e from higher socioeconom ic classes (probably w ith better
health care) (e.g., W annamethee & Sharper, 1999). These characteristics are related to
disease and death. Even so, controlling these other factors reduces but d o es not elim -
inate the contribution that wine m akes to lower the m ortality rate. W ine still appears
to make a difference. T he research has sharpened the focus to rem ove or evaluate the
impact o f other influences than w ine drinking itself. M ore construct valid ity questions
might be asked. For example, what specifically about the wine exp lains the effect? And
that too has been studied (e.g., an antioxidant, resveratrol, found in red w in e and grape
skins is one explanation). One can see that the dem onstration o f a causal relation or
correlation might be the beginning o f research that focuses on evaluating the basis o f
the original finding, that is, the underlying construct that explains the relation.

Threats to Construct Validity


T he reason why the independent variable has an effect raises fundam ental questions
about the construct the variable is designed to reflect. T he independent variable m a y b e
a package o f factors that ought to be broken down into com ponents. In m ost single-case
studies, few factors m ay emerge that account for and explain the intervention effects.
Two threats to construct validity are noted in Table 2.4 and highlighted here.
S I N G L E- C A S E R ES EA R C H D ES I G N S

Table 2.4 M ajor Threats to C o n stru c t V alidity

S p e c ific T h r e a t W h a t It In c lu d e s

A t t e n t io n a n d c o n ta c t The extent to which an increase of attention to the client/participant


during the intervention phase o r lack o f attention during nonintervention
a c c o rd e d th e clie n t
phases could plausibly explain the effects attributed to the intervention.

S p e c ia l s tim u lu s c o n d itio n s, The extent to which special conditions in which the intervention is
presented o r embedded alone o r in combination with the intervention
se ttin g s, a n d c o n te x t s
could explain the effects attributed to the intervention by itself. The
"real” influence might be "intervention x administered by wonderful
person y ” rather than the “intervention” free from its connection to
special conditions.

Attention and contact accorded the client can serve as one o f the explanations for an
intervention effect. Before the intervention is im plem ented there is a baseline period
in which observations o f perform an ce are obtained. Perhaps durin g baseline there is
relative neglect o f the client, but d u rin g the intervention phase increased attention,
contact, m onitoring, and feedback are provided. W hen these facets are not the inter-
vention but are accoutrem ents, they are potential threats to construct validity. T his
is kind o f a placebo effect in the sense that m ere attention is the intervention— not
necessarily attention in the form o f positive reinforcem ent (contingent on behavior)
but just involving or attending to the client m ore in a program , class, or interven-
tion. T he intervention consists o f a package o f com ponents the investigator com bines
to effect change. A ccoutrem ents o f that intervention m ight be responsible for the
effects. Attention and increased contact with the client are threats to validity if they
are plausible explanations o f the findings. I f plausible, som e phase in the design needs
to accom m odate the potential confound. A design that does not control for attention
and contact is not necessarily flawed. If the investigator wishes to conclude w hy the
intervention achieved its effects, attention and contact ought to be ruled out as rival
interpretations o f the results.
Special conditions, settings, and contexts m ay also threaten the construct validity
o f a study. Som etim es an intervention includes features that the investigator con sid-
ers as irrelevant to the study, but these features m ay introduce am biguity in interpret-
ing the findings. T he construct validity question is the same as we have discussed so
far, namely, was the intervention (as conceived by the investigator) responsible for the
outcom e or was it som e seem ingly irrelevant feature with which the intervention was
associated? For exam ple, the intervention m ay have been conducted in a special school
or laboratory school affiliated with a university. In such schools, often the teachers,
facilities, assistants, equipm ent, and other conditions are optim al. T he construct valid-
ity here overlaps with external validity, but they evaluate different facets o f the pro b -
lem. External validity asks, “ W ill the program effects be generalizable to other settings
w here optim al conditions o f adm inistration are not as feasible?” Construct validity
m erely is another w ay to refer to the problem and asks, “ Was the effect due to the
intervention by itself or the intervention in com bination with a very special teacher
and setting?” T he investigator may discuss the program without acknowledging that
the program in com bination with other features may have been critical. Som e o f m y
U n d er p i n n i n g s o f Sci en t i f i c Resear ch

research has included special teachers w ho could adm inister interventions extrem ely
well and often in seamless and nuanced fashions as part o f their everyday behavior. In
those dem onstrations with only one teacher, it is possible that the effects were a c o m b i-
nation o f the teacher and intervention, even though I discussed the results as i f it were
the intervention alone (e.g., Kazdin & Geesey, 1977; Kazdin & M ascitelli, 1980). (In
defense o f myself, I had not yet read this book and did not know an y better.) A n y tim e
the intervention is adm inistered under narrow or restricted circum stances (e.g., one
behavior-change agent, one classroom , one program or institution) it m ay be possible
to raise the threat that the program -in-special-context was the intervention. W hen two
or more circum stances (e.g., two teachers, two classroom s) are included, one can see o r
show that the effect was not restricted to one set o f conditions.
The use o f a narrow range o f stim uli an d the lim itations that such use im poses
sounds sim ilar to external validity. It is. Sam pling a narrow range o f stim uli as a threat
can apply to both external and construct validity. I f the investigator w ishes to g en er-
alize to other stimulus conditions (e.g., other teachers, classroom s, therapists, types
o f clients), then the narrow range o f stim ulus conditions is a threat to external va lid -
ity. To generalize across conditions o f the experim ent requires sam pling across the
range o f these conditions, if it is plausible that the conditions m ay influence the results
(Brunswik, 1955). I f the investigator wishes to explain why a change occurred, then the
problem is one o f construct validity because the investigator cannot separate the c o n -
struct o f interest (e.g., treatment or types o f description o f treatment) fro m the c o n d i-
tions o f its delivery (e.g., teacher, setting, therapist). W henever possible in a study, it
is useful to include m ore than one experim enter (teacher, therapist, setting) so that at
the end o f the study one can separate the influence o f the person who adm inistered the
intervention from the effects o f the intervention. C onstruct validity is clarified i f one
can show’ that the intervention exerted its effect under different conditions.

G e n e ra l C o m m e n ts
In applied settings, interventions often consist o f “packages,” that is, several d istin -
guishable com ponents that are put together. For example, an intervention to foster
com pliance with a m edication regim en o f an elderly adult at home (e.g., taking pills
regularly) might consist o f three com ponents: (1) special rem inders by a spouse o r sig -
nificant other to take the m edication, (2) praise by that person if he or she d o es take the
medication without the reminder, and (3) a cutesy ring tone or favorite son g that plays
whenever the person opens the special pill b o x to get the m edication. T h e com ponents
are not regarded as accoutrem ents or artifacts or threats to construct v alid ity Rather,
they are the intervention package. It may be useful to ask if one or more com ponents
are the key part o f the intervention or what the individual contribution is o f each one.
T his is not regarded as a m ethodological artifact or threat to validity.
Construct validity as a m ethodological concern is reserved fo r those instances in
which a more pervasive general feature that is o f no interest to the investigator m ay be
confounded with the intervention. I have mentioned two o f the likely candidates for
single-case research. As with other threats, one does not vacuously criticize a study b y
holding up one o f the construct validity threats I have mentioned. They becom e threats
o nly when they are plausible rival interpretations. In the context o f dru g trials (e.g., for
S I N G L E - C A S E R ES EA R C H D ES I G N S

physical disease or psychiatric sym ptom s), a long-established threat to construct validity
is the placebo effect. A n y medication provided to a group might produce the outcome
because o f the pharm acological properties o f the m edicine and/or because o f the act
o f taking medication under the supervision o f a professional. We know that a placebo
effect is not only a threat to construct validity but is now v ery plausible, is well d o cu -
m ented, and has neurobiological underpinnings that are increasingly understood (e.g.,
Price, Finniss, & Benedetti, 2008).

D ATA-EVALUATIO N VALID ITY

Defined
In group research, data-evaluation issues are referred to as statistical conclusion v alid -
ity and encom pass those facets o f the quantitative evaluation o f the results that m ay
m islead o r m isguide the investigator (C o o k & Cam pbell, 1979; Kazdin, 2003; Shadish
et al., 2002). As an exam ple, if the investigator is com parin g two or more form s o f
psychotherapy in a betw een-group study, she is likely to conduct statistical analyses to
com pare the treatments on posttreatm ent perform ance am ong the different groups.
Low statistical pow er (e.g., having too few subjects in relation to the likely effect size)
is v ery likely to lead her to conclude that the treatm ents were not different from each
other. It m ay be that the treatments in fact are not different in their effects. However,
m ore likely than not, the w ay most o f psychotherapy research is conducted, the study
was not statistically pow erful enough to detect differences (Kazdin & Bass, 1989).
Sm all-to-m edium effects require relatively large sam ples to detect a difference if there
is one. In other words, in this example, low pow er is a threat to validity.
V isual inspection criteria rather than statistical tests are the p rim ary means o f e v al-
uating single-case data. T here are statistical tests that are occasion ally used, as I discuss
at much greater length (Chapter 13 and the Appen dix), but these are not regarded as
the p rim ary criteria and are used in the m inority o f instances o f single-case research.
Data evaluation issues still emerge and can interfere with draw in g inferences about
the im pact o f the intervention. In single-case research, several aspects o f the data can
interfere w ith draw ing valid inferences, and these are referred to as threats to data-
evaluation validity.

Threats to D ata-Evaluation Validity


I have adapted several threats raised in the context o f statistical evaluation to the m eth-
ods used to evaluate single-case designs. T h ey en com pass any facet o f the data o r the
criteria used to evaluate the data (visual inspection) that m ay obscure identifying an
intervention effect. T h o se facets that can serve as threats to data-evaluation validity are
listed in Table 2.5.
Excessive variability in the data is a threat that can encom pass different influences.
Single-case designs depend on being able to discern patterns in the data within a given
phase and discrepant patterns in the data from phase to phase (e.g., baseline to inter-
vention). M uch more will be said about this in relation to decision m aking d u rin g the
course of a single-case design. T he threat to validity stem s from obtaining data within,
but also across, phases where there is so much fluctuation in perform ance that a pat-
tern cannot be reliably discerned. Did the intervention have the intended im pact? It
U n d er p in n in g s o f Scien t if ic Resear ch

Table 2.5 M ajor Threats to Data-Evaluation Validity

S p e c ific T h r e a t W h a t It In clu d e s

E x c e s s iv e v a ria b ility in A ny source of variability that can interfere with detecting a difference when
there is one.
th e d a ta

U n r e lia b ilit y o f the E rror in the measurement procedures that introduces excessive variability
that obscures an intervention effect.T h e e r r o r might result, from a measure
m e a s u re s
that is not a valid index of the domain of interest, a measure that is unreliable,
and conditions of measurement (setting, test administrator) that influence the
data in some way.

T r e n d s in th e d a ta The extent to which the direction of change in a given phase o r a pattern


across phases can interfere with drawing inferences about the effects o f the
intervention.

In su fficie n t d a ta Too few data points to permit conclusions about level of performance and its
likely level in the near future.

M ix e d d a ta p a t te rn s Intervention effects are usually replicated w ithin a study. The effect of the
intervention is inferred from the overall patte rn . Mixed data patterns across
phases o r replications within the study can interfere with drawing valid
inferences about the impact of the intervention. Previously mentioned data
evaluation threats (excessive variability, trend, and insufficient data) in one o r
m ore phases may contribute to the mixed data, pattern and obscure inferences
about the intervention.

m ay be that variability is so excessive, one cannot tell. In the o n e-cannot-tell situation,


variability is a threat to data-evaluation validity.
V ariability can com e from m any sources including:

uncontrolled influences in the setting that m ay change w id ely each day;


error in m easurem ent (unreliability);
sloppy and inconsistent implementation o f the intervention;
genuinely high variability and inconsistency in perfo rm an ce (which m ight even
be the im petus for developing an intervention);
differences am ong subjects if m ore than one is used; and
cycles or abrupt changes within the individual (e.g., on or o ff m edicine, normal
horm onal fluctuation) or the environment (e.g., scheduled changes in who is
present in the setting or who oversees the client, or changes in the classroom
activity or routine).

C onsider one o f the preceding sources in further detail, namely, the way in which
the intervention is im plem ented. Ideally, the procedures will be adm inistered in a w ay
that m inim izes their day-to-day variation. T his m eans that the procedures will b e
applied consistently, and those who adm inister the program will d o so consistently.
T h is might mean that consistency has to be part o f the intervention i f the program is
delivered by the teacher and teachers aide in a classroom ; o r different sta ff o r atten-
dants in a nursing hom e on different days or o f course the sam e person on different
days. Rigor in the execution o f the procedures is not a m ethodological nicety fo r the
sake o f appearance. C onsistency in execution o f the procedures has direct bearing on
data-evaluation validity. A given difference between phases o r subjects m ay not be clear
because o f the variation o r extra variation introduced by inconsistent procedures.
S I N G L E - C A S E R ES EA R C H D ES I G N S

Variation cannot be elim inated, especially in relation to those aspects o f research


involving human participants (e.g., as students, clients, experim enters, therapists) and
in settings outside o f the laboratory. However, in any experim en t, extraneous variation
can be m inim ized by attention to the details o f how the study is actually executed. If
variability is m inim ized, the likelihood o f detecting a true difference between baseline
an d intervention conditions is increased. In general, data-evaluation validity is threat-
en ed when variability can interfere with draw ing inferences. T he source o f variability
has im plications for what the investigator can do about it, but for present purposes,
variability that is excessive interferes with identifying an intervention effect when there
is o n e and even clearly identifying no effect w hen one has not occurred.
H ow does one define excessive? It is not in standard deviation units, a fam iliar
m easure o f variability. T h e definition is relative to perform an ce within a phase and
across phases and relative to the im pact o f the intervention. T his is true o f between-
grou p research as well. Data evaluation is obscured by excessive variability, but that is a
function o f variability in relation to other influences in the design (Kazdin, 2003). T he
m ore potent the effect o f an intervention, the less likely variability is to obscure inter-
pretation o f the effect. Even so, it is wise in experim ental dem onstrations to do what
one can to m inim ize each o f the sources o f variability bulleted earlier.
Unreliability o f the measures is encom passed by the preceding discussion, but it
warrants separate discussion. M easurem ent plays a special role in all scientific research
and therefore in single-case research. Also, reliability o f m easurem ent is a key part o f
the design and m ethodology, as conveyed in subsequent chapters. For these reasons, I
delineate this as a separate threat to data-evaluation validity.
Reliability refers to the extent to w hich the m easures assess the characteristics o f
interest in a consistent fashion (and is taken up again in C hapter 5). Reliability is, o f
course, a m atter o f degree and refers to the extent o f the variability in scoring or c o m -
pleting the measure. Variability in the results we obtain in the m easurem ent from day
to day can be a direct function o f multiple influences that im pinge on the individual.
For exam ple, perform ance varies from occasion to occasion as a function o f m ood,
experien ce, context, prior interactions with others, and m any other unspecifiable in flu -
ences. Thus, even if a measure is perfectly reliable, there will still be variability from one
occasion to the next because perform ance is m ultiply determ ined and the m easure is
o n ly one contributing factor.
Unreliability o f the m easure im poses another source o f variation. C onsider for a
m om ent that we have a poorly behaved child robot (nam ed Autom ated Luke or Al for
short) that we place in a third-grade classroom . We program Al to have 15 instances
o f blurting out inappropriate statements to the teacher du rin g the reading and w rit-
ing period. Al is made to look just like and is dressed like a child; he is carefully pro-
gram m ed, tested, and calibrated, and has “ his” battery charged. So assume for this
discussion that Al is doing his jo b — 15 random ly timed and blurted statements (e.g.,
“ Hey, teach! W hen is recess?” “ Mr. Jones, the book is a little boooooor-ing— can we
stop now ?” “ W hen do we get to the unit on single-case design s?” ). We also have Al
m ake five additional neutral or appropriate statem ents (e.g., he raises his hand and
says, “ M ay I have som e help?” or “ Is it ok if m y com m ents on the book take up more
than one page?” ).
Un d er p in n in g s o f Sci en t i f i c Resear ch

We place two o b servers at the back o f the classroom w ith the assignm ent o f noting
each instance o f A l’s blurting out. Assum e they have not m astered the observational
codes. For example, they are inconsistent in deciding what constitutes an inappropriate
statement versus a go o d question and when a statement is one statement or two state-
ments because they seem to be separated by a pause. A n o th er w ay o f saying that they
are not too consistent is to note that the m easurem ent proced ure has low reliability. If
we were to graph the data from one or both o f the observers, we would see that even
though we know A l’s perform ance is at 15 blurts per day, there is fluctuation introduced
by the m easure— som e days are at 10 , others at 15, another at 17, tw o at 1 1 , and so on. The
measurement codes and how they are applied m ay reflect unreliability in the observa-
tions. To the extent that the measure is unreliable, a greater portion o f the subject's
score is due to unsystem atic and random variation.
Needless to say, unreliability o f the measure is not uniqu e to an y specific modality.
Often, checklists, rating scales, and self-report m easures are used to evaluate the effect
o f interventions. T hese m easures too, have reliability that can vary as a. function o f use,
sample, and conditions o f adm inistration. T he fact that they are standardized (e.g.,
unvarying items on a scale such as the Beck D epression Inventory) does not m e a n that
their reliability is fixed. Reliability (consistency o f the m easurem ent) is not a property
o f a scale alone but also a function o f its use and conditions o f adm inistration.
Som e measures m ight be especially reliable because th ey are autom ated, m echani-
cal, o r use equipm ent in som e w ay that is unvarying and free from the possibility o f
human bias in ad ding erro r variation. With autom ated equipm ent there is an o cca-
sional catastrophic break dow n, and everything stops. T h is is less pernicious than the
illusion that all is going well but there is inconsistency in the observations.
In general, one wants to m inim ize extra, unneeded, and unsystem atic variation
from the m easure or m easurem ent procedure. A n y added variation introduces unnec-
essary fluctuations in perform ance. If those fluctuations, resulting from the measure,
can interfere with evaluation o f the data, unreliability or low reliability o f the measure
becom es a threat to data-evaluation validity.
Trends in the data refer to the slope or the pattern o f change over tim e based on
multiple observations. For exam ple, we may observe for 10 days, the social interactions
(num ber o f exchanges that involve more than a hi or hello) o f an elderly resident who
seem s isolated from oth er people in her assisted-living hom e. For the l o days, the trend
or slope is the line that best represents the pattern (e.g., horizontal line— no trend in
one direction rather than another). Single-case designs d ep en d on seeing changes in
trends (slope) over tim e and hence we shall return to this topic.
Occasionally, w hen baseline observations begin, one iden tifies a slope in the thera-
peutic direction. That is, behavior is im proving even though the intervention has not
yet begun. How could this occur? As with variability, the how is not critical in relation
to threats to validity. Yet, it is som etim es the case that observation o f perform ance
before the intervention begins exerts a change (a process I m entioned as reactivity),
and this change m ay continue for a while. Others in the environm ent w h o interact with
the client may change too in m any ways that support im proved behavior o f the client.
Trends can serve as data-evaluation threats in other phases than baseline, if there is a
pattern (e.g., behavior im proves quickly but seem s to be deteriorating o ver the course
S I N G L E- C A S E R ES EA R C H D ESI G N S

o f treatment) that obscures or m akes draw ing inferences about treatment difficult.
Evaluation o f trends is critical in data evaluation, and we shall return to this topic.
Insufficient data can serve as a threat to data-evaluation validity. Single-case
designs depend on looking at current perform ance across phases when different c o n -
ditions (e.g., no intervention and intervention) are in effect. A “ phase” consists o f c o n -
secutive observations obtained on the client while som e condition (e.g., baseline) is
in place. A new phase consists o f observations obtained w hen the condition changes
(e.g., intervention). Evaluation dep end s on looking at m ultiple characteristics o f the
data within and across phases to see if there is a change. A validity threat occurs if the
data are insufficient to characterize current perform ance and to provide inform ation to
project perform ance in the future. W hat constitutes insufficient data in a given phase?
T his question cannot be answered in the abstract but depends on em erging data in the
design.
For exam ple, is one data point insufficient? That is difficult to tell. Suppose som e-
one says her spouse has not exercised since they were m arried (10 years ago) and she
w ishes to increase his exercise now. H opefully our intervention is going to increase
daily exercise (in minutes). How m any days o f baseline do we need? One day might be
enough to ensure that the observation system is in place and working. For any later data
evaluation you w ill only have one data point, because retrospective report from the wife
w ill not be part o f the data in the study. O ne data point m ight be enough for this in i-
tial baseline. In general, w hen behavior is never perform ed in the immediate past (e.g.,
exercising, attending activities) one or two observations m ay be sufficient. In addition,
when an intervention is w ithdrawn, som etim es seem ingly perfect perform ance (100%
o f the target behavior daily) plunges to the depths o f baseline in an instant, that is, with
one observation day. O ne day will be fine, and I shall give exam ples later to convey that.
M ore is invariably better, so one w ould want m ore than one day when possible but in
principle it is not the num ber o f days but what the phase attempts to accomplish and
how many data points it takes to do that.
Single-case designs require a bit m ore understanding o f the m ethodology because
decision m aking occurs during the study and draw ing inferences depends 011 these
decisions. I shall provide guidelines regarding sufficient length o f phases for data c o l-
lection and more im portantly the underlying principles on w hich decisions are based. 1
o nly wish to note here that insufficient data in a phase can interfere with the evaluation
and serve as a threat to validity.
M ixed data patterns within and across phases too can interfere with evaluation o f
the data. C onsistency in various patterns o f the data is im portant for draw ing infer-
ences in a single-case design. In the different designs, usually there are multiple o p p o r-
tunities to evaluate whether the intervention was responsible for changes. For example,
in som e designs the intervention is presented and w ithdrawn on two or more occasions
( A B A B designs); in other designs, the intervention is tested across multiple behaviors
one at a tim e (m ultiple-baseline design across behaviors). T h ese and other instances
might be considered m ini-replications— I add “ m ini” in recognition that replication
usually refers to an independent attempt to repeat the study. But in single-case designs
there is usually more than one opportunity to look at the data pattern and draw c o n -
clusions about the intervention within the sam e study. O ne can look at the overall pat-
tern, but also at each individual occasion in which the intervention is tested. From this
Un d er p in n in g s o f Sci en t i f i c R e s e a r c h

overall pattern, inferences are drawn. A threat to data-evaluation validity occurs i f the
data pattern is mixed and interferes with draw ing inferences about the intervention.
A n y o f the previously mentioned data evaluation threats (excessive variability, trend,
and insufficient data) at one or m ore places in the design (e.g., one phase or one in d i-
vidual in multiple-baseline design across individuals) could be the basis for a m ixed
data pattern.
Another source o f a m ixed data pattern m ay stem from who is included in the dem -
onstration. Single-case designs often include m ore than o n e subject, no m atter w hat the
design. All subjects are different, o f course (even the fingerprints and brains o f iden ti-
cal twins), so there is inherent variability from individual differences. In principle, the
m ore diverse the subjects are in a given dem onstration, the greater the possibility that
responses to the intervention will vary. That is, the pattern o f the data for the in te rv e n -
tion effects might well be m ixed due to diversity and variability am ong the subjects.
I say “ m ixed data pattern” to capitalize on the m ore fam iliar use o f m ixed m essages
in everyday life. When som eone says the three precious words, *] lore you,” usually that
is clear enough. When the person instead says the fiv e slightly less precious w o rd s, “ I
love you” (now a silent pause for 3 seconds, and then continues w ith . . . ) “sort o f ” it
w ould be wise to view the message as m ixed and the interpretation at best is obscured.
M ixed love signals are a threat to interpersonal relationships (m eth od ologistscan h a n -
dle these). M ixed data patterns in any research design are a threat to data-evaluation
validity (m ethodologists enter long-term psychotherapy for these)

G e n e ra l C o m m e n ts
For both basic and applied research, it is critical to select interventions and param eters
o f their adm inistration that will m axim ize intervention effects. T h e goal o f m axim izin g
intervention effects is a given in applied research— we wish to m a le a real difference
that helps people. However, consider the m ethodological aspect o f strong interventions.
A m ore potent intervention decreases the likelihood that the threats to data evalu ation
will emerge. Also, strong interventions not only affect m eans (i.e., change the average
level o f perform ance from what it was before the intervention), but often reduce v a ri-
ability (i.e., fluctuations around that average point) as well. T hu s, intervention stren gth
can directly alleviate som e o f the threats noted previou sly

P R IO R IT I E S A N D T R A D E - O F F S IN V A L I D I T Y
It is not possible to design a study that is perfectly attentive to all the threats o f i nternal,
external, construct, and data-evaluation validity. T h is is in part because goals o f tbe
study m ay change or dictate that som e threats are really more im portant than others.
Also, addressing som e threats is inversely related to addressing others. C o n sider key
issues.
Internal validity is usually regarded as the highest priority. O bviously, one m ust
first have an unam biguously dem onstrated finding or effect before other questions can
be raised (e.g., Can this be generalized?, i.e., external valid ity; W hat is the u nd erlying
explanation?, i.e., construct validity). Yet, the priorities o f internal versu sexternal v a lid -
ity in any given instance depend to som e extent on the purposes o f the research, intern al
validity is given greater priority in basic research. Special experim ental arrangem ents
are designed not only to rule out threats to internal validity but also to m axim ize the
S I N G L E- C A S E R E S EA R C H D ES I G N S

likelihood o f dem onstrating a particular relationship between independent and dep en-
dent variables. Events in the experim ent are carefully controlled, and conditions are
arranged for purposes o f the dem onstration. W hether the conditions represent events
o rdinarily evident in everyday life is not necessarily crucial. T h e p u rpo se o f such exp er-
iments is to show what can happen when the situation is arranged in a particular way.
These dem onstrations are som etim es referred to as a test o f principle. Early w ork on
stem cells, cloning, and cell reprogram m ing were o f this type and provided d em on stra-
tions o f new biological tools and processes that were not previou sly possible, that is,
tests o f principle. M any questions about generality o f the find ings can be asked such as
whether the findings can be applied to clone a favorite pet, to produce organs or tissues
am ong individuals who need an organ transplant, or to reverse a disease process (e.g.,
cancer) by g ivin g cells new “ instructions.”
I have m entioned uncontrolled sources o f variability as a potential threat to valid-
ity, data-evaluation validity in particular. W orking in applied rather than laboratory
settings (e.g., schools, clinics, hospitals, homes, and the com m unity) increases v ari-
ability in data patterns and data collection. In a classroom , the assignm ents, presence
o f multiple teachers, and varied classroom activities all add a little bit o f error to m ea-
surement. T his is contrasted with the splendor o f the laboratory w here much can be the
same, automated, and held constant, all to reduce variability. U nder the circum stances,
the investigator w orking in an applied setting tries to control carefully or hold constant
as m any influences as she can. C o n sider the trade-offs.
On the one hand, v e ry careful control reduces the likelihood o f excessive variabil-
ity and m ixed data patterns, and hence serves the G oddess o f D ata-Evaluation Validity
perfectly. Yet, m aking applied settings like a quasi lab to achieve control now raises the
wrath o f the G odd ess o f External Validity. Will this intervention and its effects gen er-
alize (external validity) to any other circum stance where control is not so strong or so
heavily invoked? T his is not a m inor question. For exam ple, there are many preven-
tion dem onstration projects (e.g., in the schools) where an investigative team funded
by a large grant com pletes a com prehensive assessment, provides an intervention (to
som e classes or schools), and evaluates the short- and long-term effects years later (e.g.,
reduced rates o f suicide, substance use [drugs, alcohol, cigarettes], unprotected sex, and
crim inal activity). It is superb to show that such outcom es can be achieved, but a large
question is w hether any such program s can be extended on a larger scale without the
investigative team o f researchers introduced into the settings where the first study was
implemented. Is there any external validity to the findings? O nly further research can
establish that, but the threat to external validity m ay becom e m ore plausible the greater
the research setting (with all o f its constraints and m onitoring) departs from real-w orld
settings.
All four types o f validity— internal, external, construct, and data-evaluation
validity— need to be considered. In betw een-group research, m any o f the decisions to
address these can be made before the experim ent begins, that is, at the design stage. T he
challenge is slightly greater for single-case designs, because the investigator responds
to the data em erging from the study for basic design decisions (e.g., H ow long should
this phase be? W hen should a new intervention be tried?). U nderstanding the under-
pinnings o f design and what we are trying to accom plish is much m ore important than
mastering procedures and techniques that constitute the designs. In any research one
U n d er p i n n i n g s o f Sci en t i f i c Re se a re

must selectively neglect and attend to issues that could influence interpretation o f the
findings. Being informed o f what can em erge as problem s, w hen som e problem s are
m ore relevant or important than others, and what to do about them — this is black-belt
level methodology.

E X P E R I M E N T A L V A L I D I T Y IN C O N T E X T
T h e threats to validity are fundam ental reasons we go through design gyration« and
m ake v e ry special arrangements in how the studied is carried out. It is im portant to
place these in the context o f research m ethodology m ore b ro a d ly and in this w ay also
preview the remaining chapters. Single-case designs, but research design m ore gen er-
ally, can be conceived as including three interdependent com ponents:

Assessment: Use o f systematic measures to docum ent perform ance and to re-
flect changes where changes are sought;
Experim ental Design: Special arrangem ents o f presenting the intervention or
conditions to participants that will help establish that the intervention rather
than other influences (threats to validity) is likely to be the cause o f behavior
change; and
Data Evaluation: Procedures, techniques, and criteria that are used to decide
and show that there was a reliable change and that the effect w ithin, betw een, or
am ong conditions makes a difference.

T hreats to validity enter in at m any points, and interpretation o f the study can be
enhanced or underm ined by how each o f these is addressed. T he follow ing chapters
address the three topics in turn and provide critical issues, options, and guidelines to
strengthen the quality o f inferences from conducting single-case research.

SU M M AR Y AND CONCLUSIONS
T he purpose o f experimentation is to arrange the situation in such a w ay that ex tra -
neous influences that might affect the results do not interfere with draw ing causal
inferences about the impact o f the intervention. T he internal validity o f an experim ent
refers to the extent to which the experim ent rules out or m akes im plausible alternative
explanations o f the results. The factors or influences other than the intervention that
could explain the results are called threats to internal validity. M ajor threats include
the influence o f history, maturation, instrum entation, testing, statistical regression, a_nd
diffusion o f treatment.
Apart from internal validity, the goal o f experim entation is to dem onstrate rela-
tionships that can extend beyond the unique circum stances o f a particular experim ent.
External validity addresses questions o f the extent to which the results o f an investiga-
tion can be generalized or extended beyond the conditions o f the original experim ent.
Several characteristics o f the experim ent m ay limit the generality o f the results. T hese
characteristics are referred to as threats to external validity and include generality across
subjects, responses, settings, time, and behavior-change agents; reactivity o f exp erim en -
tal arrangem ents and the assessment procedures; and multiple-treatment interference.
Construct validity pertains to interpreting the basis for the causal relation between
the independent variable (e.g., intervention, experim ental m anipulation) and the
dependent variable (e.g., outcome, perform ance). Factors that m ay interfere with or
S I N G L E- C A S E R ES EA R C H D ES I G N S

obscure valid inferences about the reason for the effect are threats to construct validity.
M ajor threats include attention and contact with the client, and special stimulus con di-
tions, settings, and contexts.
D ata-evaluation validity refers to those aspects o f the data that obscure o r m is-
lead in draw ing inferences about the impact o f the intervention. T he factors that can
interfere with draw ing conclusions are threats to data-evaluation validity. M ajor threats
include excessive variability in the data, unreliability o f the m easures, trends in the data,
insufficient data to discern a pattern within a given phase, and m ixed-data patterns.
All four types o f validity, internal, external, construct, and data-evaluation validity,
are important. C learly internal validity rises to the top o f the list and is the raison d ’e tre
for doing research. Yet, the types o f validity v ary in im portance as a function o f what
the investigator is tryin g to accom plish and what issues em erge during the collection
o f data. It is not possible in any one experim ent to address all threats well or equally
well, nor is this necessarily a goal toward which one should strive. Rather, the goal is to
address the p rim ary questions o f interest in as thorough a fashion as possible so that
clear answers can be provided for those specific questions. At the end o f that investiga-
tion, new questions m ay em erge or questions about other types o f validity m ay increase
in priority.
The obstacles in designin g experim ents not o nly em erge from the manifold types o f
validity and their threats, but also from the interrelations o f the different types o f valid -
ity. Factors that address one type o f validity might detract from or increase vulnerabil-
ity to another type o f validity. For exam ple, factors that address data-evaluation validity
might involve controlling potential sources o f variation in relation to the experim ental
setting, delivery o f interventions, and hom ogeneity o f the subjects. In the process o f
m axim izing experim ental control and m aking the most sensitive test o f the interven-
tion variable, the range o f conditions included in the experim ent becom e increasingly
restricted. Restricting the conditions, such as the type o f subjects or measures and stan-
dardization o f delivering the intervention or independent variable, m ay com m ensu-
rately limit the range o f conditions to w hich the final results can be generalized.
Single-case designs provide m any options to rule out and m ake critical threats to
validity im plausible. T h ey are equally pow erful in addressing the threats as are m ore
fam iliar betw een-group studies. In the chapters that follow, I elaborate design options
and how they address critical threats.
CHAPTER 3

Background and Key Measurement


Considerations

C H A P T E R O UTLINE

Identifying the Goals o f the Program


Frequently Used Criteria
Social Validation as a Guide
Social Comparison
Subjective Evaluation
General Comments
Defining the Target Behavior or Focus
Measurement Guidelines and Considerations
Assessment Requirements for Single-Case Designs
Use o f Multiple Measures
When to Administer These Additional M easures
Pre-Post Assessment
Periodic Probes During the Study
Reliability and Validity
Reliability o f Observational Measures
Validity o f Single-case Measures
Sum m ary and Conclusions

A
s I noted in the previous chapter, single-case m eth o d o lo gy includes three
interrelated com ponents: assessm ent, experim en tal design, and. data evaluation.
Assessm ent is pivotal in all research and is the starting point to begin to answ er
questions about interventions and their effects. Q uestions about all so rts o f inter-
ventions (D oes eating certain foods prevent cancer? W ill this w ell-in ten ded therapy
help the patient? Is this system atic m ethod o f teachin g [reading, art, m usic] really
effective?) require special arrangem ents o f con ditio ns (exp erim en tal design s) to
be answ ered scientifically. However, before one is able to an sw er w hat cau sed the
change, one needs to be sure there w as a change in the o utco m e o f interest (rate o f
cancer, im provem ent in the patient, change in the student). A ssessm ent is a p recon -
dition for d raw in g inferences.

49
S I N G L E- C A S E R ES EA R C H D ES I G N S

T he im portance o f assessm ent is often underem phasized. You will recall mention
o f the anecdotal case study as scientifically bereft in term s o f draw ing strong infer-
ences. This is not o nly due to the lack o f an experim ental design to evaluate some
condition but also to the lack o f system atic assessm ent. A s discussed much later in
this book (Chapter n ), systematic assessment can rescue all sorts situations and help
draw inferences even when experim ents cannot be done. A s a matter o f fact, science
can make astounding advances, test theory, and draw inferences, som etim es by metic-
ulous assessment an d without the opportunity for experim ents (e.g., astronom y and
meteorology).
In this chapter and the two that follow, several characteristics o f assessment in
single-case research are elaborated. T his chapter focuses on background consideration
or requirem ents o f m easures for single-case designs. T h e next two chapters focus on
methods and strategies for assessing behavior and insuring the integrity o f the assess-
ment procedures. In each chapter, a distinction em erges that is worth undeiscoring.
I mentioned that single-case designs have flourished in an area that is referred to as
applied behavior analysis. T his area has applied and evaluated interventions broadly— it
is difficult to iden tify a setting (e.g., schools, institutions, business and industry, co m -
munity life, colleges, nursing homes) in w hich, or a client population (e.g., preschool to
elderly, psychiatric and medical inpatients and outpatients) with whom , interventions
have not been developed and evaluated. In the process o f developing the substantive
field o f applied b ehavior analysis, m ethodological innovations have been made as well.
Som e o f those relate directly to key assessm ent considerations, i.e., identifying what to
assess and what to change. I include several o f these because they provide superb gu id e-
lines for assessm ent and intervention. However, it is im portant to mention these com -
ponents as helpful guides but not necessarily central to single-case research designs.

ID ENTIFYING T H E G O A LS OF TH E P R O G R A M

Fre q u en tly U sed C r ite r ia


Assessm ent and intervention require clearly stating the goal and carefully describing
how the outcom es w ill be evident. Identifying the goal o f the program in most cases
seems obvious and straightforw ard because o f the direct and im m ediate im plications
o f the behavior for adjustm ent, im pairm ent, and adaptive fu n ctioning o f the individual
in everyday life. F o r exam ples, m any interventions have decreased such behaviors as
self-injury (e.g., head banging), anxiety and panic attacks, neglectful parenting am ong
adults, and driving under the influence o f alcohol and have increased such behaviors
as engaging in practices that promote health (e.g., exercise, consum ption o f healthful
foods) and academ ic perform ance am ong individuals perform ing poorly at school.
Exam ples are useful, but they do not address the b roader issues, namely, what
makes a behavior or dom ain o f functioning w orthy or in need o f intervention? There
are several overlapping criteria that serve as guidelines. First, the setting and institu-
tion may dictate the focus a n d goals that are worthy o f intervention by their very nature
and purpose. Schools, for example, are intended to educate youth and develop com pe-
tencies. Invariably interventions are o f interest to identify w hether the current educa-
tional program in place can be improved. Also, behavior m ay be focused on in a setting
because it relates to or interferes with the goal or main pu rpo se o f the setting. Thus, in
Back g r o u n d an d Key M easu r em en t Co n si d er at i o n s

schools, focusing on vandalism , disruptive classroom behavior, bullying, and drug use
m ay be quite relevant insofar as they relate to the likelihood that the academ ic goals
can be achieved in the setting. T h e goals o f the setting or contexts have frequently
served as the basis o f intervention and single-case evaluation. Exam ples include efforts
to increase productivity in business, im prove athletic perform ance am ong individuals,
and im prove acquisition o f a new language.
Second, m any o f the criteria that guide interventions are b ased on dysfunction,
m aladaptive behavior, or social, em otional, and behavioral problem s that are associated
with im pairm ent. Im pairm ent m eans that the problem interferes w ith an in d ivid u als
functioning in everyday life. Exam ples would be failing to meet role dem ands at home,
at school, and at work; interacting inappropriately with others, w hich has deleterious
impact on one’s own functioning; and being restricted in the settings, situations, and
experiences in which one can function. Facets o f individual fu n ctioning (thoughts,
feelings, behavior) that lead to or are associated with im pairm ent are likely to w arrant
intervention. Im pairm ent is a criterion invoked in defining psychiatric disord ers (e.g.,
m ajor depression, schizophrenia, attention deficit/hyperactivity disorder) (A m erican
Psychiatric Association, 1994). In addition to multiple sym ptom s (e.g., anxiety, sub-
stance abuse), im pairm ent in fu nctioning is required as well.
There are all sorts o f variations o f functioning that fall under the broad rubric o f
maladaptive behavior. Behaviors that are illegal or rule-breaking serve as the im petus
for intervention. Illegal behaviors w ould include driving under the influence o f alcohol,
using illicit drugs, and stealing. Rule breaking that is not illegal m ight include a c h ili
leaving school repeatedly during the m iddle o f the day or not adhering to a fam ily-
im posed curfew. Also, behaviors that are dangerous to oneself o r to others, o r that place
the individual at risk for dangerous or untoward outcom es, often serve as the basis for
intervening. Self-injury, fighting at school, and spouse abuse are o b vio us exam ples o f
dangerous behaviors because each involves physical harm ; som e are life-threatening.
Risk behaviors are those that m ay have harm ful consequences and can include unsafe
sex practices, cigarette sm oking, not w earing seat belts, and d rivin g w hile intoxicated.
Signs o f stress or distress are often grounds for intervening. Perhaps the individual was
exposed to a natural or personal disaster or trauma (e.g., loss o f hom e, loved one) o r
unusual stressor in relation to ones situation (work, relationships). T h e signs o f stress
are evident in im pairm ent in som e sphere o f functioning. Perhaps tlie most extrem e
variation within the category o f clinical dysfunction would be unusual or extrem e
sym ptom s that constitute m ore stark departures from everyday experien ces and fu n c -
tioning. Signs that an individual is hearing voices, acting on these voices, seein g things
that are not there, and other m arked departures in social behavior, com m unication,
and activity would be grounds for evaluation and intervention.
Third, the basis for intervention often reflects behaviors that are o f concern to in d i-
viduals themselves or to significant others. For example, parents brin g their children to
treatment for a variety o f behaviors that affect daily life but m ay or m ay not be severe
enough to reflect significant social, em otional, and behavioral problem s, im pairm ent,
or rule breaking. N evertheless, parents wish for some help an d m ay have concerns.
Exam ples include toilet training, school functioning, shyness, and m ild ly bothersom e
behaviors that, if severe, might indeed reflect impairment. “ B eh avio rs that are o i c o n -
cern” is a broad, catchall category but is one that is m eaningful n o n eth eless.T h e goal
S I N G L E- C A S E R ES EA R C H D ES I G N S

is to improve adaptive fun ctioning in a particular dom ain and perhaps in som e cases
bring individuals up to seem ingly norm ative levels.
There are occasions in w hich concerns o f significant others are of questionable
relevance as targets o f treatment. For example, a case at the clinic w here I work included
a very aggressive and antisocial 8-year-old boy. H is behavior was clearly impaired, as,
for example, reflected in his m ultiple and repeated expulsions from school for fighting
(physically) with children and teachers. T he single parent is extrem ely concerned about
other behaviors that are generally annoying (e.g., he leaves his clothes on the floor, does
not always flush the toilet, leaves his shoes outside o f the closet, occasionally discards
candy wrappers on the furniture, forgets to place the cap on the toothpaste tube after
brushing). These latter behaviors do not predict long-term child adjustment, c rim i-
nality, psychiatric disorder, or im pairm ent. The aggressive an d antisocial behaviors do
predict these outcom es and m eet several o f the criteria noted previously.
Fourth, behaviors are focused on that may prevent problem s from developing. T he
focus is not on a problem but on behaviors that will avert the likelihood o f a problem
or m inim ize the occurrence. O ften children or adults are at risk fo r some untoward
consequences. For exam ple, prem ature babies and children from econom ically dis-
advantaged environm ents are at risk for school difficulties. E arly intervention with the
parents and children to develop pre-academ ic behaviors at hom e are designed to pre-
vent later physical, psychological, and educational problem s. Also, developing b ehav-
iors that prom ote safety (e.g., in business and ind ustry or in the hom e) or health would
qu alify as efforts to prevent problem s.
The previously m entioned criteria capture m any o f the bases for intervening.
Although the criteria focus on characteristics o f behaviors o f the individual, there are
critical contextual influences. A given dom ain o f fu n ctioning m ight be context specific
rather than reflect pervasive features that individuals show at all tim es and places. For
example, children may show a particular problem only in the classroom or at home,
adults with anxiety m ay show this only in relation to specific situations (e.g., involv-
ing socialization with others), and som eone who stutters m ay do so much m ore in the
presence o f a group and strangers than with ind ividuals and family. Age and period
o f developm ent m ay m ake som ething worthy o f intervention. For example, enuresis
(bedwetting) may be an no ying to parents but is norm ative in early childhood. In m id -
dle and later childhood (e.g., > 7 years old), it becom es a departure from norm ative
function and a risk factor for (predictor of) psychiatric disturbance (Feehan, M cGee,
Stanton, & Silva, 1990; Rutter, Yule, & G raham , 1973). An intervention is likely to be
m ore appropriate for the older rather than younger children.
The issue o f what to select for assessment and intervention is not unique to sin gle-
case research. For exam ple, social, em otional, and behavioral problem s serve as the
frequent basis o f intervention studies. Fundam ental questions invariably arise related
to what constitutes “ n o rm al” and deviant fun ctioning and at what point one should
intervene. T he question has been made more salient with research that shows that in
“ norm al” com m unity sam ples, approxim ately one in four m eet criteria fo ra psychiatric
diagnosis (National Institute o f M ental Health, 2008). Also, m any psychiatric diagn o -
ses (e.g., depression, anxiety) and social, em otional, and behavioral problems are on
continua so there is no necessary cu to ff point that says you have “ it” or you do not.
Thus, when to intervene can be ambiguous.
Back g r o u n d an d ICejr M easu r em en t Co n si d er at i o n s

Social Validation as a Guide


Social validity or social validation occasionally has been used as a guide to hoth
assessment and intervention (Foster & Mash, 1999; K azdin, 1977b; Kennedy, 2002;
Wolf, 1978). The notion o f “social” validity is designed to ensure that interventions
take into account the concerns o f society and the consum ers o f interventions (parents,
teachers, clients) (Schwartz & Baer, 1991). Social validity encom passes three questions
about interventions:

Are the goals o f the interventions relevant to everyd ay life?


Are the intervention procedures acceptable to consum ers an d to the com m unity
at large?
Are the outcomes o f the intervention im portant, that is, do the changes m ake a
difference in the everyday lives o f individuals?

Each o f these involves assessment in som e way, but for th is chapter I em phasize the
first question because it relates directly to selection and assessm ent o f the focus or
goal o f the intervention. Two social validation m ethods can b e used for identifying the
appropriate focus o f the intervention, namely, the social com parison and subjective
evaluation methods. Each is an em pirically based m ethod o f identifying what the focus
o f the intervention and hence the assessm ent ought to be.

So c ia l C om parison. T he major feature o f the social com parison m ethod is to identify


a peer group o f the client, that is, those persons w ho are sim ilar to the client in subject
and dem ographic variables but who d iffer in perform ance o f the target behavior or
characteristic (e.g., depression, anxiety) o f interest. T h e peer group consists o f persons
who are considered to be functioning adequately with respect to the target behavior.
Essentially, norm ative data are gathered to provide a basis for evaluatin g the behavior
or dom ain o f functioning o f the client. T he behaviors that distinguish the client from
the norm ative sample or the m agnitude o f the departure fro m the norm ative sam ple
suggest what dom ains require intervention. T here are broad sw aths o f everyday life
in which this is routinely done. For exam ple, in education, extensive data on reading,
language, and arithm etic progress o f children at different ages and grade levels provide
the basis for identifying who is doing well and who is doing poorly. Each o f these m ay
be used for decision m aking about special interventions (e.g., lor ind ividuals identified
as gifted or lagging behind in a skill area). Sim ilarly, disabilities (e.g., in w alking, talk-
ing; in social behavior, as in autism) are identified early in life because o f th ed epartu res
from norm ative inform ation that is readily available. Consequently, norm ative data are
routinely used already and implicitly in decisions about when to intervene.
N orm ative data occasionally are used in other areas where the scope o f the in fo r-
mation is not as well developed, but where som e benchm arking would help in deciding
the focus. T his latter use o f normative data to identify the intervention focus was nicely
illustrated in a program that trained institutionalized wom en with developm ental d is -
abilities to dress themselves and select their own clothing in a w ay that coincided with
current fashion (Nutter & Reid, 1978). Developing skills in dressing fashionably repre-
sents an im portant focus for persons preparing to enter com m u n ity living situations.
The purpose o f the study was to train w om en to coordinate the color com binations o f
their clothing. To determ ine the specific color com binations that constituted popular
54 S I N G L E- C A S E R E S E m R C H D ES I G N S

fashion, the investigators observed over 600 w om en in com m unity settings w here the
institutionalized residents would be likely to interact, including a local shopping mall,
a restaurant, and sidewalks. Popular color com binations were identified, and the resi-
dents were trained to dress according to current fashion. T h e skills in dressing fash io n -
ably were m aintained for several weeks after training.
In some cases, it m ay be useful to look at norm ative sam ples to determine precisely
what ought to be trained. In the preceding exam ple, the investigators were interested
in focusing on specific response areas related to dressin g but sought information from
norm ative samples to determ ine the precise behaviors o f interest. The behavior o f per-
son s in everyday life served as a criterion for the particu lar behaviors that were trained.
W hen the goal is to return persons to a particular setting or level o f functioning, social
com parison m ay be especially useful. T he m ethod first identifies the level o f fu n ctio n -
ing o f persons perform ing adequately (or well) in the situation and then uses the infor-
mation as a basis for selecting the target focus.

Su b jective E va lu a tio n . Subjective evaluation consists o f soliciting the opinions o f


others who by expertise, consensus, or fam iliarity w ith the client are in a position to
judge or evaluate the behaviors or characteristics in need o f treatment. M any o f the
decisions about the behaviors that warrant intervention are m ade by parents, teachers,
peers, or people in society at large who identify deviance an d make judgm ents about
w hich social, em otional, behavioral, or learning problem s do and do not require special
attention. An intervention m ay be sought because there is a consensus that there is a
problem . Often it is useful to evaluate the opinions o f exp erts systematically to iden tify
what specific dom ains o f functioning present a problem .
T he term “subjective” is unnecessarily touchy-feely and misrepresents aspects o f
the method. Although the inform ation is based on self-rep ort and opinion, the infor-
mation often draw s on v ery special expertise. For exam ple, subjective evaluation has
been used to identify behaviors that are critical w hen children (or others) escape from
their homes in case o f fire. O pinions o f firefighters are sought to identify what behaviors
will save their lives. It dem eans the source by referring to this as “subjective.” T he spe-
cific skills recom m ended based on expertise o f the source m ake a difference in saving
lives, hardly just an opinion or subjective view. Sim ilarly, one might ask those who train
com m ercial airline pilots what the most im portant skills are for em ergency landings,
and use that to provide rigorous training. Here too, subjective evaluation might better
be called “expert evaluation.”
Two studies nicely illustrate subjective evaluation outside the context o f life or
death considerations. In the first study, the investigators w ere interested in iden tify-
ing problem situations for youths with delinquent behavior, and the responses they
should possess to handle these situations (Freedm an, Rosenthal, Donahue, Schlundt, &
M cFall, 1978). To identify problem situations, psychologists, social workers, counselors,
teachers, boys with a history o f delinquency, and others w ere consulted. A fter the
problem situations were identified (e.g., being insulted by a peer, being harassed by
a school principal), the investigators sought to identity the appropriate responses to
these situations. T he situations were presented to boys with and without prior d elin -
quent behavior. T hey were asked to respond as they typically would. Judges, consist-
ing o f students, psychology interns, and psychologists, rated the competence o f the
Back g r o u n d an d K ey M easu r em en t C o n s id e r a tio n s

responses. For each o f the problem situations, responses w ere identified that varied, in
their degree o f competence. An inventory o f situations w as constructed that included
several problem situations and response alternatives that had been developed through
the subjective evaluations o f several judges. T he input is a u seful basis for identifying
what to change and develop during an intervention.
In a second example, the investigators were interested in preparing yo u n gch iJd ren
in day care in school readiness skills (Hanley, Heal, T iger, & In gv an so n , i o q j ). T he
behaviors drew on inform ation obtained from early elem en tary school teachers and
from early education experts. In one o f the surveys, o ver 3 ,0 0 0 kindergarten teachers
from different regions o f the country, and spanning a p erio d o f a decade, provide*! the
information. T he results reflected a shift in how experts view ed the com ponents o f
readiness. T he shift m oved from a focus on academ ically o rien ted sk ills to s o c ia l skills.
The investigators selected those social skills reported to be the most im portant <e.g„
following directions, taking turns and sharing, telling what o n e needs, b ein g sensitive
to others, and reducing disruptive behavior). These categories were then o peratio n al-
ized, assessed, and successfully trained.
In the preceding examples, persons were consulted to h elp identify behaviors that
warranted intervention. T he persons were asked to recom m end the desired behaviors
because o f their fam iliarity with the requisite responses for the specific situations. The
recom m endations o f such persons can then be translated into training program s so
that specific perform ance goals are achieved.

G en era l Com m ents. Social com parison and subjective evaluation m ethods provide
em pirically based procedures for system atically selecting the target foe us for purposes
o f assessment and intervention. O f course, the m ethods are not without problem s. For
example, the social com parison m ethod suggests that behaviors that distingu ish a c o m -
munity sample ought to serve as the basis for intervention. Yet, it is possible that nor-
mative samples and clients differ in m any ways, som e o f w hich m ay h are little relevance
to the functioning o f the clients in their everyday lives. Just becau se cl ie nts d iffer from a
com m unity sam ple in a particular behavior, does not n ecessarily mean that the differ-
ence is important or that am eliorating the difference in perfo rm an ce will solve m ajor
problems for the clients. Also, I already m entioned that term s u sed decades ago such
as “norm al” sample are hard to say without gulping— I m en tioned 25% as an ap p roxi-
mate rate o f psychiatric disorder am ong com m unity samples. “ N o rm a l” an d com m u -
nity samples include a lot o f deviance, which means using th em as a bar o r criterion
requires caution.
Similarly, with subjective evaluation, the possibility exists that the behaviors
subjectively judged as im portant m ay not be the most im portan t focus o f treatment.
For example, teachers frequently identify disruptive and inattentive behavior in the
classroom as a m ajor area in need o f intervention. Yet, we have known for decades
that im proving attentive behavior in the classroom usually has little or no effect on
childrens academ ic perform ance (e.g., Ferritor, Buckholdt, H am blin, & Sm ith, 1972;
Harris & Sherm an, 1974). However, focusing directly on im p ro vin g academ ic perfor-
mance usually has inadvertent consequences on im proving attentiveness (e.g., Ayllon
& Roberts, 1974; M arholin, Steinm an, M clnnis, & Heads, 1975). T h u s, subjectively iden-
tified behaviors m ay not be the most appropriate or beneficial fo cu s in the classroom .
S I N G L E- C A S E R ES EA R C H D ES I G N S

N otwithstanding the objections that m ight be raised, social com parison and su b -
jective evaluation can be useful in guid ing assessm ent and identifying the focus o f an
intervention. T he objections against one o f the m ethods o f selecting the intervention
focus can be overcom e by em ploying both m ethods sim ultaneously. That is, norm ative
sam ples can be identified and com pared with a sam ple o f clients (e.g., individuals with
a history o f delinquent behavior, people who have a developm ental disability) iden ti-
fied for intervention. Then, the differences in specific behaviors, skills, or other facets
o f functioning that distinguish the groups can be evaluated by raters to exam ine the
extent to w hich these characteristics are view ed as im portant.

DEFIN IN G THE TA R G E T BEH AVIO R OR FOCUS


From the criteria noted previously, the general focus o f the intervention is delineated.
For both assessment and intervention, the general focus (tantrum s, aggressiveness,
self-injury) is translated into a m ore concrete definition. T h e m ove from concept (ch ar-
acteristic or idea) to operations (w ays in w hich that concept will be m easured) is a
critical facet for all o f the sciences and perm its advances, replication o f findings, and
accum ulation o f knowledge. O perational definitions refer to defin ing a concept on the
basis o f the specific operations used for assessm ent. Paper-and-pencil measures (qu es-
tionnaires to assess the dom ain), interview s, reports o f others (e.g., parents, spouses)
in contact with the client, physiological m easures (e.g., m easures o f arousal, stress),
and direct observation are am ong the m ost com m on ly used m easures in psychological,
educational, and counseling research to operationalize key concepts.
In applied research where single-case designs have been used heavily, em phasis has
been placed on direct observation o f overt behavior because overt behavior is view ed
as the most direct measure o f the treatment focus. So, rather than parent or teacher rat-
ings about the severity or frequency o f tantrum s, investigators have assessed tantrum s
directly. T his requires defin ing what a tantrum is or at least what will be counted as
a tantrum in the study. Parental reports about tantrum s, w hile important, are a step
rem oved from the tantrum s themselves. M oreover, reports are subject to special in flu -
en ces and bias (e.g., the more stress the parent is experien cing in other areas o f life
than the child’s tantrums, the greater their perception o f defiance in their children)
(see Kazdin, 1994). Consequently, if possible and feasible, it is useful to observe the
tantrum s directly and to see when they occur, under what circum stances, and w hether
they change in response to intervention. At the sam e time, the effects o f tantrum s on
others in the environm ent are not trivial. In an effective program , one would like to see
a genuine and marked reduction in tantrum s but also a change that is reflected in the
perceptions o f others w ho identified the problem . O ne m easure (direct observation o f
the behavior) does not substitute for the other (peoples’ perceptions that the change
m ade a difference).
O perational definitions are essential to begin to assess and evaluate interventions.
In defining an abstract concept (e.g., anxiety or tantrum s), an operational definition is
not likely to capture the entire dom ain o f interest. O perational definitions are w ays o f
working with the concept by taking a slice or two o f the conceptual pie to represent c rit-
ical com ponents. Usually we are interested in the dom ain in its fullest definition but use
operational definitions to represent it. In such cases, we do not m erely wish to change
functioning on the one m easure we assess but hope to change the m any com ponents o f
Back g r o u n d and KayM easu r ei n en t Co n si d er at i o n « ;
the larger concept. In other situations, operational defin ition s m ay reflect virtually all
or most o f the com ponents o f interest. In the case o f tantrum s, fo r exam ple, frequency
o f the tantrums m ay be the main aspect o f interest, but we still care about how the
world sees and perceives the child’s tantrums as well.
We begin by specifying the general dom ain (e.g., tantrum s) and then by identifying
a specific definition that perm its assessment. To make this transition, one ought to ask
others (e.g., parents, teachers, and clients) what the desired o r u ndesired behaviors are.
Also, it is useful to observe the client or others inform ally. D escriptive notes o f what
behaviors occur and which events are associated with th eir o ccu rren ce m ay be useful
in generating specific response definitions. From inquiries an d inform al observations,
one might be able to answ er several questions about the target behavior <e.g., w hen
does it occur, what does it look like, under what circum stances do th ey o ccur?).
For example, in one program the goal was to train children w ith autism in help-
ing behaviors (Reeve, Reeve, Townsend, & Foulson, 2007). C hild ren with autism have
severe deficits in socializing with others; helping b ehaviors in a special classroom
were selected because helping others tends to lead to longer social interactions than
other classes o f behavior (e.g., greetings). To identify where to begin, the investigators
surveyed parents o f typically developing children, askin g them to describe instances o f
helping behavior. A lso, another group o f children was observed in a local school during
m any different activities (story time, free play) in order to iden tify helping acts. From
this information, the investigators developed categories o f helping activities (e.g., pick-
ing up objects, setting up an activity, sorting materials) in relation to the classroom and
then moved to highly specific operational definitions o f each category'.
Initial canvassing o f others m ay not be necessary for m any behaviors that w ill be
observed (e.g., com pleting hom ework, taking one’s m edication). However, one must
move to an operational definition to specify how the behavior w ill be assessed for
purposes o f observation and intervention. As a general ru le, a definition should meet
three criteria: objectivity, clarity, and completeness (H aw kins & Dobes, 1977). These
concepts are defined and illustrated in T able3.1. D eveloping a com plete definition often
creates the greatest d ifficu lty because decision rules are needed to specify how behavior
should be scored. I f the range o f responses included in the definition is not described
carefully, observers have to infer whether such a response h as occu rred. For example, a
sim ple greeting response such as waving one’s hand to greet som eone m ay serve as the
target behavior for a socially withdrawn child. In most instances, when a person’s hand
is fully extended and m oving back and forth, there w ould be no difficu lty in agreeing
that the person was waving. However, ambiguous instances m ay require judgm ents on
the part o f observers. A child might move his or her hand once while the arm is not
extended (rather than back and forth), or a child m ay not m ove his or her arm at all
but sim ply move all o f the fingers on one hand up and d o w n (in the w ay that infants
often learn to say good-bye). These responses are instances o f w aving in everyd ay life,
because we can often see others reciprocate with sim ilar greetings. F o r assessm ent pu r-
poses, the response definition must specify how these and related variations o f waving
should be scored.
Developing clear definitions requires specifying w hat is and what is not to
be included in the behavior. For example, in one program , the focus was o n reduc-
ing the frequency o f talking to oneself for a hospitalized patient with schizophrenia
S I N G L E- C A S E R ES EA R C H D ES I G N S

T a b le 3.1 C r i t e r i a to B e M e t W h e n D e f in in g B e h a v io r s f o r O b s e r v a t i o n

C r it e r io n D e fin e d E x a m p le
O b je c tiv ity The measure refers to observable The number of times that a child engages
characteristics o f the behavior o r to in tantrums (as an operational definition of
events in the environment that can be tantrums); the number of cigarette butts in the
observed. ashtray o r cigarettes remaining in the pack (as an
operational definition of cigarette smoking).
C la r it y A definition is so unambiguous A tantrum includes anytime the child shouts,
that it could be read, repeated, and whines, stom ps feet, throw things, o r slams a
paraphrased by an observer o r d o o r in response to a comment from his o r her
som eone initially unfamiliar with the m other o r father during the hours o f 3:30 p.m. to
measure. Little explanation is needed 5:30 p.m., Monday through Friday, when the child
to begin actual observations of the and at least one parent are at home.
behavior.

C o m p le t e n e s s Delineation of the boundary condi N o t included in a tantrum is a raised voice that is


tions so that the responses to be part of excitement while watching T V o r playing
included and excluded are enumerated. a game o r an initial expression of disappointment
when a request (e.g.,staying up later for bedtime)
is denied. A statement of disappointment that
lasts less than a minute without the behaviors
noted in the example of clarity (above) is not a
tantrum for present purposes.

Note: These are critical assessment requirements for direct observations introduced by Hawkins and D ob e s
(1977).

(W ong et al„ 1987). Self-talk was defined as any vocalization not directed at another
person but excluding sounds associated with physiological functions such as coughing.
In another report, the assessm ent focused on academ ic behaviors for children with
learning disabilities including delays in speech, language, and m otor skills (Athens,
Vollmer, & Pipkin, 2007). T he duration o f perform ance was assessed as children p er-
form ed various tasks (e.g., writing sentences, tracing letters). Perform ance on the
tasks did not count as beginning until after 3 seconds had elapsed. Then the tim e was
recorded. Perform ance still counted as w orking if they paused for less than 3 seconds at
any time. T h is allowed for the children to switch papers, erase w hat they had written,
or just to pause. In a study with a 7-year-old child with autism , the goal was to reduce
aberrant vocalizations that m ade him stand out in social interactions and contributed
to his not being integrated into a regular education classroom (Pasiali, 2004). Aberrant
vocalizations were defined as noises, words, or phrases without specific content or
m eaning. These were all tallied durin g dinnertim e and used to reflect the effect o f the
intervention. Finally, in a program for veterans with problem s o f substance abuse and
at least one other psychiatric disorder, one o f the goals was to reduce illicit d ru g use.
Abstinence from drugs was defined as a negative result on urine tests provided tw ice a
week (Drebing et al„ 2005).
The exam ples convey the specificity needed to conduct the observations. T h e sp ec-
ificity m axim izes the reliability in observin g and cod in g the behaviors. As observations
are conducted, difficult-to-score exam ples m ay em erge, and these m ay be used to make
m ore precise what is and is not to be counted. A clear definition does not elim inate
judgm ents but allows a way to cod ify these judgm ents so that they are made relatively
consistently.
Back g r o u n d an d Key M f casu r er r en t Co n si d e r a t i o n s

M EASUREM ENT GUIDELINES AND CO NSIDERATIO NS


Single-case and between-group designs share core features o f scientific m ethods, but
m any o f their procedures are quite different. Assessm ent is one feature. A useful w ay
to rem em ber the difference is that between-group research usually uses m any subjects
and fe w measurement occasions (e.g., pre and post), whereas single-case research uses
fe w subjects and m any m easurem ent occasions. T he statement is w on derfu lly useful an d
clear, and I would place it on the ceiling o f my bedroom if it were not fo r the fact that
it is not quite right. There are so m any perm utations o f betw een-group and single-case
research that the general statement, while often true, is not invariably true. A lso , later
in this chapter, I mention a few assessment occasions (ch aracteristicso f betw een -group
research) that are very useful to integrate into single-case research.
In single-case designs, at least one measure is needed that will allow evaluation
o f perform ance over time in an ongoing way. O ngoing assessm ent is critical to the
logic o f single-case designs and to the methods for data evaluation, all elaborated in
later chapters. Often, m ore than one measure will be used in a sin gle-cased esign . N ot
every m easure o f interest to the investigator has to be adm inistered in an ongoing w ay,
but generally speaking at least one does. In this section, it is im portant to clarify’ k ey
requirem ents for the prim ary measure, that is, the m ain m easure that will m eet the
requirem ents o f the designs. Additional measures, rationale fo r their use, and tim in g o f
their adm inistration are covered separately.

Assessment Requirements for Single-Case Designs


T he m easure used to evaluate perform ance and meet the design and data-evaluation
requirem ents o f single-case designs must have several characteristics, highlighted in
Table 3.2. First, the m easure will need to be administered repeatedly, that is, on an o n g o -
ing basis over time. T his means the measure will be adm inistered or collected d a ily
or several times a week. The collection will be over a period before the intervention is
implemented (baseline observations) and then again while the intervention is in effect.

T a b le 3.2 M e a s u r e m e n t G u i d e l in e s a n d C o n s i d e r a t i o n s : F o r S i n g l e - C a s e D e s i g n s S e le c t
In s t r u m e n t s to E v a lu a t e t h e I n t e r v e n t io n t h a t I n c l u d e T h e s e C lia r a c t e r i stie s

C h a ra c t e r is t ic o f th e M e a s u r e D e fin e d

A d m in is t e r e d re p e a te d ly The measure is one that can be admini stered continuously over time (e.£.,
daily o r several times a week).

C o n s is t e n c y o f m e a s u r e m e n t Observers o r data collection procedures should have minimal e r r o r in


obtaining the information.

C a p a c it y t o reflect c h a n g e If the intervention is effective, the measure must be able to show that.
This is a function of how the construct is defined and the way in wliich
it is observed.
D im e n s io n a l scale Measures should reflect a continuous dimension o r scale rather than a
binary category (yes, no; completed not completed) w henever possible.

R e le v a n c e o f the m e a s u re The measure ought to assess the problem directly or domain of interest
o r some facet known to be highly correlated with that domain.
Im p o r t a n c e o f th e m e a s u re The construct o r domain that is measured ought to make a difference
and serve as one that the client o r others see as important to functioning
in everyday life.
S I N G L E- C A S E R ES EA R C H D ES I G N S

Second, behaviors or other dom ains that are assessed m ust be able to be observed
consistently (reliably). T h is h as to do with the clarity o f the definitions o f w hat is
observed and the ability o f the o bservers to invoke these defin ition s consistently.
C o nsistency o f m easurem ent is also referred to as reliability. E rro r or fluctuations in
day-to-day m easurem ent as a result o f inconsistencies in the m easure can interfere
with draw ing inferences about change and about what caused change (a threat to
data-evaluation validity). M ore is said about reliability an d procedures for its e v alu -
ation in C hapter 5.
Third, the measure must be able to reflect change. I f the intervention is effective, will
this particular m easure be able to show that? T he answ er m ay depend on the definition
o f the behaviors but also on the assessm ent strategy. A teacher m ay count the num ber
o f l-hour periods (out o f a total o f three) in which a child exhibited serious aggressive
behavior (e.g., through throw ing objects at the teacher or at other children, hitting
som eone, breaking som ething). A n y hour in which one o f these behaviors occurs is
marked as an aggressive period, and we graph the num ber (or percentage) o f periods
each day with aggression. T h e m easure is not likely to be v ery good in reflecting change.
We do not get to see w hether the child went from 100 to just 1 episode o f aggression in
any l-hour p eriod — both o f these w ould count as an aggressive p e rio d Also, we m ight
see the num ber o f aggressive periods go from three to tw o— not a clear change or not
a change likely to meet design and data-analytic requirem ents noted in later chapters.
So much aggression w ould be hidden by these num bers since so m any acts could have
occurred within a given period. Also, the insensitivity o f the measure (scale with a
small range) will make it difficult to detect change.
Reflecting change raises the notion o f ceiling and flo o r effects. Ceiling and floor
effects refer to the fact that change in the m easure m ay reach an upper and lower
lim it, respectively, and that further change cannot be dem on strated because o f that
limit. There must be room on the m easure that allow s for evid en ce o f change in the
intended direction. I f the goal is to im prove som e skill o r to decrease som e p ro b -
lem behavior, there m ust be room on the scale to show m ovem ent in that direction.
Som etim es this is easy— if the ind ivid u al never engages in a behavior (e.g., exercise,
leisure reading in m eth o d o lo gy and statistics), then baseline observations will be
zero, and there is great room fo r im provem ent. D etectin g change is a precondition to
iden tifyin g w hat caused the change, so the m easure m ust allow room for change in
the direction the intervention is likely to prom ote.1 So m etim es an investigator m ay
wish to com pare two or m ore interventions within the sam e client; here ceilin g or
floo r effects are quite possible. A fter one intervention leads to change, it m ay be

Ceiling and floor effects are not merely matters o f the numerical scale. For example, if the measure
ranges from 1 to 50 and performance goes from a mean o f 30 during the baseline (no-intervention)
phase to a mean o f 40 during the intervention phase, it looks like there is still room for further
change on the scale and no ceiling effect. Change on a measure is not equally easy across its full
scale. There can be a functional limit or a ceiling that is not easily detected even when the numerical
upper limit is not approached. The change from 30 to 40 may be much easier than the change from
40 to 50 and no one, perhaps, ever received scores above 45. There could be a ceiling effect even
though there appears to be room at the end o f the scale. Prior research can be a helptul guide to
identify whether the upper (or lower) limits o f a measure have been approached.
Back g r o u n d an d Kef M easu r em en t Co n si d e r a t i o n s SI

m ore difficult for the second intervention to show m uch m ore change if the scale has
a restricted range.
Fourth, w henever possible (which is alm ost always), m easures should reflect a con-
tinuous dim ension or scale rather than a binary category. T his is related to but distin-
guishable from detecting change. A continuous d im en sion can range from som e low
num ber (e.g., o) to som e higher num ber on a con tinu u m (e.g., 50). Percent o f arith-
metic problems solved correctly, num ber o f m inutes on an exercise bike, and num ber
o f pages read are all dim ensions and reflect a range that can v a ry widely. T h e sam e
constructs could be m easured in a b in ary fashion (yes, n o ) b y recording w hether the
person solved 70% o f the problems correctly (yes, no each day), got on the exercise bike
at all (yes, no), or sat dow n and read at least five pages (yes, no). Interventions are m ore
easily evaluated w hen one can see a larger range than just o (no) to 1 (yeseach day). It is
easy to begin with a dim ensional scale (e.g., 1- 10 0 ) and later convert it to a categorical
scale (did or did not m eet som e criterion [> 50 or < 49, respectively]) rath er than the
other way around.
Often people in everyday life are prim arily con cern ed with bo ttom -line, categor-
ical events. A m other says her adolescent is developing g u m problems an d the dentist
says the adolescent should brush her teeth at least once a day. T h e m other caies about
the binary m easure (brush, no brush) each day. From the standpoi nt o f assessm ent, we
will want to be sure our measure is addressing the problem . 'Yet from the standpoint o f
assessm ent and evaluation, we will want a continuous m easure too and use that as our
prim ary measure for evaluation. T his might be tooth brushing behaviors and include a
series o f 10 activities o r steps evaluated as o ccu rring or not o ccu rring each day so that a
score o f 10 is possible. In terms o f intervention, breaking d o w n the- behavior into steps
(a dim ensional scale) can be very helpful too. M ore on various strategies fo r d im en -
sional and categorical measures is presented in the next chapter. At this point, the rec-
om m endation is clear: assess dim ensions w henever possible or develop scales that can
v a ry from o or 1 to a larger num ber because these m easures can more read ily reflect
intervention effects than can categorical, binary measures.
Fifth, the m easure must be relevant to the ultim ate focus or 1ntc 7tst. T h is sou n ds so
obvious that there must be a good reason to m ake this explicit. T here is. Huge sw aths
o f interventions in psychology, education, and counseling focus on constructs that are
assum ed to be a means to an end. Interventions are directed toward these m eans, but
the means are not very relevant. Som e exam ples are program s in schools, w ilderness
cam ps, and some institutions that focus 011 self-esteem an d feelings o f self-w orth as a
w ay o f reducing risky behavior, aggressive actions, o r eatin g problems. Sim ilarly, self-
help and other program s directed to im prove child-rearing practices o f parents often
emphasize feelings o f parental em pow erm ent. Self-esteem and em pow erm ent m ight be
useful as targets o f treatment, but they are not very relevant to the goals (red ucin g the
problem behaviors, im proving parenting) for which they are often posed. T h e reason is
that key concepts here are not causally connected to the goals o f the program — im pro v-
ing self-esteem may be wonderful in its own right, but it is. not established as relevant
to changing aggressive behavior. I shall return to this in the discussion o f validity later
in this chapter.
Finally, the focus o f the measure is to make a difference or be im portant to the client
or to others. Im portance and relevance, the requirem ent noted previously, are different.
S I N G L E C A S E R ES EA R C H D ES I G N S

Relevance has to do with ensuring that the m easure reflects the construct or dom ain o f
interest— if vve care about changing tics in an adult, focus on tics, not on how one feels
about tics, although both w ould be quite fine to include. Feelings may be important
and, indeed, som e o f m y best friends report having them , but the issue is: W hat are we
trying to accom plish in a given project, and how does the m easure relate to that goal? In
som e cases, feelings m ight be the focus (e.g., happiness, experien ce o f pain) and direct
observations m ay be an cillary o r com plem entary.
To capture im portance o f the focus, we ask the question, “ W hy do we care about
this dom ain or m easure?” O ften the exam ples are obvious (self-destructive behavior,
driving while intoxicated, not com pleting hom ew ork, defacin g books on m ethodol-
ogy). Yet, in everyday life, parents and teachers often are con cern ed with problem s that
are m ainly annoying and tem po rary (e.g., adolescents overu sin g words “cool,” or “ like” ;
w earing odd but allow able com binations o f clothing o r having red, spiked hair). In
research as well as in everyday life, it is not always obvious that a measure is im portant
or will make a difference.
Relevance and im portance, as key considerations in selecting measures, are based
on an assum ption that the single-case dem onstration will have an applied purpose.
That is, the dem onstration focuses on individuals w ho have been identified based on
the criteria m entioned earlier in the chapter and who w ou ld profit from som e inter-
vention. Consequently, the m easure that will be used has relevance and im portance as
critical criteria. Single-case designs are also used in the context o f experimental w ork
and anim al laboratory research. In such cases, o f course, the goal is to test theoretical
hypotheses or to describe som e process. M easures are needed that can reflect change,
but are not restricted by the applied concerns noted in the present discussion.
T he six characteristics o f m easures (Table 3.2) are central for single-case research in
applied settings. T here are other considerations in deciding what measure to use, such
as the ease and cost o f adm inistration and interpretability to consum ers o f our work,
including clients and the public at large. These are im portant but less central for the
m om ent in conveying design requirem ents for the m ethodology o f single-case designs.
Assessm ent and design are very much intertwined, and the characteristics noted here
are pivotal for evaluating perform ance over time, that is, on m ultiple assessment o cca-
sions so that change can be detected and so that patterns in the data, required for the
various designs, can be discerned, as elaborated in the design and data-evaluation
chapters.

Use o f M ultiple M easures


In an y investigation, single-case or otherwise, there is no need to rely 011 one measure.
Also, the prim ary m easure used to meet the design requirem ents ought to be ad m inis-
tered in an ongoing basis, but not all measures need to be. C o n sid er the use o f additional
m easures and how they can be adm inistered.
Several argum ents favor the use o f multiple m easures w hen possible. First, it is rare
that any single m easure captures the construct com pletely. Indeed, operational d e fin i-
tions translate concepts into specific m easures or indices, but by their very nature can
om it key com ponents. O ne could readily translate signs o f love between a couple into
acts or gestures o f affection (e.g., sm iles, holding hands, touch in g each other affection -
ately, and saying “ I love yo u ” ) and all would be reasonable com ponents ot a m easure
B a c k g ro u n d and K e y M e a s u re m e n t C o n s id e ra tio n «

that operationalizes love. Yet, there is more, including the subjective feelings o f each
person, and less obvious, the neurobiology and brain activation (through neuroim ag-
ing) that characterizes rom antic love. C learlyth ere is no o n e m easure o f operationaliz-
ing love that can encom pass all o f these features. Sim ilarly, an intervention may focus
on pain reduction for som eone recovering from an injury. Activity, walking, and not
grim acin g have been used to operationalize pain. Here too, we w o u ld want som e su b -
jective evaluation by the patient as a supplem entary but critically im portant m easure.
Second, the nature o f m any problems encom passes m ultiple facets. O ne m ight
increase reading o r num ber o f pages read, but reading m ay have m ore com ponents,
such as com prehension, enjoym ent o f reading, and speaking to o th ers about what one
has read. T hese latter indices m ay not be the prim ary focus, but they are relevant and
potentially im portant outcom es. Similarly, depression am on g clinical sam ples is not
merely a m atter o f sad m ood, but rather a package o f behaviors, activities, and view s
(e.g., changes in appetite, loss o f interest in activities, negative thougl its). M ultiple m ea-
sures using v aried m ethods are likely to capture and encom pass m ore o f the domain.
T hird, multiple perspectives are often pertinent. We m ight want to know if parents,
teachers, peers, or the clients them selves believe there is a difference o r change and
w hether that change was very important. Agreem ent am on g different inform ants or
raters in social, em otional, and behavioral problem s is notoriou sly low (Achenbach,
2006; De Los Reyes & Kazdin, 2005). For example, w hen child deviant behavior (tan-
trums, aggressiveness, shyness) is m easured by child, parent, or teacher report, the
results yield som ewhat different inform ation. C onclusions v a ry as a function o f which
inform ant is used to provide the m easure o f deviance o f the child. Som etim es it m ay be
im portant to capture the different perspectives.
Fourth, beyon d perspectives o f different inform ants, different m ethodsoften yield
different inform ation. Self-report or parent report, records o f perform an ce (e.g., tru -
ancy, m issed classes, arrests), and direct observations can yield different conclusions
even when they are designed to assess the same construct. For exam ple, self-reports
o f ethnic bias or prejudice, laboratory tasks that ask or subtly assess bias, and actual
behaviors in everyday encounters that reflect bias can yield different results. Because
the m ethod o f assessm ent can influence conclusions, use o f m ultiple measures is useful.
O ne can “ replicate” the effect across measures and show the extent to which a finding
is or is not dependent 011 one method o f assessment.
Fifth, interventions, even when they are focused on specific problem dom ains, in
fact often produce a spread o f effect that would be im portant to know or establish. For
example, in my own w ork with children referred for aggressive an d antisocial b eh av-
ior, our intervention focuses on reducing these behaviors at home, at school, and in
the com m unity. We also have shown that parental depression and stress decrease, and
fam ily relationships im prove over the course o f treatm ent, even though these are not
our focus (e.g., Kazdin & Wassell, 2000). We assess these because both relate to child-
rearing practices. Parental depression and stress in the home can influence child-rearing
practices (e.g., harsh parental reactions, attention to m ore deviant behavior, c o m -
mands) that help prom ote child oppositional behavior. Favorable side effects along
these dim ensions are inform ative and important to identify.
W hether and when to use multiple m easures depend on m an y considerations,
som e substantive (e.g.. Are there m ultiple outcom es o f interest?), som e m ethodological
S I N G L E C A S E R ES EA R C H D ES I G N S

(e.g., Could the results be con fined to one m ethod o f assessm ent?), and som e practical
(e.g., A re resources available to carry out or adm inister, code, and evaluate multiple
m easures?). O ften the case can be made that one is interested in one measure or that a
dom ain is well represented by one measure. T he com m ents here are intended to focus
on the other side. One should not autom atically assum e that a single measure is su ffi-
cient, adequate, or the m ost inform ative. M ultiple m easures can improve the d em on -
stration by show ing that the effects are not restricted to a single w ay o f evaluating the
target focus. Also, intervention effects are rarely surgical, that is, narrowly specific to
the target focus. M ultiple m easures can evaluate the spread o f effects to other areas that
m ay be related and o f interest but not directly focused on. T h e importance o f this latter
point is evident in research on m edication in which the m ain outcom e is w hether the
clinical problem (cancer, blood pressure) is altered. H ow ever, multiple measures o f o u t-
com e (im m ediate change in sym ptom s, survival) and o f other dom ains or side effects
(e.g., impact on m em ory and other cognitive processes) enrich the evaluation and m ay
actually determ ine if anyone uses the medication.

W hen to Adm inister These Additional Measures


P r e -P o s t Assessm ent. A s long as there is a p rim ary outcom e m easure that is con tinu -
ous and ongoing, other m easures do not have to be. It m ay be the case that the investi-
gator wishes to adm inister a test o f som e dom ain at pre and post, merely two occasions.
T h is is the com m on strategy in betw een-group research (e.g., random ized controlled
trials) where the prim ary m easures are not ongoing or continuous over time. I shall
not illustrate pre- and post-intervention m easures; these are central to between-group
research and probably very familiar.
In single-case research, the p rim ary m easure m ay be collected daily (e.g., cigarettes
sm oked, hom ew ork com pleted, delinquent acts in the setting). At the end o f the study
or periodically w ithin the study, the investigator m ay include other measures (e.g., su b -
jective feelings of health, teacher report o f grades, arrest records). These supplem entary
m easures can be extraordin arily useful and inform ative even though assessment on
one or two occasions w ould not m eet the essential features o f the designs.
f o r example, a class-w ide program w as used in a child-care setting with 16 children
ages 3 to 5 (H anley et al., 2007). T he focus was on various social skills related to their
adjustm ent in class (e.g., com m unicating their wishes, follow ing directions, and more).
Such skills were identified by educators as being critical to school readiness and su c-
cess. Training with the children was evaluated by ongoing assessm ent of the skills. In
addition, a p re-p ost assessm ent was added. At the beginn in g and end o f the program ,
teachers com pleted a questionnaire for 14 o f the 16 children and rated the likelihood
that each child would engage in the prosocial behaviors in diverse situations. T he results
indicated that 11 children greatly im proved from pre to post; 3 did not, with an o ver-
all difference from pre to post that was statistically significant. W hat have we learned
from this added assessm ent? Teachers believed that there were changes in the children
based on what they saw before and after treatment. C o u ld this be due to testing or
statistical regression (two threats to validity)? Yes, but teacher reports were in keeping
with the ongoing observational data that show ed changes in the skills. On these latter
m easures, these threats to validity are implausible. T he changes on observational and
Back g r o u n d an d K« y M easu r em en t Considerations

teacher report m easures are parsim oniously and plausibly explained b y the interven-
tion effect.
Teacher perspectives are v ery im portant as a com plem en t to the direct o b se r-
vational data. O ne can readily im agine a situation where studen t b eh avio r change
occurs but the teachers do not perceive a change. In such a case» it is possib le that the
changes were weak or not evident in a classroom . Also, it is easy to im agine teachers
believing there is a change (because o f expectations) where d irect observations show
there is no change. It is m ore reassuring when change is reflected on both types o f
measures.

P e rio d ic Probes D u rin g the Study. Apart from pre and p o st exam ples, single-case
designs often use period ic assessment to m easure how perfo rm an ce is in settings
other than those in w hich intervention is conducted or in relation to responses other
than those included in the intervention program . For exam ple, an intervention m ay
focus on child behavior in the classroom . T he m easure m ay be ongoing to meet the
requirem ents o f the design; yet the investigator m ay take p eriod ic sam ples o f b eh av-
ior on the playground to see if the behavior has changed there as w ell. T h e m easures
adm inistered in this w ay are called probes. A probe is an assessm ent adm inistered
occasionally during a single-case investigation. U sually the purpose o f the probe is
to see whether the behavior carries over to another setting, is m aintained o ver time,
or w hether another behavior has changed. These m easures are not adm inistered on a
continuous basis. Rather they may only be adm inistered a few times during or after
the intervention phase.
For example, in one study, four children with autism (ages 5 to 6) were trained to
give helping responses after instructions from an exp erim en ter (Reeve et al., 2007).
Helping behaviors were grouped into several categories based o n interviews o f the par-
ents about areas to train. T h e categories included cleaning, p ic k in g up objects, sortin g
materials, putting objects away, and others, each with different behaviors. Although
we might be interested in teaching helping behavior in sp ecific categories, the larger
interest is teaching a m ore general repertoire, that is, helping that extends to areas that
are not trained. D u rin g training, probe assessments were used to evaluate whether th e
children increased in helping behaviors in categories that w ere not trained. Indeed,
they did. Before the study, little helping behavior was evident. T h is increased greatly
durin g training (in a m ultiple-baseline design). The probe assessm ents showed that the
results extended to areas o f helping that were not trained specifically.
Another program focused on teaching skills to prevent sexu al abase am ong p e r-
sons with mental retardation (Lumley, Miltenberger, Long, Rapp, & Roberts, 1998).
Fem ale adolescents with mental retardation are often victim s o f sexual abuse. Rates
o f sexual abuse (attempted and coerced intercourse) have ranged from 25% to &o% in
different samples. In this project, six women (ages 30 to 42) with varying degrees o f
disability were trained to refuse requests, leave the situation, report the incident, and
other such behaviors in role-playing situations. In training sessions, several situations
were presented, the client was asked what she would do, and a score w as assigned to the
appropriateness o f the response. Training was adm inistered in sessions by a male and
fem ale trainer and involved instructions, m odeling appropriate responses, rehearsal,
praise, feedback, and practice.
S I N G L E- C A S E R ES EA R C H D ES I G N S

Probes were used to assess w hether the behaviors w ould carry o ver beyond trainin g
sessions, raising a delicate ethical dilem m a (but approved by the Institutional Review
Com m ittee o f the university). Probes consisted o f som eone w orkin g for the project who
approached the client, m ade conversation for 15 minutes, and inappropriately m ade a
sexual advance (with a prearranged request) to see what the client would do. T h e probe
interactions were audiotaped. The results: training developed the refusal and avoidance
skills very well in the sessions. Probe assessments under m ore naturalistic conditions
did not reflect the effects o f training. Probes provided critical inform ation. T raining
did not do what was needed— m ore is needed to protect these w om en once th ey are in
naturalistic situations.
These exam ples illustrate the utility o f assessm ents that go beyond the m ain m ea-
sure that is required for the designs. Periodic assessm ents and assessm ent o f dom ain s o f
interest can provide very im portant supplem entary inform ation about the scope o f the
changes, their im pact, and indeed their value. T he im portance o f the multiple assess-
ments becom es even m ore evident in the discussion o f validity in the next section.

RELIABILITY AND VALID ITY


In all scientific m easurem ent, reliability and validity are key concepts. Generally, reli-
ability refers to the consistency o f the measure or m easurem ent procedure. Validity
refers to the content o f the measure and w hether the m easure assesses the dom ain o f
interest. In psychology, evaluation o f reliability and validity has focused on a variety
o f measures, but has focused most often on paper-and-pencil questionnaires, rating
scales, and other tests coverin g an almost endless list o f dom ains o f functioning (e.g.,
personality, m otivation, achievem ent, intelligence, anxiety, depression, m arital satis-
faction, coping style, quality o f life). In this m ore traditional assessm ent focus, several
different types o f reliability and validity have been identified and are used to establish
the individual m easures. C o m m o n ly designated form s o f reliability and validity for
traditional assessm ent are in an appendix at the end o f this chapter as a point o f refer-
ence and to make a connection with single-case designs. T he discussion highlights key
concepts as they relate to assessm ents usually conducted in single-case research.

Reliability o f O bservational Measures


Typically, assessm ent in single-case designs focuses on direct observation o f overt
behavior. However, this is not a necessary requirem ent for the designs, as noted further
in the next chapter. M easures are devised to assess a target behavior, as already d is-
cussed. An advantage is that measures are often individualized to each client and situa-
tion so that there is not a “tantrum ” scale or “academ ic d eficien cy” scale. T h e m easure
for a child referred for tantrum s or academ ic deficiency m ay be individually tailored, as
needed. In assessing behavior, the key reliability question is w hether the observations
are obtained consistently. Consistency here does not mean, “ D oes the subject perform
consistently?” That is another matter and not what is meant by reliability o f m easu re-
ment. Rather, it means, “ To what extent do o bservers looking at the client record in a
consistent fashion?” In relation to the types o f reliability listed in the appendix, this is
interrater or interobserver agreement. We want consistent assessm ent— that is, w e want
the assessment not to be a function o f who is doing the observing, but rather w ho is
being observed (the client).
Back g r o u n d .v»dl Key M easu r em en t Co n si d e r a t i o n s

There are norm al fluctuations in human perform an ce, so even if there is perfectly
consistent assessment, there will be variability in p erfo rm in ce. A n y inconsistencies in
measurement will add to that variability in another way. In general, a research goal is
to minimize variability that has to do with extraneous features o f the test. (In fact, the
nasty word “erro r” is often used to label all sources o f variability’ that the investigator
does not really want to analyze but does want to m inim ize.) V ariability is m in im ized to
facilitate evaluation o f the intervention effects. A s I discuss lite r in relation, to the logic
o f designs and data evaluation, excessive variability can interfere a.nd obscu re veridical
intervention effects in both between-group and single-case research.
Interobserver agreem ent is a major topic w ith in singlc-case research. Researchers
go to great lengths to ensure that the behaviors are well defined, the observation m ethod
is well described, and the m easures can be obtained consistent ly. T he si gnificance o i the
topic, how agreement is m easured across different types o f m easures, and sources o i
bias and artifact in the measurement procedures serve as a basis for a separate d iscu s-
sion (Chapter 5).

Validity o f Single-case Measures


The extent to w hich a m easure assesses the con struct o f interest is th e overall focus
o f validity. D ecades ago, w hen direct observations w ere co in in g into w id esp read use,
the dom inant view was that one need not be con cern ed w ith validity. T h e u su al types
o f validity (see the appendix) seem ed relevant to the p sych o lo gical m easu res such as
various scales, inventories, and tasks. For exam ple, when o n e w a its to evalu ate in tel-
ligence with a m easure (intelligence tests), it m akes great sen se to ask about v a lid -
ity, that is, w hether the m easure relates to perfo rm an ce b e y o n d perform an ce on the
test. A n d we have learned with decades o f research that intelligen ce test perfo rm an ce
relates very highly to academ ic perform ance (co n cu rren t and predictive v ilid ity ) ,
that is, is a valid m easure o f that perform ance. A ll this is go in g on w h ile there is a tsu -
nam i o f debates about what intelligence really is, the m any types o f intelligen ce, an d
w hether any particu lar m easure captures all that is intended. B eh avio ral ob servatio n s
were thought to be free from , or at least much less su b ject to, this k in d o f valid ity
concern because the perform an ce in situ is being ob served directly. T h at is, w e do not
have to ask about the extent to w hich a given m easure really gets at the child’s b u l-
lying or tantrum s. Behavioral measures are not a q u estio n n aire or pap er-an d -p en cil
m easure— they sam ple the bullying or tantrum s d irectly o n the battlefields (school
and home).
This view has changed in light o f deeper appreciation o f assessm ent in general and
the special features o f direct observations. Five points convey the issue. First, any single
m easure does not usually capture all o f the dom ain s o f interest. T hus an operational
definition o f som ething like bullying, tantrums, positive fam ily interaction, and c o m -
pulsive behavior m ay be a great reflection o f the dom ain, but it is not the enti redom ain.
The investigator and consum er might be concerned with all sorts o f related behaviors
that are outside o f the operational definition. A re all the b eh ario rs o f the dom ain well
represented by the operational definition? That is a validity question. In relation to tra-
ditional assessment o f validity several types o f validity (content, concurrent, face, and
convergent) address the relation o f a given m easure o f the dom ain o f interest to other
samples o f behavior from that same or related dom ains.
S I N G L E- C A S E R ES EA R C H D ES I G N S

Second, behavioral assessm ents and observations are usually limited to specific
situations such as specific tim es o f the day, tasks or situations that are relatively u n vary-
ing, and so on. T he investigator often tries to control the assessm ent conditions to
ensure that client behavior does not fluctuate w idely because o f changing conditions.
Insofar as the situation is structured or controlled, one can raise the question, “ D o the
observations reflect perform an ce under slightly less structured or totally unstructured
situations or more ‘n o rm al’ conditions o f everyday life?” That too is a validity question.
H ere traditional types o f validity (e.g., concurrent, predictive) focus on the extent to
w hich the perform ance on the measure relates to perform an ce on other indices either
now or at som e point in the future.
Third, let us go back to the measure itself. Is the m easure getting at the dom ain o r a
behavior that is very im portant? Som etim es this is obvious because the intervention is
designed to overcom e a significant condition that im pairs perform ance. Yet, som etim es
the focus is less clear. To w hat extent does this m easure get at or assess som ething that
is im portant and that we ought to care about? T h is is a validity question— in this case
social validity. Social com parison and subjective evaluation (discussed previously) are
designed to address this concern.
Fourth, o bservational m easures are com pleted by hum ans, and the filter o f hum an
perception has been v e ry w ell elaborated. V arious filtering processes (e.g., cognition,
perceptions, and beliefs) are not artifacts o f hum an observation; they are built into our
hardw are (brain), o u r softw are (learning, experience), and their interactions. Even
with reliability that is established am ong observers, there can be beliefs that influence
behavioral observations. For exam ple, observation o f fam ily interaction revealed that
the ethnicity o f the o b servers was influenced by the eth nicity o f the fam ilies that w ere
being observed (Yasui & D ish ion, 2008). European A m erican raters view ed A frican
Am erican mothers as show ing few er problem -solving skills and poorer relations
with their children than they did in their ratings o f European Am erican m others.
A frican A m erican raters d id not show this difference. M oreover, independent ev alu -
ations revealed that there w ere no differences in the fam ilies being rated. T h e stu dy
conveys that under som e circum stances ethnic bias can influence the observations.
T he m easure on which the results were obtained was o b serv in g direct interactions
for a b rief period and then com pleting ratings. T h is is standard in the area o f research
(fam ily interaction) but not quite the direct observational procedures usually used
w here concrete b ehaviors are coded. Even so, w hen hum an ju dgm ent is involved, core
features o f judgm ent (e.g., perceptions, bias) can be expected.
Finally, the observational measure may reflect change. However, do the changes
m ake any difference to the clients or to those who are in contact with the clients (e.g.,
parents, teachers, and peers at work)? Assum e that the intervention is effective and
m ade a change in som e im portant behavior (e.g., stuttering), physical condition (e.g.,
weight, blood pressure), or habit (e.g., cigarette sm oking). N ow we can ask, “ Was the
m agnitude o f change o f the order that makes any real difference?” Som etim es the
answ er can be addressed easily if the m easure (e.g., weight, blood pressure, blood sugar,
exercise) can be connected to well-established correlates (e.g., risk for heart attack).
That is, we can som etim es assess the value o f the change by know ing w hat the outcom e
m eans in other terms. A n o th er situation is one in which a particular behavior that is
problem atic (e.g., panic attacks, lighting, abusing one’s child) is com pletely elim inated.
Back g r o u n d and Key M easu r er n ent Co n si d er at i o n s
No one usually asks, “Yes, but getting the parent to stop brutalizing the c h ild — is that
reaily an im portant outcom e?”
More often than not, behavioral m easures do not have clear correlates that
facilitate interpretation o f the m agnitude o f the change as in som e health outcom e.
Also, the behavior m ay not be elim inated o r indeed even need to be elim inated. In these
circum stances, it is appropriate and im portant to ask the question. “ D oes the am ount o f
change make a difference?” This too is a question o f social validity. S o cial com parison
and subjective evaluation are very pertinent.
Are there cases where the validity o f m easures cou ld b e questioned, o r are all o f
these concerns merely straw men and wom en? Yes— in the move from the goal o f a
program to m easurem ent and intervention, occasionally the validity and relevance o f
the measure can be challenged. For exam ple, consider a pro gram that is design ed to
reduce the weight o f clients who are grossly obese (defined as overw eight > 100 lbs
[45.36 kg]). In such a program, the goal m ay be to increase the am ount o f exercise
(e.g., minutes w alking or jogging). Exercising is not a direct m e a s u r e e f w eight, and one
can increase exercise without any change in weight. Even calorie con su m p tion is not a
direct index because one can reduce calories and still show little o r no w eight loss (due
to change in m etabolism ). Needless to say, exercise is im portant and calorie c o n su m p -
tion is relevant, but weight or body m ass probably is the m easure o fin terest an d ought
to be included in the assessment in som e way, even i f not the p rim ary m easu re that is
assessed daily to meet the requirem ents o f the design.
As I noted, som etim es the measures are obviously im portant insofar as th ey sam ple
the domain o f interest. Even so, it is useful to ask i f the m easure and the m agnitu de o f
change 011 the m easure over the course o f the intervention reflect the d o m a in in c o n -
texts that are different from those in which the observations are obtained an d m ake a
difference either to the clientsor those in contact with them. T h esearesligh tly d il'feren t
questions from those in relation to traditional validity (see appendix) but are captured
well by social validity.
Reliability and validity are central to all m easures. C o n sisten cy o f observation is
fundamental to experim ental design and data evaluation and is discussed again in
those contexts. V alidity is more multifaceted and requires greater effort an d attention.
T h e reason is that often direct observations are assum ed to be the “ real th in g” an d not
removed from the actual domain ofinterest. Even when the direct behaviors are clearly
ofinterest, they may not reflect those sam e behaviors outside o flh e carefully controlled
program situation, as evident in the discussion o f probes. T h e u s e o f multiple m easures,
probes, and social validation techniques are aim ed to address the validity o f the m ea-
sures, and we shall see samples o f that throughout later chapters.

SUMM ARY AND CONCLUSIONS


In this chapter, fundam ental issues were discussed that pertain to selecting the focus
o f the intervention and therefore the assessments. Identification o f the focus o f assess-
ment is often obvious because o f the setting and its goals (e.g., education in the schools,
rehabilitation o f substance abusers) and the nature o f the clients problem (e.g., severe
deficits or excesses in perform ance). Several other criteria w ere discu ssed including
social, emotional, cognitive, or behavior characteristics that reflect im pairm en t (e.g.,
illegal or rule-breaking behavior, actions that are o f danger to o n eself o r to others, or
S I N G L E- C A S E R ES EA R C H D ES I G N S

the behaviors at risk for untoward outcom es, signs o f stress, o r unusual or extrem e
sym ptom s o f clinical dysfunction). Behaviors that are not that severe but are o f c o n -
cerns to others (e.g., parents, teachers) or that may prevent the onset o f untoward prob-
lems also serve as the basis for intervening.
Social validity was introduced to convey other criteria that are pertinent to selection
ot the target focus. Social comparison consists o f identifying target behaviors based on
normative data that convey what is accepted or com m on in everyday life. Subjective eval-
uation is used to exam ine whether the focus is one that makes a difference ii the opinions
o f experts or those in contact with the client. In this chapter we focused on selection o f
target behaviors. Social com parison and subjective evaluation are also used in evaluating
change. We return to these topics in the discussion o f data analyses and evaluation.
W hen direct observations o f behaviors are used as the assessment focus, it is
important that the definition o f the behaviors m eet three criteria: objectivity, clar-
ity, and completeness. To meet these criteria not o nly requires explicit definitions, but
also decision rules about what does and does not constitute perform ance o f the target
behavior. T he extent to which definitions o f behavior meet these criteria determ ines
whether the observations are obtained consistently and, indeed, whether they can be
obtained at all. T hese criteria help reduce error (variability) in the measure that is due
to the assessment procedures and in that w ay can increase the sensitivity o f detecting
changes in perform ance.
Several m easurem ent requirem ents were noted for single-case designs. M easures
need to be adm inistered repeatedly, should be adm inistered consistently (reliably), be
able to reflect change, reflect a continuous dim ension o r scale rather thana bin ary cat-
egory when possible, and the focus or what is m easured ought to be relevant to the ulti-
mate focus or goal o f the program . A unique feature o f single-case designs is the use o f
ongoing, continuous assessment. M any different m easures can be used in a given study.
At least one o f these ought to meet the requirem ents m entioned previously because
these are central to single-case designs and data evaluation. How'ever, otfcer m easures
(e.g., at pre- and post-intervention or probes) can supplem ent and do not have to be
adm inistered in an ongoing way.
Reliability and validity were discussed. Both are topics o f assessment in research
in general but have special facets that pertain to single-case research in the context
o f direct observations o f behavior. C onsistency o f m easurem ent is critical for evalu -
ation in part to m inim ize variability in the data due to observers and observational
procedures. A careful definition o f the target behavior is the starting point for c o n sis-
tent observations, but m any m ore issues are involved and taken up in a later chapter.
Validity o f the m easure cannot invariably be assum ed. T he m easure may obviously
sam ple the dom ain o f interest, but obviousness is not equivalent to having data to show
that the m easure is relevant, makes a difference, and that changes on the m easure are
important. Social validity indices are often used to address these latter concerns.
This chapter focused on critical considerations o f assessm ent for single-case designs.
D eterm ining the focus, defining the target behavior, and using a measure tlat will meet
the special requirem ents o f single-case designs are key features. Ensuring that the m ea-
sure can be obtained consistently and that the measure is sam pling the domain o f inter-
est and makes a difference too are fundam ental features. T he next step is to m ove to
specific procedures and techniques for assessment, the topic o f the next chapter.
R ackground and Key M easu rem en t Co n si d e r a t i o n s

A P P E N D I X 3.1

C O M M O N L Y R E F E R R E D TO T Y P E S OF R E L I A B I L I T Y A N D V A L I D I T Y

Type Definition and/or Concept

R eliability

Test-Retest Reliability The stability of test scores over time; the correlation o f scores
from one administration o f the test with the scores on the same
instrument after a particular time interval has elapsed.
Alternative-Form The correlation between different forms o f the same measure. The
Reliability items o f the two forms are considered to represent the population
o f items for that measure.
Internal Consistency The degree o f consistency or homogeneity o f the items within a
scale. Different reliability measures are used toward this end,
such as split-half reliability, Kuder-Richardson Form ula 20, and
coefficient alpha.
Interrater (or The extent to which different assessors, iaters, o r observe j s agree
interscorer) on the scores they provide when assessing, coding, or classifying
Reliability subjects’ performance. Different measures are used to evaluate
agreement such as percent agreement, Pearson product-moment
correlations, and kappa.

Validity

Construct Validity A broad concept that refers to the extent to which the measure
reflects the construct (concept, domain) o f interest. Other types
o f validity and other evidence that elaborates the correlates o f
the measure are relevant to construct validity. Construct validity
focuses on the relation of a measure to other measii res and domains
o f functioning o f which the concept underlying the measure may
be a part.
Content Validity Evidence that the content o f the items reflects the construct
or domain o f interest. The relation o f the items to the concept
underlying the measure.
Concurrent Validity The correlation of a measure with performance on another measure
or criterion at the same point in time.
Predictive Validity The correlation of a measure at one point in time with perform ance
on another measure or criterion at some point in the future.
Criterion Validity Correlation o f a measure with some other criterion. This can
encompass concurrent or predictive validity In addition, thenotion
is occasionally used in relation to a specific and often dichotomous
criterion when performance on the measure is evaluated in relation
to disorders (e.g., depressed vs. nondepressed patients) or status
(e.g., prisoners vs. nonprisoners).
C o n tin u e d
SIN G LE-C A SE RESEARC H D ESIG N S

Appendix 3.1 continued

Type Definition and/or Concept

Face Validity T his refers to the extent to which a measure appears to assess the
construct o f interest. Not regarded as a form al type o f validation or
part o f the psychometric development or evaluation o f a measure.
Convergent Validity The extent to which two measures assess the similar or related
constructs. The validity o f a given measure is suggested if the
measure correlates with other measures with which it is expected to
correlate. The correlation between the measures is expected based
on the overlap or relation o f the constructs. This is a variation of
concurrent validity but takes on special meaning in relation to
discriminant validity.
Discriminant Validity T he correlation between measures that are expected not to relate
to each other or to assess dissimilar and unrelated constructs. The
validity of a given measure is suggested if the measure shows little
or no correlation with measures with which it is expected not to
correlate. The absence o f correlation is expected based on the
separate and conceptually distinct constructs. This is especially
meaningful in a demonstration that also examines convergent
validity.

Note: The types o f reliability and validity presented here refer to commonly used terms in test con-
struction andvalidation and in the context o f hetween-group research. They are broadly relevant
to measurement and sensitize the researcher to multiple considerations about what will be used to
reflect the outcomes o f a study.
CHAPTER 4

Methods of Assessment

CH APTER OUTLINE

Strategies o f Assessment
Measures o f Overt Behavior
Frequency Measures
Discrete Categorization
Number o f People Who Perform the Behavior
Interval Recording
Duration
Latency
Other Strategies Briefly Noted
Response-specific Measures
Psychophysiological Assessment
Self-report Measures
Reports by Others
Selection o f an Assessment Strategy
Conditions o f Assessment
Natural Versus Contrived Tasks and Activities
Natural Environment Versus Laboratory (or Clinic)
Settings
Obtrusive Versus Unobtrusive Assessment
Human Observers Versus Automated Recording
General Comments
Summary and Conclusions

A
ssessment o f perform ance in single-case research has encom passed an extra o r-
dinarily wide range o f m easures and procedures that m eet the requirem ents d is-
cussed in the previous chapter. M ost measures are based on directly o b servin g overt
perform ance. W hen overt behaviors are observed directly, a m ajor issue is selecting
the m easurem ent strategy. Although observation o f overt behavior constitutes the vast
bulk o f assessm ent in single-case research, other assessment strategies are used, such
as psychophysiological assessm ent, rating scales, and other m easures unique to specific
target behaviors. T his chapter describes and illustrates several m easurem ent options
for single-case designs. Em phasis is given to measures o f overt behavior because these

73
S I N G L E- C A S E R ES EA R C H D ES I G N S

have been com m only used in single-case research. In addition, overt behavioral m ea-
sures are less well covered in traditional texts o r other resources on assessment. T he
chapter sam ples other types o f m easures rather than coverin g them comprehensively.
T he reason is that any assessment m odality, m ethod, o r form at can be used in sin gle-
case designs as long as the requirem ents (e.g., ongoing assessm ent) mentioned in the
previous chapter are met.

S T R A T E G I E S OF A S S E S S M E N T
Perhaps the m ost significant point o f departure for assessm ent is to convey that sin gle-
case designs do not necessarily entail any particular assessm ent modality. Rating scales
com pleted by individuals o r others, m easures o f biological processes or outcomes (e.g.,
blood pressure, breathing rate), autom ated responding (e.g., m ovem ents), and m ore all
are suitable for the designs. The de facto connection o f designs with behavioral research
has underscored their connection to direct m easures o f o vert behavior, but exam ples
can be provided to con vey that the connection is not required. A n d perhaps the d if-
fusion o f single-case designs into m any disciplines and areas o f w ork accounts for the
expansion to m easures that are less com m itted to direct observations o f overt behavior.
At the sam e time, direct behavioral m easures have not been selected arbitrarily. T h ey
provide direct m easures o f m any dom ains o f interest in applied settings and warrant
the bulk o f the attention o f the present chapter.

Measures o f Overt Behavior


Assessm ent o f overt behavior can be accom plished in different ways. In most program s,
behaviors are assessed on the basis o f discrete response occurrences or the am ount o f
time that the response occurs. However, several variation s and different types o f m ea-
sures are available.

Frequ en cy M easures. Frequency counts require sim ply tallying the num ber o f times the
behavior occurs in a given period o f time. A m easure o f the frequency o f the response is
particularly useful w hen the target response is discrete and each instance takes a rela-
tively constant am ount o f time. A discrete response has a clearly delineated beginning
and end so that separate instances o f the response can be counted. T h e perform ance o f
the behavior should take a relatively constant am ount o f time so that the units counted
are approxim ately equal. O ngoing behaviors, such as sm iling, sitting in ones seat, read-
ing, lying dow n, and talking, are difficult to record sim ply by counting because each
response m ay occur for different am ounts o f time. For exam ple, if a person talks to a
peer for 15 seconds and to another peer for 30 m inutes, these m ight be counted as two
instances o f talking. A great deal o f inform ation is lost by sim ply counting instances o f
talking, because they differ in duration.
Frequency m easures have been used for a variety o f behaviors. For example, a p ro -
gram designed to increase the daily productivity o f w riters o f fiction used w ord count
(autom atically counted by software on a docum ent file) to m easure how much w riting
was com pleted each day (Porritt, Burt, & Poling, 2006). A n other study evaluated ch il-
dren’s com pliance to instructions at hom e and at school (Tarbox, Wallace, Penrod, &
Tarbox, 2007). T he num ber o f times the child com plied (com pleted the request within
two instructions to do so) was sim ply counted. In another exam ple, a feeding program
M et h o d s o f A ssessm en t

was provided for a 4-year-old child hospitalized because he could not eat and swallow
food (Girolam i, Boscoe, & Roscoe, 2007). His food con sum ption had been restricted
to tube feeding placed in his stom ach. During the intervention he w as fed spoonfuls o f
food. Each spoonful was a “trial,” that is, an opportunity to take the food and swallow.
T he measure was food expulsion and consisted o f exp elling th e food (visible outside his
m outh) after each spoonful. That is, the num ber o f tim es he expelled food after feed-
ing was counted. There are additional examples o f discrete behaviors that can be easily
assessed with frequency counts such as the num ber o f tim es a person attends an activ-
ity, says hello or greets som eone, assaults another person, throw s an object at another
person, uses special vocabulary w ords in speaking or w riting, or m akes errors o f one
kind or another (on the job, in school perform ance).
T h e examples convey two w ays frequency m easures are used. O ne is a situation in
w hich the behavior is free to occu r 011 multiple occasions, that is, there is no fixed limit
in the number o f times the behavior could occur. For exam ple, how often one child hits
another may be m easured by frequen cy counts. How m any tim es the behavior (hitting)
m ay occur has no theoretical lim it. The other is a situation in w hich the opportuni-
ties are restricted because o f specific discrete trials or in response to stim uli that are
presented only a specific num ber o f times. For exam ple, i f the child is greeted by the
parent or teacher or given an instruction 10 times per day, the frequency o f the child’s
response is restricted. T he distinction is not critical, but it can be. I f the opportunities
are restricted, we want a large range to be possible (e.g., m o re than 10 opportunities)
to perm it evaluation o f the base rate and to evaluate w h ether there is change, once an
intervention is introduced. A v e ry small range (e.g., two o r three opportunities) might
m ake evaluation o f the intervention more difficult.
Frequency m easures require merely noting instances in w hich behavior occurs.
Usually there is an additional requirem ent that behavior be observed for a constant
am ou nt o f time. O f course, if behavior is observed for 20 m inutes on o n e day and
30 m inutes on another day, the frequencies are not directly com parable. However, the
rate o f response can be obtained b y dividing the frequency o f responses by the number
o f m inutes observed each day. T h is measure will yield freq u en cy per m inute o r rate o f
response, which is com parable for different durations o f observation. If the frequency
m easure is based on a fixed num ber o f opportunities to perfo rm the behavior, one can
easily convert the frequency data to percentage o f tim es the behavior occu rred given
all o f the opportunities.
A frequency m easure has several desirable features fo r use in applied settings.
F irst, the frequency o f a response is relath'ely sim ple to sco re for in d ivid u als w o rk -
ing in everyday settings. Keeping a tally o f the behavior is u su a lly all that is required.
M oreover, counting devices, such as wrist counters or calcu lato rs, in clu din g those on
cell phones, are available to facilitate recording by pressin g a button to keep the tally
for a given observation period. Second , frequency m easu res readily reflect changes
o ver time. Years o f basic and applied research have show n that response freq uen cy
is sensitive to a variety o f interventions. Third, freq u en cy exp resses the am ount o f
b eh avior perform ed, which is u sually o f concern to in d ivid u als in applied settings. In
m an y cases, the goal o f the program is to increase or d ecrease the nu m ber o f tim es
a certain behavior occurs. Frequen cy provides a direct m ea su re o f the am ount o f
behavior.
S I N G L E- C A S E R ES EA R C H D ES I G N S

D iscrete C a teg o riz a tio n . O ften it is v ery useful to classify responses into discrete
categories, such as correct-in co rrect, p erfo rm e d -n o t perform ed, or ap prop riate-
inappropriate. In som e ways, discrete categorization resem bles a frequency measure
because it is used for behaviors that have a clear beginning and end and a constant
duration. Yet there are critical differences. First, with a frequency measure, perfor-
m ances o f a particular behavior are tallied. T h e focus is on a single response (hitting,
com plying). In discrete categorization, several different behaviors m aybe included, and
each is scored as having occurred or not. T he behaviors go together in form ing a larger
unit o r goal (e.g., all the steps related to getting ready for school in the m orning or
cleaning ones room o r apartm ent). T he constituent behaviors are all different. Second,
frequency often has no real limit; the person m ay engage in that behavior (e.g., hitting,
swearing) from zero to som e higher and varyin g number. In discrete categorization,
there is only a lim ited num ber o f opportunities to perform the responses as defined by
the total num ber o f steps involved or num ber o f com ponent behaviors.
For exam ple, discrete categorization m ight be used to measure the sloppiness o f
one’s college room m ate. To do this, a checklist can be devised that lists several different
behaviors all related to sloppiness. T hese m ight include such items or tasks as putting
aw ay one’s shoes in the closet, rem oving underw ear from the kitchen table, putting
dishes in the sink, putting food aw ay in the refrigerator, and so on. Each m orning (or
som e constant tim e each day), the behaviors on the checklist are observed; each one
is categorized as perform ed or not perform ed. T h e total num ber (or percentage) o f
behaviors perform ed correctly constitutes the m easure.
Discrete categories have been used to assess behavior in m any applied program s.
For example, one program focused on special education teachers w ho sought assistance
in m anaging o ff-task and disruptive student behavior (e.g., refusing to do work, disru p-
tive verbal statements) (D iG ennaro, M artens, & K leinm ann, 2007). The procedures tor
m anaging these behaviors have been established for decades; the task is providing skills
to teachers and en surin g that the skills are carried out correctly. Effective com ponents
o f the program were identified and evaluated as perform ed or not perform ed correctly.
Sam ple com ponents included: explaining the program to the student, providing praise
and stickers correctly, and providing the back-up reward when enough stickers were
earned. T his is a go od exam ple o f multiple steps and each one scored as being per-
form ed or not perform ed correctly. The percentage o f steps completed correctly (pro-
portion o f categories with “ yes” ) was used to evaluate teacher perform ance.
Discrete categorization is readily adaptable to m any different situations, especially
related to com pletion o f activities, skills, and other tasks that m ay include m any differ-
ent com ponents. C leaning one’s room , being prepared for som e activity, com pleting
practice (e.g., m usic, athletic), and com pleting one’s chores or other responsibilities are
exam ples. In each case, the m easure is made by defin ing the step and deciding what
constitutes com pleted/perform ed or not com pleted/perform ed.
A unique feature o f the m ethod is noteworthy. The behaviors that form the list
need not be related to one another or represent the flow (steps) o f a single activity.
Perform ance o f one m ay not necessarily have anything to do with perform ance o f
another. For exam ple, room -cleaning behaviors o r a set o f separate chores are not nec-
essarily related or very sim ilar; perform ing one correctly (m aking one’s bed) m ay be
unrelated to another (clearing away dishes). H ence, discrete categorization is a very
Methods of A ssessm en t 77

flexible m ethod o f observation that allows one to assess all sorts o f behaviors indepen-
dently o f w hether they are necessarily related to each other. T his is important because
som etim es the goal o f the program is not increasing the occurrence o f a behavior (as in
frequency counting) but rather developing execution o f a task that has many different
com ponents, as in the examples noted here.

N u m b e r o f P eo p le Who Perform the B eh avior. O ccasionally, the effectiveness o f the


intervention is evaluated on the basis o f the num ber o f people who perform the tar-
get response. Obviously, this measure is used in group situations such as a classroom ,
school, or com m unity where the purpose is to increase the overall perform ance o f a
particular behavior, such as com ing to an activity on tim e, com pleting hom ework,
speaking up in a group, recycling waste materials, paying bills on tim e, and voting.
O nce the desired behavior is defined, observations c o n s is to f noting how many partici-
pants in the group have perform ed the response. A s with frequency and categorization
m easures, the observations require classifying the response as having o ccu rred or not.
But here the individuals are counted rather than the num ber o f times an individual
perform s the response.
Several program s have evaluated the impact o f interventions on the num ber o f
people who are affected. For example, program s are often designed to improve driver
safety and evaluate outcomes such as increasing the use o f seat belts an d decreasing the
use o f cell phones while driving, or driving in som e other w ay to avoid accidents (e.g.,
Clayton, Helm s, & Sim pson, 2006; Van Houten, M alenfant, Z hao, Ko, & Van Houten,
2005). T he num ber o f people, cars, or drivers is the relevant metric. In these instances,
there is no interest necessarily in the behavior o f individuals, and indeed it would be
difficu lt to track individual cars from one day to the next. T h e issue is one o f changing
most or all individuals, whoever they m ay be.
In other instances the identity o f the individuals may be known, but still the num -
ber who engage in the outcome is o f interest. For exam ple, o n e program in a college
class focused on the num ber (percentage) o f students w h o subm itted their hom ew ork
assignm ents (Ryan & Hemmes, 2005). A point program w as designed to increase
assignm ent com pletion. One might make the case for a different focus. For example,
in a class where the identity o f everyone is known, perhaps the focus ought to shift to
those individuals who can be identified and who do not turn in hom ework. However,
num ber o f individuals was quite useful in reflecting intervention effects.
K n ow ing the num ber o f individuals who perform a response is v e ry useful when
the explicit goal o f a program is to increase perform ance in a large grou p o f subjects.
D eveloping behaviors in an institution and in society at large is consistent with this
overall goal. Increasing the num ber o f people who exercise, give to charity, or seek
treatm ent in early stages of serious diseases, and decreasing the num ber o f people w ho
sm oke, overeat, speed as they drive through school zones, an d com m it crim es are all
im portant goals. T he num ber o f people or som e other unit (businesses, schools) that
engage in behaviors o r som e other practice is o f keen interest. Prom inent exam ples
pertain to clim ate change and prom oting a sustainable en viron m ent w here the goals
might be to increase the number o f people who carpool in d who elect not to have
their linens changed daily during a hotel stay (and thereby reduce en ergy use from
laundering) or the num ber o f homes that have at least one en ergy-efficien t appliance
S I N G L E- C A S E R ES EA R C H D ES I G N S

and number o f businesses in a com m unity that engage in “green practices” (recycle,
provide em ployee incentives for u sing public tran sportation ). Hence, the num ber o f
people or num ber o f o ther units (e.g., hom es, classroom s) that perform a response is
o f great interest.

In terval R ecording. A frequent strategy o f m easu rin g behavior in an applied setting is


to measure the behavior based on units o f time rather than on discrete response units.
Behavior is recorded d u rin g short periods o f time for the total time that it is perform ed.
T he two m ain versions o f tim e-based m easurem ent are interval recording and response
duration.
With interval recording, usually behavior is o bserved for a single block o f time
such as 30 o r 60 m inutes once per day. A block o f tim e is divided into a series o f short
intervals (e.g., each interval equaling 10 or 15 seconds). T he behavior of the client is
observed during each interval. The target behavior is scored as having occurred or not
occurred durin g each interval. I f a discrete behavior, such as hitting someone, occurs
one or more times in a single interval, then the response is scored as having occurred.
Several response occu rrences within an interval are not counted separately. If the
behavior is ongoing with an unclear beginning o r end, such as talking, playing, and
sitting, or occurs for a long period o f time, it is sco red du rin g each interval in which it
is occurring.
Intervention program s in classroom settings frequently use interval recording to
score whether students are paying attention, sitting in their seats, and working q u i-
etly. An individual student’s behavior m ay be o b served for lo-secon d intervals over a
20-m inute observational period. For each interval, an observer records whether the
child is in her seat w o rk in g quietly. I f the child rem ains in her seat and works for a long
period o f time, m an y intervals will be scored for attentive behavior. I f the child leaves
her seat (without perm ission) or stops w orking, inattentive behavior will be scored.
D uring som e intervals, she m ay be sitting in her seat for h alf o f the time and running
around the room for the rem aining time. Since the interval has to be scored for either
attentive or inattentive behavior, a rule must be devised as to how to score behavior
in this instance. O ften, getting out o f the seat w ill be counted as inattentive behavior
within the interval.
Interval recording for a single block o f tim e has been used in many program s
beyond the classroom setting. For exam ple, one program trained parents to interact with
their children in ways that would promote positive child behaviors at home (P h an eu f &
McIntyre, 2007). The focus was on parent behavior. At home, the mother-child dyad
was observed for several minutes during free play, clean-up, and during an activity.
Each 30-second interval was evaluated to assess if the m other engaged in inappropri-
ate behaviors (e.g., giving am biguous com m ands, unw ittingly reinforcing inappropriate
child behavior, and criticizing the child). T he benefits o f training were evident from the
decreases in the percentage o f intervals with inappropriate parenting behaviors.
Interval recording w as used in a study that was designed to evaluate happiness and
quality o f life o f three nursing home residents (over 80 years old) who were diagnosed
with Alzheim er’s disease and had lim ited verbal repertoires (M oore, Delaney, & D ixon,
2007). Ten-m inute observation periods were divid ed into lo-secon d intervals, each
o f which was scored as reflecting happiness, unhappiness, or neither happiness nor
M et h o d s o f A sse ssm e n t

unhappiness. These were operationalized on the basis o f facial and vocal expressions
(e.g., sm iling, laughing vs. frowning, grim acing, cry in g, or yelling). I f no clear affect
was present, the interval would be scored as neither. Assessm ent w as com pleted before,
during, an d after participants engaged in various activities.
In using an interval scoring m ethod, an observer looks at the client d u rin g the inter-
val. W hen one interval is over, the observer records whether the behavior occurred. I f
an o bserver is recording several behaviors in an interval, a few seconds m ay be needed
to record all the behaviors observed during that interval. I f the o bserver recorded a
behavior as soon as it occurred (before the interval w as over), he or she might m iss
other behaviors that occurred w h ile the first behavior was being scored. H ence, m any
investigators use interval-scoring procedures that allow tim e to record after each
interval o f observation. Intervals for observin g behavior m ight be 10 seconds, with
2 to 5 seconds after the interval for recording these observations. If a single behavior
is scored in an interval, no time m ay be required for recording. Each interval m ight
be 10 seconds. A s soon as a behavior occurred, it w ould be scored. I f a behavior did
not occur, a quick m ark could indicate this at the end o f the interval. O f course, it is
desirable to use short recording tim es, when possible, because w hen behavior is being
recorded, it is not being observed. R ecording consum es tim e that m ight be used for
o bserving behavior.
A variation o f interval recording is time sampling. T his variation uses the interval
method, but the observations are conducted for b rief periods at different times rather
than in a single block o f time. For exam ple, with an interval m ethod, a child might
be observed for a 30-m inute period. T he period w ould be broken dow n into sm all
intervals such as 10 seconds. W ith the tim e-sam pling method, the 30-m inute period
might be divid ed into three 10-m inute periods throughout the day (e.g., m orning, early
afternoon, and late afternoon). D uring the 10-m inute periods, th ech ild is still observed
for lo -secon d intervals just like before. Spreading out the observation period s over the
entire day is likely to capture a m ore representative sam ple o f p erform an ce than m ea-
suring behavior for a single period.
Interval recording has been widely adopted as an assessm ent strategy. First, the
m ethod is v ery flexible because virtually any behavior can be recorded. T he presence
or absence o f a response during a time interval applies to any m easurable response.
W hether a response is discrete and does not vary in duration, is continuous, or is sp o -
radic, it can be classified as occu rring or not o ccu rrin g du rin g a b rie f time period.
Second, the observations resulting from interval recording can easily be converted into
a percentage. T he number o f intervals during which the response is scored as o cc u r-
ring can be divided by the total num ber o f intervals observed. T h is ratio multiplied
by 100 yields a percentage o f intervals that the response is perform ed. For exam ple, if
social responses are scored as o ccu rring in 20 o f 40 intervals observed, the percentage
o f intervals o f social behavior is 50% (20/40 x 100). A percentage is easily com m u ni-
cated to others by noting that a certain behavior occurs a specific percentage o f tim e
(intervals). W henever there is doubt as to what assessment strategy should b e adopted,
an interval approach is almost always applicable.

D uration. A n other tim e-based m ethod o f observation is duration, or am ount o f


time that the response is perform ed. T his method is particularly useful for ongoing
S i N G L E- C A S E R ES EA R C H D ES I G N S

responses that are continuous rather than discrete acts or responses o f extrem ely short
duration. Program s that attempt to increase or decrease the length o f tim e a response
is perform ed might profit from a duration m ethod. As an exam ple, one investigation
focused on academ ic tasks o f school-age children with learning disabilities. Tasks were
individualized to the age and goals o f the children (e.g., w riting tim e for an essay, trac-
ing letters for a young child) (Athens et al., 2007). In another program , the focus was
on six male children and adolescents (ages 8 -17 ) with Fragile X syndrom e, an inherited
form o f developm ental disability (Hall, M aynes, & Reiss, 2009). Individuals with this
disorder often show an aversion to eye contact w ith others. T h e goal o f the intervention
was to increase eye contact du rin g periods in which an experim enter sat across from
a child and tried to engage him in interaction. D uration o f eye contact in seconds w as
recorded by an observer w ho pressed keys on a laptop com puter to note onset and o ff-
set o f eye contact during the interactions. O ther exam ples include the amount o f tim e
engaging in social interaction, rem aining in situations that before treatment prom oted
anxiety, exercising, studying, practicing (a musical instrum ent, athletic skill), reading,
and so on. For m any interventions or program s increasing the duration (e.g., o f p rac -
ticing, exercising) is a central goal.
Assessm ent o f response duration is a fairly sim ple matter. T h e requirement is that
the observer start and stop a stopwatch or note the tim e when the response begins and
ends. However, the onset and term ination o f the response must be carefully defined. I f
these conditions are not met, duration is extrem ely difficult to employ. For example, in
recording the duration o f a tantrum , a child m ay c ry continuously for several minutes,
w h im per for short periods, stop all noise for a few seconds, and begin intense c r y -
ing again. In recording duration, researchers must decide how to handle changes in
the intensity o f the behavior (e.g., cryin g to w him pering) and pauses (e.g., periods o f
silence) so that they can be consistently recorded as part o f the response or as a d iffer-
ent (e.g., nontantrum ) response.
Use o f response duration is generally restricted to situations in which the length
o f tim e a behavior is perform ed is a m ajor concern. Yet that may cover many areas o f
interest. For exam ple, it may be desirable to increase the length o f tim e that students
study or practice a skill or to decrease the length o f time an adolescent is in the show er
while several siblings are w aiting their turn. Duration has ease o f observation in its
favor if the start and stop can be well defined and lengthy pauses are not likely to be an
issue. Interval assessment can be used to assess behavior w hen duration is o f interest.
For example, the num ber or proportion o f intervals in which studying occurs reflects
changes in study time, since interval recording is based on time.

Latency. Latency refers to how long it takes for the client to begin the response. T h e
am ount o f time that elapses between a cue (som e starting point) and the response is
referred to as latency. It w ould be easy to group latency with duration, because they
both include the sum o f the am ount o f time that elapses (before the behavior occurs or
with the behavior occurring), but for ease o f reference and presentation the distinction
is useful.
M any program s have tim ed response latency. For exam ple, in one report, a 19-year-
old adult diagnosed with A sperger syndrom e w as referred to a day-treatm ent center
M et h o d s o f A ssessm en t

(Tiger, Bouxsein, & Fisher, 2 0 0 7 ) . ' A m on g the characteristics to be addressed was the
long period o f time he took to respond to questions (he delayed before even starting his
answer). M ore generally, he took excessive am ounts o f time to com plete m any a c tiv -
ities during the day (e.g., several m inutes to sign his name to a check in the grocery
store), all o f which limited his independent functioning. Latency w as used as a m easure
in response to answering questions. A fter a therapist asked a question (e.g., “ What is
you r sisters nam e?” ), the amount o f time that elapsed (timed b y a stopwatch) until
the answer began constituted the m easure. W hen the question w a s answered, another
question was presented until 1 0 questions were com pleted or l o m inutes elapsed. T h e
im pact o f an intervention program was evaluated by show ing a decrease in latency
(e.g., mean o f approxim ately 2 0 seconds in baseline to less than 5 seconds and eventu-
ally 3 seconds during the intervention).
Latency has m any other uses in relation to assessment an d intervention. For ex a m -
ple, in a classroom , the teacher m ight count latency from the beginning o f the class
until a student engages in disruptive or aggressive behavior. T h e intervention (e.g.,
praise, feedback, points) can be provided for increases in the tim e from the start o f
school to the end o f the day without disruptive behavior. Sim ilarly, parents are often
frustrated with getting their child up, dressed, fed, and out the d o o r Ln time to obtain
a ride or catch the school bus. Latency from the first rem inder until the child gets out
o f bed or com es to the breakfast table and sits dow n could be one m easure. T h e goal o f
the program would be to decrease the latency. A n advantage o f latency is that it greatly
facilitates observation when individuals in applied settings (parents, teachers) do the
observation. T he start time is usually easy to specify (e.g., breakfast, io :o o a.m. activity’,
when a bell rings to denote the start o f class). T he parent and teacher now o n ly have to
keep track when the behavior first occurs.
D uration and latency can be v ery useful. Often the goal o f a program is related
directly to time: how much time is spent either engaging o r not engaging in a p a r-
ticular behavior. The main constraint is in defin ing when the behavior does and d o es
not count as being perform ed, so that when to start and stop the tim ing is clear. T h is
requirem ent is sim ilar to the dem ands in defining observations using other strategies.

O th er S tra te g ies B rie fly N oted


M ost assessment in single-case research has focused on overt behavior, u sing v a ria -
tions o f one o f the strategies m entioned previously. Three other general strategies can
be delineated, including response-specific measures, psychophysiological m easures,
and self- and other-report measures. Although the form ats o f these measures so m e -
tim es overlap with the overt behavioral assessment strategies discussed earlier (e.g.,
frequency, duration), the strategies have unique features.

R esponse-specific M easures. Response-specific measures are assessm ent procedures


that are unique to the particular behaviors under investigation. M any behaviors have

' Asperger syndrome is a condition marked by impaired social interactions and communication, and
by limited patterns o f behavior. It is viewed as being on a continuum or spectrum with autism being
at the more severe and extreme end.
S I N G L E- C A S E R ES EA R C H D ES I G N S

specific m easures peculiar to them that can be exam ined directly. For example, in one
study, interventions were directed at increasing the eating and weight gain o f three
children (ages 3 to 4) who were failing to thrive (Patel, Piazza, Layer, C olem an, &
Schwartzwelder, 2005). T h ey consum ed less food than norm ally consum ed for c h il-
dren their age and were adm itted to an intensive pediatric feeding disorders program .
The m easure to evaluate alternative feeding strategies was gram s o f food consum ed.
Ultimately, weight gain was the m easure o f the success o f the program .
In a sim ilar way, interventions designed to reduce overeating or cigarette sm oking
can be evaluated by assessing the num ber o f calories consum ed or cigarettes sm oked.
Calories and cigarettes could be considered as sim ple frequen cy measures in the sense
that they are both tallies o f a particular unit o f perform ance. H owever, the m easures are
distinguished here because they are peculiar to the target behavior o f interest and can
be used to assess the im pact o f the intervention directly.
Response-specific m easures are o f use because they directly assess the response or
a product o f the response that is recognized to be o f obvious clinical, social, or applied
significance. For example, efforts to have drivers conserve en ergy (gasoline) have m ea-
sured car mileage directly from odom eter readings; efforts to have individuals recycle
waste or not litter have m easured volum e o f trash. Response-specific measures are
often available from existing data system s or records that are part o f the ongoing in sti-
tutional records (e.g., crim e rate, traffic accidents, hospital adm issions). A caution ary
note for som e o f the measures: data obtained in institutional records such as crim e
rate or episodes o f events in hospitals and schools are not alw ays kept reliably and m ay
not reflect the care that investigators usually invoke when developing a m easure for
research. Even so, when decisions about assessment are being made, the investigator
m ay wish to consider w hether the response can be assessed in a direct and unique w ay
that will be o f clear social relevance. Response-specific m easures are often o f m ore
obvious significance to persons unfam iliar with research to w hom the results m ay need
to be com m unicated than are specially devised overt behavioral measures.

Psych oph ysiological Assessm ent. Psy'chophysiological responses directly reflect m an y


problems o f clinical significance or are highly correlated with the occurrence o f p sy-
chological and m edical conditions o f interest, such as anxiety, vigilience, and attentive-
ness. In addition, physiological arousal and other states can be assessed directly and are
o f interest in their own right.
Som e o f the m ore fam iliar psychophysiological m easures include heart or pulse
rate, blood pressure, skin tem perature, blood volum e, muscle tension, and brain wave
activity. M easures related to substance use and abuse are readily available for alcohol,
drugs, and tobacco. For exam ple, in one program w ith cigarette sm okers, the p rim ary
measure was level o f carbon m onoxide (CO) (G lenn & D allery, 2007). Individuals
breathed into a CO m onitor over the course o f an intervention study. A useful feature
o f this m easure is that it has been well studied so one can evaluate levels (parts o f C O
per m illion) that are known to reflect abstinence from cigarette smoking. Sim ilarly,
a study designed to decrease m arijuana dependence am ong three adults used several
measures (Twohig, Shoenberger, & Hayes, 2007). A m on g them was an oral swab to
test for m arijuana use. T he test requires placing a special pad betw een the low er cheek
and gum for 2 to 5 minutes. T he results indicate w hether m arijuana was used w ithin
M f ct Ko d s o f A sse ssm e n t

the past 3 days. In another single-case study, the goal was to reduce m uscle tension that
caused problems in vocalizing and breathing in a 16-year-old C aucasian adolescent
with a 2-year history o f the problem (Warnes & Allen, 2005). Biofeedback for reduced
muscle tension was the intervention and was evaluated d ireed y as m easured b y electro-
m yographic responses. Electrodes placed on the neck provided the m easure o f m uscle
tension (in microvolts).
M any intervention studies focus on evaluating or altering sexual arousal in persons
who experience arousal in the presence o f socially inappropriate and censured stimuli
(e.g., exhibitionistic, sadistic, masochistic stimuli, or stimuli involving children, anim als,
or inanimate objects). Direct psychophysiological assessment o f sexual arousal is possible
by m easuring vaginal or penile blood volum e to evaluate changes in arousal. For example,
penile blood volum e is m easured by a plethysmograph, w hich includes a band around the
penis that registers increases in the diameter o f the penis (e.g., R eyeset al., 2006). T his is
a well-studied and validated measure o f sexual arousal and sexual preference.
M ore generally, psychophysiological measures are quite relevant insofar as m any
physical disorders and disease processes and their correlates (e.g., blo o d levels and
chem istry) reflect change in response to individual habits and lifestyle. B lood pres-
sure (by sphygm om anom eter) or brain activity (electroencephalogram |EEG )) record-
ings are other examples. Psychophysiological and biological assessm ents are used m ore
com m on ly in group research than in single-case designs. Som e o f the m easures to e v a l-
uate intervention effects such as neuroim aging, dense array, and scans o f vario u s sorts
are often expensive, inconvenient, and not feasible in an on go in g way, but o f cou rse
that could easily change. Thus, it is much more feasible to assess b e h a iio r on one (p o st-
intervention) or two (pre- and post-intervention) occasions only.
T he preceding examples provide only a minute sam ple o f the range o f m easures
and disorders encom passed by psychophysiological assessm ent. D iverse clinical p ro b -
lem s have been studied in single-case and betw een-group research, inclu ding insom nia,
obsessive-com pulsive disorders, pain, hyperactivity, sexual dysfunction , tics, trem ors,
and m any others. Depending on the target focus, psychophysiological assessment p er-
m its measurem ent o f precursors, central features, or correlates o f the problem .

Self-re p o rt M easures. Historically, single-case designs have focused heavily and alm ost
exclusively 011 overt perform ance, that is, what people do rather than what they say. A
m ajor exception has been those situations in which verbal behavior itself is the target
focus (e.g., irrational speech, stuttering, threats o f aggression). T h e em phasis on overt
behavior is in sharp contrast to the measures more com m on ly used in betw een-group
research in education, psychology, counseling, and psychiatry, w h ere pre- and p o st-
intervention assessm ents rely heavily on various paper-and-pencil m easures (question-
naires, rating scales, and checklists) completed by the individual or others (therapists,
spouses, parents, and teachers). There are m any types o f “self-rep ort” m easures, an d
som e o f these reflect overt behavior in im portant ways. For exam ple, educational
research often relies on self-report measures and on paper-an d-p encil m easures that
reflect com petence in an area (e.g., reading, com prehension, arithm etic). T hese are not
ratings or personal view s but m easures o f the dom ain o f interest.
Self-report, when rating a problem or target focus, is often held to b e rather suspect
because it is subject to a variety o f response biases and sets (e.g., respon ding in a socially
S I N G L E- C A S E R ES EA R C H D ES I G N S

desirable fashion, agreeing just to be agreeable, lying, and others) that distort o n es own
account o f actual perform ance. O f course, self-report is not invariably inaccurate, nor
is direct behavioral assessm ent necessarily free o f response biases o r distortion. W hen
persons are aware that their behavior is being assessed, they can distort both what they
say and what they do. Self-report tends to be more readily under the control o f the c li-
ent than m ore direct m easures o f overt behavior, however, and hence it is perhaps m ore
readily subject to distortion.
Even w hen people do not attempt to distort how they present themselves, they are
not necessarily good reporters o f what they will do, what they have done, o r what has
happened. C o n sider an anecdotal and then research example. People who live long or
who have been m arried a v ery long time occasion ally are asked by reporters, “So what
is your secret or key to a long life (or m arriage)?” Invariably, the person who is asked
has som ething to say, but we cannot take this as knowledge or a statement o f what
really accounted for the lengthy period. Luck, a package o f genetic and environm ental
influences, and even obscu re influences that we now know a little about (e.g., diet o f
ones grandparents before one was born) m ight com bine in som e novel way to explain
longevity. Self-report ju st does not provide the data, even though the report can be
interesting.
As for a research exam ple, there are lines o f research show ing circum stances in
which we do not and perhaps cannot report on critical facets o f our experience. Here
the problem is not ill will or efforts to distort, but characteristics o f our reporting lim -
its. For exam ple, people give verbal statem ents about what attracts them to a mate and
what characteristics they prefer, but for both m ales and fem ales their actual choices
are guided by characteristics that differ from what they say (e.g., Todd, Penke, Fasolo, &
Lenton, 2007). There is no effort to distort here; self-report is just not up to the task o f
identifying key factors that m ay exert influence. Sim ilarly, teenagers who pledge v ir-
ginity and abstinence from sexual activity no doubt are genuine in their com m itm ent.
Vet, in fact their statem ents do not relate to actual sexual behavior, that is, are not asso -
ciated with reduced sexual activity (Rosenbaum , 2009). As intriguing, research has
now established procedures for inducing false m em ories, that is, clear recollections
people have o f events that did not happen (Bjorklund, 2000; Brainerd & Reyna, 2005).
Although people do not equivocate about what they remember, their reports can be
shown to be com pletely inaccurate. C oncerns about self-report are not objections in
principle; but reflect problem s that have em erged from careful stu d y
Over the years, single-case designs have been applied more broadly in term s o f the
types o f dom ains that are studied as well as m any disciplines that draw on the designs.
That has led to the expansion o f types o f assessm ent to include greater use o f self-report
and other-report (e.g., clinician), either as a com plem ent to direct behavioral m easures
or as a m odality o f assessm ent valuable in its own right. In m any cases self-report may
represent the o nly m odality currently available to evaluate treatment. For exam ple, in
the case o f private events such as obsessive thoughts, uncontrollable urges, or halluci-
nations, self-report m ay be the only possible or feasible m ethod o f assessment. W hen
the client is the only one with direct access to the event, self-report may be the prim ary
assessment modality.
Private experience m ay not be private m erely because it is the internal experience
o f the individual. Som etim es the behavior is not easily publicly observable because the
Methods of Assessment 85

behavior is perform ed privately or at times throughout the day that cannot be m o n i-


tored by anyone other than the clients themselves. In such cases, self-report measures
m ay play a central role and be com plem ented by other measures. M entioned previously
was a study to reduce m arijuana dependence am ong three adults (Tw ohig et al., 2007).
A dru g test was used to assess use o f m arijuana in the previous 3 days. T h e measure
was adm inistered at different points throughout the study. H ow ever, self-report was
used as well, in which each client kept a record o f m arijuana use an d at the end o f each
day reported the num ber o f tim es by leaving a telephone or etnail m essage. T he self-
report data provided a daily m easure to provide the continuous observations needed
fo r single-case designs but was corroborated b y the drug tests adm inistered less often.
Sim ilarly, in a report on cigarette use, the drug test (carbon m o n o xide m onitoring)
was supplem ented with self-report o f cigarette sm oking (G len n & Dallery, 2007). T he
two m easures were m oderately to highly correlated (r = .72); thus reporting was quite
related to actual sm oking. O f course, one can question if accu racy in reporting was
partially increased because participants knew their sm oking cou ld be detected no m at-
ter what they said. There are m any other actions o f interest that in principle can be
observed by others but end up being private events and available through self-report.
Exam ples include sexual assault, delinquent acts (e.g., vandalism , firesetting), and b u l-
lyin g (e.g., ridicule or physical abuse on a playground). T h ese behaviors m ay involve
others who could report on the actions, but by the nature o f these actions, self-report
m ay be the only alternative.
Self-report measures should not be considered as the default m odality o f assess-
m ent when all else fails or when measures free from reporting biases cannot be used.
First, self-report is often a critical facet o f m any problem dom ain s and areas o f fu n c-
tioning (e.g., depression, marital satisfaction, quality o f life) and im portant to assess
even w hen direct behavioral m easures might be an option. Indeed, from the standpoint
o f the client or patient, self-report (one’s own subjective perception) is often the bottom
line and “true” test o f the im pact treatment. For example, m any intervention studies are
directed at reducing or elim inating headaches, debilitating muscle tension, stress, and
pain, all o f which can be assessed through psychophysiological m easures (m uscle ten -
sion, electrical activity o f the cortex, skin temperature) or behavioral indices (w alking
cautiously to avoid pain, no activity, grim acing as if in pain). A s superb as these are as
m easures, there is no substitute for asking if the client in fact is exp erien cing few er or
no headaches, less muscle tension and stress, and no pain. Perhaps even m ore clearly,
it is not rare for an adult who is v e ry depressed to have a v e ry successful life by all the
usual overt signs (e.g., relationships, work, money, leisure, and control o ver their lives).
Self-report or another type o f m easure (e.g., physiological) is needed here to assess
unhappiness in a valid way.
Sim ilarly, m any intervention studies focus on altering sexual arousal in persons
w ho experience arousal in the presence o f socially inappropriate an d censured stim uli
(e.g., exhibitionistic, sadistic, m asochistic stimuli or stim uli involving children, an i-
m als, or inanim ate objects). Direct psychophysiological assessm ent o f sexual arousal is
possible by m easuring vaginal or penile blood volum e to evaluate changes in arousal
as a function o f treatment, as I mentioned previously. Tet it is im portant as well to
m easure what persons actually say about what stimuli arouse them , because self-report
is a significant response m odality in its own right and does n o t always correlate with
S I N G L E- C A S E R E S EA R C H D ES I G N S

physiological arousal. Hence, it is relevant to assess self-report along with other m ea-
sures o f arousal.
Second, self-report m easures often have been v ery extensively studied in w ays that
establish their reliability and validity, that is, that they yield consistent results and that
the scores on the m easures relate to other types o f indices (e.g., how children are doing
in school, how ind ividuals are functioning at w ork). T he value o f a measure, in large
part, is established by the extent to w hich the m easure passes various m ethodologi-
cal hoops. T hese hoops reflect different types o f studies that con vey that the measure
assesses what it says it does, that the m easure relates to other indices of the same c o n -
struct, and so on. For exam ple, a question with a seem ingly obvious answer is, “ W hat is
the better o r best w ay to m easure the extent to w hich adolescents engage in delinquent
behavior? Should one use arrest records (how m any times a youth has been brought
to a police station)? O r should one just ask them ?” First, this is a trick question. In
research one invariably wants to use m easures o f m ore than just one type (Kazdin,
2003). So here an institutional record (arrest) and self-report scale should both be used
if possible. T h ey have different lim itations and if they converge, any conclusion would
be greatly strengthened. However, to return to the m ain point, self-report is a valid w ay
o f m easuring delinquent acts and has yielded m an y findings and insights about d elin -
quency (e.g., scope o f acts, changes over the course o f developm ent, predictors o f) that
could not have been obtained by either direct observation or institutional records (see
T horn berry & K rohn, 2000).
T h ird , sin gle-case designs require con tin u o us assessm ent, that is, repeated
adm inistration o f the m easu re on a d aily or alm ost d aily basis. T h e vast m ajority o f
self-report m easures have not been used or validated as ongoing measures and m ay
not be able to reflect changes. M ore attention has been accorded to this con cern
and there are now exam ples that have redressed the concern. A prom inent e x a m -
ple is a m easure used in the context o f psychotherap y for adults. The m easure is
the O utcom e Q uestio n n aire 45 (O Q -45), w hich is a self-report m easure designed to
evaluate client progress (e.g., w eekly) over the cou rse o f treatm ent (see Lam bert et al.,
1996, 2001, 2003, 2004). T h e m easure requires ap proxim ately 5 minutes to com plete
and provides inform ation on four dom ain s o f fu n ctio n in g, including sym ptom s o f
psychological distu rb ance (prim arily depression and anxiety), interpersonal p ro b -
lem s, social role fu n ctio n in g , (e.g., problem s at w o rk), and quality o f life (e.g., facets
o f life satisfaction). Total scores across the 45 item s present a global assessm ent o f
functioning. T h e m easure has been evaluated extensively and has been applied to
thousands o f patients an d show n to be useful in evalu atin g and predicting th erap eu -
tic changes. T h ere are now other exam ples o f self-rep ort m easures used in clinical
w ork and sin gle-case d esign s (e.g., Borckardt et al., 2008; C lem ent, 2007).
Finally, self-report m easures can be used to code overt behavior in a w-ay that c ir -
cumvents som e o f the traditional concerns that such measures are too transparent.
Questions can solicit inform ation (self-report) but be evaluated in a w ay that is slightly
different from what is reported. For exam ple, in a study designed to prevent child abuse
and to train parents in m ore effective ways to handle their children, mothers kept daily
dairies (Peterson, Trem blay, Ew igm an, & Popkey, 2002). T he questions were open-
ended and did not specifically ask about harsh disciplin e practices, the use o f ignorin g
o f deviant behavior, or tim e out from reinforcem ent. T h e goal o f the program w as to
M et h o d s o f A sse ssm e n t

decrease harsh discipline and to im prove the use o f ignorin g and tim e out in its place.
T h e diaries were structured so that they could be coded by observers and evaluated for
reliability o f the observations. T he m easure yielded inform ation that cou ld be assessed
reliably, reflected changes in parenting behavior in the home over tim e, and reflected
the im pact o f an intervention designed to change parenting behavior.
In light o f advances, self-report ought not to be ruled out in single-case designs.
T h e m ethodology has advanced in part because o f the need to address critical fac-
ets o f functioning that are not easily assessed in other ways and because these facets
are im portant in their own right. Also, even when behavior can be directly assessed,
reports o f others or the client m ay be im portant to evaluate w hether th e intervention
m ade a difference to those who see or experience the behaviors in everyd ay life. Real
and perceived changes are relevant, and self-report that taps either one o f these, espe-
cially when supplem ented by other measures, can be valuable.

R eports b y O thers. Reports by others refer to measures com pleted by individ u als w ho
have access to, observe, and interact closely with the client. Parents, teachers, spouses,
partners, and significant others often serve as inform ants, the name used to describe
raters other than the client. A s with self-report, reports b y others often play a critical
role in intervention research. First, in som e areas o f research (e.g., clinical depression),
other reports (ratings by clinicians) are am ong the standardized m easures to evaluate
the im pact o f treatment (e.g., Levesque et al., 2004; Savard et a l , 1998). We want the
view s o f experts (e.g., those in a position to m ake well-based judgm ents). In addition,
significant others (e.g., relatives, peers) in everyday life who are not “exp erts” but who
have close contact with the client provide valuable inform ation as well.
Second, significant others are often the first to identify a problem as in the case o f
parents and teachers. Ratings on standardized measures can be used to decide w hether
an intervention is needed or w hether an intervention has had im p act We want these
view s because the inform ant has close contact with the client and can prcm de valuable
inform ation. Related, ratings by significant others are used as part o f the process o f
social validation. In this context, subjective evaluation is used, as already m entioned.
Individ u als in contact with the client are asked to judge whether the beh avior that is
focused on or m ore com m only w hether the changes that have been m ade w ith the
intervention m ake a difference. T he fact that these reports are subjective, based on
perception, and not necessarily isom orphic with observations o f behavior is not a d e fi-
cien cy in the m easure. Rather, the purpose is to determ ine i f changes that are reflected
in observable behavior are such that they m ake a difference to others.
M any o f the concerns about the advances related to self-report are pertinent here.
First, reports by others do not invariably reflect actual behavior o f the client, which in
a given instance m ay o r may not be a concern. For exam ple, in the context o f therapy,
studies that com pare direct observation with clinician ratings (e.g., in studyin g som e-
thing as seem ingly concrete as tics) note that clinician ratings may not correspond v ery
well to direct observations (H im le et al., 2006). Second, w hen different inform an ts rate
the sam e person (e.g., a child referred for treatment), agreem ent among the inform ants
is in the low to moderate range (Achenbach, 2006; De Los Reyes & K azdin, 2005). For
exam ple, if parents, teachers, and children are asked about areas o f c h ild fun ction-
ing (e.g., aggression, social interaction) they do not agree very well (e.g., parent and
S I N G L E C A S E R E S EA R C H D ES I G N S

teacher, r = ~ .4). T h ird , ratings by others are influenced by factors that can be d istin -
guished from the behavior o r characteristics that are being rated. Fo r example, parental
report o f deviance in their children is partially a function o f what the child does but
also is influenced by the stress, depression, and isolation o f the parent who is doing the
ratings. A stressed, depressed, and isolated parent (from social contacts) is likely to see
the child as m ore deviant.
On balance, as with other m odalities o f assessm ent, it is important to be aware o f
their lim itations and to draw on multiple m easures w hen possible. T he perspectives o f
others are often critical, both in identifying problem s or characteristics of an individual
and in determ ining i f these characteristics have changed. Indeed, many decisions and
choices in life (e.g., jo b opportunities) are very much determ ined by the perspectives
(ratings) o f others. Interventions to be effective often have to ensure those perspectives
are part o f the evaluation.

Selection o f an Assessm ent Strategy


In most single-case designs, the investigator selects one o f the assessment strategies
based on overt perform ance (e.g., frequency, interval m easures). Some behaviors
m ay lend them selves well to frequency counts or categorization because they are d is-
crete, such as the num ber o f profane w ords used, o r the num ber o f toileting or eating
responses; others are well suited to interval recording, such as reading, w orking, or sit-
ting; and still others are best assessed by duration, such as time spent studying, crying,
or getting dressed. Target behaviors usually can be assessed in m ore than one way, so
there is no single strategy that must be adopted. For exam ple, an investigator w o rk -
ing in an institution for delinquents m ay wish to record a client’s aggressive behavior.
Hitting others (e.g., m aking physical contact with another individual with a closed Fist)
may be the response o fin terest. W hat assessm ent strategy should be used?
Aggressive behavior m ight be m easured by a frequency count by having an observer
record how m any tim es the client hits others du rin g a certain period each day. Each
hit would count as one response. T he behavior also could be observed d u rin g inter-
val recording. A block o f tim e such as 30 m inutes could be set aside for observation.
T he 30 minutes cou ld be divided into lo -secon d intervals. D u rin g each interval, the
observer records w hether any hitting occurs. A duration m easure might also be used.
It might be difficult to time the duration o f hitting, because instances o f hitting are
too fast to be tim ed with a stopwatch unless there is a series o f hits (as in a Fight). An
easier duration m easure might be to record the am ount o f time from the beginning o f
each day until the first aggressive response, that is, a latency measure. Presumably, if a
program decreased aggressive behavior, the am ount o f tim e from the beginning o f the
day until the first aggressive response w ould increase.
Although m any different measures can be used in a given program , the m easure
finally selected m ay be dictated by the purpose o f the program . I f one is developing
com plex behaviors that require m astery o f a series o f steps, discrete categorization m ay
be o f special use. T he steps (perform ed vs. not perform ed scored for each step) could be
listed and assessed as well as trained over the cou rse o f the program . In group situations
(e.g., camp, classroom , prison, military, nursing hom e, day care), trying to get most or
all individuals to engage in som e behavior (com pleting a task, show ing up for an event,
napping) m ay be the goal o f the intervention. C ou ntin g the num ber o f individuals w ho
M et h o d s of A ssessm en t

perform the behavior is a direct m easure o f that goal. M any responses m ay im m ediately
suggest their own specific measures. In such cases, the investigator need r o t d evise a
special format but can merely adopt an existing measure. M easures such as calories,
cigarettes smoked, and miles o f jo ggin g are obvious exam ples that can reflect eating,
sm oking, and exercising, relatively com m on target responses.
W hen the target problem involves psychophysiological functioning, direct m ea-
sures are often available and o f p rim ary interest. In m any cases, m easures o f overt
behavior can reflect im portant physiological processes. Fo r exam ple, seizures, ru m i-
native vomiting, and anxiety can be assessed through direct observation o f the client.
However, direct psychophysiological measures can be used as well and either provide
a finer assessment o f the target problem or evaluate an im portant an d highly related
com ponent.
To a large extent, selection o f an assessm ent strategy depends on characteristics
o f the target response and the goals o f the intervention. In an y given situation, several
assessm ent options are likely to be available. Decisions for the final assessm ent form at
are often made on the basis o f criteria other than the target response, including p rac ti-
cal considerations such as the availability o f assessment periods and observers.
It is important to mention a point from the previous chapter, namely, multiple
m easures are to be encouraged. That takes the onus o ff the investigator fo r m aking
a decision about the best, most relevant, and near perfect m easure. C onstru cts and
dom ains o f interest usually can be represented in multiple ways, and the yield from d if-
ferent measures o f assessment can be different and inform ative. For single-case designs,
one o f the measures has to be obtained on an ongoing basis, but other m easures can be
used that are adm inistered periodically or o nly once or tw ice to sam ple other indices o f
the dom ains or other dom ains that m ay be related to the target focus.

C O N D IT IO N S OF A S S E S S M E N T
T he strategies o f assessment refer to the different methods o f reco rdin g perform ance.
O bservations can v ary m arkedly along other conditions, such as the m anner in w h ich
behavior is evoked, the setting in which behaviors are assessed, w h ether the persons
are aware that their behaviors are assessed, and whether hum an observers or au to -
m ated apparatus are used to detect perform ance. These con ditions o f assessm ent can
influence how the client responds and o nes confidence that the data accurately r e fe c t
perform ance.

Natural Versus Contrived Tasks and A ctivities


O bservations o f client behavior can be obtained under a v ariety o f conditions. A broad
dim ension is the extent to which perform ance is observed u nd er ev ery d ay conditions
without structuring the activities or under conditions that are arranged in som e w a y
to foster perform ance o f activities so they can be counted. The perform ance, task, or
activity that is to be observed can be natural (unstructured) o r contrived (structured
task o f som e kind). A separate dim ension is the setting in w hich the observations are
made. These too can be natural settings (e.g., everyday life, at hom e, in the park) or
contrived settings (e.g., laboratory). It is useful to make the distinction in dissecting
assessm ent options for purposes o f presentation, but there are all sorts o f perm utations
S I N G L E- C A S E R ES EA R C H D ES I G N S

and gradation s/ In this section, the focus is on the tasks o r activities presented to the
subject that serve as the basis o f observation.
With natural or uncontrived tasks and activities, perform ance is observed without
intervening or structuring the situation for the client. O ngoing p erform an ce is o bserved
as it norm ally occurs, and the situation is not intentionally altered by the investigator
merely to obtain the observations. For example, observations o f interactions am ong
children at school during a free period in class or on the playgroun d would be c o n -
sidered natural in the sense that an ordinary activity was o bserved during the school
day. Sim ilarly, observation o f people eating in a cafeteria or restaurant would constitute
assessment under natural conditions.
Although direct observation o f perform ance as it norm ally occurs is very useful,
naturalistic observation often is not possible or feasible. M any o f the behaviors o f inter-
est are not easily observed because they are o f low frequency, require special precipitat-
ing conditions, o r are prohibitively costly to assess in view o f available resources (funds,
observers). C onsider the problem in a different context. T V show s that portray anim als
in the w ild are often interested in the hunt, the kill, or the dram a o f a disappoint-
ing chase— all observations captured with photography. T he difficu lty is that these
are natural rather than contrived activities, so the photographer is sitting in savannah
bushes or trees for days to secure these photos. Analogously, interventions often focus
on behaviors that rarely occu r or do not occur with sufficient frequen cy to assess or
intervene. Consequently, situations are often contrived to evoke responses so that the
target behavior can be assessed.
For example, one study was designed to train young children (ages 4 to 7 years)
so they would not play with guns and would respond safely if they encountered one
(Gross, Miltenberger, Knudson, Bosch, & Breitwieser, 2007). Disabled guns from the
police departm ent were used. Each child was trained how to respond (not touch the
gun, leave the room , contact an adult) after encountering a gun in a room at his or her
home. A cam era was placed in the room to watch the child’s interaction and to score
the ch ild ’s behavior. Direct assessm ent o f children in their hom es with real guns and in
natural uncontrolled situations obviously is not a possibility. H ence, a contrived situa-
tion was devised to observe and to train the behavior.
A nother study focused on training preschool children to avoid abduction by
strangers (Johnson et al„ 2005). Assessm ent was conducted in diverse settings near the
school and the child’s home; the abduction lure was staged (i.e., contrived) by a con fed -
erate (i.e., som eone w orking as part o f the study) w ho approached the child to assess
if the child w'ould say “ no,” im m ediately walk or run away, and tell an adult about the
abduction lure. T his is another situation where m erely observin g the child under nat-
ural circum stances (noncontrived activities) would, like the savannah photographer,
not yield the behaviors— and worse, “ real” abduction efforts w ould need im m ediate

; Psychological research has blurred and blended the activity and the setting in fascinating ways. For
example, in studies with college students, individuals will enter a waiting room lor some appoint-
ment. Another student already in the waiting room starts talking to the person who just arrived.
This may look like an unstructured task or activity o f the subject, but the person who starts talking
(confederate or “actor” ) is working from a practiced script; the setting looks like it is naturalistic—
just a waiting room, but in fact it is all planned for observation, maybe even taping.
Methods of Assessment 91

intervention that would preclude training. We do not need research to m ake the point.
I f one wants to teach people not to drown, waiting until they are in the life-threatening
situation and firing instructions or techniques at them probably is not going to be h elp -
ful (“ Stop swallowing water and scream in g!” “Stop m ovin g your arm s so aim lessly in
the air!” “No need to keep scream in g ‘help’; I ’m here.” “C alm d ow n— this is a bathtub.” ).
We train under contrived conditions (regular sw im m ing pools) w here there are m any
opportunities to teach and assess the requisite skills.
Naturalistic and contrived conditions o f assessm ent provide different advantages
and disadvantages. Assessm ent o f perform ance under con trived con d itio n s provides
inform ation that often w ould be too difficult to obtain under naturalistic conditions.
T h e response might be seen rarely i f the situation were not arranged to evoke the b ehav-
ior, as in the gun and abduction exam ples. Dem onstrations have, with th e approval
and aid o f parents, contrived situations so children can learn and p ractice appropriate
responses.
In addition, contrived situations provide consistent an d standardized assessm ent
conditions. Consistent assessm ent conditions directly facilitate evaluation o f the inter-
vention and analysis o f the data. W ithout structuring the situation, perform ance m ay
change or fluctuate m arkedly as a function o f the constantly changing conditions in the
natural environment. In evaluation o f intervention effects, w hether in betw een-group
o r single-case studies, variability in perform ance can m ake evaluation more d ifficu lt
(and threaten data-evaluation validity). C ontrived situations m in im ize extraneous
variability that results from constantly changing contexts an d c o n d itio n s o f the natural
environment.
The advantage o f providing standardization o f the assessm ent conditions with
contrived situations bears a cost as well. W hen the situation is con trived , the p o ssibil-
ity exists that perform ance m ay have little or no relation to tlie perform ance under
naturalistic conditions. For exam ple, fam ily interaction m ay be o b serv ed in a clinic
situation in which parents and their children are given structured tasks to perform
(e.g., decide where to go on a hypothetical vacation or w ork on a h o m ew o rk problem
together). T he contrived tasks allow assessment o f a variety o f behaviors that might
otherw ise be difficult to observe if fam ilies were allow ed to interact norm ally on their
own. However, the possibility exists that fam ilies m ay interact v e ry differently under
contrived conditions from how they w ould under o rd in ary circum stances. H ence, a
m ajor consideration in assessing perform ance in contrived situations is w hether that
perform ance represents behavior under noncontrived conditions. In m ost studies,
the relation between perform ances under contrived versus naturalistic con ditions is
assum ed rather than dem onstrated.

Natural Environment Versus Laboratory (or Clinic) Settings


A related dim ension that distinguishes observations is the setting o r w here the assess-
ment is conducted. O bservations can be obtained in the natural en viro n m en t o r in the
laboratory or special clinical setting. T he setting in which the observations are actually
conducted can be distinguished from whether or not the observatio n s are con trived .
For example, one study on a college cam pus focused on getting people to com e to c o m -
plete stops as their cars pulled up to stop signs (Austin, Hackett, G rav in a, & Lebbon,
2006). This w as a naturalistic setting— a real intersection, with real people d riv in g in
I N G L E- C A S E R ES EA R C H D ES I G N S

real cars. The task or activity was naturalistic as well; whether the individual stopped at
the intersection w as the activity and was not contrived. O bservers sat inside a parked
car where they could see and code the stop. Here is a case where the setting and activity
were natural. Sim ilarly, in the abduction lure study noted previously, the activity was
contrived (staged interactions with each child) but the settings were naturalistic (at
home and at school).
Ideally, direct observations are m ade in the natural setting in which clients nor-
mally function. Such observations m ay be especially likely to reflect perform ance that
the client has identified as problem atic. N aturalistic settings might include the com -
munity, at work, in the classroom , in the institution, or in som e other settings in which
people ordinarily function. O ccasionally observations are m ade in the homes o f per-
sons who are seen in psychological treatment. For exam ple, to evaluate children with
conduct problem s (e.g., oppositional, disruptive, and aggressive behavior), observers
may assess fam ily interaction directly in the hom e. T h e hom e is obviously a natural
setting, but the activities are not com pletely natural. Restrictions m ay be placed on the
family, such as having them remain in one or a few room s and not spend time on the
phone or watch television to help standardize the conditions o f assessm ent. The assess-
ment is in a naturalistic setting even though the actual circum stances o f assessment are
slightly contrived, that is, structured in such a w ay that the situation probably departs
from ordin ary living conditions.
Assessm ent o f fam ily interaction am ong children with conduct problems has also
taken place in clinic settings in addition to the natural environm ent. Parents and their
children are presented with tasks and gam es in a playroom setting, where they interact.
Both the activities (gam es with new, special toys, or com m and s and requests made by
the parents) and the setting (a clinic room) faintly resem ble the home. Interactions
during the tasks are recorded to evaluate how the parents and child respond to one
another. Interestingly, the exam ples o f children with conduct problem s convey differ-
ences in w hether the assessm ent was conducted in naturalistic (hom e) or clinic set-
tings. H owever, in both situations, the assessment conditions were contrived in varying
degrees because arrangem ents were made by the investigator that were likely to influ -
ence interactions and because assessment conditions helped to ensure that the desired
observations w ould occur.
Assessm ent in natural settings raises obvious problem s and obstacles, such as the
cost required for con du cting observations and reliability checks and ensuring and
m aintaining som e standardization o f the assessm ent conditions so that the relevant
behaviors can be observed. C lin ic and laboratory settings have been relied on heavily
because o f the convenience and standardization o f assessm ent conditions they afford.
In the vast m ajority o f clinic observations, contrived situations are used. Tasks can be
used that foster interactions that approxim ate those that might be seen at home yet that
allow evaluation o f the specific behaviors o f interest (e.g., child com pliance). W hen
clients com e to the clinic, it is difficult to observe direct sam ples o f perform ance that
are not under som ew hat structured, simulated, or contrived conditions.
O verall, efforts are m ade in applied program s to standardize the situation in som e
way to permit observations and to ensure that the desired behavior occurs. Equally,
efforts often are m ade to approxim ate natural situations. Real-life situations are used,
but som ething about them will be contrived to allow the observations to be made.
Methods o fA sscssm e n t

O btrusive Versus Unobtrusive Assessment


A n o th er facet o f assessment that can vary is w hether participants are aware o f the
assessm ent. I f clients are aware o f the m easurem ent process and that their behavior
is being assessed, the observations are said to be obtrusive. O btrusiA re o n ly m eans that
clients are aware o f assessment. T he obtrusiveness o f an assessm en tp ro cedu re may be
a m atter o f degree, so that subjects may be generally aw are o f assessm ent, aware that
they are being observed but unsure o f the target behaviors, an d so on. T he potential
issue with obtrusive assessment is that it may be reactive, that is, that the assessm ent
procedure m ay influence subject perform ance and provide data that do not represent
how th ey would respond i f they were unaware o f the assessm ent. Awareness o f the
assessm ent process does not necessarily mean clients will perform differently, that is,
obtrusiveness does not mean reactivity, but reactivity is possible. Unobtrusive assess-
ment (when clients are not aware that any assessment isg o in g o n ) o fc o u rse is not likely
to be reactive.5
O bservations o f overt perform ance may vary in the extent to which thev are con-
ducted under obtrusive or unobtrusive conditions. In m any investigations that utilize
direct observations, perform ance is assessed under obtrusive con ditions. For example,
observation o f behavior problem children in the hom e or the clinic is conducted in
situations in which fam ilies are aw are that they are being o bserved. Similarly, clients
w ho are seen for treatment o f anxiety-based problem s usually are fully aw are that their
behavior is assessed when avoidance behavior is evaluated u nd er con trived condition s.
D oes this lead to reactivity, or differences in perform ance? T his is not well studied.
Assessm ent in single-case research has an advantage in relation to reactivity o f
assessm ent. O ngoing, usually daily assessment m eans that clients are likely to becom e
accustom ed to the assessment procedures. I have been in scores o f classroom s with
two o r m ore observers and the first day or two students look or turn around to see the
observers. As the students are ignored (to avoid fostering interactions w ith observers),
interest in the observers tends to drop out. The extended o bservatio n s o ver weeks or
m onths becom e rather m undane and uninteresting to the students.
C o n sid er an exam ple in a v ery different context. One study fo cused on the per-
fo rm an ce o f collegiate rugby players on a team in the U nited K ingdom (M ellalieu,
H anton, & O ’ Brien, 2006). Five collegiate players were included in the project; the
goal was to increase selected behaviors (e.g., num ber o f tackles, num ber o f tim es a
player stole possession o f the ball from an opposing player). Participan ts were aw are o f
the assessm ents because individual behavioral goals were iden tified and later feedback
was provided in relation to these goals. However, the assessm ents were obtained from
vid eotapes over the course o f 20 rugby games. T h is was o b tru sive assessm en t—w as

' One has to hedge a bit here about the relation o f reactivity and unobtrusive measurement in light
o f psychological research findings on awareness and behavior change. !i is possible that subjects will
not be able to verbalize that they are aware o f something, such as observers, but still be influenced
by it. Many influences in the environment fall below our threshold o f saying w e recognize them, but
they still exert influence on our behavior (Hassin, Uleman, & Bargh, 2005). For example, exposing
people to the national flag subliminally (i.e., too briefly for them to be aware the flag was shown)
changes their immediate political views, intentions, and vote (Hassin. Ferguson, Shidlovski, & Cross,
2007). Stated another way, it is now clear that something can be reactive even ifit is not obtrusive,
that is, recognized in consciousness.
S I N G L E- C A S E R ES EA R C H D ES I G N S

it reactive? T h at is, did the players change in light o f k now in g the behaviors were
observed? A n y reactive effect is likely to be short lived. In general, whether obtrusive
observation in sin gle-case research leads to reactive effects has not been well stu d-
ied. The likely reactivity is m inim al because o f the extended observation period o f
repeated assessm ent. Participants are likely to accom m odate to the novelty o f the sit-
uation and see ob servers as part o f what becom es routine.
O ccasionally, observations are conducted under »«obtrusive assessment con di-
tions. I m entioned one exam ple already in which stopping at intersections was observed
directly (Austin et al., 2006). Individuals who drove through the intersection presum -
ably did not notice observers sitting in a nearby car and perform ed as they norm ally
would. In another study, the goal was to encourage people at a superm arket to donate
food to a food bank (Farrim ond & Leland, 2006). A food bank bin was located near the
main exit d o o r o f the superm arket. T he m easure to evaluate the program was the nu m -
ber o f items donated and their m onetary value. T h e food bank bin had been in place in
the store for 9 years, so nothing was introduced to make the situation contrived. T his
was likely to be an unobtrusive assessment, a regular food drive in a regular market.
U nobtrusive behavioral observations raise an obstacle. Participation in research
usually must be disclosed to the subjects or clients. Inform ed consent forms, state-
ments about privacy, and other legal requirem ents to protect subject rights are standard
research requirem ents. As full a disclosure to subjects as possible m eans that they are
likely to be aw are o f the assessments. O ccasionally, inform ed consent and disclosure
might not be required if identity o f the participants cannot be discerned and in fo rm a-
tion about individuals cannot be connected to them. T he broader point is m erely to
be aware o f the assessm ent procedures. If all procedures used in a given project are
obtrusive, perhaps they can be varied in the likelihood or degree o f conspicuousness
and hence in the reactivity they might generate.

Human O bservers Versus Autom ated Recording


A n o th er d im en sio n that distinguishes how ob servatio n s are obtained pertains to the
data collection m ethod. In most applied sin gle-case research, hum an observers assess
behavior. O b servers watch the client(s) an d reco rd b eh avior according to one o f the
assessm ent strategies described earlier. O b servers are com m on ly used to record
b ehavior in the hom e, classroom , psychiatric hospital, laboratory, community', and
clinical settings. O b servers m ay include special persons introdu ced into the setting
o r others w ho are alread y present (e.g., teachers in class, spouses or parents in the
hom e).
In contrast, o bservations can be gathered through the use o f apparatuses or auto-
mated devices. Behavior is recorded through an apparatus that detects when the
response has occurred, for how long it has occurred, or other features o f perform ance.4
With autom ated recording, hum ans are involved in assessm ent only to the extent that
the apparatus needs to be calibrated or that persons must read and transcribe the

* Automatic recording here refers to apparatuses that register the responses of the client.
Apparatuses that aid human observers are often used, such as wrist counters, event recorders, stop
watches, and audio- and videotape recorders. These devices serve as useful aids in recording behav-
ior, but they are still based on having human observers assess performance.
M et h o d s o f A ssß ssvnent

num erical values from the device, i f these data are not autom atically printed and sum -
m arized or added to a database.
A m ajor area o f research in which automated m easures are used routinely is bio-
feedback. In this case, psychophysiological recording equipm ent is required to assess
ongoing physiological responses. I mentioned an exam ple p revio u sly in which m us-
cle tension was directly assessed. Direct observation by hum an observers could not
assess the tension by merely observing or assess the tension with the precision needed.
Similarly, many other responses o f interest are undetectable from m erely looking at
the client (e.g., brain wave activity, blood pressure, cardiac arrhyth m ias, and skin tem -
perature). Som e physiological signs might be m onitored by observers (e.g., pulse rate
by external pressure, heart rate by stethoscope), but psychophysiological assessm ent
provides a more sensitive, accurate, and reliable recording system .
Autom ated assessm ent in single-case research has not been restricted to p sy -
chophysiological assessment. A variety o f measures has been used to assess responses
o f applied interest, such as levels o f noise from university dorm itories (decibel meters)
or speeding o f cars (by radar). With such measures, hum an observers can be co m -
pletely rem oved from assessment. In other instances, hum an o bservers have a m in i-
mal role. T he apparatus registers the response in a quantitative fashion, which can be
sim ply copied by an observer, if the data cannot be autom atically be transferred to
the computer. T he observer merely transcribes the inform ation from one source (the
apparatus) to another (data sheets), a function that often is not difficult to program
autom atically but m ay be easier to achieve with human observers.
T h e use o f automated records has the obvious advantage ©f reducing o r elim inating
errors o f measurem ent that w ould otherwise be introduced by the presence o f o b serv -
ers, a topic addressed in the next chapter. Humans must subjectively decide w hether
a response has begun, is com pleted, or has occurred at all. Lim itations o f the “ appara-
tus” o f hum an observers (e.g., the scanning capability o f the eyes), subjective judgm ent
in reaching decisions about the response, and the assessm ent o f com plex behaviors
with unclear boundary conditions may increase the inaccuracies an d inconsistencies o f
hum an observers. Autom ated apparatuses overcom e m an y o f the observational prob-
lem s introduced by human observers. For example, hyperactivity o f children is o f keen
interest in intervention studies. M any different methods have been used (e.g., direct
observations, teacher report) to assess hyperactivity. Autom ated recording is available
too. For exam ple, one m ethod evaluates the number o f m ovem ents made by a subject.
A counting unit is worn on the belt o f the subject. The unit records m ovem ent detected
by a set o f m ercury switches. T he total number o f times a m ercu ry switch is opened is
counted and autom atically recorded. A threshold can be added for feedback (biofeed-
back) so that when the rate o f m ovem ent exceeds some criterion, an audible signal can
be provided (w w w .freepatentsonline.com /4112926.htm l).
Autom ated devices can be used to sample behavior in ev eryd ay situations an d for
extended periods. For example, in one investigation, there was interest in sam plin g
behaviors perform ed throughout the day and exam ining the behaviors (laughing, sin g-
ing, and socializing) as a m easure o f m ood (Hasler, M ehl, B oo tzin , 8c Vizire, 2008).
Subjects wore a device (Electronically Activated Recorder) th at w as roughly the size o f
a cell phone clipped to their belts. The apparatus included a tie-clip m icrophone that
could pick up their voice but also sounds in the environm ent. Approxim ately five times
S I N G L E- C A S E R ES EA R C H D ES I G N S

every hour the device autom atically m ade 3 0 -second recordings to pick up sounds
from the environm ent. From these recordings, coders later evaluated three categories
o f behaviors o f interest as well as sleep onset and w aking up (by the absence and pres-
ence o f sounds at night an d in the m orning). T hese direct observations o f behavior
were obtained throughout the entire day.
A pparatuses that autom atically record responses overcom e significant problem s
that can em erge with hum an observers. In addition, autom ated recordings often allow
assessm ent o f behavior for relatively long periods o f time. O nce the device is in place,
it can record for extended periods (e.g., entire school day, all night during sleep).
T he expense o f hum an ob servers often prohibits such extended assessment. A n other
advantage m ay relate to the im pact o f the assessm ent procedure on the responses. T he
presence o f hum an o bservers m ay be obtrusive and influence the responses that are
assessed. Autom atic recording apparatuses often quickly becom e part of the physical
environm ent and, depending on the apparatus, m ay less readily convey that behavior
is being m onitored.
To be sure, autom ated recordings introduce their own problems. For ex a m -
ple, equipm ent can and often does fail, or it m ay lose its accuracy if not periodically
checked and calibrated. A lso , equipm ent is often expensive and less flexible in term s
o f the range o f behaviors that can be observed or the range o f situations that can be
assessed. Som e m easures m ay not perm it evaluation o f perform ance and fun ctioning
in the natural environm ent or require a special setup that can only be conducted in a
laboratory. O n the other hand, advances in assessm ent and technology are astounding.
A n obvious advance is in the portability o f apparatus. T h ey are dim inishing in size,
obtrusiveness, and ability to run with little power use. A nother advantage is the range
o f dom ains that can be m easured. Biological fu nctions and processes are where som e
o f the advances are especially salient. It seem s not m uch o f a speculative leap to suggest
that dom ains such as body tem perature, biological rhythm s, and hormone and neuro-
transm itter processes will correlate or serve as m eaningful operational definitions o f
social, em otional, cognitive, and behavioral aspects o f everyday functioning.
N o m easure can be expected to address all issues and considerations. 1 have already
noted in the previous chapter that multiple m easures are to be encouraged w hen p o s-
sible because any m easure o r m ethod o f assessm ent has limitations. Automated devices
have a special virtue that overcom es lim itations o f hum an observers and are reco m -
m ended w hen possible.

General Com m ents


The conditions under w hich behavioral observations are obtained m ay vary markedly.
The dim ensions I have discussed do not exhaust all o f the possibilities. M oreover, for
purposes o f presentation, three o f the conditions o f assessm ent were discussed as either
naturalistic or contrived, in natural or laboratory settings, and as obtrusive or u n o b -
trusive. It is im portant to reiterate that these characteristics vary along continua. For
exam ple, m an y laboratory or clinic situations m ay approxim ate or very much attempt
to approxim ate a natural setting. As an illustration, the alcohol consum ption o f in d i-
viduals hospitalized for alcoholic abuse is often m easured by observing individuals as
they drink in a sim ulated bar in the hospital. T h e bar is in a clinic setting but looks
exactly like an o rd in ary bar. T he conditions closely resem ble the physical environm ent
M e tho ds of A sse ssm e nt 97

in which drinking often takes place. The range o f conditions under w hich behavioral
observations can be obtained provides m any options for the investigator. W hen the
strategies for assessm ent (e.g., frequency, interval observations) are added, the diversity
o f observational practices is even m ore impressive.

SUM M ARY AND CO NCLUSIONS


Typically, single-case research focuses on direct observations o f overt perform ance.
Direct observations were em phasized because they are used freq uen tly and are not well
covered in most assessm ent books that focus on traditional m ethods o f psychological
assessment. Im portant to reiterate is that single-case designs do not require assessm ent
o f overt behavior. A n y assessm ent m ethod that provides o n go in g data over tim e can be
used. As for direct observations, or any other m odality o f assessm ent, reliability an d
validity issues are pertinent.
When direct observations are used, different strategies o f assessm en t are avai lable,
including frequ en cy counts, discrete categorization, num ber o f clients w ho perform the
behavior, interval recording, duration, and latency. O ther strategies include response
m easures specific to the particu lar responses, psychophysiological recording, and
self-report. D epending on the precise focus, measures other than direct observation
m ay be essential.
Apart from the strategies o f assessment, observations can b e obtained under a vari-
ety o f conditions. T he conditions m ay v ary according to w h ether behavior is o bserved
under natural or contrived tasks and activities, in natural o r laboratory settings, by
obtrusive or unobtrusive means, and whether behavior is recorded b y hum an observ-
ers or by an autom ated apparatus. The different conditions o f assessm ent v a ry in the
advantages and lim itations they provide, including the extent to which perfo rm an ce in
the assessment situation reflects perform ance in other situations, whether the m easures
o f perform ance are com parable over time and across persons, and t lie convenience and
cost o f assessing perform ance.
This chapter focused on various m ethods o f assessment. I noted that direct o b ser-
vations o f overt behavior constitute the most frequently used m ethod. T h e next chapter
discusses requirem ents for ensuring the quality o f the observational m ethods and the
consistency with which the inform ation is obtained. C o n sisten cy o f m easurem ent is a
bridge to later discussions o f data evaluation. Evaluating interventions is aided m ark -
edly by ensuring that the assessm ent procedures are adm inistered reliably.
CHAPTER 5

Ensuring the Quality of


Measurement

C H A P TE R OUTLINE

Interobserver Agreement
Importance o f Assessing Agreement
Agreement Versus Accuracy
Conducting Checks on Agreement
Methods o f Estimating Agreement
Frequency Ratio
Description
Problems and Considerations
Point-by-point Agreement Ratio
Description
Problems and Considerations
Pearson Product-Moment Correlation
Description
Problems and Considerations
General Comments
Base Rates and Chance Agreement
Alternative Methods o f Handling Expected (“Chance” )
Levels o f Agreement
Variations o f Occurrence and Nonoccurrence
Agreement
Plotting Agreement Data
Correlational Statistics
General Comments
Sources o f Artifact and Bias
Reactivity o f Reliability Assessment
Observer Drift
Observer Expectancies and Feedback
Complexity o f the Observations
Acceptable Levels o f Agreement
Sum m ary and Conclusions

98
En su r in g t he Q u al i t y of M easu r em en t

hen direct observations o f behavior are obtained by hum an observers, the possibil-
ity exists that observers will not record behavior consistently. However well sp eci-
fied the responses are, observers may need to make judgm ents about whether a response
occurred or may inadvertently overlook or misrecord behaviors that occur in the situation.
Central to the collection o f direct observational data is evaluation o f agreement am ong o b -
servers. Interobserver agreement, also referred to as reliability, refers to the extent to w hich
observers agree in their scoring o f behavior.1 This chapter discusses interobservei agree-
ment, conditions o f evaluating agreement, and quality o f assessment procedures.

IN TF.RO BSERVER A G R E E M E N T

Im portance of Assessing Agreem ent


Agreem ent between different observers is assessed fo r three m ajor reasons. First,
assessment is useful o nly to the extent that it can be achieved w ith som e consistency.
Obviously, if frequency counts differ depending upon w h o is counting, it will be d iffi-
cult to know the clients actual perform ance. The client m ay be scored as p erfo rm in g
a response frequently on som e days and infrequently on other days as a fu n ction o f
who scores the behavior rather than actual changes in client perform ance. Inconsistent
measurem ent introduces variation into the data, w hich ad d s to the variation ste m -
m ing from ordinary norm al fluctuations in client perfo rm an ce I m entioned earlier.
Agreem ent between observers ensures that one potential source o f variation, nam ely,
inconsistencies am ong observers, is minimal.
Second, assessing agreement minimizes, circum vents, or reveals the biases that an y
individual observer may have. I f a single observer were u sed to record a behavior, any
recorded change may be a result o f a change in the o b serv ers definition o f the b eh av-
ior over time rather than in the actual behavior o f the client. O ver tim e the o b server
might becom e lenient or stringent in applying the response delinition. Alternatively,
the observer might expect and perceive improvement based on the im plem entation o f
an intervention designed to alter behavior, even though no actual changes in behavior
occur. Using more than one observer and checking interobserver agreem ent p ro vid e a.
partial check on the consistency with which response defin ition s are applied o ver time.
Finally, agreement between observers partially reflects w hether the behavior is well
defined. Interobserver agreement on the occurrences o f behavior is one w ay to e v a l-
uate the extent to which the definition o f behavior is sufficiently objective, clear, and
com plete— requirem ents for response definitions discu ssed in Chapter 3. M oreover, i f
observers readily agree on the occurrence o f the response, it m ay be easier for person s
who eventually carry out an intervention to agree 011 the occu rrences and to apply the
intervention consistently.

A g reem e n t V ersu s A c c u ra c y
Agreem ent between observers is assessed by having tw o or more persons ob serve
the same client(s) at the same time. The observers w o rk independently for the en tire

1 In applied research, “ interobserver agreement” and “reliability” have been used interchangeably
For purposes o f the present chapter, “ interobserver agreement” is used primarily. “ Reliability” a_s
a term has an extensive history in assessment and has several different meanings. “ Interobserver
agreement” specifies the focus more precisely as the consistency between or among observers.
100 S I N G L E- C A S H R ES EA R C H D ES I G N S

observation period, and the observations are com pared when the session is over. A
com parison o f the observers’ records reflects the consistency with which observers
recorded the behavior o f interest.
It is im portant to distinguish agreem ent between observers from accuracy o f the
observations. Agreem ent refers to evaluation o f how well the data from separate o b serv-
ers correspond. High agreem ent m eans that observers correspon d in the behaviors they
score. M ethods o f quantifying the agreem ent are available, so that the extent to which
observers do correspond in their observations can be carefully evaluated.
A m ajor interest in assessing agreem ent is to evaluate w hether observers are sco r-
ing behavior accurately. Accuracy refers to w hether the o bservers’ data reflect the clients
actual perform ance. To measure the correspon den ce between how the client perform s
and o b servers’ data, a standard or criterion is needed. A ccu racy requires a firm stan-
dard that itself is reliable and valid. Fo r exam ple, accu racy is readily understandable in
a situation in which two people score the answ ers that a student has made to a multiple-
choice test. We can say that there really is a num ber that characterizes correct responses
and this m ight even be obtained by autom ated scorin g o f the answer sheet (e.g., by a
scanner). We can see if the observers agree with each other in scorin g the test (interob-
server agreem ent) and can tell if either o bserver had it right, that is, accurately scored
the students exam (accuracy).
Som etim es accuracy is evaluated by developing a reference point based on con -
sensus that certain behaviors have or have not occurred. A ccuracy m ay be evaluated
by con structing a videotape in which certain behaviors are acted out and, hence, are
know n to be on the tape with a particu lar frequency, durin g particular intervals, or
for a particular duration. Data that observers obtain from looking at the tape can be
used to assess accuracy, since “ true” perform an ce is known. Alternatively, client behav-
ior under naturalistic conditions (e.g., children in the classroom ) m ay be videotaped.
Several observers could score the tape repeatedly and decide what behaviors were pre-
sent at any particular point in time. A new observer can rate the tape, and the data,
w hen com pared with the standard, reflect accuracy. W hen there is an agreement on a
stan dard for how the client actually perform ed, a com parison o f an o bserver’s data with
the standard reflects accuracy, that is, the correspondence o f the observers’ data to the
“ true” behavior.
A lth o u gh investigators are interested in ac cu rac y o f o b servatio n s, they u sually
m ust settle for in tero b server agreem ent. In m ost settings, there are no clear criteria
o r perm an en t records o f beh avior to d eterm in e how the client really perform ed.
P a rtia lly for practical reasons, the clien t’s b eh av io r cann ot be videotaped or o th -
e rw ise recorded each tim e a check on agreem en t is m ade. Use o f equipm ent for
au tom ated recording w ithou t hum an o b serv ers circu m ven ts this problem . H ow ever,
for m ore com m on ly used procedures, hum an o b servers collect the data. W ithout
a p erm an en t record o f the clien t’s p erfo rm an ce, it is difficult to determ ine how the
client actu ally perfo rm ed . In a check on agreem en t, two o b servers usually enter the
situ ation and score behavior. The scores are com pared, but neither score n ecessarily
reflects how the client actu ally behaved. One w ants p erfo rm an ce o f the client to be
o b se rv ed con sisten tly (reliability") and to represen t actual perform an ce clo sely or
well (valid ity) w hether o r not there is an y in d ex that cou ld be used to verify exact
perfo rm an ce.
Eu s i i r in g t he? Q u al i t y o f M easu r em en t

In general, both interobserver agreement and accu racy involve com paring an
o b serv ers data with som e other source. They differ in the extent to which the source o f
com parison can be entrusted to reflect the actual behavior o f the client. Although accu -
racy and agreem ent are related, they need not go together. F o r exam ple, an observer
m ay record accurately (relative to a pre established standard) but show low interob-
server agreement (with another observer whose observations are quite inaccurate).
Conversely, an observer m ay show poor accuracy (in relation to the standard) but high
interobserver agreem ent (with another observer w ho is inaccurate in a sim ilar way).
Hence, interobserver agreem ent is not a measure o f accuracy. The general assum p-
tion is that if observers record the same behaviors, their data probably reflect what the
client is doing. H owever, it is important to bear in m ind that this is ail assum ption.
U nder special circum stances, discussed later in the chapter, the assum ption may not
be justified.

Con ductin g Checks on Agreem ent


Typically, an o bserver records the behavior o f the client on a d aily basis o ver the entire
course o f the investigation. Occasionally, another o bserver w ill also be used to check
interobserver agreem ent. On such occasions, both observers will record the clients
behavior. O bviously, it is im portant that the observers w ork independently, not look
at each oth ers scoring sheets, and refrain from discussing their observations. 'Hie pu r-
pose o f checking agreem ent is to determine how well observers agree when they record
perform ance independently
Checks on interobserver agreement are usually conducted o n a regular basis
throughout an investigation. I f there are several different phases in the investigation,
interobserver agreem ent usually is checked in each phase. It is possible that agreement
varies over tim e as a function o f changes in the client’s behavior. The investigator is
interested in having inform ation on the consistency o f observations o ver the course o f
the study. H ence, interobserver agreement is checked often and under each different
condition or intervention that is in effect.
There are no precise rules for how often agreem ent should be checked. Several
factors influence the decision. For example, with several o bservers or a relatively c o m -
plex observational system , checks m ay need to be com pleted relatively often. Also, the
extent to which o bservers in fact agree when agreem ent is checked m ay dictate the
frequency o f the checks. Initial checks on agreem ent m ay reveal that observers agree all
or virtually all o f the time. In such cases, agreem ent m ay need to be checked occasion-
ally but not often. O n the other hand, with other behaviors an d observers, agreement
m ay fluctuate greatly and checks will be required more often. As a general rule, agree-
ment ought to be assessed within each phase o f the investigation, preferably at least a
few times within each phase. Yet checking on agreem ent is m ore com plex than merely
scheduling occasions in which two observers score behavior. How the checks on agree-
ment are actually conducted m ay be as im portant as the frequen cy wit h which they are
conducted, as will be evident later in the chapter.

M E T H O D S OF E S T I M A T I N G A G R E E M E N T
The m ethods available for estim ating agreement partially depend on the assess-
ment strategy (e.g., whether frequency or interval assessm ent is conducted). For an y
S I N G L E- C A S E R ES EA R C H D ES I G N S

particular observational strategy, several different methods o f estim ating agreem ent
are available. The m ajor m ethods o f com puting reliability, their application to different
observational formats, and considerations in their use are discussed next.

Frequency Ratio
D escription. The frequency ratio is a m ethod used to com pute agreement when
com parisons are made between the totals o f two observers w ho independently rec-
ord behaviors. The m ethod is often used fo r frequen cy counts, but it can be applied
to other assessment strategies as well (e.g., intervals o f behavior, duration). Typically,
the m ethod is used when behavior can be freely perform ed and can theoretically take
on any value. That is, there are no discrete trials or restricted set o f opportunities for
responding that can occur. For exam ple, parents m ay count the num ber o f times a child
swears while at the dinner table. Theoretically, there is no limit to the frequency o f the
response (although laryngitis m ay set in if the response becom es too high). To assess
agreem ent, both parents m ay independently keep a tally o f the num ber o f times a child
says particular words. Agreem ent can be assessed by com paring the two totals that the
parents have obtained at the end o f dinner. To com pute the frequency ratio, the follow -
ing form ula is used:

_ . Sm aller total
freq u en cy Ratio = -------------------- x 100
L arger total

That is, the sm aller total is divided by the larger total. The ratio usually is m ultiplied
by 100 to form a percentage. In the preceding exam ple, one parent m ay have observed
20 instances o f sw earing and the other m ay have observed 18 instances. The frequen cy
ratio would be 18/20 o r .9, which, when m ultiplied by 100, would make agreement 90%.
The num ber reflects the finding that the totals obtained by each parent differ from each
other by only 10% (or 100% agreement m inus obtained agreem ent). For most uses, this
ratio and this m argin o f difference would be fine as an indicator that behavior overall
was observed consistently.

P rob lem s a n d C o n sideration s. Frequency ratios reflect agreem ent on the total n u m -
ber o f behaviors scored by each observer. There is no way o f determ ining within this
m ethod o f agreem ent w hether observers agreed on any particular instance o f perfor-
mance. It is even possible, although unlikely, that the observers m ay never agree on the
occurrence o f any particular behavior; they m ay see and record different instances o f
the behavior, even though their totals could be quite similar. In the preceding exam ple,
one parent observed 18 and the other observed 20 instances o f swearing. It is possible
that 38 (2 0 + 18) (or m any more) instances o ccu rred, and that the parents never scored
the sam e instance o f sw earing. In practice, o f course, large discrepancies between two
o bservers scoring a discrete behavior such as swearing are unlikely. N evertheless,
the frequency ratio hides the fact that observers may not have actually agreed on the
instances o f behavior.
The absence o f inform ation on instances o f behavior m akes the agreem ent data
from the freq uen cy ratio som ew hat am biguous. The m ethod, however, has still
proved quite useful. If the totals o f two o b serv ers are close (e.g., w ithin a 10 to 20%
En su r i n g t he Q u al i t y of M easu r em en t

m argin o f error), then the m ethod serves as a useful guideline for ensuring that they
gen erally agree. The m ajor problem with the frequency ratio rests n o t so much with
the m ethod but with the interpretation that m ay be inadvertently m ade. W hen a fre -
qu en cy ratio yields a percentage agreem ent o f 90%, this does not mean that o b serv ers
agreed 90% o f the tim e o r on 90% o f the behaviors that occurred. The ratio m erely
reflects how close the totals fell to each other.
The frequency ratio o f calculating agreement on totals for a given observation
period is not restricted to frequency counts. The m ethod can also be used to assess
agreem ents on duration, interval assessment, and discrete categorization. In each case
the ratio is com puted for each session in which reliability is assessed by divid in g the
sm aller total by the larger total. For example, a child’s tantrum s m ay be o bserved b y a
teacher and a teachers aide using interval (or duration) assessment. A fter the session
is com pleted, the total num ber o f intervals (or amount o f time in minutes) of tantrum
behavior are com pared and placed into the ratio and m ultiplied by 10 0 . A lthough the
frequency ratio can be extended to different response form ats, it is usually restricted to
frequency counts. M ore exact m ethods o f com puting agreem ent are available for other
response form ats to overcom e the problem o f know ing whether observers agreed on
particular instances or samples o f the behavior.
On balance, frequency ratios are easily com puted and easily com m unicated to o th -
ers. The problem s I have identified are not necessarily reasons to avoid the ratios. In
many cases, the problem may not be v ery likely because o f the clarity o f the behavior.
The m ethodological goal o f assessm ent is m inim izing variability due to unreliability in
the m easurem ent procedures. High ratios mean high agreem ent on the total num ber
o f events for that observation period. T“hat may be sufficient to describe perform ance
and reflect change during an intervention phase, even if one or a few instances o f the
behavior (e.g., < 10%) were m issed by one or the other observer.

Point-by-point Agreem ent Ratio


D escription. A n im portant m ethod for com puting reliability is to assess w hether there
is agreem ent on each instance o f the observed behavior. The point-by-point agree-
ment ratio is available for this purpose whenever there are discrete opportu nities (e.g.,
num ber o f trials, intervals, or correct answers) for the behavior to o ccu r (occur^not
occur, present/absent, appropriate/inappropriate). W hether observers agree is assessed
at each opportunity for behavior to occur. For exam ple, the discrete categorization
m ethod consists o f several opportunities to record w hether specific behaviors (e.g.,
room -cleaning behaviors) occur. For each o f several behaviors, the observer can rec-
ord w hether the behavior was or was not perform ed (e.g., p ickin g up ones clothing,
m aking o nes bed, putting food away). For a reliability check, two observers w ould rec-
ord w hether each o f the behaviors was perform ed. The totals cou ld be placed into a
frequency ratio, as described previously. However, when one can iden tify each discrete
behavior and whether observers agree on its occurrence, a finer-grained agreem ent
measure can be used.
Because there were discrete response categories, a m ore exact m ethod of co m p u t-
ing agreem ent can be obtained. The scoring o f the o bservers for each response can
be com pared directly to see w hether both observers recorded a particular response
as occurring. That is, if a checklist o f behaviors had 10 different behaviors as po ssibly
104 S I N G L E- C A S E R ES EA R C H D ES I G N S

occu rring or not, one could com pare w hether observers agreed on whether behavior 1
occu rred or not, 2 o ccu rred or not, and so on. Rather than looking at totals, agreem ent
is evaluated on a response-by-response or point-by-poin t basis. The formula for c o m -
puting point-by-point agreem ent consists of:

A
Point-by-point agreem ent = ----------- x 100
A +D

where
A = num ber o f agreem ents from exam in ing the data on a trial or opportunity-
by-opportunity basis for the behavior to be scored
D = num ber o f disagreem ents from exam in in g the data on a trial or opportunity-
by-opportunity basis for the behavior to be scored.
That is, agreem ents o f the observers on the specific trials are divided by the nu m -
ber o f agreements plus disagreem ents and m ultiplied by 100 to form a percentage.
Agreem ents can be defined as instances in w hich both observers record the same
thing. I f both o bservers recorded the behavior as occurring, or they both scored the
behavior as not occu rrin g, an agreem ent w ould be scored. D isagreem ents are defined
as instances in w'hich one o bserver recorded the behavior as o ccu rring and the other
did not. The agreem ents and disagreem ents are tallied by com paring each behavior on
a point-by-point basis.
A more concrete illustration o f the com putation o f agreement by this m ethod is p ro -
vided using interval assessm ent, to w hich the point-by-poin t agreem ent ratio is applied
most frequently. In interval assessm ent, tw o observers typically record and o bserve
behavior for several intervals. In each interval (e.g., a 10-second period), observers rec-
ord w hether behavior (e.g., paying attention in class) occurred or not. Because each
interval is recorded separately, point-by-point agreem ent can be evaluated. A greem ent
can be determ ined by com parin g the intervals o f both observers according to the fo r-
mula just given.
In practice, agreem ents are usually defined as agreem ent between o b se rv -
ers on occurrences o f the behavior in interval assessm ent. The preceding form ula
is unchanged. However, agreem ents constitute o n ly those intervals in which both
observers m arked the beh avior as o ccu rrin g. For exam ple, assum e observers recorded
behavior for 50 lo -se co n d intervals and both ob servers agreed on the occurrence o f
the behavior in 20 intervals and disagreed in 5 intervals. Agreem ent (according to the
point-by-point agreem ent form ula) w ould be 20/(20 + 5) x 100, or 80%. A lthough
observers recorded b ehavior for 50 intervals, all intervals were not used to calculate
agreement. An interval is counted only if at least one o bserver recorded the o c c u r-
rence o f the behavior.
E xclu d in g in tervals in w hich neither o b serv er records the behavior is based on
the follow in g reason in g. I f these in tervals w ere counted, th ey w o u ld be con sid ered
as agreem ents, sin ce both o b serv ers “agree” that the response did not occur. Yet in
o b servin g behavior, m an y intervals m ay be m arked w ithout the o ccu rrence o f the
behavior. I f these w ere in clu ded as agreem en ts, the estim ate w o u ld be inflated b eyo n d
the level obtained w hen o ccu rren ces alone w ere cou nted as agreem ents. In the p reced -
ing exam ple, behavior w as not scored as o cc u rrin g by either o b serv er in 25 intervals.
Ensuring die Q uality of M easurem en t I OS

B y cou n tin g these as agreem ents, the p o in t-by-po in t ratio w ould in crease to 90 % —
that is, 45/(45 + 5) x 10 0 , or 90% — rather than the 80% obtained o rigin ally. To
av o id this increase, m ost investigators have restricted agreem ents to response
o ccu rren ce.

P ro b lem s a n d C onsiderations. The advantage o f the m ethod is that it provides the


o pportu nity to evaluate observer agreement for each response trial or observation inter-
val and is more precise than the frequency ratio, which evaluates agreem ent on totals.
A lthough the method is used most often for interval observation, it can be applied to
other methods as well. For example, the form ula can be used with frequency counts
w hen there are discrete trials (e.g., num ber o f correct arithm etic responses on a test)
o r discrete categories (e.g., num ber o f chores), o r a count o f the num ber o f people-who
perform a response. In any assessm ent form at in which agreem ent can be evaluated on
particu lar units or responses, the point-by-point ratio can be used.
Despite the greater precision o f assessing exact agreement, m any questions li ave
been raised as to the m ethod o f com puting agreement. For interval observations, inves-
tigators have questioned whether “agreem ents” in the form ula should be restricted to
intervals w'here both observers record an occurrence o f the behavior or should also
include intervals where both score a nonoccurrence. In o n e sense, both indicate that
ob servers were in agreement for a particular interval. The issue is im portant because
the estim ate o f reliability depends on the frequency o f the clients behavior and whether
occurrence and/or nonoccurrence agreem ents are counted. I f the client perfo rm s the
behavior relatively frequently o r infrequently, observers are likely to have a high p ro -
portion o f agreements on occurrences or nonoccurrences, respectively. Hence, the e sti-
m ate o f reliability may differ greatly depending on what is counted as an agreem ent
betw een observers and how often behavior is scored as occurring. 1 return to this topic
later in the chapter in the discussion o f base rates.

P e a rso n P ro d u ct-M o m en t C o rre la tio n


D escrip tio n . The previous m ethods refer to procedures for estim ating agreement «11
any particular occasion in which reliability is assessed. In each session o r day in w h ich
agreem ent is assessed, the observers’ data are entered into one o f the form ulas provided
earlier. Typically, frequency or point-by-point agreement ratios are com puted du rin g
each reliability check (that is, each day reliability is checked), and the mean level o f
agreem ent and range (low and high agreement levels) o f the reliability checks over the
course o f the study or w ithin different phases are reported.
O ne method o f evaluating agreement over the entire course o f an investigation is to
com pute a Pearson product-m om ent correlation (r). On each occasion in which interob-
server agreement is assessed, a total for each observer is provided. This total may rellect
the num ber o f occurrences o f the behavior or total intervals or duration. Essentially,
each reliability occasion yields a pair o f scores, one total from each observer. A correla-
tion coefficient compares the totals across all occasions in w hich reliability was assessed.
The correlation provides an estimate o f agreem ent across all occasions in which reliabil-
ity w as checked rather than an estimate o f agreement on any particular occasion.
The correlation can range from - 1 .0 0 through + 1.0 0 . A correlation o f 0 .0 0
m eans that the o b servers’ scores are unrelated. That is, th ey tend not to go together
S i N G L E- C A S E R t S E A R C H D ES I G N S

at all. One ob server m ay obtain a relatively high count o f the behavior and the other
o b se rv e rs score m ay be high, low, o r som ew here in betw een. A positive correlation
betw een o.o o to + 1.0 0 , p a rticu larly one in the high range (e.g., .80 or .90), m eans
that the scores tend to go together. W hen one o b serv er scores a high frequency o f the
behavior, the other one tends to do so as well, and w hen one scores a lower frequen cy
o f the behavior, so does the other one I f the correlation assum es a m inus value (0.00
to - 1 .0 0 ) it m eans that o b serv ers tend to report scores that w ere in opposite direc-
tions: w hen one o b serv er sco red a higher frequency, the other invariably scored a
lower frequency, and vice versa. (A s a m easure o f agreem ent for o bservational data,
correlations typically take on valu es betw een 0 .0 0 an d + 1.0 0 rather than any negative
values.)
Table 5.1 provides hypothetical data for 10 observation periods in which the fre-
quency o f a behavior w as observed. A ssum e that the data were collected for 20 days and
that on 10 o f these days (every other day) two o bservers independently recorded b ehav-
ior (even-num bered days). The pairs o f scores from the reliability checks are listed.
The correlation between these scores across all days is com puted by a com m only used
form ula within the table. The r o f .93 in the table m eans that the observers’ scores are
very much in the sam e directio n— w hen one observer scores the behavior as o ccu rring
often or not often, the other does as well.

Prob lem s a n d C o n sid era tio n s. The Pearson product-m om ent correlation assesses the
extent to which o bservers covary in their scores. C ovariation refers to the tendency o f
the scores (e.g., total frequencies o r total agreem ents) to go together. If covariation is
high, it means that both o bservers tended to obtain high scores on the same occasions
and lower scores 011 other occasions. That is, their scores or totals tend to fluctuate
in the sam e direction from occasion to occasion. The correlation says nothing about

Table 5 .1 Sco re s for Two O b se rve rs to C o m p u te Pearson P rodu ct-M om en t Correlation

D ays of A gre e m e n t C h e ck O b s e r v e r 1 T o ta ls = X O b s e r v e r 2 T o ta ls = Y

2 25 29

4 12 20

6 19 17

8 30 31

10 33 33

12 18 20

14 26 28

16 15 20

18 10 11

20 17 19

I = sum N ix r - IX IY
X = scores of O bserver 1 [ N I X 2 - ( I r f f t N l Y 2 - ( I Y ) 2]
Y = scores of Obse rve r 2 r = +.93
X Y = cross products of scores
N = number of checks
En su r in g t h« Q u a l i t y of M easu r em en t

whether the o bservers agree on the total amount o f a behavior in an y session. In fact, it
is possible that one observer always scored behavior as o ccu rrin g 20 (or an y constant
number) tim es m ore than the other observer for each session in which agreem ent was
checked. I f this am ount o f error were constant across ali sessions, the correlation could
still be perfect ( r = + 1.00). Hie correlation merely assesses the extent to w hich sco tesgo
together and not whether they are close to each other in absolute tenns.
Because the correlation does not necessarily reflect exact agreem ent on total scores
for a particular reliability session, it follows that it does not necessarily say anything
about point-by-point agreement. The correlation relies on totals from the individual
sessions, and so the observations o f particular behaviors are lost. Thus, as a m ethod o f
com puting interobserver agreement, the Pearson product-m om ent correlation for the
totals o f each o bserver across sessions provides an inexact m easure o f agreem ent.
Another issue that arises in interpretation o f the pro du ct-m o m en t correlation
pertains to the use o f data across different phases. In single-case research, o b serva-
tions are u sually obtained in the different phases o f the design. In the sim plest case,
observations m ay be obtained before a particular intervention is in effect, follow ed
by a period in w hich an intervention is applied to alter behavior. W hen the in terven -
tion is im plem ented, behavior is likely to increase o r decrease, d ep en d in g on the type
o f intervention and the purpose o f the program . From th e standpoint o f a product-
m om ent correlation, the change in frequency o f behavior in the different phases m ay
affect the estim ate o f agreement obtained by com parin g o b serv er totals. I f behavior
is high in the initial phase (e.g., hyperactive behaviors) an d low du rin g the interven-
tion, the correlation o f observer scores m ay b e som ewhat m isleading. Both observers
m ay tend to have high frequencies o f behavior in the in itial phase and low frequen-
cies in the intervention phase. The tendency o f the o b serv ers’ scores to be high or low
together is partially a function o f the very different rates o f behavior associated with
the different phases. Agreem ent m ay be inflated in part becau se o f the effects o f the
different rates between the phases. Agreem ent w ithin each o f the phases (initial base-
line [pretreatment] phase or intervention phase) m ay not have been as high as the
calculation o f agreem ent between both phases. For the produ ct-m om en t correlation,
the possible artifact introduced b y different rates o f p erfo rm an ce acro ss phases can
be rem edied b y calculating a correlation separately for each phase. H ie separate c o r-
relations can be averaged (by Fish ers z' transform ation) to form an average (m ean)
correlation.1

G e n e ra l C o m m e n ts
The previously discussed methods o f com puting agreem ent address different ch arac-
teristics o f the data. Selection o f the m ethod is determ ined in part by the observational
strategy used and the unit o f data. The unit o f data refers to what the investigator uses
as a measure to evaluate the client’s perform ance on a d ay-to-day basis, lin e investiga-
tor may plot total frequency or total num ber o f occurrences on a graphical display
o f the data in order to evaluate, dem onstrate, or convey to others the im pact of the

2 The transformation is needed because r is not normally distributed and merely averaging their
numbers misrepresents the mean. The transformation is readily obtained on the Web by typing
“ Table of Fishers z' transformation” in any search engine. There is one site available at the time of
this writing: http://faculty.vassar.edu/lowry/tabs.html
108 S I N G L E- C A S E R ES EA R C H D ES I G N S

intervention. In such a case, a frequency ratio or product-m om ent correlation may be


selected to shed light on agreem ent on the totals. On the other hand, if a more refined
and precise m easure is obtained such as point-by-point agreem ent, then high agree-
ment leads to the com fortin g conclusion that totals must also be closely approximated.
M ost investigators aim for point-by point agreem ent when possible because this is the
m ost stringent m easure o f agreem ent. Even so, I w ould argue for an additional m easure
o f agreement on the total to con vey the consistency o f m easurem ent on the unit o f data
that is plotted and from w hich conclusions about the intervention will be drawn.
Even though agreem ent on totals for a given observation session is usually the
p rim ary interest, m ore analytic point-by-point agreem ent m ay be exam ined for sev -
eral purposes. W hen point-by-poin t agreement is assessed, the investigator has greater
inform ation about how adequately several behaviors are defined and observed. Point-
by-point agreement for different behaviors, rather than a frequen cy ratio for the c o m -
posite total, provides inform ation about exactly where any sources o f disagreem ents
em erge. Feedback to observers, further training, and refinem ent o f particular defin i-
tions are likely to result from analysis o f point-by-point agreem ent. Selection o f the
m ethods of com puting agreem ent is also based on other considerations, including the
frequency o f behavior and the definition o f agreem ents, two issues that now require
greater elaboration.

BA SE R ATES AND CH A N CE AG R EEM EN T


The m ethods o f assessing agreem ent presented previously, especially the point-by-
point agreem ent ratio, are the m ost com m on ly used m ethods in single-case research
w hen direct observations are used. Usually, when the estim ates o f agreem ent are rela-
tively high (e.g., 80% or r = .80), investigators assum e that observers generally agree
in their observations. H ow ever, investigators have been alert to the fact that a given
estim ate such as 80 or 90% does not mean the sam e thing under all circum stances.
Ilie level o f agreem ent is in part a function o f how frequently the behavior is scored
as occurring.
I f behavior is occu rrin g with a relatively high frequency, observers are more likely
to have high levels o f agreem ent with the usual point-by-point ratio formula than i f
behavior is o ccu rring with a relatively low frequency. The base rate o f behavior, that is,
the level o f occurrence or num ber o f intervals in which behavior is recorded as o cc u r-
ring, contributes to the estim ated level o f agreem ent.' The possible influence o f high or
low frequency o f behavior on interobserver agreem ent applies to any observations in
w hich point-by-point agreem ent is assessed. The interval m ethod is used as an exam ple
because it is one o f the most frequently used assessm ent m ethods.
A client m ay perform the response in most o f the intervals in which he or she is
observed. If two observers m ark the behavior as occu rrin g in m any o f the intervals,
they are likely to agree m erely because o f the high rate o f occurrence. When m an y
o ccurrences are marked by both observers, high correspondence between o bservers
is inevitable. To be more concrete, assum e that the client perfo rm s the behavior in 90

1 The base rate should not be confused with ihe baseline rate. The base rate refers to the proportion
o f intervals or relative frequency o f the behavior. Baseline rate usually refers to the rate of perfor-
mance when no intervention is in effect to alter the behavior.
En su r in g t he Q u a l i t y of M easu r em en t 109

o f 100 intervals and that both observers coincidentally score the beh<mor as o ccu rring
in 90% o f the intervals. Agreem ent between the observers is likely to be high sim ply
because o f the fact that a large proportion o f intervals was m ark ed a s occurrences. That
is, agreement will be high as a function o f chance.
Chance in this context refers to the level of agreement that would be expected, by ran-
domly marking occurrences for a given number o f intervals. Agreement w ould be high
whether or not observers saw the same behavior as occurring in each interval. Even if both
observers were blindfolded but marked a large num ber o f intervals as occurrences, agree-
ment might be high. Exactly how high chance agreement w ould be depends on what is
counted as an agreement. In the point-bv-point ratio, recall that reliability was computed by
dividing agreements by agreements plus disagreements and multiplying by 100 . A n agree-
ment usually means that both observers recorded the behavior asoccurring. But if behavior
is occurring at a high rate, reliability m ay be especially high on the basis o f chance.
The actual form ula for com puting the chance level o f agreem ent o n occurrences is:

o occurrences x o o ccu rrences


C hance agreem ent on occurrences = —----------------------- ------------------- x too
total intervals'

where
o occurrences = the num ber o f intervals in w hich O bserver 1 scored the behavior
as occurring
o 2 occurrences = the num ber o f intervals in which O bserver 2 scored the behavior
as occurring,
total intervals2 = all intervals o f observation squared.
If the client perform s the behavior frequently, o ( and occurrences are likely to be high.
In the preceding hypothetical exam ple, both observers recorded 90 o ccu rren ces o f the
behavior. With such frequent recordings o f occurrences, by just rand om ly m arking
this num ber o f intervals, “chance” agreement w ould be high. In the preceding form ula,
chance would be 81% ([90 x 90 /10 01] x 100). M erely because occurrence intervals are
quite frequent, agreem ent would appear high. W hen investigators report agreem ent it
this level, it m ay be im portant to know whether this level w ould have been expected
anyw ay merely as a function o f chance.
Perhaps the problem o f high agreem ent based on chance could be avoided by only
counting those intervals in which observers agreed on nonoccurrences as agreements.
The intervals in which they agreed on occurrences could be omitted, [ f o n ly the nu m -
ber o f intervals when both observers agreed on behavior not occu rring w ere counted
as agreements, the chance level o f agreement w ould be lower. In fact, chance agreem ent
on nonoccurrences w ould be calculated by the follow in g form ula:

o nonoccurrences x o, nonoccurrences
Chance agreement on nonoccurrences = —---------------------------- 5---------------------- x too
total intervals2

In the previous exam ple, both ob servers reco rded n o n o ccu rren ces in 1 0 o f the l o o
intervals, m aking chance agreem ent on n o n occurren ces 1% ( [ 1 0 x to j/ io o " x 1 0 0 ) . *

1 I he level o f agreement expected by chance is based on the proportion o f intervals i 11 which observ-
ers report the behavior as occurring or not occurring. Although chance agreement can be calculated
S I N G L E- C A S E R ES EA R C H D ES I G N S

W hen agreem en ts are defined as n o n occu rren ces that are scored at a low frequency,
chance agreem ent is low. H ence, i f the p o in t-b y-p o in t ratio w ere com puted and
ob servers agreed 80% o f the tim e on n o n occu rren ces, this w o u ld clearly m ean they
agreed well above the level expected by chance.
Defining agreem ents on the basis o f nonoccurrences is not a general solution,
because in m any cases nonoccurrences m ay be relatively high (e.g., when the behav-
ior rarely occurs). M oreover, as an intervention project proceeds, it is likely that in
different phases occurrences will be relatively high and nonoccurrences will be rela-
tively low and that this pattern will be reversed. The question for investigators that has
received considerable attention is how to com pute agreement between observers over
the course o f an experim ent while taking into account the changing level o f agreement
that would be expected by chance. Several m ethods o f addressing this question have
been suggested.5

A L T E R N A T IV E M E T H O D S O F H A N D L IN G
EXPECTED (“ C H A N C E ” ) L E V E L S OF A G R E E M E N T

Variations o f O ccurrence and N onoccurrence Agreem ent


The problem o f base rates occu rs w hen the intervals that are cou nted as agreem ents
in a reliab ility check are the ones sco red at a high rate. T ypically, agreem ents are
defined as instances in w h ich both observers record the b eh avio r as o ccu rrin g.
I f o ccu rren ces are sco red relatively often, the expected level o f agreem ent on the
basis o f chance is relatively high. O ne solution is to v a ry the definition o f ag ree-
m ents in the p o in t-b y-p o in t ratio to reduce the expected level o f agreem ent based
on “chance.” A greem en ts on occu rren ces w ould be calculated o n ly when the rate o f
behavior is low, that is, w hen relatively few in tervals are sco red as o ccu rrences o f
the response. This is som ew hat different from the usual w a y in w h ich agreem ents
on o ccu rren ces are cou nted even when o ccu rren ces are sco red frequently. H ence,
w ith low rates o f o ccu rren ces, p o in t-b y-p o in t agreem ent on o ccu rren ces p ro vides a
stringent m easure o f ho w observers agree w ithout a high level exp ected by chance.
C onversely, w hen the o ccu rren ces o f b eh avior are relatively high, agreem ent can be
com puted on intervals in w hich both ob servers record the b eh avio r as not o c c u r-
ring. W ith a high rate o f occu rren ces, agreem ent on n o n o ccu rren ces is not lik ely to
be inflated b y chance.
Although the recom m endation is sound, the solution is som ew hat cum bersom e.
First, over time in a given investigation, it is likely that the rates o f occurrence o f
response will change at different points so that high and low rates o ccu r in different
phases. The definition o f agreement would also change at different times. The prim ary
interest in assessing agreem ent is determ ining w hether observers see the behavior as

* (Continued) by the formulas provided here, other sources provide probability functions in which
chance agreement can be determined simply and directly (e.g., Ary, Covalt, & Suen, 1990; Hartmann,
1982).
' A series of articles on the topic o f interobserver agreement and alternative methods of comput-
ing agreement based on estimates of chance appeared several years ago and remain applicable. See
separate issues o f the Journal o f Applied Behavior Analysis (1977, Volume 10, pp. 97-150; and 1979,
Volume 12, pp. 523-571).
En su r i n g t ire Q u al i t y of M easu r em en t

occurring. Constantly changing the definition o f agreem ents within a study handles
the problem o f chance agreement but does not provide a clear am i direct m easure o f
agreement on scoring the behavior.
A n o th er problem with the proposed solution is that agreem en t estim ates tend
to fluctuate m arkedly w hen the intervals that defin e agreem en t are in freq u en t. F o r
exam ple, if 10 0 in tervals are o b served and b eh avio r o c c u rs in o n ly 2 in tervals, th e
reco m m en d ation w o u ld be to com pute agreem ent on o cc u rre n c e in tervals. A s s u m e
that one o b se rv er records tw o occu rren ces, that the o th er reco rd s o n ly one, arid
that they both agree on this one. R eliab ility will be b ased o n ly on c o m p u tin g a g re e -
m ent for the tw o intervals, and w ill be 50% (agreem en ts = 1, d isagreem en ts = i, and
overall reliab ility equals agreem en ts d ivid ed by agreem en ts plu s d isagreem en ts).
I f the o b server w h o pro vid ed the check on reliability sco re d o, a. or i f both o c c u r -
rences were in agreem ent w ith the p rim ary observer, agreem en t w o u ld be o , 50, o r
100 % , respectively. T h u s, with a sm all num ber o f in te rv a ls counted as agreem en ts,
reliability estim ates fluctuate w id ely and are subject to m isin terp retatio n in th eir
ow n right. Perhaps the sim plest solution is to report reliab ility separately fo r o c c u r -
rence and n o n occu rren ce in tervals throughout the in vestigatio n . T h is allo w s the
reader or con su m er o f research to judge w h ether agreem en t is d iscre p a n t b ased
on the ways in w hich it is com puted and w h ether the rate o f o c c u rren c es o f the
behavior, w hich presum ably changes over different p h ases o f the d esign , in flu en ces
agreem ent.

Plotting Agreem ent Data


A m ajor purpose o f obtaining agreement estimates is to ensure that there is su f-
ficient consistency in the m easure so as to represent the effects o f the intervention.
W hether one w ay o f com puting agreem ent versus another m isrepresents the data can
be addressed by plotting the data separately for both the p rim ary o bserver and the sec-
ond ary observer. Usually, only the data for the prim ary o b server ar* plotted. H owever,
the data obtained from the secondary observer can also be plotted so that the sim ilarity
in the scores from the observers can be seen on the graphic display.
An interesting advantage o f this recom m endation is that one can determ ine
w hether the o b servers disagree to such an extent that th e con clusion s draw n fro m the
data would differ because o f the extent o f the disagreem ent. For exam ple, F ig u re 5.1
show s hypothetical data for baseline and intervention phases. The data for the p ri-
m ary o bserver are plotted for each day o f observation (circles). The occasion al reli-
ability checks by a second o bserver are also plotted (sq uares). The data in the u p p er
panel show that both observers w ere relatively close in th eir estim ates of p erfo r-
mance. If the data o f the second observer were substituted for th ose o f the first, the
pattern o f data show ing superior perform an ce during the intervention phase w ou ld
not be altered.
In contrast, the low er panel shows marked d iscrepancies between the p rim ary
and secondary observer. Ihe discrepancy is referred to as “ m ark ed’* because o f the
im pact that the differences would have on the conclusions reached about the changes
in behavior. I f the data o f the second observer were used, it w ould not be clear that
perform ance really im proved du rin g the intervention phase. The data for the seco n d
observer suggest that perhaps there was no change in perform an ce ove r the tw o phases
112 S I N G L E - C A S E R ES EA R C H D ES I G N S

IN T E R O B S E R V E R A G R E E M E N T

D ays of observations

F ig u re 5 . 1 . Hypothetical data showing observations from the primary observer (circles connected
by lines) and the second observer, whose data are used to check agreement (squares).The upper
panel shows close correspondence between observers; the conclusions about behavior change from
baseline to intervention phases would not vary if the data from the second observer were substi-
tuted in place of the data from the primary observer. The lower panel shows marked discrepancies
between observers; the conclusions about behavior change would be very different depending on
which observer’s data were used.

or, alternatively, that there is bias in the observations and that no clear conclusion can
be reached.
In any case, plotting the data from both o b servers p ro vid es useful inform ation
ab ou t how clo sely the o b serv ers actu ally agreed in th eir totals for occu rrences o f
the response. In depen d en tly o f the nu m erical estim ate o f agreem en t, graphic d is -
p lay p erm its one to exam in e w h ether the scores fro m each o b se rv e r would lead to
different con clusion s about the effects o f an intervention , w h ich is a very im p o r-
tant reason for evalu atin g agreem ent in the first place. Plotting data from a second
o b se rv e r w h ose data are used to evalu ate agreem ent p ro vides an im portant source
o f in fo rm atio n that could be hidden by agreem ent ratios that are potentially inflated
b y “chance.”
En su r in g t h e Q u al i t y o FM easu r em en t 113

Correlational Statistics
Another means o f addressing the problem o f chance agreem ent and the m isleading
interpretations that might result from high percentage agreem ent is to use correlational
statistics. One correlational statistic that is com m on ly used is kappa (k ) (Cohen, 1965).
Kappa is especially suited for categorical data such as interval o bservation or discrete
categorization when each response o r interval is recorded as o ccu rring or not. Kappa
provides an estimate o f agreement between o bservers that is corrected for chance.
W hen observers agree at the sam e level one w ould expect o n the basis ofchance, k = o.
If agreement surpasses the expected chance level, k exceeds o and approaches a m axi-
mum o f -H.oo.6
Kappa is com puted by the follow ing formula:

where
Po = the proportion o f agreem ents between o bservers on occu rrences and n o n -
occurrences (or agreements on occurrences and nonoccurrences divid ed by the total
num ber o f agreem ents and disagreem ents).
P = the proportion o f expected agreements on the basis o f chance (is com puted
by multiplying the num ber o f occurrences for O bserver 1 times the num ber o f o ccu r-
rences for O bserver 2 plus the num ber o f nonoccurrences for O bserver tim es the
num ber o f nonoccurrences for O bserver 2. The sum o f these is divided by the total
num ber o f intervals squared).
For exam ple, two o bservers m ay observe a child fo r l o o intervals. O bserver 1
scores 80 intervals o f occurrence o f aggressive behavior an d 20 in tervals o f n o n o c-
currence. O bserver 2 scores 70 intervals o f aggressive b eh av io r and 30 intervals o f
non occurrence. Assum e that o bservers agree on 7 0 o f the o ccu rrence intervals and
o n 20 non occurrence intervals and disagree on the rem ain in g 10 intervals. Using
the preceding form ula, Pq = .90 and P = .62 with kappa = .74. A lthough there is no
firm or universally agreed upon rule, generally kappa > .7 is v iew ed as acceptable
agreem ent.
The advantage o f kappa is that it corrects for chance based o n the observed fr e -
quency o f occurrence and nonoccurrence intervals. Other agreem ent m easures are
difficult to interpret because chance agreement m ay yield a high positive value (e.g.,
80%) w hich gives the impression that high agreement has been obtained. Fo i example,
with the preceding data used in the com putation o f A:, a point-by-point ratio agreem ent
on occurrence and nonoccurrence intervals com bined w ould yield 90% agreement.
H owever, on the basis o f chance alone, the percent agreem ent w ould be 62. Kappa p ro -
vides a measure o f agreement over and above chance.7

6 Kappa values can be negative and move from 0.00 to -1.00 in the unlikely event llut agreement
between observers is less than the level expected by chance.
' Kappa is not the only correlational statistic that can estimate agreement r>n categorical data (see
Hartmann, 1982). For example, another estimate very similar to kappa is phi (<t>), which also extends
from -1.00 through +1.00 and yields 0.00 when agreement is at the chance level. The advantage of
S I N G L E- C A S E R ES EA R C H D ES I G N S

G eneral Com m ents


Several alternatives have been suggested to take into account chance or expected levels
o f agreem ent. A few o f the solutions were highlighted here, w hich merely served to
raise the issue and to con vey m ajor options. There is no clear consensus on which o f
the solutions adequately resolves the problem w ithout introducing new complexities.
And, in the applied literature, investigators have not u niform ly adopted one particu-
lar way o f handling the problem . At this point, there is consensus on the problem that
chance agreement can obscure estim ates o f reliability. Further, there is general agree-
ment that in reporting reliability, it is useful to con sider one o f the m any dilferent ways
o f conveying or incorporating chance agreem ent. H ence, as a general guideline, it is
probably useful to com pute and report agreem ent expected on the basis o f chance or
to com pute agreement in alternative form ats (e.g., separately for occurrences and non-
occurrences) to provide additional data that con vey how observers actually concur in
their observations.
As alternatives are considered for com puting agreem ent and considering the impact
o f base rates and chance, it is im portant to keep in m ind the p rim ary goal. Reliability
com putation and assessm ent are not ends in their ow n right, but rather serve a critical
m ethodological issue. D o the data reflect perform ance o f the client, and will the data
perm it evaluation o f the intervention? High levels o f agreem ent optim ize the extent to
which these ends can be achieved. In an y given dem onstration, w hether agreement is
70 or 80% depending on the m ethod o f com putation may or m ay not make a differ-
ence. Im pact o f the intervention, for exam ple, can be such that the difference in how
reliability is com puted is nugatory. This can be evident by plotting data o f both p rim ary
and secondary o bservers to address the question directly: do different data sources
appreciably alter the conclusions that would be reached about the clients behavior or
the effect o f the intervention?
There are m any w ays o f m easuring interrater agreem ent. They include measures
I did not m ention (e.g., intraclass correlation) an d variations o f others (e.g., kappa)
1 have not covered (see Broem eling, 2009; Shoukri, 2005). I have included those m eth-
ods o f evaluating agreem ent that are most com m on ly used. Ihe goal was to convey
specific practices and how and when they are used. In addition, the methods o f evalu-
ating interrater agreem ent served as a way o f discu ssin g critical issues that need to be
considered (e.g., goals o f assessing agreem ent to begin with, base rates, interpretation
o f agreem ent statistics). U nderstanding o f these latter issues is critical. Agreement is
not an end itself but a m eans to enhance interpretation o f the data and this depends on
understanding these other issues as well as how reliability inform ation is obtained, as
discussed next.

SO U R C ES OF A R T IF A C T AN D B IA S
Interpretation o f agreem ent estim ates depends on know ing several features about
the circum stances in which agreem ent is assessed. Sources o f bias that can obscure
interpretation o f interobserver agreem ent include reactivity o f reliability assessment,

' (Continued) phi is that a conversion table has been provided to convey levels o f phi based on
obtained agreement on occurrences and nonoccurrences (Lewin & Wakefield, 1979). ihus, investi-
gators can convert their usual data into phi equivalents without computational difficulties.
En si j rirt g t h e Q u al i t y of M easu r em en t

observer drift, observer expectancies and experim enter feedback, an d com plexity o f
the observations.8

Reactivity o f Reliability Assessment


Interobserver agreem ent is usually checked periodically du rin g an investigation.
Typically, observers are aware that their observations are b ein g checked, if for no other
reason than another observer may be present on just those occasion s and both observ-
ers must coordinate their recording to observe the sam e person at the s-ame time.
Because observers are aware that reliability is being checked, the situation is potentially
reactive. Reactivity refers to changes that people m ake w hen their behavior is being
monitored or evaluated. Awareness does not invariably lead to reactivity. Yet one must
be alert to the possibility that there is an influence. Indeed, research has shown that
observer awareness that reliability is being checked influences the observations they
make. When observers are led to believe that agreement is b ein g assessed on som e
occasions and not assessed on others, their agreement is higher, even though their
agreement is checked on all occasions (see Kent & Foster, 1977, for a review). In fact,
agreement was assessed even when they did not believe th ey w e re b eing checked. The
general findings are consistent; observers show higher interobserver agreem ent when
they are aware that reliability is being checked than w hen th e y are unaware.
It is not entirely clear w hy agreement is higher under con ditions w hen o bserv-
ers are aware that reliability is being checked. When o b servers are aw are o f reliability
checks, they m ay m odify the behavioral definitions or codes slightly' to concur with the
other observer with whom their data are com pared and o b serve d ifferently. One study
found that observers recorded much less disruptive student behavior in the classroom
during occasions on w hich they thought agreement w as being checked. The broader
point is the m ore critical one: interpretation o f estimates o f agreem ent depends very
much on the conditions o f reliability assessment. Estim ates obtained w hen observers
are unaware o f agreem ent checks tend to be lower than those obtained when they aie
aware o f these checks.
Awareness o f assessing agreement can be handled in different ways. As a general
rule, the conditions o f reliability assessment should be sim ilar to the conditions in which
data are ordinarily obtained. I f observers ordinarily believe that their behaviors are not
being monitored, these conditions should be maintained d u rin g reliability checks. In
practice, it may be difficult to conduct agreement checks without observers being aware
o f the checks. M easuring interobserver agreement usually involves special arrangem ents
that are not ordinarily in effect each day. For example, in most investigations two o b serv -
ers usually do not record the behavior o f the same target subject at the same tim e unless
agreement is being assessed. Hence, it may be difficult to conduct checks without alert-
ing observers to this fact. A n alternative might be to lead observers to believe that all
their observations are being monitored over the course o f the investigation. This latter

“ Many of the sources o f bias and artifact in conducting observations were evaluated in the 1970s an<l
1980s. Thus many “classic” (but also old) references constitute the prim ary literature. Key findings
and recommendations o f the early studies remain cogent and applicable1. Original studies are only
highlighted in this section; the reader is referred to reviews o f the studies in these secondary sources
(Hartmann, Barrios, & Wood, 2004; Kazdin, 1977a; Kent & Foster, 1977).
S I N G L E- C A S E R ES EA R C H D ES I G N S

alternative would appear to be advantageous, given evidence that observers tend to be


m ore accurate when they believe their agreem ent is being assessed.

O bserver Drift
O bservers usually receive extensive instruction and feedback regarding accuracy in
applying the definitions for recording behavior. T raining is designed to ensure that
o bservers adhere to the definitions o f behavior and record behavior at a consistent level
o f accuracy. Once m astery is achieved and estim ates o f agreem ent are consistently high,
it is assum ed that o bservers continue to apply the sam e definition o f behavior over
time. However, observers m ay “drift” from the original definition o f behavior. O bserver
drift refers to the tendency o f observers to change the m anner in w hich they apply defi-
nitions o f behavior over time.
The hazard o f drift is that it is not easily detected. Interobserver agreement m ay
remain high even though the observers are deviating from the origin al definitions o f
behavior. I f observers consistently work together and com m unicate with each other,
they m ay develop sim ilar variations o f the original definitions (Haw'kins & Dobes,
1977). Thus, high levels o f agreem ent can be m aintained even if accu racy declines. D rift
is detected by com paring interobserver agreem ent am ong a subgroup o f observers who
constantly work together with agreement across subgroups who have not worked with
each other. O ver time, subgroups o f observers m ay m odify and apply the definitions
o f behavior differently, w hich can only be detected by com paring data from observers
who have not worked together.
If observers m odify the definitions o f behavior over time, the data from different
phases may not be com parable. For example, i f disruptive behaviors in the classroom
or at home are observed, the data from different days in the study m ay not reflect pre-
cisely the sam e behaviors, due to observer drift. A nd, as already noted, the differences
in the definitions o f behavior m ay occur even though observers continue to show high
interobserver agreem ent.
O bserver drift can be controlled in a variety o f ways. First, observers can undergo
continuous training over the course o f the investigation. Videotapes o f the clients can
be show n in periodic retraining sessions where the codes are discussed am ong all
observers. O bservers can meet as a group, rate behavior in the situation, and receive
feedback regarding the accuracy o f their observations, that is, their adherence to the
original codes. The feedback can convey the extent to which observers correctly invoke
the definitions for sco rin g behavior. Feedback for accuracy in applying the definitions
helps reduce drift from the original behavioral codes.
Second, all observations o f the client can be videotaped. O bservers can score the
tapes in random order at the end o f the investigation. Drift w ould not differentially
bias data in different phases because tapes are rated in random order. This alternative
is som ewhat im practical because o f the tim e and expense o f taping the clients behavior
for several observation sessions. M oreover, the investigator needs the data on a day-
to-day basis to m ake decisions regarding w hen to implement or withdraw the inter-
vention, a characteristic o f single-case designs that will becom e clearer in subsequent
chapters. Yet taped sam ples o f behavior from selected occasions could be com pared
with actual observations obtained by observers in the setting to assess whether drift has
occurred over time.
En su r i n g t h e Q u al i t y of M easu r em en t

Finally, drift might also be controlled by period ically bringing new ly trained
observers into the setting to assess interobserver agreem ent. C o m parison o f new ly
trained observers with observers w ho have continuously participated in the investiga-
tion can reveal whether the codes are applied differently over time. Presum ably, new
observers would adhere m ore closely to the original definitions than other observers
who have had the opportunity to drift from the original definitions.

O bserver Expectancies and Feedback


Another potential source o f bias is the expectancies o f observers regarding the client s
behavior and the feedback observers receive from the exp erim en ter i n relation to that
behavior. I f observers are led to expect change (e.g., an increase or decrease in behav-
ior), these expectancies do not usually bias observational data (see Kent & Foster, 1977).
Yet expectancies can influence the observations w hen com bin ed with feedback from
the experim enter. Instructions to expect change com bined with feedback for scoring
reductions can lead to decreases in the disruptive behavior that the observers reported
(O ’Leary, Kent, & Kanowitz, 1975). In this study, observers were o n ly rating a videotape
o f classroom behavior in which no changes in the disru ptive behaviors o ccu rred over
time. Thus, the expectancies and feedback about the effects o f treatm ent affected the
data generated by the observers. Investigators probably rarely apply feedback to o b serv -
ers in this fashion, but the study is instructive to show the m alleability o f observer data
to such external influences.
It is reassuring that research suggests that expectancies aLone are not likely to influ-
ence behavioral observations. However, it may be crucial to control the feedback that
observers obtain about the data and w hether the in vestigato rs expectations are c o n -
firmed. O bviously, any feedback provided to observers should be restricted to in fo rm a-
tion about the accuracy o f their observations, rather than inform ation about changes in
the clients behavior, in order to prevent or m inim ize drift.

C o m p le x ity o f the O b se rv a tio n s


In the situations discussed up to this point, the assum ption has been m ad e that o b serv -
ers score o n ly one behavior at a time. O ften observers reco rd several behaviors w ithin
a given observational period. For exam ple, w ith interval assessm ent, the o b se rv -
ers m ay score several different behaviors durin g a p articu lar interval. Research has
shown that com plexity o f the observations influences agreem ent and accu racy o f the
observations.
C om plexity has been investigated in different ways. F o r exam ple, com plexity can
refer to the num ber o f different responses that are scored in a given period. O bservational
codes that consist o f several categories o f responses are m ore com plex than those with
fewer categories. As m ight be expected, observers have been fo u n d to be m ore accurate
and show higher agreem ent when there are few er categories o f behavior to score (see
Kazdin, 1977a, for a review). C o m plexity can also refer to the range o f client behaviors
that are perform ed. W ithin a given scorin g system, clients m a y p erfo rm m any different
behaviors o ver time. Ihe greater the num ber o f different behaviors that clients perform ,
the lower the interobserver agreement is likely to be. The precise reasons w h y c o m p lex-
ity o f observations and interobserver agreem ent are inversely related are not entirely
clear. P resu m ab ly with com plex observational system s in which several b eh aviors must
118 SIN G L E -C A SE RESEARCH D E SiG N S

be scored, observers m ay have difficulty in m aking discrim inations among all o f the
codes and definitions or are m ore likely to make errors. W ith much more inform ation
to process and code, errors in applying the codes and scorin g w ould be expected to
increase.
In training observers, the tem ptation is to provide relatively sim plified conditions
o f assessment in order to ensure that observers understand each o f the definitions and
apply them consistently. W hen several codes, behaviors, or subjects are to be observed
in the investigation, o bservers need to be trained to record behavior with the same level
o f complexity. High levels o f interobserver agreem ent need to be established fo r the
exact conditions under which observers will be required to perform .

ACCEPTABLE LEVELS OF AGREEM ENT


'The question for researchers invariably is, “ W hen all is said and done, what is an
acceptable level o f agreem ent?” A s the discussion conveys, the num ber by itself is not
easily interpreted based on m ethod o f com putation, chance and base rates, num ber
and com plexity o f behaviors, and various sources o f bias that m ay be present. The
level o f agreement that is acceptable is one that indicates to the researcher that the
observers are sufficiently consistent in their recordings o f behavior, that behaviors are
adequately defined, and that the m easure will be sensitive to changes in the client’s p e r-
form ance over time. This general statement m ay be unsatisfying, but the goal o f ch eck -
ing agreem ent is to address design requirem ents that require describing stable patterns
o f perform ance and identifying change when change occurs. I f v ery small changes in
behavior are likely to occur, an error in assessm ent (e.g., disagreem ent) could obscure
the outcome. The small changes m ay be obscured by a little variability (error) in assess-
ment. In contrast, if large changes are likely to occur, slightly more disagreement can
be tolerated. The added variab ility in the observations w ould not obscure the m arked
changes in perform ance.
The magnitude o f change is o nly one influence. 'Ihe variability in perform ance o f
the client also is critical. For exam ple, assum e that the client s “ real” behavior (free from
any observer bias) shows relatively little variability over time. Also, assume that across
baseline and intervention phases, dram atic changes in behavior occur. Under co n d i-
tions o f slight variability and m arked changes, moderate inconsistencies in the data
m ay not interfere with draw in g conclusions about intervention effects. On the other
hand, if the variability in the clients behavior is relatively large and the changes over
tim e are not especially dram atic, a m oderate am ount o f inconsistency among observers
m ay hide the change. H ence, although high agreem ent between observers is always a
goal, the level o f agreem ent that is acceptable to detect system atic changes in the client s
perform ance depends on the client’s behavior and the effects o f intervention.
Traditionally, agreem ent w as regarded as acceptable if it met or surpassed .80, or
80%, com puted by frequency o r point-by-point agreem ent ratios. O f course, high levels
o f agreement may not necessarily be acceptable if the form ula for com puting agree-
ment or the conditions for evaluating agreem ent introduce potential biases or artifacts.
Conversely, lower levels o f agreem ent may be quite useful and acceptable if the co n d i-
tions under which they w ere o btained m inim ize sources o f bias and artifact. Hence, it is
not only the quantitative estim ate that ought to be evaluated, but also how that estim ate
was obtained and under what conditions.
E n s u rin g the Q u a lity o f M e a s u r e m e n t I 19

In light o f the large number o f con sid eration s em b ed d ed in the esti mate o f inter-
o b server agreem ent, concrete guidelines that apply to all m ethods o f com pu ting
agreem ent, con ditions in which agreem ent is assessed, an d patterns o i data are d if-
ficult to provide. The traditional guideline o f seekin g agreem en t at o r above .80 is a
useful guideline, but attaining this criterion is not necessarily m eaningful or accep t-
able, given other conditions that could con tribute to this esti mate. As a gen eral rec-
om m endation, it is important not to lose sight o f w hy o n e is m easu rin g agreem ent
in the first place, namely, to m inim ize e rro r in the data and variab ility that m ight
obscure d raw in g inferences about change. T his con sid eration leads one to evaluate
client v ariab ility and likelihood o f being able to detect change rather than the possi-
bly m indless fo cus on a number. As a m ore con crete recom m endation , given the cur-
rent status o f v iew s o f agreement, I w ould en cou rage investigators to con sid er m ore
than one m ethod for estim ating agreem ent and to sp ecify carefu lly the con ditions in
w hich the checks on agreement are con du cted. W ith added in fo rm atio n , the inves-
tigator and those w ho read reports o f applied research w ill be in a better p osition to
evaluate the assessm ent procedures and w h ether and how the procedures influence
the conclusions.

S U M M A R Y A N D C O N C L U S IO N S
Reliability o f assessm ent is critical in all scientific research and am ong all o f the d if-
ferent m ethods (e.g., ratings, questionnaires, direct observations, automated) that are
used in single-case research. Direct observation, the most frequently used m ethod o f
assessm ent in single-case research, has its own challenges in both the com putation o f
agreement and the conditions under which agreem ent is assessed.
A crucial com ponent o f direct observation o f behavior is to ensure that observers
score behavior consistently. Consistent assessm ent is essential to ensure that m inim al
variation is introduced into the data by o bservers and to ch eck o n the ad equacy o f the
response definition(s). Interobserver agreem ent is assessed periodically b v having two
or m ore persons sim ultaneously but independently observe th eclien tan d record behav-
ior. The resulting scores are com pared to evaluate consistency o f the observations.
Several com m on ly used methods to assess agreem ent consist o f fre q u e n c y ratio,
point-by-point agreement ratio, and Pearson product-m om ent correlation. These m eth-
ods provide different information, including, respectively, correspondence o f observers
on the total frequency o f behavior for a given o bservational session , the exact agreem ent
o f observers on specific occurrences o f the behavior within a sessio n ,o r the covariation
o f observer data across several sessions.
A m ajor issue in evaluating agreement data pertains to the base rate o f the clients
perform ance. A s the frequency o f behavior or occu rrences increases, th e level o f agree-
ment on these occurrences between observers increases as a function o f chance. Thus,
i f behavior is recorded as relatively frequent, agreem ent betw een the observers is likely
to be high. W ithout calculating the expected or chance level o f agreem ent, investigators
m ay believe that high observer agreement is a function o f the w ell-defined behaviors
and high levels o f consistency between observers. Point-by-poin t agreem ent ratios as
usually calculated do not consider the chance level o f agreem ent an«i m ay be m islead-
ing. Hence, alternative methods o f calculating agreem ent have been proposed, based
on the relative frequency o f occurrences or non occurrences o f the response, graphic
1 20 S 1 N G L E- C 4 S E R ES EA R C H D ES I G N S

displays o f the data from the o bserver who serves to check reliability, and com putation
o f correlational m easures (e.g., kappa, phi).
Several sources o f bias and artifact have been identified that may influence the
agreem ent data. These include reactivity o f assessment, observer drift, expectancies o f
the observers and feedback fro m the experim enter, and com plexity o f the observations.
In general, observers tend to agree m ore and to be more accurate w hen they are aware,
rather than unaware, that their observations are being checked. The definitions that
o bservers apply to behavior m ay depart (“drift” ) from the original definitions they held
at the beginning o f the investigation. U nder som e conditions, observers’ expectan-
cies o f changes in the clien ts behavior and feedback indicating that the experim enters
expectancies are con firm ed m ay bias the observations. Finally, accuracy o f o b serva-
tions and interobserver agreem ent tend to decrease as a function o f the com plexity o f
the observational system (e.g., num ber o f different categories to be observed and num -
b er o f different behaviors clients perform within a given observational system).
It is im portant to keep in m ind that the purpose o f assessing agreement is to ensure
that observers are consistent in their observations and that sufficient agreement exists
to reflect change in the client’s behavior over time. In conducting and reporting assess-
ment o f agreem ent, it m ay be advisable to consider m ore than one way to estimate
and report agreement. Also, it is just as im portant to ensure that the conditions under
w hich agreement is obtained circum vent or m inim ize the sources o f bias and artifact.
These sources are critical because they show that agreement can be high, but the data
can very much m isrepresent client behavior.
CHAPTER 6

Introduction to Single-Case Research


and ABAB Designs

C H A P T E R O U T L IN E

General Requirements o f Single-case Designs


Continuous Assessment
Baseline Assessment
Stability o f Performance
Trend in the Data
Variability in the Data
ABAB Designs: Basic Characteristics
Description and Underlying Rationale
Illustrations
Design Variations
“Reversal” Phase
Order o f the Phases
Number o f Phases
Number o f Different Interventions
General Comments
Problems and Limitations
Absence o f a “ Reversal” o f Behavior
Undesirability o f “Reversing” Behavior
Evaluation o f the Design
Summary and Conclusions

I
m entioned that research design in general includes three interdependent c o m p o -
nents: assessm ent, experim ental design, and data evaluation. T heir interdependence
derives from how they contribute to clarity o f the con clusion s and how each helps
reduce, elim inate, or make implausible threats to validity. In the assessm ent chapters,
I noted not only strategies for assessing outcom es but also features o f assessm ent [e.g..
inconsistencies, biases in conditions o f assessm ent) that interfere with draw ing c o n -
clusions about the intervention. Beginning with this chapter, >ve now «iiscuss m ajor
design options of single-case research. Before discu ssin g the first design option, som e

121
S I N G L E- C A S E R ES EA R C H D ES I G N S

prelim inary com m ents will be helpful to convey the goals o f the designs and how these
are shared with m ore fam iliar betw een-group research.
In both between-group and single-case designs the intervention is arranged in such
a way as to m ake alternative threats to internal validity im plausible. All experiments
com pare the effects o f different conditions (independent variables) on performance
(dependent variables). In traditional betw een-group experim entation, the comparison
is made between groups o f subjects w ho receive or who are exposed to different condi-
tions. The “gold standard” for betw een-group research that evaluates interventions is the
random ized controlled trial (R C T ). Participants in the study are assigned random ly to
one o f two (or more) groups. In the sim ple case, there might be two groups: intervention
and no intervention. T he effect o f the intervention is evaluated by com paring the perfor-
mance o f the different groups at the end o f the study. In single-case research, inferences
are usually made about the effects o f the intervention by com paring different conditions
presented to the same participant over time. Here too in the sim ple case performance o f
the participant may be com pared under two conditions, intervention and no interven-
tion. However, details o f how single-case designs accom plish this com parison are novel
and depart considerably from traditional betw een-group designs. T h e purpose o f this
chapter is to identify key characteristics o f all single-case designs an d how they contrib-
ute to the logic o f draw ing causal inferences. O ne fam ily o f single-case designs, referred
to as A B A B designs, is also presented not only to convey the logic o f single-case research
but also to present options that can be used to evaluate intervention programs.

G E N E R A L R E Q U IR E M E N T S O F S IN G L E -C A S E D E S IG N S

Continuous Assessment
T he m ost fundam ental design requirem ent o f single-case experim entation is the reli-
ance on repeated observations o f perform ance over time. T he clien ts perform ance is
observed on several occasions, usually before the intervention is applied and continu-
ously over the period while the intervention is in effect. Typically, observations are
conducted on a daily basis or at least on m ultiple occasions each week. Continuous
assessment is a basic requirem ent because single-case designs exam in e the effects o f
interventions on perform an ce over time. Contin u ou s assessm ent allow s the investiga-
tor to exam ine the pattern and stability o f perform ance before treatm ent is initiated.
T he pretreatment inform ation over an extended period provides a picture o f what per-
form ance is like without the intervention. W hen the intervention eventually is im ple-
mented, the observations are continued and the investigator can exam ine whether
behavior changes coincide w ith adm inistration o f the intervention.
The role o f continuous assessm ent in single-case research can be illustrated by
exam ining a basic difference o f betw een-group and single-case research. In both types
o f research, as already noted, the effects o f a particular intervention on perform ance are
exam ined. In the most basic case, the intervention is exam ined by com paring perfor-
mance when the intervention is presented versus perform ance w hen it is withheld. In
between-group research, the question is addressed by giving the intervention to some
persons (treatment group) but not to others (no treatment group). O ne or two observa-
tions (e.g., pre- and post-treatm ent assessment) are obtained for several different per-
sons. In single-case research, the effects o f the intervention are exam ined by observing
In t r o d u ct i o n t o Si u g i e - Ci se Resear ch an d A B A B D esi g n s 123

the influence o f the intervention and no intervention on the perform ance o f the sam e
person(s). Instead o f one or two observations o f several persons, several observations are
obtained fo r one or a fe w persons. Continuous assessment refers to those several o b serva-
tions that are needed to m ake the com parison o f interest with the individual subject.

Baseline Assessment
Each o f the single-case experim ental designs usually begins with observing behavior
for several days before the intervention is implemented. T h is initial period o f o b serva-
tion, referred to as the baseline phase, provides inform ation about the level o f behavior
before a special intervention begins. T he baseline phase serves two critical functions.
T he first is referred to as the descriptive function. The data collected du rin g the b ase-
line phase describe the existing level o f perform ance or the extent to which the client
engages in the behavior o r dom ain that is to be altered. T h e second is referred to as
the predictive function. T he baseline data serve as the basis fo r predicting the level o f
perform ance for the im m ediate future if the intervention is not provided. O f course, a
description o f present perform ance does not necessarily provide a statement o f w hat
perform ance would really be like in the future. Perform ance m ight changeeven w ithout
the intervention (e.g., from history o r maturation, as two influences). T he only w a y to
be certain o f future perform ance without the intervention w ould be to continue base-
line observations without im plem enting the intervention. T his cannot be done because
the purpose is to im plem ent and evaluate the intervention in order to im prove the c li-
ent's functioning in som e way. What can be done is to observe baseline perform ance for
several days to provide a sufficient or reasonable basis for m aking a prediction o f future
perform ance. The prediction is achieved by projecting or extrapolating a continuation
o f baseline perform ance into the future.
A hypothetical exam ple can be used to illustrate how observations during the b ase-
line phase are used to predict future perform ance and how this prediction is pivotal to
draw in g inferences about the effects o f the intervention. Figure 6.1 illustrates a h y p o -
thetical case in which observations were collected on a child in a special education
class and focused on frequency o f shouting out com plaints or com m ents to a teacher.
A s evident in the figure, observations during the baseline (pretreatment) phase were

Baseline Projected future


performance
£ 40

0
5 10
Days

F ig u re 6 .1 . Hypothetical example of baseline observations of frequency of complaircing.The data in


baseline (solid line) are used to predict the likely rate of performance in the future (hashed line).
SIN G L E -C A SE R ESE A R C H D ESIG N S

obtained for 10 days. T he hypothetical baseline data suggest a reasonably consistent


pattern o f shouting out com plaints each day in the classroom .
We do not really know what p erfo rm an ce will be like on D ays 11, 12, and so
o n — all those days after baseline that were not yet observed. Yet, the baseline level can
be used to project the likely level o f perform ance in the im m ediate future if conditions
continue as they are. T he projected (dashed) line suggests the approxim ate level o f
future perform ance. T his projected level is essential for single-case experim entation
because it serves as one criterion to evaluate w h ether the intervention leads to change.
Presumably, if treatm ent is effective, perform an ce will differ from the projected level
o f baseline. For exam ple, if a program is designed to reduce shouting an d is successful
in doing so, the line (data points) for shouting out should be well below the projected
line that represents the level o f baseline. In an y case, continuous assessm ent in the
beginning o f single-case experim ental designs consists o f observation o f baseline or
pretreatment perform ance. A s the individual single-case designs are described later,
the im portance o f initial baseline assessm ent will becom e especially clear.

Stability o f Perform ance


Since baseline perform ance is used to predict how the client will behave in the future, it
is im portant that the data are stable. A stable rate o f perform ance is characterized by the
absence o f a trend (or slope) in the data and relatively little variability in perform ance.
T he notions o f trend and variability raise separate issues, even though they both relate
to stability.

Trend in the D ata. A trend, also called slope, refers to the tendency for perform ance to
decrease or increase system atically or consistently over time. One o f three sim ple data
patterns might be evident during baseline observations. First, baseline data m ay show
no trend o r slope. In this case, perform ance is best represented by a horizontal line
indicating that it is not increasing or decreasing over time. As a hypothetical exam ple,
observations m ay be obtained on the disruptive and inappropriate classroom behaviors
o f a child who is identified because he is hyperactive (e.g., rarely in his seat, disrupts,
handles the w ork o f others while they are w orking, and blurts out com m ents durin g
class). The upper panel o f Figure 6.2 show s baseline perform ance with no trend. T he
absence o f trend in baseline provides a relatively clear basis for evaluating subsequent
intervention effects. Im provem ents in perform an ce are likely to be reflected in a trend
that departs from the horizontal line o f baseline perform ance.
I f behavior does show a trend during baseline, behavior w'ould be increasing or
decreasing over time. T h e trend durin g baseline m ay o r m ay not present problem s
for evaluating intervention effects, depending 011 the direction o f the trend in rela-
tion to the desired change in behavior. Perform ance m ay be changing in the directio n
opposite from that w hich treatm ent is designed to achieve. For exam ple, our child with
disruptive behavior m ay show an increase in the behavior during baseline o b se rv a -
tions. The m iddle panel o f Figure 6.2 shows how baseline data might appear; over the
period of observations the client’s behavior is becom in g worse, that is, more d isru p -
tive. Because the intervention will attempt to alter behavior in the opposite direction,
that is, im prove behavior, this initial trend is not likely to interfere with evalu atin g
intervention effects.
In t r o d u ct i o n t o Si n g l e- Case Re se a r ch and A B A B D esig n s

F ig u re 6 .2 . Hypothetical data for disruptive behavior of a hyperactive child.T foe upper panel shows
a stable rate of performance with no systematic trend over time. The middle panel shows a sys-
tematic trend with behavior becoming worse over time.The lower pore/ shows a systematic trend
with behavior becoming better over time.This latter pattern of data (lower panel) is the most likely
one to interfere with evaluation of interventions, because the change is in the same direction as of
change anticipated with treatment.

In contrast, the baseline trend may be in the same direction (hat the interven-
tion is likely to produce. Essentially, the baseline phase may sh o w im provem ents in
behavior. For example, the behavior o f the child m ay im prove over the cou rse o f base-
line as disruptive and inappropriate behaviors decrease, as shown in the lower panel
o f Figure 6.2. Because the intervention attempts to im prove perfo rm an ce, it m ay be
S I N G L E- C A S E R E S EA R C H D ES I G N S

difficult to evaluate the effect o f the subsequent intervention. T he projected level o f


perform ance for baseline is toward im provem ent. A very strong intervention effect
o f treatment w ould be needed to show clearly that treatm ent surpassed this projected
level from baseline.
I f baseline is show ing an im provem ent, one m ight raise the question o f w hy an
intervention should be provided at all. Yet even w hen behavior is im proving during
baseline, it m ay not be im proving quickly enough. For exam ple, a child with autism
m ay show a gradual decrease in headbanging du rin g baseline observations. T he reduc-
tion may be so gradual that serious self-in ju ry m ight be inflicted unless the behavior is
treated quickly. At a broader level, such as a school or entire city, rates o f vandalism and
robbery, respectively, m ay be declining but still be too high or be declining too slow ly to
allow their courses to unfold. Hence, even though behavior is changing in the desired
direction, additional changes m ay be needed.
O ccasionally, a trend m ay exist in the data and still not interfere with evaluating
treatments. A lso, when trends do exist, several design options and data-evaluation pro -
cedures can help clarify the effects o f the intervention (see Chapters 12 and 14 and the
appendix at the end o f the book). For present purposes, it is im portant to con vey that
the one feature o f a stable baseline is little o r no trend, and that the absence o f trend
provides a clear basis for evaluating intervention effects. Presumably, when the inter-
vention is im plem ented, a trend toward im provem ent in behavior will be evident. T h is
is readily detected with an initial baseline that does not already show a trend tow ard
im provem ent1

V ariability in the D ata. In addition to trend, stability o f the data refers to the flu c-
tuation or variability in the subjects perform an ce over tim e. Excessive variability in
the data during baseline or other phases can interfere with draw ing conclusions about
treatment. A s a general rule, the greater the variability in the data, the more difficult it
is to draw conclusions about the effects o f the intervention.
Excessive variability is a relative notion. W hether the variability is excessive and
interferes with draw ing conclusions about the intervention depends on many factors,
such as the initial level o f behavior during the baseline phase and the magnitude o f
behavior change w hen the intervention is im plem ented. In the extrem e case, baseline
perform ance m ay fluctuate daily from extrem ely high to extrem ely low levels (e.g., o
to 100%). Such a pattern o f perform ance is illustrated in Figure 6.3 (upper panel), in
which hypothetical baseline data are provided. With such extrem e fluctuations in per-
formance, it is difficult to predict any particular level o f future perform ance.
Alternatively, baseline data m ay show relatively little variability. A typical exam ple
is represented in the hypothetical data in the low er panel o f Figure 6.3. Perform ance
fluctuates, but the extent o f the fluctuation is sm all com pared with the upper panel.
With relatively slight fluctuations, the projected pattern o f future perform ance is rela-
tively clear and hence intervention effects will be less difficu lt to evaluate. Som etim es

' This section presents simple trends in the data (e.g., no slope, accelerating, decelerating slope) and
is a useful point o f departure. A more subtle point is that trends in the data can be more complex
and not readily visible by just looking at a graph. The appendix at the end of the book conveys the
complexities such trends can raise for data evaluation.
In t r o d u ct i o n t o Si u g l c- Case R« sear ch a n r i A B A B D esig n s

Baseline

F ig u re 6 .3 . Baseline data showing relatively large variability (upper panel} and relatively small vari-
ability (lower panel). Intervention effects are more readily evaluated with little variability in the data.

there is no variability in perform ance du rin g baseline because the b eh avior never
occu rs (or, less likely, it occurs every time). In m an y program s, th ebehavior one wishes
to develop (e.g., exercising at home or at a gym , taking o n es m edication, practicing a
m usical instrument, initiating conversations with others in an assisted-living home)
does not occur at all before the intervention. T he baseline observations might show
zero occurrences each day and o f course no variability.
Ideally, baseline data will show little variability. Actually, this is not “ ideally," but
usually the case. V ariability in the data can be due to v ariability in the perfo rm an ce o f
the individual but can also be due to erro r or variation in the o bservations (lo w reli-
ability o f the measure). T his is one o f the reasons why reliability o f the observations is
im portant to ensure, as detailed in the previous chapter. O ccasionally relatively large
variability may exist in the data: Several options are available to m inim ize the im pact
o f such variability on draw ing conclusions about intervention effects (see C hapter 14).
However, the evaluation o f intervention effects is greatly facilitated by relatively con sis-
tent perform ance during baseline.

ABAB D E S IG N S : B A S IC C H A R A C T E R IS T IC S
In applied settings, there are many design options. A ssessing perform ance con tinu ously
over time and obtaining stable rates o f perform ance are pivotal to all o f the designs.
128 SING LE-CASF. R E SE A R C H D E S IG N S

Precisely how these features are essential for dem onstrating intervention effects can be
conveyed by discussing A B A B designs, which are the most basic experim ental designs
in single-case research. A B A B designs consist o f a fam ily o f experim ental arrangem ents
in which observations o f perform ance are m ade over time for a given client (or group
o f clients). O ver the cou rse o f the investigation, changes are made in the experim ental
conditions to which the client is exposed.

Description and U n derlying Rationale


T he A B A B design exam in es the effects o f an intervention by alternating the baseline
condition (A phase), w hen no intervention is in effect, with the intervention con di-
tion (B phase). T he A and B phases are repeated again to com plete the four phases.
The effects o f the intervention are clear i f perform ance im proves during the first inter-
vention phase, reverts to or approaches original baseline levels o f perform ance when
treatment is w ithdraw n, and im proves w hen treatment is reinstated in the second inter-
vention phase.2
T he sim ple description o f the A B A B design does not convey the underlying ratio-
nale that accounts for its experim ental utility. T he rationale is crucial to convey because
it underlies all variation s o f the A B A B designs and indeed all single-case designs. T he
initial phase begins with baseline observations when behavior is observed under co n -
ditions before treatm ent is implemented. T h is phase is continued until the rate o f the
response appears to be stable or until it is evident that the response does not improve
over time. A s m entioned previously, baseline observations serve two purposes, namely,
to describe the current level o f behavior and to predict w hat behavior would be like in
the future i f no intervention were im plem ented. T he description o f behavior before
treatment is obviously necessary to give the investigator an idea o f the nature o f the
problem. From the standpoint o f the design, the crucial feature o f baseline is the pre-
diction o f behavior in the future. A stable rate o f behavior is needed to project what
behavior w ould probably be like in the im m ediate future. Figure 6.4 shows hypothetical
data for an A B A B design. D u rin g baseline, the level o f behavior is assessed (solid line),
and this line is projected to predict the level o f behavior into the future (dashed line).
W hen a projection can be m ade with som e degree o f confidence, the intervention (B)
phase is implemented.

; Another way to say this is: We begin by observation (first A phase) in which we collect data and
do not at this point try to help the client; then we do something to make the client better in some
way (improve behavior in the first B phase); take away the effective intervention to make the client
worse (second A phase); and then put the intervention back in place to make the client better (sec-
ond B phase). Return-to-baseline conditions is unacceptable if this means making the client worse.
One can and ought to have the client’s interest in mind in single-case research that is conducted
in any applied setting. (In basic laboratory research, the goals are different; the client participates
to help with some scientific question and any direct benefit usually is not an intended or expected
outcome. Reversal o f behavior in such contexts is not likely to be objectionable.) At this point in
the discussion, it is important only to focus on the logic o f the design. That logic will be needed
to elaborate other designs and to grasp why single-case experiments are as rigorous as any other
methodology.
In t r o d u ct i o n ro Single - Case Resear < t i r » n d A BA 8 D e si g n s 129

F ig u re 6 .4 . Hypothetical data for an AB AB design.The solid lines in each phase reflect the actual
data.The dashed lines indicate the projection or predicted level of performance from the previous
phase.

T he intervention phase has sim ilar purposes to the baseline phase, namely, to
describe current performance and to predict perform ance in the future ifcon d itio n s were
unchanged. However, there is an added purpose o f the intervention phase. In the base-
line phase a prediction was made about future perform ance. In the intervention phase,
the investigator can test whether perform ance during the intervention phase (phase B,
solid line) actually departs from the projected level o f baseline (phase B. dashed line). In
effect, baseline observations were used to make a prediction about perform ance. D u rin g
the first intervention phase, data can test the prediction. D o the data d u rin g the inter-
vention phase depart from the projected level o f baseline? If the answ er is yes, this show s
that there is a change in perform ance. In Figure 6.4, it is clear that perform ance changed
during the first intervention phase. A t this point in the design, it is not ent irely clear that
the intervention was responsible for change. O ther factors, such as h isto ry an d m atura-
tion, might be proposed to account for change and cannot b e con vincingly ruled out. I
m entioned that a critical goal o f research is m aking threats to valid ity im plausible o r as
im plausible as possible. Generally, just the first two (AB) phases w ould not d o this v ery
well. We need at least the second A phase (to have ABA ) to carry out the tli ree functions
I have noted: describe, predict, and test the prediction.
In the third phase (the second A o f A B A ), the intervention is usually w ithdrawn
and the conditions o f baseline are restored. T his second A phase has three p u rp o ses, as
I ju st mentioned. The two purposes com m on to the other phases are included, namely,
to describe current perform ance and to predict w hat perfo rm an ce w ould be like in the
future i f this phase were continued. A third purpose is sim ilar to that o f the intervention
phase, namely, to test the prediction from a prior phase. Let us b reak this dow n a bit.
O ne purpose o f the intervention phase was to make a prediction ot what perform an ce
130 S I N G L E- C A S E R ES EA R C H D ES I G N S

would be like in the future if the conditions rem ain unchanged (see dashed line, sec-
ond A phase). T he second A phase tests to see whether this level o f perform ance in fact
occurred. By com paring the solid and dashed lines in the second A phase, it is clear that
the predicted and obtained levels o f perform ance differ. T hus, the change that occurs
suggests that som ething altered perform ance from its projected course.
There is one final and unique purpose o f the second A phase that is rarely dis-
cussed. T he first A phase m ade a prediction o f what perform ance w ould be like in the
future (the dashed line in the first B phase). T h is was the first prediction in the design,
and, like any prediction, it m ay be incorrect. T he second A phase restores the con di-
tions o f baseline and can test the first prediction. I f behavior had continued without an
intervention, w ould it have continued at the sam e level as the o rigin al baseline or w ould
it have changed m arkedly? T h e second A phase exam ines w hether perform ance would
have been at or near the level predicted originally. A com parison o f the solid line o f
the second A phase with the dashed line o f the first B phase, in Figu re 6.4, shows that
the lines really are no different. Thus, perform ance predicted by the original baseline
phase was generally accurate. Perform ance would have rem ained at this level without
the intervention.
In the final phase o f the A B A B design, the intervention is reinstated again. T his
phase serves the sam e pu rpo ses as the previous phase, nam ely to describe performance,
to test whether perform ance departs from the projected level o f the previous phase,
and to test whether perform an ce is the sam e as predicted from the previous interven-
tion phase. (If additional phases were added to the design, the pu rpose o f the second B
phase would o f course be to predict future perform ance.)
In short, the logic o f the A B A B design and its variations consists o f m aking and
testing predictions about perform an ce under different conditions. Essentially, data in
the separate phases provide inform ation about present perform an ce, predict the proba-
ble level o f future perform an ce, and test the extent to which predictions o f perform ance
from previous phases were accurate. By repeatedly altering experim ental conditions
in the design, there are several opportunities to com pare phases and to test whether
perform ance is altered by the intervention. I f behavior changes w hen the intervention
is introduced, reverts to or near baseline levels after the intervention is withdrawn, and
again improves when treatm ent is reinstated, then the pattern o f results suggests rather
strongly that the intervention was responsible for change. Various threats to internal
validity, outlined earlier, m ight have accounted for change in one o f the phases. For
example, coincidental changes in the behavior o f others (e.g., parents, teachers, spouses,
an annoying peer, bosses at w ork), external events (e.g., in the news, traffic ticket), or in
the internal states o f the individual (e.g., allergy reaction, change in medication, onset
o f a worsening cold) might be responsible for changing behavior. However, these events
or potential influences or any particular threat or set o f threats to validity are not very
plausible in explaining the pattern o f data across phases. T he m ost plausible explan a-
tion is that the intervention and its w ithdrawal accounted for changes.

Illustrations
T he A B A B design and its underlying rationale are nicely illustrated in an investigation
that focused on T V w atching o f an 11-year-old Latina girl who was obese (height =
5’3”; weight 172 lbs [1.6 m eters, 78.2 kg)) (Jason & Brackshaw, 1999). T h e so11’ percentile
In t r o d u ct i o n t o Sin g le- Cast * Resear ch an d ABA S D esi g n s
(m edian) height and weight for a girl this age is approxim ately 4 6 '’ and So lbs (1.4
meters, 36.3 kg). The fam ily tried several different treatm ents that had not w o rked.
T h e girl watched 6 hours o f T V during the week and 10 on the weekend and o ften
ate while watching TV. N eedless to say, m any factors contribute to obesity. Yet eatin g
w hile watching T V am ong children is recognized as a key contributor. A program w as
devised in conjunction with the fam ily in which exercise was used to earn T V tim e. T V
watching was observed daily. T he girl pedaled on a station ary bicycle con nected to the
TV. T he T V could be program m ed so that a predeterm ined am ount o f pedaling w o u ld
be accum ulated to provide a fixed amount o f time on the TV. R id ir g a n d w atching T V
were not possible at the same time, but the girl could earn I V time.
Figure 6.5 show s that an A B A B design was used to evaluate the effects o f the p ro -
gram beginning with baseline. D uring the intervention phase, T V view ing was c o n tin -
gent on exercising. Baseline (A phase) was reinstated, follow ed again by the intervention
(B) phase. Approxim ately 2V2 months later, a follow -up assessment, was m ad e without
the bicycle attached to the TV, as shown in the figure. T V viewing had dropped from
baseline, mean T V view ing o f 4.4 hours per day during baseline to less than 1 hour per
day at the en d o f the program . Also, the girl had lost 20 lbs (9.1 kg). A follow -up assess-
ment 1 year later, not in the graph, indicated that T V view in g remained at th e sam e
level as it was at the end o f treatment (about 1 hour) and that the weight loss h id been
m aintained. From the standpoint o f the design, the graph conveys rather clearly that
T V watching changed m arkedly in responses to changes in the program .

Days

F ig u re 6 .5 . T V watching over different phases (ABAB, follow-up) of die study. ("Source: /ason &
Brackshaw, 1999.)
S I N G L E- C A S E R ES EA R C H D ES I G N S

In this next example, the focus was 011 vocal stereotype am on g children diagnosed
with autism spectrum disorder and who were referred because their vocalizations inter-
fered with their participation in other special educational activities (Ahearn, Clark,
M acD onald, & Chung, 2007). Vocal stereotype refers to vocalizations such as singing,
babbling, repetitive grunts, squeals, and other phrases (e.g., “ee, ee, ee, ee” ) that are not
related to contextual cues in the situation and appear to serve no com m unication func-
tion as part o f interaction with others. Individual sessions were conducted, 5 minutes in
duration, in w hich a child sat in the room with the teacher. Both stereotypic and appro-
priate vocalizations (e.g., “I want a tickle,” “ C ou ld I have a chip ?” ) that were functional
and com m unicated content w ere recorded. Baseline (no intervention) was followed
with an intervention phase that included response interruption and redirection. T his
consisted o f im m ediately interrupting an y vocal stereotype statement and redirecting
the child to other vocalizations. T h e teacher w ould state the child’s nam e and then ask
a question that required an appropriate response (e.g., “ W hat is yo u r nam e?” “ What
color is your shirt?” ). A n y spontaneous appropriate verbal statement w as praised (e.g.,
“ Super job talking!” ). O bservations were obtained using interval assessm ent to score
the presence or absence o f the stereotypic and appropriate vocalizations. The response
interruption and redirection was evaluated in an A B A B design. Figu re 6.6 provides

Response interrupt

r
RI + R D BL + redirect (Rl + R D )

Q. 70-I
60-
2
0» 5 0 -
CO
u 40 -
A
£ 30-
)
0
C
3C
O
20 -
0
u
£.
10 - •a A
0 4-
O

18 20

Session

F ig u re 6 .6 . The percentage of each session with stereotypic behavior (top) and appropriate speech
(bottom). (Source: Ahearn et al., 2007.)
In t r o d u ct i o n t o Si n g l e- Case Resear ch and A B A B D esi g n s

data for one o f the children, a 3-year-old boy nam ed M itch. T he top graph show s vocal
stereotypic sounds; the bottom graph shows appropriate vocalizations. A s evident in
both graphs, w henever the response interruption and redirection intervention was
im plem ented there w as a dram atic reduction in stereotypic statem ents and an increase
in appropriate vocalizations.
T he design reveals that the intervention was responsible for changes in verbal
behavior. But the im m ediate response might be, “ W hy is this dem onstration im portant
if the gains in child behavior are lost im m ediately w hen there is a return-to-baseline
condition?” W ithin this study and prior to the intervention, the investigators evaJu ated
different ways in w hich child behavior might be altered and then tested the response
interruption and redirection alternative that em erged from that in order to see it in
fact it w ould control behavior. T he dem onstration con firm ed that the intervention did
have impact on behavior. O nce this was identified, teachers o fth e ch ild re n were trained
to im plem ent the intervention. Teachers reviewed videotapes o f the session and w ere
given instructions on how to implement the procedures. Assessm ents were conducted
in the classroom periodically (probes) to see if the procedure was carried out there. T he
results, not graphed, indicated large reductions when in the natural environm ent ^vith
their teachers. That is, the treatments were im m ediately extended to situations where
no further reversals were needed.
T he illustrations con vey how the A B A B designs achieve the goals o f rulin g out or
m aking threats to validity implausible. The changes w hen the phase was shifted from
no intervention (A) to intervention (B) and back and forth again make the intervention
the most plausible explanation for what led to the change. If one invoked the logic o f
single-case designs (describe, predict, and test a prior prediction), then the in terven -
tions are very likely to be the reason for the change. T h ere is no certainty in science
from any single em pirical test, but the preceding illustrations are strong dem on stra-
tions o f intervention effects.

D E S IG N V A R IA T IO N S
A B A B designs v ary as a function o f several factors, including the procedures that are
im plem ented to “ reverse” behavior in the second A phase, the ord er o f the phases, the
num ber o f phases, and the num ber o f different interventions included in the design.
Although the underlying rationale for all o f the variation s is the sam e, it is im portant to
illustrate m ajor design options.

“ Reversal” Phase
A characteristic o f the A B A B design is that the intervention is term inated o r w ithdrawn
durin g the second A or reversal phase in order to determ ine w hether behavior change
can be attributed to the intervention.3 W ithdraw ing the intervention (e.g., rein force-
ment procedure, drug) and thereby returning to baseline conditions is the most fre-
quently used variation to achieve this reversal o f perform ance.

' The term “ reversal” is used to note that pattern of behavior reverses or goes in the opposite direc-
tion from what was achieved during the intervention phase. The language is a little loose in noting
that behavior “ reverses.”
I 34 S I N G L E - C A S E R E S E m R CH D ES I G N S

Returning to baseline conditions during the second A phase is only one w ay to


show a relation between perform ance and treatm ent. Another alternative is to continue
the intervention in som e w ay but in a way that will make it ineffective or less effective.
For exam ple, praise is a very effective intervention when adm inistered immediately
after behavior. T he behavior that is praised is likely to increase over tim e. I f praise for a
specific behavior were given du rin g the intervention (B phase), one could change how
the praise is delivered during the second A or reversal phase. Praise cou ld be given ran-
dom ly based on whatever the child is doing or based on a certain am ount o f time that
elapsed. T he intervention (B) depends on praise follow ing a specific behavior. M erely
praising random ly or for just any behavior during the second A phase is likely to lead
to a return of the behavior to baseline levels— the equivalent o f rem oving the program.
T h is strategy is selected to show that it is not the event (e.g., praise) per se that leads to
b ehavior change but rather the relation betw een the event and the behavior.
A third variation o f the reversal phase is to continue consequences but alter the
behaviors that are associated with the consequences. For example, if the intervention
consists o f reinforcing (providing rew arding consequences for) a particular behavior,
the reversal phase can consist o f reinforcing all behaviors except the one that was rein-
forced during the intervention phase. T h e effect o f this is to apply the intervention to
foster return to baseline behaviors.
In m any uses o f single-case designs, consequences such as praise and tokens are
not used. Consequently, applying these differently or less effectively in a return-to-
baseline phase com pared to how they were applied in an intervention phase m ay not
be applicable. In these cases, suspension o r withdraw al o f the intervention would serve
as the return-to-baseline or reversal phase. For exam ple, stim ulant medication is an
effective (evidence-based) intervention for A ttention-D eficit/H yperactivity D isorder
(hyperactivity) in children. If this were evaluated for a given child in an A B A B design,
it w ould be reasonable to apply the m edication in each B phase and no medication in
the A phases. There m ight be a placebo in the second A phase if that were a plausible
concern.

Order o f the Phases


T he A B A B version suggests that observin g behavior under baseline conditions (A
phase) is the first step in the design. O nce the logic o f the design is grasped, that is,
how each phase describes, predicts, and tests an earlier prediction, the rigid ordering o f
the phases can be seen as som ewhat arbitrary. In m any circum stances, the design m ay
begin with the intervention (or B) phase. T h e intervention m ay need to be implemented
im m ediately because o f the severity o f the behavior (e.g., self-destructive behavior,
endangering one’s peers). In cases where clinical or educational considerations dictate
im m ediate interventions, it m ay be unreasonable to insist on collecting baseline data.
(O f course, return-to-baseline phases might not be possible either, a problem discussed
later.)
Second, in m any cases, baseline levels o f perform ance are obvious because the
behavior may never have occurred. For exam ple, when behavior has never been p er-
form ed (e.g., exercise for m any o f us, practicing a musical instrum ent, reading, eat-
ing healthful foods), the intervention m ay begin without baseline. W hen a behavior is
fairly well know n to be perform ed at a zero rate over an extended period, beginning
In t r o d u ct i o n t o Si n g l e- Case Resear ch an d A B A B D esi g n s

with a baseline phase m ay serve no useful purpose. (One could always check with a few
days o f baseline if there were any doubt. Also, one might conduct assessm ent fo r a day
or two to address or resolve any logistical or practical issu es raised by the assessm ent
procedures.) T he design would still require a reversal o f treatm ent conditions at som e
point.
In each o f the previous cases, the design m ay begin writli the intervention phase
and continue as a B A B A design. The logic o f the design an d the m ethodological fu n c -
tions o f the alternating phases are unchanged. D raw ing inferen ces about the im pact
o f treatment depends on the pattern o f results discussed earlier. For exam ple, in one
investigation a B A B A design was used to evaluate the effects o f token reinforcem ent
delivered to two young m en with mental retardation w h o en gaged in little social inter-
action (Kazdin & Pulster, 1973). The program , conducted in a sheltered w orkshop, c o n -
sisted o f providing tokens to each man when he conversed with another person, am on g
the other 40 to 50 peers who were working there. C o n versin g was defin ed as a verbal
exchange in which the client and peer made inform ative com m ents to each other (e.g.,
about news, television, sports) rather than just general greelin gs and replies (e.g., “ H i,
how are you?” “ Fine.” ). Because social behaviors were con sid ered b y staff to be ccnsi s-
tently low during the periods before the program , staff w ished to begin an intervention
immediately. Hence, the reinforcement program was begun in th e fu st phase a n d e r a l-
uated in a B A B A design, as illustrated for one o f the clients in Figure 6.7. Social in terac-
tion steadily increased in the first phase (reinforcem ent) and ceased almost com pletely
when the program was withdrawn (reversal). W hen reinforcem ent was reinstated,

W eeks

Fig u re 6 .7 . Mean frequency of interactions per day as a function of a social and token reinforce-
ment program evaluated in a BABA design. The initial intervention phase (B) w is followed by
no-intervention or baseline (A).followed by B again. In the second B phase, the praise and tokens
were given out less frequently in order to fade the program.When the program was completely
eliminated (final phase A), behaviors were maintained. (Source: Kazdin U Polster, I 973.)
S I N G L E- C A S E R ES EA R C H D E S I G N S

social interaction was again high. T h e pattern o f the first three phases suggested that
the intervention was responsible for change. H ence, in the second reinforcem ent phase,
the consequences were given less frequently (interm ittently) to gradually remove the
program . T h e goal o f achieving change had been met, so the priority switched to tryin g
to m aintain behavior when the program was ultim ately discontinued. Behavior tended
to be m aintained in the final reversal phase even though the program was withdrawn.
T he B A B part o f the design clearly reflects the logic o f the design (describe, predict, test
a prior prediction).

N um ber of Phases
A basic d im en sion that distinguish es variatio n s o f the A B A B design is the num ber o f
phases. T h e A B A B design with fo u r phases elaborated earlier has been a v ery c o m -
m o n ly used version and also is v e ry useful to con vey the logic o f the design and
its ab ility to m ake v ery im plausible threats to experim ental validity. Several oth er
options are available. A s an absolute m in im u m , the design must include at least
three phases, such as the A B A (baseline, intervention , baseline) or B A B (in terven -
tion, baseline, intervention). Four phases are clearly better. T h e basis for noting it is
better refers to the im portance o f replicating the effects o f the intervention w ithin
the stu dy (H o rn er et al., 2005). T he “describe, predict, and test” logic o f the design
begins in full bloom in the second phase o f the A B A B version o f the design. D u rin g
the first intervention (B), there is a test to see i f perform an ce departs from what was
pred icted by baseline (A). E xten din g this, each o f the last three phases o f the A B A B
design tests predictions m ade from extrap o latio n s o f the prior phase. In this w ay the
effects o f the intervention are replicated. That is, is there con sisten cy in the im pact o f
the intervention in each test? A B A w ithout the A B A B does not do this nearly as well.
In a three-phase version, we really do not see the replicated effect o f the intervention
in a final B phase.
M ore generally, a dem onstration with two phases (AB) is not a true experim ent
and does not rule out the threats to validity. At the other extrem e, several phases m ay
be included as in an A B A B A B design in w hich the intervention effect is repeatedly
dem onstrated. C o nfidence that the intervention exerted impact and was responsible
for the effect increases as the num ber o f A B phases increases in the design and the
describe, predict, and test prediction sequence is replicated in a consistent fashion.

N um ber of Different Interventions


A B A B designs can v ary in the num ber o f different interventions they include. As u su -
ally discussed, the design consists o f one intervention that is im plem ented in the two
B phases (A B A B ) in the investigation. O ccasionally, investigators may include sepa-
rate interventions (B and C phases) in the sam e design. Separate interventions m ay be
needed in situations where the first one does not alter behavior or does not achieve a
sufficient change for the desired result. In applied settings as in schools or clinics, the
ability to see effects o f the intervention and then to change or add a new intervention
to im prove the outcom e is one o f the strengths o f single-case designs. Another reason
for using two interventions in the sam e design is to evaluate the relative effectiveness.
T h e interventions (B, C) m ay be adm inistered at different points in the design as rep-
resented by A B C B C A or A B C A B C designs.
I n t r o d u ct i o n t o Si n g l e- Case Re- search and A BA B D esig n s

A n illu stratio n o fad esign that evaluated two intervent ion plans was astu d y designed
to reduce the problem behaviors o f school students (In grain , Lew is-Palm er, & Sugai,
2005). Two sixth-grade boys participated and were o b served d u rin g their classroom s
while science or math lessons were being taught. T h e boys w ere identified because o f
their problem behaviors (e.g., not engaging in the task, playin g with related objects,
staring away from the teacher). Two interventions were evaluated. T h e first is called
fu nctional behavioral assessment and is a procedure designed to identify precisely w hat
is controlling behavior in term s o f antecedents that trigger the behavior, consequences
that might maintain behavior, and events that m ight exacerbate the behaviors). T he
inform ation is obtained from systematically interview ing the teachers and thestudents.
Two b rief com m ents place this in context. There is a larger em pirical literature on
functional assessment that is an em pirical way to identify w hat factors are controlling
behavior; once these are identified they can be used for intervention (A ustin & Carr,
2000). In addition, m any schools are encouraging or m andating fu nctional behavior
assessments in light o f evidence that interventions based on su ch assessments can b-e
very effective.
In this study, the intervention derived from the fu n ctio n al assessm en t varied
fo r each child and can o n ly be highlighted here. F o r C arte r, one o f th e boys, the
assessm ent identified sp ecific tasks likely to be associated w ith not b e in g en gaged, a
specific con dition (being tired) that appeared to exacerbate this, an d con sequences
such as the ability to escape from the task and peer attention that were m ain taining
not being engaged. T he fu n ctio n -b ased intervention for C a rte r inclu ded him self-
assessing the extent to w hich he was on task (every 5 m inutes), raisin g his band an d
recru itin g teacher help for d ifficu lt work, being allow ed to take 10 m inutes if he indi -
cated he w as tired (he never d id ), and earning p o sitive co n seq u en ces (e.g., 5 m inutes
o f com puter time) for his w ork if he rem ained on task. If h e was n o t engaged in
w ork, the teacher would redirect him by providing a prom pt to restart his w ork. T his
individ u alized intervention has broad research b eh in d it, but is it needed? A s e c -
o n d intervention, referred to as non fu nction-based in terven tio n , w as stiLI evidence-
based but less intricate. At the beginning o f the class fo r the 11 on fu n ctio n -b ased
intervention, C arte r was told he cou ld earn tokens (poin ts) fo r ap prop riate b eh avior
(points could be exchanged for snacks, pencils). He m o n ito red (self-assessed his
behavior) but was not allow ed breaks and an ytim e he was not w o rk in g this w as
sim ply ignored rather than redirected. Problem b eh av io rs were assessed o n ce per
day in a 10-m in ute period at a tim e when o ff-task b eh aviors h a d been id en tified as
a problem .
Figure 6.8 shows the effects o f the two interventions on problem behavior in an
A B C B C design for one o f the students. The results con vey that the func tion-based
intervention was consistently associated with sharp reduction s in behavioral problems.
N onfunction-based intervention had little or no effect. At the very final phase the fu n c -
tion-based intervention was m odified slightly (to extend his self-assessm ent p eriod to
10 rather than 5 minutes). T h e findings are clear. T he intervention that considered m ore
o f the factors that prompted and sustained problem behavior had clear im pact. From
the standpoint o f the design, it is clear that the “describe, test, and predict” elem ents o f
single-case designs were met and that the effects o f the intervention w ere reproduced
at different points in the design.
130 s i n g l e -c a s e r e s e a r c h d es ig n s

Sessions

F ig u re 6 . 8 . Function-based and non-function-based behavior interventions designed to reduce


problem behavior in a sixth-grade boy. (Source: Ingram et al„ 2005.)

General Com m ents


I have mentioned som e o f the dim ensions on w h ich A B A B designs vary. I noted that
this is not a design but a fam ily o f designs. It is not possible to mention all o f the p o ssi-
ble variants— an infinite num ber based on the num ber o f phases, interventions, o rd er-
ing o f phases, and types o f reversal. Yet it is not critical to review all the variants even if
the num ber w ere smaller. T he most im portant point is to convey the logic o f the design
and to convey what each phase is tryin g to accom plish (e.g., describe, predict, and test).
T h e specific design variation that the investigator selects is partially determ ined by
purposes o f the project, the results evident du rin g the course o f treatment (e.g., little or
no behavior change with the first intervention), and the exigencies or constraints o f the
situation (e.g., lim ited time in which to com plete the investigation). It is im portant to
keep the overall central purpose in m ind, namely, to m ake inferences that the interven-
tion is the most if not o n ly plausible explanation o f the effect.

PRO BLEM S AND L IM IT A T IO N S


T h e defining characteristic o f the A B A B designs and their variations consists o f alter-
nating phases in such a w ay that perform ance is expected to im prove at some points and
to return to or to approach baseline rates at other points. T he need to show a “ reversal”
o f behavior is pivotal i f causal inferences are to be draw n about the impact o f the inter-
vention. Several problem s arise with the designs as a result o f this requirement.

Absence o f a “ Reversal” o f Behavior


It is quite possible that behavior will not revert tow ard baseline levels once the inter-
vention is w ithdrawn or altered. In such cases, it is not clear that the intervention was
responsible for change. Threats to validity such as history and maturation now enter as
possible explanations o f w hy the change occurred. Extraneous factors associated with
In t r o d u ct i o n t o Si n g l e- Case Resear ch x n - .IA BA Q D esi g n s 139

the intervention m ay have led to change. These factors (e.g., changes in hom e or school
situation, illness or im provem ent from an illness, better sleep at night) m ay have co in -
cidentally occurred when the intervention was im plem ented and rem ained in effect
after the intervention was withdrawn.
There are sound reasons why an intervention m ay be responsible for change in a
situation in which the behavior does not revert to or near baseline levels as needed by
the design. First, the intervention m ay have led to change initially but behavior may
have com e under the control o f other influences. For exam ple, developing social in ter-
action skills am ong a withdrawn ch ild or reading in children might well be increased
with an intervention. Yet, in a reversal phase, there m ay not be return © fb ehavio r to
baseline levels. Social interaction brings about interactions with others and these may
“trap” or lock in the behavior; the sam e for reading w hich can introduce a child to a
w orld o f adventure, suspense, travel, and more. B ehaviors som etim es have their own
consequences that sustain them. We want behaviors an d other dom ain s we train to be
m aintained and there are ways to program that (Kazdin, 2.001), hut som etim es it h a p -
pens without special efforts. From the standpoint o f the design, absence o f a reversal
makes the dem onstration am biguous. Second and related, som e interventions focus
on skills (e.g., reading, math, athletic, music, or dance skilLs). As skills develop w ith an
intervention, they are not erased by a return to baseline. Perform ance (e.g., num ber o f
problem s com pleted) might return to baseline if that is the measure, but. i f skill is the
measure, it would not be erased by stopping a program.
T hird, behaviors m ay not revert to baseline levels o f perform ance because people
who adm inistered the program continue the intervention in som e way. M any in ter-
vention program s evaluated in A B A B designs consist o f aJtering the behavior o f per-
sons (parents, teachers, and staff) w ho will influence the clients target behavior. After
behavior change in the client has been achieved, it may be difficult to convince behav-
ior-change agents to alter their perform ance to approxim ate their behavior du rin g the
original baseline. It may not be a m atter o f convincing behavior-dhange agents; theii
behavior m ay be perm anently altered in some fashion, even locked in by the con se-
quences it had.
Finally, behavior m ay not revert to baseline levels i f the change has been dram atic
and perm anent. It is readily conceivable that the intervention reduced or elim inated
hitting, tantrums, or shouting. Yet, returning to baseline levels in an A B A B design does
not bring the behavior back. From the standpoint o f the client or student whose behav-
ior has been changed, this is w onderful news— the behavior change has been nicely
established and merely changing conditions does not brin g the original problematic
behavior back.
In each o f the preceding instances, the intervention m ay he withdrawn an d the
behavior or dom ain o f functioning does not return to baseline or near baseline levels.
The intervention m ay have been responsible for change, but vve cannot tell. T here are
options that one can invoke and draw on elements o f other designs, but strictly sp eak -
ing the design requires the data pattern we have discussed.

Undesirability o f “ Reversing” Behavior


C ertain ly a m ajor issue in evaluating A B A B designs is w hether reversal phases should
be used at all. If behavior could be returned to baseline levels as part o f the design, is
140 S I N G ' - E- C A S E R ES EA R C H D ES I G N S

such a change ethical? Attem pting to return behavior to baseline levels is tantamount
to m aking the client worse. In m any cases, it is obvious that a withdrawal o f treatment
is clearly not in the interest o f the client; a reversal o f behavior would be difficult if not
im possible to defend ethically. For example, children with autism or developm ental
disabilities som etim es injure themselves severely by hitting their heads for extended
periods o f time. I f a program decreased this behavior, it w ould be ethically unaccept-
able to show that head b anging would return in a phase in w hich treatment were w ith-
drawn. Extensive physical dam age to the child m ight result. Even in situations where
the behavior is not dangerous, it m ay be difficult to ju stify suspension o f the program
on ethical grounds. A phase in which treatment is w ithdraw n is essentially designed
to make the person s behavior worse in som e way. W hether behavior should be m ade
worse and w hen such a goal w ould be ju stified are difficult issues to resolve.
Returning to baseline conditions also has im plications for those responsible for
the client. During the intervention, behavior-change agents (teachers, parents) m ay see
and benefit from changes o f their student or child. Suspen ding their improved inter-
vention efforts attempts to erase gains they have m ade in their actions to improve the
client. Thus, reintroducing the conditions o f baseline raises the sam e concerns and
questions as it does for the client.
Ethical considerations m ay be enough, but there are additional considerations. T he
client, those responsible for the client, and investigators (like this author) may not find
the design acceptable in m any circum stances. Single-case designs are often used in
applied settings where the goal o f helping people an d having an im portant im pact is
central. Design issues need to be weighed against considerations that make designs
acceptable to the various persons involved.
T h ere is another side in favor o f a return-to-baseline phase that is relevant to the
client. There is value in know ing what is responsible for client change and know ing
that it was the intervention rather than other influences (e.g., placebo effects, increased
attention to the client, conducting observations that m ade the client tem porarily behave
differently). T h e client m ay require further intervention in the future; certainly there
will be other clients w h o are likely to profit as well. K n ow ing that the intervention
specifically led to change has great applied im portance rather than serving as a m ere
academ ic exercise or effort to appease som e god o f research design who likes sacrifices.
A n exam ple w as provided previously (A hearn et al., 2007) in which the A B A B design
was used to identify an effective intervention for changin g verbal behavior (stereotypy)
o f children with autistic spectrum disorder. O nce identified, the teachers were trained
to use the intervention, and further checks were done in the classroom to ensure that
the intervention continued to be effective.
In the general case, the decision about w hether to use an A B A B design with its
return-to-baseline phase has multiple considerations. I f in doubt about the desirability
o f such a phase, there are other designs that do not require a reversal phase. T he goal
to establish that the intervention was responsible for change can be achieved in m any
ways, as discussed in the next chapters. In fact, there are m any options.

E V A L U A T IO N O F T H E D E S IG N
T he A B A B design and its variations can provide con vin cin g evidence that an interven-
tion was responsible for change. Indeed, when the data pattern shows that perform ance
In t r o d u ct i o n t o Si n g l e* Case Resear ch ar ad A BA B D esi g n s

changes consistently as the phases are altered, the evidence is dram atic. N evertheless,
there are limitations peculiar to A B A B designs, particularly when they are considered
for use in applied and clinical settings.
In A B A B designs, the m ethodological and applied priorities o f the investigator
m ay compete. T he investigator has an explicit hope that behavior w ill revert tow ard
baseline levels when the intervention is withdrawn. Such a reversal is required to d e m -
onstrate an effect o f the intervention. T he educator, clinician , counselor, o r parent, o n
the other hand, hopes that the behavior will be m aintained after treatm ent is w ith -
draw n. Indeed, the intended purpose o f most interventions or treatments is to attain a
perm anent change even after the intervention is w ithdraw n. The interests in achieving
a reversal and not achieving a reversal are obviously contradictory.
O f course, show ing a reversal in behavior is not always a problem in applied set-
tings. Reversal phases often are v ery brief, lasting only o ne o rtw o sessions or days (e.g.,
Brooks, Todd, Tofflem oyer, & Horner, 2003; Wehby & H ollahan, 2000). O ccasionally,
especially when the intervention o nly has been im plem ented for a b rief period, a b rie f
reversal phase show s an im m ediate and dram atic return o f behavior to baseline levels.
T h e “describe, predict, and test a prior prediction” requirem ents o f th edesign are read -
ily met without keeping the non-intervention phase in place. H owever, short reversal
phases are usually possible only when behavior shows rapid reversals, that is, becom es
w orse relatively quickly after the intervention is w ithdraw n. To have behaviors becom e
w orse even for short periods is usually undesirable. T h e goal o f the treatment is to
achieve changes that are m aintained rather than quickly lost as so o n as the intervention
is withdrawn.
It is possible to include a reversal phase in the design to show that the intervention
w as responsible for change and still attempt to maintain behavior. A fter e xp erim en -
tal control has been dem onstrated in a return-to-baseline phase, procedures can be
included to maintain perform ance after all treatment has been w ithdrawn. Thus, the
design is A B A B C where C adds special procedures that can m aintain behavior. T h e
A B A B part already established that the B was responsible for change. N ow one adds
procedures (e.g., gradually withdraw ing the intervention) that have been established
to m aintain behavior. A B A B design and its variations are not n ecessarily incom patible
with achieving m aintenance o f behavior.
Finally, in som e circum stances the intervention can becom e perm anent so that
there is no need for a reversal in the usual way. For exam ple, elim ination o f thumb
sucking was the goal o f a program for a 9-year-old boy (W atson, Meeks, D ufrene, &
Lindsay, 2002). His thumb sucking was exacerbating an alread y existing dental prob-
lem . Several interventions (positive reinforcement for not thum b sucking, placing
pepper sauce on the boy’s thumb) had not worked. T he b o y sucked his thumb w hile
holding a pillow (a transitional object like a stuffed anim al). O ccasionally, a behavior
(e.g., thum b sucking) can be altered by changing another response (e.g., holding the
pillow ) with which it is highly associated. The intervention consisted o f rem oval o f the
pillow. Figure 6.9 conveys an A B A B design. D uring baseline an d return to baseline,
the boy had the pillow available to hold and thumb sucking w as relatively frequent.
T h e data pattern clearly shows the im pact o f the intervention. In the final treatment
phase, the pillow was no longer available and the behavior w as elim inated. The effect
was m aintained 8 weeks later. The point o f this exam ple is to con vey that m ain taining
142 SIN G LE-CA SE R ESEARCH D ESIG N S

Days W eeks

F ig u re 4.9. The frequency o f thumb sucking in which removal of the transitional object (pillow)
served as the intervention, as evaluated in an A B A B design.The final phase, referred to as follow-up,
checked on the frequency of thumb sucking 8 weeks after the end of the intervention phase. (Source:
T. S. Watson et al., 2002.)

behavior is not incom patible with an A B A B design. In this instance, the intervention
becam e perm anent. Som etim es experim ental control can be dem onstrated, followed
by different ways o f m aintaining behavior, a topic discu ssed further in Chapter 10.
Despite the w ays in w hich concerns about reversal phase can be combated, this
does not change the dem ands o f the design. T he usual requirem ent o f returning behav-
ior to baseline levels o r im plem enting a less effective intervention when a more effective
one seem s to be available raises potential problem s for applied settings. Hence, in m any
situations, the investigator m ay wish to select one o f the m any other designs that do not
require undoing the apparent benefits o f treatm ent even if only for a short period.

SU M M A R Y A N D C O N C L U S IO N S
T he chapter began with a discussion o f general requirem ents o f single-case designs.
We have discussed the requirem ents and logic o f all single-case designs, not just the
A B A B designs. T hese requirem ents o f single-case designs included continuous assess-
ment, evaluation o f the client’s behavior du rin g baseline (no-intervention) and inter-
vention phases, and achieving stable m easures o f perform ance. T hese assessment and
data requirem ents are pivotal to the logic o f sin gle-case designs, that is, how the designs
can be used to draw causal conclusions about intervention effects.
T he underlying logic o f single-case designs and that o f m ore traditional between-
group designs is the sam e, namely, to m ake and test predictions about performance. In
single-case designs, data from the continuous assessm ent in different phases are used
to describe current perform an ce and predict w hat it w ould be like in the future without
an intervention. T hen the data during the intervention phase test that prediction. T he
most critical part o f the chapter may be con veyin g that logic rather than focusing on
any particular design. T h e logic conveys what the researcher is trying to accomplish by
I n t r o d u ct i o n n Si n g l e- Case Rcsp n r ch an d A B A B D t - si j n s 143

altering phases, and that logic serves as a useful guide in u sing a given design and m ak-
ing decisions about changing phases.
With A B A B designs, the effect o f an intervention is dem onstrated by alternating
intervention and baseline conditions in separate phases o ver time. T he designs may
vary in the procedures that are used to cause behavior to return to or approach baseline
levels. W ithdrawing the intervention or altering the intervention in a w ay that criti-
cal ingredients are omitted is com m only used in this return-to-baseline phase. A B A B
designs can v ary in m any ways including the order in w hich the baseline an d interven-
tion phases are presented, the num ber o f phases, and th e num ber o f different inter-
ventions that are presented in the design. G iven the different dim ensions, an infinite
num ber o f A B A B design options is available. However, the underlying rationale and
the m anner in which intervention effects are dem onstrated remain the same.
A B A B designs represent m ethodologically powerful experim ental tools for d em -
onstrating intervention effects. W hen the pattern o f the data reveals shifts in perfor-
mance as phases are altered, the evidence for intervention <fleets is very dram atic. For
research in applied settings such as schools, the home, an d business, the central feature
o f the designs m ay raise special problem s. Specifically, the designs require that phases
be alternated so that perform ance im proves at som e points and reverts tow ard baseline
levels at other points. In some cases, a reversal o f behavior does not occur, w h ich cre-
ates problem s in draw ing inferences about the intervention. In other cases, it may b e
undesirable to withdraw or alter treatment, and serious ethical questions m ay b e raised.
W hen the requirem ents o f the design com pete with applied priorities, o th er designs
may be m ore appropriate for dem onstrating intervention effects.
CHAPTER 7

Multiple-Baseline Designs

CH APTER O U T L IN E

Basic Characteristics o f the Designs


Description and Underlying Rationale
Illustrations
Design Variations
Multiple-baseline Design Across Individuals
Multiple-baseline Design Across Situations, Settings,
or Time
Number o f Baselines
Partial Applications o f Treatment
General Comments
Problems and Limitations
Interdependence o f the Baselines
Inconsistent Effects o f the Intervention
Prolonged Baselines
Evaluation o f the Design
Sum m ary and Conclusions

ith m ultiple-baseline designs, intervention effects are evaluated by a m ethod quite


different from that described for A B A B designs. T he effects are dem onstrated by
introducing the intervention to different baselines (e.g., behaviors or persons) at d if-
ferent points in time. I f each baseline changes when the intervention is introduced, the
effects can be attributed to the intervention rather than to extraneous events. O nce the
intervention is im plem ented to alter a particu lar behavior, it need not be withdraw n.
Thus, within the design, there is no need to return behavior to or near baseline levels
o f perform ance. H ence, m ultiple-baseline designs do not share the practical or ethical
concerns raised in A B A B designs by tem porarily w ithdraw ing the intervention.

B A S IC C H A R A C T E R IS T IC S O F T H E D E S IG N S

D escription and U nderlying Rationale


In the m ultiple-baseline design, inferences are based on exam in in g perform ance across
several different baselines. T he m anner in w h ich inferences are drawn is illustrated

»44
Multiple-Basclin a Designs

by discussing the m ultiple-baseline design across behaviors. This is a c o m m o n ly used


variation in which the different baselines refer to several di fferent b e h a v io rso f a partic-
ular person or group o f persons.
Baseline data are gathered on two or m ore behaviors. C onsider a hypothetical
exam ple in which three separate behaviors are observed, as portrayed in Figure 7.1. T h e
data gathered on each o f the behaviors serve the pu rposes com m on to each sin gle-case
design. That is, the baseline data for each behavior d escribe the current level o f p erfo r-
m ance and predict future perform ance. A fter perform ance is stable for all o f the beh av-
iors, the intervention is applied to the First behavior. Data continue to be gathered for
each behavior. If the intervention is effective, one w ou ld expect changes in the b ehavior
to which the intervention is applied. O n the other hand, the behaviors that have yet
to receive the intervention should rem ain at baseline levels. After all, no intervention
was im plem ented to alter these behaviors. W hen the first behavior changes and the
others rem ain at their baseline levels, this suggests that the intervention w as resp on si-
ble for the change. However, the data are not entirely clear at this point. It m iglit well
be that som e historical or m aturational event (threat to validity) coincidentally led to
a change in the first behavior. So, after perform an ce stabilizes across all behaviors, the
intervention is applied to the second behavior. At this point both the first an d second
behavior are receiving the intervention, and data continue 1o be gathered for all behav-
iors. A s evident in Figure 7.1, the second behavior in this hypothetical exam ple also
im proved when the intervention w as introduced. Finally, after continuing observation
o f all behaviors, the intervention is applied to the final behavior, and it too changed
when the intervention was introduced.
T h e m ultiple-baseline design dem onstrates the effect o f an intervention by show -
ing that behavior changes when and o nly w hen the intervention is applied. T he pattern
o f data in Figure 7.1 argues strongly that the intervention, rather than som e extraneous
event, w as responsible for change. Extraneous factors might have influenced p erfo r-
mance. For example, it is possible that som e event at home, school, or w o rk coincided
with the onset o f the intervention and altered behavior. Yet one v o u ld not expect this
extraneous influence to alter only one o f the behaviors and at tlie exact point that the
intervention was applied. A coincidence o f this sort is possible, so the intervention
is applied at different points in tim e to two or m ore behaviors. T he pattern o f results
illustrates that w henever the intervention is applied, behavior changes. T h e repeated
dem onstration that behavior changes in response to staggered applications o f the inter-
vention usually makes the influence o f extraneous factors implausible.
As in the A B A B designs, the m ultiple-baseline designs are based on testing o f p r e -
dictions. Each time the intervention is introduced, a test is m ade between the level o f
perform ance during the intervention and the projected level o f the previous baseline.
Essentially, each behavior is a “ m ini” A B experim en t that tests a prediction o f the p ro -
jected baseline perform ance and w hether perform ance continues at the sam e level after
treatm ent is applied. Predicting and testing o f prediction s o ver tim e for a sin gle b ase-
line is sim ilar for A B A B and m ultiple-baseline designs.
A unique feature o f m ultiple-baseline designs is the testing o f prediction s across
different behaviors. Essentially, the different behaviors in the design serve as control
conditions to evaluate what changes can be expected w ithout the application o f treat-
ment. At an y point in which the intervention is applied to one behavior and not to
146 S I N G L E- C A S E R ES EA R C H D ES I G N S

Baseline Intervention

F ig u re 7 . 1 . Hypothetical data for a multiple-baseline design across behaviors in which the interven-
tion was introduced to three behaviors at different points in time.

rem aining behaviors, a com pariso n exists between treatment and no-treatment co n d i-
tions. The behavior that receives treatment should change, that is, show a clear dep ar-
ture from the level o f perfo rm an ce predicted by baseline. Yet it is im portant to exam ine
whether other baselines that have yet to receive treatm ent show any changes during the
sam e period. The com parison o f perform ance across the behaviors at the same points
in tim e is critical to the m ultiple-baseline design. T he baselines that do not receive
treatment show the likely fluctuations o f perform ance if no changes occur in the en vi-
ronm ent. W hen only the treated behavior changes, this suggests that normal fluctu a-
tions in perform ance w ould not account for the change. T he repeated dem onstration
o f changes in specific behaviors w hen the intervention is applied provides a convincing
dem onstration that the intervention was responsible for change.
An important question is, “ H ow long does one wait after introducing the interven-
tion to the first behavior before introducing the intervention to the second and third
b ehavior—that is, what is the lapse in tim e?” T here is no fixed answ er in terms o f days
or observations. One w aits for perform ance to be stable with little o r no trend in the
behaviors yet to receive the intervention. The hypothetical data in Figure 7.1 suggest
M u i t ip lc- Ba;el! n < D esig n s 147

that 4 days (please count the data points) after the intervention was introduced to
Behavior i, it was introduced to Behavior 2; then 4 days later to Behavior 3. T h ere is
no reason stem m ing from the logic o f the design to keep the days or intervals constant
before introducing the intervention to the next baseline. T h e critical feature is that the
introduction o f the intervention is staggered, not that there is an y consistency in the
am ount o f time before introducing the intervention to the next baseline. One waits fo r
baselines to be stable. So if the intervention has been applied to the first baseline, one
continues without im plem enting the intervention for the second an d third baseline
for a few days. Ideally, one will see the first behavior change an d the others (still w ith-
out the intervention) continue without show ing a change. As these baselines are stable
(no new trend or high variability), the intervention is extended to the second baseline.
N um ber o f days does not determ ine the decision; clarity o f the pattern does.

Illustrations
M ultiple-baseline designs across behaviors or dom ains o f fun ctioning have been used
frequently. T his first exam ple focused on a business application, specifically a g ro cery
store with 60 em ployees (Carter, Holm strom , Sim panen, & M elin, 1988). T he project
was designed to reduce theft o f items in the store, and it focused the intervention 011
em ployees. T he dom ains o f function or behaviors selected w ere stealing candy, personal
hygiene products, and trinket jewelry. These products were selected on the b asis o f
in d ustry statistics about frequently stolen items but also con firm ed by i nitial evaluation
o f thefts from the stores records. Each type o f item included m any different items (i.e.,
10 types o f candy, 6 hygiene items, 27 trinket jew elry items). C o m puterized scanning
o f item s delivered to the store (inventory) and sold from the register could be readily
m onitored and checked by observers who tracked the inventory from this inform ation.
Items not accounted for by inventory or sales were considered thefts. T h e investigators
were not interested in who did the stealing but merely in reducing o r elim inating it.
T h e intervention was introduced to em ployees who were inform ed about the project
and its focus. T he project began with observation across the three a T e a s o f stealing or
dom ains o f responding.
T h e intervention consisted of listing the items (e.g., all the candy types) and then
posting a graph in the employee lunchroom o f the num ber o f thefts for the entire group
o f items (e.g., all candy items). When the intervention was introduced for cand y theft,
there was no mention o f hygiene or jew elry items. A fter 2 weeks o f the intervention,
hygiene items were added. T he products in this category were also listed and a graph
was added to the candy graph to convey the am ount o f theft. A fter iH weeks the inter-
vention was extended to the final baseline, theft o f jewelry.
Figure 7.2 conveys the effect o f the intervention across the three baselines. T h e
effects are reasonably clear. When the items were identified and graphed for em ployees
(intervention) there was a reduction in theft for each o f the baselines. T he daily rate
ot theft during baseline for candy, hygiene, and jew elry w ere means o f 4.7, 1.6, and
2 items, respectively; these means fell to 1.2, and .8 and only 1 item for the w h o le3-w eek
intervention period. T he pattern shows that the intervention w as associated w ith the
change and that change did not occur before the intervention w as implemented. W hat
we know from the dem onstration is that the intervention led to reductions in theft.
A s the authors noted, it is likely that employee theft decreased. O nly em ployees were
148 S I N G L E- C A S E R ES EA R C H D ES I G N S

IN VEN TORIES

F ig u re 7 .2 . Number of thefts (biweekly) of candy items, hygiene items, and jewelry items. The
intervention (identifying items and providing graphed information for the items as presented) was
introduced at different points in time. (Source: Carter et at.. 1988.)

m ade aware o f the intervention. On the other hand, custom er theft m ay have declined,
if em ployees were m ore vigilant in their m onitoring o f the custom ers. Reducing overall
theft rather than identifying the source was the prim ary goal o f the project.
In this next exam ple, the focus was on teaching reading to a 6-year-old girl
with multiple psychiatric diagn oses including hyperactivity and language disorders
(Attention D eficit/H yperactivity Disorder, R eceptive-Expressive Language Disorder)
( M cC ollough, Weber, Derby, & M cLaughlin, 2008). She w as reading below grade level
an d made m any errors while reading. Individual sessions were designed to teach read-
ing by focusing on sounds and providing feedback and praise during the intervention
phases for responses to cue cards with the words on them. T he intervention was based
on a method (referred to as Direct Instruction) and a com m ercially available book
( Tcach Your C hild to R ead in 100 Easy Lessons) that provided structured lessons 011 how
to teach reading (Engelm ann, H addox, & Bruner, 1983). A cross the lessons the student
M uit ip le- Ba^ H im e D esi g n s 149

traverses several important steps (e.g., learning sounds, blends, and w ords, reading) to
develop reading. T he intervention was evaluated in a m ultiple-baseline design acro ss
different sets o f words. Each set here can be viewed as different responses, that is, w ord
groups that served as baselines. In each session, w ords from all sets w ere tested to assess
the num ber o f w ords from that set that were correctly read. The intervention was intro-
duced to the first, second, and third set in a m ultiple-baseline design.
Figure 7.3 provides a graphical display o f the num ber o f w ords that were correctly
pronounced. A s evident in the figure, the effects o f training were evident w hen the
intervention (feedback, praise) was introduced for correct responding but not before.
Two points are worth underscoring with this example. First, th e first baseline in the
figure (top) included only one session. Usually, one would want more than one data
point for the baseline because o f the logic o f the design in w hich data from each phase
is used to describe and predict perform ance. W hile it might h are been useful to have
another session or two, it is clear when all baseline phases (three sets o f words) a re co n -
sidered that the criteria for the multiple-baseline design were met. T h e effects are fairly
clear. T h e second point pertains to the goal o f any reading program . W hile train ing the
child to read specific words m ayb e o f interest, the goal is to develop reading, a broader
skill that transfers to behaviors not included in training. In this case, each session also
included w ords that were not specifically trained for each set. T he untrained w o rd s
show ed a sim ilar pattern. Thus, when sounds and other skills were trained w ithin a
given word set, reading other words in that set also improved.
T he exam ples convey the utility o f multiple-baseline designs in dem onstrat ing the
effect o f an intervention. T hey also show' the practical utility o f the designs. One can
intervene on a sm all scale (e.g., one baseline) to establish that the intervention is w o rk -
ing o r w orking sufficiently well. Then, as change is evident, the intervention can be
extended m ore broadly.

D E S IG N V A R IA T IO N S
T he underlying rationale o f the design has been discussed by elaborating the m u lti-
ple-baseline design across behaviors. The several baselines need not refer to different
behaviors o f a particular person o r group of persons. Variations o f the design include
observations across different individuals or across different situations, settings, or tim e
periods. In addition, multiple-baseline designs m ay v ary along other dim en sions, such
as the num ber of baselines and the m anner in which a particular intervention is applied
to these baselines.

M u ltip le -b a s e lin e D esign A cro ss In d iv id u a ls


In this variatio n o f the design, baseline data are gathered for a p a rticu lar b eh av io r
perfo rm ed by two o r more persons. T h e m ultiple baselines refer to th e nu m b er o f
persons w h ose b ehaviors are observed. T he design begins w'ith o b serv atio n s o f b ase-
line perfo rm an ce o f the same behavior for each person. A fte r the b eh avio r o f each
person has reached a stable rate, the intervention is applied to o n ly o n e o f th em
w h ile baseline conditions are continued for the other(s). T h e beh avior o f the p e r -
son exp osed to the intervention w ould be exp ected to change; the b eh avio rs o f the
others w ould be expected to continue at their baseline levels. W hen b e h a v io rs stab i-
lize for all persons, the intervention is extend ed to another person. T h is p ro c e d u ie is
150 S I N G L E- C A S E R ES EA R C H D ESJ ^ N S

Baseline Teach Your Child to Read in 100 Easy Lessons

F ig u re 7 .3 . The number of target words that were pronounced correctly across baseline and
intervention (TeachYour Child) phases.Three sets of words were delineated to serve as the baselines
in the multiple-baseline design.The intervention was introduced to each in a staggered fashion and
was effective in altering the baseline for which it was introduced. (Source: McCollough et d . 2008.)

con tinued until all o f the person s for w hom baseline data w ere collected receive the
intervention. T h e effect o f the intervention is dem on strated w h en a change in each
person’s perfo rm an ce is o b tain ed at the point w hen the intervention is introduced
and not before.
M u lr.iple-B a.selin e D e s ig n s . 151

In an application you might never anticipate, cocktail servers in a b ar "were trained


to carry their serving trays correctly (Scherrer & W ilder, 2008). T h ey hud reported sore
muscles and join ts, and this appeared to be due in part to the w ay they c a n ie d th eir
trays while servin g drinks to customers. In consultation with an occupational therapist,
appropriate carryin g positions were identified (e.g., tray renting o n finger tips, keeping
the wrist straight, holding the tray next to the body, keeping shoulders down from the
neck, not reaching across the table to unload drinks). Eight such specific behaviors w ere
identified and considered to reflect a safer way o f serving, that is, that might reduce risk
o f injury. T h e behaviors were included in a checklist to note correct (safe) and incorrect
carryin g and w ere recorded by observers who were in the bar 3 to 4 nights per w eek
over the cou rse o f 8 weeks. Baseline observations w ere m ade o f three cocktail servers
(one male, tw o fem ales), 21 to 24 years o f age. O bservation o f each server was made
for 15 m inutes durin g which three separate checks (or serving opportunities) could be
made to assess the percentage o f correct behaviors.
A n intervention consisted o f a single training session outside o f the time when the
people w ere serving. T he training included these com ponents: explaining the correct
positions to c a rry trays with drinks and serve them, a trainer m odeling the correct p o si-
tions, and having the cocktail server describe and then dem onstrate the correct p o si-
tions to the trainer. T he server had to show m astery by dem onstrating the techniques
correctly fou r tim es. T he training session took 45 minutes to 1 hour for each cocktail
server. T h e training was done individually (separately) because a m ultiple-baseline
design across individuals was used. Consequently, trainin g was introduced at different
points in tim e as each server was trained.
Figure 7.4 show s the evening observations du rin g the shift at the bar in which the
percentage o f behaviors perform ed correctly was graphed. As evident in the figure, the
training session w as given to Sara first. Her behavior changed markedly, w h ile there
was no change in the behavior o f M ike or Tanya, w'ho had not yet had training. T raining
was introduced at different points to M ike and Tanya, and at each point there w ere
m arked increases in correct tray carrying. T h e pattern o f results provides a strong dem -
onstration that it was the intervention that led to change. Inform al com m ents after the
dem onstration suggested that the effects were m aintained and that the servers reported
less soreness o r greater ease in com pleting the tasks when they used the better trav-
carryin g procedures.
T he m ultiple-baseline design across individuals is well suited to situations in which
a single b eh avior or set o f behaviors is to be changed am ong different persons and can
be introduced to one or a few individuals at a tim e. T he previous exam ple was c o n -
ducted in a bar! A n y setting where there is a group o r a few individuals (e.g., classroom ,
playground, athletic practice) and where the intervention can be introduced sequen-
tially can accom m odate the design. As with other variations© !'thedesign, no reversal or
experim ental conditions are required to dem onstrate the effects o f the intervention.

M u ltip le -b a s e lin e D esign A cross S itu atio n s, S e ttin g s, o r T im e


In this variatio n o f the design, baseline data are gathered for a particu lar behavior
perform ed by one or m ore persons. T he m ultiple baselines re fe rto the different s itu -
ations, settings, or time periods o f the day in w hich observatio n s are obtained. T h e
design b egin s with observations o f baseline perfo rm an ce in each o f the situations.
S I N G L E- C A S E R ES EA R C H D ES I G N S

Baseline Safety Training

Sessions

Fig u re 7 .4 . Percentage of behaviors performed safely among three cocktail servers during baseline
and after the intervention (a single training session) introduced in a multiple-baseline design across
individuals. (Source: Scherrer & Wilder, 2008.)
M u lt ip h i-S a s e lin e D e s ig n s I S3

A fter the b ehavior is stable in each situation, the intervention is applied to alter b eh av-
ior in one o f the situations while baseline con ditio ns are continued for the others.
Perfo rm an ce in the situation to w hich the intervention has been applied sh ou ld sh ow
a change; p erfo rm an ce in the other situations sh ou ld not. W h in behavior stabilizes
in all o f the situations, the intervention is extended to perfo rm an ce in the o th er s itu a -
tions. T h is procedure is continued until perform an ce in all o f the situations for w hich
baseline data w ere collected receive the intervention.
T his illustration focused on the safety o f health-care workers in the context o f
perform ing surgery (Cunningham & Austin, 2007). H ealth-care w orkers suffer m an y
injuries as a function o f working with hazardous procedures o r materials. S o in e sta le s
have enacted laws to help protect w orkers from “sharps injuries” (e.g., being stuck with
a needle) given the special risk o f such injuries for em ployees (e.g., H IV /A ID S). T h is
study focused on the exchange o f instrum ents between the surgeon and scru b nurse.
T h e goal was to increase the use o f the “ hands-free technique,’’ which requires that
a neutral zone be established between the surgeon and nurse. T his neutral zone is a
place where the instruments are put as the instrum ents are exchanged. In this way, the
tw o people do not touch the instrument at the sam e time a n d tb e risk o f sharps in ju -
ries is greatly reduced. T his w as a m ultiple-baseline design across settings: two settings
were selected, namely, an operating room o f an inpatient surgery unit an d an o p erat-
ing room o f an outpatient surgery unit o f a hospital serving a nine-county region in a
M idwestern state in the United States.
O bservations were conducted durin g surgical procedures for jo m inutes, b e g in -
ning at the tim e an opening incision was made in the patient. O bservers were in the
operating room , collected inform ation, and recorded all exchanges as either hand-
to-hand (unsafe) or neutral zone (safe, hands-free procedure). The percentage o f tliese
exchanges that were hands-free constituted the dependent m easure. T he intervention
consisted o f goal setting, task clarification, and feedback to use the safe-exchange p ro -
cedure. At the b eginning o f the intervention phase, staff m em bers were inform ed o f
the hospital policy, which included use o f hands-free procedure, and the goal was set
to increase the percentage o f hands-free exchanges. H ospital policy aim ed at 75'*4, but
the rate was o n ly at 32%. M odeling was used to convey the exact ways o f m aking the
exchanges, and feedback was provided to the sta ff regarding the weekly percentages
an d whether the goal was met. Also praise was provided for im provem ents at a weekly
staff meeting.
Figure 7.5 show s that the intervention was introduced in a m ultiple-baseline design
across two su rgery settings. When the intervention was introduced to the inpatient
operating room (top o f figure), the percentage o f safe exchanges (hands-free, using the
neutral zone) increased sharply, so to speak. No changes w ere evident in the outpatient
operating room , where the intervention had yet to be introduced. W hen the in te r-
vention was introduced there, im provem ents were evident as well. T here was one d a y
when the surgeon could not reach for the instrum ent in the neutral zone, as noted on
the figure. O verall, the results convey that behavior changed when the intervention w as
introduced and not before. T he design added a third phase in order to check to see i f
the behaviors were maintained. Approxim ately 5 m onths after the end o f the in terv en -
tion phase (the intervention had been suspended) both units were ob served fo r a w eek.
A s evident in the figure, the effects o f the intervention were m ain tain ed
S I N G l E- C A S E R ES EA R C H D ES I G N S

Goal Setting. Feedback, Task Clarification. Maintenance

Session Number

F ig u re 7 . S. Percentage of sharp instruments exchanged using the neutral zone (hands-free safe
procedure) across inpatient and outpatient operating rooms.The solid lines in each phase represent
the mean (average) for that phase. (Source:! R. Cunningham & Austin, 2007.)

W hen a particular behavior needs to be altered in two or m ore situations (e.g.,


hom e, school), the m ultiple-baseline design across situations o r settings is especially
useful. The intervention is first im plem ented in one situation and, if effective, is
extended gradually to other situations as well. T he intervention is extended until all
situations in which baseline data w ere gathered are included.
T he design can be used even if there is only one situation, but separate periods can
be designated. For exam ple, if the goal is to change behavior in a particu lar setting, two
(or more) time periods can be delineated such as in the m orning and afternoon. The
multiple-baseline design focuses on the same individual and sam e behavior but gath-
ers (and graphs) data separately for these tim e periods. Thus, developing behavior o f a
child in an elem entary school classroom might delineate two periods (e.g., before lunch
and after lunch) and evaluate the intervention in a m ultiple-baseline across these two
periods. These periods could provide data for the baselines.

N u m b e r o f B ase lin es
A m ajor dim ension that distinguishes variations o f the m ultiple-baseline design is the
num ber of baselines (i.e., behaviors, persons, or situations). I included an exam ple
Mu! tipie - B a se lin e D e sig n s I SS

(su rgery settings) with two baselines (please see Figure 7.5). C learly in that exam ple,
two baselines served the purposes o f enabling inferences to be drawn about the role
o f the intervention. With two baselines, the data pattern m ay need to be especially
clear, indeed perfect, to make implausible other influences that m ay have caused the
effect. Although two m ight meet the design criteria in principle, three baselines is the
recom m ended m inim um and often that num ber is exceeded. O ther things being equal,
dem onstration that the intervention was responsible for change is clearer the larger
the num ber o f baselines that show the predicted pattern o f perform ance. By “clearer”
I m ean the extent to which the change is likely to be attributed to the Intervention
rather than to extraneous influences and various threats to validity.
There is another, more practical reason to include m ore rather than fewer b ase-
lines. It is always possible that one o f the baselines m ay not change or change very much
when the intervention is introduced. I f o n ly two baselines were included and o n e o f
them did not change, the results cannot easily be attributed to the intervention because
the requisite pattern o f data was not obtained. On the other hand, if s«veral (e.g., five)
baselines were included in the design and one o f them d id not change, the effects o f
the intervention m ay still be very clear. T h e rem aining baselines may sh o w that w h en -
ever the intervention was introduced, perform ance ch an g ed with th< one exception.
T h e clear pattern o f perform ance for most o f the behaviors still stron gly suggests that
the intervention was responsible for change rather than the threats to internal v alid -
ity. T h e problem o f inconsistent effects o f the intervention across different baselines is
addressed later in the chapter. At this point it is im portant only to note that the inclu-
sion o f several baselines beyond the m inim um o f two or three m ay clarify the effects o f
the intervention. O ccasionally, baseline data are obtained and intervention effects are
evident across several (e.g., eight or nine) behaviors, persons, o r situations.
The adequacy o f the dem onstration that the intervention was responsible for
change is not merely a function o f the num ber o f baselines assessed. Other factors,
such as the stability o f the behaviors during the baseline phases and the m agnitude and
rapidity o f change once the intervention is applied also determ ine the case with w hich
inferences can be draw n about the role o f the intervention.

P a rtia l A p p lic a tio n s o f T reatm ent


M ultiple-baseline designs vary in the m anner in w hich treatment is applied to the v a r-
ious baselines. For the variations discussed thus far, a particular intervention is applied
to the different behaviors at different points in time. Several variations o f the designs
depart from this procedure. In som e circum stances, the intervention m a y b e applied to
the first behavior (individuals or situations) and produce little or 110 change. It m ay not
be useful to continue applying this intervention to other behaviors. The intervention
m ay not achieve enough change in the first behavior to warrant further use. H ence, a
second intervention m ay be applied follow ing a sort o f A B C design for the first b eh av-
ior. If the second intervention (C) produces change, it is applied to other behaviors
in the usual fashion o f the m ultiple-baseline design. T he design is different o n ly in
the fact that the first intervention was not applied to all o f the behaviors, persons, t>r
situations.
For example, a sim ple intervention was used to increase the frequency that drivers
w ould stop at stop signs at three separate intersections in different parts o f a city (in
S I N G L E- C A S E R ES EA R C H D ES I G N S

Florida) (Van Houten & Retting, 2001). O ver 700,0 00 accidents occu r at stop signs, and
3,000 o f these are tatal (1998 statistics). In this study two prom pting procedures were
used to increase full stops. Video cam eras recorded stopping, an d the videotapes were
scored later by observers. A m ultiple-baseline design across three sites (intersections)
was the design. The first intervention consisted o f a sign posted under the stop sign
that said in black letters on a white background, “ L O O K B O T H W AYS.” The second
intervention consisted o f anim ated eyes on a screen that scan ned left and right once per
second. The anim ated-eyes screen was placed in front o f the stop sign and included a
m icrowave sensor that detected approaching vehicles. O nce a vehicle was detected, the
eyes moved from side to side that is, looking both ways, for 6 seconds.
Figure 7.6 show s that the “ look both w ays” sign did not have much o f an effect.
This intervention was not im plem ented for the other intersections. Presumably, if it
were v ery effective, this w ould be the intervention extended to other baselines. The
anim ated-eyes intervention was presented, and that increased the percentage of drivers

"L o o k Both W a ys''


Baseline Sign Prompt Animated L E D Eyes

80

60

40

20
Site A

I 3 5 7 9 II 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Sessions

Fig u re 7 .6 . The percentage of vehicles coming to a complete stop at each of the sites (intersec-
tions) in a multiple-baseline design across sites. (Source: Van Houten & Retting, 2001.)
Multiple-Baseline Designs

com ing to a com plete stop. A s evident from the m iddle and low er panels, this latter
intervention was extended to each o f the other intersections and was associated w ith
change. The exam ple conveys a strength o f single-case designs, namely, trying out an
intervention, evaluating its impact in real time, an d m aking a decision to try som ething
else if further change is needed.
Another variation o f the design that involves partial application o f treatment is
the case in which one (or more) o f the baselines never receives treatment. Essentially,
the final baseline (behavior, person, or situation) is observed over the course o f the
investigation and serves as a control for extraneous changes that m ight o ccu r because
o f events (history) or changes in m easurement. Fo r exam ple, a program w as devised to
alter the disruptive behavior o f three African A m erican students (ages i! to l o ) in a s p e -
cial education classroom com prised o f eight students (M usser, Bray, Kehle, & Jenson,
2001). The three students met criteria for psychiatric disorders, namely, O ppositional
Defiant D isorder (extremes o f stubbornness, noncom pliance) and A ttention-D eficit/
H yperactivity D isorder (inattention, hyperactivity). The fo cus was o n reducing d is -
ruptive behaviors in class (e.g., talking out, m aking noises, being out o f o n es seat,
swearing and name calling). D aily baseline observations o f disruptive behavior w eie
made in class. T he intervention included several com ponents: posting classroom rules
on the students desk (e.g., sit in your seat unless you have perm ission to leave, raise
your hand for perm ission to speak), special instructions/requests by the teacher (e.g.,
using the w ord “please” before a request was made o f the student, standing close to the
student), rewards for com pliance and good behavior (e.g., praise, stickers exchangeable
for prizes), and mild punishment (e.g., taking aw ay a sticker).
Figure 7.7 show s that the program was introdu ced in a m ulti pie-b aselin e design
across the three students. Two other students in the sam e c lass and o f the sam e age
and eth nicity an d also with diagn o ses o f d isru ptive b eh avio r d iso rd e rs w e re assessed
over the cou rse o f the study but never received the in terven tion . As the fig u re show s,
the intervention led to change for each o f the three stu den ts at each point that the
intervention was introduced and not before. T h e pattern stron gly suggests that the
intervention rather than any extraneous in flu en ces accounted fo r the change. T h is
conclusion is fu rther bolstered by the two con trol students w ho w ere ob served o ver
tim e in the sam e class. Essentially these students rem ain ed in the baseline phase
o ver the cou rse o f the study and continued to perfo rm at the sam e level o ver tim e.
T h e two control baselines (students) are not needed but p ro vid e yet an oth er w ay
o f show ing that extraneous influences are not lik ely to exp lain the pattern o f the
data across all o f the baselines students. T h is is a v ery c le a r dem o n stratio n aided b y
show ing that with the intervention there was change and w ithou t the in terv en tio n
(for each o f the three students and for the two con tro l students) there w as no change.
O ne point to m ention in passing: In the fo llow -u p (final) phase, the p ro g ra m w as
rem oved com pletely and behavior was m aintained. T h at is, return to baselin e d id
not lead to a loss o f the gains.

G e n e ra l C o m m e n ts
T he preceding discussion highlights m ajor variations o f the m ultiple-baseline design.
Perhaps the m ajor source o f diversity is whether the multiple baselines refer to the b eh av-
iors o f a particular person, to different persons, or to perform ance in different situations.
158 S I N G L E- C A S E R ES EA R C H D ES I G N S

Male C ontrol

10 15 20 25
Observation Sessions

F igu re 7 .7 . Disruptive behavior (percentage of intervals) of special education students.The inter-


vention was introduced in a multiple-baseline design across three students. Two similar children
(bottom two graphs) served as controls; their behavior was assessed over the course of the pro-
gram but never received the intervention. In the final follow-up phase the program was completely
withdrawn. (Source: Musser et oi, 2001.)

There are m any perm utations o f multiple-baseline designs that stem from considering
contexts or settings. For exam ple, governm ent program s (e.g., m oney for implementing
a special educational intervention, novel ideas for im proving the proportion o f chil-
dren who obtain vaccinations) at the local or federal level m ight be introduced across
schools, school districts, agencies, and states in a m ultiple-baseline fashion to evaluate
f ' l u i t i p l e- Basel i n « D esi g n s I 59

their impact. Apart from contexts or settings, num erous other variation s o f multiple-
baseline designs exist. The variations usually involve com bin ations o f the dim en sions
discussed previously. Variations also occasionally involve com ponents ol A B A B designs;
these will be addressed in Chapter 10, in which com bined designs are discussed.

P R O B L E M S A N D L IM IT A T IO N S
Several sou rces o f am biguity can arise in d raw in g in fe ren ces abou t in terven tion
effects u sin g m ultiple-baseline designs. A m b igu ities can resu lt fro m the in te rd e -
pen den ce o f the behaviors, persons, o r situations that serv e as the b aselin es o r from
inconsistent e ffe c ts o f the intervention on the d ifferen t b aselines. Finally, bo th p rac -
tical and m ethodological problem s m ay arise w hen the in terven tion is w ith h eld
from one o r m ore o f the behaviors, persons, o r situation s fo r a p ro tracted p eriod
o f tim e.

In te rd ep en d e n ce o f the B ase lin es


T he critical requirement for dem onstrating unam biguous effects o f the intervention in
a m ultiple-baseline design is that each baseline (behavior, person , o rsitu atio n ) changes
only w hen the intervention is introduced and not before. Som etim es th e b aselines may
be interdependent, so that change in one o f the baselines carries ov<r to another base-
line even though the intervention has not been extended to that latter baseline. T h is
effect can interfere with drawing conclusions about the intervention in each version o f
the m ultiple-baseline design.
In the design across behaviors, changing the first behavior m ay b e associated with
changes in one o f the other behaviors even though th ost behaviors have yet to be
included in the intervention (e.g., W halen, Schreibm an, & In gersoll, 10 0 6 ). C o m m o n
experience would suggest interdependence o f behaviors. S o m e behaviors (e.g., com -
munication, social interaction) may be pivotal to other activities an d h a v e ripple effects
in changing other behaviors (cf. Koegel & K ern-K oegel, 2006; R o sales-R u iz & Baer,
1997). In situations where generalization across responses occurs, the m u ltiple-baseline
design across behaviors may not show a clear relation betw een the intervention and
behavior change.
In the multiple-baseline design across individu als, it is possible that alterin g the
behavior o f one person influences other persons w h o have yet to receive the interven-
tion. In investigations in situations where one person c m o bserve the p erfo rm an ce
o f others, such as classmates at school o r siblings at home, changes in the beh avior o f
one person occasionally result in changes in other persons. Fo r exam ple, a p r o g n m
designed to reduce thumb sucking in a 9-year-old boy w as v e ry effective in elim i-
nating the behavior (Watson et al., 2002). No intervention was p ro vid ed to the b o ys
5-year-old brother, whose behavior also changed and for w h o m thum b su ckin g iras
also eliminated. It could have been that the brother who received th« in terven tion was
a cue lor the behavior or modeled the behavior. T h e interpretation is not clear, but the
etfects are. T he intervention in this case spread to another person for w h o m no direct
intervention was provided. Similarly, in the m ultiple-baseline design acro ss situations,
settings, or time, altering the behavior o f the person in one situation m ay lead to gener-
alization o f perform ance across other situations. T he specific effect o f the intervention
m ay not be clear.
S I N G L E- C A S E R ES EA R C H D ES I G N S

In each o f the preceding cases, if intervention effects extended beyond the specific
baseline to which the intervention was applied, the results w ould be ambiguous. It is
possible that extraneous events coincided with the application o f the intervention and
led to general changes in perform ance. Alternatively, it is possible that the intervention
accounted for the changes in several behaviors, persons, or situations even though it
was only applied to one. T h e problem is not that the intervention failed to produce the
change; it m ay have. Rather, the problem lies in unam bigu ously inferring that the inter-
vention was the causal agent.
Interdependence o f the baselines is a potential problem in each o f the multiple-
baseline designs. However, three points provide a perspective on the threat o f inter-
dependence o f the baselines. First, few dem onstrations report the interdependence o f
baselines. Yes, it could be that it occurs all o f the tim e but these papers do not get
published. In m y own experience, the interdependence rarely occurs. Second, when
changes do occur prem aturely for the baselines (behaviors, situations) that have yet
to receive the intervention, this does not necessarily m ean that the dem onstration is
ambiguous. T he specific effect o f the dem onstration m ay be clear for a few but not all
o f the baselines. T hird, single-case designs allow for design changes and im provisa-
tion during the dem onstration. Thus, the investigator m ay introduce features o f other
designs, such as a return-to-baseline phase for one o r m ore o f the behaviors, to show
that the intervention was responsible for change. I discuss com bin ed designs later to
convey this option.

In co n sisten t E ffe c ts o f th e In te rv e n tio n


Another potential problem o f m ultiple-baseline designs is that the intervention may
produce inconsistent effects on the behaviors, persons, o r situations to which it is
introduced. “ Inconsistent effects” means that som e behaviors are altered when the
intervention is introduced and others are not. T he inconsistent effects o f an interven-
tion in a m ultiple-baseline design raise obvious problem s. In the most serious case,
the design might include o n ly two behaviors, the m in im u m (but not recomm ended)
num ber o f baselines required. T he intervention is introduced to both behaviors at
different points in tim e, but o nly one o f these behaviors changes. T h e results are u su -
ally too am biguous to m eet the requirem ents o f the design. E xtran eo u s factors other
than the intervention m ight well account for behavior changes, so the internal validity
o f the investigation has not been achieved. T his con cern is one reason w hy three base-
lines is a recom m ended m inim u m for m ultiple-baseline design s and m ore can make
the investigator m errier.
Alternatively, if several behaviors are included in the design and one or two do not
change w hen the intervention is introduced, this m ay be an entirely different matter.
The effects o f the intervention may still be quite clear from the two, three, or more
behaviors that did change when the intervention was introduced. The behaviors that
did not change are exceptions. O f course, the fact that som e behaviors changed and
others did not raises questions about the generality o r strength o f the intervention. But
the internal validity o f the dem onstration, namely, that the intervention was respon-
sible for change, is not an issue. In short, the pattern o f the data need not be perfect
to permit the inference that the intervention was responsible for change. I f several o f
the baselines show the intended effect, an exception may not necessarily interfere with
M u lt ip le -B a se lin e D e s ig n s 161

draw ing causal inferences about the role o f the intervention. A s 1 noted, experim ental
design is about m aking com peting interpretations o f the intervention effect im p lau -
sible. An overall pattern in the data with an exception o f one o f the baselines m ay still
leave the intervention as the most reasonable explanation o f the results.

P ro lo n g e d B a se lin e s
M ultiple-baseline designs depend on w ithholding the intervention froin each baseline
(behavior, person, or situation) for a period o f time. T he intervention is applied to
the first behavior while it is tem porarily w ithheld from the second, third, and other
behaviors. Eventually, o f course, the intervention is extended to each o f the baselines. I f
several behaviors (or persons, or situations) are included in the design, the p o ssibility
exists that several days or weeks might elapse before the final behavior receives treat-
ment. Several issues arise when the intervention is withheld for extended periods.
O bviously, applied and ethical considerations m ay argue against w ithholding the
intervention. I f the intervention im proves behavior when it is applied initially, perhaps
it should be extended im m ediately to other behaviors. W ithholdin g the intervention
m ay be unethical, especially i f there is a hint in the data from the initial baselines that
the intervention influences behavior. O f course, the ethical issu e here is not unique
to m ultiple-baseline or single-case designs but can be raised in virtually any area o f
experim entation in which an intervention o f unknow n effectiveness is under evalu a-
tion or where a prom ising intervention is withheld. W hether it is ethical to withhold an
intervention or treatment m ay depend on som e assurances that th e treatm en t is helpful
and is responsible for change. These latter questions, o f course, are the basis o f usin g
experim ental designs to evaluate interventions in the first place.
Although som e justification m ay exist for tem porarily w ithholding interventions
for purposes o f evaluation, concerns increase when the p erio d o f w ithholding the
intervention is protracted. If the final behaviors in the design w ill not receive the inter-
vention for several days or weeks, this m ay be unacceptable in light o f applied co n sid -
erations. As discussed later, there are ways to use the m ultiple-baseline design so that
the final behaviors receive the intervention with relatively little <telay.
Aside from ethical and applied considerations, m ethodological problem s m ay arise
when baseline phases are prolonged for one or m ore o f the behaviors. A s noted earlier,
the m ultiple-baseline design depends on show ing that perform ance changes w hen and
o nly when the intervention is introduced. W hen baseline phases are extended tor a
prolonged period, perform ance m ay som etim es im prove slightly even before the inter-
vention is applied. Several reasons m ay account for the im provem ent. First, the inter-
dependence o f the various behaviors that are included in th ed esign m a y b e responsible
for changes in a behavior that has yet to receive the intervention. Indeed, as m ore and
m ore behaviors receive the intervention in the design, the likelihood that other behav-
iors yet to receive treatment will show the indirect or generalized benefits o f the treat-
ment may increase.
Second, over an extended period, clients m ay have in creased opportunities to
develop the desired behaviors either through direct practice o r the observation o f
others. For exam ple, if persons are m easured each day on th eir social behavior, play
skills, or com pliance to instructions, im provem ents m ay eventually appear in base-
line phases for behaviors (or persons) that have yet to receive the intervention. T he
162 S I N G L E- C A S E R ES EA R C H D ES I G N S

prolonged baseline assessm ent m ay provide som e opportunities to improve p erfo r-


mance through repeated practice or m odeling to im prove in performance.
Third, the social environm ent o f the child m ay have changed in direct response
to the individual’s changes in one or m ore behaviors. O thers in the environm ent m ay
respond differently an d that could affect a v ariety o f the individual’s behaviors, w hether
or not the intervention is introduced. Collateral changes are always possible but m ight
be m ore likely with protracted baselines w here the effects o f indirect influence m ight
increase over time. In an y case, when som e behaviors (or persons, o r situations) show
im provem ents before the intervention is introduced, the requirem ents o f the m ultiple-
baseline design m ay not be met.
T he ethical, applied, and m ethodological problem s that m ay result from prolonged
baselines can usually be avoided. To begin with, m ultiple-baseline designs u sually do
not include a large num ber o f behaviors (e.g., six o r m ore), so that the delays in ap ply-
ing the intervention to the final behavior are not great. Even i f several baselines are
used, the problem s o f prolonged baselines can be avoided in a num ber of ways. First,
when several behaviors are observed, few data points m ay be needed for the baseline
phases for som e o f the behaviors. For exam ple, if six behaviors are observed, baseline
phases for the first few behaviors may last only one or a few days. Also, the delay or lag
period between im plem enting treatment for one behavior and implementing the sam e
treatment for the next behavior need not be v ery long. A lag o f a few days m ay be all
that is necessary, so that the total period o f the baseline phase before the final behavior
receives treatment m ay not be particularly long.
Also, when several behaviors are included in the m ultiple-baseline design, treat-
ment can be introduced for two (or more) behaviors at the sam e point in time. T he
dem onstration still takes advantage o f the m ultiple-baseline design, but it does not
require im plem enting the treatment for only one behavior at a time. For exam ple, a
hypothetical m ultiple-baseline design is presented in Figure 7.8 in which six behaviors
are observed. In a m ultiple-baseline design, treatment m ight be applied to each o f the
behaviors, one at a tim e (see left panel o f figure). It m ight take several days before the
final behavior could be included in treatment. Alternatively, the treatment could be
extended to each o f the behaviors two at a time (see right panel o f the figure). T h is v ar-
iation o f the design does not decrease the strength o f the dem onstration, because the
intervention is still introduced at two (or m ore) different points in time. The obvious
advantage is that the final behavior is treated much soo n er in this version of the design
than in the version in w hich each behavior is treated separately. In short, delays in
applying the intervention to the final behavior (or person, or situation) can be reduced
by applying the treatment to more than one behavior at a time.
Prolonged baselines raise assessm ent obstacles if one has to conduct daily assess-
ment for several baselines for an extended period. It m ay be a burden or sim ply not
feasible to observe all o f the baselines d a ily Tw o options have been used to handle
observational challenges and still allow the design to proceed. First, instead o f every
day, observations could be made only occasionally for som e o f the baselines, especially
if the baselines are stable. It is critical here to keep in m ind not the design per se but
the purpose o f the design. For each baseline, we need a stable estimate to describe p re-
sent perform ance and predict future perform ance. T h is can be accomplished in a fea-
sible w ay that may involve only intermittent assessm ent. T h e periodic or intermittent
M ultiple-Baseline Design» I 63

Base Base

C 3 C 3

6 - 6 -

Days Days

F ig u re 7 .8 . Hypothetical example of a multiple-baseline design across six behaviors.The left pcnel


shows a design in which the intervention is introduced to each behavior, one at a time .The right panel
shows a design in which the intervention is introduced to two behaviors at a time.The shaded area
conveys the different durations of baseline phases in each version of the design.The illustration is a
multiple-baseline across behaviors, but of course the same point applies if the baselines were across
people or settings.

assessm ent o f behavior w hen interventions are not in effect for that behavior is referred
to as probes or probe assessment. Probes provide an estimate o f what daily perform ance
would be like. For exam ple, hypothetical data are presented in Figure 7.9, w hich illus-
trates a multiple-baseline design across behaviors. Instead o f assessing behavior every
day, probes are illustrated in two o f the baseline phases. T h e probes provide a sample
o f data and avoid the burden o f d aily assessm ent for an extended period. Certainly
an advantage o f probe assessment is the reduction in cost in term s o f the time the
o bserver must spend collecting baseline data. I f probes are to be used to reduce the
num ber o f assessment occasions, the investigator needs to have an a priori presum p-
tion that perform ance is stable. T he clearest instance o f stability w ould lie i f behavior
never occurs or reflects a com plex skill that is not likely to change over tim e without
special training.1
Another option is to begin assessm ents at different points. In this instance, there
m ay be multiple baselines but observation does not begin at the same point for each
one. For example, two baselines m ay be observed in the usual way, hut additional base-
lines m ay be added. T hose additions may be planned from the beginning but post-
poned because o f the lack o f resources for assessing all the baselines or th ey m ay be
unplanned. As the intervention begins, it m ay becom e clear o r advisable to add new
behaviors or new situations. I have explained the logic o f the design to help with tliese
practical situations. T he data within the phases o f single-case designs are intended to

1 Probes can be used for other purposes such as the assessment o f maintenance or the transfer of
behavior to other situations or settings (see Chapter 10).
16 4 S S N G L E- C A S E R ES EA R C H D ES I G N S

Baseline Intervention

2
O I
E
.Tj •
o
CO

Days

F ig u r e 7 .9 . Hypothetical data for a multiple-baseline design across behaviors. Daily observations


w ere conducted and are plotted for the first and second behaviors. Probes (intermittent assess-
ment) were conducted for the baselines of the third and fourth behaviors.

serve the describe, predict, and test prediction functions. Thus, one can depart from
rigid application o f designs as long as the functions are served.

E V A L U A T IO N O F T H E D E S IG N
M ultiple-baseline designs have a num ber o f advantages that m ake them v ery useful in
applied settings. To begin with, the designs do not depend on withdrawing treatment
to show that behavior change is a function o f the intervention. Hence, there is no need
to reduce or tem porarily suspend treatment effects for purposes o f the design. T his
characteristic makes m ultiple-baseline designs a highly preferred alternative to A B A B
designs and their variations.
Another feature o f the designs makes them quite suited to practical considerations
an d dem ands o f applied settings. T he designs require applying the intervention to one
behavior (person or situation) at a time. T he grad u al application o f the intervention
acro ss the different behaviors has practical benefits. In m any applied settings, parents,
teachers, supervisors, institutional staff, or other change agents are responsible for
applying the intervention. C onsiderable skill m ay be required to apply the intervention
effectively. Im plem enting the intervention on a sm all scale (one b eh avioro r individual)
allow s change agents to proceed gradually and extend the scope o f the intervention (to
oth er behaviors and individuals) only after having m astered the initial applications. In
situations in which behavior-change agents are learning new skills in applying an inter-
vention, the gradual application can be very useful.
M ultiple-Baseline D esigns I 65

A related advantage is that the application to only one behavior at a tim e perm its
a lest o f the effectiveness o f the procedure. Before the intervention is applied widely,
the prelim inary effects on the first behavior can be exam ined. I f treatment effects are
not sufficiently strong, or if the procedure is not im plem ented correctly, it is useful to
learn this early, before applying the procedure w idely across all behaviors, persons, or
situations o f interest.
In specific variations o f the multiple-baseline design, the gradual m anner in w hich
treatment is extended can also be useful for the clients (e.g., students, children, em p lo y-
ees). For exam ple, in the multiple-baseline design across behaviors o r situations, the
intervention is first applied to only one behavior or to beh avior in only one situation.
Gradually, other behaviors and situations are incorporated into the program . T his fo l-
lows a useful m odel o f developing behaviors gradually (shaping) for the client, since
early in the program changes are only required for one behavior or in one situation. A s
the client im proves, increased dem ands are placed on perform ance.
In an era o f increased emphasis on evidence-based interventions, m any program s
are under increased pressure to docum ent the effects o f their interventions. C lassroom s;
colleges; institutions for children, adolescents, and adults; com m unities; state and local
agencies; and businesses have an endless stream o f program s directed to w on derfu l
causes but with no data on their behalf. T hose who adm inister the program recognize
that a random ized controlled trial or assigning som e units (classes, wards, districts) to
the intervention condition and withholding the intervention fro m other units is not
possible. This has led to m ethodological helplessness, that is, the cognition that doing
a careful evaluation o f the program and its effects is not possible.
A m ultiple-baseline design is likely to be quite feasible. T h ere is no need for a n o -
intervention control group (as in between-group study), n o r to withhold som e special
program, nor to w ithhold it for very long (as in a m ultiple-baseline design). In settings
where control groups are not possible, m ultiple-baseline design is a viable alternative.
I would argue further that even if a control group w ere possible, m ultiple-baseline
designs are likely to be preferred. In applied settings, w e are interested in seeing in d i-
viduals change and tinkering with our interventions to m ake su re this happens. Sin gle-
case designs allow for decision m aking in response to client perform an ce (e.g., when
to change phases, whether a new intervention ought to b e tried>. In contrast, in most
between-group studies, the treatment regim en is fixed in duration. O verall, the m a n -
ner in which treatment is implemented to meet the m ethodological requirem ents o f
the multiple-baseline design may be quite harm onious w ith practical considerations
regarding how behavior change agents and clients perform . G radu al introduction o f
the intervention across baselines and no need for w ithdrawal of the intervention m ake
m ethodological and client considerations quite com patible.

SU M M A R Y A N D C O N C L U S IO N S
M ultiple-baseline designs demonstrate the effects o f an intervention by presenting
the intervention to each o f several different baselines at different points in time. A
clear effect is evident i f perform ance changes w hen and only w hen the intervention
is applied. Several variations o f the design exist, depending p n in a n ly on w h ether the
multiple-baseline data are collected across behaviors, persons, situations, settings, o r time.
T he designs m ay also v ary as a function o f the num ber o f baselines and the m an ner in
166 S I N G L E - C A S E R E S EA R C H D ES I G N S

w hich treatment is applied. T h e designs require a m inim um o f two baselines, but three
o r m ore are strongly recom m ended to optim ize clarity o f the intervention effect. The
strength o f the dem onstration that the intervention, rather than extraneous events, was
responsible for change is a function o f the num ber o f behaviors to w hich treatment is
applied, the stability o f baseline perform ance for each o f the behaviors, and the magni
tude and rapidity o f the changes in behavior once treatment is applied.
Sources o f am biguity m ay make it difficult to draw inferences about the effects o f
the intervention. First, problem s may arise when different baselines are interdependent
so that implementation o f treatm ent for one behavior (or person, o r situation) leads to
changes in other behaviors (or persons, or situations) as well, even though these lat-
ter behaviors have not received treatment. Another problem m ay arise in the designs
if the intervention appears to alter som e behaviors but does not alter other behaviors
w hen the intervention is applied. I f several behaviors are included in the design, a fail-
ure o f one o f the behaviors to change m ay not raise a problem . T h e effects may still
be quite clear from the several behaviors that did change when the intervention was
introduced.
A final problem that m ay arise with m ultiple-baseline designs pertains to w ith-
holding treatment for a prolonged period while the investigator is w aiting to apply the
intervention to the final behavior, person, or situation. Practical and ethical consider-
ations may create difficulties in w ithholding treatment for a protracted period. Also,
it is possible that extended baselines will introduce am biguity into the demonstration.
In cases in w hich persons are retested on several occasions or have the opportunity to
o bserve the desired behavior am ong other subjects before the intervention is applied
to them , extended baseline assessm ent m ay lead to system atic im provem ents or decre-
ments in behavior. D em onstration o f the effects o f the intervention on extended base-
lines may be difficult. Prolonged baselines can be avoided by utilizing short baseline
phases or b rie f lags before applyin g treatment to the next baseline, an d by implement-
ing the intervention across tw o or more behaviors (or persons, or situations) sim ulta-
neously in the design. T hus, the intervention need not be w ithheld even for the final
behaviors in the m ultiple-baseline design. M ultiple-baseline designs are quite popular,
in part, because they do not require reversals o f perform ance. A lso , the designs are
consistent with m any o f the dem ands o f applied settings in w hich the intervention is
implemented on a sm all scale first before being extended widely.
CHAPTER 8

Changing-Criterion Designs

C H A P T E R O U T L IN E

Basic Characteristics o f the Design


Description and Underlying Rationale
Illustrations
Design Variations
Subphases During the Intervention
How Many Are Required?
How Long (Days, Sessions) Should the Subphases Be?
How Large Should the Changes Be from On«
Criterion Shift to the Next?
Point or Range of Responses as the C riterion
Directionality of Change
Other Variations
General Comments
Problems and Limitations
Gradual Improvement Not Clearly Connected lo
Shifts in the Criterion
Rapid Changes in Performance
Correspondence of the Criterion and Behavior
Magnitude of Criterion Shifts
General Comments
Evaluation o f the Design
Summary and Conclusions

W
ith a changing-criterion design, the effect o f the intervention is dem on strated
by show ing that behavior changes gradu ally o ver the course o f the intervention
phase. T he behavior im proves in increments o r steps to match a criterion for p e rfo r-
m ance that is specified as part o f the intervention. Fo r exam ple, i f praise o r po ints are
provided to a child for practicing a musical instrum ent, a criterion (e.g., am ount o f tim e
spent practicing) is specified to the child as the requirem ent foi earning the rew ard in g
consequences. A s the child’s performance m atches or m eets tliat criterion w ith so m e
consistency, the criterion is shifted to a new level (e.g., m ore minutes) to earn the c o n -
sequences. T he required level o f perform ance in a changing-criterion design is altered

167
168 S I N G L E- C A S E R E S EA R C H D ES I G N S

repeatedly over the course o f the intervention to im prove perfo rm an ce over time. The
effects o f the intervention are show n when perform ance repeatedly changes to meet
the criterion. Graphically, this appears as a step-like function in which perform ance
matches the criterion, the criterion is shifted, and the perform ance matches the new
criterion, and so on, until the desired level o f perform ance is achieved.
Unlike the A B A B designs, the changing-criterion design does not require with-
draw ing or tem porarily suspen ding the intervention to dem onstrate the relation
between the intervention and behavior. Unlike m ultiple-baseline designs, the design
does not require multiple behaviors (settings, or situations) or require withholding the
intervention tem porarily so that it can be introduced sequentially across baselines. The
changing-criterion design neither w ithdraw s nor withholds treatm ent as part o f the
dem onstration.

B A S IC C H A R A C T E R IS T IC S O F T H E D E S IG N

Description and Underlying Rationale


T he changing-criterion design begins with a baseline phase in w hich continuous
observations o f a single behavior are m ade for one o r more persons. A fter the baseline
(or A ) phase, the intervention (or B) phase is begun. T he unique feature o f a changing-
criterion design is the use o f several subphases ( b , b , to bn). I refer to them as subphases
(little b) because they are all in the intervention phase; the num ber o f these subphases
can v ary up to any num ber (n) w ithin the intervention phase. D u rin g the intervention
phase, a criterion is set for perform ance. For exam ple, in program s based on the use o f
reinforcing consequences, the client is instructed that he or she will receive the conse-
quences if a certain level o f perform ance is achieved. I f perform ance meets or surpasses
the criterion, the consequence is provided. A s perform ance m eets that criterion, the
criterion is made slightly m ore stringent. T his continues in a few subphases in which
the criterion is repeatedly changed.
As an illustration, a person m ay be interested in exercising m ore. Baseline may
reveal that the person never exercises (i.e., zero minutes per day). T he intervention
phase m ay begin by setting a criterion such as 10 minutes o f exercise per day. If the
criterion is met or exceeded (10 or m ore minutes o f exercise), the client may earn a
reinforcing consequence (e.g., special privilege at home, m oney tow ard purchasing a
desired item). W hether the criterion is met is determ ined each day. O nly i f perfor-
mance meets or surpasses the criterion will the consequence be earned. If perform ance
consistently meets the criterion for several days, the criterion is increased slightly (e.g.,
20 minutes o f exercise). A s perform ance stabilizes at this new level, the criterion is
again shifted upward to another level. T he criterion continues to be altered in this m an-
ner until the desired level o f perform an ce (e.g., exercise) is met.
A hypothetical exam ple o f the changing-criterion design is illustrated in Figure 8.1,
which shows a baseline phase that is follow ed by an intervention phase. W ithin the
intervention phase, several subphases are delineated (by vertical dashed lines). In each
subphase, a different criterion for perform ance is specified (dashed horizontal line
within each subphase). A s perform ance stabilizes and consistently meets the criterion,
the criterion is made m ore stringent, and criterion changes are m ade repeatedly over
the course o f the design.

J
C n.ingiiig-Criterion D esigns 169

Baseline Intervention

Fig u re 8 .1 . Hypothetical example of a changing-criterion design in which several subphases are


presented during the intervention phase.The subphases differ in the criterion (dashed line) for per-
formance that is required of the subject.

T he underlying rationale o f the changing-criterion resem bles that o f designs d is -


cussed previously. As in the A B A B and m ultiple-baseline designs, the baseline phase
serves to describe current perform ance and to predict perform ance in the future. T h e
intervention phases o f these designs test that prediction to see if perform ance departs
from w hat is expected, if baseline were to continue. Sim ilarly, the subphases o f the
changing-criterion design also m ake and test predictions. In each subphase, a c r i-
terion o r perform ance standard is set. If the intervention is responsible for change,
perform ance would be expected to follow the shifts in the criterion iro m subphase
to subphase. In contrast, i f behavior fluctuates random ly” (no systematic pattern) or
tends to increase or decrease without following the criterion shifts, it is more likely o r
at least plausible to assum e that extraneous factors rather than the intervention c o n -
trolled behavior. In such instances, the intervention cann ot b e accorded a causal role in
accounting for perform ance. On the other hand, if perform an ce corresponds closely to
the changes in the criterion, then the intervention can be considered to be responsible
for the change.

Illustrations
A n illustration o f the design was provided in a program designed to develop reading in
Craig, a 26-year-old European Am erican man d iagn o sed with paranoid schizophrenia
and living in a com m unity-based treatment program (Skinner, Skinner, & A rm strong,
2000). T h e client had m any sym ptom s including delusional thoughts, paranoid idea-
tion, auditory hallucinations, and flat affect. T he goal was to develop leisure reading as
part o f his activities. T h e client asked the staff to help im prove his reading. He cou ld
read but did not sustain the task. For this project, C raig .selected the reading m aterial
from a local library. T he goal o f the program was to increase the num ber o f pages he
w ould read out loud. This was accom plished by increasing the number o f pages read
over time. N um ber o f pages read continuously w as counted, that is, with no pause o f
m ore than 30 seconds. A delay due to asking about the m eaning o f a w ord or lookin g a
170 S I N G L E - C A S E R ES EA R C H D ES I G N S

w ord up in a dictionary was excluded from counting as a pause. Each day Craig con tin -
ued reading where he left o f f on the previous day. There w as only one session per day.
C raig did not have to read and could say no or not have a session if he did not want one.
A changing-criterion design w as used for the dem onstration.
D uring baseline, Craig o r a sta ff m em ber could initiate his reading. No special p ro -
gram was implemented to im prove reading. D uring the intervention phase, Craig could
earn a soft drink if he met the criterion for the number o f pages read. Before the session,
a sta ff member showed C raig the end point o f the pages he needed to read in order to
meet the criterion. The criterion increased by one page after C raig met the required
num ber o f pages on 3 days. E arly in the program , the soft drink was given im m ediately
after reading. Soon, he asked to walk to the store for this. W alking to the store, discussing
the reading, and selecting the drin k were considered good w ays to integrate him m ore
into the community. After 6 weeks o f the program, C raig asked to end the program; he
no longer wanted to read out loud. Seven weeks later, after expressing interest in a book
som eone else was reading, there was an unprom pted and unplanned assessment, which
is identified in the final phase as a m aintenance probe (M ).
Figure 8.2 shows that the num ber o f pages read was low. D u rin g the intervention
phase, num ber o f pages read increased and showed a step-like function as the criterion
(nu m ber o f pages) was increased. Reading very closely follow ed the criterion shifts.
D u rin g baseline, C raig elected to read only 40% o f the tim e (2 o f 5 days); du rin g the
intervention phase he elected to read on 76% o f the days. M oreover, on those days,
he met the criterion for reinforcem ent on 25 o f 26 (96%) o f the occasions. Reading
was m aintained 7 weeks after the program had ended. A lthough the out-loud reading
program was discontinued, C raig con tinued to borrow books from the public library,
kept a book on his night stand, and was observed reading silently. T h e program began
with out-loud reading, C ra ig s initial preference, but silent reading appeared to c o n -
tinue after that particular program ended.

F ig u re 8.2. Number of pages read each day across conditions and criteria. (Source: Skinner. Skinner. &
Armstrong, 2000.)
Ch an g i t i g - Cr i t er i o n Desi g n s

A s another illustration, a study focused on a 15-year-old girl nam ed A m y with


insulin-dependent diabetes. She had been instructed to check her blood su gar 6 to 12
times per day (Allen & Evans, 2001). A m on g the challenges is avoiding hyp o glycem ia
(low blood sugar), which is extrem ely unpleasant and characterized by sym p to m s o f
dizziness, sweating, headaches, and im paired vision. T h is can also lead to seizu res and
loss o f consciousness. Children and their parents often are iiypervigilant to d o an ythin g
to avoid low blood sugar including deliberately m aintaining high blo o d glu cose lev-
els. The result o f m aintaining high levels can be p o o r m etabolic control an d increased
health risk for com plications (e.g., blindness, renal failure, nerve dam age, an d heart
disease). A m y was checking her blood glucose levels 80 to times per d ay (at a cost o f
about $6 0 0 per week) and was m aintaining her blood glucose levels too high.
A blood glucose m onitor was used that autom atically recorded the n u m ber o f
checks (up to 100 checks) and then dow nloaded the inform ation to a com puter. T h e
test included a finger prick, application o f the blood to a reagent test strip, in sertion o f
the strip into the monitor, and a display o f glucose levels. An intervention w as used to
decrease the num ber o f times blood glucose checks w ere m ade each day. A m y paren ts
gradually reduced access to the materials (test strips) that w ere needed for the test. A
changing-criterion design was used in w hich few er an d few er tests w ere allow ed. I f
Am y met the criterion, she was allowed to earn a m axim um o f five ad dition al tests
(blood glucose checks). Access to the test m aterials was reduced gradually o ver time.
The parents selected the criterion o f how m any tests (test strips) would be available in
each subphase. A s shown in Figure 8.3, the criterion first dropped by 20 checks and
then by sm aller increm ents. Over a 9-m onth period, A m y decreased her u se o f m o n -
itoring from over 80 to 12 times per day. Better m etabolic control was also achieved;
by the end o f the 9 months, blood glucose levels w ere at or near the target levels (i.e.,
neither hypo- or hyper-glucose levels).

D E S IG N V A R IA T IO N S
Most applications o f the changing-criterion design closely follow the basic design just
illustrated. Features o f the basic design can vary, including the num ber o f changes that
are made in the criterion, the duration o f the subphases at each criterion, the am ount
or magnitude o f change in the criterion at each step, m agnitude o f the changes in the
criteria, and the directionality o f the shifts in the criterion.

Subphases D urin g the Intervention


There are critical decision points in using the design that m ake for variatio n . T h ree
critical questions need to be answered: H ow m any subphases are required? H o w long
should each subphase be? And how large should the changes be from o n e criterio n
shift to the next? In traditional betw een-group research, questions like th ese (e.g.,
how long is the intervention provided?) are answ ered before the stu dy b egin s; in
single-case designs the questions are usually not answ ered ahead o f time. In sin g le -
case designs both the intervention and the design have a flexibility that in d iv id u a l-
izes each to the client based on how the client is resp on d in g and w hether the c rite ria
are met (e.g., stable perform ance) in a given phase. Even so, some gu id elin es can be
provided.
S I N G L E - C A S E R ES EA R C H D ES I G N S

Baseline Treatment

J2

Days

F ig u re 8 .3 . Number of blood glucose monitoring checks conducted during the last 10 days at each
criterion level. Maximum test strips allotted at each level are indicated by dashed lines (the chang-
ing criteria) and corresponding numbers of checks.The number of checks above the criterion level
reflect the number of additional test strips earned by Amy. (Source: Allen & Evans, 2001.)

H o w M a n y A re R e q u ire d ? A fter baseline, the intervention in w hich the criterion is


changed is implemented. M inim ally, at least two changes in the criterion an d therefore
two subphases o f B are needed. D u rin g the intervention one sets the criterion level, sees
i f perform ance com es to o r near that level, looks for a stable or clear pattern at that first
criterion level, and then m akes at least one more shift in that criterion. Stated another
way, this simple version o f the design has a baseline phase (A) and an intervention
phase (B) that has b and b as subphases. T he design depends on show ing that criterion
shifts lead to perform ance shifts, and in principle two might be sufficient. In practice,
an d as illustrated by the exam ples in this chapter, m any m ore criterion shifts (e.g., rang-
ing from 3 up to 25) are used (e.g., Allen & Evans, 2001; Facon, Sahiri, & Riviere, 2008;
M cD ougall, 2005). Although two are m inim al, three or m ore are much m ore com m on
an d are recomm ended here.

H ow Long (Days, Sessions) S h o u ld the Su bphases B e? A separate decision from how


m any criterion shifts is how long each shift should be, that is, how m any days or ses-
sions, before m oving to the next criterion level. Each o f these criterion shifts ought to
be long enough to achieve at least one o f two characteristics: stable responding and c or-
respondence o f behavior an d the criterion. As to stable responding (little or no trend,
m inim al variability), the criterion shifts serve as subphases (e.g., b , b , . .. b ) in relation
to how the design works. In the logic o f single-case designs, data in a phase or subphase
C h an g in g-O iterio n D esign s

are intended to describe current perform ance, predict what perform ance w ould be like
in the immediate future i f conditions were not changed again, and test w hether the new
level o f perform ance departs from a prior phase (or in this case a p rio r criterion level
01 subphase). Consequently, a criterion shift should be in place long enough to perm it
one to see stable perform ance that can be used for these purposes. A s an exception,
if the criterion shifts are large and perform ance o f the client leaps to match these, the
phases can be brief.
If behavior meets the new criterion quickly and shows low' variability, the su b -
phases can be relatively b rie f (two to five days o r sessions), [fbeh avio r does not change
quite so clearly when the criterion is shifted, the subphase may need to b e longer. A s is
invariably the case in single-case designs, the duration o f any phase or sub phase varies
in response to the client’s performance. In this design, the subphases do not need to be
o f the same duration, but they do need to meet the larger goal of allowing the investiga-
tor to discern a pattern and to infer how likely it is that the pattern o f behavior du rin g
one subphase and across a few subphases is due to the criteria shift and intervention.

H o w Large S h o u ld the C hanges B e fro m One C riterio n Shift to the S e a t? A concern


in draw ing inferences is that one should not be able to look at the graphical display and
say, “T here seem s to be a gradual trend toward im provem ent w hile the intervention is
in effect.” T his pattern might suggest that som e other influence (history, m aturation,
responses o f others in the setting that are not part o f the intervention) rather than the
specific intervention m ight account for the effect.
T he design requires show ing shifts in perform ance in response to shifts in the c r i-
terion. T he larger the shifts (bigger the steps) required and the more im m ediate the
shifts in client perform ance, the greater the clarity o f the dem onstration. That is, with
larger criterion shifts and perform ance changes that match them, the intervention as
the source o f influence becom es the most plausible account o f the results. I return to
this later because there is a potential conflict in m eeting the id ealso f the design (several
large criterion shifts) and developing behavior gradually (shaping) along som e c o n -
tinuum (more tim e, m ore responses).
As a general rule, one would like to show a step-like function in perform ance d u r -
ing the intervention phase, with subphases m oving client behavior in w ays that are
not smooth or consistent with some overall gradual change that could be readily du e
to som e other event. For example, if one is increasing duration o f perform ance o n a
task (com pleting hom ew ork, practicing a m usical instrum ent, walking fo r a person
w hose m edical regim en requires that as part o f treatment for an injury or recovery),
one would not change the criterion in the subphases b y 1 o r a m in u tesbu t would m ake
larger jum ps if possible (e.g., 10 or 15 minutes). M ore exam ples will betteT con vey ho w
this is accomplished.

Point or Range o f Responses as the Criterion


In the usual version of the design, the criterion shifts are based o n a specific point. For
exam ple, to earn som e consequence, the first criterion m ay be set as sm oking few er
than 15 cigarettes per day or reading for 20 or m ore m inutes. These criteria (e.g., 15
and 20) are specific points. Each criterion is changed over the course o f the in terv en -
tion phase. A design variation is to use a range with an upper and lower criterion level
174 S I N G L E- C A S E R E S EA R C H D ES I G N S

rather than a single point (M cD ougall, 2005; M cD ougall, H aw kins, Brady, & Jenkins,
2006). Perform ance o f the client must fall within the range that has been specified for
the person to receive the reinforcer or w hatever the intervention is. W hen the crite-
rion is changed, that new criterion also is a range. For exam ple, the range for minutes
o f studying might move from 10 to 20 in the first criterion subphase to 25 to 35 in the
second criterion subphase. T he criterion is met for any perform an ce that falls within
that range.
T h is version o f the design, referred to as the range-bound changing-criterion design,
is nicely illustrated by the innovator o f the design (M cD ougall, 2005). T h e study focused
on im proving the exercise and card iovascu lar functioning o f an obese adult. A behav-
ioral self-m anagem ent program was used (self-m onitoring, graphin g, goal setting, all
com pleted by the client) to increase m inutes o f running. A fter baseline, theie were
several subphases in which the client selected the average num ber o f minutes to be run
that week. T his began with 20 m inutes as an average for the first criterion . A range was
selected as +10% o f the average. T he client was considered to have m et the goal if the
num ber o f minutes fell w ithin the band o r range o f 20 + 2 (or 10% ). Figure 8.4 shows
several phases o f the design and the band o f acceptable p erfo rm an ce considered to
meet the criterion. T h e bands (ranges) becom e wider because 10% becom es a larger
num ber as the average num ber o f minutes m oves upward in the criterion shifts. As
evident in the figure, the criterion shifts exerted control over behavior. Performance
follow ed the step-like function one seeks in the changing-criterion design.

Baseline 20m/ 40m/ 60m/ 80m/ 60m/ 80m/ !00m/ Maintenance

Days

F igu re 8 .4 . Duration in minutes of daily exercise (running) during baseline, intervention, and main-
tenance phases. Baseline is truncated to permit display of all of the data from intervention and main-
tenance phases. Participant ran on 3 days during the 19-week baseline phase. Parallel horizontal lines
within the seven intervention phases depict performance criteria that the participant established
for himself to define the range of acceptable performance for each phase. Upper horizontal lines
indicated the maximum number of minutes of running permitted; lower horizontal lines indicate the
minimum number of minutes of running permitted. (Source: McDougall. 2005.)
Ch ar »g i n g - Cr i i «r i o n D esi g n s 175

A few advantages o f this design are worth underscoring. First, a range allows for
greater flexibility in perform ance. In the example, the client d id not want to run too
much (and risk injury before being in better physical condition) but still wanted to be
sure to run at least a m inim um each day. The range was well suited to these com peting
interests. Related, hum an perform ance is variable from day to day, w hich is normal
even if not com pletely understood. Allowing a range to serve as the criterion acco m -
m odates this characteristic better than aiming for an u n varyin g specific point as the
criterion. T he client’s perform ance can fluctuate w ithin a range an d still meet the crite-
rion for the intervention (e.g., reinforcer). T he great flexibility o f a range m ight make
the intervention and overall program more acceptable to th eclien t a sw ell as to investi-
gators contem plating use o f the design.
Second, im proved consistency in perform ance is a characteristic one often seeks in
behavior. For example, having a child study or practice a skill each day for a relatively
constant am ount o f tim e (e.g., 15 to 20 minutes) is much better for learning a topic or
skill than practicing all in one session for 1 day a week with little o r no practice at all
on the other days. T he range criterion requires im proved perform ance but within a
narrow range. Focusing on a range fosters consistency as well as a.n im proved level o f
perform ance.
Third, the design allows one to quantify, for descriptive purposes, the proportion
o f days on w hich the client perform ed within the criterion . A cross all subphases, there
were m any ranges as the criterion shifted. One can report the proportion o f days in
w hich perform ance fell within the range. There is no form al use o f this inform ation
for draw ing inferences, but the information provides a u sefu l descriptive index. The
higher the proportion o f days falling within the criterion range, the greater the clarity
o f the results.

D ire c tio n a lity o f C h a n g e


In the subphases o f the design the criterion is usually m ade m ore stringent over the
course o f the intervention. For example, the criterion m ay be altered to decrease cig a-
rette sm oking o r to increase the am ount o f time spent studying. T h e effects o f the inter-
vention are evaluated by exam ining a change in beh avior in a particular direction over
time, that is, in each case the perform ance is m oving in t hie direction o f im provem ent
(few er cigarettes sm oked and more time studying) in a step-like fashion as behavior
m oves to each new step (criterion). T he expected changes are unidirectional, that is,
either an increase or a decrease in behavior.
D ifficulties m ay arise in evaluating unidirectional changes over the course o f the
intervention phase in a changing-criterion design. Behavior m ay im prove system ati-
cally as a function o f extraneous factors rather than the intervention. 0 m ay be d if-
ficult to conclude that the intervention was responsible fo r change unless perfo rm an ce
closely follows the criterion that is set in each subphase an d unless there is a step-like
pattern that suggests that the intervention and criterion sh ifts controlled behavior. T h e
experim ental control exerted by the intervention can be m ore readily detected by alter-
ing the criterion so that there are bidirectional changes in perform an ce, that is, both
increases and decreases in behavior.
In this variation o f the design, the criterion is m ade increasingly more stringent in
the usual fashion. However, during one o f the subphases, the criterion is tem porarily
S I N G L E- C A S E R ES EA R C H D ES I G N S

m ade slightly less stringent. For exam ple, the criterion m ay be raised throughout the
intervention phase. D uring one subphase, though, the criterion is lowered slightly to
a previous criterion level. T h is subphase constitutes sort o f a m ini-reversal phase and
draw s on the same logic o f the return-to-baseline phase o f the A B A B design.
1 refer to these as m ini-reversal phases because the phase d o es not really return to
baseline conditions or level o f perform ance. We are still in the intervention phase, and
som ething is in place to change behavior. However, the criterion is altered so that the
direction o f the expected change in behavior is opposite from the changes in the previ-
ous subphase. I f the intervention is responsible for change, one w ould expect perfor-
m ance to follow the criterion rather than to continue to im prove in the sam e w ay it was
im proving in the prior subphase. In the previous exam ple o f exercise (see Figure 8.4),
there was a reversal phase in the design (see the 60 m in-per-d ay average phase). This
phase w as not really needed insofar as the shifts in perform ance seem ed to match the
criterion shifts closely. Even so, the data pattern with the added bidirectional changes
rem oves am biguity about w hether the intervention was responsible for change.
Another illustration o f the use o f a bidirectional change focused on an 11-year-
old boy, George, with Separation A n xiety Disorder, a psychiatric disorder in which
the child is very disturbed by separating from a parent or caregiver (Flo o d & Wilder,
2004). D ifficulties in separating from parents (caregivers) at a you ng age are com m on
and part o f norm al developm ent. For som e children, this m ay continue beyond early
childhood and reflect m ore severe reactions that im pair their d aily functioning. These
latter criteria, broadly speaking, influence whether the condition w arrants treatment.
G eorge had intense em otional reactions and could not allow his m other to leave with-
out displaying them.
Treatm ent was provided on an outpatient basis twice per week. Each o f the ses-
sions lasted up to 90 m inutes. T h e intervention consisted o f pro vidin g reinforcers for
the absence o f em otional behaviors and increases in the am ount o f time George could
separate from his m other without these reactions. D u rin g baseline, George and his
m other were in the treatm ent room , and the mother attempted to leave b y saying she
had som ething to do and w ould be back soon. Because G eorge show ed strong em o-
tional reactions, she stayed. D u rin g the intervention sessions, the m other began in the
room but left for varyin g periods. A duration was selected, in discussion with George
about how much time he could rem ain apart from his mother. I f G eorge met this time
and did not cry, w hine, o r show other emotional behavior, he could have access for
30 minutes to various toys and gam es or could receive a sm all piece o f candy or a gift
certificate that could be exchanged at a local toy store. If he did not meet the time, he
would have a chance in the next session. W hile the m other was aw ay (outside o f the
room or later o ff the prem ises) she w ould be called back (cell phone) if George had an
em otional reaction to the separation. That ended the session.
Figure 8.5 shows a baseline phase and the intervention phases. M ore and more
minutes free from em otional reactions were required to earn the reinforcer. Although
the dem onstration seem ed clear— in fact, the criterion was m atched for all but 1 day
(D ay 30)— a m ini-reversal was introduced by decreasing the requirem ent to earn the
reinforcer from 24 to 18 m inutes. B ehavior declined to the new criterion. In the final
phase, the criterion was lowered on four occasions, and behavior fell to that level too.
Throughout the study, perform ance matched the criterion. T h e dem onstration is
Clianjjing-Criterion d e sig n s 177

Treatment
3 0 12 24 18 36 72 90
Min Min Min Min Min Min Min Mm

Session

F igu re 8 .5 . Minutes without emotional behavior while G eorge’s mother is out of the room. Solid
lines in the data represent jointly established therapist and participant goals. (Somce: Flood & Wilder,
2004.)

particularly strong by show ing changes in both directions, that is, bidirectional changes,
as a function o f the changing criteria.
In this exam ple, there was little am biguity about the effect o f the intervention. In
changing-criterion designs where behavior does not show this close correspondence,
a bidirectional change may be particularly useful. W hen perform ance does not closely
correspond to the criteria, the influence o f the intervention may be difficult to detect.
A dding a phase in which behavior changes in the opposite direction to follow a c ri-
terion reduces the am biguity about the influence o f treatm ent. Bidirectional changes
are much less plausibly explained by extraneous factors (history, maturation) than are
unidirectional changes.
The use o f a m ini-reversal phase in the design is helpful because of the b id irec-
tional change it allows. The strength o f this variation o f the design is based on the
underlying rationale o f the A B A B designs. The m ini-reversal usually does not raise all
o f the objections that characterize reversal phases o f A B A B design. The m ini-reversal
does not consist o f com pletely withdrawing treatment to achieve baseli ne perform ance.
Rather, the intervention remains in effect, and the expected level o f perform ance still
represents an im provem ent over baseline. The am ount o f im provem ent is decreased
slightly to show that behavior change depends on the criterion that is set. O f cou rse, in
a given case, the treatm ent goal may be to approach the term in al behavior as soon as
possible. Exam ination o f bidirectional changes or a m ini-reversal m ight not be feasible.
Yet, this is not usually a return to baseline but just a tem porary low er level o f p erfo r-
mance. That lower level is still likely to be well above the baseline rates and still reflect
improvem ent.'

1 As implied in the discussion, a mini-reversal could be the second A phase in an A BA B design.


Rather than withdrawing the intervention to return to baseline, the intervention might be modified
to foster a slight change in the target behavior.
178 S I N G L E- C A S E R E S EA R C H D ES I G N S

O th er V aria tio n s
Another variation o f the design is m ore esoteric in its applicability and is mentioned
briefly. In this variation, referred to as the distributed-criterion design (M cD ougall,
2006), the application is for occasions in which m ultiple behaviors are o f interest. The
key feature o f this variation is that m ultiple baselines, rather than just one baseline for a
single target behavior, are incorporated into the changin g-criterion design. T hese m ul-
tiple behaviors are interrelated so that perform ance o f one o f the behaviors is known in
som e way to be related to the others.
Baseline data are gathered on two or more behaviors v ery m uch like a multiple-
baseline design, where each is graphed separately. In the intervention phase, a separate
criterion is specified for each o f the baselines. A ll o f the separate behaviors that are
observed cannot respond or im prove to the same extent, because o f the relation among
the behaviors. That is, if one perform s the first behavior (e.g., studying arithmetic) for
several minutes one m ight not be able to do the sam e for the second (and third) behav-
ior (e.g., studying English, history). There is o n ly so much study tim e that might be
available. In this variation , the criterion (e.g., am ount o f tim e studying) is distributed or
allocated across the interdependent behaviors. T h e effect o f the intervention is dem on-
strated in two ways: b y show ing that behavior m atches a criterion and by showing that
two (or more) b ehaviors change as the criterion shift is m ade fo r one o f the behaviors.
A n exam ple o f the design, perhaps appreciated b y readers o f the book, focused 011
a professional who wanted to increase research productivity, defin ed as time devoted
to working on three m anuscripts for jo u rn al publication (M cD ougall, 2006). T he tasks
included analyzing data, m aking charts, and w riting/editing the manuscript. T he inter-
vention was self-m anagem ent (e.g., self-m onitoring, goal setting, graphing o f perfor-
mance). A total tim e per day to be allocated across all o f the tasks w as decided at the
outset to be a mean (average) o f 3 hours per day. T h is feature is w hat made the behav-
iors, working on three separate m anuscripts, interdependent. T hus, if one worked on
one manuscript for all 3 hours, no more time w ould be available to w ork on the others.
The design is called distributed changing-criterion design because the criterion (how
much time to focus on each task) was spread (distributed) am ong the three tasks.
Tim e spent w orkin g on the three m anuscripts was graphed, and a criterion was
specified for w orkin g on each m anuscript. In the first phase, 3 hours (180 minutes)
were allocated as the criterion for the first m anuscript with 110 hours or minutes for
the second and third m anuscripts. O nce progress w as m ade 011 the first manuscript,
the criterion was changed to 2 hours, 1 hour, and no tim e for M anuscripts A, B, and C,
respectively. Progress consisted o f the m anuscript being nearly com plete, waiting for
feedback from others, o r subm itting it to a jou rn al for publication. O nce one m anu-
script was nearly com plete or finished, attention (tim e) could be allocated to the other
manuscripts.
Figure 8.6 conveys the three criteria for M anuscripts A, B, and C , respectively, at the
top o f each phase after baseline. Thus, the criterion (180, o, o m inutes) means that the
criterion was 3 hours (180 minutes) for the time devoted to M anuscript A and no time
to each o f the other m anuscripts. One can see from this description and the graphical
display that this is a com bination o f features o f changin g-criterion and multiple-base-
line designs. Each baseline (time for M anuscript A , B, and C) changed when the c ri-
terion shift was made for that m anuscript. The v ery stable baselines, the large changes
C hanging-Criterion D esign 179

Baseline 180-0-0 120-60-0 0-120-60 120-0-60

Fig u re 8 . 6 . Moving average for research productivity (mean number o f minutes expended daily)
within baseline and intervention phases for Manuscripts A. B, and C Horizontal lines indicate within-
phase productivity criteria (i.e., minimum number of minutes to be expended on a manuscript)
that the participant set. Labels for intervention phases are noted as numbered sequences (e.g.,
1 2 0 -6 0 -0 ) .The first number in the sequence indicates the with in-phase criterion in minutes that
the participant established for Manuscript A; the second and third nurnl>ers for Manuscripts B and C .
respectively.“ X " indicates that a criterion was no longer pertinent because work on the manuscript
was completed. (Source: McDougall, 2006.)

1
• 80 ? ! N G L E - C A S E R ES EA R C H D ES I G N S

w h en the criterion was shifted, and the changes associated w ith each m anuscript when
an d o n ly w hen the criterion was changed all m ake this dem on stration v e ry strong.
T h e distributed-changing-criterion design is applicable to those instances in which
time, effort, o r some other dim ension cannot be applied in an unlim ited fashion to all
the behaviors or goals o f the program and when engaging in one behavior would affect
how m uch time or effort could be placed into engaging in another. In this case, one “dis-
tributes" the time or effort across all o f the behavior. The effects o f the intervention are
show n by perform ance m atching the criterion for each o f the baselines and matching the
criterion am ong baselines as the tim e or effort criterion changes. In most instances, there
is no need to focus on separate behaviors that are interrelated in the fashion noted in this
exam ple. If there are multiple tasks that are interdependent and cannot all be improved at
the sam e time, this design variation w ould be quite suitable. A m ultiple-baseline design
m ight accomplish the sam e end. However, a broader point is worth conveying as well.
Single-case designs need not be rigid, and elements from different designs can readily be
com bined to strengthen inferences about the effects o f the intervention.

G eneral Com m ents


C hanging-criterio n design s can v a ry along several dim en sions, such as the number
o f tim es the criterion is changed, the duration o f the phases in w hich the criterion is
altered, the m agnitude o f the criterion change, whether the criterion is specified as a
point o r range, and w h ether there are unidirectional or bidirectional changes in the
criterion. In each o f these variation s, the logic o f the design rem ains the same. One
looks for the step-like pattern that reflects shifts in client perfo rm an ce as a function o f
shifts in the criterion that are related to the intervention (e.g., receivin g som e conse-
quence). T h ese steps can provide a dram atic illustration o f the im pact o f the interven-
tion as perform ance closely ap proxim ated each criterion shift. T h e design provides
optim al clarity when bidirectional changes are sought. T his variation b orrow s features
o f the A B A B design by utilizing a m ini-reversal phase. O verall im provem ents or the
im pact o f extraneous events cannot plausibly explain a step-like fu n ction and explain
changes in different direction s as the criterion is made increasingly stringent and then
for a b rie f subphase less stringent. T he strength o f this latter variation has led some
investigators to include a b idirectional phase even when the results m ight be clear from
the step-like function m o vin g in one rather than two direction s over the cou rse o f the
intervention phase.

PR O BLEM S AND L IM IT A T IO N S
T he unique feature o f the chan gin g-criterion design is the intervention phase in which
p erform an ce is expected to change in response to different criteria. A m bigu ity m ay
arise in d raw in g inferences about the intervention if perform an ce does not follow or
correspon d to the shifts o f the criterion. T here are different w ays in w hich this am bigu-
ity can be m anifested.

G radual Im provement Not C le arly Connected to Shifts in the Criterion


T here m ay be an overall im provem ent in behavior that is due to som e general influence
(e.g., the client finally starting som e intervention to w ork on a problem , novelty effects).
Perhaps it is not the specific intervention per se, but participating in any program or
C h a n g in g - C r it e r io n D esigns

structured activity that led to the change. For exam ple, in the context o f psychotherapy,
expectations that the client will improve on the part o f the therapist or the client m ay
lead to im provem ent in a w ay that is analogous to a placebo effect in m edicine. It is not
the specific treatm ent per se but som e general influence that causes the im provem ent.
Alternatively, a new m easurem ent procedure or introducing a new behavior-change
agent m ay alter the m otivation or perform ance o f a client. T h at is not really the effect o f
the specific intervention but a more general influence on perform an ce that causes the
change. In such cases, one looks for a pattern in w hich perform an ce im proves overall
and is not clearly changin g in a step-like function in respon se to the criterion changes.
In term s o f threats to validity, the influence o f history an d m aturation, as two exam ples,
m ight account for a gradual change over time. T h is is the reason we w an t to see a step-
like function on the graph that charts perform ance in response to the changin g criteria.
T h e greater the clarity o f that step-like pattern, the m ore plausible it is that the inter-
vention was responsible for change.
A n investigator u sually wants to see perform ance clearly change in response to
criterion changes. C o n sid er an early use o f a chan gin g-criterion design in an in n o va-
tive program that reduced the cigarette sm oking o f a 24-year-old m ale (Friedm an &
A xelrod , 1973). D u rin g baseline, the client observed his ow n rate o fc ig a re tte sm o kin g
with a w rist counter. (H is fiancé also independently counted sm o kin g to assess reli-
ability.) D u rin g the intervention phase, the client w as instructed to set a criterion level
o f sm okin g each day that he thought he could follow. W hen he w as able to sm oke only
the num ber o f cigarettes specified by the self-im posed criterion , he w as instructed to
low er the criterion further.
T h e results are presented in Figure 8.7, in which the reduction and eventual term in a-
tion o f sm okin g are evident. In the intervention phase, several d ifferen t criterion levels

Days

F ig u re 8 .7 . The number of cigarettes smoked each day during each o f tw o experimental condi-
tions. Baseline— the client kept a record of the number of cigarettes smoked during a 7-day period.
Self-recording of cigarettes smoked— the client recorded the number of cigarettes smoked daily
and attempted not to smoke more than at criterion level.The client set the original criterion and
lowered the criteria at his own discretion. (The horizontal lines represent the criteria.) (Source:
Friedman & Axelrod, 1973.)
S I N G L E - C A S E R ES EA R C H D ES I G N S

(short horizontal lines with the criterion num ber as superscript) w ere used Twenty-five
different criterion levels were inclu ded in the intervention phase. A lthough it is quite
o b vio us that sm oking decreased, perform ance did not clearly fo llo w the criteria that
w ere set. O ne could argue that there is an overall pattern o f decreased sm o kin g—clearly
som eth in g important and desirable. T h e criterion levels w ere not really followed closely
until D ay 40 (criterion set at eight), after w hich close corresp on d en ce is evident. Yet the
overall pattern o f perform ance (getting better and belter) and the v e ry sm all changes in
the criterion make the inferences about the intervention open fo r dispute. The results
m ight have been much clearer if a given criterion level were in effect for a longer period
of tim e and it the criterion shifts were slightly larger to see if that level really influenced
p erform an ce. O f course we ought not to lose sight o f a large change 011 an important
problem , but we have less clarity about the basis for the change.

R a p id C h a n g e s in P e rfo rm a n c e
G radual change in perform ance that cannot be separated from the im pact o f criterion
shifts is one problem in draw in g inferences about the im pact o f the intervention. T he
other is when behavior changes rapidly and often exceeds the criterion. For example,
consider a hypothetical program that is designed to reduce the d a ily calorie consum p-
tion o f an overweight adult male. Baseline reveals that this person has been consum-
ing 4 ,0 00 to 5,000 calories daily. (Recom m ended calories per day v a ry but typically are
approxim ately 2,000 for an adult fem ale and 2,500 for an adult male.) You have devel-
oped an intervention (e.g., spouse-controlled points chart with little non -food treats and
privileges as backup reinforcers) and a changing-criterion design; you intend to make
shifts in the criterion in the usual w ay to eventually get closer to 2,500 calories. Your first
criterion, as the intervention phase begins, sets calories at 3,800 o r below — any calorie
day below that earns points tracked by the spouse (as well as lavish praise). You begin
the program , and im provem ents im m ediately exceed the initial criterion set for calorie
intake. T h e person earns the points because calories fell below the criterion. Yet, let us
say that each o f the days show s that the person was below 2,800 calo ries— well below the
criterion o f 3,800 calories. O ne might well shift to a new criterion, say, 2,500 calories per
day an d there just continues to be a rapid reduction, say, to 2,200 each day.
R apid and large changes create tw o problem s for the design. First, the changes
m ake unclear that there is corresp on den ce between the criterion and the change.
Second, the changes m ake it difficu lt to dem onstrate change w ith a higher or more
stringent criterion. T hat is, the person s perform ance already reaches new heights and
provides data points that m ay well spill over to a higher level (m ore stringent) criterion
than w as intended. T here m ay be little room to show further changes in the criterion
and few rem aining opportunities to show that perform an ce m atches the criterion as
that criterio n is changed.
T h e changing-criterion design is especially well suited to situations in which behav-
ior is to be altered gradu ally tow ard a term inal goal. T h is is the u nd erlying rationale
behind starling out with a relatively easy criterion and p rogressin g over several differ-
ent criterion levels. T he rationale is sound. However, even though a criterion may only
require sm all changes in behavior (e.g., calorie consum ption, m inutes o f studying), it
is possible that perform ance changes rapid ly and greatly exceeds that criterion. In such
cases, it m ay be difficult to evaluate intervention effects.
Ch an gin g-C riterio n D esigns 183

T he effects o f rapid changes in behavior that exceed criterion perform an ce can be


seen in a program designed to alter the fear in an 8-year-o ld boy nam ed Rich. H e met
criteria for Autistic D isorder and was hospitalized at a facility tor children with d evel-
opm ental disabilities (R icciardi, Luiselli, & C am are, 200 6).’ Rich showed a very intense
fear o f anim ated figures (e.g., electronic anim ated toys that blinked, lighted, such as a
dancin g Elm o doll, blinking holiday decorations). W hen seeing these stim uli, he would
scream , try to escape, and hit anyone blocking his escape. H e also met psychiatric diag-
nostic criteria for a phobia, which denotes persistent, excessive, and unreasonable fear
and a strong response (e.g., in children, cryin g, tantrum s, or clinging) in anticipation
o f exposure to the feared event or object. Several m edications were tried and did not
im prove this behavior. A n intervention program was used based on graduated exposure
to fear or an xiety-provo k in g stim uli, one o f the m ost w ell-established evid ence-based
treatm ents for avoid ance and phobias. T he intervention p ro vid ed access to preferred
toys if he rem ained in proxim ity to the feared objects. T he toys were placed at vary in g
distances over the course o f the intei vention, and the distance in meters (m arked in
units on the floor) was the criterion that constantly changed in a changin g-criterion
design. Pro xim ity o f the toys to the feared objects changed and gradually exp osed him
m ore closely to the m aterials over the course o f the sessions. Rich could leave the ses-
sion any tim e he wanted. Intervals (each 15 seconds) were observed to assess how' long
he rem ained at the specified distance criteria during baseline and intervention phases.
Figure 8.8 show s criteria du rin g subphases o f the intervention phase (horizontal
dashed line) w here the criterion o f meters away from the feared object w as assessed
(5 m eters, 4 m eters, etc.). Perform ance clearly changed from baseline. Rich tolerated
exposure to the feared objects. O ther m easures durin g treatment (not graphed) showed
he could approach and touch the objects when asked to do so. At discharge fro m the
hospital, Rich’s m oth er w as encouraged to take him to stores with objects like the ones
he feared. She did, and she reported that he tolerated these experiences well.
Clinically, the results are excellent. Access to the preferred objects w as associated
with im provem ents, but it is not clear that the intervention w as responsible fo r change
given the requirem ents o f the design and their stringent application. T he behavior did
not follow the criterion shifts at all. There is no step-like function because o f the rapid
changes in behavior. In short, the rapid shift in perfo rm an ce and departure fro m crite-
rion levels m ake the role o f the intervention som ew hat unclear.
In practice, one m ight expect that criterion levels will often be surpassed. If the
behavior is not easy for the client to m onitor or d o es not have discrete cut points (e.g.,
num ber o f steps in a com plex task with discrete units, such as getting dressed, m ight
be easier to track than m inutes o f the overall activity), it may b<? difficult for him o r
her to perform the behavior at the exact point that the criterion is met. T h e response

1 Autistic Disorder is a formally recognized psychiatric disorder that emerges in early childhood and
includes significant impairment in social interactions (e.g.. avoids contacts v^ilh others, lacks inter-
est in others), communication (e.g., stereotyped or repetitive use oflanguag«, inability to initiate or
sustain a conversation), and repetitive and stereotyped patterns ofbeh a\ioi (e.g., repetitive routines
and rituals). It has been referred to as a pervasive developmental disorder to convey the scopc o f
impact it can have on functioning. However, there are varying degrees and it is part o f a spectrum
that reflects a range and severity o f impairment.
34 S I N G L E - C A S E R E S EA R C H D ES I G N S

SPECIFIC PHOBIA

Baseline Intervention

F ig u r e 8 .8 . Percentage of recording intervals in which Rich remained at a specific distance crite-


rion during baseline and intervention sessions.The distance criteria are depicted by the triangle data
path: arrows indicated sessions in which he left the room. (Source: Ricdardi, Luiselli, & Camare,2006,)

pattern that tends to exceed the criterion level slightly will guarantee earning o f the
consequence. To the extent that the criterion is consistently exceeded, am biguity in
draw in g inferences about the intervention m ay result. O ne cou ld select a range as the
criterion , as discussed p rev io u sly In this case, the client w o u ld earn the consequence
if perform an ce fell w ithin the range rather than above or below it. H owever, in school,
clinic, or other applied settings, we do not want to put any ceilin g on go o d perform ance
(e.g., you only earn this i f yo u r math scores fall betw een 70% an d 80% correct, but not
higher!). Again, a m in i-reversal can reduce am biguity on the few occasion s it is likely
to arise.

C o r re s p o n d e n c e o f th e C r it e r io n an d B e h a v io r
T h e strength o f the dem on stration depends on show ing a close correspondence
betw een the criterion and behavior over the cou rse o f the intervention phase. In som e
o f the exam ples in this chapter behavior fell exactly at the criterion levels or range on
virtu ally all occasions o f the intervention phase. In such instances, there is little am b i-
gu ity regarding the im pact o f the intervention. Typically, behavior will not fall exactly
at the criterion level. W hen correspon den ce is not exact, it m ay be difficu lt to evalu -
ate w hether the intervention accounts for the change. C o rresp o n d en ce is a matter o f
degree, and ability to attribute the changes to the intervention is as well.
C o n sid er an exam ple o f a changin g-criterion design to evaluate a feeding program
in a 3-year-old boy, Sam , w ho was born prem aturely and had several m edical co n d i-
tions (e.g., lung disease, esophageal reflux) (Luiselli, 200 0). He also refused to eat and
required tube feeding through a pum p activated through w akin g and sleeping hours.
Cha. n gin g-Criterion D e sign s 185

He rejected all food. T he intervention focused on developing sell-feeding, a behav-


ior d ivid ed into several steps for teaching and data evaluation (e.g., grasping a spoon,
sco o p in g up food, placing the spoon in his m outh, sw allow ing). B aby food was used
as the meal, as suggested by the prim ary-care physician. N u m b er o f bites o f food w as
used as the target behavior du rin g the meal; a changin g-criterion design was used to
sp ecify the num ber o f bites Sam w ould have to eat to earn rein fo jcers at the end o f the
meal (access to toys for 30 m inutes). If he did not m eet the criterion he was m erely told
he w o u ld have the opportunity at the next meal. T he m eals began with one of Sam ’s
parents rem indin g him o fh o w m any bites were required to earn th ere inforcer, placing
a card next to his bowl show ing the num ber o f bites that w as the criterion durin g that
tim e, and praisin g taking the bites o f food.
Figu re 8.9 show s the num ber o f bites per m eal over several days. The horizontal
line specifies the criterion, and the data points show days above (higher than) o rb e lo w
(low er than) that criterion. I f one looks at the baseline (n o feeding) and the one-bite
criterion phase it is clear that som ething happened w hen treatm ent began. It is also
clear that o ver the course o f treatment there was great change. M ost would agree that
the intervention w as responsible for change, but m an y d ays are above or below the c ri-
terion, and the precise role o f the intervention m ight be questioned. Did the behavior
follow the criterion shifts? C learly in som e subphases the criterion was matched. T h e
data are striking in the subphase, in which eight bites was the criterion and every d ay
met that criterion.

DAYS

F ig u re 8.9 . The average (mean) number of self-feeding responses recorded during daily lunch
and supper meals. Horizontal dashed lines preceded by numbers indicate tlie imposed self-feeding
response criterion during meals. (Source:Luiselli, 2000.)
186 S I N G L E - C A S E R E S E A R C H D ES I G N S

Currently, no clearly accepted m easure is available to evaluate the extent to w hich


the criterion level and behavior correspon d. H ence, a potential problem in changing-
criterion designs is deciding w hen the criterion and perform an ce correspond closely
enough to allow the inference that treatm ent was responsible for change.3 More is d is -
cussed on evaluating change in the data analysis chapter (and appendix) in relation to
all o f the single-case designs.
In som e cases in w hich corresp on den ce is not close betw een perform ance and the
criterion for each phase, authors refer to the fact that average (m ean) levels o f p erfo r-
mance across subphases show a stepw ise relationship. Even th ough actual perform ance
does not follow the criterion closely, in fact, the m ean rate o f perform ance within
each subphase m ay change with each change in the criterion. Alternatively, investiga-
tors m ay note that perform an ce fell at o r near the criterion in each subphase on all or
most o f the occasion s, and they m ay provide the proportion o f instances. Hence, even
though perform an ce levels did not fall exactly at the criterion level, it is clear that the
criterion w as associated with a shift or new level o f perform an ce. As o f yet, consistent
procedures for evaluatin g correspon den ce betw een behavior and the criterion have not
been adopted.
T he am biguities that arise w hen the criterion and perfo rm an ce levels do not closely
correspon d are largely resolved by exam in ing bidirectional rather than unidirectional
changes in the intervention phase. W hen bidirectional changes are made, the criterion
m ay be m ore stringent and less stringent at different points du rin g the intervention
phase. It is easier to evaluate the im pact o f the intervention w hen looking for changes in
different directions (decrease follow ed by an increase in perform an ce) than when lo o k-
ing for a point-by-poin t correspon den ce betw een the criterion and perform ance. T he
bidirectional changes d raw on the logic o f sin gle-case designs m ore generally, where
one describes, predicts, an d tests predictions o f likely perform an ce, as elaborated in the
presentation o f A B A B designs. Show ing bidirectional changes m akes im plausible that
extraneous factors cou ld explain the pattern o f results. H ence, when ambiguity exists
in any particu lar case about the correspon den ce between the changing criterion and
behavior, a m in i-reversal o ver one o f the subphases o f the design can be very useful.

M a g n itu d e o f C r ite r io n S h ifts


T he previous com m ents note that too gradual a change, too rapid a change, and too
large a change can raise problem s in draw in g inferences in the design. What is the
G o ld ilo cks (just right) level? That is not quite the right question. T he issue is show ing
a step-like function so that the investigator and those who con sid er the dem onstration

* One suggestion to evaluate the correspondence between performance and the criterion over the
course o f the intervention phase might be to compute a Pearson product-moment correlation.
The criterion level and actual performance would be paired each day to calculate a correlation.
Unfortunately, a product-moment correlation may provide little or no information about the extent
to which the criterion is matched. Actual performance may never match the changing criterion
during the intervention phase and the correlation could still be perfect (r = 1.00). The correlation
could result from the fact that the differences between the criterion and performance were constant
and always in the same direction. The product-moment correlation provides information about the
extent to which the two data points (criterion and actual performance) covary over assessment occa-
sions but not whether one matches the other in absolute value.
Clia iigin g-C riterio n D esigns

are persuaded that it was the intervention that provides the best explanation o f the pat-
tern in the data.
In designing the study, perhaps the most im portant design consideration is the m ag-
nitude o f the criterion shift that is m ade over the subphases when the intervention is in
effect. T h e basic design specifies that the criterion is changed at several different points.
Yet no clear guidelines are inherent in the design that con vey how much the criterion
should be changed at any given point. The particular clinical problem or focus and the
clients perform ance determ ine the amount o f change m ade in the criterion over the
course o f the intervention phase. The clients ability to m eet initial criterion levels and
relatively sm all shifts in the criterion m ay signal the investigator that larger shifts (i.e.,
m ore stringent criteria) might be attempted. Alternatively, failure o f the client to meet
the constantly changing criteria may suggest that sm aller changes might be required.
Even deciding the criterion that should be set at the inception o f the intervention
phase m ay pose questions. For exam ple, if decreasing th e consum ption o f cigarettes
is the target focus, the intervention phase m ay begin b y setting the criterion slightly
b elow baseline levels. T h e lowest or near-lowest baseline data point m ight serve as the
first criterion for the intervention phase. Alternatively, th e investigator m ight sp ecify
that a 10 or 15% reduction o f the m ean baseline level w o u ld be the first criterion. In
either case, it is im portant to set a criterion that the client can meet. T h e appropriate
place to begin, that is, the initial criterion, m ay need to be negotiated with the client.
As perform ance m eets the criterion, the client m ay need to be consulted again to
decide the next criterion level. At each step, the client m ay be consulted to help decide
the criterion level that represents the next subphase o f the design. In m an y cases, o f
course, the client m ay not be able to negotiate the procedures and changes in the crite-
rion (e.g., children and adolescents with severe developm ental disabilities; elderly with
severe cognitive im pairm ent).
With or without the aid o f the client, the investigator decides the steps or changes
in the criterion. T hree general guidelines can be provided. First, the investigator usually
should proceed gradually in changing the criterion to m axim ize the likelihood that the
client can meet each criterion. Abrupt and large shifts in th e criterion may m ean that
relatively stringent perform ance dem ands are placed on the client. T h e client m ay be
less likely to meet stringent criterion levels than more graduated criterion levels. T h u s,
the m agnitude o f the change in the criterion should be relatively m odest to m axim ize
the likelihood that the client can successfully m eet that level.
Second, the investigator should change the criteria o v e r the coarse o f the in terven -
tion phase so that correspondence between the criteria an d behavior can be detected.
T h e change in each criterion must be large enough so that one can discern that p er-
form an ce changes w hen the criterion is altered. T he investigator may m ake v ery sm all
changes in the criterion. However, if variability in perform an ce is relatively large, it m ay
be difficult to discern that the perform ance followed the criterion. H ence, there is a
gen eral relationship between the variability in the clients perform an ce and the am ount
o f change in the criterion that m ay need to be made. T h e more variability in d ay-to-
day perform ance during the intervention phase, the greater the change needed in the
criterion from subphase to subphase to reflect change.
T h e relationship between v ariab ility in p erfo rm an ce and the changes in the c ri-
teria n ecessary to reflect change is illustrated in two hypothetical cb an g in g -criterio n
S I N G L E- C A S f c R E S EA R C H D ES I G N S

d esign s displayed in Figu re 8.10. T h e upper panel show s that subject v ariab ility is
relatively high d u rin g the intervention phase, an d it is relatively difficu lt to detect that
the perform an ce follow s the changin g criterion . T h e lower panel show s that subject
v ariab ility is relatively sm all d u rin g the intervention phase and follow s the criterion
closely. In tact, for the low er panel, sm aller changes in the criteria probably w ould
have been adequate and the corresp on den ce betw een perfo rm an ce and criteria w o u ld
have been clear. In contrast, the u pper panel show s that much larger shifts in the c r i-
terion w o u ld be needed to dem onstrate u n am b igu o usly that perfo rm an ce changed
system atically.
It is im portant to bear in m ind that changes in the criterion need not be in equal
steps o ver the cou rse o f the intervention. A n d there is no virtu e in con sisten cy in
single-case designs. So a 10% increase in the criterion need not be a guid ing principle—
ju st the opposite. A s I have noted, the strength o f the single-case designs is m aking
decisio n s in response to the data. U nequal steps up o r dow n (bid irection al) are fine.

F ig u re . . Hypothetical examples of changing-criterion designs. The upper panel shows data


with relatively high variability (fluctuations).The lower panel shows relatively low variability. G reater
variability makes it more difficult to show that performance matches or is influenced by the chang-
ing criterion. In both of these graphs, the mean level of performance increased with each subphase
during the intervention phase.The influence of the criterion is clearer in the lower panel because
the data points hover more closely to the criterion in each subphase.
C h a n g in g-C ritcrio n D esigns 189

It is im portant to never lose site o f the goal, namely, to ru le out the likeliho od that
extraneo u s in fluen ces (e.g., history, m aturation, novelty) cou ld explain the data p a t-
tern. We do not need equal steps in criterion changes to accom p lish the goal. I f an y-
thing, one m ight argue that the goal is better ach ieved by not changin g in equal-size
steps. For exam ple, in the beginning o r first two subphases o l the intervention phase,
sm aller changes in the criterion m ay be needed to m ax im ize o ppo rtu n ities tor the c li-
ent’s success. A s progress is made, the client m ay be able to m ake larger steps in red u c -
ing o r increasing the behavior. C o rrespo nd ence o f client behavior and th e crite ria that
v a ry by large steps can strengthen the dem on stration because few other interpreta-
tions (e.g., history, m aturation) are likely to com pete w ith the intervention in plausibly
exp lain in g the results.

General C om m ents
A m b igu ities that arise in the changin g-criterion design usu ally pertain to the c o rre -
spond ence betw een the multiple criteria (across su b p h ases) and behavior. Som e o f the
potential problem s o f the lack o f corresp on den ce can be an ticipated and po ssibly c ir -
cum ven ted b y the investigator as a fu n ction o f how an d w h en the criteria are changed.
T h e p u rp o se o f changing the criterion from the stan d p oin t o f th e design is to provide
several subphases during the intervention phase. In each subphase, it is im portant
to be able to assess the extent to w hich perfo rm an ce m eets the criterion . A c ro ss all
subphases, it is cru cial to be able to evaluate the extent to w h ic h l he criteria have been
follow ed in general. These specific an d overall ju d gm en ts can be facilitated by k eep -
ing individ u al subphases in effect until perform an ce stabilizes. A lso , the m agnitu de o f
the criterion shifts should be m ade so that the association betw een perfo rm an ce and
the criterion can be detected. T he criterion should be ch an g ed so that p erfo rm an ce at
the new criterion level will clearly depart from p e rfo rm a n ce o f the previous criterion
level. Finally, a change in the intervention phase to a p revio u s criterion level will often
be v ery helpful in determ ining the relation betw een the intervention and b eh avio r
change.

E V A L U A T IO N O F T H E D E S IG N
T h e changing-criterion design has several features that m ake it u seful in applied set-
tings as well as m ethodologically sound. The design does r o t require w ith d raw in g
treatment, as in the A B A B design. T he multiple problem s related to reverting beh avior
to baseline levels are avoided. (The m ini-reversals are not a retu m -to-baseline co n d i-
tions but a return to slightly lower levels o f im provem ent.) Also, the design d o es not
require w ithholding treatment from som e o f the different behaviors, persons, or situa-
tions in need o f the intervention, as is the case with v ariatio n s o f the m u ltiple-baseline
design. A con vin cin g dem onstration o f the effect o f the intervention is provided i f the
level o f perfo rm an ce in the intervention phase m atches the criterion as that criterion
is changed.
T he m ost salient practical advantage o f the design is the gradual ap proxim ation
o f the final level o f the desired perform ance. R epeatedly ch an gin g the criterion m ean s
that the goal o f the program is approached gradually. A large num ber o f b eh av io rs
in education and treatment may be approached in this g rad u al fashion (e.g., am ou nt
o f reading, tim e studying, time without disruptive b eh avior). Increased dem an d s are
190 S I N G L E - C A S E R E S EA R C H D ES I G N S

placed on the client (i.e., m ore stringent criteria) o n ly after the client has shown m as-
tery o f perform ance at an easier level.4
T here is a potential conflict betw een developing behavior grad u ally and m eeting
the requirem ents o f the changin g-criterion design. In developing behavior, progress
o r the requirem ents m ay be sm all in response to how the client is doing. The design
usually requires changes in the criterion in steps that are large en ou gh to show that
perform ance clearly correspon ds to the criterion level and continues to do so as the
criterion is altered. In fact, in principle, if large changes are m ade (the steps are large)
an d correspondence o f perform ance m atches these large steps, the design is m axim ally
persuasive as an experim ental dem onstration. So the potential con flict is changing the
criterion gradually enough to constitute sound training, that is, encouragin g ap p ro xi-
m ations o f behavior, and changin g the criterion in steps that allow the investigator to
see and show that perform ance really does change in response to the changing crite-
rion. O ne way to resolve this is to begin with sm all shifts in the criterion for the p u r-
poses of training, as needed. However, alon g the way, include one phase that low ers
rather than increases the criterion slightly to show a m ini-reversal.
T here are m any excellent features o f the changing-criterion design. Even so, the
design has been used much less often than have other designs. O ne can only speculate
as to w h y First, the guidelines for using the design, such as where to set criteria, w hen
to change, and how to decide w hether there is correspon den ce o f the criterion and
perform an ce, are slightly less clear than guidelines for using other designs. Second and
related, developing behavior gradu ally does not have clear gu id elines and has an “art”
feature. One m oves the client’s behavior forw ard progressively, but how much progress
and at what increm ents o f steps? H ere is a case where the design connects with su b stan -
tive issues about behavior change m ore generally, that is, how to develop perform ance
grad u ally and to reach a criterion. T he design is quite useful because o f the shaping fe a -
ture an d flexibility in changin g the criteria for perform ance. With m ini-reversals (that
are not a com plete return to baseline levels) the designs can reflect unequivocal control
o ver behavior and make im plausible the influence o f extraneous events in explaining
the results.

SU M M AR Y A N D C O N C L U SIO N S
T h e changing-criterion design dem onstrates the effect o f an intervention by show ing
that perform ance changes at several different points du rin g the intervention phase
as the criterion is altered. A clear effect is evident if perform an ce closely follows the
changin g criterion. In most uses o f the design, the criterion for perform ance is m ade

* A behavior-change technique referred to as shaping is important to mention. Shaping consists


o f providing reinforcers (e.g., praise, approval, feedback, points) for successive approximations of
a final response. One identifies a final goal response (e.g., completing 45 minutes o f homework
or music practice, cleaning all toys from the floor of one’s room, eating < 3,000 calories per day,
exercising 5 days a week, going to bed at a specific time earlier than the current bedtime, reading
a book on methodology each week). Also, a reinforcer is identified (e.g., praise, graphed feedback,
point system with backup rewards). Shaping begins by providing the reinforcer for a small change in
the goal response (e.g., 10 minutes o f homework if baseline showed o minutes). As behavior changes
and is consistent, the criterion is increased. This is an effective technique to alter behavior and is
consistent with the methodological requirements o f changing-criterion designs.
C h a n jin g -C rite non D esigns

increasingly m ore stringent over the course o f the intervention phase. H ence, behavior
continues to change in the same direction. In one variatio n o f the design, the criterion
m ay be m ade slightly less stringent at som e point in the intervention phase to deter-
m ine w hether the direction o f perform ance changes. T h e use o f a m ini-reversal phase
to show that behavior increases and decreases d ep en d in g on the criterion can clarify
the dem onstration w hen close correspondence between perform ance and the criterion
level is not achieved.
An im portant issue in evaluating the changing-criterion design is deciding when
correspondence betw een the criterion and perform ance has been achieved. U nless there
is a close point-by-poin t correspondence between the criterion level and perform ance,
it m ay be difficult to infer that the intervention was responsible for change. Typically,
investigators have inferred a causal relationship i f perform an ce follow s a step-like fu n c-
tion so that changes in the criterion are followed by changes in perfo rm an ce, even if
perform ance does not exactly meet the criterion level.
D rawing inferences m ay be especially difficult if p erfo rm an ce changes rapidly and
in a leap beyond the criterion as soon as the intervention is Im plem ented. The design
depends on show ing gradual changes in perform ance as th e term inal goal isapproached.
If perform ance greatly exceeds the criterion level, the intervention m ay still be respon-
sible for change. Yet because the underlying rationale o f the design depends on show ing
a close relationship betw een perform ance and criterion levels, conclusions about the
im pact o f treatment will be difficult to infer.
C ertain ly a notew orthy feature o f the design is that it is based on gradual changes in
behavior. T he design is consistent with developing perfo rm an ce gradually; few p erfo r-
m ance requirem ents are m ade initially, and these requirem ents a ie g r a d u a lly increased
as the client m asters earlier criterion levels. In m an y educational and clinical situa-
tions, the investigator m ay wish to change client p erfo rm an ce gradually. E:o r behaviors
involving com plex skills or where im provem ents requ ire relatively large departures
from how the client usually behaves, gradual ap proxim ation s m ay be especially u se-
ful. Hence, the changing-criterion design m ay be well suited to a variety o f problem s,
clients, and settings.
CHAPTER 9

Multiple-Treatment Designs

C H A P T E R O U T L IN E

Basic Characteristics o f the Designs


Major Design Variations
Multi-element Design
Description and Underlying Rationale
Illustration
Alternating-treatnients or Simultaneous-treatment
Design
Description and Underlying Rationale
Versions without Initial Baseline
Other M ultiple-treatment Design Options
Simultaneous Availability o f All Conditions
Random ization Design
Com bining Components
Additional Design Variations
Conditions Included in the Design
Final Phase o f the Design
General Com m ents
Problems and Considerations
Omitting the Initial Baseline
Type o f Intervention and Behaviors
Discrim inability o f the Interventions
Number o f Interventions and Stimulus Conditions
Multiple-treatment Interference
Evaluation o f the Designs
Sum m ary and Conclusions

T
he designs discussed in previo u s chapters usually restrict them selves to the evalua-
tion o f a single intervention o r treatm ent. In applied settings such as the classroom ,
hom e, or health-care setting, the investigator often is interested in com p arin g and test-
ing two or m ore interventions, iden tifyin g which one is m ore o r m ost effective, and
applyin g that to optim ize change in the client. D ifficulties arise w hen the investigator
is interested in com paring two or m ore interventions within the sam e subject. If two or

192
M ultiple-Treatm ent D esigns

m ore treatm ents are applied to the sam e subject in A B A B or m ultiple-baseline designs,
they are given in separate phases so that one com es before th e oilier at som e point
in the design. T h e sequence in w h ich the interventions appear partially restricts the
conclusions that can be reached about the relative effects o f tw o or m ore treatm ents.
In an A B C A B C design, for exam ple, the effects o f C m ay be better (or worse) because
it follow ed B. T h e effects o f the two interventions (B and C ) m ay be v ery different if
they were each adm inistered by them selves without one being preceded by the other.
Also, evaluating m ultiple interventions in a design such as A 3 A B or m ultiple-baseline
takes tim e (m any days) as each intervention (B, C ) requires several days or m ore to
show stable levels to meet the describe, predict, and test functions over time. M ultiple-
treatment designs allow the com parison o f two or m ore treatm ents that are usually
within the same intervention phase. T h is chapter presents characteristics o f the designs
and highlights som e o f their m any variations.1

B A S IC C H A R A C T E R IS T IC S O F T H F D E S IG N S
M any variants o f m ulti-treatm ent designs have been used. They share som e overall
characteristics regardin g the m an ner in w hich separate treatm ents are com pared. In
each o f the designs, a sin gle beh avior o f one o r m ore persons is ob served to obtain
baseline data. A fter baseline, the intervention phase is im plem ented, in w h ich the
behavior is subjected to tw o or m ore interventions. T h ese interventions are im p le-
mented in the sam e intervention phase. Both are not in effect at the sam e tim e. For
exam ple, two proced ures such as praise and reprim and s m ight be com pared to d eter-
m ine their separate effects on disru ptive behavior in an elem en tary school classro om .
Both interventions w ould not be im plem ented at the sam e m om ent. T he intervention s
have to be adm inistered separately in som e w ay so their separate im pact can be e v a l-
uated and com pared. In a m an ner o f speaking, the interventions must “take tu rn s” in
term s o f when th ey are applied. T h e variations o f m ultiple-tTeatm ent d esign s depend
prim arily on the precise m an ner in which the different interventions are sch ed u led so
they can be evaluated.

M A JO R D E S IG N V A R IA T IO N S

Multi-element Design
D escription a n d U n d e rly in g R a tion ale. T he m ulti-elem ent design consists o f im p le-
mentation o f two or m ore interventions in the sam e phase. T h e unique and d efin in g
feature o f the m ulti-elem ent design is that the separate interventions are associated
o r consistently paired with distinct stim ulus conditions. T h e majoT p u rp o se o f the
design is to show that the client perform s differently under the different treatm ent

1 Many terms have been used to represent multiple-treatment designs, and these could occupy their
own chapter. The terminology reflects variations in how multiple treatments are compared. Some
o f the inconsistencies in terminology stem from how the designs were first used (e.g., basic research
on reinforcement schedules) and efforts to craft new variations o f how to compare treatments. The
critical issue for this book is understanding how the design variations can be used in applied set-
tings and what the critical components are to draw influences about the impact o f the interventions.
The key feature of the designs in this chapter is Ihe use o f more than one intervention in a way that
allows relatively rapid comparison of their impact.
194 S I N G L E- C A S H R ES EA R C H D ES I G N S

conditions, as reflected in differences in perform ance associated with the different


stim ulus conditions.
T h e m ulti-elem ent design lias been used extensively in la b o rato ry research with
non-hum an anim als in w hich the effects o f different rein fo rcem ent schedules have
been exam ined. That use helped establish “ m ultiple-schedule d esign ” as another
term . D ifferent reinforcem ent schedules w ere adm inistered at different times d u r-
ing an intervention phase. Each schedule is associated with a d istin ct stim ulus (e.g.,
a light that is on or o ff). A fter the stim ulus has been associated w ith its respective
intervention, a clear discrim ination is evident in perform an ce. W hen one stim ulus is
presented, one pattern o f perform an ce is obtained. W hen the other stim ulus is pre-
sented, a different pattern o f perfo rm an ce is obtained. T h e d ifferen ce in perform ance
am on g the stim ulus con ditions is a function o f the different intervention s associated
with each stim ulus. T he design is used to dem onstrate that the client or organism
can discrim inate in response to the different stim ulus con ditio n s. T he m ulti-elem ent
term reflects in part the broad use o f this design beyond e x a m in in g schedules o f
reinforcem ent.
T h e underlying rationale unique to this design pertains to the differences in
responding that are evident under the different stimulus conditions. I f the client makes
a discrim ination in perform ance between the different stim ulus conditions and their
respective interventions, the data should show clear differences in perform ance. On
any given day, the different stim ulus conditions and their respective interventions are
implemented. Perform ance m ay v ary m arkedly depending on the precise condition in
effect at that time. If the stim ulus conditions and interventions do not differentially
influence perform ance, one would expect an unsystem atic pattern across the different
intervention conditions, and perform ance will not differ. Sim ilarly, if extraneous events
(and threats to validity) rather than the treatment conditions were influencing perfor-
mance, one might see a general im provem ent or decrement over time. Im provem ents
due to extraneous events w ould be likely to appear under each o f the dif ferent stim ulus
conditions. W hen perform ance does differ across the stim ulus conditions, this is plau-
sibly explained by the differential effectiveness o f the interventions.

Illu stration. T he design em phasizes the control that certain stim ulus conditions exert
after being paired with various interventions. An excellent exam ple o f the design is
from a study that focused on com pliance o f tw o 4-year-old children; the children were
occasionally noncom plianl (W ilder, Atwell, & Wine, 2006). T h e study focused on
im plem entation o f the intervention, that is, how faithfully im plem entation was carried
out or, stated another way, whether the intervention was delivered as it was intended.
The extent to which the intervention is delivered as intended is referred to as treat-
ment integrity or treatment fidelity. T his is a critical topic in any area o f intervention
research (e.g., educational program s, surgery, m edication, cogn itive behavioral tech-
niques, rehabilitation) and is discussed in single-case and betw een-group designs (see
M cIntyre, Gresham , D iG ennaro, & Reed, 2007; Perepletchikova & Kazdin, 2005). In
education and clinical psychology, for exam ple, developing effective (evidence-based)
interventions is only part o f the challenge. O nce they are developed, getting individu-
als (teachers, therapists) to im plem ent them at all or to im plem ent them correctly is a
M ultiple-Treatm ent D esigns 195

challenge. In any case, an effective treatment is not likely to he very effective if it is not
im plem ented carefully.
T h is study focused on the issue by m easuring com pliance with instructions ad m in -
istered by an adult in three different situations or contexts: in a small room , in the
classroom , and on the playground. To obtain com pliance, the intervention consisted
o f asking the child to perform one o f three instructions (e.g., give me the snack item,
put the toy away, and com e here). If the child did this w ithin j o seconds, this was
counted as com pliance; if not, this was counted as noncom p liance. Praise w as p ro -
vided for immediate com pliance during the intervention phase. In addition, if the child
did not com ply immediately, the trainer went through a sequence o f steps to prom ote
com pliance. These steps included making eye contact with the child, stating the ch ild ’s
name, repeating the instruction, m odeling correct p erform an ce, and gu id ing the child
to perform the activity.
The purpose o f the study w as to compare three variations o f carrying out this p ro -
cedure that reflect treatment integrity, that is, how faithfu lly the procedures were e xe-
cuted by the trainer, namely, ioo% , 50%, or 0% o f the time. In the 100% condition, the
trainer followed the procedure each time the child d id not com ply, that is, the praise
and prom pting were im plem ented as intended. In the 50% level, the trainer d id the
procedure as specified on only h a lf o f the opportunities. On the other half o f the trials
(instruction opportunities), the therapist did not do any aspect o f the procedure. In the
0% level, the procedure was never done.
A multi-element design requires different interventions. I have described these
(100% , 50%, and 0%) and noted that the interventions have to he consistently a sso c i-
ated with a particular stim ulus condition. In this case, the instructions were the stim u -
lus conditions. A particular instruction (e.g., pick up a toy) w as always associated w ith
one o f the interventions (e.g., 100%) and the other intervention s (50%, 0%) with o n e of
the other instructions. There were two children, so w hich instruction was associated
with which intervention was varied. In short, interventions w ere associated with spe-
cific conditions (in this case specific instructions) to see i f the three interventions m ad e
any difference. If they did m ake a difference, one w ould exp ect to see different pattern s
o f responding 011 the part o f the child.
Figure 9.1 presents the com pliance data for two children. D u rin g baseline, no inter-
vention was provided and the separate lines reflect the different instructions. T hat is,
for all three instructions, there was no intervention during baseline. D u rin g the inter-
vention phase the three levels o f integrity were com pared. W h en the intervention w as
im plem ented 100% o f the time, com pliance was v ery high; 50% o f the tim e, above b ase-
line but not as high as 100% ; 0% o f the time, essentially a continuation o f baseline, and
no real change occurred in child behavior. All o f the changes can be seen in relation to
the respective instructions with which each one was associated.
T he design clearly conveys different intervention effects associated w ith the
instructions. In passing, it is useful to punctuate the sig n ifican ce o f the d e m o n stra -
tion in two contexts. First, the use o f behavioral techn iqu es (e.g., use o f an teceden ts,
behaviors, and consequences) to change behavior is v ery effective in d iverse con texts
(hom e, school) and constitute an evidence-based treatm en t for opposition al b e h a v -
ior am ong children. T h is dem onstration underscores the im p o rtan ce o f treatm en t
196 S I N G L E- C A S E R E S EA R C H D ES I G N S

Baseline 3-Step Prompting at Varying Integrity Levels

Sessions

Baseline 3-Step Prompting at Varying Integrity Levels


100

90

80

/
/ C o m e H ere - 100%
70

60

50 ,
/ / Toy Aw ay - 5 0 %
40

30
/ | Cara |
20

/ Give Snack - 0 %

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Sessions

F ig u re 9 . 1 . Percentage of trials with compliance for two 4-year-old children. A prompting pro-
cedure was used to foster compliance, if the child did not comply with the instruction immedi-
ately.The prompting procedure (accompanied by praise whenever compliance occurred) was done
exactly the same way. During the intervention phase, three levels of implementing the procedures
were compared: the trainer implemented the intervention 100%, 50%, or 0% of the time. These
levels of carrying out the procedure were associated with specific instructions. (Source Wilder et ol„
2006.)

integrity; the techniques con vey what to do (e.g., rein force and prom pt) but no less
im portant is the how, that is, w hether the proced ures are faith fu lly rendered. Second,
noncom pliance in children is often talked about as if it is “ in” the child. It is true that
som e ind ividuals are m ore opposition al than others. Yet, this study helps to convey
that contexts and interventions o f others can greatly influence com plian ce and non-
com pliance as well.
M ulo ple-Treatm ctu D esigns

A lte rn a tin g -tre a tm e n ts o r S im u ltan eo u s-treatm en t D e s ig n


D escription a n il U n d erly in g Rationale. In the m ulti-elem ent design, separate inter-
ventions are applied under different stim ulus conditions. Typically, each intervention is
associated (paired) with a particular stim ulus (e.g., adult, tim e period, o t “ instruction”
in the previous example) to show that perform ance v aries system atically as a fu n c-
tion o f the stim ulus that is presented. Usually the m ulti-elem ent design is reserved for
instances in which the interventions are purposely paired with particu lar stimuli. In
this w ay intervention effects are seen when perform ance v a rie s as a function o f that
stimulus or context.
As noted earlier, in applied research the usual p rio rity is to evaluate the relative
impact o f two or more treatments free from the influence o f any particu lar stim ulus
condition or context. That is, we want to find out what intervention k g , to develop
reading) is the m ore or m ost effective, and we are not interested in show ing that this is
associated with som e unique stimuli (e.g., a teacher, class p erio d ). M ultiple treatments
can be readily com pared in single-case research w ithout associating the treatments
with a particular stim ulus or context. In fact, the goal is to evaluate the im pact o f treat-
ments in a w ay to be sure they are not merely tied to, con n ected with, or confounded
by a particular condition.
W hen different treatment conditions are varied or alternated across different stim -
ulus conditions (e.g., times o f the day, teachers, or settings), the design can be d istin -
guished from a m ulti-elem ent design. T he treatments are ad m in istered a cross different
stim ulus conditions, but the interventions are balanced (equally distributed) across
these conditions. At the end o f the intervention phase, o n e can exam ine the effects o f
the interventions in a w ay that is not confounded by or u niqu ely associated with a p a r-
ticular stim ulus condition or context. Stated another aw ay, in a multi-element design
the interventions are purposely connected to a particular stim ulus o r context; in an
alternating-treatm ents design, the interventions are pu rp o sely balanced across and d is -
connected with a particular stimulus.
T he underlying rationale o f the design is sim ilar to that o f the multi-element
design. A fter baseline observations, two or more interventions are implemented in the
sam e phase to alter a particular behavior. T he distinguishing feature is that the different
interventions are distributed or varied across stimulus co n d itio n s in such a way that the
influence o f these interventions can be separated from the influence associated with
the different stim ulus conditions. T he different names o f th e design reflect how to best
represent what is actually done. The different interventions are alternaied during the
intervention phases, explaining why som e have chosen to refer to this as an alternating
conditions or alternating-treatments design (Barlow & H ayes, 1979; Ulm an & Sulzer-
Azaroff, 1975). T h e different conditions are adm inistered in the sam e phase, usually on
the same day, and thus the design has also been referred to as a sim ultaneous-treatm ent
or concurrent schedule design (Hersen & Barlow, 1976; K azd in & H artm ann, 1978).*

; None of the terms tor this design quite accurately describes its unique features. “Alternating-
treatments” design incorrectly suggests that the interventions must be actrvc interventions. Yet,
“no treatment” or continuation o f baseline can be used as one o f the conditions that is alternated
so there is only one treatment. The word “treatment” too is odd. M any i f not most applications
o f the design have been in education where nothing is being “treated” (e.g., in any medical or
S I N G L E- C A S E R ES EA R C H D ES I G N S

The design usually begins with baseline observation o f the target response. T he
observations are obtained daily under two or m ore conditions, such as two times per
day (e.g., m orning or afternoon) or in two different locations (e.g., classroom and play-
ground). During the baseline phase, the target behavior is observed daily under each o f
the conditions or settings. A fter baseline observations, the intervention phase is begun.
In the usual case, two different interventions are com pared. Both interventions are
implemented each day. However, the interventions are adm inistered under the differ-
ent stimulus conditions (e.g., tim es o f the day, situations or settings). T h e interventions
are administered an equal num ber o f tim es across each o f the conditions o f adm inistra-
tion so that, unlike the m ulti-elem ent design, the interventions are not uniquely associ-
ated with a particular stim ulus. T he intervention phase is continued until the response
stabilizes under the separate interventions.
The crucial feature o f the design is the unique intervention phase in which separate
interventions are adm inistered concurrently. Hence, it is w orthw hile to detail how the
interventions are varied durin g this phase. C onsider as a hypothetical example a design
in which two interventions (I_ and I ) are to be com pared. T h e interventions are to be
implemented daily but across two separate time periods (T and T ) or sessions during
the day. The interventions are balanced across the time periods. Balancing refers to
the fact that each intervention is adm inistered under each o f the conditions an equal
num ber o f times. On any given day the interventions are adm inistered under separate
conditions, but both are administered.
Table 9.1 illustrates different ways in which the interventions m ight be ad m inis-
tered on a daily basis. A s evident from Table 9. i A , each intervention is adm inistered
each day, and the tim e period in which a particular intervention is in effect is alter-
nated daily. In Table 9.1 A , the alternating pattern is accom plished by sim ply having
one intervention adm inistered first on one day, second 011 the next, and then just
switching the order every day for the rest o f the intervention phase. At the end o f
the intervention phase, each intervention was first (and second) an equal num ber o f
times o r close to that ( if the intervention phase were an odd rather than even num ber
o f days).
Table 9.1B shows that the alternating pattern could be random ly determined,
with the restriction that throughout the intervention phase each intervention appears
equally often in the first and second time period. T h is random ly ordered procedure can
be determined by a table o f random num bers in w hich one might search for an order
o f numbers and pull out num bers 1 and 2 from a long list o f num bers to determine the
order o f presenting Interventions 1 and 2. (O f course, in true random ness, it is possible

1 (Continued) psychological sense). Also, alternating treatments is sufficiently broad to encom-


pass multi-element designs in which treatments are alternated. “Simultaneous-treatment” design
incorrectly implies that the interventions are implemented simultaneously. If this were true, the
effectiveness of the separate interventions could not be independently evaluated. They are usu-
ally administered concurrently. “Concurrent schedule” design implies that the interventions are
restricted to reinforcement schedules, from basic and applied research within the behavior analysis.
As noted in a prior chapter, many o f the designs grew out o f behavior analysis but now have novel
applications where such topics as reinforcement schedules are not a focus. For present purposes, the
term “alternating-treatments design” is used because it has been adopted by the majority of investi-
gators reporting the design.
M u'tiple-T reatm ent D esigns

Table 9.1 T h e Adm inistration o f Two Interventions (I, and l; ) Balanced A cross Two T im e
Periods ( T ( a n d T ;)

A . A lt e r n a t in g o r d e r ev e ry o th e r da y d u r in g the in te rv e n tio n p h a s e

Days

Time periods 1 2 3 4 5 6 ...n

T '3
t 3

B. A lt e r n a t in g in a r a n d o m o r d e r d u r in g the in te rv e n tio n p h a se

Days

Time periods 1 2 3 4 S 6 ...n

T,
T, '2 1, 1, h

Note. The table conveys two different ways in which the Interventions (I) can be paired with and balanced across the
different Time (T) periods: n relects an unspecified number of days, as determined by the investigator.

that a string o f is would be in the table.) Consequently, one selects pairs o f numbers
with the restriction that a i and a 2 have to be in each pair. T h at way, at the end o f the
intervention phase each intervention appears equally often, although the order was
random .
T he table refers to the schedule o f adm inistering the different interventions du r-
ing the first intervention phase. If one o f the interventions is more effective than the
other(s), the design usually concludes with a final phase in which that intervention is
adm inistered across all conditions. That is, the more (or most) effective intervention is
applied across all tim e periods or situations included in the design.
A hypothetical example o f the data plotted from a sim ple version o f thealternating-
treatments design is illustrated in Figure 9.2. In the exam ple, observations were made
daily for two time periods. The data for baseline are plotted in baseline separately
for these periods. D uring the intervention phase, two separate interventions were
im plem ented and were balanced across the time periods. In this phase, data are p lot-
ted according to the interventions so that the differential effects o f the interventions
can be seen. Because Intervention 1 was more effective than Intervention 2, it was
im plem ented across both time periods in the final phase. T his last phase provides
an opportunity to see if behavior improves in the periods in which the less effective
intervention had been administered. Hence, in this last phase, data are plotted acco rd-
ing to the different time periods as they were balanced across the interventions, even
though both receive the more effective procedure. As evident in the figure, perfor-
mance im proved in those time periods that previously had been associated with the
less effective intervention.
200 S i N G L E- C A S E R ES EA R C H D ES I G N S

F igu re 9 .2. Hypothetical example of an alternating-treatments design. In baseline, the observations


are plotted across the two different time periods. In the first intervention phase, both interventions
are administered and balanced across the time periods. The data are plotted according to the dif-
ferent interventions. In the final phase, the more effective intervention (Intervention I) was imple-
mented across both time periods.

As an example, ail alternating-treatments design was used to evaluate the classroom


behavior o f two boys with developm ental disabilities who attended a special educational
classroom (Kazdin & Geesey, 1977). Both boys were identified because o f their disruptive
behaviors. T he goal was to increase attentive behavior in class. Each day baseline data
were gathered at two time periods in the m orning when separate academic tasks were
assigned by the teacher. A lter the baseline phase, the intervention was implemented,
which consisted o f two variations o f providing reinforcement. A token system was
administered in which each child could earn m arks (the tokens) on a card placed on his
desk. The two variations o f the program consisted o f the m anner in which the reint'orc-
ers would be dispensed. T h e program s differed according to whether the tokens could
be exchanged for rewards that only the child would receive (self-exchange) or whether
they could be exchanged for rewards for the child and the entire class (class-exchange).
Thus, the child could earn for him self or for everyone. T he expectation was that earning
for many indviduals might be more effective because that exchange method mobilizes
peer support (and possible pressure) for desirable behavior. Tokens were earned dur-
ing the two observation periods each day. Different-colored cards were used to record
the tokens in each period to separate the self- and the class-rew ard programs. When a
predetermined num ber o f tokens was earned on a card, the child selected from a lottery
jar to determine which o f the available rewards he w ould receive. This reward was given
to the child or to everyone in class depending 011 which card had earned the reinforc-
ers. Each program was implemented daily in one o f the two observation periods. The
program s were alternated daily so that one appeared during the first period on one day
and during the second period on the next, and so on.
Mulciple-Treatmen*- Designs

Figure 9.3 provides the results for M ax, a 7-year-old boy. The data are plotted in two
ways to show the overall effect o f the program (upper panel) and the different effects
o f the separate interventions (lower panel). The upper portion ol the figure show s
that attentive behavior improved during the First and second observation periods with
data com bined (the two morning academ ic lessons). The lower portion illustrates the
design but requires explanation. In baseline, the figure show s the tw o observation p e ri-
ods each day in which there was no intervention. C learly the two periods did not vary.
N or would they be expected to, because the condition (no intervention) w as the same.
In the first intervention phase, data are plotted according to whether the self-exchange
or class-exchange was in effect. The results indicated that M ax was more attentive when
he was w orking for rewards for the entire class rather than just for himself. Hence, in
the third and final phase, the class-exchange period w as im plem ented daily across both
time periods. He no longer earned for h im self alone, since this proved to be the less
effective intervention. In Ihe final phase, attentive behavior was consistently high across

Figure 9 .3. Attentive behavior of Max across experimental conditions. Baseline (base)— no experi-
mental intervention.Token reinforcement (token rft)— implementation of the token program where
tokens earned could purchase rewards for himself (self) or the entire class (class). Second phase of
token reinforcement (token rft2)— implementation of the class exchange intervention across both
time periods.The upper panel presents the overall data collapsed across time periods and interven-
tions.The lower panel presents the data according to the time periods across which the interventions
were balanced, although the interventions were presented only in the last two phases. (Source;
Kazdin & Geesey, 1977.)
202 S I N G L E- C A S E R E S E A R C H D ES I G N S

both time periods. T h is last phase suggests further that the class-exchange m ethod was
indeed the more effective intervention, because it raised the level o f perform ance for
the time periods previously devoted to self-exchange.
Occasionally, continuation o fbaseline serves as one o f the “ interventions” or “ treat-
ments” during the intervention phase (e.g., H ughes & Carter, 2002; Pluck, G hafari,
Glynn, & M cN au gh ton , 1984). An example o f an alternating-treatm ents design in which
baseline constituted one o f the alternating conditions was used to evaluate an interven-
tion designed to reduce the frequency o f stereotyped repetitive movements am ong h os-
pitalized children w ith developm ental and intellectual disabilities (Ollendick, Shapiro,
& Barrett, 1981). T hree children, ages 7 to 8, exhibited stereotypic behaviors such as
repetitive hand gestures and hair twirling. O bservations o f the children were made in a
classroom setting w hile each child perform ed various visual-m otor tasks (e.g., puzzles).
Behavior was observed each day for three sessions, after which the intervention phase
was implemented. D u rin g the intervention phase, three conditions were com pared,
including two active interventions and a continuation o fb a se lin e . One intervention
consisted o f physically restraining the child’s hands on the table for 30 seconds so he or
she could not perform the repetitive behaviors (physical restraint). The second inter-
vention consisted o f physically guiding the child to engage in the appropriate use o f
the task materials (positive practice). Instead o f m erely restraining the child, this p ro -
cedure was designed to develop appropriate alternative behaviors the children could
perform with their hands. T h e final condition du rin g the intervention phase was a c o n -
tinuation ofb aselin e. Physical restraint, positive practice, and continuation o fb aselin e
were implemented each day across the three different time periods.
Figure 9.4 show s the results for one child who engaged in hand-posturing ges-
tures. As evident from the first intervention phase, both physical restraint and positive
practice led to reductions in perform ance; the practice procedure was more effective.
T he extent o f the reduction is especially clear in light o f the continuation o fb a se lin e
as a third condition durin g the intervention phase. W hen baseline (no-treatment)
conditions were in effect durin g the intervention phase, perform ance remained at the
approximate level o f the original baseline phase. In the final phase, positive practice
was applied to all o f the time periods each day. Practice proved to be the more effective
o f the two interventions in the previous phase and was implemented in the final phase
across all time periods. Thus, the strength o f this intervention is especially clear from
the design.
Continuation o fb aselin e in the intervention phase allow s direct assessment o f what
perform ance is like without treatment. O f course, including baseline as another co n d i-
tion in the intervention phase introduces a new com plexity to the design. As discussed
later in the chapter, increasing the number o f conditions com pared in the intervention
phase raises potential problem s. Yet if perform ance during the initial baseline phase is
unstable or shows a trend that the investigator believes m ay interfere with the evalu a-
tion o f the interventions, it may be especially useful to continue baseline as one o f the
conditions in the design.

Versions without In itia l B a selin e. Alternating-treatm ents designs are often used w ith-
out an initial baseline phase, and this version warrants com m ent and illustration. In
many programs, the goal is to identify quickly w hat am ong multiple alternatives is
M uitiple-Treatm enc D esigns 203

Baseline Intervention Intervention 2

Sessions

F ig u re 9 .4 . Stereotypic hand posturing across experimental conditions.The three separate lines


in each phase represent three separate time periods in each session. Only in the initial intervention
phase were the three separate conditions in effect, balanced across the time periods. In the second
intervention phase, positive practice was in effect for all three periods. (Source: Ollendick, Shapiro, &
barrett, 1981.)

effective and to do so in a rapidly changing design. A com m on version is used to iden -


tify what controls self-stim ulatory or self-destructive behavior am ong individuals with
developm ental disabilities. Often individual one-to-one sessions are used in w hich 5
o r 10 minutes will be provided for each o f three or more interventions. T he interven-
tions might be how the investigator responds to the behavior (e.g., attend, ignore, m ake
som e dem and to work on a task). T he goal is to identify w hat might be controlling the
problem behavior and to do so very quickly. Rem arkable successes have been achieved
in show ing that one or more o f such conditions, varied quickly in a b rie f period, have
an im pact on reducing the problem behavior.3 T here might be little or no baseline
and the conditions are just com pared to see if one em erges as m arkedly influencing

‘ A very important area of work in behavioral research (applied behavior analysis) is referred to as
functional analysis, a topic beyond the scope o f the present book. Briefly, functional analysis is an
effort to understand what factors are controlling behavior. Through careful assessment in advance
and then through direct empirical tests, the investigator hypothesizes what influences might be
operating. These are then tested empirically. For example, in a one-to-one situation with a child
who hits himself, an investigator sitting across from the child might alternate three interventions:
attending to the child when he hits himself (to see if that is controlling the behavior), turning away
from the child, giving the child a task or making a demand. These interventions might be provided
all on the same day and all within a laboratory session. Very quickly one can determine which one is
associated with a reduction of self-hitting and that information can n ow form the basis o f an inter
vention that is not conducted in a one-to-one laboratory session. For further details o f functional
analysis, see other sources (e.g.. Cooper et al., 2007; Iwata, Kahng. Wallace, & Lindberg, 2000).

L
204 S I N G L E- C A S E R ES EA R C H D ES I G N S

perform ance. Essentially, the paradigm is to test hypotheses about what might be con -
trolling behavior. The intervention that em erges might well serve as the basis for devel-
oping an intervention that can be evaluated in an additional or other design.
There arc m any occasions in w hich the absence o f an initial baseline raises special
issues and ambiguities. For example, in one program , the goal was to identify if positive
teacher affect (e.g., sm iling, show ing a very positive vocal tone, and showing enthu-
siasm ) influenced the affect o f the children (e.g., sm iling, laughing) as well as their
correct task perform ance on classroom w ork (Park, Singer, & G ibson, 2005). Four chil-
dren, ages 6 to 11 and with various disabilities (e.g., D ow ns syndrom e, cerebral palsy)
participated. Each was brought individually to a room during recess, seated in front o f
a table, and given a task. There were two “conditions” : one with teacher positive affect
and the other with neutral affect (e.g., flat voice tone, expressionless face, low enthusi-
asm ). Children were videotaped and the tapes w ere scored for affect; correct responses
w ere scored from the task they com pleted.
Figure 9.5 provides the results for the four children and shows that teacher positive
affect was associated with more positive affect o f the children. As noted in the graph,
there was no baseline, and the interventions showed a difference. Not graphed, three o f
these four students also showed a slightly higher percentage o f correct responses on their
assignm ents in the positive-affect condition. W hat can we conclude? We can conclude
that one treatment was more effective than the other, which was the main goal o f the
dem onstration. However, the absence o f baseline raises an ambiguity. It is possible that
baseline was even higher (more positive affect o f the children) than the perform ance
achieved with these two conditions. M ore likely, it is possible that the neutral was really
not neutral but actually negative and made the children less happy than usual. After all,
the neutral condition included presentation that is probably more negative than most
people interacting with children. It m ay be that the baseline and positive affect were not
really different, and that the neutral condition made the child look worse (less positive
affect). I am not saying this is accurate. I am merely saying that the interpretation I pro-
vide cannot be addressed by the design. Is the concern just som e esoteric m ethodologi-
cal nuance? Not really. Other alternating-treatm ents designs have shown that when the
interventions have different effects, one intervention might make children worse (e.g.,
W ashington, Deitz, White, & Schwartz, 2002). T h e effects o f the interventions are aided
greatly by collection o f initial baseline data to allow evaluation.
A com prom ise to provide som e baseline data is to use the alternating-treatments
design where one o f the conditions is baseline. In this version, there is no initial base-
line phase. Rather the intervention and baseline conditions begin in one phase. An
exam ple o f this was reported with a 9-year-old girl referred for off-task and d isru p-
tive behavior and failure to com plete her w ork (M cCurdy, Skinner, Grantham , Watson,
& H indm an, 2001). Each day in-seat m athem atics assignm ents were provided to her
from a workbook. T he two conditions included regular, unaltered workbook assign-
ments (control or baseline assignm ents) and the experim ental condition in which some
additional and easier problem s were interspersed with the usual set o f problems. The
rationale is derived from prior w ork noting that com pletion o f tasks successfully may
augm ent the rewarding value o f the assignm ent and increase the likelihood o f en gag-
ing in the assignm ents and paying attention. T he two conditions were alternated across
days in the classroom .
M u l c i p l e - T r c a 'm e n t D esigns 20S

Indices of Happiness

Sessions

Figure 9.5. Student’s level of happiness (percent of intervals) under two conditions in an
alternating-treatments design.The conditions were positive affect and neutral affect on the part of
the teacher. (Source: S. Park, Singer. & Gibson, 2005.)
206 S I N G L E- C A S E R ES EA R C H D ES I G N S

Figure 9.6 shows the effects o f the experim ental condition. Performance was
higher for the experim ental sessions than for control (baseline) sessions even though
no baseline phase was provided. T he differences between the two conditions seem ed to
attenuate over time, an effect that could not be detected without som e baseline (before
o r during the intervention phase). In principle it is possible that “ real” baseline (with
no intervention or alternating conditions) w ould be different from the control sessions.
However, the inclusion o f baseline as one o f the conditions helps one judge the m agni-
tude o f the impact o f the experim ental condition.
T he different variations o f the alternating-treatm ents designs are given special
attention here in part because o f their different uses and strengths. T he strongest dem -
onstration is one that begins w ith a baseline phase. T his allows application o f the logic
o f the design (describe, predict, test) and allows for the ability to judge the m agnitude
o f the intervention effect. That is, did the interventions help or harm relative to b ase-
line? On the other hand, in an area o f behavioral research (functional analysis), there is
often interest in isolating very quickly w hich am ong m ultiple interventions is control-
ling behavior in a laboratory session. An extensive baseline phase is not used here. T he
goal is to identify one intervention that might be controlling behavior and to extend
that to everyday settings to serve as the intervention.

O ther M ultiple-treatment Design Options


T h e multi-element and alternating-treatm ents designs discussed already are the m ore
com m on ly used m ultiple-treatm ent designs. A few other options are briefly noted to
convey flexibility in the designs.

S im u lta n e o u s A v a ila b ility o f A ll C o n d itio n s. A s noted previously, in the usual alter-


nating-treatm ents o r sim ultaneous-treatm ent design, the interventions are scheduled

Observation Sessions

F ig u re 9.6. Percentage of intervals of on task behavior in an alternating-treatments design in which


experimental and control (no intervention) conditions were compared. The student worked on
mathematics assignments.The experimental assignments interspersed some easy problems to make
the task more rewarding. (Source: McCurdy et al„ 2001.)
M u l t i p l e- Tr eat m en c Desi g n s 207

at different period s each day. T he pattern o f perform ance in effect during each o f the
different interventions is used as a basis to infer the effectiveness of the different
interventions. A lm ost always, the interventions are scheduled at entirely different
tim es during the day. It is possible to make each intervention available at the same
time. T he different interventions are available but are in som e way selected by the
client.
A historically early and very clear dem onstration o f this variation compared the
effects o f three procedures (praise and attention, verbal adm onishm ent, and ignoring)
on reducing the b ragging o f a 9-year-old hospitalized boy (Brow ning, 1967). One o f the
b o y’s problem behaviors was extensive bragging that consisted o f untrue and grandi-
ose stories about him self. After baseline observations, the staff implemented the p ro -
cedures just m entioned in an alternating-treatments design. The different treatments
were balanced across three groups o f staff members (two persons in each group). Each
week, the staff m em bers associated with a particular intervention were rotated so that
all the staff eventually adm inistered each o f the interventions.
The unique feature o f the design is that durin g the day, all o f the staff were avail-
able to the child. T h e specific consequence the child received for bragging depended
on the staff m em bers with whom he was in contact. T he boy had access to and
could seek out the staff m em bers o f his choosing. A n d the sta ff provided the d if-
ferent consequences to the child according to the interventions to which they had
been assigned for that week. T h e measure o f treatm ent effects was the frequency
and duration o f bragging directed at the various staff m em bers. The results in d i-
cated that bragging incidents tended to dim inish in duration in the presence of staff
m em bers who ignored the behavior relative to those w ho adm inistered the attention
or adm onishm ent.
T his design variatio n is slightly different from the previous ones because all
treatm ents were available sim ultaneously. T h e in terven tion that was im plem ented
was determ ined by the child, w ho approached p a rticu lar staff m em bers. This v a r-
iation o f the design is useful for m easuring a client’s preference for a particular
intervention. T h e client can seek those s ta ff m em bers who perform a particular
intervention. Sin ce alt s ta ff m em bers are equally available, the extent to which those
who adm inister a p articu lar intervention are sought out m ay be o f interest in its
ow n right.
T he variation o f the design in which all in terven tion s are actually available at the
sam e time and the client selects the persons with w h om he or she interacts has been
rarely used. T his design has features o f m ulti-elem ent design s because the interven-
tions are associated with particular conditions. M ethodologically, this variation is
best suited to m easure preferences for a p articu lar condition, which is som ewhat
different from the usual question o f interest, nam ely, the effectiveness o f different
conditions.

R a n d o m iz a tio n D esign. M ultiple-treatm ent design s fo r sin gle subjects alternate


the intervention s o r conditions in various w ays d u rin g th e intervention phase. T h e
designs d iscu ssed p revio u sly resem ble a rand om izatio n d esign (Edgington, 1996),
w hich refers to a w ay o f presenting the differen t treatm ents. T h e design d evel-
oped largely through con cern with the requ irem ents for statistical evaluation o f
208 S I N G L E C A S E R ES EA R C H D ES I G N S

alternative treatm ents rather than from the m ainstream o f sin gle-case e x p e rim e n -
tal research.4
The random ization design, as applied to one subject or a group o f subjects, refers
to presentation o f alternative interventions in a random order. For example, baseline
(A) and treatment (B) conditions could be presented to subjects on a daily basis in
the following order: A B B A B A B A A B . Each day a different condition is presented, u su -
ally with the restriction that each is presented an equal num ber o f times. Because the
condition adm inistered on any particular day is random ly determ ined, the results are
am enable to several statistical tests (Edgington, 1996; Edgington & Onghena, 2007;
Todm an & Dugard, 2001).
Features o f the random ization design are included in versions o f an alternating-
treatments design. For exam ple, in the intervention phase o f an alternating-treatments
design, two or more interventions must be balanced across stim ulus conditions (e.g.,
tim e periods). W hen the ord er in which the treatments are applied is determ ined ran-
dom ly (see Table 9.1B), the phase meets the requirem ents o f a random ization design.
Essentially, a random ization design consists o f one way o f orderin g the treatments in
the intervention phase o f a m ultiple-treatm ent design.
Technically, the design can be used without an initial baseline if two or more treat-
m ents (B, C ) or baseline and one or m ore conditions are com pared during the interven-
tion phase. It is almost always better to have a baseline phase to strengthen the basis o f
the describe, predict, and test functions o f the various phases in any single-case design.
However, in a multiple-treatment design “ baseline” (no intervention) can be one o f
the conditions that is varied and presented in a random ized order if a random ization
design is used. T his is discussed further a bit later in the chapter, under “Additional
Design Variations.”
Random ization designs have not been reported very frequently in applied work,
although excellent exam ples are available (e.g., W ashington et al., 2002). T he m ethod-
ological issue o f concern when com paring interventions in the sam e phase is to ensure
that one can separate the intervention effects from any conditions with which they
m ight be associated. Hence investigators are concerned with balancing interventions
across the conditions (e.g., time periods; settings) so inferences can be drawn about
the interventions, unconfounded by other factors. Random ization is useful in deciding
how to order presentation o f conditions if there is som e reason that bias could enter
into pairing o f conditions with the intervention and in other situations i f random iza-
tion is just as convenient given applied situations and the dem ands o f adm inistering
interventions. For example, it might make a difference in im plem enting a program if
interventions were alternated in ways that were m ore predictable than random ization
but still not biased (e.g., so treatment A is not always in the m orning and B always in
the afternoon).

' It is useful to distinguish randomization design from randomization tests, although they are related.
Design refers to the arrangement of the experimental conditions, in this case the order in which they
are presented to the client. Tests refer to the statistical techniques that are used to analyze the data. A
randomization design does not have to he analyzed statistically, and if the data are analyzed statis-
tically the tests need not be randomization tests. This chapter focuses 011 design; statistical tests are
discussed later (Appendix) (see also Edgington & Onghena, 2007; Todman & Dugard, 2001).
M u l t i p l e- Tr cat m en t D esi g n s 2 0?

As I noted, random ization designs are used infrequently tor reasons one can only
surmise. A plausible view is the difficulties random ization thrusts on the logic o f the
design and the practical issues about implementation. T he logic o f the designs depends
011 describe, predict, and test functions o f the data patterns. T hese are aided by having
stable rates o f perform ance to see trends and likely patterns that will characterize the
future. A constantly changing back-and-forth intervention might well show patterns,
but this requires intervention conditions that can shift d a ily or the equivalent (as d ic-
tated by random order) and highly responsive clients w h o can show quite different
effects as soon as the intervention is implemented. T he describe, predict, and test logic
can be readily invoked, but the data have to be unu sually favorable to see the im m e-
diate flipping o f perform ance. The practical challenges o f random ization are another
issue that may have restricted use o f the design. In most applied settings (e.g., schools,
special education classroom s) one cannot easily flip the intervention random ly at least
on a daily basis. T he dem ands o f random ization make planning and im plem enting
interventions more com plex than they might otherw ise be. Integrity of the in d ivid -
ual interventions and responsiveness o f the client to change— always important in any
intervention study—are slightly more dem anding when daily shifts might be needed
because o f random ization. D oes it sound as i f I am “against” random ization? It is not a
matter o f for or against. Randomization is a tool that m ay b e quite useful in single-case
designs and can contribute to ruling out threats to validity— the reasons we are d iscu ss-
ing and using designs in the first place.
In this chapter, randomization was discussed prim arily as a w ay o f arranging inter-
ventions in multiple-treatment designs. There are broader roles o f random ization in
single-case research beyond these specific designs. First, random ization o f conditions
facilitates the application o f special statistical techniques that can be used to evaluate
interventions in single-case designs (Todman & D ugard, 200 1). Second, random iza-
tion has been discussed in a broader context o f single-case research and as a m ethod to
infuse m any different types o f designs (Kratochwill & L evin , in press). We will return
to random ization in these other contexts later in the book.

C o m bining Com ponents. Many o f the designs I have presented in this and previous
chapters convey clear, well-delineated examples o f the designs (e.g., A B A B , multiple-
baseline). In learning the designs, clear examples are v ery helpful so one can present the
fundamentals without too many distractions. Researchers w ho w ork within the sin gle-
case approach often com bine designs and pull together com ponents (e.g., random iza-
tion). The result is often an excellent dem onstration, but it is not clear what the design
was or how to classify it. Invariably in such cases, the designs, are hybrids and com bin a-
tions. I shall convey com bined designs more fully in the next chapter. However, it is use-
ful here to illustrate how several features discussed in this chapter can be com bined.
This study was conducted on a large college cam pu s and w as designed to see if
prompts and feedback could increase the proportion o fd riv e rs who cam e to a com plete
stop at a stop sign (Austin et al., 2006). D rivers who cam e to two intersections (called
Stop A and Stop B) were observed each m orning by o bservers parked in a car ad ja-
cent to the intersections. Each driver approaching these separate stop signs was cod ed
according to whether he or she stopped or did not stop. Stopped meant all visible tires
were not rotating, a measure that was highly reliable. D u rin g the intervention phase a
210 S I N G L E- C A S E R ES EA R C H D ES I G N S

volunteer stood near Stop sign A with a large poster that read, “ Please Stop— I care.”
T his instruction, o f course, was designed to promote stopping. I f the driver stopped,
the volunteer flashed the reverse side o f the poster that said “ T h an k You for Stopping.”
T his provided feedback and social reinforcem ent. T he intervention was only associated
with Stop sign A and never with Stop sign B. Stop signs A and B w ere at opposite sides
o f an intersection, and apparently the arrangem ent made it possible for only those d riv-
ers com ing to the A Stop sign to see the poster and receive the instructions and praise,
although drivers com ing to A or B signs could see the volunteer standing.
T he design began with baseline observations, and these were graphed separately for
Stop signs A and B as the percentage o f drivers who made com plete stops. D u rin g the
intervention phase, the poster (instructions and social reinforcem ent) was provided by
a volunteer standing near the intersection o f Stop sign A. Prom pts and reinforcem ent
were never provided by anyone at the intersection o f Stop B. So, because the interven-
tions are associated with one condition (Stop sign A) but not another (Stop sign B), this
is a multi element feature o f the design.
There is more in the design— there were two alternating conditions durin g the
intervention phase. One condition was the volunteer present and with the poster, as
already m entioned. T he other condition during intervention was continuation o f base-
line. Some days during the intervention phase there was no volunteer at all. That is,
during the intervention phase som e days had the volunteer present (the intervention),
and other days had no volunteer present (so conditions at the interaction were just
like baseline). Intervention and continuation o f baseline days were random ly ordered.
Thus, random ization o f the two conditions during the intervention phase was another
com ponent in the design.
Figure 9.7 nicely conveys the design and results. The top and bottom portions o f
the figure show that during baseline (solid line with filled dark circles) only a small
percentage o f drivers stopped at either Stop sign A (upper portion) o r B (bottom por-
tion). T he solid line with dark circles show s baseline during the “true” baseline phase,
and then during the random ly determ ined baseline days throughout the study. T he line
with the open circles shows the effect o f the intervention (top figure). During the inter-
vention phase, when the volunteer flashed the poster (sign visible days), the percentage
o f drivers who stopped really increased— this could not have been an y historical event
or other such influence because som e o f the days, which were random ly determ ined,
did not have the volunteer present and perform ance remained at baseline levels. The
top figure alone provides a strong dem onstration. Stop sign B never had the interven-
tion but the data were plotted to determ ine whether the volunteer was visible versus
m ore baseline days. For reasons that are not clear, seeing the volunteer in the distance
wfas associated with a slight increase in percentage o f drivers who stopped. It is possible
that the volunteer was mistaken for a pedestrian and that fostered increased stopping
or that som e individuals at Stop sign B had previously experienced the intersection at
Stop sign A.
Overall, the design com bines multiple features o f different designs (multi-element,
alternating-treatments design, random ization). The multi-element com ponent is based
on separate stimulus conditions (Stop signs A and B) associated with different inter-
ventions (instructions/praise vs. none). T h e alternating-treatments feature stems from
varying two conditions (intervention and baseline) during the intervention phase. And
M u l t i p l e- T r e a t m en t D esi g n s 211

Stop A

Sessions

F ig u re 9.7. Percentage of complete stops made at Stop A (top) and Stop B (bottom). (Source: Austin
et aL 2006.)

random ization refers to how the two conditions were varied daily. As to the results, the
graph (top portion) conveys the clear impact o f the intervention. T here is som e am b i-
guity as to why drivers at Stop sign B im proved (continued baseline). Inclusion o f this
condition was very helpful. It w as im portant to show that the intervention was better
than no intervention, and the continuation o f baseline helps establish that.

A D D IT IO N A L D E SIG N V A R IA T IO N S

C o n d itio n s In clu d ed in the D esign


T he prim ary purpose o f em ploying a m ultiple-treatm ent design is to evaluate the re l-
ative effectiveness o f two or m ore interventions and to do so relatively quickly. T h u s,
212 S I N G L E- C A S E R ES EA R C H D ES I G N S

variations discussed up to this point have em phasized the com parison o f interven-
tions, that is, viable treatments or m ethods designed to alter behavior. Not all o f the
conditions compared in the intervention phase need to be active treatments. In som e
variations, one o f the conditions included in the intervention phase is a continuation
o f baseline conditions, that is, no intervention. The previous exam ple o f the com bined
com ponents included baseline as one o f the interventions, but it is worth underscoring
the use more explicitly and without com binations o f m any other elegant procedures
included in that example.
A s you will recall regarding single-case designs more generally, one purpose o f the
initial baseline is to project (predict) what perform ance would be like in the future if
no treatment were implemented. In a multiple-treatment design, it is possible to im ple-
ment one or more interventions an d to continue baseline conditions, all in the same
phase. In addition to projecting what baseline w ould be like in the future, it is pos-
sible to assess baseline levels o f perform ance concurrently with the intervention(s).
If perform ance changes under those time periods in which the interventions are in
effect but rem ains at the original baseline level during the periods in which baseline
conditions are continued, this provides a dram atic and persuasive dem onstration that
behavior changes resulted from the intervention. Because the baseline conditions are
continued in the intervention phase, the investigator has a direct measure o f perfo r-
mance without the intervention. A ny extraneous influences that might be confounded
with the onset o f the intervention phase should affect the baseline conditions that have
been continued. By continuing baseline in the intervention phase, greater assurances
are provided that the intervention accounts for change. M oreover, the investigator can
judge the magnitude o f the changes due to the intervention by directly com paring per-
formance durin g the intervention phase under baseline and intervention conditions
that are assessed concurrently.
In general, in a multi-treatm ent design, we usually think o f com paring two or m ore
treatments. However, one variation is to com pare two conditions concurrently: one that
is an intervention and the other that is a continuation o f baseline. This makes multi-treat -
ment designs useful even when the investigator is interested in only one intervention.
For example, if a teacher or school adm inistrator has an idea how to generate enthusi-
asm and involvement especially well on classroom assignments, that method could be
the intervention and then it could be put in an alternating-treatments design in which it
was implemented once each day in the m orning or afternoon. W hen the special inter-
vention is not implemented that day, the other period is teaching as usual (baseline).
The continuation o f baseline during an intervention phase (as one o f the co n d i-
tions) has yet another use. I f perform ance during the initial baseline phase is unstable
or shows a trend that the investigator believes may interfere with the evaluation o f the
interventions, it may be especially useful to continue baseline as one o f the conditions
in the design. T his provides a direct test of whether the intervention has an im pact on
the pattern o f behavior. Rather than predict what perform ance would be like if baseline
continued, we have a direct record.

Final Phase of the Design


The alternating-treatments design is defined by a baseline phase followed by an inter-
vention phase in which two or more interventions are presented. T he designs often
M u i t i p i c - T r e a t iT ic n t D e s i g n s 213

include a third and final phase that contributes to the strength o f the dem onstration.
I f one o f the two conditions is shown to be m ore effective than the other d u rin g the
intervention phase, it is often implemented on all occasions and under all stim ulus
conditions in the final phase o f the design. W hen the final phase o f the alter nating-
treatments design consists o f applying the m ore (or most) effective intervention across
all o f the stimulus conditions, the design bears som e resem blance to a m ultiple-baseline
design.
Essentially, the design includes two intervention phases, one in v h ich two (or more)
interventions are com pared and one in which the m ore (most) effective one is applied.
T he “ multiple baselines” do not refer to different behaviors o r settings but rather to
the different time periods each day in which the o bservations are obtained. T h e m ore
(m ost) effective intervention is applied to one time period [or balanced across time
periods or situations) during the first intervention phase. In the second intervention
phase, the more (most) effective intervention is extended to all o f the time periods.
T hus, the more (most) effective intervention is introduced to the tim e periods at differ-
ent points in the design (first intervention phase, then second intervention phase).
O f course, the design is not exactly like a m ultiple-baseline design because the
more (or most) effective intervention is introduced to time period s that may not have
continued under baseline conditions. Rather, less effective interventions have been
applied to these time periods during the first intervention phase. On the other hand,
when the alternating-treatments design com pares one intervention with a continuation
o f baseline, then the two intervention phases correspond closely to a multiple-baseline
design. The intervention is introduced to one o f the daily tim e periods in the first inter-
vention phase while the other time period continues in base line conditions. In the sec-
ond intervention phase, the intervention is extended to all tim e periods in exactly the
m anner o f a multiple-baseline design.
I mention the final phase and relations o f different designs not just to confuse the
reader. There will be other opportunities for that. Rather, it is im portant to convey
the elements or components (phases, options) o f single-case designs. It is tem pting to
ask or even demand, “ What design is that?” when the design cannot be easily classi-
fied. T he design per se and its classification are not really so critical Rather, the logic
o f single-case designs (describe, predict, test) and the goal o f experim ents (ruling out
threats to validity) are critical. The logic and goals need to he satisfied, and these trump
all other considerations. The many different elem ents or com ponents provide valuable
tools for im provising single-case designs to handle differentsituations and also for a d d -
ing or altering later phases in a design i f needed to clarify tlie dem onstration.

G e n e ra l C o m m en ts
M ultiple-treatment designs can vary along m ore dim ensions th in the conditions that
are implemented in the first and second intervention phases, as discussed previously.
For example, designs differ in the num ber o f interventions or conditions that are c o m -
pared and the number o f conditions or contexts across w hich the interventions are
balanced. H owever important as these dim ensions are, they do not alter basic features
o f the designs.
The designs are not as com m only used as A B A B and multiple-baseline designs.
W hen they are used, the demonstrations tend to be experim entally strong. T h e m ain
4 5 J N G L E- C A S E R ES EA R C H D ES I G N S

reason is that as often as not investigators include a feature in the design that draw s
on som e other design. For exam ple, the design only requires the baseline and inter-
vention phase. The different interventions during the intervention phase provide the
data points to see if perform ance under each intervention departs from baseline and
from data points from the other intervention. T he describe, predict, and test features
ol the different phases o f single-case designs are accom m odated within the two phases.
Investigators often include a third phase that further strengthens the dem onstration. I
m entioned the multiple-baseline feature o f adding a third phase in which the more (or
most) effective intervention is introduced across all conditions.
In addition, a third or final phase o f the design m ay consist o f withdrawing all
o f the treatments. Thus, a reversal phase is included, and the logic o f the design fo l-
lows that o f A B A B designs discussed earlier. O f course, an attractive feature o f the
multiple-treatment designs is the ability to dem onstrate an experim ental effect w ith -
out withdrawing treatment. I m ention the use o f a reversal phase here not to advocate
that addition specifically but rather to convey variations that are used occasionally to
strengthen further the conclusions that can be draw n about the im pact o f the inter-
vention. The full range o f variations o f the designs becom es clearer as we turn to the
problem s that may emerge in m ultiple-treatm ent designs and how these problems can
be addressed.

P R O B L E M S A N D C O N S ID E R A T IO N S
A m ong single-case experim ental designs, those in which m ultiple treatm ents are c o m -
pared are relatively com plex. H ence, several considerations are raised by their use in
terms o f the types o f interventions and behaviors suitable for the designs, the extent to
which interventions can be delineated by the clients, the num ber o f interventions and
conditions (time periods, settings), and the possibility that m ultiple-treatm ent interfer-
ence m ay influence the results.

O m ittin g the In itial B a se lin e


Two or more interventions are com pared during the intervention phase. Occasionally,
a variation is used in which the investigator begins the two interventions (e.g.,
alternating-treatments design) without a baseline (e.g., Reinhartsen, Garfinkle, &
Wolery, 2002; VVacker et al., 1990). If one intervention greatly exceeds the effect o f the
other, the absence o f baseline is not so much o f an issue. T h e goal is to see if one inter-
vention is more effective than another, and this can be accom plished without an initial
baseline. However, if the two interventions are not different in their effects, there is an
im portant question that cannot be addressed. Were both interventions effective (i.e.,
better than baseline) or was neither effective (e.g., no different from baseline)? Also,
I mentioned before that it is possible that the intervention that looks better may be
no different from baseline but looks go od because the other intervention made things
worse (Washington et al., 2002). It is even possible that both interventions m adebehav-
ior w orse than baseline. T his is not merely an intellectual possibility; interventions
occasionally make people w orse (e.g., Dodge, Dishion, & Lansford, 2006; Feldman,
Caplinger, & W odarski, 1983).
For example, in one project, two interventions were com pared with the goal o f
increasing physical activity o f four elem entary school students (G rissom , Ward,
Mtil iip! e- Tr e at m ent Desi g n s 215

M artin, & Leenders, 2005). Students wore a small device (accelerom eters) e v ery d ay to
assess level o f activity; the goal was to increase activity, and this measure autom atically
recorded the inform ation that was downloaded to a com puter at the en d o f the les-
son. Two interventions were: (1) wearing a heart-rate m onitor thal provided an audible
prompt when student activity increased beyond a specific level and. (1 ) not w earing a
m onitor and not receiving feedback. The alternating-treatm ents design showed these
interventions to be no different from each other in their effect. T his is not surprising;
am ong all o f the interventions psychology has to offer, feedback all b y itself is not one
o f the stronger interventions in terms o f magnitude o f effect and num ber o f individuals
usually affected. However, we cannot judge a critical question— did both interventions
im prove activity beyond baseline? It would have been good to have the pre-interven-
tion (baseline) phase to have the descriptive and predictive features o f baseline. T his
might have been a b rie f or extended period o f w earing the accelerom eters several days
before any intervention began— it could be that the heart rate device raised both condi-
tions during the intervention phase and in fact made a difference.
A baseline is not always feasible and is not always needed. E v er in an A B A B design,
one can begin with the intervention and make this a B A B or B A B A design. Yet, in
multiple-treatment designs, the absence o f differences between the two treatments can
introduce ambiguity that would not otherwise be present with even a v ery b rie f base-
line. As a general rule for single-case designs, begin with a baseline phase if ever possi-
ble. In the case o f this exam ple, even two data points in baseline may not be com pletely
sufficient to describe and predict performance, but they still would have been an excel-
lent addition.

T y p e o f In te rv en tio n an d B e h av io rs
M ultiple-treatment designs depend on showing changes for a given behavior across
daily sessions or time periods. I f two (or more) interventions are alternated 011 a given
day, behavior must be able to shift rapidly to dem onstrate differential effects of the
interventions. T he need for behavior to change rapidly dictates both the types o f inter-
ventions and the behaviors that can be studied in m ultiple-treatm ent designs.
Interventions suitable for multiple-treatment designs m ay need to show rapid
effects initially and to have little or 110 carryover effects w hen terminated. C onsider
the initial requirement o f rapid start-up effects. Because two (o r more) interventions
are usually im plem ented on the same day, it is im portant that the intervention not
take too long within a given session to begin to show its effects. For exam ple, if each
intervention is adm inistered in one o f two l-hour time periods each day, relatively little
time exists to show a change in behavior before the intervention is term inated for that
day. Not all treatments m ay produce effects relatively quickly'. T h is problem is obvious
in som e forms o f m edication used to-treat clinical problem s in adults and children
(e.g., depression), in which days or weeks may be required before therapeutic effects
can be observed. However, the problem is likely to be as evident with psychosocial
and educational influences (e.g., instructions and praise vs. a group program , teaching
strategies or special educational materials that build skills) that would be com pared
(e.g., in a classroom ). T he m om ent or first few time periods that one intervention is in
place does not necessarily lead to an immediate im pact 011 th ebehavior o f the students.
T he impact o f each intervention may only emerge slowly. In addition, the differential
216 S I N G L E- C A S E R ES EA R C H D ES I G N S

impact o f the interventions (on the assum ption that there is one) m ay also take time
to emerge.
In many behavioral program s in which intervention effects are based on rein-
forcement and punishm ent, the effects o f the intervention have been evident within
a relatively short period. I f several opportunities (occurrences o f the behavior) exist
to apply the consequences within a given time period, intervention effects may be rel-
atively rapid. Yet m any interventions based on consequences (e.g., extinction where
consequences are not provided) m ay not show rapidly changing effects. Also, many
other interventions (e.g., academ ic skill building, cognitive therapy for a clinical d is-
order, modeling, and peer tutoring) are studied in single-case designs, and changes are
expected to accrue slowly. Rapid shifts in perform ance as a function o f changing treat-
ments may be difficult to dem onstrate. T he slow “start-up” tim e for intervention effects
depends on the intervention as well as the behavior or dom ain o f interest.
W hen treatments are alternated within a single day, as is often the case in multiple-
treatment designs, there is an initial start-up time necessary for treatment to dem -
onstrate an effect. This is not invariably a problem. However, investigators might ask
themselves in advance, “ Is it reasonable to expect the different interventions to show
their effects and any differential effects within the time periods each is provided?” The
answ er is influenced by m any considerations. Prominent am ong these is the two inter-
ventions that are com pared. In general, the greater the contrast in the interventions
(e.g., stark procedural differences) and the stronger the expected differences in impact,
the less likely that the brevity o f the time period will make a difference. As a clear case,
i f the two conditions com pared are intervention versus baseline in the same phase, as
mentioned previously, the discrim ination and differences between the conditions are
more likely to be evident. A n y time two (or more interventions) are com pared, each
may need time to have im pact and to show differential impact. T h e clients are exposed
to the balancing o f interventions across different conditions (e.g., time periods) and
that can make discrim inating between the different interventions slightly more com -
plex. In contrast, in a multi element design, each intervention is paired with a particu-
lar stimulus or context, which can help clients discrim inate between the interventions.
Another requirement is that interventions must have little or no carryover effects
after they are terminated. If the effects o f the first intervention linger after it is no longer
presented, the intervention that follows would be confounded by the previous one. For
exam ple, it might be difficult to com pare medication and behavioral procedures in an
alternating-treatments design. It m ight be impossible to adm inister both treatments on
the sam e day (e.g., m orning and afternoon periods) because o f the carryover that most
medications have. The lingering effects o f the medication m ight influence the effec-
tiveness o f the other intervention. One could evaluate whether the other intervention
(and medication) is more o r less effective when it was preceded by or when it followed
medication.
Pharm acological interventions are not the only ones that can have carryover effects.
Interventions based on environm ental changes and interpersonal interaction also may
have carryover effects and thus m ay obscure evaluation o f the separate effects o f the
interventions, a point to which I return later in the discussion o f multiple-treatment
interference. In any case, if two or m ore interventions are to be com pared, it is im por-
tant to be able to term inate each o f the interventions quickly with little o r no lingering
M u l t i p l e- Tr eat m en t D esi g n s 217

effect that will blur and m ix their impact. If interventions cann ot be rem oved quickly,
they will be difficult to com pare with each other in an alternating-treatm ents design.
A part from the interventions, the behaviors o r outcom es o f interest studied in
m ultiple-treatm ent designs must be susceptible to rapid c h a rg e s. B eh avio rs that
depend upon im provem ents over an extended p erio d m ay not be able to shift ra p -
idly in response to session-by-session changes in the intervention . F o r exam ple, it
w ould be difficu lt to evaluate two or more interventions fo r reducin g w eight o f p e r-
sons who are obese. C hanges in the m easure (w eight in kilog ram s or p o u n d s) w o u ld
not v ary to a sign ifican t degree unless an effective treatm ent w as con tinu ed w ith o u t
interruption over an extended period. C o n sta n tly alternatin g the in terven tion s on
a dailv basis m ight not affect weight at all. O n the other h an d, alternative m easu res
(e.g., calories consum ed at different tim es d u rin g the day) may w ell perm it u se o f
the design.
Aside from being able to change rapidly, the frequency o f the behavior m ay also
be a determinant o f the extent to which interventions can sh ow changes in m ultiple-
treatment designs. For example, if the purpose o f the interventions is to decrease the
occurrence o f low -frequency behaviors (e.g., severe aggressive acts), it m ay be difficu lt
to show a differential effect o f the interventions. Too few opportun ities may exist for the
intervention to be applied in any particular session. Indeed, the behavior m ay not even
occur in some o f the sessions. Thus, even though a session m ay be devoted to a partic-
ular intervention, that intervention may not actually be applied. Such a session cann ot
be fairly represented as one in which this particular intervention was em ployed.
Fligh frequency o f occurrences o f the behavior also m ay present problem s for
reflecting differences am ong interventions. I f there is an upper lim it to the num ber
o f responses because o f a limited set o f discrete opportunities for the behavior to
occur, it may be difficult to show differential im provem ents. F o r exam ple, a ch ild m ay
receive two different program s to improve academ ic perform ance. Each day, the child
receives a worksheet with 20 problems at two different times as the basis for assessin g
change. D uring each time period, there are only 20 opportunities for correct resp o n d -
ing. I f baseline perform ance is 50% correct (10 problem s com pleted), this m eans that
the differences between treatments can only be detected, on the average, in response
to the 10 other problems. I f each intervention is m oderately effective, there is likely
to be a ceiling (or floor) effect, that is, absence o f differences because o f the restricted
upper (or lower) limit to the measure. Perhaps the interventions w ould have differed
in effectiveness i f the measure were not restricted to a lim ited number o f response
opportunities.
T he restricted range as a problem w hen multiple interventions are com pared is
illustrated in a study designed to decrease absenteeism am ong em ployees w orkin g at
a human service organization (Luiselli et al., 2009). The employees included ap p ro x-
imately 60 (num ber varied over time) teachers and child-care staff m em bers in a p r i-
vate school serving individuals with autism and related developm ental disabilities. T he
measure was the percentage o f staff absent each day independent o f the reason for
the absence. T he investigators wished to com pare three interventions in w hat might
be referred to as A B C D evaluation, beginning with baseline (A). T he interventions
included: B — providing an informational brochure to convey inform ation and then to
preview a lottery in which attendance increased the possibility o f earning a m o n etary
Z I8 S I N G L E C A S E R ES EA R C H D ES I G N S

bonus; C — the actual lottery in w h ich this was implemented; and D — the lottery plus
public posting o f graphs that charted daily and weekly absences.
Figure 9.8 shows that absenteeism during baseline was increasing, a direction oppo-
site from the goal o f the program . T he first intervention (brochure) led to a marked
drop in absenteeism. The second (lottery) and third (lottery + public posting) were
implemented, but it is very difficult to evaluate if these latter interventions made any
difference. The brochure reduced absenteeism so strongly that there may have been
a floor effect, that is, not much room on the measure to show the effects o f any other
intervention. In the brochure phase, absenteeism was close to 3.6% o f the staff. T his
m ay be near or at a genuine lim it as people are absent for legitimate reasons (e.g., ill-
ness, child holidays that require a parent to stay home). W hen multiple interventions
are com pared, it is important to be sure that the measure allows for differentiation
am ong the interventions i f the interventions are differentially effective. T his cannot
be known in advance, but providing multiple interventions to the same subjects is a
situation in which this is more likely to be a problem than when providing different
interventions to different groups o f subjects (betw'een-group design).
In general, differential effectiveness o f the intervention is likely to depend on sev-
eral opportunities for the behavior to occur. I f two or more active interventions are
com pared that are likely to change behavior, the differences in their effects on perfor-
m ance are relatively sm aller than those evident if one intervention is sim ply com pared
to a continuation o f baseline. In order for the design to be sensitive to relatively less
marked differences between or am on g interventions, the frequency o f the behavior
must be such that differences could be shown. I f the goal is to im prove som e behav-
ior, low frequency o f behavior m ay present problems if it m eans that there are few

W eeks Months

Figure 9.8. Percentage of daily staff absences each week across multiple treatments. (Source: Luiselli
et al.. 2009.)
M al t i p t e- Tr eat m en t D esi g n s 219

opportunities to apply the procedures being com pared. High frequency of behavior
may be a problem if the range o f responses is restricted by an upper limit that im pedes
dem onstration o f differences among effective interventions designed to increase that
behavior.

D iscrim inability o f the Interventions


W hen multiple interventions are administered to one client in the same phase, the
client must be able to make at least two sorts o f discrim inations. First, the client must
be able to discrim inate whether those who adm inister the interventions or time p eri-
ods are associated with a particular intervention. In the m ulti-elem ent design, this
discrim ination m ay not be very difficult because the interventions are constantly asso -
ciated with a particular stimulus. In the alternating-treatm ents design, the client must
be able to discern that the specific interventions constantly vary across the different
stim ulus conditions. In the beginning o f the intervention phase, the client may inad-
vertently associate a particular intervention with a particular stim ulus condition (e.g.,
time period, staff member, or setting). I f the interventions are to show different effects
on perform ance, it will be important for the client to respond to the interventions that
are in effect independently o f who administers them . Second, the client must be able
to distinguish the separate interventions. Since the design is aimed at showing that the
interventions can produce different effects, the client must be able to tell which inter-
vention is in effect at any particular time. D iscrim in atin g the different interventions
may depend on the procedures themselves.
T he ease o f m aking a discrimination, o f course, depends on the sim ilarity oj the
interventions or procedures that are com pared. I f two v ery different procedures are
com pared, the clients are more likely to be able to discrim inate w hich intervention is in
effect than if subtle variations o f the same procedure are com pared. Forexainp le, i f the
investigation com pared the effects o f 5 versus 15 m inutes o f isolation as a punishm ent
technique, it might be difficult for the client to discrim inate which intervention was in
effect. Although the interventions might produce different effects if they were ad m in -
istered to separate groups o f subjects or to the sam e subject in different phases over
time, they may not produce a difference or produce sm aller differences when alter-
nated daily, in part because the client cannot discrim inate consistently which one is in
effect at any particular point in time.
T he discrim inability o f the different interventions m ay depend on the frequ en cy
with which each intervention is actually invoked, as alluded to earlier. The m ore fre-
quently the intervention is applied during a given time p erio d ,th e more likely the client
will be able to tell which intervention is in effect. I f in a given time interval the inter-
vention is applied rarely, the procedures are not likely to show a. difference across the
observation periods. In some special circum stances w here the goal o f treatm ent is to
reduce the frequency o f behavior, the num ber o f tim es the intervention is applied m ay
decrease over time, as behavior improves and the problem occurs less offen. As b eh av-
ior decreases in frequency, the different treatments w ill be applied less often, and the
client m ay be less able to tell which treatment is in effect. For example, if reprim ands
and isolation are com pared as two procedures to decrease behavior, each procedure
might show som e effect within the first few days o f treatment. As the behaviors decrease
in frequency, so will the opportunities to adm inister the interventions. T h e client m ay
220 S t N G L E- C A S E R ES EA R C H D ES I G N S

have increased difficulty in determ ining at any point which o f the different interven-
tions is in effect.
To ensure that clients can discrim inate which intervention is in effect at any p ar-
ticular point in time, investigators can provide daily instructions before each o f the
treatments that is adm inistered in an alternating-treatments design. The instructions
tell the client explicitly w hich condition will be in effect at a particular point in time.
A s a general guideline, instructions m ight be very valuable to enhance the d iscrim in a-
tion o f the different treatments, especially if there are several different treatments, i f the
balancing o f treatments across conditions is complex, o r if the interventions are only in
effect for b rie f periods du rin g the day.5

Number o f Interventions and Stim ulus Conditions


A central feature o f the alternating-treatm ents design is balancing the conditions o f
adm inistration with the separate interventions so that the intervention effects can be
evaluated separately from the effects o f the conditions. Theoretically, any num ber o f
different interventions can be com pared during the intervention phase. In practice,
only a few interventions usually can be com pared The problem is that as the nu m -
ber o f interventions increases, so does the number o f sessions o r days needed to b al-
ance interventions across the conditions o f administration. If several interventions are
com pared, an extraordinarily large num ber o f days would be required to balance the
interventions across all o f the conditions. As a general rule, two or three interventions
or conditions are optimal for avoiding the com plexities o f balancing the interventions
across the conditions o f adm inistration. Indeed, most multiple-treatment designs have
com pared two or three interventions.
T he difficu lty o f balancing interventions also depends on the number o f stim -
ulus conditions included in the design. In the usual variation, the two interventions
are varied across two levels (e.g., m orning or afternoon) o f one stimulus dim ension
(e.g., time periods). In som e variations, the interventions m ay be varied across two
stimulus dim ensions (e.g., time periods and staff mem bers). Thus, two interventions
(^ and I;) might be balanced across two time periods (T and T ) and two staff mem bers
(Si and S2). T he interventions must be paired equally often across all time period and
sta ff com binations (T S ,, T S , T ,S , T S ) during the intervention phase. As the num ber
o f dim ensions or stimulus conditions increases, longer periods are needed to ensure that
balancing is complete. T he num ber o f interventions and stim ulus conditions included
in the design m aybe limited by practical constraints or the duration o f the intervention
phase. In general, most alternating-treatm ents designs balance the interventions across
two levels o f a particular dim en sion (e.g., time periods). Som e variations have included
more levels o f a particular dim ension (e.g., three time periods) or two or more separate

' Interestingly, if instructions precede each intervention to convey to the clients exactly which proce-
dure is in effect, the distinction between multi element and alternating-treatments designs becomes
blurred. In effect, the instructions become stimuli that are consistently associated with particular
interventions. However, the blurred distinction need not become an issue. In the alternating-treat-
ments design, an attempt is made to balance the interventions across diverse stimulus conditions
(with the exception of instructions), and in the multi element design the balance is not usually
attempted. Indeed, in the latter design, the purpose is to show that particular stimuli come to exert
control over behavior because o f their constant association with particular treatments.
M i»lt ip !< s* Treat m ent D esig n s

dim ensions (e.g., time periods and staff) (e.g., O llendick et a.!., 198a). From a practical
standpoint, the investigation can be sim plified by balancing interventions across only
two levels o f one dimension.

M ultiple-treatment Interference
In any design in which two or more treatments (interventions) are provided to the sam e
subject, multiple-treatment interference m ay lim it the conclusions that can be draw n
(H ains & Baer, 1989; Kazdin, 2003). As noted previously, m ultiple-treatm ent interfer-
ence refers to the effect o f one treatment being influenced by the effects o f the other.
T he concept can be illustrated by the sim ple case in w hich participants receive two
interventions with one right after the other. Fo r exam ple, training in behavioral m an -
agem ent skills was provided to parents o f children with developm ental disabilities to
reduce inappropriate parenting behaviors (giving am biguous com m ands, unw ittingly
reinforcing inappropriate child behavior, physical aggression) (P h an eu f & M cIntyre,
2007). Parents received group treatment (Intervention 1) for several sessions. T h is was
followed by adding individual video feedback for parent behavior (Intervention 2). T he
results indicated that inappropriate parenting behaviors were lower with the com bined
(second intervention) than with the first intervention. Does this m ean the second
intervention is generally more effective? It is possible that the effects are due to b egin -
ning with the group intervention alone; the results m ay not ap ply i f Intervention 2 were
introduced first. T he effects o f Intervention 2 m ay be quite different dep end in g on
whether it followed the prior intervention o r was provided without that prior inter-
vention. In short, when there is treatment preceded or follow ed by another, the effects
o f the latter treatment may be in part a function o f w hat preceded it, that is, m ultiple-
treatment interference.
M ultiple-treatment interference may result from m any different arrangem ents o f
adm inistering treatments. For example, i f two treatm ents aTe exam ined in an A B A B
design (e.g., A B C B C ), multiple-treatment interference m ay result from the sequence in
which the treatments are administered. T he effects o f the different interventions (B, C)
m ay be due to the sequence in which they appeared. It is not possible to evaluate and
d raw conclusions about the effects o f C alone, because it w as preceded b y B, w hich m ay
have influenced all subsequent performance.
Occasionally, investigators include a reversal phase in A B A B des-igns. with m ultiple
treatments (e.g., A B A C ), with the b elief that reco very o f baseline levels o f perform an ce
removes the possibility o f multiple-treatment interference. However, i ntervening rever-
sal phases (e.g., A B A C A B A C ) do not alter the possible influence o f sequence effects.
Even though baseline levels o f perform ance are recovered, it is still possible that the
effects o f C are influenced in part by the previous h isto ry o f C o n d ition B. B eh avio r
m ay be m ore (or less) easily altered by the second intervention because o f the inter-
vention that preceded it. An intervening reversal (or A ) phase does not elim inate that
possibility.
M ultiple-treatment interference is a m ethodological issue, but an exam ple from
everyday life is useful to convey how sequence effects o r em bedding one intervention in
another can make a difference in practical w ays. In m aking requests o f children, there
are some requests a given child is unlikely to com ply with. T hese are individual for each
child and are readily measured. Examples o f these requests m ight be asking him or her
222 S I N G L E- C A S E R ES EA R C H D ES I G N S

to do a chore, to get ready for school, or to stop doing this or that. One can show that
for a given child the likelihood o f com plying is low for a given set o f requests. T hese are
called low-probability requests because they are very unlikely to get com pliance. One
can get much better com pliance with these low-probability requests by preceding them
by asking the child to do things he or she will more readily do. Im m ediately preced-
ing a low-probability request (please go pick up your toys, work on your math hom e-
w ork while I am out ot the classroom ) with a sm all num ber o f high-probability requests
(e.g., please give me a high five, clap you r hands, put your name on your hom ew ork
sheet, read the first math problem) increases the likelihood that the low-probability
request will be com pleted. Stated another w ay to make it more relevant to the present
discussion, low-probability requests (by definition) do not yield very much com pli-
ance. However, com pliance with these requests can be greatly increased by preceding
them with som e high-probability requests. The sequence o f high- then low -probabil-
ity requests makes the latter requests much more likely, an example that conveys howr
juxtaposing different interventions or conditions can alter the im pact o f one o f them.
Developing com pliance and using high-probability requests is an intervention strategy
to develop and shape com pliance (e.g., Hum m, Blampied, & Liberty, 2005; W ehby &
Hollahan, 2000). T his is a strategic use o f m ultiple-treatment interference.
In multi-element and alternating-treatm ents designs, multiple-treatment interfer-
ence refers to the possibility that the effect o f any intervention may be influenced by the
other intervention(s) to w hich it is juxtaposed. Here the influence is not the sequence
o f B appearing before C. On any given day, both B and C might be provided. M ultiple-
treatment interference refers to the possibility that the effects obtained for a given inter-
vention m ay differ from what they would be if the intervention were adm inistered by
itself in a separate phase without the juxtaposition o f other treatments. As an exam ple,
one alternating-treatments design com pared the effects o f token reinforcement and
response cost (fines or loss o f tokens) on attentive behavior o f children (ages 9 to 12 and
with developm ental disabilities) (Shapiro, Kazdin, & M cGonigle, 1982). In the design,
token reinforcement, response cost, and continuation o f baseline were implemented at
different points. Token reinforcem ent w'as more effective (and showed less variability
in its impact) when com pared to a continuation o f baseline during the same interven-
tion phase than when it was com pared to response cost.
More generally, two o r m ore interventions might be adm inistered in a m ulti-ele-
ment or alternating-treatm ents design. T he m ethodological point: the effects o f each
intervention might be “ interfered with” (altered) by the presentation o f the other either
before or during the other intervention. Stated more generally, the results o f a p articu -
lar intervention in a m ultiple-treatm ent design m ay be determ ined in part by the other
intervention(s) to which it is com pared.
It is possible that juxtaposing two interventions will dilute their unique effects.
Neither may be in place sufficiently long to exert its unique influence, if one exists.
Alternatively, the uniqueness cou ld be lost because two sim ilar variants are put next to
each other. In alternating-treatm ents designs, there are examples in the interventions
that are compared and that produce few or no differences within subjects or were not
consistent am ong subjects (e.g., in teaching reading, using symbols for com m unica-
tion for individuals with com m unication disorders) (Ardoin, M cCall, & Klubnik, 2007;
Hetzroni, Quist, & Lloyd, 2002). It is possible that the interventions in fact are equally
M u Jf cip le-Treat m en t D esig n s 223

effective, but it is also possible that using sim ilar methods in the sam e phase in an alter-
nating fashion made them less distinct in impact than they would have been if given to
separate subjects or provided sequentially (A B C A B C ) in the design.
It is im portant to be aware o f multiple-treatment interference as a m ethodological
issue because it m ay affect the conclusions reached about the effects o f treatment for a
specific client but also affect generality o f the findings to others. The issue applies if two
o r more interventions are alternated in som e way, as in the case o f multiple-treatment
designs presented in this chapter, or in other designs in w hich two interventions are
presented in sequence over time (e.g., A B C A C ). In this latter case, for exam ple, inves-
tigators are wont to note that Intervention C (e.g., in an A B C A C design) m ay be more
effective than B in their discussion o f the results and then state that Intervention C
ought to be used more generally. All this might be true, but the intervention was not
“ really” C but rather C-preceded-by-B. Conclusions about the effect o f C alone require
a different and further study.
In general, researchers using single-case designs have not given much attention to
m ultiple-treatment interference, and perhaps understandably so (but see H ains & Baer,
1989). The designs are most often used in applied settings with clieti ts in need o f change
in an important area o f functioning (e.g., self-injury o f a child with developm ental dis-
ability, greatly disruptive behavior in a special education class). In such instances, the
investigator is interested in identifying an effective intervention and one that has palpa-
ble and im m ediate impact. This might lead to plans for an A B A B or m ultiple-baseline
design but Intervention B is not as effective as needed. Consequently, a second inter-
vention (C) is introduced. It is quite possible that the second intervention produces an
effect due in part to the other condition with which it is com pared (m ultiple-treatm ent
design) or the intervention that preceded it (A B C B C design), but the priority o f evalu-
ating this possibility has been overshadowed by the applied goal o f the program .
It is important to at least mention rather than elaborate that applied goals are u su-
ally served by the best science, i.e., when we have an understanding o f how interven-
tions work. In the present discussion, we want to know whether an intervention is truly
effective (or m ore or less effective) based on whether it is connected to som e other c o n -
dition or whether it can be provided by itself. A given application might give priority to
helping one child, one classroom , or one group, but our other goal is to extend findings
to help many. T he best investment in that goal is to understand what interventions
work, whether their effects depend on special conditions or other variables, and how
the interventions can be administered to help the m any w ho might profit from their
application. Know ing whether multiple-treatment interference makes a difference is
quite relevant to this broader applied goal.

E V A L U A T IO N O F T H E D E S IG N S
M ultiple-treatment designs have several advantages that m ake th«m especially useful
for applied research. To begin with, the designs do not depend on a reversal o f co n d i-
tions, as do the A B A B designs. Hence, problem s o f behavior failing to reverse or the
undesirability o f reversing behavior are avoided. Similarly, the designs do not depend on
tem porarily withholding treatment, as is the case in m ultiple-baseline designs in which
the intervention is applied to one behavior (or person, or situation) at a tim e, while the
rem aining behaviors can continue in extended baseline phases. In m ultiple-treatm ent
224 S I N G L E- C A S E R ES EA R C H D ES I G N S

designs, the interventions are applied and continued throughout the investigation. The
strength o f the dem onstration depends on show ing that treatments produce differen-
tial effects across the tim e periods or situations in which perform ance is observed.
A second advantage o f the design is particularly noteworthy. M ost o f the single-
case experimental designs depend heavily on obtaining baseline data that are relatively
stable and show no trend in the therapeutic direction. A stable baseline with no trend
is the ideal pattern in virtually all circumstances. I f baseline data show improvements,
special difficulties can arise in evaluating the im pact o f subsequent interventions. In
m ultiple-treatment designs, interventions can be im plem ented and evaluated even
when baseline data show initial trends. T he designs rely on com paring perform ance
associated with the alternating conditions. T he differences can still be detected when
superim posed on any existing trend in the data.
A third main advantage o f the design is that it can com pare different treatments
for a given individual within a relatively short period. If two or more interventions
were com pared in an A B A B or multiple-baseline design, the interventions must fol-
low one another in separate phases. Providing each intervention in a separate phase
could greatly extend the duration o f the investigation. In multiple-treatment designs,
the interventions can be com pared in the sam e phase, so that within a relatively short
period one can assess if two or more interventions have different impact. T he phase in
which both interventions are com pared need not necessarily be longer than interven-
tion phases o f other single-case designs. Yet only one intervention phase is needed in
the alternating-treatments design to com pare separate interventions. In classroom and
institutional settings, when tim e is at a prem ium , the need to identify the m ore or most
effective intervention relatively quickly am ong available alternatives can be extrem ely
important.
O f course, in discussin g the com parison o f two or more treatments in a single-case
design, the topic o f m ultiple-treatm ent interference cannot be ignored. W hen two or
more treatments are com pared in sequence, as in an A B A B design, the possibility exists
that the effects o f one intervention are partially attributable to the sequence in which
it appeared. In a m ultiple-treatm ent design, these specific sequence effects are not a
problem, because separate phases with different interventions do not follow each other.
However, m ultiple-treatm ent interference may take another form. A s discussed earlier,
the effects o f one treatment m ay be due in part to the other condition to which it is
juxtaposed. Hence, in all o f the single-case experim ental designs in which two or more
treatments are given to the same subject, multiple-treatment interference rem ains an
issue, even though it m ay take different forms. T he advantage o f the multiple-treatment
designs is not in the elim ination o f multiple-treatment interference. Rather, the ad van -
tage stems from the efficien cy in com paring different treatments in a single phase. As
soon as one intervention em erges as more effective than another, it can be im plem ented
across all time periods or settings.
There is yet another advantage o f multiple-treatment designs that has not been
addressed. In the alternating-treatm ents design, the interventions are balanced across
various stimulus conditions (e.g., time periods, staff mem bers, or settings). T he data
are usually plotted according to the interventions so that one can determ ine which
among the alternatives is the more effective. It is possible to plot the data in another
way to examine the im pact o f the stim ulus conditions on client behavior. For exam ple,
M u l t ip ie- Tr cat r r .eii t D esig n s 22S

it the interventions are balanced across two teachers or sta.lt m em bers or two time
periods (e.g., m orning and afternoon classes), then the data can be plotted to exam ine
if client behavior varies as a function o f teachers (or time periods). In m an y situations,
it may be valuable to identify whether som e staff m em bers are having greater effects
on client perform ance than others independently o f the particular intervention they
are adm inistering. Because the staff m em bers are balanced across the interventions,
the separate effects o f the staff and interventions can be plotted. If the data are plotted
according to the staff m em bers who adm inister the interventions in the different peri-
ods each day, one can identify those who m ight warrant additional training, assistance,
or emulation

S U M M A R Y A N D C O N C L U S IO N S
M ultiple-treatment designs are used to com pare the effectiveness o f two o r more inter-
ventions or conditions that are adm inistered to the same subject o r group o f subjects.
T he designs dem onstrate an effect o f the interventions by presenting each o f them in a
single intervention phase after an initial baseline phase. T h e m anner in which the sep-
arate interventions are administered durin g the intervention phase serves as the basis
for distinguishing various multiple-treatment designs.
In the m ulti-elem ent design, two or m ore interventions are usually adm inistered in
the intervention phase. Each intervention is consistently associated w ith a particular
stim ulus or context (e.g., a teacher or s ta ff m em ber, setting, or lim e). T he purpose o f
the design is to dem onstrate that a particular stimulus, because o f its consistent asso-
ciation with one o f the interventions, exerts control over perform ance. The differential
effectiveness o f the intervention is evident if perform ance issu p erio r under the stim u -
lus condition or context with which a particular intervention has been connected.
In the alternating-treatments design (also referred to as sim ultaneous-treatm ent
design), two or m ore interventions o r conditions are also adm inistered in the sam e
intervention phase. Each o f the interventions is balanced across the various stim ulus
conditions (e.g., teacher or staff member, setting, o r time) s o that th e effects o f the inter-
ventions can be separated from these conditions o f adm inistration. W hen one o f the
interventions em erges as the more (or most) effective du rin g the intervention phase, a
final phase is usually included in the design in which that intervention is im plem ented
across all stim ulus conditions or occasions. Alternating-treatm ents designs usually
evaluate two o r more interventions. However, the interventions can be com pared with
no treatment or a continuation o f baseline conditions.
Several considerations are relevant for evaluating whether a m ultiple-treatm ent
design will be appropriate in any given situation. First, because the designs depend on
show ing rapid changes in perform ance for a given behavior in response to interven-
tions that m ay change on the same day, special restrictions may be placed on the types
o f interventions and behaviors that can be included. Second, because multiple treat-
ments are often adm inistered in close p ro xim ity (e.g., on the same day), it is im portant
to ensure that the interventions will be discrim inable to the clients so that they know
when each is in effect. Related, the interventions must he applied and experienced
by the client, perhaps even frequently, because each is in place so that behavior can
respond differently to the different interventions. Third, the num ber o f interventions
and stimulus conditions employed in the investigation may have practical limits. T he
226 S I N G L E- C A S E R ES EA R C H D ESI G N S

requirem ents for balancing the interventions across stimulus conditions becom e more
dem anding as the num ber o f interventions and stimulus conditions increase.
Finally, a m ajor issue o f designs in which two or m ore conditions are provided to
the same subjects is m ultiple-treatm ent interference. Multiple-treatment designs avoid
sequence effects, that is, follow ing one intervention by another in separate phases (i.e.,
sequence effects), which is a potential problem when two or m ore treatments are eval-
uated in A B A B designs. However, multiple-treatment designs juxtapose the different
treatments in a w ay that still influences the inferences that can be drawn about the
treatment. The possibility rem ains that the effect o f a particular intervention may result
in part from the particular intervention to which it is contrasted. T he extent to which
multiple-treatment interference influences the results o f the designs described in this
chapter has not been well studied.
M ultiple-treatm ent designs have several advantages. T he intervention need not be
w ithdrawn or withheld from the clients as part o f the m ethodological requirements o f
the design. Also, the effects o f different treatments can be com pared relatively quickly
(i.e., in a single phase), so that the m ore (or most) effective intervention can be applied.
Also, because the designs depend on differential effects o f varied conditions on behav-
ior, trends during the initial baseline phase need not impede initiating the interven-
tions. Finally, when the interventions are balanced across stim ulus conditions (e.g.,
staff, teacher, and classroom s) the separate effects o f the interventions and these condi-
tions can be exam ined. In general, the designs are often well suited to the dem ands o f
clinical, educational, and rehabilitation settings. W here there are viable treatments to
be explored, an interest in identifying which might be more or most effective, and a
need to avoid withdrawal or reversal phases, the designs can be extrem ely useful.
CHAPTER 10

Additional Design Options

C H A P T E R O U T L IN E

Combined Designs
Description and Underlying Rationale
Variations
Problems and Considerations
Design Additions to Examine Transfer o f Training and
Response Maintenance
Probes
Graduated Withdrawal o f the Intervention
General Comments
Between-group Designs
Utility o f Between-group Designs in Relation to
Intervention Research
Illustrations o f Between-group Designs
Illustrations o f Single-case and Between-group
Designs Combined
General Comments
Summary and Conclusions

ariations o f the designs discussed to this point constitute the m ajority o f evalu-
ation strategies used in single-case research. Several other options are available
that represent novel variations on single-case designs, special design features to address
questions about the maintenance or generalization o f behavior, and com bin ations o f
single-case and between-group design strategies. T he total population o f design v a ria -
tions would be impossible to convey. Moreover, it might not even be useful. T h e v a ria -
tions derive from understanding how the designs w ork— what I have referred to as the
logic o f single-case designs and how they rule out or make im plausible various threats
to validity. Even so, this chapter conveys some o f the available variation s because o f
their special uses and their bridges to traditional betw een-group designs. T he chapter
discusses several design options, the rationales for their use, and the benefits o f d iffe r-
ent strategies for applied research.

117
228 S I N G L E- C A S E R ESEA R C H D ES I G N S

C O M B IN E D D E SIG N S

Description and Underlying Rationale


Although the designs discussed in previous chapters are most often used in their “ pure”
forms, features from two or more designs are frequently com bined. I provided som e
exam ples in variations o f other designs, but it is im portant to illustrate and discuss
these more explicitly. C om bined designs are those that include features from two or
more designs within the same investigation. T he purpose o f using com bined designs
is to increase the strength o f the experim ental dem onstration. T he clarity o f the results
can be enhanced by show ing that the intervention effects meet the requirem ents o f
more than one design. For example, an intervention m ay be evaluated in a multiple-
baseline design across individuals. T he intervention is introduced to subjects at d if-
ferent points in time and shows the expected pattern o f results. T h e investigator m ay
include a reversal phase for one or m ore o f the subjects to show that behavior reverts
to or near the original baseline level. D em onstration o f the im pact o f the intervention
may be especially persuasive, because requirem ents o f m ultiple-baseline and A B A B
designs were met.
The use o f com bined designs w ould seem to be an exam ple o f m ethodologi-
cal overkill. That is, the design m ay include m ore features than necessary for clearly
dem onstrating an experim ental effect. Yet com bined designs are not merely used for
experim ental elegance. Rather, the designs often address genuine problems that are
anticipated or actually em erge within an investigation.
The investigator m ay anticipate a problem that could com pete with drawing valid
inferences about intervention effects. For example, the investigator m ay select a nnil-
tiple-baseline design (e.g., across behaviors) and m ay believe that altering one o f the
baselines might well influence other baselines. A com bined design may be selected. If
baselines are likely to be interdependent, which the investigator m ay have good reason
to suspect, he or she m ay want to plan som e other feature in the design to reduce am b i-
guities i f requirem ents o f the m ultiple-baseline design were not met. A reversal phase
might be planned in the event that the effects o f the intervention across the multiple
baselines are not clear. Perhaps the reversal phase would not be used but would be
kept as an option in case the effects from the multiple-baseline portion o f the design
arc unclear. A n d if it were used, it w ould be possible to apply it to only one o f the
baselines.
Alternatively, in the discussion o f changing-criterion designs, I mentioned that
perform ance must match the changing criterion rather closely to provide a clear dein
onstration. M any investigators add a m ini-reversal to show bidirectional changes in
perform ance during the intervention phase. T his com bination gains the strength o f
features o f both changing-criterion and A B A B designs, and it nicely overcom es the
less-than-perfect correspondence o f perform ance and criteria across subphases.
Com bined designs do not necessarily result from plans that the investigator m akes
in advance o f the investigation. Unexpected am biguities often em erge over the course
o f the investigation. Am biguity refers to the possibility that the extraneous events rather
than the intervention m ay have led to change. Perhaps the extraneous events cannot be
so easily ruled out in light o f the data pattern. T h e investigator decides whether a fea-
ture from som e other design might be added to clarify the dem onstration. C om bined
Additional Design O ptions 22 9

designs often reflect the fact that the investigator is reacting to the dat a by invoking ele-
ments o f different designs to resolve the am biguity o f the dem onstration. T h e ability o f
single-case designs to respond to the data is a m ethodological strength ot the approach.
T his ability is a strength from an applied perspective as well. Clien ts (students, patients,
and children) are likely to benefit if changes in the intervention can be m ade based 011
em erging data.

Variations
Com bined designs incorporate features from different designs. Perhaps the m ost com -
m only used combined design integrates featu reso f A B A B andm ultiple-baselinedesigns.
A n excellent and still tim ely exam ple o f com bining features o f an A B A B design and a
multiple-baseline design across behaviors w as reported in an investigation designed
to help an 82-year-old man who had suffered a massive heart attack (Dapcich-M iura &
Hovell, 1979). After leaving the hospital, the patient was instructed to increase his ph ys-
ical activity, to eat foods high in potassium (e.g., orange juice and bananas), and to take
m edication.1 A reinforcement program was im plem ented in which he received tokens
(poker chips) each time he walked around the block, drank juice, and took his m ed ica-
tion. The tokens could be saved and exchanged for selecting thie dinner menu at hom e
o r for going out to a restaurant o f his choice.
T he results, illustrated in Figure 10.1, show that the reinforcem ent program
was gradually extended to each o f the behaviors over tim e in the usual m ultiple-
baseline design. Also, baseline conditions were tem po rarily reinstated to follow an
A B A B design. T he results are quite clear. T h e data met the experim ental criteria for
each o f the designs. W ith such clear effects o f the m ultiple-baseline portion o f the
design, one m ight wonder why a reversal phase was im plem ented at all. Actually, the
investigators were interested in evaluating w hether the behaviors w ould be m a in -
tained without the intervention. T em p orarily w ithdraw ing the intervention resulted
in im m ediate losses o f the desired behaviors. T here are proced ures that can be used
to m aintain behavior (Kazdin, 200 1); the reversal phase suggests that som eth in g is
needed to accom plish that.
In another illustration, features o f an A B A B design and m ultiple-baseline design
were used to evaluate treatment for interventions for high school students with m o d -
erate mental retardation (IQs ranging from 40 to 53) (Hughes, Alberto, & Fredrick,
2006). T he students worked in various settings (e.g., book warehouse, nursing home,
Y M C A ) where they were assigned to com plete the activities (e.g., em ptying boxes,
cleaning). F.ach student was identified because o f problem behaviors (e.g., not starting
work immediately, being o ff task or talking on multiple occasions within an o b serv a-
tional session, asking questions repeatedly unrelated to one’s work, not com plying w ith
requests, and others), which were individualized for each student. T he intervention
consisted o f providing verbal instructions (prom pts) to work (e.g., “ It’s tim e to start
yo u r work.” ) along with praise (e.g., “ You are doing a good job." “ Nice w o rk — keep it
up.” ). The effect o f the intervention was tested in b rief sessions before the intervention

' A diet high in potassium was encouraged becausc the patients medication prohahly included
diuretics (medications that increase the flow o f urine). With such medication, potassium is often
lost from the body and has to be consumed in extra quantities.
230 S I N G L E- C A S E R ES EA R C H D ES I G N S

Cl
0
«
-O
13
z

5 10 IS 20 25 30 35 40
Days

F ig u re 1 0 .1. Number of adherence behaviors (walking, orange juice drinking, and pill taking) per
day under baseline and token reinforcement conditions. (Source: Dapcich-Miuro & Hovel, 1979.)

phase was im plem ented m ore fully. That brief dem onstration conveyed that prom pts
and praise definitely im proved behavior, so these were implemented in the daily work
regimen. The prom pts and praise were prerecorded on a tape that each student was
instructed to listen to w hile they were working. Each received a tape player; the state-
ments on the tape were personalized with the name o f each person and included a set o f
prom pts or praise every 2 minutes. Data were gathered in the work setting for 20 m in -
utes to evaluate the effects o f the prom pting procedure on reducing problem behavior.
Sessions were conducted across two time periods ( a m and p m ) that served as a basis for
the multiple-baseline part o f the design.
A d d i t i o n al D esig n O p t i o n s

Figure 10.2 shows the effects o f the taped prompts to pay attention on problem
behaviors. The intervention was introduced in the morning Ca m ) and afternoon ( p m )
time periods in a multiple-baseline design. As evident, the prom pting procedure led
to change when it was introduced to the First baseline ( a m ). N o change was evident in
the second baseline (with m issing sessions o f observation) until the intervention was
introduced. Thus, the requirem ents o f a multiple-baseline dem onstration were met. In
addition, there was a return-to-baseline (Baseline 2) and then a réintroduction o f the
intervention (Attention Prompts 2). T he first four phases in the a m and the p m form an
A B A B design. T h e final follow -up phase included the Attention Prom pts phase again,
2 weeks later. The results o f the dem onstration are very clear. Problem atic behaviors
were reduced by the prompting and praise procedures. The next task, not the goal o f this
study, would be to see if the behaviors can be maintained without the intervention.
When A B A B and multiple-baseline designs are combined, the re is no need to extend
the reversal or return-to-baseline phase across all o f the behaviors, persons, or situa-
tions. One exam ple is the case o f an intervention that focused on fo u r institutionalized
individuals (ages 9 to 21) with developmental disabilities (Fivell, M cG im sey, & Jones,
1980). The individuals ate their food rapidly, which is not me rely socially unacceptable
but can raise health problems (e.g., vom iting or aspiration). To develop slower eating,
the investigators provided praise and a bite o f a favorite food to residents who paused
between bites. Verbal instructions and physical guidance (physical prompts) w ere used
initially by stating “ wait” and by manually guiding the persons to -wait. These prom pts
were removed and praise was given less frequently as eating rates becam e stable.
A multiple-baseline design across two individuals illustrates the effects o f the
intervention, as shown in Figure 10.3. A reversal phase was used with the first person,
which further dem onstrated the effects o f the intervention. T he design is interesting to
note because the reversal phase was only em ployed for one o f the baselines (people).
Because multiple-baseline designs are often selected to circum vent use o f return-to-
baseline phases, the partial application o f a reversal phase in a com bin ed design m ay
be more useful than the withdrawal o f the intervention across all o f the behaviors,
persons, or situations.
Although features o f A B A B and multiple-baseline designs are co m m on ly co m -
bined, other design com binations have been used as well. Ln the usual case, reversal
phases are added to other designs, as noted in the chapters on the changing-criterion
and multiple-treatment designs. Yet, other variations are easily found. A com bined
alternating-treatments and multiple-baseline design was illustrated in a report o f a
child with severe developm ental disability and who engaged Ln self-in ju rious behav-
ior (Wacker et al„ 1990). The prim ary question was whether m aking requests o f the
child would influence the amount o f self-injurious behavioT. In the active dem ands or
request o f the child, the child was asked specifically to participate in an activity; in the
passive dem and condition, the child was not asked but was allowed to engage in activ-
ities on his own. The two dem and conditions were the “treatm ents” in an alternating-
treatments design and were adm inistered at different tim es each session. No baseline
was included, in part because dem anding conditions were already in place and hence
there might be no “ true" baseline without some demand condition alread y going on.
T he example begins with a com parison o f two interventions w ithout baselines.
Baseline I Attention Baseline 2 Attention Follow-

uo|ss3s 'uiw-QÊ jsd s.iOjAEqag


jo Xiuaribaj j

Figure 1 0.2. Frequency of problem (target) behavior for one child across two different time periods ( a m , PM).The intervention (listen-
ing to prompts to pay attention on a precorded tape) was evaluated in a combined multiple-baseline design (across time periods) and
ABAB design. (Source: Hughes et al„ 2006.)
A d d i t i o n al D esig n O p t i o n s 233

Baseline Treatment Baseline Treatment


S,
16
14 — Mean
12
D
10
8
6

I 4
O i

i- S;
<U 16
6
a.
v* 14

6
4
2
0
2 6 10 14 18 22 26 30 34 38 -424frSO54
Average of two daily meals

Figure 10 .3 . Rate of eating for subjects I and 2 across baseline and treatment conditions. (Solid
data points represent data from two daily meals; open data points represent data from a single
meal.) (Source: Favell, McGimsey, & Jones, 1 980.)

Figure 10.4 shows that self-injurious behavior was higher in the passive than in the
active condition in each o f the four contexts. In an alternating-treatm ents design, the
more or most effective intervention is often implemented in the final phase. T h e figure
shows that the active dem and condition was introduced in the final phase and in the
fashion o f a multiple-baseline design across different activities. Self-Injury rem ained
low once active dem ands were used for all observation periods, and this is evident
across each o f the baselines as the active-only dem ands phase was introduced. T h is is a
strong demonstration o f an intervention effect without any baseline.

P ro b lem s and C o n sid e ra tio n s


The use o f com bined designs can greatly enhance the clarity o f intervention effects in
single-case designs. Features o f different designs com plem ent «ach other, so that the
weaknesses o f any particular design are not likely to interfere with draw in g valid infer-
ences. For example, it would not be a problem i f behavior does not perfectly match a
criterion or range o f criteria in a changing-criterion design if that d esign also includes
com ponents o f a multiple-baseline or A B A B design; nor would it b i a problem it each
behavior did not show a change when and only when the intervention was introduced
in a multiple-baseline design i f intervention effects were clearly shown through the use
o f a return-to-baseline phase. Thus, within a single dem onstration, com bined designs
provide different opportunities for show ing that the intervention is responsible for the
change.
234 S IN G L E -C A S E R ESEA R C H DESIGNS

SESSION S

F ig u re 1 0 .4. Self-injurious behavior in a design that combines alternating-treatments and multiple-


baseline features. Passive and active demands were made of the child to evaluate tiieir impact on
self-injury (phase 2 ); active demands were associated with lower self-injury and were provided for
all observation periods in the second phase.This second phase was introduced in a multiple-baseline
fashion across different types of activities and contexts. Self-injury remained at the low rate achieved
during the period in which active demands was compared to passive demands. (Source: Wacker et al..
1990.)
A d d i t i o n al D esig n O p t i o n s 235

M ost combined designs consist o f adding a reversal or return • to-baseline phase to


another type o f design. A reversal phase can clarify the conclusions that are drawn from
multiple-baseline, changing-criterion, and m ultiple-treatm ent designs. Interestingly,
when the basic design is an A B A B design, com ponents from other designs are often
difficult to add to form a com bined design if they are not plann ed in advance. In an
A B A B design, com ponents o f multiple-baseline or m ultiple-treatm ent designs m ay be
difficult to include, because special features ordinarily included Ln other designs (e.g.,
different baselines or observation periods) are required. O n the other hand, it m ay be
possible to use changing criteria during the intervention phase o f an A B A B design to
help demonstrate control over behavior.
T he advantages o f com bined designs bear some costs. T h e problem s o r concerns
from each o f the constituent designs can emerge. For exam ple, Ln a com m on ly used
com bined A B A B and multiple-baseline design, the investigator has to contend with
the disadvantages o f reversal phases (from an A B A B design) and with the possibility ot
extended baseline phases for behaviors that are the last to receive the intervention (from
a multiple-baseline design). These potential problems do not interfere with draw ing
inferences about the intervention, because in one way or another a causal relationship
can be demonstrated. However, practical considerations m ay introduce difficulties in
meeting criteria for both o f the designs. Indeed, such considerations often dictate the
selection o f one design (e.g., multiple baseline) over another (e.g., A B A B ). G iven the
range o f options available within a particular type o f design and the com binations o f
different designs, it is not possible to state flatly what disadvantages or advantages will
emerge in a com bined design.
In com bining designs, I have mentioned com binations o f A B A B with other designs.
In Chapter 6, where A B A B designs were first introduced, I noted that one reason the
design m ay not be appropriate in many clinical, educational, an d institutional settings
is precisely because o f the reversal phase and suspending the benefits o f treatment for
purposes o f the design. There is an option that has not received sufficient attention
in com bined designs, namely, the use o f the m ini-reversal phase, as I called them in
discussion o f changing-criterion designs. In these latter designs, the criterion for per-
form ance is made increasingly more stringent, in keeping with the core feature o f the
designs. Investigators occasionally make a m ini-reversal in which the intervention is
not withdrawn, but rather for a brief period the criterion is m ade less stringent. T his
is a reversal that allows dem onstration o f a change in directio n o f perform ance as the
criterion is made more and less stringent across subphases o f the changing-criterion
design.
M ini-reversal phases have two distinct advantages. First, they are reversal phases
and accomplish the purposes o f such phases as they are used in A B A B designs (i.e.,
describe, predict, and test prior predictions) in keeping with the logic o f single-case
designs. Second, the reversal is not complete in the sense o f a return or hoped for
return-to-baseline levels o f performance. In fact, the change in criterion in a m ini-
reversal still has perform ance at a level o f improvem ent well abov«? baseline levels. Thus
one does not withdraw treatment or eliminate the gains by a com plete suspension o f
the intervention. T he m ini-reversals o f changing-criterion designs m ight be the first
choice o f investigators seeking to draw on the logic and ben efits o f an A B A B design in
using a combined design.
236 S iN G L E -C A SE RESEARCH D ESIG NS

D E S IG N A D D IT IO N S TO E X A M IN E T R A N S F E R OF
T R A IN IN G A N D R E SP O N S E M A IN T E N A N C E
T h e discussions o f designs in previous chapters have focused prim arily on techniques
to evaluate w hether an intervention was responsible for change. Typically, the effects o f
an intervention are replicated in som e way in the design to dem onstrate that the inter-
vention rather than extraneous factors produced the results. Early in the developm ent
o f effective behavior-change techniques (e.g., in the 1960s), the priority ob vio u sly was
on show ing that change could be obtained in several areas pertinent to work in clinics,
schools (e.g., in education and special education), hospitals (m edical and psychiatric),
institutions for various populations (e.g., ind ividuals with developm ental disability,
pervasive developm ental disorders), the com m u nity (e.g., use o f energy in the hom e,
wearing seat belts while driving), and m any others (Kazdin, 200 1). T he priority has not
changed. It is still the case that most program s in the schools, rehabilitation settings,
and juvenile and adult justice system s are not evaluated em pirically and not evidence
based. Thus, we still need evaluations to show that much o f w hat we are doing leads to
change and does not harm.
That said, for m any interventions evidence has accum ulated that changes can be
achieved with children, adolescents, and adults in the diverse settings in w hich they
function (e.g., at home, at school, work, and in the com m unity) (e.g., Austin & Carr,
2000; C o o p er et al., 2007). As the ability to change behavior becam e well docum ented,
priority shifted to the obvious next questions: I f behavior can be changed, can we get
those changes to extend to other situations and settings (referred to as transfer o f train-
ing), and can we get the behaviors to continue over time even after the intervention
has been w ithdrawn (referred to as response m aintenance)? There are procedures to
accom plish both in terms o f techniques for developing change (see C ooper et al., 2007;
Kazdin, 2001). T he investigation o f transfer o f trainin g and response maintenance can
be facilitated by several design options, including the use o f probe techniques and w ith -
draw al o f treatment, as discussed next.

P ro b es
Probes were introduced and illustrated earlier (Chapter 3) and were defined as the
assessm ent o f behavior on selected occasions, u sually when no intervention is in place
to alter o r influence that behavior. Probes are com m on ly used to determ ine whether
a behavior not focused on directly has changed over the course o f the investigation.
For exam ple, a successful program m ay have helped a very shy and withdrawn child
interact in the classroom under the careful control o f a program implemented by the
teacher, but do the changes carry over to the playground? Probe assessments can help
answ er the question. Because the intervention is not in effect during the periods or
situations in w hich the probes are assessed, the data from probe assessment address the
generality o f behavior across responses and situations.
C onsider probes as a design addition that can be applied to any o f the single-case
designs. We use this tool if we want to evaluate the spread o f effects o f our interven-
tion, usually to other situations, but this could also include other behaviors o f the sam e
individual. Probes could also be used to evaluate change over time by conducting an
assessment occasionally at separate points over the course o f follow-up, although this
is less often their use.
A d d i t i o n al D esig n O p t i o n s 237

I mentioned in the discussion o f multiple-baseline designs that occasion ally the


intervention might be applied to one behavior (individual or situation) and that the
change m ight spread to other behaviors even before the in terv en tio n is applied
to them. In a multiple-baseline design across two or m ore situations, the clarity o f
the dem onstration depends on show ing that change occurred w hen and only when
the intervention was introduced and not before. The generality o f change can be an
obstacle in versions o f the design. In most situations, we want tran sfer o f the behavior
to other situations. We change behavior in one classroom , but we also want the changes
to transfer or extend to other classroom s as needed; w e train a child to say “thank
you” when a relative provides a birthday gift, but we want 'th a n k yous” to extend well
beyond those occasions and beyond relatives. Probes can be used to sam ple behavior
outside o f the situation in which an intervention is applied to see i f there is generality
o f the behavior.
Probes have been used to evaluate different aspects of generality. Typically, the
investigator trains a particular response and examines w h eth er the response occurs
under slightly different conditions from those included in training or at different times.
The use o f probes was nicely illustrated in a study designed to increase bicycle helmet
use among middle school students (Van Houten, Van H outen, & M alenfant, 2007).
Approxim ately 70% o f children age 5 to 14 ride bicycles. Annually, several hundred chil-
dren are killed, and over 40,000 are injured by bicycle accidents. H ead injuries are the
chief cause o f hospital adm issions and death. Bicycle helmet w earin g reduces the risk
o f death and injury by over 85%.
In this study, the goal was to increase helmet use am ong children w ho com m uted
by bike to school. Adult and peer observers were trained to o b serv e helm et use daily,
including whether the helmet was worn and worn correctly (e.g., buckled snugly, buck-
led in the correct place). As students left school, observers recorded the students who
were wearing helmets and this was converted to a percentage o f riders. A fter baseline,
the intervention consisted o f having an assembly where instructions about the use o f
helmets was provided, goal setting by the group (what percentage sh ou ld they aim for),
and posting the percentage o f correct helmet use each w eek in th e school cafeteria and
school entrance. Also, students were told that if they met their goal, there would be a
party (with pizza, ice cream, and small prizes).
The intervention was evaluated in a multiple-baseline design across three schools,
and the results are presented in Figure 10.5. Probes w ere used to evaluate the extent
to which helmet use was carried out at locations other than th e school where the pri-
mary program and observations had been conducted. Locat ions were selected ap proxi-
mately o n e-h alf mile from the school where most students needed to pass on their
way home. T his was referred to as the distance probe (because it was som e distance
from the school). In addition, generalization probes were included to see if helmet
use extended to the ride to school. (Recall that the program w as b ased on whether
students wore helmets on their way home.) T he multiple-ba.seline data suggest that the
intervention was responsible for change. Perhaps the third school is slightly am biguous
because there appeared to be a slight trend toward im provem ent at the en d o f baseline.
Even so, the overall pattern supports the effects o f the intervention. Both the distance
probes (diam onds in the figure) and the generalization probes fo r the m o rning period
(triangles) indicated that behavior during the probe assessm ent w as consistent with
238 SIN G LE-C ASE RESEARCH D ESIG NS

Daily Sessions

F ig u re 10 .5. The percentage of students wearing bicycle helmets correctly at all three middle
schools.The last 4 days of the helmet program at Riviera were after the party, as was the last day of
the helmet program at Meadowlawn. Gray diamonds show the percentage of helmet use during the
distance probes taken after school. Open triangles show the percentage of helmet use during the
morning probes. (Source: Van Houten et al.. 2007.)
A d d i t i o n s! D esig n O p t io n s 23 9

the behavior during the time the children left school. Helmet use when leaving school
was very sim ilar to helmet use farther away from school and in com ing to school even
though neither o f these was specifically included in the program.
T he use o f probes represents a relatively econom ical way ol evaluating the general-
ity o f responses across a variety o f conditions. The use is econom ical, because assess-
ment is conducted only on som e occasions rather than on a continuous (e.g., daily)
basis. A n im portant feature o f probe assessment is that it provides a preview o f what
can be expected beyond the conditions o f training. Often training is conducted in one
setting (e.g., classroom ) with the hope that it will carry oveT to other settings (e.g.,
playground, home). Probes can provide assessment o f perform ance across settings and
yield inform ation on the extent to which generalization occurs. If generalization does
occur, this should be evident in probe assessment. If generalization does not occur, the
investigator can then implement procedures designed to prom ote generality an d can
evaluate their effects through changes on the probe assessment.

Graduated W ithdrawal o f the Intervention


Evaluating whether behavior change transfers to other settings and. conditions is nicely
handled by probe designs. O f equal interest is evaluating w hether behavior is m ain-
tained after the program is terminated. A component that can be added to the design is
a graduated withdrawal o f the intervention to evaluate m aintenance o f behavior.
The study or interest o f maintaining behavior after w ithdraw ing treatm ent seems
to conflict with a core design wre have discussed, namely, the A B A B design. We expect
(and m ethodologists long for) return-to-baseline levels o f behavior once the interven-
tion (B) is withdrawn and there is a return-to-baseline (second A) phase. O f course, lor
the sake o f the client and for all clients to whom our interventions might be applied,
we want behavior to be maintained. W hen we change perform ance in the classroom
with a tem porary intervention for a child or the class as a whole w e want the effects to
continue after rem oving the intervention. As mentioned earlier, we m ight train a child
to say “thank you” by praising this early in life. We certainty want “ thank yo u " to con-
tinue (be maintained) long after our intervention has ended, and long after the child is
out o f the home.
In m any program s, the intervention is withdrawn abruptly, either du rin g an A B A B
design or after the investigation is term inated. As might be expected, under such cir-
cum stances behaviors often revert to o r near baseline levels. The rapidity o f the return
o f behavior to baseline levels may in part be a function o f the manner in which the
intervention is withdrawn. Short intervention periods (e.g., a few days) an d abruptly
returning to baseline conditions would be expected to lead to abrupt loss o f behavioral
gains. Under these circum stances, the behaviors have not been allowed to occu r often,
to move toward actions that art? routine, or possibly m aintained by the everyday envi-
ronment. C learly we want intervention effects to be maintained. G raduated w ithdrawal
o f the intervention can be added to the design to assess whether responses are m ain-
tained. As with probes, this is an element that can be added to any design. A fter the
intervention effects have been dem onstrated unam biguously, withdraw al procedures
can be added to evaluate response maintenance (see Rusch SrKazdin, 1981).
As an illustration, in one program eight children (a.ges 6 to 8) were identified
and referred to a special classroom for their oppositional, aggressive, and antisocial
240 SIN G LE C A S E R E SE A R C H D ESIGNS

behavior (D ucharm e, Folino, & DeRosie, 2008). A n intervention referred to as “erro r-


less acquiescence training” was used to develop social skills. T he authors developed
acquiescence in the children (e.g., flexibility in responding to peers, sharing, taking
turns, going along with someone else’s ideas, letting others go first, and others) with
the notion that this would be a keystone skill, that is, a behavior that when developed
w ould be associated with a broad range o f other desirable behaviors not specifically
developed. Training consisted o f gradually introducing conditions associated with the
problem behavior, m oving to increasingly challenging situations, and providing rein -
forcem ent for m anaging these situations. T raining was conducted within the classroom
in a special area where a skill was taught. T he eight children were taught in two groups
o f four. O bservers recorded prosocial and antisocial behaviors, cleaning up (putting
away toys), and acquiescing. O bservations were m ade in the classroom , and training
was introduced in a multiple-baseline design across the two groups o f children.
T he intervention phase consisted o f developing acquiescence and then gradually
withdrawing the intervention procedures. D uring the intervention phase, training was
conducted (Phase 1) in which there was discussion o f the skill, incorrect and correct
m odeling o f the skill ( with discussion), and role playing. The skills were part o f acqu i-
escence and included behaviors m entioned previou sly (e.g., sharing, taking turns).
Children had the opportunity to play as part o f the session and received prompts, feed-
back, and praise for use o f the skills. T he intervention was gradually withdrawn. In
Phase 2, the am ount o f instruction was reduced; the discussion and modeling o f the
incorrect skills was dropped and m odeling o f the correct behavior was decreased over
this phase. Role playing was the only training com ponent that remained. Yet, during
the play activity part o f the sessions, prompts and other com ponents o f Phase 1 con tin-
ued. In Phase 3, all instructional com ponents including role playing were dropped, and
other features such as prompts, feedback, and reinforcem ent during play were dropped.
The final phase o f the study essentially was a return to baseline.
Figure 10.6 presents the data for changes in the frequency o f antisocial behavior. As
the figure show s, the intervention was introduced in a multiple-baseline design across
the children (two groups). The effects are clear in show ing that the antisocial behaviors
decreased when the intervention was introduced and not before. (Data are not pre-
sented here for developing acquiescence behaviors, prosocial behaviors, and cleaning
up, which also showed this pattern.) W ithdrawal o f the intervention during the phases
o f the intervention phase was complete in the final phase, nam ely a return to baseline.
For all but one or two o f the children, antisocial behavior was maintained at the low
level achieved during treatment.
T his is an excellent dem onstration insofar as the design established that the inter-
vention led to change (m ultiple-baseline design). T he investigators wanted to develop
and maintain the behaviors, so the intervention w as gradually withdrawn. T he return-
to-baseline phase did not show a reversal. One was not needed for the design. Rather
this was a test o f how to maintain the behavior. W hen an A B A B design is used, there is
a need to show a reversal. In such designs, the second B phase is one in which the inter-
vention is faded or gradually withdrawn. For exam ple, exposure and reinforcem ent
were used to overcom e a needle phobia o f an 18-year-old adolescent diagnosed with
Type 2 diabetes, autism, and mental retardation (Shabani & Fisher, 2006). T he per-
son had not allowed others to draw the blood necessary for the m onitoring o f insulin,
A d d i t i o n al D esig n O p t i o n s 241

Phase Phase Return to

| Child I

| CM d 3

mid s |

Ovid 7 |

F ig u re 1 0.6. Frequency of antisocial behaviors across baseline, intervention, and retum-to-baseline


phases.The intervention was evaluated in a multiple-baseline design across children (divided into two
groups).The intervention implemented in Phase I was then gradually faded or withdrawn. (Source.
Ducharme et ai. 2008.1
S I N G L E- C A S E R ES EA R C H D ESSG N S

for a period ot more than 2 years. In an A B A B design, the intervention was shown to
reduce his avoidance in the testing procedure. Strong effects were dem onstrated in the
A B A portion o f the design. Then a lengthy second B phase was used to fade the pro-
cedure. As the intervention was gradually withdraw n the behavior was maintained at
high levels. A 2-month follow-up indicated that blood could be draw n at home with
no problem s.

G eneral Com m ents


G eneralization o f intervention effects across situations and maintenance o f these effects
over tim e reflect important substantive issues in any area o f intervention research. I
highlight design opportunities to study these. Probes and withdrawal o f interventions
seem to depart from critical design features I have em phasized. Continuous assess-
ment is essential in single-case designs. Probes represent strategic use o f noncontinu-
ous assessm ent to answer questions about generalization o f effects across situations
as well as behaviors. Similarly, withdrawal o f interventions has been discussed as a
reversal phase where one expects a return o f behavior to baseline levels. Yet, return-to-
baseline phases som etim es are used to evaluate maintenance. G radu ally withdrawing
an intervention is a strategy that allows for a return to baseline with the expectation
that behavior will be maintained.
A s with other designs and other features, probes and withdrawal phases are tools to
add to single-case evaluation. Each is an elem ent that can be incorporated into specific
designs. However, neither is a substitute for the requirem ents o f a design. The investi-
gator dem onstrates change and the effects o f the intervention in the usual way. These
other com ponents are superim posed 011 that dem onstration to address questions o f
generalization or maintenance.

B E T W E E N -G R O U P D E S IG N S
Traditionally, research in psychology, education, m edicine, counseling, and other sci-
entific disciplines in which interventions are evaluated focuses on com paring different
groups. T his is research in the tradition o f quantitative research in which there is null
hypothesis testing and statistical evaluation o f the data. I refer to the studies as tra-
ditional betw een-group designs here to em phasize that m ajor com parisons are made
between or am ong groups receiving different interventions o r intervention and control
conditions. I use “ between-group” because “group design” alone might not convey the
point. M any “single-case” studies use groups o f individuals (e.g., an entire classroom or
com m unity) in which case the group is com bined and counted as a “single case.” (One
has to forego term inological purity to even enter the door to methodology. Between-
group research— must it be about just two groups? W hen there are three groups, should
that not be “am ong-group research” ?)
In the sim plest between-group study, one group receives an intervention and
another group does not. Participants are assigned to groups random ly. When inter-
ventions are com pared and evaluated (e.g., educational program , psychotherapy, che-
motherapy), this basic design is referred to as a random ized controlled trial (R C T ), as
mentioned earlier in the book. The random assignm ent o f participants to conditions
increases the likelihood that the groups are equivalent on potentially critical variables
that m ay relate to outcome as well as on the m easures that are to be used to evaluate
A d d it io n a l D e s ig n O p t io n s 243

the intervention. Differences at the end o f the study are more likely to be due to inter-
vention effects than to pre-existing characteristics o f the groups. Traditional between-
group designs, their variations, and unique methodological features and problem s have
been described in num erous sources (e.g., Kazdin, 2003; Rosenthal & Rosnow, 2007;
Shadish, Cook, & Cam pbell, 2002) and are not elaborated here. I mention betw een-
group studies because they are often used in com bination with single-case designs.
Hence it is useful to discuss the contribution o fbetw een -groupim ethodology to single-
case designs.

Utility o f Between-group Designs in Relation to Intervention Research


Between-group designs often provide important inform ation that is not easily obtained
or is not obtained in the sam e w ay as it is in single-case designs. Betw een-group m eth-
odology provides alternative ways to gather inform ation o f applied interest and pro -
vides an important w ay to replicate findings obtained from research using the subjects
as their own controls.
Between-group research is well suited to several types o f questions in relation
to interventions (e.g., treatment, educational, rehabilitation, school-based preven-
tion programs). Table 10 .1 highlights these briefly, and each is discu ssed here. First,
between-group com parisons are especially useful when the investigator is interested in
com paring two or more treatments. Difficulties occasionally arise in com paring differ-
ent treatments with the same subject. Difficulties are o bvious i f th e investigator is inter-
ested in com paring interventions with theoretically discrepant 0 1 conflicting rationales
(e.g., fam ily therapy, individual therapy). One treatment would appear to contradict or
underm ine the rationale o f the other treatment, and the credibility o f the second treat-
ment would be in question. Even if two treatments are applied that appear to be con -
sistent, their juxtaposition in different phases for the same subject may be difficult. As
already discussed, when two o r more treatments are given to the sam e subjects, multi-
ple-treatment interference is a m ethodological risk, that is, th eeffects o f one treatment
m ay be influenced by other treatment(s). M ultiple-treatment interference is a concern
if treatments are implemented in different phases (e.g., as in variations o f A B C A B C
designs) or are implemented in the same phase (e.g., as in m ultiple-treatm ent designs).
Com parison o f treatments between groups provides an evaluation o f each intervention
without the possible influence o f the other.

Table 1 0 .1 Key Contributions of Betw een-G roup Research


Between-group designs have special strengths in evaluating interventions. Group studies are particularly well
suited for:
• Comparing the effects of two o r m ore interventions;
• Identifying the magnitude of change relative to no intervention;
• Examining the prevalence o r incidence of a particular condition o r disorder;
• Evaluating the change and course of functioning over extended time periods;
• Evaluating correlates (concurrent features), risk factors (correlates that predict), and protective factors
(correlates that attenuate o r moderate the influence of risk factors):
• Testing the feasibility and generality of implementing interventions across multi-sites;
• Evaluating moderators and factors that may interact (statistically) with an intervention; and
• Evaluating mediators and mechanisms, that is, processes that may account fo r c r explain how changes come
about.

See text for terminology embedded within this list.


244 S I N G L E- C A S E R ES EA R C H D E S I G N S

A second contribution of between-group m ethodology to applied research is to pro -


vide information about the magnitude o f change between groups that do and do not
receive the intervention. Often the investigator is not only interested in demonstrating
that change has occurred but is also interested in m easuring the magnitude o f change in
relation to persons who have yet to receive the intervention. M agnitude is often evaluated
in research in terms o f effect size, a m easure o f im pact o f the intervention in standard
deviation units in which intervention and non-intervention groups are compared.J Also,
one can estimate magnitude by noting and com paring the percentage o f individuals in
each group that meet som e predeterm ined and m eaningful criterion. For example, in
psychological and medical research, two such m easures might be percentage o f treated
individuals, relative to controls, who were free from all sym ptom s o f panic disorder by
the end o f treatment or years later or w ho survived for 5 years without any recurrence
o f cancer, respectively. Essentially, a no-treatm ent group provides an estimate o f these
outcom es that occur without intervention, and that is a v ery much needed baseline c o m -
parison against which the perform ance o f the intervention group is evaluated.
A third contribution is to identify the rates o f dysfunction or other characteristics in
the population. For example, studies that are designed to assess prevalence rates (how
many people have a particular condition such as a disease or psychiatric disorder) and
incidence rates (how m any new cases with the condition em erge in a given period)
are often large-scale group studies. G rou p studies and large-scale studies that sam ple
broadly to represent diverse types o f individuals are needed. From that information,
analyses can identify subgroups as especially likely or unlikely to develop a condition.
A fourth contribution o f betw een-group research is to study changes over extended
period of tim e (e.g., decades). Longitudinal studies often delineate groups (e.g., at risk
or not at risk for som e adverse mental or physical health outcom e in adulthood; or
those with and without preacadem ic skills before they enter school) and follow them
for decades to identify and understand factors over the course o f development that m ay
predict the outcom e or avoidance o f the outcome.
A fifth contribution is elaborating features associated (correlated) with a disorder or
condition o f interest. T his is accom plished concurrently by exam in ing what characteris-
tics go together in a large population. For exam ple, children who have difficulties with
math and reading have what other characteristics (e.g., in other academ ic subjects, in
social behavior, in fam ily characteristics)? In addition, the characteristics can be stud-
ied prospectively in a sam ple to identify early characteristics (e.g., in the home, neigh-
borhood, prenatally) that correlate with som e later outcom e o f interest (e.g., psychiatric
disorder, genius, achievem ent). Early predictors o f later outcom es are called “ risk fac-
tors” (even when the outcom es are positive). Also, a group known to be at risk for

! Effect size (ES) refers to the magnitude of the difference between two (or more) conditions or
groups and is expressed in standard deviation units. For the case in which there are two groups in
the study, effect size equals the differences between means, divided by the standard deviation:
m -ni
ES = —— 2
s

where m and m are the sample means for two groups or conditions (e.g., intervention and control
groups), and s equals the pooled standard deviation for these groups.
A d d it io n a l D e sig n O p t io m 2^5

som e outcom e (e.g., because o f exposure to cigarette sm oking in utero or to violence in


childhood) is often studied to identify who does not later show the expected outcome.
T hese correlates are called “ protective factors” (even though they are not known to
really have a direct role in protecting). Concurrent or prospective characteristics (e.g.,
risk and protective factors) studied in a large group or population are all correlates or
associated features; they can lead to a deeper understanding o f im portant pathways to
or processes toward an outcome.
A variation o f correlational research is worth highlighting in passing. This work
focuses on naturalistic interventions that are not under the control of the experimenter.
Between-group com parisons are exceedingly im portant to address questions about d if-
ferences between or am ong groups that are distinguished on the basis o f circum stances
out o f the experim enters control. Such research can address such important applied
questions as: Does the consum ption o f cigarettes, alcohol, or coffee contribute to cer-
tain diseases? Do som e fam ily characteristics predispose children to psychiatric d is-
orders? Does television view ing have an im pact on children? T hese are correlational
longitudinal studies that require one or more groups o f individuals.
A sixth use o f between-group m ethodology for applied research arises when mult i-
site studies are used to evaluate generality o f fin d in g s across settings. With large-scale
investigations, several settings and locations may be em ployed to evaluate a particular
intervention or to com pare com peting interventions. Because o f the m agnitude o f the
project (e.g., several schools, cities, hospitals), som e o f the central characteristics o f
single-case m ethodology m ay not be feasible. For example, in large-scale applications
across schools, resources may not perm it such luxuries as continuous assessment on a
daily basis over time. By virtue o f costs o f assessment, observers, and travel to and from
schools, assessment m ay b e made at a few points in time (e.g., pretreatment, posttreat -
ment, and follow-up). In such cases, betw een-group research may be the more feasible
strategy because it requires fewer resources for assessment.
A seventh contribution o f between-group research is to exam ine moderators or
m oderating variables. M oderators are those variables that m ay interact with the inter-
vention to produce an outcome. A m oderator is an y variable that influences the m ag-
nitude or direction o f the relation between the intervention and the outcom e. For
exam ple, if the intervention is more effective with younger rather than older children
or with males rather than females, then age and sex, respectively would be moderators.
M oderators are also discussed as statistical interactions in which the effect o f one v a r-
iable (e.g., treatment) depends on the level or characteristic o f another variable (e.g.,
ethnicity). Group studies can evaluate these interactions in •ways that are not read-
ily available in single-case research. For exam ple, being physically abused as a child
does not greatly increase the likelihood o f engaging in crim inal or antisocial behavior
(e.g., aggression, stealing) as an adult. However, if one is physically abused and also
has a subtle genetic characteristic (polym orphism ) that affects a receptor o f a brain
neurotransmitter, the likelihood o f crim inal and antisocial behavior in adulthood is
greatly increased (Caspi et al., 2002; K im -C oh en et al., 200 6).' T h at is, the effects o f

The genetic characteristic (polymorphism) relates to the enzyme monoamine oxidase A (MAO-A)
which metabolizes serotonin. Other human (e.g., natural mutations) and non-human animal
research (e.g., genetic studies) has shown this enzyme to be implicated inajgressive behavior. In
246 S I N G L E- C A S E R ESEA R C H D ES I G N S

child abuse on later antisocial behavior are m oderated by a genetic characteristic. T h is


research requires between-group designs to identify individuals with and without out
the various com binations o f the characteristics o f interest.
Studying the separate and com bined effects o f two or m ore interventions is another
exam ple o f m oderator research T he investigator m ay be interested in studying two or
m ore variables simultaneously. For example, the investigator m ay wish to exam ine the
effects of feedback and reinforcement alone and in com bination. Two levels o f feedback
(feedback vs. no feedback) and two levels o f reinforcem ent (contingent praise vs. no
praise) may be com bined to produce four different com binations o f the variables. Four
groups are included in the design; each group receives one o f the four different com bi-
nations. Between-group research is required to study such com binations. In single-case
research it is difficult to explore interactions o f the interventions with other variables to
ask questions about generality o f intervention effects, that is, the extent to which inter-
vention effects extend across other variables.
A final contribution o f betw een-group intervention research is the study o f m edia-
tors a n d mechanisms o f change (Kazdin, 2007). In our research and applications, we
begin with the idea that the intervention will be effective but also have an underlying
view as to w hy it will exert im pact; we have a “sm all theory,” as this is sometimes called
(Lipsey, 1996). That theory o f what might be going on influences what we are to m ea-
sure. That is, we can test our sm all theory rather than just assum e why the interven-
tion w orks by including m easures o f the processes we consider to be responsible for
change. The focus on m ediators and mechanisms o f change directly reflect this interest.
M ediator refers to an intervening process that m ay explain why the effect occurred.
Show ing that som e intervening process (e.g., changes in cognitions) is correlated with
therapeutic change and that therapeutic change is not likely to occur without these
changes would be an exam ple o f a mediator. Establishing a m ediator is based on statis-
tical relation between som e intervening process an d som e outcome. A mediator does
not necessarily explain precisely how som e outcom e com es about. Mechanism refers to
a specific process that shows more precisely how the change com es about. A m echa-
nism reflects a deeper level o f knowledge by show ing not only that change depends on
the presence o f an intervening process but also how that process unfolds to produce
an outcome. Between-group research has been quite useful in studying mediators and
m echanism s.
It is useful to highlight several o f the strengths o f betw een-group research. At the
sam e time, the danger is to im ply or foster the b elief that one design strategycan answ er
a particular question and the other design strategy cannot. T his might he true in a

' (Continued) the initial demonstration, Caspi et al. (2002) found that abused children with a
genetic polymorphism related to the metabolism of serotonin have much higher rates of antisocial
behaviors than those without this polymorphism. Among boys with the genetic characteristic and
maltreatment, 85% developed some form o f antisocial behavior (diagnosis of Conduct Disorder,
personality assessment o f aggression, symptoms of adult personality disorder, or court conviction of
violent crime) by the age of 26. Individuals with the combined allele and maltreatment constituted
only 12% of the sample but accounted for 44% of the cohort’s violent convictions. Further research
has replicated and extended the finding by noting that parent neglect as well as abuse in conjunction
with the polymorphism increases risk for conduct problems and violence (Foley et al., 2004; Jaftee
et al., 2005).
A d d it io n a l D e s i g n O p t i o n s 247

given instance, but the contributions o f different designs are m ore uuaneed. Consider
an important case in point.
There has been increased interest in understanding m ediators and mechanisms o f
change, as highlighted above as a contribution o f betw een-group research. To under-
stand mediators, usually a group study is done in which there is pretreatment m ea-
sure (e.g., anxiety am ong patients referred for that), some m easure in the middle of
treatment (e.g., thought processes that might be considered by the investigator to be
responsible for, i.e., mediate, therapeutic change), and posttreatment measures. Stated
generally, the goal o f the study is to see if changes in the thought processes might be the
basis o f improvem ents at the end o f treatment. A variety o f statistical tests are applied
in an effort to see if patients improve (in anxiety) only if they show these cognitive
changes (thought processes). Between-group studies are the standard way of conduct-
ing research on m ediators (see Kazdin, 2007). The statistical tests used in such research
require a group (e.g., the treatment group in a treatment group an d no-treatment group
study) to evaluate the statistical relation o f cognitive processes and changes in anxiety.
It is near impossible to identify single-case experim ents in applied settings that have
studied mediators.
Consider for a m om ent the following. Single-case designs cannot only study m edi-
ators, but bring unique and needed advantages. In group research, mediators (thought
processes in the above example) are measured at a fixed and predeterm ined point in
time or let us say even two points in time. So if there are 15 sessions o f therapy the inves-
tigator may measure thought processes som ewhere in the m idd le (e.g., session 8) or
end (e.g., session 15) or perhaps at two places during treatm ent (e.g., session 5 and 20).
The point o f “ when” does not matter for this discussion. Here is an enorm ous prob-
lem that group investigations o f mediators encounter. The m ediator o f change for each
person might well be changes in thought process. However, w hen that change occurs
might be different for each individual. Your change in the putatively critical thought
processes might be at session 5, mine at session 11. Your change might be rapid if we
saw a graph o f thought processes assessed continuously over tim e (e.g., each session).
M y changes might be gradual and slow. In short, in proposing a mediator o f change
investigators do not really believe that change in the m ediator can be assessed at only
one or two points and accurately capture each subject. T he w eakness o f group studies
in evaluating mediators is assum ing there is a fixed point that is appropriate to assess
the mediator. In fact, single-case designs can com e to the rescue.
We need ongoing assessment o f each subject to see the relation o f the m ediator to
outcome. A mediator may show little or 110 relation to som e outcom e in a between-
group study merely because the mediator was not assessed at the optimum point for
each subject in the study. An assumption in the study is that a particu lar point adequately
sampled the change in the mediator for all subjects. This assu mption is al most certainly
false. Mediators could be readily studied in single-case designs that allowed exam in a-
tion o f the relation o f the m ediator-outcome for each subject an d then that com bined
the data if some larger purpose would be served. G lossing over individual differences
as if they did not exist can actually hamper identifying o f m ediators o f change.
I use the point about mediators not to advocate, or not o n ly to advocate, for the
use o f single-case designs in this context. Rather I wish to note that betw een-group
research can evaluate many critical questions but these are not all unique to that
248 S I N G L E- C A S E R ESEA R C H D ESI G N S

research strategy. The mediator exam ple m ight be a strong illustration o f the point
because mediators are rarely studied in single-case designs but would draw on one o f
the strengths o f the designs, namely, continuous and ongoing assessment.

Illustrations o f Between-group Designs


It is useful to illustrate betw een-group designs with a few key studies to convey several
o f the contributions they make. First, com parin g different treatment or treatment and
control conditions is often m ore feasible in a between-group study. Consider an exam -
ple o f interventions for cigarette sm oking, an obviously significant health problem. All
sorts o f interventions have been tried and evaluated. One opportunity that exists for
intervention is to have physicians advise their patients who sm oke to stop sm oking.
W hat i f physicians just told their patients to stop smoking? T his seem s very naive as an
intervention. We know fairly well that instructions, requests, understanding, insight,
knowledge, and other interventions like this are generally very weak. By weak I mean
they affect few people, produce variable effects, and are not am ong the more effective
ways o f changing behavior. A ll that said, it w ould be useful to know if comments from
physicians to stop sm oking would m ake a difference.
In the United States, the average physician visit is 12 to 15 minutes in duration. In
controlled trials, patients who are cigarette sm okers have been assigned random ly to
receive or not receive an intervention that merely consists o f statements to stop sm ok-
ing from the physician or nurse. Tw o exam ples o f these statements used in research
are: (1) "I think it is im portant for you to quit tobacco use now,” and (2) “As your clin i-
cian I w'ant you to know that quitting tobacco is the most im portant thing you can do
to protect yo u r health.” C om m ents like these have a small but reliable effect in leading
to abstinence. Individuals who receive the message have a 2.5% greater abstinence rate
than those assigned to no intervention. Although very brief com m ents (1 minute) are
sufficient to achieve an effect, there is a d o se-respo n se relation. That is, the m ore time
and advice lead to slightly greater abstinence rates (Fiore et al., 2000; Rice & Stead,
2008; Stead, Bergson, & Lancaster, 2008).
N o doubt the intervention m ight be tested in single-case designs (e.g., multiple-
baseline across doctors, clinics, and perhaps patients within a setting). However, the
betw'een-group study is well suited to the question because a large number o f people
are required to obtain an estimate o f abstinence rates. Also, the key question requires
a control group that gives the base rate o f abstinence without any special physician
intervention. The question o f interest w as nicely addressed between groups. Several
betw een-group studies have replicated the effects o f brief com m ents, and such co m -
ments have now becom e standard physical exam practice am ong physicians.
T he advantage o f betw een-group studies can be seen where there is interest in
large-scale, multi-site evaluations o f an intervention. Am ong the questions are the rel-
ative effectiveness o f separate and com bined interventions and the shorter- and longer-
term effects o f such interventions. For exam ple, the largest study for treatments o f child
hyperactivity is the N IM H M ultim odal Treatment Study o f C hildren with Attention-
Deficit/H yperactivity Disorder (M T A Study). In seven sites, 579 children (ages 7 to 9)
were included in a 14-m onth regimen o f treatment (M TA C ooperative Group, 1999a,
1999b; Swanson et al., 2002). All children met diagnostic criteria for AD H D, the d ia g -
nostic category that includes excessive activity and im pulsiveness in contem porary
A d d i t i o n al Desig n O p t i o n s 249

psychiatric diagnosis. T he standard treatment is stimulant m edication, which has been


well studied and shown to decrease hyperactivity while the child is on m edication.
Behavioral management treatment has not been as effective. In this study, treatment
conditions were com pared that included: (1) medication m anagem ent; (2) behavioral
treatment involving parents, school, and child program s; (3) medication and behav-
ioral treatment com bined; and (4) treatment as usual in the com m unity. Treatment as
usual in the com m unity consisted mostly o f medication (for two-thirds o f children) but
did not include the careful managem ent and titration o f medication as did the m edica-
tion management condition.
Am ong the key findings, participants in both the m edication and the com bined
treatment showed greater improvem ent than those in the behavioral-treatm ent-only
o r treatment-as-usual groups. On core symptoms o f A D H D , m edication and com bined
treatments were no different. There was some superiority o f the com bined treatment in
relation to non -A D H D sym ptom s and prosocial functioning (e.g., internalizing sy m p -
toms, prosocial skill at school, parent-child relations), but these were not strong. T h e
pattern o f results was sim ilar up to 2 years later, after treatment had been terminated.
T his study illustrates several advantages o f a between-group design including c o m -
parison o f different treatment conditions (without the concern o f m ultiple-treatm ent
interference) and evaluation o f treatment on a large scale and across m any sites. A lso ,
because o f the large sam ple, subsequent reports could exam in e child characteristics
(moderators) that might influence responsiveness to treatment.
A final example conveys the utility o f between-group research in testing mechanisms
o f action in the context o f intervention research. Background for this w ork stems from
years o f careful non-human animal research on mechanisms oi learning and extinction
o f fear responses. Briefly, elimination o f fear appears to depend on a particular receptor in
the brain (N-methyl-D-aspartate in the amygdala) (see Davis, Myers, Chhatwal, & Ressler,
2006). Non-human animal research has shown that chem ically blocking the receptor
interferes with extinction and that making the receptor work better enhances the extinc-
tion process. The laboratory research has moved to psychotherapy trials for the treatment
o f anxiety. Exposure therapy, based on an extinction model, is o n e o f the most well-studied
treatments for anxiety. There are variations o f the treatment, but the essential ingredient
is repeated or extended contact with the anxiety-provoking stimuli. Such exposure leads
to extinction, that is, the stimuli no longer evokes an anxiety reaction. Drawing on labo-
ratory findings, investigators have evaluated whether manipulating the mechanism that
influences extinction can be used to enhance the benefits o f exposure-based treatment.
Controlled trials have compared two forms o f exposure therapy: the regular version o f
the therapy and that same version with use o f a medication (D-cycloscrine) that activates
the receptor mentioned previously. Activation o f the receptor would be expected to au g-
ment extinction o f anxiety and improve the effectiveness o f treatment. As expected, the
enhanced exposure treatment is more effective, and this finding has now been replicated
with samples seen for different types o f anxiety including acrophobia (fear o f heights),
obsessive-compulsive disorder, and social anxiety (e.g., Hofm ann et a l, 2006; Kushner
et al., 2007; Ressler et al., 2004; Wilhelm et al., 2008). These studies h are been com pleted
between-groups in RCTs in which some individuals received the enhanced intervention
and others did not or they received exposure therapy with a placebo (to control for taking
a medication and any expectancy that might invoke).
250 S I N G L E- C A S E R ES EA R C H D ES I G N S

These illustrations highlight the v ery special contributions o f between-group


research, which has dim ensions that cannot be readily addressed in single-case designs.
M ore generally, belw een-group and single-case designs have their unique strengths but
also share in the questions they can address. Design features from these different trad i-
tions are occasionally com bined.

Illustrations o f Single-case and Betw een-group Designs Com bined


Between-group and single-case designs reflect different m ethodological approaches,
but they occasionally are com bined. T here are m any reasons to com bine the designs.
One would be to overcom e the possibility o f multiple-treatm ent interference when
more than one intervention is provided for the same people. Between-group studies
provide interventions to separate sets o f individuals. Another reason is that between-
group studies usually require m any subjects (for statistical pow er) and sometimes not
nearly enough subjects are available to detect a difference between conditions (e.g., see
Kazdin & Bass, 1989). Single-case designs with continuous assessm ent (many assess-
ments, few subjects) can be used here. Yet, com bined designs can do more. C onsider
som e exam ples o f com bined betw een-group and single-case designs.
A n exam ple o f a com bined betw een-group and single-case design focused on
10 individuals (ages 6 to 36) who were diagnosed with Tourette’s syndrom e (Azrin &
Peterson, 1990). The disorder, considered to be neurologically based, consists o f multi-
ple motor and vocal tics such as head, neck, and hand twitching; eye rolling; shouting
o f profane words; grunting; repetitive coughing or throat clearing; or other utterances.
T he tics, which begin in childhood, usually are quite conspicuous. In this program ,
features o f a between-group design and m ultiple-baseline design were combined. T he
10 cases were assigned either to receive treatment or to wait for a 3-month period
(waiting-list control group). T h e assignm ents were made on a random basis. Tics were
assessed daily with recordings at home and periodically with videotapes o f each person
at the clinic. T he treatment consisted o f habit reversal, which includes several differ-
ent behavior therapy treatments, such as being m ade more aware o f the behavior, self-
m onitoring, relaxation training, and practicing a response that com petes with the tic
(i.e., is incom patible with the tic, such as contracting the muscles in a different way, or
breathing in a way that prevents m aking certain sounds). Also, fam ily members praised
im proved perform ance at home. M any o f the com ponents have been used as separate
interventions for various problems.
T he results are shown in Figure 10.7, which graphs the num ber o f tics per hour at
home and at the clinic for the treatment group and the waiting-list control group. T h e
baseline phase for the waiting-list group is longer because o f the wait period before
treatment was given to them. Hence, the continuous observations over baseline and
treatment also meet criteria for a m ultiple-baseline design (across groups). The results
indicate that the intervention clearly showed m arked impact on tics. This is im por-
tant because Tourettes syndrom e has not been effectively treated with various psy-
chotherapies or pharm acological treatments. T he dem onstration is persuasive in large
part because o f the com bination o f m ultiple-baseline and group-design strategies.
Would either design alone have been persuasive? Probably the multiple-baseline p o r-
tion is clear, but in group research 10 cases is not enough to assign to even one group,
let alone two. To detect differences between groups usually requires larger sample sizes
A d d i t i o n al D esig n O p t i o n s 25 I

Immediate treatment group (N - S)

Months Months

F ig u re 10 .7 . Monthly mean (average) of Tourettc's syndrome tics per Jo u r measured in the clinic
and home settings for subjects in the immediate treatment (upper panel) and waiting-list (lower
panel) groups.The data illustrate the combined multiple-baseline design across two groups: one
group that received treatment immediately and one group that waited for the initial period. (Source:
Azrin & Peterson. 1990.)

(statistical power again). However, in this com bined study', the group data provide very
useful inform ation about the likely changes o ver time without an intervention d u rin g
the waiting period.
A nother com bined design evaluated a parenting program 1o prevent child abuse
(Peterson et al„ 2002). W omen who had young children, w h o used physical discipline,
and who were high in anger toward their children (self-report .sclie) participated. T h e y
were assigned random ly to receive either the 16 -week p ro gra.ni (that taught parent-
m anagem ent skills and anger control) or no treatment. W om en in both groups c o m -
pleted daily diaries that included answ ering open-ended questions about w hat their
children did and how they responded. No specific questions w ere asked about harsh
punishm ent (slapping, pushing, scream ing) or about w h ether the child’s disruptive
behavior was ignored or followed with time out from reinforcem ent, two o f the m any
parenting skills that were trained. O bservers coded the diaries to assess frequen cy
o f harsh punishm ent and use o f better strategies (ignoring or lim e out). Eighty-one
252 S I N G L E- C A S E R ES EA R C H D ES I G N S

wom en (approxim ately 65% European A m erican, 28% A frican Am erican, and 7% other
m inorities) com pleted the study.
Figure 10.8 shows the continuous data 011 parent harsh discipline over the weeks
o f the study. Quite clearly, physical punishm ent declined in the group that received the
training, as dem onstrated in a between-group part o f the design. Figure 10.9 adds to the
inform ation by show ing the im pact o f training a parent in the use o f time out. In this
latter figure, baseline observations o f the frequency o f using time out were quite sim ilar
for intervention and nonintervention groups. At the point that training was introduced
(vertical line) in this procedure for the intervention group, frequency o f using tim e out
increased. T his single-case feature o f the design closely resembles a multiple-baseline
design across groups in which the intervention is introduced to one group (to one base-
line) but not to the other. T he results show that change in the use o f time out occurred
when the intervention was introduced and not before. In a complete multiple-baseline

W eek Num ber

Figure 1 0.8. Number of occurrences of physical punishment reportedly used as a strategy of disci-
pline by the mothers each day for the experimental (intervention) group (bottom line in the graph)
and nonintervention control group (top line).The observations were made daily. However, the graph
shows the mean for each week over the 16 weeks. (Source: Peterson et al„ 2002.)
A d d i t i o n al D esig n O p t i o n s 253

W eek Num ber

F ig u re 10 .9 . Number of occurrences of time out reportedly used is a strategy of discipline by


the mothers each day for the experimental (intervention) and control groups. The vertical line
indicates the point in which time out was trained as a parenting skill for the intervention group. At
that point there is an increase in the use of time out for that group but not for the control group.
The observations were made daily. However, the graph shows the mean for each week over the
I 6 weeks. (Source: Peterson et ol., 2002.)

design, the intervention would have been introduced to the second group (baseline).
Even without that, the effects are clear. T he dem onstration aLs.o resem bles a multiple-
treatment design. After baseline two conditions were evaluated (training in tim e out
vs. no training). T he figure shows that the two “ treatments” (intervention, n o -in terven -
tion) vary in their impact. Overall, elements o f single-case design convey n th e r clearly
that the intervention led to change. The continuous data allow on« to see the progress
over the course o f training (Figure 10.8), which is not otherw ise evident in the usual
betw een-group design where only a pretest and posttest are used.

General Com m ents


H istorically, betw een-group designs are often criticized by proponen ts o f sin gle-case
research. C onversely, advocates o f betw een-group research rarely acknow ledge that

1
254 S I N G L E- C A S E R ES EA R C H D ES I G N S

single-case research can make a contribution to science. Both positions are d iffi-
cult to defend, are unnecessary, and ignore w hy we do research and what research
design accom plishes. First, often alternative design m ethodologies are differentially
suited to different research questions. Betw een-group designs appear to be p a rticu -
larly appropriate for larger scale investigations, for com parative studies, and for the
evaluation o f m oderators. Single-case designs are especially useful in dem on strat-
ing the im pact o f interventions on individ u als and in m aking decisions to im prove
intervention effects w h ile the intervention is in place. Single-case designs also per-
mit the experim ental study o f rare phenom ena. Accum ulating cases with unique or
low -frequency problem s so that they can be studied in betw een-group research is not
feasible.
Second, on the m any occasions in which both m ethodologies might address sim -
ilar or identical questions (e.g., Does this intervention have any effect?), the yield may
be different. In our own training and teaching o f students, we have not conveyed in our
presentation o f research design and m ethodology m ore generally that the findings one
obtains can vary as a function o f assessment methods (o f the same construct) as well
as a function o f the design. So, for exam ple, we know that studying the phenomenon
in longitudinal designs (group studies with the same cases studied over time) can yield
findings that differ from those obtained in cross-sectional designs (group studies with
different cases studied at a single point in time but with different participants repre-
senting the different age groups). If one is interested in characteristics o f individuals at
ages 5,15, and 25, this could be studied by sam pling three groups o f people at these d if-
ferent ages (cross-sectional study) or by sam pling a large group o f 5-year-old children
and m easuring them at different ages (at age 5, 15, and 25) over time. Findings from
such studies are often different, in part because the three different groups (cohorts) in
the first study are exposed to m any different influences (e.g., in the culture, in health
care) and those influences are not the sam e when one group (5-year-olds) is followed
over time. T here are m an y other reasons but this conveys the point, namely, that d if-
ferent methods can have different yield s— and often do (e.g., Aldw in & Gilmer, 2004;
Chassin, Presson, Sherm an, M ontello, & M cGrew, 1986). We want between-group
designs and single-case designs because o f the different facets o f a phenom enon they
m ay reveal. It is very inform ative when there are consistencies am ong findings that are
studied with different m ethods. Also, when there are inconsistencies, this raises im por-
tant questions about why. Answ ers to the why questions greatly enhance our under-
standing o f phenom ena.
There are unique virtues o f each o f the design strategies. W ithout drawing on d if-
ferent m ethodologies, one loses entire pockets o f research. For example, virtually every
school district, school, o r classroom in the United States has some “ program” designed
to help students read, write, do better in som e way, and prevent som e problem (e.g., sui-
cide, unprotected sex, the use o f drugs). In almost every case, the program is not eval-
uated, not really known or shown to work, and cannot be evaluated if one has to wait
for a between-group study and an RCT. However, m any o f these programs (within one
classroom , one school) could be more easily evaluated in a design that did not require
groups, larger num bers, and random assignment. A multiple-baseline across behaviors,
children, or class periods is merely one set o f viable options. Similarly, in the context
o f psychotherapy, most treatments used in practice are not evidence-based treatments,
A d d it io n a l D e s ig n O p tio n s 255

that is, show n from research to be effective, a topic warranting a separate book. Even
i f they were, we do not know in any given case whether a patient receiving such a
treatment were to benefit. Single-case methods (e.g., ongoing assessment, probes) and
designs can improve the quality o f patient care and dem onstration o f change, w hich is
not possible with the usual group research.
Overall, the issue o f research is not a question o f the superiority o f one type o f
design over another. Different m ethodologies are means o f addressing the overall goal,
namely, understanding the influence o f the variety o f variables that affect fu n ctio n -
ing. Alternative design and data-evaluation strategies are not in com petition but rather
address particular questions in service o f the overall goal.

S U M M A R Y A N D C O N C L U S IO N S
Although single-case designs are usually implemented in the m anner d escrib ed in pre-
vious chapters, elements from different designs are frequently com bined. C o m bined
designs can increase the strength o f the experim ental dem onstration. T h e use o f com -
bined designs m ay be planned in advance or decided on the basis o f the em erging
data. If the conditions o f a particular design fail to be met or are not m et convincingly,
com ponents from other designs m ay b e introduced to im prove the clarity o f the d em -
onstration. A com m on exam ple might be a m ultiple-baseline design, w here there is a
little ambiguity about whether each behavior changed when and only when the inter-
vention was introduced or in a changing-criterion design where there was a general
im provem ent in behavior that sort o f met the criterion changes but did not d o so very
persuasively. In each case, a m ini-reversal in one o f the baselines in a m ultiple-baseline
design or a bidirectional criterion shift in a changing-criterion design could b e im pro-
vised to clarify the dem onstration.
Apart from com bined designs, special features may be added to existing designs
to evaluate whether or the extent to which intervention effects generalize or extend to
responses, situations, and settings that are not included in training. Probes w ere dis-
cussed as a valuable tool to explore generality across responses and settings. W ith probes,
assessment is conducted for responses other than those included in training o r for the
target response in settings where training has not taken place. Assessment can provide
information about the extent to which training effects extend to other areas o f perfor-
mance. Graduated withdrawal o f the intervention was also discussed as a w ay o f evaluat-
ing maintenance o f the intervention effects. T he intervention is gradually withdraw n to
see if behavior continues to be perform ed (maintained). W ithdrawal can be graduated
in m any ways and may include reducing the components o f the intervention or how it is
implemented to resemble the conditions o f everyday life. T he goal is to rem ove the inter-
vention completely while the behavior remains at the level achieved du rin g the interven-
tion phase.
Finally, the contribution o f between-group designs to questions o f applied research
w as discussed. Between-group designs are especially useful in com parin g tw o o r more
treatments; identifying the magnitude o f change relative to no -treatment; exam in in g
the prevalence o r incidence o f a particular condition or disord er; evaluating the change
and course o f functioning over extended tim e periods; evaluating correlates, risk fac-
tors, and protective factors in relation to a characteristic o r outcome o f interest; test-
ing the feasibility and generality o f im plem enting interventions across m ultiple sites;
256 S I N G L E- C A S E R ES EA R C H D ES I G N S

evaluating moderators and factors that m ay interact (statistically) with an interven-


tion; and evaluating m ediators and m echanism s that may account for or explain how
changes com e about. Betw een-group designs were discussed because they are often
used in conjunction with single-case designs.
In general, the present chapter discussed som e o f the com plexities in com bining
design strategies and adding elem ents from different m ethodologies to address applied
questions. The com binations o f various design strategies con vey the diverse alterna-
tives available in single-case research beyond the individual design variations discussed
in previous chapters.
Quasi-Single-Case Experimental Designs

C H A P T E R O U T L IN E

Background
Why We Need Quasi-experiments
Methodology as Problem Solving
What to Do to Improve the Quality of Inferences
Collect Systematic Data
Assess Behavior (or Program Outcomes) on Multiple
Occasions
Consider Past and Future Projections of
Performance
Consider the Type o f Effect Associated with
Treatment
Use Multiple and Heterogeneous Participants
General Comments
Illustrations o f Quasi-experimental Designs
Selected Variations and How They Address Threats
to Validity
Study 1: With Pre- and Post-assessment
Study 2 : With Repeated Assessment and Marlc«d
Changes
Study 3: With Multiple Cases, Continuous
Assessment, and Stability Information
Examples o f Quasi-experiments
Pre-post Assessment
Continuous Assessment Helps to Evaluate
Change
Continuous Assessment Over Baseline and
Intervention Phases Help Further
Continuous Assessment and Marked Changes
Continuous Assessment, Marked Changes, and
Multiple Subjects
Other Variations and Illustrations
General Comments
Perspective on Quasi-experiments
Summary and Conclusions

257
258 S I N G L E- C A S E R ESEA R C H D ES I G N S

T
he previous chapters described single-case experim ental designs. These are true
experim ents, which consist o f investigations that perm it m axim um control over
the independent variable or m anipulation o f interest. T h is control perm its one to
rule out or make very im plausible threats to internal valid ity and to make causal
statements about the im pact o f the intervention. T he term “true experim ent” has
been defined by leading m ethodologists (and m y m entors!) as an arrangem ent in
w hich random ization is central (C am pbell & Stanley, 1963; C o o k & Cam pbell, 1979).
Random ization is possible in both betw een-group and sin gle-case research, as I have
noted previously and will revisit. Random ization has a w ell-d eserved respect in d is -
tributing possibly con foun din g variables across groups o r con ditions so they are less
likely to bias results and let those threats to validity run freely and get into all sorts
o f trouble. Even with random ization, groups or conditions are not necessarily eq u iv -
alent before the intervention or after i f there is a loss o f subjects (e.g., Hsu, 1989; Ka-
zdin, 2003). Random ization in any design is never a guarantee that valid inferences
can be drawn.
For the present chapter, if not m ore broadly, it is u sefu l to con sid er all d esign
arran gem en ts on a con tinu u m . I refer to this as a continuum o f confidence in w h ich
the con fid en ce reflects the extent to w hich one can be assu red that the in te rv e n -
tion was resp on sible fo r change. On the left side o f the con tin u u m , we can place
the an ecd otal case stu dy in w h ich skepticism , d isb elief, and low' con fiden ce in
the con clu sion s are g en erally wrell earned. On the righ t sid e o f the continuum is
an arrangem ent that allow s a stron g inferen ce that the in terven tion was the r e a -
son that change o cc u rre d an d that all or m ost o f the th reats to valid ity are not at
all p lau sible in exp lain in g the fin d in gs. In c o n sid erin g e xp erim en ts o f an y k in d,
the em p hasis m ust be on strength o f the allow able c o n clu sio n s rather than on
an y sp ecific procedure that co m p rises the design. R a n d o m izatio n can con tribu te
g re a tly to the right sid e o f the con tinu u m and has d e se rv e d ly special status, as I
have noted.
A true experim en t is a stu dy in which the arrangem ent is such that am biguity
o f the find ing is absolutely m inim al. M ost often rand om ization helps enorm ously,
but it is not essential. C o n sid er an A B A B A B A B design w h ere the ideal pattern in the
data and the describe, predict, and test functions o f all the phases are met). T here
m ight not be random ization but the allow able inference about the intervention (B)
is unusually strong and the best one can hope for in one study. (T h e salvation o f
science is replication and o u r con fiden ce in a find ing stem s largely from m u lti-
ple instances that it can be obtained.) As we see later, ran d om izatio n can be part
o f m any single-case design s (K ratoch w ill & Levin, in press) beyon d my coverage
o f the procedure in the discu ssio n o f m ultiple-treatm ent designs. In any case, true
exp erim en ts in the present con text refer to dem on strations that perm it the strongest
possible inferences that em p irical studies allow. In sin gle-case research, true e x p e ri-
m ents refer to arrangem ents in w hich the investigator is able to control assessm ent
occasion s (e.g., o ver time, across subjects) and im plem entation and w ithdraw al o f
the intervention to meet the requirem ents o f the design. T h e level o f control is e v i-
dent in presenting or w ith draw in g the intervention over tim e (e.g., A B A B designs)
or across behaviors or participan ts (e.g., m ultiple-baseline designs) and so on for
each o f the other designs.
Q u a s i - S i n g l e - C a s e E x p e rim e n ta l D e s i g n s

Quasi-experim ents refer to those designs in w hich the conditions o i true exp e ri-
ments are approximated (Campbell & Stanley, 1963).' Som e facet o f the study such as
presentation, withdrawal, or alteration o f the intervention is not readily under the c o n -
trol o f the investigator. For example, an investigator m ay b e asked to implement a read -
ing program for students placed in a special education class. T he investigator m ay be
able to obtain a pretest assessment o f reading levels o f the students (one data point at
pretreatment) but have no resources for continuous daily assessm ent o f reading over
time. Also, design features that make the arrangem ent a true experim ent (withdrawal
o f the intervention as in an A B A B design or sequential im plem entation o f the inter-
vention in a multiple-baseline fashion) may not be feasible because o f som e practical
or ethical consideration. The luxuries o f a true experim ent (whether between-group or
single-case) are not possible. W hat can one do? A great deal. Q uasi-experim ents can
provide very strong bases for drawing inferences; these designs, their logic, use, and
application serve as the basis o f the present chapter.

BACKGROUND

W hy We N eed Q u asi-exp erim en ts


A n initial question m ight be proposed; W hy do we even need quasi-experim ental
designs, especially if we know from the outset that we w ill not be able to draw strong
causal inferences? C onsider extremes o f m ethodological practices as two ends o f a c o n -
tinuum. On one end, say the right side o f the continuum , w e have true experim ents in
all their pristine elegance with careful assessment, and careful control over the inter-
vention and its presentation and withdrawal. On the left side o f the continuum, we have
the anecdotal case study, a narrative description o f what happened (in therapy for an
individual, in the classroom for a teacher, or with a diet for a person). The assessment
is not systematic or replicable, and the intervention was not w ell-described, controlled,
or presented in a way that would permit dem onstration o f change. It is rare that in fer-
ences can be draw n from such case studies. Q uasi-experim ents serve as a useful w ay o f
conceptualizing the middle ground o f the continuum . T hese include all o f the arrange-
ments that might be developed or used to draw inferences.
As to why we need them, consider the following. T he intervention world is filled
with “programs.” T hese are extrem ely well-intended interventions to help people in
relation to physical health, mental health, education, rehabilitation, sum m er camp for
children with all sorts o f goals, the elderly, and so on. In the U nited States, for example,

' True experiments and quasi-experiments do not exhaust the ways in v h i c t* experiments are clas-
sified. Observational studies refer to several designs in which the investigator selects groups or
conditions and makes comparisons (e.g., between individuals with depression vs. those without
depression or between individuals with early educational disadvantage who later graduate vs. do not
graduate college). In these studies, groups are selected who received s o u k intervention not under
control of the investigator. Observational studies have generated enormo 11s advances ia ge nerating
and testing hypotheses in key areas we often take for granted (e.g., risk factors for cancer, effects
o f cigarette smoking, impact o f divorce on children) (see Kazdin, 2001)- These designs are not
discussed because the focus of the book and chapter is on intervention research where the i nvesti-
gator manipulates some condition. Sometimes the investigator can control liow that intervention
is provided and to whom (true experiments) or can only approximate control o f these conditions
(quasi-experiments).
S I N G L E- C A S E R ESEA R C H D ES I G N S

there is an endless array o f federal, state, county, and city program s. These are inter-
ventions designed for groups with important and special needs, which might include
children with disability or in poverty, partners w ho are abused or in need o f assistance,
hom eless persons, and support grou ps o f all sorts. There is nothing in m y comment
that im pugns program s per se or these foci in particular. I love program s (and w h en-
ever I go to a baseball gam e, I bu y one). V/e are fortunate to be in a society where
resources and genuine efforts are deployed to help.
A ll o f that said, it is rare that program s include the m eans to evaluate their impact
and to provide inform ation that w ould justify continuation o f the program . Many p ro -
gram s (e.g., delivering meals to those without food) have im m ediate goals that may
not require intricate outcom e assessm ent—delivery o f the program itself is viewed as
im proving life. M any other program s are designed to have broader impact (e.g., abuse
prevention, parent-teacher m eetings in the schools, wilderness program s for youth
with delinquent behavior, special education, late-night basketball to control juvenile
m ischief and crim e) but are rarely evaluated. T h e options for evaluation (true experi-
ments) are recognized as too costly and not feasible; agencies that fund the program
rarely fund evaluation o f the effects o f the program . Consequently, most programs are
not evaluated at all. There are anecdotal reports about all the go od the programs seem
to be accomplishing. Yet, we know that program s som etim es are not effective and that
som etim es program s actually harm (i.e., are know n to make people worse as dem on-
strated in true experim ents). For example, group program s and therapies that place
children with aggressive and other antisocial behaviors together as part o f treatment
have made the children w orse (see Dishion, M cC ord. & Poulin, 1999; Dodge et al.,
2006; Feldm an et al., 1983), as I have m entioned previously. Placing such children in
groups, even if these groups are designed for treatment purposes, can foster further
bonding to deviant peers, w hich in turn, increases subsequent deviant behavior. This
is im portant to know /
T he prospect o f w ell-intended program s not w orking or m aking people worse in
som e w ay is not restricted to a particu lar problem dom ain or sam ple. For example,
program s have been designed to foster abstinence am ong teenagers with the goal o f
preventing pregnancy, sexu ally transm itted diseases, and risky sexual behavior early
in life. These program s involve a curriculum and conclude with individuals taking a
virginity pledge, that is, pledging abstinence from sexual intercourse. At one point
in the United States, 13% o f adolescents had taken the pledge (Bearm an & Bruckner,
2005). Do the program s work, that is, do they have the intended effects? We would
want to know not only because o f the im portance o f the consequences (e.g., sexually
transmitted diseases, teen pregnancy) but also to determine whether the money allo-
cated to such programs ($204 million in 2008) is the best use o f the funds for that goal.

Group treatment o f youths with antisocial behavior does not invariably lead to those individuals
becoming worse. A selective review o f the evidence shows this is not automatic at all and when it
does occur not all relevant measures may reflect deterioration o f performance (Weiss et al.. 2005).
Also, some group treatments designed for youth with disruptive, even if not delinquent behaviors,
are well established as effective (l.ochman, 2010). The point here is merely to recognize that dele-
terious effects of treatment can occur and have documented instances in controlled studies (e.g.,
Feldman et al., 1983, still an excellent example).
Q i ia < i- S in g t e - C a s e i x p c r i m e n t a l D e s i g n s

O ne cannot do an R C T v ery easily, but one can look at those who pledge and those who
do not and match them on all sorts o f variables that m ake com petin g rival hypotheses
implausible.
In one such exem plary study, the groups were selected and matched on 112 v ari-
ables (e.g., sex, ethnicity, religion, vocabulary, and m any more) (Rosenbaum , 2009).
Five years after the pledge, the results indicated that pledgers an d nonpledgers did not
differ in level o f prem arital sex, sexually transm itted diseases, anal or oral sex, age o f
first having sex, or num ber o f sexual partners. Pledgers used birth control and condom s
less often than nonpledgers in the past year or when they had th eir last sex. In short, the
intervention does not look like it was effective, and if anything it may have decreased
taking precautions durin g sex. As an ancillary but not irrelevant finding, 5 years after
taking the virginity pledge, 82% o f the pledgers denied having e v e r pledged. The fin d -
ings are disappointing, but the evaluation is critical. O ther interventions are needed
to obtain the goals o f pledging; the next $20 0+ million we spend should pursue other
alternatives. We want o u r programs to be effective; the only th in g worse than tailing
w ould be continuing the illusion that the programs are w orking.
There are m any other examples that could be cited to con vey the im portance o f
evaluation. In so m any instances, assigning individuals to different conditions (pledg-
ing vs. no pledging) is not possible. It is reassuring m ethodologically to know that it is
not necessary all o f the time. Important answers to im portant questions can com e from
understanding how to increase the interpretability o f findings.
Evaluation is not a luxury; it is related to quality intervention (education, m ed i-
cine, psychotherapy). Indeed, there is a dangerous irresponsibility in not evaluating
o u r interventions (e.g., harm ing our clientele, using resources, m o n ey and professional
tim e that could be better spent). We want evaluation to ensure that program s are hav-
ing their intended effects and, if they are, to determ ine whether w e can make the effects
better. Som e form o f evaluation is needed. Q uasi-experim ents m ight well be an option.
I f you were trained in traditional betw'een-group m ethodology, then you are likely to
be skeptical o f the scientific contributions o f single-case experim ents. Mow I am ask-
ing for more. Single-case experim ents that are not the best controlled, that is, not true
experim ents. Bear with me. As in the preceding example, we w ant to draw inferences
even when conditions do not allow for true experim ents.

M e th o d o lo g y as P ro b le m S o lv in g
P rior chapters presented m ajor single-case experim ental designs and their variations.
A s a general rule, when one uses a true-experim ental design, whether single-case or
between group designs, one is assured that m any threats to validity are well handled,
addressed, and m ade implausible. A true-experim ental design d o es not guarantee that
key threats to internal validity are addressed o r ruled out. For exam ple, in a group study
(R C T ), participants are random ly assigned to treatment and control conditions and
com plete measures before and after the study. Random assignm ent does not guarantee
that the groups are equivalent before treatment begins. Even w ith random assignment,
groups could be different from each other on critical characteristics, and that d iffer-
ence could readily explain why at the end o f the study treatment was better (or worse,
or no different from the control condition at the end o f the study) (see Kazdin, 2001).
Even so, true experim ents are likely to address the threats, and wc use them with that
262 S I N G L E- C A S E R ES EA R C H D ES I G N S

in mind. Actually, we do not use them very often “ with that in mind.” We use design
practices in a rote way because they are so strong as a basis for draw ing inferences.
Indeed, after years o f investigation, m any researchers might not be able to specify the
threats we are using the designs to rule out. That is probably fine. I refer to this here as
“ rote m ethodology” to convey that we select designs and strategies o f true experim ents
automatically. This is not to dem ean or judge the designs or the process o f their use but
rather to make a point about quasi-experiniental designs.
Q uasi-experim ental designs cannot be done by rote. Strategies are pieced together
to address likely threats to validity. M ethodology at its best is problem solving, a co g-
nitive strategy designed to address circum stances that are less than ideal and to devise
solutions to achieve the goal. From a m ethodological standpoint, the goal is to draw
w ell-based conclusions about the im pact o f the intervention by ruling out or m aking
threats to validity implausible. Q uasi-experim ents can do this well but one must keep
in m ind what one is trying to accom plish and then im provise in a w ay that true exp eri-
ments do not require.
The skill o f the investigator is required because he or she is placed in situations
in which all o f the ideal conditions (o f true experim ents) are not available, but the
goal (drawing valid inferences) is the same. C o n sider this situation for a moment: one
m orning som eone com es to you r home or apartm ent and says right now you are going
to be transported to a lush tropical island. You are told that your goal is to survive (feed,
clothe, and protect yourself) for 2 weeks, after which you will be brought back home.
You are told that all you can take with you is what you are wearing. (H earing that,
you quickly dress with a few more clothes to sneak by with an extra shirt, jacket, and
socks.) Hours later (first-class (light, four movies, five rich meals, eight crying infants)
you land on the island. You are told that you will be staying at a five-star hotel and that
you have unlim ited access to all o f the facilities (beach, pool, several restaurants, room
service, gift shops, etc.); you m ay use the credit card handed to you, but there probably
will be no need for it. As you walk down the stairs o f the airplane to the tarm ac, you
can see lush and beautiful palms swaying gently from the ocean breeze and are relieved.
T his scenario is equivalent to the control o f a true experim ent. Consider the same sce-
nario in which you are transported and given the sam e goal, nam ely to survive (feed,
clothe, and protect you rself for 2 weeks). On your flight over, you are in the m iddle
seat, between two people who just realized they went to high-school together and are
trashing peers and teachers from the old days. You finally land on the make-shift ru n -
way, and you are told the island is all nature— no hotels, no roads, no gift shop, no
restaurants, and so on. A s you leave the plane you look at the palms and they are really
sw eaty—not the trees, your hands. Your goal is to su rvive— the goal has not changed.
But, you have to scrap together all o f your skills, knowledge, ingenuity, and talent to
survive. W elcome to the world o f quasi-experim ents— the m ethodology reality show. If
you are new to them, we can call them queasy experim ents.
O ur goal is to draw valid inferences and rule out or make implausible threats to
validity. True experim ents do that—but one can arrange situations to do that fairly well
in quasi-experim ents. It is a matter o f thinking and problem solving—we are on the
lush island— but now rather than survive, we must evaluate a program or intervention
o f some kind. Does the intervention have an effect, m ake any difference, or help any-
one? The question is what can be done on the part o f the investigator to im prove the
Quasi-Single-Case Experiments! Designs 763

quality o f the inferences that can be drawn? Stated another way, what can the investiga-
tor do to help make com peting interpretations o f the results implausible?
C onsider a between-group study that is a quasi-experiinent. T here were not quite
perfect controls, and people were not assigned random ly to groups. T h is is a study
that focused on the impact o f secondhand smoke, which is know n to have adverse
cardiovascular effects including heart disease. Elim in atin g sm oking in indoor spaces
is the best way to protect nonsmokers. Som e cities have instituted sm oke-free o rd i-
nances— do they make a difference? T he best w ay to test this w ould be to random ly
select cities in the country and then random ly assign a subset o f these to be sm oke free
and others not to be smoke free. This R C T is not going to happen for a host o f reasons.
What can one do? A quasi-experim ent with the idea o f m aking im plausible the threats
to validity is a good answer.
In one such quasi-experim ent (referred to as The Pueblo Heart Study) with sev-
eral reports, the question has been exam ined by selecting and com paring three cities
(Centers for Disease Control, 2009). Pueblo, Colorado, had a sm oke-free ordinance
and was com pared to two nearby cities over a 3-year period. The two nearby cities
did not have sm oke-free ordinances and served as com parison cities. T h e results: In
Pueblo, with implementation o f its sm oke-free ordinance, hospitalization rates for
acute m yocardial infarction (heart attacks) m arkedly decreased from before to after the
ordinance was implemented. No changes in hospitalization rates were evident in the
two com parison cities. Does this finding establish and prove that secondhand sm oking
leads to increased heart attacks? No, but no one study does that. A lso, we w ould w ant to
know more about the com parability o f the three cities and their hospitals, dem ographic
com position o f the cities, and more. Furtherm ore, was it reduced seco n d ary sm oking
or more people just quitting sm oking, which also results from a ban? All these and
more are good questions, but one should not lose sight o f thestrength o f the evaluation.
The findings suggest that bans do make a difference. O f course, it must be replicated.
It has been. The findings hold. Threats to validity (e.g., history, m aturation, testing) are
not very plausible as an explanation o f the findings. Still we need to learn m ore about
what facets o f sm oking changed and what their specific im pact was.
There are m any situations in which we believe w e are helping o r we have an idea
that we think will make an im portant difference in society. T he challenge is to add
evaluation to that. If the most rigorous research can be done, yes always, we seize that
opportunity. But the other side is the problem. W hen the most rigorous study cannot
be done, this is not the time to go by our anecdotal experience. M any threats to validity
can be made implausible to help draw valid inferences. T h is is m ethodology at its best
(using ingenuity to improve the inferences that can be draw n), not m eth o do lo gy at its
easiest (random assignment and careful control).

W H A T TO DO TO IM P R O V E T H E Q U A L IT Y O F IN F E R E N C E S
We begin with a situation in which a true experim ent cannot be used and take the
challenge as follows: What can be added or utilized from what we know about research
m ethodology in general and single-case designs in particular to im prove the inform a-
tion and quality o f inferences that can be drawn? The default position (no systematic
evaluation) is the anecdotal case study with its full bloom o f ambiguity. We m ust do
better to draw inferences. We have a “case” that m ay involve an i individual, a classroom ,
264 S I N G L E- C A S E R ES EA R C H D ES I G N S

a business, or som e program for children or adolescents in a school or com m unity


setting or for victim s o f domestic violence in a w om ens shelter. O ur goal is to draw
inferences about the im pact o f an intervention. The enemy, as it were, is am biguity
and all o f those threats to validity that usually m ake the anecdotal case study a poor
basis for draw ing inferences about interventions. There are several things we can do
and inform ation we can bring to bear that greatly increase the extent to which threats
to validity are ruled out or made im plausible (Kazdin, 1981; Sechrest et al., 1996). To
evaluate a program and im prove inferences, these are key steps, even if they cannot all
be followed.

Collect System atic Data


As a point o f departure for quasi-experim ents, w e begin with systematic assessment
inform ation. We could use self-report inventories, ratings by other persons, and direct
m easures o f overt behavior. All system atic m easures have their own problems and
limitations (e.g., reactivity, response biases) but still they provide a stronger basis for
determ ining w hether change has occurred after an intervention than anecdotal nar-
rative reports. I f m ore standardized inform ation is available, at least the investigator
(educator, therapist, and teacher) has a better basis for claim ing that change has been
achieved. T he data do not allow one to infer the basis for the change. Yet, systematic
assessment and the resulting data serve as a prerequisite, because they provide infor-
mation that change has in fact occurred.

Assess Behavior (or Program Outcomes) on M ultiple Occasions


Another dim ension that can distinguish single-case dem onstrations is the num ber and
tim ing o f the assessment occasions. M ajor options consist o f collecting information on
a one- or two-shot basis (e.g., posttreatment only or pre- and posttreatment) or contin-
uously over time (e.g., every day, a few times per week, 01 right before each intervention
session). W hen inform ation is collected on one or two occasions (pre, post), threats
to internal validity associated with assessment (e.g., testing, instrumentation, statistical
regression) can be especially dilficult to rule out. With continuous assessment over time,
these threats are much less plausible, especially if continuous assessment begins before
treatment (baseline) and continues over the course o f treatment (intervention phase).
Continuous assessm ent allows one to examine the pattern o f the data and whether the
pattern appears to have been altered at the point in which the intervention was intro-
duced. That is, “the describe, predict, and test” aspects o f single-case designs begin to
come into play by having at least the first two (A B ) phases o f a true experiment. If a
single-case dem onstration includes continuous assessment on several occasions over
time, often the threats to internal validity related to assessment can be ruled out.

Consider Past and Future Projections o f Performance


Inferences about the im pact ot an intervention can be aided by inform ation about perfor-
mance in the past and likely perform ance in the near future. In single-case experiments,
baseline observations provide this information, and shifts from baseline to intervention
phases provide further information. W ithout the lu xu ry o f baseline observations, we can
sometimes bring to bear information that is a rough but still helpful approximation. For
some behaviors or problems, there may be an extended history that is fairly reliable even
Q u asi - Si n g l e- Case Ex p er i m en t ai D esig n s 2&S

without rigorous baseline assessment. This might be true if there has been no occur-
rence o f the behavior (e.g., exercise) or if the characteristic is likely to have been stable,
based in the previous weeks or months (e.g., weight o f an obese person). A history of
stable performance inferred in this way is not as perfect as continuous days o f observa-
tion, but it may be close. To the extent that there is such a history, one can assume that
the behavior or characteristic would continue unless som e special event (e.g., treatment)
altered its course. Consequently, if performance changes when treatm ent is applied, the
likelihood that treatment caused the change is increased. Thus, the history o f the prob-
lem may influence the plausibility that extraneous events or other processes (history,
maturation), other than treatment, could plausibly account for the change.
Apart from the history o f the problem , projections about the likely perform ance in
the future or the likely course and outcom e are relevant as well. F o r exam ple, the prob-
lem may be one that would not improve without intervention (e.g., term inal illness,
reading deficit). Know ing the likely outcome strengthens the inferences that can be
draw n about the impact o f an intervention that alters this course. T h e client’s im prove-
ment controverts the expected prediction o f the course o f the problem and bolsters the
likelihood that the intervention led to change.
T he course o f clinical problems is important to know because the present level o f
the problem by itself can be deceiving or incomplete inform ation. A given presenting
problem may look the sam e (e.g., sam e degree o f severity), but projection o f the im m e-
diate future may depend on knowing a bit about the past. For exam ple, rates o f recov-
e ry from an episode o f depression in adults are very high within the first few weeks
o r months. However, the probability o f recovery dim inishes as the episode becom es
longer (Patten, 2006). Thus, if a quasi-experim ent shows a reduction in depression
associated with the onset o f treatment, the demonstration m ay be m ore or less persua-
sive based on inform ation about the duration o f the episode (past) and therefore the
likelihood that it would have changed in the future.

Consider the Type of Effect Associated with Treatment


Dem onstrations vary in terms o f the type o f effects or changes that are evident as treat-
ment is applied. T he im m ediacy and magnitude o f change contribute to the inferences
that can be drawn about the role o f treatment. Usually, the m ore im m ediate the change
after the onset o f the intervention, the stronger a case can be m ade that the intervention,
rather than other events, was responsible for change. A historical event (som ething in
the news, event in the individuals personal life) might o ccu r coinciden tally with the
onset o f the intervention, and might explain the pattern, but u su ally that is not v ery
likely. On the other hand, gradual changes or changes that begin after the intervention
m ay raise greater ambiguity. M any maturational changes and changes over time are
gradual, and we look for an intervention to show a pattern not likely to be confused
with such changes.
Aside from the im m ediacy o f change, the magnitude o f the change is important
as well. W hen marked changes in perform ance are achieved, this suggests that o nly
a special event, probably the intervention, could be responsible. O f course, the m ag -
nitude and im m ediacy o f change, when combined, increase the con fiden ce one can
place in according the intervention a causal role. Rapid and dram atic changes provide a
strong basis for attributing the effects to the intervention. G radual and relatively sm all
266 S I N G L E- C A S Ê R ES EA R C H D ES I G N S

changes might m ore easily be discounted by random fluctuations o f perform ance, n o r-


mal cycles o f behavior, or developm ental changes. (The criteria for inferring change are
discussed further in Chapter 12 on data evaluation.)

Use M ultiple and Heterogeneous Participants


T he number ot clients included in a quasi-experim ent can influence the confidence
that can be placed in any inferences draw n about the intervention. Dem onstrations
with two or more cases, rather than with one case, provide a stronger basis for inferring
the effects o f the intervention. Essentially, each case can be view ed as a replication o f
the original effect that seem ed to result from intervening. If two or m ore cases im prove
it is unlikely that any particular extraneous event (history) or internal process (m at-
uration) could be responsible for change. H istorical events and maturation probably
varied among the cases, and the com m on experience, namely, the intervention, m ay be
the most plausible reason for the changes.
T h e heterogeneity o f the cases or diversity o f the types o f persons m ay also con trib-
ute to inferences about the cause o f change. I f change is dem onstrated among several
clients who differ in subject and dem ographic variables (e.g., age, ethnicity, gender,
social class, clinical problem s), the inferences that can be m ade about the intervention
are stronger than if this diversity does not exist. W ith a heterogeneous set of clients, the
likelihood is dim inished that they share history or m aturational influences. T h ey do
share exposure to the intervention and thus the intervention becom es the most plausi-
ble explanation o f the results.
In m ethodology when more and more diverse participants are discussed, the under-
lying concern is usually external validity, that is, the extent to which the results generalize.
The com m on concern is how one can generalize to others with only one subject, a topic
taken up in a later chapter. Here we are using more than one subject and diverse subjects
to address internal validity, that is, the likelihood that the intervention rather than extra-
neous events could explain the change. With more and more diverse subjects, the same
threat to validity (e.g., same history, same maturational rate) is not very plausible.

General Com m ents


The characteristics I have mentioned can be used to strengthen the inferences draw n
from situations where experim ental control is not possible, that is, quasi-experim ents.
Depending on how the different characteristics are addressed within a particular dem -
onstration, it is quite possible that the inferences closely approxim ate those that could
be obtained from a true single-case experim ent. Not all o f the dim ensions are under
the control o f the investigator (e.g., im m ediacy and strength o f intervention effects).
On the other hand, critical features upon which conclusions depend, such as the use
o f replicable m easures and assessm ent on multiple occasions, can be controlled in the
situation and can greatly enhance the dem onstration.

IL L U S T R A T IO N S O F Q U A S I-E X P E R IM E N T A L D E S IG N S

Selected Variations and How They Address Threats to Validity


It is useful to consider a few o f the exam ples o f quasi-experim ents with the single case
that vary on the characteristics m entioned previously. These convey how the quality o f
Q u asi - Si n g l e- Case Ex p erif i- jo n t al D esig n s 267

the inferences that are draw n can vary and what the investigator can do lo strengthen
the demonstration. Table 11.1 illustrates a few types o f single-case dem onstration stud-
ies that differ on som e o f the dim ensions mentioned previously. Also, the extent to
which each type o f case rules out the specific threats to internal validity is presented.
For each type o f case the collection o f data was included because, as noted, earlier, the
absence o f objective or quantifiable data usually precludes d raw in g conclusions about
whether change occurred.

Stu dy 1: With Pre- a n d Post-assessm ent. Use o f pre- and posttreat ment assessm ent foT
the individual increases the inform ational yield well beyond unsystem atic anecdotal
reports. Table 11.1 illustrates a single-case (column noted as Stud y 1) with pre- and post-
assessment but without other characteristics that would help rule o ut threats to internal
validity. Improved assessment perm its com ments about w hether change has occurred.
This is not trivial. T he goal o f intervention program s in the context o f education, reha-
bilitation, medicine, psychotherapy, and counseling is to effect som e change (e.g., in
affect, behavior, and/or cognition, academ ic perform ance). A basic requirem ent is to
put into place a system in which change can be assessed system atically and routinely.
Increasingly, accountability for delivery o f services (e.g., education, health care) has
focused on im proved outcom e assessment. Out o f concern for the people we serve

Table 1 1 .1 Selected Types of H ypothetical Cases and th e T h re a ts to Internal Validity T h e y


Address

Stu d y 2 Stud y 3
Study 1 R e p e a te d M u ltip le cases,
P re - & a ss e ss m e n t & m a r k e d c o n t in u o u s a sse ss m e n t ,
T ype o f N = 1 stu d y p o s ta s se s sm e n t ch a n g e s sta b le p e r f o r m a n ce

C h a ra c te ris tic s o f ca se p re s e n t (yes) o r abse nt/ n ot specified (n o )

Objective data yes yes yes

Continuous assessment no yes yes

Stability of problem no no yes

Immediate and marked no yes no


effects
Multiple cases no no yes

M a jo r th re a ts to in te rn a l v a lid ity ru le d o u t (+ ) o r n ot ru le d o u t (—)

History - f 4-
Maturation - ) h

Testing - + -
Instrumentation - +•

Statistical regression *•

Note: In the table,a"+" indicates that the threat to internal validity is probably controlled,a“- ” indicates that the
threat remains a problem, and a "?" indicates that the threat may remain uncontrolled In preparation of the table,
selected threats were omitted because tltey arise primarily in the comparison o< different groups in experiments.
They are not usually a problem for a case study, which, of course, does not rely on group comparisons.
268 S I N G L E- C A S E R ESEA R C H D ES I G N S

and with enorm ous expenditures o f funds, we should always ask, “ Is there any impact
at all?” Systematic assessm ent is a good first step because m any program s may not be
producing change or change o f a m agnitude that m akes a difference.
Assessm ent alone is valuable for identifying change, but determ ining the basis o f
the change is another matter. Ruling out various threats to internal validity and c o n -
cluding that treatment led to change depend on other dim ensions (listed in the table)
than the assessment procedures alone. It is quite possible that events occurring in time
(history), processes o f change within the individual (m aturation), repeated exposure to
assessm ent (testing), changes in the scoring criteria (instrum entation), or reversion o f
the score to the mean (regression) rather than treatment led to change. In short, threats
to internal validity are not ruled out in this situation, so the basis for change rem ains a
matter o f surmise.

S tu d y 2: With R epeated Assessm ent a n d M a rk ed Changes. I f the single-case dem on -


stration includes assessm ent on multiple occasions before and after treatment and the
changes associated with the intervention are relatively m arked, the inferences that can
be drawn about treatment are vastly improved. Table 11.1 illustrates the characteristics
o f such a case, along with the extent to which specific threats to internal validity are
addressed. T he fact that continuous assessment is included is important in ruling out
the specific threats to internal validity related to assessm ent. Also, changes coincide
with the onset o f treatment. This pattern o f change is not likely to result from exp o -
sure to repeated testing or changes in the instrument. W hen continuous assessment is
used, any changes due to testing or instrum entation would be evident before treatment
began. Similarly, regression to the m ean from one data point to another, a potential
problem with assessment conducted at only two points in time, is eliminated. Repeated
observation over time show s a pattern in the data. Extrem e scores m ay be a problem
for any particular assessm ent occasion in relation to the im m ediately prior occasion.
H owever, these changes cannot account for the pattern o f perform ance for an extended
period.
Aside from continuous assessment, this illustration includes relatively m arked
treatment effects, that is, changes that are relatively im m ediate and large. These types o f
changes produced in treatment help reduce the possibility that history and maturation
explain the results. M aturation in particular may be relatively im plausible because mat-
urational changes are not likely to be abrupt and large. N evertheless, a “ ?” was placed
in the table because m aturation cannot be ruled out completely. In this case example,
inform ation on the stability o f the problem in the past and future was not included.
Hence, it is not know n whether the clinical problem might ordinarily change on its
own and whether maturational influences are plausible. Som e problems that are epi-
sodic (e.g., depression) in nature conceivably could show marked changes that have
little to do with treatment. W'ith im m ediate and large changes in behavior, history and
maturation may be ruled out too, although these are likely to depend on other dim en-
sions in the table that specifically were omitted from this case.

S tu dy 3: With M u ltiple Cases, C ontinu ou s Assessm ent, a n d Stab ility Inform ation.
Several cases rather than only one m ay be studied. T h e cases m ay be treated one at a
time and accum ulated into a final sum m ary statement o f treatment effects or m ay be
Q u asi-Sin gle-C ase Experim ental Designs

treated as a single group at the same time. In this illustration, assessm ent inform ation
is available on repeated occasions before and during treatm ent. Also, the stability o f
the problem is know n in this example. Stability refers to the dim ension o f past-future
projections and denotes that other research suggests that the problem does not u su -
ally change over time. W hen the problem is known to be h ig h ly stable or to follow
a particular course without treatment, the investigator has an implicit prediction ot
the effects o f no treatment. The results can be com pared with this predicted level of
perform ance.
As is evident in Table 11.1, several threats to internal v a lid ity are addressed by a
single-case dem onstration meeting the specified characteristics. H istory and m atura-
tion are not likely to interfere with drawing conclusions about the causal role ot treat-
ment because several different cases are included. A ll cases are not likely to have a
single historical event or maturational process in com m on that could account for the
results. Knowledge about the stability o f the problem in the future also helps to rule out
the influence o f history and maturation. If the problem is know n to be stable over time,
this means that ordin ary historical events and maturational processes do not provide a
strong enough influence in their own right. Because o f the use o f multiple subjects and
the knowledge about the stability o f the problem, history an d maturation probably are
implausible explanations o f change in behavior.
The threats to internal validity related to testing are han dled largely by continuous
assessment over time. Repeated testing, changes in the instrum ent, and reversion o f
scores toward the mean m ay influence perform ance from one occasion to another. Yet
problems associated with testing are not likely to influence the pattern o f data over a
large number o f occasions. Also, information about the stability ol the problem helps
to further make changes due to testing implausible. The fact that the problem is known
to be stable means that it probably would not change m erely as a function o f repeated
assessment.
In general, a single-case demonstration o f the type illustrated In this example pro-
vides a strong basis for draw ing valid inferences about the im pact o f treatment. The
m anner in which the multiple-case report is designed does not constitute an experi-
ment, as usually conceived, because each case represents an uncontrolled dem onstra-
tion. However, characteristics o f the type o f case study can rule out specific threats to
internal validity in a m anner approaching that o f true experim ents.

E x am p les o f Q u a si-e x p e rim e n ts


A few illustrations convey more concretely the continuum o f confidence one might
place in the notion that the intervention was responsible tor change. Each illustration
qualifies as a quasi-experim ent because it captures features o f true experim ents and
varies in the extent to which specific threats can be made im plausible.

P re-p o st Assessment. In single-case research, assessm ent usually is continuous, that


is, repeated observations for the participant or group w ithin phases (baseline) and
more than one phase. A nd, as the reader knows all too well now, the multiple data
points permit one to apply the logic o f single-case designs (describe, predict, test). In
betw een-group research, there is usually pre- and posttreatm ent assessment, and the
strength o f the dem onstration stems from com paring an intervention group with a
270 S I N G L E- C A S E R ES EA R C H D ESI G N S

control (e.g., no treatment) group. Occasionally, researchers report evaluation o f a sin -


gle group with just pre- and postassessment. Although assessment can docum ent that
change occurred, this is usually a weak basis on which to draw any inferences about the
intervention.
A s an exam ple, a treatment study was reported that focused on treating panic
disord er am ong adults (M ilrod et al., 200 1). T w enty-one adults participated (ages 18
to 50; 66% fem ale; 76% European A m erican, 19% A frican A m erican, and 4% Asian
A m erican) who experienced panic disorder. K ey sym ptom s include a period o f
intense fear or discom fort that m ay include pou nd in g heart, sweating, trem bling or
shaking, shortness o f breath, feeling o f choking, fear o f losing control, chest pain,
and others. A part from the im portance o f alleviatin g the disorder itself, individ u als
with panic disorder report poor physical health and have higher rates of substance
and alcohol abuse, and suicide. A lso ind ividuals with panic disorder use health-care
services m ore frequently than patients with an y other psychiatric diagnosis. A ll in d i-
v id uals received psychodynam ic psychotherapy for 24 sessions, delivered two tim es
per week. Before starting therapy several pretreatm ent m easures were taken (related
to panic, anxiety, and depression); at the end o f treatm ent, the m easures w ere co m -
pleted again.
At the end o f treatm ent, the group (17 o f 21 individ u als com pleted treatm ent)
show ed a statistically sign ifican t im provem ent in clu din g reduced anxiety, d ep res-
sion, panic, and other dom ains on several self-rep ort scales. T h e benefit o f the
assessm ent is that we know that the patients changed. T he d iffic u lty in draw in g
inferen ces is that several threats to internal valid ity are not ruled out by the design.
It is possible that patients im proved ju st because o f taking the test on separate o c c a -
sions (testing) and that they cam e into the stu d y at an extrem e point and show ed
chang es at posttreatm ent as a result (statistical regression), or that they just got
better over tim e after they cam e to treatm ent indepen dently o f receiving sessions
(h isto ry and m aturation). Patients often show great changes w hen assessed on two
o ccasio n s before treatm ent begins. T he act o f com in g to treatm ent (history) alone
m ay prom ote im provem ent.
T he investigators referred to this as pilot w ork and an open study and recognized
that the design did not permit inferences about the intervention.3 1 cite the exam ple
here because one group (or one subject) with pre- and postassessment is only a step up
from an anecdotal case study, but inferences about the intervention really cannot be
draw n. It is a step in the sense that change is dem onstrated, usually by a measure that
has som e validity to it, as distinguished from the narrative, unsystem atic anecdotes.
Even so, continuous assessment would strengthen the same dem onstration because
seeing a pattern o f data over time would rule out testing, regression, maturation, and

1 An open study (also called open label study) in medicine refers to a study in which a medication is
used but there is 110 attempt to hide what the drug is from those who administer it. Usually in con-
trolled trials, those who administer and even receive the medication may be “blinded,” that is, not
informed whether they are receiving the drug or the placebo. This procedure reduces bias associated
with knowing who received what and possibly influencing their reactions, behaviors, or outcomes.
However, the term is often used for an uncontrolled study in which there is one group and there is
pre- and postassessment.
Q u a si - Si n g l t - Ca se Ex p er i m en t al D esi g n s 271

history as causes and we might be able to draw on som e aspect o f the describe, predict,
test features that such assessment allows.

C o n tin u ou s Assessment H elps to E valu ate Change. M ore assessm ent points help rule
out som e o f the threats related to testing. Repeated testing arid regression are cffects
likely to be evident when there are just two occasions o f assessm ent (pre, post), and one
cannot tell if these effects or the intervention led to change. Continuous assessm ent
helps a little more than just pre-post assessment.
For example, treatment was applied to decrease the weight o f an obese 55-year-
old wom an (180 lb. [81.8 kg], s'$" [1-65 meters]) (M artin & Sachs, 1973). T h e wom an
had been advised to lose weight, a recomm endation o f som e urgency in light of her
recent heart attack. The wom an was treated as an outpatient. T h e treatm ent consisted
o f developing a contract or agreement with the therapist based on adherence to a v ari-
ety o f rules and recomm endations that would alter her eating habits. Several rules were
developed pertaining to rewarding herself for resisting tem pting foods, self-recording
what was eaten after meals and snacks, weighing herself frequently each day, chew ing
foods slowly, and others. T he patient had been weighed before treatm ent, and therapy
began with weekly assessment for a 4 '/2-week period.
T he results o f the program , which appear in Figure 11.1, indicate that the w o m -
an’s initial weight o f 180 lb. was followed by a gradual decline in. weight over the next
few weeks before treatment was terminated. For present purposes, what can be said
about the impact o f treatment? Actually, statements about the effects o f the treatm ent
in accounting for the changes would be tentative at best. T he stability o f her pretreat-
ment weight is unclear. The first data point indicated that the w om an weighed 180 lb.

Weeks

Fig u re I 1 .1 . Weight in pounds per week.The line represents the connecting of the weights, respec-
tively, on Days 0,7, 1 4 ,2 1 ,2 8 . and 31 of the weight loss program. (Source: Martin & Sachs, 19/3.)
272 S I N G L E- C A S E R ES EA R C H D ESI G N S

before treatment. Perhaps this weight would have declined over the next few weeks
even without a special weight-reduction program . T h e absence o f clear information
regarding the stability o f the w om ans weight before treatment makes evaluation o f
her subsequent loss rather difficult. T he fact that the decline is gradual and modest,
albeit understandable given the expected course o f weight reduction, introduces fu r-
ther ambiguity. T he weight loss is clear, but it w ould be difficult to argue strongly that
the historical events, maturational processes, o r repeated assessment could not have led
to the sam e results.

C o ntinu ou s Assessm ent O v e r B a se lin e a n d In terv en tio n Phases H elp Further. The
next illustration provides a slightly m ore persuasive dem onstration that treatment
m ay have led to the results. T his case included a 45-year-old fem ale with low back
pain traced to degeneration o f several disks in her spine (as dem onstrated by m ag-
netic resonance im aging, or M R 1) (V laeyen, de Jong, O nghena, Kerckfoffs-H ansen, &
Kole-Snijders, 2002). Injections, m edications, and physical therapy produced little
im provem ent in her pain. A s background to this, fear o f physical movement am ong
individuals exp erien cing pain fosters avoiding activities, which in turn predicts future
disability. Fear o f further pain is a significant problem am ong m any such patients and
promotes a dow nw ard course in their pain, disability, and m ood. T he study evalu -
ated the effects o f exposure therapy, an evidence-based treatment for anxiety and
fear. Exposu re therapy consisted o f practice activities and m ovem ents with a physi-
cal therapist. T he activities were graded according to how much fear they evoked;
the exposure sessions began with the less fearful actions. Several m easures were used
to assess pain. P rior to com in g to treatm ent, the patient filled out and mailed form s
about pain experien ce to the researcher. O ther m easures o f the patient included v iew -
ing photos o f m ovem ents and rating fear and concern over harm the activity would
produce. T h e exposure treatment was evaluated in an A B design. After 1 week of base-
line, the intervention consisted o f fifteen 90-m inute sessions over a 5-week period.
T he results are plotted in Figure 11.2 for two separate m easures (fear o f movement and
pain intensity).
T he results suggest that the intervention m ay have been responsible for change.
T he inference is aided by continuous assessm ent over time before and during the
intervention phase and the pattern o f the data. B aseline show ed high levels o f fear
and pain intensity and suggests that no change was likely to occur with continued
o bservations alone. W hen the intervention w as introduced, pain fear and intensity
declined and continued to show a m arked reduction by the end o f the intervention
phase.
A few features o f the dem onstration m ay detract from the confidence one might
place in according treatment a causal role. T h e gradual decline evident in the figure
might also have resulted from other influences than the treatment, including som e
event not in the report that m ay have occurred with the onset o f treatment (history)
or boredom with continuing the assessment procedure (m aturation). A s it turns out,
another case was included in the report and show ed sim ilar effects, which makes these
threats less likely for this report. Also, the fact that the patient was responsible for pro-
viding the ratings, even though these were standard measures, raises concerns about
whether accuracy o f scoring changed over time (instrum entation), rather than actual
Qu asi - Sin g lc- Casf ? Cx p cr- im en t al D esig n s 273

Vas-score

Fear of movement • Pain intensity J

F ig u re I 1 .2 . Daily measures of pain related fear and pain intensity during biseline (period of
A to B. see bottom of the graph) and exposure treatment (B to C ).T h e series or line of data
points marked by diamonds and squares represent ratings of fear of movementand pain intensity,
respectively. The Visual Analogue Scale (VAS) was used to obtain these ratings. (Source: Ylceyen
et al., 2002.)

fear. Yet the data can be taken as presented without undue m ethodological skepticism.
As such, the intervention appears to have led to change, but the quasi-experim ent al
nature o f the design and the pattern o f results make it difficult to rule out threats to
internal validity with great confidence.

C o n tin u ou s Assessm ent a n d M a rk ed Changes. In the next illustration, the effects o f


the intervention appeared even clearer than in the previous exam ple. In this report, a
fem ale adult with agoraphobia and panic attacks participated in outpatient treatment
to overcom e her fear o f leaving home and her self-im posed restriction to her home
(O ’ Donohue, Plaud, & Hecker, 1992). T he patient kept a record o f her activities and
the tim e devoted to them. Also, at the beginning o f treatment, activities that m ight be
reinforcing were identified. The intervention consisted o f instructing her to engage in
rew arding activities (e.g., time with her pet, reading, entertaining visitors) only when
outside o f the home. Exam ples included walking down the street, socializing with
neighbors, and watching T V at a neighbor’s home.
T he effects o f the procedure in increasing time out o f the hom e are illustrated in
Figure 11.3. T he baseline period indicated a consistent pattern o f no time spent outside
o f the home. W hen the intervention began, time outside the ho m e sharply increased
and rem ained high at 2- and 18-m onth follow-up assessments. Acquaintances and

L _
274 S I N G L E C A S E R ES EA R C H D ESI G N S

18-month
Baseline Intervention 2-month Follow-up Follow-up

F ig u re 1 1 .3 . Total time an adult patient with agoraphobia and panic spent in activities outside of
the home over baseline, intervention,and two follow-up assessment periods. (SourceO’Donohue et ai,
1992.)

relatives, who reported on specific activities in w hich the patient had engaged, c o r-
roborated these changes. T he stable and very clear baseline and the marked changes
with onset o f the intervention suggest that history, m aturation, or other threats could
not readily account for the results. W ithin the lim its o f quasi-experim ental designs, the
results are relatively clear.

C o ntinu ou s Assessment, M a rk ed Changes, a n d M u ltip le Subjects. Am ong the previ-


ous examples, the likelihood that the intervention accounted for change was increas-
ingly plausible in light o f characteristics o f the report. In this final illustration, the effects
o f the intervention are extrem ely clear, although clearly not from a true experim ent.
The purpose ot this report was to investigate a novel m ethod o f treating bedwetting
(enuresis) am ong children (Azrin, Hontos, & B esalel-A zrin, 1979). Forty-four children,
ranging in age from 3 to 15 years, were included. T h eir fam ilies collected data on the
num ber o f nighttime bedwetting accidents for 7 days before treatment. After baseline,
the training procedure was implemented: the child was required to practice getting up
from bed at night, rem aking the bed after he or she wet the bed, and changing clothes.
O ther procedures were included as well, such as w akin g the child early at night in
the beginning o f training and developing increased bladder capacity by reinforcing
increases in urine volum e. The parents and children practiced som e o f the procedures
in the training session, but the intervention was essentially carried out at hom e when
the child wet his or her bed.
T he effects o f training are illustrated in Figure 11.4, which shows bedw et-
ting during the pretraining (baseline) and training periods. T he dem onstration is a
quasi-experim ental design with several o f the conditions discussed previously included
Q u asi - Si n g l e- Case Ex p e r i m en t al D esig n s 27S

Days W eeks M onths

Figure 1 1.4 . Bedwetting by 44 enuretic children after office instruction in an operant learning
method. Each data point designates the percentage of nights on -which bedwetting occurred.The
data prior to the dotted line are for a 7-day period prior to training. The data are presented daily
for the first week, weekly for the first month, and monthly for the "first 6 months and for the I 2 th
month. (Source: Azrin et al., 1979.)

to make threats to internal validity im plausible. The data suggest that the problem was
relatively stable for the group as a whole during the baseline period. A lso , the changes
in perform ance at the onset o f treatment were immediate and m arked. Finally, sev-
eral subjects were included who probably were not v ery hom ogeneous because their
ages encom passed young children through teenagers. In light o f these characteristics
o f the dem onstration, it is not very plausible that the changes could be accounted for
by history, maturation, repeated assessment, changes in the assessm ent procedures, or
statistical regression.

O ther Variations a n d Illustrations. It is in the very nature o f quasi-experi mental


designs that they will not fall into neat categories. M y prior com m ents focus on d im en -
sions or features that can strengthen the inferences that are draw n. H ow ever, there are
many variants in which judgm ents have to be made about w hether threats to validity
are plausible and whether the logic o f the single-case designs is met. C onsider b rie f
examples to convey the point.
T his initial exam ple focused on alcohol consum ption and binge d rin k in g among
college students (Fournier, Ehrhart, G lindem ann, & G eller, 2004). Su rveys suggest
that most students (80 to 90%) consum e alcoholic beverages. E xcessive d rin kin g
or binge d rinking is associated with m any problem s that colleges seek to control,
including sexual assault, unplanned and unsafe sex, property dam age, violence,
auto accidents, and poor academ ic perform ance. T h is pro gram w as con ducted in
276 S I N G L E- C A S E R ES EA R C H D ES I G N S

a university setting an d en com passed 356 college students (ages 19 to 24) who par-
ticipated w hile attending one o f four consecutive parties hosted by the sam e fra -
ternity on cam pus. Blood alcohol concentration (B A C ) was the dependent m easure
and was assessed with a hand-held breathalyzer. Students were evaluated for B A C at
the end o f each party. T he four parties were d ivid ed into two parties o f baseline (no
intervention, or A phase) and two parties with the intervention (or B phase). T he
design was A A B B , clearly placing this within the realm o f a quasi-experim ent. T his
arrangem ent is like an extended A B design. Baseline parties included the assessm ent.
Intervention parties included pro vidin g fliers to each person noting that individuals
with B A C levels below .05 would be entered into a $ 10 0 cash raffle. (The legal lim it o f
intoxication in V irgin ia, the state in which this w as conducted, is .08, and the study
selected a level below that.) T he flier also included inform ation on how to m aintain
low intoxication (e.g., consum e water between alcoholic beverages, snack on food)
an d a chart (referred to as nom ogram ) that show ed how to calculate BAC from bo d y
weight, num ber o f drinks consum ed, and how long one as been drinking, with sep a-
rate charts for m ales and fem ales (because o f differences in m etabolizing alcohol). At
the end o f the two intervention parties, there was a draw ing, and cash was provided
to the winner.
Figure 11.5 show s three m easures to evaluate the program across baseline and
intervention parties. Each side o f the graph m easures som ething slightly different,
including the percentage o f ind ivid u als who met the goal (left side) and the m ean
blood alcohol concentrations. C o n sid er the dashed line with triangles. It is clear
that the mean B A C was lower d u rin g the two intervention parties than du rin g the
tw o baseline parties. T he other lines in the graph also convey that the percentage o f
individuals who fell under the criterion for the raffle (.05) as well as the legal lim it
(.08) increased du rin g the intervention phase. T h e changes from AA to BB were
statistically significant in a num ber o f tests (for each o f the measures in the figure).
O f course, statistical sign ifican ce is about w hether there is change and not about
w hether the change was caused by the intervention, w hich is about the design. W hat
can we conclude? A re there plausible threats to valid ity that might explain the pat-
tern o f data?
Selection is a threat to validity here and is usually only mentioned in the context
o f group designs. In this case, it is possible that the participants in the baseline (A A )
parties were different in systematic ways from those in the intervention (BB) parties.
Perhaps, people who really liked to drink stopped goin g to the parties or left very soon
after arriving. T hose rem aining to be tested could be less likely to drink or drink very
much even without an intervention. T he authors note that the m ix o f party goers (e.g.,
age, sex, and proportion o f fraternity and nonfraternity members) did not vary am ong
the parties, and that partially speaks to the concern. H istory could be a threat too. It
is possible that between A A and BB phases som ething happened on campus. It is not
quite fair to raise this vacuously without having some idea o f an event, but still it is
worth noting. An A B A B design rules out history in m ost cases because history (m atu-
ration and other threats) does not reverse behavior in a back-and-forth fashion. With
an A A B B design, a historical event could explain the data. Overall and in m y view, it
remains parsim onious to interpret the dem onstration as showing that the intervention
Q u a s i-S in g le -C s .s e E x p e r im e n t a l t » e s ig r s 277

Consecutive Fraternity Parties

Fig u re 1 1 .5 . Mean blood alcohol concentration (BAC) levels and percentages of students below
BACs at the four fraternity parties in an AABB quasi-experimental design. (Source. Fournier tt ai,
2004.)

led to change. For each line in the graph, the two baseline points do not overlap with
the two intervention points.
A final exam ple focused on the treatment o f Posttraum atic Stress Disorder (PT SD )
am ong earthquake survivors. PT SD is a severe anxiety reaction precipitated by a trau-
matic event including natural disasters (e.g., earthquake, llood), com bat or m ilitary
exposure, terrorist attacks, sexual or physical assault, child sexual or physical abuse,
and accidents (e.g., automobile). Severe anxiety continues long alter the event as the
individual relives the event, dream s about it, and can h iv e the full anxiety reaction
when exposed to any rem inder o f the event.
Effective and brief treatment (one to four sessions) ofeartliq ualce su rvivo rs with
PT SD has been developed and carefully evaluated in RC Ts Ce.g., Basoglu, Salcioglu, &
Livanou, 2007; Basoglu, Salcioglu, Livanou, Kalender, & Acar, 2005). The treatm ent
encourages individuals to expose themselves to fear-evoking cues to develop a sense
o f control and to develop cognitive strategies to cope with avoidance and anxiety. In
this report, a highly structured self-help manual was provided to see i f the ben efits o f
the treatment could be obtained by individuals them selves w hen provided after an in i-
tial therapist contact (Basoglu, Salcioglu, & Livanou, 2009). Quasi-.single-case evalu -
ations were m ade o f a self-help manual. The m anual (51 pages) discussed P T S D and
depression, self-assessm ent, how to conduct self-exposure sessions, h o w to cope with
avoidance o f anxiety, and other com ponents. The treatment w as a 9 -week program , all
adm inistered individually by the clients themselves. O utcom e evaluation was based on
several clinician-adm inistered and self-adm inistered m easures. This report included
eight adults (mean age o f 40) who survived the 1999 earthquake in T u rkey an d suf-
fered PTSD. (The intervention was conducted in 2 0 0 3-2 0 0 4 , so the sym ptom s con tin-
ued long after the initial trauma.) To illustrate the results, I have sam pled one o f the
278 S I N G L E- C A S E R ES EA R C H D ESI G N S

measures (the Clinician-Adm inistered P T SD Scale) that is a standard measure in this


area ot"work.
This was a quasi-experim ent that m ight be sim ply referred to as an A B design.
However, it is a little more complex. In baseline, there were two assessments (4 weeks
apart). Then the self-help manual was provided for the 9 weeks o f treatment. No
assessments were conducted during the treatment, but immediately after, there was a
posttreatment assessment. Follow-up assessm ents were then conducted at 1, 3, and 6
months. Figure 11.6 provides the data for the eight cases. T he measure is one o f PTSD
symptoms, so im provem ent would be reflected in decreases in anxiety. Two questions
are before us: Was there any change from baseline to posttreatment? And what is the
likelihood that the intervention accounted for change? Looking at the data from eight
cases, we can see that, with the exception o f C ase 6, the change from baseline to post-
treatment assessm ent is consistent. Patients improved. It helps to have two rather than
one assessment occasions in baseline to give us a better idea o f trend and stability (and
to rule out regression to the mean as an explanation o f the change from baseline to
posttreatment). We can see that in baseline som e cases improved from the first to the
second assessm ent (Cases 2, 5, 7, 8). Repeated assessment on measures often leads to
slight im provem ents. However, for these cases, the data at post- and follow-up assess-
ments still suggest a change through visual exam ination o f the data, that is, not like the
data in baseline. T h is was bolstered by statistical evaluation that showed that change
from Baseline 1 to Baseline 2 for all the cases com bined was not statistically significant.
However, changes from baseline to posttreatment and to the follow-up assessments
were statistically significant.
This is a quasi-experim ent; if conceived as an A B A B design, this is “m issing” two
phases. H owever, the design has m ultiple-baseline features in which the intervention
was im plem ented across different people. T he im plem entation was not at the same
time, in all likelihood, so no clear historical event could explain the pattern o f change
after the intervention. Maturational events, too, are not a likely explanation; the clients
w ere all different ages, and the sym ptom s were still strong long after the earthquake.
Both o f these points make m aturation not v ery plausible as an explanation o f the
change in sym ptom s. Also, the interpretation o f the intervention is facilitated by other
inform ation, namely, the authors had developed versions o f this treatment and tested
them carefully in other contexts. T he present dem onstration is a valuable extension
by testing the m anual with individual cases. We know that individuals can ad m in -
ister treatment them selves, which is a critical finding because the scope o f national
disasters (e.g., earthquakes, tsunam is, hurricanes) does not perm it individual therapy
adm inistered by m ental health professionals to be provided to the survivors w ho su f-
fer trauma.

G en era l Com m ents. I have em phasized the goals o f research design (to make vari-
ous threats to validity implausible) and the logic o f single-case designs (e.g., describe,
predict, test the prediction) for reasons that are particularly salient for this chapter.
Experim ental design practices serve the preceding goals and are not ends in them -
selves. Thus, one evaluates quasi-experim ents (and true experim ents) in terms o f how
well they make im plausible com peting interpretations o f the findings. Many quasi
experim ents illustrated previously m ake a very strong case. It is not always the case
Q u a si-S in g le -C a se E *p « r» m e n tn l D e s i g n s 279

Case I
Clinician-Administered PTSD Scale

Case 2

Case 3

F ig u re I 1 .6 . Clinician-administered measure of PTSD symptoms for each of eight cases.The design


included two assessments during baseline, one assessment immediatefy after the self-administered
treatment, and three assessments over the course of follow-up. (Source. Basogluet el.. 2009.)
280 S I N G L E- C A S E R ES EA R C H D ES I G N S

Case 4
Clinician-Administered PTSD Scale

80

70-
60-

10-
0 --------------- 1--------------- 1--------------- 1--------------- 1--------------- 1--------------- 1
Baseline I Baseline 2 Post-Treatment I-month FU 3-month FU 6-month FU

Case 5

Case 6

Figure 1 1 .6 . Continued
Qu as i- Sing! e- Case Ex p e r i m en t al D esig n s 281

Case 7
Clinician-Administered PTSD Scale

Case 8

F ig u re 1 1 .6 . Continued

that a true experim ent or even an R C T autom atically makes a strong case because of
the design. W hen there is loss o f subjects in som e groups o r diffusion o f the interven-
tion, whether the study is an A B A B design or R C T does no I autom atically correct the
problems. Understanding what com petes with draw ing in k ren ces is im portant in all
designs. Q uasi-experim ental designs merely m ake the im portance of this u n d erstan d -
ing m ore salient.
I f one focuses on practices rather than the goals o f research, there can be a m eth -
odological helplessness that emerges. For exam ple, a teacher or superintendent has
a creative intervention idea for a classroom or school. H o w to evaluate this? I f one
believes only a controlled study can be used with random ized assignment, m ost likely
the program will never be evaluated. The goals o f research are to see il the novel inter-
vention makes a difference when other influences arecontroLled. [n any given case, tlie
arrangem ent o f approxim ations or quasi-experim ents (single case or betw een -group)
can reduce the plausibility o f threats to validity in ways that approach true experim ents.
I discussed and illustrated key dim ensions that can greatly strengthen inferences that
can be draw n about intervention effects.
282 S I N G L E- C A S E R ES EA R C H O ES I G N S

Design features from sin gle-case experim ents are not m ere m ethodological
niceties. When integrated with applied work in educational, clinic, and other set-
tings, they can also im prove the quality o f care (H orner et al., 2005; Kazdin, 2008b).
C o ntinuous assessm ent, for exam ple, provides feedback about whether the program
is having its intended effect. T h is is basic. At the v ery least, program s ought to have
an assessment com ponent. Q uasi-experim en ts go beyond assessm ent, of course, and
focus further on draw in g inferences about what caused the change and when change
occurred.

P E R S P E C T IV E O N Q U A S I-E X P E R IM E N T S
In seeking answers to critical questions, we want to begin with true experim ents when
possible. In applied settings, often such experim ents are not possible. Constraints on
the situation (e.g., lim itations o f withholding the math or reading program from som e
classes but not others) and clinical issues (e.g., the patient ought to have the inter-
vention immediately) are som e o f the impedim ents. We ought to be consoled by the
knowledge that most o f the sciences (e.g., geology, anthropology, meteorology, astro n -
omy, epidem iology, zoology) do not rely on true experim ents for critical findings and
yet have made enorm ous advances. T his is worth m entioning to begin with the prem ise
that much can be learned in a cum ulative and rigorous way without being able to c o n -
trol the situation experim entally.
For those sciences and disciplines involved in intervention research (e.g., m ed i-
cine, education, psychology, counseling, social work) research training focuses on true
experim ents. T he R C T is endlessly referred to as the “gold standard,” but it has been the
only standard included in training. Understandably, with that training, many o f us are
bothered by less well-controlled studies, which is why 1 som etim es refer to the designs
o f this chapter as queasy-experim ental designs. No matter what the designs are called,
quasi or queasy, very strong inferences can be draw n from them. However, the designs
often require greater ingenuity in selecting controls or analyzing the data to m ake var-
ious threats to validity implausible.
T he ingenuity requires thought about the specific situation and what might be
done to combat this or that threat to validity that is especially problematic. We begin by
adding assessment and then consider what might help us show that it was the interven-
tion rather than extraneous influences that account for the change. Facets o f single-case
designs might be brought to bear. Perhaps assessment on multiple occasions will help
us draw’ on the describe, predict, and test feature o f single-case experiments, perhaps
we can infer what perform ance in the past and future w ould be like without the lu xu ry
o f baseline observations, or perhaps we can show a sim ilar intervention effect in two
individuals (two A B designs with a b rief baseline and intervention phase, but not stag-
gered in a m ultiple-baseline fashion).

SU M M A R Y A N D C O N C L U S IO N S
It is useful to consider the present chapter in relation to a continuum o f rigor and clar-
ity o f research m ethods. I m entioned a continuum that has uncontrolled anecdotal
case study on the left side and true experim ents on the right side. There is general
agreement that the anecdotal case study is not an acceptable basis for draw ing infer-
ences in science. T h e absence o f systematic assessm ent and any effort to control the
Q u asi-Sin g! e-Ca.se E xpe rim can t^ l D e sig n s 283

situation makes the anecdotal case study a resort where all the threats to valid ity frolic
and join together. With true experiments, these threats are reined in, controlled, m ade
implausible, or com pletely ruled out. All the threats are kept in their cages where they
cannot get out and harm our ability to draw inferences. T h e vast tirrito ry in the ttiiddle
o f the continuum — quasi-experim ents— was the focus o f this chapter. W ith q u asi-ex-
periments, there are threats running around here and there and we shepherd them to
places where we can keep an eye on them and make them more or less im plausible.
Q uasi-experim ental arrangements can vastly improve on the anecdotal case study
and provide a strong basis o f knowledge even though these arrangem ents are not true
experiments.
In schools, clinics, rehabilitation facilities, and other applied settings, th e con -
ditions for using true single-case experim ents cannot alw ays be met. N evertheless,
selected features o f the designs can be used to form quasi-singU -iase e x p e rim e n t. T he
use o f key features such as assessment over time and consideration o f som e o f the c ri-
teria for data evaluation can strengthen the inferences that can be draw about interven-
tion effects. In this chapter, several types o f single-case dem onstrations were presented
that included critical com ponents or approxim ations o f true experim ents. These c o m -
ponents included draw ing on information about the nature o f change, the abruptness
o f change, and the likely course without treatment that can be used to m ake threats to
internal validity implausible.
Unlike prior chapters, quasi-experim ents are not form ally recognized d esign s (e.g.,
as in A B A B or multiple-baseline designs). Quasi experim ents aredefin ed by those situ -
ations in which the investigator does not have control o ver critical facets o fth e a r ra n g e -
ment o f the intervention. The challenge is drawing on com ponents o f design to help
improve the quality o f our inferences. A r a n g e o f situations was presented to sh o w how
information can be brought to bear in which change probably could be attributed to
the intervention. Exam ples were presented that began with system atic assessm ent and
that showed change over the course o f intervention. T h e dem onstrations v ary in -what
can be brought to bear to draw inferences. In each case the u seo f systematic assessm en t
alone was valuable. Systematic assessment at least provides the basis for in ferrin g that
change occurred, and this is a critical step in evaluating all program s whether at the
level o f states, schools, or individuals.
CHAPTER I 2

Data Evaluation

CH APTER O U T L IN E

Visual Inspection
Description and Underlying Rationale
Criteria tor Visual Inspection
Illustrations
Problems and Considerations
Lack o f Concrete Decision-making Rules
Multiple Influences in Reaching a Decision
Search for Only Marked Effects
General Comments
Statistical Evaluation
Reasons for Using Statistical Tests
Trend in the Data
Increased Intrasubject Variability
Investigation o f New Research Areas
Small Changes M ay Be Important
Replicability in Data Evaluation
Evaluation o f the Applied or Clinical Significance of
the Change
Social Validation
Social Comparison
Subjective Evaluation
Clinical Significance
Falling within a Normative Range
Departure from Dysfunctional Behavior
No Longer Meeting Diagnostic Criteria
Social Impact Measures
Problems and Considerations
Social Comparison and Return to Normative
Levels
Subjective Evaluation
General Comments
Summary and Conclusions

284
Data Evaluation 285

P
revious chapiters have discussed fundamental issues about assessment and exp eri-
mental design for single-case research. I mentioned that the third com ponent ot
m ethodology after assessment and design is data evaluation. T he three com ponents
w ork in concert to permit one to draw inferences about the intervention. Assum ing that
the behavior (or dom ain o f interest) has been adequately assessed and that Ihe inter-
vention was evaluated in an appropriate experim ental design, one im portant matter
remains: evaluating the data. Data evaluation consists o f describing and m aking infer-
ences about the changes. These inferences are not about w hat caused the change— that
facet has much to do with the experim ental design and w hether the conditions were
arranged to allow for that conclusion. Data evaluation focuses ora w hether the change is
likely to be reliable and not likely to be due to chance fluctuations in perform ance.
In applied investigations where single-case designs are used, separate criteria are
invoked to evaluate the data (Risley, 1970). T he experim ental criUrion refers to whether
the changes in behavior or the dom ain o f interest is reliable. T he applied criterion refers
to whether the changes are important, that is, whether they m ake a difference that has
applied significance. It is possible that reliable effects (experim ental criterion) would
be produced but that these effects would not have m ade an im portant change in the cli-
ents’ lives (applied criterion). At first blush, these criteria m ight not seem to be unique
to single-case research. In between-group studies we do not m erely w art to show that
one intervention (e.g., m ethod o f teaching reading or math; form o f psychotherapy)
is better than another statistically (statistical significance). We want to show that the
intervention also makes a difference that is o f applied im portance. Experim ental and
applied criteria are nicely intertwined in single-case research as a way to evaluate
change; both are addressed in this chapter.
T he prim ary method o f data evaluation for single-case research is based on visual
inspection. T he chapter presents the underlying rationale, key criteria, and critical
issues. Researchers trained in traditional betw een-group m ethods (i.e., all o f us!) are
m ore 011 home tu rf with statistical methods o f data evaluation. Statistical tests are used
in single-case research but not routinely. Statistical analyses o f single-case data are still
not com m on, but there are important advances that cast visu al inspection and statisti-
cal tests in a new light. T he appendix at the end o f the book elaborates statistical analy-
ses for the single case and considerations that influence both visual inspection and
statistical tests. (If you like dram a, intrigue, and m ethodological twists and. turns, the
appendix is a must, and the written version is much better than the niovie.)
Evaluation o f the data through visual inspection depend s heavily 011 graphing o f
the data. In single-case research graphing is not only or m erely a descriptive tool, it
is part o f the inferential process. Were reliable effects dem onstrated b y the interven-
tion? Data are displayed graphically and evaluated. T his chapter covers the m ethods
and rationale o f data evaluation in single-case designs, and th e next chapter highlights
com m only used methods o f graphing to aid data evalu ation.

V IS U A L IN S P E C T IO N
T he experim ental criterion refers to a com parison o f perform ance during the inter-
vention with what perform ance would be like if the intervention had not been im ple-
mented. T he purpose o f the experimental criterion is to decide whether a veridical or
reliable change has been dem onstrated as well as whether that change can b e attributed
28A S I N G L E- C A S E R ES EA R C H D ESI G N S

to the intervention. In traditional betw een-group research, the experim ental criterion
is met prim arily by com paring perform ance between or am ong groups and exam ining
the differences statistically. Groups receive different conditions (e.g., treatment vs. no
treatment), and statistical tests are used to evaluate w hether perform ance after treat-
ment is su fficien tly different to attain con ven tional levels o f statistical significance
(p < .05, .01).
In single-case research, the experim ental criterion is met by exam ining the effects
of the intervention at different points over time. The effects o f the intervention are
replicated (reproduced) at different points so that a judgm ent can be made based on
the overall pattern o f data. The m anner in which intervention effects are replicated
depends on the specific design. T he underlying rationale o f each design, outlined in
previous chapters, conveys the ways in which baseline perform ance is used to predict
future perform ance, and subsequent applications o f the intervention test whether the
predicted level is violated. For exam ple, in the A B A B design the intervention effect is
replicated over time for a single subject or group o f subjects. T he effect o f the interven-
tion is clear when systematic changes in behavior occur during each phase in which
the intervention is presented or w ithdrawn. Similarly, in a multiple-baseline design,
the intervention effect is replicated across the dim ension for which multiple-baseline
data have been gathered. T he experim ental criterion is met by determ ining whether
perform ance shifts at each point that the intervention is introduced.
The m anner in which a decision is reached about whether the data pattern reflects
a systematic intervention effect is referred to as visual inspection. Visual inspection
refers to reaching a judgm ent about the reliability or consistency o f intervention effects
by visually exam in ing the graphed data. V isual exam ination o f the data would seem to
be subject to a trem endous amount o f bias and subjectivity. I f data evaluation is based
011 visually exam ining the pattern o f the data, intervention effects (like beauty) might
be in the eyes o f the beholder.1 To be sure, several problems can emerge with visual
inspection, and these are highlighted in this chapter and detailed further in the appen-
dix. However, it is im portant to convey the underlying rationale o f visual inspection,
how the method is carried out, and its strengths and weakness.

D e scrip tio n and U n d e rly in g R a tio n a le


Visual inspection can be used in part because o f the sorts o f intervention effects that
are sought in applied research. T he underlying rationale is to encourage investigators
to focus 011 interventions that produce potent effects and effects that would be obvi-
ous from merely inspecting the data (Baer, 1977; M ichael, 1974; Sidm an, i960). Weak
results are not regarded as m eeting the stringent criteria o f visual inspection. Hence,
visual inspection is intended to serve as a filter or screening device to allow only clear
and potent interventions to be interpreted as producing reliable effects. In contrast, in
traditional betw een-group research, statistical evaluation is used to decide w hether the
effects (differences between groups) are reliable. Statistical evaluation is often more
sensitive than visual inspection in detecting intervention effects. Intervention effects

1 As the reader may well know, the expression that “beauty is in the eye o f the beholder,” is not quite
accuratc. Actually, research shows that there is considerable agreement in what beauty is, and who is
beautiful, although there are individual taste preferences as well (e.g., Honekopp, 2006).
D at a Ev al u at i o n 257

m ay be statistically significant even i f they are relatively weak. I lie same effect m ight
not be detected by visual inspection. Traditionally, the insensitivity o f visual inspection
for detecting weak effects has been view ed as an advantage rather than a d isad v an -
tage because it encourages investigators to look for potent interventions o r to develop
weak interventions to the point that large effects are produced (Parsonson & Baer, 1978,
1992).
Statistical evaluation and visual inspection are not fundam entally different w ith
respect to their underlying rationale (Baer, 1977). Both methods o f data evaluation
attempt to avoid com m itting what have been referred to in statistics as Type I and Type
II errors:

Type I error refers to concluding that the intervention (orvariable) produced an


effect when, in fact, the results could be due to chance.
Type II error refers to concluding that the intervention d id «of produce an effect
when, in fact, it did.

Researchers typically give higher priority to avoiding a Type I error, concluding


that a variable has an effect when the findings may have occurred by chance. In statisti-
cal analyses the probability o f com m itting a Type I error can be specified by the level
o f con fiden ce o f the statistical test o r a (e.g., p < .05). Specifically, one can say that i f
the investigation were carried out 10 0 tim es (or actually an infinite num ber o f times)
5 (or 5%) o f these would show a statistically significant result b y “chance."
W ith visual inspection, the probability o f a Type I error is not know n. H ence, to
avoid chance effects, the investigator looks for highly consistent effects that can be
readily seen. By m inim izing the probability o f a Type I error, the probability ot a T yp e
II error is increased. Investigators relying on visual inspection are m ore likely than are
those relying on statistical analyses to com m it more Type II e T io rs, that is, discarding
or discounting effects that may be real but are not clear. Thus, reliance on visual inspec-
tion will overlook or discount m any reliable but weak effects.

Criteria for Visual Inspection


Several situations arise in applied research in which intervention effects are likely to be
so dram atic that one can easily see (from visual inspection) the change. Fo r exam ple,
the behavior o f interest (e.g., reading, exercising) may never be present. T his can be
stated by characterizing the data; the mean (e.g., number o f times, num ber o f minutes)
and the standard deviation are zero. In such circum stances, eve n a m inor increase in the
target behavior during the intervention phase would be easily detected. Sim ilarly, w hen
the behavior o f interest occurs very frequently during the baseline phase (e.g., reports
o f hallucinations, aggressive acts, cigarette smoking) and stops com pletely during the
intervention phase, the magnitude o f change usually perm its clear judgm ents based on
visual inspection. In short, in cases in which behavior is at the opposite extrem es o f the
assessment range before and during treatment, the ease o f invoking visual inspection
can be readily understood. Indeed, the changes are self-evident and there is alm ost no
need to make explicit the criteria for visual inspection. In any type o f research, w hether
single-case or between-group, very dram atic changes m ight be so stark that there is
no question that som ething important, reliable, and veridical took place. T his type o f
change has been referred to as “slam bang” effects (Gilbert, lig h t, & M osteller, 1975).
S IN G L E -C A S E R E S E A R C H D E S IG N S

Table 12.1 Visual Inspection C riteria to Evaluate D ata from Sin gle-C ase Experiments

Characteristics Related to Magnitude of the Change


1. Changes in Means across Phases— shifts in the average rate of performance on the continuous measure as
phases (e.g.,A,B) are changed.
2. Changes in Level across Phases— shift o r discontinuity of performance (a leap, jump) from the end of one
phase (e.g.. A ) to the very beginning of the next phase (e.g., B) and back again as the phase shifts again.

Characteristics Related to the Rate of Change


3. Changes in Trend o r Slope— the trend line that characterizes the data within each phase (e.g.,A,B) and that
reflects a change from the trend line from a prior o r subsequent phase.
4. Latency of the Change— the amount of time o r the period between the onset of a condition (e.g.. B o r
intervention) and the changes in performance. A short latency (immediate) change in the conditions (A,B)
contributes to inferring that the condition was responsible for that change.

Overall Pattern
5. Nonoverlapping Data across Phases— this is a combined criterion involving som e o r all of the above criteria.
The effects are unusually clear because the numerical data points on the graph in one phase (e.g.. tantrums
that lasted between 20 and 40 minutes on all the days of baseline) do not overlap (share any same numerical
values) with the days during the intervention phases (e.g., tantrums lasted from I to 5 minutes).

Note: Visual inspection usually consists of invoking the first four criteria. The nonoverlapping-data criterion is
equivalent to the "slam bang" effect mentioned in the text. From a data-evaluation perspective this effect is imme
diately evident from a graphic display of the data and would be readily acknowledged by all or most individuals
perusing the data. The four other criteria are a matter of degree and the investigator must examine each one as
well as the overall pattern they reflect.

In most situations, the data do not show a change from one extrem e o f the assessment
scale to the other, and the criteria for m aking judgm ents by visual inspection need to
be considered deliberately. Table 12.x provides the criteria for visual inspection for easy
reference but each criterion is elaborated next.
Visual inspection prim arily depends on four characteristics o f the data that are
related to the m agnitude and the rate o f the changes across phases. The two charac-
teristics related to m agnitude are changes in m ean and level. T he two characteristics
related to rate are changes in trend and latency o f the change. It is important to exam ine
each o f these characteristics separately, even though in any applied set o f data they act
in concert.
Changes in means across phases refer to shifts in the average rate, level, or num ber
on the measure. Consistent changes in means across phases can serve as a basis for
deciding whether the data pattern meets the requirem ents o f the design. A hypotheti-
cal exam ple show ing changes in means across the intervention phase is illustrated in
an A B A B design in Figure 12.1. As evident in the figure, perform ance on the average
(horizontal dashed line in each phase) changed in response to the different baseline
and intervention phases. V isual inspection o f this pattern suggests that the intervention
was associated with changes as reflected on this particular criterion.
Changes in level are a little less fam iliar and refer to the shift or discontinuity of p e r-
formance from the end o f one phase to the beginning o f the next phase. A change in level
is independent o f the change in mean. When one asks about what happened im m ed i-
ately after the intervention was implemented or w ithdraw n, the question is about the
level o f perform ance. Figure 12.2 shows change in level across phases in ABAB design.
The figure shows that w henever the phase was altered, behavior assumed a new rate
and shifted up or down rather quickly. T he arrow s in the figure show a space from the
last day o f one phase and the first day o f the next phase.
D at a Ev al u at i o n 2ft9

F ig u re 1 2 .1 . Hypothetical example of performance in an ABAB design with means in each phase


represented with dashed lines.

Baseline Intervention Base 2 Intervention I

-
•Vs
V
9. 6

-W
S '

-•
V
Days

F ig u re 1 2 .2 . Hypothetical example of performance in an ABAB design. Tlie arrows point Co the


changes in level or discontinuities associated with a change from one phase to another.

It so happens that a change in level in this latter exam ple w ould also be accom p a-
nied by a change in mean across the phases. However, level and m ean changes do not
necessarily go together. It is possible but not usually the case that a rapid, change in level
occurs but that the mean remains the sam e across phases or that the m ean changes but
no abrupt shift in level has occurred.
Changes in trend are o f obvious im portance in applying visual inspection. Trend
or slope refers to the tendency for the data to show system atic increases or decreases
over time. Altering phases within the design may show that the direction o f behavior
changes as the intervention is applied or withdrawn. Figure 12.3 illustrates a hypotheti-
cal exam ple in which trends have changed over the course o f the phase in an A B A B
design. Discussing the figure with this criterion in m ind, we might say there was
no trend (horizontal line or zero slope) in baseline, an accelerating slope d u rin g the
S I N G L E- C A S E R ESEA R C H D ES I G N S

Figure 1 2 .3 . Hypothetical example of performance in an AB AB design with changes in trend across


phases. Baseline shows a relatively stable or possibly decreasing trend. W hen die intervention is
introduced, an accelerating trend is evident.This trend is reversed when the intervention is with-
drawn (Base 2) and is reinstated when the intervention is reintroduced.

intervention phase, a decelerating slope during the return to baseline, and an accel-
erating slope again in the second intervention phase. T he effects are dramatic as if to
suggest that behavior is turned o ff and on based on what occurred during the phases.
The slopes m oving in different directions fit well with the logic o f single-case designs I
m entioned previously (describe, predict, and test). A m arked change in slope conveys
that som ething happened that is reliable and changed the predicted pattern (slope) o f
perform ance from each prior phase.
Finally, the latency o) the change that occurs when phases are altered is an im por-
tant characteristic o f the data for invoking visual inspection. Latency refers to the
period between the onset or term ination o f one condition (e.g., intervention, return to
baseline) and changes in perform ance. The more closely in time that the change occurs
after the experim ental conditions have been altered, the clearer the intervention effect.
There is a com m onsense feature o f this. If I tell my child to clean her room, and she
does this immediately, the chances are my request was the intervention responsible for
change. If 1 tell my child to clean her room, and she does this a month later, my request
could have been responsible but the delay very much suggests that som ething else was
involved.
A hypothetical exam ple is provided in Figure 12.4, show ing o nly the first two
phases o f separate A B A B designs. In the top panel, im plem entation o f the intervention
after baseline was associated with a rapid change in perform ance. In the bottom panel,
the intervention did not imm ediately lead to change. T h e time betw een the onset o f the
intervention and behavior change was longer than in the top panel, and it is slightly less
clear that the intervention m ay have led to the change. As a general rule, the shorter
the period between the onset o f the intervention and behavior change, the easier it is to
infer that the intervention led to change.
D a ta f v i u a_tion

Baseli ne Intervention

Figure I 2.4. Hypothetical examples of first AB phases as part of larger ABA.B designs. Upper pond
shows that when the intervention was introduced, behavior changed rapidly. Lower p^nel shows that
when the intervention was introduced, behavior change was delayed.The changes in both upper
and lower panels are reasonably clear.Yet as a general rule, as the latency between the onset o f the
intervention and behavior change increases, questions are more likely to arise about v/hether the
intervention or extraneous factors accounted for change.

T he importance o f the latency o f the change after the onset o f the intervention
depends on the type o f intervention and dom ain o f functioning. For exam ple, o r e
would not expect rapid changes in applying a diet or exercise regim en to treat obesity.
Weight reduction usually reflects gradual changes after interventi ons b e g in . I n contrast,
stimulant medication is the prim ary treatment used to control hyperactivity am o n g
children diagnosed with Attention-Deficit/Hyperactivity D isorder. T he m edication
usually produces rapid effects and one can see changes on the day the m edication is
292 S I N G L E- C A S E R ES EA R C H D ESI G N S

provided (one often sees a return to baseline levels on the sam e day, as the stimulant is
metabolized). M ore generally, draw ing inferences about the intervention also includes
considerations about how the intervention is likely to w ork (e.g., rapidly, gradually) and
how that expectation fits the data pattern.
Visual inspection is conducted by ju dging the extent o f changes in means, levels,
and trends, and the latency o f change evident across phases and whether the changes
are consistent with the requirem ents o f the particular design. T he individual com po-
nents are im portant but one looks at the gestalt too, that is, the parts all together and
the whole they provide. Figure 12.5 illustrates som e hypothetical patterns for AB phases
only to convey som e o f the com binations o f the criteria and how they may or m ay not
go together. T he figure conveys perm utations o f the four criteria. T his is only a small
set o f the range o f possibilities because each criterion is a matter o f degree and there
will be more phases (e.g., A B A B rather than just A B or m ore opportunities to evalu-
ate phase changes as in a m ultiple-baseline design). Even so, with the relatively sim ple

Change in mean and level; no Change in mean and trend, no


change in trend (slope); change in level; rapid latency of
immediate latency change change

Change in mean, trend, and level; N o change of mean o r level;


rapid latency of change change in trend; rapid latency of
change

F ig u re 1 2 .5 . Examples of data patterns over two (AB) phases illustrating changes in means, levels,
trends, and latency.
D a t a E v a lu a t io n 793

graphs, one can begin to see potential com plexities and h o w it might not be quite so
easy to apply the criteria. I have crafted the graphs to corrvey in each case that som e-
thing happened during the B phase, but som etimes the slo p e rem ained the sam e (top
left); som etimes the means rem ained the same (bottom right) and so on. Shortly we
w ill look at some real exam ples that better represent full d esigns (not just A B phases).
The ease o f applying the criteria and exam ining the data pattern as a whole depend
on background characteristics I mentioned when describing the logic o f single-case
designs. W hether a particular effect will be considered reliable through visual inspec-
tion depends on the variability o f perform ance, trends w ithin a particular phase, and
whether the phases provide clear estim ates for the describe, predict, and test functions
o f the design. Data that present m inim al variability, show consistent patterns over
relatively extended phases, and show that the changes in m eans, levels, or trends are
replicable across phases for a given subject or across several subjects are m ore easily
interpreted than data in which one or m ore o f these characteristics are not obtained.
Changes in means, levels, and trends, and latencies o f change across phases may
go together, thereby m aking visual inspection easy to invoke, l o r exam ple, data across
phases may not overlap. Nonoverlapping data refers to the finding that the values o f
the data points during the baseline phase do not approach an y o f the values o f the data
points attained during the intervention phase. For exam ple, i f one looks at Figure 12.5,
the two graphs on the left show that not one data point in baseline (A ) was the sam e as
or within the range o f data points during the intervention CB). T hose w ere hypothetical
graphs and omitted the norm al day-to-day variability in perform an ce. N onoverlapping
data where that variability is evident, that is, in real data, are even m ore impressive. In
short, if there are changes in the means, levels, trends, and latencies across phases and
the data do not overlap, there is little quibble about whether the ch a rg e s are reliable and
meet the experim ental criteria. The challenges come w hen data are not perfect, and
these challenges are evident whether visual inspection o r statistical analyses are used
(please see the appendix for a more extended discussion o fth e se challenges).

Illu stra tio n s


A few illustrations are helpful to convey application o f the criteria in situations one
is likely to encounter, that is, an imperfect world in w h ich the participants have not
read this chapter and are not doing what we would like to see (e.g., non overkpping
data across phases). Visual inspection does not require perfection. Here are som e
examples.
In this example, the focus was on children undergoing restorative dental treat-
ment (O’Callaghan, Allen, Powell, & Salama, 2006). It is com m on for young children
to be disruptive and difficult to m anage during their dental care, w ith estim ates o f 20
to 25% o f all children show ing disruptive behavior during such work. T his not only
makes administration o f the activities o f cleaning and scraping o f one’s teeth difficult,
but actually can expose children to injury as they suddenly m ove about durin g inva-
sive procedures. Allow ing b rie f breaks or periods o f escape from the procedures can
reduce the difficult-to-m anage behaviors. This program included live children (ages 4
to 7 years, three girls, two boys) with disruptive behavior and in need o f at least three
visits for tooth preparation and restoration. F.ach visit lasted 4.5 to 90 minutes. The
intervention consisted o f giving breaks to the child based on a fixed-tim e schedule. A
294 S I N G L E- C A S E R ES EA R C H D ES I G N S

small apparatus was attached to the dentists waist band to signal when these breaks
would occur. A video cam era in the room recorded all sessions. Child behavior, includ-
ing bodily m ovem ents, com plaining, m oaning, gagging, and crying, was recorded in
intervals ( o f 15 seconds). T he goal was to reduce the disruptive behaviors.
A m ultiple-baseline design was used across subjects with baseline (measured in
minutes) across all o f the visits com bined. T h e intervention consisted of introducing
brief periods in which there was a lo-secon d pause or break in the procedure. At first
the breaks were frequent (every 15 seconds), but then they extended to every 1 minute.
The procedure for giving breaks was explained to the child and initially practiced. Also,
each break was cued by the dentist as, “ It’s break time.”
Figure 12.6 displays the results across children. D isruptive behavior decreased for
all children durin g the intervention period. However, as one looks at the data (graph)
tor each child, it is clear that all four criteria for visual inspection are not met in each
graph. A cross all five children the m eans change from baseline to intervention. As
for changes in level (discontinuity at point o f intervention for each child), possibly
two (Elaine and G eorge) show this effect. As for changes in trend, perhaps all but one
(George) show a different slope from baseline through intervention phases. Finally, as
for latency o f the change, the change was im m ediate for all but two children (M elissa
and Kevin). Clearly, the criteria differentially apply to the children. T his might be anal-
ogous to betw een-group research that shows significant differences among or between
groups on som e m easures but not on others. (Betw een-group studies do not usually
exam ine the effects that were evident with som e individuals but not others.) H ow do
we interpret this? If one looks at the overall graph, it is reasonably safe to conclude that
the intervention probably accounts for the change. Each child’s behavior changed as the
intervention was introduced and not before.
C o n sid er another exam ple where invoking the criteria is a little more difficult.
In this exam ple, the goal was to change the behavior o f ind ividuals who took orders
in the drive-th ro ugh w in d o w o f fast-food restaurants (W iesm an, 2006). T he goal
was to get the o rd er takers to ask custom ers to “ upsize” their orders. Upsizing (som e-
tim es called “su p ersizin g” ) or up-selling is a com m on practice in many businesses
(sales o f com puter h ardw are and softw are; travel, hotels) where custom ers who are
buying are en cou raged to buy a little m ore o f what they are getting or som ething
closely related. A fam iliar exam ple occurs at m ovie theaters. I f we order a small (12
oz., or 340 .19 gram s) size soft drink, usually we are im m ediately inform ed that for 25
cents m ore we can have 32 oz. (907.18 g ) — and a catheter. T h is study in the fast-food
restaurant wanted the order takers to ask custom ers to upsize their orders as they
com m unicated through speakers and headsets, in keeping with fast-food restaurant
procedures. T he upsizin g was asking ind ividuals to o rd er a com bination meal (san d-
wich, fries, and beverage) that involved larger portion s o f what they ordered at an
additional price. Two different fast-food restaurants w ere studied: one on an inter-
state highw ay and an other in a small college town. T h e y were affiliated with the sam e
national fo o d chain but were 11 miles apart. O bservations were made by observers
w ho were near enough to hear the conversation o f the cu sto m er and order taker and
record yes or no if u psizing was offered. A m ultiple-baseline design across the two
settings was used to evalu ate an intervention. T h e in terven tion consisted o f praise
by m anagers as they saw the behavior o f the order takers throughout the shift, weekly
Data Evaluation ItS

Baseline
Noncontingent Escape
Visit I .Visit 2

Visit 3 Visit 4
Melissa

•4 —Visit 2 Visit 3

<4—Visit I

Visit 3
Elaine
4 —Visit 2

Visit I
Visit 2

Visit A
Kevin

Visit 3

Visit 4

Visit 3
Visit 2

Consecutive 3 Min. Intervals by Visit

F ig u re 12 .6 . Percentage of (15-second) intervals containing disruptive behavior for each child


across visits. (Source: O'Calloghan et ol., 2006.)
S I N G L E- C A S E R ES EA R C H D ESI G N S

praise for keeping the upsizing requests o f custom ers above 80% (o f the oppo rtu n i-
ties), and graphed feedback. The purpose here is to discuss the data that are presented
in Figure 12.7.
Was the intervention effective, and were the changes likely to be due to the inter-
vention? Probably most would agree but perhaps express a minute reservation. D uring
the intervention phase two changes are clear. T h e mean increased over baseline, and
the variability (up and dow n range o f the data points) decreased. In addition, changes
in level (when the intervention was introduced) were also evident. Slope changed
as well, at least in the top graph. However, the top graph show s a clear trend toward
im provem ent without the intervention, and that introduces a slight ambiguity. T he sec-
ond graph shows no trend or a slight trend toward decrem ents in the desired behavior.
O verall I w ould say that the intervention accounts for the change in light o f the dem on-
stration. Experim ental control might have been helped by one more setting. As I noted
in the chapter on multiple-baseline designs, two baselines are an absolute m inim um
but requires a perfectly clear data pattern. A third baseline can make a difference in

Baseline Performance Feedback & Social Reinforcement

Observation

F igu re 1 2 .7 . The percentage of opportunities in which employees asked customers to upsize a


mea! during baseline and intervention phases in a multiple-baseline design across two drive-through
restaurants. (Source: Wicsman, 2006.)
D a t a E v a lu ta tio n 297

case one o f the baselines is not quite so clear, and that is largely w h y three baselines is
a recom m ended minim um .
Other points are important to note about the dem onstration. First, em ployee beha^ -
ior (asking to upsize) was evident in baseline at a high rate (e.g., m ean o f 65% in base
line for the top graph). This increased to 96%. There was a clear change in m ean, and
the change made a difference to the business. Although exact pro tit gains cou ld not be
estimated, thousands o f combination (upsized) m eals w ere sold, and a profit o f 3.25
was made each time. O ver time, every month, this could make a huge profit difference.
Seem ingly small changes, perhaps like those graphed, can be large w hen im pact accu -
mulates over time and across many people (customers).
These two exam ples can be placed in the context o f other exam ples in this and
prior chapters. Because the designs m ay have many phases (A B A B , A B C A B C ) or b ase-
lines (m ultiple-baselines), there are m any opportunities to invoke the criteria. Invoking
the criteria for visual inspection requires judgm ents about the pattern o f data in the
entire design and not merely changes across one or two phases. Unam biguous effects
require that the criteria mentioned previously be met throughout the design, that is,
across the different phase shifts. To the extent that the criteria are not consistently met,
conclusions about the reliability o f intervention effects becom e tentative. F o r example,
changes in an A B A B design may show non-overlapping data points for the first A B
phases but no clear differences across the second A B phases. The absence o f a consis-
tent pattern o f data that meets the criteria mentioned previously li mits the conclusions
that can be drawn.

Problems and Considerations


Lack o f C o n crete D e c isio n -m a k in g R u les. T he use o f v isu al inspection as the p r i-
m ary basis for evaluating data in single-case designs h a s raised m ajor concerns.
First, there are no concrete decision rules for determ in in g w h eth er a particular
dem onstration show s o r fails to show a reliable effect. T h e process o fv is u a l in sp ec-
tion w ould seem to perm it, i f not actively foster, su b je ctiv ity and incon sisten cy in
the evaluation o f intervention effects. T h e situation can be con trasted with statis-
tical analyses as used in betw een-group research. G ro u p s are com pared and c u to ff
points (e.g., p < .05) are invoked to decide w h ether or n o t an effect was reliable or
whether an effect is sm all, m edium , or large (e.g., effect siz e). Statistical sign ifican ce
gives a b in ary d ecision-m aking tool (sign ificant o r not), although we investigators
squeeze subjectivity in by u sing phrases with no real leg itim a cy in null h y p o th -
esis testing, as reflected in such desperate term s as “ap proach ed sign ifican ce,” “ w as
alm ost statistically significant,” o r “show ed a trend tow ard sign ifican ce.” A part from
these, statistical indices have their own degree o f inherent arb itra ry factors, p ro b -
lem s, and ob jection s.2 Yet, decision m aking based on statistical sig n ifican ce seem s

Researchers trained in the tradition o f quantitative research (between- group designs, statistical
analyses, null hypothesis testing) often object to the subjectivity o f visual inspection. Occasionally
there is an extreme view that visual inspection is purely in the eye o f the beliolder, but there are
explicit criteria as I have noted. As important, perhaps, statistical analyses are not free from subjec-
tivity, that is, views about what to do that have pivotal consequences for drawing conclusions. Many
statistical tests (e.g., factor analysis, regression, cluster analyses, time-series analysis, path analyses)
298 S I N G L E- C A S E R ESEA R C H D ES I G N S

m ore straightforw ard than decision m aking based on visual inspection. T w o or


m ore investigators applying a given statistical test to the data ought to reach the
sam e conclusion about the im pact o f the intervention , that is, whether the d iffe r-
ence was statistically sign ifican t or not.
There is an empirical question underlying the concern. Can different judges v ie w -
ing single-case data graphed in the usual way reach sim ilar decisions about w hether
there was an effect? It is one thing to say there are criteria, but quite another to see
if they can be reliably invoked. In fact, several studies have shown that judges, even
when they are experts in single-case research, often disagree about particular data pat-
terns and whether the effects were reliable (e.g., D eProspero & Cohen, 1979; Franklin,
G orm an, Beasley, & Allison, 1997; Jones, Weinrott, & Vaught, 1978; Normand & Bailey,
2006; Park, M arascuilo, & G aylord-Ross, 1990). (The appendix elaborates the concern
insofar as it has served as an impetus for statistical evaluation.) T hus the absence o f
clear decision rules that can be reliably invoked is a problem w hen the results are not
crystal clear.
Efforts have been m ade to improve reliability in ju dgin g data by visual inspection
by providing explicit training (e.g., lectures, instructions), by using visual aids in g rap h -
ing the results (e.g., novel ways o f presenting trend lines), and by specifying criteria
in relation to those aids to make the process o f visual inspection replicable and m ore
explicit (Fisher, Kelley, & Lom as, 2003; Harbst, Ottenbacher, & H arris, 1991; N orm and
& Bailey, 2006; Skiba et al., 1989; Stewart, Carr, Brandt, & M cH enry, 2007; Sw oboda,
K ratochwill, & Levin, 2009). These efforts have produced m ixed results. Som etim es
their impact still does not lead to high agreement; som etim es their effects are not m ain-
tained. At this point, 110 training regimen has em erged that redresses the unreliability
in m aking judgm ents with visual inspection criteria.

M u ltip le Influences in R ea ch in g a D ecision. One o f the difficulties o f visual inspec-


tion is that multiple factors contribute to judgm ents about the data. T he range o f factors
and how they are integrated to reach a decision are not clear. T he extent of agreem ent
am ong judges using visual inspection is a com plex function o f changes in means, le v -
els, and trends as well as the background variables, such as variability, stability, and
replication o f effects within or across subjects (D eProspero & Cohen, 1979; M atyas &
Greenw ood, 1990). All o f these criteria, and perhaps others yet to be made explicit,
are com bined to reach a final judgm ent about the effects o f the intervention. H ow to
weight the different variables and criteria to make a decision could be very subjective.

‘ (Continued) include a number o f decision points about various solutions, parameter estimates,
and levels or criteria to continue or include variables in the analysis or model. These decisions are
rarely made explicit in the data analyses. In many instances “default” criteria in the data-analvtic
programs do not convey that a critical choice has been made and that the basis o f this choice
can be readily challenged because there is no necessary objective reason that one choice is bet-
ter than another. These are not minor facets 0! statistical analyses. For example, meta-analyses
of psychotherapy are extremely popular, and those less familiar with the methods of conducting
meta-analyses may view the analyses as straightforward. Yet conclusions about the effectiveness o f
psychotherapy can vary widely depending precisely on how the statistics for the meta-analyses (e.g.,
effect size) are computed (Matt, 1989; Malt & Navarro, 1997; Weisz, Weiss, Han, Granger, & Morton,
1995)-
Data Evaluation 2SS

T he decision is really a com plex mental multiple regression equation where the set o f
variables all have som e weight in leading to the decision. That is, even though we might
not be using statistics to make the decision, there is a w ay in w hich the decision is fun-
damentally a statistical one. We as observers are weighing the variables differently and
subjectively (subjective beta weights so to speak). In cases in which the effects o f the
intervention are not dram atic, it is not surprising that judges disagree.
There are special characteristics o f single-case data that cannot be detected visually
and considered suitably when invoking visual inspection. Tw o o f these briefly are the
fact that the data from one occasion to the next can be correlated. This phenom enon,
referred to as serial dependence (detailed in the appendix), cannot be easily “seen” but
requires statistical evaluation to detect.* H owever, data that are correlated in this way
are associated with even less agreement when invoking visu al inspection. In addition,
trends in the data (e.g., baseline) are not all straight lines that are easily detected. Som e
trends can be picked up o nly through statistical evaluation o f the data. H idden trends
can obscure or mislead when trying to determ ine if the intervention produced a reli-
able change. Interestingly, such trends can interfere with both visual inspection and
statistical data-evaluation methods. (Please see the appendix fo r m ore lengthy discus-
sions on these points.)
A pertinent side com m ent pertains to research on decision m aking outside o f the
context o f visual inspection and data evaluation. We have learned that decision m aking
and judgm ent are influenced by all sorts o f factors o f w hich we are unaw are and can -
not weigh explicitly. For example, influences in the environm ent (e.g., sm ells, sights,
stimuli placed in the background) register insofar as they have direct influence on us,
but do not register at a level o f consciousness, that is, we cannot state tliat we noticed
them, and we categorically state they had no influences on us when challenged (Bargh &
Morsella, 2008). I mention this because advances in other areas o f psychology have
elaborated many nuances o f decision making. In the context o f the visual inspection. 1
doubt that smells in the air influence the decision as to w hether the intervention altered
a patients depression when introduced in a m ultiple-baseline design. Yet, the absence
o f a starkly clear dem onstration where the visual inspection criteria are m et, allows for
more influences to operate. For example, in one project, undergraduates (with no exp e-
rience in single-case designs) were asked to judge changes in A B graphs (hypotheti-
cal data) (Spirrison & Mauney, 1994). T here was a sm all correlation (/ = .3) betw een
whether they said there were changes and the extent to w h ich they thought the inter-
vention was acceptable (reasonable, appropriate). That is, characteristics or attributes
o f the intervention influenced judgm ents about the im pact o f that intervention. T h is is
one dem onstration to show that visual inspection unw ittingly can reflect other factors
than whether the criteria for visual inspection were met.

When the subject has multiple data points, there can be significant serial dependcitce or autocorre-
lation, a property o f the data in which the error terms from one data po ini to the next are correlated.
That is, data are autocorrelated when performance o f behavior on one J a y is influenced or corre-
lated with performance on the next. This feature introduces an unobservable characteristic that can
influence judgment from visual inspection and also influence the results i»f statistical evaluation.
These and related matters are elaborated in the appendix.
S I N G L E- C A S E R ES EA R C H D ES I G N S

T he disagreem ent am ong judges using visual inspection has been used as an
argument to favor statistical analysis o f the data as a supplem ent to or replacement o f
visual inspection. T he attractive feature o f statistical analysis is that once the statistic
is decided, the result that is achieved is usually consistent across investigators. A nd the
final result (statistical significance) is not altered by the judgm ent o f the investigator.
Yet, statistics are not the arbiter o f what is a “true” effect, and there are scores o f statisti-
cal options that do not yield the sam e result (please see the appendix).

Search f o r O nly M a rk ed Effects. Another criticism levied against visual inspection is


that it regards only those effects that are v ery m arked as significant. Many interven-
tions might prove to be consistent in the effects they produce but are relatively weak.
Such effects might not be detected by visual inspection and would be overlooked. 1
mentioned that single-case designs began to flourish in behavioral research, and that
there was an explicit interest in identifying only those interventions whose effects were
unequivocal. V isual inspection was seen as a filter well suited to this goal (Baer, 1977).
Interventions that pass the stringent criteria o f visual inspection without equivocation
are likely to be pow erful and consistent.
W ith a perspective o f time, som e changes in intervention priorities, and extension
o f single-case designs to novel areas, the original rationale is m ore readily challenged.
First, analyses o f published single-case data reveal that m any studies do not produce
strong intervention effects. T he effects are actually debatable as to whether such effects
are present at all (Glass, 1997; Parker, C ryer, & Byrns, 2006). So the rationale o f using
visual inspection as a filter to detect only strong effects is an ideal not routinely met.
Second, judgm ents about w hether there was a change are som ewhat opposite from
the notion o f identifying marked effects. W hen ju dges make errors from visual inspec-
tion, they are m ore likely to say there is an effect when there is not one rather than fail
to detect existent effects (M atyas & G reenw ood, 1991; N orm and & Bailey, 2006). That
is, the search for marked effects is based on the view that unclear and iffy effects will
not be identified. Actually, judges “detect” nonexistent effects, the opposite from the
intended goals o f visual inspection. T his is the equivalence o f Type I error, mentioned
previously.
Third, looking for marked effects may confuse the experim ental and applied crite-
ria for evaluating data. A s 1 noted, the experim ental criterion focuses on the reliability
o f the finding and whether change can be explained by norm al fluctuations, preexisting
patterns in the data, and chance effects. For this criterion, the strength o f the effect is
not critical per se. T he applied criterion refers to w hether the impact is so large as to
make a palpable difference. These criteria are related; an effect that meets the applied
criterion by definition is likely to be so strong as to reflect a reliable change and meet
the experim ental criterion. T he applied criterion (e.g., teaching reading to children
who could not read; elim inating self-in ju ry in an adolescent with a developmental dis-
ability) is a stringent bar. Even so, an intervention effect could be reliable and meet the
experim ental criterion without having genuine im pact on the life o f an individual. The
concern that we only want to identify potent effects m ixes these criteria a bit. We want
to identify reliable effects (experim ental criterion) and, as show n later, we can even
meet an im portant applied criterion with very weak effects.
D at a Ev al u at io n

O verlooking weak but reliable effects can have unfortunate consequences. The p o s-
sibility exists that interventions w hen first developed may have weak effects. It would
be unfortunate if these interventions were prematurely discarded before they could be
developed further. Interventions with reliable but weak effects m ighteventuallv achieve
potent effects if investigators developed them further. Insofar as th e stringent criteria
o f visual inspection discourage the pursuit o f interventions that do not have potent
effects, it m ay be a detriment to developing effective interventions.

G en e ra l Com m ents. The concerns I have noted are the prim ary ones that are invoked
when visual inspection is evaluated. A caveat ought to be m entioned about the dis-
agreement o r apparent unreliability o f visual inspection. First, in fact the conditions
o f visual inspection often are met. Examples throughout this book are not random ly
draw n from the literature, but their sheer number conveys that c h arg e can be detected
and it is often quite clear that the change can be attributed to the intervention because
the design requirem ents and data-evaluation criteria w’ere met.
Second, m any of the studies that evaluate visual inspection have raters evaluate
changes across A B designs. T he data are sometimes real or som eti m escom puter gen er-
ated; the raters often, but not always, are students or individuals without training. Thus,
the data criticizing visual inspection have their own problem s. In single-case research
in everyday settings, A B designs are not sufficient either fo r the design or for data
evaluation. O ne looks for the “describe, predict, and test” logic o fth e design in m ultiple
phases (A B A B design) or in other multiple replications o f the effect (m ultiple-baseline
designs). That is, we profit from the full design in m aking the judgm ents and not just
two phases. Also, data from naive raters not trained in or fam iliar w ith visual inspec-
tion are not quite direct tests o f how visual inspection is applied. A gain, not all studies
o f visual inspection rely on inexperienced raters and present A B designs. I mention
these caveats not as a defense o f visual inspection, but rather to co nvey that the research
on visual inspection has its own issues and sources o f ambiguity.
On balance, what conclusions might be reasonable to make?' First, visual inspec-
tion requires that a particular pattern o f data is present in baseline and across su b -
sequent phases. These include little or 110 trend or trend in direction s opposite from
the trend expected in the follow ing phase and slight variability. Also, the specific c r i-
teria, when met (e.g., change in means, level, etc.) readily allow application o f visual
inspection. O ften the criteria are not met or are incom pletely m et an d the effects are
debatable.
Second, as the data pattern is less persuasive and m oves away from little or no
trend and less clear changes in mean, level, and so on, extraneous factors are m ore
likely to enter into the judgm ent about whether there is a reliable effect. There are two
extrem es that are easily picked up by visual inspection: (1) som ething really large hap-
pened and all the criteria for visual inspection were met, including nonoverlapping
data across phases; and (2) nothing at all happened and none o f the criteria was met. In.
this latter case, the data just looked like one baseline even though several interventions
were tried. T he middle ground, but not all o f the m iddle ground, could invite su b -
jectivity. However, design and data evaluation act in concert, so i f four o f five behav-
iors in a multiple-baseline design across behaviors show changes in most o f the visual
SI N G L E- C A S E R ES EA R C H D ES I G N S

inspection criteria, the “subjectivity'” is likely to be reliable. Visual inspection ought not
to be cast aside.
When single-case designs were com in g into their own, it made sense perhaps to
carve out an identity by show ing how visual inspection was unique and accomplished
things that were not achieved by statistical tests. Visual inspection and statistical evalu-
ation, very much like single-case designs and traditional between-group designs, are
tools for draw ing inferences. There is no need to lim it ones tools, and indeed there
are several disadvantages in doing so. Few studies provide statistical evaluation and
evaluation o f the individual data via visual inspection. W hen they do, they convey the
critical point: the m ethods have a slightly different yield and each provides inform ation
the other did not provide (e.g., Brossart, M eythaler, Parker, M cNam ara, & Elliot, 2008;
Feather & Ronan, 2006; Molloy, 1990).

S T A T IS T IC A L E V A L U A T IO N
Visual inspection constitutes the criterion used most frequently to evaluate data from
single-case experim ents. The reason for this pertains to the historical development o f
the designs and the larger m ethodological approach o f which they are a part, namely,
the experim ental analysis o f behavior (Kazdin, 1978). Systematic investigation o f the
single subject began in laboratory research with non-hum an animals. The careful
control afforded by laboratory conditions helped to meet m ajor requirements o f the
designs, including m inim al variability and stable rates o f perform ance. Potent vari-
ables were exam ined (e.g., schedules o f reinforcem ent) with effects that could be easily
detected against the highly stable baseline levels. Indeed, one could readily see im m edi-
ate changes in behavior (e.g., lever pressing) in response to shifts in conditions across
phases. The lawfulness and regularity o f behavior in relation to selected variables o bvi-
ated the need for statistical tests.
As the single-case experim ental approach was extended to human behavior, appli-
cations began to encom pass a variety o f populations, behaviors, and settings. The
interest in investigating an d identifying potent variables has not changed. Invariably,
we want to identify interventions that exert im pact and have clinically important o ut-
com es. However, the com plexity o f the situations in which applied investigations are
conducted occasionally has made evaluations o f intervention effects more difficult.
Control over and standardization o f the assessm ent o f behaviors, extraneous factors
that can influence perform ance, and characteristics o f the organism s (humans) them -
selves are reduced, com pared with laboratory conditions. A s evident in many o f the
exam ples throughout the book, interventions that draw on single-case designs are
used in restaurants, at traffic intersections, in operating room s where surgery is per-
formed, and all sorts o f other settings o f everyday life. Hence, the potential sources
o f variation that m ay m ake interventions m ore difficult to evaluate are increased in
applied research. In selected situations, the criteria for invoking visual inspection are
not invariably or unequivocally met. Against that backdrop, statistical tests began to be
used to com pare data from different phases and to answer such questions as: A re the
means from all baseline phases different from the m eans o f all intervention phases in
an A B A B design? Are the changes over tim e within an intervention phase statistically
significant? T he use o f statistics was seen as an aid rather than an approach to replace
visual inspection.
D at a Ev al u at io n 3C3

Reasons for Using Statistical Tests


W hen the data meet the criteria for visual inspection outlined earlier, there is little
need to corroborate the results with statistical tests, except to com fort those not as
fam iliar with the approach. In m any situations, however, the ideal data patterns may
not emerge, and statistical tests m ay provide important advantages. C o n sid é ra few o f
the circum stances in which statistical analyses m ay be especially useful.

Trend in the D ata. V isual inspection depends on having stable baseline phases
in which no trend in the direction o f the expected change is evident. Evaluation o f
intervention effects is extrem ely difficult when baseline p erfo rm an ce is systematically
im proving. In this case, the intervention still may be required to accelerate the rate o f
im provem ent. Rates o f crim e, HIV, motorcycle accidents, and cigarette sm oking, for
exam ples, all might be declining in a given city or county, but still will be high and
warrant intervention. A s a real example, heart attacks and death from heart disease are
declining in the United States, although this is still the single greatest cause o f death. In
these exam ples, im provem ents in baseline are not a reason for d o in g nothing. An iriter-
vention might still be im portant to accelerate the process. Visual inspection criteria
m ay be difficult to invoke with initial improvem ents already u n d eiw ay during baseline.
A s a m ore general statement, trend during baselines may interfere with applying visual
inspection and drawing inferences. What we have learned is that com plex patterns can
form trends that are not detectable visually. Statistical analyses can exam ine w hether a
reliable change has occurred during the intervention phase o v e r a ad above w hat w ould
be expected by some trend, whether that trend could or could not be detected visually
(please see the appendix). Hence, statistical analyses can prov ide inform ation that m ay
be difficult to obtain through visual inspection.

In creased Intrasubject V ariability. Single-case designs have routinely been used in


applied settings (e.g., psychiatric hospitals, institutions for persons with developm ental
disabilities, classroom s, day-care centers, juvenile detention centers, foster-care facili-
ties). Rather amazingly, investigators in these settings have been able to control (experi-
mentally) several features o f the environm ent, including behavior o f the staff and events
occu rring during the day other than the intervention, that m ay influ.ence perform ance
and im plem entation o f the intervention. Careful control also reduces extra, unneces-
sary, and excess variability in the data that serves as a threat to data-evaluation validity.
For exam ple, in a classroom study, the investigator may carefully m onitor the interven-
tion so that it is implemented with little or no variation o v e i lim e. Also, teacher inter-
actions with the children m ay be carefully monitored and controlled. Students m ay
receive the same or sim ilar tasks while the observations are in effect. Because extrane-
ous factors are held relatively constant for purposes o f experim en tal control, variability
in subject perform ance can be held to a m inim um . As noted earlier, visual inspection
is more easily applied to single-case data when variability is small
N ow that interventions and single-case designs have been extended to everyday
life settings (e.g., business, restaurants, etc.), control over the environm ent and poten-
tial influences on behavior are reduced, and variability ia subject perform ance m ay
be relatively large. With larger variability, visual inspection m ay be m ore difficult to
apply than in well-controlled settings. Statistical evaluation m ay be o f greater use in
304 S I N G L E- C A S E R ES EA R C H D ESI G N S

exam ining whether reliable changes have been obtained. Statistical evaluation is not
necessarily “ better” in any way, but rather provides a tool that m ight reduce ambiguity
about the reliability o f the effect.

Investigation o f N ew Research Areas. Applied research has stressed the importance o f


investigating interventions that produce marked effects on behavior. In m any instances,
especially in new areas o f research, intervention effects m ay be relatively weak. T he
investigator working in a new area is likely to be unfam iliar with the intervention and
the conditions that m axim ize its efficacy. A s the investigator learns m ore about the
intervention, he or she can change the procedure to im prove its efficacy. Also, an inter-
vention may appear weak when applied to everyone but strong as we learn for whom
that intervention is especially well suited.
In the initial stages o f research, it m ay be im portant to identify prom ising interven-
tions that warrant further scrutiny. Visual inspection may be too stringent a criterion
and lead to rejection or discounting o f interventions that produce reliable but weak
effects. Such interventions should not be abandoned because they do not achieve large
changes initially. These interventions m ay be developed further through subsequent
research and eventually produce large effects that could be detected through visual
inspection. Even i f such interventions would not eventually produce strong effects in
their ow n right, they may be important because they can enhance or contribute to the
effectiveness o f other procedures. Hence, statistical analyses m ay serve a useful pu r-
pose in detecting reliable but weaker influences that could prove to be quite valuable.

S m a ll Changes M a y Be Im portant. The rationale underlying visual inspection has


been the search for large changes in the perform ance o f individual subjects. O ver the
years, single-case designs and the interventions typically evaluated by these designs have
been extended to a wide range o f problems. In the process, a public-health perspective
lias assum ed increased significance. That perspective reflects concern for the public at
large— large num bers o f people— and what can be done for them. From a public-health
perspective, little things (intervention effects) can m ean a lot. For selected problems, it
is not always the case that the value o f the intervention effect can be determined on the
basis o f the m agnitude o f change in an individual persons perform ance. Small changes
in the behavior o f individual subjects or in the behaviors o f large groups of subjects
often are very important.
C o n sider three related situations in which sm all changes are im portant. First, small
changes can be especially im portant when the effort and costs o f delivering treatment
are low and treatment can be adm inistered 011 a large scale. Previou sly (Chapter 10),
I m entioned an exam ple o f a brief intervention conducted du rin g a physical exam i-
nation and designed to reduce cigarette sm oking am on g patients. Physician visits are
relatively brief (m edian = 12 to 15 minutes) in the United States. Scores o f controlled
trials have show n that physician (or nurse) advice du rin g the visit can have reliable but
small im pact on sm oking. T he physician says som ething like the follow ing to patients
who are cigarette sm okers: “ I think it is im portant for you to quit tobacco use now,”
or “As your clinician, I want you to know that quitting tobacco is the m ost important
thing you can do to protect your health.” T he outcom e effects are small but consistent
(e.g., Fiore et al., 2000; Rice & Stead, 2008; Stead, Bergson, & Lancaster, 2008). T he
D at a Ev al u at i o n 305

com m ents led to approxim ately a 2.5% increm ent in abstinence rates o f sm oking co m -
pared to no intervention. Internal m edicine practice gu id elin es now recom m end that
physicians include in their visit specific advice to stop sm okin g. We know from p sy-
chological research that advice, feedback, recom m endations, raising aw areness, and
so 011 are not am ong the strong interventions. Even so, a reliable effect that is low in
cost and can be adm inistered on a large scale can have en orm o u s benefits. The b en -
efits are for the individuals who stopped sm oking, for th eir fam ilies (secondary and
tertiary sm oking are now known to have deleterious health effects^, and for society
at large (the cost burden o f sm oking). G raphing o f cigarette sm oking (e.g., pro po r-
tion o f individuals who becom e abstinent du rin g baseline and after the physician-
advice intervention) m ight not satisify visual inspection criteria. The effect is weak
and sm all, but im portant.
Second and related to the that, in many social contexts a sm all change is significant
because o f the focus and its qualitative rather than quantitative features. For example,
a sm all and perhaps even minute reduction in violent crim es (e.g., m urder, rape) in a
com m unity would be important. A n intervention program (e.g., a special police patrol
program , improved lighting and video survelleince) may produce o n ly small q u an -
titative changes (e.g., five fewer m urders per year). W hen graphed against the larger
baseline background (e.g., 100 m urders per year) perhaps visu al inspection would not
be able to detect the small changes. Loss o f life and the traum a to victim s and their fam -
ilies are significant if reliably reduced in any amount. In m an y areas o f public life (e.g., a
terrorist attack, death o f police and fire-fighters, loss o f mothe r or ch ild at child-birth,
and train, plane, and other vehicle accidents), the value stem s from having an y impact.
Loss o f a life makes a difference, and a “ weak” intervention that “only” saves a tew lives
is hugely im portan t
T hird, small changes in the behavior o f individuals c an accum ulate to produce
large effects. D epending on what is graphed (individual behavior or outcom es accu m u -
lated across m any individuals) visual inspection might or m ight not be up to the task o f
detecting an effect. For example, an intervention designed to reduce en ergy co n su m p -
tion (e.g., use o f one’s personal car, use o f energy-saving applian ces) may sh ow relatively
weak effects on the behavior o f individual subjects—just a little bit o f energy saved. The
results m ay not be dram atic by visual inspection criteria. A s a dram atic illustration, use
o f one energy-efficient (fluorescent) light bulb in each o f approxim ately 110 m illion
households in the United States would have the impact o f reducing greenhouse gas-
ses equivalent to taking 1.3 million cars o ff the road (Fish m an, 2006; w w w .energystar.
gov). In short, small changes, when accrued o ver a large nu m ber o f individuals, can be
im portant because o f the larger changes these would signal fo r an entire group. To the
extent that statistical analyses can contribute to data evaluation in these circum stances,
they provide an im portant contribution.
In general, there are m any circum stances in which we m ay want to be able to detect
intervention effects that may be sm all at the level o f individ u al behavior or indeed
relatively small at the larger level o f the group. Sm all but reliable changes m ay be very
noteworthy given the significance o f the focus, ease o f d elivery o f the intervention, and
the larger impact these changes have across many people. V isual inspection m ay not
detect small changes that are reliable. Statistical analyses m ay help determ ine whether
the intervention had a reliable, even though undram atic effect on behavior.
306 S I N G L E C A SE R ESEA R C H D ES I G N S

R eplica b ility in Data E va lu a tion . Perhaps the most general argument that might
favor the use o f statistical tests pertains to what we are trying to do in science. We
have invented the scientific m ethod that is designed to reduce subjectivity and bias as
much as possible. We develop assessments and validate our instrum ents, we use control
groups and conditions, we keep experim enters naïve with respect to intervention c o n -
ditions, and we have in m ind that the threats to validity are stalking us and our findings
like hungry ants at a picnic. T he goal is not som e m ethodological elegance for its own
sake, but to understand our world in such a w ay that the findings are replicable by o th -
ers and do not depend on opinion and subjectivity. These are aspirations. Hum ans and
perhaps consequently visual inspection and statistics are infused with subjectivity, but
we try to control or lim it the impact o f that subjectivity on our science.
The goal o f producing replicable findings raises two points about visual inspec-
tion. First, we need data-analytic tools w henever possible that lead to consistency in
reaching conclusions. V isual inspection and statistical analysis som etim es do this and
som etim es do not. However, for visual inspection, especially with less than very clear
data patterns, it is difficult to con vey w hat the exact rules are. That is one reason that
interjudge agreem ent on visual inspection data is not high. Second, in the early years
o f elaborating single-case designs visual inspection was pitted against statistical an al-
yses. There are m any reasons to use statistical tests, as I have elaborated previously.
But another one has to do with replication. We can be m ore confident o f a finding by
show ing that the finding is obtained under different conditions. Som e of the con d i-
tions include: when we use a different m easure o f the sam e construct (e.g., self-report
and direct observation), different designs (single-case and between-group), and data-
analytic techniques (e.g., visual inspection, statistics; statistical significance and c lin i-
cal significance). Now that single-case designs have d iffu sed into m any areas o f work
and across m any disciplines, more and m ore investigators report the use of both visual
inspection and statistical techniques to exam in e their data (e.g., C o x, Cox, & C o x,
2000; Levesque et al., 2004; Q uesnel, Savard, Sim ard, Ivers, & M orin, 2003; Savard
et al., 199S). T he added inform ation can be helpful. In m any cases, the different meth
ods con firm that the intervention was effective. In som e cases, the different m ethods
con firm that not much has happened, that is, little or 110 effect o f the intervention
(e.g., Pasiali, 2004).
It is still the case that visual inspection is the prim ary criterion to evaluate sin gle-
case data. There are m any statistical tests that can and have been applied to single-case
data. 1 highlight these, provide exam ples, and convey further issues about statistical
evaluation and visual inspection in the appendix at the end o f the book. Reserving
that discussion for later allows me to provide details here about the practices more
frequently used in single-case data evaluation.

E V A L U A T IO N O F T H E A P P L IE D OR C L IN IC A L
SIG N IF IC A N C E OF T H E C H A N G E
Visual inspection and statistical data evaluation m ethods address the experim ental c ri-
terion for evaluating change, that is, whether the changes in perform ance are reliable
and beyond what might be considered chance fluctuations. As noted earlier, an applied
criterion is also invoked to evaluate the intervention. T his criterion refers to the applied
significance o f the changes in behavior or whether the intervention makes a genuine
Dat a. Ev al u at i o n 3C7

difference in the everyday functioning o f the client. By “gen uin e difference’’ I refer to
one in which the client and others would see a change that positively affects the indi-
vidual’s life.
In many instances, thecriterion for deciding whether a clinically significant change
has been achieved m ay be obvious. Perhaps the clearest instances are when a m aladap-
tive, deviant, or personally harmful behavior is occurring frequen tly (baseline) and the
intervention com pletely eliminates the behavior or when a positive behavior (reading)
never occurs and is at a high rate after the intervention. A s an exam ple o f the form ei.
an intervention m ay eliminate head banging o f a child with autism or a developm ental
disability. I f the baseline was a mean o f 100 instances o f head, banging per hour, and the
behavior was eliminated, most would agree that this was an im portant and clinically
significant change. I f the intervention reduced the rate from 10 0 to 75 or 50 instances
per hour, the change may be reliable (by visual inspection o r statistical evaluation)
but it is probably not clinically important. Self-injurious b ehavior is maladaptive and.
potentially dangerous if it occurs at all. Thus, without a virtu al o r complete elim ina-
tion o f self-injurious behavior, the clinical value o f the treatm ent may be challenged.
Similarly, for many other behaviors (e.g., use o f illicit substances, d rivin g while intoxi-
cated, experiencing panic in social situations, or reading m eth o d o lo gy articles during
fam ily Thanksgiving reunions) the complete elimination w o u ld m ake it reasonable to
infer that the impact was important.
A n intervention program will not invariably make such dram atic changes (e.g.,
from a lot o f problem behavior to none or from no o ccu rrences o f a positive behavior
to a high rateo fth at behavior). Even when change is dram atic in relation to th e criteria,
for visual inspection (e.g., changes in mean, level, latency, etc.) this does not m ean that
the change is im portant or makes a difference. O ther criteria m ust b e in vok ed
Two broad and related strategies have been used. Social validation M id clinical sig-
nificance have em erged from single-case and betw een-group m ethodologies, respec-
tively. Each presents multiple options for evaluating the changes follow ing intervention
effects, and these are highlighted in Table 12.2. I have added a third category called
social impact when the goal is 011 broader outcomes than in d ivid u al client performance.
I highlight and illustrate the different methods next.

S o c ial V alid atio n


Social validation refers generally to consideration o f social criteria for evaluating the
focus o f treatment, the procedures that are used, and the effects that these treat m enu
have on perform ance (Schwartz & Baer, 1991; Wolf, 1978). For present purposes, the
feature o f the effects or impact o f treatment is especially relevant to this discussion.
As noted in the table, there are two m ethods o f social validation used to evaluate the
impact o f the intervention: social com parison and subjective evaluation.

Social C om parison. With the social com parison m ethod, the behavior o f the client
before and after treatment is com pared with the behavior o f nondeviant (“ n o rm al” )
peers who are functioning well in the com m unity or context in w hich they are evaluated.
T he question asked by this com parison is whether the client’s behavior after treatment
is distinguishable from the behavior o f his or her peers who are fun ctioning adequately
in the environment. Presumably, if the client’s behavior w arran ts intervention, that
308 S I N G L E- C A S E R ESEA R C H D ES I G N S

Table I 2.2 Means of Evaluating the Applied o r Clinical Significance o f Change in Intervention Studies

Ty p e o f M easu r e D ef in ed Cr i t e r i a
A . So ci al Val i d at i o n
1. Social Com parison M et hod Client ' s perf orm ance is evaluat ed in relat ion A t t he end of t reat ment , client funct ioning falls
t o t he perf or m ance of a normat ive sample wit hin t he range of a normat ive sample.
bef ore and af t er t reat m ent .

2. Subject ive Evaluat ion Im pressions of t he client o r t hose who Rat ings at t he end of t reat ment t o indicat e t hat
int eract wit h t he client t hat t reat m ent current f unctioning is bet t er o r t hat t he original
change m akes a percept ible dif f erence. problem is minimal o r not evid ent
B. Cl i n i cal Si g n i f i can ce
1 N orm at ive Com parison Same as Social Com parison Method. Same as Social Com parison M et hod.

2. Dep art ure f rom a Individual makes a large change on a A large change (e.g.. t wo o r m ore st andard
Dysf unct ional Level of m easure.The change depart s f rom t he deviat ions) f rom t he pret reat ment mean. This
Funct ioning pret reat m ent mean. degree of change would be a clear depart ure
f rom dysf unct ional levels.Can be depar t ure f rom
anot her sample or set of unt reat ed individuals o r
f rom one' s original score on t he m easure.

3. N o Longer Meeting Crit er ia Bef ore t reat m ent t he individual m et f ormal Reassessm ent af t er t he int ervent ion is applied,
f or a Psychiat ric Diagnosis crit eria f or a psychiat ric diagnosis (e.g., showing t hat using t he same m easure and crit eria,
M ajor Depressio n. Panic Disorder, Cond uct individuals no longer meet diagnostic crit eria f or
Disord er) but no longer does af t er t he t he disorder t hat w as t reat ed.
int ervent ion.
C . So ci al I m p act Change on a m easure t hat is recognized Change ref lect ed on such m easures as arrest ,
o r considered t o be crit ically import ant t ruancy, hospit alizat ion, disease, and deat h.
in everyday life; usually not a psychological
scale o r m easure devised f or t he purposes
of research.

behavior should initially deviate from norm ative levels o f perform ance. If treatment
produces a clinically im portant change, at least with m any clinical problems, the client’s
behavior should be brought within norm ative levels.
T h e essential feature o f social com parison is to identify the client’s peers, that is,
persons who are sim ilar to the client in such variables as age, gender, ethnicity, and
socioeconom ic class, but who are functioning adequately or well and whose behav-
iors do not warrant intervention. Presumably, a clinically important change would be
evident if the intervention brought the clients to within the level o f their peers whose
behaviors are considered to be adequate.
T he critical feature is obtaining inform ation that can be used as a benchmark. For
exam ple, a program focused on training appropriate eating behaviors among ind i-
viduals with developm ental disabilities who seldom used utensils, constantly spilled
fo o d on themselves, stole food from others, and ate food previously spilled on the
floor (O’ Brien & Azrin, 1972). A behavioral intervention (using prompts, praise, and
food reinforcem ent for appropriate eating) was used to develop appropriate eating
behaviors. Although training increased appropriate eating behaviors, one can still ask
w hether the im provem ents really were important and whether behavior approached
the eating skills o f persons w ho are regarded as fu nctioning well or norm ally in ev ery-
day life. To address these questions, the investigators com pared the group that received
Dat a. Ev al u at i o n 309

training with the eating habits o f 12 custom ers in a local restaurant. These custo m -
ers were watched by observers, who recorded their eating behavior, using the same
measure o f inappropriate eating. This represents a benchm ark o f people who have not
been identified as having deficits in their eating behaviors. T he mean o f this group is
represented by the dashed line in Figure 12.8. A s evident in the figure, after training,
the level o f inappropriate mealtime behaviors am ong the persons with developm ental
disabilities was even lower than the rate o f inappropriate eating by custom ers in the
restaurant. These results suggest that the magnitude o f changes achieved with training
brought behavior to acceptable levels o f persons functioning in everyday/ life.
C onsider as another example a program designed to develop helping behaviors
am ong four children (5 to 6 years o f age) in a private school and v it h a diagn osis ot
autism (Reeve et al., 2007). T he goal was to teach a social behavior (helping in the class-
room). This was selected because, am ong social behaviors, helping acts tend to lead to
m ore prolonged interactions. Prior to the program the children engaged in no helping
behaviors. T he intervention, which consisted o f training in each o f several activities
(e.g., m odeling, instructions, and guidance), led to changes in helping behavior. T he
intervention was evaluated in a multiple-baseline design across children. T h e effects

W eeks of maint enance

Figure 12.8 . The mean number of inappropriate responses (eating errors) per meal performed by
the training group of individuals with developmental disabilities. The dashed horiiontal line repre-
sents the mean number of these same responses performed by 1 2 customers in 1 restaurant under
ordinary eating conditions (normative sample). (Source: O'Brien & Azrin, I t72.)
310 S I N G L E- C A S E R ES EA R C H D ES I G N S

were very clear, in part because baseline was consistently at a rate o f zero (no help-
ing) for all the children. To evaluate the changes further, the investigators videotaped
the helping behaviors o f the four children toward the end o f their training and then
obtained videotapes o f other, typically developing children asked to engage in the same
behaviors. T he question to be addressed is w hether the level o f helping achieved with
training the children with autism placed them near o r closer to the helping behavior
o f children not identified as show ing im pairm ent. A total o f 80 videotapes were made;
college students rated the tapes in random order; the instances o f helping were sim ilar
for the two groups.
In the preceding exam ples, social com parison is intended to further inform what
the changes mean. D o the social com parison data convey that the changes were im por-
tant? That is difficult to answ er and m ay involve other considerations, such as how
the clients or others view ed the changes and whether the changes carry over to other
settings and are m aintained over time. Yet the exam ples convey how social com parison
data provide a very useful benchm ark for evaluating intervention effects and placing
changes in an im portant social context o f how other people are behaving. The chal-
lenge is obtaining a suitable sam ple for com parison purposes, a topic to which I return
later in the chapter.

S u b jective E va lu a tion . With the subjective evaluation method, the client's behavior is
evaluated by persons w ho are likely to have contact with him or her in everyday life and
who evaluate w hether distinct improvem ents in perform ance can be seen. T he ques-
tion addressed by this method is whether behavior changes have led to qualitative or
clearly perceptible differences in how the client is view ed by others.
Subjective evaluation as a means o f validating the effects o f treatment usually
consists o f global evaluations o f behavior. T he behaviors that have been altered are
observed by persons who interact with the client or who are in a special position (e.g.,
through expertise) to judge those behaviors. G lobal evaluations are made to provide
an overall appraisal o f the client’s perform ance after treatment. It is possible that sys-
tematic changes in behavior are dem onstrated, but that persons in everyday life cannot
see a “ real” difference in perform ance. If the client has made an important change,
this should be obvious to persons who are in a position to judge the client. Hence,
judgm ents by persons in everyday contact with the client add a crucial dimension for
evaluating the applied or clinical significance o f the change.
An excellent exam ple is the case o f Steven, a college student who sought treatment
to eliminate two m uscle tics (uncontrolled movements) (W right & Miltenberger, 1987).
T he example conveys the com plem entary role o f objective inform ation an d subjec-
tive evaluation to establish not only that there was a change, but also that the change
is one others can see as m aking a difference. Steven’s tics involved head m ovem ents
and excessive eyebrow raising. Individual treatment sessions were conducted in which
he was trained to m onitor and identify when the tics occurred and in general to be
m ore aware o f their occurrence. In addition, he self-m onitored tics throughout the
day. Assessment sessions were conducted in which Steven read at the clinic or col-
lege library and observers recorded the tics. Self-m onitoring and awareness training
procedures were evaluated in a m ultiple-baseline design in which each tic declined in
frequency as treatment was applied.
D ata Evaluation 311

Was the reduction important or did it m ake a difference either to Steven or to


others? At the end o f treatment Steven’s responses to a questionnaire indicated that he
no longer was distressed by the tics and that he felt they w ere no longer "very notice-
able to others. In addition, four observers rated random ly selected videotapes o f Steven
without knowing which tapes came from before o r after treatment. O bservers rated
the tics from the posttreatment tapes as not at all distracting, normal to very nornuiL in
appearance, and small to very small in magnitude. In con trast, they had rated tics on
the pretreatm ent tapes as much more severe on these d im en sions. O bservers were then
inform ed which were the posttreatment tapes and asked to report how satisfied th ey
w ould be if they had achieved the same results as Steven had. A ll observers reported
that they would have been satisfied with the treatment results. T h e subjective ev alu -
ations from Steven and independent observers help attest to the im portance o f the
changes, that is, they made a difference to the client and to others.
Subjective evaluation usually refers to global evalu ations b / individuals in contact
with the client, but as evident in the exam ple o f Steven, th e clients evaluations can be
sought as well to evaluate whether the difference obtained w ith the intervention makes
a genuine difference in his or her perception. Self-evaluation as a m eans for social v a li-
dation has been used much less frequently than has been evaluation b y others, fo r at
least two reasons. First, interventions evaluated with sin gle-case designs often focus on
populations with severe disability or im pairm ent (e.g., children with developm ental
disabilities, severe intellectual impairment, autism) or o n com m unity behaviors and
settings (e.g., seat-belt use among drivers and their passengers). Self-evaluation in
these cases is not feasible.
Second, self-evaluations often reflect overall satisfaction with an intervention an d
m ay not relate well to actual changes in behaviors or problem s for which the interven-
tion was provided. O f course, subjective evaluation often is a critical focus. H ow one
feels about ones own m ood (e.g., depression), one’s relationship (e.g„ feelings about
one’s partner or spouse), environmental events (e.g., sources o f stress), and physical
and mental condition (e.g., overall health, happiness) are im portant facets of life. In
som e situations, self-report provides critical data that have no substitute. For exam ple,
in one program , an adolescent was treated for muscle tension that im paired her fu n c -
tioning and caused her severe pain (W arnes & Allen, 2005). T h e program was eval-
uated by an automated psychophysiological m easure (electrom yography). Yet, d aily
self-report on pain also was assessed. Getting better on an objective m easure o f 111 uscle
tension is 110 substitute for asking individuals if they are feelin j better. Both m easures
show ed the effects o f the program. A s studies like these show, assessem ent o f overt
behavior and subjective evaluation need not com pete, but provide Important com p le-
m entary inform ation.
It is important to underscore the role o f subjective evaluation In another contest.
W hen single-case designs emerged in applied settings, as I have noted, overt behavior
was the p rim ary and often the exclusive m easure to reflect the effects o f an in terven -
tion. T h is has changed over the years as outcomes reflect other aspects o f fu n ctio n in g
(e.g., depression, anxiety, cognitions) where nonovert behavioral m easures are used.
Subjective evaluation draw's attention to the im portance o f other dim ensions an d is
a check 011 the im pact o f what a study might show. Fo r exam ple, In one program , an
intervention was designed to reduce the disruptive behavior o f boys identified b ecau se

1
3! 2 S I N G L E- C A S E R ES EA R C H D ESI G N S

o f their behavioral problems (Reitm an, Murphy, Hupp, & O ’Callaghan, 2004). T he
intervention was very effective in altering child behavior, with im pressive graphs, in
keeping with many other exam ples I have provided. Subjective evaluation was added
at the end o f the program to determ ine whether the teachers noticied a difference.
Improvement was not noted by teachers on standardized teacher ratings. This is an
instructive finding. Were the changes potent enough? Were the operational definitions
targeting the problems relevant? Were the teachers merely biased or oblivious to the
changes? Subjective evaluation is an im portant supplem ent to convey how the inter-
vention effects are perceived and are affecting others.

C lin ical Significance


So cial validation grew out o f sin gle-case research. A related developm ent grew out o f
psychotherap y research and the tradition o f betw een-group designs. Psychotherapy
research has a long h isto ry— treatm ents are evaluated in relation to changing c lin i-
cal problem s such as depression, anxiety, panic disorder, and m an y other problems.
D ata are evaluated by com parin g m eans o f grou ps that receive vario u s conditions.
A t the end o f a study, one treatm ent m ay be m ore effective than another, and patients
m ay have im proved. “ M ore effective” an d “ im pro ved” are statistical concepts even
though they sound like clinical concepts, that is, that som eth in g has m ade a real d if-
ference. “ M ore effective” o n ly m eans that groups were different 011 statistical tests;
“ im pro ved” only m eans that change from pre- to posttreatm ent was statistically sig -
n ificant for the group. T hese both reflect the exp erim en tal criterion that evaluates
the reliab ility o f the finding. N either concept necessarily reflects an applied crite-
rion, namely, w hether the change m akes any real difference in the lives o f individu-
als. C lin ical sign ifican ce has been introduced as a criterion to supplem ent statistical
evalu atio n in the sam e way social validation w as introduced to supplem ent visual
inspection.
T he concept o f clinical significance in group research is not new or unique to psy-
chotherapy or to psychology. T he concept has been recognized and is easily conveyed in
m edical research. For example, in cancer treatment studies, one treatment might be con -
sidered better than another because it extends life (survival after the treatment). With a
m easure such as “survival,” it is easy to m ix experim ental and applied criteria of change.
As a measure, survival sounds and is important. However, “ better” in this context refers
to statistical significance (experim ental criterion ). T he difference m ay be in 5 days,
5 weeks, or 5 months more o f survival. Is that difference important (applied criterion)?
Perhaps this is for the patient to judge. Perhaps the judgm ent for us or for the patient
depends on further information such as the quality o f life during that additional survival
time. I f that extra survival time is o f poor quality (greater pain, immobility, comatose,
or with side effects) that m ay contribute to the evaluation. Concerns about quality o f life
in such treatment studies reflect interest in elaborating the im portance o f the treatment
effects in ways that are relevant to patients and that com plem ent survival data.
Although the ideas underlying clinical significance are not new, the topic arose
form ally in the context o f psychotherapy research and the tradition o f between-group
research. Three indices have been used to evaluate w hether change on key measures
make a difference, that is, are clinically significant. These m easures were highlighted
in Table 12.2.
D at a Ev al u at i o n

F a llin g w ithin a N orm ative R ange. The first index relates to the use of norm ative sam -
ple. T he question asked by this method is, “to what extent d o patients or clients after
com pleting treatment (or som e other intervention) fall within the norm ative range o f
perform ance?” Prior to treatment, the clients presum ably w ould depart con sid erably
from their w ell-functioning peers on the measures and in the do m ain that led to their
selection (e.g., anxiety, depression, social withdrawal). D em onstrating that after treat-
ment these same persons were indistinguishable from or w ithin the range ot a n o r-
mative, well-functioning sam ple on the measures o f interest w ould be a reasonable
definition o f a clinically im portant change (Kazdin, 1977b; K en dall & G r o v e , 1988). T h is
is the same as the social com parison method mentioned in the context o f social valida-
tion. I remention the index here because it has been used m ore often in betw een-group
research than in single-case research. Use o f this index o f clinical significance is facili-
tated by using outcome measures that have extensive norm ative data on patients by age
and sex. Hence one can easily compare the results for a treated sam ple with that nor-
mative base. Single-case designs often use measures o f overt behavior; betw een-group
designs more often use self- and other-report scales. These latter scales are m ore easily
standardized by adm inistering the measures to thousands o f ind ivid u als w ho may vary
by age, ethnicity, socioeconom ic standing, nationality, and other such factors. Large
databases o f samples o f individuals functioning in the com m u n ity hare been used fre-
quently in between-group research to evaluate the clinical sign ifican ce o f change.
As a rather typical exam ple, one o f our own studies evaluated treatm ents for chil-
dren ages 7 to 13 referred for aggressive and antisocial behavior (K azd in , Siegel, & Bass,
1992). The effectiveness o f three conditions was exam ined including problem -solving
skills training (PSST), parent management training (PM T ), and P S ST t-P M T c o m -
bined. PSST is a cognitively based procedure in which children arc trai ned to approach
interpersonal situations in w ays that help them identih and carry out prosocial s o lu -
tions. PM T is a procedure designed to change paren t-child interactions in the home
in concrete ways that prom ote prosocial child behavior. T w o outcom e m easures are
plotted for the three groups at pretreatment, posttreatment, an d a i-/e a r follow -up (see
Figure 12.9). The m easures were the parent- and teacher-com pleted versions o f the
Child Behavior Checklist (Achenbach, 1991), which assess a wide range o f em otional
and behavioral problems. Extensive normative data (o f nonreferred, com m unity ch il-
dren) available for boys and girls within the age group have ind icated that the 90th p er-
centile score on overall (total) symptoms is the score that best distingu ishes cli nic from
com m unity (norm ative) sam ples o f children. As shown in Figu re 32.9. scores at this
percentile from com m unity youths were used to define the u pper limit o f the ‘ norm al
range” o f emotional and behavioral problems. C linically sign ifican t change was defined
as whether childrens scores fell below this cutoff, that is, w ithin the norm ative range.
T he figure shows that childrens scores were well above this range before treatment on
the parent (left panel) and teacher (right panel) m easures. Each group approached or
fell within the “ norm al” range at posttreatment, although the com bi iied treatment w as
superior in this regard.
T he results in the figure provide group means (average p erfo rm an ce of each
group). One also can com pute how m any individuals fall w ith in the norm ative
range at the end o f treatm ent. In the present exam ple, for the paren t-based m easure
referred to in the figu re, results at posttreatm ent in d icated that )9% , a.nd 64%
314 S I N G L E- C A S E R ES EA R C H D ESI G N S

75 r 75 r

£ 70

65 65

U 60

55 55
Post I Year Pre Post I Year
Assessm ent Assessm ent

PSST g g PM T g] PSST & PM T Norm al Range

F ig u re 1 2.9 . Mean scores (T scores) for Problem-Solving SkillsTraining (PSST), Parent Management
Training (PMT). and both combined (PSST + PMT) for the total behavior problem scales of the
parent-completed Child Behavior Checklist (CBCL. left panel) and the teacher-completed Child
Behavior Checklist— Teacher Report Form (TRF-CBC L right panel).The horizontal line reflects the
upper limit of the nonclinical (“ normal") range of children of the same age and sex. Scores below
this line fall within the normal range. (Source: Kazdin. Siegel. & Boss, 1992.)

of youths from PSST, PM T, and com bined treatm ent, respectively, achieved scores
that fell w ithin the norm ative range. T hese percentages are different (statistically sig -
nificant) and suggest the su p erio rity o f the com bined treatm ent on the percentage
o f youths returned to “ norm ative” levels o f functioning. The results underscore the
im portance o f evalu atin g clinical sign ifican ce. In this study, even with statistically
significant changes within groups and differences between groups, most youths who
received treatment continued to fall outside o f the norm ative range of their nonclini-
cally referred peers.

D epa rtu re fro m D ysfun ction al B eh avior. A n other m ethod to define clinical sig-
nificance uses a dysfunctional level for com parison. In clinical work, patients are
selected because o f their dysfunction in the area o f focus. Perhaps they were recruited
for extrem e scores on measures o f depression. At the end o f treatment, i f a clinically
important change is made, scores o f the clients ought to depart markedly from their
original scores. T he departure o f course ought to be in the direction of im provem ent
(e.g., reduced sym ptom s). There is no logical justification for deciding how much o f a
change or reduction in sym ptom s is needed, and different criteria have been suggested
and used (Jacobson & Revenstorf, 1988; Jacobson, Roberts, Berns, & M cGlinchey,
1999). One variant denotes intervention effects as clinically significant when there is a
I> at a. Ev al u at i o n 315

departure o f two standard deviations from the mean o f the pretreatment perfo rm an ce
(i.e., so-called dysfunctional sample). Thus, at posttreatm ent, individuals w hose scores
depart at least two standard deviations from the m ean o f the dysfunction al grou p
(e.g., untreated cases from a no-treatm ent control group) would be regarded as having
changed in an important way.
W hy a criterion o f two standard deviations? First, if the individual is two standard
deviations away from the mean o f the original group, this suggests that he o r she is not
represented by that mean and distribution from which that sam ple was draw n; indeed,
two standard deviations above (or below ) the mean reflects the 98th (or 2nd) percen -
tile. Second and related, two standard deviations approxim ates the criterion used for
statistical significance wrhen groups are com pared (e.g., 1.96 standard deviations for a
tw o-tailed t test that com pares groups for the p < .05 level o f significance).
A s an illustration, a study for the treatment o f depression am ong adults com pared
two variations o f problem -solving strategies (Nezu & Perri, 1989). To evaluate the
clinical significance o f change, the investigators exam in ed the proportion ot cases in
each group whose score on m easures o f depression fell two or m ore standard d ev ia-
tions below (i.e., less depressed) the m ean o f the untreated sample. For exam ple, on
one m easure (the Beck Depression Inventory), 85.7% o f the cases that received the fuIL
problem -solving condition achieved this level o f change. In contrast, 50% o f the cases
that received the abbreviated problem -solving condition achieved this level o f change.
T he m ore effective treatment led to a clinically significant change for the large m ajo r-
ity o f the cases, and clearly one treatment was better than the other in this regard. T h e
com parisons add im portant inform ation about the im pact o f treatment.
For m any measures used to evaluate treatment or other interventions, n o rn u tiv e
data that could serve as a criterion for evaluating clinical significance are not avail-
able. That is, we cannot tell whether at the en d o f treatm ent cases fall w ith in a n o rm a-
tive range. However, one can still evaluate how much change the individual m ade and
whether that change is so large as to reflect a score that is quite different fro m the mean
o f a dysfunctional level (pretreatm ent) or sam ple (no-treatm ent group). O f course* if
norm ative data are available, one can evaluate the clinical significance o f change by
assessing whether the clients behavior returns to norm ative levels and also departs
from dysfunctional levels.

No L o n g er M eeting D iagnostic C riteria . O ccasionally, clinical significance is e v alu -


ated by evaluating whether the diagnostic status o f the individual has changed ^vith
treatment. In m any treatment studies, individuals are recruited and screen ed on the
basis o f whether they meet criteria for a psychiatric diagn osis (e.g., M ajor D epression,
Posttraum atic Stress Disorder). T hose with a diagnosis are included in the stu dy an d
assigned to various treatment and control conditions. Clinical significance has been
defined by evaluating w hether the individual, at the en d o f treatment, continues to m eet
criteria for the original (or other) diagnoses. Presum ably, if treatment has achieved a
sufficient change the individual no longer meets criteria for the diagnosis. So m etim es
this is referred to as show ing that the individual has reco vered
For example, in one study, adolescents who met standard psychiatric diagn o stic
criteria for clinical depression were assigned to one o f three groups: adolescent treat-
ment, adolescent and parent treatment, or a wait-list condition (Lew insolin , Clarke,
316 S I N G L E- C A S E R ESEA R C H D ES I G N S

H ops, & Andrew s, 1990). At the end o f treatment, 57% and 52% o f the cases in the two
treatment groups, respectively, and 95% o f the cases in the control group continued to
meet diagnostic criteria for depression. A sm aller proportion o f cases in the treatment
grou ps continued to meet diagnostic criteria for the disorder.
There is som ething appealing about show ing that after treatment the individual no
longer meets diagnostic criteria for the disorder that was treated. It suggests that the
condition (problem, disorder) is gone or “cured.” Yet, m any clinical problems that have
form al psychiatric diagnoses (e.g., Depression, Autism , Attention-D eficit/H yperactivity
Disorder, Generalized A n xiety Disorder) are on a continuum (som etim es referred to as
a spectrum ). So no longer m eeting the diagnostic criteria for a disorder is not a cure.
(As we occasionally tell our students, only ham s are “cured.” ) Not meeting the criteria
for the diagnosis o f a disorder (e.g., depression) can be achieved by show ing a change
in o nly one or two sym ptom s.

S o c ia l Im pact M e asu res


Single-case research often focuses on im portant social problem s or on individual
behaviors (e.g., use o f seat belts, driving safely, keeping poisonous household cleaners
out o f the reach o f children) that i f altered on a large scale could have im portant p er-
sonal as well as social consequences (e.g., injury and death). For example, prevention
program s often focus on infants or young children from socioeconom ically d isad van -
taged homes who are at risk for later mental and physical health problems (M razek &
Haggerty, 1994 )- O ccasionally follow-up data are obtained 10 to 20 years later. Social
impact measures such as higher rates o f school attendance, high-school graduation,
and employment, and lower rates o f arrest and reliance on public assistance are evident
am ong those who received an early intervention, com pared to nonintervention c o n -
trols. These measures and outcom es are clearly significant to society as well as to the
individuals who benefit directly.
In health care and education, cost is often o f interest and used as a basis for decid-
ing social impact. Presum ably an intervention has significant social impact i f it can
reduce m onetary costs. O ne cost question both for evaluating large-scale social inter-
ventions as well as individual treatment is the benefits that derive from the costs. Cost-
benefit analysis is designed to evaluate the m onetary costs o f an intervention with the
benefits that are obtained. T he benefits must also be m easured in m onetary terms.
Evidence that clients return to work, miss fewer days o f work, have fewer car accidents,
or stay out o f hospitals or prisons are examples o f benefits that can be translated into
m onetary terms. O f course, m any im portant outcom es (e.g., personal happiness, fam ily
harm ony) are not readily translated to m onetary gains.
C ost-effectiveness analysis does not require placing a m onetary value on the
benefits and can be more readily used for evaluating treatm ent. Cost-effectiveness
analysis exam ines the costs o f treatment relative to a particular outcome. The analysis
perm its com parison o f different treatment techniques if the treatment benefits are
designed to be the sam e (e.g., reduction o f drinking, increase in fam ily harm ony). For
example, one study com pared two variations o f parent training for parents o f k in der-
garten children with behavior problem s (Cunningham , Brem ner, & Boyle, 1995). O ne
variation consisted o f individual treatment provided at a clinical service; the other
consisted o f group-based treatment conducted in the com m unity (at com m unity
D at a Ev al u at io n 317

centers or schools). Both treatments were better than a w ait-list control condition.
On several outcom e measures, the com inunity-based treatm ent v\as more effective.
Even if the treatments were equally effective, the m onetary costs (e.g., start-up costs,
travel time o f fam ilies, costs o f the therapist/trainer in providing treatment) o f in d i-
vidual treatment were approxim ately six times greater per fam ily than the group treat-
ment. Also, the com m unity-based treatment is much m ore likely to have impact on
individuals and society at large because it can be dissem inated m ore broadly than
individual therapy.
Within psychotherapy research, costs, cost-benefit, and cost-effectiveness m ea-
sures are infrequently used, with notable exceptions (e.g., Cunningham et a l, 1995;
Sim on et al., 2001). Cost is deceptively sim ple as a measure, w hich is one o f the reasons
why the measure is infrequently used in intervention studies. T h e com plexities include
the fact that cost is not merely the price o f a few items on a list. W hat is included in cost
estimates, how to place a price on them, and how to evaluate the costs o f not providing
treatment hint at a few o f the challenges.
The interest in cost often stems from evaluating cost o f treatm ent in relation to
the alternatives. So, for example, one could or could not treat alcoholism am ong w o r k -
ers at a major company. Both o f these have costs. Treating alcoholism has the costs o f
providing services. Yet, not treating alcoholism has costs as well, because alcoholism
leads to many m issed days at work, reduced worker productivity, a.nd increased illn ess
and injury. H ence the cost o f treatment is not the issue but lather the cost o f treat-
ment versus no treatment, both o f which may be expensive. A com m on statement that
better reflects this is the notion that the cost o f education in society is v e ry high, but
not that high when com pared to the cost o f ignorance. Sim ilarly, the costs o f treating
some problems (e.g., depression, substance use, conduct disorder) are enorm ous but
must be weighed against the cost o f not providing treatment. For exam ple, the annual
cost o f anxiety disorders in the United States is estimated at $42.3 billion (or $1,542 per
sufferer) (Greenberg et al., 1999). Psychotherapy can reduce costs by reducing w ork
impairment and lost days at work (see Gabbard, Lazar, H ornberger, & Spiegel, 1997).
M easures o f social impact are infrequently used in single-case research, but they
are in keeping with the goal o f extending the evaluation to m easures that are o f applied
significance. T he measures are relevant when the goal is beyond individual behavior o t
interventions on a small scale. In some cases, the focus o f a study awaits to be extended
on a larger scale. For example, a demonstration project m entioned previously showed
that safe driving (stopping at a stop sign) could be im proved at one intersection on a
college campus (Austin et al., 2006). O ther measures (e.g., fewer accidents, reduced
injuries and death) either within the dem onstration or from a large-scale extension
would be excellent indices o f social impact.

P ro b lem s an d C o n sid e ra tio n s


M easures o f applied significance (social validity and clinical significance) have take a
on increased im portance over the years in many disciplines such is education, co u n -
seling, psychology, and health care. We have learned that o u t interventions can leail
to statistically significant change on all sorts o f m easures, and that they can produce
seem ingly im pressive results on a small scale and on w ell-validated measures ( e g .,
direct observations, various psychological scales). A key question asked by consum ers
318 S I N G L E- C A S E R ES EA R C H D ES I G N S

(e.g., legislators, mental health agencies, insurance com panies) is whether any o f the
work makes a real difference. M easures o f social validity an d clinical significance in
varying degrees are designed to address this legitimate interest. Although social v alid -
ity, clinical significance, and social impact measures are im portant, each raises critical
issues pertaining to interpretation o f the data. Let me highlight a few o f the issues in
relation to social validity.

So cia l Com parison a n d R etu rn to N orm ative Levels. O ne criterion (for social validity
and clinical significance) was show ing that at the end o f the interventions individuals
tell within the norm ative range o f perform ance o f the behavior. D efining a normative
sample is not quite so easy. To begin with, am ong adults (18 years o f age and older) in
the United States functioning “ norm ally” in the com m unity (i.e., a normative sample),
approximately 1 in 4 (25%) meet criteria for a diagnosable psychiatric disorder (National
Institute o f Mental Health, 2008). Approxim ately one h alf o f the individuals in that 25%
meet criteria for two or m ore psychiatric disorders. Again, these statistics are based on
individuals in everyday life, w alkin g the streets, com ing to classes, teaching classes, and
writing m ethodology b ooks— well m aybe not the latter, but you get the point. A sample
o f people from everyday life includes individuals with significant social, emotional, and
behavioral problems. N orm ative functioning also varies as a function o f several sample
characteristics. Age, sex, identity, ethnicity, and culture are som e o f the features. So any
com parison ought to match on key factors, but what are the key factors and how many
ought to be included?
Second, to whom should individuals with severe developm ental disability, chronic
psychiatric impairment, or extensive prison records be com pared in evaluating treatment
or rehabilitation programs? Developing normative levels o f perform ance might be an
unrealistic ideal in treatment, if that level is based on individuals functioning well in the
community. Defining and identifying a normative population raises additional problems.
Presumably, form ing a normative group ought to take such moderators into account.
Third, even if a norm ative group can be identified, exactly what range o f their
behaviors would be defined as within the norm ative level? A m on g individuals w hose
behaviors are not identified as problematic there will be a range o f acceptable behav-
iors. Defining the upper and lower limits o f that range (e.g., + one standard deviation)
is somewhat arbitrary unless data show that scores above or below a particular cu toff
have different short- or long-term consequences on other m easures o f interest (e.g.,
hospitalization, show ing another disorder).
Fourth, for m any measures o f interest, bringing individuals into the normative
range is a questionable goal. Consider, for example, reading skills o f elem entary school
children. A clinically significant change might well be to m ove children with reading
dysfunction so that they fall within the normative range. However, perhaps the norm a-
tive range itself should be questioned as a goal. The reading o f most children might be
accelerated from current norm ative levels. Thus, a norm ative criterion itself needs to
be considered. M ore extrem e w ould be bringing youth who abuse drugs and alcohol to
the level o f their peers. For som e groups, the peer group itself might be engaging in a
level o f deviant behavior that is potentially maladaptive.
Finally, it is quite possible that perform ance falls within the norm ative range or
departs markedly from a deviant group but does not reflect how the individual is
Bat a Ev al u at i o n 3 IS

functioning in everyday life. Paper-and-pencil m easures, questionnaires, interviews,


and other frequently used measures may not reflect adaptive fu n ctio n in g for a given
individual. Even for measures with high levels o f established validity, perform ance of
a given individual does not mean that he or she is happy, doing well, or adjusting in
different spheres o f life. There is a difference in what is show n o n a psychological m ea-
sure and what is evident in everyday life so that falling into the no rm ative range on
such measures does not really have a clear m eaning (Blanton & Jaccard, 200 ^; Kaz.din,
2006).

Su bjective E valu ation . The subjective evaluation m ethod as a m ean s o f exam ining
the clinical im portance o f intervention effects also raises critical issues. First, global
rating scales usually serve as the basis for obtaining subjective evaluations. Such scales
are more readily susceptible to biases on the part o f raters than are questionnaires and
interviews or direct observations in which the items are m ore con crete and anchored
to clearer descriptors. Because subjective evaluations are global rather than concrete,
they are likely to be highly variable (e.g., have different m eanings an d intcrpretations)
am ong those who respond. Also, subjective evaluations, w hether com pleted b y the cli-
ents or others in contact with the clients, are likely to be fairly nonspecific in their ab il-
ity to differentiate am ong different treatments.
Second, the fact that the client or persons associated with a client notice a d iffer-
ence in behavior as a function o f the clients treatment does not mean that the client in
fact has changed or has changed very much. Persons in contact with the client may per-
ceive a small change and report this in their ratings. But this d o es n o t necessarily m ean
that treatment has alleviated the problem for w h ich treatment was sou ght or brings the
client within norm ative levels o f functioning.
In general, one must treat subjective evaluations cautiously; it is possible that su b -
jective evaluations will reflect change when other m easures o f change do not. Subjective
evaluations might be especially limited as the sole or p rim ary outcom e m easure for
most clinical dysfunctions. For example, clients m ight really believe th ey are doing
much better (subjective evaluation) but continue to have the dependence or ad d ic-
tion (e.g., alcohol) or im pairm ent that they experienced at tIke beglnn in g o ftrea tm en t.
It is quite possible that one feels better about som ething w ithout having changed at
all. This concern raises caution about interpretation but do«s not con d em n subjective
evaluation.
Subjective ratings provide important inform ation. It really d o es m ake »d ifferen ce
how people feel and think (e.g., about themselves, their lires, th«ir m arriages, their
partners). When all is said and done, whether treatment malces peo ple experience life
as better is no less central as an outcome criterion than perform an ce on th eb est avail-
able psychological measure. Subjective evaluation is designed to supple-ment other
measures and to address these broader issues.

G en era l Com m ents. I have highlighted concerns with m easures o f social validation.
There are parallel issues in relation to individual indices o f clinical significance and
social impact measures that are lengthy discussions beyond the present scope (see
Kazdin, 2001). The unifying issue o f the concerns pertains to assessm ent validity: H ow
does one know that the index genuinely reflects client fu n ctioning in everyd ay life?
The usual w ay o f m easuring validity is show ing that scores on a m e a sure correlate with
320 S I N G L E- C A S E R ES EA R C H D ES I G N S

perform ance elsewhere, but this does not address the matter (see Blanton & Jaccard,
2006). Does the m easure selected for social validation, clinical significance, or social
impact clearly reflect a difference that is im portant in the lives o f the clients? H ow
does one know? For som e o f the measures, such as subjective evaluation, perceiving
that there is a difference defines an im portant change. For other measures, v ery little
assessment w ork has been completed to show that huge changes on a measure or being
closer to a norm ative sample and further aw ay from a dysfunctional sample has palpa-
bly im proved the clients everyday functioning (see Kazdin, 2006).
There is no single w ay to measure applied o r clinical significance o f intervention
effects. Despite the problem s I have outlined, it is im portant to include one or more
measures when possible. M easures other than those highlighted previously m ight be
devised to evaluate clinical significance. It is not difficult to conceive o f other w ays to
operationalize clinical significance. For exam ple, in psychotherapy research, m easures
o f symptoms are usually used to evaluate clinical significance. Yet, one might assess
other constructs such as quality o f life, im pairm ent, o r participation in life (e.g., activi-
ties, relationships). T he investigator involved in intervention research ought to attend
to the question, “ Did the treatment make a genuine difference in the lives of the recipi-
ents?” and ought to select one or more m easures to provide an answer.
A m ong the dilem m as is that we are not always looking for large change. A sm all
change often can be sufficient or just enough to make a difference. I f treatment makes
people a little better (e.g., a little less depressed or anxious, a little more confident, or
they drink or sm oke less), this m ay be enough to have them enter parts of life (e.g.,
social events, relationships, work, new hobbies) that they w ould not otherwise do. It is
easy to conceive o f instances in which a sm all effect o f therapy could have just enough
impact to affect people’s lives in important and palpable ways. A s an example, m aking a
dysfunctional m arriage a little belter may have an im portant impact on the couple and
the individual spouses (e.g., deciding to remain married) even though on a measure o f
marital bliss, the couple still falls outside the norm ative range or has not changed two
or so standard deviations.
N otw ithstanding these considerations, the researcher is encouraged to include
one or m ore m easures o f social validity or clinical significance in any intervention
study. T he purpose o f the addition is to move beyond establishing reliability o f the
intervention effect (e.g., through visual inspection or statistical analyses) and to show
that the impact makes a difference. M easures o f social validity, clinical significance,
or social impact address a critical question, and in applied w ork, arguably the most
critical question: Do our interventions make a difference in w ays that the public cares
about? T his might be answ ered in all sorts o f w ays, but som e attempt should be m ade
that goes beyond show ing reliable effects through visual inspection and statistical
tests.

S U M M A R Y A N D C O N C L U S IO N S
Data from single-case experim ents are evaluated according to experim ental and applied
criteria. T he experim ental criterion refers to judgm ents about w hether behavior change
has occurred and is a reliable effect, that is, one that is not likely to be due to fluctua-
tions in behavior. The applied criterion refers to whether the effects o f the intervention
are important or make a genuine difference.
D a t a Evalu< it i on 321

In single-case experim ents, visual inspection usually is used to evaluate w hether


the experim ental criterion has been met. Data from the experim en t are graphed and
judgm ents are made about whether change has occurred an d w hether the data pattern,
meets the requirements o f the design. Several characteristics o f the data contribute to
judging whether behavior has changed and whether that change can be attributed to
the intervention. Changes in the m ean (average) perform ance across phases, changes in
the level o f perform ance (shift at the point that the phase is changed), changes in trend
(differences in the direction and rate o f change or slope across phases), an d latency
o f change (rapidity o f change at the point that the intervention is introduced o r w ith-
drawn) all contribute to judging whether a reliable effect h as o ccu rred. Invoking these
criteria is greatly facilitated by stable baselines and m inim al day-to-day variability,
which allow the changes in the data to be detected.
T he prim ary basis for using visual inspection is that it serves as a filter that m ay
allow only especially potent interventions to be agreed on as significant. Yet objections
have been raised about the use o f visual inspection in situations where intervention
effects are not spectacular. Judges occasionally disagree about w hether reliable effects
were obtained. Also, the decision rules for inferring that a change has been d em on -
strated are not always explicit or consistently invoked for visu al inspection.
Statistical analyses occasionally are used as an alternative w ay o f analyzing the
data, although they continue to serve as an ancillary w ay o f evaluating intervention
in single-case research. Statistical tests m ay be especially useful w h en several o f the
desired characteristics o f the data required for visual inspection are not inet. For exam -
ple, when baselines are unstable and show a systematic trend in a therapeutic direction ,
selected statistical analyses can more readily evaluate intervention effects than visu al
inspection. The search for reliable albeit weak intervention effects is esp ecially d ifficu lt
with visual inspection. These interventions may be im portant to detect, esp ecially in
the early stages o f research before the intervention is well understood and developed.
Finally, there are several situations in which detecting small changes m a y b e im portan t
and statistical analyses m ay b e especially useful here.
Data evaluation via visual inspection or statistical analyses is designed to iden tify
the reliability o f the effect and whether a change has occurred that could not be attrib-
uted to chance. There also is interest in the applied criterion, that is, w hether behavior
changes are clinically significant. Exam ining the im portance o f intervention effects
entails social validation, that is, considering social criteria fo r evaluatin g treatm ent
outcomes. Two methods o f social validation are relevant for evaluating intervention
effects. T he social comparison method considers whether th e intervention has brought
the client’s behavior to the level o f his o r her peers who are fu n ctioning adequately in
everyday life. The subjective evaluation method consists o f h avin g persons w ho interact
with the client or who are in a special position (e.g., through expertise) to judge w hether
the changes evident in treatment reflect a noticeable difference in everyday life.
In group research, clinical significance has emerged as a parallel focus, namely, to
evaluate the im portance o f change. Social com parison has b een one o f the m ethods.
Other methods include identifying whether the degree o f change for in d ivid u als is so
large (e.g., two standard deviations aw ay from the pretreatm ent m ean) as to depart
m arkedly from dysfunctional behavior and whether individuals at the end o f the treat-
ment no longer meet criteria for psychiatric diagnosis.
322 S I N G L E- C A S E R ES EA R C H D ES I G N S

Social impact m easures were also m entioned as a way o f identifying whether the
intervention procedure made an im portant difference. These measures focus on im por-
tant social problem s or on the large-scale im pact o f an intervention beyond chang-
ing the behavior o f one or more individuals. M easures o f impact might reflect safety,
health, energy use, cost, rates o f arrest or hospitalization, and other such indices that
are o f broader social interest.
M easures of social validity, clinical significance, and social impact, as any set o f
measures, raise their own interpretive challenges. Nevertheless they address an im por-
tant issue in applied work, namely, what genuine im pact has the intervention had to
show that it m akes a difference to the individual, to those with whom he or she is in
contact, and to society at large? T he measures I have outlined or others that might be
generated with sim ilar goals provide im portant inform ation often neglected in applied
work.
5

CHAPTER 13

Graphic Display of Data for


Visual Inspection

C H A P T E R O U T L IN E

Basic Types of Graphs


Simple Line Graph
Cumulative Graph
Bar Graph
Graphs Are Not Neutral
Descriptive Aids for Visual Inspection
Changes in Means
Plot a Measure o f Variability
Presentation of the Data to Clarify Patterns
Changes in Level
Changes in Trend
Rapidity of Change
General Comments
Summary and Conclusions

C
hapter 12 provided a discussion o f visual inspection, its underlying rationale, and
how it is invoked in single-case experim ental research. Several characteristics o f
the data are crucial for reaching this decision, including the changes in means, levels,
and trends across phases and the rapidity o f the changes when experim ental conditions
(phases) are changed. In all research, whether single-case or Netween-group, graphical
display o f the data can be a useful aid in conveying at a glance what a flood o f num -
bers and m ind-num bing tables might obscure. In single-case research graphical display
assum es even greater im portan ce.'
Visual inspection requires that the data be graphically displayed so tin t various
characteristics o f the data and criteria for data evaluation can be exam ined. Thus, it is
im portant to keep the criteria for visual inspection in m ind when one plots the data
or uses aids to enhance or facilitate application o f the criteria. Graphing ot data in
research, design, mathematics, marketing, and m any other disciplines is a topic o f sig-
nificance in its own right with scores o f options for data presentation (e.g.. H enry, 1995;

323

I
324 S I N G L E- C A S E R ES EA R C H D ESI G N S

Kosslyn, 2006; Tufte, 2001; W ilkinson, W ills, Rope, N orton, & Dubbs, 2005). Sin gle-
case research relies on a small num ber o f the options. T h is chapter discusses m ajor
options for displaying the data graphically to help the investigator apply the criteria o f
visual inspection to single-case data.1 C om m on ly used graphs and descriptive aids that
can be added to simple graphs to facilitate interpretation o f the results are discussed
and illustrated.

B A S IC T Y P E S OF G R A P H S
Data from single-case research can be displayed in several different types of graphs. In
each type, the data are plotted so that the dependent m easure is on the ordinate (verti-
cal or y-axis) and the data are plotted over time, represented by the abscissa (horizontal
or x-axis). Typical ordinate values include such labels as frequency o f responses, per-
centage o f intervals, num ber o f correct responses, and so on. Typical abscissa values or
labels include sessions, days, weeks, or months.
A s noted in Figure 13.1, four quadrants o f the graph can be identified in the general
case. T he quadrants vary as a function o f whether the values are negative or positive 011
each axis. In single-case research, almost all graphs would fit into the top right q u ad -
rant (m arked by bold lines) where the y-axis (ordinate) and x-axis (abscissa) values are
positive. T he values for the ordinate range from zero to som e higher positive number.
For exam ple, single-case research focuses on m any areas (e.g., num ber of obsessive
thoughts, arithm etic problem s completed, years o f survival, or amount o f energy c o n -
served in the home) where the goal is to increase o r decrease the occurrence o f som e
behavior or dom ain o f functioning. Negative num bers or response values are usually
not possible. Similarly, the focus is usually on perform ance over time from Day 1 to
som e point in the future. Hence, the x-axis usually is not a negative number, which
would go back into history.
A v ariety o f types o f graphs can be used to present single-case data. For present
purposes, three m ajor types o f graphs are discussed and illustrated. Em phasis is placed
011 the use o f the graphs in relation to the criteria for invoking visual inspection.

Sim p le L in e G rap h
T he m ost com m on ly used m ethod o f plotting data in single-case research consists o f
noting the day-by-day (or session by-session) perform ance o f the subject over time.
T he data for the subject are plotted each day in a noncum ulative fashion. The score for
that day can take on any value o f the dependent m easure and m ay be higher or lower
than values obtained on previous occasions. Data points within each phase are c o n -
nected to produce a line. T his m ethod o f plotting the data is represented in virtually all
o f the exam ples o f graphs in previous chapters. However, it is useful to illustrate briefly
this type o f figure in the general case to exam ine its characteristics more closely.

The chapter provides an overview o f the main types o f graphs in singlc-case designs and is about
how to construct or prepare graphs or to utilize readily available software and database management
programs to do that. Several other helpful resources are available for these facets o f graphing in
relation to single-case research (Barton et al., 2007; Carr & Burkholder, 1998; Moran & Hirschbine,
2002; Riley-Tillman & Burns, 2009).
Gr ap h ic D isp l ay o f D at a fo r Visu al I n sp ect i o n 325

Y axis
(ordinate)
5

X value negative ^ X and Y


Y value positive positive values
2

1 X a»s
(absd ssa)
1 1 1 1 1
-5 -4 -3 -2 -1 1 2 3 4 5

-I

-2
XandY X value positive
negative values ^ Y value negative

-4 -

-5 -

F igu re 1 3 . 1 . X andY axes for graphic display of data. Bold lines indicate die quadrant used in the
majority of graphs in single-case research.

Baseline Intervention Base 2 Intervention 2

Days

F ig u re 1 3 .2 . Hypothetical example of ABAB design as plotted on a simple line graph in which


frequency of responses is the ordinate and days is the abscissa.

Figure 13.2 provides a hypothetical example in which data arep lo lted in a sim ple line
graph. The crucial feature to note is that the data on different days can sh o w an increase
or decrease over time. That is, the data points on a given day can he higher o r lower than
the data points o f other days. The actual score that the subject receives lor a given day
is plotted as such. Hence, perform ance on a particular occasion is easily discerned from
326 S I N G L E- C A S E R ES EA R C H D ES I G N S

the graph. For example, on D ay 10 o f Figure 13.2, the reader can easily discern that the
target response occurred 40 tim es and on the next day the frequency increased to 50
responses. Hence, the daily level o f perform ance and the pattern o f how well or poorly
the subject is doing in relation to the dependent values are easily detected
The obvious advantage o f the sim ple line graph is that one can im m ediately deter-
mine how the subject is perform ing at a glance. The sim ple line graph represents a
relatively nontechnical format for presenting the session-by-session data. M uch o f
single-case research is conducted in applied settings where the need exists to com -
municate the results o f the intervention to parents, teachers, em ployees, nurses, and
others who are unfam iliar with alternative data presentation techniques. T h e sim ple
line graph provides a format that is relatively easy to grasp.
A n im portant feature o f the sim ple line graph, even for the better trained eye, is
that it facilitates the evaluation o f various characteristics o f the data as they relate to
visual inspection. Changes in mean, level, slope, and the rapidity o f changes in per-
formance are especially easy to exam ine in sim ple line graphs. And, as discussed later
in this chapter, several descriptive aids can be added to sim ple line graphs to facilitate
decisions about mean, level, and trend changes over time.

C u m u la tiv e G ra p h
The cum ulative graph consists o f noting the level o f perform ance o f the subject over
time in an additive fashion. T he score the subject receives on one occasion is added
to the value o f the scores plotted on previous occasions. T he score obtained for the
subject on a given day m ay assum e any value o f the dependent m easure. Yet the value
o f the score that is plotted is the accum ulated total for all previous days. C onsider as
a hypothetical exam ple data plotted in Figure 13.3, the same data that were plotted in

Baseline Intervention Base 2 Intervention 2

Figure 1 3 .3 . Hypothetical example of AB AB design as plotted on a cumulative graph. Each data


point consists of the data for that day plus the total for all previous days.
Gr ap h ic D isyl* ^ o f D a t a f o r Visu al In sp ect i o n 327

Figure 13.2. On the first day, the subject obtained a score o f 30. On the next day the
subject received a score o f 15. T he 15 is not plotted as such. Rather, it is ad ded to the 30
so that the cum ulative graph shows a 45 for Day 2. T he graph continues in this fashion
so that all data are plotted in relation to all previous data.
A s an example, one investigation was designed to help fiction writers increase their
writing productivity (Porritt et a l, 2006). T he first 10 individuals from an Internet
group o f writers (approxim ately 4,000) who volunteered participated in an Internet-
based program. To be included they also had to indicate that they w ere working on a
manuscript and were dissatisfied with their productivity. Participants m ailed in via the
Internet a copy o f their m anuscripts each day. W ords o f the m anuscript w ere automat-
ically counted (by software, M icrosoft W o rd ®) each day an d served as the outcome
measure o f intervention effects. (Also, the content was ju dged tc ensure the new mate-
rial was relevant and related to the story so they could not be m eaningless additions.)
All com m unication including the intervention was com pleted via the W eb including a
Web page for the group and individual Web pages for each participant. T he interven-
tion included providing individual goals for writing, graphic display o f the number o f
words written, public acknowledgment (email that went to all participants), personal-
ized em ails to give feedback, special recognition in em ail if the person met his or h er
goals for that week, and points earned for their w’riting that cou ld be used to obtain
critiques o f one’s m anuscript from another writer. T he 10 participants were divided into
two groups, and the intervention package was evaluated in a m ultiple-baseline design
across groups.
Figure 13.4 presents the impact o f the intervention on the cum ulative number o f
words written. T he intervention was introduced to two groups at different points in
time. In the figure the solid line in the intervention phase represents the predicted
level o f perform ance from baseline. T his is a useful addition to illustrate the logic of
the design, but also because m any people are unfam iliar with cum ulative graphs. Little
or no change in word productivity would be represented b y a low slope o r no slope
(horizontal line). M ore and a lot o f change would be reflected in a steep erslo p e because
each day’s words add to the previous days—more words, steeper slope. T h e figure c o n -
veys very clearly that baseline was not moving very steeply, is illustrated by baseline
data and the extrapolation o f what would be likely to happen if baseline were contin-
ued. D uring the intervention there was a clear change on the cum ulative w ords as the
angle o f the slope became steeper and departed from the projected line o f baseline.
W riting productivity increased during the intervention phase. T h e graph is clear once
one accom m odates to the notion o f accumulated data.
In an educational application, a multiple-treatment design w as used to evaluate
two procedures to develop m astery o f words am ong adults (ages 10 to 48) with devel-
opmental disabilities and who worked in a day vocational program (Worsdell et al.,
2005). Most were involved in continuing education and were selected tor tlieir need for
im proving in sight word reading. Individual sessions were conducted in which w ords
were presented (based on assessment o f the person’s reading level). The num ber o f
words mastered consisted o f words read without mistakes. D uring the intervention
two ways o f correcting errors were compared across m ultiple an d varyin g word lists.
Briefly, when an error was made, the individual was asked to repeat the w o rd once. T h e
trainer said, “ No, the word i s __ S a y ....” T his was called the Single-Repetition (SR>
328 S I N G L E- C A S E R ES EA R C H D ESI G N S

Days

Figure 13 .4 . Cumulative number of words submitted by two groups of fiction writers who partic-
ipated in the Internet-based intervention.The solid line represents the trend extrapolated from the
rate of change during the baseline. (Source: Porritt et al„ 2006.)

condition. The other condition provided multiple opportunities to repeat the word.
A fter an error, the correction was made and the client was asked to repeat the word
five times, a procedure referred to as the M ultiple-Repetition (M R) condition. Once a
word was mastered, it was pulled from the list and new words were added. Figure 13.5
shows the multiple-baseline design across six clients included in the demonstration. It
is clear from the graph that changes occurred when the intervention was introduced,
with especially conspicuous slope changes, meeting the criteria o f the multiple-baseline
design. Also, the com parison o f M R and SR sessions conveys that the MR condition
was better in leading to word mastery. T he cum ulative graph is quite useful. We would
Gr ap h i e Disp lay of D at a f o r Visu al In sp ect i o n J29

Sessions

Figure 13 .S . The cumulative number of words mastered during baseline (no correction of errors)
and during the intervention in which two interventions were compared: single-response repeti-
tion (SR.) and multiple-response repetition (MR). This is a combined multiple-baseline and multi-
treatment design. (Source: Worsdell et al., 2005.)
330 SI N G L E- C A S E R ESEA R C H D ES I G N S

like to know which m ethod has led to m astery o f more words. The sheer num ber of
words a person masters is m eaningful, and the cum ulative graph is clear in show ing the
different totals overall.
Historically, the use o f cum ulative graphs in single-case research can be traced
primarily to non-hum an animal laboratory research in the experimental analysis o f
behavior (see Kazdin, 1978). The frequency o f responses was often plotted as a function
o f time (rate) and accum ulated over the course o f the experim ent. Data were recorded
automatically on a cum ulative record, an apparatus that records accumulated response
rates. T he cumulative record was a convenient w ay to plot large numbers o f responses
over time. The focus o f much o f the research was on the rate o f responding rather
than on absolute num bers o f responses on discrete occasions such as days or sessions
(Skinner, 1938). A sim ple line graph is not as useful to study rate over time, because
the time periods o f the investigation are not divided into discrete sessions (e.g., days).
The experimenter might study changes in rate over the course o f varying time periods
rather than discrete sessions.2
In applied research, cum ulative graphs are used only occasionally, but examples
can be readily found (e.g., Mueller, M oore, Doggett, & Tingstrom , 2000; Sundberg,
Endicott, & Eigenheer, 2000). Part o f the reason is that they are not as familiar or as
easily interpreted as are noncum ulative graphs. T he cum ulative graph does not quickly
convey the level o f perform ance on a given day for the subject. For example, a teacher
may wish to know how m any arithmetic problems or what percentage of problem s a
child answered correctly on a particular day. This is not easy to cull from a cum ulative
graph. T he absolute num ber o f responses on a given day m ay be important to detect
and com municate quickly to others. Noncum ulative graphs are likely to be m ore help-
ful in this regard.
The move away from cum ulative graphs also is associated with an expanded range
o f dependent measures. ( '.umulaUvc-graphshave been used in basic laboratory research
to study rate o f responding. The param eter o f time (frequency/tim e) was very im portant
to consider in evaluating the effects o f the independent variable. In applied research,
responses per minute or per session usually are not as crucial as the total num ber o f
responses alone. For exam ple, an intervention may be directed toward reducing violent
acts in a special school for violent youth. Although the rate o f aggressive responses over
time and the changes in rate may be o f interest, the prim ary interest usually is sim ply
in the total number o f these responses for a given day. T he analysis o f moment-to-mo-
ment changes, often o f great interest in basic laboratory research, usually is o f less inter-
est in applied research. Even so, cum ulative graphs ought not to be ruled out. T here are
m any measures, especially in relation to social impact, in which the accumulation o f
multiple responses and the cum ulative impact are o f concern. For example, cum ulative

! A cumulative graph was especially useful in detecting patterns o f responding and immediate
changes over time. For example, in much early work in operant conditioning, schedules of rein-
forcement were studied in which variations in presenting reinforcing consequences served as the
independent variable. Schedule effects can be easily detected in a cumulative graph in which the rate
of response changes in response to alterations of reinforcement schedules. The increases in rate are
reflected in changes of the slope o f the cumulative record; absence of responding is reflected in a
horizontal line (see Ferster & Skinner, 1957).
G r a p h i c D is p l a y of D a ta , f o r V is u a l I n s p e c t i o n

injuries and death from accidents, cumulative energy saved, and cum ulative donations
to philanthropic causes are all o f interest because in the end it is the total accumulation
we care about. Even at the individual level, cum ulative responses often are o f interest
as individuals set goals (e.g., cum ulative miles o f jogging for an exercise b u ff or pages
written for an author).

B a r G rap h
A bar graph provides a simple and relatively clear way o f presenting data.* The graph
presents the data in vertical or occasionally horizontal colum ns (bars) to represent p er-
formance under different conditions. Each bar or colum n represents the mean or aver-
age level o f perform ance for a separate phase. For example, the mean o f all o f the data
points for baseline would be plotted as a single colum n; the mean for th e intervention
and for subsequent phases would be obtained and presented separately in the same
fashion. Figure 13.6 illustrates a hypothetical A B A B design in which the data are p re-
sented in a sim ple line graph (upper panel); the same data are presented as a bar grapli
(lower panel).
Bar graphs are occasionally used to present data in single -case research. In a p rev i-
ous figure (13.5), data were presented in a cum ulative graph to show the effects o f two
different procedures to correct reading errors. That figure showed the accum ulated
words mastered on a session-by-session basis. The same data are replotted in Figure 1 3 . j
in a bar graph. T he information is not redundant with the cum ulative graphs. From th e
bar graph, we see the means for the two procedures that were used and whether they
made a difference and if so how much.
T he advantage o f bar graphs is that they present the results in one o f the easi-
est formats to interpret. Day-to-day perform ance within a given phase is averaged,
and that average (mean) is reflected in the height o f the bar. From the standpoint
o f data evaluation, the graph presents only one o f the characteristics used for visual
inspection (changes in means). Fluctuations in perform ance, trends, and inform ation
about duration o f the phases are usually omitted. T he advantage in sim plifying the
format for presenting the data has a price. The interpretation o f data from single-case
experim ents very much depends on seeing several characteristics (e.g., changes in
level, mean, trend). Insofar as bar graphs exclude portions o f the original data, less
information is presented to the naive reader from which w ell-based conclusions can
be reached.

' In single-case research, “bar graph” and “ histogram” occasionally are 11 seel interchangeably,
although in other contexts they arc readily distinguished (http://educatjon.mit.edti/starlogo/
graphing/graphing.html#Histogram, http://www.ncsu.edu/labwrite/res/'gh/gh-baigiaph.html).
Both refer to displays 111 which the data are represented by columns (usually vertical, but occasion-
ally horizontal). Histogram is a type ofbar graph, often plotting the frequency of «lifFerent values in
a population (e.g.. how many people in the country are ages 1-10 ,11-2 0 ). F.ach bai represents the
number of people in each age group. The text will use bar graph as the more genera] term. In single-
case research, the use o f such graphs is primarily in the context of presenting means (averages) o f
the data. For example, the results o f an ABAB design could be presented sc that eacli bar represents
the mean of each phase on the measure of interest (e.g., items answered correctly). Alternatively,
sometimes bars are used to present the means for asssessments ad minis terd once or twice (e.g., pre-
and postassessment for several sub jects).
332 S I N G L E- C A S E R ESEA R C H D ESI G N S

Baseline Intervention Base 2 Intervention 2

£ 14 Int Int 2

12

10 Base 2

6
Base

Phases

Figure 13 .6 . Hypothetical example of an ABAB design in which the data are represented in a
simple line graph (upper panel) and a bar graph (lower panel).

T he features o f the data not revealed by a bar graph m ay contribute to m isin-


terpretations about the pattern o f change over time. For exam ple, trends in baseline
and/or intervention phases may not be represented in bar graphs, which could have
im plications for the conclusions that are reached. Hypothetical data are plotted in
Figure 13.8 to show the sorts o f problem s that can arise. In the upper left panel, a con -
tinuous improvem ent is shown over baseline and intervention phases in the simple
line graph. C learly the upper left show s that som ething is going 011, that is, that there
is a strong trend toward improvem ent that began in baseline and that there is no rea-
son to think the intervention made any difference. Remember, we use the baseline in
the upper left graph to predict what perform ance would be like in the future if base-
line were to continue. The data in the intervention phase are perfectly in keeping with
the baseline projection, and we could not conclude there is any intervention effect.
Replotting the same data in a bar graph, shown in the upper right panel, suggests that
Gr ap h ie Disp lay o f D a t a f o r Visu al In sp ect i o n 333

Y I
I
o '
L>
«
c
V
X

Ernie Seth Robin Ariel Haytey Justin

Figure 13.7. The mean number of correct words read per session during: single-response repeti-
tion (SR) and multiple-response repetition (MR) methods of correcting errors (see also Figure 13.5).
This is a combined multiple-baseline and multi-treatment design. (Sourer: Wondtil et a1.2005.)

the intervention had a large effect. The bar graph plots the m eans. T he dilferences in
means are merely a product o f the overall trend, but the trend requires a sim ple line
graph to detect.
------- iTi'thg tower panel, anoth er set~of d ata is plotted, this tim r behavior
was increasing during baseline (e.g., became worse) and changing in its trend with the
intervention. The simple line graph suggests that the intervention reversed the direc-
tion o f change. Yet the bar graph shows that the averages from the phases are virtually
identical. Consider the lower right and left panels and how they m ight be m isused.
Suppose the measure were crim e rate in a city and were plotted, over a year. The lett
lower panel graph shows that things were getting worse in baseline but better during
the intervention. The intervention might be the policy and practices ol a new m ayor
w ho claimed that she would reduce crime. When running fo r re-election, her aides plot
the data on the left lower panel, which immediately conveys that crim e was increas-
ing before she took office and that during her year the trend was com pletely reversed.
However, her opponent running for office might plot the data differently with the bar
graph on the lower right. It is clear from looking at the bar graph that the m ean crim e
rate has not changed at all. T he incumbent mayor could say, “ Look what I did for the
city” (and point to the left lower panel), and the challenger ruuning for o ffice could
say, “ Yes, I am looking and there was no change overall” (and point to the lower right
panel). Both graphs are perfectly accurate, but they differ in how much inform ation
they reveal and how they display the information. Nothing in the bar graph is inaccu-
rate, but here is a case where incomplete data are critical (an d politically m isch ievou s).
(The methodologically inform ed contender running for office w ould em phasize the
334 S I N G L E- C A S E R ESEA R C H D ESI G N S

Intervention

Baseline

Days Phases

Baseline Intervention

Phases

Figure 1 3 .8 . Hypo thpnml rhn frnm AB phnmo TI.» iifipwrywiin) «hAùuc »ho cnno A->r-> ir, -,
simple line graph (left) and replotted as a bar graph (right).The bar graph suggests large changes in
behavior, but the simple line graph suggests the changes were due to a trend beginning in baseline
and continuing during the intervention phase. The lower panel provides an example in which the
intervention was associated with a marked change as shown in the simple line graph (left), but the
bar graph (right) suggests no change from baseline to intervention phases.

AB design, as a quasi-experim ent, and how other hypotheses [threats to validity] could
account for all or the effects. He would lose the election but have the admiration o f all
methodologists.)
Bar graphs are also useful for sim plifying data. The audience for ones results may
dictate the use of a bar graph because ot this sim plifying feature. We do not teach most
researchers about changes in variability and stability o f baselines and criteria for visual
inspection, let alone inform ing the public about an)- o f this. W hen presenting infor-
mation to the public, loss o f many o f the details (trends, shift in level) has a virtue we
seek, namely, conveying a message that is at once accurate but as free from nuances as
possible. In short, bar graphs have a use both in research as well as in com m unicating
inform ation broadly. In this chapter, I am emphasizing the im portance o f graphing
continuous data so that one can apply criteria o f visual inspection. In this context, bar
Graphie Display o f Daca for Visual Inspection 335

graphs are usually a helpful complement to data plotted in ways that provide the day-
to-day performance. Occasionally investigators present the data in the usual simple
line graph to show the day-to-day performance (e.g., in a m ultiple-baseline design)
but then summarize the effects by presenting a bar graph that just conveys means. For
example, in a project that provided treatment o f posttraumatic stress sym ptom s in chil-
dren, cognitive behavior therapy was introduced in a m ultiple-baseline design across
four children (Feather & Ronan, 2006). Single-case data -were plotted in the usual
multiple-baseline design fashion to show the impact o f the intervention. In the study
there were assessments over three follow-up periods (3-, 6-, and n -m o n th follow-up).
The authors also summarized the results nicely by providing a bai graph that gave
mean symptoms scores o f all four children in baseline, intervention phases, and fol-
low-up. The presentation o f the data in different ways as in this exam ple can reflect the
best o f all possible worlds because the detailed, continuous assessment for the multiple-
baseline designs conveys the needed information for visu al inspection; the sum m ary
bar graph simplifies the presentation to consumers who are interested in a sum m ary/
bottom-line statement.

G R A P H S A R E NOT N E U T R A L
A misunderstanding about data presentation and analyses is the supercynical view that
data and statistics lie. It is true that data can misrepresent, as illustrated in the hypo-
thetical example o f crim e rates plotted as a simple line graph or b a r graph — that lead
to conclusions that crim e rate has improved or has not changed. More generally, data
from any individual study are incomplete and only represent partial inform ation. Apart
from issues related to graphing, any single study may not represent findings that would
apply to varied samples (e.g., different ethnic or cultural groups), different m easures
o f the target focus, and different settings to which the intervention might be applied.
Protections in Sdértcg'abûUt mlsrep rcsentaTtOM'tndudir repticatiüii ul findings and their---------
extension to new populations, measures, and circumstances. In any given study, one
cannot evaluate all domains that might be relevant. Also, w hen one Jo e s have exten-
sive data, an investigator is rarely allowed to present all o f the data (e.g., each client’s
performance, each day, etc.) although more and more publication outlets, investigators,
and funding agencies wish to make all data available for re analysis. So we begin with
the notion that all o f the data are rarely presented. Add to that pressures (e.g., our own,
sponsors o f the research) that can influence data presentation as well and there is more
than merely omission caused by the mass amount o f inform ation. A s we are called on
to sum marize the data, we want to provide the reader with m axim um inform ation to
convey how we drew inferences and what the data show. It is possible to distort b y leav-
ing out information. In single-case research, the graph should m ake available all ot the
data that permit evaluation o f the.criteria for visual inspection, at the v e ry least.
Another consideration in presenting the data pertains to the audience. If one is
presenting the information to other researchers, the com plexities o f the data in their
full bloom might be presented. This might include data to permit visual inspection
and statistical analyses (for the researchers) but sum m ary descriptive statistics (for
the consumer). For example, one study focused on oppositional, aggressive, and anti-
social behaviors o f eight children (ages 6 to 8) who were referred to a special class-
room because o f their problem behavior (Ducharm e et al., 2008). An intervention was
336 S IN G L E-C A SE RESEARCH D ESIG N S

introduced to train them to interact differently with others. T he intervention was intro-
duced in a multiple-baseline design across children and was shown to be responsible
tor change. (Although unnecessary for the present discussion, the reader may wish to
see the original multiple-baseline graph, Figure 10.6 in Chapter 10.) For the consum er
o f the research (parents, teachers, school adm inistrators), it would be helpful to have
a bottom line. Did the intervention help anyone? Did antisocial behavior decrease?
The investigators assessed play in naturally occu rring sessions before and after treat-
ment and plotted these data in a bar graph for each child, as shown in Figure 13.9. The
data are very straightforward; the intervention made a difference. We do not need to

! Pro Trc.itnierit Post-Treatment

Figure 13 .9 . Mean frequency of antisocial behaviors during pretreatment and posttreatment


(generalization sessions in a classroom without intervention). The impact of the intervention
was demonstrated separately in a multiple-baseline design across children. (Source: Ducharme et ol„
2 0 0 8 .)
Graphic D isp l ay of Data f o r Yi su al Inspection i l l

be concerned about whether the intervention led to change or whether omission ot so


much information (level, trends) distorts the data. All o f these concerns were addressed
in the simple line graph. The bar graph as a supplement to other inform ation can be
extrem ely useful and can add clarity.
In presenting data, the goal is to provide all the inform ation feasible to allow the
reader to make an evaluation. In the case o f graphing and visu al inspection, this means
allowing the reader to apply the criteria or to see how the investigator applied the c ri-
teria. In addition, the goal may be to com municate the major findings. These different
goals may require different presentations o f the same data.

D E S C R IP T IV E A ID S FO R V IS U A L IN S P E C T IO N
As noted earlier, inferences based on visual inspection rely on several characteristics o f
single-case data. In the usual case, simple line graphs are used to represent the data over
time and across phases. The ease o f inferring reliable intervention effects depends among
other things on evaluating changes in the mean, level, and trend across phases, and the
rapidity o f changes when conditions are altered. Several aids are available that can permit
the investigator to present more information on the simple line graph to address these
characteristics and also to communicate the results more completely and clearly.

Changes in Means
The easiest source o f information to add to a simple line graph that can facilitate visual
inspection is the plotting o f means. The data are presented in the usual w a fs o that day-
to-day perform ance is displayed. The mean for each phase is plotted as a horizontal
line within the phase. Plotting these means as horizontal lines or in an equivalent way
for each phase readily perm its the reader to compare the overall effects o f the different
conditions, that is, provides a sum m ary statement.
-------Enr n m u ltip le f W iim u/at In t»v;ilnalt> .1 p ro gram d esig n ed
to teach reading to three elementary school students (7 years o f age) diagnosed with
learning disabilities (Gilbert, Williams, & M cLaughlin, 1996). Training included <i:s
cussion o f vocabulary, teaching o f phonetic rules (sounds), an d practice. Students
read into a recorder for 4 minutes; the num ber o f correct w ords read per minute was
assessed from that. Figure 13.10 shows the effects o f the program . W ithin each phase
for each child, the horizontal line reflects the mean (average) correct words per minute.
T his is a very sim ple but helpful addition.
Another example provides a demonstration with effects that are much less clear
than the previous example. In this dem onstration, feedback was used to improve the
performance o f boys (9 to 10 years old) who participated in a football team and league
(Komaki & Barnett, 1977). The goal was to improve execution o f the plavs b y selected
members o f the team (backfleld and center). A checklist o f players behaviors was scored
after each play to measure if each player did what he was supposed to. Duri rig the feed-
back phase, the coach pointed out what was done correctly and incorrectly after each
play. The feedback from the coach was introduced in a multiple-baseline design across
various plays. Figure 13.11 shows that performance tended to im prove at each point at
which the intervention was introduced. The means are represented in each phase b y
the horizontal dotted lines. In this example, the means are especially useful because
intervention effects are not very strong. Changes in level or trend are not apparent from
338 S IN G L E -C A S E R E S E A R C H D E S IG N S

Baseline Assisted Reading

F igu re 1 3 .1 0 . Correct reading rates during baseline and assisted reading as evaluated in a multiple-
baseline design across subjects. The horizontal line in each phase represents the mean. (Source:
Gilbert.Williams.& McLaughlin, 1996.)
G ra p h ic D isf-'a y o f D a t a f a r V i <ual Insp e c tio n 33 9

Baseline Int ervent ion

Games and practice sessions during season

Figu re . . Percentage of stages successfully completed for PlaysA.B.and C during football prac-
tice (M = Monday,T =Tuesday,Th = Thursday) and game (G) situations.Each data point refers to the
execution of a single play. (Source: Komaki & Barnett, 1977.)

baseline to intervention phases. Also, rapid effects associated with implementation c l


the intervention are not evident either. The plot o f means shows a weak but seem ingly
consistent effect across the baselines. Without the means, it might be much less d e a r
that any change occurred at all.
The plotting o f means represents an easy tool for conveying slightly m ore In fo rm a-
tion in simple line graphs than would otherwise be available. Essentially, plotting o f
means com bines the advantages o f simple line graphs and bar graphs. Tk e advantage©!"
plotting means in a simple line graph rather than using a b arg rap h is that the day-to-
day perform ance can be taken into account when interpreting the means.

Plot a Measure o f Variability


Variability o f the data is critical for evaluating change whether in means or trends.
Although variability per se is not explicitly one o f the criteria fo r visual inspection, 1
mention this here in relation to aids for graphing. W hen researchers who use single-
case designs report the results, means are usually mentioned am ong phases, even if
they are not plotted graphically. In contrast, mention or illustration o f a measure o f
variability about those means is rarely included. However, the meaning o f means, so to
speak, derives in part from their departure from each other across phases. Departure
is not merely a difference in absolute numerical terms. Fo r exam ple, i f the mean o f
baseline is 20 and the mean o f the intervention phase is 24, how do we interpret that?
I f all the data in baseline were identical scores o f 20 and all tJiedata in the intervention
phase were scores o f 24, those would be hugely different (anil non overlapping). M oje
likely than not there are fluctuations day to day and hence it m ight be useful to adtL a
340 S I N G L E- C A S E R ESEA R C H D ESI G N S

descriptive aid to look at where means lie in relation to each other in light o f that fluc-
tuation, that is, a measure o f variability.
A research practice that spans single-case and between-group research that is use-
ful in this regard is to plot a m easure o f variability. In single-case designs, different
options have been used such as plus and minus one standard deviation above the mean
or occasionally the range (highest and lowest score). One m easure that is useful and
com m on across many areas o f research is to use error bars that reflect the standard
error o f the mean o f that phase.4
Consider an example to illustrate how this is done and what we gain. Table 13.1
provides hypothetical data for the first two phases (AB) o f an A B A B design. The base-
line phase was 7 days, followed imm ediately by the intervention phase o f 8 days. From
these data, we can compute measures o f variability that will be used in our graphs o f the
data. First we compute the means and standard deviations in each phase, as provided
in the table. Second, from the standard deviation, we compute the standard error o f
the mean, by the formula provided. Error bars consist o f the mean plus or minus one
standard error.
Figure 13.12 presents the data from both A B phases. T he mean for each phase is
represented by a horizontal line (in the upper panel) or by the height o f the bar (in the
lower panel). The results convey that perform ance increased in the intervention phase.
In the lower panel (bar graph) 1 have added error bars that convey one standard error
above and below the mean. T he height o f each bar reflects that one standard error in
term s o f the range o f scores. This range is relatively narrow and adds useful inform a-
tion about fluctuation. We can see that the mean o f one phase is not close to the range
(one standard error above or below the mean) o f the other phase. We cannot tell from
the bar graph if the individual data points overlap very much from baseline to interven-
tion phase (although we can see that they do not from the line graph above that). T he
narrow error bars suggest that data p o i n t s w i t h i n a p ha.w a n * i n r l o t w p r o x i m i t y o f
other, that is, variability is not that large.
How does this help us in any way? First, the range helps standardize our evalu-
ations o f variability. Data can vary, and visual inspection alone cannot capture that
variability in a precise or standard way. Indeed, the appearance o f variability in a graph

4 The standard error o f the mean refers to the estimate o f the standard deviation o f a sampling dis-
tribution o f means. The mean in any study (or the mean o f the phase in the hypothetical example) is
an estimate o f the true mean in the population. Consider for a moment that we conduct the study in
the hypothetical example many different times and assess performance under the conditions (e.g..
baseline). Lei us say we do this an infinite number of times. Each time we do, the 6 days of baseline
draws from this larger pool of all the performances under baseline. In each study or on each occa-
sion that we sampled 6 days o f baseline, we would get a mean. These means form a sampling dis-
tribution o f means, that is, each mean is a data point. The overall mean or the mean of these means
would provide the “real” or population mean (p). But not all the means that were sampled would
be the same; they would vary a bit. The standard error o f the mean is the standard deviation o f the
sampling distribution of the means and reflects how much sample means may be expected to depart
from the population mean. In a single study we conduct, the standard error o f the mean helps us to
estimate, with some level of confidence, the likelihood that the population mean will fall within the
range we present.
G r a p h i c D i s p l a y o f D a t a f o r V i stjnfl In s p e c t i o n 341

Table 13 .1 Hypothetical D ata for A B (Baseline, Intervention) Phases in


W hich Percent or Frequency of So m e Behavior Is O ccurring. (N um bers
in the Colum ns A re the D ata for Days of Baseline and the Intervention^

B a se lin e Intervent! on

16 16
12 18
13 19
17 20

14 18
13 21

14 21

20

N days of observation 7 8
Mean score for the phase 14.14 19.13
Standard Deviation (s) 1.77 1.73
Standard Error (se) .67 .61

Standard deviation (s) =

| £ < x ,- x )2
S= ) ^ N T ~
where x with a bar above it is the value of the mean,
N is the sample size, and
x( represents each data value from i = I (the first) to i = N (the last)The E symbol
indicates that you must add up the sum
Standard Error = sd

------------------wtmru s Is the sample standard deviation--------------------------------------------------------------------------------


n is the size (number of items) of the sample.

can be quite different depending on the range o f the ordinate o f the graph. If the scores
range from 25 to 45, the scores will look like there is less variability if they are plotted
on a graph that can range from o to 200, as com pared with a graph that can range front
20 to 50. Error bars provide an objective and readily interpretable metric.
Second , fo r m an y read ers there m ight be co m fo rt in th e use o f e rro r bars b e c au se
th ey brid ge v isu al in sp ectio n an d statistical an alyses. T h e r e a d e r train ed in q u an titative
m eth od s can im m ed iately recogn ize that error b a rs are s im ila r to c o n jid e n a in terv a le
that are used as part o f statistical an alyses to c o n v ey the lik e lih o o d that the true m e a n
falls w ith in a p a rtic u la r ran ge.5

Confidence intervals (CIs) provide a range o f values within which the true mean i-> likely to
lie. Even though this is a range, it also includes the information that one obtains from a statis-
tical test of significance, because z values used for significance testing (e.g., 2 score o f 1.96 for
p = .05, or 2.58 for p = .01) are used to form the upper and lower C l. The formula for com put-
ing CIs is:
CIs = m ± zasm
342 S IN G L E C A S E R E S E A R C H D E S IG N S

Baseline Intervention

Days

25

5 -

Figure 1 3 . 1 2 . H yp o t h et i cal d at a f r o m A B p h ases t hat ar e p lo t t ed as a line g r ap h (upper panel) o r


b ar g r ap h (lower panel). H o r i zo n t al li nes in t he li ne graph r ep r esen t t h e m ean s; t h e bars in t h e b ar
grap h r ep r esen t m ean s.Th e v er t ical li nes ab o ve and b elo w t h e m ean s r ep r esen t t h e er r o r b ar s plu s
and m in us o n e st an d ar d e r r o r ar o u n d t he m ean .

5(Continued) where
in = the mean score (e.g., for a given phase in a single case design);
zx- thez score value (two-tailed) under the normal curve, depending on the confidence level
(e.g., z = 1.96 and 2.58 for p = .05 and p = .01, respectively); and
Gr ap h ie Di-splay of D at a f o r V' isual Insp ect io n

Presentation o f the Data to Clarify Patterns


The previously discussed indices provide measures o f variability (e.g., standard devi-
ation, standard error, confidence interval) that provide a standardized w ay o f c o m -
municating fluctuation within phases. There is another issue that can em erge in the
data and in relation to graphing. There may be excessive variability in the data. There
is no standard definition for “excessive variability.” To convey the concept, consider
the data within phase to reflect scores that vary day to day and range from the lowest
to the highest (e.g., o to 100% o f the time) possible scores on the m easure or at least
to approximate use o f the full range of available scores. For exam ple, baseline perfor-
mance might fluctuate from 10 to 90% of the intervals o f som e observed behavior, even
though no effort was made to intervene. Invariably, the investigator wants to under-
stand the sources o f variability and control them if possible, a topic to which we return
in the next chapter. However, there is a graphing option that is occasionally used to
present the data when excessive variability cannot be controlled or no attempt has been
made to control it. The graphing solution does not reduce variability per se but rather
the appearance o f variability in the data.
T he appearance o f day-to-day variability can be reduced by plotting the data in
blocks o f time rather than on a daily basis. For example, i f data are collected every day,
they need not be plotted on a daily basis. Data can be aggregated over consecutive days
(blocks), and the average o f each block can be plotted. By representing two or more
days with a single averaged data point, the data appear m ore stable. Figure 13.13 pres-
ents hypothetical data in one phase that show day-to-day perform ance that is. highly
variable (upper panel). The same data appear in the middle panel in which the averages
for 2-day blocks are plotted. The fluctuation in perform ance is greatly reduced in the
middle panel, giving the appearance o f much more stable data. Finally, in the bottom
panel the data are aggregated into 5-day blocks. That is, perform ances for 5 consecutive

ity is reduced even further.


In single-case research, consecutive data points can be aggregated in the fashion
illustrated in the figure. In general, the larger the number o f days included in a block,
the lower the variability that will appear in the graph. O f course, once the size o f the
block is decided (e.g., 2 or 3 days) all data throughout the investigation ought to be
plotted in this fashion so the data are treated in this transform ed (averaged) way in a
consistent fashion. It is important to note that the aggregating and averaging procedure
only affects the appearance o f variability in the data. W hen the appearance is altered.

' (Continued)
s = the standard error o f measurement, i.e., the estimate o f the standard deviation o f a sampling
distribution of means or the standard deviation divided by the square root ofN (sb = siVN).
As the formula notes, to obtain these values, one multiplies the standard error of the mean (as
used in computing error bars) by z value (e.g., 1.96) and adds this to the mean lor the upper limit of
the interval and subtracts that same value from the mean for the lower limit o f the inter ra. . This is
exactly the procedure used to compute error bars. The error bars were 1 standard errorabove and
below the mean; with CIs we are using 1.96 (or 2.58) standard errors above ant! below the mean.
These latter numbers are used because they reflect the p = .05 (or p = .01) thresholds when statistical
tests (e.g., t or F) are used to define statistical significance.
344 S IN G L E -C A S E R E SE A R C H D E S IG N S

F igu re 1 3 .1 3 . H yp o t h et i cal dat a f o r o n e p h ase o f a sin g l e- case desig n. U p p e r p anel sh o w s d at a


p lo t t ed on a daily basis. M id d le p an e l sh o w s t he sam e d at a p lo t t ed in 2-day b lo ck s. L o w e r p a n e l
sh o w s t h e sam e dat a p lo t t ed in 5-day b l o ck s.To g et h er t h e f ig ur es sh o w t h at t h e ap p ear an ce o f var ­
iab ilit y can b e r ed u ced by p lo t t ing dat a in t o b lo ck s.

changes in means, levels, and trends across phases may be easier to detect than when
the original data are examined.
A few cautions are worth noting. First, aggregating data points into blocks reduces
the number o f data points in the graph for each o f the phases. If 10 days o f baseline are
observed but plotted in blocks o f 5 days, then only two data points (number o f days/
block size or 10/5 - 2) will appear in baseline. Unless the data are quite stable, these few
data points may not serve as a sufficient basis for predicting perform ance in subsequent
phases. (But if the data were “quite stable” there might be 110 need to place them into
5-day blocks and average them.) Although blocking the data in the fashion described
G r a p h ic D is p la y < ( Data, f o r V isu a l In s p e c t io n 345

reduces the number o f data points, the resulting data are usually markedly more stable
than the daily data. Thus, what one loses in number o f points is com pensated for by the
stability o f the data points based on averages obtained from aggregaied days.
Second and related, aggregating days into blocks will reduce the number of data
points and could underm ine the key purposes o f collecting continuous data. For
example, one study reduced the frequent aggression o f a m ale in a psychiatric hospital
(Bisconer, Green, M allon-Czajka, & Johnson, 2006). The patient met criteria for m u l-
tiple psychiatric diagnoses, including m any symptoms o f psychoses. An intervention
program provided praise and tangible rewards each day b y psychiatric nurses on the
ward for periods o f appropriate and nonaggressive behavior. Figure 15.14 shows the
results. Baseline lasted for 3 months, the data were averaged o re r a y month block. T h at
means only the first data point in the graph (#1) was baseline. By averaging in 3-m onth
blocks we only have that data point. One data point is not a good basis for the m ain
reason we conduct assessment in a phase (describe, predict, and test). That is, from this
one point we really cannot see if there any trends, we cann ot really get a good picture
to invoke the data-evaluation criteria. Was the program effective? We can say that there
was a change in aggression toward oneself and others. H ere is a case w here the data
are available but the averaging o f them introduced am biguity. It would be fine to plot
baseline on a month-by-month basis (with three data points) and a couple o f m onths
o f intervention in a sim ilar way, and then to switch to 3-m onth blocks i f one w ished.
Averaging led to one data point for baseline and leaves unclear key characteristics o f
the data.
Third, the actual data plotted in blocks can distort d a ily perform ance. Plotting
data on a daily basis rather than in blocks is not inherently superior o r m ore veridical.
However, variability in the data evident in daily o bservations m ay represent a m ean-
ingful, important, or interesting characteristic o f perform ance. Averaging hides this
trttrsn

Figu re . . N u m b er o f physically ag g ressive act io n s o n t h e h o sp it al ware) d u r i n g b asel i n e (f ir st


d at a point ) and o v er t he co u r se o f t r eat m en t . Each dat a p o i n t ( I —I 4 ) r ep r esen t s t h e av er ag e sco r e
of daily ag gression averag ed o v er a 3- m ont h b lo ck . O n e b aselin e d at a p o i n t r ep r esen t s 3 m o n t h s of
o b ser v at io n s averaged. (Source: Bisconer et ol.. 2006.)
346 S IN G L E -C A S E R E S E A R C H D E S IG N S

own right. For example, a hyperactive child in a classroom situation m ay show marked
differences in how he or she perform s from day to day. On some days the child may
show very high levels o f activity and inappropriate behavior, while on other days his
or her behavior may be no different from that o f peers functioning well in class. T he
variability in behavior may be im portant or important to alter. The overall activity o f
the child but also the marked inconsistency (variability) over days represents charac-
teristics that may have im plications for designing treatments. For example, there might
be influences that could be identified on the days o f normative rather than hyperactive
behavior, and these influences might be harnessed to help the child.

Changes in Level
Another source o f information on which visual inspection often relies is changes in
level across phases, that is, the discontinuity or shift in the data at each point that the
experimental condition is changed (e.g., change from A to B or from B to A phases).
Typically this change refers to the difference in the last day o f one phase and the first
day o f the next. No special technique is needed to describe this change. (One technique
to describe the changes in level in ratio form has been devised as part o f the split-m id-
dle technique o f estim ating trends and is mentioned in the next section.)
O f course, the investigator m ay be interested in going beyond merely describing
changes in level. The issue is not whether there is sim ply a shift in performance from
the last day o f one phase to the first day o f the next. Perform ance norm ally varies on a
daily basis, so it is unlikely that perform ance will be at the same level two days in a row
(unless the behavior never occurs). W hen conditions are changed, the major interest is
whether the change in level is beyond what would be expected from ordinary fluctua-
tions in perform ance. That is, is the shift in perform ance large enough to depart from
what would be expected given the usual variability in performance?
The evaluation o f the change in level is different from the description o f the change.
First, the measure o f variability discussed previously can provide useful information.
One can determ ine if perform ance is within or outside o f the plotted range o f variabil-
ity from the prior phase. There are no fixed decision-m aking guidelines, but view ing
the shift in relation to normal variability (e.g., plus or minus one or two standard devia-
tions) is helpful in m aking the judgm ent about a change in level when that change is
not obvious.

Changes in Trend
Inferring a change o f trend does not have to rely on merely looking at the data within
a phase. That is unsystematic to say the least because o f the task, namely, looking at
the data points and im agining what line (vector) best represents the angle of the slope.
A gain, when trends are clear, fiat (zero slope) in baseline, or sharply accelerating
line and decelerating lines across phases, one m ay need no clear aid. However, there
are options, some o f which are easily implemented to plot the trend within each phase.
T he trend line computed in som e standardized fashion is a much more defensible pro -
cedure for addressing the logic o f the designs (describe, predict, test). Trend lines allow
one to see the extent to which trends have changed across phases.
A fairly easy visual aid is to use a spreadsheet that allows one to move from a data-
base where the numbers are entered for each phase to a graph (e.g., C arr & Burkholder,
G r a p h i e D i s p l a y o f D a t a f o r 'V is u a i In s p e c t io n 347

Baseline Intervention

Days

Fi gure 1 3 . 1 5 . H yp o t h et i cal d at a d uring w h ich a t rend line w as co m p u t ed f o r dat a w it h in t w o


p h ases (A, B). In t his ex am p l e, t he dat a w er e en t er ed in M i cr o so f t Ex c e l ™; a lin e^ r ap h w as select ed
as t he o p t io n t o ch ar t t h e d at a.an d t hen co m p u t e t r en d li ne w as ad d ed t o give t he linear sl o p e t h at
b est r ep r esen t ed t h e d at a w it hin each p hase.

1998). For example, if one has an A B A B design, the daily num ber is enter«! for what-
ever behavior or measure (e.g., percent o f intervals or frequency). The data are then
plotted in a sim ple line graph. Within the database program , one can shade (click on,
mark, highlight) the data within the phase and click on com pute trend line or regression
line. Figure 13.15 presents hypothetical data for the first two phases o f an A SAB design.
A trend line was computed separately for each phase that easily allows the investiga-
tor and reader to see changes in trend. There are m any options, but the sim plest is a
linear (least squares) regression that is readily available in databaseprogram s. A Unear
regression line is fit to a series o f data points as those plotted in Figure 13.15. A regres-
sion line is drawn through the individual data points that com es asclose as possible to
the points. A separate line is drawn for the data in baseline and intervention phases for
each phase o f the design. As noted in the figure, the trend within each phase can be
readily seen and the trends across phases can be com pared.6

* In the appendix at the end of the book, problems with linear regression will he raised. Briefly, a
characteristic o f single-case data (called serial dependence) is that data points and their error terms
from one day to the next can be correlated (called autocorrelation). This characteristic has rather
significant implications for visual inspection and statistical evaluation o f single- case data
348 S I N G L E- C A S E R ES EA R C H D ESI G N S

An example to show what a trend line can add is provided in a study that focused
on three women (28, 43, and 44 years old, ethnicity not noted) (Billette, Guay, &
M archand, 2008). H ow much they w ere disturbed by their sym ptom s was one o f the
measures and was rated from o (not at all) to 100 (extremely). T he treatment was
cognitive behavior therapy (C BT ) with 22 to 27 individually adm inistered sessions,
each lasting from 60 to 90 minutes. T he treatment included m any com ponents (e.g.,
psychoeducation about traum a and sexual assault, anxiety managem ent techniques,
cognitive restructuring, and others beyond the scope o f the illustration). A few ses-
sions were provided to the spouse to increase support and involve the husband more
in the treatment.
Figure 13.16 conveys the results for the m easure I m entioned. The first phase
was baseline, the second phase was CBT. T he graph conveys that the intervention
w as introduced in a m ultiple-baseline design across the three wom en. A fter the last
person was treated, there was a 3-m onth follow -up assessm ent for all individuals.
The individual day-to-day data should fluctuate. Not shown in the figure, the means
changed for each participant. For exam ple, the m eans for baseline, intervention, and
follow -up phases for Participant 1 were 54 .5,19 .6 , and 9.5, respectively. A regression
slope (trend line) was used as a visual aid to characterize the trends within each
phase for each participant. T he lines help visually with the presentation because o f
the marked day-to-day variability. We see that slope during the intervention phase
was different from the slope in baseline. T he slopes at follow -up are not too in fo rm a-
tive, largely because the sym ptom s were clearly low. T he trend lines are very helpful.
Also, by the nature o f the intervention (multiple sessions) one would expect a gradual
change over time rather than an abrupt change w hen som e incentive is altered on a
more malleable problem.
There are m any techniques to evaluate trends in the data. For exam ple, one tech-
nique to describe trends in single-case research is the split-m iddle technique (W hite,
1 9 7 2 , 19 7 4 ) - This technique perm its exam in ation o f the trend within each phase and
allow s com parison o f trends across phases. T he m ethod has been developed in the
context ot assessing rate o f behavior (frequency/tim e). A nother approach is to use
tim e-series analyses, a statistical data-evaluation m ethod to incorporate trends in
the data in deciding w hether there is any change across phases (Box, Jenkins, &
Reinsel, 1 9 9 4 ) (see appendix for discussion and illustration). In the case o f grap h -
ing data in single-case research, trend within each phase is infrequently plotted
and no one m ethod has dom inated the few instances in which such inform ation is
provided.

R a p id ity o f C h an ge
Another criterion for invoking inspection discussed earlier refers to the latency between
the change in experim ental conditions and a change in perform ance. Relatively rapid
changes in perform ance after the intervention is applied or withdrawn contribute to the
decision, based on visual inspection, that the intervention probably led to the change.

6 (Continued) Linear regression can ignore complex relations within the data and assume that those
relations arc not present
G r a p h ie D i s p la y o f D a t a f o r V is ual in s p e cti o 349

PARTICIPANT I

' N 't k '^ to ' AN <bS

PARTICIPANT 2

/ 1 / / ^
PA RTICIPA N T 3

>'> -,' 'S' to' V ^

Figure 1 3 .1 6 . Level o f d ist u r b an ce r elat ed t o PTSD sy m p t o m s r ep o r t ed by p ar t icip an t s in t h eir d aily


self - m o n it o r in g at b aselin e, d uring t r eat m en t , and at f o l lo w - u p .Th e in t er v en t i o n (co g n it ive b eh av io r
t h er ap y sessio n s) w as in t r o d u ced in a m ult ip le- b aseline desig n acr o ss t h e t l i r ee w o m en . (St>urte:
Billettc, Guay. & Marchand, 2008.)
350 S I N G L E- C A S E R ESEA R C H D ESI G N S

One o f the difficulties in specifying rapidity o f change as a descriptive characteristic o f


the data pertains to defining a change. Behavior usually changes from one day to the
next. But this fluctuation represents o rd in ary variability. At what point can the change
be confidently identified as a departure from this ordinary variability?
When experimental conditions are altered, it m ay be difficult to define objectively
the point or points at which changes in perform ance are evident. Without an agreed
upon criterion, the points that define change may be quite subjective. A plot o f variabil-
ity o f perform ance provides a guideline. O ne can see at the point o f intervention if the
amount o f change exceeds the variability range for baseline data. That is, one can look
at the error bar o f variability above and below the mean o f baseline to see if the first few
days o f the intervention phase are outside o f that range. Again, there are no objective
criteria to make the decision by visual inspection.
Rapidity o f change is a difficult notion to specify because it is a joint function o f
changes in level and slope. A marked change in level and in slope usually reflects a
rapid change. For example, baseline may show a stable rate and no trend. T he onset
o f the intervention may show a shift in level o f 50 percentage points and a steep accel-
erating trend indicating that the change has occurred quickly and the rate o f behavior
change from day to day is marked.

G en era l C o m m en ts
There is no standard practice or rule about what aids to provide in presenting data.
Indeed, sometimes practices in single-case as well as between-group designs are dic-
tated by publication (e.g., journal) policies rather than by methodological principles.
However, it is useful to consider the overall goal, namely, to foster and facilitate com -
prehension ot the results by the audience. So the question for the investigator is what
information can be provided to the audience(s) to enhance that goal. An investigator
is part ot the audience and would want to know as many facets as available to under-
stand nuances within a given subject and am ong many subjects. Other consum ers
(one’s peers, lay consum ers) are likely to be less interested in nuances. One guide is
to consider what will aid application o f the criteria for visual inspection? W hat will
aid deciding whether the changes were reliable (e.g., by visual inspection or statistical
criteria)? There are often different audiences to consider (other researchers, teachers),
and different decisions might be made about what to present to each one. People in
everyday life do not talk about the criteria for visual inspection very much (although
you should have been at last year’s M ethodologist’s New Year’s Eve bash in which this
discussion got very heated). People in everyday life want to know, did the intervention
work; did perform ance improve?
Additional inform ation can err on the side o f too much or information that is not
needed. For example, in one o f the exam ples (Figure 13.10), means were plotted across
phases. No harm was done, but the results were so clear, perhaps they were not needed
there. In the next exam ple (Figure 13.11), plotting the means was helpful because the
changes and visual inspection criteria were not so clear. In short, consider the goal to
aid application o f the data-evaluation criteria. More generally, what will help the c o n -
sum er o f the evaluation understand the results— and o f course that m ay vary by who
the consumers are. I have highlighted some o f the basic aids that can enhance inspec-
tion o f the day-to-day data points.
Gr ap h i c Oisp lay of Dat a f o r Yi su al In sp ect i o n 35 I

SU M M A R Y A N D C O N C L U S IO N S
This chapter has discussed basic options for graphing data to facilitate application o f
visual inspection. Simple line graphs, cumulative graphs, and bar graphs were discussed
briefly. Virtually all o f the graphs in single-case research derive from these three types
or their combinations. Am ong the available options and com binations, the sim ple line
graph is the most com m only reported.
As noted in the earlier discussion o f data evaluation (Chapter 1 2 ), visual inspection
is more than simply looking at plotted data and arbitrarily deciding w hether the data
reflect a reliable effect. Several characteristics o f the data should be examined, including
changes in means, levels, and trends, and the rapidity o f changes. Selected descriptive
aids are available that can be incorporated into sim ple graphi ng procedures 1o facilitate
exam ination o f some o f these data characteristics. T h e chapter has discussed plotting
means, computing ratios to express changes in level, and plotting trends as som e o f the
aids to facilitate visual inspection.
T his chapter and the prior one elaborate data-evaluation methods. The statisti-
cal aids in this chapter included those that describe the data. Statistical evaluation of
course goes beyond description. Significance testing is used to draw inferences about
the reliability (experimental criterion) o f an effect or change. Statistical analyses are
covered in the appendix.
CH A P T ER 14

Evaluation of Single-Case
Designs: Challenges, Limitations,
and Directions

CH APTER O U T L IN E

Common Methodological Problems and Obstacles


Trends in the Data
Variability
Duration o f the Phases
Criteria tor Shifting Phases
General Issues and Limitations
Range o f Intervention Outcome Questions
Evaluating Intervention Packages
Evaluating Intervention Components
Comparing Different Interventions
Studying Variables that Influence the Impact o f
the Intervention
Studying Mediators and Mechanisms: Largely
Unexplored
General Comments
Generality o f the Findings
Single-Case Research
Between-Group Research
Replication
General Comments
Directions and Opportunities on the Horizon
Randomization: Motherhood and Apple Pie o f
Methodology
Integration o f Research Findings
Summary and Conclusions

352
Eval u at io n o f Sin g le* Ca.sc Desig n s 353

P
revious chapters have discussed issues and potential problem s associated with sp e-
cific single-case designs. This chapter discusses these more general issues, con-
cerns, and limitations that transcend individual designs including trend and variability
in the data, duration o f phases, and when to shift phases. OLher issues that are covered
include the range o f research questions that single-case d esign scan and cannot address,
generality o f results from the designs, and replication o f effects. Although this chapter
and the next em phasize issues in relation to single-case research, some o f these can be
addressed better w hen placed in the context o f betw een-group research.

CO M M O N M E T H O D O L O G IC A L P R O B L E M S A N D O BSTACLES
Traditionally, research designs are preplanned so that m ost o f the details about who
receives the intervention and often the duration (or dose) o f the intervention are
decided before the subjects participate in the study, [n single-case designs, many
crucial decisions about the design are made only as the data are collected. Examples
include deciding how long a baseline phase should be and w hen to present or withdraw
the intervention. Inferences about the intervention depend on the data pattern within
and across phases. The pattern o f the data determines the decisions m ade within the
design.
Each single-case design usually begins with a baseline phase followed by the inter-
vention phase. T he intervention is evaluated by com paring performance across phases.
For these com parisons to be made easily, the investigator has to he sure that the changes
from one phase to another are likely to be due to the intervention rather than to a con-
tinuation o f an existing trend or to chance fluctuations (h igh or low points) in the data.
A fundamental design issue is deciding when to change phases so as to m axim ize the
clarity o f data interpretation.
There are no w idely agreed upon rules for altering phases. However, there is general
agreement that the point at which the conditions are changed in the design is extrem ely
important because subsequent evaluation o f intervention effects depends on how clear
the changes are across phases. The usual rule o f thumb is to alter conditions (phases)
o nly when the data are stable. As noted earlier, stability refers to the absence o f trend
and relatively small variability in perform ance. Trends and excessive variability during
any o f the phases, particularly during baseline, can interfere with evaluating interven-
tion effects. Although both trend and variability were discussed earlier, it is important
to build on that earlier discussion and address problem s that m ay arise and various
solutions that can facilitate drawing inferences about intervention effects.

T rend s in the D ata


As noted earlier, draw ing inferences about intervention effects is greatly facilitated
when baseline levels show no trend (slope) or a trend in the direction opposite from
that predicted by the intervention. A problem m ay emerge, at least from the standpoint
o f the design, when baseline data show a trend in the sam e direction as expected to
result from the intervention. Changes in level and trend are tnore difficult to detect
during the intervention phase if performance is already im proving during baseline.
It may be possible to continue baseline for a little longer to see if the data pattern
will stabilize and whether a seeming trend (e.g., increase in the behavior 2 days in a
SI N G L E- C A S F R ESEA R C H D ESI G N S

row) really is a trend. O f course, in many applied settings, there are external pressures
to begin the intervention as soon as possible, and m ore baseline days are a luxury. It
is one matter for an investigator to understand what baseline is trying to accomplish,
one purpose o f this book, but quite another to convey the message and evoke sym pa-
thy from a client or a clients family experiencing a need for an intervention or from an
administrator whose programs, funding, and future are on the line to intervene now.
Also, the problem (e.g., drinking alcohol in a college dorm itory, dangerous behavior
in a classroom) may require immediate intervention. Behavior may require interven-
tion even though some improvements are occurring. If prolonged baselines cannot be
invoked to wait for stable data, other options are available.
First, the intervention can be implemented even though there is a trend toward
improved performance during baseline. I f an A B A B design, or in this case a B A B A
design, is possible in light o f ethical or practical issues, the A phase can consist o f an
active intervention that is designed not to return to baseline but to actively change the
direction o f the behavior and the trend. The design begins with an intervention (B)
and then during the A phase som e intervention is used to change the direction o f the
behavior opposite from that o f the direction o f the intervention phase. For exampie,
in a pioneering application o f behavioral techniques, staff on a ward o f a psychiatric
hospital unwittingly provided attention (social reinforcement) for “ irrational talk” o f
a patient with psychoses (Ayllon & Flaughton, 1964). T he patient would com ment on
delusions or make other statements that had no clear external referents. The staff, try -
ing to be helpful, com forting, and sympathetic, would provide attention to the patient.
The investigators raised the question, could staff attention actually influence such
statements, or were the statements merely uncontrollable expressions o f the psychiatric
disorder?
After baseline observations o f patient com ments, the intervention phase began in
which attention from the staff was used to reinforce rational com ments. W henever the
patient spoke more normally, the staff provided immediate attention. Any irrational
com ments were ignored. A reversal phase was introduced. T he investigators could have
m erely withdrawn all attention during the reversal phase. Instead, during the reversal
phase attention was provided for all com m ents that were not rational, that is, behavior
other than the focus during the intervention phase. D uring the reversal phase, rein-
forcement was given for all behavior except the original one that was targeted (referred
to as differential reinforcement o f other behavior). This has the advantage o f quickly
reversing the direction (trend) o f perform ance, as it did in this example. Hence, across
an A B A B design, for example, the effects o f the intervention on behavior are likely to
be readily apparent. In general, other procedures that will alter the direction o f perfor-
mance can help reduce ambiguities caused by initial baseline perform ance that show's a
trend in a therapeutic direction. O f course, this design option may be methodologically
sound but clinically untenable because it includes specific provisions for making the
client’s behavior worse.
A second alternative for reducing the am biguity that initial trends in the data may
present is to select design options in which such a trend in a therapeutic direction will
have little or no impact on drawing conclusions about intervention effects. For exam -
ple, a multiple-baseline design is usually not impeded by initial trends in baseline. It is
unlikely that all o f the baselines (behaviors, persons, or behaviors in different situations)
EvaluaC i« n o f S in g le -C a s e D e sig n s 3 55

will show a trend in a therapeutic direction. The intervention can be invoked for those
behaviors that are relatively stable while baseline conditions are continued for other
behaviors in which trends appear. If the need exists to intervene for the behaviors that
do show an initial trend, this too is unlikely to interfere with drawing inferences about
intervention effects. Conclusions about intervention effects are reached on the basis o f
the pattern o f data across all o f the behaviors or baselines in the multiple-baseline design.
Am biguity o f the changes across one or two o f the baselines may not necessarily imped«
drawing an overall conclusion, depending on the number o f baselines, the magnitude o f
intervention effects, and how other criteria for visual inspection are met.
Similarly, drawing inferences about intervention effects is usual ly r o t threatened by
an initial baseline trend in a therapeutic direction in an alternating-treatm ents design.
In this design, conclusions are reached on the basis o f the effects o f different conditions
usually implemented in the same phase. T he differential effects o f alternative inter-
ventions can be detected even though there m ay be an overall trend in the data. T h e
main question is whether differences between or am ong the alternative interventions
occur, and this need not be interfered with by an overall trend in the data. If one o f
the conditions included in an intervention phase o f an alternating-treatm ents design
is a continuation o f baseline, the investigator can directly assess whether the inter-
ventions surpass performance obtained concurrently under the continued baseline
conditions. A trend need not impede conclusions about th e intervention that is eval-
uated in a changing-criterion design. This design depends o n evaluating w hether the
perform ance matches a changing criterion in a step-like fashion. Even i f perform ance
improves during baseline, control exerted by the intervention can still be evaluated by
com paring the criterion level with perform ance throughout the design, and if neces-
sary by using bidirectional changes in the criteria, as discussed in an earlier chapter.
Another option for handling initial trend in baseline is to utilize statistical tech-
niques to evaluate the effects o f the intervention relative to baseline perform ance.
Techniques that describe and plot initial baseline trends such as com puting trend lines,
as discussed in the previous chapter on graphing, can help visu ally exam ine whether
an initial trend in baseline is sim ilar to trends during the intervention phase(s). In
addition, statistical techniques (e.g., time-series analysis) c a r assess whether the inter-
vention has made reliable changes over and above what would be expected from a c on -
tinuation o f initial trend. T his is a more sophisticated solution that solves a variety o f
problems, including identifying trends that visual inspection cannot detect (please see
the appendix).
In general, an initial trend during baseline may not necessarily interfere with d raw -
ing inferences about the intervention. Various design options *nd data-evaluation tech -
niques can be used to reduce or eliminate ambiguity about intervention effects. It is
crucial for the investigator to have in mind one o f the alternatives for reducing am bigu -
ity if an initial trend is evident in baseline. Without taking explicit steps in altering the
design or applying special data-evaluation techniques, trend in a therapeutic direction
during baseline or return-to-baseline phases may compete with obtaining clear effects.

V aria b ility
Evaluation o f intervention effects is facilitated by having relatively little variability in
the data in a given phase and across all phases. The larger the daily fluctuations, the
356 SI N G L E C A SE R ESEA R C H D ESI G N S

larger the behavior change required to dem onstrate an effect. Large fluctuations in the
data do not always make evaluation o f the intervention difficult. For example, som e-
times baseline perform ance may show large fluctuations about (around) the mean
value. When the intervention is implemented, not only may the mean perform ance
change, but variability m ay becom e markedly less as well. Hence, the intervention effect
is very clear, because both change in means and a reduction in variability occurred. T he
difficulties arise prim arily when baseline and intervention conditions both show rel-
atively large fluctuations in performance. In the previous chapter I discussed a grap h -
ing option about reducing the appearance o f excessive variability by blocking the data
into groups o f sessions and plotting the mean o f each block. T he present discussion
addresses the more fundamental task o f understanding and controlling or reducing
excessive variability.
Variability in the data is a property o f o rd in ary behavior in virtually all settings.
Presumably variability is due to all sorts o f factors that affect our functioning. W here
there is excessive variability that may interfere with drawing inferences or obtaining a
clear pattern in the data, we then try to identify som e o f these factors. Excessive v a ri-
ability in the data indicates absence o f experim ental control over the behavior and lack
o f understanding o f the factors that contribute to perform ance (Sidman, i960).
When baseline perform ance appears highly variable, several factors may be iden-
tified that contribute to variability. First, the client may be perform ing relatively c o n -
sistently, that is, shows little variability in perform ance, although this is not accurately
reflected in the data. One factor that might hide consistency is the manner in which
observations are conducted. Observers may introduce variability in perform ance to the
extent that they score inconsistently or depart (drift) from the original definitions o f
behavior. Careful checks on interobserver agreement and periodic retraining sessions
m ay help reduce observer deviations from the intended procedures.
Second, the conditions under which observations are obtained may contribute to
and increase variability in perform ance. Excessive variability m ay suggest that greater
standardization is needed over the conditions in which the observations are obtained.
Client perform ance may v ary as a function o f the persons present in the situation, the
time o f day in which observations are obtained, and events preceding the observation
period or events anticipated after the observation period. These other influences are
“ interventions” that may vary daily (e.g., another child in class who is very disruptive,
regular assignments on som e fixed days o f the week that generate or control variability
in perform ance, activities on som e days that precede or follow recess). Normally, such
factors that naturally vary from day to day can be ignored and baseline observations
m ay still show relatively low variability. On the other hand, when variability is exces-
sive, the investigator may wish to identify or attempt to identify features of the setting
that can be standardized further.
Standardization am ounts to making the day-to-day situation more homogeneous,
which is likely to decrease factors that influence variability. An occasional objection to
this recommendation is that we want to see behavior under “ realistic” circum stances
and so the variability and the natural environment ought not to be altered or m anaged.
T his concern is questionable in all sorts o f contexts (e.g., training a musician to play a
piece, teaching a pilot to learn how to fly, and developing skills in an athlete). In each
context, perform ance and practice under controlled conditions are part o f the early
Ev al u at io n of Sin g io - Casc D esi g n s 5 57

learning process. Our goal is to foster the desired behaviors in the natural environm ent,
but that goal does not require introducing all o f the conditions (e.g., for a m usician—
the audience, the pressure, the full musical piece rather than sections) early in o bserva-
tions and training. The initial goal is to assess behavior and then see i f we can change
it. Extending the newly acquired behavior to other situations where m ore factors are
allowed to vary can follow the initial dem onstration (see Kazdin, 2001). If variability is
excessive, look to see if the environment includes som e influences that m ight be con -
tributing. Obviously, some factors that vary on a daily basis (e.g., client’s diet, weather)
may be less easily controlled than others (e.g., presence o f peer-s in the sam e room, use
o f the sam e or sim ilar activities while the client is being observed).
For whatever reason, behavior may sim ply be quite variable even after the just-
m entioned procedures have been explored. Indeed, the goal o f an intervention p ro -
gram may be to alter the variability o f the client s perform ance<i.e„ make perform ance
more consistent by reinforcing behavior within a range as discussed in the chapter
on changing-criterion designs), rather than or in addition to changing the mean rate.
Variability may remain relatively large, and the need to intervene cannot be postponed
to identify contributory sources. In such cases, the investigator m ay use aids such as
plotting data into blocks, graphing mean perform ance in each o f the phases, and ad d-
ing trend lines within phases to help clarify the pattern and facilitate data evaluation, as
mentioned in the previous chapter.
W hether the variability will interfere with evaluation o f ik e intervention effects
is determined in part by the type o f changes produced by the intervention and the
sensitivity o f the measure to reflecting such changes (e.g., n o e e ilin g or floor effects).
Marked changes in performance may be very clear because o f sim ultaneous changes
in the mean, level, and trend across phases and fast latency o f change. So the extent
to which variability interferes with draw ing inferences is a function o f the m agnitude
and type o f change produced by the intervention. The m ain point is that with rela-
tively large variability, stronger intervention effects are needed to inter that a systematic
change has occurred.

D u ratio n o f the Phases


A n important issue in single-case research is deciding how long the phases will be over
the course o f the design. The duration o f the phases usually is not specified in advance
o f the investigation. The reason is that the investigator exam ines the data and d eter-
mines whether the information is sufficiently clear to make predictions about perfo r-
mance. T he presence or suggestion o f trends o r excessive variability during the baseline
phase or tentative, weak, or delayed treatment effects during the intervention phase
may require more prolonged phases.
A com m on methodological problem is altering phases before a clear pattern
emerges. For example, most o f the data may indicate a clear pattern for the baseline
phase. Yet, after a few days o f relatively stable baseline perform ance, one o r two data
points may be higher or lower than all o f the previous data. T he questions that im m e-
diately arise are whether a trend is em erging in baseline, whether one o f the threats
to validity (e.g., history) could explain the data, or whether ihe data points are merely
part o f random (unsystematic) variability. It is prudent to continue the condition with -
out shifting phases. If one or two more days o f data reveal that there is no trend, the
358 SI N G L E C A SE R ESEA R C H D ESI G N S

intervention can be implemented as planned. The few “extra” data points provide
increased confidence that there was no em erging trend and can greatly facilitate subse-
quent evaluation o f the intervention.
Occasionally, an investigator may obtain an extreme data point during baseline in
the opposite direction o f the change anticipated with the intervention. This extrem e
point may be interpreted as suggesting that if there is any trend, it is in the opposite
direction ol intervention effects. Investigators may shift phases when an extreme point
is noted in the previous phase in the direction opposite from the predicted effects o f
the phase. Yet an extreme score in one direction is likely to be followed by a score that
reverts in the direction o f the mean, a characteristic known as statistical regression (see
Chapter 2).
It is important to be alert to the possibility o f regression. I f one single extrem e
score occurs, it may be unwise to shift phases. Such a shift might capitalize on regres-
sion. This im m ediate “ im provem ent” in perform ance might be interpreted to be the
result o f shifting from one condition to another (change in level) when in fact it might
be accounted for by regression. A s data continue to be collected in the new phase, the
investigator could, o f course, see i f the intervention is having an effect on behavior. Yet,
if changes in level or means are exam ined across phases, shifting phases at points o f
extreme scores could systematically bias the conclusions.
In general, phases in single-case experim ental designs ought to be continued until
data patterns are relatively clear. This does not always mean that phases are long. For
example, in som e cases, return-to-baseline or reversal phases in A B A B designs m ay be
very brief, such as only 1 or 2 days or sessions (e.g., Kodak, Grow, & Northrup, 2004;
Strieker, Miltenberger, G arlinghouse, Deaver, & Anderson, 2001). T he brevity o f each
phase is determ ined in part by the clarity o f the data within that phase and in relation
to adjacent phases.
Suggesting a requisite num ber o f data points is useful as a practical guideline. As
a minimum, 3 to 5 days is probably useful as a general rule. However, it is much more
important to convey the rationale underlying the recomm endation, namely, to pro -
vide a clear basis for predicting and testing predictions about performance. A simple
rule has m any problems. For one, it is likely that some phases require longer d u ra-
tions than others. For example, it is usually important to have the initial baseline o f a
slightly longer duration than return-to-baseline phases in A B A B designs. T he initial
baseline o f any design provides the first information about trends and variability in the
data and serves uniquely as an important point o f reference for all subsequent phases.
Consequently, this is not a phase to be rushed.
All that said, it is always im portant to bear in mind the goals o f any m ethodolog-
ical practice rather than rigid practices. Thus, it is not only conceivable that there will
be brief baseline phases, but there are examples where they m ake sense. For exam ple,
in a program designed to teach bicycle riding to a 9-year-old boy with Asperger’s sy n -
drome, data were collected on various steps that entailed bicycle riding (Cam eron,
Shapiro, & Ainsleigh, 2005). T he initial baseline phase consisted o f one session that
confirm ed that the child could not ride a bicycle (and his score was zero). Also, the
child reacted with an extrem e em otional response to the task in baseline. The data
showed a clear pattern o f developing bike riding in a changing-criterion design, and
the one-session baseline was quite fine. Yes, two data points are much better than
Evaluation o f Sir»gl«-C ase D e sig n s is?

one (because from two one can see initial variability) and three is much better than
two (to help infer a trend), but when behavior is at zero and that con firm s the v iew
o f all parties (e.g., parents or teachers saying that “ the child cannot do this, has never
done this, and never will unless you do som ething pretty soo n ”), one day is fine. M ore
com m only, multiple-baseline designs are likely to have short baseline phases (e.g., one
or a few sessions) because the strength o f a demonstration, does not dep en d on an y
single phase and the short baseline o f one subject (or behavior o r setting) on which
data are collected will have longer baselines for the subjects yet to receive the interven-
tion. More generally, rules about the duration o f experim ental phases in single-case
research are difficult to specify and when specified are often d ifficu ltto ju stify without
great qualification.
Aside from the duration o f individual phases, investigators occasionally ask
whether phases in a study ought to be o f equal or approxim ately equal duration w ithin
a given investigation (e.g., each phase 10 days). Hopefully, the reader at this point will
answer with an unequivocal “ no.” The rationale for posing phases o f equal duration is
sound, so before justifying “ no” the reasons are im portant to acknowledge. First, posi ng
phases o f equal duration occasionally is based on the view that in a given peri od o f time
(e.g., a week or month), maturational or cyclical influences m ay lead to a certain pat-
tern o f performance. For example, if the setting (e.g., classroom , business activities at a
com pany) has a fixed routine for each day or each week, perhaps one w o u ld want to be
sure that each phase included the full routine so that extraneous eventsm ayr b e roughly
constant or equal in each phase and do not bias the data in one phase. Essentially, the
investigator may want to be sure that conditions (the routines) are relatively constant
across baseline and intervention phases. This is reasonable.
Second, the investigator may be planning on the use o f statistical analyses to e v a l-
uate the results. Statistical analyses often are more pow erful (ab lelo detect differences
where there are differences) when there are several observations and an equal n u m -
ber o f observations. Thus, this would be a reason to plan o n phases o f equal duration.
Finally, communicating the program to others is facilitated by satisfyin g the legiti-
mate query, how long will the baseline and treatment phases last ? Everyon e’s anxiety i s
allayed when one can give a clear answer the way one u su ally can in a pre-arranged and
planned between-group study.
Although phases o f equal or nearly equal duration m ight be convenient for all o f
the preceding reasons, try to resist whenever possible. The reason is fundam ental and
pertains to the logic o f single-case designs. We are interested in obtaining data that help
with the describe, predict, and test prediction features o f th edesign . A n equal num ber
o f days in each phase is not elegant or prized and could even interfere. Phases o f equal
duration do not necessarily strengthen the design. In fact, if duration is given prim acy
as a consideration, ambiguity m ay be introduced by altering or waiting to alter condi
tions merely because a criterion number o f days has or has not elapsed.
Typically, the duration o f the phases is determined by judgm ent on the part o f the
investigator based on his or her view that a clear data pattern is evident. O f course,
practical considerations often operate as well (e.g., end o f the school year) that place
constraints on durations o f the phases. From the standpoint o f the design, the pattern
o f the data should dictate decisions to alter the phases. That pattern is judged by how
it may serve our overall goal o f obtaining data in the phases that adequately describe
360 SI N G L E- C A SE R ES EA R C H D ESI G N S

performance, predict perform ance in the immediate future if conditions were not to
change, and test the prediction o f a previous phase.

Criteria for Shifting Phases


Currently, no agreed-upon objective decision rules exist for altering phases. The du ra-
tion o f phases depends on having stable data. Typically, stability o f performance in a
particular phase can be defined by two characteristics o f the data, namely, trend and
variability. A criterion or decision rule for shifting phases usually needs to take into
account these parameters. One criterion is to define stability o f the data in a given
phase in terms o f a num ber o f consecutive sessions or days that fall within a prespeci-
fied range o f the mean such as plus or m inus 1 or 1V2 standard deviations around the
mean. The m ethod can ensure that data do not show a systematic increase or decrease
over time (trend) and fall within a particular range (variability). W hen the specified
criteria are met, the phase is terminated and the next condition can be presented. An
obvious obstacle is that one has to accum ulate som e observations within a phase to
have a reasonable estimate o f the standard deviation. With only a few observations in
the first few days o f a phase, both the m ean and the standard deviation may be chang-
ing markedly.
Another criterion might be requiring that so m any days in a row fall within a par-
ticular range. For example, in one investigation o f an A B A B design, a change from one
phase to the other was made if 3 consecutive days o f data were obtained that did not
depart more than 10 percent from the m ean o f all previous days o f that phase (W ilson,
Robertson, Herlong, & Flaynes, 1979). To obtain the mean within a given phase, a
cumulative average was continually obtained. That is, each successive day was added
to all previous days o f that phase to obtain a new mean. W hen 3 consecutive days fell
within 10% o f that mean, the phase was changed. O ne could specify 2 days o f 20%; there
are no rules or guidelines here. T he advantage is making the decision criteria explicit.
That is always o f value in science.
Specification o f criteria for deciding when to alter conditions (phases) is fine. If c ri-
teria are specified in advance, alteration o f conditions is less likely to take advantage o f
chance fluctuations in the data. In general, specifiable criteria will reduce the subjectiv-
ity o f decision m aking within the design. As important, most investigators are trained
in the between-group designs where design decisions are made in advance. Decisions
made in advance seem more im m une to bias than those m ade on the fly during the
study. Specifying criteria for shifting phases is designed to combat that potential bias.
However, specification o f criteria in advance has its risks. A few shifts in performance
during a given phase may cause the criteria not to be met. Behavior often oscillates,
that is, goes back and forth between particular values. It may be difficult in advance
o f the baseline data to determ ine for a given subject what that range o f oscillation or
fluctuation will be. Waiting for the subjects perform ance to fall within a prespecified
range may cause the investigator to “spend a lifetime” on the same study (Sidman, i960,
p. 260).
Problems m ay arise when multiple subjects are used. For example, in a multiple-
baseline design across subjects (or behaviors, or situations), the data from different
baselines may be quite different, and a single criterion for deciding when to change
phases inay not be easily met. Moreover, waiting for all the methodological stars to
£ valu at io n o f Sin g le* Cas e D esig n s 3& I

align (little variability, no trend, and no last-minute possibilities o f a trend just starting
to em erge) will raise practical obstacles and delayed interventions that are not feasible
in m ost applied settings. We do not need the stars to align to draw vaLid inferences.
T h e purpose o f specifying criteria is to have an objective definition o f stability
before shifting phases. But it is the stability o f the data rather than meeting any p articu -
lar prespecified criterion that is important. Stability is required to predict perform ance
in subsequent phases. T he prediction serves as a basis for detecting departures from
this prediction from one phase to the next. It is conceivable that a criterion for shifting
phases m ay not be met even though a reasonably clear pattern is evident that cou ld
serve as an adequate basis for predicting future perform ance. Stated more simply, spec-
ification o f a criterion is a means toward an end, that is, defining stability, and not an
end in itself. Data points may fall close to but not exactly within the criterion for shift-
ing phases and progress through the investigation m ay be delayed. In the gen e-raL case,
and perhaps for applied settings in particular, it m ay be im portant to specify' turo or
more criteria or rules for shifting phases within a given design so tha.t if the data m eet
one o f the criteria, the phase can be altered. A m ore flexible criterion or set o f criteria
may reduce the likelihood that a few data points could continually delay shifting o f the
phases. (You will know when you have not done this well if you say som ething like,
“I think only one more year o f baseline ought to do it.” )
The previous comments are not intended to argue against use o f objective criteria
for defining stability in the data or for changing phases. M ost investigators u sin gsingle-
case designs have not invoked specific criteria for shifting phases or for duration o f the
phases. It is not clear if the benefits o f having objective criteria are compensated, for by
delaying shifts in phases or meeting the describe, predict, and. test functions o f data
within each phase. As a general strategy, it might be advisable to specify criteria that are
flexible and leave open the possibility o f abandoning them if the logic o f the d.esign is
jeopardized by a data pattern that som ehow interferes with invoking the criteria.

G E N E R A L IS S U E S A N D L IM IT A T IO N S
T h e m ethodological issues discussed herein refer to considerations that arise while con
ducting individual single-case experiments. The m ethodology o f single-case research
and its limitations can be examined from a more general perspective. The present d is-
cussion addresses major issues and limitations that apply to single-case experim ental
research.

Range o f Intervention Outcome Questions


Single-case designs have been used in applied research prim arily to evaluate the effec-
tiveness o f a variety o f interventions. Typically, the interventions arc designed to am e-
liorate a particular problem or to improve perform ance in the context o f educationaI,
clinical, community, and other applied settings. T he focus is on the impact o fth ein ter-
vention on functioning o f the individual or group. This type o f research focuses o n
outcom es and is often called outcome research.
Several different types o f outcome questions can be delineated in applied research.
T he questions vary in terms of what they ask about a particular intervention (e.g., spe-
cial education, psychotherapy) and the impact that the intervention has on behavior or
other dom ains o f functioning (e.g., academ ic perform ance, symptoms o f anxiety). T he
362 S I N G L E- C A SE R ES EA R C H D ES I G N S

Table 14.1 Strategies to D evelop and Identify Effective Interventions

In te rv e n tio n S t ra t e g y Q u e s t io n A s k e d

Intervention Package Strategy D oes the intervention produce o r lead to change?


Dismanding Strategy W hat components are necessary, sufficient, and
facilitative for change?

Parametric Strategy W hat changes can be made in the specific treatment


to increase its effectiveness?

Constructive Strategy W hat components o r other interventions can be


added to enhance outcome?
Comparative Outcom e Strategy W hich intervention is the more o r most effective
for a given problem and population?

Intervention-Moderator Strategy Upon what subject, trainer, setting, context, o r


other characteristics does the effectiveness of the
intervention depend?
Intervention-Mediator/Mechanism Strategy W hat processes, mediators, o r mechanisms explain
how intervention effects are produced,i.e., why the
change occurred?

Note. Intervention includes any program (treatment, educational, preventive, psychosocial, and medical)
that is designed to produce change.

different questions are addressed by various intervention evaluation strategies. M ajor


strategies are listed briefly in Table 14.1. As is evident in the table, the strategies raise
questions about the outcom e o f a particular intervention and the m anner in which the
intervention influences behavior change. Between-group and single-case designs vary
in how and in some cases how well they address these questions. The strengths and
limitations o f single-case designs can be conveyed by elaborating these questions.

E valuating Intervention Packages. Most single-case research fits into the intervention
package strategy in which a particular intervention (program , treatment) is com pared
with no intervention (baseline). T he intervention package usually has multiple ingredi-
ents or com ponents. For example, behavioral interventions often include instructions,
modeling, feedback, and direct reinforcement to alter behavior. For purposes o f evalu-
ation, the intervention package is exam ined as a whole. T he basic question is whether
the intervention achieves change and does so reliably. T he vast m ajority o f examples of
designs throughout the book illustrate the intervention package strategy.
In general, single-case research designs are highly suited to evaluating intervention
packages and their effects on performance. This is a critical strength. The health care,
educational, and service worlds are dominated by interventions with foci such as diet-
ing, academic functioning, parenting, psychotherapy, care o f the elderly, and prevention
o f this or that physical 01 psychological malady. Also, various agencies (family services,
educational), organizations, and professionals have a program that promises to have
an effect. For example, there is an endless stream o f wilderness program s designed to
provide interventions for children and adolescents with various form s o f social, em o -
tional, behavioral, and psychiatric problems (see http://www.wildernessprograms.org/
Programs.html). What is lacking is rigorous (or even not so rigorous) evaluation to
identify if any o f these actually help people.
Ev alu at io n o f Sin g l e- Case D esig n s 343

There are m any reasons there are no evaluations, but one o f them is that we have
learned that our options are limited. By our training, most individuals have learned that
what is needed is a randomized controlled clinical trial, that is,a between-group study.
In a classroom , school, school district, and state, where there a « m an y “ program s’
evaluation does not seem feasible and is too costly. T he options look like a large-scale
controlled group study or no evaluation.
Single-case designs are very well suited to the intervention package question and
can address this question without objectionable no-treatm ent control grou ps often
required in hetween-group research. M ultiple groups are not needed. M oreover, the
collection o f continuous data allows one to see if there is change em erging and alter
the intervention as needed if it is not. In a between-group evaluation o f a program , one
does not know the outcome until it is over and cannot do m uch about m ediocre effects.
For program s in real-world settings, we want to have feedback (data) to see i f we are
on the right track or if a new, different, or m odified intervention is needed. For exam -
ple, consider an intervention (B) designed to help students participate in science fairs
available in the school. An intervention might be evaluated in m ultiple-baseline design
across classroom s or grade levels. We m ay learn quickly that B is not v e ry effective so
we move to intervention C and use that across all o f the baselines (classroom s) that
have yet to receive the intervention. Implementation on a sm all or circum scribed scale
also is an advantage o f single-case designs.

Evaluating Intervention Components. Let me group three strategies he re because they


share strengths and limitations. The dismantling, param etric, an d constructive strate-
gies listed in Table 14.1 are similar to each other in that they attempt to analyse aspects
o f interventions that contribute to change. In its own way, each strategy exam ines wh at
can be done to make the intervention more effective. Variations o f the intervention are
presented to the same subject to examine their relative impact.
The dismantling strategy attempts to compare the full inter vent ion package with
another condition, such as the package minus selected ingredients. A n exam ple in 1
between-group study would be evaluating an intervention package that consisted o f a.
special curriculum for elementary school children, weekly testing, o f the children on
what they have learned, and feedback to teachers for child perform ance. The dism an -
tling strategy might test the entire package (provided to one group> versu s the package
m inus the feedback to teachers (provided to another group).
T he param etric strategy attempts to com pare variations o f the sam e intervention
in which one particular dimension is altered to determine if it influences oulcom e. An
example might be to evaluate an intervention o f exercise three times a week on health
(e.g., as reflected on heart rate, blood pressure, and m ood). T he param etric strategy
w ould compare variations (e.g., three times vs. six times p er week).
With the constructive strategy, a given intervention package is evaluated that u su -
ally is already known to be effective. The question is w hether adding som ething new to
that intervention such as another intervention makes a difference. For exam ple, colo-
rectal cancer is the third most common cancer and the second most frequent cause of
cancer-related deaths in the United States. Two approved drugs for treatment operate
somewhat differently in how they attack the cells, so it w as reasonable to do a con -
trolled trial com paring one o f the drugs (Avastin) by itself and with the addition o f the
364 S I N G L E- C A SE R ESEA R C H D ES I G N S

other approved drug (Avastin and Erbitux). l'he com bined treatment made the cancer
worse and produced more side effects as well (Mayer, 2009). M ore is not always better,
but sometimes it is, and constructive strategy is the way to find out.
Single-case designs can ask the questions o f all three strategies. Com paring d if-
ferent interventions or variations o f an intervention were discussed and illustrated in
Chapter 9 on multiple-treatinent designs. Yet, from a m ethodological standpoint, a
problem is multiple-treatment interference. That is, if two or m ore variations o f treat-
ment (or components) are presented to the same subject, one cannot really make a
judgment about which treatment is more or less effective because the order or prior
variation o f treatment might well influence the impact o f a subsequent treatment. For
example, in a constructive strategy the investigator wishes to know if feedback, praise,
and punishment (three interventions) are better in com bination. She begins with a
phase o f feedback, then in a second phase adds praise, then in the final phase adds p u n-
ishment. A n y effects or lack o f effects might be due to this special history or ordering o f
the interventions (e.g., which intervention was introduced first and second, and their
gradual introduction). In a betw een-group study, separate groups would receive the
various combinations and provide a dem onstration free from the possibility o f m ulti-
ple-treatment interference. With a single case, there is no unam biguous w ay to evaluate
interventions given in consecutive phases because o f the intervention x sequence con-
found, that is the effects o f the intervention may interact with (i.e., the x term) where it
was placed in the design.
A n apparent solution to the problem would be to adm inister two or more inter-
vention conditions in a different order to different subjects. A m inim um of two sub-
jects would be needed ( if two interventions were com pared) so that each subject could
receive the alternative interventions but in a different order. For example, there might
be A B C B C and A C B C B phases provided for two or more subjects. If both (or all) sub-
jects respond to the interventions consistently no matter what order in which they
appeared (e.g., C was always much better than B), the effects o f the sequence in which
the interventions appeared (and multiple-treatment interference) can be ruled out as a
significant influence. If presentation o f the different conditions in different order yields
inconsistent elfects, then considerable ambiguity is introduced. I f two subjects respond
differently as a function o f the order in which they received the interventions, the
investigator cannot determ ine whether it was the sequence that each person received or
characteristics o f that particular person. The possible interaction (differential effects)
o f intervention and sequence needs to be evaluated am ong several subjects to ensure
that a particular treatment-sequence com bination is not unique to (i.e., does not inter-
act with) characteristics o f a particular subject. Sim ply altering the sequence among a
few subjects does not necessarily avoid the sequence problem unless there is a way in
the final analyses to separate the effects o f interventions, sequences, subjects, and their
interactions. In this discussion, I have assum ed that there are two interventions (B, C)
and that the investigator changes their order for two or more subjects. If there are more
than two interventions, then it is difficult in single-case research to balance (alternate)
the different conditions to include all the possible orders (each follow ing and preceding
the other) to draw inferences that are not confounded by order effects.
These m ethodological considerations should be tempered by considerations that
lead researchers to add interventions in single-case research, that is, why they use a
Ev al u at io n o f Sin g le- Case D esig n s Îk 5

constructive strategy. M ore than one intervention or variation m ay be needed to achieve


the desired changes. If the intervention produces mediocre effects and the participant
gets only a little better with the intervention, what does the investigator do next? The
answer is to tinker with the intervention a bit, which might consist o f adding a new
com ponent (constructive strategy), adding more o f som e facet o f the treatment (para-
metric strategy), or less likely removing som e com ponent that seems to be interfering
with the effects o f the intervention (dismantling). In applied -work, researchers give
higher priority to having impact than to the risk o f m isinterpreting the data because o f
m ultiple-treatment interference.
M ultiple-treatment interference remains a threat to validity in any situation in
w hich m ore than one intervention is presented to the sam e subject. Yet the threat
sounds esoteric and o f little concern in relation to the practical challenge facing the
investigator and setting. To alter the perform ance o f the individual, the priority ought
to be to place concerns about multiple-treatment interference in the back seat. Yet,
we ought not to ignore the threat to validity. We want an intervention that also can be
extended to m any individuals, and to do that we need to know whether this interven-
tion works by itself or requires some other com ponent that is provided with or before
the intervention.
In applied settings the critical challenge is to develop interventions that have
im m ediate im pact. W orrying about m ultiple-treatm ent interference in th is context
is a much lower priority. Thus, single-case designs are qu ite useful in developin g
effective interventions because these com ponent-based strategies are needed to
m axim ize im pact. Yes, they leave open the po ssibility that the effects would not be
evident if m ultiple-treatm ent interference were controlled (e.g., as in a betw een-
grou p design). I hasten to add that m ultiple-treatm ent interference reflects the p o s -
sibility that the order o f the interventions o r what preceded the intervention m ay
m ake a difference. T he fact that it could m ake a difference does not mean that it
invariably does.

C o m p a rin g D ifferent Interventions. The problem o f evaluatin grariation s o f i nterven-


tions as part o f the dismantling, parametric, and constructive strategies extends to the
comparative strategy as well. The comparative strategy exam ines the relative effective-
ness o f two or more different interventions. In most single-case experimental designs,
com parisons o f different interventions are obfuscated by the multiple-1 re it ment inter-
ference effects noted earlier. I mention the com parative strategy separately, although
the strategy and its concerns could be absorbed in the previous section. I do so because
researchers often are keenly interested in the question. “ W hich intervention is better
or best?” I also do so because single-case research has designs specifically devoted to
com paring different treatments.
The multi-element and alternating-treatments designs attempt to provide an alter-
native in which two or more interventions or intervention variations can be com pared
in the same phase but under different or constantly changing stim ulus conditions.
These designs can resolve the sequence effects associated with presenting different c o n -
ditions in consecutive phases. However, it is possible that the results are influenced by
multiple-treatment interference, that is, the effects o f introducing m ore than one treat-
ment, as discussed in the chapter on these designs. Interventions, when juxtaposed to other
366 S I N G L E- C A SE R ES EA R C H D ESI G N S

interventions, m ay have different effects from those that would be obtained if they were
administered to entirely different subjects.
Overall, evaluating different interventions introduces ambiguity for single-case
research. The possible influence o f adm inistering one intervention on all subsequent
interventions exists for A B A B , multiple-baseline, and changing-criterion designs.
Similarly, the possibility that juxtaposing two or more interventions influences the effects
that either treatment exerts is a potential problem for multiple-treatment designs. This
ambiguity has not deterred researchers from raising questions that fit into the dism an-
tling, param etric, constructive, or comparative strategies. Single-case designs are often
used in applied settings where there is a practical issue o f critical concern and there are
views about what is the better or best strategy am ong those available. So, for example,
to foster improved reading, better completion o f homework, and higher achievement
among students in a special education class, two viable intervention options may be
worth testing. One could do a between-group random ized controlled trial—actually,
that is a problem, as I have noted before. One cannot usually do one o f these because
o f feasibility and cost. Alternatively, one could test the simpler, less costly intervention
first in a single-case design and then add the more com plex, possibly more costly inter-
vention if the first one does not achieve the goals.
Multiple-treatment interference means that the results may only apply to other
individuals who receive the two (or more) treatments juxtaposed in the w ay the design
presented them (e.g., alternating treatments). This is a lower priority concern than the
applied task (which treatment will stop inmates from stabbing each other, how do I get
special education students to read, how can we get m y relatives to stop throwing food
during holiday meals). G ive me the more or most effective intervention any day; once I
have that, I will w orry about testing for multiple-treatment interference.

Studying Variables that Influ ence the Im pact o f the Intervention. The intervention-
moderator strategy asks questions about characteristics o f the clients or other factors
that influence the effectiveness o f the intervention. A moderator is a variable that influ-
ences the relation o f two (or more) variables o f interest. That is, the relation between
the intervention and perform ance or outcome varies as a function o f some other char-
acteristic. If an intervention is more effective for boys than for girls, child sex is called
a moderator— it som ehow relates to the effects o f treatment. M oderators often are
characteristics o f people (e.g., age, sex, cultural background), but they can refer to any
characteristic o f the setting or context (e.g., classroom , class size) or even features o f
the intervention (e.g., duration o f the program, who administers it such as parents vs.
therapists) that have an influence on outcome.
Moderators are quite important, because there is a fairly safe statement to make
about treatment (e.g., medical, psychological), education, rehabilitation, and other areas
where interventions are designed to help. The statement, “An intervention, however
effective, is not likely to work with everyone,” is important to bear in mind. As familiar
examples, aspirin (for headache), chemotherapy (for a given cancer), and insulin m oni-
toring and injections (for diabetes) do not work for everyone. Granted, some interven-
tions w ork for a higher proportion o f people than others. But the scientific task is not to
just leave it at the fact that the treatment did not work for everyone. The task is to find
out for whom it did not work and why. The study o f moderators is the beginning.
E v a lu a t i o n o f S i n g l e - C a s e D e s i g n s 3<7

In everyday life, normal parenting raises the issue o f moderators. I f a parent or


fam ily has two or more children, at some point they go through the m oderator amaze-
ment phase. We raised our children identically—same house, opportunities, foibles of
our child rearing, and so on— so why are the children so different? For m ethodologi-
cally inform ed parents, the question actually goes like this: M y child-rearing interven-
tion was roughly the same for these children, but the outcom es (in kindness to me at
holidays) are very different, so what might be the m oderating variaWe(s)?1
As a research example, menopausal women have routinely been given estrogen,
which is depleted during menopause. Depletion o f the horm one contributes to m any
o f the symptoms that emerge (e.g., hot flashes, flushes, night sweats and/or cold flashes,
clam m y feeling, irregular heartbeat, irritability, m ood swings, sudden tears, difficulty
sleeping, anxiety, feelings o f dread, difficulty concentrating). W arnings em erged from
the research when it was initially presented, noting that women who received estrogen
replacement therapy were at higher risk for heart attacks. Further analyses o f the data
indicated that the relation between treatment and outcome was m oderated by age. Tor
wom en ages 50 to 59, risk o f heart attack was reduced when com pared to wom en ages
70 to 79, where the risk increased (Manson et al., 2007). Understanding the m odera-
tor and how it works can improve treatment and can direct individuals who are likely
not to profit from the treatment or to experience untoward other effects to som e other
intervention.
M ore generally, one can see moderators em erge in intervention research. A c o m -
mon situation from research—one gives the same treatment or educational regim en
to 50 individuals and, let us say, 30 respond really well, 10 respond pretty well, an d 10
do not seem to change at all. A researcher does not just shrug her shoulders and say,
“Okay, most o f the news is good.” Rather, the researcher invariably wants to understand
why there were different responses and for whom treatment is likely to be effective.
Ultimately, understanding can improve treatments for those who did not change at all
or very much.

’ Although there are obvious similarities in how parents raise their two or more children, there are
critical differences as well. Siblings (e.g., ages 3 and 6) in the same home ire under the influence of
somewhat different factors including the biological health of the parents when they were conceived
and born, child-rearing practices, often slight differences in socioeconom ic status of the family, and
the impact of the presence of a sibling in the home when the second (but not the first) child was born,
among other factors. If for a moment you are skeptical that such sibling differences exist, permit m et*
mention my flash-photo theory. In most homes, photos of a first child are much more exte 11s ive than
photos of a second child. Milestones for the first child (e.g., breathing, crawling, walking) are treated
like the big bang that may have originated the universe. Indeed, for a couple without a child,the first
child is a big bang. The second child is a slightly smaller bang in many ways in many homes, and this
differential response is evident in the photo record o f early childhood. That differential photo record
might reflect larger issues related to child-parent contact in the environment that ;i re not necessarily
good or bad (e.g., parents are more relaxed, busier with two children) but make the environments o f
siblings different. Even if the home life and parenting were identical, the different biological make-up
ofthe siblings might make them differentially sensitive to the same influences (e.g., huggiag, shout-
ing, instruction, lessons, peers, etc.). In short, siblings do not really grow up in the same environment
because the physical and social environments actually are a bit different and the s iblings vary in how
any given influence has impact on them (e.g., Bouchard, Lykken, McGue, Segal, & Tellegea, 1990;
Dunn & Plomin, 1990; Plomin, McClearn, McGuffin, & DeFries, 1000).
SI N G L E- C A SE R ESEA R C H D ESI G N S

Moderators are o f great interest in intervention research. A m on g the most salient


areas is the role o f ethnicity and culture. For example, in the context o f evidence-based
psychotherapies, most o f the research has been based on European Caucasian sam -
ples. Even so, many o f the findings, when evaluated empirically, show sim ilar effects
with (generalize to) other ethnic and cultural groups (e.g., A frican Am erican, Latin
Am erican) (e.g., M iranda et al., 2005). However, ethnicity and culture make a differ-
ence, that is, moderate outcome effects. For example, psychotherapy for adults som e-
times is more effective when treatment is provided in the native languages of the clients
and is specifically designed for m inority groups (G riner & Smith, 2006). The important
role o f ethnicity and culture in delivering and providing treatment services is beyond
the role o f this chapter (see Kazdin, 2008a). Yet, the issue conveys that moderators
are not minor afterthoughts. There are different paths toward m aking treatment more
effective. One is developing more potent interventions. Another is to do better triage,
that is, direct people to those treatments from which they are likely to profit. T his latter
strategy requires understanding moderators.
The intervention-m oderator strategy addresses whether the intervention is more
or less effective as a function o f some other variable, usually client characteristics. The
usual way that between-group research approaches this question is through large-scale
studies with data analyses (e.g., factorial designs, multi-way analyses o f variance, and
multiple regression analyses) that exam ine whether the effectiveness o f treatment inter-
acts with (is moderated by) the types o f clients, where clients are grouped according to
such variables as age, ethnicity, diagnosis, socioeconom ic status, severity of behavior,
or other dim ensions that appear to be relevant to treatment.
Single-case research usually does not address questions o f the characteristics o f the
client that may interact with treatment effects. I f a few subjects are studied and respond
differently, the investigator has no systematic wav o f determ ining whether treatment
was more or less effective as a function o f the treatment or the particular characteris-
tics o f the subjects. For example, in one study, four adolescents (ages 13 and 14) with
developmental and intellectual disabilities were exposed to interventions to develop
their word repertoire (e.g., defining words) (Riesen, M cDonnell, Johnson, Polychronis,
& Jameson, 2003). The individuals varied in psychiatric and physical disabilities; all
had IQs at or below 70, one criterion included in defining intellectual disability (or
mental retardation). T hey were functioning in a classroom where two interventions
were evaluated. Briefly, the interventions consisted o f different ways o f presenting and
fading instructions to help them learn and recite the definitions. Each adolescent was
exposed to the two treatments in an alternating-treatments design. O ne intervention
was better for two o f the cases; the other intervention was better for the other two cases.
The small sample does not permit analyses o f these cases that might shed light on why
they responded differently or the characteristics that correlate with responding to one
treatment rather than another.
Many other exam ples evaluated in single-case designs can be cited where one
intervention was more effective with som e cases and the other intervention was more
effective for others; where individuals responded differently or not at all to a given
intervention; and where a few or most participants showed the predicted pattern o f
performance but one or som e small number did not (e.g., Ardoin et al., 2007; Park
et al., 2005). The results are not surprising and support the broad conclusion we have
Ev al u at i o n o f Si n g le- Case D esi g n s 1& 9

learned from psychology, namely, that there are individual differences. The issue for
single-case designs is that when there are individual differences, w e have no easy w ay
o f testing or evaluating what factor may have moderated the outcom e. Were som e ind i-
viduals (e.g., those who responded) older, smarter, taller, less serio u sly im paired, etc.?
Investigators are wont to speculate, but a data-based interpretation is not possible with
few subjects. Between-group research is able to explore possible m oderators (and their
com binations) because o f the sample sizes or test a priori hypotheses about who will
respond to the different treatments.
In general, testing and evaluating moderators is a w eakness o f single-ca.se research.
This has not hampered the research. Interventions selected for study often have been
very potent, and have had generality across many subjects (and species). Even so,
moderation is a critical question inherent in all interventions, m edical, psychological,
educational, parental, and so on, because not everyone responds to even o u r best treat-
ments. Understanding w ho does not respond can be a precursor to understanding why.
Understanding why is often a precursor to being able to do som ething about it.

Stu dyin g M ediators a n d M echanism s: L arg ely U nexplored . T h e intervention-


mediator/mechanism strategy, the last one listed in Table 14.1, addresses the question
pertaining to the reasons why changes come about, the m echanism o f change, and
the specific process through which the intervention works. M ediators, m echanism s,
moderators, and causes are difficult to keep straight and are occasion ally defined
inconsistently in professional writings. As a point o f reference, Table 14.2 sum m arizes
key concepts that are interrelated to facilitate the discussion. I gro u p “ m ediator” and
“mechanism” together here in part because the distinction is mot critical to the point [
wish to make in this discussion, that is, how single-case and betw een-gro up research
can address processes that explain how change comes about.
Single-case and between-group experim ental designs can show a causal rela-
tion between an intervention and outcome. A causal relation does not establish how
or w hy the effect occurred, that is, specific reason or underlying process. C ause can
be readily distinguished from mechanism o f action. C o n sider cigarette sm oking and

T a b le 14.2 K e y T e r m s a n d C o n c e p t s

C a use: A variable o r intervention that leads to and is responsible for the outcom e o r change.
M e d i a t o r A n intervening variable that may account (statistically) for the relation between the independent
and dependent variable. Something that mediates change may not necessarily explain ttie processes of h o w
change came about Also, the mediator could be a proxy for one o r more ocher variables o r tie a general
construct that is not necessarily intended to explain the mechanisms of change. A m ediator may be a guide that
points to possible mechanisms but is not necessarily a mechanism.
M e c h a n ism : The basis for the effect, that is, the processes o r events that are responsible for the change; the
reasons why change occurred o r how change came about.
M o d e r a t o r: A charactenstic that influences the direction o r magnitude of the relation betAieen an independent
and dependent variable. For example, if the relation between variable x and y is different for itial e-s and females,
sex is a moderator of the relation. Moderators are related to mediators and mechanisms because they suggest
that different processes might be involved (e.g., for males or females).

Several sources can be consulted for further discussion of these concepts (e.g.. Campbell £ Stanley. 1963; Katdin,
2007; Kraemer, Stice. Kazdin. Offord. & Kupfer, 2001; Kraemer.Wilson.Fairburn.8 i Agras. 2002).
370 S I N G L E- C A SE R ESEA R C H D ESI G N S

lung cancer to help convey the distinction o f cause and mechanism . Spanning decades,
cross-sectional and longitudinal studies, research with humans, and experiments with
non-human animals have established a causal role between cigarette smoking and lung
cancer. (The causal role means that sm oking can cause cancer; it does not mean that
smoking is the only cause or that sm oking invariably leads to cancer.) Establishing a
causal relation does not automatically explain the m echanism s, that is, the process(es)
through which lung cancer com es about. W hat is it specifically about cigarette sm oking
that leads to cancer, and what are the steps along the way? T he mechanism has been
uncovered by describing what happens in a sequence from sm oking to mutation o f
cells into cancer (Denissenko, Pao, Tang, & Pfeifer, 1996). A chem ical (benzo[a)pyrene)
found in cigarette sm oke induces genetic mutation at specific regions o f the genes D N A
that is identical to the dam age evident in lung cancer cells. T his finding is considered to
convey precisely how cigarette smoking leads to cancer at the m olecular level.
A mechanism o f action need not be biological. For example, for major depres-
sion among adults, cognitive therapy is the most well-established and researched form
o f psychotherapy (Hollon & Beck, 2004). Random ized controlled trials have shown
repeatedly that the treatment is effective (i.e., causes change), but why is it effective
and how does it work, that is, what mediates the effects, and what is the mechanism o f
change? Changes in specific cognitive processes during treatment have been proposed
to account for the change. Group studies have challenged this interpretation because
we know now that the benefits o f treatment occur (i.e., treatment works) even without
changes in the supposed cognitive processes (see Kazdin, 2007). In short, that the treat-
ment works is clear; how the treatment works is not so clear.
Single-case research in applied settings is concerned prim arily with identify-
ing causes o f change, especially intervention packages that make a difference. In this
research, there has been less interest in understanding the m echanism s in the sense
o f processes that explain how the change comes about. However, many interventions
used in single-case designs are based 011 learning (e.g., reinforcem ent, practice, acqui-
sition, and extinction), and basic human and non-hum an anim al research has exam -
ined molecular and neurological changes that these interventions cause (e.g., Brembs,
Lorenzetti, Reyes, Baxter, & Byrne, 2002; Pagnoni, Z in k, M ontague, & Berns, 2002).
Much o f this work is based on the intensive study o f individuals (e.g., in neuroim ag-
ing studies o f a few individuals). It is not as if single-case designs with one or a few
individuals cannot evaluate mechanisms. Indeed, as I noted previously (Chapter 10),
there are ways in which the study o f mechanisms m ay actually require the study o f the
single case and adoption o f essential features (e.g., continuous assessment) o f the m eth-
odology. However, the study o f mechanisms has not been the priority of single-case
research in applied settings.
The study o f mediators has received increased attention in intervention research.
Mediators are not the sam e as mechanisms but can be an im portant starting point
to identify key constructs that might begin to explain how the changes come about.
M echanisms would be the m ore concrete and specific level o f analysis that shows pre-
cisely how those constructs operate (Table 14.2). The study o f mediators is accomplished
by hypothesizing what that mediator is, assessing the m ediator while the intervention
is in place, assessing outcom e or changes in the dom ain o f interest (e.g., symptoms,
problem behaviors), and establishing a connection between change in the mediator
E va lu a tio n o f S in g le - C a s e D e s ig n ? 371

and change in the outcome. Although this can be achieved with the single-case and
multiple-single cases, between-group has given this focus m uch m ore attention. T his is
due in part to the increased attention to the topic o f mediation in the context o l group
research and advances in statistical techniques designed to evaluate and test for m edia-
tion (e.g., Baron & Kenny, 1986; Kenny, Kashy, & Bolger, 199S; Kraemer, K iernan, Essex,
& Kupfer, 2008; M acKinnon, 2008).

G en era l Com m ents. A discussion o f single-case research in relation to the range o f


intervention questions that are usually o f interest is helpful at a broader level than
merely noting strengths and weaknesses or threats to validity here and there. A ny
research design is merely a tool for the investigator to answ er a question o r to test or
generate a hypothesis. Sometimes one reads the literature in an area and is im pressed
by how one design is slavishly adhered to in an almost rote w ay without considering
options, some o f which would be easier, others o f which w ould b e better in relation
to the purpose the investigators have set for themselves. Tbe intervention strategies
help sensitize us as investigators to what we are trying to acco m p lish an d how designs,
single-case, between-group, and variations within them are o r are not suited optim ally
to their purpose.
I stated before that in the intervention world (education, rehabilitation, therapy,
counseling, health care, prevention) there is an endless a r r a y o f program s designed to
help people. No matter what the focus, intervention, or profession behind them , three
key characteristics are surprisingly com m on:

1. The interventions are well intended.


2. They are unevaluated (no systematic assessment).
3. They have absolutely no evidence (research) indicating that th ey actu -
ally help.

Interventions in most contexts (local schools, state prisons, day-Iiospital treatment


programs for individuals with mental illness) cannot easily be evaluated in a between-
group study because o f limited resources, feasibility, and tim e constraints. Single-case
designs have their own constraints but could be much m oie feasibly used to ev alu -
ate programs in applied settings. The intervention package strategy alone would be
a strong justification for insuring that single-case research "vere part o f the graduate
training curriculum for anyone trying to change behavior o f anyone else (e.g., teachers,
therapists, counselors, social workers, m ilitary generals, physicians).

G en e ra lity o f the Fin d in gs


Single-C ase Research. A major objection levied against single-case research is that th e
results may not be generalizable. to persons other than those included in the study.
After all, if we are studying just a few cases (e.g., one subject in an A B A B design o r th Tee
subjects in a multiple-baseline design across subjects), how do we know the results
apply to anyone else in the world? This objection raises several im portant issues. To
begin with, single-case research grew out o f an experim ental philosophy that attempts
to discover laws o f individual perform ance (Kazdin, 1978). T here is a m ethodologi-
cal heritage o f exam ining variables that affect perform ance o f individuals rather than
groups o f persons. So the interest was in understanding individuals, and single-case
372 SI N G L E C A S E R ES EA R C H D ESI G N S

research was based on the assumptions that lawful relations would not be idiosyn-
cratic. Hence, the ultimate goal, even o f single-case research, is to discover generaliz-
able relations. There is nothing special here about the principle involved. A stronom y is
quite concerned about studying individual planets, galaxies, and comets; Egyptologists
are eager to find and elaborate individual tombs and pyramids; geneticists are keen to
elaborate individual fam ilies as a way to reveal disease patterns, and so on for most o f
the sciences. Study o f the individual often reveals generalizable and unique inform a-
tion, and understanding requires an elaboration o f both.
I mentioned earlier in the book the distinction between single-case designs (as a
methodology) and behavior analysis (as an experim ental and applied substantive area
where learning-based interventions are usually used to alter behavior). They overlap
because those engaged in behavior analysis rely on single-case designs. Yet the distinc-
tion is im portant here. The development o f behavior analysis in laboratory or applied
contexts focused on identifying interventions that led to marked changes within the
individual and variables that generalized across m any individuals. Indeed, early exp eri-
mental work (on schedules o f reinforcement) dem onstrated how changes in behavior
generalized (were very sim ilar) across species (e.g., humans, pigeons, rats, and m o n -
keys) (Kazdin, 1978). Thus, generality was not an issue.
In applied work, the interventions (e.g., variations o f reinforcement) have had very
robust effects. Investigators who use single-case designs have emphasized the need to
seek interventions that produce dram atic changes in perform ance. Interventions that
produce dram atic effects are likely to be more generalizable across individuals than are
effects that meet the relatively weaker criterion o f statistical significance. Indeed, in any
particular between-group investigation, the possibility remains that a statistically sig -
nificant difference was obtained on the basis o f chance. The results may not generalize
to other attempts to replicate the study, not to mention to different sorts o f subjects.
Single-case designs (as a methodology) do not inherently produce more or less
generalizable effects. Findings obtained in single-case demonstrations appear to be
highly generalizable because o f the types o f interventions that are com m only investi-
gated. O ver the years o f reporting single-case designs, we have not learned and there
is no evidence to my knowledge to support the view that findings from single-case
research are more or less generalizable than findings from other research. This does
not allay the natural reaction to wonder if a change in one subject represents a change
in 100 subjects.
The problem o f single-case research is not that the results lack generality am ong
subjects. Rather, the problem is that there are difficulties largely inherent in the m eth-
odology for assessing the dim ensions that may dictate generality o f the results. In the
prior discussion o f the intervention-m oderator strategy, I noted that single-case designs
cannot easily address intervention x subject interactions, that is, whether treatments
are differentially effective as a function o f certain subject characteristics. This d iffer-
ential effectiveness com m ent pertains to generality because the moderator conveys for
whom the intervention was or was not effective, that is, the characteristics o f persons
across which the effects generalize.

B etw een -G ro u p Research. The generality o f findings from single-case research is


often discussed in relation to between-group research. Between-group research uses
Ev al u at io n of Si n g l e- Case Desig n s 371

larger numbers o f subjects than single-case research. Surely that must produce more
generalizable findings. Actually, between-group research does not necessarily yield
generalizable findings or more generalizable findings than single-case research for at
least four reasons.
First, we as researchers are often com forted by the fact that group research includes
many individuals. Unfortunately, the way the results are analyzed ( u s u a l l y a com par-
ison o f means among groups), we have no idea about how m any individuals in the
group showed a change or a change o f any im portance (applied significance). Results
are evaluated on the basis o f average group perform ance. For exam ple, if a group o f
20 patients who received treatment show greater change than 20 patients who d id not
receive treatment, little information is available about the generality o f the resuLts. We
do not know by this group analysis alone how many persons in the treatment group
were affected or affected in an important w ay (e.g., m ade 1 large change). Also, it is
quite possible that a very small change among m any individuals could lead to a signifi-
cant effect, favoring treatment even though no one really changed in any palpable war.
In short, we do not know how many changed or changed in a w ay that m akes a differ-
ence, and the extent to which the mean for the group represents individual mem bers o f
the group. Am biguity about the generality o f findings fro m betw een-group research is
not inherent in this research approach. However, investigators rarely look at the indi-
vidual subject data as well as the group data to make inferences about the generality o f
effects among subjects within a given treatment condition. Certainly, i f the individual
data were examined in between-group research, a great deal m ore might be said about
the generality o f the findings than what can be said in m ost instances now.
Second and related, as researchers we are encouraged (and for som e fu n d in g agen-
cies required) to include individuals o f different ethnicities and cultures in o u r groups.
T his has unwittingly fostered the view that any finding miglkt be more generalizable if
a diverse set o f participants is included. This is not an inform ed view. M erely including
more diverse subjects alone does not establish, test, or dem onstrate generality. Typ icaily,
there are insufficient numbers o f various groups in the study to test generality, that is,
w hether diversity, ethnicity, or identity act as m oderators o f the intervention. So m erely
including a diverse sample in a group study does not by that fact alone m ean the results
will be more generalizable across subject characteristics. The generality could be tested
within a study: Does each group respond sim ilarly? I f they do, we have shed light on the
generality o f findings among the different groups. But the tests are larely requested (by
the funding agencies) or reported by investigators. In short, m erely including a more
diverse sample does not automatically make the results m ore generalizable.
Third, subjects in between-group research are rarely sam pled in such a way that
thevare random across a large population. Betw een-group research uses random asiigM •
merit o f subjects to groups but not random selection o f the sam ple from the population
(e.g., all college students, people from different parts o f the country). Random assig n -
ment is not especially pertinent to generality o f effects, although random selection is.
There are exceptions in studies where random samples are drawn from a given country.
For example, epidemiological studies sample random ly from individuals throughout
com m unities (e.g., in studying disease, eating patterns, psychiatric diagnoses) with the
goal in mind to represent the population. Som etim es studies sam ple m any geographical
locations, even though these are not chosen randomly. M ulti-site intervention studies
374 S I N G L E- C A S E R ES EA R C H D ESI G N S

purposely carry out the study in several locations (e.g., a few regions o f the country).
Rarely do psychological and educational studies use random selection o f cases or selec-
tion from diverse locations. Thus, the generality o f group research too is in question.
Finally, between-group research often uses careful inclusion and exclusion criteria
for selection o f subjects. For example, if one wishes to test an intervention for clini-
cal depression, not everyone who is depressed is allowed to participate. Depression
occurs in childhood, adolescence, and adulthood. The group is likely to be restricted
in age (e.g., let us say we just select adults 20 to 45 years o f age). M any people who are
depressed have other psychiatric disorders, and that makes depressed adults very dif-
ferent from each other (e.g., let us say we select those who only have major depression
without other psychiatric disorders). We want subjects who can com e to treatment for
the 10 sessions we are planning (e.g., so we only take those who have transportation
and who are not so depressed that they cannot leave their hom es or need to be hospi-
talized). Som e depressed patients are suicidal; it is likely we want to exclude those too
and refer them to immediate care. T his example could continue to show that between-
group studies often screen who participates. Indeed it is v ery wise to do so because
the broader the range o f sample characteristics, the greater the variability in the study.
Variability in the sample can make it much more difficult to demonstrate an effect o f
the intervention (Kazdin, 2003). In general, between-group research in schools, clinics,
and other settings and for the purposes o f education, treatment, and prevention often
selects samples with extreme care and excludes m any individuals purposely. This prac-
tice, while m ethodologically prudent for providing a strong test o f intervention effects,
is not the path o f producing generalizable findings. T his is not a criticism o f between-
group research but rather o f the unexam ined view that group studies produce findings
that are generalizable or more generalizable than findings from another research tradi-
tion (e.g., single-case or qualitative research).
The redemption from this situation is that the generality o f the findings in between-
group research is readily albeit infrequently evaluated by the intervention-moderator
strategy, as outlined previously. T he performance o f subgroups o f persons within the
study is not exam ined to assess whether intervention(s) are differentially effective as a
function o f some subject variable. Within single-case dem onstrations with one or a few
subjects, by definition, there is no immediate possibility to assess characteristics (m od-
erators) that help explain the generality o f effects, that is, for whom the effect did and
did not occur. Hence, between-group research certainly can shed more light on the gen-
erality' o f the results than can single-case research. A factorial design (or other analysis
such as multiple regression) exam ining intervention x subject interactions can provide
information about the suitability o f treatment for various subject populations. Also, with
a large number o f subjects in an intervention group, the investigator using a between-
group study can comment on the percentage o f subjects who show the pattern o f change
represented by the group mean. However, because subjects in any study are not selected
random ly from a population, the percentage cannot be used as an estimate o f the likely
percentage o f people to whom the findings would apply in the population.

R e p lica tio n
Replication refers to repetition o f the effect o f an intervention and emerges as a criti-
cal concept in two ways. The first is related to the logic o f single-case designs and the
E v a lu a tio n of Single-Ca.se D esig n s 175

describe, predict, and test functions o f collecting data in separate phases (e.g., ABAB),
across separate baselines (e.g., variation o f multiple-baseline designs), and so on with
other designs. T he impact o f the intervention is repeatedly tested within a given dem -
onstration, which is another w ay o f saying the effect is replicated (Horner et al., 2005).
In this sense replication is central to the designs and clarity o f the demonstration.
The second aspect o f replication pertains to evaluating generality o f the interven-
tion effect across subjects or conditions and serves as the basis o f the present discus-
sion. In relation to generality o f a finding, replication is a critical ingredient for all
research. Replication can examine the extent to which results obtained in one study
extend (can be generalized) across a variety o f settings, behaviors, m easures, investi-
gators, and other variables that conceivably could influence outcom e. Direct or exact
replication and systematic or approximate replication pro vide a useful way to convey
critical points (Sidman, i960). Direct replication refers to an attempt to repeat an exper-
iment exactly as it was conducted originally. Ideally, th e conditions and procedures
(e.g., setting, measures, intervention, design) across the replication and original exper-
iment are identical. Systematic replication refers to repetition o f the experim ent by sys-
tematically allowing features to vary. The conditions and procedures o f the replication
are deliberately designed only to approximate those o f th e original experim ent.
It is useful to consider direct and systematic replication as on opposite ends o f a
continuum. A replication that is at the direct end o f the sp ectru m w o u ld follow theorig-
inal procedures as closely as possible. This is the easiest to do ioT the researcher who
conducted the original investigation, since he or she has com plete access to all o f the
procedures, the population from which the original sam ple w as drawn, and nuances o f
the laboratory procedures (e.g., tasks for experimenters an d subjects, all instructions,
an dd ata collection and reliability procedures) that optim ize sim ilarity with theoriginal
study. An exact replication is not possible, even by the o rig in al investigator, since repe-
tition o f the experim ent involves new subjects tested at a different point in time and by
different experim enters, all o f which conceivably could lead to different results. Thus,
all replications necessarily allow some factors to vary; the issue is the extent to which
the replication study departs from the original investigation.
Direct and systematic replications add to knowledge in different ways. Replications
that closely approximate the conditions o f the original exp erim en t increase ones con-
fidence that the original finding is reliable and not likely to have resulted from chance,
particular artifact, or be unique to a particular moment in tim e. Replications that devi-
ate from the original conditions suggest that the findings hold across a wider range o f
conditions. Essentially, the greater the divergence o f the replication from the condi-
tions o f the original experim ent, the greater the generality o f the finding.
If the results o f direct and systematic replication research show that the interven-
tion affects behaviors or other domains in new subjects across different conditions, the
generality o f the results has been demonstrated. The ex lent o f the generality o f the find-
ings, o f course, is a function o f the characteristics o f the subjec ts (e.g., age, ethnicity),
applied or clinical focus, settings, and other conditions included in the replication stud-
ies. In any particular systematic replication study, it is u seful to vary only one or a few
o f the dimensions along which the study could depart from the original experiment. If
the results o f a replication attempt differ from the o riginal experim ent, it is desirable to
have a limited num ber o f differences between the experim ents so the possible reason(s)
376 S I N G L E- C A S E R ES EA R C H D ESI G N S

for the discrepancy o f the results might be more easily identified. If there are multiple
differences between the original experim ent and replication experiments, discrepan-
cies in results might be due to a host o f factors not easily discerned without extensive
further experimentation.
A limitation o f single-case research occurs in replication attempts in which the
results are inconsistent across subjects. For example, the effects o f the intervention may
be evaluated across several subjects in direct replication attempts. The results m ay be
inconsistent or mixed, that is, som e subjects may have show n clear changes and o th -
ers may not. In fact, it is likely that replication attempts, w hether direct or systematic
replications, will yield inconsistent results because one would not expect all persons to
respond in the same way. We have learned from intervention studies that not all per-
sons respond to treatment even when these treatments are well known, well studied,
and effective (e.g., aspirin for headache, chemotherapy for cancer, exposure therapy
for anxiety, reading o f m ethodology books for depression). Invariably there are always
some individuals who do not respond and a finding does not usually generalize to
everyone, whether in single-case or between-group research.
The problem with inconsistent effects is understanding fo r whom and why the
results did not generalize across subjects. Herein rests the potential limitation o f single-
case research. W hen replication attempts reveal that some subjects did not respond, the
investigator has to speculate on the reasons for lack o f generality. There is no systematic
o r formal way within the investigation or even in a series o f single-case investigations to
identify the basis for the lack o f generality. This is the problem already elaborated and
illustrated in the previous discussion o f evaluating m oderators o f intervention effects.
Between-group designs, with their larger number o f subjects, permit analyses to eval-
uate characteristics that may delineate who responds well or poorly. In such designs,
one can form subgroups or place individuals on a dim ension with regard to som e char-
acteristic (e.g., how severe, how smart, how socially skilled) and evaluate the impact
o f that variable. In this way, between-group studies can identify the characteristics o f
individuals to whom the intervention effects do and do not apply. The effects o f treat-
ment apparently do not generalize to everyone— once the variables are identified, one
can begin to test more specific hypotheses about w h y
Single-case designs could evaluate generality and replication more systematically.
In principle, investigators, educators, and clinicians who collect data on cases seen at
different settings could catalogue o r code subject (or other) variables (e.g., age, sever-
ity o f dysfunction) as well as behavior changes. The inform ation, when accumulated
across several cases and investigators, would form a data bank that could be analyzed
for moderating variables. In the context o f psychotherapy, there are rare examples
where systematic data have been collected 011 individual clients seen in treatment (e.g.,
Clement, 2008). The accumulated data were used to describe treatment effects for sub-
groups as well as for individuals and to identify moderators o f treatment outcome. T his
same concept on a larger scale and across many investigators might be able to address
generality o f effects.

G en era l C o m m en ts
There are many questions to which single-case designs are well suited. The intervention
package question perhaps is the one o f most universal concern and that alone would
Ev al u at io n o f Sin g le* Casf c D esig n s 3 77

give the designs a special place in research m ethodology and in training o f researchers.
The vast majority o f program s for groups in everyday life (e.g., in education, m ed i-
cine, law enforcement) cannot be subjected to an R C T for a. variety' o f reasons (e.g.,
cost, needed sample sizes, availability o f control conditions). Similarly, interventions
designed for individuals (e.g., psychotherapy, special education, remedial or rehabilita-
tion programs targeted to one person at a time) cannot be evaluated in group designs.
The choice has been an R C T or no systematic evaluation (e.g., anecdotal case study).
This is unfortunate because o f the loss o f time and m oney (in tinkering with program s
that m ay not do very much) and even more so because o f subjecting us to potentially
wasteful and ineffective interventions. We (our children, o u r relatives) are all subjects
in a set o f unevaluated well-intentioned, anecdotal case studies. Single-case m ethods,
either experim ents or quasi-experiments, provide an alternative to anecdotal claim s
that the program seems to be working or was a good idea at the lim e! In relation to
evaluation o f intervention packages, single-case methods are very strong and often are
more viable as options than are between-group studies.
Generality o f the findings from single-case research often emerges as a concern —
understandably so if only one or two cases are included in a study. G enerality is a prob-
lem that is not overcom e in group research, despite our com fort with the fact that with
more people involved any effect must be more generalizable. R arely are group studies
evaluated in ways that allow us to know anything about how any individual responded
and how m any individuals (what proportion) responded and to w hit extent. T he eval-
uation o f moderators, not easily accomplished by single-case research, is one w a y to
examine generality o f findings, that is, tor whom the effect does 0 1 does not apply or
apply in varying degrees. Another way to evaluate g e n e ra lity o fa finding, im portant in
all research, is whether the results can be repeated in subsequent tests.
Replication is the key in science to address reliability o f findings but also to ev al-
uate generality across conditions (subjects, investigators). We w ant to identify those
interventions that are robust across many different conditions. W hen interventions are
not robust across conditions we want to know the restrictions and the m oderating v ari-
ables. M any studies and many different methods o f study are needed. A fter decades o f
research and decades o f concerns, in fact there is no clear evidence that findings fro m
single-case and between-group experiments and quasi-experim ents are an y more or
less generalizable across individuals and new conditions.

D IR E C T IO N S A N D O P P O R T U N IT IE S ON T H E H O R IZ O N
It is likely that the use o f single-case designs will continue to expand to new settings,
populations, and specialty areas or disciplines. The need fo r evaluation in m any d if-
ferent settings and renewed social interest and pressure to id en tify “ what works’" alone
would increase the dem and for such designs. At the sam e tim e, single-case Je sig n s
are not usually included in training o f researchers. Aspects o f training in the qu an ti-
tative tradition may even interfere with drawing on single-case m ethods. Fo r exa m -
ple, as I have mentioned previously the R C T is referred to as the “gold stan dard” for
intervention research. A fter a while that view seem s to have evolved to a m ore tacit
and extrem e version, namely, that it is the only standard and that anything else is
“ fool’s gold.” There are hazards in relying on, yet worshipping, any single m ethod or
approach to research design, a topic for the next chapter. Even so, is there a better
378 S I N G L E- C A SE R ES EA R C H D ES I G N S

way to integrate single-case m ethodology into the mainstream? Two opportunities are
highlighted here.

Randomization: M otherhood and Apple Pie o f M ethodology


There is no question that am ong m ethodologists random ization arguably is the most
valued concept. Indeed, the concept and use o f random ization pervade our lives well
beyond our research. (I m yself use random ization for such things as determining whom
to invite to a party at m y home, selecting people from the phone book to whom I send
seasons greetings cards each December, and deciding the order o f my meals, breakfast,
lunch, and dinner on a given day.) Integration o f single-case research into mainstream
thinking and training m ight be enhanced by draw ing on random ization a bit more.
Randomization is not routinely part o f single-case designs or in fact used very
much at all. The exception I noted was in one context, namely, how to order the w ay in
which treatments are presented to subjects in multiple-treatment designs. Even that use
is somewhat rare and represents only one way o f presenting multiple treatments to the
same subjects. That said, random ization may have a larger place in single-case research,
improve features o f the designs, and improve how the designs are viewed and accepted
by investigators trained in the between-group (quantitative) research tradition.
Single-case designs com pare different conditions, usually baseline and some inter-
vention (A, B phases). Random ization could be integrated in such designs in m any
ways. In fact, random ization has been advocated for single-case research for over four
decades (e.g., Edgington, 1969; Fdgington & Onghena, 2007; Kratochwill & Levin, in
press; Onghena, 1994). In the most recent reference, Kratochwill and Levin (in press)
note many different ways in which random ization can be used. In between-group
research, random ization usually means how subjects are assigned to conditions. In sin-
gle-case research, random ization might be used to influence other facets o f the study
such as deciding the order o f delivering the intervention or condition. For example, for
each day (or week) which condition (e.g., A, B, or A ,B ,C ) is to be administered could
be determined randomly. Also, the point in time that an intervention is implemented
or begins after a period o f several days o f baselines can be random ly determined. In
a multiple-baseline design, both the when and to whom (which baseline is selected if
there are multiple baselines across individuals) can also be random ly decided.
Randomization does not require that the condition presented to the subjects change
every day (e.g., A B B A A A B B A B B A A ). Blocks o f days could be assigned randomly. For
example, consider a 3-day block, which would mean 3 days in a row o f a particular
condition. Block 1 might be baseline; Block 2 could be the intervention with each block
consisting o f 3 days o f that condition. The blocks could be assigned randomly m ul-
tiple times, as determ ined from a random numbers table. The result might look like
this: Blocks 112122, i.e., 6 blocks each with 3 days. A m ore familiar w ay to represent
this would be an A A B A B B design. Each block includes 3 days (but blocks could be any
number o f days) o f the condition so some o f the practical issues and obstacles o f chang-
ing conditions daily are circumvented. Also, random ization could be restricted so that
each block appeared equally often.
Why consider random ization as a feature o f single-case designs? First, random i-
zation strengthens the conclusions. T he goal o f research design is to reduce or elim i-
nate the plausibility that influences other than the intervention could account for the
Ev al u at io n of Sin g l e - Ci se D esig n s 379

change. Those influences are threats to validity; they can be made implausible in single-
case research even without using random ization. However, randomization strengthens
the case.
Second and related, randomization m ight well augment the credibility and adop-
tion o f single-case designs. 1 mentioned before that in som e classic writings on m eth-
odology, randomization is a defining feature o f a true experim ent (e.g., C am pbell &
Stanley, 1963; Cook & Campbell, 1979). Random ization helps make implausible threats
to validity. The view o f this book is that single-case designs are true experiments. For
exam ple, in experimental and applied single-case research, one can see v a ria tio n so f an
A B A B design such as A B A B A B A B , i.e., m any phases in which the criteria, o f the m eth-
odology' (describe, predict, test) are obviously met. If behavior can be turned on and
off, so to speak, demonstrating a causal relation and ruling out threats to validity reflect
extraordinary clarity. Indeed, such clarity, in m y mind, arguably exceeds the clarity o f
an R C T that shows some statistically significant differences between groups. Even so, it
is the case that adding randomization to single-case research would not o n ly be genu-
inely helpful in addressing threats to validity but might be the spoon full o f sugar that
makes single-case medicine go down.
Finally, randomization o f how conditions are assigned or when interventions are
introduced increases the range o f statistical tests that can be applied to single-case
research. Several statistical tests for single-case designs are not straightforward and
research has identified many ambiguities surrounding their use and yield (please
see the appendix). Randomization opens a range o f options for statistical evaluation,
including a number o f tests that are more fam iliar and better studied than those still
being developed in single-case research (e.g., Edgington & Onghena, 2007; Onghena,
1994; Todman & Dugard, 2001).
I w ould not expect randomization to be adopted extensively in single-ease research.
T he ability to shift interventions within subjects is not so easy to do in applied settings
(schools, hospitals). Moreover, shifting random ly or quickly could influence the impact
o f the intervention (e.g., diluting intervention effects if there is any obstacle in sub-
jects being able to discriminate what condition is in effect). Also, flexibility within the
design has been a core part o f single-case research for m aking decisions about when to
shift phases. All that said, randomization could be used m ore in single-case research,
and there would be the advantages to that (see Kratochwill & Levin, in press). The main
advantage might be more widespread acceptance o f the designs and that in turn m ight
lead to better and more frequent evaluation in m any settings in which program s are
used without any empirical basis as to their impact.

Integration of Research Findings


Research proliferates in so many areas and it is difficult to keep up and integrate m any
studies on a given topic. Decades ago, one could read o r write a review o f the litera-
ture in which one sifted through all the available studies and drew conclusions about
the knowledge, limitations, and so on. T hese were called qualitative review s and still
dom inate many journals that publish reviews. A clear breakthrough was the addition
o f reviews that were quantitative, especially those based o n meta-analysis. (There were
other options such as box score counting o f studies and whether i ndividual studies sup-
ported one claim versus another.)
80 S I N G L E- C A S E R ESEA R C H D ESI G N S

M eta-analyses have been adopted as a set o f procedures that permit one to look at
a body o f research and to com bine studies. The studies can be com bined by translating
the results o f each individual investigation to a com m on metric referred to as effect
size (please see the appendix for a further discussion). I f 2, 10, or 100 studies can be
identified on a given topic and if they use slightly or completely different m easures o f
outcome, the measures can be still be converted to effect size. O nce converted, results
from the studies can be com bined. T he reviewer now can draw conclusions about the
findings from many studies, quantify the strength o f the findings, and even ask and
answer questions that were net asked in the individual studies themselves (e.g., D o self-
report measures yield different results from those obtained by direct observation? Does
random assignment make a difference in the results?).
Single-case research has been largely excluded from the process o f integrating
m any studies quantitatively. T he reason is that there is no clear way to translate the
results o f single-case designs into a com m on metric such as effect size. Scores o f ways
o f m easuring effect size have been proposed. Indeed, over 40 different ways o f cal-
culating effect size have been identified for single-case research (Swaminathan et al„
2008). O bviously this prompts one to ask, “ W hat is the problem here?” As 1 have noted
in the discussion o f data evaluation (Chapter 12, appendix) single-case data have spe-
cial characteristics because o f collection observations o f the same subject over time
(serial dependence). Also, the designs have such features as som etim es relatively short
phases and many occasions in which the effects o f the intervention are replicated (e.g.,
changes across phases in an A B A B design or across subjects, behaviors, and settings in
a multiple-baseline design). The net effect oi these and other considerations m eans that
effect size and its com putation and interpretation are not at all straightforward. Based
on several com parisons and studies (see appendix) there is no method of effect size
that has emerged for widespread use for single-case research. Several issues need to be
addressed and resolved. Researchers are working on the topic and one can hope that
further developm ents will yield acceptable alternatives.
The integration o f single-case research is critically important. Bodies o f research
are being neglected. As importantly, occasionally the m ethodology used to study a phe-
nomenon influences the conclusions. We would want to know that by including in
any quantitative review studies that rely on different methods, where a method means
assessments, experim ental designs, and data analyses. As with randomization, use o f
effect size measures ultimately will facilitate acceptance and integration o f single-case
studies.

SU M M A R Y A N D C O N C L U S IO N S
In single-case designs, several problems may emerge that com pete with drawing clear
conclusions about the effects o f the intervention. M ajor problems com mon to each o f
the designs include am biguity introduced by trends and variability in the data, partic-
ularly during the baseline phases. Baseline trends toward im proved perform ance m ay
be handled in various ways, including continuing observations for protracted periods,
using procedures to reverse the direction o f the trend (e.g., b rief period o f actively fos-
tering the opposite behavior that was the focus o f the intervention phase), selecting
designs that do not depend on the absence o f trends in baseline, or using statistical
techniques that take into account initial trends.
Ev al u at io n of Sin g f e- Cxs« Desig n s 381

Excessive variability in performance may obscure intervention effects. The appear-


ance o f variability can be improved by blocking consecutive data points and plotting
averages (means for that block o f days) rather than plotting day-to-day perform ance.
O f course, it is desirable, even if not always feasible, to search for possible contributors
to variability, such as characteristics o f the assessment procedures (e.g., low interob-
server agreement) or the situation (e.g., variation am ong the environmental stimuli,
activities, and people present in the setting).
A m ajor issue for single-case research is deciding the duration o f phases, an issue
that encompasses problems related to trend and variability. It is difficult to identify
rigid rules about the minim um number o f data points necessary with in a phase because
the clarity and utility o f a set o f observations is a function o f the data pattern in ad ja -
cent phases. Occasionally, objective criteria have been specified for deciding when to
shift phases. Such criteria have the advantage o f reducing the subjectivity that can enter
into the decisions about shifting phases. However, single-case designs depend on m ak -
ing decisions about what the data show and how well th ey describe, predict, and test
predictions about performance. These criteria require m aking decisions based on the
data and not on preset criteria. Com prom ises are possible b y allow ing flexible rules for
changing phases.
Another issue is the range o f questions about intervention effects that can be
addressed easily by single-case research. Am ong the m any intervention outcom e
questions that serve as a basis for research, single-case designs are well suited to the
intervention package strategy, that is, investigation o f the effects o f an overall interven-
tion and comparison o f that intervention with no treatment (baseline). Dismantling,
parametric, constructive, and comparative intervention strategies raise potential p ro b -
lems because they require more than one intervention given to the .same subject. T he
prospect and effects o f multiple-treatment interference m ay lead to am biguity about
the relative merits o f different interventions or variations o f the same intervention.
Intervention-moderator and intervention-mediator/mechanism strategies focus on char-
acteristics that interact with the intervention or explain precisely for whom and why
the intervention achieved changes. These strategies are evaluated in betw een-group
research. Single-case studies cannot identify m oderators as readily as can betw een-
group research. Both single-case and between-group designs can identify mediators.
Yet, as I noted previously, interest as well as advances in statistical techniques have fo s-
tered such work almost exclusively in between-group studies.
The generality o f results from single-case research is also an issue. C oncerns often
have been voiced about the fact that only one or two subjects are studied at a tim e. T his
immediately raises the question about the extent to which findings extend to other
persons or to a larger group. Actually, there is no evidence that findings from sin gle-
case research are any less generalizable than findings from betw een-group research.
In fact, because o f the type o f interventions studied in single-case research the case
is som etimes made that the results may be more generaliz.able than those obtained
in between-group research. While that is arguable, it conveys that generalizability is
not automatically better as a function o f methodology, that is, single case or betw een
group. There is nothing inherent in the designs, including the use o f one subject that
makes the results less generalizable. The generality o f findings from group research is
often assumed because the research includes many people. This assum ption is easily
382 S I N G L E- C A S E R ESEA R C H D ES I G N S

challenged based on how group studies analyze data. Such studies rarely look at the
extent to which individuals changed or are represented by mean changes for the overall
group. Also, how subjects are selected for inclusion (e.g., nonrandom selection, further
exclusion and inclusion criteria) in between-group research can further limit the gen -
erality o f the findings.
The area in which generality is a problem for single-case research is the investiga-
tion o f the variables or subject characteristics that contribute to generality, i.e., m odera-
tors. In single-case research, it is difficult to evaluate interactions between treatments
and subject characteristics. Statistical analyses especially well suited to group research
(e.g., factorial designs, multiple regression analyses) are more appropriate for such
questions and address the generality or external validity o f the results directly. For all
research, generality is better assured through replication o f the effects o f the interven-
tion across subjects, situations, areas o f functioning (e.g., academ ic perform ance, clini-
cal problems), and other dim ensions o f interest. Direct and systematic replication were
discussed to illustrate the w'ays in which this can be accomplished.
Finally, the chapter discussed random ization and quantitative integration o f m ul-
tiple studies (meta-analyses and effect size). These are two topics central to research in
the quantitative tradition. T h ey can also play a role in single-case research. Advantages
for integrating these concepts, procedures, and practices were discussed.
CH A P T ER 15

Summing Up: Single-Case Research


in Perspective

C H A P T E R O U T L IN E

Characteristics of Single-Case Research


Essential Features o f the Designs
Associated but Non-essential Features of the Designs
Focus on One or a Few Subjects
Focus on Overt Behavior
Use o f Visual Inspection
Psychological or Behavioral Interventions
General Comments
Special Strengths o f Single-Case Designs
Evaluation
Ongoing Feedback While the Intervention Is Applied
Tests of Generality: Extending Interventions
We Care About Individuals
Multiple Approaches Are Essential for Understanding
Levels o f Analysis
Multiple Methodologies
Closing Comments

T
he individual subject has been used throughout history as ihe basis for drawing
inferences both in basic and applied research, as highlighted in the introductory
chapter o f the book. Development o f single-case designs as a distinct m ethod o f exp eri-
mentation has emerged relatively recently. T he various designs discussed in previous
chapters provide alternative ways o f ruling out or making im plausible threats to v a lid -
ity. T h ey constitute true experim ents and reflect a m ethodology squarely in the realm
o f science.
T his Final chapter provides a perspective that clarifies essen tial feat a res o f t he designs
that perm it their broad applicability. Also, it is important to consider the designs in the
contexts o f other approaches to research. Different m ethodologies include different
levels and types o f analysis o f a given phenomenon. Single-case is o n ed esign tradition

383

I
384 SI N G L E- C A SE R ESEA R C H D ESI G N S

and can he view ed in the context ot others (quantitative between-group research, q u ali-
tative research). The perspectives convey why any single m ethodology or tradition is
inherently lim iting and how valuable it would be to take advantage o f all the m ethods
to address the m any crucial questions and challenges before us.

C H A R A C T E R IS T IC S O F S IN G L E -C A S E R E S E A R C H
Single-case designs have been intertwined with a substantive focus and specific area o f
investigation within psychology. That focus is the experim ental and applied analysis
o f behavior, which includes the m ethodology o f single-case experim ents as well as
a conceptual and research focus that em phasizes operant conditioning. T he designs
have been extended to m any areas o f work and to many disciplines including edu -
cation, clinical psychology, psychiatry, medicine, business and industry, counseling,
social work, law enforcem ent and corrections, am ong others. The com mon feature o f
these m any areas is the interest in developing interventions that make a difference to
som e facet o f human functioning. T he scope o f the interventions has expanded greatly
and well beyond the conceptual and research focus from which single-case designs
emerged.
Despite the extension o f the m ethodology to diverse disciplines and areas o f
research, the tendency exists to regard single-case designs as restricted to focusing on
behavior analysis and operant conditioning in laboratory and applied settings. T his
is a completely understandable view because o f the pairing o f methods (single case)
and substance (operant conditioning) in thousands o f studies over a period spanning
decades. Excellent w ork in basic and applied areas continues with these methods and
substances paired.' However, the pairing may ham per extension o f single-case designs
in the many contexts in which they could be used. It is useful to discuss the essential
and defining features o f single-case designs and to distinguish those from characteris-
tics with which the designs are often associated.

Essen tial F eatu res o f the D esign s


By essential leatures o f the designs, I refer to the defining ingredients. Two character-
istics are defin ing features o f single-case designs. First, the designs require continu-
ous assessment over time. M easures are adm inistered on multiple occasions within
separate phases. Contin u ou s assessment is used as a basis for draw ing inferences
about intervention effects. Patterns o f perform ance can be detected by obtaining sev-
eral data points under different conditions. I m entioned at the outset o f the book that
a useful w ay o f differentiating between single-case and betw een-group m ethods is
to rem em ber that single-case designs usually assess few subjects on many occasions
and between-group research usually assesses m an y subjects on few occasions. T here
are exceptions and com binations that make this o nly a useful m nem onic rather than
a rule.

Prominent journals that publish research in these are the Journal of the Experimental Analysis
of Behavior and the Journal of Applied Behavior Analysis; and the professional organizations in
which proponents of single-case research are especially active include Division 25 of the American
Psychological Association and the Society for the Advancement of Behavior Analysis.
Su m m in g Up : Si a gl c- Ca se Resear ch in Per sp ect i v e 385

Second, intervention effects are replicated within the sam e subject o ve r time.’
Subjects serve as their own controls, and com parisons o f the subject’s perform ances are
m ade as different conditions are implemented over time. O f course, the designs differ
in the precise w ay in which intervention effects are replicated, but each design takes
advantage o f continuous assessment over time and evaluation o f the subject’s behavior
under different conditions. The replication or within-subject study o f the intervention
addresses the logic and requirements o f the design, namely, to describe, predict, and
test predictions across phases.
These two characteristics are the basics and serve to convey h o w the designs rule
out threats to validity, demonstrate causal relations, and build a knowledge base. There
are many specific designs and design com binations in which these characteristics
are structured. However, the essential com ponents o f the m ethodology are small in
number.

A sso c iate d but N o n -essen tial Featu res o f the D esigns


Several other characteristics often are associated with single-case designs but do not
necessarily constitute defining characteristics. These are im portant to m ention briefly
to dispel misconceptions about the designs and their applicability.

Focus on O ne or a Few Subjects. Perhaps a characteristic that would seem to be cen-


tral to the designs is the focus on one or a fe w subjects. A fter all, the designs are often
referred to as “small-N research,” “ N -of-one research,” or “single-case design s,” as in the
present text. Certainly it is true that the designs have developed to study the behavior
o f individual subjects intensively over time. However, investigation o f one o r a few
subjects is not an essential or necessary feature o f the m ethodology. T he designs refer
to particular types o f experimental arrangements.
The num ber o f subjects included in the design is som ew hat arbitrary. So-called
single-case research can use a group o f subjects (e.g., in a com m unity, in a state or
province) in any design (e.g., A B A B ) in which the entire group is treated as a subject
or in which the number o f subjects who engage in the behavior (e.g., recycle, use seat
belts, pay their bills on time) is the outcom e measure o f interest. A lso, one can use
several different groups in one o f the designs (e.g., m ultiple-baseline design across
classroom s, schools, families, or com m unities). Single-case designs have evaluated
interventions in which multiple schools, classroom s, and students participate and in
which the actual or potential subjects included hundreds, thousands, o r even m ore
than a million subjects (e.g., Cox et al., 2000; Fournier et al., 100 4; M cSweeney, 1978;
Parsons, Schepis, Reid, M cCarn, & G reen, 1987; Schnelle et al., 197$). In som e studies,
the num ber o f subjects is not even know n, as for exam ple w hen m onitorin g com pli-
ance with a law (stopping at a stop sign) over a several-day p eriod in w h ich instances
o f an event (stopping) are counted without knowing how m any different subjects 01
repeat subjects are included. In short, although single-case research can b e and usually

• An exception to the replication of intervention effects within the same subjed isltie multiple-
baseline design across subjects. In this instance, subjects serve as their own control, in the sense that
each subject represents a separate AB design, and the replication o f intervention effects is across
subjects.
386 S I N G L E- C A SE R ES EA R C H D ESI G N S

has been employed with one or a few subjects, this is not a necessary characteristic o f
the designs.

Focus on O vert B eh avior. A nother characteristic o f single-case research has been


the evaluation o f the impact o f interventions on overt behavior. T he data for single-
case research often consist o f direct observations o f perform ance. The association o f
single-case research with assessment o f overt behavior is easily understandable from a
historical standpoint. Single-case research grew out o f the research on the behavior o f
organisms (e.g., B. F. Skinner, 1938). Behavior was defined in experim ental research as
overt perform ance on such measures as frequency or rate o f responding (e.g., number
o f times a lever was pressed). T he lawfulness o f relations with different experim ental
manipulations and the sim ilarities am ong different species (hum ans and non-humans)
was easily seen in this laboratory paradigm.
As single-case designs were extended in applied settings (e.g., schools, hospitals,
nursing homes) assessment o f overt behavior has continued to be associated with the
methodology. Yet single-case research designs are not necessarily restricted to overt
performance. T he m ethodology does require continuous assessment, and measures
that can be obtained to meet this requirement can be employed. Other measures than
overt perform ance can be found in single-case investigations. For example, self-report
and psychophysiological measures have been included in single-case research (e.g.,
Glenn & Dallery, 2007; Twohig et al„ 2007; YVarnes & Allen, 2005). Also, how one
feels about oneself, one’s mood, ones ability to be in control o f one’s life, and sim ilar
non-behaviors are no less im portant in life and can be evaluated in single-case designs.
It was once thought that self-report measures might not be useful or valid when con-
tinuously adm inistered over time. That has long been dispelled by very well-developed
measures com pleted by clients or therapists (e.g., Lam bert et al„ 1996). Also, self-report
measures obtained by asking clients very specific questions about what happened in a
confined period (e.g., past 24 hours) have been o f dem onstrated value in m any stud-
ies (e.g., Cham berlain & Reid, 1987; Peterson et al„ 2002). In any case, the assessment
o f overt behavior is not a necessary characteristic o f single-case research. The designs
require observations over time but not one m ethod o f assessment (e.g., direct o bserva-
tions) in particular.

Use o f Visual Inspection. Another characteristic o f research that would seem to be


pivotal to single-case designs is the evaluation o f data through visual inspection rather
than statistical analyses. A major purpose o f continuous measurement over time is to
allow the investigator to see changes in the data as a function o f stable patterns o f per-
formance within different conditions. Certainly as the designs started to enjoy more
frequent use in applied settings, proponents made a strong case for visual inspection
as a crucial characteristic o f the m ethodology (Baer, 1977). That case was based on
filtering out weak interventions to ensure som ething with potent effects would be
readily apparent. N o doubt m any proponents would see visual inspection as being in
the “essential” features category. However, there is no necessary connection between
single-case research and visual inspection o f the data.
Single-case designs refer to the manner in which the experim ental situation is
arranged to evaluate intervention effects and to rule out threats to validity. There is no
fixed or necessary relationship between how the situation is arranged (the experimental
Su m m in g Up : Sin g le- Case Resear ch in P e r f e c t i v e J87

design) and how the resulting information is evaluated (data analysis). Statistical analy-
ses have been applied to single-case investigations. Although visual inspection contin-
ues to be the prim ary method o f data evaluation for single-case research, this is not a
necessary connection.

Psychological or B eh a vio ra l Interventions. A final characteristic is that single-case


designs are used to investigate interventions derived from psychology and specifically
learning (operant conditioning). As I mentioned, operant conditioning and single-case
designs developed together, and the substantive content o f the form er w as inextrica-
bly bound with the evaluative techniques o f the latter. This connection has continued
as noted in the journals identified previously (please see footnote l). Even so theie
is no necessary connection between single-case designs and operant conditioning
techniques.
A number o f different types o f interventions derived from clinical psychology,
medicine, pharmacology, social psychology, and other areas not central to or derived
from operant conditioning have been included in single-case research, as illustrated
in examples in prior chapters. It is im portant to emphasize the breadth o i applicability
because the designs are relevant to virtually all situations in which the goal is to alter
some facet o f functioning. For example, there is keen interest in prom oting sustain-
able environmental practices to mitigate and adapt to clim ate charge. Interventions
draw from so many areas, with direct attempts to alter the public’s perception o f risk
from global warming, to educate and provide instruction, to alter attitudes, to influ-
ence by drawing on econom ics and decision making, and to im plem ent social p o licy
changes (Kazdin, 2009). The area is rich in interventions ripe for em pirical evalu a-
tion, and where single-case designs would be quite useful. T h e designs can be used
with interventions drawn from other disciplines with little 0 1 no know ledge ot operant
conditioning.

G en eral C o m m en ts
M any arguments about the utility and limitations o f single-case designs focus on fea-
tures not central to the designs. For example, objections focus on noil statistical data
evaluation, the use o f only one or two subjects, and restricting the evaluation to overt
behavior. These features are part o f single-case methodology, but th ey are not essential
features. The designs can be used without them. I am not in any way advocating tliat
these nonessential features be abandoned. Just the opposite. O u r knowledge base is
greatly enhanced by draw ing on a broader array o f research m ethods than we curre nt ly
use. The essential and nonessential features o f single-case designs can greatly e x fin d
our approaches, the dom ains o f what we study (individual functioning), and the situ
ations in which we can introduce evaluation. I mention the features as not essential
because those trained in between-group research can easily identify o n e o f the non es-
sential features and cast aside the entire m ethodology as not usefu 1» rigorous, or indeed
scientific. For example, the “ failure” o f single-case designs to use statistics routinely
to decide whether the effect o f an intervention is reliable sounds like scientific heresy
and all by itself could taint the methodology for some. However, the use ot single-case
designs does not require automatic adoption o f the visual inspection rath er than statis-
tical analyses or the other facets I have noted here as no «essential. L w ould encourage
388 S I N G L E- C A S E R ESEA R C H D ESI G N S

adoption o f the package o f essential and nonessential com ponents but not i f these held
back researchers from exploring the methodology.
Another source o f objection o f single-case research has been the association o f the
m ethodology with a conceptual focus and area o f work. W ithin psychology, operant
conditioning (conceptual domain) and the analysis o f behavior (approach to evalua-
tion including single-case methods) have elaborated and continue to elaborate a range
o f influences on behavior in humans and non-hum an animals. O perant conditioning
underscored the role o f environmental influences (e.g., stimuli, consequences) on func-
tioning. One o f the phrases to capture portions o f the model noted that our behavior
is selected by the consequences it has on the environment. T he position never was that
“only the environment and consequences are im portant and nothing else.” However,
it was easy to isolate operant conditioning as it moved to explain human behavior,
culture, government and law, and more dom ains from the standpoint of the model
(e.g., Skinner, 1953a). Operant conditioning, years ago, was labeled “radical behavior-
ism” because o f its heavy focus 011 environm ental antecedents and consequences as key
influences on behavior. Little or no attention or em phasis was accorded other influence
such as cognitive processes (e.g., thoughts, beliefs, attributions, perceptions), emotions
(e.g., mood, affect), characteristics o f the individuals (e.g., personality, temperament),
and biological underpinnings (e.g., brain and now genetic). One could quibble at the
m argins about exceptions, but these were exceptions and definitely perceived as excep-
tions. While operant conditioning in both basic and applied dom ains flourished, it also
becam e isolated in part because o f the model and view from the outside that operant
conditioning too strongly rejected these other areas o f functioning. In more recent years
these other influences have been accorded much m ore attention and integration includ-
ing, for example, cognition and neuroscience (e.g., Tim berlake, Schaal, & Steinmetz,
2005; White, M cCarthy, & Fantino, 1989). Even so, early m inim ization o f cognition,
emotion, individual differences, and biological underpinnings led to isolation and lack
o f integration o f the m any contributions o f operant conditioning into much o f m ain-
stream psychology. The net effect is that rejection o f a “radical” approach to human
functioning has also dim inished appreciation o f a m ethodology that has no necessary
connection to any particular conceptual view. It would be unfortunate i f investigators
eschewed a m ethodology with potentially broad utility because o f historical antipathy
over a particular theoretical position that need not necessarily be embraced. (It is for
another book to lament the rejection o f a theoretical position because o f its prior char-
acterization as unusually narrow. That theoretical position has generated very effective
intervention approaches that are rarely taught.)

S P E C IA L S T R E N G T H S O F S IN G L E -C A S E D E S IG N S

E valu atio n
I have mentioned that our world is filled with program s and interventions designed to
help people. Special education schools, classroom s, and teachers, for example, develop
innovative interventions or variations all o f the time to have impact on students. In
schools (elem entary through university), interventions are aimed at improving aca-
dem ic perform ance and participation in som e activities (e.g., engaging in sports and
volunteer activities) and reducing participation in others (e.g., eating unhealthful foods,
Su m m in g U p : Si n g le-Ca.sc Resear ch in Per sp ect i v e 38?

abusing alcohol, harassing and assaulting others). In hospitals, interventions are aim ed
at reducing medical errors and increasing safety practices to reduce the spread of illness
while patients are in the hospital. In business, interventions are designed to improve
health (via exercise programs), productivity, safety, and m orale. Indeed, in virtually all
institutional settings key facets o f the day-to-day involve interventions with a goal in
mind. In everyday life, interventions are directed to im prove com pliance with the laws
(e.g., speed limit, use o f safety belts) and our support o f broad social goals (e.g., net
littering, behaving in ways that promote a sustainable environm ent). Invariably new
interventions (programs, plans, efforts, initiatives) are designed to improve the goals
in these and other contexts. Do any o f these interventions have an y im pact, and is that
impact in the intended direction or does it make people worse? These questions cannot
be answered suitably by anecdotal reports o f what the program developer ©r consum ers
believe. One really needs systematic data.
Evaluating such program s in betw een-group studies w ith pre- and p o st-
intervention testing, random assignment o f individuals to conditions (e.g., including
a no-intervention control group), and other features o f traditional group com parison
studies (e.g., invoking exclusion and inclusion recruitm ent criteria, using large sa m -
ples) usually is not possible. Single-case designs, including quasi-single-case designs,
provide viable alternatives. Continuous data collected o ver A and. B phases (e.g., A B
design) can be collected across different students, hospital units, and departm ents
within an organization. O f course, if the implementation o f the intervention (B) is
staggered so each individual, unit, or section does not receive the intervention at the
sam e time, this becom es a multiple-baseline design. A key feature o f the designs, many
observations but few subjects, can be v ery helpful in evaluating im pact, that is. w hether
there is a change and whether the intervention is likely to account fo r that change.M ost
settings have a group already in place that can serve as the subject, that is, the group can
be evaluated as if it were an individual. What is plotted as the data is the perform ance
o f the group as a whole (e.g., percentage o f children in the school who com plete their
homework, number o f families or homes that recycle). In short, single-case designs
permit a broader range o f opportunities to evaluate what we do.

O n go in g Feedb ack W h ile the In terven tio n Is A p p lied


In single-case research, the continuous feedback from the data and the fluid deci-
sion making while the program is in play have distinct advantages. In betw een-group
research, the intervention is preplanned and adm inistered in keeping with that plan.
The impact o f treatment is evaluated at the end when the fu ll treatm ent has been d eliv-
ered (posttest assessment). This makes sense for research but not for people receiving
the intervention. Single-case designs allow for evaluation o f im pact while the interven-
tion is in place. We can evaluate whether the intervention is achieving change and
whether the change is at the level we desire or need. Decisions can be m ade during the
intervention to improve outcome. The continuous assessm ent du rin g the intervention
phase makes the designs quite user friendly to the investigator (teacher, doctor, or other
person responsible for the intervention) and the client (person or group intended to
benefit). If som ething is not working or not working sufficiently well, the investigator
can make the change and continue to evaluate whether change conies about right away
without waiting to see mediocre or no impact at posttest.
390 SI N G L E- C A SE R ESEA R C H D ESI G N S

For example, in one project the goal was to train six boys and girls (6 to 7 years o f
age) to not play with handguns (M iltenberger et al., 2004). A real but disabled handgun
was used; assessments were com pleted at home and at school while the child was left
alone with the gun. Assessments were videotaped and later scored for the extent to
which each child engaged in the appropriate behaviors on a o-to-3-point scale in which
o = touching the gun and 3 = not touching the gun, leaving the room, and telling an
adult about the gun. In a multiple-baseline design across children, a behavioral skills
training program was used that provided instructions, m odeling, rehearsal, and feed-
back, all conducted in a sim ulated training situation rather than at hom e or at school.
A few sessions o f this were very effective in altering the behavior o f three o f the six
children; the effects o f training carried over to home or school.
Figure 15.1 shows the data for the three children who responded positively (Nigel,
Brigitte, and Ned) in a multiple-baseline fashion in which change occurred when the
intervention was introduced but not before. The data also show that when this training
was introduced, three other children did not respond. A second intervention was added
in which more intensive practice and rehearsal training were added and conducted in
the school setting (in the situation or in situ training). The figure show s that two o f the
remaining three children (Ricky, Tina) responded well to this enhanced intervention.
To alter the behavior o f Jake, a third condition was provided, namely, an incentive. He
could receive a treat for rehearsing the correct behaviors. Jake too achieved the crite-
rion o f engaging in the appropriate behaviors. A final 5-month follow-up assessment
showed that the appropriate behaviors were maintained in the home, where the assess-
ment was then conducted.
In many ways, this dem onstration illustrates a very special strength o f single-case
designs. Interventions were implemented and evaluated. Decisions were made based on
the data. New interventions were added to achieve the desired outcome. Pre-post data
from a between-group study might have shown that the first intervention (behavioral
skills program) worked (statistically significant difference if com pared to a no-inter-
vention control condition). Yet we see that the intervention would have left a signifi-
cant proportion o f people stranded, that is, with no change in the desired outcome.
Apart from the inform ation provided, single-case designs allow for the gradual
or small-scale implementation o f the intervention. With one or a few' cases, one can
implement the intervention and see in a prelim inary way whether this is having an
effect. This allows the investigator to m odify the intervention on a small scale if needed
before applying the intervention to the entire class, school, or other larger scale set-
ting. The investigator is posing and answering a set o f questions; “ Does the effect o f
the intervention look prom ising? I f so, let us continue and extend the intervention
to others. Alternatively, should the intervention be altered or changed com pletely?”
If there is a strong intervention effect in the sm all-scale application with one or a few
subjects, this does not necessarily mean that the effect will extend across all subjects
or baselines But the point here is that first starting out on a modest scale, across one
phase for one or two individuals (A B A B ) or across one baseline (in a multiple-baseline
across individuals, situations, responses) helps the investigator preview the impact o f
treatment as well as master implementation and som e o f the practical issues that may
relate to its effectiveness.
Su m m in g U p :Si n g l e- C\it Resear ch in Pcn p ect i v e. 391

baseline BST

I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Sessions

F igure 1 5 . 1 . Rating scale scores (derived from videotapes of child behavior) in a multiple-baseline
design across six children. Behavioral skills training (BST) was implemented and effecti-v« for three
of the children (Nigel, Brigitte. Ned). In situation (in situ) training was introduced and effective for
two of the three children (Ricky,Tina) who had not responded to BST.Finally an incentive was added
and was effective in altering the behavior of the child (Jake) who had not responded to the previous
interventions. For all children, a 5-month fcillow-up assessment in the home showed tliat the behav-
iors were maintained. (Source: Miltenberger et ol.. 2004.)
392 S I N G L E- C A SE R ESEA R C H D ESI G N S

Tests o f Generality: Extending Interventions


In the previous chapter I mentioned moderators, those variables that influence the
effectiveness o f an intervention. Actually, they can influence the direction of the effect
(e.g., some people get better or worse) or the magnitude o f the direction (e.g., som e
get a little better, som e a lot better). Between-group research evaluates moderators well
because subgroups (by ethnicity, by sex or identity, by severity or duration o f som e
characteristic) can be identified in the data analysis and one can analyze whether the
subgroups respond differently. M oderators refer to variables that affect generality o f a
result: Do the results generalize across all subjects, conditions, contexts, and situations?
There is an important way in which single-case designs study generality.
First, the generalizability o f a finding from the group to the individual is always a
question. Between-group research, for example, focuses on means rather than on ind i-
viduals. We may know that a treatment is better overall, but will the finding be true or
generalize to any particular individual? For example, an evidence-based intervention
developed in special education might well be good for a particular group. Now there is
interest in helping a child in a very different context. No matter how well established an
intervention, there is never a guarantee that it will w ork with any particular individual.
Continuous evaluation and use o f a single-case experim ental or quasi-designs permit
evaluation as the intervention is administered to individuals and tests generality o f a
finding from the group to the individual.
Second, whether a finding is from between-group or single-case designs, it m ay
have been restricted to a particular group o f people (e.g., o f a certain age, ethnicity).
One can do large-scale studies with different groups. Actually, that is not very feasible.
For example, evidence-based psychotherapies rarely have been tested with the diverse
ethnic groups in the United States.3 As I have shown elsewhere, it would be impossible
to do the necessary between group studies to test the available treatments with each
ethnic group and across a range o f clinical problems (Kazdin, 2008a). Alternatively, one
can begin to apply the intervention to other novel groups not included in the original
dem onstrations, whether group or single-case, to see if the effects are similar. Single-
case designs would be one viable means to accomplish this with sm all-scale extensions
to test the generality o f findings. C onsider the use o f single-case designs to test gen eral-
ity as probes. These were assessments to test for generality within a study. We want to
see how findings from research extend to other groups.
Third, generalizability across conditions is very important. W hether focusing on
the individual or the group, often one wants to know whether a change achieved in
some setting (e.g., home or m ilitary base) extends to other places (e.g., school, music
lesson, or battlefield). The use o f probes as part o f assessment when such questions arise
is another strength o f the designs. Probes, or occasional and intermittent assessments,
can com plem ent the continuous data and provide information about whether changes
extend to other settings and contexts or whether some other intervention might be
needed in those settings.

' In North America (Canada, Mexico, United States), there are hundreds of ethnic and cultural
groups and, o f course, hundreds more elsewhere in the world (www.infoplease.e0m/ipa/A0855617.
html).
Su m m in g Up : Si ;i g l e- Case Researcf » in Per sp ect i v e 393

Analyzing moderator effects is difficult in single-case research, as noted before.


However, extending the intervention to other samples and conditions and testing
whether the effects continue to be evident are readily achieved by single-case designs.
Indeed, the gradual extension o f an intervention across samples (e.g.„one or a few indi-
viduals) is a very feasible way to proceed. Single-case tests o f generality are not instead
of, better than, or replacements for other tests that might be completed in group stud-
ies. Rather, single-case designs increase our options and the range o f contexts in which
generality can be evaluated.

We Care About Individuals


T h e special strengths mentioned previously address m ethodological an d substan-
tive issues about what we can know (e.g., effects o f a program ) and how we can test
generality to others. I would like to add to the special strengths one that is laced
with empathy, concern for others, and the priorities o f our daily lives. I noted how
single-case designs can evaluate groups (e.g., in classroom s o r schools, the com m u-
nity at large), but let us return to the focus on the individual. A unique strength o f
single-case designs is providing a way to evaluate change and im pact o f interventions
on a particular person. This is very important, as m any o f us have experien ced or
will experience in life. For m any questions that guide our ev eryd ay life, w e care v e ry
much about the individual. It is interesting to us person ally an d intellectually to ask
about the group data. For example, we hear about a new treatm ent (e.g., for obesity,
diabetes, blood pressure, or hair loss). Does it work? Is a change in th e treatment
associated with real change or is it another one o f those bait-and sw itch television
ads that says “clinical evidence shows” (that does not mean random ized controlled
trials) where we see two models, one cast as “ before” and one cast as “ after” receiving/*
taking the intervention?
I note the obvious, namely, that we owe so m any advances in clinical work (e.g.,
psychological and medical science) to between-group research m ethods to help under-
score the point about individuals. In our daily lives it is about individuals— ourselves,
our loved ones, and about friends, and not about group data. A s an illustration, my
annual physical exam could easily turn into a m ethodological brawl. (I take heavy
medication and have m y impersonal trainer with me in the waiting room to help
me restrain myself.) Once in the room with my physician but toward the end o f my
9-minute appointment, we have an exchange that is pretty m uch li ke this:

M yP h y s ic ia n : “ You probably ought to have that medical test in about 5 years,


just to make sure you don’t h av e...[re ad er— insert yo u r favorite serious
disorder— my physician rotates several variations o f cancers, heart disease,
diabetes].”
Me: “ That sounds serious; maybe 1 should have the test now.'’
M y P: “Actually the data (he means group) show that you probably do not need
that medical test because the rate o f that problem is pretty low at your age
right now and for most people does not pick up for a few m ore years.”
M e : “ Just speaking generally, is it possible that I am one o f the cases in the
group that gets that disease on the early side?”
M y P: “ Yes, o f course, but not very likely.”
394 S I N G L E- C A S E R ESEA R C H D ES I G N S

M e : “ I would really like the test, because what happened to the big group
might not be what happens to me. Also, the test doesn’t hurt (I am a medical
coward) and the inform ation could help with one o f m y personal priorities
(staying alive).”
M y P: “ U h—you are probably fine without it, but in 3 or 4 years it would be
pretty important. Well, I see that m y next appointm ent is here— God w ill-
ing, lets continue this discussion at your next physical.”

Although not quite relevant, the astute reader will note how I refrained from chal-
lenging the group data. 1 held back on, “ Were the findings replicated, what ethnic and
cultural groups were included, were there any m oderators, were the findings based on
one- or two-tailed statistical tests, what precisely was the m iss rate, that is, not identi-
fying people my age whose problem was not detected and happen to die,” and so on.
(Actually, I have not really restrained m yself—each o f these questions in various com -
binations with the others has been asked at least tw ice— I v a ry the order with the hope
o f reducing multiple-question interference.)
To return to the point: I (and I expect you) respect the group data— some o f my
best friends even collect the stu ff—but these data do not tell me whether I will be one
o f those cases and a person who would have profited from taking the medical test a lit-
tle early. Single-case designs cannot solve my problem, and m y physician is a superbly
well-trained between-group guy. However, my story conveys the point. Critical ques-
tions— life and death questions— are often, if not almost always, about individuals.
Consider a better example to convey the point.
C ancers that spread (m etastatic cancers) have a poor prognosis. Understandably,
people with such cancers experience depression, stress, anxiety, and related problem s
that im pair quality o f life, life satisfaction, and com pliance with cancer treatment.
A n y intervention that could help m anage depression an d ease their course would
be remarkable. In this exam ple, cognitive behavior therapy w’as provided to wom en
(ages 42 to 66) with breast cancer (one also had ovarian cancer as well) with metas-
tases to at least one other site (e.g., liver, bone, lung, brain) (Levesque et al., 2004).
T he goal was to reduce depression and to develop an optim istic but realistic attitude
toward their situation, as opposed to negative thinking (e.g., only about death) or
overly positive thinking (e.g., hoping to be cured). Eight individual weekly sessions
and three booster sessions o f treatment were provided. M any m easures were ad m in -
istered to assess depression, suicidality, anxiety, and quality o f life. Two m easures
o f depression are presented here. Figure 15.2 presents the group data on a clinician-
com pleted measure o f depression (The Structured Interview G u id e for the Ham ilton
Depression Rating Scale) obtained from an interview o f approxim ately 2 hours. The
clinician was “ blind” (naive) to the procedures and goals o f the study. Each bar in
the graph represents the m ean for the cases before, during, and after treatment and
then again at two follow -up assessm ents. T he group data convey rather clearly that
there was a change on a clinician rating m easure o f depression over the course o f
treatment.
If I were one o f the patients or a relative o f one o f the patients, I would not be very
interested in the group data. We w'ant to know about any intervention effects for the
individual. Quality clinical care depends on that. Figure 15.3 presents the individual
Su m m in g Up : Si n g l e- Case Resear ch in Per sp ect i v e

20 r

PRE-TX MID-TX POST-TX 3-MONTH i-MOWTH


n= 4 n= 4 n=4 FOLLOW-UP FOLIO/V-UP
n= 3 n= 3
Time
Figure 1 5 .2 . Mean depression scores from clinician interviews (The Structured Interview Guide
for the Hamilton Depression Rating Scale; SIGH-D) for the group (N = 4. wfco completed treatment)
at pre-. mid-, and post-treatment and two follow-up periods. Follow-up indicated depression higher
than during treatment but still well below baseline. (Source: Levesque et aI., 2004.)

data for another standardized measure o f depression (Hospital Anxiety and D epression
Scale) that is specifically designed to assess depression am ong patients with physical
illness (e.g., by rem oving somatic items that could be confused with manifestations
o f the physical illness). The figure shows four participants who com pleted treatment
(Participant 3 had to be withdrawn for medical com plications). Treatment was staggered
in the multiple-baseline fashion. The focus is beyond the criteria for visual inspection.
We see that individual data show the effects o f the intervention and follow -up assess-
ments. After the intervention, depression for som e cases had decreased to the lowest
points possible. This is important to know and important as a basis to see i fthe patients
who did not respond so well require further attention.
We very much need single-case methods, experim ents, and quasi-experim ents
to provide systematic inform ation about the effects o f interventions on. individuals.
Clinical judgm ent is not up to the task o f providing the data we need in order to help, a
point to which 1 return. Group studies, so essential in iden tifyin g and evaluating inter-
ventions, by themselves do not provide the proper tool to help us evaluate individuals
in an ongoing way, to chart the effects o f our interventions, an d to help decide w hether
we ought to continue what we are doing or try som ething different. Single-case designs
have as their strength the ability to do all o f these.

M U L T IP L E A P P R O A C H E S A R E E S S E N T I A L
FOR U N D ERSTA N D IN G
Sin gle-case designs and th eir contribution can be sh ow cased better by placin g
them in two broad contexts. T hese include the d ifferen t levels o f analyses o f social
research and the different m ethodologies including b etw een -grou p and qu alitative
research.
396 S I N G L E- C A S E R ESEA R C H D ESi G N S

Baseline Treatment Boosters and 6-month f-up

Weeks

Figu re I . . Weekly scores on a patient-compleied measure of depression (Hospital Anxiety and


Depression Scale; HADS-D for the depression items). After baseline, the treatment phase included
weekly cognitive behavior therapy sessions: further sessions were provided during the booster
period 3 months later; follow-up included no intervention. (Participant 3 had to withdraw due to
medical complications.) (Source: Levesque et al„ 2004 )
S u m m in g U p: S in g le - C a s e R e s e a r c h in P e rsp e c tiv e 397

Levels of Analysis
By levels o f analysis I refer to the scale o f investigation. For example, in learning
research, one can focus on individual neurons or m olecular changes in the brain. At
another extreme, one can study learning by asking individuals to recall material they
have read. Both levels o f analysis focus on learning, but o f cou rse they are considerably
different in their focus.
Extend the notion o f levels and one can see the need for and role o f different
methodologies. In relation research on human and non-hum an anim al functioning,
many questions are o f interest at the level o f the individual subject 0 1 case. Single-case
designs can answer questions about what is effective and w hether change has com e
about for one or more individuals. This is very critical. For exam ple, in the context o f
psychotherapy, several evidence-based treatments have emerged. These are psychoso-
cial interventions that have been strongly supported in em pirical research. Exposure
therapy for anxiety, cognitive therapy for depression, and parent managem ent training
for conduct problems in children are three o f many exam ples (see Nathan & G orm an,
2007; Weisz & Kazdin, 2010). In clinical practice with individual patients, assum e that
an evidence-based treatment is u sed Will the treatment help a particular patient? W e
cannot really know because effective treatments do not always -work, whether in m ed-
icine or psychology. Can we just ask the therapist about his o r her opinion o f w hether
it worked? We could, but research is not kind about the credibility o f the inform ation
we are likely to obtain. Cognitive processes and perception lead us to misjudge often.
In fact, therapists often are simply inaccurate in their evaluations o f the impact ot their
treatments on patients (Love, Koob, & Hill, 2007). It would be helpful to have a meth-
odology that can evaluate if changes occur with individual patients. The quality o f clin -
ical care could be greatly improved (Borckardt et al., 2008; Kazdin, 2008b). Single-case
experim ents and quasi-experiments provide such a m ethodology and could contribute
enorm ously to the delivery o f services. Their unique contribution is to provide the
means to evaluate interventions for the individual client.
Second, many research questions we care about can be and actually m ust be
addressed at the level o f groups. In between-group research, one group is com pared
with one or more other groups. The unique contribution o f bet ween- group research is
to exam ine the separate and combined effects o f different variables (e.g., characteristics
o f the subjects that m ay moderate treatment). Large sam ple sizes and delineation ot
subgroups (by a categorical variable) or a range o f some characteristic (by a dim en-
sional variable) requires a group study. Also, 1 mentioned that evaluations o f inter-
ventions (e.g., param etric, dismantling, comparative) are w ell suited to group studies
because they can test treatment variations without multiple-treatment interference.
M any questions we have that are not part o f intervention studies also require a
group-level focus. For examples, we want to know about factors that contribute to prob-
lems, successes, and outcomes that occur when we do not intervene. A m on g children
who are physically or sexually abused or who experience mental or physical health
problems, what are the characteristics o f those who turned out problem-free? W ha1
can be done in early childhood that decreases the likelihood that teenagers will engage
in high-risk behaviors (e.g., driving while intoxicated, unprotected sex,cigarette sm ok-
ing)? W hat is the relation o f physical health (e.g., onset o f colds, days lost from work
due to health) and mental health (e.g., stress, adjustment)? Each o f these questions
398 SI N G L E CA SE R ESEA R C H D ES I G N S

requires the study o f groups. Also, we want an estimate o f the strength of relations
among variables (e.g., correlation), and this requires groups.
Third, many questions we care about are beyond groups o f individuals and actually
refer to groups o f studies. For example, questions about the effectiveness of interven-
tions (e.g., in education, psychology, and medicine) are routinely addressed by exam in-
ing and com bining m any different between-group studies. M eta-analysis is an approach
to research that draws on individual studies (rather than subjects) as the unit o f analy-
sis, characterizes and com bines studies, and draws conclusions based on this com bined
literature.4 The questions focus on draw ing from a body o f literature (multiple studies)
to see what new conclusions can be drawn, som e o f which could not be asked or were
not asked by any o f the individual studies included in the meta-analysis.
Each o f the preceding levels o f analysis for research focuses on important issues. It
is difficult to argue convincingly in favor o f one level o f analysis to the exclusion o f the
others. A nd there would be no point. Uncovering the secrets o f nature and develop-
ing strategies to address critical problems would profit from using the widest range o f
tools we have available. Assessm ents, experim ental designs, and data-analytic strategies
associated with the different levels o f analyses extend our range.

Multiple Methodologies
A different way to place single-case designs in context is to mention three research tradi-
tions. First and most fam iliar is between-group research, which dominates training o f
students and researchers in the social and biological sciences. As the reader is well fam il-
iar, this involves groups, null hypotheses testing, and statistical tests, and is referred to as
quantitative research. Second, in need o f no elaboration at the end o f the book, is single-
case research. This type o f research (groups not essential, no null hypothesis testing in
the same way, and no need for statistical tests) is rarely included in training in the social
or biological sciences and departs from research in the quantitative tradition.
Third is qualitative research, which consists o f systematic, replicable, and rigorous
ways o f studying individuals and human experience much more intensively than either
between-group or single-case research. In terms o f a methodology, qualitative research
often considers a small num ber o f subjects, evaluates their experience in rich details,
often with lengthy narrative descriptions, and may or m ay not use special software
and statistical techniques to evaluate the content.' Qualitative methods, well beyond
the scope o f this chapter, convey yet another m ethodological approach to study phe-
nomena. These methods look at phenomena in ways that reveal m any facets o f human
experience that the between-group quantitative tradition has been partially designed
to circum vent— in-depth evaluation, subjective views, and how individuals represent
(perceive, feel) and react to their situations and contexts. For example, qualitative
research can look at the experience o f those who go through treatment and thematic

' t or the reader unfamiliar with meta-analysis, many excellent sources are available and can be
consulted (Cooper, Hedges, & Valentine, 2009; Hunter & Schmidt, 2004).
1 It is important to note that the term “qualitative” is occasionally and mistakenly used to mean an
unsystematic case or anecdotes. Qualitative methods meet the desiderata of science; the methods are
systematic, replicable, and cumulative (see Berg, 2001; Denzin & Lincoln, 2005).
Sum m ing U p: Single-Case Research in Fersp^ctivc 399

ways in which their lives and the lives o f their partners are influenced. A s the reader
well knows, qualitative research methods are rarely covered in training in the social
sciences.
The vast majority o f research in the social sciences falls within the quantitative
between-group tradition. I f the work is intervention research (treatment, prevention,
services), random ized controlled trials are recognized as the epitome o f this tradition.
Typically, RCTs include pre- and posttreatment assessment, multiple measures, rigor-
ous control over the administration o f treatment, and holding constant or controlling
as much as possible to maximize the likelihood o f identifying an effect if one truly exists
(Kazdin, 2003). Quantitative research, and RCTs as its poster child, account for en or-
mous gains in so many areas in medicine, education, counseling, psychology, rehabil-
itation, and more. The development o f evidence-based interventions in education and
psychotherapy are two prominent examples where great strides have been made.
The dom inance and close-to-exclusive reliance on the quantitative tradition con-
strains our knowledge. The perspective and yield from a study are verym uch influenced
by the methods we use. The level o f analysis (individual, group, groups o f studies), the
types o f measures that are used, the number o f occasions on which they are ad m inis-
tered (e.g., self-report at pre and post; behavioral m easures conlinuously assessed over
time; in-depth narratives), and other features either essential to or correlated with d if-
ferent methodologies reveal different facets o f a phenom enon. Multiple m ethodolo-
gies and perspectives are essential without im plying that one is hetter than the otheT.
For example, the argument that tests o f statistical significance ire “ better” th in visual
inspection criteria is difficult to make. Both means o f evaluating the data include sub-
jectivity and some arbitrariness but in different ways. Both should be treated cautiously
and not worshipped (see appendix at the end o f the book). A n d both have hum orous
features as we investigators squirm when the criteria are not quite met but we want
them to be. A fam iliar example is from quantitative research where statistical tests are
used to decide whether there is a difference or the intervention had an effect. In m any
articles, one can easily find such comments as, “ M y findings ‘approached significance’
or were ‘almost statistically significant’ or there was a ‘trend tow ard sign ift c a nee.' ” None
o f these terms is legitimate within the quantitative tradition o f null hypothesis testing
and the rules adopted for statistical significance.
The importance o f multiple ways o f exam ining phenom ena is conveyed better
by looking at other areas o f science. As an illustration, most readers are fam iliar with
the Hubble Space Telescope and its remarkable yield o f inform ation about space. T he
Hubble has been one o f four orbiting observatories that look at space in a different light
(visible, infrared, gamma rays, a n d X rays) (wAvw.stsci.edu/science/goods). Each yields
unique inform ation and complements inform ation provided by the others. Pointing
the observatories to the same object or point in space yields entirely different pictures
because o f what is being assessed. Research m ethodology too can influence the yield;
phenomena can vary as a function o f the m ethodological lens through which they are
viewed. For example, we have known for decades that even within the quantitative
between-group tradition, findings for a given phenom enon can vary as a function o f
the type o f design (e.g., cross-sectional vs. longitudinal; between vs. within subjects)
(e.g., Chassin et al., 1986; Grice & Hunter, 1964). T he findings are all real and veridical
but convey different facets o f the topic. O ur view o f the cosm os has grown enorm ously
400 S I N G L E- C A S E R ESEA R C H D ESI G N S

by expanding the ways in which we can look at it. M ethodological ecumenicism that
recognizes and utilizes quantitative, single-case, and qualitative traditions would have
the sam e benefit.

C L O SIN G C O M M E N T S
I f you are a researcher already, you will have seen me describe several features o f
single-case designs that clash with your (and my!) training. I have emphasized what we
are trying to accomplish by doing research to allow evaluation o f single-case methods
more generally. We do studies to draw inferences and understand phenomena better
than ordin ary casual observations allow. T he problem s that emerge and compete with
drawing clear conclusions are well codified as various sources o f bias and artifact. T he
threats to validity highlighted early in the book reflect major sources o f bias and artifact
that can mislead us. T he goals o f research are to combat all the sources of am bigu-
ity we can and to design a project that will permit valid (clear, verifiable, replicable)
inferences.
Between-group designs, single-case designs, and qualitative research designs all
can rule out various sources o f artifact and bias. In their most rigorous forms, they
can dem onstrate causal relations. It is unfortunate that our training has usually limited
our exposure, so we cannot readily draw from each o f them. In the case o f single-
case designs, am ong the unfortunate consequences is that we live in a world o f well-
intentioned interventions that are rarely evaluated. Single-case designs, even if used in
this one context, would perm it more program s to exam ine whether or not we are help-
ing when we think we are. But there is more to single-case designs than program evalu-
ation. The designs greatly increase the arm am entarium o f the researcher interested in
science, whether basic or applied.
Consider single-case designs in your work. Try one with an intervention that you
believe is w orking well. The feedback that the data provide on a continuous basis is an
excellent wav to evaluate an intervention but also to make decisions as to whether the
desired effects are obtained while the intervention is still in process and can bechanged.
This book was designed to elaborate single-case m ethodology and to describe design
options, their utility, and their limitations. I hope you will explore this methodology to
com plem ent other design and evaluation methods you are using.
APPENDIX

Statistical Analyses for Single-Case


Designs: Issues and Illustrations

A P P E N D IX O U T L IN E

Background and Context


Visual Inspection: Application and Applicability
Statistical Analyses
Serial Dependence in Single-Case Data
Statistical Tests for Single-Case Research
Sampling o f the Many Tests
Time-Series Analysis: Illustration
Description
Important Considerations
Obstacles to Using Statistical Tests
Many Options, Few Guidelines
Recommendations
Conclusions

D
ata evaluation focuses on draw ing inferences about w h ether tli-e change is reli-
able and not likely to be due to chance tluctuations in th ed ata. T he experim ental
design determ ines largely whether the intervention can be identified as respon si-
ble for change, but the data-evaluation method handles th e burden o f m aking the
decision about the change itself. Visual inspection has been and continues to be the
dom inant m ethod of data evaluation o f single-case designs, as discu ssed previously
(Chapter 12). Indeed, estimates have placed the reliance on visu al inspection as c h ar-
acterizing approxim ately 90% o f studies from 1978 to 2003, w ith no clear changes o ver
time (see Parker & Hagan-Burke, 2007b). That said, there has been increased inter-
est over the sam e span o f years in the use o f statistical tests and advances in the tests
themselves.
This appendix has four goals. First, the context for carryin g out statistical tests
in single-case data is discussed. This context includes but goes well beyond concerns
about visual inspection. Research on visual inspection and how it is applied and the
elaboration o f novel statistical analyses for single-case research are parallel lines o f

401
402 S I N G L E- C A S E R ES EA R C H D ESI G N S

research in the past few decades. Each is im portant to highlight because of the central
issues and dilem m as it raises for data evaluation in single-case research.
Second, the appendix clarifies what is special about single-case data that dictates
the use o f but also provides challenges for statistical evaluation. Characteristics o f the
data and specific statistical tests for data evaluation o f the single case include surprises.
The statistical tests used in single-case research are different from the usual tests that
have been taught in graduate training and in the between-group research tradition. In
those instances in which the tests are fam iliar (e.g., t and F), there are some novel con-
siderations and issues that emerge. Statistical analysis is more than some “ new” tests.
The data and the designs present challenges for statistical evaluation that are important
to cover.
Third, the appendix enum erates statistical tests for the single-case and provides an
illustration o f one o f the more prom inently used tests. In the first edition o f the book,
several available tests were presented and illustrated. These included conventional t
and F tests, tim e-series analysis, random ization tests, a test o f ranks, and split-middle
technique. Since that writing, many more tests and variations o f som e ot these have
been developed. No one test has captured the hearts or m ethod sections o f those who
use statistical tests. Also, quite recent work has begun to evaluate available tests more
analytically by com paring the strengths, varied results, and requirements o f different
tests. In light o f the scope o f advances, this appendix can only illustrate and highlight
statistical evaluation and its yield. There are now m any resources that detail individual
statistical techniques and the relative merits o f various options.1
Finally, perhaps the most central goal and possible contribution o f this appendix is
to convey the status and dilem m a o f data evaluation in single-case designs. There are
weighty considerations, including clear strengths and limitations, o f visual inspection
and statistical evaluation. Considerations are presented to guide the researcher on the
decision to use one or both types o f analyses.

BACKGRO UN D AND CO N TEXT

Visual Inspection: Application and Applicability


It is true to state that visual inspection remains the prim ary method of evaluating
single-case designs, but in the past two or three decades many influences have placed
this m ethod under further scrutiny. T he scrutiny has focused concretely on the appli-
cability o f the method but also the rationale for its use. First, several studies have shown
that visual inspection criteria can be difficult to invoke reliably (e.g., Franklin et al„
1997; M atyas & Greenw ood, 1990: Park et al„ 1990). In the idealized data pattern or
close approxim ations o f that, all the criteria for visual inspection (e.g., mean, slope,
level, latency criteria) are met; baselines (e.g., B phases in an A B A B design, or initial
baselines in a multiple-baseline design) are stable (e.g., little variability, no trend), and
the data points from phase to phase (e.g., A B A B ) do not overlap. This pattern is not
the one that raises concern; judges can identify large effects (Knapp, 1983; Matyas &

' Books are available that discuss statistical tests for single-case designs (e.g., Edgington & Onghena,
2007; Franklin, Allison, & Gorman, 1997; Satake, Maxwell, & )agaroo, 2008). Several articles that
present different tests or that compare multiple tests are referred to throughout this appendix.
S t a t is t ic a l A n a l y s e s f o r S in g le -C a s e D e s ig n s 403

Greenwood, 1990). Once one begins to depart from the ideal data pattern, incon-
sistency emerges when two or more judges are asked to evaluate the impact o f the
intervention. The unreliability o f judging the data has been replicated am ong raters or
evaluators with little or no experience with single-case designs (e.g., undergraduates),
as well as among individuals in training and expert judges with direct experience with
the designs (see Brossart, Parker, Olson, & Mahadevan, 2006; H arbst et al., 1.991; Park
et al., 1990). Can one train the judges and surmount the u nreliability o f judging the
data? Training can help but does not eliminate the problem (e.g., Fisher et al., 2003;
Harbst et al., 1991; Skiba et al., 1989) for reasons noted later. In any case, over time with
a small but accumulating literature, there is increased recognition that m aking judg-
ments about the data using visual inspection criteria can be unreliable.
Second, a key rationale for the use o f visual inspection has undergone challenge
as well. The method o f data evaluation for applied research is based in part on the
view that only large effects ought to be considered as reliable (Parsonson & Baer, 197-8,
1992). Early in the developm ent and use o f the designs in applied settings this filtering
aspect o f visual inspection was viewed as a strength. T he goal was to use single case
designs as part o f the developm ent o f a technology o f behavior change. Iff}', weak, and
other such effects were not o f interest or to be counted. T h is is easily understood w hen
between-group research is used as a comparison. In between -group research, the goal
is to show statistically significant differences between groups. B ut such differencescan
be obtained with m inor changes on the measures or outcom es. Indeed, statistical sig-
nificance is largely a function o f sample size. That is, groups in any study will invariably
differ to some extent; the larger the sample size, the m ore likely that difference will
reach statistical significance.
Single-case research began with the notion that detecting a difference should be
a “real” difference—one that meets an experimental criterion for reliability (a genuine
difference not due to som e trend or chance fluctuation) and ap plied criterion (som e
practical, educational, or therapeutic goal is achieved). T h e rationale is easily defen-
sible in principle. In practice, the situation is rather different. Em pirical analyses ol
publications o f single-case research designs have shown thal m any intervention effects
(e.g., over 25% in one study) in fact are small and debatable, i f not com pletely illusory
(Glass, 1997; Parker et al., 2006). The specific percentage is not critical and is subject to
debate. Yet, the broader point may not be, namely, the rationale for using visual inspec-
tion and the advantage o f filtering out weak effects are not how the criteria are used in
many instances.
Third, the rationale o f searching for large effects has an od ier concern that has
become clearer in the past decades. Small effects or changes can be very important
for many reasons (Kazdin, 2001). Arguably, we want to k n o w about reliable changes,
whether they pass some stringent standard or not. Sm all effects might lead to a b e t-
ter understanding, and better understanding is a five-lane, freshly paved and painted
highway to more effective interventions that can produce large effects. In addition,
large intervention effects can actually be hidden in the data and ap pear as a small effect.
If there is a moderator variable operating (e.g., ethnicity, age), the overall effect might
look weak, because within that effect are strong and weak e ife c tso r even effects that arc
in the opposite direction for subgroups. If a single-case dem onstration obtains a weak
effect, that might not be a universal finding; for another group an d in another contest,
404 S I N G L E- C A S E R ESEA R C H D ESI G N S

the effect might well be large. Early filtering or exclusion o f weak but reliable effects is
unnecessary and potentially harmful to science and technology.2 Finally on this point,
we very much want weak interventions in our arm am entarium , especially if they are
low-cost and can be widely applied. Application o f weak treatments (e.g., antismoking
campaigns, T V ads to reduce child abuse) on a large scale can have important effects
(Kazdin, 2008a).
Fourth, there are features o f single-case data that elude visual detection but can
greatly influence judgm ents about the data. One o f these is referred to as serial depen-
dence and refers to the fact that data from one occasion to the next (D ay i, Day 2) from
the continuous observations over time m ay correlate with each other. I elaborate this
more fully later because o f its relevance to statistical evaluation. I mention this here to
note that whether the data are serially dependent cannot always be “seen” by just look-
ing at the graph. However, serial dependence can influence judgm ents about the effects
o f the intervention (jones et al., 1978; M atyas & G reenw ood, 1990).
A related feature o f single-case data is the possibility o f a slight trend that obscures
evaluation o f subsequent phases. Baseline trend is the more likely culprit. Again in
the idealized pattern o f single-case data (e.g., sharp changes in trend from phase to
phase), the issue does not emerge. However, from published research we have learned
that a high percentage o f studies have a trend tow ard improvem ent in baseline (Parker
et al., 2006). T he trend m ay not be easily observable visually because it is not sim ply a
straight line. There may be perturbations, random influences, and cycles that suggest to
an observer (visual inspection) that there is no systematic trend. However, a trend can
be quantified (modeled) and shown to characterize the data. In short, we have learned
that there are characteristics o f single-case data, patterns o f many findings o f single-
case studies, and constraints on perception and visualization o f what happened across
phases that limit visual inspection. In m any cases, visualization and invoking the crite-
ria for visual inspection may not be up to the task o f deciding whether the intervention
effect was reliable.
These considerations I have highlighted convey challenges to visual inspection.
Evaluation o f the data is not merely a matter o f holding the graphed data in one hand
and a list o f the criteria for visual inspection in another and draw ing conclusions by
clapping both hands together. Lack o f stark intervention effects and nuances o f the data
(serial dependence) that are difficult to integrate visually occur and degrade reliabil-
ity o f invoking visual inspection. Visualization and graphing o f data include a variety

*The importance of small effects is easy to illustrate. For example, in searching for fuels that will
replace fossil fuel, sunlight has been thought to hold unrealized promise. 1'he ability to convert
sunlight to electricity has been a very small effect—reliable but small. That is, there wasonly
15% efficiency; in other words, turning sunlight into energy is hardly worth it, given the cost it takes
to produce. However, that this conversion of sunlight to energy could be accomplished an any scalc
was important as a test o f principle in looking for options for energy alternatives to fossil fuel. More
recently, as the very small effect was studied more, improvements were made that now make conver-
sion a large effect (approximately 40% efficiency). That is, there is the ability to convert sunlight to
electricity on a scale of efficiency that is quite different from the early demonstrations. It would have
harmed progress to use a filter for data evaluation that only would have identified large effects both
for basic research and application.
S t a t i s t ic a l A n a l y s e s fo r S i n g l c - C a s e D e s i g n s 405

o f options that have not been explored in single-case research. Some o f the options
include ways to present the data based on nonlinear sm oothing o f the data, providing
transformations, and methods to detect cyclical patterns, noise, and outliers am ong
the data points (e.g., Clarke, Fokoue, & Zhang, 2009; Cleveland, 1953, 1994; Velleman,
1980). Many o f the options have mathematical underpinnings that might have made
Euclid change careers. However, there are user-friendly software graphing packages
that allow exploration o f m any different ways to graph the data and to understand and
reveal underlying properties (e.g., w w w .datadesk.com ,w w w.w aYem etrics.com ). T h e
options for presentation o f data are beyond the scope o f this discussion. I mention
them because visual inspection in single-case research has yet to exploit the graphing
and transformation options that might well provide more reliable ways o f evaluating
the data.
The ongoing research on visual inspection within single-case designs has not
reduced reliance on this method o f data evaluation. From issues related to unreliabil-
ity o f visual inspection, m any researchers have concluded that some alternatives are
needed. One not pursued that I have mentioned previously are the m any options for
graphing, modeling, and presenting data. The option that has been pursued is the use
o f statistical tests for the single case.

Statistical Analyses
Concerns about applying visual inspection criteria reliably are not the sole im petus
for turning more to statistical analyses. Other developm ents have set the stage for
increased use o f statistics. First, single-case designs have expanded keyond the bound-
aries o f behavioral research, and that has led to some changes directly related to use
o f statistical tests. The designs were restricted to one or tw o jo u rn als and disciplines
in the late 1960s when the field (applied behavior analysis) that m ade these designs
prominent emerged. Currently, many journals publish sin gle-case research reflecting a
range o f disciplines (education, school psychology, clinical psychology, rehabilitation,
occupational therapy, recreational therapy, internal m edicine, psychiatry, social work,
and more). Expansion o f the m ethodology to new areas o f w ork has led to expansion o f
other aspects o f the methodology as well.
I mentioned that it is useful to consider research m eth o do lo gy a.s com prised o f
three broad components: assessment, experimental design, and data evaluation.
Traditionally, these were very tightly intertwined. Single-case m ethodology included
assessment meaning direct observation o f behavior, experim ental design m eaning
single-case designs, o f course, and data evaluation m eaning visual inspection. Yet, the
extension o f the designs to other areas has led to expansion o f assessm ents in terms o f
format (e.g., not just behavioral measures but also self-report, clinician ratings) and
domains o f assessment (e.g., cognitive processes, experience o f anxiety), as I illustrated
in earlier chapters. The expansion includes a willingness to consider using statistical
tests and slight discomfort with the somewhat ambiguous decision-m aking guidelines
for using visual inspection. Also, for some researchers, the utility o f single-case designs
stems in part from not having to adopt behavioral assessm ent (w hich may not be the
goal) or visual inspection (which violates the training o f m ost researchers).
Second, changes within statistical evaluation in betw een-group research have
had implications for evaluation o f visual inspection. From the very inception o f the
406 S I N G L E- C A SE R ES EA R C H D ESI G N S

development o f tests o f statistical significance in betw een-group research, there has


been concern about the limitations. Statistical significance is dependent on sample size,
gives a binary decision, and does not say anything about the magnitude or strength o f
the effect. One can readily conclude there is no effect (not statistically significant) when
in fact there is (called Type II error). Spanning decades but exerting influence more
recently has been the view that statistical significance should be supplemented by, if
not replaced with, som e measure o f the magnitude o f effect (K irk, 1996). How large
an effect is can be distinguished from whether the effect was statistically significant.
Effect size has been the measure o f magnitude o f effect frequently advocated and does
not suffer the same problems as does statistical significance. M any journals that pub-
lish mostly between-group intervention research encourage or require that the results
include measures o f effect size.3
A related developm ent has been the proliferation o f m eta-analysis. Meta-analysis is
a way o f review ing and integrating em pirical studies on a given topic by translating the
results o f these studies (e.g., changes on outcome measures, differences among groups)
to a com m on metric (effect size). T his allows the review er (meta-analyst) to draw
conclusions about the findings in a given area and to quantify the strength o f effects.
In addition, one can ask questions about the data from m any studies combined that
were not addressed in any o f the individual studies included in the meta-analysis. Thus
novel findings can emerge from a meta-analysis. Between-group researchers engaged
in intervention research are encouraged or required to provide effect size information,
depending on the journal and discipline. Even when researchers do not provide that
information, often effect sizes can be obtained from other statistics that are in the o rigi-
nal article (e.g., means, standard deviations for various measures). This information is
used for meta-analyses.
Contrast the situation with single-case designs and the use o f visual inspection.
Visual inspection from one study to the next does not provide a systematic way o f
integrating and com bining m any studies or o f asking new questions based on a large
integrated database. W ithout some formal, replicable w ay o f com bining studies, much
of the single-case w ork is neglected or viewed as difficult to integrate. Over the years,
many researchers have proposed effect size m easures for single-case designs (e.g.,
Busse, Kratochwill, & Elliott, 1995; Krom rey & Foster-Johnson, 1996; White et al„ 1989).
In fact, a recent review noted that over 40 different approaches for measuring effect
size have been proposed for single-case research (Swam inathan et al„ 2008). None has
been widely adopted and only recently have some o f the alternatives been carefully
evaluated and com pared (e.g., M anolov & Solanas, 2008a; Parker & Brossart, 2003;
Parker & H agan-Burke, 2007b). In short, there is no recom m ended method o f com put-
ing effect size in single-case designs that is readily available and ready for prime time.
T he absence o f a clear and widely used way o f com puting effect size limits the ability
to accumulate and com bine findings from single-case studies and integrating findings

-'Effect size can be measured in many ways. The advantages o f using effect size and alternative ways
effect size can be measured and computed are discussed in several resources (e.g., Kazdin, 2003;
Kirk, 1996; Schmidt, 1996; Wilkinson and Task Force on Statistical Inference, 1999). The present dis-
cussion does not depend on the use o f a specific index of effect size.
Statistical A nalyses for Single-Ca.s<e Designs 407

o f single-case and between-group studies. This too is quite relevant background o f the
current interest in using statistical tests.
Finally, more and more studies use statistical tests to analyze single-case data,
som etimes along with visual inspection (e.g., Bradshaw, 2003; Feather & Roiian, 200^;
Levesque et al„ 2004; Molloy, 1990; Quesnel, Savard, Sim ard, Ivers, & M orin, 2003).
Also, many articles have emerged that present new statistical tests for the single case,
reanalyze prior data from published studies, or present new data to illustrate the anal-
yses. Some o f these articles com pare multiple single-case statistical tests (e.g., Brossart
et al., 2006; Parker & Brossart,.2003; Parker & Flagan-Burke, 2007a, 2007b). W hile it
remains the case that visual inspection dominates, statistical evaluation has been on
the march.
Two sum m ary points ought to be emphasized in relation to the use of statistical
tests. First, such tests represent an alternative to or a com plem entary m ethod o f evalu-
ating the results o f a single-case experiment. Second, statistical evaluation can permit
accumulation o f knowledge from m any different investigations, even if they do not
all use the same statistical tests. Enorm ous gains have been made in between-group
research by looking at large literatures and drawing quantitatively based conclusions.
Com bining studies that use visual inspection to reach conclusions and pose and answer
new questions from such a data set have yet to em erge in single-case research. Findings
from visual inspection risk continued neglect from a broad scientific com m unity i f
they cannot be integrated in a way that effect size has permitted in between group
research. The solution is not merely applying currently used effect size estimates and
applying them to single-case research. Characteristics o f single-case assessments and
data (e.g., ongoing assessments, influence o f the num ber o f data points, serial depen-
dence, as discussed below) make the usual formula not directly applicable (see Shadish,
Rindskopf, & Hedges, 2008).

Serial Dependence in Single-Case Data


T he context for statistical analysis pertains to the nature o f single-case data. Single-case
data are based on continuous observations over time for the same subject. T his results
in a characteristic o f the data that is different from one o r two observations collected
from many subjects as in between-group research. T he difference lias implications for
what statistics can be applied and how they can be applied.
The beginning point for statistical evaluation in single-case research is recognition
o f the structural feature referred to as serial dependence. Serial dependence refers to the
relation o f the data points to each other in the series o f continuous observations. The
dependence reflects the fact that the residuals (error) in the data points are correlated
(or can be) from one occasion to the next. The dependence is measured b y evaluating
whether the data are correlated over time. This can be accom plished in different ways.
T he usual method is correlating the data by pairing adjacent data points (D ays 1 and 2,
Days 2 and 3, Days 3 and 4, etc.) and com puting a correlation coefficient. The correla-
tion is referred to as autocorrelation and is a measure o f serial dependence. Correlating
D ays 1 and 2, 2 and 3, and so 011 is only one way to com pute the correlation. To under-
stand the data and how serial dependence operates, one can compute the relation o f
data points with different amounts o f time (called lags) between them. For example,
one could have a 2-day lag and correlate Days 1 and 3, 2 and 4 ,3 and 5, and so on. The
408 SI N G L E- C A S E R ESEA R C H D ESI G N S

point to make here is conveyed by just considering the adjacent points, but many d if-
ferent lags (1, 2, and 3 days, etc.) can give a complete picture o f the series. The different
lags can detect characteristics o f the data such as cycles that may be repetitive but not
encompassed merely by correlating adjacent data points (l-day lag).
What is the autocorrelation? Autocorrelation is a correlation and can range from
-x .o o to +1.00. Trends in the data tend to indicate that there is autocorrelation, but I
mentioned previously that not all trends in the data can be detected visually (Parker
et al., 2006). Autocorrelation can be negative as well as positive. Autocorrelation that
is statistically significant can be used to define whether there is serial dependence in
the data.4
Serial dependence is important to know for two reasons. First, the presence o f
serial dependence precludes the straightforward application o f statistical techniques
with which we are most familiar. For example, conventional t and F tests make several
assumptions that we learned in our early training (e.g., hom ogeneity o f variances o f the
groups, normally distributed data, independence o f the error terms). We also learned
that the data analyses are “ robust” (from the Latin robustus “strong and hardy” and
derived further from the word for oak). Although few o f us grasped what that meant,
we knew that violating the assum ptions was not a huge problem and that a little vio la-
tion o f a statistical assum ption here or there would not lead to an arrest. Moreover, for
some o f the assum ptions, there are tricks (e.g., data transformation) one can use if the
assumption is violated. In other words, we learned quickly that there are assumptions
but if they are violated that is not a problem. There is an exception to all o f this, namely,
the assumption that the error terms (residuals) o f the observations are uncorrelated.
Violation o f this assum ption does make a difference and precludes the appropriate use
o f conventional t and F tests. If the autocorrelation is positive, standard errors that
are used as part o f statistical tests (i.e., the error term or denom inator for the statis-
tic) becom e sm aller than they ought to be, and the results (f or F test) will be larger or
biased in a positive degree. That is, more Type I errors will be made (i.e., show ing a
statistically significant effect when there would not have been one). I f the autocorrela-
tion is negative, standard errors will be large, and the overall t or F will be smaller than
it would have otherwise been. This will lead to more Type II errors (i.e., showing no
significant effect when there actually was one).
Second and related, if serial dependence exists in the data, the analysis needs to
take that into account. T he dependence reflects som e trend or pattern in the underly-
ing data. It may not be a simple linear trend, but a trend perhaps jolted by random
effects and only detected across different lags. A data-analvtic technique is needed to

1 The reliance 011 a statistically significant correlation to make a decision about serial dependence
has its risks. The significance of a correlation is highly dependent on the number o f observations
(degrees of freedom). If few observations (e.g., baseline of 10 days) are available to compute autocor-
relation, it is quite possible that the resulting correlation would not be statistically significant. Serial
dependence might be evident in the series (if that series were continued), but the limited number
o f observations may make the obtained correlation fail to reach significance. Autocorrelation is
calculated on the residuals, which is accomplished by some statistical tests. In this discussion, 1 am
focusing on autocorrelation of the raw data rather than o f the residuals. Autocorrelation of the raw
data is very likely to reflect dependence in the residuals.
Statistical Analyses for Singjte-Case Designs

account for the dependence and to discern whether any intervention effect is evident
over and above some overarching but possibly subtle pattern. A s I noted previously,
vision and visual inspection are not up to the task.
Although it would be useful to begin with clear rules, there are important caveats
instead. Not all time-series data show autocorrelation. This is ail em pirical matter, and
one needs to test this. A difficulty is that single-case designs often do not include a
sufficiently long baseline phase to compute the correlations and understand the model
underlying the data (M atyas & Greenwood, 1997). Continuous data (also called time-
series data) in many other disciplines (e.g., climate change, econom ics, dendrology
[study o f treesl) actually do have huge data sets that allow for understanding the struc-
ture o f the series. In single-case research one usually wants to draw oil the baseline
data for computing autocorrelations, because the intervention phase introduces a new
influence that could change the pattern.
Som e time ago, the extent to which single-case data are likely to show serial depen-
dence was debated (see Matyas & Greenwood, 1991; Sideridis & Greenwood, 1997),
although estimates have ranged from approximately 10% to &o% ot single-case data sets
in published studies. Such variability conveys that there are other issues here, such as
how autocorrelation is computed and the length o f phases included in the study that
might influence the conclusions. The current verdict after several studies is that serial
dependence is likely to be present and ought to be taken into account in evaluation
o f the data. To do that requires statistical techniques that are not straightforward, as
highlighted shortly. W hat has been o f special interest perhaps is that serial dependence
influences both visual inspection and statistical evaluation o f the data.

S T A T I S T I C A L T E S T S FO R S IN G L E -C A S E R E S E A R C H
Visual inspection criteria alert us to the types o f changes we would like to detect
including changes in means (across phases), trends, change in level (discontinuity o f
perform ance from one phase to the next), and latency o f change. Characteristics o f
single-case data (e.g., serial dependence) and the likelihood o f b rie f phases (tew data
points) alert us to features that have to be accommodated. Identifying a single statistical
test or family o f statistical tests that can accommodate these criteriaan d characteristics
satisfactorily has yet to occur. In between-group research, there has been dom inance
o f f, F, and all the related least-squares statistical analysis tests that have brought u n i-
form ity— indeed com placency—to research studies. That does not seem to b e likely in
single-case research in the immediate future.

S a m p lin g o f the M any Tests


There has been an enorm ous expansion o f statistical tests for the single case in the
past few decades. Table A .i lists several o f the tests to convey the point. T he tests v a ry
in precisely what aspects o f the data they evaluate (e.g., changes in means, trend), how
or whether they handle serial dependence, and whether they can be applied without
lengthy phases that provide sufficient data within each phase.
Recent work conveys the flux and advances in statistical tests for the single case.
The strengths and limitations o f many o f the tests have only begun to be scrutinized. A
literature has emerged in which the tests are evaluated by applying them across d iffer-
ent conditions. Often hypothetical data are generated to perm it the evaluation o f how
410 S I N G L E- C A S E R ES EA R C H D ESI G N S

Table A . I Selected Statistical Tests Availab le for the Sin gle -Case D e sign s

Name Resources

Binomial Test W hite & Haring, 1980

C Statistic Jones, 2003; Satake et al., 2008;Tryon, 1982


Clinical Outcom e Indices Parker & Hagan-Burke, 2007a
Conventional t.F.& \ 2 tests Virtually any statistics book on between-group research; Satake et al., 2008
Double Bootstrap Method McKnight, McKean, & Huitema, 2000
Last Treatment Day Technique W hite et al., 1989
Logistic Regression Brossert et al., 2008

Mean Baseline Reduction/ Lundervold & Bourland, 1988


Increment (for decreases
and increases in behavior,
respectively)

Mean-only/mean-plus Trend Allison & Gorman, 1993; Center, Skiba, & Casey, 1985-1986;
Models Faith, Allison, & Gorman, 1997

Percent Z e ro Score (o r 100%) Scotti, Evans, Meyer, & Walker, 19 9 1


for decreases and increases in
behavior, respectively

Percentage of Nonoverlapping Busk & Serlin, 1992; Ma. 2006; Mastropieri & Scruggs, 1985-86; W olery et al.,
Data Points 2008

Randomization Tests Edgington & Onghena. 2007; Lall & Levin, 2004; Levin &Wampold, 1999
R pTest of Ranks Revusky, 1967
Split-Middle Technique Fisher, Kelley. & Lomas. 2003; W hite 1972, 1974
Time-Series Analyses Borckardt et al., 2008; Box et al.. 1994; Glass et al., 1975; Hartmann et al..
1980
Trend Analysis Effect Size Faith etal., l997;Gorsuch. 1983

Notes; The list is not intended to be comprehensive. Several listings in the table are not individual tests but
rather tests with multiple variations. In cases where more than one citation is provided, multiple variations of
that test can be found. W here possible. I have tried to give recent references o r readily available references
rather than necessarily identify the first source in which a particular analysis might have been identified o r
recommended. Some tests in the table may be familiar because they are used in group data and between-
group research. Inclusion of such tests here means that they have been applied to individual subjects. In several
articles, the yields from multiple tests and their sensitivity to detecting change are compared under different
conditions (see Brossart et al.. 2006; J. M. Campbell, 2004; Lall & Levin, 2004; Manolov & Solanas, 2008a; Parker
& Brossart. 2003; Parker et al., 2005).

the statistic perform s under multiple data patterns (strength o f intervention effect)
and with diverse characteristics o f the data (serial dependence, long vs. short duration
phases). When data are taken from published studies, som etim es only the AB phases
are evaluated. Several studies com pared multiple statistical tests for the same data set
(e.g., Brossart et al., 2006; Cam pbell, 2004; Lall & Levin, 2004; M anolov & Solanas,
2008a; Parker & Brossart, 2003; Parker et al., 2005). For example, in one excellent eval-
uation o f five different statistical tests applied to the same data, the authors concluded
that the results o f the different methods varied so much as to preclude clear guidelines
(Brossart et al., 2006). Different statistical tests emphasize different characteristics o f
the data (e.g., means, trends, last data point in the phase) and vary in their susceptibility
S ta tistic a l A n a ly s e s fo r S in g le -C a s e D e s ijn s

to autocorrelation. What is important to note is that no one m ethod emerges as suitable


under all o f the conditions, given variations in the magnitude o f intervention effects
and o f autocorrelation, duration o f the phases, and different designs. T h is is an area o f
active research and perhaps from that some subset o f analyses vrill emerge. However,
clear guidelines for what tests to use and when to use them are not easy to draw from
this literature at this time.
Progress has been made in the past few decades and m ore recently in clarifying
the problems. Also, among what has been learned is not only that convention a.11 and F
tests are subject to misinterpretation because o f serial dependence, but also that other
tests (randomization tests, rank tests, split-middle technique) c a n b e influenced as well.
In short, since the first w'riting o f this book, research has elaborated characteristics o f
single-case data, has clarified the scope o f the impact o f serial dependence, and has
begun to evaluate alternative data-analytic strategies to accom plish the goals o f visual
inspection but more reliably. The burgeoning literature has not helped in generating
guidelines regarding what tests to use and when to use them. If anything, the literature
points to caution in relying on any one o f the methods currently available.

Time-Series Analysis: Illustration


Many o f the available tests are still undergoing evaluation, and new variations are
emerging. Consequently, I have referred the reader to other sources w here details o f
the analyses and their variations are provided (please see Table A .i). As an exception, 1
am highlighting time-series analysis for several reasons. First, variations o f time-series
analysis have been used and continue to be used for single-case data over a period
spanning decades (e.g., Borckardt et al., 2008; Levesque et al., 2004; McSweeney, 197ft;
Savard et al., 1998; Schnelle, Kirchner, McNees, Sc Lawler, 1975). Consequently, while
many statistical tests for the single case are only now being proposed and evaluated
time-series analysis stands out with a literature that can be consulted by interested
researchers. Second, the m ethod has been used in other disciplines and hence is d evel-
oped well beyond the special use o f single-case research. T hird, statistical package soft •
ware (e.g., SPSS, SAS, Systat, Statistica, Stata) include tim e-series analyses; hence the
method is readily available. Finally, time-series analysis directly addresses serial depen -
dence in the data and accom m odates its impact.

D escription. Tim e-series analysis com pares data over tim e for separate phases for an
individual subject or group o f subjects (see Box & Jenkins, 1976; Box et al., 1994; Glass,
Willson, & G ottm an, 1975; Hartm ann et al., 1980; Jones, Vaught, 8c W einrott, 1 £77). The
analysis examines whether there is a statistically significant change in level and trend
from one phase to the next. Thus, the change is from phase A to B. T he analysis can be
applied to single-case designs in which there is a change in conditions across phases.
All the phases can be exam ined; not just the first two phases o f a design. For example,
in A B A B designs, separate com parisons can be made for each set o f adjacent phases
(e.g., A^B , A B , B A 2). In multiple-baseline designs, baseline (A) and treatment (B)
phases may be implemented across different responses, persons, or situations.
Consider an example that perm its evaluation o f the data via visual inspection as
well as time-series analysis. This study focused on the effectiveness o f a cognitive-be-
havioral treatment (CBT) for insom nia among women treated fo r non metastatic breast
412 SI N G L f c - CA SE R ESEA R C H D ESI G N S

cancer (Quesnel et al., 2003). Sleep disturbances are one o f m any psychological prob-
lems associated with the impact o f cancer and characterize 30 to 50% o f the patients.
Patients participated if they completed radiation or chem otherapy and met diagnostic
criteria for Chronic Insom nia D isorder (by criteria o f the International Classification
o f Diseases or the Diagnostic and Statistical M anual o f Mental Disorders). Several m ea-
sures were used involving multiple assessment methods including clinical interviews,
self-report daily diary and questionnaires, and electrophysiology (polysomnography)
o f sleep evaluated in a sleep lab. The intervention consisted o f C B T conducted in eight
weekly group sessions, approxim ately 90 minutes each. C B T included several com po-
nents (stimulus control for insom nia, coping strategies, restructuring o f dysfunctional
thoughts). At pretreatment, posttreatment, and each follow-up assessment an exten-
sive battery o f measures was completed. Electrophysiological measures of sleep were
obtained at pretreatment, posttreatment, and the 6-month follow-up.
Figure A .i charts one o f the continuous measures that consisted o f a daily sleep
diary kept by patients. The measure was used to report several characteristics o f sleep
(e.g., use o f alcohol or medication, bedtim e hour, duration o f awakenings, and others).
A s evident in the figure, C B T was introduced in a multiple-baseline design across par-
ticipants. T he results suggest through visual inspection that introduction of treatment
was associated with decreases in total wake time, although the effects are less clear
for Participants 6 and 7. One can see that gains for those who responded to treatment
appeared to be maintained at the follow-up periods.
T im e-series analyses evaluated the statistical significance o f the change for each
participant across AB (baseline, treatment) phases. The analysis was selected because
it takes into account serial dependence, can detect reliable intervention effects even
if the effects are small, and evaluates changes in level and slope. Table A .2 reproduces
inform ation that corresponds to the data graphed in Figure A.i. For present purposes,
consider the final two colum ns that convey whether the changes in level or slope were
statistically significant. For all eight participants, either level or slope changed signifi-
cantly when treatment was introduced. The statistical analyses convey that there was a
reliable treatment effect; the com plexity o f the effect (level for som e, slope for others)
provides inform ation that would be difficult to discern from visual inspection. Several
other analyses were completed (and not discussed here) that dem onstrated reductions
in depression and physical fatigue and improved cognitive functioning. Measured but
not part o f the intervention, patients who had used sleep medication during the study
stopped on their own while receiving CBT. At posttreatment and again at the 6-month
follow-up, electrophysiological measures in the sleep lab revealed significant decreases
in time awake and increases in sleep efficiency (proportion o f time sleeping out o f time
in bed).
Tim e-series analysis was very helpful in evaluating data in which there was consid-
erable variability for some o f the participants in both baseline and intervention phases.
Also, possible trends in the data and autocorrelation were modeled and handled by
the analysis. A n y trends in baseline, whether or not they could be easily detected by
visual inspection, were readily incorporated in evaluating changes from A to B phases.
Perhaps one might argue that visual inspection would have been able to detect changes,
perhaps for Participants 2, 3, and 5 where the effects are among the clearest. Even here
som e statistic is needed to handle the invisible autocorrelation and trends that are not
Total Wake Time
Treatment Post 3-month

Participant I

T T ”t r :
f*^ N CD O'

300-

Participant 2 200-

600
500-
» . , 400
Participant 3 300
200
100

500
400
Participant 4 300
200
100 - - . "*
nrHnrniiiimmiiiirriiiiiimmnMirri^tTniiiriinii'innminnrnnnTTnmnriiiiiiniiii m r $ $$ hV $$

500
400
Participant 5 300
200
IOC
rTr^rmr- /
= '
300

Participant 6 200

5
— rs. o U1 — fs m »
so »o r«* r** co o
ctn o o

-tTrrnrTTrnrnrZ/nrmrTTVA
S £ a ft £

Participant 8

Days

F ig u re A .I . Daily total wake time obtained by sleep diaries for each of eight participants who
completed treatment. Missing data (e.g.. baseline. Participant 7) reflects absence of the diary for
those days.Treatment was cognitive-behavior therapy introduced in a multiple-baseline design across
participants. (Source: Quesnel et al., 2003.)
4M S I N G L E- C A S E R ES EA R C H D ESI G N S

Table A .2 Su m m ary o f the Results of T im e-Se ries Analysis for O ne of the Measures (Total
W ake Tim e) for Eight Participants

Variable and Participant df Outliers R2 (%) Level (t test) Slope (t test)

Total wake time


1 85 4 42.6 -2.44* 0.30
2 92 5 73.6 -5.9 5** 0.23
3 87 5 77.2 -5.20** 1.34
4 112 3 59.2 1.48 -4 .3 1* *
5 114 2 73.0 -0.48 -7.44**
6 97 5 75.4 -0 .23 -5.45**
7 100 2 42.9 -5.64** 1.17
8 1 14 1 40.5 1.35 -3.4 5**
*p < .05. ** > < .01.
(Source: Quesnel et al„ 2003.)

sim ple ascending or descending straight lines. Baseline treatments need to be modeled
in the analysis. B y “m odeled” I mean an algorithm is needed that best describes any
pattern o f data in baseline; this is required to determine whether intervention reflects a
significant change. T his is m ore than visual inspection can accomplish.

Im portant C onsiderations. T he analysis makes som e dem ands on the investigator


that m ay dictate the utility o f tim e-series analysis in any particular instance. To begin
with, the design depends on having a sufficient number o f data points. The data points
are needed to determ ine the existence and pattern o f serial dependence in the data and
to derive the appropriate tim e-series analysis model for the data. T he actual number
o f data points needed within each phase has been debated, and estimates have ranged
from 20 to 100 (e.g., Box & Jenkins, 1970; Glass et al„ 1975; Hartm ann et al., 1980; Jones
et al., 1977). Variations o f tim e-series analysis have been used in which a smaller num -
ber o f observations is required. For example, in one variation a m inim um o f 10 to 16
observations totaled across baseline and intervention phases (e.g., five to eight observa-
tions in each phase) is required (Borckardt et al., 2008). Consequently, shorter phase
durations have not precluded application o f the analysis.
In many single-case experim ents, phases are relatively brief. For example, in an
A B A B design, the second A phase may be relatively brief because o f the problems
associated with returning behavior to baseline levels. Similarly, in a multiple-baseline
design, the initial baseline phases for som e o f the behaviors (individuals or situations)
may be brief so that the intervention will not be withheld for a very long time. In these
instances, too few data points m ay preclude the application o f tim e-series analysis.
Second and related, tim e series is not a matter o f plugging in numbers into a for-
mula. There are steps perform ed 011 the data (by the com puter program) that include
model building, model estim ation, and checking o f the model against the data. Within
these steps are contained such tasks as how to best describe the pattern o f autocorrela-
tion, what estimates o f param eters are needed to m axim ize the fit o f the model to the
data, and once estimated how the model has contained, addressed, or removed auto-
correlation. O nce these are com plete the analysis can test changes in level and slope
that are associated with the intervention. Returning to a prior point, one reason many
St at ist ical A n al y ses f o r Si n g l c- Case Oesig n s 415

data points are recommended for the analysis is to execute these initial steps and pro-
vide a good estimate o f the model and parameters that fit the data. From this very
cursory description, one can see that there is much to understand about time-series
analyses. Multiple models are available. Although software allow son e to enter the data,
it is important to understand the steps along the way to the final result and selection o f
the model. Results can vary as a function o f accepting or not accepting default options
within a program. Significance or lack o f significance could easily be de-fault o f the
investigator if she or he is not informed.
Tim e-series analysis is especially useful when the idealized data requirements and
criteria for visual inspection are not met. W hen there is a trend in the therapeutic direc-
tion in baseline, when variability is large, or when treatment effects are neither rapid
nor marked, time-series analysis m ay be especially useful. Also, the analysis is espe-
cially useful when the investigator is interested in drawing conclusions about changes
in either level or trend. As reviewed previously, considerable data suggest that trend is
not easily detected by visual inspection once one moves beyond sim ple ascending or
descending straight lines.
Tim e-series analysis represents an option, but I do not w ish to imply it is a panacea
for data evaluation for the single case. A m ong the cautions, tim e-series is not an analy-
sis but rather a family o f options and there lies the rub. T here is no single way o f doing
the analysis that can be recomm ended as widely applicable across most designs or data
sets. There are multiple options and decisions (e.g., how to model the data) and it is
likely that these will yield different results. The availability o f many exam ples already
published can provide concrete illustrations o f some o f the options.

O B S T A C L E S TO U S IN G S T A T I S T I C A L T E S T S

M an y O p tio n s, Few G u id e lin e s


Several options are available for statistical analyses, and an y investigator interested in
an analysis has choices. It is important to note that although more analyses and more
reports o f statistical analyses o f single-case research have becom e available, there is
not a groundswell movement am ong those who use single-case designs to adopt any
o f them. It is not difficult to explain why. First, there are v e ry few resources available
that explain the statistics for the single case in a straightforward way and show how to
apply them. Some o f the analyses I have listed (Table A.i) have only one or two refer-
ences showing how to use the analysis with single-case data. Yes there are exceptions
for some o f the tests where they are clearly described and illustrated. Random ization
tests and time-series analysis, already mentioned, are two exam ples. For m any other
tests listed in the table, there are a few studies to provide guidance, even though without
such guidance the procedures are not all that com plex to apply. Even so, m ore resources
are needed, especially when considered against the backdrop o f what is available for
other statistics more com m only used in between-group researck. For between-group
researchers popular statistical packages (e.g., SPSS, SAS) include multiple-statistical
techniques for data evaluation, are constantly revised, and serve as one-stop shopping
for many faculty members, postdocs, and graduate students doing em pirical research.
Software packages compete in their comprehensiveness o f coverage and ease o f use. In
the case o f single-case research, software is available to address specific tests, but there
416 SI N G L E- C A SE R ESEA R C H D ESI G N S

is nothing with the scope o f coverage and ease o f use that the m ore familiar statistical
packages provide.
Second, single-case statistical tests and their application are not straightfor-
ward. There are many different tests and within a test (e.g., randomization, tim e-se-
ries analysis) many different versions. The options are daunting in part because what
test is selected can make a huge difference. W hen different statistical tests are used to
evaluate the same single-case data, quite different conclusions can be reached (e.g.,
Nourbakhsh & Ottenbacher, 1994; Parker & Brossart, 2003). Even if the “same” test is
used but makes different assum ptions or focuses on slightly different features o f the
data, the results can be quite different (e.g., Lall & Levin, 2004; M anolov & Solanas,
2008a, 2008b). Som etim es seem ing nuances o f the data set, such as the num ber o f
observations in a phase or the degree o f autocorrelation, determ ine the extent to which
there is likely to be bias (e.g., Type I error) in the conclusion and make a particular
single-case statistic ill-advised or inappropriate (Sierra, Solanas, & Quera, 2005). Such
issues are only now being elaborated, but for the person who seeks guidance in select-
ing a statistical test, I regret I cannot be more helpful.
One o f the objections to visual inspection, noted in Chapter 12, was that the m ethod
is too subjective. Statistical tests, so the argument went, provide a more objective way
o f evaluating the data. There is a way in which this is definitely true. For the most part,
once a statistical test is selected, the yield (decision rule about statistical significance)
is more objective (e.g., automatic) than visual inspection. Yet, as the research h igh -
lighted in this appendix conveys, which statistical test is selected makes an enorm ous
difference and can lead to varied and opposing conclusions about the impact o f the
intervention.
Related, matching specific tests to specific designs is not straightforward. Studies
that have elaborated statistical analyses for the single-case com m only use A B phases
as the paradigm to illustrate what the statistic does and how it works. Many graphs,
sometimes hundreds, are generated by a com puter to allow the systematic inclusion
o f various characteristics o f the data. It is important to understand how various statis-
tics operate, how autocorrelations o f various magnitudes influence conclusions, and
how different statistical tests com pare with each other. However, in the trenches, we do
single-case designs that are A BA B, changing-criterion, and multi-treatment designs; we
have subphases (e.g., as we change the criterion repeatedly) or m any different baselines
(e.g., multiple-baseline across behaviors, individuals, and contexts); and we introduce
another intervention (C) to see if we can have stronger impact than our first interven-
tion (B). At this point in the emerging field o f statistics for single-case designs, there
are no clear guidelines to match designs and statistical tests. A B phase methods o f data
analysis do not automatically map on or transfer to the design applications that are
used. For example, is a multiple-baseline design across three behaviors merely three AB
designs with staggered applications o f B? Do we evaluate each A B separately? That is
not what the design intends to show. Also, the correlations in the data (autocorrelation)
are not just within the data o f one baseline. If this is a multiple-baseline across three (or
more) behaviors (for the same individual) or one behavior for one individual in three
(or more) settings, are all the baselines likely to bear some relation? Is there multiple
autocorrelation— or perhaps we could draw from the coffee shops and ask, is there a
“double autocorrelation grande with latte” we should w orry about? As these questions
St at i st i cal A n aly ses f o r 5 i n g l c- Case D esig n s 417
emerge in relation to real data sets from single-casc designs, one-eyed visual inspection
in a very dark room while wearing a sleep mask is starting to look not so bad.
Finally, there are few training opportunities in single-case research methods, leav-
ing aside the more esoteric topic o f the statistics that might be used to analyze single-
case data. Graduate training program s in education, psychology, counseling, school
psychology, occupational and physical therapy, and other areas where such designs are
applied are not likely even to mention single-case research designs, lei alone actively
teach a course in them. The challenge in preparing students for research careers is
making sure they are skilled and fluent in quantitative, null hypothesis testing research
methods, that is, the Esperanto o f science. This means between-group designs and
data-analytic techniques. Within the quantitative tradition, there is so much to teach.
Ongoing advances (e.g., meditational analyses, growth curves, hierarchical linear m o d -
eling, structural equation modeling, instrumental variable techniques, propensity score
matching, latent transition analysis) must constantly be added to the canon to prepare
students competently. There is little time to train in other traditions (single-case e xp er-
imental designs, qualitative research) given the scope o f courses already required.
One o f my prior arguments favoring the use o f statistical tests was to identify reli-
able effects in single-case designs that did not meet the requirem ents o f visual inspec-
tion. I argued that small but reliable effects could be important for all sorts o f reasons.
We know from research on visual inspection that judges often disagree w hen interven-
tion effects are not very strong. Less clear is whether statistics for single-case designs
can do appreciably better. Statistical analyses often raise issues such as power where
sample size (e.g., number o f observations) m ay be important and detecting a d iffer-
ence is a function o f many factors beyond the impact o f the intervention. How well
will single-case statistical tests fare when trying to detect sm aller effects that cannot be
readily agreed on by visual inspection? Sufficient w ork is not available to answ er this,
but perhaps a preview can be seen from a few studies that indicate that some statistical
tests will not be very useful or have sufficient statistical power unless effect siz.es are
very large (> 2.0) (e.g„ Ferron & Sen tovich, 2002; M anolov & Sol anas, 200 9).' That m ay
be an exception or be restricted to all sorts o f other conditions, but we are still in need
o f basic research that can result in practical advice to guide us.

R eco m m en dation s
Many years ago in evaluating statistical tests for single-case designs, t recom m ended
the use o f time-series analyses whenever possible (Kazdin, 1^76, 1984). I have very
much tried to resist the same recommendation this time because change in one’s posi-
tion at least conveys the illusion o f progress in one’s thinking. Also, in fact, since that
time statistical tests for single-case designs have become an extraordinarily active area
o f research. 1 have mentioned many studies that compare alternative tests in evaluating

’ An effect size o f 2.0 would be considered to be very large. To placc this in context, arbitrary bill still
commonly used guidelines note that small, medium, and large effect sizes correspond to .2. .5, and .8
(Cohen, 1988). Recall that in between-group research this is the difference between the in te rn a
tion and control group as expressed in standard deviation units. A sa benchmark, psychotherapy for
adults when compared to no treatment produces an effect size of approximately .7, meaning that the
distributions for these groups would have their means this far apart in standard deviation units.
418 S I N G L E- C A S E R ESEA R C H D ES I G N S

real and simulated single-case data to understand and elaborate our options. At this
time, at least from my readings, no statistical analysis has emerged clearly to recom -
mend. Indeed, some o f m y earlier recom m endations in the first edition o f this book
(e.g., rank tests, randomization tests) over the years have been shown to be influenced
by serial dependence and hence raise more cautions. Tim e-series analysis is a very rea-
sonable option that is better understood and used than are other analyses. Thus, one
has a literature from which to draw and com pare. Also, as evident in the illustration
provided previously with cancer patients, the analysis can be used with real data, in
real designs, with real clinical problems. There are other such applications that could
be readily provided (e.g., Levesque et al., 2004; Savard et al., 1998). All that said, as I
mentioned time-series analyses require m ultiple steps including estimation o f models
and parameters to fit the data and use o f these to make the tests to evaluate change.
A com mitment in time is needed to understand what is going on under the hood as
the com puter spews out estimates and tables and whether the analysis is a good test in
light o f constraints o f the data and the options w ithin the analysis. Dare I say it, there is
som e visual inspection needed o f the options to decide which am ong them is the most
appropriate.
We know that when intervention effects from single-case data are not crystal clear,
application o f visual inspection criteria m ay not be reliable. We also know that sin gle-
case data are likely to be serially dependent and that this can m isguide both visual
inspection and statistical evaluation. In particular, trend is especially difficult to detect
when evaluating visual inspection data unless there is a straight ascending or descend-
ing line formed by the data points. With these considerations in mind, it would be
prudent to use more than one means o f evaluating the data as the design permits, that
is, both visual inspection and statistical analysis and perhaps time-series analysis if
available as an option.
There is a com prom ise position I have not elaborated. Perhaps visual inspection
will remain the prim ary method o f evaluating single-case data but with statistical aids
to facilitate their evaluation. The aids are not statistical tests that operate as an inde-
pendent w ay to test the reliability o f the finding. Rather, they may help apply visual
inspection in a m anner that is more reliable. For example, visual inspection is par-
ticularly weak in identifying com plex trends and taking trends into account in evalu-
ating the intervention (see Fisch, 2001). Perhaps we could provide techniques to aid
visual inspection that plotted trends o r used alternative ways o f m aking visually hidden
trends more apparent. M any different ways o f com puting and evaluating trends are
available with the goal o f aiding visual inspection. At this point, some applications have
helped enhance the reliability o f visual inspection (W. W. Fisher et al., 2003), but others
have not (e.g., Borckardt, Murphy, Nash, & Shaw, 2004). Novel methods continue to be
sought and might well bear fruit (e.g., Parker et al., 2006), but no firm recom m enda-
tion is yet available to guide us to a com prom ise position. Visual aids have their own
merit but cannot take the place o f analyses that assess the reliability o f the change.

C O N C L U S IO N S
The entire area o f statistical evaluation for single-case designs has received increased
attention in the past to years. The use o f these statistical tests, discussion of the prob-
lems they raise, and suggestions for the developm ent o f alternative statistical techniques
St at i st i cal A n al y ses f o r Si r j l e - Ca se D e si j r s 419

are likely to increase greatly in the future. We will need that work because little in the
way o f concrete recomm endations can be provided.
The issue o f major significance is suiting the statistic to the design. Statistical tests
for any research may impose special requirements on the design in terms o f how, when,
to whom , and how long the intervention is to be applied. In basic laboratory research
with non-hum an and human animals, the requirements o f the designs can largely d ic -
tate how the experiment is arranged and conducted. In applied settings where many
single-case designs are used, practical constraints (e.g., in the classroom ) often make
it difficult to implement various design requirements such as reversal phases or w ith-
holding treatment for an extended period on one o f the several baselines. Som e o f the
statistical tests mentioned in this appendix also make special design requirements such
as including extended phases (time-series analysis) or random ly alternating treatment
and no-treatm ent conditions (randomization tests). A decision must be made well
in advance o f a single-case investigation as to whether these and other requirements
imposed by the design or by a statistical evaluation technique can be implemented.
T he appendix raises multiple considerations for evaluating data in single-case
research. A blind adoption o f visual inspection or statistical analysis and too strong
a preference for one instead o f the other, in my opinion, is difficult to justify in light
o f current data. Each broad method has multiple strengths and weaknesses, and these
vary under all sorts o f other conditions. We would like simple rules to guide us and to
teach our students. We have a couple, perhaps: (1) consider more llian one m eans o f
evaluating the data, and (2) in relation to visual inspection and statistical analysis, do
not take an “either/or” position. Either/or m ay work well in philosophy (Kirkegaard,
1843), but may not be wise in science.
REFERENCES

Achenbach, T. M. (1991). M anualfor the Child Behavior Athens, E. S., Vollm er, T. R., & Pipkin, C. C. S. P.
Checklist/4-18 and 1991 Profile. Burlington: (2007). Shaping academic task engagement with
University o f Vermont. percentile schedules. Journal o f A pplied Behavior
Achenbach, T. M. (2006). A s others see us: Clinical Analysis, 40, 475-488.
and research implications o f cross-informant cor- Austin, J., & Carr, ]. E. (Eds.). (2000). Handbook o f
relations for psychopathology. Current Directions applied behavior analysis. Reno, N Y : C o rte x !
in Psychological Science, 15,94-98. Press.
Ahearn, W. H „ Clark, K. M „ MacDonald, R. P. F„ & Austin, J„ Hacked, S., Gravina, N\, & Lebbon, A.
Chung, B. I. (2007). Assessing and treating vocal (2006). T he effects of prompting and feedback on
stereotypy in children with autism. Journal o f drivers’ stopping at stop signs. Journal o f Applied
Applied Behavior Analysis, 40, 263-275. Behavior Analysis, jp , 117-121.
Aldwin, C. M., & Gilmer, D. F. (2004). Health, illness Ayllon, T. (1963). Intensive treatment o f psychotic
and optimal aging: Biological and psychosocial per- behavior by stimulus satiation and food reirfem e-
spectives. Thousand Oaks, CA: Sage Publications. ment. Behaviour Research and Therapy, 1, 55-6 1.
Allen, K. D„ & Evans, J. H. (2001). Exposure-based Ayllon, T„ & ,1 light on, E. (1964). M odification o f
treatment to control excessive blood glucose symptomatic verbal behavior o f mental patients.
monitoring. Journal o f Applied Behavior Analysis, Behaviour Research and Therapy, 2, 87-97.
34- 4 9 7 -5 0 0 . Ayllon, T„ & M ichael, ]. (1959). The psychiatric
Allison, D. B„ & Gorman, B. S. (1993). Calculating nurse as a behavioral engineer. Journal 0 / the
effect sizes for meta-analysis: The case o f the sin Experim ental Analysis of Behavior, 2 , 323-334.
gle case. Behaviour Research Therapy, 31, 621-631. Ayllon, T„ & Robeits, M. D. (1974). Elim inating d isci-
Allport, G. W. (1961). Pattern and growth in personal- pline problem s by strengthening academic p er-
ity. New York: Holt, Rinehart & Winston. formance. Journal o f Applied Behavior Analysis.
American Psychiatric Association. (1994). Diagnostic 7, 71- 76-
and statistical manual o f mental disorders (4th Azrin, N. H „ Hontos, P. T., & B esakl-A zrin , V. (1.^-9)-
ed.). Washington, D C: American Psychiatric Elimination o f enuresis without a conditioning
Associat ion. apparatus: A11 extension by office instruction
American Psychological Association. (2005). Policy o f the child and parents. Behavior Therapy, 10,
statement on evidence-based practice in psychol 14-19.
ogy. Washington, DC: American Psychological A zrin, N. H., & Peterson, A. L. (1990). Treatm ent
Association. o f Tourette’s syndrome by habit reversal: A
Ardoin, S. P., McCall, M., & Klubnik, C. (2007). waiting-list control group. B ehavior Therapy, 21,
Promoting generalization o f oral reading flu- 3 0 5 - 3 1 *-
ency: Providing drill versus practice opportuni- Baer, D. M. (1977) Perhaps it would be better not to
ties. Journal o f Behavioral Education, 1 6 ,55-70. know everything. Journal o f Applied Behavior
Arv, D., Covalt, W. C „ & Suen, H. K. (1990). Graphic Analysis, 1 0 , 167-172.
comparisons o f interobserver agreement indi- Baer, D. M „ W olf, M. M „ & Risley, T. R. <1968). Som e
ces. Journal o f Psychopathology and Behavioral current dim ensions of applied behavior analy&is.
Assessment, 1 2 , 151-156. Journal o f Applied Behavior Analysis, 1, 9 1 - or.

421
422 REFERENCES

Baer, D. M., Wolf, M. M „ & Risley, T. R. (1987). Sonic Billette, V., Guay, S., 8c Marchand, A. (2008). Post-
still-current dimensions o f applied behavior traumatic stress disorder and social support in
analysis. Journal o f Applied Behavior Analysis, 20, female victims o f sexual assault: The impact o f
- . spousal involvement on the efficacy o f cognitive-
Bargh, J. A., 8c Morsella, E. (2008). The unconscious behavioral therapy. Behavior Modification, 32,
mind. Perspectives on Psychological Science, 3 , 73-79. 876-896.
Barlow, D. H., & Hayes, S. C. (1979). Alternating Bisconer, S. YV., Green, M ., Mallon-Czajka, J., 8c
treatments design: One strategy for comparing Johnson, J. S. (2006). Managing aggression in
the effects o f two treatments in a single subject. a psychiatric hospital using a behavioural plan:
Journal o f Applied Behavior Analysis, 1 2 , 199-210. A case study. Journal o f Psychiatric and Mental
Baron, R. M ., & Kenny, D. A. (1986). The moderator- Health Nursing, 13, 515-521.
mediator variable distinction in social psy- Bjorklund, D. F. (Ed.). (2000). False-memory creation in
chological research: Conceptual, strategic, and children and adults: Theory, research, and implica-
statistical considerations. Journal o f Personality tions. Mahwah, NJ: Lawrence Erlbaum Associates.
a nd Social Psychology, 5 1 , 1173-1182. Blanton, H „ 8c Jaccard, J. (2006). Arbitrary metrics in
Barton, E. E„ Reichow, B.,&Wolery, M. (2007). Guidelines psychology. American Psychologist, 61, 27-41.
for graphing data with Microsoft PowerPoint". Bolgar, H. (1965). The case study method. In B. B.
Journal of Early Intervention, 2 9 ,320-336. Wolman (Ed.), Handbook o f clinical psychology.
Basoglu, M., Salcioglu, E., 8c Livanou, M. (2007). A New York: McGraw-Hill.
randomized controlled study o f single-session Borckardt, J. J., Murphy, M. D.,Nash, M. R.. 8c Shaw, D.
behavioural treatment o f earthquake-related Post- ( 2 0 0 4 ). An empirical examination o f visual anal-
traumatic stress Disorder using an earthquake ysis procedures for clinical practice evaluation.
simulator. Psychological Medicine, 37, 203-214. Journal o f Social Service Research, 30, 55-73.
Basoglu, M ., Salcioglu, E., 8c Livanou, M. (2009). Borckardt, J. J., Nash, M. R „ Murphy, M. D., Moore,
Single-case experimental studies o f a self-help M., Shaw, D., 8c O’Neil, P. (2008). Clinical prac-
manual for traumatic stress in earthquake tice as natural laboratory for psychotherapy
survivors. Journal o f Behavior Therapy and research: A guide to case-based time-series anal-
Experimental Psychiatry, 40, 50-58. ysis. American Psychologist, 63, 77-95.
Basoglu, M., Salcioglu, E „ Livanou, M ., Kalender, D., 8c Boring, E. G. ( 19 5 7 )- A history o f experimental psychol-
Acar, G. (2005). Single-session behavioral treat ogy (2nd ed). New York: Appleton-Century-Crofts.
ment o f earthquake-related Posttraumatic Stress Bouchard, T. I., Jr., Lykken, D. T„ M cGue, M „ Segal,
Disorder: A randomized waiting list controlled N. L„ 8c Tellegen, A. (1990). Sources o f human
trial. Journal o f Traumatic Stress, 1 8 ,1—11. psychological differences: The Minnesota study
Battro, A. M . (2001). Haifa brain is enough: The story o f o f twins reared apart. Science, 250, 223-228.
Nico. Cambridge, UK: Cambridge University Press. Box, G. E. P., 8c lenkins, G. M. (1970). Time-series
Bearm an, P. S., 8c Bruckner, H. (2005). After the prom analysis: Forecasting anti control. San Francisco:
ise: The STD consequences o f adolescent virginity Holden-Day.
pledges. Journal o f Adolescent Health, 36, 271-278. Box, G. E. P., 8c Jenkins, G. (1976). Time-series analysis:
Berg, B. L. (2001). Qualitative research methods fo r the Forecasting and control (Rev. ed.). San Francisco:
social sciences (4lh ed.). Needham Heights, MA: Holden-Day.
Allyn 8c Bacon. Box. G. E. P., Jenkins, G. M., 8c Reinsel, G. C. (1994).
Bijou, S. W. (1955). A systematic approach to an Time-series analysis: Forecasting and control
experimental analysis o f young children. Child (3,'l ed.). Englewood Clift's, NJ: Prentice-Hall.
Development, 26,16 1-16 8 . Bradshaw, W. (2003). Use of single-system research
Bijou, S. W. (1957). Patterns o f reinforcement and to evaluate the effectiveness o f cognitive-
resistance to extinction in young children. Child behavioural treatment of schizophrenia. British
Development, 28, 47-54. Journal o f Social Work, 33, 885-899.
R e fe re n c e s 421

Brainerd, C. J., & Reyna, V. F. (2005). The science of false Cameron, M. J., Shapiro,R. L., & A insleigh, S. A. (2.005).
memory. New York: Oxford University Press. Bicycle riding: Pedaling made possible through
Brembs, B., Lorenzetti, F. D „ Reyes, F. D „ Baxter, D. positive behavioral interventions. Journal o f
A., & Byrne, J. H. (2002). Operant reward learn- Positive Behavior interventions, 7,153-158.
ing in Aplysia: Neuronal correlates and mecha- Campbell, D. T „ & Stanley, f. C. (1963)- Experim ental
nisms. Science, 296,170 6 -1709 . and quasi-experim ental designs tar research
Breuer, J., & Freud, S. (1957). Studies in hysteria. New and teaching. In N. L. Gage (Ed.), Handbook of
York: Basic Books. research on teaching. Chicago: Rand MxNalLy.
Broemeling, L. D. (2009) Bayesian methods for mea- Campbell, J. M. (2004). Statistical comparison o f four
sures o f agreement. Boca Raton, FL: Chapman & effect sizes for single-subjcct designs. Sshavior
Hall/Taylor & Francis. Modification, 2 8 ,234-246.
Brooks, A., Todd, A. W., Tofflemoyer, S., & Horner, Carr, J. E„ & Burkholder, E. O. (1998). Creating
R. H. (2003). Use o f functional assessment and single-subj ect de sign graphs w ith Microsoft Excel .
a self-management system to increase academic Journal o f A pplied Behavior Analysis, 31, 245-251.
engagement and work completion. Journal of Carter, N., H olm strom , A .,Sim panen, M , & Mel in, L.
Positive Behavior Interventions, 5 ,14 4 -15 2 . (1988). Theft reduction in a.grocery stc>Te through
Brossart, D. F., Meythaler, J. M., Parker, R. I., product identification and graphing o f losses for
McNamara, J„ & Elliott, T. R. (2008). Advanced employees. Journal o f A pplied Behavior Analysis,
regression methods for single-case designs: 2 1 , 385-389.
Studying propranolol in the treatment for agi- Caspi, A., M cClay, J., M offitt, T. E., Mill, J., M artin, f.,
tation associated with traumatic brain injury. Craig, L, Taylor, A., & Poulton, R. (zocn). R ole of
Journal o f Rehabilitation Psychology, 53,357-369. genotype in the cycle o f violence in maltreated
Brossart, D. F„ Parker, R. I., Olson, E. A., & children. Science, 297, 851-854-
Mahadevan, L. (2006). The relationship between Centers for D isease Control and Prevention. (2009).
visual analysis and five statistical analyses in a Reduced hospitalizations tor acute myocardial
simple A B single-case research design. Behavior infarction after implementation ol a smoke free
Modification, 30, 531-563. ordinance— C ity o f Pueblo, Colorado, 2002-
Browning, R. M. (1967). A same-subject design for 2006. M orbidity and Mortality Weekly iteporl, 57
simultaneous comparison o f three reinforcement (51). - -
contingencies. Behaviour Research and Therapy,5, Center, B. A., Skiba, R. J., 8t Casey, A. (1985-1^86).
- - A m ethodology for the quantitative synthesis o f
Brunswik, E. (1955). Representative design and prob- intra-subject design research. Journal c j Special
abilistic theory in a functional psychology. Education, 1?, 587-400.
Psychological Review, 6 2 ,193-217. Chaddock, R. E. (i>2>). Principles a m i methods 0/ sta-
Busk, P., & Serlin, R. (1992). Meta-analysis for single- tistics. Boston: Houghton M ifflin.
participant research. In T. R. Kratochwill & J. R. Chamberlain, P., & Reid, J. B. (1987). Parent observa-
Levin (Eds.), Single-case research design and anal- tion and report o f child symptoms. Behavioral
ysis: New directions fo r psychology and education. Assessment, 9, 97-109.
Mahwah, NJ: Lawrence Erlbaum. Chambless, D. L., & Ollendick, T. H. (2001).
Busse, R. T „ Kratochwill, T. R „ & Elliott, S. N. Em pirically supported psychological inter-
(1995)- Meta-analysis for single-case consultation ventions: Controversies and evidence. A nnual
outcomes: Applications to research and practice. Review o f Psychology, 52, 685-716.
Journal o f School Psychology, 33, 269-285. Chassan, ). B. (i>>67). Research design in dinical psy-
Calder, A. J., Keane, I„ M anes, E, Antoun, N „ & Young, chology and psychiatry. New York: Appleton-
A. W. (2000). Impaired recognition and expe- Centu.ry-Crofts.
rience o f disgust following brain injury. Nature Chassin, L., Presson, C. C., Sherman, S. J., Montello,
Neuroscience, j , 1077-1078. D., & M cGrew, J. (19S6). Changes in peer and
424 REFERENCES

parent influence during adolescence: Lon- effectiveness, and outcome, lournal o f Child
gitudinal versus cross-sectional perspectives on Psychology and Psychiatry, 3 6 ,114 1-115 9 .
smoking initiation. Developmental Psychology, Cunningham , T. R., 8c Austin, J. (2007). Using goal set-
22, - - ting, task clarification, and feedback to increase
Clarke, B., Fokoue, E., & Zhang, H. H. (2009). the use o f the hands-free technique by hospital
Principles and theory o f data mining and machine operating room staff, lournal o f Applied Behavior
learning. New York: Springer. Analysis, 40, 673-677.
Clayton, iM., Helms, B., & Simpson, C. (2006). Active Dapcich-M iura, E „ 8cHovell, M. F. (1979). Contingency
prompting to decrease cell phone use and management o f adherence to a com plex m edi-
increase seat belt use while driving, lournal o f cal regimen in an elderly heart patient. Behavior
Applied Behavior Analysis, 39, 341-349. Therapy, 1 0 , 193-201.
Clement, P. W. (2007). Story o f “ Hope” : Successful treat- Davis, M., Myers, K. M ., Chhatwal, J., 8c Ressler, K.
ment of obsessive-compulsive disorder. Pragmatic J. (2006). Pharmacological treatments that facil-
Case Studies in Psychotherapy, 3, 1-36. (online at itate extinction of fear: Relevance to psychother-
http://hdl.rutgers.edU/1782.1/pcsp_journal) apy. NeuroRx, 3, 82-96.
Clement, P. W. (2008). Outcomes from 40 years o f Dawes, R. M. (1994). House o f cards: Psychology and
psychotherapy in a private practice. American psychotherapy built on myth. New York: Free
Journal o f Psychotherapy, 6 2 ,215-239. Press.
Cleveland, W. S. (1993). Visualizing data. Summit, NJ: De Los Reyes, A ., 8c Kazdin, A. E. (2005). Informant
Hobart Press. discrepancies in the assessment o f childhood
Cleveland, W. S. (1994). The elements o f graphing data. psychopathology: A critical review, theoretical
Summit, NJ: Hobart Press. fram ework, and recommendations for further
Cohen, J. (1965). Some statistical issues in psychologi- study. Psychological Bulletin, 131, 483-509.
cal research. In B. B. Wolman (Ed.), Handbook o f DeMaster, B„ Reid, J., 8c Twentyman, C. (1977)-
clinical psychology. New York: M cGraw-Hill. The effects of different amounts o f feedback
Cohen, I. (1988). Statistical power analysis fo r the on observers’ reliability. Behavior Therapy, S,
behavioral sciences (2"'1 ed.). Hillsdale, NJ: 317-329.
Erlbaum. Denissenko, M. F„ Pao, A., Tang, M., & Pfeifer, G. P
Cook, T. D„ & Campbell, D. T. (Eds.). (1979)- Quasi- (1996). Preferential formation o f benzo[(i]pyrene
experimentation: Design and analysis issues fo r adducts at lung cancer mutational hotspots in
fie ld settings. Chicago: Rand-McNally. P53. Science, 274, 430-432.
Cooper, H „ Hedges, L. V'., 8c Valentine, J. C. (Eds.). Denzin, N. K „ 8c Lincoln, Y. S. (Eds.). (2005). The
(2009). The handbook o f research synthe- SA G E handbook o f qualitative research (3rJ ed.).
sis and meta-analysis. New York: Russell Sage Thousand Oaks, C A : Sage.
Foundation. DeProspero, A ., 8c Cohen, S. (1979). Inconsistent
Cooper, J. ., Heron, T. E., 8c Heward, W. L. (2007). visual analysis o f intrasubject data. Journal o f
Applied behavior analysis (2"'1 ed.). Upper Saddle Applied Behavior Analysis, 12, 573-579.
River, NJ: Pearson Education. DiGennaro, F. D., Martens, B. K „ 8c Kleinmann, A. E.
Cox, B. S., Cox, A. B„ 8c C ox, D. J. (2000). Motivating (2007). A comparison of perform ance feedback
signage prompts safety belt use among drivers procedures on teachers’ treatment implantation
exiting senior communities, lournal o f Applied integrity and students’ inappropriate behavior in
Behavior Analysis, 33,635-638. special education classrooms. Journal o f Applied
Cunningham, C. F.., Bremner, R., 8c Boyle, M. (1995)- Behavior Analysis, 4 0 ,447-461.
Large group community-based parenting pro- Dishion, T. J., M cCord, J., 8c Poulin, F. ( 19 9 9 )- When
grams for families o f preschoolers at risk for interventions harm: Peer groups and problem
disruptive behaviour disorders: Utilization, cost behavior. American Psychologist, 54,755-764.
Ref er en ces 42 S

Dittmer, C. G. (1926). Introduction to social statistics. proximal prompts. Journal o f Applied Behavior
Chicago: Shaw. Analysis, 3 9 ,14 9 -2 5 1.
Dodge, K. A., Dishion, T. J., & Lansford, J. E. (Eds.). Favell, J. E „ McGimsey, J. F., & [ones, M . L. (1980).
(2006). Deviant peer influences in programs Rapid eating in the retarded: Reduction by nona-
fo r youth: Problems and solutions. New York: versive procedures. Behavior M odification, 4,
Guilford. 481-492.
D oss, A. J., & Weisz, J. R. (2006). Syndrom e Feather, f. S., & Ron an, K. R. (2006). Traum a-focused
co-occurrence and treatment outcomes in youth cognitive-behavioural therapy for abused chil-
mental health clinics. Journal o f Consulting and dren with posttiaum atic stress disorder. New
Clinical Psychology, 74, 416-425. Z ealan d journal o f Psychology,3 5 ,132 -14 5 .
Drebing, C. E „ Van Ormer, E. A., Krebs, C „ Rosen heck, Feehan, M ., M cGee, R., Stanton, W., & Silva, P. A .
R , Rounsaville, B., Herz, L., & Penk, W. (2005). (1990). A 6-year follow -upof childhood enuresis:
The impact o f enhanced incentives for dually Prevalence in adolescence and consequences for
diagnosed veterans. Journal o f Applied Behavior mental health. Journal o f Paediatrics and Child
Analysis, 38, 359-372. Health, 2 6 ,75-79.
Ducharme, f. M., Folino, A., & DeRosie, J. (2008). Feldman, R. A ., Capiinger, T. E , 8r W odaiski, J. S.
Errorless acquiescence training: A potential (1983). The St. Louis corun drvm : The effective
“ keystone” approach to building peer interaction treatment o f antisocial youths. Englewood Cliffs,
skills in children with severe problem behavior. NJ: Prentice-Ha.ll.
Behavior Modification, 3 2 ,39-60. Ferritor, D. E „ Buckholdt, D„ Hamblin, R. L., &
Dukes, W. F. (1965). N = 1. Psychological Bulletin, 64, Sm ith, L. (1972). The noneffects o f contingent
- - reinforcem ent for attending behavior on 'work
Dunn, J., & Plomin, R. (1990). Separate lives: Why sib- accom plished. Journal o f Applied Behavior
lings are so different. New York: Basic Books. Analysis, 5 ,7 -17 .
Edgington, E. S. (1969). Statistical inference: The distri- Ferron, J., & Sentovich, C. (2 002).Statistical pow er o f
bution free approach. New York: M cGraw Hill. random ization tests used with m ultiple-baseline
Edgington, E. S. (1996). Randomized single-subject designs, jour mil o f Experim ental Education, 70,
experimental designs. Behaviour Research and 165-178 .
Therapy, 34, - - Ferster, C. B. (1961). Positive reinfcrcement and
Edgington, E. S., & Onghena, P. (2007 ). Randomization behavioral deficits in autistic children. C hild
tests (4th cd. j. Boca Raton, FL: Chapman & Hall/ Development, 32, 437-456.
CRC. Ferster, C. B ., & Skin tier, B. F. Schedules o f rein-
Engleman, S., Haddox, P., & Bruner, E. (1983). Teach forcem ent. New York: Applet on-Century-Crofts.
your child to read in 100 easy lessons. New York: Fiore, M. C ., Bailey, W. C., Cohen, S. J., Dorfm an, S. F.,
Simon & Schuster. Goldstein, H .G .,G ritz, E. R .e ta l. (2000). Treating
Façon, B., Sahiri, S., & Rivière, V. (2008). A controlled tobacco use an it dependence: Clinical practice
single-case treatment o f severe long-term selec- guideline. Rockville, M D: U.S. O epartment of
tive mutism in a child with mental retardation. Health and Human Services, Public Health
Behavior Therapy, 3 9 ,313-321. Service.
Faith, M. S., Gorm an, B. G., 8c Allison, D. B. (1997). Fisch, G. S. (2001). Evaluating data from behavioral
Meta-analytic evaluations o f single-case designs. analysis: Visual inspection or statistical models?
In D. B. Allison, B. Gorm an, & R. Franklin (Eds.), Behavioural Processes, 54, 137-154.
Methods fo r the design and analysis of single-case Fisher, R. A . (1925). Statistical methods fo r research
research. Hillsdale, NJ: Lawrence Erlbaum. workers. Edinburgh, UK: Oliver & Bo)rd.
Farrimond, S. )., & Leland, L. S., Jr. (2006). Increasing Fisher, W. W., Kelley, M. E „ 8r Lomas, I. E. (2003).
donations to supermarket food-bank bins using Visual aids and structured criteria for im proving
426 REFERENCES

visual inspection and interpretation o f single-case Garb, H. N. (2005). Clinical judgment and decision
designs. Journal o f Applied Behavior Analysis, 36, making. Annual Review o f Clinical Psychology, 1,
387-406. 67-89.
Fishman, C. (2006). How many light bulbs does it take Gilbert, J. P., Light, R. J., & Mosteller, F. (1975).
to change the world? One. And you’re looking at Assessing social interventions: An empirical base
it. Fast Company, 1008 (September), p. 74. for policy. In C. A. Bennett & A. A. Lumsdaine
Flood, VV. A., & Wilder, D. A. (2004). The use o f d if- (Eds.), Evaluation and experiment: Some criti-
ferential reinforcement and fading to increase cal issues in assessing social programs. New York-
time away from a caregiver in a child with sepa- Academic Press.
ration anxiety disorder. Education and Treatment Gilbert, L. M ., Williams, R. L., & McLaughlin, T. F.
o f Children, 27,1-8 . (1996). Use o f assisted reading to increase correct
Foie), D „ Wormley, B.( Silberg, J„ Maes, H., Hewitt, J„ reading rates and decrease error rates o f students
Eaves, L., & Riley, B. (2004). Childhood adversity, with learning disabilities. Journal o f Applied
M AO A genotype, and risk for conduct disorder. Behavior Analysis, 29, 255-257.
Archives o f General Psychiatry, 61, 738-744. Gilovich, T „ Griffin, D., & Kahneman, D. (2002).
Foster, S. L., & Mash, E. J. (1999). Assessing social Heuristics and biases: The psychology o f intuitive
validity in clinical treatment research: Issues and judgment. Cambridge, UK: Cambridge University
procedures. Journal o f Consulting and Clinical Press.
Psychology, 67,3 0 8 -319 . Girolami, P. A., Boscoe, J. H „ & Roscoe, N. (2007).
Fournier, A. K., Ehrhart, I. J„ Glindem ann, K. E „ & Decreasing expulsions by a child with a feeding dis-
Geller, E. S. (2004). Intervening to decrease order: Using a brush to present and re-present food.
alcohol abuse at university parlies: Differential Journal o f Applied Behavior Analysis, 4 0 ,749-753.
reinforcement o f intoxication level. Behavior Glass, G. V. (1997). Interrupted time series quasi-exper-
Modification, 28,16 7 -18 1. iments: Complementary methods fo r research in
Franklin, R. D., Allison, D. B., & Gorm an, B. S. (Eds.). education (2"“' ed.). Washington, D C: American
(1997). Design and analysis o f single-case research. Educational Research Association.
Mahwah, NJ: Lawrence Erlbaum Associates. Glass, G. V., Willson, V. L., & Gottman, J. M. (1975).
Franklin, R. D., Gorman, B. S., Beasley, T. M., & Allison, Design and analysis o f time-series experiments.
D. B. (1997). Graphical display and visual analysis, Boulder: Colorado Associated University Press.
in R. D. Franklin, D. B. Allison. & B. S. Gorm an Glenn, I. M ., & Dallery, J. (2007). Effects o f internet-
(Eds.), Design and analysis o f single-case research. based voucher reinforcement and a transdermal
Mahwah, NJ: Lawrence Erlbaum Associates. nicotine patch on cigarette smoking. Journal o f
Freedman, B. {., Rosenthal, L., Donahoe, C. P., Applied Behavior Analysis, 4 0 ,1-13.
Schlundt, D. G „ & McFall, R. (1978). A social- Goldiamond, I. (1962). The maintenance o f ongoing
behavioral analysis o f skills deficits in delinquent fluent verbal behavior and stuttering. Journal o f
and nondelinquent adolescent boys. Journal o f Mathetics, 1 , 57-95.
Consulting and Clinical Psychology, 4 6 ,1448-1462. Gore, A. (2006). An inconvenient truth: The planetary
Freud, S. (1933). New introductory lectures in psycho- emergency o f global warming and what we can do
analysis. New' York: Norton. about it. New York: Rodale.
Friedm an,)., & Axelrod, S. (1973). The use o f a changing- Gorsuch, R. L. (1983). Three methods fo r analyz-
criterion procedure to reduce the frequency o f ing limited time-series (N of 1) data. Behavioral
smoking behavior. Unpublished manuscript, Assessment, 5, 141-154.
Temple University. Greenberg, R E., Sisitsky, T., Kessler, R. C „ Finkelstein,
Gabbard, G.O., Lazar, S. G.,Hornberger, J., & Spiegel, D. S. N., Berndt, E. R., Davidson, J. R. T., Balienger,
(1997). The economic impact o f psychotherapy: J. C., & Fyer, A. J. (1999). The economic burden o f
A review. American Journal o f Psychiatry, 154, anxiety disorders in the 1990s. Journal o f Clinical
147-155- Psychiatry, 60, 427-435.
R c f« r« ic e s 427

Grice, C. R., & Hunter, J. J. (1964). Stimulus intensity time-series analysis and its application to behav-
effects depend upon the type o f experimental ioral data. Journal o f Applied Kehavior Analysis,
design. Psychological Review, 71, 247-256. ,
13 543-559-
Griner, D.. & Smith, T. B. (2006). Culturally adapted Hasler, B. P., Mehl, M R., B ootiin, R. R .» & Vazire, S.
mental health interventions: A meta-analytic (2008). Preliminary evidence o f diurnal rhythms
review. Psychotherapy: Theory, Research, Practice, in everyday behaviors associated with positive
Training, 43,531-54 8 . affect. Journal o f Research in Personality, 42,
Grissom , T., Ward, R, M artin, B., & Leenders, N. Y. - -
J. M . (2005). Physical activity in physical educa- Hassin, R. R ., Ferguson, M . J., Shidlovski, D., &
tion. Family Community Health, 2 8 ,125-129. Gross, T. (2007). Subliminal exposure to national
Gross, A ., Miltenberger, R „ Knudson, P., Bosch, A., & flags affects political thought and behavior.
Breitwieser, C. B. (2007). Preliminary evalua- Proceedings o f the National Academy o f Sciences,
tion o f a parent training program to prevent gun 104,19757-19761-
play. Journal o f Applied Behavior Analysis, 40, Hassin, R., Ulem an, J., & Bargh. F. (Eds.). (200;). The
691-695. new unconscious. New York: Oxford University
Hains, A. H., & Baer, D. M. (1989). Interaction effects Press.
in multielement designs: Inevitable, desirable, Hawkins, R. P., & Dobes, R. W. ( l977). Behavioral
and ignorable. Journal o f Applied Behavior definitions in applied behavior analysis: Explicit
Analysis, 2 2 ,57-69. or implicit. In B. C. Etzel, ]. M . LeBIane, & D.
Hall, S. S., Maynes, N. P., & Reiss, A. L. (2009). Using M. Baer (Eds.), New developments in behavioral
percentile schedules to increase eye contact in research: Theory, methods, a n d applications. In
children with Fragile X syndrome. Journal o f honor o f Sidney W. Bijou. Hillsdale, N J: Lawrence
A pplied Behavior Analysis, 4 2 ,171-176 . Erlbaum Associates.
Hanley, G. P., Heal, N. A..Tiger, J. H „ & Ingvarsson, E. T. Henry, G. T. (1995). Graphing data: Techniques fo r
(2007). Evaluation o f a classwide teaching pro- display a n d analysis. Thousand Oaks, C A : Sage
gram for developing preschool life skills. Journal Publications.
o f Applied Behavior Analysis, 40, 277-300. Hersen, M ., & Barlow, D. H. (1976)- Single-case exper-
Harbst, K. B., Ottenbacher, K. J., & Harris, S. R. (1991). imental designs: Strategies fo r Undying ie h a v io r
Interrater reliability o f therapists’ judgments of change. N ew York: Pergamon.
graphed data. Physical Therapy, 7 1 , 107-115. Hetzroni, O. E., Quis-t, R. W., & LLoyd, L. L. (2002).
Harris, V. W„ & Sherman, J. A. (1974)- Homework Translucency and complexity: Effects on blis-
assignments, consequences, and classroom per- symbol learning; using computer and teacher
formance in social studies and mathematics. presentations. Language, Speech, and Hearing
Journal o f Applied Behavior Analysis, 7 , 505-519. Services in Schook, 33,291-503.
Hartmann, D. P. (1982). Assessing the dependability Hinile, M. B.„ Chang, S., Woods, L). W„ Pearlman, A.,
o f observational data. In D. P. Hartmann (Ed.), Buzzella, B., Bunaciu, L., & Piacentlni, J. C.
New directions for the methodology of behavioral (2006). Establishing the feasibility o f direct
sciences: Using observers to study behavior. San observation in the assessment of tics in children
Francisco: Jossey-Bass. with chronic tic disorders. Journal o f A pplied
Hartmann, D. P., Barrios, B. A., & Wood, D. D. (2004). Behavior Analysis, 39, 429-4.40.
Principles o f behavioral observation. In S. N. Hofmann, S. G., Meuret, A. E., Smits, J. A ., Simon, N.
Haynes & E. M. Hieby (Eds.), Comprehensive M „ Pollack, M. H ., Eisenmenger, K.,Shiekh, M &
handbook of psychological assessment (Vol. 3, Otto, M . W. (200^). Augmentation o f exposure
Behavioral assessment). New York: John Wiley & therapy with D-cycIoserine for social anxiety dis-
Sons. order. Archives o f G a u r a l Psychiatry, 63, >98-304.
Hartmann, D. P., Gottman, J. iM., Jones, R. R „ Gardner, Hollon, S. D., & Beck, A. T. (2004). Cognitive and
W., Kazdin, A. E., & Vaught, R. (1980). Interrupted cognitive behavioral therapies, tn M. J. Lambert
428 REFEREN CES

(Ed.), Bergin and G arfield’s handbook o f psycho- behavioral assessment. In J. Austin & J. E. Carr
therapy and behavior change (5lh ed.). New York: (Eds.), Handbook o f applied behavior analysis.
Wiley & Sons. Reno, NV: Context Press.
Honekopp, J. (2006). Once more: Is beauty in the eye Jacobson, N. S., & Revenstorf, D. (1988). Statistics for
o f the beholder? Relative contributions of private assessing the clinical significance o f psychother-
and shared taste to judgments o f facial attractive- apy techniques: Issues, problems, and new devel-
ness. Journal o f Experimental Psychology: Human opments. Behavioral Assessment, 1 0 ,133-145.
Perception and Performance, 3 2 ,199-209. Jacobson, N. S., Roberts, L. J., Berns, S. B., &
Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, M cGlinchey, J. (1999). Methods for defining and
S., & Wolery, M. (2005). The use o f single-subject determining the clinical significance o f treat-
research to identify evidence-based practice in spe- ment effects in mental health research: Current
cial education. Exceptional Children, 7 1 , 165-179. status, new applications, and future directions.
Hsu, L. M. (1989). Random sampling, randomization, Journal o f Consulting a nd Clinical Psychology. 67,
and equivalence o f contrasted groups in psycho- 300-307.
therapy outcomc research. Journal o f Consulting Jaffee, S. R., Caspi, A., Moffitt, T. E „ Dodge, K.,
a nd Clinical Psychology, 5 7 ,131-137 . Rutter, M., Taylor, A., & Tully, L. (2005). Nature x
Hughes, C. A. O., & Carter, M. (2002). Toys and nurture: Genetic vulnerabilities interact with phys-
materials as setting events for the social inter- ical maltreatment to promote behavior problems.
action o f preschool children with special needs. Development and Psychopathology, 1 7 ,67-84.
Educational Psychology, 22, 429-444. Jason, L. A., & Brackshaw, E. (1999). Access to T V
Hughes, M. A., Alberto, R A., & Fredrick, L. L. contingent on physical activity: Effects on reduc-
(2006). Self-operated auditory prompting sys- ing T V-view ing and body-weight. Journal o f
tems as a function-based intervention in public Behavior Therapy and Experimental Psychiatry,
community settings. Journal o f Positive Behavior 3 0 ,1 4 5 - 1 5 1 -
Interventions, 8, 230-243. Johnson, B. M ., Miltenberger, R. G., Egemo-Helm , K.,
lum m , S. P., Blampied, N. M „ & Liberty, K. A. (2005). Jostad, C. M ., Flessner, C., & Gatheridge, B.
Effects o f parent-administered, home-based, (2005). Evaluation o f behavioral skills training
high-probability request sequences on com pli- for teaching abduction-prevention skills to young
ance by children with developmental disabilities. children. Journal o f Applied Behavior Analysis, 38,
C hild and Family Behavior Therapy, 27, 327-345. 67-78.
Hunsley, J. (2007). Addressing key challenges in Jones, M. C. (1924a). A laboratory study o f fear: The
evidence-based practice in psychology. Professional case o f Peter. Pedagogical Seminary and Journal o f
Psychology: Research and Practice, 3 8 ,113-121. Genetic Psychology, 3 1 , 308-315.
Hunter, J. E., & Schmidt, F. L. (2004). Methods o f Jones, M. C. (1924b). The elimination o f childrens
meta-analysis: Correcting error and bias in fears, journal o f Experimental Psychology, 7,
research findings (2nd ed.). Thousand Oaks, C A : 382-390.
Sage Publications Jones, R. R., Vaught, R. S., & Weinrott, M . (1977).
Ingram, K „ Lewis-Palmer, T., & Sugai, G. (2005). Time-series analysis in operant research. Journal
Function-based intervention planning: Com o f Applied Behavior Analysis, 1 0 ,151-16 6 .
paring theeffectivenessofFBA function-based and Jones, R. R., Weinrott, M. R „ 8c Vaught. R. S. (1978).
non-function based intervention plans. Journal o f Effects o f serial dependency on the agreement
Positive Behavior Interventions, 7, 224 -236. between visual and statistical inference. Journal
Institute of Medicinc. (2001). Crossing the quality o f A pplied Behavior Analysis, 11, 277-283.
chasm: A new health system fo r the 21st century. Jones, W. P. (2003). Single-case time series with
Washington, D C: National Academ y Press. Bayesian analysis: A practitioners guide.
Iwata, B. A., Kahng, S. W , Wallace, M. D., & Lindberg, Measurement and Education in Counseling and
J. S. (2000). The functional analysis model o f Development, 36, 28-39.
R e fe re n c e s 429

Kazdin, A. E. (1976). Statistical analysis for single- Kazdin, A. E. (2008b). Evidence-based treatment
case experimental designs. In M. Hersen & D. and practice: N ew opportunities to bridge clin -
H. Barlow, Single-case experimental designs: ical research and practice, enhance the kn ow l-
Strategies fo r studying behavior change. Elmsford, edge base, and improve patient care. American
NY: Pergamon. Psychologist, < 3,146-159 .
Kazdin, A. E. (1977a). Artifact, bias, and complexity Kazdin, A. E. (200?). Psychological science’s contribu-
o f assessment. The A B C ’s o f reliability, lournal o f tions to a sustain able environment: Extending our
Applied Behavior Analysis, 1 0 , 14 1-150 . reach to a grand challenge o f society. American
Kazdin, A. E. (1977b). Assessing the clinical or applied Psychologist, 64, 359-356.
significance o f behavior change through social Kazdin, A. E „ & Bass, D. ( 1989). Power to detect d iffer-
validation. Behavior Modification, 1, 427-452. ences between alternative treatments in com par-
Kazdin, A. E. (1977c). The token economy: A review ative psychotherapy outcom e research. Journal o f
and evaluation. New York: Plenum. Consulting unci Clinical Psychology, 5 7 . &-M -
Kazdin, A. E. (1978). History o f behavior modifica- Kazdin, A. E., & Geesey, S. <1977). Siniultaneous-
tion: Experimental foundations o f contemporary treatment design com parisons o f the effects of
research. Baltimore: University Park Press. earning reinforcers for one’s peers versus fo ro ne-
Kazdin, A. E. (1981). Drawing valid inferences from self. Behavior Therapy, S, 6 82.-693.
case studies. Journal o f Consulting and Clinical Kazdin, A. E „ & Hartmann, D. P. ( 1978).TKe siinul-
Psychology, 4 9 ,183-192. taneous-treatment design. Behavior T i\a c fy , 9,
Kazdin, A. E. (1982). Symptom substitution, general- 912-922.
ization, and response covariation: Implications for Kazdin, A. E „ & Mascitelli, S. (198a). T he opportunity
psychotherapy outcome. Psychological Bulletin, to earn oneselfol'fa token system asa reiriorcerior
9 1, - - attentive beha-vioT. Behavior Therapy, 11, 68-78.
Kazdin, A. E. (1984). Statistical analyses for single- Kazdin, A. E „ & Poister, R. (1973). Intermittent token
case experimental designs. In D. H. Barlow & reinforcement and response maintenance in
M. Hersen, Single-case experimental designs: extinction- Jfehcnior Therapy, 4,38 6 -391.
Strategies fo r studying behavior change (2"J ed.). Kazdin, A. E., Siegel, T., & Bass, D. ( 199 2). Cognitive
Elmsford, NY: Pergamon. problem -solving skills training and parent m an-
Kazdin, A. E. (1994). Informant variability in the agement training in the treatment o f antisocial
assessment of childhood depression. In W. M. behavior in. children, fo u rm i o f Consulting and
Reynolds & H. Johnston (Eds.), Handbook o f Clinical Psychology, 60, 733-747.
depression in children and adolescents. New York: Kazdin, A. E., 8r Wassell, G. (2000). Therapeutic
Plenum. changes in children, parents, and families result-
Kazdin, A. E. (2001). Behavior modification in applied ing from treatment o f children with conduct
settings (6lh ed.). Long Grove, IL: Waveland Press. problems, fournctl o f the- Am erican A cadem y o f
Kazdin, A. E. (2003). Research design in clinical psy- Child and Adolescent Psychiatry, 39, 4 14 -4 10 .
chology (4lh ed). Boston: Allyn 8c Bacon. Kazdin, A. E., be AVhitley, M. KL (2006). Comorbidity,
Kazdin, A. E. (2006). Arbitrary metrics: Implications case complexity, and effects o f evidence-based
for identifying evidence-based treatments. Am er- treatment for children referred for disruptive
ican Psychologist, 6 1 ,42-49. behavior, journal o f Consulting and Clinical
Kazdin, A. E. (2007). Mediators and mechanisms Psychology. 74. 455-467-
o f change in psychotherapy research. Annual Kendall, P. G , 8r Grove, W. SI. (1988). Norm ative
Review o f Clinical Psychology, 3,1-2 7 . com parisons in therapy outcom e. Behavioml
Kazdin, A. E. (2008a). Evidence-based treatments Assessment, 1«, 14.7-158.
and delivery o f psychological services: Shifting Kennedy, C. H. (1002). The maintenance o f behavior
our emphases to increase impact. Psychological change as an indicator o f social validity. Behavior
Sen ’ices,$, 201-215. Modification, 26. 627-647.
430 R EF ER EN C ES

Kenny, D. A., Kashy, D. A., & Bolger, N. (1998). Data Kraemer, H. C., Stice, E., Kazdin, A. E „ Offord, D. R, &
analysis in social psychology. In D. Gilbert, S. T. Kupfer, D. J. (2001). How do risk factors work
Fiske, & G. Lindzey (Eds.), Handbook o f social together? Mediators, moderators, independent,
psychology (4,h ed., Vol. 1). overlapping, and proxy-risk factors. American
Kent, R. N „ & Foster, S. L. (1977). Direct observational Journal o f Psychiatry, 158, 848-856.
procedures: Methodological issues in naturalis- Kraemer, H. C , W ilson, G. T., Fairburn, C. G „ &
tic settings. In A. R. Ciminero, K. S. Calhoun, & Agras, W. S. (2002). Mediators and moderators
H. E. Adams (Eds.), Handbook o f behavioral o f treatment effects in randomized clinical trials.
assessment. New York: Wiley. Archives o f General Psychiatry, 59, 877-883.
Kierkegaard, S. (1843) Either/or: A fragm ent o f life. Kratochwill, T. R. (2006). Evidence-based interven-
Copenhagen: University Bookshop Reitzel. tions and practices in school psychology: The sci-
(Translated 1944, FI. Milford, Oxford University entific basis o f the profession. In R. F. Subotnik &
Press) H. J. Walberg (Eds.), The scientific basis o f edu-
Kim-Cohen, J„ Caspi, A., Taylor, A ., Williams, B., cational productivity. Charlotte, NC: Information
Newcombe, R., Craig, I. W., & M offitt, T. E. Age Publishing.
(2006). MAOA, maltreatment, and gene- Kratochwill, T. R., Hoagwood, K. E., Frank, |. L.,
environment interaction predicting children’s Levitt, J. M., Olin, S., Romanelli, L. H., & Saka, N.
mental health: New evidence and a m eta-analysis. (2009). Evidence-based interventions and prac-
Molecular Psychiatry, 11, 903-913. tices in school psychology: Challenges and
Kirk. R. E. (1996). Practical significance: A con - opportunities. In T. B. Gutkin & C. R. Reynolds
cept whose time has come. Educational and (Eds.), The handbook o f school psychology (4th
Psychological Measurement, 56,746-759. ed.). New York: John Wiley & Sons.
Knapp, T .J. (1983). Behavior analysts’ visual appraisal Kratochwill, T. R., & Levin, J. R. (in press). Enhancing
o f behavior change in graphic display. Behavioral the scientific credibility of single-case inter-
Assessment, 5,155-164 . vention research: Randomization to the rescue.
Kodak, T., Grow.L., & Northrup, J. (2004). Functional Psychological Methods.
analysis and treatment o f elopement for a child Kromrey, J. D„ & Foster-Johnson, L. (1996).
with Attention-Deficit Hyperactivity Disorder. Determining the efficacy o f intervention: The use
Journal o f Applied Behavior Analysis, 37, 229-232. o f effect sizes for data analysis in single-subject
Koegel. R. L„ & Koegel, L. K. (2006). Pivotal response research. The Journal o f Experimental Education,
treatments fo r autism: Communication, social, . - -
and academic development. Baltimore: Brookes Kushner, M. G., Kim, S. W., Donahue, C „ Thuras, P.,
Publishing Company. Adson, D„ Kotlvar, M ., McCabe, J., Peterson, J.,
Komaki, J., & Barnett, F. T. (1977). A behavioral & Foa, E. B. (2007). D-cycloserine augmented
approach to coaching football: Im proving the exposure therapy for Obsessive-Compulsive
play execution o f the offensive backfield on a Disorder. Biological Psychiatry, 62, 835-838.
youth football team. Journal o f Applied Behavior Lall, V. F„ & Levin, J. R. (2004). An empirical investi
Analysis, 10, 657-664. gation o f the statistical properties o f generalized
Korchin, S. J. (1976). Modern clinical psychology. New single-case randomization tests. Journal o f School
York: Basic Books. Psychology, 42,6 1- 8 6 .
Kosslyn, S. M. (2006). Graph design fo r the eye and Lambert, M. J., Hansen, N. B., & Finch, A. E. (2001).
mind. New York: Oxford University Press. Client-focused research: Using client outcome
Kraemer, H. C ., Kiernan, M., Essex, M „ 81 Kupfer, D. J. data to enhance treatment effects. Journal of
(2008). How and why criteria defining m odera- Consulting and Clinical Psychology, 6 9 ,159-172.
tors and mediators differ between the Baron & Lambert, M. J., Hansen, N. B „ Umphress, V., Lunnen,
Kenny and M acArthur approaches. Health K., Okiishi, J., Burlingam e, G., Huefner, J. C.,
Psychology, 27(Suppl.) S101-S108. & Rcisinger, C . W. (1996). Administration and
Ref er en ces 431

scoring m anual fo r the Outcome Questionnaire mental health therapists evaluate their prac-
(OQ 45.2). W ilmington, DE: American Pro- tice? B rief Treatment and Crisis Intervention, 7,
fessional Credentialing Services. 184-193.
Lambert, M . J., Vermeersch, D. A., Brown, G. S., & Luiselli, J. K . (2000). Cueing, demand fading, and p os-
Burlingame, G. M. (2004). Administration and itive reinforcement to establish self-feeding and
scoring manual fo r the OQ-30.2. Orem, UT: oral consumption in a child with chronic food
American Professional Credentialing Services. refusal. Behavior Modification, 24 , 348-358.
Lambert, M. f„ Whipple, J. L„ Hawkins, E. J., Luiselli, J. K., Reed, F. D. D., Christian, W. P.,
Vermeersch, D. A., Nielsen, S. L., & Smart, D. W. Markowski, A., Rue, H. C., St. Am and, C., &
(2003). Is it time for clinicians to routinely track Ryan, C. J. (2009). Effects o f an informational
patient outcome? A meta-analysis. Clinical brochure, lottery-based financial incentive, and
Psychology: Science and Practice, 10, 288-301. public posting on absenteeism of direct-care
Levesque, M „ Savard, J., Simard, S., Gauthier, J. G „ & human service employees. Behavior Modification,
Ivers, H. (2004). Efficacy o f cognitive therapy for 33 ,175 -18 1.
depression among women with metastatic can- Lumley, V. A., Miltenberger, R. G „ Long, E. S., Rapp,
cer: A single-case experimental study, journal o f J. T„ & Roberts, J. A. (1998). Evaluation o f a sex-
Behavior Therapy and Experimental Psychiatry, ual abuse prevention program for adults with
35, 287-305. mental retardation. Journal of Applied Behavior
Levin, J. R., & Wampold, B. E. (1999). Generalized Analysis, 31, 9 1-10 1.
single-case randomization tests: Flexible analy- Lundervold, D„ & Bourland, G. (1988). Quantitative
ses for a variety o f situations. School Psychology analysis o f treatment o f aggression, self-injury,
Quarterly, 14 ,59-93. and property destruction. Behavior Modification,
Lewin, L. M „ & Wakefield, J. A., Jr. (1979). Percentage 12, 591-617.
agreement and phi: A conversion table, lournal o f Ma, H. (2006). An alternative method for quantitative
Applied Behavior Analysis, 12, 299-301. synthesis o f single-subject researches: Percentage
Lewinsohn, P. M „ Clarke, G. N „ Hops, H., & Andrews,). o f data points exceeding the median. Behavior
(1990). Cognitive-behavioral treatment for depres- Modification, 30,598-617.
sed adolescents. Behavior Therapy, 2 1,385-401. M acKinnon, D. P. (2008). Introduction to statistical
Lindsley, O. R. (1956). Operant conditioning meth- mediation analysis. Mahwah, NJ: Erlbaum .
ods applied to research in chronic schizophrenia. Macmillan, M . (2002). An odd kind offam e: Stories o f
Psychiatric Research Reports, 5 ,118 -139 . Phineas Gage. Boston: M IT Press.
Lindsley, O. R. (i960). Characteristics o f the behav- Manolov, R „ & Solanas, A. (2008a). Com paring N = 1
ior o f chronic psychotics as revealed by free- effect size indices in presence o f autocorrelation.
operant conditioning methods. Diseases o f the Behavior Modification, 32, 860—S75.
N ervous System (M onograph Supplement), 21, Manolov, R „ & Solanas, A. (2008b). Random ization
66-78. tests for ABAB designs: Com paring data-
Lipsey, M. W. (1996). Theory as method: Small the- division-specific and common distributions.
ories o f treatments. In L. Sechrest & A. G. Scott Psicothema, 20, 297-303.
(Eds.), New directions in program evaluation: Manolov, R „ & Solanas, A. (2009). Problems o f the
Understanding causes and generalizing-about randomization tests for A B designs. Psicologica,
them (Serial No. 57). New York: fossey-Bass. 30,137-154-
Lochman, J. E. (2010). Anger control training for Manson, [. E., Allison, M. A., Rossouw, J. E., C arr, J. ].,
aggressive youth. In f. R. Weisz & A. E. Kazdin Langer, R. D., Hsia L> et al. (2007). Estrogen
(Eds.), Evidence-based psychotherapiesfor children therapy and coronary-artery calcification. N ew
and adolescents (2nd ed.). New York: Guilford. England Journal o f Medicine, 356 ,2591-2602.
Love, S. M ., Koob, J. J., & Hill, L. E. (2007). Meeting Marholin, D. H., Steinman, W. M ., M clnnis, E. T., &
the challenges o f evidence-based practice: Can Heads, T. B. (1975). The effect o f a teacher's
432 R EFER EN C ES

presence on the classroom behavior o f conduct- M cDougall, D. (2005). The range-bound changing
problem children. Journal o f Abnorm al Child criterion design. Behavioral Interventions, 20,
Psychology, 3 , 11-25. 129-137.
Martin, J. E., 8c Sachs, D. A. (1973). The effects o f a self- M cDougall, D. (2006). The distributed criterion
control weight loss program on an obese woman. design. Journal o f Behavioral Education, 15,
Journal o f Behavior Therapy and Experimental - -
Psychiatry, 4 , 155-159. M cDougall, D., Hawkins, J., Brady, M., 8c Jenkins,
Mastropieri, M. A., & Scruggs, T. E. (1985-86). Early A . (2006). Recent innovations in the changing
intervention for socially withdrawn children. The criterion design: Implications for research and
Journal o f Special Education, 19, 429-441. practice in special education. Journal o f Special
Matt, G. E. (1989). Decision rules for selecting effect Education, 40, 2-15.
sizes in meta-analysis: A review and reanalysis M cIntyre, L. L., Gresham, F. M., DiGennaro, F. D„ 8c
o f psychotherapy outcome studies. Psychological Reed, D. D. (2007). Treatment integrity o f school-
Bulletin, 10s, 10 6-115. based interventions with children in the Journal
Matt, G. E., & Navarro, A. M. (1997). What m eta- o f A pplied Behavior Analysis 1991-2005. Journal
analyses have and have not taught us about o f Applied Behavior Analysis, 4 0 ,659-672.
psychotherapy effects: A review and future d irec- M cKnight, S., McKean, J. W., & Huitema, B. E. (2000).
tions. Clinical Psychology Review, 17 ,1-3 2 . A double bootstrapping method to analyze lin-
Matyas, T. A., & Greenwood, K. M. fi99oJ. Visual ear models with autoregressive error terms.
analysis o f single-case time series: Effects o f var- Psychological Methods, 5, 87-101.
iability, serial dependence, and magnitude o f McSweeney, A. J. (1978). Effects o f response cost on
intervention effects. Journal o f Applied Behavior the behavior o f a million persons: Charging for
Analysis, 23, 341-351. directory assistance in Cincinnati. Journal o f
Matyas. T. A., & Greenwood, K. M. (1991). Problems A pplied Behavior Analysis, 1 1 . 47-51.
in the estimation of autocorrelation in brief time Mellalieu, S. D„ Hanton. S., & O’Brien, M. (2006).
series and some implications for behavioral data. The effects o f goal setting on rugby performance.
Behavioral Assessment, 1 3 ,137-157. Journal o f Applied Behavior Analysis, 39, 257-261.
Matyas, T. A., 8c Greenwood, K. M. (1997). Serial Michael, J. (1974). Statistical inference for individ-
dependency in single-case time series. In R. D. ual organism research: Mixed blessing or curse?
Franklin, D. B. Allison, & B. S. Gorm an (Eds.), Journal o f Applied Behavior Analysis, 7, 647-653.
Design ami analysis o f single-case research. Milrod, B., Busch, F, Leon, A. C., Aronson, A., Roiphe,
Mahwah, NJ: Lawrence Erlbaum Associates. J., Rudden, M „ Singer, M „ Shapiro, T„ Goldman,
Mayer, R. J. (2009). Targeted therapy for advanced H., Richter, D., & Shear, M. K. (2001). A pilot
colorectal cancer—M ore is not always better. open trial o f brief psychodynamic psychother-
New England Journal o f Medicine, 360, 623-625. apy for panic disorder. Journal o f Psychotherapy
M cCollough, D., Weber, K., Derby, K. M„ & Practice and Research, 10, 239-245.
McLaughlin, T. F. (2008). The effects o f Teach Miltenberger, R. G., Flessner, C., Gatheridge, B„
Your C hild to Read in 100 Easy Lessons on the Johnson, B., Satterlund, M „ & Egemo, K. (2004).
acquisition and generalization o f reading skills Evaluation of behavioral skills training to prevent
with a prim ary student with A D H D and PL Child gun play in children. Journal of Applied Behavior
and Family Behavior Therapy, 30, 61-68. Analysis, 37, 513-516.
McCurdy, M „ Skinner, C. H „ Grantham, K., Watson, Miranda, J., Bernal, G., Lau, A. S., Kohn, L., Hwang,
T. S., & Hindman, P. M. (2001). Increasingon-task VV. C., 8c LaFromboise, T. (2005). Stale o f the sci-
behavior in an elementary student during math- ence on psychosocial interventions for ethnic
ematics seatwork by interspersing additional brief minorities. Annual Review o f Clinical Psychology, 1,
problems. School Psychology Review, 30, 23-32. 113-143.
R e fe re n c e s 433

Molloy, G. N. (1990). An illustrative case for the Normand, M . P., & Bailey, j. S. (2006). The effects o f
value o f individual analysis following a between- celeration lines on visual data analysis. Behavior
group experimental design. Behaviour Change, 7, Modification, 39, 295-314.
172-178. Nourbakhsh, M . R., & Ottenbacher, K . ] (i )-
Moore, K., Delaney, ). A., & Dixon, M. R. (2007). The statistical analysis o f single-subject data: A
Using indices o f happiness to examine the influ- comparative examination. Physical Therapy, 74,
ence of environmental enhancements for nursing 80-88.
home residents with Alzheimer’s disease. Journal Nutter, D„ & Reid, D. H. (197$). Teaching retarded
o f Applied Behavior Analysis, 40,541-544. women a clothing selection skill using com m u-
M oran, D. J., & Hirschbine, B. (2002). Constructing nity norms. Journal o f A pplied Behavior Analysis,
single-subject reversal design graphs using 11, - * -
Microsoft Excel': A comprehensive tutorial. The O’Brien, E, & A zrin, N . H. (1971). Developing proper
Behavioral Analyst Today, 3, 62-70. mealtime behaviors o f the institutionalized
Mrazek, P. J., & Haggerty, R. ]. (Eds.). (1994). Reducing retarded. Journal o f Applied Behavior Analysis, 5,
risks fo r mental disorders: Frontiers o f preventive
intervention research. Washington, DC: National O’Callaghan, P. M., A llen . K .D ., Powell, S.,& Salama, F.
Academy Press. (2006). The efficacy o f noncontingent escape for
M TA Cooperative Group. (1999a). A 14-month ran- decreasing children’s disruptive beha.vior during
domized clinical trial o f treatment strategies for restorative dental tieatment. Journal o f Applied
attention-deficit/hyperactivity disorder. Archives Behavior Analysis, j $ , 16 1-17 1.
o f General Psychiatry, 5 6 ,1073-1086. O’Donohue, W„ Plaud, J.]., & Hecker, f. E (1992).The
M TA Cooperative Group. (1999b). Moderators and possible function of positive reinforcement in
mediators o f treatment response for children with home-bound agoraphobia: A case study. Journal
attention-deficit/hyperactivity disorder. Archives o f Behavior Therapy an J Experim ental Psychiatry,
o f General Psychiatry, 56,1088-1096. 23.30 3-312.
Mueller, M. M „ M oore, J., Doggett, R. A., & O'Leary, K. D., Kent, R. N „ & Kanowitz, ]. (1975).
Tingstrom, D. (2000). The effectiveness of Shaping data collection congruent with experi-
contingency specific and contingency nonspe- mental hypotheses. Journal c f A pplied Behavior
cific prompts in controlling bathroom graffiti. Analysis, S, 43-51.
Journal o f A pplied Behavior Analysis, 33, 89-92. Ollendick, T. H ., Shapiro, E .S ., & Barrett, R. P. (1981).
Musser, E. H „ Bray, M. A., Kehle, T. J., & (cnson, W. R. Reducing stereotypic behaviors: An analysis o f
(2001). Reducing disruptive behaviors in stu- treatment procedures using an alternating-treat-
dents with serious emotional disturbance. School ments design. Behavior Therapy, 1 2 , 570-577.
Psychology Review, 30, 294-304. Onghena, R (1994). The pow er o f randomization
Nathan, P. E., & Gorm an, J. M. (Eds.). (2007). A tests fo r single-case designs. Leuven, Belgium:
guide to treatments that work. New York: Oxford Katholieke Universiteit Leuven.
University Press. Pagnoni, G., Zink, C . F., Montague, P. R., 8r Berns,
National Institute o f Mental Health. (2008). The G. S. (2002). Activity in human ventral striatum
numbers count: Mental disorders in America. locked to errors o f reward prediction. Nature
http://www.nim h.nih.gov/health/publications/ Neuroscience, 5, 97-9?.
t h e -n u m b e r s -c o u n t- m e n ta l d is o r d e r s -in Park., H „ M arascuilo, L., & Gaylord - Ross, R. (1990).
america/index.shtml Visual inspection and statistical analysis o f sin gle-
Nezu, A. M „ & Perri, M. G. (1989). Social problem- case designs. Journal o f Experim ental Education,
solving therapy for unipolar depression: An initial 58,3 11-3 2 0 .
dismantling investigation. Journal o f Consulting Park, S., Singer, G. H. S., Sc Gibson, VI. (2005). The
and Clinical Psychology, 57, 408-413. functional effect o f teacher positive aiul neutral
434 REFERENCES

affect on task performance o f students with sig- Patten, S. B. (2006). A major depression prognosis
nificant disabilities, journal o f Positive Behavior calculator based on episode duration. Clinical
Interventions, 7, 237-246. Practice and Epidemiology in Mental Health, 2.
Parker, R. I., & Brossart, D. F. (2003). Evaluating http://www.cpementalhealth.com/content/pdf/
single-case research data: A com parison of 174 5-0 179-2-13.p d f
seven statistical methods. Behavior Therapy, 34, Perepletchikova, F., & Kazdin, A. E. (2005). Treatment
189 -211. integrity and therapeutic change: Issues and
Parker, R. I., Brossart, D. F„ Callicott, K. J„ Long, J. R „ research recommendations. Clinical Psychology:
Garcia de Alba, R., Baugh, F. G., & Sullivan, J. R. Science und Practice, 12,365-383.
(2005). Effect sizes in single-case research: How Peterson, L., Tremblay, G., Ewigman, B., & Popkey,
large is large? School Psychology Review, 34, C. (2002). The Parental Daily Diary: A sensi-
116-132. tive measure o f the process o f change in a child
Parker, R. I., Cryer, J., & Byrns, G. (2006). Controlling maltreatment prevention program. Behavior
baseline trend in single-case research. School Modification, 26, 594-604.
Psychology Quarterly, 21, 418-443. Phaneuf, L„ & McIntyre, L. L. (2007). Effects of
Parker, R. I., & Hagan Burke, S. (2007a). Single-case individualized video feedback combined with
research results as clinical outcomes. Journal o f group parent training on inappropriate maternal
School Psychology, 45, 637-653. behavior. Journal o f Applied Behavior Analysis,
Parker, R. ., & Hagan-Burke, S. (2007b). Useful effect 40, - -
size interpretations for single-case research. Plomin, R „ M cClearn, G. E., iMcGuffin, P., & Defries,
Behavior Therapy, 38, 95-105. f. C. (2000). Behavioral genetics (4lh ed). New
Parsons, M. B., Schepis, M. M „ Reid, D. H., M cCarn, York: W. H. Freeman.
J. E., & Green, C. VV. (1987). Expanding the Pluck, M ., Ghafari, E „ Glynn, T., & McNaughton,
impact o f behavioral staff management: A large- S. (1984). Teacher and parent modeling of rec-
scale, long-term application in schools serving reational reading. N ew Zealand Journal of
severely handicapped students. Journal o f Applied Educational Studies, 19 ,114 -12 3.
Behavior Analysis, 2 0 , 139-150. Pohl, R. F. (Ed.). (2004). Cognitive illusions: A hand-
Parsonson, B. S., & Baer, D. M. (1978). The analysis and book on fallacies and biases in thinking, judgment,
presentation o f graphic data. In T. R. Kratochsvill and memory. New York: Psychology Press.
(Ed.), Single-subject research: Strategies fo r evalu- Porritt, M., Burt, A., & Poling, A. (2006). Increasing
ating change. New York: Academ ic Press. fiction writers’ productivity through an Internet
Parsonson, B. S., & Baer, D. M. (1992). The visual anal- based intervention. Journal o f Applied Behavior
ysis o f data and current research into the stimuli Analysis, 3 9 , - -
controlling it. In T. R. Kratochwill & f. R. Levin Price, D. D., Finniss, D. G., & Benedetti, B. (2008).
(Eds.), Single-subject research design and analysis. A comprehensive review o f the placebo effect:
Hillsdale, Nf: Lawrence Erlbaum Associates. Recent advances and current thought. Annual
Pasiali, V'. (2004). The use o f prescriptive therapeutic Review o f Psychology, 59, 565-590.
songs in a home-based environment to promote Prince, M. (1905). The dissociation o f a personality.
social skills acquisition by children with autism: New York: Longmans, Green.
Three case studies. Music Therapy Perspectives, Quesnel, C „ Savard, J., Simard, S., Ivers, H., & Morin,
2 2 ,11-22. C. M. (2003). Efficacy of cognitive-behavioral
Patel, M . R., Piazza, C. C „ Layer, S. A., Coleman, R., & therapy for insomnia in women treated for non-
Swartzwelder, D. M. (2005). A systematic eval- metastatic breast cancer. Journal o f Consulting
uation o f food textures to decrease packing and and Clinical Psychology, 7 1 , 189-200.
increase oral intake in children with pediatric Reeve, S. A., Reeve, K. F„ Townsend, D. B„ &
feeding disorders. Journal o f Applied Behavior Poulson, C. L. (2007). Establishing a generalized
Analysis, }8, 89-100. repertoire o f helping behavior in children with
References 435

autism. Journal o f A pplied Behavior Analysis, 40, Hamerlynck, P. O. Davidson, & L. E. Acker
123- 136. (Eds.), Behavior modification and ideal mental
Reinhartsen, D. R „ Garfinkle, A. N., & Wolery, M. health services. Calgary, Alberta: University o f
(2002). Engagement with toys in two-year-old Calgary Press.
children with autism: Teacher selection and child Robinson, P. W., 8: Foster, D. F. (1979)- Experimental
choice. Journal o f the Association fo r Persons with psychology: A sm all-N approach. New York:
Severe Handicaps, 27,175-187. Harper 8c R ev .
Reitman, D „ Murphy, M. A., Hupp, S. D. A., & Roediger, H. L „ III, c McDermott, K. B. (2000).
O’Callaghan, P. M . (2004). Behavior change Distortions o f memory. In E. Tulving & F. I. M .
and perceptions o f change: Evaluating the effec- Craik (Eds.), The Oxford handbook o f memory.
tiveness o f a token economy. Child and Family New York: Oxford University Press.
Behavior Therapy, 2 6 ,17-36. Rosales-Ruiz, J., & Baer, D. M . (1997)- Behavioral
Ressler, K. f., Rothbaum, B. O., Tannenbaum, L., cusps: A developmental and pragm atic concept
Anderson, P., Graap, K., Zimand, E., Hodges, for behavior analysis. Journal o f A pplied Behavior
L., 8c Davis, M. (2004). Cognitive enhancers as Analysis, 3 o, 533-544-
adjuncts to psychotherapy: Use of D-cycloserine Rosenbaum, f. E. (2009). Patient teenagers? A com -
in phobic individuals to facilitate extinction o f parison o f the sexual behavior o f virgin ity pledg-
fear. Archives o f General Psychiatry, 6 1 ,1136-1144. ers and matched nonpledgers. Pediatrics, 123,
Revusky, S. H. (1967). Some statistical treatments com - eii0 -ei2 0 .
patible with individual organism methodology. Rosenthal, R., & Rosnow, R. L. <2007). Essentials o f
Journal o f the Experimental Analysis o f Behavior, behavioral research: Methods and data analysis
10, 319-330. (3rd ed.). Boston: M cGraw -H ill.
Reyes, J. R., Vollmer, T. R „ Sloman, K. N., Hall, A., Rusch, F. R., c Kazdin, A. E. (19S1). Toward a m ethod-
Reed, R., fansen, G., et al. (2006). Assessment of ology o f withdrawal designs for the assessment
deviant arousal in adult male sex offenders with o f response maintenance. Journal of Applied
developmental disabilities. Journal o f Applied Behavior Analysis, 1 4 , 13 1-14 0 .
Behavior Analysis, 39 ,173-18 8. Rutter, M., Yule, W., 8c Graham , P. (1973). Enuresis
Ricciardi, J. N., Luiselli, J. K., & Camare, M. (2006). and behavioural deviance: Som e epidem iologi-
Shaping approach responses as intervention for cal considerations. In I. K o h in , R. M acKeith, 8c
specific phobia in a child with autism. Journal o f S. R. M eadow(Eds.), Biadder control and enuresis:
Applied Behavior Analysis, 39,445-448. Clinics in developmental m edicine (Vol. 48/49).
Rice, V. H., & Stead, L. E (2008). Nursing interven- London: HeinejnannfSIMP.
tions for smoking cessation. Cochrane Database of Ryan, C. S., Ik Hernmes, N. S. (2005). Effects o f the
Systematic Reviews, Issue 1 (Art. No. CD001188). contingency forhom ew orksubm ission on hom e-
Riesen, T., M cDonnell, I„ Johnson, J. W„ Polychronis, work submission and quiz perform ance in a col-
S., & Jameson, M. (2003). A comparison o f con- lege course. Journal of Applied Behavior Analysis,
stant time delay and simultaneous prompting 3 8 ,79-88.
within embedded instruction in general educa- Satake. E., Maxwell, D. L., & lagaroo, V. (>008).
tion classes with students with moderate to severe Handbook o f statistical methods: Single subject
disabilities. Journal o f Behavioral Education, 12, design. San Diego, CA: Plural Publishing.
241-259. Savard, f„ Labege, B., Gauthier, [. G., Fournier, J.,
Riley-Tillman.T. C., 8c Burns, M. K. (2009). Evaluating Bourchard, S., Barit, 8c Bergeron, M . (1998).
educational interventions: Single-case design fo r Com bination of Fluoxetine and cognitive ther-
measuring response to intervention. New York: apy for the treatment o f m ajor depression am ong
Guilford Press. people with H IV infection: A time-series an aly-
Risley, T. R. (1970). Behavior modification: An sis investigation. Cognitive Therapy and Research,
experimental-therapeutic endeavor. In L. A. 22, 21-46.
436 R EF ER EN C ES

Scherrer, M. D „ & Wilder, D. A. (2008). Training to Evidence-Based Communication Assessment and


increase safe tray carrying among cocktail serv- Intervention, 3,18 8 -19 6 .
ers. Journal o f Applied Behavior Analysis, 41, Shapiro, E. S., Kazdin, A. E., & McGonigle, I. J. (1982).
131-135- Multiple-treatment interference in the simulta-
Schmidt, F. L. (1996). Statistical significance test- neous-oralternating-treatmentsdesign.Behaviora/
ing and cumulative knowledge in psychol- Assessment, 4 , 105-115.
ogy: Implications for training of researchers. Shapiro, M. B. (1961a). A method o f m easuring psy-
Psychological Methods, 1,115 -12 9 . chological changes specific to the individual
Schnelle, J. F., Kirchner, R. E., Macrae, J. W., McNees, psychiatric patient. British Journal o f Medical
M. P., Eck, R. H., Snodgrass, S., et al. (1978). Psychology, 3 4 , 1 5 1 - 1 5 5 -
Police evaluation research: An experimental Shapiro, M. B. (1961b). The singlecase in fundamental
and cost-benefit analysis o f a helicopter patrol clinical psychological research. British Journal o f
in high-crinie area. Journal o f Applied Behavior Medical Psychology, 34, 255-262.
Analysis, 1 1 , 11-2 1. Shapiro, M. B., 8c Ravenette, T. (1959). A prelim inary
Schnelle, J. F , Kirchner, R. E „ McNees, M. P., & Lawler, experiment o f paranoid delusions. Journal o f
J. M. (1975). Social evaluation research: The eval- Mental Science, 105, 295-312.
uation o f two police patrolling strategies. Journal Shoukri, M. M. (2005). Measures o f interobserver
o f Applied Behavior Analysis, 8 ,353-365. agreement. Boca Raton, FL: Taylor 8c Francis.
Schwartz, I. S., & Baer, D. M. (1991). Social valid- Sideridis, G. D., 8c Greenwood, C. R. (1997). Is human
ity assessments: Is current practice state o f the behavior autocorrelated? An em pirical analysis.
art? Journal o f Applied Behavior Analysis, 24, Journal o f Behavioral Education, 7, 273-293.
189-204. Sidman, M. (1960). Tactics o f scientific research. New
Scotti, |. R., Evans, I. M., Meyer, L. H., & Walker, P York: Basic Books.
(1991). A meta-analysis o f intervention research Sierra, V., Solanas, A., 8c Quera, V. (2005).
with problem behavior: Treatment validity Randomization tests for systematic single-case
and standards o f practice. American Journal on designs are not always appropriate. Journal o f
Mental Retardation, 96, 233-256. Experimental Education, 7 3,14 0 -16 0 .
Sechrest, L., Stewart, M ., Stickle, T. R., & Sidani, S. Simon, G. E., Manning, W. G., Katzelnick, D. f„
(1996). Effective and persuasive case studies. Pearson, S. D„ Henk, H. I., 8c Helstad, C. P. (2001).
Cambridge, M A: Human Services Research Cost-effectiveness of systematic depression treat-
Institute. ment o f high utilizers o f general medical care.
Shabani, D. B., & Fisher, W. W. (2006). Stimulus fading Archives o f General Psychiatry, 5 8 ,181-187.
and differential reinforcement for the treatment Skiba, R., Deno, S., Marston, D., 8c Casey, A. (1989).
o f needle phobia in a youth with autism. Journal Influence o f trend estimation and subject familiarity
o f Applied Behavior Analysis, 39, 449-452. on practitioners’ judgments of intervention effec-
Shadish, W. R „ Cook, T. D., & Campbell, D. T. (2002). tiveness. lournal o f Special Education, 2 2 ,433-446.
Experimental and quasi-experimental designs fo r Skinner, B. F. (1938). The behavior o f organisms. New
generalized causal inference. Boston: Houghton York: Applcton-Century-C.rofts.
M ifflin. Skinner, B. F. (1953a). Science and human behavior.
Shadish, W. R., & Ragsdale, K. (1996). Random New York: Free Press.
versus nonrandom assignment in controlled Skinner, B. F. (1953b). Some contributions o f an exper-
experiments. Do you get the same answer? imental analysis o f behavior to psychology as a
Journal o f Consulting and Clinical Psychology, 64, whole. American Psychologist,8 , 69-78.
1290-1305. Skinner, B. F. (1956). A case history in scientific
Shadish, W. R „ Rindskopf, D. M., & Hedges, L. V’. method. American Psychologist, 11, 221-233.
(2008). The state o f the science in the meta- Skinner, C. H., Skinner, A. I.., & Arm strong, K. I.
analysis of single-case experim ental designs. (2000). Analysis of a client-staff developed
R&fcrences 437

shaping program to enhance reading persis- A B, ABAB, and multiple- baseline designs. Madison:
tence in an adult diagnosed with schizophrenia. University o f W isconsLn-Madison.
Psychiatric Rehabilitation, 2 4 ,52-57. Tarbox, R. S. F„ Wallace, M. D., Penrod, B , & Tar box, ].
Spirrison, C. L., & Mauney, L. T. (1994). Acceptability (2007). Effects o f three-step prom pting on com -
bias: The effects o f treatment acceptability pliance w ith<a regiver requ e sts. Jo urnal i f A ppli ed
on visual analysis o f graphed data. Journal o f Behavior Analysis, 40,7 0 3 -706.
Psychopathology and Behavioral Assessment, 16, Thigpen, C. H., & Clecldey, H. M. ( 19 5 4 ). A case o f
85-94. multiple personality. Journal o f Abnorm al and
Staats, A. W„ Staats, C. K., Schütz, R. E., & Wolf, M. Social Psychology, 4 9 ,135-151.
(1962). The conditioning o f textual responses Thigpen, C. H., & Cleckley, H. M. (1 9 5 7 )- The three
using “extrinsic" reinforcers, Journal o f the faces o f Eve. N ew York: M cGraw- Hill.
Experimental Analysis o f Behavior, 5 , 33-40. Thornberry, T. P., Sc Krohn, M. D (2000). The self-
Stead, L. F„ Bergson, G., & Lancaster, T. (2008). report m ethod for m easuring delinquency. In
Physician advice for smoking cessation. Cochrane D. Duffee (Ed.), Measurement and analysis o f
Database o f Systematic Reviews, Issue 2 (Art. No. crime and justice: Crim inal justice 2000 (Vol. 4).
CD000165). W ashington,D C: National Institute o f Justice.
Stewart, K. K., Carr, J. E., Brandt, C. W„ & McHenry, Tiger, J. H „ Bouxsein, K. J., & Fisher, "W. W. (2007).
M. M. (2007). An evaluation o f the conserva- Treating excessively slow responding of a young
tive dual-criterion method for teaching uni- man with Asperger syndrome using differential
versity students to visually inspect AB-design reinforcement o f short response latencies, Journal
graphs, lournal o f Applied Behavior Analysis, 40, o f Applied Behavior Analysis, 40, 559—563.
713-718. Timberlake, W , Scliaal, D. W „ k Steinmetx, ). E.
Strieker, J. M., Miltenberger, R. G., Garlinghouse.M . A., (Eds.). (2005). Relating behavior and neurosci-
Deaver, C. M „ & Anderson, C. A. (2001). ence: Introduction and synopsis. Journal o f the
Evaluation o f an awareness enhancement device Experimental Analysis o f Behavior, $4, 305-311.
tor the treatment o f thumb sucking in children. Todd, P. M., Fenke, L., Fasolo, B., & Lenton, A. P.
lournal o f Applied Behavior Analysis, 3 7 ,229-232. (2007). Different cognitive processes under-
Sundberg, M. L., Endicott, K., 8c Eigenheer, R (2000). lie human mate choices and mate preferences.
Using intraverbal prompts to establish tacts for Proceedings o f the National A cadem y o f Sciences,
children with autism. The Analysis o f Verbal 1 0 4 ,150 11-1501S.
Behavior, 17 ,89-104. Todman, J. B., & D ugard, P. (2001). Single-case and
Swaminathan, H., Horner, R. H., Sugai, G., stnall-n experim ental designs. A practical guide
Smolkowski, L., Hedges, L., 8c Spaulding, S. A. to randomization tests. M ahw ah, NT Lawrence
(2008). Application o f generalized least squares Erlbaum Associates.
regression to measure effect size in single-case Tryon, W. W. (1982). A simplified time-series analysis
research: A technical report (Institute o f Education for evaluating treatment interventions. Journal of
Sciences Technical Report). Washington, D C U.S. Applied Behavior Analysis, 15, 423-429 .
Department o f Education. Tufte, E. R. (2001). The visual display o f tj iiantitative infor-
Swanson, ). M., Arnold, L. E., Vitiello, B„ Abikoff, mation (2^ e d ). Cheshi re. C l : Graphics Press.
H. B., Wells, K. C , Pelham, W. E., et al. (2002). Twohig, M. P., Shoenberger, D., 8c Hayes, S. C. (2007).
Response to commentary on the Multimodal A prelim inary investigation o f acceptance and
Treatment Study o f AD H D (MTA): Mining the commitment therapy as a treatment for mari
meaning o f the MTA. Journal o f Abnormal Child juana dependence in adults, [turned o f Applied
Psychology, 30, 327-332. Behavior Analysis, 4 0 ,1 19-632..
Swoboda, C „ Kratochwill, T. R., & Levin, J. R. (2009). Ullmann, L. P., & Krasner, L. A . (Eds.). (1965). Case
Conservative Dual-Criterion (CDC) method fo r studies in behavior modification. “New York: Holt,
single-case research: A guide fo r visual analysis o f Rinehart & Winston
438 R EF ER EN C ES

Ulm aa, J. D., & Sulzer-Azaroff, B. (1975). Multielement Watson, J. B., & Rayner, R. (1920). Conditioned
baseline design in educational research. In E. emotional reactions. Journal o f Experimental
Ram p & G. Semb (Eds.), Behavior analysis: Areas Psychology, 3 , 1-14 .
o f research and application. Englewood Cliffs, NJ: Watson, R. I. (1951). The clinical method in psychology.
Prentice-Hall. N ew York: Harper.
Van I louten, R., Malenfant, J. E. L „ Zhao, N., Ko, B „ & Watson, T. S., Meeks, C , Dufrene, B., & Lindsay, C.
V'an Houten, J. (2005). Evaluation o f two methods (2002). Sibling thumb sucking: Effects o f treat-
o f prompting drivers to use specific exits on con- ment for targeted and untargeted siblings.
flicts between vehicles at the critical exit. Journal Behavior Modification, 2 6 ,412-423.
o f A pplied Behavior Analysis, 38, 289-302. Wehby, J. H., & Hollahan, M. S. (2000). Effects of
Van Houten, R., & Retting, R. A. (2001). Increasing high-probability requests on latency to initiate
motorist compliance and caution at stop signs. academic tasks. Journal of Applied Behavior
Journal o f Applied Behavior Analysis, 3 4 , 185-193. Analysis, 33, 259-262.
V'an Houten, R., Van Houten, J., & Malenfant, J. E. L. Weiss, B., Caron, A., Ball, S., Tapp, J., Johnson, M „ &
(2007). Impact o f a comprehensive safety pro- Weisz, J. R. (2005). Iatrogeniceffectsofgroup treat-
gram on bicycle helmet use among middle-school ment for antisocial youth. Journal o f Consulting
children. Journal o f Applied Behavior Analysis, and Clinical Psychology, 7 3 ,1036-1044.
40, 239 -247 Weisz, J. R „ & Kazdin, A. E. (Eds.). (2010). Evidence-
Velleman, P. E (1980). Definition and comparison based psychotherapies fo r children and adolescents
o f robust nonlinear data smoothing algorithms. (2nJ ed.). New York Guilford Press.
Journal o f the American Statistical Association, 75, Weisz, J. R „ Weiss, B., Han, S. S., Granger, D. A., &
609-615. M orton, T. (1995). Effects of psychotherapy with
Vlaeyen, J. W. S., de Jong, J. R „ Onghena, P., children and adolescents revisited: A meta-anal-
Kerckhoffs-Hanssen, M., & Kole-Sniiders, A. M. I. vsis o f treatment outcome studies. Psychological
(2002). Can pain-related fear be reduced? The Bulletin, 117, 450-468.
application o f cognitive-behavioral exposure in Westen, D „ Novotny, C. M ., & Thompson-Brenne, H.
vivo. Pain Research Management, 7 ,14 4 -15 3 . (2004). The empirical status o f em pirically sup-
Wacker, D., M cMahon, C., Stecge, M „ Berg, W„ Sasso, ported psychotherapies: Assumptions, find-
G., & Melloy, K. (1990). Applications o f a sequen- ings, and reporting in controlled clinical trials.
tial alternating treatments design. Journal o f Psychological Bulletin, 130, 631-663.
Applied Behavior Analysis, 2 3 ,333-339. Whalen, C., Schreibman, L „ & Ingersoll, B. (2006).
Wampold, B. E. (2001). The great psychotherapy The collateral effects o f joint attention training
debate: Models, methods, and findings. Mahwah, on social initiations, positive affect, imitation,
NJ: Lawrence Erlbaum Associates. and spontaneous speech for young children with
YVannamethee, S. G., & Sharper, A . G. (1999). Type of autism. Journal o f Autism and Developmental
alcoholic drink and risk o f major coronary heart Disorders, 36,655-664.
disease events and all-cause mortality. American White, K. G., McCarthy, D., & Fantino, E. (Eds.).
Journal o f Public Health, 89, 685-690. (1989). Cognition and behavior analysis. Journal
Warnes, E „ & Allen, K. D. (2005). Biofeedback treat- o f the Experimental Analysis o f Behavior, 52,
ment o f paradoxical vocal fold motion and respi- 197-198.
ratory distress in an adolescent girl. Journal o f White, O. R. (1972). A manual fo r the calculation
Applied Behavior Analysis, 38, 529-532. and use o f the median slope—A technique of
Washington, K., D eitz,). C , White, O. R., & Schwartz, progress estimation and prediction in the sin-
I. S. (2002). The effects o f a contoured foam seat gle case. Eugene: Regional Resource Center for
on postural alignment and upper-extremity func- Handicapped Children, University o f Oregon.
tion in infants with neuromotor impairments. White, O. R. (1974). The “split middle”: A “quickie”
Physical Therapy, 8 2 ,1064-1076. method o f trend estimation. University of
Ref er en ces 439

Washington, Experimental Education Unit, Child Wolery, M „ Busick, M., Reichow, B „ & Barton, E.
Development and Mental Retardation Center. E. (2008). Comparison o f overlap methods for
White, O. R., & Haring, N. G. (1980). Exceptional quantitatively synthesizing single-subject data.
teaching (2"'1 ed.). Colum bus, OH: Merrill. Journal o fS p tcia l Education. (Online document:
Wiesman, D. W. (2006). The effects o f performance 101177/00224.6690832^009).
feedback and social reinforcement on up-selling Wolf, M. M. (1978). Social validity: The case for sub-
at fast-food restaurants. Journal o f Organizational jective measurement or how applied behavior
Behavior Management, 2 6 ,1-18 . analysis is finding its heart. Journal o f Applied
Wilder, D. A., Atwell, J., & Wine, B. (2006). The effects Behavior Analysis, 1 1 , 203.-214.
of varying levels o f treatment integrity on child Wong, S. E., Terranova, M. D., Bowen, L., Zarate, R.,
compliance during treatment with a three-step Massey, H. K., 8c Liberm aa R. P. (1987)- Providing
prompting procedure. Journal o f Applied Behavior independent recreational activities to reduce ste-
Analysis, 3 9 ,369-373. reotypic vocalizations in chronic schizophrenics.
W ilhelm, S., Buhlmann, U., Tolin, D. F., Meunier, S. Journal o f A pplied Behavior Analysis, 20, 77-8 2.
A., Pearlson, G. D., Reese, H. E., Cannistraro, P., Worsdell, A. S., hwata, B. Dozier, C. L., Johnson,
(enike, M. A., & Rauch, S. L. (2008). Augmentation A. D., N eidert. P. L., & Thom ason, J. L. (2005).
o f behavior therapy with D-cycloserine for obses- Analysis o f response repetition as an error-
sive-compulsive disorder. American Journal o f correction strategy during sight-w ord read -
Psychiatry, 16 5 ,335-341. ing. Journal o f A pplied S eh a vio r Analysis, 3S,
W ilkinson, L., and the Task Force on Statistical 511-527.
Inference. (1999). Statistical methods in psy- Wright, K. M „ & .Uiltenberger, R.G . (1987)- Awareness
chology journals: Guidelines and explanations. training in the treatment o f head and facial tics.
American Psychologist, 54, 594-604. Journal o j B th avw r Therapy and Experim ental
W ilkinson, L „ Wills, D., Rope, D„ Norton, A., & Psychiatry, iS, 269-274.
Dubbs, R. (2005). The gram mar o f graphics. Yasui, M., & Dishion, T. f. (200S). Direct observation
Chicago: SPSS Inc. o f family management: Validity and reliability
W ilson, D. D., Robertson, S. J., Herlong, L. H., & as a function o f coder ethnicity and training.
Haynes, S. N. (1979). Vicarious effects of time Behavior Therapy, 3?, 336-347.
out in the modification o f aggression in the class- Zilboorg, G., & Henry, G. (194.1). A history o f medical
room. Behavior M odification, 3 ,9 7 - 111. psychology. N ew York: Morton.
AUTHOR INDEX

Acar, G., 277 Berns, G.S., 370


Achenbach, T.M., 87,314 Berns, S.B., 314
Agras, W.S., 369 Besalel-Azrin, V., 274
Ahearn, W.H., 132,140 Bijou, S.W., 15
Ainsleigh, S.A., 358 Billette, V., 348-349
Alberto, P.A., 229 Bisconer, S.W., 345
A id win, C .M ., 254 Bjorklund, D.F., 84
Allen, K.D ., 8 3 ,17 1-17 2 , 293,311, 386 Blampied, N.M ., 222
Allison, D.B., 298, 410 Blanton, H., 319-320
Allport, G.W., 12 Bolgar, H., 3
American Psychiatric Association, 51 Böiger, N., 371
A m erican Psychological Association, i7fn Bootzin, R.R., xiii, 95
A nderson, C .A ., 358 Borckardt, J.J., 19,86,397, 4 10 -4 11, 414, 418
Andrews, )., 316 Boring, E., 11
Antoun, N., 8 Bosch, A., 90
Ardoin, S.P., 222,368 Boscoe, J.H., 75
Arm strong, K.J., 169-170 Bouchard, T.J., Jr., 367fn
Arv, D„ n ofn Bourland, G., 410
Athens, E.S., 58, 80 Bouxsein, K.J., 81
Atwell, f., 194 Box, G.E.P., 348, 4 10 -4 11,4 14
Austin, I., 91, 9 4 .137 .15 1-15 2 , 209, 211, 236 ,317 Boyle, M ., 316
Axelrod, S., 181 Brackshaw, E., 130 -131
Avllon, T., 15,55,354 Bradshaw, W., 407
A zrin, N .H., 250-251, 274-275, 308-309 Brady, M., 174
Brainerd, C.J., 84
Baer, D.M., 15,53,159,221,223,286-287,300,307,386,403 Brandt, C.W., 298
Bailey, I.S., 298,300 Bray, M .A., 157
Bargii. I.A., 93fn, 299 Breitwieser, C.B., 90
Barlow, D.H., 197 Brembs, B., 370
Barnett, F.T., 337 Bremner, R., 316
Baron, R.M ., 371 Breuer, J„ 5,7
Barrett, R.P., 15, 202-203 Broemeling, L.D., 114
Barrios, B.A., 115fr! Brooks, A., 141
Barton, E.E., 324fr) Brossart, D.F., 302, 403, 406-407, 410, 416
Basoglu, M., 277, 279 Browning, R.M., 207
Bass, D., 40, 2 50 ,313-314 Bruckner, H., 260
Battro, A .M ., 8 Bruner, E., 148
Baxter, D.A., 370 Brunswik, E., 39
Bearm an, P.S., 260 Buckholdt, D., 55
Beasley, T.M., 298 Burkholder, E.ü ., 3241h, 346
Beck, A.T., 37,370 Burns, M .K., 324ft!
Benedetti, B., 40 Burt, A., 74
Berg, B.L., 398fn Busk, P., 410
Bergson, G., 248, 304 Busse, R.T., 406

440
A n c h o r In d e x 441

Byrns, G., 300 Deitz, J.C., 204


Delaney, J.A., 78
Calder, A.J., 8 Denissenko, M .F.,370
Camare, M „ 183-184 Denzin, N.K., 39ÍU11
Cam eron, M.J., 358 DeProspero, A ., 298
Campbell, D.T., 27-28, 40, 258-259, 369, 379 Derby, K.M., 148
Campbell, J.M., 410 DeRosie, J., 240
Caplinger, T.E., 214 DiGennaro, F.D., 76 ,19 4
Carr, J.E., 137, 236, 298, 324fn, 346 Dishion, T.J., 68, 214, 260
Carter, M „ 202 Dittmer, C.G., 11
Carter, N „ 147-148 DLxon, M.R., 78
Casey, A ., 410 Dobes, R.W., 5 7 -5 8 ,116
Caspi, A., 245 Dodge, K.A., 21+, 260
Center for Disease Control, 263 Doggett, R.A., 330
Center, B.A., 410 Donahue, C., 54
Chaddock, R.E., 11 Doss, A.J., 36
Chamberlain, P., 386 Drebing, C.E., 58
Chambless, D.L., 17 Dubbs, R., 324
Chassan, J.B., 19,399 Ducharme, J.M., 2 4 0 - 2 4 1 . »
Chassin, L., 254 Dufrene, B., 141
Chhatwal, J., 249 Dugard, P., 208-209, 379
Chung, B.I., 132 Dukes, W.F., 3, u
Clark, K.M ., 132 Dunn, J., 367hl
Clarke, B „ 405
Clarke, G.N ., 315 Edgington, E.S., 207-208,37 S -379. 410
Clayton, M „ 77 Ehrhart, I.J., 275
Cleckley, H.M., 5 ,13 Eigenheer, P., 330
Clement, P.W., 86 Elliott, S.N., 406
Cleveland, W.S., 405 Elliott, T.R., 302
Cohen, J., 1 13 ,417fr» Endicott, K „ 330
Cohen, S., 298 Engleman, S., 148
Coleman, R., 82 Essex, M., 371
Cook, T.D., 27-28, 40, 258, 379 Evans, I.M., 4 10
Cooper, H., 398fn Evans, J.H., 17 1- 17 2
Cooper, J.O., 16, 203fr», 236 Ewigman, B., 86
Covalt, W.C., liofn
Cox, A.B., 306 Facon, B., 172
Cox, B.S., 306,385 Fairburn, C .G ., 369
Cox, D.J., 306 Faith, M.S., 410
Cryer, J., 300 Fantino, E „ 388
Cunningham, C.E., 316 -317 Farrimond, S.J., 94
Cunningham, T.R., 151-152 Fasolo, B., 84
Favell, J.E., 2 3 1,13 }
Dallery, J., 82,85, 386 Feather, J.S., 30 2,336 ,4 0 7
Dapcich-Miura, E., 229-230 Feehan, M „ 52
Davis M., 249 Feldman, R.A., 114 . 26ofn
Dawes, R.M ., 7, 18 Ferguson, M.J., 93fn
de Jong, J.R., 272 Ferritor, D.E., 55
De Los Reyes, A., 87 Ferron, ]., 417
Dcaver, C.M ., 358 Ferster, C.B., 1 5 ,330fr!
Defries, f.C., 367fn Finch, A . E., 19
442 A U T H O R I N O EX

Finniss, D.G., 40 Greenwood, K.M ., 298,300, 402-404,409


Fiore, M .C., 248,304 Gresham, F.M., 194
Fisch, G.S., 418 Grice, C.R., 399
Fisher, R.A., 11 Griffin, D„ 6
Fisher, W.W., 81, 240, 298, 403, 410, 418 Griner, D., 368
Fishman, C., 305 Grissom, T„ 214
Flood, W.A., 176-177 Gross, A., 90
Fokoue, E., 405 Gross, T., 93fn
Foley, D., 246fn Grove, W.M., 313
Folino, A., 240 Grow, L „ 358
Foster, D.F., 3 ,11 Guay, S., 348-349
Foster, S.L., 5 3 ,115 ,117
Foster-Johnson, L., 406 Hackett, S., 91
Fournier, A .K ., 275, 277, 385 Haddox, P., 148
Franklin, R.D., 298, 402 Hagan-Burke, S., 401, 406-407,410
Fredrick, L.L., 229 Haggerty, R.J., 314
Freedman, B.J., 54 Hains, A .H ., 221, 223
Freud, S., 5, 7 Hall, S.S. 80
Friedman, J., 181 Hamblin, R.L., 55
Han, S.S., 298fn
Gabbard, G.O., 317 Hanley, G.P., 55, 64
Garb, H.N., 18 Hansen, N.B., 19
Garfinkle, A .N ., 214 Hanton, S., 93
Garlinghouse, M .A., 358 Harbst, K.B., 298, 402
Gaylord Ross, R., 298 Haring, N.G., 410
Geesey, S., 39, 200 -201 Harris, S.R., 298
Geller, F..S., 275 Harris, V.W., 55
Ghaiari, E., 202 Hartmann, D.P., liofn, U3fn, nsfn, 197, 4 10 -4 11, 414
Gibson, M „ 204-205 Hasler, B.P., 95
Gilbert, ).R, 287 Hassin, R „ 93fn
Gilbert, L.M., 337 Haughton, E., 354
Gilmer, D.F., 254 Hawkins, J., 174
Gilovich, T., 6 Hawkins, R.P., 5 7 -5 8 ,116
Girolam i, P.A., 75 Hayes, S.C., 82,197
Glass, G.V., 300,403, 411, 414 Haynes, S.N., 360
Glenn, I.M., 82, 85,386 Heads, T.B., 55
Glindem ann, K.E., 275 Heal, N.A., 55
Glynn, X , 82, 202 Hecker, J.E., 273
Goldiamond, I., 15 Hedges, L.V., 398fn, 407, 410
Gore, A., 6 Helms, B., 77
Gorm an, B.S., 298, 41c Hemmes, N.S., 77
Gorm an, J.M ., 35,397 Henry, G.T., 13,323
Gorsuch, R.L., 410 Herlong, L.H., 360
Gottman, J.M., 411 Heron, T.E., 16
Graham , P., 52 Hersen, M., 197
Granger, D.A., 298111 Hetzroni, O.E., 222
Grantham, K., 204 Heward, W.L., 16
Gravi na, N „ 91 Hill. L.E., 397
Green, C.W., 385 Hindman, P.M., 204
Green, M „ 345 Hirschbine, B., 324(11
Greenberg, P.E., 317 Hofmann, S.G., 249
A u cf i o r In d ex 443

Hollahan, M.S., 141, 222 Kehle, T.J., 157


Hollon, S.D., 37,370 Kelley, M .E., 298, 410
Holmström, A., 147 Kendall, P.C., 313
Honekopp, J., 286fn Kennedy, C H ., 53
Hontos, P.T., 274 Kenny, D.A., 371
Hops, H., 316 Kent, R.N., 115 ,117
Hornberger, J., 317 Kerckhoffs-Hanssen, M., 272
Horner, R.H., 136,142, 282,375 Kern-Koegel, L „ 159
Hovell, M.F., 229-230 Khang, S.W., 203(11
Hsu, L.M ., 258 Kiernan, M „ 371
Hughes, C.A.O., 202 Kim-Cohen, f., 245
Hughes, M.A., 229,232 Kirchner, R .E., 4 11
Huitema, B.E., 410 Kirk, R.E., 406
Humm, S.P., 222 Kleinmann, A .E., 76
Hunsley, J., 18 Klubnik, C., 1 22
Hunter, J.E., 398fn Knapp, T.J., 402
Hunter, J.J., 399 Knudson, P., 90
Hupp, S.D.A., 312 Ko, B., 77
Kodak, T „ 358
Ingersoll, B., 159 Koegel, R.L., 159
Ingram, K., 137-138 Kole-Snijders, A .M .J., 272
Ingvarsson, E.T., 55 Komaki, J., 337,339
Institute o f Medicine, i7fn Koob, J.J., 397
Ivers, H., 306, 407 Korchin, S.J., 12
hvata, B.A., 203fn Kosslyn, S.M ., 324
Kraemer, H .C ., 369-370
Jaceard, J., 319-320 Krasner, L.A., 15
Jacobson, N.S., 314 Kratochwill, T.R., xiii, 16, 209, 158, 298,.37S-379, 406
Jaffee, S.R., 246fn Krohn, M.D., 86
lameson, M ., 368 Kromrey, J.D., 406
fason, L.A., 130-131 Kupfer, D.J., 36 9-370
Jenkins, G.M ., 348, 411, 414 Kushner, M .G., 249
Jenson, W.R., 157
Johnson. B.M., 90 Lall, V.E, 410, 4«>
Johnson, J.S., 345 Lambert, M .J., 19. 86, 38«
Johnson, J.W., 368 Lancaster, T „ 248, 304
Jones, M .C., 5,13 Lansford, J.E., 214
Jones, M.L., 231, 233 Lawler, J.M ., 411
Jones, R.R., 29 8 ,4 04 ,4 11, 414 Layer, S.A., 82
Jones, W.P., 410 Lazar, S.G., 317
Lebbon, A., 91
Kahneman, D„ 6 Leenders, N T .J.M ., 215
Kalender, D., 277 Leland, L.S., Jr., 94
Kanowitz, J., 117 Lenton, A.P., 84
Kashy, D.A., 371 Levesque, M., 8 7 ,30 6 ,39 4 -39 6 , 407, 411. 418
Katzelnick, D.I., Levin, J.R., 209, 258,29#, 378-379» 10. i<S
Kazdin, A.E., 3, 5. 15 -16 ,18 ,3 5-37 . 40,53. 56,65, 86, Lewin, L.M., 114 6 1
115 ,117 ,135 ,139 ,19 5 , 200, 221, 236, 243, 246, 250, Levvinsohn, P.M., 315
258, 261, 264, 282, 302, 313,319 -32 0 ,330 ,357 ,36 8 , Lewis-Palmer, T ., 137
. . 392, . . 403, 417 Liberty, K .A., 222
Keane, J., 8 Light, R.J., 287
444 A U T H O R I N D EX

Lincoln, Y.S., 398fn McGee, R., 52


Lindberg, J.S., 203fn M cGimsey, J.F., 231, 233
Lindsay, C., 141 M cGlinchey, J., 314
Lindsley, O.R., 15 M cGonigle, J.J., 222
Lipsey, M.W., 246 M cGrew, J., 254
Livanou, M „ 277 McGue, M ., 367fn
Lloyd, L.L., 222 M cGuffin, P., 367t'n
Lochman, J.E., 26ofn McHenry, M .M ., 298
Lomas, J.E., 298,410 M clnnis, E.T., 55
Long, E.S., 65 McIntyre, L.L., 78 ,19 4 , 221
Lorenzetti, ED ., 370 McKean, J.W., 410
Love, S.M ., 397 M cKnight, S., 410
Luiselli, J.K ., 183-185, 217-218 M cLaughlin, T.F., 148, 337
Lumley, V.A., 65 M cM ahon, C „ 214
Lundervold, D „ 410 McNam ara, J., 302
Lykken, D.T., 367fn McNaughton, S., 202
McNees, M.P., 411
Ma, H., 410 McSweeney, A.J., 385,411
M acDonald, R.P.E, 132 Meeks, C , 141
M acKinnon, D.P., 371 Mehl, M .R., 95
M ahadevan, L „ 403 Melin, L „ 17
Malenfant, J.E.L., 77, 237 Mellalieu, S.D., 93
M allon-Czajka, [., 345 Meyer, L.H., 410
Manes, E , 8 Meythaler, J.M ., 302
Manolov, R., 406, 4 10 ,4 16 -4 17 M ichael, J.,15, 286
Manson, I.E., 367 M ilrod, B., 270
M arascuilo, L „ 298 Miltenberger, R.G., 65, 90,310,358, 390-391
M archand, A., 348-349 Miranda, J., 368
Marholin, D.H., 55 Molioy, G.N., 302,407
Martens, B.K., 76 Montague, PR., 370
M artin, B., 215 Moore, J„ 330
Martin, J.F.., 271 Moore, K „ 78
Mascitelli, S., 39 M oran, D.J., 324fn
M ash, E L , 53 M orin, C.M ., 306, 407
Mastropieri, M .A., 410 Morsel la, E., 299
Matt, G.E., 298fn Morton, T., 298fn
Matyas, T.A., 298,300, 402, 404, 409 Mosteller, F., 287
Mauney, L.T., 299 Mrazek, P.J., 316
Mayer, R.J., 364 MTA Cooperative Group, 248
Maynes, N.P., 80 Mueller, M.M ., 330
M cCall, M „ 222 Murphy, M .A., 312
M cCarn, J.E., 385 Murphy, M.D., 418
McCarthy, D., 388 Musser, E.H., 157-158
M cClearn, G.E., 367fn Myers, K.M ., 249
M cCollough, D „ 148,150
M cCord, J., 260 Nash, M.R., 418
McCurdy, M „ 204, 206 Nathan, P.E., 35, 397
McDermott, K.B., 6 National Institute o f Mental Health, 52, 318
McDonnell, (., 368 Navarro, A.M ., 298m
M cDougall, D., 17 2,17 4 ,17 8 -17 9 Nezu, A.M ., 315
McFall, R., 54 Normand, M.P., 298,300
North rup, J., 358 Quera, V., 416
Norton, A., 324 Quesnel, C., 306.407, 412-414
Nourbakhsh, M .R., 416 Qu ist, R. W., 222
Novotny, C.M ., 18
Nutter, D., 53 Ragsdale, K., 25
Rapp. J.T., 65
O’Brien, F., 308-309 Ravenette, T., 19
O’Brien, M „ 93 Rayner, R., 5 ,13
O’Callaghan, P.M., 293, 295, 312 Reed, D.D., 44
O ’Donohue, W„ 273-274 Reeve, K.F., 57
O’Leary, K.D., 117 Reeve, S.A., 57,65
Offord, D.R., 369 Reid, D.H., 53
Ollendick, T.H., xiii, 17, 202-203 Reid, J.B., 386
Olson, E.A., 403 Reinhartsen, D R., 214
Onghena, P., 208, 272, - . 410 Reinsel, G.C., 348
Ottenbacher, K.J., 298,416 Reiss, A.L., 80
Reitman, D., 312
Pagnoni, G., 370 Ressler, K.J., 249
Pao, A., 370 Retting, R.A., iÿ6
Park, S., 2 04 -205,368 ,40 2 Revenstorf, D., 314
Park., H., 298 Revusky, S.H., 410
Parker, R.I., 300,302, 401, 4 0 3-4 0 4 ,4 0 6 -4 0 8 ,4 10 , Reyes, F.D., 370
416,418 Reyes, J.R., 83
Parsons, M.B., 385 Reyna, V.F., 84
Parsonson, B.S., 287, 402 Ricciardi, J.N., LSj-184
Pasiali, V., 58, 306 Rice, V.H., 248,504
Patel, M.R., 82 Riesen, T„ 368
Patten, S.B., 265 Riley-Tillman, 7.C.. 324fr)
Penke, L., 84 Rindskopf, D.M ., +07
Penrod, B., 74 Risley, T.R., 15, 285
Perepletchikova, F., 194 Rivière, V., 172
Perri, M .G., 315 Roberts, J.A., 6>
Peterson, A.L., 250-251 Roberts, L.J., 314
Peterson, L., 86, 251-253,386 Roberts, M.D., 55
Pfeifer, G.P., 370 Robinson, P.W., 3 ,1 1
Phaneuf, L„ 78, 221 Roediger, H.L., III., 6
Piazza, C.C., 82 Ronan, K R., 302, j»6, 407
Pipkin, C.C.S.P., 58 Rope, D., 324
Plaud, J.J., 73 Rosales-Ruiz, J.,159
Plomin, R., 367(11 Roscoe, N., 75
Pluck, M., 202 Rosenbaum, J.E., Î4 . 261
Pohl, R.F., 6 Rosenthal, L., 54
Poling, A., 74 Rosenthal, R., 2 45
Polster, R., 135 Rosnow, R.L., 243
Polychronis, S., 368 Rusch, F.R., 239
Popkey, C., 86 Rutter, M., 52
Porritt, M., 74,327-328 Ryan, C.S., 77
Poulson, C.L., 57
Powell, S., 293 Sachs, D.A., 271
Presson, C .C ., 254 Sahiri, S., 172
Price, D.D., 13,40 Salama, F., 293
446 A U T H O R I N D EX

Salcioglu, E „ 277 Staats, C.K ., 15


Satake, E„ 410 Stanley, J.C., 258-259,369,379
Savard, J., 87,306, 407, 411, 418 Stanton, W., 52
Schaal, D.W., 388 Stead, L.F., 248,304
Schepis, M.M ., 385 Steege, M., 214
Schcrrer, M.D., 151-152 Steinman, W.M., 55
Schlundt, D.G., 4 Steinmetz, J.E., 388
Schmidt, F.L., 398fr, 4o6fn Stewart, K.K., 298
Schnelle, J.F.,385, 411 Stice, E., 369
Schreibman, L„ 159 Stickle, T.R., 3
Schutz, R.E., 15 Strieker, J.M., 358
Schwartz, I.S., 53, 204,307 Suen, H .K., nofn
Scotti, J.R., 410 Sugai, G „ 137
Scruggs, T.E., 410 Sulzer-Azaroff, B., 197
Sechrest, L „ 3-4, 6 ,9 , 264 Sundberg, M.L., 330
Segal, N.L., 367fr! Swaminathan, M.M ., 380
Sentovich, C „ 417 Swanson, J.M., 248
Serlin, R., 410 Swartzwelder, D.M., 82
Shabani, D.B., 240 Swoboda, C., 298
Shadish, W.R., 25, 28fn, 40, 407
Shapiro, E.S., 202-203, 222 Tang, M „ 370
Shapiro, M.B., 19 Tarbox, J., 74
Shapiro, R.L., 358 Tarbox, R.S.F., 74
Sharper, A.G., 37 Tellegen, A., 367^1
Shaw, D., 418 Thigpen, C.H., 5,13
Sherman, J.A. 55, Thompson-Brenne, H., 18
Sherman, S.J., 254 Thornberry, T.P., 86
Shidlovski, D., 93f"n Tiger, J.H., 55, 81
Shoenberger, D.. 82 Timberlake, W„ 388
Shoukri, M.M ., 114 Tingstrom, D„ 330
Sidani, S., 3 Todd, A.W., 141
Sideridis, G.D., 409 Todd, P.M., 84
Sidman, M „ 14, 286,375 Todman, J.B., 208-209, 379
Siegel, T., 313-314 Tofflemoyer, S., 141
Sierra, V., 416 Townsend, D.B., 57
Silva, P.A., 52 Tremblay, G „ 86
Simard, S., 306, 407 Tryon, W.W., 410
Simon, G.E., 317 Tufte, E.R., 324
Simpanen, M., 147 Twohig, M.P., 82, 85,386
Simpson, C., 77
Singer, G.H.S., 204-205 Uleman, J., 93
Skiba, R.J., 298, 403, 410 Ullmann, L.P., 15
Skinner, A.L., 169-170 Ulman, J.D., 197
Skinner, B.F., 14,330,386, 388
Skinner, C.H ., 169-170 Valentine, J.C., 398fn
Smith, L., 55 Van Houten, f., 77, 237
Smith, T.B., 368 Van Houten, R., 77,156, 237-238
Solanas, A., 406, 4 10 ,4 16 -4 17 Vaught, R.S., 298, 411
Spiegel, D., 317 Vazire, S., 95
Spirrison, C.L., 299 Velleman, P.F., 405
Staats, A.W., 15 Vlaeyen, J.W.S., 272-273
Vollmer, T.R., 58 Wilder, D.A., 15 1-152,17 6 -17 7 ,
19 4 ,19 6
Wacker, D., 214, 231, 234 Wilhelm, S., 249
Wakefield, J.A., Jr., U4fn W ilkinson, L., 324, 4o6fn
Walker, R, 410 W illiams, B., 4J1
Wallace, M .D., 74, 203fr) W illiams, R.L., 337
Wampold, B.E., 18, 410 Wills, D., 324
Wannamethee, S.G., 37 W ilson, D.D., 560
Ward, R, 214 W ilson, G.T., 36?
Warnes, E., 8 3,311,38 6 Wine, B., 194
Washington, K., 204, 208, 214 Wodarski, J.S., 2.14
Wassell, G., 63 Wolery, M ., 214, 410
Watson, J.B., 5 ,13 Wolf, M .M ., 15 ,5 3 ,3 0 7
Watson, R.I., 12 W'ong, S.E., 58
Watson, T.S., 14 1-14 2 ,15 9 , 204 Wood, D.D., 115fr!
Weber, K., 147 Worsdell, A.S., J2 7 ,329,333
Wehby, J.H., 141, 222 Wright, K.M ., 3L0
Weinrott, M .R., 298,411
Weiss, B„ 298fn Yasui, M., 68
Weisz, J.R., 35-36, 298fn, 397 Young, A.W., 8
Westen, D., 18 Yule, W., 52
Whalen, C., 159
White, K.G., 388, 406 Zhang, H.H., 405
White, O.R., 204, 348, 410 Zhao, N., 77
Whitley, M .K., 36 Zilboorg, G., 13
Wiesman, D.W., 294, 296 Zink, C .E , 370
SUBJECT INDEX

A BA B designs, 127-138 pre- and post-assessment, 64-65, 267-271


changing phases, 353 probes, 65-66, 236-239, 242
characteristics of, 127-128 psychophysiological, 82-85
in combined designs, 229-231 reactivity of,
ethical considerations, 140 reports by others, 65, 87-88
multiple interventions in, 13 6 -13 8 ,14 1 requirements, 59-62
number o f phases, 136 selection o f measures, 59-62
order o f phases, 134-136 self-report, 63, 83-87
problems of, 138-140 strategies of, 74-78
reversal phase, 13 3 -13 4 ,13 8 -14 2 ,17 7 Autocorrelation, 299fn, 347fn, 4 0 7 -4 0 9 ,4 11-4 12
underlying rationale, 128-130 statistical tests and, 408-409
variations of, 133-138
Abscissa, 324-325 BABA design, 135-136, 215,354
Alternating-treatments design, 197-206, 208,355 Baseline assessment, 59, io8fn, 123-130
See also Multiple-treatment designs extrapolation of, 123-124
balancing conditions, 198-199, 220,-221, 224 functions of, 123
continuation o f baseline, 202-203, 2°8 Baseline phase, io8fn
discriminabilitv of treatments, 219-220 continuation of, 202, 212
multiple-treatment interference in, 221-223 omitting of, 214 -215, 231
omitting the baseline phase, 202-206, 214-215 trends in, 224
randomization and, 198-199 Behavioral assessment, 56-58, 73-86, 384 See also
types o f interventions and behaviors, 215-219 Conditions o f assessment
underlying rationale, 197 conditions of, 89-97
Anecdotal information, 8-9 defining behaviors, 56-58
Anecdotal case study, 4 ,12 See also Case study strategies of, 74-81
Applied behavior analysis, 15-16, 20,50. 384 Bctween-group research, 2 - 3 ,12 ,12 2 , 242-250, 407
journal, liofn, 384fn combined with single-case designs, 250-253
Assessment, 49-50, 59-62,73-97, 264 See also contributions of, 243-248, 397,407
Behavioral assessment; Strategies o f evaluating statistical interactions, 246,397
assessment generality o f findings and, 245, 372-374
automated recording, 43, 94-96 single-case research in relation to, 122-123,
conditions of, 89-97 - . 384
253 255
continuous, 122-124, 271-275, 384, 386,
discrete categorization, 7 6 -7 7 ,8 8 ,10 3 -10 4 Carry over effects, 216 See also Multiple-treatment
duration, 79-80 interference
frequency, 74-76,88 Calculating interobserver agreement, 10 1-10 8
interval recording, 78-79, 88 base rates and chance, 1 0 5 ,10 8 - 11,114
latency, 88 frequency ratio, 102-103
multiple measures, 62-64 point-by-point agreement ratio, 10 3 -10 5 ,10 8
natural environment vs. lab, 91-92 product-moment correlation, 105-107
natural vs. contrived, 89-91 Case studies, 3 -10 , 50
number o f people, 77-78 anecdotal case study, 4 ,12
obtrusive vs. unobtrusive, 93-94 defined, 3-4
overt behavior, 74-81,386 drawing inferences from, 4

448
Subject: Index 449

illustrations of, 7 -8 * Pearson product-moment correlation, 10 5 -10 8 ,


limitations of, 8-9 i86fn
strength and value of, 4-7 phi, 113 -11 ifn
Ceiling effects, 60, 217 Cum ulative graph, 326-331
Chance agreement, 10 8-114
base rates, 10 8 -110 Data, 394 See also Data evaluation
estimates of, 109 group vs. individual, 393-395
methods o f handling, 110 -114 integration across studies, 4 0 6 -4 0 7
Changing-criterion design, 167-19 0 ,233 quality o f <are and,
bidirectional changes, 175-177,18 0 Data evaluation, 284-320, 401-4.19 See a Iso Statistical
characteristics of, 168-169 evaluation, Visual inspection
correspondence o f criteria and behavior, 180-187 applied criterion, 285, 300, $06-307
distributed-criterion design, 178-180 changes in level, 288, 29 1, 346
magnitude o f criterion shifts, 17 3,18 6 -19 0 changes in means, 288, 29 1, 137-338 , *46
mini reversals in, 17 6 -17 7 ,18 4 ,18 9 , 235 changes intrend, 288-290, 2 9 2,54 6 -34 8
number o f criterion shifts, 172 experimental criterion, 285-300
problems of, 180-18 4 ,189 , 233 latency o f change, 2 8 S -2 9 2 ,548,350
range-bound design, 173-175 Data-evaluation validity, 2.9, 40-45
shaping and, i9ofn threats to, 40-4.5
subphases in, 168-169, i7i-i77> 189 D efining behavior, 56-5Ü
underlying rationale, 168-169 operational definition, 56-57, 62-63
variations of, 17 1-18 0 Drawing valid inferences, 25-47
Clinical or applied significance, 30 6 -30 8 ,312 -316 construct Talidity, 36-40
See also Social validation data-evalnation validity, 4 0 -4 6
diagnostic criteria and, 315-316 external validity, 32-3^, 39
dysfunctional behavior and, 30 8 ,314 -315 internal validity, 29-32, 4-5-46
measures of, 308 parsimony, 25-28
normative behavior and, 308, 313-314 plausible rival hypotheses, 27-2«
problems with, 317-320 threats to validity, 28-35, 37~4>> >-30
social impact and, 317-318 Duration o f phases, 357-360 See also Stability ol
Combined designs, 209-213, 228-235 perform ance
between-group research in, 250-253 criteria for shifting phases, 36 0 -36 1
description of, 228-229 Duration o f response, 79-80
problems of, 233,235
underlying rationale, 228 Effect size, 244,297, 380, 406-407,4.17
variations of, 229-233 Evidence-based interventions, 1 6 - 19 ,2 1,16 5
Concurrent schedule design, 197, i98fn See Evidence-based practice, 1 6 - 1 7tn
Altcrnating-treatments designs Evidence-based treatments, 16 - 1S , 55-36,
Conditions o f assessment, 89-97
human observers vs. automated recording, Experim ental analysis o f behavior, 14 -15 ,3 7 1, 584
94-96 journal, 38 4 ^
natural vs. laboratory settings, 91-92 Experimen la] psychology 10 -11
naturalistic vs. contrived, 89-91 External validity, 29, 32-36 , 38-39
obtrusive vs. unobtrusive, 93-94 priority of, 45
Confound, 36-38 threats to, 32-36
Construct validity, 29 ,36-40
in assessment, 36 Floor effects-See C eilin g effects
in experimentation, 36 Frequency measures, 74-76, &8, 10 2 -10 3
threats to, 37-40 frequency ratio, 10 2 -10 3 ,10 S
Correlational statistics, 114 rate o f response, 75
kappa, 113 Functional analysis, 137, 2oUn, 206
450 S U B J EC T I N D EX

Generality o f results, 2, 9 ,3 2 -3 6 ,3 8 -39 , 46, 245, 392 across situations, 151-154,159


See also Moderators, Replication characteristics of, 14 4 -14 7
single-case research and, 376-377, 392-393 clinical utility of, 149,164-16 5
external validity and, 32-36 ,39 combined designs and, 229-233
Generalization, 243, 292 ethical issues, 16 1-16 2
designs to evaluate, 239-242 gradual applications o f treatment, 164-105
probes and, 237-239, 242 number o f baselines, 154-155
Graphical display o f data, 323-351, 406 partial treatment applications, 155-158
bar graph, 331-335 problems of, 159-164
cumulative graph, 326-331 prolonged baselines, 16 1-16 4
descriptive aids, 337-350 underlying rationale, 144-146
histogram, 33ifn variations of, 149-154
simple line graph, 324-326 Multiple-treatment designs, 192-225, 237
types o f graphs, 324-335 advantages of, 193, 223-225
alternating-treatments design, 197-208, 215,
Idiographic approach, 12 219-220, 224,233
Internal validity, 29, 45-47 balancing conditions, 198-199, 216, 220
priority of, 45 characteristics of, 193
threats to, 2 9 -32,130 , 267-269 concurrent schedule, 197, i98fn
Interobserver agreement, 6 6 -6 7 ,9 9 -114 See also continuation o f baseline, 212
Calculating interobserver agreement discriminability o f interventions, 219-220
acceptable levels of, 118 -119 multi element design, 193-197,216, 22ofn, 222
accuracy vs. agreement, 9 9 -101 multiple-treatment interference, 221-225
base rates and chance, 10 5 ,10 8 - 111,114 number o f interventions, 220-221
checking, 101 problems of, 214-223
frequency ratio, 102-103 randomization design, 207-209
kappa, 113 simultaneous-treatment design, 197-198
methods o f estimating, 10 1- 1 11 ,1 1 3 - 1 14 variations of, 206-214
plotting agreement data, 111-112 Multiple-treatment interference, 34-35, 221-225, 242,
point-by-point agreement ratio, 10 3 -10 5 ,10 8 365-366
product-moment correlation, 105-108,
sources o f bias in, 114 -118 Nomothetic approach, 12
Interval recording, 78-79, 8 8 ,10 4 -10 5 Normative data, 53-54
time sampling, 79
Intervention outcome questions, 36 1-37 1,3 9 7 Observational data, 56-58, 67-69
complexity of, 117 -118
Measures, 42-43 See also Assessment Observer drift, 116 -117
multiple, 62-64 Open study, 2ifn, 27ofn
unreliability of, 42-43 Operant conditioning, 14 -16 ,38 4 ,38 6 ,38 8
Mechanisms, 246-247,36 2,36 9 -371 Operational definition, 54-55
Mediators, 246-248, 362,366-369 Ordinate, 324-325
between-group studies and, 246-247
single-case research and, 247-248 Parsimony, 25-28
Meta-analysis, 379-380,398, 406-407 Occam’s razor, 26
Moderators, 24 5-24 6 ,36 2,36 6-369,392-394 Pearson product moment correlation, 105-108, i86fn
Multi element design, 193-197, 216, 219, 22ofn, 222 Placebo, 38,40, 249, 27ofn
See also Multiple-treatment designs Plausible rival hypotheses, 27-28
underlying rationale, 193-195 Probes, 65-66,236-239, 242
Multiple-baseline designs, 144-165 assessment, 65-66 ,16 3
across behaviors, 14 5-14 9,159 designs, 236-239
across individuals, 14 9-152,159 Psychophysiological assessment, 82-83,85, 89
Su bject Index 451

Psychotherapy, 7 ,18 -19 ,3 7 ,18 1,3 17 ,3 2 0 See also experimental psychology and, 1 0 - 1 1
Evidence-based treatments generality o f findings, 371-372, 376-377
logic of, 123-124, 128-130, l j S, 138, 2o<, 301
Qualitative research, 398-399 methodological issues, 353-3^1
Quantitative research, 20,399 See also number o f subjects in, 1, 385-386,393
Between-group research requirements of, 12.2-128
Quasi-experimental designs, 257-282 special strengths of, 388-595
illustrations of, 266-278 term inology of, 1- 2
improve the quality o f inferences, 263-266 Social com parison,53-56 ,3 0 7 -3 10 , 318 -319 See also
why needed, 259-263 Social Validation
Social validation, 53-56,307-31.21
Randomization, 12,17, 25,19 8 -19 9 , 209, 258, 261, problems with, 317-320
social comparison, 53-5S, 3 0 7 -3 10 ,3 18 -3 19
Randomization design, 207-209 subjective evaluation, 54-56, 30 8 ,3 10 -3 12 , 319
Randomized controlled trial, 17-18 , 2 0 -2 1,12 2 , 242, Stability o f performance, 124-127
249, 261, 281,366,377, 399 Statistical evaluation, 11-12 ,4 .0 ,2 9 7 ,30 2 -30 6 , 359, 387,
Reactivity, 34 4 0 2 ,4 0 5 -4 19 See also Statistical tests
o f assessment, 34, 9 3,115 -116 confidence intervals, 341-343
o f experimental arrangements, 34 measures if"variability, 559-342
Regression toward the mean See Statistical reasons for using, 303-306
regression, standard error o f the mean., j4o fn
Reliability, 4 1-4 3 ,6 0 ,6 6 -6 7 , 6 9 ,7 1-7 1, 99fn, 10 0 ,114 subjectivity in, 297-298/11
See also Interobserver agreement tests for the single case, 409-419
Replication, 25, 4 4,306,374-376 ,385 Statistical regression, 3 1 ,35S
direct, Statistical tests, i&> -287, 409-4 18
inconsistent effects in, 376 guidelines, 4 10 -4 11,4 15 -4 1*
systematic, 375 inconsistencies o f,4 10 ,4 rô-417
within subjects, 44,385 obstacles for using, 415-417
Reports by others, 65, 87-88 randomizutioa lests, 20$, fr
Response maintenance, 236 single-case research and, 4 0 ^ -4 19
designs to examine, 239-242 statistical signiffcanceaud, 406
Reversal phase, 133-134 ,136 -14 2 , 231, 235, 249 time-series analysis, 348, 4 11-4 15
absence o f reversal in behavior, 138-139 Strategies o f asse ssment, 74.-89
in combined designs, 214, 229-233, 235 discrete categories, 75-76
mini-reversal, 17 6 -17 7 ,18 4 ,18 9 -19 0 , 228, 235 duration, 79-8:1,88
procedural options for, 133-134 frequency, 74-75,88
undesirability of using, i28fn, 139-140 interval recording, 78-79, 88
latency, 80-8 1, 88
Self-report measures, 63,8 3-8 7,311,38 6 number o f people, 77-7-8
Serial dependence, 299,404, 407-409, 411 psychophysio logical measures, 8 2 -8 3,8 5-8 6 ,
Shifting phases, 360 89
criteria for, 360-361 reports by others, 65,87-88
duration o f the phases, 357-360 response-specific measures, 81-82
Simultaneous-treatment design See self-report, 65, 8 3-87,3ia,386
Alternating treatments design, time sampling, 79
Single-case research, 1-2 Subjective evaluation, 54-56, > 0 8 ,3 10 -3 12 ,3 19
associated or nonessential features, 74,385-387 See also Social Validation
characteristics of, 384-388
contem porary development of, 13-16 Threats to validity, 28-45 See also Construct,
decision-making in, 44, 389-390 External, Internal, and Data-evaluation
essential features, 384-385, 387 validity
452 SU B J EC T I N D EX

Time-series analysis, 4 11-4 15 ,4 17 -4 18 Variability, 40-42, 4 6 ,126 -127,30 3-30 4,355-357


considerations in using, 414 -4 15 error bars and, 341-342
illustration of, 4 11-4 14 excessive, 40-42, 46,126 ,343,356
Transfer o f training, 236 plotting o f data and, 339-340,343-346
Treatment integrity, 194-196 Visual inspection, 40,285-302,386-387, 401-405,
Trends in the data, 41,43,124-126,29 9 ,303,353-355,404 409, 415. 417-418
difficult to detect, i26fn, 404, 408, 412 changes in level, 288, 292,346-347
options to address, 353-354 changes in means, 288, 292,337-338, 346-347
Type I error, 287,300, 408, 416 changes in trend, 288-290, 2^2,346-348
Type II error, 287, 406, 408 consistency of, 29 7-29 8,402-404,418
Types o f graphs, criteria for, 287-297
bar graph, 331-335 descriptive aids, 337-350
cumulative graph, 326-331 graphical display and, 323-324
histogram, 33ifn influences on, 298-300
simple line graph, 324-326 latency o f change, 288-292, 348,350
problems with, 297-302, 402-405
Validity, 6 7 - 6 9 ,7 1- 7 1,10 0 reliability of, 297-298, 402-403,418
o f measurement, 6 7 -6 9 ,7 1, underlying rationale, 286-287,300-301, 403-404
o f experiments, 25-29 tends in the data, 299

You might also like