Bakeman & Quera, 2011 PDF

sequential analysis and observational methods
for the behavioral sciences

Behavioral scientistsÂ€– including those in psychology, infant and child devel-
opment, education, animal behavior, marketing, and usability studiesÂ€ – use
many methods to measure behavior. Systematic observation is used to study
relatively natural, spontaneous behavior as it unfolds sequentially in time. This
book emphasizes digital means to record and code such behavior; although
observational methods do not require them, they work better with them. Key
topics include devising coding schemes, training observers, and assessing reli-
ability, as well as recording, representing, and analyzing observational data. In
clear and straightforward language, this book provides a thorough grounding
in observational methods along with considerable practical advice. It describes
standard conventions for sequential data and details how to perform sequential
analysis with a computer program developed by the authors. The book is rich
with examples of coding schemes and different approaches to sequential ana-
lysis, including both statistical and graphical means.
Roger Bakeman is professor emeritus in the Psychology Department at Georgia
State University. He is a fellow of the American Psychology Association and the
Association for Psychological Science, and has served as program co-chair for
biennial meetings of the Society for Research in Child Development (SRCD)
and the International Conference of Infant Studies (ICIS). He is author, with
John M. Gottman, of Observing Interaction: An Introduction to Sequential
Analysis; with Vicenç Quera, of Analyzing Interaction: Sequential Analysis with
SDIS and GSEQ; and with Byron F. Robinson, of Understanding Statistics in the
Behavioral Sciences and Understanding Log-linear Analysis with ILOG. He was
an associate editor for Infancy and has served on editorial boards for several
other journals.
Vicenç Quera is a professor in the Department of Behavioral Science Methods,
Faculty of Psychology, at the University of Barcelona; director of the Master
and Doctorate Programme in Primatology; a member of the Institute for Brain,
Cognition and Behavior; and leads the Adaptive Behavior and Interaction
Research Group at the University of Barcelona. He is co-author, with Roger
Bakeman, of Analyzing Interaction: Sequential Analysis with SDIS and GSEQ.
He has served on the editorial board of Behavior Research Methods, Psicológica,
and other journals, and his articles have appeared in numerous journals, includ-
ing Psychological Bulletin, Psychological Methods, Behavior Research Methods,
and Social Science and Medicine.
Sequential Analysis and Observational
Methods for the Behavioral Sciences
Roger Bakeman
Georgia State University
Vicenç Quera
Universidad de Barcelona
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town,
Singapore, São Paulo, Delhi, Tokyo, Mexico City
Cambridge University Press

32 Avenue of the Americas, New York, ny 10013-2473, usa
www.cambridge.org
Information on this title: www.cambridge.org/9780521171816
© Roger Bakeman and Vicenç Quera 2011
This publication is in copyright. Subject to statutory exception

and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2011
Printed in the United States of America
A catalog record for this publication is available from the British Library.
Library of Congress Cataloging in Publication Data

Bakeman, Roger.
Sequential analysis and observational methods for the behavioral sciences / Roger Bakeman,
Vicenç Quera.
p.â•… cm.
Includes bibliographical references and index.
isbn 978-1-107-00124-4 (hardback)Â€– isbn 978-0-521-17181-6 (paperback)
1.â•‡ Psychology – Research.â•… 2.â•‡ Social sciences – Research.â•… 3.â•‡ Sequential analysis.â•…
4.â•‡ Observation (Psychology) – Methodology.â•… I.â•‡ Quera, Vicenç.â•… II.â•‡ Title.
bf76.5.b354â•… 2011
150.72′3–dc23â•…â•…â•… 2011019707
isbn 978-1-107-00124-4 Hardback

isbn 978-0-521-17181-6 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy of urls
for external or third-party Internet Web sites referred to in this publication and does not
guarantee that any content on such Web sites is, or will remain, accurate or appropriate.
Contents
List of Figures page ix

Preface xiii
1. Introduction to Observational Methods 1

Systematic Quantitative Measurement versus Qualitative Narrative 1
Correlational versus Experimental Designs 3
Predictor versus Outcome Variables 4
Variables, Units, and Sessions 4
Why Use Observational Methods? 6
Sequential Analysis of Behavior 7
Summary 11
2. Coding Schemes and Observational Measurement 13

Where Do Coding Schemes Come From? 13
Must Codes be Mutually Exclusive and Exhaustive? 14
Granularity: Micro to Macro 18
Concreteness: Physically to Socially Based Codes 19
Codes versus Rating Scales 21
The Coding Manual 22
Summary 24
3. Recording Observational Data 26

Untimed-Event Recording 28
Timed-Event Recording 29
Interval Recording 30
Partial-Interval or One-Zero Sampling 32
Momentary or Instantaneous Sampling 32
Whole-Interval Sampling 32
v
vi Contents
Selected-Interval Recording 34
Live Observation versus Recorded Behavior 35
Digital Recording and Computer-Assisted Coding 37
Summary 40
4. Representing Observational Data 43

A Sequential Data Interchange Standard (SDIS) 43
Representing Time 44
Single-Code Event Sequences 46
Timed-Event and State Sequences 48
Interval and Multicode Event Sequences 50
A Universal Code-Unit Grid 51
Alternatives: Spreadsheet and Statistical Package Grids 53
Data Management and File Formats 54
Summary 55
5. Observer Agreement and Cohen’s Kappa 57

Point-By-Point versus Summary Agreement 58
The Classic Cohen’s Kappa 59
When is Kappa Big Enough? 62
Is Statistical Significance Useful? 63
Observer Bias and Kappa Maximum 64
Observer Accuracy, Number of Codes, and Their Prevalence 65
Standards for Kappa (Number of Codes Matters) 66
Comparing an Observer with a Gold Standard 68
Agreement and Reliability 69
Errors of Commission and Omission 69
Summary 70
6. Kappas for Point-By-Point Agreement 72

Event-Based Agreement: The Alignment Problem 72
Time-Based Agreement: Inflated Counts? 77
Event-Based Agreement for Timed-Event Sequences 78
Interval-Based Agreement Using Cohen’s Kappa 81
Weighted Kappa: When Disagreements Differ in Severity 81
Are All Kappas Overrated? 83
Summary 84
7. The Intraclass Correlation Coefficient (ICC) for Summary

Measures 87
Relative versus Absolute Agreement 87
Contents vii
Targets and Sessions 88

Relative and Absolute ICCs 89
Summary 92
8. Summary Statistics for Individual Codes 93

Basic Statistics for Individual Codes 95
Frequency 95
Relative Frequency 96
Rate 96
Duration 97
Relative Duration 98
Probability 98
Mean Event Durations, Gaps, and Latencies 99
Mean Event Duration 100
Mean Gap 100
Latency 100
Recommended Statistics for Individual Codes 101
Summary 102
9. Cell and Summary Statistics for Contingency Tables 104

Individual Cell Statistics 105
Observed Joint Frequencies and Hierarchical Tallying 105
Lagged Tallies for Single-Coded Events When Codes Can
and Cannot Repeat 107
Conditional and Transitional Probabilities 108
Expected Frequencies and Adjusted Residuals 109
Indices of Association for Two-Dimensional Tables 111
Contingency Indices for 2×2 Tables 111
Odds Ratio and Log Odds 112
Yule’s Q 114
Vulnerability to Zero Cells 115
Summary 116
10. Preparing for Sequential and Other Analyses 118

Creating New Codes from Existing Codes 118
Logical Combinations for Timed-Event, Interval, and
Multicode Event Data 119
RECODE for All Data Types 120
EVENT and BOUT for Timed-Event Data 121
RECODE, LUMP, and CHAIN for Single-Code Event Data 122
REMOVE and RENAME for All Data Types 123
viii Contents
Creating New Codes as “Windows” Anchored to Existing Codes 124

Pooled versus Individual Analyses 125
Preparing Export Files and Using Statistical Packages 126
Deviant Cells, Type I Error, and Winnowing 128
Summary 131
11. Time-Window and Log-Linear Sequential analysis 134

Time-Window Sequential Analysis of Timed-Event Data 135
The Sign Test: A Nonparametric Alternative 137
Lag-Sequential and Log-Linear Analysis of Single-Code
Event Data 138
Overlapped and Nonoverlapped Tallying of m-Event Chains 139
An Illustration of Log-Linear Basics 141
Log-Linear Analysis of Interval and Multicode Event Data 144
Summary 146
12. Recurrence Analysis and Permutation Tests 148

Recurrence Analysis 148
Permutation Tests for Short Event Sequences 156
Summary 160
Epilogue 163
Appendix A: Expected Values for Kappa Comparing Two Observers 165
Appendix B: Expected Values for Kappa Comparing with a
â•… Gold Standard 167
References 169
Index 179
Figures
1.1. Parten’s (1932) coding scheme for social engagement. page 8

1.2. The evolution of three similar coding schemes for social
participation as discussed in the text. 9
2.1. Three coding schemes; each consists of a set of mutually
exclusive and exhaustive codes. 15
2.2. A coding scheme consisting of two sets of mutually exclusive
and exhaustive codes. 17
2.3. Codes for chimpanzee mother and infant food transfer. 18
2.4. Examples of coding schemes, one more physically based
(infant) and one more socially based (maternal). 20
2.5. Definitions for three types of mountain gorilla vocalizations. 24
3.1. Recording strategies described in the text. 27
3.2. A paper form for untimed-event recording with two sets of
ME&E codes. 29
3.3. A paper form for timed-event recording. 30
3.4. A paper form for interval recording. 32
4.1. Recording strategies, data types, and coding and universal
grid units. 44
4.2. An example of an SDIS single-code event sequential data file. 47
4.3. An example of an SDIS timed-event sequential data file, with
data shown in the grid at the top. 49
4.4. An example of an SDIS state sequential data file for the data
shown in Figure 4.3. 50
4.5. Examples of an SDIS interval sequential data file (based on
Figure 3.4) and an SDIS multicode event sequential data file
(based on Figure 3.2). 51
ix
x Figures
4.6. An example of a code-unit grid for which rows represent

codes and successive columns could represent either events,
time units, or intervals. 52
5.1. A kappa table tallying the frequency of agreements and
disagreements by two observers coding infant state for 120
intervals. 60
5.2. The five 2×2 tables produced by collapsing the 5×5 table in
Figure 5.1. 62
5.3. Expected values for kappa when number of codes and their
prevalence varies as shown for observers who are 95%
accurate (top set of lines), 90% accurate (second set), 85%
accurate (third set), and 80% accurate (bottom set). 66
5.4. Sensitivity-specificity table. 70
6.1. Sequential data types and the appropriate kappa variant for
each. 73
6.2. Two single-code event sequences, their alignment per the
dynamic programming algorithm as implemented in GSEQ,
and the kappa table resulting from tallying agreement between
successive pairs of aligned events. 76
6.3. Two timed-event 20-minute sequences (in this case, state
sequences) with durations in seconds, and the kappa table
resulting from tallying agreement between successive pairs of
seconds with no tolerance. 77
6.4. Alignment of the two timed-event sequences shown in Figure
6.3 per the dynamic programming algorithm as implemented
in GSEQ (with 10-second tolerance for onset times and 80%
overlap for agreements-disagreements), and the kappa table
resulting from tallying agreement between successive pairs of
aligned events. 80
6.5. Two sets of weights for computing weighted kappa given four
ordered codes. 83
7.1. Summary contingency indices for ten targets (sessions)
derived from data coded by two observers, their analysis of
variance statistics, and the formulas and computations for
ICCrel and ICCabs, respectively. 90
8.1. An SDIS timed-event data file with 1-second precision (top)
and an SDIS interval data file with 1-second intervals (bottom)
describing the same events. 94
8.2. A code-unit grid for the timed-event data (60 seconds) and
the interval data (60 intervals) shown in Figure 8.1. 95
Figures xi
8.3. Formulas for six basic simple statistics. 99

9.1. Definitions for five basic cell statistics and the notation used to
describe them. 105
9.2. Cell statistics for Figure 8.1 data. 107
9.3. Observed Lag 1 counts and transitional probabilities for
Figure 8.1 data after being converted into single-code event
data with Assure, Explain, and Touch removed. 108
9.4. Definitions for two chi-square table statistics. 111
9.5. Notation and definitions for three basic 2×2 contingency
indices. 112
9.6. Two 2×2 contingency tables for the Figure 8.1 data with
their associated odds ratios (95 CIs for the ORs are given in
parentheses), log odds, and Yule’s Qs. 113
10.1. Use of logical combinations and the RECODE command
to create new codes from existing ones, assuming 1-second
precision for timed-event data or 1-second intervals for
interval sequential data. 120
10.2. Resulting sequences when applying the RECODE and
LUMP data modification commands to the single-code
event sequence shown and applying CHAIN to the sequence
resulting from the LUMP command. 122
10.3. Existing data and WINDOW command specifications for new
codes anchored to onsets and offsets of the existing code. 125
10.4. Table-fit statistics and adjusted residuals for four models
illustrating winnowing. 131
11.1. Scores are mean odds ratios, n = 16 for males and 14 for
females. 137
11.2. Two three-dimensional, Lag 0×Lag 1×Lag 2 contingency
tables showing, on the left, tallies for 3-event chains using
overlapped sampling derived from a sequence of 250 events
when codes can repeat and, on the right, from a sequence of
122 events when codes cannot repeat. 140
11.3. Log-linear analysis of the three-dimensional table shown on
the left in Figure 11.2 (codes can repeat). 142
11.4. Log-linear analysis of the three-dimensional table shown on
the right in Figure 11.2 (codes cannot repeat). 143
11.5. Four-dimensional Age × Dominance × Prior possession ×
Resistance contingency table. 145
11.6. Log-linear analysis of the four-dimensional table for the data
given in Figure 11.5. 146
xii Figures
12.1. Examples of recurrence plots. 149

12.2. Two recurrence plots for a single-code event sequence of a
couple’s verbal interaction. 151
12.3. Recurrence plots for a random event sequence (top) and a
highly patterned event sequence of verbal interactions (bottom). 153
12.4. At bottom, a timed-event sequence of a child’s crying and
fussing episodes, and at top, its recurrence plot. 154
12.5. A recurrence plot for an interval sequence of mother-
infant interaction, and above it the novelty score time series
indicating sequence segmentation. 155
12.6. The first number in each cell (top) is the observed count
for 2-event chains (i.e., Lag 1 transitions) computed for the
single-code event sequence shown at the bottom (N = 75).
The second number in each cell is the exact p-value for each
2-event chain, estimated using sampled permutations. 159
12.7. The sampling distribution for the Attentive-Write transition,
based on shuffling an event sequence (N = 75) 1,000 times. 160
Preface
We wrote this book because it’s time. The TLA (three-letter acronym) for
because it’s time is BIT, and what used to be called the bit-net (now the
Internet) let the authors begin their long-distance collaboration between
Atlanta and Barcelona. When we began working together in the early 1990s,
many investigators believedÂ€– with some justificationÂ€– that observational
methods were appealing but too expensive and too time-consuming. At
that time, analog video recording on tape had replaced film, and electronic
means of recording observational data were replacing paper and pencil; yet
most electronic and computer systems were specialized, expensive, and a
bit cumbersome. We knew the digital revolution had begun, but we had no
idea it would have the reach and impact it has today.
As we begin the second decade of this century, times have indeed changed.
We now live in an image-saturated world where no moment seems private
and everything seems available for instant download. Thus it is no wonder
that researchers increasingly see merit in digitally recording behavior for sub-
sequent systematic observation. Indeed, for recording behavior, digital has
become the standard and preferred method. And although the systematic
observation of the sort described in this book can still be done live, it works
far better when behavior is digitally recorded for later replay, reflection, and
review. Digital multimedia (audio-video) files can be created, copied, played,
and stored with relative easeÂ€– and increasingly at minimal expense.
Coding behavior for subsequent quantitative analysis has likewise been
transformed by the digital revolution. Computer-assisted coding programs
remove much of the tedium and potential for error from the coding taskÂ€–
and can even make coding fun. Once such programs were a bit exotic, few in
number, and required relatively expensive equipment. NowÂ€– given digital
multimedia filesÂ€– such programs are easier to implement, and the kind of
computer capability they require has become ubiquitous and inexpensive.
xiii
xiv Preface
As a consequence, users have more choices than formerly, and some soft-
ware has become less expensive or even free.
Spurred by the advent of digital recording and coding and by their
greater ease and accessibility, we think it is time to revisit matters first dis-
cussed in our 1995 book, Analyzing Interaction: Sequential Analysis with
SDIS and GSEQ. In the early 1990sÂ€– recognizing the power of standard
formats such as those underlying almost everything the Internet touchesÂ€–
we defined a standard set of conventions for sequential observational
data: the Sequential Data Interchange Standard, or SDIS. We then wrote
a general-purpose computer program for analyzing sequential observa-
tional data that relied on those standards: the General Sequential Querier,
or GSEQ. Our 1995 book had described how to run this program in the
dominant computer system of the day; that system (the Disk Operating
System, or DOS) is now essentially extinct, and the book is out of print.
GSEQ, however, has now been updated to run in the Windows environ-
ment (the current version is available at www.gsu.edu/~psyrab/gseq or
www.ub.edu/gcai/gseq).
The present book differs from our 1995 book in several ways. Primarily,
it is greatly expanded in scope: it focuses on observational methods gen-
erally and is not confined to the details of GSEQ. It also offers consid-
erable practical advice regarding sequential analysis and data analytic
strategies for sequential observational dataÂ€– advice that applies whether
or not GSEQ is used. At the same time, we have striven to write a relatively
brief and nonencyclopedic book that is characterized by straightforward,
reader-friendly prose. Here, the interested reader may still learn how to
use GSEQ effectively with sequential observational data, if desired, but
should also be able to gain a sound conceptual overview of observational
Â�methodsÂ€– a view grounded in the contemporary digitalÂ€world.
It is the grounding in the digital world and its explication of GSEQ cap-
abilities that most distinguishes this volume from the book Roger Bakeman
wrote with John Gottman, Observing Interaction: An Introduction to
Sequential Analysis (1st ed. 1986, 2nd ed. 1997). Granted some conceptual
overlap, the topics covered in the two volumes are sufficiently different that
Observing Interaction can easily be read with profit as a companion to this
one. Certainly the intended audience is the same.
The audience we have in mind consists of behavioral and social science
researchers, of whatever level, who think observational methods might be
useful and who want to know more about them, or who have some famil-
iarity with observational methods and want to further hone their skills and
understanding. Apart from an interest in behavioral research, we assume
Preface xv
that readers of this volume will be familiar with research methods and stat-
istical analysis, at least at the level presented in introductory courses in these
topics. Such knowledge may not be needed for the first chapterÂ€– which is
intended as a basic introduction to observational methods generally (and
which more knowledgeable readers may skim)Â€– but is required for subse-
quent chapters.
As with our 1995 book, many people have helped us in our task. One
author Roger Bakeman (RB) recognizes the debt owed his graduate school
advisor, Robert L. Helmreich, who first encouraged him to learn more about
observational methods, and his debt to Gene P. Sackett, who introduced him
to sequential analysis. For RB, those interests were honed in collaborative
work at Georgia State University, beginning first in the 1970s with Josephine
V. Brown, a lifelong friend; and continuing since the 1980s with Lauren
B. Adamson, an invaluable friend, supporter, and research partner. More
recently, Augusto Gnisci of the Second University of Naples and Eugene H.
Buder and D. Kimbrough Oller of the University of Memphis have helped
us improve GSEQ, our computer program for sequential analysis. Eugene
H. Buder also offered many thoughtful and cogent suggestions for improv-
ing an earlier draft; we appreciate his contribution to the clarity of the final
volume, while taking responsibility for any murkiness that remains. The
other author Vicenç Quera (VQ) recognizes the debt owed the late Jordi
Sabater-Pi, who transmitted his enthusiasm for naturalistic research to VQ
and first taught him how to observe and analyze behavior systematically;
and his debt to his early mentor, colleague, and friend, Rafael López-Feal,
who supported and encouraged his teaching and research. RB would also
like to acknowledge Maria Teresa Anguera, who translated Bakeman and
Gottman (1986) into Spanish, invited RB to speak at the University of
Barcelona in 1991, and introduced us. Our collaboration began immedi-
ately and has now continued through almost two decades.
As is always the case, colleagues and friendsÂ€– too many to mentionÂ€–
have contributed to our thinking and work over the years. RB would like to
thank, in particular, Daryl W. Nenstiel, whoÂ€– in addition to being a lifelong
critic and partnerÂ€– attempted to improve the prose of the current volume
(any remaining flaws, of course, remain ours), and Kenneth D. Clark, who
manages to keep RB on target and humble. VQ would like to thank Esther
Estany, who from time to time manages to extract him from writing papers
and computer code to visit distant deserts and other exotic regions, and to
his colleagues from the Adaptive Behavior and Interaction Research Group
at the University of Barcelona for sharing good and bad academic times and
for their irreplaceable friendship and collaboration.
1
Introduction to Observational Methods
Observing behaviorÂ€ – the central concern of this bookÂ€ – is an ancient

human endeavor without which even our survival could become problem-
atic. What will the beast we hope to stalk, kill, and bring back to the tribe do
next? Is that attractive and suitable mate open to my advances? Is that child
in trouble and in need of our help?
Not all questions modern researchers pose will be as dramatic as these,
and as behavioral scientists search for answers, self-conscious, systematic
observational methods will come to supplant raw observation. But what
exactly do we mean by observational methods? A definition is in order. In an
expansive vein, the eighteenth-century historian William Douglass wrote,
“As an historian, every thing is in my province” (1760, p. 230). Similarly, fol-
lowing the nineteenth-century physiologist Claude Bernard (1865/1927), the
present-day behavioral scientist could say: Everything I know and do begins
with observation. I observe and describe the gait of the horse. I observe and
record the infant’s weight. I observe whether my participants check strongly
agree, simply agree, or some other choice on a questionnaire.
This chapter is intended as a basic introduction to observational meth-
ods. In it we introduce concepts and terms that will be familiar to readers
with some experience of observational methods, but that nonetheless pro-
vide a foundation for the chapters that follow.
systematic quantitative measurement versus

qualitative narrative
Clearly, a definition of observational methods that includes any and all obser-
vation colonizes too much territoryÂ€– although some students arrive on the
first day of our observational methods courses thinking that observation only
involves looking and then creating narrative descriptions. True, insightful
1
2 Sequential Analysis and Observational Methods
and informed narratives have a long and important history in such fields as
history, journalism, and anthropology, and what are usually called qualitative
methods have contributed to a number of fields in the behavioral sciences
(see Cooper etÂ€al., 2012). Moreover, as we describe in the next chapter, quali-
tative methods do play a role when developing the coding schemes used for
systematic observation. For example, Marjorie Shostak’s Nisa: The Life and
Words of a !Kung Woman (1981) provides an excellent example of qualitative
methods at work. In it she organizes interviews around such themes as earli-
est memories, discovering sex, first birth, and motherhood and loss; and she
provides texture, nuance, and insight that would largely elude quantitative
approaches. Another classic example is Barker and Wright’s (1951) One Boy’s
Day: A Specimen Record of Behavior, which provides intimate and poignant
minute-by-minute, morning-to-night observations of one boy’s life during a
single mid-twentieth-century Kansas day.
In contrast, as we understand the term, observational methods for behav-
ior are unabashedly quantitative. They provide measurement. Measurement
is usually understood as the act of assigning numbers or labels to things
(Walford, Tucker, & Viswanathan, 2010). In principle, the thing measured
could be any discrete behavioral entity. In observational practice, that entity
is typically an event or a time interval within which events can occur (see
Chapter 3). As you will see in subsequent chapters, event is a key termÂ€–
we apply it to both relatively instantaneous behaviors and behaviors that
have appreciable duration. Some authorsÂ€– for example, Altmann (1974)Â€–
reserve the term for more momentary behaviors and use state for behaviors
of greater duration.
Measurement implies a measuring instrument: A thermometer gauges a
person’s temperature, a scale a person’s weight. For systematic observation
of behavior, the measuring instrument consists of coding schemesÂ€– which
we discuss at length in the next chapterÂ€– used by trained observers. As you
will see, unlike more familiar measuring devices, coding schemes are more
conceptual. They are based on mental distinctions and not on physical
materials like thermometers and rulers, and they involve a human compo-
nent (i.e., the observers). Melvin Konner’s work (e.g., 1976) with Harvard’s
study of the !Kung in Botswana in the late 1960s and early 1970s provides an
example. An electronic device delivered a click to his ear every 15 Â�seconds.
He then recorded which of several mother, infant, adult, and child behav-
iors defined by his coding scheme had occurred since the last click. One
result of his work was a quantitative description of how often others in the
environment (e.g., mothers, fathers, other adults, siblings, other children)
paid attention to and played with !Kung infants.
Introduction to Observational Methods 3
Measurement also implies a measurement scale. The distinctions we

usually make were introduced by S. S. Stevens (1946) some time ago. He
categorized measurement scales as: (a) nominal or categoricalÂ€– the names
assigned to the entities of interest have no natural order, like agreeable,
extroverted, open; (b) ordinalÂ€– the integers assigned to entities can only be
ranked or ordered, like first, second, and third in the race; (c) intervalÂ€– an
increment anywhere on the scale involves the same amount of whatever is
measured, but zero is arbitrary, like degrees Celsius; and (d) ratioÂ€– every
increment on the scale denotes an identical amount and zero indicates truly
none of the quantity measured, like kilograms or many of the summary sta-
tistics for individual codes we describe in Chapter 8. As you will see in the
next chapter, the coding schemes of observational methods typically rely on
categorical measurement.
Perhaps the best way to distinguish the methods described in this book
from observation generally would be to call them systematic. Thus when
we refer to observational methods, it is systematic observation we have in
mind. Systematic differs from more informal observation in a number of
ways. First and foremost, it involves preplanning. Research questions and
key underlying constructs are articulated, and coding schemes developed
(see Chapter 2), with the research questions and constructs in mind before
observation begins. Observers are then trained, with special attention paid
to their accuracy (see Chapters 5 and 6) and the strategies they use to code
behavior (see Chapter 3). As Bakeman and Gottman (1997) summarized the
matter, central to systematic observation is (a) the use of coding schemes
that have been defined and piloted beforehand (b) by trained observers of
demonstrated reliability. At heart, it is this approach to measurement that
makes observational methods systematic.
correlational versus experimental designs

In the world of scientific investigation, measurements are embedded in
research designs. A key distinction is between correlational and experimen-
tal designs. With correlational designs, values of variables (i.e., constructs)
are simply measured (like a person’s gender or self-esteem), which allows
only weak or no causal inference. In contrast, with “true” experimental
designs, values of key variables are manipulated, which allows causal infer-
ence. For example, a confederate could be instructed to display either a fear
expression or a happy expression during a session, thereby manipulating the
type of emotion to which a participant is exposed. In common use (e.g., The
New York Times), the word observational is often used as synonymous with
correlational. Perhaps for this reason, students sometimes think that obser-
vational methods are inherently correlational, but this is not so. True, many
experimental studies are performed in laboratories and behavioral observa-
tions are often employed in field settings not involving manipulation. But
correlational studies can also be performed in laboratories and experimen-
tal ones in the field; and behavioral observations can be employed for either
type of study in either setting. It is the design that makes a study correl-
ational or experimental, not the measurement technique used.
predictor versus outcome variables

Whether or not values of some variables are manipulated, another key dis-
tinction is between predictor and outcome variables, which in the context of
experimental studies are often called independent and dependent variables.
Other terms are possible; for example, when studies posit more complex
causal models, variables whose presumed causes are unspecified and lie
outside the model are called exogenous, whereas other variables are called
endogenous.
Typically, but not necessarily, observational methods are used for meas-
uring outcome or endogenous variables for both experimental and correl-
ational studies. As detailed in later chapters, observational variables often
detail how much or how often some behavior occurred or whether behav-
iors were contingent. Often investigators want to know next whether these
outcomes were affected by (or associated with, for those who eschew causal
language) such predictors as gender, age, diagnostic group, environmental
context, type of teacher or instruction; or, in experimental studies, whether
they were affected by values of some manipulated variables. Thus in both
experimental and correlational contexts, observational methods are often
used to determine values for those variables that the investigator hopes can
be accounted for by other variables of interest.
variables, units, and sessions

Variables attach to something and a useful term for that something is
Â�analytic unit. As we plan an investigation, describe it for others, and think
forward to subsequent data analysis, it is important at the outset to spe-
cify two key components: not just our basic analytic units but also our
research factors. This is true whether or not observational methods are used
to determine values for some or all of our variables. Research factors are
usually described as between-subjects (e.g., gender with two levels, male and
female) or within-subjects (e.g., age with repeated observations at 1, 2, and

3 years of age). Between-subject analytic units are, for example, the individ-
ual participants, parent-child dyads, families, or other groups (often called
cases in standard statistical packages, subjects in older literature, or simply
basic sampling units), whose scores are organized by our between-subject
research factors. When repeated measures exist, additional analytic units,
each identified with a level of a repeated measure, are nested within cases.
When observational methods are used, the observational session almost
always serves as the basic analytic unit, where a session is defined by a
sequence of coded behavioral events for which continuity can generally be
assumed (although either planned or unplanned breaks might occur during
an observational session). Summary statistics and indices derived from the
coded data for an observational session constitute scores. Scores from the
various sessions (i.e., analytic units) are then organized by any between-
and within-subjects factors and are analyzed subsequently using conven-
tional statistical techniques as guided by the design of the study.
Typically an observational study involves two steps. First, either behav-
ioral events or time units within which events may occur are coded for a ses-
sion. As noted earlier, this usually involves nominal measurement; although
as discussed later, rating successive segments of a session or the entire ses-
sion using ordinal scales is another possibility. Second, summary scores are
derived from the coded nominal data for the session. These scores represent
variables of interest and attach to the session. Such scores usually represent
equal-interval ratio-scale measurement (e.g., the summary frequencies and
other statistics described in Chapter 8) andÂ€– taking into account whether
variables are between- or within-subjectsÂ€ – can be analyzed (assuming
appropriate distributions) with standard statistic techniques such as correl-
ation, multiple regression, and analysis of variance.
In sum, systematic observation is simply one of many possible meas-
urement methods. In common with other methods, systematic observation
provides scores for subsequent statistical analysis. In fact, it is common for
scores in any given research project to derive from a variety of methodsÂ€–
for example, gender and age from a questionnaire, maternal depression
from a self-report scale, and maternal responsiveness to her infant’s cries
from systematic observation. What distinguishes observational from other
methods is that, unlike questionnaires in which responses to a manage-
able series of questions are elicited, observation is carried out by trained
observers who typically code behavior over relatively long sessions. As a
consequence, behavioral observation is often quite time-consuming. When
coding live, observers need to be present during sessions that can vary from
a few minutes to several hours. More typically, sessions are recorded, which
can absorb even more time as observers spend hours coding just a few min-
utes of behavior. Compared to the few items of a typical self-report meas-
ure, data collected from observation can be voluminous and their analysis
seemingly intractable. Why then would an investigator bother with such a
time-consuming method?
why use observational methods?

There are many good reasons for using observational methods, but we
believe three are particularly compelling (Bakeman & Quera, 2012). First,
when research participants cannot tell us what they think or when they
cannot read and respond to questionnaires or when they cannot make
entries in a diaryÂ€– as is true of preverbal infants, preliterate children, and
animals generallyÂ€– observational methods provide a way to measure indir-
ectly what is “on their mind.” Thus it is not surprising that many early classic
examples of observational research involved animals and human infants
(e.g., Altmann, 1965; Parten, 1932). Moreover, even when our research par-
ticipants are verbal, observational methods may still be the best choice if
the focus of our research is their nonverbal behavior. In fact, in some cases
(e.g., marital interaction studies), it may be interesting to gather data by
observational methods about how people actually behave, and then com-
pare those data with other data collected by questionnaires or self-reports
about how they say they behave.
The second reason is that spontaneous behavior often seems more nat-
ural than elicited behavior. Natural is a relative and perhaps slippery term,
but when research participants whose behavior is not elicited are observedÂ€–
and it does not matter if it is in laboratory or field settingsÂ€– we assume
that their observed behavior reflects their proclivities and untutored rep-
ertoire. We do not make similar assumptions when the behavior is elicited
by the experimenter, for example, when asking a participant to fill out a
questionnaire. Participants might be asked to soothe a crying infant in a
contrived setting, but somehow the behavior we then observe seems more
natural than responses made to a questionnaire asking how they would
soothe a crying infant. Nonetheless, we may still wonder whether behavior
is changed by being observedÂ€– like observer effects in physics. The answer
seems to be that humans habituate rapidly to being observed. For example,
as reported by Bakeman and Helmreich (1975), marine scientists living
in a space-station-like habitat fifty feet below the surface of Coral Bay in
the Virgin Islands were on-camera continuously; yet as they went about
their work, awareness of the cameras seemingly disappeared within the first
several minutes of their two- to three-week stay in the habitat.
The third reason is that when investigators are interested in processÂ€– how
things work and not just outcomesÂ€– observational methods have the ability
to capture behavior unfolding in time (which is essential to understanding
process) in a way that more static measures do not. An important feature of
behavior is its functionality: What happens before? What next? Which are
causes and which consequences? Are there lagged effects between certain
behaviors? Only by studying behavior as a process can investigators address
such questions. A good example is Gottman’s work on marital interaction
(1979), which, based on characterizations of moment-to-moment inter-
action sequences, predicted whether relationships would dissolve or not.
Also, process questions almost always concern contingency. For example,
when nurses reassure children undergoing a painful procedure, is the chil-
dren’s distress lessened? Or, when children are distressed, do nurses reassure
them more? In fact, contingency analyses designed to answer questions like
these may be one of the more common and useful applications of observa-
tional methods (for details, see Chapters 9 and 11).
sequential analysis of behavior

The third reason just given for using observational methodsÂ€– an interest
in processÂ€– motivates much of this book. Understanding process means
looking at behavior in sequence as it unfolds in time, butÂ€– although the
terms sequential analysis and observational methods both occur in this
book’s titleÂ€– not all studies that are observational are sequential. The diffe-
rence is perhaps best conveyed by examples. Three paradigmatic studies
that illustrate how observational studies may or may not be sequential
were cited by Bakeman and Gottman (1997). These studies all involved
preschool children observed in relatively natural contexts and are worth
revisiting.
The first study is Mildred Parten’s (1932) study of social participation
among preschool children conducted at the University of Minnesota’s
Institute of Child Welfare in the late 1920s. During the 1926–27 school year,
some forty-two children whose ages ranged from not quite two to almost
five years of age were observed seventy different times, on average. The
daily observations occurred during indoor free play and lasted 1 minute
for each child; the order in which children were observed varied so that the
1-minute samples for each child would be distributed more or less evenly
throughout the hour-long free-play period.
Code Definition
Unoccupied Child not engaged with anything specific;
seems aimless.
Onlooker Child watches other children playing, but
does not join in.
Solitary or Child plays alone and independently,
independent play seemingly unaffected by others.
Parallel activity Child plays independently beside, but not
with, other children but with similar toys; no
attempt to control who is in the group.
Associative play Child plays with other children, with some
sharing of play materials and mild attempts
to control who is in the group.
Cooperative play Child plays in a group that is organized for
some purpose, for example, playing house
or a formal game or to attain a goal.
Figure 1.1.â•‡ Parten’s (1932) coding scheme for social engagement.
Parten was interested in the development of social behavior in young

children. Accordingly, she asked observers to code each 1-minute sample
by the level of social engagement that predominantly characterized it. Her
six codes are detailed in Figure 1.1. From the coded intervals, Parten com-
puted the percentage of samples assigned each code, separately for each
child. Over the school year, each child was observed for only 70 minutes,
on average. Still, her sampling plan let Parten use these percentage scores
as estimates of how much time each child devoted to a particular level of
social engagement during free play that year. In turn, this let her evaluate
hypotheses such as that older children would spend more time in associa-
tive and cooperative play than younger children.
However, her data do not let us ask how any of these play states were
sequenced in the stream of behavior (to use Roger Barker’s [1963] felicitous
phrase). We cannot determine, for example, whether Parallel often preceded
Associative and Associative often preceded Cooperative play, not because
Parten’s codes are not up to the task but because her recording methodÂ€–
coding daily, isolated 1-minute samplesÂ€– does not capture sequential infor-
mation. This is not a criticism of PartenÂ€– her research questions did not
require examining moment-by-moment sequences of behavior. Instead,
our intent is to make the point that when sequential data are collected, not
just questions like Parten’s, but a whole other array of interesting questions
can be addressed.
Bakeman &
Parten (1932) Smith (1978) Brownlee (1980)
Together
Unoccupied
Alone Unoccupied
Onlooker
Solitary Solitary
Parallel Parallel Parallel
Associative
Group Group
Cooperative
Figure 1.2.â•‡ The evolution of three similar coding schemes for social participation
as discussed in the text (adapted from Bakeman & Gottman, 1997).
The second paradigmatic study is provided by Peter Smith (1978).

Parten’s study had established an association between age and social par-
ticipation: As children became older, they tended to participate more at
higher levels. As ordered in Figure 1.1, each code suggests a higher level of
participation than the one before it, so it is tempting to view her codes as
suggesting a developmental progression in which parallel activity is a stage
through which children pass as they develop from solitary to social group
players; that is, Parten’s coding scheme could be viewed as an ordinal scale
of social participation and not just a categorical one. Smith, however, sought
to test that notion of developmental progression directly. For our present
purposeÂ€– asking what makes a study sequentialÂ€– his study is useful not
so much for what he found out as for the way his modification of Parten’s
method challenges our sense of what we mean by a sequential analysis.
Simplifying Smith’s (1978) methods some, he reduced Parten’s six codes
to three (see Figure 1.2). He wanted to test explicitly the idea that parallel
play is an intermediate stage in social development. As a result, there was no
need to distinguish between the presumed precursor stages of Unoccupied,
Onlooker, and Solitary. Consequently, he lumped these three into a single
code, Alone. Likewise, there was no need to distinguish between Associate
and Cooperative; he lumped these two into a single code, Group. Smith’s
recording method was similar to Parten’s: He used a sampling strategy to
code brief, isolated intervals for the forty-eight children in his study. From
these coded intervals Smith computed for each child the percentage of
samples assigned each code, separately for each of his study’s six successive
five-week periods. Then, the code with the highest percent score became
the code assigned to the entire five-week period. Examining these coded
sequences of six five-week periods, Smith reported that many children

moved directly from a five-week period in which Alone predominated to
one in which Group play did without an intervening period during which
Parallel play was most frequent. Note, however, that Smith’s results can
mask the fact that periods in which only Alone and Parallel occurred (but
Alone predominated) could be followed by periods in which only Parallel
and Group did (but Group predominated); by dividing time into shorter
periods, Alone-to-Parallel-to-Group transitions might have been revealed.
Smith’s question was sequential as was his analysis, although at one step
removed from most examples we give in this book. He used information
derived from nonsequential behavioral coding to then code much longer
segments of time (five weeks), whereas most examples we present in this
bookÂ€– and the sense in which we usually use the term sequential analysisÂ€–
code moment-by-moment, event-by-event sequences of behavior.
The third paradigmatic study is Bakeman and Brownlee’s (1980) study
of parallel play. Parten seemed to suggest that parallel play characterized an
obligatory development phase, whereas Smith suggested the phase might
be optional. This discussion caused Bakeman and Brownlee to think that
the question itself might be misleading and that parallel play might better
be regarded not as a stage, but as a strategyÂ€– important because of how
it was positioned and functioned in the moment-by-moment stream of
children’s play behavior. Therefore, they posed what is clearly a question of
behavioral sequencing.
Like Smith, Bakeman and Brownlee (1980) modified Parten’s codes (see
Figure 1.2). They kept Parten’s and Smith’s Parallel, Parten’s Solitary, and
Smith’s lumped Group, but they lumped Parten’s Onlooker with Unoccupied
and created a distinct new code (Together) defined as essentially unoccu-
pied with a focus on others, but without the focus on objects or activities
required for Parallel and Group. Forty-three three-year-old children were
video-recorded for about 100 minutes each during free play over sev-
eral mornings of a three-week summer camp. Observers then viewed the
recordings and coded successive 15-second intervals using the scheme just
described.
Later we will have more to say about Bakeman and Brownlee’s (1980)
method of interval recording and will explain why we regard it as less
than optimal, but for now we will assume that their data provided a rea-
sonably accurate estimate of how the play states (the codes representing
levels of social participation) defined in Figure 1.2 were sequenced in
time for each child. Using techniques explained in Chapter 9, Bakeman
and Brownlee counted how often various play states followed each other
and then compared the actual counts to those expected by chance, based
on how often each type of play state occurred. Of particular interest was
the Parallel-to-Group transition, which Bakeman and Brownlee thought
should be especially frequent if parallel play serves as a bridge to group
play. By chance alone, values for this transition should exceed their chance
values for half of the children. In fact, observed values for the Parallel-to-
Group transition exceeded chance for thirty-two of the forty-two children
(p < .01 per two-tailed sign test; see the section titled “The sign test: A
non-parametric alternative” in Chapter 11). Given this result, Bakeman
and Brownlee concluded that movement from parallel to group play may
be more a matter of moments than of months, and that parallel play may
indeed serve as a bridgeÂ€– a brief interlude during which young children
“size up” those to whom they are proximal before deciding whether to
become involved in the group activity.
This sequence of three studies helps define what we mean, not just by
observational methods but by sequential analysis. All employed observa-
tional measurementÂ€– that is, observers applied predefined coding schemes
to events observed either live or recorded. And all used the categorical data
recorded initially to compute various summary scores, which were then
analyzed further. Parten used her cross-sectional data to suggest develop-
mental progressions over years, Smith used his longitudinal data to suggest
developmental progressions over months, but of the three, only Bakeman
and Brownlee used continuously recorded data to suggest how behavior was
sequenced over moments. True, the sequential analytic methods described
in this book can be applied at larger time scales (as Smith demonstrates),
but almost always the time scales we have in mind involve seconds and
minutesÂ€– only rarely days, months, or yearsÂ€– because the more immediate
time scales seem appropriate for studying social processes that happen in
the moment.
summary
The topic of this book is systematic observationÂ€– observational methods,
generallyÂ€– which is one of several useful methods for measuring behavior.
We say systematic to distinguish quantitative observational methods from
other approaches to understanding behavior that rely less on quantifica-
tion and more on qualitative description. Quantification is provided by
measurement, which usually is understood as the act of assigning numbers
or labels to things. Consequently, coding schemes, as detailed in the next
chapter, are central to observational methods. Measurement requires an
instrument; and as thermometers are to temperature, so coding schemes

applied by trained observers are to behavior. Measurement occurs when a
code is assigned to an entity, which in observational practice is typically a
behavioral event or a time interval within which such events may occur.
In popular use, sometimes observational is used as a synonym for cor-
relational, as opposed to experimental. In fact, as an approach to meas-
urement, scores produced by observational means are neither inherently
correlational nor experimental, and observational methods can be used in
either correlational or experimental studies. Dependent or outcome vari-
ables are more likely to be observational than precursor variables, how-
ever; and experimental independent variables, which require experimenter
manipulation, cannot be observational.
In observational research, a basic analytic unit is the session, defined by
a sequence of coded events for which continuity can generally be assumed.
Typically an observational study involves two steps. First, observers code
behavior for a session (which could represent a particular participant or
dyad or a participant at a particular age). Second, summary statistics are
derived from the session’s data; these serve as scores for the between-sub-
jects and any within-subjects variables of the research design and can be
analyzed with standard statistical techniques.
Compared to other measurement methods (e.g., direct physical meas-
urement or deriving a summary score from a set of rated items), obser-
vational measurement is often labor-intensive and time-consuming.
Nonetheless, observational measurement is often the method of choice
when nonverbal behavior specifically or nonverbal organisms generally are
studied; when more natural, spontaneous, “real-world” behavior is of inter-
est; and when processes, and not outcomes, are the focus (e.g., questions of
contingency).
Understanding process means looking at behavior in sequence as it
unfolds in time. When behavior is observed and coded continuously,
sequential data result. By sequential analysis we mean techniques of data
analysis that, as detailed in later chapters, capture pattern and contingency
in those sequences. Data at various time scales could be viewed sequen-
tially, but in this book we are primarily concerned with time scales on the
order of seconds and minutes, not days, months, or years, because, as just
noted, these seem more appropriate for studying social processes that hap-
pen in the moment.
2
Coding Schemes and Observational Measurement
As telescopes are for astronomy and microscopes for biology, so coding

schemes are for observational methods: They bring the phenomena of
interest into focus for systematic observation. However, unlike telescopes
and microscopes, which are clearly physical, coding schemes are primar-
ily conceptual. They consist of codes (i.e., names, labels, or categories) that
can be applied to whatever behavior is being studied. By naming behavior,
coding schemes limit the attention of our observers and state, in effect, that
the aspects of behavior identified by the codes are important for our inves-
tigation and are the aspects on which we should focus.
where do coding schemes come from?

In addition to being conceptual, coding schemes necessarily make theoret-
ical commitments. Implicit in any coding scheme is the understanding that
certain behaviors are important and certain distinctions are worth making.
Necessarily, coding schemes reflect the investigator’s theory about what is
important and why, even when investigators do not make the links between
theories and codes explicit. Bakeman and Gottman (1986, 1997) wrote that
using someone else’s coding scheme was like wearing someone else’s under-
wear. They used this attention-grabbing simile to make a point: Codes and
underlying theories need to connect. Borrow, or more typically adapt, cod-
ing schemes from others only when you share theories, underlying theoret-
ical orientations, and common research goals.
There are two basic approaches to developing coding schemes. They are
not mutually exclusive, and both can beÂ€– and often areÂ€– used together.
First, as just suggested, it makes sense to begin with coding schemes that
others with similar interests and theories have used and then to adapt them
to your specific research questions. Second, simply watch examples of the
13
behavior of interest repeatedly (video recordings help greatly) with the eye
of a qualitative researcher. Try to identify themes and name codes. Then try
to imagine how analysis of these codes will help you later when you attempt
to answer your research questions.
In any case, developing coding schemes is necessarily an iterative pro-
cess, a matter of repeated trial and error and successive refinement. Whether
you begin with coding schemes others have used or start de novo, you will
pilot-test each successive version against examples of behavior (here video
recordings help greatly). Such checking may reveal that codes that seemed
important initially simply do not occur, so you will remove them from your
list. It may also reveal that distinctions that seemed important theoretically
cannot be made reliably, in which case you will probably define a single,
lumped code that avoids the unreliable distinction. Or you may find ini-
tial codes that lump too much and miss important distinctions, in which
case you will split the initial codes and define new, more fine-grained ones.
Expect to spend hours and weeks; shortchanging the development of the
measuring instruments on which a research project rests can be perilous.
must codes be mutually exclusive and exhaustive?

As you plan your research and develop your coding schemes, it is essen-
tial to ask whether your proposed coding schemes are adequate. The most
important consideration concerns fit: Will analysis of these codes result in
clear answers to your research questions? The second most important con-
siderationÂ€– and the one we address in this sectionÂ€– concerns structure.
When coding schemes are well structured and clearly organized, both data
collection and data analysis are facilitated.
Structure is perhaps best illustrated by example. The previous chapter
presented three coding schemes for young children’s social participation
(see Figure 1.2) and an additional three schemes are presented here (see
Figure 2.1). The first scheme categorizes the daily activity of marine sci-
entists living in a space-station-like habitat 50 feet underwater (Bakeman
& Helmreich, 1975) and is typical of coding schemes that seek to describe
how individuals spend their day (time-budget information). The second
categorizes infant state (Wolff, 1966) in a way that has become standard in
the field. And the third, which reflects Parten’s (1932) influence, is the basis
of research by Adamson and her colleagues (e.g., Adamson & Bakeman,
1984; Adamson, Bakeman, & Deckner, 2004; Bakeman & Adamson, 1984).
In the research by Adamson and her colleagues, Object is coded when only
the infant is engaged with an object; Supported and Coordinated are coded
Coding Schemes and Observational Measurement 15
Daily activity Infant state Engagement state

Doing scientific work Quiet alert Unoccupied
At leisure Crying Onlooker
Eating Fussy Object
Habitat-maintenance REM sleep Person
Self-maintenance Deep sleep Supported
Asleep Coordinated
Figure 2.1.â•‡ Three coding schemes; each consists of a set of mutually exclusive and
exhaustive codes (see text for citations).
when caregiver and infant are both engaged with the same object; but
Supported is coded when the infant shows no awareness of the caregiver’s
engagement, whereas Coordinated is coded when the infant does show such
awareness, often with glances to the caregiver.
Each of these three coding schemes consists of a set of mutually exclu-
sive and exhaustive (ME&E) codes; that is, for every entity coded there
is one code in the set that applies (exhaustive), but only one (mutually
exclusive). These are desirable and easily achieved properties of coding
schemes. Organizing codes into ME&E sets often helps clarify our codes
when we are first defining and developing them and almost always simpli-
fies and facilitates subsequent recording and analysis of those codes. But
often research questions concern co-occurrence. So what should we do
with codes that can co-occur, like mother holding infant and mother look-
ing at infant?
Co-occurrence can be addressed and mutual exclusivity of codes within
sets achieved in one of two ways. First, when sets of codes are not mutu-
ally exclusive by definition, any set of codes can be made mutually exclu-
sive by defining new codes as combinations of existing codes. For example,
imagine that two codes are mother looks at infant and infant looks at mother.
Their co-occurrence could be defined as a third code, mutual gaze. Second,
codes that can co-occur can be assigned to different sets of codes, each of
which is mutually exclusive in itself. As a general rule, the coding schemes
defined for a given research project work bestÂ€– from coding, to recording,
to analysisÂ€– when they consist of sets of coding schemes: Each set describes
an aspect of interest (e.g., mother’s gaze behavior, infant’s motor behavior).
For this example, two sets of codes could be defined. The first concerns
mother’s gaze and includes the code mother looks at infant, whereas the
second concerns infant’s gaze and includes the code infant looks at mother.
Mutual gaze, instead of being an explicit code, could then be determined
later analytically.
Exhaustiveness is likewise easy to achieve. Any set of codes can be

made exhaustive by adding a nil or none-of-the-above code to any set. Thus
mother and infant looking behavior could be coded using a single set of
four codes (mother looks at infant, infant looks at mother, mutual gaze, or
none) or two sets of two codes (mother looks at infant or none, and infant
looks at mother or none). But which of these two coding strategies is prefer-
able? More generally, is it better to define fewer sets of codesÂ€– which means
some of those codes may represent combinations? Or is it better to define
more setsÂ€– which means those sets need contain few, if any, combination
codes? Note, for this example, the two strategies differ in the number of sets
of codes, but the total number of codes is the same for both strategies.
The matter of more versus fewer sets of codes mainly affects coding. It
does not affect analysis because similar information can be derived from the
coded data in either case. The choice may depend on personal preference;
but, especially when working with video records, there may be advantages
to more versus fewer sets because coders can make several passes and attend
to one set of codes on each pass (e.g., first mother then infant). Moreover,
different coders can be assigned different setsÂ€– which gives greater cred-
ibility to any patterns we detect later between codes in different sets. If a
single observer uses the four-code strategy or codes both mother and infant
gaze, a skeptic might claim that mutual gaze was as much in the head of the
observer as in the stream of events; when mother and infant gaze are coded
independently by different observers, this claim is more difficult to make.
Still, a good rule may be to choose whichever strategy your observers find
easier to work with (and can do reliably).
The hierarchical rule or some variant of it is yet another way to make the
codes in a set mutually exclusive, and is useful when codes can be ordered
in some way. When more than one code could apply to an event, obser-
vers are instructed to select the highest code that applies. Consider Belsky
and Most’s (1981) study of exploratory object play in infants. They defined
twelve codes, with each characterizing a more advanced level of infant
object play than the one before it; for example, the lowest five levels in order
were Mouthing, Simple manipulation (e.g., banging, shaking), Functional
(appropriate action such as turning a dial), Relational (simply bringing
together two items), and Functional-relational (bringing together two items
as appropriate for the itemsÂ€– e.g., setting a cup on a saucer); a higher level
was Pretend. Observers viewed records of 7½- to 21-month-old infants
and characterized 10-second intervals according to the highest level of
object play during the interval. Even when a lower-level code applied (e.g.,
Functional-relational), the highest-level behavior present (e.g.,Â€Pretend, as
Mother encourages Infant

attention to: looking at:
Herself Mother’s face, face-to-face interaction
Another person Mother, not face-to-face interaction
Object or event Another person
None of the above Object
Unclear
Figure 2.2.â•‡ A coding scheme consisting of two sets of mutually exclusive and
exhaustive codes (Cote et al., 2008).
bringing cup to doll’s mouth) was assigned the interval. Thus the hierarch-
ical rule makes the codes mutually exclusive.
If codes are as simple as mother and infant looking at each other, a four-
code scheme with a combination and a nil code might be fine; but when
more looking categories are considered, two separate ME&E schemes prob-
ably make more sense. Marc Bornstein’s work provides an example. He and
his colleagues were interested in documenting cultural variations in person-
and object-directed interaction (e.g., Cote, Bornstein, Haynes, & Bakeman,
2008). Observers coded continuously from video recordings using the two
ME&E sets of codes shown in Figure 2.2 and a computer-assisted system
that recorded times. Indices of contingency were then computed using the
computer program and procedures described later (see “Indices of associ-
ation for two-dimensional tables” in Chapter 9). Analysis of these indices
led Cote and colleagues to conclude that mothers were significantly more
responsive than infants to their partner’s person-directed behavior in each
of the three cultural groups studied, but that European American mothers
were significantly more responsive to their infants’ person-directed behav-
ior than Latina immigrant mothers, while neither group differed from non-
immigrant Latina mothers.
Should all your codes be segregated into ME&E sets, each represent-
ing some coherent dimension important for your research as illustrated in
FigureÂ€2.2? The answer is: Not always. Imagine, for example, that your codes
identify five types of events that are central to your research questions and
that any and all can co-occur. Should you define five sets, each with two
codesÂ€– the behavior of interest and its absence? Or would it simply be better
to list the five codes and ask observers to note their occurrence? If you wanted
duration information, you could record onset and offset times for each.
Either strategy offers the same analytic options; thus, which you choose is a
matter of personal preference. As with the fewer-versus-more combination
codes question in an earlier paragraph, a good rule is to choose whichever
Mother code Infant code
Show empty hand Approach

Allow infant’s attempts Inspect
Reject infant’s attempts Attempt to take food
Drive off infant Extend hands
Conceal food Try to mouth food
Keep away food Grasp mother
Poke infant Try to open mother’s palm
Offer food Point at mother’s food
Show facial expression
Vocalize
Receive offered food
Ignore offered food
Scrounge
Figure 2.3.â•‡ Codes for chimpanzee mother and infant food transfer (for definitions
see Ueno & Matsuzawa, 2004).
strategy your observers find easier to work with (and can do reliably). Thus
a brief answer to the question posed by this section (Must codes be mutually
exclusive and exhaustive?) might be: Often yes, but not necessarily.
Even when codes are mutually exclusive, breaking them into smaller sets
can simplify coding. For example, when coding similar, mutually exclusive
actions by different actors, we could define a separate code for each actor-
action combination (MomLooks, KidLooks, SibLooks, etc.), but it is simpler
to define two sets of code, one for actors (Mom, Kid, Sib, etc.) and another
for actions (Looks, Talks, etc.). This results in fewer, less redundant codes
overall. And even when different actors perform different actions, organ-
izing them into separate sets still has the advantage of focusing observer
attention first on one actor and then on the other. The codes in the separ-
ate sets could be, but do not need to be, ME&E. An example is provided in
Ueno and Matsuzawa’s (2004) study of food transfer between chimpanzee
mothers and infants (see Figure 2.3). True, several of the thirteen infant
codes might co-occur never or rarely, but they are not designed to be mutu-
ally exclusive (although they could be with a hierarchical rule); likewise
with the eight mother codes. Here is an example when the answer to the
question of whether codes must be ME&E is: Not necessarily.
granularity: micro to macro

Codes vary considerably in their level of granularity or specificity. Some
seem to compulsively capture minute details, whereas others brush with
broad strokes. They can vary from micro to macro (or molecular to molar)Â€–
from detailed and fine-grained to relatively broad and coarse-grained. The
appropriate level of granularity for a research project depends on the ques-
tions addressed. For example, if you are more interested in moment-to-
moment changes in expressed emotion than in global emotional state, you
might opt to use a fine-grained scheme like the facial action coding scheme
developed by Paul Ekman (Ekman & Friesen, 1978), which relates different
facial movements to their underlying muscles.
One useful guideline is that, when in doubt, you should define codes at a
somewhat finer level of granularity than your research questions require (i.e.,
when in doubt, split, do not lump). You can always analytically lump later,
but to state the obvious, you cannot recover distinctions never made (Suomi,
1979, pp. 121–122). Another useful guideline is that your codes’ granularity
should be roughly similar. Usually questions that guide a research project
are of a piece and require either more molar or more molecular codes. Mix
levels only if your research questions clearly require it.
concreteness: physically to socially based codes

Codes also vary in their degree of concreteness. Some are physicality palp-
able whereas others seemingly carve thin air with abstractionÂ€– and their
relative popularity seems to vary decade by decade. For example, speaking
of the mid-twentieth century, Altmann (1974, p. 252) noted a trend to use
interpretive categories such as seeks attention, rather than relatively nonin-
terpretive motor patterns, such as hits. Bakeman and Gottman (1986, 1997)
suggested an ordered continuum of coding schemes with one end anchored
by physically based schemes with codes like hits and the other by socially
based ones with codes like seeks attention.
Even this may be too simple. Seemingly physically based codes can be
defined by either their morphology or their function (e.g., Bekoff, 1979).
For example, Approach could be defined morphologically by specifying
the movements involved (by legs, arms, wings, etc.) or functionally by its
consequences (by proximity established, no matter how accomplished).
Functional definitions are usually more general or macro than morpho-
logical ones; and, although the two definitions of Approach suggested here
seem physically based, functional codes may lie toward the middle of the
proposed physical-social continuum.
Nonetheless, more physically based codes generally reflect attributes
that are easily seen, whereas more socially based codes rely on abstrac-
tions and require some inference. Another example of a physically based
Infant vocalization Maternal response

Quasi-resonant vocalizations Naming
Fully resonant vowels Questions
Marginal syllables Acknowledgments
Canonical syllables Imitations
Babble (repeated canonical syllables) Attributions
Other (cry, laugh, vegetative sounds) Directives
Play vocalizations
Figure 2.4.â•‡ Examples of coding schemes, one more physically based (infant; Oller,
2000) and one more socially based (maternal; Gros-Louis et al., 2006).
code might be infant crying; another example of a more socially based code
might be child engaged in cooperative play. Some ethologists and behavior-
ists might regard the former as objective and the latter subjective (and so
less scientific), but the physically based–socially based distinction may mat-
ter primarily when selecting and training observers. Are observers detec-
tors of things “really” there? Or are they more like cultural informants,
able through experience to “see” the distinctions embodied in our coding
schemes? Perhaps the most important consideration is whether observers
can be trained to apply coding schemes reliably (see Chapters 5 and 6)Â€– no
matter how concrete the coding scheme.
Ekman’s facial action coding system (Ekman & Friesen, 1978), cited in
the previous section as an example of a molecular approach, also provides
an excellent example of a concrete coding scheme. Yet another example of a
concrete coding scheme is provided by Oller (2000), who categorizes young
infants’ vocalization as shown in Figure 2.4 (left). Oller and his colleagues
provide precise, acoustically based definitions that distinguish between, for
example, quasi-resonant vowel-like vocalizations, fully resonant Â�vowels,
marginal syllables, and canonical syllables; and like Belsky and Most’s
(1981) infant object play codes, Oller’s first five codes describe a develop-
mental progression.
Both Ekman’s and Oller’s coding schemes are sufficiently concrete that
we can imagine their coding might be automated. Computer codingÂ€– dis-
pensing with human observersÂ€– has tantalized investigators for some time,
but remains mainly out of reach. True, computer scientists are attempting
to automate the processÂ€– and some success has been achieved with auto-
matic computer detection of Ekman-like facial action patterns (CohnÂ€ &
Kanade, 2007; Cohn & Sayette, 2010; Messinger, Mahoor, Chow, & Cohn,
2009)Â€ – but it still seems that as codes become more socially based, any
kind of computer automation becomes more elusive. As an example,
consider the coding scheme in Figure 2.4 (right) used to code maternal
responses to infant vocalizations (Gros-Louis, West, Goldstein, & King,

2006). It is difficult to see how these maternal response codes, or other simi-
lar socially based codes, could be automated. For the foreseeable future at
least, a human coderÂ€– a perceiverÂ€– will likely remain an essential part of
behavioral observation, at least when codes are more socially based than
physically based.
Still, even if we underestimate the potential of automatic coding, another
possibility is worth mentioning: the integration of observer’s categorical
coding with continuous data streams produced more automatically. This
is exemplified by Oller and Buder’s work integrating infants’ observed
behavior with aspects of the acoustic signal (e.g., Warlaumont, Oller, Buder,
Dale,Â€& Kozma, 2010).
The two sets of codes in Figure 2.4 demonstrate a point we have made
repeatedly in this chapterÂ€– which is the usefulness of defining sets of ME&E
codes with each set representing an important dimension of the research.
For example, after applying these codes, Gros-Louis et al. (2006) were able
to report that mothers responded mainly with (1) acknowledgments to both
vowel-like sounds and consonant-vowel clusters, (2) play vocalizations to
vowel-like vocalizations significantly more than to consonant-vowel clus-
ters, and (3) imitations to consonant-vowel clusters more than to vowel-like
sounds. Demonstrations of such contingency are important theoretically
and represent one of the strengths of observational methods generally, as
we detail in later chapters.
codes versus rating scales

Most examples of codes we have presented in this chapter have been nomi-
nal-scale categories for which order is arbitrary. However, Belsky and Most’s
codes were ordinal, Parten’s can be viewed as ordinal, and Oller’s were
mixed (the first five infant vocalization codes were ordinal, but the sixth
was a miscellaneous category). Another possibility is to use, not nominal
codes, but conventional rating scales. The entity rated could be an event,
a time interval (e.g., successive 15-second intervals), or an entire session.
One or more items could be rated using 1-to-5 scales (or however many
scale points seem useful), appropriately anchored.
Often descriptions of observational methods assume categorical meas-
urement and ignore rating scales (e.g., Bakeman & Gottman, 1997), but
rating scales can prove useful in at least two ways. First, rating scales can tap
socially based dimensions in ways that make use of raters’ judgments as cul-
tural informants even more so than do behavioral codes, which necessarily
are more tied to visible behavior. True, the items rated could be quite con-
crete, but given the nature of rating scales, they are more likely to be socially
basedÂ€– for example, asking an observer to rate how happy a child was, 1
to 7, during a 1-minute period of interaction instead of coding how often a
child smiled or laughed during the same interval. Second, rating can be less
time-consuming than coding. Comparable codes and ratings may remain
at the same level of granularity, but the entities rated (e.g., 1-minute inter-
vals or whole session) can be longer than the entity coded (events)Â€– which
requires fewer judgments from the observers and hence less time.
An example is provided by the work of Adamson and her colleagues.
First, observers coded video records of 108 children during structured-play
sessions (Adamson et al., 2004). The coding scheme was an extension of
the engagement scheme shown in Figure 2.1 thatÂ€– as appropriate for chil-
dren learning languageÂ€– included a symbol-infused joint engagement code
(joint engagement that incorporates use of symbols such as words). Second,
observers rated 1 to 7 the amount of total joint engagement and the amount
of supported, coordinated, and symbol-infused joint engagement in the six
5-minute segments that constituted the session. Mean ratings correlated
strongly with percentages derived from the coded data (.75, .67, .86, and
.89 for total, supported, coordinated, and symbol-infused joint engagement,
respectively), but rating took considerably less time than coding. Their
comparison demonstrates that when process questions are not at issue and
summary measures of outcome are sufficient to address research questions,
ratingÂ€– instead of codingÂ€– the same behavior can provide the same quality
results with much less effort and for that reason is well worth considering.
the coding manual

Examples of codes presented in this chapter have relied mainly on a single
word or phrase. For illustrative purposes, at least in the context of this chap-
ter, we assume these words or phrases have been sufficient to convey in
general terms what the code means. In the context of actual research, how-
ever, simple words or phrases would not be sufficient. A more extensive,
well-organized, and carefully drafted coding manual is needed and is an
essential part of any observational research project.
The coding manual explains the structure of the coding schemeÂ€– for
example, are codes organized into ME&E sets and, if so, what is the ration-
ale for each set? It provides names and clearly stated definitions for each
code along with examples. The definitions are similar to those found in a
dictionary and may be relatively short. But often they are supplemented
with more elaborate and extended definitions that stress similarities and
differences between the particular code and other codes with which it could
be confused. Examples of the behavior to which the codes apply are help-
ful and might consist of verbal descriptions, pictures or other graphic aids,
sound or video recordings, or some combination of these.
The coding manual also explains any special coding rules (e.g., only
engagement states that last at least 3 seconds are to be coded). Like devel-
oping a coding scheme, drafting a coding manual is an iterative process.
Ideally the two processes occur in tandem; the coding manual should be
drafted as the coding schemes are evolving. Once completedÂ€ – and with
the understanding that additional refinements remain a possibilityÂ€ – the
coding manual stands as a reference for training new coders and an anchor
against observer drift (i.e., any change over time in observers’ implicit defi-
nitions of codes). It also documents procedures in ways that can be shared
with other researchers. For all these reasons the coding manual is central
and essential to observational research and deserves considerable care and
attention. Published reports should routinely note that copies are avail-
able on request. (For further comments concerning coding manuals, see
YoderÂ€& Symons,Â€2010, chapter 3.)
Research articles often abstract the coding manual, in whole or in partÂ€–
which is helpful, if not essential, for readers. For example, Dian Fossey
(1972) provided extensive definitions for nine types of mountain gorilla
vocalizations. Three of her definitions are given in Figure 2.5 and show the
sort of detailed definition desirable in a coding manual.
CodesÂ€– that is, the names or labels you select to identify your codesÂ€–
have many uses. They appear in coding manuals, may be entered on data-
recording forms or used by computer-assisted coding programs, and appear
again as variable names in spreadsheet programs and statistical packages.
It is worth forming them with care. They should be consistent; for example,
do not use jt as an abbreviation for joint in one code and jnt in another. As
a general rule, briefer, mnemonic names are better; codes longer than eight
or so characters can clutter the screen or page. Not all programs treat codes
as case-sensitive, but using upper- and lowercase letters is often helpful.
Underscores distract the eye and are best avoided (they are a holdover from
a time when many computer programs did not allow embedded blanks
in, e.g., file names). Thus MomLook is a better choice than Mom_Look;
both use upper- and lowercase letters, but the first is shorter and avoids an
underscore. Names are a matter of taste, so some variability is expected.
Nonetheless, a good test is whether codes are readily understood by others
at first glance or seem dense and idiosyncratic.
Code Definition
Roar Monosyllabic loud outburst of low-pitched harsh

sound, lasting from .20 to .65 s, beginning and
ending abruptly. Individual differences in frequency
concentrations. Heard only from silverbacks in
situations of stress or threat, and primarily directed at
human beings, although occasionally at buffalo herds.
Always followed, on the part of the emitter, with
varying degrees of display, ranging from bluff charges
to small forward lunges.
Scream Shrill and prolonged emission of extremely loud
sound, lasting up to 2.13 s and repeated as
often as 10 times. Individual differences not denoted.
Screams heard from all age and sex classes, but
most frequently from silverbacks. Vocalization heard
most often during intragroup disputes, though could
be directed toward human beings or ravens if alarm
rather than threat was motivation for call.
Wraagh Explosive monosyllabic loud vocal outburst not as
deep as a roar nor as shrill as a scream. Began and
ended abruptly and lasted between .2 and .8 s.
Individual differences in sound, which were more
harmonically structured than roars. Heard from all
adults but most frequently from silverbacks. Usually
precipitated by sudden situations of stress – the
unexpected arrival of an observer, etc. Most effective
in scattering group members and never accompanied
by aggressive display behavior.
Figure 2.5.â•‡ Definitions for three types of mountain gorilla vocalizations

(Fossey,Â€1972).
summary
Coding schemes are the primary measuring instrument of observational
methods. Like the lens in telescopes and microscopes, they both limit and
focus observers’ attention. Coding schemes consist of lists of names or cat-
egories (or, less often, ratings) that observers then assign to the observed
behavior. Often codes are organized into mutually exclusive (only one code
applies to each entity coded) and exhaustive (some code applies to every
entity coded) sets (ME&E)Â€– a practice we recommend because of the clar-
ity it provides.
Coding schemes can be adapted from others with similar theoretical con-
cerns and assumptions or developed de novo; in either case, development
is necessarily an iterative process and well worth the time it takes. Coding
schemes necessarily reflect underlying theoretical assumptions and clarity
results when links between theory and codes are made explicit. Several
examples of coding schemes have been given throughout this chapter. They
varied in degree of granularity from fairly molecular to quite molar, from
finer-grained to coarser-grained. They also varied in degree of concreteness
from clearly physically based to socially based codes for which observers
can be regarded more as social informants than simple detectors.
The coding manual is the end product of coding scheme development
and an essential component of observational research. It provides defini-
tions and examples for codes, details how coding schemes are organized,
and explains various coding rules. It documents procedures, is essential for
observer training, and can be shared with other researchers who want to
replicate or adapt similar procedures.
3
Recording Observational Data
In the previous chapter, we discussed coding schemes and gave several

examples. We emphasized that coding schemes are primarily conceptual,
rooted in and reflecting theoretical assumptions; and, although we regard
coding schemes as instruments that focus observers’ attention, they are not
physical in the sense that microscopes are. In contrast, applying those coding
schemes to the passing stream of behaviorÂ€– transforming observed behav-
ior into dataÂ€– combines physical and conceptual components. Recording
observational data requires physical materials ranging from simple paper
and pencil to sophisticated video and computer systems. It also requires
selection of an appropriate recording strategy. This chapter addresses both
the strategies and materials used to record observational data.
Recording strategies provide rules that observers must follow when
applying coding schemes to the stream of behavior; they serve both meas-
urement and control. For experimental variables, control means variable
manipulation or neutralizationÂ€– for example, via random assignment. For
observational variables, control is exerted by requiring observers to adhere
to specific recording rules as detailed in this chapter.
Observers assign codes to thingsÂ€– that is, to some entity. As noted in
Chapter 1, in observational practice, that entity is typically a behavioral
event or a time interval within which events can occur. Accordingly, record-
ing strategies are of two kinds: event recordingÂ€– for which observers assign
codes to particular events; and interval recordingÂ€ – for which observers
assign codes to specified time intervals. Additionally, when coding events,
observers may or may not record those events’ durations: We call the strat-
egy untimed-event recording when duration is not recorded and timed-event
recording when it is (usually by coding onset and offset times). Moreover,
intervals may or may not be contiguous: We call the strategy interval recording
(often called time sampling) when they are contiguous and Â�selected-interval
26
Recording Observational Data 27
Codes assigned to: Attribute Recording strategy
Duration No Untimed-event
Behavioral event
recorded? Yes Timed-event
Intervals Yes Interval
Time interval
contiguous? No Selected-interval
Figure 3.1.â•‡ Recording strategies described in the text.
recording when they are not. These four strategies (see Figure 3.1) are dis-
cussed in the next four sections, but our primary focus is on the first three.
In the remainder of the book, we explore the first three (untimed-event,
timed-event, and interval recording) further because, granted assumptions,
they produce data appropriate for sequential analysis.
Classifications are not so much perfect as useful, and we find that this
division of recording strategies into four kinds generally seems to describe
what investigators do. It also agrees reasonably with other authors, who
may use somewhat different terms but generally make the same distinc-
tions. For example, Martin and Bateson (2007) define two basic types of
recording rules: continuous recording and time sampling. Their continuous
recordingÂ€ – like our timed-event recordingÂ€ – “aims to provide an exact
and faithful record of the behavior, measuring true frequencies and dura-
tions”; and for their time sampling (a term with a long history in the field, as
detailed in the “Interval recording” section later in this chapter)Â€– as for our
interval recordingÂ€– “the observation session is divided up into successive,
short periods of time” (p. 51). Similarly, Yoder and Symons (2010) define
three kinds of behavior sampling: continuous behavior sampling, intermit-
tent behavior sampling, and interval sampling. Their continuous behavior
samplingÂ€– like our event recordingÂ€– is divided into two types: untimed-
event and timed-event. Their interval sampling is the same as our interval
recording. And their intermittent sampling is similar to our selected-inter-
val recording.
Another example where authors make similar distinctions but use differ-
ent terms was given in Chapter 1. We use the term event generally, both for
behaviors that are relatively instantaneous and those that have appreciable
durationÂ€ – but then use the term momentary event to identify relatively
brief behaviors for which duration is not of interest (see “Timed-event and
state sequences” in Chapter 4), whereas Altmann (1974) reserved the term
event for what we call momentary events and the term state for behaviors
that have appreciable duration.
untimed-event recording
Detecting events as they occur in the stream of behavior and coding one or
more of their aspects, but not recording their durationÂ€– which is what we
usually mean by untimed-event recordingÂ€– seems deceptively simple, but
still places demands on the observer. In such cases, observers are asked not
just to code the events, but to detect themÂ€– which requires that they be con-
tinuously alert, note when a new event first begins and then ends, and only
then assign a code (or codes) to it. If the events to be coded were demar-
cated beforehand (e.g., as utterances or turns-of-talk in a transcript), the
observer’s task would be less demanding, but such cases are relativelyÂ€rare.
One merit of untimed-event recording is how well it works with mate-
rials no more complex than paper and pencil. A lined paper tablet, with
columns representing codes and rows successive events, can be helpful;
see Figure 3.2 for an example of what an untimed-event recording form
could look like. When recording untimed events, codes are necessarily
organized as one or more ME&E sets. The sample form shows two sets:
the first codes statement type as command, declaration, or question, and
the second codes whether the statement was accompanied by a look, a ges-
ture, both, or neither. Each successive event is checked for one of the state-
ment types and for one of the look-gesture codes. For this example, events
are cross-classified by statement and gesture-look and their counts could
be summarized with a contingency table. With only one ME&E set, the
data would be represented with a single string of codesÂ€– a format that is
both common in the literature and that has attracted considerable analytic
attention as we will see in subsequent chapters. For example, the sequence
for the type of statement codes in Figure 3.2 would be: declr, declr, quest,
comnd, comnd, and quest.
Untimed-event recording is simple and inexpensive to implement. The
disadvantage is that knowing only the sequence of events, but not their dur-
ation, limits the kinds of information that can be derived from the data.
You can report how many times each code was assigned, the proportion of
events for each code, how the codes were sequenced, and how many times
codes from different ME&E sets co-occurred if more than one ME&E set
was used, but you cannot report much moreÂ€– although you could report
rates (i.e., how often each code was used per unit of time) if you know the
duration of the session (either because you arbitrarily limited it to a fixed
length or recorded its start and stop times). However, if your research
questions require nothing more than information about frequency and
sequenceÂ€– and possibly cross-classificationÂ€– then the simplicity and low
Statement type Accompanied by
Event comnd declr quest look gest both none

1 √ √
2 √ √
3 √ √
4 √ √
5 √ √
6 √ √
…
Figure 3.2.â•‡ A paper form for untimed-event recording with two sets of ME&E
codes.
cost of continuous untimed-event recording could cause you to adopt this

approach. One final simplification is possible. If sequence is not of interest,
you could dispense with the rows and simply add a tally to the appropri-
ate column when an event occurredÂ€– but then you would no longer have
sequential data.
timed-event recording
Often duration matters. You may want to know, for example, not just how
often the mother soothed her infant (i.e., how many times), but the percent-
age of time she spent soothing during the session, or even the percentage
of time her infant cried during the time she spent soothing. In such cases,
you need to record not just that events occurred, but how long they lasted.
In general, this is the approach we recommend because of the richness of
the data collected and the analytic options such data afford. As we noted
earlier in this chapter, Martin and Bateson (2007, p. 51) wrote that timed-
event recording provides “an exact and faithful record” of behavior. Not sur-
prisingly, increased options have their price. Recording duration, usually by
noting the onset and offset times for events, increases the burden on obser-
vers, requires more expensive and complex recording equipment, or bothÂ€–
and typically there is a trade-off involved. True, the burden on observers
decreases substantially with computer-assisted recording systems, but such
systems take more to acquire and maintain than paper and pencil.
Timed-event recording does not absolutely require computer technology,
but works best with it. Nor does it absolutely require that observers work with
Mom Kid
Event Onset time Offset time code code Comment
1
2
3
4
5
6
…
Figure 3.3.â•‡ A paper form for timed-event recording.
recorded materials, but it is considerably more difficult to do live than either

untimed-event recording or the interval recording described in the next sec-
tion. Consider what a paper-and-pencil method for timed-event recording
might look like when observers are asked to code mother-and-child behav-
ior using a form like that shown in Figure 3.3. Working with video record-
ingsÂ€– and assuming that time is displayed on the screenÂ€– observers would
play the video until an event included in their list of codes occurred. They
would then stop the video, probably replay the video backward and forward
to home in on the exact onset and offset times, and then enter those times
along with the appropriate code on the form. Such an approach is certainly
possible, and has been used; but it can be tedious and error-prone.
Some simplification is possible. For example, when all codes are assigned
to one or more ME&E sets, only onset times need to be recorded, because
the onset of a code necessarily implies the offset of any previous code from
the same set. If durations were relatively long, even live coding using paper
and pencil (which requires the observer to first look at a clock or other
timing device and then write the time) might be feasible. However, when
durations are often short, codes are many, or both, or when it is undesirable
to look away while recording live, some sort of automatic recording of times
is highly desirable, if not essential. In a later section of this chapter (“Digital
recording and computer-assisted coding”), we describe the sorts of capabil-
ities computer-assisted coding programs can offer.
interval recording
Like untimed-event recording, interval recording is relatively easy and inex-
pensive to implement and, perhaps for this reason, has been much used in the
past. At the same time, it is associated with a more complex terminology and
series of choices than either untimed-event or timed-event recording and,
beginning in the 1920s, has spawned a rather large, sometimes arcane litera-
ture that seems increasingly dated. Typically it is referred to as time sampling
(e.g., Arrington, 1943; Hutt & Hutt, 1970). It has often seemed something
of a compromise; it is easier to implement but less exact than timed-event
recording. As Martin and Bateson (2007) wrote, “Less information is pre-
served and an exact record of behavior is not necessarily obtained” (p. 51).
And even though 34 percent of all articles published in Child Development in
the 1980s used interval recording (Mann, Have, Plunkett, & Meisels, 1991;
cited in Yoder & Symons, 2010), it is our belief that if timed-event recording
had been easily available, many of those investigators would have preferred
it (certainly this was true for Bakeman & Brownlee, 1980).
The essence of interval recording is this: The stream of behavior is seg-
mented into relatively brief, fixed time intervals (e.g., 10 or 15 seconds in
duration), and one or more codes are assigned to each successive inter-
val. Unlike untimed-event and timed-event recording, which require that
observers pay attention to events and their boundaries, interval recording
requires that observers pay attention to time boundaries, which could be
noted with simple clicks (like a metronome, only slower). Konner’s (1976)
study of the !Kung cited in Chapter 1 is an example. Recall that an electronic
device delivered a click to his ear every 15 seconds and that he then checked
which of several mother, infant, adult, and child behaviors had occurred
since the last click. As this example suggests, interval recording can be
effected with simple and inexpensive materials. Some timing device that
demarcates intervals is needed. Otherwise, paper and pencil suffice. Lined
paper tablets are helpfulÂ€– each row can represent a successive time inter-
val and each column a specific code. Figure 3.4 gives an example of what
an interval recording form might look like. The codes used as examples
here are selected from a re-analysis of Konner’s data (Bakeman, Adamson,
Konner, & Barr, 1990) and are defined as infant manipulates object, explores
object, relates two objects, and vocalizes; and mother vocalizes to, encourages,
and entertains her infant.
However, interval recording (or time sampling, which is the more com-
mon term in the literature) is more complex than the example just pre-
sented suggests. Three kinds of sampling strategies are usually identified:
partial-interval (or one-zero), momentary (or instantaneous), and whole-
interval sampling (Powell, Martindale, & Kulp, 1975; Quera, 1990; Suen &
Ary, 1989; Yoder & Symons, 2010). The recording rules for each are some-
what different, as detailed in subsequent paragraphs.
Infant Mother
Interval Manip Explr Relate IVoc MVoc MEnc MEnt

1 √ √ √ √
2 √ √
3 √ √ √ √
4 √ √ √
5
6 √ √ √
…
Figure 3.4.â•‡ A paper form for interval recording.
Partial-Interval or One-Zero Sampling

Partial-interval or one-zero sampling has a long history, dating back to the
1920s (see Altmann, 1974), and is probably the most common of the three
sampling strategies. Its recording rule is: Check an interval if the behavior
indicated by the code occurred at some point, once or more, during the
interval. This is the rule Konner used. Like untimed-event and timed-event
recording, it requires that observers be continuously alert (although once an
interval is checked for a code, they can stop looking for additional instances
of that behavior). It is called one-zero sampling because it requires obser-
vers to apply a simple binary codeÂ€– that is, the behavior occurred (1 or yes)
or did not (0 or no).
Momentary or Instantaneous Sampling

Momentary sampling (or instantaneous sampling or point sampling) is
probably the second most common of the three sampling strategies. Its
recording rule is: Check an interval only if the behavior was occurring at
a defined moment (e.g., in response to a beep at the end of the interval).
Using Altmann’s (1974) event-state distinction, it is appropriate only for
states (for further discussion, see Martin & Bateson, 2007). Unlike event
recording, it does not require that observers be continuously alert; the the-
ory is that observations are made only at the sampling point.
Whole-Interval Sampling
Whole-interval sampling is probably the least common of the three sam-
pling strategies. Its recording rule is: Check an interval only if the behavior
occurred for the duration of that interval; do not check if the behavior did
not occur or occurred for only part of the interval. Like event recording, it
requires that observers be continuously alert. A variant of whole-Â�interval
sampling is: Check the behavior that predominated during the inter-
val (called predominant activity sampling by Hutt & Hutt, 1970)Â€ – which
seems similar to whole-interval sampling but gives approximations simi-
lar to those of momentary sampling (Tyler, 1979). Momentary and whole-
Â�interval sampling are alike in that intervals are checked for one, and only
one, code (codes are regarded as mutually exclusive by definition), whereas
with one-zero sampling, intervals may beÂ€ – and often areÂ€ – checked for
more than one code.
As noted earlier, the advantages of interval recording are primarily
practical; it is easy and inexpensive to implement. The disadvantage is that
summary statistics may be estimated only approximately. For example,
with partial-interval sampling, frequencies are likely underestimated (a
check can indicate more than one occurrence in an interval), proportions
are likely overestimated (a check does not mean the event occupied the
entire interval), and sequences can be muddled (if more than one code is
checked for an interval, which occurred first?). Moreover, with moment-
ary or whole-interval sampling, two successive checks could indicate either
one continuing or two separate occurrences of the behavior. There are pos-
sible fixes to these problems, but none seem completely satisfactory (see
Altmann & Wagner, 1970; Quera, 1990; Sackett, 1978).
Interval duration is a key parameter of interval recording. When the
interval duration is small relative to event durations and the gaps between
events, estimates of code frequencies, durations, and sequences will be bet-
ter and more precise (Suen & Ary, 1989). Decreasing the duration of the
intervals used for interval recording, however, increases the number of deci-
sions observers need to make and thereby loses the simplifying advantage
of interval recording. To take matters to the limit, if the interval duration
is decreased to the precision with which time is recorded (e.g., if 1-second
intervals are defined), interval-recorded data become indistinguishable
from timed-event-recorded dataÂ€– their code-unit grid becomes the same
(see “A universal code-unit grid” in Chapter 4). Better to record timed-event
data in the first place than use interval recording with too small intervals.
In sum, we recommend interval recording only if approximate estimates
are sufficient to answer your research questions and the practical advan-
tages are decisive (e.g., you have limited resources and the cost of paper and
pencil is attractive). An additional advantage of interval recording is that it
fits standard statistical models for observer agreement statistics better than
other methods, as we discuss in Chapter 6Â€– but this alone is not a good
reason for selecting this method.
selected-interval recording
The previous section has described methods for assigning codes to contigu-
ous fixed intervalsÂ€– an observational session was segmented into intervals
of a specified duration, usually relatively brief, and the intervals were then
assigned codes per the recording rules for partial, momentary, or whole
intervals. With such methods, continuity was usually assumed, the data
were regarded as sequential, and sequential data analytic techniques such
as those described in Chapter 11 could be applied.
In contrast to interval recording, what we call selected-interval recording
methods code noncontiguous intervals;Â€summary statistics such as those
we describe in Chapter 8Â€can still be computed and, when research ques-
tions do not concern contingency or require sequential data, these record-
ing methods can be useful. In fact, selected-interval recording is something
of a residual category. We mean to include in it any methods that assign
codes to noncontiguous intervals. However, we recognize that when those
intervals are equally spaced (every hour, every day, every month, etc.),
momentary sampling is an equally appropriate term; and when every n-th
interval is coded per partial- or whole-interval rules, the method remains
interval recording (which is equivalent to separating observation inter-
vals with recording intervals; see Bass & Aserlind, 1984; Rojahn & Kanoy,
1985). It is also a rather heterogeneous category, thus instead of attempting
to exhaustively describe all the many variants in the literature, we will sim-
ply give a few examples.
Generally, whenever the intent is to describe how individual animals or
humans distribute their time among different types of activities (time-bud-
get information), selected-interval recording can be an efficient approach.
For example, both Parten (1932) and Smith (1978), cited in Chapter 1,
coded isolated, noncontiguous selected intervals. Florence Goodenough
(1928) called Parten’s the method of repeated short samples. Arrington
(1943) defined time sampling as “a method of observing the behavior of
individuals or groups under ordinary conditions of everyday life in which
observations are made in a series of short periods so distributed as to afford
representative sampling of the behavior under observation” (p. 82). She
credited Olson (1929, cited in Arrington) with its first use (his observers
made a check when specified behaviors for grade-school students occurred
during a 5-minute intervalÂ€ – thus one-zero sampling) and wrote that
Parten used a modification of the method. We would label Parten’s method

selected-interval recording, but it could easily be called time-sampling in the
Â�
literal sense that intervals of (noncontiguous) time were sampled.

It is worth revisiting Arrington’s rationale for the method, which was
to provide representative sampling of the behavior under investigation.
In this sectionÂ€– indeed throughout the bookÂ€– we have downplayed the
many sampling issues investigators face. Time sampling enters the discus-
sion because historically it is the term used for what we call interval record-
ing (and what Yoder & Symons, 2010, call interval sampling). Today, we
assume that investigators can and have addressed the many sampling issues
involved in selecting participants, contexts, and times for their observa-
tional Â�sessions. For further discussion of some of these issues see Altmann
(1974), Martin and Bateson (2007, chapters 4 and 5), and Yoder and Symons
(2010, Â�chapterÂ€4).
Another example of selected-interval recordingÂ€– this one in an educa-
tional contextÂ€– is provided by the work of Robert Pianta and colleagues,
including The National Institute of Child Health and Human Development
(NICHD) Early Child Care Research Network (e.g., Pianta et al., 2007).
Their research used, among others, two classroom observation schemes.
One focused on individual children. Observers coded the presence of forty-
four behavioral events for eighty 1-minute intervals; 30-second periods of
observation alternated with 30-second periods of recording, grouped into
eight 10-minute cycles during a school day. The other focused on class-
rooms. Observers rated nine items that described the classroom’s emotional
and instructional climate quality; 20-minutes of observation alternated with
10-minutes of recording, again grouped into eight cycles during the day.
The interrupted method of recording Pianta and colleagues used is espe-
cially useful when recording live (see next section) because it allows obser-
vers time to look away from the behavior, perhaps reflect on what they have
just seen, and attend solely to the data-recording task. For observations of
this type, actual recording could be with paper and pencil or with a portable
electronic recording device. As mentioned earlier, this method might also be
regarded as an example of split-interval recordingÂ€– which only emphasizes
that what we call a recording method is less important than whether or not
we think successive coded intervals are appropriate for sequential analyses.
live observation versus recorded behavior

A question that can affect recording and sampling strategies alike is whether
coders are observing live or working with recorded materials (e.g., just audio,
video which includes audio, written transcripts). Almost always we think

coding from recorded materials is preferable to live observation (Bakeman &
Quera, 2012). First, and perhaps most importantly, recorded material can be
played and replayedÂ€– literally re-viewedÂ€– and at various speeds. Unlike live
behaviorÂ€– which moves forward in real time and then is goneÂ€– recorded
behavior allows for reflection before codes are assigned. It can be replayed
and savored, considered and re-considered. Second, because recorded mater-
ial can be replayed, observers do not need to code everything all at once, but
can focus on different aspects of behavior in different passesÂ€– for example,
coding a mother’s behavior in one pass and her infant’s in another. Third,
when observing live, it is difficult (but not impossible) to check interobserver
reliability without the observers’ awareness, whereas with recorded behav-
ior it is easy. However, only with recorded behavior is it possible to check
intraobserver reliability by comparing an observer’s coding with that same
observer’s earlier coding of the same material (see Chapters 5 and 6). Fourth,
contemporary computer systems for capturing coded data work best with
recorded material (especially digital files). And finally, thanks to techno-
logical developments, both audio-video recording devices and computer-
assisted coding systems are increasingly becoming routine and affordable;
complexity and cost are no longer the barrier they have been in the past.
Coding recorded behavior instead of coding live also simplifies many
sampling decisions. When writing about observational methodology more
from the point of view of paper and pencil than electronic recording, authors
often describe a variety of sampling rules, including ad lib, focal, and scan
sampling (e.g., Altmann, 1974; also Martin & Bateson, 2007, pp.Â€ 48–51;
Yoder & Symons, 2010, pp. 59–60), but given recorded behavior, all this
can be accomplished with or subsumed under a focal strategyÂ€– observing
one individual (or dyad, etc.) at a time and observing other individuals in
separate passes.
Nonetheless, recording behavior for later coding is not always feasible.
In some settings (e.g., school classrooms), audio-video recording devices
may be regarded as too intrusive, or permanent recordings may be unwel-
come for ethical or political reasons. In some circumstances (e.g., observing
animal behavior in the field or observing groups of animals or humans)
trained human observersÂ€– because they are embedded in the situationÂ€–
may be able to detect behaviors that are unclear on recordings: no behav-
ior will be “off-camera” to them. Moreover, although electronic recording
devices for live observation exist, live observation can be done with a min-
imum of electronic assistance. Often there is no need to purchase, learn
about, or maintain audio-video recording devices.
One final comment: Investigators sometimes refer to an archive of

recordings (whether the magnetic tapes of past decades or the digital files
more common today) as data, but this is inaccurate. Archived recordings,
like a baseball player’s swing that only becomes a strike when the umpire
calls it, become data only when observers code them.
digital recording and computer-assisted coding

In previous sections of this chapter, we have described how observational
data could be recorded using nothing more than paper and pencil, partly
because this is the most economical means, but also the most common
historically. In the last several decades, paralleling technological devel-
opments, behavior observation has increasingly taken advantage of elec-
tronic devicesÂ€– first analog, but now increasingly digitalÂ€– first to record
behavior (both audio and video) and now increasingly to code it by using
computer-assisted coding systems of varying degrees of capability, cost,
and sophistication. We hesitate to offer a list of such systems because it
would likely be incomplete and soon outdated. Nonetheless, and by way of
Â�example, probably two of the most widely known full-service commercial
systems currently available are Mangold International’s INTERACT (www.
mangold-international.com) and Noldus Information Technology’s The
Observer (www.noldus.com).
In this section, we describe what is increasingly becoming the norm,
which is coding behavior captured in digital files using a computer-assisted
coding system. This is a relatively new possibility; in previous decades,
behavioral records were stored mainly on magnetic tapeÂ€– and many still
areÂ€– but digital files represent a real advance. In contrast to previous record-
ing technologies, replay can now jump to any point almost instantly with-
out waiting for reels to wind and unwind. Thus, the typical coding station
includes a screen to display the video, speakers for audio, storage for the
digital files, and a computer and appropriate software to control playback
and manage coding.
Key to computer-assisted coding is the software. Even stand-alone play-
back software provides on-screen controls that let you position and play
digital files. Both stand-alone playback software and software designed
to assist coding typically let you play at various speeds forward and back-
ward, pause, or move through the file frame by frame, while displaying
the current time. These playback capabilitiesÂ€– which allow easy repeated
replay and thereby promote reflection and discussionÂ€– are invaluable for
both observer training and coding. They are also extremely helpful when
first developing coding schemes. In addition to ease of navigation, digital

has other advantages. Some programs synchronize multiple filesÂ€– which
means that multiple views recorded at the same time can be displayed and
coded together. Such synchronized playback was technically demanding
with videotapes, but is much less so for digital files. Moreover, taped images
displayed on a video monitor are fixed in size, whereas digital images dis-
played on a computer screen can be resized as you wish.
Keeping track of time has always been important. Stand-alone playback
software often rounds time to the nearest second and displays it as mm:ss
(indicating minutes and seconds). Computer-assisted coding software may
round time to the millisecond or hundredth of a second (tenth of a second
and integer second are other possibilities) or to the nearest frame. Thus
common formats are hh:mm:ss.d… or hh:mm:ss:ff (indicating hours, min-
utes, seconds, and either decimal digits or frames). A technical note: The
number of frames per second matters primarily for computer program-
mers. It is approximately 30 (actually 29.97) per second under the NTSC
standard used in North America, much of South America, and Japan, and
25 per second under the PAL standard used in most of Europe, the Near
East, South Asia, and China.
On first use, computer-assisted coding software usually requires infor-
mation about your projectÂ€– for example, what are your codes? How are they
structured? Or what kind of recording method do you want to use? Often
you are asked to associate codes with particular keyboard keys. Optionally,
a list of your codes may be displayed on-screen. Then at the moment dur-
ing playback when you want to code the onset of an event, you need only
press the appropriate key or point and click that code on the screen with the
mouse. Another possibilityÂ€– although it may not appeal to everyoneÂ€– is to
dispense with the keyboard and mouse and use voice recognition software;
the observer would then simply speak the appropriate code into a micro-
phone (White, King, & Duncan, 2002). The software will note the infor-
mation, as appropriate, in a data file (also displayed on screen), perhaps
organized like Figure 3.3 or something similar.
Computer-assisted coding software makes the process less error-
prone. The human observers do not need to worry about clerical details
like noting the time on a clock and then writing it digit by digit on a
paper form or entering it key by key into a computer file; the computer
attends to these tasks. Moreover, if you make a mistake or change your
mindÂ€– and human observers will do bothÂ€– and want to add or delete a
code orÂ€change a time, edits can usually be accomplished on-screen with
minimal effort.
The result is a data file in which each line represents a coded event along
with its onset and (optionally) offset time. Software programs vary in their
conventions and capabilities, but when sets of ME&E codes are defined,
many such programs automatically supply the offset time for a code when
the onset time of another code in the same set is recorded. Alternatively, off-
set times can be entered explicitly. Another useful feature, present in most
coding software, lets you apply additional codes to events already coded. For
example, after coding an event MomSpeaks, you might then want to code
its tone as Positive or Negative, note its function (e.g., Question, Demand,
Comment), and so forth.
Some software programs permit what we call post-hoc codingÂ€– in other
words, they allow you to first detect an event and only code it afterward, once
the whole event has transpired. Compared to systems that require you to
provide a code at the onset of an event, post-hoc coding can minimize back-
and-forth playback and so speed up coding considerably. For example (with
appropriate options set), when you think an event is beginning, you would
hold down the space bar; and when it ends, you would release it, which will
pause playback. You can then decide what the code should be and enter it
with a keystroke or a mouse click. At that point, you can restart playback
and wait for the next event to begin. Alternatively, if you are segmenting the
stream of behavior with a single set of ME&E codes (e.g., Wolff ’s, 1966, infant
state or Adamson and Bakeman’s, 1984, engagement state codes, as cited in
Chapter 1), you would simply restart play by depressing the space bar after
entering a code. When that state ends, release the space bar, enter the appro-
priate code, and continue. You can always back up and replay events and edit
both times and codes, of course, but post-hoc coding offers a quite natural
and relatively quick way to segment a session into ME&E states.
Another sophisticated coding possibility we call branched-chain cod-
ing (called lexical chaining by INTERACT), which is useful if you wish to
assign multiple codes to an event. For example, Bakeman and Brownlee
(1982) asked coders to detect possession strugglesÂ€– that is, times when one
preschool child (the holder) possessed an object and another (the taker)
attempted to take it away. With appropriate software, coding could proceed
as follows: Once a possession episode is identified, coders are asked (via an
on-screen window) to select whether the taker had prior possession (had
played with the object during the previous minute, yes or no). A second
window asks whether the holder offered resistance (yes or no), and a third
whether the taker gained possession of the contested object (yes or no).
Thus coders are presented successive sets of codes; after selecting a code
from one set, they are presented with codes from the next set. The present
example used three sets of two codes each, but you could use as many sets
with as many codes as neededÂ€– which makes this a very flexible approach.
It also illustrates how appropriate software can manage clerical coding
details while letting observers focus solely on the task of coding.
The appeal of branched-chain coding is that observers need focus on
only one decision (i.e., one set of codes) at a time, recording each decision
with a keystroke or mouse click. Often the set presented next is the same, no
matter the code just selected (as when cross-classifying an event on several
dimensions). However, the set of codes presented next can be determined
by the code just selected (as when coding decisions are structured hierarch-
ically in a tree diagram). For example, imagine that observers are first asked
to detect communicative acts and code them as involving speech, gesture,
or gesture-speech combination (based on Özçalışkan & Goldin-Meadow,
2009). If either gesture or gesture-speech is coded, next observers code the
type of gesture (conventional, deictic, iconic). And if gesture-speech is coded,
observers could also be asked to code the type of information conveyed by
the combination (reinforcing, disambiguating, supplementary). Again, the
ability to chain codes in this way offers impressive flexibility.
Finally, with most software for computer-assisted coding (and video-
Â�editing software generally), you can assemble lists of particular episodes
that can then be played sequentially, ignoring other material. Such capabil-
ities are not only useful for coding and training, but for educational and
presentation purposes as well. Still, we do not think that investigators who
require continuous timed-event recording need despair if their resources
are limited. Digital files can be played with standard and free software on
standard computers, or videotapes can be played on the usual tape playback
devices. Observers can write codes and the times they occur on a paper form
and enter the information into a computer file later, or they can enter the
information directly as they code (e.g., using a spreadsheet program run-
ning on the same or a separate computer). Times can even be written when
coding live using only paper, pencil, and a clock. Such low-tech approaches
can be tedious and error-proneÂ€– and affordableÂ€– but when used well, they
can produce timed-event data that are indistinguishable from that collected
with systems costing far more. More time and effort may be required, but
the end result can be the same.
summary
When setting out to record observational data, you need to select not just
appropriate materialsÂ€– which can range from simple paper and pencil to
sophisticated video and computer systemsÂ€– but also appropriate recording

strategies. Depending on circumstances, you may decide to code behav-
ior live or work from recorded materials (audio-visual, just audio, or tran-
scripts). Almost always, recordings are preferable, because observers can
view behavior repeatedly, at normal speed or slow motion, and at times of
their choosing. However, in some settings, recording may be deemed too
intrusive. In either case, observers assign codes to something, either events
or time intervals, which defines the two primary recording strategies for
coded data: event recording and interval recording.
For event recording, times can be recorded or not. This results in two pos-
sibilities. First, untimed-event recordingÂ€– detecting events as they occur in
the stream of behavior and coding one or more aspects about those events,
but not recording their durationÂ€– can be inexpensive to implement, but the
kinds of information derived from its data are limited. Second, timed-event
recordingÂ€ – detecting, coding, and recording duration (often in the form
of onset and offset times)Â€– results in rich data that provide many analytic
options, but works better with recoded materials than live observation.
For interval recording, intervals can be contiguous or not, and differ-
ent sampling rules can be used. For partial-interval or one-zero sampling,
an interval is checked if the behavior occurs at any point (once of more)
during it; for momentary (or instantaneous or point) sampling, an interval
is checked if the behavior is occurring at a specified point; and for whole-
interval sampling, an interval is checked if the behavior continues through-
out the interval. Interval recordingÂ€ – coding successive fixed, often brief,
time intervalsÂ€– can be relatively easy and inexpensive to implement and
has been much used in the past, but statistics derived from its data may lack
precision. We use the term selected-interval recording when the intervals to
be coded are not contiguous, although this term includes fairly heteroge-
neous methods. Such methods can be useful when research questions are
not concerned with process or contingencyÂ€– and can be especially useful
when coding liveÂ€– but untimed-event, timed-event, and interval recording
have the advantage that their data are usually appropriate for sequential
analyses described later in this book.
Coding from recorded material, as opposed to live, can be useful no
matter the method, but is almost essential for timed-event recording.
Using paper and pencil methods to record the coded data works best with
untimed-event and interval recording. Computer-assisted systems can be
used with all recording strategies, but are an especially good fit for timed-
event recording. For timed-event recording, the ideal includes observers
coding behavior, not live, but recorded previously and now stored in digital
files, using a sophisticated computer system to assist and manage coding.

Such computer-assisted coding has the potential to make coding more effi-
cient, more fun, and less error-prone. Still, even when resources are limited,
timed-event recording may be preferred to other methods because of the
analytic possibilities its data provide.
4
Representing Observational Data
Once observers have done their workÂ€ – that is, once their assignment of
codes to events or intervals has been committed to paper or electronic filesÂ€–
it is tempting to think that you can now move directly to analysis of the
coded data. Almost always this is premature because it bypasses two import-
ant intervening steps. The second step involves reducing sequential data for
a session into summary scores for subsequent analysis and is relatively well
understood; for details see Chapters 8 and 9. The first step is equally import-
ant but often receives less attention. It is the subject of this chapter and
involves representingÂ€– literally, re-presentingÂ€– the data as recorded Â�initially
into formats that facilitate subsequent data reduction and analysis.
When recording observational data, as described in the preceding chap-
ter, observer ease and accuracy are paramount, and methods and formats
for recording data appropriately accommodate these concerns. But when
progressing to data preparation, reduction, and analysis, different formats
may work better. In this chapter, we consider two levels of data represen-
tation. The first is a standard formatÂ€ – that is, a set of conventionsÂ€ – for
sequential data that defines five basic data types and reflects the recording
strategies described in the previous chapter. The second is more conceptual;
it is a way of thinking about sequential data in terms of a universal code-
unit grid that applies to all data types and that facilitates data analysis and
data modification, as demonstrated in subsequent chapters (especially in
Chapter 10).
a sequential data interchange standard (sdis)

Knowing that investigators use different recording methods as described in
Chapter 3Â€– yet also recognizing the advantages of a universal representa-
tional standard as described in this chapterÂ€ – some time ago we defined
43
Recording Sequential Coding Universal grid

strategy data type unit unit
Untimed-event Single-code Event Event

Multicode
Timed-event Timed-event Event Time unit
State
Interval Interval Interval Interval
Figure 4.1.â•‡ Recording strategies, data types, and coding and universal grid units;
see text for definitions.
conventions for representing sequential observational data (BakemanÂ€ &

Quera, 1992). We named this set of conventions the Sequential Data
Interchange Standard (SDIS) and defined five basic data types. The two sim-
plest are single-code event sequential data, which result from untimed-event
recording using a single set of ME&E codes; and interval sequential data,
which result from interval recording. A third type, and the type we find most
useful, is timed-event sequential data, resulting from timed-event recording.
The two remaining types are state sequential data, which is simply a variant
of timed-event sequential data for which data entry is simplified if all codes
are assigned to one or more ME&E sets; and multicode event sequential
data, which result from untimed-event recording when events are coded on
more than one dimension (i.e., are cross-classified using more than one set
of ME&E codes). Figure 4.1 lists the sequential data types and shows their
associated recording strategies along with their coding and universal grid
units (see the “Universal code-unit grid” section later in this chapter).
Data formatted according to SDIS conventions can be analyzed with any
general-purpose computer program designed to read SDIS-formatted data:
One such program is the Generalized Sequential Querier (GSEQ; Bakeman
& Quera, 1995a). We designed GSEQ specifically for data analysis (as dis-
tinct from programs designed primarily for coding and data collection as
described in the preceding chapter), and examples of GSEQ’s analytic cap-
abilities are detailed in subsequent chapters. First, however, before describ-
ing SDIS conventions for its different data types, we discuss the general
issue of how time is represented.
representing time
If duration matters, even if only for an observation session and not the
events within it, then time must be recorded whether you use SDIS or some
Representing Observational Data 45
other set of conventions to represent your data. And exactly how time is
represented is not always a simple matter. It can be more complicated than
simply using integers to count, for example, the number of camels or using
real numbers to gauge the weight of a camel in pounds represented with a
fractional component.
Our conventional way of representing time (60 seconds to a minute, 60
minutes to an hour, 24 hours to a day) goes back to Babylon and before.
Contemporary digital timekeeping devices represent factional parts of a
second with digits after the decimal point. Visual recordings provide a new
wrinkleÂ€– moving images are represented by a series of still frames, with the
number of frames per second varying according to the standards used by
the recording device. One common time format used for visual recordings
is hh:mm:ss:ff, where hh is the number of hours (0–24), mm the number
of minutes (0–59), ss the number of seconds (0–59), and ff the number of
frames (0–30 for the NTSC standard used in the United States and 0–25 for
the PAL standard used in much of Europe), although exactly what a frame
means becomes less clear for digital recording.
For historical reasons, hh:mm:ss, mm:ss, and ss are all reasonable ways
to represent time, and in fact most computer systems accommodate any
of these formats. Given different standards for number of frames per
second, it makes sense to convert frames to fractional parts of a second,
thus replacing hh:mm:ss:ff with hh:mm:ss.d… as a standard format for time,
where d is a digit indicating a fractional part of a second. Then the question
becomes how many digits to use after the decimal. For most observational
coding, we would argue for no more than twoÂ€– unless specialized equip-
ment is used that records many more frames per second than is standard.
When recording live, human reaction time typically averages 0.3 second,
and the precision of video recording is limited by the number of frames per
secondÂ€– which is approximately 0.033–0.40 second per frame for NTSC
and PAL standards, respectively. Given these considerations, claiming a
tenth of a second accuracy seem reasonable, a hundredth of a second accur-
acy dubious, and any greater accuracy futile.
However, for many behavioral research questions, accuracy to the near-
est second is sufficient, and for that reason we often recommend rounding
all times to the nearest second in the first place. Although some computer
programs may display multiple digits after the decimal point (three digits is
fairly common), there is no reason for you to take them seriouslyÂ€– unless,
as noted, you have specialized equipment and concerns (e.g., phonetic-level
coding of the acoustic signal). The SDIS compiler included in GSEQ does
allow the hh:mm:ss, mm:ss, and ss formats to be followed by a decimal point
with one, two, or three digits, but GSEQ also includes a utility for rounding
those times if you think less precision is more reasonable.
To avoid possible confusion, exclusive and inclusive offset times should
be distinguished. Time units are considered discrete by the SDIS compiler
and GSEQ. As a resultÂ€– but also as you would expectÂ€– duration is com-
puted by subtracting an event’s onset time from its offset time. For example,
if the onset for KidTalks is 02:43:05 and its offset is 02:43:09, thenÂ€– because
the offset time is assumed to be exclusiveÂ€– its duration is 4 seconds. For
this example, the inclusive offset time would be 02:43:08. Unless explicitly
stated otherwise, it is usually safe to assume that offset times are exclusive.
If we always said 5 to 9 (exclusive) and 5 through 8 (inclusive), this might be
clear enough, but often to and through are used interchangeably in everyday
English, which loses the exclusive-inclusive distinction. Some languages,
like Spanish, lack the to-through distinction. The safest course is always to
say either inclusive or exclusive, whichever is appropriate.
single-code event sequences

In this and the next two sections, we give examples of SDIS formatting
conventions, both those that apply generally to all data types and others
that are specific to a particular data type. A more formal presentation of
SDIS syntax is provided in the Help File included with GSEQ (available at
www.gsu.edu/~psyrab/gseq/ or www.ub.edu/gcai/gseq; should these links
become broken in the future, a Web search for GSEQ should provide a cur-
rent link).
The SDIS format for single-code event sequences is simply the codes
listed as they occurred, in sequence (one or more per line as desired). For
example, as noted earlier, the sequence for the codes shown in Figure 3.2
would be: declr, declr, quest, comnd, comnd, and quest. An example of SDIS
syntax for which the first session begins with this event sequence is shown
in Figure 4.2.
Whatever the data type, the first line or lines of an SDIS data file are a
Â�declaration. The first word on the first line indicates the data typeÂ€– Event,
Timed, State, Interval, or Multi. (The single letter E, T, S, I, or M is also accepted,
but note that single-code event data is specified as Event or E). This may be
followed by a list of codes permitted in the data; if you provide this list, any
codes in the data that are not on this list will be flagged as errors. Code names
cannot have embedded spaces; they can include letters, digits, and some spe-
cial characters (see GSEQ help file); their length is not limited, but generally
shorter names work better. The declaration ends with a semicolon. Any line
Event comnd declr quest ;
% codes indicate type of statement

% comnd = command
% declr = declare
% quest = question
<Case #1>
declr
declr
quest
comnd
comnd
quest … /
<Case #2>
,02:57:12 quest declr declr comnd … ,03:02:28/
<Case #3>
…/
Figure 4.2.â•‡ An example of an SDIS single-code event sequential data file; % indi-
cates a comment. Codes may be listed one (Case #1) or more (Case #2) per line, as
desired. Session start and stop times may be included (as for Case #2) but are not
required. See text for other details.
that begins with a percent sign (%) is treated as a comment and otherwise
ignored; comments enclosed in percent signs may also appear anywhere in a
line (% is the default comment character; it can be changed).
The data for each session is terminated with a forward slash. The ses-
sion may begin with a session label enclosed in angle brackets (this is
optional). If interruptions occur during sessions, thus segmenting them,
segment boundaries are indicated with semicolons (and so interruptions
can be taken into account when computing sequential statistics). Explicit
session start and stop times are optional. If given, they consist of a comma
followed by the time (see Case #2 in Figure 4.2). If start and stop times are
given, then rates can be computed later (see Chapter 8). Case #2 also shows
several codes on a line, which is a format some users may prefer. Spaces, as
well as tabs and line breaks, separate different elements; otherwise, you may
enter spaces and line breaks to format the file as you wish. In single-code
event sequences, when all codes are a single character, they do not need to
be separated (e.g., ABC is the same as A B C), provided you have checked
the single-character SDIS compiler option. This option makes manual data
entry easier for single-code event sequences (and is also valid for the inter-
val and multicode event sequences described subsequently).
timed-event and state sequences

Conventions for timed-event sequences are relatively simple. As for all data
types, the file begins with a declaration terminated with a semicolon and is
followed by data for each session, with each session’s data terminated with a
forward slash. Codes that form ME&E sets may be enclosed in parentheses
in the SDIS declaration. Figure 4.3 gives an example of SDIS timed-event
syntax for which MomTalks, KidTalks, and Quiet form a ME&E set (for this
example, assume that both talk is not possible).
For all data types, sessions can be assigned to levels of one or more fac-
tors. The declaration in Figure 4.3 shows two factorsÂ€– sex with two levels
(male and female) and age with three levels (6, 7, and 8). For each session,
levels for the factors are enclosed in parentheses, listed in the same order
as in the declaration, and follow the session identifier (if one is present), or
they are placed just before the session’s terminating forward slash. In this
case, Session #1 is for a seven-year-old male and Session #2 for a six-year-
old female.
The Figure 4.3 example assumes times recorded to the nearest second
and, for all three sessions, a session start time of 1 and a session stop time
of 31 (exclusive, so the session lasted 30 s). As for event data, session start
and stop times are optional. If, as in this example, start and stop times are
given, session duration equals stop time minus start time; if not given, ses-
sion duration equals the offset time for the last code minus the onset time
for first code in the session. For all data types, the same time format must
be used for all times throughout a file; any times that differ in format from
ones appearing earlier in the file are flagged as warnings or errors.
Events within a session are represented with the appropriate code fol-
lowed by a comma, its onset time, and optionally a hyphen followed by the
offset time for the event. The three sessions in Figure 4.3 exemplify differ-
ent ways events and their times can be represented with SDIS timed-event
syntax.
Session # 1 shows the Code,OnsetTime-OffsetTime format. With this for-
mat, which gives both event onset and offset times explicitly, there is no need
to enter times for Quiet, the code that completes the ME&E set. Any time not
coded MomTalks or KidTalks necessarily must indicate neither talking.
Session #2 shows the Code,OnsetTime- format (omitting the offset time
but not the hyphen). With a hyphen but no offset time following, the SDIS
compiler assumes that the offset for that code is the onset of the next code
in the file, which is why we explicitly coded Quiet for this example. Which
of these two formats you use is a matter of taste; both represent the same
Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
MomTalks √√√ √√√√√√ √√√√

KidTalks √√√√ √√
Quiet √√ √√√√√ √√√
Burp √
Timed (MomTalks KidTalks Quiet) Burp

* sex (male female) age (6 7 8) ;
<Session #1> (male 7)

,1 MomTalks,2-5 KidTalks,5-9 MomTalks,11-17
KidTalks,17-19 MomTalks,24-28 ,31/
< Session #2>(female 6)
,1 MomTalks,2-
KidTalks, 5-
Quiet,9-
MomTalks,11-
KidTalks,17-
Quiet,19-
MomTalks,24-
Quiet,28- ,31/
< Session #3> (male 8)
,1 MomTalks,2-5 MomTalks,11-17 MomTalks,24-28 &
KidTalks,5-9 Burp,14 KidTalks,17-19 ,31/
< Session #4>
…/
Figure 4.3.â•‡ An example of an SDIS timed-event sequential data file, with data
shown in the grid at the top. Events may be listed one (Session #2) or more per line
(Sessions #1 and #3), as desired. See text for other details.
data. Session #2 also shows one event per line instead of several, which
some users may preferÂ€– but, as noted earlier, line breaks and spaces may be
entered where you wish. (Additional possibilities include what we call com-
bination codes and context codes; these potentially useful, but less frequently
used, options are described in the GSEQ help file.)
Session #3 shows how codes can be entered in more than one stream
and also shows the Code,OnsetTime format (omitting both offset time and
hyphen). The SDIS compiler expects codes within a session to be time-or-
dered, as in a single forward-flowing stream; that is, later onset times cannot
occur before earlier ones (this can be useful for finding data entry errors).
However, just as coders often find it convenient to make multiple passes
through a video record, entering data for each pass separately (e.g., coding
first mother and then child behavior), you may find it useful to enter data
in the SDIS file in more than one stream. SDIS conventions allow timed-
event data (and state data, described in the next paragraph) to be listed
State (MomTalks KidTalks Quiet);
<Session #1>
Quiet,1 MomTalks,2 KidTalks,5 Quiet,9 MomTalks,11
KidTalks,17 Quiet,19 MomTalks,24 Quiet,28 ,31/
<Session #2>
Quiet=1 MomTalks=3 KidTalks=4 Quiet=2 MomTalks=6
KidTalks=2 Quiet=5 MomTalks=4 Quiet=3/
< Session #3>
…/
Figure 4.4.â•‡ An example of an SDIS state sequential data file for the data shown in
Figure 4.3. See text for details.
in separate passes or streams (each itself time-ordered), separating each

stream within a session with an ampersand. Session #3 in FigureÂ€4.3 illus-
trates this convention. It also illustrates a third code-time format (Burp,14).
If no hyphen follows OnsetTime, the SDIS compiler assumes the event is
regarded as essentially momentary, meaning its duration is not of interest
(although in fact, it is given a duration of one time unit).
The State data type is a simplified version of timed-event recording and
may be used when all codes are assigned to one or more ME&E sets. Its two
formats are Code,OnsetTime (no hyphen) and Code=Duration. If there is
more than one ME&E set, the codes for each ME&E set are entered in sep-
arate streams within a session, separated by ampersands. Figure 4.4 shows
SDIS state syntax for the data shown at the top of Figure 4.3 (omitting
Burp). Session #1 illustrates the Code,OnsetTime format and Session #2 the
Code=Duration format. The state data type is simply a data-entry conveni-
ence; any references to timed-event sequential data in this book implicitly
include its state sequential variant.
interval and multicode event sequences

SDIS conventions for interval sequential data and multicode event sequen-
tial data are essentially similar (see Figure 4.5 for examples). For interval
data, the codes in each interval are entered followed by a comma. For multi-
code data, the codes for each event are entered followed by a period. In other
words, for interval data, commas demarcate successive time intervals (of
specified duration), whereas for multicode data, periods indicate successive
events. For both, if successive intervals or events contain the same codes,
they need not be repeated. Their repetition can be indicated by entering
the number of repeated intervals or events followed by an asterisk followed
Interval Manip Explr Relate IVoc MVoc MEnc MEnt ;

<Infant #208>
Manip IVoc MVoc MEnc ,
IVoc MEnc ,
Explr Relate MVoc MEnc ,
Manip Explr MVoc ,
,
Explr MVoc MEnt , … /
<Infant #212>
…/
Multi (comnd declr quest) (look gest both none) ;

<Case #1>
declr gest . declr look . quest none .
comnd gest .comnd gest . quest both . … /
<Case #2>
declr gest . declr look . quest none .
2* comnd gest . quest both . … /
<Case #3>
…/
Figure 4.5.â•‡ Examples of an SDIS interval sequential data file (based on FigureÂ€3.4)
and an SDIS multicode event sequential data file (based on Figure 3.2). See text
forÂ€details.
by the code (or codes). For interval data, an empty interval is indicated by
two commas with nothing between them. The repetition asterisk can also
be used with empty intervals (e.g., 5 empty intervals would be indicated
asÂ€.â•›.â•›.â•›,Â€5* ,â•›.â•›.â•›.â•›). Multicode data does not contain empty events, by definition.
Other conventions (e.g., for sessions and factors) have been described in
preceding paragraphs.
a universal code-unit grid

Universal forms have considerable appeal; they can introduce efficien-
cies and clarify and organize thinking. In the previous three sections,
we described formats for representing sequential observational data that
reflected the recording strategies described in the previous chapter. As we
now progress from data preparation to data reduction and analysis, we have
found it useful to define a common underlying format that applies to all five
data types. This common format not only introduces efficiencies in GSEQ’s
computer code, but also facilitates the way we think about data analysis and,
especially, data modification (see Chapter 10)
The basis for this common format is both simple and venerable
(Bakeman, 2010). It is a grid, a structure in two-dimensional Euclidean
Event or time unit or interval

Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ...
Alpha
Beta
...
Figure 4.6.â•‡ An example of a code-unit grid for which rows represent codes and
successive columns could represent either events, time units, or intervals.
or Cartesian space, as evident in at least some cities throughout history

(e.g., Babylon, Beijing, Barcelona’s Eixample, Manhattan) and in common
computer applications such as spreadsheets and statistical packages today.
Applied to observational data, rows represent codes and columns represent
unitsÂ€ – which are either events for untimed-event recorded data, or time
units for timed-event recorded data, or intervals for interval recorded data
(see Figure 4.6). For untimed-event recorded data, the units used to record
data and the units used to represent data in the grid are the same (events).
Likewise, for interval recorded data, the units used to record data and the
units used to represent data in the grid are the same (intervals). However,
for timed-event recorded data, the units differÂ€– events for recording and
time units for representing. Nonetheless, all three types of recorded data
can be represented with a universal code-unit grid; all that differs is how
columns are labeled (see Figure 4.1).
The interval for interval recorded data (i.e., its duration as measured in
timeÂ€ units) is defined by the investigator prior to coding. In contrast, the
time unit for timed-event data is defined by the precision with which time
is recorded. If onset and offset times for events are recorded to the nearest
second,Â€ each column of the grid would represent a second; if rounded to
tenthsÂ€of a second, each column would represent a tenth of a second; and
so forth. Notice that the universal code-unit grid reveals the underlying
commonality of the five data types. No matter how recorded, data can be
represented by a grid in which time flows from left to right (per cultural con-
vention). Columns are demarcated by successive events or time units or inter-
vals; rows are labeled with the codes used; and cells are checked to indicate
a code occurring in that event, time unit, or interval. An example for timed-
event data was given in Figure 4.3; if the contiguous checks were replaced
with horizontal bars, the figure would look like a common timeline graph.
Much of GSEQ’s power and usefulness results from representing sequen-
tial observational data internally in terms of the universal code-unit grid.
Three advantages are worth mentioning. First, representing observational

data as a code-unit grid, as GSEQ does internally, facilitates computation of
standard frequency or contingency table statistics (the column is the tally-
ing unit). Second, the grid representation makes data modification easy
to use and understand. New codes (i.e., rows in the grid) can be defined
that are keyed to onsets or offsets of existing codes or new codes can be
formed from existing codes using standard logical operations (and, or, not,
etc.); this introduces considerable data analytic flexibility (see “Creating
new codes from existing codes” in Chapter 10). Third, the discrete time
unit view of timed-event sequential data implied by the code-unit grid (i.e.,
time is segmented into discrete units defined by precision) solves some
problems in gauging observer agreement (and an alignment algorithm we
describe solves others); see discussion of event- and time-based agreement
in Chapter 6.
alternatives: spreadsheet and statistical

package grids
Conceptualizing sequential observational data as a code-unit grid has
advantages, as noted in the previous section. For untimed-event and inter-
val recording, a code-unit grid could even be used to record data initially
using forms like those in Figures 3.2 and 3.4. In such cases, the data as
initially recorded could even be entered directly into the grids provided
by spreadsheet or statistical package programs. However, for timed-event
recording, using a form modeled on the schematic in Figure 4.3Â€– actually
checking successive time intervalsÂ€– would be somewhere between tedi-
ous and impossible. True, timed-event sequential data could be entered
directly into grids provided by spreadsheet programs using a format like
that shown in Figure 3.3Â€– and at least some commercially available com-
puter-assisted coding programs do exactly thisÂ€– but then other programs
would be needed to extrapolate from the onset and offset times recorded
to occurrences and co-occurrences in time.
Effectively, this is what the SDIS conventions and data types provide. Its
conventions are designed to provide choices from among data types that are
very close to the way observational data are recorded, and so are relatively
easy to enter into computer files (see Figures 4.2, 4.3, and 4.4). The SDIS
compiler then does the work of transforming (re-presenting) the data rep-
resented in SDIS format into a universal, grid-based format, thereby facili-
tating analysis by GSEQ and other similar programs.
data management and file formats

As with most advances, the digital age has both simplified and complicated
our research lives. Once records consisted primarily of marks on paper;
they were easy to read and annotate, but still had to be organized and filed
in ways that allowed for easy retrieval. File cabinets lined our laboratory
walls. During the decades when behavioral recordings were primarily con-
signed to magnetic tape, shelves of boxes containing tapes were added to
the walls. They might be deteriorating at unknown rates, but they were in
sight, clearly labeled, a marker of our progress. Computer-assisted coding of
magnetic tapes required a fair bit of technical acumen and relatively expen-
sive equipment and software, especially when multiple tapes were synchro-
nized; but independent of the extent of computer assistance for coding, data
files increasingly took electronic form. Neither visible nor palpable, they
came to be consigned to electronic storage devices, either in our laboratory
or in remote locationsÂ€– as technical experts might dictate.
With the advent of digital storage, not just of data files, but also of the
initial behavioral records themselves (i.e., the audio-visual files), two issues
have assumed increased importanceÂ€– data management and file formats.
We have presented the observational session as the basic analytic unit of
observational methods. With digital recording, one or moreÂ€– often quite
largeÂ€– files are associated with each session. These files need to be orga-
nized and stored, and both backup and security provided. Once sessions
are coded, the coded data likewise need to be organized and stored. These
are no small tasks. While coding is in progress, it may be convenient to
keep the coded data for each session in a single computer file. Later, data for
several sessions can be merged into one file for further processing, as many
computer programs require. In any event, data management will require
organization, thought, and care.
Digital recording opens exciting new opportunities. Technically, creating
computer systems to assist coding is easier with digital files than it was with
magnetic tape. As a result, more such coding systems are becoming avail-
ableÂ€– some tailored to the work of a single laboratory, some intended for a
particular field, some designed to be quite general, some available commer-
cially, and some offered at no cost. Of course, each system varies in its cap-
abilities. For the purposes of this book, we decided not to provide a list of
names of specific programs, partly because any such list would omit worthy
contenders and could soon be outdated. As mentioned earlier, and again by
way of example, probably the best-known full-service commercial systems
currently are Mangold International’s INTERACT and Noldus Information
Technology’s The Observer, but you can easily assemble a more extended
list with a few minutes of Internet search.
No matter what computerized systems you decide to use, once coding is
done, you are left with data files in a particular format. The question then
becomes, what is that format and what can you do with it? Many computer-
assisted coding programs provide summary statistics, and these may be suf-
ficient for your purpose. If not, or if you are not using such a program, you
may want to use GSEQ, which was designed not for initial data collection
but specifically for data analysis. GSEQ requires that observational data
be formatted per SDIS conventions, as described in earlier sections of this
chapter. Thus, unless a program you are using has the capability to write
SDIS-formatted files, you will need to convert the data files you have into
SDIS format. This is usually quite straightforward. Many programs prod-
uce files similar to that shown in Figure 3.3, listing codes along with their
onset and offset times. Such files can be converted into SDIS format using
search-and-replace and other editing capabilities of standard spreadsheet
and word-processing programs, or with custom conversion programs such
as those we have written (e.g., Bakeman & Quera, 2008).
summary
As observers code, they record their decisions, perhaps with marks on paper
that are transferred to electronic files later, although increasingly with key-
strokes that create electronic data files directly. When designing formats
for recording the observer’s initial decisions, ease and accuracy are import-
ant considerations. Once recorded, however, it makes sense to represent
Â�(literally re-present) the recorded data in formats that facilitate subsequent
analysis.
We have defined a set of formatting conventionsÂ€– the Sequential Data
Interchange Standard or SDIS formatÂ€ – that can be used to represent
observational data. These conventions, which were detailed in this chap-
ter, accommodate different recording strategies. The SDIS compiler trans-
forms SDIS-formatted data into a format based on a common code-unit
grid for subsequent analysis by computer programs such as the Generalized
Sequential Querier (GSEQ), a program we designed specifically for analysis
of sequential observational data.
Whatever system you use for recording coded data, you are left with data
files. If you enter data files yourself, perhaps from paper records, it is a sim-
ple matter to put them in SDIS format. If you use programs that produce
data files in their own format, it is usually a simple matter to convert such
files into SDIS-formatted ones. If summary statistics produced by whatever

program you use are adequate to your use, there may be no need for such
a conversion; but if you find you want the extensive analytic and especially
data modification capabilities provided by GSEQ (as described in later
chapters), it is probably worth the effort.
5
Observer Agreement and Cohen’s Kappa
As noted in earlier chapters, measuring instruments for observational meth-

ods consist of coding schemes in the hands (and minds and eyes) of trained
observers. Like all measuring instruments, they need to be calibrated; we
need to be assured that the instruments are accurate. With human obser-
vers, this means demonstrating that two observers independently coding
the same stream of behavior produce essentially the same resultsÂ€– or, per-
haps better, that an observer agrees with a gold standard, a version that has
been prepared by experts and is presumed accurate. It is not an exagger-
ation to call such demonstrations of observer agreement the sine qua non
of observational methods. Without such demonstrations, we are left with
individual narrativesÂ€ – perhaps fascinating, perhaps insightful, perhaps
useful for generating research questionsÂ€ – but nonetheless narratives of
unknown reliability.
Nothing is absolute, and a suitable level of agreement between two inde-
pendent observers does not by itself guarantee accuracy: After all, two
observers could share similar deviant views of the world. But almost always
in the behavioral research world, observational methods require attention
to and demonstration of observer agreementÂ€– either two observers with
each other or one observer with a gold standard.
There are at least three major reasons to be concerned with observer
accuracy. First, for ourselves: Presumably we have spent considerable time
defining and fine-tuning our coding schemes and want to assure ourselves
that the coders are faithful to our vision and are performing as we expect.
Second, for the coders: They need the accurate feedback that comparison
with each other or with a gold standard provides to correct errors dur-
ing training and ultimately to perform as expected. And third, for others:
Without demonstrations of observer accuracy, our colleagues and Â�othersÂ€–
including journal editors and reviewersÂ€ – have little reason to take our
57
results seriously. Such demonstrations give us and others reason to think

that our observers are adequately trained and that the data they produce are
trustworthy and reliable.
point-by-point versus summary agreement

Point-by-point agreement or summary agreement or both may be import-
ant, depending on your purpose. Point-by-point agreementÂ€ – which is
addressed in this chapter and the nextÂ€ – focuses on whether observers
agree with respect to the successive intervals or events coded (or rated). It
assumes nominal measurement (assigning codes) or ordinal measurement
(assigning ratings) and relies primarily on some form of the kappa statistic
(Cohen, 1960). Point-by-point agreement is especially useful for observer
training prior to data collection and for ongoing checking of observers once
data collection begins.
In contrast, summary agreementÂ€ – which is addressed in Chapter 7Â€ –
focuses on whether corresponding summary statistics agree when statistics
are based on records of the same events produced by different observers. It
assumes that these summary statistics are represented on at least an ordinal
orÂ€– more typicallyÂ€– an interval or ratio scale, and it relies primarily on
a form of the intraclass correlation coefficient (ICC; see e.g., McGraw &
Wong, 1996). Summary agreement is especially useful when a project has
moved from data collection to analysis and you want to demonstrate the
reliability of the summary scores analyzed.
At a minimum, observer agreement must assure us that the data are
of sufficient quality for whatever analyses are subsequently performed.
As Kaye (1980) has observed, “The reliability that matters is the reliabil-
ity of the data that will actually be used in the analysis, after it has been
recoded, transformed, combined, concatenated, or smoothed in prelimin-
ary ways”Â€(p. 467).
If reliability of summary scores were the only concern, ICCs would be
sufficient and kappas could be ignored. In fact, the reverse is true: Kappas
are usually regarded as sufficient and ICCs are ignored. Observer training
and data collection come first, so necessarily investigators focus first on
point-by-point agreement. After all, if we ignored any checks on observer
agreement until after data collection and the data did not pass muster, what
then? Moreover, point-by-point agreement is usually regarded as more
stringent than summary agreement. As a result, it has become routine for
journal editors and other arbiters of the scientific literature to accept point-
by-point agreement as sufficient. Presumably, if point-by-point agreement
Observer Agreement and Cohen’s Kappa 59
for the data as collected meets accepted standards, so too will summary
scores derived from them. As a result, additional statistical proof for the
reliability of summary scores is rarely requested or provided.
In fact, different types of errors matter for point-by-point and for sum-
mary agreement. For point-by-point agreement, errors are qualitative and
may consist of disagreementsÂ€– an observer applies a different code from
the other observer or the gold standard; omissionsÂ€ – an observer fails to
detect an event that the other observer or the gold standard does; and
Â�commissionsÂ€– an observer detects an event that the other observer or the
gold standard does not (see “Errors of commission and omission” later in
this chapter). In contrast, for summary agreement, errors are quantitative
and occur when summary statistics are not the same. For point-by-point
agreement, the greater the qualitative error, the lower the kappa. For sum-
mary agreement, the greater the quantitative error, the lower the ICC. Issues
related to point-by-point agreement are the focus of this chapter and the
next; issues related to summary agreement are considered in Chapter 7.
the classic cohen’s kappa

Probably the most commonly used statistic for point-by-point observer
agreement is Cohen’s kappa (1960) or some variant of it. It is a summary
statistic that assesses how well two observers agree when asked to inde-
pendently assign one of K codes from a ME&E set to N discrete entities. In
other words, kappa is a summary statistic characterizing how well Â�observers
agree when applying a set of ME&E codes. Usually the codes are nominal,
but they could be ordinal. Kappa is designed to answer questions like:
How well do two doctors agree when assigning one of K mutually exclu-
sive Â�diagnostic categories to N patients? Or, in the case of interval-recorded
data: How well do two observers agree when assigning one of K codes to N
successive intervals?
The N decisions the two observers make are tallied in a square contin-
gency table. Individual rows and columns are labeled with the K codes, as
shown in Figure 5.1 for the infant state codes. Usually rows are labeled for
Observer 1 and columns for Observer 2. Each decision adds a tally to the
appropriate cell; for example, if an interval was coded Alert by Observer 1
but Fussy by Observer 2, a tally would be added to the Alert-Fussy cell (1st
row, 3rd column). Note that agreements (both observers code an interval
the same) are tallied in one of the upper-left to lower-right diagonal cells
(hereafter called the diagonal) and disagreements are tallied in one of the
off-diagonal cells. Sometimes such tables are called agreement or confusion
Observer 2’s codes

Obs. 1’s
codes Alert Cry Fussy REM sleep Total p
Alert 19 0 1 2 0 22 .18
Cry 2 20 2 1 3 28 .23
Fussy 1 5 11 2 2 21 .18
REM 4 2 3 18 1 28 .23
Sleep 1 3 0 2 15 21 .18
Total 27 30 17 25 21 120 1.
p .23 .25 .14 .21 .18 1. .69
Figure 5.1.â•‡ A kappa table tallying the frequency of agreements and disagreements by
two observers coding infant state for 120 intervals; p represents the marginal prob-
abilities and the lower-right .69 indicates the probability of observed agreement.
matrices; but because they represent both agreements and confusions, we

think it is more accurate to call them simply kappa tables.
For the data in Figure 5.1, if we add the tallies on the diagonal and div-
ide by the total number of tallies, we get .69, which is the probability of
observed agreement for this exampleÂ€– or equivalently, multiplying by 100,
we get 69 percent, the percentage of observed agreement (hereafter we use
both terms, probability and percentage agreement, equivalently). However,
percentage agreement is a problematic index. Some agreement can be
expected simply by chance, and percentage agreement does not correct for
it. In contrast, Cohen’s kappa does, which is why it is preferable to percent-
age agreement.
Cohen’s kappa is defined as
Po − Pc K K
κ= ,â•… with Po = ∑ piiâ•… andâ•… Pc = ∑ p+ i pi +
1 − Pc i =1 i =1
where Po = probability of observed agreement, Pc = probability of chance

agreement, K = number of codes, pij = probability for the i-th cell in the
j-th column (i.e., the cell count divided by the total number of tallies or N),
piâ†œ+ = probability for the i-th row, and p+j = probability for the j-th col-
umn. Po is the sum of the probabilities on the diagonal and Pc is the sum
of their chance probabilities, which for each diagonal cell is the product of
its row probability multiplied by its column probability. Kappa is then the
observed probability of agreement that exceeds chance (PoÂ€– Pc) relative to
(divided by) the possible probability that exceeds chance (1Â€– Pc). In other
words,Â€1Â€– Pc represents maximum nonchance agreementÂ€– perfect agree-

ment from which chance agreement has been subtracted. To understand
how Cohen’s kappa corrects for chance, write the probability of agreement
as Po/1 and then subtract Pc from both numerator and denominatorÂ€– which
yields the formula for Cohen’s kappa.
Almost always, kappa will be lower than the probability of observed
agreement (they are equal only when agreement is perfect, i.e., all tallies are
in cells on the diagonal). For example, for the data in Figure 5.1, Po = .69,
Pc = .20, and so κ = .61Â€– i.e., (.69Â€– .20)/(1– .20). Positive values of kappa
indicate agreement better than expected by chance, near-zero values indi-
cate agreement about the level expected just by guessing, and negative
values indicate agreement even worse than expected by chance. Computed
values of kappa fromÂ€–1 to +1 are possible. Negative values are relatively
rare, but when they occur require us to ask why observers are disagreeing at
worse than chance levels.
Note that if one observer never assigns one (or more) of the codes, the
result is a row or column sum of zero and a zero in the cell on the diagonal
for that code. Kappa is reduced in such cases (there was no agreement for
that code), but its computation is unaffected. Similarly, if both observers
never assign one (or more) of the codes, the effective size of the kappa table
is reduced, but again the kappa computation is unaffected.
As emphasized earlier, kappa is a summary statistic that assesses
the average agreement of two observers when applying a ME&E coding
scheme consisting of K codes (for extension to more than two observers,
see Ubersax, 1982). It does not indicate how agreement for the individ-
ual codes in the set varies. For purposes of training and checking obser-
vers, probably the best guide to their disagreement is the kappa table itself.
Are disagreements distributed more or less evenly in off-diagonal cells? If
not, it suggests the observers vary in their sensitivity and need to discuss
thresholds further. Or, do only a few cells account for most of the disagree-
ments? If so, the confusions revealed by such cells will suggest codes that
need more discussion and perhaps better definitionÂ€– and in some cases, if
confusion between two particular codes persists, it may even be better to
collapse them into a single code.
Still, what can we do if agreement statistics for the individual codes in
an ME&E set are desired, or if a journal editor unfamiliar with the average
nature of kappa requests such statistics? Assuming more than two codes,
the K×K table can be collapsed into K separate 2×2 tables, as shown in
Figure 5.2 for the 5×5 table given in Figure 5.1. Each 2×2 table compares
one code in the set against all others, with counts for other codes lumped
Alert Other Total Cry Other Total
Alert 19 3 22 Cry 20 8 28
Other 8 90 98 Other 10 82 92
Total 27 93 120 Total 30 90 120
Fussy Other Total REM Other Total
Fussy 11 10 21 REM 18 10 28
Other 6 93 99 Other 7 85 92
Total 17 103 120 Total 25 95 120
Sleep Other Total
Sleep 15 6 21 κ(Alert) =.72 κ(Cry) =.59

Other 6 93 99 κ(Fussy) =.50 κ(REM) =.59
Total 21 99 120 κ(Sleep) =.65 κ(Total) =.61
Figure 5.2.â•‡ The five 2×2 tables produced by collapsing the 5×5 table in Figure 5.1.
into a single Other category. Kappas for the five separate tables are given in
Figure 5.2 and indicate that, for this example, agreement was best for Alert,
worst for Fussy. However, observers did not in fact make binary decisions;
thus the collapsed 2×2 tables do not necessarily reflect the agreement that
would result had a binary coding scheme been applied. When collapsing
any kappa table into 2×2 tables in this way, you can be sure that some of
the kappas for the 2×2 tables will be greater and some less than the kappa
for the parent K×K table. This is because kappa is a weighted average of the
kappas for the K separate 2×2 tables (Fleiss, 1981). If you multiply each 2×2
kappa by its denominator (i.e., weight each kappa by its 1Â€– Pc), sum these
products, and then divide by the sum of the weights, the result will be the
kappa computed for the parent table.
Note that partial-interval sampling (see “Partial-interval or one-zero
sampling” in Chapter 3) also requires a 2×2 table approach. With partial-
interval sampling, an interval can be coded for more than one code from a
ME&E set, yet the kappa computation requires that each interval contribute
only one tally to the kappa table. In this case, define 2×2 tables, one for each
code; then tally for each code separately and compute kappas separately for
each table.
when is kappa big enough?

Once observers independently code the same sample of behavior, or
one observer codes behavior for which gold-standard coding exists, and
once agreements and disagreements are tallied and kappa computed, two
questions remain. First, is the value of kappa sufficient to satisfy us and others
that our observers agreeÂ€– if not perfectly, at least well enough? And second,
what standards should we apply to the values of kappa we compute?
Is Statistical Significance Useful?

Statistical significance is a criterion often used by behavioral scientists. If we
compare the performance of one group with another and find that the diffe-
rence between means computed for the two groups would occur by chance
(i.e., sampling error) less than 5 percent of the timeÂ€– if in fact there were no
difference between the groupsÂ€– we claim an effect statistically significant
at the .05 level and proceed to publish. Will this work for kappa? The brief
answer is no. True, the standard error of kappa has been described (Fleiss,
Cohen, & Everitt, 1969; see also Bakeman & Gottman, 1997, p. 65)Â€– which
means that a standardized kappa could easily be computed. However, stat-
istical significance for kappa is rarely reported; as Bakeman and Gottman
note, even relatively low values of kappa can still be significantly different
from zero, but not of sufficient magnitude to satisfy investigators.
If statistical significance is not a useful guide, what is? The litera-
ture does, in fact, contain categorical terms for specific magnitudes. For
example, Landis and Koch (1977) characterized values less than 0 as indi-
cating no agreement, 0–.20 as slight, .21–.40 as fair, .41–.60 as moderate,
.61–.80 as substantial, and .81–1 as almost perfect agreementÂ€– but offered
no rationale for these terms. Fleiss’s (1981) guidelines seem more reason-
able; he characterized kappas less than .40 as poor, .40–.75 as fair to good,
and over .75 as excellentÂ€– but again provided no rationale. In fact, no abso-
lute guideline is satisfactory. Factors other than chance agreement can affect
its magnitude, as we detail shortly, and so the interpretation of a particular
magnitude of kappa is not straightforward (Sim & Wright, 2005).
Factors that can affect the value of kappa include: (1) observer accuracyÂ€–
the very thing we hope to assess, (2) the number of codes in the set, (3) the
prevalence of individual codesÂ€– are events equiprobable or are some events
more likely than others (i.e., independently of the observed row and column
probabilities for the kappa table, are the “true” population probabilities for
the different codes fairly similar or more variable), and (4) observer biasÂ€–
do observers distribute the K codes similarly or do they favor codes differ-
ently (i.e., are the observed row probabilities for the kappa table similar to or
different from the corresponding column probabilities)? Yet another factor
is: (5) the procedural matter of observer independence (after prevalence and
bias, this is the third factor affecting magnitude that Sim and Wright, 2005,
list). To state the obvious, when assessing observer agreement, observers

must code without knowledge of the other observer’s coding (or knowledge
of the gold standard’s coding).
Observer Bias and Kappa Maximum

If observers were accurate, they would distribute codes similarly and the corre-
sponding row and column probabilities (i.e., the marginal probabilities) of the
kappa table would match. Figure 5.1 demonstrates observer bias: 18 percent
of the intervals were coded Alert by Observer 1, but 23 percent by Observer
2, for example, whereas without observer bias, these percentages would be the
same. Observer bias is important because it limits the maximum value kappa
can attain. Although possible values of kappa vary fromÂ€–1 to +1, kappa can
equal 1 only when there is perfect agreement and no observer bias.
Kappa maximum (Umesh, Peterson, & Sauber, 1989) is the value of
kappa when observers agree to the maximum extent possible given obser-
ver bias. It is defined as
Pmax − Pc K
κ max = ,â•… whereâ•… Pmax = ∑ min( p+ i , pi + )
1 − Pc i =1
where Pc = the probability of chance agreement, as defined earlier, and the

min function selects the minimum of either the corresponding row or col-
umn probability. For the data in Figure 5.1, κmax = .92675 and Pmax = .94167
(five significant digits shown for illustration). Even if the observers agreed
at every opportunity afforded by the differing marginals, kappa cannot
exceed this value. Kappa cannot be 1 because, when the observers’ prob-
abilities for corresponding codes differ, some tallies necessarily spill over
into off-diagonal cells.
Other things being equal, the more observer bias (i.e., the more row and
column probabilities differ), the lower the computed value of kappa will be:
Its effect on magnitude is simple and direct. The difference between 1 and
κmax reflects the extent to which the observers’ ability to agree is constrained
by observer biasÂ€– and can be regarded as a measure of observer bias. Low
values of κmax suggest observer retraining may be in order. Whatever the
value of κmax, its value helps interpret the magnitude of kappa; thus, it makes
sense to report both. One temptation to resist: Dividing a value of kappa by
its κmax, hoping thereby to “adjust” for its maximum value, serves only to
hide the observer bias that exists.
Observer Accuracy, Number of Codes, and Their Prevalence

Especially when training and checking observers, our main concern
should not be the magnitude of kappa, but the level of observer accur-
acy we regard as acceptable. (A quantitative definition of accuracy is
the probability that X was coded, given that X occurred.) Any judgment
is arbitrary, but 80 percent is a good candidate for a minimum level of
acceptability. Gardner (1995) characterized 80 percent as discouragingly
low “but possibly representative of the accuracy of classification for some
social behaviors or expressions of affect” (p. 347). It seems reasonable to
expect better, andÂ€– although 100 percent accuracy will likely elude usÂ€–
85 percent, 90 percent, or even 95 percent accuracy may represent rea-
sonable goals.
To provide a principled guide for acceptable values of kappa, we and
our colleagues computed expected values of kappa given various circum-
stances (Bakeman, Quera, McArthur, & Robinson, 1997; computations in
the article, and also in this section, were effected with the FalliObs pro-
gram described in the article). Our computations were based on Gardner
(1995) and, in effect, simulated observer decision making. The simulated
observers were fallible; their accuracy was determined with a parameter
whose value we set at 80 percent, 85 percent, 90 percent, and 95 percent.
Two other parameters were the number of codes (K) and their prevalence.
We let K vary from 2 through 10 because values in this range are often
encountered in the literature. The lower values established a baseline (K
= 2 turned out to represent the worst-case scenario), and the other values
seemed sufficient to establish trends, which presumably would extrapolate
to larger values.
We then defined three types of prevalence (i.e., codes’ assumed popu-
lation probabilities): equiprobable, moderately varied, and highly varied.
For equiprobable, all codes had the same underlying probability, specific-
ally 1/K. For moderately varied and highly varied, p = 0.5/K and 0.25/K for
the least likely and 1.50/K and 1.75/K for the most likely code, respect-
ively, with other probabilities assigned graduated intermediate values. We
then Â�computed expected values for kappa for various combinations of
observer accuracy, number of codes, and prevalence variability. Results
of the Bakeman et al. (1997) computations are described in the next few
paragraphs and are the basis for the standards for kappa we recommend.
(Bakeman et al. also describe a downloadable program, FalliObs, that lets
users enter their own values for key parameters.)
1.00
.80
.60
Kappa
.40 Equiprobable
Moderately variable
Highly variable
.20
.00
1 2 3 4 5 6 7 8 9 10
Number of codes
Figure 5.3.â•‡ Expected values for kappa when number of codes and their prevalence
varies as shown for observers who are 95% accurate (top set of lines), 90% accurate
(2nd set), 85% accurate (3rd set), and 80% accurate (bottom set).
Standards for Kappa (Number of Codes Matters)

Computations that are incorporated in a computer program, with its blood-
less observers of known accuracy, can provide answers to questions that
would prove difficult or impossible to address in the flesh-and-blood world.
The computations detailed in Bakeman et al. (1997) and the results shown
in Figure 5.3 (the figure is not from the article; it was prepared specifically
for this book) are a case in point. They indicate, in particular, the import-
ance of the number of codes when interpreting kappa.
When the number of codes is five, six, or greater, prevalence variability
matters little, and increasing the number of codes results in increasingly
small increases in kappa. Values of kappa appear to reach an asymptote
of approximately .60, .70, .80, and .90 for observers who are 80 percent,
85 percent, 90 percent, and 95 percent accurate, respectively (if anything,
these values are a bit conservative, especially for smaller values of K). For
example, with five codes, if you want observers who are at least 85 percent
accurate, you should require kappas of about .65.
On the other hand, when the number of codes is less than five, and espe-
cially when K = 2, lower values of kappa are acceptable but prevalence vari-
ability also needs to be considered. For only two codes, expected values
of kappa for observers who are 80 percent, 85 percent, 90 percent, and 95
percent accurate are .20, .30, .44, and .65, respectively, when prevalence
is highly variable. Corresponding values when prevalence is moderately
variable are.30, .42, .57, and .76; and when prevalence is equiprobable
areÂ€.36, .49, .64, and .81 (for other values, see Appendix A).
Our computations make the simplifying assumptions that both observers
were equally accurate and unbiased, that codes were detected with equal
accuracy, that disagreements were equally likely, and that when prevalence
varied it did so with evenly graduated probabilities. To the extent these
assumptions seem reasonable, even when not met perfectly in practice, the
computed values should provide reasonable estimates for expected values
of kappa.
The low values of kappa expected with reasonable observer accuracy, but
with only two codes, may surprise some readers (but probably not investi-
gators actively involved in observer training). They certainly give encour-
agement to observers in training who have been told that an acceptable
kappa must be at least .60 (or some other arbitrary value). Note also that it
is the effective KÂ€– not the actual KÂ€– that may be at issue. If by chance two
observers were asked to code an infant session during which the infant was
asleep, the 5-category infant state scheme becomes in effect a 2-category
system (REM and Sleep), for which a kappa of .44 would suggest observer
accuracy of 90 percent (assuming REM occurred no less than 12.5 Â�percentÂ€–
our definition of highly variable when K = 2).
As Figure 5.3 shows, values of expected kappa increase with the num-
ber of codes, other things begin equal. Consequently, it is puzzling that
the opposite has been claimed, and it is instructive to understand why.
Sim and Wright (2005), citing Maclure and Willett (1987), wrote that “The
larger the number of scale categories, the greater the potential for disagree-
ment, with the result that unweighted kappa will be lower with many cat-
egories than with few” (p. 264). Maclure and Willett presented a 12×12
kappa table tallying agreements and disagreements for ordinal codes. They
then collapsed adjacent rows and columns, producing first a 6×6, then a
4×4, a 3×3, and a 2×2 table. As expected with ordinal codes, disagreements
were not randomly distributed in off-diagonal cells but clustered more
around the diagonal and became less likely in lower-left and upper-right
cells. Not surprisingly, kappas computed for this series of collapsed tables
increased (values were .10, .23, .33, .38, and .55 for the range of tables from
12×12 to 2×2, respectively). Maclure and Willett wrote that “Clearly, in
this situation, the values for Kappa are so greatly influenced by the num-
ber of categories that a four-category-Kappa for ordinal data cannot be
compared with a three-category Kappa” (p. 163). Note that Maclure and
Willett did not claim that kappa would be lower with more codes gener-
ally, but only in the situation where ordinal codes are collapsed. In terms
of the expected kappa computations presented in previous paragraphs, we

understand in this case that collapsing does not preserve, but in fact cre-
ates, the appearance of increased observer accuracyÂ€– which was Maclure
and Willett’s point. To illustrate, accuracies were 38 percent, 57 percent,
68 percent, 74 percent, and 87 percent for the range of tables from 12×12
to 2×2, respectively (computed with the FalliObs program described in
Bakeman et al.,Â€1997).
comparing an observer with a gold standard

Throughout this chapter we have noted that observer agreement could be
assessed either by comparing two observers with each other or by com-
paring one observer with a gold standard. In either case, a kappa could be
computed. But the expected kappa computations presented in the previous
paragraphs apply only when two independent and equally fallible observers
are compared. Expected values of kappa when one observer is compared to
a gold standard are higher because, instead of comparing two equally fal-
lible observers, only one fallible observer is compared with a gold standard
assumed to be 100 percent accurate. See Appendix B for estimated values
in this case.
Which approach is better? The gold-standard approach is much less com-
mon for two primary reasons. First, preparation of a gold-standard version
requires that we assume to know the truth. Second, it also requires a consid-
erable investment of resources. Preparing sufficient gold-standard sessions
for training and subsequent reliability checks can be time-consuming for
both investigators and observers. Comparing observers, who are to some
extent fallible, is the far more common approachÂ€– and may reflect, not just
greater humility, but also a reasonable allocation of resources. Nonetheless,
there is one type of error that only a gold-standard approach can detect.
Imagine that two observers tend to make the same mistakes (perhaps due
to observer drift; see Boice, 1983), in which case interobserver agreement
would be misleadingly high. Only comparison with a gold standard (or, as
is certainly desirable, occasional random review of all coding by an investi-
gator) would detect this circumstance.
Arguably, our science might be better served by greater reliance on
a gold-standard approach. When a research endeavor spans years and
involves multiple investigators and sites, training new observers and check-
ing old ones is greatly facilitated when their work can be compared to an
unvarying, archived standard. This requires that the multiple investigators
share common concepts and devote time and effort to common training
and coordination (e.g., the National Institute of Child Health and Human
Development’s Study of Early Child Care, https://secc.rti.org). The poten-
tial reward can be a coherent, cumulative contribution to knowledge. From
this point of view, comparing fallible observers is the more common case,
because our research endeavors rarely represent sustained, coordinated
group efforts.
Agreement and Reliability

A terminological comment: According to Bakeman et al. (1997), when the
kappa table summarizes judgments of two independent observers, kappa
is most conservatively regarded as an index of interobserver agreement. It
could be regarded as an index of observer reliabilityÂ€– that is, as an index of
the extent to which measurement error is absent from the data (Nunnally,
1978)Â€– but this would require meeting the essentially untestable classical
parallel test assumptions (Suen, 1988). In contrast, when an observer is
compared with a gold-standard protocol, kappa would indeed be an index
of observer reliability. Notwithstanding, kappa for two observers is com-
monly referred to as an index of reliability.
Errors of Commission and Omission

Gold standards allow for identification of errors of commission (the
observer detected a code not included in the standard) and omission (the
observer failed to detect a code included in the standard). Without a gold
standard, we cannot be certain whether commission-omission errors have
occurred. With a gold standard, we can tally the number of hits or cor-
rect detections (cell a in Figure 5.4), commission errors or false positives
(cell b), and omission errors or false negatives (cell c). Also, if data are
interval recorded, we can also tally the number of intervals for which the
observer agreed with the standard that the code did not occur (cell d).
We are now in a position to compute sensitivity, specificity, or bothÂ€– sta-
tistics adopted from epidemiology that can provide useful information
about the observer’s performance. SensitivityÂ€ – a/(a+c)Â€ – is the propor-
tion of X events correctly detected (or intervals correctly coded X ) by the
observer; it can be computed for untimed-event, timed-event, and interval
recorded data. Omission errors decrease sensitivity. SpecificityÂ€– d/(b+d)Â€–
is theÂ€ proportion of intervals correctly coded by the observer as not X;
it can beÂ€ computed only for interval recorded data. Commission errors
decrease specificity.
Observer Gold standard lists event X

detected event X or codes interval for X
or coded
interval for X Yes No
Yes (a) (b)
Correct, exists; Commission error,
event detected or false positive;
interval coded event detected or
correctly interval coded
incorrectly
No (c) (d)
Omission error, Correct, does not
false negative; exist; event not
event not detected detected or interval
or interval not coded not coded correctly
incorrectly
Figure 5.4.â•‡ Sensitivity-specificity table. Cell d can be tallied only for interval
recorded data.
summary
Observer accuracy is rightly called the sine qua non of observational
research,Â€and there are at least three major reasons why it is essential: first,
to assure ourselves that the coders we train are performing as expected;
second, to provide coders with the accurate feedback they need to improve
(and ourselves with information that may lead us to modify coding
schemes); and third, to convince others, including our colleagues and jour-
nal editors, that they have good reason to take our results seriously.
To assess observer accuracy, usually two observers are compared with
each other, but an observer’s coding could also be compared with a gold-
standard protocol that is presumed accurate. Gold standards take time to
prepare and confirm, but have advantages when coding spans considerable
time or different research teams and venues.
In either case, agreement is of two kinds. Point-by-point agreement
focuses on whether observers agree with respect to the successive intervals
or events coded (or rated), assumes nominal (or ordinal) measurement,
and primarily relies of some form of the kappa statistic. Point-by-point
agreement is especially useful for observer training prior to data collection
and for ongoing checking once data collection begins. In contrast, summary
agreement focuses on whether corresponding summary statistics agree. It
assumes at least ordinal or, more typically, interval or ratio scale measure-
ment and primarily relies on some form of the intraclass correlation coef-
ficient (ICC). Summary agreement is especially useful when a project has
moved from data collection to analysis and when the reliability of particular
summary scores is at issue. Point-by-point agreement may be sufficient; it is
often accepted as evidence that summary measures derived from sequential
data will be reliableÂ€ – probably because point-by-point agreement seems
the more stringent approach.
The statistic most commonly used for point-by-point agreement is
Cohen’s kappa (Cohen, 1960) or some variant of it. Cohen’s kappa is a
summary statistic that assesses how well two observers agree when asked
to independently assign one of K codes from a ME&E set to N discrete
entities. The N observer decisions are tallied in a K×K contingency table,
called a kappa table (also, agreement or confusion matrix). Cohen’s kappa
corrects for agreement due to chanceÂ€– which makes it preferable to per-
centage agreement, which does not. Values fromÂ€ –1 to +1 are possible;
positive values indicate better-than-chance agreement, near-zero values
indicate near-chance agreement, and negative values indicate worse-than-
chance disagreement.
Factors that affect values of kappa include observer accuracy and the
number of codes (the two most important), as well as codes’ individual
population prevalences and observer bias (how observers distribute individ-
ual codes). The maximum value of kappa is limited by observer bias; kappa
can equal 1 only when observers distribute codes equally. There is no one
value of kappa that can be regarded as universally acceptable; it depends of
the level of observer accuracy you want and the number of codes (i.e., num-
ber of alternatives among which observers select). Tables in Appendixes A
and B provide expected values of kappa for different numbers of codes and
varying observer accuracy.
6
Kappas for Point-by-Point Agreement
In the previous chapter we introduced Cohen’s (1960) classic kappa and

discussed a number of issues associated with it and its use. In this chapter
we describe how kappa can be used with each of the data types described in
Chapter 4. The basic kappa computation remains basically unchanged, but
some data types require matching and tallying procedures that differ from
the classic Cohen’s kappa. Where procedures differ, we regard the result as a
variant of Cohen’s kappa and, to make the distinction clear, provide a name
for that variant. To provide orientation (and to let you select which sections
of this chapter may best fit your circumstances), Figure 6.1 links data types
with their appropriate kappa variant. The remaining sections of this chapter
detail appropriate kappa procedures for each data type.
event-based agreement: the alignment problem

When events are demarcated before coding, as turns in a transcript, the
classic Cohen’s kappa described in the previous chapter is appropriate.
However, when observes first segment the stream of behavior into eventsÂ€–
that is, decide where the seams between events occurÂ€– and only then code
the events, the situation is more complicated.
Specifically, when observers simply code events in sequence without
recording duration or other timing information, the result is a single string
of codes. Such single-code event sequential data seem the simplest sort of
sequential data possible, and you might think that assessing their agree-
ment would be equally as simpleÂ€– but surface simplicity can be misleading.
The problem is one of assumptions. The classic Cohen’s kappa assumes that
pairs of coders make decisions about already demarcated entities (units of
some sort) and that the number of decisions is the same as the number of
tallies in the kappa table. This decision-making model fits event-recorded
72
Kappas for Point-by-Point Agreement 73
Data type Kappa variant

Single-code event Classic Cohen’s kappa
Multicode event (if events are previously demarcated)
Alignment kappa
(if they are not)
Timed-event Time-unit kappa
State Time-unit kappa with tolerance
Timed-event alignment kappa
Interval Classic Cohen’s kappa
(for each ME&E set)
Figure 6.1.â•‡ Sequential data types and the appropriate kappa variant for each. To
apply the event alignment algorithm (see text for details) to multicoded events,
co-occurring events must first be recoded into single ones.
data (untimed or timed) only when previously demarcated events are pre-
sented to codersÂ€ – for example, as turns of talk in a transcript. Usually,
however, events are not “prepackaged.” Instead, as noted in the previous
paragraph, observers first are asked to segment the stream of behavior into
events and only then to code those events. The two observers’ records fre-
quently contain different numbers of events due to commission-Â�omission
errorsÂ€ – one observer claims an event, the other does not. But even if
commission-Â�omission errors are absent, exactly how the events align is
not always Â�certain. And when alignment is uncertain, how to pair and tally
events in the kappa table is unclear.
This is long-standing problem. Bakeman and Gottman (1997) wrote that,
especially when agreement is not high, alignment is difficult and requires
subjective judgment. However, we have now developed an algorithm that
determines the optimal global alignment between two single-code event
sequences without subjective judgment (Quera, Bakeman, & Gnisci, 2007).
The problem is not unique to behavioral observation. In fact, an algorithm
that provides the optimal matching or alignment between two sequences
was developed independently by several researchers from different fields
during the 1970s (Sankoff & Kruskal, 1999) and has been re-invented sub-
sequently (e.g., Mannila & Ronkainen, 1997). Molecular biologists know
it as the Needleman-Wunsch algorithm and use it routinely for genetic
sequence alignment and comparison (Needleman & Wunsch, 1970; see also
Durbin, Eddy, Krogh, & Mitchison, 1998, and Sankoff & Kruskal, 1999).
The Needleman-Wunsch algorithm, on which our alignment algorithm is
based, belongs to a broad class of methods known as dynamic Â�programming.
With these methods, the solution for a specific subproblem can be derived
from the solution for another immediately preceding subproblem. This

approach provides a practical way to select an optimal alignment from
among the almost astronomical number of all possible alignments. It can
be demonstrated that the Needleman-Wunsch algorithm guarantees an
optimal solution: It finds the alignment with the highest possible number
of agreements between sequences (Sankoff & Kruskal, 1999, p. 48). And it
does so without being exhaustive: It does not need to explore all possible
alignments (Galisson, 2000). (Note, Dijkstra and Taris, 1995, proposed an
alternative to the dynamic programming algorithm, but it yields a measure
of agreement only, not the optimal alignment between the sequences.)
The way our alignment algorithm works is elegant and principledÂ€– and
takes time to understand in detail. In this and the following three para-
graphs we present an overview, but for further particulars, see Quera etÂ€al.
(2007). Assume that single-code event sequences coded by Observers 1
and 2 are to be aligned. The sequences are referenced as S1 and S2, respect-
ively, and consist of n1 and n2 events, with each event assigned one of K
codes. Alignment proceeds step by step. It starts with S1, transforms an
element at each step, and stops when the sequence has been converted
into S2. The distance between the two sequences is gauged by the number
of steps. The algorithm is designed to find an optimal distanceÂ€– in other
words, to minimize steps. Alignment results from the transformations; it
pairs agreements and disagreements and identifies events coded only by
Observer 1 or only by Observer 2 (in transformation terms, events coded
only by Observer 1 are called deletions and events coded only by Observer
2 are called insertions).
Four transformations are possible: (1) agreement or identity transform-
ationÂ€ – a code from S1 is paired with an identical code from S2 and the
common code is inserted in the new sequence; (2) disagreement or substitu-
tionÂ€– a code from S1 is paired with a different code from S2 and the S2 code
is inserted in the new sequence; (3) deletionÂ€– a code from S1 is paired with
no code from S2 and a hyphen (instead of the S1 code) is inserted in the new
sequence; and (4) insertionÂ€– no code from S1 is paired with a code from S2
and the S2 code is inserted in the new sequence (but a hyphen is inserted
in the S1 sequence). From the point of view of Observer 1, a deletion is an
error of omission and an insertion is an error of commission on the part of
Observer 2.
Four matrices guide the process. The first is the weight or cost matrix, W,
whose rows and columns are indexed r and c with values zero to K (Column
0 is for deletions, Row 0 for insertions). In principle, the weight matrix gives
us the option of assigning different weights for each substitution, deletion,
and insertion. In practice, and in the context of aligning event sequences,

we think one set of weights makes the most sense. All agreements on the
diagonal are set to zero, of course (wrc = 0, r = c = 1.â•›.â•›.K), all disagree-
ments to one (wrc = 1, r = 1.â•›.â•›.K, c = 1.â•›.â•›.K, r≠c), and deletions and inser-
tions (omissions and commissions) to two (wr0 = 2, r = 1.â•›.â•›.K; w0c = 2, c =
1.â•›.â•›.K). Thus commission-omission errors are regarded more seriously than
disagreementsÂ€– which we think best reflects what investigators expect of
observer agreement (for other options and further discussion, see Quera
etÂ€al.,Â€2007).
The three additional matrices accumulate distances and lengths (D andÂ€L)
and record pointers (P) from which, at the end, the alignment is recovered;
these matrices have n1+1 rows and n2+1 columns, indexed from zero. Row
0 indicates insertions, Rows 1 to n1 are labeled with the codes in the S1
sequence, Column 0 indicates deletions, and Columns 1 to n2 are labeled
with the codes in the S2 sequence. Which sequence is labeled S1 and which
S2 is arbitrary; results are the same no matter which sequence is labeled S1.
Accumulated distance and length values determine which transformation is
selected at each step. Again, for further details, see Quera et al. (2007).
Step-by-step application of the algorithm becomes detailed and complex,
and is best left to computers (it is implemented in GSEQ; given single-code
event data, its Compute kappa procedure produces an event alignment), but
a simple example can at least show what results. Assume that two observers
coded infant state using the four codes Alert, Cry, Fussy, and Sleep (ignoring
the sleep-REM distinction in an earlier example). Assume further that the
first observer detected and coded seventeen states (this event sequence is
labeled S1 in Figure 6.2), and the second observer detected and coded nine-
teen states (labeled S2 in Figure 6.2). Due to commission-omission errors,
the optimal alignment produced by the algorithm for these two sequences
consists of twenty states (see Figure 6.2). The fourteen agreements are indi-
cated with vertical bars and the two actual disagreements with two dots
(i.e., a colon), but there were four additional errors: the algorithm estimated
that Observer 1 missed three states that Observer 2 detected (these three
errors are indicated with hyphens in the top line of the alignment plot), and
Observer 2 missed one state that Observer 1 detected (this error is indi-
cated with a hyphen in the bottom line of the plot).
Now that the two event sequences are aligned, we can tally agreements
and disagreements in the usual kappa table and compute kappa (see Figure
6.2)Â€– but with two caveats. First, the nil-nil cell is a structural zero (indi-
cated with a dash); logically it cannot occur. Without a gold standard, we
cannot know how often both observers missed a state that “really” existed,
Obs. 1’s codes Nil Alert Cry Fussy Sleep Total
Nil — 1 2 0 0 3
Alert 0 5 0 0 0 5
Cry 0 0 1 1 0 2
Fussy 0 0 1 4 0 5
Sleep 1 0 0 0 4 5
Total 1 6 4 5 4 20
Event sequences:
S1 = ASFSFASCFAFSCAFSA
S2 = ASCFSCASCFCAFASFAFA
Event alignment:
AS-FSFASCF-AF-SCAFSA
|| ||:|||| || |:|| |
ASCFSCASCFCAFASFAF-A
Figure 6.2.â•‡ Two single-code event sequences, their alignment per the dynamic
programming algorithm as implemented in GSEQ, and the kappa table resulting
from tallying agreement between successive pairs of aligned events. For this
example, alignment kappa = .62. See text for other details.
so necessarily we regard this cell as structurally zero. Consequently, the

expected frequencies required by the kappa computation cannot be esti-
mated with the usual formula (see “The classic Cohen’s kappa” in ChapterÂ€5),
but require an iterative proportional fitting (IPF) algorithm instead; see
Bakeman & Robinson, 1994. Second, because Cohen’s assumptions are not
met, we should not call this a classic Cohen’s kappa: alignment kappa seems
the better term (specifically, a dynamic programming alignment kappa
for single-code event sequences). As noted in the previous paragraph, the
appropriate algorithm and computations are implemented in GSEQ, which
was used to determine the alignment and compute alignment kappa for this
example.
The term structural zero appeared in the last paragraph. The distinction
between empirical and structural zeros is important, not just here but later,
and so is worth highlighting. Empirical zeros occur when a value happens
to be zero, perhaps due to a small sample size; it might have been some
other value, perhaps in a larger sample, but in the case at hand it is zero.
In contrast, zeros are called structural when logically they cannot occur;
the structure of the situation is such that no other value is possible. For
example, the number of pregnant men in a sample provides an example of
a structural zero.
Obs. 1’s codes Alert Cry Fussy Sleep Total
Alert 479 6 11 0 496

Cry 2 21 62 2 87
Fussy 68 34 104 11 217
Sleep 9 3 30 358 400
Total 558 64 207 371 1200
Timed state sequences:
<S1> A=12 S=27 F=10 S=106 F=14 A=260 S=198 C=32
F=23 A=76 F=122 S=52 C=55 A=106 F=48 S=17 A=42 /
<S2> A=14 S=22 C=13 F=5 S=108 C=11 A=263 S=193

C=21 F=15 C=19 A=68 F=61 A=63 S=48 F=61 A=113
F=65 A=37 /
Figure 6.3.â•‡ Two timed-event 20-minute sequences (in this case, state sequences)
with durations in seconds, and the kappa table resulting from tallying agreement
between successive pairs of seconds with no tolerance. For this example, time-unit
kappa = .70. See text for other details.
time-based agreement: inflated counts?

Compared to single-code event sequential data, timed-event sequential
data present somewhat different issues. Given timed-event sequential data,
observer agreement can be approached in two ways: one time-based and
the other event-based. The time-based approach was presented in Bakeman
and Gottman (1997). It depends on the discrete view of time reflected in
the code-unit grid described in Chapter 4. If both observers’ data are rep-
resented as a sequence of coded time intervals, where the interval duration
is determined by the precision with which time was recorded (see “Timed-
event and state sequences” in Chapter 4), then the data are formally iden-
tical with interval recorded data, and there is no problem with alignment.
Agreements and disagreements between successive pairs of time units are
tallied and entered in a kappa table, and kappa is then computed.
An example is presented in Figure 6.3. Assume, as in the previous
example, that two observers detected infant states and coded them as Alert,
Cry, Fussy, and Sleep; this time, however, they recorded duration in seconds,
as shown in Figure 6.3. Tallies for the 1,200 seconds (20 minutes) are shown
in the kappa table. For this example, kappa is .70, but again, we would not
call this a classic Cohen’s kappa: time-unit kappa seems better.
Exact second-by-second (or, more correctly, time-unit-by-time-unit)
matching may seem too stringent. Some tolerance may be desirable. When
no tolerance is specified, we examine each successive pair of time units for

the two observers and add a tally to the appropriate cell of the kappa table,
as in Figure 6.3. When some tolerance is specified, we examine each suc-
cessive time unit for the first observer and tally an agreement if there is
a match with any time unit for the second observer that falls within the
stated tolerance. The effect is to move some tallies of the agreement matrix
from off-diagonal to on-diagonal cells, thereby giving credit for near misses
and increasing the magnitude of kappa. A tolerance of two time units, for
example, results in a 5-second window (if time units are seconds): the cur-
rent time unit, two before, and two after.
Consistent with previous terminology, time-unit kappa with tolerance
seems the appropriate term for kappas computed from tolerance tallies. A
concern is that the value of time-unit kappa with tolerance varies slightly
depending on which of the two observers is regarded as the first. To address
this concern, you can compute the value of kappa twiceÂ€– once with each
observer as the firstÂ€ – and report both values (as GSEQ does). Typically
the two values vary only slightly. For example, for the sequences given
in Figure 6.3, when a 2-second tolerance was specified, the two kappas
wereÂ€.74Â€andÂ€.73.
One aspect of time-unit kappa seems troublesome. As noted earlier,
with the classic Cohen model, the number of tallies represents the number
of decisions coders make, whereas with time-unit kappa, the number of tal-
lies represents the length of the session (e.g., when time units are seconds,
a 20-minute session generates 1,200 tallies). With timed-event recording,
observers are continuously looking for the seams between events, but how
often they are making decisions is arguableÂ€– and most likely unknowable.
One decision per seam seems too fewÂ€ – the observers are continuously
alertÂ€– but one per time unit seems far too many.
Another possible concern is the arbitrariness of the time unit and its
relation to the number of tallies. For example, if we increased precision
from seconds to tenths of a second, the number of tallies would increase
tenfold. Still, it is some comfort that multiplying all cells in a kappa table
by the same factor does not change the value of kappa, as Bakeman and
Gottman (1997) demonstrated.
event-based agreement for timed-event sequences

As noted earlier, observer agreement for timed-event sequential data can
be approached in two ways. The first is time-based, as described in the
previous section. The second is event-based, which is described in this
section, and has the merit of bringing the number of tallies more in line
with the number of decisions, but usually results in lower values for kappa.
As with single-code event sequential data, first events would need to be
alignedÂ€– but taking time into account. Compared to tallying time units,
tallying agreements and disagreements between aligned events probably
underestimates the number of decisions observers actually make, but at
least the number of tallies is closer to the number of events coded than for
time-unit kappa.
To effect the required alignment, we modified our single-code event
alignment algorithm to take onset and duration times into account so
that it would work with timed-event sequential data (Bakeman, Quera, &
Gnisci, 2009). The modified algorithm requires that you specify values for
two parameters. The first is an onset toleranceÂ€ – events with onset times
that differ by no more than this amount are aligned; thus even identically
coded events whose onsets differ by more than this amount can generate
commission-omission errors. The second is a percent overlapÂ€– events that
overlap by this percent are regarded as agreements if identically coded and
regarded as disagreements if differently coded; but an event coded by one
observer that does not overlap by this percent with an event coded by the
other observer is regarded as an event the second observer missed.
Once aligned, events are tallied in a kappa table and kappa is computed
in the usual way (but using iterative proportional fitting for expected fre-
quencies due to the nil-nil structural zero). The result is a timed-event align-
ment kappa. Results for the data given in Figure 6.3 are shown in FigureÂ€6.4,
along with a time plot. For the event plot, vertical bars indicate exact agree-
ment, two dots (colon) indicate disagreements, and hyphens indicate events
coded by one observer but not the other. For the time plot (the last 600
seconds of the Figure 6.3 data are shown), horizontal segments indicate
event durations, solid vertical lines between events indicate agreements,
dotted lines between events indicate disagreements, and dotted lines to top
or bottom indicate commission-omission errors. In this case, the Â�number
of events aligned was the same as for the event alignment (see Figure 6.2),
but the alignment was somewhat different. Due to differences in onset
times, Observer 2’s Fussy was regarded as an omission error on the part
of ObserverÂ€ 1, and thus Observer 1’s subsequent Fussy was paired with
Observer 2’s Cry, which counted as a disagreement.
With different data, however, or with different values for the tolerance
and overlap parameters, different alignments could have resultedÂ€– perhaps
with more errors and lower kappas. As a general rule, unless two obser-
vers’ timed events align quite well, timed-event alignment kappas will be
Obs. 1’s codes nil Alert Cry Fussy Sleep TOTAL
nil — 1 1 1 0 3
Alert 0 5 0 0 0 5
Cry 0 0 1 1 0 2
Fussy 0 0 2 3 0 5
Sleep 1 0 0 0 4 5
TOTAL 1 6 4 5 4 20
Timed-event alignment:
AS-FSFASC-FAF-SCAFSA
|| ||:||| :|| |:|| |
ASCFSCASCFCAFASFAF-A
600 700 800 900 1000 1100 1200

Observer 2's codes Observer 1's codes
Alert
Cry
Fussy
Sleep
Alert
Cry
Fussy
Sleep
Figure 6.4.â•‡ Alignment of the two timed-event sequences shown in Figure 6.3 per
the dynamic programming algorithm as implemented in GSEQ (with 10-second
tolerance for onset times and 80% overlap for agreements-disagreements), and the
kappa table resulting from tallying agreement between successive pairs of aligned
events. For this example, timed-event alignment kappa = .56. See text for other
details.
lower than single-code alignment kappas, because when time is taken into
account, more commission-omission errors and more disagreements typ-
ically result. For this example, the alignment differed somewhat and the
value of the timed-event alignment kappa was .56 compared to .62 for the
single-code alignment kappa.
An alternative to the time-based alignment algorithm for timed-event
sequential data presented here is one proposed by Haccou and Meelis
(1992). It consists of a cascade of rules for alignment instead of the more
mathematically based approach to alignment of the Neddleman-Wunsch
algorithm. It has been implemented in at least two commercially available
programs; for a comparison between Haccou-Meelis-based algorithms and

the one presented here, see Bakeman, Quera, and Gnisci (2009).
interval-based agreement using cohen’s kappa

The classic Cohen’s kappa, described in the previous chapter, and inter-
val recording, as described in Chapter 3, almost seem designed to work
together. Cohen’s kappa assumes that two judges (in our case, observers)
independently assign one of K ME&E categories to N discrete entities or
demarcated units. The judges could be medical doctors, the codes diagnos-
tic categories, and the units patientsÂ€– which defines a classic application of
Cohen’s kappa. In the context of interval recording, the demarcated units
are successive fixed time intervals. Cohen’s kappa assumes units already
demarcated before coding commences, like patients or the fixed time inter-
vals of interval recording or, as mentioned earlier when discussing single-
code event data, turns in a transcript.
Following standard procedures, kappa tables are defined, one for each
set of ME&E codes. Then, for each set, each pair of decisions (i.e., decisions
about which code from that set applied to the interval) adds a tally to the
K×K kappa table for that set. (As noted earlier, for partial-interval sam-
pling, K 2×2 tables would be used.) The observer decision model is straight-
forward: The total number of tallies indicates the number of decisions each
observer made. If both observers assign the same code to an interval, a tally
is added to the appropriate diagonal cell; if they disagree, a tally is added to
the appropriate off-diagonal cell. Then, as described for Cohen’s kappa in
the previous chapter, large counts in off-diagonal cells identify particular
points of disagreement, the counts in the cells on the diagonal can be used
to compute percentage agreement, and the kappa computation provides an
index of observer agreement for the set of ME&E codes used to define the
rows and columns of the kappa table, corrected for chance.
weighted kappa: when disagreements differ

in severity
In the previous four sections we discussed kappa variants concerned with
matching and tallying procedures. In this section we describe a variant of
the basic kappa computation. The standard Cohen’s kappa regards all dis-
agreements as equally serious, but for a variety of reasons you may regard
some as more serious than others. This is most likely to occur when inter-
vals or previously demarcated events are rated instead of coded, but first we
consider a coding case. Imagine that you categorize disagreements as not

so serious, of concern, and serious, and decide to weight them 1, 2, and 3,
respectively. To formalize this, you would prepare a K×K matrix of weights
corresponding to your observed counts (i.e., the kappa table). Each cell on
the diagonal represents an agreement and thus is weighted zero. You would
then assign weights of 1, 2, and 3 to the off-diagonal cells of the weight
matrix, depending on how seriously you regarded the specific disagree-
ment, and compute weighted kappa (Cohen, 1968). It is defined as
K K
∑ ∑ wij xij
i =1 j =1
κ wt = K K
∑ ∑ wij eij
i =1 j =1
where wij, xij, and eij are elements (i-th row, j-th column) of the weight,
observed, and expected matrices, respectively; eij = p+jâ•›xi+ where xi+ is the sum
for the i-th row and p+j is the probability for the j-th column (and where
p+j = x+jâ•›/N).
Any use of weighted kappa requires that you can convince others that the
weights you assign are sound and not unduly arbitrary. One set of weights
requires little rationale. If you weight all disagreements equally as 1, then
weighted kappa will have the same value as unweighted kappa. Otherwise,
if you weight different disagreements differently, be ready with convincing
reasons for your different weights.
In contrast to nominal codes, weights for ordinal ratings (or codes that
can be ordered) are much easier to justify. It makes sense that disagreements
between codes or ratings farther apart should be weighted more heav-
ily. The usual choice is either linear weights or, if you want disagreements
further apart treated even more severely, quadratic weights. Specifically,
wij = |CiÂ€– Cj| for linear and wij = |CiÂ€– Cj|2 for quadratic weights, where Ci and
Cj are ordinal numbers for the i-th row and j-th column, respectively, and
wij is the weight for the cell in the i-th row and j-th column (see Figure 6.5).
For the Figure 5.1 kappa table, values for unweighted kappa, weighted kappa
with linear weights, and weighted kappa with quadratic weights (shown to
five significant digits) are .61284, .62765, and .64379, respectively.
As Cohen appreciated (1968; also Fleiss & Cohen, 1973), when quad-
ratic weights are used and there is no observer bias, weighted kappa and
the intraclass correlation coefficient (ICC, as discussed in the next chapter)
have the same value. Perhaps for this reason, the statistically inclined find
quadratic weights “intuitively appealing” (Maclure & Willett, 1987, p. 164),
Rate1 Rate2 Rate3 Rate4 Rate1 Rate2 Rate3 Rate4

Rate1 0 1 2 3 Rate1 0 1 4 9
Rate2 1 0 1 2 Rate2 1 0 1 4
Rate3 2 1 0 1 Rate3 4 1 0 1
Rate4 3 2 1 0 Rate4 9 4 1 0
Figure 6.5.â•‡ Two sets of weights for computing weighted kappa given four ordered
codes. Linear weights are on the left and quadratic weights on the right.
but either quadratic or linear weights could be used; and when disagree-
ments in fact cluster around the diagonal, values for weighted kappa com-
pared to unweighted kappa will be higher. More to the point, with ordinal
codes (and rating scales), weighted kappa may be a more accurate reflec-
tion of observer agreement than unweighted kappa because it treats dis-
agreements between ordinal codes farther apart as more severe than those
between codes closer together.
are all kappas overrated?

As earlier sections in this chapter demonstrate, both time-unit kappa and
the alignment kappa for timed-event sequential data have advantages, but
raise concerns. Each has elements of arbitrarinessÂ€– specifically the preci-
sion of time units and the tolerance allowed for time-unit kappa and the
onset tolerance and overlap parameters for timed-event alignment kappa.
The classic Cohen’s kappa for interval sequential data and alignment kappa
for single-code event sequential data seem less arbitrary, but all kappas
used in the service of assessing observer agreement, as discussed through-
out this chapter, raise a fundamental question: What is kappa for?
As a summary index of agreement, any of the kappas described in this
chapter can be usefulÂ€– but none should be taken too seriously. We noted earl-
ier (see “Standards for kappa (number of codes matters)” in Chapter 5) that
no one value of kappa can be regarded as a threshold of acceptability, and that
the number of codes in the ME&E set among which the observers choose is
importantÂ€– as is, to a lesser degree, the variability among those codes’ simple
probabilities and differences in simple probabilities for corresponding codes
between observers. Consequently, it is misleading to claim, for example, that
kappa values of .80 and above are acceptable whereas values below are not.
Arguably, more important than the value of kappa is the kappa table
on which it is based. Early in the previous chapter we noted that point-by-
point agreement, which primarily relies on some form of the kappa statistic,
is especially useful for observer training prior to data collection and for
ongoing checking of observers once data collection begins. For these pur-
poses, the kappa table is more useful than the value of kappa; the kappa stat-
istic, by reducing the information in the table to a single number, obscures
the sources of agreement and disagreement identified in the table.
Presumably, observers want to perform well and get better. Presenting
them with a single value of kappa does not help much, but examining a
kappa table can. It is also a useful exercise for investigators. For example,
when the observers’ marginal probabilities for a given code differ, we know
that one observer overgeneralizes, and so overuses, that code relative to the
other observerÂ€– which means that we need to spend more time sharpening
the definition of such codes and working with our observers to assure they
share a common understanding. Further, as we work with our observers to
understand why some off-diagonal cells of a kappa table contain many tal-
lies and others do not, we identify codes that are causing confusion, defini-
tions that require greater precision, concepts that need to be defined with
greater clarity, and possibly even codes that should be lumped or eliminated
from the coding scheme. Only by looking beyond kappa to its table do we
unlock its usefulness. Kappa can be used to mark progress in the course of
observer training; working with observers to understand discrepancies in
the kappa table can facilitate that progress.
From this point of view, the answer to the questionÂ€– Given timed-event
sequential data, which is better, a time-unit kappa or a timed-event align-
ment kappa?Â€– is both. Both are better because, first, their range likely cap-
tures the “true” value of kappa; but, second and more importantly, the two
kappa tables provide different but valuable information about agreements
and disagreements. The time-unit kappa table emphasizes how long agree-
ments and disagreements lasted, whereas the timed-event alignment kappa
table emphasizes agreements and disagreements with respect to the onsets
of events. A thoughtful examination of both tables can only help observers
as they strive to improve their accuracy.
summary
The classic Cohen’s kappa works well with single-code event data when
events are previously demarcated and also works well with interval sequen-
tial data because in both cases the entities to which codes are assigned are
identified before coding begins. However, when observers are asked to first
segment the stream of behavior into events (i.e., detect the seams between
events) and only then code those events, agreement is more complicated.
Frequently the two observers’ records contain different numbers of events
due to commission-omission errorsÂ€ – one observer claims an event, the
other does notÂ€– but even when the records contain the same number of
events, exactly how the events align is not always certain. And when align-
ment is uncertain, how events should be paired and tallied in the kappa
table is unclear.
To solve this problem, we developed an algorithm based on the Needleman-
Wunsch algorithm used by molecular biologists for genetic sequence align-
ment and comparison. It can be demonstrated that the method guarantees
an optimal solutionÂ€– that is, it finds the alignment with the highest pos-
sible number of agreements between sequences. Another advantage is that
the algorithm identifies commission-omission errors. Once aligned, paired
events can be tallied in a kappa table, and what we call an alignment kappa
can be computed for single-code event sequential data.
For timed-event sequential data, one possibility is to tally successive pairs
of time units as defined by the precision with which times were recorded
(recall the code–time-unit grid representation) and compute what we call a
time-unit kappa. Another possibility is to code as agreements any time units
coded similarly within some stated toleranceÂ€– which results in what we call
a time-unit kappa with tolerance. A possible concern is that with the clas-
sic Cohen model, the number of tallies represents the number of decisions
coders make, but with time-unit kappa, the number of tallies reflects the
length of the session. With timed-event recording, observers are continu-
ously looking for the seams between events, but how often they are making
decisions is arguable; one decision per seam seems too few, but one per time
unit seems far too many. To address this concern, we adapted our event
alignment algorithm for use with timed-event data. With it, timed events
can be aligned, and what we call a timed-event alignment kappa can be com-
puted. Then both time-unit kappa and timed-event alignment kappa can be
reported for timed-event data.
When some agreements are regarded as more serious than others,
weighted kappa may be useful; this computational variant allows the user to
provide different weights for each possible disagreement. When codes are
ordered or represent ratings, disagreements between codes farther from the
diagonal may be assigned higher weights. In such cases, arguably weighted
kappa may be a more accurate reflection of observer agreement.
The value of kappa may be overemphasized. Especially for observer
training, the kappa table and its marginal probabilities are more useful than
the simple summary value kappa represents. Examining whether marginal

probabilities for the same code differ between observers, or whether some
off-diagonal cells of a kappa table contain many tallies and others do not,
helps us identify codes that cause confusion, definitions that require greater
precision, concepts that need to be defined with greater clarity, and possibly
even codes that should be lumped or eliminated from the coding scheme.
7
The Intraclass Correlation Coefficient (ICC) for

Summary Measures
As noted early in Chapter 5, point-by-point agreement is often accepted as

evidence that summary measures derived from sequential data will be reli-
able, probably because point-by-point agreement seems the more stringent
approach. Kappa-based demonstrations of observer agreement are essential
for observer training and for checking throughout coding; but once data
collection is complete and summary measures have been derived from the
sequential data, their reliability can be assessed directly with an intraclass
correlation coefficient (ICC).
relative versus absolute agreement

Summary measures in observational research are typically of two kinds: first
are frequencies, rates, and proportions for individual codes (see Chapter
8); and second are contingency indices and other contingency table statis-
tics involving two or more codes (see Chapter 9). Both are continuous (i.e.,
measured on an interval or ratio scale), and so an ICC measure of reliability,
which requires continuous or arguably ordinal measurement, is appropriate.
The general question is: When values are derived from records of the same
events coded independently by different observers, do the observers agree
sufficiently for our purposes? Can we regard the observers as essentially
interchangeable? In this regard, one important distinction concerns rela-
tive, as opposed to absolute, judgment (i.e., relative consistency vs. absolute
agreement, or norm-referenced vs. criterion-referenced; see Suen, 1988).
Do we want scores derived from two observers’ records to be rank-ordered
the same wayÂ€– that is, to be in relative agreementÂ€– or do we want them to
agree regarding absolute magnitude?
One way to address relative agreement is with a Pearson product-Â�moment
correlation coefficient (r), which is one of the oldest and Â�best-known
87
statistics used by behavioral scientists. As is widely appreciated, r gaugesÂ€the

strengthÂ€ of the linear relationship between two sets of scores (here, for
Observer 1 and Observer 2). As may be less widely appreciated, it gives the
same magnitude when scores are rank-ordered the same even if their means
differ. For example, because r indicates relative but not absolute agreement,
a high value of r results when two observers agree on which infants cried the
most, even if they disagree as to the amount of crying. However, even when
a relative, and not absolute, agreement index is desiredÂ€– and even though
investigators occasionally use r as a reliability indexÂ€– a better choice is an
ICC. It is firmly rooted in classical test theory and generalizability theory
(Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Wiggins, 1973) in ways
thatÂ€r is not.
The familiar correlation coefficient differs from an intraclass correl-
ation coefficient in at least two ways. First, for an ICC, the measures being
correlated belong to the same classÂ€ – technically, the measures have the
same metric and variance; see McGraw & Wong, 1996Â€ – which is why it
is called an intraclass correlation. The measures are interchangeable andÂ€–
as is clear when measures for identical twins are consideredÂ€– their scores
could be swapped with no effect. In contrast, for the usual interclass correl-
ation coefficient, the measures being correlated belong to different classes
(e.g.,Â€height and weight). Second, because ICCs are variance ratios, they are
neverÂ€negative.
targets and sessions

Data for an ICC’s reliability sample are arranged in the usual rows-by-
columns grid. In ICC terms, rows represent different targets and columns
represent different judges; cells then contain values for summary scores.
Targets could represent different sessions, different individuals, or even dif-
ferent codesÂ€– depending on what aspect of reliability we want to empha-
size. For example, if the reliability sample consisted of values for a particular
summary statistic derived for several individuals, a high ICC would indicate
that the observers were interchangeable with respect to that statistic when
measured in different individuals. If the sample consisted of values of that
same statistic derived for several codes (for a single individual, for example),
then a high ICC would indicate that the observers were interchangeable with
respect to that statistic when measured in different codes, and for a single
individual. Finally, if the sample consisted of values derived for several ses-
sions, a high ICC would indicate that the observers were interchangeable
with respect to that statistic when measured in different sessions. Because
The Intraclass Correlation Coefficient 89
the session is such an important and salient unit in observational research,

the session is probably the most frequent target used for computing ICCs.
A practical question is: How many targets are required? Fewer targets
mean less coding work, but what is a bare minimum? Yoder and Symons
(2010) suggest that it takes at least five for a reasonable ICC, but many
investigators may prefer ten or more. Certainly more targets will result in
smaller confidence intervals.
ICCs are cast in analysis-of-variance terms. Most fundamentally, an ICC
gauges the proportion of a variance (variously defined) that is attributable
to objects of measurement (McGraw & Wong, 1996). For example, with
sessions as targets, larger ICCs (values can vary from 0 to 1) suggest that
other sources of variability in the reliability sampleÂ€– in particular that due
to observers (judges)Â€– are small relative to intersession variability of the
summary score being considered. In this case, the ICC is usually viewed as
a reliability index (Hartmann, 1982; Suen, 1988).
relative and absolute iccs

There is not one ICC, but manyÂ€– which can be a potential source of confu-
sion. Shrout and Fleiss (1979) identify six; McGraw & Wong (1996) likewise
identify six, but associate them with ten models that differ in interpretation.
However, only two of the ICCs are generally relevant for assessing observer
reliability of summary measures. They are the two McGraw and WongÂ€–
using notation we explain shortlyÂ€– call ICC(C,1) and ICC(A,1). They are
based on an analysis of variance for repeated measures and assume that k
judges rated or scored each of n targets. As applied to observer agreement,
this means that typically two observers (judges) coded n targets (often ses-
sions), and that values for a particular summary score were computed for
each observer and target.
The C in ICC(C,1) indicates relative consistency, the A in ICC(A,1) indi-
cates absolute agreement, and the 1 in both indicates a single-score ICC. Of
the ten definitions given by McGraw and Wong (1996), five are single-score
and five are average-score ICCs. The average forms make sense when the
scores analyzed in the substantive study are the means of k judgesÂ€– which,
simply due to limited resources, is rarely the case in observational research.
Of the five single forms, one is for a one-way model, two for a two-way ran-
dom-effects model, and two for a two-way mixed-effects model. ICC(C,1)
and ICC(A,1)Â€– hereafter referred to as ICCrel and ICCabsÂ€– are defined for
both random-effects and mixed-effects models and are computed the same
for both, although their interpretations differ somewhat.
Source SS df MS Obs 1 Obs 2

Target (between) .574211 9 .063801 .4293 .3303
Observer (repeated) .117902 1 .117092 –.0157 –.1476
Error (T × O) .120572 9 .013397 .3096 .3780
TOTAL .812685 .1383 .1306
.3755 .0933
–.1053 –.1104
MS targ − MSerr .5548 .2053
ICCrel = = .65
MS targ + (k − 1)MSerr .5038 .1292
.3984 .0796
.1326 .0974
MStarg − MSerr
ICC abs = = .51
k
MStarg + (k − 1)MSerr + (MSobs − MSerr )
n
Figure 7.1.â•‡ Summary contingency indices for ten targets (sessions) derived from
data coded by two observers, their analysis of variance statistics, and the formulas
and computations for ICCrel and ICCabs, respectively; k = number of observers and
n = number of targets.
In sum, ICCrel assesses the degree of relative consistency among mea-

surements and ICCabs assesses the degree of absolute agreement (as Suen,
1988, noted, the former was proposed by Hartmann, 1982, and the latter by
Brennan & Kane, 1977, and Berk, 1979). For the two-way random-Â�effects
model, ICCrel is also known as norm-referenced reliability and ICCabs as
criterion-referenced reliability (McGraw & Wong, 1996, p. 35). Statistical
packages let you specify model (e.g., one-way random, two-way mixed,
two-way random) and type for two-way models (consistency or absolute
agreement); results differ by type but, as noted, are identical for both two-
way mixed and two-way random models.
An example is presented in Figure 7.1. Assume that two observers inde-
pendently coded ten sessions, that indices of contingency were computed,
and that their values were as shown in Figure 7.1. Assume further that the
indices are Yule’s Qs (see “Yule’s Q” in Chapter 9) representing the likeli-
hood that a mother would respond with a change in affect level within 3
seconds of her infant changing affect level (as in Goodman, Thompson,
Rouse, & Bakeman, 2010). An analysis of variance partitioned the total sum
of the squared deviations from the grand mean (sum of squares or SS) as
shown (see Bakeman & Robinson, 2005). The first component (source of
variance) is traditionally called between-subjects (here, between-targets
or between-sessions), the second represents the repeated measure (here,
observer), and the third is the error term (here, the target-by-observer or
session-by-observer interaction).
The Intraclass Correlation Coefficient 91
Formulas and values for the relative-consistency and absolute-agreement

ICCsÂ€– ICCrel and ICCabsÂ€– are given in Figure 7.1. Essentially, an ICC is a
ratio of the variance of interest over the sum of the variance of interest plus
error. Note that for ICCrel, observer variance is not considered, whereas it is
considered for the ICCabsÂ€– as indicated by the inclusion of the third term in
the denominator of the ICCabs formula in Figure 7.1.
Which of these two ICCs should be used? Shrout and Fleiss (1979) note
that, when ignoring judge variance and so treating the judges as fixed effects,
ICCrel can be interpreted in terms of rater (relative) consistency rather than
rater (absolute) agreement. But that when judges are considered random
effects, ICCabs addresses the question of whether judges are interchange-
able. Thus, when the entire corpus of a study is coded by one observer, it
may make sense to use ICCrel when establishing reliability; but when several
observers are employed, each coding a part of the corpus, it may make more
sense to use ICCabs instead. For further discussion, see Shrout and Fleiss
(1979, p. 425) and McGraw and Wong (1996, pp. 33–34).
More generally, the reason for choosing between criterion- and norm-
referenced indices should be based on what the summary measure is used
for. If we want to discriminate individuals (or sessions, or codes) according
to a certain criterion value, then we need to know whether observers dis-
criminate identically (absolute agreement). If only relative values or differ-
ences among individuals (or sessions, or codes) are interesting, then the less
strict, relative agreement may suffice.
For the data in Figure 7.1, ICCrel was larger in magnitude than ICCabs
(.65Â€vs. .51). This makes conceptual sense and is the usual case (Shrout &
Fleiss, 1979)Â€– although if MSerr is greater than MSobs, thereby reducing the
magnitude of the ICCabs denominator relative to ICCrel, ICCabs will be Â�larger.
As a matter of comparisonÂ€– and recognizing that when it is appropriate to
compute an ICC, it is generally not appropriate to compute statistics with
different purposesÂ€– the value for the Pearson product-moment correlation
for the Figure 7.1 data is .68 and the value for Cronbach’s internal consist-
ency alphaÂ€– (MStargÂ€– MSerr)/MSerrÂ€– if ten persons had responded to a two-
item scale, is .79.
It goes without saying that a compelling rationaleÂ€– and not magnitudeÂ€–
should be the reason for choosing a relative or an absolute ICC. In either
case, large ICCs are desired; we want variance between sessions to be large
relative to other sources such as the variance due to the target-by-observer
interaction. As error variance becomes small, the ICC ratio approaches 1Â€–
but at what value does it become acceptable? Like standards for Cronbach’s
internal consistency alpha, and correlation coefficients generally, this is a
judgment that investigators must make. In this case, we do not know of

a sound rationale, like the one we present for kappa (see “Standards for
kappa (number of codes matters)” in Chapter 5). Statistical significance
and confidence intervals can be taken into account (McGraw & Wong,
1996), although Yoder and Symons (2010) regard probability values as only
Â�“minimally relevant” (p. 179) and suggest that a minimally acceptable value
is relative to the area of study, but they cite Mitchell (1979) that some con-
sider an ICC of .70 as very good.
summary
Once data collection is complete and summary scores have been computed
from sequential data (e.g., rates and proportions for individual codes,
contingency table indices involving two or more codes), reliability of par-
ticular summary scores can be assessed with an intraclass correlation coef-
ficient (ICC). Computation of an ICC requires a reliability sampleÂ€– paired
values for a particular summary statistic derived for several targets (often
Â�sessions)Â€– with scores computed from data coded independently by two or
more observers. ICCs come in several forms; of the two relevant for obser-
ver reliability, one assesses relative consistency and one absolute agreement.
When the entire corpus of a study is coded by one observer, it may make
sense to use the relative consistency ICC when establishing reliability; but
when several observers are employed, each coding a part of the corpus, it
may make more sense to use the absolute-agreement ICC instead. In either
case, a rationale is required.
8
Summary Statistics for Individual Codes
After data collection is complete, and before data are analyzed, many
measurement methods require intervening stepsÂ€ – some sort of data
reductionÂ€– even if it is only computing a score from the items of a self-
report measure. Behavioral observation, however, seems to require more
data reduction than most measurement methods. Rarely are the coded
data analyzed directly without intervening steps. First, summary scores
of various sorts are computed from the event, timed-event, and inter-
val sequential data produced by the coders. In other words, the data-as-
collected, which usually reflect categorical measurement, are transformed
into summary scores for which ratio-scale measurement can usually
beÂ€assumed.
As with scores generally, the first analytic steps for summary scores
derived from behavioral observation involve quantitative description.
Descriptive results for individual variables (e.g., means, medians, and
Â�standard deviations, as well as skewness and distribution generally) are
importantÂ€ – first, of course, for what they tell us about the behavior we
observed, but also because they may define and limit subsequent analyses.
Limited values or skewed distributions, for example, may argue against
analysis of variance or other parametric statistical techniques. But what
Â�summary scores should be derived and described first?
When taking these first descriptive steps, it is useful to distinguish
between simple statistics for individual codes that do not take sequencing
or contingency into account (described in this chapter) and contingency
and other table statistics involving two or more codes that do (described in
the next chapter). It makes sense to describe statistics for individual codes
firstÂ€ – if their values are not appropriate, computation of some contin-
gency and other table statistics may be precluded or at best be questionable.
Statistics that characterize individual codes are relatively few in number,
93
Timed <second> (Calm Cry Fuss) Assure Explain Touch ;

<dyad #1>
Fuss,1- Cry,3- Calm,10- Fuss,13- Calm,22- Cry,26-
Fuss,31- Calm,39- Fuss,43- Cry,46- Calm,53- Fuss,58- &
Explain,8-15 Assure,15-19 Explain,20-26 Assure,32-35
Explain,35-40 Assure,42-49 Explain,51-55 &
Touch,12-20 Touch,27-35 Touch,45-56, 61/
Interval=1 <second> (Calm Cry Fuss) Assure Explain Touch ;
<dyad #1>
Fuss *2, Cry *5, Cry Explain * 2, Calm Explain * 2,
Calm Explain Touch, Fuss Explain Touch * 2,
Fuss Assure Touch * 4, Fuss Touch, Fuss Explain * 2,
Calm Explain * 4, Cry, Cry Touch *4, Fuss Touch,
Fuss Assure Touch * 3, Fuss Explain *4, Calm Explain,
Calm *2, Calm Assure, Fuss Assure * 2,
Fuss Assure Touch, Cry Assure Touch * 3, Cry Touch * 2,
Cry Explain Touch *2, Calm Explain Touch *2, Calm Touch,
Calm *2, Fuss *3 /
Figure 8.1.â•‡ An SDIS timed-event data file with 1-second precision (top) and
an SDIS interval data file with 1-second intervals (bottom) describing the same
events.
although their interpretation can vary depending on the data type (event,
timed-event, and interval, where event includes single-code and multicode
variants). In this chapter we describe these simple statistics, note how data
type affects their interpretation, and recommend which are most useful for
each type of sequential data.
We illustrate these simple statistics with two brief data files that describe
exactly the same events (see Figure 8.1). The codes are modeled on ones
used to describe children undergoing potentially painful medical proce-
dures as in Chorney, Garcia, Berlin, Bakeman, and Kain (2010). Calm, Cry,
and Fuss form a ME&E set used to describe a child’s behavior; and Assure,
Explain, and Touch are three behaviors that nurses or other medical pro-
fessionals might use in attempts to quiet an upset child. The timed-event
file (top) is formatted for timed-event data with 1-second precision, and
the interval file (bottom) is formatted for interval data with a 1-second
Â�intervals. Both files code 60 seconds of behavior as shown graphically in
Figure 8.2. Together they illustrate the point that, when the precision used
for timed-event data is the same as the interval duration for interval data, the
code-unit grid representation is the same for both data types (in fact, rarely
would a 1-second interval be used for interval recording). One Â�caveat: this
60-second example is not based on actual data; there is no reason to think
that its sequences or durations reflect real results. The two example data
files are here simply to illustrate statistical definitions and computations
Summary Statistics for Individual Codes 95
Second or interval
Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Calm √√√ √√√√

Cry √√√√√√√ √√ √√√
Fuss √√ √√√√√√√√√
Assure √√√√
Explain √√√√√√√ √√√√√√
Touch √√√√√√√√ √ √√√
… 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Calm √√√√ √√√√√
Cry √√√√√√√
Fuss √√√√√√√√ √√√ √√√
Assure √√√ √√√√√√√
Explain √√√√√ √√√√
Touch √√√√ √√√√√√√√√√√
Figure 8.2.â•‡ A code-unit grid for the timed-event data (60 seconds) and the interval
data (60 intervals) shown in Figure 8.1.
and to show that these definitions produce identical results for the timed-
event and interval data files shown in Figure 8.1.
basic statistics for individual codes

Probably the six most frequently encountered and useful statistics for indi-
vidual codes are frequency, relative frequency, rate, duration, relative dur-
ation, and probability. These six are described in the following paragraphs.
Frequency
Frequency indicates how often. It is not adjusted for sessions that vary in
length. In contrast, rate (defined shortly) is the frequency for a specified
time interval. In defining frequency and rate this way, we are following
standard statistical and behavioral usage. Some other fields (e.g., math-
ematics and physics) define frequency as the number of times a specified
phenomenon occurs within a specified interval; thus where we say fre-
quency and rate, they would say occurrence and frequency. (Note: Martin
and Bateson, 2007, like physicists, define frequency as the number of
Â�occurrences per unit time, but acknowledge that defining frequency as the
number of occurrences is common in the behavioral literature.) As always,

readers need to be alert; words can be used with different meanings in dif-
ferent disciplines (and even within the same discipline).
For single-code event or timed-event data, frequency is the num-
ber of times an event occurredÂ€– that is, the number of times the obser-
ver assigned a particular code. For interval and multicode event data, to
maintain comparability with timed-event data, frequency is the number
of episodesÂ€ – where an episode is an unbroken sequence of intervals (or
multicoded events) containing the same code. In other words, when the
same code occurs in successive intervals (or multicoded events), only one
is added to its frequency tally. For example, for both the timed-event and
the interval data shown in Figure 8.1, frequencies for Calm, Cry, and Fuss
wereÂ€4, 3, andÂ€5, respectively.
You might think that frequency for interval data would be the number of
intervals checked, but as illustrated by Figure 8.2 and discussed shortly, the
number of intervals checked is duration instead.
Relative frequency
Relative frequency indicates proportionate use of codes and is defined rela-
tive to a specified set of codes.
For all data types, relative frequency is a code’s frequency, as just
defined, divided by the sum of frequencies for the codes specifiedÂ€– hence
the relative frequencies computed for a specified set necessarily sum to 1.
(Alternatively, relative frequencies can be expressed as percentages sum-
ming to 100 percent.) For example, when Calm, Cry, and Fuss are specified,
their relative frequencies for the Figure 8.1 data are 33 percent, 25 percent,
and 42 percent, respectively. Some research questions require relative fre-
quency and some do not, and occasionally investigators report relative fre-
quency when another statistic is needed. Consequently, you should provide
an explicit rationale when using this statistic, along with the set of codes
used to compute
Â� it.
Rate
Rate, like frequency, indicates how often. It is the frequency per a specified
amount of time. Because it can be compared across sessions, rate is pref-
erable to frequency when sessions vary in length. (As noted, this is how
Martin and Bateson, 2007Â€– and mathematicians and physicists generallyÂ€–
define frequency.)
For all data types, rate is computed by dividing frequency, as just defined,
by the duration of the session. Its units may be adjusted (i.e., multiplied by
the appropriate factor) to be per minute, per hour, or per any other time
unit that makes sense. Rate cannot be computed unless the session Â�duration
is known. For event data (both single-code and multicode), duration is
Â�session stop time minus session start time. For timed-event data, duration
can be derived directly from the data. For interval data, a code’s duration is
the number of intervals multiplied by the interval duration. For example,
because 1-second intervals were used for the data shown in Figure 8.1, rates
for Calm, Cry, and Fuss were 4, 3, and 5 per minute, respectively. With the
same frequencies and 5-second intervals, rates (i.e., intervals checked per
unit time) would be computed as 0.80, 0.60, and 1.00 per minute or 24, 18,
and 30 per half hour.
Duration
Duration indicates how long or how many. Like frequency, it is not adjusted
for sessions that vary in length. For single-code event data, duration is the
same as frequency. (Note: Martin and Bateson, 2007, define duration differ-
ently. Their definition for duration is the length of time for which a single
occurrence of the behavior pattern lastsÂ€– what we call duration they call
total duration.)
For timed-event data, duration indicates the amount of time devoted to
a particular code during the session. In contrast, for interval and multicode
event data, duration indicates the number of intervals or the number of
multicoded events checked for a particular code (i.e., the units are intervals
or events, not standard time units like seconds). For example, for the inter-
val data shown in Figure 8.1, durations for Calm, Cry, and Fuss were 16, 19,
and 25 intervals.
FurtherÂ€– and this is something of a technical digressionÂ€– an estimated
duration that takes interval duration (symbolized as w for width) and sam-
pling method into account can be computed from interval data by applying
a correction suggested by Suen and Ary (1989). Using the definitions for
code duration (dâ•›) and frequency (â•›fâ•›â†œ) given in the preceding paragraphs,
estimates for momentary, partial, and whole-interval sampling (see “Interval
recording” in Chapter 3) will then be wd, w(dÂ€– fâ•›â†œ), and w(d + fâ•›â†œ), respect-
ively. Both the current and earlier versions of GSEQ compute this estimated
duration, but it has rarely been reported in the published literature. Usually
the number of checked intervals is reported as the duration, but it is then
interpreted in light of the interval duration and sampling method.
Relative duration
Relative duration indicates proportionate use of time for timed-event data
and proportionate use of intervals or events for interval and multicode
event data. Like relative frequency, it is defined relative to a specified set of
codes. For single-code event data, relative duration is the same as relative
frequency.
For timed-event, interval, and multicode event data, relative duration is
a code’s duration, as just defined, divided by the sum of the durations for
the codes specifiedÂ€– hence the relative durations computed for a specified
set necessarily sum to 1. (Alternatively, relative durations can be expressed
as percentages summing to 100 percent.) To be interpretable, the specified
codes should be at least mutually exclusive, if not ME&E. For example, when
the ME&E Calm, Cry, and Fuss are specified, their relative durations for
the Figure 8.1 data are 27 percent, 32 percent, and 42 percent, respectively.
Because these codes are ME&E, we know that the child was fussing for 42
percent of the session (or was coded fussing for 42 percent of the intervals).
When the codes specified are mutually exclusive, but not exhaust-
ive, the interpretation is somewhat different. If a mutually exclusiveÂ€– but
not exhaustiveÂ€ – set of codes for mother vocalizations including Naming
were defined, we might discover, for example, that 24 percent of the time
when mother’s vocalizations were occurring they were coded Naming. But
Naming was not coded for 24 percent of the total session because the codes
were not exhaustive. As with relative frequency, to avoid possible misuse,
you should provide an explicit rationale when using this statistic.
Probability
Probability indicates likelihood and can be expressed as either a proportion
or a percentage. Just as rate is preferable to frequency, so too probability
is preferable to duration when sessions vary in length, and for the same
Â�reasonÂ€ – both rate and probability can be compared across sessions. For
single-code event data, probability is the same as relative frequency.
Probability, as we define it here, indicates the proportion or percentage
of a session devoted to a particular code (as with other statistics defined
here, we treat it as a sample statistic, not an estimate of a population par-
ameter as statisticians might). For timed-event data, it is the code’s dur-
ation divided by the session’s duration; and for interval and multicode
event data, it is the code’s duration as just defined divided by the total
number of intervals or multicoded events. For example, probabilities for
Data Type
Single-code Timed- Multicode

Statistic event event Interval event
Frequency F F E E
Relative F/ΣF F/ΣF E/ΣE E/ΣE
frequency
Rate F/T F/T E/T E/T
Duration — D = time D = #checked D = #checked
Relative — D/ΣD D/ΣD D/ΣD
duration
Probability — D/T D/Nintervals D/Nevents
Total Stop – Stop – Nintervals Stop –
duration (T) start time start time × int. dur. start time
Figure 8.3.â•‡ Formulas for six basic simple statistics. See text for explanation
and details.
Calm, Cry, and Fuss for the Figure 8.1 data are 27 percent, 32 percent, and
42 percent, respectively. These are the same values as for relative duration,
but only because these three codes are ME&E. If additional codes were
specified for the probability calculation, the probability values would stay
the same, but values for relative duration would change (as would their
interpretation).
Figure 8.3 summarizes these six statistics and shows how they are com-
puted for each data type. For interval and multicode event data: (1) E, and
not F, is used for frequency to remind us that, when tallying their fre-
quency, episodes are talliedÂ€– that is, uninterrupted stretches of intervals or
multicoded events checked for the same code; (2) duration is the number of
intervals or multicoded events checked for a given code; and (3) probability
is computed relative to the total number of intervals or multicoded events.
Rate, however, is computed relative to total duration, which is inferred from
the data for timed-event data, from explicit start and stop times for event
data (both single-code and multicode), or from the number of intervals
multiplied by the interval duration for interval data.
mean event durations, gaps, and latencies

Three additional statistics for individual codesÂ€– mean event duration, mean
gap, and latencyÂ€– make sense primarily for timed-event data and can have
descriptive value. They do not make sense for single-code event data.
Mean Event Duration

For timed-event data, mean event duration indicates how long individual
events lasted for a particular code on average. It is a code’s duration divided
by its frequency. When computed for interval or multicode event data,
mean event duration indicates the mean number of successive intervals or
multicoded events checked for a particular code.
Mean Gap
Gaps are of two kinds: between consecutive occurrences of eventsÂ€– that is,
from one offset to the next onset of the same code; and between onsets of
those eventsÂ€– that is, from one onset to the next onset of the same code.
For timed-event data, mean gap indicates the average time between events
or between onsets of those events. When computed for interval or multi-
code event data, mean gap indicates the mean number of successive inter-
vals or multicoded events between ones checked for a particular code (i.e.,
theÂ€ mean number between episodes) or between ones checked for that
code’s onsets.
Latency
For timed-event data, latency is defined as the time from the start of a ses-
sion to the first onset of a particular code (not as the time between a pair
of behaviors generally, a meaning that is also encountered in the literature).
For interval or multicode event data, it is the number of intervals or mul-
ticoded events from the start of the session to the first interval or event
checked for a particular code. If a session consists of more than one stream
(see “Timed-event and state sequences” in Chapter 4), latency is computed
as the mean of the latencies for the separate streams.
GSEQ can compute values for the six basic statistics described in the
previous section for whichever codes you specify. It can also compute
values for mean event duration, gap, and latency, as well as their minimum
and maximum values (GSEQ calls these simple statistics to distinguish them
from the table statistics described in the next chapter). For example, for the
Figure 8.1 data, the mean event duration for Calm is 4 (min = 3, max = 5), the
mean gap between times (or intervals) coded Calm is 10.7 (min = 9, maxÂ€=
13), the mean gap between onsets of Calm is 14.3 (min = 12, max = 17),
and the latency for Calm is 9.
recommended statistics for individual codes

As noted earlier in this chapter, analyzing simple statistics for individual
codes should be your first step. But which of the statistics just described
should you emphasize? The answer can vary by data type, but depends pri-
marily on your research questions.
For single-code event data, the answer is easy: The only options are fre-
quency, relative frequency, and rate (assuming you have recorded session
start and stop times). When sessions you want to compare are all equal
in length, then frequency is fine; otherwise use rate. If the Figure 8.1 data
were single-code event data and the session lasted 1 minute, then you could
report, for example, that Calm was coded either 4 times or at a rate of 4 per
minute, and that 33 percent of the events were coded Calm.
For timed-event data, there are more options. When sessions you want
to compare are all equal in length, then frequency and duration are fine;
otherwise use rate and probabilityÂ€– unless you believe that variability in
session length is inconsequential and for conceptual reasons you want to
emphasize the actual number of times and the exact elapsed time that par-
ticipants experienced particular types of events. Thus for the Figure 8.1 data
you would report either that 4 events were coded Calm and that they lasted
a total of 16 seconds or you would report that an event was coded Calm 4
times per minute and 27 percent of the session was coded Calm.
But should you report both frequency and duration, or both rate and
probability? Frequency and duration are not the same conceptually; they
are independent and may or may not be correlated. The same is true for rate
and probability. For these reasons, whether you report one or both depends
on your research question. Do you think that how often a mother corrects
her child is important? Then use frequency or rate. Or do you think that
the proportion or percentage of time a mother spends correcting her child
(or a child experiences being corrected) is important? Then use duration
or probability. Whether you analyze one or both, articulating an explicit
rationale is both clarifying and appreciated.
Relative frequency and mean event duration are two further options for
timed-event data. With regard to relative frequency, consider its relation
to probability. These two are not necessarily correlated and, like frequency
and duration, answer different questions. Use relative frequency if you
think what matters is how often a participant experienced one kind of event
relative to other events (e.g., 25 percent of the child’s events were coded
Cry); but use probability if you think the proportion of time a participant
experienced one kind of event relative to other events matters (e.g., 32 per-
cent of the time the child was coded Cry). Whether you analyze one or
both, once again explicit rationales are clarifying and appreciated, and show
others that your decisions were well thought out.
With regard to mean event duration, consider the triad of mean event
duration, frequency (or rate), and duration (or probability). These three are
correlated because mean event duration is (total) duration divided by fre-
quency; in effect, the three yield two degrees of freedom. Due to this redun-
dancy, it does not make sense to analyze all three. Instead pick the two that
seem most meaningful in the context of your research questionsÂ€– or if you
describe all three, be aware that analyses of them are not independent.
For interval and multicode event data, the options and choices are simi-
lar to those for timed-event data; however, because the underlying units
are intervals or events, and not time units, the interpretation is somewhat
different. For example, for the Figure 8.1 data either you would report that
4 episodes were coded Calm (successive intervals or multicoded events all
coded Calm) and they lasted for a total of 16 intervals or multicoded events,
or you would report that an episode was coded Calm 4 times per minute
and 27 percent of the intervals or multicoded events were coded Calm. You
could also report that 25 percent of the child’s episodes were coded Cry, but
32 percent of the intervals or multicoded events were coded Cry.
summary
Data reductionÂ€ – computing summary scores from the coded dataÂ€ – is
an essential first step when analyzing observational data. Such summary
scores, which typically assume ratio-scale measurement, are important
both in themselves and because they may limit subsequent analyses. For
example, when the distribution for a summary score is skewed, some sort
of recode or transformation, use of nonparametric statistics, or both may be
indicated. Summary scores are of two kinds: those computed for individual
codes (as described in this chapter) and those computed from contingency
tables (as described in the next chapter). Statistics for individual code sum-
mary scores are described first because, if their values are not appropriate
(e.g., many zeros, excessively skewed, or both), computation of some con-
tingency and other table statistics may be questionable.
Basic statistics for individual codes include frequency, relative frequency,
rate, duration, relative duration, and probability. For single-code event data,
only frequency, relative frequency, and rate need be computed (because they
are the same as duration, relative duration, and probability, respectively).
Basic statistics for timed-event, interval, and multicode event data are simi-
lar, but their interpretation varies because the units are different (time units,
intervals, and multicoded events, respectively). Which Â�statistics you choose
to report depends on your research questions. Essentially, frequency and
rate indicate how often, duration indicates how long (time units) or how
many (intervals or multicoded events), and probability indicates likelihood
(i.e., proportion or percentage).
Additional statistics useful for describing individual codes are mean
event duration, mean gap (between codes and between their onsets), and
latencyÂ€– and minimum and maximum values for each. Whatever statis-
tics you select to characterize individual codes, explicit rationales for your
choices are clarifying and appreciated.
9
Cell and Summary Statistics for Contingency Tables
The summary statistics described in the previous chapter could be called

one-dimensional because they were computed for individual codes. In con-
trast, the statistics described in this chapter could be called two-dimensional
because they are derived from two-dimensional contingency tables whose
rows and columns are defined with two or more codes. Still, the overall pur-
pose is the same: Summary statistics are computed for individual sessions,
and those scores are then described and analyzed using whatever design
and statistical procedures are appropriate.
Statistics derived from two-dimensional contingency tables are of three
kinds. First are statistics for individual cells; these are primarily descriptive.
Second are summary indices of independence and association for tables of
varying dimensions (e.g., Pearson chi-square and Cohen’s kappa); these are
generally well known or, in the case of kappa, already described in Chapters
5 and 6. Third, and most important for sequential analyses, are sum-
mary statistics specifically for 2â•›×â•›2 tables; these contingency indices often
turn out to be the best way to address sequential questions as detailed in
subsequentÂ€chapters.
Like kappa tables, contingency tables consist of tallies. In the con-
text of behavioral observation, the entities or units tallied are events for
single-code event and multicode event data, time units for timed-event
data, and intervals for interval data. Unlike kappa tables, however, rows
and columns may, and often do, include different codes. Columns may
also be lagged relative to rows, but this is useful primarily for single-code
event data. In the following sections we define cell statistics, table sta-
tistics, and 2â•›×â•›2 contingency indices and note how their use varies by
dataÂ€type.
104
Cell and Summary Statistics 105
R number of rows (givens)

C number of columns (targets)
xrc observed joint frequency
for cell in r-th row and c-th column of a R×C table
x+c sum of the counts in the c-th column
xr+ sum of the counts in the r-th row
N = x++ number of counts total for the R×C table
pc probability for the c-th column = x+c ÷ N
pr probability for the r-th row = xr+ ÷ N
erc expected frequency, by chance = p+c × xr+
gr code for the r-th row (the given)
tc code for the c-th column (the target)
P(tc|gr) conditional probability of (tc given gr) = xrc ÷ xr+
yrc raw residual = xrc – erc
x rc − e rc
zrc adjusted residual =
e rc (1 − pc )(1 − pr )
Figure 9.1.â•‡ Definitions for five basic cell statistics and the notation used to
describeÂ€them.
individual cell statistics

As the name implies, cell statistics are computed for each cell of a two-
Â�dimensional row-by-column contingency table. Definitions for the five basic
cell statisticsÂ€– observed joint frequencies, expected frequencies (i.e., joint fre-
quencies expected by chance), conditional probabilities, raw residuals, and
adjusted residualsÂ€– are given in Figure 9.1 along with the notation used to
define them.
Observed Joint Frequencies and Hierarchical Tallying

The observed joint frequencies are the tallies that result from cross classifi-
cation. Cross classification proceeds somewhat differently for single-code
event data, so first we consider just timed-event, interval, and multicode
event data. When cross-classifying such data, each of the session’s units (i.e.,
time units, intervals, or multicoded events) is considered in turn. For each
unit, a tally is added to one of the cells of the contingency tableÂ€– which cell
depends on the codes checked for that unit and the codes used to label the
rows and columns of the cross-classification table. The order of the row and
column codes is consequential. All units must add one, and only one, tally
to the table; in other words, the total number of tallies equals the number
of units cross-classified.
The codes that define the rows (the given codes) of the Râ•›×â•›C table must
constitute a mutually exclusive setÂ€– or at least be treated as though they
did by following the hierarchical tallying rule explained shortly. In most
cases, they must also be exhaustive (a narrowly targeted research question
that restricts the codes considered might be an exception). The codes that
define the columns (the target codes) must likewise be mutually exclusive
and usually exhaustive. Almost always with timed-event, interval, and mul-
ticode event data, columns are not lagged relative to rows. It is expected that
some row codes will co-occur with some column codes; indeed, such co-
occurrences, or their lack, are often central to our research questions. For
interval, multicode event, and especially timed-event data, lagged sequen-
tial associations are best analyzed by defining time windows anchored to
existing codes (see “Creating new codes as ‘windows’ anchored to existing
codes” in Chapter 10 and “Time-window sequential analysis of timed-event
data” in Chapter 11). There is one strategy we would not recommend: It
makes little sense to mindlessly include all codes as both givens and targets
just “to see what happens.” For one thing, codes will, of course, co-occur
with themselves. When selecting given and target codes, a thoughtful and
an explicit rationale is important.
Tallying proceeds as follows. Each successive unit is scanned for a given
(row) code. Given codes are considered in the order listed. If one is found,
scanning for given codes stops and scanning for target (column) codes
begins. Consequently, if a unit contains more than one given code, only
the one encountered first in the list is used to cross-classify it. Target codes
are likewise considered in the order listed. If one is found, scanning for
target codes stops and a tally is added to the table for that given-target
code pair. This is what we mean by hierarchical tallyingÂ€– which, in effect,
makes any list of codes mutually exclusive. For each successive unit, if no
given code is foundÂ€– or if a given code but no target code is foundÂ€– no
tally is added to the table. Therefore, to ensure that the total number of
tallies equals the total number of units, both given and target codes must
be exhaustiveÂ€– which in GSEQ may be accomplished with a residual code,
indicating anything-else or none-of-the-above, and signified with an &
(i.e., an ampersand).
To illustrate the computations in this chapter, we again use the data in
Figure 8.1. Assume that we tallied the time units or intervals for these data
in a 3â•›×â•›3 table whose rows were Calm, Cry, and Fuss (a ME&E set) and
whose columns were Assure, Explain, and a third column labeled &, which
makes the column codes exhaustive (here, by indicating any time unit or
Observed counts Conditional probabilities

Assure Explain & Total Assure Explain &
Calm 1 10 5 16 .06 .63 .31

Cry 3 4 12 19 .16 .21 .63
Fuss 10 8 7 25 .49 .32 .28
Total 14 22 24 60 .23 .37 .40
Expected counts Adjusted residuals

Assure Explain & Assure Explain &
Calm 3.7 5.9 6.4 –1.89 2.54 –0.83

Cry 4.4 7.0 7.6 –0.94 –1.71 2.49
Fuss 5.8 9.2 10.0 2.58 –0.63 –1.60
Figure 9.2.â•‡ Cell statistics for Figure 8.1 data.
interval in which neither Assure nor Explain were coded). The tallies for
the 60 time units or intervals are shown in Figure 9.2. For example, of 60
seconds (or intervals), only 1 was coded both Calm and Assure but 10 were
coded both Calm and Explain. (As a general rule, row and column codes
should be exhaustive so that all units are tallied. Occasionally, however, a
narrowly targeted research question may only require that a subset be tal-
liedÂ€– for example, if our only interest were in comparing Calm-Assure and
Calm-Explain associations.)
Lagged Tallies for Single-Coded Events When Codes Can

and Cannot Repeat
For single-code event data with its single stream of codes, events cannot
co-occur; therefore, an unlagged (i.e., Lag 0) contingency table would
contain structural zeros in off-diagonal cells and code frequencies on the
diagonal. As a result, instead of considering co-occurrencesÂ€– which is the
norm with timed-event, interval, and multicode event dataÂ€– transitions
are considered instead. The usual two-dimensional table for single-code
event data labels both rows and columns with the possible codes (a Kâ•›×â•›K
table), which for single-code event data necessarily constitute a ME&E set.
Columns are lagged relative to rows (Lag 1 is typical), and then tallying
proceeds as follows: Each successive pair of codes is considered (if ci is
the i-th code, first c1-c2, then c2-c3, etc.), with the first member of the pair
determining the row and the second determining the column to which a
tally is added. Thus, for Lag 1, the total number of tallies for the table is the
number of events minus 1.
Observed counts, Lag 1 Transitional probabilities

Calm Cry Fuss Total Calm Cry Fuss
Calm — 1 3 4 — .25 .75

Cry 2 — 1 3 .67 — .33
Fuss 2 2 — 4 .50 .50 —
Total 4 3 4 11 .36 .27 .36
Figure 9.3.â•‡ Observed Lag 1 counts and transitional probabilities for Figure 8.1 data
after being converted into single-code event data with Assure, Explain, and Touch
removed; codes cannot repeat. Dashes indicate structural zeros.
A restriction, important primarily for single-code event data, concerns

whether or not codes may repeat. Depending on the codes used and the
sorts of events coded, codes that repeat may be a logical impossibilityÂ€–
that is, by definition adjacent events may not be assigned the same code.
For example, when an observer segments an infant’s behavior into states
and then assigns a code to each state, two successive states logically can-
not be assigned the same code. If they were, it would indicate one state,
and not two. In such cases, the Lag 1 Kâ•›×â•›K table for single-code event
data (where K is the number of codes) would have structural zeros on the
diagonal, indicating circumstances (a transition from one code to itself)
that logically cannot occur. Such logical zeros can affect other cell statis-
tics, as noted shortly. For example, if the Figure 8.1 data were converted
into single-code event data with Assure, Explain, and Touch removed (see
“EVENT and BOUT for timed-event data” in Chapter 10), the observed
counts and transitional probabilities (see next paragraph) for the resulting
sequence of twelve events (for which codes cannot repeat) would be as
shown in Figure 9.3.
Conditional and Transitional Probabilities

The observed joint frequencies of a contingency table have descriptive
value, but are used primarily to compute other statistics. One statistic com-
puted from them, likewise important primarily for its descriptive value, is
the conditional probability. Conditional probabilities indicate the probabil-
ity of a target (column) behavior being coded when the given (row) behav-
ior is coded. They are computed row-wise. The joint frequency for each
cell is divided by the sum for its row; consequently, the conditional prob-
abilities on each row necessarily sum to 1. For example, as shown in Figure
9.2, although the simple probability of Assure was .23, the conditional
probability of Assure given Calm was .06, but of Assure given Fuss was .49.
Descriptively, it appears that Fuss and Assure were often associated, but
Calm and Assure seldom were.
For single-code event data, conditional probabilities are computed the
same way, but are called transitional probabilities because they indicate
transitions of some lag. For Lag 1, when codes cannot repeat, conditional
probabilities on the diagonalÂ€ – like the observed joint frequencies from
which they are computedÂ€– are structural zeros. For example, as shown in
Figure 9.3, the transitional probability of Fuss at Lag 1 after Calm is .75, but
the simple probability of Fuss is .36.
Conditional probabilities reflect target behavior frequencies and so
are not comparable over sessions. For this reason, although it may seem
tempting to regard particular conditional probabilities as outcome scores
for individual sessions and to analyze them using standard statistical tech-
niques, this is not recommended. There are better 2â•›×â•›2 contingency index
alternatives, as discussed shortly.
Expected Frequencies and Adjusted Residuals

As shown in Figure 9.1, the expected frequency for a cell is the probabil-
ity of its target (column) behavior multiplied by the frequency of its given
(row) behavior. This is the frequency expected if, in fact, there is no associa-
tion between given and target codes. With no association, we would expect
the proportion of column counts in each row to be the same for all rows
and to reflect the proportions based on the column sums. In other words,
the expected value for a cell is its column probability multiplied by its row
countÂ€– which is often presented as the column sum multiplied by the row
sum divided by the total sum or N. For example, although 10 units were
coded both Fuss and Assure, we would expect only 5.8 based on chance (i.e.,
simple probability of Assure × simple frequency of Fuss). This definition
(i.e., closed-form formula) for expected frequencies is appropriate as long
as a table contains no structural zeros. Given structural zeros (e.g., as occur
when computing expected frequencies for single-code event Lag 1 transi-
tional probabilities with codes that cannot repeat), expected frequencies
should instead be computed with an iterative proportional fitting algorithm
(IPF; see Bakeman & Robinson, 1994)Â€– as GSEQ does when the appropriate
option is checked.
Like observed joint frequencies, expected frequencies and their raw
residuals (the difference between observed and expected joint frequen-
cies) are of interest descriptively, although they are primarily used to
compute other statisticsÂ€– the most useful of which is the adjusted resid-
ual. The adjusted residual indicates the extent to which an observed joint
frequency differs from chance: It is positive if the observed is greater than
chance and negative if the observed is less than chance. If there is no asso-
ciation between given and target codes, then adjusted residuals are dis-
tributed approximately normally with a mean of 0 and variance equal 1
(granted assumptions), and so their magnitudes can be compared across
various pairs of codes within the same contingency tableÂ€– this is perhaps
their major merit (Allison & Liker, 1982; Haberman, 1978). For example,
Explain was more likely given Calm (z = 2.54) and Assure was more likely
given Fuss (z = 2.58), whereas Explain was less likely given Fuss (z =Â€–0.63)
and Assure less likely given Calm (z =Â€–1.89). Note, assuming the normal
approximation is justified, the first twoÂ€– but not the last twoÂ€– adjusted
residuals reached the 1.96 (absolute) criterion required to claim signifi-
cance at the .05 level (see Figure 9.2).
Adjusted residuals have limitations. One limitation is that the distri-
bution of adjusted residuals only approximates the standard normal dis-
tribution. The larger the row sum (xr+) and the less extreme the expected
probability (erc/xr+), the better the approximation. A reasonable guideline
is to assume that adjusted residuals are approximately normally distributed
only when the row sum is at least 30, and then only when the expected
probability is > .1 and < .9 (Haberman, 1979). Even when these guidelines
are met, a second limitation involves type I error; a single table may contain
several adjusted residuals, each of which is tested for significance. Because
comparing each to a 1.96 absolute, p < .05, criterion invites type I error,
some investigators may prefer a more stringent criterion, for example, 2.58
absolute, p < .01, or even an arbitrary criterion like 3 absolute (Bakeman &
Quera, 2012). Another possibility is a winnowing strategy that identifies the
most consequential adjusted residuals (see “Deviant cells, type I error, and
winnowing” in Chapter 10).
One final cell statistic, not included in Figure 9.1 and seldom encoun-
tered, is the standardized residualÂ€– defined as the raw residual divided by
the square root of the expected frequency, (xrcÂ€– erc) ÷ √erc. However, the
adjusted residual as defined in Figure 9.1 offers a better approximation to
a normal distribution and, for that reason, is preferable (Haberman, 1978).
It would make sense to call it standardized, but when the adjusted residual
was defined, the term standardized residual was already in use with the def-
inition just given (Haberman, 1979)Â€– and so the better approximation is
known as the adjusted residual.
R C
( x rc − erc ) 2
χ
2
Pearson chi-square = ∑∑
r =1 c =1
erc
R C
Likelihood-ratio =
G
2
chi-square
2 ∑∑ x
r =1 c =1
rc (ln x rc − ln e rc )
Figure 9.4.â•‡ Definitions for two chi-square table statistics.
indices of association for two-dimensional tables

Kappa as an index for Kâ•›×â•›K tables was discussed extensively in Chapters 5
andÂ€6. For completeness, here we mention two statistics for Râ•›×â•›C tables, both
of which gauge the association or independence of the row and column dimen-
sions. The first statistic is the Pearson chi-square, which usually appears in
introductory statistics courses. Denoted as χ2 (although some texts use X2 for
the sample statistic), it is distributed as χ2 with (RÂ€– 1) × (CÂ€– 1) degrees of free-
dom (minus the number of structural zeros, if any). Its computation involves
summing over all cells with nonzero expected values (see Figure 9.4).
In fact, it makes sense to compute χ2 before computing adjusted residu-
als, and to pay attention to “significant” adjusted residuals only if the χ2
is itself significantly different from zero. Just as follow-up analyses are not
performed without a significant interaction in an analysis of variance, so
too a nonsignificant χ2 can protect us from engaging in a “fishing exped-
ition” with the separate adjusted residuals.
The second, and less well known, statistic is the likelihood-ratio Â�chi-square.
Denoted as G2, it is also distributed as χ2 with (RÂ€– 1) × (CÂ€– 1) degrees of
freedom (minus the number of structural zeros, if any). Its computation
involves natural logarithms (ln) and summing over all cells with nonzero
observed and expected values (see Figure 9.4). For technical reasons (which
matter primarily for contingency tables with more than two dimensions),
G2 is preferable to χ2 for log-linear analyses (Fienberg, 1980; see also the
log-linear analysis sections in Chapter 11).
contingency indices for 2â•›× â•›2 tables

When research questions involve the contingency between two behaviors
(e.g., a specific given and target, or a specific row and column), any Râ•›×â•›C
table can be reduced to a 2â•›×â•›2 table. Row 1 is labeled with the given behav-
ior, Column 1 with the target behavior, and Row 2 and Column 2 with the
Target:
Yes No
Given: Yes a b
No c d
a / b ad
OR Odds ratio = =
c / d bc
lnOR Log odds ratio = ln(odds ratio)
ad − bc
Q Yule’s Q =
ad + bc
Figure 9.5.â•‡ Notation and definitions for three basic 2×2 contingency indices.
residual code. This 2â•›×â•›2 table arrangement is advantageous because then

the given behavior-target behavior contingency can be assessed with stand-
ard summary statistics for 2â•›×â•›2 tables. Such tables and the indices derived
from them have long been a source of fascination among statisticians (e.g.,
Cochran, 1954), but here we focus on three indices that seem especially use-
ful for assessing contingency: the odds ratio, the log odds (i.e., the natural
logarithm of the odds ratio), and Yule’s Q. These indices are derived from the
counts of a 2â•›×â•›2 table whose cells are conventionally labeled a, b, c, and d, as
shown in Figure 9.5.
Odds Ratio and Log Odds

The odds ratio is a measure of effect size. Its interpretation is straightfor-
ward and concrete. It is useful descriptively and deserves to be used more
by behavioral scientists. (It is already widely used by epidemiologists.) As
the name implies, it is the ratio of two odds derived from the top and bot-
tom rows of a 2â•›×â•›2 table (see Figure 9.5). The odds ratio can assume values
from 0 to infinity. A value of 1 indicates no effect (Row 1 odds = Row 2
odds), a value greater than 1 indicates that the target behavior is more
likely in the presence of the given behavior than in its absence (Row 1 odds
> Row 2 odds), and a value less than 1 indicates that the target behavior is
more likely in the absence of the given behavior than in its presence (Row
1 odds < Row 2 odds). Because the odds ratio varies from 0 to infinity with
1 indicating no effect, its distributions are often skewed. Consequently,
although useful descriptively, the odds ratio is not as useful analytically.
As an example, consider the two 2â•›×â•›2 tables in Figure 9.6; these tables
were formed by collapsing the 3â•›×â•›3 table of observed counts in Figure 9.2
Assure & Total Assure & Total
Fuss 10 15 25 Cry 3 16 19
& 4 31 35 & 11 30 41
Total 14 36 60 Total 14 46 60
OR = 5.17 (1.39–19.2) OR = 0.51 (0.12–2.10)

lnOR = 1.64 lnOR = –0.67
Yule’s Q = .68 Yule’s Q = –.32
Figure 9.6.â•‡ Two 2â•›×â•›2 contingency tables for the Figure 8.1 data with their associated
odds ratios (95 CIs for the ORs are given in parentheses), log odds, and Yule’s Qs.
in two different ways. For the left-hand table, the odds of Assure to any code
not Assure (&) when Fuss was coded were 10 to 15, but the odds were 4 to
31 when Fuss was not coded. For this example, the odds ratio is 5.17 (10
÷ 15 divided by 4 ÷ 31). Concretely, this means that the odds of the nurse
offering assurance when the child was fussing were more than five times
greater than when the child was not fussing. Moreover, because the 95 per-
cent confidence interval (CI) does not include 1Â€– which is the no-effect
valueÂ€– this result is statistically significant, p < .05. In contrast, assurance
was about half as likely when the child was crying than when not, but the
association was not statistically significant (see Figure 9.6, right).
The log odds is the natural logarithm of the odds ratio. It varies from
negative infinity to positive infinity, with zero indicating no effect.
Compared to the odds ratio, its distributions are less likely to be skewed.
(Note, as with all scores, skew should be checked before analysis; if scores
are skewed, nonparametric analyses should be considered or scores should
be recoded or transformed before parametric analyses). The log odds is
expressed in difficult-to-interpret logarithmic unitsÂ€– which can be a limi-
tation. For example, for the Fuss-Assure association, the natural logarithm
of the odds ratio is logeâ•›5.17 = 1.64 (i.e., 2.718.â•›.â•›.1.64 = 5.17), which is diffi-
cult to interpret concretely. Thus the odds ratio is better descriptively, but
the log odds is often the better choice when using standard parametric
statistical techniques such as correlation, multiple regression, and analysis
ofÂ€variance.
Investigators sometimes ask whether an individual odds ratio is statis-
tically significantÂ€– meaning significantly different from 1 in a sample of the
size used for computation. We are of three minds on this. First, we thinkÂ€–
as do others (e.g., Wilkinson and the Task Force on Statistical Inference,
1999)Â€– that statistical significance is often overrated and overemphasized,
and that equal emphasis on effect size is desirable. Second, it is nonetheless
useful to compute and report 95 percent CIs for odds ratios (as GSEQ and
most statistical programs do); if 1 is not included in the CI, the odds ratios
differ from 1, p < .05. Third, guidelinesÂ€– understood with the appropriate
grain of saltÂ€– can be useful (e.g., Cohen’s, 1977, suggestion that Pearson
product-moment correlations of .10, .30, and .50 represent small, medium,
and large effect sizes).
With regard to odds ratios, a general guideline suggested by Haddock,
Rindskopf, and Shadish (1998) is that odds ratios close to 1.0 indicate
weak relationships, whereas odds ratios over 3.0 for positive associations
or less than 0.33 for negative associations indicate strong relationships.
Additionally, we think that odds ratios between 1.25 and 2.00 (or 0.50–
0.80) should be regarded as weak, and those between 2.00 and 3.00 (or
0.33–0.50) should be regarded as moderate. Our rationale is that increas-
ing the odds 100 percent, which is what an odds ratio of 2.00 does, is
a reasonable definition for moderate (Parrott, Gallagher, Vincent, &
Bakeman, 2010). Moreover, our cut points for the odds ratio correspond
to values of .11, .33, and .50 absolute for Yule’s Q, an index of association
for 2â•›×â•›2Â€tables that ranges fromÂ€–1 to +1 and is discussed next. Note that
these cut points for Yule’s Q are almost the same as Cohen’s for r (1977,
see previous paragraph).
Yule’s Q
Yule’s Q is an index of effect size (see Figure 9.5). A straightforward alge-
braic transformation of the odds ratio (see Bakeman, 2000), it is like the
familiar correlation coefficient in two waysÂ€– it varies fromÂ€–1 to +1 with 0
indicating no effect, and its units have no natural meaning. Consequently,
its interpretation is not as concrete or clearly descriptive as the odds ratio.
On the other hand, compared to the odds ratio, its distributions are less
likely to be skewed, and so it can be used both descriptively and analyt-
ically (assuming distributions are not badly skewed). It is also somewhat
less vulnerable to a zero cell count than the odds ratio, as described in the
next section.
One final 2â•›×â•›2 cell statistic, not included in Figure 9.1, but often listed
in older texts, is the phi coefficient (Hays, 1963; see also Bakeman, 2000). It
is a Pearson product-moment correlation coefficient computed with bin-
ary data. Like Yule’s Q, it can potentially assume values fromÂ€–1 to +1, but
can only achieve its maximum value when pr = pc = .5; thus Yule’s Q almost
always seems preferable.
Vulnerability to Zero Cells

Which contingency index should you useÂ€– the odds ratio descriptively and
the log odds analytically, or Yule’s Q for both? It is probably a matter of
personal preference. We think the odds ratio is more concretely descrip-
tive, but Yule’s Q may seem more natural to some, especially those schooled
in correlation coefficients. Another consideration is computational vulner-
ability to zero cells. A large positive effect (Column 1 behavior more likely
given Row 1 behavior) occurs as the count in cell b (or c) tends toward
zero relative to other counts, and a large negative effect (Column 1 behavior
less likely given Row 1 behavior) occurs as the count in cell a (or dâ•›) tends
toward zero.
A zero in a cell is ambiguous. If we know the zero is not structural (i.e., a
tally is logically possible), we still don’t know whether the zero is empirical
(although possible, the circumstance doesn’t occur) or simply a result of
not observing long enough (i.e., the true value falls outside the range of the
measuring circumstance and so is censored). For whichever reason, one or
more zero cells can cause computational difficultiesÂ€– more so for the odds
and log odds ratios than for Yule’s Q.
If cell b or c (or both) are zero, the value of Yule’s Q will be +1Â€– but the
odds ratio and the log odds will be undefined (divide by 0). If cell a or d
(or both) are zero, the value of Yule’s Q will beÂ€–1 and the odds ratio will
be 0 (using the computational formula, ad/bc)Â€– but the log odds will be
undefined (divide by 0). Thus, Yule’s Q is not vulnerable to a zero cell (or
even to two zero cells, if they are cater-cornered). The odds ratio is vulner-
able only if cell b or c (or both) are zero. And the log odds is vulnerable if
any cell is zero. This vulnerability of the log odds to zero cells leads many
experts to advocate adding a small constantÂ€ – typically ½Â€ – to each cell
before computing a log odds (e.g., Wickens, 1989), although a smaller con-
stant may be advisable when some numbers in the 2â•›×â•›2 table are very large
and some very small.
One circumstance is always fatal. If two cells in the same row or the
same column are zeroÂ€– which means that a row or column sum is zeroÂ€–
no contingency index can be computed. Subsequent analyses should treat
the value of the index as missing. After all, if one of the behaviors does
not occur, no contingency can be observed. Even when row or column
sums are not zero, but simply small, it may still be wise to treat the value
as missing. With very small numbers, there is little reason to have confi-
dence in the value of the index, even when a value can be computed. We
think that the value for a contingency index should be treated as missing
if any row or column sum is less than 5 (this is the default value supplied
by GSEQ), but some investigators may prefer a more stringent guideline.
summary
The summary statistics described in the previous chapter were computed
for individual codes. In contrast, the statistics described in this chapter are
derived from two-dimensional contingency tables whose rows and columns
are defined with two or more codes. These statistics are of three kinds. The
first kind are primarily descriptive statistics for individual cells, the second
kind are summary indices of association for tables of varying dimensions,
and the third kindÂ€– and most important for sequential analysesÂ€– are sum-
mary statistics specifically for 2â•›×â•›2 tables. By convention, we call the row
codes the givens and the column codes the targets.
Individual cell statistics include observed joint frequencies, joint fre-
quencies expected by chance (i.e., assuming no association between row
and column codes), conditional probabilities, raw residuals (differences
between observed and expected frequencies), and adjusted residuals.
Adjusted residuals are especially useful because they allow comparisons of
given-target code pairs within a particular table; granted assumptions and
sufficient counts, they are distributed approximately normally.
The observed cell frequencies for timed-event, interval, and multicode
event data are the tallies that result from cross-classification. Each of the
session’s units is considered in turn and, depending on how it is coded, a
tally is added to one of the cells of the contingency table. The order in which
the row and column codes of the Râ•›×â•›C table are listed matters. All units
must add one, and only one, tally to the table so that the total number of
tallies equals the number of units cross-classified. Tallying follows a hier-
archical rule: If a unit contains more than one given (or target) code, only
the one encountered first in the list is used to cross-classify the unit. Usually
columns are not lagged relative to rows (Lag 0).
In contrast, for single-code event data, events cannot co-occur. Lag 0
would result in a table with structural zeros in off-diagonal cells and code
frequencies on the diagonal, so typically columns are lagged relative to
rows. When codes cannot repeat, zeros on the diagonal of a Lag 1 table are
structural, and so expected frequencies need to be computed with an itera-
tive proportional fitting algorithm instead of the usual formula.
Indices for two-dimensional tables include the well-known χ2 (Pearson
chi-square); the similar G2 (likelihood-ratio chi-square), which is used in
log-linear analysis; and kappa, as discussed in Chapters 5 and 6. Adjusted

residuals within a table are best considered only when the chi-square for the
table is significant.
Contingency indices for 2â•›×â•›2 tables include the odds ratio, the log odds,
and Yule’s Q. They are especially useful in sequential analyses. The odds
ratio is especially useful descriptively, although the log odds is often better
suited for analysis, and the Yule’s Q is less vulnerable to cells that contain
zero. Even so, any of these indices might be better regarded as missing if any
row or column sums are less than some value (e.g., 5).
10
Preparing for Sequential and Other Analyses
It makes sense to define codes and use recording procedures that work best
for the observers. After all, if we expect good work, we should accommodate
observers’ comfort; the data as recorded can be modified later into forms
that facilitate analysis. In Chapter 4, we argued the utility of representing
observational data with a few standard formats (i.e., single-code event,
timed-event, interval, and multicode event data) and then conceptualizing
such data as a code-unit gridÂ€– partly for the order and organization doing
so brings to observational data, but also for the opportunities it presents for
later data modification. In this chapter we describe, among other matters,
data modification, that is, specific ways new codes can be created from exist-
ing codesÂ€– new codes that are faithful to and accurately reflect our research
questions and that extend the range of our data-analytic efforts.
Given the benefits, it is a bit puzzling that data modification of obser-
vational data is not more common. Perhaps it is because data modifica-
tion occupies something of a middle ground. On the one hand, there are
a number of systems for computer-assisted coding that facilitate the initial
recording of observational data; most produce the sorts of summary scores
described in Chapter 8. On the other hand, there are a number of statistical
packages that permit often quite complex data recoding and transformation
of summary scores. But no coding or analysis programs we know of address
the need to modify sequential data before summary scores are computed.
In this respect, GSEQ, with its extensive and flexible data modification cap-
abilities, may be uniquely helpful.
creating new codes from existing codes

The SDIS compiler creates a modified data file (called an MDS file) from the
initial SDIS-formatted data file (or SDS file). Most data modifications create
118
Preparing for Sequential and Other Analyses 119
new codes from existing ones; they add these new codes to the MDS file but
leave the initial SDS file unchanged.
In this section we define the data modification commands that are imple-
mented in GSEQ. Several depend on standard logical operations; a few are
simply housekeeping; several work only with timed-event, interval, and
multicode event data; and a few others work only with single-code event
data. In GSEQ, after making a series of data modifications (including the
WINDOW command described in the next section), you have the option to
overwrite the existing MDS file or to create a new MDS file. If you choose to
create a new file, it contains the modifications you just made and the earlier
MDS file remains intact. The new codes created by the data modification
commands can be used in any subsequent analyses (summary statistics,
contingency tables, etc.).
Logical Combinations for Timed-Event, Interval,

and Multicode Event Data
The standard logical operators are AND, OR, NOT, NOR, and XOR. All are
implemented in GSEQ and each is potentially useful, but in practice you
will probably find AND and OR the most useful.
Especially for timed-event data, and especially when codes have been
defined at a more molecular level than is strictly necessary for some subse-
quent analyses (see Chapter 2), you may find it useful to create new super-
ordinate codes from existing ones. This requires a logical OR: the new code is
checked (coded) for any time unit (or interval or multicoded event) already
coded for one or more of a list of specified existing codes. For example, as
shown in Figure 10.1, the commandÂ€– OR FETor = Fuss Explain TouchÂ€–
would create three new events: one from the existing Fuss, the second a
stretch of time coded for one or more of these codes, and the third a stretch
of time coded only for Touch. The GSEQ MDS file would now contain the
new code FETor in addition to the previously existing codes; the initial SDS
file is unchanged.
It may also be desirable to create a new code that reflects times when all
of the specified codes co-occur. This requires a logical AND: the new code is
checked (coded) for any time unit (or interval or multicoded event) already
coded for all of a list of specified existing codes. For example, as shown in
Figure 10.1, the commandÂ€– AND FETand = Fuss Explain TouchÂ€– would create
one new event for the two seconds when all three of these codes co-occurred.
Although you may never find occasion to use them, the remaining three
logical commands are also shown in Figure 10.1. NOT is the complement of
Second or interval
Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Fuss √+ √++++++++
Explain √ ++++++ √+++++
Touch √+++++++ √ +++
And √+
Or √+ √ +++++++ ++++++++++ √ +++
Not √ +++++++++++ √ +++++++++++++++
Nor √++++ √
Xor √+ √ +++ √ +++
Recode √+ √ +++√ √ ++ ++++√ +++++ √ +++
Figure 10.1.â•‡ Use of logical combinations and the RECODE command to create
new codes from existing ones, assuming 1-second precision for timed-event data
or 1-second intervals for interval sequential data; results would be the same for
multicode event data. Each of the six commands at the bottom specified the three
codes at the top. A check mark indicates the onset of an event (or episode) and a
plus sign its continuation.
AND: The new code is checked for any time unit (or interval or multicoded
event) that is not coded for all of existing codes specified. NOR is the com-
plement of OR: The new code is checked for any time unit (or interval or
multicoded event) that is not already coded for one or more of the existing
codes specified. Finally, XOR is the exclusive OR: The new code is checked
for any time unit (or interval or multicoded event) that is already coded for
one, and only one, of the existing codes specified.
RECODE for All Data Types

An additional data modification command, not strictly a logical combin-
ation, is RECODE. Unlike the five logical commands just described, it can
also be applied to single-code event data; we describe how shortly, but here
we describe its use with timed-event, interval, and multi-event data. As
shown in Figure 10.1, RECODE is like OR but preserves individual code
onsets and thus code frequency. The new codes created by both OR and
RECODE check the same time units (or intervals or multicoded events)
and so have the same duration, but the number of new events created by
RECODE is the same as the number of existing events renamed by the
RECODE command. For this example, the frequency of the new code is
3 with OR, but 5 with RECODEÂ€ – the number of existing events com-
bined to create the new code. When combining existing events into a new
code, use OR if you think of the merged event as becoming a single occur-
rence, but use RECODE if you think of the merged events as remaining
separate but contiguous events. RECODE differs from OR in another way
as well. Existing codes used to define OR remain in the GSEQ MDS file,
whereas existing codes used to define RECODE are deleted from the MDS
file and so are unavailable for subsequent analyses; the initial SDS file is
unchanged.
EVENT and BOUT for Timed-Event Data

The five logical combination commands can be applied to timed-event,
interval, and multicode event data, whereas EVENT and BOUT can be
applied only to timed-event data.
Occasionally it is usefulÂ€– as always, depending of your research ques-
tionÂ€– to reduce a timed-event data file into a single-code event data file in
which all information about event duration is removed, leaving you with a
data file in which only the sequence of events is preserved. In such cases,
use the EVENT command. If no events in the timed-event data file began
at the same time, a single-code event data file results; but if some events
began at the same time, a multicode event data file results with events shar-
ing a common onset time assigned to the same multicoded event. Having
two versions of the same sessionÂ€– one in timed-event and one in untimed-
event form (either single-code or multicode)Â€ – gives you the option of
selecting the version that best lets you answer a particular research ques-
tion. Moreover, when creating an untimed-event version of your timed-
event data file, GSEQ gives you the option of considering only some events.
Thus you can select which codes appear in the single-code event data file
that EVENT creates; the events you select appear in sequence but others
are ignored.
Many behaviors occur in burstsÂ€ – for example, sneezing, barking, or
growlingÂ€– and you may be interested in defining a series of closely spaced
events (identically coded or not) as a single event, for which bout seems a
reasonable term (see e.g., Becker, Buder, Bakeman, Price, & Ward, 2003). In
such cases, use the BOUT command. First, you are asked to enter a max-
imum gap duration. GSEQ then creates a new code that comprises the burst
of old codes and any gaps between them no longer than the maximum gap
length you specified. The units for the gap duration depend on the number
of decimal digits used for time. For example, if your time units are seconds
Single-code event sequence

cm dcl tch chk dcl qst dcl chk tch chk cm dcl tch cm
RECODE tlk = cm dlr qst
tlk tlk tch chk tlk tlk tlk chk tch chk tlk tlk tch tlk
LUMP tlk = cm dlr qst
tlk tch chk tlk chk tch chk tlk tch tlk
CHAIN resp = chk tlk tch
tlk tch chk tlk chk tch resp tlk
Figure 10.2.â•‡ Resulting sequences when applying the RECODE and LUMP data
modification commands to the single-code event sequence shown and applying
CHAIN to the sequence resulting from the LUMP command.
and are specified in hundredths, you would enter 100 to indicate that you
wanted gaps of 1 second filled in.
RECODE, LUMP, and CHAIN for Single-Code Event Data

LUMP and CHAIN can be applied only to single-code event data. As dis-
cussed previously, RECODE can be applied both to single-code event data
and also to timed-event, interval, and multicode event data. One reason
for using the EVENT command just discussed is to allow you to then
use, in particular, the CHAIN command with the resulting single-code
eventÂ€data.
RECODE applied to single-code event data is essentially the same as
RECODE applied to other data types. The frequency of individual events is
preserved by the new code (although, as noted earlier, existing codes used
to define RECODE are deleted from the MDS file). Assume that the codes
used are cm, dec, qst, chk, and tch for mother commands, declares, and
questions, and for infant checks (with a look to the mother) and touches
(a toy). In the example sequence shown in Figure 10.2, RECODE results in
each cm, dec, and qst being replaced with tlk, but the frequency of mother
utterances is not changed.
In contrast, LUMP results in one or more successive cm, dec, or qst
codes being replaced with a single tlk. The frequency of tlk with RECODE
is 7 (the same as the sum of the frequencies for cm, dec, and qst in the ori-
ginal sequence), whereas the frequency of tlk with LUMP is 3. Use LUMP
instead of RECODE when you regard successive events assigned any of
the existing codes specified by the LUMP code as a continuation of the
sameÂ€event.
The CHAIN command is unique in that sequence matters. Whenever

the sequence specified with the CHAIN command is found, occurrences
of the old codes that constitute the chain are replaced in the MDS file with
the new code specified on the CHAIN command. The old occurrences are
deleted from the MDS file, but other occurrences of the old codes that are
not part of a chain are, of course, retained. Again, the initial SDS file is
unchanged. In Figure 10.2, CHAIN has been applied to the sequence that
resulted from the LUMP command. If CHAIN had been applied instead
to the sequence resulting from RECODE, no sequence matching the one
on the CHAIN command would have been found. For this example, we
would use RECODE if we were interested simply in how often the infant
checked with the mother and then didn’t touch a toy until the mother spoke
to her infantÂ€– effectively ignoring type of speech. On the other hand, if we
thought the type of mother speech between chk and tch mattered, we would
define a specific chain with the CHAIN command and apply it to the initial
event sequence, not to the one resulting from the RECODE.
Two or more existing codes may be listed on the CHAIN command, pro-
viding a way to identify how often chains of two or more events occurred.
Here the example included three events in the chain, but it can also be use-
ful to define just the first two (in this case, chk then tlk). We would then
define a table for which the given was the chain just defined and targets
were any codes of interest (&s could be added to make the given and targets
exhaustive). We would then be able to state what codes followed the chk-tlk
chain with greater or less than expected frequency, as described in the pre-
vious chapter. As you can see, RECODE, LUMP, and CHAIN, as applied to
single-code event data, often offer a clear way to address specific research
questions. For this reason, and as just noted, investigators may find it useful
to have two versions of their dataÂ€– one timed-event for some questions and
one single-code event for others.
REMOVE and RENAME for All Data Types

REMOVE and RENAME are simple housekeeping commands that work
for all data types. REMOVE deletes a code or codes from the MDS file.
You might use it to reduce clutter if, after creating new codes from existing
codes, you no longer plan to use those original codes. Or you might use it
to remove a newly created codeÂ€– perhaps because you made an error or, on
reflection, changed your mind regarding its definition or need.
RENAME lets you give a new name to an existing code in the MDS file (as
usual, the SDS file is unchanged). You might use it if you decide that existing
codes could be named in ways that are more mnemonic or consistent or, after
creating a new code, you decide a different name would work better.
creating new codes as “windows” anchored

to existing codes
Commands like AND, OR, RECODE, LUMP, and CHAIN make it pos-
sible to create new codes in very flexible ways. The WINDOW command
creates yet more possibilities. It can be applied to any data type, but is used
primarily with timed-event data. Often the new codes created with data
modification commands, including the WINDOW command, are used to
define contingency table targets or givens. We can then examine their co-
occurrence with each other and with other behaviors of interest in ways
that directly address our research questions.
The WINDOW command defines a new event as a stretch of time (or
intervals, or even events) anchored to onsets and offsets of existing codes (see
Figure 10.3). For example, we can define a new code consisting only of the
onset unit of the existing code (the notation is the code preceded by a left
parenthesis) or consisting only of the offset unit (the notation is the code fol-
lowed by a right parenthesis). A new code can also be defined as the existing
code expanded to include three units before its onset (e.g., Cry–3) or the three
units after its offset (e.g., Cry+3) or any stretch of units anchored to existing
onsets, offsets, or both. For example, we can define a new code as extending
from four units before the onset of the existing event to four units after its off-
set. For a graphic presentation of this and other examples, see Figure 10.3.
Given how useful data modification commands areÂ€– especially but not
limited to AND, OR, and WINDOW with timed-event data, and LUMP
and CHAIN with single-code event dataÂ€– and how they let you address
research questions with great fidelity, it is somewhat surprising that rela-
tively few packaged programs provide data modification of sequential data.
True, statistical packages often include extensive data recoding and trans-
formation capabilitiesÂ€– but directed to summary and not sequential data.
And some computer-assisted coding systems include some data modifica-
tion capabilitiesÂ€– but more as a peripheral than a central feature and none as
extensively as GSEQ. This may reflect their initial purpose and subsequent
development. Most computer-assisted coding systems have been designed
primarily to collect data, and only later were data-analytic capabilities
added. In contrast, GSEQ was designed primarily as a data-Â�analytic tool in
the first place with the assumption that other means and other Â�programs
would be used for collecting data.
Second
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Existing data:
Cry,8-15 √ ++++++
WINDOW specifications:
(Cry √
Cry) √
Cry-3 √ +++++++++
Cry+3 √ +++++++++
(Cry-3,(Cry-1 √++
Cry)+1,Cry)+3 √++
(Cry+3 √+++
Cry)-3 √ +++
(Cry-4 √++++
Cry)+4 √++++
(Cry-2,(Cry+2 √++++
Cry)-2,Cry)+2 √ ++++
(Cry-4,Cry)+4 √++++++++++++++
Figure 10.3.â•‡ Existing data and WINDOW command specifications for new codes
anchored to onsets and offsets of the existing code. The names for the new codes
are not shown. A check mark indicates the onset of an event and a plus sign its
continuation.
pooled versus individual analyses

Behavioral scientists who study human behavior and who are interested
in individual variation almost always expect that data will not be pooled
over the individuals studied (or dyad or families or whatever analytic unit
is used). If no research factors exist, pooling would include all individuals
studied. If individuals are assigned to levels of a single factor, then pooling
could be across all levels of that factor or within its levels (e.g., pooling across
all males and females, or across males and females separately). If more than
one factor is defined, pooling could ignore or respect level boundaries and
be crossed as desired (e.g., pooling all urban males, all urban females, all
rural males, all rural females). It is worth noting that summary statistics
computed from pooled data may differ from the same statistics computed
as the mean of individual cases, simply because individual cases may con-
tribute differentially to the pooled data.
The usual aversion among behavioral scientists to pooling may be
because, as humans, we value our individuality; we might not have the
same objection if data were pooled over grasshoppers. Certainly, individual

variation and the factors that explain it have long been central concerns of
behavioral science; these concerns are reflected in the statistics we use most
and expect our students to learn (e.g., correlation, multiple regression, ana-
lyses of variance). Nonetheless, poolingÂ€ – by which we mean computing
summary statistics or entering tallies into a contingency table across all ses-
sions, or across all sessions for a specified factorÂ€– has it uses (which is why
it is a GSEQ option). We mention two such uses.
First, examining summary statistics or a particular contingency table
with tallies contributed by all individuals (we use individual in a generic
sense, meaning whatever the analytic unit is) can be a useful exploratory
techniqueÂ€ – useful for you the investigator to know even if your pooled
results will never appear in the research reports you prepare for others.
Pooled results let you know, for example, whether events or co-Â�occurrences
you expected theoretically happened in fact. If not, certain analytic options
may not be worth pursuing or some sort of corrective action may be required
(e.g., using data modification commands to combine existing codes into a
new, more frequently occurring one).
SecondÂ€– and this can be more controversialÂ€– pooling may be useful
when the events of interest are sufficiently rare that the accuracy of summary
statistics computed for individual contingency tables becomes questionable
(see “Vulnerability to zero cells” in Chapter 9). In such cases, the percentage
of individuals with sufficient data to compute a particular summary index is
itself an important result. Moreover, pooling data over all such individuals,
or over individuals within a factor level, and then computing a summary
index like Yule’s Q for the pooled data can be informative. True, different
individuals contribute different numbers of tallies to this result and some
contribute not at all. Nonetheless, you can still report that for the group as a
whole the odds of a particular target were twice as great in the presence of a
particular given than in its absence. Because individuals were not sampled,
generalization is limited to similar groups, but not to similar individuals.
Moreover, when presenting pooled results, be prepared to argue your case
with journal editors, reviewers, and other arbiters of our literature. Still,
used judiciously, pooling can be an important data-analytic tool.
preparing export files and using

statistical packages
For most investigators, most of the time, the material presented in this
chapter and the two preceding ones is valuable because it suggests various
summary scoresÂ€– including indices of contingencyÂ€– that can be computed

separately for each session. Following such computations and guided by the
design (i.e., how the sessions are organized by any between- and within-
subjects factors), these summary scores can next be combined with other
variables (demographic information, scores from questionnaires, etc.).
Then, standard descriptive measures like means, medians, and standard
deviations can be computed and whatever analytic approaches you find
appropriate employedÂ€– including, but certainly not limited to, correlation,
multiple regression, and analysis of variance.
It makes little sense to include standard, widely available statistical ana-
lytic capabilities in computer-assisted coding programs or programs like
GSEQ that extract contingency tables and summary scores from sequen-
tial data. A number of widely used and readily available computer pro-
grams are designed specifically for the standard sorts of statistical analyses,
and there is no reason to reinvent them. On the other hand, it does make
sense for programs that collect and deal with observational data to prepare
export files of summary scoresÂ€– files that can then be read and processed
by other programs. One common format, and the one used by GSEQ, is a
tab-delimited file. In such a file, the first row contains names of variables
separated by tabs and successive rows contain the data for those variables,
likewise separated by tabs. Thus a tab-delimited file contains a conven-
tional Â�cases-by-variables grid. We do not know of a major statistical or
spreadsheet program that cannot read a tab-delimited fileÂ€– which is why
this Â�format is useful when you want to export data from one program and
import it intoÂ€another.
Note that we have said nothing about statistical significance here. The
summary scores exported for subsequent analysis are only scores. And
unlike the matters discussed in the next section, other issues of statistical
significance await future analysis. However, there are two issues of immedi-
ate relevance for export scores. First, if the score is a contingency index, it
may be defined as missing when tallies were insufficient either to compute
its value or to have confidence in its value even when one can be computed
(see “Vulnerability to zero cells” in Chapter 9).
Second, before analyses are performed, the distributional properties of
all scores should be examined. No matter the type of statistical analysis used
subsequentlyÂ€– parametric or notÂ€– an important question is whether one
value (often zero) predominates. If as many as 40 percent of the scoresÂ€ –
and this is only a guidelineÂ€– are a single value, a binary recode may be in
order (i.e., transform scores to 0 or 1). If the score in question is an outcome
variable, this can affect the analysesÂ€– e.g., requiring a shift from multiple
to logistic regression. And for parametric statistics (whose p values become

better approximations as variables are more normal-likeÂ€– i.e., unimodal and
not badly skewed), scores whose standardized skew exceeds some criterion
(often 2.58 absolute) should be recoded or transformed. The weakest trans-
formation that produces a standardized skew less than the criterion should
be used; transformations such as the natural logarithm, the square, or the
reciprocal are common. However, a recode may be more interpretable (e.g.,
0 → 0; 1 → 1; 2–3 → 2; 4–6 → 3; 7–10 → 4; >10 → 5) and thus preferable.
deviant cells, type i error, and winnowing

Statistical significance can be an issue when our interest is focused, not on
scores destined for export and later analysis, but on a single contingency
tableÂ€– perhaps one for a particular session or one including data pooled
over several sessions. The contingency table may result from an analysis of
co-occurrence of events in timed-event data or from an analysis of anteced-
ent-consequent events (Lag 1 transitions) in single-code event data. In these
and other cases, it is important to identify those cells containing deviant
countsÂ€– counts differing from expected sufficiently that we should regard
them with interest and attempt to interpret their associated co-occurrences
or transitions.
Consider the associations between a child’s Calm, Cry, and Fuss to the
adult’s Assure and Explain during a painful medical procedure (Figure 9.2).
Judging by the adjusted residuals, we would say that three of the nine asso-
ciations deviated “significantly” from their chance expected valueÂ€– Fuss-
Assure, Calm-Explain, and Cry-& (neither Assure nor Explain). However,
this “courts type I error,” as statisticians (and journal reviewers and edi-
tors) are fond of pointing out. After all, if there were genuinely no effect “in
the population” and if alpha were set to the conventional .05, then accord-
ing to statistical theory, 5 percent of the time statistically significant effects
would occur by chance. (Considerable ink has been spilled discussing null-
hypothesis significance testing or NHST in the past couple of decades; for
an excellent introduction, see Wilkinson and the Task Force on Statistical
Inference, 1999; also Bakeman, 2006, and Cohen 1990, 1994).
Thus the more tests we make in the course of an analysis, the more likely
we are to identify such significant-by-chance effects. But which ones are
these? Better, how can we keep our study-wise alpha level to .05 in the face
of multiple tests? A common recommendation is to apply the Bonferroni
correction (Miller, 1966), adjusting the alpha level by the inverse of the
number of tests. In this case, instead of asking which adjusted residual
probabilities are less than .05, we would ask instead which are less than
.0056 (.05/9). However, as Cohen (1990) has noted, in practice, the prob-
ability of type I error is almost always zero because effect sizes, even when
small, are almost never exactly zero. He argued that researchers are unlikely
to solve our “multiple tests problem with the Bonferroni maneuver” (p.
1304); for one thing, applied zealously, almost nothing is ever significant
(see also Wilkinson and the Task Force on Statistical Inference, 1999). Both
Cohen and Wilkerson recommend that investigators interpret overall pat-
terns of significant effects, not just individual ones; that they be guided by
predictions ordered from most to least important; and, above all, that they
focus on effect sizes. This advice has considerable merit.
Moreover, probabilities for individual tests are always approxima-
tions to some degree. This is certainly true for the present example of
nine adjusted residuals, none of which meet Haberman’s criteria for a
good approximation (no row sums are 30 or greater; see “Expected fre-
quencies and adjusted residuals” in Chapter 9). It is also true that none of
the approximate probabilities are less than the Bonferroni value of .0056
eitherÂ€ – illustrating Cohen’s point. Nonetheless, something appears pat-
terned about the joint frequencies shown in Figure 9.2; but still noting
only which adjusted residuals are large is too piecemeal an approach. A
principled way of examining table cells, not individually piece by piece but
as a whole, would be welcome.
The counts in a contingency table form something of an interconnected
webÂ€ – as values in one cell change, expected frequencies in all cells are
affected. For this reason, a whole-table approach to identifying deviant cells
makes considerable sense, especially as the number of cells becomes large.
We call our whole-table approach winnowing (Bakeman & Gottman, 1997;
Bakeman & Quera, 1995b; Bakeman, Robinson, & Quera, 1996; see also
Fagen & Mankovich, 1980, and Rechten & Fernald, 1978), but to explain
it requires introducing a couple of log-linear analysis concepts that we
develop further in the next chapter.
Winnowing is based on the familiar chi-square test of independence
(see “Indices of association for two-dimensional tables” in Chapter 9). For
this test, expected values are computed based on a model of row and col-
umn independence (as reflected in the usual erc = pc × xr+ formula). The
chi-square (using either χ2 or G2) is a goodness-of-fit test; smaller values,
which indicate less discrepancy between observed and expected frequen-
cies, indicate better fit. Substantively, we usually want to show that the row
and column factors are associated; that is, usually we want fit to fail and
so desire large values of chi-squareÂ€– ones with p values less than .05. In
contrast, to indicate fit, we want small valuesÂ€– ones with p values greater
than .05 (i.e., values of chi-square less than the critical .05 value for the
appropriate degrees of freedom).
When some adjusted residuals are large, the omnibus chi-square for the
table will be large as wellÂ€– fit will fail. Winnowing attempts to determine
which cells are causing table fit to fail and how few cells we can ignore (i.e.,
replace with structural zeros) before achieving a model of independence
that fits (technically, a model of quasi-independence because it includes
structural zeros; see Wickens, 1989, pp. 251–253). Almost always, a fitting
model is achieved before all cells with large adjusted residuals are replaced,
meaning that interpretation can then focus on fewer cellsÂ€– adding any one
of them back would cause fit to fail significantly. If this process of removing
outlandish cells seems counterintuitive, think of it this way: To determine
who is responsible for a too-noisy room, we remove the loudest person first,
the next loudest second, and so forth, until reasonable quiet reigns.
Winnowing is an iterative process. We delete cells (i.e., replace them
with structural zeros) one by one until we find a model of quasi-indepen-
dence that fits. Winnowing can proceed either theoretically (delete cells in
a prespecified order until a fitting model is found) or empirically (at each
step, remove the cell with the largest absolute residual until a fitting model
is found). An alternative empirical method is to order cells based on the
absolute magnitude of the adjusted residuals from the initial model and
then to delete them in that order. Both empirical approaches are illustrated
in Figure 10.4.
The adjusted residuals for the model of independenceÂ€– [R][C] for rows
and columns using conventional log-linear notationÂ€– are shown as Model
#1 at the top of Figure 10.4. With G2(4) = 13.4 and p = .01, this model fails to
fit. If we took an empirical approach, we would first delete the Fuss-Assure
cell. The resulting Model #2, with G2(3) = 6.7 and p = .09, fits. The Fuss-
Assure cell was responsible for failure to fit. Models #1 and #2 are related
hierarchically; the difference between their chi-squares is 6.7 (13.4Â€– 6.7)
and is distributed as chi-square with 1 degree of freedom (4Â€– 3 = 1, the
difference between their respective dfs). The value of 6.7 exceeds 3.84, the
.05 critical value for 1 df, thus adding the Fuss-Assure cell back to the model
causes goodness-of-fit to deteriorate significantly.
If Model #2 had not fit, again proceeding empirically, next we would
have deleted the Calm-Explain cell (the largest absolute adjusted residual
for #1) or the Cry-Explain cell (the largest absolute adjusted residual for #2);
see Models #3 and #4, respectively. Either would have resulted in a model
that fits (although #4 has smaller adjusted residuals overall); and, although
2
Model (df,N) G p Assure Explain &
#1 [R][C] (4,60) 13.4 .01 Calm –1.89 2.54 –0.83

Cry –0.94 –1.71 2.49
Fuss 2.58 –0.63 –1.60
2
#2 [R][C] (3,50) 6.7 .09 Calm –0.77 2.01 –1.48

–FusAs Cry 0.74 –2.42 1.91
Fuss — 0.49 –0.49
2
#3 [R][C] (2,40) 2.7 .26 Calm –.28 — 0.22

–FusAs Cry 0.25 –1.48 1.35
–ClmEx Fuss — –1.41 –1.48
Model (df,N) G2 p Assure Explain &
#4 [R][C] (2,46) 0.67 .72 Calm –0.33 0.58 –0.60

–FusAs Cry –0.22 — –0.14
–CryEx Fuss — –0.56 0.78
Figure 10.4.â•‡ Table-fit statistics and adjusted residuals for four models illustrating
winnowing; dashes indicate structural zeros.
both began with a fitting model (#2), both resulted in a significant increase
in goodness-of-fit (G2[1] for #2Â€– #3 = 6.7Â€– 2.7 = 4.0; and G2[1] for #2Â€– #4
= 6.7Â€– 0.67 = 6.0; both G2s > 3.84).
If we took a conceptual approach, and had theoretical reasons for want-
ing to delete first Calm-Explain and then Fuss-Assure, we would have dis-
covered that deleting the Calm-Explain cell resulted in a fitting model
(G2[3] = 7.0, p = .07), albeit marginally, but that deleting the Fuss-Assure
cell next resulted in a model that fit significantly better (Model #3 in Figure
10.4, hierarchical test G2[1] = 4.3, p < .05). (Winnowing is implemented in
our ILOG 3 program; see our GSEQ web pages for download).
summary
Data modificationÂ€– mainly creating new codes from existing codesÂ€– offers
a flexible and powerful way to create variables that are particularly faithful
to our research questions; thus, it is a bit surprising that data modification
does not receive more attention. Data modifications are of several kinds.
Among the most useful are such logical combinations as AND and OR (for
timed-event, interval, and multicode event data), which let you define a
new code as a stretch of time (or intervals, or multicoded events) whenever

all (AND), or any one or more (OR), of a list of existing codes occur. Other
logical modifications include NOT, NOR, and XOR. RECODE is similar to
OR, but preserves the frequency of existing codes.
Other commands include EVENT and BOUT (for timed-event data),
which allow you to convert a timed-event into an untimed-event data file
and to combine bursts of codes into a single bout; RECODE, LUMP, and
CHAIN (for single-code event data), which allow you to assign a new code
to any of a list of existing events, to merge together as a new code any suc-
cessive events that appear on the list of existing codes, and to replace an
existing chain of events with a new code; and REMOVE and RENAME (for
all data types), which allow you to perform housekeeping.
Especially useful is the WINDOW command (for all data types, but usu-
ally used with timed-event data), which allows you to define a new event as
a stretch of time (or intervals, or even events) anchored to onsets and offsets
of existing codes. For example, you can define a new code as just the onset
unit (e.g., second) of an existing code or as the five units just before its onset
or, for example, even a stretch of time extending from five seconds before an
onset to five seconds after the offset of an existing code.
Pooling data over several sessions, either over all levels of a factor or pre-
serving factor levels (e.g., pooling separately for males and females) can be
useful for two reasons. First, pooling can be a useful exploratory technique,
even if pooled results never enter into subsequent research reports. Second,
pooling may be useful when the events of interest are sufficiently rare that
the accuracy of summary statistics computed for individual contingency
tables becomes questionable.
Often the summary scores described in the preceding two chapters will
be collected into export files and then imported into standard statistical
packages for subsequent analysis; such export files often consist of tab-
Â�delimited files, which is a widely used format for data exchange. Two rele-
vant issues are: When score are contingency indices, should they be defined
as missing due to insufficient tallies? And are distributions for a particular
summary score so skewed that a transformation or a recode is warranted?
A recode (e.g., a binary recode when many scores have a single value) may
be better than a transformation because it is more interpretable.
When our interest is focused on a single contingency tableÂ€– perhaps for
a particular session or perhaps pooled over several sessionsÂ€– it is import-
ant to identify those cells that contain greater than expected (or less than
expected) counts and therefore are worthy candidates for interpretation.
Relying on the statistical significance of a single statistic like the adjusted
residual applied piecemeal to all cells courts type I error, yet a Bonferroni
correction may be too stringent. We recommend a log-linear-based
approach that we call winnowing; cells are replaced with structural zeros,
one by one, until a fitting model (of quasi-independence) is found, thus
identifying those “deviant” cells that caused fit to fail.
11
Time-Window and Log-Linear Sequential Analysis
The phrase, sequential analysisÂ€ – which appears in the title of this book
as well as earlier ones (Bakeman & Gottman, 1997; Bakeman & Quera,
1995a)Â€– admits to more than one meaning. In the context of microbiology,
it can refer to the description of genetic material. In the context of statistics,
it can mean sequential hypothesis testingÂ€– that is, evaluating data as they
are collected and terminating the study in accordance with a predefined
stopping rule once significant results are obtained (Siegmund, 1985).
However, in the context of observational methods generally, and in the
context of this book specifically, sequential analysis refers to attempts to
detect patterns and temporal associations among behaviors within observa-
tional sessions. As such, sequential analysis is more a toolbox of techniques
than one particular technique. It can include any of a variety of techniques
that serve its goals. Some of these techniques have already been discussed
(e.g., “Contingency indices for 2â•›×â•›2 tables” in Chapter 9). The unifying fac-
tor is the data used; by definition, sequential analysis is based on sequen-
tial dataÂ€– data for which some sort of continuity between data points can
be assumed. Indeed, a common thread throughout this book has been the
description and use of the four basic sequential data types we defined in
Chapter 4Â€– single-code event, timed-event, interval, and multicode event
sequential data.
As we have emphasized in earlier chapters, such sequential data result
from coding or ratingÂ€ – that is, from nominal and occasionally ordinal
measurement. There is another kind of sequential data that does not appear
in these pagesÂ€ – not because we think it is unimportant, but because its
collection and analysis requires quite different approaches from those
described here. It is usually called time series data, is common especially in
such fields as economics and astronomy, and is characterized by a lengthy
series of numbers (often in the 100s and 1,000s) usually measured on a ratio
134
Time-Window and Log-Linear Sequential Analysis 135
scale and collected at equal intervals (anywhere from minutes to months

to years). Readers interested in the analysis of time series data in a behav-
ioral context should consult Gottman (1981; also Gottman & Roy, 1990;
and Bakeman & Gottman, 1997, Chapter 10).
Given that sequential analysis is concerned with detecting pattern in
sequential data, it is not surprising that we give particular attention to con-
tingency tables and indices derived from them (see Chapter 9). Time-based
contingency tablesÂ€– which tally time unitsÂ€– allow us to examine patterns
of co-occurrence in timed-event data. Event-based contingency tablesÂ€ –
where columns are lagged relative to rowsÂ€– allow us to examine sequen-
tial patterns in single-code event data. Similar possibilities exist for interval
and multicode event data. Moreover, any of these analyses can be based
on new codes created from existing ones (see “Creating new codes from
existing codes” in Chapter 10). We can then use standard statistical tech-
niques to determine whether indices derived from contingency tables vary
by session or by other research factors of interest. From this point of view,
previous chapters have already described sequential analytic approaches
and techniques. This chapter and the next one describe additional tech-
niquesÂ€ – including both statistical and graphical approachesÂ€ – useful for
understanding pattern in sequential data.
time-window sequential analysis of

timed-event data
Some early approaches to detecting sequential patternsÂ€ – of which
Sackett’s lag-sequential analysis (e.g., Bakeman, 1978; Sackett, 1979; see
also Bakeman & Gottman, 1986) may be the best knownÂ€– were designed
primarily for single-code event data. When applied to timed-event data
(with time units, and not events, defining lags) lag-sequential analysis
did not work very well, perhaps because differing durations of some key
events pushed onsets of other key events into various lag positions, thereby
obscuring patterns.
A more useful and flexible approach to detecting contingency with
timed-event data has come to be called time-window sequential analysis
(Bakeman, 2004; Bakeman et al., 2005; Yoder & Symons, 2010; Yoder &
Tapp, 2004; see also Chorney et al., 2010; Hall & Oliver, 1997). It uses con-
tingency indices (see “Contingency indices for 2â•›×â•›2 tables” in Chapter 9)
and relies on data modificationÂ€– specifically the WINDOW command in
GSEQ (see “Creating new codes as ‘windows’ anchored to existing codes” in
Chapter 10)Â€– and so builds on material already presented.
The generic question we wish to address is whether the target behavior

is contingent on the given (or criterion) behavior. Here are the three steps.
First, we define a window of opportunity or time window for the given
behavior using the WINDOW command. For example, we might say that
for a behavior to be contingent we need to see a response within 5 seconds,
so we would code the onset second of the given behavior and the following
4 seconds as a given window (assuming 1-second precision). Second, we
code any second in which the target behavior starts as a target onset, again
using the WINDOW command. Third, we tally time units for the session
into a 2â•›×â•›2 table and compute a contingency index for the table (either a
Yule’s Q or an odds ratio, as discussed in Chapter 9).
Typically, there is a fourth step, which involves analyzing the Yule’s Qs
or log odds computed for different sessions as appropriate for our research
design. A study by Deborah Deckner illustrates this approach. She wanted
to know whether mothers and infants matched each other’s rhythmic vocal-
izations (Deckner, Adamson, & Bakeman, 2003). Working with video-
Â�recorded material, she coded onset and offset times for such vocalizations
for 30 mother-infant pairs who were observed for 20-minute sessions when
infants were 18- and 24-months of age. To determine whether mothers
responded to infants, she used the WINDOW command to define two new
codes: one coded the five seconds at the start of an infant vocalization (i.e.,
the given window of opportunity), while the other coded just the onset
second of a mother vocalization (the target). Seconds were tallied in a 2â•›×â•›2
table for each session, and odds ratios were computed for each dyad at each
age (log odds were used for analysis).
For this example, an odds ratio greater than 1 indicated that a mother was
more likely to begin a rhythmic vocalization during the first five seconds of
an infant rhythmic vocalizationÂ€– that is, to match her infantÂ€– than at other
times. A similar strategy was used to determine whether infants responded
to mothers. Session scores (log odds) were analyzed with a standard one-
between (male or female), one-within (18- or 24-months) mixed-design
analysis of variance. As shown in Figure 11.1, Deckner et al. (2003) found
that mothers matched their infants (mean odds ratios were all greater than
1), but more so for female than male infants regardless of their age. For their
part, only 24-month-old females matched their mothers; mean odds ratios
for males at both ages and females at 18 months were all near 1 and signifi-
cantly less than the mean for females at 24 months.
The Deckner et al. (2003) study illustrates nicely how a time-window
sequential analysis can address substantive questions of interestÂ€ – in this
case gender and age effects on whether mothers, infants, or both would
Males Females
Contingency index 18-mo 24-mo 18-mo 24-mo
Mothers matching their infants 1.53a 1.40a 1.83b 2.50b
Infants matching their mothers 0.94ab 1.10b 0.71a 1.99c
Figure 11.1.â•‡ Scores are mean odds ratios, n = 16 for males and 14 for females;
Â�analyses were conducted with log odds. For each row, means sharing a common
subscript do not differ significantly, p < .05, per Tukey post-hoc test.
match their partners’ rhythmic vocalizations. More generally, given timed-

event sequential data, time-window sequential analysis allows you to
address targeted and conceptually based questions of sequence and contin-
gency in ways that preserve the importance of timing. True, the width (i.e.,
duration) of the window is an arbitrary judgmentÂ€ – and no doubt many
investigators will explore how results are affected when different window
widths are triedÂ€– but there is almost always some empirical or conceptual
basis for the choice, if only that many studies of humans have found sens-
ible and interpretable results with 3- or 5-second windows.
the sign test: a nonparametric alternative

The sign test (also called the binomial test) is so simple and straightforward
that it is often overlooked. This is unfortunate. It can be used whenever
the outcome of interest is binary (e.g., yes or no, true or false, observed
frequency greater than zero or not, Yule’s Q above zero or not, odds ratio
above 1 or not) and we want to know whether the proportion of cases with
one of the two outcomes deviates from a proportion we predict. We spe-
cify the predicted probability (or percentage) beforehand; and usually we
hope to show that the proportion observed is unlikely, given the proportion
predicted. For example, we could test whether the percentage coded true
deviated significantly from 50 percent.
Assume P is the predicted probability for one outcome and Q for the
other, thus Q = 1Â€– P. Often we specify that P = Q = .5Â€– that is, that one
outcome is as likely as the other when, in fact, we want to disprove this “null
hypothesis” value. Substantively, we want to show that one outcome was
significantly more likely than the other, that the observed outcome would
happen 5 percent of the time or less if P really were .5. As a nonparametric
testÂ€– specifically, an exact permutation test (see “Permutation tests for short
event sequences” in Chapter 12)Â€– the sign test makes few assumptions. It
generates probabilities for all possible outcomes using the binomial distri-
bution and so can determine exactly where in this distribution the observed
value lies (for details, see Bakeman & Robinson, 2005). Although the sign
test can analyze binary outcomes generally, in the following two paragraphs
we present an example showing how the sign test is useful specifically in the
context of sequential analysis.
Consider the Deckner et al. (2003) study described a few paragraphs
earlier. The group means and the results of the parametric analysis pre-
sented in Figure 11.1 answer questions about mean differences, but leave
other questions unanswered. For example, when the infants were 18 months
of age, how many tended to match their mothers? In other words, for how
many were the odds ratio over 1? We know that the means for both males
and females were less than 1, but the mean both summarizes and obscures
how individuals performed. In contrast, the sign testÂ€– which requires that
we count individual casesÂ€– makes it easy to highlight how individuals per-
formed, allowing us to report, for example, the percentage of individuals
with a particular outcome.
For the current example, odds ratios exceeded 1 for only 6 of the
30 18-month-olds (4 of 16 males and 2 of 14 females), which is signifi-
cant, p < .01 with either a one- or two-tailed sign test (separately by sex,
pÂ€< .05 one-tailed for males, p < .05 two-tailed or p < .01 one-tailed for
females). A similar analysis indicates that odds ratios exceeded 1 for 7
of 16 30-month-old males and for 12 of 14 30-month-old females. The
effect for 30-month-olds overall and for 30-month-old males was not
significant, but the effect for 30-month-old females was (p < .05 two-
tailed or p < .01 one-tailed). As this example shows, the sign test not only
is useful analytically when evaluating contingency indices, but also pro-
vides a level of individually based descriptive detail that gets lost when
only group means are presented.
lag-sequential and log-linear analysis

of single-code event data
As noted when discussing time-window sequential analysis, an early
approach to detecting sequential patternsÂ€ – applied primarily to single-
code event dataÂ€ – was Sackett’s lag-sequential analysis (for details see
Bakeman & Gottman, 1997). The intent was to identify sequences whose
occurrence was more likely than their simple frequencies would suggest
and which were longer than two events. However, this approach with its
multiple tables of adjusted residuals at various lags and multiple tests of
significance can seem a bit piecemeal. Log-linear analysis offers a more stat-
istically grounded, whole-table approach. Among the standard references
are Bishop, Fienberg, and Holland (1975), Fienberg (1980), and Haberman
(1978, 1979), although more accessible alternatives are Bakeman and
Robinson (1994), Kennedy (1992), and Wickens (1989).
Log-linear analysis can be regarded as an extension of the traditional
2-dimensional chi-square test of independence or association. And while
traditional chi-square analysis is limited to contingency tables of just two
dimensions, log-linear analysis can handle tables of more dimensionsÂ€– and
so can handle chains longer than just two events.
Overlapped and Nonoverlapped Tallying of m-Event Chains

As you might expect of a contingency-table approach, the first step is to
tally chains into m-dimensional tables, where m is the length of the chain
we wish to investigate. For example, if we are interested in 3-event chains
and the number of codes is 5, we would define a three-dimensional 5â•›×â•›5â•›×â•›5
table. We will label the first dimension of such tables as Lag 0, the second
Lag 1, the third Lag 2, and so forth, because these numbers identify the lag
positions in the chain.
Chains can be sampled in two waysÂ€– overlapped and nonoverlapped.
If the population of interest consisted only of m-event chains, each chain
would add one tally to the tableÂ€– and the issue of overlapped versus non-
overlapped sampling would not arise. But if the sequence is longer than
the chain length of interest, a choice must be made; and almost always
the choice is overlapped sampling. For example, if ei is the i-th event in
a sequence of N events and if our interest is in 3-event chains (m = 3),
then that sequence contains NÂ€– m + 1 overlapped chainsÂ€– e1e2e3, e2e3e4,
e3e4e5, and so forth. (More generally, the number of m-event overlapped
chains in a sequence of N events, divided into S segments, is NÂ€ – mS +
S.) Alternatively, nonoverlapped sampling could be usedÂ€ – selecting
e1e2e3, e4e5e6, e7e8e9, and so forthÂ€– but this reduces the number of chains
tallied. A nonoverlapping strategy might be used if you were concerned
about the sampling independence of overlapped chains, but our studies
have shown that this concern is not consequential (Bakeman & Dorval,
1989). As noted, overlapped sampling is the usual choice and is usually
assumedÂ€– for example, Lag 1 tables in GSEQ assume overlapped sampling
(see “Lagged tallies for single-coded events when codes can and cannot
repeat” in Chapter 9), as do the analyses in our Psychological Bulletin art-
icle (Bakeman & Quera, 1995b).
2:Assr Exp Cajole 2:Alert Fuss Cry
0:A 1:A 7 6 8 0:A 1:A — — —

E 10 9 6 F 15 — 6
C 31 8 10 C 10 9 —
95 40
0:E 1:A 3 6 14 0:F 1:A — 12 8

E 6 11 9 F — — —
C 12 4 4 C 9 12 —
69 41
0:C 1:A 11 13 26 0:C 1:A — 8 11

E 7 6 6 F 5 — 15
C 7 7 1 C — — —
84 39
Figure 11.2.â•‡ Two three-dimensional, Lag 0â•›×â•›Lag 1â•›×â•›Lag 2 contingency tables

Â�showing, on the left, tallies for 3-event chains using overlapped sampling derived
from a sequence of 250 events when codes can repeat and, on the right, from a
sequence of 122 events when codes cannot repeat (adapted from Bakeman &
Gottman, 1997). For Lag 0 and Lag 1, codes are represented with their first letter (A
for Assr or Alert,Â€etc.).
However sampled, each chain considered adds a tally to the m-way table.
For example, assume an interest in 3-event chains and three codesÂ€– Assr,
Exp, and Cajole for assure, explain, and cajoleÂ€– that might be applied to
parents’ or medical professionals’ turns of talk. These codes can repeat, and
thus a 3-event chain could add a tally to any of the 9 cells of the 3â•›×â•›3â•›×â•›3,
Lag 0â•›×â•›Lag 1â•›×â•›Lag 2 contingency table. Specifically, an Assr-Exp-Exp chain
would add one tally to the middle cell in the table at the top left in Figure
11.2. However, when codes cannot repeat, some of the cells will be struc-
turally zero. For example, again assume an interest in 3-event chains, but a
different set of three codesÂ€– Alert, Fuss, and CryÂ€– that might be applied
to a child’s state and that cannot repeat. Instead of 27 possible 3-event
sequences, now there are only 12Â€ – with the pattern of structural zeros
shown in FigureÂ€11.2 (right). At first glance, the large number of structural
zeros might seem problematic, but in fact they are handled routinely by log-
linear analysisÂ€– which is one strength of this approach when attempting to
detect chains in single-code event data.
An Illustration of Log-Linear Basics

A common way to approach log-linear analysis is hierarchically. We begin with
a saturated modelÂ€– one that includes all possible effects and thus fits the data
perfectly. We then delete effects, one by one, searching for a model whose gen-
erated data still fit the observed countsÂ€– if not perfectly, at least tolerably well.
Typically, the most parsimonious model that still fits tolerably is interpreted.
Searching for a model to interpret is an iterative process, similar to winnowing
(see “Deviant cells, type I error, and winnowing” in Chapter 10).
We present two examples to illustrate hierarchical log-linear proce-
dures. These examples are based on the three-dimensional tables shown
in Figure 11.2. The 3â•›×â•›3 tables at the top, middle, and bottom show how
often specific Lag 2 codes (Assr, Exp, Cajole, or Alert, Fuss Cry) followed
the same Lag 1 codes when Lag 0 was coded Assr, Exp, Cajole, or Alert,
Fuss Cry, respectively.
For our first example, consider the three-dimensional table on the left
in Figure 11.2, for which codes can repeat. Using conventional log-linear
bracket notation, a saturated Lag 0â•›×â•›Lag 1â•›×â•›Lag 2 model is represented as
[012]. Expected frequencies for this and other models can be computed
with iterative proportional fitting (the same method as used in Chapter 9
for two-dimensional tables with structural zeros). For a three-dimensional
table, the [012] model constrains expected frequencies to be exactly as
observed; hence the G2 for the saturated model is 0 and has zero degrees
of freedom. (G2 is used instead of χ2 because of its suitability for hierarch-
ical testsÂ€– i.e., tests of the difference between G2s for hierarchically related
models.) In fact, the saturated [012] model contains 3 two-way terms and
3 one-way terms in addition to the three-way termÂ€– thus the full model is
[012] [01] [02] [12] [0] [1] [2]. However, expected frequencies that fit the
[012] constraint will necessarily fit constraints imposed by any lower-order
constituent terms, and so the saturated model is identified simply as [012].
If we then delete the [012] term, the resulting modelÂ€– [01] [02] [12]Â€–
consists of all two-way terms; again, there is no need to list the implied
lower-order one-way terms. This model generates expected frequencies
that fit the three collapsed two-way tables implied by the three two-way
terms. It specifies that events have Lag 1 associations (indicated by the [01]
and [12] terms) and Lag 2 associations (indicated by the [02] term), but that
the way the events are associated at Lag 1 does not depend on the preced-
ing event (as indicated by the absence of the [012] term). You can think of
the [012] term as indicating moderation: if required for fit, it indicates that
Model Deleted
2 2
Terms G df Term ∆G ∆df
[012] 0 0 — — —
[01][02][12] 7.7 8 [012] 7.7 8
[01][12] 11.5 12 [02] 3.9 4
**
[0][12] 47.0 16 [01] 35.5** 4
**
[0][1][2] 81.3 20 [12] 34.4** 4
Figure 11.3.â•‡ Log-linear analysis of the three-dimensional table shown on the left in
Figure 11.2 (codes can repeat).
** p < .01
the Â�association between Lag 1 and Lag 2 is moderated by (i.e., depends on)
what the Lag 0 term is.
For the counts given in Figure 11.2 (left), the generated counts for the
[01] [02] [12] model fit tolerably; the G2 of 7.7 with 8 df is small and not
statistically significant (see Figure 11.3). In other words, when we deleted
the saturated term, the G2 indicating model fit deteriorated from 0 to 7.7;
that is ΔG2 (the change in G2) was 7.7, which with 8 degrees of freedom is
not a significant change in fit.
Now we have a choice: Which two-way term should be deleted next? As
you would expect, this could be decided statisticallyÂ€– delete first the term
that causes the smallest change in G2Â€– or conceptually. For conceptual rea-
sons, we decided to delete the [02] term. A model that included the [01] but
not the [12] termÂ€– or vice versaÂ€– would make little sense in a sequential
context; if a Lag 1 association exists, both terms would be necessary. The
G2 for this modelÂ€– [01][12]Â€– generated expected frequencies that still fit
reasonably (G2 = 11.5, df = 12)Â€– and ΔG2 = 3.9, which with 4 df was not
a significant change. If we had proceeded next to delete the [01] term, the
resulting modelÂ€– [0][12]Â€– would not fit, the ΔG2 of 35.5 with 4 df would
indicate a significant deterioration in fit, and moreover, as just noted, this
model would make little sense in a sequential context with the [01] but not
the [12] term. Thus we would select the [01][12] model to interpret.
The [01][12] model implies that when events at Lag 1 are taken into
account, events at Lag 0 and Lag 2 are not associated, but are in fact inde-
pendent. The ability to detect such conditional independenceÂ€– the independ-
ence of Lag 0 and Lag 2 conditional on Lag 1Â€– is a strength of the log-linear
approach applied to single-code event sequential data; it is a strength not
shared with more piecemeal approaches. (Conditional independence is
Model Deleted
2 2
[01][02][12] 0 0 — — —
[01][12] – CFC 1.6 2 [02]–CFC 1.6 2
* *
[01][12] 10.8 3 [02] 9.2 1
Figure 11.4.â•‡ Log-linear analysis of the three-dimensional table shown on the right
in Figure 11.2 (codes cannot repeat).
* p < .05
symbolized 0╨2|1 by Wickens, 1989; see also Bakeman & Quera, 1995b.)
In other words, we conclude that two-event chain patterning characterizes
this particular sequence. Knowing that chains are not patterned at Lag 2,
we could then proceed to a Lag 0-Lag 1 follow-up analysis, using winnow-
ing. For these data, we would discover that when the Assure-Cajole and
the Cajole-Assure cells are deleted, the resulting model of quasi-indepen-
dence fits: G2(2, N = 149) = 0.02, p = .98. Apparently for these (generated,
not empirically collected) data, patterning consisted of Assure-Cajole and
Cajole-Assure transitions.
For our second example, consider the three-dimensional table on the
right in Figure 11.2, for which codes cannot repeat. This example illustrates
an additional advantage of the log-linear approachÂ€– its ability to deal with
sequences when codes cannot repeat and their attendant structural zeros.
When the number of codes is three, the [01][02][12] model is completely
determinedÂ€– that is, both G2 and df are 0; in effect, it is the saturated model.
Removing the [02] term causes fit to fail: G2 = 10.8 for the resulting [01]
[12] model, which with 3 df is significant (i.e., significantly bad fit), and the
change in fit (ΔG2 = 10.8, df = 3) was likewise significant (compare first and
last line in Figure 11.4). We tentatively accept the [01][02][12] saturated
model. Unlike the previous analysis, in this case, the model of conditional
independence fails to fit and so we conclude that 3-event chain patterning
characterizes this particular sequence.
However, these data can be subject to winnowing (again, see “Deviant
cells, type I error, and winnowing” in Chapter 10). Imagine, for example,
that we have theoretical reasons to think that the Cry-Fuss-Cry chain is of
particular importance. To test its importance, we replace its count of 15
(see lower-right table in Figure 11.2) with a structural zero. As shown in
Figure 11.4 (middle line), the model of conditional independence with the
Cry-Fuss-Cry cell removedÂ€– [01][12]Â€– CFCÂ€– fits the data (G2 =1.6, df = 2).
But the [01][02] model with the Cry-Fuss-Cry cell replaced (bottom line)
fails to fit. The change in G2 from the [01][12] model with the structural
zero to one without the structural zero is significant. We conclude that the
Â�Cry-Fuss-Cry chain causes the failure of the [01][12] model to fit. Note,
however, that we chose the Cry-Fuss-Cry chain for theoretical reasons;
replacing other chains with structural zeros might also have resulted in a
[01][12] model that fitÂ€– which simply underlines the importance of con-
ceptually guided data analysis.
The illustration of log-linear methods applied to exploring sequencing
in single-code event data presented in the previous several paragraphs
should be regarded as brief and introductory and in no way exhaustive.
If these techniques seem applicable to your work, we encourage you to
read further (e.g., Bakeman & Gottman, 1997; Bakeman & Quera, 1995b;
Wickens,Â€1989).
log-linear analysis of interval and

multicode event data
Log-linear analysis is useful for uncovering pattern in single-code event
sequences, as discussed in the previous section, but was developed primarily
as a method for the analysis of multidimensional contingency tables gener-
ally. As such, it is especially suited for the analysis of interval and multicode
event data. When intervals or events are coded on several dimensionsÂ€– that
is, when intervals or events are cross-classifiedÂ€– multidimensional tables of
the sort suited for log-linear analysis necessarily result.
An example may clarify. Consider the Bakeman and Brownlee (1982)
study cited in Chapter 3. Recall that they asked coders to detect possession
strugglesÂ€– that is, times when one preschool child (the holder) possessed
an object and another (the taker) attempted to take it away. Each posses-
sion struggle was coded on four dimensionsÂ€– Age (whether the children
were toddlers or preschoolers), Dominance (whether the taker had been
judged dominant to the holder), Prior possession (whether the taker had
had prior possession of the contested object within the previous minute),
and Resistance (whether the taker encountered resistance from the holder).
Success (whether the taker gained possession of the contested object) was
also coded, but for the present example we regard resistance as the out-
come of interest, and ignore success. The counts derived from the multi-
code event data for the resulting four-dimensional contingency table are
given in Figure 11.5.
As you can see from Figure 11.5, more possession struggles involved tod-
dlers than preschoolers (71 percent vs. 29 percent); takers were dominant in
Taker Taker not

dominant dominant
Prior Resistance Prior Resistance
possession Yes No possession Yes No
Toddlers Yes 19 7 Yes 16 4
No 42 30 No 61 13
98 94
Prior Resistance Prior Resistance
possession Yes No possession Yes No
Pre- Yes 6 5 Yes 9 6
schoolers
No 18 5 No 27 4
34 46
Figure 11.5.â•‡ Four-dimensional Age × Dominance × Prior possession × Resistance

contingency table (data from Bakeman & Brownlee, 1982).
49 percent of the episodes, had had prior possession in 26 percent of them,

and encountered resistance in 73 percent. These percentages provide sim-
ple descriptive information, but the analytic question of interest is whether
age, dominance, or prior possessionÂ€– singly or in combinationÂ€– is asso-
ciated with, or accounts for, resistance. Following procedures described in
the previous section on log-linear basics, we begin with a saturated modelÂ€–
[ADPR]Â€ – which fits perfectly, by definition. We then proceed to delete
terms, searching for a more parsimonious model that still fits tolerably.
We retain the [ADP] term in all models. In effect, this term reflects our
design; it includes our predictor variables (age, dominance, prior posses-
sion) but excludes the outcome variable of resistance. The [ADP] term
constrains cell frequencies generated by the model to reflect the age by
dominance by prior possession cross-classification frequencies which were
actually observed. When we delete the saturated term, but retain the [ADP]
term, the model that resultsÂ€– [ADP][AR][DR][PR]Â€– contains, in addition
to the design term, three terms that reflect associations of age, dominance,
and prior possession with resistance. If this model fit, but simpler ones did
not, we would conclude that all three predictor variables were associated
with resistance. On the other hand, if the model formed by deleting all three
of these termsÂ€– [ADP][R]Â€– fit, we would conclude that none of the pre-
dictor variables was associated with resistance. In fact, as Figure 11.6 shows,
the [AR] and [PR] terms, but not the [DR] term, can be deleted; the most
parsimonious model that still fits acceptably is [ADP][DR]. We conclude
that resistance is associated with dominanceÂ€– being dominant decreases
the odds of resistance by more than halfÂ€– but that age and prior posses-
sionÂ€are not.
Model Deleted
2 2
[ADPR] 0 0 — — —
[ADP][AR][DR][PR] 8.5 4 [ADPR] 8.5 4
[ADP][DR][PR] 8.6 5 [AR] 0.1 1
[ADP][DR] 9.0 6 [PR] 0.4 1
** **
[ADP][R] 18.2 7 [DR] 9.2 1
Figure 11.6.â•‡ Log-linear analysis of the four-dimensional table for the data given in
Figure 11.5.
** p < .01
The analysis presented here is based on Bakeman and Brownlee’s (1982)

data, but is somewhat different from the analyses they reported. The point
of this example is to demonstrate how log-linear analysis of interval or mul-
ticode event data can yield substantively interesting results. Again, as noted
at the end of the previous section, if log-linear analysis seems of interest to
you, we encourage you to read further.
summary
Sequential analysis refers to attempts to detect patterns and temporal asso-
ciations among behaviors within observational sessions. As such, sequen-
tial analysis is more a toolbox of techniques than one particular technique.
Contingency tables are given particular attention. Time-based contingency
tables allow us to examine patterns of co-occurrence in timed-event data,
while lagged event-based contingency tables allow us to examine sequen-
tial patterns in single-code event data. Some sequential analytic approaches
and techniques have already been described in previous chapters.
Time-window sequential analysis offers a way to examine lagged asso-
ciations in timed-event data. You define a window of opportunity keyed to
a given behavior (e.g., the five seconds after a given behavior begins) and
then tally how often a target begins within such a window. The association
of a given window and a target onset can be summarized with a statistic
such as the odds ratio or Yule’s Q, computed for each session separately and
analyzed as appropriate.
The sign test (or binomial test) is a simple statistic for binary outcomes.
It makes few assumptions, provides useful descriptive detail, and lets
you determine whether, for example, a particular contingency occurred
in more individuals than expected (e.g., in more than 50 percent of the

individuals).
Log-linear analysis offers a way to examine lagged associations in single-
code event data. If you are interested in m-event chains, you would tally
chains in an m-dimensional table; dimensions would be labeled Lag 0 to
Lag m–1. Assuming overlapped sampling, which is the usual choice, each
overlapped chainÂ€– e1e2e3, e2e3e4, etc.Â€– would add a tally to the table, depend-
ing on the code in each lag position. Log-linear techniques would identify
the most parsimonious model whose generated data fit the observed data
acceptably; this model would then be interpreted. A strength of the log-
linear approach is its ability to detect conditional independenceÂ€– that is,
the independence of Lag 0 and Lag 2 conditional on Lag 1. The approach
can also confirm whether influence extends just 1 lag or further. It can also
take into account whether or not codes are allowed to repeat.
In addition to their ability to uncover pattern in single-code event
sequential data, log-linear techniques are equally useful with interval and
multicode event data. Coding intervals or events on several dimensionsÂ€–
that is, cross-classifying intervals or eventsÂ€ – produces multidimensional
contingency tables of the sort suited for log-linear analysis.
12
Recurrence Analysis and Permutation Tests
The analytic techniques presented so far have relied primarily on summary

scores derived from sequential data; depending on the technique, they
could be applied to one or more of the data types that we have described
(i.e., single-code event, timed-event, interval, and multicode event data). In
this final chapter we describe two additional techniques for detecting pat-
tern. One is primarily graphic and can be applied to any kind of sequence;
the other is a statistical approach to detecting pattern in relatively short
single-code event sequences and requires few assumptions.
recurrence analysis
In this section we consider techniques that rely on whole sequences to
display patterns graphically. Exploring a sequence as a whole can pro-
vide new insight into any patterns that may exist, where in the sequence
they occur, and even whether they tend to repeat in different, but com-
parable, sequences. Moreover, such explorations can also be applied to two
sequences to determine whether certain codes tend to repeat in both, thus
revealing a possible synchronicity.
Eckmann, Kamphorst, and Ruelle (1987) proposed using a kind of
similarity mapÂ€– called a recurrence plot or dot plotÂ€– to detect patterns and
structural changes in time series for quantitative variables that describe
the behavior of dynamic systems (e.g., weather, stock market). A recur-
rence plot is an array of dots arranged in a N×N square. Values for both
horizontal and vertical axes are associated with the successive values of a
time series of N elements. The color assigned to a dot (i.e., cell rc of the
N×N matrix, where r = 1.â•›.â•›.N, bottom-to-top, for the Y axis and c = 1.â•›.â•›.N,
left-to-right, for the X axis) depends on the similarity between the r-th
and c-th elements. Either a single color like black or different colors or
148
Recurrence Analysis and Permutation Tests 149
240
0.2
0
–0.2
–0.4
–0.6 210
1.5 1.6 1.7 1.8 1.9 2 2.1 2.2
Time (sec)
180
Infant utterance number

2.2
150
2.1
2
120
Time (sec)
1.9
90
1.8
60
1.7
1.6 30
1.5
1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 30 60 90 120 150 180 210 240
Time (sec)
Adult utterance number
Figure 12.1.â•‡ Examples of recurrence plots. At left, a recurrence plot of the ECG
measurement of a heart beat (Marwan, 2003; retrieved from www.recurrence-plot.
tk). At right, a cross-recurrence plot of mother and infant utterances at 10 months-
of-age (from Buder et al., 2010). See text for details.
different shades of gray could be assigned. An example of a black-and-

white recurrence plot for an ECG measurement of a heart beat is shown in
FigureÂ€12.1Â€(left).
Similarity (or closeness) between xr and xc could be absolute (as in Figure
12.1, left), in which case the recurrence plot would consist of a black diag-
onal line indicating that every element is identical to itselfÂ€– and any black
off-diagonal elements would indicate recurrence later or earlier in time.
Similarity between xr and xc could also be computed as the Euclidean dis-
tance between two vectors defined as time windows, [xr-k, xr-k+1, .â•›.â•›.â•›, xr] and
[xc-k, xc-k+1, .â•›.â•›.â•›, xc], where k is the embedding dimension. The Euclidean dis-
tance could be represented graphically as a black dot (for sufficiently close)
or, more accurately, with different colors or levels of gray.
Similar methods have been proposed for representing similarity patterns
in code sequences (i.e., single-code event sequences), particularly gen-
omic sequences (Maizel & Lenk, 1981; Pearson & Lipman, 1988) and text
sequences (Helfman, 1996; Marwan, Moreno, Thiel, & Kurths, 2007)Â€– and
likewise in timed-event sequences such as sequences of telecommunica-
tion alarms (Mannila & Seppänen, 2001). Recurrence plots have also been
used for the analysis of pattern and rhythm in sound and music (Cooper
& Foote, 2002), coupling between speakers’ and listeners’ eye movements
(Richardson & Dale, 2005), linguistic pattern matching between caregiv-
ers and children in conversation (Dale & Spivey, 2006), and vocal feature
matching in mother-infant face-to-face interaction (Buder, Warlaumont,

Oller, & Chorna, 2010; Warlaumont, Oller, Dale, et al., 2010).
A second example of a recurrence plot, this one with shades of grayÂ€–
Buder et al.’s (2010) cross-recurrence plot for a sequence of mother and
infant utterancesÂ€– is shown in Figure 12.1 (right). Adult utterances are on
the X-axis and infant utterances on the Y-axis. This figure compares two
different, but simultaneously occurring, sequences and displays the simi-
larity between every pair of mother and infant utterances as a function
of the Euclidean distance between their pitch frequencies with shades of
gray: the higher the similarity, the darker the gray. Vertical dark bands in
the plot indicate that a certain pitch in a mother’s utterance tended to be
matched by her infant’s utterances at different moments during their inter-
action, whereas horizontal dark bands indicate when the mother’s pitch
tended to match the infant’s; light bands indicate utterances whose pitch
was largely unmatched by the partner’s. The bands in the plot suggest that
the relationship between adult and infant sequences was patterned; if the
sequences were unrelated, only scattered dots with different levels of gray
would occur.
How can recurrence analysis be applied to SDIS sequences? As in
genomic sequences or text sequences, the events of a single-code event
sequence define the rows and columns of the recurrence plot grid. Rows
represent the events vertically (either bottom to top as in Figure 12.1 or
top to bottom as in Figure 12.2) and the columns represent events hori-
zontally (left-to-right). Each row and each column represents one code; a
grid cell is black if the row and the column codes are identical and white
otherwise. A recurrence plot for a single-code event sequence of a couple’s
verbal interaction is illustrated in Figure 12.2 (left). For this example, ver-
bal utterances are coded as wa (wife approves), wc (wife complains), we
(wife emotes), wp (wife empathizes), wn (wife negates), wo (other wife
utterances), ha (husband approves), hc (husband complains), he (husband
emotes), hp (husband empathizes), hn (husband negates), and ho (other
husband utterances); every wife utterance is followed by one husband utter-
ance and vice versa. Utterances run top to bottom on the Y-axis, so the plot
is symmetrical around the upper-left to lower-right diagonal. Checkered
regions indicate alternations of identical data, as in the 6×6 region in the
upper left corner, which corresponds to the first six codes in the sequence,
we he we he we he.
Alternative plots can be generated by defining time windows (whose
width is analogous to an embedding dimension) along the sequence and
computing a similarity measure among them. Successive time windows
we he we he we he we hc wp he we he
we
he
we
he
we
he
we
hc
wp
he
we
he
Event (wa wc we wp wn wo) (ha hc he hp hn ho); we he we he we he we hc

wa hp we hc wc hc wc hc wc he we he we he we he we he we hc wc hc wc hc
wc hc wc hc wc hc wc hc wc hc wc ho wc hc we he wn hn wo ha wo ho wo hn
wo ho we he we he we hp wp ha wa ha wa hp wa ha wa hp wa ha wa ha wa hn
wn hn wn hn wn he we he we he we he we he wp hp wp he we he /
Figure 12.2.â•‡ Two recurrence plots for a single-code event sequence of a couple’s
verbal interaction. For both plots, the sequence is represented top to bottom and
left to right. At left, each row (and each column) of the plot corresponds to one code
and similarities are all-or-nothing. At right, each row (and each column) corres-
ponds to a time window containing a chain of three successive codes; time windows
are shifted one event forward and are overlapped; similarities are quantitative and
represented with levels of grey.
may overlap or not, each providing slightly different plots. When time win-
dows are used, similarity is no longer all-or-nothing because quantitative
measures of similarity can be represented with different levels of gray (as
in Buder et al.’s, 2010, example). To illustrate, the verbal interaction data
shown at the bottom of Figure 12.2 creates a recurrence plot in which time
windows containing three codes were moved along the sequence (Figure
12.2, right). Successive windows were shifted one event forward and thus
overlapped; the first three windows started at the first, second, and third
events in the sequenceÂ€– i.e., [we he we], [he we he], and [we he we]. Gray
dots indicate that certain windows have a nonperfect similarityÂ€– e.g., [ha
wa hp] and [ha wa ha].
In this case, repetitions of similar three-code chains occurred in differ-
ent parts of the sequence as indicated by diagonal segments that are parallel
to the main diagonal; notice the run of five diagonal black dots close to the
main diagonal in the lower right quarter of the plot. They correspond to
five successive windows of the section wa ha wa hp wa ha wa hp wa ha wa
in the last quarter of the sequence, which starts at position 67 and ends at
position 77. Specifically, window [wa ha wa] starting at 67 is repeated at 71

and 75; window [ha wa hp] starting at 68 is repeated at 72 and 74; and so
on. Of course, alternative recurrence plots may be obtained by varying both
the window width and the degree of window overlapping.
Other indications of possible patterns are vertical and horizontal lines,
either continuous or fragmented, which show that a certain code (Figure 12.2,
left) or chain of codes (Figure 12.2, right) repeats in several other positions. In
Figure 12.2 (left) a fragmented horizontal line composed of eight dots in the
upper left quarter of the plot indicates that a particular code (hc) repeats eight
times forward in the sequenceÂ€– specifically in section hc wc hc wc hc wc hc wc
hc wc hc wc hc wc hc wc. On the other hand, horizontal or vertical white bands
would indicate that a certain code or chain of codes does not repeat at those
positions in the sequence; in particular, a horizontal white band as wide as the
plot (except for the black diagonal dot) would show that the code or chain of
codes corresponding to that row is unique and never repeats.
Interpreted correctly, recurrence plots reveal general features of sequences
and are useful for classifying them as patterned or random. Figure 12.3 (top)
shows three plots for a hypothetical sequence of couple interaction in which
husband responds to wife randomly and vice versa, whereas Figure 12.3
(bottom) shows three plots for another hypothetical sequence containing
long runs of reciprocal interactionsÂ€ – specifically cross-approvals (wa ha
wa haÂ€.â•›.â•›.), cross-empathizing (we he we he .â•›.â•›.), and cross-complaining, (wc
hc wc hc .â•›.â•›.)Â€– which correspond to three large checkered squares along the
main diagonal, indicating a development from positive to negative recip-
rocation. The plots from left to right were generated using time windows
containing one, two, and three codes, respectively (successive windows
overlapped in the latter cases). For the random sequence, black dots in the
similarity map tend to vanish rapidly as the window width increases (top),
whereas for the highly patterned sequence, the proportion of black dots
remains quite stable (bottom).
As stated earlier, when time windows instead of single codes are used,
a similarity measure must be computed. Shifting time windows along the
sequence is a method that can be applied to any data type, but is especially
suited for timed-event sequences. For timed-event sequences, it makes
sense that similarity should depend more on the codes’ onset times than on
time units when codes are merely continuing (already on) or no code is on
(e.g., Mannila & Seppänen, 2001). Given two time windows with identical
width in the same sequenceÂ€– or in different, but comparable sequencesÂ€–
their similarity can be computed; the computations are a bit complex and
interested readers should consult Quera (2008).
Figure 12.3.â•‡ Recurrence plots for a random event sequence (top) and a highly pat-
terned event sequence of verbal interactions (bottom). Left to right, plots indicate
time windows 1, 2, and 3 codes wide. See text for details.
An example of a recurrence plot for a timed-event sequence is presented

in Figure 12.4. The plot was created using the RAP program (Quera, 2008);
overlapped time windows 20 time units wide were specified and succes-
sive windows were shifted one unit forward. Window self-similarities are
displayed as a black line along the diagonal. Horizontal gray/black bands
indicate a group of successive windows (in the vertical sequence) contain-
ing the same codes (ditto for vertical bands and the horizontal sequence).
Therefore, a square containing dots with a gradation of grays indicates that
a group of successive windows in the vertical sequence has contents that
are similar to those of a group of successive windows in the horizontal one.
For example, Cry,46- Calm,53- Fuss,63 is contained within a window start-
ing at 45 and ending at 65, and in two contiguous windows as well (44−64
and 46−66); those three windows with the same codes correspond to the
narrow horizontal band pointed to by the upper cursor arrow. Similarly,
Cry,97- Calm,107- is contained within a window starting at 95 and ending
at 115 and in several contiguous windows as well (93−113, 94−114, 96−116,
and so on)Â€– these correspond to the narrow vertical band pointed to by
the same cursor arrow. In fact, the arrow points to the dot for the intersec-
tion of windows 45−65 and 95−115, whose similarity is 0.703. Several other
Timed Calm Cry Fuss ;

Fuss,1- Cry,5- Calm,17- Cry,26- Fuss,31- Calm,39-
Fuss,43- Cry,46- Calm,53- Fuss,63 Fuss+Cry,65-
Calm,77- Fuss,91- Cry,97- Calm,107- Cry,125-
Fuss+Cry,140- Fuss,187- Cry,190- Calm,204- Fuss,223
Cry,265- Calm,290- Cry,314- ,320 /
Figure 12.4.â•‡ At bottom, a timed-event sequence of a child’s crying and fussing epi-
sodes, and at top, its recurrence plot. Contents for two pairs of windows with high
similarities are shown. See text for details.
intersections along those horizontal and vertical bands indicate high simi-
laritiesÂ€– that is, similar repetitions of codes Cry, Calm, and Fuss at different
positions along the sequence.
A recurrence plot can be further processed to reveal the temporal struc-
ture of the sequence; for example, the RAP program can detect segments
in a sequence by filtering similarity values along and close to the diagonal.
Following Foote and Cooper (2003), a square matrix (smaller in size than
the plot itself) whose cells contain values from a two-dimensional normal
distribution (a Gaussian checkerboard filter) is moved along the diagonal,
centered on each of its dots; for every diagonal dot, its surrounding regions
are multiplied cell-wise by the filter and all the products are added, yielding
one single measure called a novelty score. Novelty scores are a time series
indicating where important changes in the sequence occur; by applying the
Figure 12.5.â•‡ A recurrence plot for an interval sequence of mother-infant inter-

action, and above it the novelty score time series indicating sequence segmentation.
See text for details.
filter, abrupt changes in the sequence are highlighted. Figure 12.5 shows
a recurrence plot for a interval-recorded sequence of mother-infant inter-
action 558 intervals long; codes include approach, obey, instruct, complain,
and so on. The plot was generated by applying a moving time window two
intervals wide (successive windows overlapped by one interval); the result-
ing novelty score is shown at the top, its peaks indicating segment bound-
ariesÂ€– that is, temporal points at which significant changes were detected
in the sequence.
The intent of these examples has been to whet your appetite. Recurrence
analysis offers many more possibilities than the few illustrated here,
including meaningful summary measures of entire patterns. Once again,
interested readers are encouraged to read further (e.g., Marwan, Romano,
Thiel, & Kurths, 2007; Riley & Van Ordern, 2005; Zbilut & Webber,
2007).
permutation tests for short event sequences

Interpreting adjusted residuals in a sequential contingency table may be
problematic when the normal approximation is not justified; specifically,
claiming statistical significance when an adjusted residual exceeds a cri-
terion like 1.96 absolute (p < .05) may not be warranted when either the
row sum is small or the expected probability is extreme (see “Expected
frequencies and adjusted residuals” in Chapter 9). Assigning p values
to adjusted residuals assumes that the way the residuals are distributed
(hypergeometric distribution; see Allison & Liker, 1982; Gottman, 1980)
asymptotically approaches the normal distribution when conditions are
met. As assumptions become untenable (e.g., as sequences become short
or marginal distributions become skewed, yielding small row totals and
extreme expected probabilities, respectively), those asymptotic p values
become doubtful. In such casesÂ€– for example, when lag 1 transitions are
tallied for relatively short single-code event sequencesÂ€ – permutation
tests provide a better way to assign significance (Bakeman, Robinson, &
Quera, 1996).
In contrast to standard asymptotic tests, permutation tests yield exact
pÂ€ values and do not require distributional assumptions; however, they
are less powerful. When an asymptotic test is possibleÂ€– that is, when its
assumptions are metÂ€– and a sequential association exists, a permutation
test requires more data than an asymptotic test in order to claim a given level
of statistical significance. Consequently, when single-code event sequences
are short, only relatively strong sequential associations may be found sig-
nificant. But with permutation tests, users need no longer worry whether
data are sufficient for a reasonable approximation to normality because no
such assumption is required.
Permutation tests construct the actual sampling distribution, or reference
set, based on the observed data and assign a p value to a statistic accord-
ing to its position in the distribution (Edgington & Onghena, 2007); well
known examples of permutation tests are the sign, or binomial, test (see
“The sign test: A nonparametric alternative” in this chapter) and Fisher’s
exact test for 2×2 contingency tables (e.g., Hays, 1963). Given a short event
sequence like ACBACBACB (length N = 9), a permutation test for the LagÂ€1
BA transition would proceed in five steps as follows:
1. The observed transition frequencyÂ€– xBAÂ€– is tallied; in this case, its
value is 2.
2. All possible permutations of the sequence are listed; this example
yields N! = 9! = 362,880 permuted sequences. One of them, of course,
is the sequence observed, and the simple code frequencies are the
same for all sequences.
3. For each permuted sequence the frequency of the BA transitionÂ€ –
xBA(s) where s = 1.â•›.â•›.N!Â€– is tallied. Values for xBA(s) can vary between
0 (for those permuted sequences in which A never follows B; e.g.,
ACBCABCAB) and 3 (for those in which A follows B three times,
which is the maximum possible given that B’s simple frequency is
3; e.g., CBACBACBA). The number of sequences that contain 0, 1,
2, and 3 BA transitionsÂ€– in this case, 86,400, 194,400, 77,760, and
4,320 or 23.8 percent, 53.6 percent, 21.4 percent, and 1.19 percent,
respectivelyÂ€– constitute a sampling distribution for the number of
BA transitions expected by chance.
4. The distribution medianÂ€– mBAÂ€– is computed; in this case its value
isÂ€0.99.
5. If xBA > mBA, then the one-tailed exact p value for xBA = 2 is the pro-
portion of permuted sequences in which xBA ≥ 2; if xBA < mBA, then
the p value for xBA = 2 is the proportion of permuted sequences in
which xBA ≤ 2. In this case, the exact p value for the observed value of
2 is .226 (.214 + .012).
The procedure for computing a one-tailed exact p value for every possible
transition among codes A, B, and C in the ACBACBACB sequence is similar:
(1) The observed transition frequencies xrc are tallied (r,c = 1, 2, 3); (2) for
each permuted sequence, the frequency of every transition is tallied, xrc(s);
(3) One sampling distribution of N! values and one median, mrc, is obtained
for every cell (r,c) in the Lag 1 table; (4) The exact one-tailed p value for
cell (r,c) is then the proportion of values in its sampling distribution that
are equal to or greater than xrc (if xrc > mrc) or are equal to or less than xrc
(if xrc < mrc). Note that in the observed sequence, ACBCABCAB, no code
repeats, whereas in many of the N! permuted sequences codes may repeat
(e.g., ACBBCBCAA). When codes may repeat in the observed sequence, the
sampling distributions are constructed using the N! permuted sequences,
even if no code happened to repeat in the observed one. However, when
codes cannot logically repeat, all permuted sequences containing repeated

codes should be discarded when constructing the sampling distribution.
Even for sequences as short as the preceding example, the number of
possible permutations is huge; for anything other than very short sequences,
constructing the sampling distribution can be time-consuming, even for
relatively powerful computers. There is a relatively simple solution: Instead
of actually constructing the complete reference set, a smaller set can be
formed by sampling from the full set using random or Monte Carlo proce-
dures. For example, Mehta and Patel’s (1992) StatXact program (www.cytel.
com) uses Monte Carlo estimates for exact p values for a variety of tests
(e.g., chi-square, Kolmogorov, Wilcoxon) when a data set is too large for
exact algorithms. A random permutation test may be effective with as few
as 1,000 data permutations (Edgington & Onghena, 2007).
A sampled permutation test of sequential association follows the same
steps as described beforeÂ€– except that the sequence is permuted randomly,
or shuffled, many times (e.g., 1,000; for a reliable shuffling algorithm, see
Castellan, 1992). If codes cannot repeat, any permuted sequence in which at
least one code repeats is discarded; alternatively, an algorithm that shuffles
the sequence with the restriction that codes cannot repeat can be used. Note
that different shuffles can yield accidentally identical permuted sequencesÂ€–
which can nonetheless be considered when constructing the sampling dis-
tributions. The one-tailed exact p value for cell (r,c) in the Lag 1 table is
estimated by evaluating the position of the observed frequency xrc in the
sampling distribution for that cell.
Exact permutation results in exact p values; the result is the same each
time the procedure is repeated. Sampled permutation results in estimates
of exact p values. If we repeat the procedure several times, the results will
vary some due to the random shuffling. This is hardly problematic because,
as Mehta and Patel note (1992, p. 4.16–4.17), with enough runs estimates
can be computed to any accuracy desired. For example, we might permute
a sequence 1,000 times before estimating exact probabilities, but then repli-
cate the procedure ten times. Based on the ten replications, we would next
compute a mean for the ten estimates along with its 95 percent confidence
interval. To narrow the confidence interval, and so provide greater accur-
acy, we need only repeat the procedure more times (e.g., 50 or 100 times
instead of 10).
An example of sampled permutation tests is shown in Figure 12.6 (the
PSEQ program was used; see Bakeman, Robinson, & Quera, 1996). We
tallied Lag 1 frequencies for an observed short event sequence (N = 75) of
a student’s school activities; these codes may repeat. The likelihood ratio
2nd code in 2-event chain

1st code in
2-event chain Chat Write Read Ask Attentive TOTAL
1 3 4 6 2 16
Chat
.099– .638+ .574– .055+ .659+
5 2 5 0 2 16
Write
.151+ .550– .266+ .031– .622+
5 0 9 4 1 19
Read
.426+ .011– .022+ .612+ .213–
5 2 2 3 3 15
Ask
.176+ .440– .163– .587+ .310+
0 6 0 2 2 10
Attentive
.069– .002+ .034– .656+ .357+
TOTAL 16 13 20 15 10 74
Event (Chat=1 Write=2 Read=3 Ask=4 Attentive=5);

2143343421445413453221412
5212331441413331521131412
3345523352544231552213333/
Figure 12.6.â•‡ The first number in each cell (top) is the observed count for 2-event
chains (i.e., Lag 1 transitions) computed for the single-code event sequence shown
at the bottom (N = 75). The second number in each cell is the exact p-value for each
2-event chain, estimated using sampled permutations. See text for details.
chi-square for this table is G2(16) = 40.96, with asymptotic p = .0006 (for
comparison, its exact p value as computed by PSEQ was .0221). Because
the chi-square indicates sequential association in the data, we decided
to probe further. Note that the number of possible permutations for this
sequence is N! = 75!, which is approximately 2.48·10109 (i.e., 248 followed
by 107 zeros). The sequence was shuffled 1,000 times, and a sampling dis-
tribution was obtained for each cell in the table. For each Lag 1 transition
(i.e., each cell in the table), Figure 12.6 showsÂ€– in addition to its countÂ€–
its one-tailed p value based on the sampling distribution. For example,
the observed count for the Attentive-Write transition is 6. Only 2 out of
the 1,000 shuffled sequences contained 6 or more Attentive-Write transi-
tions and so its estimated p value is .002 (see Figure 12.7). If the observed
count for this transition had been 5 instead, its estimated p value would
have been .014 (i.e., the probability of obtaining 5 or 6, which equals .012
+ .002). Other transitions with significant results were Write-Ask, Read-
Write, Read-Read, and Attentive-Read.
In Figure 12.6, probabilities above and below their medians are indi-
cated with plus and minus signs, indicating that the transition occurred
more or less often than expected by chance, respectively. In this case,
Frequency (1,000 shuffles)

400
337
288
300
198
200
103
100 60
12 2
0
0 1 2 3 4 5 6
Number of Attentive-Write chains
Figure 12.7.â•‡ The sampling distribution for the Attentive-Write transition, based on
shuffling an event sequence (N = 75) 1,000 times. See text for details.
the Â�Attentive-Write and Read-Read transitions occurred more often than

expected by chance. As a check on the Figure 12.6 p values, we replicated
the process 9 more times; we then computed mean p values and their 95 per-
cent confidence intervals for the 25 Lag 1 transitions based on the 10 sets of
results. Means for the five significant transitions were similar to those shown
in Figure 12.6, and none of their confidence intervals extended to include
.05. In general, CIs were narrow (e.g., for Attentive-Read, M = .036 and
CIÂ€= .032–.039). The signs of replicated p values were either all above, or
all below, the median and reflected those given in Figure 12.6 with three
exceptionsÂ€ – Chat-Attentive, Read-Ask, and Attentive-AskÂ€ – all of whose
p values exceeded .60. In general, the replications confirmed the sampled
permutation results presented in Figure 12.6.
To identify significant Lag 1 transitions, we recommend using sampled
permutation tests when event sequences are short; they engender greater
confidence than asymptotic tests. True, when sequences are short, only
relatively strong effects may be detected as significant (which illustrates
the general rule that power increases with more data). But with permuta-
tion tests, users need not worry whether data are sufficient for a reasonable
approximation to normality because no such assumption is required. As
with asymptotic tests, results from sampled permutation tests can be win-
nowed to discover which cells are responsible for the sequential association
and which depend on them or are side effects of them (see “Deviant cells,
type I error, and winnowing” in Chapter 10).
summary
In addition to approaches discussed in earlier chapters, two additional tech-
niques for detecting pattern are recurrence analysis and permutation tests.
Recurrence analysis is primarily graphic and applies to any of the data types
we have described. An entire sequence defines both the horizontal and the
vertical axes of a recurrence plot; units are events or time units or intervals
(or windows containing them). Cells are coloredÂ€– either black and white,
shades of gray, or different colorsÂ€ – depending on the similarity (various
definitions are possible) of each cell’s respective row and column codes. In
this way, patterns in the sequencesÂ€– for example, repeated runs of the same
short sequencesÂ€ – are revealed by patterns in the plot. Matching may be
revealed when horizontal and vertical axes represent different sequences
(cross-recurrenceÂ€– e.g., mother and infant). Moreover, meaningful sum-
mary measures of entire patterns can be derived from individual plots.
Permutation tests, the second approach, can detect patterns in rela-
tively short single-code event sequences. Such tests generate all possible
permutations of the observed sequence, create a sampling distribution for
the observed test statistic from the permuted sequences, and then deter-
mine the exact probability for the test statistic from this distribution. For
example, given a sequence of nine eventsÂ€– each of which is coded A, B, or
CÂ€– we can determine the exact probability of observing two A-B sequences
in the 9! permutations of the observed sequence of nine events.
The number of permutations can be very largeÂ€– N! where N is the length
of the sequence; as a result, constructing the sampling distribution can be
time consuming, even for relatively powerful computers. A solution is to
sample permutations instead of generating all possible permutationsÂ€ –
which results in an estimated p value instead of an exact one. Nonetheless,
satisfactory results can be obtained with as few as 1,000 samples (i.e., shuf-
fles of the observed sequence). Moreover, the procedure could be replicated
several times, which produces mean p values along with their 95 percent
confidence intervals. We recommend using sampled permutation tests
to identify significant Lag 1 transitions when event sequences are short;
because they require fewer assumptions, they engender greater confidence
than asymptotic tests.
Epilogue
Sometimes a book merely rests unnoticed or waits unused or stands lifeless

on a shelfÂ€– although authors usually hope for more. In the hands of read-
ers, a book becomes many books; and even in the hands of a single reader,
it can be a different book each time it is picked up anewÂ€– like the same
river never stepped in twice. Readers bring to a book a personal history and
understanding of a topic, guided by necessity and motivation. Some, for
their own edification or on a teacher’s instruction, may have read this book
from start to finish. Others may have skipped around a bit, looking for parts
relevant to them. And still others may have used it simply for reference.
Whatever your previous history or current motivationÂ€– whether a nov-
ice seeking an overall understanding, a curious researcher seeking to expand
your repertoire, a relatively experienced investigator wanting answers to
specific questions, or even a fellow author examining how and how well
we have addressed various topicsÂ€– we hope this book has expanded your
knowledge and understanding of sequential analysis and observational
methods. Used appropriately, they can be cost effective and allow you to
address research questions with unique fidelity. If you choose to use them,
we hope this book will speed you in your work.
163
Appendix A
Expected Values for Kappa Comparing Two Observers
Observer # codes Variability

accuracy
Equiprobable Moderately variable Highly variable
80% 2 .36 .30 .20
3 .49 .47 .44
4 .54 .53 .51
5 .56 .55 .54
6 .58 .57 .56
8 .60 .59 .59
10 .60 .60 .60
15 .62 .62 .61
20 .62 .62 .62
85% 2 .49 .42 .30
3 .60 .58 .55
4 .64 .63 .61
5 .66 .65 .64
6 .67 .67 .66
8 .69 .68 .68
10 .69 .69 .69
15 .70 .70 .70
20 .71 .71 .71
90% 2 .64 .57 .44
3 .72 .70 .68
4 .75 .74 .73
5 .77 .76 .75
6 .77 .77 .76
8 .78 .78 .78
10 .79 .79 .79
15 .80 .80 .79
20 .80 .80 .80
(continued)
165
166 Appendix A

accuracy
95% 2 .81 .76 .65
3 .86 .84 .83
4 .87 .87 .86
5 .88 .88 .87
6 .88 .88 .88
8 .89 .89 .89
10 .89 .89 .89
15 .90 .90 .89
20 .90 .90 .90
Note. Table entries indicate the expected value of kappa when comparing two observers, both
accurate at the indicated level, using a scheme with K codes. For example, minimum acceptable
value of kappa is .76 if you want 90% accuracy, K = 5, and code probabilities (prevalence) are
moderately variable. For details, see Bakeman et al. (1997).
Appendix B
Expected Values for Kappa Comparing with

a Gold Standard

accuracy
80% 2 .60 .53 .40
3 .70 .68 .65
4 .73 .72 .71
5 .75 .74 .74
6 .76 .76 .75
8 .77 .77 .77
10 .78 .78 .77
15 .79 .78 .78
20 .79 .79 .79
85% 2 .70 .64 .51
3 .78 .76 .74
4 .80 .79 .78
5 .81 .81 .80
6 .82 .82 .81
8 .83 .83 .82
10 .83 .83 .83
15 .84 .84 .84
20 .84 .84 .84
90% 2 .80 .75 .64
3 .85 .84 .82
4 .87 .86 .85
5 .88 .87 .87
6 .88 .88 .87
8 .89 .88 .88
10 .89 .89 .89
15 .89 .89 .89
20 .89 .89 .89
(continued)
167
168 Appendix B

accuracy
95% 2 .90 .87 .80
3 .93 .92 .91
4 .93 .93 .93
5 .94 .94 .93
6 .94 .94 .94
8 .94 .94 .94
10 .94 .94 .94
15 .95 .95 .95
20 .95 .95 .95
Note. Table entries indicate the expected value of kappa when comparing an observer accurate
at the indicated level with a gold standard, using a scheme with K codes. For example, minimum
acceptable value of kappa is .87 if you want 90% accuracy, K = 5, and code probabilities (prevalence)
are moderately variable. For details, see Bakeman et al. (1997).
References
Adamson, L. B., & Bakeman, R. (1984). Mothers’ communicative actions: Changes

during infancy. Infant Behavior and Development, 7, 467–478.
Adamson, L. B., Bakeman, R., & Deckner, D. F. (2004). The development of symbol-
infused joint engagement. Child Development, 75, 1171–1187.
Allison, P. D., & Liker, J. K. (1982). Analyzing sequential categorical data on dyadic
interaction: A comment on Gottman. Psychological Bulletin, 91, 393–403.
Altmann, J. (1974). Observational study of behaviour: Sampling methods. Behaviour,
49, 227–267.
Altmann, S. A. (1965). Sociobiology of rhesus monkeys. II. Stochastics of social
communication. Journal of Theoretical Biology, 8, 490–522.
Altmann, S. A., & Wagner, S. S. (1970). Estimating rates of behaviour from Hansen
frequencies. Primates, 11, 181–183.
Arrington, R. E. (1943). Time sampling in studies of social behavior: A critical
review of techniques and results with research suggestions. Psychological
Bulletin, 40, 81–124.
Bakeman, R. (1978). Untangling streams of behavior: Sequential analyses of obser-
vation data. In G. P. Sackett (Ed.), Observing behavior (Vol. 2, Data collection
and analysis methods, pp. 63–78). Baltimore: University Park Press.
â•… (2000). Behavioral observations and coding. In H. T. Reis & C. K. Judd (Eds.),
Handbook of research methods in social psychology (pp. 138–159). Cambridge:
Cambridge University Press.
â•… (2004). Sequential analysis. In M. Lewis-Beck, A. E. Bryman, & T. F. Liao (Eds.),
The SAGE encyclopedia of social science research methods (Vol. 3, pp. 1024–
1026). Thousand Oaks, CA: SAGE Publications.
â•… (2006). The practical importance of findings. In K. McCartney, M. R. Burchinal, &
K. L. Bub (Eds.), Best practices in quantitative methods for developmentalists
(pp. 127–145). Monographs of the Society for Research in Child Development,
71(3, Serial No. 285).
â•… (2010). Reflections on measuring behavior: Time and the grid. In G. Walford, E.
Tucker, & M. Viswanathan (Eds.). The SAGE handbook of measurement (pp.
221–237). Thousand Oaks, CA: SAGE Publications.
Bakeman, R., & Adamson, L. B. (1984). Coordinating attention to people and
objects in mother-infant interaction. Child Development, 55, 1278–1289.
169
170 References
Bakeman, R., Adamson, L. B., Konner, M., & Barr, R. (1990). !Kung infancy: The
social context of object exploration. Child Development, 61, 794–809.
Bakeman, R., & Brownlee, J. R. (1980). The strategic use of parallel play: A sequen-
tial analysis. Child Development, 51, 873–878.
â•… (1982). Social rules governing object conflicts in toddlers and preschoolers. In K.
H. Rubin & H. S. Ross (Eds.), Peer relationships and social skills in childhood
(pp. 99–111). New York: Springer-Verlag.
Bakeman, R., Deckner, D. F., & Quera, V. (2005). Analysis of behavioral streams.
In D. M. Teti (Ed.), Handbook of research methods in developmental science
(pp.Â€394–420). Oxford: Blackwell Publishers.
Bakeman, R., & Dorval, B. (1989). The distinction between sampling independence
and empirical independence in sequential analysis. Behavioral Assessment, 11,
31–37.
Bakeman, R., & Gottman, J. M. (1986). Observing interaction: An introduction to
sequential analysis. Cambridge: Cambridge University Press.
â•… (1997). Observing interaction: An introduction to sequential analysis (2nd ed.).
Cambridge: Cambridge University Press.
Bakeman, R., & Helmreich, R. (1975). Cohesiveness and performance: Covariation
and causality in an undersea environment. Journal of Experimental Social
Psychology, 11, 478–489.
Bakeman, R., & Quera, V. (1992). SDIS: A sequential data interchange standard.
Behavior Research Methods, Instruments, and Computers, 24, 554–559.
â•… (1995a). Analyzing interaction: Sequential analysis with SDIS and GSEQ.
â•… (1995b). Log-linear approaches to lag-sequential analysis when consecutive
codes may and cannot repeat. Psychological Bulletin, 118, 272–284.
â•… (2008). ActSds and OdfSds: Programs for converting INTERACT and The
Observer data files into SDIS timed-event sequential data files. Behavior
Research Methods, 40, 869–872.
â•… (2009). GSEQ 5 [Computer software and manual]. Retrieved from www.gsu.
edu/~psyrab/gseq/gseq.html
â•… (2012). Behavioral observation. In H. Cooper (Ed.-in-Chief), P. Camic, D. Long,
A. Panter, D. Rindskopf, & K. J. Sher (Assoc. Eds.), APA handbooks in psych-
ology: Vol. 1. APA handbook of research methods in psychology: Psychological
research: Foundations, planning, methods, and psychometrics. Washington, DC:
American Psychological Association.
Bakeman, R., Quera, V., & Gnisci A. (2009). Observer agreement for timed-event
sequential data: A comparison of time-based and event-based algorithms.
Behavior Research Methods, 41, 137–147.
Bakeman, R., Quera, V., McArthur, D., & Robinson, B. F. (1997). Detecting
sequential patterns and determining their reliability with fallible observers.
Psychological Methods, 2, 357–370.
Bakeman, R., & Robinson, B. F. (1994). Understanding log-linear analysis with ILOG:
An interactive approach. Hillsdale, NJ: Lawrence Erlbaum Associates.
â•… (2005). Understanding statistics in the behavioral sciences. Mahwah, NJ: Lawrence
Erlbaum Associates.
References 171
Bakeman, R., Robinson, B. F., & Quera, V. (1996). Testing sequential association:
Estimating exact p values using sampled permutations. Psychological Methods,
1, 4–15.
Barker, R. G. (1963). The stream of behavior: Explorations of its structure and con-
tent. New York: Appleton-Century-Crofts.
Barker, R. G., & Wright, H. (1951). One boy’s day: A specimen record of behavior.
New York: Harper.
Bass, R. F., & Aserlind, L. (1984). Interval and time-sample data collection proce-
dures: Methodological issues. Advances in Learning and Behavioral Disabilities,
3, 1–9.
Becker, M., Buder, E., Bakeman, R., Price, M., & Ward, J. (2003). Infant response
to mother call patterns in Otolemur garnettii. Folia Primatologica, 74,
301–311.
Bekoff, M. (1979). Behavioral acts: Description, classification, ethogram ana-
lysis, and measurement. In R. B. Cairns (Ed.), The analysis of social interac-
tions: Methods issues, and illustrations (pp. 67–80). Hillsdale, NJ: Lawrence
ErlbaumÂ€Associates.
Belsky, J., & Most, R. K. (1981). From exploration to play: A cross-sectional study of
infant free play behavior. Developmental Psychology, 17, 630–639.
Berk, R. A. (1979). Generalizability of behavioral observations: A clarification of
interobserver agreement and interobserver reliability. American Journal of
Mental Deficiency, 83, 460–472.
Bernard, C. (1927). An introduction to the study of experimental medicine. New
York: Macmillan. [Introduction a l’étude de la médecine experimentale. Paris:
J.-P.Â€Baillière, 1865].
Bishop, Y. M. M., Fienberg, S. R., & Holland, P. W. (1975). Discrete multivariate ana-
lysis: Theory and practice. Cambridge, MA: MIT Press.
Boice, R. (1983). Observational skills. Psychological Bulletin, 93, 3–29.
Brennan, R. L., & Kane, M. T. (1977). An index of dependability for mastery tests.
Journal of Educational Measurement, 14, 277–289.
Buder, E. H., Warlaumont, A. S., Oller, D. K., & Chorna, L. B. (May, 2010). Dynamic
indicators of mother-infant prosodic and illocutionary coordination.
Proceedings of Speech Prosody 2010, Chicago, IL.
Castellan, N. J., Jr. (1992). Shuffling arrays: Appearances may be deceiving. Behavior
Research Methods, Instruments, and Computers, 24, 72–77.
Chorney, J. M., Garcia, A. M., Berlin, K., Bakeman, R., & Kain, Z. N. (2010). Time-
window sequential analysis: An introduction for pediatric psychologists.
Journal of Pediatric Psychology, 35, 1060–1070. doi: 10.1093/jpepsy/jsq022.
Cochran W. G. (1954). Some methods for strengthening the common χ2 tests.
Biometrics, 10, 417–451.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20, 37–46.
â•… (1968). Weighted kappa: Nominal scale agreement with provision for scaled dis-
agreement or partial credit. Psychological Bulletin, 70, 213–220.
â•… (1977). Statistical power analysis for the behavioral sciences (revised edition). New
York: Academic Press.
172 References
â•… (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312.
â•… (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Cohn, J. F. & Kanade, T. (2007). Use of automated facial image analysis for meas-
urement of emotion expression. In J. A. Coan & J. J. B. Allen (Eds.), Handbook
of emotion elicitation and assessment. Oxford University Press Series in Affective
Science (pp. 222–238). New York: Oxford.
Cohn, J. F., & Sayette, M. A. (2010). Spontaneous facial expression in a small group
can be automatically measured: An initial demonstration. Behavioral Research
Methods, 42, 1079–1086.
Cooper, H. (Ed.-in-Chief), Camic, P., Long, D., Panter, A., Rindskopf, D., & Sher,
K. J. (Assoc. Eds.). (2012). APA handbooks in psychology: Vol. 3. APA hand-
book of research methods in psychology: Data analysis and research publication.
Washington, DC: American Psychological Association.
Cooper, M., & Foote, J. (2002). Automatic music summarization via similarity
analysis. Proceedings of the International Symposium on Music Information
Retrieval, 81–85.
Cote, L. R., Bornstein, M. H., Haynes, O. M., & Bakeman, R. (2008). Mother-infant
person- and object-directed interactions in Latino immigrant families: A com-
parative approach. Infancy, 13, 338–365.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependabil-
ity of behavioral measures. New York: Wiley.
Dale, R., & Spivey, M. J. (2006). Unraveling the dyad: Using recurrence analysis to
explore patterns of syntactic coordination between children and caregivers in
conversation. Language Learning, 56, 391–430.
Deckner, D. F., Adamson, L. B., & Bakeman, R. (2003). Rhythm in mother-toddler
interactions. Infancy, 4, 201–217.
Dijkstra, W., & Taris, T. (1995). Measuring the agreement between sequences.
Sociological Methods and Research, 24, 214–231.
Douglass, W. (1760). A summary, historical and political, of the first planting, pro-
gressive improvements, and present state of the British settlements in North-
America (Vol. 1). London: R. and J. Dodsley.
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence ana-
lysis: Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge
University Press.
Eckmann, J.-P., Kamphorst, S. O., & Ruelle, D. (1987). Recurence plots of dynamical
systems. Europhysics Letters, 5, 973–977.
Edgington, E. S., & Onghena, P. (2007). Randomization tests (4th ed.). Boca Raton,
FL: Chapman and Hall/CRC.
Ekman, P. W., & Friesen, W. (1978). Facial Action Coding System: A technique for the
measurement of facial movement. Palo Alto, CA: Consulting Psychologist Press.
Fagen, R. M., & Mankovich, N. J. (1980). Two-act transitions, partitioned con-
tingency tables, and the ‘significant cells’ problem. Animal Behaviour, 28,
1017–1023.
Fienberg, E. S. (1980). The analysis of cross-classified categorical data. Cambridge,
MA: MIT Press.
Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York:
Wiley.
References 173
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass
correlation coefficient as measures of reliability. Educational and Psychological
Measurement, 33, 613–619.
Fleiss, J. L., Cohen, J., & Everitt, B.S. (1969). Large sample standard errors of kappa
and weighted kappa. Psychological Bulletin, 72, 323–327.
Foote, J., & Cooper, M. (2003). Media segmentation using self-similarity decom-
position. Proceedings of the Society of Photo-Optical Instrumentation Engineers
(SPIE), 5021, 167–175.
Fossey, D. (1972). Vocalizations of the mountain gorilla (Gorilla gorilla beringei).
Animal Behaviour, 20, 36–53.
Gardner, W. (1995). On the reliability of sequential data: Measurement, meaning,
and correction. In J. M. Gottman (Ed.), The analysis of change (pp. 339–359).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Galisson, F. (2000). Introduction to computational sequence analysis. Tutorial, ISMB
2000, 8th International Conference on Intelligent Systems for Molecular
Biology, August, San Diego, CA. Available at www.iscb.org/ismb2000/tutor-
ial_pdf/galisson4.pdf
Goodenough, F. (1928). Measuring behavior traits by means of repeated short sam-
ples. Journal of Juvenile Research, 12, 230–235.
Goodman, S. H., Thompson, S. F., Rouse, M. H., & Bakeman, R. (2010). Extending
models of sensitive parenting of infants to women at risk for perinatal depression.
Unpublished manuscript.
Gottman, J. M. (1979). Marital interaction: Experimental investigations. New York:
Academic Press.
â•… (1980). On analyzing for sequential connection and assessing interobserver reli-
ability for the sequential analysis of observational data. Behavioral Assessment,
2, 361–368.
â•… (1981). Time-series analysis: A comprehensive introduction for social scientists.
Gottman, J. M., & Roy, A. K. (1990). Sequential analysis: A guide for behavioral
research. Cambridge: Cambridge University Press.
Gros-Louis, J., West, M. J., Goldstein, M. H., & King, A. P. (2006). Mothers provide
differential feedback to infants’ prelinguistic sounds. International Journal of
Behavioral Development, 30, 509–516.
Haberman, S. J. (1978). Analysis of qualitative data (Vol. 1). New York: Academic
Press.
â•… (1979). Analysis of qualitative data (Vol. 2). New York: Academic Press.
Haccou, P., & Meelis, E. (1992). Statistical analysis of behavioural data: An approach
based on time-structured models. Oxford: Oxford University Press.
Haddock, C., Rindskopf, D., & Shadish, W. (1998). Using odds ratios as effect sizes
for meta-analysis of dichotomous data: A primer on methods and issues.
Psychological Methods, 3, 339–353.
Hall, S., & Oliver, C. (1997). A graphical method to aid the sequential analysis of
observational data. Behavior Research Methods, Instruments, and Computers,
29, 563–573.
Hartmann, D. P. (1982). Assessing the dependability of observational data. In
D. P. Hartmann (Ed.), Using observers to study behavior: New directions for
174 References
methodology of social and behavioral science (No. 14, pp. 51–65). San Francisco:
Jossey-Bass.
Hays, W. L. (1963). Statistics (1st ed.). New York: Holt, Rinehart, & Winston.
Helfman, J. I. (1996). Dotplot patterns: A literal look at pattern languages. Theory
and Practice of Object Systems, 2, 31–41.
Hutt, S. J., & Hutt, C. (1970). Direct observation and measurement of behaviour.
Springfield, IL: Thomas.
Kaye, K. (1980). Estimating false alarms and missed events from interobserver
agreement: A rationale. Psychological Bulletin, 88, 458–468.
Kennedy, J. J. (1992). Analyzing qualitative data: Log-linear analysis for behavioral
research (2nd ed). New York: Praeger.
Konner, M. J. (1976). Maternal care, infant behavior, and development among
the !Kung. In R. B. DeVore (Eds.), Kalahari hunter-gathers (pp. 218–245).
Cambridge, MA: Harvard University Press.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for
categorical data. Biometrics, 33, 159–174.
Maclure, M., & Willett, W. C. (1987). Misinterpretation and misuse of the kappa
statistic. American Journal of Epidemiology, 126, 161–169.
Mann, J., Haave, T. T., Plunkett, J. W., & Meisels, S. J. (1991). Time sampling: A
methodological critique. Child Development, 62, 227–241.
Martin, P., & Bateson, P. (2007). Measuring behaviour: An introductory guide (3rd
ed.). Cambridge: Cambridge University Press.
Maizel, J. V., Jr., & Lenk, R. P. (1981). Enhanced graphic matrix analysis of nucleic
acid and protein sequences. Proceedings of the National Academy of Sciences,
78, 7665–7669.
Mannila, H., & Ronkainen, P. (1997). Similarity of event sequences. In Proceedings of
the Fourth International Workshop on Temporal Representation and Reasoning.
TIME’97 (p. 136–139). Daytona Beach, Florida.
Mannila, H., & Seppänen, J. (2001). Recognizing similar situations from event
sequences. In Proceedings of the First SIAM Conference on Data Mining,
Chicago. Available at www.cs.helsinki.fi/~mannila/postscripts/mannilasep-
panensiam.pdf.
Marwan, N. (2003). Encounters with neighboursÂ€– Current developments of concepts
based on recurrence plots and their applications. Ph.D. Thesis, University of
Potsdam, ISBN 3–00–012347–4.
Marwan, N., Romano, M. C., Thiel, M., & Kurths, J. (2007). Recurrence plots for
the analysis of complex systems. Physics Reports, 438, 237–329. doi:10.1016/j.
physrep.2006.11.001.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass
correlation coefficients. Psychological Methods, 1, 30–46.
Mehta, C., & Patel, N. (1992). StatXact: Statistical software for exact nonparametric
inference. Cambridge, MA: Cytel Software Corporation.
Messinger, D. S., Mahoor, M. H., Chow, S., & Cohn, J. F. (2009). Automated meas-
urement of facial expression in infant–mother interaction: A pilot study.
Infancy, 14, 285–305.
Miller, R. G., Jr. (1966). Simultaneous statistical inference. New York: McGraw-
Hill.
References 175
Mitchell, S. (1979). Interobserver agreement, reliability, and generalizability of data

collected in observational studies. Psychological Bulletin, 86, 376–390.
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the
search for similarities in the amino acid sequence of two proteins. Journal of
Molecular Biology, 48, 443–453.
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.
Oller, D. K. (2000). The emergence of the speech capacity. Mahwah, NJ: Lawrence
Erlbaum Associates.
Olson, W. C. (1929). The measure of nervous habits in normal children. University
of Minnesota Institute of Child Welfare Monograph, No. 3.
Özçalışkan, Ş., & Goldin-Meadow, S. (2009). When gesture-speech combinations
do and do not index linguistic change. Language and Cognitive Processes, 24,
190–217.
Parrott, D. J., Gallagher, K. E., Vincent, W., & Bakeman, R. (2010). The link between
alcohol use and aggression toward sexual minorities: An event-based analysis.
Psychology of Addictive Behaviors, 24, 516–521.
Parten, M. B. (1932). Social participation among preschool children. Journal of
Abnormal and Social Psychology, 27, 243–269 .
Pearson, W. R., & Lipman, D. J. (1988). Improved tools for biological sequence com-
parison. Proceedings of the National Academy of Sciences, 85, 2444–2448.
Pianta, R. C., Belsky, J., Houts, R., Morrison, F., & The National Institute of Child
Health and Human Development (NICHD) Early Child Care Research
Network (2007). Teaching: Opportunities to learn in America’s elementary
classrooms. Science, 315, 1795–1796. doi: 10.1126/science.1139719.
Powell, J., Martindale, A., & Kulp, S. (1975). An evaluation of time-sample mea-
sures of behavior. Journal of Applied Behavior Analysis, 8, 463–469.
Quera, V. (1990). A generalized technique to estimate frequency and duration in
time sampling. Behavioral Assessment, 12, 409–424.
Quera, V. (2008). RAP: A computer program for exploring similarities in behavior
sequences using random projections. Behavior Research Methods, 40, 21–32.
Quera, V., Bakeman, R., & Gnisci, A. (2007). Observer agreement for event
sequences: Methods and software for sequence alignment and reliability esti-
mates. Behavior Research Methods, 39, 39–49.
Rechten C., & Fernald, R. D. (1978). A sampled randomization test for examining
single cells of behavioural transition matrices. Behaviour, 69, 217–227.
Richardson, D. C., & Dale, R. (2005). Looking to understand: The coupling between
speakers’ and listeners’ eye movements and its relationship to discourse com-
prehension. Cognitive Science, 29, 39–54.
Riley, M. A., & Van Orden, G. C. (2005). Tutorials in contemporary nonlinear methods
for the behavioral sciences. Digital publication available through the National
Science Foundation. Available at www.nsf.gov/sbe/bcs/pac/nmbs/nmbs.jsp.
Robinson, B. F., & Bakeman, R. (1998). ComKappa: A Windows 95 program
for Â�calculating kappa and related statistics. Behavior Research Methods,
Instruments, and Computers, 30, 731–732.
Rojahn, J., & Kanoy, R. C. (1985). Toward an empirically based parameter selec-
tion for time-sampling observation systems. Journal of Psychopathology and
Behavioral Assessment, 7, 99–120.
176 References
Sackett, G. P. (1978). Measurement in observational research. In G. P. Sackett (Ed.),

Observing behavior (Vol. 2, pp. 25–43). Baltimore: University Park Press.
â•… (1979). Lag sequential analysis of contingency and cyclicity in behavioral inter-
action research. In J. D. Osofsky (Ed.), Handbook of infant development (pp.
623–649). New York: Wiley.
Sankoff, D., & Kruskal, J. (Eds.). (1999). Time warps, string edits, and macromol-
ecules: The theory and practice of sequence comparison (2nd ed.). Stanford, CA:
CSLI Publications.
Shostak, M. (1981). Nisa: The life and words of a !Kung woman. Cambridge, MA:
Harvard University Press.
Shrout, P. E., & Fleiss, J. L. (1979). Intrasclass correlations: Uses in assessing rater
reliability. Psychological Bulletin, 86, 420–428.
Siegmund, D. (1985). Sequential analysis: Tests and confidence intervals. New York:
Springer-Verlag.
Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability studies: Use, inter-
pretation, and sample size requirements. Physical Therapy, 85, 257–268.
Smith, P. K. (1978). A longitudinal study of social participation in preschool children:
Solitary ad parallel play reexamined. Developmental Psychology, 14, 517–523.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103,
677–680.
Suen, H. K. (1988). Agreement, reliability, accuracy, and validity: Toward a clarifi-
cation. Behavioral Assessment, 10, 343–366.
Suen, H. K., & Ary, D. (1989). Analyzing quantitative behavioral observation data.
Suomi, S. J. (1979). Levels of analysis for interactive data collected on monkeys liv-
ing in complex social groups. In M. E. Lamb, S. J. Suomi, & G. R. Stephenson
(Eds.), Social interaction analysis: Methodological issues (pp. 119–135).
Madison: University of Wisconsin Press.
Tyler, S. (1979). Time sampling: A matter of convention. Animal Behaviour, 27,
801–810.
Ubersax, J. S. (1982). A generalized kappa coefficient. Educational and Psychological
Measurement, 42, 181–183.
Ueno, A., & Matsuzawa, T. (2004). Food transfer between chimpanzee mothers and
their infants. Primates, 45, 231–239.
Umesh, U. N., Peterson, R. A., & Sauber. M. H. (1989). Interjudge agreement and
the maximum value of kappa. Educational and Psychological Measurement, 49,
835–850.
Walford, G., Tucker, E., & Viswanathan, M., (Eds.). (2010). The SAGE handbook of
measurement. Thousand Oaks, CA: SAGE Publications.
Warlaumont A. S., Oller, D. K., Buder, E. H., Dale R., Kozma, R. (2010). Data-driven
automated acoustic analysis of human infant vocalizations using neural net-
work tools. Journal of the Acoustical Society of America, 127, 2563–2577.
Warlaumont, A. S., Oller, D. K., Dale, R., Richards, J. A., Gilkerson, J., & Dongxin,
X. (2010). Vocal interaction dynamics of children with and without autism. In
S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd Annual Meeting
of the Cognitive Science Society (pp. 121–126). Austin, TX: Cognitive Science
Society.
References 177
White, D. P., King, A. P., & Duncan, S. D. (2002). Voice recognition technology
as a tool for behavioral research. Behavior Research Methods, Instruments, &
Computers, 34, 1–5.
Wickens, T. D. (1989). Multiway contingency tables analysis for the social sciences.
Wiggins, J. S. (1973). Personality and prediction. Reading, MA: Addison-Wesley.
Wilkinson, L., & Task Force on Statistical Inference (1999). Statistical methods in
psychology journals: Guidelines and explanations. American Psychologist, 54,
594–604.
Wolff, P. (1966). The causes, controls, and organization of the neonate. Psychological
Issues, 5 (whole No. 17).
Yoder, P., & Symons, F. (2010). Observational measurement of behavior. New York:
Springer.
Yoder, P. J., & Tapp, J. (2004). Empirical guidance for time-window sequential ana-
lysis of single cases. Journal of Behavioral Education, 13, 227–246.
Zbilut, J. P., & Webber, C. L., Jr. (2007). Recurrence quantification analysis:
Introduction and historical context. International Journal of Bifurcation and
Chaos, 17, 3477–3481.
Index
absolute agreement ICC, 87, 89, 90 Bornstein, Marc, people-object codes, 17

Adamson, Lauren B. BOUT command, 121
joint engagement codes, 15 bout duration
joint engagement ratings, 22 mean, 100
adjusted residual, 110, 128 min and max, 100
how distributed, 110
agreement. SeeÂ€event-based, interval-based, categorical-scale measurement. SeeÂ€nominal-
observer, point-by-point, time-based, scale
summary CHAIN command, 123
agreement matrix, 59 chi-square, 129. SeeÂ€likelihood-ratio, Pearson
alignment code names
algorithm for single-code event data, 74 recommendations for, 23
algorithm for timed-event data, 79 rules for, SDIS, 46
example for single-coded events, 75 single-character, SDIS, 47
matrices, 74 code prevalence, 63, 65
transformations, 74 codes
why needed, 73 can and cannot repeat, 108, 140
alignment kappa (for single-coded events), 76 concreteness (physically, socially based), 19
Altmann, Jeanne, 19, 32 granularity (micro to macro), 18
definitions for event, state, 2, 32 mutually exclusive, exhaustive, 21
on sampling, 36 number of, effect on kappa, 66
analytic units, 4 vs. rating scales, 21
AND command, 119 code-unit grid, 53
asymptotic p values, 156 coding manual, 22
automated coding, 20 coding methods. SeeÂ€computer-assisted, lexical
chaining, post hoc
Bakeman & Brownlee coding schemes
object struggle codes, 39, 144 development of, 13
parallel play codes, 10 examples. SeeÂ€Bakeman & Brownlee,
Bakeman & Helmreich, daily activity codes, 14 Bakeman & Helmreich, Belsky & Most,
Barker, Roger, 2, 8 Bornstein & Cote, Ekman & Friesen,
Belsky & Most, infant play codes, 16 Fossey, Konner, Oller, Parten, Pianta,
between-subjects. SeeÂ€factors Smith, Ueno & Matsuzawa, Wolff
binary recode, 127 Cohen’s kappa, 59, 72, 81
binomial test. SeeÂ€sign test combination codes, SDIS, 49
Bonferroni correction, 128 commission errors, 59, 69
179
180 Index
computer-assisted coding systems, 37, 54, 124. event data, 72. SeeÂ€multicode, single-code,
SeeÂ€Mangold INTERACT, Noldus The state, timed-event
Observer event recording, 26
conditional independence in log-linear event-based agreement, 72, 78
analysis, 142 exact p values, 156
conditional probability, 108 exclusive offset times, 46
confidence intervals. SeeÂ€odds ratio exhaustive. SeeÂ€mutually exclusive and
confusion matrix, 60 exhaustive
context codes, SDIS, 49 expected frequency, 109
contingency indices. SeeÂ€log odds, odds ratio, experimental studies, 3
Yule’s Q export files, 127
co-occurrence, 106
correlational studies, 3 factors, 4
criterion-referenced ICC, 87, 90 between-subjects, 4
Cronbach’s internal consistency alpha, 91 in SDIS, 48
cross-recurrence plot, 150 pooling over. SeeÂ€pooling
within-subjects, 5
data management, 54 file formats, 55
data modification commands, 119. SeeÂ€AND, files. SeeÂ€export, MDS, SDS, tab-delimited
BOUT, CHAIN, EVENT, LUMP, Fossey, Dian, gorilla vocalization codes, 23
NOR, NOT, OR, RECODE, REMOVE, frames, number per second, 38, 45
RENAME, WINDOW, XOR frequency, 95
data modification, benefits of, 118, 124 for interval or multicode data, 96
data reduction, 43, 93 for single-code or timed-event data, 96
data transformations and recodes, 128 relative. SeeÂ€relative frequency
data types in SDIS, 44. SeeÂ€interval, multicode,
single-code, state, timed-event G2 difference test, 130, 141
declaration, SDIS, 46 gap
degrees of freedom mean between event onsets, 100
for chi-square, 111 mean between events, 100
in log-linear analysis, 141 min and max, 100
deviant cells, 128 Generalized Sequential Querier. SeeÂ€GSEQ
digital recording, advantages of, 37 gold standard, 57, 68, 75
dot plot. SeeÂ€recurrence plot advantages, 68, 69
duration, 97 disadvantages, 68
for interval and multicode data, 97 GSEQ, 44, 118
for single-code event data, 97
for timed-event data, 97 Haccou & Meelis, alignment algorithm, 80
relative. SeeÂ€relative duration hierarchical rule
dynamic programming, 73 when coding, 16
when tallying, 106
Ekman, Paul, facial action coding system hypergeometric distribution, 156
(FACS), 20
embedding dimension in recurrence plots, 149 ICC, 58, 87
empirical zeros, 76 and weighted kappa, 82
episode for interval and multicode data, 96, formulas for, 91
99, 100, 102 models for, 90
estimated duration for interval data, 97 reliability sample, 88
event standards for, 91
J. Altmann’s defintion. SeeÂ€Altmann vs. kappa, 58
onset and offset times, 48 vs. r, 88
EVENT command, 121 inclusive offset times, 46
Index 181
independence, model of, 129 LUMP command, 122

instantaneous sampling, 32 vs. RECODE, 122
interval data, 50, 77
agreement for, 81 macro. SeeÂ€codes, granularity
interval duration, 33 Mangold INTERACT, 37, 54
interval recording, 26, 30, 31, 81 Martin & Bateson
advantages and disadvantages, 33 definition for duration, 97
interval sequential data. SeeÂ€interval data definition for frequency, 96
interval-based agreement, 81 on recording rules, 27
interval-scale measurement, 3 on sampling, 36
intraclass correlation coefficient. SeeÂ€ICC MDS file (modified SDS), 118
iterative proportional fitting (IPF), 76, 79 mean bout duration. SeeÂ€bout duration
iterative proportional fitting IPF), 116, 141 mean gap. SeeÂ€gap
mean latency. SeeÂ€latency
joint frequency, 105 measurement scale. SeeÂ€nominal, interval,
tallying interval, multicode, and time-event ordinal, ratio
data, 105, 106 micro. SeeÂ€codes, granularity
tallying lagged single-code event data, 107 missing data, 115, 127
molar. SeeÂ€codes, granularity
kappa. SeeÂ€alignment, Cohen’s, timed-event molecular. SeeÂ€codes, granularity
alignment, time unit, weighted momentary events, 50
agreement vs. reliability, 69 momentary sampling, 32, 34
factors affecting magnitude, 63 vs. one-zero, 33
formula for, 60 multicode event data, 50
no single acceptable value, 83 multiple streams, SDIS, 50
number of codes, effect on, 66 mutually exclusive and exhaustive, 59
standard error, 63 defined, 15
standards for, 66, 166, 168 making codes exhaustive, 16
statistical significance, 63 making codes mutually exclusive, 15
unsatisfactory guidelines for, 63 should codes be ME&E, 17
vs. ICC, 58
weighted average of 2×2 tables, 62 narrative reports, 2, 57
kappa maximum, formula for, 64 Needleman-Wunsch algorithm, 73
kappa table, 60, 61, 83 Noldus The Observer, 37, 55
collapsing into 2×2 tables, 61 nominal-scale measurement, 3, 58, 93
Konner, Melvin, !Kung study, 2, 31 NOR command, 120
norm-referenced ICC, 87, 90
lagged and unlagged tallies, 107 NOT command, 119
lag-sequential analysis, 135, 138 novelty scores in recurrence plots, 154
latency, 100
min and max, 100 observational methods, reasons for, 6
lexical chaining, 39 observed joint frequency. SeeÂ€joint
likelihood-ratio chi-square (G2), 116 frequency
live observation observer accuracy, 63, 65
advantages, 36 reasons why important, 57
vs. recorded behavior, 36 observer agreement, 57
log odds, 113 observer bias, 63
logical zeros. SeeÂ€structural zeros and kappa maximum, 64
log-linear analysis, 141, 144 observer drift, 23
example, multicoded events, 144 observer training, 58, 65, 84
of interval and multicode event data, 144 observers
of single-code event data, 139 as cultural informants, detectors, 20
182 Index
odds ratio, 112 vs. frequency, 96

95% confidence intervals, 114 rating scales, 21
guidelines, 114 and weighted kappa, 83
offset times. SeeÂ€inclusive, exclusive ratio-scale measurement, 3, 58, 93
Oller, Kimbrough, infant vocalization raw residual, 109
codes, 20 RECODE command
omission errors, 59, 69 for interval, multicode, and timed-event
one-zero sampling, 32 data, 120
vs. momentary and whole-interval, 33 for single-code data, 122
onset and offset times. SeeÂ€event vs. LUMP, 122
optimal global alignment, 73 vs. OR, 120
OR command, 119 recorded behavior
vs. RECODE, 120 vs. live observation, 36
ordinal-scale measurement, 3, 58 recording strategies, 26. SeeÂ€event (timed event,
outcome variables, 4 untimed-event), interval, selected interval
overlapped sampling of m-event chains, 139 recurrence analysis, 148, 156
vs. nonoverlapped, 139 recurrence plot, 148
reference set in permutation tests, 156
paper and pencil, when best used, 41 relative consistency ICC, 87, 89, 90
Parten, Mildred, social participation codes, 8 relative duration, 98
partial-interval sampling, 32, 62 for single-code event data, 98
Pearson chi-square (χ2), 116 for timed-event, interval, and multicode
Pearson correlation coefficient, 87 data, 98
percentage agreement, 60 relative frequency, 96
permutation tests, 156 REMOVE command, 123
example of, 157 RENAME command, 123
sampled. SeeÂ€sampled permutation tests repeat. See Â€codes, can and cannot
vs. asymptotic tests, 156, 160 research factors. SeeÂ€factors
phi coefficient (ϕ), 114 residual. SeeÂ€adjusted, raw, standardized
physically based codes. SeeÂ€codes residual code, in GSEQ, 106
Pianta, Robert, classroom codes, 35
point sampling, 32 sampled permutation tests, 158
point-by-point agreement, 58 example of, 158
vs. summary, 59 sampling
pooling of intervals. SeeÂ€instantaneous, momentary,
advantages of, 126 one-zero, partial interval, point,
aversion to, 125 predominant activity, whole-interval
over sessions, factor levels, 125 of single-coded events. SeeÂ€overlapped
post-hoc coding, 39 sampling unit, 5
predictor variables, 4 saturated model, 141
predominant activity sampling, 33 SDIS, 44
probability, 98. SeeÂ€conditional, transitional SDS file (SDIS formatted), 118
for interval and multicode data, 98 selected-interval recording, 27, 34
for single-code event data, 98 sensitivity, 69
for timed-event data, 98 sequential analysis, defined, 134
vs. duration, 98 Sequential Data Interchange Standard.
PSEQ computer program, 158 SeeÂ€SDIS
sequential data types. SeeÂ€data types in SDIS
quasi-independence, model of, 130 session
as basic analytic unit, 5
RAP computer program, 153 defined, 5
rate, 96 SDIS notation, 47
Index 183
start and stop times, 47 discrete view of, 53

sessions, pooling over. SeeÂ€pooling formats for, 38, 45
sign test, 137 SDIS formats, 46
example, 138 time sampling, 26, 27, 31. SeeÂ€interval
similarity map. SeeÂ€recurrence plot recording
single-code event data as defined by Arrington, 34
agreement for, 72 time series data, 134
skew. SeeÂ€standardized time windows
Smith, Peter, developmental progression for time-window sequential analysis, 136
codes, 9 in recurrence plots, 149, 150
socially-based codes. SeeÂ€codes time-based agreement, 77
specificity, 69 time-budget information, 14
spreadsheets, 53 timed-event alignment kappa, 79, 84
standardized residual, 110 timed-event data, 77
standardized skew, 128 agreement for, 77, 78
start and stop times. SeeÂ€session timed-event recording, 26, 29
state data, 50 times
state, J. Altmann’s defintion. SeeÂ€Altmann onset and offset. SeeÂ€event
statistical packages, 53, 118, 124 start and stop. SeeÂ€session
statistical significance, 128 time-unit kappa, 77, 84
statistics for 2×2 tables, 104. SeeÂ€log odds, odds time-unit kappa with tolerance, 78
ratio, Yule’s Q time-window sequential analysis, 135
statistics for contingency table cells, 104, example, 136
105. SeeÂ€adjusted residual, conditional transitional probability, 109
probability, expected frequency, joint type I error, 110, 128
frequency, raw residual, transitional
probability Ueno & Matsuzawa, food transfer
statistics for contingency tables, 104. codes, 18
SeeÂ€likelihood, Pearson chi-square untimed-event recording, 26, 28
statistics for individual codes, 93, 99. SeeÂ€bout
duration, duration, frequency, gap, vulnerability to zero cells, 2â•›×â•›2 table
latency, probability, rate, relative duration, statistics, 115
relative frequency
frequency vs. duration, 101 weighted kappa
rate vs. probability, 101 and ICC, 82
recommendations for single-code event and rating scales, 83
data, 101 formula for, 82
recommendations for timed-event whole-interval sampling, 32
data, 101 vs. one-zero, 33
redundancies among, 102 WINDOW command, 124, 136
relative frequency vs. probability, 101 winnowing
Stevens, S. S., measurement scales, 3 for adjusted residuals, 129
streams, multiple. SeeÂ€multiple streams in log-linear analysis, 143
structural zeros, 75, 76, 79, 108, 130 within-subjects. SeeÂ€factors
summary agreement, 58 Wolff, Peter, infant state codes, 14
vs. point-by-point, 59
summary scores, 93, 104 XOR command, 120
systematic observation defined, 3
Yoder & Symons, 23, 135
tab-delimited file, 127 on ICCs, 89, 92
time on sampling, 27, 35, 36
accuracy of, 45 Yule’s Q, 114

Bakeman & Quera, 2011 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bakeman & Quera, 2011 PDF

Uploaded by

Copyright:

Available Formats

sequential analysis and observational methods

for the behavioral sciences

Cambridge University Press

© Roger Bakeman and Vicenç Quera 2011

This publication is in copyright. Subject to statutory exception

First published 2011

Printed in the United States of America

Library of Congress Cataloging in Publication Data

isbn 978-1-107-00124-4 Hardback

List of Figures page ix

1. Introduction to Observational Methods 1

2. Coding Schemes and Observational Measurement 13

3. Recording Observational Data 26

4. Representing Observational Data 43

5. Observer Agreement and Cohen’s Kappa 57

6. Kappas for Point-By-Point Agreement 72

7. The Intraclass Correlation Coefficient (ICC) for Summary

Targets and Sessions 88

8. Summary Statistics for Individual Codes 93

9. Cell and Summary Statistics for Contingency Tables 104

10. Preparing for Sequential and Other Analyses 118

Creating New Codes as “Windows” Anchored to Existing Codes 124

11. Time-Window and Log-Linear Sequential analysis 134

12. Recurrence Analysis and Permutation Tests 148

1.1. Parten’s (1932) coding scheme for social engagement. page 8

4.6. An example of a code-unit grid for which rows represent

8.3. Formulas for six basic simple statistics. 99

12.1. Examples of recurrence plots. 149

Introduction to Observational Methods

Observing behaviorÂ€ – the central concern of this bookÂ€ – is an ancient

systematic quantitative measurement versus

Measurement also implies a measurement scale. The distinctions we

correlational versus experimental designs

predictor versus outcome variables

variables, units, and sessions

female) or within-subjects (e.g., age with repeated observations at 1, 2, and

why use observational methods?

sequential analysis of behavior

Figure 1.1.â•‡ Parten’s (1932) coding scheme for social engagement.

Parten was interested in the development of social behavior in young

The second paradigmatic study is provided by Peter Smith (1978).

sequences of six five-week periods, Smith reported that many children

instrument; and as thermometers are to temperature, so coding schemes

Coding Schemes and Observational Measurement

As telescopes are for astronomy and microscopes for biology, so coding

where do coding schemes come from?

must codes be mutually exclusive and exhaustive?

Daily activity Infant state Engagement state

Exhaustiveness is likewise easy to achieve. Any set of codes can be

Mother encourages Infant

Mother code Infant code

Show empty hand Approach

granularity: micro to macro

concreteness: physically to socially based codes

Infant vocalization Maternal response

responses to infant vocalizations (Gros-Louis, West, Goldstein, & King,

codes versus rating scales

the coding manual

Roar Monosyllabic loud outburst of low-pitched harsh

Figure 2.5.â•‡ Definitions for three types of mountain gorilla vocalizations

Recording Observational Data

In the previous chapter, we discussed coding schemes and gave several

Codes assigned to: Attribute Recording strategy