A Researcher's Handbook
Fourth Edition
Geoffrey Keppel
Thomas D. Wickens
University of California, Berkeley

\/
PEARSON
Prentice
Hall
Upper Saddle River, New Jersey 07458
1
Library of Congress CataloginginPublication Data
Keppel, Geoffrey.
Desigp. and analysis: a researcher's handbook.4th ed. / Geoffrey Keppel, Thomas D. Wickens.
p. cm.
lncludes bibliographical references and index.
ISBN 0135159415 (alk. paper)
l. Social sciencesStatistical methods. 2. Factorial experiment designs. 3. Social
sciencesResearch. l. Wickens, Thomas D., 1942. 11. Title.
HA29.K44 2004
300'.72'4dc22 2003064774
Cr13dits and acknowledgments borrowed from other sources and reproduced, with permission, in this
textbook appear on appropriate page within text.
Copyright © 2004, 1991, 1982, 1973 by Pearson Education, Inc., Upper Saddle River,
N!'lw Jersey, 07458.
Pearson Prentice Hall. All rights reserved. Printed in the United States of America. This publicar
tion is protected by Copyright and permission should be obtained from the publisher prior to any
prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means,
electronic, mechanical, photocopying, recording, or likewise. For information regarding permission(s),
write to: Rights and Permissions Department.
1098765432
ISBN 0135159415
To Sheila and Lucia
Contents
I INTRODUCTION 1
2
1 Experimental Design
2
1.1 Variables in Experimental Research
1.2 Control in Experimentation . . . 7
1.3 Populations and Generalizing . . 9
1.4 The Basic Experimental Designs 10
11 SINGLEFACTOR EXPERIMENTS 13
5 Analysis of Thend 88
5.1 Analysis of Linear Trend . 88
5.2 Analysis of Quadratic Trend . . . 95
5.3 HigherOrder Trend Components 98
5.4 Theoretical Prediction of Trend Components 100
5.5 Planning a Trend Analysis. 103
5.6 Monotonic Trend Analysis . 108
Exercises . . . . . . . . . . . . . 109
204
10.4 Further Examples of lnteraction . . . . .
206
10.5 Measurement of the Dependent Variable
209
Exercises . . . . . . . . . . · · · · · · · ·
211
11 The Overall TwoFactor Analysis
211
11.1 Component Deviations . . . . . . . ·
214
11.2 Computations in the TwoWay Analysis
220
11.3 A Numerical Example . . . . . 224
11.4 The Statistical Model . . . . . 228
11.5 Designs with a Blocking Factor 232
11.6 Measuring Effect Size . 235
11.7 Sample Size and Power . . . .
239
Exercises . . . . . . . . . . · · · · ·
242
12 Main Effects and Simple Effects
242
12.1 lnterpreting a TwoWay Design ..
244
12.2 Comparisons for the Marginal Means .
246
12.3 lnterpreting the lnteraction 249
12.4 Testing the Simple Effects . . . . . . .
256
12.5 Simple Comparisons . . . . . . . . · .
258
12.6 Effect Sizes and Power for Simple Effects
262
12.7 Controlling Familywise Type 1 Error . . .
264
Exercises . . . . . . . . . . · · · · · · · · · · ·
266
13 The Analysis of Interaction Components
266
13.1 Types of lnteraction Components .
271
13.2 Analyzing lnteraction Contrasts . . . .
276
13.3 Orthogonal lnteraction Contrasts . . . .
277
13.4 Testing ContrastbyFactor lnteractions
282
13.5 Contrasts Outside the Factorial Structure
285
13.6 Multiple Tests and Type 1 Error
286
Exercises . . . . . . . . . . . · · · · ·
289
IV THE GENERAL LINEAR MODEL
291
14 The General Linear Model
292
14.1 The General Linear Model . . . . . . .
298
14.2 The TwoFactor Analysis . . . . . . .
304
14.3 Averaging of Groups and lndividuals .
307
14.4 Contrasts and Other Analytical Analyses
309
14.5 Sensitivity to Assumptions
310
Exercises . . . . . . . . . . . · ·
311
15 The Analysis of Covariance
312
15.1 Covariance and Linear Regression
318
15.2 The Analysis of Covariance . . . .
CONTENTS vii
495 ~"
22.3 Effects Based on Marginal Means . . . fo
~~
22.4 ThreeFactor Interaction Components 499 ~¡
";'
j
CONTENTS ix
This fourth edition of Design and Analysis is the result of a collaboration that began
informally about 25 years ago, when TDW was spending a sabbatical year at Berkeley
and attended GK's seminar on experimental design. We have continued our discus
sions of research design and statistical analysis over the years. TDW was intimately
familiar with the earlier editions of this book, using it as a text in his graduate courses
at UCLA and contributing substantially to the second and third editions. Our col
laboration on this new revision is the natural extension of our discussions. With our
equal participation, we have created an essentially new book. We have rewritten it
throughout, reorganized the chapters, and introduced much new material.
Our topic is the analysis of variance, the major statistical method used to analyze
experimental data in psychology and the behavioral sciences. It gives the researcher
a way to compare the means from a series of sets of scores and to determine whether
they differ. More than just identifying overall differences, however, the analysis of
variance provides a very specific set of procedures that allow a researéher to expresa
particular questions about the study and to measure and test them. Much of the
power and the subtlety of the analysis of variance Hes in these analytical techniques,
which are necessary both when planning a study and when interpreting its resulta.
Accordingly, we devote over a third of the book to them.
The early part of the book contains a detailed discussion of the foundation block of
the analysis of variancethe experimental design with independent groups of subjects,
each of whom contributes a single observation to the analysis. We then elaborate this
structure by expanding it in several waysto studies in which the groups are classified
in two or more ways, to studies with muitiple observations, and to studies with several
sources of random variability. By the end of this book, a student or researcher will be
able to put these parts together and apply them to the analysis of any complex design.
We cover the major designa commonly used in psychological research and present,
with numerical examples, the analyses appropriate to them. Where possible, we cover
statistical argumenta in the context of data analysis and interpretation. For example,
when we discuss the statistical assumptions that underlie an analysis, we both give
their mathematical form and describe how they relate to experimental procedures.
To the extent possible while keeping the size of our book within bounds, we describe
and illustrate how the analytical techniques are applied in both simple and complex
designatoo many books, in our opinion, fail the reader at this point.
We wrote the book with a particular reader in mind: a student at the advanced
undergraduate or beginning graduate level who is about to engage in experimental
research, but who has little or no formal mathematical or statistical background. We
th,erefore include a discussion of statistical inference, and we limit our assumption of
mathematical skills to basic algebra. We give enough detail in our numerical examples
PREFACE xi
for a reader at that level to follow them and to adapt them to new data. Our reader
must also be able to use the quantitative methods as part of an overall research plan.
We have, therefore, set our numerical examples in context of plausible, if greatly
simplified, research designs, so that we can describe both the statistical analysis and
how that analysis serves as an inferential tool.
Although we have structured this book as a text, we also envision a second use.
The books from which one learns technical material are those to which one returns to
review and apply them. We know (or at least hope) that many of our readers will use
this book as the handbook of its title and will return to it over the years when using
the analysis of variance. With this application in mind, we have included extensive
coverage of the analytical analysis of the larger designa. Indeed, we expect that sorne
of the later portions'Part VI, in particularwill gain their full usefulness only once
the reader has had sorne practical research experience.
The earlier editions of this book emphasized the computational aspect of the anal
ysis of variance. Since they were written, the accessibility of computer programa has
greatly increased. Quite sophisticated and comprehensive packages of statistical pro
grama are available for use on personal computers without great expense. A researcher
now will use a computer for any major analysis. We have considerably modified this
edition of the book to accommodate this change. Thus, we have added Chapter 9 on
computer applications, and we have completely rewritten our treatments of unequal
sample sizes and the analysis of covariance (Chapters 14 and 15) to emphasize the
conceptual basis of these techniques. Our treatment of the multifactor designa has
likewise been adjusted to be concordant with computer use. The changes here have
let us emphasize the principies of the analysis instead of the arithmetic. On the other
hand, we have kept our discussions of statistical software general and have not focused
on any particular statistical package. Different laboratories are, quite appropriately,
committed to different programa. Moreover, the programa differ too much from one to
another and change t'oo frequently for us to tie our discussion to a single program or
package. A description of a computer package that is nearly current can be extraodi
narily confusing to a learner. We suggest that an instructor supplement our treatment
with examples of whatever computer programa are locally available.
Notwithstanding the importance ofthe computer, we feelstrongly that real facility
with the statistical methods comes only when one understands how they work, an idea
that is gained from both computer applications and handson calculation. Without
experience with the actual computation (at least of the simpler designa), a user's
skills are brittlehe or she has little ability to go outside the standard analyses or
to notice and accommodate unusual characteristics of the data. With the greater
flexibility that comes from understanding the techniques well, a researcher can direct
the analysis performed by a software package, is more able to use them appropriately,
and is not restricted to the particular approach favored by the software developers.
Real data sets require this flexibility.
This book is intended for use in a onesemester course or a twoquarter sequence
in experimental design and statistical analysisfor a shorter course, an instructor
might treat just the singlefactor and twofactor designa (Chapters 113 and 1620).
The material we cover constitutes what we feel graduate students should master as
1
Experimental Design
An experiment consists of a carefully workedout and executed plan for data collec
tion and analysis. Treatment conditions are chosen to focus on particular features
of the testing environment. An appropriate source of subjects is established. These
conditions are administered to subjects in such a way that observed differences in the
average values can be unambiguously attributed to the differences among the various
treatment conditions. In essence, a welldesigned experiment permits the inference of
causation. This chapter describes a number of important features of experimentation.
Many of these points will be amplified throughout the remainder of the book. We will
consider throughout the interdependent problems of experimental design along with
our presentation of the formal statistical procedures.
focuses on the addition of water to the food reward. A comparison between conditions
2 and 3 permits a similar determination of the effects of the addition of food to the
water reward. Most experiments involving a qualitative independent variable can be
analyzed as a set of smaller, more focused experiments, a topic we will discuss in detail
in Chapter 4.
A second type of manipulation, quantitative independent variables or con
tinuous independent variables, consists of variables that represent variation in
amountamount of food deprivation, variations in dosage, loudness of the masking
noise, length of the learning task. Variables of this sort usually include treatment
conditions that define the range of values of interest to the researcher and several
intermediate conditions to provide a picture of the effects of the variable between the
two extremes. The effects of quantitative independent variables are usually analyzed
differently from those produced by qualitative manipulations. Here, researchers con
centrate on the overall relationship between the variation of the independent variable
and changes in behavior. The goal of the analysis is to determine the nature or shape
of this relationship. Suppose that the independent variable is the number of hours of
food deprivation and that we will be measuring the trials required to learn a diffi.cult
maze. How are we to describe the relationship? Presumably we will find an increase in
performance as the animais become more hungry. But will the increase occlir steadily
as the number of hours increases, or will there be no effect at first, then a steady
increase? Will very high levels of deprivation lead to poorer performance? Specialized
procedures known as trend analysis are available that permit us to distinguish among
these (and other) possibilities. We will discuss them in Chapter 5.
A third type of independent variable is encountered when researchers are inter
ested in the systematic variation of characteristics that are intrinsic to the subjects.
Variables of this sort are variously referred to as classification variables, subject
variables, organismic variables, and individualdifference variables. Examples
of this type of study are the effect of accidental damage to different areas of the brain
on learning and the effect of gender on solving arithmetic problems. Classification
varíables are created by selecting or classifying subjects on sorne natural dimension.
A manipulation of this sort does not constitute a true experiment since the admin
istration of the "treatments" is obviously not under the control of the experi.menter.
In an experiment, the independent variable is the only feature of the situation that
is allowed to vary systematically from condition to condition. It is this characteristic
of an experiment that permits us to infer that a particular manipulation caused sys
tematic differences in the behavior observed among the different groups. But when
a classification variable is involved, the subjects may also differ systematically from
group to group in characteristics other than the classification variable. Thus, the
inferences that can be made regarding a classification variable are weaker than those
that can be made from an experimental variable. The statistical procedures in this
book allow us to say (or not to say) that the groups differ, but the nature of the
design does not let us know why. For example, female and male subjects differ in
many waysphysiological, cultural, experiential, psychologicai, etc. A researcher can
find that females and males differ on a particular dependent variable, but cannot easily
say which of the many differences between the sexes causes the effect.
l.l. VARIABLES IN EXPERIMENTAL RESEARCH 5
Studies in sorne fields of psychology examine naturally occurring groups that have
received different treatments. For example, an organizational psychologist may com
pare the incentive systems used in two different factories, or a health researcher may
compare the relative effectiveness of two diet programs administered in different clinics.
Although these studies have much in common with an experimentin particular, two
different treatments are comparedthey lack a critical component of a true experi
ment. The treatments are applied to naturally occurring groups (the factories or the
clinics), and these may differ for various other reasons than the particular treatment.
Differences in productivity or in weight loss may have been influenced by other factors,
such as the kind of person who works at a particular factory or attends a specific clinic.
Studies of this sort are called quasiexperiments or nonrandomized experiments.
As with the classification factors, the diffi.culties are not with the statistical methods
per se, but with the interpretation of their outcome. When using a quasiexperimental
factor, a researcher must try to establish that the groups do not differ on whatever
incidental characteristics there may be that also affect the final scores.
for them to differ appreciably. The researcher should have used a longer list of words
or a more difficult test, so that the scores were not close to the performance limit.
Nuisance Variables
In any research situation, there are many factors that infiuence the value of the depen
dent variable other than the treatment of interest. These characteristics constitute
nuisance variables. They cause the scores to differ, but they are not of interest
in the particular study. For example, in the experiment with the word recall, sorne
word lists are easier to recall than others, and subjects tested in the morning may be
fresher and recall more words than those run late in the day. These variables could
be of interest to sorne other researcher, who could manipulate them separately, but as
they affect the study of incentive, they are simply in the way.
The presence of nuisance variables can have two types of effect. First, if they are
systematically related to the treatment conditions, a situation known as confound
ing, they can alter the apparent effects of the treatments. This is what makes quasi
experiments so hard to interpret. Second, even if they have no systematic relationship
to the independent variable, their accidental, or nonsystematic, effects will increase
the variability of the scores and obscure whatever treatment effects there may be. A
researcher must consider both possibilities.
Consider a study designed to examine the effects of three experimental treat
mentscall them t 1 , t 2 , and t 3 . For convenience in setting up the equipment, the
experimenter decides to run all subjects in one condition before turning to the next.
As a result, condition t 1 is run in the morning, t 2 in the afternoon, and t 3 in the
evening. The nuisance variable time of day is now confounded with the experimental
treatment. The confounding can have serious consequences. If rats are the subjects,
for example, systematic differences in their performance may arise from where in
their diurnal cycles they are run. If the subjects are student volunteers, differences
may arise because different types of subject volunteer at different times of dayfor
example, because of their availability on campus (which in turn depends on their class
schedule). In either case, the infiuences on performance are intertwined with those of
the independent variable and cannot be separated.
There are many ways to deal with the effects of nuisance variables. Four are
cmrimon, and many experiments include combinations of all of them.
l. A nuisance variable can be held to a constant value throughout an experiment,
say by running all subjects at the same time of day.
2. A nuisance variable may be counterbalanced by including all of its values
equally often in each condition, say by running onethird of the subjects from
each treatment level in each time slot.
3. A nuisance variable may be included in the designas an explicit factor, known
as a nuisance factor.
4. The systematic relationship between the nuisance variable and the indepen
dent variable may be destroyed by randomization. Each subject is randomly
assigned to a condition to break any possibility of confounding.
We will discuss all of these approaches, particularly randomization, shortly.
In any study, there is an infinitude of potential nuisance variables, sorne of them
important, most not. In many studies, all four of the strategies we just mentioned are
1.2. CONTROL IN EXPERIMENTATION 7
employed. An index of the skill of a researcher is. the subtlet! with whi~h. t~is is done.
Only after considerable experience in research wlll you ac~mre the sen~It1vlty to treat
nuisance variable well and to identify places where potenttal confoundmg may occur.
Control by Design
One way to address these ends is to design the experiment carefully. For the greater
part, doing so involves applying 1 the first three of the fou~ principl~s we listed at
the end of our discussion of nuisance variables. To start wlth, certam features can
in fact be held constant across the levels of the experiment. All the testing can be
done at the same time, in the same experimental room, by the same experimenter,
and with the same equipment and testing procedures. Control of other features of
the experiment, though not absolute, is sufficiently close to be considered essentially
constant. Consider the mechanical devices that are used to hold various features of
the environment constant. A thermostat does not achieve absolute control of the
temperature in the room, but it reduces its variation. An uncontrolled room would
be subjected to a wider range of temperatures during the course of an experiment
than a controlled room, although variation is still present. However, this variation is
typically small enough to allow us to view the temperature as constant. . .
What about the subjects? Can they be held constant across treatment cond1t10ns?
Sometimes that is possible. Each subject in the study can serve in all the conditions
instead of just one. In this way, any differences among subjects apply to all conditions
equally, greatly increasing their comparability. There are sorne limitations, howev~r.
For one thing, in sorne studies a subject cannot provide more than one score. Certam
independent variables cause changes in the subject that cannot be undone. Most
obviously, a treatment as drastic as removal of brain tissue is irreversible and renders
the subject unusable for another treatment. More subtly, whatever may have been
learned during the course of a treatment cannot be unlearned, making it impossible
to return subjects to their original state. There are a host of things to consider when
designs with repeated observations are used, which we will describe in Chapter 17.
Suppose the same subject can be used only once. It is still possible to ma~e the
groups more comparable by grouping the subjects into sets based on sorne nmsance
variable and then assigning an equal number of members from each set to each of the
experimental groups. This procedure, known as blocking, has two effects. First, it
renders the treatment groups more comparable by assuring that each has the same
distribution of the nuisance variable, and, second, it reduces the variability of the scores
8 CHAPTER l. EXPERIMENTAL DESIGN
within a block. The operation of blocking depends on having a way to classify the
subjects, and create a classification factor. However, do not think that all classification
factors are used for blocking. Classification variables such as gender or age usually
appear in a study because their effect is intrinsically of interest. We will discuss
blocking in Section 11.5 and the related analysis of covariance in Chapter 15.
Properly designing the study can go a long way towards making the conditions
comparable. However, inevitably there are many characteristics that infl.uence the
behavior that we cannot control. Subjects vary one from another in ways that we can
neither explain nor classify. For these we must take another approach.
Control by Randornization
The second way to address the comparability of the groups is to stop trying to achieve
an exact balance and adopt an approach that assures that any differences that remain
are not systematically related to the treatments. To use this procedure, known as
randomization, we simply assign each subject to a treatment condition completely
at random. If we had two conditions, we might fl.ip a coin to determine the group
assignment for each subject. Or, if we were planning to run 75 subjects in a three
group experiment, we could start by writing "Group 1" on 25 index cards, "Group
2" on 25 cards, and "Group 3" on 25 cards. Then we would shuffie the deck to put
the cards in a random order. When each subject carne in, we would pull a card
off the stack and let the number on it determine the treatment that subject would
receive. There are more sophisticated methods of randomization, including the use of
randomnumber tables, but conceptually they amount to the same thing.
Consider again the control of the time of testing. Earlier we suggested that all
treatments could be administered at the same time or that an equal number of them
could be administered at each of the available time periods. Suppose, instead, that
we randomly assigned the subjects to the treatments regardless of what time they
were tested. A subject tested at 9 AM is equally likely to receive treatment tr, t 2 ,
or t3, and the same goes for a subject tested at noon or at 6 p.m. Thus, every
subject, regardless of the time of testing, has an equal chance of being assigned to
any treatment condition. If we follow this procedure with enough subjects, statistical
theory tells us that the average testing times for the three treatment groups will be
almost equal. Under these circumstances, then, the nuisance variable time will have
little or no differential effect.
What about other features of the experimental environment that also change? The
grand thing about randomization is that once we have controlled one environmental
feature by randomization, we have controlled all others as well. Let 's list sorne of
the characteristics of the testing session. The room will be at a certain temperature;
there will be a certain humidity; the room illumination will be at a particular level;
the noise from the outside filtering into the room will be of a certain intensity; the
experiment will be given at a particular time of day, on a particular day, and by a
particular experimenter; and so on. All of these factors are controlled simultaneously
by random assignment. We do not need to consider each one separately, as we would
have to if we tried to control them by design.
Differences among the treatment conditions in the characteristics of the subjects
are also controlled by randomization. The subjects who are chosen to participate in an
1.3. POPULATIONS AND GENERALIZING 9
students at one college will produce effects that are much like those at a different
school or that animals from one breeding stock will perform like those from aíl.other.
To put it anotherway, the statistical methods allow us to generalize from our sample to
an idealized population from which it could have been sampled, and the extrastatistical
generalization lets us conclude that this hypothetical population is similar to the actual
population that we want to study. Of course, there are limits to the extent of such
nonstatistical generalizations. For example, generalizing results obtained from college
students to a population of elderly subjects is often not justified. Concerns like this,
in fact, often justify the selection of subject variables asan object of study.
Our distinction, then, is between statistical generalization, which depends on
random sampling, and nonstatistical generalization, which depends on our knowl
edge of a particular research area. Cornfield and Thkey (1956, pp. 912913) make this
point quite clearly:
In almost any practical situation where analytical statistics is applied, the
inference from the observations ·to the real conclusion has two parts, only
the first of which is. statistical. A genetic experiment on Drosophila will
usually involve fl.ies of a certain race of a certain species. The statistically
based conclusions cannot extend beyond this race, yet the geneticist will
usually, and often wisely, extend the conclusion to (a) the whole species,
(b) all Drosophila, or (e) a larger group of insects. This wider extension
may be implicit or explicit, but it is almost always present.
We have spoken about generalizing to a population of subjects, but there are other
forros of generalization that are possible. Various other aspects of the experimental
situation can also be thought of as "sampled" from a larger population. Suppose you
are studying how fifth graders solve mathematics problems and can draw them from
a number of classes .within a school system. You cannot include all the classes in
your study, but you would like to generalize your findings beyond those you choose
to include. You have, in effect, sampled certain classes from a larger set of them. Or,
consider t4e math problems that your subjects are to solve. There are many possible
problems that you might have used, and the ones you actually use are only a sample
of these. To take a different example, suppose you wish to compare the memory for
active and passive sentences. You can't possibly have your subjects memorize every
possible active or passive sentence. Only a few instances must suffice. You are not
interested in the difference in memory for these particular sentences, but rather for all
sentences in general. The type of generalization that goes on here is quite similar to
that involving subjects, as you will see in Chapters 24 and 25.
SINGLEFACTOR
EXPERIMENTS
fact that she is using each set of data over and over again to make these compari~ons.
Each set is part of nine different comparisons. We cannot think of these compansons
as constituting 45 independent experiments; if one gro~p is. disto:te~ for sorne ~eason
or other this distortion will be present in every companson m wh1ch 1t enters. Fmally,
a set of '45 tests is difficult to comprehend or interpret. When a re~earcher want~ to
test certain differences between individual groups, there are more d1rect and efficwnt
ways to do so. .
The singlefactor, or oneway, analysis of variance allows researchers ~o cons1d~r
all of the treatments in a single assessment. Without going into the detmls yet, t~ns
analysis sets in perspective any interpretations one may want to ma~e ~oncernmg
the differences that have been observed. More specifically, the anal~s~s w1ll tell the
researcher whether or not it will be worthwhile to conduct any add1t10nal analyses
comparing specific treatment groups. . .
We will first consider the logic behind the analys1s of vanance and then worry
abopt translating these intuitive notions into mathematical expressions and actual
nulJlbers.
2
Sources of Variability
and Sums of Squares
We will begin our discussion of the design and analysis of experiments with an example
of a simple memory experiment. 1 Suppose you have devised a procedure that you
think helps people to learn and remember things. You believe that when people use
your procedure, they willlearn them faster and remember them longer. Although you
believe that this procedure works for you, you also know that others will want more
compelling evidence than simply your testimony or the testimony of your friends.
You decide to give people a list of 20 obscure facts to learn and then test them the
next day to see how many facts they can recall. To make a convincing case for your
method, you compare two groups of people, an experimental group in which individuals
study your memory procedure and a control group in which individuals are given the
same amount of time to study something irrelevant to the task, such as a lesson on how
to solve fraction problems. Following this study period, all individuals are presented
the same set of facts to learn and then are tested the next day. You expect that
the experimental group will remember more facts than the control group. Following
the procedures described in the last chapter, you assemble 32 subjects, assign them
randomly to the two conditions, and run your study.
Table 2.1 gives the results of such an experiment. Each row of the table indicates
which of the 20 facts a particular subject recalled, identified by a O when the fact was
not recalled, and by a 1 when it was recalled. In the right column, the total number of
facts recalled by each subject is tabulated. There are 16 totals from each group, each
potentially consisting of a number between O (recalling no facts) and 20 (recalling all
of the facts). A glance at these data is not particularly revealing. There are differences
among the subjects (the column on the right) and among the facts (the data within
the body of the table), which obscure any overall difference between the two groups.
If the data are to convey useful information, they must be summarized.
The place to start is with a picture that focuses on the total number of facts
recalled by each subject, disregarding the differences among the facts. Figure 2.1
shows a representation of the data, known as a histogram. The abscissa of this
1 This section is based on an example developed by Wickens (1998).
16 CHAPTER 2. SOURCES OF VARIABILITY ANO SUMS OF SQUARES
Table 2.1: Recall of 20 facts by the 32 subjects in the two groups of ~ memory
experiment Correct recall and failure are indicated by 1 and O, respect1vely.
Subject Fact Total
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Control Group
1 o 1 1 1 1 1 1 1 1 1 1 1 1 o 1 1 o 1 o 1 16
2 o o 1 1 1 o o 1 1 o 1 1 1 1 1 o 1 o o 1 12
3 1 1 o o o 1 1 o 1 1 1· o 1 1 o o o 1 1 o 11
4 1 1 1 1 o 1 o 1 1 1 o o 1 1 1 1 1 o 1 1 15
5 1o 1 o o o o 1 o 1 1 o 1 o o o 1 o 1 1 9
6 o o o 1 o o 1 o 1 o o 1 1 1 o o o o o o 6
7 1 1 o 1 o 1 o 1 o 1 1 1 1 o 1 1 1 o o o 12
8 1 o 1 o 1 1 1 1 1 o o 1 o o 1 o 1 1 o 1 12
9 1 o o 1 1 1 1 o o 1 1 1 1 1 o o 1 o 1 o 12
10 1 1 1 o o 1 o 1 o o 1 o 1 o 1 o o 1 1 o 10
11 1 1 1 o o 1 o o o 1 o 1 o 1 o 1 o 1 1 1 11
12 1 1 1 o o 1 o o 1 o o 1 1 o o 1 o o 1 1 10
13 1 1 1 1 1 o 1 o 1 1 1 o 1 1 1 o o o o 1 13
14 o o o o o 1 o 1 1 1 1 1 o 1 o o 1 1 o o 9
15 1 o o 1 o 1 o 1 1 o o o 1 o 1 1 1 1 1 o 11
16 o 1 1 1 1 1 1 o 1 1 1 o o 1 o 1 o 1 1 1 14
Total 11 9 10 9 6 12 7 9 11 10 10 9 12 9 8 7 8 8 9 9
Experimental Group
1 1 1 1 1 o 1 1 o o o o 1 1 1 1 o o 1 o 1 12
2 1 1 o 1
1 1 o o 1 o 1 o 1 1 o o o 1 1 1 12
3 1 1 o o 1 1 1 o o o o o o o 1 1 1 1 o 1 10
4 o 1 o 1 1 1 1 1 o o o 1 1 o o o 1 o o o 9
5 1 1 1 o 1 1 1 1 1 1 1 1 1 1 1 o 1 1 o o 16
6 1 1 o 1 1 1 1 o 1 o 1 1 1 1 1 o 1 1 1 1 16
7 1 1 o 1 o 1 1 1 1 1 1 1 o 1 1 1 1 o 1 1 16
8 1 1 o 1 o 1 1 1 1 1 1 1 1 o 1 1 1 1 o 1 16
9 1 1 1 o 1 1 1 o 1 o o o o o o o o 1 1 1 10
10 1 o 1 1 1 1 o o 1 o 1 1 1 1 1 1 1 1 1 1 16
11 1 o o o 1 1 o o o 1 1 1 o 1 o 1 o 1 1 1 11
12 1 1 1 o 1 1 1 1 1 1 1 1 o o 1 1 o 1 1 1 16
13 1 o 1 1 o 1 1 1 1 1 1 1 1 o 1 1 1 o o 1 15
14 1 1 1 1 1 1 1 o 1 o o 1 o 1 o o 1 1 o 1 13
15 1 1 1 1 o 1 o 1 o 1 o o o 1 1 o o 1 1 1 12
16 1 1 1 1 1 1 1 1 1 1 1 1 1 o 1 1 1 1 1 1 19
Total 15 13 9 11 11 16 12 8 11 8 10 12 9 9 11 8 10 13 9 14
17
Control
t
,.... 1
11
D LJ 111 1
5 10 15 20
Oi
.e Experimental
E::;:¡
z
o 5 10 15 20
Number of facts recalled
Figure 2.1: Two histograms showing the distributions of the number of facts re
membered for the two conditions of the memory experiment. The arrows mark the
means.
1
graph (the horizontal or X axis) is marked with the number of facts that could be
recalled, a range from Oto 20. For each subject's total, a block is stacked up over the
appropriate score on the abscissa. For example, three subjects in the control group
remembered 11 facts, so the plot has three blocks above this point. One can easily see
from these two histograms, where each distribution of scores is approximately centered
or located, how much they spread out from that center, and how much the distributions
of the two groups overlap. Like every summary representation, however, this display
emphasizes one aspect of the data over others. Here, the overall performance of the
individual subjects is apparent, but the identity of the individual subjects and the
particular facts they remembered have vanished.
The j;_w_Qmost..imp.nr:.tant_characteristics..oithe distributions of scores in Figure 2.1
are their centers and how widelyth~yspFea4G:u.t.. Th.fl C®t!l;r.§_¡:eflect__the_Q§If9rmance
leyel~LQLthe gronpsasawhele, and tl!©iL~PI!'ll'tdtu:eflectJ;hfLYArill.Q!lity of th!JJ:;~ _=._
how the individual subjects scatter about the center of their respective distributions.
The centers of the distributions are commonly summarized either by selecting the
middle score in the distribution, known as the median, or by averaging the scores
to obtain the arithmetical mean, or mean for short. For the statistical procedures
in this book, the mean is the measure of choice. You will recall that the mean for a
group, which we will denote by the letter Y with a bar over it, is found by adding
up all the scores and dividing by the number of observations. The means of your two
groups are
 16 + 12 + ... + 11 + 14 183
Yc = 16 = 16 = 11.44,
 12 + 12 + ... + 12 + 19 219
Ye = 16 = 16 = 13.69.
There is a difference of 2.25 facts, favoring the experimental group.
What can you now conclude? Was the special memory activity responsible for the
2.25fact difference you observed? Unfortunately, you can tell yet. The problem is
18 CHAPTER 2. SOURCES OF VARIABILITY AND SUMS OF SQUARES
that the random assignment of subjects to the two conditions does not fully control
the infl.uence of nuisance variables, that is, it does not fully equate the groups before
the start of the experiment. Specifically, the observed difference between the two
group means is infl.uenced jointly by the actual difference between the control and
experimental treatments and by any accidental factors arising from the randomization.
The decision confronting you now is whether the difference between the treatment
conditions is entirely or just partly dueto chance. We will consider, in general terms,
the statistical solution to this disturbing problem next.
Statistical Hypotheses
A research hypothesis is a fairly general statement about the assumed nature of the
world that a researcher translates into an experiment. · Typically, but not always,
a researchh:yJ)etOOsiSas~tthetwatm.@lltswil1IH'fltluBe an effeet. Statistical
hypotheses consist of a set of precise hypotheses about the parameters of the differ
ent treatment populations. We usually formulate two statistical hypotheses that are
mutually exclusive or incompatible statements about the treatment parameters.
The statistical hypothesis we test is called the null hypothesis, often symbolized
Ha. Its function is to specify the values of a particular parameter (the mean, for ex
ample, symbolized as J.l and pronounced "mu") in the different treatment populations
(J.Ll, f.L2, f.L3, and so on). The null hypothesis typically chosen gives the parameters the
same value in the different populations:
Ha: f.ll = J.L2 = f.l3 = etc.
This is tantamount to saying that no treatment effects are present in the population.
2.1. THE LOGIC OF HYPOTHESIS TESTING 19
Most researchers realize that the null hypothesis is rarely true. Almost any treat
ment manipulation is bound to produce some effect, even ifit may be small. You might
wonder, then, if the hypothesis is almost always false, why anyone would bother to
test it. What researchers really mean by the null hypothesis is a set of differences that
are sufficiently small to be considered functionally equivalent to zero, to be treated
as if they did not exist. 2 However, we will continue to refer to Ha in the traditional
manner as specifying the absence of treatment differences, realizing that this condition
rarely, if ever, holds exactly.
When the actual means obtained from the treatment groups are too deviant from
those specified by the null hypothesis, it is rejected in favor of the other statistical
hypothesis, called the alternative hypothesis, H 1. The alternative hypothesis spec
ifies values for the parameter that are incompatible with the null hypothesis. Usually,
tl}e alternative hypothesis states that the values of the parameter in the different
treatment populations are not all equal. Specifically,
H1 : not all J.L'S are equal.
A decision to reject Ha implies an acceptance of H 1' which ' in essence ' constitutes
·
support of an experimenter's origiJ¡ml research hypothesis. On the other hand, if the
treatment means are reasonably close to those specified by the null hypothesisthe
band around zero mentioned aboveHa is retained and not rejected. This latter
decision can be thought of as a failure of the experiment to support the research
hypothesis. As you will see, a decision to retain the null hypothesis is not simple.
Depending on the true state of the worldthe equality or inequality of the actual
treatment population meansone can make an error of inference with either decision
rejecting Ha or retaining Ha. We will have much more to say about these errors. '
Experimental Error
The crux of the problem is the fact that one can always attribute sorne portion of
the differences we observe among the treatment means to the operation of chance
factors. Every nuisance variable in an experiment that one controls through random
assignment of subjects to the treatment conditions is a potential contributor to the
results. These infl.uences are called expJ~rimental error, error__variability,_o.r_en:or_
.va:riªnc~ ___w_e_will_u.§_e_theseterms .interchangeably.
Let's consider sorne of the components that contribute to the score that we record
from each subject.
• Permanent or stable abilities. This component remains a constant and is a
permanent part of the subject's performance during the study. This component
is usually the greatest contributor to a subject's score.
• The treatment effect. This component expresses the influence of each treatment
condition on the ~u]:)J~<::t~ It_!§~r:rrEill..tQ_b_e..._idelltkal fo~a1I"SiilijectsJn.an:i
_giy_en_tr.eatment_c_onditiGn..ltis.abs.ent.:wh.en . . Ha ..is__tr.ue.......
• _I_nterr~:_alv__ariabili~y~ This component refers to te1Ilp()rary ch_11.nges in the subject,
~u_ch_¡¡,ª_v~ia~i()!l.~ _!!l mqgd, rnotivation, attention,.and so on._~  ···· 
2Serlin and Lapsley (1985) suggest that a better approach is to establish a band around zero that
represents the range of outcomes that are consistent with the absence of a treatment effect.
20 CHAPTER 2. SOURCES OF VARIABILITY AND SUMS OF SQUARES
•. External variability. This component refers to changes outside the subject that
vary from one subject to the next. We mentioned one of these sources in Chapter
1, namely, variations in the testing environment. Another external source is
measurement error, the fact that, however a.Ccurate our instruments, they are
~ot perfect. For example, an experimenter timing sorne behavior will be faster
or slower at recognizing that the behavior has started from one occurrence to the
next and so may not start the dock equally quickly. Even automated measuring
devices have sorne measurement error. Another type of external variability arises
from variations in the actual treatment conditions. An experimental apparatus
cannot be counted on to administer the same treatment to each subject, and
an experimenter cannot construct an identical testing environment (the reading
of instructions, the experimentersubject interaction, etc.) for every subject.
Finally, there are measurement mistakes such as misreading a dial or incorrectly
transcribing observations recorded in the laboratory to summary worksheets.
The point is that except for the treatment effect, any given subject's score is a compos
ite of all these components. But because subjects are randomly assigned to treatment
groups, they exert an unsystematic effect on the treatment conditions. Their influence
is independent of the treatment effects.
Experimental Error and Treatment Effects
Suppose you were able to estímate the extent to which the differences you observe
among the grotip means are. due to experimental error. You would then be in a
position to evalmite the hypothesis tha:t the means of the treatment populations are
equal. Consider the scores of subjects in any one of the treatment conditions. In any
actual experiment, all the sources of uncontrolled variability we mentioned above will
contribute toa subject's score, producing differences in performance among subjects
who are given the same treatment. __The variability of subjects treated l!like, tha.t js,
within the same treatinent lE)vel, provi<:lef'l an .estima.t.e oLexpeti:rn~mta,l error. B.y the
same argument, ~iahility of subjects within each of the other treatment levels
also offers estimates of experimental error.
  Suppose you have drawn random samples from a population of subjects, admin
istered the different treatments, recorded each subject's performance, and calculated
the means of the treatment groups. Further assume for the moment that the null
hypothesis is true and the population means associated with the treatment conditions
are equal. Would you expect the sample means you observe to be equal? Certainly
not. From our discussion of the use of randomization to control unwanted factors,
it should be clear that the means will rarely be exactly equal. If the sample means
are not equal, the only reasonabkl..explanation..is_th.e.....operationoLexperimentaLerr<:>I:r
~_l:¡f~s=.Qi!iiiistematic variahlJ.tiYt whic!l_~Qrrtxib:ut.e.J_o the differenc~s among .
subjects within a give_~!:~~~c_ondition,_w:UL.alspQp_erateto..prodnce dlfferences
;8iiíüilgthe :;¡_a,mpJ,,e m5lJ1D,S •..
yñ·s~ary, consider again what can happen by randomly assigning subjects to
the treatment conditions. jLthe assignment procedl!!!JJ.~ truly random., j:)achsub;jeet
h u chance o • · oed to any__one_oLthe__differenLtr.e.a.tments._But_
this in no way guarantees that the avera~..§:ubjects..as.signed..to these groups
~~ ;gl!_~~}!:es~~~apPliE:l~Jg_~ge _()ther co!P:p()n~J:!t::!..QLexperim.elltal error. There is
2.1. THE LOGIC OF HYPOTHESIS TESTING 21
_¿!O reason to expect ·any rando~d sources of error to balance out perfectly across
~ns. In short, even when the null hypothesis is true, differences
among the sample means will reflect the presence of experimental error. 3
So far our discussion has considered only the case in which the null hypothesis is
true. Researchers almost always hope that it will be false. Suppose there are real
differences among the means of the treatment populations. We will refer to these
differences as treatment effecl.s_, ..ªªY§Í~!!!l!.~~(;ªQ!l!~e co;ntrib_uting..to..the..differ.e.n.ces
we observe among the treatment means of an experiment. Does this imply that these
latter differences reflect only the differences in the population treatment means? Un
fortunately, the answer is "no." .Ihl:l_rnere.f~t that. the null hypo~he::!.is.Js.falsadoes
noJJ!J:l:P!yJ¡_hat..expe.riment.aLerror_ has .:vanished, In short" differences among.tr.e.atment_
me~s may reflect bo_th.a..systematk:oom:penent ®e~tl:eatmenteffeet&) and a random _
or unsystematiG...cQmpone~!__(~~p~:¡;_im~rr.~Le..rmr.)_.____
Evaluation of the Null Hypothesis .
You have seen that there are two ways in which experimental error is reflected in the
results of an experiment:.Jla,differences amonK.§lJ.bkcts..ghz:en...the..sametooatmenta:aG
~s difference~a.m_o.E.It_group~of subjects_giYen_di.t!e.xent..treatments'. We can express
these differences as a ratio:
differences among different groups of subjects
differences among subjects in the same groups ·
,JYhei_:t_j;l~!L!!!!!!J!YQQJ;h~sis is trne..JQ_u can think of this rati~~~()~_t.:mt'!tiu._g_two_.s_ets
__ ()fdiff(lrenc{ls__~llat. ~~.11.~~~_(l?~ .tlJ.(lpre~{l!l~~{)XRerimentalerror~·
experimental error
experimental error·
If you were to repeat this experiment a large riumber of times on new samples of
subjects drawn from the same population, you would expect that the average value of
this ra.fu..w.oJJ.ld be about l. O.~ ~ ··'
Now consider the same ratio wh.en_ th{l __~ElLhYPC>tlJ.e.!'Jf3.j~Jaj~_e.•.. Under these cir
cumstances, there is an additional component in the numerat.Qr.,_QllilJ.hª.t re;fl.ects the
treatment effects. Symbolically, the ratio becomes
(treatment effects) + (experimental error)
experimental error
If you were to repeat the experiment a large number of times, ;y()~~Q!IlcLI:l~P{)~t.t9
_@d ª:tLªYEJ:r_age .Yalue.oLthis ratio_ that . is. g:reate:r. __tha.n.J.Q.__.
You can see, then, that when Ho is true (that is, the means are_.eqJ.ml}, the average
value of this nttj(),_Qb_tained fr9.m_ ~Jª'I"gEl n~mbe~ of replic¡],tion.s of the experim{llltL_~~{ill
~ro.ximatel.D~ but wben_HJ is j;n:lEJ. (that is,tb(l.i.[ie~ªH¡ye. not_E)g~~fft:he a~erage
vallJ~_vv:!U e~c_{)t:lcl LO.__ We gould use the value of the ratio to q~ide_w.h~ther_lfojstru~~
·~~!~A problem remains,C!!.Rw.~v~, ~ecause in ~Y. _Si'(l,gle {l:JCPE:Jri~E:lr:t.t, it }s ~JYI~YI:l
3 You may be asking: Why not assign subjects to the groups systematically so that there are no
accidental differences among them? The answer is that we sometimes can, to the advantage of our
inferences (see our discussion of blocking in Section 11.5). But there will always be something else
that makes the individuals differ. Ultimately, we still have to fall back on random assignment as our
primary method for creating equivalent treatment groups.
22 CHAPTER 2. SOURCES OF VARIABILITY AND SUMS OF SQUARES
possible to obtain a value that is greater than 1.0 when Ho is true and one that is
equal to or less than 1.0 when H1 is true. Thus, l!!_~ely_rl!_eckj.ng_..t.Oseewbetherthe
~iª_grjlªt_~I._than_.LQ_99!lltnotte1Ly.ou . .which..~tatis.tical hypot~~si.sjs...corr.~t•. 
What researchers will do about this problem 1s to make a dec1s1on concernmg the
acceptability of the null hypothesis that is based on the ch~~ba~ility associated
~.h_th!ua.t~Jy_found_i!!j;he eyp_eriment.. If the :Q_l'_Q~~_b_i!ity Qf__obtainin~.xat.io
of this size or larger just from experime!!t~LEill:Q!.it?.J:easonably Jo:w, .t.l!ey_r:eJe.c.t_the.._
::·; ;~· ,:
nqJJ.J:¡_yp.o.th!liDS,._if...tln§._P!Qlm.:t:ulity.Is.Jllgh,__t.he_y_ __do___not re'ect it. In the next cha ter.
~ ~  .. . . . ·. . . . ~ __ IL.  
we will say more about the actual decision rules you use to make this decision.
Table 2.2: Comprebension scores for tbree groups of 5 boys wbo received eitber
one of two drugs or a placebo.
Factor A  
Control DrugA DrugB
a¡ a2 as
16 4 2
18 7 10
10 8 9
12 10 13
'19 1 11
2.2. THE COMPONENT DEVIATIONS 23
Ys,2 Y2 Yr
r¡+trT"",,f+rr¡r+t1/f
0 ~v·_:+Y,~Y,~'o
~Ys,2Yr!~1
Figure 2.2: Pictorial representation of tbe component deviations.
by Y with a numbered subscript to designate the group, most generally as Yj. For
these data, the three means are Yi_ = 15, Y2 = 6, and Y3 = 9. The grand mean of all
three conditions, obtained by taking the average of the three treatment means, is YT =
(15+6+9)/3 = 10. As explained above, we cannot conclude that the differences among
the group means represent the "real" effects of the different experimental treatments.
They could have resulted from experimental error, that is, uncontrolled sources of
variability that happened to favor one condition and not another. You saw that the
solution to this problem is to coiiÍp!lJ.'e the differences among..the..groupmeansagainst 
 ti.!~. ª!!I~!~!!~es_~bta,!__~e~__fE_o~~:t!b.J~~1~~lthln .eª~p_qft}l~__ j!,lgiyi_11,1lll. ~~!!P~~ ..... .
For the moment, focus on the worst score in condition a2, which is Ys 2 = l.
Consider the deviation of this score from the grand mean YT. This deviation, Y~,2 YT,
is represented pictorially at the bottom of Figure 2.2 as the distance between the two
verticallines drawn from the score Ys,2 = 1 on the left and the grand mean YT = 10 on
the right. Now consider the verticalline drawn through the group mean for condition
a 2 at Y2 = 6. From the figure,' it is obvious that the deviation Ys,2  YT is made
up of two components. One component is the deviation of the group mean from the
grand mean, that is, Y2 YT, the component on the right. The other component is
the deviation of the score from the mean of the group from which it was selected, that
is, Ys, 2  Y2, the component deviation on the left. We can write this relationship as
(Ys,2  YT) = (1'2  YT) + (Ys,2  Y2).
Traditionally, each of the three deviations has a name:
Ys,2  YT is called the total deviation.
Y2  YT is called the betweengroups deviation. 4
1'5,2  Y2 is called the withingroups deviation.
This subdividing of the total deviation, known as partitioning, is illustrated with
actual numbers in Table 2.3. Substituting Ys,2 = 1, Y2 = 6, and YT = 10 for the
symbols in.the deviations above, we obtain
1  10 = (6 10) + (1  6),
9 = (4) + (5).
Table 2.3 summarizes these calculations for each subject in the experiment.
4 Thename between~groups deviation appears to be something of a misnomer, in that the difference
Yj _:_ YT
involves only one'group.'s mean. It is possible to develop formulas based on those differences
(as we will do in Equation 4.1, p. 61), but here we use the more traditional form.
24 CHAPTER 2. SOURCES OF VARIABILITY AND SUMS OF SQUARES
Table 2.3: Component deviation scores for the data in Table 2.2.
  Deviations  
Total Between + Within
Score Value (Yú  YT) = (Yj  YT) + (Yij Yj)
Group a1
Y1,1 16 (6) (5) + (1)
Y2,1 18 (8) (5) + (3)
Ys,1 10 (O) (5) + (5)
Y4,1 12 (2) (5) + (3)
Y5,1 19 (9) (5) + (4)
Group a2
Y1,2 4 (6) (4) + (2)
Y2,2 7 (3) (4) + (1)
Ys,2 8 (2) = (4) + (2)
Y4,2 10 (O) (4) + (4)
Y5,2 1 (9) (4) + (5)
Group as
Y1,s 2 (8) (1) + (7)
Y2,s 10 (O) (1) + (1)
Ys,s 9 (1) (1) + (O)
Y4,s 13 (3) (1) + (4)
Ys,s 11 (1) (1) + (2)
Thus, the score for each subject can be expressed as a deviation from the grand
mean and this deviation can be partitioned into two components, a betweengroups
deviation and a withingroups deviation. These two component deviations are what
we have been after, one quantity that reflects both treatment effects in the popula
tion and experimental error (the betweengroups deviation), and another that reflects
only experimental error (the withingroups deviation). We will now see how these
deviations are turned into measures of variability.
(2.2)
As you undoubtedly know, the capital Greek letter sigma, ¿:, means ''the sum of ... "
Thus, í:(Yij YT ) 2 is read ''the sum of the squared deviations formed by subtracting
the grand mean of the Y scores from all the Y scores in the experiment." We can
calculate SST for the numerical example we have been considering by squaring each
of the 15 total deviations presented in Table 2.3:
SST.= í:(Yij YT) 2 = 62 +8 2 + ... + 32 + 12 = 390.
BetweenGroups Sum of Squares
You saw in Figure 2.2 that one of the components of a subject's total deviation is the
deviation of the subject's group mean from the grand mean, }j  YT. If we square
this component and sum the squares for all the subjects in the experiment, we will
26 CHAPTER 2. SOURCES OF VARIABILITY AND SUMS OF SQUARES
obtain the betweengroups sum of squares. We will refer to this quantity as SSA,
indicating that it is hased..on the deviations involving the means o ctor .A. From
the betweengroup deviations listed in Table 2.3,
SSA =5 2 + 52 + ... + (1) 2 + (1) 2 = 210.
A more compact way to express thís calculation takes advantage of the fact that the
betweengroup deviation is the same for all subjects in a given group. In this example,
the deviation is 5 for a 1 , 4 for a2 , and 1 for a 3 . We can square these three between
group deviations, multiply the squared deviations by the number of subjects in· each
group (n), and sum the three products:
SSA = n(Y¡ YT) 2 +ncY2 YT) 2 +n(Ya YT) 2.
Using the sigma notation, this sum of squares is written:
  2
SSA = L:n(Yj YT) .
When the samples all have the same size (we consider unequal sample sizes in Section
3.5), it is simpler to place the n outside (to the left) of the summation sign, that ~'
to square and sum the deviations first and then to multiply by n: '
SSA = n L:(YJ  YT ) 2 • (2.3)
For our numerical example, this formula gives the same value we calculated above:
SSA = (5)[5 2 + (4) 2 + (1) 2 ] = (5)(25 + 16 + 1) = (5)(42) = 210.
WithinGroups Sum of Squares
The final sum of squares is the withingroups sum of squares, denoted by SSs¡A· The
subscript S/A is read "S within A," which says that we ate dealingwith the~
of subjects from their own gr~means {i.e within their own groups). In Figure 2.2,
the basic deviation involved is Yii  Yj. The first step in finding SSs¡A is to obtain a
sum of squares for each group using these within:group deviations. From Table 2.3,
SSs at a 1 = 12 + 32 + (5) 2 + (3) 2 + 42 = 60, .
SSs at a 2 = ( 2) 2 + 12 + 22 + 42 + (5) 2 = 50,
SSs at a 3 = +
( 7) 2 + 12 + 02 + 42 22 = 70.
In the analysis of variance, the different withingroup variances are combined to obtain
a more stable estimate of experimental error. Our next step, then, is to add together
the separate withingroup sums of squares, a process known as pooling:
SSs¡A = SSs at a 1 + SSs at a 2 + SSs at a 3 (2.4)
= 60 + 50 + 70 = 180.
In the form of an equation, the pooled sum of squares is
 2 (2.5)
SSs¡A = L:(Yii  Yj) ,
where the summation extends over both the subjects and the groups.
2.4 Sums of Squares: Computational Formulas
Although the defining formulas for the three sums of squares preserve the logic by
which the component deviations are derived, they are not convenient for calculation
by hand. Instead, it is more practical to calculate sums of squares using formulas that
are equivalent algebraically, but much simpler to use.
2.4. SUMS OF SQUARES: COMPUTATIONAL FORMULAS 27
Notation
Befare presenting the different formulas, we should say a few words about our nota
tional.syste~. The b~ic ~ob of any notational system is to express unambiguously
the anthmetiC oper~t10n_s m the_ most. complex of designs as well as in the simplest.
The system we use m this book IS designed specifically to facilitate the calculation of
sums of squares in the analysis of variance.
The system has several advantages. Probably the most important is that formulas
writt~n. in it e:mph~ize the logic of what is being calculated. A second advantage is
that It IS relatively simple, and it avoids the confusion of subscripts parentheses and
multiple summation signs used in some other systems. Finally, it ~akes the co~mon
aspects of the calculations in different designs apparent. Quantities that mean th
same t~ing a:e d~scrib~ the same way. As you will see throughout this book, i:
wo~ks m con~unct10~ with a ge~eral analysis scheme that applies to nearly all the
designs we wlll consider. You wlll come to appreciate this aspect of the system more
and more as we turn to the analysis of more complicated designs.
A notational system is essentially a code. The symbols constitute a shorthand for
specifying _the o~er~tions to be performed on a set of data. In the analysis of the single
fact~r ~e~Ign with mdependent groups, we need to designate three basic quantities:
the mdividual scores or observations (the raw data), the sum of these scores for each
treatment condition (the treatment sums or treatment subtotals), and the sum
of all the scores or observations (the grand sum or grand total).
The Individual Scores. Each subject provides us with a single numerical value
that reflects his or her performance on the response measure. As we stated earli
th"I~ b~sic
· er,
score or _observation is designated by the capital letter Y, with subscripts
to I~diCate.the ~ubJect number (given first) and the group (given second). We will be
consist.ent m this use throughout this bookthat is, the first subscript always refers
to subJects and the second always refer to the factor A treatments.s When we write
Yii, t?e let~er i stands for the subject and j for the group. Table. 2.4 illustrates this
notatlon wlth numbers from Table 2.2. Here i takes values from 1 ton and j takes
values _from 1 to a. In Tables. 2.3 and 2.4, we have used commas to separate the two
~umencal parts of the subscript, but from now on we will suppress the comma when
It can be done without ambiguity. For example, we will write Y23 instead of y 2 for
the second score in the third group, but Y12 ,3 for the twelfth score in that grou~~
The Treatment Sums. ·Our first step is to calculate the sums of the scores in
each of the treatment groups. These subtotals, or treatment sums are designated
by a capital A, which indicates that they are the sums of the scores 'for the levels of
factor A, with a s~bscript to indicate the particular treatment condition. To designate
a treat~ent sum m general we use the súbscript j, writing it as Aj. For the three
groups m Table 2.4, the treatment sums for groups a¡, a2 , and a 3 are .
A¡ = 16 + 18 + 10 + 12 + 19 = 75
'
A2 = 4 + 7 + 8 + 10 + 1 = 30,
Aa = 2 + 10 + 9 + 13 + 11 = 45.
• 5 We ~ave reversed the order of the subscripts from that in previous editions of this book to obtain
th1s cons1stency and to make our notation conform to that adopted by others.
28 CHAPTER 2. SOURCES OF VARIABILITY AND SUMS OF SQUARES
We use the treatment sums to calculate the treatment means, Yj, by dividing
each treatment sum by the number of acores in that condition. The lowercase letter
nj denotes the number of subjects, or sample size, for group aj. As a formula:
Aj
Yj =. (2.6 )
nj
To refer to specific treatment means, we add a numerical subscript, for example,
writing Y2 for the mean of the second group. From the data in Table 2.4,
 A1 75  A2 30  _ A3 _ 45 _
Y1 =  =  = 15.0, Y 2 =  =  = 6.0, and Y3  .  5  9.0.
~ 5 ~ 5 ~
In much of this book, the sizes of the samples are identical in all groups, in which case
we will write the sample size ás simply n instead of nj.
The Grand Sum. The grand sum is the sum of all the acores in the experiment,
which we designate by T. Computationally, T may be calculated either by summing
the entire set of Y acores or by summing the treatment subtotals (A). Expressing
these operations in symbols,
The first summation, L:Yij, is read, "the sum of all the Y scores," and L:Aj is read,
"the sum of all the A treatment sums." When we turn this sum into a mean .(the
grand mean), the result is designated YT. It is calculated by dividing the sum of all
the acores by the total number of acores. With equalsized groups, this number can
be calculated by multiplying the number of acores in each treat~ent group (n) by the
number of treatment groups, which we will designate a. In symbols, the total number
of.s_cores is a x n, or an, and the grand mean is .
"  T (2.7)
\
YT=.
an
From the numbers in Table 2.4, where T = 75 + 30 + 45 = 150, a = 3 treatment
groups, and n = 5 subjects per group,
 150
YT = (3)( 5) = 10.0.
2.4. SUMS OF SQUARES: COMPUTATIONAL FORMULAS 29'
Bracket Terms
Each of the sums of squares is calculated by combining special quantities that we call
bracket terms (in the previous edition of this book they were called basic ratios).
Bracket terms are a common step in the computational formulas for sums of squares
in the analysis of variance. An important advantage of writ_ing sums of squares with
bracket terms is that they preserve the logic of the defining formulas for the sums of
squares. We do not expect you to appreciate this particular advantage until we turn
to the analysis of more complex designa in later chapters of this book.
All bracket terms have the same form. For the present design, there are three
bracket terms, one involving the individual observations (Y), another involving group
informationeither the treatment subtotals (Aj) or the treatment means (Yj)and
the third involving either the grand total (T) or the overall mean (YT ). For conve
nience in presenting computational formulas, we designate each term by putting an
identifying letter between a pair of brackets, that is, [Y] for the bracket term based on
the Y acores, [A] for the bracket term based on either the treatment sums or treatment
means, and [T] for the bracket term based on the grand total or the grand mean. These
bracket terms may be read as "bracket Y," "bracket A," and "bracket T," respectively.
We will consider first the bracket1 terms calculated from sums, followed by the bracket
terms calculated from means.
Using Sums. Calculating a bracket term involves three simple steps:
l. Square all quantities in a given setthe Yij's, the Aj's, or T,
2. Sum these squared quantities, if more than one is present,
3. Divide this sum by the number of acores that went into each of its components.
Let's apply these rules to the three bracket terms. For Y, the first step squares each
individual score and the second step sums these squares:
L:Yi~ = 162 + 182 + ... + 132 + 11 2 = 1,890.
We will refer to this quantity as the raw sum of squares. Each score Y involves
only itself, so the third step simply divides this sum by 1 to give [Y]= 1,800. For the
bracket term involving A, the first step squares the group subtotals A from Table 2.4,
the second step sums them, and the third step divides by 5 because this is the number
of acores that contributes to each A sum. The result is
[A] = 75 2 + 3~ 2 + 452 = 5,625 + 9~0 + 2,025 = 8,~50 ~
The third bracket ter ased~dtot which must be squared (step one)
and divided by th 3)(5) = 15 acores that contribute it (step three):
[T] = 1502 = 22,500 = 1 500
15 15 ' ..
We can combine the tliree stepstogiveasin ormula for each bracket term:
~=L:~. ~~
¿A2
[A]=j, (2.9)
n
T2
[T]=.. (2.10)
an
30 CHAPTER 2. SOURCES OF VARIABILITY ANO SUMS OF SQUARES
Sums of Squares
All that remains is to use the three bracket terms to find the three sums of squares,
SSr, SSA, and SSs¡A· They are combined by addition or subtraction, using an identical
form as the three deviations in the defining formulas of Section 2.3. Specifically, the
total sum of squares is based on the deviation Yij  Yr, and the computational formula
for SSr combines the bracket terms [Y] and [T] in the same way:
SSr =[Y] [T]. (2.13)
The treatment sum of squares, SSA, is based on the deviation of the treatment means
from the grand mean, fj  Yr, and again the computational formula has the same
form:
SSA =[A] [T]. (2.14)
Finally, the withingroups sum of squares is based _?n the deviation of individual
observations from the relevant treatment mean, Yij Yj, leading to the computational
formula
SSs¡A = [Y] [A]. (2.15)
Substituting the values of the bracket terms that we just found, we obtain numer
ical values for the three sums of squares:
SSr = [Y]  [T] = 1,890.0 1,500.0 = 390.0,
SSA = [A]  [T] = 1,710.0 1,500.0 = 210.0,
SSs¡A =[Y] [A]= 1,890.01,710.0 = 180.0.
As a computational check and as a demonstration of the relationship among these
three sums of squares, we should verify that Equation 2.1 holds for these calculations:
SSr = SSA + SSs¡A = 210.0 + 180.0 = 390.0.
EXERCISES 31
Comment
In our calculations, we will focus almost entirely on the computational versions of the
formulas because they are easy to use and can be generated by a simple set of rules.
However, you should remember that they are equivalent to the defining versions,
which show more closely the logic behind the sums of squares. You can verify for
yourself here that the computational formulas produce exactly the same answers as
those obtained with the defining formulas. The only time different answers are found
is when rounding errors introduced in calculating the treatment means and the grand
mean become magnified in squaring the deviations or when too few places are carried
in calculating the intermediate bracket termsa common error when learning to use
the formulas. It is not possible to state in advance how many places you need to
carry your calculations because that depends on the sizes of both the bracket terms
and their differences. If you find that your results have lost accuracy, then you need
to go back to the data and carry more figures in your calculations. For example, if
you calculated SSA = 1,738. 1,735. = 3. and the data were not really integers, you
have been left with only one figure of accuracy. You might find that more accurate
calculations gave SSA = 1,738.31 1,734.66 = 3.65, which is an appreciably different
result. We will follow these recommendations in our examples.
Exercises
2.1. In an experiment with a = 5 treatment conditions, the following scores were
obtained.
13 9 7 4 12 11 10 12 13 6
8 7 4 1 4 9 9 7 14 12
8 6 10 7 5 10 15 14 13 10
6 7 5 9 2 8 10 17 8 4
6 10 5 8 3 6 14 12 9 11
a. Calculate the sum of the scores and the sum of the squared scores for each of the
treatment groups.
b. Calculate the treatment means.
c. Calculate the sums of squares for each of the sources of variability normally iden
tified in the analysis of variance, that is, SSA, SSs¡A, and SSr. Reserve this
information for Problem 3.6 in Chapter 3 and Problem 11.4 in Chapter 11.
2.2. Consider the scores from the memory experiment we presented in Table 2.1 at the
beginning of this chapter. Using the total number of facts recalled by the subjects,
calculate the SSA, SSs¡A, and SSr. Calculate the same three quantities using the
treatment means and the grand mean instead.
3
Variance Estimates
and the F Ratio
We are ready to complete the analysis of varia~ce .. T~e re~ai~ing c~culations are
simple and easy to follow, aJ.though their theoretlcal JUSti~catiOn .1s cons1derably more
complicated. Consequently, we will present the calculat10ns qmckly and dev~te the
majar portian of this chapter to describing the meaning of the nullhypothes1s tests
and theír consequences. We will conclude by describing what to do when the treatment
groups do not have equal sizes.
Degrees of Freedom
~egrees of freedom associated with gum of sq~ares. correspond to t4~_num..Qer_Qf~
independent pieces of information that enter into tfíe calculation ofJ;lle....s.um...of.squa.J:es.
~·consíder;forexampl"Ei;tlieuseófasTngfesamplem~t;;stimate the population
mean. ~ant to estimate the population variance as well, we must take account
of the fact that we have already used up some of the independent information in
estimating the population mean.
Let's examine a concrete example. Suppose we have five observations in our ex
periment and that we determine. the mean of the scores is 7.0. This sample mean
is our estimate of the population mean. With five observations and the population
mean estimated to be 7.0, how much independent information remains to estimate
the population variance? The answer is the number of observations that are free to
varythat is, to take on any value whatsoever. In this example, there are four, one
less than the total number of observations. The reason for this loss of "freedom" is
that although we are free to select any value for the first four scores, the final score
is then determined by our choice of mean. More specifically, the total sum of all five
scores must equal 35, so that the sample mean is 7.0. As soon as four scores are
selected, the fifth score is determ~ned and can be obtained by subtraction. Estimating
the population mean places a constraint on the values that the scores are free to take.
A general rule for computing the degrees of freedom of any sum of squares is:
df = (number of independent observations)
(3.2)
 (number of population estimates).
The degrees of freedom associated with each sum of squares in the analysis of
variance are presented in the fourth column of Table 3.1. We can calculate them by
applying Equation 3.2. For factor A, there are a basic observationsthe a different
sample means. Because one degree of freedom is lost by using the grand mean YT to
estimate the overall population mean ftT,
dfA =a l.
The degrees of freedom for SSs¡A may be found by noting that the withingroup sum
of squares for any one group (SSs at a 3 ) is based on n observations and that one degree
of freedom is lost by using the group mean YAj to estimate the population mean ftj.
That is, dfs at a3 = n  l. When we combine the withingroup sums of squares for the
a treatment groups, we combine this value a times:
dfs;A = a(n =_2).
Finally, for SST; there are an observations, but since ooe degree of freedom is lost by
estimating the overall population mean,
dfT = an1.
As a check, we can verify that the degrees of freedoni associated with the component
sums of squares add up to dfT:
.
dfT = dfA + dfsjA·
Substituting the values on the left and right sides, we find that
an  1 = (a  1) + a( n  1) =a 1 + an  a = an  l.
34 CHAPTER 3. VARIANCE ESTIMATES AND THE F RATIO
Mean Squares
As we have said, the actual variance estimates are called mean squares (abbreviated
MS). They appear in the next column of Table 3.1. The mean squares for the two
component sources of variance are given by Equation 3.1:
SSA SSs¡A
MSA = dfA and MSs¡A = dfs¡A . (3.3)
The first mean square estimates the combined presence of treatment effects plus error
variance; the second independently estimates error variance alone.
As an example, consider the experiment we presented in Chapter 2 in which a = 3
groups, each containing n = 5 hyperactive boys, were administered either one of two
drugs or a placebo and later given a test of reading comprehension. Using the data
from Table 2.4 (p. 28), we found SSA = 210.00 and SSs¡A = 180.00. Substituting in
Equations 3.3, we obtain
MSA = SSA = 210.00 = 105.00
a1 31 '
MS SSs¡A 180.00 = 180.00 = 15 _QQ.
S/A = a(n 1) (3)(5 1) 12
Confidence lntervals for Treatment Means
Before we use these two mean squares to complete the analysis of variance, we will
consider how to establish a plausible range for each pop.!!!ªt;i_QnJ.rl2atmil_nt mean,Jmown,
as a confidence interva . ese ranges are estimátes aitd as such, _;;;;notguaran
teed ~ popu~tron·¡:;;_;~~:But,. the procedure we_i~Ii~itü~:fiiíd
jllé~er:_~~§~ SJ2eciJ1~.Jb&..perc©!!.1&g§___QfjJ"fu~_V!~ wTTIJ>~ su~cessfi:ií":iJ:l.b~~:t~~~t_i!lg...thg
populatill.J:l treatment means_:___
· .. .. .. ····~··c··~~~fidence = (100)(1 a) percent.
By choosing a value of a between O and 1.0, we can control our longterm success rate
in establishing confidence intervals. To illustrate, if we were to select a relatively low
value for a such as a = .05, then we would obtain a
(100)(1 .05) = (100)(.95) = 95 percent confidence interval.
~tb.at9.fi.Jillr.c.enLoLth.eJi.LJ:Jlfl the range will include the population
~u,_and_5_p_m:oont.ofth€iltim€ilit:w:i1Ln..ot. 1 You will sometimes find
confidence intervals (usually 95%) reported in tables along with the treatment means,
and they appear in the output of many of the computer packages. We will discuss the
interpretation of confidence intervals further in the next chapter (see p. 75).
The confidence interval for a population mean is given by the formula
Yj  f SMj :S: /.lj :S: Yj + f SMjl (3.4)
There are two ways that we can estímate the standard error of the mean. The
simplest and safest method is to base the estímate on the data from condition a1 only.
We obtain this estímate by dividing the standard deviation of that group (s1 ) by the
square root of the sample size:
. d"lVl"dUal SM. = Sj .
In (3.5)
J ..¡n;
When the conditions all have the same variability, an assumption we will consider in
Chapter 7, a somewhat better estímate is obtained by using information from all the
groups. In this case, we estímate the standard error of the mean by
pooled sM = ~ (3.6)
The numerator of this ratio is the withingroup mean square MSs¡A from the overall
analysis, which is actually an average of the individual variances for the treatment
groups. Where the assumption of equal variance seems satisfied or the sample sizes
are so small that the individual variance are very poorly estimated, we can use s M
instead of s M.J in Equation 3.4.
As an example, we will calculate two confidence intervals for the group receiving
1
drug B (a3) for the current numerical example, the first using an estimated standard
error based only on the data for a3 and the second using one based on the pooled
withingroups variance. Equation 3.4 indicates that we need three quantities to find a
confidence interval: the group mean Yj, the value of t, and the estimate of the standard
error of the mean. First we will illustrate the calculation using the data from group a 3
alone. From Table 2.4, we find that the mean for the drug B group is Y3 = 9.0 and
sample size is n 3 = 5. The value of t depends on the level a and the degrees of freedom
associated with the standard deviation, which here is df = n 3  1 = 5  1 = 4. Using
a = .05, we obtain the value t = 2. 78 from the table of the t distribution in Appendix
A.2 at the intersection of the row where df = 4 and the column where a = .05. The
last term is the estimated standard error of the mean, s M. , which we calculate from the
formula for the individual standard error (Equation 3.5). First we find the standard
deviation for the drugB group (s3), using the deviations in Table 2.3 (p. 24):
deviation) differ from one treatment _Qopulation to g¡,nother. However, if we are will
ing to assume that these variances are equal, we can obtain a better intervalmore
accurate and usually narrowerby using the pooled withingroups variance MSs¡A
and Equation 3.6. We already have the group mean, Y3 = 9.0. The value of t is
smaller than the one we obtained above because MSs¡A has more degrees of freedom
than Sj. For this pooled estimate, dfs¡A = a(n 1) = (3)(5 1) = 12, and so, from
Appendix A.2, we find t = 2.18. Finally, we substitute in Equation 3.6 to get the
pooled standard error of the mean:
100., 1.0
~
U)
e:
{l
~
0.5 :eal
.o
e
!l.
1%
0.0
2 3 4 5 6 7 8
Valueof F
Figure 3.1: The sampling distribution obtained from 1,000 F ratios, each calculated
from three groups of 5 scores randomly drawn from the same population. Four
scores with values greater than 8 lie to the right of the plot. The smooth curve is
the theoretical F distribution with 2 and 12 degrees of freedom. The percentage
poiilts refer to this distribution.
3.2. EVALUATING THE F RATIO 39
o.8~,
o.o!...,..==!2:rz;:z::z:==1
o 2 3 3.48 4 5
Valueof F
A Monte Cario simulation lik~ this helps us understand sampling effects and aids in
evaluating testing procedures (we will use them again in Chapter 7). However, we do
not need to rely on it for our basic test. The exact distribution ofthe F statistic when
the group means are equal can be derived from statistical· theory. This distribution
is shown by the smooth curve in Figure 3.1. As you can see, it closely matches the
empirical distribution that we found in our Monte Cario simulation. In effect, the
theoretical F distribution is the sampling distribution of F that we would obtain from
an infinitely large number of simulated experiments of this sort. Knowing the exact
shape of the F distribution lets us determine the value ofF that will be exceeded by a
given percentage of the experiments when the null hypothesis is true. Several of these
values are marked in Figure 3.1. Consider the value F = 2.81. Theory tells us that
90 percent of the sampling distribution is less than this value and that 10 percent is
greater. If there were no differences between the means, we would expect to obtain
a value ofF equal to or greater than 2.81 no more than 10 percent of thetime. We
have located several other values in Figure 3.1.
The F Table
There is not just one F distribution but a family of them. The exact shap13 of any
one of the curves is determined by the number of degrees of freedom associated with
the numerator and denominator mean squares in the F ratio. An F distribution with
quite a different shape appears in Figure 3.2. It is the theoretical sampling distribution
ofF (under the null ~~p~thesis)~~l!§..QÍ.R.s:uhie.cts each,~
dfnum ~~ and dfdeg;,:IJ 10. As a shorthand way of referring to a particular F
arstñ'Sutlon, we will use the expression F(dfnum, dfdenom), or here, F(4, 10).
To draw conclusJ~~~~~..ll•re.do not have to ~now the e~act sha~e
of the F distribution. All we need are the value.s...ol_.EtQ.J;he..x.ight.Qf..lYbidu:er.tain
~ea under the curvetáiCTh;e are the percentagepoints in Figures
3.1 and 3.2. These values are t1l:bulated:iñtñe F table in Appendix A.l. A particular
~nthis tahlfL!s specified by t~ee quantities: (1) t~ numerator degrees;
40 CHAPTER 3. VARIANCE ESTIMATES AND THE F RATIO
~
r::
Q)
"O
~
¡ ~~~~1
a. Null hypothesis false
o 2 6 8 10 12
Value ofF
Figure 3.3: Sampling distributiops of the F ratio when the null hypothesis is true
(upper panel) and when the alternative hypothesis is true (lower panel). The
distributions have 3 and 12 degrees of freedom. The vertical line is the value of
F.o5 (3, 12) = 3.89, and the shaded areas represent errors.
It would be nice to be able to draw the F' distribution and to compare it with a
corresponding F distribution. Unfortunately, however, this is difficult to do, because
the shape of this distribution depends on the magnitude of the treatment eff~cts as
well as the numerator and denominator degrees of freedom. Although there 1s only
one central F distribution for any combination of numerator and denominator degrees
of freedom there is a family of noncentral F' distributions, one distribution for each
value that 'the treatment effects may take. All we know, in general, is that the F'
distribution is shifted to the right of the F distribution.
We are now in a position to state more formally the decision rule that is followed
after the calculation of the F ratio. The significance level a defines a critical value
Fa( dfnum,dfdenom) that cuts off this proportion at the top of the F distribution. Let's
say we have calculated the value Fobserved from the data. If Fobserved falls within the
region of incompatibility determined by Fa, then the null hypothesis is rejected and the
,alternative hypothesis is accepted. If Fobserved falls within the region of compatibility,
then the null hypothesis is not rejected. Symbolically, this rule is
If Fobserved 2:: Fa(dfnum, dfdenom), then reject Ho,
{
If Fobserved < Fa(dfnum, dfdenom), then retain Ho.
There is often confusion about the exact wording of the decision rule in null
hypothesis testing. The problem is that the null hypothesis is a very precise statement
(the J.l/S are equal), but the alternative hypothesis is a vague one (the J.l/S are not
equal), so we cannot talk about them the same way. When we fail to reject the null
hypothesis, we mean that the null hypothesis is consistent with our data. We use the
word "retained," rather than "accepted" or "proved," because of the near certainty
that the treatment has had at least some effect, even ifit is so small to be essentially
undetectable by our study. For th~ same reason, we do not say that we have "tejeded"
the alternative hypothesis, only that our data do not support it. In contrast, when
there are large differences between the means and the F is significant, we believe
that some form of the many outcomes subsumed under the alternative hypothesis. is
almost certainly true. This is why we are much happier in saying that we "reject" the
null hypothesis and "accept" the alternative hypothesis. We will consider the issue of
proving statistical hypotheses again in Section 8.5.
We will illustrate the decision rule with the F we calculated earlier in this chapter
(see Table 3.2). In this example, dfnum = 2 and dfdenom = 12. If we set a= .05, the
critical value ofF, which we find in the tabled values ofthe F distribution (Table A.1
of Appendix A), is 3.89. The rejection region consists of all values ofF equal to or
greater than 3.89. The decision rule is
If Fobserved 2:: 3.89, then reject Ho,
{
If Fobserved < 3.89, then retain Ho.
Since the F obtained in this example exceeded this value (F = 7.00), we would
conclude that treatment effects were present in this experiment. Although not of
particular interest, we also calculated an F for the null hypothesis /lT = O. For this
test, the critical value is F. 05 (1, 12) = 4.75. Because the observed value ofF= 100.00
exceeded this value, we would conclude that the overall mean was greater than zero.
This statement indicates that the observed F falls in the top 1 percent ofthe theoretical
distribution and that the null hypothesis would be rejected by any researcher who had
adopted a significance level equal to 1 percent or larger. The main problem with this
practice is that it tends to reinforce the false belief that a finding significant at p < .01
is somehow "better," "stronger," or "larger" than a finding significant at p < .05.
As you will see in Chapter 8, comparisons of strength are more appropriately made
by obtaining sorne measure of effect magnitude rather than by comparing significance
levels.
Many statistical computer programs eliminate the need for an F table by reporting
the probability of a result greater than Fobserved under the null hypothesis. These
programs calculate the exact proportion of the sampling distribution of the F statistic
falling at or above Fobserved· For example, the F we calculated in Table 3.2, F = 7.00,
has an exact probability of p = .009666. Armed with this information, all one needs
to do is to compare this value to whatever level of significance one is using, and reject
H0 when the exact probability is smaller than that level. We can state the decision
rule quite simply, without specific reference to F; for example, if a = .05, the rule is
If p::::; .05, then reject Ho,
{
If p > .05, then retain Ho.
We want to emphasize that however you report a statistical test, you should choose
your significance level befare the start of an experiment. Alternative ways of reporti?g
probabilities do not change this fundamental point. Sorne researchers make a pr~ctiCe
of announcing their significance level before they present the results of an expenment
in a research article, allowing them to report the outcome of statistical tests simply
as "significant" or "not significant." Doing so avoids the potential confusion created
when. different significance levels appear among a string of statistical tests.
Another mistake that is alltoooften made is to interpret the probability p as the
probability that an effect is present or that the results will be replicated by another
study. 4 All that the exact probability provides is a quantitative measure of evidence
against the null hypothesis. It should not be used to imply either the strength of a
treatment effect or the probability of a successful replication, stillless the probability
that the effect is "really there."
Comment
Dissatisfaction with hypothesis testing has been expressed ever since the concept was
introduced early in the last century. There is a substantial (and tendentious) con
temporary literature on the topic. Sorne individuals support the procedure; others
take such extreme stands as recommending that the procedure be "banned" from the
journals altogether. 5 There are reasons to be concerned. The paradoxes of hypothesis
4Such statements are the domain of Bayesian statistics, an area that has recently received
much attention in the statisticalliterature (e.g., see Carlin & Louis, 2000). However, the Bayesian
techniques are not yet developed to the point where they are practica! alternatives to the techniques
we describe in this book. In any case, ~ost of the design issues that we discuss do not depend on the
type of inference used. . . . .
5 An excellent summary of the discussion of the vices and virtues of hypothes1s testmg IS giVen
by Nickerson (2000). Because Nickerson includes a comprehensive bibliography, we will not cite
individual articles here.
3.2. EVALUATING THE F RATIO 45
testing and its possibilities for abuse are manifold: Why do we test the null hypothesis
when we know that it is rarely true and that sorne differences, even if small, probably
result from any treatment manipulation? Why do researchers evaluate the null hy
pothesis using an unvarying significance level of 5 percent? Certainly sorne results are
interesting and important that miss this arbitrary criterion. Should not a researcher
reserve judgment on a result for which p = .06, rather than turning his or her back on
it? Another problem is the ease with which null hypothesis testing can be (and often
is) misinterpreted. Many researchers treat the exact p value associated with Fobserved
as a measure of the size of the observed treatment effect or the probability that an
effect is present, although it is neither of these.
We believe that these concerns and misunderstandings should not obscure the
useful aspects of hypothesis testing. Any procedure that takes limited data about a
variable phenomenon and arrives at an assertion about scientific truth is based on
many assumptions, all of which are open to question and to misuse. This problem
is magnified when that assertion is a dichotomous, yesno, decision. What is essen
tial is to understand what the test of a null hypothesis does and does not provide.
Fundamentally, it serves the important function of allowing a researcher to decide
whether the outcome of an experiment could have reasonably happened by chance.
Thus, it provides a filter to prevent us from wasting our time talking about things
that may well be unstable or irreproducible. It helps us to get others to pay attention
to our findings. However, a hypothesis test does not complete the task of inference.
In particular, it does not tell us what we have found. Suppose the null hypothesis has
been rejected. It is still necessary to interpret the outcome and to reconcile it with the
body of findings that led us to do the experiment. The rejected null hypothesis does
not prove the results agree with the original theory. On the other hand, suppose it had
been retained. It is still necessary to think about the outcome. One of the persistent
errors among beginning users of hypothesis tests is to cease thinking about the data
at this pointin effect, to accept the null hypothesis as true. All the treatment
means from a study convey information, and even though we may be appropriately
cautious when reporting an effect, the pattern of values we found should infl.uence
our understanding. Perhaps the effect we were looking for was present but smaller
than expected; perhaps our study was less sensitive than we intended. Or perhaps our
initial research hypothesis was completely wrong. The lack of significance itself does
not sort out these possibilities.
To close our discussion of nullhypothesis testing, we offer five suggestions that we
feel will help you to eliminate its most common abuses.
l. Never lose sight of your data. Elot them, with measures of variability, and study
~Use confiaence intervals (p. 34) to establish reasonable ranges for
your effects. Keep asking yourself what your numbers mean.
2. Remember that not all null hypotheses represent interesting or plausible out
comes of the experiment (the hypothesis that the grand mean is zero is often an
example) and that their rejection ~~_:rJgt_inevitablyjiJ.for1Ilativ~,
3. Remember that the nullhypothesis test only gives information_about how likely
the~ operations _are to h_~ve produced your effect .. Do not confuse the
46 CHAPTER 3. VARIANCE ESTIMATES ANO THE F RATIO
significance level with the size of an effect or the probability that another study
can replicate it (we discuss proper procedures in Chapter 8). '
4. When you find a significant effect, ask what it means. Use the effectsize mea
sures (Chapter 8) to assess its magnitude and the test of contrasts and subsidiary
hypotheses (Chapters 4 and 5) to interpret it.
5. Always interpret your results in the context of other studies in the field. Re
member that theoretical understanding is an incremental process in which no
study stands alone, but each extends and refines the others.
when the null hypothesis is falsewhen the population means have values that are
not equal and so treatment effects are present. 6 The distribution we've drawn is a
noncentral F distribution, which we mentioned on page 40. It is more spread out and
more symmetric than the F distribution. The probability ¡3 is determined from this
distribution. The critical value ofF remains at the value Fa determined by the level
of a, but the different placement and shape of the alternativehypothesis distribution
changes the size of the areas on either side of the critical value. In this distribution,
the shaded area to the left of Far represents the probability ¡3 of making a Type II
error, and the unshaded area to the right of Fa represents the probability of correctly
rejecting the null hypothesis.
The probability of correctly rejecting a false null hypothesis is so important that
it is given a special name. _'Ihe power of a test is the Qrobab.ili:ty of _r~ecting the null
_ hypothesis w.hen __a_s.p_e_c_ifi.c_alt_eJ:.llª.tlY.e.JJ.YP.2!he.f?.1sjª__t_rue. Ji __i§_J;be.pr.o.b.abilit.y...that
~.Qf the null hypothesis :w:ilU!;l.a,d_tQ_the.. Q9!!~!1l§jg.n_tlr:.tthe.Iille.nmneno.1Ll!nder
study exists. Power is the complement of the probability of making a Type II error:
power = 1 ¡3, (3.8)
Thus, the J~ss likel;v.:_is ª...Txp.e.JLeu.Qr..,.,..tbg_ gr~eª.tex.Jhe..power . . and.the.gr.eat.er~.the
sensitivity of the test. It is important to __giv;e_a_s,í;J!<l.Y.ªi?. much.poweras possible, and
tnese importántillafters are the topic of Chapter 8.
How can we control the two types of errors? First, we must realize that complete
control is not possible. We cannot avoid the risk of committing sorne sort of error,
as long as we must make our decisions in the face of incomplete knowledge. We can,
however, try to minimize the chance of an error. Jt_i.a..t.r.ua..:that...w.e..dh:e.ct)y_control the
..Type I . errerprobabilit~c.in:..our:..s.ele~tio.n.. PfJb~,.:;;jgnifkanceJeyel,_;;.!;n_d __H!~t!>Y set_!;!g_g__.
~~ion region we are t~~!:l:g;_a ~~~~~~~~.!:}.~~ th~.~ ~..t::!l!:!.?inJ::?I.ORQ:t:tionoLthdime~
(for example, a= .05) we will obtain F''s that fall into the rejection region when the
null hypcithesis is true. But FJl.º!1l§.i;J.~:J.C:Q!lci1§._91,!!'.Sll!Y~.§_J~<!..!!?~C:t.!fl..9:LTYP.eLerror:s..
will not disappear no. matterhow_ Sfilall we make tl:le ¡:ejectionregiq¡¡, And we must
íiñd.erstañu"ilierefatiü'~shfí) ..bet.;een these'errors and the Type II errors.
What about Type II errors? _Q_rre..~_~w_t_Q.cont:wUhem~is..to.in_QJ;E~ª'l'il the §i!'L~
rejecti()I1IE;JgjQ.!J.Jn Figure 3.3, suppose we increased a to .10. Doing so moves the
vertiéalline at Fa to the left and increases the shaded area in the top distribution.
6 The exact shape of this distribution is determined by a number of factors, including the specific
pattern of differences among the f.li 's assumed in the population. Remember that the alternative hy
pothesis H1 is general and includes many other possibilities besides the specific alternative hypothesis
on which this noncentral distribution is based.
48 CHAPTER 3. VARIANCE ESTIMATES ANO THE F RATIO
This change has the opposite effect on the shaded area in the bottom distribution. It
shrinks as the unshaded area expands. Tl:J,e__.<:¿g_§tnce of a :!J::~I ersor would J!!.ld§...P~
reduced an_g_pQN~L.WIJ.uld~.he.increasad ..:We ..have.a.tradeoff: Inctf::)ª¡;jggJhsu:;ha.ncC
· ~L~~iSi~:L~:n:or g_§c:;re~~i> .tlt~__QAªJ+.C~e oLª.TYP~_1l.§.~l'9J, anc:l decre~sillg thE) c:;hance
of a Type I _error J.!l.<::I~Q.i'l~S.. the...chance.. nLa...Type·Ilerron7 Every researcher must
'Strike'a balancebetween the two types of error. When it is important to discover new
facts, we may be willing to accept more Type I errors and thus enlarge the rejection
region by increasing a. But when we want to avoid Type I errorsfor example, not
to get started on false leads or to clog up the literature with false findingswe may
be willing to accept more Type II errors and decrease the rejection region. As we
will discuss in Chapter 6, striking the right balance is particularly important when we
plan to run several statistical tests.
~The best way to control TJC~Or is not throu~radeoff wlth Type I
,_error, but through tni2..J>roper design of__the st~y. Many aspects of the design affect
the power, and throughout this book we will consider how powerful experiments are
designed. For the time being, we will just mention two useful ways to decrease Type
II error (or increase power). One is j¡Q_.acld_j¡p_the_rmmb.eLof o.b~rva~J~
treatment condition, and the otheris. tg_ r:~c:lm;;~ <:)Jror VIJ,riar;tc_{;J_ witb. ª IJ:lQJ!L:R!'E)s;iselL_
Coñtr0Iloo}3xperíiñent. B~th~~p~;~t¡~~~ _in.crease. the.a~~llr~cy with which.w_e.mea§..Y!'JL
the m~an:s,···a.ñdIieñ'ce 'the:~:d~iii~~jo which.we .can trlJ!>t. dHferenc<:Js. in thei.L~amQl!L
value;.·· ~· 
Student Student
No. Fo No. Fo
1 9.40 8.60 0.28 2.13 24 7.87 10.13 2.43 13.13
2 9.67 8.33 0.80 1.24 25 8.07 9.93 1.61 10.91
3 9.07 8.93 0.01 3.58 26 9.33 8.67 0.19 2.39
4 8.27 9.73 0.97 8.99 27 8.13 9.87 1.37 10.24
5 9.20 8.80 0.07 2.95 28 8.73 9.27 0.12 5.46
6 9.47 8.53 0.38 1.89 29 8.33 9.67 0.80 8.41
7 9.47 8.53 0.38 1.89 30 8.67 9.33 0.19 5.90
8 9.53 8.47 0.50 1.66 31 9.20 8.80 0.07 2.95
9 9.60 8.40 0.64 1.44 32 9.80 8.20 1.16 0.89
10 9.33 8.67 0.19 2.39 33 8.47 9.53 0.50 7.34
11 9.00 9.00 0.00 3.92 34 9.47 8.53 0.38 1.89
12 10.53 7.47 4.80 0.00 35 7.40 10.60 5.31 19.92
13 10.00 8.00 1.86 10.46 36 9.80 8.20 1.16 0.89
14 10.40 7.60 3.89 0.02 37 8.60 9.40 0.28 6.35
15 9.80 8.20 1.16 0.89 38 10.13 7.87 2.43 0.25
b 9.40 8.60 0.28 2.13 39 7.60 10.40 3.89 16.69
17 9.13 8.87 0.03 3.26 40 8.20 9.80 1.16 9.60
18 9.20 8.80 0.07 2.95 41 10.07" 7.93 2.13 0.35
19 8.73 9.27 0.12 5.46 42 8.47 9.53 0.50 7.34
20 10.07 7.93 2.13 0.35 43 11.07 6.93 10.14 0.76
21 8.87 9.13 0.03 4.65 44 10.27 7.73 3.11 0.11
22 8.33 9.67 0.80 8.41 45 10.20 7.80 2.76 0.17
23 9.73 8.27 0.97 1.06 Mean 9.20 8.80
They are scored on the number of failures to spot objects on a radar screen during a
30minute test period. The scores for each subject are presented in Table 3.5.
After arranging the scores into groups, it is wise to closely examine the data
looking for problems that might affect their interpretation or analysis. In Chapter 7:
we will consider the assumptions on which the analysis of variance is based, describe
how to check for their violation, and discuss how you might cope with any violations
that you find. The computer software packages, which we mention in Chapter 9, are
very helpful here. For the moment, we just note that you should look over your data
for scores that might have been incorrectly recorded. Nothing stands out here.
Next, you summarize the data. At the bottom of Table 3.5 are a number of useful
summary statistics, including the mean and several measures of the variability of the
groups. 9 Although most of these latter quantities will be familiar from elementary
statistics, we summarize their calculation in the following equations:
Sums of squares (SSj) = ¿:: Yi;  AJ / n,
Degrees of freedom ( dfJ) = nj  1,
Variance (sJ) = SSj/diJ,
",1
1:.1
52 CHAPTER 3. VARIANCE ESTIMATES AND THE F RATIO
m
~ 60
:::::1
Sr::::
o
•ts
40
*
"C
o....
(1)
.e
E
:::::1
r::::
20
r::::
ltl
(1)
:::!:
o
4 12 20 28
Hours without sleep
Figure 3.4: The average number of failures as a function of hours of sleep depriva
tion. The vertical bars extend above and below the mean by one standard error of
the mean.
You can verify these calculations in Table 3.5. They are also easily obtained from
computer programs for the analysis of variance. 10 You will note that we have calcu
lated separate standard errors of the mean, as is conventional when summarizing data,
rather than the combined standard error of the mean based on MSs¡A (Equation 3.6).
Once you have obtained this summary information, you should plot it, as we have
done in Figure 3.4. The verticallines at each point, known as error bars, extend
upward and downward by one standard error of the mean. They give a feeling for the
accuracy with which each mean is estimated. 11 Now examine the means and standard
errors for any interesting outcomes or patterns. Here we see that the average number
of failures increases steadily with the number of hours of sleep deprivation. There ap
pears to be sorne increase in the variability with the amount of deprivation, although,
as we will discuss in Section 7.4, the amount of data is too small to definitively sup
port this indication. After taking note of our observations, we turn to the analysis of
variance.
The steps in the analysis are summarized in Table 3.6. The upper half of the
table details the operations for calculating the bracket terms [Y], [A], and [T], and the
lower half presents the summary table of the analysis of variance. From the formulas
1 0When you are writing up your results, you should present summaries of this information in the
text, tables, or figures. In its Publication Manual, the American Psychological Association states that
researchers must report sufficient statistical information so that readers can corroborate or enlarge
upon the analyses reported in a study (American Psychological Association, 2001). The information
on means, sample sizes, and variability of the· groups is necessary, and, as a convenience, the error
term MSs¡A from the overall analysis.
11 For large samples, you can think of the range marked out by the error bars as a confidence interval
calculatec:l at the 68 percent level or as the interval that is likely to contain the mean of a replication
about half the time; for the difference in these interpretations, see Estes (1997).
ij
3.4. A COMPLETE NUMERICAL EXAMPLE 53 l!
!1
in Section 2.4 (see p. 30), we combine the bracket terms to obtain the sums of squares
as follows:
SSA = [A]  [T] = 36,986.500 33,672.250 = 3,314.250,
SSs¡A =[Y] [A]= 38,79236,986.500 = 1,805.500,
SST =[Y] [T] =38,79233,672.250 = 5,119.750.
We check our calculations by assuring that they sum up properly. The sum of the
between and withingroups sums of squares should equal the total sum of squares
SST:
SSA + SSs¡A = 3,314.250 + 1,805.500 = 5,119.750 = SST.
The remainder of the analysis uses the formulas in Table 3.1. The sums of squares
we have just calculated are entered in Table 3.6. The degrees of freedom associated
with the different sums of squares are
dfA = a  1 = 4  1 = 3,
dfs¡A = a(n 1) = (4)(4 1) = 12,
dfT = an1 = (4)(4) 1 = 15.
We also check these values by summing:
dfA + dfs¡A = 3 + 12 =
15 = dfT.
The betweengroups and withingroups mean squares are formed by dividing the
relevant sum of squares by the corresponding degrees of freedom. Specifically,
SSÁ 3,314.250
MSA = dfA = .3 = 1,104.750,
The F ratio is obtained by dividing the first mean square by the second:
F = MSA = 1,104.750 = 7.34 .
MSs¡A 150.458
The results of each of these steps are entered in the summary table. We will assume
that the a level has been set at p = .05 before the start of the experiment. In order
to evaluate the significance of the F, we locate the critical value ofF at a= .05 and
dfnum = 3 and dfdenom = 12. The F table in Appendix A.1 gives F.o5(3, 12) = 3.49.
Because Fobserved = 7.34 exceeds this value, we reject the null hypothesis and conclude
that the independent variable has produced an effect. The result of the statistical test
is indicated by an asterisk in the summary table.
In Section 2.3 (see p. 26), you saw that the withingroups sum of squares is in
effect the sum of the individual sums of squares for the treatment groups; that is,
SSs¡A = L: SSs at ai. The withingroups mean square, which is based on SSs¡A, can
also be viewed as a composite or pooled source of variance. More specifically, the error
term for the analysis of variance with equalsized groups is literally an average of the
variances of the separate groups (sJ):
.L:s2
MSs¡A =  1 • (3.9)
a
As an example, we can calculate the average withingroups variance with the variances
presented in Table 3.5:
MS _ 51.000 + 119.583 + 240.333 + 190.917 _ 601.833 _ 150 458
8/A 4  4  . .
This value is identical to the value we calculated with the standard formulas. Viewing
the error term as an average of the group variances emphasizes the point that each
group produces an estimate of experimental error for each condition, and that these
estimates are combined to provide a more stable estimate of experimental error for
the entire experiment.
Analysis of Variance
A complete treatment of the analysis of unequal groups is complex. There are severa!
ways to conduct the analysiswe describe a more general approach in Chapter 14.
However, in a singlefactor design, the presence of unequal groups does not seriously
complicate the analysis. The apwoach we used for equal samples may be adapted to
unequal samples with only minor changes.
First, sorne changes in notation. We will use nj to represent the number of obser
vations J.n treatment condition aj. The total sample size is N = ¿:; nj, and the grand
mean (Yr) is the mean of all the Yij scores; that is,
~ _ LYij _ T
T N N' (3.10)
Let Aj be the sum of the Y scores in group aj. The mean of this group is
Yj = Aj (3.11)
nj
Just as in the equalsample case, the total sum of squares is partitioned into a
betweengroups portion and a withingroups portion. The defining formulas for these
three sums of squares are:
SSA = L:nj(Yj Yr) 2 , (3.12)
SSs¡A = L:(Yij  YAj )2 , (3.13)
SSr = L:(Yij Yr) 2 • (3.14)
The primary difference between these formulas and those that apply to studies with
equal sample sizes is in the formula for the treatment sum of squares SSA (Equation
2.3). With unequal sample sizes each squared deviation must be multiplied by nj
before the summation takes place, while with equal sample sizes the multiplication by
n can take place either before or after the summation.
The computational formulas for calculating the sums of squares also require a
change to accommodate the unequal sample sizes. These changes occur in the bracket
terms we add and subtract to calculate the three sums of squares. The bracket term
[Y] remains unchanged:
56 CHAPTER 3. VARIANCE ESTIMATES AND THE F RATIO
The bracket term [T] is unchanged if we recall that N = an with equal sample sizes.
More specifically, the formulas for calculating [T] with sums or means are
T2 2
[T] =N and [T] = NYr.
This leaves only the bracket term [A] to be modified. It is revised to move the sample
sizes ni within the summation signs:
A~ 2
[A]= L: __1_ and [A]= L:nilj.
ni
There are a  1 degrees of freedom for the groups. The number of withingroups
degrees of freedom is the sum of the individual degrees of freedom for each group:
dfs¡A = L:(ni  1) =N a.
The remainder of the analysis proceeds as usual, through the mean squares, the F
ratio, and the test of significance.
For a numerical example, consider the data in the top portion of Table 3.7. This
table also contains the information we need to summarize the data and run the analysis
(calculated using the equations on p. 51). We also need the grand mean Yr, which we
calculate using Equation 3.10:
 T 101
Yr = N = 13 = 7. 769.
You should note that Yr is not obtained by averaging the treatment means. With
equal sample sizes, either way of calculating Yr produces the same answer. With
unequal sample sizes, these two definitions are not equivalent. Here the mean of the
three treatmentgroup means is (7 + 10 + 5)/3 = 7.333.
In addition to the preliminary squaring and summing, we have also included in Ta
ble 3. 7 the calculation of the three bracket terms and the completed analysis summary
tableP You may also calculate SSA using the defi.ning formula Equation 3.12:
SSA = (3)(7.0 7.769) 2 + (6)(10.0 7.769) 2 + (4)(5.0 7.769) 2
= (3)(.769) 2 + (6)(2.231) 2 + (4)(2.769) 2
= 1.774 + 29.864 + 30.669 = 62.307,
which is the same as the value in Table 3.7, except for rounding error. We complete
the analysis by calculating the mean squares and the F ratio. The treatment and
withingroups inean squares are obtained in the usual manner:
MSA = SSA = 62.308 = 3 1. 154 and MSs¡A = SSs¡A = 24.000 = 2 .400.
dfA 3 1 dfs¡A 10
12 We calculated the bracket terms [A] and [T] using sums. Except for rounding error, we obtain
the same information from the means:
5 12 3
9 10 6
7 10 5
8 6
11
9 Total
Sum (A;) 21 60 20 T= 101
Sample size (n;) 3 6 4 N= 13
Mean (Yj) 7.0 10.0 5.0
Raw sum of squares (~::; Yi~) 155 610 106 (Y]= 871
Sum of squares (SSA 3 ) 8.0 10.0 6.0 SSs¡A = 24.0
Degrees of freedom ( dfA 3 ) 2 5 3 d/s¡A = 10
Standard deviation (s~) 2.000 1.414 1.414
Std. error ofthe mean (sM3 ) 1.155 0.577 0.707 'i'
'
Calculation of the bracket terms
[Y]= 871
21 2 60 2 20 2
[A] = 3 +6 +4 = 147.0 + 600.0 + 100.0 = 847.0
101 2
[T] = 13 = 784.692
dividing by the sum of the degrees of freedom. To illustrate, we have included these
sums among the summary information calculated in Table 3.7. As you can see,
MSs¡A = L SSs at aj = 24.0 = 2.40. (3.15)
L dfs at aj 10
Comment
The method we have just presented is widely recommended by applied methodologists
and is equivalent to the one implemented by most statistical software programs. Other
approaches have been proposed, however, that adapt the equalgroups analysis to
unequal groups. One possibility is to average the sample sizes and use that value
for all groups when calculating SSA; another is to drop subjects at random from the
more populous groups until all sample sizes are the same. We will not discuss these
approaches because they have less power than the method we have described, and so
are not currently advocated.
As we noted, the calculations involved in conducting a oneway analysis of variance
with unequal groups are not much different from those in the equalsample case. This
similarity should not obscure the fact that the analysis of unequal samples is a more
delicate procedure, in the sense that the inferences are more sensitive to violations of
the assumptions. It also raises sorne substantial issues of interpretation, particularly
in more complex designs. We will return to these matters at several points.
Exercises
3.1. Find the critica! values ofF for the following situations: a. F(4, 30) ata= .05;
b. F(l, 120) at a= .001; c. a= 7, n = 5, a= .10; and d. a= 3, n = 9, a= .01.
3.2. In an experiment comparing the rate of learning following a brain lesion, the
number of trials to learn a Tmaze was measured for three groups of n = 4 rats.
Group a 1 had a lesion in one location of the brain, group a 2 in another location, and
group a 3 was a shamoperated control (that is, without a lesion). The number of trials
to criterion for the three groups were:
such as the group standard deviations. Suppose you wanted to compare the results of
the experiment reported in Table 3.5 (p. 51) with another study that included only
the first three conditions (4 hr., 12 hr., and 20 hr.). Perform an analysis of variance
on these three groups, using only the means and standard deviations.
3.5. Consider the following set of scores:
8 4 9 1 2 7
o 2 4 8 o 7
9 8 5
a. Calculate the mean and standard deviation for each of the groups.
b. Perform an analysis of variance on these data.
c. What can you conclude from this study?
3.6. Complete the analysis of variance of the data presented in Problem 2.1 in Chapter
2.
3. 7. An experiment is conducted with n = 5 subjects in each of the a= 6 treatment
conditions; in the following table; the means are given in the first row and the sums
of the squared Y scores in the second:
13 17 10 7 11 3
14 16 13 13 4 4
8 10 8
9
where j and k are two different levels of factor A, and the summation is over all unique
pairs of Yj and Yk (there are a( a 1)/2 of these pairs).
To illustrate, consider the data presented in Table 2.4 (p. 28). There were three
levels of factor A and n = 5 observations per group. The three means were Y1 = 15,
Y2 = 6, and Y3 = 9. There are only three unique differences, for example,
Y1 Y2, Y1 Y3, and Y2 Y3.
Substituting these differences into Equation 4.1 gives
1
SSA = i[(15  6) 2 + (15 9) 2 + (6 9) 2]
= (1.667)[9 2 + 62 + (3) 2] = 210.
This value is identical to the one obtained in Chapter 2 from the original formula.
Thus, the between~~l:lPªª!!!!l___QLsJiuar_es. JJ/Uerag.e.s. ...the differ.ences ..between..the
different pairs of means. As such, it cannot provide unambiguous information about
····· ··.~..1
the exact nature of the treatment effects. Because SSA is a composite, an F ratio
based on more than two treatment levels is known as a,n omnibus or overall test.
_':['h(3 _op1nib11s _testis a blunt. i11strument that only partially a,ddress()s tll._e ;g.~ecl§_Q[tl:le _
researcher. The test asse~_the Q'1l.f_mlLn~ge_djfferences among_ihe treªtm.Jll!t __
~ researcher'.s. inter~s_t_oft_en_con~rns.specific_dijfer:ences_hetwaeJLmeans.
  ~
Planned Comparisons
Most research begins with the formulation of one or more research hypotheses
statements about how behavior will be influenced if such and such treatment differ
ences are manipulated. Experiments are designed specifically to test these hypotheses.
Most typically, it is possible to designan experiment that provides information relevant
to several research hypotheses. Tests designed to shed light on these questions are
planned before the start of an experiment and clearly represent the primary focus of
analysis. The generality of the omnibus F test makes it an inefficient way to study
these specific effects. Thus, most researchers form, or at least imply, an analysis plan
when they specify their research hypotheses and design experiments to test them.
This plan can be expressed by a set of specific comparisons among the group means,
chosen to focus on the researchquestions. As you will see in Chapter 6, these planned
comparisons typically are examined directly, without reference to the significance or
nonsignificance of the omnibus F test.
1 '
Table 4.1: An example of planned comparisons from the McGovern (1964) study.
Treatment group
a¡ ag
sorne groups are used in several contrasts. Comparisons between contrasts are less
ambiguous because all groups were formed under a common randomization scheme
and all testing occurred within the same experimental context, circumstances that
generally do not hold when separate experiments are conducted.
Question 4
Control Drugs
16 4 2
18 7 10
10 8 9
12 10 13
19 1 11
researchers want to test the significance of subsets of three or more means, simply
asking whether the means differ among themselves. For instance, suppose we have
a = 4 treatment conditions, one control condition and three experimental conditions.
It might make sense to ask two questions: (1) Does the control condition differ from the
combined experimental conditionsthat is, is there a general or average experimental
effect? (2) Do the experimental groups differ among themselves? In the first case, we
have a singl,édf comparison between two means (the control mean versus the average
of the three experimental means), and in the second case, we have a comparison among
the subset of three experimental means (a threegroup analysis derived from the larger
a= 4 design). We will consider the first type of comparison now and the second type
in Section 4.7.
Linear Contrasts
Jo test the siguifk.11nc.!LQf single dj_<;:Qmp.arisons,weneed..to_write_eachoneas·a·null·
hypothesisasser.ting_that_the_effectJs_abs~I).i Consider the four questions above. The
first three questions involve pairwise comparisons. Each null hypothesis asserts that
the difference between a pair of means is zero: ,
Hm : J.tl  J.t2 = O,
Ho2: J.t1  J.t3 = O,
Ho3: J.t2  J.t3 = O.
The fourth question compa¡;es the control group with the average of the two experi
mental groups. It is represented by the null hypothesis
Ho4: J.t1 
J.t2
= O.
+ J.t3
2
These four null hypotheses have a common form: ,pne _group mean (or an average
of group means) issubtracted frbm another group mean (or average of group means).
L··~·~~
66 CHAPTER 4. ANALYTICAL COMPARISONS AMONG MEANS
The coefficients for this comparison are {+1, %, %}. Using these contrasts, all
four questions refer the same null hypotheses:
Ho: "!!_ ~ 0: __
There are several reasons to express comparisons in terms of contraststhat is, by
Equation 4.2.D_~~~L~~e procedure is ge!ler..~l;Jt<::~tn_compar('l t.h.<? ..:r:Ilt?ª!!.S_Q.lt.Wº.grou,p...§...
..or .it can t:Jyal!!~t~ compÜcafé(rcomparisons involving _filany groups. Sl:Jcogg,_¡;¡.Q.ecial
,J?§.t§_~fjoel!l<:_i<:)_nts~i~~~ªV~_ii¡;tli!i1• partÍcul~r!y. ÍQf experi~~:J.'uts. t.hat have a quan.tit.a:t~
. independent vari¡;¡,l?!e. 'J;:hird,. the statistic~:tl ~est~J<:>llo:w.: _a_ :fi:¡_ced procedure, involving
_t}leea:lc11,~ation gfsl1!ll.of sgu~:t!'es._¡¡,ndan F ra,tio . _E2.1!rth 1_ as yoli=wfilse_elli.R~~t'f!?.ñ4.5, 
_the_ indepenci~P:<:l:J.I?Í tw~_QJ:'__!!_lüre eontrasts is readily determined byeomparingtheir
.~()_effi~iE)p:Js~ We würb·; able to~..l:frt~~~~he~.ª.Jl~}_2LI?l~J:lll:~.cl~~.P~~J.~9.ri.~~ cí~aºlY
ext_r,_as:t~H .~he information available in a set gf_!!!l:J.!!:.ns. Fi@lly, th_~~9.ntr:?k!t§...9ª.!LQ.!L
:extended to express' qüestioñs 'about'designs ~itl;LtJYQor more independent variables.
We will consider the contrasts in Chapter 13. ~~·· ·
Many contrasts can be written for an experiment, but only a few are actually
tested. These are the comparisons that express meaningful questions. Technically,
there is an infinite number of logically different and valid contrasts possible for any
experiment with three or more groups. Because the only constraint on the choice of
coefficients in a conj;rast is that they_sumtozer~;;:ti~4:'3),";·:p·e~tly..valid
~====::::.....::::;~~·        fP'~~··'·· .
contrast for the McGovern e~.riment ..is  ..
~.........~
= (0.42)¡.¿1 + (0.26)¡.¿2 + (1.13)¡.¿3 + (0.73)¡.¿4 + (0.24)¡.¿5.
7/J
Although the coefficients of this contrast sum to O, we would not consider it because
~()~resting...qnes.tion. In short, the logicalconstramt that a
~
4.3. COMPARISONS AMONG TREATMENT MEANS 67
=_l_.:fQ.!:...lli~.PJllliLgroup,
O for the groups
~ ·"·~~
not entering into the comparison.
· ····   ..  ···
·~~~~~ .
i,,l
This observed contrast is easily converted into a sum of squares: ::
n;f2 (4.5)
SS,¡;= "' ,
Liej2
where_.n...:;:::j;_hfLilJJ.Jllber oLaubj.e_d.s_p.edJ:eat_roE:Jnt..~Qndition,
__i/i_=:: ~ll(;l_Qiff~_r:(:)!l~i1. Q.~t..W§.E;m thfJ\:.\T.Q_ID§ans J~eing_~ompared,.~
Z:::·····~ = the~~~~Lth~.§@!lJ:~~L!:;s>~!!i~!~~~~:...~
'"'" ··~~
The denominator of Equation 4.5 adjusts the size of the sum of squares so that it does
not depend on the overall magnitude of the coefficients.
To illustrate these calculations, we will use data from Table 2.4 (p. 28). In this
example, there were a= 3 groups (control, drug A, and drug B) and n = 5 subjects per
group. We will focus on two comparisons, a comparison between the two drug groups
and a comparison between the control group and an average of the two drug groups.
Table 4.3 presents the means for the three groups again, along with the coefficients of
the two comparisons.
Using Equation 4.4, we can calculate the values of the two comparisons. For the
first contrast, we find a difference favoring drug B over drug A:
;¡;1 = L ejYj = (0)(15) + (+1)(6) + (1)(9) =o+ 69 = 3.
We convert this difference into a sum of squares by substituting it into Equation 4.5:
Evaluating a Comparison
We now can test the significance of the contrasts. In Table 4.4, the two sums of
squares are integrated into the original summary table. Because each comparison has
one de~ree of freedom, MS,¡; = SS,¡;/1 = SS,¡;. An F ratio is formed by dividing the
companson mean square by the error term from the overall analysis· 2 that is
' '
F= MS,¡; (4.6)
MSs¡A.
The F is evaluated in the usual manner by comparing it with the value listed in the
l. F table, with dfnum = 1 and d/denom = a(n1), in this case F. 05 (1, 12) = 4.75.
The two F tests are shown in Table 4.4. They provide statistical justification for
t~e obse~vations offered. in the last section: There is no evidence that the two drugs
d1fferent~ally affect readmg comprehension (comparison 1), and there is clear evidence
'¡¡ that, considered jointly, the two drugs disrupted the boys' performance (comparison 2).
,1
The advantage of this analysis is the additional information it provides. Jf_.we_had._
~ed__¡:~c~y~rnlLE._ratio,aswe..didÍnChapter3,wewouldhav_e_r_ejected..tha.null_
~~~nd co~_?lude~~!Y .. !!:~~~;!1~~ll!:~~Jr.e.atmentmeans..werenot~the....sam.e ..
__ ~QW.~~bv,r~pFai!.PlnPOii.!UhB.lO.~'!J~LQLtbJLdiffer.ences ... tha_t_..contribute ..t.2..tl:!~ .. t>lg!!ificant._
omm us .
and 4, respectively. Suppose we want to test the null hypothesis that the average of
the means of groups a 1 and a3 equals the mean of group a2:
Ho: !(J.tl + ¡.¿3) JL2 =O.
This hypothesis leads toa complex comparison with the coefficient set {+%, 1,
+%}. U sing the more convenient set of integers { + 1, 2, + 1}, we find
;¡; = (+1)(7.00) + ( 2)(10.00) + ( +1)(5.00) = 8.00.
Substituting in Equation 4.7, we obtain
SS ~ ( 8.00) 2 ~ 64.00 ~ 64.00 _ 5 20
"' (+1) 2 (2) 2 (+1) 2  .333 + .667 + .25 1.25  1. ·
3 + 6 + 4
The F test for this comparison is calculated the same way as with equal sample sizes.
Substituting MS,¡; = 51.20, which we just calculated, and MSs¡A = 2.40 from Table
3.7 (p. 57) in Equation 4.6, we find:
F = MS,¡; = 51.20 = 21.33 _
MSs¡A 2.40
With dfnum = 1 and d/denom = 10, 'this value is significant (F.o5 = 4.96).
The t Test
As you may recall, the t test for the hypothesis that the mean of a single group of n
scores is equal to a value ¡.¿0 is conducted by finding the observed difference between
the sample mean Y and ¡.¿0 and dividing by the standard error of the mean:
t
y J.to
= '. (4.8)
The standard error of the mean, sM equals sj,¡ñ (Equation 3.5, p. 35). Appendix A.2
gives critical values for the t statistic, which depend on the significance level a and
the number of degrees of freedom in the v:ariance, which here is n  l.
The t statistic is also used to test the null hypotheses that a contrast in the analysis
of variance is equal to zero. The test statistic now is the observed value of the contrast
divided by an estímate of its standard error:
(4.9)
72 CHAPTER 4. ANALYTICAL COMPARISONS AMONG MEANS
MSs¡Al::c]
s;¡;= (4.11)
n
Let's apply this test to the second comparison in Table 4.3, which contrasts the
control mean with the average of the means for the two drug conditions. We need the
following information to calculate t: ·
• the value of the contrast: ;¡; = 7.50,
• the sample size: n = 5,
• the overall error term: MSs¡A = 15.00,
• the sum the squared coefficients: I:c] = (+1) 2 + (%)2 + (%)2 = 1.5.
Substituting these quantities in Equation 4.11, we find the standard error:
F = t
2
and t = .../F.
From the value of t we just calculated, we find that t 2 = 3.536 2 = 12.50, exactly
the same as the value we calculated for F in Table 4.4. Because the two tests are
algebraically equivalent, the conclusions are interchangeable: If the t is significant,
then F is significant; if t is not significant, then F is not significant.
Up to this point, we have framed the null hypothesis Ho as the absence of an
effectthat is, as an assertion that the population treatment means are equal (e.g.,
Ho: f.tl = ¡..¿2) or the contrast is zero (Ho: '1/J = 0). We can also test null hypotheses
that specify outcomes other than the absence of an effect. Suppose, that a standard
drug treatment is known to decrease feelings of depression in a clinical population
by 2 points relative to a placebo condition. Any new drug should be evaluated with
respect to this value. In statistical terms, you want to test the hypothesis Ho: '1/J = 2,
4.4. EVALUATING CONTRASTS WITH A t TEST 73
not Ho: ¡..¿ = O. This null hypothesis would be rejected only if the difference between
the new drug and the placebo was incompatible with '1/J = 2. 3
Hypotheses of this more general sortthat '1/J is equal to sorne value '1/Joare eval
uated by a extension of the t statistic. It is only necessary to include the hypothesized
value in the numerator:
t = '1/J '1/Jo. (4.12)
s;¡;
When '1/Jo =O, Equation 4.12 is identical to Equation 4.9. All other terms in Equation
4.12 are the same, and the t statistic is evaluated in the usual fashion. We will return
to this use of the t statistic when we discuss confidence intervals.
Directional Hypotheses
The alternative hypotheses we have considered so far are nondirectional, in the sense
that differences between two means in either the positive or negative direction are
incompatible with the null hypothesis. That is, with Ho: f.tl = f.t2 and H1: f.tl #
¡..¿2, the null hypothesis will be rej<:jcted whenever F falls within the rejection region,
which incudes both ¡..¿1 > ¡..¿2 and ¡..¿2 > f.tl· This type of alternative hypothesis,
which is most common in psychology, is called a nondirectional hypothesis or an
omnidirectional hypothesis.
When tested with a t statistic, a test with a nondirectional hypothesis is sometimes
known as a twotailed test. The "tails" in this case refer to the two rejection
regions used in conjunction with the t test, shown by the two shaded areas on the
t distribution in the upperleft portian of Figure 4.1. Because a value of t can be
either positive or negative depending on the direction of the difference between the
two means, the rejection region is located in both tails of the t distribution, with
onehalf of the significance level (a/2 = .025) located in the extreme positive portian
of the t distribution and the other half in the extreme negative portian. Consider
the corresponding F distribution on the right. The F test deals with squared values,
which treat positive and negative values equivalently, so there is only one rejection
region with an area of a = .05. Half of the F's in this region are produced by positive
differences between the two means and half by negative differences. The critical values
directly relate to each other. For example, the critical value of t for the example in
the preceding section was t(12) = 2.18 for a = .05, while the corresponding value of
F is F(1, 12) = [t(12)] 2 = 2.18 2 = 4.75, the same value found in the F table.
In contrast to a nondirectional alternative hypothesis that allows differences in
either direction, it is possible to choose an alternative hypothesis that specifies a
particular direction of the outcome. If we wished to show that the mean of group a1
was greater than that of group a 2 , we would test the hypothesis pair
Ho: Jt1 ~ ¡..¿2,
H1: Jt1 > Jt2i
3Null hypotheses like this are not common in the behavioral sciences. A twogroup study in which
one group was given the old drug and the other was given the new drug would be more convincing,
because it would assure that the testing conditions were the same.
74 CHAPTER 4. ANALYTICAL COMPARISONS AMONG MEANS
i.
:·;":~.
4 o 2 2
1
4 o 2 5 8
Value of t Value ofF
1
4 2 o 2
1
4
1
4
. ~.
2 o 2
1
4
Lowertail interval Uppertail test
Value of t Value of t
and if we were interested in showing the reverse, we would test the hypothesis pair
Ha: J.t1 2: ¡.¿2,
H¡: ¡.¿¡ < f.t2·
Note that in both cases, we have put the relationship we want to show as the alternative
hypothesis. Because the direction of the outcome is important in these hypotheses,
we refer to them as directional hypotheses. In either case, if the results carne out
in the unexpected directionwith Y¡ below Y2 in the first case and above it in the
secondwe would retain the null hypothesis regardless of how large the difference
between the two means might be.
A t test with a directional hypothesis is onetailed test because it has a single
rejection region located in the positive or the negative tail. The two possibilities are
depicted in the lower part of Figure 4.1. As you can see, the entire rejection region is
located in either the positive or negative extremes of the t distribution. We concentrate
the entire p = a chosen for the study in either the positive or the negative tail rather
than spreading the equally in the two tails. 4
Directional tests are best suited for applied research in which the purpose of a
study is to compare a new or different treatment (a new therapy, drug, or teaching
method) with an established one, with an eye toward adopting the new treatment only
if it improves on the one currently in use. Here a researcher will not care that the
directional test only detects differences in the "positive" direction. A new treatment
that is no different from the present one is just as useless as one that is worse. Both
would be treated the same way, namely dropped from consideration. However, direc
4 If you want to test these hypotheses with an F test, then you would pick a critica! value based on
twice your a (e.g., a= .10 instead of a= .05), but only reject the null hypothesis when the observed
means are in the direction specified by the alternative hypothesis.
4.4. EVALUATING CONTRASTS WITH A t TEST 75
tional tests are not well suited to basic or theoretical research. Generally, theories
are only temporary explanations, tentative ideas about cause and effect. A significant
effect in a predicted direction might provide positive support for one theoty, while a
significant effect in the negative direction can either be evidence against that theory
or support for another. Both directions are important. Given the transitional nature
of theoretical research, we usually can justify only nondirectional tests. Cohducting
a directional test under these circumstances is equivalent to adopting a more liberal
significance levelfor example, a= .10 rather than .05which is why many journal
editors question research that is based largely on directional tests .
Confidence Intervals for a Contrast
We introduced confidence intervals for the population treatment means J.li in Section
3.1. Similar intervals may be constructed for contrasts. These interval estimates
consist of a range of valu es· for 'ljJ, and they are constructed in such a way that they
are expected to include the population difference a certain percentage of the time.
This percentage or confidence is given by the expression
confidence = (100)(1 o:) percent.
Most commonly, researchers use d = .05, which defines 95 percent confidence intervals,
but other values are popular.
The formula for a confidence interval for a contrast is similar to the one for a
population mean (Equation 3.4, p. 34):
;¡; ts;¡; '5, 'ljJ '5, ;¡; + ts;¡;. (4.13)
The standard error s;¡; is the same one that we used for a t testEquation 4.10 when
the samples have unequal size or Equation 4.11 when they are equal.
As an example, we will return to the experiment with hyperactive boys and calcu
late the 95 percent confidence interval for the significant difference we found between
the control condition and the combined drug conditions. The means are Y1 = 15 for
the control group and Y2 = 6 and Y3 = 9 for the two drug groups. We will use the
standard form of this contrast, which expresses the contrast in terms of the means
rather than s~me multiple of the means, to construct a confidence interval. Thus,
we calculate 'ljJ = (1)(15) + (%)(6) + (%)(9) = 7.50. From the t table, we find
t(12) = 2.18, a= .05. The final quantity in Equation 4.13 is the standard error of the
contrast, which we found to bes;¡; = 2.121 when we conducted the t test in Section
4.4. We now substitute these values into Equation 4.13:
7.50 (2.18)(2.121) ::;, ;¡;::;, 7.50 + (2.18)(2.121)
7.50  4.62 ::;, ;¡;::;, 7.50 + 4.62
2.88 ::;, 7/J ::;, 12.12.
We have described the confidence interval above as a range that, over many studies,
will include the true value of the parameter as often as indicated by the confidence
level (for example, 95 percent of the time). Another interpretation is also useful: A
confidence interval is the set of values of 7/J that are not rejected by hypothesis tests
based on the data. Thus, a 95 percent confidence interval indicates that every value
included in the interval will be retained by a hypothesis test at a = 0.05 and every
1,.
¡,
¡> 76 CHAPTER 4. ANALYTICAL COMPARISONS AMONG MEANS
¡1'
value outside it will be rejected. Applying this interpretation to the most common
null hypothesis, we would retain H 0 : '1/J = O when a confidence interval included O and
reject it when the interval did not. As we indicated in Section 4.4, however, a null
hypothesis need not be O, which is why a confidence interval is particularly usefulit
provides a decision rule for testing all hypotheses about '1/J:
.• 1
If '1/Jo falls within the confidence interval, then retain Ho: '1/J = '1/Jo,
{
If '1/Jo falls outside the confidence interval, then reject Ho: '1/J = '1/Jo.
A confidence interval is particularly valuable when we want to say something about
'1/J after we have rejected the hypothesis that '1/J = O, or when we are trying to relate
our findings to those of other investigators. Suppose other research had suggested a 15point
difference between the control and the combined drug conditions. A glance at the confidence
interval for this difference (2.88 ::::; '1/J ::::; 12.12) tells us, without further calculation,
that that value is inconsistent with our data. We reject the hypothesis that '1/J = 15
using the same standards of evidence that we used to test the hypothesis that '1/J = O.
gests, there is a geometric metaphor that underlies the analysis of singledf contrasts. We don't use
this metaphor here, but you can look at Wickens (1995) for a discussion.
4.5. ORTHOGONAL CONTRASTS 77
First, we enter the coefficients of the two contrasts in the rows, then we calculate the
product of the coefficients for each treatment condition and write these beneath the
box. Finally, we sum them and check whether the sum of products equals zero; in this
case they do: O % +% = O. We have shown that these two contrasts are orthogQ.I1al.
In contrast, consider two contrasts that are not orthogonal, such as pairwise coill:
parisons between the control condition and each of the drug conditions. Let the
contrasts 'lj;3 and 'I/J4 compare the control group to the first and second drug groups,
respectively. The coeffi.cients for these contrasts are c3j: { + 1, 1, O } and c4j: { + 1,
O, 1}. AlthoÚgh these two comparisons both express meaningful questions, they are
not orthogonal, as we discover by substituting their coefficients into Equation 4.14:
E C3j C4j = (+ 1) (+ 1) + (1) (o) + (o) (1) = 1 + o+ o = l.
..__Q_r.th()g~~~Ji~~j_sa rel~ti().[lShJp b~!'Y!:l.€.l!.tW.9 c.ontrasts (2: cljC2j = 0). If you have
a set of three or more contrasts, you must verify that this rule holds for éach pair of
corítrasts. With three contrasts, there are three relationships to check ('1/J~ to 1;;2, '1/J¡
tü;¡p;,ancl'I/J2 to 'I/J3), and with four contrasts, there are six. If all contrasts in a set
.~¡;,r~ orthogonal t() on~ another, they a~e said to be mutually orthogonal.
1 '
There can be no more mutually orthogonal contrast§..tlm!J.J.h.e...degme_s_ oL
Ereeaoiñassocia;!__~~itll_flíe.=Oillñü;~~~:~fi¿~~;.:··
If we have three means, we can write only two orthogonal contrasts, while if we have
four means, we can construct sets of three mutuall:y orthogonal contrasts, and so on.
The second important point is that apy_ set ~utually ·ortn6gomil'·contrasts
completely expr.12_s_se_s_·th!l.. .Yª'rta,b.!lity
~ ...
of ,·U~e mé~~th;Qffiñíbliselfect:;~: Tliis "facf
····'·~··········· .,
11i
:¡.
1 ~ ! 1 Now we find the sums of squares, using Equation 4.5:
~ 1i !1 •
n;j;~ (5)(9.00 2)
1 [1
SS,¡,s = Ec~j = (1)2 + (1)2 + (0)2 = 202.50,
1 '
! 1
n;j;~ (5)(6.00 2)
SS,¡;4 = Ec~i = (1)2 + (0)2 + (1)2 = 90.00.
¡'
'1
The sum of these two sums of squares is 202.50+90.00 = 292.50, which greatly exceeds
the omnibus sum of squares SSA = 210.00 given in Table 4.4.
'1
1 1 j Only orthogonal contrasts are assured of having their sums of squares add up to
1
SSA. 6 In this sense, g¡,_e S!!!!! of squares obtain.OO~t of a treatment means is a
¡ 1
composite of the sums.oLsquar.esassociatedwitha+mutuallyoFthogenalcolltr.asts.,__~
A(I~taii~d~naly;¡;involving a properly constructed set of orthogonal contrasts is an
1'
'¡!
1 efficient way to examine the results of an experiment. But efficiency is not everything,
and researchers will form incomplete sets of orthogonal contrasts and sets including
nonorthogonal contrasts when they are dictated by the nature of the questions they
want to ask of their data. We will pursue this point in the next section.
Before moving on, we should mention one easy way to construct a set of a  1
orthogonal contrasts. To begin, you compare the first group to the average of the
other a  1 groups. In all other contrasts you ignore the first group by giving it the
i 1,: 1
coefficient c1 = O. For the second contrast, you compare the second group to the
1
1·'1
average of all the highernumbered groups, again giving it the coefficient c2 =O in any
further contrasts. You continue this process, comparing each successive group to the
average of the later ones until you reach a comparison between the last two groups.
For exa:mple, with four groups this process yields three contrasts:
., ·'·
'f'l = /tl  /t2 + /t3 + /t4 , or c 1j: {1, %, %, %},
3
1
6 When the sample sizes are unequal, the the sums of squares of orthogonal contrasts (calculated by
Equation 4.7) may not add up to SSA. We will discuss the reason for this fundamental nonadditivity
in Section 14.4.
4.6. COMPOSITE CONTRASTS DERIVED FROM THEORY ) 79
Table 4.5: Tests of mutual orthogonality for the contrasts in Table 4.1.
Treatment condition
a¡ a2 a3 a4 as
Comparisons Sum
Comp. 1 (Cij) +1 1 o o o o
Comp. 2 (C2j) +1 o o 1 o o
Comp. 3 (C3j) o +1 1 o o o
Comp. 4 (C4j) o o o +1 1 o
Tests for Orthogonality
Comp. 1 vs. 2 (Cij c2j) +1 o o o o +1
Comp. 1 vs. 3 (C¡jC3j) o 1 o o o 1
Comp. 1 vs. 4 (Cij C4j) o o o o o o
Comp. 2 vs. 3 (C2jCaj) o o o o o o
Comp. 2 vs. 4 (c2jC4j) o o o 1 o 1
Comp. 3 vs. 4 (cajC4j) o o o o o o
the four comparisons qualifies as a contrastthat is, the coefficients sum to zero, as
shown in the last column on the right. Now we test for orthogonality between all
different pairs of comparisons. In the bottom part of Table 4.5, we have calculated
the products of the coefficients from each pair of contrasts. For example, the first
row gives the products formed from the coefficients of contrasts 1 and 2 (c1jc2j)· The
sums of these products are on the right of the table. These sums reveal that three
pairs of contrasts are orthogonal (1 and 4, 2 and 3, and 3 and 4) and three pairs are
not orthogonal (1 and 2, 1 and 3, and 2 and 4).
We have established that as a set, the comparisons are not mutually orthogonal.
This result implies that these comparisons do not divide the SSA into tidy independent
pieces of information, which would have been true had the set of comparisons been
mutually orthogonal (Equation 4.15). On the other hand, orthogonality is only one of
several considerations in the selection of contrasts. In the present example, the fact
that each comparison tests for the loss of a single memory component overrides the
need for orthogonality.
Let's return to our original question: Do planned contrasts need to be orthogonal?
We would say "no," but with sorne qualification. You should strive to look at orthog
onal comparisons whenever you can, but not to the extent of ignoring substantively
important comparisons. Sorne important questions simply are not orthogonal, and
that fact should not stop us from looking at them. Attaining a proper balance is
important here. Too many nonorthogonal contrasts will clutter up your analysis and
make it difficult to understand, but ignoring meaningful contrasts will make your
discussion less direct and to the point.
theory. This type of analysis is somewhat different from those we have discussed above.
Before we discuss it, however, we will consider the simpler problem of constructing a
set of contrast coefficients to test for a particular pattern of means.
Contrast Coefficients that Match a Pattern
1 1
Suppose we have hypothetical values for the means of two or more groups, and we
want to find the contrast 'ljJ that is most sensitive to that particular pattern. The
coefficients Cj for this contrast are found from a simple principie:
1 11
The most sensitive contrast to a pattern of means is the one whose
'1
coefflcients reflect the same pattern.
To obtain these coefficients, all we do is start with values that duplicate the exact
1¡1
patterning of means that we want to detect. Then, we make sure that they obey the
1' fundamental rule for contrast coefficients, that they sum to zero. In more detail, the
1 1'
'i' steps we follow are
'1
1
l. Use each predicted mean as a starting coefficient.
2. Subtract the average of these means from the predicted means so that they sum
to zero.
3. Optionally, simplify the coefficients by multiplying or dividing all of them by the
same number to turn them into integers or to give the contrast standard form.
First, we will show that this procedure works for a twogroup comparison. Suppose
we predict that a control group will score higher than an experimental group. We start
by assigning numbers that reflect this expectation, say means of 10 for the control
group and 5 for the experimental group (ignore for the moment any other groups
not relevant to this comparison). We can't use these numbers as coefficients because
their sum is 15 not O. We solve this problem by subtracting their average, which is
(10 + 5)/2 = 7.5, from each of them:
c1 = 10 7.5 = 2.5 and Cz = 5  7.5 = 2~5.
Now ¿ Cj =O. For convenience, we can turn these coefficients into integers by dividing
each one by 2.5, giving us the familiar values of c1 = + 1 and c2 = l. The coefficients
for groups not entering into this comparison are O. If you loo k over these calculations,
you can see that the particular starting points (10 and 5) were not critical. What was
important was that we choose different values for the two groups.
Now consider a more complex prediction involving several groups. Suppose for
our study of hyperactive boys, we predicted that both drugs would increase reading
comprehension on the test, and that drug A would increase performance more than
group B, say by 7 points instead of 2 points. Both of these values are relative to the
control group. We start the process by choosing a convenient value for the control
group mean, for example, 10 (this value actually does not affect our result). We
would then expect means of 10, 17 and 12 for the control, drug A, and drug B
groups, respectively. These values are our starting point in constructing customized
coefficients. In step 2, we subtract their average, (10 + 17 + 12) /3 = 13, to get
c1 = 10  13 = 3, Cz = 17  13 = +4, and c3 = 12  13 = l.
These coefficients are maximally sensitive to the hypothesized pattern. If the actual
means are such that drug A is 7 points above the control and drug B is 2 points
\
above, then SS.p will equal SSA, fully capturing the observed variability among the
three groups. Problem 4.6 demonstrates of this property.
When you apply the procedure for finding custom coefficients to these means, you
will discover that these values will be detected by a contrast with the coefficient set
{6, 1, 4, 1, 4} (see Problem 4.5):
'l/Jfit = 6f.L1 + /.L2 4JL3 + /.L4 4f.L5·
7
See, for example, Rosnow and Rosenthal (1995) and the commentaries by Abelson (1996) Petty
Fabrigar, Wegener, and Priester (1996) and Rosnow and Rosenthal (1996). An excellent dis~ussio~
of the procedure is in Levin and Neumann (1999).
82 CHAPTER 4. ANALYTICAL COMPARISONS AMONG MEANS
This contrast does not look like any of the conventional forms we presented above, but
it embodies our particular predictions about the study. If we can reject the hypothesis
that it is zero, then we know that the means agree, in sorne part, with our predictions.
Our next job is to determine just how well the prediction fits.
A theoretical prediction in a multigroup experiment involves many distinct aspects.
i ¡
With the five groups in the McGovern experiment, there are 10 different pairwise
comparisons that could be written, and the theory makes assertions about all of them.
It is possible that '1/Jfit may be significantly different from zero (a good thing), yet for
the theory to fail badly in sorne important respects. For example, we might find
that even though the data looked something like the prediction, our assumption that
J.L2 ~ J.l4 or that J.La ~ J.l5 was wrong. We need a way to look for such failures of our
i. expectations. If we find them, then we would need to qualify the theory or reject
1!
1
it altogether. What we need is a second test that determines the extent to which
the theory fails. We obtain it by looking at the variability of the means that is not
accounted for by '1/Jfit· More specifically, we calculate the omnibus sum of squares SSA
1
1
and the sum of squares SSfit for our custom contrast, and then take the difference
between them:
SSrailure = SSA  SSfit· (4.16)
This difference reflects any variation among the means that is not part of our composite
prediction. If SSrailure is small, then the theory is in good shape; if SSrailure is large,
then something is wrong with it.
We will illustrate the difference between fit and failure with a simple example.
Suppose we had a theory that predicted the pattern of means J.ll = 12, J.L2 = 9, and
J.La = 9. As you can verify, this pattern leads to the contrast '1/Jfit = 2J.Ll  J.l2  J.La·
Now consider two outcomes of an experiment. In the first, the means exactly agree
with the predictions: Y1 = 12, Y2 = 9, and Ya = 9. These values lead to ;¡;fit = 6. In
the second, the means are Y1 = 12, Y2 = 6, and Ya = 12. Again ;¡;fit = 6, but clearly
these means are far less concordant with the theory than the first set. The difference
in outcome is reflected by SSrailure, which is small (actually zero) in the first case but
substantial in the second.
We test SSrailure in the same way that we test any other effect. The degrees of
freedom are obtained by subtraction:
dfrailure = dfA dffit· (4.17)
Here, the omnibus sum of squares has a 1 degrees of freedom and dffit = 1, so
dfrailure =(a 1) 1 =a 2. We divide the sum of squares by its degrees of freedom
1
¡
to get a mean square, then calculate an F ratio as usual:
MS SSrailure and MSrailure
failure = a 2 Frailure = MS ·
8/A
Our hope here is that the result will not be significant, so that we can conclude that
nothing other than that predicted by the theory is contributing to the differences
among the means.
You will notice that the strategy of testing MSrailure we just described is quite
different from the way we usually use hypothesis tests. Instead of hoping to reject the
null hypothesis, we want to retain it. When you use this approach, you must keep sorne
4.7. COMPARING THREE OR MORE MEANS 83
cautions in mind. The first is, of course, that we can never prove a null hypothesis to be
true. A significant Frailure tells us that our theory is in trouble, but a nonsignificant
value does not tell us that it is true. Second, the test is not informative unless it
has enough power. Lack of power here is a serious problem. Particularly in a large
experiment, in which the data are generally in agreement with our predictions, the
power of the test of failure may be quite low for interesting alternatives to the theory. B
The solution is not to stop after looking at Frailure, but also to investigate specific
hypotheses that represent to important ways that the theory could be incorrect. For
example, we could test the McGovern data to see whether groups a 2 and a4 differ and
whether groups aa and a5 differ using the contrasts
'1/J = J.L2  J.l4 and '1/J = J.La  J.l5.
You will note that these contrasts are orthogonal both to '1/Jfit and to each other, so
they pickup distinct parts of the variability of SSrailure· They are much more specific
than the overall failure test, and they are also much easier to interpret.
Let us summarize our recommendations for the use of composite contrasts derived
from theory. First, you should determine '1/Jfit and show that it is significantly nonzero
and in the correct direction. Next, ¡you should subtract its sum of squares from SSA and
establish that the remaining variation, SSrailure, is not significant. Finally, you should
follow up this result by testing contrasts orthogonal to '1/Jfit that probe the particular
places where the theory is vulnerable. You will also need these tests to determine
what went wrong when the test of SSrailure is significant. The conibination of these
testsone significant and the others not, provides the best support for theoretical
predictions.
lj
]1 :!
. The simplest way to create a sum of squares to test this hypothesis is to ignore the
control group and treat the three incentive groups as if they constituted an independent
1
experiment. A betweengroups sum of squares captures the variability among these
means. More specifically, there are three conditions in this calculation (levels a1,
a 2, and a 3). The treatment sums are A1, A2, and A3, the new grand sum is T' =
A1 + A 2 + A 3 , the effective number of groups is a'= 3, and the sample size is still n.
! ,i.
The computational formula for the sum of squares representing variation among these
three conditions (Equation 2.14, p. 30, with substitutions) is
1 1
A~ + A~ + A~ (T') 2
SSA123 =  ,,
11
,,
n an
where the subscript "123" refers to the set of means in the analysis. The degrees
of freedom are 1 less than the number of means being compared; that is, dfA 123 =
a'  1 = 2. The mean square is formed in the usual way, and the overall MSs¡A from
the original analysis serves as the error term, just as it did in the test of contrasts:
MsA
_ SSA 123
    an
d F _ MSA 1 2 3 .
1 '' 123 dfA123 MSs¡A
il The critical value ofF is found in Appendix A.1 using 2 degrees of freedom for the
l.:[: numerator and those of MSs¡A for the denominator (i.e. based on 4 groups, not the 3
¡. 1 being compared). 9 Problem 4.4 gives an example of this analysis.
.· ¡'' This type of analysis is appropriate when the formation of subgroups makes logi
1 :1 cal sense. If the F is significant, we treat this finding as we would any omnibus test
and probe further to determine in what ways the incentive conditions differ among
themselves. Any further tests involving the control group would include the individ
ual experimental means. On the other hand, if the F is not significant, we might
(with suitable caution about accepting null hypotheses) average the three incentive
conditions and compare this combined mean with the mean for the single control
condition.
Exercises
4.1. An experimenter is investigating the effects of two drugs on the activity of rats.
Drug A is a depressant, and drug B is a stimulant. Half of the subjects receiving
either drug are given a low dosage, and half a high dosage. The experimenter also
includes a control group that is given an injection of an inert substance, such as a
saline solution. In total, 5 groups are formed, each containing n = 4 rats assigned
randomly from the stock of laboratory rats on hand. The animals are injected, and
their activity is observed for a fixed period. The activity scores for the animals and
the average activity scores follow:
DrugA Drug B
Control Low High Low High
1'
¡,.
90n a computer, you would run an ordinary analysisofvariance program with the data restricted
to the groups that you want to compare (for example, groups 1, 2, and 3). You will need to construct
the final F test by hand, using the error term from the full analysis.
EXERCISES
85
Predicted effect 1 ++ + + o
Mean reaction time Yj 543 552 547 554 559
~~
558 565
a. Conduct an overall analysis of variance on these data. What can you conclude?
Did the data correspond to the predictions? Remember that a smaller reaction
time is a faster response.
EXERCISES
87
b. Create ~ custom hypothesis on the basis of the prediction, assuming that a plus
and a mmus speed up or slow clown reaction time by the same degree relative to
O ~nd that two pluses and two minuses have twice the effect as a single plus or
mmus. What is the relationship among the means and what coefficients would you
use to refiect this relationship?
c. Test the hypothesis that '1/Jfit = O. What can you conclude from this test?
d. Test for a~~ failure of the theory to predict differences in mean reaction time.
e. What additiOnal tests are important in an evaluation of the theory?
i
j'
,¡1 :1
.1·¡.1'·'. , · __ , , ,
11'' :
l!l¡, '1
! '
,''
.
·'
¡\ 1
li'J
rr. 1
1
~~
.,
5
Analysis of Trend
A Numerical Example
Consider an experiment designed to test the proposition that subjects learn better
when training is distributed or spaced over a period of time than when the training is
massed all at once. We could investigate this question with just two gmups,_on_e_ group.
J;ha.t receives massed trainin~ anoth_E:)!_thªt_!'_E:l~El!.~~ª t.r~i~!r:_g ~Lf3_:P~~Q_j:r_It_EJ!')"!tl_¡;.
However, this study would ~ only a restricte_Q_glimps.e_oLthe.. phenomenon. We
could say nothing about spacing longer or shorter than that used. Instead we could
conduct a more comprehensive study that ~inclu.ded conditions.:w:itlLaeY.emLdiffer.ent
spacings.~ It would provide information about th<:Lh!::?JLQLtb~ r:!'llations.hip..between,
~g_and the degree oL<f_if?.tr.iht~~<f_t~;_¡ún.iug, giving a much clearer picture of the
spacing phenomenon.
The subject's task is to track a spot moving in a random path around a computer
screen, using a control stick to position a circle over it. Each subject has 10 (me
minute practice trials learning how to track the spot. The jng~p_@.dent_:v:l;kriaJ:>~~ ...
the time between practice trials: For one group the trials immediately follow another
(O seconds spacingthe massed condition); for a second group there is a 20second
pause between trials, for a third group there is a 40second pause; and for the fourth
group there is a 60second pause. Twenty minutes after the last practice trial, each
subject is given a single 30second test trial, and the dependent variable is the number
of seconds that the subject is able to keep on the target. Suppose n = 5 subjects
are randomly assigned to each of the four groups. The scores, with sorne summary
statistics, are given in the top part of Table 5.1, and the means are plotted in Figure
5.1. Although there is a general tendency for the tracking scores to increase with the
spacing, this is true only up toa point (40 seconds), after which they drop. A trend
analysis investigates these observations in a systematic fashion.
Typically, we woul~~tªD.<far~~.Y§_is of variance to~
whether the variation among the treatment means is signifi.cant. This analysis is
'SUiilmar!zealn thelOWer partsOfthe~b1~;~~dth~~~;raÍÍF ·:il.o7 is significa~t."'
Having foun<U!!at ther_~.e_differ:ences..t_o expJM!,~turn_to._th.e_tr_~~alysis.

20
@
o
o
<ll
Ol
e:
:52
o
~ 10
e:
!ll
(1)
::E
•
o
o 20 40 60
lntertrial interval (sec.)
Table 5.1: 'Iracking in four groups with ditferent intertrial intervals: Overall
analysis.
Data and summary statistics
Intertrial interval (A)
O sec. 20 sec. 40 sec. 60 sec.
a1 a2 a3 a4
4 18 24 16
6 13 19 17
10 15 21 13
9 11 16 23
11 13 15 21
Sum (Aj) 40 70 95 90
n 5 5 5 5
Mean (YJ) 8.00 14.00 19.00 18.00
¿Yi; 354 1,008 1,859 1,684
Sj 2.915 2.646 3.674 4.000
Bracket terms
[Y] = ¿; Yi~ = 354 + 1,008 + 1,859 + 1,684 = 4,905
¿;A~ 402 + 702 + 952 + 902  4 725 000
[A] = n = 5  ' ·
2
_ T 2 _ (40 + 70 + 95 + 90) = 4 35 1. 250
[T]  an  (4)(5) '
Summary of the analysis of variance
So urce SS df MS F
A [A] [T] = 373.750 3 124.583 11.07*
S/A [Y]  [A] = 180.000 16 11.250
Total [Y]  [T] = 553.750 19
* p < .05
1 More technically, the Jine was found by minimizing the sum of s.qua~ed devia~ions between the
treatment means y. and corresponding points on the Jine. We will d1scuss th1s least squares
criterion in Chapt~r 15. We will show how to find this line later in this section.
5.1. ANALYSIS OF LINEAR TRENO 91
independent variable. The data plotted in Figure 5.1 are moderately well described
by a linear function. However, the linear function does not fit the data perfectly. We
will consider this observation after we complete our discussion of linear trend.
Calculating the Linear Trend Coeffi.cients. The test for linear trend is an
application of the tests of singledf comparisons in the last chapter. More specifically,
we start with a set of coefficients, which in this case refl.ects linear trend; next, we
determine the degree to which these coefficients describe the relationship among the
means in our experiment; and finally, we test the significance ofthis relationship. What
coefficients should we use? Because a linear trend implies a difference between the two
end points, we could simply assess their difference, using the coefficients { 1, O, O, 1}.
Alternatively, we could use sets that concentrate on other differences that are also
implied by the presence oflinear trendfor example, {1, 1, 1, 1} or {0, 1, 1, 0}.
In each case, we would expect to find a significant contrast if linear trend is present.
A better way of assessing linear trend is to use a set of ·coefficients that mimics
a straight line. We can do this by creating a set of customized coefficients, using
the steps we explained on pages 8081. We start with a set of temporary coefficients
(Xj) that refl.ect a linear functionfor example, the group numbers {1, 2, 3, 4}
or the actual values for the independent variable {0, 20, 40, 60}. We will use the
latter set. These temporary coefficients are not satisfactory because they do not
satisfy the criterion that contrast coefficients must sum to zero (Equation 4.3), that
is, E Xj = O+ 20 + 40 + 60 = 120. We correct this problem by subtracting from each
value of Xj the mean of the entire sét of values, X = 12% = 30. The result is a set of
valid coefficients that do sum to zero. We will refer to the deviations of the values of
Xj from their mean by Xj. For the four conditions, we obtain the values
1
components shortly. You won't want to go through these calculations each time you
!
decide to investigate linear trend. The set of linear trend coefficients we constructed
1' above, along with those for other numbers of treatments, is given in Appendix A.3.
!
Before leaving the linear trend coefficients, we want to call your attention to two·
points. First, as a way of assessing the rise or fall of a set of means, the linear trend
contrast is preferable to contrasts based on pairwise comparisons such as those given
by the sets {1, O, O 1} or {0, 1, 1, 0}. They are more sensitive to the presence of
linear trend because they give progressively more weight to the treatment conditions
located farther away from the center of the independent variable. The other sets of
coefficients only incidentally represent the full extent of the linear trend. Second,
the coefficients in Appendix A.3 are appropriate only when there are equal intervals
between successive levels of the independent variable. They are appropriate for our
example because the levels o, 20, 40, and 60 are equally spaced (20 seconds between
each level), but would not be usable if the levels were O, 15, 30, and 60 or O, 10, 45,
and 60, because the spacing is uneven. Under these circumstances, you would need to
construct your own coefficients using the actual values for the independent variable,
as we have illustrated. 3
Completing the Analysis. We can now return to the numerical example. From
here on, the analysis duplicates the analysis of singledf comparisons considered in
the last chapter. We use the linear trend coefficients in conjunction with the general
formula for singledf comparisons (Equation 4.5) and calculate a sum of squares that
reflects the degree to which the variation among the treatment means follows the
straight line. The first step is to calculate :(¡; using the observed treatment means and
the linear trend coefficients c1j:
:(¡;linear = Z::: C1j Yj
= (3)(8.00) + ( 1)(14.00) + (1)(19.00) + (3)(18.00) = 35.00.
We next substitute the relevant values into Equation 4.5 (p. 69):
~ 2
SS n'I/Jlinear (5)(35.00) 62
linear= Z:::cfj = (3)2+(1)2+12+32 =30, 5.
Finally, we test the significance of this comparison with an F statistic:
F = MSlinear = 306.25 = 27 .22 .
MSs¡A 11.25
The F has dfnum = 1 and dfdenom = 16. It is significant (F.o5 = 4.49).
The test just conducted is not sensitive to whether the points go up or clown, only to
the amount oflinear trend. Ifthe means had been 18, 19, 14, and 8, instead ofthe other
way around, we would have obtained :(¡;linear= ( 3)(18)+( 1)(19)+(1)(14)+(3)(8) =
35.00 instead of 35.00. When both :(¡;linear and the coefficients are squared in Equation
4.5, we obtain exactly the same SSlinear = 306.25. You can also verify that changing
the sign of the coefficients makes no difference, so that you obtain the same sum of
squares with the coefficient {3, 1, 1, 3}. To determine the direction, simply look
at a plot of the means.
3 Most statistical programs can properly analyze experiments with unequal intervals, but you will
probably have to modify the default behavior of the program.
5.1. ANALYSIS OF LINEAR TRENO 93
bo = YT  b1X. (5.3)
From our calculations on page 91, the devíations x 3· are 30 10 10 and 30.
Su bstltutmg
. · these values and the means from Table 5.1 in Equations ' 5.2 ' and' 5.3 gives
we need only connect two of our predicted points. Usually the points from thehmo;t
i'i
1
extreme groups are good choices (here (0, 9.50l a~d (60, 20.00)], b)ut a;o(~~er14 ~l
choice is the intercept (0, bo) and the means (X,_ Yr) (~ere (0, 9.50 an .• · ·
As a check, you should plot one of the intermed1ate pomts to be ~ure the ~me passes
1!
1 1
through it. Many researchers draw the regression line even wh~n mt.ermedmte values
have no meaning. They simply want to display the linear relatwnsh1p on a graph, to
show how well the straight line describes the data.
Evaluating the Linear Fit .
Having detected a significant linear trend, we need to determine if this linea~ functwn
rovide~ a complete summary of the relationship between X and Y .or ~f a m?re
~omplicated mathematical function is needed. We raised a similar questwn m Sectwn
4 _6 (see pp. 8183), where a test of a t~eoreti~al pattern of the t_reatment means_was
followed by an assessment of the variatwn m1ssed by t~at orderm? (Ffit and Fra¡Jure•
respectively). We use the same approach he~e .. The l~near functwn c~rre~ponds to
the fitted model, and the extent to which it f~Ils IS obtamed by s~btractmg Its sum of
squares from that of the overall effect (Equatwns 4.16 and 4.17) ·
SSrailure = SSA SSunear with dfrailure = dfA dfunear·
The analysis is completed with the usual F test:
_ SSrailure and F: . = MSrailure
MSrailure fa¡Jure MS
dfrailure S/A
How does our numerical example fare? Applying the equations above, ~e find
SSrailure = 373.75..,.. 306.25 = 67.50,
dfrailure =3 1 = 2,
67.50
MSrailure =   = 33.75,
2
. = 33.75 = 3.00.
Ffailure 11.25 J
"th d'~'  2 and d~'d = 16 this result is not significant (F.o5 = 3.63).
W1 ~num  ~· enom ' . . . . ft
In addition to a test of this remaining, or residual vanatwn, 1t 1s o en r~vea mg o
r t
examine this information in a graph. To do this, we simply calculate the d1screpa~cy
between the actual group means and the group means spec~fied by the regresswn
.
equat wn, th a t 1·s , y:. _y! , and plot them in the graph. Usmg the values that we
3 3
calculated previously, the differences are
Y1  Y{= 8.oo 9.50 = 1.50,
y2  y~ = 14.00  13.00 = 1.00,
Ya Y; = 19.00 16.50 = 2.50,
y4 _y¡ = 18.00  20.00 = 2.oo.
These residual values are plotted in Figure 5.2. They appear to have a simple pattern
a curve that first rises, then falls. Although the test of SSrailure sug?ested that_ t~ese
d iations do not vary more than could be accounted for by accidental vanatwn,
ev esearchers would want to test for the possibility of curvature, regardless of the
many r h" "bTt t
outcome of the Frai!ure test. We will investigate t 1s poss1 11 y nex .
5.2. ANALYSIS OF QUAORATIC TRENO 95
+4
Q)
Ji¡ +2
E
,g
e:: o
~
~
2
'·
o
4
o 20 40 60
lntertrial interval (sec.)
Figure 5.2: Ditferences in learning between the group means (Yj) and the group
means associated with the linear regression equation (Yj).
idealized form. The linear trend coefficients exhibit a perfect linear trend, a steady
rise of two units from 3 through 3, and the quadratic trend coefficients exhibit a
perfect quadratic trend, a falling of two units downward and a rising of two units
upward. It does not matter that the coefficients reflect a trend opposite to that
exhibited by the data because, as we have seen in the last section, the sum of squares
do es not depend on the sign of 'lj;.
As with the analysis of linear trend, the sets of trend coefficients in Appendix
A.3 are based on equal spacing of the levels of the independent variable. If you have
unequal intervals, you must construct sets of orthogonal trend coefficients that are
appropriate for the particular spacings of the levels represented in your experiment.
'i¡
We will discuss this matter briefly on pages 105106.
Testing for Quadratic Trend
The procedure for testing the quadratic trend is identical to that for linear trend. We
start with the quadratic coefficients {1, 1, 1, 1} from Appendix A.3. Next, we
estímate the contrast using Equation 4.4:
.;pquadratic = 2: C2j YJ
= (1)(8.00) + (1)(14.00) + (1)(19.00) + (1)(18.00) = 7.00.
We then use Equation 4.5 to find the sum of squares:
2
. _ n.;p~uadratic _ (5)(7.00) = 61.25.
SSquadratlC ¿c~j  (1)2+(1)2+(1)2+(1)2
5.1 was not. An explanation of this apparent conflict is that the residual variation
evaluated by FraÜure is an average of all remaining orthogonal trend components. In
this case,. t?e signi~ca~t quadrat~c ~rend component was masked by averaging it with
the remammg nons1gmficant vanatlon. We will return to this matter in Section 5.3.
Evaluating the Quadratic Fit
If we are attempting to find the simplest function that describes our data, we would
want. to know now whe~her there is any significant residual variation remaining beyond
the lmear and quadratiC components. We use the same strategy that we used for the
linear trend (p. 94). To find Frailure here, we subtract both linear and quadratic effects
from SSA,
SSrailure = SSA SSunear ~ SSquadratic = 373.75 306.25 61.25 = 6.25,
and from the degrees of freedom,
dfrailure = dfA  dfunear  dfquadratic = 3  1  1 = l.
We can now calculate
P, . = MSrailure _ 6.25 _
fa1lure MSs¡A  l1. 25  0.56.
1
Clearly, no trend components beyond the quadratic are present. We conclude that the
relationship between tracking performance (Y) and intertrial interval (X) is quadratic
in form, with a sigrtificant linear component.
Plotting the Quadratic Equation
Because we have found the quadratic component to be significant we know that the
function is more complicated than the linear regression equation 'that we calculated
above (Equation 5.1). The function that we should plot is quadratic:
Yj = bo + b1Xj + b2XJ. (5.5)
To use this equation, we must determine values for the coefficients bo, b1, and b2.
Calculating the values of the quadratic regression coefficients by hand is a tedious
and er~orprone process: We suggest instead that you use one of the many multiple
regress10n programs available for small computers. The idea is the following: For each
group determine the values for the three variables Xj, XJ, and Yj. Now enter these
~umbers as data and use the program to regress Yj onto Xj and XJ. The output
1s the correct regression equation. 5 For our example, the numbers that go into the
programare
Xi x? 1j
a1 o o 8.00
a2 20 400 14.00 /
19.00
/
a3 40 1,600
a4 60 3,600 18.00
5 The same answer is obtained from an analysis of the individual scores Yii when the sizes of the
grou~s are equal, but not when they are unequalwe discuss the difference in Section 14.3. The
funct1o~ found from the means is more appropriate for experimental designs. If you need to do the
calculatiOns by hand, you can look up the formulas for twopredictor regression and apply the'm to
these values and obtain the same answer.
CHAPTER 5. ANALYSIS OF TRENO
98
1 :'
1 '
!.'''
1 :
.''
'1
,[
1,!
¡
ol.,~~~e'no
o 20 40
lntertrial interval (sec.)
': 1:,
. , _ 750 + 0 .43 7 x 3  o.004375XJ
1
. 5 3· The quadratíc regression functwn Yj  7 .
,'1! ;! F 1gure . ·
fitted to the data from Thble 5.1.
5' 3 HigherOrder
.
Trend Components
X and the quadratic function Y= bo + b¡X + b2X
2
,!
'/53. HIGHEROROER TRENO COMPONENTS 99
use does not depend on either the significance of the omnibus F test nor on the test
of failure to fit following the test for linear trend.
The quadratic function implied by the tests for linear and quadratic trend does
not exhaust the patterns that are possible from a set of four or more means. When
Frailure is significant after including the quadratic term, we know something more is
going on. What to do about it is less clear. The conventional approach is to stick
with polynomial equations. After the quadratic comes the cubic function,
Y= bo + b1X + b2X 2 + b3X 3, (5.7)
and following that is the quartic function,
Y= bo + b1X + b2X 2 + b3X 3 + b4 X 4 , (5.8)
and so on. These functions allow more complicated shapes than the linear or quadratic
·function. The cubic function can rise, level out (or even reverse), then rise again. Each
of these equations has a corresponding trend contrast that determines whether a power
of X of that order is needed. For example, if you found that a test of the cubic trend
contrast gave a significant result, you would know that fitting the data required a
polynomial of at least the complexity of Equation 5.7i.e. one that included a term
involving X 3. You will find that these trend contrasts are tested by several of the
computer packages.
· We should point out, however, that the higher trend components are frequently
less useful to the behavioral researcher than are the linear and quadratic components.
Many behavioral studies do not have a theoretical basis strong enough to support
questions any more complex than those tested by the linear and quadratic trends
that is, slope and curvature (we will see an exception in Figure 5.6). Once those
components have been tested, tests of the other components give little further infor
mation. Compounding this is the fact that the form implied by any of the polynomial
equations is very specific. The quadratic equation in Equation 5.4 has a particular
curved forma parabolaand there are many other curved functions that don't ex
actly agree with it. Even though these functions may be equivalent from a researcher's
point of view, they can give different results from a trend analysis. One function will
require a cubic equation, and another will not. A set of data that curves, then levels
out, is a common example of this problem. Suppose we had continued the spacing
experiment with conditions at 80 and 100 seconds, and the tracking scores remained
at about 18 seconds. The hypothesis of curvature is still important, but a quadratic
function like Equation 5.4 could be rejected in favor of sorne higher function, such as
a cubic or quartic. The particular values of the trend coefficients would tell us little
in themselves, however. For these reasons, we suggest that you not be too quick to
turn to these higher components. If your questions only concern slope and curvature,
then stop with the quadratic.
Testing for HigherOrder Trends
That being said, how should you investigate these complex trend components? Math
ematically, the only restriction to the number of possible orthogonal trend components
is the number of treatment conditions. When a = 2, only a linear trend can be tested;
when a = 3, linear and quadratic trends are testable; when a = 4 linear, quadratic,
and cubic trends; and when a= 5, linear, quadratic, cubic, and quartic trends. We
CHAPTER 5. ANALYSIS OF TRENO
100
\
stated the equivalent restriction in Section 4.5, when we pointed out that the maximum
number of mutually orthogonal comparisons is one less than the number of treatment
conditions; that is, to dfA = a  l.
Because the trend contrasts are mutually orthogonal, each can be tested with
out interfering with each other. However, it is better to concentrate on the simpler
components and resort to the complex ones only if necessary. One approach is to
work upwards from the simple regression equations to the more complex higher ones.
Start by testing the linear and quadratic components. Then look at the unexplained
variability by calculating and testing SSfailure = SSA  SSunear  SSquadratic• as de
i scribed on page 97. If this test is significant, it implies that you need at least a cubic
1 i
!
function (Equation 5.7) to describe your data. So you calculate and test the cubic
trend component. Regardless of whether that test is significant, you conduct another
test of the unexplained variablity, this time subtracting the linear, quadratic, and
cubic sums of squares from SSA. If this test is significant, you need to go on to the
quartic function (Equation 5.8). You continue this process until no more variation is
available to be fitted, either because Ffailure is small or because you have extracted
a  1 components.
The actual calculations follow the steps we outlined for the analysis of the linear
and quadratic componentsall that changes is the set of coefficients. Appendix A.3
gives sets of coefficients for components ranging in complexity from linear to quartic
for experiments in which the number of equally spaced treatment conditions ranges
from 3 to 10. More extensive listings can be found in books of statistical tables, such
as Beyer (1968), Fisher and Yates (1953), and Pearson and Hartley (1970).
We will illustrate the test for cubic trend with our numerical example, even though
a test for fit indicates that it was not necessary. We first need the cubic coefficients
c j, which we determine from Appendix A.3 to be { 1, 3, 3, 1}. Substituting in
3
Equations 4.4 and 4.5, we find
;jjcubic = (1)(8.00) + (3)(14.00) + (3)(19.00) + (1)(18.00) = 5.00,
.  (5)(5.00)2 = 6.25.
SScubiC ( 1)2+ 32 + (3)2 + 12
The statistic Fcubic = 6.25/11.25 = 0.56 is not significant. Because a = 4, we can
only extract three trend componentslinear, quadratic, and cubic. We have already
calculated sums of squares for the first two, finding SSunear = 306.25 and SSquadratic =
61.25. Together, we have fully accounted for the variability among the group means,
and there is no residual variability to detect. We can see this by showing that the
sums of squares combine to equal SSA:
SSunear + SSquadr~tic + SScubic = 306.25 + 61.25 + 6.25 = 373.75 = SSA.
'¡ ··¡
5.4. THEORETICAL PREOICTION OF TRENO COMPONENTS 101
0.6
~
o
Q)
.!!!
Q)
,§
Q)
m
e:
o
c.
m
l!?
¡¡¡
Q)
:2
0.4
2 3 4 5 6
Size of positiva set
t R
5.4: Response time as a function of the number of items 1·n a m
rFigure
d . . .
· d
emor1ze
1s · e ra~n w1th perm1sswn from S. Sternberg, Highspeed scanning in human
memory, Sc1ence, 1966, 153, 652654. Copyright 1966 by AAAS.
should be directly related to the length of the lista linear funct1·0 n H1·s fi d"
h" h h b · · n mgs,
w 1c . ave een. reph~ated nu.mero~s times, are illustrated in Figure 5.4. There is an
~nequ~vocal stra~ghtlme relatwnsh1p between the time to respond and the number of
1tems m the memorized list, and nothing else.
Anothe~ ex_ample of the theoretical prediction of trend is the classic study of stimu
lus generahzatwn _repo~ted ~y Grant (1956). He classically conditioned a physiological
response (change 1? _skm res1stance) to a training stimulus, a 12inch circle. Once the
response ~as cond1t10ned, _subje~ts were randomly assigned toa= 7 different groups,
e~ch of wh1ch was tested w1th a d1fferent generalization stimulus. One group was tested
w1th the original ~ircle (12 inches), three were tested with smaller circles (diameters
of 9, 10, a~d 11 mches) and three were tested with larger circles (diameters of 13,
14, and _15 mches). For theoretical reasons, Grant expected to find both linear and
quadratiC component_s relating the independent variable (size of the testing stiniulus)
~o the dependent vanable (the response to the test stimulus). He predicted that sub
Jects ':"ould tend to respond more vigorously as the size of the testing circle increased
(the lm~ar component). His main interest, however, was in stimulus generalization.
H~ pred1cted that the response would be greatest for subjects for whom the testing
stnr:ulus was the _trai~ing st~mulus itself (the 12inch circle) and would taper off for
s~bJects tested wlt_h c1rcles mcreasingly different in size (the quadratic component).
Figure 5.5 s~ows h1s results. Statistical analysis (see Problem 5.1) revealed that only
the quadrat1~ trend _component was significant. We have drawn the bestfitting linear
and ~u~drat1c f~nctwns on the plot. As you can see, the trend analysis supported the
prediCtiOns of st1mulus generalization that motivated the study.
Our final .e~ample is a study reported by HayesRoth (1977). Subjects learned a
set of propos1t10ns of the form "Alice bought the cat before Jane sold the painting"
(actually, nonsense syllables were used instead of nouns). After correctly responding
,!i!' ~
t'
r 'G,¡'
11'· 1
\
5
4
•
! 1
Q)
Cf)
e: 3
o
a.
Cf)
!!!
e: 2
al
Q)
~
•
9 10 11
Test circle size (inches)
Figure 5.5: Response to a test circle of various sizes after training with a 12inch
circle (after Grant, 1956). The best fitting linear and quadratic function are shown.
E
al
.!!1 6
.S
Cf)
¡¡;
E
e: 5
al
Q)
~
i
4
1o 10
1
20
1
30
Number of correct verifications
Figure 5.6: A complex function requiring higherorder trend components. From
B. HayesRoth, Evolution of cognitive structures and processes, Psychological Re
view, 1977, 84, 260270. Copyright 1977 by the American Psychological Associa
tion. Adapted by permission.
to the propositions a predetermined number of times (O, 2, 4, 6, 8, 10, 20, or 30
times), they were given a new, but related, set. of propositions to master. Hayes
Roth proposed a theory that predicted a complex relationship between the trials to
learn the new set Óf propositions (the dependent variable) and the number of correct
responses given to the first set (the independent variable). She predicted two changes
1 i in direction in the function, a characteristics that requires at least a cubic equation.
li' ¡
A trend analysis of her results (plotted in Figure 5.6) showed that a function more
complex than quadratic was necessary.
Each of these examples demonstrates a different aspect of trend analysis. For both
Sternberg and Grant, the linear and quadratic tests were planned comparisons, but
5.5. PLANNING A TREND ANALYSIS 103
they served different purposes. Sternberg's theory predicted only a linear function,
with the quadratic test serving as a test of deviations from the theory. For Grant, the
test of the quadratic component was his central prediction. The HayesRoth study
tested the complex trend component predicted by her theory. In her study, the linear
and quadratic components formed the baseline against which she found additional components.
1 1
12 18 24
6
Hours of light per day
. stes of olden hamsters as a function of the num
Figure 5. 7: Mean weight of the te . g A t end analysis is uninformative here,
ber of hours of light per 24 hour perw:d A;apted from Figure 7 in Elliott, 1976,
and the points have simply been connec ed . "th ermission of Federation of Amer
Federation Proceedings,. 35. Repr~duce Wl .P. n conveyed through Copyright
ican Societies for Experimental Bwlogy, permlSSJO
Clearance Center, Inc.
;~~::.:~~~:o~:~¿:::; ::~:~:::,:;:,~~u~:.::~:::::~:::::
a = g treatment conditions: 1
·t· toa are spaced between O and 11 hours of light (say, at O, 5 /2,
• eond1 1ons a1 3
and 11 hours). h ·
• Conditions a4 to a6 are concentrated around the duration when most e ange 1S
expected (say, at 11%, 12, and 12% hours). .
• Conditions a7 to ag are spaced between 13 and 24 hours of hght (say, at 13,
18% and 24 hours) ·
' ld b th major findings: (1) a low and
Elliott's theory suggested we shouh o serve. dreef light (011 hours) (2) a rapid
· t t weight for the s orter peno s o ' . h
relat1ve1y cons an d (3) stability of this increased wmg t
increase in weight for the next hour or so, an a
)
between 13 and 24 ho~r~. b 1 t d by a different and orthogonal statistical
test~~~eo!~~:~~!~~~ ~~~n:o~:~an~
1
1
::; ~or
the shorter periods is assessed with an
omnibus F testing the hypothesis
,..
1 ,.
Ho1: f.Ll = f.L2 = f.L3·
. h th sis can be tested by the procedure we described in Section 4.7. Si~ilarly,
¡; !
~::sex~~~ta~on of a constant w~ight for the longer periods can be assessed w1th an
omnibus F testing the hypothes1s
Ho2: f.L7 = f.Ls = f.L9·
5.5. PLANNING A TRENO ANALYSIS 105
For the theory to be supported, both of these null hypotheses need to be retained. A
test of the prediction that the testes weigh less during the periods of relatively shorter
periods of light than for the longer periods could be established by rejecting
R . f.Ll + f.L2 + f.L3 _ f.L7 + f.Ls + ¡.tg
03. 3  3 .
Finally, the rapid increase in weight between the short and long periods of light could
be evaluated by an omnibus F test involving the relevant conditions,
Ho4: f.L4 = f.L5 = f.L6,
or, even better, by a test for linear trend also based on these conditions. At the
expense of orthogonality, we might choose to include in this trend analysis the two
conditions at the beginning and end of the expected rapid increase in weight, giving
us five points, a3 through a7, spaced at halfhour intervals.
]l 0.8 g 0.8
1
1'
~
~ 0.6 ~~ 0.6
e: e:
~ 0.4 ~ 0.4
8. 8.
~ 0.2 ~ 0.2
0.0 t.,r.,__J
o 2 3 4 5 a2 a3 a4 a5
Delay in minutes Delay condition
1
1
'' 1
Figure 5.8: Upper graph: Intervals spa,ced according to actual time values. Lower
graph: Intervals spaced evenly.
' 1
Finding quadratic coefficients is more difficult because of the need to make them
1 1
orthogonal to the linear coefficients, and we do not have space to describe the proce
dures here. If you need these coefficients, you can consult one of the first two editions
of this book (Keppel, 1973, 1982) or one of several other advanced texts (e.g., Kirk,
1995, Appendix D; Myers & Well, 1991, Appendix 7.2). Conveniently, many computer
programs will accommodate unequal intervals, typically by using the General Linear
Model testing procedures described in Chapter 14.
However, in a few more complicated cases, you will need use nonlinear regression or
sorne other model fitting procedure. You may want to seek technical help.
1 All three sets of means are monotonic, but the shapes of the functions are very differ
·1
ent. The evenly spaced coefficients would not be equally sensitive to each of them. In
rr,
,, 1 !.,'1
fact, these coefficients are most sensitive to a set that increases regularly, say
10, 12, 14, 16, 18.
6 For a good discussion of this approach, including ancillary testing strategies, see Braver and
Sheets (1993).
l,¡
EXERCISES
109
1
CHAPTER 5. ANALYSIS OF TREND
1
110
il
l '¡
¡
The mean square error is MSs¡A = 21.83.
a. Can you detecta upward or downward linear trend?
1 b. Is the function curved? .
5.3. Theorists have postulated a complicated relationship betwee~ anx1et;Y lev~l and
1! plex task To investigate the nature of th1s relat10nsh1p, an
per1ormance on a com · ) f b· t
investigator measured the backward memory span (a rather d~fficult task o fsu J~\s,
who were given special instructions designed to create v~nou~ degrees 0 anx 1~ Y·
There were six sets of instructions, from low anxiety instructwns m a1 thro.ugh ~nx1ety
intermediate levels, to high anxiety instructions in a5. There were five subJects m each
of the anxiety levels. The span lengths obtained were as follows:
a1 a2
1 2 1 3 2 3 2 4 2 3 3
1
5 3 4 2 4
o 1 1 2 4 5 5
1 2 3 3 3 2
a. Perform an overall analysis of variance on these data and draw appropriate con
1
1
clusions. b · · "th th l"near
b. Conducta trend analysis using the fitfailure proceldur~ b~l~tmn:g W1 is n: l~nger
,11
1
component and ending when the test of the residua vana 11 Y L'residual
significant. Again draw conclusions. .
c. Suppose you questioned the assumption that the levels of anx1ety were equa11_Y
.ded to ·test for the presence of monotonic trend. Is a monotomc
spaced and dec1 f h" d ?
trend present? Does it completely describe the outcome o t 1s st~ Y: .
5.4. It is often found in motorskill research that practice that van~s m nature lS
more effective than practice that is all of one type. One such exp~nme~t looks at
the efficacy of practice in basketball freethrow shooting using pract1~e with balls of
·a: t wei"ght One group of six subjects practices free throws usmg a stand~rd
d hieren . d"fF t . ht d a th1rd
weight ball. A second group practices using two balls of 1 eren wmg s,_an .
practices with balls of four different weights. After fou~ weeks of ~ractice .s~slO:S~
each subject was given 10 chances to shoot free throws usmg a ball w1th a we1g t t a
had not been used before. The number of successful shots was recorded:
1 ball 2 balls 4 balls
7 4 5 8 10 10
,, .·, 2 7 6 7 6 9
3 8 6 8 6 7
1
!
!
a. Is there a significant effect of the number of training balls and the mean number
1
!'/1
1
1'
nonlinear effects, taking into consideration the unequal mtervals.
1 '
,, !
6
Simultaneous Comparisons
and the Control
of Type 1 Errors
¡:
112 CHAPTER 6. SIMULTANEOUS COMPARISONS
1
i
! '
1:
with a confusingly large number of alternatives. 1 An important step in distinguishing
among them is to understand the different ways that contrasts are used in the typical (
experiment. We look at several possibilities in the next section, and will organize our ·
later discussion around those distinctions.
A look at Equation 6.1 re;eal~ that any change in the size of the percomparison
error rate a also changes famllywise error rate a B 1 ·r
f, _ . d . FW· or examp e, I you set a = .01
or e  3 m ependent tests, the famllywise error rate falls below 5 percent:
3
aFW = 1 (1  .01) = 1  (.99) 3 = 1 0.970 = 0.030.
Thus, if you were inter~sted in limiting the familywise error to 5 percent for these tests
you co~d do so by settmg ato sorne intermediate value between a= .05 (aFw = 0.143)
anda: .01 (aFw ~ 0.030). By reducing the chance of a Type I error on every test
(assummg the ommbus null hypothesis is true), you also reduce the chance of m k"
a Type I error within the family of tests This idea of controlling f . . a Ibng
reducing · · ami1ywiSe error y
percompanson error a underlies all of the procedures we consider.
. Ma~ schemes to correct the familywise error rates have been devised. The differ
m the Size an~ nature of the family that is to be investigated and the statusyof th
rese~rch quest10n~. Our goal here is to introduce you to the most common methods an~
to g~ve ~ou a basis for choosing among them. Because the appropriate way to control
fa~mlywise error depends on. the role of a given test in the general plan of anal sis we
wlll look first at. how ana~ytical comparisons are used in the typical experime¿t t'hen
turn to t~e specific techmques usec¡l to control familywise error. '
~e wlll roughly distinguish among three parts in the analysis of the typic 1
expenment. a
l. Testing the primary questions. Experiments have a purpose. They are generall
undertaken to answer a few specific questions, as we illustrated in Chapter 4 an~
Chapter 5. The McGovern experiment outlined in Table 4 1 (p 63) e d
d t t" · f, · · 10cuse on
e ec mg mte~ ere~ce ':ith three different memory components, and the Grant
st~dy summanzed m Figure 5.5 (p. 102) was designed to detect increasing trend
an a reversa! of the response. Most experiments involve only a few such con
trasts, often only one or t":o and never more than the degrees of freedom for the
effec~. Bec~us~ thes~ quest10ns are the principal point of the study, they are best
consider~d m IsolatiOn. Each is a major reason why the study was conducted
so e~ch IS tested as t?e unique member of its family. To give these tests th~
m~ximum P_ower, their percomparison error rate is fixed at a and no th
adJustment IS made. ' 0 er
treated as separate entities, each given its own allocation of fa~·¡ ~ are o en
3 E z · th d 1 I ywise error.
· xp orz~g e ata Jor unexpected relationships. Finally, a researcher may want
to examme all of the data, looking for any interesting or suggestive relationship
a;rwng the groups. The pool of potential tests in this case is the entire collection
o contrasts that could be written. Any contrast that can be measured might
rr
CHAPTER 6. SIMULTANEOUS COMPARISONS
114
20~¡
15
OJ_...r~a:~a~s~
a1 a2 a3 a4 s
Level of factor A
''
for !ost researchers. Because this posthoc search could, potentmlly, find an¡ effe~:·
'1: every conceivable contrast is part of the family, even though only a few are orma y
¡,:'
!'l. test~d. might be expected the stringency with which the error rate should be con
1' trolle~ differs in the three ~ituations. Planned tests bring the largest amount of extra
1 1
1 :
6.2. PLANNED COMPARISONS 115
not be what you intend. Always be sure you know what your program is doing!
3 Strictly, requiring an omnibus F test first makes this procedure a form of sequential testing,
which we discuss in Section 6.4.
6.3. RESTRICTED SETS OF CONTRASTS
117
This statistic exceeds its critica! 1 f
situation like this where the t va ue . ~ 05 ( 1 , 24) = 4.26 and is significant 1
0
( . ' ex ra group Is mcide t l . . n a
t h e ommbus F test and test the 1 d . n a ' It makes more sense to ign
P aune companson. ore
' aFw=l(1a)c.
We can t apply this formula directly beca
orthogonal. For example, the set of c~ntrast~s:o our t_ests are almost never mutually
groups to a control group are not orth .1 b mparmg each of severa! experimental
for each H . ogona ecause the s
. owever' It turns out that . d ame contro1 group is used
error, and so, for any set of tests, m ependent tests have the largest familywise
¡, '
¡! 1
6.4 Pairwise Comparisons
Researchers often want to examine all pairwise comparisons, that is, to compare each
¡. 1
group with every other group, a family containing a total of a(a 1)/2 comparisons.
¡.:
¡'1'~ For example, there are (5)(5 1)/2 = 10 pairwise comparisons among a= 5 groups.
Although the Bonferroni or SidákBonferroni procedures can be used here, the large
( number of comparisons force them to use smaller a values (producing lower power)
than techniques specifically developed for this situation. We will discuss three of
these in detail, covering procedures that are both widely applied and conceptually
different from each other, and note sorne variants more briefl.y. For details of these
other methods, see Seaman, Levin, and Serlin (1991) or Toothaker (1991).
Table 6.1: Mean comfort ratings for five ditferent headache remedies.
16 10 9 9 11
11 5 3 12 9
22 8 4 12 10
8 12 8 4 13
15 o 8 5 13
6 16 6 10 12
16 8 13 7 11
15 12 5 12 18
12 10 9 9 14
11 6 7 8 12
Means (Yj) 13.20 8.70 7.20 8.80 12.30
Std. Dev. (si) 4.59 4.42 2.90 2.86 2.50
Analysis of variance
So urce SS df MS F
A 264.920 4 66.230 5.22 *
S/A 571.000 45 12.689
Total 835.920 49
* p < .05
To use th~ Studentized range table you need to know the number of groups in the
study, designated by the subscript attached to qa, the value of dferror, and the desired
level of ~rr~r c~ntrol aFw· Any pair of groups whose means differ by more than this
amount IS sigmficantly different.
. As an example, suppose a ~esearcher wants to determine whether there are any
differences a~o.ng a = 5 remedies _for headaches. Students with headaches reporting
to
E" ahhealth
h chmc are randomly assigned to the five conditions (n = 10 per cond"t" I Ion ) .
Ig t ours after taking the remedy, the students rate their feeling of physical comfort
o~ a scale from O (severe discomfort) to 30 (normal comfort). Table 6.1 gives the data
w1th the summary statistics and omnibus analysis of variance. Given the absence of
any_ theoretical ex~ect.ations, this experiment is ideally suited to a systematic exami
natiOn of the 10 pairwise contrasts. We will use Tukey's test to control the familywise
error over these 10 contrasts at aFw = .05.
~h~ ~rst step is_to calculate DThkey· We start by looking up the Studentized range
statist_Ic m Appendix A.6 for a= 5 groups with dfs¡A = 45 and aFw = .05. The tabled
value IS q5 = 4.02. Then using Equation 6.7, we calculate the critica! difference:
''
1
are different usmg. t h'1S d'mgr am , you locate
To determine whether ~ny two means hether they are covered by a single line. For
the two groups in questlün and checd~: b cause the second line covers them both,
d a do not Uler e
'! '
example, groups az an 5 . ecause no one line covers them. .
1
while groups a3 and a5 do d1ffer b fi t list the groups in order of thmr means
'1 To construct one of these diagrams, you rst groups first In our example, the
(as we did in Table 6.2), the~ test th; :os~~=s::ediffer signifi~antly, so they are not
1
',¡'
1
,,,,,
most extreme pair of groups 1S a3 an 1· 1 k at the pairs that are the next most
members of an equivalent subset. We now oo ly the pair a3 and a5 and the pair az
distantly separated and spa~ fo~r g~lolu~s, ::;m:t
first of these pmrs 1S st1 s1gm ca ,
but the second pair is not. Because
an d a1. The
6.4. PAIRWISE COMPARISONS 123
a2 and a1 do not differ significantly, they are members of an equivalent subset, which
we indicate by drawing a line under them:
Note that the means of the groups within this underlined subset can never differ
significantlyany pair within the subset (e.g., az and a5 or a4 and a5) must differ
less than the groups at the end (a 2 and a1), and these were not significantly different.
We continue the testing procedure by looking at the pairs that span sets of three
groups. Two of these sets (the pair az and a5 and the pair a4 and a1) fall within the
equivalent set we just found, and won't differ significantly. The third pair (a 3 and
a 4 ) falls outside the equivalent set and needs to be checked further. The difference
between these means is not significant, which indicates that they are members of
another equivalent subset, and we add a line under them:
If necessary, we could go on to lodk at the adjacent pairs (a 3 and az, etc.), but the
diagram shows that all of these are already included in an equivalent subset and need
not be checked further.
You may see these subsets presented in another form. In it, members of the same
equivalent subset are marked by a superscript or other symbol. For example, the
results shown by the line diagram above are indicated by
a~, agb, a~, a4b, a~.
Here the superscript a refers to those groups covered by the first line above, and
the superscript b those covered by the second line. Any two groups that share a
superscript are not significantly different, while any two groups that do not have a
common subscript are significantly different. Thus, a 1 and a3 are significantly different
because they have different superscripts, b and a, respectively, while a1 and a5 are not
significantly different because they share the superscript b.
How do we interpret the results of this analysis? If your interest is in determining
which pairs differ significantly, you can locate them in Table 6.2. Only the pairs
(a¡, a3) and (a5, a 3) differ. If you want to establish the most effective treatn1ent,
the equivalent subsets are more revealing. The second band drawn below the ordered
treatment conditions indicates that none of the four best headache remedies (a 1 , a 5 ,
a4, or az) emerged as significantly more effective than the others.
When the sample sizes are unequal or the variances are heterogeneous, Equation
6. 7 should not be used. Under either of these circumstances, the best solution is to
replace MSs¡A/n in Equation 6. 7 with a value that, takes into account the unequal
samples or the unequal variances. For unequal samples sizes, Kramer (1956) suggests
replacing the n in Equation 6.7 with a special average of the individual sample sizes,
called the harmonic mean and written ñ. For example, to compare group a 1 with
n 1 subjects to group a 2 with n 2 subjects, we calculate the harmonic mean of their
sample sizes by6
6 The reciproca! of a harmonic mean (1/ñ) is the average of the reciprocals of its parts (1/nj)·
124 CHAPTER 6. SIMULTANEOUS COMPARISONS
2
n  e,
 1/nl + 1/nz ·
This number is used in place of n in Equation 6.7. Each pair of groups has a different
critical difference because of different combinations of sample sizes. This analysis is
in agreement with the analysis of unequal samples in Section 3.5.
When the variances are heterogeneous, the critical differences must take into ac
count the actual variances of the particular groups being tested. The common standard
error of the mean, MSs¡A/n, is replaced by an average of two squared standard errors
of the individual groups being tested (Games & Howell, 1976). The critical difference,
,,
; again for groups a 1 and a 2 , is
1'
:,
1,
'!
6.4. PAIRWISE COMPARISONS
127
Obviously, the NewmanKeuls procedure is much more complex than the other
~~:;dur~.' h On ~he ?ositive ~ide, the changing sequence of criteria gives it more
. , ": 1c ma es 1t attractiVe to many researchers. On the negative side this
mcrease m power often comes with an increase in the familywise T '
1
!Jr~~;: t~:~l~sstudies show thathwhen o~ly sorne of the group means ~f;er (:r~;a~:t~f
. . more .common t an not m research), the error rate for the airs th
do not d1ffer 1S cons1derably greater than the nominal ,.,, . p at
d bl th . '"Fw, m sorne cases as much as
ou e e va1ue 1t should be (Ramsey 1981 1993· S 1
. th t . . 1 ' ' ' eaman et a., 1991). What goes
wr~ng 1S a reJectmg the false null hypotheses lets the critical difference h · k t
r;rd~~h It~ m~ny typical studies, a researcher who is trying to control the :r:o;1~te ~~
· w1 e ewmannKuels test has an actual error rate much clos t 10
for those pairs of means that do not differ.lO er o percent
th T~erte ;s at better solutio~ to the i~fl.ation of the error rate in the sequential tests
th an o o era e an unknown 1~cre~e m aFw· Ryan (1960) suggested a modification of
as ek~wmanKeul~ procedure ~n whiCh the critical difference shrinks a little less rapidly
(1977)e~~ases. ~1th further 1mprovements by Einot and Gabriel (1975) and Welsch
th ' 1e ~roce ure h~lds the error rate close to its nominallevel, even when sorne of
e popu atwn means d1ffer and sorne do not. The stepsin this RyanEinot G b ·
Welsch (REGW) proc d l'  a r1e 1
't k H . e ure are comp 1cated enough that we will not describe how
1 wor s. owever' 1t po~es no diffi~ulties for a computer' and it is now an o tion in
sorne of the packages. It 1S substantially superior to the NewmanKeuls proc~ure.
Recommendations
:.~ ha~e p~e~ented sev~ral way~ to test the complete set of pairwise differences. They
1 er m t eir complex1ty and m the degree to which they maintain the fam'l .
error at the chosen value of Tuk ' d 1 yw1se
11 d'ft . aFw· ey s proce ure, which uses a single criterion for
a I erences Is the most accurate in its control of familywise Type 1 error. Use it
9
If we had applied the last critica! range t 0 11 th t ·
for the difference y5 _ y4 = 3 50 B t b a e wostep P~Jrs, we would have rejected equality
(which we established in an eariie; ste;), ;~~~~n~: t~:d a5 fall m ~he equivalent ~et from a2 to a5
sequential nature of the NewmanKeuls test we would a~~h~m as ?¡fferent. If we did not respect the
differ from ft5 but that ft4 which falls betw~en th . od.!,m thefmcoherent result that ft2 does not
10Th N ' . em, 1s Jaerent rom ¡.t5 .
more than the NewmanKeuls:_;~r e~~y trograms provide. It inflates the Type I error rate even
larger than their purported leve!. pe, Seaman et al. (1991) found them to be over three times
CHAPTER 6. SIMULTANEOUS COMPARISONS
128
. d that maintains the familywise
when you want a simple, easily defensible, proce ure
error within the chosen level. b 1 O the one hand, they have greater
The sequential procedures are molrehsuft e:l w~se error less well when only sorne
h th they contro t ep·ami y . d t
P ower ' but on t e o . aer' t s The 1sh er H ayter procedure achwves a mo es.
of the groups have diueren mean . . f 't error control. We recommend It.
'th t uch weakenmg o 1 s t
increase in power Wl ou m f '1 t ntrol the familywise error ra e at
Although the NewmanKeuls procedu~e d~l s o c~ce to ignore. However, we do not
:::'
t~'
\i¡ ·.
j:, the chosen level, it is too well entrene e m p~ta~ research reports, you should take
•t d when you see 1 m fi d'
recommend that you use 1 ' an . . t ol when vou interpret the n mgs.
1 f famllyw1se error con r d
account of its weaker 1eve o
J •
. d 'th this type of sequential testmg an
. · power assoc1ate w1
lf you want the mcrease m . '1 ble on your computer, you can use
. 1 W 1 h procedure 1s ava1 a
the RyanEinotGab ne e se f il wise error rate is properly controlled.
it with more confidence that your am y
i ·'
criterion Fa(1, dfs¡A) obtamed from t e a
replaces this critical value by (6 .11)
Fscheffé =(a 1)FaEw(dfA, dfs¡A)·
6.5. POSTHOC ERROR CORRECTION 129
The value of FaEw(dfA, dfs¡A) in this formula, also obtained from Appendix A.1, is
the one that would be used to test for an omnibus effect at the aEw level. Note that
it uses dfnum = a 1 instead of dfnum = l. Unless there are only two groups (and
only one comparison), the value of Fscheffé is larger than the uncorrected value. For
example, with a = 5 groups and n = 10 subjects per group, the incorrect criterion
p 05 (1, 45) is 4.06, but Scheffé's critical value at the same level of significance is
Fscheffé =(a 1)FaEw(dfA, dfs¡A) = (5 1)F.o5(4,45) = (4)(2.58) = 10.32.
This substantially larger value makes the test take account of the many possible
contrasts that could have been tested.
Scheffé's procedure is easy to apply when the contrast has been tested by a t test
instead of an F. The Scheffé critical value for the t statistic is equal to the square
root of Equation 6.11:
tscheffé = VFscheffé = .j(a 1)FaEw(dfA, dfs¡A)· (6.12)
For example, applied to a study with 5 groups of 10 subjects each, we take the square
root of the value we calculated above, giving tscheffé = v'l0.32 = 3.21. Approach
ing the problem through the t test 1
lets one accommodate unequal sample sizes and
heterogeneous variances, as we will discuss in Section 7.5.
The way that the Scheffé criterion is chosen links it closely to the omnibus test:
When the omnibus F is significant, there is at least one contrast 'ljJ that
is significant by Scheffé's criterion; when the omnibus F is not significant,
no contrast is significant by Scheffé's criterion. 11
This connection between the omnibus test and the corrected test for the contrast is
an attractive feature .of Scheffé's procedure. However, it is a strong requirement, and
the test is very conservative, with less power than the other methods.
Because it controls error over the set of all simple and complex contrasts, Scheffé's
procedure is poorly suited toan examination of pairwise differences alone. Applied to
our example of the headache remedies, it finds that only the largest of the differences
(Yí  Y3 = 6.00, for which F = 14.19) exceeds the Scheffé criterion of Fscheffé = 10.32.
We do not recommend using it for a study such as this one, where the emphasis
is on pairwise differences. The Scheffé test is more appropriate for an experiment
where complex comparisons involving posthoc relationships among many groups are
easier to postulate. Look back at the McGovern experiment (Section 4.2, especially
Table 4.1, p. 63). Here the groups differed in various complex, but meaningful, ways.
Although we introduced this study to show how the structure of the design leads
to theoretically motivated contrasts, you can imagine that if the observed pattern
of means differed from the initial expectations, an ingenious researcher could devise
other interpretations. These ideas would suggest new contrasts, probably involving
several groups, that should be tested. Such discoveries are how new psychological
theory evolves. However, because these explanations are post hoc, it is advisable to
apply the type of control implied by Scheffé's procedure.
11 You may wonder whether there is a way to find a contrast that is significant by Scheffé's criterion,
given a significant omnibus F. There is. Simply determine a custom contrast based on the observed
pattern of the means, in the manner we described in Section 4.6. Whether it represents a useful or
interpretable combination is another matter.
CHAPTER 6. SIMULTANEOUS COMPARISONS
The error protection provided by the Scheffé procedure is severe, and it is common
for researchers to adopt an increased O:pw for such tests (perhaps one greater than
O:pw = .10). However, doing so results in an increased chance of a false finding, (
particularly because the temptation to pounce on any significant contrast and devise
an interpretation for it is so strong. We suggest that unusual posthoc results be
given most credence when they are consistent with the general pattern of results from
both the current study and previous work. You should put only a little faith in
an inconsistent and isolated posthoc finding. It is often best to suspend judgment
concerning unexpected but exciting findings, especially those frustrating results that
are significant by an uncorrected test but not when subjected to posthoc correctioh.
This strategy allows you to maintain your standards for evaluating posthoc compar
isons, while not ignoring something that could become a major contribution. The best
course is to try to replicate the finding, either by repeating the original study or by
conducting one that is designed to take a more accurate or comprehensive look at the
novel result.
Exercises
6.1. The contrasts in Problem 4.1 were tested as planned comparisons with no control
of familywise Type I error. Use the .10 level for the following tests:
a. Evaluate each contrast using the Bonferroni procedure.
b. Evaluate each contrast using the SidákBonferroni procedure.
c. Suppose, instead, that these tests were not planned. Evaluate each contrast using
the Scheffé procedure.
6.2. Another possible family of contrasts for the example in Problem 4.1 is the set
comparing the control against each of the other four groups.
a. Evaluate these contrasts using the Dunnett procedure and the .05 level for all tests.
b. By way of comparison, use Equation 6.6 with the standard t statistic to evaluate
them without correction.
6.3. As one more analysis of the example from Problem 4.1, let's consider sorne of
the correction procedures for pairwise comparisons. (Use the .05 level for all tests.)
a. Use the Tukey method to evaluate the pairwise contrasts from Problem 4.1.
b. Use the FisherHayter method to evaluate these pairwise contrasts.
c. Compare the results to those using a Scheffé correction.
6.4. An article reports a oneway analysis of variance (with equal sample sizes) by
the summary statement "F(4, 20) = 4.5, p < 0.01" and the graph in Figure 6.2.
a. Which pairs of means actually differ from each other? Make posthoc comparisons
at the .05 level using the FisherHayter method.
b. Suppose the independent variable was quantitative with equal spacing between the
levels. Perform a complete analysis of trend using a SidákBonferroni adjustment
and the .05 level.
6.5. You have run an experiment with four groups and 11 subjects per group. An
analysis of variance gives a significant result. You wish to test a comparison '1/J, so you
calculate its F ratio and obtain F = 7.0.
a. Assume that you planned the experiment to test this contrast. What is the critical
value for F at the 0.05 level? Is it significant?
EXERCISES
131
10
(
Q)
m
e:
o
a.
m
~ 5
e:
ctl
Q)
:E
o
a1 a2 a3 a4 a5
Leve! ot factor A
b. Assume. that you noticed this effect when you were looking through the data after
compl~tmg the. study. Test tpe contrast using Scheffé's method at the .05 level
c. ~xp~am the d1screpancy between the two answers. Why should a result ·be
s1gmficant or not depending on when you decide to look at it? ,
1.
1.
1.
:1
7
The Linear Model
and Its Assumptions
1·.! researcher needs to understand the most likely violations, their effects on the analysis,
and ways to avoid them. Depending on their size and nature, these violations can
have negligible effects or can seriously compromise the tests. We will discuss these
tapies in Sections 7.2 and 7.3 .. One characteristic of many sets of data is a particular
problem: different variability in the different treatment groups. In the final section,
we discuss how to discover and circumvent this problem.
1 We mention two pieces of terminology that are ~ot central here, but that come up later. First,
the separation of Yij into a fixed treatment effect and a random error makes it a fixed~effect model.
We will consider randomeffect models in Chapters 24 and 25. Second, because the grand mean J.tT
is defined as the average ofthe J.bj, Equation 7.2 is known as a means model.
134 CHAPTER 7. THE LINEAR MODEL AND ITS ASSUMPTIONS
A benefit of using the linear model to describe a score is that our null hypothesis
is associat~d wi~ha single term of the model. ""\:Vhen !'Jl the .<>;..<~r.e..zero, t,b.e ~
.Jll'l'l.IlS~"_'_~ntJCal. Instead of a null hypothesis about the means,
Ho: 1'1 = 1'2 = p,g =etc.,
we can write a statement about the treatment effects:
Ho : a~ = O, a2 = O, aa = O, etc.
This hypothesis is more compactly written by asserting that the sum of.J;ba.sqnared
~~tBis~zero;__
Ho: l:aJ=O.
At this point you won't see much difference between these three versions of the null
b.ypothesis, but the advantage of the forms that use a; will become apparent when
we discuss factorial designs in Part III and particularly when we describe the general
linear model in Chapter 14.
value for their particular group. In terms of the linear model of Equation 7.3, this
variability comes only from the experimental error E;;. Sorne statistical calculation
(which we will spare you) shows that the expected value of the withingroups mean
square is exactly equal to the error variance, a;rron which we assumed was the same
for all the treatment conditions. Stated in symbols:
(7.4)
The expected value of the treatment mean square in the numerator of the F ratio
is a little more complicated. It is infiuenced by three factors, sample size n, the size
of the treatment effects O:j, and the amount of experimental error o:rror' Another set
of statistical calculations shows that the expected treatment mean square is
I:a]
E(MSA) = n   + "•no'.
2
7.5)
(
a 1
The first term is the product of the number of subjects per group and the sum of the
squared treatment effects divided by the degrees of freedom, and the second term is
the error variance.
Having the expected mean squares lets us discuss the evaluation of the null hy
pothesis in mathematical terms. The null hypothesis, as we wrote it earlier, states
that 2: a] = O. When this hypothesis holds, the first term in Equation 7.5 vanishes
and the two expected mean squares are the same size: E(MSA) = E(MSs¡A)· The
actual values of the two mean squares will generally not be equal because they derive
from different aspects of the data~differences among the means and differences among
subjects within groups~but on average they will be. When the null hypothesis is false,
I: a] is positive and E(MSA) is greater than E(MSs¡A)· This difference increases both
with how much the null hypothesis is violated (i.e. with the magnitude of the a;) and
with the sample size n. Now consider the F ratio. When the null hypothesis is true,
its numerator and denominator have, on average, the same size and F ;: : :;: l. When
the null hypothesis is false, the numerator is, on average, larger, more so when the
differences between groups are large or when the sample size is large.
Violations of the Assumptions
We stated five assumptions about the distribution of the experimental error. These
d assumptions are part of the mathematical model, which, as we said, is an idealization
!:
that is never exactly satisfied by data. How seriously should we take thern? If the
:1 assumptions are violated, then the F we calculate from our data might not have the
.11
F distribution we use to decide whether to reject the null hypothesis. Fortunately,
' research has shown that ~stis_insensitive to__s_o_"'c".Y.i()l":t!on'l W~
1
that the F test is robust to those violations. What we_!llled to knowis which aspects
~fine Jí' testare robust and whiCii.á:f;IiOt wh·e~ ~~.we ignoretlie assumptmns, and __, "
~eir~~ratK;nªeriQlí~_é,IiOUgh·_tg COrn:pronli~~ our_ ~tatisucatll!s".,___~~~
 ·N;;tsiiri:irisíngly;·¡;;_¿i~idual resear~hers (and statÍ~t·i~i~ns) differ intll<,ir concern
for these violations. Sorne ignore thern, sorne feel that they invalidate the standard
analyses, and most fall in between~as in any other human endeavor, there are opti
rnists and pessirnists. Wherever you fall on this spectrum, it is important for you to
understand how the assurnptions may fall and to know how to look for these fallures
and what their consequences are. We feel strongly that a close examination of the
7.2. SAMPLING BIAS AND THE LOSS OF SUBJECTS 137
data and a consideration of the assurnptions of the statistical design are important
parts of any analysis. You cannot sirnply assume that things will work out well, nor
can you leave the checking of the assumptions to a computer program (although the
programs can help you check thern). You should acquire the habit of looking at how
well your study satisfies the assumptions of the analysis, and you should keep them
in rnind when interpreting your results and those of others.
Substantially violating the assumptions of the analysis of variance can have one of
two types of effects. Sorne violations, particularly those affecting the randornness of
the sampling, comprornise the entire set of inferences drawn from the study. Others,
such as concerns about the distributional shape, affect mainly the accuracy of the
statistical tests themselves~their Type I error probability and their power. The two
categories of violations should not be confused. Clearly, the first category needs to
be taken very seriously. We will discuss the consequences of violating the random
sampling assurnptions in Section 7.2 and the ways for dealing with violations of the
distributional assumptions in Sections 7.3 and 7.4.
'i 1
.':'1''11 1
to all groups. Differences among the groups are due to their treatments, not to the
!''::1 exclusion of subjects.
Where dangers arise are in nonexperimental designsthat is, studies in which
the groups constitute natural populations, to wbich random assignment is impossible.
The analysis of variance can tell us that the groups differ, but it cannot tell us why.
Consider a study of college students that includes groups of men and of women. We
find a di:fference between the two group means: Ymen f= Ywomen· But in understanding
these means, we must consider whether we have recruited different portions of the
actual population of men and women at the college. If the samples of men and
women were biased in different ways, that alone could account for the difference we
observe between the two means. Our concern would be greater if we found that, for
example, it was much easier to recruit women for the study than men. The potential
for bias of this type is one of the reasons that studies using preexisting groups are
more difficult to interpret without ambiguity than those where random assignment is
possible. Conclusions about such factors are frequently controversia!.
Faced with these concerns, researchers sometimes try to find evidence that no
differential sampling bias has occurred. Todo this, it is necessary to find sorne ancillary
quantity that might be related to the scores and then try to show that it does not differ
among the groups. For example, studies in which a clinical population is compared to
a group of nonclinical control subjects often start by showing that the two groups do
not differ in such things as age, education, and so forth. Although an analysis such as
this is essential, you should note that it depends on accepting a null hypothesis, and
as such, can never lead to as clear an outcome as the more usual rejection strategy.
At the least, it adds a step to the logical chain that leads to a conclusion 2
When differences among the initial samples are suspected, it may be possible to
apply sorne form of statistical correction to ameliorate the situation. Like the demon
stration of equivalent groups we described in the preceding paragraph, this approach
also depends on the availability of sorne ancillary variable that can be used to "adjust"
the scores. Sometimes the new factor can be introduced as a blocking factor (as we
discuss in Section 11.5); other times the analysis of covariance is used (see Section
15.7). As we describe there, either of these two approaches is fraught with danger,
and must be used carefully. In our view, even though the corrections used in these
analyses often are better than nothing, no statistical adjustment can entirely eliminate
differential sampling biases.
Loss of Subjects
What are the consequences of losing subjects after the groups have been formed? Sup
pose we have used random assignment to create treatment groups with equal numbers
of subjects, but subjects are lost or their data become unusable. The researcher now
has an unbalanced design, one with unequal sample sizes, and must face two prob
lems, one minar and one serious. The minar problem is that the calculations needed
to analyze the design are more complex. For a oneway design, we showed the overall
analysis with unequal groups in Section 3.5 and the test of analytical comparisons
in Section 4.4. The calculations for unbalanced designs with more than one factor
7.2. SAMPLING BIAS AND THE LOSS OF SUBJECTS 139
are harder, but modern computer programs can perform them using the approach we
discuss in Chapter 14. The more serious probiem is that the loss of subjects potentially
damages the equivalence of groups that were originally created randomly.
Examples of Subject Loss. There are many reasons for data to be lost or sub
jects to be discarded from an experiment. In animal studies, subjects are frequently
Jost through death and sickness. In a human study in which testing continues over sev
era! days, subjects are lost when they fail to complete the experimental sequence. For
example, in a memory study requiring subjects to return a week later for a retention
test, sorne subjects will not appear, perhaps because of illness, a scheduling confiict,
or simple forgetfulness. Subjects are also lost when they fail to reach a required
performance criterion, such as a certain level of masterythose who cantwt petfotm
the task are not included in the analysis. Suppose we are interested in the time that
subjects take to solve a problem. Those subjects who fail to solve it cannot contribute
to the analysis and must be eliminated.
When observations are lost or unavailable for sorne reason, we should ask what ef
fects the exclusion has on the results. The most important questions here are whether
the loss depends on the behavior pf the subject and whether the loss threatens the
assumption of equivalent groups. For example, consider the timetosolve measure
mentioned above. The subjects who are lost from this study are those whose so
lution times (if they could eventually solve the problem) are greater than those of
subjects who solve within the time limit. Can we still assume that the groups contain
subjects of equivalent ability? If roughly the same number of subjects fail to reach
the criterion in al! of the conditions, then we may still be able to assume equivalency.
Although the groups would contain slightly better problem solvers than the original
population, the critica! assumption of equivalent groups would be largely intact. We
would, however, need to make clear that our conclusions applied to a population
from which poor problem solvers had been excluded. On the other hand, when
different proportions of subjects are lost, we cannot convincingly assume that the
resulting samples are equivalent. Groups with a greater proportion of lost subjects
would contain, on average, better problem solvers. Contrast this case with the sit
uation when the loss is due to the occasional malfunctioning of the experimental
equipment. Such losses are unrelated to the behavior of the subjects and thus pose
no threat to the assumption of equivalent groups. With random loss, the inferences
are not affected.
to the unrecorded score, and the loss is ignorable. Inferences based on such data are
unaffected by the Joss. In contrast, consider the solutiontime example. The only
subjects who fail to provide scores are those that would have had long solution times.
This is a nonignorable loss. Because the high scores are missing, the observed mean
time Y; is less than the population value /"j. More seriously, this bias is greater for
groups that received a hard problem, where more subjects failed to solve within the
time Jimit, than it is for an easy problem, where most subjects solved. The comparison
between groups is contaminated.
Faced with subject loss, a researcher must decide whether it is random, that is,
whether it is ignorable or nonignorable. The answer depends on what causes the
subjects to Jeave the experiment, and this can be almost impossible to know for certain.
For example, consider a memory experiment in which each subject is tested either
15 minutes or 2 days after learning. Fifty subjects are randomly assigned to each
condition. Of the 50 subjects in the 15minute condition, all are available for the
retention test. However, only 38 of the 50 subjects in the twoday group return for the
final test. Whether this loss is ignorable depends on why the subjects did not show
u p. If the failure was because of an unanticipated scheduling conflict or illness, then it
is random and in the sarne categoty as equipment failure or an experimenter's error.
If the subjects who failed to return were mostly poorer Jearners who were dissatisfied
with how they were doing, then it would not be. The delay group would have a
greater proportion of the better subjects than the 15minute group, and we would
overestimate the mean of the twoday group.
Let's take another example. Regrettably, in animal studies, particularly those
involving a physiological manipulation, the animals in highly stressful conditions (such
as operations, drugs, high levels of food or water deprivation, or exhausting training
procedures) are sometimes unable to complete the study. If this were the case, only the
strongest and healthiest animals would give valid data. The difficult or more stressful
conditions would contain a larger proportion of healthy animals than the less severe
conditions. Thus, a potentially critical selection factorhealth of the animalsmay
make this Joss nonignorable.
Sometimes it is possible to salve the problem of nonignorable loss by taking cor
rective action. A useful strategy is to make the same selection pressures apply to all
groups. In the memory experiment, we could require that all subjects return after
two days, even those who had already received the 15minute test. The data from
any subject in the short retention condition who falled to show up later would also
be discarded. In an animal study, we could apply similar stresses to animals in all
conditions so as to equalize the chance that an animal would be dropped.
If the dropout rates cannot be equalized, it may still be possible to make the groups
more equivalent by artifically excluding sorne members of certain groups. Suppose that
in the study that measured time to solve a problem, 5 of 20 subjects in one condition
had failed to salve within the time limit, while all of the 20 subjects in the other
condition had solved. To make the groups more comparable, we could drop the 5
slowest solvers from the second condition, so that each group included the 15 fastest
solvers. For this technique to work, it is necessary to assume that the subjects whose
data are discarded are those who would have falled to complete the study had they
been assigned to the condition that produced the failures. This assumption may be
7.3. VIOLATIONS OF DISTRIBUTIONAL ASSUMPTIONS 141
questionable, but it must be made befare any meaningful inferences c~n be drawn
from data adjusted in this manner. It is usually preferable to assuming that alllosses
are ignorable and making no correction at all.
We should reemphasize that the critical problem with nonignorable subject loss is
the interpretation of the results, not any imbalance in the sample sizes (with which
it has sometimes been confused). When subjects have been Jost nonignorably, we can
still perform an analysis of variance and it can still show that the means differ, but
we will have trouble interpreting that result. You can see that the sample sizes per se
have nothing to do with the problem by considering what would happen if you ran
additional subjects to make up for the ones who were missing, so that all the groups
ended up the same size. The new subjects would still be affected by the same selection
biases as the original ones, and the problems of interpretation would still remain. Do
the different sample means imply differences in the population or are they simply the
consequence of uneven loss of subjects? We would not know.
Clearly a loss of subjects should be of paramount concern to the experimenter. You
have seen that when the loss is related to the phenomenon under study, randomness
is destroyed and a systematic bias may be added to the differences among the means.
This bias cannot be disentangled' from the treatment effects. You must solve this
problem befare you can draw meaningful conclusions from your experiment. However,
if you can be convinced (and, crucially, yo u can convince others) that the subject loss
has not resulted in a bias, then you can go ahead to analyze the data and interpret
your results unequivocally.
test is sensitive to this violation. When the observed proportion exceeds the nominal
a leve!, the violation of the assumption has given the test a positive bias and that it
is liberal with regard to Type I error; when the proportion is less than the nominal
level, it has a negative bias and is conservative.
;,1
··.·.i. •;
,,
l
VIOLATIONS OF DISTRIBUTIONAL ASSUMPTIONS
143
8 ample size. Large ~amples do not cure dependencies in the data. As should be clear,
the assumptw~ of mdependence is fundamental to the entire statistical enterprise.
Independenc~ 1s best approached in the original design, through careful experimental
procedures, tight ~antro! ofthe design, random assigmnent of subjects to the treatment
groups, and by bemg espec1ally careful whenever group testing is used. There are few
,ways to recover from dependent observations after the data have been collected.
Identical WithinGronp Error Distribntion
The second and third assumptions of the statistical model state that the random
p,art E,; of every score has the same distribution. This assumption does not say that
the scores are the samethat would surely be unrealisticbut that t. he p b b Tt
. t "b t·
d.1s f E d d ro a 11 y
n u IO~ o ij oes not efend on either j (the particular group) or i (the par
ticular sub¡ect). Every subject s error component comes from the same dist "b t•
W . th "th" n u IOn.
e examme e w1 mgroup assumption flrst, and will look at the betweengroup
aspects later.
. The ~ost commo~ violation of the withingroup identicaldistribution assumption
m expenmental stud1es occurs when the population contains subgroups of b · t
"th b t t• JJ d""" t ' SU JeC S
WI su s an Ia Y Iuer~n statistical properties. Several examples illustrate the range
of problems th~t can ar1se. Suppose a study is designed that uses samples of college
students as sub¡ects and that a systematic difference exists in the population between
the average scores of the. roen and the women. As a result, the samples contain
~embers from the two d1stinct subpopulations with different means and, perhaps
d1fferent vanances. Changes in distribution can also be created by outside events_:
college students tested at the start of the term m ay be different from those tested ·ust
before final exams, and political opinions recorded before an election may be diffe~ent
from those recorded after It. Even when the subjects are homogeneous, changes in the
expenment~l procedure can create changes in distribution. Perhaps the equipment
ch~ges as It ages,_ or the researcher becomes more skilled in executing the procedure.
Sub¡ects run by d1fferent research assistants may have different distributions.
A frulure of the identicaldistribution assumption has two types of effect on the
~alys1s. One effect i~ on the interpretation of the results. When the samples are a
mixture of subpopulatwns whose means are different, a sample mean ¡8 not an estimate
of the mean of ~ny of these subpopulations, and, if the subpopulations are sufliciently
d1fferent, then 1t may not be a good description of any individual. For example, the
sample m~an m a study where the scores of men and women differ greatly does not
descnbe 81ther the m.en ~r the women. The óther effect involves the statistical tests.
For one. thmg, a d1stnbutwn that is a mixture of severa! subpopulations with different
prop~rtw~ usually does not satisfy the normaldistribution assumptions, a topic that
we WI~l d.I~cuss lat~r. More critically, because the study contains a systematic source
of varmb1hty th~t Is not taken into account (the difference between subpopulations),
the error term 1s mflated and the power of the test is reduced. A very small value
?f F, one much less then 1, is sometimes an indication that the error term has been
mflated by a mixture of subgroups with different properties.
Where they can be identified, many failures of identical distributions can be cor
rected by mtroducing additional factors into the design. For example if there appe
to be gender differences in the scores, a gender factor can be added. We discuss the:~
. '
11
designs in Part III. When the distributions shift over time in a simple way, we can
use analysis of covariance (Chapter 15). Both approaches reduce the size of the error
term and, consequently, increase the power of the tests.
/
Figure 7.1: A normal distribution' (top) and four non~ normal distributions all with
the same variance. The first two nonnormal distributions are asymmetri~al with
skew coefflcients of 1.5 on the left and +1.0 on the right. The second t~o are
symmetrical ( skew coefflcients of O), but with a kurtosis of 1 on the left and
+2.5 on the right. These distributions are said to be platykurtic and leptokurtic
respectively. '
This deviation, or residual, is then plotted in a histogram of the residuals E,, for al!
groups at once. Most computer programs can construct sorne version of this plot and
also offer a variety of other residuals and diagnostic plots.
The assumption of the normal distribution is the foundation on which the F distri
bution is based .. Thus, ';he~ it fails, the sample statistic F = MSA/ MSs¡A might not
have the theoretwal F d1stnbutwn used to obtain the critica! values in Appendix A. l.
However, the situation is_much improved by the fact that the hypotheses we want to
test concern the means Yj, and the sampling distribution of a mean is always much
closer to normal than is that of the original scores. In fact, an important theoretical
~esult, known as the central limit theorem, states that as the size of the sample
mcreases, the mean comes to have a normal distribution. Consequently, once the sam
ples become as large as a dozen or so, we need not worry much about the assumption
of normality. Simulation studies using a wide variety of nonnormal distributions show
that. the true TYpe I error rates deviate little from their nominal values, and when they
do, .'t IS often m a conservative direction ( Clinch & Kesehnan, 1982; Sawilowsky &
Bla1r, 1992; Tan, 1982). The one place where appreciable deviation from the nominal
values occurs is when the underlying distribution is skewed and the sample sizes are
unequal. In this case, the TJpe I errors are asymmetrical, with the mean of the smaller
group tending to deviate in the direction of the skew. Tests of directional hypotheses
(see p. 73) are inaccurate and should be avoided.
Although the F test is robust with regard to violations of the normality assumption
you still need to be on the lookout for the presence of aberrant scores or outliers'
' '
146 CHAPTER 7. THE LINEAR MODEL AND ITS ASSUMPTIONS
that seem to come from a completely different population. These abnormal scores
tend to have a disproportionate influence on the mean and to infl.ate the variance.
Outliers are relatively easy to spot. You can think of them as scores that fall outside
the rauge of most of the other data, say more than three standard deviations from the
mean. The more comprehensive computer packages contain procedures for identifying
them. What is more difficult is to decide what to do with them. You should examine
these extreme scores in detail to see if something, other than just being extreme,
makes them exceptional. Often they turn out to be errors made when copying the
data. Alternatively, a subject may have failed to follow instructions or refused to
cooperate. Experimenter error, such as giving a subject the wrong treatment, also
can produce outliers. After identif'ying the problem, or when the score is palpably
absurd, it is appropriate to exclude an outlier from the analysis (but you should note
this fact wheu you describe your study). However, any distribution of data is likely
to contain sorne extreme scores. Real data often are a little more scattered than a
normal distributionthe technical term is overdispersed. These observations are a
valid part of the distribution and should be included in the aualysis"
The treatment of uonnormality varies with the way in which the assmnption is
violated. Severa! ways that the data can deviate from uormality are relatively easy
to spotmultiple modes, extreme scores or outliers, and asymmetrical or skewed
distributions. The discovery of multiple modes is usually the easiest to deal with.
They suggest that the data are a mixture of two or more underlying distributions,
which implies that the assumption of au identical distribution of the random variable
E,. for al! subjects is violated. The presence of subpopulations is best treated by
id~ntifying the source of the heterogeneity (for example, geuder) aud including it as
an explicit factor in the design.
Another way to deal with data that are not normal is to abandon the analysis of
variance and to switch to a form of testing that does not depend on' the normality of
the scores. There are several procedures, known collectively as nonparametric tests,
that examine null hypotheses about location under weaker distributional assumptions.
For data from independent groups, as we discuss here, the relevant null hypothesis
is that the medians of the distributions are the same, 6 and the alternative is that
sorne of the medians differ. If the groups differ only in that their distributions are
shifted left or right relative to each other, then this hypothesis can be tested by the
KruskalWallis test or its twogroup equivalents, the MannWhitney U test and
the Wilcoxon rank sum test. When the underlying distributions are substantially
skewed or have many extreme scores, these tests are less sensitive to distributional
differences and have greater power than does the standard analysis of variance. 7
ssome data analysts suggest routinely trimming the datathat is, excluding the most deviant
scores from all treatment conditions. We do not recommend such a wholesale exclusion of data.
Throwing out extreme scores that are actually part of the distribution reduces the error mean square
and gives the F test a positive bias.
6The median, you will recall, is the score that divides a distribution such that half of the scores
are smaller than it and half are larger. Without getting into details, when the number of seo res is
odd, the median is the value of the middlemost score, and when the number of scores is even, it is
the average of the two scores in the middle.
7Space limitations prevent us from describing the nonpararnetric tests here. You can turn to any
of several good books on the subject, all of which cover much the same ground {psychologists often
7.3. VIOLATIONS OF DISTRIBUTIONAL ASSUMPTIONS
147
do older ones, who have already encountered instances where similar problems were
solved. The wider variety of strategies employed by the younger children results in
a more variable distribution of scores. Second, a clinical population may differ in
variance from a set of normal controls, as a result of whatever characteristics make it
a clinical population. Which group is the more variable here depends on the particular
population that is being studied and how the controls were chosen.
The second systematic cause of differences in variance arises when the treatment
itself affects subjects differently, producing differences among the group variances.
The experimental manipulation can consolidate or scatter subjects within a group
by making them behave more similarly or more disparately. For example, a stressful
condition might lead sorne subjects to buckle clown and try harder and others to give
up, increasing the variance of that group relative to a less stressful condition. The floor
effects and ceiling effects that we mentioned on page 5 have the effect of compressing
the range of scores and creating differences in variance.
The third source of variance differences is more intimately tied to the scores than
to the treatments. Observations that are bounded or restricted at one side of the
scale (often by zero) frequently have greater variability the farther they are from that
boundary or point of restriction. Data that are counts of events have this character
istic. For example, the number of nerve impulses recorded in a fixed interval or the
number of speech errors made while reading a given text have greater variability when
the number of impulses or number of errors is large than when it is sma!l. Response
latencythe time that a subject takes to make a responsealso has this property:
Conditions leading to long latencies usually have larger variances than conditions
leading to shorter latencies. For example, the variability of subjects solving problems
will almost certainly be greater for the groups that take more time to solve them.
With any of these measures, a manipulation that changes the mean also changes the
'1' variance, producing a situation in which the variances are related to the means.
'11 Having seen that differences in variance can arise in many types of study, we need to
l. consider their effects on the analysis. 8 It used to be thought (based on an investigation
by Box, 1954b) that the F test was relatively insensitive to the presence of variance
heterogeneity, except when unequal sample sizes were involved (for an excellent re
view of this older literature, see Glass, Peckham, & Sanders, 1972). However, more
1
recent work, summarized by Wilcox (1987), has shown that even with equal samples,
substantial differences in variance bias the F test positively. The null hypothesis is
rejected too often.
To demonstrate the effects of heterogeneity of variance, we conducted the series
of Monte Cario simulations in Table 7.1. The table shows three sets of simulations,
separated by horizontal lines. The four righthand columns give the proportion of
1
100,000 simulated experiments that were rejected at different nominal significance
levels. The first four rows give the properties of threegroup analyses with n = 10
subjects per group, no di:fferences between the means, and, except in the first row,
different variances. In the first row, the variances are homogeneous, and you can see
that the obtained or actual alpha levels are almost exactly equal to their nominal
8 The effects of heterogeneity of variance on the analysis of variance is sometimes called the
BehrensFisher problem after early workers.
7.3. VIOLATIONS OF DISTRIBUTIONAL ASSUMPTIONS
149
e
n Table.7.1: Monte Cario signiflcance levels estimated fro
n experJments with various combfuations of h •m sets o~ 100,000 simulated
and number of groups. eterogeneJty of li".Ulance. sample siz..!:,
Number Sample Pattern of
of Groups Nominal significance Jevel a
Size \( '
ar1ances .lO .05 .025 .01
3 10 1,1,1 .lOO .050 .026 .010
3 10
? 1,2,3 .104 .054 .029 .013
3 10 1,4,9 .112 .063 .037 .018
3 10 1,9,25 .116 .069
3 .041 .022
5 1,4,9 .116 .066 .038 .019
3 10 1,4,9 .112 .064 .036 .018
3 15 1,4,9 .111 .062 .035 .017
3 25 1,4,9 .107 .061 .035 .017
3 50 1,4,9
leo .107 .060 .035 .017
3 10 , 1,4,9 .110 .060
IL 6 .036 .018
10 1,1,4,4,9,9 .123 .073 .044 .023
iL 9 10 1,1,1,4,4,4,9,9,9 .130 .078 .047 .023
()..o¡>,
11 ! J )_!.
'/
} {~vo k'"tv...v .' Lt?, ,_f, ow. ,· s. rCxif.f~r"", ,i' ftrd\~'\.. e>0 ( f(._Q
va ues. The next three rows have increasin d . .
and there is a corresponding increase i th g egrees of heterogeneity of variance,
each significance leve!. This positi b'n . etroportwn of null hypotheses rejected at
for example, when the largest var~:n las. lB 2 argest at the smaller significance levels
difference), the actual rejection rates i~e ~~re5 t~:es t.he smallest (a very substantial
In the second series of simulations twwe the nommal a = .O1 leve!.
and the sample size changes From th the .deglre~ of heterogeneity is held constant
. . ese Slmu atwns we see that alth h
sorne lmprovement with larger samples th 't' . oug there is
n is as large as 50 9 In the thi d . , e posi Ive bias does not go away, even when
."'
d luerences . r senes, we varied the numb f
in variability among the th er o groups, keeping the
¡ . groups e same The p 't' b' .
arger m the experimenta with · OSI Ive las IS somewhat
.
1eads to an mflation many groups. In short th f
of the act al Ty I ' e presence o heterogeneity
an increase in sample size. u pe error that is reduced, but not eliminated, by
The analys'IS of vanance
· ·Is more susce tibl t h
when the samples do not have equal . ~" de o t e problems of unequal variance
"' t s m
euec . detail but you should b sizes. vve f o not have space here t o d'Iscuss these
d , e aware o a few points F' t th b'
emonstrated are greater when the 1 . · us ' e Iases we have
equal. Second, the problem is mucl:.amp e Slz~s are unequal than when they are nearly
greatest variability. Third tests that more ~~nou~ when the smaller samples have the
":;;;::::::::::::::::~' use a Irectwnal alternative hypothesis are more
9
The r ow 'th n = 10 replicates a cond •t ·
Wl f h
t~ird series. The proportion of rejections ::~~u:~: ~ ~~ firs~ se~ies, which is replicated again in the
t e same. This variability reflecta the random b h .simlt~ In t e three rows, although not exactly
of how closely you can trust each individual res~lt~vlOr o onte Cario simulations and gives an idea
150 CHAPTER 7. THE LINEAR MODEL AND ITS ASSUMPTIONS
at risk than are those that use the standard omnidirectional alternative hypothesis.
Many researchers feel that these facts make it important for a good study to have
samples of approximately equal size. We concur with this recommendation.
Fmax =
largest s; ·
smallest s;2
A large value of Fmax indicates the variances differ. Critica! values for the Fmox statis
tic have been calculated and tabled (Pearson & Hartley, 1970, 1972). Unfortunately,
in spite of its simplicity and of the fact that it is provided by many packaged computer
programs, the Fmax statistic is unsatisfactory. lts sampling distribution, as reflected
in the PearsonHartley tables, is extremely sensitive to the assumption that the scores
have a normal distribution. It lacks the robustness of the analysis of variance.
7.4. DEALING WITH HETEROGENEITY OF VARIANCE
151
Among the man alt ·
y ernatives proposed to th F
Conover et. al. (1981) survey as both accur e m~x test, two stand out in the
the same S1mple idea. First we flnd ate and relatJvely easy. Both are based on
from the center ofits group Úts mean ~r:: tod':'easure how much each score deviates
then the average size of these dev¡'at1' mde Jan). If the variances are homogeneous
r 11 ons, 1Sregardmg th . . . ,
sa:ne wr a the groups, while if the varianc e1r Sign, Wlll be about the
W1ll be large in the most variable groups I e\':,e heterogeneous, then the deviations
turned into a question about average d .' t"n IS way, a question about vá.riances is
. . ev¡a 1ons and th t b . .
ord mary analysis of variance. ' a can e Investigated with an
The better of the two tests is due t B
t~e deviation of a score from the centeroof ~~wn. and F~rsythe (1974a). They measure
d1fference from the median of that Ms distnbutwn by the absolute value of its
group, d;:
. .
(the two vertwallmes z,; = lli; Md;[
indicate to ignore th . (7. 7)
of the test, P":?Posed earlier by Levene (19~~Jgn of th~ deviation). The other version
mliean, [li;  l:j [. Either procedure works well, :s:st~ e absolute deviation from the
s ghtly more robust. , , u e BrownForsythe procedure is
Table 7.2 shows a numerical exam le f .
given in Table 2.2 (p. 22). The upper p t o f t~Is procedure, using the data originally
T~ere are three columns for each gr:uar ;~ e table presents the initial calculations.
ongmal scores, with the median of
the second column this med1'
f· e flrst of these co!umns reproduces the
. eabc group, 16, 7, and 10, marked by a box I
al ' an 1s su tracted fr ach · n
e c~lation, note that each column contain th om e score. As a check on our
entnes, as must be the case unless th s e_same number of positive and negative
contains the absolute values of these d:ffire are tJes at the median. The third column
1 erences that · th z
now conduct a standard analys1's f . , 1S, e ij of Equation 7 7 "'e
Id o varmnce on the Z t . . . vv•
wou any other set of scores Below th 1 ij, reatmg them exactly as we
the sums of the squared
eh
z ..
zJ
~· e co umns of Z,; are the sums of the
m our standard t t ·
z..
zJ . an
d
ea group's contribution to the ra f no a wn, these are the sums A. and
values we calculate the three bracke~t:::, o squares [Y), respectively. From Jthese
[Y] = ¿:; Zf¡ = 65 +55+ 75 = 195 ,
did the actual error r~te for t .e ~o";,~t:ide the range in the tablemore groups and
leve!. So, unless the situatwn
larger varmnce dlfferences the
lB;e I error will be kept below the 5 percent leve! by
~ t keep the Type I error under control, then a
setting "'.= .~2~ .. If y~ur.c~~~e~:s~:stoand simplest way to eliminate concerns about
conservat1ve a vmg 0 "' 18 . t the data it is easy to apply to
heterogeneity. Because it does not reqmre access o '
the results reported in a published exp~riment. ore stringent significance leve! makes
it h~r~:rs:::~~!~th:~:nc:~o~~:~:U~ t~~~~g~age of Chapter 8, it reduces the power
7.4. DEALING WITH HETEROGENEITY OF VARIANCE 153
of the test). Reducing the a leve! is appropriate when there really are differences in
variance, because this reduction simply brings the actual Type I error rate back to
where it belongs. However, when the variances really do not differ, or they differ by
only a small amount, reducing a diminishes the power. This lost power can be restored
by adding more subjects. How many are needed can be calculated using the methods
that we discuss in Chapter 8. In the type of designs most common in psychology, the
difference amounts to about a 20 percent increase in sample size. Thus, when you
are designing a study in which unequal variances are a possibility, you might consider
reducing a and raising the sample size by this amount.
~ans~ormation
ThbJe 7.3: u "
of data by the square root (Equation 7.8).
1 .
Analysis of variance of the transformed scores
•• 1
Source SS df MS F
;¡,,,
1 '1 A 11.998 2 5.999 19.33*
• 1
1 ·)',
;]·
transformed values are in the rigbt three columns of Table 7.3 .. The differences among
::1
'1~.1
the variances now are smaller, are not related to the means m any simple way, and
l,j!
no longer approach significance by the BrownForsythe test (F(2, 18) = 1.8~). We
i can now perform an analysis of variance on these transformed data. We don t s?ow
the details of these calculations because they are just like the oneway calculat10~
elsewhere in this book, ·but the result is given at the bottom of the table. The analys1s
hows that the averages of the transformed scores differ among the groups. What
'(',
~oes this analysis tell us about the original scores, which, of course, ar~ the ones that
? For the most part it supports the same concluswns. Although
we care moSt about · ' . h · · 1
the transformation changes the scores, it does not change the1r arder. T e or1gma
and transformed scores are also very highly correlatedfor Table 7.3, r = ~.994.
Consequently, groups that differ on the transformed scores also diffe; ~n the ongmal
10 Thus we suggest discussing the results in terms of the ongmal scores. In
scores. , f t th · t
our example, we would talk about the number o. errors, no ~Ir square roo .
We will mentían two other situations in whwh transformatwns are helpful. The
first arises when data are generated by the timing of events, for example, the latency
.1•
or time taken to respond. With such data, the standard deviations (rather than t~e
variances) tend to be roughly proportional to the treatment means. Here a logarühm1c
lOJt is possible to find exceptions to this assertion, but they almost never occ~r when a transfor
mation is used in the type of theoreticallymotivated circumstance we have descrtbed.
7.4. DEALING WITH HETEROGENEITY OF VARIANCE 155
All these tests accommodate unequal variances, although, when the variances are
equal, they have less power than the standard analysis. Among them (and others),
the best choice according to current research appears to be the second version of
James's method, usually known as James's secondorder method (Alexander &
Govern, 1994; Wilcox, 1988; see Coombs et al., 1996, for a review of the volumi
nous literature leading to this recommendation). Unfortunately, the computational
demands of James's procedure are substantial and, as of this writing, it has not been
incorporated into the standard statistical software packages (sorne code to run it with
the SAS package is mentioned by Lix & Keselman, 1995). The Welch procedure is
more tractable~ but is somewhat less accurate.
Equation 7.12 uses the variances for the individual groups, and 80 ¡8 more accurate
than an average standard error based on MSs(A· For example, suppose a study has
a= 3 groups and we want to compare 1"2 to 1"3· The coefficients for this contrast are
{0, 1, 1}. The variability is
T~us, s;¡; is based on the squ';"ed .standard errors of the two conditions entering into
th1s comparison,. and the vanab1hty of the remaining group is not involved. For a
complex compar1son with the coeflicients {2 1 1}
' ' '
s;¡; = /2 2s'?.,, + ( 1) 2s'?., + ( 1)2s'?., = ./r4s2:_+_s_2_+_s_2_
V 2 a V M¡ M2 M3 ·
· Here all three squared standard errors are used, but in unequal parts. The squared
cr
standard error for the first group is multiplied by = 22 = 4, and those for the other
two gronps are multiplied by (1 )2 = 1.
There is o~e fu~ther complication. The unequal variances give the test statistic a
samplmg d1stnbutmn that differs from a standard t distribution. The critica! value
must. be approx1mated by entering the t table in Appendix A.2 with a somewhat
ungamly spec1al value for the degtees of freedom (Satterthwaite, 1941, 1946):
s'!.
d'' = "'é. (7.13)
L 4 4
cjsMj
nj 1
This value will generally not be an integer, in which case, the next smaller integer is
used.
For a numerical example, Jet 's return to the unequalvariance data in· Table 3. 7
(p. 57). Suppose you wanted to examine the contrast ,¡, = ¡t1  2¡t2 + ¡t3 . The
means ?f the three groups are 7.0, 10.0, and 5.0, based on samples of size 3, 6, and 4,
respectJvely, and the corresponding standard errors are 1.155, 0.577, and 0.707. From
these, you calculate the observed contrast, the estimated standard error of and the ;¡;
test statistic: '
;¡; = I;cj:Y¡ = (1)(7.0) + (2)(10.0) + (1)(5.0) = 8.0,
s;¡; = JI:cJs'?.,, = V(12)(1.155)2 + ( 2) 2(0.577)2 + (12)(0.707)2
= v1.334 + (4)(0.333) + o.5oo = v3.166 = 1.779,
t 
,¡, 
;¡;  8.0 
s;¡;  l. 779  4.497.
df =
.
The approximate degrees of freedom (Equation 7.13) are
;¡;
cjs't, 

4
(l. 779) 4
(1 )(1.155) 4 ( 2) 4(0.577) 4 (14)(0.707)4
¿;  
nj1 31
+ 61
+ ''7.,..''
41
 10.016 10.016
 (1)(1.780)/2+ (16)(0.111)/5 + (1)(0.250)/3 = 1.329 = 7·536 ·
INEAR MODEL AND ITS ASSUMPTIONS
CHAPTER 7. THE L
158
 4 50 to t o5(7) = 2.36.
e the observed t  · · h
The next smaller integer is 7; so_we ~~:p:e mean for group a2 (10.0) is greater t an
The result is significant, mdteatmg 0)/2 = 6.0. .
the combined means for a, and a3 (7.0 + 5~f control for Type I error when testmg
want to ernploy sorne forrn d . Chapter 6 work as before,
sev~~~ h~;~theses. Al! the procedures ::t~~~c~~~; r~;laces dfs¡A as the degr~es of
except that the value calculated fr?rn ~!tistical tables. Because the ccrnbmatto~ of
freedorn used to enter the appropnate . a roxirnate depees of freedo~ rnay so
. dt'"'erent for each contrast, tlie PP . ¡· d S'tdákBonferrom, Dunnett,
groups 1S w · the specta tze . ·¡ ¡
d'ffi r With the procedures that requtre 't' al values for each contrast. Strnl ar y,
' Tue.k tables you rnay have to !ook UJ? en te •. 6 12 to obtain a critica! value
or ey , r d b usmg Equaoton · (B &
Scheffé's procedure can be app te y b th' djusted degrees of freedom rown
frorn the F tables with dfs¡A repla~. h ~;ter procedures, the critica! differenc~
Forsythe, 1974c). In the Tukey an lS e~ it must be calculated separately for eac
D also depends on the standard error' an
pair using Equation 6.8 (p. 124).
The results of an experiment are expressed most directly by the values of the sample
rneans. The report of any study should include a description of the pattern of means
and estirnates of their variability, usually the standard deviations of the scores. How
ever,itJg___also nseful to bave an overall measure ofJhf;Ll!lªg~~~w;le __of tl:t,~ ~ff~ctJJ¡at
~orates al! the groups at once. Sl!_ch ll)~asures_Q[ef!'ed si¡¡e_l:!ave_i'~f!le_Í.!1JQ_gJ,"eate~
,_p.!Q!llinence...recently, JY.hen_fuy haye been ~d._as_adjuncts.to.the conventional
__)zy!)otho;:¡¡j¡lJ;est!l__(or even as alternatives to thern). They circurnvent sorne, although
not al!, of the problems with hypothesis tests that we discussed at the end of Section
3.2 (p. 44). We describe sorne of these measures in Sections 8.1 and 8.2.
The remainder of this chapter concerns· the planning of experiments. It is unwise
for a researcher to begin a study unless it is likely to obtain reportable results. An
important part of the planning process, one that is now required by many granting
agencies, i§. an ana!Y&~f't.hepuwer_ofa.propose.<i.exp_e~Ll!lentc:::t.he.Probability.that.a.,
~altemativeJ;n_the..null.hyp_QtJ¡_esis_wi!l!ffil,l_to a.significant outcome.• We dis
cuss how power is calculated, and, of greater value to the researcher, how to determine
the appropriate sample size·for a projected experiment.
If you read literature in this area, you will discover it is plagued by a disparate
collection of procedures, measures, and recommendations (Richardson, 1996, gives
sorne useful history). We will describe a consistent approach to this topic, but will
mention in footnotes sorne alternative that you might encounter.
With unequal sample sizes, we must pool the sums of squ~es. and ?egr~es of freed?m
separately, as we did when finding the mean ~quare m a smular s1tuatmn (Equatwn
3.15, p. 58):
88, + 882 (8.3)
dj,+dh'
In our example, the two groups contain the same number of boys, so we c~n use either
formula to calculate 8 12 . After calculating the variances from the data m Table 2.2
and entering them in Equation 8.2, we find
S12 = J sf; 8~ = J15 +212.5 = 3.708.
Substituting this information in Equation 8.1, we find that the effect size is
Y1 Y2 9 .
d,2 = 812
= 08
3.7
= 2.43.
1Do not to confuse this measure with a t test statistic (Equation 4.8, P· 71), in which the divisor
¡ an estimate of the standard error of the difference between means.
8
8.1. DESCRIPTIVE MEASURES OF EFFECT SIZE 161
Thus, the groups differ by more than two standarddeviation units. We could calculate
the statistics d13 and d23 in the same way for the other pairs of groups?
The measure d is quite popular, because it is simple to use and makes good intuitive
sense. If any measure can be called simply "the effect size," it is this. It is zero when
~lLilQ_gj@.r.<llli'.e_i:Jetween_t~e_ tW()__I!J.eai1s, and. i11creaseª .. without];Qufi,C~b.~n ·
~if.erence becomes.large. It is particularly useful in th; field of metaa~alysis
which involves combining the findings from several':studies (e.g., Hedges & Olkin, 1988:
Rosenthal, 1991). By working with d instead ofthe actual differences between means
the disparate studies are put on a common basis. '
Substituting these variances into Equation 8.6 gives a measure of effect size known as
omega squared:
2 2
2 _ O"total  O"error
W  2 (8.8)
O"total
Another way to characterize w 2 uses the parameters of the linear model in Section
7.1. In that model, Ctj is the deviation of the group mean Pi from the grand mean ¡.t.
The variability of the group means around the grand mean is measured by the average
of the squares of these deviations:
2
oA=~.
L. etJ (8.9)
a
A nasty little bit of algebra (which we won't show) determines that the total variability
is the sum of this betweengroup variance and the withingroup error a;rror:
afotal
2 2 2
O"total = O" A + O"error·
Using these quantities, Equation 8.8 becomes
2 2
2 O"A O"A
w = ~2 = 2 2 (8.10)
O"total O" A + O"error
This equation shows clearly that.\<i~S_.tll_e_ratio of the variability of the treatment
conditions (i.e: cr~)_:t;g iil!!__total .""Ii!t!Jili~y_ í.i.:~:jile siim i:iHJ1e explairíed anaerror
. V!!J'i"c!Jillt:Ü 7 ... •·· .. 
7
A related quantity used in much of Cohen's writing (e.g., Cohen, 1988) is f = CFA/CFerror· You
can easily translate this measure to and from w 2 :
f = Vw2 /(1 w 2 ) and w2 = j 2j(f 2 + 1).
CHAPTER 8. EFFECT SIZE, POWER, AND SAMPLE SIZE
w2 =
~ (a1)(F1) (8.12)
.
(a1)(F1)+an
The way that these formulas corree\ for samp!ing error is most easily seen in the
second formula. You will recall that when no differences exist among the groups, the
chance value ofF is approximately l. In Equation 8.12, this chance value is subtracted
from F everywhere it appears. This adjustment does not appear in the formula for2
2
R 2 (Equation 8.5), which is why it is always larger than w . We will always use w
,~w]l_en we are trying to infer population values or to use_ this infqrmation to estimate
...pQw§r_~d~a.rnPie§fzeT  ······· ······· ·· ··  · · ·
When the within~groups variability exceeds the betweengroups variability (that is,
when F < 1), Equations 8.11 and 8.12 give negative values. Of course, the population
effect size can not be less than zero. What.th.e_Jlli5aliye val1Je_t_ells_us iª_that the_~
_!Q:Qll.l'_ern>r__var_~a,_l>_ilit)'.lt":S..B_W."''lP~d~_nL<ii_1ferences among the treat!Ilent co~ons
Jilat _1>1i.g)l_th;J:yg_Q(l_en_ prese!l_t_.i!i.th<J.POPl1laj;_io!1_.l1ñtl!>DJ.ie"C.~<;.in;.úmst•>;[lc.l!!l.,jtJs...bffil
w
lo set 2 = O. ThisVRiu; does not prove that treatment effects are. absenj;_.(n_Qthing
::.·!
1.',,,
~;,;;;¡;;·¡;hat), j),;ly that ;\'h~t differenc~s there .are cán~rea8onabl~ he.~ttribll.t.ed lo
.random v_ariatio_n.
··we will illustrate the estimation of w2 using the same data for the hyperactive
boys we used for the descriptive measures. Both Equation 8.11 and Equation 8.12
give the si,Ulle value:
8 A similar estimation formula can be written to estimate {¡, but it is of less use. As we noted
above, an important use of d is for metaanalysis, and there the nature of the procedures makes the
descriptive measure appropriate. The correction is most important for the power and samplesize
2
calculations we discuss below, and for them the w measure is necessary.
8.2. EFFECT SIZES IN THE POPULATION 165
2
w2  a,p
~~+~ 0.15)
Wi h · ·
e ave mdiCated the partial measure b
Because the partial statistic has a2 in t:
'1/J error
Source SS df MS F
A 490 3 163.33 10.21 *
.
along with a their error mean square. The effect sizes you report are adjuncts to this
information, not substitutes for it.
If you wish to summarize the effects for the experiment as a whole, we rec~mmend
w
that you use 2 • This index takes the sampling variability into account and so is
most relevant to the population you are studying. Because so many other measures
are available, you should make it clear what you are reportingdon't just say "the
effect size was 0.04" and leave a reader to guess whether you are talking about R 2 ,
¡;:;2, d, or something else. When you are discussing the difference between two groups,
you might consider reporting d, because it is widely used and closer to the data than
the proportion of variability. However, you should remember that this measure is
influenced by sampling error. Cohen's division into large, medium, and small effects
is helpful in thinking about how readily an effect stands out from the background
variability, but you should remember that this classification says nothing about the
practica! importance of an effect nor its usefulness to psychology theory.
Finally, remember that specific results are often more useful than overall ones and
tests of contrasts are frequently more informative than omnibus tests. The same goes
for measures of effect sizes. If you plan to discuss a particular aspect of the results,
represented by a contrast, then you,should give the effect size for that contrast. When
you do so, you will probably find the partía! omega squared more useful than the
complete statistic.
know."lle'lf! Qf.time if our planpe<l study. is capable <Jfd~tecting. our anticipated results.
··rfit)s, .then we can .pro"ceed as planned; gt)lerwise, we. need t~ ~~thÍnk:the~EreSi~
thestudy.
Determinants of Power
..W~ 9a,r1 control.p<:>wer .by changing . any of. its three determinants: tl:te sign[.fjcan.ce..lrurel.
__g¡, t)les_i~_QU!!<J.keill1lle.!lt.efl"e~ts.. !.<l2 , and.the sample size.n, Changing the a leve!,
however, is not an option, because it is set (typically to .05) by the standards in
the research domain. Increasing the size of the effect is more promising. The use
of stronger manipulations to increase O"~ will increase w2 , as will decreasing the error
variance a;rror either through better experimental control or by selecting subjects from
a more homogeneous population. Nevertheless, by far the most direct way to control
the power of an experiment is sample size, and we will concentrate on it here.
We will illustrate the interplay between power, sample size, aud effect size using
an experiment with a = 4 treatment conditions. Figure 8.1 shows the relationship
between power (on the abscissa) and the sample size needed to attain it (on the
ordinate) for two significance levels (a = .05, and a = .01) and three effect sizes
(w 2 = .01, .06, and .15). To determine the sample size n for a particular combination
of significance level, effect size, and power, select a value for power, extend a vertical
line upward until it intersects the appropriate combination of significance leve! and
effect size, and read the required group sample size on the ordinate.
The dashed lines show the sample sizes needed when the significance leve! is a =
.05. Look at the line for small effect sizes (w 2 = .01). Here, the sample sizes needed
to obtain a substantial leve! of power are outrageously largeabout 270 subjects
per group to achieve a power of .80 and 350 per group for a power of .90. With
mediumsized effects (w 2 = .06) the corresponding numbers are 45 subjects and 60
subjects group, still surprisingly large. Finally, the sample sizes needed to find a
large effect (w 2 = .15) are more like those we typically find in psychology research
journalsnamely, n = 17 and 22, for powers of .80 and .90, respectively.
Now compare the salid lines for the more stringent significance leve! of a = .01
to the dashed lines for a = .05. To achieve power of .80 in an experiment with a
mediumsized effect, for example, the sample size must be increased from about 45
per group when a= .05 to over 60when a= .01. By making it more difficult to reject
false null hypotheses, power is reduced, and to offset this decrease, we must increase
the sample size.
What do these numbers tell us? For one thing, ~d to. have sub_s_L.antial
~ce~'!!§ºf Sl1!Ü~~El.ar.t<lthe_ggst.Qf testing t!J,em probabÍy.ínon3than are
availab!e to most researchersbefore you can have an appreciable chance of detecting
8.4. DETERMINING SAMPLE SIZE 169
'
500,,,,,,.,.r;
.01
Small
400
300
Q)
~
Q)
c.
~ .,E 200
"'
100
a small effect. For another, when you have only a small number of subjects, you
are unlikely to detect anything but a large effect. It would be foolish to undertake a
study that had only a small chance of finding a result. Psychologists often conduct
studies that are underpowered, as has been pointed out by Cohen (1962, 1992) and
expanded upon by Rossi (1990), Sedlmeier and Gigerenzer (1989), and others. The
tradeoffs involved here are complicated. For example, if you have only a given number
of subjects available, are you better off expending them al! in one strong study, or
in using them over a series of weaker studies where you can refine your procedure
and questions as you go along? We won't attempt to answer such questions, for the
most productive research course surely depends on the domain in which the research
is done. What is clear is that you should know how to choose the sample size that
will produce the leve! of power you want.
need about 30 subjects per group and that you cannot get by with only 15 is very
important, but the difference between n = 28 and n = 30 is not. You certainly need
to know when the study demands more subjects than you have available, but not the
exact number.
Steps in Determining Sample Size
We have found that the best way to perform the power calculations correctly is to be
systematic. We suggests you follow the following sequence of steps.
l. Determine the form of the experiment to be analyzed. So far, we have discussed
only the singlefactor design, but after completing this book, you will be able
to consider a number of design issues at this stage, such as the nmnber of inde
pendent variables, the nature of these variables, and the use of subjects in more
than one treatment condition.
2. Decide on the hypothesis test to be conducted. Choose the null hypothesis to be
tested, the type of statistical test to be used, and the Type I error probability
a. Your null hypotheses may be of omnibus form, /"1 = 1"2 = /"3 =etc., or may
concern the value of a contrast.
3. State the alternative to the null hypothesis that you wish to detect. This step
requires you to find the effect size w2 for the desired effect using the formulas in
Section 8.1.
4. Select the desired power. This value will usually be somewhere between about
.50 and .90. When you have a good idea of the effect you are seeking, such as
when you are replicating an existing study, you should choose a fairly high leve!,
say at least .80. When you have merely specified the smallest effect that you
would find interesting (and hope to find a larger one), yo u might pick a smaller
value, such as .50 or .60.
5. Combine this information to find the sample size. We describe these calculations
below.
Finding the Size of the Target Effect
Estimates of the effect size for a projected study are usually obtained in one of two
ways. Either the researcher has sorne idea of what the outcome of the study is likely
to be, or he or she decides on the smallest effect that is of practica! interest. One way
to obtain an estímate of the first type is to review the literature and come up with a
set of plausible means for your conditions. Another is to find sorne earlier experiment
that you believe to be like yours and use it as a model. These estimates have in
common that the projected effect is one you think is likely to occur. The second
approach is more useful when you don't have enough information to pick a plausible
outcome. You may still be able to decide on the smallest differences among the means
that you would care about. In our experiment with hyperactive boys, for example, the
l'i
,'Ir researcher could decide that a difference of 3 points on the comprehension test between
the control condition and either of the two drug conditions was the mínimum effect
needed to justify conducting further research with this therapy. A smaller difference,
even if significant, would not be practically important.
However you come up with a target effect, translate it into a value of w2 • When
your target effect is expressed as means, you will have to make an estímate of a;rror'
DETERMINING SAMPLE SIZE
171
.,
2
use Equations 8.9 and 8.10 to obtain w • When your target effect comes from
results of a previous experiment, you should use either Equation 8.11 or Equation
' 8.12 to correct your estímate for chance differences.
. Consider three. examples that illustrate this process. Suppose you are planning a
threegroup expenment wh1ch, from the reading of several papers in the literature
you expect to have means of 25, 28, and 16. Because these means do not come froU:
any particular experiment, you can treat them as given exactly (otherwise you would
need to .correct the effect size for sampling error). In your survey, you noted that
the w1thmgroups st~ndard deviation U error was about 12. From this inforrriation, you
calculate the effect s1ze. You find the grand mean JiT, subtract it from the treatment
means /"j to find th~ effects ~j, and from these calculate "~ (Equation 8.9). Al! these
calculatwns are easlly done m a little table:
a¡ a2 a3 Sum Mean
Means, J.tj 25 28
Effects,
a2
a, 2 5
16
7
69
o
23
J 4 25 49 78 26
'
We listed the means in the first row, and, on their right, summed and averaged them
to find the grand mean f"T = 23. In the next row, we subtracted the gtand mean
from the treatment means to obtain the treatment effects a. (these should sum to
0). Finally, in the third row, we squared, summed, and avera¡ed the effects to obtain
a~ = 26. Now, using Equation 8.10,
2 _ a~ 26 26
w 
= 26 + 12 2 = 170 = 0 · 153 ·
2 2
(j A+ a error
f
174 CHAPTER 8. EFFECT SIZE, POWER, AND SAMPLE
the tableafter al!, we hope the effect is larger than our minimum. The table
n = 60, which implles we must commit substantial resources to the study.
Using the Power Charts
Although Table 8.1 is extremely usefnl, it is limited by the number of entries it
tains. If you need sample sizes for levels of power not included in the table or
different effect sizes, you must use another method of calculation. A set of
that applies to a much wider range of effect size and power was developed by
and Hartley (1951) and is widely reproduced in books on the analysis of vai·iartC
Appendix A. 7 contains modified versions of these charts, appropriate for tests at
5 percent leve! and studies with dfnum between 1 and 6 ."
The power charts make use of a quantity known as the noncentrality natram.et
that measures the extent to which the experiment gives evidence for differences
the population means. It is denoted here by </> ( the Greek letter phi). The charts
us translate power into a value of the noncentrality parameter. Once we have q,,
combine it with the effect size w2 to calculate the sample size using the equation
n=</>21w2.
w2
The relationship between power and </> depends on the degrees of freedom in
analysis. Appendix A. 7 contains a separate chart for each value of dfnum, and
chart contains power functions for severa! values of dfdenom. You start by
the chart and line that applies to your study. You will know dfnum, and we
you start with dfdenom = 50. Next, you find your desired power on the ordinate
extend a horizontalline from that point to the right until it intersects with the
function. From the point of intersection, drop down to the abscissa and read off
noncentrality parameter </>. Finally, put this value and the effect size into Eq,uati
8.18 and calculate the sample size. There is one minor complication. You may
when you are done that the sample size you find implies a value dfs;A different
the starting value of dfdenom = 50. When that happens, just pick a line based on
and repeat the calculation. In most cases, the change in n will be small.
We can use the charts to recalculate the sample sizes for our three example stttdic
The first experiment had a= 3 groups and an effect size of w2 = 0.153. On the
for dfnum = a 1 = 2, we draw a horizontalline from our target power of .80 on
ordinate to the fnnction labeled dfdenom = 50 and note that the scale value immediette
below this point is </>"' 1.85. We obtain the sample size from Equation 8.18:
= "21 w2 = {1.85)21 0.153 = 3.423 0.847 = 18.95
n '" w2 0.153 0.153 '
or 19 when we round up to the nearest whole subject. This sample size implies
dfs¡A = a(n1) = (3){191) = 54, which is close to our starting value of dfdenom =
We have used the correct line.
The second experiment involved a = 4 groups and had an estimated effect size
w2 = 0.079. To determine the sample size needed to attain a power of .80, we use
11 The PearsonHartley charts plot the power on a nonlinear scale' that expands the upper end
the power axis. They are more accurate than our charts when the power is near one, but are less
for smaller values.
INING SAMPLE SIZE
175
labeled dfnum = 3 and the po fuu .
. . wer ct10n for dj, _
assocmtes w1th a power of .80 . " ~ den~m  50. The value of </> th t
2 !S "'~ l. 7. EquatiOn 8.18 gives a
2 1 w 1 O
n = </> __, = (1.7)2  .079 _ 0.921
. . w 0.079  2·89  = 33.69,
·· sub¡ects. W1th this sample . d" 0.079
50 Slze, "8/A = 132 so f
= was a little off. Redoing the cale 1 ¡. our. use o the power function for
the nonc~ntrality parameter to </> = 1~6~ ~:; Wlth the line. for dfdenom"' 100
C< •sul>je,ots . The lmpact is very small. the sample s¡ze to n = 32.9 or
•. power charts allow you to fiud 1 .
'. 8 .1. Suppose you wished to e
samp es s1zes for powers oth er than the thr .
. . . . perwrm the test in th th' d ee m
to . a mnumum mteresting effect of w2 = 0.038) . h e lr example (two groups
dfnum = 1 and the line for d/denom = 50 we finWlt a power of .50. F'rom the chart
' value, the sample size is ' d the noncentrality </> "' 1.42. Based
2 1 w2 1
n = </> __, = {1. 42)2
 0.038 _ 0.96 2
. w 0.038  2·01 6  = 51.04.
value of n 1mplies dfsA = 102 wh' . . 0.038
function for d/denom V~ lOO ' Ret !eh mdJCates that we should have used the
</> = 1.40 aud n = 49 6
'"0ciuc:ed I
A. urnmg to the power chart we find the
· · gam, you can see th t th ' new
'"' on Y a slight change. a e refined estimate of </>
>
:.. sigmficauce. Sorne of the component
others. As a practica! matter sam
h p~~ m~s ls much less than that for
~po eses w!ll be of much greater interest
of the design where the effect' ples¡z~ calculations should be based on th
two el 1 s are most lmportant If d' . e
cls: :·. . ose yspaced groups is crucial to th . Iscnmination between
:. . 'I1 fibe based on that difference not on thee rese~ch, then the power calculations
o nd the sample size for a e ' omm us analysis.
.contrast d omponent hypoth .
ely ·. t an determining the partial effect . 2 es¡s, we start by expressing it as a
·Eqmor~ation about the population means Slze wd,P) that it will have. Wben you have
uatlon 8.15; when you are b . th 11:, an the standard deviation u
use Eq t · 8 asmg e estimate on F f error, use
d ua IOn .17. Note that the partial ffi t . ,¡. rom a previous experiment
~s: . ~~w you are on familiar ground:
. ~r e power charts for d'
Yo:
 1 . A
t:n s;ze, ·:::'t the complete effect size, ¡~
. o e¡ er the a = 2 column of Tabl
partiaJ ;num  m ppend¡x A 7 d e
A . omega squared to a sample size 12 • an convert the value of this
. s Illustration, recall that th fi t . f
e Wlth populat' e rs o our three exam ¡
. Ion treatment meaus of 25 28 d p es was a threegroup stud
~mmbus test with a power of .80 ne d 'd ' au 16, ~d Uenm = 12. We found that thy
d of:;
12 e e a sample Slze of 19 or 20 e
Sometimes you may w t per group. Suppose
s procedure was dese 'b d ~ to test a null hypothesis that '1/J
hypotheses, you rep:~c: 'lf¡:~thEEqua:ion 4.12, p. 73). To find th:q;~ls a value other than O (the
m quatlon 8.13 by ("'' "'' )2 h wer or sample size for thes
'f' 'f'O w en calculating O'~. e
176 CHAPTER 8. EFFECT SIZE, POWER, AND SAMPLE SIZE
it were critica! to detect the difference between the first two groups. Because this is
the smallest of the three pairwise differences, we would expect to need more sub jects
than our estimate for the omnibus effect. This pairwise difference is the contrast
7/J = p,1  /"2. Finding the hypothesized value of this contrast and ·using Equations
8.13 and 8.17, we estimate the effect size:
1'11¡ ' ,¡; = /"1  1"2 = 25  28 = 3,
i'l!;' ·!¡·· '·.1
:''
2 'f/;2 (3)2
'
1
11.1
tJ.p = 2I:cJ = 2[12 + (1) 2] = 2·25 '
:1 !
2
2
tJ.p 2.25
'!. 1
Comments
Power analyses should be an integral part of designing any experiment. It does not
make sense to conduct an experiment that is seriously underpowered. You probably
won't obtain a significant result, and you won't be able to interpret that outcome. The
effect you expected may have been there, but your experiment couldn't detect it. Your
study is all too likely to end up in your file drawer of unpublishable research. Finding
that the implied sample size is unrealistically large should alert you to redesign your
study. Perhaps you can make your treatments more effective, so that the differences
between means increase; perhaps you can reduce the variability of the scores through
better experimental control or the use of more homogeneous subjects. On the other
hand, perhaps you will find that the sample size you require is reasonable, and you
can proceed with the study in greater confidence that, significant or not, the outcome
will be meaningful.
You should be sure that your expectations regarding sample size are realistic. The
functions in Figure 8.1 relating sample size and power are not straight lines, particular
the ones for small effect sizes. They imply that it is much harder to increase power
by even a few percentage points when it is already high than when it is in the middle
or lower part of its range. As a result, a disproportionate amount of resources is
required to raise the probability of finding an effect to near certaintylevels of power
above .95. A power of .90 is attainable in sorne designs, but near certainty is not a
realistic goal, particularly when the effect you are searching for is small in magnitude.
Interestingly, methodologists seem to agree that a power of about .80 is a reasonable
value for researchers in the behavioral sciences. It is a sensible compromise between
. . DETERMINING POWER
177
:~:d~:s~d for adequate power and the finite resources available to most psychological
Determining Power
; In this section, we brief!y consider a second f .
determining the power of an existing e ~se o !owe~ analys~s lU research, namely,
In this application, the sample size has :::'e~t d~1gn to d1scover a given effect.
. ~o evaluate the power of the experiment as it :as ~:n e os~n and the re~earcher wants
, m questlOn has not yet been performed en designed. Sometimes the study
, it has suf!icient power to detect an int , ~~d t~ goal of the analysis is to decide if
, . been conducted, and it is desired to eres ;:'~he ect. Sometimes the experiment has
a particular effect. We will return t~e:hw e er 1~would have been able to detect
calculations are made. ese uses ter we have described how the
/nW2
<~>=v~· (8.19)
Once we have calculated q,, we can convert th"1 1
appropriate power chart. s va ue to a power by using the
To illustrate this form of power analysis consider .
we mentioned first on page 171 for which, . agam the threegroup experiment
Suppose you had only 30 subjects 10 for :,est1mated an effect size ·of w2 = 0.153.
the study immediately you check to ·/th" group. Rather than going ahead with
suf!icient power to justÍfy the effort of ~~~~uct:s P:""hcular sample ~ize gives the study
the noncentrality parameter: ng It. You use Equatwn 8.19 to estimate
q, = ~ nw =
2 (10)(0.153) _
1 w2 1 _ 0. 153  v'l.806 = 1.34.
The power chart for dj,  2t ¡ .
a(n 1) = (3)(10 1)"~27 rans athes this value into the power. Because dfd =
 ' we use t e function for df  30 W enom
the abscissa, extend a vertical line u d . . .  . e !acate q, = 1.34 on
function, then extend the line horizon~~~ t u;~¡J 1t. mtersects the appropriate power
slightly below .50. y 0 e axis, where we read the power to be
This analysis reveals that when .the true mean .
will reject the null hypothesis about half the ti s ~:s hypot~esized, the experiment
go ahead with the experiment depends on h m~. ether th¡s power 18 suflicient to
chosen. If the three means are th 1 ow t e spec1fic alternative hypothesis was
true, then the design is not very :a':' ~est that ~ou thought were the most likely to be
its power. On the other hand if the ~hac ory. ou should search for ways to increase
would be interested in findin~ and tha~e:h:;:n¡" ~scribe :he smallest effect that you
power may be suflicient to proceed. a e ect 1S hkely to be larger, then the
178 CHAPTER 8. EFFECT SIZE, POWER, ANO SAMPLE SIZE
As a second example, look at the comparison of the means 1'1 and 1'2 in this same
experiment. On page 175, we fonnd that the sample size required to achieve a power
of .80 for a test of the contrast 1/J = 1'1  1'2 was much greater than that needed for
the omnibus effect (n = 250 vs. n = 20). But suppose the study were conducted with
that smaller sample size? What is the power to detect the difference between l'l and
1'2? Again we work from the effect size w2. In our earlier calculations, we found the
effect size for the contrast to be Wf,¡,) = 0.015. Again using Equation 8.19,
<P = Vnw
1  w2
2
= (20)(0.015) = (ü.30 = 0 _552 _
1  0.015 V
ü.985
Turning to the power chart for dfnum = 1, we see that power is only about .10, an
abysmally small value. Thus, although a study with this sample size has adequate
power to detect the omnibus effect, it has very )ittle power to detect the anticipated
difference between these two groups. If this difference were important or critica!, then
the experiment as planned is inadequate. As this example illustrates, it is simply not
enough to conduct a power analysisyou need also to consider the specific effects you
are interested in detecting.
Comments
You will notice that in our two examples, the effect sizes were determined
the experiment was undertaken. Wherever the means 25, 28, and 16 carne from in
our example, they were not based on information obtained after the study had
conducted. This separation between the effect that is to be found and the study
will look for it is characteristic of power analyses. When planning a study we
with an effect and project to a future study. A power analysis should never be
after a study has been completed, simply as a descriptive measure. It makes no
to apply the power formulas to a set of observed means, calculating what arrtounts"
to the power of the study to find the results it has just found. The calculation is
circular as it sounds. It gives no information that is not better expressed by one
the measures of effect size. 13
There is one circumstance where power calculations can be made after a
has been completed. When the null hypothesis is retained, a researcher may
assurance that the failure to reject the null hypothesis was not due to a lack of
A power calculation here can show that had the population means follovred
particular pattern other thau the observed sample values, the study was likely to
detected it. Consider our threegroup example, and suppose that after conducting
study using n = 20 subjects per group, the omnibus F test was not significan\.
could return to the population means we had originally hypothesized25, 28,
16to see whether the study was likely to have come out significant, had those
the true values. We have already performed these calculations above and found
13You should be aware that sorne of the computer packa.ges include an option to mak.e
179
e the study had a fairly high
r powerabout 80
we can b e more confident that the f ·¡ percent. Armed with this inform ¡·
r · ower O t ak a¡ ure to find thi ¡ a IOn
.p . r, o t e an example that s resu t was not just due t 0 1 '
h . of that st d t · comes out the oth ow
u y o re¡ect the null hypoth . h . er way, we saw that the pow
d · w":' only ~bout 10 percent. With th=~n;. en m fact .,¡; = 1'1  1'2 = 25 28 = _;r
e · this experunent to detect this ar . . mg, we could question the ade '
example of these calculations a~pe~~:: ;~~~~ence8 between the two means. k~{h::
em .5.
Exercises
8.1. Two groups of 8 sub". t
n • ¡ec s were tested for er . d
was t est ed a ft er a good night 's si d rors m epth perception G
e ,. E h b. eep, an group a . roup a,
ac su ¡ect received a score indicating th 2 was tested after sleep deprivation
d a target. The data, with sorne s e ~v~rage error in judgment of di t .
n• mnmary statistiCs, are s ance to
t '
Rested Depnve
· d
u 11.8 5.3 13.8 12.3
8.8 14.8 11.5 12.8
1¡.8 16.3 17.0 15.0
10.0 14.0 15.3 11.5
n 92.8 109.2
1164.7 1518.2
a. Perform an analysis of .
b H variance on the d t
. ow large is the difference bet these a a and draw a conclusion
·¡ ? Wh ween group .
I y. at proportion of the total 0 b d s, ~easured relative to their variabil
treatment? Wh · serve var1ability b
" S . at IS your estímate of thi 1 . can e attributed to the
c. uppose you wished to repeat th s v~ ue m the population?
eh · e expenm t ·
angmg a. Suggest at least two realistic th" en 'hmcreasing the power but not
8.2. In Problem 3.2 of Chapter 3 an . . mgs t at you could do.
learnmg a T maze by three groups ~f rat:xpenment was reported in which the rates of
~~~~~f a¡ redceived a lesion in one locatio:e:~ ~~:bared following different operations;
on, an group a3 a sham o . . ram, group a2 a lesion in th
omega squared for the following· peratwn. Usmg the data in that problem at~o e.r
a The · , es Imate
b: 11
The ;~~ :::~~ offthe three c?nditions.
i. The comparis~ ob tthe followmg comparisons:
.. Th n e ween the two grou . .
u. e comparison between th ps receiving brain lesions
groups. e control (a 3 ) and an average of ·the th t
83 . oerwo
· . Conslder two ahnost id . .
Y;t = 12.3, Y;, = 10.6 ? e~t~c: ex~er_I_ments that have found the sam
different: ' 3 . ' an y4 = 9.8. Their analyses f .• means,
o vanance are
Source SS df MS F Source
A 266.7 SS df MS
3 88.9 2.90 F
S fA 1,713.6 56 30.6 A 266.7 3 88.9 1.32
'lb tal SfA 3,763.2 56 67.2
1,980.3 59
Total 4,029.9 59
CHAPTER 8. EFFECT SIZE, POWER. AND SAMPLE
180
. .o R2 and the estimate of w2 for the two
Calculate the squared correlatwn rati ~ d are not the same, even though
ts Explain why the four values you have oun
men . d th same means.
calculations are base on e ' urgroup experiment that you
· to conduct a 10 ·¡¡ h
8.4. Suppose you are prepari~g ex ect that your group means WI . s ow
analyze with an analysis of variance. y~ 71 p 3 = 74, and /"4 = 85. A plausible.
less deviation than the set.M = 66;,}'t4~ Ho~'"many subjects should you choose 1f
at the standard deviatwn 1S O'eno~ 11 . ffects at the a = .05 leve!?
9 to detect the 10 owmg e
want a power of ·
a. The omnibus effect.
b. Linear trend. . . the following results:
A fourgroup analysis of variance gives
8.5. Source SS dj MS F
A 30 3 10 2
300 60 5
SfA
Total 330 63
ho performed the analysis wishes
. ·¡¡ t and the person w . . b d on
Th .IS result is not S1gn1 can ' b pted This assertwn 1S ase
h . n now e acce . d'ffi b
claim that the null hypot es1s ca . f . terest the four group means I er y
notion that in the smallest alter~ative o ~n ate ~f the power of the analysis to
. ( 9 10 11 12). What IS your es 1ill
pmnt e.g., ' ' ' . able?
this effect? Is the cla1ill reason . . t group gives the means 10.6,
· t with 30 subJec s per ·ff ig;niflcantljl:
8.6. A fourgroup expenmen . fi d that these means do not di er o.
5 d 12 1 An analysis of var.ance n s
9. ' an . . SS df MS F
Source
127.8 3 42.6 0.71
A
S/A 6,925.2 116 59.7
Total 7,053.0 119
. h ex eriment with a larger sample. Wha\
The experimenter decides t~ rephcdatedtt e Yicate this study with a power of .95.
mple size nee e o rep . . F
you say about the sa . . d obtains the stat1st1C =
alysis of variance an . d
8 ·7· A researcher performs
ff d
an an . t the effect size for biS observe
Heestlmaes . · nt
on 2 and 27 degrees o ree om. h t to find the power of his expenme
to be GP = 0.021, th~ uses ~he pow~: ~a:u~ation gives a power. of less than .10.,
detect an effect of this magmtu~e. H d determines that 160 subJects per group
so he runs a samplesize calculat10n an h ff t In his report of the study he
f b ut 80todetectt e e ec.
give him a power o a o . . t lacked sufficient power to
. h d th t the expenmen 160
A power analysis s owe a f d Had a sample size of n =
f eans that was oun .
detect the pattern o ':ould have differed significantly. .
been used, the groups . . t Explain what IS wrong
t but his interpretatwn 1S no .
His calculations are corree '
this statement.
9
U sing Statistical Software
l
¡::::~:~~m;a._ad~·e~ by their users. What we will do, then, is to make sorne higherlevel
this chapter and scattered throughout this bookthat will help you
employ whatever program you use effective]y. When you begin working with a
pru·tic:ul1>r program, you will want to seek documentation tailored to your particular
and leve! of experience. Many other books are available to fill this need; they
of the authors recalls a summer job in college during which a five or sixway mixed analysis
variance (a design we will discuss later) was conducted (and checked!) using a desk calculator.
took two people, working full time, the better part of a week. It would now take at most a few
of computer time.
182 CHAPTER 9. USING STATISTICAL SOFTWARE
1
provide tutorials, detailed instructions, and even introductions to data analysis that
are particularly directed to one or another program or package?
One reason that we don't emphasize any particular program is that we do not have
any idea which one you will use. Any of the major packages, and many of the minor
ones, will do all the analyses in this book, and the major ones, such as SPSS (SPSS,
Inc., 1999) or SAS (SAS Institute, lnc., 1998), will do more. Which you end up using
will depend on what might otherwise be secondary factors, such as their availability
and the fact that your colleagues are using them. Do not underestimate the latter
factor: Support from somebody who has become proficient with a particular package
goes a long way to solve the difficulties that arise understanding the documentation.
Table 9.2: Control language to analyze Table 9.1 using Release 10 of SPSS.
1. UNIANOVA
2. score BY spacing
3, /CONTRAST (spacing)=Polynomial
4. /METHOD = SSTYPE(3)
5. /INTERCEPT = INCLUDE
6. /POSTHOC = spacing ( TUKEY )
7. /PRINT = DESCRIPTIVE ETASQ HOMOGENEITY
8. /CRITERIA = ALPHA(.05)
9. /DESIGN = spacing .
9.2 An Example
We will illustrate sorne of these ideas by following through the analysis of the data in
Table 9.1. Because we are not writing a manual for computer usage, we will not offer
recommendations as to which program to use, and we certainly will not attempt to
explain the graphical interface (which like as not will have changed by the time this
book is printed). Instead, we will offer an explanation that is general enough to help
you understand the output of whatever program you happen to use. This approach is
possible for this material because the analysis of variance is fairly well standardized,
so the bulk of the analysis has a common forro, regardless of the program used.
For our particular example, we will use Release 10 of SPSS. We basten to add that
our choice of this program was made for the not very profound reason that it was
available to us as we were writing this chapter. Another program or version would
have served us as well. · We began by entering the data in the form, and with the
variable names, in Table 9.1. We then used the graphical interface to set up and run
the program. We had the program save the control statements so that we could study
them more closely. They are given in Table 9.2, with line numbers added so that we
can refer to them easily.
Let's go through these statements in detail. The first line, UNIANOVA, contains the
name of the particular module of SPSS that we are using. It conducts an analysis of
variance (strictly, a univariate analysis ofvariance, which explains the name). Line 2
has the very important function of specifying the dependent and independent variables,
se ore and spacing, respectively. The remaining lines, those starting with a slash, tell
the program more details about what we want it to do. You can ignore lines 4 and 5
(we will mention the Type III sums of squares in Chapter 14, and line 5 simply tells
h the program that the grand mean can be different from zero). Line 8 says to use the
n .05 leve! for any operations that require a significance leve!. Line 7 tells the program
d that we want it to calculate (and "print") severai things: the descriptiva statistics,
e measures of effect size, and a test for homogeneity of variance. Lines 3 and 6 request
o two types of analytical analysis, namely, a trend analysis and comparisons of the group
e means using the Thkey test. Final! y, !ine 9 indicates the particular form of design that
r is to be used, which at this point is simply a standard oneway analysis of variance,
d in which the means depend on the variable spacing.
.
CHAPTER 9. USING STATISTICAL SOFTWARE
186
For one thing, there are more sources listed here; for another, the labels are unfamiliar.
You can figure out what is what by comparing the program's results to those of our
previous analysis. Comparing the two tables tells us (correctly) that the three sources
you expected to find, A, 8/A, and Total, appear here as Spacing (the name we gave
tu our independent variable), Error, and Corrected Total, respectively. The source
labeled Jntercept is a test of the hypothesis that the grand mean is equal to zero (see
p. 37), which is not very interesting here. The source labeled Corrected Model is,
for this design, the same as the effect of spacing (it will differ for designs with more
factors), and that labeled Total is simply the sum of the Y¡~. Ignore for the moment
that the sums of squares are labeled Type IIJ Sums of Squares; we will explain that
in Chapter 14. The remainder of the F table should be familiar. Note that, as for the
Levene test, the program calculates the exact descriptive leve\ for the three F tests
(Spacing, Intercept, and Corrected Model), which here are so small asto be reported
as zero. Because they are allless than a, al\ the null hypotheses associated with these
effects are easily rejected. The measure of effect size reported by SPSS is the R 2
in our discussion of effect size (Equation 8.4, p. 161), indicating that approximately
68 percent of the total sum of squares is associated with the differences among the
1
treatment conditions. 3
The request for analytical analyses (lines 3 and 6 in Table 9.2) produced a consid
erable volume of output, of which we reproduce only a few parts in Table 9.4. The
first section concerns the trend analysis. Because there are four levels of spacing, three
trend coeflicients (linear, quadratic, and cubic) can be tested, and the program tests
al\ of themwe give only the output from the linear part. 4 The analysis appears
straightforward: The program reports the observed value of the contrast and its stan
dard error. Although it does not give the calculation itself, the ratio of the observed
contrast and its standard error is a t statistic (Equation 4.9, p. 71):
:;¡, 7.826
t,¡, ==50 = 5.217.
s;¡; l. O
This value is significant, as indicated by the descriptive leve\ of .000 in the row labeled
Sig. As you can verify, this t statistic is the square root of the statistic F = 27.22
that we calculated for these data in Section 5.1, a property that always holds when
the two statistics are applied to the same contrast. The program also calculates a 95
percent confidence interval for .Pune" (from 4.646 to 11.006). This interval does not
include zero, which aga.in shows that the hypothesis that .Pune= = O can be rejected.
The !ast analysis in Table 9.4 is the Thkey test (described in Section 6.4, pp. 120
124). The output includes differences between means and confidence intervals for
each pair of means (which we do not show), and a summary of the result using the
equivalent subset notation (pp. 122123). The output identifies two sets of equivalent
groups, with group a 1 alone in one subset and the rema.ining groups in the other. The
significance information in the last row refers to a test of significance for differences
3 This program, like many others, does not calculate W2 , but you can easily determine it from the
sums of squares or F statistic using Equations 8.11 and 8.12. The adjusted R 2 is a less substantial
e correction of a similar sort.
4 SPSS assumes equal intervals for this analysis; special steps must be taken when the intervals
s
are not equal.
.
1 .
1 SING STATISTICAL SOFTWARE
'.1' .'
1
CHAPTER 9. U
'' !
, ' 188
¡•, ·
· '¡., SPSS for Tables 9.1 and 9.2.
l ,' Selected output from
Table 9.4:
ynomial Contrast"
·i'1.·1¡..•· ¡' Spacing Poi 7.826
\· Contrast Estimate o
. ·.·.'.' Linear
:: ' 1
\ 1 ···1'1:'1,11
1'
. \·'·11' 1'11,,
'''¡:', 1
\ ~.:r' . 1 ' ·,!'.·.!': ,L:.¡ 0.076 for the s~co~d su:(~t ~~ceeds a = .05),
,,¡.,,, 1
subset is not s¡gn can , . Table 9.4 con
\ ~.·.! 1 r 1nspect10n,
ms of Squares. oneos~
.·.\',i,'¡:·\.J!·¡i. .
ter 5 (those in Ap
.
1
11
'1,, •'1
1 1 ''
1
1 ' ·':1 '!¡'··'·'¡
'.'·.·,i 11 · Tein1 1 3} but when we ap plyt . h ·
' 11'.''.
1,
talns a surpnse. f the calculatwns
'·'¡ :('1,'11
11!\.'1':! ,'
pendix A.3) are { 3,  t' th' val'ue mven by the program. Nelt erclousionsbut their
1 ¡::.'¡·.1,!,,:'· '
1 ':l,i·'l': ' ·l¡'¡ '1
'. ! .
''111
'(r
'1
1
1
~ 3500 no e "" . ,
obtain 'f/;unea< = . ' th same test stat¡st!C an
is wrongboth lead toff e t You have to do a little exp o:
d the same con
1 . g with the numbers o
t
ts the coefficients so
particular values are di er~~ ~urns out that the computer e ~~e of ;¡J. The origina
1
11 1
d iscover what ha.ppened· t 1 befare it calculates th . h f them by v'2Q. Iu
d Jues sum o d" ·des eac o a
that t~eir squar~ ~= 20, so the program, in effect, ~~~2236, 0.2236, 0.6708} inste
~'
' 1
1,
1'
'1 1 1
11
1
1''111
'
Our purpose in pointing out the different calculations is to alert you that software
packages differ in the output they provide and the way they do the calculations.
Although the tests are correct, sorne of the intermediate values may be different from
the values obtained using the procedures and formulas in this book or, for that matter,
from those given by a different program or a different part of the same program.
One reason we bring this matter up here is that sometimes you need to combine the
results from severa! parts of the computer output. For exan~ple, in our discussion of
theoretical predictions (Section 4.6), we pointed out that it is often useful to determine
the variability that is not explained by a particular contrast or set of contrasts. Thus,
in trend analysis, we used this approach to evaluate whether a straight line is sufficient
to describe the relationship between the independent variable and the group means
(p. 94). The relevant sum of squares is the difference between the sums of squares for
the overall effect and those for the contrasts (Equation 4.16, p. 82). It is essential that
these sums of squares are calculated on a consistent basis.
Let's complete the analysis here as we did in Section 5.1. When the sum of the
squared coefficients equals 1, the equation for the sum of squares (Equation 4.5) is
very simple:
' n;p
SS,¡, = "' 2 = n;p.
Li ci
Applying this formula to the value of ;¡J calculated by the computer, we obtain5
SSune" = niKine" = (5)(7.826) 2 = 306.231.
This value can properly be combined with the sum of squares from the overall analysis
to obtain a measure of how much the linear function fails to fit the data:
SSrallu'e = SSA SS¡;ne" = 373.750 306.231 = 67.519.
We found SSrailme = 67.50 when we calculated this value in Section 5.1; the difference
is attributable to rounding. The remainder of the analysis proceeds as we showed it
in Chapter 5.
Befare Ieaving this illustration, we want to make a Iarger point. There are fre
quently several different computational approaches to solving a particular problem,
and a given computer progran1 may use any one of them. We have presented the
analysis of variance in the form we believe is easiest to understand, but it is frequently
(and rightly) not the one that a programmer would choose. When you are going
beyond the simple results of the significance tests, it is important to verify that the
computer's output is what you think it is by applying it to an example whose values
yo u aiready know and understand.
s
r
9.3 Hints, Cautions, and Advice
o
o We will clase this chapter with a few observations and suggestions on the use of the
al software packages.
u
ad 5Another way to calculate the sum of squares is to work from the test statistic. Because F =
MS¡pjMSs¡A. (J = F, and dfw = 1, we have SSw = t 2 MSs¡A· This formula is useful when the
he program provides t statistics but you don't know what coefficients it used.
190 CHAPTER 9. USING STATISTICAL SOFTWARE
TWOWAY FACTORIAL
EXPERIMENTS
In this part, we will introduce experiments in which treatment conditions are classified
with respect to the levels of two independent variables. We will continue to assume
that subjects serve in only one of the treatment conditions, that they provide only a
single score or observation, and that the sample sizes n are equal. In later chapters,
we will relax these restrictions and consider designs in which the same subjects receive
sorne or all of the treatment conditions and designs with unequal sample sizes. The
designs in this part are the simplest examples of experiments with two independent
variables, computationally as well as conceptually, and they serve as the basis for the
more complicated designs we consider later.
The most common way by which two or more independent variables are manipu
lated in an experiment is a factorial arrangement of the treatments or, more simply,
a factorial experiment or factorial design. We will use these terms interchange
ably. We will refer to the two independent variables as factor A and factor B and
to the design itself as an AbyB factorial. Factorial designs are those in which the
independent variables are completely crossed, which refers to the fact that the de
sign includes every possible combination of the levels of the independent variables.
Suppose, for example, that two variables are both varied in a factorial designthe
magnitude of the food reward given to a hungry rat for completing a run through
a maze and the difficulty of the maze the rat must learn. There are three levels of
food magnitude (small, medium, and large) and two levels ofmaze difficulty (easy and
hard). We create a factorial design by combining each of the three levels of reward
with each of the two levels of difficulty to produce the six cells in the following table:
We will often refer to a factorial design by the number of levels of the factors involved.
In this case, the experiment is a 3x3 factorial design, with the times sign indicating
that the levels of the two factors are crossed to produce a total of 3x3 = 9 treatment
combinations.
In comparison with the singlefactor design, the factorial has advantages of econ
omy, control, and generality. Factorial designs are economical in the se~se that they
provide considerably more information than separate singlefactor expenments, often
at reduced cost of subjects, time, and effort. They achieve experimental control by pro
viding a way to remove important but unwanted sources of variability that otherwise
would be part of the estímate of error variance. Finally, factorial designs allow us to
assess the generality of a particular finding by studying the effects of one independent
variable under different conditions or with different types of subjects. We evaluate
this generality by examining the results of a factorial experiment for an interaction
between the factors.
There are four chapters in Part III. Chapter 10 considers the important charac
teristics of factorial designs, particularly the concept of interaction, without regard to
numerical calculation; Chapter 11 covers the standard analysis of the twofactor ex
periment; and Chapters 12 and 13 present analytical comparisons that are particularly
useful in the detailed analysis of this type of experimental design.
10
Introd uction
to Factorial Designs
Table 10.1: A factorial experiment (left) and its interpretation as a set of single...
factor experiments (right).
High High
Contrast 1 1 1 1
Line Length
Contrast 3 in. 5 in. 7 in.
Low ~ Low
Medium ~ Medium
High ~ High
l l l
3 in. 5 in. 7 in.
An Example of N o Interaction
Table 10 3 presents sorne hypothetical results for our experiment on reading spee~.
Assume that equal numbers of children have been randomly assigned to each of t e
nihe conditions and that the values in the table are the average readmg scores. We
will refer to the contrast variable as factor A and to the three degree~ o~ co~ras~. as
a Likewise the linelength variable is factor B an t e t ree me
leve1s a¡, a2, and 3· '
lengths are levels b¡, b2, and b3 · . ¡¡
The main effect of contrast (factor A) is obtained by summmg the three ce means
for the three length conditions and then averaging these sums. The last column o~~h~
table gives these means for the three contrast conditions. These ave~ages a~e e b~
th row mar inal means because they can be written in the mar~m o~ t e ta e.
e ; . dJru:_ .. h,M+" ;.. +llfLklwooontrast condJtJOn JS found by
Thtis the avera.g,e read1ng spee s.~~  . .
..comhliiillg_t~eJ:t!~llil.§irD_I.I!tb..lLtilr"."_length conditions and calculatmg an average.
' .   0.89 + 2.22 + 2.89  2 00
YA1 3  · '
This mean repres.ents the ayer.l!~_J>erf~_<:~.. !2L':']lJhe~l!b.j§c.tdn t~~ exQerim<ent__
~.dthe readin material under_l_o~Il.t!'.'!"t;. the speclfic con.<htl~
·j¡h(~~~;;;::!in.e ~~~gtghsj. ·;;_;;; liñhnportan_l;_~~_t!J_i¡¡¡)(Ünt. We can. 0~Jc":'n sJmJlar
. a:,;;,;~g;,8 f¿~ th~ i~i;Ie;;:t¡; réCe!vin!fihe Inaterials U!lder .medmJII. and hJgh.con.tr.ast..
''These t;;;margln_':')JII~~Ils llf~_giv~n_inthe other two ro_;vs. . . .
In fikefashlon, ~argillll!.=r.ages in the columns g¡v~_nalJlfori!!!!~JOn ~nce,l'Il_!!l_g_
the main effect of the different line leE)!:~E!!.Cfactor B). SpeCJfically, the average readmg
  ···~· ·····________.
10.2. THE CONCEPT OF INTERACTION 199
speed for subjects in the 3inch condition is an average of the means for the three
contrast conditions:
yB , 0.89 + 3.89 + 4.22 3.0  o.
3
This average, with those for the other length conditions, appear as column marginal
means in the Table 10.3. Each of these marginaLmeill!§. represents the avera@_Jler
jQ!:ID~ of _al! subj~c:_t~~_!10 ~~."i:':"'~U!'_~_specified length co!lditia¡};=ig!lorii!g !~;
particnla¡:J.eyeLoLco!l.ti.'!"L(factor A) that each subject received.
You will usually want to plot the cell and marginal means in a graph. A graphical
display often xev:eiJJs .a.~¡:>e_9j;J!_Qft!Ie._dªt':'.. t.!wLw.e Jess _ apparent .in. i\. ti\bk of. JI1eans.
Figure 10.1 shows severa! ways to plot the means in Table 10.3. First, look at the
marginal averages for contrast, which are plotted in the upper left panel. This plot
shows that low contrM!_E!:9M.Q§S.th<;l.!R.l'les.t..readingscores and that medinm.and.high,
~d@.<J_,lJ,[gher!J~.§l!!lil.':'!. aYe.r•J.ges. It provides a general description of the
overall or main effect of factor A. The graph does not tell us whether the effects of
;;ontrast are the same foral! line Iengths;:Tiiat i:¡uestioíi ré(¡iiires the doubleclassifi .
catio..O.::J;,lótof the cel! Íneá.nS in the· upper right panel ofFÍgure 10.1. We draw this
~h by ·;;arking off contrast on tlie baseline, plotting the éell means, and connecting
the points with the same value of line length. The functions for the Jhree c,omponent
,_experiments in· this graph are pamllel,..which. means_.that ..the .pattern oL differences...
~ ob~Jle<l. with contrast is. e¡¡:_a.ctlyihe..same_at.each leve! ofline length. Parallel lines
J!!Qicate that there is_no_intel:action. · ·
We arrive at the same conclusion if we focus on the other independent variable,
line length. The graph at the lower left of Figure 10.1 plots the main (or average)
effect of factor B. Yo u can see thítt.re.a_<ling speed increases steadily as the length of
the lines increas_es from 3 to 7 inches. !he component singlefactor experiments in the
three rows of Table 10.3 are plotted in the righthand graph. Again, the patterns of
differences associated with the length variable (factor B) are the same for the three
contrasts. The lines are parallel, and there is no interaction. ~
In the absence of in!era,ctioll,_V/<l_USl!ally_fo_cus_our attention on the\main effectsf
the two sets of marginal means..rather than on the cell means. Since the effects of
.....Qlllltrast,..f01'exarnple,do not depend.on allYPª"ticular line l.<Jng(h,,::W~~¡;_n,Gg¡nbi!le the
...results frorn the.J&evant component experiJIIegts witho¡¡t distorting their outcomes.
A similar argument ~pÚes to li;:,:e length·.· . . ....
An Example of an lnteraction
Table 10.4 presents a second set of hypothetical results for the reading example. The
same main effects are present; that is, the means in the row and column margins
of Table 10.4 are identical to the corresponding means in the margins of Table 10.3.
There is a big difference, however, in the simple effects. Figure 10.2 shows plots of the
individual means like those on the right of Figure 10.1. Contrast is on the baseline of
the lefthand graph and line length is on the baseline of the righthand graph. In both
plots, you can see that the patterns of differences for the simple effects are not the
¡: same at alllevels of the other independent variable. Thus, an interaction is present.
To be more specific, consider the simple effect of contras.t at leve! b¡the cell
_ means in the first column of Table 10.4. These three means are plotted in the left
CHAPTER 10. INTRODUCTION TO FACTORIAL DESIGNS
200
Cell means
Marginal means
8~~~.
8
6
6
4
4
2
2
!!!
o
¡,¡ o olLTow~M~ecrn~u~m~H~ig:h~
Low Medium High
"'e contrast
contrast
~ 8~,
!! 8
e
"'m
::;
6
6
~
4
4
2
2
ol~3c5,7~~
ol.3,5~;7;~
Une length (inches)
Une length (inches)
. Tabl 10 3 Jotted as a function of contrast ( upper
Figure 10.1: Reatidlngt~core; :~o~engt: (lo.w! panel). No interaction is present.
panel) and as a une wn o Jne
Table 10.4: A hypothetical set of means for the reading experiment that shows an
interaction.
Line Length (Factor B)
Contrast 3 in. 5 in. 7 in.
(Factor A) (bt) (b,) (ba) Mean
1.00 2.00 3.00 2.00
Low (at)
Medium (a,) 3.00 5.00 7.00 5.00
5.00 6.00 5.00 5.33
High (as)
3.00 4.33 5.00 4.11
Mean
THE DEFINITION OF AN INTERACTION 201
8.~ 8,,
4
~
... ... ...
2 . ,..""Medium
······•···············•
......··'i.aw
QL,.~ QL,r,~
Low Medium High 3 5 7
contrast Une length ~nches)
Iine length under low contrast (the first row), the relationship is linear. The simple
effect at leve! b2 (medium contrast) shows a steeper linear trend. But look at what
happens in the highcontrast condition. The relationship in this third component
experiment is. curvedthe reading scores first increase and then decrease with line
length. Again, the lines are not parallel.
With either plot, you can see at a glance that the form of the relationship between
one independent variable (contrast or line length) and the mean reading score depends
on the leve! of the other independent variable. You can now understand why we are
less interested in the main effects when an interaction is present than when it is
absent. In either plot, the patterns of differences in the simple effects are not the
same as the pattern expressed in the corresponding main effects. The main effects are
not representativa of the simple effects and, thus, distort the singlefactor experiments
that make up the factorial design. There was no such distortion in Figure 10.3, where
the interaction was absent.
1. ,¡:'
CHAPTER 10. INTRODUCTION TO FACTORIAL DESIGNS
::'.1:1 202
In the second case, the combination is complexit will take theoretical ingenuity to
explain why the relationship between contrast and reading speed is different for the
different line lengths or why the relationship between line length and reading speed is
different for the three levels of contrast. The discovery of an unexpected interaction
usually puts a strain on the current theories and forces them to be modified.
Sorne theories, however, are rich enough to predict interactions, and when they
do, the detection of this interaction is often the major point of the experiment. A
good illustration comes from the program of research conducted by John Garcia, who
studied the interplay between different classes of stimuli and reinforcers in a ciassical
conditioning paradigm. In classical conditioning, a previously neutral cue (e.g., a
flashing light) is paired with a reinforcer (food reward) that reliably elicits a partic
ular behavior (salivation). Formally, these are called the conditioned stiniulus, the
unconditioned stimulus, and the unconditioned response, respectively. Conditioning
is said to occur when the conditioned stimulus comes to reliably elicit the response,
now called the conditioned response, in the absence of the reinforcer.
Garcia suggested (e.g., Garcia, McGowan, Ervin, & Koelling, 1968) that animals
are equipped with two independent learning systems, one promoting the identification
of cues in the external environment ¡tssociated with sustenance (food and water) br
danger, the other sensitive to serious changes in the animal's internal environment
(cues associated with ingested dangers such as poisons and other noxious stimuli that
produce illness). He designed a series of experiments that followed the same basic
strategy: a gustatory stimulus (the flavor of a food pellet) was paired for one group
with an "interna!'' reinforcer that resulted in sickness (a nonlethal poison) and for
another group with an "externa!" reinforcer (an electric shock). He hypothesized that
conditioning would occur only with the interna! reinforcer. The other part of these
experiments involved two more groups, one receiving a nongustatory stimulus (a flash
ing light) paired with the internal reinforcer and the other receiving the same stimulns
paired with the externa! reinforcer. Here, he hypothesized the opposite result, namely,
that conditioning would occur only with the externa! reinforcer. The basic design of
this experiment is a 2x2 factorial design in which type of conditioned stimulus (flavor
and light) is eros sed with type of reinforcer (poison and shock). These theoretical
expectations are summarized in a simple table, in which "Yes'' and "No" indicate
whether conditioning is successful: ·
1
Reinforcer
11 Stimulus Poison Shock
Gustatory (flavor) Yes No
Nongustatory (light) ''N"o=__Y<.:.e=s:.....J
1
¡,
If you assigned numbers to the four cells, say 5 to the "Yes" cells to represent condi
tioning and O to the "No" cells to represent no conditioning, and plot the results, you
will see there are no main effects, but that the simple effects go in opposite directions.
Garcia and his associates found this pattern in many experimental settings, with dif
ferent external and interna! reinforcers and with different gustatory and nongustatory
stimuli, thus confirming their theoretical analysis.
¡,
,,
¡,
204 CHAPTER 10. INTRODUCTION TO FACTORIAL DESIGNS
Ys
:[IJJ~
5 5
::Ys [CJ:
5 5
:: ITJJ ~
Ys 7 3
::Ys [IJJ :
7 3
(5) (6) (7) (8)
b, b, YA b, ¡, YA b, b, YA
Ys
:: [CJ
5 5
~ :: []] : :: tDJ ~
Ys 5 5 Ys 7 3

(5)
<= a (6)
"'
m
E 6
'¡¡ b,
o"
r
4
m 2
·"
~ o
a, a2 a, 82
(7) a2
(a)
a
6
4
~
.
2 ~
o
o
a,
a2
level of factor A
Figure 10.3: Plot of the cell me ti
ans rom Table 1 o. 5.
at levels
.
of B are e¡'ther supenmposed
. or .
i:
of an mteraction. The first exampl h parallel. This form characterizes the lack
t
the. main effects nor an interaction ;r~~s ~om~letely negative outcomeneither
~m effect of factor A, as you can verif'y ~ ~ ; t ::econd example, there is only a
1 erent (4 6 = 2) and that the column o mg at the row marginal means are
You can also verif'y the absence of interact' :argmal means are equal (55 = O)
.J;he two means at leve)_Q¡_ (4   . Ion y checkmg that :the.. difference bet .
leve! b" 1A _ ~ _ " 6 2) 1s 1dent1caUo th "'····l"een
~()~ the third exam¡)j_e the m ...{"SQ.rr.espondmg_difl::erence.atr·
~ 7 3 =. 4), but ~"ffect of fact arg¡~ '::_eans sho~ main effect of
s presentthe dlfference between the two or:..tL(5 50). Agam, .no interaction
to the d.ifference at leve! a2 (3  3 = O :eans at leve! b¡ (7 7 = O) is identical
~Ahave ':!!am effects:the differences be~ th example 4, both independent variables
ffi and 7 3 = 4 e h  lll~ e..ma,gmalmeans..are4 6.= =:¿eh···
~illLU.Bmam e ·  .. >()r t e
two simple effects of factor A ~e~:t.. However, there is no interaction
as are the two simple effects of factor B (6a~e ~ ~~tiCal (6 8 = 2 and 24 = 2):
. The last four examples contain AxB int . and 8 4 = 4).
Figure 10.3, the lines for the simple effect eractiOr.B. In the corresponding panels of
s are not parallel, a form that characterizes
1 1
an interaction.. The marginal means for example 5 show no difference between the
two row means and no difference between the two column rneans; hence, there are
no main effects of either factor. On average, neither factor has an effect, and on this
basis someone might conclude that the manipulations were ineffective. But when you
look at the cell means, you see that the independent variables produce quite striking
effects. When factor A is manipulated at leve! b1 , a, is superior to az, but when it is
manip1.1lated at leve! bz, the effect is reversed. There wasn't a main effect because the
two simple effects canceled each other out. This patternan interaction without main
effects___:_is sometimes called a pure interaction. We saw this type of interaction in
Garcia's experiment.
The next two examples illustrate situations in which there is an interaction and
one main effect. The main effect in example 6 is revealed in the row marginal means
(4 6 = 2), but not in the column marginal means (55= 0). An interaction is
apparent in the plot of the cell means and can be verified by comparing either set of
simple effects. In example 7, the situation is reversed; the marginal means indicate a
main effect of factor B (7  3 = 4) and no main effect for factor A (5  5 = O), and
the nonparallellines indicate an interaction. Final!y, in example 8 al! three effects are
prese11t. There a main effect for both independent variables andan AxB interaction,
as you c&n see from the nonparallellines or by comparing either set of simple effects.
10 min 20 min
Younger 18 ]
O!der _ 3D ~:
. The two simple effect h . . '
· . s ere exh1b1t th
lB no mteraction But f¡ e same 6point chan e
• age and study ti~e do n~~ ~s to interpret these equal i~c~:~:athe~atically, there
represents the same IUteract, we would have to be!" as mdicatmg that
younger children) as i':';;oun~ of additional learning at the ~::t that; 6item change
this assumptionafter ~~s ~~ the middle (the older children) ;,m ? the range (the
older ones increased the~rs' b e young children nearly doubl~d t~ ~lght well q~estion
the same lack of int t· y only 20 percent. Moreover h mr scores while the
erac wn would b b . ' we ave no assu
way, say by the nu b e o tamed if learnin rance that
or the time they w'::u¡~ :~uestions the children could ans!e;~~ea ~ea.:ured in another
though another measu to answer a question based on th t dmmute test penod
learn more than th re may place the groups in the s e s u Ied material. Even
cannot tell whethere yo~nger, and increasing the study":e orhde¡r (the older children
study. In a case like an
th" mteraction wouId be present until me w
e ps both groups ) , we
interpretation of the i ~s, we. cannot trust the comparability e ;ctual:y performed the
What this exam t e~act!On (or its absence) is clouded o the dlfferences, so the
interpretable, we ne~de t~ ows Is that for the size of an i~terac .
between meansmean th know that the simple effectsi th' hon to be precisely
dependent variable St . t~ same thing regardless of where ;h '~ c~se the differences
. nc y speaking, the follow· . . ey le m the range of the
Themeaningofth d""' lng prmc¡p]e must hold·
d , e 111erence betwee t ·
Thi oes not depend on the size o[ the s:/or wo values o[ the dependent variable
s statement concerns the rela . . es.
(the amount of learnin . t!Onship between the quantit
number of correct an g m the example) and the specific y we want to measure
measurement theor ;wers or the number of answers in thway w~ measure it (the
measured on an in¡~r:~:~:~t :a'Oiables for which this princ~p~~~~;te peri?d). In
4 e. n an llltervai scale . . s are srud to be
The concept of an in ' a giVen dlfference (Iike th
substantialliterature (fr terval scale was introduced into s e
equently polemical) on the ap ~ ychology by Stevens (1951) Th .
propnateness of these . · ere IS a
assumptwns on statistical
208 CHAPTER 10. INTRODUCTION TO FACTORIAL DESIGNS
6point simple effects) has the same meaning regardless of the particular value of the
numbers (8 and 14 or 30 and 36). If we could be sure that our potential dependent
variables were all on an interval scale, we could use any of them and reach the same
conclusions. However, if we cannot, we need to look more cautiously at the interaction.
In extreme cases, such as the differences in the example above, it may be impossible
to say whether an interaction really exists or not in more than a superficial way.
Most real situations are not as problematic as our ficticious example. When the
means fall in a roughly equivalent range, the presence or absence of an interaction
is usually clear. In fact, there are sorne interactions for which concerns about the
scaling of the dependent variable are irrelevant. These are interactions that cannot
be eliminated by changing the dependent variable, as long as the order of the means
remains the same. Such interactions are said to be nonremovable. To determin~
whether an interaction is nonremovable, we plot the data twice, once with each of the
independent variables on the abscissa. If the lines in either of these two plots cross,
then the interaction is nonremovablefor this reason, a nonremovable interaction is
sometimes known as a crossover interaction. To illustrate, consider two different
outcomes for a 2 x 2 factorial study:
b, bz
a,~ and
az~
The results for the study on the left are plotted in the top part of Figure 10.4.
You can see that the lines cross when the levels of factor A are placed on the abscissa
(left plot). This crossover implies that the interaction is nonremoveable. The fact
that the lines do not cross when plotted with factor B on the abscissa (right plot)
does not alter this conclusion. The rneans from the other study are quite different.
They are plotted in the bottom part of Figure 10.4, and the lines in neither of the
two plots cross. The interaction here is a removable interaction. There exists sorne
equivalent dependent variable that keeps the scores (and the means) in the same order,
but for which the factors do not interact. The fact that an interaction is removable
does not imply that you should ignore the finding, of conrse. Sorne removable inter
actions are quite convincing, particularly those involving wellestablished dependent
variables. Nevertheless, many researchers consider nonremoveable interactions more
compelling and less ambiguous because they are resistant to every possible change in
the measurement scale.
Scaling issues of this type also need to be considered whenever transformations
are used to reduce the impact of violations of the distributional assumptions, such
as heterogeneity of variance, as we discussed in Section 7.4 (see pp. 153155). For
example, we might apply a square root or logarithmic transformation (Equations 7.8
and 7.9) when the groups with the larger means also have larger vaxiances. Because
transforming the scores changes the size of the differences between groups, it changes
both the sizes of the simple effects and the size of the interaction. It rnay create or
(less often) eliminate a removable interaction. Whenever we apply a transformation
analysis. Hand {1996) reviews rnuch of this literature in a balanced way1 and Loftus (1978) gives a
stra.ightforward discussion of scaling issues in the rneasurernent of learning.
209
8
A
8
4
o
4
~
... ....
a,
a, 1 o
a2 b, b¿
~
4
4
o~~~b~'~J
a, 0':r·.:"',_j
"" b,
F" Level of factor A
. Jgure ~0.4: A nonremovable
mteractlOn (bottorn.) E eh ' qr crossover, interaction (t ,\
abscissa and once wÚh : set of means is plotted twice op'. and a removable
actor B. ' one Wlth factor A on the
c.
'f' '
'*~
b~b,
_.b2
/b, ' ',
'
.. b2
''
' ' ''
'
.
•
a, a2 a,' a2
1
a,
b, ,;2
1.
""'
Q)
E
'iil
o"
Q)
"
'/l ' ' ''
' ' ''
/ b2
e
•
··b2
eb1
''
¡;;
>
•
a, a2 a, a2 a, a2
~b, /  .... b2
'
•' '
a1 a2 as
Level of factor A
You saw in Chapter 2 that the trota! sum of squares could be partitioned into two
parts: the betweengroups sum of squares SSbetween reflecting the deviation of the
treatment groups from the overall mean and the withingroups sum of squares SSw;th;n
reflecting the variability of subjects treated alike. In subsequent chapters, we asked
more refined questions of the data by dividing the SSbetween into component sums of
squares. The analysis of the factorial experiment follows a similar pattern, except that
the SSbetween is rarely of systematic interest. Instead, we are primarily interested in
three components of SSbetween: (1) a sum of squares SSA reflecting the main effect of
factor A, (2) a sum of squares 888 reflecting the main effect offactor B, and (3) a sum
of squares SSAxB representing the AxB interaction. In this chapter, we will consider
only designs in which each treatment condition contains the same number of subjects.
The analysis of experiments with unequal sample sizes is described in Chapter 14.
•
. an d notatiDn .c.
.~or
the twofactor design.
Table 11.1: Destgn
Experimental design
Data tab\e
Treatment Combinations
azba
atbz a1 b3 azbt a2bz
Yua Ytzt Ytzz Y12s
Yut Ynz
Y221 Yzzz Yzza
Y211 Yz12 Yzt3
Yaz1 Yszz Yszs
Ya u Ystz Yata
y422 Y4zs
Y4n Y412 Y4ts Y421
AB table of sums
Sum
A1
A2
T
(11.1)
We could complete the analysis using the deviations. However, performing the calcu
lations this way is impractical and an easy place to make arithemetic mistakes. So we
turn to computational formulas that are easier to use.