You are on page 1of 380

Understanding Statistics

in the
Behavioral Sciences
This page intentionally left blank
Understanding Statistics
in the
Behavioral Sciences

by

Roger Bakeman
Byron F. Robinson
Georgia State University

LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS


2005 Mahwah, New Jersey London
Senior Editor: Debra Riegert
Editorial Assistant: Kerry Breen
Cover Design: Kathryn Houghtaling Lacey
Textbook Production Manager: Paul Smolensk!
Text and Cover Printer: Hamilton Printing Company

Camera ready copy for this book was provided by the authors

Copyright © 2005 by Lawrence Erlbaum Associates, Inc.

All rights reserved. No part of this book may be reproduced in any form, by
photostat, microform, retrieval system, or any other means, without prior written
permission of the publisher.
Lawrence Erlbaum Associates, Inc., Publishers
10 Industrial Avenue
Mahwah, New Jersey 07430
www.erlbaum.com

Library of Congress Cataloging-in-Publication Data

Bakeman, Roger.
Understanding statistics in the behavioral sciences / by Roger Bakeman and
Byron F. Robinson,
p. cm.
Includes bibliographical references and index.
ISBN 0-8058-4944-0 (casebound : alk. paper)
1. Psychology—Statistical methods—Textbooks. 2. Social sciences—Statistical
methods—Textbooks. 3. Psychometrics—Textbooks. I. Robinson, Byron F.
II. Title.
BF39.B325 2004
150'.1'5195—dc22 2004056417
CIP
Books published by Lawrence Erlbaum Associates are printed on acid-free paper,
and their bindings are chosen for strength and durability.
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1

Disclaimer:
This eBook does not include the ancillary media that was
packaged with the original printed version of the book.
Contents

Preface xi
1 Preliminaries: How to Use This Book 1
1.1 Statistics and the Behavioral Sciences 1
1.2 Computing Statistics by Hand and Computer 3
1.3 An Integrated Approach to Learning Statistics 12
2 Getting Started: The Logic of Hypothesis Testing 17
2.1 Statistics, Samples, and Populations 17
2.2 Hypothesis Testing: An Introduction 20
2.3 False Claims, Real Effects, and Power 24
2.4 Why Discuss Inferential Before Descriptive Statistics? 30
3 Inferring From a Sample: The Binomial Distribution 31
3.1 The Binomial Distribution 31
3.2 The Sign Test 39
4 Measuring Variables: Some Basic Vocabulary 45
4.1 Scales of Measurement 45
4.2 Designing a Study: Independent and Dependent Variables 48
4.3 Matching Study Designs With Statistical Procedures 49
5 Describing a Sample: Basic Descriptive Statistics 53
5.1 The Mean 54
5.2 The Variance 60
5.3 The Standard Deviation 63
5.4 Standard Scores 66

V
VI CONTENTS
6 Describing a Sample: Graphical Techniques 71
6.1 Principles of good design 72
6.2 Graphical Techniques Explained 73
7 Inferring From a Sample: The Normal and t Distributions 83
7.1 The Normal Approximation for the Binomial 84
7.2 The Normal Distribution 87
7.3 The Central Limit Theorem 91
7.4 The t Distribution 92
7.5 Single-Sample Tests 93
7.6 Ninety-Five Percent Confidence Intervals 98
8 Accounting for Variance: A Single Predictor 103
8.1 Simple Regression and Correlation 103
8.2 What Accounting for Variance Means 113
9 Bivariate Relations: The Regression and Correlation Coefficients 117

9.1 Computing the Slope and the Y Intercept 119


9.2 Computing the Correlation Coefficient 124
9.3 Detecting Group Differences with a Binary Predictor 127
9.4 Graphing the Regression Line 132
1O Inferring From a Sample: The F Distribution 137
1O.1 Estimating Population Variance 137
1O.2 The F Distribution 140
10.3 The F Test 142
10.4 The Analysis of Variance: Two Independent Groups 149
10.5 Assumptions of the F test 152
11 Accounting for Variance: Multiple Predictors 155
11.1 Multiple Regression and Correlation 156
11.2 Significance Testing With Multiple Predictors 166
11.3 Accounting For Unique Additional Variance 168
11.4 Hierarchic MRC and the Analysis of Covariance 171
11.5 More Than Two Predictors 178
12 Single-Factor Between-Subjects Studies 181
12.1 Coding Categorical Predictor Variables 182
12.2 One-Way Analysis of Variance 194
12.3 Trend Analysis 197

13 Planned Comparisons, Post Hoc Tests, and Adjusted Means 201

13.1 Organizing Stepwise Statistics 203


13.2 Planned Comparisons 205
13.3 Post Hoc Tests 206
13.4 Unequal Numbers of Subjects Per Group 210
13.5 Adjusted Means and the Analysis of Covariance 212
CONTENTS viii
14 Studies With Multiple Between-Subjects Factors 223
14.1 Between-Subjects Factorial Studies 224
14.2 Significance Testing for Main Effects And Interactions 233
14.3 Interpreting Significant Main Effects and Interactions 235
14.4 Magnitude of Effects and Partial Eta Squared 238

15 Single-Factor Within-Subjects Studies 245


15-1 Within-Subjects or Repeated-Measures Factors 245
15.2 Controlling Between-Subjects Variability 250
15.3 Modifying the Source Table for Repeated Measures 258
15.4 Assumptions of the Repeated Measure ANOVA 266
16 Two-Factor Studies With Repeated Measures 269
16.1 One Between- and One Within-Subjects Factor 269
16.2 Two Within-Subjects Factors 278
16.3 Explicating Interactions With Repeated Measures 284
16.4 Generalizing to More Complex Designs 286

17 Power, Pitfalls, and Practical Matters 289


17.1 Pretest, Posttest: Repeated Measure Or Covariate? 289
17.2 Power Analysis: How Many Subjects Are Enough? 295
References 301
Glossary of Symbols and Key Terms 303
Appendix A: SAS exercises 309
Appendix B: Answers To Selected Exercises 325
Appendix C: Statistical Tables

A. Critical Values for the Binomial Distribution, P = 0.5 345


B. Areas Under the Normal Curve 347
C. Critical Values for the t Distribution 350
D.1 Critical Values for the F Distribution, a = .05 351
D.2 Critical Values for the F Distribution, a = .01 352
E.1 Distribution of the Studentized Range Statistic, a = .05 353
E..2 Distribution of the Studentized Range Statistic, a = .01 354
F.1 L Values for a = .05 355
F.2 L Values for a = .01 356
Author Index 357
Subject Index 359
This page intentionally left blank
Preface

There are at least three reasons why you might buy this book:

1. You are a student in the behavioral sciences taking a statistics course and
the instructor has assigned this book as the primary text.
2. You are a student taking a statistics course and the instructor has
assigned this book for enrichment and for the exercises it provides.
3. You are a behavioral science researcher who feels a need to brush up on
your statistics and you find working on exercises at your own pace
appealing.

If you are an instructor, your reasons for assigning this book might be similar:

1. You assigned this book as the primary text because its unified and
economic approach appeals to you. You also like the idea of presenting
to students a book that they can truly master in its entirety.
2. You assigned another text because of the encyclopedic coverage of
statistical topics it offers, but you also assigned this book because you
thought the straightforward prose and the conceptually based exercises
would prove useful to your students.

When sitting in statistics classes or when trying to read and understand


statistical material, too many otherwise intelligent and capable students and
researchers feel dumb. This book is intended as an antidote. It is designed to
make you feel smart and competent. Its approach is conservative in that it
attempts to identify and present the essentials of data analysis as developed by
statisticians over the last two or three centuries. But the selection and
organization of topics and the manner in which they are presented result from a
radical rethinking of how basic statistics should be taught to students whose
primary concern lies with behavioral science research or clinical practice, and not
with the formal, mathematical properties of statistical theory.
This book is designed to develop your conceptual and practical
understanding of basic data analysis. It is not a cookbook, although by reading

IX
X PREFACE
the text and working through the exercises you will gain considerable practical
knowledge and the ability to analyze data using multiple regression and the
analysis of variance. We assume that your goal is basic statistical literacy—the
ability to understand and critique the statistical analyses others perform, and the
ability to proceed knowledgeably with your own data analyses. We also assume
that you are far more interested in the results and empirical implications of your
analyses than in the formal properties of the statistics used—and therefore will
welcome the practical, intuitive approach to statistics taken here.
Several sorts of audiences, from advanced undergraduates, to beginning and
advanced graduate students, to seasoned researchers, should find this book
useful and appealing. The primary audience we had in mind when writing it
consisted of graduate students in the behavioral sciences, students who were
required to take a statistics course as undergraduates, did reasonably well at the
time, but frankly never quite "got it." When reading journal articles and other
research reports, they tend to skip over the results section. They see ps and Fs
and lots of other numbers and their eyes glaze over. They look forward to the
required graduate statistics courses with a mixture of terror and loathing and
cannot quite imagine how they will analyze data for their required thesis or
dissertation on their own. We hope that such students will find this book
something of a relief, perhaps even a delight, and as they read it and work
through the exercises their past coursework will suddenly make sense in new
ways and they will increasingly feel competent to read journal articles and to
pursue data analyses on their own.
For such students, the present volume can serve as the basis for a semester or
quarter course. We recommend that as students read the text, they faithfully
perform the exercises, and not skip over them as some readers might a results
section in a research report. But we also recommend that readers seek out data
sets relevant to their area of interest, and use them to practice the statistical
techniques presented here. Usually it is not too difficult to borrow a small data
set; if you cannot find one relevant to your area, a practice data set is provided on
the accompanying CD. Any attempt to apply the principles learned here to new
data is helpful. There is something particularly motivating when you create
exercises for yourself and when they are relevant to your areas of interest.
Similarly, learning is enhanced and consolidated immensely when, in the context
of reading journal articles relevant to your own interests, you suddenly recognize
that the author knows, and has used, what you have just come to understand.
A second audience for which this book should prove useful consists of active
and productive researchers. Such researchers, like the graduate students we
described earlier, should also find this book a relief and a delight—and for most
of the same reasons. As they read the text and work through the exercises, much
of the data analyses they have been doing fairly mechanically will suddenly make
more sense to them and they will be infused with a new sense of understanding
and control. The exercises will serve them especially well. These are designed to
be done at the reader's own pace, in the privacy of the office or home as desired.
Researchers who read this text in order to upgrade their statistical understanding
may miss the help and nudging that fellow students in a class often provide, but
the book is intended to work equally well as a self-study guide and a classroom
text.
For all of these audiences, the radical redefining of the topics required for a
basic understanding of statistics that this book exemplifies will matter. It is a
cliche of modern architecture (usually attributed to Mies van der Rohe) that less
is more. Jacob Cohen (1990), whose writings on statistics have had considerable
influence on us, argued that the same principle applies equally well when
PREFACE _xi
considering the application of statistics to the behavioral sciences generally.
Certainly it is a precept that has guided us throughout the design and writing of
this book.

CONTENTS

There are a staggering number of undergraduate statistics books on the market


today, several dozen of which are designed for behavioral science majors, and a
fair number designed specifically for graduate courses, many of which are
encyclopedic in scope. Over the past several decades there has come to be a
"standard content" for these texts, a content from which few texts deviate. We
have come to believe that many of the topics presented in traditional behavioral
science statistics texts are nice but not necessary for a basic understanding of
data analysis as practiced by researchers— and that several of the more essential
topics could be presented in a far more unified manner than is usual.
In writing this book, we began by asking what students and researchers need
to know in order to understand and perform most of the basic kinds of data
analyses used in the work-a-day world of the behavioral sciences. We then set
out to provide a conceptually based but practice-oriented description of the
requisite knowledge. As a result, some, but by no means all, of the topics found
in traditional texts are covered here. And although students are hardly harmed
by exposure to the full array of traditional topics, the "less is more" approach
exemplified here has several advantages, not the least of which is efficiency. Here
students are exposed to fewer topics, but those presented possess considerable
power, generality, and practical importance.
Two relatively recent developments have made this book possible, one
conceptual and one technical. Over the past few decades, there has been a
growing realization that statistical topics formerly kept in quite separate
compartments can be viewed in a unified way. Thus the analysis of variance is
conveniently, and economically, viewed as simply another application of multiple
regression. This unified view permits considerable savings. Students learn
multiple regression, which is extremely useful in its own right, and at the same
time— and with little additional cost— learn how to analyze data associated with
the basic analysis-of-variance designs. Moreover, analysis of covariance— a topic
that traditional experimental design texts make quite complex— is thrown in
almost for free. The unified view has been expressed in journal articles for
several decades (e.g., Cohen, 1968) and has greatly influenced general-purpose
statistical computer packages. But only now is it beginning to have an influence
on textbooks. The result can be a dramatic trimming of traditional topics with
little sacrifice in data analysis capability.
The technical development concerns the now almost universal availability of
personal computers. We assume that readers will have access to a
microcomputer and a spreadsheet program such as Microsoft's Excel and a
statistical package such as SPSS or SAS. Statistical packages are powerful tools
for rapidly conducting complex analyses, and any serious student or researcher
should be familiar with their operation. The ease with which analyses can be
carried out in a statistical package, however, can also promote the mechanical
application of analytical procedures without • the background knowledge
necessary to select the correct analysis and interpret the results appropriately.
Spreadsheets, on the other hand, make the formulas used in an analysis explicit
and allow for the exploration of the conceptual underpinnings of the analyses and
their constituent formulas. Spreadsheets relieve the user of laborious
computations; this allowed us to develop exercises that use more meaningful
_XII PREFACE
definitional formulas rather than the sometimes opaque computational formulas
necessary when problem sets are to be worked by hand. Together, spreadsheets
and statistical packages provide a ready means for teaching conceptual
knowledge and concrete skills to readers interested in learning how to conduct
and interpret analyses appropriately.
For readers who have already had statistics, the first seven chapters should
constitute a brief review of some important, introductory topics (hypothesis
testing, descriptive statistics, the normal and t distributions). For students new
to statistics, chapters 1-7 should provide a sufficient basis for the material to
follow. The foundation of the unified view (simple and multiple regression,
accounting for variance) is presented in chapters 8-11. Chapters 12-16 present
the basic analysis-of-variance designs—single and multifactor factorial designs,
including designs with repeated measures—and also discuss post hoc tests.
Mastery of these topics should make it possible to understand most of the simpler
analyses appearing in journal articles and to perform basic analyses of data. In
addition, mastery of this material should adequately prepare readers to move on
to more advanced texts and topics, such as the analysis of categorical data. The
interested reader is referred to our companion to this text, Understanding Log-
linear Analysis with ILOG (Bakeman & Robinson, 1994).

LEARNING TOOLS

This book contains a number of features that should aid learning. Most chapters
contain exercises, and answers to selected exercises are provided in an appendix.
Many of the exercises use spreadsheets or statistical software. The computer-
based exercises, which allow you to learn by doing, are a central and essential
feature of this book. The spreadsheet exercises, in particular, promote
conceptual understanding. Spreadsheets allow you to perform meaningful
analyses quickly and efficiently and in conceptually informative ways. The
spreadsheet-based exercises are almost as central to this book as the unified view
of basic statistics. In fact, as you will see, the two dovetail remarkably.
Naturally, we wish that all the examples and exercises were relevant to each
reader's particular interests and concerns. We could never create enough
different examples to cover the breadth of topics readers will bring with them.
And so, instead of even attempting this, we have based most of the examples and
exercises on just a few paradigmatic studies. Readers will rapidly become
familiar with these few studies (the money-cure study, the lie-detection study,
etc.), which should allow them to focus more on new statistical concepts as they
are presented without being distracted by details associated with a different
study.
The spreadsheet exercises are a powerful learning tool and allow you to
conduct basic analyses. When data sets are large or more complex analyses are
necessary, however, it is useful to know how to use one of the more powerful
statistical packages. To this end, the chapters include exercises that walk the
reader through the procedures necessary to conduct the analyses using either
SPSS or SAS. SPSS exercises are presented in the chapters, with corresponding
SAS exercises in an appendix.
Key terms and symbols, which usually are italicized when they first appear in
the text, are defined in boxes throughout the chapters and are collected together
into a glossary at the end of the book. In addition, most chapters contain
considerable graphic and tabular information, including spreadsheet printouts
and figures. Finally, a CD is provided that contains all of the spreadsheets used in
the book, SPSS and SAS output and data files, and practice data sets.
PREFACE xiii
ACKNOWLEDGEMENTS

Several classes of graduate students at Georgia State University have read and
used earlier versions of this book, and we have benefited greatly from their
comments and their experience. In fact, without them—without their efforts,
their needs, and their anxieties—we would probably never have come to write this
book. Among our former students, we would especially like to thank Anna
Williams, Kim Deffebach, Robert Casey, Carli Jacobs, and P. Brooke Robertson
for their diligence, good humor, and many insightful comments. Carli Jacobs
also assisted with creating the answers appendix and the SAS exercises, and
Brooke Robertson assisted with the indexes. We would also like to thank our
colleagues and friends for their many helpful comments and support (Roger
Bakeman would especially like to thank Lauren B. Adamson, Josephine V.
Brown, Kenneth D. Clark, Alvin D. Hall, Daryl W. Nenstiel, and James L. Pate,
and Byron Robinson would especially like to thank Carolyn B. Mervis, Irwin D.
Waldman, and Bronwyn W. Robinson).

Roger Bakeman & Byron F. Robinson


This page intentionally left blank
Understanding Statistics
in the
Behavioral Sciences
This page intentionally left blank
Preliminaries:
1 How to Use This Book

In this chapter you will:

1. Be introduced in a broad, general way to the uses of statistics in the


behavioral sciences and the scope of this book.
2. Be introduced to a simple tool—a spreadsheet program (e.g.,
Microsofts's Excel)—that can perform statistical computations in a way
that is both conceptually meaningful and practically useful.
3. Be introduced to an integrated approach to basic statistics, one that relies
on multiple regression and emphasizes a few powerful ideas common to a
wide range of basic statistical analyses.
4. Be introduced to some of the assumptions, notations, and strategies for
learning used in this book.

1.1 STATISTICS AND THE BEHAVIORAL SCIENCES

What is the effect of early experience with the mother on an infant's subsequent
development? What sort of person makes a good manager? What kinds of
appeals are most likely to affect a person's attitudes? Why do people buy a
certain kind of soap? Why do some children do better in school than others? Are
there programs that can affect children's academic performance? Are there
therapies that can affect an adult's mental health? When seeking answers to
questions like these, behavioral scientists almost always resort to statistics. Why
is this true? Why should science in general, and the behavioral sciences in
particular, be so dependent on statistics?

Why Behavioral Scientists Need Statistics


Simplifying considerably, assume for the moment that the goal of any science is
to understand a phenomenon well enough so that such characteristics as when it
occurs and how often it occurs can be predicted with some precision. Some
phenomena studied by scientists (e.g., simple physical events) might arise from
only a few causes. If this were so, then it would be relatively easy to include in
one study all the important causes, making it at least easier to understand and
predict that phenomenon. Other phenomena (especially phenomena studied by
1
2 PRELIMINARIES: How To USE THIS BOOK
behavioral scientists), however, might result from a multitude of causes. If
important causes were not identified, or for any other reason were not included
in a study, then prediction would suffer. But prediction could suffer for another
reason as well. It may be that at least some complex phenomena are affected by
genuinely random processes.
For scientists who hold a deterministic view of human behavior, this is a
nightmarish possibility, as William Kessen (1979) noted in a speech to the
American Psychological Association (Kessen was speaking to child psychologists
but what he said also applies to students of human behavior in general):

To be sure, most expert students of children continue to assert the truth


of the positivistic dream—that we have not yet found the underlying
structural simplicities that will reveal the child entire, that we have not
yet cut nature at the joints—but it may be wise ... to peer into the abyss of
the positivistic nightmare—that ... the variety of the child's definition is
not the removable error of an incomplete science, (p. 815)

Kessen raised two possibilities that students of human behavior should


consider: Predictions of human behavior may be imprecise because the
behavioral sciences are not yet mature enough to identify all determining causes,
or—and this is the "positivistic nightmare"—precise predictions may be
impossible in principle because of random or chaotic elements in human
behavior.
An ancient and important philosophic question echoes in these two
possibilities, but statistics are neutral in this debate. Prediction may fail to be
perfect because important causes of a completely determined phenomenon were
not taken into account. Or prediction may fail because indeterminate elements
shape the phenomenon. In either case, the observed association of the presumed
causes (or explanatory factors) with the resultant phenomenon will be less than
perfect. When prediction is essentially perfect, statistics are not necessary.
When prediction is not perfect, statistical techniques help us determine whether
our observations should be ignored because they probably reflect just a chance
occurrence, or whether they should be emphasized because they probably reflect
a real effect. In other words, when results are ambiguous, statistical analysis
allows us to determine whether or not the effects we observe are statistically
significant.
Thus, the need for statistical analysis is not a matter of whether it is physical
or social phenomena that are under investigation, but rather a matter of how
strongly effects are linked to their presumed causes, that is, of how well the
phenomenon under study can be predicted. For example, when studying the
effect of certain toxins on the body, predictability can approach 100%. If the
toxin is introduced into a person's body, that person dies. In such cases, statistics
are hardly needed. The effects of psychoactive drugs often appear to occupy a
middle ground between complete determinism and complete chaos. Even though
there are broad regularities, different drugs in different concentrations affect
individuals differently—a state of affairs that can complicate both research and
clinical practice.
Behavioral regularities of the sort students of human behavior study are
typically weaker than the effects of toxins or psychoactive drugs. Usually the
rules uncovered by behavioral research suggest probabilities, not certainties,
which can be quite frustrating for judges, clinical psychologists, and others who
must make decisions in individual cases. The behavioral regularities may be real,
but often they are not all that powerful and there will be many exceptions to the
general rules. And, from the point of view of the researcher, many behavioral
1.1 STATISTICS AND THE BEHAVIORAL SCIENCES 3
regularities may not be at all obvious to the naked eye. These regularities are
probabilistic and the evidence for them may appear ambiguous, so skeptics will
naturally suspect that an invested researcher who claims effects is led more by
desire than evidence.
Statistical techniques are important, indeed essential, because they provide
ways to resolve such ambiguities. Thus, to return to the question posed at the
beginning of this section, scientists need statistics when they deal with
phenomena that can be predicted only imperfectly. Behavioral scientists make a
career of dealing with imperfectly predicted phenomena and thus often have
greater need of statistics than other kinds of scientists. But any scientist, when
faced with ambiguous results, needs techniques that control for the
all-too-human tendency to see what we want or expect to see. Happily, statistical
techniques provide ways of making decisions about ambiguous evidence, ways
that insulate those decisions from an investigator's potential bias. No wonder,
then, that statistics has come to play an especially important role in the thinking
and research of behavioral scientists.

How Much Statistical Knowledge Do Behavioral Scientists Need?


Few would deny that behavioral scientists need some knowledge of statistics.
However, there is room for a considerable diversity of opinion as to how much
knowledge is enough. Certainly they need to be able to read journals in their field
and analyze their data. At a minimum, diligent readers of this book, who read the
material and faithfully do the exercises, should be able to understand much of
what they find in the results sections of journal articles and should be able to
perform and present competent data analyses of their own. In no sense is this
book a compendium of all the statistical knowledge a behavioral scientist will
ever need, but it will provide a basis for learning about more advanced topics like
multivariate analysis of variance, advanced multiple regression, advanced
experimental design, factor analysis, log-linear analysis, logistic regression, and
the like later on.
This book is perhaps best viewed as an advanced-level introductory text. We
assume that many readers will have had an undergraduate introductory statistics
course but they are not now comfortable applying whatever they learned earlier.
In fact, as noted in the preface, that earlier material may be remembered only in
the most vague and shadowy of forms and as a result cannot form the basis for
confident practice. Reading this book and working its exercises should allow
previous knowledge to coalesce with present learning and provide many
"moments of recognition" when terms and concepts heard before suddenly make
sense in a way they had not previously.
One caveat is necessary: It is as important for us to realize what we do not
know as it is to understand what we do know. An attempt is made to present
topics in this book in as straightforward and simple a manner as possible. Yet
any one of them could be elaborated and, if probed, might reveal subtleties,
complexities, and depths not even hinted at here. Such successive refinement
seems central to academic work and to scholarly endeavor generally. We hope
this book provides you with a sufficiently elaborated view of basic statistics to
allow you to realize what a small corner of the statistical world you have
conquered.

1.2 COMPUTING STATISTICS BY HAND AND COMPUTER

We learn new skills—from broad jumping to piano playing to statistics—best by


doing. Watching others perform any of these activities may be helpful initially,
4 PRELIMINARIES: How To USE THIS BOOK
but in the end there is no substitute for individual exercise. Thus, watching an
instructor perform statistical computations on a blackboard may be helpful at
first, but at some point we need to work through computations ourselves, at our
own speed. The question then is, what is the best way to practice the required
computations? And, once mastered, what is the best way to do the computations
required for a real data analysis?
Traditionally, pencil and paper (or something similar) have been used. Later,
mechanical calculators offered some relief from the tedium and error-prone ways
of hand methods. These have now been almost totally replaced by amazingly
inexpensive and seemingly ubiquitous hand-held electronic calculators. At the
same time, personal computers have become as commonly used by students as
typewriters once were, and computer labs have sprouted apparently overnight on
most college campuses. It is now reasonable to assume that students will have
easy access to microcomputers.
In fact, this book assumes that readers will use a personal computer for
computations, sometimes when reading the text, certainly when doing the
exercises. In addition to accuracy and speed, personal computers offer
advantages not easily realized even with electronic calculators. Historically,
considerable space in statistics texts, and considerable student time, was devoted
to deriving computational from theoretical formulas. The computational
formulas were then used for all subsequent exercises. In other words, the
formulas used in exercises were designed for hand omputation but were not
necessarily conceptually meaningful.
The only justification for this state of affairs was practical. Given earlier
technology, theoretical formulas were simply too cumbersome to use. But with
the advent of personal computers, this has all changed. It is simply no longer
necessary for anyone (other than specialists who develop the general-purpose
computer packages the rest of us use) to know computational formulas. For basic
work, the definitional methods presented in this book can be used, and for more
advanced work, several excellent packages are available, such as SPSS.
As a happy result, less rote learning is required and concepts, not mechanics,
can be used to solve both the exercises presented in this book and practical
problems as well. The advantages for the reader are considerable. Free from the
need to spend time mastering unnecessary details, the same degree of statistical
mastery now takes less time. But it is the personal computer, specifically a
personal computer running a spreadsheet program, that makes such an approach
practical.

Spreadsheets: A Basic Computing Tool for Statistics


The computer programs that spurred so many sales of personal computers
initially were electronic spreadsheets and today, after word processors, they
remain the second most common kind of application run on personal computers.
Although designed initially for business applications, it turns out that
spreadsheets are extremely well-suited for statistical use as well.
The idea behind spreadsheets—like so many innovative ideas—is
astonishingly simple. A spreadsheet consists of rows (usually labeled 1, 2, 3, etc.)
and columns (usually labeled A, B, C, etc.) that together form a field or matrix of
individual cells (identified as A1, A2, B1, etc.; see Fig. 1.1). Among other things,
cells can contain formulas or arithmetic rules. For example, a cell might contain
a formula for adding together all the numbers in the column above it. That cell
would then display the total for the column. Several entries in a row might all
contain formulas for column totals and an additional formula in that row might
sum the column totals, providing a grand sum of all the numbers in the table.
1.2 COMPUTING STATISTICS BY HAND AND COMPUTER _5
Rows can represent such diverse items as expense categories for a small
business, lines in a tax form, participants in an experiment, or students in a class.
Columns can represent months or years, interest or other kinds of rates, or test
scores or scores of any kind. The whole might then represent a model for a small
business, a tax computation, a statistical analysis, or a class grade book.
Whatever is represented, spreadsheets typically consist of linked and
interdependent computations, which means that a change in one cell will affect
the values of all the other cells that depend on it. For example, if cell B3 in Fig.
1.1 were changed from 12 to 22, the column total in cell B5 would change from 52
to 62, the row total in cell F3 would change from 67 to 77, and the total of the
column totals in cell F5 would change from 232 to 242.
Electronic spreadsheets ensure that making such changes, whether they
result from incorrect entries or simply a desire to experiment with different
values, does not necessitate a series of frantic erasures and error-prone hand
recalculations. Instead, the effects of a change in the value of one cell can ripple
through the spreadsheet seemingly instantaneously as the computer performs the
required recalculations. This means that a template (a particular set of labels and
formulas) can be developed for a series of calculations (e.g., a particular
statistical analysis) and then reused with different data by entering the new data
into the cells in place of the old data.
As though this were not enough, the use of spreadsheets for statistical
analysis offers yet another distinct advantage. For whatever reason, many
students encountering statistics for the first (or even second) time find the usual
sort of statistical notation—subscripts, superscripts, big and little Greek letters,
various squiggles—intimidating or worse. Too often, notation seems to form a
barrier that impedes the understanding of too many students. Much of the
notational burden is carried by the spreadsheet itself because of the graphic way
spreadsheets are laid out with rows representing subjects (i.e., participants or,
more generally, cases) and columns representing various statistical elements. As
a result, in this book the need for elaborate statistical notation is greatly reduced
and the notation remaining is relatively straightforward and its meaning easily
tied to the spreadsheet layout. Moreover, given mastery of a spreadsheet
approach to statistics as presented in this book, readers should then have little
trouble understanding and applying more complex notation in subsequent, more
advanced courses.
Another advantage of a spreadsheet approach to basic data analysis should
be mentioned. Although there is already a limited graphic aspect to data laid out

A B C D E F
1 Name Test 1 Test 2 Test 3 Test 4 Total
2 Alex 21 22 19 23 85
3 Betty 12 16 20 19 67
4 Carlos 19 18 21 22 80
5 Total 52 56 60 64 232
6 N 3 3 3 3 3
7 Mean 17.3 18.7 20.0 21.3 77.3
FIG. 1.1. An example of a simple spreadsheet. For example, cells in row 1
and column A contain labels, cell B3 contains the value 12, cell B5 contains a
formula for the sum of the three cells above it, and cell B7 contains a formula
for dividing cell B5 by B6 to get the mean.
6 PRELIMINARIES: How To USE THIS BOOK
in a spreadsheet format, other kinds of graphic portrayals can considerably
enhance understanding of our data. Fortunately, most spreadsheet programs
have some graphing capability, and in subsequent chapters we demonstrate just
how useful various graphs produced by spreadsheet programs can be for
understanding the concepts as well as the results of data analysis.

A Brief Introduction to Spreadsheet Programs


It is characteristic of tools, from simple hammers to high-speed electronic
calculators, that advantages are reaped only after an entry price is paid. The
advantages of spreadsheets when learning and doing statistics is clear. They
automate calculations, reinforce concepts, and simplify notation. Yet to gain
these advantages, first the tool needs to be mastered. In the case of spreadsheets,
this entry cost is neither unduly high nor completely negligible. It takes some
effort to learn how to use spreadsheets, more than for a hand-held calculator, but
less than for a typical word processor. But the payoff can be considerable, not
just for statistics but for a host of other calculating tasks as well. Many of you are
probably already reasonably expert with spreadsheets, but if you are a complete
novice, do not despair. Spreadsheets are not very difficult to master.
Almost all spreadsheets programs are remarkably similar, far more similar
than word-processing programs. Whatever spreadsheet programs you use, your
first task is to become familiar with it, a process that is usually aided by the
tutorial programs provided with many programs. Nonetheless, some general
orienting comments that apply to spreadsheets in general may be helpful.
Specific examples are also provided. These examples follow the conventions of
Microsoft's Excel, which is one of the most widely available spreadsheets. But
even if you use a spreadsheet with somewhat different conventions, this
combination of general discussion and specific examples should prove helpful.
As already noted, a spreadsheet consists of a matrix of rows and columns.
Columns are identified with letters, rows with numbers. Each cell in the matrix
has an address that identifies that cell. For example, A1 is the address for the cell
in the upper left-hand corner and cell C5 is over two columns and down four rows
from A1. When the program is first invoked, one cell will be highlighted. The
highlight can be moved by pressing the directional arrows. A few moments of
experimentation with your spreadsheet program should demonstrate how this
works.
Initially, all cells of the spreadsheet are empty, but potentially any cell can
contain a
(1) value,
(2) label, or
(3) formula.
In other words, cells can contain numbers, identifying information, or
computational rules. To enter something in a cell, first highlight that cell. Then
simply type. If the first letter you type is a number, most spreadsheets assume
you mean a value. If the first letter you type is a letter (or a quote mark), most
spreadsheets assume you mean a label. A label can be any identifying
information you want and is usually used to make the spreadsheet readable.
Conventions for formulas can vary, but the usual assumption is, if the first
character is an equals sign a formula follows.
1.2 COMPUTING STATISTICS BY HAND AND COMPUTER 7

Spreadsheet Formulas and Functions


Whatever the convention used for your spreadsheet, formulas are very general
and powerful and it is important to understand how they work. If a number is
entered in a cell, that number is displayed on your spreadsheet. For example, in
Fig. 1.2 you see the number 12 in cell B3. Similarly, if a label is entered in a cell,
that label is displayed on your spreadsheet, for example, "Alex" in cell A2. But if
a formula is entered in a cell, the spreadsheet displays the value rather than the
formula. For example, in Fig. 1.2 we have indicated the formula that is in cell B5,
=SUM(B2:B4), but in fact the value 52 and not the formula would be displayed in
cell B5 on your screen (when you highlight a cell that contains a formula, the
formula is displayed, usually in a special box at the top of your screen). If the
value of any cell used in computing that formula is changed subsequently, the
formula will be recomputed and the new value displayed, which, as noted earlier,
gives spreadsheets a notable advantage over pencil-and-paper techniques.
A formula can indicate simple arithmetic. For example,

=B5/B6
entered in cell By divides the cell two above By by the cell immediately above it
(see Fig. 1.2). A formula can be as simple as a pointer to another cell. For
example,

=F7
entered in cell F11 would cause cell F11 to display the value of cell F7. Or a
formula can combine values in other cells, constants, or both as required. For
example,

=F6-3
entered in cell G6 would subtract a constant value of three from the value of. the
cell immediately to the left of G6. Or,

=(G5*3.14)/F6

entered in cell G6 would multiply the value of the cell immediately above by a
constant value of 3.14 and would then divide the product by the value of the cell

A B C D E F
1 Name Test 1 Test 2 Test 3 Test 4 Total
2 Alex 21 22 19 23 85
3 Betty 12 16 20 19 67
4 Carlos 19 18 21 22 80
5 Total =SUM(B2:B4) 56 60 64 180
6 N=COUNT(B2:B4) 3 3 3 3
7 Mean =B5/B6 18.7 20.0 21.3 60.0
FIG. 1.2. The spreadsheet shown in Fig. 1.1 with Column B expanded to
show the formulas in Cells B5, B6, and B7. Row 1 and Column A contain
labels, Cells B2 through E4 contains numbers, Rows 5-7 and Column F
contain formulas.
8 PRELIMINARIES: How To USE THIS BOOK
immediately to the left of G6. In other words, formulas can combine cell address
and constants with operators. Operators include:

1. + (for addition)
2. - (for subtraction)
3. / (for division)
4. * (for multiplication)
5. ^ (for exponentiation)

Left and right parentheses are used to group addresses and constants as needed
to indicate the order of computation. For example, =A7+B7/B8 is not the same
as =(A7+B7)/B8. A formula can contain any address except that of its host cell,
because this would result in a circular definition.
A formula can also indicate a function, or can combine functions with cell
addresses, constants, or both. Functions are predefined and the particular
functions included vary some among spreadsheets. Only three functions are
required for the exercises in this book:

1. =SUM(range)
2. =COUNT(range)
3. =SQRT(cell)

although others you may find useful include:

4. =AVERAGE(range)
5. =STDEV(range)

(more on the standard deviation function later). For example,


=SUM(D2:D4)

sums the numbers in column D in rows 2 through 4. And


=COUNT(D2:D4)

counts the number of cells containing numeric values, which in this case
(assuming there is a value in each of the cells in the range indicated) would be 3
Finally,
=SQRT(H23)or=H23^.5

computes the square root of whatever value is in cell H23. (Recall that the square
root of a number is the same as that number raised to the one-half power. Thus
=SQRT(H23) and =H236^ .5 give identical results.) One final note: We have
shown functions here with capital letters to make them stand out, but function
names are not case sensitive. =SUM and =sum have the same effect.
Some functions operate on a single cell, like the square root function; others,
like the summation function, operate on a range of cells. For example,

C1:C10
specifies a range incorporating 10 cells in column C. Again,

D8:H8
1.2 COMPUTING STATISTICS BY HAND AND COMPUTER _9
specifies a range of five cells in row 8. Similarly,

A1:D5
indicates a rectangular range, four columns across by five rows down, containing
20 cells. The ability to specify a range turns out to be very useful. Not only can a
range be specified for a function in a formula, a range, like an individual cell, can
also be copied or moved (more on this later). Further, you do not need to type in
the range explicitly. For example, you might point to a cell, select it with a left
mouse click, begin typing "=sum(" and then, holding the left mouse burton down,
"paint" a range of cells. This range (whether a singe column, a single row, or a
rectangle of cells) will be entered into the cells for you.

Relative Versus Absolute Addressing in Spreadsheets


Spreadsheets distinguish between relative and absolute addresses, for both
individual cells and ranges. This distinction may seem rather academic at first,
but in fact the flexibility and power of spreadsheets depends in large measure on
the way addresses are copied, preserving relative addresses. When a formula is
copied from one cell to another, the addresses in the new cell are relocated,
preserving the relative, not the absolute address.
For example, if the formula in cell By, which indicates that the value in cell
B5 is to be divided by the value in cell B6 (=B5/B6), is copied to cell C7, the
addresses are relocated so that the formula in cell C7 is now =C5/C6, exactly as
you would like if you wanted to copy this formula for computing the mean (see
Fig. 1.2). What is preserved is the relative address (like the old formula, the new
formula refers to the two cells immediately above it), not the absolute address
(B5 and B6 in the original formula). This allows for considerable efficiency in
defining spreadsheets. For example, a formula that operates on other values in
its row can be copied to cells in the same column, and each of those formulas will
then operate on the values in the appropriate row. In addition, if a formula in cell
C12 indicates summation for cells Ci through C10, and if the formula in C12 is
copied to cells D12 through G12, then the formulas in cells D12 through G12 will
indicate summation for rows 1 though 10 for that column.
All of this is extremely helpful for statistical analysis. Raw data for
behavioral science studies often consist of scores for different subjects. (When
discussing spreadsheets, we use the briefer term subjects, although usually
participants would be used when writing results for publication.) Each row
could represent a different subject. Subject identification numbers might then be
entered in column A and raw data for these subjects in column B. Data for the
first participant might be on row 1, for the second participant on row 2, and so
forth. Various statistical manipulations of the raw scores could be accomplished
in various columns, and summations for all columns could be entered in an
appropriate row after the data for the last subject. Formulas need be entered
fully only once and then copied. Taking advantage of the way copying works
allows spreadsheet users to rapidly create a spreadsheet template that performs
the required statistical computations.
Sometimes, of course, we want an address in a formula to be copied
absolutely, not relatively, which is easy enough to do. An address, like D13, is
always understood to be relative, unless the column letter and row number is
preceded by a dollar sign, like $D$13. For example, the formula,

=B1-$B$24
_10 PRELIMINARIES: How To USE THIS BOOK
entered in cell D1 indicates that the value in cell B24 is to be subtracted from the
value in the cell Bl, the cell two columns to the left of D1. If this formula is copied
to cell D2, the formula in cell D2 will be
=B2-$B$24

The address for Bl was relocated—the first term in the initial formula and in the
copied formula points to a cell two to the left of the host cell—but the address for
B24 was not relocated (because of the dollar signs). In other words, the initial
formula in cell D1 indicates relative addressing for the first term but absolute
addressing for the second.
So far we have talked about copy, but move is a different matter. When a cell
or a range is copied, addresses are relocated, preserving their relative position
(unless the formula being copied indicates an absolute address). When a cell or a
range is moved, on the other hand, the addresses are not relocated, but the initial
addresses are preserved instead. Furthermore, and consistent with what the
words copy and move usually mean, after a copy two versions exist, the initial cell
or range and the copied cell or range. After a move, however, only the moved cell
or range exists; the initial cell or range is erased. Copying is very useful when you
want to expand a formula throughout a table, whereas moving is useful when you
want to reorganize the layout of your spreadsheet, perhaps to make room for new
variables or formulas.
The number of significant digits maintained internally by spreadsheet
programs for a particular cell can vary, depending on the particular computer,
but typically the internal representation is accurate to several digits.
Consequently, accuracy is not usually a concern of spreadsheet users. The
number of digits displayed on your screen, however, will depend on the column
width you allow and the format you have specified. For example, in Figs.1.1i and
1.2, we specified a numeric format with one digit after the decimal point for row
7. Consequently cell C7 displays 18.7, but the number held internally, and used
for any computations, is 18.6666... Again, it is important to distinguish between
what a cell contains and what a cell displays. If your answers to the exercises in
this book do not agree exactly with the answers given in the figures or in the
Answers to Selected Exercises section, do not be unduly alarmed. The
discrepancy may be due simply to the format of the display.
The brief introduction to spreadsheets contained in the preceding paragraphs
is far from exhaustive and most spreadsheets have many more capabilities than
those mentioned here. Moreover, we have not discussed how to save a
spreadsheet once defined or how to retrieve that spreadsheet for subsequent use,
nor have we discussed machine-specific or systems-level matters, for example,
how one invokes the spreadsheet program in the first place. Still, we have
mentioned most of the spreadsheet capabilities necessary for the exercises and
demonstrations contained in this book. As a next step, and as preparation for
material presented later, readers unfamiliar with spreadsheets should now spend
some time familiarizing themselves with whichever spreadsheet they plan to use.

Exercise 1.1
Using a Spreadsheet Program
This first exercise provides brief and introductory practice in using a spreadsheet
program. If you are not sure how to perform any of these operations, ask
someone to show you or experiment with the help provided by the program.
1.2 COMPUTING STATISTICS BY HAND AND COMPUTER 11_
1. Invoke your spreadsheet program. Experiment with moving the highlight
around using the cursor control keys (marked with arrows).
2. Enter five numbers in rows 2-6 of column B. Enter five different numbers in
rows 2-6 of columns C and column D. Enter a label for column B (like "Test
1") in row 1 of column B; likewise for columns C and D. Enter labels in
column A as appropriate.
3. Enter a formula in cell B7 that sums the five numbers in B2 through B6.
Copy the formula from cell B7 to cells C7 through D7. Are the computed
values in row 7 correct?
4. Enter a formula in cell E2 that sums the values in row 2. Copy this formula
from cell E2 to cells E3 through E6. Are the totals in these cells correct?
Change one of the numbers in the range B2:D6. Are the totals updated
correctly?
5. Enter a formula in cell B8 that counts the numbers in B2 through B6. Copy
the formula from cell B8 to cells C8 through D8. Are the computed values in
row 8 correct? Delete one the numbers in the range B2:D6. Is the count
updated correctly?
6. Insert three blank rows between rows 5 and 6 (this moves row 6 to row 8).
Enter numbers in some of the blank cells in what are now rows 5 and 6. Are
the sums and counts updated correctly?
7. Print a hard copy of your spreadsheet (assuming that a printer is attached to
your microcomputer).
8. Save the template you have created, giving it some meaningful name. Exit
the spreadsheet program.

A Brief Introduction to SPSS


A spreadsheet is a powerful tool for manipulating data and conducting basic
statistical analyses. Spreadsheets also allow students to explore the inner
workings of meaningful definitional formulas and avoid the drudgery of hand
calculations using opaque computational formulas. All of the computations
described in this book can be accomplished by most spreadsheet packages with a
regression function. As the analyses you wish to conduct become larger (e.g.,
contain many cases) and more complex (e.g., analysis of repeated measures),
however, it will be to your advantage to learn a computer package dedicated to
statistical analysis. To this end we have included exercises using SPSS (and SAS)
that parallel and extend the spreadsheet exercises.
SPSS is a system of commands and procedures for data definition,
manipulation, and analysis. Recent versions of SPSS use a Windows-based menu
and a point-and-click-driven interface. We assume the reader has basic
familiarity with such Windows-based program features. SPSS includes a data
interface that is similar in many respects to a spreadsheet. Rows represent cases
and columns represent different variables. You can manipulate data using cut,
paste, and copy commands in a manner similar to spreadsheets. SPSS also
provides the ability to import spreadsheet files directly into the SPSS data editor.
The SPSS exercises will provide enough information to navigate the
commands and procedures necessary to complete each exercise. You should,
however, consult the SPSS manuals and tutorial to gain a more thorough
background in all of the options and short-cuts available in SPSS. The following
exercise will familiarize you with the SPSS interface and basic commands
(comparable SAS exercises for this and subsequent SPSS exercises are contained
in an appendix).
12 PRELIMINARIES: How To USE THIS BOOK

Exercise 1 .2
Using SPSS
This first exercise provides brief and introductory practice in using SPSS. If you
are not sure how to perform any of these operations, ask someone to show you
or experiment with the help provided by the program. Prior to importing data into
SPSS, open the spreadsheet form Exercise 1.1, delete lines 5-7 (leaving just
data), save the file with a new name.
1. Invoke SPSS. Open the spreadsheet file you just created (data from
Exercise 1.1). From the main menu, select Flle->Open->Data. In the Open
File dialog box find the Files of Type window and change the file type from
an SPSS (.sav) file to and Excel (.xls) file. Find your spreadsheet file and
open it by either double clicking on the filename or highlighting the file and
clicking Open. In the open data source window, make sure that Read
variable names in the first row of data is checked and then click on OK.
Confirm that the variable names and data are the same as your spreadsheet.
2. Change the name of the variable in column A. At the bottom of the SPSS
data editor, click on the Variable View tab. In variable view, experiment by
changing the name of one of the variables. Note that SPSS variables are
limited to eight characters. You can, however, click on the Label column and
provide a more descriptive label that will appear in the output. Labels can be
up to 256 characters in length, but be aware that labels that are too long will
make the output difficult to read.
3. Create a new variable by entering a new name in the Name column.
Designate the number of decimal places you would like to display and give
the new variable a meaningful label.
4. Enter data for the new variable. Click on the Data View tab and enter five
values in the rows below the new variable name.
5. From the main menu, select Analyze-> Descriptive Statistics
->Descriptives. In the Descriptives window, highlight all three variables in
the left-hand window and move them to the right hand window by clicking on
the right-pointing triangle. Click on the Options button, check the box next to
Sum. Click Continue and then OK. This should open the output viewer and
display basic descriptive statistics. Confirm that the N (i.e., count) and sum
are the same values you obtained in Exercise 1.1.
6. Return to the data editor and change some of the values. Run descriptive
statistics again. Were the descriptive statistics updated correctly?
7. Save the output and data files. To save the output file, make sure you are in
the SPSS output viewer, and then select File->Save from the main menu.
Give the file a meaningful name and click Save. The output will be saved in a
file with an .spo extension. Now return to the data editor and save the data
using a meaningful name. SPSS data files are saved with a .sav extension.
Exit SPSS.

1.3 AN INTEGRATED APPROACH TO LEARNING STATISTICS

To a large extent, statistical practice has developed separately in a number of


different academic disciplines or areas, many of which hardly spoke the same
language. Often students in some areas would learn only analyses of variance
and students in other areas would learn only multiple regression. And even when
1.3 AN INTEGRATED APPROACH TO LEARNING STATISTICS 13_
students learned about both, the two would typically be presented as quite
unrelated topics. As a result, students in different areas learned seemingly
separate statistics in different ways. Even introductory statistical texts often
presented statistics in a piecemeal fashion, as a series of disparate topics.
During the last several decades there has been a growing realization that
considerable commonality underlies seemingly different statistical approaches,
that much of what constitutes basic statistical analysis can be understood as
applications of a general linear model. This has fortunate implications for the
student new to statistics. Formerly, for example, multiple regression was
typically taught as a predictive technique, appropriate for correlational studies in
education and business, whereas the analysis of variance was taught as a way to
analyze experimental results in agriculture and psychology. Increasingly, we now
understand that both use the same concepts and computations, and analysis of
variance is simply a specific application of multiple regression, which is the more
general or overarching approach. As Cohen and Cohen (1983) noted in their
excellent text on multiple regression, multiple regression/correlation is a general
data-analytic system, "a versatile, all-purpose system for analyzing the data of the
behavioral, social, and biological sciences" (p. 4).
Reflecting this new understanding, correlation, multiple regression, and the
analysis of variance for standard experimental designs are treated here in a single
unified and integrated way. This results in considerable economy. Students
confront fewer separate topics, but those they do encounter possess broad
generality and application, which is consistent with the general "less-is-more"
philosophy on which this book is predicated. Earlier in this chapter we argued
that only theoretical, not computational, statistical formulas need be learned. It
may not be necessary to learn many of the traditional theoretical formulas
either—at least not in an introductory course—because an integrated approach
renders many of them unnecessary. Thus, an integrated approach reduces the
amount to be learned but does not sacrifice understanding of basic statistical
concepts or limit the kinds of analyses that can be performed. This may sound
like an extravagant claim to you now (although certainly appealing), but once you
understand the material and work the exercises in this book, you should be in a
position to judge for yourself.

Multiple Regression: A Second Basic Computing Tool


The conceptual framework for the statistical analyses presented in this book is
provided mainly by multiple regression. Both concepts and computations for
particular analyses are embodied in spreadsheets. The two make for a happy
marriage, almost as though spreadsheets were invented to make a multiple-
regression approach to basic statistical analysis comprehensible and practical. As
demonstrated repeatedly throughout this book, the layout of spreadsheets allows
the logic of multiple regression to be expressed faithfully and, as noted earlier,
with a minimum of intimidating notation.
By way of introduction, it seems worthwhile at this point to describe multiple
regression in a very brief and general way. In introducing multiple
regression/correlation, Cohen and Cohen (1983) wrote that it "is a highly general
and therefore very flexible data-analytic system that may be used whenever a
quantitative variable (the dependent variable) is to be studied as a function of, or
in relationship to, any factors of interest (expressed as independent variables)"
(P- 3).
An amazing number of research questions in the behavioral sciences fit this
simple description. Usually there is something we want to explain, or account for
(like IQ scores, income, or health), which we want to account for in terms of
14 PRELIMINARIES: How To USE THIS BOOK
various factors. To be sure, a study that attempts to account for a categorical
instead of a quantitative dependent variable (these terms are defined in chap. 4)
does not fit this description, but techniques for analyzing categorical variables are
mentioned in chapter 3. Moreover, many of the same concepts apply whether the
dependent variable is categorical or quantitative. Thus the material presented
here applies broadly to most simple statistical analytic techniques.
Multiple regression, used as a basic computing tool, can be understood in
black-box terms. Without even knowing what is inside the black box, or how it
works, we can simply provide the black box with the appropriate information.
We would tell it what we want to account for (the dependent variable or DV),
what the factors of interest are (the independent variables or the IVs), and what
the independent and dependent variable scores are for the subjects in our study.
In other words, we would provide the multiple regression program with a
rectangular data matrix. Rows would represent subjects (or some other unit of
analysis like a dyad or a family), the first column would represent the dependent
variable, and subsequent columns would represent independent variables. The
matrix would then consist of scores for individual subjects for the various
variables. For its part, the multiple-regression program would return to us
coefficients or "weights" for each independent variable, indicating the relative
importance of each, along with an overall constant and other statistics. The
coefficients are then used as part of the basic machinery for carrying out the
statistical analyses presented in this book.
Multiple-regression statistics are discussed in greater detail in chapter 11. As
a practical matter, and in order to perform the exercises throughout this book,
the reader needs some way of determining multiple regression coefficients. If
there is only one independent variable, the regression statistics can be computed
directly in the spreadsheet according to formulas given in chapters 8 and 11. If
there is more than one, it becomes much easier to let a general-purpose multiple-
regression program do the computational work.
Happily, most spreadsheet programs have an internal or built-in ability to do
multiple regression. (In Excel, specify Tools | Data Analysis, usually after a one-
time Tools | Add-ins specifying the Analysis ToolPak.) All you need do is request
regression and then specify the range for the dependent variable, the range for
the independent variable or variables, and the place in the spreadsheet where you
wish the regression output placed. And for more advanced cases, and also as a
check on the spreadsheet exercises and operations, you will use the multiple-
regression procedure in a statistical package such as SPSS.
However, do not let the technical details of how, or with which program, you
will compute multiple-regression statistics obscure the power and sweep of the
argument being made here. Two basic computing tools have been discussed.
Multiple regression provides coefficients. These coefficients are then
incorporated in spreadsheets, which complete the necessary statistical
computations. As you read this book and come to understand the integrated
approach pursued here, you will also be developing specific spreadsheet
templates that allow you to carry out the analyses you have learned. In order to
perform competently what is presented here, in the future you could use either
these templates or a specific statistical package. In fact, when you are done, and
as a necessary consequence of doing the exercises in this book, you will have
created a library of spreadsheet templates that can be, in effect, your own
personal statistical package.
1 .3 AN INTEGRATED APPROACH TO LEARNING STATISTICS _15_
A Note About Statistical Notation
As noted earlier, spreadsheets provide a graphic notation of sorts and allow
notation to be simplified, but notation is hardly eliminated entirely. Some
notation is needed. Yet notation can be a tricky subject that raises passions
seemingly out of proportion to its importance. Part of the problem may be prior
experience. Once we have learned one set of notation, we are often reluctant to
learn another, just as we are likely to favor the first word-processing program we
learned and reject subsequent ones as ridiculous or unnatural. This presents
problems. What sort of notation should be used here? Is it possible to select
symbols that seem useful, clear, consistent, and reasonably natural to most
readers? Moreover, for this book we want notation that can easily be used for
labeling spreadsheets, which rules out a fair number of nonstandard symbols,
including something as simple as a bar over a letter.
Consider, for example, one of the most common statistics, the arithmetic
mean. When raw scores are symbolized with X, their mean is often symbolized as
an X with a bar above it (read "X-bar"), but there is no easy way to enter X-bar as
a spreadsheet label. Another symbol often used to designate the mean is M, and
because of its typographical simplicity, we decided to use it in this book. Thus MX
(read "mean-of-X") indicates the mean of the X scores, My indicates the mean of
the Y scores, and so forth. For purposes of labeling spreadsheets, you may use
either MX and My (no italics and no subscripts) or MX and My, whichever you
find easier with your spreadsheet. Some might prefer X-bar and Y-bar, but what
is easy for a pen often becomes complex for a computer.
Our use of notation in this book has been guided by a few simple principles.
Whenever possible, and recognizing that there is some but by no means universal
standardization of statistical notation among textbooks, we have usually opted
for forms in common use. However, as mentioned in the preceding paragraph,
usability as a spreadsheet label has affected our notational choices. We have also
tried to avoid complex forms. For example, if no ambiguity is introduced, we
often avoid subscripts, and thus X may be used instead of Xij to indicate a
particular datum if context (and location in spreadsheet) indicate clearly which
datum is under discussion. Above all, we have endeavored to be consistent
within this book and to define all notation when first introduced. Statistical
symbols and key terms are set off from the text in boxes (labeled Notes) and, for
ease of reference, statistical symbols and key terms are collected together in a
glossary at the back of the book. Finally, and in order to facilitate comparison
with other notation the reader may have learned or may encounter, we usually
mention any common alternative forms.

The Importance of Spreadsheet Exercises


This book is intended to be one leg of a tripod. The reader and the computer
form the other two legs. When using this book, we assume that readers will have
their computer at hand and they will use it to check examples and perform all
exercises given in the text. Answers for most exercises are provided, either in the
text itself, in figures in the text, or in the Answers to Selected Exercises section at
the end of the book. This should be understood as a device to provide immediate
feedback, not as an invitation to short-circuit the kind of learning that comes only
by doing things for yourself. Tripods are useful because their three legs provide
stability, but all three legs working together are required. When deprived of one
of its legs, a tripod becomes unstable and falls. Likewise, unless you ground your
work firmly on the three legs we intend— the material presented in this book,
yourself, and your computer— your progress could be unstable and your
16_ PRELIMINARIES: How To USE THIS BOOK
understanding might fall short. The moral is clear: You need to read
thoughtfully and carefully, do all exercises, and think about how the exercises and
the written material relate.
Readers will come to this text with different degrees of spreadsheet
experience, including none, which is why detailed instructions are provided for
many of the exercises, especially those earlier in the book. For many readers,
such instructions will prove welcome. But for readers who begin with more
experience, and for readers who become experienced in the course of doing these
exercises, detailed instructions can become more of an annoyance than a help.
For that reason, general instructions are provided for all exercises, allowing more
practiced or venturesome readers the option of working more on their own.
Many of the exercises build on previous problems, modifying previous
spreadsheets for new purposes. This is advantageous because it minimizes the
tedium of data entry and demonstrates in a clear way how your work, and your
learning, are accumulating. Nonetheless, we recommend that you save a copy of
each spreadsheet template that you develop, even if only a slight modification of
an earlier template. You may want to print a copy of each different template for
future reference and you will want to save a copy on disk or other storage
medium. To avoid confusion, give each template that you save a name that
relates to the exercise, one that allows you to identify and retrieve it later. As you
will soon discover, you will often have occasion to refer back to previous
spreadsheets and you will want to retrieve earlier attempts for later modification.
Now, on to the work at hand. The next chapter begins where researchers
typically begin, with a research question, some research data, and the need to
evaluate the hypothesis that stimulated the research in the first place.
Getting Started:

2 The Logic of
Hypothesis Testing

In this chapter you will:

1. Learn the difference between descriptive and inferential statistics.


2. Learn the difference between samples and populations.
3. Learn the difference between statistics and parameters.
4. Be introduced to the logic of statistical tests.
5. Learn about type I errors (false claims).
6. Learn about type II errors (missed effects).
7. Learn about the power of statistical tests to detect real effects.

Students in the behavioral sciences study statistics for a variety of reasons: the
course is required, they need to analyze their thesis data in order to get a degree,
and so forth. Thinking positively, let us assume that learning how to analyze,
understand, and interpret research data is the paramount motivation. Then,
rather than starting with long digressions into descriptive statistics and basic
probability theory, it makes sense to begin immediately showing how statistics
can be used to answer research questions.

2.1 STATISTICS, SAMPLES, AND POPULATIONS

In discussing our need for statistics, imagine as our starting point a very simple
study. Assume that a maverick psychotherapist decides to try an unusual
treatment. The next 10 people who show up at her office are included in the
study. Each is referred to an outside consultant for an initial evaluation. Then
appointments are scheduled weekly for the next 3 months, but instead of talking
to the clients, the psychotherapist simply gives them $50 and sends them on their
way. After 3 months, the patients are again evaluated by an outside consultant
who knows nothing of this unusual treatment. Based on the consultant's reports,
the therapist determines that 8 of 10 clients improved. Her data are shown in
Fig. 2.1.
The question is, should we be impressed by any of this? After all, all 10 did
not get better. And some would probably have gotten better just by chance alone.
Is 8 out of 10 just a chance happening? Or do these results merit some attention?

17
18 GETTING STARTED: LOGIC OF HYPOTHESIS TESTING
Subject Outcome
1 +
2 +
3 +
4 +
5
6 +
7 +
8 +
9
10 +
FIG. 2.1. Data for the money treatment study:
+ = improved, - = not improved.

Specifically, do they suggest that the "money cure" the therapist attempted is
effective? A knowledge of statistics lets us answer questions like these and reach
conclusions even when our knowledge about the world is incomplete. In this
case, all we know is what happened with a sample of 10 people. But what can we
say about the effect of the money cure on people in general?

Generalizing From Sample to Population


Ultimately we want to know not just the datum for a single subject (whether the
subject improved) and not just the group summary statistic (8 of 10 improved),
but what happens to therapeutic clients in general. What we really want to know
is, for the population of all possible clients, how many would get better—that is,
what is the probability that the money cure would prove effective in general?
Assessing all possible clients to find this out is a practical impossibility. Even if
we could identify potential clients, the number would be overwhelming. So
instead we select a limited but manageable number to study and hope that this
sample represents the population at large.

The Importance of a Random Sample


The study sample and how it is selected are the foundation on which the logic of
statistical reasoning rests. In the most narrow sense, the results of the money-
cure study tell us only about the 10 people actually studied. Yet normally we
want to claim that these results generalize to a wider population. Such claims are
justified if our sample is truly a random sample from the population of interest.
We would first need to define the population of interest (e.g., all potential
psychotherapy clients), put their names in a hat, and then draw 10 names. This
would be a random sample because each and every name has an equal chance of
being selected.
In practice, however, the requirement of random sampling is almost never
met in behavioral science research. Some survey research studies that use
random-digit dialing of respondents come very close. But almost always
psychological studies use "samples of convenience" and then argue that the
sample, if not randomly selected, at least appears to represent middle-class
mothers, college sophomores, happily married couples, and so forth. This rather
2.1 STATISTICS, SAMPLES, AND POPULATIONS 19
casual and after-the-fact approach to sampling is less than ideal from a statistical
point of view, and although commonly condoned, it does place a burden on
individual researchers to justify any generalizations.

What Are Statistics Anyway?


The term statistics is used in two quite different ways. In a narrow sense, a
statistic is simply a number that characterizes or summarizes group data. The
statistic of interest for the present example is a count or tally— the number of
people in the study who improved— but other statistics for the group could be the
average age of the subjects or the variability of their income. In a broader sense,
statistics, as a body of knowledge and as an academic discipline, comprises
concepts and techniques that allow researchers to move beyond the information
provided by one study and generalize, or make inferences about, a larger
population. Thus statistics, in the broad sense, makes extensive use of statistics,
in the narrow sense.

Descriptive and Inferential Statistics


It is common to distinguish between descriptive statistics and inferential
statistics. Descriptive statistics are statistics in the narrow sense. Examples
include the ratio of men to women in a graduate statistics class, the average age
of students at our university, the medium income of households in the United
States, and the variability of Scholastic Aptitude Test (SAT) scores for 2004.
Descriptive statistics are simply summary scores for sets of data and as such
characterize various aspects of the units studied. In psychological studies, units
are often people (e.g, students in a graduate statistics class or the people who
took the SAT in 2004) and are often referred to as participants in research
reports. But other kinds of units (e.g., a mother-infant dyad, a family, a school, a
city, a year, etc.) are possible.
Inferential statistics, on the other hand, implies using statistics in a broader
sense. The term refers to a body of reasoning and techniques, called statistical
decision theory or hypothesis testing, that allow researchers to make decisions
about populations based on the incomplete information provided by samples.
Thus the techniques of inferential statistics are designed to address a practical
problem. If we cannot assess all units (individuals, dyads, schools, states, etc.) in
the population, how can we still make decisions about various aspects of the
population that we believe are important and interesting?

Populations and Parameters, Samples and Statistics


Aspects of interest are identified with population parameters. For example, one
parameter could be the probability that a client in the population would improve.
Another could be the average age of clients in the population. Another could be
the strength of the association between clients' age and their income, again in the
population, not just in the sample. We assume that population parameters are
quantifiable and, in theory at least, knowable in some sense even if not directly
observable. Some writers, for example, regard populations as essentially infinite
by definition, which means that determination of parameter values by direct
measurement would forever elude us. But even if we cannot assess population
parameters directly, we can at least estimate values for them from sample data
instead. Indeed, statisticians spend considerable care and time deriving
appropriate ways to estimate different parameters and demonstrating that those
estimates have a variety of desirable properties.
20 GETTING STARTED: LOGIC OF HYPOTHESIS TESTING

FIG. 2.2. Schematic representation of the relations between


statistics, samples, parameters, and populations.

In the sample, aspects of interest are identified with statistics (in the narrow
sense). In other words, a statistic is to a sample as a parameter is to a population
(see Fig. 2.2). Often (but not always) a sample statistic can be used to estimate a
population parameter directly. For example, the population mean is usually
represented with u (the Greek lower case mu), the sample mean is often
represented with M (as discussed in the last chapter), and as it turns out the value
of u can be estimated with the sample M. In other words,

"estimated = M (2.l)

Simplifying considerably, inferential statistics implies:

1. Drawing a random sample from a population.


2. Computing statistics from the sample.
3. Estimating population parameters from the sample statistics.

In the next section we discuss ways of comparing values of population


parameters estimated from a sample with values assumed by a particular theory.
As you will see, this is the way we test our research hypotheses.

Note 2.1
M Roman letters are usually used to represent sample statistics.
For example, M is often used for the sample mean.
u Greek letters are usually used to represent population
parameters. For example, u (lower case Greek mu) is often
used for the population mean.
2.2 HYPOTHESIS TESTING: AN INTRODUCTION 21
2.2 HYPOTHESIS TESTING: AN INTRODUCTION

With reference to the money-cure study described at the beginning of this


chapter, is 8 out of 10 clients improving a significant result or just a chance
happening? Just by chance we could have selected 8 out of 10 patients who
improved even though in the population from which those subjects were selected
the money cure has no effect on therapeutic outcome. If we assume that the
money cure has no effect—that patients, left to their own devices, are as likely to
improve as not—then the value for the population parameter of interest (the
probability of improving) would be .5 (or 1 in 2). This is not a matter of
computation, but simply an assumption, based on the theory that the money cure
has no effect.
But even if the true value of the population parameter were .5, it is still
possible, just by chance alone, to draw a sample in which the value of the relevant
statistic is .8. It may be unlikely, but it could happen. Even so, if an observed
result (.8) is sufficiently unlikely, naturally we question whether or not the
assumed value for the population parameter (.5) is reasonable.
Over the course of the past century, a method of inference termed statistical
decision making or hypothesis testing has been developed, largely by Fisher but
with generous assists from Neyman and Pearson. Hypothesis testing provides a
way of deciding whether or not postulated or assumed values for various
population parameters are tenable given the evidence provided by the statistics
computed from a particular sample. Five elements central to this process, listed
in the order in which they need to be addressed during hypothesis testing, are:

1. The null hypothesis, which provides a theoretical basis for constructing a


sampling distribution for a particular test statistic.
2. The sampling distribution, which allows us to determine how probable
particular values of a test statistic would be if in fact the null hypothesis
were true.
3. The alpha level, a probability value that is used for the statistical test.
4. The test statistic, the value for which is computed from the sample.
5. The statistical test, which allows us to determine whether it is probable,
or improbable (the exact probability value is the alpha level), that the
sample was drawn from a population for which the null hypothesis is
true.

The Null Hypothesis

The null hypothesis (usually symbolized HO) is one guess about the true state of
affairs in the population of interest. It is contrasted with the alternative
hypothesis (usually symbolized H1). The null and alternative hypotheses are
usually formulated so that, as a matter of logic, they are mutually exclusive (only
one of them can be true) and exhaustive (one of them must be true). In other
words, if one is false, the other necessarily must be true. Thus if the null
hypothesis states that it is not raining, then the alternative hypothesis states that
it is raining. Textbooks in statistics often begin by introducing hypotheses about
population means. For example, if the null hypothesis states that the population
mean is zero, then an alternative hypothesis could be that the population mean is
not zero. Symbolically:
22 GETTING STARTED: LOGIC OF HYPOTHESIS TESTING
In this case, the alternative hypothesis commits to a mean different from zero no
matter whether that mean is smaller or larger than zero. Thus this H1 is
nondirectional (sometimes called two-tailed). Another null hypothesis could
state that the population mean is greater than zero whereas the alternative
hypothesis could be that the population mean is not greater than zero.
Symbolically:

H0: u < 0 .
H 1 : u>0.

Thus this H1 is directional (sometimes called one-tailed} because it commits only


to differences from zero that, in this case, are larger than zero.
Usually, the alternative hypothesis corresponds to the investigator's hunch
regarding the true state of affairs; that is, it corresponds to what the investigator
hopes to demonstrate (for the current example, that the money treatment has an
effect). However, it is the tenability of the null hypothesis that is actually tested
(that the money treatment has no effect). If the null hypothesis is found
untenable, then as a matter of logic the alternative must be tenable, in which case
investigators "reject the null" and, for that reason, "accept the alternative."
When encountered for the first time, this way of reasoning often seems
indirect and even a bit tortured to many students. Why not simply demonstrate
that the alternative hypothesis is tenable in the first place? This is partly a matter
of convention, but there is good reason for it. There is a crucial difference
between null and alternative hypotheses. In its typical form, a null hypothesis is
exact. It commits to a particular value for a population parameter of interest.
The alternative hypothesis, on the other hand, usually is inexact. It only claims
that the parameter is not the value claimed by the null hypothesis but does not
commit to a specific value (although, as we just saw, it may commit to a
particular direction). (Exceptions to the foregoing exist, but they are rare.)
In order to construct a sampling distribution for the test statistic, which is the
next step in hypothesis testing, an exact value for the population parameter of
interest is needed. In most cases, it is provided by the null hypothesis. Thus we
test, and perhaps reject, not the hypothesis that embodies our research concern
and that postulates an effect, but another hypothesis that postulates exactly no
effect and that, from a substantive point of view, is less interesting.
According to the no-effect or null hypothesis for the present example, exactly
half of the clients should get better just by chance alone. If this null hypothesis
were true, then the population parameter for the probability of improving would
be .5. This may or may not be the true value of the population parameter, but at
least it provides a basis for predicting how probable various values of the test
statistic would be, if the true value for the population parameter were indeed .5.

The Sampling Distribution


Once the null hypothesis is defined, the sampling distribution appropriate for the
test statistic can be specified. Sampling distributions are theoretical constructs
and as such are based on logical and formal considerations. They resemble but
should not be confused with frequency histograms, which are based on data and
indicate the empirical frequency for the scores in a sample. Sampling
distributions are important for hypothesis testing because we can use them to
derive how likely (i.e., how probable) a particular value of a test statistic would
be, in theory at least, if the null hypothesis were true. The binomial, which is
described in chapter 3, is the appropriate sampling distribution for the current
example.
2.2 HYPOTHESIS TESTING: AN INTRODUCTION 23
The phrase "in theory" as it applies to sampling distributions is important.
The probabilities provided by a sampling distribution are accurate for the test
statistic to the extent that the data and the data collection procedures meet the
assumptions used to generate the theoretical sampling distribution in the first
place. Assumptions for different tests vary. And in practice many assumptions
can be violated without severe consequences. One assumption, however, called
independence of measurements (or simply, independence), is basic to most
sampling distributions and, if violated, raises serious questions about any
conclusions.
This key assumption requires that during data collection scores are assigned
to each unit in the sample (subject, dyad, family, and so forth.) independently. In
other words, each score must represent an independent draw from the
population. For the present example, this means that the evaluation of one client
cannot be linked, or affected by, the evaluation given another client. To use the
classic example, imagine an urn filled with tickets, each of which has a number
printed on it. We shake the urn, reach in, remove one ticket, and note its
number. We then replace the ticket and repeat the procedure until we have
accumulated N numbers, the size of our sample. (If the population of tickets is
large enough, it may not matter much whether we draw with, or without,
replacement of tickets previously drawn.) This constitutes an independent
sample because presumably tickets pulled on previous draws do not influence
which ticket we pull on the next draw. We would not, however, select two tickets
at a time because then pairs of tickets would be linked and the total of all tickets
drawn would not constitute an independent sample. And if the assumption of
independence is violated, then we can no longer be confident that a probability
value derived from a theoretical sampling distribution provides us with the
correct probability value for the particular test statistic being examined.

The Alpha Level


The alpha level is the probability value used for the statistical test. By
convention, it is usually set to .05 or, more stringently, to .01. If the probability
of the results observed in the sample occurring by chance alone (given that the
null hypothesis is true) is equal to or less than the alpha level, then we declare the
null hypothesis untenable.

The Test Statistic


In contrast to the sampling distribution, which is based on theory, the value of
the test statistic depends on the data collected, and thus in general terms a test
statistic is a score computed from sample data. For example, if our null
hypothesis involved the age of our clients, the average age might be used as a test
statistic. In theory, any of a wide array of summary statistics could be used as
test statistics, but in practice, statistical attention has focused on just a few. For
the present example, the test statistic is the probability that clients who received
the money cure would improve and its value, as determined from the sample of
10 subjects, is .8.

The Statistical Test


Given a particular null hypothesis, an appropriate sampling distribution, and a
value for the appropriate test statistic, we are in a position to determine whether
or not the result we observed in our sample would be probable if the null
hypothesis is in fact true. If it turns out that our result would occur only rarely,
24 GETTING STARTED: LOGIC OF HYPOTHESIS TESTING
given that the null hypothesis is true, we may decide that the null hypothesis is
untenable. But how rare is rare? As noted a few paragraphs back, by convention
and somewhat arbitrarily, 5% is generally accepted as a reasonable cutoff point.
Certainly other percentages could be justified, but in general behavioral scientists
are willing to reject the null hypothesis and accept the alternative only if the
results actually obtained would occur 5% of the time or less by chance alone if the
null hypothesis were true. This process of deciding what level of risk is
acceptable and, on that basis, deciding whether or not to reject the null
hypothesis constitutes the statistical test. For the present example (testing the
effectiveness of the money cure), the appropriate test is called a sign test or a
binomial test. In the next chapter we describe this test and demonstrate its use.

2.3 FALSE CLAIMS, REAL EFFECTS, AND POWER

Type I Error: The Risk of Making a False Claim


Earlier in this chapter we claimed that knowledge of statistics allows us to make
decisions from incomplete information. Thus we may make decisions about a
population based only on a sample selected from the relevant population. For the
money-cure example, only 10 subjects were examined—which falls far short of a
complete survey of psychotherapy clients. Yet based on this sample of 10 subjects
(and pending presentation of the sign test in the next chapter), we might
conclude that the money cure affected so many clients in the sample positively
that the null hypothesis (the hypothesis that the clients were selected from a
population in which the money cure has no effect) is untenable, which leads us to
conclude that, yes, the money cure does have a beneficial effect.
Basing decisions on incomplete information entails a certain amount of risk.
What if, for example, in the population from which our subjects were selected,
the money cure had no effect even though, in this one study, we just happened to
select a high proportion of clients who got better? In this case, if we claimed an
effect based on the particular sample we happened to draw, we would be wrong.
We would be making what is called a type I error, which means we would have
rejected the null hypothesis when in fact the null hypothesis is true. We would
have made a false claim.
Given the nature of statistical inference, we can never eliminate type I errors,
but at least we can control how likely they are to occur. As noted earlier, the
probability cutoff point for rejecting the null hypothesis is called the alpha level.
If we set our alpha level to the conventional .05, then the probability that we will
reject the null hypothesis wrongly, that is, make a type I error, is also .05. After
all, by setting the alpha level to .05 for a statistical test we commit ourselves to
rejecting the null hypothesis if the results we obtain would occur 5% of the time
or less given that the null hypothesis is true. If we did the same experiment again
and again, and if in fact there is no effect in the population, over the long run 95%
of the time we would correctly claim no effect. But 5% of the time, just by the
luck of the draw, we would wrongly claim an effect. As noted earlier, by
convention most behavioral scientists find this level of risk acceptable.

Type II Error: The Risk of Missing a Real Effect


Making a false claim is not the only error that can result from statistical decision
making. If, for the population from which our subjects were selected, the money
cure indeed has an effect but, based on the particular sample drawn, we claimed
that there was none, we would be making what is called a type II error. That is,
2.3 FALSE CLAIMS, REAL EFFECTS, AND POWER 25
we would have failed to reject the null hypothesis when in fact the null hypothesis
is false. We would have missed a real effect.
Under most circumstances, we do not know the exact probability of a type II
error. The probability of a type II error depends on the actual state of affairs in
the population— which we do not know exactly, and not on the null-hypothesis
assumed state of affairs— which we define and hence know exactly. A type I error
occurs when the magnitude of the effect in the population (indexed by an
appropriate population parameter) is zero (or some other specific value) and yet,
based on the sample selected, we claim an effect (thereby making a false claim).
The probability that this will occur is determined by the alpha level, which we set,
and depends on the sampling distribution for the test statistic under the null
hypothesis, which we assume. Hence we can specify and control the probability
of type I error.
In contrast, a type II error occurs when the magnitude of the effect in the
population is different from zero (or another specific value) by some unspecified
amount and yet, based on the sample selected, we do not claim an effect (thereby
missing a real effect). The probability of a type II error can be determined only if
the exact magnitude of the effect in the population is known, which means that,
under most circumstances, we cannot determine the probability of a type II error
exactly. However, even if we do not know the probability of a type II error, we
can affect its magnitude by changing the alpha level. If we select a more stringent
alpha level (.01 instead of .05, for example), which has the effect of decreasing the
probability of making a false claim, we necessarily increase the probability of
missing a real effect. It is a trade-off. Decreasing the probability of a type I error
increases the probability of a type II error and vice versa. If we are less likely to
make a false claim, we are more likely to miss a real effect. If we are less likely to
miss a real effect, we are more likely to make a false claim.

Power: The Probability of Detecting Real Effects


Perhaps too pessimistically, the preceding paragraphs have focused on the ways
statistical decision making can go wrong. But just as there are two ways we can
be wrong, there are also two ways we can be right. (All four possibilities are
shown schematically in Fig. 2.3.) One way that we can be right occurs when there
genuinely is no effect in the population and we claim none. The probability of
making a false claim is alpha (a), so the probability of correctly claiming no effect
is 1 - alpha. If there is no effect, if our alpha level is .05, and if we conduct study
after study, over the long run 95% of the time we will be right when we claim no
effect.
A second way that we can be right occurs when we claim an effect and there
genuinely is one. Just as alpha (a) is the probability of making a false claim
(Type I error), so the probability of missing a genuine effect (Type II error) is
often symbolized as beta (B). And just as the probability of correctly claiming no
effect is 1 - alpha, so the probability of correctly claiming an effect is 1 - beta.
This probability, the ability to detect real effects, is called the power of a
statistical test.
The power of statistical tests is affected by three factors. First, the magnitude
of the effect in the population affects power. Other things being equal, bigger
effects are more likely to be detected than smaller effects and hence when effects
are large we are more likely to claim an effect. Second, as is discussed in chapter
17, we are more likely to detect an effect of a particular magnitude when sample
sizes are larger than when they are smaller. Third, alpha level affects power. If
the alpha level is made less stringent (changed from .05 to .10, for example),
26 GETTING STARTED: LOGIC OF HYPOTHESIS TESTING

FIG. 2.3. Possible outcomes of the statistical decision-making process


and their associated probabilities (P symbolizes probability).

power will be increased. We will be more likely to detect real effects if they exist
but we will also be more likely to make false claims if H0 is true. Conversely, if we
make the alpha level more stringent, power will be decreased. We will be less
likely to make false claims if H0 is true but we will also be less likely to detect real
effects if they exist.
Of these three, only sample size is under our control and so the only practical
way to increase power is to increase sample size. Alpha is almost always set at
.05 (or less) by convention, so choosing a less stringent alpha level is usually not a
practical way to increase power; in any case, such increases would be balanced by
the concomitant increase in type I error. And the magnitude of the effect in the
population is hardly under our control. In fact, as we noted earlier, normally we
do not even know its actual value and thus we cannot compute a value for either
beta (the probability of missing a real effect) or power (the probability of
detecting a real effect).

Note 2.2
Type I Error Making a false claim, or incorrectly rejecting the null
hypothesis (also called an alpha error}. The probability of
making a Type I error is alpha.
Type II Missing a real effect, or incorrectly accepting the null
Error hypothesis (also called a beta error). The probability of
making a Type II error is beta.
Power The probability of detecting a real effect, or the probability of
correctly rejecting the null hypothesis. The power of a test is
one minus the probability of missing a real effect, or 1 - beta.
2.3 FALSE CLAIMS, REAL EFFECTS, AND POWER 27

Alpha, Beta, and Power: A Numerical Example


Under normal circumstances, we do not know the true state of affairs for the
population. All we know (because we assume it) is what the sampling
distribution for the test statistic would be if the null hypothesis were true. For
example, imagine that we have developed a test statistic whose sampling
distribution under the null hypothesis is as shown in the first histogram (top) of
Fig. 2.4. Altogether there are 20 possible outcomes, each represented by a
square. One of the outcomes is 1, which is why there is one square or box above
the value 1. Two of the outcomes are 2, which is why there are two boxes above
the value 2. Similarly, three of the possible outcomes are 3, four are 4, four are 5,
three are 6, two are 7, and one is 8. Area is proportional to probability, and thus
the probability that the outcome actually will be 1 on any one trial is 1/20, that it
will be 6 is 3/20, and so forth. The most likely outcomes for the test statistic are
either 4 or 5. For each, the probability is 4/20 or .2; hence the probability of
either a 4 or a 5 is .4 (because the probability of at least one of a set of events
occurring is the sum of their separate probabilities). Symbolically,

P( 4 or 5 ) =P(4) +P(5)
= .2 + .2 = .4.

(Recall that P symbolizes probability.)


Imagine further that we have decided beforehand that we will reject the null
hypothesis only if our study yields a test statistic whose value is larger than
expected (this specifies a directional or one-tailed test as opposed to a
nondirectional or two-tailed test) and that we have set our alpha level to an
unconventionally generous .15. Note that the theoretical probability of a 7 is
2/20 or .10 and the probability of an 8 is 1/10 or .05; hence the probability of
either a 7 or 8— the largest values in this sampling distribution— is the sum of .10
and .05 or .15. Symbolically,

P(7 or 8) = P(7)+P(8)
= .10 + .05 = .15.
Thus if we conduct a study and discover that the value of the test statistic is 7 or
larger, we would reject the null hypothesis. This is, after all, a relatively unlikely
outcome, given the sampling distribution portrayed at the top of Fig. 2.4. If we
performed the study hundreds of times, and if the null hypothesis were true, the
value for the test statistic would be as big as 7 or 8 only 15% of the time.
In order to demonstrate the relations between alpha, beta, and power, it is
convenient (in fact, necessary) to assume that we actually know the true state of
affairs for the population. For example, imagine that we know the true value of
the appropriate population parameter and it generates a sampling distribution
like the one shown in the second histogram (bottom) of Fig. 2.4. Given this
information we can compute beta.
The sampling distributions shown in Fig. 2.4, like all sampling distributions,
indicate the likelihood of various outcomes. For example, the first histogram
(top) indicates that if the null hypothesis is true then a test statistic as large as 7
or 8 would occur only 15% of the time or less. Each time we do a study, of course,
only one outcome is produced. If that outcome were 6 or less, we would not
reject the null hypothesis. After all, outcomes of 6 or less would occur 85% of the
time (17 times out of 20; top histogram). In other words:
28 GETTING STARTED: LOGIC OF HYPOTHESIS TESTING
P(1 or 2 or 3 or 4 or 5 or 6)
= P(1) + P2) + P3) + P(4) + P5) + TO)
= .05 + .10 + .15 + .20 + .20 + .15 = .85

However, if the population probabilities for the test statistic are as indicated
in the second histogram (bottom), then outcomes of 6 or less would actually
occur 6 times out of 20, or just 30% of the time:

P(4 or 5 or 6)
= P(4)+P(5)+n6)
= .05 + .10 + .15 = .30
If we conducted the study repeatedly, over the long run 30% of the time we would
not reject the null hypothesis (because the value of the test statistic would be 6 or
less). In other words, 30% of the time we would fail to detect the effect. Hence, if
the true state of affairs is as indicated in the bottom histogram, and our null
hypothesis is based on the top histogram, then the probability of a type II error
(of missing the effect) would be .3. However, 70% of the time we would correctly
claim an effect. In this case the power of the statistical test would be .7.

FIG. 2.4. Sampling distributions for a hypothetical test statistic assumed by


the null hypothesis (top) and actually occurring in the population (bottom).
2.3 FALSE CLAIMS, REAL EFFECTS, AND POWER 29
The Region of Rejection
Sampling distributions like those portrayed in Fig. 2.4 show the relations among
alpha, beta, and power graphically. Consider the unshaded and shaded areas.
For the null-hypothesis-generated sampling distribution (top), and an alpha level
of .15, values in the shaded area to the right would cause us to reject the null
hypothesis (in this case, values of 7 or more), whereas values in the unshaded
area to the left would not cause us to reject the null hypothesis (in this case,
values of 6 or less). Recall that for a sampling distribution, area is equivalent to
probability. Thus, for an alpha level of .15, 15% of the area is shaded. This area
is called the region of rejection or the critical region. If values computed for our
test statistic fall within this rejection region (in this case, if our test statistic is 7 or
greater), we would reject the null hypothesis.
In the top panel, 85% of the area under the null-hypothesis-generated
sampling distribution is unshaded, to the left. This proportion is 1 - alpha, the
probability of correctly making no claim if the null hypothesis is true.
But if the null hypothesis is false, and if we know that the true state of affairs
is as depicted in the bottom panel, then matters are different. The unshaded 30%
of the area to the left represents beta, the probability of missing an effect (the
probability of a type II error), whereas the shaded 70% of the area to the right is
1 - beta and represents power. In this case, we have a 70% chance of detecting an
effect as significant, given an alpha level of .15.
The shaded area can help us visualize the relation between alpha and beta.
Imagine that we reduce the alpha level to .05, reducing the shaded portion to
values of 8 or more. This makes alpha (the probability of making a false claim, a
type I error) smaller (top panel). But at the same time, beta (the probability of
missing an effect, a type II error) becomes larger while 1 - beta (the power of the
test or the probability of correctly detecting a genuine effect) necessarily
decreases (bottom panel).

Exercise 2.1
Type I Errors, Type II Errors, and Power
The problems in this exercise refer to Fig. 2.4. They provide practice in
determining the probability of making false claims (type I error), of missing real
effects (type II error), and of detecting real effects (power). All questions assume
that if the null hypothesis is true, the test statistic is distributed as shown in the
top panel.
1. If the alpha level were .05, instead of .15, what values for the test statistic
would fall in the critical region, that is, lead us to reject the null hypothesis?
2. If the alpha level were .30, what values for the test statistic would fall in the
critical region?
3. If the alpha level were .05, and the true state of affairs were as depicted in
the bottom panel, what is the probability of missing a real effect? What is the
probability of correctly claiming an effect?
4. If the alpha level were .30, and the true state of affairs were as depicted in
the bottom panel, what is the probability of missing a real effect? What is the
probability of correctly claimin'g an effect?
30 GETTING STARTED: LOGIC OF HYPOTHESIS TESTING
2.4 WHY DISCUSS INFERENTIAL BEFORE DESCRIPTIVE STATISTICS?

The differences between samples and populations, between statistics and


parameters, and between descriptive and inferential statistics were discussed in
this chapter. Hypothesis testing, which represents an important application of
inferential statistics, was presented and the ways a statistical decision could be
either correct or incorrect were noted. Many textbooks spend the first several
chapters detailing descriptive statistics—a topic not addressed in earnest in this
book until chapter 5. Descriptive statistics usually seem more concrete than
inferential statistics (there are no type I and type II errors to puzzle about and no
parameters to estimate, for example), and as a result many students find
presentations of descriptive statistics less challenging and easier to learn.
Nonetheless, we prefer to begin with material that is more abstract and certainly
more intellectually exciting (hypothesis testing, for example) because it
immediately and clearly demonstrates the usefulness of statistical knowledge and
techniques. Continuing in this vein, a common and useful statistical test, the sign
test, is presented in chapter 3.

Note 2.3
Alpha Level Maximum acceptable probability for type I error (claiming an
effect based on sample data when there is none in the
population). Conventionally set to .05 or .01.
Rejection Designates values for the test statistic whose probability of
Region occurrence is equal to or less than the alpha level, assuming
that the null hypothesis is true. If the test statistic assumes
any of these values (falls in the region of rejection), the null
hypothesis is rejected.
3 Inferring From a Sample:
The Binomial Distribution

In this chapter you will:


1. Learn about a simple sampling distribution, the binomial.
2. Learn how to perform a simple statistical test, the binomial or sign test.

This chapter continues the discussion of hypothesis testing begun in chapter 2


and illustrates how a common statistical test, the binomial or sign test, can be
used to answer simple research questions. The sign test makes use of one of the
easiest distributions to understand, the binomial.

3.1 THE BINOMIAL DISTRIBUTION


Interest in the binomial distribution began with Blaise Pascal in the 17th century
and has continued to be a source of interest for mathematicians and statisticians
ever since. The binomial is easy to understand (at least now, after several
brilliant mathematicians have shown us how), easy to generate, and lends itself to
a clear and straightforward demonstration of hypothesis testing. Moreover, a
simple and useful statistical test—the sign test—is based on the binomial. For all
these reasons, the first sampling distribution introduced in this book is the
binomial.
The binomial is the appropriate sampling distribution to consider whenever a
trial results in one of two events—that is, whenever the results of a single trial (or
treatment) can be categorized in one, and only one, of two ways. For example, in
the money-cure study described in the last chapter, clients were categorized as
either improved or not improved. In more general terms, whenever a subject (or
a family, a coin toss, or whatever constitutes the sampling unit) can be assigned
to one of two possible categories, the binomial distribution can be used. From it,
we can figure out how probable the results for a particular study would be if the
null hypothesis (and not the alternative hypothesis) were in fact true.
Consider the money-cure study. If we examine only one subject (symbolized
S), then there are only two possible outcomes for the study—the events defined
for a single trial (because, in fact, this study consists of only one trial):

31
32 INFERRING FROM A SAMPLE: THE BINOMIAL
1. The subject improves.
2. The subject does not improve.

If we assume that both events are equally likely, then the probability that the
subject improves is 1/2 or .5. The probabilities for both events must sum to 1, so
the probability that the subject does not improve must be 1 minus .5.
Symbolically:

P - P(S improves) = .5
Q - P(S does not improve) = 1 - .5 - .5

(P is often used to represent the probability for the first event, Q for its
complement; thus Q =1- P.)
However, if we examine two subjects (S1 and S2), then there are four possible
outcomes for the study:

1. Both subjects improve.


2. The first improves but the second does not improve.
3. The first does not improve but the second does improve.
4. Neither subject improves.

The probability that the first subject will improve is .5 and the probability that the
second subject will improve is also .5; therefore, because the probability that all
events in a set will occur is the product of their separate probabilities, the
probability that both subjects will improve is .25. Symbolically (imp = improves),

P(both improve)
= P(S1 imp and S2 imp)
= P(Sl imp) x P(S2 imp)
- 0.5 x 0.5 = 0.25
Similarly, because the probability of a subject not improving is also .5, the
probability for the second, third, and fourth outcomes will also be .25. This
makes sense, because the probabilities for all outcomes must sum to 1.
Although there are four different possible outcomes for a study with two
subjects, there are only three outcome classes:

1. Two improve (outcome 1).


2. One improves (outcomes 2 and 3).
3. None improve (outcome 4).

The probability for the first outcome class is .25. This class contains only one
outcome, outcome 1, whose probability was computed in the previous paragraph.
Likewise, the probability for the third class is also .25, again because this
outcome class contains only one outcome, outcome 4, whose probability is .25.
But the probability for the second outcome class is .5. This class contains two
outcomes, outcomes 2 and 3, both of whose probabilities are .25. Therefore,
because the probability of any one of a set of events occurring is the sum of their
separate probabilities, the probability that either will occur is twice .25 or .5.
Symbolically,
3.1 THE BINOMIAL DISTRIBUTION 33
.P(one improves)
=P(S 1 imp and S2 not) + P(S1 not and S2 imp)
- P(S1 imp) x P(S2 not) + P(S1, not) x P(S2 imp)
= 0.5 x 0.5 + 0.5 x 0.5 = 0.25 + 0.25 = 0.5
This formula illustrates the two basic probability rules we have been using:
(a) The and rule, which states that the probability that two events will both occur
is the product of their individual probabilities; and (b) the or rule, which states
that the probability that either of two events will occur is the sum of their
individual probabilities.
Often discussions of the binomial are cast in terms of coins. After all, under
normal circumstances a tossed coin must come up either heads (H) or tails (T).
Continuing our intuitive introduction to the binomial, if we toss a coin three
times (which is equivalent to examining three subjects), then there are eight
different possible outcomes for a three-trial study (T = tails, H = heads):

1. TTT
2. TTH
3. THT
4. THH
5. HTT
6. HTH
7. HHT
8. HHH

and four outcome classes:

1. 0 heads (1 above)
2. 1 head (2, 3, and 5 above)
3. 2 heads (4, 6, and 7 above)
4. 3 heads (8 above)

A convenient way to represent all of the possible outcomes is with a


branching tree diagram (see Fig. 3.1). The first branch at the top represents the
first toss, which could be tails or heads. The second set of branches represents
the second toss, and so forth. If you tried to list all possible outcomes without
some mnemonic device, you might miss a few. If you use the branching diagram,
you should be able to list all without omission.

FIG. 3.1. The eight possible outcomes if a coin is tossed three times.
34 INFERRING FROM A SAMPLE: THE BINOMIAL
At this point you may begin to see (or remember) a pattern. With one toss
there are two outcomes and two classes. With two tosses there are four outcomes
and three classes. With three tosses there are eight outcomes and four classes.
And in general, if there are N tosses, there will be N + 1 classes (e.g., for 7 tosses
there will be 8 classes: 0 heads, 1 head, 2 heads, ..., 7 heads) and there will be 2N
outcomes (because each toss doubles the number of outcomes for the previous
toss). Thus for 2 tosses there are 4 outcomes, for 3 tosses, 8 outcomes, for 4
tosses, 16 outcomes, for 5 tosses, 32 outcomes, and so forth (see Fig. 3.2).

Exercise 3.1
Tree Diagrams
This exercise provides practice in generating tree diagrams and in determining
theoretical probabilities for different outcomes.
1. Generate a tree diagram for 4 tosses. Count the number of outcomes in
each outcome class. Have you generated the correct number of outcomes
and outcome classes? If this seems easy, generate a tree diagram for 5 or 6
tosses.
2. How many outcomes, and outcome classes, are there for 7 tosses? For 8
tosses? For 10 tosses? For 20 tosses?

The principles described in the preceding several paragraphs form the basis
for generating a binomial distribution, which is then used as the sampling
distribution for the sign test. If we know the number of subjects (N) in a study—
and if it is appropriate to assign each subject to one of two categories—then we
know that there are 2N different outcomes or 2N different ways the study might
turn out. If according to our null hypothesis the two events for a single subject
are equally likely (i.e., if the null hypothesis states that the population parameter
for the probability of one of two events is .5—which means that the probability for
the other outcome must also be .5), it follows that each of the 2N possible
outcomes for the study will be equally likely also. For example, if a coin is tossed
two times, then there are four possible outcomes (TT, TH, HT, HH), and if the
assumed probability for a head is .5, then the probability for any one of the four
outcomes is .25. In general, if the null hypothesis assumes that P - .5, and if
there are N trials (subjects, coin tosses, and so forth.), then in order to compute

FIG. 3.2. Relation between number of trials, number of outcomes, and


number of outcome classes.
3.1 THE BINOMIAL DISTRIBUTION 35
probabilities for the different outcome classes we need only know how many of
the 2N outcomes fall in each outcome class.
For example, as described previously, with three subjects there are eight
possible outcomes. Three of the possible outcomes involve two heads (THH,
HTH, HHT); thus the probability of tossing two out of three heads is 3/8 or .375.
One of the possible outcomes involves three heads (HHH), thus the probability of
tossing all three heads is 1/8 or .125. Putting these together, four of the outcomes
involve two or more heads (THH, HTH, HHT, HHH) and so the probability of
getting at least two heads in three tosses is 4/8 or .5. But remember, these are all
theoretical probabilities based on the null hypothesis that the coin is fair.

Pascal's Triangle
For just five or six subjects, you could generate a list of all the simple outcomes
using a tree diagram and count the number of outcomes in each class, but with
larger numbers of subjects this becomes unwieldy. Fortunately, there is a simple
way to compute the number of outcomes in each outcome class. Named for its
discoverer, this device is called Pascal's triangle, and although its role in the
history of mathematics has been profound, our use of it here is quite simple.

Exercise 3.2
Pascal's Triangle
This is the second exercise to require a spreadsheet program. The purpose is to
develop a spreadsheet template that generates values for Pascal's triangle. Due
to its triangular shape, this spreadsheet is unique, unlike any other in the book.
The formulas in the cells of the triangle are essentially alike. Instead of
laboriously and tediously entering similar formulas in different cells, you should
use this exercise to acquaint yourself with the use and utility of the spreadsheet
copy command (see chap. 1).

General Instructions
1. Let columns represent different number of trials. Use values beginning with
1 and continuing through 10. Label the columns appropriately.
2. Indicate the number of simple outcomes that would occur for each number of
trials.
3. Normally Pascal's triangle is represented with the top of the triangle pointing
up. In this case, lay the triangle on its side, letting the "top" point to the left.
The 1-trial column will have two entries (because there are two outcome
classs), the 2-trials column will have three entries (because there are three
outcome classes), and so forth. Entries in each column will be staggered,
which means that no cell will have an entry immediately to its left or right.
Instead, entries will be located on what are, in effect, diagonal lines.
4. Enter formulas that will generate the correct values for Pascal's triangle.
Remember that the value of a target entry is the sum of the two entries to the
left, the one diagonally above and the one diagonally below the target entry.
Once all formulas are correctly entered (which can be done by copying
formulas to the appropriate cells), enter the "seed value" of 1 in the apex or
top cell of the triangle. Observe the way the values for Pascal's triangle
ripple through your spreadsheet.
5. Sum the entries (number of simple outcomes per outcome class) in each
column of the triangle. The sum of the entries in each column should be
36 INFERRING FROM A SAMPLE: THE BINOMIAL
exactly the same as the number of simple outcomes you computed
previously in step 2. Why?
6. The numbers in the 10-trials column represent the number of outcomes per
outcome class (0 heads, 1 head, 2 heads, ..., 10 heads). Compute the
probability for each outcome class.
7. Sum the probabilities for these 11 outcome classes. This sum must be 1.
Why?
Detailed Instructions
1. We want to display columns A-L, so set the default column width to 5. Set
the width for column L to 8.
2. Label cell A1 "N' for number of trials. Label cell A2 "N + 1" for number of
classes. Label cell A3 (and A27) "2N,, for number of outcomes.
3. Put the formula "=B1+1" in cell C1. (This is the Excel convention. Other
spreadsheets may have different conventions for entering formulas.) This
causes C1 to display a number one larger than the number in the cell
immediately to its left. Copy this formula from cell C1 to cells D1 through K1.
Now put the value 1 in cell B1 and watch the results "ripple" through the first
row. Cells B1-K1 should now display the values 1, 2, 3, ..., 10.
4. Put the formula "=B1+1" in the cell B2. This causes B2 to display a number
one greater than the value contained in B1. Copy this formula from cell B2 to
cells C2 through K2.
5. Put the formula "=B3*2" in cell C3. This causes C3 to display a number
double the number in the cell immediately to its left. Copy this formula from
cell C3 to cells D3 through K3. Now "seed" cell B3 with the value 2. Cells
C3-K3 should now display the values 4, 8, 16, ..., 1024.
6. Put the formula "=A13+A15" in cell B14. This causes cell B14 to display the
sum of the cells diagonally up to the left and diagonally down to the left.
Copy this formula to cell B16. When formulas are copied, the cells pointed to
are relative to the cell copied. Therefore cell B16 will display the sum of A15
and A17. And because no values have yet been put in the these A-column
cells, B14 and B16 will, for now at least, display zeros.
7. Copy the "sum of up and down left diagonal cells" formula to cells C13, C15,
C17 (omitting C14 and C16); to D12, D14, D16, D18 (omitting D13, D15,
D17); to E11 through E19 (omitting even rows); to F10 through F20 (omitting
odd rows); ...; to K5 through K25 (omitting even rows). At this point, your
spreadsheet should display a triangle of zeros. The triangle is lying on its
side. Its base spans cells K5 through K25 and its apex, cell A15, has yet to
be filled in.
8. Put the value 1 in cell A15. Observe the way the values for Pascal's triangle
ripple through your spreadsheet.
9. Put a formula for the sum of cells B5-B25 in cell B27. Copy this formula to
cells C27-L27. Rows 2 and 28 should now display the same values. Why?
10. In cells L5, L7, L9, ..., L25, put a formula for dividing the cell immediately to
the left by cell K27 (absolute). This gives the probabilities for each outcome
class when there are 10 trials. (You may want to format column L to have
four decimal places.) The value in cell L28 should be 1. Why?
3.1 THE BINOMIAL DISTRIBUTION 37
At this point your spreadsheet should look the one displayed in Fig. 3.3.
Each column provides you with the number of outcomes in each outcome class
for a particular number of trial. These are the binomial coefficients for that
number of trials. For this exercise, you stopped at 10 trials, but in principle you
could keep generating more of the triangle forever, simply by copying columns to
the right. In practice, of course, you would be limited by the maximum
dimensions allowed by your spreadsheet program. But in theory, you should now
understand how to generate the number of outcomes in each outcome class for a
binomial sampling distribution involving any number of trials.
For example, given five trials and a fair coin, your spreadsheet should
indicate there are 32 different outcomes and six outcome classes (column F in
Fig. 3-3)- The binomial coefficients in this column (1, 5, 10, 10, 5, 1) represent the
number of outcomes in each class; thus there is 1 outcome that contains zero
heads, 5 that contain one head, 10 for two heads, 10 again for three heads, 5 for

A B C E F G H IK J L
1 N 1 2 4 5 6 7 8 9 10
2 N+1
2 3 5 6 7 8 9; 10 11
3 2N 2 4 16 32 64 128 256 512 1024
4
5 1 0 0010
6 1
7 1 10 0 0098
8 1 9
9 1 8 45 0 0439
10 1 7 36
11 1 6 28 120 0 1172
12 5 21 84
13 1 4 15 56 210 0 2051
14 1 10 35 126
15 2 6 20 70 252 0 2461
16 1 10 35 126
17 1 4 15 56 210 0 2051
18 5 21 84
19 1 6 28 120 0 1172
20 1 7 36
21 1 8 45 0.0439
22 1 9
23 1 10 0 0098
24 1
25 1 0 0010
26
27 2 4 16 32 64 128 256 512 1024 1
FIG. 3.3. Pascal's triangle: number of outcomes in each outcome class for 1
to 10 trials.
38 INFERRING FROM A SAMPLE: THE BINOMIAL
four heads, and 1 for all five heads. The probability, given a fair coin (P = .5), of
getting five heads is 1/32 (.03125), of getting four heads is 5/32 (.15625), of
getting four or more heads is 6/32 (.1875 or .15625 + .03125), and of getting
either no heads or all heads is 2/32 (.0625 or .03125 + .03125). This sampling
distribution for five trials is displayed graphically on the left side of Fig. 3.4.

Exercise 3.3
Binomial Probabilities
The problems in this exercise refer to Fig. 3.3. They provide practice in
computing theoretical probabilities using the binomial.
1. The sampling distribution given a fair coin for six tosses is portrayed on the
right side of Fig. 3.3. (The coefficients are listed in column G of Fig. 3.3.)
What is the probability of getting all heads? Of getting either all heads or all
tails? Of getting five or more heads? Of getting either two or three or four
heads? Of getting five or more all the same (either five or more heads or five
or more tails)?
2. Draw a sampling distribution for eight tosses. What is the probability of eight
heads? Of either eight heads or eight tails? Of seven or more heads? Of
six or more heads? Of seven or more all the same? Of six or more all the
same?
3. Given N trials, there are N ways of getting just one head. Why does this
make sense?
4. If your spreadsheet program can produce graphs, use it to graph the
binomial sampling distributions for 8, 9, and 10 trials. You may want to
produce both bar and line graph versions. Which is more correct? Why?

FIG. 3.4. Binomial sampling distributions for five (left) and six (right) tosses, given a
fair coin.
3.1 THE BINOMIAL DISTRIBUTION 39
Binomial Parameters
The binomial distribution can be thought of as a family of possible distributions,
only two of which are displayed in Fig. 3.4. The binomial is often specified with
three parameters: P is the probability of the first outcome, Q the probability of
the second outcome, and N is the number of trials. However, only two
parameters are necessary. Because Q= 1 - P necessarily (the probability of the
second outcome must be one minus the probability of the first outcome), a
particular instance of the binomial is specified completely by P and N. In other
words, two parameters are associated with the binomial distribution. Given
values for these parameters, a particular member of the family can be generated.
The examples given in this chapter have all assumed P = .5 because this is the
usual case. But imagine that P - .2 and N - 3. If according to our null hypothesis
the probability of a head is .2 (and thus the probability of a tail is .8), then how
likely are we to toss three heads, three tails, or any of the other possible
outcomes? The probabilities for each of the eight possible outcomes are:

P(TTT) = P(T) x P(T) x P(T) = .8 x .8 x.8 =.512


P(TTH) = P(T) x P(T) x P(H) = .8 x .8 x .2 = .128
P(THT) = P(T) x P(H) x P(T) = .8 x .2 x .8 = .128
P(THH) = P(T) x P(H) x P(H) - .8 x .2 x .2 = .032
P(HTT) = P(H) x P(T) x P(T) = .2 x .8 x .8 = .128
P(HTH) = P(H) x P(T) x P(H) = .2 x .8 x .2 = .032
P(HHT) = P(H) x P(H) x P(T) = .2 x .2 x .8 = .032
P(HHH) = P(H) x P(H) x P(H) = .2 x .2 x .2 = .008
Hence the probabilities for the four outcome classes are:

P(0 heads) = P(TTT) = 1 x .512 = .512


P(1 head) = P(TTH) + P(THT) + P(HTT) = 3 x .128 = .384
P(2 heads) = P(THH) + P(HTH) + P(HHT) = 3 x .032 = .096
P(3 heads) = P(HHH) = 1 x .008 = .008
Note that the probabilities for the eight outcomes and for the four outcome
classes add to one, as they must. Note also that you are far more likely to toss
three tails than three heads, assuming that the true probability of tossing one
head is .2. Given this one example, you should be able to generate binomial
distributions for various values of P and N (for additional examples and
formulas, see Loftus & Loftus, 1988). More to the point, you should have a sense
for how binomial distributions with various values for their P and N parameters
are generated in the first place.

3.2 THE SIGN TEST


This simple and common statistical test is called the sign test because, as
commonly presented, it requires that a sign (usually a + or - sign, as in Fig. 2.1)
be assigned subjects on the basis of some assessment procedure. It is based on
the binomial distribution, consequently it is also called the binomial test. It is
useful for determining if there are more (or fewer) plus signs than would be
expected, if subjects were sampled from a population in which the probability of
assigning a + were some specified value, usually .5.
Consider the money-cure study. There were 10 subjects in the sample, and 8
improved. The null hypothesis purports that subjects were as likely to improve as
40 INFERRING FROM A SAMPLE: THE BINOMIAL

not (which is analogous to assuming that a coin is fair) and thus the values for the
relevant population parameters are P = .5 and N- 10. This would generate the
sampling distribution with binomial coefficients as shown in column K, and
probabilities as shown in column L, of the spreadsheet portrayed in Fig. 3.3.

One-Tailed and Two-Tailed Tests


For the sign test (and for certain other statistical tests too), it is important to
distinguish between one- and two-tailed tests. If a coin is fair, we would expect
the number of heads to be one-half the number of trials— a value in the center of
the distribution. Any extreme score for the test statistic— for example, one head
and nine tails or nine heads and one tail— would suggest that the coin might not
be fair. Note that extreme values fall at either the left or the right ends of the
distribution (see, e.g., Fig. 3.4), under one or the other of the distribution's two
"tails." If we reject the null hypothesis whenever the test statistic is either
considerably less or considerably more than expected, then the test is called
nondirectional or two-tailed. Symbolically:

H0 : P=.5.
H1: P=.5.
On the other hand, if we think that only more heads than expected, or more
clients improving than expected, is of interest, and so choose to reject the null
hypothesis only when values of the test statistic fall at one end of the distribution
but not the other, then the test is called directional or one-tailed. Symbolically:

H0: P<.5-
HI: P > .5.
Note that null and alternative hypotheses are formulated somewhat differently
for one- and two-tailed tests.
For example, based on the values shown in Fig. 3.3 for N - 10 trials and
P=.5:
1. The probability of all heads is .001 (one-tailed) but the probability of
either all heads or all tails is .002 (two-tailed).
2. The probability of nine or more heads is .011il (one-tailed) but the
probability of nine or more heads or nine or more tails is .021
(two-tailed).
3. The probability of eight or more heads is .055 (one-tailed) but the
probability of eight or more heads or eight or more tails is .109 (two-
tailed).

These probabilities were computed using the or rule, which means that
individual probabilities were summed. Thus .001 + .010 + .044 = .055. If you try
to recreate these values on a hand-held calculator from those displayed in Fig.
3.2, and you round at intermediate steps, your answers may vary slightly from
those given here. Numeric results given here and throughout the text were
computed with a spreadsheet and intermediate results were not rounded. Only
the final scores are rounded and then usually three significant digits are retained.
If the alpha level were .05, if the test were two-tailed, and if the value of the
test statistic were 0, 1, 9, or 10, we would reject the null hypothesis. These
outcomes, which are termed critical values and constitute the region of rejection
or the critical region (see chap. 2), would occur 2.1% of the time or less by chance
3.2 THE SIGN TEST 41_
alone. If the alpha level were again .05, but the test one-tailed, again we would
reject if the value of the test statistic were 9 or 10. (Arbitrarily, and just as a
convenient convention, if a test is one-tailed we assume values that might cause
rejection are in the right-hand tail unless explicitly stated otherwise.) These
outcomes would occur 1.1% of the time by chance alone, whereas values of 8 or
higher would occur 5.5% of the time—just a little too often for a value of 8 to be
included in the rejection region for an alpha level of .05, one-tailed. Although it
is true that the probability of tossing exactly 8 heads with a fair coin is .044,
regions of rejection are constructed by accumulating probabilities from the
extremities of the tails inward.
If the alpha level were .10, however, critical values for a one-tailed test would
be 8, 9, or 10. These outcomes would occur 5.5% of the time by chance alone.
Critical values for a two-tailed test would be 0, 1, 9, or 10. These outcomes would
occur 2.1% of the time, as noted earlier. The critical values would not include 2
and 8. The probability of tossing 8 or more heads or 8 or more tails is .109,
which is too high for values of 2 and 8 to be included in the critical region for an
alpha level of .10, two-tailed.
Finally we can answer the question posed at the beginning of chapter 2: Is 8
out of 10 clients improving worthy of attention? If our alpha level were .05, it
would not matter whether the test was one- or two-tailed—we would not reject
the null hypothesis. However, if our alpha level were .10, we would reject for a
one- but not for a two-tailed test. A test statistic as high as 8 (one-tailed) would
occur 5.5%, and a test statistic as extreme as 8 (two-tailed) would occur 10.9% of
the time if the null hypothesis were true. This is not strong enough evidence to
reject the null hypothesis at an alpha level of .05, one- or two-tailed, or at an
alpha level of .10, two-tailed. But a value of 8 falls in the critical region for an
alpha = .10, one-tailed sign test. But note, an alpha level of .10 is used in this
paragraph largely for illustrative purposes. In research reports published in
scientific journals, alpha levels are almost always .05 or less.

Exercise 3.4
Critical Values for the Sign Test
This exercise provides practice in determining critical values for the sign test.
1. Verify that all probabilities and percentages given in the previous several
paragraphs are correct (if you were not already doing this.)
2. Using Pascal's triangle, compute the number of outcomes in each class,
along with their associated probabilities, for N = 11 and N= 12.
3. Construct a table of critical values for the sign test. (A critical value is one
that causes you to reject the null hypothesis.) Include nine rows, one each
for number of trials = 4 through 12. Include four columns, two for one-tailed
and two for two-tailed tests. Within each kind of test, include a column for
alpha = .05 and for alpha = .10. Indicate with an "x" combinations for which
no values exist that would justify rejecting the null hypothesis. For example,
the N = 10 row would look like this:
... One-tailed Two-tailed
a = 05 a = .10 a = .05 a =.10
10 9-10 8-10 0-1,9-10 0-1,9-10
42 INFERRING FROM A SAMPLE: THE BINOMIAL
Using the Sign Test
The table you have just constructed (Exercise 3.4, part 3) works as long as there
are no more than 12 subjects in a study. But if the need arose, you know enough
so that, in principle at least, you could lengthen this table. More importantly, the
process of computing values for this table should have helped you gain insight
into how such tables of critical values for sampling distributions are constructed
in general. In practice, of course, investigators rarely compute critical values.
Instead, they consult tables constructed by others and indeed such a table is
provided here for the sign test (see Table A in the statistical tables section).
When contemplating a statistical test, the first step is to decide which test is
appropriate. The sign test is appropriate if a binomial (literally, bi = two, nomin
= name) code can be meaningfully assigned each subject (e.g., assigning each
subject a + or a - based on some assessment procedure) and if the assessments
are made independently so that the outcome of one cannot affect the outcomes of
others. Assuming that a sign test is appropriate, the next steps require that you
commit to an alpha level, decide on an appropriate null and alternative
hypothesis (which will imply certain values for the relevant population
parameters), and (as a related matter) decide whether a one- or two-tailed test is
appropriate.
The final step is the statistical decision. From the appropriate table,
determine which values of the test statistic would occur rarely (their combined
probabilities cannot exceed your preselected alpha level) if your null hypothesis
were true. If the test statistic derived from your sample is one of those rare
values (i.e., if it falls in the critical region) then you reject the null and necessarily
accept the alternative hypothesis.

Exercise 3.5
Applying the Sign Test
This exercise provides practice in applying the sign test. You will need to refer to
Table A in the Statistical Tables section.
1. An expanded study testing the efficacy of the money cure contains 40
subjects. By chance, you expect 20 to respond to treatment, but in fact 30
do respond. You decide beforehand that you will reject the null hypothesis
that treatment has no effect only if more subjects improved than expected
and only if the specific result would occur less than 5% of the time if the null
hypothesis were in fact true. Do you reject the null hypothesis? Why or why
not?
2. What if only 26 improved? Would you reject the null hypothesis? Why or
why not?
3. What if 30 improved, but you had decided beforehand to reject the null
hypothesis if the number of subjects was different from expected, either more
improving or more worsening than expected. Keeping the alpha level at .05,
would you reject the null hypothesis now? Why or why not? Given these
same circumstances, what would your decision be if 26 improved? Explain
your decision.
4. Given 50 subjects, an alpha level of .05, and a two-tailed test, what is the
smallest number of subjects who, even if they all improved, would not allow
you to reject the null hypothesis? What is the largest number of subjects
who, even if they all improved, would not allow you to reject the null
hypothesis?
3.2 THE SIGN TEST 43
5. Given 50 subjects, an alpha level of .05, but a one-tailed test, what is the
largest number who, even if they all improved, would not allow you to reject
the null hypothesis? What is the smallest number that would allow you to
reject the null hypothesis?
6. Now answer question 4, but for an alpha level of .01.
7. Now answer question 5, but for an alpha level of .01.

At this point you are now able to use a simple test, the sign test, and the logic
of hypothesis testing in order to answer research questions like the one posed at
the beginning of chapter 2: How many patients would need to get better before
we could claim statistical significance for the result? The next exercise will
instruct you in how to conduct a sign test using SPSS.

Exercise 3.6
The Sign Test in SPSS
This exercise provides practice in using the sign test in SPSS.
1. Invoke SPSS and create two new variables in the Variable View window.
Name the first variable "outcome" and the second variable "freq." Set the
number of decimal places for both variables to 0.
2.1 Provide appropriate value labels for the outcome variable. In row one, click on
the values cell and then click again on the grey box that appears to the right
of the cell. This will open the Value Labels dialog box. Enter "1" and
"improved" in the Value and Value Label windows, respectively. Then click
on Add to enter the information in the bottom window. Now do the same, but
this time enter "0" and "no improvement". After you enter labels for "0" and
"1", click on OK. Now, the labels improved and no improvement will appear in
the SPSS output, instead of the rather cryptic codes 1 and 0.
3. Weight the cases by the frequency variable. Select Data->Weight Cases
from the main menu. In the Weight Cases dialog box, select the Weight
cases by radio button and move the freq variable to the Frequency variable
window. Click on OK. This command tells SPSS to weight each value for
outcome (i.e., improved or no improvement) by the number indicated in the
freq variable. You could simply enter a 1 or 0 for each case in the outcome
column, but doing so would become tedious when N is large.
4. In Data View enter a 1 in the first row of the outcome column and a 0 in the
second row. In the freq column enter 30 in the first row and 10 in the second
row. Because you weighted the cases by the freq variable in step 3, this is
the same as entering 30 ones and 10 zeros in the outcome column.
5. Select Analyzed->Nonparametric Tests->Binomial from the main menu.
Move the outcome variable to the Test Variable List box. Note that the Test
Proportion window reads .50 by default. This indicates that you expect 50%
of the outcomes to be positive. Click on OK.
6. Look at the output. Is the N correct? Look at the Observed Prop. column.
What proportion improved? What was the expected proportion based on the
Test Prop. column? Now look at the final column labeled Asymp. Sig (two-
tailed). This tells you the probability of finding 30 out of 40 positive outcomes
if you expected only 20. If the value is less than alpha, then you reject the
null hypothesis that that treatment has no effect.
7. Change the values in the freq column to reflect that 26 improved and 14 did
not. Would you reject the null hypothesis with an alpha of .05?
44 INFERRING FROM A SAMPLE: THE BINOMIAL
8. What if 30 improved, but you expected 80% to improve? Keeping the alpha
level at .05, would you reject the null hypothesis?

You are now able to conduct a simple sign test by hand and using SPSS. In
subsequent chapters other tests for statistical significance are described—those
based on the normal, t, and F distributions instead of the binomial—but first
some matters of basic terminology (chap. 4) and descriptive statistics (chap. 5)
need to be discussed.

Note 3.1
Critical Values of a test statistic that would occur 5% of the time or
values less (assuming an alpha level of .05) if the null hypothesis
were true. Depending on the test statistic used and how the
null hypothesis is stated, these values could fall only at the
extreme end of one tail of a distribution or at the extreme
ends of both tails. If a test statistic assumes a critical value,
the null hypothesis is rejected. For that reason, critical values
define the critical region or the region of rejection of a
sampling distribution.
4 Measuring Variables: Some
Basic Vocabulary

In this chapter you will:


1. Learn the difference between qualitative and quantitative variables.
2. Learn the difference between nominal, ordinal, interval, and ratio scales.
3. Learn how to describe a study design in terms of independent and
dependent variables.
4. Begin to learn how to match study designs with statistical procedures.

In talking with others, whether describing your study to fellow researchers or


attempting to gain statistical advice, it is important to use relatively standard,
clear vocabulary. This chapter describes some basic terms and concepts used to
talk about research studies. Some of these terms deal with the variables under
study and how they are measured, and others deal with how the variables are
related to each other in the study design.

4.1 SCALES OF MEASUREMENT

All studies, including simple studies like the money-cure study used as a running
example in chapters 2 and 3, incorporate one or more variables. The variables—
so named because, unlike constants, they can take on various values at different
times—identify the concepts the investigator thinks important for a given study.
The variable of interest for the money-cure study was outcome: Did patients get
better or not. The design included only one kind of treatment (and so this was
constant), but an expanded study might well have defined two different kinds or
levels of treatment (e.g., the money cure and a standard talk cure, or the money
cure and no treatment). In this case, two variables would be identified, type of
treatment and outcome, and the research question would involve the connection,
if any, between the two.
An important attribute of variables is their scale of measurement. Knowing
the kind of scale used to assign values to variables helps the investigator decide
which statistical procedures are appropriate. Following a scheme first introduced
by Stevens (1946), four kinds of scales are identified:

45
46 MEASURING VARIABLES: SOME BASIC VOCABULARY
1. Nominal (or categorical).
2. Ordinal.
3. Interval.
4. Ratio.

Nominal or Categorical Scales


Nominal variables assume named values, that is, their permissible values are
qualitative instead of quantitative. And although variables can be assigned
different categories or levels, there is no inherent reason to order the categories
in any particular way. Sex (female/male), religion (Catholic/ Jew/ Muslim/
Protestant/Other), and outcome (improved/did not improve) are examples.

Ordinal Scales
Values for ordinal variables are also named. In other words, measurement
results in assigning a particular name, level, or category to the unit (subject,
family, trial, etc.) under consideration. But unlike nominal variables, for ordinal
variables there is some rationale for ordering the categories in a particular way.
Examples are freshman/sophomore/junior/senior, and first/second/third, and
so forth. There is no reason, however, to assume that the distance between the
first and second categories is in any way equivalent to the distance between the
second and third pair, or any other adjacent pair, of categories. In sum, ordinal-
scale values represent degree, whereas nominal-scale values represent only kind.
Ordinal data can be regarded as quantitative (in a relatively loose sense), but
nominal data remain forever qualitative.

Interval and Ratio Scales


Like ordinal scales, both interval and ratio scales reflect an underlying and
ordered continuum, but unlike ordinal, these scales are characterized by equal
intervals. That is, an interval anywhere on the scale (e.g., the interval between 4
and 5) is assumed equivalent to any other interval (e.g., the one between 12 and
13. Interval and ratio scales differ in only one respect. For an interval scale the
placement of zero is arbitrary, whereas for a ratio scale zero indicates truly none
of the quantity measured. Thus if the ratio of two numbers is 3:1, the second
value represents three times more of the quantity measured only for ratio but not
interval data. Examples of interval scales are IQ, scores on most attitude and
personality tests, and temperature measured on the Fahrenheit or Celsius scales.
Examples of ratio scales are age, weight, and temperature measured on the
Kelvin scale (for which zero implies no molecular movement).
The scale of measurement can itself be viewed as a nominal (or, arguably,
ordinal) variable. It can assume any of four values and the value for a particular
case can be determined by asking, at most, three yes/no questions (later on, in
chap. 10, we will learn to say that three degrees of freedom are associated with
the four categories). These questions or distinctions, and their permissible
values, can be represented with a tree diagram (see Fig. 4.1).
4.1 SCALES OF MEASUREMENT 47

FIG. 4.1. Tree diagram for the four types of measurement scales as
represented by three binary questions.

In practice, the distinction between nominal (qualitative) and other


(quantitative) scales of measurement is more important than the distinctions
among ordinal, interval, and ratio scales. For most statistical procedures, the
difference between interval and ratio variables is not consequential. In addition,
under many circumstances it is often regarded as acceptable to apply statistical
procedures that were developed for equal-interval to ordinal data. However, it is
never acceptable to apply statistical procedures designed for equal-interval data
to any numbers arbitrarily associated with nominal categories (with the possible
exception of binary categorical variables, if one category can be regarded as
representing more of some attribute than the other).

Note 4.1
Nominal The levels of a nominal or categorical scale are category
names, like Catholic | Jew| Muslim [Other for religion or
male | female for sex. Their order is arbitrary; there is no
obviously correct way to order the names.
Ordinal The values or levels of an ordinal scale are named but also
have an obvious order, like first| second third or
freshman | sophomore |junior | senior. However, there is no
obvious way to quantify the distance between levels or ranks.
Interval The intervals of an interval scale are equal, no matter where
they fall on the measurement continuum. The placement of
zero, however, is arbitrary, like zero degrees Fahrenheit.
Ratio The intervals of a ratio scale are also equal, but zero indicates
truly none of the quantity, like zero degrees Kelvin. Thus the
ratio of two numbers is meaningful for numbers measured on
a ratio but not an interval scale.
48 MEASURING VARIABLES: SOME BASIC VOCABULARY

4.2 DESIGNING A STUDY: INDEPENDENT AND DEPENDENT VARIABLES

In addition to the distinction between qualitative and quantitative variables, a


second distinction, likewise important because it helps the investigator decide
which statistical procedures are appropriate, is between independent and
dependent variables. Whether a variable is regarded as independent or
dependent depends on the thinking of the investigator. Thus, a particular
variable might be viewed as independent in one study but dependent in another.
The dependent variable (also called the criterion or response variable) identifies
the variable the investigator wants to account for or explain, whereas the
independent variable (also called the predictor or explanatory variable) identifies
the variable (or variables) that the investigator thinks may account for, or affect,
the different values the dependent variable assumes.
Strictly speaking, it is not always necessary that variables in a study be
segregated into these two classes. In some cases an investigator may be
interested only in the association among a set of variables and may not think of
any variables as prior in some way to others. More typically, however,
investigators are concerned with explanation. Reflecting this concern, the studies
used as examples in this book all contain variables that can be regarded either as
independent (i.e., logically prior and so independent of other variables named in
the study) or as dependent (i.e., their values presumably depend on values for the
independent variable or variables).
The distinction between independent and dependent variables is more
important than the actual words used to express it. We could just as well refer to
explanatory and response variables instead. These terms have the merit of being
more general but are less used (except when discussing log-linear analyses). Or,
we could refer to predictor and criterion variables, which are terms commonly
used in the multiple regression literature. Still, the terms independent variable
(IV) and dependent variable (DV) have the merit of being the most widely used—
and, as used by most writers, no longer have the once exclusively experimental
connotation.

Experimental and Observational Studies


In traditional usage the terms independent variable and dependent variable
have been reserved for true experimental studies. That is, values or levels
(usually called treatment conditions) for the independent variable would be
determined (manipulated) by the experimenter, subjects would be randomly
assigned to a particular level or condition, and then the subjects' scores for the
dependent variable would be measured.
As students in introductory statistics courses learn, experimental studies
support causal conclusions in a different way than merely correlational or
observational studies. In such studies, variation among variables is merely
observed; there is no manipulation of independent variables and no random
assignment of subjects to treatments. As a result, if some association is observed
between independent and dependent variables, there is always the suspicion that
their covariation reflects, not a causal link from one variable to the other, but the
influence of a third variable, unmeasured in the study, that affects them both. In
an experimental study, on the other hand, because values for any possibly
influential third variables should be distributed by random assignment equally
among the experimental conditions, the third variable alternative to a causal
interpretation is rendered unlikely.
4.2 INDEPENDENT AND DEPENDENT VARIABLES 49
Note 4.2
Dependent The variable a researcher wants to explain or account for is
called the dependent variable or DV. It is also called the
criterion variable or the response variable.
Independent Variables a researcher thinks account for or affect the values
of the dependent variable are called independent variables or
IVs. They are also called predictor variables or explanatory
variables.

In contemporary usage, the terms independent and dependent variable


indicate not necessarily experimental manipulation, but simply how investigators
conceptualize relations among their variables. Such usage, however, should not
lull us into making stronger causal claims than our procedures justify. In
general, only experimental studies can justify strong causal claims.

4.3 MATCHING STUDY DESIGNS WITH STATISTICAL PROCEDURES

The qualitative versus quantitative and independent versus dependent


distinctions are used when describing a study's design. The question "What is
the design of your study?" means: What variables do you think are important for
this study? Which do you regard as dependent, which independent? What is
their scale of measurement? Answering these questions, along with one or two
more involving how independent variables are related and whether or not
dependent variables involve repeated measurements, both specifies the design
and aids in selecting statistical procedures appropriate for that design. In the
following paragraphs, a number of common design considerations are described
and appropriate statistical procedures named. Almost all of the remaining
chapters in this book then describe how the more basic of these procedures are
put into practice.
For example, the two variables identified as important for the money-cure
study were treatment and outcome. Outcome is viewed as the dependent
variable, treatment as the independent variable. Both are nominal. There were
two values for outcome: improved or not. However, there was only one value for
the treatment—that is, all subjects received the same treatment. As we saw, for
such a design (a single group, a binary dependent variable) a sign test was
appropriate. However, this is a relatively weak design.
A stronger design would include a control or comparison group. For
example, the independent variable could be represented by two levels: the money
treatment and a standard talk treatment. This would result in a design involving
qualitative independent and dependent variables, both scored with more than
one level or category. Data for such designs are the counts associated with the
cells of the design—that is, the number of subjects who received the standard
treatment who improved, the number who did not, the number who received the
money treatment who improved, and the number who did not—resulting in a two
by two table of counts. These counts are typically analyzed with chi-square
statistics for two-dimensional tables (like the present example) and with log-
linear approaches (Bakeman & Robinson, 1994; Bishop, Fienberg, & Holland,
1975) when tables include two or more dimensions (each categorical variable
defines a dimension of the cross-classification or contingency table). As noted in
50 MEASURING VARIABLES: SOME BASIC VOCABULARY
the preface, such analyses, which require computational tools not provided by
spreadsheets, are discussed in Bakeman & Robinson (1994).
Perhaps the most common design, insofar as there is one in the behavioral
sciences, involves a quantitative dependent variable and one or more qualitative
independent variables. Such data are analyzed with the analysis of variance
(ANOVA), a statistical approach that has received considerable attention and has
been much used in the behavioral sciences. If the independent variables were
quantitative, correlation or multiple regression would be used instead. And if
the independent variables were a mixture of qualitative and quantitative, again
multiple regression would be used. In fact, as noted in chapter 1, the analysis of
variance can be presented as one straightforward application of multiple
regression, although not all texts emphasize this simplifying fact.
For completeness, logistic regression and discriminant function analysis
should be mentioned because they are appropriate techniques to use when the
dependent variable is qualitative and the independent variable quantitative.
Multivariate analysis of variance (MAN OVA) and multivariate regression should
also be mentioned in recognition of the fact that some designs involve multiple
dependent quantitative measures. These can be important techniques, but their
use is regarded as an advanced topic in statistics and hence beyond the scope of
this book. For a clear explication of these and other multivariate techniques, see
Tabachnick and Fidell (2001).
The preceding discussion (summarized in Fig. 4.2) has emphasized the scale
of measurement for the independent and dependent variables (and has
mentioned the number of dependent variables), but other distinctions are
important and are often made. For example, it is common to distinguish between
parametric and nonparametric tests. Parametric tests require assumptions
about the distributions underlying the data, whereas nonparametric tests do not
require such assumptions. In practice, nonparametric tests are typically applied
to categorical or ordinal data, and parametric tests, which are emphasized here,
to quantitative data. The traditional reference for a number of nonparametric
techniques is Siegel (1956).
Another distinction commonly made concerns the nature of the research
question. Many authorities (e.g., Tabachnick & Fidell, 2001) find it convenient to
distinguish questions that ask about relations among variables (for which
correlation and regression are appropriate), from questions that ask about the
significance of group differences (for which analysis of variance is appropriate),
from questions that attempt to predict group membership (for which
discriminant function analysis is appropriate), from questions that ask about the
underlying structure of the data (for which factor analysis is appropriate).
Explication of the techniques required for the first and second kinds of
questions constitutes the bulk of this book, but in practice— because of the
integrated multiple regression approach adopted here— the distinction between
the two is somewhat blurred. Technically, however, the assumptions required for
tests of association and tests of group differences are somewhat different (again
see Tabachnick & Fidell, 2001). As noted previously, the statistical techniques
required for the third and fourth kinds of questions constitute an advanced topic
in statistical analysis and are not covered here.
4.3 MATCHING DESIGNS WITH STATISTICAL PROCEDURES 51

Dependent variable
Independent Variables
Categorical Quantitative
Single categorical Chi-square ANOVA.
Hest
Multiple categorical Log-linear analysis ANOVA
Single quantitative Discriminant function Simple correlation
Multiple quantitative Discriminant function Multiple regression
Categorical and Logistic regression Multiple regression
Quantitative
FIG. 4.2. Appropriate statistical techniques for studies with categorical
and quantitative dependent variables and various kinds and numbers of
independent variables. Techniques appropriate for analyzing
quantitative dependent variables are emphasized in this book.

One technique that usually appears in introductory texts but is not discussed
here in any detail is Student's t test. The t test is appropriate for designs with a
quantitative dependent variable and a single binary independent variable. Thus
it can be used to answer questions like, "Is the mean IQ at 3 years of age different
for preterm and full-term infants?" The t test was developed earlier than the
analysis of variance and is limited to two groups. The analysis of variance can
analyze for differences between two groups and more besides. In the interest of
economy (why learn two different tests when only one is needed?), we have
decided to present only the more general ANOVA technique for analyzing
problems that earlier would have been subjected to a t test. As Keppel and
Saufley (1980) noted:

The t test is a special case of the F test. ... If you were to conduct a t test
and an F test on the data from the same two-group experiment, you
would obtain exactly the same information. ... The two statistical tests
are algebraically equivalent, that is,

Due to this equivalency, we decided not to develop the t test. . . . The F


test can be applied to almost any situation in which the t test can be used,
but it can also be applied in situations where the t test cannot be used.
(p. 109)
Do not misunderstand. We are not claiming that the t test is of no use and should
not be used. Moreover, in order to understand much of the literature, its
application needs to be understood. And pedagogically, there is much to be
gained by a thorough explication of the t test and its variants. In a longer volume,
it would have a place. Our intent in writing this book, however, was to present
basic statistics in as economical and efficient a manner as possible, which means
that concepts and techniques that have wide application (like multiple regression
52 MEASURING VARIABLES: SOME BASIC VOCABULARY
and analysis of variance) are emphasized and other topics with more limited
application (like chi-square and the different kinds of t tests) are omitted.
The purpose of this chapter was first to introduce some basic vocabulary that
will be used in this book, and second, to introduce the names of several useful
statistical techniques with which you should be familiar, even though explication
of these techniques is beyond the scope of this book. It is important, of course, to
know how to apply particular statistical techniques, but it is even more important
to know which techniques should be applied in the first place—and in order to do
that, readers need to at least be aware of a wide range of possibilities.
The importance of matching statistical procedures to study designs has been
emphasized in the last several paragraphs. It is a key issue that will occupy us
throughout the remainder of this book. In the next chapter, however, we discuss
some simple and basic ways to describe data.
5 Describing a Sample:
Basic Descriptive Statistics

In this chapter you will:


1. Learn how to summarize a sample (a particular set of scores) with a
single number or statistic, the arithmetic mean.
2. Learn what it means to say that the mean is the best fit, in a least-squares
sense, to the scores in the sample.
3. Learn how to describe the variability in a sample and in a population
using the variance and the standard deviation.
4. Learn how to compare scores from different samples or populations
using standard scores (Z scores).
5. Learn how to identify scores so extreme that they might possibly be
measuring or copying mistakes.

If you never wanted to generalize beyond a sample, inferential statistical


techniques like hypothesis testing would be unnecessary. But very likely you
would still want to describe what your sample was like. Basic descriptive
statistical techniques, like those described in this chapter, allow you to do exactly
that.
In the last chapter we distinguished among variables measured on nominal,
ordinal, interval, and ratio scales. Descriptive statistics for nominal data, as the
running example of the money-cure study demonstrates, are extremely simple,
consisting of little more than counts (or proportions), for example, the number
(or percentage) of people who improved. Descriptive statistics for ordinal data
(unless we think it justified to regard differences between ranks as equal) are no
more complex. Again, we simply report the numbers and proportions of subjects
who fall in the different categories. Describing interval or ratio variables offers
more possibilities. In this chapter we discuss the mean, the variance, and the
standard deviation, which are basic descriptive statistics appropriate for
summarizing quantitative data.

53
54_ DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
5.1 THE MEAN

For economy of presentation, if for no other reason, it is often useful to


characterize an entire set of scores with just one typical score. There are several
different ways to compute this typical, or average, score. For example, the mode,
the median, the arithmetic mean, and the geometric mean all involve different
computations and often result in different values. Each of these typical scores
has its uses, as students in introductory statistics courses learn, but by far the
most commonly used average score in statistical work is the arithmetic mean. In
fact, whenever the word mean is used in this book, it is safe to assume that it
means the arithmetic mean.

The Mean Defined


Almost certainly you learned how to compute the arithmetic mean long ago. The
rule is (a) add the scores together and (b) divide their sum by the number of
scores. Symbolically (using I, the Greek upper case sigma, to indicate
summation, Y to represent a raw score, and N to indicate the number of scores),
the mean for the Y scores is

We could leave matters there, but there is something to be learned if the


definition of the mean is probed a bit more. You may find this probing rather
overwrought for something so basic, but our purpose is to introduce an important
concept (the method of least squares) within the simplest context possible. In
developing this definition of the mean as a best-fit statistic in a least-squares
sense, we make use of data derived from a simple (imaginary) study, which is
described next and is used as an example in this and subsequent chapters.

An Example: The Lie Detection Study


Imagine that we invite 10 subjects into our laboratory for a polygraph
examination. We provide them with the script of the questions the examiner will
ask and we instruct them to lie when answering certain of those questions.
Different subjects provide false answers to different questions, but all subjects tell
the same number of lies. Then we ask the examiner to identify untruthful
responses. Thus the dependent variable (DV) for this study is the number of lies
detected by the examiner. The data for this study are given in Fig. 5.1 and, as you
can see, the number of lies detected ranged from two to nine. The next exercise
shows you how to manipulate these data using a spreadsheet. Although this
exercise uses the data from the lie detection study, it could easily be adapted for
use with data from other studies as well.

Exercise 5.1
The Mean
The spreadsheet template developed during this exercise computes basic
descriptive statistics for the lie detection study. Moreover, it will form the basis
for subsequent exercises. Give some attention to the notation used and defined
in this exercise. It is used throughout the book.
5.1 THE MEAN 55
General Instructions
1. Setup and label five columns. Each row represents a subject and the first
column contains that subject's number (1-10).
2. The second column contains values for the dependent variable, which is the
number of lies detected for each subject as given in Fig. 5.1. Label this Y,
which represents a generic DV.
3. The third column contains predicted scores. Label this Y' (read as "V-prime")
and point each cell in this column to a cell that contains 5 (use an absolute
address). You could enter the value 5 (which is an arbitrary value selected
for this exercise) into the cells in this column directly, but if you point the cells
to another, single cell you have the option of later changing the predicted
values for all subjects by just changing one cell.
4. In the fourth column, enter a formula for the differences between the raw and
the predicted scores. This difference score is symbolized with a lower case
y, and is often called a deviation or residual score. Label this column Y-Y'.
5. In the fifth column, enter a formula for the square of the difference scores.
Label it y*y (where y = Y-Y1).
6. On separate rows, enter formulas that compute the sum, the count, and the
mean for the raw, the predicted, the difference, and squared difference
scores. You may use Fig. 5.2 as a guide if you wish.
Detailed Instructions
1. In row 2, label the columns as follows:
Label Column Meaning
s A The subject number or index. In this case it ranges from 1 to
10. In general, the subject could be an individual, a dyad, a
family, or any other appropriate unit of analysis.
Y B The number of lies detected, actually observed. In general,
Y indicates an observed score. For the present example,
seven lies were detected for the fifth subject.
Y' C The number of lies detected, predicted by you (read as "V-
prime"). Usually the basis for prediction will be clear from
context.

0 .. . No. lies
Subject
detected
1 3
2 2
3 4
4 6
5 6
6 4
7 5
8 7
9 7
10 9
FIG. 5.1. Data for the lie detection study.
56 DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
Y-Y' D The difference between observed and predicted. This is the
amount left when the predicted score is subtracted from the
observed score. This difference score is symbolized with a
lower case y, and is often called a deviation or residual
score.
y*y E The square of this difference, that is,
Y-Y' multiplied by Y-Y'. The sum of the entries in this
column is called a sum of squares (symbolized as SS)
because the difference scores are squared and then
summed.
2. To remind ourselves that the sum of the column E entries is the sum of the Y
deviation scores squared, enter the label "SSy" in cell E1. In addition, enter
the labels "Lies" in cell B1 and "y=" in cell D1.
3. In column A, label rows 13-16 as follows:
Label Row Meaning
Sum= 13 The sum of the scores in rows 3-12.
N= 14 The number of scores in rows 3-12.
Mean= 15 The mean of those scores.
a= 16 The predicted score.
4. Enter the values 1 through 10 in column A, rows 3-12.
5. Enter the observed scores in column B, rows 3-12.
6. Put a value in cell B16. For now, make that value 5.
7. Point the cells in the Y' column (cells C3-C12) to the predicted cell (B16),
that is, put the address B16 in C3-C12. Cells C3-C12 and B16 should now
all display the same value.
8. Enter a formula for subtracting Y' (the column C entry) from Y (the column B
entry) in cells D3-D12. These are the deviation scores or the residuals.
9. Enter a formula for multiplying Y-Y' (the column D entry) by itself in cells E3-
E12. These are the deviations squared.
10. In cells B13-E13, enter a function that sums the entries in rows 3-12.
11. In cells B14 and C14, enter a function that counts the entries in rows 3-12.
12. In cells B15 and C15, enter a formula that computes the mean (the sum
divided by N) for that column. At this point, your spreadsheet should look like
the one portrayed in Fig. 5.2.

Predicting the Mean


At this point, your spreadsheet should look like the one portrayed in Fig. 5.2. It
shows the raw data, the arithmetic mean for those data, and the sum of the
squared deviation scores, where a deviation consists of the difference between the
raw score and the value 5. You probably wonder why we choose 5 as the
predicted value in Exercise 5.1. We took a quick look at the numbers, noted they
ranged from 2 to 9 but that there were two 2s. Then hurriedly and somewhat
arbitrarily, we decided to guess that a typical value for this set of 10 numbers was
5. This was only a first guess, selected for purposes of exposition; we will revise it
shortly.
Actually, we had in mind a formal model for these scores:

Yi = a + e/ (5.2)

This particular model states that the score for the ith subject in the population, Yi,
consists of some population parameter, represented with a (lower case Greek
5.1 THE MEAN 57
alpha), plus an error component, Y' (lower case Greek epsilon), specific to that
subject. This is a very simple model. It suggests that the value of a score is
determined by a single parameter and nothing more. Not all scores will have
exactly the same value as the parameter, which is what the error component
signals, but if asked to predict the value for any score, our prediction would
always be a, the value of the parameter. If Y' (readY -prime, for Y-predicted)
represents the predicted value for Y, and if a represents the estimated value for
the population parameter a, then the prediction equation associated with this
model is:

Yi' = a (5.3)
For Exercise 5.1, we guessed that a was 5; thus the specific prediction equation
became Yi' - 5 (or, omitting the subscript, simply Y - 5). For this prediction
model, characteristics of individual subjects are not considered. The same
prediction is applied to all, which is why all the predicted values in column C (see
Fig. 5.2) are the same.
In general, a model should indicate concisely how we think a particular
dependent or criterion variable is generated. In that sense, it is an idealization of
some presumed underlying reality, and presumably it reflects the mechanism
that produces the criterion variable. Usually, a model contains one or more
population parameters (like the a in Equation 5.2). In addition, and unlike
Equation 5.2, it often contains one or more independent or predictor variables as
well and so indicates how we think those predictor variables are related to the
criterion variable under consideration.

D
Lies y=. SSy
s Y Y' Y-Y1 y*y
1 3 5 -2 4
2 2 5 -3 9
3 4 5 -1 1
4 6 5 1 1
5 6 5 1 .1
6 4 5 -1 1
7 5 5 0 0
8 7 5 2 4
9 7 5 2 4
10 9 5 4 16
Sum= 53 50 3 41
N= 10 10
Mean= 5.3 5
a= 5

FIG. 5.2. Spreadsheet for the lie detection study using V"= 5 as the
prediction equation.
_58 __ DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
The Equation 5.2 model makes sense because we have not yet really thought
about the lie detection study. We have not yet considered how characteristics of
individual subjects or the circumstances surrounding their testing sessions might
have affected their responses. In other words, we have yet to identify research
factors (independent variables) that we want to investigate. Thus, this simple
model represents a starting point. Later on we will ask whether other, more
complex models allow us to make better predictions than the initial model posed
here, and it will prove useful then to have this simple model as a basis for
comparison.
But even if we provisionally accept this simple model and its associated
prediction equation, have we found the best estimate for the parameter? The
value 5 was after all a rather offhand guess. Can we do better? Is there a best
value? The answer is yes, if we first define what we mean by best. A common
criterion for best, one that is often used in statistics, is the least-squares
criterion. According to this guide, whatever value provides us with the lowest (or
least) possible value for the sum of the squared error scores (or squares) is best.
The Equation 5.2 model, rewritten using estimates instead of parameters, is:
Yi = a + ei

Substituting Yi' for a (because Y,' = a, Equation 5.3), this can be rewritten as:

Yi=Yi' + ei

Rearranging terms, this becomes:

ei=Yi-Yi'

In other words, the error score for each subject is the difference between that
subject's actual score and the predicted score for that subject. In the last exercise
the error scores (also called deviation scores or residuals because they indicate
what is left after the predicted is subtracted from the observed) were symbolized
with a lower case y (omitting subscripts, y = Y - Y'). When we used 5 as our
predicted value, the sum of the squared error scores, which is the quantity we
wish to minimize, was 41 (see Fig. 5.2). If no other value yields a sum of squares
(as the sum of the squared residuals is usually called) lower than this, then we
would regard 5 as the best estimate for the parameter, in the least-squares sense.
As you have probably already guessed, 5 is not the best estimate. For
pedagogic purposes, we deliberately choose a number that was close but not best.
The best estimate, based on our sample of 10 scores, is 5.3, which is the mean
value for these scores. Statisticians have proved to their satisfaction that the
mean provides the best-fit, in the least-squares sense, to a set of numbers. When
the mean is used to predict scores (when Y,' = M), the sum of the squared
deviations between observed and predicted scores will assume its lowest possible
value. This is easy to demonstrate, as the next exercise shows.

Exercise 5.2
The Sum of Squares
The purpose of this exercise is to demonstrate that the minimum value for the
sum of the squared deviations is obtained when the mean is used to form
deviation scores.
5.1 THE MEAN 59
1. Using the spreadsheet from the last exercise, enter different values in the cell
that now contains 5, the predicted value or the value for the parameter a.
For example, enter 3.5, 4, 4.5, 5, 5.5, 6, and so forth, and note the different
values that are computed for the sum of squares, that is, the sum of the
squared deviation scores. You should observe that the minimum value is
achieved only when the mean (in this case, the value 5.3) is used.
2. Verify that using the mean indeed yields the smallest value for the sum of
squares by entering values like 5.31, 5.32, 5.29, 5.28, and so forth.

When the value 5.3 is used to form deviation scores, the sum of squares
(usually abbreviated SS) should be 40.1, which is its lowest possible value for
these data, and your spreadsheet should look like the one given in Fig. 5.3. The
principle of least squares—which, once presented, may seem like an obvious way
to minimize errors of prediction—was first articulated in 1805 by a French
mathematician, Adrien Marie Legendre. He wrote, "We see, therefore, that the
method of least squares reveals, in a manner of speaking, the center around
which the results of observations arrange themselves, so that the deviations from
that center are as small as possible" (cited in Stigler, 1986). The presentation
here may have seemed like a long and involved way to present something as
seemingly simple as the arithmetic mean, but consider what you have learned.
On the practical side, you have had practice setting up and using a spreadsheet
for a simple study. You have also learned to view the mean as the value that best
represents or fits a set of numbers, using the least-squares criterion. This
concept will serve you well later when the least-squares criterion is used in more
interesting and more complex contexts. And, as you will discover in the next two
sections, the groundwork required for computing the variance and the standard
deviation is already insinuated in the spreadsheet you have just developed.

A B C D E
1 Lies y= SSy
2 s Y Y' Y-Y' y*y
3 1 3 5.3 -2.3 5.29
4 2 2 5.3 -3.3 10.89
5 3 4 5.3 -1.3 1.69
6 4 6 5.3 0.7 0.49
7 5 6 5.3 0.7 0.49
8 6 4 5.3 -1.3 1.69
9 7 5 5.3 -0.3 0.09
10 8 7 5.3 1.7 2.89
11 9 7 5.3 1.7 2.89
12 10 9 5.3 3.7 13.69
13 Sum= 53 53 0 40.1
14 N= 10 10
15 Mean= 5.3 5.3
16 a= 5.3
17
FIG. 5.3. Spreadsheet for the lie detection study using Y' = 5.3 for the
prediction equation.
60 DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS

Note 5.1
Y An upper case Y is used to indicate a generic raw score,
usually for the dependent variable. Thus Y represents any
score in the set generally, and Yi,indicates the score for the ith
subject.
Y' Y' (read Y-prime) indicates a predicted score (think of
"prime" as standing for "predicted"). This is relatively
standard regression notation, although sometimes a Y with a
circumflex or "hat" (i.e., a ^) over it is used instead, Y.
Usually the basis for prediction will be clear from the context.
Individual predicted scores are symbolized Y,-', but often the
subscript is omitted.
Y- Y' The difference or deviation between an observed score and a
predicted score is called a residual or error score. Again the
subscripts are often omitted. Often lower case letters are
. used to represent residuals. For example, in Fig. 5.3 a lower
case y represents the deviation between a raw Y score and the
mean for the Y scores.
SS The sums of squares is formed by first squaring each
residual, and then summing the resulting squares; thus it is
the sum of the squared residuals. In other words, SS =
E (Yi - Yi)2 where i = 1,N.

5.2 THE VARIANCE

The mean is a point estimate because it identifies one typical score, whereas
variability indicates range, which is an area within which scores typically fall.
Consider the kind of practical problem that first caused mathematicians to
ponder variability of scores. Imagine you are a ship's navigator at sea attempting
to read the angle of the North Star above the horizon from a moving deck. In a
desperate attempt at accuracy, you take successive readings, but each one is
somewhat different from the others. Still, you reason, the best estimate you can
make for the true value is the mean of your various measurements. You know
there is only one true value, so you regard any deviation between that value and
one of your measurements as error.
Under these circumstances, variability might not be your major concern,
unless, of course, you wanted to compare the variability of your measurements
with someone else's. But now imagine you are concerned with various
anthropometric measurements, for example, the heights of military recruits who
come from different geographic areas. In this case, you might well find it
interesting to see if the height of Scotsmen, for example, is more variable than
that of Japanese. Or, given the lie detection study described earlier, you might
want to know if the number of lies detected for the first five subjects (who might
have been treated differently in some way) was more variable than the number
for the last five.
No matter whether scores are several measurements of the same subject, as
for the navigation example, or single measurements of different subjects, as for
the anthropometric or the lie detection study example, it can be useful to quantify
the amount of variability in a set of scores. One common measure of variability is
5.2 THE VARIANCE 61_
the variance. The variance for a sample of scores is simply the mean squared
deviation, that is, the average of the squared residuals. In other words, to
compute the variance for a sample of scores:

1. Subtract the mean from each score.


2. Square each difference.
3. Sum the squared deviation scores.
4. Divide that sum by the number of scores.

Symbolically (letting My represents the mean for the Y scores) the variance for
the Y scores is:

Most of the necessary computations are already contained in the spreadsheet


shown in Fig. 5.3.

Exercise 5.3
The Sample Variance
The purpose of this exercise is to modify the current template so that, in addition
to the mean, it computes the variance for the sample data as well.

General Instructions
1. Make sure that the predicted score is always the mean. In other words, point
the cell that contains the predicted score to the cell that contains the mean of
the raw scores. This allows you to change the raw scores, and hence the
mean, insuring that deviation scores will remain deviations from the mean.
2. Add a formula for the sample variance to the current template. Provide an
appropriate label.
3. Selectively change the values for the raw scores and note how these
changes affect the value of the variance. First, change the raw scores so
that they are all quite similar to each other. Then, make just one value very
different from the others (e.g., most scores might be 3s and 4s whereas one
score might be 94). Finally, select values that are somewhat different, and
then a lot different, from each other. What do you observe?
Detailed Instructions
1. Put the label "VAR=" for variance in cell D15.
2. Enter 10 (or a counting function) in cell E14. In cell E15 enter a formula for
dividing the sum of the squared deviation scores (cell E13) by the number of
scores (cell E14). This is the variance.
3. Insure that the predicted value will always be the mean of the raw scores,
even if you change the raw scores (i.e., point cell B16 to cell B15).
4. The more scores are spread out around the mean, the greater the variance.
This is easy to demonstrate. First replace the most extreme numbers of lies
with numbers nearer the mean (in column B). For example, replace the 2
with a 4, the 7s with 6s, and the 9 with a 7. What is the variance for this new
set of scores? How does it compare with the original variance of 4.01 ?
5. Next undo the changes you just made (restore the original raw scores) and
replace numbers near the mean with more extreme scores. For example,
62 ___ DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
replace the 4 with a 1, the 5 with a 2, and the two 7s with 9s. What is the
variance now? Why should it be larger than the initial 4.01 ?

Sample and Population Variance


For the last exercise you computed a quantity called the sample variance.
According to Equation 5.4, this is the sum of squares divided by N. There is a
second definition for the variance, which divides the sum of squares not by N, but
by N - 1 instead. Symbolically:

Many writers call this the population variance. It is an estimate of the true value
of the population variance, based on sample data, and is symbolized here as VAR'
to distinguish it from VAR (the sample statistic). As a general rule, in this book
addition of the apostrophe or prime indicates an estimate; many texts use the
circumflex (i.e., the ^ or hat) over the symbol for the same purpose.
Generations of students have been perplexed at first by the difference
between the sample and population variance (and the sample and population
standard deviation), which is not surprising once you realize that different
writers define matters differently. One question concerns the N -1 divisor: Why
is it N - 1 instead of N? For now accept that one of the divisors should be N -1
and wait until chapter 10, where we offer an intuitive explanation for the N -1. A
second question concerns which divisor is used for the sample, which for the
population variance: For example, should N or N-1 be used for the sample
variance? Here we have defined the sample variance as the one with the N
divisor and the population variance as the one with the N -1 divisor, but some
texts and programs do just the opposite. The VAR function in Excel, for example,
returns the sums of squares divided by N -1, whereas the VARP function (P for
population) returns the SS divided by N.
On one point there is no confusion: If one wants to describe the variance for
a group, and if scores are available for all members of the group, then one divides
the sum of squares by N. Some writers, usually those of a more empirical bent,
think primarily in terms of samples, assume they can never know all the scores in
the population, identify the group with a sample, and therefore call the SS
divided by N the sample variance. If they divide the same SS by N -1, they call it
the population variance, presumably because it is an estimate of population
variance based on sample data. Other writers, usually those of a more formal or
mathematical bent, think primarily in terms of populations, assume they can
know all scores in the population, identify the group with the population, and
therefore call the SS divided by N the population variance. If for some reason
they do not have access to all scores in the population, they divide the SS by N -1
and call this the sample variance, presumably because it is an estimate of the
population variance based on sample data.
In this book, reflecting our preoccupation with empirical data analysis as
opposed to formal mathematics, we side with the empiricists and call the SS
divided by N the sample variance, and the SS divided by N -1 the population
variance, or more correctly, the estimated population variance. But once you are
aware of this distinction, and understand why some writers define matters one
way whereas other writers reverse the definition, you should be able to
understand different writers and use different computer routines correctly.
5.3 THE STANDARD DEVIATION 63

5.3 THE STANDARD DEVIATION

A second useful measure of variability used in statistics is the standard


deviation. To compute the standard deviation for a sample of scores, first
compute (a) the variance and then compute (b) its square root. Symbolically, the
sample standard deviation for the Y scores is:

The variance is an average sum of squares, so the units for variance are the units
for the initial scores squared. For example, if the initial scores were feet, then
variance would be measured in square feet and analogous to a measurement of
area. Taking the square root of the variance, which is how the standard
deviation is computed, means the units for the standard deviation are again the
same as those used initially. Thus the standard deviation is like an average error,
expressed in the same units as those used for the raw scores. The larger the
deviations from the mean are (i.e., the more scores are spread out instead of
clustering near the mean), the larger is the standard deviation.
You may wonder why the square root of the variance has come to be called
the standard deviation; why it is used to represent an average error? Certainly
the average of the absolute values of the deviation scores (column D in the Fig.
5.3 spreadsheet) would be a logical candidate for a measure of the typical
deviation or error. The reason the standard deviation has become standard has
to do with its technical statistical properties. These properties are not shared by
the mean absolute deviation. For now, and until you read further, accept this on
faith but know that there are reasons for this that experts find acceptable.

Sample and Population Standard Deviation


Equation 5.6 defines the sample standard deviation. The formula for the
population standard deviation, estimated from sample data, is:

Again, as with the sample and population variance, you should be aware that
some writers (and spreadsheets) reverse the definitions used here.

Exercise 5.4
The Sample Standard Deviation
The purpose of this exercise is to modify the current template so that, in addition
to the mean and the sample variance, it computes the sample standard deviation
as well.
64 DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
General Instructions
1. Add a formula for the sample standard deviation to the current template.
Provide an appropriate label.
2. Again, selectively change the values for the raw scores and note how these
changes affect the value of the standard deviation. What do you observe?
3. Finally, restore the raw scores to their initial values as given in Fig. 5.1.
Detailed Instructions
1. Put the label "SD=" for standard deviation in cell D16. Then in cell E16, enter
a formula for the square root of the variance (cell E15). This is the standard
deviation.
2. Now change the raw data (cells B3-B12) and note the effect on the standard
deviation (cell E16). First try the two alternative sets given in Exercise 5.2.
Recall that one set was less spread out, and one more, than the initial data.
Then try some unusually distributed data sets. For example, make one score
24, one 35, and all the rest 2s and 3s. Then make half the scores 3s and the
other half 12s. What do you observe?
3. Finally, restore the original data as given in Fig. 5.1.

Note 5.2
M The sample mean, which is the sum of the scores divided by
N. The mean for the Y scores is often symbolized as Y with a
bar above it (read Y-bar), the mean for the X scores as X with
a bar above, and so forth. In this book, to avoid symbols
difficult to portray in spreadsheets, My represents the mean
of the Y scores, Mx the mean of the X scores, and so forth.
VAR The sample variance, which is the sum of squares divided by
N. Often it is symbolized S2 or s2, which makes sense because
variance is a squared measure. If not clear from context,
VARy indicates the variance for the Y scores, VARx the
variance for the X scores, and so forth.
VAR' The population variance as estimated from sample data,
which is the sum of squares divided by N -1. Often it is
symbolized with a circumflex (a ^ or hat) above s2.
SD The sample standard deviation, which is the square root of
the sample variance. Often it is symbolized S or s. If not
clear from context, SDy indicates the standard deviation for
the Y scores, SDx the standard deviation for the X scores, and
so forth.
SD' The population standard deviation as estimated from sample
data, which is the square root of the estimated population
variance. Often it is symbolized with a circumflex (a ^ or hat)
above s.
' (prime) In this book, a prime or apostrophe after a symbol indicates
an estimate, for example, VAR' and SD'. Some texts use the
circumflex (the ^ or hat) for this purpose.
5.3 THE STANDARD DEVIATION 65
After you have restored the raw data to the initial values, your spreadsheet
should look like the one given in Fig. 5.4. There are two additional points
concerning this spreadsheet. First, because Y, the predicted value, is the mean of
the raw scores, the mean of the predicted scores will of course be the same as the
raw score mean. Second, given that the predicted score is the mean, the sum of
the deviation scores must be zero. This follows from the way the mean is defined.
However, due to rounding errors, entries that should sum exactly to zero, like the
sum of the deviation scores, may sum to an extremely small number instead, like
4E-16, which in scientific notation means 4 divided by a 1 followed by 16 zeros.
Such values may appear if a general (G) format is used for a cell. If you specify a
format with, for example, two decimal places, then the value ".00" will be
displayed instead.

Exercise 5.5
SPSS Descriptive Statistics
The purpose of this exercise is to familiarize you with the Descriptives command
in SPSS.
1. Invoke SPSS. Create two variables, s and Y. Give the Y variable the label
"Lies". Enter the data from the lie detection study. You could do this by hand,
or you could cut and paste from your spreadsheet as displayed in Fig. 5.4.
2. Select Analyze-> Descriptive Statistics-> Descriptives from the main
menu. In the Descriptives window, move the Lies variable to the right-hand
window. Click on the Options button, check the box next to Variance. Click
Continue and then OK.

A B C D E
=
1 Lies y SSy
2 s Y Y' Y-Y' y*y
3 1 3 5.3 -2.3 5.29
4 2 2 5.3 -3.3 10.89
5 3. 4 5.3 -1.3 1.69
6 4 6 5.3 0.7 0.49
7 5 6 5.3 0.7 0.49
8 6 4 5.3 -1.3 1.69
9 7 5 5.3 -0.3 0.09
10 8 7 5.3 1.7 2.89
11 9 7 5.3 1.7 2.89
12 10 9 5.3 3.7 13.69
13 Sum= 53 53 0 40.1
14 N= 10 10 10
15 Mean= 5.3 5.3 VAR= 4.01
16 a= 5.3 SD= 2.0025
17
FIG. 5.4. Spreadsheet for the lie detection study after variance and standard
deviation calculations have been added.
66 DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
3. Examine the output. Do the values you obtained for N, the mean, variance,
and standard deviation agree with the results from the spreadsheet you
created in Exercise 5.4?
4. Save the SPSS data file.

5.4 STANDARD SCORES

Adolphe Quetelet, a French 19th-century statistician, is noted for his work with
the way various measurements are distributed. Among the scores he worked with
were the chest circumferences of Scottish soldiers (see Stigler, 1986). Assume the
mean is 40 inches and one particular soldier has a chest circumference of 45.
Thus, this soldier's chest measured 5 inches more than the mean. Now from our
lie detection study recall that the mean number of lies detected was five and for
the loth subject nine lies were detected, which means that four more lies than
average were detected for that subject. It makes little sense to ask who is more
extreme, the subject with four detected lies above the mean or the soldier whose
chest circumference is 5 inches more than the mean. The two scales, number of
lies and inches, are as alike as goats and apples. What is needed is a common or
standard scale of measurement. Then both scores could be rescaled and, once the
transformed scores were expressed in the same units, these new scores could be
compared.
The common scale is provided by the standard deviations, and the
transformed scores are called standard scores or Z scores. To compute a
standard score, (a) subtract the mean from an individual score and (b) divide that
difference by the standard deviation for the set of scores. Symbolically (letting
My and SDy represent the mean and standard deviation for the Y scores), the Z
score corresponding to the Y score for the ith subject is:

Equation 5.8 is expressed in terms of sample statistics. The corresponding


definition expressed in terms of population parameters is:

Note that both the deviation (Yi - MY or Y -u) and the standard deviation (SDY
or a) are measured in the units used for the initial scores. Thus when one is
divided by the other, the units cancel, resulting in a unit-free score, neither lies
nor inches. Actually, because deviation scores are divided by the appropriate
standard deviation, we can regard the resulting standard scores as being
expressed in standard deviation units. A standard score of 1.0, then, means the
corresponding raw score is exactly one standard deviation above its mean,
whereas a standard score of -1.0 indicates a raw score one standard deviation
below the mean, a standard score of 2.5 implies a raw score two and a half
standard deviations above the mean, and so forth.
5.4 STANDARD SCORES 67

Exercise 5.6
Standard Scores
This exercise modifies the current spreadsheet so that, in addition to the mean,
variance, and standard deviation, it now computes standard scores as well.

General Instructions
1. Add two columns to the current spreadsheet. In the first column enter
formulas for Zscores and in the second column enter formulas for the square
of the Zscores.
2. Enter formulas for the sum, count, and mean of the Zscores. Enter formulas
for the sum, count, mean, and square root of the mean for the squared Z
score. What is the mean Z score? Why must it be zero? What is the
variance and standard deviation for the Zscores? Why must they be one?
Detailed Instructions
1. Label column F (cell F2) with "Z" (for Zor standard score) and column G (cell
G2) with "z*z."
2. Enter a formula in cells F3-F12 for dividing the residual (Y-Y' in column D) by
the standard deviation (cell E16). This is the standard score.
3. Enter a formula in cells G3-G12 for multiplying the standard score (Z, the
column F entry) by itself. (You may want to format columns F and G so that
only two or three places after the decimal point are given.)
4. Copy the formulas from cells E13-E16 to F13-F16 and to G13-G16.
5. What is the mean Z score? Why must it be zero? What is the variance and
standard deviation for the Z scores? Why must they be one?

At this point, your spreadsheet should look the one given in Fig. 5.5. Note
that the mean standard or Z score is zero. This follows from the fact that
deviations from the mean must sum to zero. Note also that the standard
deviation for the standard scores (the square root of the average squared Z score)
is 1. This follows from the fact that differences were divided by the standard
deviation; hence, a raw score that initially was a standard deviation above the
mean would have been transformed into a standard score of 1.

Exercise 5.7
More Standard Scores
This exercise provides additional practice in computing standard scores. The
current template is used, but with different data. This time chest circumference
measurements are used.
1. Using the spreadsheet for the lie detection study as a base, create a new
spreadsheet for the chest circumference of Scottish soldiers. All that needs
to be changed is some labeling (e.g., "Chest" for chest circumference,
instead of "Lies") and the raw data. Replace the number of lies with the
values 34, 38, 39, 39, 40, 40, 41, 41, 43, and 45, respectively. These
represent the chest measurements, in inches, for 10 Scottish soldiers.
2. Nothing else needs to be done. All formulas for computing residuals, means,
variances, standard deviations, and standard scores are already in place'.
68 DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
Your new spreadsheet should look like the one shown in Fig. 5.6. You are
now in a position to compare scores in the two distributions. Note that the
standard deviation for lies was 2.28 (Fig. 5.5) and the standard deviation for
chest measurements was 2.79 (rounded to two decimal places). This means that
the chest measurements were somewhat more widely distributed about their
mean than the number of lies were around their mean.
Now examine the standard or Z scores. As you can see, the standard scores
associated with 9 lies and 45 inches were 1.85 (Fig. 5.5) and 1.79 (Fig. 5.6),
respectively, almost the same. This means that the two highest scores from these
two distributions are almost equally extreme. Note also that many standard
scores were between -1 and +1 and almost no scores were less than -2 or greater
than +2. However, the soldier with the smallest chest circumference (34 inches)
was more extreme in his distribution (Z score = -2.15) than was the subject who
had only one lie detected (Z score = -1.65).

Exercise 5.8
Obtaining Standard Scores in SPSS
Open the data file you created in exercise 5.5. Run Descriptive statistics for
the Lies variable as you did in the previous exercise, but this time check the
Save standardized values as variables box in the Descriptives window.
After you run the Descriptives command, return to the Data editor. Notice
that SPSS created a new variable called "zy." Why do these z-scores differ
slightly from the scores you calculated with your spreadsheet?

A B C D E F G
1 Lies y= SSy
2 s Y Y' Y-Y' y*y Z z*z
3 1; 3 5.3 -2.3 5.29 -1.15 1.3192
4 2 2 5.3 -3.3 10.89 -1.65 2.7157
5 3 4 5.3 -1.3 1.69 -0.65 0.4214
6 4 6 5.3 0.7 0.49 0.35 0.1222
7 5 6 5.3 0.7 0.49 0.35 0.1222
8 6 4 5.3 -1.3 1.69 -0.65 0.4214
9 7 5 5.3 -0.3 0.09 -0.15 0.0224
10 8 7 5.3 1.7 2.89 0.85 0.7207
11 9 7 5.3 1.7 2.89 0.85 0.7207
12 10 9 5.3 3.7 13.69 1.85 3.4140
13 Sum= 53 53 0 40.1 0 10
14 N=. 10 10 10 10 10 10
15 Mean= 5.3 5.3 VAR= 4.01 0 1
16 a- 5.3 SD= 2.0025 0 1
17
FIG. 5.5. Spreadsheet for the lie detection study including standard or
Z-score calculations.
5.4 STANDARD SCORES 69
Identifying Outliers
Transforming or rescaling raw scores into standard scores is useful, not only
because it allows us to compare scores in different distributions, but also because
it can reveal comforting or disturbing information about a particular sample of
scores. Sometimes a data recording instrument malfunctions or sometimes a
person makes a copying or data entry mistake. If such mistakes result in a data
point quite discrepant from other, legitimate data, then its standard score will be
extreme.
Opinions vary as to how extreme a score needs to be before it should be
regarded as an outlier, a possibly illegitimate data point that should be modified
or deleted. A common rule of thumb suggests that any data point whose standard
score is less than -3 or greater than +3 should, at the very least, be subjected to
scrutiny. Scores this extreme may be important and valid observations, but they
also may indicate that something (a subject, a piece of recording apparatus, a
procedure) may have malfunctioned, or the subject does not legitimately belong
to the population from which the investigator intended to sample. Outliers can
exercise undue weight on the statistics computed (recall Exercise 5.3) and can
render suspect the assumptions required for parametric statistical tests. As a
matter of course, then, it is good practice always to compute and examine
standard scores and to consider whether any outliers represent influential and
legitimate data or unwanted noise.
The mean of standard scores will always be zero and their standard deviation
and variance will always be 1 because of the way standard scores are defined.
Note, however, that the shape of the distribution of the scores is unaffected by
being standardized. If we begin with a sample of scores, and use the sample
mean and standard deviation to compute standard scores, the shape of the

A : B C D E F G
1 Scots; Chests y = SSy
2 s Y Y-Y' Y-Y' y*y Z z*z
3 1: 34 40 -6 36 -2.15 4.62
4 2 38 40 -2 4 -0.72 0.51
5 3 39 40 -1 1 -0.36 0.13
6 4 39 40 -1 1 -0.36 0.13
7 5 40 40 0 o; 0.00 0.00
8 6 40 40 0 0; 0.00 0.00
9 7 41 40 1 11 0.36 0.13
10 8 41! 40 1 1 0.36 0.13
11 9 43 40 3 9 1.07 1.15
12 10 45 40 5 25 1.79 3.21
13 Sum= 400 400 0 78 0 10
14 N= 10 10 10 10 10 10
15 Mean- 40 • 40 VAR= 7.80 1
16 a= 5 SD=: 2.79 1
17
FIG. 5.6. Spreadsheet giving the chest circumferences in inches and their
associated standard scores for 10 Scottish soldiers.
70 DESCRIBING A SAMPLE: DESCRIPTIVE STATISTICS
distribution of the transformed scores will be identical to the shape of the
distribution for the raw scores. If the raw scores are highly skewed, for example,
the standardized scores will also be skewed. Many students come to a course in
statistics believing that standardizing scores somehow makes them normally
distributed. This is simply not so; their distribution is unaffected by
standardizing.

Exercise 5.9
Outliers
For this exercise you again use the current template and modify the data. The
purpose is to demonstrate the effect different distributions of raw scores,
especially those that include outliers, can have on their means, standard
deviations, and standardized scores.
1. Change the raw scores for chest measurements so that the scores are more
spread out from the mean. Then change the raw scores to represent a
skewed distribution, for example, one with many more low than high scores.
In each case, how are the standard scores affected?
2. Change one or two of the raw scores for chest measurements to a very large
number, like 88 (likely an outlier). Then change one or two of the raw scores
to a very small number (an outlier in the other direction). In each case, how
is the mean and standard deviation affected? How are the standard scores
affected?

Descriptive Statistics Revisited


This chapter has been concerned primarily with description, with statistics that
tell us something about the scores in a sample (or in a population, if all scores in
the population are available to us). Specifically we have learned that the mean
number of lies detected for the 10 subjects in the lie detection study was 5.3, and
if the mean were used as the best guess for the number of lies detected for each
subject, a typical or average error (i.e., the standard deviation) would be 2.00.
Likewise, we have learned that the average chest circumference for a sample of 10
Scotsmen was 40 inches, and if the mean were used as the best guess for each
Scotsman's chest circumference, the standard deviation or error would be 2.79.
We have also learned that the mean can be thought of as the typical score for
a set of scores, whereas the variance and standard deviation indicate whether
scores are widely dispersed or clustered more narrowly about that mean. The
standard deviation, moreover, is used when computing standard scores, which
provides a way of determining whether scores in different distributions are
equally extreme. This may also aid in identifying measurement or data
transcription errors. Descriptive statistics such as these can indicate to others
exactly what we observed and measured; consequently, they are interesting and
valuable in their own right. In addition, they can also be used as a basis for
inference, as we begin to demonstrate in the next chapter.

Note 5.3
z A standardized or z score. It is the difference between the
raw score and the mean divided by the standard deviation.
6 Describing a Sample:
Graphical Techniques

In this chapter you will:


1. Learn basic principles of good graphical design.
2. Learn how to create histograms, stem-and-leaf plots, and box plots.

In the previous chapter you learned basic statistics designed to describe the
characteristics of central tendency (e.g., the mean) and variability (e.g., the
variance and standard deviation). These statistics help define two important
aspects of a distribution. Indeed, they form the foundation for the analysis of
variance. Before applying inferential statistics, however, it is always important to
know the subtleties inherent in any data set. The mean and standard deviation do
not provide a comprehensive view of the data. The application of good graphical
techniques allows you to determine the shape of your data distribution, discover
outliers, and uncover subtle patterns in your data that may not be readily
detectable when viewing only numerical values. As Edward Tufte summarized in
the epilogue of his classic text, The Visual Display of Quantitative Information
(1983, p. 191), "What is to be sought in designs for the display of information is
the clear portrayal of complexity... the task of the designer is to give visual access
to the subtle and the difficult—that is, the revelation of the complex."
At this point we should differentiate two purposes of graphing data. Most of
you are likely familiar with the bar charts typically found in professional journals.
The purpose of such figures is to present the reader with the results of an analysis
and they are therefore designed to tell a specific story; perhaps to the exclusion of
unimportant information. The graphical techniques in this chapter, on the other
hand, are more in the spirit of exploratory data analysis described by John Tukey
(1977)- These graphical techniques will allow you to get to know your data before
you conduct confirmatory or inferential statistics to assure that you have selected
the right models to test, and to help you determine if your data meet the
assumptions (e.g., normality) for each of the inferential analyses you plan to
conduct. These techniques can also be used at any time, before or after
confirmatory analysis, to simply explore the data in search of new and interesting
patterns that you may have not considered during study design. The techniques
are also critical to data cleanup. Strange patterns in your data or outliers may

71
72 DESCRIBING A SAMPLE: GRAPHICAL TECHNIQUES
become evident in your graphs. These may be due to true variability in the
population, or they may be errors. Therefore, it is always important to check.

6.1 PRINCIPLES OF GOOD DESIGN

Before proceeding immediately to the different types of graphs used in


exploratory data analysis, it is helpful to present a few principles that will help
you organize your data in the most efficient manner possible. Graphical
techniques are useful for showing large amounts of data in one readily
interpretable unit. What would take you many words and paragraphs to describe
in written text can be presented in a single well-designed figure. Thus, it is
important to produce graphs that present your data in a coherent format. At the
same time, however, you should avoid covering up patterns in your data or, worse
yet, distorting the data in ways that would lead to erroneous conclusions. If you
follow a few simple principles, the graphs you produce should give you vital
insight into your data that will help you make appropriate decisions concerning
the more complex inferential analyses you will learn about shortly.

Have a Clear Goal


Good expository writing always begins with a clear thesis and then uses the
literary style most suitable for presenting that thesis. Similarly, when producing
a graph, you should have a clear goal in mind, and then select the best technique
for accomplishing that goal. For instance, a histogram may be the best format for
getting an overall feel for the way univariate data are distributed, while a stem-
and leaf plot may be better when trying to uncover outliers or unusual clusters of
values. When comparing two or more groups, box plots may be the best
graphical form.
In your initial attempts to explore a data set, you will need to use various
different graphical techniques with different goals in mind. As your analyses
progress, however, the goal of each figure should become clearer. Eventually,
when presenting your data to others, you should ensure that the figures are
clearly integrated with the text and written interpretations of your analyses. In
later chapters we present graphs that are designed to explicate the results of your
analyses.

Show the Data


The preceding principle does not imply that a good figure can have only one
point. In fact, as the "show the data" principle suggests, a single graph can, and
should, present as many aspects of the data as can be viewed in a coherent
fashion. Thus, a single graph could provide the viewer with information
concerning central tendency, variability, the shape of the distribution, and even
individual data points. By doing so, for instance, it is possible to provide
information concerning group comparisons, but also how valid the comparisons
are, and how to best interpret any differences and similarities between the
groups.
Consider the typical bar chart ubiquitous in many professional publications.
If two groups are compared, then the figure conveys only two pieces of
information—the mean of each group. In some cases you may also be provided
with error bars that give you some idea of variability within the groups. If a graph
is to be worth a thousand words, however, then it needs to contain more than
four values (i.e., two means and two standard errors). The stem-and-leaf diagram
(see the next section) is an excellent example of a graphical technique that
6.1 PRINCIPLES OF GOOD DESIGN 73
presents large amounts of data (in some cases all of the raw data itself) in a
format that allows the quick apperception of summary characteristics.

Keep It Simple
The explosion of relatively simple to use software packages such as Excel and
Power Point in the 1990s gave great power to the user to conduct statistical
analyses and prepare presentation quality graphics. Along with that power,
however, comes responsibility. Just as you would not conduct a statistical
analysis in SPSS or Excel simply because it was available to you as a menu item,
you should not add graphical decoration to your figures just because the option is
available. Charts in Excel frequently default to the use of garish color schemes,
grid lines, and background color fills that, at best, direct attention away from the
main point of the figure, and, at worst, obscure or distort the data. Some authors
refer to this as "chartjunk" (Tufte, 1983; Wainer, 1984) Chartjunk should be kept
to a minimum and information to a maximum. You can use some of the design
options, such as holding one aspect of a figure to highlight an important point,
but you should always ask yourself before selecting an option, such as three-
dimensional shadows for a bar chart, if it adds any informational value to the
figure.

6.2 GRAPHICAL TECHNIQUES EXPLAINED


The following section will present three graphical techniques, the stem-and-leaf
plot, the histogram, and the box plot, that will help you get to know your data.
Using a combination of these techniques will allow you to uncover characteristics
of your data such as the shape of the distribution (i.e., skew), central tendency
(the mean, median, and/or mode), and spread (i.e., variance). Some of the
techniques are also useful for detecting gaps and clusters within the data and for
finding possible outliers. It is also possible to adapt these techniques to aid in the
comparison of distributions.

The Stem-and-Leaf Plot


The stem-and-leaf plot, or stem plot for short, embodies the principle of "show
the data." It is actually a hybrid of a text table and a true figure. The stem-and-
leaf plot organizes individual numerical data in ways that highlight summary
features of the distribution. Consider the following set of numbers—40, 49, 47,
46, 64, 64, 46, 66, 48, 46, 46, 38, 64, 63, 45, 65, 34, 60, 57, 55, 5O, 41, 65, 47, 49,
63, 47, 46, 56, 40, 49, 60, 19, 47, 42, 46, 63, 61, 63, 37. Try to quickly identify the
characteristics of this data set. What is the minimum value? What is the
maximum value? Are there any clusters in the data? Are there any outliers? An
unordered case listing of the- data is not very helpful. A first step to organizing
the data is to sort it in numerical order—19, 34, 37, 38, 40, 40, 41, 42, 45, 46, 46,
46, 46, 46, 46, 47, 47, 47, 47, 48, 49, 49, 49, 50, 55, 56, 57, 60, 60, 61, 63, 63, 63,
63, 64, 64, 64, 65, 65, 66. It is now readily apparent that the minimum value is 19
and the maximum is 66. Upon closer inspection there is a large number of values
at 46. The careful reader may have also noticed that 19 may be an outlier. Thus,
when data sets are relatively small (say, less than 50 or so cases), simply sorting
your data in Excel or a statistical package and then scrolling through the values
will give you a hint about some of the important features of your data. The shape
of the distribution, however, is hard to extract from this sort of presentation.
74 DESCRIBING A SAMPLE: GRAPHICAL TECHNIQUES
We could also group the data into a frequency table. To do so, we simply
create class intervals and then count the number of values within each interval
(see figure 6.1). When viewing the figure it is obvious that there also are two
modes—one between 40 and 49, and the other between 60 and 69. The
distribution also appears asymmetrical—there are fewer numbers in the lower
ranges and more in the higher ranges (i.e., a negative skew). By creating a
frequency table we have gained a better understanding of the shape of the data,
but we have lost access to the original data. We don't know, for instance, if the
minimum value is 10 or 19. We also don't know if the mode is due to the
existence of 19 scores of 40, or if the 19 scores are distributed evenly throughout
the 40-49 range.
The stem-and-leaf plot provides the opportunity to organize the data into a
frequency table, yet it maintains access to the original values. The plot is made up
of stems, which represent the class intervals from a frequency table. Each leaf
represents an individual value. To construct a stem plot you separate each
number at a place value. The stem becomes the value to the left of the place
value, and the leaf becomes the number to the right of the place value. In the
previous example, because the values range between 10 and 100, it makes sense
to create stems in the tens and let each leaf represent the ones place holder.
Typically stems and leaves are represented by one-digit numbers, but depending
on the range of your date, it is sometimes useful to have two-digit stems or leaves.
To read the plot presented in Fig. 6.2, you combine the stem with the leaf to
recreate each data point. The first value is 1 for the stem and 9 for the leaf; thus
the 9 represents the value 19. Similarly, on the third line we combine a stem of 3
with the leaves 4, 7, and 8 to create 34, 37, and 38. The frequency column is
optional, but provides a good check to make sure you have the correct N for your
data set. The stem plot provides the same information as the frequency table, but
retains the original data values. From this plot, we can still see that the data
clusters between 40 and 50, but we also know that 46 is the true mode. An added
advantage is that the stem plot is more graphical in nature. The shape of the
distribution (remember the negative skew) is visually perceptible in this format.
It should be noted that it is necessary to use a fixed font, such as Courier, rather
than a proportional font, to create an accurate visual impression of horizontal
bars extending from each leaf.
The stem-and-leaf plot can be adapted in a number of ways. First, if your
data are in a different range, say from 100 to 999, you could round or truncate
the values so that the stem represents the hundreds place holder and each leaf
would represent the tens. For instance, a value of 324 would have a stem of 3 and
a leaf of 2. Of course, given this method, you could not differentiate between 324
and 327. If you desired to retain the full resolution of each number, you could use

FIG. 6.1. A frequency table.


5.2 GRAPHICAL TECHNIQUES EXPLAINED 75

FIG. 6.2. A stem-and-leaf plot.

a stem of 3 (hundred) and leaves of 24 and 27. If you have a large number of
scores clustered on one stem, you could alter the number of stems by splitting
each one in half. For instance, Fig. 6.2 could be recreated with each stem split
into those values below five and those five and above. By doing so, the shape of
the distribution may sometimes become more obvious (see Fig. 6.3)

The Histogram
The stem plot is an excellent graphical technique for displaying all of the data in a
fashion that allows the viewer to perceive the distributional characteristics with
little effort. The histogram does not retain information about individual scores,
but as it is more graphical in nature, it is particularly good for providing a feel for
the distribution of data at a quick glance. Given the design principle of show the
data, however, the stem plot is the preferred technique, but most statistical
programs do not provide the flexibility in selection of interval sizes and the
manipulation of the number of digits in the stem and/or leaves necessary to take
full advantage of the features of the stem plot. It is for this reason that we also
present the histogram.
A histogram is a graphical presentation of a frequency table. Class intervals
are plotted on the x axis and frequency is plotted on the y axis. Individual bars
indicate the number of values contained within each class interval. A histogram
looks very much like the stem plot turned on its side. Compare the stem plot in

FIG. 6.3. A stem-and-leaf plot with each stem split in two.


76 DESCRIBING A SAMPLE: GRAPHICAL TECHNIQUES

FIG. 6.4. An example of a histogram.

Fig. 6.3 and the histogram of the same data presented in Fig. 6.4. The two modes
are clearly visible, and so is the outlier. We can also see the negative skew in the
shape of the distribution. What is missing is the ability to access individual data
points.
When creating a histogram it is important to be aware of the number and size
of the bins (i.e., the class intervals). In Fig. 6.4 there are 10 bins, which seems a
reasonable number given the range of the data. Fig. 6.5 shows the same data as
Fig. 6.4, but with only five bins. The two modes are still apparent in this figure,
but the outlier is no longer readily apparent and the gap between the two modes
is not as dramatic.
Some consider the optimum width of each bin or class interval to be the one
that most closely resembles the probability distribution of the data. Scott (1979)
suggested that this can be accomplished by calculating the width of each bin as
W=3.49 * SV* N-1/3 (6.1)
where W is the bin size, SD is the standard deviation, and N is the sample size.
Given formula 6.1 and using the estimated standard deviation, the appropriate
width of the bins for the data is 10.85. Given the range of the data this would
produce a histogram very similar to the one in Fig. 6.5. Using Scott's formula will
often result in noninteger bin widths, which might be confusing to some viewers.
Therefore, histograms based on Formula 6.1 provide a good exploratory tool for

FIG. 6.5 A histogram with only five bins.


5.2 GRAPHICAL TECHNIQUES EXPLAINED 77
determining the overall shape of a distribution, but may not be as well suited for
presenting your data to others. Moreover, such histograms may sometimes omit
detailed information, such as a small number of outliers that do not have great
influence on the overall distribution. It is important to play with the size of the
bins to give yourself the opportunity to explore all aspects of a distribution. This
can be easily accomplished in most statistical packages.

The Box-and-Whiskers Plot


The box-and-whiskers plot, or box plot, contains less information than the
histogram or stem plot, but provides a graphical representation of important
descriptive statistics related to the distribution of a data set. The box plot is also
a good graphical technique for comparing two or more distributions. It is usually
comprised of a box, a line within the box, and whiskers, which are two lines
emanating from the either side of the box. These features of the box plot describe
five characteristics of a distribution—the 25th and 75th percentiles, maximum
and minimum values (excluding outliers), and the 5Oth percentile (i.e., the
median). When oriented vertically, the lower and upper edges of the box
represent the 25th and 75th percentiles. These two values define the interquartile
range (IQR), which is the 75th percentile minus the 25th percentile, or the middle
50% of scores. The whiskers extend to the maximum and minimum values,
excluding outliers. A line within the box is also drawn to indicate the position of
the median.
The box plot uses the median instead of the mean as a measure of central
tendency, and the IQR instead of the standard deviation as a measure of
variability. This is because the median and IQR are more resistant to the effects
of individual scores, especially extreme scores, than the mean and standard
deviation. If we changed the 66 to 166 in the data set, the median and IQR would
remain unchanged at 48.5 and 16.5, respectively. The mean and standard
deviation, on the other hand, would change from 50.85 to 53.35 and from 10.64
to 21.00, respectively. Thus, when first exploring your data, the median and IRQ
are more useful. Later, after you have cleaned your data and determined the
appropriate inferential statistics, the mean and standard deviation become more
important.
Fig. 6.6 presents a box plot for the data set. It is readily apparent that 50% of
values fall within the box between the mid 40s and the low 6os. The median is
around 48, and the minimum and maximum values are in the mid 30s and 6os.
Skew is also apparent in these data as the bottom whisker is longer than the top,
and a larger number of scores within the box are found above the median. The
dot at the bottom of the plot represents an outlier. In chapter 5 you learned that
an outlier could be defined as any value that was greater than three standard
deviations away from the mean. In fact, this outlier is 19 and has a z score of
-2.99, so it falls just under the threshold of 3.00 absolute—although the
definition of an outlier as ±3.00 standard deviations is just a rule of thumb and
any value approaching 3 should probably be examined carefully. In the box plot,
however, outliers are defined as values that lie 1 l /2 IQRs or more away from the
median. SPSS also identifies extreme values as those that are 3 IQRs or more
away from the median. If there were any extreme values in these data, SPSS
would identify them with an asterisk. Not all statistical packages define outliers
in the same manner, so it is important that you check.
Occasionally, you may come across a box plot that uses the terms hinges and
H spread. Hinges are approximations of the 25th and 75th percentiles designed
to be easier to calculate. The H spread is the equivalent of the IQR and is defined
as the upper hinge minus the lower hinge. Hinges were useful before the days of
78 DESCRIBING A SAMPLE: GRAPHICAL TECHNIQUES

FIG. 6.6. A box-plot representation of the data.

statistical packages implemented on a computer. Today they are rarely seen, but
if you do come across a box plot that uses hinges, they can be interpreted in the
same way as plots based on percentiles and the IQR.

Comparing Distributions Graphically


Thus far we have discussed the graphical representation of univariate data, or
data based on a single sample. Often, however, it is useful to evaluate the
distributions of two or more groups of data on the same figure. In such cases, the
box plot allows for the quick comparison of distributions. The median appears
prominently in a box plot and allows one to quickly perceive differences between
groups with reference to central tendency. At the same time, however,
information concerning variability and the shape of each group's distribution is
maintained, making it a much better representation of group differences than a
text table of means and standard deviations. The box plot is therefore an
excellent graphical technique to use when your eventual goal is to apply
inferential statistics that test group differences.
In Fig. 6.7, Group 1 is the same sample we have been working with
throughout the chapter. Compare Group 1 to Group 2. It is easy to see that the
two groups have a similar median value, but there is more variability in Group 2.
Group 2 also seems to be more symmetrical and has an outlier in the 120 range
that should be checked. Group 3, on the other hand, has a median much higher
than the other two groups. As with Group 1, the third group is negatively
skewed—the bottom whisker is longer than the top whisker—but the skew is not
as large as the first group.
The previous example demonstrates the usefulness of side-by-side box plots
for comparison of the distributional characteristics of different samples. The box
plot for comparing groups can also be easily created in most statistical packages.
Ease of computation is important, but you should also not limit yourself to only
those graphical techniques that have already been implemented by the common
statistical packages. For instance, you could use many small histograms in one
figure to compare distributions. This is what Tufte (1983) referred to as the small
multiple. It is also possible to adapt the stem plot to grouped data. Compare the
side-by-side stem plot in Fig. 6.8 to the box plot in Fig. 6.7. From both plots we
can immediately determine that the central tendency of the two distributions is
about the same. The mode of both is in the 405. With small sample sizes it is also
relatively easy to find the median of the two groups by counting leaves in order
5.2 GRAPHICAL TECHNIQUES EXPLAINED 79

FIG. 6.7. A box plot comparing the distribution of three groups.

until you reach the midpoint of the sample. In this case, because there is an even
number of scores, it is necessary to take the mean of the two scores closest to the
center of the plot. From this figure, we can also see that Group 2's distribution is
more symmetrical than Group 1's distribution, if you exclude the outliers. Finally,
even though this figure may be a little busier than the box plot, it retains
information concerning individual data points that allow you to extract
characteristics of the samples, such as the outliers and the large number of 46s
and 47s in Group 1.
These variations of the graphical techniques can easily be created by hand, on
a word processor (as with Fig. 6.8), or in common draw programs that allow you
to manipulate the output from statistical packages such as SPSS. Although they
may take a little more effort, such graphs give you insight into your data and
uncover subtle patterns that may not be readily apparent using the graphical
techniques already implemented in software packages. Remember the quote
from Edward Tufte: "the task of the designer is to give visual access to the subtle
and the difficult—that is, the revelation of the complex." Your ability to reveal the
complex patterns in your data is limited only by your creativity and skill in

FIG. 6.8. The side-by-side stem plot.


80 DESCRIBING A SAMPLE: GRAPHICAL TECHNIQUES
applying good principles of exploratory data analysis and the armamentarium of
graphical techniques that you have mastered.

Exercise 6.1
Graphical Data Exploration
This exercise provides practice in the creation and interpretation of stem plots,
histograms, and box plots. You will use the data provided in the table here.
These data come from a study of children with Williams syndrome, a syndrome
that often results in mild mental retardation (Robinson, Mervis, & Robinson,
2003). Each child was given a vocabulary test (Voc) and a test of grammar
(Gram). In this exercise you will explore the univariate distributions of the
vocabulary and the grammar scores. You will also generate a box-plot to explore
possible differences between the vocabulary and grammar scores.

Sub Voc Gram Sub Voc Gram Sub Voc Gram


1 84 79 11 80 67 21 56 66
2 70 79 12 92 76 22 45 66
3 75 69 13 60 64 23 68 82
4 62 68 14 70 75 24 66 63
5 93 91 15 75 76 25 54
6 83 75 16 35 55 26 44 67
7 98 94 17 31 55 27 89 72
8 59 61 18 64 63 28 80 75
9 73 67 19 69 67 29 70 67
10 67 57 20 83 82 30 80

1. For the vocabulary data, sort them in numerical order. Without calculating
any statistics, what are your first impressions of the distribution? What are
the minimum and maximum values? Are there any possible outliers? Do the
scores seem to cluster in any particular way?
2. By hand, using graph paper, create a stem plot of the data. Were your first
impressions concerning central tendency, variability, and outliers correct?
What about the shape of the distribution? Is it skewed? If so, describe the
skew.
3. Now create a histogram of the vocabulary data using eight equally spaced
bins. Compare and contrast the histogram to the stem plot. Which do you
find more informative and why?
4. Create another histogram by calculating the width according to Formula 5.1.
What is the width? Compare this histogram to the one you created in part 3.
How do your impressions of the data differ based on the histograms with
differing bin widths?
5. Redo parts 1-4 using the grammar data. You may do this by hand, or on a
computer, if you are familiar with the graphing functions in Excel, SPSS (see
the next exercise), or another statistical package.
6. Create side-by-side box plots of the vocabulary and grammar data. How do
the two sets of scores differ with respect to central tendency, variability, and
the shapes of their distributions?
5.2 GRAPHICAL TECHNIQUES EXPLAINED 81
The preceding exercise gave you valuable experience in creating graphs by
hand. In the next exercise, you will be introduced to some of the graphs available
in the Explore procedure of SPSS. Statistical packages provide a quick and
efficient means of summarizing graphical data, but it is often the case that
graphing data by hand gives you a better "feel" for the data. Moreover, most
statistical packages do not provide options for creating some of the graphs
presented in this chapter (e.g., the side-by-side stem plot), nor do they provide
you with the flexibility to change all of the parameters in any particular graph
(e.g., the stem size of a stem plot). For these reasons you should not limit yourself
to any one statistical package. Often, using a statistical package to create
summary statistics that you then graph by hand, or using a draw program, is the
most flexible option available for exploratory data analysis.

Exercise 6.2
Graphing in SPSS
In this exercise you will learn to use the Explore command to obtain descriptive
statistics, histograms, stem-and-leaf plots, and box plots.
1. Open SPSS, create variables, and enter the data from Exercise 6.1.
2. Select Analyze->Descriptive Statistics->Explore from the main menu.
3. Move the vocabulary and grammar variables to the Dependent List box.
Click on Plots and select the histogram option in the Explore: Plots dialog
box. Click Continue.
4. Click on Options and select the exclude cases pairwise option and click
continue. Note that some SPSS commands give you the option of excluding
cases listwise or pairwise. Selecting listwise exclusion will cause SPSS to
delete any case that is missing data on at least one variable. Pairwise
exclusion, on the other hand, will exclude only those cases that have missing
data on variables involved in the current analysis.
5 Click on OK to run the Explore command.
6. Examine the output. What are the Ns for the vocabulary and grammar
variables? Why are they different? What would they be if you selected
listwise exclusion?
7. Examine the Descriptives output. Are the mean, median, standard deviation,
interquartile range, and skewness statistics what you would expect given
your results from Exercise 6.1?
8. Examine the histogram for the vocabulary scores. Double click on the
histogram to open the chart editor. In the chart editor you can change the bin
widths. Select Chart->Axis from the main menu. In the intervals box, select
custom and click on Define. You can now select either the number of
intervals you would like to display, or the interval width. Change the number
of bins and/or the bin width. Observe how doing so affects the shape of the
histogram. Do this with the gram variable as well.
9. Examine the stem-and-leaf plots and the box plots. Do they look like the plots
you generated by hand?
10. Change some of the values in the data and rerun the Explore command. See
if you can predict how the plots will look based on the changed values.
11. Return the numbers you changed to their original values and save the data
file.
This page intentionally left blank
7 Inferring From a Sample:
The Normal and t Distributions

In this chapter you will:


1. Learn how the normal can be used as an approximation for the binomial
distribution if N (the number of trials) is large.
2. Learn what the normal curve is, how it arose historically, and what kind
of circumstances produce it.
3. Learn what the central limit theorem is and how the normal distribution
can be used as an approximation for the distribution of sample means if
N (the sample size) is large.
4. Be introduced to the t distribution and learn how it can be used for the
distribution of sample means if N is small.
5. Be introduced to the standard error of the mean and learn how to
perform a single-sample test, determining whether a sample was
probably drawn from a population with specified parameters.
6. Learn how to determine 95% confidence intervals for the population
mean.

In chapter 3 you were asked to derive critical values for the binomial sampling
distribution for larger and larger numbers of trials. This was not too difficult to
do for small values like 10 or even 15, but you can see how tedious and error-
prone this could become for larger numbers. This concerned at least some 18th-
century English and French mathematicians. Lacking today's high-speed
electronic computers, they set out to reduce tedium in a different way, and so
sought approximations for the binomial. The goal was to find a way for
computing probability values that did not require deriving probabilities for all the
separate outcomes of series of different numbers of binomial trials.
The binomial is a discrete distribution. No matter the number of trials, the
number of outcomes that contain, for example, no heads, 1 head, 2 heads, ..., N
heads are always whole numbers. As a result, the distribution, portrayed
correctly, will always be jagged and will not be described by a smooth, continuous
line. Recall the binomial distributions graphed in chapter 3, but now imagine the
distributions for ever larger values of N. As N becomes large, the discreteness
matters less. The graphs for the distributions begin to look almost smooth, which

83
84 INFERRING FROM A SAMPLE: THE NORMAL AND t
suggests that as N approaches infinity, the distribution might approach a smooth
line, one that could be represented with a continuous function. And if a function
could be found that gave frequencies for the various outcome classes, then the
entire distribution could be generated relatively easily.
Desired is a function of X, symbolized as Y=f(X), that would generate all
possible values of Y. The function would be supplied with the binomial
parameters and produce the appropriate binomial distribution, or at least a close
approximation. It is doubtful that the function would be as simple as Y - 2 + .5X,
which would generate a straight line, or Y= X2, a quadratic equation that would
generate a parabola (or U-shaped curve). But whatever the correct function, once
defined, it can be viewed as a generating device. Supplied with continuously
changing values of X, it would produce (i.e., generate) the corresponding values
of Y as defined by the particular function.

7.1 THE NORMAL APPROXIMATION FOR THE BINOMIAL

The person usually credited with first describing a functional approximation for
the binomial distribution is a French mathematician, De Moivre, although it is
doubtful if he fully appreciated the implications of his discovery (Stigler, 1986).
Writing in the 1730s, De Moivre defined a curve that has come to be called the
normal. If the number of trials is large (over 50, for example, but an exact rule of
thumb is discussed subsequently), then the discrete binomial can be
approximated with the continuously varying curve specified by De Moivre.
This has an immediate and practical implication. As you may have noted,
Table A in the statistical tables appendix (Critical Values for the Binomial
Distribution) does not give values for N greater than 50. The reason is, for values
over 50 the normal approximation is usually quite sufficient. The normal
distribution is well known and values for the standard normal distribution (this is
defined in a few paragraphs) are widely available (see Table B, Areas Under the
Normal Curve, in the statistical tables appendix). Thus it seems easier and more
economical to use the normal distribution for any large-sample sign or binomial
tests.
The first step is to standardize the test statistic, which is the number of trials
characterized by the first of the two possible outcomes (e.g., the number of coin
tosses that came up heads) and is symbolized here as T. To standardize T, we
need to know (or assume) values for the three parameters that define its binomial
distribution:

1. N, is the total number of trials conducted.


2. P, the probability for the first outcome according to the null hypothesis
under investigation (e.g., that the coin is fair, which means that P, the
probability of a head, would be .5).
3. Q, the probability for the second outcome (e.g., the probability of a tail),
which necessarily equals one minus P (i.e., Q = 1 - P).

In standardizing T, we compare T against all possible outcomes represented


by the binomial distribution whose parameters are the values given for N and P,
hence we are dealing with a population. Recall from the chapter before last
(Equation 5.9) that, for a population, the definition of a standard score (omitting
subscripts, and understanding u and a to be the mean and standard deviation for
the T scores) is
7.1 THE NORMAL APPROXIMATION FOR THE BINOMIAL 85

This is the standardized value for T, or the binomial test Z score.


In order to compute Z, we need to know both u and a. For the binomial
distribution, the mean is

This makes intuitive sense. If N were 6 and P were .5, then the mean or expected
value would be 3, meaning three heads out of six tosses. Similarly, if N were 5
and P were .5, the mean value would be 2.5 or if N were 6 and P were .33, the
mean value would be 2. The standard deviation for the binomial distribution is

This makes less intuitive sense. For now, accept that it is correct, but if you
would like proof, check a mathematically oriented statistics text (e.g., Marascuilo
& Serlin, 1988). In any case, we can now rewrite Equation 7.1, the binomial test Z
score, as follows:

If N is large, statisticians have proved that the Z defined by Equation 7.4 will be
distributed approximately normally. That is, if the population parameters are as
claimed, then the sampling distribution for Z should be the normal curve first
described by De Moivre. The parameter N is determined by our procedures, Q is
determined by P, and P is determined by the null hypothesis we wish to test.
Now, and this is the point of the foregoing discussion, we have a way of
determining if the null hypothesis is tenable, given the value actually observed for
T.
For example, if N = 86 and P = .5, we would expect that half of the trials, or
43, would result in successes. If the actual number of successes were 53 then,
applying Equation 6.1:

Five percent of the area under the normal curve is demarcated with a Z value of
1.96 (two-tailed), and 2.16 is bigger than that; thus we conclude that a value as
extreme as 53 would occur by chance alone less than 5% of the time, if the
present sample were drawn from a population whose value for P is .5. (The most
common null hypothesis for binary trials assumes that P = .5 and for that reason
our discussion of the sign or binomial test in chapter 3 was confined to such
cases. Some null hypotheses may demand other values, however, and equation
7.4 can easily accommodate them.)
As already noted, the normal approximation for Z requires a large value of N.
But how large is large? It depends not just on N, but on P as well. The usual rule
of thumb is provided by Siegel (1956). He suggested that if the value of NPQ is at
least 9, then the approximation provided by the normal will fit the true binomial
distribution adequately enough for statistical hypothesis testing. This means that
86 INFERRING FROM A SAMPLE: THE NORMAL AND t
if P = .5, then N could be as low as 36 (36 x .5 x .5 = 9) but that if P = .8, then N
should be at least 57 (57 x .8 x .2 = 9.12). This rule of thumb, however, only
establishes minimum standards; the approximation becomes better of course for
larger values of N.
The preceding paragraphs have introduced some useful material. If a study
consists of a series of trials (e.g., if several different individuals are assessed), and
if there are only two possible outcomes for each trial (e.g., a patient gets better or
fails to get better), then the tenability of a particular value for P can be evaluated.
This value predicts how many subjects should improve according to the null
hypothesis. The actual number of subjects who improved can then be compared
against this predicted value. The discrepancy between observed and predicted is
standardized (i.e., divided by its standard deviation) and the resulting Z score is
compared against the normal distribution. The assumption that such Z scores
are approximately normally distributed, however, is reasonable only if N is
sufficiently large (specifically, the product NPQ should be greater than nine).
If the computed Z score is extreme, that is, if it falls in the region of rejection
as defined by the alpha level and type of test (one- versus two-tailed), then the
null hypothesis is rejected. We conclude that the sample of study subjects was
probably drawn from a population whose value for P is not the value we derived
from the null hypothesis and then used to compute the Z score. As our
alternative hypothesis suggests, probably the true population P is a different
value. To return to an example used earlier, if 86 patients were tested, and if 53
got better, we would conclude that this excess of 10 over the expected 43 was not
likely a chance happenstance. Similarly, if only 33 got better, 10 less than
expected, we would again conclude that this was likely not just chance (assuming
a two-tailed test). It seems reasonable to conclude, therefore, that the treatment
made a difference.

Exercise 7.1
The Binomial Test using the Normal Approximation
This exercise provides practice in using the large sample normal approximation
for the binomial test.
1. Draw a graph, using your spreadsheet program if you like (this would be
what spreadsheet programs usually call an XY graph). Label the X axis with
values of P from 0 to 1. Label the Y axis N. Using Siegel's rule of thumb,
compute enough values of N for various values of P so you can draw the
graph. There are an infinite number of values for P, you need only select
enough to give you confidence that you have accurately portrayed the shape
of the graph. In drawing lines between the points you plotted, what
assumptions did you make? You probably drew a smooth graph; is this
justified? Should it be jagged instead? Why? What is the shape of your
graph? It should be symmetric; why must it be?
2. The critical values for the normal distribution are:

Two-tailed, alpha = .05: Z= ±1.96


Two-tailed, alpha = .01: Z= ±2.58

Memorize these values; you will need them later as well as for this and the
next few exercises. Now, if N = 40, P= .5, and T= 26, what is the value of
Z? Would you reject the null hypothesis, alpha = .05, two-tailed? Why or
why not? If T= 28? If T= 30? Again, why or why not?
7.1 THE NORMAL APPROXIMATION FOR THE BINOMIAL 87_
3. Now answer the same questions as in part 2, but for an alpha level of .01.
4. If N = 40, P = .75, and T= 25, would you reject the null hypothesis, alpha =
.05, two-tailed? Why or why not? If T = 35? If T = 23? If T = 37?
5. If P = .5, alpha = .05, two-tailed, and N = 80, what values of T would allow
you to reject the null hypothesis? What values would not allow you to reject?
6. Now answer part 5, but for N = 100 and N = 200.
7. If P = .2, alpha = .05, two-tailed, and N = 80, what values of T would allow
you to reject the null hypothesis? What values would not allow you to reject?

7.2 THE NORMAL DISTRIBUTION


In the preceding section we referred to the normal distribution and showed one
way it could be used, but we have yet to describe it in any detail. There is some
historical justification. Only early in the 19th century did Pierre Simon Laplace
and Carl Friedrich Gauss, in a series of seminal papers, integrate De Moivre's
earlier work, Legendre's method of least squares, Laplace's central limit theorem
(discussed in the next section), and the normal curve into a synthesis that
provided the basis for modern statistics.

Historical Considerations
Primarily problems of astronomy and geodesy (the geological science concerned
with the size and shape of the earth), in addition to sheer intellectual pleasure,
motivated these early statistical theorists. One practical problem was how to
combine several observations of the same astronomical or geodetic phenomenon,
and from consideration of this problem both the method of least squares
(discussed in chap. 5) and the realization that errors were distributed in a
particular way resulted. This distribution later became called the normal or
Gaussian distribution. Intended at first as a description for errors of observation,
only later in the 19th century did investigators realize how common, or normal, it
was. Adolphe Quetelet, in particular, became entranced by the normal curve arid
demonstrated repeatedly that it could be found throughout nature. His fame in
statistics, Stigler (1986) wrote, "depends primarily on two hypnotic ideas: the
concept of the average man, and the notion that all naturally occurring
distributions of properly collected and sorted data follow a normal curve" (p.
201).
As noted briefly in chapter 5, Quetelet analyzed chest measurements for
Scottish soldiers. These were normally distributed, which Quetelet interpreted as
evidence that nature was striving for an ideal type, the average man. The ideal is
the mean; deviations from it can only be due to accidental causes, to "errors."
Quetelet is hardly the first person in the history of science to become enamored of
one simple idea to the detriment of others, and the notion that the mean has
some mystical force or moral imperative can still be found. Fortunately, the
normal curve can be understood without recourse to underlying mystical
principles.

The Normal Function


The normal curve is a particular curve. It is not any bell-shaped curve, as
students sometimes think, but a curve defined by a particular function. The
function is not especially simple. It involves pi (a), which equals approximately
3.1416, the base of the natural logarithm (e), which equals approximately 2.7183,
88 INFERRING FROM A SAMPLE: THE NORMAL AND t
and a negative exponent. Like the binomial, the normal curve is actually a family
of curves. The parameters for the normal curve are mu (u) and sigma (a), the
population mean and standard deviation, respectively, and each pair of
parameters generates a particular instance of a normal curve. The function that
defines the normal curve is:

(X can range from minus to plus infinity.) One instance of the normal curve is so
useful and is used so often that it has its own name. When the mean is zero and
the standard deviation is one, the resulting curve is called the standard normal.
Substituting u = 1 and a = 0 in equation 7.5 simplifies it somewhat and yields the
definition of the standard normal curve:

(Again, Z can range from minus to plus infinity.) Any normal distribution can be
transformed into the standard normal just by rescaling the X axis, in effect
forcing the mean to be zero and the standard deviation one. To standardize the
values on the X axis, the difference between each value and the mean would be
divided by the standard deviation. By definition, standardizing produces a
distribution whose mean is zero and whose standard deviation is one. The curve
defined by Equation 7.6 (i.e., the standard normal curve) is depicted in Fig. 7.1.
The function for the normal distribution may appear complex or even
intimidating. In practice, however, this formula is almost never used. Tables
giving the area under the standard normal curve to the left and/or right of
various values of Z are used instead. If you look at the standard normal table
provided here (see Table B in the statistical tables appendix), you will find that
the area to the left of Z = 1 is .1587 and to the left of Z = +1 is .8413. This means
that the area to the right of Z = +1 is .1587, that the area either less than -1 or
greater than +1 is .3174, and that the area between -1 and +1 is .6826. In other
words, if scores are normally distributed, a little over two-thirds of them will be
within a standard deviation of the mean, and just under one-third will be greater
than a standard deviation from the mean.

Note 7.1
u As noted earlier, the population mean is usually represented
with a lower case Greek mu and the sample mean with M or
X-bar.
o The population standard deviation is usually represented
with a lower case Greek sigma and the sample standard
deviation with S or SD.
7.2 THE NORMAL DISTRIBUTION 89

FIG. 7.1. The standard normal distribution.

Underlying Circumstances
The underlying circumstances that produce normally distributed numbers are
not difficult to understand. Whenever the thing measured (e.g., the chest
circumference of Scottish soldiers, the height of college freshmen, the length of
bats' wings, the IQ of military recruits) is the result of multiple, independent,
random causes, the scores will be normally distributed. That is, their
distribution will fit the form of the function described in the preceding
paragraph. This is not mysterious, but simply a straightforward mathematical
consequence. These circumstances necessarily produce normally distributed
scores. And because so many phenomena are indeed the result of multiple,
independent, random genetic and environmental influences, normally
distributed scores are frequently encountered.
Occasionally students think that, for example, IQ scores are normally
distributed because of the way IQ tests are constructed. This is not true.
Although the mean and standard deviation can be manipulated, the shape of the
distribution cannot be changed. As noted in chapter 5, standard scores are
distributed the same way the scores were distributed before standardization.
Standardization only changes the scale markers on the X axis, replacing the mean
of the raw scores with 0 (100 for IQ scores) and labeling one standard deviation
below and above the mean -1 and +1 respectively (85 and 115 for IQ scores).
To reiterate, standardization does not change the shape of a distribution.
Moreover, we should remember that not all scores are normally distributed. For
example, the distribution for individual annual income is quite skewed: Many
people make very little, and only a few people make a lot. However, when
circumstances are right—when a phenomenon is the result of many independent,
random influences—then scores will be normally distributed. Of course, this is a
theoretical statement. The distribution of a small sample of scores might not
90 INFERRING FROM A SAMPLE: THE NORMAL AND t
appear normal, just due to the luck of that particular draw or sampling error, but
if enough scores were sampled, the empirical distribution would closely
approximate the theoretically expected normal distribution.

Normal and Rectangular Distributions Compared


Why is any of this useful? Playing a bit with popular imagination, imagine we
have developed a test to detect aliens from outerspace, something like the tests
used by British intelligence during World War II to trip up German spies who
were attempting to pass themselves off as English. Assume we know the
theoretical distribution of scores for terrestrials, that is, we know the shape of the
distribution as well as its mean and standard deviation. Now, even if we do not
know how aliens perform, confronted with a single score, we can at least
determine how likely that score is if the person tested is a terrestrial.
Assume, for example, that possible scores for the test are 80, 81, 82 ..., 120,
and each of the 41 possible scores is equally likely. This represents not a normal
but a rectangular distribution. In this case, the probability for a score as high as
120 would be 1/41 or .0244 (one-tailed) and the probability for scores as extreme
as 80 or 120 would be 2/41 or .0488 (two-tailed). The mean for this distribution
is 100 and its standard deviation is 11.8. Thus the standard score for a raw score
of 120 is 1.69. If the scores for the test had been normally distributed instead, the
probability of a standard or Z score as high as 1.69 would have been .0455 (one-
tailed) and the probability of a Z score as extreme as 1.69 would have been .0909
(two-tailed; see Table B in the statistical tables appendix).
This example was introduced for three reasons. First, it demonstrates that a
standard score by itself does not tell us how probable the corresponding raw
score is: We need to know the shape of the distribution as well. In addition, it
provides yet another demonstration of the logic of hypothesis testing, in which
the probability for a particular value of a test statistic is determined with
reference to the appropriate theoretical sampling distribution. Finally, it lays
some groundwork for the next exercise and for discussion of the central limit
theorem, which is presented in the next section.

Exercise 7.2
The Normal Distribution
Primarily, this exercise provides practice using the table of areas under the
normal curve (Table B in the statistical tables appendix).
1. As in the example just presented, assume M = 100, S = 11.8, but that scores
are normally distributed. What proportion of scores are greater than 105?
Greater than 115? Less than 80? What proportion differ from the mean by
more than 3?
2. For normally distributed scores, what proportion are greater than two
standard deviations above the mean? Greater than three standard
deviations? More extreme than two standard deviations from the mean?
More extreme than three standard deviations?
3. Assuming normal distributions, what are the critical values of Z for alpha =
.05 and for alpha = .01 for one-tailed tests? For two-tailed tests?
4. Why might it make sense to regard any standardized scores less than -3 or
greater than 3 as outliers?
7.2 THE NORMAL DISTRIBUTION 91
5. For the rectangular distribution just presented, M = 100 and SD = 11.8. What
proportion of the scores are greater than .678 of a standard deviation above
the mean? Are within .678 of a standard deviation from the mean?
6. (Optional) For this rectangular distribution, whose scores ranged from 80 to
120, why does SD= 11.8?

7.3 THE CENTRAL LIMIT THEOREM

What has come to be called the central limit theorem (central in this context
means fundamental, not middle) was first described by Laplace in 1810 and
represented a major advance over De Moivre's work a century earlier. Laplace
proved the following theorem: Any sum or mean (not just the number of
successes in N trials) will be approximately normally distributed if the number of
terms is large. The larger N is, the better the approximation. In theory, as the
limit approaches infinity, the approximation approaches perfection. But long
before that, the approximation becomes good enough to be useful.
The implications for statistical analysis are profound. If we draw a single
score at random, unless we know exactly how those scores are distributed in the
population, we have no basis for determining the probability of drawing that
score. But if we draw a sample of scores and compute the mean for the sample,
we now know how sample means (if not individual scores) should be distributed.
According to the central limit theorem, no matter how the raw scores in a given
population are distributed, sample means will be distributed normally. For
example, no matter whether the distribution of raw scores is rectangular, or
bimodal, or whatever, if we drew thousands of samples of size N, computed the
mean for each, and graphed the distribution, the distribution of the sample
means would approximate a normal distribution.
Given what you learned about the normal distribution in the preceding
section, this makes sense. Earlier we defined the mean as the sum of the raw
scores divided by N (Equation 5.1). But this is simply an economical way of
saying that the mean is a weighted summary score. Each raw score is multiplied
(or weighted) by a weighting coefficient, which for the mean is 1 divided by N for
all scores, and the resulting products are summed. Symbolically, the weighted
summary score formulation for the mean is:

When drawing a sample of N scores, each X can be regarded as representing an


independent, random influence. If this is so, then of course the summary score
(or mean) for these multiple scores will be distributed normally. After all, the
mean simply combines into one score each of the presumably independent,
random separate scores.
92 INFERRING FROM A SAMPLE: THE NORMAL AND t

7.4 THE t DISTRIBUTION


As noted earlier in this chapter, for only 4, 5, or even 10 trials, the normal is not a
very good approximation for the binomial distribution. However, as the number
of trials increases, the approximation becomes increasingly better. As a general
rule of thumb, and assuming the value of P is not extremely small or extremely
large, the normal approximation for the binomial is usually good enough to be
practically useful if N is greater than 50, certainly if N is greater than 100. And in
theory, as N approaches infinity, the binomial comes ever closer to being exactly
like the normal.
The same is true for the distribution of sample means. According to the
central limit theorem, sample means are distributed approximately normally, and
the approximation becomes better as the sample size becomes larger. The
general rule of thumb is the same. The normal approximation for the
distribution of sample means is good enough to be practically useful if the sample
size (or N) is greater than 50. And as with the binomial, as N approaches infinity,
the distribution of sample means comes ever closer to being exactly like the
normal.
For smaller numbers of trials, sampling distributions can be constructed for
the binomial, which is exactly what you did for an exercise in chapter 3. The
distribution of sample means for smaller sample sizes is likewise known. The
distribution is called the t distribution and has been extremely important in the
history of statistics. It can be applied to many problems and tests, and indeed
many introductory texts in statistics spend considerable time developing the
many applications of t tests. As noted at the end of chapter 4, we have elected to
emphasize other, more general techniques that accomplish many of the same
ends, but nonetheless students should be familiar with the t distribution. In any
case, when dealing with small samples (i.e., samples for which N is less than 50,
certainly less than 30), it is essential to use the t distribution for solving the two
kinds of problems described in the next two sections: single-sample tests and
95% confidence intervals.
The t distribution itself looks very much like the normal distribution, only
flatter. Imagine the normal distribution is a tent, held up with a single wire. As
the sample size decreases, the wire is lowered and the tent spreads out, assuming
the shape of the t distribution for smaller and smaller N. The tails for the t
distribution extend further than the tails for the normal distribution, so critical
values for t (values that demarcate the most extreme 5% or 1% of the area under
the curve) are larger than the corresponding values would be for the normal
distribution; they become larger as N becomes smaller. For example, the alpha
.05 nondirectional critical value for the normal distribution is 1.96. The
corresponding values for a t distribution with a sample size of 30 is 2.05, for N =
20 is 2.09, and for N = 10 is 2.26. (See Table C, Critical Values for the t
Distribution, in the statistical tables appendix. In these cases, degrees of freedom
are N - 1 so for N = 30, df = 29; for N = 20, df = 19; and for N = 10, df = 9. What
degrees of freedom are and how they are determined is explained in chapter 10.)
7.5 SINGLE-SAMPLE TESTS 93

7.5 SINGLE-SAMPLE TESTS


An understanding of the central limit theorem and the normal and t distributions
allows you to perform what are usually called single-sample tests, or tests that
determine if a single sample was likely drawn from a particular population.
Consider the lie detection study. The only data available to us are the number of
lies detected for the 10 subjects in the lie detection study sample. From these
sample data, we can compute the sample mean (M = 5.3) and standard deviation
(SD = 2.00). Imagine our null hypothesis states that the sample is drawn from a
population whose mean is 7, so H0: u = 7). Is the sample mean of 5.3 sufficiently
different from the assumed population mean of 7 to render the null hypothesis
that the sample was drawn from this particular population untenable?
The statistic we have in hand is a single sample mean, M. What we want to
know is whether or not M is especially deviant, compared to the distribution of
sample means. In other words, if we drew thousands of samples of 10 scores each
from a population whose mean score is 7, computed the mean for each sample,
and examined the distribution of these sample means, would a value of 5.3 be
near the center of this distribution of values for sample means or far out in one of
the tails? We do know from the central limit theorem that, no matter how the
raw scores were distributed in the population, the distribution of sample means
will be approximately normal, or in this case distributed as t with N - 1 or 9
degrees of freedom. Hence, if we standardize the sample mean, we can evaluate
the resulting standard score using the t distribution, which will tell us how likely
a mean score of 5.3 is, if the population raw score mean is 7.
The sample mean is standardized as follows:

It is important to remember that we are dealing with means, not with raw
scores. Thus ZM indicates a standard score for means, not raw scores. And in
order to compute it we need to know the mean for the distribution of sample
means (symbolized UM), which conceptually is not the same as the mean for the
scores in the population (symbolized u), and we need to know the standard
deviation for the distribution of sample means (symbolized aM), which is not the
same as the standard deviation for the scores in the population (which is
symbolized a).
If we know the mean and standard deviation for scores in the population,
then the mean and standard deviation for the distribution of sample means can
be determined. It seems reasonable to assume (and statisticians can prove) that
the mean for the distribution of sample means is the same as the mean for the
scores in the population. In other words,

In the present case, because u is assumed to be 7 (by our null hypothesis), we can
assume that uM is also 7. In contrast, we would expect the standard deviation for
sample means to be less than the standard deviation for the population scores. If
we draw a sample of 10 scores, for example, some will likely be less than the
mean, and some will be larger, in effect canceling each other out. As a result, the
values for sample means will tend to cluster more tightly around the mean than
the raw scores, often by quite a bit.
94 INFERRING FROM A SAMPLE: THE NORMAL AND t
The Standard Error of the Mean
What is needed is a way of estimating the standard deviation for sample means
(a M ). We know that the standard error of the mean depends on the sample size.
If we draw hundreds of samples of size 20, for example, their means will cluster
more tightly around the population mean than if we had drawn samples of size
10. We also know that the standard error of the mean will be less than the
population standard deviation (a). Statisticians have proven to their satisfaction
that an accurate estimate of the standard error of the mean is the population
standard deviation divided by the square root of the sample size:

However, how do we determine the value of a in the first place?


For single-sample tests, the value of u is assumed (determined by the null
hypothesis) but the value of a is typically estimated from the sample. We already
know how to compute the sample standard deviation (see Equation 5.6). It is the
square root of the quotient of the sum of the deviation scores squared
(symbolized SS for sum of squares) divided by N (the sample size):

The population standard deviation, as estimated from the sample, however, is not
quite the same (see Equation 5.7). It is the square root of the sum of squares
divided by N - 1:

Because N - 1 is smaller than N, the estimated value for the population standard
deviation will be somewhat larger than the value computed for the sample
standard deviation. This is explained in chapter 10. For now, note that although
the sample standard deviation or SD is 2.00 (the square root of 40.1 divided by
10) for the 10 data points from the money-cure study, the estimated population
standard deviation or SD' is 2.11 (the square root of 40.1 divided by 9).
Now we can compute the standard deviation for the distribution of sample
means. From Equation 7.10, the standard error of the mean (a M ) equals a
(estimated as 2.11) divided by the square root of N. Thus, letting SD'M represent
the estimated value for the standard error of the mean,

For the data from the money-cure study,


7.5 SINGLE-SAMPLE TESTS 95
(computed retaining unshown significant digits for SD'). Or, combining
Equations 7.10 and 7.12 and simplifying algebraically:

Again, the result for our current example is .667 (the square root of the quotient
of 40.1 divided by 90). This formula (Equation 7.14) for the standard error of the
mean is convenient when the sum of the squared deviation scores is readily
available.
Finally we can answer the question posed at the beginning of this section: Is
the sample whose mean is 5.3 likely drawn from a population whose mean is 7?
The standard score representing the deviation of the sample mean from the
assumed population mean (see Equation 7.8) is 5.3 minus 7 divided by 0.667, the
standard error of the mean (recall that SD'M is the estimated value for a M ).
Hence:

The sample size is small (N = 10); thus ZM will be distributed as t in the preceding
equation, so we refer to Table C, not Table B, in the statistical tables appendix.
According to this table, the critical value for t (alpha = .05, two-tailed) with nine
degrees of freedom is 2.26. Therefore a sample mean as small as 5.3, whose
standardized value is -2.55 (and is distributed as t with nine degrees of freedom),
would occur less than 5% of the time if the population mean were in fact seven.
We conclude that the sample is likely not drawn from the null hypothesis
assumed population.
As a second example of a single-sample test, let us return to our earlier
problem of alien identification, assuming this time that instead of sampling just
one individual, we sample groups of eight instead. In this case, we know that the
population from which we are sampling is homogeneous, either all aliens or all
terrestrials. As before, we know that the population mean and standard deviation
for terrestrials are 100 and 11.8, respectively (in this case, the population
standard deviation is given, not estimated from a sample). If the mean for the
sample is 94, should we reject the null hypothesis (alpha - .05, two-tailed) that
the sample is drawn from a population of terrestrials?
The mean for the distribution of sample means is

and the standard deviation for the distribution of sample means is

Thus the standardized value for the deviation of the sample mean from the
population mean is:
96 INFERRING FROM A SAMPLE: THE NORMAL AND t
Again, because the sample size is small, ZM is distributed as t (and we could have
replaced ZM with t in the equation). According to Table C in the statistical tables
appendix, the critical value for t (alpha = .05, two-tailed) with N - 1 or seven
degrees of freedom is 2.37. Thus a value of — 1.44 is not sufficient to reject the
null hypothesis. Probably the sample consists of terrestrials.

Note 7.2
M The sample mean. Computed by summing the scores in the
sample and dividing by N.
(u The population mean. It can be estimated by the sample
mean, or its value can be assumed based on theoretical
considerations.
UM The mean of the distribution of sample means. It can be
estimated by the population mean.
SD The sample standard deviation. Computed by taking the
square root of the sum of the squared deviation scores
divided by N.
a The population standard deviation. It can be estimated by
taking the square root of the quotient of the sum of the
squared deviation scores divided by N -1. This estimate is
symbolized SD'.
aM The standard error of the mean, or the standard deviation for
the distribution of sample means. It can be estimated by
dividing a by the square root of N, or by dividing SD' by the
square root of N, or by taking the square root of the quotient
of the SS divided by N times N - 1. This estimate is
symbolized SD'M .
t A test statistic similar to Z but for small samples. The shape
of the t distribution is similar to the normal distribution but
flatter. As a result, critical values for t are larger than
corresponding critical values for Z.

Exercise 7.3
Single-sample tests
This exercise provides practice in single-sample tests and in computing the
standard error of the mean.
1. If the alien identification sample size were 100 instead of 8, what would be
the value for the standard error of the mean? What is the critical value for t
(alpha = .01, two-tailed)? Would you conclude that the sample consisted of
aliens or terrestrials?
2. Assume the sample size for the money-cure study is sufficiently large that
you decide to use the standard normal (Table B) instead of the t table (Table
C). Assume further that the standard error of the mean retains the value
7.5 SINGLE-SAMPLE TESTS 97
.760, so that ZM retains the value -2.63 computed earlier. What percentage
of the time would a sample mean as low as 5 occur if the population mean
really were 7?
3. A sample consists of six scores: 32, 25, 45, 38, 52, and 29. How likely is it
that this sample was drawn from a population whose mean is 26? Assume a
two-tailed test, alpha = .05. What would your decision be if alpha were .01?
4. A second sample likewise consists of six scores: 32, 25, 97, 38, 52, and 29.
How likely is it that this sample was drawn from a population whose mean is
26 (alpha = .05, two-tailed)?
5. As in the last example given in the text, assume the population mean is 100
and the population standard deviation is 11.8. If the sample size were 16
(instead of 8), what values for the sample mean would allow you to reject the
null hypothesis (alpha = .05, two-tailed)? If the sample size were 32? 64?
As sample size gets larger, what happens to the value of the standard error
of the mean?
6. The standard error of the mean is often used in conjunction with bar graphs
to indicate variability. Imagine you have selected two samples, one of five
boys and one of seven girls, and have determined their weights. The mean
weight is 72 pounds for the boys and 64 pounds for the girls. The population
standard deviations, estimated from the sample data, are 9 for the boys and
8 for the girls. Compute standard errors of the mean for both the boys and
the girls. Then create a bar graph. One bar will represent the boys, one the
girls; the Y axis will indicate pounds. From the middle of the top of each bar
draw two lines, one extending up and one extending down, each marking off
exactly one standard error of the mean. Draw small horizontal lines to
demarcate the top and bottom of these error bars. Adding error bars to a
conventional bar graph is desirable because of the graphic way they indicate
variability—and remind the viewer that the top of the bar is just the mean
obtained from one sample, but other samples might locate the mean
differently. (See Fig. 7.2 for an example.)
7. (Optional) Using the functional definition for the standard normal distribution
(Equation 6.6), set up a spreadsheet to compute Y values for values of X that
range from -4.0 to +4.0 in steps of 0.2. Now graph this normal distribution
(using your spreadsheet's graphing capability, if it has one). Your graph
should look like Fig. 7.1. Note that probability is associated with the area
under the curve; the exact height of the curve at various points is not
especially meaningful in this context.
8. (Optional) Starting with Equation 6.10, provide an algebraic derivation for
Equation 6.14.
9. (Optional) A third definition for the estimated standard error of the mean or
SD'M (in addition to Equations 6.13 and 6.14) is the sample standard
deviation divided by the square root of N - 1. Again starting with Equation
6.10, provide an algebraic derivation for this formula.

At this point, a confession is in order. In actual practice, single-sample tests


are rarely used. They were introduced here, as they typically are in statistical
texts, because of the particularly clear way they demonstrate principles of
statistical inference and the use of the normal and t distributions. The standard
error of the mean, however, is often used in conjunction with bar graphs to
indicate variability, as is demonstrated in the next exercise.
98 INFERRING FROM A SAMPLE: THE NORMAL AND t

FIG. 7.2. A bar graph of the group means described in Exercise 7.3 showing
error bars. The distance the error bars extend above and below a particular
group mean represents an estimate of the standard error of the mean
computed from data for that group.

7.6 NINETY-FIVE PERCENT CONFIDENCE INTERVALS

Usually error bars indicate one standard error of the mean above and below the
sample mean, but other intervals could be used. A particularly useful one is the
95% confidence interval. The interval is based on data from a single sample, but
if we drew sample after sample repeatedly, 95% of the time the true population
mean should fall within the indicated confidence intervals. Confidence intervals,
whether or not displayed graphically, are valuable for the same reason that error
bars are valuable: They serve to remind us that information gleaned from sample
data provides probabilistic and not absolute information about the population.
We may never know the value of the population mean with absolute certainty, but
at least we can say that there is a 95% chance that the true population mean falls
within the interval indicated.
You know from the central limit theorem that sample means are distributed
approximately normally and means for small samples follow the t distribution.
You also know that the critical values for t, alpha = .05, two-tailed, given in Table
C in the statistical tables appendix, demarcate the 5% of the area under the curve
in the tails from the 95% of the area in the center under the hump, and that for
large samples, this critical value is 1.96 (from Table B, the standard normal
table). The t values (or Z values, for large samples) are standardized, which is
why you divided by the standard error of the mean when computing them. In
order to compute 95% confidence intervals, you need to reverse the process,
multiplying (instead of dividing) by the standard error of the mean. Remember
that t or Z scores are in standard units. By definition, the number of raw score
units that correspond to one standard unit is the value of the standard error of
the mean. Thus the upper 95% confidence interval, expressed in whatever units
were used for the raw scores in the first place, is the mean plus the standard error
of the mean multiplied by 1.96 for large samples, or multiplied by the appropriate
7.6 NINETY-FIVE PERCENT CONFIDENCE INTERVALS 99
value from Table C for small samples; the lower 95% confidence interval is the
mean minus the standard error of the mean times the appropriate value.
Symbolically (where CI stands for confidence interval),

Thus

95% error bar = SD'M x t(df)05,two-tailed

Lower 95% confidence limit -M- SD'M x t(df)05,two-tailed

Upper 95% confidence limit = M + SD'M x t(df)05,two-tailed


As an example, consider the samples of boys and girls shown in Fig. 7.2. The
standard error of the mean, estimated from sample data, was 4.02 for the boys
and 3.02 for the girls, whereas the mean weights were 72 pounds for the five
boys, 64 for the seven girls. The confidence intervals for the boys are:

95% error bar = SD'M x t(4) = 4.02 x 2.78 = 11.2

Lower 95% confidence limit = M - SD'M x t(4) = 72 - 11.2 = 60.8

Upper 95% confidence limit = M + SD'M x t(4) = 72 + 11.2 = 83.2


For the girls the confidence intervals are:

95% error bar = SD'M x t(6) = 3.02 x 2.45 = 7.4

Lower 95% confidence limit = M - SD'M x t(6) = 64 - 7.4 = 56.6

Upper 95% confidence limit = M + SD 'M x t(6) = 64 + 7.4 = 71.4


Often researchers want to know whether the means for two groups are
significantly different. For example, is the average weight for the girls different
from the average weight for the boys? Techniques designed to answer such
questions are introduced in chapter 10. For now, it is worth noting that because
the 95% confidence interval for the boys includes within its range the mean value
for the girls, and vice versa, it is unlikely that the difference between their mean
weights is statistically significant (at the .05 level).

Exercise 7.4
Ninety-five Percent Confidence Intervals
This exercise provides practice in computing 95% confidence intervals. Fig, 7.3
illustrates the results.
1. Compute the 95% confidence interval for the population mean using the
sample of six scores from part 3 in the last exercise.
2. Compute the 95% confidence interval for the population mean using the
sample of six scores from part 4 in the last exercise.
3. Do the results of the single-sample tests performed for the last exercise
agree with the 95% confidence intervals computed for this exercise?
100 INFERRING FROM A SAMPLE: THE NORMAL AND t

FIG. 7.3. A bar graph of the group means described in Exercise 7.4,
showing 95% confidence intervals.

Exercise 7.5
SPSS: Single-sample tests and confidence intervals
This exercise provides practice in using SPSS to conduct single-sample tests.
You will use the data from Exercise 7.3, parts 3 and 4, to determine whether the
sample is drawn from a population with a mean of 20.
1. Open SPSS and create a variable called scores. Enter the data from
Exercise 7.3, part 3.
2. Select Analyze->Compare Means->One-Sample T Test from the main
menu. Move the scores variable to the Test Variable(s): window. Enter 20 in
the Test Value window. Click on OK.
3. Examine the output. Are the values for the standard deviation and the
standard error the same as the values you calculated in Exercise 7.3? What
is the significance level of the t test? Do you conclude that the sample was
drawn from a population with a mean of 20 at the alpha = .05 level? What
would your decision be if alpha were .01?
4. Conduct a single-sample test using the data from Exercise 7.3, part 4. Why is
this test not significant at the .05 level even though the mean is higher than
the data from question number three?
5. To compute confidence intervals for the scores variable, select
Analyze->Descriptive Statistics->Explore from the main menu. Move the
score variable to the Dependent List window and click OK. The 95%
confidence interval for the mean will appear in the Descriptives box. If you
would like to calculate confidence intervals based on a value other than 95%,
click the Statistics button in the Explore dialog box and change the 95 to
another value in the Confidence Interval for Mean window.
7.6 NINETY-FIVE PERCENT CONFIDENCE INTERVALS 101
Some of the material presented in this chapter is more important
conceptually than practically. Practically, knowing how to present figures with
error bars representing the standard error of the mean, or 95% confidence
intervals, allows you to present your data visually. It is often useful to know how
to approximate a binomial with a normal distribution. It is also important, as a
matter of basic statistical literacy, to have some familiarity with the t distribution.
Single-sample tests, on the other hand, are probably more important
conceptually than practically. Still, their principles nonetheless provide a
particularly clear foundation for the material to follow. This chapter, in fact,
represents something of a turning point. Having finished with basic topics like
hypothesis testing and simple descriptive statistics, we now turn to the topic that
will occupy us for the remainder of the book: accounting for and analyzing
variance.
This page intentionally left blank
8 Accounting for Variance:
A Single Predictor

In this chapter you will:


1. Learn how to determine empirically the best-fit line (the line that, when
used as a basis for predicting one variable from another, minimizes the
error sum of squares).
2. Learn what the phrase accounting for variance means.
3. Learn to distinguish measures that indicate the exact nature of the
relation between two variables from measures that indicate the strength
of their relation.

In this chapter we begin laying the groundwork required for most of the analytic
techniques described in the remainder of this book. Many attributes of interest
to behavioral scientists are measured with interval, ratio, or at least ordinal
scales. In such cases, almost all specific research questions can be reduced to a
single general form. The general question is, if the values of some quantifiable
attribute (a dependent or criterion variable measured on a quantitative scale)
vary from subject to subject, can that variability be explained, or accounted for,
by research factors identified by the investigator (independent variables
measured on either qualitative or quantitative scales)? In other words, does the
independent variable have an effect on the dependent variable? Is the number of
lies the expert detects affected by the subject's mood, for example, or by whether
or not the subject received a drug? The techniques described in this chapter
explain how such questions can be answered.

8.1 SIMPLE REGRESSION AND CORRELATION

Historically, both simple regression and correlation (one criterion and one
predictor variable) and multiple regression and correlation (one criterion and
multiple predictor variables; often abbreviated as MRC) have been taught as
techniques for nonexperimental data exclusively. Indeed, the term correlational
has come to signify a nonexperimental or observational study. The only reason
for the segregation of MRC into the nonexperimental world, and the dominance
of analysis of variance or ANOVA in the experimental world, however, is

103
104 ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR
historical precedent and tradition. There is no compelling statistical basis for this
bifurcation (Cohen, 1968).
In fact, there is much to be gained by a more unitary approach. From the
point of view of the general linear model, analysis of variance is simply a subset of
multiple regression and correlation. Giving prominent play to this fact, as this
book does, simplifies learning considerably. As noted in chapter 1, there are
fewer topics to master initially and those topics possess considerable generality
and power. The unified approach lets you see the forest as a whole without being
tripped up by individual trees.
Two numeric examples are presented in this chapter and the next. In both
cases, the quantitative dependent variable accounted for is number of lies
detected. In one case, the independent variable is quantitative (mood, used in
this chapter), and in the other, qualitative (drug vs. no drug, used in chap. 9).
These two examples are used first to demonstrate the basic concepts required to
analyze such data, and second to demonstrate that no matter whether the
predictor variable is quantitative (mood score) or qualitative (drug group), its
effect can be analyzed using multiple-regression procedures. But remember,
whether qualitative or quantitative, throughout this chapter we assume just a
single predictor. Methods for dealing with two or more predictors are deferred
until chapter 11.

Finding the Best-Fit Line


In chapter 5, you learned how a set of scores (like the number of lies detected for
the 10 subjects) could be described. In particular, you learned how, using the
method of least squares, a single point (the mean) could be fit to the set, and how
the variability of the scores around the mean could be described (using the
variance and the standard deviation). Research questions rarely involve just one
variable, however. Typically we want to know, at the very least, how much one
variable (the independent or predictor variable) affects another (the dependent
or criterion variable). For example, imagine that before determining the number
of lies detected we had asked our subjects to fill out a mood scale (higher scores
indicate a better mood). We do this because we want to find out whether subjects
who are in a good mood (and who perhaps are less guarded) are more likely to
have their lies detected.
The question now is, what relation, if any, is there between mood and
number of lies detected? In other words, if we know a subject's mood, can we
predict the number of lies detected more accurately than if we did not know that
individual's mood score, if we only knew the average number of lies detected
irrespective of mood? In order to answer this question, we would first postulate a
prediction equation that ignores mood, and a second one that takes mood into
account, and then compare the two equations. If mood affects the number of lies
detected, then the second equation should provide a better fit to the data (i.e.,
should make more accurate predictions) than the first. However, if mood has no
effect, then there should be little difference between the predictions made by the
two equations.
In chapter 5 (Equation 5.3), we described a prediction or regression equation
that consisted of a single constant:

This prediction equation contains only a constant and no variables, so the


predicted scores will be the same for all subjects. Of particular interest is the
equation for which a is set equal to the mean:
8.1 SIMPLE REGRESSION AND CORRELATION 105

When the mean for a group of subjects serves as the predicted score for each
subject, then the sum of the squared deviation scores (SS, or the sum of
differences between observed and predicted scores for each subject) will be the
minimum value possible for that group's data. This is the total SS, which is
defined as the sum of the squared differences between each raw score and the
group's mean. Similarly, the variance for the sample (or the total variance) is
defined as the total SS divided by the sample size or N. The preceding sentences
are nothing more than a restatement of material first presented in chapter 5.
Recall, for example, that for the lie detection study the mean number of lies was
5.3, the SS was 40.1, the number of subjects was 10, and so the sample variance
was 4.01.
It is this variance, the SS divided by N, to which accounting for variance
refers, and it is this variance or at least portions of it that we hope to account for
with our research factor (or factors). The first prediction equation (Equation 8.2)
considered only the mean. A second prediction equation, one that takes the
values of a single predictor variable into account and assumes a linear relation
between the predictor and the criterion variable, is

Following the usual convention, Y is used for the DV or criterion variable and X
for the IV or predictor variable. This regression equation generates an expected
or predicted score for each subject equal to the sum of a (which is called the
regression constant) and the product of b (which is called the regression
coefficient) times the value of the variable X for that subject; thus, the value of
the predictor variable is multiplied or "weighted" by b. Graphically, Equation 8.3
is an equation for the straight line defined by a and b. The regression constant, a,
is the Y intercept of the line—that is, it is the value of Y at the point the line
crosses the Y axis. The regression coefficient, b, is the slope of the line—that is, it
is the ratio of rise (a vertical distance) to run (a horizontal distance) for the line.
If values of Y tend to increase as values of X increase, the line will tilt up to the
right and the value of b will be positive. If values of Y tend to decrease with
increases in X, however, then the line will tilt down to the right and the value of b
will be negative.
The mood scores for the 10 subjects in the lie detection study are given in Fig.
8.1. An obvious question at this point is, how do we even know if an equation
that attempts to fit a straight line to these data is reasonable? The best advice is,
first graph the data. Prepare a simple scattergram like that shown in Fig. 8.2
and decide, on the basis of visual inspection, whether the data appear to reflect a
roughly linear relationship—that is, as a general rule, are the increases (or
decreases) along the Y axis about the same anywhere along the X axis? (The
graph in Fig. 8.2 was prepared with Excel. If your spreadsheet has a similar
graphing capability, your next exercise should be to prepare a similar graph.)
If number of lies detected were perfectly predicted by mood, then these
points would all fall on a straight line. In this case (and in most cases
encountered in the behavioral sciences) they do not form a straight line. On the
other hand, there appears to be some orderliness to these data, a general
tendency for higher mood scores to be associated with more lies detected; thus, it
makes some sense to assume linearity and proceed with Equation 8.3. We could
even attempt to fit a line to these data points visually. As an exercise, you should
draw a straight line on Fig. 8.2 that you think provides the best fit for these data.
106 ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR

FIG. 8.1. Mood data for the lie detection study.

Once you have drawn a straight line, you can figure out the values of a (the Y
intercept) and b (the slope) directly from the graph and then use these values and
Equation 8.3 to compute Y', the number of lies predicted for each subject based
on that subject's mood score. Alternatively, you could read the predicted values
directly from the graph. To do so, you would locate each subject's mood score on
the X axis, draw a vertical line from it to your fitted line, and then extend a
horizontal line from that point over to the Y axis. The predicted number of lies is
the Y value at the point where the horizontal line crosses the Y axis. And now
that you know both observed and predicted scores for each subject, you can
compute each subject's error or residual score, which is the difference between
the number of detected lies actually observed and the number predicted by the
equation.
For example, three lies were detected for the first subject. This individual's
mood score was 5.5. The predicted Y value (number of lies) will be the height of
your visually fitted line above the X axis at the point on the X axis where X (mood
score) equals 5.5. If this height is 5, then the error of prediction (or the residual)
for the first subject would be 3 (the observed score) minus 5 (the predicted
score), which equals -2. Earlier you computed a total sum of squares (the sum of
the squared differences between observed scores and the group mean). You can

FIG. 8.2. Scattergram for predicting number of lies detected from mood.
8.1 SIMPLE REGRESSION AND CORRELATION 107_
now compute an error sum of squares. As you would expect, it is the sum of the
squared differences between observed scores and the scores predicted by the
Equation 8.3 regression line. (And, as demonstrated in the next exercise, the
model sum of squares is the sum of the squared differences between the scores
predicted by the regression line and the group mean.)
In your attempt to find the best-fit line, you might draw several,slightly
different lines. Each would be uniquely specified by the values for its intercept
and slope, but only one line would be best, in the sense that the sum of the
squared deviations or errors (the error sum of squares) would be minimized.
This best-fit line could be determined empirically, which is the purpose of the
next spreadsheet exercise.

Note 8.1
a The regression constant. When Y, the criterion variable, is
regressed on X, a single predictor variable, a is the Y
intercept of the best-fit line.
b The simple regression coefficient. Given two variables, b is
the slope of the best-fit line.

Exercise 8.1
The Error Sum of Squares
The template developed for this exercise computes the error sum of squares (as
well as the total and model sums of squares) for specified values of the
regression constant a and the regression coefficient b. This allows you to
determine the best values for these two regression statistics. You may find it
convenient to begin with the spreadsheet shown later in Fig. 5.4 and modify it
according to the following instructions.

General Instructions
1. Establish columns (columns like these will be used repeatedly in the
exercises to follow), labeled appropriately and containing the appropriate
data or formulas, for:
a. The subject number.
b. The number of lies detected (Y).
c. The mood score (X).
d. The predicted score (Y' = a + bX).
e. The deviation between Y and the Y mean (y = Y- MY).
f. The deviation between Y'and the Y mean (m = Y'- MY).
g. The deviation between Y- Y'(e = Y- V').
h. y* y or the square of y, their sum is SS total.
i. m*m or the square of m; their sum is SS model.
j. e*e or the square of e; their sum is SS error.
108 ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR
2. Establish rows that compute sums, counts, and means for the columns as
appropriate. Enter a formula for the SDY. Reserve space for the regression
statistics a and b.
3. Experiment with various values for a and b and note the effect on the various
deviation scores and sums of squares. What exactly do the deviation scores
represent? Finally, enter 2.5 for a and 0.05 for b. Do you get the values
shown in Fig. 8.3?
Detailed Instructions
1. In row 2, label the columns as follows:
Labeel Col Meaning
s A The subject number or index; ranges from 1-10.
Y B The observed number of lies detected (the DV).
X C The mood score (the IV).
Y' D The predicted number of lies detected.
Y-My E The difference between the observed y score and the mean
y score (represented here with a lower case y).
Y'-My F The difference between the predicted Y score and the mean
y score (represented with m for model).
Y-Y' G The difference between the observed Y and the predicted Y
score (represented with e for error or residual).
y*y H The square of the difference between the observed raw
score and the mean.
m*m I The square of the difference between the score predicted by
the model and the mean.
e*e J The square of the difference between the raw score and the
one predicted by the model.
2. To remind ourselves what the sums of the squares in columns H-J
represent, in row 1 label columns H-J as follows:
Label Column Meaning
SStot H Total sum of squares--the sum of the differences between
the raw scores and the mean, squared and summed. (On
some previous spreadsheets this same quantity was labeled
SSY.)
SSmod I Model sum of squares or SS explained by the regression
equation (or model)--it is the differences between the
predicted scores and the mean, squared and summed.
SSerr J Error or residual sum of squares or SS unexplained by the
regression equation--it is the differences between observed
and predicted scores, squared and summed.
3. In addition, enter the labels "Lies" in cell B1, "Mood" in cell C1, "y=" in cell E1,
"m=" in cell F1, and "e=" in cell G1.
4. In column A, label rows 13-16 as follows:
Label Row Meaning
Sum= 13 The sum of the scores in rows 3-12.
N= 14 The number of scores in rows 3-12.
Mean= 15 The mean of those scores.
a,b= 16 The values for the regression statistics (a in column B; b in
column C).
5. Enter the observed lie and mood scores in columns B and C, rows 3-12.
6. For the time being, enter a value of 2.5 for a (the Y intercept) in cell B16 and
a value of 0.5 for b (the slope) in cell C16.
8.1 SIMPLE REGRESSION AND CORRELATION 109
7. Enter a formula for the predicted value in column D, cells 3-12. The
predicted value is a (cell B16) plus b (cell C16) times this subject's mood
score (X, in column C).
8. In cells E3-E12, enter a formula for subtracting the mean number of lies
detected (MY, in cell B15) from the observed Y score (Y, in column B).
9. In cells F3-F12, enter a formula for subtracting MY from the predicted Y
score (Y,' in column D).
10. In cells G3-G12, enter a formula for subtracting Y' from the raw score
observed for Y (column B).
11. In cells H3-H12 enter a formula for the total sum of squares (Y-MY, in
column E, multiplied by itself).
12. In cells 13-112, enter a formula for the model or regression sum of squares
(Y'-MY, in column F, multiplied by itself).
13. In cells J3-J12, enter a formula for the error or residual sum of squares
(Y-Y', in column G, multiplied by itself).
14. In cells B13-J13, enter a function that sums the entries in rows 3-12.
15. In cells B14-J14, enter a function that counts the entries in rows 3-12.
16. Enter the label "VAR=" in cell G15 and the label "SD=" in cell G16.
17. In cells B15-D15 and H15-J15 enter a formula that computes the mean for
the column. In cell H16 enter the formula for the standard deviation for the Y
scores (the square root of the Y variance). The means for the SStotal, SSmodel.
and SSerror (columns H-J) are total Y score, model, and error variances,
respectively.

At this point, your spreadsheet should look like the one shown in Fig. 8.3. Of
particular interest in this spreadsheet are the errors, which are given in column
G. These are the deviations of the observed values from those predicted by the
line whose intercept is 2.5 and whose slope is 0.5. This particular straight line,
however, was only an initial guess. It may or may not be the best-fit line. One
way to find out is by trial and error. Try other values of a and b until you find
those that yield a lower error sum of squares (cell J13) than the others. These
values then define the best-fit line.

Exercise 8.2
Finding the Best Fit Line Empirically
The template shown in Fig. 8.3, which was developed during the last exercise, is
used again for this exercise. Only the values for the parameters a and b are
changed. The purpose is to find the best fit line by trial and error.
1. By a process of trial and error, and guided by an inspection of the
scattergram in Fig. 8.2, replace values for a and b in the previous
spreadsheet until you find values that give the smallest value you can
discover for the sum of the squared differences between observed and
predicted values (SSerror or error sum of squares). For the time being, limit
your search to numbers that have only two significant digits.
2. Now set a to 3.0 (if this was not the value you ended up with in number 1)
and try to find a value for b that minimizes the error sum of squares. Is this
sum smaller than the one you found before?
3. Now set a to 3.0 and b to 0.47 (if this was not the value you ended up with in
number 2). Are you able to find other parameter values that yield a smaller
sum of squares than these?
110 ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR
At this point your spreadsheet should look like the one shown in Fig. 8.4.
The values computed for the various sums of squares (total, model, and error) are
based on values of 3.0 and 0.47 for a and b, and although these values are used
for discussion purposes in the next several paragraphs, they are not quite correct.
As becomes evident, more than two significant digits are required to express the
exact values for these regression statistics. Still, several aspects of this last
spreadsheet deserve comment. As you would expect, the sums and means of the
raw Y scores and the predicted Y scores are almost the same. Also as you would
expect, the sums of the deviation scores based on the predicted score (predicted
minus mean, raw minus predicted) are close to zero. If we had found the best
values for a and b, not just values that were close, the means for the raw and the
predicted scores would be identical and the sums of these two deviation scores
would be exactly zero.
An error sum of squares of 28.862625 (Fig. 8.4) is better than 30.3125 30.31
(Fig. 8.3), but it is still not the minimum possible. These are more accurate
values, given to considerably more precision than is needed, not the less accurate
values of 28.86 and 30.31 displayed in the figures. Remember that the values
displayed by your spreadsheet depend on the format you specify and the width of
the column you select, but that values as accurate as your computer allows are
retained and used by the spreadsheet program.
Also note that, for each subject, the raw minus the predicted (the error
deviation score or Y i - Y i ' ) and the predicted minus the mean (the model
deviation score or Yi' - My) sum to the total deviation score (Yi - My), as they
must and as you can prove algebraically. That is, for rows 3-12, the score in
column E is the sum of the scores in columns F and G. This will be true no matter
what the values of a and b are; it follows from the way the deviation scores are
defined.

A B c D E F G H I J
1 Lies Mood y= m= e= SStot SSmod SSerr

2 s Y X Y' Y-My Y'-My Y-Y' y*y m*m e*e


3 1 3 5.5 5 .25 -2.3 -0.05 -2.25 5.29 0.002 5.063
4 2 2 2 3.5 -3.3 -1.8 -1.5 10.89 3.24 2.25
5 3 4 4.5 4 .75 -1.3 -0.55 -0.75 1.69 0.303 0.563
6 4 6 3 4 0.7 -1.3 2 0.49 1.69 4
7 5 6 1.5 3 .25 0.7 -2.05 2.75 0.49 4.203 7.563
8 6 4 6.5 5 .75 -1.3 0.45 -1.75 1 .69 0.203 3.063
9 7 5 3.5 4 .25 -0.3 -1.05 0.75 0.09 1.103 0.563
10 8 7 7 6 1.7 0.7 1 2.89 0.49 1
11 9 7 6 5.5 1.7 0.2 1.5 2.89 0.04 2.25
12 10. 9 9 7 3.7 1.7 2 13.69 2.89 4
13 Sum= 53 48.5 49 .25 0 -3.75 3.75 40.1 14.16 30.31
14 N= 10 10 10 10 10 10 10 10 10
15 Mean= 5.3 4.85 4.925 VAR= 4.01 1.416 3.031
16 a,b= 2.5 0.5 SD= 2.002
17
FIG. 8.3. Spreadsheet for finding the best-fit line by trial and error: results
when a = 2.5 and b = 0.5.
8.1 SIMPLE REGRESSION AND CORRELATION 111
However, only for the best-fit line will the model sum of squares and the
error sum of squares add up exactly to the total sum of squares. That is, for the
best-fit line,

The total sum of squares for the present example is 40.1 (see Fig. 8.4). The model
(or regression, or explained) sum of squares and the error (or residual, or
unexplained) sum of squares would add up exactly to 40.1 if we had found the
best fit values for a and b. In this case, they sum to 40.13825 (using nonrounded
values), a value just slightly over 40.1. That means we are close but still not
exactly on target with our guesses for a and b.

Exercise 8.3
Deviation Scores
The purpose of this exercise is to demonstrate graphically how the various
deviation scores are formed and how they are related.
1. On a piece of graph paper, graph the mood/lies data, that is, reproduce the
scattergram shown in Fig. 8.2. Be sure to label the X and Y axes
appropriately. By convention, the dependent variable is usually graphed on
the Y axis.
2. Draw the line whose Y intercept is 3.0 and whose slope is 0.47. Two points
uniquely define a straight line, consequently an easy way to do this is to
compute values of Y associated with two arbitrarily chosen X values using
the formula Y'= 3.0 + 0.47X.

A B C D E F G H I J
1 Lies Mood y= m= e= sstot S Smod SSerr
2 s Y X Y' Y-My Y'-My Y-Y1 y*y m*m e*e
3 1 3 5.5 5.585 -2.3 0.285 -2.59 5.29 0.081 6.682
4 2 2 2 3.94 -3.3 -1.36 -1.94 10.89 1.85 3.764
5 3 4 4.5 5.115 -1.3 -0.19 -1.12 1.69 0 .034 1.243
6 4 6 3 4.41 0.7 -0.89 1.59 0.49 0 .792 2.528
7 5 6 1.5 3.705 0.7 -1.6 2.295 0.49 2 .544 5.267
8 6 4 6.5 6.055 -1.3 0.755 -2.06 1.69 0.57 4.223
9 7 5 3.5 4.645 -0.3 -0.66 0.355 0.09 0.429 0.126
10 8 7 7 6.29 1.7 0.99 0.71 2.89 0.98 0.504
11 9 7 6 5.82 1.7 0.52 1.18 2.89 0.27 1.392
12 10 9 9 7.23 3.7 1.93 1.77 13.69 3 .725 3.133
13 Sum= 53 48.5 52.8 0 -0.2 0.205 40.1 11.28 28.86
14 N= 10 10 10 10 10 10 10 10 10
15 Mean= 5.3 4.85 5.28 VAR= 4.01 1.128 2.886
16 a,b= 3 0.47 SD= 2.002
17
FIG. 8.4. Spreadsheet for finding the best-fit line by trial and error: results
when a = 3 and b = 0.47.
112 ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR
3. Draw arrows representing each deviation score. There are three kinds of
deviation scores: total (Y- MY), model (Y'- MY), and error (Y- Y'). Arrows
will begin at points where Y = MY for the total and model deviations and at
points where Y= Y' 'for the error deviations. They will go up (+) or down (-)
depending on the sign of the deviation scores. Their associated mood
scores will determine how far away from the Y axis they are. For example,
for subject 3 the mood score is 4.5 and the deviation between raw score and
mean is 4 - 5.3, which equals -1.3. Therefore draw an arrow beginning at
the point on the graph where X = 4.5 and Y= 5.3 and ending with an
arrowhead at the point where X = 4.5 and Y= 4. Note that this arrow, which
represents a deviation of -1.3, extends 1.3 units and points down. You will
probably want to use three different colors (or some other stylistic device) to
represent the three kinds of deviations.

The three kinds of deviations (total, model, and error) and their associated
sums of squares are basic to an understanding of much of the material presented
in this book. It is worthwhile to ponder the figure you produced for the previous
exercise (see Fig. 8.5) and think about what it represents. What do total
deviations represent? Model deviations? Error deviations? Based on the figure,
how strong is the relation between mood and number of lies detected? WhyT

Note 8.2
Y-My
Y-My The total deviation score, that is, the difference between each
subject's observed score and the group mean. (Subscripts
indicating subject are not shown.)
V - MY The model (or regression) deviation score, that is, the
difference between the score predicted by the regression
equation (or model) for each subject and the group mean.
(Subscripts indicating subject are not shown.)
Y-Y' The error (or residual) deviation score, that is, the difference
between each subject's observed score and each subject's
predicted score. (Subscripts indicating subject are not
shown.)
SStotal The total sum of squares, or Y - My for each subject, squared
and summed. This can also be symbolized SSy.
SSmodel The model sum of squares, or Y - MY for each subject,
squared and summed. This can also be symbolized 5Sreg, for
regression sum of squares, or SSexp, for sum of squares
explained by the model.
SSerror The error sum of squares, or Y - Y' for each subject, squared
and summed. This can also be symbolized SSres, for residual
sum of squares, or SSunexp, for sum of squares left
unexplained by the model.
8.2 WHAT ACCOUNTING FOR VARIANCE MEANS 113

8.2 WHAT ACCOUNTING FOR VARIANCE MEANS

From a best-fit line (and remembering that we have not yet determined the
absolute best-fit line for the lie detection data, although we will do so in the next
chapter), two important facts can be determined.

1. The exact nature of the linear relation between predictor and criterion
variable.
2. The strength of the linear relation between predictor and criterion
variable.

The nature of the linear relation is indicated by the regression statistics for
the best-fit line, whereas the strength of the linear relation is indicated by the
ratio of variability explained by the regression model to total variability. Exactly
what this means is explained in the next several paragraphs.

The Exact Nature of the Relation


Assume the values for a and b given in Fig. 8.4 are accurate. (We know they are
accurate to two significant digits.) The value of a indicates that, for mood scores
of zero, 3.0 lies would be detected (Y' = 3.0 + 0.47X = 3.0 when X= 0). This
statement is meaningful only if mood scores of zero are meaningful. Although we
can imagine any best-fit line stretching to infinity in both directions, any actual
line we draw will be based on a particular sample of X values; values that will
necessarily be confined to a particular range. In this case, mood scores ranged
from 2 to 9, not minus to plus infinity, so predictions of number of lies based on
mood scores outside this range (like zero) may not make much sense.
More interesting is the information conveyed by the value for b. It indicates
that, for each increase of one mood point, the number of lies detected increases
0.47. Numbers less than 1 often seem difficult for many people to grasp, thus in
this case we might instead say that for each increase of 2.13 mood points, the
number of lies detected increases by 1. (The algebra is as follows: b = rise:run =
0.47/1; 0.47/1 = 1/X; O.47X = 1; X = 1/0.47 = 2.13 rounded.)
It is important to remember that the exact nature of the relation between a
predictor and criterion variable—the change in value for the criterion expected
for a given change in the predictor—is an idealization based on the model. It is
an ideal statement about an ideal world, one in which error is ignored and the
model is assumed to be true, when in fact it is better viewed only as a
probabilistic guide. In the behavioral sciences, observed values often deviate
strikingly from those predicted by a particular model, even though that model is
based on the best possible fit to the data. Thus it becomes important to assess,
not just what the best-fit model is (i.e., what are the best-fit values for a and b),
but also how well that model fits a particular set of data. In other words,
although we may have found a best-fit line, the question remains, how good is
that fit?

The Strength of the Relation


The strength of the relation (or association) between a particular predictor
variable and its criterion, or the goodness of fit, can be assessed by the proportion
of criterion to total variance that is accounted for when predictions take the
predictor variable into account, that is, when predictions are based on the
regression equation (Equation 8.3) instead of just on the group mean (Equation
114 ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR
8.2). In effect, two variances are compared. First is total variance or the extent
to which raw scores deviate from the mean (see Fig. 8.5, top). Second is error
variance or the extent to which raw scores deviate from the best-fit line (see Fig.
8.5, bottom).
Error variance is represented graphically by the dispersion of points around
the best-fit line, whereas total variance is represented by the dispersion of points
around the horizontal line representing the Y mean (recall Exercise 8.3). When
the relation between X and Y is weak, the dispersion of data points around the
best-fit line is almost as great as the dispersion around the mean line. When the
relation between X and Y is strong, however, the data points are clustered far
more tightly around the best-fit line than around the mean line; basing
predictions on the linear model (Y'i = a + bXi) instead of the mean model (Yi' =
My) results in a clear reduction in variability. For the present example, the total
variance for the mean model (the SStotal divided by N) was 4.01, whereas the error
variance (the SSerror divided by N) was 2.89 (see Fig. 8.4).
We are now in a position to answer the question posed earlier in this chapter:
How much do we gain by stepping up from the one-parameter mean model (with
requires a value only for My) to the two-parameter linear model (which requires
values for a and b)? Recall that for the best-fit linear model, the total sum of
squares (SStotal) can be divided (or partitioned) into two pieces, the sum of
squares associated with or explained by the regression model (SSmodel) and the
error or residual sum of squares (SSerror), which is the sum of squares left
unexplained by the model (Equation 8.4). The total variance, which is simply the
total sum of squares divided by N, can also be partitioned in the same way:

The mean model generates the same value for all predicted scores. As a result,
there is no variability in predicted scores and no variability to be associated with
the model. In the case of the mean model, total variance is all error variance,
portions of which we hope to account for by more complex models.
Usually the error variance for the linear model will be less than the error (i.e.,
total) variance for the mean model. The amount by which it is less is the variance
due to the model and represents the effect of the additional information (i.e.,
each subject's values for the predictor variable) used in the linear model as
compared to the mean model. For the present example, the error variance for the
mean model is 4.01 and the error variance for the linear model is 2.69 (assuming
a = 3.0 and b = 0.47). This represents an improvement of approximately 4.01
minus 2.89, or an improvement of 1.12 (rounded). This improvement is due to
the variance accounted for by the linear model (VAR model ), which according to
Fig. 8.4 is 1.13 (rounded) and which would be exactly 1.13 (actually, 1.1275625) if
we had used values for a and b that produced the absolute best-fit line. In any
case, compared to the mean model, the linear model reduces error variance by
approximately 28% (1.13 divided by 4.01 times 100). In other words, for the
present example the linear model accounts for approximately 28% of the variance
in criterion variable scores.
The formula for proportion of variance accounted for can be expressed in
terms of either variances or sums of squares. In terms of variances,
8.2 WHAT ACCOUNTING FOR VARIANCE MEANS 115
This formulation clearly expresses the notion that what is being computed is the
proportion of total variance accounted for by the model. The second variant is
given because sometimes total and error variances are more readily available
than model variance, and clearly, if VARtotal = VARmodel + VARerror then VARmodel
= VARtotal - VARerror. The various variances given in Equation 8.6 are all the
appropriate sums of squares divided by N, the sample size, and therefore all Ns
implicit in Equation 8.6 cancel, leaving

FIG. 8.5. Total (top), model (middle), and error (bottom) deviations; when
squared and summed, the result is the sum of squares total, for the model,
and for error.
116 ACCOUNTING FOR VARIANCE: A SINGLE PREDICTOR

This formulation bypasses the divisions by N required for computing the


variances in Equation 8.6 and for that reason may occasionally be more
convenient. But both formulas yield identical results.
As an additional aid to understanding the concept of accounting for variance,
it can be helpful to consider extreme, even degenerate circumstances. For
example, imagine that prediction were perfect, that every data point fell on the
best-fit line. As always, the total sum of the deviation scores squared (deviations
of raw scores from the mean) would reflect variability for the criterion variable,
the very thing we want to account for. But in this case, each error deviation
would be zero (because every point is perfectly on the line), so the sum of the
squared residuals would be zero. As a result, total and model sum of squares
would be identical. With no error variance, all of the total variance is accounted
for by the model. In other words, the SSmodel divided by the SStotal would equal
exactly 1. When prediction is perfect, 100% of the variability in criterion scores is
accounted for by the model; there is no error variability.
Now imagine there is no association between predictor and criterion
variables. In this case, the slope for the best-fit line would equal zero, indicating
no relation, and the Y intercept would equal the mean of the Y scores. That is, the
best-fit line would be horizontal (zero slope) and would intercept the Y axis at the
Y mean. Given no relation, the best prediction for a Y score is simply the Y mean;
there is no way to improve it. As a result, the sum of squares explained by the
model is zero (because the model predicts the mean, the deviations between
predicted and mean scores are zero). With no model variance, none of the total
variance can be due to the model. In other words, the SSmodel divided by the SStotal
would equal exactly zero. When prediction is nil, 0% of the variability in criterion
scores is accounted for by the model.
Normally, of course, the proportion of variance accounted for falls
somewhere between 0 and 1. For the present example (Fig. 8.4), the sum of
squares for the number of lies detected is 40.10. The sum of squares explained by
the model is 11.28 and the error sum of squares is 28.86 (rounding to four
significant digits). Thus the proportion of variability in number of lies detected
that is accounted for by mood scores is .281 (11.3 divided by 40.1; but remember,
this is based on values for a and b determined by trial and error and accurate
only to two significant digits, which is why the present .281 is different from the
more accurate .302 we compute in the next chapter.
The purpose of the past several paragraphs has been to develop a clear
understanding of what it means to say that a given proportion of criterion
variable variance is accounted for by a particular predictor variable. Many
readers will already have recognized that the proportion of variance accounted
for is an important and widely used descriptive statistic, known as the coefficient
of determination or r2 (read r-squared). It indicates the degree or strength of
association between two variables, it can vary from 0 to 1, and it is symbolized
and defined as follows:

Historically, another statistic, the correlation coefficient or r, was developed


before r2. In traditional statistic texts, r is usually introduced first and receives
8.2 WHAT ACCOUNTING FOR VARIANCE MEANS 117
2 2
more attention than r . We prefer to emphasize r , partly because it has such a
clear and concrete meaning (proportion of variance accounted for) and partly
because accounting for variance is a general and powerful concept, one not
limited to cases involving single predictor variables. In any case, and ignoring for
a moment the matter of sign, you can always compute the value of r simply by
taking the square root of r2.
In this chapter you have been introduced to four important descriptive
statistics: the regression constant or a, the regression coefficient or b, the
correlation coefficient or r, and the coefficient of determination or r2. The
regression coefficient indexes the exact nature of the relation between two
variables, r indexes the strength of association between the predictor and the
criterion variables, and r2 indicates the proportion of criterion variable variance
accounted for by the model that includes the predictor variable. You have also
learned how to determine the value of a and b by trial and error and how to use
them to compute first the SStotal, SSmodel, and SSerror, and then r2. At this point,
you should have a good understanding of accounting for variance. In chapter 9
you will learn how to compute the exact, best-fit values for a and b.

Note 8.3
VARtotal The variance for the scores in a sample. It indicates the
variability of raw scores relative to the group mean. To
compute, divide SStotal by N. The subscript indicates the
group of scores in question, e.g., VARx for the X scores, VARy
for the Y scores, and so forth.
VARmodel The model variance. It indicates the variability of scores
predicted by a specified regression equation or model relative
to the group mean. To compute, divide SSmodel by N.
VARerror The error variance. It indicates the variability of raw scores
relative to scores predicted by a specified regression equation
or model. To compute, divide SSenor by N.
This page intentionally left blank
Bivariate Relations:

9 The Regression and


Correlation Coefficients

In this chapter you will:

1. Learn how to compute the slope and Y intercept for the best-fit line.
2. Learn how to compute and interpret the correlation coefficient.
3. Learn how to account for variance with a single predictor, either
quantitative or binary.
4. Learn how to graph the regression line for both a quantitative and a
binary predictor variable.

In the preceding chapter, as a way of describing the relation between two


variables, you determined the best-fit line by trial and error. You also learned
that the slope of that line (the regression coefficient) indicates the exact nature of
the linear relation between the independent and dependent variable. Topics
introduced in the last chapter are discussed further in this chapter. The focus
remains on descriptive statistics, and the major question remains: How can the
strength of the relation between a quantitative dependent variable and an
independent variable (either quantitative or qualitative) be assessed? Other ways
to state this question are: How can we describe the effect of one variable on
another, and how much variability in one variable can be accounted for by
another?
In the previous chapter, the numeric example used a quantitative
independent or predictor variable (mood). In this chapter, after describing
procedures for computing the regression and correlation coefficients, we
demonstrate how the methods learned in the last chapter also apply when the
predictor variable is binary. This is useful because researchers often compare
two groups and in such cases the variable indicating group membership will be
binary.

9.1 COMPUTING THE SLOPE AND THE Y INTERCEPT

As you probably know, or certainly have guessed, in practice the slope and Y
intercept are computed using formulas developed by statisticians, not discovered

119
120 BIVARIATE RELATIONS: REGRESSION, CORRELATION
empirically by refined guesswork as demonstrated in the previous chapter. Still,
it is instructive to learn the trial-and-error method first, partly because it is so
simple and it does work, but mainly because it helps develop a clear and intuitive
sense of a best-fit line. Statisticians can prove that the formulas for the slope and
Y intercept given in this section do indeed result in best-fit lines. We do not
reproduce their arguments here, preferring to leave such discussion for advanced
and more mathematically oriented statistics texts. However, you can use the
spreadsheets developed in the last chapter to demonstrate to yourself that the
computed values for a and b really do describe a best-fit line. You will not find it
possible to find any other values for a and b that yield a smaller value for the
error sum of squares. Keep in mind, however, that the computations for the Y
intercept and slope presented here apply only when a criterion variable is being
accounted for by a single predictor variable. Procedures used when more than
one predictor variable is under consideration are presented in subsequent
sections.
If we let a lower case y represent the total deviation score for a Y score, so
that yi =Y0i- MY, then the sum of squares for Y is:

In other words, the sum of squares for Y is the sum of the squared differences
between the raw Y scores and the Y mean. Likewise, if we let a lower case x
represent a total deviation score for an X score, so that xi = Xi - MX, then the sum
of squares for X is:

As you already know, the sum of squares for Y indicates the variability of the Y
scores. Similarly, the sum of squares for X indicates how variable the X scores
are. Now we can define a new sum of squares, the XY sum of squares, that
indicates the extent to which the X and Y scores covary:

Note that SSx and SSy are always positive (because deviations are multiplied by
themselves) but SS xy can be negative as well as positive.
If X and Y covary in a positive direction—that is, if small values of Y tend to
be paired with small values of X and large values of Y tend to be paired with large
values of X— then most of the products of the paired x and y deviation scores will
be positive (because most paired deviations will have the same sign), resulting in
a large, positive value for SSxy. On the other hand, if X and Y covary in a negative
direction—that is, if large values of Y tend to be paired with small values of X, and
small value of Y tend to be paired with large values of X—then most of the
products of the paired x and y deviation scores will be negative (because most
paired deviations will have opposite signs) and thus the value of SSxy will be large
and negative. There is a third possibility, of course: X and Y may not covary at
all. In that case, about half of the xy products would be positive, half negative,
and SS xy would be some value close to zero.
You already know how to compute the variance for the X and Y scores in a
sample. These variances are:
9.1 COMPUTING THE SLOPE AND THEY INTERCEPT 121

Similarly, the sample covariance is the XY sum of squares divided by the sample
size:

We remind you about variances, and introduce the covariance, because the slope
for the best-fit line can be defined in terms of these statistics. If Y is the criterion
variable and X the predictor variable, then the slope for the best-fit line is the
ratio of the covariance for X and Y to the variance for X:

(The Ns used to compute the covariance and variance cancel, so the slope could
also be expressed as the ratio of SSxy to SSx.) If the covariance is positive, then
the slope is positive (and tilts up to the right); if the covariance is negative, then
the slope is negative (and tilts down to the right).
In terms of units, Equation 9.7 makes sense. If Y is lies and X is mood, then
the numerator, which is the product of lie and mood deviation scores, is
measured in lie-mood units (or lies times mood). Similarly, the denominator is
measured in mood-squared units (like area is measured in square feet). When
lie-mood units are divided by mood-squared units, the mood units cancel, and
the units for the resulting dividend are lies/mood (or lies per mood), which is
exactly what the slope is: a ratio of rise to run or the increase in Y (number of
lies) per unit increase in X (mood). If you remember this, you should have no
trouble remembering that the slope is the covariance for X (run) and Y (rise)
divided by the variance for X (run-squared).
Given the slope, and the mean values for X and Y, the Y intercept can be
determined by geometry. The formula for the Y intercept is:

By definition, the best-fit line must go through the point where X = MX and
Y = My. Its slope, or the ratio of rise to run, is the increase in Y units for each X
unit. At the point where the line intercepts the Y axis, X equals zero. In running
from X = 0 to X = MX, a distance of MX X units, we rise b times Mx Y units. In
order to get back to the Y value corresponding to X - 0 (the Y intercept), we in
effect run backward, subtracting the total rise (b times MX) from the mean for Y,
as indicated by Equation 9.8. If this seems unclear, try making a simple figure
that depicts Equation 9.8

Exercise 9.1
Computing the Y intercept and Slope
The template developed for this exercise adds the ability to compute the exact
values for the regression constant (the Y intercept or a) and the regression
coefficient (the slope or b) to the template shown in Fig. 8.4.
122 BIVARIATE RELATIONS: REGRESSION, CORRELATION
General Instructions
1. Add three columns, labeled appropriately and containing the appropriate
formulas, to your Exercise 8.4 spreadsheet. These columns are for:
a. The deviation between X and the X mean (x= X- Mx).
b. The square of x; their sum is SSX
c. THE product of xy, their sum is SSXY.
2. Establish rows that compute sums, counts, and means for these columns as
appropriate. Enter a formula for the SDX.
3. Now that your spreadsheet contains all the elements needed, enter formulas
for a and b, the Y intercept and slope. Do your computed values agree with
those shown in Fig. 9.1? How does the current value for the error sum of
squares compare with the value computed for the last exercise?
Detailed Instructions
1. Begin with the spreadsheet shown in Fig. 8.4. Insert a blank column
between columns G and H, or, alternatively, move the block H1-J16 over
one column, to I1-K16. This opens up column H, which will be used for
labeling.
2. Move the VAR= and SD= labels in cells G15-G16 to H15-H16.
3. In row 2, add the following labels to columns L-N:
Label Column Meaning
X-Mx L The difference between the observed mood score and its
mean (symbolized x).
x*x M The product of the mood deviation scores. The sum is SSx
and indicates variance in X scores.
x*y N The crossproduct of the mood and lie deviation scores. The
sum is SSxy and indicates covariance in X and Y scores.
4. In addition, enter the labels "x=" in L1, "SSx" in M1, and "SSxy" in N1.
5. In column L (cells 3-12), enter a formula for subtracting the mean mood
score (Mx, in cell C15) from the observed X score (in column C).
6. In column M (cells 3-12), enter a formula for squaring the mood deviation
score (in column L).
7. In column N (cells 3-12), enter a formula for the cross product of the mood
and lie deviation scores (in column E).
8. In cells L13-N13, enter a function that sums the entries in rows 3-12.
9. In cells L14-N14, enter a function that counts the entries in rows 3-12.
10. In cells M15-N15, enter a formula that computes the mean for the column.
The column M mean is the variance for X and the column N mean is the XY
covariance.
11. In cell M16, enter a formula for the standard deviation for the mood or X
scores.
12. In cell C16, enter the formula for the slope. This is the covariance (cell N15)
divided by the X variance (cell M15).
13. In cell B16, enter a formula for the Y intercept. This is the product of the
slope (cell C16) and the X mean (cell C15) subtracted from the Y mean (cell
B15).

At this point, your spreadsheet should look like the one shown in Fig. 9.1. Now
we can see that the computed values for a and b (rounded to five significant
digits) are 3.0235 and 0.46938 respectively. (Remember, the number of digits
displayed depends on the format and column width you select; the number of
digits stored internally is not affected by the display.)
9.1 COMPUTING THE SLOPE AND THE Y INTERCEPT 123
The guesses we used in the previous chapter for a and b, 3.0 and .47 (see Fig.
8.4) , were close but not the best possible. Now that our estimates for a and b are
mor e accurate (Fig. 9.1), the error sum of squares (cell K13) is 28 .85840274,

A B C D E F G
1 Lies Mood y= m= e=
2 s Y X Y' Y-My Y'-My Y-Y'
3 1 3 5.5 5.605 -2.3 0.305 -2.61
4 2 2 2 3.962 -3.3 -1.34 -1.96
5 3 4 4.5 5-136 -1.3 -0.16 -1.14
6 4 6 3. 4.432 0.7 -0.87 1.568
7 5 6 1.5 3.728 0.7 -1.57 2.272
8 6 4 6.5 6.074 -1.3 0.774 -2.07
9 7 5 3.5 4.666 -0.3 -0.63 0.334
10 8 7 7 6.309 1.7 1.009 0.691
11 9 7 6 5.84 1.7 0.54 1.16
12 10 9 9 7.248 3.7 1.948 1.752
13 Sum= 53 48.5 53 0 0 0
14 N= 10 10 10 10 10 10
15 Mean= 5.3 4.85 5.3
16 a,b= 3.024 0.469
17

H I J K L M N
1 SStot SSmod SSerr x= SSX SSxY

2 y*y m*m e*e X-Mx x*x x*y


3 5.29 0.093 6.787 0.65 0.423 -1.5
4 10.89 1.79 3.851 -2.85 8.123 9.405
5 1 .69 0.027 1.29 -0.35 0.123 0.455
6 0.49 0.754 2.46 -1 .85 3.423 -1.3
7 0.49 2.472 5.164 -3.35 11.22 -2.35
8 1.69 0.6 4.303 1.65 2.723 -2.15
9 0.09 0.402 0.111 -1.35 1.823 0.405
10 2.89 1.018 0.477 2.15 4.623 3.655
11 2.89 0.291 1.346 1.15 1.323 1.955
12 13.69 3.794 3.07 4.15 17.22 15.36
13 Sum= 40.1 11.24 28.86 0 51.03 23.95
14 N= 10 10 10 10 10 10
15 VAR= 4.01 1.124 2.886 5.103 2.395
16 SD= 2.002 2.259
17
FIG. 9.1. Spreadsheet for finding the Y intercept and slope of a best-fit line
predicting number of lies detected from mood scores.
124 BIVARIATE RELATIONS: REGRESSION, CORRELATION
which is very slightly less than the 28.862625 from Fig. 8.4 (when we adjust the
column width to see more significant digits than the 28.86 displayed in the
figures). This value from Fig. 9.1 should be the smallest value possible.
Moreover, the model and error deviation scores (columns F-G) now sum to zero,
as they should (or else a number like -2E-15, which in scientific notation is -2
divided by a ten followed by 15 zeros and which is essentially zero). And the
SSmodel and the SSerror (cells J13-K13) sum exactly to SStotal (values, rounded to
four significant digits, for SSmodel and SSerror are 11.24 and 28.86, which sum
exactly to 40.10). Given a single predictor, then, the spreadsheet shown in Fig.
9.1 can be used to compute the regression constant (a or the Y intercept) and the
regression coefficient (b or the slope for the best-fit line). In the next section, we
show how to compute the correlation coefficient (r) as well.

Note 9.1
SSxy The XY sum of squares, or E (Xi - MX) (Yi - My), where i = 1,N.
Thus SSxy is the sum of the cross products of the X and Y
deviation scores.
COV The covariance. If for the X and Y scores, it is SSXY divided by
N. Thus COVXY is the average cross product of the X and Y
deviation scores. It indicates the direction (plus or minus) and
the extent to which the X and Y scores covary, and is related to
both the correlation and the regression coefficients.

9.2 COMPUTING THE CORRELATION COEFFICIENT

The Pearson product-moment correlation coefficient, which is symbolized by r, is


an historically important and widely used index of association usually applied to
quantitative variables. It was first defined by Karl Pearson in 1895. Although
Pearson approached the matter somewhat differently, we have already noted that
r2 is the proportion of variance accounted for in a criterion variable when values
for the predictor variable are taken into account. Thus, one way to compute r is
simply to take the square root of r2:

This gives the absolute value of r; in order to determine its sign (whether the
relation is positive or negative) you would need to look at the scattergram or the
sign of the covariance. For the present example, relating number of lies detected
to mood:

The values for VARmodel , VARtotal, and COVxy (which is positive) are taken
from Fig. 9.1. The Ns in the variance formula cancel, so we could have defined r2
as the SSmodel divided by SStotal instead (11.24 divided by 40.1), which gives the
same result.
The correlation coefficient can vary from –1 (a perfect negative relation) to
zero (no relation) to +1 (a perfect positive relation), whereas r2 can vary only
9.2 COMPUTING THE CORRELATION COEFFICIENT 125
from zero (no variance accounted for) to +1 (all variance accounted for). If you
compute r as the square root of r2, you will need to supply the appropriate sign.
As noted earlier, the covariance (cell N15 in Fig. 9.1) indicates the direction of the
relation, so its sign can be used if you compute r as the square root of r2. In
addition, there are several other ways to compute the standard Pearson
correlation coefficient. One of these is demonstrated in the next exercise.

Exercise 9.2
Computing the Correlation Coefficient Using Z Scores
The purpose of this exercise is to add the ability to compute the correlation
coefficient (r) to the template shown in Fig. 9.1. The method used for computing
the correlation coefficient is based on Z or standardized scores.

General Instructions
1. Add three columns, labeled appropriately and containing the appropriate
formulas, to your Exercise 9.1 spreadsheet. These columns are for:
a. Zy, the standardized Y score.
b. Zx, the standardized X score.
c. Zy Zx, the product of ZY and Zx.
2. Establish rows that compute sums, counts, and means for these columns as
appropriate.
3. Provide labels and formulas for r and r2. Note that r is the average cross
product of the Z scores. Compute r2 by dividing the variance accounted for
by the model by the total variance. Is the square root of this value the same
as the average cross product of the Z scores? Would it be if the average
cross product were negative?
Detailed Instructions
1. Begin with the spreadsheet shown in Fig. 9.1. In row 2, add the following
labels to columns O-Q:
Label Column Meaning
ZY O Standard or Z scores for number of lies detected.
Zx P Standard or Z scores for mood.
ZY* Zx Q The cross product of ZY and Zx.
2. In column O, enter the formula for a standardized lie score (Use the
standard deviation for Y in cell 116.) If you do not remember the formula,
reread the section on standard scores in Chapter 5.
3. In column P, enter the formula for a standardized mood score. (Use the
standard deviation for X in cell M16.)
4. In column Q, enter a formula for the cross product of the ZY and Zx scores.
5. In cells O13-Q13, enter a function for summing the data entries in the
column. In cells O14-Q14, enter the function for counting the number of
data entries in columns O to Q. In cells O15-Q15, enter a formula for the
mean of the data entries in columns O to Q. The value in cell Q15 is the
mean of the cross products of the Z scores.
6. Enter the label "r=" (for correlation coefficient) in cell A17 and the label "R2="
(for r2 or r-squared) in cell H17.
7. Point cell B17 to cell Q15 (the mean cross product of the standard scores).
8. In cell 117, enter a formula for dividing model variance (cell J15) by total
variance (cell 115). This is r2.
126 BIVARIATE RELATIONS: REGRESSION, CORRELATION
At this point, your spreadsheet should look like the one shown in Fig. 9.2.
You already know from Equation 9.9 that the value of the correlation coefficient
for our current running example is 0.529 (the square root of VARmodel divided by
VARtotal). An examination of the exercise just completed should reveal a second
way to define and compute the correlation coefficient. As you can see, the
correlation coefficient is the average cross product of the standard or Z scores:

This formulation emphasizes the fact that the correlation coefficient is an


average statistic that, like the mean, summarizes data for a group. For the
present example (and rounding the correlation coefficient to two significant
digits),

There is another way to define and compute the correlation coefficient.


Again, the values needed are given in the spreadsheet you just prepared. This
definition, like the one for the regression coefficient given in Equation 9.7, uses
the variances and covariances for the two variables involved. It is:

A B H I 0 P Q
1 Lies SSY
2 s Y y*y Zy Zx Zy*Zx
3 1 3 5.29 -1.15 0.288 -0.33
4 2 2 10.89 -1 .65 -1.26 2.079
5 3 4 1 .69 -0.65 -0.15 0.101
6 4 6 0.49 0.35 -0.82 -0.29
7 5 6 0.49 0.35 -1.48 -0.52
8 6 4 1.69 -0.65 0.73 -0.47
9 7 5 0.09 -0.15 -0.6 0.09
10 8 7 2.89 0.849 0.952 0.808
11 9 7 2.89 0.849 0.509 0.432
12 10 9 13.69 1.848 1.837 3.395
13 Sum= 53 Sum= 40.1 0 2E-15 5.295
14 N= 10 N= 10 10 10 10
15 Mean= 5.3 VAR= 4.01 0 2E-16 0.529
16 a,b= 3.024 SD= 2.002
17 r= 0.529 R2= 0.28
FIG. 9.2. Spreadsheet for computing the correlation between number of lies
detected and mood scores. Columns that are the same as Fig. 9.1 are not
shown.
9.2 COMPUTING THE CORRELATION COEFFICIENT 127_
For the present example,

(The values for these variances and covariances are given in cells I15, M15, and
N15 in Fig. 9.1.) In other words, the correlation coefficient can be defined as the
ratio of the covariation of X and Y, relative to the square root of the variation of X
and Y considered separately. Both variance and covariance are group summary
statistics, reflecting average variability about group means, which again serves to
emphasize that r is an average, group-based statistic. It does not indicate the
relation between two variables for an individual, but rather indicates the average
relation between those variables for a group of individuals.
However defined or computed, the correlation coefficient is by far the most
common index of association for two quantitative variables. Moreover, the same
identical computations can be applied when both variables are ranked (ordinal)
data, in which case the statistic is called the Spearman rank-order correlation
coefficient; when one of the variables is quantitative (i.e., interval-scaled) and the
other is binary, in which case the statistic is called a point-biserial correlation;
and when both variables are binary data, in which case the statistic is called the
phi coefficient. Historically, special computational formulas have been developed
for each of these cases, but they are merely computationally convenient
derivatives of the formulas presented here. Unfortunately, the separate formulas
have led generations of students to think these variants of the correlation
coefficient were somehow different things. Because of their widespread use, it is
important for you to know the different names given these correlation
coefficients, but at the same time to understand that they are fundamentally the
same.
One final comment: We think it often makes more sense to report r2 instead
of r. The units of the correlation coefficient have no particular meaning. An r of
.4, for example, is not twice as much correlation as an r of .2, and the difference
between an r of .8 and an r of .9 is not the same as the difference between an r of
.4 and an r of .5. The units of r2, however do have a more intuitively grasped
meaning: the proportion of variance accounted for.

The Correlation and Regression Coefficients Compared


The correlation coefficient (r) and the regression coefficient (6, or the slope
of the best-fit line) appear very similar. According to Equation 9.11, one
definition for r is:

And according to Equation 9.7, the definition for b is:

The similarity between the two equations is made more evident if we substitute
the square root of the X variance squared for VARx in the denominator:
128 BIVARIATE RELATIONS: REGRESSION, CORRELATION

This superficial similarity between r and 6, however, masks a very real difference.
As noted earlier, the regression coefficient indicates the exact nature of the
relation between X and Y. Its value indicates how much Y increases (or
decreases, if b is negative) for each unit change in X. Its units are X units per Y
units (e.g., lies per mood score). The correlation coefficient, on the other hand, is
"unit free" (numerator and denominator units cancel). It is a pure index of the
strength of the association between X and Y.
Both correlation and regression coefficients convey important, but different,
pieces of information. When reporting the results of the lie detection study, it
would be traditional to say that mood and number of lies detected correlate
0.543. As noted earlier, our preference is to report r2, not r, because we think it
is more informative to say that 29.5% of the variance in number of lies detected is
accounted for when mood scores are taken into account. Either way, r or r2
provides information about the strength of the relation. In addition, in order to
provide information about the exact nature of the relation, you would report that
for each increase of 1 mood point, 0.0465 more lies were detected on average.
Alternatively, because decimals less than 1 often seem difficult to grasp, you
might report instead that for each increase of 21.5 mood points, one more lie is
detected (recall the algebra used in Chap. 8).

Note 9.2
b The simple regression coefficient. Given a single predictor
variable, b is the slope of the best-fit line. It indicates the exact
nature of the relation between the predictor and criterion
variables. Specifically, b is the change in the criterion variable
for each one unit change in the predictor variable. To compute
(assuming that Y is the criterion or dependent variable), divide
the XY covariance by the X variance (b = COV XY /VAR x ).
a The 'regression constant. When Y, the criterion variable, is
regressed on X, a single predictor variable, a is the Y intercept
of the best-fit line. To compute, subtract the product of b (the
simple regression coefficient) and the mean of X from the
mean of Y (a - My - b MX).
r The correlation coefficient or, more formally, the Pearson
product-moment correlation coefficient. It is an index of the
strength of the relation between two variables, and its values
can range from -1 to +1. To compute, find the average cross
product of the Zx and Zy scores (r = E Zxi Zyi/N, where i = 1,N),
or take the square root of r2 and assign it the same sign as the
covariance.
r2 The coefficient of determination, or r-squared. It indicates the
proportion of criterion variable variance that can be accounted
for given knowledge of the predictor variable: To compute,
divide model variance by total variance (r2 =
VARmodel/VARtotal).
9.3 DETECTING GROUP DIFFERENCES, BINARY PREDICTOR 129

9.3 DETECTING GROUP DIFFERENCES WITH A BINARY PREDICTOR

Perhaps one of the most common research questions asked is, do two groups
differ? The groups may differ naturally in some way or they may have been
formed by random assignment and then treated differently, but in either the
correlational or experimental case, the investigator wants to know whether the
average of some measure of interest is different for the two groups. Imagine, for
example, that 5 of the 10 subjects in the lie detection study had received a drug
that made them less physiologically reactive than normal whereas the other 5 had
received a placebo. Such a study design could be motivated by the hypothesis
that a tranquilizing drug will make lies more difficult to detect.
A question like this can be answered easily with only slight modification to
the template shown in Fig. 9.2, as you will see in the next exercise. Then, in the
next chapter, you will learn how the statistical significance of this information
can be assessed. You can then use the template you are about to develop
whenever subjects fall into two groups. But is the difference between the means
for the two groups statistically significant?
The procedure developed here gives results identical with those obtained
using Student's t test, a test that was developed early in this century by Gossett
(who used the pen name Student). However, as we mentioned in chapter 4 and
again in the next chapter, we prefer to emphasize a more general approach
appropriate not only for situations in which the historically much-used t test
could be applied, but one that is also appropriate for many other situations as
well. This general approach uses regression computations to perform what is
called a one-way analysis of variance and requires that a subject's group
membership (in the present case, whether the subject received a drug or placebo)
be treated as a predictor variable. Exactly how a categorical variable like group
membership can be coded so that its qualitative information is represented
quantitatively is an important topic discussed in some detail in chapter 12. For
present purposes, and without further justification, we use a single predictor
variable to represent group membership. Arbitrarily, values for this variable are
set to 1 for subjects who received the drug and to 0 for subjects who received the
placebo.

Exercise 9.3
Predicting Lies Detected From Drug Group
The template developed during this exercise can be used whenever you want to
describe the association between a predictor and a criterion variable. The
predictor variable can be binary (indicating membership in one of two groups, for
example) or quantitative (like mood score).
1. Begin with the spreadsheet shown in Fig. 9.2. Change the formula for r from
the average cross products of the Z scores to the covariance divided by the
square root of the product of the X and Y variances. This eliminates the
need for the columns containing ZY, Zx, and their cross product. You can
erase these columns.
2. Replace the mood scores with drug group scores (and change labels
appropriately). Enter 1 for the first five subjects (indicating that they belong
to the drug group) and 0 for the last five (indicating that they belong to the
placebo group).
130 BIVARIATE RELATIONS: REGRESSION, CORRELATION
3. You are now done. All other formulas carried over from the previous
template should be in place and correct.
4. As a final exercise, verify that Equation 9.9 gives the same value for r as
Equation 9.11. The value for r-currently displayed is based on Equation 9.11
(r = the XY covariance divided by the square root of the product of the X and
Y variances). In another cell, enter a formula based on Equation 9.9 (r= the
square root of the quotient resulting from dividing the model variance by total
variance). Except for one detail, the two values you just computed should be
identical. In what way are they different? Why?

At this point, your spreadsheet should look like the one shown in Fig. 9.3.
From it you know that knowledge of drug group (drug or placebo) allows you to
account for 30% of the variance in number of lies detected. This is a reasonable
amount, and a bit higher than the 28% accounted for by mood scores (see Fig.
9.2). You also know that the value of the correlation coefficient describing the
association between drug group and number of lies detected is -.55 (rounded to
two significant digits).
Drug group is a binary categorical variable. When the predictor variable is
quantitative, the sign of the correlation coefficient indicates the direction of the
relation (a positive sign indicates that increases in the IV tend to be associated
with increases in the DV; a negative sign indicates that increases in the IV tend to
be associated with decreases in the DV) and the Y intercept and slope have the
meanings described in chapter 8 and in the previous section of this chapter.
When the predictor variable is categorical, however, the sign of the correlation
coefficient and the values for Y intercept and slope need to be interpreted in light
of the arbitrary codes assigned to the two levels of the binary variable.
In this case the drug group was coded 1 and the placebo group was coded 0.
There are only two values for the predictor variables; thus there are only two
values for the predicted scores. The predicted score for all subjects in the drug
group was 4.2, which is the mean of the observed scores for the five drug group
subjects, whereas the predicted score for all subjects in the placebo group was
6.4, which is the mean of the observed scores for the five placebo group subjects
(see Fig. 9.3). You may want to ponder why, using a binary predictor variable,
predicted values based on the best-fit line are the mean scores for the two groups.
For now, however, note that because the lower code (zero for the placebo group)
was assigned to the group with the higher mean (M - 6.4), the best-fit line (which
you will graph during the next exercise) tilts upward to the left and as a result the
correlation coefficient is negative (-0.55). If codes assigning a higher number to
the placebo than to the drug group were chosen, the correlation coefficient would
have been positive instead.
Just as the sign of the regression coefficient depends on the codes chosen for
the levels of the binary variable, so do the values computed for the Y intercept
and slope. In this case, the Y intercept (or regression constant) is 6.4, which is
the mean for the group coded zero (the placebo group). The slope (or regression
coefficient) is -2.2, which means that for a unit increase in drug group, the
number of lies detected decreases by 2.2. The drug group was coded 1, which is a
unit increase from the code for the placebo group, so this means that the mean
number of lies for the drug group is 2.2 less than the mean for the placebo group,
which it is (6.4 - 4.2 = 2.2). However, because the values computed for a and b
are affected by the codes used, the exact nature of the relation between a binary
predictor and a quantitative criterion variable is best conveyed by simply
reporting the group means in the first place, by saying that the mean number of
9.3 DETECTING GROUP DIFFERENCES, BINARY PREDICTOR 131
lies detected was 4.2 for subjects who had the drug and 6.4 for those who
received the placebo.
Whether the predictor variable is qualitative or quantitative, r and r2 are

A B C D E F G
1 Lies Drug y= m= e=
2 s Y X Y' Y-My Y'-My Y-Y'
3 1 3 1 4.2 -2.3 -1.1 -1.2
4 2 2 1 4.2 -3.3 -1.1 -2.2
5 3 4 1 4.2 -1.3 -1.1 -0.2
6 4 6 1 4.2 0.7 -1.1 1.8
7 5 6 1 4.2 0.7 -1.1 1.8
8 6 4 0 6.4 -1.3 1.1 -2.4
9 7 5 0 6.4 -0.3 1.1 -1.4
10 8 7 0 6.4 1.7 1.1 0.6
11 9 7 0 6.4 1.7 1.1 0.6
12 10 9 0 6.4 3.7 1.1 2.6
13 Sum= 53 5 53 0 0 0
14 N= 10 10 10 10 10 10
15 Mean= 5.3 0.5 5.3
16 a,b= 6.4 -2.2
17 r= -0.55

H I J K L M N
1 SStot SSmod SSerr x= SSX SSXY

2 y*y m*m e*e X-Mx x*x x*y


3 5.29 1.21 1.44 0.5 0.25 -1.15
4 10.89 1.21 4.84 0.5 0.25 -1.65
5 1.69 1.21 0.04 0.5 0.25 -0.65
6 0.49 1.21 3.24 0.5 0.25 0.35
7 0.49 1.21 3.24 0.5 0.25 0.35
8 1.69 1.21 5.76 -0.5 0.25 0.65
9 0.09 1.21 1.96 -0.5 0.25 0.15
10 2.89 1.21 0.36 -0.5 0.25 -0.85
11 2.89 1.21 0.36 -0.5 0.25 -0.85
12 13.69 1.21 6.76 -0.5 0.25 -1.85
13 Sum= 40.1 12.1 28 0 2.5 -5.5
14 N= 10 10 10 10 10 10
15 VAR= 4.01 1.21 2.8 0.25 -0.55
16 SD= 2.002 0.5
2
17 R= 0.302
FIG. 9.3. Spreadsheet for predicting lies detected from drug used. Drug is
treated as a binary categorical variable: 0 = no drug, 1 = drug.
132 BIVARIATE RELATIONS: REGRESSION, CORRELATION
important statistics to report because they indicate the strength of the
association. In addition, for a quantitative predictor, the slope of the best-fit line
indicates the exact nature of the relation, whereas for a qualitative predictor, the
same information is conveyed by group means. In either case, whenever you
want to know if one variable affects another, the template developed for the last
exercise provides the basic descriptive statistics. For now, we consider cases for
which the single independent variable, if qualitative, has only two values or levels
(drug vs. placebo group, for example), although later we show how to deal with
more than two levels. How those two levels are coded is not especially critical, as
the next exercise demonstrates.

Exercise 9.4
Codes for a Two-Group Predictor Variable
This exercise uses the template developed for the last exercise. Its purpose is to
demonstrate some interesting properties of coded variables.
1. Begin with the spreadsheet shown in Fig. 9.3. Arbitrarily we had decided to
code the drug group 1 and the placebo 0. Change the codes to -1 for the
drug group and +1 for the placebo. How does this affect the values for a, b,
and r2?
2. Imagine that the drug variable is quantitative, not qualitative, and that two
levels of drug were selected for study. The first five subjects received 4 mg,
the last five, 8 mg of the drug. Change the data in column C. What are the
values of a, b, and r2 now?
3. Select one other pair of codes (in addition to 1/0 and -1/+1) for the two
groups and recompute a, b, and r2. What do you conclude about the effect
of the values selected to represent the two groups on the values computed
for a, b, and r2?
4. Now turn your attention to Y, the predicted values. How are they affected by
different ways of coding drug group? When a single binary predictor variable
is used, what will the predicted scores always be?

9.4 GRAPHING THE REGRESSION LINE

As mentioned earlier, if the independent variable is quantitative, the regression


coefficient has a clear interpretation. It indicates the change in DV units
expected for each one unit change in the IV. If the independent variable is
qualitative, as in the last two exercises, the interpretation of the slope is less clear.
This is because the numeric codes assigned the two categories of a binary
nominal variable are arbitrary. Still, graphing the relation between a quantitative
criterion variable (like number of lies detected) and a qualitative predictor
variable (like drug group) can be meaningful, as the next exercise demonstrates.
9.4 GRAPHING THE REGRESSION LINE 133
Exercise 9.5
Graphing the Regression Line
For this exercise, you graph two regression lines. One shows the relation
between number of lies detected and mood score, the other shows the relation
between number of lies detected and drug group.
1. Graph the lies/mood data given in Fig. 9.2. (This will look like the
scattergram shown in Fig. 8.2.) Now, using the parameters given in Fig. 9.2,
graph the regression line. If you are using your spreadsheet to draw this "xy"
graph, graph the raw data using symbols (this gives the data points) and then
graph the predicted scores using lines (this gives the regression line; Excel
terms, x = mood scores, a = raw data, b = predicted data).
2. Note the mean value for the mood scores on the X axis. Draw a vertical line
through it. Now note the mean value for number of lies detected and draw a
horizontal line through it. If you have done this correctly, all three lines (the
regression line and the two mean lines) cross at a common point. Why must
this be so?
3. Using the graph, estimate about how many more lies are detected if mood
increases from 4 to 6. (Draw a vertical line from mood = 4 to the regression
line, then draw a horizontal line from that point to the Y axis. Do the same for
mood = 6.) If you did not want to use this graphic method, which is only
approximate, how could you have computed the value exactly?
4. Now graph the lies/drug data given in Fig. 9.3. (Use 1 to indicate the drug
and 0 the placebo group.) In this case the slope is negative. How do you
interpret this?
5. What is the predicted number of lies detected when drug = 0? When drug =
1? (Draw a line vertically from drug = 0 to the regression line and then
horizontally to the Y axis. Do the same for drug = 1.) How else might you
have determined these values?
6. In this case, what concrete meaning can you give the Y intercept and slope?

As you can see, whether a predictor variable is quantitative (like mood score)
or binary (like drug group), the regression line superimposed on the raw data
provides a way to visualize the relation between the independent and dependent
variables (see Figs. 9.4 and 9.5). The slope indicates the exact nature of the
relation between criterion and predictor variables. It indicates the change in the
criterion variable occasioned by a unit change in the predictor. If it is positive,
increases in the predictor are associated with increases in the criterion variable, if
negative, with decreases in the criterion variable. For a binary predictor, the
slope passes through the means for the two groups and indicates the amount of
difference between them. No matter whether the predictor is quantitative or
qualitative, r2 indicates the strength of the relation between criterion and
predictor variables. In other words, it indicates the proportion of variance
accounted for when predictions for the dependent variable are made, based not
just on the mean, but with knowledge of the value of the independent variable as
well.
134 BIVARIATE RELATIONS: REGRESSION, CORRELATION

FIG. 9.4. Regression line for predicting number of lies detected from mood
scores: a = 3.024, b = 0.469, r = .529.

FIG. 9.5. Regression line for predicting number of lies detected from drug
group: a = 6.40, b = -0.220, r = –.549. On average, 4.2 lies were detected
for drug-group subjects compared to 6.4 for placebo-group subjects.

Exercise 9.6
More Accounting for Variance
. This exercise provides additional practice in variance accounting. You can easily
modify the last spreadsheet in order to do these computations
1. Sixteen infants were recruited for a study. The number of different words
each infant spoke at 18 months of age was 32, 27, 48, 34, 33, 30, 39, 23, 24,
25, 36, 31, 19, 28, 32, and 22 for the 1 st through the 16th infant, respectively.
The first 7 infants all had no older siblings. The remaining 9 infants (the 8th
through the 16th) had 2, 1, 4, 1, 1, 3, 2, 1, and 5 older siblings, respectively.
What is the correlation between number of words spoken and number of
older siblings? How do you interpret this correlation? How much variance in
number of words spoken is accounted for by the number of older siblings?
9.4 GRAPHING THE REGRESSION LINE 135
2. What if you had made a data entry error and had entered 4 instead of 48 for
the 3rd subject's number of words? What would r and r2 be then? What if
you had entered 77 instead of 22 for the last subject's number of words? Or
222 instead of 22? Or 10 instead of 0 for the 7th subject's number of older
siblings? What do you infer about the potential effect of a single incorrect
datum?
3. What if you had divided the infants into two groups, those with no older
siblings (coded 0) and those with one or more older siblings (coded 1)?
What is the correlation between the number of words spoken and the binary
code for none versus one or more older siblings? How do you interpret this
correlation? How much variance in number of words spoken is accounted for
by knowing whether infants had at least one as opposed to no older siblings?
4. How many words are predicted for the 7 infants who had no older siblings?
For the 9 who did? Do these values surprise you?

Exercise 9.7
Regression in SPSS
In this exercise you will use SPSS to conduct a regression analysis and create a
graph of the lies and mood data.
1. Invoke SPSS. Create variables and enter the Lies and Mood data, or copy
the data from the spreadsheet you last used in Exercise 9.2.
2. Select Analyze->Regression->Linear from the main menu. Move Lies to
the Dependent window and Mood to the Independent(s) window. Click on
Save and check the Unstandardized boxes under Predicted Values and
Residuals. Click on Continue and then OK.
3. Examine the Model Summary in the output. Do the values for R and R2
agree with your spreadsheet? Now look at the ANOVA box. The Sums of
Squares Regression, Residual, and Total should agree with the values from
your spreadsheet for SSmod, SSerr, and SStot, respectively.
4. Examine the coefficients in the output. Can you find the values for a and b?
5. Finally return to the Data Editor. In step two you instructed SPSS to create
two new variables. The variable pre__1 contains the Y' predicted values and
res_1 contains the residuals (Y-Y'). Do these values agree with you
spreadsheet?
6. Select Graphs->Scatter from the main menu. Click on Simple and then
Define. Move Lies to the Y axis window and Mood to the X axis window.
Click on OK.
7. To create a scatter plot and regression line for the lies and mood data,
double click on the scatter plot to open the chart editor. Select
Chart->Options from the main menu and check Total under Fit Line.
8. Save the lies and mood data.
9. For additional practice you should try running the SPSS Regression
procedure and creating scatter plots for the lies and drug data. Remember to
save the data in a separate file.
BIVARIATE RELATIONS: REGRESSION, CORRELATION
Both the exact nature and the strength of a relation or association are
important descriptive topics. In the next chapter we consider a third and
different topic: the statistical significance of an observed association. There we
ask the key inferential question: Is the association observed between two
variables in a sample large enough to merit attention? In other words, is the
observed association statistically significant: Is it likely a real effect and not just a
chance result?
10 Inferring From a Sample:
The FDistribution

In this chapter you wil:

1. Learn how to estimate the population variance from the value computed
for the sample.
2. Be introduced to the concept of degrees of freedom.
3. Learn about the F distribution.
4. Learn how to determine whether the proportion of variance accounted
for in a sample by a single predictor is statistically significant.
5. Learn how to do a one-way analysis of variance with two independent
groups.

In chapter 5 you learned how to compute the variance for a sample of scores, and
in the last chapter you learned how a portion of that sample variance could be
accounted for, using the best prediction equation (in the least-squares sense) and
values for a single independent variable. These topics belong to the realm of
descriptive statistics. In this chapter, on the other hand, we discuss ways of
moving beyond the sample, inferring facts and relations that probably
characterize the population from which the sample was drawn. Thus the material
presented here, like that in chapter 7, concerns inferential statistics.

10.1 ESTIMATING POPULATION VARIANCE

Equation 5.4 in chapter 5 defined the sample variance as the sum of squares
divided by the sample size. That is:

Subscripts such as "X" or "Y" or "total" can be added to VAR and SS to identify
the group of scores or the sample in question, but in general terms, the variance
for a sample of scores is the average squared deviation from the mean for scores
in that sample. It is computed by subtracting the sample mean from each score,
squaring each difference, summing the squares, and dividing the sum by the

137
138 INFERRING FROM A SAMPLE: THE F DISTRIBUTION
number of scores in the sample. This is a perfectly reasonable way to describe the
extent to which scores in a sample are dispersed about the mean but as an
estimate of the value of the population variance (O2), the formula for the sample
variance has one undesirable characteristic. It is biased.

Biased and Unbiased Estimates


Not all estimates are biased. For example, imagine that we drew 100 samples
and computed the mean for each. Each sample mean would be an unbiased
estimate of the population mean. Rarely would a particular sample mean exactly
equal the value for the population mean. Some values might be too high, some
too low, but overall there should be no pattern to the direction of the errors—for
example, no tendency for the sample means to be smaller than the population
mean. A statistic that is biased, on the other hand, will tend to give values that
are either too low, on average, or else too high: The values will be consistently off
in a particular direction. For example, if we computed the variance for each of
the same 100 samples, many more of the values computed for the sample
variances would underestimate the population variance than would overestimate
it. Thus the sample variance is a biased estimate of the population variance
because it consistently underestimates the true population value.
This makes intuitive sense. When drawing samples from a population, we
are likely to miss some of the rarer extreme scores, so it is likely that the
variability in the sample will be less than the variability in the population.
Statisticians have proved to their satisfaction that the following formula
provides an unbiased estimate of the population variance (recall from chapter 5
that the apostrophe or prime after a symbol indicates an estimate):

This formula was first introduced in chapter 5 (Equation 5.5), but without
explaining that the purpose of dividing by N –1 was to provide an unbiased
estimate. When first encountered, this formula may seem too neat. Why should
dividing the sum of the squared deviation scores by N–1i instead of N produce an
unbiased estimate? Accept for now that proof is possible (e.g., see Hays, 1981),
that the sample variance consistently underestimates the population value, and
that in order to provide an unbiased estimate a correction factor is needed, some
number that will remove the bias from the sample statistic. This correction factor
must be greater than 1 (because the population variance is larger than the sample
variance) and must become smaller as the sample size becomes larger (because
bias decreases with sample size). Statisticians agree that this correction factor is
exactly N divided by N – 1. Hence:

This equation reduces to Equation 10.2 because VAR equals SS divided by N , and
because the Ns cancel. In other words:
10.1 ESTIMATING POPULATION VARIANCE 139
Similarly, the sample standard deviation (SD) is a biased estimate of the true
population standard deviation (a). Like the sample variance, the sample
standard deviation underestimates the true population value. A better estimate
of the population standard deviation is the square root of the estimated
population variance or the square root of the quotient of the sum of squares
divided by N – 1:

Technically, a second correction factor is necessary to make this a truly unbiased


estimate. However, for sample sizes larger than 10 the amount of bias is small
and thus in the interests of simplicity—and remembering that the analysis of
variance is based on variances, not standard deviations—the correction is usually
ignored (again, see Hays, 1981). Indeed, many texts ignore this nicety and state
without qualification that Equation 10.4 provides an unbiased estimate of a.

Exercise 10.1
The Estimated Population Variance
The purpose of this exercise is to examine the correction factor, N divided by
N– 1, and to provide practice computing the estimated population variance.
1. What is the correction factor for a sample size of 5? 10? 20? 40? 100?
2. Graph sample size (on the X axis) against the correction factor (on the Y
axis). Compute the correction factor for sufficient sample sizes in the range
1-200 so that the shape of the curve is accurately portrayed. As a practical
matter, beginning with what sample size does it makes sense to ignore the
correction?
3. What is the estimated population variance for the number of lies detected
(see Exercise 9.5)?
4. What is the estimated population variance for the number of words spoken
(see Exercise 9.6)?

Note 10.1
VAR' The estimated population variance. It is an unbiased
estimate of O2, the true population value. It can be estimated
by multiplying the sample variance, VAR, by the quotient of
N divided by N–-1, or by dividing the sum of squares by
N– 1. Often it is symbolized with a circumflex (a ^ or hat)
above O2 or S2.
SD' The estimated standard deviation. It is almost an unbiased
estimate (especially if N > 10) of a, the true population value.
It can be estimated by taking the square root of VAR' , or by
taking the square root of the quotient of the sum of squares
divided by N– 1. Often it is symbolized with a circumflex (a
^ or hat) above a or S.
140 I NFERRING FROM A SAMPLE: THE F DISTRIBUTION

Mean Squares and Degrees of Freedom


There is another more general way to view the equation defining the estimated
population variance:

Here N – 1 is replaced with df, which stands for degrees of freedom. In the
analysis of variance tradition, an estimate of variance is usually called a
mean square because it is a sum of squares divided by its appropriate degrees of
freedom. Hence Equation 10.5 is often written as:

Subscripts such as or Y or total (and, as you will see, model and error) can be
added to MS (and SS) to identify a particular variance estimate. But in general, a
mean square is a particular sum of squares divided by its appropriate degrees of
freedom. Mean squares, and the degrees of freedom used to compute them, are
basic to inferential statistics. To begin with, and as described in the next section,
the ratio of two mean squares, called the F ratio, plays an important role in the
analysis of variance.

10.2 THE F DISTRIBUTION

Judging from articles in professional journals, perhaps the sampling distribution


behavioral scientists use most often in their day-to-day work is the F distribution.
It was developed primarily by R. A. Fisher during the first few decades of the
2Oth century and was named in his honor by another important 20th-century
statistician, George Snedecor. The F statistic, or F ratio, is the ratio of two
independent variance estimates or mean squares. Both the numerator and the
denominator of the F ratio are mean squares, that is, sums of squares divided by
their appropriate degrees of freedom. Thus, in general:

Because sums of squares cannot be negative, neither can the F statistic. Its
possible values range from zero to infinity.
The F distribution is in fact a family of distributions. Each member of the
family is characterized by its numerator and denominator degrees of freedom.
Thus df might equal 2,8 for one distribution (meaning 2 df associated with the
numerator SS and 8 df with the denominator SS) and 6,24 for another. The
actual shape of the F distribution depends on the values for the numerator and
denominator degrees of freedom (see Fig. 10.1). With two degrees of freedom in
the numerator, the distribution looks like a diagonal line that sagged; values are
high near zero and steadily decline as F increases. With higher degrees of
freedom in the numerator, the distribution rises sharply from F = 0, rapidly
reaches a peak, and then falls back more slowly as F increases. And, as you might
guess from chapter 7, as the numerator and denominator degrees of freedom
10.2 THE F DlSTRIBUTION 141

FIG. 10.1. F distributions for 2 and 8 (lighter line) and for 6 and 24 (heavier
line) degrees of freedom. Critical values (marked with arrows) are 2.51 for
F(2,8) and 4.46 for F(6,24).

become extremely large, the shape of the F distribution approaches that of the
normal.
In practice, null hypotheses are usually formulated so that if the null
hypothesis is true then the two independent variance estimates under
consideration will be equal. And if two variances are equal, their ratio will be 1,
which is the null-hypothesis expected value for the F ratio. Moreover, the mean
squares selected for the numerator and the denominator of the F ratio usually
require large values (certainly greater than 1) of F to disconfirm the null
hypothesis. In other words, in practice the F test is always and inherently one-
tailed. Although values for the F ratio less than 1 can and do occur, they usually
indicate deviant data that likely violate assumptions required for an appropriate
use of the F test in the first place. We could overstate the matter as follows:
There are two kinds of Fs, big Fs (that allow us to reject a particular null
hypothesis) and small Fs (that do not).
Of primary interest to the researcher is the region of rejection, that 5% (if
alpha = .05) or 1% (if alpha = .01) of the area under the sampling distribution
curve that falls in the right-hand tail of the distribution and is demarcated by the
critical value of F. If the computed F ratio falls in this area, that is, if the
computed F ratio exceeds the critical value, then the null hypothesis is rejected.
Thus the result usually desired by researchers is an F ratio big enough to reject
the null hypothesis. Critical values for various members of the F distribution
family, as defined by various pairs of values for the numerator and denominator
degrees of freedom, are given in Table D in the statistical tables appendix. The
next exercise gives you practice in using this table.
At this point you know enough to answer a simple and common preliminary
question. Imagine that you have two sets of scores (e.g., one from a group of
subjects who received a drug and one from another group of subjects who
received a placebo) and you want to know if the variability of the scores in one
142 INFERRING FROM A SAMPLE: THE F DISTRIBUTION
group is significantly different from the variability of the other group. (This is
useful to know before we ask whether the groups means differ significantly.)
First you would compute estimated variances (mean squares) for each group
separately; then you would divide the larger by the smaller, computing an F ratio.
According to the null hypothesis, the variances for the two groups are equal
(presumably because the groups were sampled from the same and not different
populations). If in fact the variances are roughly equal, the F ratio will be small,
not much larger than 1. But if it exceeds the critical value, as determined by the
numerator and denominator degrees of freedom, then you would reject the null
hypothesis. For example, if the estimated variance or mean square (VAR' or MS)
is 9.62 for a group containing 16 subjects and 5.17 for a second group containing
19 subjects, the F ratio (with 15 degrees of freedom in the numerator and 18 in
the denominator) is 1.86. The critical value for F(15,18) is 2.27 (alpha = .05);
thus we would not reject the null hypothesis. (By convention, numbers in
parentheses after an F ratio indicate degrees of freedom for the numerator and
denominator, respectively.)

Exercise 10.2
Statistical Significance of the F statistic
The purpose of this exercise is to provide practice in using Table D in the
statistical tables appendix to determine critical values for the F statistic or F ratio.
1. What is the critical value for the F ratio if alpha = .05 and degrees of freedom
= 3,24? lf df =6,24?
2. Would you reject the null hypothesis if F(1,8) = 5.24 and alpha = .05? Why
or why not? Note that F(1, 8) indicates an F statistic with 1 df in the
numerator, 8 in the denominator.
3. If degrees of freedom = 2,30, what is the critical value for alpha = .05? For
alpha = .01 ? Why must the latter be a larger number than the former?
4. If your computed F were 4.42 and the degrees of freedom were 1 for the
numerator and N - 2 for the denominator, what is the smallest sample size
(N) that would allow you to reject the null hypothesis, alpha = .05?
5. Can an F ratio of 3.8 with one degree of freedom in the numerator ever be
significant at the .05 level? Why or why not?
6. Reaction time scores (in milliseconds) for subjects who received an
experimental drug are 348, 237, 681, 532, 218, 823, 798, 644, 734, 583 and
for subjects who received a placebo are 186, 463, 248, 461, 379, 436, 304,
212. Are the variances for the two groups significantly different?

10.3 THE F TEST

In the preceding chapter we found that 28% of the sample variance in number of
lies detected could be accounted for by knowing subjects' mood scores and 30%
could be accounted for by knowing whether subjects were in the drug or placebo
group (see Figs. 9.2 and 9.3). That was straightforward description. But how can
we decide whether or not the proportion of variance accounted for is statistically
significant? For example, if drug has no effect on the number of lies detected in
the population from which a sample is drawn (as the null hypothesis claims), how
often, just by the luck of the draw, would we select a sample in which we were
able to account for 28% of the variance? Is the probability low enough so that we
10.3 THE F TEST 143
can reject the null hypothesis? The statistic used to answer these and similar
questions is the F ratio, which was introduced in the previous section (Equation
10.7), and a test that uses the F ratio is called an F test.
Recall that in chapter 8 we noted that the total sum of squares could be
divided, or partitioned, into two pieces, the sum of squares associated with the
model and the error sum of squares (Equation 8.3). Again, this is simply a
descriptive statement. But now, given our present inferential concern (Were the
10 subjects in the lie detection study drawn from a population in which drug has
no effect on number of lies detected?), it is useful to examine the mean squares,
or variance estimates, associated with the model and error sums of squares. As
you might guess, the ratio of these two mean squares will be distributed as F if
the null hypothesis is true. Thus if the F we compute is large, we have a basis for
rejecting the null hypothesis and can conclude that in general, when people who
have taken this drug lie, their lies are less easily detected by experts.
Degrees of Freedom for the Total Sum of Squares
As you already know, mean squares are sums of squares divided by their
appropriate degrees of freedom (Equation 10.6). Determining the correct
degrees of freedom for different sums of squares is relatively straightforward and
a basic statistical skill you must master. There are two common approaches that
can be understood on a relatively nontechnical level. The first emphasizes how
many scores are free to vary, whereas the second, which is used somewhat more
in this book, emphasizes the number of parameter estimates used in computing
the particular sum of squares. For simplicity of presentation and ease of
understanding, these two approaches are demonstrated first for total degrees of
freedom and then for model and error degrees of freedom, even though it is the
model and error mean squares that are needed to compute the F ratio.
Assume a sample size of 10. If taking the first approach, we accept the
sample size and the sample mean as fixed. If we did not know the actual scores,
we could specify any number we like for the first score— in that sense it is free to
vary; likewise for the second score, the third, and so forth. However only nine
scores are free to vary in this manner. Once the first nine scores are in place, the
loth is determined. There is one value, and only one value, that will insure that
the mean of the 10 numbers is the value specified initially. Thus, because nine
scores are free to vary, there are nine degrees of freedom.
If taking the second approach, we note that an error sum of squares is
computed by summing the squares of the deviations of the raw scores from the
predicted scores. For the total sum of squares, the predicted scores are always
the group's mean. Symbolically:

The first element of the deviation (Yi) represents a raw score and because there
are 10 of them, we begin with 10 degrees of freedom. From them we subtract the
predicted scores. The equation for a predicted score (which first appeared in
chap. 5) is
144 ___ INFERRING FROM A SAMPLE: THE F DISTRIBUTION
There is just one parameter estimate, a, which for the total sum of squares is the
sample mean (a = MY). Thus when computing SStotal, we begin with 10 degrees of
freedom and use 1 estimating a parameter, which leaves 9 degrees of freedom.
In general,

This formula applies whenever the sum of squares represents deviations from the
sample mean.
Moreover, the total degrees of freedom, like the total sum of squares, can be
partitioned into two pieces. In other words,

But do not overextend this notion. Although the constituent sums of squares add
up to the total sum of squares, and although the constituent degrees of freedom
add up to the total degrees of freedom, the constituent mean squares do not sum
to the total mean square.

Degrees of Freedom for the Error Sum of Squares


The error sum of squares is computed by summing the squares of the deviations
of the raw scores from the predicted scores. Symbolically:

Again, the first element of the deviation (Yi) represents a raw score and because
there are 10 of them, we begin with 10 degrees of freedom. From them we again
subtract the predicted scores. In this case, the prediction equation for Yi' is

There are two parameter estimates, a and b, which are computed using
regression procedures (Equations 8.7 and 8.8). Thus when computing SSerror, we
begin with 10 degrees of freedom and use 2 estimating parameters, which leaves
8 degrees of freedom. In general,

This formula applies whenever a single predictor variable is used. One degree of
freedom is "lost" to the regression constant (a), the other to the regression
coefficient (b) for the single predictor variable. More generally, the degrees of
freedom for error will always be N minus 1 (for the constant) minus the number
of predictor variables.
Alternatively, and more traditionally, we would accept that there are five
scores in each of two groups and the means for the two groups are fixed. If we
did not know the actual scores, we could specify any number we like for the first
score in the first group— in that sense, it is free to vary— and likewise for the
second, third, and fourth, but not the fifth score in the first group. The fifth score
is constrained by the mean for the first group. The same logic applies to the
second group. Thus, because eight scores are free to vary (four in each group),
there are eight degrees of freedom.
THE F TEST 145
Degrees of Freedom for the Model Sum of Squares
The model sum of squares is computed by summing the squares of the deviations
of the predicted scores from the mean. Symbolically:

In this case, the first element of the deviation (Yi') represents a predicted score,
which as we already know (from Equation 10.11) requires two parameter
estimates; thus we begin with two degrees of freedom. From the predicted scores
we subtract the mean, which requires one estimate (see Equation 10.8). Thus
when computing SSmodel, we begin with two degrees of freedom and lose one,
which leaves one degree of freedom. In general,

This formula applies whenever a single predictor variable is used. More


generally, the degrees of freedom for the model will always be the number of
predictor variables.
Alternatively, we would accept that there are two groups whose means are
fixed, and an overall or grand mean, whose mean is also fixed. If we did not
know the actual group means, we could specify any number we like for the first
group's mean—in that sense, it is free to vary. The mean for the second group,
however, is constrained (or determined) by the grand mean. Thus, because only
one score is free to vary, there is only one degree of freedom.
For a sample size of 10 with a single predictor variable, we have shown that
the total degrees of freedom is 9, the degrees of freedom for error is 8, and the
degrees of freedom for the model is 1. Thus for the present example, degrees of
freedom for error and for the model add up to the total degrees of freedom, as
they should (Equation 10.10).

Model and Error Mean Squares


In general terms, a mean square is a sum of squares divided by its appropriate
degrees of freedom. For a single predictor variable and a total sample size of N,
the formulas for the model and error mean squares are:

Do not confuse the SSmodel and SSerror, which are descriptive statistics and sum to
SStotal, with MSmodel and MS error , which are estimates of population parameters
and do not sum to MStotal. But how are these estimates of variance used? What,
exactly, do they estimate? As it turns out, their use in inferential statistics
depends on what they mean when the null hypothesis is true. Let us now
consider each in turn.
Mean squares are population estimates based on sample data. We know that
such estimates are not perfect reflections of a population but reflect some degree
146 INFERRING FROM A SAMPLE: THE F DISTRIBUTION
of sampling error. If the null hypothesis is true, MSmodel and MSerror provide two
different and independent ways to estimate that sampling error, that is, the
population error variance. If the null hypothesis is not true, then the mean
square for the model reflects variability between groups in addition to sampling
error:

MSmodel = effect of predictor variable + sampling error

However, if the means for the two groups are equal as claimed by the null
hypothesis (i.e., if the effect of the predictor variable is nil), then the model mean
square is a pure estimate of sampling error. Again, do not confuse SStotal, which is
a descriptive statistic and can be decomposed into sums of squares based on
between group and within group deviations, with MSmodel, which is a population
estimate and is affected by sampling error as well as by variability associated with
the model if any.
No matter whether or not the null hypothesis is true, the mean square for
error is a pure estimate of sampling error:

MSerror = sampling error


It is not influenced by the effect of the predictor variable. Subtracting predicted
scores from raw scores removes any such effect from the within group deviations
used to compute the error mean square.
Both MSmodel and MSerror provide independent estimates of sampling variance
when the null hypothesis is true, so the ratio of MSmodel to MSerror should be
distributed according to the sampling distribution for F, which provides us with a
way to evaluate the null hypothesis, as described in the next section. (For a
formal development of this argument, see Winer, 1971.)

The F Ratio
The F ratio that tests the null hypothesis that claims no effect for the predictor
variable (or no difference between group means) is

Theoretically MSmodel should be equal to or larger than MSerror because MSmodel


estimates variability associated with the effect of the predictor on the criterion
variable, if any, plus sampling error, whereas MSerror estimates only sampling
error. Therefore if the null hypothesis is true, F should equal 1. But if it is not
true, F should be larger than 1 (although as noted later it sometimes may be less
than 1). Values of F that exceed the critical value associated with dfmodel and dferror
are grounds for rejecting the null hypothesis.
For the lie detection study, one null hypothesis suggests that treatment group
(drug versus placebo, represented with a single predictor variable) has no effect
on number of lies detected. If this null hypotheses were true, and if the sample
data reflected it exactly, then the proportion of criterion variance accounted for
by the predictor variable would be zero (and the regression coefficients,
correlation coefficients, and r2s would all be zero). Moreover, again if the null
hypothesis were true, the F ratio—the ratio of MSmodel to MSerror—would be
distributed as F with one and eight degrees of freedom (because N = 10).
According to the null hypothesis, MSmodel and the MSerror are equal (they both
estimate the same population parameter); thus the expected value for F is 1.
10.3 THE F TEST 147
Note 10.2
MSmodel Mean square for the model. Computed by dividing the sum
of squares associated with the model by the degrees of
freedom for the model. It is also called mean square between
groups or mean square due to regression.
dfmodel Degrees of freedom for the model. It is equal to the number
of predictor variables.
MSerror Mean square for error. Computed by dividing the error sum
of squares by the error degrees of freedom. It is also called
the mean square within groups or the residual mean square.
dferror Degrees of freedom for error. It is equal to N minus 1 (for the
regression constant) minus the number of predictor
variables.
F ratio Usually the ratio of MSmodel to MSerror; more generally, the
ratio of two variances. If the null hypothesis is true, it will be
distributed as F with the degrees of freedom associated with
the numerator and denominator sums of squares.

Needless to say, the ratio of MSmodel to MSerror is rarely exactly 1. A computed


F may occasionally even be less than 1, which suggests that the F test was
probably not appropriate for those data in the first place. However, because the
ratio of two theoretically equal variances is distributed as F with the appropriate
degrees of freedom, the statistical significance of a computed F ratio can be
determined—and if the F ratio is big enough, the null hypothesis can be rejected.
In the next two exercises you are asked to use the F test to evaluate, first, the null
hypothesis that drug does not affect number of lies detected and, second, the null
hypothesis that mood does not affect number of lies detected.

Exercise 10.3
The Significance of a Single Binary Predictor
The template developed for this exercise allows you to evaluate the statistical
significance of a single predictor variable. The predictor variable examined is
drug group and the null hypothesis states that drug has no effect on number of
lies detected. As you will see, only a few modifications to earlier spreadsheets
are required.

General Instructions
1. A spreadsheet you developed earlier computed the proportion of variance in
number of lies detected that is accounted for by knowing the subject's drug
group (see Fig. 9.3). Modify this spreadsheet so that it computes MStotal,
MSmodel, MSerror, and the F statistic that evaluates whether the proportion of
variance is statistically significant. You may use Fig. 10.2 as a guide.
2. What is the value of the F ratio you computed? What is the critical value for
this F (alpha = .05)? Would you reject the null hypothesis? Why or why not?
Is the effect of drug treatment (drug or placebo) on the number of lies
detected significant at the .05 level or better?
148 INFERRING FROM A SAMPLE: THE F DISTRIBUTION
Detailed Instructions
1. Begin with the spreadsheet shown in Fig. 9.3. In order to make room for the
inferential statistics, move the summary descriptive statistics from the sum-
of-squares columns (I-K) to the deviation-score columns (E-G). Specifically,
move (use the cut and paste functions, rather than the copy function) the
labels and the formulas for N, the variance, the standard deviation, and R2
from block H14–K17 to block D14–G17. This will erase any formulas
currently in cells D14-G15.
2. Check to make sure that all function references and formulas are correct.
Due to the way spreadsheets execute a move (as opposed to a copy), the
count functions in cells E14-G14 will refer to columns I-K, the variance
formulas in cells E15-G15 will divide the sums of squares in cells I13-K13 by
the Ns now in cells E14–G14, and the formulas for the standard deviation
(cell E16), R2 (cell E17), and r (cell B17) will likewise point to the correct
cells. Other formulas (predicted values, deviation scores, sums of squares,
parameter values) should still be correct from before.
3. Provide labels for the inferential statistics. In column H, label rows 13–16 as
indicated:
Label Row Meaning
ss= 13 The sum of squares for this column.
df= 14 The degrees of freedom for this SS.
MS= 15 The mean square (estimated population variance).
SD'= 16 The estimated population standard deviation.
4. Enter the correct degrees of freedom for the SStotal, SSmodel, and SSerror in
cells I14–K14, respectively.
5. Enter formulas for the mean squares (the sums of squares divided by the
appropriate degrees of freedom) in cells I15–K15.
6. Enter the formula for the estimated population standard deviation in cell 116.
7. Enter the label "F=" in cell J17 and enter the formula for the F ratio (the
MSmodel divided by the MSerror) in cell K17.
8. What is the value of the F ratio you computed? What is the critical value for
this F (alpha = .05)? Would you reject the null hypothesis? Why or why not?
Is the effect of drug treatment (drug or placebo) on the number of lies
detected significant at the .05 level or better?

At this point your spreadsheet should look like the one shown in Fig. 10.2.
You have now produced a template that allows you to determine the significance
of a single predictor variable and can easily be modified for other data, for
example, the mood score data, as we demonstrate in the next exercise. Finally, in
the next section, we discuss how to interpret the results of both the drug group
and mood score analyses.

Exercise 10.4
The Significance of a Single Quantitative Predictor
The template you developed for the last exercise is used for this exercise too.
Only the data are different. The predictor variable examined is mood and the null
hypothesis suggests that mood has no effect on number of lies detected.
1. Change the data entered in the previous spreadsheet (see Fig. 10.2) from a
binary code for drug group to mood scores (the mood scores were last
10.3 THE F TEST 149
shown in Fig. 9.1). All other formulas from the previous spreadsheet should
be in place correct.
2. What is the value of the F ratio you computed? What is the critical value for
this F (alpha = .05)? Would you reject the null hypothesis? Why or why not?
Is the effect of mood on the number of lies detected significant at the .05
level or better?

10.4 THE ANALYSIS OF VARIANCE: TWO INDEPENDENT GROUPS

At this point your spreadsheet should look like the one shown in Fig. 10.3. It is
worth reflecting for a moment on what these last two exercises have
accomplished. The analysis performed for Exercise 10.3 (and shown in Fig. 10.2)
is an analysis of variance with two independent groups. This analysis allows you
to determine whether the means for two different groups of subjects are
sufficiently different so that, in all likelihood, subjects in the two groups were
sampled from two different populations, not one. In practical terms, this analysis
allows you to determine whether there is a significant difference between the
means computed for the two groups. The analysis performed in Exercise 10.4
(and shown in Fig. 10.3) evaluates the significance of a regression or correlation
coefficient. This analysis allows you to determine whether the observed
association is sufficiently great so that, in all likelihood, subjects were not
sampled from a population in which there is no association. In practical terms,
this analysis allows you to determine whether the two variables are significantly

D
1 Lies Drug y= m- e=
2 s Y X Y' Y-My Y'-My Y-Y'

13 Sum= 53 5 53 0 4E–15 0
14 N= 10 10 N= 10 10 10
15 Mean= 5.3 0.5 VAR= 4.01 1.21 2.8
16 a,b= 6.4 -2.2 SD= 2.002
17 r= -0.55 R2= 0.302

H I J K L M N
1 SStot SSmod SSerr x= SSX SSxY

2 y*y m*m e*e X-Mx x*x x*y

13 SS= 40.1 12.1 28 0 2.5 -5.5


14 df= 9 1 8 10 10 10
15 MS= 4.456 12.1 3.5 0.25 -0.55
16 SD'= 2.111 0.5
17 F= 3.457
FIG. 10.2. Spreadsheet for computing the F ratio that evaluates the effect of
drug treatment on number of lies detected. Rows 3-12 are the same as Fig.
9.3 so are not shown.
150 INFERRING FROM A SAMPLE: THE F DISTRIBUTIONS
related.
It should now be evident to you that both kinds of questions can be answered
with the same analysis. The overarching question concerns the proportion of
variance accounted for by a single predictor variable. Is it sufficiently greater
than zero so that, in all likelihood, subjects were sampled from a population in
which that predictor has an effect, and not sampled from a population in which
the predictor has no effect as the null hypothesis claims? In other words, does
the predictor variable matter? Can we predict individual outcome significantly
better if we know how subjects scored on the predictor variable?

Statistical Significance and Effect Size


You computed two F ratios for the last two exercises, one evaluating the
contribution of drug group and one evaluating mood. Based on the sample data
provided, neither mood nor drug group significantly affected the number of lies
experts could detect. Although not finding effects significant may seem
disappointing from a research point of view, remember that our example
included data for only 10 subjects, which is not very many. Still, the magnitude of
the effect was not trivial: In this sample, drug group accounted for 30% and
mood scores accounted for 28% of the variance in number of lies detected. Thus
the data from the lie detection study rather dramatically demonstrate the
difference between real-world significance, or the magnitude of the effect, and
statistical significance. In fact, with five additional subjects, both effects would
have been statistically significant at the .05 level (assuming that the additional

A D
1 Lies Mood y= m= e=
2 s Y X Y' Y-My Y'-My Y-Y'

13 Sum= 53 48.5 53 0 0 0
14 N= 10 10 N= 10 10 10
15 Mean= 5.3 4.85 VAR= 4.01 1.124 2.886
16 a,b= 3.024 0.469 SD= 2.002
17 r= 0.529 R2= 0.28

H I J K L M N
1 SStot SSmod SSerr x= SSX SSxY

2 y*y m*m e*e X-Mx x*x x*y

13 SS= 40.1 11.24 28.86 0 51.03 23.95


14 df= 9 1 8 10 10 10
15 MS= 4.456 11.24 3.607 5.103 2.395
16 SD'= 2.111 2.259
17 F= 3.116
FIG. 10.3. Spreadsheet for computing the F ratio that evaluates the effect of
mood on number of lies detected. Rows 3-12 are the same as Fig. 9.1 so
are not shown.
10.4 ANALYSIS OF VARIANCE: Two INDEPENDENT GROUPS 151
subjects did not alter the values for the SSmodel and SSerror).
Often, perhaps all too often, researchers pay attention only to statistical
significance and ignore effect sizes (see Cohen, 1990; Rosnow & Rosenthal, 1989;
Wilkinson et al., 1999). But effect sizes provide important descriptive
information and should always be reported (using, for example, an index of effect
size such as r2, which was discussed in the previous chapter and used in this one,
or R2, which is discussed in the next chapter). There is no reason to be
particularly dazzled by statistical significance or depressed by its absence. As the
present example demonstrates, when the sample size is small even a relatively
hefty effect may not be statistically significant. Worse, the reverse is also true.
Given a large enough sample size, even a miniscule effect can achieve statistical
significance. You now can understand why throughout this book r2 (and its
multivariate extension, R2} are emphasized. Reflecting as they do the size of real-
world effects, they provide an excellent correction to what might otherwise be an
exclusive focus on the statistical significance of F ratios.

Exercise 10.5
The Significance of a Single Predictor: More Examples
The purpose of this exercise is to provide additional practice in determining the
statistical significance of the proportion of variance accounted for.
1. Recall Exercise 9.6 for which you computed the proportion of variance in the
number of words 18-month-old infants spoke that was accounted for by
knowing the number of older siblings. This proportion was 43.1%. Would
you reject the null hypothesis that this proportion is really zero at the .05 level
of significance? At the .01 level?
2. You also computed the proportion of variance accounted for by knowing
whether an infant had (a) no older siblings or (b) one or more older siblings.
This proportion was 32.6%. Would you reject the null hypothesis that this
proportion is really zero at the .05 level of significance? At the .01 level?
3. The 32.6% in question 2 is a proportion of variance accounted for by a binary
predictor. If significant, it means that the means for the two groups differ
significantly. What is the name of the test that, especially in older literature,
is often used to determine if the means for two groups differ?
4. For the exercises in chapters 8 and 9, you computed the variances and
covariance for X and Y and used them to compute the regression coefficient
and constant. The regression statistics were then used to compute predicted
values for Y, which in turn allowed you to compute the model and error sum
of squares and R2. Now use the multiple regression data analysis procedure
in your spreadsheet to verify the values you have computed for a, b, and R2.
5. For these data, the mean number of words spoken by infants with no older
siblings was significantly different from the mean number for infants with one
or more older siblings. Compute the standard error of the mean for these two
groups and prepare a bar graph (similar to Fig. 7.2) with error bars extending
one standard error of the mean above and below the top of each bar. Now
compute 95% confidence intervals for the two means (and graph them). Do
the 95% confidence intervals suggest that the means are significantly
different from each other?
152 INFERRING FROM A SAMPLE: THE F DISTRIBUTIONS
The figure you prepared for part 5 in the last exercise shows how to present
the results from a simple analysis of variance of two groups graphically (see Fig.
10.4). The top of each bar indicates the mean value for one of the two groups of
infants. But remember, the means obtained for these two groups are derived
from sample data. They only represent an estimate for the true population mean,
which need not be exactly the sample mean. The error bars serve to remind us of
the probabilistic nature of sample data, and so serve as some protection against
taking our results too absolutely.
Usually error bars indicate standard errors of the mean, but as Fig. 10.4
shows, it is also possible to have error bars represent 95% confidence intervals
instead. In the context of an analysis of variance of two independent groups, 95%
confidence intervals do have one advantage: They suggest statistical significance.
Usually, if the difference between two means is statistically significant (at the .05
level), then neither mean will fall within the 95% confidence interval for the
other—but to verify the suggestion you need to do the formal statistical test.

10.5 ASSUMPTIONS OF THE F TEST

Three assumptions need to be met before you can be confident that the
probability value of your observed F ratio is valid. The first is that observations
are independent and that scores across groups are independent. If you randomly
sample participants from the population and then randomly assign them to
groups, this assumption will be satisfied. The second assumption is that the Y
values are normally distributed within groups. You can check this assumption
using the graphical techniques described in chapter 6. You can also examine the
skew statistic provided in the Descriptives procedure of SPSS. A general rule of
thumb is that if the skew divided by its standard error is >2.00, then you may
have violated the normality assumption. The final assumption requires that the

FIG. 10.4. A bar graph showing the mean number of words infants with no
older siblings, and infants with one or more older siblings, spoke at 18
months of age. The error bars on the left represent standard errors of the
mean for the two groups. The error bars on the right represent 95%
confidence intervals for the means of the two groups.
10.5 ASSUMPTIONS OF THE F TEST 153
variances of the groups be roughly equal. This assumption, known as
homogeneity of variances, can be tested using the F test described earlier in this
chapter. SPSS also provides a test, described in the next exercise, that examines
the homogeneity assumption. It should be noted that, when the groups are of
equal size, analysis of variance (ANOVA) is fairly robust to minor violations of
normality and homogeneity of variances. If, however, the ns are not equal or you
violate both assumptions simultaneously, then corrective actions may be
necessary.

Exercise 10.6
The F Test in SPSS Regression and One-Way ANOVA
This will show you how to find the significance of the amount of variance
accounted for by Drug in Lies Scores.
1. Open the Mood and Drug data file you created in Exercise 9.7 and rerun the
Regression procedure.
2. Examine the ANOVA box in the output. Does this value agree with the F you
calculated in Exercise 10.3?
3. Select Analyze->Compare Means->One-Way ANOVA from the main menu.
Move lies to the Dependent List and Drug to the Factor window. Click on
Options and then, under Statistics, check the Descriptive and Homegeneity
of variance boxes. Click on Continue and then OK.
4. Examine the output. In the Descriptives box, make sure the N, means, and
standard deviations agree with your spreadsheet. You will find the sums of
squares and the F value reported in the ANOVA table. The statistics should be
identical to the ANOVA output from the regression procedure with the exception
that the regression statistics are now termed between groups and the residual
statistics are called within groups. Thus, you may analyze a single-factor study
with a categorical independent variable using either the Regression or One-Way
ANOVA procedures.
5. The One-way ANOVA procedure does , however, also provide a test of the
homogeneity of variances. Find Levene's statistic in the box labeled Test of
Homegeneity of Variances. Levene's statistic provides a test that the variances
of groups formed by the categorical independent variable are equal. This test is
similar to the F test for equal variances presented earlier in this chapter.
Levene's, however, is less likely to be biased by departures from normality.

Considerable basic and important material has been presented in this


chapter. You have been introduced to degrees of freedom, a topic that is
straightforward and yet often seems perplexing to students at first. If you
remember that degrees of freedom relate to the deviations used to compute sums
of squares and are determined by the total number of scores analyzed as
constrained by predictor variables—and if you memorize all the formulas when
presented—then you should have little difficulty. You have also been introduced
to the F distribution and have learned how to perform an F test to determine
whether the means (or variances) of two groups are significantly different from
each other. More importantly, you have learned to distinguish between real-
world importance and statistical significance.
You now know that nontrivial effects may not be found statistically
significant in small samples, but that trivial effects can be found significant if the
sample size is large enough. Clearly, when reporting research results, it is
154 INFERRING FROM A SAMPLE: THE F DISTRIBUTIONS
important to describe both the magnitude of any effects investigated as well as
their statistical significance. Just as clearly, when planning any research, it is
important to determine exactly how large a sample should be in order to detect as
significant effects the researcher regards as big enough to be nontrivial. This
determination is called power analysis and is discussed in chapter 17.
At this point you now know how to perform an analysis of variance for two
independent groups (which is equivalent to a t test for independent groups).
First you determine the proportion of variance accounted for by a single predictor
variable, and then you determine its statistical significance using an F test. This
approach, as you will see in the next chapter, is easily generalized to situations
that require more than one predictor variable. However, no matter whether one
or more than one predictor variables are considered, the central descriptive issue
is, how much variance do they account for (i.e., what is the magnitude of the
effect)? And the central inferential question remains, is that amount sufficiently
different from zero so that it is unlikely to have occurred in our sample just by
chance, given that the null hypothesis is true (i.e., is the effect statistically
significant)?
11 Accounting for Variance:
Multiple Predictors

In this chapter you will:


1. Be introduced to multiple regression and learn how to interpret the basic
multiple-regression statistics.
2. Learn how to estimate R2 for a population from the Rz computed for a
particular sample.
3. Learn how to determine the amount of variance two predictor variables
account for together and whether that amount is statistically significant.
4. Learn how to determine the amount of additional variance a second
predictor variable accounts for, above and beyond the portion of variance
accounted for by the first, and whether that additional amount is
statistically significant.
5. Be introduced in a basic way to concepts underlying the analysis of
covariance.
6. Learn how to generalize the techniques discussed in this chapter from
two to more than two predictor variables.

Beginning in chapter 8, and continuing in chapters 9 and 10, our discussion was
limited to simple, as opposed to multiple, regression and correlation. Simple
correlation is concerned with the degree of association between two variables and
is typically indexed with the simple correlation coefficient, r (whose values can
range from -1 to +1). Simple regression is concerned with the exact relation
between a predictor and criterion variable, as indexed by the regression
coefficient b (whose values are affected by values for the two variables involved
and, in theory at least, can range from minus to plus infinity), and with the
proportion of criterion variance that can be accounted for, given knowledge of the
predictor variable, as indexed by r2 (whose values can range from 0 to +1).
Simple regression and correlation are conveniently viewed as a single topic.
As noted previously, definitional formulas for r and b (Equations 8.11 and 8.12)
appear quite similar. In fact, one way to compute r is as follows:

155
156 _ ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
This serves to remind us that r is a standardized version of b. The regression
coefficient is expressed in Y units per unit change in X. Multiplying b by the ratio
of the X to Y standard deviations cancels the units and provides a unit-free,
standardized measure of the association between the two variables. In other
words, correlation coefficients can be compared for different pairs of variables-
similar values indicate similar degrees of relation— whereas regression
coefficients cannot be compared in this way. Their values and units reflect the
raw-score scales of the variables involved.

11 .1 MULTIPLE REGRESSION AND CORRELATION


Multiple regression and correlation (MRC) is a straightforward extension of
simple regression and correlation. Instead of being concerned with only one
predictor variable, MRC is concerned with the effects of two or more predictor
variables, working in concert. The techniques of MRC provide answers to
questions like, how much criterion variance is accounted for by knowledge of a
set of predictor variables working together? And, what is the unique contribution
of a particular predictor— that is, how much additional criterion variance is
accounted for when that predictor variable is added to the existing set? If mood
affects the number of lies detected, for example, does knowing the drug
treatment, in addition to knowing the mood score, increase the proportion of
variance accounted for significantly?
More generally, the techniques of MRC allow answers to many of the simpler
research questions typically asked by behavioral scientists. Analysis of variance
(ANOVA) and analysis of covariance (ANCOVA), for example, can be understood
as particular instances or applications of multiple regression. Thus, whenever
investigators wish to account for or explain a given quantitative criterion variable
in terms of a few predictor variables, a basic understanding of MRC, as presented
in this book, is often all that is required.

Multiple Regression Parameters


In chapter 8 we noted that the regression equation for a single predictor was:

We showed that the parameters for this equation, a and b, could be determined
by trial and error, but we also showed a way to compute the values for the
parameters. As you might guess, the equation for two predictors is

This equation has three parameters, a, bi, and b2. More generally, the equation
for K predictors is

In chapter 8 we also interpreted the parameters graphically, as the Y intercept


and slope of a straight line, in the belief that such visualization can aid
understanding. Although it is possible to visualize the two-predictor case
spatially in three dimensions, the more than three dimensions required for three
or more predictors are difficult, if not impossible, for most of us to visualize.
Hence from now on we rely on the more general Equation 11.4 and drop the
spatial metaphor, which only works for Equation 11.2.
11.1 MULTIPLE REGRESSION AND CORRELATION 157
The trial- and- error method for determining parameter values, however,
remains appropriate, or at least theoretically possible, no matter the number of
predictors. The best values for the parameters (in the least-squares sense)
remain those that minimize the sum of the squared errors (the deviations
between predicted and observed scores, squared and summed). And, in theory at
least, they could be found by trying different combinations of values for the
parameters and noting their effect on the error sum of squares. However, as the
number of variables increases, the concomitant increase in the number of
combinations becomes astronomical, which renders the trial-and-error method
tedious if not completely impractical.
Similarly, computing the values for the multiple-regression parameters with
simple spreadsheet formulas is easy enough with one predictor, not much more
difficult with two, but becomes increasingly impractical as the number of
predictors increases. Happily, there are general methods for computing the best-
fit parameters. Learning such methods (which involve matrix algebra) is beyond
the scope of this book, but the methods are embodied in widely available
multiple-regression computing programs. Indeed, as noted in chapter 1, most
spreadsheet programs include built-in routines for computing multiple-
regression statistics.
For the exercises in the remainder of this book, instead of computing
multiple-regression statistics directly, you will use a multiple-regression routine.
(In Excel this is done by entering Tools/Data Analysis/Regression, although the
first time you may need to specify Tools/Add-Ins/Analysis ToolPak.) You then
specify the range for the dependent variable and the range for the independent
variable or variables, indicate where the output should be placed, and instruct the
routine to compute the statistics. The routine then computes and places in your
spreadsheet values for various regression statistics including values for the
constant (a) and the regression coefficients (b 1 , b2, etc.). These values can then
be referenced by the formula you define to compute predicted values for Y.

Partial Regression Coefficients


Perhaps too informally, we have referred to the b1S, b 2 s, and so forth, of multiple
regression as regression coefficients. This is to deprive them of their full and
correct name, which is partial regression coefficients. Only in the case of a single
predictor variable is it correct to speak of a regression coefficient. Whenever
there is more than one predictor variable, it is important to remember the
qualifying partial because it serves to remind us that each partial regression
coefficient is just one member of a family and we cannot forget the family. Each
bk (where K is the number of variables and k - 1,K) describes the relation
between a particular predictor variable and the criterion, not for that predictor
variable in isolation—the simple regression coefficient does that—but when the
other predictor variables in the set are taken into account.
In statistical terms, the matter is often put as follows: The partial regression
coefficient describes the change in the criterion variable per unit change in the
predictor variable when other variables in the set are held constant. It is
important to keep in mind that the partial regression coefficients of multiple
regression (the b 1 S, b 2 s, and so forth) cannot be treated as if they described a
simple bivariate relation between a single predictor variable and the criterion.
They describe instead the relation between a predictor variable and the criterion
when that predictor is part of, or works in concert with, the other predictors in
the set.
Multiple regression routines typically compute values for a number of other
statistics in addition to the a, b1, b2, and so forth, needed to compute predicted
158 ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
scores and so it makes some sense to describe them briefly. The output produced
by a typical spreadsheet multiple-regression routine is shown in Fig. 11.1. The
data provided as input were from our running example (Y = number of lies
detected, X1 = mood score, X2 - drug group, coded 1 for drug and 0 for placebo).
Some of the statistics shown in Fig. 11.1 are already familiar to you. For example,
the regression constant, labeled intercept, and the regression coefficients, labeled
X Variable 1 and X Variable 2, appear on last three lines. But others, described in
subsequent paragraphs, may be new to you.

R, R2, and Adjusted R2


Multiple R, on the first line after Regression Statistics, is the multiple
correlation coefficient, the correlation between a criterion variable and a set of
predictor variables. R Square, on the second line, which is often written R2, is
similar to r2, which was introduced earlier (Equation 7.8). When only one
predictor variable is used, the proportion of variance accounted for is written r2
(lower case) but when more than one predictor variable is used, this is written R2
(upper case). But the definition remains the same: R2, like r2, is the proportion
of total variance accounted for in a sample by a particular model. That is:

The multiple correlation coefficient squared (R2}, like the coefficient of


determination (r 2 ), can assume values ranging from 0 to 1.
Before proceeding further we should point out that no matter the value or

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.588
R Square 0.346
Adjusted R Square 0.159
Standard Error 1.936
Observations 10

AN OVA
df SS MS F Significance F
Regression 2 13.86 6.932 1.849 0.227
Residual 7 26.24 3.748
Total 9 40.1

Coefficients Standard Error t Stat P-value


Intercept 4.764 2.537 1.878 0.102
X Variable 1 0.256 0.373 0.686 0.515
X Variable 2 -1.41 1.683 -0.84 0.431
FIG. 11.1. Typical output from a spreadsheet multiple-regression routine.
1 MULTIPLE REGRESSION AND CORRELATION 159
significance of R2, its use as a descriptive statistic has limitations. The multiple
correlation coefficient squared is a sample statistic and, like the sample variance
and standard deviation, it provides a biased estimate of the population value. But
unlike the sample variance and standard deviation, the sample R2 overestimates
the population value. This is because it capitalizes on any chance variation in the
sample. If R2 were .5, for example, claiming that the predictor variables
accounted for half of the variance would likely overstate the true state of affairs, if
one meant this to apply to the population generally.
A formula for R2, adjusted to provide an unbiased estimate of the population
value, is

The adjustment appears on the right-hand side of the equation after the minus
sign. For constant numbers of predictor variables and subjects, the adjustment
becomes smaller as R2 becomes larger. And for constant values of R2, the
adjustment becomes smaller as the ratio of predictor variables to subjects
becomes smaller (a ratio of 1:12 is smaller than a ratio of 1:3). In other words, the
more subjects there are for each variable, the smaller is the adjustment. If there
are few subjects for each predictor variable—if the ratio of variables to subjects is
small (3:1 is smaller than 12:1)—the adjustment can be quite substantial,
especially if R2 is not very large. The adjustment can even be greater than the
sample R2, which results in an adjusted R2 less than zero. If that should happen,
however, report the value as zero: A negative R2 makes no sense.
A question is, which value should you report, R2 or adjusted R2 ? Often only
2
R is reported, but if you wish your results to characterize the population from
which your sample was drawn, it makes more sense to report the adjusted R2 and
not the sample .R2. If you have any doubts, you can always report both.
At this point, a brief comment about notation is in order. Throughout this
text, an apostrophe or prime is used to indicate an estimated value (e.g., Y', VAR',
SD', SD' M ,and so forth). Thus it would make sense to apply the same convention
to the adjusted R2. However, perhaps because R2' or .R2 looks somewhat
awkward, R2adjusted or R2adj is conventional so we use it here.

The Standard Error of Estimate


The fourth line after Regression Statistics in Fig. 11.1 is Standard Error. More
fully, this is the estimated standard error of estimate, which may be new to you
but is not new in concept. For the present sample of 10 scores, the total sum of
squares is 40.1. Therefore the standard deviation for raw scores in the population
as estimated from sample data is

(See Equation 5.7 and Fig. 10.3.) This quantity, which has the same units as
those used to measure Y (in this case, number of lies), can be thought of as the
average error—for the population—when guessing the number of lies knowing
just the mean.
160 ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
Similarly, the standard error of estimate is the square root of the error sum
of squares divided by N, and the estimated standard error of estimate (the
population value estimated from sample data) is the square root of the sum of
squares for error divided by its degrees of freedom. Subscripting SD' with Y-Y
indicates that SD'Y-Y' represents the estimated standard deviation for the
deviations between raw and predicted scores (instead of between raw scores and
the mean as for SD'):

This quantity can be thought of as the average error, for the population, when
guessing the number of lies using the prediction equation. In this text, standard
error of estimate refers to the sample statistic and estimated standard error of
estimate refers to the population estimate. Some texts drop the "estimated" and
use standard error of estimate to refer to the quantity defined in Equation 11.7.
As you will see in the next exercise, the SSerror when prediction is based on
two variables is 26.24. Thus for the lie detection study the estimated standard
error of estimate, when predicting number of lies detected from both mood and
drug group, is

Note that 1.94, the estimated standard error of estimate or the average error in
the population when prediction is based on two variables, is less than 2.11, the
estimated value for the population standard deviation or the average error when
prediction is based only on the mean number of lies detected. In other words,
when using mood and drug group to predict number of lies, and not just the
mean number of lies, the average error decreases from 2.11 to 1.94, a reduction of
12.9%. The arithmetic is as follows:

Thus SD'Y-Y', especially when compared with SD', gives a sense of how important
a particular set of predictor variables is in general. The more effective a set of
predictors is in improving prediction and hence reducing error, the smaller the
estimated standard error of estimate will be relative to the estimated standard
deviation.
For some reason, perhaps historical, perhaps reflecting researchers' urge to
generalize, most general-purpose regression routines print only SD'Y-Y' (and call it
the standard error of estimate or simply the standard error) and not SDY-Y' (the
square root of the SSerror divided by N). At the same time, as exemplified by Fig.
11.1, often they print only the sample R2 and not the adjusted R2.
The degrees of freedom (df) are also given in Fig. 11.1. For the present
example, three parameters (a, b1, and b2) were used to compute the predicted
scores. The regression constant (a) is subtracted from N, the number of scores,
which gives 9 degrees of freedom Total. The model then contains two parameters
(b 1 , and b2), hence two degrees of freedom for the Regression, which leaves 7
degrees of freedom for error or Residual.
11.1 MULTIPLE REGRESSION AND CORRELATION 161
A second brief comment about notation is now in order and concerns SD', the
estimated standard deviation, and SD' M ' the standard error of the mean, as
compared with SD' Y-Y' ,the estimated standard error of estimate. Each of these
three represents a population standard deviation, that is, the average deviation of
scores from a specified reference point. For SD' Y-Y' , the deviations are the
regression residuals—the deviations of raw scores from predicted ones—as
reflected by the subscript Y-Y'. Yet when deviations are the Y scores from their
mean, SD' Y is used and not SD'Y-M.. Similarly, when deviations are sample means
from the population mean, SD'M is used and not SD'M-U. This usage both is
conventional and makes sense. When the reference point for a deviation is a
mean, and when the particular mean used is clear from context, notation for
standard deviations of any sort typically omits the reference point from the
subscript.

Exercise 11.1
The Standard Error of Estimate and Adjusted R2
This exercise adds the capability to compute the estimated standard error of
estimate and the adjusted R2 to the spreadsheets shown in Figs. 9.2 and 9.3.
1. Add a formula for the estimated standard error of estimate to the
spreadsheets shown in Figs. 9.2 and 9.3. Note the similarity between this
formula and the formula for the estimated standard deviation for the
population.
2. What is the estimated standard error of estimate when drug treatment alone
is used to predict number of lies detected? What percentage reduction in
average error, beyond using just the mean as a predictor, does this
represent? What is the estimated standard error of estimate when mood
scores alone are used as the single predictor variables? Again, what
percentage reduction does this represent?
3. Add a label and a formula for the adjusted R2 to the spreadsheets shown in
Figs. 9.2 and 9.3. For each spreadsheet, what is the value for R2adj? How
does each compare with the corresponding value for R2?

Standardized Partial Regression Coefficients


In common with the adjusted R2, standardized partial regression coefficients
(usually symbolized as lower case Greek betas or Bs) are not computed by most
spreadsheet multiple-regression routines. (Do not confuse this use of beta with
its use as a symbol for the probability of type II error.) Like correlation
coefficients, and unlike unstandardized partial regression coefficients or bs, Bs
are standardized, so their values so can be compared even when variables within
a study have quite different scales. For that reason they figure prominently in
most discussions and many applications of multiple regression.
For the analyses presented in this book, however, the change in R2 when
additional variables or sets of variables are added to a multiple-regression
equation is emphasized. Little attention is given to the Bs and only some to the 6s
(which are used to compute Y'), because for almost all of the analyses described
in this book the only multiple-regression statistic actually needed is R2.
Nonetheless, you should be aware of other multiple-regression statistics so that
you recognize them when you encounter them, and you should also be aware that
there is much to learn about multiple regression other than the brief and
introductory material presented in this book.
162 ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
Other Regression Statistics
We have yet to comment further on the last few lines in Fig. 11.1. These include
the standard errors for the X coefficients (the unstandardized partial regression
coefficients or the 5s). They are useful because they can be used to compute a t
score, which is also shown in Fig. 11.1 and which can be used to determine
whether or not the individual partial regression coefficients are significantly
different from zero. For present purposes, the statistical significance of
regression coefficients is not emphasized. Instead, the analyses developed here
utilize the significance of the increase in R2 that occurs when variables (or sets of
variables) are added, step by step, to the regression equation. Exactly how this
works is demonstrated later in this chapter, but first you need to know how to
determine the amount of variance two predictors, working in concert, account
for, and how to evaluate whether that amount of variance is statistically
significant.

Exercise 11.2
Significance of Multiple R2
The template developed for this exercise allows you to evaluate the statistical
significance of, and describe the proportion of criterion variance accounted for
by, two predictor variables acting together. The predictor variables examined are
mood and drug group, and the null hypothesis suggests that these two variables
together have no effect on number of lies detected. In addition, using information
from this and the previous two spreadsheets (Figs. 9.2 and 9.3), the proportion of
variance accounted for uniquely by each predictor variable, separate from the
other, can be computed.

General Instructions
1. This spreadsheet will have one column for the dependent variable (number
of lies detected) and two for the independent variables (mood and drug
group). Beginning with the spreadsheet shown in Fig. 10.3, insert a column
for drug group and enter the appropriate data (1 for the first five subjects, 0
for the last five).
2. Beginning with this spreadsheet, you will use a multiple-regression routine to
compute regression statistics. Run the routine and insure that the cells for a,
b1, and b2 display the values computed by it. Then correct the formula for Y'
so that is based on the linear combination of both predictor variables (Y'= a
+ b1X1 + b 2 X 2 ). Also check that the value for R2 computed by the
spreadsheet (as proportion of total variance due to the model) is the same as
the R2 computed by the program.
3. From now on we will usually be concerned with more than one predictor
variable, so change the label for the correlation coefficient from r (for simple
correlation) to R (for multiple correlation). Compute r as the square root of
R2. This eliminates the need for the three columns containing the deviation
scores for X, their square, and the cross products for the X and Y deviation
scores. They can be erased.
4. Finally, enter the correct degrees of freedom. At this point, all statistics,
including the F ratio, should be displayed correctly, based on formulas
already in place.
11. 1 MULTIPLE REGRESSION AND CORRELATION 163
Detailed Instructions
1. Begin with the spreadsheet shown in Fig. 10.3. Insert a new column
between columns C and D. This has the effect of moving the old columns D-
K to columns E– L, opening up a new (and for the moment blank) column D.
Alternatively, you may want to move the block D1-K17 to E1-L17, which
also has the effect of opening up column D.
2. Label the new column D "Drug" (in cell D1). Enter the label "X1" (instead of
"X") in cell C2 and the label "X2" in cell D2. Then enter the coded data for
drug group in cells D3–D12 (1 for the first five subjects, 0 for the last five).
3. Enter the value for the parameter a (given in Fig. 11.1) in cell B16. Enter the
values for the parameters b1 and b2 (again from Fig. 11.1) in cells C16 and
D16. Alternatively, you may want to invoke whatever multiple-regression
routine you plan to use and verify these values yourself.
4. From now on, we will usually be concerned with more than one predictor
variable, which means we will compute multiple R, not r. Therefore enter the
label "R=" (instead of "r=") in cell A17. In addition, replace the formula
currently in cell B17 with a function that computes the square root of R2 (cell
F17). The formula for R2, however, remains the same.
5. The R is computed from R2, and the regression statistics are computed by a
multiple-regression routine; thus we no longer need the information in
columns M–O (which were columns L-N in Fig. 10.3). These columns
should be deleted.
6. Enter a formula for the equation, Y'= a + b1X1 + b2X2, in cells E3-E12, using
the parameter values in cells B16-D16.
7. Enter the correct degrees of freedom in cells J14-L14. At this point, all of the
statistics indicated in the spreadsheet should be computed correctly, using
formulas already in place from the previous spreadsheet.

The spreadsheet you just completed should look like the one shown in Fig.
11.2. If everything was done correctly, then the R2 computed as the proportion of
total variance accounted for by the model should be the same as the R2 in Fig.
11.1. Similarly, the square root of the mean square for error, which is the
standard error of estimate, should have the value given in Fig. 11.1.

Multiple R Revisited
Multiple R is the correlation between Y and Y', between the observed scores for Y
and the predicted scores, which are the optimally weighted sum of a set of
predictor variables, X1t X2, and so forth. Earlier in this chapter, multiple
correlation was defined as the relation between a criterion variable and a set of
variables working in concert. What was meant by that phrase was the optimally
weighted sum of a set of predictor variables, Y' , as computed by Equation 11.4,
the general prediction equation:

The weights used to compute Y' (that is, the regression coefficients b1 b2, and so
forth) are selected by multiple-regression computations so that the error sum of
squares will be the smallest value possible given the data, which is why the
weighted sum is called optimal.
As a consequence of minimizing the error sum of squares, the model sum of
squares and hence R2 (and multiple R) will be maximized, that is, will assume the
164 ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
largest values possible for the data. As noted earlier, multiple regression
inherently capitalizes on chance variation in data, producing the highest possible
values of R and R2 for a given data set. This bias can be corrected by adjusting R2
(Equation 11.6), but even so readers should be aware that multiple-regression
statistics are especially vulnerable to chance fluctuations in sampling and,
especially with small data sets, results seen in one sample may not replicate in
others.
For spreadsheet purposes, the value of multiple R can be computed in one of
two ways: as the square root of R2 (as in the last exercise) or as the correlation
between Y and Y' (although this could require that deviation scores and their
squares and cross products be computed for both Y and Y'). The correlation
method is demonstrated in the next exercise.

A I B C D E F G H
1 Lies Mood Drug y= m= e=
2 s Y X X Y' Y-My Y'-My Y-Y'
3 1 3 5.5 1 4.762 -2.3 -0.54 -1.76
4 2 2 2 1 3.868 -3.3 -1.43 -1.87
5 3 4 4.5 1 4.507 -1.3 -0.79 -0.51
6 4 6 3 1 4.123 0.7 -1.18 1.877
7 5 6 1.5 1 3.74 0.7 -1.56 2.26
8 6 4 6.5 0 6.426 -1.3 1.126 -2.43
9 7 5 3.5 0 5.659 -0.3 0.359 -0.66
10 8 7 7 0 6.553 1.7 1.253 0.447
11 9 7 6 0 6.298 1.7 0.998 0.702
12 10 9 9 0 7.064 3.7 1.764 1.936
13 Sum= 53 48.5 5 53 0 0 0
14 N= 10 10 10 N= 10 10 10
15 Mean= 5.3 4.85 0.5 VAR= 4.01 1.386 2.624
16 a,b= 4.764 0.256 -1.41 SD= 2.002
17 R= 0.588 R2= 0.346

I J K L
1 SStot SSmod SSerr
2 y*y m*m e*e

13 SS= 40.1 13.86 26.24


14 df= 9 2 7
15 MS= 4.456 6.932 3.748
16 SD'= 2.111 1.936
17 R2adj = 0.159 F= 1.849
FIG. 11.2. Spreadsheet for computing the F ratio that evaluates the effect of
mood and drug group together on number of lies detected. Rows 3-12 for
columns I-L are not shown.
11.1 MULTIPLE REGRESSION AND CORRELATION 165

Exercise 11.3
The Correlation Between Observed and Predicted Scores
This exercise shows that multiple R the correlation between Y and Y' scores.
1. Beginning with the spreadsheet shown in Fig. 11.2, compute the correlation
between Y and Y'. There are two ways to do this: You could add columns
for the appropriate deviations and their squares and cross products, and
compute R according to Equation 9.11, or you could regress Y on Y' and
compute the square root of the R2 provided by the multiple-regression
routine. In other words, regressing Yon either Y', or on X1 and X2, should
produce the same R2. Does it?

Note 11.1
R Multiple .R. Just as r is the correlation between a criterion
variable (Y) and a single predictor variable (X), so R is the
correlation between a criterion variable and an optimally
weighted sum of a set of predictor variables (X1, X2, and so
forth).
R2 Multiple R squared. It is the proportion of criterion variable
variance accounted for by a set of predictor variables,
working in concert.
R adj Adjusted multiple R squared. It is the population value for R2
estimated from a sample, which will be somewhat smaller
than R2. Consistent with the notion used in this text, it could
also be symbolized as jR 2 ' with the prime indicating an
estimated value, but the adjusted subscript is used far more
frequently in multiple-regression texts.
SD'y.Y-Y' The estimated standard error of estimate. It is the estimated
population standard deviation for the regression residuals,
that is, the differences between raw and predicted scores. It
can be regarded as an estimate of the average error made in
the population when prediction is based on a particular
regression equation. Sometimes it is called the standard
error of estimate, leaving off the qualifying estimated. Often
it is symbolized with a circumflex or ^ above the SD and
above the second Y subscript instead of a prime after them.

11.2 SIGNIFICANCE TESTING WITH MULTIPLE PREDICTORS

Exercise 11.2 should have convinced you how easy it is to generalize from one
predictor to more than one predictor variable. All you did was add the data for a
second variable and change the prediction equation and degrees of freedom to
take the second variable into account. All other computations you needed were
held over from the earlier spreadsheet, including the statistics needed for
significance testing.
166 ___ ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
In chapter 10 you learned that the F ratio is:

For the current example, the value for this F ratio is 1.85 (see Fig. 11.2), and it has
2 and 7 degrees of freedom. The critical value for F(2,7), alpha = .05, is 4.74, so
this result fails to reach a conventional level of significance. We conclude that we
cannot reject the null hypothesis that mood and drug acting together have no
effect on number of lies detected, at least not with a sample of 10 subjects.
Not all regression routines print SSmodel and SSerror, but all give R2. For that
reason, an alternate but equivalent formula for the F ratio that often proves
useful is

The R2 for error is itself rarely printed but it is easily computed. The R2model is
the proportion of variance accounted for by the model; thus R2error, the proportion
unaccounted for, is:

And so we could rewrite Equation 11.9 as follows:

All we need to remember is

and

(N scores initially; one df used by the regression constant, the rest by the
predictor variables). Thus, using Equation 11.10, you can test whether two (or
more) variables together account for a statistically significant amount of criterion
score variance.

Exercise 11.4
Computing the F Ratio Using R2
This exercise provides practice in using the R2 formulation for the F ratio.
1. Demonstrate that, for the data shown in Fig. 11.2, the F ratios computed
using equations 11.8 (SS) and 11.9 (R2) are identical.
2. Demonstrate algebraically that the two equations must give equivalent
results. (Optional)
3. Assume that a single predictor variable accounts for 40% of the variance.
What is the minimum number of subjects that would need to be included in a
study in order to conclude that this amount of variance is significant at the .05
11.2 SIGNIFICANCE TESTING WITH MULTIPLE PREDICTORS 167
level? This exercise is more challenging than most. One way to determine
the answer to this question is to set up a spreadsheet. Columns would
indicate number of subjects, degrees of freedom, R2, and F. Rows would
indicate successively greater numbers of subjects. In this way, the fewest
number of subjects that would nonetheless yield a significant F ratio can be
found.
4. Now determine the minimum number of subjects required for significance
when the proportion of variance accounted for is 30%, 20%, 10%, and 5%.
Note that if a statistical table does not contain an entry for the exact number
of degrees of freedom you need, use the entry for the next fewer degrees of
freedom. It is acceptable (and conservative) to claim fewer degrees of
freedom than you have, but is it not acceptable to claim more.
5. Based on your answers to parts 3 and 4, what appears to be the relation
between proportion of variance and number of subjects needed to find that
proportion significant?

11.3 ACCOUNTING FOR UNIQUE ADDITIONAL VARIANCE

From the lie detection study analyses conducted in this chapter and in chapter 10
we have learned that, with respect to the DV, number of lies detected:

1. 28.0% of the variance is accounted for when mood scores alone are
considered (see Fig. 10.3).
2. 30.2% of the variance is accounted for when drug group alone is
considered (see Fig. 10.2).
3. 34.6% of the variance is accounted for by drug group and mood scores
considered together in concert (see Fig. 11.2).

We could organize these three analyses into two different hierarchic series.
For the first, we would begin with mood and then add drug group. From it we
would learn that drug group accounts uniquely for 6.5% of the variance above
and beyond that already accounted for by mood:

(We used numbers accurate to five digits for this computation, and then rounded
the result.) For the second, we would begin with drug group and then would add
mood. From it we would learn that mood accounts uniquely for 4.4% of the
variance above and beyond that already accounted for by drug group:

Not surprisingly, the unique contribution of each predictor variable is less


than its contribution alone. The two predictor variables are correlated, and as a
result, part of their influence is joint; that is, it cannot be assigned uniquely to
one variable or the other. This overlap is easy to compute. It must be 23.6%,
which is the difference between the total variance and the unique variance
accounted for by a predictor:
168 ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS

Likewise:

In other words, given two predictor variables, variability associated with the
criterion variable can be divided into four pieces (see Fig. 11.3):

1. Variance accounted for uniquely by the first predictor (in this case 4.4%
for mood scores, represented by the nonoverlapped part of the mood
circle).
2. Variance accounted for uniquely by the second predictor (in this case
6.5% for drug group, represented by the nonoverlapped part of the drug
circle).
3. Variance accounted for jointly by the two predictors (in this case 23.6%,
represented by the overlap between the mood and drug circles).
4. Variance left unaccounted for by the predictor variables (in this case
65.5%, represented by the area outside the mood and drug group circles).

Occasionally, the optimally weighted sum of two predictors will produce a


higher R2 than the sum of the r2s for the two predictors separately. This pattern
occurs when the relation between two predictor variables hides or suppresses
their real relation with the criterion variable, as can happen when the correlation
between two predictor variables is negative but both correlate positively with the
criterion. In the presence of such a suppressor effect the overlap between the two
predictor variables is negative, which makes drawing a figure like Fig. 11.3
untenable. For further discussion of suppression see Cohen and Cohen (1983).
Even in the (relatively rare) presence of suppression, determining the unique
contribution of a particular variable—the additional proportion of variance
accounted for when that variable is added to the equation—is straightforward.
The variable is added to the regression equation and a new R2 is computed. Its
contribution is the difference between the new .R2 and the R2 for the previous
equation, the one that did not include the new variable. If we call the new
equation the larger model and the previous equation the smaller model, then the
change in .R2 due to the new variable added is

and the degrees of freedom associated with this increase are

The Significance of the Second Predictor


Determining the statistical significance associated with this new variable is also
straightforward. Again, an F ratio is used, but this F ratio tests the increase in R2
df
change is the numerator df and dferror is the denominator df):
11.3 ACCOUNTING FOR UNIQUE ADDITIONAL VARIANCE 169

FIG. 11.3. Venn diagram partitioning variance for number of lies detected
into four portions: variance accounted for uniquely by drug, uniquely by
mood, by drug and mood jointly, and unaccounted or error variance.

In this case,

Equation 11.13 is simply a more general version of Equation 11.10. It has the
advantage of reminding us that an addition or change in R2 is being tested.
An alternative version of Equation 11.13, formed by expandingR2Changeand dfchange
(Equations 11.11-11.12) and dferror, is

This formulation has the advantage of bypassing the intermediate steps


represented by Equations 11.11 and 11.12.
When the effect of a model is compared to no model (as in Equation 11.10),
the numerator degrees of freedom are the number of predictor variables
associated with that model. Similarly, when the effect of a larger model is
compared to a smaller model (as in Equations 11.13 and 11.14), the numerator
degrees of freedom for the change in .R2 are the number of variables added to the
equation, which is the difference in degrees of freedom between the larger and
smaller models. If variables are added one at time, as they have been in the
examples presented so far, dfchange = 1, but as you can see, Equations 10.11-10.14
apply when more than one variable is added to a preexisting set as well. We
mention this matter again at the end of the chapter.
For the present example, the F ratio used to evaluate the significance of the
unique contribution of drug group, above and beyond any contribution already
made by mood, is
170 ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS

The critical value of F with 1 and 7 degrees of freedom, alpha = .05, is 5.59; thus
we conclude that the unique contribution of drug group observed for this sample
would occur by chance more than 5% of the time. With a sample size of 10 and
two predictor variables it is not rare enough to reject the null hypothesis that the
unique contribution of drug group is zero.
It is helpful to organize the results of analyses like these as shown in Fig. 11.4.
Each row represents a step and indicates the proportion of variance accounted
for by all variables in the equation as of that step (R2total), as well as the unique
proportion accounted for by the variable (or variables) added at that step
R2change). In addition, the table shows the F ratios and their associated degrees of
freedom for the various .R2s.
It may seem disappointing that none of the results for our running example
reached a conventional level of significance. Actually, this is quite realistic. For
convenience, the current example includes a relatively small number of subjects
(N = 10) and, as we discuss in chapter 17, statistically significant results are found
for samples this small only when effects are very strong. Still, the proportion of
variance accounted for uniquely by drug group, above and beyond the
contribution of mood, was a nontrivial 6.5%. With a larger sample size, an effect
of this magnitude would be statistically significant. As the present example
demonstrates, once again, there is no reason to be overimpressed with statistical
significance or to be depressed by the lack of it—and there is every reason to pay
attention to the magnitude of effects.
The previous analysis evaluated the unique contribution of drug group given
mood scores, but different underlying theoretical concerns might lead us to
evaluate the unique contribution of mood given drug group. The F ratio used to
evaluate the significance of the unique contribution of mood, above and beyond
the contribution already made by drug group, is

Again this F ratio is not significant, so we decide that, in addition to knowledge of


a subject's drug group, knowledge of mood scores would not allow us to make
significantly better predictions for the number of lies detected.
Statistical significance aside, there is much in the way of technique to be
gained from the current analyses. For each variable (or set of variables) added to
a multiple-regression equation, you now understand how to determine its
contribution to variance accounting, its R 2 change , and you know how to decide
whether or not that contribution is statistically significant. Moreover, using Fig.
11.4 as a model, you now know how to organize the results of such hierarchic
regression analyses. Such analyses are very general, as you will soon see.
11.3 ACCOUNTING FOR UNIQUE ADDITIONAL VARIANCE 171
Variable Total Change
Step added R2 dt F Jl? dt RF2
1 Mood 0.280 1,8 3.12 0.280 1,8 3.12
2 Drug 0.346 2,7 1.85 0.065 1,7 0.70
FIG. 11.4. Predicting lies detected: Adding drug to mood.

Exercise 11.5
Significance of Increases in R2
For this exercise you are asked to organize the hierarchic multiple-regression
results for two previous examples.
1. Fig. 11.4 shows a hierarchic analysis that adds the drug variable to mood.
Prepare a table like the one shown in Fig. 11.4, but for the analysis that adds
mood to drug group. How much additional variance in number of lies
detected can you account for if you know mood scores in addition to drug
group membership? Is this amount statistically significant?
2. Recall the example that examined the effect of number of older siblings on
number of words infants spoke (Exercises 9.6 and 10.5). First we regressed
number of words on the actual number of older siblings; then in a separate
analysis, number of words was regressed on a binary variable representing
simply whether or not infants had at least one older sibling. The number of
siblings accounted for more variance than the binary variable, but these were
two separate analyses and so we do not know whether knowing the actual
number of siblings, in addition to knowing simply whether or not infants had
older sibling{s), accounts for a significant increase in variance accounted for.
To find out, we would regress number of words, first on the binary variable,
then on both binary and quantitative variable together, and evaluate the
increase in R2. Do this analysis and prepare a table modeled after Fig. 11.4
that presents the results. How do you interpret these results?

11.4 HIERARCHIC MRC AND THE ANALYSIS OF COVARIANCE

The way MRC was used in the last section provides an example of what is
called hierarchic multiple regression. In the case of hierarchic multiple
regression, variables (or sets of variables) are added to the equation one step at a
time, and ordered in a way that makes sense to the investigator. It would make
sense to call this procedure stepwise regression. Unfortunately, that term has
been preempted and usually refers to an approach whereby variables that
account for more variance in a particular data set are selected automatically by
the computer program and are added to the regression equation before variables
that account for less variance, no matter their meaning or interpretability. This
kind of automatic stepwise selection capitalizes on chance variation and,
especially in small data sets, may lead to results that are difficult to interpret or
replicate. It is well suited to certain technical predictive tasks, but is usually not
very useful if you are interested in explanation or substantive interpretation of
your predictor variables. Still, because the term does occur with some frequency,
it is important for you to know what stepwise regression usually means, even
though it is not used here.
172 ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
With hierarchic multiple regression, the researcher and not the computer
determines the order with which variables are entered into the multiple-
regression equation. Thus, variables regarded as antecedent in some way are
added before variables regarded as consequent, and the significance of each new
variable (or set of variables) added is tested in turn. This approach is very
general and powerful—indeed, it forms the basis for most of the analyses
presented in this book—but one particular use to which this approach can be put
is traditionally called the analysis of covariance.
Whether a predictor variable is regarded as a covariate depends not on the
particular variable, but rather on the researcher's view of the matter. Within the
experimental tradition, a covariate is a variable that is thought to be associated
with (i.e., to covary with) the dependent variable but whose influence is not of
primary concern. It is more of a nuisance, obscuring a clear evaluation of the
effect of the experimental independent variable on the dependent variable. The
purpose of the analysis of covariance is to control for the effect of the covariate
statistically, in effect neutralizing it. As you can see, the hierarchic regression
analysis discussed in the last section does exactly that. Imagine, for example,
that our major concern is the effect of drug on number of lies detected. However,
we have reason to think that mood might also have an effect; that is, we think of
mood as a covariate. And even though subjects were randomly assigned to the
two treatment conditions (drug and placebo), we still note higher mood scores in
one of them.
In this case we would add mood scores to the regression equation first (step
1; see Fig. 11.4). Then the increment in R2 that occurs when drug group is added
to the equation (step 2) gives us the effect of drug, above and beyond any effect of
mood, on the number of lies detected. In this way, the effect of mood is
controlled statistically and we can evaluate the effect of the drug manipulation
uncontaminated by variability in mood scores. In the present case, we see that,
in addition to the 28.0% of the variance in number of lies detected accounted for
by mood scores, an additional 6.5% is accounted for uniquely by drug group.
The example just presented assumed an experimental study (presumably,
subjects were randomly assigned to the two drug groups), but the need to control
for background variables, or other covariates, arises in nonexperimental studies
as well. In fact, the need is usually greater because nonexperimental studies,
lacking experimental control, must rely more on statistical control. Whether or
not studies are experimental, however, the approach is the same, which
demonstrates the generality of a hierarchic multiple-regression framework.

An Example: The Button-Pushing Study


In chapter 13 we return to the analysis of covariance in greater detail, but first a
new study is introduced. This study is used to exemplify both an analysis of
covariance and other analytic strategies discussed in this and subsequent
chapters. It has more subjects and more research factors than the lie detection
study we have been using up to now for our running example.
Imagine we are interested in how people with different experiences perceive
infants. In particular, we want to know if parents see, that is, detect, more
communicative attempts on the part of preverbal infants than nonparents. For
this study we prepare videotapes of preverbal infants playing. We show them to
some participants who are parents and some who are not, and we ask the subjects
to push a button every time they believe the infant has done something the adult
participant views as communicative. The button is attached to a microcomputer
that records the number of presses.
11.4 HIERARCHIC MRC AND THE ANALYSIS OF COVARIANCE 173
The dependent variable for this study, then, is the number of button pushes
for each subject. One independent variable is parental status (parent vs.
nonparent), and one research hypothesis purports that parents will push the
button more than nonparents. Another variable is the subject's age. Not
surprisingly, after recruiting subjects we note that parents tend to be older than
the nonparents and we worry that older subjects may push the button more than
younger subjects, introducing a confound into our study. If parents are more
frequent button pushers than nonparents, it might be due to their age and not
their parental status.
One solution to the age-parental status confound is to analyze the data using
hierarchical multiple regression. First the subject's age would be entered into the
regression equation (step 1), then parental status (step 2). In this way the effect
of parental status on the number of button pushes, above and beyond any effect
of age, can be evaluated. This is, in effect, an analysis of covariance in which age
serves as the covariate. The data for this study are given in Fig. 11.5 and the
necessary computations for the analysis of covariance just described are
performed during the course of the next exercise.

Exercise 11.6
An Analysis of Covariance
This exercise uses the template shown in Fig. 11.2, but incorporates data from
the button-pushing instead of the lie detection study. The predictor variables are
age and parental status. These are examined hierarchically, first subject's age
(step 1), then parental status (step 2). Age can be regarded as a covariate and
the entire analysis, an analysis of covariance. The null hypothesis suggests that
parental status does not account for a significant increase in R2, above any
variance already accounted for by the subject's age.

No. of Subject's Parental


Subject
pushes age status
1 102 35 1
2 125 38 1
3 95 40 1
4 130 32 1
5 79 29 1
6 93 44 1
7 75 26 1
8 69 •36 1
9 43 27 0
10 82 26 0
11 69 18 0
12 66 22 0
13 101 31 0
14 94 21 0
15 84 27 0
16 69 28 0
FIG. 11.5. Data for the button-pushing study: 1 .= parent, 0 = nonparent.
174 ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
General Instructions
1. Modify the spreadsheet shown in Fig. 11.2 to accommodate data from the
button-pushing study. This involves relabeling the data columns, inserting
rows for additional subjects, and entering new data. The required formulas
are either already in place or can be copied into new rows.
2. Do step 1 of the hierarchic regression analysis. Regress number of button
pushes on age and note its R2and associated F ratio.
3. Do step 2 of the hierarchic regression analysis. Regress number of button
pushes on both age and parental status. Note both the final R2 and the
change in R2.
4. To complete the analysis, organize your results in a stepwise table modeled
after Fig. 11.4. Then answer the questions posed in part 10 of the detailed
instructions.

Detailed Instructions
1. Begin with the spreadsheet shown in Fig. 11.2. Relabel columns B–D, row 2,
as follows:
Label Column Meaning
Y B The number of button pushes. In general, "Y" is used to
indicate the dependent variable.
X C Subject's age. In general, "X" is used to indicate an
independent variable or covariate. (In Fig. 11.2 this column
was labeled "XL")
A D Parental status. (Often, A, B, C, and so forth, are used to
indicate categorical independent between-subjects variables,
although in Fig. 11.2 this column was labeled "X2.")
2. Enter the labels "#BPs" for number of button pushes in B1, "Age" in C1, and
"PSt" for parental status in D1.
3. Insert six rows before row 13 (or move the block A13-L18 to A19-L24). This
makes room for the 16 (instead of 10) subjects used for the button-pushing
study.
4. Extend the subject numbers from 10 to 16 in column A and enter the data
from Fig. 11.5 (number of pushes, age, parental status) in columns B, C, and
D respectively.
5. Using a multiple-regression routine, find the values for a and b for the step 1
model, #BPs = a + b Age, ignoring parental status. Enter the values for the
parameters a and b in cells B22 and C22 respectively. Enter the predictive
formula, Y' = a + bX in cells E3–E18.
6. Copy the formulas in row 12, columns F–L, to the new rows 13-18. Rows 3-
12, columns F–L, should already contain the correct formulas from the
previous spreadsheet, as should rows 19–23, columns A–L.
7. Enter the correct values for the total, model, and error degrees of freedom in
cells J20-L20 respectively. At this point, the correct R2 for the step 1 model
(the one using only age as a predictor) should be displayed in cell F23. This
completes step 1 of the hierarchic regression analysis. At this point, your
spreadsheet should look like the one given in Fig. 11.6.
8. For step 2 of the analysis, the spreadsheet is modified as follows. Using a
multiple-regression routine, find the values for a, b1, and b2 for the model,
#BPs = a + b1 Age + b2 PSt. Enter the values for the parameters a, b1, and
b2 in cells B22, C22, and D22 respectively. Enter the predictive formula Y' =
a + b1X1 + b2X2 in cells E3–E18.
11.4 HIERARCHIC MRC AND THE ANALYSIS OF COVARIANCE 175_
9. Now enter the correct values for the total, model, and error degrees of
freedom for this model in cells J20-L20 respectively. At this point, the
correct R2 for the step 2 model (the one using both age and status as
predictors) should be displayed in cell F23 and your step 2 spreadsheet
should look like the one shown in Fig. 11.7.
10. To complete the exercise, organize your results in a hierarchic regression
table organized like Fig. 11.4. In this case, what is the critical value of F,
alpha = .05, for one predictor? Is age a significant predictor for the number
of button pushes? What is the critical value of F, alpha = .05, for two
predictors? Do age and status together account for a significant amount of
variance? Finally, does parental status make a significant unique
contribution to prediction, above any contribution made by age?

After step 1, your spreadsheet should look like Fig. 11.6, and after step 2, like
Fig. 11.7. The hierarchic results from the last exercise should look like those
shown in Fig. 11.8. Even though age accounted for 21.5% of the variance, with 16
subjects this amount approached, but did not reach, conventional levels of
significance. The critical value for F(1,14), alpha .05, is 4.60 and the obtained F
was 3.83. If the effect of age had been our primary concern, we would have
learned from this study that an N of 16 is simply inadequate to find significant an
effect that accounts for approximately 20% of the sample variance.
Age and parental status together accounted for 25.0% of the variance F((2,13)
= 2.16, NS), and parental status accounted uniquely for 3.5% of the variance
above that accounted for by age alone (F(1,13) = 0.61, NS). Again, neither effect
was statistically significant. Still, there is an important lesson to be learned here.
It is important, if we are not to waste time and resources, to decide before any
study begins how large an effect we think is important and to insure that
sufficient subjects are studied so that we have a reasonable chance of detecting
effects of that size as statistically significant. How to do this is explained in
chapter 17. Otherwise we are left, as in the present case, knowing that it is more
probable than we would like (and more probable than journal editors usually
allow) that the size of the effects we see could be just a lucky draw from a
population in which the true values of the R2s under investigation are zero.

Exercise 11.7
Hierarchical Regression in SPSS
In this exercise you will learn how to conduct a hierarchical regression in SPSS.
1. Open the Lies and Drug data file you created in Exercise 10.6. Create a new
variable for mood and enter the data.
2. Select Analyze-> Regression-> Linear from the main menu. Move Lies to
the Dependent window and Mood to the Independent(s) window. Click on
Next under Block 1 of 1 and move Drug to the Independent(s) window. Click
on Statistics and check the R squared change box. Click Continue and then
OK.
3. Examine the Model Summary in the output. On the left hand side you will find
the statistics for the Total model. Thus Model 1 represents the first step that
includes only mood scores. Model 2 includes both mood and drug as
predictors. In the right-hand box of the Model Summary you will find the R2,
F, df, and, significance of the change. Thus Model 2 represents the amount
176 ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
of variance and significance of adding drug to mood scores. Make sure that
these values agree with Fig. 11.4.
4. In the ANOVA table you will find the test of each model. Thus, Model 1 tests
if mood alone is a significant predictor of lies, and Model 2 represents the
significance of mood and drug together as predictors of lies. Finally, in the

A B C D E F G H
1 #BPs Age PSt y= m= e=
2 s Y X A Y' Y-My Y'-My Y-Y'
3 1 102 35 1 93.2 16 7.201 8.799
4 2 125; 38 1 97.52 39 11.52 27.48
5 3 95 40 1 100.4 9 14.4 -5.4
6 4 130 32 1 88.88 44 2.881 41.12
7 5 79 29 1; 84.56 -7 -1.44 -5.56
8 6 93 44 1 106.2 7 20.16 -13.2
9 7: 75 26 1 80.24 -11 -5.76 -5.24
10 8 69 36 1 94.64 -17 8.642 -25.6
11 9 43 27 0 81.68 -43 -4.32 -38.7
12 10; 82 26 0 80.24 -4 -5.76 1.761
13 11 69 18 0 68.72 -17 -17.3 0.283
14 12 66 22 0 74.48 -20 -11.5 -8.48
15 13 101 31 0 87.44 15 1.44 13.56
16 14 94 21 0 73.04 8 -13 20.96
17 15 84 27 0 81.68 -2 -4.32 2.321
19 16 69 28 0 83.12 -17 -2.88 -14.1
19 Sum= 1376 480 8 1376 0 -0 3E–14
20 N= 16 16 16 N= 16 16 16
21 Mean= 86 30 0.5 VAR= 464.9 99.83 365
21 a,b= 42.79 1.44 0 SD= 21.56
23 R= 0.463 R2= 0.215

1 J K L
1 sstot ssmod SSerr
2 y*y m*m e*e

19 SS= 7438 1597 5841


20 df= 15 1 14
21 MS= 495.9 1597 417.2
21 SD'= 22.27 20.43
2
23 R2adj = 0.159 F= 3.829
FIG. 11.6. Spreadsheet for evaluating the effect of age on number of button
pushes. Rows 3—8 for columns I-L are not shown.
11.4 HIERARCHIC MRC AND THE ANALYSIS OF COVARIANCE 177
Coefficients find the values unstandardized and standardized regression
coefficients.
5. As additional practice, use the SPSS Regression procedure to reanalyze the
button-pushing study presented in Exercise 11.6.

A B C D E F G H
1 #BPs Age PSt y= m= e=
2 s Y X A Y' Y-My Y'-My Y-Y'
3 1 102 35 1 96 16 10 6
4 2 125 38 1 98.51 39 12.51 26.49
5 3 95 40 1 100.2 9 14.18 -5.18
6 4 130 32 1 93.49 44 7.495 36.51
7 5 79 29 1 90.99 -7 4.989 -12
8 6 93 44 1 103.5 7 17.52 -10.5
9 7 75 26 1 88.48 -11 2.484 -13.5
10 8 69; 36 1 96.84 -17 10.84 -27.8
11 9 43 27 0 77.67 -43 -8.33 -34.7
12 10 82 26 0 76.84 -4 -9.16 5.165
13 11 69 18 0 70.15 -17 -15.8 -1.15
14 12 66 22 0 73.49 -20 -12.5 -7.49
15 13. 101 31 0 81.01 15 -4.99 19.99
16 14 94 21 0 72.66 8 -13.3 21.34
17 15 84 27 0 77.67 -2 -8.33 6.33
19 16 69 28 0 78.51 -17 -7.49 -9.51
19 Sum= 1376 480 8 1376 0 -0 6E-14
20 N= 16. 16 16 N= 16 16 16
21 Mean= 86 30 0.5 VAR= 464.9 116.1 348.7
21 a,b= 55.12 0.835 11.65 SD= 21.56
23 R= 0.5 R2= 0.25

1 K L
1 sstot ssmod SSerr

2 y*y m*m e*e

19 SS= 7438 1858 5580


20 df= 15 2 13
21 MS= 495.9 929 429.2
21 SD'= 22.27 20.72
R2
23 R2adj= 0.134
0.134 F= 2.164
FIG. 11.7. Spreadsheet for evaluating the effect of age and parental status
together on number of button pushes. Rows 3-18 for columns I-L are not
shown.
178 ACCOUNTING FOR VARIANCE: MULTIPLE PREDICTORS
Variable Total Change
Step
added ~ R2 d1 F R2 df F
1 Age .215 1,14 3.83 .215 1,14 3.83
2 Status .250 2,13 2.16 .035 1,13 <1
FIG. 11.8. Predicting number of button pushes: adding parental status to
age. The value of the F ratio is actually 0.608, but usually values less than 1
are indicated simply as <1.

11.5 MORE THAN TWO PREDICTORS

The discussion and examples used in this chapter have been cast in terms of just
two predictor variables. In chapter 8 we learned how to compute the proportion
of criterion variance accounted for by one predictor variable, and in chapter 10
we learned how to determine if that proportion of variance was statistically
significant. In this chapter we learned how to compute the proportion of
criterion variance accounted for by two predictor variables considered together as
a set. We also learned how to compute the proportion of criterion variance
uniquely accounted for by the second predictor variable when added to the first.
Finally, we learned how to evaluate the statistical significance of the R2 accounted
for by the set of two variables and the increment in R2 that occurred when the
second predictor was added to the first.
These concepts and computations apply with only slight modification when
three or more predictor variables are under consideration. For each additional
predictor, another column is added to the spreadsheet. The partial regression
coefficient for that predictor is computed and incorporated into the predictive
equation. The computation for R2 (the proportion of total variance accounted for
by the model), with the evaluation of its significance, is the same whether two or
more predictors are used. No matter the number of predictor variables, the F
ratio (to restate Equation 11.10) is

Similarly, hierarchic regression is easily extended beyond two variables.


Moreover, a step can consist of adding either a single variable or a set of
variables. Thus, at each step, an additional variable, or a set of variables, is
added. When evaluating the significance of that variable (or set), the smaller is
the previous model, the larger is the new model, and the increase in R2 is the
change between R2larger and R 2 smaller . Thus, for example, the larger model for step
2 becomes the smaller model for step 3, and so forth. As noted earlier, the F ratio
for evaluating the significance of the variable (or variables) entered at each step
(to restate Equation 11.14) is

Again as noted earlier, the two F ratios given in this and the preceding paragraph
are really the same. The first is simply a special and more limited case of the
second.
11.5 MORE THAN Two PREDICTORS 179
The material summarized in the previous three paragraphs may seem clear
and simple enough. However, unless one has had some experience with data
analysis, the stunning simplicity and power of these concepts and computations
may not be apparent. Essentially, whenever the criterion or dependent variable
is quantitative, the procedures just outlined apply. In other words, almost all of
the basic sorts of analyses used by behavioral scientists can be understood, and
performed, based on the material presented in this chapter. It only remains to
define the independent variables, or the research factors of interest. In
substantive terms, this means identifying those factors or variables that the
investigators think might affect or account for variation in the variable of interest
(i.e., in the criterion or dependent variable).
There are caveats, of course. For example, the independent variables should
be well measured and reasonably distributed. And, certainly for
nonexperimental studies, these variables should not exclude variables that
strongly influence the criterion being studied. Such cautions are important and
usually are amply discussed in multiple-regression texts (see, e.g., Cohen &
Cohen, 1983). We do not discuss these cautions further here but urge anyone
using multiple-regression techniques to be aware that there are pitfalls for
unwary users, and any seemingly nonsensical results should be discussed with
knowledgeable colleagues.
The usefulness of defining several independent variables is probably self-
evident. Typically researchers are interested in more than one factor: Variables
other than just age may warrant investigation, for example. Moreover, because
not all independent variables are quantitative, the usefulness of evaluating
categorical as well as quantitative independent variables is probably also self-
evident. What may not yet be evident is how productively these two strategies
can be combined.
As we will see in the next chapter, categorical independent variables (other
than binary variables) are represented with more than one predictor variable.
Such sets of coded variables, each set representing a categorical or nominal
variable, when combined with multiple-regression techniques, provide a
straightforward and general way to analyze data from a variety of different
studies. In the next chapter we describe how to perform a one-way analysis of
variance, analyzing data from a single-factor between-subjects study.
This page intentionally left blank
12 Single-Factor
Between-Subjects Studies

In this chapter you will:


1. Learn how to render a categorical variable that has more than two
categories as a set of coded variables. The set can then be used to
represent the categorical variable in multiple-regression equations.
2. Learn two different ways of coding categorical variables: dummy coding
and contrast coding.
3. Learn how to perform a one-way analysis of variance using coded
variables. This allows you to analyze data from single-factor between-
subjects studies when the single factor encompasses more than two levels
or groups.

This chapter describes how to analyze data from single-factor between-subjects


studies. A between-subjects analysis (as opposed to a within-subjects analysis,
which is discussed in chap. 15) is appropriate whenever a score for the dependent
variable is determined once, and only once (as opposed to repeatedly), for each
subject. In the interests of reliability, several measurements might be made and
then combined into one score, but such data would still require a between-
subjects analysis. For example, in the button-pushing study, the total number of
times the button was pushed was tallied once for each subject. This total number
was then analyzed to see if it was affected by the subject's age or parental status.
A single-factor study, as the name implies, is concerned with the effect of a
single research factor on the dependent variable. As such, single-factor studies
are among the simplest researchers use. For example, if we confine our attention
solely to parental status, then the button-pushing study could be an example of a
single-factor study. Moreover, it would be the simplest single-factor between-
subjects study possible, one involving only two groups or two levels of the factor.
We have already discussed in chapters 8 and 9 how to analyze data from single-
factor between-subjects studies when only two groups are involved, but studies
involving more than two groups are common. For example, a researcher might
want to investigate the effect of three or four different kinds of treatment on
patients' neuroticism or the effect of religious background on individual attitudes
toward abortion.

181
182 SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES
In general terms, the analysis of data from a single-factor study allows us to
determine whether knowing the group to which a subject belongs (which could be
a self-assigned natural group or an experimenter-assigned experimental group)
allows us to make better predictions than otherwise as to how that individual will
respond. In statistical terms, the question is whether the proportion of variance
accounted for by the research factor is significantly different from zero. The null
hypothesis, which claims no effect for the research factor, would predict an R2 of
zero. A sample will almost always yield a value of R2 higher than that, but the key
statistical question is whether the obtained value of R2 would occur less than 5%
of the time just due to the luck of draw, if the population value for R2 really were
zero. If the F ratio associated with R2 exceeds the appropriate critical value, we
declare the null hypothesis untenable and label our obtained value statistically
significant: In other words, we claim a statistically significant effect.

12.1 CODING CATEGORICAL PREDICTOR VARIABLES

In order to test the significance of a single research factor, we need some way to
represent the group to which each subject belongs. If the single research factor
were a quantitative variable, like maternal age, then one predictor variable would
suffice and for each subject the value for this variable would be that subject's age.
This variable could be entered into a multiple-regression equation, where it
would account for one degree of freedom.
Similarly, if the single research factor were a binary qualitative variable, like
parental status (parent vs. nonparent), again one predictor variable would suffice.
Parents would be assigned one value, nonparents a second value. As we saw in
Exercise 9.4, the values selected to represent the two groups are not critical; the
two values must be different and that is all. A binary research factor, then, is
represented with one coded variable and hence accounts for one degree of
freedom.
Qualitative research factors comprising more than two groups or levels are
handled somewhat differently. The general rule is, a factor with G groups or
levels must be represented with G –1 coded variables. There are several ways to
do this, but we emphasize two here: dummy variable coding, because it is so
simple, and contrast coding, because contrasts can be interpreted as planned
comparisons.

Dummy Variable Coding


Dummy variable coding reduces the information concerning group membership
into a series of binary distinctions (see Fig. 12.1). It is particularly appropriate
when one group is regarded as a comparison or control group and other groups
are compared to it. For example, individuals with Catholic, Jewish, or Protestant
religious backgrounds might be compared to all others. In this case, religious
background would be represented with four levels, the three just listed plus a
fourth category for Other. Yet group membership can be represented
unambiguously with just three (G –1) predictor variables. Values for the first
variable, X1 would be 1 = Catholic and 0 = Non-Catholic. Similarly, values for
the second variable, X2, would be 1 = Jewish and 0 = Non-Jewish, and the values
for the third variable, X3, would be 1 = Protestant and 0 = Non-Protestant. Each
variable represents a religious group and subjects receive a code of one for the
group to which they belong. Subjects belonging to the Other category would be
assigned a code of zero for all three predictor variables.
12.1 CODING CATEGORICAL PREDICTOR VARIABLES 183

Dummy coded variables for 2-5 groups


Group 2 3 4 5
X1 X1 X2 X1 X2 X3 X1 X2 X3 X4
G1 S1 1 1 0 1 0 0 1 0 0 0
1 1 0 1 0 0 1 0 0 0
G2 . 0 0 1 0 1 0 0 1 0 0
0 0 1 0 1 0 0 1 0 0
G3 . 0 0 0 0 1 0 0 1 0
0 0 0 0 1 0 0 1 0
G4 . 0 0 0 0 0 0 1
0 0 0 0 0 0 1
G5 . 0 0 0 0
SN 0 0 0 0

FIG. 12.1. Values for dummy-coded variables for research factors


comprising two, three, four, and five levels or groups. Two subjects per
group are shown, although usually groups would have more subjects.

As you can see from Fig. 12.1, three binary-coded predictor variables allow for
four patterns (100, 010, 001, and 000). A fourth predictor variable (presumably
coded 1 for Other) would be redundant and including it in our set of predictor
variables would result in an error message from the multiple-regression routine.
Likewise, two binary-coded predictor variables allow for three patterns, four
allow for five patterns, and so fourth. In general, G– 1 predictor variables are
required to indicate to which of G groups a subject belongs.
In chapter 10 we noted that the degrees of freedom associated with a
categorical variable were one less than the number of categories (or levels or
groups) and argued that, once the means for all groups but the last one had been
determined, the last was not free to vary. Here the same argument is made in a
somewhat different way. In order to unambiguously represent group
membership for G groups, G – 1 predictor variables are needed. Each predictor
variable represents a degree of freedom, so again we see that the degrees of
freedom associated with a categorical variable is one less than the number of
categories.
Occasionally a student will ask, since a quantitative variable like maternal age
requires only one predictor variable, why should a qualitative variable like
religious background require more than one? Why not code religious
background 1 = Catholic, 2 = Jewish, 3 = Protestant, and 4 = Other and use only
one predictor variable and hence one degree of freedom? To do so would violate
what a categorical variable means and would deny the amount of unique
information a categorical variable conveys. If we indeed scaled religious
background—that is, assigned values 1-4 to the four categories—we would be
saying, in effect, that religious background could be quantified and Protestants
had more of it than Jews who had more of it than Catholics. Not only would such
scaling be controversial, we would still fail to code group membership.
We would be confusing qualitative measurement with quantitative
measurement, in effect forcing an interval (or ratio, or ordinal) scale of
measurement on a categorical scale inappropriately (recall the discussion of
scales of measurement in chap. 4). The information conveyed by a qualitative as
compared to a quantitative variable is of kind, not degree, and, as it turns out,
requires G – 1 independent coded variables to represent the distinctions among G
groups.
184 SINGLE-FACTOR BETWEEN - SUBJECTS STUDIES
Imagine, for example, that the 16 subjects from the button-pushing study
(see Fig. 11.5) were divided into the following four groups representing what we
will call want/has children status:

1. Has no desire for children.


2. Has none but would like children.
3. Has one child.
4. Has more than one child.

Because the categorical variable of children status consists of four levels or


four groups, it can be represented with a set of three dummy-coded predictor
variables. If we wanted to know whether the mean numbers of button pushes for
these four groups were significantly different from each other, we would perform
an analysis of variance, first regressing the number of button pushes on these
three dummy-coded variables taken together as a set, and then testing whether
the computed R2 was significantly different from zero. The next exercise gives
you the opportunity to do this analysis.

Exercise 12.1
A One-Way Analysis of Variance Using Dummy-Coded Variables
The template that results from this exercise allows you to analyze data from a
single-factor between-subjects study, performing a one-way analysis of variance.
The effect of a single factor, marital/parental status, is analyzed. This factor is
represented with four groups or levels and hence with three predictor variables.
The data and many of the computations needed are already present in the
spreadsheet shown in Fig. 11.7, which is modified for this exercise.
1. Modify the spreadsheet shown in Fig. 11.7 to accommodate a third predictor
variable. The three predictor variables for this spreadsheet represent X1, X2,
and X3, the dummy-coded variables for factor A, marital/parental status.
There is no need to compute the means or counts for coded variables, so
these formulas can be erased.
2. Enter values for the dummy-coded variables. Assume that subjects 1-4, 5-
8, 9-12, and 13-16 belong to the four groups listed in the previous
paragraph. Guided by Fig. 12.1, enter the appropriate values for the
predictor variables.
3. Regress the dependent variable, Y or number of button pushes, on the
dummy-coded variables X1, X2, and X3.
4. Enter the correct formula for Y'. Remember that the exercise shown in Fig.
11.7 involved two predictor variables whereas the present exercise involves
three predictor variables. How does this change the prediction equation?
The degrees of freedom for the model sum of squares? For the error sum of
squares?
5. What are the values for the predicted scores? Why are there only four
different values for the predicted scores?
6. What is the value of F? Would you reject the null hypothesis at the .05 level
of significance? Why or why not?
7. How would the values of R2 and F be changed if you had mistakenly entered
1 instead of 102 for subject 1? If there had been a 17th subject with no
desire for children who pushed the button 200 times?
12.1 CODING CATEGORICAL PREDICTOR VARIABLES 185
The summary statistics for this one-way ANOVA (one-way because only a
single factor is involved) are provided by the spreadsheet just constructed. This
spreadsheet is given in Fig. 12.2. From it we know that 66% of the variance in
number of button pushes is accounted for by children status group. The critical
value for F, for 3 and 12 degrees of freedom, alpha = .05, is 3.49. The obtained
value was 7.63, consequently we can say there is a significant difference, p < .05,

A B C D E F G H I
1 #BPs y= m= e=
2 s Y X1 X2 X3 Y' Y-My Y'-My Y-Y'
3 1 102 1 0 0 113 16 27 -11
4 2 125 1 0 0 113 39 27 12
5 3 95 1 0 0 113 9 27 -18
6 4 130 1 0 0 113 44 27 17
7 5 79 0 1 0 79 -7 -7 0
8 6 93 0 1 0 79 7 -7 14
9 7 75 0 1 0 79 -11 -7 -4
10 8 69 0 1 0 79 -17 -7 -10
11 9 43 0 0 1 65 -43 -21 -22
12 10. 82 0 0 1 65 -4 -21 17
13 11 69 0 0 1 65 -17 -21 4
14 12 66 0 0 1 65 -20 -21 1
15 13 101 0 0 0 87 15 1 14
16 14 94 0 0 0 87 8 1 7
17 15 84 0 0 0 87 -2 1 -3
19 16 69 0 0 0 87 -17 1 -18
19 Sum= 1376 4 4 4 1376 0 0 0
20 N= 16 16 16 16 N= 16 16 1.6
21 Mean= 86 0.25 0.25 0.25 VAR= 464.9 305 159.9
21 a,b= 87 26 -8 -22 SD= 21.56
23 R= 0.81 R2= 0.656

J K L M
1 sstot ssmod SSerr
2 y*y m*m e*e

19 SS= 7438 4880 2558


20 df= 15 3 12
21 MS= 495.9 1627 213.2
21 SD'= 22.27 14.6
R2
23 adj= 0.57 F= 7.631
FIG. 12.2. Spreadsheet for determining the effect of want/has children status
on number of button pushes using dummy-coded variables. Rows 3-18 for
columns J-M are not shown.
186 SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES
in the mean values for number of button pushes for the four groups. (Thep < .05
means that the obtained F exceeded the alpha .05 critical value.) However,
pending further analyses, we cannot yet say exactly which groups account for this
significant difference or exactly how the groups differ. Such a general or overall
test is usually called an omnibus test (omnibus means "for all" in Latin).
If we want to know whether groups differ, but lack specific predictions
concerning exactly how they differ, then first we would perform an overall or
omnibus test. Instead of regressing the dependent variable on each predictor in
turn, we regress it on a set of predictor variables all at once, in a single step (as in
Exercise 12.1), which is usually called a simultaneous regression. If the R2change
associated with this step is significant, then we can claim that at least some of the
group means are different from others. We are then justified in probing
differences among the group means further, a topic discussed under the heading
of post hoc tests in the next chapter, and in reporting the means broken down
separately for the different groups, as is done in Fig. 12.3 for the present example.
However, if the omnibus test is not significant, then we should report only
the mean for all subjects (pooled over all groups) and not report separate means
for the different groups. After all, if they do not differ significantly then there
usually is no point in reporting them. In the present case, if the F ratio had not
been significant, we would simply report that subjects pushed the button 86
times on average.
When more than two groups are under consideration, omnibus F tests (which
necessarily have more than one degree of freedom in the numerator) are the rule.
Occasionally, however, researchers have a specific rationale for predicting an
exact pattern of differences among groups. Those predictions (each of which is
associated with a particular predictor variable and hence one degree of freedom)
can be tested directly, proceeding step by step. This matter is discussed in the
section on planned comparisons presented in the next chapter.
An examination of the present spreadsheet (see Fig. 12.2), along with the
means for the four groups (see Fig. 12.3), tells us something about how a one-way
of analysis works with dummy-coded variables. Note that a, the regression
constant, is 87, which is the mean for group 4, the comparison group (the group
whose values for all three dummy-coded variables were zero). Note also that the
regression coefficients represent deviations from the comparison group. The first
variable is coded 1 for group 1; its regression coefficient is 26, which is 113 (the
group 1 mean) minus 87 (the group 4 mean). Similarly, the second variable codes
group 2; its regression coefficient is -8, which is 79 (the group 2 mean) minus 87.
Likewise, the regression coefficient for the third variable is -22, or 65 minus 87.
Also note that Y' or the predicted value for a score is always the mean of the
group in which that score lies. This will be true no matter how categorical
variables are coded or how many subjects are in each group, as subsequent
exercises demonstrate. Thus the error scores for a one-way ANOVA represent
deviations of observed scores from their group's mean (see the Y - Y' column).
That is why error variance for a single-factor between-subjects study is often
called error within groups.

Contrast Coding
Contrast coding is a second method for representing categorical information.
Like dummy variable coding, it reduces information about group membership to
a series of binary distinctions, but the distinctions or contrasts are structured
hierarchically—in a way that can be represented with a tree diagram. This
method is particularly appropriate whenever a researcher has some a priori basis
for arranging groups according to a series of yes-no questions or contrasts. In
12.1 CODING CATEGORICAL PREDICTOR VARIABLES 187
Want or have children Number of pushes
Has no desire for children (ND) 113
Has none but would like (CO) 79
Has 1 Child (C1) 65
Has more than 1 child (C2) 87
FIG. 12.3. Mean number of button pushes, computed separately for each
want/has children status group. Each group contains four subjects.

fact, contrast coding automatically provides for the analysis of what are usually
called planned comparisons or a priori tests. (In Latin, a priori means "from the
previous" and implies deduction from a hypothesis or cause.) Both planned
comparisons and post hoc tests are described in chapter 13.
What a contrast is, and what constitutes a set of contrast-coded predictor
variables, is perhaps best conveyed by example. Recall the four groups defined
earlier in this chapter:

1. ND, has no desire for children.


2. CO, has none but would like.
3. Cl, has 1 child.
4. C2, has more than 1 child.

One possible set of three contrasts could be:

1. ND and CO versus Cl and C2.


2. ND versus CO.
3. Cl versus C2.

In other words, first we would contrast subjects who have no children (ND and
CO) with those that do (Cl and C2), then those that have no desire with those that
do among the former (ND vs. CO), and finally those that have one and those that
have more than one child among the latter (Cl vs. C2).
A second set of possible contrasts is:

1. ND versus CO, Cl, and C2.


2. CO versus Cl and C2
3. Cl versus C2

In other words, first we would contrast subjects with no desire (ND) with those
who desire children (Co, Cl, and C2), then subjects who have no children (CO)
with those that do (Cl and C2), and finally those that have one and those that
have more than one child (Cl vs. C2).
Speaking generally, given G groups, several different sets of G - 1 contrasts
are possible. For example, with three groups one set of contrasts is group 1
versus groups 2, 3 and group 2 versus 3; another set is groups 1, 2 versus group 3
and group 1 versus 2; yet another is group 2 versus groups 1, 3 and group 1 versus
3. As you can see, for a simple three-group analysis, several different sets of
contrasts could be defined. But for any one analysis, the researcher must commit
to just one set. For an analysis involving three groups, a set will consist of two
contrasts. The general rule is, one contrast is allowed for each degree of freedom,
just as with dummy-coded variables. More would be redundant because G -1
contrasts exhausts information about group membership (assuming G groups).
188 SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES
Is there any reason to prefer contrast to dummy codes? It depends. If you
have decided beforehand to compare all groups to an explicit comparison group,
then use dummy-coded variables. If the increase in R2 associated with a
particular predictor variable (when predictors are entered one at a time) is
significant, then you can conclude that the mean for the group associated with
that particular predictor variable is significantly different from the mean for the
comparison group.
Alternatively, if you have decided for theoretical reasons to limit your
investigation to G - 1 particular contrasts, then use contrast codes. With contrast
codes, each predictor variable is associated with a particular contrast (or
question, or planned comparison) and hence, if that predictor variable is
associated with a significant increase in R2 (when added to the regression
equation), you can conclude that the means for the two sets of groups defined by
the contrast differ significantly (e.g., all subjects with no desire for children
compared to all others, or all subjects who desire children but have none
compared to all subjects with one or more child).
However, if you only want to know whether the dependent variable is
affected by the research factor and so regress the dependent variable on the G - 1
predictor variables required to represent that categorical independent variable
simultaneously, then it does not matter which coding scheme you choose for the
categorical independent variable. You may use either dummy or contrast codes,
and if you choose contrast codes, you may use any of the possible sets of G - 1
contrasts. The increase in the value of R2 and the value of the omnibus F test
when the G -1 predictor variable are entered in the equation will be the same no
matter what kind or which set of coded predictor variables are used. (A third way
to code categorical variables is called effects coding—see Cohen & Cohen, 1983—
but in the interest of simplicity, only dummy and contrast coding are described
here.)
Rules for forming contrast codes are relatively simple. It is helpful to begin
by organizing a set of contrasts into hierarchic tree structures like those shown in
Fig. 12.4. Consider our current four-group example. We might ask first whether
a subject has children, second whether those who have no children want children,
and third whether those who have children have one or more than one (see top of
Fig. 12.4). In other words, the first contrast is between subjects who have no
children and those who do, the second between those with no children who do
and do not desire children, and the third between those who have one or more
than one child. However, given a different set of theoretic concerns, we might
instead ask first whether a subject wanted children, then if yes ask whether the
subject had children, and again if yes whether the subject had more than one (see
bottom of Fig. 12.4).
Other sets of contrasts could also be defined, but no one set would be more
correct than the other. The correct set is always the one that most faithfully
reflects the theoretical concerns of the study. However, whichever set is selected,
it is desirable to depict that set with a tree diagram. Properly constructed, this
insures that the number of (binary or yes-no) questions asked and the degrees of
freedom are the same. It also insures that the questions are independent of each
other (or orthogonal) and not confounded. Again, we should emphasize that if
the dependent variable is regressed on all G - 1 predictor variables in a single
step, and if contrast codes instead of dummy codes are used, then the particular
set of contrasts used does not matter. It only matters that the coded predictor
variables in the set are orthogonal, that is, reflect a tree structure as exemplified
in Fig. 12.4.
12.1 CODING CATEGORICAL PREDICTOR VARIABLES 189

FIG. 12.4. Tree diagrams indicating two different ways of contrasting the four
groups that represent the want/has children factor.

Each predictor variable in a set of orthogonal contrast codes is associated


with a node of the tree diagram, which represents a particular contrast or
question. Consider the set of contrasts at the top of Fig. 12.4. The first question
asks whether a subject has children. Thus for the first predictor variable, subjects
in the two groups without children will be assigned one code (i.e., one contrast
coefficient) and subjects in the two groups with children, another code. The
second question asks whether subjects with no children desire them. Thus for
the second predictor variable, subjects with no desire would be assigned one
code, subjects who desire children would be assigned another code, and subjects
in groups not included in this contrast would be assigned a third code, usually
zero. Similarly, for the third predictor variable subjects with one child would be
assigned one code, subjects with more than one child another code, and subjects
in the groups not included in this contrast would be assigned zero.

Rules for Forming Orthogonal Contrasts


Two rules guide the selection of codes used to represent orthogonal contrasts.
First, the codes selected for each contrast must sum to zero groupwise. That is,
codes are not summed across all subjects (because there might be uneven
numbers of subjects per group) but instead only one code per group is included in
the sum. An example should clarify. Consider the first contrast in the set at the
top of Fig. 12.4, which is between subjects who have no children and subjects who
do. If we assigned a contrast code of -2 to subjects in groups with out children
and a code of +2 to subjects in groups with children, then the coefficients for the
190 SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES
four groups would be -2, -2, +2, and +2, which sum to zero as required. There
are an equal number of groups in each contrasting condition for the contrast just
described (two groups without children, two with children); consequently any
pair of equal numbers would sum to zero (e.g., -1, +1) and could be used as
contrast coefficients. The coefficients -2 and +2 are suggested here because an
appropriate contrast coefficient for the groups included in one contrasting
condition is always the number of groups included in the other. This ensures that
the codes selected for a contrast sum to zero as required. Other general rules for
selecting contrast codes are possible, but this one has the merit of avoiding
fractions, which many people find more confusing than integers.
As a second example of the contrasts-must-sum-to-zero rule, consider the
first contrast in the set at the bottom of Fig. 12.4, which is between subjects who
do and do not want children. If we assigned a contrast code of -3 to those who
do not desire children and a code of +1 to subjects in the other three groups (who
we are assuming want children), then the coefficients for the four groups would
be -3, +1, +1, and +1, which again sum to zero as required.
Groups not included in a contrast are assigned a code of zero. For example,
consider the second contrast in the set at the top of Fig. 12.4, which is between
subjects without children who do and do not desire them. If we assigned a code
of -1 to subjects who have no children and no desire for them and a code of +1 to
subjects who have no children but who want them, then the coefficients for the
four groups would be -1, +1, 0, and 0, which again sum to zero as required. As
an exercise, you should now determine the codes for the remaining contrasts
associated with the two tree diagrams in Fig. 12.4. Your codes should agree with
the contrast codes given in Fig. 12.5.
Orthogonal contrast codes are also constrained by a second rule, specifically
that the cross products for all possible pairs of contrasts must sum to zero
groupwise. This insures that the contrasts will represent independent
comparisons. (Technically it means that the correlations between pairs of
contrast coefficients will be zero, considered groupwise.) For example, consider
the group cross products between variables X1 and X2 in Set I of Fig. 12.5. The
cross products for the four groups are 2, -2, 0, and 0, which sum to zero as
required:

-2 x -1 = -2
-2 x +1 = -2
+2 x 0 = 0
+2 x 0 = 0

Coded Variables
Group
Set l
Has no desire for children -2 -1 0
Has none but would like -2 +1 0
Has 1 Child +2 0 -1
Has more than 1 child +2 0 +1
Set II
Has no desire for children _3 0 0
Has none but would like +1 -2 0
Has 1 Child +1 +1 -1
Has more than 1 child +1 +1 +1
FIG. 12.5. Two sets of contrast codes for the want/has children factor.
12.1 CODING CATEGORICAL PREDICTOR VARIABLES 191
In the next exercise, you are asked to verify for yourself that the cross products
for the other possible pairs of contrasts shown in Fig. 12.5 sum to zero as the
second rule for forming contrast codes requires. (See Fig. 12. 6 for a restatement
of the two rules for forming contrast codes.)
One final point regarding contrast codes should be mentioned. When
defining a set of orthogonal contrast codes, it is important to keep in mind the
distinction between individuals and the groups to which they belong. Ultimately
the codes or contrast coefficients determine the values assigned to the predictor
variables for individuals within groups (see Figs. 12.7 and 12.8). However, first
the contrast coefficients are determined for the groups, no matter whether or not
the numbers of subjects in the various groups are equal. It is these group-level
contrast coefficients that must sum to zero and whose cross products must sum
to zero. In other words, a set of contrast codes is first determined for groups and
then applied to individuals in those groups.

Exercise 12.2
Applying Rules for Forming a Set of Contrast Coefficients
This exercise provides practice in determining whether contrast coefficients and
their cross products sum to zero as required.
1. For the contrast coefficients in Set I, Fig. 12.5, verify that the cross products
for X1 and X3 and for X2 and X3 sum to zero.
2. For the contrast coefficients in Set II, Fig. 12.5, verify that the sum of the
contrast coefficients for the three contrasts sum to zero. Also verify that the
three possible cross products sum to zero.
3. Label the 4th group, 2 children, and add a 5th group, 3 or more children.
Draw a tree diagram showing one set of contrasts that could be used for
such a study. Indicate the contrast coefficients that would be used and verify
that these coefficients and their cross products all sum to zero. Now draw a
tree diagram for a second set of contrasts and indicate the contrast
coefficients that you would use for this second set.
4. Describe a study with six groups. Name the groups and provide a rationale
for a particular set of contrasts. Indicate the contrast codes you would use
and verify that they and their cross products sum to zero.

It is important to know how to form contrast coefficients correctly. They can


be used to analyze planned comparisons, as described in the next chapter, and
they can be used instead of dummy-coded variables to perform one-way analyses
of variance (and other, more complex analyses, as you will see in subsequent
chapters). Dummy-coded variables were used in Exercise 12.1 to analyze the
number of button pushes. In the next exercise, you will use contrast codes to
perform a similar analysis.

Rules for forming orthogonal contrast codes


Rule 1 The codes selected for each contrast must sum to
zero across groups.
Rule 2 The cross products for all possible pairs of contrasts
must sum to zero across groups.

FIG. 12.6. Rules for forming orthogonal contrast codes.


192 SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES

Exercise 12.3
A One-Way Analysis of Variance Using Contrast Codes
This exercise uses the template developed for Exercise 12.1. Again, the effect of
a single factor, want/has children status, on number of button pushes is
analyzed, but this time, instead of dummy-coded variables, the two sets of
contrast codes given in Fig. 12.5 are used as predictor variables.
1. Begin with the spreadsheet shown in Fig. 12.2. Replace the dummy-coded
variables used there with the contrast codes for set I, Fig. 12.5. Remember
that subjects 1-4 belong to the first group, subjects 5-8 to the second, and
so forth.
2. Regress the number of button pushes on the three predictor variables that
code for group membership. What proportion of variance is accounted for?
What is its associated F ratio?
3. Next replace the contrast codes with those shown for set II, Fig. 12.5. Again
regress the number of button pushes on these three new predictor variables.
What proportion of variance is accounted for? What is its associated F ratio?

A B C D E
1 #BPs
2 s Y X1 X2 X3
3 1 102 -2 -1 0
4 2 125 -2 -1 0
5 3 95 -2 -1 0
6 4 130 -2 -1 0
7 5 79 -2 1 0
8 6 93 -2 1 0
9 7 75 -2 1 0
10 8 69 -2 1 0
11 9 43 2 0 -1
12 10 82 2 0 -1
13 11 69 2 0 -1
14 12 66 2 0 -1
15 13 101 2 0 1
16 14 94 2 0 1
17 15 84 2 0 1
19 16 69 2 0 1
19 Sum= 1376 0 0 0
20 N= 16 16 16 16
21 Mean= 86 0 0 0
21 a,b= 86 -5 -17 11
23 R= 0.81;
FIG. 12.7. Spreadsheet for determining the effect of want/has children status
on number of button pushes using contrast code set I. Columns F-M are the
same as columns F-M in Fig. 12.2, so are not repeated here.
12.1 CODING CATEGORICAL PREDICTOR VARIABLES 193
4. Why are the predicted values the same no matter which set of contrast codes
are used?
5. If there had been a 17th no-desire (ND) subject who pushed the button 200
times, what proportion of variance would be accounted for? What is its
associated F ratio?

After completing the last exercise, your spreadsheets for the two sets of
contrast codes should look like those shown in Figs.12.7 and 12.8. For both the
last exercise and for Exercise 12.1 earlier in the chapter, you determined that 66%
of the variance in number of button pushes was accounted for by group
membership. No matter whether you used dummy-coded variables or contrast
codes to represent group membership, and no matter which set of contrast codes
you used, SS, R, R2, and the F ratio were the same. We concluded that the
difference among the means for these four groups is significant at the .01 level.
In the next section, we show how information from these spreadsheets can be
organized in an ANOVA source table. And in the next chapter, we show how, if
the predictor variables had been added one at a time instead of all at once,
information from the separate steps can be used to analyze planned comparisons.

A B C D E
1 #BPs
2 s Y x1 X2 X3
3 1 102 -3 0 0
4 2 125 -3 P. 0
5 3 95 -3 0 0
6 4 130 -3 0 0
7 5 79 1 -2 0
8 6 93 1 -2 0
9 7 75 1 -2 0
10 8 69 1 -2 0
11 9 43 1 1 -1
12 10 82 1 1 -1
13 11 69 1 1 -1
14 12 66 1 1 -1
15 13 101 1 1 1
16 14 94 1 1 1
17 15 84 1 1 1
19 16 69 1 1 1
19 Sum= 1376 0 0 0
20 N= 16 16 16 16
21 Mean= 86 0 0 0
21 a,b= 86 -9 -1 11
23 R= 0.81
FIG. 12.8. Spreadsheet for determining the effect of want/has children status
on number of button pushes using contrast code set II. Columns F-M are
the same as columns F-M in Fig. 12.2, so are not repeated here.
194 SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES

12.2 ONE-WAY ANALYSIS OF VARIANCE

For a one-way analysis of variance, the G groups must be identified with a unique
set of values for G - 1 predictor variables. Regressing the dependent measure on
the coded predictor variables yields an R2. If this R2 is significantly different
from zero, then we say that there is a significant effect of group, that the group
means differ in some way. No matter whether dummy-coded variables, contrast
codes, or some other scheme is used to distinguish the groups, the various
statistics (sums of squares, mean squares, F ratios, and R2s) are identical, as
demonstrated by the analyses shown in Figs.12.2, 12.7, and 12.8. In all three
cases (dummy-coded variables, contrast code set I, contrast code set II) the value
of the F ratio was 7.63. Only the parameter values appear affected by the way the
predictor variables are coded, a matter discussed in the next chapter.
Traditionally, results of an analysis of variance are organized into an ANOVA
source table, so called because it lists the sources of variability examined and
notes the portions of variance associated with each (see Fig. 12.9). A typical
ANOVA source table gives, first, the sources of variance. For the present
example, total variance is divided into a between-groups component and a
within-groups component. The between-group (or treatment, or model) sum of
squares, the within-group (or error, or residual) sum of squares, and the total
sum of squares are computed by the spreadsheet. For the present example, with
its four groups and three predictor variables, Y predicted is computed as a + b1X1
+ b2X2 + b3X3. Note that the sums of squares for the model and for error add up
to the total sum of squares (4880 + 2558 = 7438), as they should.
To complete the ANOVA source table, sums of squares are divided by their
associated degrees of freedom, which gives mean squares (variance estimates) for
the various effects. Finally, the significance of the overall (or omnibus) group
effect is evaluated with the appropriate F ratio, dividing the between-group mean
square by the mean square for error (i.e., the within group mean square). As
before, the df model is the number of predictor variables (which in this case is 3)
and the df error is N minus 1 minus the number of predictor variables (which in this
case is 16 - 1 - 3 = 12).
General formulas for the degrees of freedom for a one-way analysis of
variance are given in Fig. 12.10.

1. The total degrees of freedom between subjects is N - 1, the number of


subjects minus one. You can regard that one degree of freedom as
consumed by the grand mean.
2. The degrees of freedom for the between-groups effect (the A main effect)
is a - 1, where a symbolizes the number of levels or groups for the
between-subjects factor A. This is the same as G - 1 (the number of
groups minus one or the number of predictor variables) used earlier in
this chapter.
3. The degrees of freedom within groups, or the degrees of freedom for
error, is N- a, and is symbolized here as S/G or S/A, which is read
"subjects within groups" or "subjects within A."

Source SS df MS F
Between groups 4880 3 1627 7.63
Within groups' 2558 12 213
TOTAL between subjects 7438 15
FIG. 12.9. Analysis of variance source table for contrast code Set I.
12.2 ONE-WAY ANALYSIS OF VARIANCE 195

Source Degrees of Freedom


A main effect a- 1
S/A, subjects within A N- a
TOTAL between subjects N- 1
FIG. 12.10. Degrees of freedom for a one-way analysis of variance. The
number of levels for between-subjects factor A is symbolized with a and the
number of subjects is symbolized with N.

There are two ways to explain algebraically why the degrees of freedom for
error is N - a. First, recall that

dftotal = dfbetween + dferror

Therefore,

dferror — dftotal - dfbetween


= ( N - 1 ) - ( a - 1 ) =N - 1 - a + 1 = N - a

In addition, recall that for each of the a groups, one degree of freedom is lost
within groups estimating the group mean. Assuming equal number of subjects
per group,

dferror = a ( N / a - l) = N - a

Within each group there are N/a subjects and N/a - l degrees of freedom (one
consumed by the group mean). The degrees of freedom for error pooled over all
groups is the N/a - l for each group multiplied by a, the number of groups. After
routine algebraic manipulation, this yields N - a.

Exercise 12.4
A One-Way ANOVA: Effect of Sibling Status on Number of Words
The purpose of this exercise is to provide additional practice in performing a one-
way analysis of variance.
1.Recall the example study that examined the effect of number of older siblings
on number of words infants spoke. In Exercise 9.6 you determined that
32.6% of the variance in number of words spoken at 18 months of age could
be accounted for by knowing whether infants had just one or more older
siblings, and in Exercise 10.5 you determined that this R2 was significant at
the .01 level. The previous analysis was, in effect, a one-way analysis of
variance with two groups. For this exercise, categorize infants in three rather
than two groups: those with no older siblings, those with just one older
sibling, and those with more than one older sibling.
2. Define and use contrast codes. The first predictor variable contrasts infants
who have no older siblings with those who have one or more older siblings,
and the second contrasts infants who have just one older sibling with those
who have more than one older sibling.
3. What proportion of variance is accounted for by knowing to which of these
three groups infants belong? Is the effect of group membership statistically
significant?
196 SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES
Your spreadsheet for the previous exercise should look like the one shown in
Fig. 12.11. Three facts are worth noting. First, for an analysis of variance, the
numbers of subjects in the various groups need not be equal. In this case, there
were seven, five, and four subjects in the three groups, respectively. Second, the
order of the subjects in the spreadsheet is immaterial. Often it is convenient to
place subjects who are in the same group together, and in this case all subjects
who had no older siblings were in fact listed together. But it is not necessary to
group subjects, and in this case subjects in the second and third groups were
mingled together. Values for the predictor variables must accurately reflect the
group to which a subject belongs, but that is all. Finally, recall that contrast
codes are required to sum to zero at the group, but not at the individual, level. In
fact, when the number of subjects per group is unequal, contrast codes will not
sum to zero at the individual level, as the numbers in Fig. 12.11 demonstrate.

Exercise 12.5
Analyzing a Singe-Factor Between-Subjects Study in SPSS
In this exercise you will analyze the button-pushing study using the four groups
defined in Fig. 12.3. In SPSS you could analyze these data using the Regression
procedure and dummy or contrast codes. If using dummy codes, you would run a
regression and enter all of the coded vectors, X1 through X3 in a single block,
and then examine the output to determine if the model containing all three
predictor variables is significant. If you want to test individual contrast codes, you
would enter each of the coded predictor variables in separate blocks and check
the R squared change option under Statistics. The resulting hierarchical
analysis (see Exercise 11.7) will allow you to determine the significance of each
contrast. Finally, as presented in this exercise, you could use the One-Way
ANOVA procedure to determine if group is a significant predictor of number of
pushes.
1. Invoke SPSS. Create a variable called Pushes and enter, or cut and paste,
the data from Fig. 12.2.
2. When using the One-way ANOVA procedure, you do not need to use
dummy or contrast coding. Instead, create one variable called group and
enter Os for all the cases in the ND group, 1s for the CO group, 2s for the C1
group, and 3s for C2 group. The output will be more readable if you create
value labels for each of these groups.
3. Select Analyze->Compare Means->One-way ANOVA from the main menu.
Enter pushes in the Dependent List window and group in the Factor
window. Click on Options and check the Descriptive and Homogeneity of
Variance Test boxes. Click Continue and OK.
4. Examine the Descsriptives output and confirm that the Ns, means, and
standard deviations are correct. Next look at Levene's test. Is the assumption
of equal variances met? Finally look at the output in the ANOVA table. Do
the sums of squares and df agree with values you calculated using the
spreadsheets? Based on the Fand significance values, do you reject the null
hypothesis that the number of button pushes in each group is equal?
12.3 TREND ANALYSIS 197
12.3 TREND ANALYSIS

As you have seen in this chapter, with G groups, you can define G - 1 coded
variables or contrasts. The examples we have given of contrast codes (e.g., Fig.
12.4) ordered the groups in a hierarchic or tree structure. But there is another

A B c _D E F G H _I
1 1Swords #sibs Ovs>0 1vs>1 y= m= e=
2 s Y X1 X2 Y' Y-My Y'-My Y-Y'
3 1 32 0 -2 0 34.71 1.813 4.527 -2.71
4 2 27 0 -2 0 34.71 -3.19 4.527 -7.71
5 3 48 0 -2 0 34.71 17.81 4.527 13.29
6 4 34 0 -2 0 34.71 3.813 4.527 -0.71
7 5 33 0 -2 0 34.71 2.813 4.527 -1.71
8 6 30 0 -2 0 34.71 -0.19 4.527 -4.71
9 7 39 0 -2 0 34.71 8.813 4.527 4.286
10 8 23 2 1 1 23.4 -7.19 -6.79 -0.4
11 9 24 1 1 -1 30.75 -6.19 0.563 -6.75
12 10 25 4 1 1 23.4 -5.19 -6.79 1.6
13 11 36 1 1 -1 30.75 5.813 0.563 5.25
14 12 31 1 1 -1 30.75 0.813 0.563 P.25
15 13 19 3 1 1 23.4 -11.2 -6.79 -4.4
16 14 28 2 1 1 23.4 -2.19 -6.79 4.6
17 15 32 1 1 -1 30.75 1.813 0.563 1.25
19 16 22 5 1 1 23.4 -8.19 -6.79 -1.4
19 Sum= 483 20 -5 1 483 0 0 -0
20 N= 16 16 16 16 N= 16 16 16
21 Mean= 30.19 1.25 -0.31 0.063 VAR= 48.9 23.44 25.46
21 a,b= 29.62 -2.55 -3.68 SD= 6.993
23 R= 0.692 R2= 0.479

J K L M
1 SStot Somod SSerr
2 y*y m*m e*e

19 ss= 782.4 375.1 407.4


20 df= 15 2 13
21 MS= 52.16 187.5 31.34
21 SD'= 7.222 5.598
23 R2
R
-
adi- 0.399 F= 5.984
FIG. 12.11. Spreadsheet for determining the effect of sibling group (no older
siblings, one older sibling, more than one older sibling) on number of words
spoken at 18 months of age.
198 SINGLE-FACTOR BETWEEN-SUBJECTS STUDIES
way to define contrast codes that turns out to be very useful. Contrast codes can
also be used to define trends.
Often groups are ordered in some way. For example, imagine you have
assembled three groups of children and that the three groups comprise children
who are within a month of their 3rd, 4th, and 5th birthdays, respectively. You
have collected motor dexterity scores for the children and, not surprisingly, you
expect older children to score higher. You could form contrast codes that
compare, first, 3-year-olds with the combined group of 4- and 5-year-olds and,
second, 4- with 5-year-olds. But the more relevant question may be, of the
variance accounted for by group membership, what proportion is due to the
linear trend of the group means and what proportion to the remaining variance.
Variables that code the various trends possible with G groups let you do just that.
With G groups, you can define G - 1 trends, each one of which corresponds to
a coded predictor variable. With three groups, you can define linear (i.e., a
straight line) and quadratic (i.e., a parabola or a curve with one bend) trends.
With four groups you can define linear, quadratic, and cubic (an S-shaped curve
with two bends) trends. With five groups you can define linear, quadratic, cubic,
and quartic (a W-shaped curve with three bends) trends, and so forth.
The values for the predictor variables that code for these trends are called
orthogonal polynomials, and advanced texts often have tables of them for
various numbers of groups (e.g., Kirk, 1982). Values for three, four, and five
groups are given in Fig. 12.12. Orthogonal polynomials obey the rules we have
already defined for orthogonal contrast codes generally (Fig. 12.6). You should
now verify for yourself that the values given in Fig. 12.12 obey these rules.
If you use orthogonal polynomials for your contrast codes instead of some
other set of contrast codes, you will still account for exactly the same amount of
variance: The value of all of your summary statistics (R2, F ratio, etc.) will be the
same (as they would be if you used any other set of coded variables including
dummy codes). However, if you use trend contrasts (i.e., orthogonal
polynomials), then, with two groups, you can say what proportion of variance was
accounted for by a linear and what proportion by a quadratic trend; with three
groups, what proportion of variance was accounted for by a linear trend, what
proportion by a quadratic trend, and what proportion by a cubic trend; and so
forth. When research hypotheses predict that most of the variance should be
accounted for by particular trend (often a linear trend), this can be very useful.

Exercise 12.6
A Trend Analysis: Effect of Sibling Status on Number of Words
The purpose of this exercise is to modify the last exercise for a trend analysis.
1. Replace values for the contrast codes used in the last exercise with values
for a linear and a quadratic contrast. For the linear trend, code subjects with
no sibs -1, those with one sib 0, and those with more than one sib +1. For
the quadratic trend, code subjects with no sibs +1, those with one sib -2, and
those with more than one sib +1.
2. Verify that all summary statistics are the same as for Exercise 12.4.
3. What proportion of variance is accounted for by knowing only the linear
trend? How much additional variance is accounted for by the quadratic
trend?
12.3 TREND ANALYSIS 199

FIG. 12.12. Values for coded variables for a trend analysis comprising
three, four or five groups. Two subjects per group are shown, although
usually groups would have more subjects.

In this chapter you have learned how to analyze data from single-factor
between-subjects studies. The single factor defines levels or groups, and the
statistical question is, do the means for the various groups differ significantly? In
other words, is the proportion of variance accounted for when the dependent
variable is regressed on coded variables representing group membership (the
research factor) significantly different from zero? Can significantly better
predictions for the value of the criterion or dependent measure be made when
subjects' group membership is taken into account? Most succinctly, does group
membership matter? If the answer is yes, that is, if the F ratio is statistically
significant, the next question is, exactly how do the means for the groups differ?
The next chapter addresses this and related questions concerning how results of
analyses like those presented in this chapter should be described.
This page intentionally left blank
Planned Comparisons,

13 Post Hoc Tests, and


Adjusted Means

In this chapter you will:


1. Learn how to use contrast codes to analyze the significance of any group
comparisons you have planned prior to analysis.
2. Learn how to use a post hoc test to determine, after the fact, exactly
which group means differ significantly from each other, given that an
omnibus F test was significant.
3. Learn how to adjust group means when reporting results, given that an
analysis of covariance yielded significant results.
4. Learn how to test whether the assumption of homogeneity of regression,
which is required for an analysis of covariance, is warranted.

Analyzing variance, or determining which factors account for significant


proportions of criterion variance, whether performing a one-way analysis of
variance as in the last chapter or an analysis of covariance as in chapter 11, is only
half of the data analytic task. The task is not complete until the specific results
have been described. What should be described, and how it should be described,
is the topic of this chapter.
First we consider one-way ANOVAs. As mentioned earlier, if the test for a
factor is not significant, only the grand mean for the dependent variable need be
given. If the test is significant, however, then the means for the different groups
analyzed should be presented. If there are only two groups, no further analysis is
required. We know that the means for those two groups differ significantly.
However, if there are more than two groups—and planned comparisons or trend
analyses were not defined—then post hoc tests should be performed in order to
explicate the exact nature of the differences among the groups. Post hoc tests
(tests performed after overall significance has been established) examine each
possible pair of group means and tell you which pairs of means are significantly
different from each other.
You might ask, why bother with this combination of an omnibus F test
(regressing the dependent variable on a set of G - 1 predictor variables that
represents the categorical variable) followed by post hoc tests? Why not proceed
directly to pairwise tests instead? For each test, you would select a different pair

201
202. PLANNED COMPARISONS, POST Hoc TESTS
of groups and would then determine whether the single-coded variable required
to represent the distinction between those two groups accounted for a significant
proportion of variance. For example, given four groups, six separate tests would
be required to represent the possible pairwise comparisons:

1 versus 2.
1 versus 3.
1 versus 4.
2 versus 3.
2 versus 4.
3 versus 4.

In other words, there are six combinations of four things taken two at a time. The
information provided by this pairwise approach seems to be the same as that
provided by post hoc tests; that is, we find out whether or not the means in each
pair are significantly different from each other.
Pairwise testing, however, is not recommended; in fact, it is usually roundly
condemned. An omnibus test represents a single test with a set alpha level. The
exhaustive pairwise approach multiplies the number of separate tests of
significance, which increases the probability of Type I error (the exact amount of
the increase is defined in the section on post hoc tests later in this chapter). Thus
the reason we perform an omnibus F test first, and only proceed to post hoc tests
if the omnibus test is significant, is to provide ourselves with some protection
from Type I error.
There is an alternative to the omnibus followed by post hoc test strategy.
(Remember, the present discussion applies to one-way analyses of variance
involving more than two groups.) In the last chapter we mentioned briefly that
there are two ways an analysis of group effects can proceed. Although it is
relatively uncommon to do so, the investigator may commit beforehand to
particular contrasts or a priori tests. As described in the previous chapter, as
many contrasts as there are degrees of freedom are permitted, and each contrast
is represented with a coded predictor variable. In this case, each predictor
variable is added to the regression equation one step at a time, and the increase
in R2 associated with each step is tested for significance. In this way, the
investigator tests the significance of each (single degree of freedom) contrast
directly, instead of first asking whether all G - 1 coded predictor variables
together account for a significant proportion of variance.
The contrasts of a planned comparison approach are more powerful (i.e.,
more likely to be statistically significant) than the comparable post hoc
comparisons, which is the main advantage of this approach. However, the
number of comparisons is limited to G - 1 and they must be specified
beforehand, which is the main disadvantage. Given four groups, for example,
only three planned comparisons are permitted, but post hoc tests could examine
difference between all six possible pairs of groups.
Typically investigators do not commit to particular contrasts before analyzing
their data but first perform an omnibus F test as described in the last chapter.
Contrast codes can still be used for the G - 1 predictor variables, of course—and
in this book, often will be—but the dependent variable is regressed on all of them
in one step. Thus an omnibus F test necessarily involves more than one predictor
variable. If the R2 for the final model (e.g., the one including all three of the
predictor variables that code group membership given four groups) is significant,
you would conclude that somehow the group means differ among themselves.
But in order to determine exactly how they differ, you would then perform what
are called post hoc tests, or tests performed after overall significance has been
13.1 ORGANIZING STEPWISE STATISTICS __ 203
established. One common post hoc test, the Tukey test, is described later in this
chapter.
Given a significant group effect, you would then present group means
separately, as we mentioned earlier. Normally, these would be the simple
arithmetic means for the groups. However, imagine that variables representing
two research factors are involved, the first is a covariate, as discussed in chapter
11 (like subject's age), and the second is a group membership variable (like
children status). In this case, you would perform the analysis in two steps. The
first would test for the significance of the covariate and the second for the
significance of the group membership variable. Note that, depending on the
number of groups, the group membership variable might be represented with one
or more coded predictor variables. If the group effect were significant
(determined by testing the significance of the R2Change for step 2), it would mean
that group membership affects the dependent variable even when the covariate is
taken into account. In this case you would report not the simple group means,but
the group means adjusted for the effect of the covariate. At the end of this
chapter we show how such an adjustment is effected.

13.1 ORGANIZING STEPWISE STATISTICS

In chapter 11 we showed how results of a hierarchic multiple-regression analysis


can be portrayed. In this chapter we discuss hierarchic analyses further and
show how, for an analysis of planned comparisons, each step represents a
different contrast. And in the next chapter we will see that the hierarchic
approach has even broader application still, that successive steps may each
represent a different research factor.
In chapter 12 the number of button pushes for 16 subjects— data we have
been using as a running example— was regressed on different sets of predictor
variables. For Exercise 12.3, you were asked to perform an omnibus F test,
regressing. number of button pushes on two different sets of three predictor
variables. But you could have proceeded step by step, determining the
significance of the increase in R2 associated with each predictor variable as it was
added to the regression equation in turn.
If you had proceeded step by step, the results after the third step would be
identical with those shown in Figs. 12.7 and 12.8; that is, you would account for
66% of the variance with three predictor variables. But the results after the first
and second step, and comparisons of each step with the previous step, would
provide us with information about the three predictor variables used. You
already know how to organize the results of such an analysis in a step-by-step
table (see Fig. 11.8) and how to test the significance of the change in R2 at each
step.
The F ratio used to test the significance of an increase in R2 was introduced in
chapter 11 (Equation 11.13). It is

According to this formulation, the error term (the denominator in Equation 13.1)
is different at each step. It is the variance left unaccounted at each step (i -
2
step(1 -R)divided
large by the number of subjects minus one minus the number of
predictor variables used in that step (i.e., dferror = N- 1- dflarge). Thus, -R2large
refers to the proportion of variance accounted for at each step, step by step.
204 PLANNED COMPARISONS, POST Hoc TESTS
For our current running example—predicting number of button pushes from
children status for 16 subjects—three predictor variables are defined. Entering
each predictor variable one at a time, the error terms for the first, second, and
third steps are 1 minus the R2 for that step divided by 14, 13, and 12 error degrees
of freedom, respectively (see Fig. 13.1). However, using a different error term at
each step makes sense only if we regard each step as representing a separate and
progressively less important question.
More often, as is the case with our running example, we conceive of the
predictor variables as a set (together they represent the group variable), and
regard all as important. The overarching or omnibus question is, do the groups
differ? This question is answered by the proportion of variance accounted for by
all three variables. In such cases, it both makes sense and is traditional to use the
final error term (one minus the step 3 or final R2 divided by its degrees of
freedom) to assess not just the change in R2 from step 2 to 3 but the change in R2
from no model to step 1, and from step 1 to 2 as well.
Reflecting these considerations, Equation 13.1 could be rewritten as follows:

According to this formulation, although which model is large and which is small
varies with each step, the error term is the same for all steps. It is the proportion
of variance left unaccounted after the final step divided by the number of subjects
minus one minus the number of predictor variables in the final model.
Traditional analyses of variance use a single final, not a progressively changing,
error term. For that reason, the final error term approach is used in this book
whenever increments in proportions of variance accounted for are tested for
statistical significance. Still, there are occasions when the progressive approach
may be justified and interested readers should consult Cohen and Cohen (1983).
The stepwise statistics for the first set of contrast codes used in Exercise 12.3,
using the final error term approach as described in the preceding paragraph, are
organized and presented in Fig. 13.1. Note that the error degrees of freedom
associated with the F ratios used to test the significance of the changes in R2 are
all 12. As an exercise, you should now verify that the numbers given in Fig. 13.1
are correct. The next exercise provides you with an opportunity to organize the
results from the second set of contrast codes into a similar table. And in the next
section, we discuss how the results given in Fig. 13.1 (and in the table you will
prepare for Exercise 13.1) can be interpreted.

Total Change
Step Contrast
R2 df F R2 df F
1 Children? .215 1,14 3.84 .215 1,12 7.51
2 Want? .526 2,13 7.21 .311 1,12 10.85
3 >1? .656 3,12 7.63 .130 1,12 4.54
FIG. 13.1. Stepwise statistics for a planned comparison analysis using
contrast code Set I.
13.1 ORGANIZING STEPWISE STATISTICS 205

Exercise 13.1
Presenting Results of Planned Comparisons
For this exercise you are asked to organize a stepwise results table like that
shown in Fig. 13.1 but for the second contrast set used in Exercise 12.3.
1. Prepare a table like the one shown in Fig. 13.1, but for contrast set II (Fig.
12.5). You have already completed step 3 for this analysis (see Fig. 12.8).
Return now to your previous spreadsheet and perform steps 1 and 2 of a
planned comparison analysis and compute the stepwise statistics.

13.2 PLANNED COMPARISONS

From the one-way analysis of variance performed in chapter 12, you know that
the mean number of button pushes differs for the four parental-marital status
groups. The critical value of F(3,12) is 3.49 at the .05 level and 5.95 at the .01
level. Thus, no matter whether we had selected the .05 or the .01 level of
significance, because F(3,12) = 7.63 for our running example, we would have
claimed statistical significance for these results. (Recall from chap. 10 that
numbers in parentheses after an F ratio indicate degrees of freedom for the
numerator and denominator respectively.)
The overall ANOVA, however, only tells us that the mean number of button
pushes differs for the different groups but it does not tell us how. We could
examine the group means (see Fig. 11.3) but this would be merely
impressionistic—far better to fortify our impressions with statistical tests. As
mentioned earlier, two approaches are possible. If we have no particular
theoretical rationale for suspecting how the groups might differ beforehand—that
is, if we are unwilling to commit to G - 1 specific contrasts—then we should
perform post hoc tests as described in the next section. (Post hoc means "after
this" or "after the event" in Latin.) Otherwise, although it is done less frequently,
we can analyze the contrasts or comparisons as planned.
A contrast is regarded as significant if R2 increases significantly when its
associated predictor variable is added to the equation. Consider the first set of
contrast codes we have been using. The critical value of F(1,12) = 4.75, alpha =
.05, so both the first (F(1,12) = 7.51) and the second (F(1,12) = 10.85) contrasts
are significant, but the third is not significant (F(1,12) = 4.54; see Fig.13.1). This
indicates that the mean number of button presses for the groups of subjects
without children is significantly different from the groups of subjects with
children (the first contrast). In other words, 96—the mean of the first two group
means—is significantly different from 76—the mean of the last two group means.
Also, the mean number of presses for the group of subjects with no desire for
children was significantly different from the group who desired children (the
second contrast). In other words, 113 differed significantly from 79. However,
the means for the group of subjects with one and more than one child, which
were 65 and 87, respectively, did not differ significantly.
206 PLANNED COMPARISONS, POST Hoc TESTS

Exercise 13.2
Interpreting Planned Comparisons
This exercise provides practice in interpreting the results of planned
comparisons.
1. Using the table you prepared for the last exercise, describe and interpret the
results of the planned comparison analysis using the second set of contrast
codes. Be sure to describe the group means or means of groups means, as
appropriate (i.e., as embodied in the significant contrasts).
2. Recall the study examining number of words infants spoke at 18 months of
age, last used in Exercise 12.4. Prepare a stepwise table for a planned
comparison analysis using the contrasts shown in Fig. 12.11. Then describe
and interpret the results.
3. Organize the results of the trend analysis from Exercise 12.6 in a stepwise
table. Then describe and interpret the results.

13.3 POST HOC TESTS

Planned comparisons are disciplined by forethought and limited in number by


the degrees of freedom. But lacking an a priori basis for specific comparisons
(which, judging from the research literature, is the usual case), a significant
analysis of variance omnibus result can be—and should be—explicated with post
hoc tests. Typically, post hoc tests determine which of all possible pairs of group
means differ significantly. For the present four-group example, this generates six
pairs:

groups 1 vs. 2.
groups 1 vs. 3.
groups 1 vs. 4.
groups 2 vs. 3.
groups 2 vs. 4.
groups 3 vs. 4.

In order to determine significance, we could treat each pair of groups as a


separate two-group analysis of variance, but this is problematic because the more
such comparisons we make, the more likely we are to make type I errors if the
null hypothesis is true. Assume an alpha level of .05. Then, although the
probability of a type I error per comparison is .05, the familywise error rate for a
set of c independent tests or comparisons is

The present example requires six comparisons, thus

Clearly, a familywise probability for type I error that equals .265 is a far cry from
the .05 that an unwary investigator might (mistakenly) expect.
Over the years, statisticians have suggested a number of post hoc procedures,
all designed to control the type I error rate. The number of different procedures,
and the arguments pro and con for each, can seem confusing to the novice (and to
13.3 POST Hoc TESTS 207
many seasoned investigators as well). The Tukey test, however, seems the one
test most often preferred, primarily because it is computationally straightforward
and appears to strike the best balance between type I and type II errors. Still,
other tests have their partisans. For simplicity, the only post hoc test presented
here is the Tukey, although the reader is strongly urged to read Keppel's (1982,
chap. 9) excellent and informative discussion concerning multiple comparisons.

The Tukey Test


The Tukey test can be decomposed into four essential steps. First, order the
group means from smallest to largest. Consider the button-pushing study (the
means for the four groups were last given in Fig. 12.8). The smallest mean is 65
for the has one child or C1 group, the next smallest is 79 for the desire but have
no children or CO group, the next is 87 for the has more than one child or C2
group, and the largest mean is 113 for the no desire for children or ND group.
Then, compute differences between pairs of means. Arrange these
differences in a G x G table, where G is the number of groups (see Fig. 13.2).
Differences on the diagonal are necessarily zero. There is no reason to display
differences below the diagonal because they necessarily mirror the differences
above the diagonal. If the difference between a pair of group means exceeds a
specified critical value (symbolized here as TCD for Tukey critical difference),
then the means for those two groups are said to differ significantly.
Third, compute the value for the Tukey critical difference. Its formula is

In this formula
q represents the Studentized Range Statistic,
G is the number of groups,
df e r r o r is from the omnibus F test,
MSerror is computed for the omnibus Ftest, and
n is the number of subjects per group.
(Do not confuse n with N, the number of subjects total). Values for the
studentized range statistic for various values of G and dferror, for alpha = .05 and
.01, are given in Table E in the statistical tables appendix. For the present
example, G = 4 and dferror - 12; hence, assuming the usual .05 level of
significance, q(4,12).05 = 4.20 (from Table D), so

Group Means
Group
Groups C1 CO C2 ND
Means
65 79 87 113
Has 1 Child 65 0 14 22 48
Has none but would like 79 0 8 34
Has more than 1 child 87 0 26
Has no desire for children 113 0
FIG. 13.2. Differences between ordered group means for four want/have
child groups. Group means are ordered from smallest to largest.
208 PLANNED COMPARISONS, POST Hoc TESTS

Having computed the TCD, it is a simple matter to move on to the fourth


step: Determine which differences exceed the critical value. If the difference
between the means for two specific groups exceeds the value of the Tukey critical
difference, then those means are significantly different. For the present example,
only two differences exceed 30.7, the difference between the means for the C1 and
ND groups (48) and between the means for the CO and ND groups (34). All other
differences are not significant. (These four steps are summarized in Fig. 13.3.)
It is conventional to identify group means that do not differ significantly with
a common subscript. Thus, means that do differ significantly will not share a
common subscript and can be readily identified. Fig. 13.4 demonstrates a
graphic way to determine which subscripts should be applied to which means.
First, draw a diagonal line through the cells that contain zero. Second, draw
vertical lines between cells so as to separate differences that exceed the Tukey
critical difference from those that do not (here, between 22 and 48, and between
8 and 34). Third, likewise draw horizontal lines between cells so as to separate
differences that exceed the Tukey critical difference from those that do not (here,
between 34 and 26). Fourth, draw a horizontal line above the first row of
differences from the top of the first vertical line back to the diagonal line, and
likewise extend any other horizontal lines back to the diagonal.
These lines accomplish two things. First, differences that exceed the TCD are
segregated from those that do not (here, 34 and 48 exceed 30.7). Second, the
horizontal lines identify groups whose means do no differ significantly. Label
each such horizontal line with a different letter, and attach these letters as
subscripts to the means above them. Here the two lines are labeled a and b.
Arbitrarily, we labeled the first line, which includes groups C1, CO, and C2, with a
b, and the second line, which includes groups C2 and ND, with an a, but only
because we knew that later we would table these means and list the ND group
mean first. The important point is that the post hoc analysis identified two
groups: The means for groups C1, CO, and C2 do not differ among themselves;
likewise, the means for C2 and ND do not differ between themselves.
Of particular interest are horizontal group lines that do not overlap. Their
associated group means do not share a common subscript and so differ
significantly. Here, the means for both the C1 and CO groups (65 and 79) differ
significantly from the mean for the ND group (113).
According to these post hoc test results, persons with no desire for children
noted significantly more communicative acts (i.e., pushed the button more), on
the average, than persons who had one child and persons who had no children
but desired them—but not more than persons who had two or more children.

Steps for the Tukey post hoc test


Step 1 Order the group means from smallest to largest.
Step 2 Compute differences between pairs of means.
Step 3 Compute the value for the tukey critical difference.
Step 4 Determine which differences exceed the critical value.
FIG. 13.3. Steps for the Tukey post hoc test.
13.3 POST Hoc TESTS 209

FIG. 13.4. A graphic way for displaying results of a post hoc analysis. The
gray area indicates differences between group means that exceed the TCD.
The solid horizontal lines underscore means that do not differ significantly.

The mean number of button pushes by persons who had two or more children did
not differ significantly from the means noted for the other three groups. Thus,
for these data, we now understand not just that the means for the four groups
differed (R2 = 0.66, F(3,12) = 7.63, p < .01) but exactly how they differed as well.
The lines of Fig. 13.4 are useful as a first step in understanding the pattern of
differences revealed by a post hoc test. In a fairly mechanical way, they quickly
identify which means do not differ among themselves and should therefore be
identified with common subscripts. More formally, post hoc results are often
presented in written reports (articles, dissertations, and so forth) as shown in Fig.
13.5. In such a table, the means can be ordered in any way that makes conceptual
sense. They need not be ordered from smallest to largest. For consistency, the
order here is the same as the one used in chapter 11 when the button-pushing
study was first introduced. Nonetheless, as noted earlier, groups that differ
significantly can be readily identified by their lack of a common subscript.

Exercise 13.3
Presenting Results of Post Hoc Tests
This exercise provides practice in interpreting the results of post hoc tests.
1. Prepare a bar graph with error bars (like the one shown in Fig. 10.4) for the
means given in Fig. 13.5. Prepare a second graph showing 95% confidence
intervals, as discussed at the end of chapter 10. (See Fig. 13.6)
2. Given different data for the button-pushing study, the results of the post hoc
analyses could have been quite different. Assume, for example, that the
MSerror is 92.5 instead of 213.6 but that all other details remain the same.
Prepare a figure like Fig. 13.4 and a table like Fig. 13.5 for this post hoc
analysis. Explain how these results would be interpreted.
3. Now do the same, assuming this time that the MSerror is 23.2.

Groups
Variable
ND CO C1 C2
Number of button pushes 113a 79b 65b 87ab
FIG. 13.5. Mean number of button pushes for four want/has child status
groups. Means that do not differ significantly per a Tukey post hoc test, p <
.05, share a common subscript.
210 PLANNED COMPARISONS, POST Hoc TESTS

FIG. 13.6. A bar graph showing the mean number of button pushes for the
four groups; error bars indicate 95% confidence intervals.

13.4 UNEQUAL NUMBERS OF SUBJECTS PER GROUP


Traditional analysis of variance texts often discuss at some length solutions to
what is usually called the unequal n problem, noting adjustments required when
groups contain different numbers of subjects, which is a common situation in
behavioral science research (n is used here to symbolize number of subjects per
group, whereas N indicates the total number of subjects in all groups). With
multiple regression, such adjustments are handled automatically and so no
computational problems arise. Our running example involves four subjects per
group. For generality, we probably should have had unequal numbers of subjects
in each group, but none of the procedures or computations would have been
changed by that, as you have demonstrated in a number of the exercises.
Widely discrepant ns can affect interpretation of results, however. For one
thing, the discrepant proportions of subjects in different groups can affect
correlations between the criterion and coded variables and can limit the
maximum value for a number of multiple-regression statistics. There is probably
little reason to worry if group ns are only somewhat discrepant, but if widely
different it makes sense to consult with more knowledgeable colleagues because
interpretation may require some qualification. If you want to know more about
this matter, read Cohen and Cohen's (1983, pp. 186-189) discussion concerning
the relationships of dummy variables to Y.

Unequal ns and Post Hoc Tests


Before leaving this chapter, one detail remains to be addressed. Recall that the
formula for the Tukey critical difference used for post hoc tests includes n, the
number of subjects per group. But what if the ns for the groups are different?
What value should be used then? One obvious choice might seem the arithmetic
13.4 UNEQUAL NUMBERS OF SUBJECTS PER GROUP 211
mean of the group ns, but Winer (1971) recommended that the harmonic mean be
used instead. The harmonic mean is

For example, imagine four groups with 3, 4, 4, and 5 subjects in each. Their
arithmetic mean would be 4.00, but the harmonic mean would be

Exercise 13.4
Post Hoc Tests with Unequal Sized Groups
This exercise provides practice in performing a one-way analysis of variance and
post hoc tests when groups contain unequal numbers of subjects.
1. Begin with either the spreadsheet shown in Fig. 12.2 or 11.7. For this
exercise you will perform an omnibus test followed by post hoc tests, so it
does not matter which set of coded variables you use to represent group
membership. Omit subjects 4 (#BPs = 130) and 14 (#BPs = 94). Perform a
one-way analysis of variance on the remaining 14 subjects. Note that groups
1 and 4 now contain three subjects. Prepare a source table for this analysis,
modeled on the one shown in Fig. 12.9.
2. Now perform a post hoc test analysis and organize your results as shown in
Figs. 13.4 and 13.5. Explain how these results would be interpreted.
3. The mean number of words spoken by three groups of infants was analyzed
in Exercise 12.4 and the group effect was found to be significant. Now
perform a post hoc analysis and present and interpret your results.

Exercise 13.5
Post Hoc Tests in SPSS
In this exercise you will learn how to use SPSS to conduct post hoc tests for the
button pressing study presented in Chapter 12.
1. Open the data file for the button pushing study you created in Exercise 12.5.
2. Redo the one-way ANOVA analysis. This time, however, click on the Post-
hoc button and check the Tukey box before you click on OK.
3. Selecting the Post-hoc option under the one-way procedure produces two
tables in addition to the typical One-way ANOVA output. The first, displays all
possible comparisons between group means and their significance level. The
second table, called Homogeneous Subsets, groups together all means
that do not differ from each other at the selected alpha level. This grouping
should make it easy for you to create the subscripts necessary to present the
results of your post hoc tests in the format of Fig. 13.5. Note that SPSS
always uses the harmonic mean of the group sample sizes.
4. To create a bar graph of the means, select Graphs->Interactive->Bar from
the main menu. Move the pushes variable to the y axis window and the
212 PLANNED COMPARISONS, POST Hoc TESTS
group variable to the x axis window. Click on the Error bars tab and select
either Standard Error of Mean or Confidence Interval of Mean from the
pull-down menu under Confidence Interval. If you select the Confidence
Interval of Mean, you should also input the size of the confidence interval
(e.g., 95%, 99% or 90%).

13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE

In chapter 9 we learned that if a criterion variable is regressed on a single


quantitative predictor variable, the relation between them is described by
reporting the proportion of variance accounted for by the predictor (r 2 ) and the
change in the criterion per unit change in the predictor (b). We also learned that
if a criterion is regressed on a binary categorical variable, the relation is described
by reporting the means for the two groups (as for the logically equivalent t test).
And in this chapter we have learned how to report results from a one-way
analysis of variance when more than two groups are involved, that is, when a
criterion is regressed on a set of predictor variables that together represent a
multilevel categorical variable. Specifically, we have learned that either we report
group means lumped to reflect the significant contrasts (planned comparison
analysis) or else we report all group means and identify those that differ
significantly with different subscripts (post hoc analysis).
A single study, however, may include some quantitative and some qualitative
research variables. This set of circumstances presents no problems for a
multiple-regression-based analysis, and, in fact, such an analysis was first
demonstrated in chapter 11. Within the experimental tradition, the name given
an analysis involving mixed quantitative and categorical predictors is the
analysis of covariance. In the simplest instance, an analysis of covariance
involves two predictor variables, a quantitative covariate and a binary group
variable (assuming just two groups). Of primary concern is whether group
membership (defined by the categorical variable) affects the dependent measure.
However, if the groups vary with respect to some other attribute (the covariate),
which may also affect the dependent measure, then it may be desirable to control
for the effect of that other variable statistically.

An Example: The Gender Smiling Study


Imagine, for example, that we want to test the hypothesis that during routine
caregiving male infants smile more than female infants. We obtain a sample of
10 male and 10 female infants, observe each infant for 15 minutes during routine
caregiving in the home, and count each time the infant smiles during that time.
Examination of our raw data (see Fig. 13.7) reveals that the mean number of
smiles is indeed higher for male infants. But this, of course, could mean nothing.
We need to determine whether the difference between the mean number of
smiles for male and female infants is statistically significant.
Procedures first described in chapter 9 are appropriate for these data. First
we regress the number of smiles on the (binary categorical) variable coded for
sex. For the data in Fig. 13.7, the r2 is .150 and the corresponding F(1,18) is 3.19.
The critical value for F(1,18), alpha = .05, is 4.41, so we cannot reject the null
hypothesis. Based on this analysis, we could not claim that the difference
between a mean of 6.4 smiles for males and a mean of 4.1 smiles for females is
significant. However, we note two interesting and disturbing aspects of our data.
First, for some reason, the female infants were older than the male infants. Their
13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE 213
mean age was 16.1 months compared to an average age of 10.9 months for the
males. Second, it appears that older infants may have smiled somewhat more
than younger infants.
This set of circumstances suggests that an analysis of covariance might be
warranted. Age may be related to the number of smiles, and the mean age for the
two groups of infants appears somewhat different; consequently, before testing
for a sex effect perhaps we should first remove (or control) statistically any effect
age might have on number of smiles. This you already know how to do, as you
are asked to demonstrate in the next exercise.

A _B c D
1 Smiles Age Sex
2 s_ Y_ X_ A
3 1 5 5 0
4 2 7 5 0
5 3 4 5 0
6 4 8 6 0
7 5 6 9 0
8 6 3 13 0
9 7 4 14 0
10 8 11 17 0
11 9 12 17 0
12 10 4 18 0
13 11 1 4 1
14 12 1 8 1
15 13 5 13 1
16 14 2 17 1
17 15 1 18 1
18 16 5 18 1
19 17 5 20 1
20 18 8 20 1
21 19 7 21 1
22 20 6_ 22 1_
23 Sum= 105 270
24 N= 20 20
25 Mean= 5.25 13.5
26 a,b=
27 R=
28 Mean= 6.4 10.9 males
29 Mean= 4.1 16.1 1females
FIG. 13.7. Spreadsheet showing raw data and group means for the gender
smiling study. Male infants are coded 0 and female infants 1.
214 PLANNED COMPARISONS, POST Hoc TESTS

Exercise 13.6
Analyzing Variance Using Age as a Covariate
The template that results from this exercise allows you to perform an analysis of
covariance for the gender smiling study. The question is, once differences in age
are statistically controlled, do male infants smile significantly more than female
infants?
1. Modify one of your previous spreadsheets (e.g., Fig. 12.8 or Fig. 12.11) to
accommodate the gender smiling study. Enter data (shown in Fig. 13.7) for
the number of smiles (y), the infant's age (X), and the infants sex (A, a coded
predictor variable). Reserve space for a third predictor variable, in addition to
the covariate (age) and the group variable (sex), which represents the
interaction between age and sex. The interaction variable will not be used
until Exercise 13.7. Exactly what interaction means is discussed in greater
detail in chapter 14.
2. Extend the spreadsheet to accommodate the 20 subjects in the gender
smiling study and fill in all formulas as appropriate.
3. Do step 1. Regress number of smiles on age. Note the R2 and its
corresponding F. You will need them later.
4. Do step 2. Regress number of smiles on age and the variable coded for sex
together. Organize your results into a stepwise table like Fig. 13.1.
5. Does sex account for a significant unique proportion of variance, beyond that
already accounted for by age?

The table you prepared should look like the one shown in Fig. 13.8. As you
can see, the increase in R2 when sex is added to the equation is .290 and its
corresponding F is 7.46. The critical value of F(1,17), alpha = .05, is 4.45, so this
effect is significant and we conclude that, controlling statistically for the effect of
age, sex of infant significantly affects number of smiles. But how do we describe
this effect? It does not make sense to report the mean number of smiles for male
and female infants, which were 6.4 and 4.1, respectively. We know from an
earlier analysis that those means do not differ significantly. But what means do?
An analysis of covariance adjusts the individual raw scores for the effect of
the covariate. That is, any effect the covariate might have on the criterion
measure is removed (step 1) and then, in effect, these adjusted scores are
analyzed to see if they are affected by the categorical variable of interest (step 2).
This is simply another way to state the function of a hierarchic regression
analysis. A change in R2, after all, tells us how much influence the last variable
entered has, after any effects of variables already in the equation have been
removed from the criterion variable. Thus, if the effect of the categorical variable
entered at step 2 is significant, the means for the scores adjusted for the step 1
covariate, the scores on which the effect is based, differ significantly and thus it is
the adjusted and not the raw means that should be reported.

Step Variable Total Change


added R2 df F R2 df F
1 age .051 1,18 <1 .051 1,17 1.31
2 sex .340 2,17 4.39 .290 1,17 7.46
FIG. 13.8. Stepwise results for an analysis of covariance of the gender
smiling study.
13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE 215
The statistically insignificant difference between the means of the raw scores
for female and male infants can be portrayed graphically (see Fig. 13.9). The top
horizontal line represents the mean number of smiles for male infants. Its
equation is Y = 6.4 (the mean number of smiles for male infants), which means
that the predicted scores for male infants would all fall on this line. The bottom
horizontal line represents the mean number of smiles for female infants. Its
equation is Y = 4.1 (the mean number of smiles for female infants), which means
that the predicted scores for female infants would all fall on this line. The
distance between the two lines (2.3 smiles) indicates the (nonsignificant)
difference between the means for males and females.
Fig. 13.9 also graphically portrays aspects about the data that we noted
earlier, specifically, that females are older on average than males (the four oldest
infants are female whereas four of the five youngest infants are male) and there is
a tendency, albeit somewhat weak, for older infants to smile more than younger
infants.
The statistically significant difference between the means of the scores for
female and male infants that take age into account can also be portrayed
graphically (see Fig. 13.10). First, we regress the number of smiles on age and sex
using the prediction equation
- a + b1 agei + b2 sex. (13.6)
Then we compute the predicted values for each infant and graph them. These
predicted values will fall on two sloping parallel lines, representing the predicted
number of smiles for males and females adjusted for age (see Fig. 13.10). The top
sloping line represents males, who smiled more on average, and the bottom
sloping line represents females, who smiled less. Note that the difference
between the parallel lines representing predicted scores for males and females is

FIG. 13.9. Graphic presentation of observed scores for the gender smiling
study. Triangles represent observed number of smiles for males, circles, the
number for females. The top horizontal line (black) represents the mean
number of smiles for males (y=6.4), the bottom horizontal line (gray), the
mean number of smiles for females (Y= 4.1).
216 PLANNED COMPARISONS, POST Hoc TESTS
greater in Fig. 13.10 (which takes the infant's age into account) than in Fig. 13.9
(which does not take infant's age into account).

Adjusting Individual Scores


The top and bottom sloping lines in Fig. 13.10 represent predicted scores that
take age and sex into account. These lines slope up to the right, indicating that
on average older infants smiled more than younger infants. Earlier we said that if
we wanted to describe the difference between male and female scores we should
use adjusted and not raw means. But how should adjusted scores be computed?
The task is to remove statistically the effect of age. In effect, we ask, if all infants
were the same age (in this case, the mean age for the sample or 13.5 months),
how many times would they smile? Given the upward sloping line in Fig. 13.10,
this means that we will adjust younger infants' number of smiles upward and
older infants' downward, and that larger adjustments will be made for the
youngest and oldest infants than for those whose age is nearer the mean.
In fact, the size of the adjustment for an infant of a specified age will be the
distance at that age between the light gray horizontal line (representing the
sample mean) and the light gray parallel line midway between the male and
female upward slanting lines (representing predicted scores for infants whose sex
is not specified) in Fig. 13.10. This distance can be computed with the regression
equation (Equation 13.6). We use the regression coefficients we computed for
age and sex (b1 and b2), and, when computing the adjustment for a particular
infant, we use that infant's age (agei). However, because we want to compute an
adjustment for the average infant of that age, we do not supply that particular

FIG. 13.10. Graphic presentation of predicted scores for the gender smiling
study. Triangles represent observed number of smiles for males, circles, the
number for females. The horizontal gray line represents the mean number of
smiles for all infants (Y= 5.25). Predicted scores for males and for females
taking age and sex into account fall on the top (black) and bottom (gray)
sloping parallel lines, respectively. The middle sloping line (light gray)
represents predicted scores that take age into account when sex is not
specified; scores are adjusted by the difference between it and the horizontal
mean line.
13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE __ 217
infant's sex (sexi, in this case 0 for males and 1 for females) but instead the
average of the codes we used for sex (Msex, in this case 0.5):

(13.7)
The symbol Y" is used to remind us that the prediction equation is for an average
subject, not one who belongs to a specific group, and thus the individual variable
(sexi, in this case) is replaced with a constant that is the average of the codes
used.
As noted earlier, adjustments to individual infants' scores are represented by
differences between the middle sloping line in Fig. 13.10 and the horizontal line
representing the mean. In other words

adjustmenti = Yi" - MY (13.8)


As you can see from Fig. 13.10, this adjustment is negative for the youngest ages,
becomes smaller with increasing age, is zero at 13.5 months (the mean age for the
infants in this study), and then becomes positive and progressively larger for
older infants. The adjusted score— the individual's raw score adjusted for the
covariate— is then the initial score from which this adjustment is subtracted:

Thus in the present case the number of smiles is increased for younger infants (a
negative adjustment is subtracted from their initial scores) and decreased for
older infants (a positive adjustment is subtracted from their initial scores).
You may wonder, why define the adjustment so that you end up subtracting
negative adjustments, which really means adding? Why not define the
adjustment as MY- Yi" in the first place, in which case the adjusted score would
be Yi + My- Yi", which can be derived algebraically from Equation 13.9 in any
case? The answer lies with convention, which defines any deviation from a mean
as the mean subtracted from a score and hence Yi" - MY is used.
In any event, you should now have grasped a key and central concept of the
analysis of covariance: In terms of the present example, controlling for age
means, in effect, adjusting raw scores statistically so that the adjusted scores now
reflect what the scores would be if all infants had been observed at 13.5 months of
age, the mean age for the sample. This idea should become clearer to you during
the course of the next exercise, which asks you to compute adjusted scores.

Exercise 13.7
Computing Adjusted Means for an Analysis of Covariance
The template developed for this exercise adds the ability to compute adjusted
scores to the template developed for the last exercise. It allows you to compute
the mean number of smiles for male and female infants adjusted for age, which is
necessary information to describe the significant sex effect revealed by your last
analysis.
1. Modify the spreadsheet developed for the last exercise. Add three new
columns to it. These columns are for the predicted number of smiles based
on the truncated prediction equation, or Y"; the difference between y "and
the mean number of smiles for all infants, Y" - MY, which is the adjustment
for the raw score; and the adjusted score, Y- d where d = Y"- MY.
218 PLANNED COMPARISONS, POST Hoc TESTS
2. The correct values for the parameters, a, b1, and b2 should be left over from
the previous exercise. Now enter correct formulas in the three new columns.
Do the Y" scores fall on the middle sloped line in Fig. 13.10? Are the
adjusted compared to the raw scores larger for younger infants and smaller
for older infants, as they should be?
3. Finally, compute the adjusted means for the male and female infants. Do
these means make sense?
4. Your spreadsheet should contain a column for the age x sex interaction.
Enter a formula in this column that multiplies age by the coded value for sex.
This value represents the age x sex interaction and will be used in the next
exercise.

At this point your spreadsheet should look like the one shown in Fig. 13.11. It
is instructive to compare the adjusted scores, which are portrayed graphically in
Fig. 13.12, with the unadjusted scores shown in Figs. 13.9 and 13.10. In general,
scores for younger infants have increased and scores for older infants have
decreased. As noted earlier, for a given age, the amount of change is the
difference between the middle sloped line (Y i " = a + b1 agei+ b2 Msex) and the
horizontal line (Yi = My) shown in Fig. 13.10. More of the younger infants were
male, and more of the older female; thus the males' scores were more often
increased and the females' scores more often decreased.
Moreover, the mean for the males' scores was higher than the mean for the
females' to begin with, so as a result the means for the adjusted scores are even
further apart, as indicated by the horizontal lines in Fig. 13.12. Indeed, as we
know from the analysis of covariance conducted earlier (Exercise 13.6), the
difference between the means for the adjusted scores is statistically significant
(R2 - .290, F(1,17) = 7.46, p <..05), and it is these adjusted means (males = 7.02,
females = 3.48) that we would report.

Homogeneity of Regression
In the course of performing the analysis of covariance just described, we made an
assumption. By imposing parallel best fit or regression lines on the two groups
(see Fig. 13.10) we assumed homogeneity of regression—that is, we assumed that
the regression lines relating age to outcome had the same slope for both groups.
Thus we proceeded to use the same age coefficient when adjusting both males'
and females' scores (the b1 in Equation 13.7; in experimental texts this 61 is often
called the average within-groups regression coefficient).
However, if the regression lines relating age to outcome were quite different
for males and females, it would not be appropriate to make a common
adjustment. In traditional analysis of covariance terms, if the slopes relating the
covariate to outcome varied across the groups analyzed, the assumption of
homogeneity of regression would not be warranted. Instead of a model (e.g.,
Equation 12.6) that includes just terms for the covariate and for group
membership, a more complex model that includes terms reflecting different
slopes across groups would be required instead.
It is an easy matter to test whether the assumption of homogeneity of
regression is warranted. First, a third variable is formed, one that represents the
interaction between the covariate and the categorical variable (this will be a set of
variables if more than two groups are under consideration). We have more to say
about interaction in the next chapter. For now, it is sufficient to know that a
variable representing the interaction between two other variables is formed by
multiplying the values for those two variables together. A covariate is
13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE 219
continuous; thus, one variable is always sufficient to represent it. For the present
example, the categorical variable is binary, and hence both it and the interaction
can be represented with one variable each. In fact, you have already formed
values for the interaction variable in the current spreadsheet by multiplying
values for age and the coded values for sex together. Now the question is, does
the interaction variable account for a significant increase in R2 when added to age
and sex, the two variables already in the equation?

A B c D E F G H
1 Smiles Age Sex cN Yadj=
2 s_ Y_ X_ A XxA Y"_ Y"-My Y-d
3 1 5 5 0 0 3.228 -2.02 7.022
4 2 7 5 0 0 3.228 -2.02 9.022
5 3 4 5 0 0 3.228 -2.02 6.022
6 4 8 6 0 0 3.466 -1.78 9.784
7 5 6 9 0 0 4.18 -1.07 7.07
8 6 3 13 0 0 5.131 -0.12 3.119
9 7 4 14 0 0 5.369 0.119 3.881
10 8 11 17 0 o 6.083 0.833 10.17
11 9 12 17 0 0 6.083 0.833 11.17
12 10 4 18 0 0 6.32 1.07 2.93
13 11 1 4 1 4 2.99 -2.26 3.26
14 12 1 8 1 8 3.942 -1.31 2.308
15 13 5 13 1 13 5.131 -0.12 5.119
16 14 2 17 1 17 6.083 0.833 1.167
17 15 1 18 1 18 6.32 1.07 -0.07
18 16 5 18 1 18 6.32 1.07 3.93
19 17 5 20 1 20 6.796 1.546 3.454
20 18 8 20 1 20 6.796 1.546 6.454
21 19 7 21 1 21 7.034 1.784 5.216
22 20_ 6_ 22 1_ 22 7.272 2.022 3.978
23 Sum= 105 270 10 161 105 7E-15 105
24 N= 20 20 20 20 20 20 20
25 Mean= 5.25 13.5 0.5 5.25 4E-16 5.25
26 a,b= 3.807 0.2379 -3.54
27 R= 0.583
28 Mean= 6.4 10.9 males MMadj 7.018
29 Mean= 4.1 16.1 females FMadi 3.482
FIG. 13.11. Spreadsheet for computing analysis of covariance adjusted
means for the gender smiling study. Columns I-P are not shown.
220 PLANNED COMPARISONS, POST Hoc TESTS
Exercise 13.8
Testing Homogeneity of Regression
This exercise uses the template developed for the last exercise to test whether
the assumption of homogeneity of regression, which is required for an analysis of
covariance, is warranted for the gender smiling study.
1. This exercise can be viewed as an extension of Exercise 13.6. For that
exercise, steps 1 and 2 consisted of adding first age and then the coded
variable for sex to the regression equation.
2. Now do step 3. Regress number of smiles on age, the coded variable for
sex, and the variable representing the age x sex interaction and make other
appropriate changes. What is the value of R2 for this equation? What is the
change in R2 from step 2 to step 3? Is this increase in R2 significant?

As you can see, for the current example the assumption of homogeneity of
regression is warranted. The increase in R2 due to the interaction variable was
not significant (R2 = .019, F(1,16) < 1, NS), which means that the regression lines
for the two groups do not differ significantly. If the increase in R2 had been
significant, reporting adjusted mean would not have been informative. In such a
case, we would have instead described the (significantly different) slopes for the
two groups separately.
The categorical variable used in the present analysis of covariance example
was sex, which is a binary variable. But even if the categorical variable were more
complex, requiring more than one coded predictor variable, the analysis still
proceeds hierarchically. In general, if a categorical variable comprises G levels,
then G - 1 predictor variables are required to represent it, as discussed in the last
chapter. In order to test the significance of any categorical variable, the covariate

FIG. 13.12. Graphic presentation of adjusted scores for the gender smiling
study. Triangles represent adjusted number of smiles for males, cirlces, the
adjusted number for females. The top horizontal line represents the mean of
the adjusted number of smiles for males (y= 7.02), the bottom horizontal
line, the mean of the adjusted number of smiles for females (Y= 3.48).
13.5 ADJUSTED MEANS AND THE ANALYSIS OF COVARIANCE 221
would be added to the regression equation at step 1 and the set of predictor
variables representing the categorical variable, however many are required,
would be entered at step 2. Similarly, G - 1 predictor variables would be required
to represent the interaction. They would be formed by multiplying each predictor
variable required for the coded variable by age. These predictor variables would
then be entered at step 3, as a set.
If the increase in R2 for step 3, however many interaction variables are
involved, is not significant, then you can assume that the assumption of
homogeneity of regression is warranted. Then, if the increase in R2 for step 2 is
significant, you would claim a significant group effect, compute and report
adjusted means, and explicate the pattern of group differences with a Tukey post
hoc test based on the adjusted means.
However, if the increase in R2 for step 3 were significant, a more complex
explication that takes the interaction into account is called far. In such cases,
more knowledgeable colleagues should be consulted.

Exercise 13.9
ANCOVA in SPSS
In this exercise you will learn how to conduct an analysis of covariance in SPSS.
1. Create a new SPSS data file from the spreadsheet you created in Exercise
13.6.
2. Conduct a hierarchical regression as you did in Exercise 11.7. Enter smiles
as the dependent variable and age in the first block and sex in the second
block. Make sure you select R squared change statistics as an option. The
model summary output should correspond to the values found in Fig. 13.8.
3. Test the homogeneity of regression assumption by running a second
analysis and entering age, sex, and the age by sex interaction term in the
third block.
4. You could also conduct an ANCOVA using the General Linear Model (GLM)
procedure. To do this select Analyze->General Linear Model->Univariate
from the main menu. Move smiles to the Dependent Variable window, sex
to the Fixed Factor(s) window, and age to the Covariate(s) window.
5. Click on Model and select Type I from the Sums of Squares pull down
menu. Click on Continue. Click on Options and move the sex variable to the
Display Means for window. Click on Continue and then OK.
6. Examine the output. The statistics for age, sex, error, and total should be
identical to the regression you ran in step 2.
7. To test the homogeneity of regression assumption, you must create a custom
model. Rerun the GLM procedure, but this time click on Model and check the
Custom button under Specify Model. Click on the age variable in the
Factor(s) and Covariate(s) window to highlight it, then click on to move
the variable to the Model window. Do the same with sex. Then click on both
age and sex while holding down the Ctrl key, so that both are highlighted,
and then click on This moves the age * sex interaction term to the Model
window.
8. Examine the Test of Between Subjects Effects box in the output. Check
the F value and significance of the interaction term. Do these statistics agree
with the values you calculated in part 3?
9. Examine the Estimated Marginal Means box in the output from the model
that does not include the interaction. SPSS automatically calculates the
222 PLANNED COMPARISONS, POST Hoc TESTS
adjusted group means. Do they agree with the values you calculated in
Exercise 13.7?

Single- and Multifactor Studies


This has been an important skill-building chapter, filled with "And after the
initial analyses, what then?" kinds of considerations. It is also something of a
bridge chapter, standing midway between discussion of single-factor (chap. 11)
and multiple-factor (chap. 13) between-subjects studies. The chapter began with
a discussion of planned comparisons, which probably seemed simply a
straightforward application of hierarchic multiple regression. But as you will see
in the next and subsequent chapters, the ideas used in an analysis of planned
comparisons apply to analyses of multifactor studies as well.
Significance testing, which is usually embodied by big Fs, is greatly
emphasized in our present statistical tradition. And although it is important to
determine whether apparently impressive (or even puny) results might be just
chance happenings, it is equally important to describe results once significance
tests have given us license to do so. Post hoc tests are especially important in this
regard. Whenever more than two groups are included in a one-way analysis of
variance, for example, a single significant omnibus F is insufficient to tell us
exactly which groups differ from which others.
Analogously, in the presence of a significant analysis of covariance result, the
raw score means do not provide an accurate picture of how the groups vary once
scores have been adjusted for the covariate. Only the adjusted means do that.
You now know how to compute adjusted means, which ordinarily is regarded as
an advanced topic but is rendered quite simple, even elementary, by the
integrated multiple-regression approach of this book. You also know how to
perform one widely used post hoc test, the Tukey, and you should be comfortable
interpreting post hoc test results. This is an important skill. As you will see in
the next and subsequent chapters, post hoc analysis applies not just to the single-
factor studies discussed so far but to all of the other analysis of variance designs
you will encounter.
14 Studies With Multiple
Between-Subjects Factors

In this chapter you will:


1. Learn what between-subjects factorial studies are and what advantages
they offer.
2. Learn how to construct predictor variables for analyzing data from
between-subjects factorial studies.
3. Learn how to compute the degrees of freedom associated with the main
effects and interactions of these studies.
4. Learn how to determine whether the main effects and interactions (i.e.,
conditional relationships) of these studies are statistically significant and
how to characterize the magnitude of such effects.
5. Learn how to interpret any statistically significant main effects and
interactions.

In chapter 10 you learned how to decide if two groups differ significantly and in
chapter 12 you learned how to perform an analysis of variance with more than
two groups. As you now know, no matter the number of groups, you need only
determine whether a single between-subjects factor significantly increases
predictability. The single factor in such studies indicates group membership—for
example, whether subjects were assigned to an experimental or a control group,
or to which of four different want/have children status groups subjects belonged.
Subjects in these studies are assigned or belong to one, and only one, of the
groups; that is, the groups in the studies are formed independently. Such studies
can be visualized as a single row of cells, with each cell representing a group and
containing the subjects belonging to that group (see Fig. 14.1).
When subjects belong to (or are assigned to) two or more groups (a single-
factor study), the usual research question is, are the groups different in some
way? With respect to some measure of interest, are men different from women,
or are people who received one treatment different from those who received
another treatment (or treatments)? In statistical terms, are the criterion score
means for the various groups so discrepant from one another that it is unlikely
that the subjects were sampled from a population in which there is no association
between group membership and the criterion variable? As you know, the usual

223
224 MULTIPLE BETWEEN-SUBJECTS FACTORS
analytic technique for detecting such group differences is called a one-way
analysis of variance. In chapter 12 you learned how to conduct a one-way
ANOVA using multiple regression and coded predictor variables, and in the last
chapter you learned how to describe significant results.
Researchers' interests, however, are rarely limited to just a single
independent factor or variable. In this chapter we discuss a common and more
general situation, one for which a two-way, three-way, or even higher way
analysis of variance is appropriate. This situation can be characterized as follows:

1. More than one research factor is of interest (and for the time being we
assume that all factors define groups, i.e., are categorical).
2. The factors operate between subjects (i.e., subjects serve in and hence
contribute criterion scores to only one group, not repeatedly to more
than one group).
3. The factors are completely crossed (i.e., each level of a factor is
represented at all levels of the other factors).

14.1 BETWEEN-SUBJECTS FACTORIAL STUDIES

Some writers use the term factorial broadly for any study involving more than
one factor. Others use the term more narrowly just for studies with completely
crossed factors. (Some alternatives to completely crossed factors are named at
the end of the next section.) Usage in this book adheres to the narrow tradition:
If a study is called factorial, its factors are understood to be completely crossed
(see Fig. 14.2). In this chapter those factors are understood to be between
subjects as well, although in the next chapter the discussion will be extended to
studies involving repeated factors (i.e., factors that operate within subjects).
Factorial studies, in the sense just defined, are deservedly one of the most
common arrangements used in behavioral science research. Their hallmark is a
factorial or crossed arrangement of the treatments (the levels of the independent
variables), which means that groups of subjects receiving every possible
combination of the levels of the independent variables are included in the study.
For example, if factor A were sex (with two levels, male and female) and factor B
were instruction set (again two levels, set I and set II), then the corresponding
2 x 2 factorial study would include four groups: males exposed to set I, males to
set II, females to set I, and females to set II. More generally, if factor A includes a
levels and factor B includes b levels, then the study includes a times b groups.
As is conventional, we indicate the first categorical between-subjects variable
with A, the second with B, the third with C, and so forth. Then the first level for
factor A would be A1, the second level would be A1, and so forth, whereas the first
level for factor B would be B1, the second level would be B2, and so forth. If there
were only two factors, and if each factor had two levels each (as in the
male/female, instruction set I/set II example just given), then the four groups for
that study (male and I, male and II, female and I, female and II) would be

FIG. 14.1. Schematic for a single-factor between-subjects study. The factor


is symbolized A and there are G levels of the factor, hence groups are
labeled A1 through AG. Each of the N subjects would be assigned to one,
and only one, of the groups.
14.1 BETWEEN-SUBJECTS FACTORIAL STUDIES 225
designated A1B1, A1B2, A2B1, and A2B2 (see top, Fig. 14.2). Similarly, if there were
two factors, but if factor A had three levels and factor B had four, then the study
would include 12 groups (see bottom, Fig. 14.2).
As you can see, the number of groups included in a factorial study is
determined by multiplying the number of levels for each factor together. We just
saw that 2 x 2 = 4 groups, 2 x 3 = 6 groups, and 3 x 4 = 12 groups. Additional
examples are:

1. If factor A, B, and C are each represented with 2 levels, then the complete
2 x 2 x 2 study includes 8 groups.
2. If factors A, B, and C contain 3, 4, and 5 levels respectively, then the
complete 3 x 4 x 5 study includes 60 groups.

In some circumstances not all groups might be necessary to answer the


research question of interest. Moreover, not all research situations match the
factorial study described here. Readers should be aware that, in addition to a
factorial arrangement, there are other ways to combine two or more factors in a
study. Designs for such studies (e.g., nested or hierarchical designs, Latin square
designs, and other kinds of incomplete designs) are regarded as an advanced
topic in statistical analysis, beyond the scope of this book, but interested readers
should consult authorities like Hays (1981), Keppel (1982), Kirk (1982), and
Winer (1971).

FIG. 14.2. Schematics for three different two-factor between-subjects


studies: a 2 x 2 (top), a 2 x 3 (middle), and a 3 x 4 (bottom) two-factor
factorial study.
226 MULTIPLE BETWEEN-SUBJECTS FACTORS
Advantages of Factorial Studies
There are two major reasons why factorial studies are so popular. First, the effect
of more than one variable can be investigated in an economical way that requires
no more data. For example, imagine that the sex (male/female) by instruction
(set I/set II) 2 x 2 factorial design mentioned a few paragraphs ago were used for
the button-pushing study described in chapter 11. Ignoring instruction set, we
could ask whether men and women differed in how often they pushed the button.
In addition, this time ignoring sex, we could ask whether the instructions given
affected how often subjects pushed the button. In analysis of variance terms, we
are asking whether there is a mam effect for sex and also whether there is a main
effect for instruction set.
As appealing as this first advantage is—allowing us, in effect, to address two
questions for the price of one—the second advantage of factorial studies may be
the more interesting. Factorial studies allow us to test for conditional relations
or what in analysis of variance terms are called statistical interactions. It may
be, for example,. that the effect of instruction set is conditional on (i.e., depends
on) the sex of the subject. Perhaps instruction set only affects men, not women
(or vice versa). Factorial studies let us test for such interesting possibilities in a
simple and straightforward way.

Coding Predictor Variables for Factorial Studies


Between-subjects single-factor and multifactor factorial studies are alike in that
both consist of a number of cells or groups to which subjects are assigned. And
for both, the number of predictor variables required to code for group
membership is G - 1, one less than the total number of groups. In chapter 12 we
learned how to use contrast (and dummy) coding to create predictor variables for
single-factor studies. We also learned that the G- I predictors could be coded in
a number of different ways (as long as they satisfied certain rules), that the
particular set selected did not affect the overall variance accounted for, but other
things being equal it makes sense to select a set that reflects the research
questions and the design of the study.
In this chapter we learn how to form sets of predictor variables appropriate
for factorial studies. First we define predictor variables that code for factor A.
This is done exactly as it was done in chapter 12 for single-factor studies. Two
groups would require one predictor variable, three groups would require two
predictor variables, and so forth. Moreover, the subset of predictor variables that
code for factor A should follow the two rules defined in chapter 12. That is:

1. The codes selected for each contrast (i.e., each predictor variable) must
sum to zero.
2. The cross products for all possible pairs of contrasts must likewise sum to
zero.

Next the predictor variables that code for factor B are formed in exactly the
same way, just as though they also derived from a single-factor study. The same
is true for any other factors. In other words, each factor is associated with a
subset of predictor variables. For each factor, the number of predictor variables
is one less than the number of levels for that factor.
Finally, predictor variables that code for interactions are formed.
Interactions exhaustively (i.e., completely) combine the factors. For example, a
two-factor factorial would include a main effect for A, a main effect for B, and an
1 4.1 BETWEEN-SUBJECTS FACTORIAL STUDIES 227

AB interaction, whereas a three-factor factorial would include main effects for A,


B, and C, two-way (or first-order) interactions for AB, AC, and BC, and a three-
way (or second-order) interaction for ABC. Main effects and interactions for
two-, three-, and four-factor way factorial studies are shown in Fig. 14.3. At this
point, the reader should grasp the logic of factorial designs and should
understand how to list all the higher order interaction terms for any factorial
design. Exactly what those interactions mean in substantive terms will become
clear later.
As noted a few paragraphs earlier, coded predictor variables for main effects
are formed as though each factor were the single factor in a single-factor study.
Consider a simple 2 x 2 factorial (two factors, each represented with two levels).
Such a study consists of four groups, labeled as follows (see Fig. 14.2):

A1B1
A1B2
A2B1
A2B2
There are four groups, so three predictor variables are required. The first codes
for factor A: Using contrast codes, subjects in the first two or A1 groups could be
coded -1 and subjects in the last two or A2 groups could be coded +1 (see Fig.
14.4). The second predictor variable codes for factor B: Subjects in the first and
third or B1 groups could be coded -1 and subjects in the second and fourth or B2
groups could be coded +1.
The third predictor variable codes for the AB interaction. The code for the
subjects in each group would be formed by multiplying the A and B codes for that
group. Thus the codes for the first, second, third, and fourth groups would be +1,

Number of Factors
Effects
2 3 4

Main effects: A A A
B B B
C C
D
First-order
interactions: AB AB AB
AC AC
BC AD
BC
BD
CD
Second-order
interactions: ABC ABC
ABD
ACD
BCD
Third-order
interactions: ABCD
FIG. 14.3. Main effects and interactions for factorial studies with two, three,
and four factors.
228 MULTIPLE BETWEEN-SUBJECTS FACTORS

FIG. 14.4. Contrast codes for a 2 x 2 factorial study.

-1, -1, and +1, respectively (again, see Fig. 14.4.) When codes are formed in this
way, the entire set of three predictor variables will obey the two formation rules
described in chapter 12—that is, the codes selected for each predictor variable will
sum to zero and the cross products for all possible pairs of contrasts will also sum
to zero.
Once the predictor variables are formed, analysis of the 2 x 2 factorial is
straightforward. Three steps are required. Each step adds a predictor variable,
first the one that codes for the A main effect, then the B main effect, and finally
the AB interaction. The significance for the A main effect, for the B main effect,
and for the AB interaction is the significance of the increase in R2 associated with
step 1, 2, and 3, respectively. Thus analysis of a 2 x 2 factorial is identical with a
planned comparison analysis of a study involving four groups. For any planned
comparison analysis, selection of the contrasts used requires some thought and
justification. In the case of a 2 x 2 factorial (or any other factorial) the contrasts
to be used are determined by the factorial design. Again as for any planned
comparison analysis, interest lies with the increases in R2 between successive
steps and so it makes no sense to regress the dependent measure on all three
predictor variables at once in a single step as for an omnibus Ftest.
As a second example, consider a 2 x 3 factorial. The six groups are shown in
Fig. 14.5, and again contrast codes are used. For factor A (predictor variable X 1 ),
subjects in the first three or A1 groups are coded -1 and subjects in the last three
groups or A2 groups are coded +1. Factor B has three levels and hence two
predictor variables are required (X2 and X3). The first contrasts group 1 with
groups 2 and 3, and the second contrasts group 2 with group 3. For predictor
variable X2, subjects in the first and fourth or Bl groups are coded -2 and all
other subjects are coded +1. For predictor variable X3, subjects in the second and
fifth or B2 groups are coded -1 and subjects in the third and sixth or B3 groups
are coded +1.
Factor A is represented with one and factor B with two predictor variables,
thus the AB interaction requires two predictor variables as well (one times two).
Predictor variable X4 is formed by multiplying values for Xl and X2 together, and

FIG. 14.5. Contrast codes for a 2 x 3 factorial study.


1 BETWEEN-SUBJECTS FACTORIAL STUDIES 229
predictor variable X5 is formed by multiplying values for X1 and X3. As an
exercise you should verify that the X4 and X5 products shown in Fig. 14.5, are
correct.
Analysis of this 2 x 3 factorial, like all two-way analyses, requires three steps.
In step 1, the first coded predictor variable is entered. Factor A has only two
levels, so only one predictor variable is required to represent it. In step 2, the two
predictor variables required to code for the three levels of factor B are added.
Thus step 2 constitutes what is in effect an omnibus F test for factor B. Finally,
the two predictor variables representing the AB interaction are added. Again, the
significance for the increase in R2 for steps 1, 2, and 3 is the significance for the A
main effect, the B main effect, and the AB interaction respectively.
Assuming the predictor variables representing factor A follow contrast code
rules, and those for factor B do too, and the codes for the interaction are formed
by multiplication, then the entire set of five predictor variables will also fulfill
contrast code requirements, as you can easily verify. Analysis of the factorial
design requires a hierarchical procedure, just like a planned comparison analysis,
although more than one predictor variable may be added on some steps.
Specifically, if a main effect or any component factor of an interaction involves
more than two groups, then more than one predictor variable will be required to
represent it.
As a third example, consider a 2 x 2 x 2 factorial. The eight groups are as
shown in Fig. 14.6. The first predictor variable codes for factor A, the second for
factor B, and the third for factor C. The codes for the AB interaction are formed
by multiplying the A and B codes. Similarly, multiplying the A and C codes yields
the codes for the AC interaction and multiplying the B and C codes gives the
codes for the BC interaction. Finally, multiplying A, B, and C codes together gives
the codes for the ABC interaction. As you can see, a three-way factorial analysis
of variance considers seven effects, hence seven steps are required. In this case,
because the A, the B, and the C factor all comprised only two levels, each of the
seven steps is represented by only one predictor variable.

Exercise 14.1
Coding Predictor Variables for Factorial Studies I
This exercise provides preliminary practice with coding predictor variables for
factorial studies.
1. Verify that the values for the coded variables for the first- and second-order
interactions (AB, AC, BC, and ABC) given in Fig. 14.6 are correct.
2. Verify that the values for the coded predictor variables given in Figs. 14.4
and 14.5 obey the two formation rules for such variables (group contrast
codes and all possible pairs of their cross products sum to zero).

One final example should help clarify further how contrast codes are formed
for factorial studies. Two of the examples just presented, the 2 x 2 and the 2 x 2 x
2 factorial, were relatively simple. For both, all factors were represented with
only two levels; therefore only one predictor variable was required for each main
effect and interaction. When factors are represented by more than two levels,
however, additional predictor variables are required.
Specifically, as we already know from the one-way case, if factor A consists of
a levels, then a - 1 predictor variables are required to code the A main effect.
Similarly, if factor B consists of b levels, then b - 1 predictor variables are
230 MULTIPLE BETWEEN-SUBJECTS FACTORS

FIG. 14.6. Contrast codes for a 2 x 2 x 2 factorial study

required for the B main effect, and so forth. That is why two predictor variables
were required to represent the B factor for the 2 x 3 example presented in Fig.
14.5. Furthermore, because codes for interaction terms are formed by
multiplying the values for the predictor variables representing the constituent
main effect codes together, the number of predictor variables required to
represent the AB interaction is a - 1 times b - 1. For example, in Fig. 14.4, one
predictor variable was required to represent the AB interaction for the 2 x 2
example, and in Fig. 14.5 two predictor variables were required for the AB
interaction for the 2 x 3 example.
As a further and more complex example, consider a 3 x 4 factorial study.
Two predictor variables are required for the A main effect and three for the B
main effect. This means that six predictor variables are required to code the AB
interaction:

(a - 1)(6 - 1) = 2 x 3 = 6
Thus 11 predictor variables in all are needed for this 12 group study. Predictor
variables X1 and X2 would code factor A, predictor variables X3, X4 and X5 would
code factor B, and predictor variables X6 through X11 would code for the AB
interaction (see Fig. 14.7). Specifically, variables X6, through X11 would be formed
as follows:

Again, as an exercise you should verify that the values given in Fig. 14.7 are
correct.
This example, with its 12 groups and 11 predictor variables, may seem
somewhat cumbersome. But imagine, for example, a 3 x 4 x 5 analysis of
variance, which would have 60 different groups and hence 59 predictor variables!
True, studies with so many groups are quite rare, but in any case, once the codes
for the predictor variables representing the main effects have been determined,
the codes for the predictor variables associated with the different interactions can
be generated easily using a spreadsheet, which is exactly how codes for predictor
variables X6 through X11 shown in Fig. 14.7 were computed.
14.1 BETWEEN-SUBJECTS FACTORIAL STUDIES 231

FIG. 14.7. Possible contrast codes for a 3 x 4 factorial study. For this
example, factor A comprises three and factor B four levels. Therefore factor
A is represented with two, factor B with three, and the AB interaction with six
predictor variables.

Exercise 14.2
Coding Predictor Variables for Factorial Studies II
This exercise provides additional practice with coding predictor variables for
factorial studies.
1. Think of a plausible research application for a 3 x 4 factorial design. What
are the research factors? What are the levels for those factors? How do you
plan to code the factors and what is your rationale for the coding you
selected? (Your coding should be different from that used for Fig. 14.7.)
Following the format used in Fig. 14.7, indicate the groups included in the
study and show how the predictor variables would be coded for each group.
Verify that these predictor variables obey the two formation rules for contrast
codes.

Degrees of Freedom for Factorial Studies


In this chapter the discussion of one-way or single-factor between-subjects
studies begun in chapter 12 has been extended and generalized. You have
learned how to generate the higher order interaction terms implied by
multifactor studies, and you have learned how to code the predictor variables
associated with the various main effects and interactions of such studies. In this
subsection, you will learn how to determine degrees of freedom for the various
effects—the main effects and interactions—of factorial studies. In order to
perform tests of significance—the topic of the next section—you will need to know
the correct degrees of freedom for these effects.
Recall from chapter 10 that, given N scores derived from N subjects, the
degrees of freedom associated with the total sum of squares was N - 1. Recall
further that the total degrees of freedom, like the total sum of squares, can be
232 MULTIPLE BETWEEN-SUBJECTS FACTORS
partitioned into two parts: the portion due to the model and the portion
remaining, or residual, due to error. In other words,

Finally, recall that the degrees of freedom for the model were simply the number
of predictor variables included in the model whereas the degrees of freedom for
error were, as the term residual implies, the degrees of freedom left over or
remaining. Thus, for a factorial study that includes a total of G groups, the
following is true:

The degrees of freedom for the model is G - 1 because, given G groups, G - 1


predictor variables are required to code group membership. The degrees of
freedom for error, then, can determined by simple algebraic manipulation:

This agrees with Equation 14.4.


Just as the total degrees of freedom can be partitioned into model and error
components, so too the degrees of freedom due to the model can be further
subdivided. A consideration of Fig. 14.7 and material presented earlier suggests
some general principles. Let a symbolize the number of levels of A, b the number
of levels of B, and so forth. Then the degrees of freedom associated with the A
main effect are a - 1 (because a - 1 predictor variables are required to code for
factor A) and b - 1 degrees of freedom are associated with the B main effect.
Then, because a - 1 times b - 1 predictor variables are required to code the
AB interaction, the degrees of freedom associated with the AB interaction are a -
1 times b - 1, the degrees of freedom for A times the degrees of freedom for B.
For the example portrayed in Fig. 14.7:

df

In general, the degrees of freedom for any interaction will be the product of the
degrees of freedom of its constituents. For example, if dfA = 2, dfB = 3, and
dfc = 4, then

Generalized degree of freedom computations for any two- or three-way factorial


study are given in Figs. 14.8 and 14.9.
As noted earlier, the total degrees of freedom between N subjects is N - 1.
(Recall from chapter 10 that one degree of freedom is lost when scores are
14.1 BETWEEN-SUBJECTS FACTORIAL STUDIES 233
Source Degrees of freedom
A main effect a -1
B main effect b -1
AB interaction (a - 1 )(b -1)
S/AB, subjects within AB N-ab
TOTAL between subjects /V-1
FIG. 14.8. Degrees of freedom for a two-way factorial study. The number of
levels for factors A and B is symbolized with a and b respectively and the
number of subjects is symbolized with N. For other comments, see text.

constrained by the overall or grand mean.) These N - 1 degrees of freedom


between subjects can be divided into two parts: those concerned with between-
group variability and those concerned with how subjects vary within groups:

The between-groups degrees of freedom are associated with the main effects and
interactions of the factorial study and add up to G - 1, the number of predictor
variables. (G = ab for a two-way study, abc for a three-way study, etc.) The
remaining N - G degrees of freedom are associated with how subjects differ
within groups. For that reason, the residual or error terms in Figs. 14.8 and 14.9
are symbolized S/group (the virgule or slash is read as within). S/group (read,
subjects within groups) indicates subjects within the groups or cells defined by
the factorial study—for example, S/AB and S/ABC for a two- and three-way
factorial study, respectively.

Exercise 14.3
Degrees of Freedom for Factorial Studies
This exercise provides practice with computing degrees of freedom for between-
subjects factorial studies.
1. Create a table, modeled after the ones shown in Figs. 14.8 and 14.9, but for
a four-way factorial study.
2. Based on this table, and assuming that a = 3, b = 2, c = 4, d = 2, and N =
240, compute the degrees of freedom for each main effect and interaction.
3. Compute degrees of freedom for the 3 x 4 factorial study you described in
the last exercise. Assume that N = 108. Organize your results in a table like
that shown in Fig. 14.8, but instead of the symbols give the computed
degrees of freedom. Label the effects with the names you supplied for the
last exercise. Verify that the degrees of freedom for the main effects and
interactions sum to G - 1 and that all degrees of freedom sum to N - 1.

14.2 SIGNIFICANCE TESTING FOR MAIN EFFECTS AND INTERACTIONS

The statistical significance of the main effects and interactions appearing in


factorial studies can be tested using the hierarchical techniques first presented in
chapter 11 and elaborated in chapter 13. Consider first a simple 2 x 2 study. Any
2 x 2 study, like the button-pushing study we have been using as an example,
234 MULTIPLE BETWEEN-SUBJECTS FACTORS
Source Degrees of freedom
A main effect a -1
6 main effect b -1
C main effect c -1
AB interaction (a-1)(b-1)
AC interaction (a-1)(c-1)
BC interaction (b -1 )(c -1)
ABC interaction (a-1)(b-1)(c-1)
S/ABC, subjects within ABC N - abc
TOTAL between subjects N- 1
FIG. 14.9. Degrees of freedom for a three-way factorial study. The number
of levels for factors A, B, and C is symbolized with a, b, and c, respectively
and the number of subjects is symbolized with N. For other comments, see
text.

comprises four groups and for that reason will require three predictor variables.
Two sets of predictor variables, each representing a different way to contrast the
four groups for a planned comparison analysis, were presented in chapter 13. In
Fig. 14.4, we presented a third set of predictor variables, one appropriate for a
2 x 2 factorial study. In this case, the three contrast codes represent an A main
effect, a B main effect, and an AB interaction.
There is little new here. In fact, as noted earlier a multifactorial study can be
viewed simply as a special case of a single-factor planned-comparison study. For
both, G - 1 coded predictor variables are required. The planned-comparison
study allows some latitude as to which contrasts are selected, whereas the
multifactorial study essentially dictates how contrasts are formed, but that is the
only difference. For single-factor planned comparison analyses and for analyses
of multifactorial studies, procedures for significance testing are the same. The
predictor variables are arranged in a hierarchy of steps; a new predictor variable
(or set of variables) is added at each step; and the increase in variance accounted
for at each step is tested for significance. A two-way factorial analysis of variance
requires three steps. The predictor variable or variables that represent factor A
are entered on the first step, those for factor B on the second step, and the
variable or variables representing the AB interaction on the third step. The next
exercise demonstrates how data would be analyzed for a simple 2 x 2 factorial.

Exercise 14.4
Analysis of a 2 x 2 Factorial Study
The template that results from this exercise allows you to analyze data from a
2 x 2 factorial study. The data are from the button-pushing study but for this
exercise you assume that the four groups are formed by crossing subject's
gender and instruction set. The resulting analysis tells you whether any of the
effects—the gender main effect, the instruction main effect, or the gender x
instruction interaction—are statistically significant.
1. For this exercise, you will modify the spreadsheet shown in Fig. 12.8.
Assume that the data shown there resulted from a 2 x 2 factorial study. Let
factor A be gender of subject (male or female) and factor B the instructions
given (set I or set II). Thus the resulting four groups could be symbolized M-
I, M-ll, F-l, and F-ll. Assume that subjects 1-4 are in the M-l, subjects 5-8
14.2 TESTING FOR MAIN EFFECTS AND INTERACTIONS 235
in the M-ll, 9-12 in the F-l, and 13-16 in the F-ll group. Label columns
appropriately
2. Enter contrast codes for gender (-1 = male, +1 = female), instruction set (-1
= set I, +1 = set II), and their interaction (gender x instruction) in the
appropriate columns.
3. Do step 1 (the A main effect) of the hierarchic analysis. That is, regress
number of button pushes on the coded predictor variable representing
gender. Determine the values for the parameters a and b1, and use them to
compute predicted scores. What are the values for the predicted scores?
Why do they have these values? What are the values of R2 and F for step
1?
4. Do step 2 (adding the B main effect), that is, regress number of button
pushes on the predictor variables representing both gender and instruction.
Correct the prediction equation so that it now takes both the A and B main
effects into account. Now what are the values for the predicted scores?
Again, why do they have these values? What are the values of R2 and F for
step 2?
5. Do step 3 (adding the AB interaction), that is, regress the number of buttons
pushed on predictor variables representing the gender and instruction main
effects and the gender x instruction interaction. Again, correct the prediction
equation so that it now takes the A and B main effects and the AB interaction
into account. Now what are the values for the predicted scores and why do
they assume these values? What are the values of R2 and F for step 3?
6. Summarize the results of all three steps of this hierarchic analysis in a table,
organized like that shown in Fig. 13.1, showing R2 change and its significance
for each step.

The spreadsheet resulting from step 3 of the last exercise is given in Fig.
14.10. Spreadsheets for steps 1 and 2 differ from this one in the proportion of
variance accounted for and in the values for the predicted values (ask yourself,
why do the predicted values at each step make sense). This analysis of a 2 x 2
factorial study and the planned comparison analysis pursued in the last chapter
are comparable in a number of ways. For both, predictor variables are added to
the regression equation step by step, increases in R2 are computed, and the
increases are tested for statistical significance using the error term associated
with the last step (the final model).
Only the predictor variables—how they are formed and what they mean-
differentiate the two analyses. For the 2 x 2 factorial analysis, the three contrasts
represent the A main effect, the B main effect, and the AB interaction, and the
significance of each is determined by whether its associated R2 change is
significantly different from zero. Note that an omnibus test, one testing the
significance of all three predictor variables entered in one step, is not performed.
The questions of interest for this 2 x 2 factorial are embodied in the three
separate predictor variables.

14.3 INTERPRETING SIGNIFICANT MAIN EFFECTS AND INTERACTIONS

The results of the two-way ANOVA you just performed reveal a significant main
effect for subject's gender (F(1,12) = 7.51, p < .05) and a significant gender x
instruction interaction (F(1,12) = 14.71, p < .01). (The critical value for F(1,12).05
= 4.75 and for F(1,12).01 = 9.73.) The mean number of button pushes for all
subjects was 86; means for men and women separately were 96 and 76, whereas
236 MULTIPLE BETWEEN-SUEJECTS FACTORS
means for instruction sets I and II were 89 and 83. The analysis of variance
results indicate that the means for men and women were significantly different
but the means for instruction were not significantly different.
The main effect for subject's gender, however, is qualified by an interaction
with instruction and, as a general rule, such main effects should not be
emphasized, or even discussed, until the qualifying interaction is understood.

A B C D E F G H I
1 #BPS Sex Inst AB y= m- e=
2 s Y X1 X2 X3 Y' Y-My Y'-My Y-Y'
3 1 102 -1 -1 1 113 16 27 -11
4 2 125 -1 -1 1 113 39 27 12
5 3 95 -1 -1 1 113 9 27 -18
6 4 130 -1 -1 1 113 44 27 17
7 5 79 -1 1 -1 79 -7 -7 o
8 6 93 -1 1 -1 79 7 -7 14
9 7 75 -1 1 -1 79 -11 -7 -4
10 8 69 -1 1 -1 79 -17 -7 -10
11 9 43 1 -1 -1 65 -43 -21 -22
12 10 82 1 -1 -1 65 -4 -21 17
13 11 69 1 -1 -1 65 -17 -21 4
14 12 66 1 -1 -1 65 -20 -21 1
15 13 101 1 1 1 87 15 1 14
16 14 94 1 1 1 87 8 1 7
17 15 84 1 1 1 87 -2 1 -3
19 16 69 1 1 1 87 -17 1 -18
19 Sum= 1376 0 0 0 1376 0 0 0
20 N= 16 16 16 16 N= 16 16 16
21 Mean= 86 0 0 0 VAR= 464.9 305 159.9
21 a,b= 86 -10 -3 14 SD= 21.56
23 R= 0.81 R2= 0.656

J K L M
1 sstot ssmod: SSerr

2 y*y m*m e*e

19 SS= 7438 4880 2558


20 df= 15 3 12
21 MS= 495.9 1627 213.2
21 SD'= 22.27 14.7
R2=
23 0.57 F= 7.631
FIG. 14.10. Spreadsheet for determining the effect of gender, instruction,
and their interaction on number of button pushes after step 3. Rows 3-18 for
columns J-M are not shown.
14.3 SIGNIFICANT MAIN EFFECTS AND INTERACTIONS 237
After all, a significant interaction indicates that the means for the groups formed
by crossing the constituent factors differ among themselves in ways that cannot
be described by invoking the significant main effects alone. An interaction
signals a conditional relation, which signifies that means vary among themselves
with respect to one factor in ways that depend on the level of the other factor.
For example, instead of women scoring higher than men in all circumstances (a
main effect), they might score higher only in some circumstances (a gender x
circumstance interaction).
The problem, then, is to understand exactly how the groups identified by the
significant interaction—the four groups formed by crossing gender and
instruction set—differed. This is not an entirely new problem for us. Whether
groups are identified by a significant omnibus F test in a one-way analysis of
variance, as in chapter 12, or by a significant interaction, as in the last exercise,
differences among them can be described using the Tukey post hoc test detailed
in chapter 13.
In fact, because the data used to demonstrate the Tukey post hoc test were
used in the last exercise as well, the computations needed to understand the
significant gender x instruction interaction detected in the last exercise have
already been done. It is only a matter of relabeling the want/has children groups
used in chapter 13 (see Fig. 13.5) to conform with the gender by instruction
design used in this chapter. Group means, labeled for the present example and
subscripted to indicate Tukey post hoc results, are given in Fig. 14.11.
These results could be interpreted as follows. Apparently men were
particularly responsive to instruction set I. The instruction set used did not affect
how often women pushed the button (65 was not significantly different from 87),
nor did gender affect button-pushing for those instructed with set II (79 was not
significantly different from 87). However, men instructed with set I pushed the
button 113 times, on average, which was significantly different from the means
for women instructed with set I (M - 65) and for men instructed with set II (M =
79). Given these results, it would be misleading to emphasize the significant
gender main effect. True, the mean number of button presses was greater for
men than women, but this effect was confined to subjects who were exposed to
instruction set I. There was no significant gender difference for subjects exposed
to instruction set II.

Exercise 14.5
Interpreting Interaction in a 2 x 2 Factorial Study
This exercise provides practice in interpreting significant interaction results for a
2 x 2 factorial study.
1. If the MSerror were 92.5 instead of 213.6, the post hoc results would have
been different, as you demonstrated when doing Exercise 13.3. If the groups
were those defined by the present gender by instruction factorial, not the
marital/parental groups of chapter 13, how would you interpret the post hoc
results?
2. Now interpret the post hoc results for the significant gender x instruction
interaction if the MSerror were 23.2.

In the last several paragraphs we have considered how to explicate significant


interactions detected by an analysis of variance. Let us return for a moment to
the previous problem: the initial analysis of the main effects and interactions of a
238 MULTIPLE BETWEEN-SUBJECTS FACTORS
. Instruction
Gender
Set I Set II
Male 113a 79b
Female 65b 87ab
FIG. 14.11. Post hoc analysis for the gender x instruction interaction.
Scores are means based on four subjects each. Means that do not differ
significantly according to the Tukey test, alpha = .05, share a common
subscript.

factorial study. For Exercise 14.4 you were asked to organize the results in a table
showing R2 change and its significance for each step. It may have occurred to you,
as you performed the necessary computations, that such work could be done
more easily with a spreadsheet. For the next exercise, you are asked to create a
template that displays and summarizes the results of the gender by instruction
analysis. This new format merges the best features of the hierarchic table we
have been using (e.g., Fig. 13.1 and the table you created for Exercise 14.4) with a
typical analysis of variance source table (e.g., Fig. 12.9) and is used from now on
to summarize all ANOVA results.

14.4 MAGNITUDE OF EFFECTS AND PARTIAL ETA SQUARED

When using multiple-regression computations to analyze the effects of a factorial


design, the change in R2 at each step (or the change in the sum of squares) is of
primary interest because it indicates the variance accounted for by the main
effects and interactions associated with each step. In earlier tables, we displayed
the total R2 at each step along with its degrees of freedom and associated F ratio,
and also the change in R2 at each step along with the degrees of freedom and the
associated F ratio for R2 change We did this as a learning device, to show how
hierarchic regression worked.
The analysis of variance source tables we now introduce are more economical
and more conventional. They also add a new and useful magnitude-of-effect
statistic, partial n2 (a Greek lower case eta, squared). These new tables retain the
total R2s as of each step (the proportion of criterion variance accounted for by all
variables in the equation as of this step), but otherwise the statistics displayed
refer to the step—that is, they characterize the source of variance (A main effect,
B main effect, AB interaction, etc.) whose predictor variables were added at that
step.
The statistics in the source table include each step's R2 change, in part because
this is a magnitude of effect statistic with which you have become familiar but
also because it is used to compute the SS values associated with the step. SS are
traditional in an analysis of variance source table and are computed by
multiplying R2 change for the step by the total SS for the criterion variable. The
degrees of freedom for each step are the number of predictor variables entered at
that step, and the residual or error degrees of freedom for the last step are, as
usual, the number of cases minus 1 minus the number of predictor variables
(N - 1 - K). Finally, the traditional ANOVA mean square (MS) for each step is
computed by dividing the SS by its degrees of freedom, and the F ratio for each
effect is computed by dividing the effect MS by the error MS.
A statistic new to you is n2 (eta squared). In the traditional analysis of
variance literature it is defined as
1 4.4 MAGNITUDE OF EFFECTS AND PA RTIAL ETA SQUARED 239

In other words, it is identical to R2 change, which is why we have not introduced it


earlier. (The estimated population value, analogous to R2 adjusted , is symbolized
with a Greek lower case omega squared, w2; see Hays, 1973, and Tabachnick &
Fidell, 2001). Thus we could label the same statistic either R2change, reflecting a
multiple-regression heritage, or n2, reflecting an analysis of variance heritage; the
value would be the same.
More useful is a variant of n2, the partial n2, which is defined as

As Tabachnick and Fidell (2001) noted, the value of n2 for a particular variable
depends on the number and significance of other variables in the model, whereas
partial n2 isolates the effect for a particular variable more. Further, partial n2
makes more sense than n2 or -R2change as a magnitude of effect statistic in the
context of repeated-measures designs (see next two chapters). Thus partial n2
lends itself to comparison within and across studies in a way n2 does not, and for
that reason we recommend its use and incorporate it in our analysis of variance
source tables (where we label it pn 2 ). However, be aware that partial n2 may not
be the best statistic when comparing effects of a particular variable across studies
that use different designs; the recently introduced generalized n2 may be better,
especially for repeated measures designs (see Olejnik & Algina, 2003, for details).
One comment about n2 is in order: SPSS and other statistical packages
optionally provide values for partial n2. Nonetheless, when investigators report
those values in research reports, they are often (incorrectly) labeled n2. If you see
a value labeled n2 in an article, chances are very good it is actually a partial n2.
From here on, we incorporate partial n2s, sums of squares, mean squares, and
F ratios in a more traditional analysis of variance source table, which is the point
of the next exercise.

Exercise 14.6
ANOVA Source Table for a 2 x 2 Factorial Study
The template developed for this exercise adds an analysis of variance source
table to the template you developed for Exercise 14.4 to analyze a 2 x 2 factorial.
It provides a summary of the results for the button-pushing study, assuming the
four groups were formed by crossing the two factors, gender and instruction set.
1. Add a source table to the spreadsheet shown in Fig. 14.10. You may use
Fig. 14.12 as a guide if you wish.
2. Establish cells that contain the number of subjects and the number of levels
for factors A and B. Enter formulas, not values, in cells displaying the
degrees of freedom. Thus if your design changes, you can change just a few
cells and the appropriate degrees of freedom for your new design will be
computed and displayed automatically.
3. If the multiple-regression output from steps 1, 2, and 3 is still in your
spreadsheet, then the cells in your source table that display the total R2 for
240 MULTIPLE BETWEEN-SUBJECTS FACTORS
each step can point to the appropriate cells of the multiple-regression output
and you will not need to reenter these values.
4. Earlier you only did three steps but there was an implicit fourth step, which
you should add now. Remember that R2totai = R2model + R2 error and SStotai =
SSmodel + SSerror. If at step 4 you hed added coded predictor variables for the
error term (one for each degree of freedom associated with the error term, a
matter explained in the next chapter), you would have accounted for all of the
variance. Accordingly, you can enter 1 (all variance accounted for) as the
step 4 entry in the total (accumulative) R2 column. Likewise, you can enter
the overall sum of squares (or a pointer to it) as the step 4 entry in the total
(accumulative) SS column.
5. For each step, enter the appropriate formulas for total sums of squares, for
changes in R2, for changes in sums of squares, and for partial n2. Do the
changes in R2 and SS sum to 1 and SStotal as they should?
6. Finally, enter formulas for mean squares and formulas for the F ratios that
test the significance of the mean squares.

At this point, your spreadsheet should look like the one shown in Fig. 14.12.
From it you can determine significance for the A and B main effects and the AB
interaction for a two-way factorial, along with the magnitude of each of these
effects as assessed with a partial n2. Significance testing for the main effects and
interactions of more complex factorial studies follows the same strategy
demonstrated here. These procedures can be summarized as follows. When
completely crossed, the levels of the factors of a factorial study define G groups.
G - 1 predictor variables, formed following the rules and principles presented
earlier, are required to code the information implied by such designs. For a 2 x 2
(or any factorial with only two levels per factor), main effects and interactions are
associated with one predictor variable each.

Exercise 14.7
SPSS Analysis of a 2 x 2 Factorial Study
In this exercise you will learn how to use the General Linear Model procedure in
SPSS to conduct a 2 X 2 Analysis of Variance for the button-pushing study.
1. Create a new SPSS data file containing variables for the number of button
pushes (bps), gender (gen), and instruction set (insf). Enter the data from
Fig. 14.10. You should create value labels for the sex and instruction set to
make the output more readable.
2. Select Analyze->General Linear Model->Univariate from the main menu.
Move bps to the Dependent Variable window, and gen and inst to the Fixed
Factors(s) window.
3. Click on Options and check the boxes next to Descriptive statistics,
Estimates of effect size, and homogeneity tests. Also, in the Estimated
Marginal Means box move [Overall], sex, inst, and sex*inst to the window
labeled Display means for. Click Continue and then OK.
4. Examine the Descriptive Statistics box, where means are displayed for
each of the cells and marginals. Make sure these values agree with your
spreadsheets. Now scroll down to the boxes under Estimated Marginal
Means. The values for the grand mean, sex and instruction set main effects,
and the interaction should be the same as those found in the Descriptive
Statistics Box. In the case of an unbalanced design (i.e., one or more of the
14.4 MAGNITUDE OF EFFECTS AND PARTIAL ETA SQUARED 241
cells were of different size), then the descriptive statistics would provide
traditional weighted means while the estimated marginal means would be
unweighted. Typically, when cells are unequal due to subject attrition or
random factors, you would want to report the unweighted means for any
significant effects resulting from your analysis.
5. Scroll back up to the box labeled Levene's Test of the Equality of Error
Variances. Notice that this test is not statistically significant, indicating that
the assumption of equal variances is met.
6. Finally examine the box labeled Tests of Between-Subjects Effects. Look
at the lines for the SEX, INST, SEX*INST, Error, and corrected total. Check
that the sums of squares, df, mean square, F, and partial eta-squared values
correspond to your spreadsheet calculations.
7. For additional practice you should try reanalyzing the data from Exercise
14.8 using SPSS. Do all of the relevant statistics agree with your
spreadsheet analysis?

Factors with more than two levels are associated, not with a single predictor,
but with a set of predictor variables instead. Similarly, interactions involving
such factors will also be associated with a set of predictor variables. In such cases
the question remains, how much additional variance is accounted for when the
set of variables associated with a particular main effect or interaction is added to
the regression equation? Increases in R2 are tested exactly the same way for
individual variables or sets. As always, the degrees of freedom in the numerator
of the F ratio reflect the number of predictor variables added, whether one or
more than one (see Equation 13.2), whereas degrees of freedom in the
denominator reflect the residual degrees of freedom for the final model (N - 1 -
number of predictor variables used for the final step). Results from any two-way
factorial analysis of variance can be summarized and presented, as shown in Fig.
14.12.
Before any significant main effects are emphasized, any qualifying
interactions should first be analyzed and understood. Significant interactions can
be analyzed using the Tukey post hoc test procedures described in chapter 13.
This is not the only approach, of course. Just as there are several variants of post
hoc tests, so too there are several approaches to analyzing interactions. The
Tukey is emphasized here because of its generality, general acceptability, and
simplicity. (For additional approaches to analyzing significant interactions see
Winer, 1971, and Keppel, 1982.) Main effects require post hoc tests only if more
than two groups are involved. Interactions, however, necessarily require post hoc

A B C D E F G H I
1 Step Source R2 R2 change SS dt MS F pn2
2 1 A, gender 0 215 0.215 1600 1 1600 7.506 0.385
3 2 B, instruction 0 234 0.019 144 1 144 0 676 0.053
4 3 AB, gend x inst 0 656 0.422 3136 1 3136 14.81 0.551
5 4 S/AB, error 1 0.344 2558 12 213.2
6 TOTAL btwn Ss 1 7438
FIG. 14.12. Spreadsheet showing an analysis of variance source table for a
2 x 2 factorial study analyzing the effect of gender, instruction, and their
interaction on number of button pushes.
242 MULTIPLE BETWEEN-SUBJECTS FACTORS
explication. After all, even if factor A and B both involve only two groups (as in
the previous exercise), their interaction defines four groups, and hence a post hoc
test is required to understand the nature of the interaction. For the next exercise,
you will analyze a 2 x 3 factorial study and will be asked to explain a significant
main effect involving three groups.

Exercise 14.8
Analysis of a 2 x 3 Factorial Study
The exercise provides additional practice in a two-way analysis of variance. In
this case, factor A has two levels and factor B has three levels. This exercise
also provides additional practice in interpreting post hoc results.
1. For this exercise, you reanalyze the data from the gender smiling study last
shown in Fig. 13.11. Retain the 20 subjects and the number of smiles shown
for each infant but ignore age. This study involves two factors, gender (factor
A) and partner (factor B). The first 10 subjects are males and the last 10
females, as before. However, this time assume that the first three male
(subjects 1-3) and the first four female infants (subjects 11-4) interacted
with their mother, that the second four male (subjects 4-7) and the second
three female infants (subjects (15-17) interacted with their father, and the
rest interacted with a stranger (subjects 8-10 and 18-20). set up contrast
codes to represent the A main effect, the B main effect, and the AB
interaction.
2. Analyze the data for this 2 x 3 factorial study. You will want to incorporate an
analysis of variance source table like that shown in Fig. 14.12, so you may
find it easier to modify the spreadsheet used in the last exercise rather than
modifying the spreadsheet shown in Fig. 13.11. Or, if you are becoming
adept with your spreadsheet program, you may combine the two
spreadsheets. Or, for practice and to assure yourself that you understand all
the required formulas, you could create this spreadsheet from scratch.
However you do it, the goal is to determine whether the gender effect, the
partner effect, and their interaction are statistically significant, organizing your
results in a source table like that shown in Fig. 14.12.
3. If you have performed the analysis correctly, you should have found out that
the gender effect was not significant (it approached but did not reach the .05
level of significance), that the partner effect was significant, and that their
interaction was not significant. Perform a post hoc test on the three means
representing the number of smiles seen when interacting with mothers,
fathers, and strangers and interpret your results. Think about this carefully.
This exercise is a stringent, but quite realistic, test of your understanding of
post hoc analysis.

From the previous exercise you should have gained an understanding of the
general analysis of variance approach to factorial studies. To begin with, you
create contrast codes for each main effect and interaction. Then, using hierarchic
multiple-regression procedures, you determine the statistical significance and
magnitude of effect for each main effect and interaction. Your first task is to
understand and interpret any significant interactions. In order to do this you will
need to use post hoc tests. The second task is to explicate any significant main
effects that are not qualified by higher order interactions. You will need post hoc
tests for this only if a main effect involves more than two groups.
14.4 MAGNITUDE OF EFFECTS AND PARTIAL ETA SQUARED 243
Analyses of variance appropriate for studies including more than one
between-subjects factor have been described in this chapter. You have learned
how to generate higher order terms for two-way, three-way, and so forth, factorial
studies, how to represent main effects and interactions with coded predictor
variables, how to compute the appropriate degrees of freedom for each effect, and
how to test the statistical significance of each effect using hierarchic multiple-
regression procedures. In addition, you have also learned how to interpret
significant conditional (interactive) effects and when post hoc tests should be
used to explicate significant main effects and interactions.
At this point, you should be able to analyze completely crossed factorial
studies of any order and understand others' analyses. Factorial studies represent
by far the most commonly used analysis of variance design. As noted earlier in
this chapter, there are other more complex possibilities (e.g., various kinds of
incomplete designs), but these are regarded as advanced topics beyond the scope
of this book. All of these analyses are alike in that factors are between subjects.
However, there is another fairly common alternative. Factors can also be within
subjects—that is, subjects can contribute repeated measurements to a study.
How data from such studies are analyzed constitutes the topic of the next two
chapters.
This page intentionally left blank
15 Single-Factor
Within-Subjects Studies

In this chapter you will:


1. Learn about repeated measures or within-subjects factors.
2. Learn about studies that include within-subjects factors and when they
should be used.
3. Learn how to analyze data from a single-factor within-subjects study.

The statistical procedures discussed in the previous chapters are appropriate for
assessing either the effect of a single factor on a dependent measure of interest or
the effects of multiple factors (their main effects and interactions) combined in a
completely crossed design. A restriction has been that factors must operate
between subjects; that is, subjects may contribute one, and only one, score to the
analysis. Thus, for these between-subjects studies, no subject is represented at
more than one level of any factor. In other words, each subject appears in exactly
one group or cell of the study.
As suggested earlier, there is another possibility. A subject might be assessed
repeatedly and the repeated assessments could represent levels of a within-
subjects (or repeated measures) factor. Such factors would allow us to
investigate, for example, changes over time, either those occurring naturally or
perhaps with an intervening experimental treatment. This chapter explores how
within-subjects factors can be incorporated into factorial studies, and how their
effects can be analyzed.

15.1 WITHIN-SUBJECTS OR REPEATED-MEASURES FACTORS


A factor is said to be within subjects when the subjects (or dyads, families, or
whatever sampling unit is used) in a study are assessed more than once and when
those repeated assessments form the levels of the factor. For example, the
within-subjects factor could be time and the levels might represent time 1 and
time 2. Such repeated assessments are rightfully popular because they allow
researchers to address questions like, do individuals tend to score higher at time
2, after a particular treatment (or event)? Or, the within-subjects factor could be
setting and the levels might be laboratory and home, or day and night. Again, the

245
246 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
purpose is to determine if the factor has any effect on the dependent measure.
For example, we might want to determine if scores are systematically higher at
night, in the home, and so forth.
The unit of analysis (or sampling unit) is not always an individual. If married
couples were studied, for example, a husband's and wife's scores would form
repeated assessments. The factor would be spouse, whose two levels are husband
and wife. Marital satisfaction scores, assessed separately for husbands and wives,
would then represent repeated measures for the couple. Analysis of these scores
would let us determine whether husbands and wives differed significantly with
respect to marital satisfaction.
When selecting an appropriate design for analysis, there may occasionally be
some question as to whether a particular factor is between or within subjects.
Such questions can usually be resolved by considering how units are sampled.
For example, a single-factor two-group study might consist of husbands and
wives. If the wife group is formed by selecting wives at random and the husband
group in a similar way, with no restriction that the husbands be matched to the
wives already selected, then the spouse factor is between subjects. The spouse
factor is within subjects, however, if husbands are linked to wives—this is, if the
husband group consists of the husbands of the wives previously selected.
As you can see from the preceding paragraph, factors are not inherently
between or within subjects. Depending on the sampling design, the same factor
can be between subjects in one study, within subjects in another. Moreover,
several factors—some between subjects, some within subjects—can be combined
in a single factorial study. In the previous chapter we discussed factorial studies
and assumed that all factors were between subjects. In this chapter and the next
we consider the possibility that some or all of the factors in a factorial study can
be within subjects instead.

Factorial Studies With Within-Subjects Factors


A within-subjects factor, as already noted, is also called a repeated-measures
factor. Similarly, a study including any within-subjects factors is called a
repeated-measures study. Such studies could consist solely of repeated-measures
factors, or could be mixed, containing both between- and within-subjects factors.
Possibilities include a one-between, one-within two-factor study; a no-between,
two-within two-factor study; a two-between, one-within three-factor study; a
one-between, two-within three-factor study; and so forth. The general case, then,
is a u-between, u-within factorial study, where u represents the number of
between-subjects and v the number of within-subjects factors.
Consider the 2 x 2 between-subjects study analyzed in the previous chapter.
The two dimensions were gender of subject (male/female) and instruction (set
I/set II). The male and female subjects were not linked in any way, and different
subjects received different instructions, so both factors were between subjects.
This is a straightforward example of a two-factor between subjects (two-between,
no-within) factorial study.
However, if the male and female subjects received two treatments, first one
instruction set and then the other, instruction would be a within-subjects factor.
The number of button pushes would be assessed after each instruction set, and
each subject would contribute two scores to the analysis. This is an example of a
mixed two-factor (one-between, one-within) factorial study: The gender of
subject is the between-subjects factor and the instruction set is the within-
subjects factor.
15.1 WITHIN SUBJECTS OR REPEATED MEASURES FACTORS 247
There is still another variant of the basic 2 x 2 study. Instead of sampling
men and women randomly, we might instead select married couples. If husband
and wife were each exposed to instruction set I and instruction set II, then both
gender (husband/wife) and instruction (set I/set II) would be within-subjects
factors and each couple would contribute four scores to the analysis (number of
button pushes for both spouses for both instruction sets). This is an example of a
two-factor, within-subjects (no-between, two-within) factorial study.
For simplicity, single-factor within-subjects studies are emphasized in this
chapter, leaving discussion of multifactor studies that include repeated factors,
such as those described in the preceding paragraphs, until chapter 16. Still you
should be aware that the more complex designs are quite common. For example,
when studies include a treatment like instruction set as a within-subjects factor,
order of presentation is commonly included as a between-subjects factor because
if it is not the results are difficult to interpret unambiguously.
Imagine that in a single-factor within-subjects study instruction set I was
always presented first, instruction set II second, and the analysis of variance
detected a main effect for instruction set. Then we would not know whether or
not subjects always pushed the button more during the first session or if they
were reacting in particular to instruction set I. In such a case, we would say that
order and instruction set were confounded, combined in such a way that we
cannot disentangle their separate effects. The solution is to add a between-
subjects factor of order to the study design. For the present example, half of the
males and half of the females would be exposed to instruction set I first and then
set II (order 1), whereas the remaining half would be exposed first to instruction
set II and then set I (order 2), thus counterbalancing instruction set. Then if
there is a main effect for instruction set, we know that subjects are reacting
differently to the different instruction sets, no matter which is presented first. On
the other hand, if subjects are always more reactive to whichever instructions are
presented first, then there would be a main effect for order. In the next chapter
we describe how to analyze data from mixed two-factor studies such as this, but
for now we focus on the simpler single-factor case.

Advantages of Within-Subjects Factors


Studies including within-subjects factors have one notable strength: Subjects
assessed repeatedly serve as their own controls. This means that variability
between subjects is irrelevant to testing within-subject effects. Hence variance
due to between-subject variability is removed from consideration at the outset,
which reduces the residual or error term used for tests of within-subject effects
considerably. A smaller MSerror usually results in a larger F ratio, so, as a general
rule, we are more likely to find an effect significant if tests involve within instead
of between-subjects variables. In other words, for the same number of subjects,
tests involving within instead of between-subjects factors are usually more
powerful (see chap. 16). This set of circumstances presents an interesting
opportunity. The same power afforded by a test of a between-subjects factor can
be achieved with considerably fewer subjects if that factor reasonably can be
assessed within instead of between subjects. Fewer subjects typically means
greater economy of effort, so researchers are well-advised to consider whether
repeated measures procedures might be appropriate for a particular study.
Unfortunately, repeated assessments are not always possible. There are two
major stumbling blocks to their exclusive use. First, some tests are reactive
(subjects react differently the second time, not due to different circumstances,
but due to having being assessed previously), and if subsequent scores are indeed
248 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
affected by earlier administrations of the same test, repeated measures are ruled
out. For example, imagine we wanted to study the effect of three different drugs
on memory. We could use either a between-subjects design (one-third of the
subjects receive drug A, one-third B, one-third C) or a within-subjects design (all
subjects receive all three drugs at different times and are tested after each).
Given the same number of subjects, the within-subjects design is usually more
powerful. But if subjects remember elements of the test, and so subsequent
scores are affected by previous scores, the investigator may have no choice but to
use a between-subjects design. If tests are reactive, meaning that subsequent
scores are contaminated by previous scores, fresh subjects may be needed.
Further, some factors, by their very nature, can only be between subjects,
which is a second stumbling block to the exclusive use of within-subjects factors.
Under usual conditions, families are either lower income or middle income
(assuming this is a reasonable categorization), and usually we assume this status
is relatively enduring. A repeated-measures design would not be possible. Or, if
it were, it would study only that (perhaps atypical) subset of families who change
status and would undoubtedly focus on specific questions concerning social
mobility.
Other examples of inherently between-subjects factors are relationship and
parental status. At a given time, a person either is in a committed relationship or
not, and either has children or not, so study of either of these factors requires a
between-subjects design. Of course, one could select subjects and wait until they
had children, in which case parental status could be a within-subjects factor.
Such a study would take considerable time, although if the investigator were
interested specifically in questions concerning the transition to parenthood, the
time might be justified.
As we have seen, factors can be either between or within subjects, and any
number of factors of either kind can be combined in a factorial design. The
advantages of factorial designs that include within-subjects factors (either ail-
within or mixed designs) are the same as the advantages of purely between-
subjects designs: Any effects the different factors have on the dependent
variable, either singly or in interaction with other factors, can be evaluated.
Sometimes logical or practical considerations require that a particular factor be
between subjects, sometimes that it be within subjects. Given a choice, though, it
usually makes sense to opt for the within-subjects version of a factor. As noted
earlier, the reasons are largely economic. If an effect (of a particular magnitude)
exists, usually it will be detected with fewer subjects if the statistical tests involve
within-subjects instead of between-subjects factors.

Partitioning Variance Between and Within Subjects


The analysis of studies including repeated measures requires no major new
concepts, only an extension and application of ideas and techniques already
presented in previous chapters. Two ideas in particular are central to an
understanding of any analysis of variance involving within-subjects factors:

1. The first idea concerns how variance is partitioned into between-subjects


and within-subjects components.
2. The second conceptualizes subject as a control variable or covariate so
that between-subjects variance, which is irrelevant to analyzing within-
subjects factors, can be removed from within-subject error terms.
15.1 WITHIN SUBJECTS OR REPEATED MEASURES FACTORS 249
Beginning in chapter 8, we learned that the total sum of squares (or variance)
could be partitioned into two portions, one part due to the best-fitting model and
the other due to error (the residual sum of squares). In other words,

In earlier chapters, we ignored the possibility of within-subjects factors, but


now we can recognize that Equation 15.1 tells only part of the story. It applies
only to studies consisting solely of between-subjects factors. An expanded and
more accurate formulation for Equation 15.1 would be:

(Note that what is termed "between-subjects error" in Equation 15.2 has also
been termed "error within groups" in earlier chapters because they are the same
thing.) Purely between-subjects studies have N subjects and N scores, so the total
sum of squares consists only of variability between subjects. We have always
referred to this as the total sum of squares before, although it would have been
more accurate to call it the total sum of squares between subjects. This serves to
remind us that the SStotal in Equation 15.1 is total only if no within-subjects
factors are present.
Studies including within-subjects factors, however, have more than N scores,
so, as you might guess, their total sum of squares is greater than the sum of
squares due to variability between subjects. Variability within subjects also
contributes. Specifically, for repeated-measures studies:
=
SStotal (between + within) SStotal between subjects + SStotal within subjects (15.3)

Consider for a moment how you would compute the total (between + within) sum
of squares. If three assessments were made for each of 12 subjects, you would
compute the mean of the 36 scores, subtract that mean from each score, square
the deviation, and sum the 36 squared deviations. But as you know from
Equation 15.3, this sum of squares can be partitioned into two pieces, one
representing variability between subjects and one representing variability within
subjects.
Moreover, each of these two sums of squares (SStotal between subjects and SStotal
within subjects) can be subdivided further. From chapter 14 you know that the total
between subjects sum of squares can be partitioned into a series of between-
subjects main effects and interactions and a between-subjects (or within-groups)
error term. Similarly, the total within-subjects sum of squares can be partitioned
into sums of squares and error terms that allow us to evaluate the effects, if any,
of within-subjects variables on the criterion measure.
It is useful to view a repeated-measures analysis as consisting of two (or
more) separate analyses. First, there is the between-subjects analysis, which is
associated with the sum of squares total between subjects and which includes
only between-subjects factors and their interaction with each other. Second,
there is the within-subjects analysis (or analyses, if there is more than one
repeated factor), which is associated with the sum of squares total within-subjects
and includes within-subjects factors, their interactions with each other, and their
interactions with the between-subjects factor or factors. Exactly how now repeated
measures analyses are ordered and organized will become clearer when we
describe in detail how the total sum of squares and degrees of freedom are
partitioned for specific repeated-measures studies.
250 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES

15.2 CONTROLLING BETWEEN-SUBJECTS VARIABILITY

The second basic idea required for an understanding of repeated measures


analysis of variance is based on the realization that subject, like age or group, can
itself be a research factor and can be treated as a covariate for repeated-measures
analyses. In the previous section we noted that the total (between + within) sum
of squares for 36 scores, derived from three assessments of 12 subjects, is easily
computed. But how would we compute the sum of squares between subjects?
You could compute a mean for each subject's three scores and then compute a
sum of squares for the 12 mean scores. But there is another way to compute the
sum of squares between subjects. It requires that you apply concepts learned in
previous chapters and has direct application to the analysis of repeated measures
factors.
Just as G groups can be represented with G - 1 predictor variables (see chap.
12), so too N subjects can be represented with N - 1 predictor variables. After all,
as we learned in chapter 10, N scores are associated with N - 1 degrees of
freedom. Typically, dummy-coded variables are used. Thus the first subject
would be coded one for the first predictor variable, zero otherwise, the second
subject would be coded one for the second variable, zero otherwise, and so forth,
until the final subject would be coded zero for all N - 1 predictor variables.
Once subjects are represented with predictor variables, we can determine the
proportion of total variability accounted for by the subject factor. Again consider
the N = 12 example. We would regress all 36 scores on the 11 predictor variables
for subject. The resulting R2 is the proportion of variance accounted for by the
subject factor. Then we can compute the sum of squares between subjects. It is
the total (between + within) sum of squares multiplied by the proportion
accounted for by between-subject variability—the total SS multiplied by the R2
just computed.
The analysis of studies involving repeated measures proceeds hierarchically,
just like the factorial studies described in the previous chapter. If there are no
between-subjects factors, then the first step consists of regressing criterion scores
on coded predictor variables representing subjects. But if there are between-
subjects factors, then some of the N - 1 predictor variables representing total
between-subject variability will be identified with the between-subjects factor or
factors and their interactions (just as in chap. 14). Thus the total set of N - 1
between-subjects predictor variables allows us to evaluate the significance of any
between-subjects factors. Exactly how this works is demonstrated in chapter 16.
In any event, once the N - 1 between-subjects predictor variables are entered
(whether that takes one or more than one step), the next steps involve adding
predictor variables associated with the repeated measures factor (or factors).
Note the overall strategy. First we control for between-subject variability,
entering coded variables representing subjects—exactly as though the subject
factor were a covariate (see chaps. 10 and 12). Then, after removing purely
between-subject variance, which is irrelevant to the analysis of within-subject
effects, we proceed to analyze the variance remaining, the variance associated
with within-subject effects.
Whether an analysis of variance includes one or more repeated measures, or
includes between-subjects factors as well, the basic steps are the same although
some details vary. Exact procedures are demonstrated in the remainder of this
chapter and in the next. Specific topics include the way variables (including
subject variables) are coded, the sequencing of the multiple-regression steps
required, the way the total sum of squares (and the total degrees of freedom) is
15.2 CONTROLLING BETWEEN SUBJECT VARIABILITY 251
partitioned among the various effects, and the way the various effects are tested
for significance. The next section illustrates a no-between, one-within study,
whereas a one-between, one-within and a no-between, two-within study are
described in the next chapter. An understanding of these three exemplars should
make it possible for you to understand how to analyze other, more complex
repeated-measures designs.

A Study With a Single Within-Subjects Factor


Recall the lie detection study first introduced in chapter 5. There were 10
subjects and the dependent variable was the number of lies an expert detected.
Beginning in chapter 9, we imposed a single-factor between-subjects design on
these data. We divided the 10 subjects into a drug group and a placebo group and
represented the single factor with a dummy-coded variable. There were 10 scores
and 9 degrees of freedom total. We would now say that there were 9 degrees of
freedom between subjects and these were divided into 1 degree of freedom for the
between-subjects model (one predictor variable) with 8 remaining for between-
subjects error. The partitioning for the total sum of squares and degrees of
freedom was:

and

Because dfbetween subjects = N - I,

Finally, the F test for the between-subjects drug effect was:

But now imagine this was a within-subjects study instead and there were only
five subjects and each subject received both drug treatments (the actual drug and
a placebo). There would still be 10 scores total but now each subject would
contribute two scores. If those scores were the same as those given earlier (Fig.
5.1), and if we assume the scores initially given for subjects 6-10 are now scores
for subjects 1-5 after the second drug treatment (the placebo condition), the total
sum of squares for the 10 scores would still be 52 and the total degrees of
freedom would still be 9, the same values we computed earlier (see Fig. 10.3).
After all, we have not changed the scores for this example, only the number of
subjects and the fact that each subject now contributes two scores.
Although the total sum of squares and degrees of freedom would remain the
same, the way they are partitioned would be quite different. Five subjects require
four predictor variables. The first step would regress the number of lies detected
on the coded variables representing the subject factor. This would give us the R2
between subjects and (multiplying by the total sum of squares for lies detected)
the total between subjects sum of squares, which is associated with 4 degrees of
freedom (because five subjects require four predictor variables). Having removed
or accounted for between-subjects variance (step 1), we would now proceed to
252 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES

analyze the remaining within-subjects variance, which is associated with 5


degrees of freedom (9 initially minus 4 for the subject factor).
On the second step, a coded variable for drug treatment would be entered.
This gives us the increase in R2 (and the increase in SS) due to drug and is
associated with 1 degree of freedom (because two drug groups require 1 predictor
variable). The residual or error variance is associated with 4 degrees of freedom
(5 for total within-subject variability minus 1 for drug). In this case the
partitioning for the total sum of squares and degrees of freedom is:

Because dftotal = 2N - 1 (there are two scores per subject),

Finally, the F test for the within-subjects drug effect is:

If the obtained F ratio exceeds its critical value, we would say that the drug
effect was statistically significant. The next exercise demonstrates in detail how
this works, so do not worry if the procedure is not yet completely clear. For the
moment, focus on the overall strategy. First total variability is partitioned into
two pieces: variability between subjects and variability within subjects. Then
variability within subjects is subdivided into variability due to the repeated factor
and residual variability. Finally, the significance of the repeated factor is
determined by comparing a mean square (variance estimate) for the repeated
factor with the mean square for the residual.
In the preceding paragraphs, the error mean square for the between-subjects
F test was labeled MSerror and the corresponding error term for the within-
subjects test was labeled MSresidual. This was done to signal that the "proper"
error terms for these two tests are somewhat different— although both represent
residual error and could quite correctly be called either MSerror or MSreSidual. In
mathematically oriented texts, formal justification is offered for the error terms
used for various F tests. For present purposes and in the more informal spirit of
this book, it is sufficient to know that statisticians can prove to their satisfaction
that the error terms given here are correct. However, it is useful to consider some
ways in which the mean squares for the between-subjects and within-subjects F
tests described in the previous several paragraphs differ.
The proportion of total variance associated with the mean square for the
within-subjects F test is smaller than the proportion associated with the between-
subjects F test. There is an easy way to visualize this situation. Consider the drug
versus placebo independent groups version of the lie detection study. There are
10 subjects, 10 scores, and 9 degrees of freedom total. Therefore total variability
can be represented with nine predictor variables, one representing drug group
and the remaining eight representing residual error between subjects or error
within groups. Before we have only formed the first predictor variable, the one
that represents drug group (labeled X in Fig. 10.2) but it is possible to complete
the set, forming the remaining eight predictor variables (see Fig. 15.1).
15.2 CONTROLLING BETWEEN SUBJECT VARIABILITY 253
In Fig. 15.1 the predictor variable representing drug group is labeled A and
the eight predictor variables representing subjects are labeled S1-S8. Predictor
variable A is coded 1 for the drug subjects and 0 for the placebo subjects. Five
subjects are nested within each group. Predictor variables S1-S4 represent the
subjects nested within the drug group, and predictor variables S5-S8 represent
the subjects nested within the placebo group. Variables S1-S4 are coded as
though the first five subjects each represented different groups. Variable S1 is
coded 1 for subject one, variable S2 is coded 1 for subject two, S3 is coded 1 for
subject three, and S4 is 1 for subject four. Subject five, whose values for S1-84
are all 0, represents, in effect, a comparison subject for the drug group. Similarly,
variable S5 is coded 1 for subject six (the first subject in the placebo group), S6 is
coded 1 for subject seven, S.7 is coded 1 for subject eight, and S8 is coded 1 for
subject nine. Subject ten, like subject five, represents a comparison subject but
for the placebo group. However, subjects five and ten differ in their value for
variable A. Thus a different pattern of predictor variable values is associated with
each of the 10 scores.
It might seem economical to use only four predictor variables for subject,
repeating the S1-84 pattern for both drug and placebo subjects. Codes for
variable A distinguish between drug and placebo groups, so this scheme would
also assure a different pattern of predictor variable values for each of the 10
scores. But it would not reflect the reality of the situation. Variable S1, for
example, would be coded 1 for both subjects one and six only if they were the
same subject, and for the independent groups design they are not.
When analyzing for the between-subjects effect of group, first the criterion
scores are regressed on the predictor variable representing drug group (step 1).
We do not bother to perform what is in effect a phantom step 2, regressing the
criterion on all of the N - 1 predictor variables, for two reasons. First, we do not
need to. We know that the additional variance accounted for by step 2 must be
1 - R2, one minus the variance accounted for by step 1, because that is what is
left. Second, multiple-regression routines do not allow N scores to be regressed
on N - 1 predictor variables. In such cases they return an error message.
Now consider the drug versus placebo repeated measures version of the lie
detection study. There are 5 subjects, 10 scores, and 9 degrees of freedom total.
Again, total variability can be represented with nine predictor variables. This
time, however, only four represent subject. Variable Si is coded 1 for subject one,

s A S1 S2 S3 S4 S5 S6 S7 S8
1 1 1 0 0 0 0 0 0 0
2 1 0 1 0 0 0 0 0 0
3 1 0 0 1 0 0 0 0 0
4 1 0 0 0 1 0 0 0 0
5 1 0 0 0 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0
7 0 0 0 0 0 0 1 0 0
8 0 0 0 0 0 0 0 1 0
9 0 0 0 0 0 0 0 0 1
10 0 0 0 0 0 0 0 0 0
FIG. 15.1. Predictor variables representing all nine degrees of freedom for a
single-factor between-subjects study. There are 10 subjects, the between-
subjects predictor variable is labeled A, and variables S1 through S8 code for
subjects nested within groups.
254 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
variable S2 is coded 1 for subject two, S3 is coded 1 for subject three, and S4 is 1
for subject four. As before, subject five, whose values for S1-84 are all 0,
represents a comparison subject (see Fig. 15.2). Again, subject is treated as a
factor and each subject represents a level of that factor, which is why N - 1
predictor variables are required for the subject factor. In this case, however, the
pattern of codes for S1-84 is repeated because each subject contributes two
scores. For example, the codes for S1-84 are the same for the first and sixth rows
because those rows represent the same subject.
The fifth predictor variable, labeled P in Fig. 15.2, represents drug group.
Again, it is coded 1 for drug group and 0 for placebo, but now it represents a
within-subjects variable. At this point, we have coded two factors, subject and
drug. Next, and as you would expect from the factorial designs discussed in the
previous chapter, is the interaction term. It is represented with four predictor
variables formed by multiplying each predictor variable for subject with the
predictor variable for drug group (see Fig. 15.2). At this point, it is not possible to
form further predictor variables. The four associated with the S x P (subject x
within-subjects drug group) interaction have exhausted the nine original degrees
of freedom. In effect, there are no degrees of freedom within groups (i.e.,
subjects), which is a consequence of having "groups" that contain a single subject.
Thus it should not be surprising that the proper error term for analysis of a
within-subjects factor is the subject x within-subjects factor interaction.
The analysis of the within-subjects effect of group would again proceed
hierarchically. First the criterion scores are regressed on the predictor variables
representing subject (step 1), and then on the predictor variables representing
subject and drug group (step 2). Again, we can regard the step that adds the
predictor variables representing the drug by subject interaction as a phantom
third step, one that completes the present analysis. The R2 after this step
necessarily must be one—with 9 predictor variables and 9 degrees of freedom, all
variance is accounted for. Similarly, the final sum of squares must be 52, the
total sum of squares for the 10 scores. The next exercise asks you to complete
steps 1 and 2 for this analysis of a single within-subjects factor.

s S1 S2 S3 S4 P PSi PS2 PS3 PS4


1 1 0 0 0 1 1 0 0 0
2 0 1 0 0 1 0 1 0 0
3 0 0 1 0 1 0 0 1 0
4 0 0 0 1 1 0 0 0 1
5 0 0 0 0 1 0 0 0 0
1 1 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0
FIG. 15.2. Predictor variables representing all nine degrees of freedom for a
single-factor within-subjects study. There are 5 subjects and 10 scores,
variables S1 through S4 code for subject, the within-subjects predictor
variable is labeled P, and variables PS1 through PS4 represent the PxS
interaction.
CONTROLLING BETWEEN SUBJECT VARIABILITY 255

Exercise 15.1
A Single-Factor Repeated-Measures Study
The template that results from this exercise allows you to analyze data from a
single-factor within-subjects study. You will use data from the lie detection study
but will assume that five subjects were tested twice.
1. Modify the spreadsheet shown in Fig. 11.2. Add columns for the dummy-
coded subject variables, enter values for the dummy codes, and change
labels as appropriate. In particular, rows for subjects should be labeled 1-5,
not 1-10. Each subject occupies two rows (because there are two scores for
each subject) and the dummy codes for each subject's two rows are the
same (it is the same subject, after all). What distinguishes each of the two
rows is the level of the repeated factor (drug versus placebo). Consistent
with our earlier spreadsheet, the drug factor can be dummy coded as well.
2. Do step 1. Regress number of lies on the four coded subject variables.
Enter the correct formula for Y'. What is the value of R2 and SS for this
model? The significance of this covariate (the subject factor) is not
especially important. If R2 is large, it only means that subjects behaved fairly
consistently over trials.
3. Do step 2. Regress number of lies on the four coded subject variables plus
the coded variable for drug treatment. Enter the correct formula for Y'. What
is the value of R2 and SS for this model?
4. What is the increase in R2 and SS from step 1 to step 2? How many
degrees of freedom does the error (residual) term have? What is
significance of the change in R2?
5. Examine the predicted scores after steps 1 and 2. How do they appear to be
computed? How do the values for the regression constant and coefficients
appear to be computed?

After steps 1 and 2 your spreadsheets should look like those shown in Figs.
15.3 and 15.4. Careful scrutiny of these spreadsheets can help you understand
exactly how an analysis of a within-subjects factor proceeds. The primary
question, of course, is whether drug matters— if we know whether a person
received the drug or a placebo, can we more accurately predict the number of lies
the expert will detect? If we do not know which treatment the subject received,
we would simply guess the mean number of lies detected for that subject,
averaging over both treatment conditions. And indeed, these are the predicted
scores at step 1, when only coded variables for subjects are in the equation. Note
that the predicted scores are the same for each subject across both treatment
conditions, and that the predicted scores for each subject are the number of lies
detected for that subject, averaged over both conditions. Thus the predicted
score for subject 1, whose scores were 3.0 and 4.0 for the two treatment
conditions, is 3.5; the predicted score for subject 4, whose scores were 6.0 and
7.0, is 6.5; and so forth (see Fig. 15.3). These predictions are not perfect, of
course, but predictions made with knowledge of the particular subject involved
allow us to account for 63.8% of the criterion variance. In other words, 63.8% of
the variability in number of lies detected is due to variation between subjects.
We already know that the mean score for subjects exposed to the drug is 4.2
and for subjects exposed to the placebo is 6.4. In other words, subjects who
received the drug had 1.1 less, and subjects who received the placebo had 1.1
more, lies detected than average (the grand mean is 5.3). Thus, if predicting
256 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
scores knowing both the individual involved and the treatment condition, instead
of just the individual as in the last paragraph, we would refine our previous
predictions by subtracting 1.1 from the mean for the drug condition and adding
1.1 for the placebo condition, which is exactly how the predicted scores in Fig.
15.4 were formed. Thus instead of predicting 3.5 for subject 1, we predict 2.4 for
the drug and 4.6 for the placebo condition; instead of predicting 6.5 for subject 4,
we predict 5.4 for the drug and 7.6 for the placebo condition, and so forth (see
Fig. 15.4). This allows us to account for 94% of criterion variability, which
represents a statistically significant increase of 30.2% (F(1,4) = 20.17, p < .05).
Thus drug matters. In addition to knowing the subjects involved, if we also know
whether they received a drug or a placebo, our predicted scores for the number of
lies detected will improve significantly.

A B C D E F G H I J K
1 Lies Drug y= m= e=
2 s Y S1 S2 S3 S4 P Y1 Y-My Y'-My Y-Y'
3 1 3 1 0 0 o 1 3.5 -2.3 -1.8 -0.5
4 2 2 0 1 0 0 1 3.5 -3.3 -1.8 -1.5
5 3 4 0 0 1 0 1 5.5 -1.3 0.2 -1.5
6 4 6 0 0 0 1 1 6.5 0.7 1.2 -0.5
7 5 6 0 0 0 0 1 7.5 0.7 2.2 -1.5
8 1 4 1 0 0 0 0 3.5 -1.3 -1.8 0.5
9 2 5 0 1 0 0 0 3.5 -0.3 -1.8 1.5
10 3 7 0 0 1 0 0 5.5 1.7 0.2 1.5
11 4 7 0 0 0 1 0 6.5 1.7 1.2 0.5
12 5 9 0 0 0 0 0 7.5 3.7 2.2 1.5
13 Sum= 53 53 0 0 0
14 N= 10 N= 10 10 10
15 Mean= 5.3 VAR= 4.01 2.56 1.45
16 a,b= 7.5 -4 -4 -2 -1 SD= 2.002
17 R= 0.799 R2= 0.638

L M N O
1 sstot ssmod sserr
2 y*y m*m e*e

13 SS= 40.1 25.6 14.6


14 df= 9 4 5
15 MS= 4.456 6.4 2.9
16 SD'= 2.111 1.703
R2
17 adj = 0.349 F= 2.207
FIG. 15.3. Spreadsheet for analyzing the effect of drug (within-subjects
factor P) on number of lies told. Only predictor variables for subject have
been entered (step 1).
15.2 CONTROLLING BETWEEN SUBJECT VARIABILITY 257
It is both helpful and illuminating to view the analysis of the variance
accounted for by a within-subjects factor as an analysis of covariance. In chapter
11 we analyzed whether drug contributed significantly to the variance in lies
detected, once the effect of mood was taken into account (i.e., controlled
statistically). In chapter 13 we analyzed whether gender of infant contributed
significantly to variance in number of smiles, once the age of the infant was taken
into account And in this chapter, we analyzed whether drug contributed
significantly to the variance in lies detected, once the individual subject was taken
into account. When a within-subjects factor is under consideration, the fact that
individuals vary between themselves is not of interest. That is why between-
subject variability is controlled or, in other words, removed from consideration at
the outset.

A B C D E F G H 1 J K
1 Lies Drug y= m= e=
2 s Y Si S2 S3 S4 P Y1 Y-My Y'-My Y-Y'
3 1 3 1 0 0 0 1 2.4 -2.3 -2.9 0.6
4 2 2 0 1 0 0 1 2.4 -3.3 -2.9 -0.4
5 3 4 0 0 1 0 1 4.4 -1.3 -0.9 -0.4
6 4 6 0 0 0 1 1 5.4 0.7 0.1 0.6
7 5 6 0 0 0 0 1 6.4 0.7 1.1 -0.4
8 1 4 1 0 0 0 0 4.6 -1.3 -0.7 -0.6
9 2 5 0 1 0 0 0 4.6 -0.3 -0.7 0.4
1o 3 7 0 0 1 0 0 6.6 1.7 1.3 0.4
11 4 7 0 0 0 1 0 7.6 1.7 2.3 -0.6
12 5 9 o 0 0 0 0 8.6 3.7 3.3 0.4
13 Sum= 53 53 0 04E-15
14 N= 10 N= 10 10 10
15 Mean= 5.3 VAR= 4.01 3.77 0.24
16 a,b= 8.6 -4 -4 -2 -1 -2.2 SD= 2.002
17 R= 0.97 R2= 0.94

L M N O
1 SStot SSmod SSerr
2 y*y m*m e*e

13 SS= 40.1 37.7 2.4


14 df= 9 5 4
15 MS= 4.456 7.54 0.6
16 SD'=: 2.111 0.775
17 R 2adj= 0.865 F= 12.57
FIG. 15.4. Spreadsheet for analyzing the effect of drug (within-subjects
factor P) on number of lies told. Predictor variables for subject and drug
have been entered (step 2).
258 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
But note, that in all these cases the analytic strategy is the same. First the
criterion scores are regressed on the covariate, whether that covariate is a
quantitative variable like age or a categorical variable like subject (step 1). Then
variables representing the research factor of interest (e.g., gender of infant or
drug group) are added to the equation (step 2) and the increase in R2 is tested for
significance.

15.3 MODIFYING THE SOURCE TABLE FOR REPEATED MEASURES

The general layout used for the template in the last exercise (and in a number of
previous exercises) has served us well. But now, especially as we increasingly
emphasize analyses consisting of several steps, and are primarily concerned with
the significance of the increases in Rz associated with different steps, some
modifications are in order. After all, neither of the spreadsheets shown in
Figs. 15.3 and 15.4 gives us directly the information we want for the present
analysis, which is whether or not the drug effect is significant. Thus it makes
sense to incorporate a table giving step-by-step statistics into any template used
for an analysis of variance, just as we did for the two-factor between-subjects
factorial study in Exercise 14.7 (see Fig. 14.12) and as we will do for the present
single-factor within-subjects study in the next exercise.
The analysis of variance templates we have developed in the last several
exercises have used a multiple-regression routine to compute the regression
constant and coefficients and then have used those statistics to compute the
predicted values for Y (column H in Figs. 15.3 and 15.4). The predicted value of Y
(i.e., Y') was then used to compute the SSmodel and the SSerror. These, in turn, were
used to compute R2, the proportion of variance accounted for by the variables
whose regression coefficients were included in the calculation of Y. All of this
was not only useful but seemed pedagogically justified. However, there is a
shortcut to R2. Almost all multiple-regression programs give its value. We could
use this value directly, rather than computing R2 ourselves as we have been
doing. This would simplify the procedure. After identifying the dependent
variable, we would invoke the multiple-regression procedure repeatedly,
specifying the independent variables and the output range for each step.
Computing Y, its associated statistics, and R2 after each step could be bypassed.
Thus values of R2 can be taken directly from the multiple-regression routine
and from them theR2Change for each step can be computed. Analysis of variance
source tables, however, traditionally give the sum of squares for each step, not the
R2 change. However, as we have already noted, the SSchange is easy to compute. The
model sum of squares for each step is easily computed from R2, as we noted in
the previous chapter. Recall that:

Therefore, by simple algebra:

The SStotal depends only on the mean of Y, not on Y, thus it does not vary
from step to step. For example, SStotal = 40.1 in both Figs.15-3 and 15.4. Thus,
given the value of R2 for each step, it is a simple matter to compute both the
corresponding model sum of squares for that step and the change in sum of
1 5.3 MODIFYING THE SOURCE TABLE FOR REPEATED MEASURES ____ 259
squares from the previous step. It is not necessary to compute predicted values
at each step in order to derive the corresponding model sum of squares.
The major elements of an ANOVA source table are SSs, dfs, MSs, and Fs.
Sums of squares we have just discussed and MSs and Fs are easily computed
given the SSs and their corresponding dfs. Moreover, spreadsheets are such a
versatile computing tool, so we might as well let them compute degrees of
freedom instead of entering values explicitly as we have been doing. And we
might as well make the formulas used as general as possible. In chapter 14 we
gave general formulas for degrees of freedom for two- and three-way between-
subjects factorial studies (Figs. 14.8 and 14.9). Now, in Fig. 15.5, we provide
general formulas for degrees of freedom for a single-factor within-subjects study.
These are useful, not only for the template developed during the course of the
next exercise, but also as a basis for understanding how the single-factor within-
subjects design accounts for variance.
Earlier we introduced A, B, Cto symbolize between-subjects factors and a, b,
c to symbolize the number of levels for each of those factors. Now we introduce
P, Q, R to symbolize within-subjects factors and p, q, r to symbolize the number
of levels for those factors (see Fig. 15.5). As always, N symbolizes the number of
subjects. For a single-factor within-subjects study, the total (between + within)
degrees of freedom is the number of scores minus one (N subjects each
contribute p scores so there are a total of Np scores):

dftotal = Np - l

As before, the degrees of freedom between subjects is the number of subjects


minus one:
=N
between subjects — 1

There are three ways to derive the degrees of freedom within subjects. Taking the
scores free to vary approach, we would note that for a single subject p - 1 scores
can vary (because there are p score for each subject) and because there are N
subjects, the total degrees of freedom within subjects is N times p - 1:

dfwithin subjects = N(p - 1) = Np - N

Or, taking the number of parameters approach, we would note that there are Np
scores initially but that in computing deviations we need to estimate means for N
subjects, and hence there are Np - N degrees of freedom left. Finally, we could
determine the degrees of freedom within subjects by subtraction:

Source Degrees of freedom


S, subjects N-1
TOTAL between subjects N-1
P main effect P-1
PS interaction (P-1)(N-1)
TOTAL within subjects Np-N
TOTAL (between + within) Np-1
FIG. 15.5. Degrees of freedom for a single-factor within-subjects study. The
number of levels for within-subjects factor P is symbolized with p and the
number of subjects is symbolized with N.
260 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES

Degrees of freedom for the components of total within-subjects variance (P and


PS) also make sense. The degrees of freedom for p levels is p - 1, the number of
predictor variables:

And the degrees of freedom for the PS interaction (the error term) is simply the
degrees of freedom for P multiplied by the degrees of freedom for S (the degrees
of freedom between subjects):

For the next exercise, these formulas are incorporated into an analysis of variance
source table.

Exercise 15.2
Source Table for a Single-Factor Repeated-Measures Study
The template developed for this exercise creates an analysis of variance source
table for the template you developed for Exercise 15.1 to analyze data from a
single-factor within-subjects study. It provides a summary of the results for the
lie detection study, assuming that five subjects were each tested twice, once with
a drug and once with a placebo.

General Instructions
1. On a new worksheet, add a source table that summarizes the analysis in
Exercise 15.1. You may want to use Fig. 15.6 as a guide. Use general
formulas for degrees of freedom. Experiment with different values for N, a,
and p before you enter the correct values for the present example. Do all
degrees of freedom sum as they should?
2. Invoke the multiple-regression routine in order to compute step 1 and 2
values for R2. Provide formulas to compute other sums of squares, changes
in R2 and SS, the required mean squares, and the Fratio.
3. Answer question 9 of the detailed instructions.
Detailed Instructions
1. Enter labels in cell A1 through l1 as indicated in Fig. 15.6. Label the steps in
cells A2 through A4 and the sources of variance in cells B2 thought B5. In
cells A7, A8, and A9 enter the labels "N=", "a=", and "p=", respectively.
2. Enter the number of subjects in B7 and the number of levels for the single
within-subjects variable in A9. There is no between-subjects variable, hence
no levels for it, so enter zero in cell A8.
3. Enter the appropriate formulas for degrees of freedom in column F (see Fig.
15.5). If done correctly, df for between subjects, the P main effect and the
PS error term should sum to the df for the Total (between + within).
4. At this point, you have a template that partitions degrees of freedom for any
no-between, one-within study. Experiment with different values for N and p
15.3 MODIFYING THE SOURCE TABLE FOR REPEATED MEASURES 261
(cells B7 and B9) and note the effect on degrees of freedom. Now replace N
and p with 5 and 2, the values for this exercise.
5. Do step 1. Invoke the multiple-regression routine using the data from your
exercise 15.1 spreadsheet, specifying lies (column B) as the dependent
variable and the coded variables for subjects (columns C-F) as the
independent variable. Point the R2 for step 1 (cell C2) to the R2 computed
for this step by the multiple-regression routine. This is the proportion of total
variance accounted for by the subject factor.
6. Do step 2. Again invoke the multiple-regression program, this time
specifying subject plus drug variables (columns C-G) as the independent
variables. Point the R2 for step 2 (cell C3) to the R2 just computed. This is
the proportion of total variance accounted for by the subject and drug factor
together.
7. Do step 3. As previously noted, actually performing the regression implied
by the implicit or phantom step 3 would exhaust the degrees of freedom.
Therefore simply enter 1 (all variance accounted for) in cell C4. This is the
value of R2 for step 3. Then point cell E4 to cell O13 on the exercise 15.1
spreadsheet. This is the total sum of squares, which is the SS for step 3.
8. The change in R2 and SS for each step can now be computed. Enter
appropriate formulas in columns D and E. As a check, enter summation
formulas in cells D5 and E5. Do the sum of the changes in R2 and SS sum
to 1 and SStotai as they should?
9. Finally, enter the appropriate formulas for MSs, F, and n2 in cells G3, G4,
and H3 and l3. What is the critical value for this F ratio? Is it significant?
What is n2? How do you interpret the results?

It is instructive to compare the analysis of variance results when drug is a


between-subjects factor (Fig. 10.2) with those produced when it is within subjects
(Fig. 15.6). Although the same 10 scores are analyzed, the F ratios for the drug
effect are dramatically different. The F ratio for the between-subjects study is
3.46 whereas the F ratio for the within-subjects study is 20.17. The total sum of
squares (SStotai = 40.1) and the sum of squares associated with the drug effect
(SSmode1 = SSp main effect = 12.1) are the same in both cases, of course. But for the
between-subjects study, with its 10 different subjects, the remaining sum of
squares (SSerror = 28) is all within-group or between-subjects error.
For the within-subjects study, on the other hand, with its five subjects
assessed twice, the sum of squares between subjects (whose value is 25.6) is
irrelevant and is removed at the outset, during step 1. Once the drug effect sum
of squares (whose value is 12.1) is also removed, the sum of squares remaining—

A B C D E F G H I
R2 R2 change 2
1 Step Source SS dt MS F Pn
2 1 TOTAL btwn Ss 0.638 0.638 25.6 4
3 2 P (Drug) 0.94 0.302 12.1 1 12.1 20.17 0.834
4 3 PS (error) 1 0.06 2.4 4 0.6
5 TOTAL (B+W) 1 40.1 9
FIG. 15.6. Spreadsheet showing an analysis of variance source table for a
single-factor between-subjects study analyzing the effect of drug on number
of lies detected.
262 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
the error or PS interaction sum of square—is only 2.4. There are fewer degrees of
freedom for the within subjects as compared to the between-subjects error term
(4 vs. 8); still, because the within-subjects error term is so much smaller, the
mean square for error is considerably smaller for the within-subjects study (0.6
vs. 3.5) , so what was an insignificant drug effect with a between-subjects (or
independent groups) study becomes, given identical data but half as many
subjects, a significant drug effect with a within-subjects (or repeated-measures)
study. The analyses of these particular data dramatically demonstrate the
advantage of using a repeated-measures factor.

Exercise 15.3
A Single-Factor Repeated-Measures Study in SPSS
In this exercise you will analyze the lie-detection study, last described in Exercise
15.2, using the repeated measures option under the General Linear Model
procedure of SPSS.
1. Create a new SPSS data file with three variables. Create one variable for
subject number (s), one variable for number of lies detected in the drug
condition (drug), and a third variable for number of lies detected in the
placebo condition (placebo). Give each variable a meaningful label.
2. Enter 1-5 in the subject number column. In the drug column, enter the
number of lies detected for subjects 1-5 in the drug condition. Do the same
in the placebo condition. Thus, the SPSS file set up for a repeated-measures
study will have a single row for each subject with the scores from each level
of the repeated measures variable in separate columns.
3. Select Analyze-General Linear Model->Repeated Measures from the
main menu. In the Repeated Measures Define Factor(s) window, type in a
name to describe the repeated measures factor in the Within-Subject
Factor Name box (e.g., group). Enter 2 for the number of levels of the factor
and click Add to move this factor definition to the lower box. Click the Define
button.
4. After the Repeated Measures window opens, move the drug and placebo
variables to the right hand Witihin Subjects Variables box. Click on
Options and check the Estimates of Effect Size box. Click Continue and
then OK.
5. Examine the output. For the moment, ignore the Multivariate Tests and
Mauchly's Test of Sphericity. Look at the Sphericity Assumed lines in the
Tests of Within-Subjects Effects box. The SS, df, MS, F, and pn2 values
should agree with your spreadsheet results. Do they?

A Second Study With a Single Within-Subjects Factor


The previous example consisted of a study with a single within-subjects factor
measured on two levels. The single factor was drug group and the two levels were
drug or placebo treatments. In the interest of generality, a second example is
presented. This study, like the previous one, includes a single within-subjects
factor but it is measured on four instead of two levels. Recall the button-pushing
study first described in chapter 11. There it was presented as a single-factor
between-subjects study: Sixteen subjects categorized into four groups were
15.3 MODIFYING THE SOURCE TABLE FOR REPEATED MEASURES 263
exposed to videotapes and each time they thought the infant had done something
communicative they pushed a button. But imagine instead that the study
includes only four subjects and each subject is exposed to four different
videotapes, each portraying an infant with a different diagnosis (Down syndrome,
fetal alcohol syndrome, very low birth weight, and a no diagnosis comparison
group). Again subjects are asked to push a button whenever they think the infant
has done something communicative.
For the previous example, the between-subjects factor was want/have
children status. For the present example, the within-subjects factor is infant's
diagnosis. But the research question remains the same: Is the mean number of
button pushes significantly different for different levels of the factor? The
analysis required to answer this question, assuming that infant's diagnosis is a
within-subjects factor, is accomplished in the next exercise.

Exercise 15.4
A Second Single-Factor Repeated-Measures Study
This exercise provides additional practice in analyzing data from a single-factor
within-subjects study. The single factor, infant's diagnosis, comprises four levels.
You will use data from the button-pushing study but will assume that four
subjects were each exposed to four videotapes, each videotape portraying an
infant with a different diagnosis.

General Instructions
1. Analyze the 16 scores shown in Fig. 11.5. Assume that four subjects
contributed four scores each, that the 1st, 5th, 9th, and 13th scores were
contributed by one subject, the 2nd, 6th, 10th, and 14th by a second subject,
and so forth. Assume further that the first score for each subject represents
the number of button pushes when viewing a Down syndrome infant, the
second when viewing a fetal alcohol syndrome infant, the third a very low
birth weight infant, and the fourth a no diagnosis comparison infant. Use
dummy-coded variables to represent subjects and type of infant.
2. First regress number of button pushes on the coded variables for subjects,
then on the coded variables for subjects plus the coded variables for
diagnostic group. Enter results, including those for the phantom third step, in
a source table like that shown in Fig. 15.6.

At this point your spreadsheet should look like the one shown in Fig. 15.7 and
15.8. Again, it is instructive to compare the analysis of variance results when
group (want/have children status) is a between-subjects factor (Fig. 12.2) with
those produced when group (infant's diagnosis) is within subjects (Fig. 15.8).
The total sum of squares (SStotai = 7438) and the sum of squares associated with
the group effect (SSmodel - SSpmain effect = 4880) are the same in both cases.
However, in contrast to the lie detection example, the variability between subjects
for the repeated-measures button-pushing study was not especially large
(SSbetween subjects = 850.5), which means that little variability in button pushing can
be accounted for by knowing the particular subject involved. As a result, when
the group factor was redefined from between- to within-subjects, there was a
dramatic reduction in the error sum of squares for the lie detection study but not
for the button-pushing study.
For the between-subjects version of the button-pushing study (16 subjects
categorized by status group), the error term (the mean square for error) was
264 SINGLE-FACTOR WitHIN-SUBJECTS STUDIES
2558/12 = 213.2 and for the within-subjects version (4 subjects exposed to
videotapes of 4 infants representing different diagnostic groups), it was 1707.5/9
= 189.7. The F ratio was somewhat higher for the within- compared to the
between-subjects versions (8.57 vs. 7.63), but this difference is hardly significant.
The partitioning of the total sum of squares for the two versions of the button-
pushing study, both of which analyze the same data, is compared explicitly in Fig.
15.9.
In general it is true that within-subjects factors result in more powerful tests
and so, other things being equal, are to be preferred. However, as the present
comparison illustrates, the extent of the advantage depends on the nature of the
data analyzed. Specifically, if the correlation between the repeated measures is
small, and if little variance is accounted for by the subject factor, then the usual
advantage is diminished. The research question (Are there group differences?) is
the same, of course, no matter whether a between-subjects or a within-subjects
factor is analyzed. For the button-pushing data the group effect was significant
no matter whether group was a between-subjects or a within-subjects factor—no
matter whether subjects belonged to different want/have children status groups
or infants from different diagnostic groups were viewed by the same subjects.
No matter whether effects are between or within subjects, if more than two
groups are defined and if the group effect is significant, next we would want to
know exactly how the groups differed among themselves. Post hoc tests for the

A B C D E F G H I
1 #BPs SS tot
2 s Y Si S2 S3 P1 P2 P3
3 1 102 1 0 0 1 0 0 256
4 2 125 0 1 0 1 0 0 1521
5 3 95 0 0 1 1 0 0 81
6 4 130 0 0 0 1 0 0 1936
7 1 79 1 0 0 0 1 0 49
8 2 93 0 1 0 0 1 0 49
9 3 75 0 0 1 0 1 0 121
10 4 69 0 0 0 0 1 0 289
11 1 43 1 0 0 0 0 1 1849
12 2 82 0 1 0 o 0 1 16
13 3 69 0 0 1 0 0 1 289
14 4 66 0 0 0 o 0 1 400
15 1 101 1 0 0 0 0 0 225
16 2 94 0 1 0 0 0 0 64
17 3 84 0 0 1 0 0 0 4
19 4 69 0 0 0 0 0 0 289
19 Sum= 1376 7438
20 N= 16
21 Mean= 86
FIG. 15.7. Spreadsheet for analyzing the effect of infant's diagnosis (within-
subjects factor P) on number of button pushes (coded predictor variables).
15.3 MODIFYING THE SOURCE TABLE FOR REPEATED MEASURES 265
A B C D E F G H I
1 Step Source R2
change
R2 SS
df MS F Pn2
2 1 TOTAL btwn Ss 0.114 0.114 850.5 3 283.5
3 2P (Diagnosis) 0.77 0.6564880.0 3 1627. {3.574 0.741
4 3 PS (error) 1 0.231707.5 9 189.7
5 TOTAL (B+W) 1 7438 15 495.9
FIG. 15.8. Spreadsheet for analyzing the effect of infant's diagnosis (within-
subjects factor P) on number of button pushes (source table).

between-subjects version of the button-pushing study were illustrated in chapter


13. For the within-subjects version, the Tukey test is again applied. Recall from
chapter 13 (Equation 13.4) that the formula for the Tukey critical difference is

The number of groups or G is still 4, as is the number of scores per group or n.


The degrees of freedom for error is now 9, not 12, which changes the values of q,
and the MSerror is now 189.7. Thus the value of TCD in this case is 30.4, which
happens to be not much different from the 30.7 we computed earlier for the
between-subjects version.
For these data, it happens that the post hoc results are the same for both
between- and within-subjects versions (see Fig. 15.10). Specifically, for the
within-subjects study, subjects pushed the button significantly more for Down
syndrome infants (M = 113) than for either fetal alcohol syndrome (M = 79) or
very low birthweight infants (M = 65). However, the number of button pushes
for these three groups did not differ significantly from the number for the no
diagnosis comparison group (M - 87).
In earlier chapters you learned how to decide whether mean scores for the
levels or groups defined by a research factor differed significantly, assuming that
different subjects were assessed at each level (i.e., within each group). In this
chapter you have learned how to decide whether mean scores for the groups
defined by levels of a research factor differ, assuming instead that subjects are
assessed repeatedly— that each subject contributes a score to each level of the
research factor. The general research question, however, remains the same: Can
predictions concerning a subject's criterion score be significantly improved if they
are made with knowledge of the particular level of the research factor associated
with that score? Does the research factor matter?

Analysis
1-between (Fig. 12.9) 1-within (Fig. 15.8)
Source SS df Source SS df
S 851 3
A 4880 3 P 4880 3
S/A 2558 12 PxS 1708 9
FIG. 15.9. Partitioning total sums of squares and degrees of freedom for a
one-between and a one-within study with the same data.
266 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES
p Analysis
Post-hoc
statistic 1 -between 1 -within
(Fig. 12.9) (Fig. 15.8)
G 4 4
dferror 12 9
q 4.20 4.41
MSerror 213.1 189.7
n 4 4
TCD 30.7 30.4
FIG. 15.10. Statistics needed for a post hoc test for a one-between and a
one-within study with the same data.

Within subjects as opposed to between-subjects studies have. certain


advantages. Other things being equal, analysis of within-subjects factors is more
powerful, which means that fewer subjects are required to decide that effects of a
specified size are statistically significant. But not all factors are good candidates
for within-subjects factors. Some factors are too reactive. For example, a subject
may remember a test and so react differently at a later assessment only because it
is a later assessment and not because the second assessment occurs after an
experimental treatment. And some factors, like socioeconomic status, describe
individual difference and are inherently between subjects. Nonetheless, studies
can often use within-subjects factors to good advantage. Levels of commonly
used within-subjects factors can represent different instructions, different forms
of a test, or different settings (night vs. day, home vs. laboratory, and so forth.).
They can also distinguish between pre- and posttreatment assessments (but see
chap. 17) and can represent age in longitudinal studies (e.g., 9-, 12-, and 15-
month-old infants).

15.4 ASSUMPTIONS OF THE REPEATED MEASURE ANOVA


Repeated-measures ANOVA requires that the typical assumptions of random
sampling, normality, and equal variances are met. Given that there are the same
number of observations at each treatment level and the data come from the same
subjects, the homogeneity of variances assumption is unlikely to be violated, and
often not even tested. When you have more than two levels of the repeated
measures factor, however, an additional assumption, sphericity, is required to
ensure that the F ratio is unbiased. Sphericity holds when the variances of the
difference scores between all levels of a factor are the same. A difference score is
simply a subject's score at one level subtracted from his or her score at another
level. For a four-level factor there are six possible pairs of levels: 1-2, 1-3, 1-4,
2-3, 2-4, and, 3-4. The variances of these six sets of difference scores should be
roughly equal to satisfy the sphericity assumption.
Sphericity is estimated by a statistic called epsilon (e). If sphericity is met
perfectly, then 8 will have a value of 1. Epsilon has a lower bound o f 1 / ( k - 1).
Thus, any value between i and the lower bound indicates that sphericity is
violated to some extent. There are two generally accepted methods for
calculating e. The first is the Greenhouse and Geisser (1959), and is the more
conservative approach. The second is more liberal and is known as the Huynh
and Feldt (1976). When violations of sphericity are not extreme, the Huynh-
Feldt is considered more accurate. As the severity of the violation increases,
however, the Greenhouse-Geisser is the more appropriate estimate. The
15.4 ASSUMPTIONS OF REPEATED MEASURES ANOVA 267_

calculations for the two estimates of e are beyond the scope of this book, but
appear in the output of most major statistical packages.
Many statistical packages also provide a statistical test of sphericity. SPSS
calculates Mauchly's W. Such tests have very little power when sample sizes are
small. For this reason, most researchers assume that sphericity is violated and
proceed as described in the next subsection.

The Modified Univariate Approach


The spreadsheet calculations for a repeated-measures design hold when
sphericity is assumed. They also provide a nice demonstration of the logic behind
a repeated-measures design. However, when sphericity is violated (and it often is
to some extent), then it is necessary to make some adjustments to your
calculations. When sphericity is violated, the F ratio becomes positively biased.
Therefore one way to correct for sphericity violations is to adjust your degrees of
freedom based on the estimate of e. To do this, Box (1954) suggested multiplying
the df numerator and df denominator by e, and then evaluating the significance
of F using these new values. Another, extremely conservative, approach is to
evaluate F at the lower bound. In other words, set df for the numerator equal to
1, and df for the denominator equal to k - 1.
Given that there are two methods of estimating e, there are four possible
levels of significance for the repeated-measures factor. One assumes sphericity
and is the method presented in the spreadsheet; the other three, just described,
involve corrections to the degrees of freedom. It is possible to select one method
based on the importance of a Type I error relative to a Type II error, but most
researchers use a strategy known as the modified univariate approach. This
strategy consists of three steps:

1. Evaluate the significance of F assuming sphericity. If the F is not


significant then retain the null hypothesis. If the F is significant then
proceed to step 2.
2. Evaluate the significance of F at the lower bound. If the F is significant,
then the null hypothesis is rejected. If the F is not significant, then
proceed to step 3.
3. Evaluate the significance of F using the Box correction (i.e., multiply df
by e). You can use either the Greenhouse-Geisser or Huynh-Feldt
estimate.

The Multivariate Approach


An alternative to the modified univariate approach is to treat each set of
difference scores as a separate dependent variable and then conduct a
multivariate analysis of variance. Sphericity is not an assumption of the
multivariate tests, but they tend to be less powerful than the univariate approach.
A general rule of thumb is that if there appears to be some violation of sphericity
(e.g., e < .80 or so) and N > 30 plus the number of levels of the repeated-
measures factor, then the multivariate approach is recommended. When N is
small and the multivariate tests are not significant, then the modified univariate
approach may be more powerful.
268 SINGLE-FACTOR WITHIN-SUBJECTS STUDIES

Exercise 15.5
SPSS Analysis of a Repeated-Measures Study with Four Levels
In this exercise you will analyze the data for the four-level repeated-measures
study presented in Exercise 15.4.
1. Create a new SPSS data file for the button pushes study that is set up for a
repeated-measures analysis. You will need five variables, one for the subject
number (s), and four for each level of the infant diagnosis factor: Down
syndrome (ds), fetal alcohol syndrome (fas), low birth weight (Ibw), and no
diagnosis comparison (cntrl). Give each variable a meaningful label.
2. Enter 1-4 in the s column and the appropriate number of button presses for
each of the four cases in the remaining columns.
3. Select Analyze->General Linear Model->Repeated Measures from the
main menu. In the Repeated Measures Define Factor(s) window, type in a
name to describe the repeated measures factor in the Within-Subject
Factor Name box (e.g., diag for diagnosis). Enter 4 for the number of levels
of the factor and click Add to move this factor definition to the lower box.
Click the Define button.
4. Move each of the diagnosis levels to the Within Subjects Variables box,
and check the Descriptive Statistics and Estimates of Effect Size boxes
under Options. Run the analysis.
5. Check your descriptive statistics. Do they agree with your spreadsheet?
Examine the Sphericity Assumed lines in the output. Do the SS, df, MS, F,
and pn2 values agree with your spreadsheet?
6. Look in the box labeled Mauchley's Test of Sphericity. Mauchly's test is not
significant, indicating that the sphericity assumption has been met.
Remember, however, that Mauchly's test is underpowered when the sample
size is small. It would therefore be prudent, in this case, to assume that there
is at least some violation of sphericity. Note that the lower bound is .33; 1/(k
- 1) = 1/(4-1). Also note that the Greenhouse-Geisser and Huynh-Feldt
estimates of e differ by a large amount. This is due to the small number of
cases in the study. Typically the two estimates would be closer.
7. Examine the Multivariate Tests. None are significant, but remember that N is
small and the multivariate tests are not very powerful under these
circumstances. Apply the modified univariate approach. What is your
statistical decision based on this approach?

From this chapter, you have learned specifically how to evaluate the effect of
a single within-subjects factor. Little new was needed conceptually, just
additional applications of the hierarchic approach emphasized throughout this
text, an understanding that variability due to subjects can be regarded as a
covariate and so can be dispensed with at step 1, and an understanding that
subject can serve as a factor and can be represented with dummy-coded
variables. The next chapter extends your understanding of repeated-measures
studies. Studies with more than one factor, at least one of which is repeated, are
described and additional post hoc tests for within-subjects studies are
demonstrated.
16 Two-Factor Studies With
Repeated Measures

In this chapter you will:

1. Learn how to analyze data from a two-factor study when factors are
mixed, one between subjects and one within subjects.
2. Learn how to analyze data from a two-factor study when both factors are
within subjects.
3. Learn how to compute the degrees of freedom associated with the main
effects and interactions for these two kinds of repeated-measures studies.
4. Learn how to perform post hoc tests for analyses involving repeated
measures.

The two studies presented in the last chapter included a single repeated measure
or within-subjects factor. Subjects were assessed more than once and the levels
of the within-subjects factor formed the repeated assessments or groups for
analysis. Thus for the last example, subjects were presented with videotapes of
four infants, each with a different diagnosis. The purpose of the analysis was to
determine whether the mean number of times subjects pushed the button was
significantly different for infants with different diagnoses. Each subject saw
videotapes representing all four diagnostic groups, so our only concern was with
variability within subjects. Variability between subjects was irrelevant to the
question at hand, which involved within-subject effects, and for that reason it was
removed from consideration, or statistically controlled, at the outset.
In this chapter we consider more complex possibilities. Two studies are
examined: Both studies involve two factors, at least one of which is within
subjects. But as you will see, the general approach developed in the last chapter
applies to these more complex repeated-measures studies as well. The common
feature is controlling for between-subject variability, exactly as though the
subject factor were a covariate.

16.1 ONE BETWEEN- AND ONE WITHIN-SUBJECTS FACTOR

In this section we consider one example of a mixed study. This is a study that
has, in addition to a repeated-measures factor, a between-subjects factor as well.

269
270 TWO-FACTOR STUDIES WITH REPEATED MEASURES
We again use data from the button-pushing study but change the assumed
procedural arrangements to reflect a mixed two-factor study. In chapter 14 we
used these data to exemplify a 2 x 2 between-subjects factorial. The 16 subjects
were arranged in four groups and the four groups represented males and females
(factor A) given either instruction set I or II (factor B). In this and the following
section, two new examples are presented. All three of the examples represent
2 x 2 factorials, and for all three, the four groups are the same (i.e., group 1
consists of males exposed to set I; group 2, males given set II; etc.). However,
this section and the following section demonstrate how these data would be
analyzed if one or both of the factors were within subjects.
For the mixed example presented in this section, imagine there were only 8
subjects instead of 16, that 4 were male and 4 female, and all subjects were tested
twice, once after receiving instruction set I, once after set II. Thus, for the
present example, we are assuming the study consists of a single between-subjects
factor, gender of subject, and a single within-subjects factor, instruction set.
For the two single-factor repeated-measures studies presented in the last
chapter, the subjects constituted a single group and were not subdivided further.
Dummy-coded variables were defined for subjects and each subject received a
unique code—in effect, constituting each subject as a group of one—but the
subjects were not grouped in any other way. In other words, there were no
between-subjects factors as such, only the subject factor nested within the single
group, which was used to represent between-subject variability. Here, however,
subjects are divided into a male and a female group, which introduces gender as a
between-subjects factor.
For all repeated-measures studies, part of the total variance is between
subjects (recall Equation 15.3):
=
SStotal (between + within) SStotal between subjects + SS'total within subjects

The degrees of freedom associated with the between subjects variance is N - 1.


For a single-factor within-subjects study, no further subdividing of total between-
subjects variance is possible (see Fig. 15.5). However, for a mixed design like the
current example, total between-subjects variance is partitioned into components
representing the between-subjects main effect (or effects), their interactions (if
there is more than one between-subjects factor), and the residual, which is
between-subjects error. The present example includes one between-subjects
factor (not counting the nested subject factor). In this case, total between-
subjects variance is partitioned into two portions (see Fig. 16.1). One part
(symbolized A) is associated with the between-subjects factor (in this case,
gender of subject) and the rest with residual between-subjects error (subjects
within groups, symbolized S/A).
The other part of total variance for repeated-measures studies is within
subjects. The degrees of freedom associated with the within-subjects variance is
Np - N (where p is the number of levels for the within-subjects factor P). For the
single-factor repeated-measures studies presented in the last chapter, one
portion (symbolized P) was associated with the effect of the repeated measure
and the remaining portion, the error term, with the repeated factor by subjects
interaction (symbolized PS, i.e., P x S). For the present example, the effect of the
repeated measure is still symbolized P, but now an additional portion of within-
subjects variance is associated with the interaction of the between and within-
subjects factors. This is usually symbolized as AP, but in this book often PA is
used instead because it makes the partitioning of variance easier to follow (see
Fig. 16.1).
16.1 ONE BETWEEN-AND ONE WITHIN-SUBJECTS FACTOR 271
The residual or error variance is again the interaction of the repeated factor
with subjects—or, more correctly, the interaction of the repeated factor with
subjects within groups (symbolized PS/A, i.e., P x S/A}. For repeated-measures
studies, subject is treated as a factor nested within groups. For the single-group
repeated-measures studies described in the last chapter, within-subjects error
variance was symbolized as PS. With only one group, it did not seem necessary to
specify that subjects were nested within the one group, although we could have
used a notation like PS/G instead of PS. But for the present example with its two
independent groups (males and females), within-subject error is associated with
the repeated measure by subjects within groups interaction. As just noted, this is
symbolized as PS/A and signals the presence of a single between-subjects factor,
A, in addition to the within-subjects factor, P.
The point of all this, of course, is to be able to determine whether the
subject's gender, the instructions given, and/or the gender x instruction
interaction affect the number of times subjects believe they have noted infants
engaging in communicative acts. As noted in chapter 14 when discussing
factorial designs generally, significant interactions are often informative. This is
as true for interactions involving repeated measures as for interactions involving
only between-subjects factors. For example, in the present case we might learn
that the effect of the instructions given was conditional on the gender of the
person receiving the instruction. However, in order to test the various main
effects and interactions for significance, we need to know first how to partition
variance and degrees of freedom for mixed two-factor studies, and second, which
error terms are used to test which effects.
The preceding discussion is summarized in Fig. 16.1, which lists sources of
variance and general formulas for their associated degrees of freedom for a mixed
two-factor study. It is worth noting several points of similarity with the
corresponding table for a single-factor within-subjects study (see Fig. 15.5). As
before, the total degrees of freedom is the total number of scores minus one, Np -
1. Likewise, the total degrees of freedom between subjects is N - 1 and the total
degrees of freedom within subjects is N(p - 1) or Np - N.
The subdivisions of total between and total within variance, however, are
somewhat different for the mixed two-factor design. The N - 1 degrees of
freedom between subjects are divided into a - 1 degrees of freedom for the A
main effect (a - 1 predictor variables), which leaves N - a degrees of freedom for
subjects within groups (S/A). This is exactly how degrees of freedom were
computed earlier for the between-subjects error term of a single-factor between-

Source Degrees of freedom


A main effect a -1
S/A, subjects within A N -a
TOTAL between subjects N - 1
P main effect p -1
PA interaction (p - 1)(a - 1)
PS/A interaction (p _- |)(/y - a)
TOTAL within subjects Np - N
TOTAL (between + within) Np -1
FIG. 16.1. Degrees of freedom for a mixed two-factor study. The number of
levels for between-subjects factor A is symbolized with a, and for within-
subjects factor P, with p. The number of subjects is symbolized with N.
272 TWO-FACTOR STUDIES WITH REPEATED MEASURES
subjects study (see Fig. 12.10). Further, the Np - N degrees of freedom within
subjects are divided into p - 1 degrees of freedom for the P main effect, p - 1
times a - 1 degrees of freedom for the PA interaction, and p - 1 times N - a
degrees of freedom for the P by subjects within groups interaction. The value of
p - 1 (degrees of freedom for the repeated measure) times N - a (degrees of
freedom for subjects within groups) is the degrees of freedom for the error term
used to evaluate the significance of the P main effect and the PA interaction.
In the previous chapter, variance for a single-factor repeated-measures study
was partitioned into three components—symbolized as S (for subjects within
groups), P, and PS (see Fig. 15.5)—and each was associated with a step in the
regression procedure used to analyze criterion variance. For the current mixed
two-factor study, variance is partitioned into five components, symbolized as A,
S/A, P, PA, and PS/A (see Fig. 16.1):

1. A, between-subjects main effect.


2. S/A, between-subjects error term.
3. P, within-subjects main effect.
4. PA, between x within interaction.
5. PS/A, error term for P and PA effects.

As before, each of these is associated with a multiple-regression step. The R2 and


SS are determined for the first four steps; the last or phantom step exhausts the
degrees of freedom and accounts for all variance. The steps and exact
computations required for analysis of the present mixed two-factor example are
demonstrated in the next exercise.
Before beginning the next exercise, you need to know how to code the N - 1
between-subjects predictor variables when there is a between-subjects factor.
This is an important although somewhat technical detail. For the single-factor
repeated-measures studies described in the previous chapter this was easy
enough. Subjects were dummy coded. The first subject was coded one for the
first predictor variable and zero otherwise, the second subject was coded two for
the second predictor variable and zero otherwise, and so forth, and the last
subject was coded zero for all between-subjects predictor variables (see Figs.15.3,
15.4, and 15.7). Subjects constituted a single group. In other words, there was no
between-subjects factor (other than the nested subject factor) and hence no need
for predictor variables that coded group membership.
In the present case, there are eight subjects, so seven predictor variables are
required to represent the total between-subjects effect. There are two groups of
subjects, however—male and female—so one of those seven predictor variables
has to code for the between-subjects factor, A or gender. The remaining six
predictor variables code for the between-subjects error term, S/A (whose degrees
of freedom are, of course, six). These are dummy coded, but nested within group.
What this means should become clear in the course of the next exercise, if not the
next paragraph.
Recall that the first predictor variable codes for gender. Using contrast
codes, males might be -1 and females +1 (see Fig. 16.2). The second predictor
variable is coded one for the first male and zero otherwise, the third predictor
variable is coded one for the second male and zero otherwise, and the fourth
predictor is coded one for the third male and zero otherwise. The fourth and final
male in his group receives codes of zero for all between-subjects predictor
variables except the first (which is -1 indicating a male). Similarly, the fifth
predictor variable is coded one for the first female and zero otherwise, the sixth
predictor variable is coded one for the second female and zero otherwise, and the
16.1 ONE BETWEEN- AND ONE WITHIN-SUBJECTS FACTOR 273
seventh predictor is coded one for the third female and zero otherwise. The
fourth and final female in her group receives codes of zero for all between-
subjects predictor variables except the first (which is +1, indicating a female).
Sometimes students ask, why not use just three predictor variables, in
addition to the one coding for gender, to indicate the S/A effect? The first
variable could be coded -1 for male and +1 for female. Then the second variable
could be coded one for first subject in group (for both the first male and the first
female) and zero otherwise, the third variable could be coded one for second
subject in the group and zero otherwise, and so forth. Such a coding, however,
implies that the subject factor is crossed instead of nested and some subjects
were randomly assigned to receive the "male" treatment, others the "female"
treatment. However, gender is usually believed to be a relatively enduring
attribute not so easily subject to experimental manipulation. In fact, we define a
group (male or female) and then select subjects from within each group, which is
what nesting implies. The between-subjects predictor variables reflect this
reality, which is why the first predictor variable codes group membership, the
second three indicate males within the male group, and the final three indicate
females within the female group—which reflects, of course, the division of the
seven between-subjects degrees of freedom into one for the effect and six for
error. The next exercise requires that you form predictor variables according to
these principles.

Exercise 16.1
Analysis of a One-Between, One-Within Two-factor Study
The template developed for this exercise allows you to analyze data from a
mixed two-factor study. You will use data from the button-pushing study, but will
assume that a group of eight men and a second group of eight women were each
exposed to two different sets of instructions.

General Instructions
1. Analyze data from the button-pushing study assuming two factors, one
between and one within subjects. Assume the first four scores are for males
exposed to instruction set I and the second four scores are for the same
males exposed to set II. Likewise, assume scores 9-12 are for females
exposed to set I and scores 13-16 are for the same females exposed to set
II.
2. Establish seven predictor variables for the between-subjects variables. One
codes gender (use contrast codes) and the remaining six code for subjects
nested within gender. Use dummy codes as described in the last paragraph
before this exercise. You may also want to examine Fig. 16.2 to see an
example of how the nesting of subjects within groups works. Establish two
additional predictor variables, one for the within-subjects variable (use
contrast codes) and one for its interaction with the dummy-coded variable for
gender.
3. Guided by Fig. 16.1, create a source table appropriate for this study using
general formulas for degrees of freedom. Compute the total R2 for each of
the four steps. Provide correct formulas for all steps, including the phantom
fifth step. Check that all formulas are correct and then answer the questions
in item 19 of the detailed instructions.
274 TWO-FACTOR STUDIES WITH REPEATED MEASURES
Detailed Instructions
1. Begin with the spreadsheet shown in Fig. 15.7. Insert three columns after
column H. The present example requires nine predictor variables whereas
the previous required six, so an additional three column are needed.
2. Enter the labels "A," "S1," "S2," "S3," "S4," "S5," "S6," "P," and "AP" in cells
C2-K2. These columns will contain the coded variable for the between-
subjects variable gender, the remaining six coded subject variables (for a
total of seven), the coded variable for instruction group, and the variable for
the gender x instruction interaction. Enter the label "Sex" in cell C1 and the
label "Inst" in cell J1.
3. Enter the subject numbers in column A, cells 3-18. Recall that for the 2 x 2
factorial example using these same data (Fig. 15.12) subjects 1-8 were male
and 9-16 were female; subjects 1-4 and 9-12 were exposed to instruction
set I and 5-8 and 13-16 to set II. For this example, we want to keep the
characteristics (gender of subject, instruction set) of the four groups the
same, at the same time changing instruction to a within-subjects variable.
Thus use the numbers 1-4 for the males subjects, who appear in rows 3-6
and again in rows 7-10, and use the numbers 5-8 for the female subjects,
who appear in rows 11-14 and again in rows 15-18.
4. There are 8 subjects, so there are 7 (N - 1) coded between-subjects
variables in all. Enter dummy codes for the first between-subjects predictor
variable, which represents gender of subject, in column C. Use 0 for males,
1 for females. Remember that the first eight scores are for the males
(subjects 1-4) and the last eight scores are for the females (subjects 5-8).
5. Next enter dummy-coded variables for the remaining six between-subject
variables, S1-S6, in columns D-l. Remember that the codes used for
subject 1 must be the same everywhere subject 1 appears; likewise for the
other subjects. (In this case, lines 3 and 7, 4 and 8, ..., 14 and 18, and so
forth, are the same; see Fig. 16.2.) Note how the codes are formed within
groups (recall that the subject factor is nested within groups). All codes are
set to zero initially. The first variable is coded one for the first subject in the
first group. The next variable is coded one for the second subject, and so
forth, up to but not including the last subject within the group. The next
variable after that is coded one for the first subject in the second group, and
so forth. This insures that the appropriate number of predictor variables (N-
a) are coded for between-subjects error (subjects within groups or S/A).
6. Enter contrast codes for instruction set in column J. Use -1 for set I and +1
for set II. Remember that the first four scores for the males and female
subjects represent the number of button pushes for instruction set I.
7. In column K enter a formula for the gender x instruction interaction (the code
for gender multiplied by the code for instruction set, or the code in column C
times the one in column J). At this point, values for the nine coded predictor
variables for all 16 scores should be in place.
8. On a separate sheet in the workbook, create a source table for the results of
your hierarchical regression. In the first row, enter the labels as shown in Fig.
16.3. In column A enter the labels "N=", "p=", and "a=", in rows 9, 10, and 11,
respectively. In column B, enter the number of subjects, and levels for the
within and between factors.
9. Label the sources of variance in column B (rows 2-7) as shown in Fig. 16.3.
10. In column F enter the appropriate degrees of freedom using the values you
entered in cells B9-B11; these formulas are given in Fig. 16.1. If done
correctly, the degrees of freedom for the between components should sum to
16.1 ONE BETWEEN- AND ONE WITHIN-SUBJECTS FACTOR 275
total between, the within components should sum to total within, and the total
between and total within should sum to the grand total (between + within).
11. At this point, you have a spreadsheet that partitions degrees of freedom for
any mixed two-factor design. Experiment with different values for N, a, and p
and note the effect on how degrees of freedom are apportioned. Now
replace N, a, and p with 8, 2, and 2, the values for this exercise.
12. Do step 1. Invoke the multiple-regression routine, specifying number of
button pushes (column B) as the dependent variable and gender (column C)
as the independent variable. Point step 1 RSQ in the source table to the R2
output by the regression program for this step.
13. Do step 2. Again invoke the multiple-regression program, this time
specifying all between-subjects variables (columns C-l) as the independent
variables. Point step 2 RSQ to the R2 just computed.
14. Do step 3. This time the independent variables are the between-subjects
variables plus instruction set (columns C-J). Point step 3 RSQ to the R2 just
computed.
15. Do step 4. The independent variables are the between-subjects variables,
instruction set, and the gender x instruction interaction (columns C-K). Point
step 4 RSQ to the R2 just computed.
16. The phantom step 5 would exhaust the degrees of freedom. Enter 1 (all
variance accounted for) in cell C5. Enter the SStotal in cell D7 of the source
table.
17. As a check, you should enter the regression statistics from the last step in
cells B22-K22, correct the prediction equation in column L, and see if the R2
computed in cell M23 agrees with the value computed by the regression
program.
18. Complete the source table. The SSs for each step and changes in R2s and
SSs for each step can now be computed. Enter formulas for SSs in column
E. Enter the appropriate change or summation formulas in column D.
19. Finally, ensure that the appropriate formulas for MSs, Fs, and n2s are
entered in columns G, H, and I. What are the critical values for the three F
ratios? Which are significant? How do you interpret these results?

Your results should look those given in Figs. 16.2 and 16.3. It is worth
comparing the results of the two-factor one-between, one-within analysis of
variance just performed with the results of the two-between factorial analysis
performed in chapter 14 (see Fig. 14.12). The sums of squares associated with the

A B C D E F G H I
R2
1 Step Source R2 change SS dl MS F Pn2
2 1 A (Gender) 0 215 0.215 1600 1 1600 8.496 0 586
3 2 S/A (error) 0 367 0.152 1130 6 188.3
4 3 P (Instruction) 0 386 0.019 144 1 144 0.605 0 092
5 4 PS (Gend x Inst) 0 808 0.422 3136 1 3136 13.18 0 687
6 5 PS/A (error) 1 0.192 1428 6 238
7 TOTAL (B+W) 1 7438 15
FIG. 16.2. Spreadsheet for analyzing the effect of gender (between-subjects
factor A) and instruction set (within-subjects factor P) on number of button
pushes (source table).
276 TWO-FACTOR STUDIES WITH REPEATED MEASURES
gender (SSA - 1600) and instruction (SSB = SSp = 144) main effects and the
gender x instruction interaction (SSAB - SSAp = 3136) remain the same. But the
error sums of squares change. The error term for the two-factor between-
subjects study (SSs/AB = 2558) is, in effect, split into two pieces, an error term for
the between-subjects factor (SSs/A - 1130) and an error term for the within-
subjects factor (SSps/A = 1428). For these data, it happens that the two different
analyses did not give dramatically different results. For both, the gender effect
and the gender x instruction interaction are significant. But note that the one-
between, one-within study required half as many subjects as the two-factor
between-subjects study.
As noted in the previous paragraph, for the present example both the gender
main effect (F(1,6) = 8.50, p < .05) and the gender x instruction interaction
(F(1,6) = 13.2, p < .05) are significant. This suggests that males and females
responded differently to the different instructions. The gender main effect is
qualified by a higher order interaction; consequently, the nature of the
interaction needs to be understood before we interpret the gender main effect,
and that requires a post hoc test. In the last chapter we analyzed these same
scores, assuming that the four groups defined here by a 2 x 2 factorial (one-
between, one-within) represented four levels of a single within-subjects factor
(infant's diagnosis, see Fig. 15.7). The within-subjects group effect was

A B c D E F G H I J K L
1 #BPs Sex
"S6"
Inst SStot
1
2 s Y A S1 S2 S3 S4 S5 P PA y*y
3 T 102 0 1 0 0 0 0 0 -1 0 256
4 2 125 0 0 1 0 0 0 0 -1 0 1521
5 3 95 0 0 0 1 0 0 0 -1 0 81
6 4 130 0 0 0 0 0 0 0 -1 0 1936
7 1 79 0 1 0 0 0 0 0 1 0 49
8 2 93 0 0 1 0 0 0 0 1 0 49
9 3 75 0 0 0 1 0 0 0 1 0 121
10 4 69 0 0 0 0 0 0 0 1 0 289
11 5 43 1 0 0 0 1 o 0 -1 -1 1849
82 1 1
12 6 0 0 0 0 0 -1 I -1 16
13 7 69 1 0 0 0 0 0 1 -1 -1 289
14 8 66 1 0 0 0 0 0 0 -1 -1 400
15 5 101 1 0 0 0 1 0 0 1 1 225
16 6 94 1 0 0 0 0 1 0 1 1 64
17 7 84 1 0 0 0 0 0 1 1 1 4
19 8 69 1 0 0 0 0 0 0 1 1 289
19 Sum= 1376 7438
20 N= 16
21 Mean= 86
FIG. 16.3. Spreadsheet for analyzing the effect of gender (between-subjects
factor A) and instruction set (within-subjects factor P) on number of button
pushes (predictor variables).
16.1 ONE BETWEEN- AND ONE WITHIN-SUBJECTS FACTOR 277
significant but we delayed interpretation of the results, which likewise required a
post hoc test, until this chapter. In a subsequent section, the appropriate post
hoc tests for both these studies are described, but first a third exemplar of a 2 x 2
factorial study, one that includes only within-subjects factors, is presented.

Equality of Covariances
When a study includes between- and within- subjects factors, an additional
assumption, the homogeneity of covariances, is required. This assumption states
that the covariances of the levels of the within-subjects factor are the same for
each group. In the present case, this would mean that the covariance between
set I scores and set II scores is the same for males and females. A test of this
assumption, as described in the next exercise, is provided by SPSS.

Exercise 16.2
SPSS Analysis of a One-Between, One-Within Two-Factor Study
This exercise walks you through an SPSS analysis for a mixed two-factor study.
You will use data from Exercis16.1.
1. Create a new SPSS data file, or adapt one from Exercise 15.1. The data file
should contain variables for subject number (s), gender (gen) instruction set I
(set1), and instruction set II (set2).
2. Enter 1 through 8 for the 8 subjects in the s column. Enter 0 for males and 1
for females in the gen column. Finally, enter the number of button pushes for
each subject in the appropriate column for instruction set. Create appropriate
variable labels and value labels for the variables.
3. Select Anaiyze->General Linear Model->Repeated Measures from the
main menu. In the Repeated Measures Define Factor(s) window, type in a
name to describe the repeated measures factor in the Within-Subject
Factor Name box (e.g., instruct for instruction set). Enter 2 for the number of
levels of the factor and click Add to move this factor definition to the lower
box. Click the Define button.
4. Move set1 and set2 to the Witihin-Subjects Variables box. Move gen to the
Between-Subjects Factor(s) box. Check the Descriptive statistics,
Estimates of effect size, and Homogeneity tests boxes under Options.
Run the analysis.
5. Check the descriptive statistics to ensure that you set up the variables and
entered the data correctly. Next, check Box's M to determine if there is
homogeneity of covariances. Usually, you would then check the sphericity
assumption, but because there are only two levels of the repeated measure
in this design, e = 1 and Mauchly's test of sphericity does not apply. Finally,
check Levene's test to make sure the homogeneity of variances assumption
holds for the between-subjects factor.
6. Examine the SS, df, MS, F, significance levels, and pn2 of the within-subjects
effects. Do they agree with your spreadsheet? Because there are only two
levels of the within-subjects factor, the multivariate tests, sphericity assumed
tests, lower bound, and e corrected tests will all yield the same F ratios and
significance levels.
7. Examine the test of the between-subject effect for gender. Are the SS, MS,
and df correct for the sex factor and the error term? What about the F ratio
and pn2?
278 TWO-FACTOR STUDIES WITH REPEATED MEASURES

16.2 TWO WITHIN-SUBJECTS FACTORS

In addition to studies with two between-subjects factors (presented in chap. 13),


and mixed studies with one between- and one within-subjects factor (presented
in the previous section), there is a third possibility for studies with two factors:
Both factors could be within subjects. The mixed study discussed in the last
section consisted of two groups, one of males and one of females, and we
assumed subjects in both groups were randomly selected. But now, imagine that
instead of separate individuals, we selected male-female couples instead. The
sampling unit would be the couple, not the individual.
When couples rather than individuals are sampled, gender of spouse
becomes a within-subjects factor. Each spouse would be exposed separately to
instruction set I and II, and thus each couple would contribute four scores to the
analysis: his scores for the two instruction sets and her scores for the two
instruction sets. In this case, the appropriate analysis would treat both gender
and instruction set as within-subjects variables. It is worthwhile to consider for a
moment how total variance is partitioned for three kinds of repeated-measures
studies: one involving a single within-subjects factor, one involving both a
between- and a within-subjects factor, and one involving two within-subjects
factors.
For a study involving one within-subjects factor, like the examples in the last
chapter, total variance is partitioned into three components: The first component
is associated with variability between subjects, the second with the P main effect,
and the third with the interaction between P and subject (see Fig. 15.5). Mean
squares are formed for these components and the mean square for P is tested
against the mean square for the PS interaction.
For a study involving one within- and one between-subjects factor, like the
example in the last section, total variance is partitioned into five components (see
Fig. 16.1). The first two are associated with variability between subjects. The first
is associated with the A main effect, the second with variability between subjects
within A, and the mean square for A is tested against the mean square for S/A.
The last three are associated with variability within subjects. The third is
associated with the P main effect, the fourth with the P by A (or A by P)
interaction, and the fifth with the P by subjects within A interaction. Mean
squares for P and for PA are tested against the mean square for the PS/A
interaction.
For a study involving two within-subjects factors, as for the present example,
total variance is partitioned into seven components (see Fig. 16.4). The first is
associated with variability between subjects and, as with all repeated-measures
studies, can be viewed as an irrelevant source of variance, dispensed with (or
controlled for) at the outset. Because there are no between-subjects factors, it is
not subdivided further. The remaining six components are associated with the P
and Q main effects and the PQ, PS, QS, and PQS interactions:

1. S, between-subjects error.
2. P, first within-subjects main effect.
3. PS, error term for P effect.
4. Q, second within-subjects main effect.
5. QS, error term for Q effect.
6. PQ, P by Q interaction.
7. PQS, error term for P by Q interaction.
16.2 TWO WlTHIN-SUBJECTS FACTORS 279
As you might guess, generalizing from the previous examples, the mean square
for P is tested against the mean square for the PS interaction, the MS for Q
against the MS for the QS interaction, and the MS for the PQ interaction against
the MS for the PQS interaction.
For a PQ design with N subjects (p levels for within-subjects factor P, q levels
for within-subjects factor Q), there are a total of Npq criterion scores (each
subject or sampling unit contributes p times q scores). Hence the total number of
degrees of freedom associated with the criterion scores is the number of scores
minus one:

As always, the degrees of freedom between subjects is N - 1 and, as noted


previously, for a two-factor repeated-measures study it is not subdivided further:

The degrees of freedom within subjects is the number of scores minus the
number of subjects, which for the PQ design is Npq - N:

The partitioning of these Npq - N within-subjects degrees of freedom into


separate components is straightforward, as long as you remember that the error
terms for each effect consist of an interaction with subject. Thus the degrees of
freedom associated with the P main effect are p - 1 and the degrees of freedom
for the SxP interaction are p - 1 times N - 1:

Similar reasoning applies to the Q, QS, PQ, and PQS components:

Source_ Degrees of freedom


S, subjects_ N- 1
TOTAL between subjects N- 1
P main effect P-1
PS interaction (p -1)(N-1)
Q main effect q-1
OS interaction (q-1)(N-1)
PQ interaction (p-1)(q-1)
PQS interaction (p-1)(q-1)(?N-1)
TOTAL within subjects_ Npq- N
TOTAL (between + within)_ Npq-
FIG. 16.4. Degrees of freedom for a two-factor within-subjects study. The
number of levels for within-subjects factors P and Q are symbolized with p
and q respectively. The number of subjects is symbolized with N.
280 TWO-FACTOR STUDIES WITH REPEATED MEASURES
With a little bit of algebraic manipulation, you should be able to convince yourself
that the degrees of freedom for the within-subjects components do indeed sum to
Npq - N, as they should.
The predictor variables required for the hierarchic analysis of a two-factor
repeated-measures study are formed using the same techniques we have
developed for earlier studies. First the between-subjects predictor variables are
formed (step 1). Then predictor variables representing the within-subjects effects
and their interactions with subjects are entered, following a P, PS, Q, QS, PQ,
PQS order (steps 2-7). Exactly how this works is perhaps best conveyed with an
example.

Exercise 16.3
Analysis of a Two-Within Two-Factor Study
The template developed for this exercise allows you to analyze data from a two-
factor study when both factors are within subjects. You use data from the button-
pushing study, but assume that husband and wife pairs were each exposed
separately to two different sets of instructions.

General Instructions
1. Analyze data from the button-pushing study assuming that both factors are
within subjects. Assume the first four scores are for husbands exposed to
instruction set I and the second four scores are for the same husbands
exposed to set II. Likewise, assume scores 9-12 are for their wives exposed
to set I and scores 13-16 are for the same wives exposed to set II.
2. There are four spouses; therefore, establish three dummy-coded predictor
variables to identify which spouse is associated with each of the 16 scores.
Establish 9 additional predictor variables: 1 codes the gender of the person
associated with the score, 3 the gender x subject (spouse) interaction, 1 the
instruction associated with the score, 3 the instruction by subject interaction,
and the final predictor codes the gender x instruction interaction. It is not
necessary to form the 3 (and final) predictor variables associated with the
gender x instruction by subject interaction explicitly. The R2 associated with
them is the residual R2 when all the other 12 predictor variables have been
entered.
3. Guided by Fig. 16.4, create a source table appropriate for this study using
general formulas for degrees of freedom. Compute the total R2 for each of
the five steps. Provide correct formulas for all steps, including the phantom
sixth step. Check that all formulas are correct and then answer the questions
in part 22 of the detailed instructions.
Detailed Instructions
1. Begin with the spreadsheet shown in Fig. 16.2. Insert three columns after
column K. The present example requires 12 predictor variables whereas the
previous required 9.
2. Enter the labels "S1," "S2," "S3," "P," "S1 P," "S2P," "S3P," "Q," S1 Q," "S2Q,"
"S3Q," and "PQ" in cells C2-N2. These columns will contain the coded
variables for the subject factor, for gender and its interaction with subject, for
instruction group and its interaction with subject, and for the gender x
instruction interaction. Move the label "Sex" from cell C1 to F1. Leave the
label "Inst" in cell J1.
16.2 TWO WlTHIN-SUBJECTS FACTORS 281

3. There are only four subjects (husband-wife pairs) for this example. Enter
appropriate subject numbers in column A.
4. There are 4 subjects; thus there are 3 (N - 1) coded between subjects
variables. Enter dummy-coded variables for the subject codes, S1-S3, in
columns C-E. Remember the codes used for subject 1 must be the same
everywhere subject 1 appears; likewise for the other subjects. (In this case,
lines 3, 7, 11, and 15, and so forth, are the same; see Fig. 16.5.)
5. Enter contrast codes for gender (spouse) in column F. Use -1 for husbands
and +1 for wives. Assume the first eight scores given (rows 3-10) are for
husbands.
6. In columns G-l enter formulas for the subject by gender interaction (the code
for each subject variable multiplied by the code for gender). The code in
column G is the product of columns C and F, the code in column H is the
product of columns D and F, and the code in column I is the product of
columns E and F.
7. Enter contrast codes for instruction set in column J. Use -1 for Set I and +1
for Set II. Assume the first four scores for the husbands and the first four for
the wives represent the number of button pushes for instruction Set I. (The
contrast codes currently in place from the last exercise should be correct.)
8. In columns K-M enter formulas for the subject by instruction interaction (the
code for subject multiplied by the code for instruction set). The code in
column K is the product of columns C and J, the code in column L is the
product of columns D and J, and the code in column M is the product of
columns E and J.
9. In column N enter a formula for the gender x instruction interaction (the code
for gender multiplied by the code for instruction set, or the product of
columns F and J). At this point, values for the 12 coded predictor variables
for all 16 scores should be in place.
10. Update the source table you created in the last exercise. Change the label
"a=" in cell A11 to "p=".
11. Extend and modify the 1-between, 1-within source table so that it is
appropriate for the present 0-between, 2-within example. Label the sources
of variance as shown in Fig. 16.5.
12. In column F enter the appropriate formulas for degrees of freedom; these
formulas are given in Fig. 16.4. If done correctly, the degrees of freedom for
the within components should sum to total within, and the total between and
total within should sum to the grand total (between + within).
13. At this point, you have a spreadsheet that partitions degrees of freedom for
any PQ design (both factors within subjects). Experiment with different
values for N, p, and q and note the effect on how degrees of freedom are
apportioned. Now replace N, p, and q with 4, 2, and 2, the values for this
exercise.
14. Do step 1. Invoke the multiple-regression routine, specifying number of
button pushes (column B) as the dependent variable and the coded variables
for subject (columns C-E) as the independent variables. Point step 1 R2 in
the source table to the R2 computed by the regression program for this step.
15. Do step 2. Again invoke the multiple-regression program, this time adding
the coded variable for gender (column F) to the independent variable list.
Point step 2 R2 to the R2 just computed.
16. Do step 3. This time add the variables for the gender x subject interaction
(columns G-l) to the equation. Point step 3 R2 to the R2 just computed.
282 TWO-FACTOR STUDIES WITH REPEATED MEASURES
17. Do step 4. Add the variable for instruction set (column J) to the equation.
Point step 4 R2 to the R2 just computed.
18. Do step 5. Add the variables for the instruction set by subject interaction
(columns K-M) to the equation. Point step 5 R2 to the R2 just computed.
19. Do step 6. Add the variable for the gender x instruction set interaction
(column N) to the equation. Point step 5 R2 to the R2 just computed.
20. The phantom step 7 would exhaust the degrees of freedom. Enter 1.00 (all
variance accounted for) in cell C8. Enter the SStotal in cell E9.
21. Complete the source table. The SSs for each step and changes in R2s and
SSs for each step can now be computed. Enter formulas for SSs in column
E. Enter the appropriate change or summation formulas in columns D.
22. Finally, insure that the appropriate formulas for MSs, Fs, and partial n2s are
entered in columns G through I. What are the critical values for the three F
ratios? Which are significant? How do you interpret these results?

At this point, your spreadsheet should like the one shown in Figs. 16.5 and
16.6. Now you can compare results of three different analyses of the same 2x2
factorial data: the two-between, no-within (Fig. 14.12), the one-between, one-
within (Fig. 16.3), and the no-between, two-within (Fig. 16.6) arrangements. For
these data, the interaction effect was quite strong and, as a result, it was
significant no matter the kind of factors employed. But remember, the first study
consisted of 16, the second of 8, and the third of just 4 subjects. This
demonstrates the greater economy often afforded by studies that include within-
subjects factors. In this case, we were able to detect effects present in the data
with fewer subjects when we construed the study as one including repeated
measures.

A B C D E F G H I
1 Step Source R2 R
2
change SS df MS F Pn2
2 1 TOTAL btwn.Ss 0 114 0.114 850.5 3
3 2 P (gender) 0 329 0.215 1600 1 1600 17. 17 0.851!
4 3PxS (error) 0 367 0.038 279.5 3 93.17
5 4 Q (insturction) 0 386 0.019 144 1 144 0.389 0.115
6 5QxS 0 536 0.149 1110 3 369.8
7 6 PQ (gender x inst) 0 957 0.422 3136 1 3136 29. 54 0.908
8 7 PXQxS (error) 1 0.043 318.5 3 106.2
9 5TOTAL (B+W) 1 7438 15
FIG. 16.5. Source table for analyzing the effect of gender of spouse and
instruction set (within-subjects factors Pand Q) on number of button pushes.
16.2 TWO WlTHIN-SUBJECTS FACTORS 283

Exercise 16.4
SPSS Analysis of a Two-Within Two-Factor Study
This exercise walks you through an SPSS analysis for a mixed two-factor study.
You will use data from Exercise 16.1.
1. Create a new SPSS data file, or adapt one Exercise Ex. 16.2. The data file
should contain variables for subject number (s), and button pushes for the
four repeated-measures conditions: husbands in instruction set I (set1h),
wives in instruction set I (set1w), husbands in instruction set II (set2h), and
wives in instruction set II (set2w).
2. Enter 1 through 4 for the four sets of matched scores. Enter the number of
button pushes in the appropriate column for the instruction set and gender
combinations. Create appropriate variable labels and value labels for the
variables.
3. Select Analyze->General Linear Model->Repeated Measures from the
main menu. In the Repeated Measures Define Factor(s) window, type in a
name to describe the first repeated-measures factor, instruction set (e.g.,
instruct for instruction set), in the Within-Subject Factor Name box. Enter 2
for the number of levels of the factor and click Add to move this factor
definition to the lower box. Do the same for the gender variable. Click the
Define button.

A B C D E F G H I J K L M N O
1 #BPs Ss Sex Inst lxS SStot
2 s Y S1 S2 S3 P S1PS2P S3P Q.S1PS2P S3P PQ y*y
3 1 102 1 0 0 -1 -1 0 0 -1! -1 0 0 1 256
4 2 125 0 1 0 -1 0 -1 0 -1 0 -1 0 1 1521
5 3 95 0 0 1 -1 0 0 -1 -1 0 0 -1 1 81
6 4 130 0 0 0 -1 0 0 0 -1 0 0 0 1 1936
7 1 79 1 0 0 -1 -1 0 0 1 1 0 0 -1 49
8 2 93 0 1 0 -1 0 -1 0 1 0 1 0 -1 49
9 3 75 0 0 1 -1 0 0 -1 1 0 0 1 -1 121
10 4 69 0 0 0 -1 0 0 0 1 0 0 0 -1 289
11 1 43 1 0 0 1 1 0 0 -1 -1 0 0 -1 1849
12 2 82 0 1 0 1 0 1 0 -1 0 -1 0 -1 16
13 3 69 o 0 1 1 0 0 1 -1 0 0 -1 -1 289
14 4 66 0 0 0 1 0 0 0 -1 0 0 0 -1 400
15 1 101 1 0 0 1 1 0 0 1 1 0 0 1 225
16 2 94 0 1 0 1 0 1 0 1 0 1 0 1 64
17 3 84 0 0 1 1 0 0 1 1 0 0 1 1 4
19 4 69 0 0 0 1 0 0 0 1 0 0 0 1 289
19 Sum= 1376 7438
20 N= 16
21 Mean= 86
FIG. 16.6. Spreadsheet for analyzing the effect of gender of spouse and
instruction set (within-subjects factors P and O) on number of button pushes.
284 TWO-FACTOR STUDIES WITH REPEATED MEASURES
4. Move setlh, setlw, set2h, and set2w to the correct places in the Witihin-
Subjects Variables box. Be careful that you move the variables to the
correct location; note the order SPSS expects for the variables above the
window. For example, if you defined instruction set as the first variable in the
previous step, then SPSS expects you to enter all levels of set I first, followed
by all levels of set II. Check the Descriptive Statistics and Estimates of
Effect Size boxes under Options. Run the analysis.
5. Check the descriptive statistics to ensure that you set up the variables and
entered the data correctly.
6. Examine the SS, df, MS, F, significance levels, and pn2 of the within-subjects
effects. Do they agree with your spreadsheet?

16.3 EXPLICATING INTERACTIONS WITH REPEATED MEASURES

For all three two-factor designs analyzing number of button pushes discussed in
this and previous chapters, the gender main effect and the gender x instruction
interaction were significant. In chapter 14 you were cautioned that significant
main effects should not be discussed until their qualifying interactions are
understood. It does not matter whether factors involved in the interaction are
between or within subjects, the same caution applies.
Moreover, for all three designs, the same test, the Tukey post hoc test
described in chapter 13, can be used to understand the nature of any significant
interactions. There is one qualification. As explained shortly, the MS error used for
the post hoc analysis of the two-within-subjects study is not the same as the
MSeiror from its source table.
The partitioning of the total sum of squares for these three different 2 x 2
factorial studies is shown in Fig. 16.7, which should clarify exactly how variance
components are partitioned, depending on whether the two factors of a 2 x 2
study are either both between subjects, mixed, or both within subjects. Fig. 16.8
gives the various statistics required for a post hoc test of a significant interaction
for each of the three studies. The post hoc analysis for the two-between, no-
within study was presented earlier (see Fig. 14.13). Here we see that the number
of groups or G is the same for all three analyses, as is the number of scores per
group or n. However, the degrees of freedom for error varies, as does the MSerror.

2-between, 0-within 1-between, 1-within 0-between, 2-within


(Fig. 14.12) (Fig. 16.3) (Fig. 16.6)
Source SS df Source SS df Source SS df
S 581 3
A 1600 1 A 1600 1 P 1600 1
S/A 1130 6 PxS 280 3
B 144 1 P 144 1 Q 144 1
QxS 1109 3
AB 3136 1 PA 3136 1 PQ 316 1
S/AB 2558 12 PxS/A 1428 6 PQxS 319 3
FIG. 16.7. Partitioning total sums of squares and degrees of freedom for a
two-between, no-within; a one-between, one-within; and a no-between, two-
within study with the same data; sums of squares may not add exactly, due
to rounding.
16.3 EXPLICATING INTERACTIONS WITH REPEATED MEASURES 285
The dferror for the two-between study is 12, which makes sense when we recall
that 1 degree of freedom each is lost for the A main effect, the B main effect, and
the AB interaction; the MSerror is then read directly from the source table (Fig.
14.12). Similarly, df e r r o r for the one-between, one-within study is 6 because 1
degree of freedom each is lost for the A main effect, the P main effect, and the PA
interaction and an additional 6 are lost for S/A; again the MSerror can be read
directly from the source table (Fig. 16.3)
Matters are somewhat different for the two-within study. Here, the
significant PQ interaction tells us that the means of the four groups defined by
crossing the P and Q factors differ. The two-within analysis separates the error
terms into three, but to perform the Tukey post hoc test, we need just one error
term for the four groups. That is, instead of analyzing these four groups as a two-
factorial, PQ design, we need to redo the analysis as a four-groups analysis, one
that has 3 degrees of freedom for the four groups and 9 degrees of freedom for
the PS error term. In fact, we have already performed this 4-group analysis (see
Fig. 15.8). Thus we know that the MSerror for the Tukey post hoc test for the PQ
interaction is 189.7, as shown in Fig. 16.8.
For these data, the Tukey critical difference (TCD) for the one-between, one-
within and for the no-between, two-within studies are 37.8 and 30.4,
respectively, and, as a result, the post hoc analysis for the one-between, one-
within study is somewhat different from the post hoc results for the two-between,
no-within study presented earlier. The next exercise provides you with the
opportunity to describe exactly how the results of these different analyses vary.

Exercise 16.5
Post Hoc Tests for Repeated-Measures Studies
This exercise provides practice in interpreting post hoc results for studies that
include repeated measures.
1. For the two studies analyzed in this chapter, demarcate differences between
group means that do and do not exceed the Tukey critical difference.
Display your results as in Figs. 13.4 and 13.5. How do you interpret these
results?

Analysis
Post-hoc 2-between, 1 -between, 0-between,
statistic 0-within 1 -within 2-within
(Fig. 14.12) (Fig. 16.3) (Fig. 16.6)
G 4 4 4
Q'error 12 6 9
q 4.20 4.90 4.41
MSerror 213.2 238.0 189.7
n 4 4 4
TCD 30.7 37.8 30.4
FIG. 16.8. Statistics needed for a post hoc test of a significant interaction for
a two-between, no-within; a one-between, one-within; and a no-between,
two-within study with the same data.
286 TWO-FACTOR STUDIES WITH REPEATED MEASURES

16.4 GENERALIZING TO MORE COMPLEX DESIGNS

At this point you should be able to generalize what you have learned about
repeated-measures studies in this and the last chapter to more complex
situations, which is the point of the next exercise.

Exercise 16.6
Degrees of Freedom for Three-Factor Repeated-Measures Studies
For this exercise you are asked to generalize what you have learned about
partitioning total sums of squares and degrees of freedom to more complex
studies involving repeated measures.
1. Create a table, modeled on the one shown in Fig. 16.4, showing sources of
variance and general formulas for degrees of freedom for a two-between,
one-within design. Compute the degrees of freedom if N - 96, a = 2, b = 4,
and p = 3.
2. Now create a table showing sources of variance and general formulas for
degrees of freedom for a one-between, two-within design. Compute degrees
of freedom assuming N = 45, a = 3, p = 2, and q = 4.
3. Describe studies for which these designs would be appropriate. Provide
plausible names and descriptions for each of the factors and its levels.

It is easy to imagine even more complex designs than those listed in the last
exercise. Interpreting the results of the appropriate analyses, however, is less
easy. Studies that include many factors also include many possible interactions
between factors, some of which may turn out to be statistically significant, and
making sense of unexpected but significant second- and third-order interactions
can challenge even the most seasoned investigator. If interactions have been
predicted, there is no problem, of course. Interpretation follows from whatever
theorizing formed the basis for the prediction. But unexpected interactions are
another matter, and often there may be no obvious or plausible explanation for a
particular pattern of results.
Investigators often feel compelled to lavish considerable interpretive energy
on such unexpected and seemingly difficult to explain results. However, because
such results may reflect nothing more than a chance result, one unlikely to
replicate, whatever interpretation is offered should in any case be labeled clearly
for what it is—post hoc speculation, a tale woven after the fact. There is no
reason to pay much attention to such findings unless and until they reappear in
subsequent studies.

Exercise 16.7
Analysis of Number of Hours Spent Studying
This exercise gives you an opportunity to analyze data from a new two-factor
study that includes a repeated measure. In addition, this exercise provides
additional exercise with a post hoc analysis of a repeated measure factor and
introduces a trend analysis as well.
1. A class consists of five males and six females. They are asked to keep
careful records of the number of hours they spend studying statistics on each
of three successive Wednesdays (the data are provided in Fig. 16.9).
16.4 GENERALIZING TO MORE COMPLEX DESIGNS 287
Number of hours
ouujeui VJCM HJCI •
Weekl Week 2 Week 3
1 M 5 3 5
2 F 6 5 6
3 F 4 6 7
4 F 5 5 8
5 M 5 7 6
6 M 6 6 7
7 M 3 6 6
8 F 5 6 7
9 F 7 5 8
10 F 4 7 6
11 M 5 5 5

FIG. 16.9. Data for sex x week, repeated-measures, trend analysis.

2. Analyze these data using contrast codes for gender (M = -1, F = +1) and
linear (P1) and quadratic (P2) contrast codes for week (see Fig. 16.10).
Verify that the contrast codes for week obey contrast code formation rules.
Use dummy-coded variables for subject. It is not necessary to reorder the
subjects, putting all males and all females together, but you must pay
attention to which subjects are nested within which gender group.
3. Perform an analysis of variance for these data using the appropriate design.
Is there a gender effect on number of hours? (Do males study significantly
more, or fewer, hours than females?) Is there an effect for week? (Do
students study more some weeks than others?) Do gender and week
interact? (Is the pattern of differences among weeks different for males and
females?) If there is a week effect, or a gender x week interaction, perform
the appropriate post hoc analysis to elucidate the exact nature of the effect.
4. The first within-subjects predictor variable, P1, can be interpreted as a linear
trend. Note, values increase from -1, to 0, to +1 for weeks 1-3 respectively.
The second within-subjects predictor variable, P2, can be interpreted as a
quadratic trend. Note that the values change from +1 to -2 and then back to
+1 for weeks 1-3 respectively, tracing a "U-shaped" relation. (Contrast
codes like these are called orthogonal polynomials.) For the previous
analysis, in order to evaluate the significance of the within-subjects variable,
you entered P1 and P2 in one step. Now enter each one separately and
evaluate the significance of each. This is like a planned comparison
analysis, except that P1 represents a linear trend and P2 a quadratic trend. In
this case, which kind of trend accounts for the differences among weeks?

... . Coded variable


Week
P1 P2
Week 1 -1 1
Week 2 0 -2
Week 3 1 1
Fig 16.10. Coded variables representing linear and quadratic trend for week.
_288 TWO-FACTOR STUDIES WITH REPEATED MEASURES
Limitations of Repeated—Measures Analyses With Spreadsheets
As mentioned earlier, factorial studies—both with and without repeated-
measures factors—often offer interesting economies: Several research factors can
be investigated in the context of a single study. However, as discussed in the
previous paragraphs, when too many factors are included, the results may
become so complex that interpretation is strained. This limitation on the number
of factors is purely conceptual. In addition, depending on the computer program
used to analyze data, the number of factors can be limited by technical
considerations as well. But more than that, the sheer number of dummy-coded
predictor variables that a multiple-regression approach to repeated measures
requires to represent subjects—to say nothing of the other factors—will almost
always cause you to use programs designed specifically for repeated-measures
analyses and not multiple regression.
Standard statistical packages (e.g., SAS and SPSS) limit the number of
variables that can be included in their multiple-regression routines, but typically
the limit is so large than it rarely if ever becomes a consideration. Multiple-
regression routines . incorporated in spreadsheets historically were quite
constrained, but currently are not, so such constraints are no longer a barrier to
using spreadsheet programs for repeated-measures analyses.
Studies with repeated measures, as emphasized throughout this chapter and
the previous one, have a number of advantages. A thorough understanding of
analysis of variance designs that include repeated measures is thus an important
skill to include in your statistical armamentarium. And indeed, given the specific
examples described here in detail, you should now be able to generalize what you
have learned to other, more complex studies. The advantage of the approach to
repeated measures adopted here—regarding subject as an explicit research factor,
representing it with appropriately coded predictor variables, and treating it as a
covariate—is the ease with which studies with and without repeated measures can
all be integrated into a common hierarchic multiple-regression approach.
True, when adopting a multiple-regression approach, studies with repeated
measures can generate an inordinate number of predictor variables. For that
reason, use of specific programs in standard packages (such as SPSS GLM for
repeated measures, as described in Exercise 16.4)—which require only that the
number of between and within factors and the number of levels for each be
specified—is especially attractive on practical ground. Nonetheless, once you
understand how to approach studies including repeated measures as a series of
hierarchic steps, you should have a clear grasp of the logic of these analyses. In
particular, you should understand the way total variance and degrees of freedom
are partitioned among the various effects. This understanding will be helpful to
you no matter whether you use the hierarchic multiple-regression procedures
described here or come to rely on specific repeated-measures program in
standard statistical packages. As a practical matter, however, now that you
understand the logic of repeated-measures analyses, you will almost always use a
program in a statistical package to perform your analyses. In practice, the
number of dummy-coded predictor variables that a reasonable number of
subjects requires is simply too cumbersome for routine use of multiple regression
for repeated measures analyses.
A final comment: Some situations that seem appropriate for a repeated
measures analysis may, in fact, be better handled in a more typical multiple-
regression way. The next chapter describes one commonly encountered situation
for which this is true and also describes power analytic methods for determining
the minimum number of subjects that should be included in a study.
17 Power, Pitfalls, and
Practical Matters

In this chapter you will:


1. Learn how to analyze data from pretest, posttest repeated-measures
studies using analysis of covariance.
2. Learn how to do a power analysis. A power analysis helps you decide
how many subjects should be included in a study. If too few subjects are
included, significant effects may not be detected.

In this final chapter two important but somewhat disparate topics are discussed.
The first topic is quite specific. Given what you have just learned about analyzing
studies that include within-subjects factors, you might be tempted to apply such
analyses to studies including pretest and posttest measures because such
measures are clearly repeated. Actually, data from such studies are analyzed
more easily as an analysis of covariance and an example is presented here. Some
readers may be able to apply this example directly to their own work, but for all
readers this final example serves as a way to summarize and exemplify the
general analytic principles emphasized throughout this book.
The second topic is more general. Before conducting any study it is
important to determine how many subjects should be included. If too few
subjects are studied, the probability of finding significant results may be
unacceptably low. On the other hand, if too many subjects are included, the
investigator may spend more time and effort on the study than is warranted.
Some general methods for determining what constitutes an appropriate sample
size have been developed and are presented here.

17.1 PRETEST, POSTTEST: REPEATED MEASURE OR COVARIATE?


The 16 scores for the button-pushing study have been used to illustrate a number
of different analyses throughout this book. In chapter 14, for example, we
assumed that each score was contributed by a different person, and that the first
eight subjects were male and the second eight were female; the first four male
and female subjects received instruction set I whereas the last four received
instruction set II. Accordingly, these data were analyzed as a 2 x 2 between-

289
290 POWER, PITFALLS, AND PRACTICAL MATTERS
subjects design (two-between). Then, in chapter 16, we showed how to analyze
these same data, first assuming that instruction set was a within-subjects factor
(one-between, one-within), then assuming that both gender (of spouse) and
instruction set were within-subjects factors (two-within).
In this section, a somewhat different set of assumptions is made but again the
same scores are used as an example. Assume that eight people have been
randomly selected to serve as subjects and four of them have been randomly
assigned to a training group whereas the second four are assigned to a no-
training control group. Thus treatment (training vs. no training) for the present
example, like gender in the first two of the previous three examples, is a between-
subjects factor or variable. Assume further that subjects are exposed to
videotapes and are asked to push a button whenever the infants portrayed in the
tapes do something communicative. The number of button pushes constitutes
the pretest or the pretreatment assessment. Next, the four treatment-group
subjects receive training concerning common infant cues. Finally, all subjects are
again asked to push a button whenever infants do something communicative.
The number of times the button is pushed this time is the posttest or
posttreatment assessment.
As with all such studies involving a pretest, a posttest, and different
treatments given to different subjects, the research question is: Does the
treatment matter? Are the treated subjects' scores, on average, different from the
untreated subjects' scores? There are two ways we might seek to answer this
question, given the situation described in the preceding paragraph, but before we
do that it is useful to discuss an even simpler situation. Imagine that we did not
have a control group—that all subjects were first given a pretest, then a
treatment, and finally a posttest. Such a single-group procedure is not
recommended as a way to demonstrate unambiguously that a particular
treatment has a desired effect. Subjects might score higher on the posttest, for
example, simply because of some effect of elapsed time, and not because of the
treatment. Still, it is useful for present purposes to consider how data from such
a study would be analyzed.
Pretest and posttest scores constitute repeated measures, two scores from
each subject. If this single-group study had been presented in chapter 9, you
might have been tempted to correlate pretest and posttest scores. Certainly you
could compute regression and correlation coefficients and these would tell you
how, and how well, you could predict the posttreatment from the pretreatment
assessment. But this would not tell you whether the pretreatment and
posttreatment assessments were different. This is a different question and
requires a different analysis.
In order to determine if the mean pretest score is different from the mean
posttest score, you would perform an analysis of variance. Time of test (pre
versus post) would be the single within-subjects factor. If the P main effect were
significant, you would conclude that the mean pretest and posttest scores were
significantly different. Do not confuse these two very different analyses. Given
two repeated measures, such as a pretest and a posttest:

1. The correlation coefficient tells you how well the second score can be
predicted from the first.
2. The analysis of variance, on the other hand, tells you whether the mean
score for the first test is different from the mean score for the second test.
17.1 PRE-, POSTTEST: REPEATED MEASURES, COVARIATE? 291
These are two separate but important pieces of information. But note that there
is no necessary relation between their significance: One, neither, or both might
be significant.
The single-factor study just described, like the somewhat more complex two-
factor study described a few paragraphs earlier, assesses subjects both before and
after treatment, but the more complex study, in addition, assigns some subjects
to a treatment group and the remaining subjects to a no-treatment control group.
For this reason, your first thought might be to analyze the data from the two-
factor study as a one-between, one-within, 2 x 2 design. The between-subjects
factor is treatment (training vs. no training) and the within-subjects factor is time
of testing (before vs. after training). A main effect for training (factor A) would
indicate that, considering both pretest and posttest scores, one group scored
higher than the other. A main effect for time of testing (factor P) would indicate
that, considering both treatment conditions, scores tended to be higher at one
time compared to the other.
The result the investigator most likely desires, however, is a significant
interaction. If the treatment affects the posttest score, then there should be a
difference between pretest and posttest scores only for treated subjects. In other
words, the effect of the treatment factor on score should be conditional on—that
is, should interact with—the level of treatment. It is a significant interaction
between treatment level and time of testing that allows the investigator to
conclude that subjects were affected by the treatment offered in the treatment
condition. Thus one way to test the effectiveness of a treatment, given the
present example, is to analyze the data as a one-between (factor A - treatment
level), one-within (factor P = time of assessment) study, in which case a
significant AP interaction might indicate that scores for the treated subjects, but
not the untreated subjects, were significantly different at the second assessment.
There is a second way to analyze these same data. It is somewhat simpler
and more straightforward than the mixed two-factor strategy just described and,
in fact, it is usually preferred by experts. Simply put, the question of interest
concerns whether the treatment matters—that is, whether it has an effect on a
particular outcome. Treatment is the research factor of interest and, as noted
previously, it is a between-subjects factor. This does not change. However,
rather than treat the repeated measures as levels of a within-subjects factor, it
makes sense to regard the second measure as the outcome variable and the first
measure as the covariate. This reduces a one-between, one-within design to a
one-between design with a covariate.
Quite likely the second score can be predicted to some extent by the first one,
but the question of interest is, above and beyond this expected effect (the effect of
the covariate), does knowledge of whether or not a subject received training (the
effect of the between-subjects factor) allow us to make significantly better
predictions for the second score? In other words, when accounting for criterion
or outcome score variance, in addition to whatever effect the covariate may have,
does the training factor matter?
As the discussion in the preceding paragraph makes clear, the present
example can be analyzed with an analysis of covariance, exactly as described in
chapters 10 and 12. Recall that 16 scores were collected for the button-pushing
study. Assume that scores 1-4 are posttreatment scores for the training group
subjects and scores 5-8 are their pretreatment scores; similarly, assume scores
9-12 are posttreatment scores for the no-training control group and scores 13-16
are their pretreatment scores. In other words, the preceding between-subjects
variable of gender is now the between-subjects variable of training. However,
scores that were previously ascribed to set I for the instruction factor (scores 1-4
292 POWER, PITFALLS, AND PRACTICAL MATTERS
and 9-12) are now scores for the criterion variable, whereas scores that were
previously ascribed to set II for the instruction factor (scores 5-8 and 13-16) are
now scores for the covariate.
The analysis of covariance for the present example proceeds in three steps.
First the posttest scores (representing the dependent or criterion variable) are
regressed on the pretest scores (representing the covariate). Then, in order to
test whether training has an effect, the coded variable for group membership
(training vs. no training) is added to the regression equation. Finally, in order to
test for homogeneity of regression, a variable representing the covariate by group
interaction is added. Computations for this analysis are carried out during the
course of the next exercise. Note that because there are no within-subjects
factors for this analysis, it is not necessary to code subjects as a factor. In this
case, the covariate is the pretest score, not the subject factor as it was for the
analyses described in chapters 14 and 15.

Exercise 17.1
Analysis of Covariance of a Pretest, Posttest Study
The template that results from this exercise allows you to perform a simple
analysis of covariance for pretest and posttest scores. Data from the button-
pushing study will again be used. This time you will assume that subjects were
tested twice and half of the subjects received sensitivity training between the two
assessments.
1. You will probably want to modify an earlier spreadsheet, for example, the one
shown in Fig. 14.10. However, for this analysis there are only eight subjects
and three predictor variables. The first four scores from the button-pushing
study are now the values for the dependent variable for the first four
subjects—the subjects who receive training—and scores 5-8 are now the
values for the covariate for the first four subjects (their pretest scores).
Scores 9-12 are now the values for the dependent variable for the second
four subjects—the subjects who did not receive training—and scores 13-16
are now the values for the covariate for the second four subjects.
2. Analyze these data and enter the appropriate information into a summary
table. First, do the analysis of covariance, regressing the posttest score on
the pretest score (step 1) and then on both the pretest score and a contrast
coded variable for treatment group (step 2). Second,, check for homogeneity
of regression, following the preceding two steps with a third step that
regresses the posttest score on the pretest score, a contrast-coded variable
for treatment group, and a third variable representing their interaction.
3. Is the effect of the covariate statistically significant? Is the effect of the
sensitivity training statistically significant? Is the assumption of the
homogeneity of regression warranted?

The spreadsheet that results from the last exercise is shown in Figs.17.1 and
17.2. The pretest by training interaction is not significant (F(1,4) <1, NS); thus
homogeneity of regression is not violated and we can conclude that these are
appropriate data for an analysis of covariance. The effect of pretraining on
posttraining scores is not significant (F(1,5) = 2.5, NS), but its significance, or
lack thereof, is not critical. The major question concerns the effect of training
and it is important to note that, controlling for pretest scores, the treatment
variable—training versus no training—significantly affects posttest scores (F(1,5)
= 11.6, P < .05).
17.1 PRE-, POSTTEST: REPEATED MEASURES, COVARIATE? 293

A B C D E F G H I
2
1 Step Source R / change SS dt MS F n)2
2 1 X, pretest 0.132 0.132 826 1 826 2 514 0 335
3 2A, training 0 .738 0.607 3807 1 3807 11.58 0 699
4 3error 1 0.262 1643 5 328.6
5 TOTAL btwn Ss 1 6276 7
6
7 1 X, pretest 0 .132 0.132 826 1 826 2 105 0 345
8 2 A, training o .738 0.607 3807 1 3807 9 703 0 708
9 3XA, interaction 0.75 0.012 73.74 1 73.74 0 188 0.045
10 4 error 1 0.25 1569 4 392.3
11 TOTAL btwn Ss 1 6276 7
FIG. 17.1. Source table for analyzing the effect of training (between-subjects
factor A) on posttreatment number of button pushes. Pretreatment number
of button pushes (labeled X) is a covariate.

One final set of computations remains. We could report raw means, noting
that after training the mean number of button pushes for the treatment group
was 113 whereas the corresponding number for the no-treatment control group
was 65. In addition, however, and in line with the usual reporting of analysis of
covariance results, we should also report the adjusted means, computed as
described in chapter 13.

A B C D E F G H I J K
1 Post Pre Tr y= m= e= sstot
2 s Y X A XA Y' Y-My Y'-My Y-Y' y*y
3 1 102 79 -1 -79 113 13 24 -11 169
4 2 125 93 -1 -93 110.7 36 21.66 14.34 1296
5 3 95 75 -1 -75 113.7 6 24.67 -18.7 36
6 4 130 69 -1 -69 114.8 41 25.67 15.33 1681
7 5 43 101 1 101 62.66 -46 -26.3 -19.7 2116
8 6 82 94 1 94 63.83 -7 -25.2 18.17 49
9 7 69 84 1 84 65.5 -20 -23.5 3.498 400
10 8 66 69 1 69 68.01 -23 -21 -2.01 529
11 Sum= 712 Sum= 0 -0 0 SS= 6276
12 N= 8 N= 8 8 8 df= 7
13 Mean= 89 VAR= 784.5 579.1 205.4 MS= 896.6
14 a,b= 102.9 -0.17 -23.3 SD= 28.01 SD'= 29.94
2
15 R= 0.859 R = 0.738 R adj= 0.633

FIG. 17.2. Spreadsheet for analyzing the effect of training (between-subjects


factor A) on posttreatment number of button pushes (labeled Post).
Pretreatment number of button pushes (labeled Pre) is a covariate.
294 POWER, PITFALLS, AND PRACTICAL MATTERS

Exercise 17.2
Adjusted Scores for a Pretest, Posttest Study
This exercise provides additional practice in computing adjusted scores as
required for an analysis of covariance.
1. Modify the spreadsheet shown in Fig. 17.1 to compute posttest scores,
adjusted for the effect of pretest scores. If you have forgotten how to do this,
refer to Exercise 13.7.
2. What are the adjusted posttest means for the treated and untreated groups?

The example just described illustrates the general analytic strategy


emphasized throughout this book, which can be summarized as follows:

1. A quantitative dependent or criterion variable (in this case, number of


posttreatment button pushes) is identified and measured.
2. Independent or predictor variables are identified and hierarchically
ordered. Predictor variables may be quantitative, like age or number of
pretreatment button pushes, or they may be categorical, like treatment
group. Categorical variables distinguishing among G groups are
represented with G - i coded predictor variables. Predictor variables
representing an interaction are formed by multiplying their component
predictors together.
3. The criterion variable is regressed on the first predictor variable, or set of
predictor variables. Then the second variable or set (if there is one) is
added to the regression equation, then the third, fourth, and so forth.
The increase in the proportion of criterion variance accounted for at each
step is noted and indicates the strength of the effect associated with the
variable (or variables) added at that step. Normally predictor variables
that serve as covariates, like number of pretreatment button pushes or
coded variables for subjects when analyzing within-subjects factors, are
added to the regression equation first, whereas predictor variables that
represent interactions between other variables are added last.
4. An F ratio is computed for the increase or change in R2 at each step and
is tested for significance. This test indicates whether or not the effect
associated with that step is statistically significant.
5. The magnitude of the effect can be represented with either the increase in
R2 or the partial n2.

The approach just summarized is amazingly general. Using only a few basic
and easily understood principles and techniques, it incorporates under one roof,
so to speak, not only standard regression analyses, but also standard analyses of
variance and covariance—and does so in a way that remains remarkably faithful
to the investigator's primary concern with identifying the strength and
significance of research factors of interest. Still, unless enough subjects are
studied, effects strong enough to be interesting may not be detected as
statistically significant. How investigators can guard against this undesirable
outcome is discussed in the next section.
17.2 POWER ANALYSIS: How MANY SUBJECTS ARE ENOUGH? 295

17.2 POWER ANALYSIS: HOW MANY SUBJECTS ARE ENOUGH?

Power, as you learned in chapter 2, is the probability that a real effect will be
detected as significant by a particular statistical test. Again as you learned in
chapter 2, if the alpha level is made less stringent, power is increased. However,
because the alpha level regarded as appropriate is set not by individual
investigators acting unilaterally but by social consensus, modifying the alpha
level is usually not an acceptable way to gain power.
Some tests are more powerful than others. For example, as a general rule,
tests that rely on quantitative data are more powerful than tests that make use of
categorical data and, as discussed in the last two chapters, usually tests involving
within-subjects factors are more powerful than tests involving between-subjects
factors. Under most circumstances, however, the nature of the data and the
design appropriate for analyzing those data are limited by the nature of the
problem and cannot easily be changed.
By far, the easiest way to gain power is to increase the number of subjects
studied. How this works with the F distribution is easy to demonstrate. Look at
Table D in the statistical tables appendix. Notice that as the degrees of freedom
for the denominator increases, critical values of F decrease. In other words, other
things being equal, as sample size increases, the bulk of the sampling distribution
shifts to the center, leaving a thinner tail. As a result, the value of F that
demarcates 5% of the area becomes steadily smaller—and an F ratio that was not
deemed significant may become significant if it is based on data from more
subjects.
In other words, very large effects may be found significant even with quite
small sample sizes, whereas miniscule effects may be found significant if sample
size is large enough. A practical question to address before beginning any study
is, how many subjects are enough? If too few subjects are included in a study,
there may be little hope of finding significant results. Both time and money will
have been wasted in a futile effort. On the other hand, if too many subjects are
included, effects too small to be of either theoretical or practical interest may be
detected. This too can be viewed as a waste of resources.
Happily, there are fairly straightforward procedures for determining an
appropriate sample size. The standard reference is Jacob Cohen's Statistical
Power Analysis for the Behavioral Sciences (1988), which should be consulted
for situations that do not fit the basic guidelines given here. The procedures
presented in this chapter are adapted from Cohen and Cohen's Applied Multiple
Regression/Correlation: Analysis for the Behavioral Sciences (1983). As such,
they apply to multiple-regression analyses and so are appropriate for almost all of
the analyses described in this book (the exception is the sign test).
A power analysis can be decomposed into six constituent steps (Fig. 17.3).
First select an alpha level. This step requires only that you state your alpha level,
which is part of the planning for any study. Usually it will be .05 or .01.
Second, set power. Decide what level of power you prefer or find acceptable.
A common choice is .9, but under some circumstances you may be willing to
accept only a .8 or less chance of detecting effects as significant.
Third, count predictors. The number of predictors is symbolized with K and
is the number of predictors uniquely associated with the effect. In other words, K
is dfchange or the number of predictor variables added to the regression equation at
the step associated with the effect.
Fourth, find the L score. L is a statistic whose values Cohen has computed
and tabled (Cohen & Cohen, 1983; see Tables F1i and F.2 in the statistical tables
296 POWER, PITFALLS, AND PRACTICAL MATTERS
appendix). It will be used in the sixth step to compute the number of subjects.
Note that the values from the first three steps are used to determine the value of
L; thus for the fourth step you scan Table F.1 or F.2 (depending on the alpha level
you selected) for the value of L that corresponds to the power you desire and the
number of predictor variables you have.
The fifth step requires that you determine the size of the effect of interest.
This requires somewhat more judgment than the first four steps. Along with L,
the effect size will be used in the sixth step to compute the number of subjects.
Following Cohen and Cohen (1983), the effect size is symbolized as f2 and is
defined as follows:

The large, small, and final R2s assume that we are testing effects hierarchically.
Thus -R2change (i.e., R2large - R2small) refers to the proportion of variance we think
will be accounted for uniquely by the effect of interest, whereas R2final refers to the
proportion of variance we think will be accounted for by all the effects tested by a
particular design. (Recall the discussion of larger, smaller, and final models in
chap. 12.)
It is important to note that these are population, not sample .R2s. Thus,f2 is
a population statistic that reflects the size we expect for a given effect not in a
sample but in the population. This is hardly surprising. Presumably determining
an appropriate sample size is something we do before selecting a sample, so we
would hardly expect that procedure to rely on sample data. But how are the
population R2s of Equation 17.1 determined? There are several possibilities.
Unfortunately, none are simply mechanical and all require that the investigator
exercise some judgment. The investigator could be guided by previous results in
the field, arguing that because effects of a certain size have typically been found
in the past, it seems reasonable to think that effects of a similar size would be
found in the future. Or, if there is little in the way of precedent, an investigator
could simply argue that only effects of a certain size are of interest, for either
theoretical or practical reasons.
In any case, in order to proceed with a power analysis, it is necessary to
commit to some expected effect size. And it is highly desirable, of course, to have
some rationale for the size selected. This is not quite as difficult as it sounds.
Even though we do not know the true population values for particular R2s, for
most areas of research there is usually some consensus as to what constitutes a
small, medium, or large effect. And if no other rationale is available, one can use

Steps required for a power analysis


Step 1 Select an alpha level.
Step 2 Set power.
Step 3 Count predictors.
Step 4 Find the 1 score.
Step 5 Select an effect size.
Step 6 Compute the number of subjects required

FIG. 17.3. Steps required for a power analysis.


17.2 POWER ANALYSIS: How MANY SUBJECTS ARE ENOUGH? 297_
Cohen's guidelines. For the behavioral sciences, Cohen (1988) wrote, an R2 of .01
is regarded as small, .09 as medium, and .25 as large (which translate into
correlation coefficients of .1, .3, and .5, respectively). Not all investigators will
think that accounting for 25% of the criterion variance constitutes a large effect,
but nonetheless to proceed with a power analysis some desired effect size must be
selected.
Having defined the alpha level, the power desired, the number of predictors,
the value of L, and the effect size, it is now possible to perform the desired
computation, thus for the sixth step compute the number of subjects required.
The formula, where n* indicates number of subjects, is:

This is the smallest sample size you should consider. If an effect is real and of the
strength claimed, then the probability of detecting it as significant, with K
predictor variables at the stated alpha level, is the probability you selected for
power at step 2—which is why a value of .9 for power was recommended. Under
most circumstances it does not seem desirable to have, for example, only a 50%
chance of finding significant results of a strength you believe important.
The lie detection study described earlier can be used to exemplify a power
analysis. As presented in chapter 10, this study included only 10 subjects and,
not surprisingly given the small number of subjects, did not find a significant
effect of drug on number of lies detected (see Fig. 10.2). The number of subjects,
although convenient for exposition, was unrealistically small. Still, the R2 for this
sample—.19—does not seem trivial. This raises a question; In a real study how
many subjects would be needed to detect a similarly sized effect as significant?
The .19 is a sample R2. Its adjusted value (see chap. 11), which is a reasonable
estimate for the population value, is .09, exactly what Cohen called medium. (As
an exercise, verify that .09 is the population value estimated from the sample
value of .19.) The conventional value for alpha is .05 and a commonly selected
value for power is .9. In this case, there is one predictor variable, the coded
variable that indicates whether or not the subject received a drug. The value for L
is 10.51 ( K = 1 , power = .90). The estimated population value for R2 is .09, and
the effect is associated with the first and only step (which means that R2Small - 0
and R2large = R2final); consequently, for this example,

Finally,

Instead of rounding to the nearest whole number, always round up instead


because this calculation determines a minimum number of subjects. In other
words, in order to have a 9 in 10 chance (power = .9) of finding significant at the
.05 level an effect of drug that accounts for 9% of the variance in number of lies
detected in the population, at least 109 subjects would be needed. Approximately
half would receive the drug treatment and the remaining subjects would be
assigned to a no-drug control group.
298 POWER, PITFALLS, AND PRACTICAL MATTERS
Power analysis is important primarily because it provides a rational way to
determine the number of subjects needed before a study begins. It prevents you
from being embarrassed when, after the study is over and you are bemoaning the
lack of significant results, someone points out you only had a 40% (or whatever)
chance anyway of finding significant effects you thought important, given the
number of subjects. In addition, power analysis provides a way to interpret
negative results after the study is over. It is never correct to claim that negative
results "prove" the null hypothesis. And certainly, given the results of the lie
detection study, it would be misleading to even say that they support the notion
that drug has no effect on number of lies detected. After all, 19% of the variance
in the sample was accounted for by drug group—an effect that would be
statistically significant with a larger sample.
Imagine, however, that we redo the study with 109 subjects, the number of
subjects we computed based on power analysis considerations, and again find no
significant effect. We cannot claim that this proves that drug has no effect, of
course, but we can say that we had a 90% chance of detecting a drug effect that
accounts for 9% of the population variance as significant at the .05 level—and we
failed to do so. We do not know the true magnitude of the drug effect in the
population, but we do know that if it really does account for 9% of the variance in
number of lies detected, then 90% of the studies conducted with a sample size of
109 should find the drug effect significant (at the .05 level). Such a statement
helps to put negative results in perspective and—coupled with description and
discussion of the size of the effect we actually found—seems considerably more
informative than simply stating that an effect achieved, or did not achieve,
statistical significance.

Exercise 17.3
A Power Analysis
This exercise provides practice in power analysis.
1. Assume you want at least a .9 chance of detecting an effect as significant at
the .05 level. Assume further that you believe the effect, which is coded
using two predictor variables, should account for at least 36% of the variance
in the population. What sample size should you use?
2. Assume you want at least a .9 chance of detecting an effect as significant at
the .05 level. Assume further a 2 x 2 between-subjects factorial design. You
believe that the A, B, and AB effects together account for 30% and you want
to detect a B main effect that accounts for 22% of the variance in the
population. What sample size should you use?
3. Assume you want at least a .9 chance of detecting an effect as significant at
the .05 level. Assume a covariate accounts for 7% and you want to detect an
effect for the between-subjects factor, which is coded using three predictor
variables and accounts for an additional 17% of the variance in the
population. What sample size should you use?
17.2 POWER ANALYSIS: How MANY SUBJECTS ARE ENOUGH? 299

Note 17.1
L A statistic used in power analysis calculations. Its value is
determined by the power desired and the number of
predictor variables used. See Table F in the statistical tables
section.
f2
A measure of effect size. Technically, the ratio of variance
accounted for by an effect to residual or error variance. Used
in power analysis calculations.
n* The number of subjects needed in a study in order to detect
effects of a specified size as significant. Determined by power
analysis calculations.

The approach to power analysis described here—which requires you to


identify the sources of variance, arrange them hierarchically, and identify an
increase in R2 for each step—has been automated. If you provide this
information to BWPower, a computer program written by Bakeman and
McArthur (1999), the program computes power for various sample sizes (see Fig.
17.4). BWPower can be downloaded from:

www.gsu.edu/~psyrab/BakemanPrograms.htm.

- D X

FIG. 17.4. Program BWPower showing answer for Part 2 from Exercise
17.3.
300 POWER, PITFALLS, AND PRACTICAL MATTERS
A Final Comment
At this point, after 17 chapters, you should have a firm, conceptual grasp of how
variation in an outcome of interest (a dependent or criterion variable) can be
explained or accounted for in terms of research factors (independent or predictor
variables). It is worthwhile to reflect on what you have gained. You have learned
exactly what it means to say that an effect accounts for a stated proportion of
criterion variance, and you have learned how to describe the magnitude of a given
effect and to test that magnitude for statistical significance. You have also
learned that statistical significance is important—you want to know that observed
results are probably not just chance happenings—but that it is not all important.
After all, any effect that is not zero will be statistically significant if the sample
size is large enough. But now you know how to decide on an appropriate sample
size before launching a study.
From the examples and exercises throughout this book you have gained an
appreciation for statistical hypothesis testing as it has come to be conventionally
understood and practiced—and presumably you see it as a useful tool in data
interpretation, not as a cult idol to be adhered to without question. If you wish to
deepen your understanding of hypothesis testing and the seemingly arbitrary .05
cutoff point, two essays, one by Cohen (1990) and one by Rosnow and Rosenthal
1989), are especially recommended and serve as a superb compliment to the
material presented here.
Along the way you have accumulated an impressive array of skills. Coding
categorical predictor variables and arranging both quantitative and qualitative
predictor variables into a series of hierarchic steps for multiple regression should
by now be second nature. Given the integrated approach adopted in this book,
you should be able to perform simple multiple-regression analyses, one-way and
factorial analyses of variances with and without repeated measures, and analyses
of covariance—and you should also be adept at describing the results of such
analyses, using post hoc tests and adjusted means. One topic not discussed here
is the analysis of categorical dependent variables using chi-square and log-linear
techniques. As mentioned in the preface, such analyses are described in a
companion volume (Bakeman & Robinson, 1994). Otherwise this book has
provided you with an integrated understanding of most of the basic sorts of data
analyses used by behavioral scientists.
You have reason to be pleased with your accomplishment but you should not
take this as license to rest on your laurels. William Shakespeare has Richard II
say: "But what e'er I be,/ nor I, nor any man that but man is/ with nothing shall
be pleased till he be eased/ with being nothing" (Act 5, Scene 5, lines 38-41).
This is not a somber thought, but simply suggests the restless quality of an active
person with an active mind; it suggests that anyone too satisfied is no longer fully
alive. It seems a fitting moral with which to end this book. The diligent reader
will have learned a great deal about statistics, at an introductory level and as
usually applied to behavioral science data. But it is only a foundation, a
beginning. We hope that the material presented throughout this book will
provide a useful base, not just for your understanding of the simpler statistical
analyses found in the research reports you read, and not just for the basic
analyses you may need to perform on your own data, but also for all of your
future learning about statistics and statistical data analysis.
References

Bakeman, R., & McArthur, D. (1999). Determining the power of multiple


regression analyses both with and without repeated measures. Behavior
Research Methods, Instruments, and Computers, 31, 150-154
Bakeman, R. & Robinson, B. F. (1994). Understanding log-linear analysis with
ILOG; An interactive approach. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate
analysis. Cambridge, MA: MIT press.
Cohen, J. (1968). Multiple regression as a general data-analytic system.
Psychological Bulletin, 70, 426-443.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd
ed.). Hillsdale, NJ: Lawrence Erlbaum Associatiates.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45,
1304-1312.
Cohen, J., & Cohen P. (1983). Applied multiple regression/correlation analysis
for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.
Greenhouse, S. W., & Geisser, S. (1959). On methods in the analysis of profile
data. Psychometrika, 24, 95-112.
Hays, W. L. (1981). Statistics. New York: Holt, Rinehart & Winston.
Huynh, H., & Feldt, L. S. (1976). Estimation of the Box correction for degrees of
freedom from sample data in randomized block and split-plot designs.
Journal of Educational Statistics, 1, 69-82.
Keppel, G. (1982). Design and analysis: A researcher's handbook. Englewood
Cliffs, NJ: Prentice Hall.
Keppel, G., & Saufley, W. H., Jr. (1980). Introduction to design and analysis: A
student's handbook. New York: W. H. Freeman.
Kessen, W. (1979). The American child and other cultural inventions. American
Psychologist, 34, 815-820.
Kirk, R. E. (1982). Experimental design: Procedures for the behavioral
sciences. Belmont, CA: Brooks/Cole.
Loftus, G. R., & Loftus, E. F. (1988). Essence of statistics. New York: Knopf.
Marascuilo, L. A., & Serlin, R. C. (1988). Statistical methods for the social and
behavioral sciences. New York: W. H. Freeman.

301
302 REFERENCES
Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics:
Measures of effect size for some common research designs. Psychological
Methods, 8, 434-447.
Robinson, B. F., Mervis, C. B., & Robinson, B. W. (2003). The roles of verbal
short-term memory and working memory in the acquisition of grammar by
children with Williams syndrome. Developmental Neuropsychology, 23, 13-
32.
Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the
justification of knowledge in psychological science. American Psychologist,
44,1276-1284.
Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66,
605-610.
Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New
York: McGraw-Hill.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103,
677-680.
Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty
before 1900. Cambridge, MA: Harvard University Press.
Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th ed.).
Boston: Allyn and Bacon.
Tufte, E. R. (1983). The visual display of quantitative information. Cheshire
(CN): Graphics Press.Tukey, J. W. (1977). Exploratory data analysis.
Reading (MA): Addison-Wesley.Wainer, H. (1984). How to display data
badly. American Statistician 38,137-147.
Tukey, J. W. (1977). Exploratory data analysis. Reading (MA): Addison-
Wesley.
Wainer, H. (1984). How to display data badly. American Statistician, 38, 137-
147.
Wilkinson, L., & the Task Force on Statistical Inference, American Psychological
Association Board of Scientific Affairs (1999). Statistical methods in
psychology journals: Guidelines and explanations. American Psychologist,
54, 594-604.
Winer, B. J. (1971). Statistical principles in experimental design. New York:
McGraw-Hill.
Glossary of Symbols
and Key Terms

a The regression constant. When Y, the criterion variable, is


regressed on X, a single predictor variable, a is the Y intercept of
the best-fit line. To compute, subtract the product of b (the
simple regression coefficient) and the mean of X from the mean
of Y(a=MY-bMX).
Also, the number of levels for the between-subjects factor A.
A The first between-subjects factor in a factorial study.
a (alpha) The probability of making a false claim (Type I error).
Alpha Level Maximum acceptable probability for type I error (claiming an
effect based on sample data when there is none in the
population). Conventionally set to .05 or .01.
b The simple regression coefficient. Given a single predictor
variable, b is the slope of the best-fit line. It indicates the exact
nature of the relation between the predictor and criterion
variables. Specifically, b is the change in the criterion variable
for each one-unit change in the predictor variable. To compute
(assuming that Y is the criterion or dependent variable), divide
the XY covariance by the X variance (b == COVXY/VARx).
Also, the number of levels for the between-subjects factor B.
B The second between-subjects factor in a factorial study.
P (beta) The probability of missing a real effect (Type II error).
c The number of levels for the between-subjects factor C.
C The third between-subjects factor in a factorial study.
COV The covariance. If for the X and Y scores, it is SSxy divided by N.
Thus COVxy is the average cross product of the X and Y deviation
scores. It indicates the direction (plus or minus) and the extent
to which the X and Y scores covary, and is related to both the
correlation and the regression coefficients.

303
304 GLOSSARY
Critical Values of a test statistic that would occur 5% of the time or less
values (assuming an alpha level of .05) if the null hypothesis were true.
Depending on the test statistic used and how the null hypothesis
is stated, these values could fall only at the extreme end of one
tail of a distribution or at the extreme ends of both tails. If a test
statistic assumes a critical value, the null hypothesis is rejected.
For that reason, critical values define the critical region or the
region of rejection of a sampling distribution.
df Degrees of freedom.
d/error Degrees of freedom for error. It is equal to N minus 1 (for the
regression constant) minus the number of predictor variables.
dymodel Degrees of freedom for the model. It is equal to the number of
predictor variables.
DV The variable a researcher wants to explain or account for is called
the dependent variable or DV. It is also called the criterion
variable or the response variable.
F ratio Usually the ratio of MSmodel to MSenor', more generally, the ratio of
two variances. If the null hypothesis is true, it will be distributed
as F with the degrees of freedom associated with the numerator
and denominator sums of squares.
f2 A measure of effect size. Technically, the ratio of variance
accounted for by an effect to residual or error variance. Used in
power analysis calculations.
G The number of groups or cells included in a study.
IV Variables a researcher thinks account for or affect the values of
the dependent variable are called independent variables or IVs.
They are also called predictor variables or explanatory variables.
Interval The intervals of an interval scale are equal, no matter where they
Scale fall on the measurement continuum. The placement of zero,
however, is arbitrary, like zero degrees Fahrenheit.
K The number of predictor variables.
L A statistic used in power analysis calculations. Its value is
determined by the power desired and the number of predictor
variables used. See Table F in the statistical tables section.
M The sample mean, which is the sum of the scores divided by N.
The mean for the Y scores is often symbolized as Y with a bar
above it (read Y-bar), the mean for the X scores as X with a bar
above, and so forth. In this book, to avoid symbols difficult to
portray in spreadsheets, My represents the mean of the Y scores,
MX the mean of the X scores, and so forth.
MX The mean of the X scores.
My The mean of the Y scores.
ui(mu) The population mean., It can be estimated by the sample mean,
or its value can be assumed based on theoretical considerations.
MM The mean of the distribution of sample means, It can be
estimated by the population mean.
GLOSSARY 305
MSerror Mean square for error. Computed by dividing the error sum of
squares by the error degrees of freedom. It is also called the
mean square within groups or the residual mean square.
MSmodel Mean square for the model. Computed by dividing the sum of
squares associated with the model by the degrees of freedom for
the model. It is also called mean square between groups or mean
square due to regression.
n The number of subjects within one group or cell of a study
N The total number of subjects (sampling units) included in a
study.
n* The number of subjects needed in a study in order to detect
effects of a specified size as significant. Determined by power
analysis calculations.
Nominal The levels of a nominal or categorical scale are category names,
Scale like Catholic| Jew | Muslim | Other for religion or male|female for
sex. Their order is arbitrary; there is no obviously correct way to
order the names.
Ordinal The values or levels of an ordinal scale are named but also have
Scale an obvious order, like first | second | third. Another example is
freshman | sophomore |junior | senior. However, there is no
obvious way to quantify the distance between levels or ranks.
p The number of levels for the within-subjects factor P.
P The first within-subjects factor in a factorial study.
Power The probability of detecting a real effect, or the probability of
correctly rejecting the null hypothesis. The power of a test is one
minus the probability of missing a real effect, on - beta.
' (prime) In this book, a prime or apostrophe after a symbol indicates an
estimate, for example, VAR' and SD', Some texts use the
circumflex (the ^ or hat) for this purpose.
q The number of levels for the within-subjects factor Q.
Also, the Studentized Range Statistic used to compute the Tukey
critical difference used in post hoc tests.
Q The second within-subjects factor in a factorial study.
Also, the reciprocal of a probability or P - 1.
r The correlation coefficient or, more formally, the Pearson
product-moment correlation coefficient. It is an index of the
strength of the relation between two variables, and its values can
range from -1 to +1. To compute, find the average cross product
of the ZX and ZY scores (r = E ZxI Zy i /N, where i = 1,N), or take
the square root of r2 and assign it the same sign as the
covariance.
R Multiple R. Just as r is the correlation between a criterion
variable (Y) and a single predictor variable (X), so R is the
correlation between a criterion variable and an optimally
weighted sum of a set of predictor variables (X1, X2, and so
forth).
306 GLOSSARY
r 2 The coefficient of determination, or r-squared. It indicates the
proportion of criterion variable variance that can be accounted
for given knowledge of the predictor variable. To compute,
divide model variance by total variance (r2 = VAR model /VAR total ).
R2 Multiple R squared. It is the proportion of criterion variable
variance accounted for by a set of predictor variables, working in
concert.
2
R adj Adjusted multiple R squared. It is the population value for R
estimated from a sample, which will be somewhat smaller than
the sample R2. Consistent with the notion used in this text, it
could also be symbolized as R2' with the prime indicating an
estimated value, but the adjusted subscript is used far more
frequently in multiple-regression texts.
Ratio The intervals of a ratio scale are equal (as for an interval scale),
Scale but zero indicates truly none of the quantity, like zero degrees
Kelvin. Thus the ratio of two numbers is meaningful for
numbers measured on a ratio but not an interval scale.
Rejection Designates values for the test statistic whose probability of
Region occurrence is equal to or less than the alpha level, assuming that
the null hypothesis is true. If the test statistic assumes any of
these values (falls in the region of rejection), the null hypothesis
is rejected.
S The sample standard deviation. Alternatives are SD (used in this
book) and s.
a (sigma) The population standard deviation. It can be estimated by
taking the square root of the quotient of the sum of the squared
deviation scores divided by N — 1. This estimate is symbolized
SD'.
S2 The sample variance. Alternatives are VAR (used in this book)
and s2. Computed by dividing the sum of squares by N.
a 2 The population variance. It can be estimated by the sum of the
squared deviation scores divided by N -1. This estimate is
symbolized VAR'.
SD The sample standard deviation, which is the square root of the
sample variance. Often it is symbolized S or s. If not clear from
context, SDy indicates the standard deviation for the Y scores,
SDx the standard deviation for the X scores, and so forth.
SD' The estimated standard deviation. It is almost an unbiased
estimate (especially if N > 10) of o, the true population value. It
can be estimated by taking the square root of VAR ', or by taking
the square root of the quotient of the sum of squares divided by
N- 1. Often it is symbolized with a circumflex (a ^ or hat) above
a ors.
GLOSSARY 307

SD'y-y The estimated standard error of estimate. It is the estimated


population standard deviation for the regression residuals, that
is, the differences between raw and predicted scores. It can be
regarded as an estimate of the average error made in the
population when prediction is based on a particular regression
equation. Sometimes it is called the standard error of estimate,
leaving off the qualifying estimated. Often it is symbolized with
a circumflex or ^ above the SD and above the second Y subscript
instead of a prime after them.
SD'M The estimated standard error of the mean.
<oM The standard error of the mean, or the standard deviation for the
distribution of sample means. It can be estimated by dividing a
by the square root of N, or by dividing SD' by the square root of
N, or by taking the square root of the quotient of the SS divided
by N times N - 1. This estimate is symbolized SD'M..
SS The sums of squares is formed by first squaring each residual,
and then summing the resulting squares; thus it is the sum of the
squared residuals. In other words, SS - I (Yi,- Y i ') 2 (where i =
1,N).
SSerror The error sum of squares, or Y - Y' for each subject, squared and
summed. This can also be symbolized SSres, for residual sum of
squares, or SSunexp,for sum of squares left unexplained by the
model.
SSmodel The model sum of squares, or Y' - My for each subject, squared
and summed. This can also be symbolized SSreg, for regression
sum of squares, or SSexp, for sum of squares explained by the
model.
SStotal The total sum of squares, or Y - My for each subject, squared and
summed. This can also be symbolized SSy.
SSxy The XY sum of squares, or I (Xi - Mx) (Y{ - My), where i = 1,N.
Thus SSxy is the sum of the cross products of the X and Y
deviation scores.
t A test statistic similar to Z but for small samples. The shape of
the t distribution is similar to the normal distribution but flatter
and more spread out. As a result, critical values for t are larger
than corresponding critical values for Z.
TCD Tukey Critical Difference. Computed based on the Studentized
Range Statistics and used in post hoc tests. If differences
between means exceed the TCD, then the difference between
those means is statistically significant.
Type I Error Making a false claim, or incorrectly rejecting the null hypothesis
(also called an alpha error). The probability of making a Type I
error is alpha.
Type II Error Missing a real effect, or incorrectly accepting the null hypothesis
(also called a beta error). The probability of making a Type II
error is beta.
308 GLOSSARY
VAR The sample variance, which is the sum of squares divided by N.
Often it is symbolized S2 or s2, which makes sense because
variance is a squared measure. If not clear from context, VARy
indicates the variance for the Y scores, VARx the variance for the
X scores, and so forth.
VAR' The estimated population variance. It is an unbiased estimate of
a2, the true population value. It can be estimated by multiplying
the sample variance, VAR, by the quotient of N divided by N -1,
or by dividing the sum of squares by N -1. Often it is
symbolized with a circumflex (a ^ or hat) above a2 or s2.
or
VARerrc The error variance. It indicates the variability of raw scores
relative to scores predicted by a specified regression equation or
model. To compute, divide SSerror by N.
VARmodel The model variance. It indicates the variability of scores
predicted by a specified regression equation or model relative to
the group mean. To compute, divide SSmode\ by N..
VA.Rtotal The variance for the scores in a sample. It indicates the
variability of raw scores relative to the group mean. To compute,
divide SStotal by N. The subscript indicates the group of scores in
question, e.g., VARx for the X scores, VARy for the Y scores, and
so forth.
X An upper case X is used to indicate a generic raw score, usually
for the independent or predictor variable. If there is more than
one, the first is indicated X1, the second X2, and so forth.
Y An upper case Y is used to indicate a generic raw score, usually
for the dependent variable. Thus Y represents any score in the
set generally, and Yi, indicates the score for the ith subject.
Y' Y' (read Y-prime) indicates a predicted score (think of "prime" as
standing for "predicted"). This is relatively standard regression
notation, although sometimes a Y with a circumflex or "hat" (i.e.,
a ^) over it is used instead, Y. Usually the basis for prediction
will, be clear from the context. Individual predicted scores are
symbolized Yi', but often the subscript is omitted.
Y' - MY The model (or regression) deviation score, that is, the difference
between the score predicted by the regression equation (or
model) for each subject and the group mean. (Subscripts
indicating subject are not shown.)
Y - Y' The difference or deviation between an observed score and a
predicted score is called a residual or error score. Again the
subscripts are often omitted. Often lower case letters are used to
represent residuals. For example, in Fig. 5.3 a lower case y
represents the deviation between a raw Y score and the mean for
the Y scores.
Z A standardized or z score. It is the difference between the raw
score and the mean divided by the standard deviation.
Appendix A:
SAS Exercises

A BRIEF INTRODUCTION TO SPSS AND SAS

A spreadsheet is a powerful tool for manipulating data and conducting basic


statistical analyses. Spreadsheets also allow students to explore the inner
workings of meaningful definitional formulas and avoid the drudgery of hand
calculations using opaque computational formulas. All of the computations
described in this book can be accomplished by most spreadsheet packages with a
regression function. As the analyses you wish to conduct become larger (i.e.,
contain many cases) and more complex (i.e., analysis of repeated measures),
however, it will be to the students advantage to learn a computer package
dedicated to statistical analysis. To this end we have included exercises using
both SPSS and SAS that parallel and extend the spreadsheet exercises.
SPSS is a system of commands and procedures for data definition,
manipulation, and analysis. Recent versions of SPSS use a Windows-based menu
and a point-and-click driven interface. We will assume the reader has basic
familiarity with such Windows based program features. SPSS includes a data
interface that is similar in many respects to a spreadsheet. Rows represent cases
and columns represent different variables. You can manipulate data using cut,
paste, and copy commands in a manner similar to spreadsheets. SPSS also
provides the ability to import Spreadsheet files directly into the data editor.
Although similar to SPSS in its data analytic capabilities, SAS relies much less on
a Windows-based menu to perform operations. Rather, it requires that the user
become familiar with the SAS programming language to execute commands.
Nonetheless, data may be imported from Spreadsheet files for further analysis in
SAS.
The SPSS and SAS exercises will provide enough information to navigate the
commands and procedures necessary to complete each exercise. You should,
however, consult the SPSS and' SAS manuals and tutorials to gain a more
thorough background in all of the options and short-cuts available in these
programs. The following exercise will familiarize you with the SPSS and SAS
interfaces and basic commands, respectively.

309
310 APPENDIX A

Exercise 1.2
Using SAS
Now you will complete this first exercise using SAS. If you are not sure how to
perform any of these operations, ask someone to show you or experiment with
the help provided by the program.
1. Invoke SAS. Import the spreadsheet file you created in Exercise 1.1. Since
all SAS files have part of their names associated with a location on your
computer, use the LIBNAME command to generate the library under which
all of your work will be saved. For example, type (in the Program Editor
window) and run the following command:
LIBNAME sasuser 'c:\SASExercises'
Any spreadsheets you wish to import should be saved to this directory on
your computer. Spreadsheets containing the data for each exercise are
provided on the CD. You may want to copy all of them to the appropriate
directory. If you would like experience creating data files, you could create
your own spreadsheets, using the files on the CD as a guide. Now import
your data using this next command:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex1pt2.xls' OUT =
sasuser.ex1pt2;
PROC PRINT DATA = sasuser.ex1pt2;
TITLE 'Exercise 1.2.1';
RUN;
Look in the Output window and confirm that the variable names and data are
the same as your spreadsheet.
2. Change the name of the variable in the first column. As with the majority of
SAS programs, you will use a DATA command to tell SAS which data and
variables you want to work with and a PROC command to inform SAS of the
operations you would like performed with these data. In this case, you would
type:
DATA sasuser.ex1pt2 (RENAME = (test1 = exam1));
SET sasuser.ex1pt2;
RUN;
PROC PRINT DATA = sasuser.ex1pt2;
TITLE 'Exercise 1.2.2';
RUN;
in order to change variable one's name from Test 1 to Exam 1. Note that,
while there is a character maximum of 32 for variable names, you may use
the LABEL command to provide a more descriptive label that will appear in
the output.
3. Create a new variable by writing the following program:
DATA sasuser.ex1pt2;
SET sasuser.ex1pt2;
test4 =.;
RUN;
Use the LABEL command to give the variable a meaningful description.
4. Enter data for the new variable. One means of doing this is to open the SAS
datafile that contains the original imported data and to switch from the
BROWSE to the EDIT mode by clicking the appropriate toggle icon in the
upper right hand corner of the toolbar. Once you are in edit mode, you are
SAS EXERCISES 311_
free to hand enter five hypothetical values for your new variable, namely Test
4.
5. To display basic descriptive statistics for these data, write and run the
following code:
PROC MEANS DATA=sasuser.ex1pt2 N MEAN STDDEV MIN MAX
RANGE SUM;
TITLE 'Descriptives1;
RUN;
This will open the Output Viewer and provide you with summary statistics for
all of the variables in the spreadsheet, unless otherwise specified. Confirm
that the N (i.e., count) and sum are the same values you obtained in Ex. 1.1.
6. You may return to the data file/editor and change some of the values. Run
the descriptive statistics code again (e.g., PROC MEANS). Were the
descriptive statistics updated correctly?
7. Save the output and data files. To save the output file, just make it the active
window and then select File->Save As from the main menu. Give the file a
meaningful name and click Save. The output will be saved in a file with a .1st
extension. Since using the LIBNAME command gets SAS to create a
permanent data file, there is no need specify where you want your data to be
saved. Now return to the data editor and save the actual data using a
meaningful name. SAS files are saved with a .data extension. Exit SAS.

Exercise 3.6
The Sign Test in SAS
This exercise provides practice in using the sign test in SAS.
1. Invoke SAS and create two new variables. Name the first variable "outcome"
and the second variable "freq". Add value labels, 'no improvement' for '0' and
'improved' for '1', using the PROC FORMAT command. The code for these
steps is as follows:
PROC FORMAT;
value Outcome 1='improved' 0='no improvement';
DATA sasuser.ex3pt6;
Outcome=.;
Frequency= .;
FORMAT Outcome Outcome.;
OUTPUT;
RUN;
2. The component of code you will now write is that which identifies the
frequency of cases in each of the outcome classes. The frequency
associated with outcome 1 is 30, while the frequency associated with
outcome 0 is 10.
DATA sasuser.ex3pt6;
Outcome=1;
Frequency=30;
OUTPUT;
Outcome=0;
Frequency=10;
OUTPUT;
RUN;
312
In your code, you will ask SAS to weight the cases by frequency variable,
which is essentially the same as entering 30 ones and 10 zeros in the
outcome column. You could simply enter a 1 or 0 for each case in the
outcome column, but doing so would become tedious when N is large.
3. Now, request a binomal sign test using the following code:
PROC FREQ;
WEIGHT Frequency;
TABLE Outcome / BINOMIAL;
RUN;
Note that the Test Proportion is set at .50 by default. This indicates that you
expect 50% of the outcomes to be positive. To change the expected
proportion, enter the correct value in parentheses after the BINOMIAL
statement:
BINOMIAL (p=.3);
4. Look at the output. Is the N correct? What proportion improved? What was
the expected proportion based on the Test Prop. column? Look at the '1
sided Pr > z' row towards the bottom of the output. This tells you the
probability of finding 30 out of 40 positive outcomes if you expected only 20.
If the value is less than alpha, then you reject the null hypothesis that that
treatment has no effect.
5. Re-run the program so that the values in the freq column reflect that 26
improved and 14 did not. Would you reject the null hypothesis with an alpha
of .05?
6. What if 30 improved, but you expected 80% to improve? Keeping the alpha
level at .05, would you reject the null hypothesis?

Exercise 5.5
SAS Descriptive Statistics
The purpose of this exercise is to familiarize you with requesting various
descriptive statistics in SAS.
1. Invoke SAS. Import part of the data from the lie detection study, as follows:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex5pt5.xls1 OUT =
sasuser.exSptS;
RUN;
Label your two variables using the following code:
DATA sasuser.exSptS;
SET sasuser.exSptS;
LABEL s = 'Participant1 y = 'Lies';
View your data in the output window
PROC PRINT DATA = sasuser.exSptS;
TITLE 'Exercise 5.5';
RUN;
2. The most common command for requesting descriptive statistics is the
PROC MEANS command. To run descriptives for the imported data, type
and run:
PROC MEANS DATA = sasuser.exSptS N MEAN STDDEV MIN MAX
RANGE VAR;
VARY;
TITLE 'Descriptives for 5.5';
RUN;
SAS EXERCISES 3J3
Consult the output window for your results.
3. Examine the output. Do the values you obtained for N, the mean, variance,
and standard deviation agree with the results from the spreadsheet you
created in Exercise 5.4?
4. Note: A permanent SAS data file will be automatically saved under your
LIBNAME directory.

Exercise 5.8
Obtaining Standard Scores in SAS
1. Open the data file you created in exercise 5.5. To do this via the menu
system in SAS, select View->Explorer from the main menu. Highlight
sasuser and double click on the ex5pt5 data file. Run your program for
requesting descriptive statistics for the Lies variable as you did in the
previous exercise, but this time add a command that requests standardized
values to be saved as variables. The following code will allow you to do so:
PROC REG DATA=sasuser.ex5pt5;
MODEL Y=S / r cli clm;
RUN;
2. Although your program asked SAS to run an inferential data analytic
procedure. What you should be most interested in is the list of z-scores
provided for you towards the end of the output. Do these agree with the
scores you calculated in exercise 5.6?

Exercise 6.2
Graphing in SAS
In this exercise you will learn how request graphs in SAS.
1. Import the data from exercise 6.1 into SAS:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex6pt2.xls1 OUT =
sasuser.ex6pt2;
PROC PRINT DATA = sasuser.ex6pt2;
TITLE 'Vocab & Grammar Data';
RUN;
2. By typing just one procedural command, namely:
DATA sasuser.ex6pt2;
SET sasuser.ex6pt2;
PROC UNIVARIATE PLOT DATA = sasuser.ex6pt2;
VAR vocab grammar;
TITLE 'Vocab & Grammar Var Descriptions';
RUN;
SAS will generate a stem-and-leaf plot, normal probability plot, and box plot
of your data in addition to several descriptive statistics.
3. Are the mean, median, standard deviation, interquartile range, and skewness
statistics what you would expect given your results from Ex. 6.1?
4. Once graphs and charts are generated in SAS, sometimes the option is
available to double click on an image and modify it. However, this is not
always the case and you may need to write additional code in order to edit
the characteristics of a chart or graph (e.g., the bin widths). The interested
reader is referred to the SAS documentation.
314 APPENDIX A
5. Examine the stem-and-leaf plots and the box plots. Do they look like the plots
you generated by hand?
6. Change some of the values in the data and rerun your program. See if you
can predict how the plots will look based on the changed values.
7. Return the numbers you changed to their original values and save your work.

Exercise 7.5
SAS: Single Sample Tests and confidence intervals
This exercise provides practice in using SAS to conduct single-sample tests. You
will use the data from exercise questions 7.3.3 and 7.3.4 to determine if the
sample is drawn from a population with a mean of 20.
1. Open SAS and import the spreadsheet data:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex7pt5.xls1 OUT =
sasuser.ex7pt5;
PROC PRINT DATA = sasuser.ex7pt5;
TITLE 'Samples 1 & 2';
RUN;
2. Run the following code to analyze the data for Sample 1:
PROC TTEST DATA = sasuser.ex7pt5 H0=20;
VAR sample"!;
RUN;
3. Examine the output. Are the values for the standard deviation and the
standard error the same as the values you calculated in exercise 7.3? What
is the significance level of the t-test? Do you conclude that the sample was
drawn from a population with a mean of 20 at the alpha = .05 level? What
would your decision be if alpha were .01 ?
4. Conduct a single sample test using the data from question number four by
running the following code:
PROC TTEST DATA = sasuser.ex7pt5 H0=20;
VAR sample2;
RUN;
Why is this test not significant at the .05 level even though the mean is higher
than the data from question number three?
5. To compute confidence intervals for the scores variable, run:
PROC MEANS DATA = sasuser.ex7pt5 MEAN ALPHA = .05 CLM;
VAR sample1 sample2;
TITLE '95% CIs';
RUN;
If you would like to calculate confidence intervals based on a value other
than 95%, change the value after 'ALPHA ='.

Exercise 9.7
Regression in SAS
In this exercise you will use SAS to conduct a regression analysis and create a
graph of the lies and mood data.
1. Invoke SAS. Import the Lies and Mood data you last used in exercise 9.2:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex9pt7.xls' OUT =
sasuser.ex9pt7;
SAS EXERCISES 315
PROC PRINT DATA = sasuser.ex9pt7;
TITLE 'Mood and Lies Data';
RUN;
2. To conduct a regression analysis, you will use the PROC REG command.
For these data, you would write:
PROC REG DATA = sasuser.ex9pt7;
MODEL lies = mood;
OUTPUT OUT=sasuser.ex9pt7 PREDICTED = predict RESIDUAL =
resid;
PLOT lies * mood;
TITLE 'Simple Linear Regression';
RUN;
3. To view the output, run:
PROC PRINT DATA = sasuser.ex9pt7;
TITLE 'Unstandardized Predicted Values & Residuals';
RUN;
4. Examine the Model Summary in the output. Do the values for R and R2 agree
with your spreadsheet? Now look the results for the ANOVA. The Sums of
Squares Model, Error, and Total should agree with the values from your
spreadsheet for SSmod, SSerr, and SStot, respectively.
5. Examine the coefficients in the output. Can you find the values for a and b?
6. Finally, scroll down and observe the predicted values of Y (V) and the
residuals (Y-Y'). Do these values agree with you spreadsheet?
7. As part of the code you wrote above, a request was made for a scatterplot of
Lies and Mood (PLOT lies * mood;).
8. Also as part of the program, the lies and mood data have been saved under
your chosen directory (sasuser.ex9pt7).
9. For additional practice you should try running the SAS Regression procedure
and creating scatter plots for the lies and drug data.

Exercise 10.6
The F test in SAS Regression and One-Way ANOVA
This will show you how to find the significance of the amount of variance
accounted for by Drug in Mood Scores.
1. Import the Lies and Drug spreadsheet (Note you could also open the data file
you created in Exercise 9.6.6):
PROC IMPORT DATAFILE = 'c:\SASExercises\ex10pt6.xls' OUT =
sasuser.ex10pt6;
PROC PRINT DATA = sasuser.ex10pt6;
TITLE 'Drug and Lies Data';
RUN;
Rerun the Regression procedure:
PROC REG DATA - sasuser.ex10pt6;
MODEL lies = drug;
OUTPUT OUT=sasuser.ex10pt6 PREDICTED = predict RESIDUAL =
resid;
PLOT lies * drug;
TITLE 'Simple Linear Regression';
RUN;
316 APPENDIX A
2. Examine the ANOVA output. Does this value agree with the F you calculated
in exercise 10.3?
3. The general command for a one-way ANOVA in SAS is PROC ANOVA. Run
code that conducts a one-way ANOVA on drug (IV) and lies (DV) that
requests descriptive statistics and Homogeneity of Variance:
PROC ANOVA DATA = sasuser.ex10pt6;
CLASS drug;
MODEL lies = drug;
MEANS drug / HOVTEST = LEVENE;
TITLE 'ANOVA Output Exercise 10.6';
RUN; PROC SORT DATA = sasuser.ex10pt6;
BY drug;
PROC MEANS DATA = sasuser.ex10pt6 N MEAN STDDEV;
BY drug;
VAR lies;
TITLE 'Summary of Lies by Drug';
RUN;
4. Examine the output. Make sure the N, means and standard deviations agree
with your spreadsheet. You will find the sums of squares and the F value
reported in the ANOVA table. The statistics should be identical to the ANOVA
output from the regression procedure with the exception that the regression
statistics are now termed between groups and the residual statistics are
called within groups. Thus, you may analyze a single factor study with a
categorical independent variable using either the Regression or One-Way
ANOVA procedures.
5. The One-way ANOVA procedure does, however, also provide a test of the
homogeneity of variances. Find Levene's statistic in your output. Levene's
statistic provides a test that the variances of groups formed by the
categorical independent variable are equal. This test is similar to the Ftest
for equal variances presented earlier in this chapter. Levene's, however, is
less likely to be biased by departures from normality.

Exercise 11.7
Hierarchical Regression in SAS
In this exercise you will learn how to conduct a hierarchical regression in SAS.
1. Import the ex11pt7 spreadsheet provided on the CD. For practice, you could
open the Lies and Mood SAS data file you created in Ex. 10.6, create a new
variable for the drug data, and dummy code the two groups.
2. For the hierarchical regression analysis, write and run the following code:
PROC REG DATA = sasuser.ex11pt7;
MODEL lies = mood drug/SCORR1 (TESTS);
TITLE 'Hierarchical Regression in SAS';
RUN;
3. Examine the Model Summary in the output (under Analysis of Variance
heading). Notice that the model builds sequentially, such that the first level
describes the model with only mood entered. The next level provides
information about the model when the variable drug is added as a predictor.
If you use the cumulative R2 for any step in the model and subtract the R2
from the previous step in the model, you will get the R2Change increase
SAS EXERCISES 3T7
associated with adding that variable to the model. Make sure that the values
for R2, F, df, and, significance agree with Figure 11.4.
4. Explore the ANOVA table to see if the overall model is significant. You can
look at the Parameter Estimates table to find the values of the
unstandardized and standardized partial regression coefficients.
5. As additional practice, use the SAS Regression procedure (PROC REG) to
reanalyze the button pushing study presented in exercise 11.6.

Exercise 12.5
Analyzing a singe-factor between-subjects study in SAS
In this exercise you will analyze the button pushing study using the four groups
defined in Figure 12.3. In SAS you could analyze these data using the
Regression procedure and dummy or contrast codes. If using dummy codes, you
would run a regression notifying SAS of all of the coded vectors (X1 through X3)
to be analyzed in a single block. You would then examine the output to determine
if the model containing all three predictor variables is significant. If you want to
test individual contrast codes, you would request that each of the coded predictor
variables be analyzed in separate blocks, ultimately calculating the R squared
change associated with each step. The resulting hierarchical analysis (see Ex.
11.7) will allow you to determine the significance of each contrast. Finally, as
presented in this exercise, you could run the One-Way ANOVA procedure to
determine if group is a significant predictor of number of pushes.
1. In Excel, create a variable called Pushes and enter the data from Figure
12.2. Next, import these data into SAS by running:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex12pt5.xls' OUT =
sasuser.ex12pt5;
PROC PRINT DATA = sasuser.ex12pt5;
TITLE 'Group Predicting Button Pushes Data';
RUN;
2. When using the One-way ANOVA procedure, you do not need to use dummy
or contrast coding. Instead, create one variable called group and enter 0s for
all the cases in the ND group, 1s for the CO group, 2s for the C1 group, and
3s for C2 group. The output will be more readable if you create value labels
for each of these groups using the FORMAT procedure:
PROC FORMAT;
VALUE Group 0="ND" 1="CO" 2="C3" 3="C4";
RUN;
3. To conduct your ANOVA (requesting Levine's test and descriptives), run the
following code:
PROC ANOVA DATA = sasuser.ex12pt5;
FORMAT group group.;
CLASS group;
MODEL pushes = group;
MEANS group / HOVTEST = LEVENE;
TITLE 'ANOVA Output Exercise 12.5';
RUN;
PROC MEANS DATA = sasuser.ex12pt5 N MEAN STDDEV;
FORMAT group group.;
BY group;
VAR pushes;
318 APPENDIX A
TITLE 'Summary of Pushes by Group';
RUN;
4. Examine the Descriptives output and confirm that the A/s, means, and
standard deviations are correct. Next look at Levene's test. Is the assumption
of equal variances met? Finally look at the output in the ANOVA table. Do the
sums of squares and df agree with values you calculated using the
spreadsheets? Based on the Fand significance values, do you reject the null
hypothesis that the number of button pushes in each group is equal?

Exercise 13.5
Post-hoc tests in SAS
In this exercise you will learn how to use SAS to conduct post-hoc tests for the
button pressing study presented in Chapter 12.
1. Open the data file for the button pushing study you created in exercise 12.5.
By the nature of the code used to import these data, SAS created a
permanent file under your LIBNAME (e.g., sasuser) directory.
2. Redo the One-way ANOVA analysis. This time, however, make a request for
Tukey Post-hoc analyses, as such:
PROC ANOVA DATA = sasuser.ex12pt5;
CLASS group;
MODEL pushes = group;
MEANS group / HOVTEST = LEVENE TUKEY;
TITLE 'ANOVA Output Exercise 13.5';
RUN;
3. By making the request for the Post-hoc tests, a table is produced that groups
together all means that do not differ from each other at the selected alpha
level. This grouping should make it easy for you to create the subscripts
necessary to present the results of your post-hoc tests in the format of Figure
13.5.
4. To create a bar graph of the means that contains the standard error of the
means, run the following code:
PROC GCHART DATA = sasuser.ex12pt5;
VBAR group / SUMVAR = pushes AXIS = axis1 ERRORBAR= bars
WIDTH = 5 GSPACE=2 DISCRETE TYPE=mean CFRAME=ligr
COUTLINE= blue CERROR=black;
RUN;
If you choose to use the Confidence Interval of Mean, you must include the
size of the confidence interval (e.g., 95%, 99% or 90%).

Exercise 13.9
ANCOVA in SAS
In this exercise you will learn how to conduct an analysis of covariance in SAS.
1. Create a new SAS data file by importing the spreadsheet data from exercise
13.5:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex13pt9.xls' OUT =
sasuser.ex13pt9 REPLACE;
PROC PRINT DATA = sasuser.ex13pt9;
TITLE 'Age, Sex, and Smiles Data';
SAS EXERCISES 319
RUN;
2. Conduct a hierarchical regression as you did in Exercise 11.7. The code is as
follows:
PROC REG DATA = sasuser.ex13pt9;
MODEL smiles = age sex / SCORR1(TESTS);
TITLE 'ANCOVA in SAS';
RUN;
DATA sasuser.ex13pt9;
SET sasuser.ex13pt9;
agexsex = age * sex;
RUN;
The model summary output should correspond to the values found in Figure
13.8.
3. Test the homogeneity of regression assumption by running a second
analysis and entering age, sex, and the age by sex interaction term in the
third block. Similar to the code above, run the program
PROC REG DATA = sasuser.ex13pt9;
MODEL smiles = age sex agexsex/SCORR1(TESTS);
TITLE 'ANCOVA in SAS1;
RUN;
4. You could also conduct an ANCOVA using the General Linear Model (GLM)
procedure (PROC GLM). To do this select, run the code:
PROC GLM DATA = sasuser.ex13pt9;
CLASS sex;
MODEL smiles = age sex age*sex;
RUN;
5. Note that both Type I and Type III Sums of Squares are provided in the
output. For this analysis, attend to the Type I Sums of Squares.
6. Examine the output. The statistics for age, sex, error, and total should be
identical to the regression you ran in step 2.
7. When you use PROC GLM, the homogeneity of regression assumption is
automatically tested because the testing of the interaction between the
covariate and independent variable is built into the command. Whereas in
SPSS, a custom model must be created, SAS conveniently runs this step for
you.
9. Examine the Parameter Estimates. Do they agree with the values you
calculated in Exercise 13.7?

Exercise 14.7
SAS Analysis of a 2 x 2 Factorial Study
In this exercise you will learn how to use the General Linear Model procedure in
SAS to conduct a 2 X 2 Analysis of Variance for the button-pushing study.
1. Create a new SAS data file containing variables for the number of button
pushes (bps), gender (gen), and instruction set (insf). Enter the data from
Figure 14.10. You can do this by entering the data directly into the Table
Editor in SAS or importing the spreadsheet from the CD:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex14pt7.xls' OUT =
sasuser.ex14pt7;
PROC PRINT DATA = sasuser.ex14pt7;
TITLE 'Gender, Instruction, and BPs';
320 APPENDIX A
RUN;
2. Using the PROC GLM command, request the following:
DATA sasuser.ex14pt7;
SET sasuser.ex14pt7;
instxsex = sex * inst;
RUN;
PROC GLM DATA = sasuser.ex14pt7;
CLASS sex inst;
MODEL pushes = sex inst sex * inst;
3. For options, you will want SAS to generate estimates of effect size and
homogeneity tests. You will also want Estimated Marginal Means. To get
these, continue the program above by writing:
MEANS sex inst sex * inst / HOVTEST=LEVENE WELCH;
LSMEANS sex inst / PDIFF STDERR; LSMEANS sex * inst / SLICE =
inst;
RUN;
Run the entire program.
4. Examine the means for each of the cells and marginals. Make sure these
values agree with your spreadsheets. Now look at the Estimated Marginal
Means (generated by LSMEANS). The values for the grand mean, sex and
instruction set main effects, and the interaction should be the same as those
generated by the descriptive statistics command. In the case of an
unbalanced design (i.e., one or more of the cell means were of different
size), then the descriptive statistics would provide traditional weighted means
while the estimated marginal means would be unweighted. Typically, when
cells are unequal due to subject attrition or random factors, you would want
to report the unweighted means for any significant effects resulting from your
analysis.
5. Look at the Levene's Test of the Equality of Error Variances. Notice that this
test is not statistically significant, indicating that the assumption of equal
variances is met.
6 Finally examine the box labeled The GLM Procedure. Look at the lines for
the SEX, INST, SEX*INST, Error, and corrected total. Check that the sums of
squares, df, mean square, F, and partial eta-squared values correspond to
your spreadsheet calculations.
7. For additional practice you should try reanalyzing the data from exercise 14.8
using SAS. Do all of the relevant statistics agree with your spreadsheet
analysis?

Exercise 15.3
A Single-Factor Repeated Measures Study in SAS
In this exercise you will analyze the lie-detection study, last described in Exercise
15.2, using the repeated measures PROC GLM of SAS.
1. Create a new SAS data file with three variables. Create one variable for
subject number (s), one variable for number of lies detected in the drug
condition (drug), and a third variable for number of lies detected in the
placebo condition (placebo). Give each variable a meaningful label.
2. Enter 1 - 5 in the subjects column. In the drug column, enter the number of
lies detected for subjects 1 - 5 in the drug condition. Do the same in the
placebo condition. Thus, the SAS file set up for a repeated measures study
SAS EXERCISES 321
will have a single row for each subject with the scores from each level of the
repeated measures variable in separate columns. You could also import the
spreadsheet from the CD where steps 1 and 2 are already complete:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex15pt3.xls' OUT =
sasuser.ex15pt3;
PROC PRINT DATA = sasuser.ex15pt3;
TITLE 'Lies, Drug/Placebo - Repeated Measures';
RUN;
3. To run this analysis with, run the following code:
PROC GLM DATA = sasuser.ex15pt3;
CLASS;
MODEL drug placebo = /NOUNI;
REPEATED condition 2 / PRINTE;
RUN;
In this case, the PRINTE option will tell SAS to generate the various statistics
of interest to us, such as Mauchly's Test of Sphericity.
4. Examine the output. For the moment, ignore the Multivariate Tests and
Mauchly's Test of Sphericity. Look at the Sphericity values for your Within-
Subjects Effects. The SS, df, MS, F and prp values should agree with your
spreadsheet results. Do they?

Exercise 15.5
SAS Analysis of a Repeated Measures Study with Four Levels
In this exercise you will analyze the data for the four-level repeated measures
study presented in Ex. 15.4.
1. Create a new SAS data file for the button pushes study that is set up for a
repeated measures analysis. You will need five variables, one for the subject
number (s), and four for each level of the infant diagnosis factor; Down
syndrome (ds), fetal alcohol syndrome (fas), low birth weight (Ibw), and no
diagnosis comparison (cntrf). Give each variable a meaningful label (either by
hand in the Table Editor or by running the LABEL command).
2. Enter 1 - 4 in the subjects column and the appropriate number of button
presses for each of the four cases in the remaining columns. Iternatively, you
can import an Excel file from the CD with the data by typing:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex15pt5.xls' OUT =
sasuser.ex15pt5;
PROC PRINT DATA = sasuser.ex15pt5;
TITLE 'Repeated Measures Study w/ 4 Levels';
RUN;
3. To run the primary analysis, use the code:
PROC GLM DATA = sasuser.ex15pt5;
CLASS;
MODEL ds fas Ibw cntrl = / NOUNI; REPEATED dx 4 / PRINTE;
RUN;
4. Check your descriptive statistics. Do they agree with your spreadsheet?
Examine the Sphericity values in the output. Do the SS, df, MS, F and pn2
values agree with your spreadsheet?
5. Look in the table labeled Mauchley's Test of Sphericity. Mauchly's test is not
significant, indicating that the sphericity assumption has been met.
Remember, however, that Mauchly's test is underpowered when the sample
322 APPENDIX A
size is small. It would therefore be prudent, in this case, to assume that there
is at least some violation of sphericity. Note that the lower bound is .33; 1/(k
- 1) = 1/(4-1). Also note that the Greenhouse-Geisser and Huynh-Feldt
estimates of e differ by a large amount. This is due to the small number of
cases in the study. Typically the two estimates would be closer.
6. Examine the Multivariate Tests. None are significant, but remember that N is
small and the multivariate tests are not very powerful under these
circumstances. Apply the modified univariate approach. What is your
statistical decision based on this approach?

Exercise 16.2
SAS Analysis of a One-between, One-within Two-factor Study
This exercise walks you through an SAS analysis for a mixed two-factor study.
You will use data from Ex 16.1.
1. Create a new SAS data file.The data file should contain variables for subject
number (s), gender (gen) instruction set I (set1), and instruction set II (set2).
2. Enter 1 through 8 for the 8 subjects in the sub column. Enter 0 for males and
1 for females in the gen column. Finally, enter the number of button pushes
for each subject in the appropriate column for instruction set. Create
appropriate variable labels and value labels for the variables. To import a file
from the CD, run the following program:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex16pt2.xls' OUT =
sasuser.ex16pt2;
PROC PRINT DATA = sasuser.ex16pt2;
TITLE '1 B/W & 1 W/l';
RUN;
3. Run the analysis, remembering to request the appropriate statistics, as
follows:
PROC GLM DATA = sasuser.ex16pt2;
CLASS gen; MODEL set1 set2 = gen / NOUNI;
REPEATED inst 2 / PRINTE;
RUN;
4. Check the descriptive statistics to ensure that you set up the variables and
entered the data correctly. Next, determine if there is homogeneity of
covariances. Usually, you would then check the sphericity assumption, but
because there are only two levels of the repeated measure in this design, e =
1 and Mauchly's test of sphericity does not apply. Finally, check Levene's
test to make sure the homogeneity of variances assumption holds for the
between-subjects factor.
5. Examine the SS, df, MS, F, significance levels, and pn2 of the within-subjects
effects. Do they agree with your spreadsheet? Because there are only two
levels of the within-subjects factor, the multivariate tests, sphericity assumed
tests, lower bound, and e corrected tests will all yield the same F-ratios and
significance levels.
6. Examine the test of the between-subject effect for gender. Are the SS, MS,
and df correct for the sex factor and the error term? What about the F-ratio
Pn2?
SAS EXERCISES 323

Exercise 16.4
SAS Analysis of a Two-Within Two-Factor Study
This exercise walks you through an SPSS analysis for a mixed two-factor study.
You will use data from Ex. 16.1.
1. Create a new SAS data file, or adapt one from Ex. 16.2:The data file should
contain variables for subject number (s), and button pushes for the 4
repeated-measures conditions: husbands in instruction set I (set1h), wives in
instruction set I (set1w), husbands in instruction set II (set2h), and wives in
instruction set II (set2w).
2. Enter 1 through 4 for the four sets of matched scores. Enter the number of
button pushes in the appropriate column for the instruction set and gender
combinations. Create appropriate variable labels and value labels for the
variables. To import a spreadsheet from the CD, run the following:
PROC IMPORT DATAFILE = 'c:\SASExercises\ex16pt4.xls1 OUT =
sasuser.ex16pt4;
PROC PRINT DATA = sasuser.ex16pt4;
TITLE '2 W/l';
RUN;

3. This analysis and it's relevant statistics can be obtained by running the
following program:
PROC GLM DATA = sasuser.ex16pt4;
CLASS;
MODEL setlh set1w set2h set2w = / NOUNI;
REPEATED spouse 4 / PRINTE;
RUN;
4. Check the descriptive statistics (using the PROC MEANS command) to
ensure that you set up the variables and entered the data correctly.
5. Examine the SS, df, MS, F, significance levels, and pn2 of the within-subjects
effects. Do they agree with your spreadsheet?
This page intentionally left blank
Appendix B:
Answers to Selected Exercises

2. GETTING STARTED: AN INTRODUCTION TO HYPOTHESIS TESTING

2.1.1 8 or higher.
2.1.2 6 or higher.
2.1.3 P(type II) = P(4|5|6|7) = (1+2+3+4)/20 = .5; power = i - B = 1 - .5 = .5.
2.1.4 P(type II) = P(4l5) = (1 + 2)/20 = .15; power = 1 - B = i - .15 = .85.

3. INFERRING FROM A SAMPLE: THE BINOMIAL DISTRIBUTION

3.1.1 and 3.1.2

# tosses # outcomes # classes


4 16 5
5 32 6
6 64 7
7 128 8
8 256 9
10 1,024 11
20 1,048,576 21

3-3-1 P(6 heads) = 1/64 = .0156


P(o|6 heads) = (1+1)/64 = .0313
P(5 6 heads) = (6+1)/64 = .109
P(2|3|4 heads) = (15+20+15)/64 = .781
P(0|1|5|6heads) = (1 + 6 + 6 + 1)/64 = .219
3.3.2 P(8 heads) = 1/256 = .00391
P(0|8 heads) = (1+1)/256 = .00781
P(7|8 heads) = (8+1)/256 = .0352
P(6|7|8 heads) = (28+8+1)/256 = .145
P(0|1|7|8 heads) = (1+8+8+1)/256 = .0703
P(0|1|2|6|7|8 heads) = (1+8+28+28+8+1)/256 = .289

325
326 ANSWERS TO SELECTED EXERCISES
3.3.3 A head can appear once in each serial position—the first, the second, and
so forth, up to the nth.

3.4.2
# Trials = 11 # Trials = 12
# Heads/ # Outcomes # Outcomes
Class in Class Probability in Class Probability
0 1 0.00049 1 0.00024
1 11 0.00537 12 0.00293
2 55 0.02686 66 0.01611
3 165 0.08057 220 0.05371
4 330 0.16113 495 0.12085
5 462 0.22559 792 0.19336
6 462 0.22559 924 0.22558
7 330 0.16113 792 0.19336
8 165 0.08057 495 0.12085
9 55 0.02686 220 0.05371
10 11 0.00537 66 0.01611
11 1 0.00049 12 0.00293
12 1 0.00024
# Classes 12 13
# Outcomes 2048 4096

3.4.3
One-tailed Two-tailed
N a = .05 a = .10 a = .05 a= .10
4 X 4 X X
5 5 5 X 0,5
6 6 6 0,6 0,6
7 7 6-7 0,7 0,7
8 7-8 7-8 0,8 0-1 ,7-8
9 8-9 7-9 0-1 ,8-9 0-1 ,8-9
10 9-10 8-10 0-1,9-10 0-1,9-10
11 9-11 9-11 0-1,10-11 0-2,9-11
12 10-12 9-12 0-2,10-12 0-2,10-12

3.5.1 Critical values for N = 40, alpha = .05, one-tailed, are 26-40. You would
reject the null hypothesis because 30 falls in this range.
3.5.2 Again, you would reject the null hypothesis because 26 falls in the range
26-40.
3.5.3 Critical values for N = 40, alpha = .05, two-tailed, are 0-13 and 27-40.
You would reject the null hypothesis if 30 improved because 30 is in this
range, but you would not reject the null hypothesis if only 26 improved.
For a two-tailed test, 26 is not in the critical range.
3.5.4 Critical values for N - 50, alpha = .05, two-tailed, are 0-iy and 33-50. If
the number who improved were any number 0-17 you would reject the
null hypothesis, thus 18 is the smallest number who, even if they all
improved, would not allow you to reject the null hypothesis. The largest
number who, even if they all improved, would not allow you to reject the
null hypothesis is 32.
APPENDIX B 327
3.5.5 Critical values for N - 50, alpha = .05, one-tailed, are 32-50. In this
case, the largest number who, even if they all improved, would not allow
you to reject the null hypothesis is 31. The smallest number that would
allow you to reject the null hypothesis is 32.
3.5.6 Critical values for N = 50, alpha = .01, two-tailed, are 0-15 and 35-50. If
the number who improved were any number 0-15 you would reject the
null hypothesis, thus 16 is the smallest number who, even if they all
improved, would not allow you to reject the null hypothesis. The largest
number who, even if they all improved, would not allow you to reject the
null hypothesis is 34.
3.5.7 Critical values for N = 50, alpha = .01, one-tailed, are 34-50. In this
case, the largest number who, even if they all improved, would not allow
you to reject the null hypothesis is 33. The smallest number that would
allow you to reject the null hypothesis is 34.

5. DESCRIBING A SAMPLE: BASIC DESCRIPTIVE STATISTICS

5.9.2 The mean is pulled in the direction of the outlier and the standard
deviation increases noticeably. The standard score for the outlier will be
quite large, probably greater than three.

7. INFERRING FROM A SAMPLE: THE NORMAL AND t DISTRIBUTIONS

7.1.2 T=26, Z = 1.9O. Would not reject: 1.90 < 1.96.


T = 28, Z = 2.53. Would reject: 2.53 > 1.96.
T = 30, Z = 3.16. Would reject: 3.16 > 1.96.
7.1.3 T = 26, Z = 1.90. Would not reject: 1.90 < 2.58.
T = 28, Z = 2.53. Would not reject: 2.53 < 2.58.
T = 30,Z = 3.16. Would reject: 3.16 > 2.58.
7.1.4 T = 25, Z = -1.83. Would not reject: -1.83 > -1.96.
T = 35, Z = +1.83. Would not reject: +1.83 < +1.96.
T = 23, Z = -2.56. Would reject: -2.56 < -1.96.
T = 37, Z = +2.56. Would reject: +2.56 > +1.96.
7.1.5 N = 80, reject if T is in the range 0-31 or 49-80.
7.1.6 N = 100, reject if T is in the range 0-40 or 60-100.
N = 200, reject if T is in the range 0-86 or 114-200.
7.1.7 N = 80, reject if T is in the range 0-8 or 24-80.
7.2.1 Only two digits after the decimal point are given for Z scores in the table,
so you will need to interpolate.
M = 100, S = 11.8, X = 105, Z = 0.424, 33.5% > 105.
M = 100, S = 11.8, X = 115, Z = 1.271, 10.2% > 115.
M = 100, S = 11.8, X = 80, Z = -1.695, 4.50% < 80.
M = 100, S = 11.8, X = 97| 103, Z = ±0.254, 80.0% < 97 or > 103.
7.2.2 2.28% > 2S above mean. 0.13% > 3S above mean.
4.56% > +2S or < -2 from mean. 0.26% > +3S or < -3S from mean.
7.2.3 Z(critical, .05) = 1.65, one-tailed.
Z(critical, .01) = 2.33, one-tailed.
328 ANSWERS TO SELECTED EXERCISES
Z(critical, .05) = 1.96, two-tailed.
Z(critical, .01) = 2.58, two-tailed.
7.2.5 Z = 0.678, S = 11.8, X = 108, 12/41 = 29.3% > 108.
Z = 0.678, S = 11.8, X = 92|108, 17/41 = 41.5% < 108 or > 92.
7.3.1 o - 11.8, N = 100, OM = 1.18. Critical value for t, df = 99, alpha = .01, two-
tailed, is ±2.66 (if the degrees of freedom you want are not tabled, always
use the next lowest value in the table, in this case the value for df = 60).
m - 100, M = 94, ZM = -5.08. A sample mean this deviant from the
population mean would occur less than 1% of the time if the sample really
were drawn from a population of terrestrials. Therefore the null
hypothesis is rejected. The sample likely consists of aliens.
7.3.2 Standardized normal scores less than -2.63 occur less than 0.43% of the
time.
7.3.3 M = 36.83, SD = 9.33, SD' = 10.23, SD'M = 4.17, t = 2.60.
Critical value for t, df = 5, alpha = .05, two-tailed, is ±2.57. Because 2.60
> 2.57, reject null hypothesis that sample is drawn from a population
whose mean is 26. Critical value for alpha = .01 is 4.03. Because 2.60 <
4.03, you would not reject the null hypothesis if you had selected an
alpha level of .01.
7.3.4 M = 45.5, SD = 24.58, SD' = 26.93, SD'M = 10.99, t = 1.77.
Critical value for t is ±2.57. Because 1.77 < 2.57, you would not reject the
null hypothesis. Based on means alone, the evidence that the population
mean is not 26 seems stronger for the second set of data (M = 45.5) than
for the first set (M = 36.83) - yet you reject the null hypothesis for the
first but not for the second set.
This occurs because the data are much more variable for the second set
than for the first (SD = 24.58 compared to 9.33). Due to the greater raw
score variability, the standard error of the mean estimated from the
second set is larger than for the first, which offsets the larger deviation
between sample and population mean and results in a smaller
standarized value for that deviation (t - 1.77 compared to 2.60 for the
first set).
7.3.5 N = 16, OM = 2.95, f(15)critical = 2.13. Reject if M < 93 or > 107.
N = 32, oM = 2.09, f(31)critical = 2.04 (use df = 30).
Reject if M < 95 or > 105.
N = 64, oM = 1.48, t(63)critical = 2.00 (use df = 60).
Reject if M < 97 or > 103.
The larger the sample size, the smaller the standard error of the mean.
As sample size increases, the sample means (for samples of that size) are
distributed more tightly around the population mean.
7.3.6 SD'M(boys) = 4.02, SD'M(girls) = 3.02. See Fig. 6.2 in text.
7.4.1 lower confidence limit = 26.10, upper confidence limit = 47.56.
7.4.2 lower confidence limit = 17.24, upper confidence limit = 73..76.
7.4.3 Yes. The first confidence interval (26.10 to 47.56) does not include the
null hypothesis population mean of 26, whereas the second interval
(17.24 to 73.76) does include it. Therefore we reject the null hypothesis
that m = 26 for the first set of data, but not for the second set.
APPENDIX B 329
9. BIVARIATE RELATIONS:
THE REGRESSION AND CORRELATION COEFFICIENTS

9.3.4 The square root is always positive and so in this case the negative sign for
the correlation is lost.
9.4.4 The predicted score will be the mean of the group to which the raw score
belongs.
9.5.6 The Y intercept is the number of lies detected for the no-drug group; the
slope is the difference in number of lies detected between the no-drug
and drug group.
9.6.1 r= -0.657, r2 = 0.431. The more older siblings an infant has, the fewer
words he or she is likely to speak at 18 months of age. The predicted
number of words for an infant with no older siblings is 34.0 (Yintercept).
For each additional sibling, an infant is likely to speak 3.02 fewer words,
on average (slope).
9.6.2 If number of words = 4 for subject 3, r = -0.291, r2 = 0.0845.
If number of words = 77 for subject 16, r = 0.298, r2 = 0.0888.
If number of words = 222 for subject 16, r = 0.561, r2 - 0.315.
If number of sibs = 10 for subject 7, r = -0.082, r2 = 0.0067.
Note how a single data entry error can reduce a large correlation to a
small one, or can even change a large negative correlation to a large
positive one.
9.6.3 r = -0.571, r2 = 0.326. The mean number of words spoken by the group
with no older siblings (M = 34.7) is higher than the mean number spoken
by the group with one or more older siblings (M = 26.7).
9.6.4 The predicted score for each subject is the mean score for the group to
which that subject belongs, thus the predicted score for infants with no
siblings is 34.7 (the mean number of words spoken by those infants) and
the predicted score for infants with one or more siblings is 26.7.
If the number of subjects in the two groups had been equal initially, and
two subjects had been "lost" from the no-sibling group, the multiple
regression analysis would have yielded what in older texts is called an
unweighted means analysis for studies with unequal cell sizes. The moral
is, what is a problem for traditional analyses of variances (unequal cell
size) is routine within the multiple regression approach. In general,
unequal cell sizes present no computational problems.

10. INFERRING FROM A SAMPLE: THE FDISTRIBUTION

10.1.1 1.250,1.111,1.053,1.026,1.010.
10.1.2 About 40 or 50.
10.2.1 F(3,24)critical,05 = 3.O1. F(6,24) critical,05 = 2.51.

10.2.2 F(1,8) criticai,05 = 5.32. 5.24 is not big enough to reject.


10.2.3 F(2,30) critical,05 = 3.32. F(2,30) cnticai.o1 = 5.39. The larger number, 5.39,
demarcates a smaller area under the curve, in this case 1%.
10.2.4 N = 20, because F(1,18) Criticai.o5 = 4.41 and 4.42 > 4.41.
330 ANSWERS TO SELECTED EXERCISES
10.2.5 No, because the smallest critical value for F with one degree of freedom
in the numerator and an infinite number of degrees in freedom in the
denominator is 3.84 and 3.8 is less than that.
10.2.6 MSplacebo = 12,892 and N = 8, MSdrug = 49,481 and N = 10. The null
hypotheis is rejected because F(9,7) critical.o5 = 3.68 is less than F(9,7) =
3.84.
10.3.8 F(1,8)computed = 3.46; F(1,8) critical.05 = 5.32. Do not reject; computed F not
big enough. For this sample size and these data, the effect of drug
treatment on number of lies detected is not statistically significant.
10.4.3 F(1,8)computed = 3.12; F(1,8) critical,05 = 5.32. Do not reject; computed F not
big enough. For this sample size and these data, the effect of mood on
number of lies detected is not statistically significant.
10.5.1 F(1,14) = 10.62; F(1,14) critical.01 = 8.86. Reject at both .05 and .01 levels.
10.5.2 F(1,14) = 6.77; F(1,14) critical,05 = 4.6o. Reject at the .05 but not the .01
level.
10.5.3 Student's t test (for independent groups).
10.5.5 See Fig. 10.4 in text. For no sibs: N = 7, M = 34.7, SD'M = 2.62, M - SD'
M = 32.1, M + SD' M = 37.3, t(6) Critical,05 = 2.45, lower 95% confidence limit
= 28.3, upper 95% confidence limit = 41.1. For one or more sibs: N = 9,
M = 26.7, SD'M = 1.83, M - SD' M = 24.8, M + SD'M = 28.5, t(8) Critical,05 =
2.31, lower 95% confidence limit = 22.5, upper 95% confidence limit =
30.9. The mean for the no sib group (34.7) does not fall within the 95%
confidence interval for the sib group (22.5-30.9), nor does the mean for
the sib group (26.7) fall within the 95% confidence interval for the no sib
group (28.3-41.1). Therefore the two means are probably significantly
different (at the .05 level), but this would need to be verified with a
formal statistical test.

11. ACCOUNTING FOR VARIANCE: TWO OR MORE PREDICTORS

11.1.2 Estimated standard error of estimate (predicting lies from drug) = 1.87.
Estimated standard deviation (predicting lies from mean) = 2.002.
Percentage reduction = 6.6%. Estimated standard error of estimate
(predicting lies from mood) = 1.899. Estimated standard deviation
(predicting lies from mean) = 2.002. Percentage reduction = 5.29%.
11.1.3 Drug: R2 = 0.30, R2adj = 0.21.
Mood: R2 = 0.28, R2adj = 0.19.
11.4.3 R2 = .40, N = 10, F(1,8) = 5.33, F(1,8)critical,05 = 5.32
11.4.4 R2 = .30, N = 14, F(l,12) = 5.14, F(l,12) critical,05 = 4.75
R2 = .20, N = 20, F(1,18) = 4.50, F(1,18) Critical,05 = 4.41
R2 = .10, N = 40, F(1,38) = 4.22, F(1,30) Critical,05 = 4.17
R2 = .05, N= 80, F(1,78) = 4.00, F(1,60) Critical,05 = 4.00
11.4.5 If a proportion of variance is just barely significant, and if you want to be
able to claim that half that amount is statistically significant, then you
would need to double the number of subjects included in the study.
APPENDIX B 331
11.5.1 Predicting lies detected: Adding mood to drug:
Variable Total Change
Step added R2 df F R2 df F
1 Drug ..302 1,8 3.46 ..302 1,8 3.46
2 Mood ..346 2,7 1.85 ..044 1,7 <1

11.5.2 Predicting words spoken: In addition to knowing whether or not an


infant has an older sibling, does knowing the exact number matter:
Variable Total Change
Step added R2 df F R2 df F
1 O vs > 0 .326 1,14 6.77 .326 1,14 6.77
2 #Sibs .450 2,13 5.32 .124 1,13 2.94

Infants who have no older siblings use more words than infants who have
one or more older siblings. The binary distinction between infants who
have no, and who have one or more, older siblings accounts for 32.6% of
the variance in number of words used (F(1,14) = 6.77, p < .05). An
additional 12.4% is accounted for by knowing the actual number of older
sibling, but this is not significant (F(1,14) = 2.94, NS).
11.6.10 See text.

12. SINGLE-FACTOR BETWEEN SUBJECTS STUDIES

12.1.4 Y' = a + b1X1 + b2X2 + 63X3 , dfmodel = 3, dferror = 12.

12.1.5 Predicted scores are the means for the four groups.
12.1.6 F(3,12) = 7.63, F(3,12)criticali,o5 = 3-49, therefore reject.
12.1.7 R2 = 0.097, F(3,12) = 0.431, NS.
R2 = 0.562, F(3,13) = 5.562, p < .05.
12.2.3 One possible set of contrast codes for five groups is as follows:
Coded Variable
Group X1 X2 X3 X4
Has no desire for children -4 0 0 0
Has none but would like 1 -3 0 0
Has 1 child 1 1 -2 0
Has 2 children 1 1 1 -1
Has >2 children 1 1 1 1

12.3.2 R2 = 0.656, F(3,12) = 7.63, p < .01.


12.3.3 R2 = 0.656, F(3,12) = 7.63, p < .01.
12.3.4 As long as the subjects in each group and their scores on the DV do not
change, the G - 1 predictor variables coding for group membership will
always account for the same proportion of variance. The particular way
the predictors code for group membership does not matter.
12.3.5 R2 = 0.369, F(3,13) = 2.53, NS.
12.4.3 R2 = 0.562, F(3,13) = 5.562, p < .05.
332 ANSWERS TO SELECTED EXERCISES
13. PLANNED COMPARISONS, POST HOC TESTS, AND ADJUSTED MEANS
13.1.1 Stepwise results for a planned comparison analysis using contrast code
set II:
Variable Total Change
Step added R2 df F R2 df F
Want
1 children? .523 1,14 15.33 .523 1,12 18.24
Have
2 children? .526 2,13 7.21 .003 1,12 <1
3 >1 child? .656 3,12 7.63 .130 1,12 4.53

13.2.1 There is a significant difference in the mean number of button pushes for
subjects in the no desire for children group (M - 113) and subjects in the
want/have children group (M = 77). However, there is no difference in
mean button pushes for those who have no children, but would like to (M
- 79) and those who already have one or more children (M - 76). There
is also no difference in mean number of button presses for those who
have only one child (M = 65), and those with more one child (M = 87).
13.2.2 Stepwise results for a planned comparison analysis, comparing the seven
infants with no older siblings with the nine infants with older siblings
(contrast 1) and the four infants with one with the five infants with more
than one older sibling (contrast 2):
Variable Total Change
Step added R2 df F R2 df F
1 0 vs >0 .326 1,14 6.77 .326 1,13 8.14
2 1 vs > 1 .479 2,13 5.98 .153 1,13 3.83

F(1,13)critical,o5 = 4.67, thus only the first contrast is significant. The mean
number of words spoken by the seven infants with no older siblings (M =
34.7) differs significantly from the mean number of words spoken by the
nine infants with one or more older siblings (M = 26.7). The mean
number of words spoken by the four infants with only one older sibling
(M = 30.8) does not differ significantly from the mean number of words
spoken by the five infants with more than one older sibling (M = 23.4),
and hence normally these last two means would not be reported.
13.3.1 See Fig. 13.6 in text.
APPENDIX B 333
13.3.2 Post hoc analysis: MSerror = 92.5, df = 12, n = 4, q(4,12) = 4.20, therefore
TCD = 20.2.

Groups
M = 65 M = 79 1 = 87 M = 113
Groups C1 CO C2 ND

C1 (1 child) 0 14 22 48

CO (none, desire) 0 8 34

C2 (> 1 child) 0 26

ND (no desire ) 0
c c
b b
a

Means
ND CO C1 C2
113a 79b,c 65C 87b

13.3.3 Post hoc analysis: MSerror = 23.2, df = 12, n = 4, g(4,12) = 4.20, therefore
TCD = 10.1.
Groups
M = 65 M = 79 n= 87 M = 113
Groups C1 CO C2 ND

C1 (1 child) 0 14 22 48

CO (none, desire) 0 8 34

C2 (> 1 child) 0 26

ND (no desire ) 0
c c
b b
a

Means
ND CO C1 C2
113a 79b 65C 87b
334 ANSWERS TO SELECTED EXERCISES
13.4.1 Source table for want/has child status ANOVA: 14 subjects:
Source SS df MS F
Between groups 3137.5 3 1045.8 4.96
Within groups 2107.3 10 210.7
TOTAL between subjects 5244.9 13

The critical value for F(3,10), alpha = .05, is 3.88 and for alpha =.01,
5.27. Thus the group effect would be significant if alpha were set to .05
but it would not be significant if alpha were .01.
13.4.2 Post hoc analysis: MSerror = 210.7, df = 10, n = 3.43 (harmonic mean),
q(4,10) = 4.33, therefore TCD = 33.9.
Groups
M = 65.0 M = 79.0 M = 84.7 M = 107.3
Groups C1 CO C2 ND

C1 (1 child) 0 14.0 19.7 [ 42.3

CO (none, desire) 0 5.7 28.3

C2 (> 1 child) 0 22.7

ND (no desire ) 0
b b b
a a a

Means
ND CO C1 C2
107.3a 79.0ab 65.0b 84.7ab

13.4.3 Post hoc analysis: MSerror = 31.34, df = 13, n = 5.06 (harmonic mean),
q(3,13) = 3.73, therefore TCD = 9.28.
Groups
>1 Sib =1 Sib =0 Sib
Groups (23.4) (30.8) (34.7)

> 1 sib 0 7.35 11.31

= 1 sib 0 3.96

= 0 sib 0
b b
a a

Means
= 0 sib = 1 sib > 1 sib
34.7a 30.8ab 23.4b
APPENDIX B 335
The post hoc analysis reveals that the mean number of words spoken by
infants who have no older siblings (M = 34.7) is significantly different
from the mean for infants with more than one older sibling (M - 23.4),
but that the mean number of words spoken by infants with only one older
sibling (M = 30.8) is not significantly different from either of the other
two means. Note than a post hoc analysis compares all possible pairs of
means for the groups involved. It does not allow some groups to be
lumped together, as a planned comparison analysis does.
13.8.2 For step 3, the Age x Sex interaction, R2total = 0.359, F(3,16) = 2.99, NS,
andR2Change = 0.019, F(1,16) = 0.471, NS.

STUDIES WITH MORE THAN ONE BETWEEN-SUBJECTS FACTOR

14.3.1 Formulas for degrees of freedom for a four-way factorial study (numbers
at the right give values if a = 3, b = 2, c = 4, d = 2, and N = 240):
Source cffformulas df
A main effect a -1 2
B main effect b -1 1
C main effect c -1 3
D main effect d -1 1
AB interaction (a-1)(b-1) 2
AC interaction (a-1)(c-1) 6
AD interaction (a - 1 )(d - 1) 2
BC interaction (b-1)(c-1) 3
BD interaction (b-1)(d-1) 1
CD interaction (c-1)(d-1) 3
ABC interaction (a-1)(b-1)(c - 1) 6
ABD interaction (a- 1)(b- 1)(d - 1) 2
ACD interaction (a- 1)(c- 1)(d - 1) 6
BCD interaction (b - 1)(c- 1)(d - 1) 3
ABCD interaction (a - 1)(b-1)(c - 1)(d - 1) 6
S/ABCD, subjects/ABCD N - abcd 192

TOTAL between subjects N- 1 239

14.3.3 Degrees of freedom for a 3 x 4 factorial study, N = 108:


Source df
A main effect 2
B main effect 3
AB interaction 6
S/AB, subjects within AB 96

TOTAL between subjects 107

14.4.3 The mean for males is 96, for females 76, which represents a deviation of
10 (b1) from the grand mean of 86 (a). The prediction equation includes
a term for sex only (Y' = a + b1X1). Accordingly, the predicted scores are
96 for males (86 + 10) and 76 for females (86 - 10) for females.
336 ANSWERS TO SELECTED EXERCISES
14.3.4 The mean for set I is 89, for set II 83, which represents a deviation of 3
(b2) from the grand mean. The prediction equation includes a term for
sex and for instruction set (Y' - a + b1X1 + b2X2). Accordingly, the
predicted scores are 99 for set I males (86 + 10 + 3), 93 for set II males
(86 + 10 - 3), 79 for set I females (86 - 10 + 3), and 73 for set II females
(86 - 10 - 3).
14.4.5 The four separate groups means are 113, 79, 65, and 87, which deviate 14
(b3) from the grand mean. (In the case of a 2 x 2, the deviations for all
four groups are always the same.) The prediction equation includes a
term for sex, for instruction set, and for their interaction (Y' = a + b1X1 +
b2X2 + b3X3) Accordingly, the predicted scores are 113 for set I males (86
+ 10 + 3 + 14), 79 for set II males (86 + 10 - 3 - 14), 65 for set I females
(86 - 10 + 3 -14), and 87 for set II females (86 - 10 - 3 + 14).
14.4.6 Stepwise results for a 2 x 2 factorial analyzing sex and instruction set:
Variable Total Change
R2
Step added df F R2 df F
1 Sex .215 1,14 3.84 .215 1,12 7.51
2 Inst .234 2,13 1.99 .019 1,12 0.68
3 Sex x Inst .656 3,12 7.63 .422 1,12 14.71

14.5.1 Instruction affects males and females differently. Males push the button
more when exposed to set I, females more when exposed to set II.
Moreover, instruction set I, but not II, differentiates males and females.
When exposed to set I, males push the button more than females, but
when exposed to set II, the mean number of button pushes is not
significantly different for males and females. Nor is the difference
between set I females and set II males significant.
Sex Instruction
Set I Set II
Male 113a 79b,c
Female 65b 87C

14.5.2 The interpretation is essentially the same as for 13.5.1. Now the
difference between set I females and set II males is significant, but the
key differences mentioned in the preceding answer remain the same.
Sex Instruction
Set I Set II
Male 113a 79b
Female 65c 87b
APPENDIX B 337
14.8.2 Your spreadsheet for this exercise, including the source table, should look
as follows (columns beyond K not shown):

A B C D E F G H I J K
=
1 Smiles! Gen Partner GxP y m= e=
2 s Y A B1 B2 AB1 AB2 Y1 Y-My Y'-My Y-Y1
3 1 5 1 2 0 2 0 5.333 -0.25 0.08 -0.333
4 2 7 1 2 0 2 0 5.333 1.75 0.08 1.667
5 3 4 1 2 0 2 0 5.333 -1.25 0.08 -1.333
6 4 8 1 -1 1 -1 1 5.250 2.75 0.00 2.750
7 5 6 1 -1 1 -1 1 5.250 0.75 0.00 0.750
8 6 3 1 -1 1 -1 1 5.250 -2.25 0.00 -2.250
9 7 4 1 -1 1 -1 1 5.250 -1.25 0.00 -1 .250
10 8 11 1 -1 -1 -1 -1 9.000 5.75 3.75 2.000
11 9 12 1 -1 -1 -1 -1 9.000 6.75 3.75 3.000
12 10 4 1 -1 -1 -1 -1 9.000 -1.25 3.75 -5.000
13 11 1 -1 2 0 -2 0 2.250 -4.25 -3.00 -1.250
14 12 1 -1 2 0 -2 0 2.250 -4.25 -3.00 -1.250
15 13 5 -1 2 0 -2 0 2.250 -0.25 -3.00 2.750
16 14 2 -1 2 0 -2 0 2.250 -3.25 -3.00 -0.250
17 15 1 -1 -1 1 1 -1 3.667 -4.25 -1.58-2.667
18 16 5 -1 -1 1 1 -1 3.667 -0.25 -1.58 1.333
19 17 5 -1 -1 1 1 -1 3.667 -0.25 -1.58 1.333
20 18 8 -1 -1 -1 1 1 7.000 2.75 1.75 1.000
21 19 7 -1 -1 -1 1 1 7.000 1.75 1.75 0.000
22 20 6 -1 -1 -1 1 1 7.000 0.75 1.75-1.000
23 Sum= 105 0 1 1 0 -3 1 105.0 0.00 0.000
24 N= 20 20 20 20 20 20
20 20 20
25 M= 5.25 0.05 0.05 -0.15 0.05 VAR - 8.788 4.746 4.042
0
26 a,b= 5.417 1.111 -0.813 -1.771 0.215-0 .104 SD = 2.964
27 R = 0.735 RSQ= 0.540

A B C D E F G H I
1 Step Source R / 2
change SS dt MS F Pn2
2 1 A - Gender 0 150 0.150 26.45 1 26.45 4 581 0.247
3 2B-Partner 0 528 0.378 66 42 2 33.21 5 752 0.451
4 3AB - GxP 0 540 0.012 2 04 2 1.02 0 177 0.025
5 4S/AB-error 1 000 0.460 80 83 14 5.77
6 5 Total between Ss 1.000 175 75
338 ANSWERS TO SELECTED EXERCISES
14.8.3 Post hoc analysis: MSerror = 5.77, df = 14, n - 6.63 (harmonic mean),
q(3,14) = 3.70, therefore TCD = 3.45:
Groups
Mother Father Stranger
Groups (3.57) (4.57) (8.00)

Mother 0 1.00 : 4.43 :

Father 0 3.43

Stranger 0
a a
b b

Means
Mother Father Stranger
N=7 N=7 N=6
3.57a 4.57ab 8.00b

According to this analysis, the sex by partner effect is not significant,


therefore we are free to examine the main effects. The sex main effect is
also not significant, although we might note that the difference between
the mean number of smiles for males (6.4) and females (4.1) approached,
but did not reach, the conventional .05 level of significance, F(1,14)critical,05
= 4.60. The partner effect, however, was significant, F(2,14) = 5.75, p <
.05. Infants smiled more to strangers (M = 8.00) than to mothers (M =
3.57) and this difference was significant (p < .05) according to the Tukey
post hoc test. The difference between the number of smiles to mothers
and to fathers, and the difference between the number of smiles to
fathers and strangers, however, were not significant.

15. SINGLE-FACTOR WITHIN SUBJECTS STUDIES

15.1.2 R2 = 0.638, SS = 25.6.


15.1.3 R2 = 0.940, SS = 37.7.
15.1.4 Change = O.3O2, SSchange = 12.1, dferror = 4, *(1,4) = 20.17, p < .05.

15.1.5 After step 1, predicted scores represent the mean score for the subject.
After step 2, they take into account the mean difference between time 1
and time 2 scores as well. Regression coefficients for dummy-coded
variables reflect deviations from the comparison group, the group coded
all zeros. Thus the first subject's mean score (3.5) minus the last
subject's mean score (7.5) is - 4.0, and so forth, and the mean score for
the drug group (4.2) minus the mean score for the placebo group (6.4) is
- 2.2.
15.2.9 F(l,4)computed = 20.17; F(1,4)critical,05 = 7.71. You reject the null hypothesis
because 20.17 > 7.71,. For this sample size and these repeated measures
data, the effect of drug treatment on number of lies detected is
statistically significant.
APPENDIX B 339
16. TWO-FACTOR STUDIES INVOLVING REPEATED MEASURES

16.1.19 In this case, F(1,6)critical,05 = 5.99 for all three effects tested. The sex main
effect and the sex by instruction interaction are significant. The main
effect is qualified by an interaction, so first that interaction should be
understood and interpreted.
16.2.22 In this case, F(1,3)critical,05 = 10.13 for all three effects tested. The sex
main effect and the sex by instruction interaction are significant. The
main effect is qualified by an interaction, so first that interaction should
be understood and interpreted.
16.5.1 The TCD for the one-between, one-within is 37.8 and for the no-between,
two-within is 30.4. Thus for these data it happens that the post hoc
results for the two designs are the same:
Groups
#3-F1 #2-M2 #4-F2 #1-M1
Groups (M = 65) (M = 79) (M = 87) (M- 113)

#3-F1 0 14 0 19.7 42.3

#2-M2 0 5.7 28.3

#4-F2 0 22.7

#1-M1 0
b b b
a a a

For both the one-between, one-within and the no-between, two-within


analyses, means are based on four scores each and are displayed in the
following table. Means that do not differ significantly according to the
Tukey test, alpha = .05, share a common subscript.

Sex
Instruction
Set I Set II
Male 113a 79ab
Female 65b 87ab

Neither males nor females were affected by the instructions they


received; the difference between the mean number of button pushes for
males who received set I versus II was not significant (113 versus 79), nor
was the corresponding difference significant for females (65 versus 87).
Instruction set I, however, but not set II, distinguished between males
and females; the difference between the mean number of button pushes
for males versus females exposed to set I was significant (113 versus 65),
but the corresponding difference for males versus females exposed to set
II was not significant (79 versus 87).
340 ANSWERS TO SELECTED EXERCISES
16.6.1 Degrees of freedom for a three-factor, two-between, one-within design
(numbers at the right give values if a - 2, b = 4, p - 3, and N - 96):
Source dfformulas
cffformulas df
A, main effect a -1 1
B, main effect b -1 3
AB interaction (a-1)(b-1) 3
S/AB, subjects w/i AB (N - ab) 88
TOTAL b/w subjects (N -1) 95
2
P main effect (p-1)
PA interaction (p - 1 )(a - 1) 2
PB interaction (p - 1 )(b - 1) 6
PAB interaction (p - 1)(a- 1)(b - 1) 6
PxS/AB interaction (p - 1 )(N - ab) 176
TOTAL w/i subjects Np - N 192

TOTAL (b/w + w/i) Np - 1 287

16.6.2 Degrees of freedom for a three-factor, one-between, two-within design


(numbers at the right give values if a = 3, p = 2, q = 4, and N = 45):
Source df formulas
dftormulas df
A, main effect a -1 2
S/A, subjects w/i A N- a 42
TOTAL b/w subjects N -1 44

P main effect p-1 1


PA interaction (p - 1)( a - 1) 2
PxS/A interaction (p - 1)( N - a) 42
Q main effect q -1 3
QA interaction (q-1)(a-1) 6
QxS/A interaction (q - 1 )(N - a) 126
PQ interaction (p - 1 )(q - 1) 3
PQA interaction (p-1)(q-1)(a-1) 6
PQxS/A interaction (p - 1 )(q - 1 )(N - a) 126
TOTAL w/i subjects Npq - N 315

TOTAL (b/w + w/i) Npq - 1 359


APPENDIX B 341
16.7.1 Predictor variables for this one-between (sex), one-within (week)
2 x 3 factorial are as follows:
#Hr Sex S/A Week Wk x Sx
s Y A S1 S2 S3 S4 S5 S6 S7 S8 S9 P1 P2 A1 AP2
1 5 -1 1 0 0 0 0 0 0 0 0 -1 1 1 -1
2 6 1 0 0 0 0 0 0 0 1 0 -1 1 -1 1
3 4 1 0 0 0 0 0 0 0 0 1 -1 1 -1 1
4 5 1 0 0 0 0 0 0 0 0 0 -1 1 -1 1
5 5 -1 0 1 0 0 0 0 0 0 0 -1 1 1 -1
6 6 -1 0 0 1 0 0 0 0 0 0 -1 1 1 -1
7 3 -1 0 0 0 1 0 0 0 0 0 -1 1 1 -1
8 5 1 0 0 0 0 1 0 0 0 0 -1 1 -1 1
9 7 1 0 0 0 0 0 1 0 0 0 -1 1 -1 1
10 4 1 0 0 0 0 0 0 1 0 0 -1 1 -1 1
11 5 -1 0 0 0 0 0 0 0 0 0 -1 1 1 -1
1 3 -1 1 0 0 0 0 0 0 0 0 0 -2 0 2
2 5 1 0 0 0 0 0 0 0 1 0 0 -2 0 -2
3 6 1 0 0 0 0 0 0 0 0 1 0 -2 0 -2
4 5 1 0 0 0 0 0 0 0 0 0 0 -2 0 -2
5 7 -1 0 1 0 0 0 0 0 0 0 0 -2 0 2
6 6 -1 0 0 1 0 0 0 0 0 0 0 -2 0 2
7 6 -1 0 0 0 1 0 0 0 0 0 0 -2 0 2
8 6 1 0 0 0 0 1 0 0 0 0 0 -2 0 -2
9 5 1 0 0 0 0 0 1 0 0 0 0 -2 0 -2
10 7 1 0 0 0 0 0 0 1 0 0 0 -2 0 -2
11 5 -1 0 0 0 0 0 0 0 0 0 0 -2 0 2
1 5 -1 1 0 0 0 0 0 0 0 0 1 1 -1 -1
2 6 1 0 0 0 0 0 0 0 1 0 1 1 1 1
3 7 1 0 0 0 0 0 0 0 0 1 1 1 1 1
4 8 1 0 0 0 0 0 0 0 0 0 1 1 1 1
5 6 -1 0 1 0 0 0 0 0 0 0 1 1 -1 -1
6 7 -1 0 0 1 0 0 0 0 0 0 1 1 -1 -1
7 6 -1 0 0 0 1 0 0 0 0 0 1 1 -1 -1
8 7 1 0 0 0 0 1 0 0 0 0 1 1 1 1
9 8 1 0 0 0 0 0 1 0 0 0 1 1 1 1
10 6 1 0 0 0 0 0 0 1 0 0 1 1 1 1
11 5 -1 0 0 0 0 0 0 0 0 0 1 1 -1 -1
342 ANSWERS TO SELECTED EXERCISES
16.7.2 The spreadsheet for the analysis of variance source table is as follows:
1-between, 1-within Total Change
Step Source RSQ SS RSQ SS df MS F
1 A (main effect) 0.,065 3.06 0.065 3.06 1 3 .06 2 .676
2 S/A (error term) 0. 282 13.33 0.217 10.28 9 1.14
TOTAL between Ss 0.282 13.33 10

3 P (main effect) 0. 533 25.21 0.251 11.88 2 5 .94 5 .167


4 AP (interaction) 0.,563 26.64 0.030 1.43 2 0 .72 0 .623
5 S/AxP (error term) 1..000 47.33 0.437 20.69 18 1.15
TOTAL within Ss 0.718 34.00 22

TOT (between + within) 1.000 47.33 32

Only the effect for weeks was significant (F(2,18) = 5.2, p < .05). The
post hoc analysis is as follows: MSerror = 1.149, df = 18, n = 11, q(3,18) =
3.61, therefore TCD = 1.167:

Groups
Week 1 Week 2 Week 3
Groups (5.00) (5.55) (6.45)

Weekl 0 0.55 1.45

Week 2 0 0.91

Week 3 0
a a

Means
Week 1 Week 2 Week 3
N =11 N = 11 N =11
3.57a 4.57 ab 8.00b

For these data, the number of hours males and females studied did not
differ significantly, but the mean number of hours studied the first week
(M = 5.00 hours) was significantly less than the mean number of hours
studied the third week (M - 6.45 hours). The mean number of hours
studied the second week (M = 5.55) was between the means for the first
and third weeks and did not differ significantly from either of them.
APPENDIX B 343
16.7.3 The spreadsheet for the analysis of variance source table for a trend
analysis is as follows (the two planned comparisons are a linear and a
quadratic trend):
1 -between, 1 -within Total Change
Step Source RSQ SS RSQ SS df MS F
1 A (main effect) 0 .065 3.06 0 .065 3.06 1 3. 06 2.676
2 S/A (error term) 0 .282 13.33 0 .217 10.28 9 1. 14
TOTAL between Ss 0,.282 13.33 10

3 Linear 0 .528 24.97 0,.246 11.64 1 11 .64 10.124


4 Qadratic 0 .533 25.21 0..005 0.24 1 0.24 0.211
AP (interaction) 0 .563 26.64 0..030 1.43 2 0. 72 0.623
5 S/AxP (error term) 1.000 47.33 0.437 20.69 18 1. 15
TOTAL within Ss 0,.718 34.00 22

TOT (between + within) 1.000 47.33 32

By committing to a trend analysis, the two degrees of freedom within


weeks are partitioned into a linear and a quadratic component, each of
which is tested for significance. An omnibus test for weeks is not
performed, nor are post hoc tests. In this case, only the linear trend was
significant (F(1,18) = 10.1, p < .01), indicating that the monotonic
increase from 5.00, to 5.55, to 6.45 is significant. In other words, we can
account for an additional 24.6% of the variance (a significant increase) in
number of hours studied, above and beyond that accounted for by
knowing the particular student's mean for all three scores (R2 = .282), if
we fit a straight line to each subject's week 1, week 2, and week 3 scores.
344 ANSWERS TO SELECTED EXERCISES

17. POWER, PITFALLS, AND PRACTICAL MATTERS


17.1.3 See text.
17.2.1 Spreadsheet that computes posttest scores adjusted for the effect of the
pretest scores (labeled Yadj).
A B C D E F G H
1 Post Pre Tr d= Yadj
2 S Y X A XxA Y" Y"-My Y-d
3 1 102 79 -1 -79 89.66 0.669 101.3
4 2 125 93 -1 -93 87.32 -1.67 126.6
5 3 95 75 -1 -75 90.33 1.339 93.66
6 4 130 69 -1 -69 91.34 2.343 127.6
7 5 43 101 1 101 85.98 -3.01 46.01
8 6 82 94 1 94 87.15 -1.84 83.84
9 7 69 84 1 84 88.83 -.016 69.16
10 8 66 69 1 69 91.34 2.343 63.65
11 Sum= 712 712 0 712
12 N= 8 8 8 8
13 Mean= 89
14 a,b= 102.8 -0.16 -3.3
15 R= 0.859
17.2.2 The adjusted means for the trained and untrained groups are 112.3 and
65.7 respectively (unadjusted means = 113 and 65). For these data the
adjustment is not great because the correlation between pretest and
posttest scores is not strong (R2 = 0.132, .F(1,5) = 2.5, NS).
17.3.1 L = 12.65,f2 = .36/.64 = .563, n* = 12.65/.563 + 3 = 26.
17.3.2 L = 10.51, f2 = .22/.70 = .314, n* = 10.51/.314 + 2 = 36.
17.3.3 L = 14.17, f2 = .17/.76 = .224, n* = 14.17/.224 + 4 = 68.
Appendix C:
Statistical Tables

Table A
Critical Values for the Binomial Distribution, P = 0.5
The first number in each row, N, indicates the number of trials that are
categorized either plus or minus. The remaining numbers represent unlikely
values for the number of pluses. The column in which the value falls indicates
the probability of that or a smaller number of pluses occurring (lower tail), or the
probability of that or a larger number of pluses occurring (upper tail), assuming
the null hypothesis value of P = 0.5. Probability values for the columns represent
selected standard values.

1. a = .05, two-tailed: Reject the null hypothesis if the number of


pluses is equal to or less than the number in the .025 column for the
lower tail, or if the number of pluses is equal to or greater than the
number in the .025 column for the upper tail.
2. a = .01, two-tailed: Reject the null hypothesis if the number of
pluses is equal to or less than the number in the .005 column for the
lower tail, or if the number of pluses is equal to or greater than the
number in the .005 column for the upper tail.
3. a = .05, one-tailed: If the alternative hypothesis predicts a lower-tail
value, reject the null hypothesis if the number of pluses is equal to or
less than the number in the .050 column for the lower tail. If the
alternative hypothesis predicts an upper-tail value, reject the null
hypothesis if the number of pluses is equal to or greater than the
number in the .050 column for the upper tail.
4. a = .01, one-tailed: If the alternative hypothesis predicts a lower-tail
value, reject the null hypothesis if the number of pluses is equal to or
less than the number in the .010 column for the lower tail. If the
alternative hypothesis predicts an upper-tail value, reject the null
hypothesis if the number of pluses is equal to or greater than the
number in the .010 column for the upper tail.

345
346 STATISTICAL TABLES
Table A
Critical Values for the Binomial Distribution, P = 0.5 (continued)

Lower Tail Upper Tail


N
005 .010 .025 .050 .050 .025 .010 .005
5 0 5
6 0 0 6 6
7 0 0 0 7 7 7
8 0 0 0 1 7 8 8 8
9 0 0 1 1 8 8 9 9
10 0 0 1 1 9 9 10 10
11 0 1 1 2 9 10 10 11
12 1 1 2 2 10 10 11 11
13 1 1 2 3 10 11 12 12
14 1 2 2 3 11 12 12 13
15 2 2 3 3 12 12 13 13
16 2 2 3 4 12 13 14 14
17 2 3 4 4 13 13 14 15
18 3 3 4 5 13 14 15 15
19 3 4 4 5 14 15 15 16
20 3 4 5 5 15 15 16 17
21 4 4 5 6 15 16 17 17
22 4 5 5 6 16 17 17 18
23 4 5 6 7 16 17 18 19
24 5 6 6 7 17 18 18 19
25 5 6 7 7 18 18 19 20
26 6 6 7 8 18 19 20 20
27 6 7 7 8 19 20 20 21
28 6 7 8 9 19 20 21 22
29 7 7 8 9 20 21 22 22
30 7 8 9 10 20 21 22 23
31 7 8 9 10 21 22 23 24
32 8 8 10 10 22 22 24 24
33 8 9 10 11 22 23 24 25
34 9 9 10 11 23 24 25 25
35 9 10 11 12 23 24 25 26
36 9 10 11 12 24 25 26 27
37 10 11 12 13 24 25 26 27
38 10 11 12 13 25 26 27 28
39 11 11 12 13 26 27 28 28
40 11 12 13 14 26 27 28 29
41 11 12 13 14 27 28 29 30
42 12 13 14 15 27 28 29 30
43 12 13 14 15 28 29 30 31
44 13 13 15 16 28 29 31 31
45 13 14 15 16 29 30 31 32
46 13 14 15 16 30 31 32 33
47 14 15 16 17 30 31 32 33
48 14 15 16 17 31 32 33 34
49 15 15 17 18 31 32 34 34
50 15 16 17 18 32 33 34 35
APPENDIX C 347
Table B
Areas Under the Normal Curve
The first number in each row gives a Z score accurate to the first decimal digit.
Subsequent columns provide the second decimal digit. The numbers in the table
give the proportion of area under the normal curve to the left of the indicated Z
score. For example, the proportion of area to the left of a Z score of -1.96 is .0250
and the proportion of area to the left of a Z score of +2.58 is .9951.

1. What is the probability of obtaining a Z score less than a particular


value (under the curve to the left)? For both negative and positive Z
scores, this probability can be read directly from the table. For
example, the probability of a Z score being less than +1.00 is .8413.
2. What is the probability of obtaining a Z score greater than a
particular value (under the curve to the right)? For both negative
and positive Z scores, subtract the tabled probability from 1. For
example, the probability of a Z score being greater than +1.64 is 1
minus .9495, which is .0505, and the probability of a Z score being
greater than +1.65 is 1 minus .9505, which is .0495.
3. What is the probability of obtaining a Z score less than the absolute
value of a particular value (under the curve in the center)? Double
the tabled value for the negative Z score and subtract it from 1. For
example, the probability of a Z score being less than 1.00 absolute is
twice .1587 or .3174 subtracted from 1, which is .6826.
4. What is the probability of obtaining a Z score greater than the
absolute value of a particular value (under the curve in the tails)? It
is twice the tabled probability for the negative Z score. For example,
the probability of a Z score being greater than 2.57 absolute is twice
.0051, which is .0102, and the probability of a Z score being greater
than 1.58 absolute is twice .0049, which is .0098.
348 STATISTICAL TABLES
Table B
Areas Under the Normal Curve (continued)

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359
0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753
0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141
0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517
0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879
0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621
1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015
1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177
1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319
1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441
1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545
1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706
1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767
2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817
2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857
2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890
2.3 .9893 .9896 .9898 .9901 9904 .9906 .9909 .9911 .9913 .9916
2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952
2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964
2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974
2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981
2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986
3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990

Note. Abridged from Table 1, Biometrika Tables for Statisticians (Vol.1,3rd ed.) edited
by E. S. Pearson and H. O. Hartley, 1970, New York: Cambridge University Press.
Adapted by permission of the Biometrika Trustees.
APPENDIX C 349
Table B
Areas Under the Normal Curve (continued)

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641
-0.1 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247
-0.2 .4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859
-0.3 .3821 .3783 .3745 .3707 .3669 .3632 ,3594 .3557 .3520 .3483
-0.4 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121
-0.5 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776
-0.6 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451
-0.7 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148
-0.8 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867
-0.9 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611
-1.0 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379
-1.1 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170
-1.2 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .0985
-1.3 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823
-1.4 .0808 .0793 .0778 .0764 .0749 .0735 .0721 .0708 .0694 .0681
-1.5 .0668 .0655 .0643 .0630 .0618 .0606 .0594 .0582 .0571 .0559
-1.6 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455
-1.7 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367
-1.8 .0359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .0294
-1.9 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233
-2.0 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183
-2.1 .0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .0143
-2.2 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .0110
-2.3 .0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .0084
-2.4 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .0064
-2.5 .0062 .0060 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .0048
-2.6 .0047 .0045 .0044 .0043 .0041 .0040 .0039 .0038 .0037 .0036
-2.7 .0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .0026
-2.8 .0026 .0025 .0024 .0023 .0023 .0022 .0021 .0021 .0020 .0019
-2.9 .0019 .0018 .0018 .0017 .0016 .0016 .0015 .0015 .0014 .0014
-3.0 .0013 .0013 .0013 .0012 .0012 .0011 .0011 .0011 .0010 .0010

Note. Abridged from Table 1, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited
by E. S. Pearson and H. O. Hartley, 1970, New York: Cambridge University Press.
Adapted by permission of the Biometrika Trustees.
350 STATISTICAL TABLES
Table C
Critical Values for the t Distribution

Two-tailed Nondirectional Test One-tailed Ddirectional Test


df 0.05 0.01 0.001 df 0.05 0.01 0.001
12.71 63.66 636.58 1 6.314 31.82 318.29
2 4.303 9.925 31.600 2 2.920 6.965 22.33
3 3.182 5.841 12.924 3 2.353 4.541 10.214
4 2.776 4.604 8.610 4 2.132 3.747 7.173
5 2.571 4.032 6.869 5 2.015 3.365 5.894
6 2.447 3.707 5.959 6 1.943 3.143 5.208
7 2.365 3.499 5.408 7 1.895 2.998 4.785
8 2.306 3.355 5.041 8 1.860 2.896 4.501
9 2.262 3.250 4.781 9 1.833 2.821 4.297
10 2.228 3.169 4.587 10 1.812 2.764 4.144
11 2.201 3.106 4.437 11 1.796 2.718 4.025
12 2.179 3.055 4.318 12 1.782 2.681 3.930
13 2.160 3.012 4.221 13 1.771 2.650 3.852
14 2.145 2.977 4.140 14 1.761 2.624 3.787
15 2.131 2.947 4.073 15 1.753 2.602 3.733
16 2.120 2.921 4.015 16 1.746 2.583 3.686
17 2.110 2.898 3.965 17 1.740 2.567 3.646
18 2.101 2.878 3.922 18 1.734 2.552 3.610
19 2.093 2.861 3.883 19 1.729 2.539 3.579
20 2.086 2.845 3.850 20 1.725 2.528 3.552
21 2.080 2.831 3.819 21 1.721 2.518 3.527
22 2.074 2.819 3.792 22 1.717 2.508 3.505
23 2.069 2.807 3.768 23 1.714 2.500 3.485
24 2.064 2.797 3.745 24 1.711 2.492 3.467
25 2.060 2.787 3.725 25 1.708 2.485 3.450
26 2.056 2.779 3.707 26 1.706 2.479 3.435
27 2.052 2.771 3.689 27 1.703 2.473 3.421
28 2.048 2.763 3.674 28 1.701 2.467 3.408
29 2.045 2.756 3.660 29 1.699 2.462 3.396
30 2.042 2.750 3.646 30 1.697 2.457 3.385
40 2.021 2.704 3.551 40 1.684 2.423 3.307
60 2.000 2.660 3.460 60 1.671 2.390 3.232
80 1.990 2.639 3.416 80 1.664 2.374 3.195
120 1.980 2.617 3.373 120 1.658 2.358 3.160
oo 1.960 2.576 3.290 oo 1.645 2.326 3.090

Note. Abridged from Table 12, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited
by E. S. Pearson and H. O. Hartley, 1970, New York: Cambridge University Press.
Adapted by permission of the Biometrika Trustees.
Table D.1
Critical Values for the F-Distribution, alpha = .05
V1
1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 00
V2
1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5 241.0 243.9 245.9 248.0 249.1 250.1 251.1 252.2 253.3 254.3
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.43 19.45 19.45 19.46 19.47 19.48 19.49 19.50
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.36
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.51
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.81
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.07 1.99 1.95 1.90 1.85 1.80 1.75 1.69
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.06 1.97 1.93 1.88 1.84 1.79 1.73 1.67
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.04 1.96 1.91 1.87 1.82 1.77 1.71 1.65
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.03 1.94 1.90 1.85 1.81 1.75 1.70 1.61
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39
120 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25
OO 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00
Note: vl = df numerator'V2 = df denominator . Abridged from Table 18, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited by E. S. Pearson and H. O. Hartley,
1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
Table D.2
Critical Values for the F-Distribution, alpha = .01
Vi
V2
1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 oo

1 4052 4999.5 5403 5625 5764 5859 5928 5982 6022 6056 6106 6157 6209 6235 6261 6287 6313 6339 6366 •
2 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 99.42 99.43 99.45 99.46 99.47 99.47 99.48 99.49 99.50
3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 27.05 26.87 26.69 26.60 26.50 26.41 26.32 26.22 26.13
4 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 14.37 14.20 14.02 13.93 13.84 13.75 13.65 13.56 13.46
5 16.27 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 9.89 9.72 9.55 9.47 9.38 9.29 9.20 9.11 9.02
6 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.56 7.40 7.31 7.23 7.14 7.06 6.97 6.88
7 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 6.16 6.07 5.99 5.91 5.82 5.74 5.65
8 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 5.36 5.28 5.20 5.12 5.03 4.95 4.86
9 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 4.81 4.73 4.65 4.57 4.48 4.40 4.31
10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56 4.41 4.33 4.25 4.17 4.08 4.00 3.91
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.47 4.25 4.10 4.02 3.94 3.86 3.78 3.69 3.60
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01 3.86 3.78 3.70 3.62 3.54 3.45 3.36
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.82 3.66 3.59 3.51 3.43 3.34 3.25 3.17
14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.80 3.66 3.51 3.43 3.35 3.27 3.18 3.09 3.00
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.52 3.37 3.29 3.21 3.13 3.05 2.96 2.87
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.41 3.26 3.18 3.10 3.02 2.93 2.84 2.75
17 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.31 3.16 3.08 3.00 2.92 2.83 2.75 2.65
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.23 3.08 3.00 2.92 2.84 2.75 2.66 2.57
19 8.81 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.15 3.00 2.92 2.84 2.76 2.67 2.58 2.49
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94 2.86 2.78 2.69 2.61 2.52 2.42
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.03 2.88 2.80 2.72 2.64 2.55 2.46 2.36
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 2.98 2.83 2.75 2.67 2.58 2.50 2.40 2.31
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.93 2.78 2.70 2.62 2.54 2.45 2.35 2.26
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.89 2.74 2.66 2.58 2.49 2.40 2.31 2.21
25 7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 2.99 2.85 2.70 2.62 2.54 2.45 2.36 2.27 2.17
26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 2.96 2.81 2.66 2.58 2.50 2.42 2.33 2.23 2.13
27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 2.93 2.78 2.63 2.55 2.47 2.38 2.29 2.20 2.10
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.90 2.75 2.60 2.52 2.44 2.35 2.26 2.17 2.06
29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 2.87 2.73 2.57 2.49 2.41 2.33 2.23 2.14 2.03
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.70 2.55 2.47 2.39 2.30 2.21 2.11 2.01
40 7.31 5.81 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.52 2.37 2.29 2.20 2.11 2.02 1.92 1.80
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.35 2.20 2.12 2.03 1.94 1.84 1.73 1.60
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.19 2.03 1.95 1.86 1.76 1.66 1.53 1.38
CO 6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 2.18 2.04 1.88 1.79 1.70 1.59 1.47 1.32 1.00
Note: vl =df numeratorV2 = dfdenominator Abridged from Table 18, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited by E. S. Pearson and H. O. Hartley,
1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
Table E.1
Critical Values for the Studentized Range Statistic, alpha = .05
G
V
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 17.97 26.98 32.82 37.08 40.41 43.12 45.40 47.36 49.07 50.59 51.96 53.20 54.33 55.36 56.32
2 6.08 8.33 9.80 10.88 11.74 12.44 13.03 13.54 13.99 14.39 14.75 15.08 15.38 15.65 15.91
3 4.50 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46 9.72 9.95 10.15 10.35 10.52 10.69
4 3.93 5.04 5.76 6.29 6.71 7.05 7.35 7.60 7.83 8.03 8.21 8.37 8.52 8.66 8.79
5 3.64 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99 7.17 7.32 7.47 7.60 7.72 7.83
6 6.46 4.34 4.90 5.30 5.63 5.90 6.12 6.32 6.49 6.65 6.79 6.92 7.03 7.14 7.24
7 3.34 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16 6.30 6.43 6.55 6.66 6.76 6.85
8 3.26 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92 6.05 6.18 6.29 6.39 6.48 6.57
9 3.20 3.95 4.41 4.76 5.02 5.24 5.43 5.59 5.74 5.87 5.98 6.09 6.16 6.28 6.36
10 3.15 3.88 .4.33 4.65 4.91 5.12 5.30 5.46 5.60 5.72 5.83 5.93 6.03 6.11 6.19
11 3.11 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49 5.61 5.71 5.81 5.90 5.98 6.06
12 3.08 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.39 5.51 5.61 5.71 5.80 5.88 5.95
13 3.06 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32 5.43 5.53 5.63 5.71 5.79 5.86
14 3.03 3.70 4.11 4.41 4.64 4.83 4.99 5.13 5.25 5.36 5.46 5.55 5.64 5.71 5.79
15 3.01 3.67 4.08 4.37 4.59 4.78 4.94 5.08 5.20 5.31 5.40 5.49 5.57 5.65 5.72
16 3.00 3.65 4.05 4.33 4.56 4.74 4.90 5.03 5.15 5.26 5.35 5.44 5.52 5.59 5.66
17 2.98 3.63 4.02 4.30 4.52 4.70 4.86 4.99 5.11 5.21 5.31 5.39 5.47 5.54 5.61
18 2.97 3.61 4.00 4.28 4.49 4.67 4.82 4.96 5.07 5.17 5.27 5.35 5.43 5.50 5.57
19 2.96 3.59 3.98 4.25 4.47 4.65 4.79 4.92 5.04 5.14 5.23 5.31 5.39 5.46 5.53
20 2.95 3.58 3.96 4.23 4.45 4.62 4.77 4.90 5.01 5.11 5.20 5.28 5.36 5.43 5.49
24 2.92 3.53 3.90 4.17 4.37 4.54 4.68 4.81 4.92 5.01 5.10 5.18 5.25 5.32 5.38
30 2.89 3.49 3.85 4.10 4.30 4.46 4.60 4.72 4.82 4.92 5.00 5.08 5.15 5.21 5.27
40 2.86 3.44 3.79 4.04 4.23 4.39 4.52 4.63 4.73 4.82 4.90 4.98 5.04 5.11 5.16
60 2.83 3.40 3.74 3.98 4.16 4.31 4.44 4.55 4.65 4.73 4.81 4.88 4.94 5.00 5.06
120 2.80 3.36 3.68 3.92 4.10 4.24 4.36 4.47 4.56 4.64 4.71 4.78 4.84 4.90 4.95
oo 2.77 3.31 3.63 3.86 4.03 4.17 4.29 4.39 4.47 4.55 4.62 4.68 4.74 4.80 4.85
Note: G = number of groups, u = dferror. Abridged from Table 29, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited by E. S.
Pearson and H. O. Hartley, 1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
Table E.2
Critical Values for the Studentized Range Statistic, alpha = .01
G
V 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 90.03 135.0 164.3 185.6 202.2 215.8 227.2 237.0 245.6 253.2 260.0 266.2 271.8 277.0 281.8
2 14.04 19.02 22.29 24.72 26.63 28.20 29.53 30.68 31.69 32.59 33.40 34.13 34.81 35.43 36.00
3 8.26 10.62 12.17 13.33 14.24 15.00 15.64 16.20 16.69 17.13 17.53 17.89 18.22 18.52 18.81
4 6.51 8.12 9.17 9.96 10.58 11.10 11.55 11.93 12.27 12.57 12.84 13.09 13.32 13.53 13.73
5 5.70 6.98 7.80 8.42 8.91 9.32 9.67 9.97 10.24 10.48 10.70 10.89 11.08 11.24 11.40
6 5.24 6.33 7.03 7.56 7.97 8.32 8.61 8.87 9.10 9.30 9.48 9.65 9.81 9.95 10.08
7 4.95 5.92 6.54 7.01 7.37 7.68 7.94 8.17 8.37 8.55 8.71 8.86 9.00 9.11 9.24
8 4.75 5.64 6.20 6.62 6.96 7.24 7.47 7.68 7.86 8.03 8.18 8.31 8.44 8.55 8.66
9 4.60 5.43 5.96 6.35 6.66 6.91 7.13 7.33 7.49 7.65 7.78 7.91 8.03 8.13 8.23
10 4.48 5.27 5.77 6.14 6.43 6.67 6.87 7.05 7.21 7.36 7.49 7.60 7.71 7.81 7.91
11 4.39 5.15 5.62 5.97 6.25 6.48 6.67 6.84 6.99 7.13 7.25 7.36 7.46 7.56 7.65
12 4.32 5.05 5.50 5.84 6.10 6.32 6.51 6.67 6.81 6.94 7.06 7.17 7.26 7.36 7.44
13 4.26 4.96 5.40 5.73 5.98 6.19 6.37 6.53 6.66 6.79 6.90 7.01 7.10 7.19 7.27
14 4.21 4.89 5.32 5.63 5.88 6.08 6.26 6.41 6.54 6.44 6.77 6.87 6.96 7.05 7.13
15 4.17 4.84 5.25 5.56 5.80 5.99 6.16 6.31 6.44 6.55 6.66 6.76 6.84 6.93 7.00
16 4.13 4.79 5.19 5.49 5.72 5.92 6.08 6.22 6.35 6.46 6.56 6.66 6.74 6.82 6.90
17 4.10 4.74 5.14 5.43 5.66 5.88 6.01 6.15 6.27 6.38 6.48 6.57 6.66 6.73 6.81
18 4.07 4.70 5.09 5.38 5.60 5.79 5.94 6.08 6.20 6.31 6.41 6.50 6.58 6.65 6.73
19 4.05 4.67 5.05 5.33 5.55 5.73 5.89 6.02 6.14 6.25 6.34 6.43 6.51 6.58 6.66
20 4.02 4.64 5.02 5.29 5.51 5.69 5.84 5.97 6.09 6.19 6.28 6.37 6.45 6.52 6.59
24 3.96 4.55 4.91 5.17 5.37 5.54 5.69 5.18 5.92 6.02 6.11 6.19 6.26 6.33 6.39
30 3.89 4.45 4.80 5.05 5.24 5.40 5.54 5.65 5.67 5.85 5.93 6.01 6.08 6.14 6.20
40 3.82 4.37 4.70 4.93 5.11 5.26 5.39 5.50 5.60 5.69 5.76 5.83 5.90 5.96 6.02
60 3.76 4.28 4.59 4.82 4.99 5.13 5.25 5.36 5.45 5.53 5.60 5.67 5.73 5.78 5.84
120 3.70 4.20 4.50 4.71 4.87 5.01 5.12 5.21 5.30 5.37 5.44 5.50 5.56 5.61 5.66
00 3.64 4.12 4.44 4.60 4.76 4.88 4.99 5.08 5.16 5.23 5.29 5.35 5.40 5.45 5.49
Note: G = number of groups, v = dferror. Abridged from Table 29, Biometrika Tables for Statisticians (Vol. 1, 3rd ed.) edited by E. S.
Pearson and H. O. Hartley, 1970, New York: Cambridge University Press. Adapted by permission of the Biometrika Trustees.
Table F.1
L Values for alpha = .05
Power
k
.10 .30 .50 .60 .70 .75 .80 .85 .90 .95 .99
1 .43 2.06 3.84 4.90 6.17 6.94 7.85 8.98 10.51 13.00 18.37
2 .62 2.78 4.96 6.21 7.70 8.59 9.64 10.92 12.65 15.44 21.40
3 .78 3.30 5.76 7.15 8.79 9.77 10.90 12.30 14.17 17.17 23.52
4 .91 3.74 6.42 7.92 9.68 10.72 11.94 13.42 15.41 18.57 25.24
5 1.03 4.12 6.99 8.59 10.45 11.55 12.83 14.39 16.47 19.78 26.73
6 1.13 4.46 7.50 9.19 11.14 12.29 13.62 15.26 17.42 20.86 28.05
7 1.23 4.77 7.97 9.73 11.77 12.96 14.35 16.04 18.28 21.84 29.25
8 1.32 5.06 8.41 10.24 12.35 13.59 15.02 16.77 19.08 22.74 30.36
9 1.40 5.33 8.81 10.71 12.89 14.17 15.65 17.45 19.83 23.59 31.39
10 1.49 5.59 9.19 11.15 13.40 14.72 16.24 18.09 20.53 24.39 32.37
11 1.56 5.83 9.56 11.58 13.89 15.24 16.80 18.70 21.20 25.14 33.29
12 1.64 6.06 9.90 11.98 14.34 15.74 17.34 19.28 21.83 25.86 34.16
13 1.17 6.29 10.24 12.36 14.80 16.21 17.85 19.83 22.44 26.55 35.00
14 1.78 6.50 10.55 12.73 15.22 16.67 18.34 20.36 23.02 27.20 35.81
15 1.84 6.71 10.86 13.09 15.63 17.11 18.81 20.87 23.58 27.84 36.58
16 1.90 6.91 11.16 13.43 16.03 17.53 19.27 21.37 24.13 28.45 37.33
18 2.03 7.29 11.73 14.09 16.78 18.34 20.14 22.31 25.16 29.62 38.76
20 2.14 7.65 12.26 14.71 17.50 19.11 20.96 22.20 26.13 30.72 40.10
22 2.25 8.00 12.77 15.30 18.17 19.83 21.74 24.04 27.06 31.77 41.37
24 2.36 8.33 13.02 15.87 18.82 20.53 22.49 24.85 27.94 32.76 42.59
28 2.56 8.94 14.17 16.93 20.04 21.83 23.89 26.36 29.60 34.64 44.87
32 2.74 9.52 15.02 17.91 21.17 23.04 25.19 27.77 31.14 36.37 46.98
36 2.91 10.06 15.82 18.84 22.23 24.81 26.41 29.09 32.58 38.00 48.96
40 3.08 10.57 16.58 19.71 23.23 25.25 27.56 30.33 33.94 39.59 50.83
50 3.46 11.75 18.31 21.72 25.53 27.71 30.20 33.19 37.07 43.07 55.12
60 3.80 12.81 19.88 23.53 27.61 29.94 32.59 35.77 39.89 46.25 58.98
70 4.12 13.79 21.32 25.20 29.52 31.98 34.79 38.14 42.48 49.17 62.53
80 4.41 14.70 22.67 26.75 31.29 33.88 36.83 40.35 44.89 51.89 65.83
90 4.69 15.56 23.93 28.21 32.96 35.67 38.75 42.14 47.16 54.44 68.92
100 4.95 16.37 25.12 29.59 34.54 37.36 40.56 44.37 49.29 56.85 71.84
Note: k = number of variables added. From Applied multiple regression/correlation for the behavioral sciences (p. 526) by J. Cohen
and P. Cohen, 1983, Hillsdale (NJ): Lawrence Erlbaum Associatates. Copyright 1983 by Lawrence Erlbaum Associatates. Reprinted
by permission of the publisher.
Table F.2
L Values for alpha = .01

Power
k
.10 .30 .50 .60 .70 .75 .80 .85 .90 .95 .99
1 1.67 4.21 6.64 8.00 9.61 10.57 11.68 13.05 14.88 17.81 24.03
2 2.30 5.37 8.19 9.75 11.57 12.64 13.88 15.40 17.43 20.65 27.42
3 2.76 6.22 9.31 11.01 12.97 14.12 15.46 17.09 19.25 22.67 29.83
4 3.15 6.92 10.23 12.04 14.12 15.34 16.75 18.47 20.74 24.33 31.80
5 3.49 7.52 11.03 12.94 15.12 16.40 17.87 19.66 22.03 25.76 33.50
6 3.79 8.07 11.79 13.74 16.01 17.34 18.87 20.73 23.18 27.04 35.02
7 4.08 8.57 12.41 14.47 16.83 18.20 19.79 21.71 24.24 28.21 36.41
8 4.34 9.03 13.02 15.15 17.59 19.00 20.64 22.61 25.21 29.29 37.69
9 4.58 9.47 13.59 15.79 18.30 19.75 21.43 23.46 26.12 30.31 38.89
10 4.82 9.88 14.13 16.39 18.97 20.46 22.18 24.25 26.98 31.26 40.02
11 5.04 10.27 14.64 16.96 19.60 21.13 22.89 25.01 27.80 32.16 41.09
12 5.25 10.64 15.13 17.51 20.21 21.77 23.56 25.73 28.58 33.02 42.11
13 5.45 11.00 15.59 18.03 20.78 22.38 24.21 26.42 29.32 33.85 43.09
14 5.65 11.35 16.04 18.53 21.34 22.97 22.83 27.09 30.03 34.64 44.03
15 5.84 11.67 16.48 19.01 21.88 23.53 25.43 27.72 30.72 35.40 44.93
16 6.02 12.00 16.90 19.48 22.40 24.08 26.01 28.34 31.39 36.14 45.80
18 6.37 12.61 17.70 20.37 23.39 25.12 27.12 29.52 32.66 37.54 47.46
20 6.70 13.19 18.45 21.21 24.32 26.11 28.16 30.63 33.85 38.87 49.03
22 7.02 13.74 19.17 22.01 25.21 27.05 29.15 31.69 34.99 40.12 50.51
24 7.32 14.27 19.86 22.78 26.06 27.94 30.10 32.69 36.07 41.32 51.93
28 7.89 15.26 21.15 24.21 27.65 29.62 31.88 34.59 38.11 43.58 54.60
32 8.42 16.19 22.35 25.55 29.13 31.19 33.53 36.35 40.01 45.67 57.07
36 8.92 17.06 23.48 26.80 30.52 32.65 35.09 38.00 41.78 47.63 59.39
40 9.39 17.88 24.54 27.99 31.84 34.04 36.55 39.56 43.46 49.49 61.57
50 10.48 19.77 27.00 30.72 34.86 37.23 39.92 43.14 47.31 53.74 66.59
60 11.46 21.48 29.21 33.18 37.59 40.10 42.96 46.38 50.79 57.58 71.12
70 12.37 23.05 31.25 35.45 40.10 42.75 45.76 49.35 53.99 61.11 75.27
80 13.22 24.51 33.15 37.55 42.43 45.21 48.36 52.11 56.96 64.39 79.13
90 14.01 25.89 34.93 39.53 44.62 47.52 50.80 54.71 59.75 67.47 82.76
100 14.76 27.19 36.62 41.41 46.70 49.70 53.11 57.16 62.38 70.37 86.18
Note: k = number of variables added. From Applied multiple regression/correlation for the behavioral sciences (p. 526) by J. Cohen
and P. Cohen, 1983, Hillsdale (NJ): Lawrence Erlbaum Associatates. Copyright 1983 by Lawrence Erlbaum Associatates. Reprinted
by permission of the publisher.
Author Index

A L

Algina, J., 239, 302 Loftus, G. R., 39, 301


Loftus, E. F., 39, 301
B
M
Bakeman, R., 49, 50, 299, 300, 301
Bishop, Y.M.M., 49, 301 Marascuilo, L. A., 85, 301
McArthur, D., 299, ,301
C Mervis, C. B., 80, 302

Cohen, J., 13, 104, 151, 168, 179, 188, O


204, 210, 295, 296, 297, 300,
301 Olejnik, S., 239, 302
Cohen, P., 13, 168, 179, 188, 204, 210,
295, 296,301 R

F Robinson, B. F., 49, 50, 80, 300, 301,


302
Fidell, L. S., 50, 239, 302 Robinson, B. W., 80, 302
Fienberg, S.E., 49, 301 Rosenthal, R., 151, 300, 302
Rosnow, R. L., 151, 300, 302
H
S
Hays, W. L., 138, 139, 225, 301
Holland, P.W., 49, 301 Saufley, W. H., Jr., 51, 301
Scott, D.W., 76, 302
K Serlin, R. C., 85, 301
Siegel, S., 50, 85, 302
Keppel, G., 51, 207, 225, 241, 301 Stevens, S. S., 45, 302
Kessen, W., 2, 301 Stigler, S. M., 59, 66, 84, 87, 302
Kirk, R. E., 225, 301

357
358 AUTHOR INDEX
T
Tabachnick, B. G., 50, 239, 302
Tufte, E. R., 71, 73, 78, 79, 302
Tukey, J.W., 77, 302

Wainer, H., 73, 302


Wilkinson, L., 151, 302
Winer, B. J., 146, 211, 225, 241, 302
Subject Index

A source table for repeated measures,


258-259,261
A priori tests, see Planned comparisons two-way, 235-238
Accounting for variance, 105, 113-117, And rule, see Probability
167-168 Arithmetic mean, see Mean
and statistical significance, 140,
168-171 B
Adjusted means, see Analysis of
covariance (ANCOVA) Best fit line, 104-107, 111, 113, 114, 116
Adjusted R2, see R squared, adjusted Beta error, see Type II error
Alpha error, see Type I error Between groups, see Degrees of
Alpha level, 21, 23, 24, 25-26, 30 freedom, between groups
and power, 295, 296 Between-subjects factors, 181-182, 223-
Alternative hypothesis, 21-22 224
Analysis of covariance (ANCOVA), 156, Between-subjects studies, see Designs
171-172 Biased estimates, 138
and adjusted means, 212, 293 Binary predictor variables, see Predictor
and adjusted individual scores, 216- variables, binary
217, 294 Binomial coefficients, 37
and homogeneity of regression, 218- Binomial distribution, 31, 34, 39, 83-84
219, 292 normal approximation for, 84-86
and pretest-posttest studies, 289- Binomial parameters, 39
292 Binomial test, see Sign test
Analysis of variance (ANOVA), 12-13, Box-and-whiskers plot, 77-78, 79
50, 51, 103-104, 290 Button-pushing study, introduced, 172-
and multiple regression, see 173
Multiple regression between-subjects 2 X 2 factorial
and unequal numbers of subjects version, 233-234
per group, 210 mixed between- within-subjects
for two independent groups, 149- 2 x 2 factorial version, 246, 270
150 single-factor within subjects
one-way, 185-186, 194 version, 262-263
source table, 194, 241

359
360 SUBJECT INDEX
within-subjects 2 x 2 factorial total, 143-144
version, 247, 278 within groups, 232-233
Dependent variable, 13-14, 48-49
C Descriptive statistics, see Statistics
Designs, and statistical procedures, 49-
Categorical scales, see Nominal scales 51
Categorical variables, see Coding single-factor between-subjects, 181-
categorical variables 182, 224
Central limit theorem, 91 single-factor within-subjects, 246-
Chi-square analysis, 49, 51 247
Coding categorical variables, 182 mixed between within, 246
contrast coding, 186-189 multi-factor between-subjects, 224-
dummy variable coding, 182-184, 225, 246
250, 253, 254 Deviation scores, see Residuals
for factorial studies, 226-231 Directional test, see One-tailed test
orthogonal contrasts, see Contrast Dummy variable coding, see Coding
coefficients categorical variables
Coefficient of determination (r2), 116,
128 E
Conditional relations, see Interactions
Confidence intervals, 98-99, 152 Effect size, 150-151, 296, 299
Contrast coding, see Coding categorical Error bars, 98, 152
variables Error sum of squares, see Sum of
Contrast coefficients, 189-191, 228, 230, squares, error
231 Estimated standard deviation, see
Correlation, 50, 103-104 Standard deviation, estimated
Correlation coefficient (r), 116-117, 124- Estimated standard error of estimate,
128, 156, 290 see Standard error of estimate,
Correlational studies, see Observational estimated
studies Estimated standard error of the mean,
Covariance, 121, 124, 126-127 see Standard error of the mean,
Criterion variable, see Dependent estimated
variable Estimated variance, see Variance,
Critical region, see Region of rejection estimated
Critical values, 40-41, 44 Excel, 1, 6, 14, 62, 73, 105, 157
Experimental studies, 48-49
D Explanatory variables, see Predictor
variables
De Moivre, Abraham, 84, 85, 87, 91
Degrees of freedom, 140 F
between groups, 233
error, 144, 147 F distribution, 140-142
for one-way ANOVA, 194-195 F ratio, 140, 146-147
for single-factor within-subjects for model, 166
studies, 259-260 for unique additional variance, 168-
for mixed two-factor studies, 271 169
for two-factor between-subjects using final error term, 204
studies, 233 Ftest, 142-143
for two-factor within-subjects Factorial studies, 224-225
studies, 279 advantages of, 226
for three-factor between-subjects Fisher, Ronald A., 21, 140
studies, 234
model, 145, 147
SUBJECT INDEX 361
G Model sum of squares, see Sum of
squares, model
Gauss, Carl Friedrich, 87 Money cure study, introduced, 17-18
Gender smiling study, introduced, 212- Multiple correlation coefficient (R), see
213 Multiple R
Multiple R, 158, 163-164, 165
H Multiple R squared, see R squared
Multiple regression, 12-14, 50, 51, 156
Harmonic mean, 211 and ANOVA, unified view, 12-13,
Hierarchic multiple regression, 171-172, 103-104
203, 214, 229, 280 and spreadsheets, 14, 288
Histogram, 75-77 Multivariate analysis of variance
Homogeneity of regression, see Analysis (MANOVA), 50
of covariance (ANCOVA)
Hypothesis testing, 20-22 N

I Neyman, Jerzy, 21
Nominal scales, 46, 49
Independence assumption of, 23 Non-directional test, see Two-tailed test
Independent variables, 13-14, 48-49, see Nonparametric tests, 50
also Predictor variables Normal curve, see Normal distribution
Inferential statistics, see Statistics Normal distribution, 87, 90
Interactions, 218-219, 226 functional definition, 87-88
interpreting significance of, 235-237 historical considerations, 87
Interval scales, 46, 47 underlying circumstances, 89-90
Notation, statistical, see Statistical
L notation
Null hypothesis, 21-22
Laplace, Pierre Simon, 87, 91
Least squares, method of, 54, 58 O
Legendre, Adrien Marie, 59, 87
Lie detection study, introduced, 54 Observation studies, 48
within subjects version, 251 Omnibus test, 186, 201-202
Linear relation, 105 One-tailed test, 22, 27, 40-41
exact nature of, 113 One-way analysis of variance, see
strength of, 113-115 Analysis of variance (ANOVA)
Log linear analysis, 49, 51 Or rule, see Probability
Ordinal scales, 46, 47
M Orthogonal contrasts, see Coding
categorical variables
Magnitude of effect, see Effect size Outcomes, see Probability
Main effects, 226 Outcome class, see Probability
interpreting significance of, 235-237 Outliers, 69
Mean, 15, 54
population, 19-20, 88, 93, 96 P
of sample means, 96
sample, 20, 64, 91-92, 96 Parameter, population, 19-20
standard error of, see Standard Parametric tests, 50
error of the mean Partial regression coefficients (bi), 157-
weighted sum, 91 158
Mean square, 140 standardized (Bi, ), 161
error, 145-146, 147 Partitioning variance, between and
model, 145-146, 147 within groups, 114, 194
362 SUBJECT INDEX
between and within subjects, 248- R
239
Pascal's triangle, 35, 37 R squared, 158-159, 165
Pascal, Blaise, 31 adjusted, 159, 165
Pearson, Karl, 21, 124 Random sample, 18
Pearson product-moment correlation Ratio scales, 46, 47
coefficient (r), see Correlation Region of rejection, 29, 30, 40-41
coefficient Regression, 103-104, 156
Phi coefficient, 127 Regression coefficient (b), 105, 121, 127-
Planned comparisons, 186, 201-203, 128, 144, 218
205, 234 Partial, see Partial regression
Point estimate, 60 coefficient
Population, 18, 19-20 Regression constant (a), 105, 121, 128,
Population mean, see Mean
Population parameter, see Parameter, Regression line, 132, 134
population Rejecting the null hypothesis, 22, 24
Population standard deviation, see Rejection region, see Region of rejection
Standard deviation Repeated-measures factors, see Within-
Population variance, see Variance subjects factors
Post hoc tests, 186, 201-202, 206-207, Residuals, 58, 60, 110-111, 112
237, 242-243 Response variable, see Dependent
and unequal numbers of subjects variable
per group, 210-211
for within subjects studies, 264-266, s
284-285
Power, 25-26 Sample, 18, 19-20
Power analysis, 154, 295-299 Random, see Random sample
Predicting the mean, 56-58 Sample mean, see Mean
Predictor variables, see also Sample standard deviation, see
Independent variables Standard deviation
binary, 129, 130, 133 Sample statistics, see Statistics
for a single-factor between-subjects Sample variance, see Variance
study, 253 Sampling distribution, 21, 22-23
for a single-factor within-subjects Scales of measurement, 45-47
study, 254, 264 Scattergram, 105, 106
for a multi-factor between-subjects Sign test, 24, 34, 39-41, 42, 49
studies, 228, 230, 231 Significance testing, 140-142, 146-147
Pretest-posttest studies, see Analysis of for main effects and interactions,
covariance 233-234
Probability, 32-33 for model, 166
and rule, 33 for unique additional variance, 168,
or rule, 33, 40 204
outcomes, 31, 33-34, 35, 37, 39 Single sample tests, 93-96
outcome class, 32, 33-34, 35, 39 Slope, see Regression coefficient
Snedecor, George, 140
O Source table, see Analysis of variance
(ANOVA)
Qualitative scales, see Nominal scales Spearmen rank-order correlation
Qualitative variables, see Coding coefficient, 127
categorical variables Spreadsheets, 4-10
Quantitative variables, 13, 46-47, 183 defined, 4-5
Quetelet, Adolphe, 66, 87 elements of, 6-10
formulas, 7-9
SUBJECT INDEX 363
and multiple regression, see U
Multiple regression
operators for, 8 Unbiased estimates, 138
Standard deviation, 63
estimated, 139, 161 V
population, 63, 64, 94, 96, 139
sample, 63, 64, 96, 139 Variance, 60-61
Standard error of estimate, 160 error, 114-116, 117
estimated, 159-161, 165 estimated, 137-138, 139
Standard error of the mean, 94-96 model, 114-116, 117
estimated, 94, 161 population, 62, 64, 138
population, 95, 96 sample, 62, 64, 137, 138
Standard normal distribution, 88 total, 114-116, 117
Standard scores, see Z scores
Standardized partial regression W
coefficient, see Partial
regression coefficient, Withinstandardized
groups, see Degrees of freedom,
within groups
Statistical interactions, see Interactions Within-subjects factorial studies, see
Statistical notation, 14-15 Designs
Statistical test, and hypothesis testing, Within-subject factors, 245-246, 270-
21, 23-24 271
Statistics, Advantages of, 247-248
descriptive, 19, 30, 70 Controlling between subject
inferential, 19, 30 variability for, 250-251
sample, 19-20
Stem-and-leaf plot, 73-75, 79 Y
Step-wise regression, 171
Student's t test, see t test Y-intercept, see Regression constant
Sum of squares, 58, 60
error, 107, 110-111, 145 Z
model, 107, 110-111, 145
total, 105, 110-111 Z scores, 66-67, 69, 70, 85, 126
within and between subjects, 249
X, 120
XY, 120, 124
F, 120
Suppressor effect, 168

t distribution, 92, 96, 97


t statistic, 96
ttest, 51, 92, 129, 154
Test statistic, and hypothesis testing, 21,
23
Tree diagram, 33, 189
Tukey critical difference, 207-208, 265
Tukey test, 207-209, 284-285
Two-tailed test, 21, 27, 40-41
Type I error, 24, 26, 29, 202, 206-207
Type II error, 24-25, 26, 29, 207

You might also like